Showing 82 open source projects for "data cleaning"

View related business solutions
  • Gemini 3 and 200+ AI Models on One Platform Icon
    Gemini 3 and 200+ AI Models on One Platform

    Access Google's best plus Claude, Llama, and Gemma. Fine-tune and deploy from one console.

    Build generative AI apps with Vertex AI. Switch between models without switching platforms.
    Start Free
  • Forever Free Full-Stack Observability | Grafana Cloud Icon
    Forever Free Full-Stack Observability | Grafana Cloud

    Our generous forever free tier includes the full platform, including the AI Assistant, for 3 users with 10k metrics, 50GB logs, and 50GB traces.

    Built on open standards like Prometheus and OpenTelemetry, Grafana Cloud includes Kubernetes Monitoring, Application Observability, Incident Response, plus the AI-powered Grafana Assistant. Get started with our generous free tier today.
    Create free account
  • 1
    NYC Taxi Data

    NYC Taxi Data

    Import public NYC taxi and for-hire vehicle (Uber, Lyft)

    The nyc-taxi-data repository is a rich dataset and exploratory project around New York City taxi trip records. It collects and preprocesses large-scale trip datasets (fares, pickup/dropoff, timestamps, locations, passenger counts) to enable data analysis, modeling, and visualization efforts. The project includes scripts and notebooks for cleaning and filtering the raw data, memory-efficient processing for large CSV/Parquet files, and aggregation workflows (e.g. trips per hour, heatmaps of pickups/dropoffs). ...
    Downloads: 6 This Week
    Last Update:
    See Project
  • 2
    AI Data Science Team

    AI Data Science Team

    An AI-powered data science team of agents

    AI Data Science Team is a Python library and agent ecosystem designed to accelerate and automate common data science workflows by modeling them as specialized AI “agents” that can be orchestrated to perform tasks like data cleaning, transformation, analysis, visualization, and machine learning. It provides a modular agent framework where each agent focuses on a step in the typical data science pipeline — for example, loading data from CSV/Excel files, cleaning and wrangling messy datasets, engineering predictive features, building models with AutoML, connecting to SQL databases, and producing visual outputs — all driven by natural language or programmatic instructions. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 3
    janitor

    janitor

    Simple tools for data cleaning in R

    janitor provides simple, convenient tools for data cleaning, formatting, and exploration in R. It is especially useful for cleaning messy data frames, removing duplicates, formatting column names, and producing frequency tables in a tidy workflow.
    Downloads: 1 This Week
    Last Update:
    See Project
  • 4
    The Data Engineering Handbook

    The Data Engineering Handbook

    Links to everything you'd ever want to learn about data engineering

    ...It includes beginner and intermediate boot camps, interview guides, data cleaning and transformation resources, and curated lists of newsletters and industry communities, making it useful both for self-study and technical interview preparation. The repository is actively maintained and widely starred, reflecting its role as a go-to reference for newcomers and experienced practitioners alike.
    Downloads: 0 This Week
    Last Update:
    See Project
  • Custom VMs From 1 to 96 vCPUs With 99.95% Uptime Icon
    Custom VMs From 1 to 96 vCPUs With 99.95% Uptime

    General-purpose, compute-optimized, or GPU/TPU-accelerated. Built to your exact specs.

    Live migration and automatic failover keep workloads online through maintenance. One free e2-micro VM every month.
    Try Free
  • 5
    Agentic Data Scientist

    Agentic Data Scientist

    An end-to-end Data Scientist

    Agentic Data Scientist is an experimental AI-driven research framework that orchestrates data science workflows through autonomous agents that can reason, plan, and execute complex analytics tasks. Unlike traditional scripted pipelines, this project lets AI agents break down high-level research goals into sub-tasks such as data acquisition, cleaning, modeling, evaluation, and reporting, with minimal human direction.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 6
    Perfect Roadmap To Learn Data Science

    Perfect Roadmap To Learn Data Science

    Basic To Intermediate Python data science guide

    ...What makes it particularly valuable is its holistic nature: rather than focusing only on modeling or theory, it also addresses the broader lifecycle of data-science work, data ingestion, cleaning, EDA, feature engineering, model building, validation, deployment, etc.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 7
    litlyx

    litlyx

    Analytics for developers, setup Analytics in 30 seconds

    The easiest, developer-centric analytics tool. Litlyxis an open-source, self-hostable analytics solution for the modern framework. Litlyx offers a unique eyewear cleaning system that includes a special cleaning solution and reusable microfiber swabs. This system is designed to provide a more thorough and eco-friendly way to clean glasses, lenses, and screens. The brand emphasizes sustainability by reducing single-use plastics and promoting long-term use of their products. Their cleaning kit...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 8
    CSV Lint

    CSV Lint

    CSV Lint plug-in for Notepad++ for syntax highlighting

    CSV Lint plug-in for Notepad++ for syntax highlighting, csv validation, automatic column and datatype detecting fixed width datasets, change datetime format, decimal separator, sort data, count unique values, convert to xml, json, sql etc. A plugin for data cleaning and working with messy data files. Use CSV Lint for metadata discovery, technical data validation, and reformatting on tabular data files. It is not meant to be a replacement for spreadsheet programs like Excel or SPSS, but rather it's a quality control tool to examine, verify or polish up a dataset before further processing.
    Downloads: 21 This Week
    Last Update:
    See Project
  • 9
    ExtractThinker

    ExtractThinker

    ExtractThinker is a Document Intelligence library for LLMs

    ExtractThinker is a tool designed to facilitate the extraction and analysis of information from various data sources, aiding in data processing and knowledge discovery.
    Downloads: 1 This Week
    Last Update:
    See Project
  • Fully Managed MySQL, PostgreSQL, and SQL Server Icon
    Fully Managed MySQL, PostgreSQL, and SQL Server

    Automatic backups, patching, replication, and failover. Focus on your app, not your database.

    Cloud SQL handles your database ops end to end, so you can focus on your app.
    Try Free
  • 10
    MeshLab

    MeshLab

    The open source mesh processing system

    ...The open source system for processing and editing 3D triangular meshes. It provides a set of tools for editing, cleaning, healing, inspecting, rendering, texturing and converting meshes. It offers features for processing raw data produced by 3D digitization tools/devices and for preparing models for 3D printing.
    Downloads: 31 This Week
    Last Update:
    See Project
  • 11
    nb-clean

    nb-clean

    Clean Jupyter notebooks of outputs, metadata, and empty cells

    nb-clean cleans Jupyter notebooks of cell execution counts, metadata, outputs, and (optionally) empty cells, preparing them for committing to version control. It provides both a Git filter and pre-commit hook to automatically clean notebooks before they're staged, and can also be used with other version control systems, as a command line tool, and as a Python library. It can determine if a notebook is clean or not, which can be used as a check in your continuous integration pipelines....
    Downloads: 0 This Week
    Last Update:
    See Project
  • 12
    FDUPES

    FDUPES

    FDUPES is a program for identifying or deleting duplicate files

    ...Because it operates directly on file content rather than just filenames, fdupes can accurately detect true copies and guide cleaning operations in data cleanup or migration tasks. It’s a simple, efficient, and widely used utility on Unix-like systems, appreciated by administrators, developers, and power users.
    Downloads: 10 This Week
    Last Update:
    See Project
  • 13
    DOLMA

    DOLMA

    Data and tools for generating and inspecting OLMo pre-training data

    DOLMA (Data Optimization and Learning for Model Alignment) is a framework designed to manage large-scale datasets for training and fine-tuning language models efficiently.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 14
    Crowbook LaTeX

    Crowbook LaTeX

    Converts books written in Markdown to HTML, LaTeX/PDF and EPUB

    Crowbook's aim is to allow you to write a book in Markdown without worrying about formatting or typography and let the program generate HTML, PDF and EPUB output for you. Its focus is novels and fiction, and the default settings should (hopefully) generate readable books with correct typography without requiring you to worry about it.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 15
    Miller

    Miller

    Miller is like awk, sed, cut, join, and sort for name-indexed data

    Miller is like awk, sed, cut, join, and sort for data formats such as CSV, TSV, JSON, JSON Lines, and positionally-indexed. With Miller, you get to use named fields without needing to count positional indices, using familiar formats such as CSV, TSV, JSON, JSON Lines, and positionally-indexed. Then, on the fly, you can add new fields which are functions of existing fields, drop fields, sort, aggregate statistically, pretty-print, and more. Miller operates on key-value-pair data while the...
    Downloads: 9 This Week
    Last Update:
    See Project
  • 16
    Java Tablesaw

    Java Tablesaw

    Java dataframe and visualization library

    Tablesaw is a dataframe and visualization library that supports loading, cleaning, transforming, filtering, and summarizing data. If you work with data in Java, it may save you time and effort. Tablesaw also supports descriptive statistics and can be used to prepare data for working with machine learning libraries like Smile, Tribuo, H20.ai, DL4J. Import data from RDBMS, Excel, CSV, TSV, JSON, HTML, or Fixed Width text files, whether they are local or remote (http, S3, etc.) ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 17
    PandasAI

    PandasAI

    PandasAI is a Python library that integrates generative AI

    PandasAI is a Python library that adds Generative AI capabilities to pandas, the popular data analysis and manipulation tool. It is designed to be used in conjunction with pandas, and is not a replacement for it. PandasAI makes pandas (and all the most used data analyst libraries) conversational, allowing you to ask questions to your data in natural language. For example, you can ask PandasAI to find all the rows in a DataFrame where the value of a column is greater than 5, and it will...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 18
    Automated Tool for Optimized Modelling

    Automated Tool for Optimized Modelling

    Automated Tool for Optimized Modelling

    During the exploration phase of a machine learning project, a data scientist tries to find the optimal pipeline for his specific use case. This usually involves applying standard data cleaning steps, creating or selecting useful features, trying out different models, etc. Testing multiple pipelines requires many lines of code, and writing it all in the same notebook often makes it long and cluttered.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 19
    Mac Cleaner CLI

    Mac Cleaner CLI

    Scan and remove junk files, caches, logs, and more

    Mac Cleaner CLI is a free and open-source terminal-based utility that helps users scan, identify, and remove unnecessary files from their macOS systems to reclaim storage space and keep systems tidy. Through a simple command-line interface, the tool performs deep scans to find caches, temporary files, logs, browser data, and other clutter, presenting results in an organized interactive menu where users can choose exactly what to clean. It emphasizes safety by allowing users to exclude...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 20
    NeMo Curator

    NeMo Curator

    Scalable data pre processing and curation toolkit for LLMs

    NeMo Curator is a Python library specifically designed for fast and scalable dataset preparation and curation for large language model (LLM) use-cases such as foundation model pretraining, domain-adaptive pretraining (DAPT), supervised fine-tuning (SFT) and paramter-efficient fine-tuning (PEFT). It greatly accelerates data curation by leveraging GPUs with Dask and RAPIDS, resulting in significant time savings. The library provides a customizable and modular interface, simplifying pipeline...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 21
    All-in-RAG

    All-in-RAG

    Big Model Application Development Practice 1

    All-in-RAG is an open-source educational project designed to teach developers how to build applications using retrieval-augmented generation techniques. The repository provides a structured learning path that covers both theoretical foundations and practical implementation steps for RAG systems. It explains the full development pipeline required to create knowledge-aware AI assistants, including data preparation, document indexing, vector embedding generation, and retrieval strategies. The...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 22
    Practical Machine Learning with Python

    Practical Machine Learning with Python

    Master the essential skills needed to recognize and solve problems

    Practical Machine Learning with Python is a comprehensive repository built to accompany a project-centered guide for applying machine learning techniques to real-world problems using Python’s mature data science ecosystem. It centralizes example code, datasets, model pipelines, and explanatory notebooks that teach users how to approach problems from data ingestion and cleaning all the way through feature engineering, model selection, evaluation, tuning, and production-ready deployment patterns. The repository emphasizes end-to-end workflows rather than isolated code snippets, showing how to handle common challenges like class imbalance, overfitting, hyperparameter optimization, and interpretability. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 23
    HtmlSanitizer

    HtmlSanitizer

    Cleans HTML to avoid XSS attacks

    HtmlSanitizer is a .NET library for cleaning HTML fragments and documents from constructs that can lead to XSS attacks. It uses AngleSharp to parse, manipulate, and render HTML and CSS. Because HtmlSanitizer is based on a robust HTML parser it can also shield you from deliberate or accidental "tag poisoning" where invalid HTML in one fragment can corrupt the whole document leading to broken layout or style. In order to facilitate different use cases, HtmlSanitizer can be customized at...
    Downloads: 1 This Week
    Last Update:
    See Project
  • 24
    Open Interpreter

    Open Interpreter

    A natural language interface for computers

    Open Interpreter is an open-source tool that provides a natural-language interface for interacting with your computer. It lets large language models (LLMs) run code locally (Python, JavaScript, shell, etc.), enabling you to ask your computer to do tasks like data analysis, file manipulation, browsing, etc. in human terms (“chat with your computer”), with safeguards. Runs locally or via configured remote LLM servers/inference backends, giving flexibility to use models you trust or have...
    Downloads: 15 This Week
    Last Update:
    See Project
  • 25
    handson-ml2

    handson-ml2

    Jupyter notebooks that walk you through the fundamentals of ML

    This repository contains the Jupyter notebooks and code for the second edition of a popular hands-on machine learning book that teaches both classical ML and deep learning using modern tooling. The notebooks emphasize end-to-end workflows: data preparation, model selection, tuning, and reliable evaluation. Deep learning sections use the contemporary Keras/TensorFlow 2 ecosystem, highlighting clean APIs and eager execution to make experiments easier to reason about. Traditional ML topics...
    Downloads: 0 This Week
    Last Update:
    See Project
  • Previous
  • You're on page 1
  • 2
  • 3
  • 4
  • Next
MongoDB Logo MongoDB