data cleaning free download

Showing 82 open source projects for "data cleaning"

View related business solutions

Gemini 3 and 200+ AI Models on One Platform
Access Google's best plus Claude, Llama, and Gemma. Fine-tune and deploy from one console.

Build generative AI apps with Vertex AI. Switch between models without switching platforms.

Start Free
Forever Free Full-Stack Observability | Grafana Cloud
Our generous forever free tier includes the full platform, including the AI Assistant, for 3 users with 10k metrics, 50GB logs, and 50GB traces.

Built on open standards like Prometheus and OpenTelemetry, Grafana Cloud includes Kubernetes Monitoring, Application Observability, Incident Response, plus the AI-powered Grafana Assistant. Get started with our generous free tier today.

Create free account
1

NYC Taxi Data

Import public NYC taxi and for-hire vehicle (Uber, Lyft)

The nyc-taxi-data repository is a rich dataset and exploratory project around New York City taxi trip records. It collects and preprocesses large-scale trip datasets (fares, pickup/dropoff, timestamps, locations, passenger counts) to enable data analysis, modeling, and visualization efforts. The project includes scripts and notebooks for cleaning and filtering the raw data, memory-efficient processing for large CSV/Parquet files, and aggregation workflows (e.g. trips per hour, heatmaps of pickups/dropoffs). ...

Downloads: 6 This Week

Last Update: 2025-10-01
See Project
2

AI Data Science Team

An AI-powered data science team of agents

AI Data Science Team is a Python library and agent ecosystem designed to accelerate and automate common data science workflows by modeling them as specialized AI “agents” that can be orchestrated to perform tasks like data cleaning, transformation, analysis, visualization, and machine learning. It provides a modular agent framework where each agent focuses on a step in the typical data science pipeline — for example, loading data from CSV/Excel files, cleaning and wrangling messy datasets, engineering predictive features, building models with AutoML, connecting to SQL databases, and producing visual outputs — all driven by natural language or programmatic instructions. ...

Downloads: 0 This Week

Last Update: 2026-01-26
See Project
3

janitor

Simple tools for data cleaning in R

janitor provides simple, convenient tools for data cleaning, formatting, and exploration in R. It is especially useful for cleaning messy data frames, removing duplicates, formatting column names, and producing frequency tables in a tidy workflow.

Downloads: 1 This Week

Last Update: 2025-07-30
See Project
4

The Data Engineering Handbook

Links to everything you'd ever want to learn about data engineering

...It includes beginner and intermediate boot camps, interview guides, data cleaning and transformation resources, and curated lists of newsletters and industry communities, making it useful both for self-study and technical interview preparation. The repository is actively maintained and widely starred, reflecting its role as a go-to reference for newcomers and experienced practitioners alike.

Downloads: 0 This Week

Last Update: 2026-03-18
See Project
Custom VMs From 1 to 96 vCPUs With 99.95% Uptime
General-purpose, compute-optimized, or GPU/TPU-accelerated. Built to your exact specs.

Live migration and automatic failover keep workloads online through maintenance. One free e2-micro VM every month.

Try Free
5

Agentic Data Scientist

An end-to-end Data Scientist

Agentic Data Scientist is an experimental AI-driven research framework that orchestrates data science workflows through autonomous agents that can reason, plan, and execute complex analytics tasks. Unlike traditional scripted pipelines, this project lets AI agents break down high-level research goals into sub-tasks such as data acquisition, cleaning, modeling, evaluation, and reporting, with minimal human direction.

Downloads: 0 This Week

Last Update: 2026-02-05
See Project
6

Perfect Roadmap To Learn Data Science

Basic To Intermediate Python data science guide

...What makes it particularly valuable is its holistic nature: rather than focusing only on modeling or theory, it also addresses the broader lifecycle of data-science work, data ingestion, cleaning, EDA, feature engineering, model building, validation, deployment, etc.

Downloads: 0 This Week

Last Update: 2025-12-02
See Project
7

litlyx

Analytics for developers, setup Analytics in 30 seconds

The easiest, developer-centric analytics tool. Litlyxis an open-source, self-hostable analytics solution for the modern framework. Litlyx offers a unique eyewear cleaning system that includes a special cleaning solution and reusable microfiber swabs. This system is designed to provide a more thorough and eco-friendly way to clean glasses, lenses, and screens. The brand emphasizes sustainability by reducing single-use plastics and promoting long-term use of their products. Their cleaning kit...

Downloads: 0 This Week

Last Update: 2025-11-28
See Project
8

CSV Lint

CSV Lint plug-in for Notepad++ for syntax highlighting

CSV Lint plug-in for Notepad++ for syntax highlighting, csv validation, automatic column and datatype detecting fixed width datasets, change datetime format, decimal separator, sort data, count unique values, convert to xml, json, sql etc. A plugin for data cleaning and working with messy data files. Use CSV Lint for metadata discovery, technical data validation, and reformatting on tabular data files. It is not meant to be a replacement for spreadsheet programs like Excel or SPSS, but rather it's a quality control tool to examine, verify or polish up a dataset before further processing.

Downloads: 21 This Week

Last Update: 2025-08-08
See Project
9

ExtractThinker

ExtractThinker is a Document Intelligence library for LLMs

ExtractThinker is a tool designed to facilitate the extraction and analysis of information from various data sources, aiding in data processing and knowledge discovery.

Downloads: 1 This Week

Last Update: 2025-06-09
See Project
Fully Managed MySQL, PostgreSQL, and SQL Server
Automatic backups, patching, replication, and failover. Focus on your app, not your database.

Cloud SQL handles your database ops end to end, so you can focus on your app.

Try Free
10

MeshLab

The open source mesh processing system

...The open source system for processing and editing 3D triangular meshes. It provides a set of tools for editing, cleaning, healing, inspecting, rendering, texturing and converting meshes. It offers features for processing raw data produced by 3D digitization tools/devices and for preparing models for 3D printing.

Downloads: 31 This Week

Last Update: 2025-07-22
See Project
11

nb-clean

Clean Jupyter notebooks of outputs, metadata, and empty cells

nb-clean cleans Jupyter notebooks of cell execution counts, metadata, outputs, and (optionally) empty cells, preparing them for committing to version control. It provides both a Git filter and pre-commit hook to automatically clean notebooks before they're staged, and can also be used with other version control systems, as a command line tool, and as a Python library. It can determine if a notebook is clean or not, which can be used as a check in your continuous integration pipelines....

Downloads: 0 This Week

Last Update: 2024-10-19
See Project
12

FDUPES

FDUPES is a program for identifying or deleting duplicate files

...Because it operates directly on file content rather than just filenames, fdupes can accurately detect true copies and guide cleaning operations in data cleanup or migration tasks. It’s a simple, efficient, and widely used utility on Unix-like systems, appreciated by administrators, developers, and power users.

Downloads: 10 This Week

Last Update: 2026-01-19
See Project
13

DOLMA

Data and tools for generating and inspecting OLMo pre-training data

DOLMA (Data Optimization and Learning for Model Alignment) is a framework designed to manage large-scale datasets for training and fine-tuning language models efficiently.

Downloads: 0 This Week

Last Update: 2025-06-25
See Project
14
$Crowbook LaTeX$

Crowbook LaTeX

Converts books written in Markdown to HTML, LaTeX/PDF and EPUB

Crowbook's aim is to allow you to write a book in Markdown without worrying about formatting or typography and let the program generate HTML, PDF and EPUB output for you. Its focus is novels and fiction, and the default settings should (hopefully) generate readable books with correct typography without requiring you to worry about it.

Downloads: 0 This Week

Last Update: 2025-06-07
See Project
15

Miller

Miller is like awk, sed, cut, join, and sort for name-indexed data

Miller is like awk, sed, cut, join, and sort for data formats such as CSV, TSV, JSON, JSON Lines, and positionally-indexed. With Miller, you get to use named fields without needing to count positional indices, using familiar formats such as CSV, TSV, JSON, JSON Lines, and positionally-indexed. Then, on the fly, you can add new fields which are functions of existing fields, drop fields, sort, aggregate statistically, pretty-print, and more. Miller operates on key-value-pair data while the...

Downloads: 9 This Week

Last Update: 2026-02-21
See Project
16

Java Tablesaw

Java dataframe and visualization library

Tablesaw is a dataframe and visualization library that supports loading, cleaning, transforming, filtering, and summarizing data. If you work with data in Java, it may save you time and effort. Tablesaw also supports descriptive statistics and can be used to prepare data for working with machine learning libraries like Smile, Tribuo, H20.ai, DL4J. Import data from RDBMS, Excel, CSV, TSV, JSON, HTML, or Fixed Width text files, whether they are local or remote (http, S3, etc.) ...

Downloads: 0 This Week

Last Update: 2025-06-27
See Project
17

PandasAI

PandasAI is a Python library that integrates generative AI

PandasAI is a Python library that adds Generative AI capabilities to pandas, the popular data analysis and manipulation tool. It is designed to be used in conjunction with pandas, and is not a replacement for it. PandasAI makes pandas (and all the most used data analyst libraries) conversational, allowing you to ask questions to your data in natural language. For example, you can ask PandasAI to find all the rows in a DataFrame where the value of a column is greater than 5, and it will...

Downloads: 0 This Week

Last Update: 2025-10-07
See Project
18

Automated Tool for Optimized Modelling

Automated Tool for Optimized Modelling

During the exploration phase of a machine learning project, a data scientist tries to find the optimal pipeline for his specific use case. This usually involves applying standard data cleaning steps, creating or selecting useful features, trying out different models, etc. Testing multiple pipelines requires many lines of code, and writing it all in the same notebook often makes it long and cluttered.

Downloads: 0 This Week

Last Update: 2024-07-05
See Project
19

Mac Cleaner CLI

Scan and remove junk files, caches, logs, and more

Mac Cleaner CLI is a free and open-source terminal-based utility that helps users scan, identify, and remove unnecessary files from their macOS systems to reclaim storage space and keep systems tidy. Through a simple command-line interface, the tool performs deep scans to find caches, temporary files, logs, browser data, and other clutter, presenting results in an organized interactive menu where users can choose exactly what to clean. It emphasizes safety by allowing users to exclude...

Downloads: 0 This Week

Last Update: 2026-01-29
See Project
20

NeMo Curator

Scalable data pre processing and curation toolkit for LLMs

NeMo Curator is a Python library specifically designed for fast and scalable dataset preparation and curation for large language model (LLM) use-cases such as foundation model pretraining, domain-adaptive pretraining (DAPT), supervised fine-tuning (SFT) and paramter-efficient fine-tuning (PEFT). It greatly accelerates data curation by leveraging GPUs with Dask and RAPIDS, resulting in significant time savings. The library provides a customizable and modular interface, simplifying pipeline...

Downloads: 0 This Week

Last Update: 2026-02-23
See Project
21

All-in-RAG

Big Model Application Development Practice 1

All-in-RAG is an open-source educational project designed to teach developers how to build applications using retrieval-augmented generation techniques. The repository provides a structured learning path that covers both theoretical foundations and practical implementation steps for RAG systems. It explains the full development pipeline required to create knowledge-aware AI assistants, including data preparation, document indexing, vector embedding generation, and retrieval strategies. The...

Downloads: 0 This Week

Last Update: 2026-03-17
See Project
22

Practical Machine Learning with Python

Master the essential skills needed to recognize and solve problems

Practical Machine Learning with Python is a comprehensive repository built to accompany a project-centered guide for applying machine learning techniques to real-world problems using Python’s mature data science ecosystem. It centralizes example code, datasets, model pipelines, and explanatory notebooks that teach users how to approach problems from data ingestion and cleaning all the way through feature engineering, model selection, evaluation, tuning, and production-ready deployment patterns. The repository emphasizes end-to-end workflows rather than isolated code snippets, showing how to handle common challenges like class imbalance, overfitting, hyperparameter optimization, and interpretability. ...

Downloads: 0 This Week

Last Update: 2026-02-17
See Project
23

HtmlSanitizer

Cleans HTML to avoid XSS attacks

HtmlSanitizer is a .NET library for cleaning HTML fragments and documents from constructs that can lead to XSS attacks. It uses AngleSharp to parse, manipulate, and render HTML and CSS. Because HtmlSanitizer is based on a robust HTML parser it can also shield you from deliberate or accidental "tag poisoning" where invalid HTML in one fragment can corrupt the whole document leading to broken layout or style. In order to facilitate different use cases, HtmlSanitizer can be customized at...

Downloads: 1 This Week

Last Update: 2026-02-02
See Project
24

Open Interpreter

A natural language interface for computers

Open Interpreter is an open-source tool that provides a natural-language interface for interacting with your computer. It lets large language models (LLMs) run code locally (Python, JavaScript, shell, etc.), enabling you to ask your computer to do tasks like data analysis, file manipulation, browsing, etc. in human terms (“chat with your computer”), with safeguards. Runs locally or via configured remote LLM servers/inference backends, giving flexibility to use models you trust or have...

Downloads: 15 This Week

Last Update: 2025-09-12
See Project
25

handson-ml2

Jupyter notebooks that walk you through the fundamentals of ML

This repository contains the Jupyter notebooks and code for the second edition of a popular hands-on machine learning book that teaches both classical ML and deep learning using modern tooling. The notebooks emphasize end-to-end workflows: data preparation, model selection, tuning, and reliable evaluation. Deep learning sections use the contemporary Keras/TensorFlow 2 ecosystem, highlighting clean APIs and eager execution to make experiments easier to reason about. Traditional ML topics...

Downloads: 0 This Week

Last Update: 2026-03-19
See Project