text-dedup

text-dedup is a Python library that enables efficient deduplication of large text corpora by using MinHash and other probabilistic techniques to detect near-duplicate content. This is especially useful for NLP tasks where duplicated training data can skew model performance. text-dedup scales to billions of documents and offers tools for chunking, hashing, and comparing text efficiently with low memory usage. It supports Jaccard similarity thresholding, parallel execution, and flexible deduplication strategies, making it ideal for cleaning web-scraped data, language model training datasets, or document archives.

Features

Fast and scalable near-duplicate detection
Uses MinHash and Jaccard similarity for fuzzy matching
Designed for web-scale datasets with billions of documents
Supports customizable deduplication thresholds
Multi-threaded and memory-efficient processing
Hashing-based representation of text chunks
Optional GPU acceleration for faster computation
Suitable for cleaning NLP and LLM training data

Project Samples

Project Activity

See All Activity >

License

Apache License V2.0

Follow text-dedup

text-dedup Web Site

Other Useful Business Software

Enterprise-grade ITSM, for every business

Give your IT, operations, and business teams the ability to deliver exceptional services—without the complexity.

Freshservice is an intuitive, AI-powered platform that helps IT, operations, and business teams deliver exceptional service without the usual complexity. Automate repetitive tasks, resolve issues faster, and provide seamless support across the organization. From managing incidents and assets to driving smarter decisions, Freshservice makes it easy to stay efficient and scale with confidence.

Try it Free

Rate This Project

User Reviews

Be the first to post a review of text-dedup!

Additional Project Details

Programming Language

Python

Related Categories

Python Stream Processing Tool

Registered

2025-04-08

Similar Business Software

groundcover

Cloud-based observability solution that helps businesses track and manage workload and performance on a unified dashboard. Monitor everything you run in your cloud without compromising on cost, granularity, or scale. groundcover is a full stack cloud-native APM platform designed to make...

See Software
MongoDB Atlas

The most innovative cloud database service on the market, with unmatched data distribution and mobility across AWS, Azure, and Google Cloud, built-in automation for resource and workload optimization, and so much more. MongoDB Atlas is the global cloud database service for modern applications....

See Software
Ably

Ably is the definitive realtime experience platform. We power more WebSocket connections than any other pub/sub platform, serving over a billion devices monthly. Businesses like HubSpot, NASCAR and Webflow trust us to power their critical applications - reliably, securely and at serious...

See Software
RudderStack

RudderStack is the smart customer data pipeline. Easily build pipelines connecting your whole customer data stack, then make them smarter by pulling analysis from your data warehouse to trigger enrichment and activation in customer tools for identity stitching and other advanced use cases. Start...

See Software
Aiven

Aiven manages your open source data infrastructure in the cloud - so you don't have to. Developers can do what they do best: create applications. We do what we do best: manage cloud data infrastructure. All solutions are open source. You can also freely move data between clouds or create...

See Software
Nussknacker

Nussknacker is a low-code visual tool for domain experts to define and run real-time decisioning algorithms instead of implementing them in the code. It serves where real-time actions on data have to be made: real-time marketing, fraud detection, Internet of Things, Customer 360, and Machine...

See Software

Report inappropriate content

text-dedup

All-in-one text de-duplication

Get an email when there's a new version of text-dedup

Features

Project Samples

Project Activity

Categories

License

Follow text-dedup

User Reviews

Additional Project Details

Programming Language

Related Categories

Registered