The DeDuplicator is an add-on module (plug-in) for the web crawler Heritrix. It offers a means to reduce the amount of duplicate data collected in a series of snapshot crawls.
License
GNU Library or Lesser General Public License version 2.0 (LGPLv2)Follow DeDuplicator (Heritrix add-on)
Other Useful Business Software
Go from Data Warehouse to Data and AI platform with BigQuery
BigQuery is more than a data warehouse—it's an autonomous data-to-AI platform. Use familiar SQL to train ML models, run time-series forecasts, and generate AI-powered insights with native Gemini integration. Built-in agents handle data engineering and data science workflows automatically. Get $300 in free credit, query 1 TB, and store 10 GB free monthly.
Rate This Project
Login To Rate This Project
User Reviews
Be the first to post a review of DeDuplicator (Heritrix add-on)!