The archive-crawler project is building Heritrix: a flexible, extensible, robust, and scalable web crawler capable of fetching, archiving, and analyzing the full diversity and breadth of internet-accesible content.
Features
- deeply and thoroughly harvests website content
- works on any Java platform (Linux recommended)
- stores content to ARC or ISO WARC aggregate/transcript format
- web interface for operator control and monitoring of crawls
License
Apache License V2.0, GNU Library or Lesser General Public License version 2.0 (LGPLv2)Follow Heritrix: Internet Archive Web Crawler
Other Useful Business Software
Easily Host LLMs and Web Apps on Cloud Run
Run frontend and backend services, batch jobs, host LLMs, and queue processing workloads without the need to manage infrastructure. Cloud Run gives you on-demand GPU access for hosting LLMs and running real-time AI—with 5-second cold starts and automatic scale-to-zero so you only pay for actual usage. New customers get $300 in free credit to start.
Rate This Project
Login To Rate This Project
User Reviews
-
Cool
-
Cool.
-
Useful project. Thanks
-
Great software, thank you.
-
The app works well in my PC. Serves its purpose too, so no regrets for me.