The archive-crawler project is building Heritrix: a flexible, extensible, robust, and scalable web crawler capable of fetching, archiving, and analyzing the full diversity and breadth of internet-accesible content.

Features

  • deeply and thoroughly harvests website content
  • works on any Java platform (Linux recommended)
  • stores content to ARC or ISO WARC aggregate/transcript format
  • web interface for operator control and monitoring of crawls

Project Activity

See All Activity >

License

Apache License V2.0, GNU Library or Lesser General Public License version 2.0 (LGPLv2)

Follow Heritrix: Internet Archive Web Crawler

Heritrix: Internet Archive Web Crawler Web Site

Other Useful Business Software
Easily Host LLMs and Web Apps on Cloud Run Icon
Easily Host LLMs and Web Apps on Cloud Run

Run everything from popular models with on-demand NVIDIA L4 GPUs to web apps without infrastructure management.

Run frontend and backend services, batch jobs, host LLMs, and queue processing workloads without the need to manage infrastructure. Cloud Run gives you on-demand GPU access for hosting LLMs and running real-time AI—with 5-second cold starts and automatic scale-to-zero so you only pay for actual usage. New customers get $300 in free credit to start.
Try Cloud Run Free
Rate This Project
Login To Rate This Project

User Ratings

★★★★★
★★★★
★★★
★★
21
0
0
0
0
ease 1 of 5 2 of 5 3 of 5 4 of 5 5 of 5 4 / 5
features 1 of 5 2 of 5 3 of 5 4 of 5 5 of 5 4 / 5
design 1 of 5 2 of 5 3 of 5 4 of 5 5 of 5 4 / 5
support 1 of 5 2 of 5 3 of 5 4 of 5 5 of 5 4 / 5

User Reviews

  • Cool
  • Cool.
  • Useful project. Thanks
  • Great software, thank you.
  • The app works well in my PC. Serves its purpose too, so no regrets for me.
Read more reviews >

Additional Project Details

Operating Systems

Linux

Languages

English

Intended Audience

Advanced End Users, Developers, Education, Government, Information Technology, Non-Profit Organizations

User Interface

Web-based

Programming Language

Java

Database Environment

Berkeley/Sleepycat/Gdbm (DBM)

Related Categories

Java Library Management Software, Java Archiving Software, Java Web Scrapers

Registered

2003-02-12