Trafilatura

Trafilatura is a Python package and command-line tool designed to gather text on the Web. It includes discovery, extraction and text-processing components. Its main applications are web crawling, downloads, scraping, and extraction of main texts, metadata and comments. It aims at staying handy and modular: no database is required, the output can be converted to various commonly used formats. Going from raw HTML to essential parts can alleviate many problems related to text quality, first by avoiding the noise caused by recurring elements (headers, footers, links/blogroll etc.) and second by including information such as author and date in order to make sense of the data. The extractor tries to strike a balance between limiting noise (precision) and including all valid parts (recall). It also has to be robust and reasonably fast, it runs in production on millions of documents.

Features

Web crawling and text discovery
Seamless and parallel processing, online and offline
Robust and efficient extraction
Main text (with LXML, common patterns and generic algorithms: jusText, fork of readability-lxml)
URLs, HTML files or parsed HTML trees usable as input
Efficient and polite processing of download queues

Project Samples

Project Activity

See All Activity >

License

GNU General Public License version 3.0 (GPLv3)

Follow Trafilatura

Trafilatura Web Site

Other Useful Business Software

Try Google Cloud Risk-Free With $300 in Credit

No hidden charges. No surprise bills. Cancel anytime.

Use your credit across every product. Compute, storage, AI, analytics. When it runs out, 20+ products stay free. You only pay when you choose to.

Start Free

Rate This Project

User Reviews

Be the first to post a review of Trafilatura!

Additional Project Details

Programming Language

Python

Related Categories

Python Web Scrapers

Registered

2023-04-12

Similar Business Software

Apify

Apify is a full-stack web scraping and automation platform helping anyone get value from the web. At its core is Apify Store, a marketplace with over 10,000 Actors where developers build, publish, and monetize automation tools. Actors are serverless cloud programs that extract data, automate...

See Software
Oxylabs

Oxylabs is a market leader in web intelligence with enterprise-grade, ethical, and compliant solutions. Its proxy infrastructure spans one of the largest global networks, offering residential, ISP, mobile, datacenter, & dedicated datacenter proxies, along with Web Unblocker – an AI-driven...

See Software
NetNut

Get ready to experience unmatched control and insights with our user-friendly dashboard tailored to your needs. Monitor and adjust your proxies with just a few clicks. Track your usage and performance with detailed statistics. Our team is devoted to providing customers with proxy solutions...

See Software
PYPROXY

Market-leading proxy solution provides tens of millions of IP resources. Commercial residential and ISP proxy network includes 90M+ IPs around the world. Exclusive high-performance server requests access to real residential addresses. Abundant bandwidth support business demands. Real-time speed...

See Software
Price2Spy

Price2Spy makes automatic price adjustments easy to perform saving your most valuable resource - time, allowing your pricing team to focus on strategic planning and management. Since 2010, we have provided pricing intelligence for retailers and brands in 40+ countries, helping them smoothly...

See Software
APISCRAPY

APISCRAPY is an AI-driven web scraping and automation platform converting any web data into ready-to-use data API. Other Data Solutions from AIMLEAP: AI-Labeler: AI-augmented annotation & labeling tool AI-Data-Hub: On-demand data for building AI products & services PRICE-SCRAPY:...

See Software

Report inappropriate content

Trafilatura

Python & command-line tool to gather text on the Web

Get an email when there's a new version of Trafilatura

Features

Project Samples

Project Activity

Categories

License

Follow Trafilatura

User Reviews

Additional Project Details

Programming Language

Related Categories

Registered