Name | Modified | Size | Downloads / Week |
---|---|---|---|
Parent folder | |||
README.md | 2024-12-03 | 2.0 kB | |
trafilatura-2.0.0 source code.tar.gz | 2024-12-03 | 31.4 MB | |
trafilatura-2.0.0 source code.zip | 2024-12-03 | 31.8 MB | |
Totals: 3 Items | 63.1 MB | 0 |
Breaking changes:
- Python 3.6 and 3.7 deprecated (#709)
- bare_extraction()
:
- now returns an instance of the Document
class by default
- as_dict
deprecation warning → use .as_dict()
method on return value (#730)
- bare_extraction()
and extract()
: no_fallback
deprecation warning → use fast
instead (#730)
- downloads: remove decode
argument in fetch_url()
→ use fetch_response
instead (#724)
- deprecated graphical user interface now removed (#713)
- extraction: move max_tree_size
parameter to settings.cfg
(#742)
- use type hinting (#721, [#723], [#748])
- see Python and CLI deprecations in the docs
Fixes:
- set options.source
before raising error on empty doc tree by @dmoklaf (#707)
- robust encoding in options.source
(#717)
- more robust mapping for conversion to HTML (#721)
- CLI downloads: use all information in settings file (#734)
- downloads: cleaner urllib3 code (#736)
- refine table markdown output by @unsleepy22 (#752)
- extraction fix: images in text nodes by @unsleepy22 (#757)
Metadata: - more robust URL extraction (#710)
Command-line interface:
- CLI: print URLs early for feeds and sitemaps with --list
with @gremid (#744)
- CLI: add 126 exit code for high error ratio (#747)
Maintenance:
- remove already deprecated functions and args (#716)
- add type hints (#723, [#728])
- setup: use pyproject.toml
file (#715)
- simplify code (#708, [#709], [#727])
- better debug messages in main_extractor
(#714)
- evaluation: review data, update packages, add magic_html (#731)
- setup: explicit exports through __all__
(#740)
- tests: extend coverage (#753)
Documentation:
- fix link in docs/index.html
by @nzw0301 (#711)
- remove docs from published packages (#743)
- update docs (#745)