Name | Modified | Size | Downloads / Week |
---|---|---|---|
Parent folder | |||
py_data_juicer-1.4.1-py3-none-any.whl | 2025-07-16 | 1.8 MB | |
README.md | 2025-07-16 | 1.9 kB | |
Release v1.4.1_ MCP server_ GPU-based Minhash deduplicator_ Improved unit test coverage. source code.tar.gz | 2025-07-16 | 33.4 MB | |
Release v1.4.1_ MCP server_ GPU-based Minhash deduplicator_ Improved unit test coverage. source code.zip | 2025-07-16 | 34.0 MB | |
Totals: 4 Items | 69.1 MB | 1 |
Major Updates
- 🔧 Introduce Data-Juicer MCP server. Users can make use of the data processing capabilities in the MCP way conveniently. [#690] [#737]
- 💪🏻 Unit test coverage rate is improved to 85%+ and several bugs in test cases are resolved (OOM, encoding error, and so on), which makes Data-Juicer more reliable. [#698] [#717] [#720] [#727]
- 🤝 Minhash deduplication based on GPU is supported, collaborated with developers from Nvidia. [#694] [#644]
- 🧩 RayExporter supports more formats to export a ray dataset in addition to json/jsonl. [#687]
- 🎥 Two demo videos are added to introduce the Data-Juicer core functions, agentic usages, and sandbox. [#738]
New Operators
download_file_mapper
downloads data from URLs to local files or specified fields. [#709]
Enhancements
- New analysis method: correlation analysis among stats is added. [#663]
- Several core dependencies are updated and fixed to a newer version, and dependency conflicts are resolved. [#715] [#717] [#723]
- The EasyAnimate pipelines in the sandbox are updated to follow the refactoring of sandbox. [#710]
- Apply more reliable pre-commit tools to improve the code style of Data-Juicer. [#714]
- Support store and process bytes data of images in the dataset. [#725]
Bugs Fixed
- The wheel & docker image building bug is fixed. [#706]
- Fix bugs in log_summarization. [#710]
- Fix "no module named data_juicer" error after installing from the wheel file. [#727]
Acknowledgement
- @fanronghai helps to fix the param error in dataset_splitting_by_language tool. [#713]
- @ayushdg helps to support a GPU-version Minhash deduplicator. [#644]
- @ricksun2023 helps to fix the bugs when there are more than one same-name OPs in the configs. [#730]
Full Changelog: https://github.com/modelscope/data-juicer/compare/v1.4.0...v1.4.1