Name | Modified | Size | Downloads / Week |
---|---|---|---|
Parent folder | |||
py_data_juicer-1.4.0-py3-none-any.whl | 2025-06-16 | 448.3 kB | |
README.md | 2025-06-13 | 2.3 kB | |
v1.4.0 Major Refactor for Env Management, Doc, Sandbox_ Derivative Works (TPAMI Survey_ Trinity-RFT _ DetailMaster) source code.tar.gz | 2025-06-13 | 32.9 MB | |
v1.4.0 Major Refactor for Env Management, Doc, Sandbox_ Derivative Works (TPAMI Survey_ Trinity-RFT _ DetailMaster) source code.zip | 2025-06-13 | 33.4 MB | |
Totals: 4 Items | 66.7 MB | 0 |
Summarization: 200+ files changed with 18,535 additions and 3,720 deletions.
๐ง Major Refactors & Improvements
- ๐ Sandbox Usability (#686):
- Support for multiple pipelines, context info, and an environment manager to run different commands in various environments.
-
Includes the InternVL example as a showcase.
-
๐ DJ-Doc Redesign (https://github.com/modelscope/data-juicer/pull/675):
-
Now with multilingual support (English / Chinese) and a modernized style.
-
๐ฆ Dependency Management Update (#660, [#680]):
- Migrated to
uv
for faster dependency resolution. - Added sub-groups for better organization.
๐ New Features & Integrations (#683, [#688], [#692])
- ๐ Additional Repo Supported:
-
Trinity-RFT now supported by Data-Juicer.
-
๐ DJ-Awesome-List:
-
A survey paper accepted by TPAMI'25!
-
๐งช Synthetic Benchmark Added:
-
DetailMaster โ a new benchmark for synthetic data evaluation.
-
๐ ๏ธ New Operators Introduced (#673, [#701]):
llm_analysis_filter
general_field_filter
๐ Core Optimizations & Bug Fixes
- โ Ray Executor Enhancements (#697):
- File extension detection added.
-
Support for more data formats.
-
โฑ๏ธ Startup Time Optimization:
-
Improved startup performance. (#684)
-
๐ง Text Embedding Support:
-
Added support for text embedding via API and local model. (#681)
-
๐ณ Docker Build Improvement:
-
Ignore installed
distutils
libraries during Docker image building. (#668) -
๐ ๏ธ Mapper Module Fix:
-
Fixed issue with module initialization. (#700)
-
๐๏ธ Warning Suppression:
- Suppressed unnecessary warnings from fasttext. (#696)