Download Latest Version Release v1.4.1_ MCP server_ GPU-based Minhash deduplicator_ Improved unit test coverage. source code.tar.gz (33.4 MB)
Email in envelope

Get an email when there's a new version of Data-Juicer

Home / v1.4.0
Name Modified Size InfoDownloads / Week
Parent folder
py_data_juicer-1.4.0-py3-none-any.whl 2025-06-16 448.3 kB
README.md 2025-06-13 2.3 kB
v1.4.0 Major Refactor for Env Management, Doc, Sandbox_ Derivative Works (TPAMI Survey_ Trinity-RFT _ DetailMaster) source code.tar.gz 2025-06-13 32.9 MB
v1.4.0 Major Refactor for Env Management, Doc, Sandbox_ Derivative Works (TPAMI Survey_ Trinity-RFT _ DetailMaster) source code.zip 2025-06-13 33.4 MB
Totals: 4 Items   66.7 MB 0

Summarization: 200+ files changed with 18,535 additions and 3,720 deletions.


๐Ÿ”ง Major Refactors & Improvements

  • ๐Ÿ”„ Sandbox Usability (#686):
  • Support for multiple pipelines, context info, and an environment manager to run different commands in various environments.
  • Includes the InternVL example as a showcase.

  • ๐Ÿ“˜ DJ-Doc Redesign (https://github.com/modelscope/data-juicer/pull/675):

  • Now with multilingual support (English / Chinese) and a modernized style.

  • ๐Ÿ“ฆ Dependency Management Update (#660, [#680]):

  • Migrated to uv for faster dependency resolution.
  • Added sub-groups for better organization.

๐ŸŒ New Features & Integrations (#683, [#688], [#692])

  • ๐Ÿ†• Additional Repo Supported:
  • Trinity-RFT now supported by Data-Juicer.

  • ๐Ÿ“œ DJ-Awesome-List:

  • A survey paper accepted by TPAMI'25!

  • ๐Ÿงช Synthetic Benchmark Added:

  • DetailMaster โ€“ a new benchmark for synthetic data evaluation.

  • ๐Ÿ› ๏ธ New Operators Introduced (#673, [#701]):

  • llm_analysis_filter
  • general_field_filter

๐Ÿš€ Core Optimizations & Bug Fixes

  • โœ… Ray Executor Enhancements (#697):
  • File extension detection added.
  • Support for more data formats.

  • โฑ๏ธ Startup Time Optimization:

  • Improved startup performance. (#684)

  • ๐Ÿง  Text Embedding Support:

  • Added support for text embedding via API and local model. (#681)

  • ๐Ÿณ Docker Build Improvement:

  • Ignore installed distutils libraries during Docker image building. (#668)

  • ๐Ÿ› ๏ธ Mapper Module Fix:

  • Fixed issue with module initialization. (#700)

  • ๐Ÿ—‘๏ธ Warning Suppression:

  • Suppressed unnecessary warnings from fasttext. (#696)

๐Ÿ“š Full Changelog

View all changes since v1.3.3 โ†’

Source: README.md, updated 2025-06-13