Download Latest Version Release v1.4.1_ MCP server_ GPU-based Minhash deduplicator_ Improved unit test coverage. source code.tar.gz (33.4 MB)
Email in envelope

Get an email when there's a new version of Data-Juicer

Home / v1.3.3
Name Modified Size InfoDownloads / Week
Parent folder
py_data_juicer-1.3.3-py3-none-any.whl 2025-05-09 494.3 kB
README.md 2025-05-09 817 Bytes
Release v1.3.3_ Sandbox is accepted as Spotlight by ICML 2025_ Add Img-Diff recipes. source code.tar.gz 2025-05-09 32.4 MB
Release v1.3.3_ Sandbox is accepted as Spotlight by ICML 2025_ Add Img-Diff recipes. source code.zip 2025-05-09 32.9 MB
Totals: 4 Items   65.8 MB 0

Major Updates

  • 🎉 Our work of Data-Juicer Sandbox has been accepted as a Spotlight by ICML 2025 (top 2.6% of all submissions)!
  • Add new OPs and recipes for Img-Diff. [#658]

Enhancements

  • Support HF llm for two llm_xxx_score_filter OPs. [#655]
  • Sync docker image to Aliyun OSS for downloading if docker hub is not accessed. [#657]
  • Split standalone and distributed unit tests to save time when re-running failed ones. [#666]

Bugs Fixed

  • Address possibly missing cfg in unify_format. [#653]
  • Improve clarity & fix bad links for some docs. [#659]

Acknowledgement

  • @co63oc helps to fix some typos. [#654]

Full Changelog: https://github.com/modelscope/data-juicer/compare/v1.3.2...v1.3.3

Source: README.md, updated 2025-05-09