| Name | Modified | Size | Downloads / Week |
|---|---|---|---|
| Parent folder | |||
| README.md | 2025-02-17 | 5.1 kB | |
| v0.5.0 source code.tar.gz | 2025-02-17 | 583.5 kB | |
| v0.5.0 source code.zip | 2025-02-17 | 625.4 kB | |
| Totals: 3 Items | 1.2 MB | 0 | |
🚨 Breaking changes
- All chunkers except
TokenChunkerhave their argumenttokenizerrenamed totokenizer_or_token_counterto denote that the chunkers support callable token counters as well. DeprecatedWarninghas been set forchunk_overlap>0and users are suggested to useOverlapRefineryfor its speed and flexibility.
✨ Highlights
- All chunkers now support a
return_type="texts"parameter, causing the chunker to output onlyList[str]; skip receiving the metadata available in theChunkdataclass and get only texts. This saves a little bit of memory as well. - All chunkers support
Callablein theirtokenizer_or_token_counterarg. This allows you to pass in functions defined likedef token_counter (text:str) -> int: ...into the chunkers. - All chunkers which use delimiters (i.e.
SentenceChunker,RecursiveChunker,LateChunkeretc) haveinclude_delim="next"which puts the delimiter in the next chunk. This feature is useful in processing Markdown files properly. - Added initial support for Chonkie's pre-processing classes,
ChefwithTextChefthat can handle loading and pre-processing Text and Markdown files. - All
Chunkdataclasses haveto_dictandfrom_dictmethod, which allows to convertChunk <--> Dict. This is especially useful if you want to store chunks as JSON or JSONLines files.
What's Changed
- [FEAT] Support
return_typeastextsfor direct text handling by @bhavnicksm in https://github.com/chonkie-ai/chonkie/pull/146 - [FEAT] Support
return_typewithtextsoutput type by @bhavnicksm in https://github.com/chonkie-ai/chonkie/pull/147 - [FEAT] Add support for callable
token_counteras input for rule-based Chunkers by @bhavnicksm in https://github.com/chonkie-ai/chonkie/pull/155 - [DOCS] Benchmarking update by @shreyashnigam in https://github.com/chonkie-ai/chonkie/pull/145
- [DOCS] Update Benchmarks - Include Wikipedia-100k and Wikipedia-500k run timings by @bhavnicksm in https://github.com/chonkie-ai/chonkie/pull/156
- [Feat] Add
include_delim='next'as an optional argument in SentenceChunker and RecursiveChunker by @bhavnicksm in https://github.com/chonkie-ai/chonkie/pull/157 - Update Benchmarks + Remove numpy base dependency + Add
include_delimin Chunkers by @bhavnicksm in https://github.com/chonkie-ai/chonkie/pull/158 - [Fix] [#151]: Provide warning to user when
min_sentences_per_chunkis not satisfied by @bhavnicksm in https://github.com/chonkie-ai/chonkie/pull/159 - [Minor] Update Benchmarks + Add
include_delim="next"+ Fix [#151] by @bhavnicksm in https://github.com/chonkie-ai/chonkie/pull/160 - [Fix] Default to Tokenizers while AutoTikTokenizer issues are resolved by @bhavnicksm in https://github.com/chonkie-ai/chonkie/pull/161
- Shift to HF Tokenizers as the default, until AutoTikTokenizer issues are resolved by @bhavnicksm in https://github.com/chonkie-ai/chonkie/pull/162
- [Fix] Correct the
_split_text/_split_sentencelogic to give proper splits by @bhavnicksm in https://github.com/chonkie-ai/chonkie/pull/164 - Add chonkbook by @bhavnicksm in https://github.com/chonkie-ai/chonkie/pull/165
- Add ChonkBook by @bhavnicksm in https://github.com/chonkie-ai/chonkie/pull/166
- [FEAT] Support Cohere Embeddings for SemanticChunker and SDPMChunker [#118] by @Udayk02 in https://github.com/chonkie-ai/chonkie/pull/130
- Update Example URL by @shreyashnigam in https://github.com/chonkie-ai/chonkie/pull/167
- [Feat] Add
mode="recursive"inOverlapRefineryfor all methods by @bhavnicksm in https://github.com/chonkie-ai/chonkie/pull/168 - Add CohereEmbeddings +
recursivemode in OverlapRefinery + Initial support for master Tokenizer class by @bhavnicksm in https://github.com/chonkie-ai/chonkie/pull/169 - Bump up version to
v0.5.0by @bhavnicksm in https://github.com/chonkie-ai/chonkie/pull/170 - Add
to_dictandfrom_dictto all Chonkie data-classes by @bhavnicksm in https://github.com/chonkie-ai/chonkie/pull/171 - Add tests for types.py + Make the tests pass :) by @bhavnicksm in https://github.com/chonkie-ai/chonkie/pull/172
- Add
to_dict+from_dictto Chonkie dataclasses + Add__repr__to classes by @bhavnicksm in https://github.com/chonkie-ai/chonkie/pull/173 - [Minor] Remove
token_processor.pyby @bhavnicksm in https://github.com/chonkie-ai/chonkie/pull/174 - [Feat] Add initial support for Chef via
BaseChefandTextChefby @bhavnicksm in https://github.com/chonkie-ai/chonkie/pull/175 - [Feat] Add initial support for Chefs via
BaseChefandTextChefby @bhavnicksm in https://github.com/chonkie-ai/chonkie/pull/176 - [Feat] Switch to using
Chonkie.Tokenizerfor Chunkers, Refineries by @bhavnicksm in https://github.com/chonkie-ai/chonkie/pull/178 - [Fix] Use default model in
AutoEmbeddingsifError: model not found+ bad__repr__forSemanticSentenceby @bhavnicksm in https://github.com/chonkie-ai/chonkie/pull/179
Full Changelog: https://github.com/chonkie-ai/chonkie/compare/v0.4.1...v0.5.0