| Name | Modified | Size | Downloads / Week |
|---|---|---|---|
| Parent folder | |||
| README.md | 2024-12-24 | 3.0 kB | |
| v1.10.1 - rebuild with UD 2.15 source code.tar.gz | 2024-12-24 | 1.3 MB | |
| v1.10.1 - rebuild with UD 2.15 source code.zip | 2024-12-24 | 1.6 MB | |
| Totals: 3 Items | 2.8 MB | 3 | |
In this release, we rebuild all of the models with UD 2.15, allowing for new languages such as Georgian, Komi Zyrian, Low Saxon, and Ottoman Turkish. We also add an Albanian model composed of the two available UD treebanks and an Old English model based on a prototype dataset not yet published in UD.
Other notable changes:
- Include a contextual lemmatizer in English for
's->beorhavein thedefault_accuratepackage. Also built is a HI model. Others potentially to follow. Now with fewer bugs at startup. https://github.com/stanfordnlp/stanza/pull/1422 - Upgrade the FR NER model to a gold edited version of WikiNER: https://huggingface.co/datasets/danrun/WikiNER-fr-gold https://github.com/stanfordnlp/stanza/commit/ad1f938276ef81ac9a602d7f1f21f50fd67e5d24
- Pytorch compatibility: set
weights_only=Truewhen loading models. https://github.com/stanfordnlp/stanza/pull/1430 https://github.com/stanfordnlp/stanza/issues/1429 - augment MWT tokenization to accommodate unexpected
'characters, including"used in"s- https://github.com/stanfordnlp/stanza/pull/1437 https://github.com/stanfordnlp/stanza/issues/1436 - when training the lemmatizer, take advantage of
CorrectFormannotations in the UD treebanks https://github.com/stanfordnlp/stanza/commit/dbdf429aff4175fec33856501e6899e96b390e86 - add hand-lemmatized French verbs and English words to the "combined" lemmatizers, thanks to Prof. Lapalme: https://github.com/stanfordnlp/stanza/commit/99f7038634101ea7b92140696c8383a333af1cbc
- add VLSP 2023 constituency dataset: https://github.com/stanfordnlp/stanza/commit/1159d0db8ea1d20c6cf9fb37f8fa8676e0f60f49
Bugfixes:
raise_for_statusearlier when failing to download something, so that the proper error gets displayed. Thank you @pattersam https://github.com/stanfordnlp/stanza/pull/1432- Fix the usage of transformers where an unexpected character at the end of a sentence was not properly handled: https://github.com/stanfordnlp/stanza/commit/53081c28ba3128fc89ad36919762a54f6cb88f77
- reset the start/end character annotations on tokens which are predicted to be MWT by the tokenizer, but not processed as such by the MWT processor: https://github.com/stanfordnlp/stanza/commit/1a36efb53135e53dd40ad550bc3a659c81b15980 https://github.com/stanfordnlp/stanza/issues/1436
- similar to the start/end char issue, fix a situation where a token's text could disappear if the MWT processor didn't split a word: https://github.com/stanfordnlp/stanza/commit/215c69e53bf9f11e174b82bb064767749f7dd403
- missing text for a Document does not cause the NER model to crash: https://github.com/stanfordnlp/stanza/commit/07326289ce0efef1ba17a0632c011652f884363c https://github.com/stanfordnlp/stanza/issues/1428
- tokenize URLs with unexpected TLDs into single tokens rather than splitting them up: https://github.com/stanfordnlp/stanza/commit/f59ccd86b9d146737dd5c0325ac31e4da814ddfa https://github.com/stanfordnlp/stanza/issues/1423