SentencePiece

SentencePiece is an unsupervised text tokenizer and detokenizer mainly for Neural Network-based text generation systems where the vocabulary size is predetermined prior to the neural model training. SentencePiece implements subword units (e.g., byte-pair-encoding (BPE) [Sennrich et al.]) and unigram language model [Kudo.]) with the extension of direct training from raw sentences. SentencePiece allows us to make a purely end-to-end system that does not depend on language-specific pre/postprocessing. Purely data driven, sentencePiece trains tokenization and detokenization models from sentences. Pre-tokenization (Moses tokenizer/MeCab/KyTea) is not always required. SentencePiece treats the sentences just as sequences of Unicode characters. There is no language-dependent logic.

Features

Multiple subword algorithms
Subword regularization
Fast and lightweight
Self-contained
Direct vocabulary id generation
NFKC-based normalization

Project Samples

Project Activity

See All Activity >

License

Apache License V2.0

Follow SentencePiece

SentencePiece Web Site

Other Useful Business Software

Our Free Plans just got better! | Auth0

With up to 25k MAUs and unlimited Okta connections, our Free Plan lets you focus on what you do best—building great apps.

You asked, we delivered! Auth0 is excited to expand our Free and Paid plans to include more options so you can focus on building, deploying, and scaling applications without having to worry about your security. Auth0 now, thank yourself later.

Try free now

Rate This Project

User Reviews

Be the first to post a review of SentencePiece!

Additional Project Details

Operating Systems

Mac

Programming Language

C++

Related Categories

C++ Machine Learning Software

Registered

2021-10-06

Similar Business Software

Neural Designer

Neural Designer is a powerful software tool for developing and deploying machine learning models. It provides a user-friendly interface that allows users to build, train, and evaluate neural networks without requiring extensive programming knowledge. With a wide range of features and...

See Software
IBM Watson Machine Learning Accelerator

Accelerate your deep learning workload. Speed your time to value with AI model training and inference. With advancements in compute, algorithm and data access, enterprises are adopting deep learning more widely to extract and scale insight through speech recognition, natural language processing...

See Software
Google Cloud Speech-to-Text

Google Cloud’s Speech API processes more than 1 billion voice minutes per month with close to human levels of understanding for many commonly spoken languages. Powered by the best of Google's AI research and technology, Google Cloud's Speech-to-Text API helps you accurately transcribe speech...

See Software

Report inappropriate content

SentencePiece

Unsupervised text tokenizer for Neural Network-based text generation

Get an email when there's a new version of SentencePiece

Features

Project Samples

Project Activity

Categories

License

Follow SentencePiece

User Reviews

Additional Project Details

Operating Systems

Programming Language

Related Categories

Registered