SentencePiece is a tokenisation library Google released in 2018 that needs no language-specific preprocessing. It treats whitespace as just another character ('▁'), which lets it work the same way on languages with unclear word boundaries — Chinese, Japanese, Thai — while staying perfectly reversible. The library implements both BPE and a unigram language-model algorithm and underpins T5, ALBERT, mT5 and the LLaMA family. It has become the de facto standard for multilingual LLM training because it keeps the Tokenization step symmetric between training and inference.
External Links