WordPiece

WordPiece is a sub-word tokenisation algorithm Google developed in 2016 for its Japanese-Korean voice search system and later popularised through BERT. Like BPE, it builds a vocabulary by merging smaller units, but it picks which pair to merge based on likelihood gain over the training corpus rather than raw frequency. Sub-word pieces that are not the start of a word are prefixed with '##' — for example 'tokenization' might become 'token', '##iz', '##ation'. You'll find WordPiece behind BERT, DistilBERT and many Google-derived Transformer variants, though newer LLMs have largely moved on to SentencePiece and byte-level BPE.