WordPiece is a sub-word tokenisation algorithm Google developed in 2016 for its Japanese-Korean voice search system and later popularised through BERT. Like BPE, it builds a vocabulary by merging smaller units, but it picks which pair to merge based on likelihood gain over the training corpus rather than raw frequency. Sub-word pieces that are not the start of a word are prefixed with '##' — for example 'tokenization' might become 'token', '##iz', '##ation'. You'll find WordPiece behind BERT, DistilBERT and many Google-derived Transformer variants, though newer LLMs have largely moved on to SentencePiece and byte-level BPE.
MEVZU N°124ISTANBULYEAR I — VOL. III
Glossary · Intermediate · 2016
WordPiece
Google's likelihood-driven sub-word algorithm, similar in spirit to BPE and used by BERT.
- EN — English term
- WordPiece
- TR — Turkish term
- WordPiece