BPE is a sub-word tokenisation algorithm originally proposed in 1994 for data compression and adapted to neural machine translation by Sennrich et al. in 2015. It iteratively merges the most frequent pairs of characters in a training corpus to build a fixed-size vocabulary, so common words become single Tokens while rare ones are broken into smaller pieces. The GPT family and many other modern LLMs rely on BPE — or a byte-level variant — which lets them represent unseen words and emoji without falling back to an unknown token. It is the most established algorithm in the Tokenization landscape and shaped the design choices behind both WordPiece and SentencePiece.
External Links