Tokenization

Tokenisation is the preprocessing step that turns raw text into a sequence of Tokens a language model can actually consume. Depending on the model's vocabulary and chosen algorithm, the same sentence can produce wildly different token counts, directly affecting both cost and context-window usage. Common approaches include BPE, WordPiece and SentencePiece, each behaving differently on non-Latin scripts, code snippets and symbols like emoji. For agglutinative languages such as Turkish, the design of the tokeniser is an invisible but decisive component of model performance.