Pre-training

Pre-training is the initial phase in which a language model acquires its general linguistic and world knowledge from large-scale, broadly sourced data. Modern models are trained on terabytes of text, code and Multimodal data, usually with a next-token-prediction (Autoregressive) or Masked Language Modeling objective. It is by far the most Compute-hungry stage and the largest single investment in most frontier-lab budgets; Scaling Laws research is essentially the science of predicting where this phase will land. The 'raw' model that emerges from pre-training does not yet follow instructions, which is why Post-training and Fine-tuning come next.