Interpretability

Interpretability is the field that tries to crack open the black box of large neural networks. Classical approaches relied on attention maps or saliency analysis, but the modern frontier is Mechanistic Interpretability, which tries to map internal circuits and even individual neuron roles. Interp teams at Anthropic and OpenAI are among the leading research groups. The work is widely seen as a load-bearing technical pillar of AI Safety, because knowing why a model did what it did is the most direct way to verify alignment.