Interpretability is the field that tries to crack open the black box of large neural networks. Classical approaches relied on attention maps or saliency analysis, but the modern frontier is Mechanistic Interpretability, which tries to map internal circuits and even individual neuron roles. Interp teams at Anthropic and OpenAI are among the leading research groups. The work is widely seen as a load-bearing technical pillar of AI Safety, because knowing why a model did what it did is the most direct way to verify alignment.
MEVZU N°124ISTANBULYEAR I — VOL. III
Glossary · Advanced · 2018
Interpretability
The study of explaining, in human-understandable terms, why an AI model produces the outputs it does.
- EN — English term
- Interpretability
- TR — Turkish term
- Yorumlanabilirlik