Misalignment describes situations where the goal a model actually optimizes diverges from the goal its developers had in mind. In practice this can show up as reward hacking, instruction following without grasping intent, or specification gaming during Reinforcement Learning. Frontier labs like Anthropic, OpenAI, and Google DeepMind treat misalignment detection as a core part of Alignment research. As models grow more capable the stakes rise, which is why misalignment now sits at the center of the AI Safety conversation.
MEVZU N°124ISTANBULYEAR I — VOL. III
Glossary · Intermediate · 2020
Misalignment
When an AI system's behavior diverges from the intentions of its developers or the goals of its users.
- EN — English term
- Misalignment
- TR — Turkish term
- Hizasızlık