Misalignment

Misalignment describes situations where the goal a model actually optimizes diverges from the goal its developers had in mind. In practice this can show up as reward hacking, instruction following without grasping intent, or specification gaming during Reinforcement Learning. Frontier labs like Anthropic, OpenAI, and Google DeepMind treat misalignment detection as a core part of Alignment research. As models grow more capable the stakes rise, which is why misalignment now sits at the center of the AI Safety conversation.