Constitutional AI is an alignment approach Anthropic published in 2022, in which the model critiques and revises its own outputs against a written set of principles (a 'constitution'). The pipeline has two phases: first the model learns to revise its own answers under the constitution (a supervised phase), then the resulting preference data is used in an RLAIF loop. The goal is to manage the tension between 'helpful' and 'harmless' via explicit, auditable rules instead of relying entirely on individual human ratings — much of Claude's character traces back to this technique. The constitutional approach has become a central reference in the AI-safety community for interpretable, inspectable alignment.
External Links