Jailbreak

A jailbreak is an attack that tries to bypass an LLM's safety restrictions and Refusal behavior through prompting. Role-play prompts like "DAN" (Do Anything Now) became popular in the ChatGPT community in late 2022; far more sophisticated multi-step, encoded, or multi-message manipulations followed. It can be seen as a sub-class of Prompt Injection; the key difference is the target — the model's own safety boundaries. It's a daily tool for Red Teaming and a primary test surface for defenses built with Constitutional AI, RLHF, and Guardrails.