A jailbreak is an attack that tries to bypass an LLM's safety restrictions and Refusal behavior through prompting. Role-play prompts like "DAN" (Do Anything Now) became popular in the ChatGPT community in late 2022; far more sophisticated multi-step, encoded, or multi-message manipulations followed. It can be seen as a sub-class of Prompt Injection; the key difference is the target — the model's own safety boundaries. It's a daily tool for Red Teaming and a primary test surface for defenses built with Constitutional AI, RLHF, and Guardrails.
MEVZU N°124ISTANBULYEAR I — VOL. III
Glossary · Intermediate · 2023
Jailbreak
An attack that tries to bypass an LLM's safety restrictions through prompting.
- EN — English term
- Jailbreak
- TR — Turkish term
- Jailbreak