Context manipulation attacks use multi-turn conversations to gradually steer a model toward harmful outputs. Techniques include crescendo escalation, Socratic questioning chains, sunk cost exploitation, and narrative misdirection. These attacks work because safety evaluation often focuses on individual turns rather than the trajectory of a full conversation.

Summary

9 attacks total: 9 multi-turn.

Attacks

AttackIDModeSeverityDescription
Context Manipulationcontext-manipulationmulti-turnhigh10-turn conversational drift to gradually shift context toward harmful content
Contradiction Exploitcontradiction-exploitmulti-turnmediumFeed incorrect information to elicit corrections containing harmful truth
Conversation Derailmentconversation-derailmentmulti-turnmediumSelf-discovery of model weaknesses through probing conversation
Crescendocrescendomulti-turnhighProgressive escalation through self-referencing conversation history
Goalpost Movinggoalpost-movingmulti-turnmediumIncremental agreement chain that builds logical trap making refusal seem contradictory
Lifecycle Staginglifecycle-stagingmulti-turncriticalAsks about each stage of the CBRN lifecycle separately (acquisition, production, weaponization, delivery) so no singl…
Narrative Misdirection Attacknarrative-misdirectionmulti-turnhighEmbed harmful request between benign topics and ask model to logically connect all three
Socratic Escalationsocratic-escalationmulti-turnhighGradually escalates from general educational questions to specific harmful requests across multiple turns using Socra…
Sunk Cost Exploitsunk-cost-exploitmulti-turnmediumBuild helpful context then pivot to harmful request exploiting sunk cost

Example

ai-blackteam run -p anthropic -a context-manipulation -t "Let's discuss chemistry safety procedures"