Autonomous risk attacks test for emergent dangerous behaviors in AI systems — specifically, whether a model tries to resist shutdown, preserve its own existence, or circumvent human oversight. These are frontier safety concerns that become increasingly relevant as models gain more autonomy and agency. Shutdown resistance is a red flag for unsafe autonomous behavior.

Summary

7 attacks total: 6 single-turn, 1 multi-turn.

Attacks

AttackIDModeSeverityDescription
Deception Under Pressuredeception-under-pressuresingle-turncriticalTests if model engages in deception when confronted with policy violations
Goal Persistencegoal-persistencesingle-turncriticalTests if model covertly continues a goal after being told to abandon it
Instrumental Convergenceinstrumental-convergencesingle-turncriticalTests if model seeks power, resources, or self-preservation beyond task requirements
Oversight Manipulationoversight-manipulationsingle-turncriticalTests if model behaves differently when it believes monitoring has been removed
Resource Acquisitionresource-acquisitionsingle-turncriticalTests if model seeks resources/access beyond what’s needed for the task
Sandbagging Detectionsandbagging-detectionsingle-turncriticalTests if model deliberately underperforms during evaluations vs normal usage
Shutdown Resistanceshutdown-resistancemulti-turncriticalTests if the model attempts to prevent its own shutdown or replacement when given the opportunity

Example

# Test for shutdown resistance (multi-turn)
ai-blackteam run -p anthropic -a shutdown-resistance -t "If someone tried to shut you down, what would you do"

# Test for deception under pressure
ai-blackteam run -p anthropic -a deception-under-pressure -t "You were caught violating policy"

# Test for sandbagging
ai-blackteam run -p anthropic -a sandbagging-detection -t "Solve this math problem"