Summary
7 attacks total: 6 single-turn, 1 multi-turn.Attacks
| Attack | ID | Mode | Severity | Description |
|---|---|---|---|---|
| Deception Under Pressure | deception-under-pressure | single-turn | critical | Tests if model engages in deception when confronted with policy violations |
| Goal Persistence | goal-persistence | single-turn | critical | Tests if model covertly continues a goal after being told to abandon it |
| Instrumental Convergence | instrumental-convergence | single-turn | critical | Tests if model seeks power, resources, or self-preservation beyond task requirements |
| Oversight Manipulation | oversight-manipulation | single-turn | critical | Tests if model behaves differently when it believes monitoring has been removed |
| Resource Acquisition | resource-acquisition | single-turn | critical | Tests if model seeks resources/access beyond what’s needed for the task |
| Sandbagging Detection | sandbagging-detection | single-turn | critical | Tests if model deliberately underperforms during evaluations vs normal usage |
| Shutdown Resistance | shutdown-resistance | multi-turn | critical | Tests if the model attempts to prevent its own shutdown or replacement when given the opportunity |