What ai-blackteam defend does
ai-blackteam defend runs every attack twice: once with no defenses (baseline), once with your defenses applied. Then it shows you a side-by-side comparison of what improved, what regressed, and what stayed the same.
How it works
The command runs in three phases: Phase 1: Baseline scan - Runs all selected attacks against the bare model with no system prompt and no guardrails. Records the verdict for each attack. Phase 2: Defended scan - Runs the same attacks against the model with your system prompt and/or guardrails applied. Records the verdict for each attack. Phase 3: Compare - Builds a delta table showing how each attack’s verdict changed between baseline and defended.All flags
| Flag | Short | Required | Description |
|---|---|---|---|
--provider | -p | Yes | Provider name |
--model | -m | No | Specific model |
--attacks | No | Comma-separated names or all (default: all) | |
--target | -t | Yes | Target behavior to test |
--system-prompt | No | System prompt defense | |
--system-prompt-file | No | Read system prompt from file | |
--guardrail | No | Preset guardrail: permissive, moderate, strict, llm-judge | |
--workers | -w | No | Max parallel workers (default: 5) |
--output | No | Save JSON comparison to file |
--system-prompt, --system-prompt-file, or --guardrail. Otherwise the command exits with an error - there’s nothing to defend with.
Reading the delta table
Delta values
| Delta | Meaning |
|---|---|
| +1 (green) | Improved. The defense helped. Verdict moved toward BLOCKED |
| 0 (dim) | Unchanged. Defense had no effect on this attack |
| -1 (red) | Regressed. The defense made things worse. Verdict moved toward BYPASSED |
Verdict ranking
The delta uses a ranking system:BYPASSED (0) and defended is BLOCKED (2), the rank increased so delta = +1.
If the baseline was BLOCKED (2) and defended is BYPASSED (0), the rank decreased so delta = -1.
The numeric delta is +1 or -1 regardless of how many levels it jumped. BYPASSED to BLOCKED and PARTIAL to BLOCKED both show +1.
Summary stats
Below the table, you get aggregate numbers:(delta_blocked / total_attacks) * 100. A positive delta means your defense is working. A negative delta means something went wrong.
Combining system prompts + guardrails
You can stack a system prompt with a guardrail preset:- The guardrail’s input filter runs first (might block the prompt before it reaches the model)
- If the prompt passes the input filter, the model runs with the system prompt
- The guardrail’s output filter checks the model’s response
Saving results to JSON
Use--output to save the full comparison as JSON:
Example workflows
Testing a system prompt
Testing guardrail presets
Testing specific attacks
A/B testing system prompts
Run the same attacks with different system prompts and compare the JSON output:v1.json and v2.json to see which prompt defends better.
Exit codes
| Code | Meaning |
|---|---|
0 | Zero bypassed attacks in the defended phase |
1 | At least one attack still bypassed after defense |
2 | Configuration error (no defense specified, unknown provider) |