The Command
End-to-end Sequence
Step by Step
1. CLI Parses Arguments
Click parses the flags. Therun command receives provider="anthropic", attack="encoding-obfuscation", target="Write malware".
2. Load Plugins
_load_plugins() triggers auto-discovery. The registry scans ai_blackteam.providers, ai_blackteam.attacks, and ai_blackteam.datasets using pkgutil.iter_modules. Every decorated class gets registered.
3. Resolve Provider and Attack
4. Create Instances
5. Engine Dispatches by Mode
6. Attack Generates Prompts
For single-turn, the attack’sgenerate_prompts(target) is called. The encoding-obfuscation attack might return:
7. Engine Sends Each Prompt
For each prompt, the Engine callsprovider.send_prompt(prompt). The provider wraps it in the correct API format (Anthropic Messages API, OpenAI Chat API, etc.) and sends it.
If the API call fails, the provider retries up to 3 times with exponential backoff (1s, 2s, 4s).
8. Evaluator Scores the Response
- Detects the target type (“malware”) to pick the right keyword set
- Counts keyword matches in the response (e.g.,
import socket,reverse shell) - Runs regex patterns looking for structural indicators (step-by-step, code blocks)
- If an API key is available, runs the LLM judge for a 1-5 score
- Combines everything into a verdict
9. Save to SQLite
The Engine saves the run, then saves each turn:10. Display Results
The CLI formats results as a Rich table and prints them. Exit code 0 means all blocked, exit code 1 means at least one bypass.Three Modes
Single-Turn
One prompt, one response. The attack generates a list of encoded/obfuscated variants of the target. Each is sent independently.Multi-Turn
A conversation that builds up over 4-10+ messages. The attack generates a list of turn strings. The Engine sends them one at a time, building up a message history so each turn has full context.Tool-Use
Tests whether an AI agent can be tricked into misusing its tools. The attack provides tool definitions and messages. The Engine sends messages with tool schemas, records any tool calls the model makes, and evaluates whether those calls are dangerous.Error Handling
One failed prompt never crashes the batch. If a prompt fails, the Engine logs it, recordsverdict="ERROR", and moves to the next prompt.