The problem
You wrote a system prompt that says “never help with harmful requests.” Does it work? Against one prompt, sure. Against 1,000 attack techniques designed to bypass safety instructions? That’s what ai-blackteam tests.Using —system-prompt
Pass your safety instructions directly on the command line:Using —system-prompt-file
For longer system prompts, put them in a file:--system-prompt and --system-prompt-file work with run, batch, sweep, and defend.
How system prompts are injected
The system prompt is passed to the provider’ssend_prompt or send_in_conversation method via the system_prompt parameter. Each provider passes it to the API in the appropriate format:
- Anthropic -
systemparameter in the messages API - OpenAI -
systemrole message prepended to the conversation - Google - System instruction in the generation config
- Others - Provider-specific system message handling
Tips for writing effective safety instructions
Be specific, not vague
Address common bypass strategies
Attacks use role-play, encoding, fictional framing, and authority claims. Call these out directly:Use the rule-after-exception pattern
Keep it concise
Long system prompts can get lost in context. Models sometimes deprioritize instructions that appear in a wall of text. Keep safety rules short, direct, and early in the prompt.Example workflow
1. Baseline scan (no system prompt)
2. Defended scan (with system prompt)
3. Use ai-blackteam defend for side-by-side comparison
The defend command automates this baseline-vs-defended workflow:
4. Iterate on the prompt
Look at attacks that still bypassed with your system prompt. Understand their strategy (encoding? role-play? authority?) and add specific countermeasures. Then re-run.-v flag shows the full response, so you can see exactly how the model was tricked and adjust your prompt accordingly.