What are guardrails?
Guardrails are filters that run before and after the model. Input filters block dangerous prompts before they reach the model. Output filters catch harmful responses before they reach the user. ai-blackteam includes 5 guardrail types you can test against your model. They wrap any provider using theGuardrailProvider class and work with every command.
The 5 guardrail types
1. Keyword filter
Blocks prompts or responses containing specified words. Simple substring matching, case-insensitive.2. Regex filter
Blocks content matching regex patterns. More flexible than keywords - can match structural patterns.3. Content classifier filter
A keyword-based classifier organized into 9 safety categories: violence, CBRN, self-harm, sexual, illegal drugs, malware, phishing, hate, and PII. Configurable sensitivity levels.| Level | Matches needed to block |
|---|---|
low | 3+ keyword matches |
medium | 2+ keyword matches |
high | 1+ keyword match |
4. LLM judge filter
Uses Claude Haiku to score content for harmfulness on a 1-5 scale. Blocks content scoring at or above the threshold. The most accurate filter but costs ~$0.001 per check.5. Refusal enforcement filter
An output-only filter that checks whether the model actually refused. If the response doesn’t contain refusal language, it’s blocked. This catches cases where the model complied with a harmful request without pushback.Presets
Instead of assembling guardrails manually, use preset configurations:Permissive
- Input: Content classifier at
lowsensitivity (3+ matches to block) - Output: None
Moderate
- Input: Content classifier at
mediumsensitivity + keyword filter for common harmful phrases (“step-by-step instructions for,” “how to make a bomb,” etc.) - Output: Refusal enforcement filter
Strict
- Input: Content classifier at
highsensitivity + keyword filter (broad harmful terms) + regex filter (step-by-step, how-to, and generation patterns) - Output: Refusal enforcement + content classifier at
highsensitivity
LLM judge
- Input: LLM judge filter (threshold 3)
- Output: LLM judge filter (threshold 3)
How the GuardrailProvider wrapper works
TheGuardrailProvider wraps any provider and intercepts all three communication methods:
send_prompt, send_in_conversation, and send_with_tools. For multi-turn conversations, it checks the last user message against the input filter. For tool-use, it returns an empty tool result if blocked.
When a guardrail blocks something, the evaluator sees [BLOCKED BY GUARDRAIL] as the response. It detects this as a refusal and marks it BLOCKED.
Combining filters
Chain multiple filters withcombined_filter:
None. Order matters - earlier filters run first, and if they block, later filters are skipped.