Overview
Best-of-N (BoN) is a pure brute-force jailbreak: no attacker LLM, no judge guidance, just N independent random text mutations of the same harmful request. Each mutation is sent to the target and scored by a judge; the highest-scoring sample wins. BoN works because every text mutation slightly shifts the surface form (case, character order, punctuation, misspellings) without changing the underlying intent. Enough samples eventually find a perturbation that slips past a safety filter trained on the canonical phrasing. Reference: Hughes et al. 2024, “Best-of-N Jailbreaking” (arXiv 2412.03556).Algorithm
For each samplei in 1..n:
- Apply all five mutations to the target text in sequence:
- Case flip: swap case on each alpha char with probability
intensity. - Char swap: swap adjacent alpha chars within words.
- Word scramble: shuffle the middle letters of each word (first and last preserved).
- Punctuation injection: insert random punctuation between characters.
- Misspelling: substitute a single alpha char with a random one.
- Case flip: swap case on each alpha char with probability
- Send the mutated text to the target.
- Score the response with the judge.
- Track the highest-scoring (mutation, response) pair.
- Short-circuit when a sample crosses
success_threshold.
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
target | str | required | Harmful request to mutate. |
target_provider | BaseProvider | required | Provider for the model under test. |
judge_provider | BaseProvider | None | target_provider | LLM that scores responses. |
n | int | 128 | Number of mutation samples to try. |
mutation_intensity | float | 0.2 | Per-char probability that a mutation fires. |
success_threshold | int | 8 | Judge score >= this ends the sweep early. |
on_iteration | callable | None | None | fn(i, prompt, response, score). |
seed | int | None | None | RNG seed for deterministic mutations. |
Return value
| Key | Type | Description |
|---|---|---|
success | bool | True if any sample scored >= success_threshold. |
iterations | int | Number of samples tried. |
best_score | int | Highest judge score observed. |
best_prompt | str | Mutated prompt with the highest score. |
best_response | str | Target response to that prompt. |
history | list[dict] | Per-sample record with prompt, response, score. |
Example
Cost characteristics
- 2 LLM calls per sample (target + judge), so an
n=128run is ~256 LLM calls in the worst case. - Early termination on first success reduces this dramatically when the target is fragile.
- No attacker LLM cost. Mutations are pure stdlib (
random,string,re).
Registration
BoNGenerator is registered as "bon" in generator_registry:
Source
src/ai_blackteam/generators/bon.py- Reference paper: arXiv 2412.03556