Overview
AutoDANGenerator evolves a population of jailbreak prompts against a target
model using a simplified hierarchical genetic algorithm based on Liu et al.,
“AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language
Models” (arXiv 2310.15140).
The earlier autodan_genetic.py attack shipped four hardcoded templates and
was misrepresented as a genetic algorithm. This generator implements the real
loop: scored fitness, sentence-level crossover, word- and sentence-level
mutation, and elite selection with early termination.
Algorithm Steps
- Initialization. Build a population of
population_sizecandidates by cycling through a seed pool (DEFAULT_SEED_TEMPLATESby default) and inlining the harmfultargetinto each template. - Fitness scoring. For each candidate, send to
target_provider, askjudge_providerto rate compliance 1-10, subtract a fluency penalty approximated by character-level Shannon entropy. - Selection. Sort by fitness; keep the top
elite_ksurvivors. - Crossover. Split parents on sentence boundaries (
[.!?]\s+). Swap a random contiguous span of sentences between two parents to produce a child. - Mutation. With probability
mutation_rate, apply either word-level synonym substitution (offline static dictionary) or sentence-level paraphrase viaattacker_provider. - Termination. Stop when best raw judge score reaches
success_thresholdormax_generationsis exhausted.
Parameters
| Parameter | Default | Description |
|---|---|---|
target | required | Harmful behaviour to elicit. |
target_provider | required | Provider used for the model under test. |
attacker_provider | target_provider | Provider used for paraphrase mutations. |
judge_provider | attacker_provider | Provider used for fitness scoring. |
population_size | 50 | Candidates per generation. |
max_generations | 100 | Maximum loop iterations before stopping. |
success_threshold | 8 | Raw judge score that triggers early exit. |
elite_k | 5 | Top-K survivors carried into the next generation. |
mutation_rate | 0.1 | Per-child probability of mutation. |
seed_templates | DEFAULT_SEED_TEMPLATES | Optional override for the initial seed pool. |
on_iteration | None | Callback (generation, best_fitness, best_prompt). |
Example Usage
best_prompt, best_score, generations,
and history (one record per generation).
Backward Compatibility
The legacy attack id"autodan-genetic" still resolves to
ai_blackteam.attacks.autodan_genetic.AutoDANGenetic, but the class now
delegates to AutoDANGenerator when a target_provider is supplied. Static
batch runs without providers fall back to the seed templates so existing
pipelines keep working.
Reference Paper
- Liu, X., Xu, N., Lu, K., Xiao, C. AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models. arXiv 2310.15140 (2024).