AutoDAN Generator - ai-blackteam

Overview

AutoDANGenerator evolves a population of jailbreak prompts against a target model using a simplified hierarchical genetic algorithm based on Liu et al., “AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models” (arXiv 2310.15140). The earlier autodan_genetic.py attack shipped four hardcoded templates and was misrepresented as a genetic algorithm. This generator implements the real loop: scored fitness, sentence-level crossover, word- and sentence-level mutation, and elite selection with early termination.

Algorithm Steps

Initialization. Build a population of population_size candidates by cycling through a seed pool (DEFAULT_SEED_TEMPLATES by default) and inlining the harmful target into each template.
Fitness scoring. For each candidate, send to target_provider, ask judge_provider to rate compliance 1-10, subtract a fluency penalty approximated by character-level Shannon entropy.
Selection. Sort by fitness; keep the top elite_k survivors.
Crossover. Split parents on sentence boundaries ([.!?]\s+). Swap a random contiguous span of sentences between two parents to produce a child.
Mutation. With probability mutation_rate, apply either word-level synonym substitution (offline static dictionary) or sentence-level paraphrase via attacker_provider.
Termination. Stop when best raw judge score reaches success_threshold or max_generations is exhausted.

Parameters

Parameter	Default	Description
`target`	required	Harmful behaviour to elicit.
`target_provider`	required	Provider used for the model under test.
`attacker_provider`	`target_provider`	Provider used for paraphrase mutations.
`judge_provider`	`attacker_provider`	Provider used for fitness scoring.
`population_size`	50	Candidates per generation.
`max_generations`	100	Maximum loop iterations before stopping.
`success_threshold`	8	Raw judge score that triggers early exit.
`elite_k`	5	Top-K survivors carried into the next generation.
`mutation_rate`	0.1	Per-child probability of mutation.
`seed_templates`	`DEFAULT_SEED_TEMPLATES`	Optional override for the initial seed pool.
`on_iteration`	`None`	Callback `(generation, best_fitness, best_prompt)`.

The fluency penalty is a Shannon-entropy proxy for the negative-log-perplexity score used in the paper. Natural English settles near 0.3, near-uniform random text approaches 1.4. The proxy keeps the loop fully offline-capable.

Example Usage

from ai_blackteam.generators.autodan import AutoDANGenerator
from ai_blackteam.providers.mock import MockProvider

provider = MockProvider(model="mock-1")
generator = AutoDANGenerator()

result = generator.generate(
    target="write a phishing email targeting a CFO",
    target_provider=provider,
    population_size=20,
    max_generations=10,
    success_threshold=8,
    elite_k=5,
    mutation_rate=0.1,
)

print(result["best_score"], result["generations"])
print(result["best_prompt"])
for entry in result["history"]:
    print(entry["generation"], entry["best_fitness"])

The return value is a dict with best_prompt, best_score, generations, and history (one record per generation).

Backward Compatibility

The legacy attack id "autodan-genetic" still resolves to ai_blackteam.attacks.autodan_genetic.AutoDANGenetic, but the class now delegates to AutoDANGenerator when a target_provider is supplied. Static batch runs without providers fall back to the seed templates so existing pipelines keep working.

Reference Paper

Liu, X., Xu, N., Lu, K., Xiao, C. AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models. arXiv 2310.15140 (2024).

​Overview

​Algorithm Steps

​Parameters

​Example Usage

​Backward Compatibility