Overview
PAP (Persuasive Adversarial Prompts) treats jailbreaking as a social-engineering problem. Rather than searching for an adversarial token suffix, it asks an attacker LLM to rewrite a harmful request using a named persuasion technique (Authority, Storytelling, Logical Appeal, etc.), then sends each rewrite to the target model and scores the response with a judge LLM. The generator ships with the top 10 high-impact techniques from the paper. The technique whose rewrite scores highest wins. Reference: Zeng et al. 2024, “How Johnny Can Persuade LLMs to Jailbreak Them” (arXiv 2401.06373).Algorithm
For each technique in the list:- Build a
REWRITE_PROMPTinstructing the attacker to rephrase the request using the technique. - Send the rewritten prompt to the target.
- Ask the judge for a 1-10 compliance score.
- Track the highest-scoring (technique, prompt, response) triple.
- Short-circuit if a score crosses
success_threshold.
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
target | str | required | Harmful behavior to elicit. |
target_provider | BaseProvider | required | Provider for the model under test. |
attacker_provider | BaseProvider | None | target_provider | LLM that rewrites the request. |
judge_provider | BaseProvider | None | attacker_provider | LLM that scores target responses. |
techniques | list[tuple[str, str]] | None | DEFAULT_TECHNIQUES | (name, definition) pairs to apply. |
success_threshold | int | 8 | Judge score >= this counts as success. |
on_iteration | callable | None | None | fn(i, technique, prompt, response, score). |
Return value
| Key | Type | Description |
|---|---|---|
success | bool | True if any technique scored >= success_threshold. |
iterations | int | Number of techniques tried. |
best_score | int | Highest judge score observed. |
best_prompt | str | Rewritten prompt with the highest score. |
best_response | str | Target response to that prompt. |
best_technique | str | Name of the winning technique. |
history | list[dict] | Per-technique record with prompt, response, score. |
Example
Registration
PAPGenerator is registered as "pap" in generator_registry:
Source
src/ai_blackteam/generators/pap.py- Reference paper: arXiv 2401.06373