HF_TOKEN set, the realistic ceiling is ~100K.
The 19 loaders
| Loader | Records | License | Paper / Source |
|---|---|---|---|
| harmbench | 400 | MIT | HarmBench (CAIS, NeurIPS 2024) |
| advbench | 520 | MIT | AdvBench (Zou et al., 2023) |
| jailbreakbench | 100 | MIT | JailbreakBench (NeurIPS 2024) |
| wmdp-bio | 1,273 | MIT | WMDP biosecurity (CAIS) |
| wmdp-cyber | 1,987 | MIT | WMDP cybersecurity (CAIS) |
| wmdp-chem | 408 | MIT | WMDP chemistry (CAIS) |
| do_not_answer | 939 | Apache-2.0 | DoNotAnswer (Wang et al., 2023) |
| wildguard | varies | Apache-2.0 | WildGuard (AI2). Gated; needs HF_TOKEN + terms |
| redbench | ~29K | MIT | RedBench (knoveleng, ICLR 2026) |
| salad_bench | ~30K | MIT | SALAD-Bench (Li et al., ACL 2024) |
| sorry_bench | varies | Apache-2.0 | SORRY-Bench (NeurIPS 2024). Gated |
| strongreject | 313 | MIT | StrongREJECT (Souly et al., arXiv 2402.10260) |
| aart | 3,269 | CC-BY-4.0 | AART (Google, Radharapu et al., 2023) |
| forbidden_questions | 390 | research-only | Do Anything Now (Shen et al., 2024) |
| beavertails | ~30K | CC-BY-NC-4.0 | BeaverTails (Ji et al., 2023). Non-commercial |
| realtoxicityprompts | ~10K | Apache-2.0 | RealToxicityPrompts (Gehman et al., 2020). Filtered to toxicity > 0.5 |
| jailbreakv_28k | 28,000 | research-only | JailBreakV-28K (Luo et al., arXiv 2404.03027) |
| redteam2k | 2,000 | research-only | RedTeam-2K subset of JailBreakV |
| agentharm | 440 | MIT | AgentHarm (Andriushchenko et al., UK AISI, ICLR 2025) |
Listing datasets
Listing datasets
Downloading datasets
~/.ai-blackteam/datasets/ as JSONL files. You only need to download once.
Dataset statistics
The 14 datasets
HarmBench
- Prompts: 510
- Focus: Harmful behaviors across multiple categories
- Source: HarmBench paper
- License: MIT
AdvBench
- Prompts: ~520
- Focus: Adversarial attack prompts
- Source: Universal and Transferable Adversarial Attacks
- License: MIT
JailbreakBench
- Prompts: ~100
- Focus: Standardized jailbreak evaluation
- Source: JailbreakBench
- License: MIT
SorryBench
- Prompts: ~450
- Focus: Fine-grained safety evaluation across 45 categories
- Source: SORRY-Bench
- License: Apache-2.0
WMDP (Bio, Cyber, Chem)
- Prompts: ~3,668 total (1,273 bio + 1,987 cyber + 408 chem)
- Focus: Weapons of mass destruction proliferation knowledge
- Source: WMDP Benchmark from CAIS
- License: CC BY-NC
DoNotAnswer
- Prompts: ~900
- Focus: Questions models should refuse
- Source: Do Not Answer
- License: CC BY-NC-SA
WildGuard
- Prompts: ~1,700
- Focus: Real-world adversarial prompts
- Source: WildGuard
- License: Apache-2.0
StrongREJECT
- Prompts: 313
- Focus: Multi-dimensional jailbreak quality evaluation
- Source: StrongREJECT
- License: MIT
XSTest
- Prompts: 450
- Focus: Over-refusal testing with safe/unsafe prompt pairs
- Source: XSTest
- License: CC-BY-4.0
SimpleSafetyTests
- Prompts: 100
- Focus: Baseline safety testing with simple harmful prompts
- Source: SimpleSafetyTests
- License: CC-BY-2.0
BeaverTails
- Prompts: 700
- Focus: Fine-grained harm categorization
- Source: BeaverTails
- License: CC-BY-NC-4.0
Multi-Turn Human Jailbreaks (MHJ)
- Prompts: 2,912
- Focus: Real human-written multi-turn jailbreak conversations
- Source: MHJ
- License: CC-BY-NC-4.0
Using datasets with attacks
Datasets feed into themega-sweep command:
Each prompt’s structure
Every dataset prompt includes:category field maps to ai-blackteam’s harm taxonomy, so you can filter by category across datasets.