Multimodal attacks exploit the gap between different input modalities. Hidden instructions in images bypass text-based safety filters, audio injection embeds inaudible commands, and video frame injection hides payloads in temporal sequences. As models process more input types, each modality becomes a potential vector for smuggling harmful content past safety checks.

Summary

5 attacks total: 5 single-turn.

Attacks

AttackIDModeSeverityDescription
Audio Injectionaudio-injectionsingle-turnhighHidden voice commands or ultrasonic payloads embedded in audio that multimodal models process
Image-Text Mismatchimage-text-mismatchsingle-turnmediumConflicting information between image content and text prompt to confuse multimodal safety filters
Multimodal Context Confusionmultimodal-context-confusionsingle-turnmediumExploiting context switching between modalities to smuggle harmful content past safety checks
OCR Bypassocr-bypasssingle-turnhighHarmful instructions rendered as text in images to bypass text-based safety filters via OCR processing
Video Frame Injectionvideo-frame-injectionsingle-turnhighMalicious instructions hidden in individual video frames that multimodal models extract during processing

Example

ai-blackteam run -p anthropic -a audio-injection -t "Process this image containing hidden instructions"