| Audio Injection | audio-injection | single-turn | high | Hidden voice commands or ultrasonic payloads embedded in audio that multimodal models process |
| Image-Text Mismatch | image-text-mismatch | single-turn | medium | Conflicting information between image content and text prompt to confuse multimodal safety filters |
| Multimodal Context Confusion | multimodal-context-confusion | single-turn | medium | Exploiting context switching between modalities to smuggle harmful content past safety checks |
| OCR Bypass | ocr-bypass | single-turn | high | Harmful instructions rendered as text in images to bypass text-based safety filters via OCR processing |
| Video Frame Injection | video-frame-injection | single-turn | high | Malicious instructions hidden in individual video frames that multimodal models extract during processing |