| I've been researching what happens when you split a prompt injection across multiple input modalities instead of putting it all in one text field. The short answer: per-channel detection breaks completely. The idea is simple. Instead of sending
Each fragment scores well below detection thresholds individually. A DistilBERT classifier sees each piece at 0.43-0.53 confidence. No single channel triggers anything. But the LLM processes all channels as one token stream and reconstructs the full attack. I ran these against a three-stage detection pipeline (regex fast-reject, fine-tuned DistilBERT ONNX INT8, modality-specific preprocessing) and documented everything that got through. Modality combinations covered
Attack categoriesExfiltration, compliance forcing, context switching, template injection, encoding obfuscation (base64, hex, ROT13, reversed text, unicode homoglyphs), multilingual injection, DAN/jailbreak, roleplay manipulation, authority impersonation, and delimiter injection. Sources and references
Repogithub.com/Josh-blythe/bordair-multimodal-v1 All JSON payloads, no executable code required. Intended for red teams and anyone building or evaluating multimodal LLM detection systems. Interested in hearing from anyone who's working on cross-modal defence. The fundamental question seems to be: do you reassemble extracted text across channels before classification, or do you need a different architectural approach entirely? [link] [comments] |