What the Eyes See, the LLMs Miss: Exploiting Human Perception for Adversarial Text Attacks
Qin Yang, Lu Malloy, Joshua Lee, Xiaohan Chang, Meisam Mohammady, Doowon Kim, Yuan Hong
Your LLM moderation stack has a blind spot. Typographic attacks bypass tokenization-based filters while remaining obvious to users. Audit your guardrails against visual manipulation, not just semantic evasion.
LLM content moderation systems tokenize text but ignore visual cues humans use to interpret harmful content. This creates a perceptual mismatch: what humans recognize as harmful becomes invisible to automated filters.
Method: Human-Perceptible Adversarial Attacks (HPAA) embed harmful expressions into benign text through typographic manipulations—spacing, visual emphasis, spatial arrangement. With only three detector queries in black-box settings, attacks achieved over 86% human recognition while maintaining detection rates below 1% across ten deployed moderation systems, including commercial APIs and state-of-the-art guardrails.
Caveats: Tested on current moderation architectures. Defense strategies discussed but not validated in deployment.
Reflections: Can vision-language models close the perceptual gap, or do they inherit similar tokenization blind spots? · What's the minimum typographic complexity needed to evade detection while preserving human readability? · How do moderation systems perform when adversaries combine typographic and semantic evasion techniques?