Lost in Moderation: How Commercial Content Moderation APIs Over- and Under-Moderate Group-Targeted Hate Speech and Linguistic Variations
David Hartmann, Amin Oueslati, Dimitri Staufer, Lena Pohlmann, Simon Munzert, Hendrik Heuer
Audit your moderation stack before deployment. Test against identity-swapped content and dialect variations. Don't rely on a single API—ensemble methods reduce both over- and under-moderation. Budget for human review of edge cases involving reclaimed language.
Commercial moderation APIs flag legitimate speech while missing actual hate. Five million test cases reveal systematic failures across identity groups and dialects.
Method: The audit framework tests five APIs (Google Perspective, Azure Content Safety, OpenAI, AWS Comprehend, Hive) against controlled variations of hate speech targeting different groups. The system manipulates identity markers (e.g., swapping 'gay' for 'straight') and linguistic features (AAVE vs. Standard English) to expose bias patterns. APIs over-moderate reclaimed slurs and AAVE by 23-47% while under-detecting coded hate speech that substitutes euphemisms.
Caveats: Framework requires labeled hate speech datasets and doesn't address multimodal content (images, video) where most evasion happens.
Reflections: Can ensemble methods be optimized to minimize both over- and under-moderation simultaneously, or is there an irreducible tradeoff? · How do moderation APIs perform on emerging coded hate speech that evolves faster than training data? · What's the optimal human-in-the-loop intervention rate for different community contexts?