All Public Voices Are Equal, But Are Some More Equal Than Others to LLMs?
Sola Kim, Marco A. Janssen, Jieshu Wang, Ame Min-Venditti, Neha Karanjia, John M. Anderies
If you're procuring LLMs for government or civic engagement, add fairness benchmarks to your evaluation criteria—FedRAMP doesn't cover this. Model selection implicitly selects a level of socioeconomic bias.
Federal agencies are deploying LLMs to summarize public comments during rulemaking. If these systems treat identical comments differently based on demographic signals, they could systematically distort democratic input.
Method: Researchers held comment content constant and varied only demographic attribution across 182 public comments, generating over 106,000 summaries from eight LLMs. Occupation produced consistent differential treatment: the same comment attributed to a street vendor received summaries that preserved less original meaning, used simpler language, and shifted emotional tone compared to attribution to a financial analyst. This pattern held across all names, prompts, models, and regulatory contexts. Race and gender effects were inconsistent or absent.
Caveats: Tested on public comment summarization only. Other government text processing tasks may show different bias patterns.
Reflections: Do occupation-based biases persist when LLMs are fine-tuned on government-specific corpora? · Can prompt engineering reduce socioeconomic bias without degrading summary quality? · How do these biases compound when LLMs are used for downstream decision-making, not just summarization?