Week 17 / April 2026

Aggregate Benchmarks Collapse Under Individual Variation

LLM rankings, medical evaluations, and civic AI all fail when averaged metrics meet real users

Synthesized using AI

Analyzed 115 papers. AI models can occasionally hallucinate, please verify critical details.

The evaluation methods we use to select and validate AI systems systematically obscure the failures that matter most in deployment. Individual LLM preferences diverge so dramatically from aggregate benchmarks that public leaderboards are nearly useless for predicting user satisfaction—Bradley-Terry coefficients show ρ=0.04 average correlation, with 57% of users exhibiting near-zero or negative correlation with Chatbot Arena rankings. Medical QA systems demonstrate the same pattern from a different angle: semantic similarity metrics that measure text coherence miss entity recognition failures that cause 13.8% performance drops on chronic conditions affecting vulnerable populations. Public comment summarization reveals occupation-based bias that persists across all models and prompts while race and gender effects remain absent. These aren't isolated measurement problems—they're three instances of the same failure mode: aggregate metrics optimize for average-case performance while deployment requires understanding variance.

The practical implication cuts across domains. If you're selecting LLMs based on aggregate benchmarks, you're making deployment decisions on metrics that don't predict individual user outcomes. If you're evaluating medical AI with semantic similarity alone, you're certifying systems that will systematically underperform for specific patient populations. If you're procuring civic tech based on standard fairness audits, you're missing socioeconomic bias that demographic testing doesn't capture. The solution isn't better averaging—it's abandoning the assumption that averaged performance predicts deployment outcomes. Personalized benchmarking, component-wise medical evaluation, and occupation-specific fairness testing all point toward the same shift: measure what varies, instrument for heterogeneity, and build evaluation frameworks that surface the gaps between lab performance and field failure before systems ship.

Featured(1/6)

2604.17247

All Public Voices Are Equal, But Are Some More Equal Than Others to LLMs?

Sola Kim, Marco A. Janssen, Jieshu Wang, Ame Min-Venditti, Neha Karanjia, John M. Anderies

Preprint·2026-04-19

If you're procuring LLMs for government or civic engagement, add fairness benchmarks to your evaluation criteria—FedRAMP doesn't cover this. Model selection implicitly selects a level of socioeconomic bias.

Federal agencies are deploying LLMs to summarize public comments during rulemaking. If these systems treat identical comments differently based on demographic signals, they could systematically distort democratic input.

Method: Researchers held comment content constant and varied only demographic attribution across 182 public comments, generating over 106,000 summaries from eight LLMs. Occupation produced consistent differential treatment: the same comment attributed to a street vendor received summaries that preserved less original meaning, used simpler language, and shifted emotional tone compared to attribution to a financial analyst. This pattern held across all names, prompts, models, and regulatory contexts. Race and gender effects were inconsistent or absent.

Caveats: Tested on public comment summarization only. Other government text processing tasks may show different bias patterns.

Reflections: Do occupation-based biases persist when LLMs are fine-tuned on government-specific corpora? · Can prompt engineering reduce socioeconomic bias without degrading summary quality? · How do these biases compound when LLMs are used for downstream decision-making, not just summarization?

ai-interactionethicstrust-safetyprivacy-security

2604.19281

Beyond Semantic Similarity: A Component-Wise Evaluation Framework for Medical Question Answering Systems with Health Equity Implications

Abu Noman Md Sakib, Md. Main Oddin Chisty, Zijie Zhang

Accepted·2026-04-21

Don't rely on semantic evaluation alone for medical AI. Demand component-wise accuracy metrics, especially entity recognition. Expect systematic underperformance on conditions affecting vulnerable populations.

LLMs are answering patient medical questions, but semantic similarity metrics don't measure medical accuracy or reveal health equity risks embedded in their responses.

Method: The VB-Score framework evaluates medical QA across four components: entity recognition, semantic similarity, factual consistency, and structured information completeness. Testing three widely-used LLMs on 48 public health topics revealed severe performance failures and a major discrepancy between semantic and entity accuracy. Models exhibited 13.8% lower performance on chronic conditions affecting older and minority populations compared to overall average—evidence of condition-based algorithmic discrimination.

Caveats: Framework tested on public health topics from authoritative sources. Clinical diagnostic accuracy may require additional evaluation dimensions.

Reflections: Can targeted fine-tuning on underrepresented health conditions close the performance gap? · Do these disparities persist when LLMs are used as retrieval systems rather than generative answerers? · How do entity recognition failures translate into patient harm in real-world deployment?

healthcareai-interactionethicsevaluation-methodstrust-safety

2604.19114

OOPrompt: Reifying Intents into Structured Artifacts for Modular and Iterative Prompting

Tengyou Xu, Detao Ma, Xiang 'Anthony' Chen

Preprint·2026-04-21

If you're building LLM interfaces for complex tasks, move beyond text boxes. Structure prompts as composable objects that users can manipulate, version, and reuse.

Composing prompts as linear text strings becomes unwieldy when users need to express multifaceted intents, iterate on components, or reuse prompt structures across contexts.

Method: Object-Oriented Prompting (OOPrompt) treats prompts as structured, manipulable artifacts rather than text strings. A formative study with 20 participants informed a design space, followed by a validation study of the full prototype. Users could create, edit, iterate, and reuse prompt components as discrete objects, unifying several existing point systems into a coherent interaction paradigm.

Caveats: Validation study focused on understanding added values and trade-offs. Long-term adoption patterns and learning curves remain uncharacterized.

Reflections: What's the cognitive overhead of learning object-oriented prompting versus traditional text-based prompting? · How do teams collaborate when prompts are structured artifacts rather than shared text? · Can OOPrompt patterns be automatically extracted from existing text-based prompts?

ai-interactiondesign-toolsprogramming-tools

1 / 6

Featured

Findings(1/5)

Evaluation shifts from semantic similarity to component-wise medical accuracy·Prompting moves from linear text to structured, reusable artifacts·Benchmarking fragments from aggregate ratings to personalized rankings·Data collection interfaces embed sensemaking into input, not just review·Self-related cues interrupt mindless consumption through de-immersion

Medical question-answering systems are moving beyond semantic similarity metrics that measure how closely LLM outputs match reference answers. Component-wise evaluation frameworks now assess medical accuracy directly, exposing health equity risks that semantic measures miss. This matters because semantic similarity can mask clinically dangerous errors—a response might match the reference text closely while recommending contraindicated treatments. The shift forces practitioners to instrument for harm, not just coherence.

2604.19281

Beyond Semantic Similarity: A Component-Wise Evaluation Framework for Medical Question Answering Systems with Health Equity Implications

Surprises(1/3)

Demographic attribution doesn't bias LLM processing of public comments·Mixed reality sharing fails at protocol legibility, not technical capability·Fairness audits require institutional training data, not just model access

Holding comment content constant while varying only the commenter's race, gender, and socioeconomic status across eight LLMs revealed no systematic bias in how federal agencies' deployed systems process public input during rulemaking. This contradicts the assumption that LLMs inherit demographic biases in high-stakes civic applications. The null result suggests bias may concentrate in other pipeline stages—problem framing, output interpretation, or downstream decision-making—rather than in the language model itself.

2604.17247

All Public Voices Are Equal, But Are Some More Equal Than Others to LLMs?

TOOLBOX(8)

VB-Score

Framework

Verification-Based Score evaluation framework for medical question answering systems. Provides separate evaluation across four components: entity recognition, semantic similarity, factual consistency, and structured information completeness. Used to assess LLM performance on 48 public health topics, revealing 13.8% lower performance on chronic conditions affecting older and minority populations.

2604.19281

OOPrompt

Tool

Object-Oriented Prompting prototype that enables users to create, edit, iterate, and reuse prompts as structured, manipulable artifacts rather than linear text strings. Unifies and generalizes existing point systems for LLM-based interactive systems. Validated through formative study with 20 participants and subsequent validation study.

2604.19114

KUBO

Tool

Kumunidad at Balitang Opisyal - hyperlocal communication platform prototype for the Philippines. Integrates home module for verified local government advisories and curated headlines, plus community module for resident-powered neighborhood reports. Within-subjects evaluation showed significantly reduced task completion times (p<0.001) and improved information recall (p=0.010) versus Facebook.

2604.20973

Kenya-polarization classifier

Model

Fine-tuned BERT-based text classifier for monitoring online polarization in Kenya. Developed through participatory annotation process with peacebuilders and domain experts. Trained on collaboratively annotated datasets with enhanced contextual alignment and reduced cultural nuance misclassification. Open-source and accessible via HuggingFace.

2604.21034

Sudan-hate speech classifier

Model

Fine-tuned BERT-based text classifier for monitoring hate speech in Sudan. Developed through participatory annotation with peacebuilders and data scientists. Evaluated against held-out test sets with improved contextual validity and practitioner ownership. Open-source and accessible via HuggingFace.

2604.21034

Remindful

Tool

Caregiver-informed reminder platform for dementia care that extends task prompting with caregiver-facing alerts, summaries, and review features. Supports awareness in home-based dementia care through reminder-based caregiver awareness. Deployed in-home with two caregiver-PLwD dyads, enabling household coordination and routine awareness over time.

2604.19574

Autark

Framework

Serverless toolkit for rapid prototyping of urban visual analytics systems. Provides domain-aware abstractions through self-contained architecture for spatial data management, analytical processing, and visualization. Enables transition from design intention to deployed, shareable systems within hours. Well-suited for AI-assisted coding workflows with structured, tightly scoped interfaces.

2604.20759

TouchPort

Tool

Embodied sharing protocol for mixed reality that collapses multi-stage encounter sequence (Discover, Consent, Confirm, Allow, Spatial Colocation, Sync Objects, Permission Management) into single handshake-and-pull gesture. Enables spontaneous sharing of mixed realities in public settings by simultaneously signaling intent, negotiating consent, and initiating temporary shared encounter layer.

2604.19423

Auditing and Controlling AI Agent Actions in Spreadsheets

Exposes AI agent reasoning by making spreadsheet cells the audit trail—each formula becomes a checkpoint where users can inspect, override, or redirect autonomous work before it compounds.

2604.21152

Dialect vs Demographics: Quantifying LLM Bias from Implicit Linguistic Signals vs. Explicit User Profiles

Separates stated identity from linguistic style to reveal LLMs penalize how people write, not who they are—dialect carries bias even when demographics are hidden.

2604.20468

MOMO: A framework for seamless physical, verbal, and graphical robot skill learning and adaptation

Lets non-experts teach industrial robots by physically guiding them, speaking commands, or sketching trajectories—switching modalities mid-task as the situation demands.

2604.20711

Participatory provenance as representational auditing for AI-mediated public consultation

Builds a formal framework to audit whether AI-generated policy summaries actually represent public input—tracking which voices got amplified, merged, or erased in synthesis.

2604.19309

Co-Refine: AI-Powered Tool Supporting Qualitative Analysis

Detects when your interpretation of a qualitative code drifts across hundreds of interviews—then suggests which past excerpts to recode for consistency.

2604.20279

AgentLens: Adaptive Visual Modalities for Human-Agent Interaction in Mobile GUI Agents

Replaces the foreground-or-nothing paradigm with adaptive overlays—mobile agents show just enough UI to keep users informed without blocking multitasking.

2604.19276

Designing Transparent AI-Mediated Language Support for Intergenerational Family Communication

Translates Gen Z slang for parents and formal language for teens—but tests whether showing the translation process builds trust or just makes conversations awkward.

2604.21461

Do MLLMs Understand Pointing? Benchmarking and Enhancing Referential Reasoning in Egocentric Vision

Reveals that smart glasses fail at basic pointing gestures—models confuse spatial semantics with object recognition, missing what your finger actually indicates.

REFLECTION(4)

Explainability scales, but understanding doesn't

Across 26 papers this week, researchers built systems that expose algorithmic reasoning—dashboards, captions, decision trees, uncertainty estimates. Yet practitioners still can't act on what they see. The interpretability tax isn't computational; it's the design work required to translate what systems measure into what humans actually need to decide.

We've solved the technical problem of making models transparent, but users still distrust or ignore the explanations we give them. If explainability doesn't change behavior, are we optimizing for the wrong audience—building explanations for auditors instead of practitioners?

1 / 4

Week 16April 2026

ABOUT THIS ISSUE

How was this newsletter synthesized?

Methodology

This newsletter is generated by an AI pipeline (leveraging Anthropic Sonnet 4.5 & Haiku 4.5) that processes the metadata and abstracts of every new arXiv HCI paper from the past week—115 this issue. Each paper is scored on three dimensions: Practice (applicability for practitioners), Research (scientific contribution), and Strategy (industry implications), with scores from 1-5. Papers passing threshold are grouped into topic clusters, and each cluster is summarized to capture what that body of research is exploring.

Selection Criteria

The pipeline builds a curated selection that balances high scores with topic diversity—and deliberately includes at least one 'contrarian' paper that challenges prevailing assumptions. This selection is then analyzed to identify key findings (patterns across multiple papers) and surprises (results that contradict conventional wisdom). A narrative synthesis ties the week's research together under a unifying frame.

Key Themes Discovered

Field Report: ai-interaction

Trust, Transparency, and Grounding

This cluster examines how users develop trust and maintain control when collaborating with AI systems across knowledge work, policy, healthcare, and social contexts. Core tensions emerge: systems must balance transparency (exposing reasoning) with usability; support active participation in decisions rather than post-hoc review; and ground AI outputs in domain expertise and user intent. Work spans fairness audits, participatory design, and interaction design—asking not whether AI works, but whether users can meaningfully oversee, redirect, and take ownership of AI-mediated outcomes.

1/10

Top Papers in this Theme

2604.21152

Dialect vs Demographics: Quantifying LLM Bias from Implicit Linguistic Signals vs. Explicit User Profiles

2604.21461

Do MLLMs Understand Pointing? Benchmarking and Enhancing Referential Reasoning in Egocentric Vision

2604.17730

MHSafeEval: Role-Aware Interaction-Level Evaluation of Mental Health Safety in Large Language Models

2604.17843

Learning from AVA: Early Lessons from a Curated and Trustworthy Generative AI for Policy and Development Research

2604.20055

Synthesized using AI

All Public Voices Are Equal, But Are Some More Equal Than Others to LLMs?

Beyond Semantic Similarity: A Component-Wise Evaluation Framework for Medical Question Answering Systems with Health Equity Implications

OOPrompt: Reifying Intents into Structured Artifacts for Modular and Iterative Prompting

Beyond Semantic Similarity: A Component-Wise Evaluation Framework for Medical Question Answering Systems with Health Equity Implications

All Public Voices Are Equal, But Are Some More Equal Than Others to LLMs?

VB-Score

OOPrompt

KUBO

Kenya-polarization classifier

Sudan-hate speech classifier

Remindful

Autark

TouchPort

Auditing and Controlling AI Agent Actions in Spreadsheets

Dialect vs Demographics: Quantifying LLM Bias from Implicit Linguistic Signals vs. Explicit User Profiles

MOMO: A framework for seamless physical, verbal, and graphical robot skill learning and adaptation

Participatory provenance as representational auditing for AI-mediated public consultation

Co-Refine: AI-Powered Tool Supporting Qualitative Analysis

AgentLens: Adaptive Visual Modalities for Human-Agent Interaction in Mobile GUI Agents

Designing Transparent AI-Mediated Language Support for Intergenerational Family Communication

Do MLLMs Understand Pointing? Benchmarking and Enhancing Referential Reasoning in Egocentric Vision

Explainability scales, but understanding doesn't

How was this newsletter synthesized?

Methodology

Selection Criteria

Key Themes Discovered

Field Report: ai-interaction

Trust, Transparency, and Grounding

Top Papers in this Theme

Dialect vs Demographics: Quantifying LLM Bias from Implicit Linguistic Signals vs. Explicit User Profiles

Do MLLMs Understand Pointing? Benchmarking and Enhancing Referential Reasoning in Egocentric Vision

MHSafeEval: Role-Aware Interaction-Level Evaluation of Mental Health Safety in Large Language Models

Learning from AVA: Early Lessons from a Curated and Trustworthy Generative AI for Policy and Development Research

From Fuzzy to Formal: Scaling Hospital Quality Improvement with AI