Week 12 / March 2026

Evaluation Metrics Miss the Failures That Matter

Identical outcomes hide corrupted reasoning, biased recommendations, and exclusionary training data

Synthesized using AI

Analyzed 97 papers. AI models can occasionally hallucinate, please verify critical details.

We're optimizing systems for metrics that don't capture failure. A controlled experiment with 290 participants found that recommendation-driven and hypothesis-driven AI interfaces produced identical task accuracy, but Signal Detection Theory exposed a hidden divergence: the recommendation design lowered users' thresholds for sufficient evidence, shifting the distribution of errors even as overall performance stayed flat. Standard evaluation would declare these systems equivalent. Process-level analysis reveals one systematically corrupts judgment. This pattern repeats across domains. Testing LLMs with persona-diverse prompts across shopping, travel, and information tasks, U.S.-developed models (Gemini, GPT) showed marked favoritism toward American brands and destinations, while China-developed DeepSeek exhibited more balanced but still detectable geographic preferences. Gaze detection classifiers for group interaction showed acceptable aggregate performance but misread engagement in people with intellectual and developmental disabilities—fine-tuning on atypical data improved results, but only after recognizing that population-averaged metrics hid systematic exclusion.

The productivity narrative faces similar scrutiny. Surveying 65 developers, over 70% report halving time on boilerplate and documentation using GenAI, but the research notes gains shrink dramatically for complex, ambiguous tasks. The acceleration happens where expertise matters least—compressing low-skill work while leaving high-complexity problems largely untouched. Even in demanding VR wayfinding with 1008 trials under time pressure and forced replanning, directional arrows outperformed information-rich minimaps and compasses. Users didn't need more spatial data; they needed faster translation to action.

The implication cuts across application domains: outcome metrics reward systems that optimize for measurable performance while degrading the processes we actually care about—stable reasoning thresholds, neutral recommendations, inclusive detection, expert judgment preservation. If evaluation frameworks can't distinguish between systems that preserve cognitive integrity and those that hollow it out, we're shipping the wrong solutions at scale.

Featured(1/6)

2603.15824

Same Performance, Hidden Bias: Evaluating Hypothesis- and Recommendation-Driven AI

Michaela Benk, Tim Miller

CHI 2026·2026-03-16

Stop evaluating AI assistants on accuracy alone. Recommendation-driven interfaces corrupt decision processes by lowering evidence standards. If judgment quality matters more than speed, use hypothesis-driven designs that preserve stable thresholds.

AI decision support systems are evaluated on accuracy and reliance, but identical performance can mask how users actually form judgments. Do recommendation-driven interfaces change evidence standards even when outcomes stay the same?

Method: A 290-person experiment using Signal Detection Theory found that recommendation-driven designs lowered users' thresholds for sufficient evidence compared to hypothesis-driven designs, even when task performance remained identical. This created a "hidden bias"—a shifted distribution of errors. Experts showed the same threshold shifts as novices, meaning experience didn't protect against the systemic effect.

Caveats: Web-based experiment. Real-world stakes and repeated use might amplify or attenuate threshold shifts.

Reflections: Do threshold shifts persist or normalize with extended exposure to recommendation-driven systems? · Can interface design restore stable evidence standards while keeping recommendation convenience? · Which professional domains are most vulnerable to hidden bias from lowered thresholds?

ai-interactionevaluation-methodsrecommendation-systemstrust-safety

2603.18300

Auditing Preferences for Brands and Cultures in LLMs

Jasmine Rienecker, Katarina Mpofu, Naman Goel, Siddhartha Datta, Jun Zhao, Oscar Danielsson, Fredrik Thorsen

Preprint·2026-03-18

Don't assume LLM recommendations are neutral. U.S. models systematically favor American brands and destinations. If you're building choice interfaces, audit for geographic bias—it's structural, not incidental. Non-U.S. markets need localized models or explicit debiasing.

LLMs increasingly mediate purchasing and information access for billions, but their systematic preferences for brands and cultures remain unquantified under realistic usage conditions.

Method: ChoiceEval framework generates persona-diverse prompts across psychographic profiles, converts free-form LLM outputs into normalized top-k choice sets, then quantifies preference and geographic bias. Applied to Gemini, GPT, and DeepSeek across 10 topics and over 2,000 questions, U.S.-developed models showed marked favoritism toward American entities, while China-developed DeepSeek exhibited more balanced but still detectable geographic preferences. Patterns persisted across user personas.

Caveats: Tested on three model families. Smaller or open-source models may show different bias patterns.

Reflections: Do users detect and discount geographic bias when it's made explicit? · Can fine-tuning or prompt engineering eliminate structural preferences without degrading output quality? · How do preference patterns shift when models are deployed in non-English languages?

ai-interactionethicstrust-safetyrecommendation-systems

2603.14460

Inclusive AI for Group Interactions: Predicting Gaze-Direction Behaviors in People with Intellectual and Developmental Disabilities

Giulia Huang, Maristella Matera, Micol Spitale

Accepted·2026-03-15

Don't deploy group interaction AI trained only on neurotypical data—it will misread engagement. Fine-tune on atypical gaze patterns for therapeutic and well-being contexts. Feature choice matters more than model architecture for inclusive detection.

AI systems for group interaction mediation fail with non-neurotypical users because detection models for turn-taking and eye contact are trained exclusively on neurotypical populations.

Method: Introduced MIDD dataset capturing atypical gaze and engagement patterns in people with Intellectual and Developmental Disabilities. Comparative analysis with neurotypical datasets revealed differences in class imbalance, speaking activity, gaze distribution, and interaction dynamics. Classifiers from SVMs to FSFNet showed improved performance when fine-tuned on MIDD, though notable limitations remained. Six therapists in a focus group interpreted findings, highlighting practical implications of atypical patterns.

Caveats: Dataset size and diversity of disabilities not specified. Generalization across IDD subtypes unclear.

Reflections: Which gaze features transfer across different IDD populations versus requiring condition-specific training? · Can hybrid models combine neurotypical and atypical data without degrading performance on either? · How do therapists want AI to surface atypical engagement patterns without pathologizing difference?

accessibilityai-interactionethicsevaluation-methods

1 / 6

Featured

Findings(1/5)

AI training data bias forces redesign from detection models to interaction scaffolds·Decision support evaluation moves from outcome accuracy to process transparency·Spatial interfaces default to topology over sequence in high-dimensional data·LLM governance shifts from model weights to memory lifecycle management·Actionable guidance outperforms information-rich cues under cognitive load

Detection models trained on neurotypical populations fail when applied to neurodivergent users—gaze tracking for people with intellectual disabilities and task management tools for adults with ADHD both break down. The response isn't better training data; it's architectural. Systems are shifting from trying to detect universal behavioral signals to providing configurable social scaffolds that adapt to cognitive differences. This reframes accessibility from edge-case accommodation to core design constraint.

2603.14460

Inclusive AI for Group Interactions: Predicting Gaze-Direction Behaviors in People with Intellectual and Developmental Disabilities

2603.17258

"Not Just Me and My To-Do List": Understanding Challenges of Task Management for Adults with ADHD and the Need for AI-Augmented Social Scaffolds

Surprises(1/3)

Identical task performance masks divergent reasoning strategies in AI-assisted decisions·GenAI's productivity gains concentrate where expertise matters least·Peripheral vision enables faster object selection than central display overlays

Recommendation-driven and hypothesis-driven AI systems produced identical performance outcomes but induced different decision-making strategies in users. Signal Detection Theory revealed the divergence, but standard accuracy metrics missed it entirely. The implication: systems optimized for correct answers may be training users into reasoning patterns that don't transfer or scale.

2603.15824

Same Performance, Hidden Bias: Evaluating Hypothesis- and Recommendation-Driven AI

TOOLBOX(8)

MIDD (Multi-party Interaction with Intellectual and Developmental Disabilities)

Dataset

Dataset capturing atypical gaze and engagement patterns in group interactions involving people with Intellectual and Developmental Disabilities. Enables training and evaluation of eye contact detection classifiers for inclusive AI systems. Comparative analysis reveals differences in class imbalance, speaking activity, gaze distribution, and interaction dynamics compared to neurotypical datasets.

2603.14460

Social Sycophancy Scale

Tool

Psychometrically validated measurement instrument for assessing LLM sycophancy based on behavior rather than ground truth comparisons. Features 3-factor structure (Uncritical Agreement, Obsequiousness, Excitement) validated across three samples (N=877). Applicable with both human and automated LLM raters for evaluating chatbot conversations in emotional support contexts.

2603.15490

iDaVIE v1.0 (immersive Data Visualisation Interactive Explorer)

Tool

Open-source VR software suite for interactive analysis of astronomical 3D data cubes. Built on Unity engine and SteamVR with custom plug-ins for efficient data parsing, downsampling, and statistical calculations. Enables real-time selection, cropping, catalogue overlays, and pipeline integration for HI data verification from MeerKAT, ASKAP, and APERTIF telescopes.

2603.15490

Galaxy Tracer

Tool

Browser-native packet capture exploration system with topology-first 3D interface. Parses PCAP and PCAPNG formats, dissects over 90 protocols, and renders interactive network topology through Three.js. Hosts appear as spatially positioned nodes, conversations as edges, with synchronized packet list view sharing filter state for continuous workflow between structural and tabular inspection.

2603.16018

ChoiceEval

Framework

Reproducible framework for auditing brand and cultural preferences in LLMs under realistic usage conditions. Generates persona-diverse evaluation queries across psychographic profiles, converts free-form outputs into normalized top-k choice sets, and quantifies preference and geographic bias using comparable metrics. Applied across Gemini, GPT, and DeepSeek with 2,000+ questions spanning 10 topics.

2603.18300

MemArchitect

Framework

Policy-driven memory governance layer for persistent LLM agents. Decouples memory lifecycle management from model weights, enforcing explicit rule-based policies including memory decay, conflict resolution, and privacy controls. Addresses governance gaps in standard RAG frameworks by preventing contradictions and outdated information from contaminating context windows in agentic settings.

2603.18330

HRI-SA

Dataset

Multimodal dataset from 30 participants in search-and-rescue human-robot teaming context for online situational awareness assessment. Includes eye movements, pupil diameter, biosignals, user interactions, and robot data. Features ground truth SA latency measurements (perceptual and comprehension types) with predefined events requiring timely operator assistance. Demonstrates 91.51% recall for perceptual SA latency detection.

2603.18344

PeriphAR

Tool

Visualization technique for gaze-based selection on monocular AR displays using peripheral vision feedback. Implements color enhancement strategy that maximizes contrast of target to neighboring object with most similar color. End-to-end system implementation supports real-world object detection with fast and accurate selection on always-on AR glasses without central overlay interference.

2603.18350

From Accuracy to Readiness: Metrics and Benchmarks for Human-AI Decision-Making

Argues that model accuracy is the wrong yardstick—what matters is whether human-AI teams are actually prepared to collaborate safely when deployed in the wild.

2603.18677

Cognitive Amplification vs Cognitive Delegation in Human-AI Systems: A Metric Framework

Introduces math to distinguish AI that sharpens human reasoning from AI that induces cognitive dependence. Finally, a framework for the amplification-versus-atrophy debate.

2603.15777

Lessons from Real-World Deployment of a Cognition-Preserving Writing Tool: Students Actively Engage with Critical Thinking and Planning Affordances

Deploys AI writing scaffolds in actual classrooms and finds students genuinely engage with critical thinking features—not just the autocomplete button. Rare field evidence.

2603.14225

"I'm Not Reading All of That": Understanding Software Engineers' Level of Cognitive Engagement with Agentic Coding Assistants

Examines how engineers interact with AI agents that write code autonomously. Spoiler: complacency is real, and the title quote says it all.

2603.19134

Introducing M: A Modular, Modifiable Social Robot

Presents an open-source social robot explicitly designed to be hacked, modified, and reproduced. Tackles platform friction that's been strangling social robotics research for years.

2603.16750

Thermopneumatic Pixels for Fast, Localized, Low-Voltage Touch Feedback

Converts low-voltage pulses into tactile feedback using sealed air chambers. Fast fabrication, compact form factor—haptics without the usual engineering overhead.

2603.19000

SVLAT: Scientific Visualization Literacy Assessment Test

Builds the first validated test for scientific visualization literacy. Turns out we've been making SciVis without knowing if anyone can actually read it.

2603.16537

Designing for Disagreement: Front-End Guardrails for Assistance Allocation in LLM-Enabled Robots

Tackles LLM robots deciding who gets help first in social settings. Proposes design guardrails for pluralistic values when reasonable people will inevitably disagree.

REFLECTION(4)

Assistance that erases the work

AI systems are designed to reduce cognitive load, yet research across education, healthcare, and collaborative work shows they're hollowing out expertise instead. The paradox: the more seamlessly AI handles the hard parts, the less practitioners understand what they're delegating.

We measure AI success by speed and accuracy, but practitioners report feeling deskilled after using these systems. If the metrics say it works but the humans say they're losing competence, which one is lying?

1 / 4

Week 11March 2026

Week 13March 2026

ABOUT THIS ISSUE

How was this newsletter synthesized?

Methodology

This newsletter is generated by an AI pipeline (leveraging Anthropic Sonnet 4.5 & Haiku 4.5) that processes the metadata and abstracts of every new arXiv HCI paper from the past week—97 this issue. Each paper is scored on three dimensions: Practice (applicability for practitioners), Research (scientific contribution), and Strategy (industry implications), with scores from 1-5. Papers passing threshold are grouped into topic clusters, and each cluster is summarized to capture what that body of research is exploring.

Selection Criteria

The pipeline builds a curated selection that balances high scores with topic diversity—and deliberately includes at least one 'contrarian' paper that challenges prevailing assumptions. This selection is then analyzed to identify key findings (patterns across multiple papers) and surprises (results that contradict conventional wisdom). A narrative synthesis ties the week's research together under a unifying frame.

Key Themes Discovered

Field Report: ai-interaction

Trust, Calibration, and Cognitive Drift

This cluster examines how AI systems reshape human judgment and reasoning in collaborative settings. Core tensions emerge: users develop miscalibrated reliance on AI recommendations, shifting decision thresholds without awareness; cognitive engagement declines as task complexity increases; and warmth in AI design correlates with sycophancy. Research spans governance frameworks for memory and assistance allocation, metrics for team readiness over accuracy, and interventions to sustain human sensemaking. Implications center on preserving human expertise and interpretative agency rather than maximizing short-term performance.

1/10

"I'm Not Reading All of That": Understanding Software Engineers' Level of Cognitive Engagement with Agentic Coding Assistants

2603.16260

Human/AI Collective Intelligence for Deliberative Democracy: A Human-Centred Design Approach

2603.16537

Synthesized using AI

Same Performance, Hidden Bias: Evaluating Hypothesis- and Recommendation-Driven AI

Auditing Preferences for Brands and Cultures in LLMs

Inclusive AI for Group Interactions: Predicting Gaze-Direction Behaviors in People with Intellectual and Developmental Disabilities

Inclusive AI for Group Interactions: Predicting Gaze-Direction Behaviors in People with Intellectual and Developmental Disabilities

"Not Just Me and My To-Do List": Understanding Challenges of Task Management for Adults with ADHD and the Need for AI-Augmented Social Scaffolds

Same Performance, Hidden Bias: Evaluating Hypothesis- and Recommendation-Driven AI

MIDD (Multi-party Interaction with Intellectual and Developmental Disabilities)

Social Sycophancy Scale

iDaVIE v1.0 (immersive Data Visualisation Interactive Explorer)

Galaxy Tracer

ChoiceEval

MemArchitect

HRI-SA

PeriphAR

From Accuracy to Readiness: Metrics and Benchmarks for Human-AI Decision-Making

Cognitive Amplification vs Cognitive Delegation in Human-AI Systems: A Metric Framework

Lessons from Real-World Deployment of a Cognition-Preserving Writing Tool: Students Actively Engage with Critical Thinking and Planning Affordances

"I'm Not Reading All of That": Understanding Software Engineers' Level of Cognitive Engagement with Agentic Coding Assistants

Introducing M: A Modular, Modifiable Social Robot

Thermopneumatic Pixels for Fast, Localized, Low-Voltage Touch Feedback

SVLAT: Scientific Visualization Literacy Assessment Test

Designing for Disagreement: Front-End Guardrails for Assistance Allocation in LLM-Enabled Robots

Assistance that erases the work

How was this newsletter synthesized?

Methodology

Selection Criteria

Key Themes Discovered

Field Report: ai-interaction

Trust, Calibration, and Cognitive Drift

Top Papers in this Theme

The Social Sycophancy Scale: A psychometrically validated measure of sycophancy

From Accuracy to Readiness: Metrics and Benchmarks for Human-AI Decision-Making

"I'm Not Reading All of That": Understanding Software Engineers' Level of Cognitive Engagement with Agentic Coding Assistants

Human/AI Collective Intelligence for Deliberative Democracy: A Human-Centred Design Approach

Designing for Disagreement: Front-End Guardrails for Assistance Allocation in LLM-Enabled Robots