Week 11 / March 2026

Evaluation Protocols Fail Where Models Succeed

Constrained testing formats create artifacts that look like capability failures across healthcare, education, and crowdsourced moderation

Synthesized using AI

Analyzed 124 papers. AI models can occasionally hallucinate, please verify critical details.

A consumer health AI system under-triaged 51.6% of emergencies in a Nature Medicine study, suggesting deployment would pose serious safety risks. But when researchers tested five frontier LLMs under naturalistic conditions—patient-style messages, clarifying questions allowed—three models that scored 0-24% with forced-choice formats achieved 100% accuracy in their own words, consistently recommending emergency care. The gap wasn't model capability; it was evaluation design. Diabetic ketoacidosis was correctly triaged in 100% of trials across all models and conditions. The same pattern appears in crowdsourced fact-checking, where analysis of 2,250 vaccine posts over eighteen months revealed that raters systematically avoid claims requiring cognitive effort to verify, regardless of partisan alignment. Obviously false content gets notes; plausible misinformation requiring deeper verification is ignored. Educational AI assessment faces the identical problem: detection tools are unreliable and ethically contentious, but the actual failure is that outcome-based evaluation eliminates visibility into learning processes when students use conversational AI.

The through-line isn't that AI systems fail—it's that constrained evaluation formats create artifacts that look like capability failures. Forced-choice medical triage, effort-minimizing crowdsourced moderation, and outcome-only academic assessment all optimize for measurability while systematically obscuring the mechanisms that determine real-world performance. Meanwhile, an aerospace manufacturer deployed an LLM agent for engineering analysis orchestration precisely because rigid automation breaks when tool interfaces, data formats, or documented processes change during product evolution. The agent handles interface volatility—parsing inputs, adapting workflows—while verified tools do the calculation. The counterintuitive finding: brittleness in automation isn't a bug to fix through better specification; it's inherent to rigid interfaces, and LLMs absorb that volatility.

The implication arrives as regulatory frameworks crystallize: if high-stakes AI evaluation standards lock in exam-style benchmarks and controlled conditions, we'll systematically misclassify deployment risk. Test under naturalistic conditions that mirror actual use, redesign assessments to restore process visibility, and recognize that the hardest verification work—the stuff requiring cognitive effort—is exactly what current evaluation paradigms systematically miss.

Featured(1/5)

2603.11413

Evaluation format, not model capability, drives triage failure in the assessment of consumer health AI

David Fraile Navarro, Farah Magrabi, Enrico Coiera

Preprint·2026-03-12

Don't trust health AI evaluations that use exam-style protocols. If you're evaluating consumer health tools, test under conditions that reflect actual use—conversational, clarifying questions allowed. The headline under-triage rate is an artifact of methodology, not model capability.

A Nature Medicine study claimed ChatGPT under-triages 51.6% of emergencies, suggesting consumer health AI poses safety risks. The evaluation used exam-style forced-choice formats that don't reflect how people actually use chatbots.

Method: Testing five frontier LLMs under constrained (exam-style, forced A/B/C/D output) versus naturalistic (patient-style messages) conditions revealed the evaluation format was the failure mechanism. Naturalistic interaction improved triage accuracy by 6.4 percentage points. Three models scored 0-24% with forced choice but 100% with free text (all p < 10^-8), consistently recommending emergency care in their own words while the forced-choice format registered under-triage. Diabetic ketoacidosis was correctly triaged in 100% of trials across all models and conditions.

Caveats: Tested on 17 scenarios from the original study. Full replication across all original cases not performed.

Reflections: What other high-stakes AI evaluations are contaminated by format artifacts that don't reflect deployment conditions? · How do clarifying questions versus forced-choice responses affect triage accuracy across different medical conditions? · What evaluation protocols would capture both safety risks and realistic usage patterns for consumer health AI?

healthcareai-interactionevaluation-methods

2603.11120

The Laziness of the Crowd: Effort Aversion Among Raters Risks Undermining the Efficacy of X's Community Notes Program

Morgan Wack, Patrick Warren, Mustafa Alam

Preprint·2026-03-11

Crowdsourced moderation systematically fails on the misinformation most likely to deceive—the plausible stuff. If you're relying on Community Notes-style systems, assume they're catching low-hanging fruit while sophisticated disinfo slips through. Redesign incentives to reward effort, not just volume.

Crowdsourced moderation like X's Community Notes could scale fact-checking, but no one has tested whether raters systematically avoid claims that require cognitive effort to evaluate.

Method: Analyzing 2,250 vaccine-related posts over eighteen months, researchers found claims perceived as more difficult to fact-check were significantly less likely to receive public/helpful notes. Survey data from 382 participants confirmed the pattern: obviously false, easy-to-refute claims got notes, while plausible misinformation requiring greater effort to debunk was systematically ignored. An LLM-assisted fact-checking pipeline ruled out alternative explanations, pointing to effort aversion as the mechanism.

Caveats: Focused on vaccine-related content. Effort aversion patterns may differ across topics or platforms.

Reflections: What incentive structures could overcome effort aversion without introducing gaming or quality degradation? · Does expertise level moderate the difficulty penalty—do domain experts avoid hard claims less than generalists? · How does the difficulty penalty interact with partisan bias in crowdsourced fact-checking?

social-computingtrust-safetyevaluation-methodsethics

2603.10249

DUCTILE: Agentic LLM Orchestration of Engineering Analysis in Product Development Practice

Alejandro Pradas-Gomez, Arindam Brahma, Ola Isaksson

Preprint·2026-03-10

Use LLM agents for orchestration, not calculation. Let them handle the brittle interface layer—parsing inputs, adapting workflows—while verified tools do the math. Best for engineering contexts where process documentation exists but inputs vary unpredictably.

Engineering analysis automation breaks when interfaces change—different data formats, units, naming conventions. Rigid scripts can't adapt as products evolve, forcing engineers back to manual work.

Method: DUCTILE separates adaptive orchestration (LLM agent interprets documented practices, inspects input data, adapts processing paths) from deterministic execution (verified engineering tools). Deployed on an industrial structural analysis task at an aerospace manufacturer, the agent handled input deviations in format, units, naming conventions, and methodology that would break traditional scripted pipelines. Evaluation against expert-defined acceptance criteria confirmed correct, methodologically compliant results across repeated independent runs.

Caveats: Single industrial deployment. Supervision burden and unintended effects on engineering work roles require further study.

Reflections: What happens to engineering skill development when orchestration is delegated but supervision remains? · How do you verify that LLM orchestration decisions remain methodologically sound as design practices evolve? · What's the right division of labor between adaptive agents and deterministic tools in other technical domains?

ai-interactiondesign-toolsprogramming-tools

1 / 5

Featured

Findings(1/5)

Evaluation protocols, not model capabilities, determine AI safety verdicts·Crowdsourced moderation fails not from bias but from effort aversion·AI literacy emerges bottom-up in creative communities, not from expert frameworks·Physics-guided preprocessing outperforms learned representations for structured sensor data·Real-time in-situ AR guidance displaces simulation as the training paradigm for physical skills

A consumer health AI system under-triaged 51.6% of emergencies under exam-style evaluation protocols—forced multiple choice, no clarifying questions. The same frontier LLMs tested under naturalistic conditions showed dramatically different performance. The gap reveals that assessment design, not just model architecture, determines whether systems pass safety thresholds. Implication: regulatory frameworks built on constrained evaluation formats may systematically misclassify deployment risk.

2603.11413

Evaluation format, not model capability, drives triage failure in the assessment of consumer health AI

Surprises(1/3)

Assistive technology's independence goal contradicts disabled users' lived preferences·AI misuse in education is a measurement problem, not a detection problem·LLM-based agentic systems succeed where rigid automation fails—when interfaces break

Analysis of 90 social accessibility papers reveals a persistent tension: assistive technologies predominantly pursue autonomy and independence, yet disabled people's experiences show rich preferences for interdependence. The mismatch stems from self-determination theory's emphasis on individual autonomy rather than relational models of access. The data suggests the field optimizes for the wrong telos—sovereignty and interdependence, not independence, better align with how disabled people actually want to live.

2603.07737

From Autonomy to Sovereignty - A New Telos for Socially Assistive Technology

2603.07727

The Three Praxes Framework - A Thematic Review and Map of Social Accessibility Research

TOOLBOX(5)

MRDrive

Framework

Open-source mixed reality driving simulator combining real vehicle cabin interaction with virtual driving environments. Supports HCI research on in-vehicle interaction, attention, and explainability in manual and automated driving contexts. Enables collection of eye-tracking and touch interaction data. Available on GitHub for automotive user research.

2603.08080

TUMSphere

Tool

Serious VR application built as interactive digital twin of TUM Bildungscampus Heilbronn featuring six curriculum-mapped mini-games covering programming, hardware debugging, code completion, graph traversal, shortest-path optimization, and database querying. Achieved significant knowledge gain (p < 0.001, r = 0.86), SUS score of 76.4, and engagement rating of 4.21/5 in pilot study with 18 participants.

2603.08525

PixelConfig

Framework

Differential analysis framework for reverse-engineering Meta Pixel tracking configurations across websites. Analyzes activity tracking, identity tracking, and tracking restrictions using Internet Archive Wayback Machine data. Tested on 18K health-related websites and 10K control websites from 2017-2024, revealing adoption rates up to 98.4% for automatic event tracking features.

2603.09380

WeldAR

Tool

Augmented Reality system with five learning modules for real-time welding guidance. Integrates AR headset into welding helmet with torch attachment, overlaying guidance during live welding. Provides real-time feedback on four performance measures. In-situ study with 24 novices showed AR improved composite welding performance, travel speed, and work angle compared to video instruction.

2603.07959

AiRWeb

Tool

Phone+AR web browsing prototype built with standard web technologies that enables users to select and offload arbitrary web content into surrounding space. Leverages structural properties of web pages for personalized spatial organization. Covers complete interaction workflow from element selection on phone to manipulation in air. Preliminary study validated learnability and usability.

2603.07586

LLMs in social services: How does chatbot accuracy affect human accuracy?

Tests whether caseworkers using SNAP eligibility chatbots get *worse* at their jobs when the bot is confidently wrong. Spoiler: they do.

2603.11393

To Believe or Not To Believe: Comparing Supporting Information Tools to Aid Human Judgments of AI Veracity

Compares citation links, confidence scores, and retrieval previews to see which tools actually help users catch AI hallucinations. None of them work great.

2603.07820

Broken Access: On the Challenges of Screen Reader Assisted Two-Factor and Passwordless Authentication

Documents how modern authentication flows—QR codes, push notifications, biometrics—systematically exclude blind users. Security theater meets accessibility failure.

2603.08837

NaviNote: Enabling In-situ Spatial Annotation Authoring to Support Exploration and Navigation for Blind and Low Vision People

Lets blind users leave voice memos tied to GPS coordinates as they walk. Turns navigation from consumption to authorship.

2603.12094

Human-Centred LLM Privacy Audits: Findings and Frictions

Builds a tool that lets people query what an LLM associates with their name. Early findings: models leak training data and users want control.

2603.08856

Unpacking Interpretability: Human-Centered Criteria for Optimal Combinatorial Solutions

Asks what makes one algorithm-generated solution more understandable than another when both are technically optimal. Turns out humans care about symmetry and chunking.

2603.11511

Managing Cognitive Bias in Human Labeling Operations for Rare-Event AI: Evidence from a Field Experiment

Shows that when annotators label rare events (fraud, tumors), prevalence effects make them miss positives. Tests interventions in a real medical imaging operation.

2603.09072

A Text-Native Interface for Generative Video Authoring

Proposes writing video scripts in plain text with inline generation commands. Video editing as a text editor with a render button.

REFLECTION(3)

Confidence without context kills trust

Systems across healthcare, law, and creative work report high accuracy in labs, then fail catastrophically in deployment. The gap isn't measurement error—it's that controlled evaluation protocols systematically hide the moment when users need to know what a system actually doesn't know.

AI diagnostics pass forced-choice benchmarks but collapse under naturalistic triage, where clinicians need uncertainty quantification, not accuracy scores. If a system performs well on metrics but leaves practitioners unable to calibrate trust, does the metric measure anything real about deployment readiness?

1 / 3

Week 10March 2026

Week 12March 2026

ABOUT THIS ISSUE

How was this newsletter synthesized?

Methodology

This newsletter is generated by an AI pipeline (leveraging Anthropic Sonnet 4.5 & Haiku 4.5) that processes the metadata and abstracts of every new arXiv HCI paper from the past week—124 this issue. Each paper is scored on three dimensions: Practice (applicability for practitioners), Research (scientific contribution), and Strategy (industry implications), with scores from 1-5. Papers passing threshold are grouped into topic clusters, and each cluster is summarized to capture what that body of research is exploring.

Selection Criteria

The pipeline builds a curated selection that balances high scores with topic diversity—and deliberately includes at least one 'contrarian' paper that challenges prevailing assumptions. This selection is then analyzed to identify key findings (patterns across multiple papers) and surprises (results that contradict conventional wisdom). A narrative synthesis ties the week's research together under a unifying frame.

Key Themes Discovered

Field Report: ai-interaction

Trust, Calibration, and Control

This cluster examines how humans decide to trust, rely on, and collaborate with AI systems in high-stakes contexts. Core questions: When do AI suggestions improve versus degrade human judgment? How do users assess AI veracity and detect errors? What design properties enable meaningful human agency? Research spans clinical diagnosis, legal analysis, engineering workflows, and creative practice. Dominant pattern: misalignment between system capability and user mental models creates systematic biases—over-reliance, under-reliance, and rigid delegation. Solutions emphasize transparency, calibrated uncertainty signaling, and interactive verification rather than autonomous automation.

1/10

Top Papers in this Theme

2603.09020

AI Phenomenology for Understanding Human-AI Experiences Across Eras

2603.08322

Agentic Neurosymbolic Collaboration for Mathematical Discovery: A Case Study in Combinatorial Design

2603.11213

LLMs in social services: How does chatbot accuracy affect human accuracy?

2603.08856

Unpacking Interpretability: Human-Centered Criteria for Optimal Combinatorial Solutions

2603.09324

Synthesized using AI

Evaluation format, not model capability, drives triage failure in the assessment of consumer health AI

The Laziness of the Crowd: Effort Aversion Among Raters Risks Undermining the Efficacy of X's Community Notes Program

DUCTILE: Agentic LLM Orchestration of Engineering Analysis in Product Development Practice

Evaluation format, not model capability, drives triage failure in the assessment of consumer health AI

From Autonomy to Sovereignty - A New Telos for Socially Assistive Technology

The Three Praxes Framework - A Thematic Review and Map of Social Accessibility Research

MRDrive

TUMSphere

PixelConfig

WeldAR

AiRWeb

LLMs in social services: How does chatbot accuracy affect human accuracy?

To Believe or Not To Believe: Comparing Supporting Information Tools to Aid Human Judgments of AI Veracity

Broken Access: On the Challenges of Screen Reader Assisted Two-Factor and Passwordless Authentication

NaviNote: Enabling In-situ Spatial Annotation Authoring to Support Exploration and Navigation for Blind and Low Vision People

Human-Centred LLM Privacy Audits: Findings and Frictions

Unpacking Interpretability: Human-Centered Criteria for Optimal Combinatorial Solutions

Managing Cognitive Bias in Human Labeling Operations for Rare-Event AI: Evidence from a Field Experiment

A Text-Native Interface for Generative Video Authoring

Confidence without context kills trust

How was this newsletter synthesized?

Methodology

Selection Criteria

Key Themes Discovered

Field Report: ai-interaction

Trust, Calibration, and Control

Top Papers in this Theme

AI Phenomenology for Understanding Human-AI Experiences Across Eras

Agentic Neurosymbolic Collaboration for Mathematical Discovery: A Case Study in Combinatorial Design

LLMs in social services: How does chatbot accuracy affect human accuracy?

Unpacking Interpretability: Human-Centered Criteria for Optimal Combinatorial Solutions

Reading the Mood Behind Words: Integrating Prosody-Derived Emotional Context into Socially Responsive VR Agents