Systems Optimize for the Wrong Thing When Deployed at Scale

Week 21May 2026

Week 23June 2026

ABOUT THIS ISSUE

How was this newsletter synthesized?

Methodology

This newsletter is generated by an AI pipeline (leveraging Anthropic Sonnet 4.5 & Haiku 4.5) that processes the metadata and abstracts of every new arXiv HCI paper from the past week—77 this issue. Each paper is scored on three dimensions: Practice (applicability for practitioners), Research (scientific contribution), and Strategy (industry implications), with scores from 1-5. Papers passing threshold are grouped into topic clusters, and each cluster is summarized to capture what that body of research is exploring.

Selection Criteria

The pipeline builds a curated selection that balances high scores with topic diversity—and deliberately includes at least one 'contrarian' paper that challenges prevailing assumptions. This selection is then analyzed to identify key findings (patterns across multiple papers) and surprises (results that contradict conventional wisdom). A narrative synthesis ties the week's research together under a unifying frame.

Key Themes Discovered

Field Report: ai-interaction

Trust Calibration in Human-AI Workflows

This cluster examines how users decide when to trust, verify, and rely on AI systems across real-world tasks. Core questions: How do uncertainty displays, explanations, and agent timing shape verification behavior? When do users over-rely or under-rely on AI suggestions? Research spans trust calibration mechanisms (uncertainty granularity, confidence displays), reliance measurement (offloading scores, delegation vs. adoption), and contextual factors (warmth, source labels, interaction type). Methodologically diverse—controlled experiments, field deployments, audits—but unified by focus on behavioral outcomes rather than system accuracy alone. Primarily relevant for interaction designers and AI system architects.

1/9

Top Papers in this Theme

2605.25856

Explaining Too Much? Understanding How Large Language Model Reasoning Traces Influence Performance and Metacognition

2605.29473

Inform, Coach, Relate, Listen: Auditing LLM Caregiving Support Roles

2605.29543

SCOPE: A Lightweight-training LLM Framework for Air Traffic Control Readback Monitoring

2605.28571

Not All Uncertainty Is Equal: How Uncertainty Granularity Shapes Human Verification in LLM-Assisted Decision Making

2605.24830

How Coding Agents Fail Their Users: A Large-Scale Analysis of Developer-Agent Misalignment in 20,574 Real-World Sessions

EyeSpy: Inferring Eye Gaze via Side-Channel Attacks Against Foveated Rendering

ATWL: A Formal Language for Representing, Comparing, and Reusing Visual Analytics Workflows

Why Meditation Wearables Fail: Reward Misspecification in Closed-Loop EEG and Biofeedback Systems

How Coding Agents Fail Their Users: A Large-Scale Analysis of Developer-Agent Misalignment in 20,574 Real-World Sessions

EyeSpy: Inferring Eye Gaze via Side-Channel Attacks Against Foveated Rendering

Mobilio

ATWL (Artifact-Transform Workflow Language)

ATWL Workflow Library

SCOPE (Semantic reasoning for Communication via Open-set Plug-in with Examples)

DARE (Directionality-Aware Reslicing)

SmartIterator

IteraScope

Puno Quechua Speech Corpus

Puno Quechua ASR Models

The Trust Paradox: How CS Researchers Engage LLM Leaderboards

Not All Uncertainty Is Equal: How Uncertainty Granularity Shapes Human Verification in LLM-Assisted Decision Making

Explaining Too Much? Understanding How Large Language Model Reasoning Traces Influence Performance and Metacognition

AI, Take the Wheel: What Drives Delegation and Trust in Human-Computer Cooperative Question Answering?

Structuring Human-AI Productive Interdependence by Strategic Level of Automation Selection for Qualitative Inquiry

GUI Agents for Continual Game Generation

Label Over Logic? How Source Cues Bias Human Fallacy Judgments More Than LLMs

Inform, Coach, Relate, Listen: Auditing LLM Caregiving Support Roles

How was this newsletter synthesized?

Methodology

Selection Criteria

Key Themes Discovered

Field Report: ai-interaction

Trust Calibration in Human-AI Workflows

Top Papers in this Theme

Explaining Too Much? Understanding How Large Language Model Reasoning Traces Influence Performance and Metacognition

Inform, Coach, Relate, Listen: Auditing LLM Caregiving Support Roles

SCOPE: A Lightweight-training LLM Framework for Air Traffic Control Readback Monitoring

Not All Uncertainty Is Equal: How Uncertainty Granularity Shapes Human Verification in LLM-Assisted Decision Making

Macaron-A2UI: A Model for Generative UI in Personal Agents

Synthesized using AI

How Coding Agents Fail Their Users: A Large-Scale Analysis of Developer-Agent Misalignment in 20,574 Real-World Sessions

EyeSpy: Inferring Eye Gaze via Side-Channel Attacks Against Foveated Rendering

ATWL: A Formal Language for Representing, Comparing, and Reusing Visual Analytics Workflows

Why Meditation Wearables Fail: Reward Misspecification in Closed-Loop EEG and Biofeedback Systems

How Coding Agents Fail Their Users: A Large-Scale Analysis of Developer-Agent Misalignment in 20,574 Real-World Sessions

EyeSpy: Inferring Eye Gaze via Side-Channel Attacks Against Foveated Rendering

Mobilio

ATWL (Artifact-Transform Workflow Language)

ATWL Workflow Library

SCOPE (Semantic reasoning for Communication via Open-set Plug-in with Examples)

DARE (Directionality-Aware Reslicing)

SmartIterator

IteraScope

Puno Quechua Speech Corpus

Puno Quechua ASR Models

The Trust Paradox: How CS Researchers Engage LLM Leaderboards

Not All Uncertainty Is Equal: How Uncertainty Granularity Shapes Human Verification in LLM-Assisted Decision Making

Explaining Too Much? Understanding How Large Language Model Reasoning Traces Influence Performance and Metacognition

AI, Take the Wheel: What Drives Delegation and Trust in Human-Computer Cooperative Question Answering?

Structuring Human-AI Productive Interdependence by Strategic Level of Automation Selection for Qualitative Inquiry

GUI Agents for Continual Game Generation

Label Over Logic? How Source Cues Bias Human Fallacy Judgments More Than LLMs

Inform, Coach, Relate, Listen: Auditing LLM Caregiving Support Roles

Correct answers, broken workflows

How was this newsletter synthesized?

Methodology

Selection Criteria

Key Themes Discovered

Field Report: ai-interaction

Trust Calibration in Human-AI Workflows

Top Papers in this Theme

Explaining Too Much? Understanding How Large Language Model Reasoning Traces Influence Performance and Metacognition

Inform, Coach, Relate, Listen: Auditing LLM Caregiving Support Roles

SCOPE: A Lightweight-training LLM Framework for Air Traffic Control Readback Monitoring

Not All Uncertainty Is Equal: How Uncertainty Granularity Shapes Human Verification in LLM-Assisted Decision Making

Macaron-A2UI: A Model for Generative UI in Personal Agents