Week 19 / May 2026

Evaluation Shifts from Artifact Quality to User Behavior

AI tutors, productivity frameworks, and interface design converge on measuring downstream outcomes

Synthesized using AI

Analyzed 93 papers. AI models can occasionally hallucinate, please verify critical details.

Evaluation methodology is consolidating around a brutal insight: measuring what systems produce tells you nothing about whether users benefit. The clearest evidence comes from AI tutoring, where analysis of 10,235 student code submissions reveals that behavioral engagement—whether students actually revise their code after feedback—predicts perceived helpfulness more strongly than expert ratings of pedagogical quality. Two deployed tutors with similar pedagogy scores showed substantial engagement differences that quality-only evaluation missed entirely. This finding echoes across domains: an RCT framework for AI evaluation argues for centering on human performance uplift rather than model accuracy, while Microsoft's EngThrive productivity framework measures developer outcomes (Speed, Ease, Quality) instead of activity metrics. The convergence isn't coincidental—it's a response to deployment-scale failures where systems optimized for artifact quality fail because users ignore, misapply, or can't act on the output.

Interface design is responding to a parallel pressure: systems must now serve both human cognition and machine interpretation. Work on computer-use agents shows that capability improvements won't solve interface generalization failures caused by implicit design assumptions. Augmented usability heuristics—explicit labels, consistent patterns, visible state—improve agent task completion by over 21% in procedural action understanding without harming human usability. Meanwhile, proactive assistance systems achieve up to 2.29x better timing accuracy by tracking fine-grained task progress through AR glasses, and privacy tools complete 96.3% of rights-exercise tasks by integrating policy comprehension with action workflows. The pattern inverts the traditional design question from "can humans use this?" to "can both humans and agents parse this?"

The methodological shift matters now because the gap between lab metrics and operational performance is causing real failures at scale. Organizations are discovering their evaluation frameworks select for systems that look good in isolation but collapse when users must act on the output. These papers offer actionable alternatives: measure behavior, not just quality; design for interpretability by multiple audiences; and reduce measurement complexity to dimensions that enable system-level improvements.

Featured(1/6)

2605.05648

The Missing Evaluation Axis: What 10,000 Student Submissions Reveal About AI Tutor Effectiveness

Rose Niousha, Samantha Boatright Smith, Bita Akram, Peter Brusilovsky, Arto Hellas, Juho Leinonen, John DeNero, Narges Norouzi

Accepted·2026-05-07

Stop evaluating AI tutors on feedback quality alone. Add behavioral metrics: did students revise their code after feedback? Did they apply it correctly? These signals predict perceived helpfulness better than expert ratings of pedagogy.

AI tutors are evaluated on pedagogical quality—how good the feedback sounds—but not on whether students actually use it. A tutor can give perfect advice that students ignore.

Method: Analyzed 10,235 code submissions to measure whether students act on AI feedback and apply it correctly. This behavioral dimension—engagement patterns—correlated more strongly with student perception of helpfulness than pedagogical quality alone. Two deployed tutors with similar pedagogical scores showed substantial differences in engagement that pedagogy-only evaluation missed entirely.

Caveats: Tested only in introductory programming. Other domains may show different engagement patterns.

Reflections: Do behavioral engagement patterns predict long-term learning outcomes, or just immediate satisfaction? · Can tutors be optimized directly for engagement metrics, or does that create perverse incentives? · How do engagement patterns differ across disciplines beyond programming?

educationai-interactionevaluation-methods

2605.04227

Pro$^2$Assist: Continuous Step-Aware Proactive Assistance with Multimodal Egocentric Perception for Long-Horizon Procedural Tasks

Lilin Xu, Bufang Yang, Siyang Jiang, Kaiwei Liu, Kaiyuan Hou, Yuang Fan, Hongkai Chen, Zhenyu Yan, Xiaofan Jiang

Preprint·2026-05-05

Build procedural assistants that track task state continuously, not reactively. Gate proactive suggestions by step progress—timing matters as much as content quality. 90% of study participants found the approach useful.

Existing AI assistants react to user queries or handle isolated short events. They don't track progress through multi-step procedural tasks or proactively intervene at the right moment.

Method: Pro²Assist uses AR glasses to continuously track fine-grained task progress and reason over evolving user state. It outperformed baselines by over 21% in procedural action understanding accuracy and achieved up to 2.29x the proactive timing accuracy. The system extracts step-oriented context from multi-scale temporal dynamics and task-specific expert knowledge to display timely assistance.

Caveats: Evaluated on curated datasets and testbed tasks. Real-world task diversity and interruption tolerance need validation.

Reflections: How do users adapt their task strategies when they know proactive assistance is monitoring them? · What's the optimal interruption threshold—when does proactive help become intrusive? · Can the system generalize to novel procedural tasks without task-specific expert knowledge?

augmented-realityai-interactionaccessibilityevaluation-methods

2605.02729

Augmenting Interface Usability Heuristics for Reliable Computer-Use Agents

Jiateng Liu, Rushi Wang, Bingxuan Li, Kunlun Zhu, Yifan Shen, Qingyun Wang, Ahmed Abbasi, Denghui Zhang, Heng Ji

Preprint·2026-05-04

Design interfaces for both humans and agents. Apply augmented heuristics—explicit labels, consistent patterns, visible state—to improve agent reliability without harming human usability. Interface design is a practical lever for agent performance.

Computer-use agents fail on unseen interfaces because implicit design assumptions—obvious to humans—create agent-specific failures. Improving agent capability alone won't solve this.

Method: Revisited Nielsen's 10 usability heuristics for agents, identifying which transfer and which need augmentation. Tested augmented heuristics in UI-Verse, controlled environments with functionally similar interfaces. Augmented heuristics consistently improved task completion and modestly improved efficiency, with combined heuristics yielding further gains. Human studies showed no observable usability regressions.

Caveats: Tested in controlled environments. Production interfaces with legacy constraints may face implementation barriers.

Reflections: Which specific heuristic augmentations provide the highest ROI for agent reliability? · Do agent-friendly designs inadvertently train humans to rely on explicit cues, degrading their adaptability? · How do these principles extend to mobile and voice interfaces?

design-toolsai-interactionweb-interfacesevaluation-methods

1 / 6

Featured

Findings(1/5)

Evaluation shifts from artifact quality to behavioral consequence·Interface design becomes agent-compatible, not just human-usable·Expertise commodifies as AI labs create a gig economy for white-collar knowledge work·Productivity measurement consolidates around speed, ease, and quality as actionable dimensions·Social platforms confront capacity limits, not apathy, as the barrier to sustained engagement

AI tutor evaluation has focused on pedagogical quality of feedback messages, ignoring what students actually do with that feedback. Paper 2605.05648 argues for extending evaluation with a behavioral dimension grounded in interaction data from 10,000 student submissions. Paper 2605.02050 establishes RCT frameworks for AI evaluation that measure human uplift rather than model performance. The shift reframes quality as downstream user action, not upstream system output. Implication: systems optimized for feedback elegance may fail at the only metric that matters—changed behavior.

2605.05648

The Missing Evaluation Axis: What 10,000 Student Submissions Reveal About AI Tutor Effectiveness

2605.02050

Principles and Guidelines for Randomized Controlled Trials in AI Evaluation

Surprises(1/3)

Strong-tie chat partners don't reciprocate response times·Eye contact creates ambivalence, not clarity, for people with visual impairments·Older adults seek emotional support from AI despite system-level constraints that limit and redirect vulnerable interactions

We assume close relationships create response time reciprocity in messaging—you reply fast, I reply fast. Analysis of 3.4 million messages across 889 WhatsApp and Instagram chats shows this is the first large-scale evidence on whether strong-tie partners actually mirror each other's response speeds. The finding challenges the assumption that timing symmetry signals or sustains closeness. Reciprocity may matter less than we think, or manifest in dimensions we're not measuring.

2605.03687

Sorry for the late reply: Response times and reciprocity in WhatsApp and Instagram chats

TOOLBOX(5)

UI-Verse

Framework

UI-Verse is a suite of controlled environments built around functionally similar interfaces with different applied heuristics for evaluating computer-use agents. It enables testing how augmented usability heuristics affect agent task completion and efficiency across interfaces with varying design principles, demonstrating consistent improvements in agent reliability.

2605.02729

Privy

Tool

Privy is an LLM-powered browser assistant that guides users through exercising privacy rights on websites. It automatically analyzes privacy policies, extracts rights with 0.979 precision, and completes 96.3% of privacy tasks in an average of 3.2 steps across 14 websites, providing step-by-step guidance, email templates, and form completion support.

2605.02005

EventColumn

Tool

EventColumn is a new column type that integrates event-sequence data with heterogeneous tabular attributes into unified tables. It supports compressed overviews, heatmap group summaries, alignment by event types, and boxplots of similar historical items. Demonstrated with integrations for Taggle and Microsoft Power BI on steel production and e-commerce datasets.

2605.06065

Jiao

Framework

Jiao is a mixed-criticality robotics architecture for ARM Cortex-A55 platforms featuring Safe IO Cell for hardware-level override, Parameter Synchronization Service, and Safety Communication Layer with IEC 61508-aligned verification. Reduces cycle-period jitter by 84.5% and cuts p99 timing error from 69.0μs to 7.8μs, eliminating all excursions exceeding 50μs.

2605.03641

dtour

Tool

dtour is a GPU-accelerated tour interface for high-dimensional data visualization that scales to millions of points. It combines static projection previews, reversible scrubbing along geodesic paths, manual projection manipulation, and wandering grand tours in a single browser-based interface. Integrates with both Python and JavaScript ecosystems for text, image, and single-cell data exploration.

2605.04306

Deployment-Relevant Alignment Cannot Be Inferred from Model-Level Evaluation Alone

Argues that scoring model outputs on fixed benchmarks tells you almost nothing about how aligned a system will be when actually deployed. The gap between lab and reality, formalized.

2605.05790

GazeMind: A Gaze-Guided LLM Agent for Personalized Cognitive Load Assessment

Tracks where you're looking through smart glasses to infer cognitive load in real time. The bet: AI assistants should know when you're overwhelmed before you ask for help.

2605.04779

A meta-analysis of the effect of generative AI on productivity and learning in programming

Synthesizes evidence on when GenAI actually speeds up coding versus when it just feels faster. Mixed results on whether it helps or hinders long-term skill development.

2605.03882

Deco: Extending Personal Physical Objects into Pervasive AI Companion through a Dual-Embodiment Framework

Turns your existing plush toy or keepsake into a responsive AI companion that senses and reacts. Bridges the gap between objects you love and AI that adapts.

2605.05348

Making AI Drafts Count: A Quality Threshold in Audio Description Workflows

Finds that AI-generated audio description drafts only help novices when they cross a specific quality bar—below it, they're worse than starting from scratch. Thresholds matter.

2605.02765

U-Define: Designing User Workflows for Hard and Soft Constraints in LLM-Based Planning

Explores how users can specify rigid rules versus flexible preferences when LLMs plan tasks. Tackles the black-box problem by making constraint definition a first-class design challenge.

2605.02832

HAAS: A Policy-Aware Framework for Adaptive Task Allocation Between Humans and Artificial Intelligence Systems

Rejects the binary human-or-AI choice in favor of dynamic task sharing that adapts to context, fatigue, and organizational policy. Treats allocation as an ongoing negotiation.

2605.04639

Cognitive Alignment Drives Attention: Modeling and Supporting Socially Shared Regulation in Pair Programming

Measures joint mental effort and visual attention in pair programming to detect when partners fall out of sync. Tests whether AI feedback can nudge them back into coordination.

REFLECTION(4)

Lab wins, field losses: who pays?

This week's research exposes a systematic blindness: systems that perform flawlessly in controlled settings fail catastrophically once deployed. The gap isn't a minor engineering problem—it's a methodological choice that lets researchers claim success while practitioners inherit the wreckage.

Benchmarks reward isolated performance, but deployment demands robustness across confounds labs never modeled. If your evaluation doesn't include the messy constraints of the real system, are you measuring success or just avoiding failure?

1 / 4

Week 18April 2026

ABOUT THIS ISSUE

How was this newsletter synthesized?

Methodology

This newsletter is generated by an AI pipeline (leveraging Anthropic Sonnet 4.5 & Haiku 4.5) that processes the metadata and abstracts of every new arXiv HCI paper from the past week—93 this issue. Each paper is scored on three dimensions: Practice (applicability for practitioners), Research (scientific contribution), and Strategy (industry implications), with scores from 1-5. Papers passing threshold are grouped into topic clusters, and each cluster is summarized to capture what that body of research is exploring.

Selection Criteria

The pipeline builds a curated selection that balances high scores with topic diversity—and deliberately includes at least one 'contrarian' paper that challenges prevailing assumptions. This selection is then analyzed to identify key findings (patterns across multiple papers) and surprises (results that contradict conventional wisdom). A narrative synthesis ties the week's research together under a unifying frame.

Key Themes Discovered

Field Report: ai-interaction

Interaction Under Constraints

This cluster examines how users navigate AI systems when organizational, technical, or safety constraints reshape interaction. Core tensions emerge: deployment-level alignment cannot be inferred from model benchmarks alone; emotional support interactions fracture when safety interventions interrupt engagement; task allocation requires balancing efficiency against human oversight and agency. Research spans evaluation frameworks that expose mismatches between lab metrics and real-world outcomes, design systems that preserve user control amid system limitations, and longitudinal studies revealing preference instability. The work is grounded in operational contexts—healthcare, manufacturing, education, elder care—where constraints are not obstacles but design variables requiring explicit governance. Audience: systems engineers, HCI researchers, and organizational leaders implementing AI at scale.

1/10

Top Papers in this Theme

2605.04454

Deployment-Relevant Alignment Cannot Be Inferred from Model-Level Evaluation Alone

2605.05790

GazeMind: A Gaze-Guided LLM Agent for Personalized Cognitive Load Assessment

2605.03882

Deco: Extending Personal Physical Objects into Pervasive AI Companion through a Dual-Embodiment Framework

2605.06007

PersonaKit (PK): A Plug-and-Play Platform for User Testing Diverse Roles in Full-Duplex Dialogue

2605.02765