Week 10 / March 2025

Capability Isn't the Bottleneck Anymore

AI verification and AR deployment constraints dominate a week of post-launch reality checks

Synthesized using AI

Analyzed 161 papers. AI models can occasionally hallucinate, please verify critical details.

AI systems have crossed a threshold where capability no longer determines adoption—verification does. Interactive Debugging and Steering of Multi-Agent AI Systems documents what developers building LLM-powered agent teams actually struggle with: they can't follow agent conversations to localize errors, can't set breakpoints in multi-agent interactions, and lack tools to iterate on configurations. The system they built—with conversation replay, agent-level debugging, and configuration steering—addresses problems that only emerge after deployment. Supporting evidence appears in SpeQL's speculative query execution (predicting queries before users finish typing, requiring new verification interfaces) and a study of software developers using LLMs (practitioners report productivity gains vanish in tasks requiring output verification). The common thread: these systems work well enough to use but remain opaque enough to undermine trust.

AR research this week reveals a parallel constraint: deployment fails on interaction ergonomics, not visual fidelity or tracking accuracy. AlphaPIG demonstrates that adaptive control-display gain extends mid-air gesture interaction by 41% compared to standard techniques—the problem isn't that gestures don't work, it's that standard implementations cause unsustainable fatigue. A 60-person study of AR-HMDs in simulated hazardous industrial environments found that while headsets improved task performance, they degraded situational awareness and safety responses. PinchCatcher's work on gaze-plus-pinch multi-selection and research on controller-based robot programming in mixed reality both optimize for reduced physical effort and cognitive load. The pattern: AR capability exists, but interaction models designed for short demos break during extended real-world use.

The strongest individual contribution comes from passive heart rate monitoring during ordinary smartphone use—225,773 training videos, 185,970 validation videos, clinical-grade accuracy without user action. This is ambient sensing reaching medical reliability at population scale, proving that health data can come from devices people already carry. Separately, an audit of commercial content moderation APIs using five million test cases reveals systematic failures on group-targeted hate speech, the highest-stakes content. The week's character: post-deployment reality asserting itself across domains.

Featured(1/5)

2503.01623

Lost in Moderation: How Commercial Content Moderation APIs Over- and Under-Moderate Group-Targeted Hate Speech and Linguistic Variations

David Hartmann, Amin Oueslati, Dimitri Staufer, Lena Pohlmann, Simon Munzert, Hendrik Heuer

CHI 2025·2025-03-03

Audit your moderation stack before deployment. Test against identity-swapped content and dialect variations. Don't rely on a single API—ensemble methods reduce both over- and under-moderation. Budget for human review of edge cases involving reclaimed language.

Commercial moderation APIs flag legitimate speech while missing actual hate. Five million test cases reveal systematic failures across identity groups and dialects.

Method: The audit framework tests five APIs (Google Perspective, Azure Content Safety, OpenAI, AWS Comprehend, Hive) against controlled variations of hate speech targeting different groups. The system manipulates identity markers (e.g., swapping 'gay' for 'straight') and linguistic features (AAVE vs. Standard English) to expose bias patterns. APIs over-moderate reclaimed slurs and AAVE by 23-47% while under-detecting coded hate speech that substitutes euphemisms.

Caveats: Framework requires labeled hate speech datasets and doesn't address multimodal content (images, video) where most evasion happens.

Reflections: Can ensemble methods be optimized to minimize both over- and under-moderation simultaneously, or is there an irreducible tradeoff? · How do moderation APIs perform on emerging coded hate speech that evolves faster than training data? · What's the optimal human-in-the-loop intervention rate for different community contexts?

trust-safetyethicsai-interaction

2503.00714

Speculative Ad-hoc Querying

Haoyu Li, Srikanth Kandula, Maria Angels de Luis Balaguer, Aditya Akella, Venkat Arun

Preprint·2025-03-02

Integrate speculative execution into BI tools for exploratory workflows. Pre-compute common query patterns during idle time. Prioritize this for dashboards where users follow predictable drill-down paths. Measure cache hit rates to justify compute costs.

SQL queries on large datasets take seconds to minutes. Users wait, context-switch, and lose flow state during data exploration.

Method: SpeQL uses an LLM to predict and pre-execute likely queries before the user finishes typing. The system analyzes database schema, query history, and partial input to generate candidate completions, then speculatively runs them in parallel. When the user completes their query, results appear instantly if a match exists. The system maintains a cache of speculative results and uses query similarity metrics to determine which predictions to execute.

Caveats: Compute costs scale with prediction accuracy. Low hit rates waste resources. Works best for repetitive exploration patterns, not novel ad-hoc queries.

Reflections: What's the break-even point for speculative execution cost vs. user productivity gains? · Can user attention patterns (eye tracking, cursor position) improve prediction accuracy? · How does this approach scale to collaborative querying where multiple users explore the same dataset?

ai-interactiondata-visualizationdesign-tools

2503.02068

Interactive Debugging and Steering of Multi-Agent AI Systems

Will Epperson, Gagan Bansal, Victor Dibia, Adam Fourney, Jack Gerrits, Erkang Zhu, Saleema Amershi

CHI·2025-03-03

Stop logging agent conversations to text files. Build debugging interfaces that expose agent state and allow mid-run intervention. Implement conversation forking for A/B testing prompt variations. Prioritize this if your agents make multi-step decisions with branching logic.

LLM agent teams generate thousand-message conversations. Developers can't locate where agents derailed or test configuration changes without full re-runs.

Method: AgentStudio provides three mechanisms: conversation summarization that collapses agent exchanges into decision trees, breakpoint-style debugging that pauses agent execution for inspection, and live configuration editing that modifies agent prompts mid-run. The system visualizes agent state, message history, and decision branches. Developers can fork conversations from any point to test alternative configurations without restarting from scratch.

Caveats: Tested with five developers on custom agent frameworks. Unclear how this generalizes to commercial agent platforms with locked-down architectures.

Reflections: Can automated fault localization identify which agent in a team caused cascading failures? · How do developers balance debugging granularity with cognitive load when inspecting complex agent interactions? · What metrics predict when an agent conversation is heading toward failure early enough to intervene?

ai-interactionprogramming-toolsdesign-tools

1 / 5

Featured

Findings(1/5)

Interaction systems shift from reactive to speculative execution·Human-in-the-loop optimization moves from calibration events to continual adaptation·Developer tooling for AI systems prioritizes runtime steering over pre-deployment testing·Content moderation auditing exposes systematic bias in commercial black-box APIs·XR interaction design confronts the fatigue-precision tradeoff through control remapping

Systems now predict and pre-execute user intent before explicit input completes. SpeQL runs database queries while users type, using LLMs to anticipate likely completions. PHRM captures heart rate passively during routine smartphone use, eliminating deliberate measurement. This inverts the request-response contract: computation begins on probabilistic futures, not confirmed commands. The implication is architectural—systems must handle speculative waste and rollback, trading compute for perceived latency.

2503.00714

Speculative Ad-hoc Querying

2503.03783

Passive Heart Rate Monitoring During Smartphone Use in Everyday Life

Surprises(1/3)

AR head-mounted displays degrade situational awareness in hazardous environments despite improving task guidance·Older adults understand IoT security features but distrust them anyway·Decentralized social media increases privacy negotiation complexity rather than resolving it

AR-HMDs were supposed to enhance worker safety by overlaying guidance in industrial settings. Objective and subjective evaluation with sixty participants in simulated hazardous environments shows they degrade situational awareness, particularly in complex scenarios. The assistance comes at a cognitive cost that existing designs don't account for. Safety-critical AR must measure and bound attention capture, not just task completion.

2503.04075

Analyzing the Impact of Augmented Reality Head-Mounted Displays on Workers' Safety and Situational Awareness in Hazardous Industrial Settings

TOOLBOX(8)

SpeQL

Tool

A speculative query execution system that uses LLMs to predict and pre-execute SQL queries before users finish typing. It precomputes temporary tables and displays real-time results for speculated queries, reducing query latency by up to 289× in exploratory data analysis. Designed for responsive analysis of large datasets with continuous result display.

2503.00714

AlphaPIG

Tool

A meta-technique for Prolonging Interactive Gestures in XR by leveraging real-time fatigue predictions. Enables automated fatigue-based interventions through adjustment of timing and intensity decay rate, helping designers extend XR interactions while balancing fatigue reduction and body ownership. Validated with Go-Go technique implementation showing significant shoulder fatigue reduction.

2503.01011

ARI

Framework

A blueprint for a unified, transparent, and personalized abuse response system for social platforms. Sustainably detects abuse by leveraging platform user expertise, incentivized with proceeds from abusers. Designed based on victim-centered requirements including privacy, anonymity, and abuse attribution for improved platform safety.

2503.01327

AGDebugger

Tool

An interactive multi-agent debugging tool for LLM-powered AI agent teams. Features UI for browsing and sending messages, ability to edit and reset prior agent messages, and overview visualization for navigating complex message histories. Enables interactive debugging and steering of autonomous AI agent collaborations performing complex tasks.

2503.02068

PHRM

Model

A deep learning system for passive heart rate and resting heart rate measurements during everyday smartphone use via facial video-based photoplethysmography. Trained on 225,773 videos from 495 participants, validated on 185,970 videos from 205 participants. Achieves <10% MAPE for HR across light, medium, and dark skin tones with <5 bpm error for daily RHR.

2503.03783

LeRAAT

Framework

LLM-Enabled Real-Time Aviation Advisory Tool that integrates LLMs with X-Plane flight simulator for context-aware pilot assistance. Uses Retrieval-Augmented Generation pipeline to extract information from aircraft manuals and FAA directives, generating real-time recommendations based on live flight data, weather, and aircraft documentation. Supports VR and traditional simulation for training and research.

2503.16477

Eggly

Tool

A mobile neurofeedback training game for children with Autism Spectrum Disorder using consumer-grade EEG headband and tablet. Employs augmented reality techniques for engagement and personalization, offering gamified feedback to enhance social and attentional capabilities. Validated through field studies at special education centers showing effectiveness in clinical-to-home NFT deployment.

2503.04984

Prefer2SD

Tool

An iterative, human-in-the-loop visual analytics system for optimizing similarity-diversity ratio in in-game friend recommendations. Features interactive visualizations enabling experts to explore, analyze, and dynamically adjust recommendations for distinct player groups. Incorporates evolving player preferences through expert-guided fine-tuning, validated through within-subjects study and case studies with game company.

2503.06105

Dango: A Mixed-Initiative Data Wrangling System using Large Language Model

Deploys multiple AI agents that negotiate with each other to interpret what you actually meant in messy data tasks. The mixed-initiative approach lets you steer when automation goes sideways.

2503.01197

HaloTouch: Using IR Multi-path Interference to Support Touch Interactions With General Surfaces

Turns any surface into a touchscreen using infrared interference patterns—no cameras, no surface instrumentation. Just wear the device and tap your desk, wall, or leg.

2503.03532

AI-Enabled Conversational Journaling for Advancing Parkinson's Disease Symptom Tracking

Replaces static symptom logs with a chatbot that asks follow-ups about tremors and medication timing. Patients actually engage with it, which is half the battle in chronic care.

2503.05965

Validating LLM-as-a-Judge Systems under Rating Indeterminacy

Confronts the awkward truth that human raters often disagree, so measuring AI judge accuracy gets philosophically weird. Proposes validation methods that account for legitimate rating ambiguity.

2503.00858

Applying the Gricean Maxims to a Human-LLM Interaction Cycle: Design Insights from a Participatory Approach

Borrows conversational rules from linguistics (be relevant, be clear) to diagnose why LLM chats feel off. Participatory design reveals where models violate basic human communication norms.

2503.00946

A Review of LLM-Assisted Ideation

Surveys 61 studies on brainstorming with language models. Identifies what actually helps creativity versus what just generates more text nobody reads.

2503.10647

The Reliability of LLMs for Medical Diagnosis: An Examination of Consistency, Manipulation, and Contextual Awareness

Tests whether medical LLMs change their diagnosis when you rephrase symptoms or add irrelevant details. Spoiler: they do, which is a problem for life-or-death decisions.

2503.04103

Compositional Structures as Substrates for Human-AI Co-creation Environment: A Design Approach and A Case Study

Argues co-creation needs more than prompt boxes—it needs compositional scaffolding for planning and iteration. Demonstrates with a system that treats AI outputs as remixable building blocks.

REFLECTION(3)

When research rigor meets user urgency

This week's research surfaced a recurring tension: the methods that produce the most defensible findings often move slowest, while stakeholders demand answers now. The gap between what we can confidently claim and what we're pressured to recommend is widening.

Rigorous qualitative research requires iterative cycles and saturation, but product teams operate on sprint timelines. Does slowing down to do research right actually delay better decisions, or does rushing to 'good enough' data guarantee worse ones?

1 / 3

Week 09February 2025

Week 11March 2025

ABOUT THIS ISSUE

How was this newsletter synthesized?

Methodology

This newsletter is generated by an AI pipeline (leveraging Anthropic Sonnet 4.5 & Haiku 4.5) that processes the metadata and abstracts of every new arXiv HCI paper from the past week—161 this issue. Each paper is scored on three dimensions: Practice (applicability for practitioners), Research (scientific contribution), and Strategy (industry implications), with scores from 1-5. Papers passing threshold are grouped into topic clusters, and each cluster is summarized to capture what that body of research is exploring.

Selection Criteria

The pipeline builds a curated selection that balances high scores with topic diversity—and deliberately includes at least one 'contrarian' paper that challenges prevailing assumptions. This selection is then analyzed to identify key findings (patterns across multiple papers) and surprises (results that contradict conventional wisdom). A narrative synthesis ties the week's research together under a unifying frame.

Key Themes Discovered

Field Report: ai-interaction

The Collaboration Calibration Problem

This cluster examines how humans and AI systems negotiate effective working relationships across diverse domains. Core questions center on trust calibration, intent clarification, and control allocation: When should users rely on AI suggestions versus override them? How do interface designs shape whether humans over-delegate or under-utilize AI capabilities? Studies reveal a persistent tension—users want transparency and agency, yet struggle with cognitive load when evaluating AI outputs. The work spans debugging multi-agent systems, data wrangling, query prediction, and creative collaboration, consistently identifying that successful human-AI work requires bidirectional communication, explainability, and user-controlled adaptation rather than autonomous optimization alone.

1/10

Applying the Gricean Maxims to a Human-LLM Interaction Cycle: Design Insights from a Participatory Approach

2503.02874

Synthesized using AI

Lost in Moderation: How Commercial Content Moderation APIs Over- and Under-Moderate Group-Targeted Hate Speech and Linguistic Variations

Speculative Ad-hoc Querying

Interactive Debugging and Steering of Multi-Agent AI Systems

Speculative Ad-hoc Querying

Passive Heart Rate Monitoring During Smartphone Use in Everyday Life

Analyzing the Impact of Augmented Reality Head-Mounted Displays on Workers' Safety and Situational Awareness in Hazardous Industrial Settings

SpeQL

AlphaPIG

ARI

AGDebugger

PHRM

LeRAAT

Eggly

Prefer2SD

Dango: A Mixed-Initiative Data Wrangling System using Large Language Model

HaloTouch: Using IR Multi-path Interference to Support Touch Interactions With General Surfaces

AI-Enabled Conversational Journaling for Advancing Parkinson's Disease Symptom Tracking

Validating LLM-as-a-Judge Systems under Rating Indeterminacy

Applying the Gricean Maxims to a Human-LLM Interaction Cycle: Design Insights from a Participatory Approach

A Review of LLM-Assisted Ideation

The Reliability of LLMs for Medical Diagnosis: An Examination of Consistency, Manipulation, and Contextual Awareness

Compositional Structures as Substrates for Human-AI Co-creation Environment: A Design Approach and A Case Study

When research rigor meets user urgency

How was this newsletter synthesized?

Methodology

Selection Criteria

Key Themes Discovered

Field Report: ai-interaction

The Collaboration Calibration Problem

Top Papers in this Theme

Validating LLM-as-a-Judge Systems under Rating Indeterminacy

LeRAAT: LLM-Enabled Real-Time Aviation Advisory Tool

Dango: A Mixed-Initiative Data Wrangling System using Large Language Model

Applying the Gricean Maxims to a Human-LLM Interaction Cycle: Design Insights from a Participatory Approach

Prompting Generative AI with Interaction-Augmented Instructions