Week 11 / March 2025

Lab Success Meets Deployment Failure

From healthcare AI to XR interfaces, systems collapse when real-world friction exceeds controlled testing

Synthesized using AI

Analyzed 137 papers. AI models can occasionally hallucinate, please verify critical details.

Systems that shine in controlled research settings are collapsing under real-world conditions, and this week's research documents the pattern across domains. A clinical decision support system for ICU decompensation risk, co-designed with providers through iterative sessions, revealed that workflow integration and interface design matter more than model accuracy—clinicians ignore alerts that don't fit their existing practices, regardless of predictive performance. Similar friction appears in wearable gesture recognition (fails when users deviate from training populations), XR text entry (the new TEXT database shows most techniques work in labs but not diverse contexts), and voice-assisted medical recovery for Black older adults (technology and health literacy gaps create adoption barriers). The deployment paradox isn't about insufficient testing—it's that lab conditions systematically exclude the variability that kills real-world performance.

Meanwhile, AI systems are shifting from augmentation to substitution in ways that erode human agency. Twenty-five interviews with writing professionals reveal four strategies for resisting or embracing generative AI, but the underlying pattern is delegation rather than collaboration—users accept suggestions without understanding alternatives, losing the expertise that made them valuable in the first place. Educational chatbot interactions show students circumventing learning objectives. The problem extends beyond individual tools: a field study of AI-assisted colonoscopy in Australia found that development prioritizes ML model performance while ignoring UI/UX integration, rendering algorithmic advances irrelevant when gastroenterologists can't incorporate systems into practice. We're optimizing the wrong variables.

The most significant contribution addresses an emerging gap: semantic safety for VLM-controlled robots. Collision avoidance is solved, but 'don't hand a knife to a child' has no benchmark. The robot constitution work provides the first systematic methodology—explicit semantic constraints plus adversarial prompting—arriving just as embodied AI moves from research demos to physical deployment. It's the rare paper that gets ahead of a problem rather than documenting failure after the fact.

Featured(1/5)

2503.11113

Vipera: Towards systematic auditing of generative text-to-image models at scale

Yanwei Huang, Wesley Hanwen Deng, Sijia Xiao, Motahhare Eslami, Jason I. Hong, Adam Perer

CHI 2025·2025-03-14

Stop running unstructured prompt tests. Use scene graphs to map your audit space before you start generating images. Prioritize this for high-stakes domains like medical or legal imagery where bias compounds.

Auditing text-to-image models for bias and misinformation is ad-hoc and unscalable. Auditors lack structured ways to explore failure modes systematically.

Method: Vipera uses scene graphs—visual representations of objects and their relationships—to organize audit criteria hierarchically. Instead of random prompt testing, auditors can navigate structured categories (e.g., 'gender' → 'occupation' → 'doctor') and see how the model responds across systematic variations. The scene graph acts as a visual index, letting auditors drill down from broad categories to specific edge cases, then collect and compare generated images in clusters.

Caveats: Requires manual construction of scene graphs for each domain. No automated taxonomy generation yet.

Reflections: Can scene graphs be auto-generated from existing prompt datasets to reduce setup overhead? · How do different T2I models respond to identical scene graph structures? · What's the minimum viable scene graph depth for catching 80% of bias patterns?

ai-interactionethicsevaluation-methodstrust-safety

2503.10029

HandProxy: Expanding the Affordances of Speech Interfaces in Immersive Environments with a Virtual Proxy Hand

Chen Liang, Yuxuan Liu, Martez Mott, Anhong Guo

Journal 2025·2025-03-13

Integrate proxy hand controls into accessibility modes for VR apps. Map speech commands to incremental spatial adjustments, not just binary actions. Test with users who have tremors or limited range of motion.

Speech interfaces in VR only trigger basic gestures. Users with motor impairments or situational constraints can't perform expressive hand interactions like pinching, rotating, or multi-finger manipulation.

Method: HandProxy renders a virtual proxy hand that users control entirely through speech commands. Instead of saying 'select object' and losing nuance, users say 'move hand forward 10 centimeters' or 'rotate wrist 45 degrees clockwise' to position the proxy hand precisely in 3D space. The system supports continuous adjustments ('keep moving forward') and discrete actions ('pinch now'), enabling fine-grained manipulation without physical hand tracking.

Caveats: Speech latency makes real-time gaming impractical. Best for deliberate tasks like 3D modeling or medical training.

Reflections: Can multimodal input (gaze + speech) reduce the number of verbal commands needed? · How do users mentally model the proxy hand's position without proprioceptive feedback? · What's the learning curve for spatial speech commands versus traditional gesture shortcuts?

voice-interfacesaugmented-realityaccessibilityai-interaction

2503.07782

Malleable Overview-Detail Interfaces

Bryan Min, Allen Chen, Yining Cao, Haijun Xia

CHI 2025·2025-03-10

Build overview-detail interfaces with user-editable attribute slots. Expose a 'customize overview' mode in settings. Prioritize this for data-heavy apps where users have heterogeneous information needs—real estate, job boards, travel booking.

Overview-detail interfaces (like map listings showing hotel prices) lock users into a single designer-chosen attribute set. Users can't customize what appears in the overview without developer intervention.

Method: The system lets end-users drag attributes from the detail view into the overview panel, creating custom overview layouts. For example, a user viewing hotel details can drag 'pet-friendly' and 'parking' badges into the map overview, replacing the default price display. The interface uses a constraint-based layout engine that automatically repositions overview elements when users add or remove attributes, maintaining visual hierarchy without manual pixel adjustment.

Caveats: Only tested with list-based overviews. Unclear how this scales to graph or network visualizations.

Reflections: How do users decide which attributes to promote to the overview versus keeping in detail? · Can collaborative teams share customized overview templates? · What's the cognitive load of maintaining multiple custom overview configurations?

design-toolsdata-visualizationweb-interfaces

1 / 5

Featured

Findings(1/5)

Safety research shifts from physical collision avoidance to semantic harm prevention·Interface customization moves from designer-controlled presets to end-user malleability·Privacy interfaces migrate from 2D disclosure replication to spatial environment exploitation·Multimodal input design shifts from modality replacement to situational complementarity·Professional AI adoption fractures into strategic resistance and selective embrace, not binary acceptance

Robotics safety traditionally focused on proximity hazards—keeping metal away from flesh. Vision-language models now give robots semantic understanding and natural language control, but inherit vulnerabilities like hallucinations and jailbreaking. The safety boundary has moved from the robot's immediate physical radius to the interpretation layer between language commands and physical actions. This requires new benchmarks that test not collision dynamics but whether a robot correctly refuses harmful instructions or misinterprets benign ones.

2503.08663

Generating Robot Constitutions & Benchmarks for Semantic Safety

2503.07901

Intelligent Framework for Human-Robot Collaboration: Dynamic Ergonomics and Adaptive Decision-Making

Surprises(1/3)

Text entry technique performance depends more on interaction attributes than modality choice·Centralized privacy enforcement campaigns achieve scale through resource concentration, inverting GDPR's distributed model·Collaborative MR communication breaks down not from insufficient fidelity but from perspective misalignment

Comparing text entry techniques in XR remains challenging not because modalities are fundamentally different, but because interaction attributes—like presence of visual feedback—impact performance more than the input method itself. The assumption was that modality (voice vs. gesture vs. controller) determined performance. The data suggests attributes that cut across modalities matter more. This reframes the design question from 'which modality' to 'which interaction properties.'

2503.11357

Text Entry for XR Trove (TEXT): Collecting and Analyzing Techniques for Text Input in XR

TOOLBOX(6)

ASIMOV Benchmark

Dataset

Large-scale benchmark dataset for evaluating semantic safety of foundation models controlling robots. Generated using text and image generation techniques from real-world visual scenes and hospital injury reports. Includes undesirable situations for testing robot behavior alignment. Available at asimov-benchmark.github.io for assessing VLM-based robot safety across diverse scenarios.

2503.08663

StratIncon Detector

Tool

Visual analytics system for MOBA esports that detects real-time strategies, predicts preferred professional strategies, and identifies strategic inconsistencies in professional matches. Extracts human factors and analyzes their impact on game progression. Enables coaches and players to comprehensively identify strategy gaps, infer causes, and evaluate effects on outcomes for improved team collaboration.

2503.09060

DancingBoard

Tool

Integrated authoring tool for creating motion comics with simplified workflow. Features user-friendly interface with guided creation process, supporting character and object animation without specialized skills. Designed based on analysis of 95 online motion comics and expert workflows. Provides comprehensive step-by-step support for amateur creators to enhance comic narratives with digital animation.

2503.09061

HandProxy

Tool

Speech-controlled virtual hand proxy system for immersive environments. Translates natural language descriptions into real-time hand control sequences, enabling expressive hand interactions without physical hand input. Achieved 100% task completion rate with 91.8% command execution accuracy. Supports flexible speech input with varying control granularity for users with situational impairments or motor limitations.

2503.10029

Vipera

Tool

Systematic auditing tool for generative text-to-image models. Uses visual cues including scene graphs for image collection sensemaking and hierarchical organization of auditing criteria. Leverages LLM-powered suggestions to facilitate exploration of unexplored auditing directions. Enables auditors to systematically explore bias, offense, and misinformation risks in T2I models at scale.

2503.11113

TEXT (Text Entry for XR Trove)

Dataset

Database of 176 XR text entry techniques with interaction attributes and performance metrics. Analyzes trends in TET design, evaluation metrics, and how interaction attributes impact user performance. Includes interactive online tool for navigating the database. Enables researchers to compare techniques and understand design patterns in extended reality text input methods.

2503.11357

Fewer Than 1% of Explainable AI Papers Validate Explainability with Humans

Analyzed 18,254 XAI papers and found only 1% actually tested whether humans understood the explanations. The emperor has no clothes, quantified.

2503.20790

Toward a Human-Centered AI-assisted Colonoscopy System in Australia

Observed gastroenterologists ignoring AI polyp detection because it didn't fit clinical workflows. Classic case of algorithmic accuracy meeting real-world indifference.

2503.09838

BioSpark: Beyond Analogical Inspiration to LLM-augmented Transfer

Uses LLMs to map biological mechanisms onto design problems, reducing the cognitive load of analogical reasoning. Creativity partner, not just inspiration search engine.

2503.06911

Beyond Code Generation: LLM-supported Exploration of the Program Design Space

Treats programming as design iteration, not just code completion. LLM shows alternative problem formulations instead of one 'correct' solution.

2503.07970

Sustaining Human Agency, Attending to Its Cost: An Investigation into Generative AI Design for Non-Native Speakers' Language Use

Studied machine translation in conversations and found preserving human agency requires cognitive effort. Convenience and control are in tension.

2503.08582

Chatbots for Data Collection in Surveys: A Comparison of Four Theory-Based Interview Probes

Tested four interview techniques as chatbot probes to get richer survey responses. Bridges the scale-depth tradeoff in qualitative research.

2503.16493

Uncertainty Expression for Human-Robot Task Communication

Designed robots that explicitly communicate when they're uncertain about object locations. Honesty about confusion beats false confidence.

2503.11018

An LLM's Attempts to Adapt to Diverse Software Engineers' Problem-Solving Styles: More Inclusive & Equitable?

Prompted LLMs to adapt code explanations to five problem-solving styles from inclusive design research. Tests whether personalization reduces bias or amplifies it.

REFLECTION(3)

Explainability theater doesn't build better judgment

We've spent years designing systems that show their work—dashboards with reasoning traces, AI alerts with justifications, visualizations with annotations. Yet clinicians still override alerts blindly, analysts misread dimensionality reductions, and users accept AI suggestions without validation. The uncomfortable pattern: transparency and actual understanding are not the same thing.

We measure explainability by how much information we surface, but research shows users often misinterpret or ignore it entirely. If a clinician can't tell when to trust an AI alert despite seeing the reasoning, have we actually explained anything—or just created the appearance of accountability?

1 / 3

Week 10March 2025

Week 12March 2025

ABOUT THIS ISSUE

How was this newsletter synthesized?

Methodology

This newsletter is generated by an AI pipeline (leveraging Anthropic Sonnet 4.5 & Haiku 4.5) that processes the metadata and abstracts of every new arXiv HCI paper from the past week—137 this issue. Each paper is scored on three dimensions: Practice (applicability for practitioners), Research (scientific contribution), and Strategy (industry implications), with scores from 1-5. Papers passing threshold are grouped into topic clusters, and each cluster is summarized to capture what that body of research is exploring.

Selection Criteria

The pipeline builds a curated selection that balances high scores with topic diversity—and deliberately includes at least one 'contrarian' paper that challenges prevailing assumptions. This selection is then analyzed to identify key findings (patterns across multiple papers) and surprises (results that contradict conventional wisdom). A narrative synthesis ties the week's research together under a unifying frame.

Key Themes Discovered

Field Report: ai-interaction

Trust, Agency, and Alignment in Human-AI Interaction

This cluster investigates how humans calibrate trust, maintain agency, and align values when collaborating with AI systems. Core tensions emerge: users struggle to exercise meaningful control while AI systems optimize for efficiency; explainability claims lack human validation; and AI-mediated communication trades cognitive benefit for convenience. Research spans auditing generative models, robot safety, peer feedback, writing assistance, and deliberative systems. The dominant concern is not capability but governance—how to design interactions that preserve human judgment, transparency, and equitable participation across diverse user populations and contexts.

1/10

Top Papers in this Theme

2503.16507

Fewer Than 1% of Explainable AI Papers Validate Explainability with Humans

2503.11918

Sketch-to-Skill: Bootstrapping Robot Learning with Human Drawn Trajectory Sketches

2503.07279

VizTrust: A Visual Analytics Tool for Capturing User Trust Dynamics in Human-AI Communication

2503.16493

Uncertainty Expression for Human-Robot Task Communication

2503.07970