Week 14 / April 2025

AI Systems Become Infrastructure Before Users Can Verify Them

Healthcare, professional tools, and wearables deploy models faster than humans can calibrate trust or maintain expertise

Synthesized using AI

Analyzed 144 papers. AI models can occasionally hallucinate, please verify critical details.

AI systems are crossing a threshold from optional assistance to mandatory infrastructure, and users are losing the ability to verify what these systems do. A study of cardiac arrest prognosis models exposes the core problem: different clinically-reasonable ways of defining outcome labels produce fundamentally different AI recommendations, not because of technical failures but because ground truth doesn't exist. Parallel work on surgical risk surveillance, expert knowledge work with LLMs, and mental health measurement all document the same pattern — when AI handles judgment calls, humans can't maintain independent verification capability. The GUI grounding research for professional workflows and dyslexia reading assistance with LLM annotation extend this further: systems that eliminate tedious mechanics between intent and execution also eliminate the visibility needed to catch errors.

The physical interface research offers a counterpoint that's equally uncomfortable for different reasons. Wrist-based PPG monitoring achieves better accuracy by reconstructing physiological signals with generative models than by improving raw sensor fidelity. VR text editing works better with simplified microgestures than realistic hand tracking. An ultra-low-power ring mouse succeeds precisely because it strips away the power-hungry fidelity that shorter battery life. Across wearables, AR/VR, and visualization, task-adapted abstractions consistently outperform faithful physical replication. This inverts decades of 'natural interface' ideology that assumed mimicking reality was the design goal.

The collision between these two patterns creates the week's central tension: we're building systems that hide implementation details to reduce cognitive load, but verification requires seeing what the system is doing. The accessibility checker using LLMs achieves 69% coverage versus 31% for rule-based tools, but needs expert oversight to catch hallucinations. Emotion regulation nudges reduce disinformation sharing by making anger visible, but only work when users can perceive their own state. The design challenge isn't choosing between automation and transparency — it's figuring out which details to surface for verification when you can't show everything.

Featured(1/5)

2504.07981

ScreenSpot-Pro: GUI Grounding for Professional High-Resolution Computer Use

Kaixin Li, Ziyang Meng, Hongzhan Lin, Ziyang Luo, Yuchen Tian, Jing Ma, Zhiyong Huang, Tat-Seng Chua

Preprint·2025-04-04

Don't deploy GUI automation in professional tools yet. If you're building AI assistants for design software, test at native 4K resolution with real toolbar density—not downsampled screenshots. Prioritize single-window workflows until multi-pane grounding improves.

MLLMs fail at professional software interfaces. Targets are 3-5x smaller than consumer apps, and 4K displays expose models trained on low-res screenshots.

Method: ScreenSpot-Pro tests GUI agents on Photoshop, Blender, and CAD tools at native resolutions up to 3840x2160. Current best models hit only 25-40% accuracy on small UI elements (buttons under 50px), compared to 70%+ on consumer interfaces. The benchmark isolates three failure modes: resolution degradation during preprocessing, inability to parse dense toolbars, and confusion in multi-window layouts with overlapping coordinate systems.

Caveats: Benchmark focuses on static screenshots, not dynamic interactions like drag-and-drop or modal dialogs that appear mid-workflow.

Reflections: Can fine-tuning on high-res professional UI datasets close the 30-point accuracy gap? · How do coordinate system transformations in multi-monitor setups affect grounding performance? · What's the minimum viable resolution for acceptable accuracy in dense interfaces?

design-toolsai-interactionevaluation-methods

2503.24037

Digital Nudges Using Emotion Regulation to Reduce Online Disinformation Sharing

Haruka Nakajima Suzuki, Midori Inaba

Journal 2025·2025-03-31

Add emotion-detection triggers to share buttons on high-engagement posts. Gate the share action with a 3-second 'label your reaction' interstitial when sentiment analysis flags anger. Don't apply this universally—false positives on neutral content will annoy users.

Anger drives disinformation sharing. Users repost false content when emotionally activated, bypassing deliberation entirely.

Method: Two nudges outperformed controls: an emotion-labeling prompt ('You seem angry—take a moment') and a reappraisal nudge ('Consider another perspective'). The emotion-labeling intervention reduced sharing intent by 18% when users reported strong anger (p<0.01). The mechanism is cognitive reappraisal—forcing users to name their emotion creates a 3-5 second pause that disrupts automatic sharing. Critically, the nudge only worked for high-anger content; neutral posts saw no effect.

Caveats: Study used hypothetical sharing scenarios, not real platform behavior. Habituation effects unknown after repeated exposure.

Reflections: Does the nudge lose effectiveness after 5-10 exposures as users learn to dismiss it? · Can real-time sentiment analysis accurately flag anger without misclassifying passionate agreement? · What's the optimal delay duration before habituation sets in?

trust-safetyethicssocial-computing

2504.00941

Let AI Read First: Enhancing Reading Abilities for Individuals with Dyslexia through Artificial Intelligence

Sihang Zhao, Shoucong Carol Xiong, Bo Pang, Xiaoying Tang, Pinjia He

CHI 2025·2025-04-01

Integrate LLM-based annotation layers into reading apps for dyslexic users. Prioritize phonetic breakdowns for words over 8 letters and contextual synonyms for high-frequency ambiguous terms. Test with real dyslexic users—don't assume neurotypical preferences transfer.

Dyslexic readers struggle with phonological decoding and working memory load. Existing tools either distort meaning (text-to-speech) or require expensive tutoring.

Method: LARF uses GPT-4 to pre-annotate text with three layers: phonetic breakdowns for complex words, contextual synonyms for ambiguous terms, and sentence-level summaries for long passages. In a 42-participant study, dyslexic readers using LARF improved comprehension scores by 23% and reduced reading time by 19% compared to plain text. The key mechanism is cognitive offloading—annotations reduce working memory demands by externalizing phonological and semantic processing.

Caveats: Tested only on English narrative text. Technical documentation and non-Latin scripts untested. Annotation density may overwhelm some users.

Reflections: What's the optimal annotation density before cognitive load increases instead of decreases? · Can users customize which annotation types to display based on individual dyslexia profiles? · Does long-term use improve unaided reading ability or create dependency?

accessibilityai-interactionhealthcare

1 / 5

Featured

Findings(1/5)

AI assistance shifts from task automation to expertise preservation·Measurement validity emerges as the bottleneck in high-stakes AI deployment·Interaction design moves from explicit commands to ambient inference·Professional GUI environments expose the limits of general-purpose AI agents·Behavioral interventions target emotional regulation over information correction

Generative AI is forcing a reframe: the goal isn't replacing expert labor but maintaining expert cognition under delegation pressure. Studies of survey authoring and document analysis show experts welcome AI for repetitive tasks but resist offloading judgment-intensive work, revealing that AI's value lies in protecting scarce cognitive resources rather than eliminating them. The design challenge becomes identifying which tasks degrade expertise when automated versus which tasks protect it by reducing cognitive load.

2503.24334

Augmenting Expert Cognition in the Age of Generative AI: Insights from Document-Centric Knowledge Work

2504.02551

Human-Centered Development of an Explainable AI Framework for Real-Time Surgical Risk Surveillance

Surprises(1/3)

Experts resist AI delegation precisely where it promises the most value·Teens receive support for body image struggles but participate in shaming others·Live coding's effectiveness depends more on infrastructure than instructor skill

Domain experts in survey authoring and document analysis welcome AI for repetitive information tasks but resist offloading judgment-intensive work—the very tasks where AI could provide the largest productivity gains. The resistance isn't irrational: experts recognize that delegating complex reasoning degrades the cognitive skills that make them experts. Productivity optimization and expertise preservation may be fundamentally opposed objectives.

2503.24334

Augmenting Expert Cognition in the Age of Generative AI: Insights from Document-Centric Knowledge Work

TOOLBOX(7)

TransforMerger

Code

Transformer-based reasoning model that fuses voice and gesture inputs into unified sentences for robotic manipulation commands. Uses probabilistic embeddings to handle uncertainty and contextual scene understanding to resolve ambiguous references. Enables robust human-robot communication in noisy, misaligned scenarios. Code and datasets available for simulated and real-world experiments.

2504.01708

VizCreativity Coded Papers Database

Dataset

Systematic review database of 58 papers characterizing creativity in visualization design. Includes coded analysis of creative representations (infographics, pictorials, data comics), design activities (sketching, storyboarding), and creative tasks. Provides structured taxonomy for understanding how creativity manifests in visualization design processes and authoring tools.

2504.02204

Cybersickness Assessment Framework (TestBed)

Framework

Standardized VR environment framework for conducting consistent cybersickness experiments. Provides pre-built VR content and experiment management tools to eliminate variability across studies. Enables researchers to evaluate and compare impact of various factors on cybersickness with common foundation, reducing resource requirements for creating custom VR environments.

2504.02675

CP-PPG

Model

Deep adversarial framework that transforms contact pressure-distorted PPG signals into high-fidelity waveforms with ideal morphology. Integrates custom data collection protocol, signal processing pipeline, and PPG-aware loss function. Improves signal fidelity by 40% and downstream physiological monitoring (HR, HRV, RR, BP) by 4-46% across metrics.

2504.02735

ScreenSpot-Pro

Dataset

Benchmark dataset for evaluating GUI grounding capabilities of MLLMs in high-resolution professional settings. Contains authentic high-resolution images with expert annotations spanning 23 applications across five industries and three operating systems. Designed to test models on smaller targets and complex professional workflows.

2504.07981

ScreenSeekeR

Model

Visual search method for GUI grounding that uses strong planner's GUI knowledge to guide cascaded search in professional applications. Achieves 48.1% accuracy on ScreenSpot-Pro benchmark (state-of-the-art) without additional training by strategically reducing search area. Designed for high-resolution professional GUI agent tasks.

2504.07981

microGEXT

Tool

Lightweight microgesture-based system for text editing in VR without external sensors. Uses small, subtle hand movements to reduce physical strain compared to standard gestures. Reduces overall edit time and fatigue versus ray-casting + pinch menu baseline. Offers device-free, ergonomic alternative to traditional VR text editing.

2504.04198

From Intuition to Understanding: Using AI Peers to Overcome Physics Misconceptions

Designs an AI "peer" that helps students correct physics misconceptions without becoming a crutch. Tests whether conversational scaffolding can build critical thinking instead of dependency.

2504.04299

AI-induced sexual harassment: Investigating Contextual Characteristics and User Reactions of Sexual Harassment by a Companion Chatbot

Documents users experiencing sexual harassment from Replika and similar companion chatbots. Maps the contextual triggers and emotional fallout when AI crosses boundaries it wasn't supposed to have.

2504.02217

The Plot Thickens: Quantitative Part-by-Part Exploration of MLLM Visualization Literacy

Systematically tests what makes data visualizations legible to multimodal LLMs. Discovers that color, shape, and text affect model comprehension in ways that diverge from human perception.

2504.02664

How humans evaluate AI systems for person detection in automatic train operation: Not all misses are alike

Shows that humans judge AI detection failures differently based on context—missing a person near tracks matters more than missing distant pedestrians. Challenges uniform accuracy metrics for safety systems.

2504.01153

Catch Me if You Search: When Contextual Web Search Results Affect the Detection of Hallucinations

Tests whether integrated web search helps users catch LLM hallucinations. Finds that search results can paradoxically make verification harder when context misleads rather than clarifies.

2504.00221

GazeLLM: Multimodal LLMs incorporating Human Visual Attention

Feeds human gaze data into multimodal LLMs to improve activity understanding from first-person video. Explores whether attention patterns help models figure out what humans actually care about.

2503.23688

Mapping Geopolitical Bias in 11 Large Language Models: A Bilingual, Dual-Framing Analysis of U.S.-China Tensions

Generates 19,712 responses across 11 LLMs using English and Chinese prompts about U.S.-China relations. Maps systematic geopolitical biases that shift based on language and question framing.

2504.01367

Enhancing Computational Notebooks with Code+Data Space Versioning

Addresses the mismatch between nonlinear data exploration and sequential notebook design. Proposes versioning that tracks both code and data state for branching, undos, and complete reverts.

REFLECTION(4)

Infrastructure demands trust we can't verify

AI systems are shifting from optional tools to mandatory infrastructure in healthcare, education, and knowledge work. Yet the research shows users cannot reliably calibrate trust or verify outputs when AI becomes the layer they depend on—creating a fundamental mismatch between adoption speed and safety assurance.

Abstraction accelerates workflows but erases visibility. When design tools eliminate mechanical steps between intent and execution, users lose the checkpoints that once caught misalignment—so how do you validate understanding when the implementation details vanish?

1 / 4

Week 13March 2025

Week 20May 2025

ABOUT THIS ISSUE

How was this newsletter synthesized?

Methodology

This newsletter is generated by an AI pipeline (leveraging Anthropic Sonnet 4.5 & Haiku 4.5) that processes the metadata and abstracts of every new arXiv HCI paper from the past week—144 this issue. Each paper is scored on three dimensions: Practice (applicability for practitioners), Research (scientific contribution), and Strategy (industry implications), with scores from 1-5. Papers passing threshold are grouped into topic clusters, and each cluster is summarized to capture what that body of research is exploring.

Selection Criteria

The pipeline builds a curated selection that balances high scores with topic diversity—and deliberately includes at least one 'contrarian' paper that challenges prevailing assumptions. This selection is then analyzed to identify key findings (patterns across multiple papers) and surprises (results that contradict conventional wisdom). A narrative synthesis ties the week's research together under a unifying frame.

Key Themes Discovered

Field Report: ai-interaction

Trust, Calibration, and Cognitive Augmentation

This cluster examines how humans calibrate reliance on AI systems and maintain cognitive agency during collaboration. Core tensions emerge: users must decide when to delegate versus retain control, how to verify AI outputs without over-trusting or under-utilizing capabilities, and whether AI augments or erodes expertise. Research spans knowledge work, education, safety-critical domains, and emotional support, revealing systematic mismatches between AI capabilities and human expectations. Design implications center on preserving deliberate practice, enabling selective delegation aligned with expertise, and supporting metacognitive awareness rather than passive consumption.

1/10

Top Papers in this Theme

2503.24150

Learning a Canonical Basis of Human Preferences from Binary Ratings

2504.00408

From Intuition to Understanding: Using AI Peers to Overcome Physics Misconceptions

2504.02664

How humans evaluate AI systems for person detection in automatic train operation: Not all misses are alike

2504.04299

AI-induced sexual harassment: Investigating Contextual Characteristics and User Reactions of Sexual Harassment by a Companion Chatbot

2503.23574

Synthesized using AI

ScreenSpot-Pro: GUI Grounding for Professional High-Resolution Computer Use

Digital Nudges Using Emotion Regulation to Reduce Online Disinformation Sharing

Let AI Read First: Enhancing Reading Abilities for Individuals with Dyslexia through Artificial Intelligence

Augmenting Expert Cognition in the Age of Generative AI: Insights from Document-Centric Knowledge Work

Human-Centered Development of an Explainable AI Framework for Real-Time Surgical Risk Surveillance

Augmenting Expert Cognition in the Age of Generative AI: Insights from Document-Centric Knowledge Work

TransforMerger

VizCreativity Coded Papers Database

Cybersickness Assessment Framework (TestBed)

CP-PPG

ScreenSpot-Pro

ScreenSeekeR

microGEXT

From Intuition to Understanding: Using AI Peers to Overcome Physics Misconceptions

AI-induced sexual harassment: Investigating Contextual Characteristics and User Reactions of Sexual Harassment by a Companion Chatbot

The Plot Thickens: Quantitative Part-by-Part Exploration of MLLM Visualization Literacy

How humans evaluate AI systems for person detection in automatic train operation: Not all misses are alike

Catch Me if You Search: When Contextual Web Search Results Affect the Detection of Hallucinations

GazeLLM: Multimodal LLMs incorporating Human Visual Attention

Mapping Geopolitical Bias in 11 Large Language Models: A Bilingual, Dual-Framing Analysis of U.S.-China Tensions

Enhancing Computational Notebooks with Code+Data Space Versioning

Infrastructure demands trust we can't verify

How was this newsletter synthesized?

Methodology

Selection Criteria

Key Themes Discovered

Field Report: ai-interaction

Trust, Calibration, and Cognitive Augmentation

Top Papers in this Theme

Learning a Canonical Basis of Human Preferences from Binary Ratings

From Intuition to Understanding: Using AI Peers to Overcome Physics Misconceptions

How humans evaluate AI systems for person detection in automatic train operation: Not all misses are alike

AI-induced sexual harassment: Investigating Contextual Characteristics and User Reactions of Sexual Harassment by a Companion Chatbot

Navigating Uncertainties: Understanding How GenAI Developers Document Their Models on Open-Source Platforms