Week 13 / March 2025

Verification Becomes the Bottleneck for AI Deployment

From healthcare to mobile agents, capable systems fail because users can't tell when to trust them

Synthesized using AI

Analyzed 115 papers. AI models can occasionally hallucinate, please verify critical details.

AI systems have crossed a capability threshold that creates a new problem: users can't tell when to trust them. A health economics study of diabetic retinopathy screening across 270 scenarios in a national program quantifies this precisely—optimal human-AI collaboration requires delegating some cases to AI autonomy while keeping humans in the loop for others, but current systems don't surface the confidence signals needed to make that split. The same verification bottleneck appears in mobile GUI agents (where formal verification prevents erroneous actions), LLM research summarization (where even experts can't detect overgeneralization), and conversational health assistants (where users need to know when the system is uncertain). This isn't about explainability—users don't need to understand how the model works, they need actionable signals about when it's likely to fail.

Accessibility research offers a counterpoint to this verification crisis by focusing relentlessly on implementation friction. Five papers this week—on PDF remediation, alt-text generation, personalized content moderation, tactile collaborative interfaces, and scientific accessibility—all scored maximum practice relevance by addressing real deployment barriers rather than capability demonstrations. The content moderation work is particularly instructive: disabled users don't want platforms to invisibly filter ableist content, they want transparent control over what they see. This challenges the protective moderation paradigm and suggests a broader principle—trust comes from user agency and transparency, not from systems making decisions on users' behalf.

The strategic implication cuts across domains: we're building systems capable of autonomous action but deploying them without verification infrastructure. The healthcare economics work matters now because health systems are making procurement decisions. The mobile agent verification work matters because autonomous agents are shipping in consumer products. The research isn't anticipatory—it's addressing systems already in production that are failing not because they lack capability, but because users can't calibrate trust appropriately.

Featured(1/5)

2504.00025

Generalization Bias in Large Language Model Summarization of Scientific Research

Uwe Peters, Benjamin Chin-Yee

Preprint·2025-03-28

If you're building AI summarization for research, science, or medical content, add a constraint-preservation layer. Force the model to extract and surface limitations before generating summaries. Test outputs against original abstracts for scope creep.

LLMs summarizing research papers strip out caveats and scope limitations, turning "this worked in lab mice" into "this works." Readers get confident conclusions without the constraints.

Method: Tested 10 LLMs including ChatGPT-4o and Claude-3.5-Sonnet on scientific abstracts. The models systematically omitted methodological constraints—sample sizes, population specifics, controlled conditions—that limit generalizability. When asked to summarize findings, LLMs produced broader claims than the original studies warranted. The pattern held across models: they preserved the headline result but dropped the footnotes that matter.

Caveats: Study focused on abstracts, not full papers. Real-world summarization often involves longer, more complex documents where omissions may differ.

Reflections: Can fine-tuning or prompt engineering reliably preserve methodological constraints without sacrificing readability? · Do users actually notice or care about missing caveats when reading AI-generated summaries? · How does this generalization bias compound when LLMs summarize summaries?

ai-interactionevaluation-methodstrust-safety

2503.21094

GazeSwipe: Enhancing Mobile Touchscreen Reachability through Seamless Gaze and Finger-Swipe Integration

Zhuojiang Cai, Jingkai Hong, Zhimin Wang, Feng Lu

Preprint·2025-03-27

Stop building reachability modes that shrink the entire screen. Implement gaze-swipe for precise, context-aware reach extension. Best for apps with scattered UI controls—messaging, browsers, maps—where thumb zones are unpredictable.

One-handed phone use fails when your thumb can't reach the top third of the screen. Users either shift grip (risking drops) or use two hands (losing convenience).

Method: Combines eye gaze with finger-swipe gestures: look at an unreachable target, swipe in its direction, and the system moves the cursor there. The gaze estimation runs without explicit calibration—it adapts on-the-fly using the phone's front camera. Users don't need to train the system or sit still; it works during normal mobile use with head movement and varying lighting.

Caveats: Requires front-facing camera access and sufficient lighting. Effectiveness degrades in very dark environments or with sunglasses.

Reflections: How does accuracy degrade across different phone sizes and screen aspect ratios? · Can this work for users with eye-tracking disabilities or those who avoid camera-based features for privacy? · What's the learning curve for users unfamiliar with gaze-based interaction?

mobile-interfacesai-interactionaccessibility

2503.18492

VeriSafe Agent: Safeguarding Mobile GUI Agent via Logic-based Action Verification

Jungjae Lee, Dongjae Lee, Chihun Choi, Youngmin Im, Jaeyoung Wi, Kihong Heo, Sangeun Oh, Sunjae Lee, Insik Shin

Preprint·2025-03-24

If you're building LLM agents for mobile or desktop automation, add a verification gate. Don't let the model execute directly. Use constraint checking to validate actions against user intent and app context before committing.

LLM-powered mobile agents automate tasks through natural language but fail unpredictably—deleting wrong messages, sending texts to wrong contacts, or executing unintended purchases. Users can't trust automation that's probabilistic.

Method: Introduces a logic-based verification layer that sits between the LLM's action proposals and execution. Before any GUI action fires, the system checks it against formal constraints derived from the user's intent and app state. Uses temporal logic to verify action sequences—ensuring "send message to Alice" doesn't accidentally target "Alice Smith" when the user meant "Alice Johnson." The verifier catches errors before they execute, not after damage is done.

Caveats: Requires formal specification of constraints, which may not cover all edge cases or ambiguous user intents in complex workflows.

Reflections: How do you balance verification strictness with user frustration from false positives blocking legitimate actions? · Can users effectively specify constraints without technical knowledge of formal logic? · What's the performance overhead of real-time verification on mobile devices?

ai-interactionmobile-interfacestrust-safety

1 / 5

Featured

Findings(1/5)

Human-AI collaboration shifts from capability replacement to economic optimization·Accessibility tooling moves from remediation to authoring-time integration·Privacy protection fragments from platform-level policies to personalized filtering·Multimodal interaction eliminates calibration overhead through implicit sensing·LLM reliability requires external verification layers, not just better prompts

AI in medical screening isn't replacing human judgment—it's creating new decision architectures where humans and algorithms divide labor based on cost-effectiveness, not just accuracy. Analysis of 270 diabetic retinopathy screening scenarios shows the optimal strategy isn't full automation or human-in-the-loop validation, but economically-driven task allocation. The implication: deployment decisions require health economics expertise, not just clinical validation.

2503.20160

What is the role of human decisions in a world of artificial intelligence: an economic evaluation of human-AI collaboration in diabetic retinopathy screening

Surprises(1/3)

Jargon preservation aids cross-disciplinary understanding better than simplification·Visualization annotations enable collective storytelling, not just individual interpretation·Integrated authoring outperforms sequential visualization-then-narrative workflows

Computational tools for exploring unfamiliar scholarly domains traditionally remove jargon through simplification. New work preserves field-specific terminology as 'bridges to conceptual spaces,' treating different disciplines as language-using communities requiring translation, not reduction. The assumption that accessibility requires simplification inverts: jargon carries meaning that simplification destroys.

2503.18471

Words as Bridges: Exploring Computational Support for Cross-Disciplinary Translation Work

TOOLBOX(8)

GPT for Researchers (G4R)

Tool

G4R (g4r.org) is a free web platform enabling researchers to create customizable GPT interfaces for study participants. Researchers can configure conversation constraints, adjust GPT tone/style, and download complete interaction logs between participants and GPT. Designed for studying consumer-AI interactions, AI-assisted decision-making, and human-AI communication patterns without requiring technical infrastructure setup.

2503.18303

VeriSafe Agent (VSA)

Framework

VSA is a formal verification system for mobile GUI agents that translates natural language instructions into formally verifiable specifications using autoformalization. It performs runtime, rule-based verification of agent actions before execution, achieving 94.33%-98.33% accuracy in action verification using GPT-4o. Increases GUI agent task completion rates by 90%-130% across 18 mobile apps.

2503.18492

MatplotAlt

Code

MatplotAlt is an open-source Python package for automatically generating and adding alternative text to Matplotlib figures in Jupyter notebooks. Supports both heuristic and LLM-based (GPT4-turbo) methods for creating accurate long-form chart descriptions. Users can customize caption generation and display with a single line of code, improving accessibility for computational notebooks.

2503.20089

Coolight

Tool

Coolight is a mobile application designed to enhance nighttime safety for urban student commuters. Features include an interactive live map, real-time community incident reporting, location sharing with trusted contacts, and a route planner optimized for safety metrics. Developed through interviews, questionnaires, and usability testing with university students in Toronto, Canada.

2503.20888

GazeSwipe

Tool

GazeSwipe is a multimodal interaction technique combining eye gaze with finger-swipe gestures for mobile touchscreen reachability. Uses smartphone front-facing cameras for calibration-free gaze estimation with user-unaware auto-calibration during interaction. Finger-swipe gestures compensate for gaze inaccuracies. Achieves high success rates on smartphones and tablets for one-handed reach on large screens.

2503.21094

StreetScape

Tool

StreetScape is a tactile street puzzle system featuring modular 3D-printed tiles, tactile roadways, and customizable decorative elements for collaborative play between blind/visually impaired and sighted children. Enables gamified tactile interaction for constructing and exploring cityscapes, promoting spatial reasoning skills and social connections through dynamic assembly and intuitive tactile navigation markers.

2503.21897

PAVE 2.0

Tool

PAVE 2.0 is a PDF remediation tool providing step-by-step guidance for creating accessible scientific PDFs for screen reader users. Includes specialized approach for generating alternative text for mathematical formulas without expert knowledge. Improved tagging accuracy from 42.0% to 80.1% for experienced users and 39.2% to 75.2% for novices compared to Adobe Acrobat Pro.

2503.22216

DataWeaver

Tool

DataWeaver is an integrated authoring system supporting bidirectional composition of data-driven narratives. Enables visualization-to-text creation through user-initiated 'call-out' interactions that highlight visualization elements and prompt narrative content, plus text-to-visualization generation from existing narratives. Evaluated with 13 participants for creating cohesive, interactive data stories in journalism and data reporting contexts.

2503.22946

SPHERE: An Evaluation Card for Human-AI Systems

Proposes a documentation standard for human-AI evaluation design. Tackles the transparency problem when every system needs different success metrics and nobody agrees what counts.

2503.19075

The Case for "Thick Evaluations" of Cultural Representation in AI

Argues image model evaluations flatten culture into checkboxes. Calls for interpretive methods that let people define their own representation instead of benchmarking stereotypes.

2504.13868

Using Generative AI Personas Increases Collective Diversity in Human Ideation

Challenges the AI-kills-diversity narrative with experimental evidence. Participants using AI personas produced more varied story ideas than control groups—wild methodology, surprising results.

2503.21394

Composable Prompting Workspaces for Creative Writing: Exploration and Iteration Using Dynamic Widgets

Turns prompts into draggable interface objects instead of text boxes. Designed for writers who want to iterate and explore, not just generate-and-pray.

2503.18778

The case for delegated AI autonomy for Human AI teaming in healthcare

Proposes letting diagnostic AI act autonomously on easy cases while staying advisory on hard ones. Directly confronts the trust calibration problem with conditional delegation.

2503.22610

Evaluating Multimodal Language Models as Visual Assistants for Visually Impaired Users

Surveys blind users on vision AI adoption. High usage, persistent failures—models miss context and users can't verify outputs. The perception gap in action.

2503.18419

Generative AI in Knowledge Work: Design Implications for Data Navigation and Decision-Making

Interviews knowledge workers drowning in scattered data, then builds Yodeai to test synthesis tools. Explores how AI can support judgment without replacing it.

2503.18243

A Robot-Led Intervention for Emotion Regulation: From Expression to Reappraisal

Tests a social robot coaching humans through emotion regulation exercises. Structured interactions help users move from venting to cognitive reappraisal—surprisingly effective for mental health support.

REFLECTION(4)

Trust requires opacity we can't afford

This week's research reveals a cruel bind: users demand transparency to calibrate trust in AI systems, yet the very mechanisms that enable verification—formal proofs, explainability layers, audit trails—add friction that undermines adoption. We're caught between two incompatible requirements: systems must be trustworthy *and* usable.

Users reject passive automation and demand agency in AI workflows, but verification mechanisms that preserve control create cognitive overhead that defeats the purpose of automation. Does adding transparency actually restore agency, or does it just shift the burden of risk assessment from the system to the user?

1 / 4

Week 12March 2025

Week 14April 2025

ABOUT THIS ISSUE

How was this newsletter synthesized?

Methodology

This newsletter is generated by an AI pipeline (leveraging Anthropic Sonnet 4.5 & Haiku 4.5) that processes the metadata and abstracts of every new arXiv HCI paper from the past week—115 this issue. Each paper is scored on three dimensions: Practice (applicability for practitioners), Research (scientific contribution), and Strategy (industry implications), with scores from 1-5. Papers passing threshold are grouped into topic clusters, and each cluster is summarized to capture what that body of research is exploring.

Selection Criteria

The pipeline builds a curated selection that balances high scores with topic diversity—and deliberately includes at least one 'contrarian' paper that challenges prevailing assumptions. This selection is then analyzed to identify key findings (patterns across multiple papers) and surprises (results that contradict conventional wisdom). A narrative synthesis ties the week's research together under a unifying frame.

Key Themes Discovered

Field Report: ai-interaction

Trust, Reliability, and Alignment

This cluster examines how humans calibrate trust in AI systems and whether AI outputs align with user intent and societal values. Core tensions emerge: LLMs systematically overgeneralize scientific findings; GUI agents require formal verification to prevent erroneous actions; AI-generated content risks manipulation through personification. Research spans evaluation frameworks (SPHERE), adversarial robustness, multimodal grounding, and domain-specific reliability (medical, financial, news). The work is primarily for systems designers and policymakers navigating deployment risks.

1/10

Top Papers in this Theme

2503.18303

How to Capture and Study Conversations Between Research Participants and ChatGPT: GPT for Researchers (g4r.org)

2504.07971

Synthesized using AI

Generalization Bias in Large Language Model Summarization of Scientific Research

GazeSwipe: Enhancing Mobile Touchscreen Reachability through Seamless Gaze and Finger-Swipe Integration

VeriSafe Agent: Safeguarding Mobile GUI Agent via Logic-based Action Verification

What is the role of human decisions in a world of artificial intelligence: an economic evaluation of human-AI collaboration in diabetic retinopathy screening

Words as Bridges: Exploring Computational Support for Cross-Disciplinary Translation Work

GPT for Researchers (G4R)

VeriSafe Agent (VSA)

MatplotAlt

Coolight

GazeSwipe

StreetScape

PAVE 2.0

DataWeaver

SPHERE: An Evaluation Card for Human-AI Systems

The Case for "Thick Evaluations" of Cultural Representation in AI

Using Generative AI Personas Increases Collective Diversity in Human Ideation

Composable Prompting Workspaces for Creative Writing: Exploration and Iteration Using Dynamic Widgets

The case for delegated AI autonomy for Human AI teaming in healthcare

Evaluating Multimodal Language Models as Visual Assistants for Visually Impaired Users

Generative AI in Knowledge Work: Design Implications for Data Navigation and Decision-Making

A Robot-Led Intervention for Emotion Regulation: From Expression to Reappraisal

Trust requires opacity we can't afford

How was this newsletter synthesized?

Methodology

Selection Criteria

Key Themes Discovered

Field Report: ai-interaction

Trust, Reliability, and Alignment

Top Papers in this Theme

How to Capture and Study Conversations Between Research Participants and ChatGPT: GPT for Researchers (g4r.org)

SPHERE: An Evaluation Card for Human-AI Systems

The Case for "Thick Evaluations" of Cultural Representation in AI

A Survey on (M)LLM-Based GUI Agents

Using Generative AI Personas Increases Collective Diversity in Human Ideation