Week 28 / July 2025

AI Systems Project Confidence Users Cannot Verify

From developer tools to medical models, the gap between AI's certainty and actual reliability is widening

Synthesized using AI

Analyzed 95 papers. AI models can occasionally hallucinate, please verify critical details.

AI systems are projecting confidence that users cannot verify, and the gap is widening as models scale. A randomized controlled trial with 16 experienced open-source developers completing 246 tasks in mature codebases they know well found zero productivity benefit from early-2025 AI tools—the verification tax erased all generation speed gains. This directly contradicts vendor narratives and recent studies, but those studies used novices or unfamiliar tasks. Supporting evidence appears across domains: multilingual users overrely on LLM outputs because epistemic markers like 'definitely' vary in strength across languages but models don't calibrate accordingly, LLM-generated phishing warnings fail to improve user security decisions despite being more detailed than static warnings, and medical AI models are systematically dropping safety disclaimers as they scale from 2022 to 2025 even as accuracy remains clinically insufficient. The pattern isn't about trust as an abstract concept—it's about systems that generate confident outputs faster than users can evaluate their correctness, and performance that doesn't justify the confidence projected.

A secondary cluster confronts what happens when interaction design assumes normative bodies. Motion capture for creative industries excludes disabled performers because skeletal models and calibration routines assume standard anatomical structure, wearable haptic devices fail when users can't place sensors at predetermined body locations, and real-time speech coaching through auditory feedback struggles with diverse speech patterns and cognitive processing differences. These aren't accessibility edge cases—they're evidence that the infrastructure of embodied interaction is built on bodies that don't exist. The disability-centered co-design approach in the motion capture work shows what changes when you center margins instead of retrofitting: personalized calibration, mobility aid integration, and flexible avatar representation become core features, not accommodations.

The through-line: systems are scaling faster than our ability to measure their actual impact on the users who matter most—experts, non-English speakers, bodies outside the norm. The quality of evidence this week is exceptional, but the findings are sobering.

Featured(1/5)

2507.09089

Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity

Joel Becker, Nate Rush, Elizabeth Barnes, David Rein

Preprint·2025-07-12

Stop extrapolating from student studies or greenfield tasks. Demand productivity metrics from developers with multi-year tenure on the same codebase. Use this RCT design—random assignment within familiar projects—as your benchmark for evaluating AI tooling ROI.

AI coding assistants promise productivity gains, but evidence from experienced developers working on real codebases—not toy problems—remains scarce.

Method: 16 developers with 5 years average experience on their projects completed 246 tasks in mature open-source repos, randomly assigned to allow or block early-2025 AI tools (GPT-4 class). This is the first RCT measuring AI impact on developers working in their own production codebases, not synthetic benchmarks. The study isolates tool effect from developer skill and project familiarity.

Caveats: Only 16 developers, all open-source contributors. Enterprise codebases with stricter review processes may show different patterns.

Reflections: How does AI impact vary between bug fixes, feature additions, and refactoring tasks in mature codebases? · Do productivity gains persist after 6+ months of continuous AI tool usage? · What's the effect on code review burden when AI-assisted code enters production?

ai-interactionprogramming-toolsevaluation-methods

2507.08030

A Systematic Analysis of Declining Medical Safety Messaging in Generative AI Models

Sonali Sharma, Ahmed M. Alaa, Roxana Daneshjou

Preprint·2025-07-08

Audit your AI health products for disclaimer regression across model updates. Implement disclaimer policies at the product layer, not model layer—capability improvements shouldn't erode safety messaging. Treat this as a compliance risk, not a UX optimization.

LLMs and vision-language models answer medical queries with increasing confidence, but safety disclaimers that warn users against treating outputs as professional advice are disappearing.

Method: Tracked disclaimer presence across LLM and VLM model generations from 2022 to 2025. Found systematic decline in medical safety messaging as models improved in capability. Newer, more capable models paradoxically provide fewer warnings that outputs aren't vetted medical advice, even as users increasingly trust them for clinical decisions.

Caveats: Study doesn't measure whether disclaimers actually change user behavior or reduce harm—only their presence.

Reflections: Do users actually modify behavior based on medical disclaimers, or do they ignore them? · What disclaimer design patterns are most effective at preventing over-reliance on AI medical advice? · How should disclaimer intensity scale with query risk level?

healthcareai-interactiontrust-safetyethics

2507.07916

Can Large Language Models Automate Phishing Warning Explanations? A Controlled Experiment on Effectiveness and User Perception

Federico Maria Cau, Giuseppe Desolda, Francesco Greco, Lucio Davide Spano, Luca Viganò

Preprint·2025-07-10

Replace static phishing warnings with LLM-generated, context-specific explanations. Explain why this sender domain is spoofed or why this link structure is malicious. Make warnings educational, not just cautionary.

Phishing warnings fail because they're static and vague. Users dismiss them without understanding the specific threat in front of them.

Method: Large-scale between-subjects experiment testing LLM-generated explanations for phishing warnings versus standard static messages. LLMs generate threat-specific explanations tailored to each phishing attempt's characteristics—explaining why this particular email is suspicious rather than generic 'be careful' messaging. Measures both detection accuracy and user perception of clarity.

Caveats: LLM explanations could be gamed by attackers who craft phishing attempts to generate benign-sounding explanations.

Reflections: Can attackers reverse-engineer LLM explanation logic to craft warnings that sound harmless? · Do personalized explanations improve long-term phishing detection skills or just immediate decisions? · What's the latency cost of generating explanations in real-time email clients?

privacy-securitytrust-safetyai-interactionevaluation-methods

1 / 5

Featured

Findings(1/5)

AI safety guardrails degrade faster than capabilities improve·Authentication moves from knowledge to proximity·Wearables shift from tracking to real-time intervention·LLMs automate explanation generation for security warnings·Annotations become first-class design elements in visualization

Medical disclaimers in generative AI outputs declined systematically across model generations from 2022 to 2025, even as clinical accuracy improved. Overconfident language in LLMs causes users to overrely on responses across all tested languages, creating a widening gap between model confidence and actual reliability. The implication: safety isn't a feature that scales with capability—it requires independent, continuous enforcement as models evolve.

2507.08030

A Systematic Analysis of Declining Medical Safety Messaging in Generative AI Models

2507.06306

Humans overrely on overconfident language models, across languages

Surprises(1/3)

AI productivity gains don't scale where complexity matters most·Insider language contests conspiracy theories more effectively than outsider correction·GenAI-generated personas match manual quality but expose bias risks

Early-2025 AI tools showed measurable productivity gains for experienced open-source developers, but the study of 246 tasks across mature projects revealed gains were task-dependent rather than uniform. The assumption was that more capable models would accelerate work across the board, but experienced developers with 5 years of project history saw benefits concentrated in structured tasks, not the ambiguous, complex work where acceleration would matter most. Capability doesn't predict utility.

2507.09089

Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity

TOOLBOX(7)

Vega-Lite Annotation

Code

A declarative extension to Vega-Lite that treats annotations as first-class design elements in the Grammar of Graphics. Enables structured specification of annotation targets, types, and positioning strategies for data visualizations. Reduces authoring effort and enables portable, semantically integrated annotation workflows compared to manual annotation approaches.

2507.04236

WSCoach (Wearable Speech Coach)

Tool

A wearable system that provides near-real-time auditory feedback to reduce unwanted words (filler words, swear words) in daily communication. Uses speech recognition to detect target words and delivers immediate audio cues through wearable devices. Demonstrated more effective long-term behavior correction than post-hoc mobile analysis applications.

2507.04238

Breaking the Plane

Tool

An augmented reality application for AR headsets that visualizes 3D mathematical functions using handwritten input. Features real-time equation parsing, graph manipulation, and 3D function plotting. Enables in-situ learning by combining handwritten equation recognition with dynamic 3D mathematical visualizations for enhanced comprehension and problem-solving.

2507.05616

HopeBot

Tool

An LLM-powered voice-based chatbot that administers PHQ-9 depression screening using retrieval-augmented generation and real-time clarification. Provides structured, interactive depression assessment with interpretive guidance and supportive tone. Demonstrated strong score agreement (ICC=0.91) with self-administered PHQ-9 and high user trust (71% preferred over static forms).

2507.05984

MEWI (Medical Evacuation Wargaming Initiative)

Tool

A three-dimensional multiplayer simulation developed in Unity that replicates battlefield medical evacuation networks. Models patient interactions at casualty collection points, ambulance exchange points, medical treatment facilities, and evacuation platforms. Includes Pacific amphibious assault and Eurasian conflict scenarios for training medical evacuation planning and decision-making.

2507.06373

FiDTouch

Tool

A 3D wearable haptic device that delivers cutaneous stimuli to the finger pad including contact, pressure, encounter, skin stretch, and vibrotactile feedback. Uses a miniature inverted Delta robot mechanism to provide accurate contact and fast-changing dynamic stimuli. Enhances immersion and efficiency in human-computer and human-robot interactions.

2507.07661

EqualMotion

Tool

A body-agnostic, wearable motion capture system designed through disability-centred co-design. Enables personalized calibration, integrates mobility aids, and adopts inclusive visual language to support diverse body types and movement styles. Developed collaboratively with disabled researchers and creatives for equitable participation in digital performance and prototyping.

2507.08744

Interactive Groupwise Comparison for Reinforcement Learning from Human Feedback

Replaces pairwise RLHF comparisons with group rankings. Turns out humans are better at judging five things at once than toggling between two—and the data quality shows it.

2507.04398

The Agency Gap: How Generative AI Literacy Shapes Independent Writing after AI Support

Measures what happens when you take the chatbot away. Students with higher AI literacy perform better independently afterward—suggesting the tool isn't just a crutch, but the training matters.

2507.04352

AI-washing: The Asymmetric Effects of Its Two Types on Consumer Moral Judgments

Lying about using AI hits differently than lying about not using it. Consumers judge deceptive boasting ("we use AI!") way harsher than deceptive denial—asymmetry with real brand consequences.

2508.00848

RestAware: Non-Invasive Sleep Monitoring Using FMCW Radar and AI-Generated Summaries

Tracks sleep posture through blankets using radar, no wearables required. The AI summarizes your night in plain language—privacy-preserving voyeurism for the bedroom.

2507.04491

A validity-guided workflow for robust large language model research in psychology

Personality tests on LLMs collapse under factor analysis. Proposes a validity framework before psychologists treat chatbots as stand-ins for human cognition—measurement crisis meets AI hype.

2507.05820

Constella: Supporting Storywriters' Interconnected Character Creation through LLM-based Multi-Agents

Uses competing LLM agents to help writers design character casts with relational tension. Each agent advocates for a different character's arc—collaborative worldbuilding through synthetic debate.

2507.05532

W2W: A Simulated Exploration of IMU Placement Across the Human Body for Designing Smarter Wearable

Simulates thousands of IMU placements to find optimal sensor locations for activity recognition. Challenges convention-driven design—your wrist might not be the best spot after all.

2507.06779

Tailoring deep learning for real-time brain-computer interfaces: From offline models to calibration-free online decoding

Bridges the gap between offline BCI models and real-time use without per-session calibration. Tackles the deployment problem that keeps deep learning stuck in the lab.

REFLECTION(3)

Confidence scales, judgment doesn't

AI systems are shedding safety disclaimers and users are overtrusting outputs as models scale, yet actual performance reliability hasn't kept pace. The gap between projected confidence and measured accuracy is widening across clinical, multilingual, and developer-facing applications—and we're treating it as a calibration problem when it's actually a design choice.

Medical AI models are dropping safety warnings as they improve, signaling reliability to clinicians who now face higher stakes with less friction. Does removing guardrails accelerate harm faster than it accelerates trust, and who bears the cost of finding out?

1 / 3

Week 27July 2025

Week 29July 2025

ABOUT THIS ISSUE

How was this newsletter synthesized?

Methodology

This newsletter is generated by an AI pipeline (leveraging Anthropic Sonnet 4.5 & Haiku 4.5) that processes the metadata and abstracts of every new arXiv HCI paper from the past week—95 this issue. Each paper is scored on three dimensions: Practice (applicability for practitioners), Research (scientific contribution), and Strategy (industry implications), with scores from 1-5. Papers passing threshold are grouped into topic clusters, and each cluster is summarized to capture what that body of research is exploring.

Selection Criteria

The pipeline builds a curated selection that balances high scores with topic diversity—and deliberately includes at least one 'contrarian' paper that challenges prevailing assumptions. This selection is then analyzed to identify key findings (patterns across multiple papers) and surprises (results that contradict conventional wisdom). A narrative synthesis ties the week's research together under a unifying frame.

Key Themes Discovered

Field Report: ai-interaction

Trust, Calibration, and Human Agency

This cluster examines how humans interact with and depend on AI systems, focusing on trust calibration, user agency, and performance outcomes. Core tensions emerge: AI tools often underperform expectations (developer productivity slows despite forecasts), users overrely on overconfident outputs across languages, and literacy gaps determine independent capability post-support. Research spans trust formation in ChatGPT, multilingual miscalibration risks, and the "agency gap" where GenAI literacy predicts self-directed performance. Methodologically diverse—mixing RCTs, behavioral experiments, qualitative interviews, and human-centered evaluations—the work prioritizes understanding *when* and *why* humans calibrate trust incorrectly, rather than optimizing AI accuracy alone.

1/10

Top Papers in this Theme

2507.04340

Interactive Groupwise Comparison for Reinforcement Learning from Human Feedback

2507.04491

A validity-guided workflow for robust large language model research in psychology

2507.04398

The Agency Gap: How Generative AI Literacy Shapes Independent Writing after AI Support

2507.08028

SSSUMO: Real-Time Semi-Supervised Submovement Decomposition

2507.07881

Synthesized using AI

Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity

A Systematic Analysis of Declining Medical Safety Messaging in Generative AI Models

Can Large Language Models Automate Phishing Warning Explanations? A Controlled Experiment on Effectiveness and User Perception

A Systematic Analysis of Declining Medical Safety Messaging in Generative AI Models

Humans overrely on overconfident language models, across languages

Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity

Vega-Lite Annotation

WSCoach (Wearable Speech Coach)

Breaking the Plane

HopeBot

MEWI (Medical Evacuation Wargaming Initiative)

FiDTouch

EqualMotion

Interactive Groupwise Comparison for Reinforcement Learning from Human Feedback

The Agency Gap: How Generative AI Literacy Shapes Independent Writing after AI Support

AI-washing: The Asymmetric Effects of Its Two Types on Consumer Moral Judgments

RestAware: Non-Invasive Sleep Monitoring Using FMCW Radar and AI-Generated Summaries

A validity-guided workflow for robust large language model research in psychology

Constella: Supporting Storywriters' Interconnected Character Creation through LLM-based Multi-Agents

W2W: A Simulated Exploration of IMU Placement Across the Human Body for Designing Smarter Wearable

Tailoring deep learning for real-time brain-computer interfaces: From offline models to calibration-free online decoding

Confidence scales, judgment doesn't

How was this newsletter synthesized?

Methodology

Selection Criteria

Key Themes Discovered

Field Report: ai-interaction

Trust, Calibration, and Human Agency

Top Papers in this Theme

Interactive Groupwise Comparison for Reinforcement Learning from Human Feedback

A validity-guided workflow for robust large language model research in psychology

The Agency Gap: How Generative AI Literacy Shapes Independent Writing after AI Support

SSSUMO: Real-Time Semi-Supervised Submovement Decomposition

Opting Out of Generative AI: a Behavioral Experiment on the Role of Education in Perplexity AI Avoidance