Week 25 / June 2026

Optimization Targets That Mask Process-Level Harm

LLMs pass safety tests while eroding reflection; robots reduce anxiety but increase social difficulties

Synthesized using AI

Analyzed 95 papers. AI models can occasionally hallucinate, please verify critical details.

Systems optimized for engagement or performance metrics are undermining the human capacities they claim to support, and the gap is widening. Research on cognitive atrophy in LLMs introduces a 20-attribute clinical schema applied to 42,230 responses across five models, finding moderate-to-high rates of directive advice and problem-solving that reinforces dependence rather than reflection. The models passed standard safety benchmarks but failed on process-level measures: they responded to overt safety cues but adapted poorly when users sought solutions, creating interactions that erode users' ability to think independently. A social robotics study using withdrawal design found continued robot access reduced anxiety in autistic children but was associated with increased parent-reported social difficulties and weaker emotion recognition gains versus removal—high engagement siloed social behavior rather than transferring skills to human relationships. Mobile agent research shows CLI-first architectures match or beat GUI baselines while completing tasks in half the steps, questioning whether screen-based interaction was ever necessary for agents that can access device APIs directly.

The pattern isn't about AI capability—it's about optimization targets that obscure second-order effects. Cognitive atrophy measures what happens to users over time, not what AI produces in the moment. Withdrawal design tests whether assistive technology scaffolds independence or creates dependence. CLI evaluation reveals that GUI interaction may compound errors rather than enable them. Meanwhile, mixed reality research demonstrates that perceptual fidelity has crossed a threshold: four attacks on Apple Vision Pro achieved 85-100% success rates because users cannot reliably distinguish virtual from physical content, even when they recognize something is virtual. The implication: as systems become more capable, the distance between what we measure and what matters grows larger. Surface metrics—safety scores, engagement rates, task completion—actively mislead when the goal is long-term human capacity rather than short-term performance.

Featured(1/6)

2606.18129

Towards Understanding and Measuring COGNITIVE ATROPHY in LLM Behaviour

Abeer Badawi, Moyosoreoluwa Olatosi, Negin Baghbanzadeh, Laleh Seyyed-Kalantari, Frank Rudzicz, R. Shayna Rosenbaum, Sara Pishdadian, Elham Dolatabadi

Preprint·2026-06-16

Don't trust surface-level safety scores for therapeutic AI. Audit for cognitive atrophy: does your chatbot solve problems for users or scaffold their own thinking? If it's the former, you're building dependence.

LLMs in mental-health support pass safety benchmarks but may undermine users' ability to reflect, cope, and decide independently. Existing metrics miss this process-level harm.

Method: Introduced COGNITIVE ATROPHY BENCH: 1,576 human counseling conversations, 42,230 LLM responses, and a 20-attribute clinical schema applied by six trained reviewers. Five LLMs showed moderate-to-high atrophy-aligned behavior—directive advice, problem-solving, and validation that reinforces dependence rather than reflection. Models responded to overt safety cues but adapted poorly when users sought solutions.

Caveats: Schema developed for mental-health contexts. Transfer to other sensitive domains (legal advice, medical triage) unverified.

Reflections: Can LLMs be fine-tuned to reduce cognitive atrophy without sacrificing perceived helpfulness? · Do users recognize atrophy-inducing patterns in their own interactions, or does dependence develop invisibly? · How does cognitive atrophy manifest in non-therapeutic domains like education or workplace coaching?

ai-interactionhealthcareethicstrust-safety

2606.19388

Beyond the GUI Paradigm: Do Mobile Agents Need the Phone Screen?

Li Gu, Zihuan Jiang, Linqiang Guo, Zhixiang Chi, Ziqiang Wang, Huan Liu, Yuanhao Yu, Tse-Hsun Chen, Yang Wang

Preprint·2026-06-16

Stop forcing mobile agents through the GUI. CLI agents match or beat GUI baselines with half the steps. Best for bulk operations, multi-condition filtering, and cross-app workflows where GUI navigation compounds errors.

Mobile agents default to GUI interactions—perceiving screens, tapping buttons—but phones also expose command-line interfaces with direct access to services and data. CLI remains unexplored.

Method: Evaluated three coding agents on AndroidWorld and MobileWorld without mobile-specific training. Claude Code (Opus 4.7) reached 71.8% and 51.9%, outperforming every reproducible GUI baseline (best: 69.3% and 43.2%). Oracle CLI solutions hit 88.8% on AndroidWorld (103/116 tasks CLI-solvable) and 86.3% on MobileWorld. CLI agents completed tasks in 10.7 steps versus 18.6 for GUI agents.

Caveats: Oracle solutions show ceiling, but no CLI agent yet approaches it. Gap suggests substantial room for prompt engineering or tool design.

Reflections: Can hybrid agents dynamically choose between CLI and GUI based on task structure? · Do CLI agents generalize better across OS versions than GUI agents, which break on UI redesigns? · What percentage of real-world user intents are CLI-solvable versus GUI-only?

ai-interactionmobile-interfacesprogramming-toolsevaluation-methods

2606.16439

Beyond Usability: A UX Case Study on Using "Withdrawal Design" to Challenge Engagement Metrics in Social Robotics

Yibo Meng, Qiuyu Long, Richard Chen, Yan Guan, Xiaolan Ding

Preprint·2026-06-15

Don't optimize for retention with vulnerable users. Design for separation that bridges back to human relationships. High engagement can mask ecological harm when it substitutes for, rather than scaffolds, human connection.

Social robots for autistic children are evaluated by engagement and interaction quality, assuming the robot scaffolds social skills. But what happens when you remove the robot?

Method: 8-week randomized controlled trial (N=40) with withdrawal design. Continued robot access reduced anxiety (SCARED/RCADS) but was associated with lower parent-reported social motivation and weaker emotion recognition gains (SMS/RMET) versus withdrawal. Interviews revealed removal sometimes prompted children to seek human interaction, while continued use siloed social behavior within the child-robot dyad—despite exceptionally high usability (SUS).

Caveats: Small sample (N=40), home-based deployment. Findings may not generalize to classroom or clinical settings with different social structures.

Reflections: Can robots be designed with explicit 'graduation' mechanics that fade interaction over time? · Do other assistive technologies (voice assistants, therapy apps) show similar engagement-versus-transfer tradeoffs? · How long does the post-withdrawal effect last—does human-seeking behavior persist or regress?

social-computinghealthcareethicsevaluation-methods

1 / 6

Featured

Findings(1/5)

Evaluation shifts from output quality to behavioral trajectory·Interface design moves from single-user optimization to role-preserving asymmetry·Security research identifies perceptual confusion as an exploitable primitive·Documentation practices expand to capture improvisation, not just specification·Command-line interfaces return as first-class mobile interaction paradigm

Benchmarks traditionally measure what AI produces—accuracy, safety scores, response quality. Two papers now measure what AI does to users over time. One formalizes cognitive atrophy as a process-level metric for whether LLM interactions erode users' reflection and decision-making capacity. Another uses withdrawal design to test what changes when a social robot is removed from autistic children's homes, finding continued access reduced anxiety but was associated with increased parent-reported social difficulties. The shift: from scoring artifacts to tracking influence.

2606.18129

Towards Understanding and Measuring COGNITIVE ATROPHY in LLM Behaviour

2606.16439

Beyond Usability: A UX Case Study on Using "Withdrawal Design" to Challenge Engagement Metrics in Social Robotics

Surprises(1/3)

Continued robot access reduced anxiety but increased social difficulties·Learners anchor to code and resist visual scaffolds despite design principles·LLMs generate synthetic lived experience narratives without authentic possession

In an 8-week randomized controlled trial with 40 children with autism, continued access to a social robot reduced anxiety scores, yet was associated with increased parent-reported social difficulties compared to withdrawal. The assumption was that social scaffolding robots uniformly benefit development. The data suggests prolonged access may create dependencies that complicate peer interaction. Benefit isn't monotonic with exposure duration.

2606.16439

Beyond Usability: A UX Case Study on Using "Withdrawal Design" to Challenge Engagement Metrics in Social Robotics

TOOLBOX(8)

COGNITIVE ATROPHY BENCH

Dataset

Clinically grounded benchmark for measuring cognitive atrophy in LLM mental-health support. Built from 1,576 fully human-generated counseling conversations, 15,680 turns, and 42,230 responses from five LLMs. Includes 20-attribute schema developed by three clinical experts, with 5,324 span-grounded reviewer judgments from six trained clinical reviewers. Introduces User-Input Risk Index (UIRI) and Cognitive Atrophy Risk Index (ARI) for auditing model behavior.

2606.18129

CLI-Advantage Task Suite

Dataset

Benchmark suite for evaluating mobile CLI agents on everyday user intents beyond GUI scope. Comprises 45 templates across five categories: bulk operations, multi-condition filtering, aggregation, cross-app workflows, and hidden device state. Designed to test mobile agent capabilities where command-line interfaces provide advantages over graphical interfaces. Authors will open-source the suite with agent implementations, oracle solutions, and evaluation infrastructure.

2606.19388

WULPUS

Framework

Fully wearable multimodal platform for real-time VR interaction combining A-mode ultrasound and inertial sensing from forearm and upper arm. Integrates end-to-end software framework for real-time acquisition, visualization, and Unity-based VR communication. Achieves 80±6% inter-session accuracy for hand pose estimation and 77±7% for forearm position estimation. Power consumption of 19.9 mW enables over 2.5 days continuous use on 350 mAh LiPo battery.

2606.17741

DroneLets

Framework

Design artifact framework extending Collaboration Engineering to embodied drone agents in emergency services. Derived from four field trials and 95 interviews, capturing 44 interaction patterns grouped into 10 meta-patterns. DroneLets capture setup requirements, drone capabilities, environmental constraints, and coordinated human-drone actions. Provides modular framework for designing repeatable, scalable collaboration processes including reconnaissance, communication, and logistical support patterns.

2606.17839

DataMagic

Tool

End-to-end interactive system transforming raw tabular data and natural language queries into narrative data-insight videos. Introduces DVSpec declarative specification binding visual and animation elements to data fields. Uses Generate-then-Orchestrate multi-agent architecture generating candidate scenes in parallel with global narrative optimization. Supports three interaction modes and structured provenance-based data Q&A. Evaluated on 109 real-world samples.

2606.20388

DVSpec

Framework

Declarative specification for data videos that ensures data fidelity by binding visual and animation elements to underlying data fields through data-driven semantic references. Decouples logic from rendering, enabling transformation of one-way videos into explorable interactive data interfaces. Core component of DataMagic system for generating narrative data-insight videos from tabular data.

2606.20388

WhoamI Today (WIT)

Tool

Friendship-supportive social media platform for youth designed around three pillars: social understanding (legible norms, intentions, trust, reciprocity, accountability), placeness (spatial and embodied affordances), and identity alignment (authentic, current, plural, interpretable expression). Deployed with 99 youth across the United States and Korea for validation of youth social media design framework centered on friendship building.

2606.16651

Pentimento

Tool

Documentation tool for tracking on-site construction adaptations across material lifecycles in circular economy construction. Leverages video documentation and 3D Gaussian Splatting to spatially, temporally, and semantically represent on-site adaptations in relation to designed models. Addresses 'building drift' taxonomy including Tending the Site, Foraging for Fit, Interpreting the Material, Marking Measurements, and Coordinating Across Communities. Developed through ReShelter case study.

2606.19609

ASTRA: A Scalable Next-Generation ATCO Training Simulator with Autonomous Simpilots

Replaces human trainers with AI pilots in air traffic control simulations. The bottleneck in training critical infrastructure operators is now automated—what could go wrong?

2606.17786

Toward Accessible Psychotherapy Training Using AI-Driven Interactive Patient Avatars

Trains therapists using AI patients that respond to ACT interventions. The system gives feedback on therapeutic technique, raising questions about who defines 'good' therapy.

2606.16206

Measuring Whether LLM Tutors Teach or Solve: A Diagnostic for Educational Impact

Exposes that tutoring benchmarks can't distinguish between AI that teaches and AI that just gives answers. Stronger models don't necessarily create stronger learning.

2606.20482

Your Mouse and Eyes Secretly Leak Your Preference: LLM Alignment using Implicit Feedback from Users

Trains reward models on cursor movements and dwell time instead of explicit ratings. Your hesitation before clicking becomes training data—no feedback form required.

2606.18671

HANSEL: Extracting Breadcrumbs from Web Agent Trajectories for Interactive Verification

Converts web agent action logs into human-verifiable decision points. Tackles the accountability gap when AI makes purchases or bookings on your behalf.

2606.16337

Medical Heuristic Learning: An LLM-Driven Framework for Interpretable and Auditable Clinical Decision Rules

Uses LLMs to generate clinician-readable decision rules from tabular data. Aims for the accuracy of black boxes with the auditability of simple heuristics.

2606.15902

The Missing Layer: Why EdTech Needs Design-Time Generative UI, Not Just Runtime Personalization

Argues runtime adaptation puts impossible demands on teachers who can't predict what AI will generate. Proposes letting educators design with AI before students see it.

2606.16009

Bridging the Usability Gap: Lessons from Interpreting Studies for Machine Interpreting Design

Documents why machine interpreting feels terrible despite benchmark parity with humans. Turns out live translation needs more than accurate words—timing and repair matter.

REFLECTION(4)

Capability kills the competence it claims to save

AI systems are solving problems so effectively that they're dismantling the human judgment they were meant to augment. Across education, healthcare, and workplace contexts, the research reveals a structural paradox: the better the system performs, the faster users atrophy the skills that make them trustworthy decision-makers.

Students bypass learning scaffolds when AI solves instantly; clinicians defer judgment when models predict accurately; workers trust biased assistants because they're faster than thinking. At what point does removing friction become removing agency?

1 / 4

Week 24June 2026

ABOUT THIS ISSUE

How was this newsletter synthesized?

Methodology

This newsletter is generated by an AI pipeline (leveraging Anthropic Sonnet 4.5 & Haiku 4.5) that processes the metadata and abstracts of every new arXiv HCI paper from the past week—95 this issue. Each paper is scored on three dimensions: Practice (applicability for practitioners), Research (scientific contribution), and Strategy (industry implications), with scores from 1-5. Papers passing threshold are grouped into topic clusters, and each cluster is summarized to capture what that body of research is exploring.

Selection Criteria

The pipeline builds a curated selection that balances high scores with topic diversity—and deliberately includes at least one 'contrarian' paper that challenges prevailing assumptions. This selection is then analyzed to identify key findings (patterns across multiple papers) and surprises (results that contradict conventional wisdom). A narrative synthesis ties the week's research together under a unifying frame.

Key Themes Discovered

Field Report: ai-interaction

Trust, Transparency, and Task Fit

This cluster examines how users calibrate trust in AI systems and what design choices enable or undermine effective human-AI collaboration. Research spans mental-health support, mobile agents, professional training, and workplace adoption—asking when AI assistance helps versus harms. Central tensions emerge: AI can provide immediate support but risks creating dependence; transparency mechanisms compete with usability; and capability alone doesn't ensure reliable human oversight. The work prioritizes process-level evaluation over surface metrics, role-differentiated interaction design, and mechanisms for users to verify or challenge AI outputs.

1/9

Top Papers in this Theme

2606.20205

Apparent Psychological Profiles of Large Language Models are Largely a Measurement Artifact

2606.17786

Toward Accessible Psychotherapy Training Using AI-Driven Interactive Patient Avatars

2606.18319

ASTRA: A Scalable Next-Generation ATCO Training Simulator with Autonomous Simpilots

2606.18671

HANSEL: Extracting Breadcrumbs from Web Agent Trajectories for Interactive Verification

2606.16009