Week 42 / October 2025

Systems Work Until Users Need to Verify Them

From AI agents to data visualizations, this week's research exposes the verification gap

Synthesized using AI

Analyzed 95 papers. AI models can occasionally hallucinate, please verify critical details.

The strongest signal this week isn't about what systems can do—it's about what happens when users need to verify they're doing it correctly. A rigorous eye-tracking study of radiologists found that AI-assisted structured reporting fundamentally altered diagnostic behavior: where clinicians looked, how long they spent on images, and their visual search patterns all changed, even when accuracy metrics stayed similar. This matters because we've been measuring outcomes while ignoring process, and process changes accumulate risk. Supporting evidence appears across domains: computer-use agents fell for dark patterns at rates between 13% and 43% depending on the model, accessibility checkers missed functional errors that only emerged during task execution, and line chart readers couldn't detect when aggregation strategies hid critical information. The pattern isn't that systems fail—it's that they fail in ways current evaluation methods don't capture.

The verification problem gets worse in high-stakes contexts, but not for the reasons you'd expect. Frontline workers in homelessness services actively resist automation not because they distrust technology, but because they recognize accountability can't be delegated to a black box when decisions affect vulnerable lives. They want systems that surface what they might miss, not systems that decide for them. This creates friction with the dominant narrative that automation improves high-stakes decision-making. The radiologist study reinforces this: AI assistance that changes cognitive behavior without clear improvement in outcomes isn't obviously beneficial, even if it's faster.

The design challenge is clear: we need interfaces that make verification costs visible and manageable, not systems that hide complexity behind confidence scores. TapNav's adaptive screen reader and the domain-specific JavaScript editor both succeed because they give users agency to verify and correct, not just accept outputs. The week's accessibility work converges on the same insight—tools that pass static tests fail when users can't verify behavior in context.

Featured(1/5)

2510.11035

SusBench: An Online Benchmark for Evaluating Dark Pattern Susceptibility of Computer-Use Agents

Longjie Guo, Chenjie Yuan, Mingyuan Zhong, Robert Wolfe, Ruican Zhong, Yue Xu, Bingbing Wen, Hua Shen, Lucy Lu Wang, Alexis Hiniker

Preprint·2025-10-13

Audit your AI agents against SusBench's nine dark pattern categories before deployment. If your agent handles transactions or account management, test specifically for hidden costs and preselection vulnerabilities—these are where autonomous systems cause the most user harm.

LLM-based agents autonomously clicking through interfaces inherit human vulnerabilities to manipulative UI patterns—confirmshaming, hidden costs, trick questions—but at machine scale.

Method: SusBench tests nine dark pattern types (confirmshaming, forced action, hidden costs, misdirection, nagging, obstruction, preselection, sneaking, trick questions) across believable e-commerce scenarios. The benchmark constructs realistic interfaces with manipulative elements embedded in checkout flows, subscription cancellations, and account settings. Agents are evaluated on whether they fall for these patterns or correctly resist them.

Caveats: Benchmark uses simulated e-commerce scenarios. Real-world dark patterns evolve faster than static test suites can capture.

Reflections: Can agents learn to recognize novel dark pattern variants not in the training set? · What's the tradeoff between dark pattern resistance and task completion speed? · Should agents warn users about detected dark patterns or silently resist them?

trust-safetyai-interactionevaluation-methodsethics

2510.12972

TaskAudit: Detecting Functiona11ity Errors in Mobile Apps via Agentic Task Execution

Mingyuan Zhong, Xia Chen, Davin Win Kyi, Chen Li, James Fogarty, Jacob O. Wobbrock

Preprint·2025-10-14

Run TaskAudit on your production apps before release. Focus on high-frequency user tasks (login, checkout, search) where functiona11ity errors cause immediate abandonment. Static checkers won't catch these.

Current accessibility checkers evaluate static contexts and miss functionality errors—buttons that don't respond, forms that can't submit, navigation that breaks for assistive tech users.

Method: TaskAudit uses three components: a Task Generator creates realistic user scenarios ("add item to cart"), a Task Executor simulates interactions with accessibility services enabled, and an Error Detector identifies when tasks fail due to accessibility issues. The system detects functiona11ity errors—accessibility bugs that prevent task completion—not just guideline violations. It caught errors in 23 of 25 tested apps.

Caveats: Requires task scenarios to be defined upfront. Won't discover errors in workflows you don't test.

Reflections: Can task generation be automated from analytics data showing common user paths? · How do functiona11ity error rates correlate with actual user abandonment? · What's the false positive rate on complex multi-step workflows?

accessibilityevaluation-methodsmobile-interfacesprogramming-tools

2510.14889

Detecting Early and Implicit Suicidal Ideation via Longitudinal and Information Environment Signals on Social Media

Soorya Ram Shimgekar, Ruining Zhao, Agam Goyal, Violeta J. Rodriguez, Paul A. Bloom, Hari Sundaram, Koustuv Saha

Preprint·2025-10-16

If you're building social platform safety systems, incorporate network-level discourse signals, not just individual post content. Monitor for shifts in a user's information environment—changes in peer conversation tone predict risk earlier than waiting for explicit mentions.

Most suicidal ideation on social media surfaces implicitly through everyday posts and peer interactions, not explicit distress signals. Current detection systems miss these early signs.

Method: The framework models a user's information environment: their longitudinal posting history plus discourse from their social network (friends, communities they engage with). It predicts implicit suicidal ideation 30-180 days before explicit disclosure by analyzing shifts in both personal content and the emotional tone of surrounding conversations. The model outperforms baselines that use only explicit mentions or individual post history.

Caveats: Raises significant privacy concerns around monitoring social networks. Requires careful ethical review before deployment.

Reflections: What intervention strategies work when ideation is detected 30+ days early? · How do false positives affect user trust in platform safety features? · Can the model distinguish between genuine risk and performative distress posting?

social-computinghealthcareprivacy-securityethics

1 / 5

Featured

Findings(1/5)

Accessibility testing shifts from static compliance to dynamic task completion·AI agents inherit human cognitive vulnerabilities, not just capabilities·Workflow synthesis replaces per-execution guidance in agent automation·Generative interfaces replace static representations in knowledge tools·Implicit signal detection moves from content analysis to environmental modeling

Accessibility checkers have focused on static rule violations—missing labels, contrast ratios—but TaskAudit demonstrates that functionality errors emerge only during interaction. TapNav extends this by replacing sequential audio with spatiotactile feedback, acknowledging that bandwidth constraints are architectural, not solvable through better narration. The shift reframes accessibility from checklist compliance to whether disabled users can actually complete workflows. Implication: accessibility audits that pass static tests may still block core user tasks.

2510.12972

TaskAudit: Detecting Functiona11ity Errors in Mobile Apps via Agentic Task Execution

2510.14267

TapNav: Adaptive Spatiotactile Screen Readers for Tactually Guided Touchscreen Interactions for Blind and Low Vision People

Surprises(1/3)

Structured reporting changes radiologist behavior, not just output format·Aggregated charts match standard performance while reducing cognitive load·Frontline staff resist automation in high-stakes decisions, not because of distrust but because of context

Structured reporting was designed to standardize documentation, but it alters how radiologists visually analyze images—changing gaze patterns and diagnostic behavior, not just the format of findings. AI-assisted structured reporting compounds this by introducing a third interaction mode. The assumption was that reporting templates are neutral scaffolds. They aren't—they reshape perception itself, not just communication.

2510.16070

Effect of Reporting Mode and Clinical Experience on Radiologists' Gaze and Image Analysis Behavior in Chest Radiography

TOOLBOX(8)

ABLEIST

Framework

ABLEIST (Ableism, Inspiration, Superhumanization, and Tokenism) is a set of five ableism-specific and three intersectional harm metrics grounded in disability studies literature. It captures subtle intersectional harms and biases in LLM-generated hiring scenarios, specifically designed to detect discrimination against people with disabilities across gender, caste, and nationality dimensions in 2,820 hiring scenarios.

2510.10998

SusBench

Dataset

SusBench is an online benchmark for evaluating the susceptibility of LLM-based computer-use agents to UI dark patterns. It contains 313 evaluation tasks across 55 real-world consumer websites, covering nine common dark pattern types (Preselection, Trick Wording, Hidden Information, etc.) constructed through code injections. Enables testing of autonomous agents' vulnerability to manipulative interface designs.

2510.11035

KnowledgeTrail

Tool

KnowledgeTrail is an AI-powered generative timeline system that adapts to users' evolving questions by dynamically expanding or contracting in response to input. It enables users to co-construct timelines of historical events and knowledge formation processes, fostering curiosity-driven exploration and serendipitous discovery with citation features for verification of complex relationships between ideas and events.

2510.12113

CrisisNews

Dataset

CrisisNews is a dataset of 93,250 news articles covering social media-endemic crises from the past 20 years. It maps global news coverage of online problematic behaviors that escalate into large-scale crises, with classifications of stakeholder roles, behavior types, and outcomes. Provides event-focused perspectives tracking how discrete problematic behaviors originating in social media evolve and cause larger-scale harms.

2510.12243

TaskAudit

Tool

TaskAudit is an accessibility evaluation system that detects functionality errors in mobile apps through simulated interactions. It comprises three components: a Task Generator constructing interactive tasks from app screens, a Task Executor using agents with a screen reader proxy, and an Accessibility Analyzer detecting errors from interaction traces. Detected 48 functionality errors from 54 app screens, including label-functionality mismatch and cluttered navigation.

2510.12972

TapNav

Tool

TapNav is an adaptive spatiotactile screen reader prototype for Blind and Low Vision users to interact with touchscreen interfaces spatially. It provides adaptive auditory feedback combined with tactile overlays to convey spatial information and location of interface elements on-screen. Enables touch-based navigation for data visualizations and applications, helping users anticipate outcomes and offload cognitive load to touch.

2510.14267

ReUseIt

Tool

ReUseIt automatically synthesizes reusable workflows from AI web agents' successful and failed attempts for web automation tasks. It incorporates execution guards that help agents detect and fix errors while keeping users informed of progress and issues. Increases success rates from 24.2% to 70.1% across fifteen repetitive web tasks like form filling, information retrieval, and scheduling with minimal human intervention.

2510.14308

Aria Gen 2 Pilot Dataset (A2PD)

Dataset

A2PD is an egocentric multimodal open dataset captured using Aria Gen 2 glasses. It features comprehensive raw sensor data and machine perception algorithm outputs across five primary scenarios: cleaning, cooking, eating, playing, and outdoor walking. Includes multi-person recordings with synchronized data illustrating device perception of wearer, environment, and interactions. Available at projectaria.com with open-source tools in Project Aria Tools.

2510.16134

VizCopilot: Fostering Appropriate Reliance on Enterprise Chatbots with Context Visualization

Exposes the "irrelevantly right" problem: when chatbots retrieve wrong context but generate plausible answers, users can't tell. Visualizing retrieval context helps users catch misalignment before decisions go sideways.

2510.12692

Who is a Better Matchmaker? Human vs. Algorithmic Judge Assignment in a High-Stakes Startup Competition

Pits human expertise against algorithms in matching startup judges. The twist: neither wins cleanly, revealing that semantic understanding and domain knowledge don't always align with optimization metrics.

2510.11927

Visual Stenography: Feature Recreation and Preservation in Sketches of Noisy Line Charts

Asks people to sketch noisy time series from memory. What they preserve versus discard reveals which visual features actually register—and which statistical patterns disappear despite being "visible."

2510.14591

Just-In-Time Objectives: A General Approach for Specialized AI Interactions

Argues LLMs produce bland output because they lack specific objectives. Infers user intent in real-time, then optimizes for that singular goal—turning generic drafts into actually useful results.

2510.12728

Data-Model Co-Evolution: Growing Test Sets to Refine LLM Behavior

Breaks the traditional data-model separation by letting developers instantly govern LLM behavior through test case curation. Fast iteration cycles replace slow fine-tuning, making model refinement feel more like debugging.

2510.11185

Principles of Safe AI Companions for Youth: Parent and Expert Perspectives

Surveys parents and developmental psychologists on AI companion risks for teens. Finds current platforms lack basic safeguards against harmful normalization—and stakeholders want fundamentally different protections than platforms offer.

2510.13011

Deliberate Lab: A Platform for Real-Time Human-AI Social Experiments

Builds infrastructure for multi-party human-AI experiments at scale. Addresses a gap: most behavioral science tools can't handle real-time interaction or mixed human-AI groups in controlled settings.

2512.00009

Development and Benchmarking of a Blended Human-AI Qualitative Research Assistant

Tackles coder fatigue and interpretative drift in qualitative research by blending human judgment with AI assistance. Tests whether computational support can scale thematic analysis without flattening nuance.

REFLECTION(4)

Intuitive interfaces hide what matters most

Across visualization, AI decision support, and embodied interaction, researchers are finding that the interfaces users trust most are precisely the ones that obscure critical failure modes. The tension: making systems legible enough to catch errors requires friction that users actively resist.

Users prefer visualizations that feel intuitive but systematically miss temporal trends and hidden noise—yet adding visual complexity to surface ground truth triggers abandonment. Does making a system harder to misuse require making it harder to use?

1 / 4

Week 41October 2025

Week 43October 2025

ABOUT THIS ISSUE

How was this newsletter synthesized?

Methodology

This newsletter is generated by an AI pipeline (leveraging Anthropic Sonnet 4.5 & Haiku 4.5) that processes the metadata and abstracts of every new arXiv HCI paper from the past week—95 this issue. Each paper is scored on three dimensions: Practice (applicability for practitioners), Research (scientific contribution), and Strategy (industry implications), with scores from 1-5. Papers passing threshold are grouped into topic clusters, and each cluster is summarized to capture what that body of research is exploring.

Selection Criteria

The pipeline builds a curated selection that balances high scores with topic diversity—and deliberately includes at least one 'contrarian' paper that challenges prevailing assumptions. This selection is then analyzed to identify key findings (patterns across multiple papers) and surprises (results that contradict conventional wisdom). A narrative synthesis ties the week's research together under a unifying frame.

Key Themes Discovered

Field Report: ai-interaction

Trust Through Transparency

This cluster examines how users calibrate trust in AI systems through interpretable feedback and controlled agency. Core questions: When should humans defer to AI versus override it? How do explanations, context visualization, and error transparency shape reliance? Research spans qualitative analysis benchmarking, decision support design, and safety frameworks for high-stakes domains. Methodologically diverse—combining controlled experiments, field deployments, and qualitative interviews—the work prioritizes human oversight mechanisms over pure automation, emphasizing that effective AI integration requires visible reasoning, error detection, and user-centered verification strategies.

1/8

Top Papers in this Theme

2510.16380

MoReBench: Evaluating Procedural and Pluralistic Moral Reasoning in Language Models, More than Outcomes

2512.00009

Development and Benchmarking of a Blended Human-AI Qualitative Research Assistant

2510.14591

Just-In-Time Objectives: A General Approach for Specialized AI Interactions

2510.10616

Assessing Policy Updates: Toward Trust-Preserving Intelligent User Interfaces

2510.11185