Week 14 / April 2026

AI Collaboration Abandons the Advisor Model

Real-world systems reveal sequential recommendation patterns fail—independence preserves accuracy and skill

Synthesized using AI

Analyzed 100 papers. AI models can occasionally hallucinate, please verify critical details.

Human-AI collaboration is abandoning sequential advisor models because people can't tell good AI advice from bad. The hybrid confirmation tree elicits human and AI judgments independently before aggregation—agreement accepts the decision, disagreement triggers a human tiebreaker. Tested across medical diagnostics and misinformation detection, this pattern outperformed traditional AI-as-advisor designs in every dataset. Signal detection analysis reveals why: humans lack the discrimination ability to make accept-or-reject decisions work, leading to systematic overreliance on incorrect advice and underuse of accurate recommendations. Programming by chat documents the same shift in production systems. Analysis of 11,579 real-world IDE sessions shows developers externalize plans into persistent artifacts and constrain AI autonomy through context injection rather than accepting sequential recommendations. ViviDoc's structured control mechanisms—SRTC interaction specifications and Style Palettes—similarly preserve user agency over planning and styling instead of delegating to agent suggestions. All three respond to the same pressure: sequential patterns fail when humans can't evaluate AI outputs, so preserve independence through architecture.

The week's other standout work documents systems that escaped the lab before we understood them. Controlled audits of five LLM families in matrimonial matchmaking found same-caste matches rated up to 25% higher than inter-caste matches, with recommendations ordered by traditional caste hierarchy—off-the-shelf models reproduce structural inequality at scale without domain-specific intervention. VueBuds embeds cameras into wireless earbuds, moving egocentric vision to devices already worn continuously with no additional user burden, solving the adoption problem that has constrained wearable vision systems. Physically-intuitive privacy demonstrates that users trust direct physical manipulation of sensor state over software promises, treating trust as an engineering constraint requiring physical evidence rather than a disclosure problem.

The pattern across these findings: AI systems in production behave differently than lab prototypes predicted, and the gaps matter. Practitioners moved faster than research infrastructure could document, deploying systems whose failure modes we're discovering in the wild.

Featured(1/6)

2603.29288

Sima AIunty: Caste Audit in LLM-Driven Matchmaking

Atharva Naik, Shounok Kar, Varnika Sharma, Ashwin Rajadesingan, Koustuv Saha

Preprint·2026-03-31

Don't deploy LLMs in socially sensitive matchmaking without caste-aware auditing and intervention. Off-the-shelf models will reproduce historical exclusion patterns at scale.

LLMs are being deployed in matchmaking contexts where caste hierarchies have historically shaped marital decisions. Do these models reproduce or disrupt caste-based stratification?

Method: Controlled audit of five LLM families (GPT, Gemini, Llama, Qwen, BharatGPT) using real matrimonial profiles with varied caste identities (Brahmin, Kshatriya, Vaishya, Shudra, Dalit) and income levels. Same-caste matches received ratings up to 25% higher on a 10-point scale than inter-caste matches. Inter-caste matches were further ordered according to traditional caste hierarchy across all models tested.

Caveats: Tested on South Asian matrimonial contexts. Other cultural hierarchies may manifest differently.

Reflections: Can fine-tuning on counter-stereotypical training data disrupt these hierarchical patterns, or are they too deeply embedded in pre-training? · Do users perceive LLM-mediated matchmaking recommendations as more 'objective' than human matchmakers, thereby legitimizing caste bias? · How do these biases interact with other identity dimensions like religion, region, or disability status in multi-attribute matching scenarios?

ethicsai-interactiontrust-safetybias-issues

2604.00436

Programming by Chat: A Large-Scale Behavioral Analysis of 11,579 Real-World AI-Assisted IDE Sessions

Ningzhi Tang, Chaoran Chen, Zihan Fang, Gelei Xu, Maria Dhakal, Yiyu Shi, Collin McMillan, Yu Huang, Toby Jia-Jun Li

Preprint·2026-04-01

Design IDE assistants for iterative refinement, not one-shot task completion. Provide mechanisms for developers to externalize plans and constrain AI behavior—these aren't workarounds, they're core workflow patterns.

IDE-integrated AI assistants operate conversationally within codebases, but empirical evidence of how developers actually use them in real workflows remains limited.

Method: Analysis of 74,998 developer messages from 11,579 chat sessions across 1,300 repositories using Cursor and GitHub Copilot. Programming work reorganizes around three patterns: progressive specification (iterative refinement rather than upfront task specification), cognitive redistribution (delegating diagnosis and validation to AI rather than engaging code directly), and active collaboration management (externalizing plans into persistent artifacts and constraining AI autonomy through context injection).

Caveats: Data from public repositories where developers committed chat logs. May not represent proprietary or sensitive codebases.

Reflections: Does cognitive redistribution to AI erode developers' ability to diagnose problems independently over time? · What happens when progressive specification fails—how do developers recover from conversational dead-ends? · Are there systematic differences in how novice versus expert developers manage AI autonomy and context injection?

ai-interactionprogramming-toolsevaluation-methods

2603.29866

Beyond AI advice -- independent aggregation boosts human-AI accuracy

Julian Berger, Pantelis P. Analytis, Ville Satopää, Ralf H. J. M. Kurvers

Preprint·2026-03-31

Stop showing AI recommendations before human judgment. Use independent elicitation with tiebreaking instead—it's more accurate and preserves human skill.

The standard AI-as-advisor pattern (AI recommends, human accepts/rejects) suffers from advice underuse, overreliance, and skill deterioration. Can preserving judgment independence improve outcomes?

Method: Hybrid confirmation tree (HCT) elicits human and AI judgments independently. Agreement accepts the decision; disagreement triggers a second human tiebreaker. Tested across 10 datasets (medical diagnostics, misinformation detection) and four datasets with AI explanations. HCT outperformed AI-as-advisor in all datasets. Signal detection analysis reveals people cannot discriminate well enough between correct and incorrect AI advice to make the advisor pattern work.

Caveats: Requires access to multiple human judges. Cost-benefit depends on decision stakes and judge availability.

Reflections: Does the HCT maintain its advantage when AI accuracy significantly exceeds human accuracy? · How does the tiebreaker role affect the second human's judgment—do they anchor on knowing there was disagreement? · Can the HCT pattern be adapted for real-time decisions where independent elicitation isn't feasible?

ai-interactioncollaborationtrust-safetyevaluation-methods

1 / 6

Featured

Findings(1/5)

Authentication moves from digital credentials to physical coherence·Human-AI collaboration abandons the advisor model for independent aggregation·Privacy controls shift from policy disclosure to physical intuition·AI documentation functions as marketing, not accountability infrastructure·Egocentric sensing migrates to always-worn devices with no interaction cost

VoxAnchor grounds speech authenticity in radar-sensed throat vibrations, exploiting the inherent coherence between acoustics and vocal dynamics rather than cryptographic signatures. This shift responds to synthesis tools making traditional audio forensics obsolete. The implication: as generative media becomes indistinguishable from authentic content, verification architectures must anchor in physics that adversaries cannot replicate without physical presence.

2603.27562

VoxAnchor: Grounding Speech Authenticity in Throat Vibration via mmWave Radar

Surprises(1/3)

IDE-integrated AI assistants operate in real codebases, not chat interfaces·Developers willing to reduce fingerprinting don't know platforms already restrict it·Patient voice in interactive health research is an epistemic problem, not just an inclusion one

The first large-scale study of IDE-integrated AI coding assistants reveals they function as codebase-aware, multi-file editing systems with project context, not as general-purpose chatbots. Existing research has analyzed the wrong artifact—controlled settings and standalone chat interfaces—while practitioners have already moved to conversational workflows embedded in working codebases. The gap means most published findings about AI coding assistance don't describe the tools developers actually use.

2604.00436

Programming by Chat: A Large-Scale Behavioral Analysis of 11,579 Real-World AI-Assisted IDE Sessions

TOOLBOX(4)

CARLA-Air

Framework

Open-source unified simulation infrastructure merging CARLA urban driving and AirSim multirotor flight within a single Unreal Engine process. Preserves native Python APIs and ROS 2 interfaces for zero-modification code reuse. Synchronously captures up to 18 sensor modalities across air-ground platforms with photorealistic environments, rule-compliant traffic, and aerodynamically consistent UAV dynamics. Supports embodied intelligence workloads including cooperation, navigation, multi-modal perception, and RL policy training.

2603.28032

ProVega

Framework

Vega-Lite-based grammar for Progressive Data Analysis and Visualization (PDAV) that simplifies instrumentation for both simple visualizations and complex visual environments. Supports various progressive methods including data-chunking, process-chunking, and mixed-chunking. Validated by reimplementing 11 literature exemplars. Enables fast response times for massive datasets while maintaining interactivity with controlled accuracy.

2604.02096

Pro-Ex

Tool

Editor designed to streamline the creation and analysis of progressive data visualization solutions. Works alongside ProVega grammar to lower barriers for implementing Progressive Data Analysis and Visualization (PDAV). Validated through expert user study confirming efficacy in real-world tasks. Supports instrumentation of progressive solutions for handling massive datasets with fast response times.

2604.02096

NeuroVase

Tool

Tablet-based mobile augmented reality platform for stroke-related neuroanatomy learning. Features dual-mode setup using tangible cue cards as standalone study aids and interactive AR markers. Custom curriculum covers cerebrovascular anatomy, vascular territories, stroke syndromes, and arterial occlusions with annotated 3D anatomical models. Controlled study with 40 participants demonstrated effectiveness compared to traditional learning methods for complex anatomical and physiological education.

2604.00296

"What Did It Actually Do?": Understanding Risk Awareness and Traceability for Computer-Use Agents

Documents how users grant computer-use agents sweeping permissions without understanding what authority they've delegated. The traceability problem is worse than you think.

2603.28106

InconLens: Interactive Visual Diagnosis of Behavioral Inconsistencies in LLM-based Agentic Systems

Builds a visual debugger for when your LLM agent does different things given identical inputs. Stochastic generation meets deterministic expectations.

2603.28944

AI prediction leads people to forgo guaranteed rewards

2604.01463

Low-Burden LLM-Based Preference Learning: Personalizing Assistive Robots from Natural Language Feedback for Users with Paralysis

Replaces exhaustive pairwise comparisons with natural language critiques to train assistive robots. Tested with users who experience severe fatigue from traditional methods.

2603.27550

Drag or Traction: Understanding How Designers Appropriate Friction in AI Ideation Outputs

Introduces intentional disruptions—fragmentation, distortion—to AI-generated designs to prevent fixation. Seamless output kills creativity; strategic friction restores it.

2603.27633

Adapting AI to the Moment: Understanding the Dynamics of Parent-AI Collaboration Modes in Real-Time Conversations with Children

Maps how parents shift between AI collaboration modes during live conversations with kids. Static assistance models miss the continuous recalibration real interactions demand.

2603.29681

Beyond the Steeper Curve: AI-Mediated Metacognitive Decoupling and the Limits of the Dunning-Kruger Metaphor

Argues the Dunning-Kruger framing misses the real problem: LLMs improve output while degrading users' ability to judge their own competence. Metacognitive decoupling, not overconfidence.

2604.00464

Not My Truce: Personality Differences in AI-Mediated Workplace Negotiation

Shows AI negotiation coaching doesn't work uniformly—personality traits moderate outcomes. One-size-fits-all coaching assumptions meet individual difference reality.

REFLECTION(4)

Plausibility is not competence

AI systems are getting better at producing outputs that *look* right, but humans are getting worse at catching when they're wrong. The calibration crisis isn't about trust—it's about the collapse of critical distance when seamless answers feel safer than skepticism.

Transparency theater (explainability, consent dialogs, audit trails) makes systems feel legible without making them actually inspectable. If users can't *act* on what they see, does documentation just create the illusion of control while eroding the friction that once forced engagement?

1 / 4

Week 13March 2026

Week 15April 2026

ABOUT THIS ISSUE

How was this newsletter synthesized?

Methodology

This newsletter is generated by an AI pipeline (leveraging Anthropic Sonnet 4.5 & Haiku 4.5) that processes the metadata and abstracts of every new arXiv HCI paper from the past week—100 this issue. Each paper is scored on three dimensions: Practice (applicability for practitioners), Research (scientific contribution), and Strategy (industry implications), with scores from 1-5. Papers passing threshold are grouped into topic clusters, and each cluster is summarized to capture what that body of research is exploring.

Selection Criteria

The pipeline builds a curated selection that balances high scores with topic diversity—and deliberately includes at least one 'contrarian' paper that challenges prevailing assumptions. This selection is then analyzed to identify key findings (patterns across multiple papers) and surprises (results that contradict conventional wisdom). A narrative synthesis ties the week's research together under a unifying frame.

Key Themes Discovered

Field Report: ai-interaction

Trust, Agency, and Calibration

This cluster examines how humans calibrate trust and agency when interacting with AI systems across high-stakes domains. Core tensions emerge: users struggle to discriminate correct from incorrect AI advice, leading to over-reliance or self-constraint; developers face opacity in agent behavior and struggle to audit multi-step reasoning; and designers confront the paradox that seamless AI output invites passive acceptance rather than critical engagement. Work spans programming, clinical diagnosis, education, and autonomous systems, unified by a question: how do interaction design choices shape whether humans maintain appropriate skepticism, preserve decision authority, and develop accurate mental models of AI capability and limitation?

1/10

Top Papers in this Theme

2603.28944

AI prediction leads people to forgo guaranteed rewards

2603.28106

InconLens: Interactive Visual Diagnosis of Behavioral Inconsistencies in LLM-based Agentic Systems

2603.28551

"What Did It Actually Do?": Understanding Risk Awareness and Traceability for Computer-Use Agents

2603.29689

KEditVis: A Visual Analytics System for Knowledge Editing of Large Language Models

2603.27550