Week 12 / March 2025

AI Systems Outpace Human Steering Capacity

Generative models produce impressive outputs users can't reliably control, verify, or trust

Synthesized using AI

Analyzed 152 papers. AI models can occasionally hallucinate, please verify critical details.

Generative AI has a control problem, and it's not the one we've been measuring. Research from Cornell and Harvard introduces a mathematical decomposition showing that producibility—what a model can generate—diverges sharply from steerability—what users can actually make it generate. In empirical tests, models that produced high-quality outputs across diverse prompts still left users unable to reach specific goals because the interface and control mechanisms failed. This finding lands alongside a Northwestern inventory revealing trust in visualizations operates across at least four independent dimensions (accuracy, design, source, emotion), meaning interventions that improve one dimension don't transfer to others. Add Georgia Tech's audit showing GenAI browser assistants collect detailed browsing activity for profiling users never consented to, and MIT's four-week study (n=981, 300k+ messages) finding chatbot interaction modes significantly affect loneliness and emotional dependence, and the pattern becomes clear: we've deployed systems whose outputs we can't steer, whose trustworthiness we can't calibrate, whose data practices we can't audit, and whose psychosocial effects we're only beginning to measure.

The wearables and sensing work offers a different kind of precision. Tokyo researchers demonstrate machine-knittable NFC textiles that enable body-scale wireless sensor networks without the electromagnetic interference that plagued previous approaches—this solves the full-body monitoring problem for continuous health tracking. A VR passthrough study from China shows users prefer front-projected rear interruptions over head-turning, reducing nausea and maintaining immersion. These aren't conceptual breakthroughs; they're engineering solutions to deployment barriers, the kind of work that moves technologies from lab to product.

The tension is between capability and control. AI systems generate, sense, and respond at scales that outpace human verification. The steerability research names what practitioners have felt: impressive demos don't predict whether users can actually accomplish tasks. The trust inventory reveals why calibration is hard: trust isn't one thing to fix. The privacy audit shows the gap between user mental models and system behavior. The design challenge isn't making AI more capable—it's building interfaces and evaluation methods that let humans actually drive these systems where they need to go.

Featured(1/5)

2503.17482

What's Producible May Not Be Reachable: Measuring the Steerability of Generative Models

Keyon Vafa, Sarah Bentley, Jon Kleinberg, Sendhil Mullainathan

Preprint·2025-03-21

Stop evaluating generative tools solely on output quality galleries. Measure whether your users can actually reach their intended outputs through your interface controls. If your model scores high on FID but users can't steer it, you've built a slot machine, not a tool.

Generative models produce impressive outputs in demos, but users with specific goals can't reliably steer them to produce what they actually need.

Method: The paper introduces a mathematical decomposition separating producibility (what a model can generate) from steerability (whether users can actually reach desired outputs through available controls). They formalize steerability as the probability that a user can navigate the control space to achieve their goal, independent of the model's raw generation capabilities. This reframes evaluation from 'what can this model do?' to 'can I make it do what I need?'

Caveats: Framework is model-agnostic but requires defining a goal space and control interface—non-trivial for open-ended creative tasks.

Reflections: How do different interface paradigms (sliders vs. text prompts vs. examples) affect steerability for the same underlying model? · Can we predict steerability from model architecture before deployment? · What's the minimum control dimensionality needed for acceptable steerability in different creative domains?

ai-interactionevaluation-methodsdesign-tools

2503.17473

How AI and Human Behaviors Shape Psychosocial Effects of Extended Chatbot Use: A Longitudinal Randomized Controlled Study

Cathy Mengying Fang, Auren R. Liu, Valdemar Danry, Eunhae Lee, Samantha W. T. Chan, Pat Pataranutaporn, Pattie Maes, Jason Phang, Michael Lampe, Lama Ahmad, Sandhini Agarwal

Preprint·2025-03-21

If you're designing conversational AI for wellbeing, audit your voice design and conversation prompts. Engaging voice features increase dependency risk. Don't assume 'more conversation' helps—design for active disclosure, not passive chatting. Consider exposure limits for high-intimacy modes.

Users increasingly turn to AI chatbots for emotional support, but we don't know whether extended use creates dependency or affects real-world social connections.

Method: Four-week RCT with 981 participants generating over 300,000 messages tested three interaction modes (text, neutral voice, engaging voice) and three conversation types (open-ended, non-personal, personal). Engaging voice increased emotional dependence on AI. Personal conversations reduced loneliness but only when users actively disclosed—passive chatting had no effect. The mechanism matters: it's not chatbot use itself, but the combination of voice intimacy and self-disclosure that drives outcomes.

Caveats: Four-week study; long-term dependency patterns and withdrawal effects unknown. Sample skewed toward users seeking AI companionship.

Reflections: What happens after users stop using high-intimacy chatbots—do dependency effects reverse or persist? · Can we design 'scaffolding' that transitions users from AI support to human connections? · Do cultural differences in emotional expression change these dependency patterns?

ai-interactionethicshealthcare

2503.16586

Big Help or Big Brother? Auditing Tracking, Profiling, and Personalization in Generative AI Assistants

Yash Vekaria, Aurelio Loris Canino, Jonathan Levitsky, Alex Ciechonski, Patricia Callejo, Anna Maria Mandalari, Zubair Shafiq

Preprint·2025-03-20

If you're building or evaluating AI browser assistants, implement transparent data collection boundaries. Show users exactly what browsing data feeds the model versus what stays local. Don't hide autonomous actions—surface every form fill and navigation decision before execution.

GenAI browser assistants can track every search, click, and form input while autonomously performing tasks—raising privacy concerns beyond traditional browser extensions.

Method: Systematic audit of GenAI browser assistants revealed they track detailed browsing activity including search queries, clicked links, and form data. Unlike passive extensions, these assistants can autonomously fill forms and navigate sites. The study developed an auditing methodology examining data collection practices, server-side profiling, and personalization behaviors. They found assistants collect far more granular behavioral data than users expect, with unclear retention and usage policies.

Caveats: Audit methodology relies on observable network traffic and disclosed policies; server-side processing and actual data usage remain partially opaque.

Reflections: Can we build effective GenAI assistants with on-device processing only? · What's the minimum browsing context needed for useful assistance versus privacy-invasive profiling? · How do users' mental models of 'assistant' versus 'tracker' diverge from actual system behavior?

privacy-securityai-interactionethics

1 / 5

Featured

Findings(1/5)

Generative AI evaluation shifts from output quality to user control·Trust measurement fragments as platforms multiply interaction modes·Privacy threats migrate from data collection to autonomous action·Disagreement becomes a design feature, not a bug to smooth over·Inoculation theory moves from public health to interface design

Generative models are being reframed around steerability—whether users can actually produce outputs that satisfy specific goals—rather than producibility, the breadth of what models can generate. This matters because current benchmarks measure capability without measuring access to that capability. The gap between what's technically possible and what's practically achievable determines real-world value, not model performance on static datasets. Evaluation frameworks must now account for the interaction layer.

2503.17482

What's Producible May Not Be Reachable: Measuring the Steerability of Generative Models

Surprises(1/3)

Trusted institutions reproduce the harms they claim to prevent·Extended AI companionship doesn't reduce loneliness—it shifts social effort·Automation in journalism saves time but doesn't solve the bottleneck

Governments and regulated entities in digital service delivery create insecurities structurally similar to criminal fraud through efficiency-driven design and weakened social institutions. Research from northern England shows 'digital by default' policies open users to deception not just from fraudsters but from supposedly trusted actors in a marketized digital economy. The threat model that separates legitimate services from scams breaks down when both exploit the same vulnerabilities.

2503.16992

Friend or Foe? Navigating and Re-configuring "Snipers' Alley"

TOOLBOX(5)

ShieldUp!

Tool

ShieldUp! is a mobile game prototype that inoculates users against online scams using psychological inoculation theory. It exposes users to weakened versions of scammer manipulation tactics for 15 minutes, teaching recognition and refutation techniques. Validated through RCT with 3,000 participants in India, showing significant improvement in scam identification maintained at 21-day follow-up.

2503.12341

Scam Discernment Ability Test (SDAT-10)

Dataset

SDAT-10 is a newly developed assessment instrument for measuring users' ability to discern online scams from legitimate offers. Used in pre-test, post-test, and 21-day follow-up evaluations in the ShieldUp! RCT study with 3,000 participants. Provides standardized measurement of scam identification skills across intervention groups.

2503.12341

Iffy-Or-Not (ION)

Tool

ION is a browser extension that invokes critical thinking when reading online texts. Guided by argumentation theory, it highlights fallacious content, suggests diverse queries to probe claims, and offers deeper questions for consideration and discussion. User study (N=18) validated that ION encourages attentiveness and expands perspectives on potentially misleading content.

2503.14412

CheapVS

Framework

CheapVS is a human-centered drug discovery framework combining preferential multi-objective Bayesian optimization with docking models for binding affinity measurement. It captures chemist intuition through pairwise comparisons of drug property trade-offs. On 100K chemical candidates targeting EGFR and DRD2, it recovered 16/37 EGFR and 37/58 DRD2 known drugs while screening only 6% of the library.

2503.16841

Trust in Visualizations Inventory

Tool

An eight-item standardized inventory for measuring trust in data visualizations, derived through exploratory factor analysis. Comprises four core items measuring trust (credibility, comprehensibility, usability) and four optional items controlling for baseline trust tendency. Validated through McDonald's omega reliability testing, content validity alignment, and criterion validity via two trust games with real-world stakes.

2503.17670

ChatBench: From Static Benchmarks to Human-AI Evaluation

Converts MMLU into a user study to measure what humans and LLMs achieve *together*, not in isolation. Turns out AI-alone benchmarks miss the whole point of chatbots.

2503.15484

Value Profiles for Encoding Human Variation

Represents individuals as natural language descriptions of their values, compressed from in-context demonstrations. A surprisingly elegant approach to modeling human disagreement in rating tasks.

2503.13975

Navigating Rifts in Human-LLM Grounding: Study and Benchmark

Examines why LLMs fail at the collaborative aspects of conversation that humans do naturally. The grounding breakdowns range from frustrating to genuinely dangerous.

2503.16114

The Impact of Revealing Large Language Model Stochasticity on Trust, Reliability, and Anthropomorphization

Tests whether showing users multiple LLM responses instead of one reduces over-trust and anthropomorphization. Spoiler: exposing the probabilistic nature changes how people relate to the model.

2503.16517

From G-Factor to A-Factor: Establishing a Psychometric Framework for AI Literacy

Builds a psychometric framework for measuring AI literacy across 517 participants. Treats AI competence like intelligence testing—coherent, measurable, and surprisingly stable.

2503.15120

Communication Access Real-Time Translation Through Collaborative Correction of Automatic Speech Recognition

Explores hybrid systems where humans correct ASR errors in real-time for DHH accessibility. Balances automation efficiency with the scarcity of trained CART providers.

2503.16227

Flight Testing an Optionally Piloted Aircraft: a Case Study on Trust Dynamics in Human-Autonomy Teaming

Tracks how trust forms and erodes during actual flight tests with autonomous aircraft. Goes beyond static trust models to capture temporal dynamics in high-stakes teaming.

2503.16632

Benchmarking Visual Language Models on Standardized Visualization Literacy Tests

Systematically tests VLMs on visualization interpretation tasks using standardized literacy measures. Reveals where models excel and where they catastrophically misread charts.

REFLECTION(3)

Steerability without control is just capability theater

This week's research exposes a widening gap: AI systems produce stunning outputs, but users can't reliably direct them toward specific goals. Explainability doesn't improve task performance. Trust calibration fails. Grounding breaks down. The uncomfortable truth is that capability and controllability are decoupling—and we're still measuring the former while ignoring the latter.

We celebrate models that generate novel solutions, yet users struggle to steer them toward intended outcomes. If a system produces brilliant outputs that users can't reliably reproduce or modify, are we building tools or just impressive black boxes?

1 / 3

Week 11March 2025

Week 13March 2025

ABOUT THIS ISSUE

How was this newsletter synthesized?

Methodology

This newsletter is generated by an AI pipeline (leveraging Anthropic Sonnet 4.5 & Haiku 4.5) that processes the metadata and abstracts of every new arXiv HCI paper from the past week—152 this issue. Each paper is scored on three dimensions: Practice (applicability for practitioners), Research (scientific contribution), and Strategy (industry implications), with scores from 1-5. Papers passing threshold are grouped into topic clusters, and each cluster is summarized to capture what that body of research is exploring.

Selection Criteria

The pipeline builds a curated selection that balances high scores with topic diversity—and deliberately includes at least one 'contrarian' paper that challenges prevailing assumptions. This selection is then analyzed to identify key findings (patterns across multiple papers) and surprises (results that contradict conventional wisdom). A narrative synthesis ties the week's research together under a unifying frame.

Key Themes Discovered

Field Report: ai-interaction

Trust, Steerability, and Alignment

This cluster examines how users calibrate trust in AI systems and whether AI outputs align with user intent. Core tensions emerge: generative models produce high-quality outputs but fail at steerability—users cannot reliably steer them toward specific goals. Research spans trust formation in human-autonomy teaming, grounding failures in LLM conversations, and preference-based personalization. A secondary thread addresses explainability's inconsistent effects on task performance. Collectively, these papers reframe the evaluation problem: capability alone is insufficient; systems must be steerable, interpretable, and aligned with human expectations across diverse contexts and user populations.

1/10

A Case Study of Scalable Content Annotation Using Multi-LLM Consensus and Human Review

2503.16517

From G-Factor to A-Factor: Establishing a Psychometric Framework for AI Literacy

2503.12757

Synthesized using AI

What's Producible May Not Be Reachable: Measuring the Steerability of Generative Models

How AI and Human Behaviors Shape Psychosocial Effects of Extended Chatbot Use: A Longitudinal Randomized Controlled Study

Big Help or Big Brother? Auditing Tracking, Profiling, and Personalization in Generative AI Assistants

What's Producible May Not Be Reachable: Measuring the Steerability of Generative Models

Friend or Foe? Navigating and Re-configuring "Snipers' Alley"

ShieldUp!

Scam Discernment Ability Test (SDAT-10)

Iffy-Or-Not (ION)

CheapVS

Trust in Visualizations Inventory

ChatBench: From Static Benchmarks to Human-AI Evaluation

Value Profiles for Encoding Human Variation

Navigating Rifts in Human-LLM Grounding: Study and Benchmark

The Impact of Revealing Large Language Model Stochasticity on Trust, Reliability, and Anthropomorphization

From G-Factor to A-Factor: Establishing a Psychometric Framework for AI Literacy

Communication Access Real-Time Translation Through Collaborative Correction of Automatic Speech Recognition

Flight Testing an Optionally Piloted Aircraft: a Case Study on Trust Dynamics in Human-Autonomy Teaming

Benchmarking Visual Language Models on Standardized Visualization Literacy Tests

Steerability without control is just capability theater

How was this newsletter synthesized?

Methodology

Selection Criteria

Key Themes Discovered

Field Report: ai-interaction

Trust, Steerability, and Alignment

Top Papers in this Theme

Value Profiles for Encoding Human Variation

ChatBench: From Static Benchmarks to Human-AI Evaluation

A Case Study of Scalable Content Annotation Using Multi-LLM Consensus and Human Review

From G-Factor to A-Factor: Establishing a Psychometric Framework for AI Literacy

MAP: Multi-user Personalization with Collaborative LLM-powered Agents