Week 5 / January 2025

Systems Work But Users Can't Verify Them

From AI explanations to AR privacy, deployment fails when capability outpaces human evaluation

Synthesized using AI

Analyzed 134 papers. AI models can occasionally hallucinate, please verify critical details.

A year-long study of 12,000+ metaphors from a nationally representative U.S. sample found that public mental models of AI shifted from instrumental descriptions like 'tool' and 'assistant' to adversarial ones like 'thief' and 'replacement' as deployment accelerated. The timing matters: perceptions soured not because systems failed but because opacity became undeniable at scale. Supporting work on synthetic training data reveals how AI-generated datasets obscure provenance and quality assessment, while research on LLM agent behavior shows that alignment masks implicit biases in decision-making rather than eliminating them. Together, these papers identify a verification crisis—systems work, but users can't evaluate whether to trust them.

The response appears in sensing research that abandons cameras entirely. Thermal arrays achieve sub-centimeter 3D hand tracking with $50 hardware and zero visual surveillance, while neuromorphic cameras enable gesture recognition on deformable materials without capturing identifiable features. Seven AR papers converged on privacy-preserving sensing not as exploration but as deployment necessity—camera-based systems face regulatory barriers and user rejection that technical improvements can't solve. Meanwhile, accessibility research documents a parallel verification gap: blind software professionals report that commercial development tools remain inaccessible, forcing them to build custom solutions that prioritize autonomy over functional workarounds. The pattern extends to social platforms, where users re-appropriate hashtags to control algorithmic audience reach when platforms don't expose that control directly.

What emerges is a field responding to deployment failures invisible in lab studies. The bottleneck isn't capability—it's that users have developed accurate mental models of systems they can't verify or control. The design challenge is building verification mechanisms, not better explanations, and recognizing that when users systematically work around your system, they're designing the agency you failed to provide.

Featured(1/5)

2501.17799

Leveraging Multimodal LLM for Inspirational User Interface Search

Seokhyeon Park, Yumin Song, Soohyun Lee, Jaeyoung Kim, Jinwook Seo

CHI·2025-01-29

Stop building UI search around component tags and color filters. Index your design system screenshots with multimodal LLMs to enable natural language queries. Best for teams maintaining large reference libraries where manual tagging is a bottleneck.

Designers hunting for UI inspiration waste hours filtering irrelevant results. Existing search tools miss semantic context like target users or app mood, and require metadata like view hierarchies that most screenshots lack.

Method: A multimodal LLM extracts semantic attributes directly from UI screenshots—no metadata required. It interprets queries like 'calming meditation app for seniors' and matches against extracted features including target demographics, visual mood, and interaction patterns. The system processes raw screenshots through vision-language understanding to build a searchable semantic index.

Caveats: Effectiveness depends on LLM's ability to infer user intent from visual cues alone—may struggle with niche design patterns or cultural context.

Reflections: Can this approach identify anti-patterns or accessibility issues during inspirational search? · How does semantic search quality degrade with stylistically unconventional or experimental UIs? · What's the latency cost of real-time LLM inference versus pre-computed embeddings?

design-toolsai-interactiondata-visualization

2501.18148

The Dilemma of Building Do-It-Yourself (DIY) Solutions for Workplace Accessibility

Yoonha Cha, Victoria Jackson, Karina Kohl, Rafael Prikladnicki, André van der Hoek, Stacy M. Branham

CHI 2025·2025-01-30

Audit your dev toolchain for accessibility gaps that force workarounds. Prioritize fixing tools that require ongoing DIY maintenance—these compound career penalties. If you're building IDEs or CI/CD platforms, interview blind engineers before launch.

Commercial dev tools are inaccessible to blind and low vision software professionals, forcing them to build custom workarounds. This creates a hidden tax: time spent on DIY accessibility tools instead of core work.

Method: Interviews with 30 blind and low vision software professionals revealed they build DIY tools for everything from screen reader-friendly code navigation to custom IDE plugins. These tools are brittle, undocumented, and break with every software update. The research maps the ecosystem of workarounds and identifies which commercial tool gaps force DIY solutions versus which reflect personal preference.

Caveats: Study focuses on software professionals; findings may not generalize to other technical roles or non-technical accessibility needs.

Reflections: What's the total productivity cost of DIY tool maintenance versus advocating for commercial fixes? · Can open-source communities sustainably maintain accessibility-focused forks of popular dev tools? · Which tool categories see the highest DIY abandonment rates due to maintenance burden?

accessibilityprogramming-toolsethics

2501.17420

Actions Speak Louder than Words: Agent Decisions Reveal Implicit Biases in Language Models

Yuxuan Li, Hirokazu Shirado, Sauvik Das

Preprint·2025-01-29

Test your LLM applications with persona-based decision audits, not just prompt-response pairs. If you're deploying agents for hiring, lending, or content moderation, measure outcome disparities across demographic personas before launch.

LLMs pass explicit fairness tests but may harbor implicit biases that surface in decision-making. Alignment training teaches models to avoid biased language, not biased reasoning.

Method: Researchers created LLM agents with sociodemographically-informed personas and measured decision disparities in simulated scenarios. Even when models avoid explicitly biased statements, agents with different demographic personas make systematically different choices in identical situations—revealing implicit bias in the decision logic itself, not just the output text.

Caveats: Simulated personas may not reflect real-world decision patterns; findings need validation with actual user populations.

Reflections: Do implicit biases compound when LLM agents interact with each other in multi-agent systems? · Can adversarial training on decision outcomes reduce implicit bias without degrading task performance? · Which decision domains show the largest gaps between explicit fairness and implicit bias?

ai-interactionethicstrust-safetybias-issues

1 / 5

Featured

Findings(1/5)

Accessibility shifts from retrofit to adversarial design·Interaction design abandons the screen as the primary canvas·AI evaluation moves from model outputs to behavioral traces·Synthetic data reconfigures the AI pipeline from training artifact to infrastructure·Proactive systems replace reactive tools in knowledge work

Blind developers now build DIY tools to work around inaccessible commercial software, while marginalized creators on Xiaohongshu hijack unrelated hashtags to evade algorithmic exposure to hostile audiences. Accessibility is no longer about compliance—it's about users weaponizing systems against their own constraints. This reframes accessibility as a continuous adversarial game where platforms, tools, and users are locked in tactical adaptation cycles rather than converging toward universal design.

2501.18148

The Dilemma of Building Do-It-Yourself (DIY) Solutions for Workplace Accessibility

2501.18210

Hashtag Re-Appropriation for Audience Control on Recommendation-Driven Social Media Xiaohongshu (rednote)

Surprises(1/3)

Two operators don't reliably outperform one in dynamic teleoperation·LLM-generated summaries don't improve crowdsourced fact-checking accuracy·Algorithm audits fail without high-quality data access, even with sophisticated methods

Joint decision-making was supposed to reduce errors under uncertainty, a principle validated in static tasks. But in dynamic robot teleoperation, two operators don't consistently make better decisions than one. The collaboration overhead in real-time, high-stakes environments can negate the benefits of shared cognition. The reframe: decision quality in dynamic systems depends less on cognitive redundancy and more on minimizing coordination latency.

2503.15510

Joint Decision-Making in Robot Teleoperation: When are Two Heads Better Than One?

"It makes you think": Provocations Help Restore Critical Thinking to AI-Assisted Knowledge Work

Injects brief textual critiques into AI suggestions to counter the critical thinking collapse that happens when people lean on generative tools. The provocations work—users question more, accept less blindly.

2501.16627

Engaging with AI: How Interface Design Shapes Human-AI Collaboration in High-Stakes Decision-Making

Shows that human + AI teams perform worse than AI alone in healthcare decisions. The culprit? Interface design that fails to calibrate trust appropriately in high-stakes contexts.

2501.15463

Mind the Value-Action Gap: Do LLMs Act in Alignment with Their Values?

Borrows the "value-action gap" from psychology to test whether LLMs actually behave according to their stated values. Spoiler: they don't, and the discrepancies are measurable.

2501.17258

Controlling AI Agent Participation in Group Conversations: A Human-Centered Approach

Tackles the awkward question of when AI agents should speak up in group chats. Turns out the turn-taking mechanics we take for granted in 1:1 conversations completely break down.

2501.15727

Gensors: Authoring Personalized Visual Sensors with Multimodal Foundation Models and Reasoning

Lets end-users describe sensing tasks in natural language and get back personalized AI sensors that reason about complex situations. Think "tell me when the baby wakes up" without hardcoding.

2501.17299

"Ownership, Not Just Happy Talk": Co-Designing a Participatory Large Language Model for Journalism

Co-designs an LLM with journalists who want ownership over the model, not just consultation theater. Reveals the tension between financial pressures pushing LLM adoption and professional autonomy.

2501.18642

DebiasPI: Inference-time Debiasing by Prompt Iteration of a Text-to-Image Generative Model

Iteratively tweaks prompts at inference time to counter demographic biases in image generation. No retraining required, and it actually hits desired gender/race distributions.

2501.15678

Blissful (A)Ignorance: People form overly positive impressions of others based on their written messages, despite wide-scale adoption of Generative AI

Finds that people still trust written messages as authentic social signals even though GenAI makes them cheap to produce. Signaling theory predicted this would collapse; it hasn't.

REFLECTION(4)

Explainability builds trust, then breaks it

The research shows that showing users how AI works doesn't reliably fix miscalibration—it sometimes deepens it. Users either dismiss explanations and over-rely anyway, or weaponize them to reject systems that could genuinely help. The tension isn't between transparency and opacity; it's between the assumption that understanding leads to appropriate trust and the evidence that it often doesn't.

Explainability is treated as a trust-building tool, but the data suggests it's a trust-revealing tool—it exposes where users' mental models diverge from system capability. Does showing your work actually calibrate trust, or does it just make miscalibration more visible and harder to correct?

1 / 4

Week 04January 2025

Week 06February 2025