Week 24 / June 2025

The Infrastructure Beneath AI Claims Is Collapsing

Platform APIs fail audits, evaluation methods can't support their conclusions, and expertise resists capture

Synthesized using AI

Analyzed 132 papers. AI models can occasionally hallucinate, please verify critical details.

The research infrastructure supporting AI evaluation and platform studies is failing under scrutiny. Rieder et al.'s six-month audit of YouTube's Data API reveals that the search endpoint—used in hundreds of academic papers—returns incomplete, inconsistent results, with the 'relevance' ranking parameter often retrieving off-topic videos while missing pertinent content. Twitch's AutoMod shows parallel failures, silencing empowerment language from marginalized communities while allowing bigotry to pass. These aren't edge cases; they're systematic breakdowns in the tools academics and practitioners use to understand online platforms. Meanwhile, Wei et al. demonstrate that human performance baselines in AI evaluations lack the rigor to support 'superhuman' claims, and Jeon et al. document widespread misuse of t-SNE and UMAP for inter-cluster analysis—a mathematically invalid practice appearing in peer-reviewed visualization research.

The evaluation crisis extends beyond platforms into embodied systems. Zielasko et al. show that VR-induced cybersickness produces physiological stress responses lasting 90 minutes after headset removal, with cortisol and alpha-amylase elevation persisting alongside working memory impairment. Current safety guidelines assume immediate recovery, creating liability gaps for any deployment where users must drive or make critical decisions post-session. Separately, multiple papers expose the limits of formalizing expertise: craft workflows rely on improvisational knowledge that resists documentation, AI citation frameworks collapse when outputs are ephemeral, and animation tools can't capture the tacit reasoning behind timing choices.

The through-line is uncomfortable: the measurement and documentation systems we've built to support AI deployment and platform research can't bear the weight we're placing on them. Practitioners making decisions based on benchmark performance, API-derived insights, or formalized expertise transfer are operating with compromised foundations. The question isn't whether to proceed—it's how to build verification mechanisms that don't depend on infrastructure we now know to be unreliable.

Featured(1/5)

2506.08725

Stop Misusing t-SNE and UMAP for Visual Analytics

Hyeon Jeon, Jeongin Park, Sungbok Shin, Jinwook Seo

Preprint·2025-06-10

Stop inferring inter-cluster relationships from t-SNE/UMAP plots. Use them only for within-cluster exploration. For cluster comparisons, overlay distance metrics directly or use MDS-based methods that preserve global structure.

Practitioners routinely use t-SNE and UMAP projections to compare cluster distances, even though these algorithms distort inter-cluster relationships. The projections lie about what's far apart and what's close.

Method: The authors reviewed 136 papers and found widespread misuse: analysts treat projected distances as ground truth for cluster similarity. They interviewed researchers who admitted they knew the projections were unreliable but used them anyway because "everyone does." The core issue: t-SNE and UMAP optimize for local neighborhood preservation, not global structure. When you see two clusters far apart in a projection, that distance is meaningless—it's an artifact of the algorithm's cost function, not your data.

Caveats: The paper focuses on academic misuse; production dashboards may already use better methods, but the interviews suggest even experts fall into this trap.

Reflections: What alternative projection methods preserve both local and global structure without sacrificing interpretability? · How can visualization tools programmatically warn users when they're misinterpreting projections? · Do practitioners misuse other dimensionality reduction techniques (PCA, autoencoders) in similar ways?

data-visualizationevaluation-methods

2506.11727

Forgetful by Design? A Critical Audit of YouTube's Search API for Academic Research

Bernhard Rieder, Adrian Padilla, Oscar Coromina

Information, Communication and Society, 1-20·2025-06-13

Stop treating YouTube API results as representative samples. Document your ranking parameters and run repeated queries to measure consistency. For longitudinal studies, archive raw results—don't assume you can re-query later.

Researchers rely on YouTube's Data API to study content trends, but the API returns inconsistent, incomplete results. The same query run twice yields different video sets, undermining reproducibility.

Method: Over six months of weekly searches using eleven queries, the authors found that the "relevance" ranking parameter retrieved numerous off-topic videos while "date" ranking showed better precision but lower recall. The API's completeness and consistency varied wildly—searches for identical terms returned different result sets across time, suggesting the API doesn't expose YouTube's full index. The audit used systematic logging to quantify recall gaps and ranking drift, revealing that the API is a lossy, biased window into YouTube's catalog.

Caveats: The study used eleven queries over six months; broader query sets or different time windows might reveal different patterns of incompleteness.

Reflections: Do other platform APIs (TikTok, Instagram) exhibit similar inconsistencies, or is this YouTube-specific? · Can researchers build correction models to estimate the true distribution from biased API samples? · How much of the inconsistency stems from API design versus YouTube's internal ranking volatility?

evaluation-methodsprivacy-securityethics

2506.11536

Do Not Immerse and Drive? Prolonged Effects of Cybersickness on Physiological Stress Markers And Cognitive Performance

Daniel Zielasko, Ben Rehling, Bernadette von Dawans, Gregor Domes

Preprint·2025-06-13

Design VR experiences with mandatory cooldown periods. Don't let users jump into high-stakes tasks (driving simulations, surgical training) immediately after intense VR. Test for working memory deficits, not just nausea.

VR sessions induce cybersickness, but no one knows how long the aftereffects last. Users might leave a VR experience feeling fine, then crash cognitively minutes later.

Method: Using a carousel simulation to trigger cybersickness, the researchers measured salivary cortisol and alpha-amylase (physiological stress markers) plus working memory performance after exposure. They found that subjective discomfort (measured via SSQ and FMS) persisted beyond the VR session, and physiological stress markers remained elevated. The study tracked these markers over time to quantify recovery windows, revealing that cognitive impairment outlasts the immediate nausea.

Caveats: The study used a carousel simulation, which is a strong cybersickness trigger; milder VR experiences might have shorter aftereffects.

Reflections: Do certain VR locomotion techniques (teleportation vs. smooth movement) produce different recovery timelines? · Can pre-exposure training or gradual acclimation reduce the duration of aftereffects? · How do individual differences (age, VR experience, susceptibility to motion sickness) modulate recovery time?

wearablesaugmented-realityevaluation-methodssafety

1 / 5

Featured

Findings(1/5)

Platform auditing reveals infrastructural amnesia as a research barrier·AI tutoring systems expose process, not just output·Citation frameworks break when collaborators aren't retrievable·Visualization tools are systematically misused despite designer intent·Avatar-mediated environments unlock identity expression where physical spaces constrain it

YouTube's API systematically fails to retrieve older content, while Twitch's AutoMod shows inconsistent enforcement patterns—both platforms exhibit what researchers term 'forgetful by design' architectures. These aren't bugs; they're business logic that treats historical data as liability rather than record. For researchers, this means platform-provided tools are unreliable proxies for actual platform behavior. The implication: academic findings based on API access may be studying the access layer, not the platform itself.

2506.11727

Forgetful by Design? A Critical Audit of YouTube's Search API for Academic Research

2506.07667

Silencing Empowerment, Allowing Bigotry: Auditing the Moderation of Hate Speech on Twitch

Surprises(1/3)

Human baselines in AI evaluations are less rigorous than the models they benchmark·Cybersickness effects persist long after the headset comes off·Privacy tools fail because users don't understand the threat model

Models are routinely claimed to achieve 'super-human' performance, but the paper reveals that human baseline methodologies lack the rigor applied to model evaluations—inconsistent sampling, poorly controlled conditions, and opaque reporting. The humans being outperformed may not represent competent performance at all. This inverts the assumption that human benchmarks are the gold standard; they may be the weakest link in the evaluation chain.

2506.13776

Recommendations and Reporting Checklist for Rigorous & Transparent Human Baselines in Model Evaluations

TOOLBOX(6)

Human Baselines Checklist and Dataset

Dataset

A systematic review dataset containing 115 human baseline studies from foundation model evaluations, accompanied by a reporting checklist framework. Enables researchers to assess rigor in human-AI performance comparisons and supports meta-analysis of evaluation practices. Available on GitHub for reproducibility and extension.

2506.13776

Gear8

Dataset

An industrial assembly dataset featuring gear components for training lightweight object detection models in factory environments. Includes automated data construction pipeline with two-stage refinement strategy to improve visual robustness under domain shift and common visual corruptions. Designed for on-device deployment in privacy-constrained industrial settings.

2507.21072

SRB-300

Dataset

A 300-hour annotated Swiss German speech corpus featuring real-world long-audio recordings from 39 radio and TV stations. Captures spontaneous conversational speech across all major Swiss dialects in realistic environments. Used to fine-tune Whisper models, achieving 19-33% WER improvements and 8-40% BLEU score increases for low-resource speech-to-text systems.

2506.08836

CraftLink

Tool

An interface tool implementing an elementary grammar for documenting improvisational craft workflows. Analyzes expert videos and semi-automatically generates documentation capturing material and contextual variations in craft practices. Enables craftspeople to share tacit knowledge beyond linear step-by-step instructions, supporting collaborative knowledge archives within craft communities.

2506.10891

Needle

Tool

An interactive visual analytics system for navigating threaded online discussions like Reddit. Summarizes conversational metrics including activity, toxicity levels, and voting trends over time with both high-level insights and detailed thread breakdowns. Reduces cognitive load for moderators by helping prioritize areas needing attention in large-scale discussions.

2506.11276

Socheton

Tool

A culturally appropriate AI-mediated platform for reproductive well-being education in Bangladesh. Integrates healthcare professionals, AI-language teachers, and community members to moderate activity-based content. Combats misinformation using culturally appropriate language while promoting democratic participation, designed following distributive justice principles for marginalized communities.

2506.12357

Levels of Autonomy for AI Agents

Proposes treating agent autonomy as a design parameter with deliberate calibration points. Directly confronts the double-edged sword problem: more autonomy unlocks capability but multiplies risk.

2506.12347

Why AI Agents Still Need You: Findings from Developer-Agent Collaborations in the Wild

Studies real developer-agent collaborations beyond benchmarks. Reveals that complex, ambiguous tasks still break autonomous agents, forcing them back into interactive mode.

2506.10197

Intergenerational AI Literacy in Korean Immigrant Families: Interpretive Gatekeeping Meets Convenient Critical Deferment

Maps how Korean immigrant families negotiate ChatGPT and smart assistants across language barriers and generational divides. Trust becomes a family negotiation, not an individual choice.

2506.09220

Beyond the Hype: Mapping Uncertainty and Gratification in AI Assistant Use

Interviews early adopters of Rabbit R1 and Humane AI Pin to map the gap between promise and performance. Uses gratification theory to explain why people keep using disappointing devices.

2506.10249

Extended Creativity: A Conceptual Framework for Understanding Human-AI Creative Relations

Builds a relational cognition framework for human-AI creativity. Identifies three fundamental modes of creative enhancement—because 'AI makes you creative' needs actual theoretical grounding.

2506.09707

Fine-Tuning Large Audio-Language Models with LoRA for Precise Temporal Localization of Prolonged Exposure Therapy Elements

Automates fidelity assessment in PTSD therapy by timestamping key therapeutic elements in session audio. Tackles the labor-intensive bottleneck in scaling evidence-based mental health treatment.

2506.10587

IDEA: Augmenting Design Intelligence through Design Space Exploration

Formalizes design spaces mathematically to enable computational exploration. Confronts the experience-dependency problem: how do you make good design decisions without years of tacit knowledge?

2506.09212

Show Me Your Best Side: Characteristics of User-Preferred Perspectives for 3D Graph Drawings

Identifies what makes a 'good' camera angle for 3D graph visualizations in AR/VR. Turns out viewpoint selection is critical—3D layouts are useless if you're looking from the wrong side.

REFLECTION(4)

When infrastructure demands we stop trusting

AI systems are moving from optional assistants to embedded collaborators—and our ability to evaluate them is collapsing in real time. The research reveals a pattern: the more essential AI becomes, the less we can verify what it's actually doing, and the more we depend on it anyway.

We've built evaluation frameworks around 'superhuman' baselines that are themselves unverifiable—crowdsourced data is contaminated, human benchmarks are phantoms, and stakeholder perspectives are systematically absent. If our measurement infrastructure is this brittle, are we measuring capability or just measurement failure?

1 / 4

Week 23June 2025

Week 25June 2025

ABOUT THIS ISSUE

How was this newsletter synthesized?

Methodology

This newsletter is generated by an AI pipeline (leveraging Anthropic Sonnet 4.5 & Haiku 4.5) that processes the metadata and abstracts of every new arXiv HCI paper from the past week—132 this issue. Each paper is scored on three dimensions: Practice (applicability for practitioners), Research (scientific contribution), and Strategy (industry implications), with scores from 1-5. Papers passing threshold are grouped into topic clusters, and each cluster is summarized to capture what that body of research is exploring.

Selection Criteria

The pipeline builds a curated selection that balances high scores with topic diversity—and deliberately includes at least one 'contrarian' paper that challenges prevailing assumptions. This selection is then analyzed to identify key findings (patterns across multiple papers) and surprises (results that contradict conventional wisdom). A narrative synthesis ties the week's research together under a unifying frame.

Key Themes Discovered

Field Report: ai-interaction

Trust, Personalization, and Collaboration

This cluster examines how users calibrate trust in AI systems and negotiate personalized, collaborative relationships with them. Core questions: How do people decide when to rely on AI? What breaks trust? How should systems adapt to individual preferences and contexts? Research spans educational chatbots, healthcare agents, workplace collaboration, and social companions—revealing persistent gaps between AI capability and user expectations. Dominant pattern: systems that enable transparency, user control, and iterative feedback outperform autonomous designs. Secondary focus on intergenerational and cultural dimensions of AI adoption.

1/10

Top Papers in this Theme

2506.12248

ProVox: Personalization and Proactive Planning for Situated Human-Robot Collaboration

2507.21071

FingerTip 20K: A Benchmark for Proactive and Personalized Mobile LLM Agents

2506.09220

Beyond the Hype: Mapping Uncertainty and Gratification in AI Assistant Use

2506.10197

Intergenerational AI Literacy in Korean Immigrant Families: Interpretive Gatekeeping Meets Convenient Critical Deferment

2506.10249