Same Performance, Hidden Bias: Evaluating Hypothesis- and Recommendation-Driven AI
Michaela Benk, Tim Miller
Stop evaluating AI assistants on accuracy alone. Recommendation-driven interfaces corrupt decision processes by lowering evidence standards. If judgment quality matters more than speed, use hypothesis-driven designs that preserve stable thresholds.
AI decision support systems are evaluated on accuracy and reliance, but identical performance can mask how users actually form judgments. Do recommendation-driven interfaces change evidence standards even when outcomes stay the same?
Method: A 290-person experiment using Signal Detection Theory found that recommendation-driven designs lowered users' thresholds for sufficient evidence compared to hypothesis-driven designs, even when task performance remained identical. This created a "hidden bias"—a shifted distribution of errors. Experts showed the same threshold shifts as novices, meaning experience didn't protect against the systemic effect.
Caveats: Web-based experiment. Real-world stakes and repeated use might amplify or attenuate threshold shifts.
Reflections: Do threshold shifts persist or normalize with extended exposure to recommendation-driven systems? · Can interface design restore stable evidence standards while keeping recommendation convenience? · Which professional domains are most vulnerable to hidden bias from lowered thresholds?