What Would GPT Click: Practical Effects of Human-AI Behavioral Misalignment and the Cost of Synthetic Participants in User Experience
Eduard Kuric, Peter Demcak, Matus Krajcovic
Don't use GPT to simulate user behavior in click tests or navigation studies. The distortions are systematic, not random noise you can average out. If budget forces a choice between synthetic data and no data, choose no data—bad data steers worse than intuition.
UX teams are replacing real user testing with GPT-simulated participants to cut costs and speed up iteration. The question is whether synthetic clicks predict real behavior well enough to guide design decisions.
Method: GPT failed to predict where humans click in 53% of first-click tests across twelve diverse UX studies (n=3431). Participant personas, chain-of-thought reasoning, and sampling parameter tweaks produced no meaningful fidelity improvements—they just inflated believability. The synthetic responses showed systematic distortions in both click patterns and cognitive reasoning that stem from LLMs' statistical nature and linguistic training data.
Caveats: Tested on first-click tests only. Other UX methods (surveys, interviews) may show different failure modes.
Reflections: Do LLM simulation failures extend to other behavioral methods like card sorting or tree testing? · Can hybrid approaches (real users for critical paths, synthetic for edge cases) avoid the worst distortions? · Are there specific task types where synthetic participants perform acceptably?