Assessment Design in the AI Era: A Method for Identifying Items Functioning Differentially for Humans and Chatbots
Licol Zeinfeld, Alona Strugatski, Ziva Bar-Dov, Ron Blonder, Shelley Rap, Giora Alexandron
Stop using aggregate benchmark scores to assess AI vulnerability. Run DIF analysis on your assessments to pinpoint which items need redesign. Best for high-stakes exams where validity matters more than convenience.
Educators need to know which test questions are vulnerable to LLM cheating, but current benchmarks only provide aggregate scores—not item-level diagnostics that reveal where AI systematically outperforms or underperforms humans.
Method: Differential Item Functioning (DIF) analysis—borrowed from bias detection in psychometrics—flags test items where humans and chatbots show systematic response differences. Applied to a high school chemistry test and university entrance exam with six leading chatbots (ChatGPT-4o & 5.2, Gemini 1.5 & 3 Pro, Claude 3.5 & 4.5 Sonnet), the method reliably identified items where LLMs diverge from human learners, enabling subject-matter experts to characterize task dimensions that make problems particularly easy or difficult for AI.
Caveats: Tested on two STEM assessments. Transfer to humanities or open-ended tasks unverified.
Reflections: Do DIF-flagged items remain stable across LLM versions, or does each model update require re-analysis? · Can DIF patterns predict which item types will be vulnerable to future AI capabilities? · How do DIF results change when students use LLMs as assistants rather than direct answer generators?