The Missing Evaluation Axis: What 10,000 Student Submissions Reveal About AI Tutor Effectiveness
Rose Niousha, Samantha Boatright Smith, Bita Akram, Peter Brusilovsky, Arto Hellas, Juho Leinonen, John DeNero, Narges Norouzi
Stop evaluating AI tutors on feedback quality alone. Add behavioral metrics: did students revise their code after feedback? Did they apply it correctly? These signals predict perceived helpfulness better than expert ratings of pedagogy.
AI tutors are evaluated on pedagogical quality—how good the feedback sounds—but not on whether students actually use it. A tutor can give perfect advice that students ignore.
Method: Analyzed 10,235 code submissions to measure whether students act on AI feedback and apply it correctly. This behavioral dimension—engagement patterns—correlated more strongly with student perception of helpfulness than pedagogical quality alone. Two deployed tutors with similar pedagogical scores showed substantial differences in engagement that pedagogy-only evaluation missed entirely.
Caveats: Tested only in introductory programming. Other domains may show different engagement patterns.
Reflections: Do behavioral engagement patterns predict long-term learning outcomes, or just immediate satisfaction? · Can tutors be optimized directly for engagement metrics, or does that create perverse incentives? · How do engagement patterns differ across disciplines beyond programming?