Imagine submitting a paper as a student. A few days later, you receive feedback. Detailed. Polite. Well-structured. Perhaps even empathetic and warm-hearted. Except: no one has actually read your text.
That scenario is closer than many people think. A new British research project involving Cambridge, Nottingham, and Manchester Metropolitan University, among others, investigated how well AI systems such as ChatGPT, Claude, and Gemini can grade university essays. Not with artificial tests, but with real exams taken by psychology students from three different universities. The researchers compared AI scores with the grades originally assigned by human assessors.
And the result is simultaneously impressive and yet problematic, though perhaps not in the way you might suspect. The AI systems turned out to be remarkably consistent. Sometimes, their agreement was even closer to what you normally see between two human evaluators. Moreover, the AI systems were often more consistent with each other than with humans.
But… consistency is not necessarily the same as good assessment. For as soon as the researchers took a closer look at exactly what AI was responding to, things went wrong. The systems turned out to be systematically more sensitive to language form than human assessors. Longer essays received higher scores more easily. The same applied to texts with more complex sentences, richer vocabulary, and more connecting words between ideas.
That is actually logical. A language model predicts language. It does not “understand” an essay in the way a teacher or instructor tries to do. The software looks for statistical patterns that often go hand in hand with good texts. However, those patterns are not always synonymous with quality.
Even more striking was a second effect: AI tended to pull everything towards the middle. Weak essays received relatively high scores, strong essays relatively low ones. Precisely where evaluation often becomes most important, at the boundaries between pass and failure or between average and excellent, AI proved to be the least reliable.
That makes this research more interesting than so many discussions about “AI works” or “AI doesn’t work”. In fact, the systems clearly *do* work (partially). Just not necessarily in the way educational assessment is intended to work.