When AI Grades Your Essay: Impressive, but Problematic

Imagine submitting a paper as a student. A few days later, you receive feedback. Detailed. Polite. Well-structured. Perhaps even empathetic and warm-hearted. Except: no one has actually read your text.

That scenario is closer than many people think. A new British research project involving Cambridge, Nottingham, and Manchester Metropolitan University, among others, investigated how well AI systems such as ChatGPT, Claude, and Gemini can grade university essays. Not with artificial tests, but with real exams taken by psychology students from three different universities. The researchers compared AI scores with the grades originally assigned by human assessors.

And the result is simultaneously impressive and yet problematic, though perhaps not in the way you might suspect. The AI ​​systems turned out to be remarkably consistent. Sometimes, their agreement was even closer to what you normally see between two human evaluators. Moreover, the AI ​​systems were often more consistent with each other than with humans.

But… consistency is not necessarily the same as good assessment. For as soon as the researchers took a closer look at exactly what AI was responding to, things went wrong. The systems turned out to be systematically more sensitive to language form than human assessors. Longer essays received higher scores more easily. The same applied to texts with more complex sentences, richer vocabulary, and more connecting words between ideas.

That is actually logical. A language model predicts language. It does not “understand” an essay in the way a teacher or instructor tries to do. The software looks for statistical patterns that often go hand in hand with good texts. However, those patterns are not always synonymous with quality.

Even more striking was a second effect: AI tended to pull everything towards the middle. Weak essays received relatively high scores, strong essays relatively low ones. Precisely where evaluation often becomes most important, at the boundaries between pass and failure or between average and excellent, AI proved to be the least reliable.

That makes this research more interesting than so many discussions about “AI works” or “AI doesn’t work”. In fact, the systems clearly *do* work (partially). Just not necessarily in the way educational assessment is intended to work.

And that is where the issue becomes pedagogically relevant.

In the focus groups, the discussion ultimately became less about technology and more about relationships. Students and teachers described assessment as part of a kind of social contract. Feedback is not only about grades, but also about recognition. About the feeling that someone has genuinely read your work. That someone is trying to understand your thinking.

One participating student put it sharply:

“If both the students and the teachers are using it, then like, who’s learning? What are we doing here?”

A sentence I have formulated myself in similar ways more than once.

That may sound dramatic, but it touches on a fundamental question. What is assessment actually for? Merely an efficient way to classify performance? Or also an essential part of education itself?

The researchers, incidentally, are remarkably nuanced. They do not argue that AI should never be used in assessment. On the contrary, they see potential applications in moderation, quality control, or additional feedback. The latter is also one of the ways I use it myself. I noticed the same tendency towards the middle there as well, though fortunately, I still read and assess everything myself first and foremost. Some students have even discovered over the past months that I sometimes go so far as to track down and review their cited sources*.

And all of this aligns with the researchers’ warning that universities should be extremely cautious about using automated systems as the primary evaluator.

* I know that some of my students read this blog regularly and are probably smiling in recognition right now.

Leave a Reply