AI in Education: What the Evidence Really Says

People say many things about what AI will do to education. It will make teaching more efficient, assessment fairer, and learning more personalised. Or, depending on who you ask, it will render teachers redundant, make students less intelligent, and exacerbate inequality.

But what do we actually know?

Every time I get asked about this, I sigh. I try to keep up with the research myself, but it’s not an easy task. Beyond the apparent flood of publications, I keep running into a more fundamental problem: the quality of the evidence itself.

Upon closer examination of the literature, the answer is less reassuring than the headlines suggest. Most studies on AI in education are methodologically weak. Often intriguing, sometimes promising, but still fragile.

Small Studies, Big Claims

Over the past five years, hundreds of papers have been published on the application of AI in education. Systematic reviews (e.g., Létourneau et al., 2025; Li et al., 2025; Zhang et al., 2025) generally indicate small, positive effects. Students working with an AI tutor tend to perform slightly better on short knowledge tests than those in control groups.

But these effects usually come from small, context-poor studies: thirty pupils here, fifty there, often within a single school or university. Replications are rare. And most systems are developed and evaluated by the same people, aka the developers themselves.

As John Ioannidis warned back in 2005, that’s the perfect recipe for inflated effects. In the case of AI, there’s an extra ingredient: the pressure to innovate, to be seen as keeping up.

A Lack of Independent Evaluation

Where robust independent evaluations are missing, the risk of positive bias increases. Many papers conclude that an AI tool “improves learning”, but rarely mention that the difference sometimes amounts to just a few extra questions correct on a short test. Or that the comparison group received no intervention at all.

1 Padlock out of 5

The Education Endowment Foundation (EEF) is aware of this. That’s why the EEF reports not only the size of an effect, but also the security of the evidence, its “padlock rating”. A programme with five padlocks has been independently tested in multiple contexts; one padlock means preliminary evidence with high uncertainty.

If we applied that system to current AI research, most studies would score somewhere between zero and two padlocks.

Little Insight into How It Works

Even when an effect exists, we rarely know why. Many studies describe the technology, but hardly touch on the pedagogy behind it. Was it the AI feedback that made the difference? Or simply that students practised more? Was it the personalised guidance, or the extra time spent on task?

Without that context, it’s almost impossible to learn from what works. As Nancy Cartwright and Jeremy Hardie wrote in Evidence-Based Policy (2012):

“Evidence shows what worked somewhere, not what will work here.”

Short-Term and Superficial Outcomes

Another recurring issue is the measurement. Most studies focus on short-term outcomes, including test scores, motivation, and time-on-task. Few follow students long enough to see whether they actually learn to write better, think more deeply, or collaborate more effectively. The very things that are often promised.

A Little Light, Too

That doesn’t mean we know nothing.
Some classic intelligent tutoring systems, such as Carnegie Learning, have been repeatedly tested and consistently show modest gains, primarily when used as a complement rather than a replacement for the teacher.

Newer studies on AI-based feedback also show potential, provided they’re well-embedded in classroom practice.
Organisations like UNESCO are right to stress that the question isn’t whether schools should use AI, but how, and under what safeguards.

What We Can Learn From This

The quality of the current evidence is, frankly, substandard. But that’s no reason for cynicism. It’s a reason for precision.
We can do three things at once:

  1. Experiment, because understanding AI in education requires hands-on experience.

  2. Report honestly, with clear boundaries around what we do and don’t know.

  3. Build evidence as we go, by linking implementation to independent evaluation, transparency, and replication.

That’s precisely what UNESCO means by responsible evidence-building: not waiting until everything is proven, but using each step to create better evidence.

Ultimately, the problem isn’t that the evidence is too weak to talk about AI.
The problem is that we discuss weak evidence with far too much confidence.

References

Leave a Reply