The end of the illusion? Microsoft punctures its own medical AI hype

Imagine this: a model like GPT-5 achieves top results on medical exams. It outperforms benchmarks in leading journals and seems ready to support doctors. Sounds impressive. However, as soon as the tests become a little more demanding, the façade crumbles. That is the message of a new paper from Microsoft Research, tellingly titled The Illusion of Readiness (Gu et al., 2025).

I am usually quite critical when companies publish research on (partly) their own products. In this case, it would be difficult to be more critical than Microsoft itself.

The researchers show that current models often succeed for the wrong reasons. They recognise patterns and statistical shortcuts, but they do not actually understand medicine. A model can, for instance, “diagnose” pneumonia correctly without ever looking at an X-ray. This happens simply because the words “fever + cough” frequently co-occur with that diagnosis in its training data.

Stress tests instead of benchmarks

What Microsoft did was subject these models to so-called stress tests. They examined six leading multimodal models – including GPT-5, Gemini 2.5 Pro, and OpenAI’s O-series. They tested how they performed when the conditions were altered. Images were removed, answer options were reshuffled, distractors were replaced, or subtle visual modifications were introduced.

The results are painful:

  • Without images, models often continued to score surprisingly well, suggesting they were not really interpreting the visuals at all but relying on textual hints or memorisation.

  • When the order of answer options was changed, accuracy dropped sharply. That suggests a reliance on superficial patterns rather than genuine understanding.

  • Even more troubling: the models produced convincing explanations that turned out to be completely fabricated. They described in detail the visual features they claimed to see, when in fact no image was present.

In other words, AI can appear to reason, but often it is little more than a parlour trick.

What does this mean for the future?

The study underlines an uncomfortable truth. Current medical benchmarks are not reliable predictors of how models will behave in real hospitals. They measure test-taking strategies rather than robust understanding. And yet these scores are frequently presented as evidence of “medical readiness”.

Microsoft, therefore, calls for a radical rethink: benchmarks should be regarded as diagnostic instruments, not end goals. Stress tests that simulate uncertainty, incomplete data and misleading cues are essential if we are to know whether a model can truly be trusted.

That Microsoft is saying this so loudly makes the message even stronger. The company has every incentive to showcase its progress, yet here it opts for transparency and self-criticism. The signal is clear: anyone wishing to deploy AI in healthcare must look beyond leaderboard competitions.

A broader lesson

What holds for medicine applies equally to other fields where AI is leaping fast from demo to practice. Whether in education, the legal system or transport, models must not only give the right answers. They must give them for the right reasons. The difference between a lucky guess and a diagnosis can literally mean life or death.

The future of AI, therefore, depends not only on building ever larger models but equally on developing better ways of testing and evaluating them. Transparency, resilience under stress and robust reasoning will be the new watchwords.

And the fact that Microsoft itself is articulating this so sharply may well be a laudable turning point in how we talk about AI progress.

Image: https://easy-peasy.ai/ai-image-generator/images/role-of-ai-in-healthcare-futuristic-illustration

Leave a Reply