There’s a growing body of research looking at how AI models like ChatGPT handle science. A new article in Learned Publishing by Thelwall and colleagues zooms in on a particular question: does ChatGPT recognise when a scientific paper has been retracted?
The answer is sobering. In this study, 217 notorious, retracted or otherwise problematic papers were fed to ChatGPT 4o-mini (I came across this study via Retraction Watch). Not once did the model mention that a retraction or correction had taken place. On the contrary, it often gave these studies high-quality ratings — sometimes even “world-leading.” And when the researchers extracted individual claims from these retracted papers and asked ChatGPT to assess them, the model said they were true or mostly true in about two-thirds of cases. That included at least one claim that had already been proven false more than a decade ago.
At first glance, this looks like a serious alarm bell. But a few caveats are in order. First, this is about one very specific version of ChatGPT (4o-mini, summer 2024), with a knowledge cut-off in October 2023. More recent versions may handle retraction notices or live web data more effectively. So it would be misleading to generalise these results to “ChatGPT” as a whole.
Second, the methodological choice is both interesting and limiting. The researchers asked the model to evaluate abstracts for research quality, not explicitly whether a paper had been retracted. Human readers wouldn’t necessarily pick up on that either from an abstract alone. Still, the core finding stands: even for very well-known retractions, widely covered in the media and on Wikipedia, ChatGPT failed to flag them.
What this study underlines is that we shouldn’t blindly trust AI-generated literature summaries. If an LLM labels an article “high quality,” that doesn’t mean the scientific community still agrees. And one personal observation: this isn’t just an AI issue. Human researchers also keep citing retracted papers, often without realising their status has changed.
The takeaway is twofold. On the one hand, large language models can be useful for early exploration, but not for the final check. On the other hand, we as researchers, lecturers and students need to be much more alert to the status of our sources. Tools like Retraction Watch exist, but are still far too rarely consulted.
It would be great if AI could actually help with this — automatically flagging when a paper has been retracted, or when a claim sits in the danger zone. But we’re not there yet. Until then, an old lesson still applies, perhaps more than ever: trust is good, checking is better.
Abstract of the study:
Large language models (LLMs) like ChatGPT seem to be increasingly used for information seeking and analysis, including to support academic literature reviews. To test whether the results might sometimes include retracted research, we identified 217 retracted or otherwise concerning academic studies with high altmetric scores and asked ChatGPT 4o-mini to evaluate their quality 30 times each. Surprisingly, none of its 6510 reports mentioned that the articles were retracted or had relevant errors, and it gave 190 relatively high scores (world leading, internationally excellent, or close). The 27 articles with the lowest scores were mostly accused of being weak, although the topic (but not the article) was described as controversial in five cases (e.g., about hydroxychloroquine for COVID-19). In a follow-up investigation, 61 claims were extracted from retracted articles from the set, and ChatGPT 4o-mini was asked 10 times whether each was true. It gave a definitive yes or a positive response two-thirds of the time, including for at least one statement that had been shown to be false over a decade ago. The results therefore emphasise, from an academic knowledge perspective, the importance of verifying information from LLMs when using them for information seeking or analysis.