Experienced exam markers may struggle to spot answers generated by AI

This study may surprise many readers. The rise of AI systems like ChatGPT poses a serious threat to academic integrity, especially in unsupervised assessments. This study investigates the extent of this issue by inserting AI-generated submissions into a university’s examination process. Alarmingly, 94% of these submissions went undetected and often outperformed real student work.

From the press release:

The study was conducted at the University of Reading, UK, where university leaders are working to identify potential risks and opportunities of AI for research, teaching, learning, and assessment, with updated advice already issued to staff and students as a result of their findings.

The researchers are calling for the global education sector to follow the example of Reading, and others who are also forming new policies and guidance and do more to address this emerging issue.

In a rigorous blind test of a real-life university examinations system, published today (26 June) in PLOS ONE, ChatGPT generated exam answers, submitted for several undergraduate psychology modules, went undetected in 94% of cases and, on average, attained higher grades than real student submissions.

This was the largest and most robust blind study of its kind, to date, to challenge human educators to detect AI-generated content.

Associate Professor Peter Scarfe and Professor Etienne Roesch, who led the study at Reading’s School of Psychology and Clinical Language Sciences, said their findings should provide a “wakeup call” for educators across the world. A recent UNESCO survey of 450 schools and universities found that less than 10% had policies or guidance on the use of generative AI.

Dr Scarfe said: “Many institutions have moved away from traditional exams to make assessment more inclusive. Our research shows it is of international importance to understand how AI will affect the integrity of educational assessments.

“We won’t necessarily go back fully to hand-written exams, but global education sector will need to evolve in the face of AI.

“It is testament to the candid academic rigour and commitment to research integrity at Reading that we have turned the microscope on ourselves to lead in this.”

Professor Roesch said: “As a sector, we need to agree how we expect students to use and acknowledge the role of AI in their work. The same is true of the wider use of AI in other areas of life to prevent a crisis of trust across society.

“Our study highlights the responsibility we have as producers and consumers of information. We need to double down on our commitment to academic and research integrity.”

Professor Elizabeth McCrum, Pro-Vice-Chancellor for Education and Student Experience at the University of Reading, said: “It is clear that AI will have a transformative effect in many aspects of our lives, including how we teach students and assess their learning.

“At Reading, we have undertaken a huge programme of work to consider all aspects of our teaching, including making greater use of technology to enhance student experience and boost graduate employability skills.

“Solutions include moving away from outmoded ideas of assessment and towards those that are more aligned with the skills that students will need in the workplace, including making use of AI. Sharing alternative approaches that enable students to demonstrate their knowledge and skills, with colleagues across disciplines, is vitally important.

Abstract of the study:

The recent rise in artificial intelligence systems, such as ChatGPT, poses a fundamental problem for the educational sector. In universities and schools, many forms of assessment, such as coursework, are completed without invigilation. Therefore, students could hand in work as their own which is in fact completed by AI. Since the COVID pandemic, the sector has additionally accelerated its reliance on unsupervised ‘take home exams’. If students cheat using AI and this is undetected, the integrity of the way in which students are assessed is threatened. We report a rigorous, blind study in which we injected 100% AI written submissions into the examinations system in five undergraduate modules, across all years of study, for a BSc degree in Psychology at a reputable UK university. We found that 94% of our AI submissions were undetected. The grades awarded to our AI submissions were on average half a grade boundary higher than that achieved by real students. Across modules there was an 83.4% chance that the AI submissions on a module would outperform a random selection of the same number of real student submissions.

One thought on “Experienced exam markers may struggle to spot answers generated by AI

Leave a Reply