Are AI models smarter than students?

I have bright students, trust me, but are artificial language models such as ChatGPT getting smarter than those bright students? In a new experiment, six generative large language models were tested against students in an online introductory biomedical and health informatics course. And now the depressing answer: the models scored higher than as many as three-quarters of the real-world students in the class.

But does this say something about how bright AI models are or how limited online tests are? Or both? Answer is in what follows.

From the press release:

William Hersh, M.D., who has taught generations of medical and clinical informatics students at Oregon Health & Science University, found himself curious about the growing influence of artificial intelligence. He wondered how AI would perform in his own class.

So, he decided to try an experiment.

He tested six forms of generative, large-language AI models — for example ChatGPT — in an online version of his popular introductory course in biomedical and health informatics to see how they performed compared with living, thinking students. A study published in the journal npj Digital Medicine, revealed the answer: Better than as many as three-quarters of his human students.

“This does raise concern about cheating, but there is a larger issue here,” Hersh said. “How do we know that our students are actually learning and mastering the knowledge and skills they need for their future professional work?”

As a professor of medical informatics and clinical epidemiology in the OHSU School of Medicine, Hersh is especially attuned to new technologies. The role of technology in education is nothing new, Hersh said, recalling his own experience as a high school student in the 1970s during the transition from slide rules to calculators.

Yet, the shift to generative AI represents an exponential leap forward.

“Clearly, everyone should have some kind of foundation of knowledge in their field,” Hersh said. “What is the foundation of knowledge you expect people to have to be able to think critically?”

Large-language models

Hersh and co-author Kate Fultz Hollis, an OHSU informatician, pulled the knowledge assessment scores of 139 students who took the introductory course in biomedical and health informatics in 2023. They prompted six generative AI large language models with student assessment materials from the course. Depending on the model, AI scored in the top 50th to 75th percentile on multiple-choice questions that were used in quizzes and a final exam that required short written responses to questions.

“The results of this study raise significant questions for the future of student assessment in most, if not all, academic disciplines,” the authors write.

The study is the first to compare large-language models to students for a full academic course in the biomedical field. Hersh and Fultz Hollis noted that a knowledge-based course such as this one may be especially ripe for generative, large-language models, in contrast to more participatory academic courses that help students develop more complex skills and abilities.

Hersh remembers his experience in medical school.

“When I was a medical student, one of my attending physicians told me I needed to have all the knowledge in my head,” he said. “Even in the 1980s, that was a stretch. The knowledge base of medicine has long surpassed the capacity of the human brain to memorize it all.”

Maintaining the human touch

Yet, he believes there’s a fine line between making sensible use of technical resources to advance learning and over-reliance to the point that it inhibits learning. Ultimately, the goal of an academic health center like OHSU is to educate health care professionals capable of caring for patients and optimizing the use of data and information about them in the real world.

In that sense, he said, medicine will always require the human touch.

“There are a lot of things that health care professionals do that are pretty straightforward, but there are those instances where it gets more complicated and you have to make judgment calls,” he said. “That’s when it helps to have that broader perspective, without necessarily needing to have every last fact in your brain.”

With fall classes starting soon, Hersh said he’s not worried about cheating.

“I update the course each year,” he said. “In any scientific field, there are new advancements all the time and large-language models aren’t necessarily up to date on all of it. This just means we’ll have to look at newer or more nuanced tests where you won’t get the answer out of ChatGPT.”

Abstract of the study:

Generative artificial intelligence (AI) systems have performed well at many biomedical tasks, but few studies have assessed their performance directly compared to students in higher-education courses. We compared student knowledge-assessment scores with prompting of 6 large-language model (LLM) systems as they would be used by typical students in a large online introductory course in biomedical and health informatics that is taken by graduate, continuing education, and medical students. The state-of-the-art LLM systems were prompted to answer multiple-choice questions (MCQs) and final exam questions. We compared the scores for 139 students (30 graduate students, 85 continuing education students, and 24 medical students) to the LLM systems. All of the LLMs scored between the 50th and 75th percentiles of students for MCQ and final exam questions. The performance of LLMs raises questions about student assessment in higher education, especially in courses that are knowledge-based and online.

One thought on “Are AI models smarter than students?

Leave a Reply