In recent years, I have regularly heard that AI systems work with attention. It is no coincidence that this word reappears in the title of the famous article that made the current generation of language models possible: Attention is All You Need, a paper published 9 years ago today. But how comparable is that attention actually to human attention? A new study by Suketu Patel and colleagues attempted to answer this question using one of the most famous experiments in cognitive psychology: the Stroop test.
You might remember this test from psychology classes. You are shown words like “red,” “blue,” or “green,” but the color of the letters does not match the word itself. For example, the word RED is written in blue letters. The task is simple: name the colour of the letters and ignore the word.
That turns out to be surprisingly difficult. After all, our brains read words automatically. Even when we know we should not, the word’s meaning keeps intruding. The Stroop test has therefore been used for nearly a century to measure something psychologists call executive control: the ability to suppress an automatic response and keep attention focused on what is relevant at that moment.
The researchers had GPT-4o and Claude 3.5 Sonnet perform this task. At first glance, everything seemed fine. With short lists of five words, both models performed quite well. Just like humans, they made more mistakes when the word and the colour did not match than when they did. Up to that point, the findings even seemed reassuring. But then the researchers made the lists longer, and things started to go wrong.
While people generally maintain their performance quite well, the language models began to struggle increasingly. Performance collapsed, particularly with longer sequences of twenty or forty words. GPT-4o dropped to barely 15% correct answers in the incongruent condition. Claude maintained its performance a little longer but eventually also fell back to about 24%. At the same time, reading the words themselves remained virtually perfect.
So the problem was not that the models could not see or recognise the words. The problem was that they kept performing the wrong task. They continued, as it were, to read the word automatically, even though the instruction was to name the colour.
According to the authors, this points to a difference between transformer attention and human attention. Humans possess not only mechanisms for selecting information but also mechanisms for resolving conflicts and maintaining task goals. When we notice that a task is becoming difficult, we can exert additional control. That form of executive control appears to be much less developed in current language models, or at least in those tested in these experiments.
However, some caution is warranted. The study does not claim that AI lacks attention. Nor does it suggest that AI is “stupid”. On the contrary. The models performed excellently on other parts of the task and can, of course, do many things that humans cannot. Moreover, this concerns a specific experimental task designed to measure one particular aspect of attention. As always, a laboratory task is not a complete description of intelligence.
Nevertheless, I found the study interesting enough to write about because it exposes something we sometimes forget when working with AI. When a language model makes a mistake, we often assume it is a knowledge problem. The model either lacks the relevant information or has learned something incorrectly. This study suggests that some errors may have less to do with knowledge and more to do with maintaining a task goal. It may not be a lack of information at all, but rather a difficulty in consistently maintaining the correct goal when competing signals are present.
This points to a striking paradox. Modern language models have context windows of hundreds of thousands of words. They can process enormous amounts of information. But more memory does not automatically mean more control. The researchers argue that future AI systems may need not only larger context windows but also better mechanisms for selecting relevant information, resolving conflicts, and maintaining task goals.
Intelligence is not just about how much information you can process. Sometimes it is also about being able to ignore information that is irrelevant at that moment. That already proves difficult for humans. And, for the time being, apparently for AI as well. Somewhere, I find that strangely reassuring.