How a Relatively Small Paper Laid the Foundation for ChatGPT and Gemini

Anyone using ChatGPT, Gemini or Claude today relies, indirectly, on an idea from 2017. Barbara Oakley reminded me of this today. That idea appears in a paper with a strikingly confident title: Attention Is All You Need. In hindsight, the title was not bravado but an accurate summary of what followed. Oakley also noted that when the authors first submitted the paper to a conference, it attracted little attention outside a small circle of specialists and was ultimately presented in what she described as an academic back room.

Until then, language models were organised mainly around sequences. They processed words one by one, strictly in the order in which they appeared in a sentence. Recurrent neural networks and LSTMs performed that work: each new word built on the previous one. This approach worked reasonably well, but its limitations were clear. Training progressed slowly, models resisted parallelisation, and as words appeared further apart in a sentence, the models struggled increasingly to learn meaningful relationships between them.

The paper asked a deceptively simple question: what if a sequence does not have to serve as the organising principle? What if a model could look at an entire sentence at once and decide for itself which words matter to each other?

The authors answered that question with what they called “attention”. Instead of processing words sequentially, the model allows each word to explicitly “look at” all other words in the sentence and assess their relevance. The system does not rely on intuition or vague heuristics. It assigns precise mathematical weights to the relationships between words, depending on how useful those relationships are for the task at hand. During training, the model learns these relevance patterns itself.

This change may sound like a technical nuance, but it has fundamental consequences. Models no longer need to pass relationships between words through long chains of intermediate steps. Every word can connect directly to every other word. Structural distance within a sentence no longer constrains what the model can learn.

The second breakthrough concerns parallelisation. Because the model no longer processes text step by step, it can handle entire sentences simultaneously. This shift dramatically speeds up training and, crucially, makes large-scale models feasible. Without this property, large language models would never have scaled, no matter how much data or computing power engineers threw at the problem.

To strengthen this mechanism further, the authors introduced “multi-head attention”. Instead of looking at a sentence once, the model examines it several times in parallel, each time from a different perspective. One attention head may focus on grammatical structure, another on semantic meaning, and a third on long-distance dependencies. The programmer does not hard-code these perspectives. The model discovers them during training.

Because the architecture itself no longer enforces order, the model needs explicit position information. The authors provide this through positional encodings: mathematical signals that indicate where each word appears in a sentence. They do not add these signals because order has lost its importance, but because the model no longer assumes order by default.

Together, these ideas form the Transformer architecture. At the time of publication, researchers tested it mainly on machine translation tasks. The model immediately outperformed existing approaches while requiring less training time. Yet those results mattered less than what the architecture enabled next.

The Transformer does not function as a language model in its own right. Instead, it offers a powerful way to structure information. Once researchers recognised its potential, they quickly applied it to other tasks: text comprehension, summarisation, question answering, and eventually large-scale language modelling. GPT, BERT, PaLM, Gemini and ChatGPT do not copy this paper directly, but they all build on its central ideas.

What makes this paper remarkable is not that it made AI “more human” or suddenly introduced “intelligence”. The authors did something far more prosaic and far more powerful. They reframed how we think about language processing. Rather than treating it as a chain of steps, they treated it as a network of relationships. Meaning emerges not from sequence alone, but from connections.

In hindsight, the idea now seems obvious. At the time, it broke sharply with what researchers had taken for granted. That break explains why Attention Is All You Need forms the quiet foundation beneath almost everything we do today with large language models.

When you ask ChatGPT to connect ideas, maintain context, or incorporate earlier sentences into an answer, it still operates according to that idea from 2017: anything can receive attention, as long as it proves relevant.

I held back for a long time from pointing out the irony that we now mostly struggle with problems of attention. Oops, too late. Jokes aside, this relatively short paper has now attracted well over 100,000 citations.

Leave a Reply