Theme
Font size
Contrast
Chapter II

The Art of
Paying Attention

On how context changes meaning — and why a machine that has never understood anything can still tell the difference between a river bank and a savings bank.

Consider the word bank. A perfectly ordinary word — until you try to nail down what it means.

In the sentence "she walked along the bank of the river," bank means a slope of earth beside moving water. In "she transferred money to her bank," it means a financial institution. The word is identical. The meaning is entirely different. And you — effortlessly, instantly — knew which was which.

How? You didn't look it up. You didn't reason through it. You felt the surrounding words pull the meaning into place. River was nearby. Or money. The context spoke, and the word obeyed.

This is the problem that attention mechanisms were designed to solve. And the solution, once you see it, looks less like engineering and more like something Wittgenstein would have approved of.

"The meaning of a word is its use in the language." — Ludwig Wittgenstein, Philosophical Investigations (1953)

Wittgenstein didn't say meaning is a definition. He said meaning is use — the pattern of contexts in which a word appears, the company it keeps, the situations it inhabits. A word isolated from its surroundings has no meaning; it has only potential.

The transformer architecture implements this insight literally. When the model processes a word, it doesn't look up a fixed representation. It asks: given everything else in this sentence, what should this word become? The word's final representation is shaped — transformed — by what surrounds it.

That shaping process is called attention. And it is one of the few things about a language model's inner life that you can actually see.

§

Here is what attention looks like from the inside. Every token in a sequence generates three vectors: a query ("what am I looking for?"), a key ("what do I offer?"), and a value ("what do I contribute if selected?"). Each token's query is compared to every other token's key. The result is a score — a weight — that determines how much each token influences the representation of every other token.

The matrix below shows these weights for a sample sentence. Each row is a token asking its question. Each column is a token answering. The intensity of colour shows the weight of attention — how strongly the row token is shaped by the column token. Click any token to highlight its attention pattern.

Interactive — Attention Matrix Simulated weights
low attention
high attention
Note: Attention weights are simulated to illustrate the concept. In a real transformer, these weights are learned during training and vary by layer and attention head. A single model may have hundreds of attention heads, each learning different relationships.

What you're seeing is a form of X-RAY — v1.8. In most of the model's processing, the math is opaque — billions of multiplications producing outputs that can't be directly interpreted. Attention weights are different. They are, in a real sense, the model's own account of what it found relevant. Not a full explanation. But a window.

Notice how the word bank attends differently depending on the sentence. In the river sentence, it pulls toward river, steep, beside. In the financial sentence, toward withdrew, money. Same word, different context, different representation. Wittgenstein's insight, implemented in matrix multiplication.

§

But attention is not just disambiguation. It is the mechanism by which meaning becomes relational. A word doesn't have a representation — it has a family of representations, one for each context it appears in. The model never sees the word bank; it sees bank-in-this-sentence.

This is the geometry of attention: each token begins as a fixed point in space — the embedding you saw in Chapter I. By the time attention has finished its work, that point has moved. It has been pulled toward the tokens that mattered and pushed away from those that didn't. The final position encodes not just what the word is, but what the word is here, now, in this particular sequence of words.

Interactive — Context Shift Simulated

Select a sentence to see which tokens most influence the highlighted word.

The bank beside the river was steep.
She withdrew money from the bank.
The light from the plant in the window was warm.
Workers at the plant went on strike.
Attending tokens
Meaning pulled toward
What you're seeing: A simplified view of which tokens most influence the target word's final representation. The highlighted tokens have the highest simulated attention weights toward the target.

There is something quietly remarkable about this. The model was not told that bank has two meanings. It was not given a disambiguation rule. It was trained on text, and from that training it learned — implicitly, geometrically — that the right representation of a word depends on its neighbourhood.

This is what Wittgenstein meant by language as practice. Meaning doesn't live inside words like furniture inside a house. It emerges from use, from context, from the web of relationships that a word inhabits. The transformer didn't discover this insight. It rediscovered it — through a completely alien process, at a scale no human could replicate, arriving at the same place from a completely different direction.

Whether the model knows any of this is a question we should hold carefully. What we can say is this: it behaves as if it understands context. The attention weights are evidence — partial, imperfect, but real — that something structurally similar to contextual understanding is happening inside.

That is the X-RAY — v1.8 view of attention. Not the full picture. Never the full picture. But a crack of light through the black box — and that is more than most systems ever offer.