The Function
Same arrows. Different inside.
The previous chapter ended with a problem: the Q-table cannot scale. A table needs one row per state, and real problems have more states than can be stored anywhere. Something had to replace it.
The replacement is a neural network. Instead of storing a value for every state, the network learns a function — a mapping from any state to estimated Q-values. It generalizes. It can say something reasonable about states it has never seen, because it has learned the pattern underneath them. This is called Deep Q-Network, or DQN.
One difference is visible before you press Run. The Q-learning maze starts empty — no arrows anywhere. The network maze already shows arrows pointing in various directions. This is not a bug. Q-learning stores explicit zeros: it knows nothing until it visits a cell and receives a reward. The network, by contrast, has randomly initialized weights — and a function with random weights still computes something for every input. The arrows are there from the start, but they point at random. Watch them shift and stabilize as learning progresses.
The two agents below solve the same maze. Left uses Q-learning with a table — everything we know from Chapter V. Right uses DQN with a small neural network instead of a table. Watch them run. They look the same. The arrows appear at the same rate, point in similar directions, converge to similar policies. There is no visual difference between the two outputs.
Open the "Show Inside" panel. Click a cell in the Q-learning table. You see four numbers — one per direction — and the logic behind the decision is fully exposed. The agent chose north because north had the highest value. You can verify this. You can trace how that value was built, update by update, reward by reward.
Now click a cell on the DQN side. You also see four numbers — and the best direction is also marked. The output format is identical. But ask yourself: where do these numbers come from? Not from a table. From 1,024 weights distributed across two layers of a neural network. Backpropagation updated those weights. The numbers are the result of a computation you cannot follow step by step.
Look at the two numbers in the input layer. When you click cell (row 4, col 1), they show 0.50 and 0.00. Not 4 and 1 — but normalized coordinates: row divided by the maximum row, column divided by the maximum column. The entire grid maps to the unit square [0,0]–[1,1]. Neural networks work better with small numbers near zero; passing raw grid indices would require very different weight magnitudes. The normalization is invisible to you as a user, but it is the actual language the network speaks.
This also means the network generalizes. The Q-table is a dictionary — if the agent has never visited a cell, the value is zero, full stop. The network is a function — it computes something for every input, including cells it has never seen, because nearby coordinates produce similar outputs. Look at the two mazes: Q-learning shows arrows only where the agent has actually been. Cells behind walls, cells the agent skipped — they stay blank. The DQN maze shows an arrow in nearly every cell from the very first step, because random initial weights already produce some output for every coordinate. That is the trade-off made visible: the table is honest about what it does not know. The network always has an answer — right or wrong. Walk the path from (0,0) to (0,6) in the Show Inside panel and watch the hidden layer. Thirty-two neurons activate in shifting patterns. You can see that something is happening. You cannot read what each neuron means.
This is the trade-off. The network scales — it can handle state spaces that would destroy any table. But it does so by becoming opaque. The decision is made. The reason is not accessible.
There is a second cost, less obvious than opacity. Run both agents on this 7×7 maze and watch the step counter. Q-learning converges in thousands of steps. DQN needs an order of magnitude more — the replay buffer must fill, the target network must sync repeatedly, backpropagation must adjust 1,024 weights across thousands of updates before the policy stabilizes. The network is slower to learn precisely because it is more powerful: it carries machinery designed for problems far larger than this maze. DQN is overkill here. The Wall showed that Q-learning cannot scale up. This shows the reverse: DQN does not scale down. Every tool has the problem it was built for.
"The Q-table told you everything. The network tells you the answer."
This opacity is not a flaw to be fixed. It is a structural property of function approximation. When a system learns a compressed representation of a high-dimensional space, the individual weights stop being interpretable in isolation. The network works — often better than the table — but it works in a way that resists inspection. The next chapter is about what this means.