Chapter VI

The Wall

Where the table runs out of room

The Q-table worked. In a 7×7 maze, it filled with values, the agent learned, and we could watch every step of the process. That transparency was the point — Q-learning is an X-RAY — v1.8 tool precisely because it makes knowledge visible.

Now try something simple: make the maze bigger. Drag the slider below. The maze grows. The Q-table grows with it — one row per cell, four values per row. At 7×7 you have 49 states and 196 Q-values. At 10×10 you have 400. At 15×15 you have 900. Watch the policy map. The arrows become smaller, harder to read, eventually meaningless as symbols. The table is still there — but you can no longer see through it.

This is the wall. Not a conceptual limit — a physical one. A game of chess has somewhere between 10⁴⁰ and 10⁵⁰ legal positions. The observable universe contains roughly 10⁸⁰ atoms. A table with one row per chess position cannot exist anywhere in reality. Q-learning, as we have seen it, simply stops being a viable approach.

49 states × 4 actions = 196 Q-values

Q-table size

Chess: ~10⁴⁵ states — a table for that cannot fit in the observable universe

Grid size

5710131620

7×7

Q-Learning with memory

Steps 0

Goal reached 0 times

First goal at step —

Best episode —

Convergence 0%

% of cells with a learned policy — grows slower as grid expands

What these numbers mean

ε (exploration) Probability the agent picks a random direction instead of its best known move. Starts at 1.0 (fully random), decays toward 0.05 as learning progresses.

Current ε 1.00

Random moves (ε triggered) 0

Greedy moves (Q-table used) 0

Episodes Each time the agent finds the goal and resets to the start, one episode ends. The agent reuses everything it learned in the Q-table.

Current episodes 0

Best episode (steps) —

Convergence The agent has "converged" when its policy map stops changing — every visited cell has a stable best direction and the agent finds the goal reliably. On a 7×7 grid this happens in ~50 episodes. On a 15×15 grid it may take hundreds. On a real-world problem, it may never happen.

Q-Table — policy map

Arrow = best action per cell · Dot = not yet learned · Click cell for Q-values · Heat = Q-value intensity

Q-Values for selected cell

click a cell in the policy map →

↑

0.000

↓

0.000

→

0.000

←

0.000

Click any cell in the policy map

the wall

Watch what changes as the grid grows: the agent still learns, but slower. With more states to explore, it takes longer to visit each one enough times to get reliable estimates. The convergence bar fills from the goal outward — that warmth has to travel further and further before it reaches the start. The wall is not that learning stops. The wall is that learning slows beyond use.

At some point, the approach breaks entirely. Not because the algorithm is wrong, but because the representation is wrong. A table assumes you can enumerate every possible situation in advance and give it its own row. Reality doesn't allow this.

What we need is something that can generalize — that can say: this situation is similar to that one, so its value should be similar too. A function, not a table. Something that takes a state as input and outputs a value estimate, without needing to have seen that exact state before. That something has a name.

"A table can only know what it has seen. A function can know what it hasn't."

The next chapter replaces the table with a neural network. The agent stops memorizing states and starts learning patterns. The Q-table — our transparent, inspectable, X-RAY — v1.8-friendly structure — will disappear. What takes its place is faster, more powerful, and considerably harder to see through. That trade-off is the subject of everything that follows.