Chapter V

Memory

What changes when wandering becomes learning

The wanderer from the previous chapter had no memory. Each step was the first step. It could stumble across the goal and still not know how to find it again. There was no accumulation — only noise.

Now we run two agents on the same maze at the same time. The left one wanders randomly, exactly as before. The right one uses Q-learning: it keeps a table of four numbers per cell — one for each direction — and updates those numbers every time it receives a reward. Watch what happens when the Q-learning agent reaches the goal for the first time. The simulation pauses so you can compare the two sides.

Left: a heatmap of visits — where the random walker has been. Right: a policy map — what the Q-learning agent has decided to do in every cell it has visited. One agent has covered more ground. The other has learned something.

Random Walker no memory

Steps 0

Goal reached 0 times

First goal at step —

Q-Learning with memory

Steps 0

Goal reached 0 times

First goal at step —

What these numbers mean

ε (exploration) Probability that the agent picks a random direction instead of its best known move. Starts at 1.0 (fully random) and decays toward 0.05 as learning progresses.

1.00

Random moves 0

Greedy moves 0

Episodes Each time the agent finds the goal and resets to the start, one episode ends. The agent reuses everything it learned in the Q-table.

Best episode —

Q-Table — policy map

Arrow = best action per cell · Grey dot = not yet learned · Click cell for Q-values

Q-Values for selected cell

click a cell in the policy map →

↑

0.000

↓

0.000

→

0.000

←

0.000

Click any cell in the policy map

memory

The right side paused when Q-learning first reached the goal. At that moment, the random walker may have also passed through the goal — perhaps many times. But it has no policy. It cannot reliably find the goal again. It will find it again only by accident.

The Q-learning agent has something different. Look at the arrows in the cells it has visited. They are not yet complete — many cells are still blank, not yet explored. But the cells near the goal already have confident arrows pointing toward it. Knowledge is propagating backward from the reward.

This is the X-RAY — v1.8 view of learning: not what the agent did, but what it knows. The table is readable. Click any cell in the right maze and you see four numbers — the estimated value of moving north, south, east, west from that position. The brightest bar is the agent's current best guess. You can trace the reasoning behind every future decision before it is made.

"What fires together, wires together."

Continue the simulation after pausing. Watch the arrows fill in. Watch the policy crystallize. Then ask: what happens if the maze is ten times larger? The table needs ten times more rows. A hundred times larger? A hundred times more rows. There is a point where the table becomes too large to exist. That point is the subject of the next chapter.