A model that only ever read an agent's movement symbols built a map of its world inside — all on its own. Here you read what it sees, decode that map from its mind, then edit it — and watch its behavior follow the false belief.
A stream of moves. No picture. No map. No coordinates. Just these symbols, one after another.
We point a "mind-reader" (a linear probe) at its hidden activations.
A 155,000-parameter network was trained to do one thing: predict the next move-symbol. It was never shown a grid, a map, or a single coordinate. Yet to do its job it built a model of the world the symbols describe — and kept it in its activations, where a simple linear probe can read it out.
When you clicked a cell, you overwrote its internal sense of place (activation patching). It then acted on the false belief — refusing exits that are only walls where it thinks it is, and "seeing" a lamp that isn't there. The representation is causal, not decorative.
measured on this very model:
✓ position decodable 98.8% (chance 2%)
✓ predicts only legal moves 100%
✓ belief-edit changes behavior 100%
✓ phantom-lamp on command 99.7%
And related patterns show up in larger systems. Othello-GPT, trained only on Othello moves, builds an internal board representation (Li et al.; Nanda); and research finds Llama-class models encode linear maps of real-world place and time (Gurnee & Tegmark). This toy model is small enough to prove end-to-end — and to run, live, in your browser.
Next in this series: render the real-world map of place and time hidden inside an actual LLM.
Refs: Othello-GPT world model · Language Models Represent Space & Time · Golden Gate Claude
A note on words: here, "belief" means a decoded internal state representation, not consciousness or human-like understanding. This is a controlled toy grid-world experiment; it shows that this model learned a measurable, causally relevant world-state representation — not that all large models have cleanly readable or editable beliefs.