From GnuGo to AlphaGo Zero: A Roadmap for Solving Difficult Problems
Originally posted 2025-04-14
Tagged: alphago, machine_learning, strategy
Obligatory disclaimer: all opinions are mine and not of my employer
What do difficult problems like digital assistants, self-driving cars, theorem-proving systems, and compilers have in common? They all have a certain level of fuzziness in their problem specification and a large solution space that makes it incredibly difficult to find optimal solutions.
As it turns out, the game of Go is difficult in many of the same ways, and has proven to be a fertile testbed for artificial intelligence techniques. Computer Go researchers have gone through many iterations, and the ways in which each iteration fell short are instructive to understand. In this essay, I review four distinct eras of computer Go, and draw a roadmap for how to tackle difficult problems.
Why is Go difficult?
Go is a simple game to learn. There are just three official rules: 1. players alternate playing stones on a 19x19 board, until both players agree on the ownership of each point of the board. 2. if a chain of stones is surrounded on all available adjacencies, it is captured and removed from the board (the inevitability of capture drives consensus on the “ownership” in rule 1). 3. No repeating previous positions (“Ko” rule).
The rules are so simple that they teach you absolutely nothing about how to play the game. You might think you’re ahead because you captured a big lump of stones, but a stronger player would recognize that you’d just wasted several moves on stones that were not worth it for either side to rescue or finish off. You learn by playing games, but it’s often not obvious what’s being learned.
In Go, balance is important. If you play your stones too far apart, they don’t support each other and can get separated; if you play your stones too closely, you won’t claim territory as rapidly as your opponent. If you play too close to the edge, you can secure early-game territory at the cost of whole-board influence that secures late-game territory; if you play too close to the center, then you lose out on easy territory. My favorite Go proverb is “If one player has all four corners, the weaker player should resign”, implying that if such an unbalanced situation arose, then somebody – most likely the weaker player – has messed up.
The Expert Systems Era (1980s-2006)
The first serious attempt at Go AI was the expert system, best typified by GnuGo. GnuGo was composed of rules, heuristics, and decision trees programmed by experts in each domain, with modules for each area of the game – breakin.c (invasions), dragon.c (whole-board fights), endgame.c, fuseki.c (opening theory), influence.c, semeai.c (capture races), and so on.
Here is some representative code, showing typical expert system ideas:
if (eyespace_neighbors >= 2)
if (make_solid_eye(pos, color)) {
bonus += 20;
if (do_capture_dead_stones && opponent_dragons > 0)
bonus += 10;
}
score[pos] = 4 * eyespace_neighbors + bonus;
if (safety == INVINCIBLE) {
score[pos] += own_neighbors;
if (own_neighbors < 2)
score[pos] += own_diagonals;
if (own_worms > 1 && eyespace_neighbors >= 1)
score[pos] += 10 + 5 * (own_worms - 2);
}
else if (eyespace_neighbors > 2)
score[pos] += own_diagonals;
/* Splitting bonus. */
if (opponent_dragons > 1)
score[pos] += 10 * (opponent_dragons - 1);
Note:
- A variety of hand-tuned magic numbers.
- Black and white determinations like “INVICIBLE”, no shades of grey.
- Human-interpretable concepts like “dragons” reified in code.
- Giant hand-tuned if-else trees.
GnuGo had a number of glaring flaws. It focuses narrowly on local situations when other parts of the board are more important. It is unable to consider whole-board interactions. It rigidly sticks with local shapes and patterns even when it’s unwarranted. It wastes moves clarifying the status of 99.9% alive groups, due to its desire to meet strictly correct definitions of “dragon” or “invincible”. It has blind spots where weaker fallback code is activated. All of these are flavors of the same problem: treating Go in absolutes, rather than in subtleties. GnuGo played at the 25%ile of casual players.
Lest you think that expert systems are worthless, keep in mind that Deep Blue, which defeated Kasparov in 1997, was also an expert system built atop hand-tuned board evaluation functions and modules for opening, midgame, and endgame.
The Monte-Carlo Tree Search era (2006-2016)
See also: Coulom, my deep dive on MCTS.
Monte-Carlo Tree Search brought a paradigm shift: instead of trying to divide an intrinsically fuzzy game into expert modules, use a single probabilistic framework (Upper Confidence bounds on Trees, or UCT). UCT would search the game tree for the best move by prioritizing game variations with the highest potential value. This gave it the flexibility to search wide and deep, unlike Deep Blue’s minmax, which requires finishing a depth level before going deeper.
UCT seemed to work best in conjunction with Monte-Carlo rollouts – random stones thrown onto the board until the game reached an endstate – to estimate who was winning. To get more reliable estimates from these random stones, researchers added basic heuristics to make the random rollouts be somewhat realistic, and applied a massive amount of compute to smooth noisy estimates. Together, this system was called MCTS.
MCTS’s style was very inhuman; it tended to steer games towards large, simple territories that the Monte-Carlo rollout approach could more reliably assess. It tended to suffer in sharp tactical positions where its rollout heuristics might not capture the right techniques. Yet, MCTS bots far outscaled GnuGo, reaching the 90%ile of human casual players.
The Deep Learning era (2015)
Even during the MCTS era, hand-written rollout heuristics were increasingly discarded in favor of statistically learned heuristics. As Deep Learning became more widely known, multiple groups applied convolutional neural networks to Go to learn better move heuristics. Surprisingly, it was discovered that these trained neural networks could play Go at the 75%ile of human players, even without search! Later on, with better data, network architectures and training techniques, neural networks have been able to play at roughly the 98%ile of human players without search.
AlphaGo relied on three key neural networks: a move proposer, a position evaluator, and a rollout move proposer. The move proposer was used to bias the tree search towards more promising moves; the position evaluator estimated who was winning, and the rollout move proposer, a remnant from the MCTS era, quickly suggested moves for playing rollouts.
AlphaGo’s style was mostly human, intuitive, and even beautiful. Its least human trait was that it could and would accurately make gigantic trades, thanks to its architecture forcing fresh evaluations of every position. In contrast, humans tended to judge positions as an accumulation of good and bad exchanges. However, this also meant it was structurally unable to “save” an analysis across game trees, leading to the thrashing effect shown in Game 4 of Lee Sedol vs. AlphaGo. It was also susceptible to a “split brain” phenomenon when the rollout and evaluation methods disagreed. AlphaGo’s policy network started at the 95%ile of human players and added a massive amount of compute (~700 GPUs used in the AlphaGo vs. Lee Sedol match) to reach 100%ile.
The Reinforcement Learning era (2017)
See also: Minigo, LeelaZero and KataGo
700 GPUs was an admittedly brute force way to defeat the human Go champion. Was there anything better? Well, now that you have an AI capable of generating new training data at a level higher than any human, you have an infinite source of training data.
Recall that when DeepMind trained neural networks on human professional data, those neural networks had a distillation gap - the raw neural network couldn’t play as well as the humans it trained from. They then recovered that gap through additional compute and tree search techniques. Researchers at DeepMind would work on three primary directions:
- (tick) neural architecture/training innovations that minimized the distillation gap from data -> neural network.
- (tock) tree search innovations that maximized the strength uplift per search iteration.
- (clock) optimization of macro parameters such as self-play/training compute ratio, training lookback windows, explore/exploit balance, and data quality maximization through early resignation.
The first two of these three directions are doable in isolation: one does not need to run the full reinforcement learning loop in order to optimize the individual steps. In fact, you can think of DeepMind’s original AlphaGo as a single manually executed tick/tock cycle. An intermediate version of AlphaGo called AlphaGo Master was the result of several more manual tick/tock cycles. Once this reinforcement learning loop was ironed out and optimized, it became so robust that AlphaGo Zero was able to bootstrap itself from completely random play, not even needing human professional data. AlphaGo Zero’s policy network (with no search) plays at the 99%ile of human players, and the full system is probably 1000+ Elo points stronger than any human.
How to tackle difficult problems
If we compress 30 years of Computer Go into a roadmap, it would look like the following:
- Start with brute-force search, guided by some hand-encoded heuristics.
- Replace with smarter search and machine-learned heuristics.
- Optimize the tick-tock cycle of better search and better machine learned heuristics.
- Fully automate the tick-tock steps as a reinforcement learning.
Put this way, it seems pretty obvious, right? But there are many ways to screw this up.
Bitter lesson
Let’s look at the first stage: brute force and hand-coded heuristics. Rich Sutton’s now famous essay on The Bitter Lesson has the top-line takeaway that, thanks to Moore’s law, methods that scale with computing power (like search and statistical learning techniques) will win over methods that merely scale with human effort (like hand-encoding of heuristics).
In this essay, the very first example that Sutton of the bitter lesson is Deep Blue: Deep Blue’s brute force search beat out the efforts of computer chess researchers to encode human chess expertise. True. Deep Blue optimized for faster and less accurate heuristics to enable its brute force. But do you think that the GnuGo developers didn’t know the lessons of Deep Blue?
When I look at the GnuGo codebase from a modern machine learning
lens, this effort is the moral equivalent of a group of people spending
hundreds of human-years running
sklearn.ensemble.RandomForestClassifier().fit()
by hand.
Why did they persist so long? Sutton offers the following sequence:
- AI researchers have often tried to build knowledge into their agents
- this always helps in the short term, and is personally satisfying to the researcher, but
- in the long run it plateaus and even inhibits further progress, and
- breakthrough progress eventually arrives by an opposing approach based on scaling computation by search and learning.
I offer another point: that these stages happen over the span of a decade or so. Over this decade, PhDs are minted, career identities built, promotion criteria set, and organizational culture annealed. Much in the way that science progresses one funeral at a time, progress on difficult problems progresses one company/project/organizational shutdown at a time.
Jumping straight to Reinforcement Learning
The last stage, reinforcement learning, is so powerful and tempting, yet incredibly hard to get right. If you ever take a class on RL theory, you might think that RL can only be done by theory geniuses who understand inscrutable math equations - just skim through Barto and Sutton’s textbook for a taste of this. The set of people who graduate with PhDs in RL don’t help this picture, either - the math is indeed so hard that students don’t have time to gain experience with such mundanities as building a working RL system.
In practice, jumping straight to RL is like pouring rocket fuel into your car’s gas tank and expecting it to magically work. All of your car’s systems are designed for a certain amount of engine power, requiring a certain amount of cooling capacity, delivering a certain amount of power through the drivetrain, and designed around the limitations of physics and human biology. The result isn’t a supercar, but a car that randomly explodes on the test track, and veers uncontrollably off-course. See Alex Irpan’s compilation of RL failures for countless examples of RL PhDs adding rocket fuel to cars.
Unfortunately, the only way to build a really fast car is to start with a slow car and work your way up, figuring out what breaks at each stage of scaleup. Looking back at the time I replicated AlphaGo Zero, I’m most surprised at how much effort went into building monitoring systems and inspection tools, to understand when and how our system was broken. We introduced and fixed countless bugs related to edge cases in endgame scoring, in tree search, incomplete shuffling, random noise injection, and MCTS parameter selection.
Successful applications of RL are far more mundane than the theory suggests: a focus on the basic tick-tock cycle of search and deep learning, combined with a strong emphasis on developing correct objective functions and supporting infrastructure, like monitoring and inspection tools.
Conclusion
The evolution of Go AI from expert systems to reinforcement learning offers a blueprint for tackling other fuzzy, complex problems. While it’s tempting to jump straight to advanced techniques like reinforcement learning, success typically requires progressing through stages: starting with basic search and heuristics, gradually incorporating machine learning, and only then attempting full automation through reinforcement learning. Each stage requires robust infrastructure, monitoring, and a willingness to abandon previous approaches. These lessons apply whether you’re building self-driving cars, digital assistants, or other complex AI systems - there are no shortcuts to superhuman performance.