Monte Carlo Tree Search (MCTS) in AlphaGo Zero

AlphaGo Zero uses MCTS to select the next move in a Go game.

Image for post
Image for post
Image for post
Image for post

MCTS searches for possible moves and records the results in a search tree. As more searches are performed, the tree grows larger with more accurate predictions. After 1,600 searches, it picks the next move with the highest chance in winning the game.

Image for post
Image for post

MCTS Overview

Let’s see how it picks a3 for a board position s3.

Image for post
Image for post

In MCTS, a node represents a board position and an edge represents a move.

Image for post
Image for post

Starting with a board position (s3), it searches possible moves and use a deep network f to evaluate

  • the policy p (it controls what move should it takes next. To model uncertainty, the policy is a probability distribution.), and
  • the value function v (how likely to win at a board position s).
Image for post
Image for post

In the example below, it applies a deep network f (composed of convolutional layers) on s’3 to compute the policy p and the value functions v.

Image for post
Image for post

Next, it expands the search tree by making a move. This adds a new board position and its corresponding moves to the search tree.

Image for post
Image for post

Let’s introduce another term called action-value function Q. It measures the value of making a move. In (a) below, it takes the move in red (a3) and end up with a 0.9 chance of winning. So Q is 0.9. In (b), it makes one more move and end up with a 0.6 chance of winning. Now, the move a3 is taken twice (visit count N=2). The Q value is simply the average of previous results, .i.e. W=(0.9+0.6), Q = W/2=0.75. In (c), it explores another path. Now, a3 is picked 3 times and the Q value is (0.9+0.6+0.3)/3 = 0.6.

Image for post
Image for post

The Q value serves a similar purpose as the policy p estimated by the deep network f. However, as with more exploration, our search tree grows bigger. If it continues to play, the noise in p cancel each other. So Q is getting more accurate as we make more moves.

Image for post
Image for post

However, the search space for Go is huge, we need to prioritize the search better. There are two factors needed to consider: exploitation and exploration.

  • exploitation: perform more searches that look promising (i.e. high Q value).
  • exploration: perform searches that are not frequently explored (i.e. low visit count N).
Image for post
Image for post

Mathematically, it selects the move a according to:

Image for post
Image for post

where u is

Image for post
Image for post

In this equation, Q controls the exploitation and u (inverse to its visit count) controls the exploration . Starting from the root, it uses this policy (tree policy) to select the path that it wants to search further.

Image for post
Image for post

In the beginning, MCTS will focus more on exploration but as the iterations increase, most searches are exploitation and Q is getting more accurate.

AlphaGo Zero iterates the steps above 1,600 times to expand the tree. It uses the search tree to create a policy π to pick our next move for the board position s3.

Image for post
Image for post

Surprisingly, it does not use Q to compute the policy π. Instead, π is derived from the visit count N.

Image for post
Image for post
Image for post
Image for post

After the initial iterations, moves with higher Q value will be visited more frequently. It uses the visit count to calculate the policy because it is less prone to outliners. τ is a temperature controlling the level of exploration. When τ is 1, it selects moves based on the visit count. When τ → 0, only the move with the highest count will be picked. So τ =1 allows exploration while τ → 0 does not.

With a board position s, MCTS calculates a more accurate policy π to decide the next move.

Image for post
Image for post

MCTS improves our policy evaluation (improves Q), and it uses the new evaluation to improve the policy (policy improvement). Then it re-applies the policy to evaluate the policy again. These repeated iterations of policy evaluation and policy improvement are called policy iteration in RL. After self-playing many games, both policy evaluation and policy improvement will be optimized to a point that it can beat the masters.

MCTS

MCTS composes of 4 major steps:

  • Step (a) selects a path (a sequence of moves) that it wants further search.
  • Step (b) expands the search tree by taking one more move.
  • Step (c) updates the current path backward on how promise the path is.
Image for post
Image for post

After repeating step (a) to (c) 1,600 times, it uses the search tree to create a policy π3 for the board position s3. In step (d), it samples the next move from π3.

Select

The first step is to select a path from the tree for further search. Let’s assume our search tree looks like the one below.

Image for post
Image for post

It starts from the root and selects moves based on the equation below:

Image for post
Image for post

where u controls exploration. It depends on N the visit count, P the policy from the network f, c a hyperparameter controlling the level of exploration. Q controls exploitation which is the calculated action-value function from the MCTS.

Expand and evaluate

With the selected path decided, a leaf node is added to the search tree. It uses the deep network f to compute the policy p and v.

Image for post
Image for post

The leaf node is expanded with edges (s, a). Each edge is initialized with:

Image for post
Image for post

It sets zero for the visit count N, accumulated action-value W and action-value Q. P is initialized by v estimated from f.

Backup

Image for post
Image for post

Once p and v are calculated for the leaf node, we backup v to update the action value Q for every edge (s, a) in the selected path.

Image for post
Image for post

Play

Image for post
Image for post

To decide the next move, AlphaGo Zero samples a move a based on

Image for post
Image for post
Image for post
Image for post

The selected move a becomes the root of the search tree with all its children preserved while others are discarded. Then it repeats MCTS again for the next move.

Written by

Deep Learning

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store