AlphaGo Zero uses MCTS to select the next move in a Go game.
MCTS searches for possible moves and records the results in a search tree. As more searches are performed, the tree grows larger with more accurate predictions. After 1,600 searches, it picks the next move with the highest chance in winning the game.
Let’s see how it picks a3 for a board position s3.
In MCTS, a node represents a board position and an edge represents a move.
Starting with a board position (s3), it searches possible moves and use a deep network f to evaluate
- the policy p (it controls what move should it takes next. To model uncertainty, the policy is a probability distribution.), and
- the value function v (how likely to win at a board position s).
In the example below, it applies a deep network f (composed of convolutional layers) on s’3 to compute the policy p and the value functions v.
Next, it expands the search tree by making a move. This adds a new board position and its corresponding moves to the search tree.
Let’s introduce another term called action-value function Q. It measures the value of making a move. In (a) below, it takes the move in red (a3) and end up with a 0.9 chance of winning. So Q is 0.9. In (b), it makes one more move and end up with a 0.6 chance of winning. Now, the move a3 is taken twice (visit count N=2). The Q value is simply the average of previous results, .i.e. W=(0.9+0.6), Q = W/2=0.75. In (c), it explores another path. Now, a3 is picked 3 times and the Q value is (0.9+0.6+0.3)/3 = 0.6.
The Q value serves a similar purpose as the policy p estimated by the deep network f. However, as with more exploration, our search tree grows bigger. If it continues to play, the noise in p cancel each other. So Q is getting more accurate as we make more moves.
However, the search space for Go is huge, we need to prioritize the search better. There are two factors needed to consider: exploitation and exploration.
- exploitation: perform more searches that look promising (i.e. high Q value).
- exploration: perform searches that are not frequently explored (i.e. low visit count N).
Mathematically, it selects the move a according to:
where u is
In this equation, Q controls the exploitation and u (inverse to its visit count) controls the exploration . Starting from the root, it uses this policy (tree policy) to select the path that it wants to search further.
In the beginning, MCTS will focus more on exploration but as the iterations increase, most searches are exploitation and Q is getting more accurate.
AlphaGo Zero iterates the steps above 1,600 times to expand the tree. It uses the search tree to create a policy π to pick our next move for the board position s3.
Surprisingly, it does not use Q to compute the policy π. Instead, π is derived from the visit count N.
After the initial iterations, moves with higher Q value will be visited more frequently. It uses the visit count to calculate the policy because it is less prone to outliners. τ is a temperature controlling the level of exploration. When τ is 1, it selects moves based on the visit count. When τ → 0, only the move with the highest count will be picked. So τ =1 allows exploration while τ → 0 does not.
With a board position s, MCTS calculates a more accurate policy π to decide the next move.
MCTS improves our policy evaluation (improves Q), and it uses the new evaluation to improve the policy (policy improvement). Then it re-applies the policy to evaluate the policy again. These repeated iterations of policy evaluation and policy improvement are called policy iteration in RL. After self-playing many games, both policy evaluation and policy improvement will be optimized to a point that it can beat the masters.
MCTS composes of 4 major steps:
- Step (a) selects a path (a sequence of moves) that it wants further search.
- Step (b) expands the search tree by taking one more move.
- Step (c) updates the current path backward on how promise the path is.
After repeating step (a) to (c) 1,600 times, it uses the search tree to create a policy π3 for the board position s3. In step (d), it samples the next move from π3.
The first step is to select a path from the tree for further search. Let’s assume our search tree looks like the one below.
It starts from the root and selects moves based on the equation below:
where u controls exploration. It depends on N the visit count, P the policy from the network f, c a hyperparameter controlling the level of exploration. Q controls exploitation which is the calculated action-value function from the MCTS.
Expand and evaluate
With the selected path decided, a leaf node is added to the search tree. It uses the deep network f to compute the policy p and v.
The leaf node is expanded with edges (s, a). Each edge is initialized with:
It sets zero for the visit count N, accumulated action-value W and action-value Q. P is initialized by v estimated from f.
Once p and v are calculated for the leaf node, we backup v to update the action value Q for every edge (s, a) in the selected path.
To decide the next move, AlphaGo Zero samples a move a based on
The selected move a becomes the root of the search tree with all its children preserved while others are discarded. Then it repeats MCTS again for the next move.