Mastering the game of Go with deep neural networks and tree search
30 Aug 2017Paper: Nature
For full paper, please Google yourself.
Key idea:
We introduce a new approach to computer Go that uses ‘value networks’ to evaluate board positions and ‘policy networks’ to select moves.
Background knowledge:
Previous approach:
- Depth of the search may be reduced by position evaluation: truncating the search tree at state s and replacing the subtree below s by an approximate value function that predicts the outcome from state s.
- The breadth of the search may be reduced by sampling actions from a policy that is a probability distribution over possible moves a in position s.
- Monte Carlo tree search (MCTS) uses Monte Carlo rollouts to estimate the value of each state in a search tree.
- However, prior work has been limited to shallow policies or value functions based on a linear combination of input features.
This paper approach:
- We use these neural networks to reduce the effective depth and breadth of the search tree: evaluating positions using a value network, and sampling actions using a policy network.
- Structure (Check Figure 1 in paper):
- Supervised Learning (SL) policy network directly from expert human moves
- Reinforcement Learning (RL) policy network that improves the SL policy network by optimizing the final outcome of games of selfplay.
- Value network that predicts the winner of games played by the RL policy network against itself.
- MCTS with policy and value network
Notation:
s
: state (current state of the Go game, check Extended Data Table 2 in paper)a
: moves (represent the legal moves)
Supervised Learning (SL) policy network
- Task: predicting the probability of human experts next moves
a
given the current states
- Input: state
s
- Output: probability distribution of all legal moves
a
- Structure: formed by Convolutional layer and rectifier nonlinearities, softmax as final output.
- Detailed network: 13 layer
- Data: 30 million state-action pairs from KGS Go server.
- Performance:
57.0%
using all input features,55.7%
using only raw board position and move history as inputs. - Time consumption:
3ms
- Extra: a faster rollout policy with
24.2%
and
Reinforcement Learning (RL) policy network
- Objective: improving the policy network by policy gradient reinforcement learning (RL)
- Network: RL policy network is identical in structure to SL policy network
- Training procedure:
- RL policy network weights is initialized to the same values of SL policy network, .
- play GO games between the current policy network and a randomly selected previous iteration of the policy network.
- At the end of the game, update the network for all previous move where t=1:T by where is outcome of the game (+1 for winning, -1 for losing).
- Performance: 80% winning rate against SL policy network, 85% against strongest open-source Go program Pachi (previous state-of-art supervised CNN only got 11% wining rate against Pachi)
Value network
- Objective: position evaluation of any given state
s
- Task: estimate a value function that predicts the outcome from position
s
of games played by using policy p for both players - Input: state-outcome pairs
- Output: estimated value function
- Network: single prediction output value network with similar architecture to the previous two policy network and weight that
- Training: use state-outcome pairs (s,z) from both KGS dataset and self-play data set with SGD to minimize the MSE between the predicted value and the corresponding z, ()
- Performance: 0.226 and 0.234 MSE on training and testing set respectively, near RL policy network performance but 15,000 times less computation.
MCTS (Searching) with policy and value networks
- Objective: Determine next move action
- Input: state
s
- Output: estimated optimal move
a
- Algorithm: The network would run multiple times of simulation for the same input state
s
. The simulation is traversing the tree from the root state (input states
). For one simulation (Figure 3):- For the node with children, branch to the child that where
- For the leaf node, it may be expanded by SL policy network and the output probabilities are stored as prior probabilities P for each action.
- At the end of a simulation, the leaf node is evaluated in two ways: using the value network ; and by running a rollout to the end of the game with the fast rollout policy . The outcome is a leaf function .
- Update action values where ( indicates whether an edge (s, a) was traversed during the ith simulation. is the leaf node from the ith simulation.)
Performance
- Against serveral other programs:
- 5s computation time per move
- AlphaGo (single-machine) wining 494 out of 495 games
- Four handicap stones: AlphaGo win 77%, 86% and 99% against Crazy stones, Zen and Pachi.
- AlphaGo (distributed) win 100% against other programs and 77% against AlphaGo (single-machine)
- Variants of AlphaGo that evaluation postition using just value network or just rollouts was tested, and the mixed evaluation give the best results.
- Against professional player Fan Hui: AlphaGo won 5 games to 0