AlphaGo IJCAI

Go in numbers
3,000
Years Old
40M
Players
10^170
Positions
The Rules of Go
Capture
Territory
Why is Go hard for computers to play?

Brute force search intractable:
1. Search space is huge
2. Impossible for computers
to evaluate who is winning
Game tree complexity = bd
Convolutional neural network
Value network
Evaluation
v (s)
Position
Policy network
Move probabilities
p (a|s)
Position
Exhaustive search
Monte-Carlo rollouts
Reducing depth with value network
Reducing depth with value network
Reducing breadth with policy network
Deep reinforcement learning in AlphaGo

Human expert
positions
Supervised Learning
policy network
Reinforcement Learning
policy network
Self-play data
Value network
Supervised learning of policy networks

Policy network: 12 layer convolutional neural network
Training data: 30M positions from human expert games (KGS 5+ dan)
Training algorithm: maximise likelihood by stochastic gradient descent
Training time: 4 weeks on 50 GPUs using Google Cloud

Results: 57% accuracy on held out test data (state-of-the art was 44%)
Reinforcement learning of policy networks

Policy network: 12 layer convolutional neural network
Training data: games of self-play between policy network
Training algorithm: maximise wins z by policy gradient reinforcement learning
Training time: 1 week on 50 GPUs using Google Cloud

Results: 80% vs supervised learning. Raw network ~3 amateur dan.
Reinforcement learning of value networks

Value network: 12 layer convolutional neural network
Training data: 30 million games of self-play
Training algorithm: minimise MSE by stochastic gradient descent
Training time: 1 week on 50 GPUs using Google Cloud

Results: First strong position evaluation function - previously thought impossible
Monte-Carlo tree search in AlphaGo: selection
P prior probability
Q action value
Monte-Carlo tree search in AlphaGo: expansion
p Policy network
P prior probability
Monte-Carlo tree search in AlphaGo: evaluation
Value network
Monte-Carlo tree search in AlphaGo: rollout
Value network
r Game scorer
Monte-Carlo tree search in AlphaGo: backup
Action value
v Value network
r Game scorer
Deep Blue
Handcrafted chess knowledge
AlphaGo
Knowledge learned from expert
games and self-play
Alpha-beta search guided by
Monte-Carlo search guided by
heuristic evaluation function
policy and value networks
200 million positions / second
60,000 positions / second
Nature AlphaGo
Seoul AlphaGo
Evaluating Nature AlphaGo against computers

3000
2500
9p
7p
5p
3p
1p
9d
Professional
dan (p)
3500
AlphaGo
494/495 against
computer opponents
7d
1d
Fuego
1k
0
Elo rating
Go Programs
5k
7k
Beginner
kyu (k)
3k
500
Gnu
Go
Even stronger using

distributed machines
3d
Pachi
1000
Zen
1500
5d
Amateur
dan (d)
2000
Crazy Stone
>75% winning rate with

4 stone handicap
Evaluating Nature AlphaGo against humans

Fan Hui (2p): European Champion 2013 - 2016
Match was played in October 2015
AlphaGo won the match 5-0
First program ever to beat a professional

on a full size 19x19 in an even game
Seoul AlphaGo
Deep Reinforcement Learning (as Nature AlphaGo)
Improved value network
Improved policy network
Improved search
Improved hardware (TPU vs GPU)
Evaluating Seoul AlphaGo against computers

4000
3000
2500
3d
1d
Fuego
Pachi
1000
5d
Zen
1500
Crazy Stone
CAUTION: ratings
based on selfplay results
7d
1k
Gnu
Go
5k
7k
Beginner
kyu (k)
3k
500
0
Amateur
dan (d)
2000
9p
7p
5p
3p
1p
9d
Professional
dan (p)
3500
AlphaGo (Nature v13)
>50% against
Nature AlphaGo
with 3 to 4 stone
handicap
AlphaGo (Seoul v18)
4500
Evaluating Seoul AlphaGo against humans

Lee Sedol (9p): winner of 18 world titles
Match was played in Seoul, March 2016
AlphaGo won the match 4-1
AlphaGo vs Lee Sedol: Game 1
Deep Reinforcement Learning: Beyond AlphaGo
Whats Next?
Team
Dave
Silver
Aja
Huang
Chris
Maddison
Arthur
Guez
Laurent
Sifre
Van Den Driessche
Julian
Schrittwieser
Ioannis
Antonoglou
Panneershelvam
Yutian
Chen
Marc
Lanctot
Sander
Dieleman
Dominik
Grewe
John
Nham
Nal
Kalchbrenner
Tim
Lillicrap
Maddy
Leach
Koray
Kavukcuoglu
Thore
Graepel
Demis
Hassabis
George
Veda
With thanks to: Lucas Baker, David Szepesvari, Malcolm Reynolds, Ziyu Wang,
Nando De Freitas, Mike Johnson, Ilya Sutskever, Jeff Dean, Mike Marty, Sanjay Ghemawat.
Demis Hassabis

AlphaGo IJCAI

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

AlphaGo IJCAI

Transféré par

Droits d'auteur :

Formats disponibles

Go in numbers

Why is Go hard for computers to play?

Game tree complexity = bd

Convolutional neural network

Reducing depth with value network

Reducing depth with value network

Reducing breadth with policy network

Deep reinforcement learning in AlphaGo

Supervised learning of policy networks

Training time: 4 weeks on 50 GPUs using Google Cloud

Reinforcement learning of policy networks

Training time: 1 week on 50 GPUs using Google Cloud

Reinforcement learning of value networks

Training time: 1 week on 50 GPUs using Google Cloud

Monte-Carlo tree search in AlphaGo: selection

Monte-Carlo tree search in AlphaGo: expansion

Monte-Carlo tree search in AlphaGo: evaluation

Monte-Carlo tree search in AlphaGo: rollout

Monte-Carlo tree search in AlphaGo: backup

Alpha-beta search guided by

Monte-Carlo search guided by

heuristic evaluation function

policy and value networks

200 million positions / second

60,000 positions / second

Evaluating Nature AlphaGo against computers

Even stronger using

>75% winning rate with

Evaluating Nature AlphaGo against humans

Match was played in October 2015

AlphaGo won the match 5-0

First program ever to beat a professional

Improved value network

Improved policy network

Improved hardware (TPU vs GPU)

Evaluating Seoul AlphaGo against computers

AlphaGo (Nature v13)

AlphaGo (Seoul v18)

Evaluating Seoul AlphaGo against humans

Match was played in Seoul, March 2016

AlphaGo won the match 4-1

AlphaGo vs Lee Sedol: Game 1

AlphaGo vs Lee Sedol: Game 2

AlphaGo vs Lee Sedol: Game 3

AlphaGo vs Lee Sedol: Game 4

AlphaGo vs Lee Sedol: Game 5

Deep Reinforcement Learning: Beyond AlphaGo

Van Den Driessche

Vous aimerez peut-être aussi