The Leela Chess Zero’s neural network is largely based on the DeepMind’s AlphaGo Zero1 and AlphaZero2 architecture. There are however some changes.
Network topology
The core of the network is a residual tower with Squeeze and Excitation3 (SE) layers.
The number of the residual BLOCKS
and FILTERS
(channels) per block differs between networks.
Typical values for BLOCKS
×FILTERS
are 10×128, 20×256, 24×320.
SE layers have SE_CHANNELS
channels (typically 32 or so).
Input to the neural network is 112 planes 8×8 each.
The network consists of a “body” (residual tower) and several output “heads” attached to it.
All convolution layers also include bias layers.
Fully connected layer is MatMul plus adding Bias on top.
Body
- Input convolution: from 112×8×8 to
FILTERS
×8×8. - Residual tower consisting of
BLOCKS
blocks:- Convolution from
FILTERS
×8×8 toFILTERS
×8×8. - Convolution from
FILTERS
×8×8 toFILTERS
×8×8. - SE layer (only in network type NETWORK_SE_WITH_HEADFORMAT [current]), i.e.:
- Global average pooling layer (
FILTERS
×8×8 toFILTERS
) - Fully connected layer (
FILTERS
toSE_CHANNELS
) - ReLU
- Fully connected layer (
SE_CHANNELS
to 2×FILTERS
). - 2×
FILTERS
is split into twoFILTERS
sized vectorsW
andB
Z
= Sigmoid(W
)- Output of the SE layer is
(Z × input) + B
.
- Global average pooling layer (
- Adding the residual tower skip connection.
- ReLU activation function.
- Convolution from
All convolutions have kernel size 3×3 and stride 1.
Batch normalization is already folded into weights, so there’s no need to do any normalization during the inference.
Policy head
Format: POLICY_CONVOLUTION [current]
- Convolution from
FILTERS
×8×8 toFILTERS
×8×8. - Convolution from
FILTERS
×8×8 to 80×8×8. - The vector of length 1858 is gathered from the 80×8×8 matrix using this mapping (only 73×8×8 is actually used, the rest is for padding).
- (note there is no activation function on the output)
Format: POLICY_CLASSICAL
POLICY_CONV_SIZE
is a parameter.
- Convolution from
FILTERS
×8×8 toPOLICY_CONV_SIZE
×8×8 - Fully connected from
POLICY_CONV_SIZE
×8×8 to a vector of length1858
- (note there is no activation function on the output)
Value head
Common part
- Convolution from
FILTERS
×8×8 to 32×8×8 - Convolution from 32×8×8 to the vector of length 128
- ReLU
Format: VALUE_WDL [current]
- Fully connected from vector of length 128 to the vector of length 3
- Softmax
Format: VALUE_CLASSICAL
- Fully connected from vector of length 128 to a scalar
- Tanh
Moves left head
MLH_CHANNELS
and FC_SIZE
are parameters.
- Convolution from
FILTERS
×8×8 toMLH_CHANNELS
×8×8. - Fully connected from
MLH_CHANNELS
×8×8 to a vector of sizeFC_SIZE
. - ReLU
- Fully connected from a vector of size
FC_SIZE
to a scalar - ReLU
AlphaGo Zero https://deepmind.com/research/publications/mastering-game-go-without-human-knowledge, scroll down for the paper link. ↩︎
AlphaZero https://deepmind.com/blog/article/alphazero-shedding-new-light-grand-games-chess-shogi-and-go, scroll down for the paper link. ↩︎
Squeeze and Excitation networks: https://arxiv.org/abs/1709.01507 ↩︎