Transformer cost曲线
Created by: guoshengCS
Fluid和Pytorch均使用如下的模型和训练参数(目前dropout_op有bug,这里暂时去掉dropout),并统一初始化方法,Transformer在WMT'16数据集上有如图的cost曲线对照图。
# number of sequences contained in a mini-batch.
batch_size = 64
# the hyper params for Adam optimizer.
learning_rate = 0.001
beta1 = 0.9
beta2 = 0.98
eps = 1e-9
# the params for learning rate scheduling
warmup_steps = 4000
src_vocab_size=2909
trg_vocab_size=3149
# the dimension for word embeddings, which is also the last dimension of
# the input and output of multi-head attention, position-wise feed-forward
# networks, encoder and decoder.
d_model = 512
# size of the hidden layer in position-wise feed-forward networks.
d_inner_hid = 1024
# the dimension that keys are projected to for dot-product attention.
d_key = 64
# the dimension that values are projected to for dot-product attention.
d_value = 64
# number of head used in multi-head attention.
n_head = 8
# number of sub-layers to be stacked in the encoder and decoder.
n_layer = 6
# dropout rate used by all dropout layers.
dropout = 0.