Transformer cost曲线 (#700) · Issue · PaddlePaddle / models

Transformer cost曲线

Created by: guoshengCS

Fluid和Pytorch均使用如下的模型和训练参数（目前dropout_op有bug，这里暂时去掉dropout），并统一初始化方法，Transformer在WMT'16数据集上有如图的cost曲线对照图。

    # number of sequences contained in a mini-batch.
    batch_size = 64
    # the hyper params for Adam optimizer.
    learning_rate = 0.001
    beta1 = 0.9
    beta2 = 0.98
    eps = 1e-9
    # the params for learning rate scheduling
    warmup_steps = 4000

    src_vocab_size=2909
    trg_vocab_size=3149
    # the dimension for word embeddings, which is also the last dimension of
    # the input and output of multi-head attention, position-wise feed-forward
    # networks, encoder and decoder.
    d_model = 512
    # size of the hidden layer in position-wise feed-forward networks.
    d_inner_hid = 1024
    # the dimension that keys are projected to for dot-product attention.
    d_key = 64
    # the dimension that values are projected to for dot-product attention.
    d_value = 64
    # number of head used in multi-head attention.
    n_head = 8
    # number of sub-layers to be stacked in the encoder and decoder.
    n_layer = 6
    # dropout rate used by all dropout layers.
    dropout = 0.

PaddlePaddle / models 大约 1 年 前同步成功

Transformer cost曲线

PaddlePaddle / models
大约 1 年前同步成功