resolve confilct

0033bc54 · LielinJiang · 6c770910 · c88f4570 · 0033bc54 · 0033bc54
53 changed file
--- a/examples/seq2seq/README.md
+++ b/examples/seq2seq/README.md
+运行本目录下的范例模型需要安装PaddlePaddle Fluid 1.7版。如果您的 PaddlePaddle 安装版本低于此要求，请按照[安装文档](https://www.paddlepaddle.org.cn/#quick-start)中的说明更新 PaddlePaddle 安装版本。
+# Sequence to Sequence (Seq2Seq)
+以下是本范例模型的简要目录结构及说明：
+```
+.
+├── README.md              # 文档，本文件
+├── args.py                # 训练、预测以及模型参数配置程序
+├── reader.py              # 数据读入程序
+├── download.py            # 数据下载程序
+├── train.py               # 训练主程序
+├── predict.py             # 预测主程序
+├── seq2seq_attn.py        # 带注意力机制的翻译模型程序
+└── seq2seq_base.py        # 无注意力机制的翻译模型程序
+```
+## 简介
+Sequence to Sequence (Seq2Seq)，使用编码器-解码器（Encoder-Decoder）结构，用编码器将源序列编码成vector，再用解码器将该vector解码为目标序列。Seq2Seq 广泛应用于机器翻译，自动对话机器人，文档摘要自动生成，图片描述自动生成等任务中。
+本目录包含Seq2Seq的一个经典样例：机器翻译，实现了一个base model（不带attention机制），一个带attention机制的翻译模型。Seq2Seq翻译模型，模拟了人类在进行翻译类任务时的行为：先解析源语言，理解其含义，再根据该含义来写出目标语言的语句。更多关于机器翻译的具体原理和数学表达式，我们推荐参考飞桨官网[机器翻译案例](https://www.paddlepaddle.org.cn/documentation/docs/zh/user_guides/nlp_case/machine_translation/README.cn.html)。
+## 模型概览
+本模型中，在编码器方面，我们采用了基于LSTM的多层的RNN encoder；在解码器方面，我们使用了带注意力（Attention）机制的RNN decoder，并同时提供了一个不带注意力机制的解码器实现作为对比。在预测时我们使用柱搜索（beam search）算法来生成翻译的目标语句。
+## 数据介绍
+本教程使用[IWSLT'15 English-Vietnamese data ](https://nlp.stanford.edu/projects/nmt/)数据集中的英语到越南语的数据作为训练语料，tst2012的数据作为开发集，tst2013的数据作为测试集
+### 数据获取
+```
+python download.py
+```
+## 模型训练
+执行以下命令即可训练带有注意力机制的Seq2Seq机器翻译模型：
+```sh
+export CUDA_VISIBLE_DEVICES=0
+python train.py \
+    --src_lang en --tar_lang vi \
+    --attention True \
+    --num_layers 2 \
+    --hidden_size 512 \
+    --src_vocab_size 17191 \
+    --tar_vocab_size 7709 \
+    --batch_size 128 \
+    --dropout 0.2 \
+    --init_scale  0.1 \
+    --max_grad_norm 5.0 \
+    --train_data_prefix data/en-vi/train \
+    --eval_data_prefix data/en-vi/tst2012 \
+    --test_data_prefix data/en-vi/tst2013 \
+    --vocab_prefix data/en-vi/vocab \
+    --use_gpu True \
+    --model_path ./attention_models
+```
+可以通过修改 `attention` 参数为False来训练不带注意力机制的Seq2Seq模型，各参数的具体说明请参阅 `args.py` 。训练程序会在每个epoch训练结束之后，save一次模型。
+默认使用动态图模式进行训练，可以通过设置 `eager_run` 参数为False来以静态图模式进行训练，如下：
+```sh
+export CUDA_VISIBLE_DEVICES=0
+python train.py \
+    --src_lang en --tar_lang vi \
+    --attention True \
+    --num_layers 2 \
+    --hidden_size 512 \
+    --src_vocab_size 17191 \
+    --tar_vocab_size 7709 \
+    --batch_size 128 \
+    --dropout 0.2 \
+    --init_scale  0.1 \
+    --max_grad_norm 5.0 \
+    --train_data_prefix data/en-vi/train \
+    --eval_data_prefix data/en-vi/tst2012 \
+    --test_data_prefix data/en-vi/tst2013 \
+    --vocab_prefix data/en-vi/vocab \
+    --use_gpu True \
+    --model_path ./attention_models \
+    --eager_run False
+```
+## 模型预测
+训练完成之后，可以使用保存的模型（由 `--reload_model` 指定）对test的数据集（由 `--infer_file` 指定）进行beam search解码，命令如下：
+```sh
+export CUDA_VISIBLE_DEVICES=0
+python infer.py \
+    --attention True \
+    --src_lang en --tar_lang vi \
+    --num_layers 2 \
+    --hidden_size 512 \
+    --src_vocab_size 17191 \
+    --tar_vocab_size 7709 \
+    --batch_size 128 \
+    --dropout 0.2 \
+    --init_scale  0.1 \
+    --max_grad_norm 5.0 \
+    --vocab_prefix data/en-vi/vocab \
+    --infer_file data/en-vi/tst2013.en \
+    --reload_model attention_models/10 \
+    --infer_output_file infer_output.txt \
+    --beam_size 10 \
+    --use_gpu True
+```
+各参数的具体说明请参阅 `args.py` ，注意预测时所用模型超参数需和训练时一致。和训练类似，预测时同样可以以静态图模式进行，如下：
+```sh
+export CUDA_VISIBLE_DEVICES=0
+python infer.py \
+    --attention True \
+    --src_lang en --tar_lang vi \
+    --num_layers 2 \
+    --hidden_size 512 \
+    --src_vocab_size 17191 \
+    --tar_vocab_size 7709 \
+    --batch_size 128 \
+    --dropout 0.2 \
+    --init_scale  0.1 \
+    --max_grad_norm 5.0 \
+    --vocab_prefix data/en-vi/vocab \
+    --infer_file data/en-vi/tst2013.en \
+    --reload_model attention_models/10 \
+    --infer_output_file infer_output.txt \
+    --beam_size 10 \
+    --use_gpu True \
+    --eager_run False  
+```
+## 效果评价
+使用 [*multi-bleu.perl*](https://github.com/moses-smt/mosesdecoder.git) 工具来评价模型预测的翻译质量，使用方法如下：
+```sh
+mosesdecoder/scripts/generic/multi-bleu.perl tst2013.vi < infer_output.txt
+```
+每个模型分别训练了10次，单次取第10个epoch保存的模型进行预测，取beam_size=10。效果如下（为了便于观察，对10次结果按照升序进行了排序）：
+```
+> no attention
+tst2012 BLEU:
+[10.75 10.85 10.9  10.94 10.97 11.01 11.01 11.04 11.13 11.4]
+tst2013 BLEU:
+[10.71 10.71 10.74 10.76 10.91 10.94 11.02 11.16 11.21 11.44]
+> with attention
+tst2012 BLEU:
+[21.14 22.34 22.54 22.65 22.71 22.71 23.08 23.15 23.3  23.4]
+tst2013 BLEU:
+[23.41 24.79 25.11 25.12 25.19 25.24 25.39 25.61 25.61 25.63]
+```
--- a/examples/seq2seq/args.py
+++ b/examples/seq2seq/args.py
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserve.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+import argparse
+import distutils.util
+def parse_args():
+    parser = argparse.ArgumentParser(description=__doc__)
+    parser.add_argument(
+        "--train_data_prefix", type=str, help="file prefix for train data")
+    parser.add_argument(
+        "--eval_data_prefix", type=str, help="file prefix for eval data")
+    parser.add_argument(
+        "--test_data_prefix", type=str, help="file prefix for test data")
+    parser.add_argument(
+        "--vocab_prefix", type=str, help="file prefix for vocab")
+    parser.add_argument("--src_lang", type=str, help="source language suffix")
+    parser.add_argument("--tar_lang", type=str, help="target language suffix")
+    parser.add_argument(
+        "--attention",
+        type=eval,
+        default=False,
+        help="Whether use attention model")
+    parser.add_argument(
+        "--optimizer",
+        type=str,
+        default='adam',
+        help="optimizer to use, only supprt[sgd|adam]")
+    parser.add_argument(
+        "--learning_rate",
+        type=float,
+        default=0.001,
+        help="learning rate for optimizer")
+    parser.add_argument(
+        "--num_layers",
+        type=int,
+        default=1,
+        help="layers number of encoder and decoder")
+    parser.add_argument(
+        "--hidden_size",
+        type=int,
+        default=100,
+        help="hidden size of encoder and decoder")
+    parser.add_argument("--src_vocab_size", type=int, help="source vocab size")
+    parser.add_argument("--tar_vocab_size", type=int, help="target vocab size")
+    parser.add_argument(
+        "--batch_size", type=int, help="batch size of each step")
+    parser.add_argument(
+        "--max_epoch", type=int, default=12, help="max epoch for the training")
+    parser.add_argument(
+        "--max_len",
+        type=int,
+        default=50,
+        help="max length for source and target sentence")
+    parser.add_argument(
+        "--dropout", type=float, default=0.0, help="drop probability")
+    parser.add_argument(
+        "--init_scale",
+        type=float,
+        default=0.0,
+        help="init scale for parameter")
+    parser.add_argument(
+        "--max_grad_norm",
+        type=float,
+        default=5.0,
+        help="max grad norm for global norm clip")
+    parser.add_argument(
+        "--log_freq",
+        type=int,
+        default=100,
+        help="The frequency to print training logs")
+    parser.add_argument(
+        "--model_path",
+        type=str,
+        default='model',
+        help="model path for model to save")
+    parser.add_argument(
+        "--reload_model", type=str, help="reload model to inference")
+    parser.add_argument(
+        "--infer_file", type=str, help="file name for inference")
+    parser.add_argument(
+        "--infer_output_file",
+        type=str,
+        default='infer_output',
+        help="file name for inference output")
+    parser.add_argument(
+        "--beam_size", type=int, default=10, help="file name for inference")
+    parser.add_argument(
+        '--use_gpu',
+        type=eval,
+        default=False,
+        help='Whether using gpu [True|False]')
+    parser.add_argument(
+        '--eager_run', type=eval, default=False, help='Whether to use dygraph')
+    parser.add_argument(
+        "--enable_ce",
+        action='store_true',
+        help="The flag indicating whether to run the task "
+        "for continuous evaluation.")
+    parser.add_argument(
+        "--profile", action='store_true', help="Whether enable the profile.")
+    # NOTE: profiler args, used for benchmark
+    parser.add_argument(
+        "--profiler_path",
+        type=str,
+        default='./seq2seq.profile',
+        help="the profiler output file path. (used for benchmark)")
+    args = parser.parse_args()
+    return args
--- a/examples/seq2seq/download.py
+++ b/examples/seq2seq/download.py
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserve.
+#
+# Licensed under the Apache License, Version 2.0 (the 'License');
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an 'AS IS' BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+'''
+Script for downloading training data.
+'''
+import os
+import urllib
+import sys
+if sys.version_info >= (3, 0):
+    import urllib.request
+import zipfile
+URLLIB = urllib
+if sys.version_info >= (3, 0):
+    URLLIB = urllib.request
+remote_path = 'https://nlp.stanford.edu/projects/nmt/data/iwslt15.en-vi'
+base_path = 'data'
+tar_path = os.path.join(base_path, 'en-vi')
+filenames = [
+    'train.en', 'train.vi', 'tst2012.en', 'tst2012.vi', 'tst2013.en',
+    'tst2013.vi', 'vocab.en', 'vocab.vi'
+]
+def main(arguments):
+    print("Downloading data......")
+    if not os.path.exists(tar_path):
+        if not os.path.exists(base_path):
+            os.mkdir(base_path)
+        os.mkdir(tar_path)
+    for filename in filenames:
+        url = remote_path + '/' + filename
+        tar_file = os.path.join(tar_path, filename)
+        URLLIB.urlretrieve(url, tar_file)
+    print("Downloaded sucess......")
+if __name__ == '__main__':
+    sys.exit(main(sys.argv[1:]))
--- a/examples/seq2seq/predict.py
+++ b/examples/seq2seq/predict.py
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import logging
+import os
+import io
+import random
+from functools import partial
+import numpy as np
+import paddle.fluid as fluid
+from paddle.fluid.layers.utils import flatten
+from paddle.fluid.io import DataLoader
+from hapi.model import Input, set_device
+from args import parse_args
+from seq2seq_base import BaseInferModel
+from seq2seq_attn import AttentionInferModel
+from reader import Seq2SeqDataset, Seq2SeqBatchSampler, SortType, prepare_infer_input
+def post_process_seq(seq, bos_idx, eos_idx, output_bos=False,
+                     output_eos=False):
+    """
+    Post-process the decoded sequence.
+    """
+    eos_pos = len(seq) - 1
+    for i, idx in enumerate(seq):
+        if idx == eos_idx:
+            eos_pos = i
+            break
+    seq = [
+        idx for idx in seq[:eos_pos + 1]
+        if (output_bos or idx != bos_idx) and (output_eos or idx != eos_idx)
+    ]
+    return seq
+def do_predict(args):
+    device = set_device("gpu" if args.use_gpu else "cpu")
+    fluid.enable_dygraph(device) if args.eager_run else None
+    # define model
+    inputs = [
+        Input(
+            [None, None], "int64", name="src_word"),
+        Input(
+            [None], "int64", name="src_length"),
+    ]
+    # def dataloader
+    dataset = Seq2SeqDataset(
+        fpattern=args.infer_file,
+        src_vocab_fpath=args.vocab_prefix + "." + args.src_lang,
+        trg_vocab_fpath=args.vocab_prefix + "." + args.tar_lang,
+        token_delimiter=None,
+        start_mark="<s>",
+        end_mark="</s>",
+        unk_mark="<unk>")
+    trg_idx2word = Seq2SeqDataset.load_dict(
+        dict_path=args.vocab_prefix + "." + args.tar_lang, reverse=True)
+    (args.src_vocab_size, args.trg_vocab_size, bos_id, eos_id,
+     unk_id) = dataset.get_vocab_summary()
+    batch_sampler = Seq2SeqBatchSampler(
+        dataset=dataset, use_token_batch=False, batch_size=args.batch_size)
+    data_loader = DataLoader(
+        dataset=dataset,
+        batch_sampler=batch_sampler,
+        places=device,
+        feed_list=None
+        if fluid.in_dygraph_mode() else [x.forward() for x in inputs],
+        collate_fn=partial(
+            prepare_infer_input, bos_id=bos_id, eos_id=eos_id, pad_id=eos_id),
+        num_workers=0,
+        return_list=True)
+    model_maker = AttentionInferModel if args.attention else BaseInferModel
+    model = model_maker(
+        args.src_vocab_size,
+        args.tar_vocab_size,
+        args.hidden_size,
+        args.hidden_size,
+        args.num_layers,
+        args.dropout,
+        bos_id=bos_id,
+        eos_id=eos_id,
+        beam_size=args.beam_size,
+        max_out_len=256)
+    model.prepare(inputs=inputs)
+    # load the trained model
+    assert args.reload_model, (
+        "Please set reload_model to load the infer model.")
+    model.load(args.reload_model)
+    # TODO(guosheng): use model.predict when support variant length
+    with io.open(args.infer_output_file, 'w', encoding='utf-8') as f:
+        for data in data_loader():
+            finished_seq = model.test_batch(inputs=flatten(data))[0]
+            finished_seq = finished_seq[:, :, np.newaxis] if len(
+                finished_seq.shape) == 2 else finished_seq
+            finished_seq = np.transpose(finished_seq, [0, 2, 1])
+            for ins in finished_seq:
+                for beam_idx, beam in enumerate(ins):
+                    id_list = post_process_seq(beam, bos_id, eos_id)
+                    word_list = [trg_idx2word[id] for id in id_list]
+                    sequence = " ".join(word_list) + "\n"
+                    f.write(sequence)
+                    break
+if __name__ == "__main__":
+    args = parse_args()
+    do_predict(args)
--- a/examples/seq2seq/reader.py
+++ b/examples/seq2seq/reader.py
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+import glob
+import six
+import os
+import io
+import itertools
+from functools import partial
+import numpy as np
+import paddle.fluid as fluid
+from paddle.fluid.dygraph.parallel import ParallelEnv
+from paddle.fluid.io import BatchSampler, DataLoader, Dataset
+def create_data_loader(args, device, for_train=True):
+    data_loaders = [None, None]
+    data_prefixes = [args.train_data_prefix, args.eval_data_prefix
+                     ] if args.eval_data_prefix else [args.train_data_prefix]
+    for i, data_prefix in enumerate(data_prefixes):
+        dataset = Seq2SeqDataset(
+            fpattern=data_prefix + "." + args.src_lang,
+            trg_fpattern=data_prefix + "." + args.tar_lang,
+            src_vocab_fpath=args.vocab_prefix + "." + args.src_lang,
+            trg_vocab_fpath=args.vocab_prefix + "." + args.tar_lang,
+            token_delimiter=None,
+            start_mark="<s>",
+            end_mark="</s>",
+            unk_mark="<unk>",
+            max_length=args.max_len if i == 0 else None,
+            truncate=True,
+            trg_add_bos_eos=True)
+        (args.src_vocab_size, args.tar_vocab_size, bos_id, eos_id,
+         unk_id) = dataset.get_vocab_summary()
+        batch_sampler = Seq2SeqBatchSampler(
+            dataset=dataset,
+            use_token_batch=False,
+            batch_size=args.batch_size,
+            pool_size=args.batch_size * 20,
+            sort_type=SortType.POOL,
+            shuffle=False if args.enable_ce else True,
+            distribute_mode=True if i == 0 else False)
+        data_loader = DataLoader(
+            dataset=dataset,
+            batch_sampler=batch_sampler,
+            places=device,
+            collate_fn=partial(
+                prepare_train_input,
+                bos_id=bos_id,
+                eos_id=eos_id,
+                pad_id=eos_id),
+            num_workers=0,
+            return_list=True)
+        data_loaders[i] = data_loader
+    return data_loaders
+def prepare_train_input(insts, bos_id, eos_id, pad_id):
+    src, src_length = pad_batch_data(
+        [inst[0] for inst in insts], pad_id=pad_id)
+    trg, trg_length = pad_batch_data(
+        [inst[1] for inst in insts], pad_id=pad_id)
+    trg_length = trg_length - 1
+    return src, src_length, trg[:, :-1], trg_length, trg[:, 1:, np.newaxis]
+def prepare_infer_input(insts, bos_id, eos_id, pad_id):
+    src, src_length = pad_batch_data(insts, pad_id=pad_id)
+    return src, src_length
+def pad_batch_data(insts, pad_id):
+    """
+    Pad the instances to the max sequence length in batch, and generate the
+    corresponding position data and attention bias.
+    """
+    inst_lens = np.array([len(inst) for inst in insts], dtype="int64")
+    max_len = np.max(inst_lens)
+    inst_data = np.array(
+        [inst + [pad_id] * (max_len - len(inst)) for inst in insts],
+        dtype="int64")
+    return inst_data, inst_lens
+class SortType(object):
+    GLOBAL = 'global'
+    POOL = 'pool'
+    NONE = "none"
+class Converter(object):
+    def __init__(self, vocab, beg, end, unk, delimiter, add_beg, add_end):
+        self._vocab = vocab
+        self._beg = beg
+        self._end = end
+        self._unk = unk
+        self._delimiter = delimiter
+        self._add_beg = add_beg
+        self._add_end = add_end
+    def __call__(self, sentence):
+        return ([self._beg] if self._add_beg else []) + [
+            self._vocab.get(w, self._unk)
+            for w in sentence.split(self._delimiter)
+        ] + ([self._end] if self._add_end else [])
+class ComposedConverter(object):
+    def __init__(self, converters):
+        self._converters = converters
+    def __call__(self, fields):
+        return [
+            converter(field)
+            for field, converter in zip(fields, self._converters)
+        ]
+class SentenceBatchCreator(object):
+    def __init__(self, batch_size):
+        self.batch = []
+        self._batch_size = batch_size
+    def append(self, info):
+        self.batch.append(info)
+        if len(self.batch) == self._batch_size:
+            tmp = self.batch
+            self.batch = []
+            return tmp
+class TokenBatchCreator(object):
+    def __init__(self, batch_size):
+        self.batch = []
+        self.max_len = -1
+        self._batch_size = batch_size
+    def append(self, info):
+        cur_len = info.max_len
+        max_len = max(self.max_len, cur_len)
+        if max_len * (len(self.batch) + 1) > self._batch_size:
+            result = self.batch
+            self.batch = [info]
+            self.max_len = cur_len
+            return result
+        else:
+            self.max_len = max_len
+            self.batch.append(info)
+class SampleInfo(object):
+    def __init__(self, i, lens):
+        self.i = i
+        self.lens = lens
+        self.max_len = lens[0]  # to be consitent with the original reader
+    def get_ranges(self, min_length=None, max_length=None, truncate=False):
+        ranges = []
+        # source
+        if (min_length is None or self.lens[0] >= min_length) and (
+                max_length is None or self.lens[0] <= max_length or truncate):
+            end = max_length if truncate and max_length else self.lens[0]
+            ranges.append([0, end])
+        # target
+        if len(self.lens) == 2:
+            if (min_length is None or self.lens[1] >= min_length) and (
+                    max_length is None or self.lens[1] <= max_length + 2 or
+                    truncate):
+                end = max_length + 2 if truncate and max_length else self.lens[
+                    1]
+                ranges.append([0, end])
+        return ranges if len(ranges) == len(self.lens) else None
+class MinMaxFilter(object):
+    def __init__(self, max_len, min_len, underlying_creator):
+        self._min_len = min_len
+        self._max_len = max_len
+        self._creator = underlying_creator
+    def append(self, info):
+        if (self._min_len is None or info.min_len >= self._min_len) and (
+                self._max_len is None or info.max_len <= self._max_len):
+            return self._creator.append(info)
+    @property
+    def batch(self):
+        return self._creator.batch
+class Seq2SeqDataset(Dataset):
+    def __init__(self,
+                 src_vocab_fpath,
+                 trg_vocab_fpath,
+                 fpattern,
+                 field_delimiter="\t",
+                 token_delimiter=" ",
+                 start_mark="<s>",
+                 end_mark="<e>",
+                 unk_mark="<unk>",
+                 trg_fpattern=None,
+                 trg_add_bos_eos=False,
+                 byte_data=False,
+                 min_length=None,
+                 max_length=None,
+                 truncate=False):
+        if byte_data:
+            # The WMT16 bpe data used here seems including bytes can not be
+            # decoded by utf8. Thus convert str to bytes, and use byte data
+            field_delimiter = field_delimiter.encode("utf8")
+            token_delimiter = token_delimiter.encode("utf8")
+            start_mark = start_mark.encode("utf8")
+            end_mark = end_mark.encode("utf8")
+            unk_mark = unk_mark.encode("utf8")
+        self._byte_data = byte_data
+        self._src_vocab = self.load_dict(src_vocab_fpath, byte_data=byte_data)
+        self._trg_vocab = self.load_dict(trg_vocab_fpath, byte_data=byte_data)
+        self._bos_idx = self._src_vocab[start_mark]
+        self._eos_idx = self._src_vocab[end_mark]
+        self._unk_idx = self._src_vocab[unk_mark]
+        self._field_delimiter = field_delimiter
+        self._token_delimiter = token_delimiter
+        self._min_length = min_length
+        self._max_length = max_length
+        self._truncate = truncate
+        self._trg_add_bos_eos = trg_add_bos_eos
+        self.load_src_trg_ids(fpattern, trg_fpattern)
+    def load_src_trg_ids(self, fpattern, trg_fpattern=None):
+        src_converter = Converter(
+            vocab=self._src_vocab,
+            beg=self._bos_idx,
+            end=self._eos_idx,
+            unk=self._unk_idx,
+            delimiter=self._token_delimiter,
+            add_beg=False,
+            add_end=False)
+        trg_converter = Converter(
+            vocab=self._trg_vocab,
+            beg=self._bos_idx,
+            end=self._eos_idx,
+            unk=self._unk_idx,
+            delimiter=self._token_delimiter,
+            add_beg=True if self._trg_add_bos_eos else False,
+            add_end=True if self._trg_add_bos_eos else False)
+        converters = ComposedConverter([src_converter, trg_converter])
+        self._src_seq_ids = []
+        self._trg_seq_ids = []
+        self._sample_infos = []
+        slots = [self._src_seq_ids, self._trg_seq_ids]
+        for i, line in enumerate(self._load_lines(fpattern, trg_fpattern)):
+            fields = converters(line)
+            lens = [len(field) for field in fields]
+            sample = SampleInfo(i, lens)
+            field_ranges = sample.get_ranges(self._min_length,
+                                             self._max_length, self._truncate)
+            if field_ranges:
+                for field, field_range, slot in zip(fields, field_ranges,
+                                                    slots):
+                    slot.append(field[field_range[0]:field_range[1]])
+                self._sample_infos.append(sample)
+    def _load_lines(self, fpattern, trg_fpattern=None):
+        fpaths = glob.glob(fpattern)
+        fpaths = sorted(fpaths)  # TODO: Add custum sort
+        assert len(fpaths) > 0, "no matching file to the provided data path"
+        (f_mode, f_encoding,
+         endl) = ("rb", None, b"\n") if self._byte_data else ("r", "utf8",
+                                                              "\n")
+        if trg_fpattern is None:
+            for fpath in fpaths:
+                with io.open(fpath, f_mode, encoding=f_encoding) as f:
+                    for line in f:
+                        fields = line.strip(endl).split(self._field_delimiter)
+                        yield fields
+        else:
+            # separated source and target language data files
+            # assume we can get aligned data by sort the two language files
+            # TODO: Need more rigorous check
+            trg_fpaths = glob.glob(trg_fpattern)
+            trg_fpaths = sorted(trg_fpaths)
+            assert len(fpaths) == len(
+                trg_fpaths
+            ), "the number of source language data files must equal \
+                with that of source language"
+            for fpath, trg_fpath in zip(fpaths, trg_fpaths):
+                with io.open(fpath, f_mode, encoding=f_encoding) as f:
+                    with io.open(
+                            trg_fpath, f_mode, encoding=f_encoding) as trg_f:
+                        for line in zip(f, trg_f):
+                            fields = [field.strip(endl) for field in line]
+                            yield fields
+    @staticmethod
+    def load_dict(dict_path, reverse=False, byte_data=False):
+        word_dict = {}
+        (f_mode, f_encoding,
+         endl) = ("rb", None, b"\n") if byte_data else ("r", "utf8", "\n")
+        with io.open(dict_path, f_mode, encoding=f_encoding) as fdict:
+            for idx, line in enumerate(fdict):
+                if reverse:
+                    word_dict[idx] = line.strip(endl)
+                else:
+                    word_dict[line.strip(endl)] = idx
+        return word_dict
+    def get_vocab_summary(self):
+        return len(self._src_vocab), len(
+            self._trg_vocab), self._bos_idx, self._eos_idx, self._unk_idx
+    def __getitem__(self, idx):
+        return (self._src_seq_ids[idx], self._trg_seq_ids[idx]
+                ) if self._trg_seq_ids else self._src_seq_ids[idx]
+    def __len__(self):
+        return len(self._sample_infos)
+class Seq2SeqBatchSampler(BatchSampler):
+    def __init__(self,
+                 dataset,
+                 batch_size,
+                 pool_size=10000,
+                 sort_type=SortType.NONE,
+                 min_length=None,
+                 max_length=None,
+                 shuffle=False,
+                 shuffle_batch=False,
+                 use_token_batch=False,
+                 clip_last_batch=False,
+                 distribute_mode=True,
+                 seed=0):
+        for arg, value in locals().items():
+            if arg != "self":
+                setattr(self, "_" + arg, value)
+        self._random = np.random
+        self._random.seed(seed)
+        # for multi-devices
+        self._distribute_mode = distribute_mode
+        self._nranks = ParallelEnv().nranks
+        self._local_rank = ParallelEnv().local_rank
+        self._device_id = ParallelEnv().dev_id
+    def __iter__(self):
+        # global sort or global shuffle
+        if self._sort_type == SortType.GLOBAL:
+            infos = sorted(
+                self._dataset._sample_infos, key=lambda x: x.max_len)
+        else:
+            if self._shuffle:
+                infos = self._dataset._sample_infos
+                self._random.shuffle(infos)
+            else:
+                infos = self._dataset._sample_infos
+            if self._sort_type == SortType.POOL:
+                reverse = True
+                for i in range(0, len(infos), self._pool_size):
+                    # to avoid placing short next to long sentences
+                    reverse = False  # not reverse
+                    infos[i:i + self._pool_size] = sorted(
+                        infos[i:i + self._pool_size],
+                        key=lambda x: x.max_len,
+                        reverse=reverse)
+        batches = []
+        batch_creator = TokenBatchCreator(
+            self.
+            _batch_size) if self._use_token_batch else SentenceBatchCreator(
+                self._batch_size * self._nranks)
+        batch_creator = MinMaxFilter(self._max_length, self._min_length,
+                                     batch_creator)
+        for info in infos:
+            batch = batch_creator.append(info)
+            if batch is not None:
+                batches.append(batch)
+        if not self._clip_last_batch and len(batch_creator.batch) != 0:
+            batches.append(batch_creator.batch)
+        if self._shuffle_batch:
+            self._random.shuffle(batches)
+        if not self._use_token_batch:
+            # when producing batches according to sequence number, to confirm
+            # neighbor batches which would be feed and run parallel have similar
+            # length (thus similar computational cost) after shuffle, we as take
+            # them as a whole when shuffling and split here
+            batches = [[
+                batch[self._batch_size * i:self._batch_size * (i + 1)]
+                for i in range(self._nranks)
+            ] for batch in batches]
+            batches = list(itertools.chain.from_iterable(batches))
+        # for multi-device
+        for batch_id, batch in enumerate(batches):
+            if not self._distribute_mode or (
+                    batch_id % self._nranks == self._local_rank):
+                batch_indices = [info.i for info in batch]
+                yield batch_indices
+        if self._distribute_mode and len(batches) % self._nranks != 0:
+            if self._local_rank >= len(batches) % self._nranks:
+                # use previous data to pad
+                yield batch_indices
+    def __len__(self):
+        if not self._use_token_batch:
+            batch_number = (
+                len(self._dataset) + self._batch_size * self._nranks - 1) // (
+                    self._batch_size * self._nranks)
+        else:
+            # TODO(guosheng): fix the uncertain length
+            batch_number = 1
+        return batch_number
--- a/examples/seq2seq/seq2seq_attn.py
+++ b/examples/seq2seq/seq2seq_attn.py
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import paddle.fluid as fluid
+import paddle.fluid.layers as layers
+from paddle.fluid import ParamAttr
+from paddle.fluid.initializer import UniformInitializer
+from paddle.fluid.dygraph import Embedding, Linear, Layer
+from paddle.fluid.layers import BeamSearchDecoder
+from hapi.model import Model, Loss
+from hapi.text import DynamicDecode, RNN, BasicLSTMCell, RNNCell
+from seq2seq_base import Encoder
+class AttentionLayer(Layer):
+    def __init__(self, hidden_size, bias=False, init_scale=0.1):
+        super(AttentionLayer, self).__init__()
+        self.input_proj = Linear(
+            hidden_size,
+            hidden_size,
+            param_attr=ParamAttr(initializer=UniformInitializer(
+                low=-init_scale, high=init_scale)),
+            bias_attr=bias)
+        self.output_proj = Linear(
+            hidden_size + hidden_size,
+            hidden_size,
+            param_attr=ParamAttr(initializer=UniformInitializer(
+                low=-init_scale, high=init_scale)),
+            bias_attr=bias)
+    def forward(self, hidden, encoder_output, encoder_padding_mask):
+        # query = self.input_proj(hidden)
+        encoder_output = self.input_proj(encoder_output)
+        attn_scores = layers.matmul(
+            layers.unsqueeze(hidden, [1]), encoder_output, transpose_y=True)
+        if encoder_padding_mask is not None:
+            attn_scores = layers.elementwise_add(attn_scores,
+                                                 encoder_padding_mask)
+        attn_scores = layers.softmax(attn_scores)
+        attn_out = layers.squeeze(
+            layers.matmul(attn_scores, encoder_output), [1])
+        attn_out = layers.concat([attn_out, hidden], 1)
+        attn_out = self.output_proj(attn_out)
+        return attn_out
+class DecoderCell(RNNCell):
+    def __init__(self,
+                 num_layers,
+                 input_size,
+                 hidden_size,
+                 dropout_prob=0.,
+                 init_scale=0.1):
+        super(DecoderCell, self).__init__()
+        self.dropout_prob = dropout_prob
+        # use add_sublayer to add multi-layers
+        self.lstm_cells = []
+        for i in range(num_layers):
+            self.lstm_cells.append(
+                self.add_sublayer(
+                    "lstm_%d" % i,
+                    BasicLSTMCell(
+                        input_size=input_size + hidden_size
+                        if i == 0 else hidden_size,
+                        hidden_size=hidden_size,
+                        param_attr=ParamAttr(initializer=UniformInitializer(
+                            low=-init_scale, high=init_scale)))))
+        self.attention_layer = AttentionLayer(hidden_size)
+    def forward(self,
+                step_input,
+                states,
+                encoder_output,
+                encoder_padding_mask=None):
+        lstm_states, input_feed = states
+        new_lstm_states = []
+        step_input = layers.concat([step_input, input_feed], 1)
+        for i, lstm_cell in enumerate(self.lstm_cells):
+            out, new_lstm_state = lstm_cell(step_input, lstm_states[i])
+            step_input = layers.dropout(
+                out,
+                self.dropout_prob,
+                dropout_implementation='upscale_in_train'
+            ) if self.dropout_prob > 0 else out
+            new_lstm_states.append(new_lstm_state)
+        out = self.attention_layer(step_input, encoder_output,
+                                   encoder_padding_mask)
+        return out, [new_lstm_states, out]
+class Decoder(Layer):
+    def __init__(self,
+                 vocab_size,
+                 embed_dim,
+                 hidden_size,
+                 num_layers,
+                 dropout_prob=0.,
+                 init_scale=0.1):
+        super(Decoder, self).__init__()
+        self.embedder = Embedding(
+            size=[vocab_size, embed_dim],
+            param_attr=ParamAttr(initializer=UniformInitializer(
+                low=-init_scale, high=init_scale)))
+        self.lstm_attention = RNN(DecoderCell(
+            num_layers, embed_dim, hidden_size, dropout_prob, init_scale),
+                                  is_reverse=False,
+                                  time_major=False)
+        self.output_layer = Linear(
+            hidden_size,
+            vocab_size,
+            param_attr=ParamAttr(initializer=UniformInitializer(
+                low=-init_scale, high=init_scale)),
+            bias_attr=False)
+    def forward(self, target, decoder_initial_states, encoder_output,
+                encoder_padding_mask):
+        inputs = self.embedder(target)
+        decoder_output, _ = self.lstm_attention(
+            inputs,
+            initial_states=decoder_initial_states,
+            encoder_output=encoder_output,
+            encoder_padding_mask=encoder_padding_mask)
+        predict = self.output_layer(decoder_output)
+        return predict
+class AttentionModel(Model):
+    def __init__(self,
+                 src_vocab_size,
+                 trg_vocab_size,
+                 embed_dim,
+                 hidden_size,
+                 num_layers,
+                 dropout_prob=0.,
+                 init_scale=0.1):
+        super(AttentionModel, self).__init__()
+        self.hidden_size = hidden_size
+        self.encoder = Encoder(src_vocab_size, embed_dim, hidden_size,
+                               num_layers, dropout_prob, init_scale)
+        self.decoder = Decoder(trg_vocab_size, embed_dim, hidden_size,
+                               num_layers, dropout_prob, init_scale)
+    def forward(self, src, src_length, trg):
+        # encoder
+        encoder_output, encoder_final_state = self.encoder(src, src_length)
+        # decoder initial states: use input_feed and the structure is
+        # [[h,c] * num_layers, input_feed], consistent with DecoderCell.states
+        decoder_initial_states = [
+            encoder_final_state,
+            self.decoder.lstm_attention.cell.get_initial_states(
+                batch_ref=encoder_output, shape=[self.hidden_size])
+        ]
+        # attention mask to avoid paying attention on padddings
+        src_mask = layers.sequence_mask(
+            src_length,
+            maxlen=layers.shape(src)[1],
+            dtype=encoder_output.dtype)
+        encoder_padding_mask = (src_mask - 1.0) * 1e9
+        encoder_padding_mask = layers.unsqueeze(encoder_padding_mask, [1])
+        # decoder with attentioon
+        predict = self.decoder(trg, decoder_initial_states, encoder_output,
+                               encoder_padding_mask)
+        return predict
+class AttentionInferModel(AttentionModel):
+    def __init__(self,
+                 src_vocab_size,
+                 trg_vocab_size,
+                 embed_dim,
+                 hidden_size,
+                 num_layers,
+                 dropout_prob=0.,
+                 bos_id=0,
+                 eos_id=1,
+                 beam_size=4,
+                 max_out_len=256):
+        args = dict(locals())
+        args.pop("self")
+        args.pop("__class__", None)  # py3
+        self.bos_id = args.pop("bos_id")
+        self.eos_id = args.pop("eos_id")
+        self.beam_size = args.pop("beam_size")
+        self.max_out_len = args.pop("max_out_len")
+        super(AttentionInferModel, self).__init__(**args)
+        # dynamic decoder for inference
+        decoder = BeamSearchDecoder(
+            self.decoder.lstm_attention.cell,
+            start_token=bos_id,
+            end_token=eos_id,
+            beam_size=beam_size,
+            embedding_fn=self.decoder.embedder,
+            output_fn=self.decoder.output_layer)
+        self.beam_search_decoder = DynamicDecode(
+            decoder, max_step_num=max_out_len, is_test=True)
+    def forward(self, src, src_length):
+        # encoding
+        encoder_output, encoder_final_state = self.encoder(src, src_length)
+        # decoder initial states
+        decoder_initial_states = [
+            encoder_final_state,
+            self.decoder.lstm_attention.cell.get_initial_states(
+                batch_ref=encoder_output, shape=[self.hidden_size])
+        ]
+        # attention mask to avoid paying attention on padddings
+        src_mask = layers.sequence_mask(
+            src_length,
+            maxlen=layers.shape(src)[1],
+            dtype=encoder_output.dtype)
+        encoder_padding_mask = (src_mask - 1.0) * 1e9
+        encoder_padding_mask = layers.unsqueeze(encoder_padding_mask, [1])
+        # Tile the batch dimension with beam_size
+        encoder_output = BeamSearchDecoder.tile_beam_merge_with_batch(
+            encoder_output, self.beam_size)
+        encoder_padding_mask = BeamSearchDecoder.tile_beam_merge_with_batch(
+            encoder_padding_mask, self.beam_size)
+        # dynamic decoding with beam search
+        rs, _ = self.beam_search_decoder(
+            inits=decoder_initial_states,
+            encoder_output=encoder_output,
+            encoder_padding_mask=encoder_padding_mask)
+        return rs
--- a/examples/seq2seq/seq2seq_base.py
+++ b/examples/seq2seq/seq2seq_base.py
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import paddle.fluid as fluid
+import paddle.fluid.layers as layers
+from paddle.fluid import ParamAttr
+from paddle.fluid.initializer import UniformInitializer
+from paddle.fluid.dygraph import Embedding, Linear, Layer
+from paddle.fluid.layers import BeamSearchDecoder
+from hapi.model import Model, Loss
+from hapi.text import DynamicDecode, RNN, BasicLSTMCell, RNNCell
+class CrossEntropyCriterion(Loss):
+    def __init__(self):
+        super(CrossEntropyCriterion, self).__init__()
+    def forward(self, outputs, labels):
+        predict, (trg_length, label) = outputs[0], labels
+        # for target padding mask
+        mask = layers.sequence_mask(
+            trg_length, maxlen=layers.shape(predict)[1], dtype=predict.dtype)
+        cost = layers.softmax_with_cross_entropy(
+            logits=predict, label=label, soft_label=False)
+        masked_cost = layers.elementwise_mul(cost, mask, axis=0)
+        batch_mean_cost = layers.reduce_mean(masked_cost, dim=[0])
+        seq_cost = layers.reduce_sum(batch_mean_cost)
+        return seq_cost
+class EncoderCell(RNNCell):
+    def __init__(self,
+                 num_layers,
+                 input_size,
+                 hidden_size,
+                 dropout_prob=0.,
+                 init_scale=0.1):
+        super(EncoderCell, self).__init__()
+        self.dropout_prob = dropout_prob
+        # use add_sublayer to add multi-layers
+        self.lstm_cells = []
+        for i in range(num_layers):
+            self.lstm_cells.append(
+                self.add_sublayer(
+                    "lstm_%d" % i,
+                    BasicLSTMCell(
+                        input_size=input_size if i == 0 else hidden_size,
+                        hidden_size=hidden_size,
+                        param_attr=ParamAttr(initializer=UniformInitializer(
+                            low=-init_scale, high=init_scale)))))
+    def forward(self, step_input, states):
+        new_states = []
+        for i, lstm_cell in enumerate(self.lstm_cells):
+            out, new_state = lstm_cell(step_input, states[i])
+            step_input = layers.dropout(
+                out,
+                self.dropout_prob,
+                dropout_implementation='upscale_in_train'
+            ) if self.dropout_prob > 0 else out
+            new_states.append(new_state)
+        return step_input, new_states
+    @property
+    def state_shape(self):
+        return [cell.state_shape for cell in self.lstm_cells]
+class Encoder(Layer):
+    def __init__(self,
+                 vocab_size,
+                 embed_dim,
+                 hidden_size,
+                 num_layers,
+                 dropout_prob=0.,
+                 init_scale=0.1):
+        super(Encoder, self).__init__()
+        self.embedder = Embedding(
+            size=[vocab_size, embed_dim],
+            param_attr=ParamAttr(initializer=UniformInitializer(
+                low=-init_scale, high=init_scale)))
+        self.stack_lstm = RNN(EncoderCell(num_layers, embed_dim, hidden_size,
+                                          dropout_prob, init_scale),
+                              is_reverse=False,
+                              time_major=False)
+    def forward(self, sequence, sequence_length):
+        inputs = self.embedder(sequence)
+        encoder_output, encoder_state = self.stack_lstm(
+            inputs, sequence_length=sequence_length)
+        return encoder_output, encoder_state
+DecoderCell = EncoderCell
+class Decoder(Layer):
+    def __init__(self,
+                 vocab_size,
+                 embed_dim,
+                 hidden_size,
+                 num_layers,
+                 dropout_prob=0.,
+                 init_scale=0.1):
+        super(Decoder, self).__init__()
+        self.embedder = Embedding(
+            size=[vocab_size, embed_dim],
+            param_attr=ParamAttr(initializer=UniformInitializer(
+                low=-init_scale, high=init_scale)))
+        self.stack_lstm = RNN(DecoderCell(num_layers, embed_dim, hidden_size,
+                                          dropout_prob, init_scale),
+                              is_reverse=False,
+                              time_major=False)
+        self.output_layer = Linear(
+            hidden_size,
+            vocab_size,
+            param_attr=ParamAttr(initializer=UniformInitializer(
+                low=-init_scale, high=init_scale)),
+            bias_attr=False)
+    def forward(self, target, decoder_initial_states):
+        inputs = self.embedder(target)
+        decoder_output, _ = self.stack_lstm(
+            inputs, initial_states=decoder_initial_states)
+        predict = self.output_layer(decoder_output)
+        return predict
+class BaseModel(Model):
+    def __init__(self,
+                 src_vocab_size,
+                 trg_vocab_size,
+                 embed_dim,
+                 hidden_size,
+                 num_layers,
+                 dropout_prob=0.,
+                 init_scale=0.1):
+        super(BaseModel, self).__init__()
+        self.hidden_size = hidden_size
+        self.encoder = Encoder(src_vocab_size, embed_dim, hidden_size,
+                               num_layers, dropout_prob, init_scale)
+        self.decoder = Decoder(trg_vocab_size, embed_dim, hidden_size,
+                               num_layers, dropout_prob, init_scale)
+    def forward(self, src, src_length, trg):
+        # encoder
+        encoder_output, encoder_final_states = self.encoder(src, src_length)
+        # decoder
+        predict = self.decoder(trg, encoder_final_states)
+        return predict
+class BaseInferModel(BaseModel):
+    def __init__(self,
+                 src_vocab_size,
+                 trg_vocab_size,
+                 embed_dim,
+                 hidden_size,
+                 num_layers,
+                 dropout_prob=0.,
+                 bos_id=0,
+                 eos_id=1,
+                 beam_size=4,
+                 max_out_len=256):
+        args = dict(locals())
+        args.pop("self")
+        args.pop("__class__", None)  # py3
+        self.bos_id = args.pop("bos_id")
+        self.eos_id = args.pop("eos_id")
+        self.beam_size = args.pop("beam_size")
+        self.max_out_len = args.pop("max_out_len")
+        super(BaseInferModel, self).__init__(**args)
+        # dynamic decoder for inference
+        decoder = BeamSearchDecoder(
+            self.decoder.stack_lstm.cell,
+            start_token=bos_id,
+            end_token=eos_id,
+            beam_size=beam_size,
+            embedding_fn=self.decoder.embedder,
+            output_fn=self.decoder.output_layer)
+        self.beam_search_decoder = DynamicDecode(
+            decoder, max_step_num=max_out_len, is_test=True)
+    def forward(self, src, src_length):
+        # encoding
+        encoder_output, encoder_final_states = self.encoder(src, src_length)
+        # dynamic decoding with beam search
+        rs, _ = self.beam_search_decoder(inits=encoder_final_states)
+        return rs
--- a/examples/seq2seq/train.py
+++ b/examples/seq2seq/train.py
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import logging
+import os
+import random
+from functools import partial
+import numpy as np
+import paddle.fluid as fluid
+from paddle.fluid.io import DataLoader
+from hapi.model import Input, set_device
+from args import parse_args
+from seq2seq_base import BaseModel, CrossEntropyCriterion
+from seq2seq_attn import AttentionModel
+from reader import create_data_loader
+from utility import PPL, TrainCallback
+def do_train(args):
+    device = set_device("gpu" if args.use_gpu else "cpu")
+    fluid.enable_dygraph(device) if args.eager_run else None
+    if args.enable_ce:
+        fluid.default_main_program().random_seed = 102
+        fluid.default_startup_program().random_seed = 102
+    # define model
+    inputs = [
+        Input(
+            [None, None], "int64", name="src_word"),
+        Input(
+            [None], "int64", name="src_length"),
+        Input(
+            [None, None], "int64", name="trg_word"),
+    ]
+    labels = [
+        Input(
+            [None], "int64", name="trg_length"),
+        Input(
+            [None, None, 1], "int64", name="label"),
+    ]
+    # def dataloader
+    train_loader, eval_loader = create_data_loader(args, device)
+    model_maker = AttentionModel if args.attention else BaseModel
+    model = model_maker(args.src_vocab_size, args.tar_vocab_size,
+                        args.hidden_size, args.hidden_size, args.num_layers,
+                        args.dropout)
+    grad_clip = fluid.clip.GradientClipByGlobalNorm(
+        clip_norm=args.max_grad_norm)
+    optimizer = fluid.optimizer.Adam(
+        learning_rate=args.learning_rate,
+        parameter_list=model.parameters(),
+        grad_clip=grad_clip)
+    ppl_metric = PPL(reset_freq=100)  # ppl for every 100 batches
+    model.prepare(
+        optimizer,
+        CrossEntropyCriterion(),
+        ppl_metric,
+        inputs=inputs,
+        labels=labels)
+    model.fit(train_data=train_loader,
+              eval_data=eval_loader,
+              epochs=args.max_epoch,
+              eval_freq=1,
+              save_freq=1,
+              save_dir=args.model_path,
+              callbacks=[TrainCallback(ppl_metric, args.log_freq)])
+if __name__ == "__main__":
+    args = parse_args()
+    do_train(args)
--- a/examples/seq2seq/utility.py
+++ b/examples/seq2seq/utility.py
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import math
+import paddle.fluid as fluid
+from hapi.metrics import Metric
+from hapi.callbacks import ProgBarLogger
+class TrainCallback(ProgBarLogger):
+    def __init__(self, ppl, log_freq, verbose=2):
+        super(TrainCallback, self).__init__(log_freq, verbose)
+        self.ppl = ppl
+    def on_train_begin(self, logs=None):
+        super(TrainCallback, self).on_train_begin(logs)
+        self.train_metrics = ["ppl"]  # remove loss to not print it
+    def on_epoch_begin(self, epoch=None, logs=None):
+        super(TrainCallback, self).on_epoch_begin(epoch, logs)
+        self.ppl.reset()
+    def on_train_batch_end(self, step, logs=None):
+        logs["ppl"] = self.ppl.cal_acc_ppl(logs["loss"][0], logs["batch_size"])
+        if step > 0 and step % self.ppl.reset_freq == 0:
+            self.ppl.reset()
+        super(TrainCallback, self).on_train_batch_end(step, logs)
+    def on_eval_begin(self, logs=None):
+        super(TrainCallback, self).on_eval_begin(logs)
+        self.eval_metrics = ["ppl"]
+        self.ppl.reset()
+    def on_eval_batch_end(self, step, logs=None):
+        logs["ppl"] = self.ppl.cal_acc_ppl(logs["loss"][0], logs["batch_size"])
+        super(TrainCallback, self).on_eval_batch_end(step, logs)
+class PPL(Metric):
+    def __init__(self, reset_freq=100, name=None):
+        super(PPL, self).__init__()
+        self._name = name or "ppl"
+        self.reset_freq = reset_freq
+        self.reset()
+    def add_metric_op(self, pred, seq_length, label):
+        word_num = fluid.layers.reduce_sum(seq_length)
+        return word_num
+    def update(self, word_num):
+        self.word_count += word_num
+        return word_num
+    def reset(self):
+        self.total_loss = 0
+        self.word_count = 0
+    def accumulate(self):
+        return self.word_count
+    def name(self):
+        return self._name
+    def cal_acc_ppl(self, batch_loss, batch_size):
+        self.total_loss += batch_loss * batch_size
+        ppl = math.exp(self.total_loss / self.word_count)
+        return ppl
\ No newline at end of file
--- a/examples/sequence_tagging/README.md
+++ b/examples/sequence_tagging/README.md
@@ -6,7 +6,7 @@ Sequence Tagging，是一个序列标注模型，模型可用于实现，分词
 |模型|Precision|Recall|F1-score|
 |:-:|:-:|:-:|:-:|
-|Lexical Analysis|88.26%|89.20%|88.73%|
+|Lexical Analysis|89.57%|89.96%|89.76%|
 ## 2. 快速开始
@@ -22,7 +22,7 @@ Sequence Tagging，是一个序列标注模型，模型可用于实现，分词
 克隆工具集代码库到本地
 ```bash
 git clone https://github.com/PaddlePaddle/hapi.git
- cd hapi/sequence_tagging
+ cd hapi/examples/sequence_tagging
 ```
 #### 3. 环境依赖
@@ -70,7 +70,7 @@ python -u train.py \
          --dynamic False
 # --device: 使用gpu设备还是cpu设备
-# --dynamic： 是否使用动态图模式进行训练，如果使用静态图训练，设置为True, 动态图设置为False
+# --dynamic： 是否使用动态图模式进行训练，如果使用静态图训练，设置为False, 动态图设置为True
 ```
 GPU上多卡训练
@@ -84,7 +84,7 @@ python -m paddle.distributed.launch --selected_gpus=0,1,2,3  train.py \
          --dynamic False
 # --device: 使用gpu设备还是cpu设备
-# --dynamic: 是否使用动态图模式进行训练，如果使用静态图训练，设置为True, 动态图设置为False
+# --dynamic: 是否使用动态图模式进行训练，如果使用静态图训练，设置为False, 动态图设置为True
 ```
 CPU上训练
@@ -95,7 +95,7 @@ python -u train.py \
          --dynamic False
 # --device: 使用gpu设备还是cpu设备
-# --dynamic: 是否使用动态图模式进行训练，如果使用静态图训练，设置为True, 动态图设置为False
+# --dynamic: 是否使用动态图模式进行训练，如果使用静态图训练，设置为False, 动态图设置为True
 ```
 ### 模型预测
@@ -105,15 +105,13 @@ python -u train.py \
 python predict.py \
      --init_from_checkpoint  model_baseline/params \
      --output_file predict.result  \
-      --mode predict \
      --device cpu  \
      --dynamic False
 # --init_from_checkpoint: 初始化模型
 # --output_file: 预测结果文件
 # --device: 使用gpu还是cpu设备
-# --mode: 开启模式, 设置为train时，进行训练，设置为predict时进行预测
+# --dynamic: 是否使用动态图模式进行训练，如果使用静态图训练，设置为False, 动态图设置为True
-# --dynamic: 是否使用动态图模式进行训练，如果使用静态图训练，设置为True, 动态图设置为False
 ```
 ### 模型评估
@@ -123,14 +121,12 @@ python predict.py \
 # baseline model
 python eval.py \
        --init_from_checkpoint  ./model_baseline/params \
-        --mode predict \
        --device cpu  \
        --dynamic False
 # --init_from_checkpoint: 初始化模型
 # --device: 使用gpu还是cpu设备
-# --mode: 开启模式, 设置为train时，进行训练，设置为predict时进行预测
+# --dynamic: 是否使用动态图模式进行训练，如果使用静态图训练，设置为False, 动态图设置为True
-# --dynamic: 是否使用动态图模式进行训练，如果使用静态图训练，设置为True, 动态图设置为False
 ```
@@ -168,7 +164,7 @@ Overall Architecture of GRU-CRF-MODEL
 训练使用的数据可以由用户根据实际的应用场景，自己组织数据。除了第一行是 `text_a\tlabel` 固定的开头，后面的每行数据都是由两列组成，以制表符分隔，第一列是 utf-8 编码的中文文本，以 `\002` 分割，第二列是对应每个字的标注，以 `\002` 分隔。我们采用 IOB2 标注体系，即以 X-B 作为类型为 X 的词的开始，以 X-I 作为类型为 X 的词的持续，以 O 表示不关注的字（实际上，在词性、专名联合标注中，不存在 O ）。示例如下：
 ```text
-除\002了\002他\002续\002任\002十\002二\002届\002政\002协\002委\002员\002,\002马\002化\002腾\002,\002雷\002军\002,\002李\002彦\002宏\002也\002被\002推\002选\002为\002新\002一\002届\002全\002国\002人\002大\002代\002表\002或\002全\002国\002政\002协\002委\002员	p-B\002p-I\002r-B\002v-B\002v-I\002m-B\002m-I\002m-I\002ORG-B\002ORG-I\002n-B\002n-I\002w-B\002PER-B\002PER-I\002PER-I\002w-B\002PER-B\002PER-I\002w-B\002PER-B\002PER-I\002PER-I\002d-B\002p-B\002v-B\002v-I\002v-B\002a-B\002m-B\002m-I\002ORG-B\002ORG-I\002ORG-I\002ORG-I\002n-B\002n-I\002c-B\002n-B\002n-I\002ORG-B\002ORG-I\002n-B\002n-I
+除\002了\002他\002续\002任\002十\002二\002届\002政\002协\002委\002员\002,\002马\002化\002腾\002,\002雷\002军\002,\002李\002彦\002宏\002也\002被\002推\002选\002为\002新\002一\002届\002全\002国\002人\002大\002代\002表\002或\002全\002国\002政\002协\002委\002员    p-B\002p-I\002r-B\002v-B\002v-I\002m-B\002m-I\002m-I\002ORG-B\002ORG-I\002n-B\002n-I\002w-B\002PER-B\002PER-I\002PER-I\002w-B\002PER-B\002PER-I\002w-B\002PER-B\002PER-I\002PER-I\002d-B\002p-B\002v-B\002v-I\002v-B\002a-B\002m-B\002m-I\002ORG-B\002ORG-I\002ORG-I\002ORG-I\002n-B\002n-I\002c-B\002n-B\002n-I\002ORG-B\002ORG-I\002n-B\002n-I
 ```
 + 我们随同代码一并发布了完全版的模型和相关的依赖数据。但是，由于模型的训练数据过于庞大，我们没有发布训练数据，仅在`data`目录下放置少数样本用以示例输入数据格式。
@@ -196,6 +192,7 @@ Overall Architecture of GRU-CRF-MODEL
 ├── eval.py                               # 词法分析评估的脚本
 ├── downloads.py                      # 用于下载数据和模型的脚本
 ├── downloads.sh                      # 用于下载数据和模型的脚本
+├── sequence_tagging.yaml             # 模型训练、预测、评估相关配置参数
 └──reader.py                           # 文件读取相关函数
 ```
@@ -207,11 +204,11 @@ Overall Architecture of GRU-CRF-MODEL
 ```text
 @article{jiao2018LAC,
-	title={Chinese Lexical Analysis with Deep Bi-GRU-CRF Network},
+    title={Chinese Lexical Analysis with Deep Bi-GRU-CRF Network},
-	author={Jiao, Zhenyu and Sun, Shuqi and Sun, Ke},
+    author={Jiao, Zhenyu and Sun, Shuqi and Sun, Ke},
-	journal={arXiv preprint arXiv:1807.01882},
+    journal={arXiv preprint arXiv:1807.01882},
-	year={2018},
+    year={2018},
-	url={https://arxiv.org/abs/1807.01882}
+    url={https://arxiv.org/abs/1807.01882}
 }
 ```
 ### 如何贡献代码

--- a/examples/sequence_tagging/downloads.py
+++ b/examples/sequence_tagging/downloads.py
@@ -35,7 +35,7 @@ FILE_INFO = {
    },
    'MODEL': {
        'name': 'sequence_tagging_dy.tar.gz',
-        'md5': "1125d374c03c8218b6e47325dcf607e3"
+        'md5': "6ba37ceea8f1f764ba1fe227295a6a3b"
    },
 }

--- a/examples/sequence_tagging/eval.py
+++ b/examples/sequence_tagging/eval.py
@@ -12,7 +12,7 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """
-SequenceTagging network structure
+SequenceTagging eval structure
 """
 from __future__ import division
@@ -25,18 +25,16 @@ import math
 import argparse
 import numpy as np
-from train import SeqTagging
+from train import SeqTagging, ChunkEval, LacLoss
 from utils.configure import PDConfig
 from utils.check import check_gpu, check_version
-from utils.metrics import chunk_count
+from reader import LacDataset, LacDataLoader
-from reader import LacDataset, create_lexnet_data_generator, create_dataloader
 work_dir = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
 sys.path.append(os.path.join(work_dir, "../"))
 from hapi.model import set_device, Input
 import paddle.fluid as fluid
-from paddle.fluid.optimizer import AdamOptimizer
 from paddle.fluid.layers.utils import flatten
@@ -44,51 +42,33 @@ def main(args):
    place = set_device(args.device)
    fluid.enable_dygraph(place) if args.dynamic else None
-    inputs = [Input([None, None], 'int64', name='words'), 
+    inputs = [
-              Input([None], 'int64', name='length')] 
+        Input(
+            [None, None], 'int64', name='words'), Input(
+                [None], 'int64', name='length'), Input(
+                    [None, None], 'int64', name='target')
+    ]
+    labels = [Input([None, None], 'int64', name='labels')]
-    feed_list = None if args.dynamic else [x.forward() for x in inputs]
    dataset = LacDataset(args)
-    eval_path = args.test_file
+    eval_dataset = LacDataLoader(args, place, phase="test")
-    chunk_evaluator = fluid.metrics.ChunkEvaluator()
-    chunk_evaluator.reset()
-    eval_generator = create_lexnet_data_generator(
-        args, reader=dataset, file_name=eval_path, place=place, mode="test")
-    eval_dataset = create_dataloader(
-        eval_generator, place, feed_list=feed_list)
    vocab_size = dataset.vocab_size
    num_labels = dataset.num_labels
-    model = SeqTagging(args, vocab_size, num_labels)
+    model = SeqTagging(args, vocab_size, num_labels, mode="test")
-    optim = AdamOptimizer(
-        learning_rate=args.base_learning_rate,
-        parameter_list=model.parameters())
    model.mode = "test"
-    model.prepare(inputs=inputs)
+    model.prepare(
+        metrics=ChunkEval(num_labels),
+        inputs=inputs,
+        labels=labels,
+        device=place)
    model.load(args.init_from_checkpoint, skip_mismatch=True)
-    for data in eval_dataset():
+    model.evaluate(eval_dataset.dataloader, batch_size=args.batch_size)
-        if len(data) == 1: 
-            batch_data = data[0]
-            targets = np.array(batch_data[2])
-        else: 
-            batch_data = data
-            targets = batch_data[2].numpy()
-        inputs_data = [batch_data[0], batch_data[1]]
-        crf_decode, length = model.test(inputs=inputs_data)
-        num_infer_chunks, num_label_chunks, num_correct_chunks = chunk_count(crf_decode, targets, length, dataset.id2label_dict)
-        chunk_evaluator.update(num_infer_chunks, num_label_chunks, num_correct_chunks)
-    precision, recall, f1 = chunk_evaluator.eval()
-    print("[test] P: %.5f, R: %.5f, F1: %.5f" % (precision, recall, f1))
-if __name__ == '__main__': 
+if __name__ == '__main__':
    args = PDConfig(yaml_file="sequence_tagging.yaml")
    args.build()
    args.Print()

--- a/examples/sequence_tagging/predict.py
+++ b/examples/sequence_tagging/predict.py
@@ -12,7 +12,7 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """
-SequenceTagging network structure
+SequenceTagging predict structure
 """
 from __future__ import division
@@ -28,14 +28,13 @@ import numpy as np
 from train import SeqTagging
 from utils.check import check_gpu, check_version
 from utils.configure import PDConfig
-from reader import LacDataset, create_lexnet_data_generator, create_dataloader
+from reader import LacDataset, LacDataLoader
 work_dir = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
 sys.path.append(os.path.join(work_dir, "../"))
 from hapi.model import set_device, Input
 import paddle.fluid as fluid
-from paddle.fluid.optimizer import AdamOptimizer
 from paddle.fluid.layers.utils import flatten
@@ -43,26 +42,18 @@ def main(args):
    place = set_device(args.device)
    fluid.enable_dygraph(place) if args.dynamic else None
-    inputs = [Input([None, None], 'int64', name='words'), 
+    inputs = [
-              Input([None], 'int64', name='length')]
+        Input(
+            [None, None], 'int64', name='words'), Input(
+                [None], 'int64', name='length')
+    ]
-    feed_list = None if args.dynamic else [x.forward() for x in inputs]
    dataset = LacDataset(args)
-    predict_path = args.predict_file
+    predict_dataset = LacDataLoader(args, place, phase="predict")
-    predict_generator = create_lexnet_data_generator(
-        args, reader=dataset, file_name=predict_path, place=place, mode="predict")
-    predict_dataset = create_dataloader(
-        predict_generator, place, feed_list=feed_list)
    vocab_size = dataset.vocab_size
    num_labels = dataset.num_labels
-    model = SeqTagging(args, vocab_size, num_labels)
+    model = SeqTagging(args, vocab_size, num_labels, mode="predict")
-    optim = AdamOptimizer(
-        learning_rate=args.base_learning_rate,
-        parameter_list=model.parameters())
    model.mode = "test"
    model.prepare(inputs=inputs)
@@ -70,20 +61,20 @@ def main(args):
    model.load(args.init_from_checkpoint, skip_mismatch=True)
    f = open(args.output_file, "wb")
-    for data in predict_dataset(): 
+    for data in predict_dataset.dataloader:
-        if len(data) == 1: 
+        if len(data) == 1:
            input_data = data[0]
-        else: 
+        else:
            input_data = data
-        results, length = model.test(inputs=flatten(input_data))
+        results, length = model.test_batch(inputs=flatten(input_data))
-        for i in range(len(results)): 
+        for i in range(len(results)):
            word_len = length[i]
-            word_ids = results[i][: word_len]
+            word_ids = results[i][:word_len]
            tags = [dataset.id2label_dict[str(id)] for id in word_ids]
            f.write("\002".join(tags) + "\n")
-if __name__ == '__main__': 
+if __name__ == '__main__':
    args = PDConfig(yaml_file="sequence_tagging.yaml")
    args.build()
    args.Print()

--- a/examples/sequence_tagging/reader.py
+++ b/examples/sequence_tagging/reader.py
@@ -19,12 +19,19 @@ from __future__ import division
 from __future__ import print_function
 import io
+import os
+import leveldb
 import numpy as np
+import shutil
+from functools import partial
 import paddle
+from paddle.io import BatchSampler, DataLoader, Dataset
+from paddle.fluid.dygraph.parallel import ParallelEnv
+from hapi.distributed import DistributedBatchSampler
-class LacDataset(object):
+class LacDataset(Dataset):
    """
    Load lexical analysis dataset
    """
@@ -34,6 +41,7 @@ class LacDataset(object):
        self.label_dict_path = args.label_dict_path
        self.word_rep_dict_path = args.word_rep_dict_path
        self._load_dict()
+        self.examples = []
    def _load_dict(self):
        self.word2id_dict = self.load_kv_dict(
@@ -108,152 +116,135 @@ class LacDataset(object):
            label_ids.append(label_id)
        return label_ids
-    def file_reader(self,
+    def file_reader(self, filename, phase="train"):
-                    filename,
-                    mode="train",
-                    batch_size=32,
-                    max_seq_len=126):
        """
        yield (word_idx, target_idx) one by one from file,
            or yield (word_idx, ) in `infer` mode
        """
+        self.phase = phase
-        def wrapper():
+        with io.open(filename, "r", encoding="utf8") as fr:
-            fread = io.open(filename, "r", encoding="utf-8")
+            if phase in ["train", "test"]:
-            if mode == "train": 
+                headline = next(fr)
-                headline = next(fread)
-                headline = headline.strip().split('\t')
-                assert len(headline) == 2 and headline[0] == "text_a" and headline[
-                    1] == "label"
-                buf = []
-                for line in fread:
-                    words, labels = line.strip("\n").split("\t")
-                    if len(words) < 1:
-                        continue
-                    word_ids = self.word_to_ids(words.split("\002"))
-                    label_ids = self.label_to_ids(labels.split("\002"))
-                    assert len(word_ids) == len(label_ids)
-                    words_len = np.int64(len(word_ids))
-                    word_ids = word_ids[0:max_seq_len]
-                    words_len = np.int64(len(word_ids))
-                    word_ids += [0 for _ in range(max_seq_len - words_len)]
-                    label_ids = label_ids[0:max_seq_len]
-                    label_ids += [0 for _ in range(max_seq_len - words_len)]
-                    assert len(word_ids) == len(label_ids)
-                    yield word_ids, label_ids, words_len
-            elif mode == "test": 
-                headline = next(fread)
                headline = headline.strip().split('\t')
-                assert len(headline) == 2 and headline[0] == "text_a" and headline[
+                assert len(headline) == 2 and headline[
-                           1] == "label"
+                    0] == "text_a" and headline[1] == "label"
-                buf = []
-                for line in fread:
+                for line in fr:
-                    words, labels = line.strip("\n").split("\t")
+                    line_str = line.strip("\n")
-                    if len(words) < 1:
+                    if len(line_str) < 1 and len(line_str.split('\t')) < 2:
-                        continue
-                    word_ids = self.word_to_ids(words.split("\002"))
-                    label_ids = self.label_to_ids(labels.split("\002"))
-                    assert len(word_ids) == len(label_ids)
-                    words_len = np.int64(len(word_ids))
-                    yield word_ids, label_ids, words_len
-            else: 
-                for line in fread: 
-                    words = line.strip("\n").split('\t')[0]
-                    if words == u"text_a": 
                        continue
-                    if "\002" not in words: 
-                        word_ids = self.word_to_ids(words)
-                    else: 
-                        word_ids = self.word_to_ids(words.split("\002"))
-                    words_len = np.int64(len(word_ids))
-                    yield word_ids, words_len
-            fread.close()
+                    self.examples.append(line_str)
+            else:
+                for idx, line in enumerate(fr):
+                    words = line.strip("\n").split("\t")[0]
+                    self.examples.append(words)
+    def __getitem__(self, idx):
+        line_str = self.examples[idx]
+        if self.phase in ["train", "test"]:
+            words, labels = line_str.split('\t')
+            word_ids = self.word_to_ids(words.split("\002"))
+            label_ids = self.label_to_ids(labels.split("\002"))
+            assert len(word_ids) == len(label_ids)
+            return word_ids, label_ids
+        else:
+            words = [w for w in line_str]
+            word_ids = self.word_to_ids(words)
+            return word_ids
+    def __len__(self):
-        return wrapper
+        return len(self.examples)
-def create_lexnet_data_generator(args, reader, file_name, place, mode="train"): 
+def create_lexnet_data_generator(args, insts, phase="train"):
-    def padding_data(max_len, batch_data): 
+    def padding_data(max_len, batch_data, if_len=False):
        padding_batch_data = []
-        for data in batch_data: 
+        padding_lens = []
+        for data in batch_data:
+            data = data[:max_len]
+            if if_len:
+                seq_len = np.int64(len(data))
+                padding_lens.append(seq_len)
            data += [0 for _ in range(max_len - len(data))]
            padding_batch_data.append(data)
-        return padding_batch_data
+        if if_len:
+            return np.array(padding_batch_data), np.array(padding_lens)
-    def wrapper(): 
+        else:
-        if mode == "train": 
+            return np.array(padding_batch_data)
-            batch_words, batch_labels, seq_lens = [], [], []
-            for epoch in xrange(args.epoch):
+    if phase == "train":
-                for instance in reader.file_reader(
+        batch_words = [inst[0] for inst in insts]
-                        file_name, mode, max_seq_len=args.max_seq_len)():
+        batch_labels = [inst[1] for inst in insts]
-                    words, labels, words_len = instance
+        padding_batch_words, padding_lens = padding_data(
-                    if len(seq_lens) < args.batch_size:
+            args.max_seq_len, batch_words, if_len=True)
-                        batch_words.append(words)
+        padding_batch_labels = padding_data(args.max_seq_len, batch_labels)
-                        batch_labels.append(labels)
+        return [
-                        seq_lens.append(words_len)
+            padding_batch_words, padding_lens, padding_batch_labels,
-                    if len(seq_lens) == args.batch_size: 
+            padding_batch_labels
-                        yield batch_words, seq_lens, batch_labels, batch_labels
+        ]
-                        batch_words, batch_labels, seq_lens = [], [], []
+    elif phase == "test":
+        batch_words = [inst[0] for inst in insts]
-            if len(seq_lens) > 0:
+        seq_len = [len(inst[0]) for inst in insts]
-                yield batch_words, seq_lens, batch_labels, batch_labels
+        max_seq_len = max(seq_len)
-        elif mode == "test": 
+        batch_labels = [inst[1] for inst in insts]
-            batch_words, batch_labels, seq_lens, max_len = [], [], [], 0
+        padding_batch_words, padding_lens = padding_data(
-            for instance in reader.file_reader(
+            max_seq_len, batch_words, if_len=True)
-                file_name, mode, max_seq_len=args.max_seq_len)():
+        padding_batch_labels = padding_data(max_seq_len, batch_labels)
-                words, labels, words_len = instance
+        return [
-                max_len = words_len if words_len > max_len else max_len
+            padding_batch_words, padding_lens, padding_batch_labels,
-                if len(seq_lens) < args.batch_size:
+            padding_batch_labels
-                    batch_words.append(words)
+        ]
-                    seq_lens.append(words_len)
-                    batch_labels.append(labels)
-                if len(seq_lens) == args.batch_size: 
-                    padding_batch_words = padding_data(max_len, batch_words)
-                    padding_batch_labels = padding_data(max_len, batch_labels)
-                    yield padding_batch_words, seq_lens, padding_batch_labels, padding_batch_labels
-                    batch_words, batch_labels, seq_lens, max_len = [], [], [], 0
-            if len(seq_lens) > 0: 
-                padding_batch_words = padding_data(max_len, batch_words)
-                padding_batch_labels = padding_data(max_len, batch_labels)
-                yield padding_batch_words, seq_lens, padding_batch_labels, padding_batch_labels
-        else: 
-            batch_words, seq_lens, max_len = [], [], 0
-            for instance in reader.file_reader(
-                   file_name, mode, max_seq_len=args.max_seq_len)():
-                words, words_len = instance
-                if len(seq_lens) < args.batch_size:
-                    batch_words.append(words)
-                    seq_lens.append(words_len)
-                    max_len = words_len if words_len > max_len else max_len
-                if len(seq_lens) == args.batch_size: 
-                    padding_batch_words = padding_data(max_len, batch_words)
-                    yield padding_batch_words, seq_lens
-                    batch_words, seq_lens, max_len = [], [], 0
-            if len(seq_lens) > 0: 
-                padding_batch_words = padding_data(max_len, batch_words)
-                yield padding_batch_words, seq_lens
-    return wrapper
-def create_dataloader(generator, place, feed_list=None):
-    if not feed_list:
-        data_loader = paddle.io.DataLoader.from_generator(
-            capacity=50,
-            use_double_buffer=True,
-            iterable=True,
-            return_list=True)
    else:
-        data_loader = paddle.io.DataLoader.from_generator(
+        batch_words = insts
-            feed_list=feed_list,
+        seq_len = [len(inst) for inst in insts]
-            capacity=50,
+        max_seq_len = max(seq_len)
-            use_double_buffer=True,
+        padding_batch_words, padding_lens = padding_data(
-            iterable=True,
+            max_seq_len, batch_words, if_len=True)
+        return [padding_batch_words, padding_lens]
+class LacDataLoader(object):
+    def __init__(self,
+                 args,
+                 place,
+                 phase="train",
+                 shuffle=False,
+                 num_workers=0,
+                 drop_last=False):
+        assert phase in [
+            "train", "test", "predict"
+        ], "phase should be in [train, test, predict], but get %s" % phase
+        if phase == "train":
+            file_name = args.train_file
+        elif phase == "test":
+            file_name = args.test_file
+        elif phase == "predict":
+            file_name = args.predict_file
+        self.dataset = LacDataset(args)
+        self.dataset.file_reader(file_name, phase=phase)
+        if phase == "train":
+            self.sampler = DistributedBatchSampler(
+                dataset=self.dataset,
+                batch_size=args.batch_size,
+                shuffle=shuffle,
+                drop_last=drop_last)
+        else:
+            self.sampler = BatchSampler(
+                dataset=self.dataset,
+                batch_size=args.batch_size,
+                shuffle=shuffle,
+                drop_last=drop_last)
+        self.dataloader = DataLoader(
+            dataset=self.dataset,
+            batch_sampler=self.sampler,
+            places=place,
+            collate_fn=partial(
+                create_lexnet_data_generator, args, phase=phase),
+            num_workers=num_workers,
            return_list=True)
-    data_loader.set_batch_generator(generator, places=place)
-    return data_loader
--- a/examples/sequence_tagging/sequence_tagging.yaml
+++ b/examples/sequence_tagging/sequence_tagging.yaml
 word_dict_path: "./conf/word.dic"
 label_dict_path: "./conf/tag.dic"
 word_rep_dict_path: "./conf/q2b.dic"
-device: "cpu"
+device: "gpu"
 dynamic: True
 epoch: 10
 base_learning_rate: 0.001
@@ -14,7 +14,7 @@ batch_size: 300
 max_seq_len: 126
 num_devices: 1
 save_dir: "model"
-init_from_checkpoint: "model_baseline/params"
+init_from_checkpoint: ""
 init_from_pretrain_model: ""
 save_freq: 1
 eval_freq: 1
@@ -22,4 +22,3 @@ output_file: "predict.result"
 test_file: "./data/test.tsv"
 train_file: "./data/train.tsv"
 predict_file: "./data/infer.tsv"
-mode: "train"
--- a/examples/sequence_tagging/train.py
+++ b/examples/sequence_tagging/train.py
@@ -35,14 +35,17 @@ from hapi.text.text import SequenceTagging
 from utils.check import check_gpu, check_version
 from utils.configure import PDConfig
-from reader import LacDataset, create_lexnet_data_generator, create_dataloader
+from reader import LacDataset, LacDataLoader
 import paddle.fluid as fluid
 from paddle.fluid.optimizer import AdamOptimizer
+__all__ = ["SeqTagging", "LacLoss", "ChunkEval"]
 class SeqTagging(Model):
-    def __init__(self, args, vocab_size, num_labels, length=None):
+    def __init__(self, args, vocab_size, num_labels, length=None,
+                 mode="train"):
        super(SeqTagging, self).__init__()
        """
        define the lexical analysis network structure
@@ -53,7 +56,7 @@ class SeqTagging(Model):
            for infer: return the prediction
            otherwise: return the prediction
        """
-        self.mode_type = args.mode
+        self.mode_type = mode
        self.word_emb_dim = args.word_emb_dim
        self.vocab_size = vocab_size
        self.num_labels = num_labels
@@ -219,23 +222,13 @@ def main(args):
    feed_list = None if args.dynamic else [
        x.forward() for x in inputs + labels
    ]
-    dataset = LacDataset(args)
-    train_path = args.train_file
-    test_path = args.test_file
-    train_generator = create_lexnet_data_generator(
+    dataset = LacDataset(args)
-        args, reader=dataset, file_name=train_path, place=place, mode="train")
+    train_dataset = LacDataLoader(args, place, phase="train")
-    test_generator = create_lexnet_data_generator(
-        args, reader=dataset, file_name=test_path, place=place, mode="test")
-    train_dataset = create_dataloader(
-        train_generator, place, feed_list=feed_list)
-    test_dataset = create_dataloader(
-        test_generator, place, feed_list=feed_list)
    vocab_size = dataset.vocab_size
    num_labels = dataset.num_labels
-    model = SeqTagging(args, vocab_size, num_labels)
+    model = SeqTagging(args, vocab_size, num_labels, mode="train")
    optim = AdamOptimizer(
        learning_rate=args.base_learning_rate,
@@ -255,8 +248,7 @@ def main(args):
    if args.init_from_pretrain_model:
        model.load(args.init_from_pretrain_model, reset_optimizer=True)
-    model.fit(train_dataset,
+    model.fit(train_dataset.dataloader,
-              test_dataset,
              epochs=args.epoch,
              batch_size=args.batch_size,
              eval_freq=args.eval_freq,

--- a/examples/sequence_tagging/utils/configure.py
+++ b/examples/sequence_tagging/utils/configure.py
@@ -195,13 +195,19 @@ class PDConfig(object):
                               "Whether to perform predicting.")
        self.default_g.add_arg("do_eval", bool, False,
                               "Whether to perform evaluating.")
-        self.default_g.add_arg("do_save_inference_model", bool, False,
+        self.default_g.add_arg(
-                               "Whether to perform model saving for inference.")
+            "do_save_inference_model", bool, False,
+            "Whether to perform model saving for inference.")
        # NOTE: args for profiler
-        self.default_g.add_arg("is_profiler", int, 0, "the switch of profiler tools. (used for benchmark)")
+        self.default_g.add_arg(
-        self.default_g.add_arg("profiler_path", str, './', "the profiler output file path. (used for benchmark)")
+            "is_profiler", int, 0,
-        self.default_g.add_arg("max_iter", int, 0, "the max train batch num.(used for benchmark)")
+            "the switch of profiler tools. (used for benchmark)")
+        self.default_g.add_arg(
+            "profiler_path", str, './',
+            "the profiler output file path. (used for benchmark)")
+        self.default_g.add_arg("max_iter", int, 0,
+                               "the max train batch num.(used for benchmark)")
        self.parser = parser

--- a/examples/sequence_tagging/utils/metrics.py
+++ b/examples/sequence_tagging/utils/metrics.py
@@ -23,7 +23,7 @@ import paddle.fluid as fluid
 __all__ = ['chunk_count', "build_chunk"]
-def build_chunk(data_list, id2label_dict): 
+def build_chunk(data_list, id2label_dict):
    """
    Assembly entity
    """
@@ -31,29 +31,29 @@ def build_chunk(data_list, id2label_dict):
    ner_dict = {}
    ner_str = ""
    ner_start = 0
-    for i in range(len(tag_list)): 
+    for i in range(len(tag_list)):
        tag = tag_list[i]
-        if tag == u"O": 
+        if tag == u"O":
-            if i != 0: 
+            if i != 0:
                key = "%d_%d" % (ner_start, i - 1)
                ner_dict[key] = ner_str
            ner_start = i
-            ner_str = tag 
+            ner_str = tag
-        elif tag.endswith(u"B"): 
+        elif tag.endswith(u"B"):
-            if i != 0: 
+            if i != 0:
                key = "%d_%d" % (ner_start, i - 1)
                ner_dict[key] = ner_str
            ner_start = i
            ner_str = tag.split('-')[0]
-        elif tag.endswith(u"I"): 
+        elif tag.endswith(u"I"):
-            if tag.split('-')[0] != ner_str: 
+            if tag.split('-')[0] != ner_str:
-                if i != 0: 
+                if i != 0:
                    key = "%d_%d" % (ner_start, i - 1)
                    ner_dict[key] = ner_str
                ner_start = i
                ner_str = tag.split('-')[0]
    return ner_dict
 def chunk_count(infer_numpy, label_numpy, seq_len, id2label_dict):
    """
@@ -62,15 +62,14 @@ def chunk_count(infer_numpy, label_numpy, seq_len, id2label_dict):
    num_infer_chunks, num_label_chunks, num_correct_chunks = 0, 0, 0
    assert infer_numpy.shape[0] == label_numpy.shape[0]
-    for i in range(infer_numpy.shape[0]): 
+    for i in range(infer_numpy.shape[0]):
-        infer_list = infer_numpy[i][: seq_len[i]]
+        infer_list = infer_numpy[i][:seq_len[i]]
-        label_list = label_numpy[i][: seq_len[i]]
+        label_list = label_numpy[i][:seq_len[i]]
        infer_dict = build_chunk(infer_list, id2label_dict)
        num_infer_chunks += len(infer_dict)
        label_dict = build_chunk(label_list, id2label_dict)
        num_label_chunks += len(label_dict)
-        for key in infer_dict: 
+        for key in infer_dict:
-            if key in label_dict and label_dict[key] == infer_dict[key]: 
+            if key in label_dict and label_dict[key] == infer_dict[key]:
                num_correct_chunks += 1
    return num_infer_chunks, num_label_chunks, num_correct_chunks
--- a/examples/style-transfer/README.md
+++ b/examples/style-transfer/README.md
+# 图像风格迁移
+图像的风格迁移是卷积神经网络有趣的应用之一。那什么是风格迁移呢？下图第一列左边的图为相机拍摄的一张普通图片，右边的图为梵高的名画星空。那如何让左边的普通图片拥有星空的风格呢。神经网络的风格迁移就可以帮助你生成第二列的这样的图片。
+<div align=center>
+ <img src="images/markdown/img1.png" width = "600" height = "300"  />
+</br>
+ <img src="images/markdown/img2.png" width = "300" height = "300"  divalign=center />
+<div align=left>
+## 基本原理
+风格迁移的目标就是使得生成图片的内容与内容图片（content image）尽可能相似。由于在计算机中，我们用一个一个像素点表示图片，所以两个图片的相似程度我们可以用每个像素点的欧式距离来表示。而两个图片的风格相似度，我们采用两个图片在卷积神经网络中相同的一层特征图的gram矩阵的欧式距离来表示。对于一个特征图gram矩阵的计算如下所示：
+```python
+# tensor shape is [1, c, h, w]
+_, c, h, w = tensor.shape
+tensor = fluid.layers.reshape(tensor, [c, h * w])
+# gram matrix with shape: [c, c]
+gram_matrix = fluid.layers.matmul(tensor, fluid.layers.transpose(tensor, [1, 0]))
+```
+最终风格迁移的问题转化为优化上述的两个欧式距离的问题。这里要注意的是，我们使用一个在imagenet上预训练好的模型vgg16，并且固定参数，优化器只更新输入的生成图像的值。
+## 具体实现
+接下来，使用代码一步一步来实现上述图片的风格迁移
+```python
+# 导入所需的模块
+%matplotlib inline
+import numpy as np
+import matplotlib.pyplot as plt
+from hapi.model import Model, Loss
+from hapi.vision.models import vgg16
+from hapi.vision.transforms import transforms
+from paddle import fluid
+from paddle.fluid.io import Dataset
+import cv2
+import copy
+# 图像预处理函数，和tensor恢复到自然图像的函数
+from .style_transfer import load_image, image_restore
+```
+```python
+# 启动动态图模式
+fluid.enable_dygraph()
+```
+```python
+# 内容图像，用于风格迁移
+content_path = './images/chicago_cropped.jpg'
+# 风格图像
+style_path = './images/Starry-Night-by-Vincent-Van-Gogh-painting.jpg'
+```
+```python
+# 可视化两个图像
+content = load_image(content_path)
+style = load_image(style_path, shape=tuple(content.shape[-2:]))
+fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10, 10))
+ax1.imshow(image_restore(content))
+ax2.imshow(image_restore(style))
+```
+![png](images/markdown/output_10_1.png)
+```python
+# 定义风格迁移模型，使用在imagenet上预训练好的vgg16作为基础模型
+class StyleTransferModel(Model):
+    def __init__(self):
+        super(StyleTransferModel, self).__init__()
+        # pretrained设置为true，会自动下载imagenet上的预训练权重并加载
+        vgg = vgg16(pretrained=True)
+        self.base_model = vgg.features
+        for p in self.base_model.parameters():
+            p.stop_gradient=True
+        self.layers = {
+                  '0': 'conv1_1',
+                  '3': 'conv2_1',
+                  '6': 'conv3_1',
+                  '10': 'conv4_1',
+                  '11': 'conv4_2',  ## content representation
+                  '14': 'conv5_1'
+                 }
+    def forward(self, image):
+        outputs = []
+        for name, layer in self.base_model.named_sublayers():
+            image = layer(image)
+            if name in self.layers:
+                outputs.append(image)
+        return outputs
+```
+```python
+# 定义风格迁移个损失函数
+class StyleTransferLoss(Loss):
+    def __init__(self, content_loss_weight=1, style_loss_weight=1e5, style_weights=[1.0, 0.8, 0.5, 0.3, 0.1]):
+        super(StyleTransferLoss, self).__init__()
+        self.content_loss_weight = content_loss_weight
+        self.style_loss_weight = style_loss_weight
+        self.style_weights = style_weights
+    def forward(self, outputs, labels):
+        content_features = labels[-1]
+        style_features = labels[:-1]
+        # 计算图像内容相似度的loss
+        content_loss = fluid.layers.mean((outputs[-2] - content_features)**2)
+        # 计算风格相似度的loss
+        style_loss = 0
+        style_grams = [self.gram_matrix(feat) for feat in style_features ]
+        style_weights = self.style_weights
+        for i, weight in enumerate(style_weights):
+            target_gram = self.gram_matrix(outputs[i])
+            layer_loss = weight * fluid.layers.mean((target_gram - style_grams[i])**2)
+            b, d, h, w = outputs[i].shape
+            style_loss += layer_loss / (d * h * w)
+        total_loss = self.content_loss_weight * content_loss + self.style_loss_weight * style_loss
+        return total_loss
+    def gram_matrix(self, A):
+        if len(A.shape) == 4:
+            batch_size, c, h, w = A.shape
+            A = fluid.layers.reshape(A, (c, h*w))
+        GA = fluid.layers.matmul(A, fluid.layers.transpose(A, [1, 0]))
+        return GA
+```
+```python
+# 创建模型
+model = StyleTransferModel()
+```
+```python
+# 创建损失函数
+style_loss = StyleTransferLoss()
+```
+```python
+# 使用内容图像初始化要生成的图像
+target = Model.create_parameter(model, shape=content.shape)
+target.set_value(content.numpy())
+```
+```python
+# 创建优化器
+optimizer = fluid.optimizer.Adam(parameter_list=[target], learning_rate=0.001)
+```
+```python
+# 初始化高级api
+model.prepare(optimizer, style_loss)
+```
+```python
+# 使用内容图像和风格图像获取内容特征和风格特征
+content_fetures = model.test_batch(content)
+style_features = model.test_batch(style)
+```
+```python
+# 将两个特征组合，作为损失函数的label传给模型
+feats = style_features + [content_fetures[-2]]
+```
+```python
+# 训练5000个step，每500个step画一下生成的图像查看效果
+steps = 5000
+for i in range(steps):
+    outs = model.train_batch(target, feats)
+    if i % 500 == 0:
+        print('iters:', i, 'loss:', outs[0])
+        plt.imshow(image_restore(target))
+        plt.show()
+```
+    iters: 0 loss: [8.829961e+09]
+![png](images/markdown/output_20_1.png)
+    iters: 500 loss: [3.728548e+08]
+![png](images/markdown/output_20_3.png)
+    iters: 1000 loss: [1.6327214e+08]
+![png](images/markdown/output_20_5.png)
+    iters: 1500 loss: [1.0806553e+08]
+![png](images/markdown/output_20_7.png)
+    iters: 2000 loss: [81069480.]
+![png](images/markdown/output_20_9.png)
+    iters: 2500 loss: [64284104.]
+![png](images/markdown/output_20_11.png)
+    iters: 3000 loss: [52580884.]
+![png](images/markdown/output_20_13.png)
+    iters: 3500 loss: [43825304.]
+![png](images/markdown/output_20_15.png)
+    iters: 4000 loss: [37048400.]
+![png](images/markdown/output_20_17.png)
+    iters: 4500 loss: [31719670.]
+![png](images/markdown/output_20_19.png)
+```python
+# 风格迁移后的图像
+fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(20, 10))
+ax1.imshow(image_restore(content))
+ax2.imshow(image_restore(target))
+ax3.imshow(image_restore(style))
+```
+![png](images/markdown/output_21_1.png)
+## 总结
+上述可运行的代码都在[style-transfer.ipynb](./style-transfer.ipynb)中， 同时我们提供了[style-transfer.py](./style-transfer.py)脚本，可以直接执行如下命令，实现图片的风格迁移：
+```shell
+python -u style-transfer.py --content-image /path/to/your-content-image --style-image /path/to/your-style-image --save-dir /path/to/your-output-dir
+```
+风格迁移生成的图像保存在```--save-dir```中。
+## 参考
+[A Neural Algorithm of Artistic Style](https://arxiv.org/abs/1508.06576)
--- a/examples/style-transfer/images/Starry-Night-by-Vincent-Van-Gogh-painting.jpg
+++ b/examples/style-transfer/images/Starry-Night-by-Vincent-Van-Gogh-painting.jpg
--- a/examples/style-transfer/images/chicago_cropped.jpg
+++ b/examples/style-transfer/images/chicago_cropped.jpg
--- a/examples/style-transfer/images/janelle.png
+++ b/examples/style-transfer/images/janelle.png
--- a/examples/style-transfer/images/markdown/img1.png
+++ b/examples/style-transfer/images/markdown/img1.png
--- a/examples/style-transfer/images/markdown/img2.png
+++ b/examples/style-transfer/images/markdown/img2.png
--- a/examples/style-transfer/images/markdown/output_10_1.png
+++ b/examples/style-transfer/images/markdown/output_10_1.png
--- a/examples/style-transfer/images/markdown/output_20_1.png
+++ b/examples/style-transfer/images/markdown/output_20_1.png
--- a/examples/style-transfer/images/markdown/output_20_11.png
+++ b/examples/style-transfer/images/markdown/output_20_11.png
--- a/examples/style-transfer/images/markdown/output_20_13.png
+++ b/examples/style-transfer/images/markdown/output_20_13.png
--- a/examples/style-transfer/images/markdown/output_20_15.png
+++ b/examples/style-transfer/images/markdown/output_20_15.png
--- a/examples/style-transfer/images/markdown/output_20_17.png
+++ b/examples/style-transfer/images/markdown/output_20_17.png
--- a/examples/style-transfer/images/markdown/output_20_19.png
+++ b/examples/style-transfer/images/markdown/output_20_19.png
--- a/examples/style-transfer/images/markdown/output_20_3.png
+++ b/examples/style-transfer/images/markdown/output_20_3.png
--- a/examples/style-transfer/images/markdown/output_20_5.png
+++ b/examples/style-transfer/images/markdown/output_20_5.png
--- a/examples/style-transfer/images/markdown/output_20_7.png
+++ b/examples/style-transfer/images/markdown/output_20_7.png
--- a/examples/style-transfer/images/markdown/output_20_9.png
+++ b/examples/style-transfer/images/markdown/output_20_9.png
--- a/examples/style-transfer/images/markdown/output_21_1.png
+++ b/examples/style-transfer/images/markdown/output_21_1.png
--- a/examples/style-transfer/style-transfer.ipynb
+++ b/examples/style-transfer/style-transfer.ipynb
--- a/examples/style-transfer/style_transfer.py
+++ b/examples/style-transfer/style_transfer.py
+import os
+import argparse
+import numpy as np
+import matplotlib.pyplot as plt
+from hapi.model import Model, Loss
+from hapi.vision.models import vgg16
+from hapi.vision.transforms import transforms
+from paddle import fluid
+from paddle.fluid.io import Dataset
+import cv2
+import copy
+def load_image(image_path, max_size=400, shape=None):
+    image = cv2.imread(image_path)
+    image = image.astype('float32') / 255.0
+    size = shape if shape is not None else max_size if max(
+        image.shape[:2]) > max_size else max(image.shape[:2])
+    transform = transforms.Compose([
+        transforms.Resize(size), transforms.Permute(),
+        transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
+    ])
+    image = transform(image)[np.newaxis, :3, :, :]
+    image = fluid.dygraph.to_variable(image)
+    return image
+def image_restore(image):
+    image = np.squeeze(image.numpy(), 0)
+    image = image.transpose(1, 2, 0)
+    image = image * np.array((0.229, 0.224, 0.225)) + np.array(
+        (0.485, 0.456, 0.406))
+    image = image.clip(0, 1)
+    return image
+class StyleTransferModel(Model):
+    def __init__(self):
+        super(StyleTransferModel, self).__init__()
+        # pretrained设置为true，会自动下载imagenet上的预训练权重并加载
+        vgg = vgg16(pretrained=True)
+        self.base_model = vgg.features
+        for p in self.base_model.parameters():
+            p.stop_gradient = True
+        self.layers = {
+            '0': 'conv1_1',
+            '3': 'conv2_1',
+            '6': 'conv3_1',
+            '10': 'conv4_1',
+            '11': 'conv4_2',  ## content representation
+            '14': 'conv5_1'
+        }
+    def forward(self, image):
+        outputs = []
+        for name, layer in self.base_model.named_sublayers():
+            image = layer(image)
+            if name in self.layers:
+                outputs.append(image)
+        return outputs
+class StyleTransferLoss(Loss):
+    def __init__(self,
+                 content_loss_weight=1,
+                 style_loss_weight=1e5,
+                 style_weights=[1.0, 0.8, 0.5, 0.3, 0.1]):
+        super(StyleTransferLoss, self).__init__()
+        self.content_loss_weight = content_loss_weight
+        self.style_loss_weight = style_loss_weight
+        self.style_weights = style_weights
+    def forward(self, outputs, labels):
+        content_features = labels[-1]
+        style_features = labels[:-1]
+        # 计算图像内容相似度的loss
+        content_loss = fluid.layers.mean((outputs[-2] - content_features)**2)
+        # 计算风格相似度的loss
+        style_loss = 0
+        style_grams = [self.gram_matrix(feat) for feat in style_features]
+        style_weights = self.style_weights
+        for i, weight in enumerate(style_weights):
+            target_gram = self.gram_matrix(outputs[i])
+            layer_loss = weight * fluid.layers.mean((target_gram - style_grams[
+                i])**2)
+            b, d, h, w = outputs[i].shape
+            style_loss += layer_loss / (d * h * w)
+        total_loss = self.content_loss_weight * content_loss + self.style_loss_weight * style_loss
+        return total_loss
+    def gram_matrix(self, A):
+        if len(A.shape) == 4:
+            _, c, h, w = A.shape
+            A = fluid.layers.reshape(A, (c, h * w))
+        GA = fluid.layers.matmul(A, fluid.layers.transpose(A, [1, 0]))
+        return GA
+def main():
+    # 启动动态图模式
+    fluid.enable_dygraph()
+    content = load_image(FLAGS.content_image)
+    style = load_image(FLAGS.style_image, shape=tuple(content.shape[-2:]))
+    model = StyleTransferModel()
+    style_loss = StyleTransferLoss()
+    # 使用内容图像初始化要生成的图像
+    target = Model.create_parameter(model, shape=content.shape)
+    target.set_value(content.numpy())
+    optimizer = fluid.optimizer.Adam(
+        parameter_list=[target], learning_rate=FLAGS.lr)
+    model.prepare(optimizer, style_loss)
+    content_fetures = model.test_batch(content)
+    style_features = model.test_batch(style)
+    # 将两个特征组合，作为损失函数的label传给模型
+    feats = style_features + [content_fetures[-2]]
+    # 训练5000个step，每500个step画一下生成的图像查看效果
+    steps = FLAGS.steps
+    for i in range(steps):
+        outs = model.train_batch(target, feats)
+        if i % 500 == 0:
+            print('iters:', i, 'loss:', outs[0][0])
+    if not os.path.exists(FLAGS.save_dir):
+        os.makedirs(FLAGS.save_dir)
+    # 保存生成好的图像
+    name = FLAGS.content_image.split(os.sep)[-1]
+    output_path = os.path.join(FLAGS.save_dir, 'generated_' + name)
+    cv2.imwrite(output_path,
+                cv2.cvtColor((image_restore(target) * 255).astype('uint8'),
+                             cv2.COLOR_RGB2BGR))
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser("Resnet Training on ImageNet")
+    parser.add_argument(
+        "--content-image",
+        type=str,
+        default='./images/chicago_cropped.jpg',
+        help="content image")
+    parser.add_argument(
+        "--style-image",
+        type=str,
+        default='./images/Starry-Night-by-Vincent-Van-Gogh-painting.jpg',
+        help="style image")
+    parser.add_argument(
+        "--save-dir", type=str, default='./output', help="output dir")
+    parser.add_argument(
+        "--steps", default=5000, type=int, help="number of steps to run")
+    parser.add_argument(
+        '--lr',
+        '--learning-rate',
+        default=1e-3,
+        type=float,
+        metavar='LR',
+        help='initial learning rate')
+    FLAGS = parser.parse_args()
+    main()
--- a/transformer/README.md
+++ b/transformer/README.md
@@ -201,7 +201,7 @@ python -u predict.py \
  --special_token '<s>' '<e>' '<unk>' \
  --predict_file gen_data/wmt16_ende_data_bpe/newstest2014.tok.bpe.32000.en-de \
  --batch_size 32 \
-  --init_from_params base_model_dygraph/step_100000/transformer \
+  --init_from_params big_model_dygraph/step_100000/transformer \
  --beam_size 5 \
  --max_out_len 255 \
  --output_file predict.txt \

--- a/transformer/gen_data.sh
+++ b/transformer/gen_data.sh
--- a/transformer/images/multi_head_attention.png
+++ b/transformer/images/multi_head_attention.png
--- a/transformer/images/transformer_network.png
+++ b/transformer/images/transformer_network.png
--- a/transformer/predict.py
+++ b/transformer/predict.py
@@ -14,9 +14,6 @@
 import logging
 import os
-import six
-import sys
-sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
 from functools import partial
 import numpy as np
@@ -28,9 +25,9 @@ from paddle.fluid.layers.utils import flatten
 from utils.configure import PDConfig
 from utils.check import check_gpu, check_version
-from model import Input, set_device
+from hapi.model import Input, set_device
 from reader import prepare_infer_input, Seq2SeqDataset, Seq2SeqBatchSampler
-from transformer import InferTransformer, position_encoding_init
+from transformer import InferTransformer
 def post_process_seq(seq, bos_idx, eos_idx, output_bos=False,
@@ -132,7 +129,7 @@ def do_predict(args):
    # TODO: use model.predict when support variant length
    f = open(args.output_file, "wb")
    for data in data_loader():
-        finished_seq = transformer.test(inputs=flatten(data))[0]
+        finished_seq = transformer.test_batch(inputs=flatten(data))[0]
        finished_seq = np.transpose(finished_seq, [0, 2, 1])
        for ins in finished_seq:
            for beam_idx, beam in enumerate(ins):

--- a/transformer/reader.py
+++ b/transformer/reader.py
@@ -13,7 +13,7 @@
 # limitations under the License.
 import glob
-import six
+import sys
 import os
 import io
 import itertools
@@ -26,7 +26,7 @@ from paddle.io import BatchSampler, DataLoader, Dataset
 def create_data_loader(args, device):
-    data_loaders = [None, None]
+    data_loaders = [(None, None)] * 2
    data_files = [args.training_file, args.validation_file
                  ] if args.validation_file else [args.training_file]
    for i, data_file in enumerate(data_files):
@@ -65,7 +65,7 @@ def create_data_loader(args, device):
                n_head=args.n_head),
            num_workers=0,  # TODO: use multi-process
            return_list=True)
-        data_loaders[i] = data_loader
+        data_loaders[i] = (data_loader, batch_sampler.__len__)
    return data_loaders
@@ -289,7 +289,6 @@ class Seq2SeqDataset(Dataset):
                 start_mark="<s>",
                 end_mark="<e>",
                 unk_mark="<unk>",
-                 only_src=False,
                 trg_fpattern=None,
                 byte_data=False):
        if byte_data:
@@ -477,6 +476,7 @@ class Seq2SeqBatchSampler(BatchSampler):
                for i in range(self._nranks)
            ] for batch in batches]
            batches = list(itertools.chain.from_iterable(batches))
+        self.batch_number = (len(batches) + self._nranks - 1) // self._nranks
        # for multi-device
        for batch_id, batch in enumerate(batches):
@@ -490,11 +490,13 @@ class Seq2SeqBatchSampler(BatchSampler):
                yield batch_indices
    def __len__(self):
+        if hasattr(self, "batch_number"):  #
+            return self.batch_number
        if not self._use_token_batch:
            batch_number = (
                len(self._dataset) + self._batch_size * self._nranks - 1) // (
                    self._batch_size * self._nranks)
        else:
-            # TODO(guosheng): fix the uncertain length
+            # for uncertain batch number, the actual value is self.batch_number
-            batch_number = 1
+            batch_number = sys.maxsize
        return batch_number
--- a/transformer/train.py
+++ b/transformer/train.py
@@ -14,9 +14,6 @@
 import logging
 import os
-import six
-import sys
-sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
 import numpy as np
 import paddle
@@ -26,14 +23,18 @@ from paddle.io import DataLoader
 from utils.configure import PDConfig
 from utils.check import check_gpu, check_version
-from model import Input, set_device
+from hapi.model import Input, set_device
-from callbacks import ProgBarLogger
+from hapi.callbacks import ProgBarLogger
 from reader import create_data_loader
 from transformer import Transformer, CrossEntropyCriterion
 class TrainCallback(ProgBarLogger):
-    def __init__(self, args, verbose=2):
+    def __init__(self,
+                 args,
+                 verbose=2,
+                 train_steps_fn=None,
+                 eval_steps_fn=None):
        # TODO(guosheng): save according to step
        super(TrainCallback, self).__init__(args.print_step, verbose)
        # the best cross-entropy value with label smoothing
@@ -42,11 +43,17 @@ class TrainCallback(ProgBarLogger):
                (1. - args.label_smooth_eps)) + args.label_smooth_eps *
            np.log(args.label_smooth_eps / (args.trg_vocab_size - 1) + 1e-20))
        self.loss_normalizer = loss_normalizer
+        self.train_steps_fn = train_steps_fn
+        self.eval_steps_fn = eval_steps_fn
    def on_train_begin(self, logs=None):
        super(TrainCallback, self).on_train_begin(logs)
        self.train_metrics += ["normalized loss", "ppl"]
+    def on_train_batch_begin(self, step, logs=None):
+        if step == 0 and self.train_steps_fn:
+            self.train_progbar._num = self.train_steps_fn()
    def on_train_batch_end(self, step, logs=None):
        logs["normalized loss"] = logs["loss"][0] - self.loss_normalizer
        logs["ppl"] = np.exp(min(logs["loss"][0], 100))
@@ -57,6 +64,10 @@ class TrainCallback(ProgBarLogger):
        self.eval_metrics = list(
            self.eval_metrics) + ["normalized loss", "ppl"]
+    def on_eval_batch_begin(self, step, logs=None):
+        if step == 0 and self.eval_steps_fn:
+            self.eval_progbar._num = self.eval_steps_fn()
    def on_eval_batch_end(self, step, logs=None):
        logs["normalized loss"] = logs["loss"][0] - self.loss_normalizer
        logs["ppl"] = np.exp(min(logs["loss"][0], 100))
@@ -104,7 +115,8 @@ def do_train(args):
    ]
    # def dataloader
-    train_loader, eval_loader = create_data_loader(args, device)
+    (train_loader, train_steps_fn), (
+        eval_loader, eval_steps_fn) = create_data_loader(args, device)
    # define model
    transformer = Transformer(
@@ -142,7 +154,12 @@ def do_train(args):
                    eval_freq=1,
                    save_freq=1,
                    save_dir=args.save_model,
-                    callbacks=[TrainCallback(args)])
+                    callbacks=[
+                        TrainCallback(
+                            args,
+                            train_steps_fn=train_steps_fn,
+                            eval_steps_fn=eval_steps_fn)
+                    ])
 if __name__ == "__main__":

--- a/transformer/transformer.py
+++ b/transformer/transformer.py
@@ -20,8 +20,8 @@ import paddle.fluid as fluid
 import paddle.fluid.layers as layers
 from paddle.fluid.dygraph import Embedding, LayerNorm, Linear, Layer, to_variable
 from paddle.fluid.dygraph.learning_rate_scheduler import LearningRateDecay
-from model import Model, CrossEntropy, Loss
+from hapi.model import Model, CrossEntropy, Loss
-from text import TransformerBeamSearchDecoder, DynamicDecode
+from hapi.text import TransformerBeamSearchDecoder, DynamicDecode
 def position_encoding_init(n_position, d_pos_vec):

--- a/transformer/transformer.yaml
+++ b/transformer/transformer.yaml
--- a/transformer/utils/__init__.py
+++ b/transformer/utils/__init__.py
--- a/transformer/utils/check.py
+++ b/transformer/utils/check.py
--- a/transformer/utils/configure.py
+++ b/transformer/utils/configure.py
@@ -195,13 +195,19 @@ class PDConfig(object):
                               "Whether to perform predicting.")
        self.default_g.add_arg("do_eval", bool, False,
                               "Whether to perform evaluating.")
-        self.default_g.add_arg("do_save_inference_model", bool, False,
+        self.default_g.add_arg(
-                               "Whether to perform model saving for inference.")
+            "do_save_inference_model", bool, False,
+            "Whether to perform model saving for inference.")
        # NOTE: args for profiler
-        self.default_g.add_arg("is_profiler", int, 0, "the switch of profiler tools. (used for benchmark)")
+        self.default_g.add_arg(
-        self.default_g.add_arg("profiler_path", str, './', "the profiler output file path. (used for benchmark)")
+            "is_profiler", int, 0,
-        self.default_g.add_arg("max_iter", int, 0, "the max train batch num.(used for benchmark)")
+            "the switch of profiler tools. (used for benchmark)")
+        self.default_g.add_arg(
+            "profiler_path", str, './',
+            "the profiler output file path. (used for benchmark)")
+        self.default_g.add_arg("max_iter", int, 0,
+                               "the max train batch num.(used for benchmark)")
        self.parser = parser

--- a/hapi/callbacks.py
+++ b/hapi/callbacks.py
@@ -215,13 +215,13 @@ class ProgBarLogger(Callback):
        if self.train_step % self.log_freq == 0 and self.verbose and ParallelEnv(
        ).local_rank == 0:
-            # if steps is not None, last step will update in on_epoch_end
+            if self.steps is None or self.train_step < self.steps:
-            if self.steps and self.train_step < self.steps:
                self._updates(logs, 'train')
    def on_epoch_end(self, epoch, logs=None):
        logs = logs or {}
-        if self.verbose and ParallelEnv().local_rank == 0:
+        if self.train_step % self.log_freq != 0 and self.verbose and ParallelEnv(
+        ).local_rank == 0:
            self._updates(logs, 'train')
    def on_eval_begin(self, logs=None):
@@ -242,14 +242,14 @@ class ProgBarLogger(Callback):
        if self.eval_step % self.log_freq == 0 and self.verbose and ParallelEnv(
        ).local_rank == 0:
-            # if steps is not None, last step will update in on_epoch_end
+            if self.eval_steps is None or self.eval_step < self.eval_steps:
-            if self.eval_steps and self.eval_step < self.eval_steps:
                self._updates(logs, 'eval')
    def on_eval_end(self, logs=None):
        logs = logs or {}
        if self.verbose and ParallelEnv().local_rank == 0:
-            self._updates(logs, 'eval')
+            if self.eval_step % self.log_freq != 0:
+                self._updates(logs, 'eval')
            print('Eval samples: %d' % (self.evaled_samples))

--- a/hapi/model.py
+++ b/hapi/model.py
@@ -553,8 +553,8 @@ class DynamicGraphAdapter(object):
        self.model.clear_gradients()
        metrics = []
        for metric in self.model._metrics:
-            metric_outs = metric.add_metric_op(
+            metric_outs = metric.add_metric_op(*(
-                *(to_list(outputs) + to_list(labels)))
+                to_list(outputs) + to_list(labels)))
            m = metric.update(*[to_numpy(m) for m in to_list(metric_outs)])
            metrics.append(m)
@@ -593,8 +593,8 @@ class DynamicGraphAdapter(object):
                    self._merge_count[self.mode + '_total'] += samples
                    self._merge_count[self.mode + '_batch'] = samples
-            metric_outs = metric.add_metric_op(
+            metric_outs = metric.add_metric_op(*(
-                *(to_list(outputs) + to_list(labels)))
+                to_list(outputs) + to_list(labels)))
            m = metric.update(*[to_numpy(m) for m in to_list(metric_outs)])
            metrics.append(m)
@@ -834,8 +834,16 @@ class Model(fluid.dygraph.Layer):
            global _parallel_context_initialized
            if ParallelEnv().nranks > 1 and not _parallel_context_initialized:
                if fluid.in_dygraph_mode():
+                    main_prog_seed = fluid.default_main_program().random_seed
+                    startup_prog_seed = fluid.default_startup_program(
+                    ).random_seed
                    fluid.disable_dygraph()
                    fluid.enable_dygraph(self._place)
+                    # enable_dygraph would create and switch to a new program,
+                    # thus also copy seed to the new program
+                    fluid.default_main_program().random_seed = main_prog_seed
+                    fluid.default_startup_program(
+                    ).random_seed = startup_prog_seed
                    fluid.dygraph.parallel.prepare_context()
                else:
                    prepare_distributed_context(self._place)
@@ -970,7 +978,7 @@ class Model(fluid.dygraph.Layer):
        do_eval = eval_loader is not None
        self._test_dataloader = eval_loader
        metrics_name = self._metrics_name()
-        steps = len(train_loader) if hasattr(train_loader, '__len__') else None
+        steps = self._len_data_loader(train_loader)
        cbks = config_callbacks(
            callbacks,
            model=self,
@@ -998,8 +1006,7 @@ class Model(fluid.dygraph.Layer):
                if not isinstance(eval_loader, Iterable):
                    loader = eval_loader()
-                eval_steps = len(loader) if hasattr(loader,
+                eval_steps = self._len_data_loader(loader)
-                                                    '__len__') else None
                cbks.on_begin('eval', {
                    'steps': eval_steps,
                    'metrics_name': metrics_name
@@ -1075,7 +1082,7 @@ class Model(fluid.dygraph.Layer):
        if not isinstance(eval_loader, Iterable):
            loader = eval_loader()
-        eval_steps = len(loader) if hasattr(loader, '__len__') else None
+        eval_steps = self._len_data_loader(loader)
        cbks.on_begin('eval',
                      {'steps': eval_steps,
                       'metrics_name': metrics_name})
@@ -1228,7 +1235,7 @@ class Model(fluid.dygraph.Layer):
                       mode,
                       metrics_name,
                       epoch=None):
-        size = len(data_loader) if hasattr(data_loader, '__len__') else None
+        size = self._len_data_loader(data_loader)
        logs = {
            'steps': size,
            'metrics_name': metrics_name,
@@ -1303,3 +1310,10 @@ class Model(fluid.dygraph.Layer):
        for m in self._metrics:
            metrics_name.extend(to_list(m.name()))
        return metrics_name
+    def _len_data_loader(self, data_loader):
+        try:
+            steps = len(data_loader)
+        except Exception:
+            steps = None
+        return steps
--- a/hapi/text/text.py
+++ b/hapi/text/text.py
@@ -49,7 +49,7 @@ __all__ = [
    'BeamSearchDecoder', 'MultiHeadAttention', 'FFN',
    'TransformerEncoderLayer', 'TransformerEncoder', 'TransformerDecoderLayer',
    'TransformerDecoder', 'TransformerBeamSearchDecoder', 'Linear_chain_crf',
-    'Crf_decoding', 'SequenceTagging'
+    'Crf_decoding', 'SequenceTagging', 'GRUEncoderLayer'
 ]
@@ -238,8 +238,9 @@ class BasicLSTMCell(RNNCell):
        self._bias_attr = bias_attr
        self._gate_activation = gate_activation or layers.sigmoid
        self._activation = activation or layers.tanh
-        self._forget_bias = layers.fill_constant(
+        # TODO(guosheng): find better way to resolve constants in __init__
-            [1], dtype=dtype, value=forget_bias)
+        self._forget_bias = layers.create_global_var(
+            shape=[1], dtype=dtype, value=forget_bias, persistable=True)
        self._forget_bias.stop_gradient = False
        self._dtype = dtype
        self._input_size = input_size
@@ -762,7 +763,7 @@ class BasicGRUCell(RNNCell):
        c = self._activation(candidate)
        new_hidden = u * pre_hidden + (1 - u) * c
-        return new_hidden
+        return new_hidden, new_hidden
    @property
    def state_shape(self):
@@ -817,7 +818,7 @@ class RNN(fluid.dygraph.Layer):
                    lambda x: fluid.layers.transpose(x, [1, 0] + list(
                        range(2, len(x.shape)))), inputs)
-            if sequence_length:
+            if sequence_length is not None:
                mask = fluid.layers.sequence_mask(
                    sequence_length,
                    maxlen=time_steps,
@@ -828,7 +829,7 @@ class RNN(fluid.dygraph.Layer):
                inputs = map_structure(
                    lambda x: fluid.layers.reverse(x, axis=[0]), inputs)
                mask = fluid.layers.reverse(
-                    mask, axis=[0]) if sequence_length else None
+                    mask, axis=[0]) if sequence_length is not None else None
            states = initial_states
            outputs = []
@@ -836,7 +837,7 @@ class RNN(fluid.dygraph.Layer):
                step_inputs = map_structure(lambda x: x[i], inputs)
                step_outputs, new_states = self.cell(step_inputs, states,
                                                     **kwargs)
-                if sequence_length:
+                if sequence_length is not None:
                    new_states = map_structure(
                        partial(
                            _maybe_copy, step_mask=mask[i]),
@@ -1740,6 +1741,64 @@ class Crf_decoding(fluid.dygraph.Layer):
        return viterbi_path
+class GRUEncoderLayer(Layer):
+    def __init__(self,
+                 input_dim,
+                 grnn_hidden_dim,
+                 init_bound,
+                 num_layers=1,
+                 h_0=None,
+                 is_bidirection=False):
+        super(GRUEncoderLayer, self).__init__()
+        self.h_0 = h_0
+        self.num_layers = num_layers
+        self.is_bidirection = is_bidirection
+        self.gru_list = []
+        self.gru_r_list = []
+        for i in range(num_layers):
+            self.basic_gru_cell = BasicGRUCell(
+                input_size=input_dim if i == 0 else input_dim * 2,
+                hidden_size=grnn_hidden_dim,
+                param_attr=fluid.ParamAttr(
+                    initializer=fluid.initializer.UniformInitializer(
+                        low=-init_bound, high=init_bound),
+                    regularizer=fluid.regularizer.L2DecayRegularizer(
+                        regularization_coeff=1e-4)))
+            self.gru_list.append(
+                self.add_sublayer(
+                    "gru_%d" % i,
+                    RNN(self.basic_gru_cell,
+                        is_reverse=False,
+                        time_major=False)))
+        if self.is_bidirection:
+            for i in range(num_layers):
+                self.basic_gru_cell_r = BasicGRUCell(
+                    input_size=input_dim if i == 0 else input_dim * 2,
+                    hidden_size=grnn_hidden_dim,
+                    param_attr=fluid.ParamAttr(
+                        initializer=fluid.initializer.UniformInitializer(
+                            low=-init_bound, high=init_bound),
+                        regularizer=fluid.regularizer.L2DecayRegularizer(
+                            regularization_coeff=1e-4)))
+                self.gru_r_list.append(
+                    self.add_sublayer(
+                        "gru_r_%d" % i,
+                        RNN(self.basic_gru_cell_r,
+                            is_reverse=True,
+                            time_major=False)))
+    def forward(self, input_feature):
+        for i in range(self.num_layers):
+            pre_gru, pre_state = self.gru_list[i](input_feature)
+            if self.is_bidirection:
+                gru_r, r_state = self.gru_r_list[i](input_feature)
+                out = fluid.layers.concat(input=[pre_gru, gru_r], axis=-1)
+            else:
+                out = pre_gru
+            input_feature = out
+        return out
 class SequenceTagging(fluid.dygraph.Layer):
    def __init__(self,
                 vocab_size,
@@ -1789,26 +1848,13 @@ class SequenceTagging(fluid.dygraph.Layer):
            force_cpu=True,
            name='h_0')
-        self.bigru_units = []
+        self.gru_encoder = GRUEncoderLayer(
-        for i in range(self.bigru_num):
+            input_dim=self.grnn_hidden_dim,
-            if i == 0:
+            grnn_hidden_dim=self.grnn_hidden_dim,
-                self.bigru_units.append(
+            init_bound=self.init_bound,
-                    self.add_sublayer(
+            num_layers=self.bigru_num,
-                        "bigru_units%d" % i,
+            h_0=h_0,
-                        BiGRU(
+            is_bidirection=True)
-                            self.grnn_hidden_dim,
-                            self.grnn_hidden_dim,
-                            self.init_bound,
-                            h_0=h_0)))
-            else:
-                self.bigru_units.append(
-                    self.add_sublayer(
-                        "bigru_units%d" % i,
-                        BiGRU(
-                            self.grnn_hidden_dim * 2,
-                            self.grnn_hidden_dim,
-                            self.init_bound,
-                            h_0=h_0)))
        self.fc = Linear(
            input_dim=self.grnn_hidden_dim * 2,
@@ -1836,10 +1882,7 @@ class SequenceTagging(fluid.dygraph.Layer):
        word_embed = self.word_embedding(word)
        input_feature = word_embed
-        for i in range(self.bigru_num):
+        bigru_output = self.gru_encoder(input_feature)
-            bigru_output = self.bigru_units[i](input_feature)
-            input_feature = bigru_output
        emission = self.fc(bigru_output)
        if target is not None: