Merge pull request #59 from PaddlePaddle/update_pd_131

Update with the release of paddle fluid 1.3.1

Merge pull request #59 from PaddlePaddle/update_pd_131
Update with the release of paddle fluid 1.3.1
95c7f715 · Yibing Liu · GitHub · 65df295a · 11b46946 · 95c7f715
17 changed file
--- a/BERT/README.md
+++ b/BERT/README.md
-
 # BERT on PaddlePaddle

 [BERT](https://arxiv.org/abs/1810.04805) 是一个迁移能力很强的通用语义表示模型， 以 [Transformer](https://arxiv.org/abs/1706.03762) 为网络基本组件，以双向 `Masked Language Model`  
@@ -6,11 +5,11 @@

 ### 发布要点

-1）完整支持 BERT 模型训练, 包括:
+1）完整支持 BERT 模型训练到部署, 包括:

 - 支持 BERT GPU 单机、分布式预训练
 - 支持 BERT GPU 多卡 Fine-tuning
-
+- 提供 BERT 预测接口 demo, 方便多硬件设备生产环境的部署

 2）支持 FP16/FP32 混合精度训练和 Fine-tuning，节省显存开销、加速训练过程；

@@ -37,29 +36,31 @@
  - [数据预处理](#数据预处理)
  - [单机训练](#单机训练)
  - [分布式训练](#分布式训练)
- [**Fine-Tuning**: 预训练模型如何应用到特定 NLP 任务上](#fine-tuning-任务)
+- [**Fine-Tuning**: 预训练模型如何应用到特定 NLP 任务上](#nlp-任务的-fine-tuning)
  - [语句和句对分类任务](#语句和句对分类任务)
  - [阅读理解 SQuAD](#阅读理解-squad)
 - [**混合精度训练**: 利用混合精度加速训练](#混合精度训练)
 - [**模型转换**: 如何将 BERT TensorFlow 模型转换为 Paddle Fluid 模型](#模型转换)
-
+- [**模型部署**: 多硬件环境模型部署支持](#模型部署)
+  - [产出用于部署的 inference model](#保存-inference-model)
+  - [inference 接口调用示例](#inference-接口调用示例)

 ## 安装
-本项目依赖于 Paddle Fluid 1.3.0，请参考[安装指南](http://www.paddlepaddle.org/#quick-start)进行安装。
+本项目依赖于 Paddle Fluid **1.3.1**，请参考[安装指南](http://www.paddlepaddle.org/#quick-start)进行安装。如果需要进行 TensorFlow 模型到 Paddle Fluid 参数的转换，则需要同时安装 TensorFlow 1.12。

 ## 预训练

 ### 数据预处理

-以中文模型的预训练为例，可基于中文维基百科数据构造具有上下文关系的句子对作为训练数据，用 [`tokenization.py`](tokenization.py) 中的 CharTokenizer 对构造出的句子对数据进行 token 化处理，得到 token 化的明文数据，然后将明文数据根据词典 [`config/vocab.txt`](config/vocab.txt) 映射为 id 数据并作为训练数据；
+以中文模型的预训练为例，可基于中文维基百科数据构造具有上下文关系的句子对作为训练数据，用 [`tokenization.py`](tokenization.py) 中的 CharTokenizer 对构造出的句子对数据进行 token 化处理，得到 token 化的明文数据，然后将明文数据根据词典 [`vocab.txt`](data/demo_config/vocab.txt) 映射为 id 数据并作为训练数据，该示例词典和模型配置[`bert_config.json`](./data/demo_config/bert_config.json)均来自[BERT-Base, Chinese](https://bert-models.bj.bcebos.com/chinese_L-12_H-768_A-12.tar.gz)。

-我们给出了 token 化后的示例明文数据: [`data/demo_wiki_tokens.txt`](./data/demo_wiki_tokens.txt)，其中每行数据为2个 tab 分隔的句子，示例如下:
+我们给出了 token 化后的示例明文数据: [`demo_wiki_tokens.txt`](./data/demo_wiki_tokens.txt)，其中每行数据为2个 tab 分隔的句子，示例如下:

 ```
 1 . 雏 凤 鸣 剧 团   2 . 古 典 之 门 ： 帝 女 花 3 . 戏 曲 之 旅 ： 第 155 期 心 系 唐 氏 慈 善 戏 曲 晚 会 4 . 区 文 凤 , 郑 燕 虹 1999 编 ， 香 港 当 代 粤 剧 人 名 录 ， 中 大 音 乐 系 5 . 王 胜 泉 , 张 文 珊 2011 编 ， 香 港 当 代 粤 剧 人 名 录 ， 中 大 音 乐 系
 ```

-同时我们也给出了 id 化后的部分训练数据：[`data/train/demo_wiki_train.gz`](./data/train/demo_wiki_train.gz)、和测试数据：[`data/validation/demo_wiki_validation.gz`](./data/validation/demo_wiki_validation.gz)，每行数据为1个训练样本，示例如下:
+同时我们也给出了 id 化后的部分训练数据：[`demo_wiki_train.gz`](./data/train/demo_wiki_train.gz)、和测试数据：[`demo_wiki_validation.gz`](./data/validation/demo_wiki_validation.gz)，每行数据为1个训练样本，示例如下:

 ```
 1 7987 3736 8577 8020 2398 969 1399 8038 8021 3221 855 754 7270 7029 1344 7649 4506 2356 4638 676 6823 1298 928 5632 1220 6756 6887 722 769 3837 6887 511 2 4385 3198 6820 3313 1423 4500 511 2;0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1;0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41;1
@@ -76,7 +77,7 @@ export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
 ./train.sh -local y
 ```

-这里需要特别说明的是，参数 `generate_neg_sample` 为 `True` 表示在预训练过程中，`next sentence prediction` 任务的负样本是根据训练数据中的正样本动态生成的，我们给出的样例训练数据 `data/train/demo_wiki_train.gz` 只包含 `next sentence prediction` 任务的正样本；如果已事先构造了 `next sentence prediction` 任务的正负样本，则需要将 `generate_neg_sample` 置为 `False`。
+这里需要特别说明的是，参数 `generate_neg_sample` 为 `True` 表示在预训练过程中，`Next Sentence Prediction` 任务的负样本是根据训练数据中的正样本动态生成的，我们给出的样例训练数据 [`demo_wiki_train.gz`](data/train/demo_wiki_train.gz) 只包含 `Next Sentence Prediction` 任务的正样本；如果已事先构造了 `Next Sentence Prediction` 任务的正负样本，则需要将 `generate_neg_sample` 置为 `False`。

 预训练任务进行的过程中会输出当前学习率、训练数据所经过的轮数、当前迭代的总步数、训练误差、训练速度等信息，根据 `--validation_steps ${N}` 的配置，每间隔 `N` 步输出模型在验证集的各种指标:

@@ -112,7 +113,7 @@ export current_endpoint=192.168.0.17:9185
 ./train.sh -local n
 ```

-## Fine-tuning 任务
+## NLP 任务的 Fine-tuning

 在完成 BERT 模型的预训练后，即可利用预训练参数在特定的 NLP 任务上做 Fine-tuning。以下利用开源的预训练模型，示例如何进行分类任务和阅读理解任务的 Fine-tuning，如果要运行这些任务，请通过 [发布要点](#发布要点) 一节提供的链接预先下载好对应的预训练模型。

@@ -191,8 +192,8 @@ CHECKPOINT_PATH=/path/to/save/checkpoints/
 SQUAD_PATH=/path/to/squad/data/

 python -u run_squad.py --use_cuda true\
-                    --batch_size 4068 \
-                    --in_tokens true\
+                    --batch_size 12 \
+                    --in_tokens false\
                    --init_pretraining_params ${BERT_BASE_PATH}/params \
                    --checkpoints ${CHECKPOINT_PATH} \
                    --vocab_path ${BERT_BASE_PATH}/vocab.txt \
@@ -207,7 +208,6 @@ python -u run_squad.py --use_cuda true\
                    --predict_file ${SQUAD_PATH}/dev-v1.1.json \
                    --do_lower_case true \
                    --doc_stride 128 \
-                    --n_best_size 20 \
                    --train_file ${SQUAD_PATH}/train-v1.1.json \
                    --learning_rate 3e-5 \
                    --lr_scheduler linear_warmup_decay \
@@ -217,7 +217,7 @@ python -u run_squad.py --use_cuda true\
 对预测结果进行测评

 ```shell
-python ${SQUAD_PATH}/evaluate-v1.1.py ${SQUAD_PATH}/dev-v1.1.json ${CHECKPOINT_PATH}predictions.json
+python ${SQUAD_PATH}/evaluate-v1.1.py ${SQUAD_PATH}/dev-v1.1.json ${CHECKPOINT_PATH}/predictions.json
 ```

 会得到类似如下的输出
@@ -235,8 +235,8 @@ CHECKPOINT_PATH=/path/to/save/checkpoints/
 SQUAD_PATH=/path/tp/squad/data/

 python -u run_squad.py --use_cuda true \
-                    --batch_size 4068 \
-                    --in_tokens true\
+                    --batch_size 12 \
+                    --in_tokens false\
                    --init_pretraining_params ${BERT_BASE_PATH}/params \
                    --checkpoints ${CHECKPOINT_PATH} \
                    --vocab_path ${BERT_BASE_PATH}/vocab.txt \
@@ -316,8 +316,80 @@ python convert_params.py \
 ```
 即可完成模型转换。

-**注意**：要成功运行转换脚本，需同时安装 TensorFlow 和 Paddle Fluid 1.3。

+## 模型部署
+
+深度学习模型需要应用于实际情景，则需要进行模型的部署，把训练好的模型部署到不同的机器上去，这需要考虑不同的硬件环境，包括 GPU/CPU 的环境，单机/分布式集群，或者嵌入式设备；同时还要考虑软件环境，比如部署的机器上是否都安装了对应的深度学习框架；还要考虑运行性能等。但是要求部署环境都安装整个框架会给部署带来不便，为了解决深度学习模型的部署，一种可行的方案是使得模型可以脱离框架运行，Paddle Fluid 采用这种方法进行部署，编译 [Paddle Fluid inference](http://paddlepaddle.org/documentation/docs/zh/1.2/advanced_usage/deploy/inference/build_and_install_lib_cn.html) 库，并且编写加载模型的 `C++` inference 接口。预测的时候则只要加载保存的预测网络结构和模型参数，就可以对输入数据进行预测，不再需要整个框架而只需要 Paddle Fluid inference 库，这带来了模型部署的灵活性。
+
+以语句和语句对分类任务为例子，下面讲述如何进行模型部署。首先需要进行模型的训练，其次是要保存用于部署的模型。最后编写 `C++` inference 程序加载模型和参数进行预测。
+
+前面 [语句和句对分类任务](#语句和句对分类任务) 一节中讲到了如何训练 XNLI 任务的模型，并且保存了 checkpoints。但是值得注意的是这些 checkpoint 中只是包含了模型参数以及对于训练过程中必要的状态信息（参见 [params](http://paddlepaddle.org/documentation/docs/zh/1.3/api_cn/io_cn.html#save-params) 和 [persistables](http://paddlepaddle.org/documentation/docs/zh/1.3/api_cn/io_cn.html#save-persistables) ), 现在要生成预测用的 [inference model](http://paddlepaddle.org/documentation/docs/zh/1.2/api_cn/io_cn.html#permalink-5-save_inference_model)，可以按照下面的步骤进行。
+
+### 保存 inference model
+
+```shell
+BERT_BASE_PATH="chinese_L-12_H-768_A-12"
+TASK_NAME="XNLI"
+DATA_PATH=/path/to/xnli/data/
+INIT_CKPT_PATH=/path/to/a/finetuned/checkpoint/
+
+python -u predict_classifier.py --task_name ${TASK_NAME} \
+       --use_cuda true \
+       --batch_size 64 \
+       --data_dir ${DATA_PATH} \
+       --vocab_path ${BERT_BASE_PATH}/vocab.txt \
+       --init_checkpoint ${INIT_CKPT_PATH} \
+       --do_lower_case true \
+       --max_seq_len 128 \
+       --bert_config_path ${BERT_BASE_PATH}/bert_config.json \
+       --do_predict true \
+       --save_inference_model_path ${INIT_CKPT_PATH}
+```
+
+以上的脚本完成可以两部分工作：
+
+1. 从某一个 `init_checkpoint` 加载模型参数，此时如果设定参数 `--do_predict` 为 `true` 则在 `test` 数据集上进行测试，输出预测结果。
+2. 生成对应于 `init_checkpoint` 的 inference model，这会被保存在 `${INIT_CKPT_PATH}/{CKPT_NAME}_inference_model` 目录。
+
+### inference 接口调用示例
+
+使用 `C++` 进行预测的过程需要使用 Paddle Fluid inference 库，具体的使用例子参考 [`inference`](./inference) 目录下的 `README.md`.
+
+下面的代码演示了如何使用 `C++` 进行预测，更多细节请见 [`inference`](./inference) 目录下的例子，可以参考例子写 inference。
+
+``` cpp
+#include <paddle_inference_api.h>
+
+// create and set configuration
+paddle::NativeConfig config;
+config.model_dir = "xxx";
+config.use_gpu = false;
+
+// create predictor
+auto predictor = CreatePaddlePredictor(config);
+
+// create input tensors
+paddle::PaddleTensor src_id;
+src.dtype = paddle::PaddleDType::INT64;
+src.shape = ...;
+src.data.Reset(...);
+
+paddle::PaddleTensor pos_id;
+paddle::PaddleTensor segmeng_id;
+paddle::PaddleTensor self_attention_bias;
+paddle::PaddleTensor next_segment_index;
+
+// create iutput tensors and run prediction
+std::vector<paddle::PaddleTensor> output;
+predictor->Run({src_id, pos_id, segmeng_id, self_attention_bias, next_segment_index}, &output);
+
+std::cout << "example_id\tcontradiction\tentailment\tneutral";
+for (size_t i = 0; i < output.front().data.length() / sizeof(float); i += 3) {
+  std::cout << static_cast<float *>(output.front().data.data())[i] << "\t"
+            << static_cast<float *>(output.front().data.data())[i + 1] << "\t"
+            << static_cast<float *>(output.front().data.data())[i + 2] << std::endl;
+}
+```

 ## Contributors


--- a/BERT/batching.py
+++ b/BERT/batching.py
@@ -81,7 +81,7 @@ def prepare_batch_data(insts,
                       cls_id=None,
                       sep_id=None,
                       mask_id=None,
-                       return_attn_bias=True,
+                       return_input_mask=True,
                       return_max_len=True,
                       return_num_token=False):
    """
@@ -114,22 +114,25 @@ def prepare_batch_data(insts,
    else:
        out = batch_src_ids
    # Second step: padding
-    src_id, next_sent_index, self_attn_bias = pad_batch_data(
-        out, pad_idx=pad_id, return_next_sent_pos=True, return_attn_bias=True)
+    src_id, self_input_mask = pad_batch_data(
+        out, pad_idx=pad_id, return_input_mask=True)
    pos_id = pad_batch_data(
-        batch_pos_ids, pad_idx=pad_id, return_pos=False, return_attn_bias=False)
+        batch_pos_ids,
+        pad_idx=pad_id,
+        return_pos=False,
+        return_input_mask=False)
    sent_id = pad_batch_data(
        batch_sent_ids,
        pad_idx=pad_id,
        return_pos=False,
-        return_attn_bias=False)
+        return_input_mask=False)

    if mask_id >= 0:
-        return_list = [src_id, pos_id, sent_id, self_attn_bias, mask_label, mask_pos] \
-                      + labels_list + [next_sent_index]
+        return_list = [
+            src_id, pos_id, sent_id, self_input_mask, mask_label, mask_pos
+        ] + labels_list
    else:
-        return_list = [src_id, pos_id, sent_id, self_attn_bias] + labels_list \
-                     + [next_sent_index]
+        return_list = [src_id, pos_id, sent_id, self_input_mask] + labels_list

    return return_list if len(return_list) > 1 else return_list[0]

@@ -137,32 +140,23 @@ def prepare_batch_data(insts,
 def pad_batch_data(insts,
                   pad_idx=0,
                   return_pos=False,
-                   return_next_sent_pos=False,
-                   return_attn_bias=False,
+                   return_input_mask=False,
                   return_max_len=False,
                   return_num_token=False):
    """
    Pad the instances to the max sequence length in batch, and generate the
-    corresponding position data and attention bias.
+    corresponding position data and input mask.
    """
    return_list = []
    max_len = max(len(inst) for inst in insts)
    # Any token included in dict can be used to pad, since the paddings' loss
    # will be masked out by weights and make no effect on parameter gradients.

-    inst_data = np.array(
-        [inst + list([pad_idx] * (max_len - len(inst))) for inst in insts])
+    inst_data = np.array([
+        list(inst) + list([pad_idx] * (max_len - len(inst))) for inst in insts
+    ])
    return_list += [inst_data.astype("int64").reshape([-1, max_len, 1])]

-    # next_sent_pos for extract first token embedding of each sentence
-    if return_next_sent_pos:
-        batch_size = inst_data.shape[0]
-        max_seq_len = inst_data.shape[1]
-        next_sent_index = np.array(
-            range(0, batch_size * max_seq_len, max_seq_len)).astype(
-                "int64").reshape(-1, 1)
-        return_list += [next_sent_index]
-
    # position data
    if return_pos:
        inst_pos = np.array([
@@ -172,13 +166,12 @@ def pad_batch_data(insts,

        return_list += [inst_pos.astype("int64").reshape([-1, max_len, 1])]

-    if return_attn_bias:
+    if return_input_mask:
        # This is used to avoid attention on paddings.
-        slf_attn_bias_data = np.array([[0] * len(inst) + [-1e9] *
-                                       (max_len - len(inst)) for inst in insts])
-        slf_attn_bias_data = np.tile(
-            slf_attn_bias_data.reshape([-1, 1, max_len]), [1, max_len, 1])
-        return_list += [slf_attn_bias_data.astype("float32")]
+        input_mask_data = np.array([[1] * len(inst) + [0] *
+                                    (max_len - len(inst)) for inst in insts])
+        input_mask_data = np.expand_dims(input_mask_data, axis=-1)
+        return_list += [input_mask_data.astype("float32")]

    if return_max_len:
        return_list += [max_len]

--- a/BERT/config/bert_config.json
+++ b/BERT/config/bert_config.json
--- a/BERT/config/vocab.txt
+++ b/BERT/config/vocab.txt
--- a/BERT/inference/CMakeLists.txt
+++ b/BERT/inference/CMakeLists.txt
 CMAKE_MINIMUM_REQUIRED(VERSION 3.2)
 PROJECT(inference_demo)
-SET(CMAKE_C_COMPILER gcc-4.8)
-SET(CMAKE_CXX_COMPILER g++-4.8)
-ADD_COMPILE_OPTIONS(-std=c++11)
+SET(CMAKE_C_COMPILER gcc)
+SET(CMAKE_CXX_COMPILER g++)
+ADD_COMPILE_OPTIONS(-std=c++11 -D_GLIBCXX_USE_CXX11_ABI=0)

 SET(FLUID_INFER_LIB fluid_inference)
 SET(FLUID_INC_PATH ${FLUID_INFER_LIB}/paddle/include)

--- a/BERT/inference/gen_demo_data.py
+++ b/BERT/inference/gen_demo_data.py
@@ -39,7 +39,7 @@ def main():
        args.batch_size, phase='test', epoch=1, shuffle=False)()

    for i, data in enumerate(gen):
-        data = data[:4] + [data[5]]
+        data = data[:4]
        sample = []
        for field in data:
            shape_str = ' '.join(map(str, field.shape))

--- a/BERT/inference/inference.cc
+++ b/BERT/inference/inference.cc
@@ -16,15 +16,19 @@
 #include <glog/logging.h>
 #include <paddle_inference_api.h>
 #include <chrono>
+#include <iostream>
 #include <fstream>
 #include <numeric>
 #include <sstream>
 #include <string>
 #include <vector>

-DEFINE_string(model_dir, "", "model directory");
-DEFINE_string(data, "", "input data path");
-DEFINE_int32(repeat, 1, "repeat");
+DEFINE_string(model_dir, "", "Inference model directory.");
+DEFINE_string(data, "", "Input data path.");
+DEFINE_int32(repeat, 1, "Repeat times.");
+DEFINE_int32(num_labels, 3, "Number of labels.");
+DEFINE_bool(output_prediction, false, "Whether to output the prediction results.");
+DEFINE_bool(use_gpu, false, "Whether to use GPU for prediction.");

 template <typename T>
 void GetValueFromStream(std::stringstream *ss, T *t) {
@@ -73,12 +77,16 @@ constexpr paddle::PaddleDType GetPaddleDType<float>() {
  return paddle::PaddleDType::FLOAT32;
 }

+
 // Parse tensor from string
 template <typename T>
 bool ParseTensor(const std::string &field, paddle::PaddleTensor *tensor) {
  std::vector<std::string> data;
  Split(field, ':', &data);
-  if (data.size() < 2) return false;
+  if (data.size() < 2) {
+    LOG(ERROR) << "parse tensor error!";
+    return false;
+  }

  std::string shape_str = data[0];

@@ -107,10 +115,10 @@ bool ParseLine(const std::string &line,
  std::vector<std::string> fields;
  Split(line, ';', &fields);

-  if (fields.size() < 5) return false;
+  if (fields.size() < 4) return false;

  tensors->clear();
-  tensors->reserve(5);
+  tensors->reserve(4);

  int i = 0;
  // src_id
@@ -128,29 +136,53 @@ bool ParseLine(const std::string &line,
  ParseTensor<int64_t>(fields[i++], &segment_id);
  tensors->push_back(segment_id);

-  // self_attention_bias
-  paddle::PaddleTensor self_attention_bias;
-  ParseTensor<float>(fields[i++], &self_attention_bias);
-  tensors->push_back(self_attention_bias);
-
-  // next_segment_index
-  paddle::PaddleTensor next_segment_index;
-  ParseTensor<int64_t>(fields[i++], &next_segment_index);
-  tensors->push_back(next_segment_index);
+  // input mask
+  paddle::PaddleTensor input_mask;
+  ParseTensor<float>(fields[i++], &input_mask);
+  tensors->push_back(input_mask);

  return true;
 }

+template <typename T>
+void PrintTensor(const paddle::PaddleTensor &t) {
+  std::stringstream ss;
+  ss.str({});
+  ss.clear();
+  ss << "Tensor: shape[";
+  for (auto i: t.shape) {
+    ss << i << " ";
+  }
+  ss << "], data[";
+  T *data = static_cast<T *>(t.data.data());
+  for (int i = 0; i < t.data.length() / sizeof(T); i++) {
+    ss << data[i] << " ";
+  }
+
+  ss << "]";
+  LOG(INFO) << ss.str();
+}
+
+void PrintInputs(const std::vector<paddle::PaddleTensor> &inputs) {
+  for (const auto &t : inputs) {
+    if (t.dtype == paddle::PaddleDType::INT64) {
+      PrintTensor<int64_t>(t);
+    } else {
+      PrintTensor<float>(t);
+    }
+  }
+}
+
 // Print outputs to log
-void PrintOutputs(const std::vector<paddle::PaddleTensor> &outputs) {
-  LOG(INFO) << "example_id\tcontradiction\tentailment\tneutral";
-
-  for (size_t i = 0; i < outputs.front().data.length() / sizeof(float); i += 3) {
-    LOG(INFO) << (i / 3) << "\t"
-              << static_cast<float *>(outputs.front().data.data())[i] << "\t"
-              << static_cast<float *>(outputs.front().data.data())[i + 1]
-              << "\t"
-              << static_cast<float *>(outputs.front().data.data())[i + 2];
+void PrintOutputs(const std::vector<paddle::PaddleTensor> &outputs, int &cnt) {
+  for (size_t i = 0; i < outputs.front().data.length() / sizeof(float); 
+       i += FLAGS_num_labels) {
+    std::cout << cnt << "\t";
+    for (size_t j = 0; j < FLAGS_num_labels; ++j) {
+      std::cout  << static_cast<float *>(outputs.front().data.data())[i+j] << "\t";
+    }
+    std::cout << std::endl;
+    cnt += 1;
  }
 }

@@ -176,11 +208,6 @@ bool LoadInputData(std::vector<std::vector<paddle::PaddleTensor>> *inputs) {
  return true;
 }

-// Bert inference demo
-// Options:
-//     --model_dir: bert model file directory
-//     --data: data path
-//     --repeat: repeat num
 int main(int argc, char *argv[]) {
  google::InitGoogleLogging(*argv);
  gflags::ParseCommandLineFlags(&argc, &argv, true);
@@ -192,7 +219,11 @@ int main(int argc, char *argv[]) {

  paddle::NativeConfig config;
  config.model_dir = FLAGS_model_dir;
-  config.use_gpu = false;
+  if (FLAGS_use_gpu) {
+    config.use_gpu = true;
+    config.fraction_of_gpu_memory = 0.15;
+    config.device = 0;
+  }

  auto predictor = CreatePaddlePredictor(config);

@@ -204,26 +235,31 @@ int main(int argc, char *argv[]) {

  std::vector<paddle::PaddleTensor> fetch;
  int total_time{0};
-  // auto predict_timer = []()
  int num_samples{0};
+  int out_cnt = 0;
  for (int i = 0; i < FLAGS_repeat; i++) {
    for (auto feed : inputs) {
+      fetch.clear();
      auto start = std::chrono::system_clock::now();
      predictor->Run(feed, &fetch);
+      if (FLAGS_output_prediction && i == 0) {
+	PrintOutputs(fetch, out_cnt);
+      }
      auto end = std::chrono::system_clock::now();
      if (!fetch.empty()) {
        total_time +=
            std::chrono::duration_cast<std::chrono::milliseconds>(end - start)
                .count();
-        num_samples += fetch.front().data.length() / 3;
+        num_samples += fetch.front().data.length() / FLAGS_num_labels / sizeof(float);
      }
    }
  }
+  

  auto per_sample_ms =
      static_cast<float>(total_time) / num_samples;
-  LOG(INFO) << "Run " << num_samples
-            << " samples, average latency: " << per_sample_ms
+  LOG(INFO) << "Run on " << num_samples 
+            << " samples over "<< FLAGS_repeat << " times, average latency: " << per_sample_ms
            << "ms per sample.";

  return 0;

--- a/BERT/model/bert.py
+++ b/BERT/model/bert.py
@@ -52,7 +52,7 @@ class BertModel(object):
                 src_ids,
                 position_ids,
                 sentence_ids,
-                 self_attn_mask,
+                 input_mask,
                 config,
                 weight_sharing=True,
                 use_fp16=False):
@@ -73,14 +73,14 @@ class BertModel(object):
        self._sent_emb_name = "sent_embedding"
        self._dtype = "float16" if use_fp16 else "float32"

-        # Initialize all weigths by truncated normal initializer, and all biases
+        # Initialize all weigths by truncated normal initializer, and all biases 
        # will be initialized by constant zero by default.
        self._param_initializer = fluid.initializer.TruncatedNormal(
            scale=config['initializer_range'])

-        self._build_model(src_ids, position_ids, sentence_ids, self_attn_mask)
+        self._build_model(src_ids, position_ids, sentence_ids, input_mask)

-    def _build_model(self, src_ids, position_ids, sentence_ids, self_attn_mask):
+    def _build_model(self, src_ids, position_ids, sentence_ids, input_mask):
        # padding id in vocabulary must be set to 0
        emb_out = fluid.layers.embedding(
            input=src_ids,
@@ -110,9 +110,12 @@ class BertModel(object):
            emb_out, 'nd', self._prepostprocess_dropout, name='pre_encoder')

        if self._dtype == "float16":
-            self_attn_mask = fluid.layers.cast(
-                x=self_attn_mask, dtype=self._dtype)
+            input_mask = fluid.layers.cast(x=input_mask, dtype=self._dtype)

+        self_attn_mask = fluid.layers.matmul(
+            x=input_mask, y=input_mask, transpose_y=True)
+        self_attn_mask = fluid.layers.scale(
+            x=self_attn_mask, scale=1000.0, bias=-1.0, bias_after_scale=False)
        n_head_self_attn_mask = fluid.layers.stack(
            x=[self_attn_mask] * self._n_head, axis=1)
        n_head_self_attn_mask.stop_gradient = True
@@ -138,13 +141,11 @@ class BertModel(object):
    def get_sequence_output(self):
        return self._enc_out

-    def get_pooled_output(self, next_sent_index):
+    def get_pooled_output(self):
        """Get the first feature of each sequence for classification"""
-        self._reshaped_emb_out = fluid.layers.reshape(
-            x=self._enc_out, shape=[-1, self._emb_size], inplace=True)
-        next_sent_index = fluid.layers.cast(x=next_sent_index, dtype='int32')
-        next_sent_feat = fluid.layers.gather(
-            input=self._reshaped_emb_out, index=next_sent_index)
+
+        next_sent_feat = fluid.layers.slice(
+            input=self._enc_out, axes=[1], starts=[0], ends=[1])
        next_sent_feat = fluid.layers.fc(
            input=next_sent_feat,
            size=self._emb_size,
@@ -154,17 +155,17 @@ class BertModel(object):
            bias_attr="pooled_fc.b_0")
        return next_sent_feat

-    def get_pretraining_output(self, mask_label, mask_pos, labels,
-                               next_sent_index):
+    def get_pretraining_output(self, mask_label, mask_pos, labels):
        """Get the loss & accuracy for pretraining"""

        mask_pos = fluid.layers.cast(x=mask_pos, dtype='int32')

        # extract the first token feature in each sentence
-        next_sent_feat = self.get_pooled_output(next_sent_index)
+        next_sent_feat = self.get_pooled_output()
+        reshaped_emb_out = fluid.layers.reshape(
+            x=self._enc_out, shape=[-1, self._emb_size])
        # extract masked tokens' feature
-        mask_feat = fluid.layers.gather(
-            input=self._reshaped_emb_out, index=mask_pos)
+        mask_feat = fluid.layers.gather(input=reshaped_emb_out, index=mask_pos)

        # transform: fc
        mask_trans_feat = fluid.layers.fc(
@@ -175,7 +176,7 @@ class BertModel(object):
                name='mask_lm_trans_fc.w_0',
                initializer=self._param_initializer),
            bias_attr=fluid.ParamAttr(name='mask_lm_trans_fc.b_0'))
-        # transform: layer norm
+        # transform: layer norm 
        mask_trans_feat = pre_process_layer(
            mask_trans_feat, 'n', name='mask_lm_trans')


--- a/BERT/model/classifier.py
+++ b/BERT/model/classifier.py
@@ -30,25 +30,24 @@ def create_model(args,
    pyreader = fluid.layers.py_reader(
        capacity=50,
        shapes=[[-1, args.max_seq_len, 1], [-1, args.max_seq_len, 1],
-                [-1, args.max_seq_len, 1],
-                [-1, args.max_seq_len, args.max_seq_len], [-1, 1], [-1, 1]],
-        dtypes=['int64', 'int64', 'int64', 'float', 'int64', 'int64'],
-        lod_levels=[0, 0, 0, 0, 0, 0],
+                [-1, args.max_seq_len, 1], [-1, args.max_seq_len, 1], [-1, 1]],
+        dtypes=['int64', 'int64', 'int64', 'float32', 'int64'],
+        lod_levels=[0, 0, 0, 0, 0],
        name=pyreader_name,
        use_double_buffer=True)

-    (src_ids, pos_ids, sent_ids, self_attn_mask, labels,
-     next_sent_index) = fluid.layers.read_file(pyreader)
+    (src_ids, pos_ids, sent_ids, input_mask,
+     labels) = fluid.layers.read_file(pyreader)

    bert = BertModel(
        src_ids=src_ids,
        position_ids=pos_ids,
        sentence_ids=sent_ids,
-        self_attn_mask=self_attn_mask,
+        input_mask=input_mask,
        config=bert_config,
        use_fp16=args.use_fp16)

-    cls_feats = bert.get_pooled_output(next_sent_index)
+    cls_feats = bert.get_pooled_output()
    cls_feats = fluid.layers.dropout(
        x=cls_feats,
        dropout_prob=0.1,
@@ -65,8 +64,7 @@ def create_model(args,
    if is_prediction:
        probs = fluid.layers.softmax(logits)
        feed_targets_name = [
-            src_ids.name, pos_ids.name, sent_ids.name, self_attn_mask.name,
-            next_sent_index.name
+            src_ids.name, pos_ids.name, sent_ids.name, input_mask.name
        ]
        return pyreader, probs, feed_targets_name


--- a/BERT/predict_classifier.py
+++ b/BERT/predict_classifier.py
@@ -21,8 +21,8 @@ import os
 import time
 import argparse
 import numpy as np
-import paddle.fluid as fluid
 import multiprocessing
+import paddle.fluid as fluid

 import reader.cls as reader
 from model.bert import BertConfig

--- a/BERT/reader/cls.py
+++ b/BERT/reader/cls.py
@@ -82,7 +82,7 @@ class DataProcessor(object):
                            total_token_num,
                            voc_size=-1,
                            mask_id=-1,
-                            return_attn_bias=True,
+                            return_input_mask=True,
                            return_max_len=False,
                            return_num_token=False):
        return prepare_batch_data(
@@ -93,7 +93,7 @@ class DataProcessor(object):
            cls_id=self.vocab["[CLS]"],
            sep_id=self.vocab["[SEP]"],
            mask_id=-1,
-            return_attn_bias=True,
+            return_input_mask=True,
            return_max_len=False,
            return_num_token=False)

@@ -185,7 +185,7 @@ class DataProcessor(object):
                    total_token_num,
                    voc_size=-1,
                    mask_id=-1,
-                    return_attn_bias=True,
+                    return_input_mask=True,
                    return_max_len=False,
                    return_num_token=False)
                yield batch_data

--- a/BERT/reader/pretraining.py
+++ b/BERT/reader/pretraining.py
@@ -278,7 +278,7 @@ class DataReader(object):
                    cls_id=self.cls_id,
                    sep_id=self.sep_id,
                    mask_id=self.mask_id,
-                    return_attn_bias=True,
+                    return_input_mask=True,
                    return_max_len=False,
                    return_num_token=False)


--- a/BERT/reader/squad.py
+++ b/BERT/reader/squad.py
@@ -559,7 +559,7 @@ class DataProcessor(object):
                        cls_id=self.cls_id,
                        sep_id=self.sep_id,
                        mask_id=-1,
-                        return_attn_bias=True,
+                        return_input_mask=True,
                        return_max_len=False,
                        return_num_token=False)


--- a/BERT/run_classifier.py
+++ b/BERT/run_classifier.py
@@ -76,6 +76,7 @@ data_g.add_arg("random_seed",   int,  0,     "Random seed.")
 run_type_g = ArgumentGroup(parser, "run_type", "running type options.")
 run_type_g.add_arg("use_cuda",                     bool,   True,  "If set, use GPU for training.")
 run_type_g.add_arg("use_fast_executor",            bool,   False, "If set, use fast parallel executor (in experiment).")
+run_type_g.add_arg("num_iteration_per_drop_scope", int,    1,     "Ihe iteration intervals to clean up temporary variables.")
 run_type_g.add_arg("task_name",                    str,    None,
                   "The name of task to perform fine-tuning, should be in {'xnli', 'mnli', 'cola', 'mrpc'}.")
 run_type_g.add_arg("do_train",                     bool,   True,  "Whether to perform training.")
@@ -244,9 +245,9 @@ def main(args):

    if args.do_train:
        exec_strategy = fluid.ExecutionStrategy()
-        if args.use_fast_executor:
-            exec_strategy.use_experimental_executor = True
+        exec_strategy.use_experimental_executor = args.use_fast_executor
        exec_strategy.num_threads = dev_count
+        exec_strategy.num_iteration_per_drop_scope = args.num_iteration_per_drop_scope

        train_exe = fluid.ParallelExecutor(
            use_cuda=args.use_cuda,

--- a/BERT/run_squad.py
+++ b/BERT/run_squad.py
@@ -22,9 +22,7 @@ import collections
 import multiprocessing
 import os
 import time
-
 import numpy as np
-
 import paddle
 import paddle.fluid as fluid

@@ -85,10 +83,11 @@ data_g.add_arg("null_score_diff_threshold", float, 0.0,
 data_g.add_arg("random_seed",               int,   0,      "Random seed.")

 run_type_g = ArgumentGroup(parser, "run_type", "running type options.")
-run_type_g.add_arg("use_cuda",              bool,   True,  "If set, use GPU for training.")
-run_type_g.add_arg("use_fast_executor",     bool,   False, "If set, use fast parallel executor (in experiment).")
-run_type_g.add_arg("do_train",              bool,   True,  "Whether to perform training.")
-run_type_g.add_arg("do_predict",            bool,   True,  "Whether to perform prediction.")
+run_type_g.add_arg("use_cuda",                     bool,   True,  "If set, use GPU for training.")
+run_type_g.add_arg("use_fast_executor",            bool,   False, "If set, use fast parallel executor (in experiment).")
+run_type_g.add_arg("num_iteration_per_drop_scope", int,    1,     "Ihe iteration intervals to clean up temporary variables.")
+run_type_g.add_arg("do_train",                     bool,   True,  "Whether to perform training.")
+run_type_g.add_arg("do_predict",                   bool,   True,  "Whether to perform prediction.")

 args = parser.parse_args()
 # yapf: enable.
@@ -99,34 +98,31 @@ def create_model(pyreader_name, bert_config, is_training=False):
            capacity=50,
            shapes=[[-1, args.max_seq_len, 1], [-1, args.max_seq_len, 1],
                    [-1, args.max_seq_len, 1],
-                    [-1, args.max_seq_len, args.max_seq_len], [-1, 1], [-1, 1],
-                    [-1, 1]],
+                    [-1, args.max_seq_len, 1], [-1, 1], [-1, 1]],
            dtypes=[
-                'int64', 'int64', 'int64', 'float', 'int64', 'int64', 'int64'
-            ],
-            lod_levels=[0, 0, 0, 0, 0, 0, 0],
+                'int64', 'int64', 'int64', 'float32', 'int64', 'int64'],
+            lod_levels=[0, 0, 0, 0, 0, 0],
            name=pyreader_name,
            use_double_buffer=True)
-        (src_ids, pos_ids, sent_ids, self_attn_mask, start_positions,
-         end_positions, next_sent_index) = fluid.layers.read_file(pyreader)
+        (src_ids, pos_ids, sent_ids, input_mask, start_positions,
+         end_positions) = fluid.layers.read_file(pyreader)
    else:
        pyreader = fluid.layers.py_reader(
            capacity=50,
            shapes=[[-1, args.max_seq_len, 1], [-1, args.max_seq_len, 1],
                    [-1, args.max_seq_len, 1],
-                    [-1, args.max_seq_len, args.max_seq_len], [-1, 1], [-1, 1]],
-            dtypes=['int64', 'int64', 'int64', 'float', 'int64', 'int64'],
-            lod_levels=[0, 0, 0, 0, 0, 0],
+                    [-1, args.max_seq_len, 1], [-1, 1]],
+            dtypes=['int64', 'int64', 'int64', 'float32', 'int64'],
+            lod_levels=[0, 0, 0, 0, 0],
            name=pyreader_name,
            use_double_buffer=True)
-        (src_ids, pos_ids, sent_ids, self_attn_mask, unique_id,
-         next_sent_index) = fluid.layers.read_file(pyreader)
+        (src_ids, pos_ids, sent_ids, input_mask, unique_id) = fluid.layers.read_file(pyreader)

    bert = BertModel(
        src_ids=src_ids,
        position_ids=pos_ids,
        sentence_ids=sent_ids,
-        self_attn_mask=self_attn_mask,
+        input_mask=input_mask,
        config=bert_config,
        use_fp16=args.use_fp16)

@@ -343,9 +339,9 @@ def train(args):

    if args.do_train:
        exec_strategy = fluid.ExecutionStrategy()
-        if args.use_fast_executor:
-            exec_strategy.use_experimental_executor = True
+        exec_strategy.use_experimental_executor = args.use_fast_executor
        exec_strategy.num_threads = dev_count
+        exec_strategy.num_iteration_per_drop_scope = args.num_iteration_per_drop_scope

        train_exe = fluid.ParallelExecutor(
            use_cuda=args.use_cuda,

--- a/BERT/train.py
+++ b/BERT/train.py
@@ -72,10 +72,11 @@ data_g.add_arg("in_tokens",           bool, True,
               "Otherwise, it will be the maximum number of examples in one batch.")

 run_type_g = ArgumentGroup(parser, "run_type", "running type options.")
-run_type_g.add_arg("is_distributed",    bool,   False,  "If set, then start distributed training.")
-run_type_g.add_arg("use_cuda",          bool,   True,   "If set, use GPU for training.")
-run_type_g.add_arg("use_fast_executor", bool,   False,  "If set, use fast parallel executor (in experiment).")
-run_type_g.add_arg("do_test",           bool,   False,  "Whether to perform evaluation on test data set.")
+run_type_g.add_arg("is_distributed",               bool,   False,  "If set, then start distributed training.")
+run_type_g.add_arg("use_cuda",                     bool,   True,   "If set, use GPU for training.")
+run_type_g.add_arg("use_fast_executor",            bool,   False,  "If set, use fast parallel executor (in experiment).")
+run_type_g.add_arg("num_iteration_per_drop_scope", int,    1,      "Ihe iteration intervals to clean up temporary variables.")
+run_type_g.add_arg("do_test",                      bool,   False,  "Whether to perform evaluation on test data set.")

 args = parser.parse_args()
 # yapf: enable.
@@ -86,30 +87,28 @@ def create_model(pyreader_name, bert_config):
        capacity=70,
        shapes=[[-1, args.max_seq_len, 1], [-1, args.max_seq_len, 1],
                [-1, args.max_seq_len, 1],
-                [-1, args.max_seq_len, args.max_seq_len], [-1, 1], [-1, 1],
-                [-1, 1], [-1, 1]],
+                [-1, args.max_seq_len, 1], [-1, 1], [-1, 1],
+                [-1, 1]],
        dtypes=[
-            'int64', 'int64', 'int64', 'float', 'int64', 'int64', 'int64',
-            'int64'
+            'int64', 'int64', 'int64', 'float32', 'int64', 'int64', 'int64'
        ],
-        lod_levels=[0, 0, 0, 0, 0, 0, 0, 0],
+        lod_levels=[0, 0, 0, 0, 0, 0, 0],
        name=pyreader_name,
        use_double_buffer=True)

-    (src_ids, pos_ids, sent_ids, self_attn_mask, mask_label, mask_pos, labels,
-     next_sent_index) = fluid.layers.read_file(pyreader)
+    (src_ids, pos_ids, sent_ids, input_mask, mask_label, mask_pos, labels) = fluid.layers.read_file(pyreader)

    bert = BertModel(
        src_ids=src_ids,
        position_ids=pos_ids,
        sentence_ids=sent_ids,
-        self_attn_mask=self_attn_mask,
+        input_mask=input_mask,
        config=bert_config,
        weight_sharing=args.weight_sharing,
        use_fp16=args.use_fp16)

    next_sent_acc, mask_lm_loss, total_loss = bert.get_pretraining_output(
-        mask_label, mask_pos, labels, next_sent_index)
+        mask_label, mask_pos, labels)

    if args.use_fp16 and args.loss_scaling > 1.0:
        total_loss *= args.loss_scaling
@@ -310,17 +309,13 @@ def train(args):
        generate_neg_sample=args.generate_neg_sample)

    exec_strategy = fluid.ExecutionStrategy()
-    if args.use_fast_executor:
-        exec_strategy.use_experimental_executor = True
+    exec_strategy.use_experimental_executor = args.use_fast_executor
    exec_strategy.num_threads = dev_count
-
-    build_strategy = fluid.BuildStrategy()
-    build_strategy.remove_unnecessary_lock = False
+    exec_strategy.num_iteration_per_drop_scope = args.num_iteration_per_drop_scope

    train_exe = fluid.ParallelExecutor(
        use_cuda=args.use_cuda,
        loss_name=total_loss.name,
-        build_strategy=build_strategy,
        exec_strategy=exec_strategy,
        main_program=train_program,
        num_trainers=nccl2_num_trainers,

--- a/BERT/train.sh
+++ b/BERT/train.sh
@@ -29,8 +29,8 @@ WEIGHT_DECAY=0.01
 MAX_LEN=512
 TRAIN_DATA_DIR=data/train
 VALIDATION_DATA_DIR=data/validation
-CONFIG_PATH=config/bert_config.json
-VOCAB_PATH=config/vocab.txt
+CONFIG_PATH=data/demo_config/bert_config.json
+VOCAB_PATH=data/demo_config/vocab.txt

 # Change your train arguments:
 python -u ./train.py ${is_distributed}\
@@ -48,4 +48,8 @@ python -u ./train.py ${is_distributed}\
        --weight_decay ${WEIGHT_DECAY:-0} \
        --max_seq_len ${MAX_LEN} \
        --skip_steps 20 \
-        --validation_steps 1000
+        --validation_steps 1000 \
+        --num_iteration_per_drop_scope 10 \
+        --use_fp16 false \
+        --loss_scaling 8.0
+