Merge pull request #1 from xixiaoyao/master

Release PALM 1.0

Merge pull request #1 from xixiaoyao/master
Release PALM 1.0
f6e5d4ff · pkpk · GitHub · 7db4e324 · 426650fc · f6e5d4ff
35 changed file
--- a/README.md
+++ b/README.md
+PALM
+===
+PALM (PAddLE Multitask) 是一个灵活易用的多任务学习框架，框架中内置了丰富的模型backbone（BERT、ERNIE等）、常见的任务范式（分类、匹配、序列标注、机器阅读理解等）和数据集读取与处理工具。对于典型的任务场景，用户几乎无需书写代码便可完成新任务的添加；对于特殊的任务场景，用户可通过对预置接口的实现来完成对新任务的支持。
+
+## 安装
+
+目前仅支持git clone源码的方式使用:
+```shell
+git clone https://github.com/PaddlePaddle/PALM.git
+```
+
+**环境依赖**
+- Python >= 2.7
+- cuda >= 9.0
+- cudnn >= 7.0
+- PaddlePaddle >= 1.5.0 (请参考[安装指南](http://www.paddlepaddle.org/#quick-start)进行安装)
+
+## 目录结构
+
+- backbone: 多任务学习的主干网络表示，支持bert, ernie, xlnet等，用户可自定义添加
+- config：存放各个任务的配置文件，用户添加任务时需在此建立该任务的配置文件
+- data: 存放各个任务的数据集
+- pretrain_model: 存放预训练模型、字典及其相关配置
+- optimizer: 优化器，用户可在此自定义优化器
+- reader: 各个任务的数据读取与处理模块以及做reader融合的joint_reader文件
+- paradigm: 任务输出层相关网络结构描述
+- utils: 通用工具函数文件
+- mtl_run.py: 多任务学习的主要流程描述
+- run.sh: 多任务学习启动脚本
+
+## 使用说明
+
+框架给出了三个添加完成的任务示例：*Machine Reading Comprehension*、*Mask Language Model*和*Question Answer Matching*。其中在`mtl_config.yaml`中将*Machine Reading Comprehension*设置为了主任务，其他为辅助任务，用户可通过如下命令启动多任务学习
+
+```
+bash run.sh
+```
+
+### 多任务学习配置
+
+在`mtl_config.yaml`中完成对多任务训练和推理的主配置，配置包含如下
+
+***必选字段***
+
+- main_task：*(str)* 指定主任务的名称，目前仅支持单个主任务。名称选取自`config`文件夹中的配置的文件名（不包含后缀`.yaml`和为task共享而设置的中间后缀）
+- auxiliary_task：*(str)* 指定辅助任务，支持多个辅助任务，辅助任务之间使用空格隔开。名称选取自`config`文件夹中的配置的文件名（不包含后缀`.yaml`和为task共享而设置的中间后缀）
+- do_train：*(bool)* 训练标志位
+- do_predict：*(bool)* 预测标志位，目前仅支持对主任务进行预测
+- checkpoint_path: *(str)* 模型保存、训练断点恢复和预测模型载入路径，从该路径载入模型时默认读取最后一个训练step的模型
+- backbone_model：*(str)* 使用的骨干网络，名称选取自`backbone`目录下的模块
+- vocab_path：*(str)* 字典文件，纯文本格式存储，其中每行为一个单词
+- optimizer：*(str)* 优化器名称，名称选取自`optimizer`中的文件名
+- learning_rate：*(str)* 训练阶段的学习率
+- skip_steps：*(int)* 训练阶段打印日志的频率（step为单位）
+- epoch：*(int)* 主任务的训练epoch数
+- use_cuda：*(bool)* 使用GPU训练的标志位
+- warmup_proportion：*(float)* 预训练模型finetuning时的warmup比例
+- use_ema：*(bool)* 是否开启ema进行训练和推理
+- ema_decay：*(float)* 开启ema时的衰减指数
+- random_seed：*(int)* 随机种子
+- use_fp16：*(bool)* 开启混合精度训练标志位
+- loss_scaling：*(float)* 开启混合精度训练时的loss缩放因子
+
+***可选字段***
+
+- pretrain_model_path：*(str)* 预训练模型的载入路径，该路径下应包含存储模型参数的params文件夹
+- pretrain_config_path：*(str)* 预训练模型的配置文件，json格式描述
+- do_lower_case：*(bool)* 预处理阶段是否区分大小写
+- 其他用户自定义字段
+
+### 添加新任务
+
+用户添加任务时，在准备好该任务的数据集后，需要完成如下3处开发工作：
+
+***config模块***
+
+位于`./config`目录。存放各个任务实例的配置文件，使用`yaml`格式描述。配置文件中的必选字段包括
+
+- in_tokens：是否使用lod tensor的方式构造batch，当`in_tokens`为False时，使用padding方式构造batch。
+- batch_size：每个训练或推理step所使用样本数。当`in_tokens`为True时，`batch_size`表示每个step所包含的tokens数量。
+
+训练阶段包含的必选字段包括
+
+- train_file：训练集文件路径
+- mix_ratio：该任务的训练阶段采样权重（1.0代表与主任务采样次数的期望相同）
+
+推理阶段包含的必选字段包括
+
+- predict_file：测试集文件路径
+
+此外用户可根据任务需要，自行定义其他超参数，该超参可在创建任务模型时被访问
+
+***reader模块***
+
+位于`./reader`目录下。完成数据集读取与处理。新增的reader应放置在`paradigm`目录下，且包含一个`get_input_shape`方法和`DataProcessor`类。
+
+- **get_input_shape**: *(function)*  定义reader给backbone和task_paradigm生成的数据的shape和dtype，且需要同时返回训练和推理阶段的定义。
+  - 输入参数
+    - args: *(dict)* 解析后的任务配置
+  - 返回值
+    - train_input_shape: *(dict)* 包含backbone和task两个key，每个key对应的value为一个list，存储若干`(shape, dtype)`的元组
+    - test_input_shape: *(dict)* 包含backbone和task两个key，每个key对应的value为一个list，存储若干`(shape, dtype)`的元组
+- **DataProcessor**：*(class)*   定义数据集的载入、预处理和遍历
+  - \_\_init\_\_: 构造函数，解析和存储相关参数，进行必要的初始化
+    - 输入参数
+      - args: *(dict)* 解析后的任务配置
+    - 返回值
+      - 无
+  - data_generator: *(function)* 数据集的迭代器，被遍历时每次yield一个batch
+    - 输入参数
+      - phase: *(str)* 任务所处阶段，支持训练`train`和推理`predict`两种可选阶段
+      - shuffle: *(bool)* 训练阶段是否进行数据集打乱
+      - dev_count: *(int)* 可用的GPU数量或CPU数量
+    - yield输出
+      - tensors: (list) 根据`get_input_shape`中定义的任务backbone和task的所需输入shape和类型，来yield相应list结构的数据。其中被yield出的list的头部元素为backbone要求的输入数据，后续元素为task要求的输入数据
+  - get_num_examples: *(function)* 返回样本数。注意由于滑动窗口等机制，实际运行时产生的样本数可能多于数据集中的样本数，这时应返回runtime阶段实际样本数
+    - 输入参数
+      - 无
+    - 返回值
+      - num_examples: *(int)* 样本数量
+
+***task_paradigm模块***
+
+位于`./paradigm`目录下。描述任务范式（如分类、匹配、阅读理解等）。新增的任务范式应放置在`paradigm`目录下，且应包含`compute_loss`和`create_model`两个必选方法，以及`postprocess`，`global_postprocess`两个可选方法。
+
+- create_model：*(function)* 创建task模型
+  - 输入参数
+    - reader_input：*(nested Variables)* 数据输入层的输出，定义位于该任务的reader模块的`input_shape`方法中。输入的前N个元素为backbone的输入元素，之后的元素为task的输入。
+    - base_model：*(Model)* 模型backbone的实例，可调用backbone的对外输出接口来实现task与backbone的连接。一般来说，backbone的输出接口最少包括`final_sentence_representation`和`final_word_representation`两个属性。
+      - base_model.final_sentence_representation：*(Variable)* 输入文本的向量表示，shape为`[batch_size, hidden_size]`。
+      - base_model.final_word_representation：*(Variable)* 输入文本中每个单词的向量表示，shape为`[batch_size, max_seqlen, hidden_size]`
+    - is_training：*(bool)* 训练标志位
+    - args：*(Argument)* 任务相关的参数配置，具体参数在config文件夹中定义
+  - 返回值
+    - output_tensors: *(dict)* 任务输出的tensor字典。训练阶段的输出字典中应至少包括num_seqs元素，num_seqs记录了batch中所包含的样本数（在输入为lod tensor（args.in_tokens被设置为True）时所以样本压平打平，没有样本数维度）
+- compute_loss: *(function)* 计算task在训练阶段的batch平均损失值
+  - 输入参数
+    - output_tensors: *(dict)* 创建task时（调用`create_model`）时返回值，存储计算loss所需的Variables的名字到实例的映射
+    - args：*(Argument)* 任务相关的参数配置，具体参数在config文件夹中定义
+  - 返回值
+    - total_loss：*(Variable)* 当前batch的平均损失值
+- postprocess：*(function)* 推理阶段对每个推理step得到的fetch_results进行的后处理，返回对该step的每个样本的后处理结果
+  - 输入参数
+    - fetch_results：(dict) 当前推理step的fetch_dict中的计算结果，其中fetch_dict在create_model时定义并返回。
+  - 返回值
+    - processed_results：(list)当前推理step所有样本的后处理结果。
+- global_postprocess: *(function)* 推理结束后，对全部样本的后处理结果进行最终处理（如结果保存、二次后处理等）
+  - 输入参数
+    - pred_buf：所有测试集样本的预测后处理结果
+    - processor：任务的数据集载入与处理类DataProcessor的实例
+    - mtl_args：多任务学习配置，在`mtl_conf.yaml`中定义
+    - task_args：任务相关的参数配置，在`conf`文件夹中定义
+  - 返回值
+    - 无
+
+***命名规范***
+
+为新任务创建config，task_paradigm和reader文件后，应将三者文件名统一，且为reader文件的文件名增加`_reader`后缀。例如，用户添加的新任务名为yelp_senti，则config文件名为`yelp_senti.yaml`，放置于config文件夹下；task_paradigm文件名为`yelp_senti.py`，放置于paradigm文件夹下；reader文件名为`yelp_senti_reader.py`，放置于reader文件夹下。
+
+***One-to-One模式（任务层共享）***
+
+框架默认使用one-to-many的模式进行多任务训练，即多任务共享encoder，不共享输出层。该版本同时支持one-to-one模式，即多任务同时共享encoder和输出层（模型参数完全共享，但是有不同的数据源）。该模式通过config文件命名的方式开启，具体流程如下。
+
+```
+1. mtl_config.yaml下用户配置任务相关的名称，如main_task: "reading_comprehension"
+2. 如果一个任务的数据集是多个来源，请在configs下对同一个任务添加多个任务配置，如任务为"reading_comprehension"有两个数据集需要训练，且每个batch内的数据都来自同一数据集，则需要添加reading_comprehension.name1.yaml和reading_comprehension.name2.yaml两个配置文件，其中name1和name2用户可根据自己需求定义名称，框架内不限定名称定义；
+3. 启动多任务学习：sh run.sh
+```
+
+## License
+This tutorial is contributed by [PaddlePaddle](https://github.com/PaddlePaddle/Paddle) and licensed under the [Apache-2.0 license](https://github.com/PaddlePaddle/models/blob/develop/LICENSE).
+
+## 许可证书
+此向导由[PaddlePaddle](https://github.com/PaddlePaddle/Paddle)贡献，受[Apache-2.0 license](https://github.com/PaddlePaddle/models/blob/develop/LICENSE)许可认证。
+
+
--- a/backbone/__init__.py
+++ b/backbone/__init__.py
--- a/backbone/bert_model.py
+++ b/backbone/bert_model.py
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""BERT model"""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import paddle.fluid as fluid
+from paddle.fluid import layers
+
+import backbone.utils.transformer as transformer
+    
+class Model(object):
+    
+    def __init__(self,
+                 config,
+                 is_training=False,
+                 model_name=''):
+
+        self._emb_size = config["hidden_size"]
+        self._n_layer = config["num_hidden_layers"]
+        self._n_head = config["num_attention_heads"]
+        self._voc_size = config["vocab_size"]
+        self._max_position_seq_len = config["max_position_embeddings"]
+        self._sent_types = config["type_vocab_size"]
+        self._hidden_act = config["hidden_act"]
+        self._prepostprocess_dropout = config["hidden_dropout_prob"]
+        self._attention_dropout = config["attention_probs_dropout_prob"]
+
+        self._is_training = is_training
+
+        self.model_name = model_name
+
+        self._word_emb_name = self.model_name + "word_embedding"
+        self._pos_emb_name = self.model_name + "pos_embedding"
+        self._sent_emb_name = self.model_name + "sent_embedding"
+
+        # Initialize all weigths by truncated normal initializer, and all biases 
+        # will be initialized by constant zero by default.
+        self._param_initializer = fluid.initializer.TruncatedNormal(
+            scale=config["initializer_range"])
+
+    def build_model(self, reader_input, use_fp16=False):
+        
+        dtype = "float16" if use_fp16 else "float32"
+
+        src_ids, pos_ids, sent_ids, input_mask = reader_input[:4]
+        # padding id in vocabulary must be set to 0
+        emb_out = fluid.layers.embedding(
+            input=src_ids,
+            size=[self._voc_size, self._emb_size],
+            dtype=dtype,
+            param_attr=fluid.ParamAttr(
+                name=self._word_emb_name, initializer=self._param_initializer),
+            is_sparse=False)
+        
+        self.emb_out = emb_out
+        
+        position_emb_out = fluid.layers.embedding(
+            input=pos_ids,
+            size=[self._max_position_seq_len, self._emb_size],
+            dtype=dtype,
+            param_attr=fluid.ParamAttr(
+                name=self._pos_emb_name, initializer=self._param_initializer))
+    
+        self.position_emb_out = position_emb_out
+
+        sent_emb_out = fluid.layers.embedding(
+            sent_ids,
+            size=[self._sent_types, self._emb_size],
+            dtype=dtype,
+            param_attr=fluid.ParamAttr(
+                name=self._sent_emb_name, initializer=self._param_initializer))
+
+        self.sent_emb_out = sent_emb_out
+
+        emb_out = emb_out + position_emb_out + sent_emb_out
+
+        emb_out = transformer.pre_process_layer(
+            emb_out, 'nd', self._prepostprocess_dropout, name='pre_encoder')
+
+        if dtype == "float16":
+            input_mask = fluid.layers.cast(x=input_mask, dtype=dtype)
+
+        self_attn_mask = fluid.layers.matmul(
+            x = input_mask, y = input_mask, transpose_y = True)
+
+        self_attn_mask = fluid.layers.scale(
+            x = self_attn_mask, scale = 10000.0, bias = -1.0, bias_after_scale = False)
+        
+        n_head_self_attn_mask = fluid.layers.stack(
+            x=[self_attn_mask] * self._n_head, axis=1)
+        
+        n_head_self_attn_mask.stop_gradient = True
+
+        self._enc_out = transformer.encoder(
+            enc_input = emb_out,
+            attn_bias = n_head_self_attn_mask,
+            n_layer = self._n_layer,
+            n_head = self._n_head,
+            d_key = self._emb_size // self._n_head,
+            d_value = self._emb_size // self._n_head,
+            d_model = self._emb_size,
+            d_inner_hid = self._emb_size * 4,
+            prepostprocess_dropout = self._prepostprocess_dropout,
+            attention_dropout = self._attention_dropout,
+            relu_dropout = 0,
+            hidden_act = self._hidden_act,
+            preprocess_cmd = "",
+            postprocess_cmd = "dan",
+            param_initializer = self._param_initializer,
+            name = self.model_name + 'encoder')
+
+        next_sent_feat = fluid.layers.slice(
+            input = self._enc_out, axes = [1], starts = [0], ends = [1])
+        self.next_sent_feat = fluid.layers.fc(
+            input = next_sent_feat,
+            size = self._emb_size,
+            act = "tanh",
+            param_attr = fluid.ParamAttr(
+                name = self.model_name + "pooled_fc.w_0", 
+                initializer = self._param_initializer),
+            bias_attr = "pooled_fc.b_0")
+    @property
+    def final_word_representation(self):
+        """final layer output of transformer encoder as the (contextual) word representation"""
+        return self._enc_out
+
+    @property
+    def final_sentence_representation(self):
+        """final representation of the first token ([CLS]) as sentence representation """
+
+        return self.next_sent_feat
+
+
+if __name__ == "__main__":
+    print("hello world!")
+
+
--- a/backbone/ernie_model.py
+++ b/backbone/ernie_model.py
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Ernie model."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+from __future__ import unicode_literals
+from __future__ import absolute_import
+
+import paddle.fluid as fluid
+
+import backbone.utils.transformer4ernie as transformer
+from backbone.interface import backbone
+
+
+class Model(backbone):
+    def __init__(self,
+                 config,
+                 is_training=False,
+                 ):
+
+        self._emb_size = config['hidden_size']
+        self._n_layer = config['num_hidden_layers']
+        self._n_head = config['num_attention_heads']
+        self._voc_size = config['vocab_size']
+        self._max_position_seq_len = config['max_position_embeddings']
+        if config['sent_type_vocab_size']:
+            self._sent_types = config['sent_type_vocab_size']
+        else:
+            self._sent_types = config['type_vocab_size']
+
+        self._hidden_act = config['hidden_act']
+        self._prepostprocess_dropout = config['hidden_dropout_prob']
+        self._attention_dropout = config['attention_probs_dropout_prob']
+
+        self._word_emb_name = "word_embedding"
+        self._pos_emb_name = "pos_embedding"
+        self._sent_emb_name = "sent_embedding"
+        self._task_emb_name = "task_embedding"
+        self._emb_dtype = "float32"
+
+        self._param_initializer = fluid.initializer.TruncatedNormal(
+            scale=config['initializer_range'])
+
+
+    def build_model(self, reader_input, use_fp16=False):
+
+        dtype = "float16" if use_fp16 else "float32"
+
+        src_ids, pos_ids, sent_ids, input_mask = reader_input[:4]
+        # padding id in vocabulary must be set to 0
+        emb_out = fluid.layers.embedding(
+            input=src_ids,
+            size=[self._voc_size, self._emb_size],
+            dtype=self._emb_dtype,
+            param_attr=fluid.ParamAttr(
+                name=self._word_emb_name, initializer=self._param_initializer),
+            is_sparse=False)
+        
+        position_emb_out = fluid.layers.embedding(
+            input=pos_ids,
+            size=[self._max_position_seq_len, self._emb_size],
+            dtype=self._emb_dtype,
+            param_attr=fluid.ParamAttr(
+                name=self._pos_emb_name, initializer=self._param_initializer))
+
+        sent_emb_out = fluid.layers.embedding(
+            sent_ids,
+            size=[self._sent_types, self._emb_size],
+            dtype=self._emb_dtype,
+            param_attr=fluid.ParamAttr(
+                name=self._sent_emb_name, initializer=self._param_initializer))
+
+        emb_out = emb_out + position_emb_out
+        emb_out = emb_out + sent_emb_out
+
+        emb_out = transformer.pre_process_layer(
+            emb_out, 'nd', self._prepostprocess_dropout, name='pre_encoder')
+
+        if dtype == "float16":
+            emb_out = fluid.layers.cast(x=emb_out, dtype=dtype)
+            input_mask = fluid.layers.cast(x=input_mask, dtype=dtype)
+        self_attn_mask = fluid.layers.matmul(
+            x=input_mask, y=input_mask, transpose_y=True)
+
+        self_attn_mask = fluid.layers.scale(
+            x=self_attn_mask, scale=10000.0, bias=-1.0, bias_after_scale=False)
+        n_head_self_attn_mask = fluid.layers.stack(
+            x=[self_attn_mask] * self._n_head, axis=1)
+        n_head_self_attn_mask.stop_gradient = True
+
+        self._enc_out = transformer.encoder(
+            enc_input=emb_out,
+            attn_bias=n_head_self_attn_mask,
+            n_layer=self._n_layer,
+            n_head=self._n_head,
+            d_key=self._emb_size // self._n_head,
+            d_value=self._emb_size // self._n_head,
+            d_model=self._emb_size,
+            d_inner_hid=self._emb_size * 4,
+            prepostprocess_dropout=self._prepostprocess_dropout,
+            attention_dropout=self._attention_dropout,
+            relu_dropout=0,
+            hidden_act=self._hidden_act,
+            preprocess_cmd="",
+            postprocess_cmd="dan",
+            param_initializer=self._param_initializer,
+            name='encoder')
+        if dtype == "float16":        
+            self._enc_out = fluid.layers.cast(
+                x=self._enc_out, dtype=self._emb_dtype)
+
+
+    @property
+    def final_word_representation(self):
+        return self._enc_out
+
+    @property
+    def final_sentence_representation(self):
+        """Get the first feature of each sequence for classification"""
+        next_sent_feat = fluid.layers.slice(
+            input=self._enc_out, axes=[1], starts=[0], ends=[1])
+        next_sent_feat = fluid.layers.fc(
+            input=next_sent_feat,
+            size=self._emb_size,
+            act="tanh",
+            param_attr=fluid.ParamAttr(
+                name="pooled_fc.w_0", initializer=self._param_initializer),
+            bias_attr="pooled_fc.b_0")
+        return next_sent_feat
+
--- a/backbone/utils/__init__.py
+++ b/backbone/utils/__init__.py
--- a/backbone/utils/transformer.py
+++ b/backbone/utils/transformer.py
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Transformer encoder."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+from functools import partial
+import numpy as np
+
+import paddle.fluid as fluid
+import paddle.fluid.layers as layers
+from paddle.fluid.layer_helper import LayerHelper
+
+
+def layer_norm(x, begin_norm_axis=1, epsilon=1e-6, param_attr=None, bias_attr=None):
+    helper = LayerHelper('layer_norm', **locals())
+    mean = layers.reduce_mean(x, dim=begin_norm_axis, keep_dim=True)
+    shift_x = layers.elementwise_sub(x=x, y=mean, axis=0)
+    variance = layers.reduce_mean(layers.square(shift_x), dim=begin_norm_axis, keep_dim=True)
+    r_stdev = layers.rsqrt(variance + epsilon)
+    norm_x = layers.elementwise_mul(x=shift_x, y=r_stdev, axis=0)
+
+    param_shape = [reduce(lambda x, y: x * y, norm_x.shape[begin_norm_axis:])]
+    param_dtype = norm_x.dtype
+    scale = helper.create_parameter(
+        attr=param_attr,
+        shape=param_shape,
+        dtype=param_dtype,
+        default_initializer=fluid.initializer.Constant(1.))
+    bias = helper.create_parameter(
+        attr=bias_attr,
+        shape=param_shape,
+        dtype=param_dtype,
+        is_bias=True,
+        default_initializer=fluid.initializer.Constant(0.))
+
+    out = layers.elementwise_mul(x=norm_x, y=scale, axis=-1)
+    out = layers.elementwise_add(x=out, y=bias, axis=-1)
+
+    return out
+
+
+def multi_head_attention(queries,
+                         keys,
+                         values,
+                         attn_bias,
+                         d_key,
+                         d_value,
+                         d_model,
+                         n_head=1,
+                         dropout_rate=0.,
+                         cache=None,
+                         param_initializer=None,
+                         name='multi_head_att'):
+    """
+    Multi-Head Attention. Note that attn_bias is added to the logit before
+    computing softmax activiation to mask certain selected positions so that
+    they will not considered in attention weights.
+    """
+    keys = queries if keys is None else keys
+    values = keys if values is None else values
+
+    if not (len(queries.shape) == len(keys.shape) == len(values.shape) == 3):
+        raise ValueError(
+            "Inputs: quries, keys and values should all be 3-D tensors.")
+
+    def __compute_qkv(queries, keys, values, n_head, d_key, d_value):
+        """
+        Add linear projection to queries, keys, and values.
+        """
+        q = layers.fc(input = queries,
+                      size = d_key * n_head,
+                      num_flatten_dims = 2,
+                      param_attr = fluid.ParamAttr(
+                          name = name + '_query_fc.w_0',
+                          initializer = param_initializer),
+                      bias_attr = name + '_query_fc.b_0')
+        k = layers.fc(input = keys,
+                      size = d_key * n_head,
+                      num_flatten_dims = 2,
+                      param_attr = fluid.ParamAttr(
+                          name = name + '_key_fc.w_0',
+                          initializer = param_initializer),
+                      bias_attr = name + '_key_fc.b_0')
+        v = layers.fc(input = values,
+                      size = d_value * n_head,
+                      num_flatten_dims = 2,
+                      param_attr = fluid.ParamAttr(
+                          name = name + '_value_fc.w_0',
+                          initializer = param_initializer),
+                      bias_attr = name + '_value_fc.b_0')
+        return q, k, v
+
+    def __split_heads(x, n_head):
+        """
+        Reshape the last dimension of inpunt tensor x so that it becomes two
+        dimensions and then transpose. Specifically, input a tensor with shape
+        [bs, max_sequence_length, n_head * hidden_dim] then output a tensor
+        with shape [bs, n_head, max_sequence_length, hidden_dim].
+        """
+        hidden_size = x.shape[-1]
+        # The value 0 in shape attr means copying the corresponding dimension
+        # size of the input as the output dimension size.
+        reshaped = layers.reshape(
+            x = x, shape = [0, 0, n_head, hidden_size // n_head], inplace=False)
+
+        # permuate the dimensions into:
+        # [batch_size, n_head, max_sequence_len, hidden_size_per_head]
+        return layers.transpose(x=reshaped, perm=[0, 2, 1, 3])
+
+    def __combine_heads(x):
+        """
+        Transpose and then reshape the last two dimensions of inpunt tensor x
+        so that it becomes one dimension, which is reverse to __split_heads.
+        """
+        if len(x.shape) == 3: return x
+        if len(x.shape) != 4:
+            raise ValueError("Input(x) should be a 4-D Tensor.")
+
+        trans_x = layers.transpose(x, perm=[0, 2, 1, 3])
+        # The value 0 in shape attr means copying the corresponding dimension
+        # size of the input as the output dimension size.
+        return layers.reshape(
+            x = trans_x,
+            shape = [0, 0, trans_x.shape[2] * trans_x.shape[3]],
+            inplace = False)
+
+    def scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate):
+        """
+        Scaled Dot-Product Attention
+        """
+        scaled_q = layers.scale(x = q, scale = d_key**-0.5)
+        product = layers.matmul(x = scaled_q, y = k, transpose_y = True)
+        if attn_bias:
+            product += attn_bias
+        weights = layers.softmax(product)
+        if dropout_rate:
+            weights = layers.dropout(
+                weights,
+                dropout_prob=dropout_rate,
+                dropout_implementation="upscale_in_train",
+                is_test=False)
+        out = layers.matmul(weights, v)
+        return out
+
+    q, k, v = __compute_qkv(queries, keys, values, n_head, d_key, d_value)
+
+    if cache is not None:  # use cache and concat time steps
+        # Since the inplace reshape in __split_heads changes the shape of k and
+        # v, which is the cache input for next time step, reshape the cache
+        # input from the previous time step first.
+        k = cache["k"] = layers.concat(
+            [layers.reshape(
+                cache["k"], shape=[0, 0, d_model]), k], axis=1)
+        v = cache["v"] = layers.concat(
+            [layers.reshape(
+                cache["v"], shape=[0, 0, d_model]), v], axis=1)
+
+    q = __split_heads(q, n_head)
+    k = __split_heads(k, n_head)
+    v = __split_heads(v, n_head)
+
+    ctx_multiheads = scaled_dot_product_attention(q, k, v, attn_bias, d_key,
+                                                  dropout_rate)
+
+    out = __combine_heads(ctx_multiheads)
+
+    # Project back to the model size.
+    proj_out = layers.fc(input = out,
+                         size = d_model,
+                         num_flatten_dims = 2,
+                         param_attr=fluid.ParamAttr(
+                             name = name + '_output_fc.w_0',
+                             initializer = param_initializer),
+                         bias_attr = name + '_output_fc.b_0')
+    return proj_out
+
+
+def positionwise_feed_forward(x,
+                              d_inner_hid,
+                              d_hid,
+                              dropout_rate,
+                              hidden_act,
+                              param_initializer=None,
+                              name='ffn'):
+    """
+    Position-wise Feed-Forward Networks.
+    This module consists of two linear transformations with a ReLU activation
+    in between, which is applied to each position separately and identically.
+    """
+    hidden = layers.fc(input=x,
+                       size=d_inner_hid,
+                       num_flatten_dims=2,
+                       act=hidden_act,
+                       param_attr=fluid.ParamAttr(
+                           name=name + '_fc_0.w_0',
+                           initializer=param_initializer),
+                       bias_attr=name + '_fc_0.b_0')
+    if dropout_rate:
+        hidden = layers.dropout(
+            hidden,
+            dropout_prob=dropout_rate,
+            dropout_implementation="upscale_in_train",
+            is_test = False)
+
+    out = layers.fc(input = hidden,
+                    size = d_hid,
+                    num_flatten_dims = 2,
+                    param_attr=fluid.ParamAttr(
+                        name = name + '_fc_1.w_0', 
+                        initializer = param_initializer),
+                    bias_attr = name + '_fc_1.b_0')
+    return out
+
+
+def pre_post_process_layer(prev_out, out, process_cmd, dropout_rate=0.,
+                           name=''):
+    """
+    Add residual connection, layer normalization and droput to the out tensor
+    optionally according to the value of process_cmd.
+    This will be used before or after multi-head attention and position-wise
+    feed-forward networks.
+    """
+    for cmd in process_cmd:
+        if cmd == "a":  # add residual connection
+            out = out + prev_out if prev_out else out
+        elif cmd == "n":  # add layer normalization
+            out_dtype = out.dtype
+            if out_dtype == fluid.core.VarDesc.VarType.FP16:
+                out = layers.cast(x = out, dtype = "float32")
+            out = layer_norm(
+                out,
+                begin_norm_axis=len(out.shape) - 1,
+                param_attr=fluid.ParamAttr(
+                    name = name + '_layer_norm_scale',
+                    initializer = fluid.initializer.Constant(1.)),
+                bias_attr=fluid.ParamAttr(
+                    name = name + '_layer_norm_bias',
+                    initializer = fluid.initializer.Constant(0.)))
+            if out_dtype == fluid.core.VarDesc.VarType.FP16:
+                out = layers.cast(x = out, dtype = "float16")
+        elif cmd == "d":  # add dropout
+            if dropout_rate:
+                out = layers.dropout(
+                    out,
+                    dropout_prob = dropout_rate,
+                    dropout_implementation = "upscale_in_train",
+                    is_test = False)
+    return out
+
+
+pre_process_layer = partial(pre_post_process_layer, None)
+post_process_layer = pre_post_process_layer
+
+def encoder_layer(enc_input,
+                  attn_bias,
+                  n_head,
+                  d_key,
+                  d_value,
+                  d_model,
+                  d_inner_hid,
+                  prepostprocess_dropout,
+                  attention_dropout,
+                  relu_dropout,
+                  hidden_act,
+                  preprocess_cmd="n",
+                  postprocess_cmd="da",
+                  param_initializer=None,
+                  name=''):
+    """The encoder layers that can be stacked to form a deep encoder.
+    This module consits of a multi-head (self) attention followed by
+    position-wise feed-forward networks and both the two components companied
+    with the post_process_layer to add residual connection, layer normalization
+    and droput.
+    """
+    attn_output = multi_head_attention(
+        pre_process_layer(
+            enc_input,
+            preprocess_cmd,
+            prepostprocess_dropout,
+            name=name + '_pre_att'),
+        None,
+        None,
+        attn_bias,
+        d_key,
+        d_value,
+        d_model,
+        n_head,
+        attention_dropout,
+        param_initializer = param_initializer,
+        name = name + '_multi_head_att')
+    attn_output = post_process_layer(
+        enc_input,
+        attn_output,
+        postprocess_cmd,
+        prepostprocess_dropout,
+        name = name + '_post_att')
+    ffd_output = positionwise_feed_forward(
+        pre_process_layer(
+            attn_output,
+            preprocess_cmd,
+            prepostprocess_dropout,
+            name = name + '_pre_ffn'),
+        d_inner_hid,
+        d_model,
+        relu_dropout,
+        hidden_act,
+        param_initializer = param_initializer,
+        name = name + '_ffn')
+    return post_process_layer(
+        attn_output,
+        ffd_output,
+        postprocess_cmd,
+        prepostprocess_dropout,
+        name = name + '_post_ffn')
+
+
+def encoder(enc_input,
+            attn_bias,
+            n_layer,
+            n_head,
+            d_key,
+            d_value,
+            d_model,
+            d_inner_hid,
+            prepostprocess_dropout,
+            attention_dropout,
+            relu_dropout,
+            hidden_act,
+            preprocess_cmd="n",
+            postprocess_cmd="da",
+            param_initializer=None,
+            name='',
+            return_all = False):
+    """
+    The encoder is composed of a stack of identical layers returned by calling
+    encoder_layer.
+    """
+    enc_outputs = []
+    for i in range(n_layer):
+        enc_output = encoder_layer(
+            enc_input,
+            attn_bias,
+            n_head,
+            d_key,
+            d_value,
+            d_model,
+            d_inner_hid,
+            prepostprocess_dropout,
+            attention_dropout,
+            relu_dropout,
+            hidden_act,
+            preprocess_cmd,
+            postprocess_cmd,
+            param_initializer = param_initializer,
+            name = name + '_layer_' + str(i))
+        enc_input = enc_output
+        if i < n_layer - 1:
+            enc_outputs.append(enc_output)
+
+    enc_output = pre_process_layer(
+        enc_output, preprocess_cmd, prepostprocess_dropout, name="post_encoder")
+    enc_outputs.append(enc_output)
+
+    if not return_all:
+        return enc_output
+    else:
+        return enc_output, enc_outputs
--- a/backbone/utils/transformer4ernie.py
+++ b/backbone/utils/transformer4ernie.py
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Transformer encoder."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+from functools import partial
+
+import paddle.fluid as fluid
+import paddle.fluid.layers as layers
+
+
+def multi_head_attention(queries,
+                         keys,
+                         values,
+                         attn_bias,
+                         d_key,
+                         d_value,
+                         d_model,
+                         n_head=1,
+                         dropout_rate=0.,
+                         cache=None,
+                         param_initializer=None,
+                         name='multi_head_att'):
+    """
+    Multi-Head Attention. Note that attn_bias is added to the logit before
+    computing softmax activiation to mask certain selected positions so that
+    they will not considered in attention weights.
+    """
+    keys = queries if keys is None else keys
+    values = keys if values is None else values
+
+    if not (len(queries.shape) == len(keys.shape) == len(values.shape) == 3):
+        raise ValueError(
+            "Inputs: quries, keys and values should all be 3-D tensors.")
+
+    def __compute_qkv(queries, keys, values, n_head, d_key, d_value):
+        """
+        Add linear projection to queries, keys, and values.
+        """
+        q = layers.fc(input=queries,
+                      size=d_key * n_head,
+                      num_flatten_dims=2,
+                      param_attr=fluid.ParamAttr(
+                          name=name + '_query_fc.w_0',
+                          initializer=param_initializer),
+                      bias_attr=name + '_query_fc.b_0')
+        k = layers.fc(input=keys,
+                      size=d_key * n_head,
+                      num_flatten_dims=2,
+                      param_attr=fluid.ParamAttr(
+                          name=name + '_key_fc.w_0',
+                          initializer=param_initializer),
+                      bias_attr=name + '_key_fc.b_0')
+        v = layers.fc(input=values,
+                      size=d_value * n_head,
+                      num_flatten_dims=2,
+                      param_attr=fluid.ParamAttr(
+                          name=name + '_value_fc.w_0',
+                          initializer=param_initializer),
+                      bias_attr=name + '_value_fc.b_0')
+        return q, k, v
+
+    def __split_heads(x, n_head):
+        """
+        Reshape the last dimension of inpunt tensor x so that it becomes two
+        dimensions and then transpose. Specifically, input a tensor with shape
+        [bs, max_sequence_length, n_head * hidden_dim] then output a tensor
+        with shape [bs, n_head, max_sequence_length, hidden_dim].
+        """
+        hidden_size = x.shape[-1]
+        # The value 0 in shape attr means copying the corresponding dimension
+        # size of the input as the output dimension size.
+        reshaped = layers.reshape(
+            x=x, shape=[0, 0, n_head, hidden_size // n_head], inplace=True)
+
+        # permuate the dimensions into:
+        # [batch_size, n_head, max_sequence_len, hidden_size_per_head]
+        return layers.transpose(x=reshaped, perm=[0, 2, 1, 3])
+
+    def __combine_heads(x):
+        """
+        Transpose and then reshape the last two dimensions of inpunt tensor x
+        so that it becomes one dimension, which is reverse to __split_heads.
+        """
+        if len(x.shape) == 3: return x
+        if len(x.shape) != 4:
+            raise ValueError("Input(x) should be a 4-D Tensor.")
+
+        trans_x = layers.transpose(x, perm=[0, 2, 1, 3])
+        # The value 0 in shape attr means copying the corresponding dimension
+        # size of the input as the output dimension size.
+        return layers.reshape(
+            x=trans_x,
+            shape=[0, 0, trans_x.shape[2] * trans_x.shape[3]],
+            inplace=True)
+
+    def scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate):
+        """
+        Scaled Dot-Product Attention
+        """
+        scaled_q = layers.scale(x=q, scale=d_key**-0.5)
+        product = layers.matmul(x=scaled_q, y=k, transpose_y=True)
+        if attn_bias:
+            product += attn_bias
+        weights = layers.softmax(product)
+        if dropout_rate:
+            weights = layers.dropout(
+                weights,
+                dropout_prob=dropout_rate,
+                dropout_implementation="upscale_in_train",
+                is_test=False)
+        out = layers.matmul(weights, v)
+        return out
+
+    q, k, v = __compute_qkv(queries, keys, values, n_head, d_key, d_value)
+
+    if cache is not None:  # use cache and concat time steps
+        # Since the inplace reshape in __split_heads changes the shape of k and
+        # v, which is the cache input for next time step, reshape the cache
+        # input from the previous time step first.
+        k = cache["k"] = layers.concat(
+            [layers.reshape(
+                cache["k"], shape=[0, 0, d_model]), k], axis=1)
+        v = cache["v"] = layers.concat(
+            [layers.reshape(
+                cache["v"], shape=[0, 0, d_model]), v], axis=1)
+
+    q = __split_heads(q, n_head)
+    k = __split_heads(k, n_head)
+    v = __split_heads(v, n_head)
+
+    ctx_multiheads = scaled_dot_product_attention(q, k, v, attn_bias, d_key,
+                                                  dropout_rate)
+
+    out = __combine_heads(ctx_multiheads)
+
+    # Project back to the model size.
+    proj_out = layers.fc(input=out,
+                         size=d_model,
+                         num_flatten_dims=2,
+                         param_attr=fluid.ParamAttr(
+                             name=name + '_output_fc.w_0',
+                             initializer=param_initializer),
+                         bias_attr=name + '_output_fc.b_0')
+    return proj_out
+
+
+def positionwise_feed_forward(x,
+                              d_inner_hid,
+                              d_hid,
+                              dropout_rate,
+                              hidden_act,
+                              param_initializer=None,
+                              name='ffn'):
+    """
+    Position-wise Feed-Forward Networks.
+    This module consists of two linear transformations with a ReLU activation
+    in between, which is applied to each position separately and identically.
+    """
+    hidden = layers.fc(input=x,
+                       size=d_inner_hid,
+                       num_flatten_dims=2,
+                       act=hidden_act,
+                       param_attr=fluid.ParamAttr(
+                           name=name + '_fc_0.w_0',
+                           initializer=param_initializer),
+                       bias_attr=name + '_fc_0.b_0')
+    if dropout_rate:
+        hidden = layers.dropout(
+            hidden,
+            dropout_prob=dropout_rate,
+            dropout_implementation="upscale_in_train",
+            is_test=False)
+    out = layers.fc(input=hidden,
+                    size=d_hid,
+                    num_flatten_dims=2,
+                    param_attr=fluid.ParamAttr(
+                        name=name + '_fc_1.w_0', initializer=param_initializer),
+                    bias_attr=name + '_fc_1.b_0')
+    return out
+
+
+def pre_post_process_layer(prev_out, out, process_cmd, dropout_rate=0.,
+                           name=''):
+    """
+    Add residual connection, layer normalization and droput to the out tensor
+    optionally according to the value of process_cmd.
+    This will be used before or after multi-head attention and position-wise
+    feed-forward networks.
+    """
+    for cmd in process_cmd:
+        if cmd == "a":  # add residual connection
+            out = out + prev_out if prev_out else out
+        elif cmd == "n":  # add layer normalization
+            out_dtype = out.dtype
+            if out_dtype == fluid.core.VarDesc.VarType.FP16:
+                out = layers.cast(x=out, dtype="float32")
+            out = layers.layer_norm(
+                out,
+                begin_norm_axis=len(out.shape) - 1,
+                param_attr=fluid.ParamAttr(
+                    name=name + '_layer_norm_scale',
+                    initializer=fluid.initializer.Constant(1.)),
+                bias_attr=fluid.ParamAttr(
+                    name=name + '_layer_norm_bias',
+                    initializer=fluid.initializer.Constant(0.)))
+            if out_dtype == fluid.core.VarDesc.VarType.FP16:
+                out = layers.cast(x=out, dtype="float16")
+        elif cmd == "d":  # add dropout
+            if dropout_rate:
+                out = layers.dropout(
+                    out,
+                    dropout_prob=dropout_rate,
+                    dropout_implementation="upscale_in_train",
+                    is_test=False)
+    return out
+
+
+pre_process_layer = partial(pre_post_process_layer, None)
+post_process_layer = pre_post_process_layer
+
+
+def encoder_layer(enc_input,
+                  attn_bias,
+                  n_head,
+                  d_key,
+                  d_value,
+                  d_model,
+                  d_inner_hid,
+                  prepostprocess_dropout,
+                  attention_dropout,
+                  relu_dropout,
+                  hidden_act,
+                  preprocess_cmd="n",
+                  postprocess_cmd="da",
+                  param_initializer=None,
+                  name=''):
+    """The encoder layers that can be stacked to form a deep encoder.
+    This module consits of a multi-head (self) attention followed by
+    position-wise feed-forward networks and both the two components companied
+    with the post_process_layer to add residual connection, layer normalization
+    and droput.
+    """
+    attn_output = multi_head_attention(
+        pre_process_layer(
+            enc_input,
+            preprocess_cmd,
+            prepostprocess_dropout,
+            name=name + '_pre_att'),
+        None,
+        None,
+        attn_bias,
+        d_key,
+        d_value,
+        d_model,
+        n_head,
+        attention_dropout,
+        param_initializer=param_initializer,
+        name=name + '_multi_head_att')
+    attn_output = post_process_layer(
+        enc_input,
+        attn_output,
+        postprocess_cmd,
+        prepostprocess_dropout,
+        name=name + '_post_att')
+    ffd_output = positionwise_feed_forward(
+        pre_process_layer(
+            attn_output,
+            preprocess_cmd,
+            prepostprocess_dropout,
+            name=name + '_pre_ffn'),
+        d_inner_hid,
+        d_model,
+        relu_dropout,
+        hidden_act,
+        param_initializer=param_initializer,
+        name=name + '_ffn')
+    return post_process_layer(
+        attn_output,
+        ffd_output,
+        postprocess_cmd,
+        prepostprocess_dropout,
+        name=name + '_post_ffn')
+
+
+def encoder(enc_input,
+            attn_bias,
+            n_layer,
+            n_head,
+            d_key,
+            d_value,
+            d_model,
+            d_inner_hid,
+            prepostprocess_dropout,
+            attention_dropout,
+            relu_dropout,
+            hidden_act,
+            preprocess_cmd="n",
+            postprocess_cmd="da",
+            param_initializer=None,
+            name=''):
+    """
+    The encoder is composed of a stack of identical layers returned by calling
+    encoder_layer.
+    """
+    for i in range(n_layer):
+        enc_output = encoder_layer(
+            enc_input,
+            attn_bias,
+            n_head,
+            d_key,
+            d_value,
+            d_model,
+            d_inner_hid,
+            prepostprocess_dropout,
+            attention_dropout,
+            relu_dropout,
+            hidden_act,
+            preprocess_cmd,
+            postprocess_cmd,
+            param_initializer=param_initializer,
+            name=name + '_layer_' + str(i))
+        enc_input = enc_output
+    enc_output = pre_process_layer(
+        enc_output, preprocess_cmd, prepostprocess_dropout, name="post_encoder")
+
+    return enc_output
--- a/config/answer_matching.yaml
+++ b/config/answer_matching.yaml
+train_file: "data/am4mrqa/train.txt"
+mix_ratio: 0.4
+batch_size: 4
+in_tokens: False
+generate_neg_sample: False
--- a/config/mask_language_model.yaml
+++ b/config/mask_language_model.yaml
+train_file: "data/mlm4mrqa"
+mix_ratio: 0.4
+batch_size: 4
+in_tokens: False
+generate_neg_sample: False
--- a/config/reading_comprehension.yaml
+++ b/config/reading_comprehension.yaml
+train_file: "data/mrqa/mrqa-combined.train.raw.json"
+predict_file: "data/mrqa/mrqa-combined.dev.raw.json"
+sample_rate: 0.02
+mix_ratio: 1.0
+batch_size: 4
+in_tokens: false
+doc_stride: 128
+with_negative: false
+max_query_length: 64
+max_answer_length: 30
+n_best_size: 20
+null_score_diff_threshold: 0.0
+verbose: False
--- a/data/am4mrqa/dev.txt
+++ b/data/am4mrqa/dev.txt
--- a/data/am4mrqa/train.txt
+++ b/data/am4mrqa/train.txt
--- a/data/mlm4mrqa/mlm.txt.gz
+++ b/data/mlm4mrqa/mlm.txt.gz
--- a/data/mrqa/mrqa-combined.dev.raw.json
+++ b/data/mrqa/mrqa-combined.dev.raw.json
--- a/data/mrqa/mrqa-combined.train.raw.json
+++ b/data/mrqa/mrqa-combined.train.raw.json
--- a/mtl_config.yaml
+++ b/mtl_config.yaml
+main_task: "reading_comprehension"
+auxiliary_task: "mask_language_model answer_matching"
+
+do_train: True
+do_predict: True
+
+checkpoint_path: "output_model/firstrun"
+
+backbone_model: "bert_model"
+pretrain_model_path: "pretrain_model/bert"
+pretrain_config_path: "pretrain_model/bert/bert_config.json"
+vocab_path: "pretrain_model/bert/vocab.txt"
+
+# backbone_model: "ernie_model"
+# pretrain_model_path: "pretrain_model/ernie/params"
+# pretrain_config_path: "pretrain_model/ernie/ernie_config.json"
+# vocab_path: "pretrain_model/ernie/vocab.txt"
+
+optimizer: "bert_optimizer"
+learning_rate: 3e-5
+lr_scheduler: "linear_warmup_decay"
+skip_steps: 10
+save_steps: 10000
+epoch: 2
+use_cuda: True
+warmup_proportion: 0.1
+weight_decay: 0.1
+do_lower_case: False
+max_seq_len: 512
+use_ema: True
+ema_decay: 0.9999
+random_seed: 0
+use_fp16: False
+loss_scaling: 1.0
+
--- a/mtl_run.py
+++ b/mtl_run.py
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# encoding=utf8
+
+import os
+import sys
+import time
+import argparse
+import importlib
+import collections
+import numpy as np
+import multiprocessing
+
+import paddle
+import paddle.fluid as fluid
+
+from utils.configure import PDConfig
+from utils.placeholder import Placeholder 
+from utils.configure import JsonConfig, ArgumentGroup, print_arguments
+from utils.init import init_pretraining_params, init_checkpoint
+
+sys.path.append("reader")
+import joint_reader
+from joint_reader import create_reader
+
+sys.path.append("optimizer")
+sys.path.append("paradigm")
+sys.path.append("backbone")
+
+TASKSET_PATH="config"
+
+def train(multitask_config): 
+
+    # load task config
+    print("Loading multi_task configure...................")
+    args = PDConfig(yaml_file=[multitask_config])
+    args.build()
+
+    index = 0
+    reader_map_task = dict()
+    task_args_list = list()
+    reader_args_list = list()
+    id_map_task = {index: args.main_task}
+    print("Loading main task configure....................")
+    main_task_name = args.main_task
+    task_config_files = [i for i in os.listdir(TASKSET_PATH) if i.endswith('.yaml')]
+    main_config_list = [config for config in task_config_files if config.split('.')[0] == main_task_name]
+    main_args = None
+    for config in main_config_list: 
+        main_yaml = os.path.join(TASKSET_PATH, config)
+        main_args = PDConfig(yaml_file=[multitask_config, main_yaml])
+        main_args.build()
+        main_args.Print()
+        if not task_args_list or main_task_name != task_args_list[-1][0]: 
+            task_args_list.append((main_task_name, main_args))
+        reader_args_list.append((config.strip('.yaml'), main_args))
+        reader_map_task[config.strip('.yaml')] = main_task_name
+
+    print("Loading auxiliary tasks configure...................")
+    aux_task_name_list = args.auxiliary_task.strip().split()
+    for aux_task_name in aux_task_name_list: 
+        index += 1
+        id_map_task[index] = aux_task_name
+        print("Loading %s auxiliary tasks configure......." % aux_task_name)
+        aux_config_list = [config for config in task_config_files if config.split('.')[0] == aux_task_name]
+        for aux_yaml in aux_config_list: 
+            aux_yaml = os.path.join(TASKSET_PATH, aux_yaml)
+            aux_args = PDConfig(yaml_file=[multitask_config, aux_yaml])
+            aux_args.build()
+            aux_args.Print()
+            if aux_task_name != task_args_list[-1][0]: 
+                task_args_list.append((aux_task_name, aux_args))
+            reader_args_list.append((aux_yaml.strip('.yaml'), aux_args))
+            reader_map_task[aux_yaml.strip('.yaml')] = aux_task_name
+
+    # import tasks reader module and build joint_input_shape
+    input_shape_list = []
+    reader_module_dict = {}
+    input_shape_dict = {}
+    for params in task_args_list: 
+        task_reader_mdl = "%s_reader" % params[0]
+        reader_module = importlib.import_module(task_reader_mdl)
+        reader_servlet_cls = getattr(reader_module, "get_input_shape")
+        reader_input_shape = reader_servlet_cls(params[1])
+        reader_module_dict[params[0]] = reader_module
+        input_shape_list.append(reader_input_shape)
+        input_shape_dict[params[0]] = reader_input_shape
+    train_input_shape, test_input_shape, task_map_id = joint_reader.joint_input_shape(input_shape_list)
+
+    # import backbone model
+    backbone_mdl = args.backbone_model
+    backbone_cls = "Model"
+    backbone_module = importlib.import_module(backbone_mdl)
+    backbone_servlet = getattr(backbone_module, backbone_cls)
+
+    if not (args.do_train or args.do_predict):
+        raise ValueError("For args `do_train` and `do_predict`, at "
+                         "least one of them must be True.")
+    if args.use_cuda:
+        place = fluid.CUDAPlace(0)
+        dev_count = fluid.core.get_cuda_device_count()
+    else:
+        place = fluid.CPUPlace()
+        dev_count = int(os.environ.get('CPU_NUM', multiprocessing.cpu_count()))
+    exe = fluid.Executor(place)
+    startup_prog = fluid.default_startup_program()
+
+    if args.random_seed is not None:
+        startup_prog.random_seed = args.random_seed
+
+    if args.do_train: 
+        #create joint pyreader
+        print('creating readers...')
+        gens = []
+        main_generator = ""
+        for params in reader_args_list: 
+            generator_cls = getattr(reader_module_dict[reader_map_task[params[0]]], "DataProcessor")
+            generator_inst = generator_cls(params[1])
+            reader_generator = generator_inst.data_generator(phase='train', shuffle=True, dev_count=dev_count)
+            if not main_generator: 
+                main_generator = generator_inst
+            gens.append((reader_generator, params[1].mix_ratio, reader_map_task[params[0]]))
+        joint_generator, train_pyreader, model_inputs = create_reader("train_reader", train_input_shape, True, task_map_id, gens)
+
+        train_pyreader.decorate_tensor_provider(joint_generator)
+
+        # build task inputs 
+        task_inputs_list = []
+        main_test_input = []
+        task_id = model_inputs[0]
+        backbone_inputs = model_inputs[task_map_id[0][0]: task_map_id[0][1]]
+        for i in range(1, len(task_map_id)): 
+            task_inputs = backbone_inputs + model_inputs[task_map_id[i][0]: task_map_id[i][1]]
+            task_inputs_list.append(task_inputs)
+
+        # build backbone model
+        print('building model backbone...')
+        conf = vars(args)
+        if args.pretrain_config_path is not None:
+            model_conf = JsonConfig(args.pretrain_config_path).asdict()
+            for k, v in model_conf.items():
+                if k in conf:
+                    assert k == conf[k], "ERROR: argument {} in pretrain_model_config is NOT consistent with which in main.yaml"
+            conf.update(model_conf)
+
+        backbone_inst = backbone_servlet(conf, is_training=True)
+       
+        print('building task models...')
+        num_train_examples = main_generator.get_num_examples()
+        if main_args.in_tokens:
+            max_train_steps = int(main_args.epoch * num_train_examples) // (
+                    main_args.batch_size // main_args.max_seq_len) // dev_count
+        else:
+            max_train_steps = int(main_args.epoch * num_train_examples) // (
+                main_args.batch_size) // dev_count
+        mix_ratio_list = [task_args[1].mix_ratio for task_args in task_args_list]
+        args.max_train_steps = int(max_train_steps * (sum(mix_ratio_list) / main_args.mix_ratio))
+        print("Max train steps: %d" % max_train_steps)
+
+        build_strategy = fluid.BuildStrategy()
+        train_program = fluid.default_main_program()
+        with fluid.program_guard(train_program, startup_prog):
+            with fluid.unique_name.guard():
+                
+                backbone_inst.build_model(backbone_inputs)
+                all_loss_list = []
+
+                for i in range(len(task_args_list)): 
+                    task_name = task_args_list[i][0]
+                    task_args = task_args_list[i][1]
+
+                    if hasattr(task_args, 'paradigm'):
+                        task_net = task_args.paradigm
+                    else:
+                        task_net = task_name
+
+                    task_net_mdl = importlib.import_module(task_net)
+                    task_net_cls = getattr(task_net_mdl, "create_model")
+                    output_tensor = task_net_cls(task_inputs_list[i], base_model=backbone_inst, is_training=True, args=task_args)
+                    loss_cls = getattr(task_net_mdl, "compute_loss")
+                    task_loss = loss_cls(output_tensor, task_args)
+                    all_loss_list.append(task_loss)
+                    num_seqs = output_tensor['num_seqs']
+
+                task_one_hot = fluid.layers.one_hot(task_id, len(task_args_list))
+                all_loss = fluid.layers.concat(all_loss_list, axis=0)
+                loss = fluid.layers.reduce_sum(task_one_hot * all_loss)
+              
+                programs = [train_program, startup_prog]
+                optimizer_mdl = importlib.import_module(args.optimizer)
+                optimizer_inst = getattr(optimizer_mdl, "optimization")
+                optimizer_inst(loss, programs, args=args)
+                
+                loss.persistable = True
+                num_seqs.persistable = True
+
+                ema = fluid.optimizer.ExponentialMovingAverage(args.ema_decay)
+                ema.update()
+
+        train_compiled_program = fluid.CompiledProgram(train_program).with_data_parallel(
+            loss_name=loss.name, build_strategy=build_strategy)
+
+    if args.do_predict:
+        conf = vars(args)
+        if args.pretrain_config_path is not None:
+            model_conf = JsonConfig(args.pretrain_config_path).asdict()
+            for k, v in model_conf.items():
+                if k in conf:
+                    assert v == conf[k], "ERROR: argument {} in pretrain_model_config is NOT consistent with which in main.yaml".format(k)
+            conf.update(model_conf)
+        mod = reader_module_dict[main_task_name]
+        DataProcessor = getattr(mod, 'DataProcessor')
+        predict_processor = DataProcessor(main_args)
+        test_generator = predict_processor.data_generator(
+            phase='predict',
+            shuffle=False,
+            dev_count=dev_count)
+
+        new_test_input_shape = input_shape_dict[main_task_name][1]['backbone'] + input_shape_dict[main_task_name][1]['task']
+        assert new_test_input_shape == test_input_shape
+        build_strategy = fluid.BuildStrategy()
+        test_prog = fluid.Program()
+        with fluid.program_guard(test_prog, startup_prog):
+            with fluid.unique_name.guard():
+                placeholder = Placeholder(test_input_shape)
+                test_pyreader, model_inputs = placeholder.build(
+                    capacity=100, reader_name="test_reader")
+
+                test_pyreader.decorate_tensor_provider(test_generator)
+
+                # create model
+                backbone_inst = backbone_servlet(conf, is_training=False)
+
+                backbone_inst.build_model(model_inputs)
+
+                task_net_mdl = importlib.import_module(main_task_name)
+                task_net_cls = getattr(task_net_mdl, "create_model")
+                postprocess = getattr(task_net_mdl, "postprocess")
+                global_postprocess = getattr(task_net_mdl, "global_postprocess")
+                output_tensor = task_net_cls(model_inputs, base_model=backbone_inst, is_training=False, args=main_args)
+
+                if 'ema' not in dir():
+                    ema = fluid.optimizer.ExponentialMovingAverage(args.ema_decay)
+
+                pred_fetch_names = []
+                fetch_vars = []
+                for i,j in output_tensor.items():
+                    pred_fetch_names.append(i)
+                    fetch_vars.append(j)
+                for var in fetch_vars:
+                    var.persistable = True
+                pred_fetch_list = [i.name for i in fetch_vars]
+
+
+        test_prog = test_prog.clone(for_test=True)
+        test_compiled_program = fluid.CompiledProgram(test_prog).with_data_parallel(
+            build_strategy=build_strategy)
+
+    exe.run(startup_prog)
+
+    if args.do_train:
+        if args.pretrain_model_path:
+            init_pretraining_params(
+                exe,
+                args.pretrain_model_path,
+                main_program=startup_prog,
+                use_fp16=args.use_fp16)
+        if args.checkpoint_path:
+            if os.path.exists(args.checkpoint_path):
+                init_checkpoint(
+                    exe,
+                    args.checkpoint_path,
+                    main_program=startup_prog,
+                    use_fp16=args.use_fp16)
+            else:
+                os.makedirs(args.checkpoint_path)
+
+    elif args.do_predict:
+        if not args.checkpoint_path:
+            raise ValueError("args 'checkpoint_path' should be set if"
+                             "only doing prediction!")
+        init_checkpoint(
+            exe,
+            args.checkpoint_path,
+            main_program=test_prog,
+            use_fp16=args.use_fp16)
+
+    if args.do_train:
+        print('start training...')
+        train_pyreader.start()
+
+        steps = 0
+        total_cost, total_num_seqs = [], []
+        time_begin = time.time()
+        while True:
+            try:
+                steps += 1
+                if steps % args.skip_steps == 0:
+                    fetch_list = [loss.name, num_seqs.name, task_id.name]
+                else:
+                    fetch_list = []
+
+                outputs = exe.run(train_compiled_program, fetch_list=fetch_list)
+
+                if steps % args.skip_steps == 0:
+                    np_loss, np_num_seqs, np_task_id = outputs
+                    total_cost.extend(np_loss * np_num_seqs)
+                    total_num_seqs.extend(np_num_seqs)
+
+                    time_end = time.time()
+                    used_time = time_end - time_begin
+                    current_example, epoch = main_generator.get_train_progress()
+                   
+                    cur_task_name = id_map_task[np_task_id[0][0]]
+                    print("epoch: %d, task_name: %s, progress: %d/%d, step: %d, loss: %f, "
+                          "speed: %f steps/s" %
+                          (epoch, cur_task_name, current_example, num_train_examples, steps,
+                           np.sum(total_cost) / np.sum(total_num_seqs),
+                           args.skip_steps / used_time))
+                    total_cost, total_num_seqs = [], []
+                    time_begin = time.time()
+
+                if steps % args.save_steps == 0:
+                    save_path = os.path.join(args.checkpoint_path,
+                                             "step_" + str(steps))
+                    fluid.io.save_persistables(exe, save_path, train_program)
+                if steps == max_train_steps:
+                    save_path = os.path.join(args.checkpoint_path,
+                                             "step_" + str(steps) + "_final")
+                    fluid.io.save_persistables(exe, save_path, train_program)
+                    break
+            except paddle.fluid.core.EOFException as err:
+                save_path = os.path.join(args.checkpoint_path,
+                                         "step_" + str(steps) + "_final")
+                fluid.io.save_persistables(exe, save_path, train_program)
+                train_pyreader.reset()
+                break
+
+    if args.do_predict:
+        print('start predicting...')
+        cnt = 0
+        if args.use_ema:
+            with ema.apply(exe):
+                test_pyreader.start()
+                pred_buf = []
+                while True:
+                    try:
+                        fetch_res = exe.run(fetch_list=pred_fetch_list, program=test_compiled_program)
+                        cnt += 1
+                        if cnt % 200 == 0:
+                            print('predicting {}th batch...'.format(cnt))
+                        fetch_dict = {}
+                        for key,val in zip(pred_fetch_names, fetch_res):
+                            fetch_dict[key] = val
+                        res = postprocess(fetch_dict)
+                        if res is not None:
+                            pred_buf.extend(res)
+                    except fluid.core.EOFException:
+                        test_pyreader.reset()
+                        break
+                global_postprocess(pred_buf, predict_processor, args, main_args)
+        else:
+            test_pyreader.start()
+            pred_buf = []
+            while True:
+                try:
+                    fetch_res = exe.run(fetch_list=pred_fetch_list, program=test_compiled_program)
+                    cnt += 1
+                    if cnt % 200 == 0:
+                        print('predicting {}th batch...'.format(cnt))
+                    fetch_dict = {}
+                    for key,val in zip(pred_fetch_names, fetch_res):
+                        fetch_dict[key] = val
+                    res = postprocess(fetch_dict)
+                    if res is not None:
+                        pred_buf.extend(res)
+                except fluid.core.EOFException:
+                    test_pyreader.reset()
+                    break
+            global_postprocess(pred_buf, predict_processor, args, main_args)
+
+
+if __name__ == '__main__':
+
+    multitask_config = "mtl_config.yaml"
+    train(multitask_config)
--- a/optimizer/bert_optimizer.py
+++ b/optimizer/bert_optimizer.py
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Optimization and learning rate scheduling."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import numpy as np
+import paddle.fluid as fluid
+from utils.fp16 import create_master_params_grads, master_param_to_train_param
+
+
+def linear_warmup_decay(learning_rate, warmup_steps, num_train_steps):
+    """ Applies linear warmup of learning rate from 0 and decay to 0."""
+    with fluid.default_main_program()._lr_schedule_guard():
+        lr = fluid.layers.tensor.create_global_var(
+            shape=[1],
+            value=0.0,
+            dtype='float32',
+            persistable=True,
+            name="scheduled_learning_rate")
+
+        global_step = fluid.layers.learning_rate_scheduler._decay_step_counter()
+
+        with fluid.layers.control_flow.Switch() as switch:
+            with switch.case(global_step < warmup_steps):
+                warmup_lr = learning_rate * (global_step / warmup_steps)
+                fluid.layers.tensor.assign(warmup_lr, lr)
+            with switch.default():
+                decayed_lr = fluid.layers.learning_rate_scheduler.polynomial_decay(
+                    learning_rate=learning_rate,
+                    decay_steps=num_train_steps,
+                    end_learning_rate=0.0,
+                    power=1.0,
+                    cycle=False)
+                fluid.layers.tensor.assign(decayed_lr, lr)
+
+        return lr
+
+
+def optimization(loss, programs, args): 
+    train_program = programs[0]
+    startup_prog = programs[1]
+    warmup_steps = args.max_train_steps * args.warmup_proportion
+    if warmup_steps > 0:
+        if args.lr_scheduler == 'noam_decay':
+            scheduled_lr = fluid.layers.learning_rate_scheduler\
+             .noam_decay(1/(warmup_steps *(float(args.learning_rate) ** 2)),
+                         warmup_steps)
+        elif args.lr_scheduler == 'linear_warmup_decay':
+            scheduled_lr = linear_warmup_decay(float(args.learning_rate), warmup_steps,
+                                               args.max_train_steps)
+        else:
+            raise ValueError("Unkown learning rate scheduler, should be "
+                             "'noam_decay' or 'linear_warmup_decay'")
+        optimizer = fluid.optimizer.Adam(learning_rate=scheduled_lr)
+    else:
+        optimizer = fluid.optimizer.Adam(learning_rate=args.learning_rate)
+        scheduled_lr = args.learning_rate
+
+    clip_norm_thres = 1.0
+    # When using mixed precision training, scale the gradient clip threshold
+    # by loss_scaling
+    if args.use_fp16 and args.loss_scaling > 1.0:
+        clip_norm_thres *= args.loss_scaling
+    fluid.clip.set_gradient_clip(
+        clip=fluid.clip.GradientClipByGlobalNorm(clip_norm=clip_norm_thres))
+
+    def exclude_from_weight_decay(name):
+        if name.find("layer_norm") > -1:
+            return True
+        bias_suffix = ["_bias", "_b", ".b_0"]
+        for suffix in bias_suffix:
+            if name.endswith(suffix):
+                return True
+        return False
+
+    param_list = dict()
+
+    if args.use_fp16:
+        param_grads = optimizer.backward(loss)
+        master_param_grads = create_master_params_grads(
+            param_grads, train_program, startup_prog, args.loss_scaling)
+
+        for param, _ in master_param_grads:
+            param_list[param.name] = param * 1.0
+            param_list[param.name].stop_gradient = True
+
+        optimizer.apply_gradients(master_param_grads)
+
+        if args.weight_decay > 0:
+            for param, grad in master_param_grads:
+                if exclude_from_weight_decay(param.name.rstrip(".master")):
+                    continue
+                with param.block.program._optimized_guard(
+                    [param, grad]), fluid.framework.name_scope("weight_decay"):
+                    updated_param = param - param_list[
+                        param.name] * weight_decay * scheduled_lr
+                    fluid.layers.assign(output=param, input=updated_param)
+
+        master_param_to_train_param(master_param_grads, param_grads,
+                                    train_program)
+
+    else:
+        for param in train_program.global_block().all_parameters():
+            param_list[param.name] = param * 1.0
+            param_list[param.name].stop_gradient = True
+
+        _, param_grads = optimizer.minimize(loss)
+
+        if args.weight_decay > 0:
+            for param, grad in param_grads:
+                if exclude_from_weight_decay(param.name):
+                    continue
+                with param.block.program._optimized_guard(
+                    [param, grad]), fluid.framework.name_scope("weight_decay"):
+                    updated_param = param - param_list[
+                        param.name] * args.weight_decay * scheduled_lr
+                    fluid.layers.assign(output=param, input=updated_param)
+
--- a/paradigm/__init__.py
+++ b/paradigm/__init__.py
--- a/paradigm/answer_matching.py
+++ b/paradigm/answer_matching.py
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# encoding=utf8
+
+import paddle.fluid as fluid
+
+
+def compute_loss(output_tensors, args=None):
+    """Compute loss for mrc model"""
+    labels = output_tensors['labels']
+    logits = output_tensors['logits']
+
+    ce_loss, probs = fluid.layers.softmax_with_cross_entropy(
+        logits=logits, label=labels, return_softmax=True)
+    loss = fluid.layers.mean(x=ce_loss)
+
+    if args.use_fp16 and args.loss_scaling > 1.0:
+        loss *= args.loss_scaling
+
+    return loss
+
+
+def create_model(reader_input, base_model=None, is_training=True, args=None):
+    """
+        given the base model, reader_input
+        return the output tensors
+    """
+    labels = reader_input[-1]
+
+    cls_feats = base_model.final_sentence_representation
+    cls_feats = fluid.layers.dropout(
+        x=cls_feats,
+        dropout_prob=0.1,
+        dropout_implementation="upscale_in_train")
+    logits = fluid.layers.fc(
+        input=cls_feats,
+        size=2,
+        param_attr=fluid.ParamAttr(
+            name="cls_out_w",
+            initializer=fluid.initializer.TruncatedNormal(scale=0.02)),
+        bias_attr=fluid.ParamAttr(
+            name="cls_out_b", initializer=fluid.initializer.Constant(0.)))
+
+    num_seqs = fluid.layers.fill_constant(shape=[1], value=512, dtype='int64') 
+
+    output_tensors = {}
+    output_tensors['labels'] = labels
+    output_tensors['logits'] = logits
+    output_tensors['num_seqs'] = num_seqs
+
+    return output_tensors
--- a/paradigm/mask_language_model.py
+++ b/paradigm/mask_language_model.py
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import paddle.fluid as fluid
+from backbone.utils.transformer import pre_process_layer
+from utils.configure import JsonConfig
+
+
+def compute_loss(output_tensors, args=None):
+    """Compute loss for mlm model"""
+    fc_out = output_tensors['mlm_out']
+    mask_label = output_tensors['mask_label']
+    mask_lm_loss = fluid.layers.softmax_with_cross_entropy(
+        logits=fc_out, label=mask_label)
+    mean_mask_lm_loss = fluid.layers.mean(mask_lm_loss)
+    return mean_mask_lm_loss
+
+
+def create_model(reader_input, base_model=None, is_training=True, args=None):
+    """
+        given the base model, reader_input
+        return the output tensors
+    """
+
+    src_ids, pos_ids, sent_ids, input_mask, mask_label, mask_pos = reader_input
+
+    config = JsonConfig(args.pretrain_config_path)
+
+    _emb_size = config['hidden_size']
+    _voc_size = config['vocab_size']
+    _hidden_act = config['hidden_act']
+
+    _word_emb_name = "word_embedding"
+    _dtype = "float16" if args.use_fp16 else "float32"
+
+    _param_initializer = fluid.initializer.TruncatedNormal(
+        scale=config['initializer_range'])
+
+    mask_pos = fluid.layers.cast(x=mask_pos, dtype='int32')
+
+    enc_out = base_model.final_word_representation
+
+    # extract the first token feature in each sentence
+    reshaped_emb_out = fluid.layers.reshape(
+        x=enc_out, shape=[-1, _emb_size])
+    # extract masked tokens' feature
+    mask_feat = fluid.layers.gather(input=reshaped_emb_out, index=mask_pos)
+    num_seqs = fluid.layers.fill_constant(shape=[1], value=512, dtype='int64')
+
+    # transform: fc
+    mask_trans_feat = fluid.layers.fc(
+        input=mask_feat,
+        size=_emb_size,
+        act=_hidden_act,
+        param_attr=fluid.ParamAttr(
+            name='mask_lm_trans_fc.w_0',
+            initializer=_param_initializer),
+        bias_attr=fluid.ParamAttr(name='mask_lm_trans_fc.b_0'))
+    # transform: layer norm
+    mask_trans_feat = pre_process_layer(
+        mask_trans_feat, 'n', name='mask_lm_trans')
+
+    mask_lm_out_bias_attr = fluid.ParamAttr(
+        name="mask_lm_out_fc.b_0",
+        initializer=fluid.initializer.Constant(value=0.0))
+
+    fc_out = fluid.layers.matmul(
+        x=mask_trans_feat,
+        y=fluid.default_main_program().global_block().var(
+            _word_emb_name),
+        transpose_y=True)
+    fc_out += fluid.layers.create_parameter(
+        shape=[_voc_size],
+        dtype=_dtype,
+        attr=mask_lm_out_bias_attr,
+        is_bias=True)
+
+    output_tensors = {}
+    output_tensors['num_seqs'] = num_seqs
+    output_tensors['mlm_out'] = fc_out
+    output_tensors['mask_label'] = mask_label
+
+    return output_tensors
+
--- a/paradigm/reading_comprehension.py
+++ b/paradigm/reading_comprehension.py
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# encoding=utf8
+
+import paddle.fluid as fluid
+import collections
+import os
+from paddle.fluid import layers
+
+
+def compute_loss(output_tensors, args=None):
+    """Compute loss for mrc model"""
+    def _compute_single_loss(logits, positions):
+        """Compute start/end loss for mrc model"""
+        loss = fluid.layers.softmax_with_cross_entropy(
+            logits=logits, label=positions)
+        loss = fluid.layers.mean(x=loss)
+        return loss
+
+    start_logits = output_tensors['start_logits']
+    end_logits = output_tensors['end_logits']
+    start_positions = output_tensors['start_positions']
+    end_positions = output_tensors['end_positions']
+    start_loss = _compute_single_loss(start_logits, start_positions)
+    end_loss = _compute_single_loss(end_logits, end_positions)
+    total_loss = (start_loss + end_loss) / 2.0
+    if args.use_fp16 and args.loss_scaling > 1.0:
+        total_loss = total_loss * args.loss_scaling
+
+    return total_loss
+
+
+def create_model(reader_input, base_model=None, is_training=True, args=None):
+    """
+        given the base model, reader_input
+        return the output tensors
+    """
+
+    if is_training:
+        src_ids, pos_ids, sent_ids, input_mask, start_positions, end_positions = reader_input
+    else:
+        src_ids, pos_ids, sent_ids, input_mask, unique_id = reader_input
+
+    enc_out = base_model.final_word_representation
+    logits = fluid.layers.fc(
+        input=enc_out,
+        size=2,
+        num_flatten_dims=2,
+        param_attr=fluid.ParamAttr(
+            name="cls_squad_out_w",
+            initializer=fluid.initializer.TruncatedNormal(scale=0.02)),
+        bias_attr=fluid.ParamAttr(
+            name="cls_squad_out_b", initializer=fluid.initializer.Constant(0.)))
+
+    logits = fluid.layers.transpose(x=logits, perm=[2, 0, 1])
+    start_logits, end_logits = fluid.layers.unstack(x=logits, axis=0)
+
+    batch_ones = fluid.layers.fill_constant_batch_size_like(
+        input=start_logits, dtype='int64', shape=[1], value=1)
+    num_seqs = fluid.layers.reduce_sum(input=batch_ones)
+
+    output_tensors = {}
+    output_tensors['start_logits'] = start_logits
+    output_tensors['end_logits'] = end_logits
+    output_tensors['num_seqs'] = num_seqs
+    if is_training:
+        output_tensors['start_positions'] = start_positions
+        output_tensors['end_positions'] = end_positions
+
+    else:
+        output_tensors['unique_id'] = unique_id
+
+    return output_tensors
+
+
+RawResult = collections.namedtuple("RawResult",
+                                   ["unique_id", "start_logits", "end_logits"])
+
+
+def postprocess(fetch_results):
+    np_unique_ids= fetch_results['unique_id']
+    np_start_logits= fetch_results['start_logits']
+    np_end_logits= fetch_results['end_logits']
+    ret = []
+    for idx in range(np_unique_ids.shape[0]):                                                                                                                                          
+        if np_unique_ids[idx] < 0:
+            continue
+        unique_id = int(np_unique_ids[idx])
+        start_logits = [float(x) for x in np_start_logits[idx].flat]
+        end_logits = [float(x) for x in np_end_logits[idx].flat]
+        ret.append(
+            RawResult(
+                unique_id=unique_id,
+                start_logits=start_logits,
+                end_logits=end_logits))
+    return ret
+
+    
+def global_postprocess(pred_buf, processor, mtl_args, task_args):
+    if not os.path.exists(mtl_args.checkpoint_path):
+        os.makedirs(mtl_args.checkpoints)
+    output_prediction_file = os.path.join(mtl_args.checkpoint_path, "predictions.json")
+    output_nbest_file = os.path.join(mtl_args.checkpoint_path, "nbest_predictions.json")
+    output_null_log_odds_file = os.path.join(mtl_args.checkpoint_path, "null_odds.json")
+
+    processor.write_predictions(pred_buf, task_args.n_best_size, task_args.max_answer_length,
+                                task_args.do_lower_case, output_prediction_file,
+                                output_nbest_file, output_null_log_odds_file,
+                                task_args.with_negative,
+                                task_args.null_score_diff_threshold, task_args.verbose)
+
+
--- a/reader/__init__.py
+++ b/reader/__init__.py
--- a/reader/answer_matching_reader.py
+++ b/reader/answer_matching_reader.py
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import types
+import csv
+import numpy as np
+from utils import tokenization
+from utils.batching import prepare_batch_data
+
+
+def get_input_shape(args):
+    """
+    define answer matching model input shape
+    """
+    train_input_shape = {"backbone": [([-1, args.max_seq_len, 1], 'int64'),
+                                      ([-1, args.max_seq_len, 1], 'int64'),
+                                      ([-1, args.max_seq_len, 1], 'int64'),
+                                      ([-1, args.max_seq_len, 1], 'float32')],
+                         "task": [([-1, 1], 'int64')]
+                         }
+
+    test_input_shape = {"backbone": [([-1, args.max_seq_len, 1], 'int64'),
+                                     ([-1, args.max_seq_len, 1], 'int64'),
+                                     ([-1, args.max_seq_len, 1], 'int64'),
+                                     ([-1, args.max_seq_len, 1], 'float32')],
+                         "task": [([-1, 1], 'int64')]
+                         }
+    return train_input_shape, test_input_shape
+
+
+class BaseProcessor(object):
+    """Base class for data converters for sequence classification data sets."""
+
+    def __init__(self, args):
+
+        self.train_file = args.train_file
+        self.max_seq_len = args.max_seq_len
+        self.batch_size = args.batch_size
+        self.epoch = args.epoch
+        self.tokenizer = tokenization.FullTokenizer(
+            vocab_file=args.vocab_path, do_lower_case=args.do_lower_case)
+        self.vocab = self.tokenizer.vocab
+        self.in_tokens = args.in_tokens
+
+        self.current_train_example = -1
+        self.num_examples = {'train': -1, 'dev': -1, 'test': -1}
+        self.current_train_epoch = -1
+
+    def get_train_examples(self, file_path):
+        """Gets a collection of `InputExample`s for the train set."""
+        raise NotImplementedError()
+
+    def get_dev_examples(self, file_path):
+        """Gets a collection of `InputExample`s for the dev set."""
+        raise NotImplementedError()
+
+    def get_test_examples(self, file_path):
+        """Gets a collection of `InputExample`s for prediction."""
+        raise NotImplementedError()
+
+    def get_labels(self):
+        """Gets the list of labels for this data set."""
+        raise NotImplementedError()
+
+    def convert_example(self, index, example, labels, max_seq_len, tokenizer):
+        """Converts a single `InputExample` into a single `InputFeatures`."""
+        feature = convert_single_example(index, example, labels, max_seq_len,
+                                         tokenizer)
+        return feature
+
+    def generate_instance(self, feature):
+        """
+        generate instance with given feature
+
+        Args:
+            feature: InputFeatures(object). A single set of features of data.
+        """
+        input_pos = list(range(len(feature.input_ids)))
+        return [
+            feature.input_ids, feature.segment_ids, input_pos, feature.label_id
+        ]
+
+    def generate_batch_data(self,
+                            batch_data,
+                            total_token_num,
+                            voc_size=-1,
+                            mask_id=-1,
+                            return_input_mask=True,
+                            return_max_len=False,
+                            return_num_token=False):
+        return prepare_batch_data(
+            batch_data,
+            total_token_num,
+            voc_size=-1,
+            pad_id=self.vocab["[PAD]"],
+            cls_id=self.vocab["[CLS]"],
+            sep_id=self.vocab["[SEP]"],
+            mask_id=-1,
+            return_input_mask=True,
+            return_max_len=False,
+            return_num_token=False)
+
+    @classmethod
+    def _read_tsv(cls, input_file, quotechar=None):
+        """Reads a tab separated value file."""
+        with open(input_file, "r") as f:
+            reader = csv.reader(f, delimiter="\t", quotechar=quotechar)
+            lines = []
+            for line in reader:
+                lines.append(line)
+            return lines
+
+    def get_num_examples(self, phase):
+        """Get number of examples for train, dev or test."""
+        if phase not in ['train', 'dev', 'test']:
+            raise ValueError(
+                "Unknown phase, which should be in ['train', 'dev', 'test'].")
+        return self.num_examples[phase]
+
+    def get_train_progress(self):
+        """Gets progress for training phase."""
+        return self.current_train_example, self.current_train_epoch
+
+    def data_generator(self,
+                       phase='train',
+                       dev_count=1,
+                       shuffle=True
+                       ):
+        """
+        Generate data for train, dev or test.
+    
+        Args:
+          batch_size: int. The batch size of generated data.
+          phase: string. The phase for which to generate data.
+          epoch: int. Total epoches to generate data.
+          shuffle: bool. Whether to shuffle examples.
+        """
+        if phase == 'train':
+            examples = self.get_train_examples(self.train_file)
+            self.num_examples['train'] = len(examples)
+        elif phase == 'dev':
+            examples = self.get_dev_examples(self.dev_file)
+            self.num_examples['dev'] = len(examples)
+        elif phase == 'test':
+            examples = self.get_test_examples(self.test_file)
+            self.num_examples['test'] = len(examples)
+        else:
+            raise ValueError(
+                "Unknown phase, which should be in ['train', 'dev', 'test'].")
+
+        def instance_reader():
+            for epoch_index in range(self.epoch):
+                if shuffle:
+                    np.random.shuffle(examples)
+                if phase == 'train':
+                    self.current_train_epoch = epoch_index
+                for (index, example) in enumerate(examples):
+                    if phase == 'train':
+                        self.current_train_example = index + 1
+                    feature = self.convert_example(
+                        index, example,
+                        self.get_labels(), self.max_seq_len, self.tokenizer)
+
+                    instance = self.generate_instance(feature)
+                    yield instance
+
+        def batch_reader(reader, batch_size, in_tokens):
+            batch, total_token_num, max_len = [], 0, 0
+            for instance in reader():
+                token_ids, sent_ids, pos_ids, label = instance[:4]
+                max_len = max(max_len, len(token_ids))
+                if in_tokens:
+                    to_append = (len(batch) + 1) * max_len <= batch_size
+                else:
+                    to_append = len(batch) < batch_size
+                if to_append:
+                    batch.append(instance)
+                    total_token_num += len(token_ids)
+                else:
+                    yield batch, total_token_num
+                    batch, total_token_num, max_len = [instance], len(
+                        token_ids), len(token_ids)
+
+            if len(batch) > 0:
+                yield batch, total_token_num
+
+        def wrapper():
+            all_dev_batches = []
+            for batch_data, total_token_num in batch_reader(
+                    instance_reader, self.batch_size, self.in_tokens):
+                batch_data = self.generate_batch_data(
+                    batch_data,
+                    total_token_num,
+                    voc_size=-1,
+                    mask_id=-1,
+                    return_input_mask=True,
+                    return_max_len=False,
+                    return_num_token=False)
+                if len(all_dev_batches) < dev_count:
+                    all_dev_batches.append(batch_data)
+
+                if len(all_dev_batches) == dev_count:
+                    for batch in all_dev_batches:
+                        yield batch
+                    all_dev_batches = []
+
+        return wrapper
+
+
+class DataProcessor(BaseProcessor):
+    """Processor for the MultiNLI data set (GLUE version)."""
+
+    def get_train_examples(self, file_path):
+        """See base class."""
+        return self._create_examples(
+            self._read_tsv(file_path), "train")
+
+    def get_dev_examples(self, file_path):
+        """See base class."""
+        return self._create_examples(
+            self._read_tsv(file_path), "dev")
+
+    def get_test_examples(self, file_path):
+        """See base class."""
+        return self._create_examples(
+            self._read_tsv(file_path), "test")
+
+    def get_labels(self):
+        """See base class."""
+        return ["0", "1"]
+
+    def _create_examples(self, lines, set_type):
+        """Creates examples for the training and dev sets."""
+        examples = []
+        for (i, line) in enumerate(lines):
+            guid = "%s-%s" % (set_type,
+                              tokenization.convert_to_unicode("0000"))
+            text_a = tokenization.convert_to_unicode(line[1])
+            text_b = tokenization.convert_to_unicode(line[2])
+            if set_type == "test":
+                label = "0"
+            else:
+                label = tokenization.convert_to_unicode(line[0])
+            examples.append(
+                InputExample(
+                    guid=guid, text_a=text_a, text_b=text_b, label=label))
+        return examples
+
+
+class InputExample(object):
+    """A single training/test example for simple sequence classification."""
+
+    def __init__(self, guid, text_a, text_b=None, label=None):
+        """Constructs a InputExample.
+
+    Args:
+      guid: Unique id for the example.
+      text_a: string. The untokenized text of the first sequence. For single
+        sequence tasks, only this sequence must be specified.
+      text_b: (Optional) string. The untokenized text of the second sequence.
+        Only must be specified for sequence pair tasks.
+      label: (Optional) string. The label of the example. This should be
+        specified for train and dev examples, but not for test examples.
+    """
+        self.guid = guid
+        self.text_a = text_a
+        self.text_b = text_b
+        self.label = label
+
+
+def _truncate_seq_pair(tokens_a, tokens_b, max_length):
+    """Truncates a sequence pair in place to the maximum length."""
+
+    # This is a simple heuristic which will always truncate the longer sequence
+    # one token at a time. This makes more sense than truncating an equal percent
+    # of tokens from each, since if one sequence is very short then each token
+    # that's truncated likely contains more information than a longer sequence.
+    while True:
+        total_length = len(tokens_a) + len(tokens_b)
+        if total_length <= max_length:
+            break
+        if len(tokens_a) > len(tokens_b):
+            tokens_a.pop()
+        else:
+            tokens_b.pop()
+
+
+class InputFeatures(object):
+    """A single set of features of data."""
+
+    def __init__(self, input_ids, input_mask, segment_ids, label_id):
+        self.input_ids = input_ids
+        self.input_mask = input_mask
+        self.segment_ids = segment_ids
+        self.label_id = label_id
+
+
+def convert_single_example_to_unicode(guid, single_example):
+    text_a = tokenization.convert_to_unicode(single_example[0])
+    text_b = tokenization.convert_to_unicode(single_example[1])
+    label = tokenization.convert_to_unicode(single_example[2])
+    return InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label)
+
+
+def convert_single_example(ex_index, example, label_list, max_seq_length,
+                           tokenizer):
+    """Converts a single `InputExample` into a single `InputFeatures`."""
+    label_map = {}
+    for (i, label) in enumerate(label_list):
+        label_map[label] = i
+
+    tokens_a = tokenizer.tokenize(example.text_a)
+    tokens_b = None
+    if example.text_b:
+        tokens_b = tokenizer.tokenize(example.text_b)
+
+    if tokens_b:
+        # Modifies `tokens_a` and `tokens_b` in place so that the total
+        # length is less than the specified length.
+        # Account for [CLS], [SEP], [SEP] with "- 3"
+        _truncate_seq_pair(tokens_a, tokens_b, max_seq_length - 3)
+    else:
+        # Account for [CLS] and [SEP] with "- 2"
+        if len(tokens_a) > max_seq_length - 2:
+            tokens_a = tokens_a[0:(max_seq_length - 2)]
+
+    # The convention in BERT is:
+    # (a) For sequence pairs:
+    #  tokens:   [CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP]
+    #  type_ids: 0     0  0    0    0     0       0 0     1  1  1  1   1 1
+    # (b) For single sequences:
+    #  tokens:   [CLS] the dog is hairy . [SEP]
+    #  type_ids: 0     0   0   0  0     0 0
+    #
+    # Where "type_ids" are used to indicate whether this is the first
+    # sequence or the second sequence. The embedding vectors for `type=0` and
+    # `type=1` were learned during pre-training and are added to the wordpiece
+    # embedding vector (and position vector). This is not *strictly* necessary
+    # since the [SEP] token unambiguously separates the sequences, but it makes
+    # it easier for the model to learn the concept of sequences.
+    #
+    # For classification tasks, the first vector (corresponding to [CLS]) is
+    # used as as the "sentence vector". Note that this only makes sense because
+    # the entire model is fine-tuned.
+    tokens = []
+    segment_ids = []
+    tokens.append("[CLS]")
+    segment_ids.append(0)
+    for token in tokens_a:
+        tokens.append(token)
+        segment_ids.append(0)
+    tokens.append("[SEP]")
+    segment_ids.append(0)
+
+    if tokens_b:
+        for token in tokens_b:
+            tokens.append(token)
+            segment_ids.append(1)
+        tokens.append("[SEP]")
+        segment_ids.append(1)
+
+    input_ids = tokenizer.convert_tokens_to_ids(tokens)
+
+    # The mask has 1 for real tokens and 0 for padding tokens. Only real
+    # tokens are attended to.
+    input_mask = [1] * len(input_ids)
+
+    label_id = label_map[example.label]
+
+    feature = InputFeatures(
+        input_ids=input_ids,
+        input_mask=input_mask,
+        segment_ids=segment_ids,
+        label_id=label_id)
+    return feature
+
+
+def convert_examples_to_features(examples, label_list, max_seq_length,
+                                 tokenizer):
+    """Convert a set of `InputExample`s to a list of `InputFeatures`."""
+
+    features = []
+    for (ex_index, example) in enumerate(examples):
+        if ex_index % 10000 == 0:
+            print("Writing example %d of %d" % (ex_index, len(examples)))
+
+        feature = convert_single_example(ex_index, example, label_list,
+                                         max_seq_length, tokenizer)
+
+        features.append(feature)
+    return features
+
+
+if __name__ == '__main__':
+    pass
--- a/reader/joint_reader.py
+++ b/reader/joint_reader.py
+#encoding=utf8
+import os
+import sys
+import random
+import numpy as np
+import paddle
+import paddle.fluid as fluid
+from utils.placeholder import Placeholder
+
+
+def repeat(reader):
+    """Repeat a generator forever"""
+    generator = reader()
+    while True:
+        try:
+            yield next(generator)
+        except StopIteration:
+            generator = reader()
+            yield next(generator)
+
+
+def create_joint_generator(input_shape, generators, task_map_id, is_multi_task=True):
+
+    def empty_output(input_shape, batch_size=1):
+        results = []
+        for i in range(len(input_shape)):
+            if input_shape[i][1] == 'int32':
+                dtype = np.int32
+            if input_shape[i][1] == 'int64':
+                dtype = np.int64
+            if input_shape[i][1] == 'float32':
+                dtype = np.float32
+            if input_shape[i][1] == 'float64':
+                dtype = np.float64
+            shape = input_shape[i][0]
+            shape[0] = batch_size
+            pad_tensor = np.zeros(shape=shape, dtype=dtype)
+            results.append(pad_tensor)
+        return results
+
+    def wrapper(): 
+        generators_inst = [repeat(gen[0]) for gen in generators]
+
+        generators_ratio = [gen[1] for gen in generators]
+        weights = [ratio/sum(generators_ratio) for ratio in generators_ratio]
+
+        task_names = [gen[2] for gen in generators]
+        task_names_ids = [0]
+        for i in range(1, len(task_names)):  
+            if task_names[i] == task_names[i - 1]: 
+                task_names_ids.append(task_names_ids[-1])
+            else: 
+                task_names_ids.append(task_names_ids[-1] + 1)
+        run_task_id = range(len(generators))
+        while True:
+            idx = np.random.choice(run_task_id, p=weights)
+            gen_results = next(generators_inst[idx])
+            if not gen_results:
+                break
+            batch_size = gen_results[0].shape[0]
+            results = empty_output(input_shape, batch_size)
+            task_id_tensor = np.array([[task_names_ids[idx]]]).astype("int64")
+            results[0] = task_id_tensor
+
+            backbone_range_start = task_map_id[0][0]
+            backbone_range_end = task_map_id[0][1]
+
+            for i in range(backbone_range_start, backbone_range_end):
+                results[i] = gen_results[i - 1]
+            cur_gene_task = task_names_ids[idx] + 1
+            for j in range(task_map_id[cur_gene_task][0], task_map_id[cur_gene_task][1]): 
+                results[j] = gen_results[i]
+                i += 1
+            yield results
+
+    return wrapper
+
+
+def create_reader(reader_name, input_shape, is_multi_task, task_map_id, *gens):
+    """
+    build reader for multi_task_learning
+    """
+    placeholder = Placeholder(input_shape)
+    pyreader, model_inputs = placeholder.build(capacity=16, reader_name=reader_name)
+    joint_generator = create_joint_generator(input_shape, gens[0], task_map_id, is_multi_task=is_multi_task)
+
+    return joint_generator, pyreader, model_inputs
+
+
+def joint_input_shape(input_shape_list): 
+    """
+    joint main task and auxiliary tasks input shape
+    """
+    joint_test_input_shape = input_shape_list[0][1]["backbone"] + input_shape_list[0][1]["task"]
+    
+    joint_train_input_shape = [([1, 1], 'int64')] # task_id_shape
+    backbone_input_shape = input_shape_list[0][0]["backbone"]
+    joint_train_input_shape.extend(backbone_input_shape)
+    task_map_id = [(1, len(input_shape_list[0][0]["backbone"]) + 1)]
+
+    for input_shape in input_shape_list: 
+        task_input_shape = input_shape[0]["task"]
+        joint_train_input_shape.extend(task_input_shape)
+        task_map_id.append((task_map_id[-1][1], task_map_id[-1][1] + len(task_input_shape)))
+    return joint_train_input_shape, joint_test_input_shape, task_map_id
+        
+
--- a/reader/mask_language_model_reader.py
+++ b/reader/mask_language_model_reader.py
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import print_function
+from __future__ import division
+
+import os
+import numpy as np
+import types
+import gzip
+import logging
+import re
+import six
+import collections
+from utils import tokenization
+from utils.batching import prepare_batch_data
+
+
+def get_input_shape(args):
+    """
+    define mask language model input shape
+    """
+    train_input_shape = {"backbone": [([-1, args.max_seq_len, 1], 'int64'),
+                                      ([-1, args.max_seq_len, 1], 'int64'),
+                                      ([-1, args.max_seq_len, 1], 'int64'),
+                                      ([-1, args.max_seq_len, 1], 'float32')],
+                         "task": [([-1, 1], 'int64'),
+                                  ([-1, 1], 'int64')]
+                         }
+
+    test_input_shape = {"backbone": [([-1, args.max_seq_len, 1], 'int64'),
+                                     ([-1, args.max_seq_len, 1], 'int64'),
+                                     ([-1, args.max_seq_len, 1], 'int64'),
+                                     ([-1, args.max_seq_len, 1], 'float32')],
+                         "task": [([-1, 1], 'int64'),
+                                  ([-1, 1], 'int64')]
+                         }
+    return train_input_shape, test_input_shape
+
+
+class DataProcessor(object): 
+    def __init__(self, args): 
+        self.vocab = self.load_vocab(args.vocab_path)
+        self.data_dir = args.train_file
+        self.batch_size = args.batch_size
+        self.in_tokens = args.in_tokens
+        self.epoch = args.epoch
+        self.current_epoch = 0
+        self.current_file_index = 0
+        self.total_file = 0
+        self.current_file = None
+        self.generate_neg_sample = args.generate_neg_sample
+        self.voc_size = len(self.vocab)
+        self.max_seq_len = args.max_seq_len
+        self.pad_id = self.vocab["[PAD]"]
+        self.cls_id = self.vocab["[CLS]"]
+        self.sep_id = self.vocab["[SEP]"]
+        self.mask_id = self.vocab["[MASK]"]
+        if self.in_tokens:
+            assert self.batch_size >= self.max_seq_len, "The number of " \
+                   "tokens in batch should not be smaller than max seq length."
+
+    def get_progress(self):
+        """return current progress of training data
+        """
+        return self.current_epoch, self.current_file_index, self.total_file, self.current_file
+
+    def parse_line(self, line, max_seq_len=512):
+        """ parse one line to token_ids, sentence_ids, pos_ids, label
+        """
+        line = line.strip().decode().split(";")
+        assert len(line) == 4, "One sample must have 4 fields!"
+        (token_ids, sent_ids, pos_ids, label) = line
+        token_ids = [int(token) for token in token_ids.split(" ")]
+        sent_ids = [int(token) for token in sent_ids.split(" ")]
+        pos_ids = [int(token) for token in pos_ids.split(" ")]
+        assert len(token_ids) == len(sent_ids) == len(
+            pos_ids
+        ), "[Must be true]len(token_ids) == len(sent_ids) == len(pos_ids)"
+        label = int(label)
+        if len(token_ids) > max_seq_len:
+            return None
+        return [token_ids, sent_ids, pos_ids, label]
+
+    def read_file(self, file):
+        assert file.endswith('.gz'), "[ERROR] %s is not a gzip file" % file
+        file_path = self.data_dir + "/" + file
+        with gzip.open(file_path, "rb") as f:
+            for line in f:
+                parsed_line = self.parse_line(
+                    line, max_seq_len=self.max_seq_len)
+                if parsed_line is None:
+                    continue
+                yield parsed_line
+
+    def convert_to_unicode(self, text):
+        """Converts `text` to Unicode (if it's not already), assuming utf-8 input."""
+        if six.PY3:
+            if isinstance(text, str):
+                return text
+            elif isinstance(text, bytes):
+                return text.decode("utf-8", "ignore")
+            else:
+                raise ValueError("Unsupported string type: %s" % (type(text)))
+        elif six.PY2:
+            if isinstance(text, str):
+                return text.decode("utf-8", "ignore")
+            elif isinstance(text, unicode):
+                return text
+            else:
+                raise ValueError("Unsupported string type: %s" % (type(text)))
+        else:
+            raise ValueError("Not running on Python2 or Python 3?")
+
+    def load_vocab(self, vocab_file):
+        """Loads a vocabulary file into a dictionary."""
+        vocab = collections.OrderedDict()
+        fin = open(vocab_file)
+        for num, line in enumerate(fin):
+            items = self.convert_to_unicode(line.strip()).split("\t")
+            if len(items) > 2:
+                break
+            token = items[0]
+            index = items[1] if len(items) == 2 else num
+            token = token.strip()
+            vocab[token] = int(index)
+        return vocab
+
+    def random_pair_neg_samples(self, pos_samples):
+        """ randomly generate negative samples using pos_samples
+
+            Args:
+                pos_samples: list of positive samples
+            
+            Returns:
+                neg_samples: list of negative samples
+        """
+        np.random.shuffle(pos_samples)
+        num_sample = len(pos_samples)
+        neg_samples = []
+        miss_num = 0
+
+        for i in range(num_sample):
+            pair_index = (i + 1) % num_sample
+            origin_src_ids = pos_samples[i][0]
+            origin_sep_index = origin_src_ids.index(2)
+            pair_src_ids = pos_samples[pair_index][0]
+            pair_sep_index = pair_src_ids.index(2)
+
+            src_ids = origin_src_ids[:origin_sep_index + 1] + pair_src_ids[
+                pair_sep_index + 1:]
+            if len(src_ids) >= self.max_seq_len:
+                miss_num += 1
+                continue
+            sent_ids = [0] * len(origin_src_ids[:origin_sep_index + 1]) + [
+                1
+            ] * len(pair_src_ids[pair_sep_index + 1:])
+            pos_ids = list(range(len(src_ids)))
+            neg_sample = [src_ids, sent_ids, pos_ids, 0]
+            assert len(src_ids) == len(sent_ids) == len(
+                pos_ids
+            ), "[ERROR]len(src_id) == lne(sent_id) == len(pos_id) must be True"
+            neg_samples.append(neg_sample)
+        return neg_samples, miss_num
+
+    def mixin_negative_samples(self, pos_sample_generator, buffer=1000):
+        """ 1. generate negative samples by randomly group sentence_1 and sentence_2 of positive samples
+            2. combine negative samples and positive samples
+            
+            Args:
+                pos_sample_generator: a generator producing a parsed positive sample, which is a list: [token_ids, sent_ids, pos_ids, 1]
+
+            Returns:
+                sample: one sample from shuffled positive samples and negative samples
+        """
+        pos_samples = []
+        num_total_miss = 0
+        pos_sample_num = 0
+        try:
+            while True:
+                while len(pos_samples) < buffer:
+                    pos_sample = next(pos_sample_generator)
+                    label = pos_sample[3]
+                    assert label == 1, "positive sample's label must be 1"
+                    pos_samples.append(pos_sample)
+                    pos_sample_num += 1
+
+                neg_samples, miss_num = self.random_pair_neg_samples(
+                    pos_samples)
+                num_total_miss += miss_num
+                samples = pos_samples + neg_samples
+                pos_samples = []
+                np.random.shuffle(samples)
+                for sample in samples:
+                    yield sample
+        except StopIteration:
+            print("stopiteration: reach end of file")
+            if len(pos_samples) == 1:
+                yield pos_samples[0]
+            elif len(pos_samples) == 0:
+                yield None
+            else:
+                neg_samples, miss_num = self.random_pair_neg_samples(
+                    pos_samples)
+                num_total_miss += miss_num
+                samples = pos_samples + neg_samples
+                pos_samples = []
+                np.random.shuffle(samples)
+                for sample in samples:
+                    yield sample
+            print("miss_num:%d\tideal_total_sample_num:%d\tmiss_rate:%f" %
+                  (num_total_miss, pos_sample_num * 2,
+                   num_total_miss / (pos_sample_num * 2)))
+
+    def data_generator(self, phase='train', shuffle=True, dev_count=1):
+        """
+        data_generator
+        """
+        files = os.listdir(self.data_dir)
+        self.total_file = len(files)
+        assert self.total_file > 0, "[Error] data_dir is empty"
+
+        def wrapper():
+            def reader(): 
+                if phase == "train": 
+                    epoch_num = self.epoch
+                    is_test = False
+                else: 
+                    epoch_num = 1
+                    is_test = True
+                for epoch in range(epoch_num):
+                    self.current_epoch = epoch + 1
+                    if shuffle:
+                        np.random.shuffle(files)
+                    for index, file in enumerate(files):
+                        self.current_file_index = index + 1
+                        self.current_file = file
+                        sample_generator = self.read_file(file)
+                        if not is_test and self.generate_neg_sample:
+                            sample_generator = self.mixin_negative_samples(
+                                sample_generator)
+                        for sample in sample_generator:
+                            if sample is None:
+                                continue
+                            yield sample
+
+            def batch_reader(reader, batch_size, in_tokens):
+                batch, total_token_num, max_len = [], 0, 0
+                for parsed_line in reader():
+                    token_ids, sent_ids, pos_ids, label = parsed_line
+                    max_len = max(max_len, len(token_ids))
+                    if in_tokens:
+                        to_append = (len(batch) + 1) * max_len <= batch_size
+                    else:
+                        to_append = len(batch) < batch_size
+                    if to_append:
+                        batch.append(parsed_line)
+                        total_token_num += len(token_ids)
+                    else:
+                        yield batch, total_token_num
+                        batch, total_token_num, max_len = [parsed_line], len(
+                            token_ids), len(token_ids)
+
+                if len(batch) > 0:
+                    yield batch, total_token_num
+
+            for batch_data, total_token_num in batch_reader(
+                    reader, self.batch_size, self.in_tokens):
+                yield prepare_batch_data(
+                    batch_data,
+                    total_token_num,
+                    voc_size=self.voc_size,
+                    pad_id=self.pad_id,
+                    cls_id=self.cls_id,
+                    sep_id=self.sep_id,
+                    mask_id=self.mask_id,
+                    max_len=self.max_seq_len,
+                    return_input_mask=True,
+                    return_max_len=False,
+                    return_num_token=False)
+
+        return wrapper
+
+
+if __name__ == "__main__":
+    pass
--- a/reader/reading_comprehension_reader.py
+++ b/reader/reading_comprehension_reader.py
--- a/run.sh
+++ b/run.sh
+#!/bin/bash
+
+export FLAGS_sync_nccl_allreduce=0
+export FLAGS_eager_delete_tensor_gb=1
+
+export CUDA_VISIBLE_DEVICES=0
+
+if  [ ! "$CUDA_VISIBLE_DEVICES" ]
+then
+    export CPU_NUM=1
+    use_cuda=false
+else
+    use_cuda=true
+fi
+
+python -u mtl_run.py
+
--- a/utils/__init__.py
+++ b/utils/__init__.py
--- a/utils/batching.py
+++ b/utils/batching.py
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Mask, padding and batching."""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+import numpy as np
+
+
+def mask(batch_tokens, total_token_num, vocab_size, CLS=1, SEP=2, MASK=3):
+    """
+    Add mask for batch_tokens, return out, mask_label, mask_pos;
+    Note: mask_pos responding the batch_tokens after padded;
+    """
+    max_len = max([len(sent) for sent in batch_tokens])
+    mask_label = []
+    mask_pos = []
+    prob_mask = np.random.rand(total_token_num)
+    # Note: the first token is [CLS], so [low=1]
+    replace_ids = np.random.randint(1, high=vocab_size, size=total_token_num)
+    pre_sent_len = 0
+    prob_index = 0
+    for sent_index, sent in enumerate(batch_tokens):
+        mask_flag = False
+        prob_index += pre_sent_len
+        for token_index, token in enumerate(sent):
+            prob = prob_mask[prob_index + token_index]
+            if prob > 0.15:
+                continue
+            elif 0.03 < prob <= 0.15:
+                # mask
+                if token != SEP and token != CLS:
+                    mask_label.append(sent[token_index])
+                    sent[token_index] = MASK
+                    mask_flag = True
+                    mask_pos.append(sent_index * max_len + token_index)
+            elif 0.015 < prob <= 0.03:
+                # random replace
+                if token != SEP and token != CLS:
+                    mask_label.append(sent[token_index])
+                    sent[token_index] = replace_ids[prob_index + token_index]
+                    mask_flag = True
+                    mask_pos.append(sent_index * max_len + token_index)
+            else:
+                # keep the original token
+                if token != SEP and token != CLS:
+                    mask_label.append(sent[token_index])
+                    mask_pos.append(sent_index * max_len + token_index)
+        pre_sent_len = len(sent)
+        # ensure at least mask one word in a sentence
+        while not mask_flag:
+            token_index = int(np.random.randint(1, high=len(sent) - 1, size=1))
+            if sent[token_index] != SEP and sent[token_index] != CLS:
+                mask_label.append(sent[token_index])
+                sent[token_index] = MASK
+                mask_flag = True
+                mask_pos.append(sent_index * max_len + token_index)
+    mask_label = np.array(mask_label).astype("int64").reshape([-1, 1])
+    mask_pos = np.array(mask_pos).astype("int64").reshape([-1, 1])
+    return batch_tokens, mask_label, mask_pos
+
+
+def prepare_batch_data(insts,
+                       total_token_num,
+                       max_len=None,
+                       voc_size=0,
+                       pad_id=None,
+                       cls_id=None,
+                       sep_id=None,
+                       mask_id=None,
+                       return_input_mask=True,
+                       return_max_len=True,
+                       return_num_token=False):
+    """
+    1. generate Tensor of data
+    2. generate Tensor of position
+    3. generate self attention mask, [shape: batch_size *  max_len * max_len]
+    """
+    batch_src_ids = [inst[0] for inst in insts]
+    batch_sent_ids = [inst[1] for inst in insts]
+    batch_pos_ids = [inst[2] for inst in insts]
+    labels_list = []
+    # compatible with mrqa, whose example includes start/end positions, 
+    # or unique id
+    for i in range(3, len(insts[0]), 1):
+        labels = [inst[i] for inst in insts]
+        labels = np.array(labels).astype("int64").reshape([-1, 1])
+        labels_list.append(labels)
+    # First step: do mask without padding
+    if mask_id >= 0:
+        out, mask_label, mask_pos = mask(
+            batch_src_ids,
+            total_token_num,
+            vocab_size=voc_size,
+            CLS=cls_id,
+            SEP=sep_id,
+            MASK=mask_id)
+    else:
+        out = batch_src_ids
+    # Second step: padding
+    src_id, self_input_mask = pad_batch_data(
+        out, 
+        max_len=max_len,
+        pad_idx=pad_id, return_input_mask=True)
+    pos_id = pad_batch_data(
+        batch_pos_ids,
+        max_len=max_len,
+        pad_idx=pad_id,
+        return_pos=False,
+        return_input_mask=False)
+    sent_id = pad_batch_data(
+        batch_sent_ids,
+        max_len=max_len,
+        pad_idx=pad_id,
+        return_pos=False,
+        return_input_mask=False)
+    if mask_id >= 0:
+        return_list = [
+            src_id, pos_id, sent_id, self_input_mask, mask_label, mask_pos
+        ] + labels_list
+    else:
+        return_list = [src_id, pos_id, sent_id, self_input_mask] + labels_list
+    return return_list if len(return_list) > 1 else return_list[0]
+
+
+def pad_batch_data(insts,
+                   max_len=None,
+                   pad_idx=0,
+                   return_pos=False,
+                   return_input_mask=False,
+                   return_max_len=False,
+                   return_num_token=False):
+    """
+    Pad the instances to the max sequence length in batch, and generate the
+    corresponding position data and input mask.
+    """
+    return_list = []
+    if max_len is None:
+        max_len = max(len(inst) for inst in insts)
+    # Any token included in dict can be used to pad, since the paddings' loss
+    # will be masked out by weights and make no effect on parameter gradients.
+    inst_data = np.array([
+        list(inst) + list([pad_idx] * (max_len - len(inst))) for inst in insts
+    ])
+    return_list += [inst_data.astype("int64").reshape([-1, max_len, 1])]
+    # position data
+    if return_pos:
+        inst_pos = np.array([
+            list(range(0, len(inst))) + [pad_idx] * (max_len - len(inst))
+            for inst in insts
+        ])
+        return_list += [inst_pos.astype("int64").reshape([-1, max_len, 1])]
+    if return_input_mask:
+        # This is used to avoid attention on paddings.
+        input_mask_data = np.array([[1] * len(inst) + [0] *
+                                    (max_len - len(inst)) for inst in insts])
+        input_mask_data = np.expand_dims(input_mask_data, axis=-1)
+        return_list += [input_mask_data.astype("float32")]
+    if return_max_len:
+        return_list += [max_len]
+    if return_num_token:
+        num_token = 0
+        for inst in insts:
+            num_token += len(inst)
+        return_list += [num_token]
+    return return_list if len(return_list) > 1 else return_list[0]
+
+
+if __name__ == "__main__":
+    pass
+
+
--- a/utils/configure.py
+++ b/utils/configure.py
+# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. 
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import os
+import sys
+import argparse
+import json
+import yaml
+import six
+import logging
+
+logging_only_message = "%(message)s"
+logging_details = "%(asctime)s.%(msecs)03d %(levelname)s %(module)s - %(funcName)s: %(message)s"
+
+
+class JsonConfig(object):
+    """
+    A high-level api for handling json configure file.
+    """
+
+    def __init__(self, config_path):
+        self._config_dict = self._parse(config_path)
+
+    def _parse(self, config_path):
+        try:
+            with open(config_path) as json_file:
+                config_dict = json.load(json_file)
+                assert isinstance(config_dict, dict), "Object in {} is NOT a dict.".format(config_path)
+        except:
+            raise IOError("Error in parsing bert model config file '%s'" %
+                          config_path)
+        else:
+            return config_dict
+
+    def __getitem__(self, key):
+        return self._config_dict[key]
+
+    def asdict(self):
+        return self._config_dict
+
+    def print_config(self):
+        for arg, value in sorted(six.iteritems(self._config_dict)):
+            print('%s: %s' % (arg, value))
+        print('------------------------------------------------')
+
+
+class ArgumentGroup(object):
+    def __init__(self, parser, title, des):
+        self._group = parser.add_argument_group(title=title, description=des)
+
+    def add_arg(self, name, type, default, help, **kwargs):
+        type = str2bool if type == bool else type
+        self._group.add_argument(
+            "--" + name,
+            default=default,
+            type=type,
+            help=help + ' Default: %(default)s.',
+            **kwargs)
+
+
+class ArgConfig(object):
+    """
+    A high-level api for handling argument configs.
+    """
+
+    def __init__(self):
+        parser = argparse.ArgumentParser()
+
+        train_g = ArgumentGroup(parser, "training", "training options.")
+        train_g.add_arg("epoch", int, 3, "Number of epoches for fine-tuning.")
+        train_g.add_arg("learning_rate", float, 5e-5,
+                        "Learning rate used to train with warmup.")
+        train_g.add_arg(
+            "lr_scheduler",
+            str,
+            "linear_warmup_decay",
+            "scheduler of learning rate.",
+            choices=['linear_warmup_decay', 'noam_decay'])
+        train_g.add_arg("weight_decay", float, 0.01,
+                        "Weight decay rate for L2 regularizer.")
+        train_g.add_arg(
+            "warmup_proportion", float, 0.1,
+            "Proportion of training steps to perform linear learning rate warmup for."
+        )
+        train_g.add_arg("save_steps", int, 1000,
+                        "The steps interval to save checkpoints.")
+        train_g.add_arg(
+            "loss_scaling", float, 1.0,
+            "Loss scaling factor for mixed precision training, only valid when use_fp16 is enabled."
+        )
+        train_g.add_arg("pred_dir", str, None,
+                        "Path to save the prediction results")
+
+        log_g = ArgumentGroup(parser, "logging", "logging related.")
+        log_g.add_arg("skip_steps", int, 10,
+                      "The steps interval to print loss.")
+        log_g.add_arg("verbose", bool, False, "Whether to output verbose log.")
+
+        run_type_g = ArgumentGroup(parser, "run_type", "running type options.")
+        run_type_g.add_arg("use_cuda", bool, True,
+                           "If set, use GPU for training.")
+        run_type_g.add_arg(
+            "use_fast_executor", bool, False,
+            "If set, use fast parallel executor (in experiment).")
+        run_type_g.add_arg(
+            "num_iteration_per_drop_scope", int, 1,
+            "Ihe iteration intervals to clean up temporary variables.")
+        run_type_g.add_arg("do_train", bool, True,
+                           "Whether to perform training.")
+        run_type_g.add_arg("do_predict", bool, True,
+                           "Whether to perform prediction.")
+
+        custom_g = ArgumentGroup(parser, "customize", "customized options.")
+
+        self.custom_g = custom_g
+
+        self.parser = parser
+
+    def add_arg(self, name, dtype, default, descrip):
+        self.custom_g.add_arg(name, dtype, default, descrip)
+
+    def build_conf(self):
+        return self.parser.parse_args()
+
+
+def str2bool(v):
+    # because argparse does not support to parse "true, False" as python
+    # boolean directly
+    return v.lower() in ("true", "t", "1")
+
+
+def print_arguments(args, log=None):
+    if not log:
+        print('-----------  Configuration Arguments -----------')
+        for arg, value in sorted(six.iteritems(vars(args))):
+            print('%s: %s' % (arg, value))
+        print('------------------------------------------------')
+    else:
+        log.info('-----------  Configuration Arguments -----------')
+        for arg, value in sorted(six.iteritems(vars(args))):
+            log.info('%s: %s' % (arg, value))
+        log.info('------------------------------------------------')
+
+
+class PDConfig(object):
+    """
+    A high-level API for managing configuration files in PaddlePaddle.
+    Can jointly work with command-line-arugment, json files and yaml files.
+    """
+
+    def __init__(self, json_file="", yaml_file=[], fuse_args=True):
+        """
+            Init funciton for PDConfig.
+            json_file: the path to the json configure file.
+            yaml_file: the path to the yaml configure file.
+            fuse_args: if fuse the json/yaml configs with argparse.
+        """
+        assert isinstance(json_file, str)
+        assert isinstance(yaml_file, list)
+
+        if json_file != "" and yaml_file != []:
+            raise Warning(
+                "json_file and yaml_file can not co-exist for now. please only use one configure file type."
+            )
+            return
+
+        self.args = None
+        self.arg_config = {}
+        self.json_config = {}
+        self.yaml_config = {}
+
+        parser = argparse.ArgumentParser()
+
+        self.default_g = ArgumentGroup(parser, "default", "default options.")
+        self.yaml_g = ArgumentGroup(parser, "yaml", "options from yaml.")
+        self.json_g = ArgumentGroup(parser, "json", "options from json.")
+        self.com_g = ArgumentGroup(parser, "custom", "customized options.")
+
+        """
+        self.default_g.add_arg("epoch", int, 2,
+                               "Number of epoches for training.")
+        self.default_g.add_arg("do_train", bool, False,
+                               "Whether to perform training.")
+        self.default_g.add_arg("do_predict", bool, False,
+                               "Whether to perform predicting.")
+        self.default_g.add_arg("do_eval", bool, False,
+                               "Whether to perform evaluating.")
+        """
+
+        self.parser = parser
+
+        if json_file != "":
+            self.load_json(json_file, fuse_args=fuse_args)
+
+        if yaml_file:
+            self.load_yaml(yaml_file, fuse_args=fuse_args)
+
+    def load_json(self, file_path, fuse_args=True):
+
+        if not os.path.exists(file_path):
+            raise Warning("the json file %s does not exist." % file_path)
+            return
+
+        with open(file_path, "r") as fin:
+            self.json_config = json.loads(fin.read())
+            fin.close()
+
+        if fuse_args:
+            for name in self.json_config:
+                if not isinstance(self.json_config[name], int) \
+                    and not isinstance(self.json_config[name], float) \
+                    and not isinstance(self.json_config[name], str) \
+                    and not isinstance(self.json_config[name], bool):
+
+                    continue
+
+                self.json_g.add_arg(name,
+                                    type(self.json_config[name]),
+                                    self.json_config[name],
+                                    "This is from %s" % file_path)
+
+    def load_yaml(self, file_path_list, fuse_args=True):
+
+        for file_path in file_path_list: 
+            if not os.path.exists(file_path):
+                raise Warning("the yaml file %s does not exist." % file_path)
+                return
+
+            with open(file_path, "r") as fin: 
+                self.yaml_config = yaml.load(fin, Loader=yaml.SafeLoader)
+                fin.close()
+
+            if fuse_args:
+                for name in self.yaml_config:
+                    if not isinstance(self.yaml_config[name], int) \
+                        and not isinstance(self.yaml_config[name], float) \
+                        and not isinstance(self.yaml_config[name], str) \
+                        and not isinstance(self.yaml_config[name], bool):
+
+                        continue
+
+                    self.yaml_g.add_arg(name,
+                                        type(self.yaml_config[name]),
+                                        self.yaml_config[name],
+                                        "This is from %s" % file_path)
+
+    def build(self):
+        self.args = self.parser.parse_args()
+        self.arg_config = vars(self.args)
+
+    def __add__(self, new_arg):
+        assert isinstance(new_arg, list) or isinstance(new_arg, tuple)
+        assert len(new_arg) >= 3
+        assert self.args is None
+
+        name = new_arg[0]
+        dtype = new_arg[1]
+        dvalue = new_arg[2]
+        desc = new_arg[3] if len(
+            new_arg) == 4 else "Description is not provided."
+
+        self.com_g.add_arg(name, dtype, dvalue, desc)
+
+        return self
+
+    def __getattr__(self, name):
+        if name in self.arg_config:
+            return self.arg_config[name]
+
+        if name in self.json_config:
+            return self.json_config[name]
+
+        if name in self.yaml_config:
+            return self.yaml_config[name]
+
+        raise Warning("The argument %s is not defined." % name)
+
+    def Print(self):
+
+        print("-" * 70)
+        for name in self.arg_config:
+            print("{: <25}\t{}".format(str(name), str(self.arg_config[name])))
+
+        for name in self.json_config:
+            if name not in self.arg_config:
+                print("{: <25}\t{}" %
+                      (str(name), str(self.json_config[name])))
+
+        for name in self.yaml_config:
+            if name not in self.arg_config:
+                print("{: <25}\t{}" %
+                      (str(name), str(self.yaml_config[name])))
+
+        print("-" * 70)
+
+
+if __name__ == "__main__":
+    pd_config = PDConfig(yaml_file="./test/bert_config.yaml")
+    pd_config += ("my_age", int, 18, "I am forever 18.")
+    pd_config.build()
+
+    print(pd_config.do_train)
+    print(pd_config.hidden_size)
+    print(pd_config.my_age)
--- a/utils/fp16.py
+++ b/utils/fp16.py
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import print_function
+import paddle
+import paddle.fluid as fluid
+
+
+def cast_fp16_to_fp32(i, o, prog):
+    prog.global_block().append_op(
+        type="cast",
+        inputs={"X": i},
+        outputs={"Out": o},
+        attrs={
+            "in_dtype": fluid.core.VarDesc.VarType.FP16,
+            "out_dtype": fluid.core.VarDesc.VarType.FP32
+        })
+
+
+def cast_fp32_to_fp16(i, o, prog):
+    prog.global_block().append_op(
+        type="cast",
+        inputs={"X": i},
+        outputs={"Out": o},
+        attrs={
+            "in_dtype": fluid.core.VarDesc.VarType.FP32,
+            "out_dtype": fluid.core.VarDesc.VarType.FP16
+        })
+
+
+def copy_to_master_param(p, block):
+    v = block.vars.get(p.name, None)
+    if v is None:
+        raise ValueError("no param name %s found!" % p.name)
+    new_p = fluid.framework.Parameter(
+        block=block,
+        shape=v.shape,
+        dtype=fluid.core.VarDesc.VarType.FP32,
+        type=v.type,
+        lod_level=v.lod_level,
+        stop_gradient=p.stop_gradient,
+        trainable=p.trainable,
+        optimize_attr=p.optimize_attr,
+        regularizer=p.regularizer,
+        gradient_clip_attr=p.gradient_clip_attr,
+        error_clip=p.error_clip,
+        name=v.name + ".master")
+    return new_p
+
+
+def create_master_params_grads(params_grads, main_prog, startup_prog,
+                               loss_scaling):
+    master_params_grads = []
+    tmp_role = main_prog._current_role
+    OpRole = fluid.core.op_proto_and_checker_maker.OpRole
+    main_prog._current_role = OpRole.Backward
+    for p, g in params_grads:
+        # create master parameters
+        master_param = copy_to_master_param(p, main_prog.global_block())
+        startup_master_param = startup_prog.global_block()._clone_variable(
+            master_param)
+        startup_p = startup_prog.global_block().var(p.name)
+        cast_fp16_to_fp32(startup_p, startup_master_param, startup_prog)
+        # cast fp16 gradients to fp32 before apply gradients
+        if g.name.find("layer_norm") > -1:
+            if loss_scaling > 1:
+                scaled_g = g / float(loss_scaling)
+            else:
+                scaled_g = g
+            master_params_grads.append([p, scaled_g])
+            continue
+        master_grad = fluid.layers.cast(g, "float32")
+        if loss_scaling > 1:
+            master_grad = master_grad / float(loss_scaling)
+        master_params_grads.append([master_param, master_grad])
+    main_prog._current_role = tmp_role
+    return master_params_grads
+
+
+def master_param_to_train_param(master_params_grads, params_grads, main_prog):
+    for idx, m_p_g in enumerate(master_params_grads):
+        train_p, _ = params_grads[idx]
+        if train_p.name.find("layer_norm") > -1:
+            continue
+        with main_prog._optimized_guard([m_p_g[0], m_p_g[1]]):
+            cast_fp32_to_fp16(m_p_g[0], train_p, main_prog)
--- a/utils/init.py
+++ b/utils/init.py
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import print_function
+
+import os
+import six
+import ast
+import copy
+
+import numpy as np
+import paddle.fluid as fluid
+
+
+def cast_fp32_to_fp16(exe, main_program):
+    print("Cast parameters to float16 data format.")
+    for param in main_program.global_block().all_parameters():
+        if not param.name.endswith(".master"):
+            param_t = fluid.global_scope().find_var(param.name).get_tensor()
+            data = np.array(param_t)
+            if param.name.find("layer_norm") == -1:
+                param_t.set(np.float16(data).view(np.uint16), exe.place)
+            master_param_var = fluid.global_scope().find_var(param.name +
+                                                             ".master")
+            if master_param_var is not None:
+                master_param_var.get_tensor().set(data, exe.place)
+
+
+def init_checkpoint(exe, init_checkpoint_path, main_program, use_fp16=False, skip_list = []):
+    assert os.path.exists(
+        init_checkpoint_path), "[%s] cann't be found." % init_checkpoint_path
+    assert os.path.isdir(init_checkpoint_path), '{} is not a dir.'.format(init_checkpoint_path)
+
+    path = init_checkpoint_path
+    if not os.path.split(init_checkpoint_path)[-1].startswith('step_') and 'params' != os.path.split(init_checkpoint_path)[-1]:
+        max_step = 0
+        for d in os.listdir(init_checkpoint_path):
+            if os.path.isdir(os.path.join(init_checkpoint_path, d)):
+                if d.startswith('step_'):
+                    step = int(d.lstrip('step_').rstrip('_final'))
+                    if step > max_step:
+                        path = os.path.join(init_checkpoint_path, d)
+                        max_step = step
+
+    def existed_persitables(var):
+        if not fluid.io.is_persistable(var):
+            return False
+        if var.name in skip_list:
+            return False
+        return os.path.exists(os.path.join(path, var.name))
+
+    print("loading checkpoint from {}...".format(path))
+    fluid.io.load_vars(
+        exe,
+        path,
+        main_program=main_program,
+        predicate=existed_persitables)
+
+    if use_fp16:
+        cast_fp32_to_fp16(exe, main_program)
+
+
+def init_pretraining_params(exe,
+                            pretraining_params_path,
+                            main_program,
+                            use_fp16=False):
+    assert os.path.exists(pretraining_params_path
+                          ), "[%s] cann't be found." % pretraining_params_path
+    assert os.path.isdir(pretraining_params_path), '{} is not a dir.'.format(pretraining_params_path)
+
+    if os.path.exists(os.path.join(pretraining_params_path, 'params')):
+        pretraining_params_path = os.path.join(pretraining_params_path, 'params')
+
+    if not os.path.split(pretraining_params_path)[-1] == 'params':
+        raise Warning('Dir "params" not found in {}.'.format(pretraining_params_path))
+        max_step = 0
+        path = pretraining_params_path
+        for d in os.listdir(pretraining_params_path):
+            if os.path.isdir(os.path.join(pretraining_params_path, d)):
+                if d.startswith('step_'):
+                    step = int(d.lstrip('step_').rstrip('_final'))
+                    if step > max_step:
+                        path = os.path.join(pretraining_params_path, d)
+                        max_step = step
+        pretraining_params_path = path
+
+    print("loading pretrained parameters from {}...".format(
+        pretraining_params_path))
+
+    def existed_params(var):
+        if not isinstance(var, fluid.framework.Parameter):
+            return False
+        return os.path.exists(os.path.join(pretraining_params_path, var.name))
+
+    fluid.io.load_vars(
+        exe,
+        pretraining_params_path,
+        main_program=main_program,
+        predicate=existed_params)
+
+    if use_fp16:
+        cast_fp32_to_fp16(exe, main_program)
--- a/utils/placeholder.py
+++ b/utils/placeholder.py
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import os
+import six
+import ast
+import copy
+
+import numpy as np
+import paddle.fluid as fluid
+
+
+class Placeholder(object):
+
+    def __init__(self):
+        self.shapes = []
+        self.dtypes = []
+        self.lod_levels = []
+        self.names = []
+
+    def __init__(self, input_shapes):
+
+        self.shapes = []
+        self.dtypes = []
+        self.lod_levels = []
+        self.names = []
+
+        for new_holder in input_shapes:
+            shape = new_holder[0]
+            dtype = new_holder[1]
+            lod_level = new_holder[2] if len(new_holder) >= 3 else 0
+            name = new_holder[3] if len(new_holder) >= 4 else ""
+
+            self.append_placeholder(shape, dtype, lod_level = lod_level, name = name)
+
+    def append_placeholder(self, shape, dtype, lod_level = 0, name = ""):
+        self.shapes.append(shape)
+        self.dtypes.append(dtype)
+        self.lod_levels.append(lod_level)
+        self.names.append(name)
+
+    def build(self, capacity, reader_name, use_double_buffer = False):
+        pyreader = fluid.layers.py_reader(
+            capacity = capacity,
+            shapes = self.shapes,
+            dtypes = self.dtypes,
+            lod_levels = self.lod_levels,
+            name = reader_name, 
+            use_double_buffer = use_double_buffer)
+
+        return [pyreader, fluid.layers.read_file(pyreader)]
+
+    def __add__(self, new_holder):
+        assert isinstance(new_holder, tuple) or isinstance(new_holder, list) 
+        assert len(new_holder) >= 2
+
+        shape = new_holder[0]
+        dtype = new_holder[1]
+        lod_level = new_holder[2] if len(new_holder) >= 3 else 0
+        name = new_holder[3] if len(new_holder) >= 4 else ""
+
+        self.append_placeholder(shape, dtype, lod_level = lod_level, name = name)
+
+
+if __name__ == "__main__":
+    print("hello world!")
+
+
+
--- a/utils/tokenization.py
+++ b/utils/tokenization.py