MRQA2019-D-NET (#3413)

* MRQA2019-D-NET * delete_data_file

MRQA2019-D-NET (#3413)
* MRQA2019-D-NET * delete_data_file
e7b1fef2 · 0YuanZhang0 · GitHub · 529ad161 · e7b1fef2 · e7b1fef2
97 changed file
--- a/PaddleNLP/Research/MRQA2019-D-NET/README.md
+++ b/PaddleNLP/Research/MRQA2019-D-NET/README.md
+# D-NET
+## Introduction
+D-NET is the system Baidu submitted for MRQA (Machine Reading for Question Answering) 2019 Shared Task that focused on generalization of machine reading comprehension (MRC) models. Our system is built on a framework of pre-training and fine-tuning. The techniques of pre-trained language models, multi-task learning and knowledge distillation are employed to improve the generalization of MRC models and the experimental results show the effectiveness of these strategies. Our system is ranked at top 1 of all the participants in terms of averaged F1 score. Additionally, we won the first place for 10 of the 12 test sets and the second place for the other two in terms of F1 scores.
+## Framework
+<p align="center">
+<img src="./images/D-NET_framework.png" width="500">
+</p>
+### D-NET includes 3 parts: 
+#### multi_task_learning
+We use PaddlePaddle PALM multi-task learning library [Link](https://github.com/PaddlePaddle/PALM) to train single model for  MRQA 2019 Shared Task.
+#### knowledge_distillation 
+Model ensemble can improve the generalization of MRC models, we leverage the technique of distillation to ensemble multiple models into a single  model, and no loss of accuracy, distillation solves the problem of slow inference process and reduce the use of a huge amount of resource.
+#### server
+MRQA2019 submission environment with baidu bert inference model and xlnet inference model.
+## Copyright and License
+Copyright 2019 Baidu.com, Inc. All Rights Reserved Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
--- a/PaddleNLP/Research/MRQA2019-D-NET/images/D-NET_framework.png
+++ b/PaddleNLP/Research/MRQA2019-D-NET/images/D-NET_framework.png
--- a/PaddleNLP/Research/MRQA2019-D-NET/knowledge_distillation/README.md
+++ b/PaddleNLP/Research/MRQA2019-D-NET/knowledge_distillation/README.md
+# knowledge_distillation
+## 1、Introduction
+Model ensemble can improve the generalization of MRC models. However, such approach is not efficient. Because the inference of an ensemble model is slow and a huge amount of resources are required. We leverage the technique of distillation to ensemble multiple models into a single model solves the problem of slow inference process.
+## 2、Quick Start
+### Environment
+- Python >= 2.7
+- cuda >= 9.0
+- cudnn >= 7.0
+- PaddlePaddle >= 1.5.0 Please refer to Installation Guide [Installation Guide](http://www.paddlepaddle.org/#quick-start)
+### Data and Models Preparation
+User can get the data and trained knowledge_distillation models directly we provided: 
+```
+bash wget_models_and_data.sh
+```
+user can get data and models directorys: 
+data: 
+./data/input/mlm_data: mask language model dataset.
+./data/input/mrqa_distill_data: mrqa dataset, it includes two parts: mrqa_distill.json(json data we calculate from teacher models), mrqa-combined.all_dev.raw.json(merge all mrqa dev dataset). 
+./data/input/mrqa_evaluation_dataset: mrqa evaluation data(in_domain data and out_of_domain json data).
+models: 
+./data/pretrain_model/squad2_model: pretrain model(google squad2.0 model as pretrain model [Model Link](https://worksheets.codalab.org/worksheets/0x3852e60a51d2444680606556d404c657)).
+./saved_models/knowledge_distillation_model: baidu trained knowledge distillation model.
+## 3、Train and Predict
+Train and predict  knowledge distillation model
+```
+bash run_distill.sh
+```
+## 4、Evaluation
+To evaluate the result, run
+```
+sh run_evaluation.sh
+```
+Note that we use the evaluation script for SQuAD 1.1 here, which is equivalent to the official one.
+## 5、Performance
+|  | dev in_domain(Macro-F1)| dev out_of_domain(Macro-F1) |
+| ------------- | ------------ | ------------ |
+| Official baseline | 77.87 | 58.67 |
+| KD(4 teacher model-> student)| 83.67 | 67.34 |
+KD: knowledge distillation model(ensemble 4 teacher models to student model)
+## Copyright and License
+Copyright 2019 Baidu.com, Inc. All Rights Reserved Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and
+limitations under the License.
--- a/PaddleNLP/Research/MRQA2019-D-NET/knowledge_distillation/data/input/input.md
+++ b/PaddleNLP/Research/MRQA2019-D-NET/knowledge_distillation/data/input/input.md
+input data dir: mrqa distillation dataset and mask language model dataset
--- a/PaddleNLP/Research/MRQA2019-D-NET/knowledge_distillation/data/output/output.md
+++ b/PaddleNLP/Research/MRQA2019-D-NET/knowledge_distillation/data/output/output.md
+save checkpoints dir
--- a/PaddleNLP/Research/MRQA2019-D-NET/knowledge_distillation/data/pretrain_model/pretrain_model.md
+++ b/PaddleNLP/Research/MRQA2019-D-NET/knowledge_distillation/data/pretrain_model/pretrain_model.md
+pretrain model dir
--- a/PaddleNLP/Research/MRQA2019-D-NET/knowledge_distillation/data/saved_models/saved_models.md
+++ b/PaddleNLP/Research/MRQA2019-D-NET/knowledge_distillation/data/saved_models/saved_models.md
+MRQA2019 baidu trained knowledge distillation model
--- a/PaddleNLP/Research/MRQA2019-D-NET/knowledge_distillation/model/__init__.py
+++ b/PaddleNLP/Research/MRQA2019-D-NET/knowledge_distillation/model/__init__.py
--- a/PaddleNLP/Research/MRQA2019-D-NET/knowledge_distillation/model/bert.py
+++ b/PaddleNLP/Research/MRQA2019-D-NET/knowledge_distillation/model/bert.py
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""BERT model."""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+import six
+import json
+import numpy as np
+import paddle.fluid as fluid
+from model.transformer_encoder import encoder as encoder
+from model.transformer_encoder import pre_process_layer as pre_process_layer
+class BertModel(object):
+    def __init__(self,
+                 src_ids,
+                 position_ids,
+                 sentence_ids,
+                 input_mask,
+                 config,
+                 weight_sharing=True,
+                 use_fp16=False,
+                 model_name = ''):
+        self._emb_size = config["hidden_size"]
+        self._n_layer = config["num_hidden_layers"]
+        self._n_head = config["num_attention_heads"]
+        self._voc_size = config["vocab_size"]
+        self._max_position_seq_len = config["max_position_embeddings"]
+        self._sent_types = config["type_vocab_size"]
+        self._hidden_act = config["hidden_act"]
+        self._prepostprocess_dropout = config["hidden_dropout_prob"]
+        self._attention_dropout = config["attention_probs_dropout_prob"]
+        self._weight_sharing = weight_sharing
+        self.model_name = model_name
+        self._word_emb_name = self.model_name + "word_embedding"
+        self._pos_emb_name = self.model_name + "pos_embedding"
+        self._sent_emb_name = self.model_name + "sent_embedding"
+        self._dtype = "float16" if use_fp16 else "float32"
+        # Initialize all weigths by truncated normal initializer, and all biases 
+        # will be initialized by constant zero by default.
+        self._param_initializer = fluid.initializer.TruncatedNormal(
+            scale=config["initializer_range"])
+        self._build_model(src_ids, position_ids, sentence_ids, input_mask, config)
+    def _build_model(self, src_ids, position_ids, sentence_ids, input_mask, config):
+        # padding id in vocabulary must be set to 0
+        emb_out = fluid.layers.embedding(
+            input=src_ids,
+            size=[self._voc_size, self._emb_size],
+            dtype=self._dtype,
+            param_attr=fluid.ParamAttr(
+                name=self._word_emb_name, initializer=self._param_initializer),
+            is_sparse=False)
+        self.emb_out =emb_out
+        position_emb_out = fluid.layers.embedding(
+            input=position_ids,
+            size=[self._max_position_seq_len, self._emb_size],
+            dtype=self._dtype,
+            param_attr=fluid.ParamAttr(
+                name=self._pos_emb_name, initializer=self._param_initializer))
+        self.position_emb_out = position_emb_out
+        sent_emb_out = fluid.layers.embedding(
+            sentence_ids,
+            size=[self._sent_types, self._emb_size],
+            dtype=self._dtype,
+            param_attr=fluid.ParamAttr(
+                name=self._sent_emb_name, initializer=self._param_initializer))
+        self.sent_emb_out = sent_emb_out
+        emb_out = emb_out + position_emb_out
+        emb_out = emb_out + sent_emb_out
+        emb_out = pre_process_layer(
+            emb_out, 'nd', self._prepostprocess_dropout, name='pre_encoder')
+        if self._dtype == "float16":
+            input_mask = fluid.layers.cast(x=input_mask, dtype=self._dtype)
+        self_attn_mask = fluid.layers.matmul(
+            x = input_mask, y = input_mask, transpose_y = True)
+        self_attn_mask = fluid.layers.scale(
+            x = self_attn_mask, scale = 10000.0, bias = -1.0, bias_after_scale = False)
+        n_head_self_attn_mask = fluid.layers.stack(
+            x=[self_attn_mask] * self._n_head, axis=1)
+        n_head_self_attn_mask.stop_gradient = True
+        self._enc_out = encoder(
+            enc_input = emb_out,
+            attn_bias = n_head_self_attn_mask,
+            n_layer = self._n_layer,
+            n_head = self._n_head,
+            d_key = self._emb_size // self._n_head,
+            d_value = self._emb_size // self._n_head,
+            d_model = self._emb_size,
+            d_inner_hid = self._emb_size * 4,
+            prepostprocess_dropout = self._prepostprocess_dropout,
+            attention_dropout = self._attention_dropout,
+            relu_dropout = 0,
+            hidden_act = self._hidden_act,
+            preprocess_cmd = "",
+            postprocess_cmd = "dan",
+            param_initializer = self._param_initializer,
+            name = self.model_name + 'encoder')
+    def get_sequence_output(self):
+        return self._enc_out
+    def get_pooled_output(self):
+        """Get the first feature of each sequence for classification"""
+        next_sent_feat = fluid.layers.slice(
+            input = self._enc_out, axes = [1], starts = [0], ends = [1])
+        next_sent_feat = fluid.layers.fc(
+            input = next_sent_feat,
+            size = self._emb_size,
+            act = "tanh",
+            param_attr = fluid.ParamAttr(
+                name = self.model_name + "pooled_fc.w_0", 
+                initializer = self._param_initializer),
+            bias_attr = "pooled_fc.b_0")
+        return next_sent_feat
+    def get_pretraining_output(self, mask_label, mask_pos, labels):
+        """Get the loss & accuracy for pretraining"""
+        mask_pos = fluid.layers.cast(x=mask_pos, dtype='int32')
+        # extract the first token feature in each sentence
+        next_sent_feat = self.get_pooled_output()
+        reshaped_emb_out = fluid.layers.reshape(
+            x=self._enc_out, shape = [-1, self._emb_size])
+        # extract masked tokens' feature
+        mask_feat = fluid.layers.gather(input = reshaped_emb_out, index = mask_pos)
+        # transform: fc
+        mask_trans_feat = fluid.layers.fc(
+            input = mask_feat,
+            size = self._emb_size,
+            act = self._hidden_act,
+            param_attr = fluid.ParamAttr(
+                name = self.model_name + 'mask_lm_trans_fc.w_0',
+                initializer = self._param_initializer),
+            bias_attr = fluid.ParamAttr(name = self.model_name + 'mask_lm_trans_fc.b_0'))
+        # transform: layer norm 
+        mask_trans_feat = pre_process_layer(
+            mask_trans_feat, 'n', name = self.model_name + 'mask_lm_trans')
+        mask_lm_out_bias_attr = fluid.ParamAttr(
+            name = self.model_name + "mask_lm_out_fc.b_0",
+            initializer = fluid.initializer.Constant(value = 0.0))
+        if self._weight_sharing:
+            fc_out = fluid.layers.matmul(
+                x = mask_trans_feat,
+                y = fluid.default_main_program().global_block().var(
+                    self._word_emb_name),
+                transpose_y = True)
+            fc_out += fluid.layers.create_parameter(
+                shape = [self._voc_size],
+                dtype = self._dtype,
+                attr = mask_lm_out_bias_attr,
+                is_bias = True)
+        else:
+            fc_out = fluid.layers.fc(input = mask_trans_feat,
+                                     size = self._voc_size,
+                                     param_attr = fluid.ParamAttr(
+                                         name = self.model_name + "mask_lm_out_fc.w_0",
+                                         initializer = self._param_initializer),
+                                     bias_attr = mask_lm_out_bias_attr)
+        mask_lm_loss = fluid.layers.softmax_with_cross_entropy(
+            logits = fc_out, label = mask_label)
+        mean_mask_lm_loss = fluid.layers.mean(mask_lm_loss)
+        next_sent_fc_out = fluid.layers.fc(
+            input = next_sent_feat,
+            size = 2,
+            param_attr = fluid.ParamAttr(
+                name = self.model_name + "next_sent_fc.w_0", 
+                initializer = self._param_initializer),
+            bias_attr = self.model_name + "next_sent_fc.b_0")
+        next_sent_loss, next_sent_softmax = fluid.layers.softmax_with_cross_entropy(
+            logits = next_sent_fc_out, label = labels, return_softmax = True)
+        next_sent_acc = fluid.layers.accuracy(
+            input = next_sent_softmax, label = labels)
+        mean_next_sent_loss = fluid.layers.mean(next_sent_loss)
+        loss = mean_next_sent_loss + mean_mask_lm_loss
+        return next_sent_acc, mean_mask_lm_loss, loss
+if __name__ == "__main__":
+    print("hello wolrd!")
--- a/PaddleNLP/Research/MRQA2019-D-NET/knowledge_distillation/model/bert_model.py
+++ b/PaddleNLP/Research/MRQA2019-D-NET/knowledge_distillation/model/bert_model.py
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import os
+import sys
+import argparse
+import collections
+import numpy as np
+import multiprocessing
+from copy import deepcopy as copy
+import paddle
+import paddle.fluid as fluid
+from model.bert import BertModel
+from utils.configure import JsonConfig
+class ModelBERT(object):
+    def __init__(
+        self, 
+        conf, 
+        name = "", 
+        is_training = False,
+        base_model = None):
+        # the name of this task
+        # name is used for identifying parameters
+        self.name = name
+        # deep copy the configure of model
+        self.conf = copy(conf)
+        self.is_training = is_training
+        ## the overall loss of this task
+        self.loss = None
+        ## outputs may be useful for the other models
+        self.outputs = {}
+        ## the prediction of this task
+        self.predict = []
+    def create_model(self, 
+                      args,
+                      reader_input,
+                      base_model = None):
+        """
+            given the base model, reader_input
+            return the create fn for create this model
+        """
+        def _create_model():
+            src_ids, pos_ids, sent_ids, input_mask = reader_input
+            bert_conf = JsonConfig(self.conf["bert_conf_file"])
+            self.bert = BertModel(
+                src_ids = src_ids,
+                position_ids = pos_ids,
+                sentence_ids = sent_ids,
+                input_mask = input_mask,
+                config = bert_conf,
+                use_fp16 = args.use_fp16,
+                model_name = self.name)
+            self.loss = None
+            self.outputs = {
+                "sequence_output":self.bert.get_sequence_output(),
+            }
+        return _create_model
+    def get_output(self, name):
+        return self.outputs[name]
+    def get_outputs(self):
+        return self.outputs
+    def get_predict(self):
+        return self.predict
+if __name__ == "__main__":
+    bert_model = ModelBERT(conf = {"json_conf_path" : "./data/pretrained_models/squad2_model/bert_config.json"})
--- a/PaddleNLP/Research/MRQA2019-D-NET/knowledge_distillation/model/mlm_net.py
+++ b/PaddleNLP/Research/MRQA2019-D-NET/knowledge_distillation/model/mlm_net.py
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import paddle.fluid as fluid
+from model.transformer_encoder import pre_process_layer
+from utils.configure import JsonConfig
+def compute_loss(output_tensors, args=None):
+    """Compute loss for mlm model"""
+    fc_out = output_tensors['mlm_out']
+    mask_label = output_tensors['mask_label']
+    mask_lm_loss = fluid.layers.softmax_with_cross_entropy(
+        logits=fc_out, label=mask_label)
+    mean_mask_lm_loss = fluid.layers.mean(mask_lm_loss)
+    return mean_mask_lm_loss
+def create_model(reader_input, base_model=None, is_training=True, args=None):
+    """
+        given the base model, reader_input
+        return the output tensors
+    """
+    mask_label, mask_pos = reader_input
+    config = JsonConfig(args.bert_config_path)
+    _emb_size = config['hidden_size']
+    _voc_size = config['vocab_size']
+    _hidden_act = config['hidden_act']
+    _word_emb_name = "word_embedding"
+    _dtype = "float16" if args.use_fp16 else "float32"
+    _param_initializer = fluid.initializer.TruncatedNormal(
+        scale=config['initializer_range'])
+    mask_pos = fluid.layers.cast(x=mask_pos, dtype='int32')
+    enc_out = base_model.get_output("sequence_output")
+    # extract the first token feature in each sentence
+    reshaped_emb_out = fluid.layers.reshape(
+        x=enc_out, shape=[-1, _emb_size])
+    # extract masked tokens' feature
+    mask_feat = fluid.layers.gather(input=reshaped_emb_out, index=mask_pos)
+    num_seqs = fluid.layers.fill_constant(shape=[1], value=512, dtype='int64')
+    # transform: fc
+    mask_trans_feat = fluid.layers.fc(
+        input=mask_feat,
+        size=_emb_size,
+        act=_hidden_act,
+        param_attr=fluid.ParamAttr(
+            name='mask_lm_trans_fc.w_0',
+            initializer=_param_initializer),
+        bias_attr=fluid.ParamAttr(name='mask_lm_trans_fc.b_0'))
+    # transform: layer norm
+    mask_trans_feat = pre_process_layer(
+        mask_trans_feat, 'n', name='mask_lm_trans')
+    mask_lm_out_bias_attr = fluid.ParamAttr(
+        name="mask_lm_out_fc.b_0",
+        initializer=fluid.initializer.Constant(value=0.0))
+    fc_out = fluid.layers.matmul(
+        x=mask_trans_feat,
+        y=fluid.default_main_program().global_block().var(
+            _word_emb_name),
+        transpose_y=True)
+    fc_out += fluid.layers.create_parameter(
+        shape=[_voc_size],
+        dtype=_dtype,
+        attr=mask_lm_out_bias_attr,
+        is_bias=True)
+    output_tensors = {}
+    output_tensors['num_seqs'] = num_seqs
+    output_tensors['mlm_out'] = fc_out
+    output_tensors['mask_label'] = mask_label
+    return output_tensors
--- a/PaddleNLP/Research/MRQA2019-D-NET/knowledge_distillation/model/mrqa_net.py
+++ b/PaddleNLP/Research/MRQA2019-D-NET/knowledge_distillation/model/mrqa_net.py
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import paddle.fluid as fluid
+def compute_loss(output_tensors, args=None):
+    """Compute loss for mrc model"""
+    def _compute_single_loss(logits, positions):
+        """Compute start/end loss for mrc model"""
+        loss = fluid.layers.softmax_with_cross_entropy(
+            logits=logits, label=positions)
+        loss = fluid.layers.mean(x=loss)
+        return loss
+    start_logits = output_tensors['start_logits']
+    end_logits = output_tensors['end_logits']
+    start_positions = output_tensors['start_positions']
+    end_positions = output_tensors['end_positions']
+    start_loss = _compute_single_loss(start_logits, start_positions)
+    end_loss = _compute_single_loss(end_logits, end_positions)
+    total_loss = (start_loss + end_loss) / 2.0
+    if args.use_fp16 and args.loss_scaling > 1.0:
+        total_loss = total_loss * args.loss_scaling
+    return total_loss
+def compute_distill_loss(output_tensors, args=None): 
+    """Compute loss for mrc model"""
+    start_logits = output_tensors['start_logits']
+    end_logits = output_tensors['end_logits']
+    start_logits_truth = output_tensors['start_logits_truth']
+    end_logits_truth = output_tensors['end_logits_truth']
+    input_mask = output_tensors['input_mask']
+    def _mask(logits, input_mask, nan=1e5):
+        input_mask = fluid.layers.reshape(input_mask, [-1, 512])
+        logits = logits - (1.0 - input_mask) * nan
+        return logits
+    start_logits = _mask(start_logits, input_mask)
+    end_logits = _mask(end_logits, input_mask)
+    start_logits_truth = _mask(start_logits_truth, input_mask)
+    end_logits_truth = _mask(end_logits_truth, input_mask)
+    start_logits_truth = fluid.layers.reshape(start_logits_truth, [-1, 512])
+    end_logits_truth = fluid.layers.reshape(end_logits_truth, [-1, 512])
+    T = 1.0
+    start_logits_softmax = fluid.layers.softmax(input=start_logits/T)
+    end_logits_softmax = fluid.layers.softmax(input=end_logits/T)
+    start_logits_truth_softmax = fluid.layers.softmax(input=start_logits_truth/T)
+    end_logits_truth_softmax = fluid.layers.softmax(input=end_logits_truth/T)
+    start_logits_truth_softmax.stop_gradient = True
+    end_logits_truth_softmax.stop_gradient = True
+    start_loss = fluid.layers.cross_entropy(start_logits_softmax, start_logits_truth_softmax, soft_label=True)
+    end_loss = fluid.layers.cross_entropy(end_logits_softmax, end_logits_truth_softmax, soft_label=True)
+    start_loss = fluid.layers.mean(x=start_loss)
+    end_loss = fluid.layers.mean(x=end_loss)
+    total_loss = (start_loss + end_loss) / 2.0
+    return total_loss
+def create_model(reader_input, base_model=None, is_training=True, args=None):
+    """
+        given the base model, reader_input
+        return the output tensors
+    """
+    if is_training: 
+        if args.do_distill: 
+            src_ids, pos_ids, sent_ids, input_mask, \
+                start_logits_truth, end_logits_truth, start_positions, end_positions = reader_input
+        else: 
+            src_ids, pos_ids, sent_ids, input_mask, \
+                start_positions, end_positions = reader_input
+    else:
+        src_ids, pos_ids, sent_ids, input_mask, unique_id = reader_input
+    enc_out = base_model.get_output("sequence_output")
+    logits = fluid.layers.fc(
+        input=enc_out,
+        size=2,
+        num_flatten_dims=2,
+        param_attr=fluid.ParamAttr(
+            name="cls_squad_out_w",
+            initializer=fluid.initializer.TruncatedNormal(scale=0.02)),
+        bias_attr=fluid.ParamAttr(
+            name="cls_squad_out_b", initializer=fluid.initializer.Constant(0.)))
+    logits = fluid.layers.transpose(x=logits, perm=[2, 0, 1])
+    start_logits, end_logits = fluid.layers.unstack(x=logits, axis=0)
+    batch_ones = fluid.layers.fill_constant_batch_size_like(
+        input=start_logits, dtype='int64', shape=[1], value=1)
+    num_seqs = fluid.layers.reduce_sum(input=batch_ones)
+    output_tensors = {}
+    output_tensors['start_logits'] = start_logits
+    output_tensors['end_logits'] = end_logits
+    output_tensors['num_seqs'] = num_seqs
+    output_tensors['input_mask'] = input_mask
+    if is_training:
+        output_tensors['start_positions'] = start_positions
+        output_tensors['end_positions'] = end_positions
+        if args.do_distill: 
+            output_tensors['start_logits_truth'] = start_logits_truth
+            output_tensors['end_logits_truth'] = end_logits_truth
+    else:
+        output_tensors['unique_id'] = unique_id
+        output_tensors['start_logits'] = start_logits
+        output_tensors['end_logits'] = end_logits
+    return output_tensors
--- a/PaddleNLP/Research/MRQA2019-D-NET/knowledge_distillation/model/transformer_encoder.py
+++ b/PaddleNLP/Research/MRQA2019-D-NET/knowledge_distillation/model/transformer_encoder.py
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Transformer encoder."""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+from functools import partial
+import numpy as np
+import paddle.fluid as fluid
+import paddle.fluid.layers as layers
+from paddle.fluid.layer_helper import LayerHelper
+def layer_norm(x, begin_norm_axis=1, epsilon=1e-6, param_attr=None, bias_attr=None):
+    helper = LayerHelper('layer_norm', **locals())
+    mean = layers.reduce_mean(x, dim=begin_norm_axis, keep_dim=True)
+    shift_x = layers.elementwise_sub(x=x, y=mean, axis=0)
+    variance = layers.reduce_mean(layers.square(shift_x), dim=begin_norm_axis, keep_dim=True)
+    r_stdev = layers.rsqrt(variance + epsilon)
+    norm_x = layers.elementwise_mul(x=shift_x, y=r_stdev, axis=0)
+    param_shape = [reduce(lambda x, y: x * y, norm_x.shape[begin_norm_axis:])]
+    param_dtype = norm_x.dtype
+    scale = helper.create_parameter(
+        attr=param_attr,
+        shape=param_shape,
+        dtype=param_dtype,
+        default_initializer=fluid.initializer.Constant(1.))
+    bias = helper.create_parameter(
+        attr=bias_attr,
+        shape=param_shape,
+        dtype=param_dtype,
+        is_bias=True,
+        default_initializer=fluid.initializer.Constant(0.))
+    out = layers.elementwise_mul(x=norm_x, y=scale, axis=-1)
+    out = layers.elementwise_add(x=out, y=bias, axis=-1)
+    return out
+def multi_head_attention(queries,
+                         keys,
+                         values,
+                         attn_bias,
+                         d_key,
+                         d_value,
+                         d_model,
+                         n_head=1,
+                         dropout_rate=0.,
+                         cache=None,
+                         param_initializer=None,
+                         name='multi_head_att'):
+    """
+    Multi-Head Attention. Note that attn_bias is added to the logit before
+    computing softmax activiation to mask certain selected positions so that
+    they will not considered in attention weights.
+    """
+    keys = queries if keys is None else keys
+    values = keys if values is None else values
+    if not (len(queries.shape) == len(keys.shape) == len(values.shape) == 3):
+        raise ValueError(
+            "Inputs: quries, keys and values should all be 3-D tensors.")
+    def __compute_qkv(queries, keys, values, n_head, d_key, d_value):
+        """
+        Add linear projection to queries, keys, and values.
+        """
+        q = layers.fc(input = queries,
+                      size = d_key * n_head,
+                      num_flatten_dims = 2,
+                      param_attr = fluid.ParamAttr(
+                          name = name + '_query_fc.w_0',
+                          initializer = param_initializer),
+                      bias_attr = name + '_query_fc.b_0')
+        k = layers.fc(input = keys,
+                      size = d_key * n_head,
+                      num_flatten_dims = 2,
+                      param_attr = fluid.ParamAttr(
+                          name = name + '_key_fc.w_0',
+                          initializer = param_initializer),
+                      bias_attr = name + '_key_fc.b_0')
+        v = layers.fc(input = values,
+                      size = d_value * n_head,
+                      num_flatten_dims = 2,
+                      param_attr = fluid.ParamAttr(
+                          name = name + '_value_fc.w_0',
+                          initializer = param_initializer),
+                      bias_attr = name + '_value_fc.b_0')
+        return q, k, v
+    def __split_heads(x, n_head):
+        """
+        Reshape the last dimension of inpunt tensor x so that it becomes two
+        dimensions and then transpose. Specifically, input a tensor with shape
+        [bs, max_sequence_length, n_head * hidden_dim] then output a tensor
+        with shape [bs, n_head, max_sequence_length, hidden_dim].
+        """
+        hidden_size = x.shape[-1]
+        # The value 0 in shape attr means copying the corresponding dimension
+        # size of the input as the output dimension size.
+        reshaped = layers.reshape(
+            x = x, shape = [0, 0, n_head, hidden_size // n_head], inplace=False)
+        # permuate the dimensions into:
+        # [batch_size, n_head, max_sequence_len, hidden_size_per_head]
+        return layers.transpose(x=reshaped, perm=[0, 2, 1, 3])
+    def __combine_heads(x):
+        """
+        Transpose and then reshape the last two dimensions of inpunt tensor x
+        so that it becomes one dimension, which is reverse to __split_heads.
+        """
+        if len(x.shape) == 3: return x
+        if len(x.shape) != 4:
+            raise ValueError("Input(x) should be a 4-D Tensor.")
+        trans_x = layers.transpose(x, perm=[0, 2, 1, 3])
+        # The value 0 in shape attr means copying the corresponding dimension
+        # size of the input as the output dimension size.
+        return layers.reshape(
+            x = trans_x,
+            shape = [0, 0, trans_x.shape[2] * trans_x.shape[3]],
+            inplace = False)
+    def scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate):
+        """
+        Scaled Dot-Product Attention
+        """
+        scaled_q = layers.scale(x = q, scale = d_key**-0.5)
+        product = layers.matmul(x = scaled_q, y = k, transpose_y = True)
+        if attn_bias:
+            product += attn_bias
+        weights = layers.softmax(product)
+        if dropout_rate:
+            weights = layers.dropout(
+                weights,
+                dropout_prob=dropout_rate,
+                dropout_implementation="upscale_in_train",
+                is_test=False)
+        out = layers.matmul(weights, v)
+        return out
+    q, k, v = __compute_qkv(queries, keys, values, n_head, d_key, d_value)
+    if cache is not None:  # use cache and concat time steps
+        # Since the inplace reshape in __split_heads changes the shape of k and
+        # v, which is the cache input for next time step, reshape the cache
+        # input from the previous time step first.
+        k = cache["k"] = layers.concat(
+            [layers.reshape(
+                cache["k"], shape=[0, 0, d_model]), k], axis=1)
+        v = cache["v"] = layers.concat(
+            [layers.reshape(
+                cache["v"], shape=[0, 0, d_model]), v], axis=1)
+    q = __split_heads(q, n_head)
+    k = __split_heads(k, n_head)
+    v = __split_heads(v, n_head)
+    ctx_multiheads = scaled_dot_product_attention(q, k, v, attn_bias, d_key,
+                                                  dropout_rate)
+    out = __combine_heads(ctx_multiheads)
+    # Project back to the model size.
+    proj_out = layers.fc(input = out,
+                         size = d_model,
+                         num_flatten_dims = 2,
+                         param_attr=fluid.ParamAttr(
+                             name = name + '_output_fc.w_0',
+                             initializer = param_initializer),
+                         bias_attr = name + '_output_fc.b_0')
+    return proj_out
+def positionwise_feed_forward(x,
+                              d_inner_hid,
+                              d_hid,
+                              dropout_rate,
+                              hidden_act,
+                              param_initializer=None,
+                              name='ffn'):
+    """
+    Position-wise Feed-Forward Networks.
+    This module consists of two linear transformations with a ReLU activation
+    in between, which is applied to each position separately and identically.
+    """
+    hidden = layers.fc(input=x,
+                       size=d_inner_hid,
+                       num_flatten_dims=2,
+                       act=hidden_act,
+                       param_attr=fluid.ParamAttr(
+                           name=name + '_fc_0.w_0',
+                           initializer=param_initializer),
+                       bias_attr=name + '_fc_0.b_0')
+    if dropout_rate:
+        hidden = layers.dropout(
+            hidden,
+            dropout_prob=dropout_rate,
+            dropout_implementation="upscale_in_train",
+            is_test = False)
+    out = layers.fc(input = hidden,
+                    size = d_hid,
+                    num_flatten_dims = 2,
+                    param_attr=fluid.ParamAttr(
+                        name = name + '_fc_1.w_0', 
+                        initializer = param_initializer),
+                    bias_attr = name + '_fc_1.b_0')
+    return out
+def pre_post_process_layer(prev_out, out, process_cmd, dropout_rate=0.,
+                           name=''):
+    """
+    Add residual connection, layer normalization and droput to the out tensor
+    optionally according to the value of process_cmd.
+    This will be used before or after multi-head attention and position-wise
+    feed-forward networks.
+    """
+    for cmd in process_cmd:
+        if cmd == "a":  # add residual connection
+            out = out + prev_out if prev_out else out
+        elif cmd == "n":  # add layer normalization
+            out_dtype = out.dtype
+            if out_dtype == fluid.core.VarDesc.VarType.FP16:
+                out = layers.cast(x = out, dtype = "float32")
+            out = layer_norm(
+                out,
+                begin_norm_axis=len(out.shape) - 1,
+                param_attr=fluid.ParamAttr(
+                    name = name + '_layer_norm_scale',
+                    initializer = fluid.initializer.Constant(1.)),
+                bias_attr=fluid.ParamAttr(
+                    name = name + '_layer_norm_bias',
+                    initializer = fluid.initializer.Constant(0.)))
+            if out_dtype == fluid.core.VarDesc.VarType.FP16:
+                out = layers.cast(x = out, dtype = "float16")
+        elif cmd == "d":  # add dropout
+            if dropout_rate:
+                out = layers.dropout(
+                    out,
+                    dropout_prob = dropout_rate,
+                    dropout_implementation = "upscale_in_train",
+                    is_test = False)
+    return out
+pre_process_layer = partial(pre_post_process_layer, None)
+post_process_layer = pre_post_process_layer
+def encoder_layer(enc_input,
+                  attn_bias,
+                  n_head,
+                  d_key,
+                  d_value,
+                  d_model,
+                  d_inner_hid,
+                  prepostprocess_dropout,
+                  attention_dropout,
+                  relu_dropout,
+                  hidden_act,
+                  preprocess_cmd="n",
+                  postprocess_cmd="da",
+                  param_initializer=None,
+                  name=''):
+    """The encoder layers that can be stacked to form a deep encoder.
+    This module consits of a multi-head (self) attention followed by
+    position-wise feed-forward networks and both the two components companied
+    with the post_process_layer to add residual connection, layer normalization
+    and droput.
+    """
+    attn_output = multi_head_attention(
+        pre_process_layer(
+            enc_input,
+            preprocess_cmd,
+            prepostprocess_dropout,
+            name=name + '_pre_att'),
+        None,
+        None,
+        attn_bias,
+        d_key,
+        d_value,
+        d_model,
+        n_head,
+        attention_dropout,
+        param_initializer = param_initializer,
+        name = name + '_multi_head_att')
+    attn_output = post_process_layer(
+        enc_input,
+        attn_output,
+        postprocess_cmd,
+        prepostprocess_dropout,
+        name = name + '_post_att')
+    ffd_output = positionwise_feed_forward(
+        pre_process_layer(
+            attn_output,
+            preprocess_cmd,
+            prepostprocess_dropout,
+            name = name + '_pre_ffn'),
+        d_inner_hid,
+        d_model,
+        relu_dropout,
+        hidden_act,
+        param_initializer = param_initializer,
+        name = name + '_ffn')
+    return post_process_layer(
+        attn_output,
+        ffd_output,
+        postprocess_cmd,
+        prepostprocess_dropout,
+        name = name + '_post_ffn')
+def encoder(enc_input,
+            attn_bias,
+            n_layer,
+            n_head,
+            d_key,
+            d_value,
+            d_model,
+            d_inner_hid,
+            prepostprocess_dropout,
+            attention_dropout,
+            relu_dropout,
+            hidden_act,
+            preprocess_cmd="n",
+            postprocess_cmd="da",
+            param_initializer=None,
+            name='',
+            return_all = False):
+    """
+    The encoder is composed of a stack of identical layers returned by calling
+    encoder_layer.
+    """
+    enc_outputs = []
+    for i in range(n_layer):
+        enc_output = encoder_layer(
+            enc_input,
+            attn_bias,
+            n_head,
+            d_key,
+            d_value,
+            d_model,
+            d_inner_hid,
+            prepostprocess_dropout,
+            attention_dropout,
+            relu_dropout,
+            hidden_act,
+            preprocess_cmd,
+            postprocess_cmd,
+            param_initializer = param_initializer,
+            name = name + '_layer_' + str(i))
+        enc_input = enc_output
+        if i < n_layer - 1:
+            enc_outputs.append(enc_output)
+    enc_output = pre_process_layer(
+        enc_output, preprocess_cmd, prepostprocess_dropout, name="post_encoder")
+    enc_outputs.append(enc_output)
+    if not return_all:
+        return enc_output
+    else:
+        return enc_output, enc_outputs
--- a/PaddleNLP/Research/MRQA2019-D-NET/knowledge_distillation/optimizer/__init__.py
+++ b/PaddleNLP/Research/MRQA2019-D-NET/knowledge_distillation/optimizer/__init__.py
--- a/PaddleNLP/Research/MRQA2019-D-NET/knowledge_distillation/optimizer/optimization.py
+++ b/PaddleNLP/Research/MRQA2019-D-NET/knowledge_distillation/optimizer/optimization.py
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Optimization and learning rate scheduling."""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+import numpy as np
+import paddle.fluid as fluid
+from utils.fp16 import create_master_params_grads, master_param_to_train_param
+def linear_warmup_decay(learning_rate, warmup_steps, num_train_steps):
+    """ Applies linear warmup of learning rate from 0 and decay to 0."""
+    with fluid.default_main_program()._lr_schedule_guard():
+        lr = fluid.layers.tensor.create_global_var(
+            shape=[1],
+            value=0.0,
+            dtype='float32',
+            persistable=True,
+            name="scheduled_learning_rate")
+        global_step = fluid.layers.learning_rate_scheduler._decay_step_counter()
+        with fluid.layers.control_flow.Switch() as switch:
+            with switch.case(global_step < warmup_steps):
+                warmup_lr = learning_rate * (global_step / warmup_steps)
+                fluid.layers.tensor.assign(warmup_lr, lr)
+            with switch.default():
+                decayed_lr = fluid.layers.learning_rate_scheduler.polynomial_decay(
+                    learning_rate=learning_rate,
+                    decay_steps=num_train_steps,
+                    end_learning_rate=0.0,
+                    power=1.0,
+                    cycle=False)
+                fluid.layers.tensor.assign(decayed_lr, lr)
+        return lr
+def optimization(loss,
+                 warmup_steps,
+                 num_train_steps,
+                 learning_rate,
+                 train_program,
+                 startup_prog,
+                 weight_decay,
+                 scheduler='linear_warmup_decay',
+                 use_fp16=False,
+                 loss_scaling=1.0):
+    if warmup_steps > 0:
+        if scheduler == 'noam_decay':
+            scheduled_lr = fluid.layers.learning_rate_scheduler\
+             .noam_decay(1/(warmup_steps *(learning_rate ** 2)),
+                         warmup_steps)
+        elif scheduler == 'linear_warmup_decay':
+            scheduled_lr = linear_warmup_decay(learning_rate, warmup_steps,
+                                               num_train_steps)
+        else:
+            raise ValueError("Unkown learning rate scheduler, should be "
+                             "'noam_decay' or 'linear_warmup_decay'")
+        optimizer = fluid.optimizer.Adam(learning_rate=scheduled_lr)
+    else:
+        optimizer = fluid.optimizer.Adam(learning_rate=learning_rate)
+        scheduled_lr = learning_rate
+    clip_norm_thres = 1.0
+    # When using mixed precision training, scale the gradient clip threshold
+    # by loss_scaling
+    if use_fp16 and loss_scaling > 1.0:
+        clip_norm_thres *= loss_scaling
+    fluid.clip.set_gradient_clip(
+        clip=fluid.clip.GradientClipByGlobalNorm(clip_norm=clip_norm_thres))
+    def exclude_from_weight_decay(name):
+        if name.find("layer_norm") > -1:
+            return True
+        bias_suffix = ["_bias", "_b", ".b_0"]
+        for suffix in bias_suffix:
+            if name.endswith(suffix):
+                return True
+        return False
+    param_list = dict()
+    if use_fp16:
+        param_grads = optimizer.backward(loss)
+        master_param_grads = create_master_params_grads(
+            param_grads, train_program, startup_prog, loss_scaling)
+        for param, _ in master_param_grads:
+            param_list[param.name] = param * 1.0
+            param_list[param.name].stop_gradient = True
+        optimizer.apply_gradients(master_param_grads)
+        if weight_decay > 0:
+            for param, grad in master_param_grads:
+                if exclude_from_weight_decay(param.name.rstrip(".master")):
+                    continue
+                with param.block.program._optimized_guard(
+                    [param, grad]), fluid.framework.name_scope("weight_decay"):
+                    updated_param = param - param_list[
+                        param.name] * weight_decay * scheduled_lr
+                    fluid.layers.assign(output=param, input=updated_param)
+        master_param_to_train_param(master_param_grads, param_grads,
+                                    train_program)
+    else:
+        for param in train_program.global_block().all_parameters():
+            param_list[param.name] = param * 1.0
+            param_list[param.name].stop_gradient = True
+        _, param_grads = optimizer.minimize(loss)
+        if weight_decay > 0:
+            for param, grad in param_grads:
+                if exclude_from_weight_decay(param.name):
+                    continue
+                with param.block.program._optimized_guard(
+                    [param, grad]), fluid.framework.name_scope("weight_decay"):
+                    updated_param = param - param_list[
+                        param.name] * weight_decay * scheduled_lr
+                    fluid.layers.assign(output=param, input=updated_param)
+    return scheduled_lr
--- a/PaddleNLP/Research/MRQA2019-D-NET/knowledge_distillation/reader/__init__.py
+++ b/PaddleNLP/Research/MRQA2019-D-NET/knowledge_distillation/reader/__init__.py
--- a/PaddleNLP/Research/MRQA2019-D-NET/knowledge_distillation/reader/joint_reader.py
+++ b/PaddleNLP/Research/MRQA2019-D-NET/knowledge_distillation/reader/joint_reader.py
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import os
+import sys
+import random
+import numpy as np
+import paddle
+import paddle.fluid as fluid
+from utils.placeholder import Placeholder
+def repeat(reader):
+    """Repeat a generator forever"""
+    generator = reader()
+    while True:
+        try:
+            yield next(generator)
+        except StopIteration:
+            generator = reader()
+            yield next(generator)
+def create_joint_generator(input_shape, generators, do_distill, is_multi_task=True):
+    def empty_output(input_shape, batch_size=1):
+        results = []
+        for i in range(len(input_shape)):
+            if input_shape[i][1] == 'int32':
+                dtype = np.int32
+            if input_shape[i][1] == 'int64':
+                dtype = np.int64
+            if input_shape[i][1] == 'float32':
+                dtype = np.float32
+            if input_shape[i][1] == 'float64':
+                dtype = np.float64
+            shape = input_shape[i][0]
+            shape[0] = batch_size
+            pad_tensor = np.zeros(shape=shape, dtype=dtype)
+            results.append(pad_tensor)
+        return results
+    def wrapper(): 
+        """wrapper data"""
+        generators_inst = [repeat(gen[0]) for gen in generators]
+        generators_ratio = [gen[1] for gen in generators]
+        weights = [ratio/sum(generators_ratio) for ratio in generators_ratio]
+        run_task_id = range(len(generators))
+        while True:
+            idx = np.random.choice(run_task_id, p=weights)
+            gen_results = next(generators_inst[idx])
+            if not gen_results:
+                break
+            batch_size = gen_results[0].shape[0]
+            results = empty_output(input_shape, batch_size)
+            task_id_tensor = np.array([[idx]]).astype("int64")
+            results[0] = task_id_tensor
+            for i in range(4):
+                results[i+1] = gen_results[i]
+            if do_distill: 
+                if idx == 0: 
+                    results[5] = gen_results[4]
+                    results[6] = gen_results[5]
+                    results[7] = gen_results[6]
+                    results[8] = gen_results[7]
+                else: 
+                    results[9] = gen_results[4]
+                    results[10] = gen_results[5]
+            else: 
+                if idx == 0:
+                    # mrc batch
+                    results[5] = gen_results[4]
+                    results[6] = gen_results[5]
+                elif idx == 1:
+                    # mlm batch
+                    results[7] = gen_results[4]
+                    results[8] = gen_results[5]
+            # idx stands for the task index
+            yield results
+    return wrapper
+def create_reader(reader_name, input_shape, is_multi_task, do_distill, *gens):
+    """
+    build reader for multi_task_learning
+    """
+    placeholder = Placeholder(input_shape)
+    pyreader, model_inputs = placeholder.build(capacity=100, reader_name=reader_name)
+    joint_generator = create_joint_generator(input_shape, gens[0], do_distill, is_multi_task=is_multi_task)
+    return joint_generator, pyreader, model_inputs
--- a/PaddleNLP/Research/MRQA2019-D-NET/knowledge_distillation/reader/mlm_reader.py
+++ b/PaddleNLP/Research/MRQA2019-D-NET/knowledge_distillation/reader/mlm_reader.py
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from __future__ import print_function
+from __future__ import division
+import os
+import re
+import six
+import gzip
+import types
+import logging
+import numpy as np
+import collections
+import paddle
+import paddle.fluid as fluid
+from utils import tokenization
+from utils.batching import prepare_batch_data
+class DataReader(object):
+    def __init__(self,
+                 data_dir,
+                 vocab_path,
+                 batch_size=4096,
+                 in_tokens=True,
+                 max_seq_len=512,
+                 shuffle_files=True,
+                 epoch=100,
+                 voc_size=0,
+                 is_test=False,
+                 generate_neg_sample=False):
+        self.vocab = self.load_vocab(vocab_path)
+        self.data_dir = data_dir
+        self.batch_size = batch_size
+        self.in_tokens = in_tokens
+        self.shuffle_files = shuffle_files
+        self.epoch = epoch
+        self.current_epoch = 0
+        self.current_file_index = 0
+        self.total_file = 0
+        self.current_file = None
+        self.voc_size = voc_size
+        self.max_seq_len = max_seq_len
+        self.pad_id = self.vocab["[PAD]"]
+        self.cls_id = self.vocab["[CLS]"]
+        self.sep_id = self.vocab["[SEP]"]
+        self.mask_id = self.vocab["[MASK]"]
+        self.is_test = is_test
+        self.generate_neg_sample = generate_neg_sample
+        if self.in_tokens:
+            assert self.batch_size >= self.max_seq_len, "The number of " \
+                   "tokens in batch should not be smaller than max seq length."
+        if self.is_test:
+            self.epoch = 1
+            self.shuffle_files = False
+    def get_progress(self):
+        """return current progress of traning data
+        """
+        return self.current_epoch, self.current_file_index, self.total_file, self.current_file
+    def parse_line(self, line, max_seq_len=512):
+        """ parse one line to token_ids, sentence_ids, pos_ids, label
+        """
+        line = line.strip().decode().split(";")
+        assert len(line) == 4, "One sample must have 4 fields!"
+        (token_ids, sent_ids, pos_ids, label) = line
+        token_ids = [int(token) for token in token_ids.split(" ")]
+        sent_ids = [int(token) for token in sent_ids.split(" ")]
+        pos_ids = [int(token) for token in pos_ids.split(" ")]
+        assert len(token_ids) == len(sent_ids) == len(
+            pos_ids
+        ), "[Must be true]len(token_ids) == len(sent_ids) == len(pos_ids)"
+        label = int(label)
+        if len(token_ids) > max_seq_len:
+            return None
+        return [token_ids, sent_ids, pos_ids, label]
+    def read_file(self, file):
+        assert file.endswith('.gz'), "[ERROR] %s is not a gzip file" % file
+        file_path = self.data_dir + "/" + file
+        with gzip.open(file_path, "rb") as f:
+            for line in f:
+                parsed_line = self.parse_line(
+                    line, max_seq_len=self.max_seq_len)
+                if parsed_line is None:
+                    continue
+                yield parsed_line
+    def convert_to_unicode(self, text):
+        """Converts `text` to Unicode (if it's not already), assuming utf-8 input."""
+        if six.PY3:
+            if isinstance(text, str):
+                return text
+            elif isinstance(text, bytes):
+                return text.decode("utf-8", "ignore")
+            else:
+                raise ValueError("Unsupported string type: %s" % (type(text)))
+        elif six.PY2:
+            if isinstance(text, str):
+                return text.decode("utf-8", "ignore")
+            elif isinstance(text, unicode):
+                return text
+            else:
+                raise ValueError("Unsupported string type: %s" % (type(text)))
+        else:
+            raise ValueError("Not running on Python2 or Python 3?")
+    def load_vocab(self, vocab_file):
+        """Loads a vocabulary file into a dictionary."""
+        vocab = collections.OrderedDict()
+        fin = open(vocab_file)
+        for num, line in enumerate(fin):
+            items = self.convert_to_unicode(line.strip()).split("\t")
+            if len(items) > 2:
+                break
+            token = items[0]
+            index = items[1] if len(items) == 2 else num
+            token = token.strip()
+            vocab[token] = int(index)
+        return vocab
+    def random_pair_neg_samples(self, pos_samples):
+        """ randomly generate negtive samples using pos_samples
+            Args:
+                pos_samples: list of positive samples
+            Returns:
+                neg_samples: list of negtive samples
+        """
+        np.random.shuffle(pos_samples)
+        num_sample = len(pos_samples)
+        neg_samples = []
+        miss_num = 0
+        for i in range(num_sample):
+            pair_index = (i + 1) % num_sample
+            origin_src_ids = pos_samples[i][0]
+            origin_sep_index = origin_src_ids.index(2)
+            pair_src_ids = pos_samples[pair_index][0]
+            pair_sep_index = pair_src_ids.index(2)
+            src_ids = origin_src_ids[:origin_sep_index + 1] + pair_src_ids[
+                pair_sep_index + 1:]
+            if len(src_ids) >= self.max_seq_len:
+                miss_num += 1
+                continue
+            sent_ids = [0] * len(origin_src_ids[:origin_sep_index + 1]) + [
+                1
+            ] * len(pair_src_ids[pair_sep_index + 1:])
+            pos_ids = list(range(len(src_ids)))
+            neg_sample = [src_ids, sent_ids, pos_ids, 0]
+            assert len(src_ids) == len(sent_ids) == len(
+                pos_ids
+            ), "[ERROR]len(src_id) == lne(sent_id) == len(pos_id) must be True"
+            neg_samples.append(neg_sample)
+        return neg_samples, miss_num
+    def mixin_negtive_samples(self, pos_sample_generator, buffer=1000):
+        """ 1. generate negtive samples by randomly group sentence_1 and sentence_2 of positive samples
+            2. combine negtive samples and positive samples
+            Args:
+                pos_sample_generator: a generator producing a parsed positive sample, which is a list: [token_ids, sent_ids, pos_ids, 1]
+            Returns:
+                sample: one sample from shuffled positive samples and negtive samples
+        """
+        pos_samples = []
+        num_total_miss = 0
+        pos_sample_num = 0
+        try:
+            while True:
+                while len(pos_samples) < buffer:
+                    pos_sample = next(pos_sample_generator)
+                    label = pos_sample[3]
+                    assert label == 1, "positive sample's label must be 1"
+                    pos_samples.append(pos_sample)
+                    pos_sample_num += 1
+                neg_samples, miss_num = self.random_pair_neg_samples(
+                    pos_samples)
+                num_total_miss += miss_num
+                samples = pos_samples + neg_samples
+                pos_samples = []
+                np.random.shuffle(samples)
+                for sample in samples:
+                    yield sample
+        except StopIteration:
+            print("stopiteration: reach end of file")
+            if len(pos_samples) == 1:
+                yield pos_samples[0]
+            elif len(pos_samples) == 0:
+                yield None
+            else:
+                neg_samples, miss_num = self.random_pair_neg_samples(
+                    pos_samples)
+                num_total_miss += miss_num
+                samples = pos_samples + neg_samples
+                pos_samples = []
+                np.random.shuffle(samples)
+                for sample in samples:
+                    yield sample
+            print("miss_num:%d\tideal_total_sample_num:%d\tmiss_rate:%f" %
+                  (num_total_miss, pos_sample_num * 2,
+                   num_total_miss / (pos_sample_num * 2)))
+    def data_generator(self):
+        """
+        data_generator
+        """
+        files = os.listdir(self.data_dir)
+        self.total_file = len(files)
+        assert self.total_file > 0, "[Error] data_dir is empty"
+        def wrapper():
+            def reader():
+                for epoch in range(self.epoch):
+                    self.current_epoch = epoch + 1
+                    if self.shuffle_files:
+                        np.random.shuffle(files)
+                    for index, file in enumerate(files):
+                        self.current_file_index = index + 1
+                        self.current_file = file
+                        sample_generator = self.read_file(file)
+                        if not self.is_test and self.generate_neg_sample:
+                            sample_generator = self.mixin_negtive_samples(
+                                sample_generator)
+                        for sample in sample_generator:
+                            if sample is None:
+                                continue
+                            yield sample
+            def batch_reader(reader, batch_size, in_tokens):
+                batch, total_token_num, max_len = [], 0, 0
+                for parsed_line in reader():
+                    token_ids, sent_ids, pos_ids, label = parsed_line
+                    max_len = max(max_len, len(token_ids))
+                    if in_tokens:
+                        to_append = (len(batch) + 1) * max_len <= batch_size
+                    else:
+                        to_append = len(batch) < batch_size
+                    if to_append:
+                        batch.append(parsed_line)
+                        total_token_num += len(token_ids)
+                    else:
+                        yield batch, total_token_num
+                        batch, total_token_num, max_len = [parsed_line], len(
+                            token_ids), len(token_ids)
+                if len(batch) > 0:
+                    yield batch, total_token_num
+            for batch_data, total_token_num in batch_reader(
+                    reader, self.batch_size, self.in_tokens):
+                yield prepare_batch_data(
+                    batch_data,
+                    total_token_num,
+                    voc_size=self.voc_size,
+                    pad_id=self.pad_id,
+                    cls_id=self.cls_id,
+                    sep_id=self.sep_id,
+                    mask_id=self.mask_id,
+                    max_len=self.max_seq_len,
+                    return_input_mask=True,
+                    return_max_len=False,
+                    return_num_token=False)
+        return wrapper
+if __name__ == "__main__":
+    pass
--- a/PaddleNLP/Research/MRQA2019-D-NET/knowledge_distillation/reader/mrqa_distill_reader.py
+++ b/PaddleNLP/Research/MRQA2019-D-NET/knowledge_distillation/reader/mrqa_distill_reader.py
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Run MRQA"""
+import six
+import math
+import json
+import random
+import collections
+import numpy as np
+from utils import tokenization
+from utils.batching import prepare_batch_data
+class DataProcessorDistill(object): 
+    def __init__(self): 
+        self.num_examples = -1
+        self.current_train_example = -1
+        self.current_train_epoch = -1
+    def get_features(self, data_path): 
+        with open(data_path, 'r') as fr: 
+            for line in fr: 
+                yield line.strip()
+    def data_generator(self, 
+                       data_file, 
+                       batch_size, 
+                       max_len, 
+                       in_tokens, 
+                       dev_count,
+                       epochs,
+                       shuffle): 
+        self.num_examples = len([ "" for line in open(data_file,"r")])
+        def batch_reader(data_file, in_tokens, batch_size): 
+            batch = []
+            index = 0
+            for feature in self.get_features(data_file): 
+                to_append = len(batch) < batch_size
+                if to_append: 
+                    batch.append(feature)
+                else: 
+                    yield batch
+                    batch = []
+            if len(batch) > 0: 
+                yield batch
+        def wrapper(): 
+            for epoch in range(epochs): 
+                all_batches = []
+                for batch_data in batch_reader(data_file, in_tokens, batch_size): 
+                    batch_data_segment = []
+                    for feature in batch_data: 
+                        data = json.loads(feature.strip())
+                        example_index = data['example_index']
+                        unique_id = data['unique_id']
+                        input_ids = data['input_ids']
+                        position_ids = data['position_ids']
+                        input_mask = data['input_mask']
+                        segment_ids = data['segment_ids']
+                        start_position = data['start_position']
+                        end_position = data['end_position']
+                        start_logits = data['start_logits']
+                        end_logits = data['end_logits']
+                        instance = [input_ids, position_ids, segment_ids, input_mask, start_logits, end_logits, start_position, end_position]
+                        batch_data_segment.append(instance)
+                    batch_data = batch_data_segment
+                    src_ids = [inst[0] for inst in batch_data]
+                    pos_ids = [inst[1] for inst in batch_data]
+                    sent_ids = [inst[2] for inst in batch_data]
+                    input_mask = [inst[3] for inst in batch_data]
+                    start_logits = [inst[4] for inst in batch_data]
+                    end_logits = [inst[5] for inst in batch_data]
+                    src_ids = np.array(src_ids).astype("int64").reshape([-1, max_len, 1])
+                    pos_ids = np.array(pos_ids).astype("int64").reshape([-1, max_len, 1])
+                    sent_ids = np.array(sent_ids).astype("int64").reshape([-1, max_len, 1])
+                    input_mask = np.array(input_mask).astype("float32").reshape([-1, max_len, 1])
+                    start_logits = np.array(start_logits).astype("float32").reshape([-1, max_len])
+                    end_logits = np.array(end_logits).astype("float32").reshape([-1, max_len])
+                    start_positions = [inst[6] for inst in batch_data]
+                    end_positions = [inst[7] for inst in batch_data]
+                    start_positions = np.array(start_positions).astype("int64").reshape([-1, 1])
+                    end_positions = np.array(end_positions).astype("int64").reshape([-1, 1])
+                    batch_data = [src_ids, pos_ids, sent_ids, input_mask, start_logits, end_logits, start_positions, end_positions]
+                    if len(all_batches) < dev_count:
+                        all_batches.append(batch_data)
+                    if len(all_batches) == dev_count: 
+                        for batch in all_batches: 
+                            yield batch
+                        all_batches = []
+        return wrapper
--- a/PaddleNLP/Research/MRQA2019-D-NET/knowledge_distillation/reader/mrqa_reader.py
+++ b/PaddleNLP/Research/MRQA2019-D-NET/knowledge_distillation/reader/mrqa_reader.py
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Run MRQA"""
+import six
+import math
+import json
+import random
+import collections
+import numpy as np
+from utils import tokenization
+from utils.batching import prepare_batch_data
+class MRQAExample(object):
+    """A single training/test example for simple sequence classification.
+     For examples without an answer, the start and end position are -1.
+  """
+    def __init__(self,
+                 qas_id,
+                 question_text,
+                 doc_tokens,
+                 orig_answer_text=None,
+                 start_position=None,
+                 end_position=None,
+                 is_impossible=False):
+        self.qas_id = qas_id
+        self.question_text = question_text
+        self.doc_tokens = doc_tokens
+        self.orig_answer_text = orig_answer_text
+        self.start_position = start_position
+        self.end_position = end_position
+        self.is_impossible = is_impossible
+    def __str__(self):
+        return self.__repr__()
+    def __repr__(self):
+        s = ""
+        s += "qas_id: %s" % (tokenization.printable_text(self.qas_id))
+        s += ", question_text: %s" % (
+            tokenization.printable_text(self.question_text))
+        s += ", doc_tokens: [%s]" % (" ".join(self.doc_tokens))
+        if self.start_position:
+            s += ", start_position: %d" % (self.start_position)
+        if self.start_position:
+            s += ", end_position: %d" % (self.end_position)
+        if self.start_position:
+            s += ", is_impossible: %r" % (self.is_impossible)
+        return s
+class InputFeatures(object):
+    """A single set of features of data."""
+    def __init__(self,
+                 unique_id,
+                 example_index,
+                 doc_span_index,
+                 tokens,
+                 token_to_orig_map,
+                 token_is_max_context,
+                 input_ids,
+                 input_mask,
+                 segment_ids,
+                 start_position=None,
+                 end_position=None,
+                 is_impossible=None):
+        self.unique_id = unique_id
+        self.example_index = example_index
+        self.doc_span_index = doc_span_index
+        self.tokens = tokens
+        self.token_to_orig_map = token_to_orig_map
+        self.token_is_max_context = token_is_max_context
+        self.input_ids = input_ids
+        self.input_mask = input_mask
+        self.segment_ids = segment_ids
+        self.start_position = start_position
+        self.end_position = end_position
+        self.is_impossible = is_impossible
+def read_mrqa_examples(input_file, is_training, with_negative=False):
+    """Read a MRQA json file into a list of MRQAExample."""
+    with open(input_file, "r") as reader:
+        input_data = json.load(reader)["data"]
+    def is_whitespace(c):
+        if c == " " or c == "\t" or c == "\r" or c == "\n" or ord(c) == 0x202F:
+            return True
+        return False
+    examples = []
+    for entry in input_data:
+        for paragraph in entry["paragraphs"]:
+            paragraph_text = paragraph["context"]
+            doc_tokens = []
+            char_to_word_offset = []
+            prev_is_whitespace = True
+            for c in paragraph_text:
+                if is_whitespace(c):
+                    prev_is_whitespace = True
+                else:
+                    if prev_is_whitespace:
+                        doc_tokens.append(c)
+                    else:
+                        doc_tokens[-1] += c
+                    prev_is_whitespace = False
+                char_to_word_offset.append(len(doc_tokens) - 1)
+            for qa in paragraph["qas"]:
+                qas_id = qa["id"]
+                question_text = qa["question"]
+                start_position = None
+                end_position = None
+                orig_answer_text = None
+                is_impossible = False
+                if is_training:
+                    if with_negative:
+                        is_impossible = qa["is_impossible"]
+                    if (len(qa["answers"]) != 1) and (not is_impossible):
+                        raise ValueError(
+                            "For training, each question should have exactly 1 answer."
+                        )
+                    if not is_impossible:
+                        answer = qa["answers"][0]
+                        orig_answer_text = answer["text"]
+                        answer_offset = answer["answer_start"]
+                        answer_length = len(orig_answer_text)
+                        start_position = char_to_word_offset[answer_offset]
+                        end_position = char_to_word_offset[answer_offset +
+                                                           answer_length - 1]
+                        # Only add answers where the text can be exactly recovered from the
+                        # document. If this CAN'T happen it's likely due to weird Unicode
+                        # stuff so we will just skip the example.
+                        #
+                        # Note that this means for training mode, every example is NOT
+                        # guaranteed to be preserved.
+                        actual_text = " ".join(doc_tokens[start_position:(
+                            end_position + 1)])
+                        cleaned_answer_text = " ".join(
+                            tokenization.whitespace_tokenize(orig_answer_text))
+                        if actual_text.find(cleaned_answer_text) == -1:
+                            print("Could not find answer: '%s' vs. '%s'",
+                                  actual_text, cleaned_answer_text)
+                            continue
+                    else:
+                        start_position = -1
+                        end_position = -1
+                        orig_answer_text = ""
+                example = MRQAExample(
+                    qas_id=qas_id,
+                    question_text=question_text,
+                    doc_tokens=doc_tokens,
+                    orig_answer_text=orig_answer_text,
+                    start_position=start_position,
+                    end_position=end_position,
+                    is_impossible=is_impossible)
+                examples.append(example)
+    return examples
+def convert_examples_to_features(
+        examples,
+        tokenizer,
+        max_seq_length,
+        doc_stride,
+        max_query_length,
+        is_training,
+        #output_fn
+):
+    """Loads a data file into a list of `InputBatch`s."""
+    unique_id = 1000000000
+    for (example_index, example) in enumerate(examples):
+        query_tokens = tokenizer.tokenize(example.question_text)
+        if len(query_tokens) > max_query_length:
+            query_tokens = query_tokens[0:max_query_length]
+        tok_to_orig_index = []
+        orig_to_tok_index = []
+        all_doc_tokens = []
+        for (i, token) in enumerate(example.doc_tokens):
+            orig_to_tok_index.append(len(all_doc_tokens))
+            sub_tokens = tokenizer.tokenize(token)
+            for sub_token in sub_tokens:
+                tok_to_orig_index.append(i)
+                all_doc_tokens.append(sub_token)
+        tok_start_position = None
+        tok_end_position = None
+        if is_training and example.is_impossible:
+            tok_start_position = -1
+            tok_end_position = -1
+        if is_training and not example.is_impossible:
+            tok_start_position = orig_to_tok_index[example.start_position]
+            if example.end_position < len(example.doc_tokens) - 1:
+                tok_end_position = orig_to_tok_index[example.end_position +
+                                                     1] - 1
+            else:
+                tok_end_position = len(all_doc_tokens) - 1
+            (tok_start_position, tok_end_position) = _improve_answer_span(
+                all_doc_tokens, tok_start_position, tok_end_position, tokenizer,
+                example.orig_answer_text)
+        # The -3 accounts for [CLS], [SEP] and [SEP]
+        max_tokens_for_doc = max_seq_length - len(query_tokens) - 3
+        # We can have documents that are longer than the maximum sequence length.
+        # To deal with this we do a sliding window approach, where we take chunks
+        # of the up to our max length with a stride of `doc_stride`.
+        _DocSpan = collections.namedtuple(  # pylint: disable=invalid-name
+            "DocSpan", ["start", "length"])
+        doc_spans = []
+        start_offset = 0
+        while start_offset < len(all_doc_tokens):
+            length = len(all_doc_tokens) - start_offset
+            if length > max_tokens_for_doc:
+                length = max_tokens_for_doc
+            doc_spans.append(_DocSpan(start=start_offset, length=length))
+            if start_offset + length == len(all_doc_tokens):
+                break
+            start_offset += min(length, doc_stride)
+        for (doc_span_index, doc_span) in enumerate(doc_spans):
+            tokens = []
+            token_to_orig_map = {}
+            token_is_max_context = {}
+            segment_ids = []
+            tokens.append("[CLS]")
+            segment_ids.append(0)
+            for token in query_tokens:
+                tokens.append(token)
+                segment_ids.append(0)
+            tokens.append("[SEP]")
+            segment_ids.append(0)
+            for i in range(doc_span.length):
+                split_token_index = doc_span.start + i
+                token_to_orig_map[len(tokens)] = tok_to_orig_index[
+                    split_token_index]
+                is_max_context = _check_is_max_context(
+                    doc_spans, doc_span_index, split_token_index)
+                token_is_max_context[len(tokens)] = is_max_context
+                tokens.append(all_doc_tokens[split_token_index])
+                segment_ids.append(1)
+            tokens.append("[SEP]")
+            segment_ids.append(1)
+            input_ids = tokenizer.convert_tokens_to_ids(tokens)
+            # The mask has 1 for real tokens and 0 for padding tokens. Only real
+            # tokens are attended to.
+            input_mask = [1] * len(input_ids)
+            # Zero-pad up to the sequence length.
+            #while len(input_ids) < max_seq_length:
+            #  input_ids.append(0)
+            #  input_mask.append(0)
+            #  segment_ids.append(0)
+            #assert len(input_ids) == max_seq_length
+            #assert len(input_mask) == max_seq_length
+            #assert len(segment_ids) == max_seq_length
+            start_position = None
+            end_position = None
+            if is_training and not example.is_impossible:
+                # For training, if our document chunk does not contain an annotation
+                # we throw it out, since there is nothing to predict.
+                doc_start = doc_span.start
+                doc_end = doc_span.start + doc_span.length - 1
+                out_of_span = False
+                if not (tok_start_position >= doc_start and
+                        tok_end_position <= doc_end):
+                    out_of_span = True
+                if out_of_span:
+                    start_position = 0
+                    end_position = 0
+                    continue
+                else:
+                    doc_offset = len(query_tokens) + 2
+                    start_position = tok_start_position - doc_start + doc_offset
+                    end_position = tok_end_position - doc_start + doc_offset
+            """
+            if is_training and example.is_impossible:
+                start_position = 0
+                end_position = 0
+            """
+            if example_index < 3:
+                print("*** Example ***")
+                print("unique_id: %s" % (unique_id))
+                print("example_index: %s" % (example_index))
+                print("doc_span_index: %s" % (doc_span_index))
+                print("tokens: %s" % " ".join(
+                    [tokenization.printable_text(x) for x in tokens]))
+                print("token_to_orig_map: %s" % " ".join([
+                    "%d:%d" % (x, y)
+                    for (x, y) in six.iteritems(token_to_orig_map)
+                ]))
+                print("token_is_max_context: %s" % " ".join([
+                    "%d:%s" % (x, y)
+                    for (x, y) in six.iteritems(token_is_max_context)
+                ]))
+                print("input_ids: %s" % " ".join([str(x) for x in input_ids]))
+                print("input_mask: %s" % " ".join([str(x) for x in input_mask]))
+                print("segment_ids: %s" %
+                      " ".join([str(x) for x in segment_ids]))
+                if is_training and example.is_impossible:
+                    print("impossible example")
+                if is_training and not example.is_impossible:
+                    answer_text = " ".join(tokens[start_position:(end_position +
+                                                                  1)])
+                    print("start_position: %d" % (start_position))
+                    print("end_position: %d" % (end_position))
+                    print("answer: %s" %
+                          (tokenization.printable_text(answer_text)))
+            feature = InputFeatures(
+                unique_id=unique_id,
+                example_index=example_index,
+                doc_span_index=doc_span_index,
+                tokens=tokens,
+                token_to_orig_map=token_to_orig_map,
+                token_is_max_context=token_is_max_context,
+                input_ids=input_ids,
+                input_mask=input_mask,
+                segment_ids=segment_ids,
+                start_position=start_position,
+                end_position=end_position,
+                is_impossible=example.is_impossible)
+            unique_id += 1
+            yield feature
+def estimate_runtime_examples(data_path, sample_rate, tokenizer, \
+                              max_seq_length, doc_stride, max_query_length, \
+                              remove_impossible_questions=True, filter_invalid_spans=True):
+    """Count runtime examples which may differ from number of raw samples due to sliding window operation and etc.. This is useful to get correct warmup steps for training."""
+    assert sample_rate > 0.0 and sample_rate <= 1.0, "sample_rate must be set between 0.0~1.0"
+    print("loading data with json parser...")
+    with open(data_path, "r") as reader:
+        data = json.load(reader)["data"]
+    num_raw_examples = 0
+    for entry in data:
+        for paragraph in entry["paragraphs"]:
+            paragraph_text = paragraph["context"]
+            for qa in paragraph["qas"]:
+                num_raw_examples += 1
+    print("num raw examples:{}".format(num_raw_examples))
+    def is_whitespace(c):
+        if c == " " or c == "\t" or c == "\r" or c == "\n" or ord(c) == 0x202F:
+            return True
+        return False
+    sampled_examples = []
+    for entry in data:
+        for paragraph in entry["paragraphs"]:
+            doc_tokens = None
+            for qa in paragraph["qas"]:
+                if random.random() > sample_rate and sample_rate < 1.0:
+                    continue
+                if doc_tokens is None:
+                    paragraph_text = paragraph["context"]
+                    doc_tokens = []
+                    char_to_word_offset = []
+                    prev_is_whitespace = True
+                    for c in paragraph_text:
+                        if is_whitespace(c):
+                            prev_is_whitespace = True
+                        else:
+                            if prev_is_whitespace:
+                                doc_tokens.append(c)
+                            else:
+                                doc_tokens[-1] += c
+                            prev_is_whitespace = False
+                        char_to_word_offset.append(len(doc_tokens) - 1)
+                assert len(qa["answers"]) == 1, "For training, each question should have exactly 1 answer."
+                qas_id = qa["id"]
+                question_text = qa["question"]
+                start_position = None
+                end_position = None
+                orig_answer_text = None
+                is_impossible = False
+                if ('is_impossible' in qa) and (qa["is_impossible"]):
+                    if remove_impossible_questions or filter_invalid_spans:
+                        continue
+                    else:
+                        start_position = -1
+                        end_position = -1
+                        orig_answer_text = ""
+                        is_impossible = True
+                else:
+                    answer = qa["answers"][0]
+                    orig_answer_text = answer["text"]
+                    answer_offset = answer["answer_start"]
+                    answer_length = len(orig_answer_text)
+                    start_position = char_to_word_offset[answer_offset]
+                    end_position = char_to_word_offset[answer_offset +
+                                                       answer_length - 1]
+                    # remove corrupt samples
+                    actual_text = " ".join(doc_tokens[start_position:(
+                        end_position + 1)])
+                    cleaned_answer_text = " ".join(
+                        tokenization.whitespace_tokenize(orig_answer_text))
+                    if actual_text.find(cleaned_answer_text) == -1:
+                        print("Could not find answer: '%s' vs. '%s'",
+                              actual_text, cleaned_answer_text)
+                        continue
+                example = MRQAExample(
+                    qas_id=qas_id,
+                    question_text=question_text,
+                    doc_tokens=doc_tokens,
+                    orig_answer_text=orig_answer_text,
+                    start_position=start_position,
+                    end_position=end_position,
+                    is_impossible=is_impossible)
+                sampled_examples.append(example)
+    runtime_sample_rate = len(sampled_examples) / float(num_raw_examples)
+    # print("DEBUG-> runtime sampled examples: {}, sample rate: {}.".format(len(sampled_examples), runtime_sample_rate))
+    runtime_samp_cnt = 0
+    for example in sampled_examples:
+        query_tokens = tokenizer.tokenize(example.question_text)
+        if len(query_tokens) > max_query_length:
+            query_tokens = query_tokens[0:max_query_length]
+        tok_to_orig_index = []
+        orig_to_tok_index = []
+        all_doc_tokens = []
+        for (i, token) in enumerate(example.doc_tokens):
+            orig_to_tok_index.append(len(all_doc_tokens))
+            sub_tokens = tokenizer.tokenize(token)
+            for sub_token in sub_tokens:
+                tok_to_orig_index.append(i)
+                all_doc_tokens.append(sub_token)
+        tok_start_position = None
+        tok_end_position = None
+        tok_start_position = orig_to_tok_index[example.start_position]
+        if example.end_position < len(example.doc_tokens) - 1:
+            tok_end_position = orig_to_tok_index[example.end_position + 1] - 1
+        else:
+            tok_end_position = len(all_doc_tokens) - 1
+        (tok_start_position, tok_end_position) = _improve_answer_span(
+            all_doc_tokens, tok_start_position, tok_end_position, tokenizer,
+            example.orig_answer_text)
+        # The -3 accounts for [CLS], [SEP] and [SEP]
+        max_tokens_for_doc = max_seq_length - len(query_tokens) - 3
+        _DocSpan = collections.namedtuple(  # pylint: disable=invalid-name
+            "DocSpan", ["start", "length"])
+        doc_spans = []
+        start_offset = 0
+        while start_offset < len(all_doc_tokens):
+            length = len(all_doc_tokens) - start_offset
+            if length > max_tokens_for_doc:
+                length = max_tokens_for_doc
+            doc_spans.append(_DocSpan(start=start_offset, length=length))
+            if start_offset + length == len(all_doc_tokens):
+                break
+            start_offset += min(length, doc_stride)
+        for (doc_span_index, doc_span) in enumerate(doc_spans):
+            doc_start = doc_span.start
+            doc_end = doc_span.start + doc_span.length - 1
+            if filter_invalid_spans and not (tok_start_position >= doc_start and tok_end_position <= doc_end):
+                continue
+            runtime_samp_cnt += 1
+    return int(runtime_samp_cnt/runtime_sample_rate)
+def _improve_answer_span(doc_tokens, input_start, input_end, tokenizer,
+                         orig_answer_text):
+    """Returns tokenized answer spans that better match the annotated answer."""
+    # The MRQA annotations are character based. We first project them to
+    # whitespace-tokenized words. But then after WordPiece tokenization, we can
+    # often find a "better match". For example:
+    #
+    #   Question: What year was John Smith born?
+    #   Context: The leader was John Smith (1895-1943).
+    #   Answer: 1895
+    #
+    # The original whitespace-tokenized answer will be "(1895-1943).". However
+    # after tokenization, our tokens will be "( 1895 - 1943 ) .". So we can match
+    # the exact answer, 1895.
+    #
+    # However, this is not always possible. Consider the following:
+    #
+    #   Question: What country is the top exporter of electornics?
+    #   Context: The Japanese electronics industry is the lagest in the world.
+    #   Answer: Japan
+    #
+    # In this case, the annotator chose "Japan" as a character sub-span of
+    # the word "Japanese". Since our WordPiece tokenizer does not split
+    # "Japanese", we just use "Japanese" as the annotation. This is fairly rare
+    # in MRQA, but does happen.
+    tok_answer_text = " ".join(tokenizer.tokenize(orig_answer_text))
+    for new_start in range(input_start, input_end + 1):
+        for new_end in range(input_end, new_start - 1, -1):
+            text_span = " ".join(doc_tokens[new_start:(new_end + 1)])
+            if text_span == tok_answer_text:
+                return (new_start, new_end)
+    return (input_start, input_end)
+def _check_is_max_context(doc_spans, cur_span_index, position):
+    """Check if this is the 'max context' doc span for the token."""
+    # Because of the sliding window approach taken to scoring documents, a single
+    # token can appear in multiple documents. E.g.
+    #  Doc: the man went to the store and bought a gallon of milk
+    #  Span A: the man went to the
+    #  Span B: to the store and bought
+    #  Span C: and bought a gallon of
+    #  ...
+    #
+    # Now the word 'bought' will have two scores from spans B and C. We only
+    # want to consider the score with "maximum context", which we define as
+    # the *minimum* of its left and right context (the *sum* of left and
+    # right context will always be the same, of course).
+    #
+    # In the example the maximum context for 'bought' would be span C since
+    # it has 1 left context and 3 right context, while span B has 4 left context
+    # and 0 right context.
+    best_score = None
+    best_span_index = None
+    for (span_index, doc_span) in enumerate(doc_spans):
+        end = doc_span.start + doc_span.length - 1
+        if position < doc_span.start:
+            continue
+        if position > end:
+            continue
+        num_left_context = position - doc_span.start
+        num_right_context = end - position
+        score = min(num_left_context,
+                    num_right_context) + 0.01 * doc_span.length
+        if best_score is None or score > best_score:
+            best_score = score
+            best_span_index = span_index
+    return cur_span_index == best_span_index
+class DataProcessor(object):
+    def __init__(self, vocab_path, do_lower_case, max_seq_length, in_tokens,
+                 doc_stride, max_query_length):
+        self._tokenizer = tokenization.FullTokenizer(
+            vocab_file=vocab_path, do_lower_case=do_lower_case)
+        self._max_seq_length = max_seq_length
+        self._doc_stride = doc_stride
+        self._max_query_length = max_query_length
+        self._in_tokens = in_tokens
+        self.vocab = self._tokenizer.vocab
+        self.vocab_size = len(self.vocab)
+        self.pad_id = self.vocab["[PAD]"]
+        self.cls_id = self.vocab["[CLS]"]
+        self.sep_id = self.vocab["[SEP]"]
+        self.mask_id = self.vocab["[MASK]"]
+        self.current_train_example = -1
+        self.num_train_examples = -1
+        self.current_train_epoch = -1
+        self.train_examples = None
+        self.predict_examples = None
+        self.num_examples = {'train': -1, 'predict': -1}
+    def get_train_progress(self):
+        """Gets progress for training phase."""
+        return self.current_train_example, self.current_train_epoch
+    def get_examples(self,
+                     data_path,
+                     is_training,
+                     with_negative=False):
+        examples = read_mrqa_examples(
+            input_file=data_path,
+            is_training=is_training,
+            with_negative=with_negative)
+        return examples
+    def get_num_examples(self, phase):
+        if phase not in ['train', 'predict']:
+            raise ValueError(
+                "Unknown phase, which should be in ['train', 'predict'].")
+        return self.num_examples[phase]
+    def estimate_runtime_examples(self, data_path, sample_rate=0.01, \
+                                 remove_impossible_questions=True, filter_invalid_spans=True):
+        """Noted that this API Only support for Training phase."""
+        return estimate_runtime_examples(data_path, sample_rate, self._tokenizer, \
+                                  self._max_seq_length, self._doc_stride, self._max_query_length, \
+                                  remove_impossible_questions=True, filter_invalid_spans=True)
+    def get_features(self, examples, is_training):
+        features = convert_examples_to_features(
+            examples=examples,
+            tokenizer=self._tokenizer,
+            max_seq_length=self._max_seq_length,
+            doc_stride=self._doc_stride,
+            max_query_length=self._max_query_length,
+            is_training=is_training)
+        return features
+    def data_generator(self,
+                       data_path,
+                       batch_size,
+                       max_len=None,
+                       phase='train',
+                       shuffle=False,
+                       dev_count=1,
+                       with_negative=False,
+                       epoch=1):
+        if phase == 'train':
+            self.train_examples = self.get_examples(
+                data_path,
+                is_training=True,
+                with_negative=with_negative)
+            examples = self.train_examples
+            self.num_examples['train'] = len(self.train_examples)
+        elif phase == 'predict':
+            self.predict_examples = self.get_examples(
+                data_path,
+                is_training=False,
+                with_negative=with_negative)
+            examples = self.predict_examples
+            self.num_examples['predict'] = len(self.predict_examples)
+        else:
+            raise ValueError(
+                "Unknown phase, which should be in ['train', 'predict'].")
+        def batch_reader(features, batch_size, in_tokens):
+            batch, total_token_num, max_len = [], 0, 0
+            for (index, feature) in enumerate(features):
+                if phase == 'train':
+                    self.current_train_example = index + 1
+                seq_len = len(feature.input_ids)
+                labels = [feature.unique_id
+                          ] if feature.start_position is None else [
+                              feature.start_position, feature.end_position
+                          ]
+                example = [
+                    feature.input_ids, feature.segment_ids, range(seq_len)
+                ] + labels
+                max_len = max(max_len, seq_len)
+                #max_len = max(max_len, len(token_ids))
+                if in_tokens:
+                    to_append = (len(batch) + 1) * max_len <= batch_size
+                else:
+                    to_append = len(batch) < batch_size
+                if to_append:
+                    batch.append(example)
+                    total_token_num += seq_len
+                else:
+                    yield batch, total_token_num
+                    batch, total_token_num, max_len = [example
+                                                       ], seq_len, seq_len
+            if len(batch) > 0:
+                yield batch, total_token_num
+        def wrapper():
+            for epoch_index in range(epoch):
+                if shuffle:
+                    random.shuffle(examples)
+                if phase == 'train':
+                    self.current_train_epoch = epoch_index
+                    features = self.get_features(examples, is_training=True)
+                else:
+                    features = self.get_features(examples, is_training=False)
+                all_dev_batches = []
+                for batch_data, total_token_num in batch_reader(
+                        features, batch_size, self._in_tokens):
+                    batch_data = prepare_batch_data(
+                        batch_data,
+                        total_token_num,
+                        max_len=max_len,
+                        voc_size=-1,
+                        pad_id=self.pad_id,
+                        cls_id=self.cls_id,
+                        sep_id=self.sep_id,
+                        mask_id=-1,
+                        return_input_mask=True,
+                        return_max_len=False,
+                        return_num_token=False)
+                    if len(all_dev_batches) < dev_count:
+                        all_dev_batches.append(batch_data)
+                    if len(all_dev_batches) == dev_count:
+                        for batch in all_dev_batches:
+                            yield batch
+                        all_dev_batches = []
+                if phase == 'predict' and len(all_dev_batches) > 0:
+                    fake_batch = all_dev_batches[-1]
+                    fake_batch = fake_batch[:-1] + [np.array([-1]*len(fake_batch[0]))]
+                    all_dev_batches = all_dev_batches + [fake_batch] * (dev_count - len(all_dev_batches))
+                    for batch in all_dev_batches:
+                        yield batch
+        return wrapper
+def write_predictions(all_examples, all_features, all_results, n_best_size,
+                      max_answer_length, do_lower_case, output_prediction_file,
+                      output_nbest_file, output_null_log_odds_file,
+                      with_negative, null_score_diff_threshold,
+                      verbose):
+    """Write final predictions to the json file and log-odds of null if needed."""
+    print("Writing predictions to: %s" % (output_prediction_file))
+    print("Writing nbest to: %s" % (output_nbest_file))
+    example_index_to_features = collections.defaultdict(list)
+    for feature in all_features:
+        example_index_to_features[feature.example_index].append(feature)
+    unique_id_to_result = {}
+    for result in all_results:
+        unique_id_to_result[result.unique_id] = result
+    _PrelimPrediction = collections.namedtuple(  # pylint: disable=invalid-name
+        "PrelimPrediction", [
+            "feature_index", "start_index", "end_index", "start_logit",
+            "end_logit"
+        ])
+    all_predictions = collections.OrderedDict()
+    all_nbest_json = collections.OrderedDict()
+    scores_diff_json = collections.OrderedDict()
+    for (example_index, example) in enumerate(all_examples):
+        features = example_index_to_features[example_index]
+        prelim_predictions = []
+        # keep track of the minimum score of null start+end of position 0
+        score_null = 1000000  # large and positive
+        min_null_feature_index = 0  # the paragraph slice with min mull score
+        null_start_logit = 0  # the start logit at the slice with min null score
+        null_end_logit = 0  # the end logit at the slice with min null score
+        for (feature_index, feature) in enumerate(features):
+            result = unique_id_to_result[feature.unique_id]
+            start_indexes = _get_best_indexes(result.start_logits, n_best_size)
+            end_indexes = _get_best_indexes(result.end_logits, n_best_size)
+            # if we could have irrelevant answers, get the min score of irrelevant
+            if with_negative:
+                feature_null_score = result.start_logits[0] + result.end_logits[
+                    0]
+                if feature_null_score < score_null:
+                    score_null = feature_null_score
+                    min_null_feature_index = feature_index
+                    null_start_logit = result.start_logits[0]
+                    null_end_logit = result.end_logits[0]
+            for start_index in start_indexes:
+                for end_index in end_indexes:
+                    # We could hypothetically create invalid predictions, e.g., predict
+                    # that the start of the span is in the question. We throw out all
+                    # invalid predictions.
+                    if start_index >= len(feature.tokens):
+                        continue
+                    if end_index >= len(feature.tokens):
+                        continue
+                    if start_index not in feature.token_to_orig_map:
+                        continue
+                    if end_index not in feature.token_to_orig_map:
+                        continue
+                    if not feature.token_is_max_context.get(start_index, False):
+                        continue
+                    if end_index < start_index:
+                        continue
+                    length = end_index - start_index + 1
+                    if length > max_answer_length:
+                        continue
+                    prelim_predictions.append(
+                        _PrelimPrediction(
+                            feature_index=feature_index,
+                            start_index=start_index,
+                            end_index=end_index,
+                            start_logit=result.start_logits[start_index],
+                            end_logit=result.end_logits[end_index]))
+        if with_negative:
+            prelim_predictions.append(
+                _PrelimPrediction(
+                    feature_index=min_null_feature_index,
+                    start_index=0,
+                    end_index=0,
+                    start_logit=null_start_logit,
+                    end_logit=null_end_logit))
+        prelim_predictions = sorted(
+            prelim_predictions,
+            key=lambda x: (x.start_logit + x.end_logit),
+            reverse=True)
+        _NbestPrediction = collections.namedtuple(  # pylint: disable=invalid-name
+            "NbestPrediction", ["text", "start_logit", "end_logit"])
+        seen_predictions = {}
+        nbest = []
+        for pred in prelim_predictions:
+            if len(nbest) >= n_best_size:
+                break
+            feature = features[pred.feature_index]
+            if pred.start_index > 0:  # this is a non-null prediction
+                tok_tokens = feature.tokens[pred.start_index:(pred.end_index + 1
+                                                              )]
+                orig_doc_start = feature.token_to_orig_map[pred.start_index]
+                orig_doc_end = feature.token_to_orig_map[pred.end_index]
+                orig_tokens = example.doc_tokens[orig_doc_start:(orig_doc_end +
+                                                                 1)]
+                tok_text = " ".join(tok_tokens)
+                # De-tokenize WordPieces that have been split off.
+                tok_text = tok_text.replace(" ##", "")
+                tok_text = tok_text.replace("##", "")
+                # Clean whitespace
+                tok_text = tok_text.strip()
+                tok_text = " ".join(tok_text.split())
+                orig_text = " ".join(orig_tokens)
+                final_text = get_final_text(tok_text, orig_text, do_lower_case,
+                                            verbose)
+                if final_text in seen_predictions:
+                    continue
+                seen_predictions[final_text] = True
+            else:
+                final_text = ""
+                seen_predictions[final_text] = True
+            nbest.append(
+                _NbestPrediction(
+                    text=final_text,
+                    start_logit=pred.start_logit,
+                    end_logit=pred.end_logit))
+        # if we didn't inlude the empty option in the n-best, inlcude it
+        if with_negative:
+            if "" not in seen_predictions:
+                nbest.append(
+                    _NbestPrediction(
+                        text="",
+                        start_logit=null_start_logit,
+                        end_logit=null_end_logit))
+        # In very rare edge cases we could have no valid predictions. So we
+        # just create a nonce prediction in this case to avoid failure.
+        if not nbest:
+            nbest.append(
+                _NbestPrediction(
+                    text="empty", start_logit=0.0, end_logit=0.0))
+        assert len(nbest) >= 1
+        total_scores = []
+        best_non_null_entry = None
+        for entry in nbest:
+            total_scores.append(entry.start_logit + entry.end_logit)
+            if not best_non_null_entry:
+                if entry.text:
+                    best_non_null_entry = entry
+        # debug
+        if best_non_null_entry is None:
+            print("Emmm..., sth wrong")
+        probs = _compute_softmax(total_scores)
+        nbest_json = []
+        for (i, entry) in enumerate(nbest):
+            output = collections.OrderedDict()
+            output["text"] = entry.text
+            output["probability"] = probs[i]
+            output["start_logit"] = entry.start_logit
+            output["end_logit"] = entry.end_logit
+            nbest_json.append(output)
+        assert len(nbest_json) >= 1
+        if not with_negative:
+            all_predictions[example.qas_id] = nbest_json[0]["text"]
+        else:
+            # predict "" iff the null score - the score of best non-null > threshold
+            score_diff = score_null - best_non_null_entry.start_logit - (
+                best_non_null_entry.end_logit)
+            scores_diff_json[example.qas_id] = score_diff
+            if score_diff > null_score_diff_threshold:
+                all_predictions[example.qas_id] = ""
+            else:
+                all_predictions[example.qas_id] = best_non_null_entry.text
+        all_nbest_json[example.qas_id] = nbest_json
+    with open(output_prediction_file, "w") as writer:
+        writer.write(json.dumps(all_predictions, indent=4) + "\n")
+    with open(output_nbest_file, "w") as writer:
+        writer.write(json.dumps(all_nbest_json, indent=4) + "\n")
+    if with_negative:
+        with open(output_null_log_odds_file, "w") as writer:
+            writer.write(json.dumps(scores_diff_json, indent=4) + "\n")
+def get_final_text(pred_text, orig_text, do_lower_case, verbose):
+    """Project the tokenized prediction back to the original text."""
+    # When we created the data, we kept track of the alignment between original
+    # (whitespace tokenized) tokens and our WordPiece tokenized tokens. So
+    # now `orig_text` contains the span of our original text corresponding to the
+    # span that we predicted.
+    #
+    # However, `orig_text` may contain extra characters that we don't want in
+    # our prediction.
+    #
+    # For example, let's say:
+    #   pred_text = steve smith
+    #   orig_text = Steve Smith's
+    #
+    # We don't want to return `orig_text` because it contains the extra "'s".
+    #
+    # We don't want to return `pred_text` because it's already been normalized
+    # (the MRQA eval script also does punctuation stripping/lower casing but
+    # our tokenizer does additional normalization like stripping accent
+    # characters).
+    #
+    # What we really want to return is "Steve Smith".
+    #
+    # Therefore, we have to apply a semi-complicated alignment heruistic between
+    # `pred_text` and `orig_text` to get a character-to-charcter alignment. This
+    # can fail in certain cases in which case we just return `orig_text`.
+    def _strip_spaces(text):
+        ns_chars = []
+        ns_to_s_map = collections.OrderedDict()
+        for (i, c) in enumerate(text):
+            if c == " ":
+                continue
+            ns_to_s_map[len(ns_chars)] = i
+            ns_chars.append(c)
+        ns_text = "".join(ns_chars)
+        return (ns_text, ns_to_s_map)
+    # We first tokenize `orig_text`, strip whitespace from the result
+    # and `pred_text`, and check if they are the same length. If they are
+    # NOT the same length, the heuristic has failed. If they are the same
+    # length, we assume the characters are one-to-one aligned.
+    tokenizer = tokenization.BasicTokenizer(do_lower_case=do_lower_case)
+    tok_text = " ".join(tokenizer.tokenize(orig_text))
+    start_position = tok_text.find(pred_text)
+    if start_position == -1:
+        if verbose:
+            print("Unable to find text: '%s' in '%s'" % (pred_text, orig_text))
+        return orig_text
+    end_position = start_position + len(pred_text) - 1
+    (orig_ns_text, orig_ns_to_s_map) = _strip_spaces(orig_text)
+    (tok_ns_text, tok_ns_to_s_map) = _strip_spaces(tok_text)
+    if len(orig_ns_text) != len(tok_ns_text):
+        if verbose:
+            print("Length not equal after stripping spaces: '%s' vs '%s'",
+                  orig_ns_text, tok_ns_text)
+        return orig_text
+    # We then project the characters in `pred_text` back to `orig_text` using
+    # the character-to-character alignment.
+    tok_s_to_ns_map = {}
+    for (i, tok_index) in six.iteritems(tok_ns_to_s_map):
+        tok_s_to_ns_map[tok_index] = i
+    orig_start_position = None
+    if start_position in tok_s_to_ns_map:
+        ns_start_position = tok_s_to_ns_map[start_position]
+        if ns_start_position in orig_ns_to_s_map:
+            orig_start_position = orig_ns_to_s_map[ns_start_position]
+    if orig_start_position is None:
+        if verbose:
+            print("Couldn't map start position")
+        return orig_text
+    orig_end_position = None
+    if end_position in tok_s_to_ns_map:
+        ns_end_position = tok_s_to_ns_map[end_position]
+        if ns_end_position in orig_ns_to_s_map:
+            orig_end_position = orig_ns_to_s_map[ns_end_position]
+    if orig_end_position is None:
+        if verbose:
+            print("Couldn't map end position")
+        return orig_text
+    output_text = orig_text[orig_start_position:(orig_end_position + 1)]
+    return output_text
+def _get_best_indexes(logits, n_best_size):
+    """Get the n-best logits from a list."""
+    index_and_score = sorted(
+        enumerate(logits), key=lambda x: x[1], reverse=True)
+    best_indexes = []
+    for i in range(len(index_and_score)):
+        if i >= n_best_size:
+            break
+        best_indexes.append(index_and_score[i][0])
+    return best_indexes
+def _compute_softmax(scores):
+    """Compute softmax probability over raw logits."""
+    if not scores:
+        return []
+    max_score = None
+    for score in scores:
+        if max_score is None or score > max_score:
+            max_score = score
+    exp_scores = []
+    total_sum = 0.0
+    for score in scores:
+        x = math.exp(score - max_score)
+        exp_scores.append(x)
+        total_sum += x
+    probs = []
+    for score in exp_scores:
+        probs.append(score / total_sum)
+    return probs
+if __name__ == '__main__':
+    train_file = 'data/mrqa-combined.all_dev.raw.json'
+    vocab_file = 'uncased_L-12_H-768_A-12/vocab.txt'
+    do_lower_case = True
+    tokenizer = tokenization.FullTokenizer(
+        vocab_file=vocab_file, do_lower_case=do_lower_case)
+    train_examples = read_mrqa_examples(
+        input_file=train_file, is_training=True)
+    print("begin converting")
+    for (index, feature) in enumerate(
+            convert_examples_to_features(
+                examples=train_examples,
+                tokenizer=tokenizer,
+                max_seq_length=384,
+                doc_stride=128,
+                max_query_length=64,
+                is_training=True,
+                #output_fn=train_writer.process_feature
+            )):
+        if index < 10:
+            print(index, feature.input_ids, feature.input_mask,
+                  feature.segment_ids)
--- a/PaddleNLP/Research/MRQA2019-D-NET/knowledge_distillation/run_distill.sh
+++ b/PaddleNLP/Research/MRQA2019-D-NET/knowledge_distillation/run_distill.sh
+#!/bin/bash
+export FLAGS_sync_nccl_allreduce=0
+export FLAGS_eager_delete_tensor_gb=1
+export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
+if  [ ! "$CUDA_VISIBLE_DEVICES" ]
+then
+    export CPU_NUM=1
+    use_cuda=false
+else
+    use_cuda=true
+fi
+# path of pre_train model
+INPUT_PATH="data/input"
+PRETRAIN_MODEL_PATH="data/pretrain_model/squad2_model"
+# path to save checkpoint
+CHECKPOINT_PATH="data/output/output_mrqa"
+mkdir -p $CHECKPOINT_PATH
+python -u train.py --use_cuda ${use_cuda}\
+        --batch_size 8 \
+        --in_tokens false \
+        --init_pretraining_params ${PRETRAIN_MODEL_PATH}/params \
+        --checkpoints $CHECKPOINT_PATH \
+        --vocab_path ${PRETRAIN_MODEL_PATH}/vocab.txt \
+        --do_distill true \
+        --do_train true \
+        --do_predict true \
+        --save_steps 10000 \
+        --warmup_proportion 0.1 \
+        --weight_decay  0.01 \
+        --sample_rate 0.02 \
+        --epoch 2 \
+        --max_seq_len 512 \
+        --bert_config_path ${PRETRAIN_MODEL_PATH}/bert_config.json \
+        --predict_file ${INPUT_PATH}/mrqa_distill_data/mrqa-combined.all_dev.raw.json \
+        --do_lower_case false \
+        --doc_stride 128 \
+        --train_file ${INPUT_PATH}/mrqa_distill_data/mrqa_distill.json \
+        --mlm_path ${INPUT_PATH}/mlm_data \
+        --mix_ratio 2.0 \
+        --learning_rate 3e-5 \
+        --lr_scheduler linear_warmup_decay \
+        --skip_steps 100 
--- a/PaddleNLP/Research/MRQA2019-D-NET/knowledge_distillation/run_evaluation.sh
+++ b/PaddleNLP/Research/MRQA2019-D-NET/knowledge_distillation/run_evaluation.sh
+#!/usr/bin/env bash
+# ==============================================================================
+# Copyright 2017 Baidu.com, Inc. All Rights Reserved
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+# path of dev data
+PATH_dev=./data/input/mrqa_evaluation_dataset
+# path of dev predict
+KD_prediction=./prediction_results/KD_ema_predictions.json
+files=$(ls ./prediction_results/*.log 2> /dev/null | wc -l)
+if [ "$files" != "0" ];
+then
+    rm prediction_results/*.log
+fi
+# evaluation KD model
+echo "evaluate knowledge distillation model........................................."
+for dataset in `ls $PATH_dev/in_domain_dev/*.raw.json`;do
+    echo $dataset >> prediction_results/KD.log
+    python ../multi_task_learning/scripts/evaluate-v1.1.py $dataset $KD_prediction >> prediction_results/KD.log
+done
+for dataset in `ls $PATH_dev/out_of_domain_dev/*.raw.json`;do
+    echo $dataset >> prediction_results/KD.log
+    python ../multi_task_learning/scripts/evaluate-v1.1.py $dataset $KD_prediction >> prediction_results/KD.log
+done
+python ../multi_task_learning/scripts/macro_avg.py prediction_results/KD.log
--- a/PaddleNLP/Research/MRQA2019-D-NET/knowledge_distillation/train.py
+++ b/PaddleNLP/Research/MRQA2019-D-NET/knowledge_distillation/train.py
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import os
+import time
+import argparse
+import collections
+import numpy as np
+import multiprocessing
+import paddle
+import paddle.fluid as fluid
+from utils.placeholder import Placeholder
+from utils.init import init_pretraining_params, init_checkpoint
+from utils.configure import ArgumentGroup, print_arguments, JsonConfig
+from model import mlm_net
+from model import mrqa_net
+from optimizer.optimization import optimization
+from model.bert_model import ModelBERT
+from reader.mrqa_reader import DataProcessor, write_predictions
+from reader.mrqa_distill_reader import DataProcessorDistill 
+from reader.mlm_reader import DataReader
+from reader.joint_reader import create_reader
+parser = argparse.ArgumentParser(__doc__)
+model_g = ArgumentGroup(parser, "model", "model configuration and paths.")
+model_g.add_arg("bert_config_path", str, None, "Path to the json file for bert model config.")
+model_g.add_arg("init_checkpoint", str, None, "Init checkpoint to resume training from.")
+model_g.add_arg("init_pretraining_params", str, None,
+                "Init pre-training params which preforms fine-tuning from. If the "
+                "arg 'init_checkpoint' has been set, this argument wouldn't be valid.")
+model_g.add_arg("checkpoints", str, "checkpoints", "Path to save checkpoints.")
+train_g = ArgumentGroup(parser, "training", "training options.")
+train_g.add_arg("epoch", int, 3, "Number of epoches for fine-tuning.")
+train_g.add_arg("learning_rate", float, 5e-5, "Learning rate used to train with warmup.")
+train_g.add_arg("lr_scheduler", str, "linear_warmup_decay",
+                "scheduler of learning rate.", choices=['linear_warmup_decay', 'noam_decay'])
+train_g.add_arg("weight_decay", float, 0.01, "Weight decay rate for L2 regularizer.")
+train_g.add_arg("use_ema", bool, True, "Whether to use ema.")
+train_g.add_arg("ema_decay", float, 0.9999, "Decay rate for expoential moving average.")
+train_g.add_arg("warmup_proportion", float, 0.1,
+                "Proportion of training steps to perform linear learning rate warmup for.")
+train_g.add_arg("save_steps", int, 1000, "The steps interval to save checkpoints.")
+train_g.add_arg("sample_rate", float, 0.02, "train samples num.")
+train_g.add_arg("use_fp16", bool, False, "Whether to use fp16 mixed precision training.")
+train_g.add_arg("mix_ratio", float, 0.4, "batch mix ratio for masked language model task")
+train_g.add_arg("loss_scaling", float, 1.0,
+                "Loss scaling factor for mixed precision training, only valid when use_fp16 is enabled.")
+train_g.add_arg("do_distill", bool, False, "do distillation")
+log_g = ArgumentGroup(parser, "logging", "logging related.")
+log_g.add_arg("skip_steps", int, 10, "The steps interval to print loss.")
+log_g.add_arg("verbose", bool, False, "Whether to output verbose log.")
+data_g = ArgumentGroup(parser, "data", "Data paths, vocab paths and data processing options")
+data_g.add_arg("train_file", str, None, "json data for training.")
+data_g.add_arg("mlm_path", str, None, "data for masked language model training.")
+data_g.add_arg("predict_file", str, None, "json data for predictions.")
+data_g.add_arg("vocab_path", str, None, "Vocabulary path.")
+data_g.add_arg("with_negative", bool, False,
+               "If true, the examples contain some that do not have an answer.")
+data_g.add_arg("max_seq_len", int, 512, "Number of words of the longest seqence.")
+data_g.add_arg("max_query_length", int, 64, "Max query length.")
+data_g.add_arg("max_answer_length", int, 30, "Max answer length.")
+data_g.add_arg("batch_size", int, 12,
+               "Total examples' number in batch for training. see also --in_tokens.")
+data_g.add_arg("in_tokens", bool, False,
+               "If set, the batch size will be the maximum number of tokens in one batch. "
+               "Otherwise, it will be the maximum number of examples in one batch.")
+data_g.add_arg("do_lower_case", bool, True,
+               "Whether to lower case the input text. Should be True for uncased models and False for cased models.")
+data_g.add_arg("doc_stride", int, 128,
+               "When splitting up a long document into chunks, how much stride to take between chunks.")
+data_g.add_arg("n_best_size", int, 20,
+               "The total number of n-best predictions to generate in the nbest_predictions.json output file.")
+data_g.add_arg("null_score_diff_threshold", float, 0.0,
+               "If null_score - best_non_null is greater than the threshold predict null.")
+data_g.add_arg("random_seed", int, 0, "Random seed.")
+run_type_g = ArgumentGroup(parser, "run_type", "running type options.")
+run_type_g.add_arg("use_cuda", bool, True, "If set, use GPU for training.")
+run_type_g.add_arg("use_fast_executor", bool, False,
+                   "If set, use fast parallel executor (in experiment).")
+run_type_g.add_arg("num_iteration_per_drop_scope", int, 1,
+                   "Ihe iteration intervals to clean up temporary variables.")
+run_type_g.add_arg("do_train", bool, True, "Whether to perform training.")
+run_type_g.add_arg("do_predict", bool, True, "Whether to perform prediction.")
+args = parser.parse_args()
+max_seq_len = args.max_seq_len
+if args.do_distill: 
+    input_shape = [
+        ([1, 1], 'int64'),
+        ([-1, max_seq_len, 1], 'int64'), # src_ids
+        ([-1, max_seq_len, 1], 'int64'), # pos_ids
+        ([-1, max_seq_len, 1], 'int64'), # sent_ids
+        ([-1, max_seq_len, 1], 'float32'), # input_mask
+        ([-1, max_seq_len, 1], 'float32'), # start_logits_truth
+        ([-1, max_seq_len, 1], 'float32'), # end_logits_truth
+        ([-1, 1], 'int64'),  # start label
+        ([-1, 1], 'int64'),  # end label
+        ([-1, 1], 'int64'),  # masked label
+        ([-1, 1], 'int64')]  # masked pos
+else: 
+    input_shape = [
+        ([1, 1], 'int64'),
+        ([-1, max_seq_len, 1], 'int64'),
+        ([-1, max_seq_len, 1], 'int64'),
+        ([-1, max_seq_len, 1], 'int64'),
+        ([-1, max_seq_len, 1], 'float32'),
+        ([-1, 1], 'int64'),  # start label
+        ([-1, 1], 'int64'),  # end label
+        ([-1, 1], 'int64'),  # masked label
+        ([-1, 1], 'int64')]  # masked pos
+# yapf: enable.
+RawResult = collections.namedtuple("RawResult",
+                                   ["unique_id", "start_logits", "end_logits"])
+def predict(test_exe, test_program, test_pyreader, fetch_list, processor, prefix=''):
+    if not os.path.exists(args.checkpoints):
+        os.makedirs(args.checkpoints)
+    output_prediction_file = os.path.join(args.checkpoints, prefix + "predictions.json")
+    output_nbest_file = os.path.join(args.checkpoints, prefix + "nbest_predictions.json")
+    output_null_log_odds_file = os.path.join(args.checkpoints, prefix + "null_odds.json")
+    test_pyreader.start()
+    all_results = []
+    time_begin = time.time()
+    while True:
+        try:
+            np_unique_ids, np_start_logits, np_end_logits, np_num_seqs = test_exe.run(
+                fetch_list=fetch_list, program=test_program)
+            for idx in range(np_unique_ids.shape[0]):
+                if np_unique_ids[idx] < 0:
+                    continue
+                if len(all_results) % 1000 == 0:
+                    print("Processing example: %d" % len(all_results))
+                unique_id = int(np_unique_ids[idx])
+                start_logits = [float(x) for x in np_start_logits[idx].flat]
+                end_logits = [float(x) for x in np_end_logits[idx].flat]
+                all_results.append(
+                    RawResult(
+                        unique_id=unique_id,
+                        start_logits=start_logits,
+                        end_logits=end_logits))
+        except fluid.core.EOFException:
+            test_pyreader.reset()
+            break
+    time_end = time.time()
+    features = processor.get_features(
+        processor.predict_examples, is_training=False)
+    write_predictions(processor.predict_examples, features, all_results,
+                      args.n_best_size, args.max_answer_length,
+                      args.do_lower_case, output_prediction_file,
+                      output_nbest_file, output_null_log_odds_file,
+                      args.with_negative,
+                      args.null_score_diff_threshold, args.verbose)
+def train(args):
+    if not (args.do_train or args.do_predict):
+        raise ValueError("For args `do_train` and `do_predict`, at "
+                         "least one of them must be True.")
+    if args.use_cuda:
+        place = fluid.CUDAPlace(0)
+        dev_count = fluid.core.get_cuda_device_count()
+    else:
+        place = fluid.CPUPlace()
+        dev_count = int(os.environ.get('CPU_NUM', multiprocessing.cpu_count()))
+    exe = fluid.Executor(place)
+    startup_prog = fluid.default_startup_program()
+    if args.random_seed is not None:
+        startup_prog.random_seed = args.random_seed
+    if args.do_train: 
+        if args.do_distill: 
+            train_processor = DataProcessorDistill()
+            mrc_train_generator = train_processor.data_generator(
+                data_file=args.train_file,
+                batch_size=args.batch_size,
+                max_len=args.max_seq_len,
+                in_tokens=False,
+                dev_count=dev_count,
+                epochs=args.epoch,
+                shuffle=True)
+        else: 
+            train_processor = DataProcessor(
+                vocab_path=args.vocab_path,
+                do_lower_case=args.do_lower_case,
+                max_seq_length=args.max_seq_len,
+                in_tokens=args.in_tokens,
+                doc_stride=args.doc_stride,
+                max_query_length=args.max_query_length)
+            mrc_train_generator = train_processor.data_generator(
+                data_path=args.train_file,
+                batch_size=args.batch_size,
+                max_len=args.max_seq_len,
+                phase='train',
+                shuffle=True,
+                dev_count=dev_count,
+                with_negative=args.with_negative,
+                epoch=args.epoch)
+        bert_conf = JsonConfig(args.bert_config_path)
+        data_reader = DataReader(
+            args.mlm_path,
+            vocab_path=args.vocab_path,
+            batch_size=args.batch_size,
+            in_tokens=args.in_tokens,
+            voc_size=bert_conf['vocab_size'],
+            shuffle_files=False,
+            epoch=args.epoch,
+            max_seq_len=args.max_seq_len,
+            is_test=False)
+        mlm_train_generator = data_reader.data_generator()
+        gens = [
+            (mrc_train_generator, 1.0),
+            (mlm_train_generator, args.mix_ratio)
+        ]
+        # create joint pyreader
+        joint_generator, train_pyreader, model_inputs = \
+            create_reader("train_reader", input_shape, True, args.do_distill, 
+                          gens)
+        train_pyreader.decorate_tensor_provider(joint_generator)
+        task_id = model_inputs[0]
+        if args.do_distill: 
+            bert_inputs = model_inputs[1:5]
+            mrc_inputs = model_inputs[1:9]
+            mlm_inputs = model_inputs[9:11]
+        else: 
+            bert_inputs = model_inputs[1:5]
+            mrc_inputs = model_inputs[1:7]
+            mlm_inputs = model_inputs[7:9]
+        # create model
+        train_bert_model = ModelBERT(
+            conf={"bert_conf_file": args.bert_config_path},
+            is_training=True)
+        train_create_bert = train_bert_model.create_model(args, bert_inputs)
+        build_strategy = fluid.BuildStrategy()
+        if args.do_distill: 
+            num_train_examples = train_processor.num_examples
+            print("runtime number of examples:")
+            print(num_train_examples)
+        else: 
+            print("estimating runtime number of examples...")
+            num_train_examples = train_processor.estimate_runtime_examples(
+                args.train_file, sample_rate=args.sample_rate)
+            print("runtime number of examples:")
+            print(num_train_examples)
+        if args.in_tokens:
+            max_train_steps = args.epoch * num_train_examples // (
+                    args.batch_size // args.max_seq_len) // dev_count
+        else:
+            max_train_steps = args.epoch * num_train_examples // (
+                args.batch_size) // dev_count
+        max_train_steps = int(max_train_steps * (1 + args.mix_ratio))
+        warmup_steps = int(max_train_steps * args.warmup_proportion)
+        print("Device count: %d" % dev_count)
+        print("Num train examples: %d" % num_train_examples)
+        print("Max train steps: %d" % max_train_steps)
+        print("Num warmup steps: %d" % warmup_steps)
+        train_program = fluid.default_main_program()
+        with fluid.program_guard(train_program, startup_prog):
+            with fluid.unique_name.guard():
+                train_create_bert()
+                mlm_output_tensors = mlm_net.create_model(
+                    mlm_inputs, base_model=train_bert_model, is_training=True, args=args
+                )
+                mrc_output_tensors = mrqa_net.create_model(
+                    mrc_inputs, base_model=train_bert_model, is_training=True, args=args
+                )
+                task_one_hot = fluid.layers.one_hot(task_id, 2)
+                mrc_loss = mrqa_net.compute_loss(mrc_output_tensors, args)
+                if args.do_distill: 
+                    distill_loss = mrqa_net.compute_distill_loss(mrc_output_tensors, args)
+                    mrc_loss = mrc_loss + distill_loss
+                num_seqs = mrc_output_tensors['num_seqs']
+                mlm_loss = mlm_net.compute_loss(mlm_output_tensors)
+                num_seqs = mlm_output_tensors['num_seqs']
+                all_loss = fluid.layers.concat([mrc_loss, mlm_loss], axis=0)
+                loss = fluid.layers.reduce_sum(task_one_hot * all_loss)
+                scheduled_lr = optimization(
+                    loss=loss,
+                    warmup_steps=warmup_steps,
+                    num_train_steps=max_train_steps,
+                    learning_rate=args.learning_rate,
+                    train_program=train_program,
+                    startup_prog=startup_prog,
+                    weight_decay=args.weight_decay,
+                    scheduler=args.lr_scheduler,
+                    use_fp16=args.use_fp16,
+                    loss_scaling=args.loss_scaling)
+                loss.persistable = True
+                num_seqs.persistable = True
+                ema = fluid.optimizer.ExponentialMovingAverage(args.ema_decay)
+                ema.update()
+        train_compiled_program = fluid.CompiledProgram(train_program).with_data_parallel(
+            loss_name=loss.name, build_strategy=build_strategy)
+        if args.verbose:
+            if args.in_tokens:
+                lower_mem, upper_mem, unit = fluid.contrib.memory_usage(
+                    program=train_program,
+                    batch_size=args.batch_size // args.max_seq_len)
+            else:
+                lower_mem, upper_mem, unit = fluid.contrib.memory_usage(
+                    program=train_program, batch_size=args.batch_size)
+            print("Theoretical memory usage in training:  %.3f - %.3f %s" %
+                  (lower_mem, upper_mem, unit))
+    if args.do_predict:
+        predict_processor = DataProcessor(
+            vocab_path=args.vocab_path,
+            do_lower_case=args.do_lower_case,
+            max_seq_length=args.max_seq_len,
+            in_tokens=args.in_tokens,
+            doc_stride=args.doc_stride,
+            max_query_length=args.max_query_length)
+        mrc_test_generator = predict_processor.data_generator(
+            data_path=args.predict_file,
+            batch_size=args.batch_size,
+            max_len=args.max_seq_len,
+            phase='predict',
+            shuffle=False,
+            dev_count=dev_count,
+            epoch=1)
+        test_input_shape = [
+            ([-1, max_seq_len, 1], 'int64'),
+            ([-1, max_seq_len, 1], 'int64'),
+            ([-1, max_seq_len, 1], 'int64'),
+            ([-1, max_seq_len, 1], 'float32'),
+            ([-1, 1], 'int64')]
+        build_strategy = fluid.BuildStrategy()
+        test_prog = fluid.Program()
+        with fluid.program_guard(test_prog, startup_prog):
+            with fluid.unique_name.guard():
+                placeholder = Placeholder(test_input_shape)
+                test_pyreader, model_inputs = placeholder.build(
+                    capacity=100, reader_name="test_reader")
+                test_pyreader.decorate_tensor_provider(mrc_test_generator)
+                # create model
+                bert_inputs = model_inputs[0:4]
+                mrc_inputs = model_inputs
+                test_bert_model = ModelBERT(
+                    conf={"bert_conf_file": args.bert_config_path},
+                    is_training=False)
+                test_create_bert = test_bert_model.create_model(args, bert_inputs)
+                test_create_bert()
+                mrc_output_tensors = mrqa_net.create_model(
+                    mrc_inputs, base_model=test_bert_model, is_training=False, args=args
+                )
+                unique_ids = mrc_output_tensors['unique_id']
+                start_logits = mrc_output_tensors['start_logits']
+                end_logits = mrc_output_tensors['end_logits']
+                num_seqs = mrc_output_tensors['num_seqs']
+                if 'ema' not in dir():
+                    ema = fluid.optimizer.ExponentialMovingAverage(args.ema_decay)
+                unique_ids.persistable = True
+                start_logits.persistable = True
+                end_logits.persistable = True
+                num_seqs.persistable = True
+        test_prog = test_prog.clone(for_test=True)
+        test_compiled_program = fluid.CompiledProgram(test_prog).with_data_parallel(
+            build_strategy=build_strategy)
+    exe.run(startup_prog)
+    if args.do_train:
+        if args.init_checkpoint and args.init_pretraining_params:
+            print(
+                "WARNING: args 'init_checkpoint' and 'init_pretraining_params' "
+                "both are set! Only arg 'init_checkpoint' is made valid.")
+        if args.init_checkpoint:
+            init_checkpoint(
+                exe,
+                args.init_checkpoint,
+                main_program=startup_prog,
+                use_fp16=args.use_fp16)
+        elif args.init_pretraining_params:
+            init_pretraining_params(
+                exe,
+                args.init_pretraining_params,
+                main_program=startup_prog,
+                use_fp16=args.use_fp16)
+    elif args.do_predict:
+        if not args.init_checkpoint:
+            raise ValueError("args 'init_checkpoint' should be set if"
+                             "only doing prediction!")
+        init_checkpoint(
+            exe,
+            args.init_checkpoint,
+            main_program=startup_prog,
+            use_fp16=args.use_fp16)
+    if args.do_train:
+        train_pyreader.start()
+        steps = 0
+        total_cost, total_num_seqs = [], []
+        time_begin = time.time()
+        while True:
+            try:
+                steps += 1
+                if steps % args.skip_steps == 0:
+                    if warmup_steps <= 0:
+                        fetch_list = [loss.name, num_seqs.name]
+                    else:
+                        fetch_list = [
+                            loss.name, scheduled_lr.name, num_seqs.name
+                        ]
+                else:
+                    fetch_list = []
+                outputs = exe.run(train_compiled_program, fetch_list=fetch_list)
+                if steps % args.skip_steps == 0:
+                    if warmup_steps <= 0:
+                        np_loss, np_num_seqs = outputs
+                    else:
+                        np_loss, np_lr, np_num_seqs = outputs
+                    total_cost.extend(np_loss * np_num_seqs)
+                    total_num_seqs.extend(np_num_seqs)
+                    if args.verbose:
+                        verbose = "train pyreader queue size: %d, " % train_pyreader.queue.size(
+                        )
+                        verbose += "learning rate: %f" % (
+                            np_lr[0]
+                            if warmup_steps > 0 else args.learning_rate)
+                        print(verbose)
+                    time_end = time.time()
+                    used_time = time_end - time_begin
+                    print("progress: %d/%d, step: %d, loss: %f" % (steps, max_train_steps, steps, np.sum(total_cost) / np.sum(total_num_seqs)))
+                    total_cost, total_num_seqs = [], []
+                    time_begin = time.time()
+                if steps % args.save_steps == 0:
+                    save_path = os.path.join(args.checkpoints,
+                                             "step_" + str(steps))
+                    fluid.io.save_persistables(exe, save_path, train_program)
+                if steps == max_train_steps:
+                    save_path = os.path.join(args.checkpoints,
+                                             "step_" + str(steps) + "_final")
+                    fluid.io.save_persistables(exe, save_path, train_program)
+                    break
+            except paddle.fluid.core.EOFException as err:
+                save_path = os.path.join(args.checkpoints,
+                                         "step_" + str(steps) + "_final")
+                fluid.io.save_persistables(exe, save_path, train_program)
+                train_pyreader.reset()
+                break
+    if args.do_predict:
+        if args.use_ema:
+            with ema.apply(exe):
+                predict(exe, test_compiled_program, test_pyreader, [
+                    unique_ids.name, start_logits.name, end_logits.name, num_seqs.name
+                ], predict_processor, prefix='ema_')
+        else:
+            predict(exe, test_compiled_program, test_pyreader, [
+                unique_ids.name, start_logits.name, end_logits.name, num_seqs.name
+            ], predict_processor)
+if __name__ == '__main__':
+    print_arguments(args)
+    train(args)
--- a/PaddleNLP/Research/MRQA2019-D-NET/knowledge_distillation/utils/__init__.py
+++ b/PaddleNLP/Research/MRQA2019-D-NET/knowledge_distillation/utils/__init__.py
--- a/PaddleNLP/Research/MRQA2019-D-NET/knowledge_distillation/utils/batching.py
+++ b/PaddleNLP/Research/MRQA2019-D-NET/knowledge_distillation/utils/batching.py
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Mask, padding and batching."""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+import numpy as np
+def mask(batch_tokens, total_token_num, vocab_size, CLS=1, SEP=2, MASK=3):
+    """
+    Add mask for batch_tokens, return out, mask_label, mask_pos;
+    Note: mask_pos responding the batch_tokens after padded;
+    """
+    max_len = max([len(sent) for sent in batch_tokens])
+    mask_label = []
+    mask_pos = []
+    prob_mask = np.random.rand(total_token_num)
+    # Note: the first token is [CLS], so [low=1]
+    replace_ids = np.random.randint(1, high=vocab_size, size=total_token_num)
+    pre_sent_len = 0
+    prob_index = 0
+    for sent_index, sent in enumerate(batch_tokens):
+        mask_flag = False
+        prob_index += pre_sent_len
+        for token_index, token in enumerate(sent):
+            prob = prob_mask[prob_index + token_index]
+            if prob > 0.15:
+                continue
+            elif 0.03 < prob <= 0.15:
+                # mask
+                if token != SEP and token != CLS:
+                    mask_label.append(sent[token_index])
+                    sent[token_index] = MASK
+                    mask_flag = True
+                    mask_pos.append(sent_index * max_len + token_index)
+            elif 0.015 < prob <= 0.03:
+                # random replace
+                if token != SEP and token != CLS:
+                    mask_label.append(sent[token_index])
+                    sent[token_index] = replace_ids[prob_index + token_index]
+                    mask_flag = True
+                    mask_pos.append(sent_index * max_len + token_index)
+            else:
+                # keep the original token
+                if token != SEP and token != CLS:
+                    mask_label.append(sent[token_index])
+                    mask_pos.append(sent_index * max_len + token_index)
+        pre_sent_len = len(sent)
+        # ensure at least mask one word in a sentence
+        while not mask_flag:
+            token_index = int(np.random.randint(1, high=len(sent) - 1, size=1))
+            if sent[token_index] != SEP and sent[token_index] != CLS:
+                mask_label.append(sent[token_index])
+                sent[token_index] = MASK
+                mask_flag = True
+                mask_pos.append(sent_index * max_len + token_index)
+    mask_label = np.array(mask_label).astype("int64").reshape([-1, 1])
+    mask_pos = np.array(mask_pos).astype("int64").reshape([-1, 1])
+    return batch_tokens, mask_label, mask_pos
+def prepare_batch_data(insts,
+                       total_token_num,
+                       max_len=None,
+                       voc_size=0,
+                       pad_id=None,
+                       cls_id=None,
+                       sep_id=None,
+                       mask_id=None,
+                       return_input_mask=True,
+                       return_max_len=True,
+                       return_num_token=False):
+    """
+    1. generate Tensor of data
+    2. generate Tensor of position
+    3. generate self attention mask, [shape: batch_size *  max_len * max_len]
+    """
+    batch_src_ids = [inst[0] for inst in insts]
+    batch_sent_ids = [inst[1] for inst in insts]
+    batch_pos_ids = [inst[2] for inst in insts]
+    labels_list = []
+    # compatible with mrqa, whose example includes start/end positions, 
+    # or unique id
+    for i in range(3, len(insts[0]), 1):
+        labels = [inst[i] for inst in insts]
+        labels = np.array(labels).astype("int64").reshape([-1, 1])
+        labels_list.append(labels)
+    # First step: do mask without padding
+    if mask_id >= 0:
+        out, mask_label, mask_pos = mask(
+            batch_src_ids,
+            total_token_num,
+            vocab_size=voc_size,
+            CLS=cls_id,
+            SEP=sep_id,
+            MASK=mask_id)
+    else:
+        out = batch_src_ids
+    # Second step: padding
+    src_id, self_input_mask = pad_batch_data(
+        out, 
+        max_len=max_len,
+        pad_idx=pad_id, return_input_mask=True)
+    pos_id = pad_batch_data(
+        batch_pos_ids,
+        max_len=max_len,
+        pad_idx=pad_id,
+        return_pos=False,
+        return_input_mask=False)
+    sent_id = pad_batch_data(
+        batch_sent_ids,
+        max_len=max_len,
+        pad_idx=pad_id,
+        return_pos=False,
+        return_input_mask=False)
+    if mask_id >= 0:
+        return_list = [
+            src_id, pos_id, sent_id, self_input_mask, mask_label, mask_pos
+        ] + labels_list
+    else:
+        return_list = [src_id, pos_id, sent_id, self_input_mask] + labels_list
+    return return_list if len(return_list) > 1 else return_list[0]
+def pad_batch_data(insts,
+                   max_len=None,
+                   pad_idx=0,
+                   return_pos=False,
+                   return_input_mask=False,
+                   return_max_len=False,
+                   return_num_token=False):
+    """
+    Pad the instances to the max sequence length in batch, and generate the
+    corresponding position data and input mask.
+    """
+    return_list = []
+    if max_len is None:
+        max_len = max(len(inst) for inst in insts)
+    # Any token included in dict can be used to pad, since the paddings' loss
+    # will be masked out by weights and make no effect on parameter gradients.
+    inst_data = np.array([
+        list(inst) + list([pad_idx] * (max_len - len(inst))) for inst in insts
+    ])
+    return_list += [inst_data.astype("int64").reshape([-1, max_len, 1])]
+    # position data
+    if return_pos:
+        inst_pos = np.array([
+            list(range(0, len(inst))) + [pad_idx] * (max_len - len(inst))
+            for inst in insts
+        ])
+        return_list += [inst_pos.astype("int64").reshape([-1, max_len, 1])]
+    if return_input_mask:
+        # This is used to avoid attention on paddings.
+        input_mask_data = np.array([[1] * len(inst) + [0] *
+                                    (max_len - len(inst)) for inst in insts])
+        input_mask_data = np.expand_dims(input_mask_data, axis=-1)
+        return_list += [input_mask_data.astype("float32")]
+    if return_max_len:
+        return_list += [max_len]
+    if return_num_token:
+        num_token = 0
+        for inst in insts:
+            num_token += len(inst)
+        return_list += [num_token]
+    return return_list if len(return_list) > 1 else return_list[0]
+if __name__ == "__main__":
+    pass
--- a/PaddleNLP/Research/MRQA2019-D-NET/knowledge_distillation/utils/configure.py
+++ b/PaddleNLP/Research/MRQA2019-D-NET/knowledge_distillation/utils/configure.py
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+import sys
+import argparse
+import six
+import logging
+import json
+logging_only_message = "%(message)s"
+logging_details = "%(asctime)s.%(msecs)03d %(levelname)s %(module)s - %(funcName)s: %(message)s"
+class JsonConfig(object):
+    def __init__(self, config_path):
+        self._config_dict = self._parse(config_path)
+    def _parse(self, config_path):
+        try:
+            with open(config_path) as json_file:
+                config_dict = json.load(json_file)
+        except:
+            raise IOError("Error in parsing bert model config file '%s'" %
+                config_path)
+        else:
+            return config_dict
+    def __getitem__(self, key):
+        return self._config_dict[key]
+    def print_config(self):
+        for arg, value in sorted(six.iteritems(self._config_dict)):
+            print('%s: %s' % (arg, value))
+        print('------------------------------------------------')
+class ArgumentGroup(object):
+    def __init__(self, parser, title, des):
+        self._group = parser.add_argument_group(title=title, description=des)
+    def add_arg(self, name, type, default, help, **kwargs):
+        type = str2bool if type == bool else type
+        self._group.add_argument(
+            "--" + name,
+            default=default,
+            type=type,
+            help=help + ' Default: %(default)s.',
+            **kwargs)
+class ArgConfig(object):
+    def __init__(self):
+        parser = argparse.ArgumentParser()
+        train_g = ArgumentGroup(parser, "training", "training options.")
+        train_g.add_arg("epoch",             int,    3,      "Number of epoches for fine-tuning.")
+        train_g.add_arg("learning_rate",     float,  5e-5,   "Learning rate used to train with warmup.")
+        train_g.add_arg("lr_scheduler",      str,    "linear_warmup_decay",
+                        "scheduler of learning rate.", choices=['linear_warmup_decay', 'noam_decay'])
+        train_g.add_arg("weight_decay",      float,  0.01,   "Weight decay rate for L2 regularizer.")
+        train_g.add_arg("warmup_proportion", float,  0.1,
+                        "Proportion of training steps to perform linear learning rate warmup for.")
+        train_g.add_arg("save_steps",        int,    1000,   "The steps interval to save checkpoints.")
+        train_g.add_arg("use_fp16",          bool,   False,  "Whether to use fp16 mixed precision training.")
+        train_g.add_arg("loss_scaling",      float,  1.0,
+                        "Loss scaling factor for mixed precision training, only valid when use_fp16 is enabled.")
+        train_g.add_arg("pred_dir",   str,    None,   "Path to save the prediction results")
+        log_g = ArgumentGroup(parser, "logging", "logging related.")
+        log_g.add_arg("skip_steps",          int,    10,    "The steps interval to print loss.")
+        log_g.add_arg("verbose",             bool,   False, "Whether to output verbose log.")
+        run_type_g = ArgumentGroup(parser, "run_type", "running type options.")
+        run_type_g.add_arg("use_cuda",                     bool,   True,  "If set, use GPU for training.")
+        run_type_g.add_arg("use_fast_executor",            bool,   False, "If set, use fast parallel executor (in experiment).")
+        run_type_g.add_arg("num_iteration_per_drop_scope", int,    1,     "Ihe iteration intervals to clean up temporary variables.")
+        run_type_g.add_arg("do_train",                     bool,   True,  "Whether to perform training.")
+        run_type_g.add_arg("do_predict",                   bool,   True,  "Whether to perform prediction.")
+        custom_g = ArgumentGroup(parser, "customize", "customized options.")
+        self.custom_g = custom_g
+        self.parser = parser
+    def add_arg(self, name, dtype, default, descrip):
+        self.custom_g.add_arg(name, dtype, default, descrip)
+    def build_conf(self):
+        return self.parser.parse_args()
+def str2bool(v):
+    # because argparse does not support to parse "true, False" as python
+    # boolean directly
+    return v.lower() in ("true", "t", "1")
+def print_arguments(args, log = None):
+    if not log:
+        print('-----------  Configuration Arguments -----------')
+        for arg, value in sorted(six.iteritems(vars(args))):
+            print('%s: %s' % (arg, value))
+        print('------------------------------------------------')
+    else:
+        log.info('-----------  Configuration Arguments -----------')
+        for arg, value in sorted(six.iteritems(vars(args))):
+            log.info('%s: %s' % (arg, value))
+        log.info('------------------------------------------------')
+if __name__ == "__main__":
+    args = ArgConfig()
+    args = args.build_conf()
+    # using print()
+    print_arguments(args)
+    logging.basicConfig(
+        level=logging.INFO,
+        format=logging_details,
+        datefmt='%Y-%m-%d %H:%M:%S')
+    # using logging
+    print_arguments(args, logging)
+    json_conf = JsonConfig("../../data/pretrained_models/uncased_L-12_H-768_A-12/bert_config.json")
+    json_conf.print_config()
--- a/PaddleNLP/Research/MRQA2019-D-NET/knowledge_distillation/utils/fp16.py
+++ b/PaddleNLP/Research/MRQA2019-D-NET/knowledge_distillation/utils/fp16.py
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from __future__ import print_function
+import paddle
+import paddle.fluid as fluid
+def cast_fp16_to_fp32(i, o, prog):
+    prog.global_block().append_op(
+        type="cast",
+        inputs={"X": i},
+        outputs={"Out": o},
+        attrs={
+            "in_dtype": fluid.core.VarDesc.VarType.FP16,
+            "out_dtype": fluid.core.VarDesc.VarType.FP32
+        })
+def cast_fp32_to_fp16(i, o, prog):
+    prog.global_block().append_op(
+        type="cast",
+        inputs={"X": i},
+        outputs={"Out": o},
+        attrs={
+            "in_dtype": fluid.core.VarDesc.VarType.FP32,
+            "out_dtype": fluid.core.VarDesc.VarType.FP16
+        })
+def copy_to_master_param(p, block):
+    v = block.vars.get(p.name, None)
+    if v is None:
+        raise ValueError("no param name %s found!" % p.name)
+    new_p = fluid.framework.Parameter(
+        block=block,
+        shape=v.shape,
+        dtype=fluid.core.VarDesc.VarType.FP32,
+        type=v.type,
+        lod_level=v.lod_level,
+        stop_gradient=p.stop_gradient,
+        trainable=p.trainable,
+        optimize_attr=p.optimize_attr,
+        regularizer=p.regularizer,
+        gradient_clip_attr=p.gradient_clip_attr,
+        error_clip=p.error_clip,
+        name=v.name + ".master")
+    return new_p
+def create_master_params_grads(params_grads, main_prog, startup_prog,
+                               loss_scaling):
+    master_params_grads = []
+    tmp_role = main_prog._current_role
+    OpRole = fluid.core.op_proto_and_checker_maker.OpRole
+    main_prog._current_role = OpRole.Backward
+    for p, g in params_grads:
+        # create master parameters
+        master_param = copy_to_master_param(p, main_prog.global_block())
+        startup_master_param = startup_prog.global_block()._clone_variable(
+            master_param)
+        startup_p = startup_prog.global_block().var(p.name)
+        cast_fp16_to_fp32(startup_p, startup_master_param, startup_prog)
+        # cast fp16 gradients to fp32 before apply gradients
+        if g.name.find("layer_norm") > -1:
+            if loss_scaling > 1:
+                scaled_g = g / float(loss_scaling)
+            else:
+                scaled_g = g
+            master_params_grads.append([p, scaled_g])
+            continue
+        master_grad = fluid.layers.cast(g, "float32")
+        if loss_scaling > 1:
+            master_grad = master_grad / float(loss_scaling)
+        master_params_grads.append([master_param, master_grad])
+    main_prog._current_role = tmp_role
+    return master_params_grads
+def master_param_to_train_param(master_params_grads, params_grads, main_prog):
+    for idx, m_p_g in enumerate(master_params_grads):
+        train_p, _ = params_grads[idx]
+        if train_p.name.find("layer_norm") > -1:
+            continue
+        with main_prog._optimized_guard([m_p_g[0], m_p_g[1]]):
+            cast_fp32_to_fp16(m_p_g[0], train_p, main_prog)
--- a/PaddleNLP/Research/MRQA2019-D-NET/knowledge_distillation/utils/init.py
+++ b/PaddleNLP/Research/MRQA2019-D-NET/knowledge_distillation/utils/init.py
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from __future__ import print_function
+import os
+import six
+import ast
+import copy
+import numpy as np
+import paddle.fluid as fluid
+def cast_fp32_to_fp16(exe, main_program):
+    print("Cast parameters to float16 data format.")
+    for param in main_program.global_block().all_parameters():
+        if not param.name.endswith(".master"):
+            param_t = fluid.global_scope().find_var(param.name).get_tensor()
+            data = np.array(param_t)
+            if param.name.find("layer_norm") == -1:
+                param_t.set(np.float16(data).view(np.uint16), exe.place)
+            master_param_var = fluid.global_scope().find_var(param.name +
+                                                             ".master")
+            if master_param_var is not None:
+                master_param_var.get_tensor().set(data, exe.place)
+def init_checkpoint(exe, init_checkpoint_path, main_program, use_fp16=False, skip_list = []):
+    assert os.path.exists(
+        init_checkpoint_path), "[%s] cann't be found." % init_checkpoint_path
+    def existed_persitables(var):
+        if not fluid.io.is_persistable(var):
+            return False
+        if var.name in skip_list:
+            return False
+        return os.path.exists(os.path.join(init_checkpoint_path, var.name))
+    fluid.io.load_vars(
+        exe,
+        init_checkpoint_path,
+        main_program=main_program,
+        predicate=existed_persitables)
+    print("Load model from {}".format(init_checkpoint_path))
+    if use_fp16:
+        cast_fp32_to_fp16(exe, main_program)
+def init_pretraining_params(exe,
+                            pretraining_params_path,
+                            main_program,
+                            use_fp16=False):
+    assert os.path.exists(pretraining_params_path
+                          ), "[%s] cann't be found." % pretraining_params_path
+    def existed_params(var):
+        if not isinstance(var, fluid.framework.Parameter):
+            return False
+        return os.path.exists(os.path.join(pretraining_params_path, var.name))
+    fluid.io.load_vars(
+        exe,
+        pretraining_params_path,
+        main_program=main_program,
+        predicate=existed_params)
+    print("Load pretraining parameters from {}.".format(
+        pretraining_params_path))
+    if use_fp16:
+        cast_fp32_to_fp16(exe, main_program)
--- a/PaddleNLP/Research/MRQA2019-D-NET/knowledge_distillation/utils/placeholder.py
+++ b/PaddleNLP/Research/MRQA2019-D-NET/knowledge_distillation/utils/placeholder.py
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+import os
+import six
+import ast
+import copy
+import numpy as np
+import paddle.fluid as fluid
+class Placeholder(object):
+    def __init__(self):
+        self.shapes = []
+        self.dtypes = []
+        self.lod_levels = []
+        self.names = []
+    def __init__(self, input_shapes):
+        self.shapes = []
+        self.dtypes = []
+        self.lod_levels = []
+        self.names = []
+        for new_holder in input_shapes:
+            shape = new_holder[0]
+            dtype = new_holder[1]
+            lod_level = new_holder[2] if len(new_holder) >= 3 else 0
+            name = new_holder[3] if len(new_holder) >= 4 else ""
+            self.append_placeholder(shape, dtype, lod_level = lod_level, name = name)
+    def append_placeholder(self, shape, dtype, lod_level = 0, name = ""):
+        self.shapes.append(shape)
+        self.dtypes.append(dtype)
+        self.lod_levels.append(lod_level)
+        self.names.append(name)
+    def build(self, capacity, reader_name, use_double_buffer = False):
+        pyreader = fluid.layers.py_reader(
+            capacity = capacity,
+            shapes = self.shapes,
+            dtypes = self.dtypes,
+            lod_levels = self.lod_levels,
+            name = reader_name, 
+            use_double_buffer = use_double_buffer)
+        return [pyreader, fluid.layers.read_file(pyreader)]
+    def __add__(self, new_holder):
+        assert isinstance(new_holder, tuple) or isinstance(new_holder, list) 
+        assert len(new_holder) >= 2
+        shape = new_holder[0]
+        dtype = new_holder[1]
+        lod_level = new_holder[2] if len(new_holder) >= 3 else 0
+        name = new_holder[3] if len(new_holder) >= 4 else ""
+        self.append_placeholder(shape, dtype, lod_level = lod_level, name = name)
+if __name__ == "__main__":
+    print("hello world!")
--- a/PaddleNLP/Research/MRQA2019-D-NET/knowledge_distillation/utils/tokenization.py
+++ b/PaddleNLP/Research/MRQA2019-D-NET/knowledge_distillation/utils/tokenization.py
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Tokenization classes."""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+import collections
+import unicodedata
+import six
+def convert_to_unicode(text):
+    """Converts `text` to Unicode (if it's not already), assuming utf-8 input."""
+    if six.PY3:
+        if isinstance(text, str):
+            return text
+        elif isinstance(text, bytes):
+            return text.decode("utf-8", "ignore")
+        else:
+            raise ValueError("Unsupported string type: %s" % (type(text)))
+    elif six.PY2:
+        if isinstance(text, str):
+            return text.decode("utf-8", "ignore")
+        elif isinstance(text, unicode):
+            return text
+        else:
+            raise ValueError("Unsupported string type: %s" % (type(text)))
+    else:
+        raise ValueError("Not running on Python2 or Python 3?")
+def printable_text(text):
+    """Returns text encoded in a way suitable for print or `tf.logging`."""
+    # These functions want `str` for both Python2 and Python3, but in one case
+    # it's a Unicode string and in the other it's a byte string.
+    if six.PY3:
+        if isinstance(text, str):
+            return text
+        elif isinstance(text, bytes):
+            return text.decode("utf-8", "ignore")
+        else:
+            raise ValueError("Unsupported string type: %s" % (type(text)))
+    elif six.PY2:
+        if isinstance(text, str):
+            return text
+        elif isinstance(text, unicode):
+            return text.encode("utf-8")
+        else:
+            raise ValueError("Unsupported string type: %s" % (type(text)))
+    else:
+        raise ValueError("Not running on Python2 or Python 3?")
+def load_vocab(vocab_file):
+    """Loads a vocabulary file into a dictionary."""
+    vocab = collections.OrderedDict()
+    fin = open(vocab_file)
+    for num, line in enumerate(fin):
+        items = convert_to_unicode(line.strip()).split("\t")
+        if len(items) > 2:
+            break
+        token = items[0]
+        index = items[1] if len(items) == 2 else num
+        token = token.strip()
+        vocab[token] = int(index)
+    return vocab
+def convert_by_vocab(vocab, items):
+    """Converts a sequence of [tokens|ids] using the vocab."""
+    output = []
+    for item in items:
+        output.append(vocab[item])
+    return output
+def convert_tokens_to_ids(vocab, tokens):
+    return convert_by_vocab(vocab, tokens)
+def convert_ids_to_tokens(inv_vocab, ids):
+    return convert_by_vocab(inv_vocab, ids)
+def whitespace_tokenize(text):
+    """Runs basic whitespace cleaning and splitting on a peice of text."""
+    text = text.strip()
+    if not text:
+        return []
+    tokens = text.split()
+    return tokens
+class FullTokenizer(object):
+    """Runs end-to-end tokenziation."""
+    def __init__(self, vocab_file, do_lower_case=True):
+        self.vocab = load_vocab(vocab_file)
+        self.inv_vocab = {v: k for k, v in self.vocab.items()}
+        self.basic_tokenizer = BasicTokenizer(do_lower_case=do_lower_case)
+        self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab)
+    def tokenize(self, text):
+        split_tokens = []
+        for token in self.basic_tokenizer.tokenize(text):
+            for sub_token in self.wordpiece_tokenizer.tokenize(token):
+                split_tokens.append(sub_token)
+        return split_tokens
+    def convert_tokens_to_ids(self, tokens):
+        return convert_by_vocab(self.vocab, tokens)
+    def convert_ids_to_tokens(self, ids):
+        return convert_by_vocab(self.inv_vocab, ids)
+class CharTokenizer(object):
+    """Runs end-to-end tokenziation."""
+    def __init__(self, vocab_file, do_lower_case=True):
+        self.vocab = load_vocab(vocab_file)
+        self.inv_vocab = {v: k for k, v in self.vocab.items()}
+        self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab)
+    def tokenize(self, text):
+        split_tokens = []
+        for token in text.lower().split(" "):
+            for sub_token in self.wordpiece_tokenizer.tokenize(token):
+                split_tokens.append(sub_token)
+        return split_tokens
+    def convert_tokens_to_ids(self, tokens):
+        return convert_by_vocab(self.vocab, tokens)
+    def convert_ids_to_tokens(self, ids):
+        return convert_by_vocab(self.inv_vocab, ids)
+class BasicTokenizer(object):
+    """Runs basic tokenization (punctuation splitting, lower casing, etc.)."""
+    def __init__(self, do_lower_case=True):
+        """Constructs a BasicTokenizer.
+        Args:
+            do_lower_case: Whether to lower case the input.
+        """
+        self.do_lower_case = do_lower_case
+        self._never_lowercase = ['[UNK]', '[SEP]', '[PAD]', '[CLS]', '[MASK]']
+    def tokenize(self, text):
+        """Tokenizes a piece of text."""
+        text = convert_to_unicode(text)
+        text = self._clean_text(text)
+        # This was added on November 1st, 2018 for the multilingual and Chinese
+        # models. This is also applied to the English models now, but it doesn't
+        # matter since the English models were not trained on any Chinese data
+        # and generally don't have any Chinese data in them (there are Chinese
+        # characters in the vocabulary because Wikipedia does have some Chinese
+        # words in the English Wikipedia.).
+        text = self._tokenize_chinese_chars(text)
+        orig_tokens = whitespace_tokenize(text)
+        split_tokens = []
+        for token in orig_tokens:
+            if self.do_lower_case and token not in self._never_lowercase:
+                token = token.lower()
+                token = self._run_strip_accents(token)
+            if token in self._never_lowercase:
+                split_tokens.extend([token])
+            else:
+                split_tokens.extend(self._run_split_on_punc(token))
+        output_tokens = whitespace_tokenize(" ".join(split_tokens))
+        return output_tokens
+    def _run_strip_accents(self, text):
+        """Strips accents from a piece of text."""
+        text = unicodedata.normalize("NFD", text)
+        output = []
+        for char in text:
+            cat = unicodedata.category(char)
+            if cat == "Mn":
+                continue
+            output.append(char)
+        return "".join(output)
+    def _run_split_on_punc(self, text):
+        """Splits punctuation on a piece of text."""
+        chars = list(text)
+        i = 0
+        start_new_word = True
+        output = []
+        while i < len(chars):
+            char = chars[i]
+            if _is_punctuation(char):
+                output.append([char])
+                start_new_word = True
+            else:
+                if start_new_word:
+                    output.append([])
+                start_new_word = False
+                output[-1].append(char)
+            i += 1
+        return ["".join(x) for x in output]
+    def _tokenize_chinese_chars(self, text):
+        """Adds whitespace around any CJK character."""
+        output = []
+        for char in text:
+            cp = ord(char)
+            if self._is_chinese_char(cp):
+                output.append(" ")
+                output.append(char)
+                output.append(" ")
+            else:
+                output.append(char)
+        return "".join(output)
+    def _is_chinese_char(self, cp):
+        """Checks whether CP is the codepoint of a CJK character."""
+        # This defines a "chinese character" as anything in the CJK Unicode block:
+        #     https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_(Unicode_block)
+        #
+        # Note that the CJK Unicode block is NOT all Japanese and Korean characters,
+        # despite its name. The modern Korean Hangul alphabet is a different block,
+        # as is Japanese Hiragana and Katakana. Those alphabets are used to write
+        # space-separated words, so they are not treated specially and handled
+        # like the all of the other languages.
+        if ((cp >= 0x4E00 and cp <= 0x9FFF) or  #
+            (cp >= 0x3400 and cp <= 0x4DBF) or  #
+            (cp >= 0x20000 and cp <= 0x2A6DF) or  #
+            (cp >= 0x2A700 and cp <= 0x2B73F) or  #
+            (cp >= 0x2B740 and cp <= 0x2B81F) or  #
+            (cp >= 0x2B820 and cp <= 0x2CEAF) or
+            (cp >= 0xF900 and cp <= 0xFAFF) or  #
+            (cp >= 0x2F800 and cp <= 0x2FA1F)):  #
+            return True
+        return False
+    def _clean_text(self, text):
+        """Performs invalid character removal and whitespace cleanup on text."""
+        output = []
+        for char in text:
+            cp = ord(char)
+            if cp == 0 or cp == 0xfffd or _is_control(char):
+                continue
+            if _is_whitespace(char):
+                output.append(" ")
+            else:
+                output.append(char)
+        return "".join(output)
+class WordpieceTokenizer(object):
+    """Runs WordPiece tokenziation."""
+    def __init__(self, vocab, unk_token="[UNK]", max_input_chars_per_word=100):
+        self.vocab = vocab
+        self.unk_token = unk_token
+        self.max_input_chars_per_word = max_input_chars_per_word
+    def tokenize(self, text):
+        """Tokenizes a piece of text into its word pieces.
+        This uses a greedy longest-match-first algorithm to perform tokenization
+        using the given vocabulary.
+        For example:
+            input = "unaffable"
+            output = ["un", "##aff", "##able"]
+        Args:
+            text: A single token or whitespace separated tokens. This should have
+                already been passed through `BasicTokenizer.
+        Returns:
+            A list of wordpiece tokens.
+        """
+        text = convert_to_unicode(text)
+        output_tokens = []
+        for token in whitespace_tokenize(text):
+            chars = list(token)
+            if len(chars) > self.max_input_chars_per_word:
+                output_tokens.append(self.unk_token)
+                continue
+            is_bad = False
+            start = 0
+            sub_tokens = []
+            while start < len(chars):
+                end = len(chars)
+                cur_substr = None
+                while start < end:
+                    substr = "".join(chars[start:end])
+                    if start > 0:
+                        substr = "##" + substr
+                    if substr in self.vocab:
+                        cur_substr = substr
+                        break
+                    end -= 1
+                if cur_substr is None:
+                    is_bad = True
+                    break
+                sub_tokens.append(cur_substr)
+                start = end
+            if is_bad:
+                output_tokens.append(self.unk_token)
+            else:
+                output_tokens.extend(sub_tokens)
+        return output_tokens
+def _is_whitespace(char):
+    """Checks whether `chars` is a whitespace character."""
+    # \t, \n, and \r are technically contorl characters but we treat them
+    # as whitespace since they are generally considered as such.
+    if char == " " or char == "\t" or char == "\n" or char == "\r":
+        return True
+    cat = unicodedata.category(char)
+    if cat == "Zs":
+        return True
+    return False
+def _is_control(char):
+    """Checks whether `chars` is a control character."""
+    # These are technically control characters but we count them as whitespace
+    # characters.
+    if char == "\t" or char == "\n" or char == "\r":
+        return False
+    cat = unicodedata.category(char)
+    if cat.startswith("C"):
+        return True
+    return False
+def _is_punctuation(char):
+    """Checks whether `chars` is a punctuation character."""
+    cp = ord(char)
+    # We treat all non-letter/number ASCII as punctuation.
+    # Characters such as "^", "$", and "`" are not in the Unicode
+    # Punctuation class but we treat them as punctuation anyways, for
+    # consistency.
+    if ((cp >= 33 and cp <= 47) or (cp >= 58 and cp <= 64) or
+        (cp >= 91 and cp <= 96) or (cp >= 123 and cp <= 126)):
+        return True
+    cat = unicodedata.category(char)
+    if cat.startswith("P"):
+        return True
+    return False
--- a/PaddleNLP/Research/MRQA2019-D-NET/knowledge_distillation/wget_models_and_data.sh
+++ b/PaddleNLP/Research/MRQA2019-D-NET/knowledge_distillation/wget_models_and_data.sh
+# wget pretrain model
+wget --no-check-certificate https://baidu-nlp.bj.bcebos.com/D-Net/squad2_model.tar.gz
+tar -xvf squad2_model.tar.gz
+rm squad2_model.tar.gz
+mv squad2_model ./data/pretrain_model/
+# wget knowledge_distillation dataset
+wget --no-check-certificate https://baidu-nlp.bj.bcebos.com/D-Net/d_net_knowledge_distillation_dataset.tar.gz
+tar -xvf d_net_knowledge_distillation_dataset.tar.gz
+rm d_net_knowledge_distillation_dataset.tar.gz
+mv mlm_data ./data/input
+mv mrqa_distill_data ./data/input
+# wget evaluation dev dataset
+wget --no-check-certificate https://baidu-nlp.bj.bcebos.com/D-Net/mrqa_evaluation_dataset.tar.gz 
+tar -xvf mrqa_evaluation_dataset.tar.gz 
+rm mrqa_evaluation_dataset.tar.gz 
+mv mrqa_evaluation_dataset ./data/input
+# wget predictions results
+wget --no-check-certificate https://baidu-nlp.bj.bcebos.com/D-Net/kd_prediction_results.tar.gz
+tar -xvf kd_prediction_results.tar.gz
+rm kd_prediction_results.tar.gz
+# wget MRQA baidu trained knowledge distillation model
+wget --no-check-certificate https://baidu-nlp.bj.bcebos.com/D-Net/knowledge_distillation_model.tar.gz
+tar -xvf knowledge_distillation_model.tar.gz
+rm knowledge_distillation_model.tar.gz
+mv knowledge_distillation_model ./data/saved_models
--- a/PaddleNLP/Research/MRQA2019-D-NET/multi_task_learning/README.md
+++ b/PaddleNLP/Research/MRQA2019-D-NET/multi_task_learning/README.md
+# Multi_task_learning 
+## 1、Introduction
+The pretraining is usually performed on corpus with restricted domains, it is expected that increasing the domain diversity by further pre-training on other corpus may improve the generalization capability. Hence, we incorporate masked language model and domain classify model by using corpus from various domains as an auxiliary tasks in the fine-tuning phase, along with MRC. Additionally, we explore multi-task learning by incorporating the supervised dataset from other NLP tasks to learn better language representation.
+## 2、Quick Start
+We use PaddlePaddle PALM(multi-task Learning Library) to train MRQA2019 MRC multi-task baseline model, download PALM:
+```
+git clone https://github.com/PaddlePaddle/PALM.git
+```
+### Environment
+- Python >= 2.7
+- cuda >= 9.0
+- cudnn >= 7.0
+- PaddlePaddle >= 1.5.0 Please refer to Installation Guide [Installation Guide](http://www.paddlepaddle.org/#quick-start)
+### Data Preparation
+#### Get data directly: 
+User can get the data directly we provided: 
+```
+bash wget_data.sh
+```
+#### Convert MRC dataset to squad format data: 
+To download the MRQA datasets, run
+```
+cd scripts && bash download_data.sh && cd ..
+```
+The training and prediction datasets will be saved in `./data/train/` and `./data/dev/`, respectively.
+The Multi_task_learning model only supports dataset files in SQuAD format. Before running the model on MRQA datasets, one need to convert the official MRQA data to SQuAD format. To do the conversion, run
+```
+cd scripts && bash convert_mrqa2squad.sh && cd ..
+```
+The output files will be named as `xxx.raw.json`.
+For convenience, we provide a script to combine all the training and development data into a single file respectively.
+```
+cd scripts && bash combine.sh && cd ..
+```
+The combined files will be saved in `./data/train/mrqa-combined.raw.json` and `./data/dev/mrqa-combined.raw.json`.
+### Models Preparation
+In this competition, We use google squad2.0 model as pretrain model [Model Link](https://worksheets.codalab.org/worksheets/0x3852e60a51d2444680606556d404c657)
+we provide script to convert tensorflow model to paddle model
+```
+cd scripts && python convert_model_params.py  --init_tf_checkpoint tf_model --fluid_params_dir paddle_model && cd ..
+```
+or user can get the pretrain model and multi-task learning trained models we provided: 
+```
+bash wget_models.sh
+```
+## 3、Train and Predict
+Preparing data, models, and task profiles for PALM
+```
+bash run_build_palm.sh
+```
+Start training: 
+```
+cd PALM
+bash run_multi_task.sh
+```
+## 4、Evaluation
+To evaluate the result, run
+```
+bash run_evaluation.sh
+```
+Note that we use the evaluation script for SQuAD 1.1 here, which is equivalent to the official one.
+## 5、Performance
+|  | dev in_domain(Macro-F1)| dev out_of_domain(Macro-F1) |
+| ------------- | ------------ | ------------ |
+| Official baseline | 77.87 | 58.67 |
+| BERT | 82.40 | 66.35 |
+| BERT + MLM | 83.19 | 67.45 |
+| BERT + MLM + ParaRank | 83.51 | 66.83 |
+BERT: reading comprehension single model.
+BERT + MLM: reading comprehension single model as main task, mask language model as auxiliary task.
+BERT + MLM + ParaRank: reading comprehension single model as main task, mask language model and paragraph classify rank as auxiliary tasks.
+BERT config: configs/reading_comprehension.yaml 
+MLM config: configs/mask_language_model.yaml
+ParaRank config: configs/answer_matching.yaml
+## Copyright and License
+Copyright 2019 Baidu.com, Inc. All Rights Reserved Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and
+limitations under the License.
--- a/PaddleNLP/Research/MRQA2019-D-NET/multi_task_learning/configs/answer_matching.yaml
+++ b/PaddleNLP/Research/MRQA2019-D-NET/multi_task_learning/configs/answer_matching.yaml
+train_file: "data/am4mrqa/train.txt"
+mix_ratio: 0.8
+batch_size: 4
+in_tokens: False
+generate_neg_sample: False
--- a/PaddleNLP/Research/MRQA2019-D-NET/multi_task_learning/configs/mask_language_model.yaml
+++ b/PaddleNLP/Research/MRQA2019-D-NET/multi_task_learning/configs/mask_language_model.yaml
+train_file: "data/mlm4mrqa"
+mix_ratio: 2.0
+batch_size: 4
+in_tokens: False
+generate_neg_sample: False
--- a/PaddleNLP/Research/MRQA2019-D-NET/multi_task_learning/configs/mtl_config.yaml
+++ b/PaddleNLP/Research/MRQA2019-D-NET/multi_task_learning/configs/mtl_config.yaml
+main_task: "reading_comprehension"
+auxiliary_task: "mask_language_model answer_matching"
+do_train: True
+do_predict: True
+checkpoint_path: "output"
+backbone_model: "bert_model"
+pretrain_model_path: "pretrain_model/squad2_model"
+pretrain_config_path: "pretrain_model/squad2_model/bert_config.json"
+vocab_path: "pretrain_model/squad2_model/vocab.txt"
+optimizer: "bert_optimizer"
+learning_rate: 3e-5
+lr_scheduler: "linear_warmup_decay"
+skip_steps: 100
+save_steps: 10000
+epoch: 2
+use_cuda: True
+warmup_proportion: 0.1
+weight_decay: 0.1
+do_lower_case: False
+max_seq_len: 512
+use_ema: True
+ema_decay: 0.9999
+random_seed: 0
+use_fp16: False
+loss_scaling: 1.0
--- a/PaddleNLP/Research/MRQA2019-D-NET/multi_task_learning/configs/reading_comprehension.yaml
+++ b/PaddleNLP/Research/MRQA2019-D-NET/multi_task_learning/configs/reading_comprehension.yaml
+train_file: "data/mrqa/mrqa-combined.train.raw.json"
+predict_file: "data/mrqa/mrqa-combined.dev.raw.json"
+sample_rate: 0.02
+mix_ratio: 1.0
+batch_size: 4
+in_tokens: false
+doc_stride: 128
+with_negative: false
+max_query_length: 64
+max_answer_length: 30
+n_best_size: 20
+null_score_diff_threshold: 0.0
+verbose: False
--- a/PaddleNLP/Research/MRQA2019-D-NET/multi_task_learning/run_build_palm.sh
+++ b/PaddleNLP/Research/MRQA2019-D-NET/multi_task_learning/run_build_palm.sh
+#!/bin/bash
+cp -r configs/* PALM/config/
+cp configs/mtl_config.yaml PALM/
+rm -rf PALM/data
+mv data PALM/
+mv squad2_model PALM/pretrain_model
+mv mrqa_multi_task_models PALM/
+cp run_multi_task.sh PALM/
--- a/PaddleNLP/Research/MRQA2019-D-NET/multi_task_learning/run_evaluation.sh
+++ b/PaddleNLP/Research/MRQA2019-D-NET/multi_task_learning/run_evaluation.sh
+#!/usr/bin/env bash
+# ==============================================================================
+# Copyright 2017 Baidu.com, Inc. All Rights Reserved
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+# path of dev data
+PATH_dev=./PALM/data/mrqa_dev
+# path of dev prediction
+BERT_MLM_PATH_prediction=./prediction_results/BERT_MLM_ema_predictions.json 
+BERT_MLM_ParaRank_PATH_prediction=./prediction_results/BERT_MLM_ParaRank_ema_predictions.json
+files=$(ls ./prediction_results/*.log 2> /dev/null | wc -l)
+if [ "$files" != "0" ];
+then
+    rm prediction_results/BERT_MLM*.log
+fi
+# evaluation BERT_MLM
+echo "evaluate BERT_MLM model........................................."
+for dataset in `ls $PATH_dev/in_domain_dev/*.raw.json`;do
+    echo $dataset >> prediction_results/BERT_MLM.log
+    python scripts/evaluate-v1.1.py $dataset $BERT_MLM_PATH_prediction >> prediction_results/BERT_MLM.log
+done
+for dataset in `ls $PATH_dev/out_of_domain_dev/*.raw.json`;do
+    echo $dataset >> prediction_results/BERT_MLM.log
+    python scripts/evaluate-v1.1.py $dataset $BERT_MLM_PATH_prediction >> prediction_results/BERT_MLM.log
+done
+python scripts/macro_avg.py prediction_results/BERT_MLM.log
+# evaluation BERT_MLM_ParaRank_PATH_prediction
+echo "evaluate BERT_MLM_ParaRank model................................"
+for dataset in `ls $PATH_dev/in_domain_dev/*.raw.json`;do
+    echo $dataset >> prediction_results/BERT_MLM_ParaRank.log
+    python scripts/evaluate-v1.1.py $dataset $BERT_MLM_ParaRank_PATH_prediction >> prediction_results/BERT_MLM_ParaRank.log
+done
+for dataset in `ls $PATH_dev/out_of_domain_dev/*.raw.json`;do
+    echo $dataset >> prediction_results/BERT_MLM_ParaRank.log
+    python scripts/evaluate-v1.1.py $dataset $BERT_MLM_ParaRank_PATH_prediction >> prediction_results/BERT_MLM_ParaRank.log
+done
+python scripts/macro_avg.py prediction_results/BERT_MLM_ParaRank.log
--- a/PaddleNLP/Research/MRQA2019-D-NET/multi_task_learning/run_multi_task.sh
+++ b/PaddleNLP/Research/MRQA2019-D-NET/multi_task_learning/run_multi_task.sh
+#!/bin/bash
+# for gpu memory optimization
+export FLAGS_sync_nccl_allreduce=0
+export FLAGS_eager_delete_tensor_gb=1
+export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
+python -u mtl_run.py
--- a/PaddleNLP/Research/MRQA2019-D-NET/multi_task_learning/scripts/combine.py
+++ b/PaddleNLP/Research/MRQA2019-D-NET/multi_task_learning/scripts/combine.py
+#!/usr/bin/env python
+# -*- coding: utf-8 -*-
+# ==============================================================================
+# Copyright 2017 Baidu.com, Inc. All Rights Reserved
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""
+This module add all train/dev data to a file named "mrqa-combined.raw.json".
+"""
+import json
+import argparse
+import glob
+# path of train/dev data
+parser = argparse.ArgumentParser()
+parser.add_argument('path', help='the path of train/dev data')
+args = parser.parse_args()
+path = args.path
+# all train/dev data files
+files = glob.glob(path + '/*.raw.json')
+print ('files:', files)
+# add all train/dev data to "datasets"
+with open(files[0]) as fin:
+    datasets = json.load(fin)
+for i in range(1, len(files)):
+    with open(files[i]) as fin:
+        dataset = json.load(fin)
+    datasets['data'].extend(dataset['data'])
+# save to "mrqa-combined.raw.json"
+with open(path + '/mrqa-combined.raw.json', 'w') as fout:
+    json.dump(datasets, fout, indent=4)
--- a/PaddleNLP/Research/MRQA2019-D-NET/multi_task_learning/scripts/combine.sh
+++ b/PaddleNLP/Research/MRQA2019-D-NET/multi_task_learning/scripts/combine.sh
+#!/usr/bin/env bash
+# ==============================================================================
+# Copyright 2017 Baidu.com, Inc. All Rights Reserved
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+# path of train and dev data
+PATH_train=train
+PATH_dev=dev
+# add all train data to a file "$PATH_train/mrqa-combined.raw.json".
+python combine.py $PATH_train
+# add all dev data to a file "$PATH_dev/mrqa-combined.raw.json".
+python combine.py $PATH_dev
\ No newline at end of file
--- a/PaddleNLP/Research/MRQA2019-D-NET/multi_task_learning/scripts/convert_model_params.py
+++ b/PaddleNLP/Research/MRQA2019-D-NET/multi_task_learning/scripts/convert_model_params.py
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Convert Google n-gram mask reading comprehension models to Fluid parameters."""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+import numpy as np
+import argparse
+import collections
+from utils.args import print_arguments
+import tensorflow as tf
+import paddle.fluid as fluid
+from tensorflow.python import pywrap_tensorflow
+def parse_args():
+    parser = argparse.ArgumentParser(__doc__)
+    parser.add_argument(
+        "--init_tf_checkpoint",
+        type=str,
+        required=True,
+        help="Initial TF checkpoint (a pre-trained BERT model).")
+    parser.add_argument(
+        "--fluid_params_dir",
+        type=str,
+        required=True,
+        help="The directory to store converted Fluid parameters.")
+    args = parser.parse_args()
+    return args
+def parse(init_checkpoint):
+    tf_fluid_param_name_map = collections.OrderedDict()
+    tf_param_name_shape_map = collections.OrderedDict()
+    init_vars = tf.train.list_variables(init_checkpoint)
+    for (var_name, var_shape) in init_vars: 
+        print("%s\t%s" % (var_name, var_shape))
+        fluid_param_name = ''
+        if var_name.startswith('bert/'):
+            key = var_name[5:]
+            if (key.startswith('embeddings/')):
+                if (key.endswith('LayerNorm/gamma')):
+                    fluid_param_name = 'pre_encoder_layer_norm_scale'
+                elif (key.endswith('LayerNorm/beta')):
+                    fluid_param_name = 'pre_encoder_layer_norm_bias'
+                elif (key.endswith('position_embeddings')):
+                    fluid_param_name = 'pos_embedding'
+                elif (key.endswith('word_embeddings')):
+                    fluid_param_name = 'word_embedding'
+                elif (key.endswith('token_type_embeddings')):
+                    fluid_param_name = 'sent_embedding'
+                else:
+                    print("ignored param: %s" % var_name)
+            elif (key.startswith('encoder/')):
+                key = key[8:]
+                layer_num = int(key[key.find('_') + 1:key.find('/')])
+                suffix = "encoder_layer_" + str(layer_num)
+                if key.endswith('attention/output/LayerNorm/beta'):
+                    fluid_param_name = suffix + '_post_att_layer_norm_bias'
+                elif key.endswith('attention/output/LayerNorm/gamma'):
+                    fluid_param_name = suffix + '_post_att_layer_norm_scale'
+                elif key.endswith('attention/output/dense/bias'):
+                    fluid_param_name = suffix + '_multi_head_att_output_fc.b_0'
+                elif key.endswith('attention/output/dense/kernel'):
+                    fluid_param_name = suffix + '_multi_head_att_output_fc.w_0'
+                elif key.endswith('attention/self/key/bias'):
+                    fluid_param_name = suffix + '_multi_head_att_key_fc.b_0'
+                elif key.endswith('attention/self/key/kernel'):
+                    fluid_param_name = suffix + '_multi_head_att_key_fc.w_0'
+                elif key.endswith('attention/self/query/bias'):
+                    fluid_param_name = suffix + '_multi_head_att_query_fc.b_0'
+                elif key.endswith('attention/self/query/kernel'):
+                    fluid_param_name = suffix + '_multi_head_att_query_fc.w_0'
+                elif key.endswith('attention/self/value/bias'):
+                    fluid_param_name = suffix + '_multi_head_att_value_fc.b_0'
+                elif key.endswith('attention/self/value/kernel'):
+                    fluid_param_name = suffix + '_multi_head_att_value_fc.w_0'
+                elif key.endswith('intermediate/dense/bias'):
+                    fluid_param_name = suffix + '_ffn_fc_0.b_0'
+                elif key.endswith('intermediate/dense/kernel'):
+                    fluid_param_name = suffix + '_ffn_fc_0.w_0'
+                elif key.endswith('output/LayerNorm/beta'):
+                    fluid_param_name = suffix + '_post_ffn_layer_norm_bias'
+                elif key.endswith('output/LayerNorm/gamma'):
+                    fluid_param_name = suffix + '_post_ffn_layer_norm_scale'
+                elif key.endswith('output/dense/bias'):
+                    fluid_param_name = suffix + '_ffn_fc_1.b_0'
+                elif key.endswith('output/dense/kernel'):
+                    fluid_param_name = suffix + '_ffn_fc_1.w_0'
+                else:
+                    print("ignored param: %s" % var_name)
+            elif (key.startswith('pooler/')):
+                if key.endswith('dense/bias'):
+                    fluid_param_name = 'pooled_fc.b_0'
+                elif key.endswith('dense/kernel'):
+                    fluid_param_name = 'pooled_fc.w_0'
+                else:
+                    print("ignored param: %s" % var_name)
+            else:
+                print("ignored param: %s" % var_name)
+        elif var_name.startswith('output/'):
+            if var_name == 'output/passage_regression/weights':
+                fluid_param_name = 'passage_regression_weights'
+            elif var_name == 'output/span/start/weights':
+                fluid_param_name = 'span_start_weights'
+            elif var_name == "output/span/end/conditional/dense/kernel": 
+                fluid_param_name = 'conditional_fc_weights'
+            elif var_name == "output/span/end/conditional/dense/bias": 
+                fluid_param_name = 'conditional_fc_bias'
+            elif var_name == "output/span/end/conditional/LayerNorm/beta": 
+                fluid_param_name = 'conditional_layernorm_beta'
+            elif var_name == "output/span/end/conditional/LayerNorm/gamma": 
+                fluid_param_name = 'conditional_layernorm_gamma'
+            elif var_name == "output/span/end/weights": 
+                fluid_param_name = 'span_end_weights'
+            else:
+                print("ignored param: %s" % var_name)
+        else:
+            print("ignored param: %s" % var_name)
+        if fluid_param_name != '':
+            tf_fluid_param_name_map[var_name] = fluid_param_name
+            tf_param_name_shape_map[var_name] = var_shape
+            fluid_param_name = ''
+    return tf_fluid_param_name_map, tf_param_name_shape_map
+def convert(args):
+    tf_fluid_param_name_map, tf_param_name_shape_map = parse(
+        args.init_tf_checkpoint)
+    program = fluid.Program()
+    global_block = program.global_block()
+    for param in tf_fluid_param_name_map:
+        global_block.create_parameter(
+            name=tf_fluid_param_name_map[param],
+            shape=tf_param_name_shape_map[param],
+            dtype='float32',
+            initializer=fluid.initializer.Constant(value=0.0))
+    place = fluid.core.CPUPlace()
+    exe = fluid.Executor(place)
+    exe.run(program)
+    print('---------------------- Converted Parameters -----------------------')
+    print('###### [TF param name] --> [Fluid param name]  [param shape] ######')
+    print('-------------------------------------------------------------------')
+    reader = pywrap_tensorflow.NewCheckpointReader(args.init_tf_checkpoint)
+    for param in tf_fluid_param_name_map:
+        value = reader.get_tensor(param)
+        fluid.global_scope().find_var(tf_fluid_param_name_map[
+            param]).get_tensor().set(value, place)
+        print(param, ' --> ', tf_fluid_param_name_map[param], '  ', value.shape)
+    fluid.io.save_params(exe, args.fluid_params_dir, main_program=program)
+if __name__ == '__main__':
+    args = parse_args()
+    print_arguments(args)
+    convert(args)
--- a/PaddleNLP/Research/MRQA2019-D-NET/multi_task_learning/scripts/convert_mrqa2squad.py
+++ b/PaddleNLP/Research/MRQA2019-D-NET/multi_task_learning/scripts/convert_mrqa2squad.py
+#!/usr/bin/env python
+# -*- coding: utf-8 -*-
+# ==============================================================================
+# Copyright 2017 Baidu.com, Inc. All Rights Reserved
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""
+This module convert MRQA official data to SQuAD format
+"""
+import json
+import argparse
+import re
+def reader(filename):
+    """
+    This function read a MRQA data file.
+    :param filename: name of a MRQA data file.
+    :return: original samples of a MRQA data file.
+    """
+    with open(filename) as fin:
+        for lidx, line in enumerate(fin):
+            if lidx == 0:
+                continue
+            sample = json.loads(line.strip())
+            yield sample
+def to_squad_para_train(sample):
+    """
+    This function convert training data from MRQA format to SQuAD format.
+    :param sample: one sample in MRQA format.
+    :return: paragraphs in SQuAD format.
+    """
+    squad_para = dict()
+    context = sample['context']
+    context = re.sub(r'\[TLE\]|\[DOC\]|\[PAR\]', '[SEP]', context)
+    # replace special tokens to [SEP] to avoid UNK in BERT
+    squad_para['context'] = context
+    qas = []
+    for qa in sample['qas']:
+        text = qa['detected_answers'][0]['text']
+        new_start = context.find(text)
+        # Try to find an exact match (without normalization) of the reference answer.
+        # Some articles like {a|an|the} my get lost in the original spans.
+        # E.g. the reference answer is "The London Eye",
+        # while the original span may only contain "London Eye" due to normalization.
+        new_end = new_start + len(text) - 1
+        org_start = qa['detected_answers'][0]['char_spans'][0][0]
+        org_end = qa['detected_answers'][0]['char_spans'][0][1]
+        if new_start == -1 or len(text) < 8:
+            # If no exact match (without normalization) can be found or reference answer is too short
+            # (e.g. only contain a character "c", which will cause problems using find),
+            # use the original span in MRQA dataset.
+            answer = {
+                'text': squad_para['context'][org_start:org_end + 1],
+                'answer_start': org_start
+            }
+            answer_start = org_start
+            answer_end = org_end
+        else:
+            answer = {
+                'text': text,
+                'answer_start': new_start
+            }
+            answer_start = new_start
+            answer_end = new_end
+        # A sanity check
+        try:
+            assert answer['text'].lower() == squad_para['context'][answer_start:answer_end + 1].lower()
+        except AssertionError:
+            print(answer['text'])
+            print(squad_para['context'][answer_start:answer_end + 1])
+            continue
+        squad_qa = {
+            'question': qa['question'],
+            'id': qa['qid'],
+            'answers': [answer]
+        }
+        qas.append(squad_qa)
+    squad_para['qas'] = qas
+    return squad_para
+def to_squad_para_dev(sample):
+    """
+    This function convert development data from MRQA format to SQuAD format.
+    :param sample: one sample in MRQA format.
+    :return: paragraphs in SQuAD format.
+    """
+    squad_para = dict()
+    context = sample['context']
+    context = re.sub(r'\[TLE\]|\[DOC\]|\[PAR\]', '[SEP]', context)
+    squad_para['context'] = context
+    qas = []
+    for qa in sample['qas']:
+        org_answers = qa['answers']
+        answers = []
+        for org_answer in org_answers:
+            answer = {
+                'text': org_answer,
+                'answer_start': -1
+            }
+            answers.append(answer)
+        squad_qa = {
+            'question': qa['question'],
+            'id': qa['qid'],
+            'answers': answers
+        }
+        qas.append(squad_qa)
+    squad_para['qas'] = qas
+    return squad_para
+def doc_wrapper(squad_para, title=""):
+    """
+    This function wrap paragraphs into a document.
+    :param squad_para: paragraphs in SQuAD format.
+    :param title: the title of paragraphs.
+    :return: wrap of title and paragraphs
+    """
+    squad_doc = {
+        'title': title,
+        'paragraphs': [squad_para]
+    }
+    return squad_doc
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument('input', help='the input file')
+    parser.add_argument('--dev', action='store_true', help='convert devset')
+    args = parser.parse_args()
+    file_prefix = args.input[0:-6]
+    squad = {
+        'data': [],
+        'version': "1.1"
+    }
+    to_squad_para = to_squad_para_dev if args.dev else to_squad_para_train
+    for org_sample in reader(args.input):
+        para = to_squad_para(org_sample)
+        doc = doc_wrapper(para)
+        squad['data'].append(doc)
+    with open('{}.raw.json'.format(file_prefix), 'w') as fout:
+        json.dump(squad, fout, indent=4)
--- a/PaddleNLP/Research/MRQA2019-D-NET/multi_task_learning/scripts/convert_mrqa2squad.sh
+++ b/PaddleNLP/Research/MRQA2019-D-NET/multi_task_learning/scripts/convert_mrqa2squad.sh
+#!/usr/bin/env bash
+# ==============================================================================
+# Copyright 2017 Baidu.com, Inc. All Rights Reserved
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+# path of train and dev data
+PATH_train=train
+PATH_dev=dev
+# Convert train data from MRQA format to SQuAD format
+NAME_LIST_train="SQuAD NewsQA TriviaQA SearchQA HotpotQA NaturalQuestions"
+for name in $NAME_LIST_train;do
+    echo "Converting training data from MRQA format to SQuAD format: ""$name"
+    python convert_mrqa2squad.py $PATH_train/$name.jsonl
+done
+# Convert dev data from MRQA format to SQuAD format
+NAME_LIST_dev="SQuAD NewsQA TriviaQA SearchQA HotpotQA NaturalQuestions BioASQ TextbookQA RelationExtraction DROP DuoRC RACE"
+for name in $NAME_LIST_dev;do
+    echo "Converting development data from MRQA format to SQuAD format: ""$name"
+    python convert_mrqa2squad.py --dev $PATH_dev/$name.jsonl
+done
--- a/PaddleNLP/Research/MRQA2019-D-NET/multi_task_learning/scripts/dev/md5sum_dev.txt
+++ b/PaddleNLP/Research/MRQA2019-D-NET/multi_task_learning/scripts/dev/md5sum_dev.txt
+05f3f16c5c31ba8e46ff5fa80647ac46  SQuAD.jsonl.gz
+5c188c92a84ddffe2ab590ac7598bde2  NewsQA.jsonl.gz
+a7a3bd90db58524f666e757db659b047  TriviaQA.jsonl.gz
+bfcb304f1b3167693b627cbf0f98bc9e  SearchQA.jsonl.gz
+675de35c3605353ec039ca4d2854072d  HotpotQA.jsonl.gz
+c0347eebbca02d10d1b07b9a64efe61d  NaturalQuestions.jsonl.gz
+6408dc4fcf258535d0ea8b125bba5fbb  BioASQ.jsonl.gz
+76ca9cc16625dd8da75758d64676e6a1  TextbookQA.jsonl.gz
+128d318ea1391bf77234d8c1b69a45df  RelationExtraction.jsonl.gz
+8b03867e4da2817ef341707040d99785  DROP.jsonl.gz
+9e66769a70fdfdec4906a4bcef5f3d71  DuoRC.jsonl.gz
+94a7ef9b9ea9402671e5b0248b6a5395  RACE.jsonl.gz
\ No newline at end of file
--- a/PaddleNLP/Research/MRQA2019-D-NET/multi_task_learning/scripts/download_data.sh
+++ b/PaddleNLP/Research/MRQA2019-D-NET/multi_task_learning/scripts/download_data.sh
+#!/usr/bin/env bash
+# ==============================================================================
+# Copyright 2017 Baidu.com, Inc. All Rights Reserved
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+# path to save data
+OUTPUT_train=train
+OUTPUT_dev=dev
+DATA_URL="https://s3.us-east-2.amazonaws.com/mrqa/release/v2"
+alias wget="wget -c --no-check-certificate"
+# download training datasets
+wget $DATA_URL/train/SQuAD.jsonl.gz -O $OUTPUT_train/SQuAD.jsonl.gz
+wget $DATA_URL/train/NewsQA.jsonl.gz -O $OUTPUT_train/NewsQA.jsonl.gz
+wget $DATA_URL/train/TriviaQA-web.jsonl.gz -O $OUTPUT_train/TriviaQA.jsonl.gz
+wget $DATA_URL/train/SearchQA.jsonl.gz -O $OUTPUT_train/SearchQA.jsonl.gz
+wget $DATA_URL/train/HotpotQA.jsonl.gz -O $OUTPUT_train/HotpotQA.jsonl.gz
+wget $DATA_URL/train/NaturalQuestionsShort.jsonl.gz -O $OUTPUT_train/NaturalQuestions.jsonl.gz
+# download the in-domain development data
+wget $DATA_URL/dev/SQuAD.jsonl.gz -O $OUTPUT_dev/SQuAD.jsonl.gz
+wget $DATA_URL/dev/NewsQA.jsonl.gz -O $OUTPUT_dev/NewsQA.jsonl.gz
+wget $DATA_URL/dev/TriviaQA-web.jsonl.gz -O $OUTPUT_dev/TriviaQA.jsonl.gz
+wget $DATA_URL/dev/SearchQA.jsonl.gz -O $OUTPUT_dev/SearchQA.jsonl.gz
+wget $DATA_URL/dev/HotpotQA.jsonl.gz -O $OUTPUT_dev/HotpotQA.jsonl.gz
+wget $DATA_URL/dev/NaturalQuestionsShort.jsonl.gz -O $OUTPUT_dev/NaturalQuestions.jsonl.gz
+# download the out-of-domain development data
+wget http://participants-area.bioasq.org/MRQA2019/ -O $OUTPUT_dev/BioASQ.jsonl.gz
+wget $DATA_URL/dev/TextbookQA.jsonl.gz -O $OUTPUT_dev/TextbookQA.jsonl.gz
+wget $DATA_URL/dev/RelationExtraction.jsonl.gz -O $OUTPUT_dev/RelationExtraction.jsonl.gz
+wget $DATA_URL/dev/DROP.jsonl.gz -O $OUTPUT_dev/DROP.jsonl.gz
+wget $DATA_URL/dev/DuoRC.ParaphraseRC.jsonl.gz -O $OUTPUT_dev/DuoRC.jsonl.gz
+wget $DATA_URL/dev/RACE.jsonl.gz -O $OUTPUT_dev/RACE.jsonl.gz
+# check md5sum for training datasets
+cd $OUTPUT_train
+if md5sum --status -c md5sum_train.txt; then
+    echo  "finish download training data"
+else
+    echo  "md5sum check failed!"
+fi
+cd ..
+# check md5sum for development data
+cd $OUTPUT_dev
+if md5sum --status -c md5sum_dev.txt; then
+    echo  "finish download development data"
+else
+    echo  "md5sum check failed!"
+fi
+cd ..
+# gzip training datasets
+echo "unzipping train data"
+NAME_LIST_train="SQuAD NewsQA TriviaQA SearchQA HotpotQA NaturalQuestions"
+for name in $NAME_LIST_train;do
+    gzip -d $OUTPUT_train/$name.jsonl.gz
+done
+# gzip development data
+echo "unzipping dev data"
+NAME_LIST_dev="SQuAD NewsQA TriviaQA SearchQA HotpotQA NaturalQuestions BioASQ TextbookQA RelationExtraction DROP DuoRC RACE"
+for name in $NAME_LIST_dev;do
+    gzip -d $OUTPUT_dev/$name.jsonl.gz
+done
--- a/PaddleNLP/Research/MRQA2019-D-NET/multi_task_learning/scripts/evaluate-v1.1.py
+++ b/PaddleNLP/Research/MRQA2019-D-NET/multi_task_learning/scripts/evaluate-v1.1.py
+""" Official evaluation script for v1.1 of the SQuAD dataset. """
+from __future__ import print_function
+from collections import Counter
+import string
+import re
+import argparse
+import json
+import sys
+def normalize_answer(s):
+    """Lower text and remove punctuation, articles and extra whitespace."""
+    def remove_articles(text):
+        return re.sub(r'\b(a|an|the)\b', ' ', text)
+    def white_space_fix(text):
+        return ' '.join(text.split())
+    def remove_punc(text):
+        exclude = set(string.punctuation)
+        return ''.join(ch for ch in text if ch not in exclude)
+    def lower(text):
+        return text.lower()
+    return white_space_fix(remove_articles(remove_punc(lower(s))))
+def f1_score(prediction, ground_truth):
+    prediction_tokens = normalize_answer(prediction).split()
+    ground_truth_tokens = normalize_answer(ground_truth).split()
+    common = Counter(prediction_tokens) & Counter(ground_truth_tokens)
+    num_same = sum(common.values())
+    if num_same == 0:
+        return 0
+    precision = 1.0 * num_same / len(prediction_tokens)
+    recall = 1.0 * num_same / len(ground_truth_tokens)
+    f1 = (2 * precision * recall) / (precision + recall)
+    return f1
+def exact_match_score(prediction, ground_truth):
+    return (normalize_answer(prediction) == normalize_answer(ground_truth))
+def metric_max_over_ground_truths(metric_fn, prediction, ground_truths):
+    scores_for_ground_truths = []
+    for ground_truth in ground_truths:
+        score = metric_fn(prediction, ground_truth)
+        scores_for_ground_truths.append(score)
+    return max(scores_for_ground_truths)
+def evaluate(dataset, predictions):
+    f1 = exact_match = total = 0
+    for article in dataset:
+        for paragraph in article['paragraphs']:
+            for qa in paragraph['qas']:
+                total += 1
+                if qa['id'] not in predictions:
+                    message = 'Unanswered question ' + qa['id'] + \
+                              ' will receive score 0.'
+                    print(message, file=sys.stderr)
+                    continue
+                ground_truths = list(map(lambda x: x['text'], qa['answers']))
+                prediction = predictions[qa['id']]
+                exact_match += metric_max_over_ground_truths(
+                    exact_match_score, prediction, ground_truths)
+                f1 += metric_max_over_ground_truths(
+                    f1_score, prediction, ground_truths)
+    exact_match = 100.0 * exact_match / total
+    f1 = 100.0 * f1 / total
+    return {'exact_match': exact_match, 'f1': f1}
+if __name__ == '__main__':
+    expected_version = '1.1'
+    parser = argparse.ArgumentParser(
+        description='Evaluation for SQuAD ' + expected_version)
+    parser.add_argument('dataset_file', help='Dataset file')
+    parser.add_argument('prediction_file', help='Prediction File')
+    args = parser.parse_args()
+    with open(args.dataset_file) as dataset_file:
+        dataset_json = json.load(dataset_file)
+        if (dataset_json['version'] != expected_version):
+            print('Evaluation expects v-' + expected_version +
+                  ', but got dataset with v-' + dataset_json['version'],
+                  file=sys.stderr)
+        dataset = dataset_json['data']
+    with open(args.prediction_file) as prediction_file:
+        predictions = json.load(prediction_file)
+    print(json.dumps(evaluate(dataset, predictions)))
--- a/PaddleNLP/Research/MRQA2019-D-NET/multi_task_learning/scripts/macro_avg.py
+++ b/PaddleNLP/Research/MRQA2019-D-NET/multi_task_learning/scripts/macro_avg.py
+import numpy as np
+import argparse
+import re
+def extract_score(line):
+    prog = re.compile(r'{"f1": (-?\d+\.?\d*e?-?\d*?), "exact_match": (-?\d+\.?\d*e?-?\d*?)}')
+    result = prog.match(line)
+    f1 = float(result.group(1))
+    em = float(result.group(2))
+    return f1, em
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser(
+    description='Calculate macro average for MRQA')
+    parser.add_argument('input_file', help='Score file')
+    args = parser.parse_args()
+    with open(args.input_file) as fin:
+        lines = map(str.strip, fin.readlines())
+    in_domain_scores = {}
+    for dataset_id in range(0, 12, 2):
+        f1, em = extract_score(lines[dataset_id+1])
+        in_domain_scores[lines[dataset_id]] = f1
+    out_of_domain_scores = {}
+    for dataset_id in range(12, 24, 2):
+        f1, em = extract_score(lines[dataset_id+1])
+        out_of_domain_scores[lines[dataset_id]] = f1
+    print('In domain avg: {}'.format(np.mean(in_domain_scores.values())))
+    print('Out of domain avg: {}'.format(np.mean(out_of_domain_scores.values())))
--- a/PaddleNLP/Research/MRQA2019-D-NET/multi_task_learning/scripts/train/md5sum_train.txt
+++ b/PaddleNLP/Research/MRQA2019-D-NET/multi_task_learning/scripts/train/md5sum_train.txt
+efd6a551d2697c20a694e933210489f8  SQuAD.jsonl.gz
+182f4e977b849cb1dbfb796030b91444  NewsQA.jsonl.gz
+e18f586152612a9358c22f5536bfd32a  TriviaQA.jsonl.gz
+612245315e6e7c4d8446e5fcc3dc1086  SearchQA.jsonl.gz
+d212c7b3fc949bd0dc47d124e8c34907  HotpotQA.jsonl.gz
+e27d27bf7c49eb5ead43cef3f41de6be  NaturalQuestions.jsonl.gz
\ No newline at end of file
--- a/PaddleNLP/Research/MRQA2019-D-NET/multi_task_learning/wget_data.sh
+++ b/PaddleNLP/Research/MRQA2019-D-NET/multi_task_learning/wget_data.sh
+# wget train data
+wget --no-check-certificate https://baidu-nlp.bj.bcebos.com/D-Net/mrqa_multi_task_dataset.tar.gz
+tar -xvf mrqa_multi_task_dataset.tar.gz
+rm mrqa_multi_task_dataset.tar.gz
+# wget predictions results
+wget --no-check-certificate https://baidu-nlp.bj.bcebos.com/D-Net/muiti_task_prediction_results.tar.gz
+tar -xvf muiti_task_prediction_results.tar.gz
+rm muiti_task_prediction_results.tar.gz
--- a/PaddleNLP/Research/MRQA2019-D-NET/multi_task_learning/wget_models.sh
+++ b/PaddleNLP/Research/MRQA2019-D-NET/multi_task_learning/wget_models.sh
+wget --no-check-certificate https://baidu-nlp.bj.bcebos.com/D-Net/squad2_model.tar.gz
+tar -xvf squad2_model.tar.gz
+rm squad2_model.tar.gz
+wget --no-check-certificate https://baidu-nlp.bj.bcebos.com/D-Net/mrqa_multi_task_models.tar.gz
+tar -xvf mrqa_multi_task_models.tar.gz
+rm mrqa_multi_task_models.tar.gz
--- a/PaddleNLP/Research/MRQA2019-D-NET/server/README.md
+++ b/PaddleNLP/Research/MRQA2019-D-NET/server/README.md
+# server
+## Introduction 
+MRQA 2019 Shared Task submission will be handled through the [Codalab](https://worksheets.codalab.org/) platform: see [these instructions](https://worksheets.codalab.org/worksheets/0x926e37ac8b4941f793bf9b9758cc01be/).
+We provided D-NET models submission environment for MRQA competition. it includes two server: bert server and xlnet server, we merged the results of two serves.
+## Inference Model Preparation 
+Download bert inference model and xlnet inferece model
+```
+bash wget_server_inference_model.sh
+```
+## Start server
+```
+bash start.sh
+```
--- a/PaddleNLP/Research/MRQA2019-D-NET/server/bert_server/bert.py
+++ b/PaddleNLP/Research/MRQA2019-D-NET/server/bert_server/bert.py
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""BERT model."""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+import six
+import json
+import numpy as np
+import paddle.fluid as fluid
+from pdnlp.module.transformer_encoder import encoder as encoder
+from pdnlp.module.transformer_encoder import pre_process_layer as pre_process_layer
+class BertModel(object):
+    def __init__(self,
+                 src_ids,
+                 position_ids,
+                 sentence_ids,
+                 input_mask,
+                 config,
+                 weight_sharing=True,
+                 use_fp16=False,
+                 model_name = ''):
+        self._emb_size = config["hidden_size"]
+        self._n_layer = config["num_hidden_layers"]
+        self._n_head = config["num_attention_heads"]
+        self._voc_size = config["vocab_size"]
+        self._max_position_seq_len = config["max_position_embeddings"]
+        self._sent_types = config["type_vocab_size"]
+        self._hidden_act = config["hidden_act"]
+        self._prepostprocess_dropout = config["hidden_dropout_prob"]
+        self._attention_dropout = config["attention_probs_dropout_prob"]
+        self._weight_sharing = weight_sharing
+        self.model_name = model_name
+        self._word_emb_name = self.model_name + "word_embedding"
+        self._pos_emb_name = self.model_name + "pos_embedding"
+        self._sent_emb_name = self.model_name + "sent_embedding"
+        self._dtype = "float16" if use_fp16 else "float32"
+        # Initialize all weigths by truncated normal initializer, and all biases 
+        # will be initialized by constant zero by default.
+        self._param_initializer = fluid.initializer.TruncatedNormal(
+            scale=config["initializer_range"])
+        self._build_model(src_ids, position_ids, sentence_ids, input_mask, config)
+    def _build_model(self, src_ids, position_ids, sentence_ids, input_mask, config):
+        # padding id in vocabulary must be set to 0
+        emb_out = fluid.layers.embedding(
+            input=src_ids,
+            size=[self._voc_size, self._emb_size],
+            dtype=self._dtype,
+            param_attr=fluid.ParamAttr(
+                name=self._word_emb_name, initializer=self._param_initializer),
+            is_sparse=False)
+        self.emb_out =emb_out
+        position_emb_out = fluid.layers.embedding(
+            input=position_ids,
+            size=[self._max_position_seq_len, self._emb_size],
+            dtype=self._dtype,
+            param_attr=fluid.ParamAttr(
+                name=self._pos_emb_name, initializer=self._param_initializer))
+        self.position_emb_out = position_emb_out
+        sent_emb_out = fluid.layers.embedding(
+            sentence_ids,
+            size=[self._sent_types, self._emb_size],
+            dtype=self._dtype,
+            param_attr=fluid.ParamAttr(
+                name=self._sent_emb_name, initializer=self._param_initializer))
+        self.sent_emb_out = sent_emb_out
+        emb_out = emb_out + position_emb_out
+        emb_out = emb_out + sent_emb_out
+        emb_out = pre_process_layer(
+            emb_out, 'nd', self._prepostprocess_dropout, name='pre_encoder')
+        if self._dtype == "float16":
+            input_mask = fluid.layers.cast(x=input_mask, dtype=self._dtype)
+        self_attn_mask = fluid.layers.matmul(
+            x = input_mask, y = input_mask, transpose_y = True)
+        self_attn_mask = fluid.layers.scale(
+            x = self_attn_mask, scale = 10000.0, bias = -1.0, bias_after_scale = False)
+        n_head_self_attn_mask = fluid.layers.stack(
+            x=[self_attn_mask] * self._n_head, axis=1)
+        n_head_self_attn_mask.stop_gradient = True
+        self._enc_out = encoder(
+            enc_input = emb_out,
+            attn_bias = n_head_self_attn_mask,
+            n_layer = self._n_layer,
+            n_head = self._n_head,
+            d_key = self._emb_size // self._n_head,
+            d_value = self._emb_size // self._n_head,
+            d_model = self._emb_size,
+            d_inner_hid = self._emb_size * 4,
+            prepostprocess_dropout = self._prepostprocess_dropout,
+            attention_dropout = self._attention_dropout,
+            relu_dropout = 0,
+            hidden_act = self._hidden_act,
+            preprocess_cmd = "",
+            postprocess_cmd = "dan",
+            param_initializer = self._param_initializer,
+            name = self.model_name + 'encoder')
+    def get_sequence_output(self):
+        return self._enc_out
+    def get_pooled_output(self):
+        """Get the first feature of each sequence for classification"""
+        next_sent_feat = fluid.layers.slice(
+            input = self._enc_out, axes = [1], starts = [0], ends = [1])
+        next_sent_feat = fluid.layers.fc(
+            input = next_sent_feat,
+            size = self._emb_size,
+            act = "tanh",
+            param_attr = fluid.ParamAttr(
+                name = self.model_name + "pooled_fc.w_0", 
+                initializer = self._param_initializer),
+            bias_attr = "pooled_fc.b_0")
+        return next_sent_feat
+    def get_pretraining_output(self, mask_label, mask_pos, labels):
+        """Get the loss & accuracy for pretraining"""
+        mask_pos = fluid.layers.cast(x=mask_pos, dtype='int32')
+        # extract the first token feature in each sentence
+        next_sent_feat = self.get_pooled_output()
+        reshaped_emb_out = fluid.layers.reshape(
+            x=self._enc_out, shape = [-1, self._emb_size])
+        # extract masked tokens' feature
+        mask_feat = fluid.layers.gather(input = reshaped_emb_out, index = mask_pos)
+        # transform: fc
+        mask_trans_feat = fluid.layers.fc(
+            input = mask_feat,
+            size = self._emb_size,
+            act = self._hidden_act,
+            param_attr = fluid.ParamAttr(
+                name = self.model_name + 'mask_lm_trans_fc.w_0',
+                initializer = self._param_initializer),
+            bias_attr = fluid.ParamAttr(name = self.model_name + 'mask_lm_trans_fc.b_0'))
+        # transform: layer norm 
+        mask_trans_feat = pre_process_layer(
+            mask_trans_feat, 'n', name = self.model_name + 'mask_lm_trans')
+        mask_lm_out_bias_attr = fluid.ParamAttr(
+            name = self.model_name + "mask_lm_out_fc.b_0",
+            initializer = fluid.initializer.Constant(value = 0.0))
+        if self._weight_sharing:
+            fc_out = fluid.layers.matmul(
+                x = mask_trans_feat,
+                y = fluid.default_main_program().global_block().var(
+                    self._word_emb_name),
+                transpose_y = True)
+            fc_out += fluid.layers.create_parameter(
+                shape = [self._voc_size],
+                dtype = self._dtype,
+                attr = mask_lm_out_bias_attr,
+                is_bias = True)
+        else:
+            fc_out = fluid.layers.fc(input = mask_trans_feat,
+                                     size = self._voc_size,
+                                     param_attr = fluid.ParamAttr(
+                                         name = self.model_name + "mask_lm_out_fc.w_0",
+                                         initializer = self._param_initializer),
+                                     bias_attr = mask_lm_out_bias_attr)
+        mask_lm_loss = fluid.layers.softmax_with_cross_entropy(
+            logits = fc_out, label = mask_label)
+        mean_mask_lm_loss = fluid.layers.mean(mask_lm_loss)
+        next_sent_fc_out = fluid.layers.fc(
+            input = next_sent_feat,
+            size = 2,
+            param_attr = fluid.ParamAttr(
+                name = self.model_name + "next_sent_fc.w_0", 
+                initializer = self._param_initializer),
+            bias_attr = self.model_name + "next_sent_fc.b_0")
+        next_sent_loss, next_sent_softmax = fluid.layers.softmax_with_cross_entropy(
+            logits = next_sent_fc_out, label = labels, return_softmax = True)
+        next_sent_acc = fluid.layers.accuracy(
+            input = next_sent_softmax, label = labels)
+        mean_next_sent_loss = fluid.layers.mean(next_sent_loss)
+        loss = mean_next_sent_loss + mean_mask_lm_loss
+        return next_sent_acc, mean_mask_lm_loss, loss
+if __name__ == "__main__":
+    print("hello wolrd!")
--- a/PaddleNLP/Research/MRQA2019-D-NET/server/bert_server/bert_model.py
+++ b/PaddleNLP/Research/MRQA2019-D-NET/server/bert_server/bert_model.py
+#encoding=utf8
+import os
+import sys
+import argparse
+from copy import deepcopy as copy
+import numpy as np
+import paddle
+import paddle.fluid as fluid
+import collections
+import multiprocessing
+from pdnlp.nets.bert import BertModel
+from pdnlp.toolkit.configure import JsonConfig
+class ModelBERT(object):
+    def __init__(
+        self, 
+        conf, 
+        name = "", 
+        is_training = False,
+        base_model = None):
+        # the name of this task
+        # name is used for identifying parameters
+        self.name = name
+        # deep copy the configure of model
+        self.conf = copy(conf)
+        self.is_training = is_training
+        ## the overall loss of this task
+        self.loss = None
+        ## outputs may be useful for the other models
+        self.outputs = {}
+        ## the prediction of this task
+        self.predict = []
+    def create_model(self, 
+                      args,
+                      reader_input,
+                      base_model = None):
+        """
+            given the base model, reader_input
+            return the create fn for create this model
+        """
+        def _create_model():
+            src_ids, pos_ids, sent_ids, input_mask = reader_input
+            bert_conf = JsonConfig(self.conf["bert_conf_file"])
+            self.bert = BertModel(
+                src_ids = src_ids,
+                position_ids = pos_ids,
+                sentence_ids = sent_ids,
+                input_mask = input_mask,
+                config = bert_conf,
+                use_fp16 = args.use_fp16,
+                model_name = self.name)
+            self.loss = None
+            self.outputs = {
+                "sequence_output": self.bert.get_sequence_output(),
+                # "pooled_output": self.bert.get_pooled_output()
+            }
+        return _create_model
+    def get_output(self, name):
+        return self.outputs[name]
+    def get_outputs(self):
+        return self.outputs
+    def get_predict(self):
+        return self.predict
+if __name__ == "__main__":
+    bert_model = ModelBERT(conf = {"json_conf_path" : "./data/pretrained_models/squad2_model/bert_config.json"})
--- a/PaddleNLP/Research/MRQA2019-D-NET/server/bert_server/model_wrapper.py
+++ b/PaddleNLP/Research/MRQA2019-D-NET/server/bert_server/model_wrapper.py
+#!/usr/bin/env python
+# -*- coding: utf-8 -*-
+"""BERT (PaddlePaddle) model wrapper"""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+import os
+import json
+import collections
+import multiprocessing
+import argparse
+import numpy as np
+import paddle.fluid as fluid
+from task_reader.mrqa import DataProcessor, get_answers
+from bert_model import ModelBERT
+import mrc_model
+ema_decay = 0.9999
+verbose = False
+max_seq_len = 512
+max_query_length = 64
+max_answer_length = 30
+in_tokens = False
+do_lower_case = False
+doc_stride = 128
+n_best_size = 20
+use_cuda = True
+class BertModelWrapper():
+    """
+    Wrap a tnet model
+     the basic processes include input checking, preprocessing, calling tf-serving
+     and postprocessing
+    """
+    def __init__(self, model_dir):
+        """ """
+        if use_cuda:
+            place = fluid.CUDAPlace(0)
+            dev_count = fluid.core.get_cuda_device_count()
+        else:
+            place = fluid.CPUPlace()
+            dev_count = int(os.environ.get('CPU_NUM', multiprocessing.cpu_count()))
+        self.exe = fluid.Executor(place)
+        self.bert_preprocessor = DataProcessor(
+            vocab_path=os.path.join(model_dir, 'vocab.txt'),
+            do_lower_case=do_lower_case,
+            max_seq_length=max_seq_len,
+            in_tokens=in_tokens,
+            doc_stride=doc_stride,
+            max_query_length=max_query_length)
+        self.inference_program, self.feed_target_names, self.fetch_targets = \
+            fluid.io.load_inference_model(dirname=model_dir, executor=self.exe)
+    def preprocessor(self, samples, batch_size, examples_start_id, features_start_id):
+        """Preprocess the input samples, including word seg, padding, token to ids"""
+        # Tokenization and paragraph padding
+        examples, features, batch = self.bert_preprocessor.data_generator(
+            samples, batch_size, max_len=max_seq_len, examples_start_id=examples_start_id, features_start_id=features_start_id)
+        self.samples = samples
+        return examples, features, batch
+    def call_mrc(self, batch, squeeze_dim0=False, return_list=False):
+        """MRC"""
+        if squeeze_dim0 and return_list:
+            raise ValueError("squeeze_dim0 only work for dict-type return value.")
+        src_ids = batch[0]
+        pos_ids = batch[1]
+        sent_ids = batch[2]
+        input_mask = batch[3]
+        unique_id = batch[4]
+        feed_dict = {
+            self.feed_target_names[0]: src_ids,
+            self.feed_target_names[1]: pos_ids,
+            self.feed_target_names[2]: sent_ids,
+            self.feed_target_names[3]: input_mask,
+            self.feed_target_names[4]: unique_id
+        }
+        np_unique_ids, np_start_logits, np_end_logits, np_num_seqs = \
+            self.exe.run(self.inference_program, feed=feed_dict, fetch_list=self.fetch_targets)
+        if len(np_unique_ids) == 1 and squeeze_dim0:
+            np_unique_ids = np_unique_ids[0]
+            np_start_logits = np_start_logits[0]
+            np_end_logits = np_end_logits[0]
+        if return_list:
+            mrc_results = [{'unique_ids': id, 'start_logits': st, 'end_logits': end} 
+                            for id, st, end in zip(np_unique_ids, np_start_logits, np_end_logits)]
+        else:
+            mrc_results = {
+                'unique_ids': np_unique_ids,
+                'start_logits': np_start_logits,
+                'end_logits': np_end_logits,
+            }
+        return mrc_results
+    def postprocessor(self, examples, features, mrc_results):
+        """Extract answer
+         batch: [examples, features] from preprocessor
+         mrc_results: model results from call_mrc. if mrc_results is list, each element of which is a size=1 batch.
+        """
+        RawResult = collections.namedtuple("RawResult",
+                                           ["unique_id", "start_logits", "end_logits"])
+        results = []
+        if isinstance(mrc_results, list):
+            for res in mrc_results:
+                unique_id = res['unique_ids'][0]
+                start_logits = [float(x) for x in res['start_logits'].flat]
+                end_logits = [float(x) for x in res['end_logits'].flat]
+                results.append(
+                    RawResult(
+                        unique_id=unique_id,
+                        start_logits=start_logits,
+                        end_logits=end_logits))
+        else:
+            assert isinstance(mrc_results, dict)
+            for idx in range(mrc_results['unique_ids'].shape[0]):
+                unique_id = int(mrc_results['unique_ids'][idx])
+                start_logits = [float(x) for x in mrc_results['start_logits'][idx].flat]
+                end_logits = [float(x) for x in mrc_results['end_logits'][idx].flat]
+                results.append(
+                    RawResult(
+                        unique_id=unique_id,
+                        start_logits=start_logits,
+                        end_logits=end_logits))
+        answers = get_answers(
+            examples, features, results, n_best_size,
+            max_answer_length, do_lower_case, verbose)
+        return answers
--- a/PaddleNLP/Research/MRQA2019-D-NET/server/bert_server/mrc_model.py
+++ b/PaddleNLP/Research/MRQA2019-D-NET/server/bert_server/mrc_model.py
+# encoding=utf8
+import paddle.fluid as fluid
+def compute_loss(output_tensors, args=None):
+    """Compute loss for mrc model"""
+    def _compute_single_loss(logits, positions):
+        """Compute start/end loss for mrc model"""
+        loss = fluid.layers.softmax_with_cross_entropy(
+            logits=logits, label=positions)
+        loss = fluid.layers.mean(x=loss)
+        return loss
+    start_logits = output_tensors['start_logits']
+    end_logits = output_tensors['end_logits']
+    start_positions = output_tensors['start_positions']
+    end_positions = output_tensors['end_positions']
+    start_loss = _compute_single_loss(start_logits, start_positions)
+    end_loss = _compute_single_loss(end_logits, end_positions)
+    total_loss = (start_loss + end_loss) / 2.0
+    if args.use_fp16 and args.loss_scaling > 1.0:
+        total_loss = total_loss * args.loss_scaling
+    return total_loss
+def create_model(reader_input, base_model=None, is_training=True, args=None):
+    """
+        given the base model, reader_input
+        return the output tensors
+    """
+    if is_training:
+        src_ids, pos_ids, sent_ids, input_mask, \
+        start_positions, end_positions = reader_input
+    else:
+        src_ids, pos_ids, sent_ids, input_mask, unique_id = reader_input
+    enc_out = base_model.get_output("sequence_output")
+    logits = fluid.layers.fc(
+        input=enc_out,
+        size=2,
+        num_flatten_dims=2,
+        param_attr=fluid.ParamAttr(
+            name="cls_squad_out_w",
+            initializer=fluid.initializer.TruncatedNormal(scale=0.02)),
+        bias_attr=fluid.ParamAttr(
+            name="cls_squad_out_b", initializer=fluid.initializer.Constant(0.)))
+    logits = fluid.layers.transpose(x=logits, perm=[2, 0, 1])
+    start_logits, end_logits = fluid.layers.unstack(x=logits, axis=0)
+    batch_ones = fluid.layers.fill_constant_batch_size_like(
+        input=start_logits, dtype='int64', shape=[1], value=1)
+    num_seqs = fluid.layers.reduce_sum(input=batch_ones)
+    output_tensors = {}
+    output_tensors['start_logits'] = start_logits
+    output_tensors['end_logits'] = end_logits
+    output_tensors['num_seqs'] = num_seqs
+    if is_training:
+        output_tensors['start_positions'] = start_positions
+        output_tensors['end_positions'] = end_positions
+    else:
+        output_tensors['unique_id'] = unique_id
+        output_tensors['start_logits'] = start_logits
+        output_tensors['end_logits'] = end_logits
+    return output_tensors
--- a/PaddleNLP/Research/MRQA2019-D-NET/server/bert_server/mrc_service.py
+++ b/PaddleNLP/Research/MRQA2019-D-NET/server/bert_server/mrc_service.py
+#!/usr/bin/env python
+# -*- coding: utf-8 -*-
+"""Some utilities for MRC online service"""
+import json
+import sys
+import logging
+import time
+import numpy as np
+from flask import Response
+from flask import request
+from copy import deepcopy
+verbose = False
+def _request_check(input_json):
+    """Check if the request json is valid"""
+    if input_json is None or not isinstance(input_json, dict):
+        return 'Can not parse the input json data - {}'.format(input_json)
+    try:
+        c = input_json['context']
+        qa = input_json['qas'][0]
+        qid = qa['qid']
+        q = qa['question']
+    except KeyError as e:
+        return 'Invalid request, key "{}" not found'.format(e)
+    return 'OK'
+def _abort(status_code, message):
+    """Create custom error message and status code"""
+    return Response(json.dumps(message), status=status_code, mimetype='application/json')
+def _timmer(init_start, start, current, process_name):
+    cumulated_elapsed_time = (current - init_start) * 1000
+    current_elapsed_time = (current - start) * 1000
+    print('{}\t-\t{:.2f}\t{:.2f}'.format(process_name, cumulated_elapsed_time,
+                                         current_elapsed_time))
+def _split_input_json(input_json):
+    if len(input_json['context_tokens']) > 810:
+        input_json['context'] = input_json['context'][:5000]
+    if len(input_json['qas']) == 1:
+        return [input_json]
+    else:
+        rets = []
+        for i in range(len(input_json['qas'])):
+            temp = deepcopy(input_json)
+            temp['qas'] = [input_json['qas'][i]]
+            rets.append(temp)
+        return rets
+class MRQAService(object):
+    """Provide basic MRC service for flask"""
+    def __init__(self, name, logger=None, log_data=False):
+        """ """
+        self.name = name
+        if logger is None:
+            self.logger = logging.getLogger('flask')
+        else:
+            self.logger = logger
+        self.log_data = log_data
+    def __call__(self, model, process_mode='serial', max_batch_size=5, timmer=False):
+        """
+        Args:
+            mode: serial, parallel
+        """
+        if timmer:
+            start = time.time()
+        """Call mrc model wrapper and handle expectations"""
+        self.input_json = request.get_json(silent=True)
+        try:
+            if timmer:
+                start_request_check = time.time()
+            request_status = _request_check(self.input_json)
+            if timmer:
+                current_time = time.time()
+                _timmer(start, start_request_check, current_time, 'request check')
+            if self.log_data:
+                if self.logger is None:
+                    logging.info(
+                        'Client input - {}'.format(json.dumps(self.input_json, ensure_ascii=False))
+                    )
+                else:
+                    self.logger.info(
+                        'Client input - {}'.format(json.dumps(self.input_json, ensure_ascii=False))
+                    )
+        except Exception as e:
+            self.logger.error('server request checker error')
+            self.logger.exception(e)
+            return _abort(500, 'server request checker error - {}'.format(e))
+        if request_status != 'OK':
+            return _abort(400, request_status)
+        # call preprocessor
+        try:
+            if timmer:
+                start_preprocess = time.time()
+            jsons = _split_input_json(self.input_json)
+            processed = []
+            ex_start_idx = 0
+            feat_start_idx = 1000000000
+            for i in jsons:
+                e,f,b = model.preprocessor(i, batch_size=max_batch_size if process_mode == 'parallel' else 1, examples_start_id=ex_start_idx, features_start_id=feat_start_idx)
+                ex_start_idx += len(e)
+                feat_start_idx += len(f)
+                processed.append([e,f,b])
+            if timmer:
+                current_time = time.time()
+                _timmer(start, start_preprocess, current_time, 'preprocess')
+        except Exception as e:
+            self.logger.error('preprocessor error')
+            self.logger.exception(e)
+            return _abort(500, 'preprocessor error - {}'.format(e))
+        def transpose(mat):
+            return zip(*mat)
+        # call mrc
+        try:
+            if timmer:
+                start_call_mrc = time.time()
+            self.mrc_results = []
+            self.examples = []
+            self.features = []
+            for e, f, batches in processed:
+                if verbose:
+                    if len(f) > max_batch_size:
+                        print("get a too long example....")
+                if process_mode == 'serial':
+                    self.mrc_results.extend([model.call_mrc(b, squeeze_dim0=True) for b in batches[:max_batch_size]])
+                elif process_mode == 'parallel':
+                    # only keep first max_batch_size features
+                    # batches = batches[0]
+                    for b in batches:
+                        self.mrc_results.extend(model.call_mrc(b, return_list=True))
+                else:
+                    raise NotImplementedError()
+                self.examples.extend(e)
+                # self.features.extend(f[:max_batch_size])
+                self.features.extend(f)
+            if timmer:
+                current_time = time.time()
+                _timmer(start, start_call_mrc, current_time, 'call mrc')
+        except Exception as e:
+            self.logger.error('call_mrc error')
+            self.logger.exception(e)
+            return _abort(500, 'call_mrc error - {}'.format(e))
+        # call post processor
+        try:
+            if timmer:
+                start_post_precess = time.time()
+            self.results = model.postprocessor(self.examples, self.features, self.mrc_results)
+            # only nbest results is POSTed back
+            self.results = self.results[1]
+            # self.results = self.results[0]
+            if timmer:
+                current_time = time.time()
+                _timmer(start, start_post_precess, current_time, 'post process')
+        except Exception as e:
+            self.logger.error('postprocessor error')
+            self.logger.exception(e)
+            return _abort(500, 'postprocessor error - {}'.format(e))
+        return self._response_constructor()
+    def _response_constructor(self):
+        """construct http response object"""
+        try:
+            response = {
+                # 'requestID': self.input_json['requestID'],
+                'results': self.results
+            }
+            if self.log_data:
+                self.logger.info(
+                    'Response - {}'.format(json.dumps(response, ensure_ascii=False))
+                )
+            return Response(json.dumps(response), mimetype='application/json')
+        except Exception as e:
+            self.logger.error('response constructor error')
+            self.logger.exception(e)
+            return _abort(500, 'response constructor error - {}'.format(e))
--- a/PaddleNLP/Research/MRQA2019-D-NET/server/bert_server/pdnlp/__init__.py
+++ b/PaddleNLP/Research/MRQA2019-D-NET/server/bert_server/pdnlp/__init__.py
--- a/PaddleNLP/Research/MRQA2019-D-NET/server/bert_server/pdnlp/__main__.py
+++ b/PaddleNLP/Research/MRQA2019-D-NET/server/bert_server/pdnlp/__main__.py
+from algorithm import optimization
+from algorithm import multitask
+from extension import fp16
+from module import transformer_encoder
+from toolkit import configure
+from toolkit import init
+from toolkit import placeholder
+from nets import bert
--- a/PaddleNLP/Research/MRQA2019-D-NET/server/bert_server/pdnlp/algorithm/__init__.py
+++ b/PaddleNLP/Research/MRQA2019-D-NET/server/bert_server/pdnlp/algorithm/__init__.py
--- a/PaddleNLP/Research/MRQA2019-D-NET/server/bert_server/pdnlp/algorithm/multitask.py
+++ b/PaddleNLP/Research/MRQA2019-D-NET/server/bert_server/pdnlp/algorithm/multitask.py
+#encoding=utf8
+import os
+import sys
+import random
+from copy import deepcopy as copy
+import numpy as np
+import paddle
+import paddle.fluid as fluid
+import multiprocessing
+class Task:
+    def __init__(
+        self, 
+        conf,
+        name = "",
+        is_training = False,
+        _DataProcesser = None,
+        shared_name = ""):
+        self.conf = copy(conf)
+        self.name = name
+        self.shared_name = shared_name
+        self.is_training = is_training
+        self.DataProcesser = _DataProcesser
+    def _create_reader(self):
+        raise NotImplementedError("Task:_create_reader not implemented")
+    def _create_model(self):
+        raise NotImplementedError("Task:_create_model not implemented")
+    def prepare(self, args):
+        raise NotImplementedError("Task:prepare not implemented")
+    def train_step(self, args):
+        raise NotImplementedError("Task:train_step not implemented")
+    def predict(self, args):
+        raise NotImplementedError("Task:_predict not implemented")
+class JointTask:
+    def __init__(self):
+        self.tasks = []
+        #self.startup_exe = None
+        #self.train_exe = None
+        self.exe = None
+        self.share_vars_from = None
+        self.startup_prog = fluid.Program()
+    def __add__(self, task):
+        assert isinstance(task, Task)
+        self.tasks.append(task)
+        return self
+    def prepare(self, args):
+        if args.use_cuda:
+            place = fluid.CUDAPlace(0)
+            dev_count = fluid.core.get_cuda_device_count()
+        else:
+            place = fluid.CPUPlace()
+            dev_count = int(os.environ.get('CPU_NUM', multiprocessing.cpu_count()))
+        #self.startup_exe = fluid.Executor(place)
+        self.exe = fluid.Executor(place)
+        for idx, task in enumerate(self.tasks):
+            if idx == 0:
+                print("for idx : %d" % idx)
+                task.prepare(args, exe = self.exe)
+                self.share_vars_from = task.compiled_train_prog
+            else:
+                print("for idx : %d" % idx)
+                task.prepare(args, exe = self.exe, share_vars_from = self.share_vars_from)
+    def train(self, args):
+        joint_steps = []
+        for i in xrange(0, len(self.tasks)):
+            for _ in xrange(0, self.tasks[i].max_train_steps):
+                joint_steps.append(i)
+        self.tasks[0].train_step(args, exe = self.exe)
+        random.shuffle(joint_steps)
+        for next_task_id in joint_steps:
+            self.tasks[next_task_id].train_step(args, exe = self.exe)
+if __name__ == "__main__":
+    basetask_a = Task(None)
+    basetask_b = Task(None)
+    joint_tasks = JointTask()
+    joint_tasks += basetask_a
+    print(joint_tasks.tasks)
+    joint_tasks += basetask_b
+    print(joint_tasks.tasks)
--- a/PaddleNLP/Research/MRQA2019-D-NET/server/bert_server/pdnlp/algorithm/optimization.py
+++ b/PaddleNLP/Research/MRQA2019-D-NET/server/bert_server/pdnlp/algorithm/optimization.py
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Optimization and learning rate scheduling."""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+import numpy as np
+import paddle.fluid as fluid
+from pdnlp.extension.fp16 import create_master_params_grads, master_param_to_train_param
+def linear_warmup_decay(learning_rate, warmup_steps, num_train_steps):
+    """ Applies linear warmup of learning rate from 0 and decay to 0."""
+    with fluid.default_main_program()._lr_schedule_guard():
+        lr = fluid.layers.tensor.create_global_var(
+            shape=[1],
+            value=0.0,
+            dtype='float32',
+            persistable=True,
+            name="scheduled_learning_rate")
+        global_step = fluid.layers.learning_rate_scheduler._decay_step_counter()
+        with fluid.layers.control_flow.Switch() as switch:
+            with switch.case(global_step < warmup_steps):
+                warmup_lr = learning_rate * (global_step / warmup_steps)
+                fluid.layers.tensor.assign(warmup_lr, lr)
+            with switch.default():
+                decayed_lr = fluid.layers.learning_rate_scheduler.polynomial_decay(
+                    learning_rate=learning_rate,
+                    decay_steps=num_train_steps,
+                    end_learning_rate=0.0,
+                    power=1.0,
+                    cycle=False)
+                fluid.layers.tensor.assign(decayed_lr, lr)
+        return lr
+def optimization(loss,
+                 warmup_steps,
+                 num_train_steps,
+                 learning_rate,
+                 train_program,
+                 startup_prog,
+                 weight_decay,
+                 scheduler='linear_warmup_decay',
+                 use_fp16=False,
+                 loss_scaling=1.0):
+    if warmup_steps > 0:
+        if scheduler == 'noam_decay':
+            scheduled_lr = fluid.layers.learning_rate_scheduler\
+             .noam_decay(1/(warmup_steps *(learning_rate ** 2)),
+                         warmup_steps)
+        elif scheduler == 'linear_warmup_decay':
+            scheduled_lr = linear_warmup_decay(learning_rate, warmup_steps,
+                                               num_train_steps)
+        else:
+            raise ValueError("Unkown learning rate scheduler, should be "
+                             "'noam_decay' or 'linear_warmup_decay'")
+        optimizer = fluid.optimizer.Adam(learning_rate=scheduled_lr)
+    else:
+        optimizer = fluid.optimizer.Adam(learning_rate=learning_rate)
+        scheduled_lr = learning_rate
+    clip_norm_thres = 1.0
+    # When using mixed precision training, scale the gradient clip threshold
+    # by loss_scaling
+    if use_fp16 and loss_scaling > 1.0:
+        clip_norm_thres *= loss_scaling
+    fluid.clip.set_gradient_clip(
+        clip=fluid.clip.GradientClipByGlobalNorm(clip_norm=clip_norm_thres))
+    def exclude_from_weight_decay(name):
+        if name.find("layer_norm") > -1:
+            return True
+        bias_suffix = ["_bias", "_b", ".b_0"]
+        for suffix in bias_suffix:
+            if name.endswith(suffix):
+                return True
+        return False
+    param_list = dict()
+    if use_fp16:
+        param_grads = optimizer.backward(loss)
+        master_param_grads = create_master_params_grads(
+            param_grads, train_program, startup_prog, loss_scaling)
+        for param, _ in master_param_grads:
+            param_list[param.name] = param * 1.0
+            param_list[param.name].stop_gradient = True
+        optimizer.apply_gradients(master_param_grads)
+        if weight_decay > 0:
+            for param, grad in master_param_grads:
+                if exclude_from_weight_decay(param.name.rstrip(".master")):
+                    continue
+                with param.block.program._optimized_guard(
+                    [param, grad]), fluid.framework.name_scope("weight_decay"):
+                    updated_param = param - param_list[
+                        param.name] * weight_decay * scheduled_lr
+                    fluid.layers.assign(output=param, input=updated_param)
+        master_param_to_train_param(master_param_grads, param_grads,
+                                    train_program)
+    else:
+        for param in train_program.global_block().all_parameters():
+            param_list[param.name] = param * 1.0
+            param_list[param.name].stop_gradient = True
+        _, param_grads = optimizer.minimize(loss)
+        if weight_decay > 0:
+            for param, grad in param_grads:
+                if exclude_from_weight_decay(param.name):
+                    continue
+                with param.block.program._optimized_guard(
+                    [param, grad]), fluid.framework.name_scope("weight_decay"):
+                    updated_param = param - param_list[
+                        param.name] * weight_decay * scheduled_lr
+                    fluid.layers.assign(output=param, input=updated_param)
+    return scheduled_lr
--- a/PaddleNLP/Research/MRQA2019-D-NET/server/bert_server/pdnlp/extension/__init__.py
+++ b/PaddleNLP/Research/MRQA2019-D-NET/server/bert_server/pdnlp/extension/__init__.py
--- a/PaddleNLP/Research/MRQA2019-D-NET/server/bert_server/pdnlp/extension/fp16.py
+++ b/PaddleNLP/Research/MRQA2019-D-NET/server/bert_server/pdnlp/extension/fp16.py
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from __future__ import print_function
+import paddle
+import paddle.fluid as fluid
+def cast_fp16_to_fp32(i, o, prog):
+    prog.global_block().append_op(
+        type="cast",
+        inputs={"X": i},
+        outputs={"Out": o},
+        attrs={
+            "in_dtype": fluid.core.VarDesc.VarType.FP16,
+            "out_dtype": fluid.core.VarDesc.VarType.FP32
+        })
+def cast_fp32_to_fp16(i, o, prog):
+    prog.global_block().append_op(
+        type="cast",
+        inputs={"X": i},
+        outputs={"Out": o},
+        attrs={
+            "in_dtype": fluid.core.VarDesc.VarType.FP32,
+            "out_dtype": fluid.core.VarDesc.VarType.FP16
+        })
+def copy_to_master_param(p, block):
+    v = block.vars.get(p.name, None)
+    if v is None:
+        raise ValueError("no param name %s found!" % p.name)
+    new_p = fluid.framework.Parameter(
+        block=block,
+        shape=v.shape,
+        dtype=fluid.core.VarDesc.VarType.FP32,
+        type=v.type,
+        lod_level=v.lod_level,
+        stop_gradient=p.stop_gradient,
+        trainable=p.trainable,
+        optimize_attr=p.optimize_attr,
+        regularizer=p.regularizer,
+        gradient_clip_attr=p.gradient_clip_attr,
+        error_clip=p.error_clip,
+        name=v.name + ".master")
+    return new_p
+def create_master_params_grads(params_grads, main_prog, startup_prog,
+                               loss_scaling):
+    master_params_grads = []
+    tmp_role = main_prog._current_role
+    OpRole = fluid.core.op_proto_and_checker_maker.OpRole
+    main_prog._current_role = OpRole.Backward
+    for p, g in params_grads:
+        # create master parameters
+        master_param = copy_to_master_param(p, main_prog.global_block())
+        startup_master_param = startup_prog.global_block()._clone_variable(
+            master_param)
+        startup_p = startup_prog.global_block().var(p.name)
+        cast_fp16_to_fp32(startup_p, startup_master_param, startup_prog)
+        # cast fp16 gradients to fp32 before apply gradients
+        if g.name.find("layer_norm") > -1:
+            if loss_scaling > 1:
+                scaled_g = g / float(loss_scaling)
+            else:
+                scaled_g = g
+            master_params_grads.append([p, scaled_g])
+            continue
+        master_grad = fluid.layers.cast(g, "float32")
+        if loss_scaling > 1:
+            master_grad = master_grad / float(loss_scaling)
+        master_params_grads.append([master_param, master_grad])
+    main_prog._current_role = tmp_role
+    return master_params_grads
+def master_param_to_train_param(master_params_grads, params_grads, main_prog):
+    for idx, m_p_g in enumerate(master_params_grads):
+        train_p, _ = params_grads[idx]
+        if train_p.name.find("layer_norm") > -1:
+            continue
+        with main_prog._optimized_guard([m_p_g[0], m_p_g[1]]):
+            cast_fp32_to_fp16(m_p_g[0], train_p, main_prog)
--- a/PaddleNLP/Research/MRQA2019-D-NET/server/bert_server/pdnlp/module/__init__.py
+++ b/PaddleNLP/Research/MRQA2019-D-NET/server/bert_server/pdnlp/module/__init__.py
--- a/PaddleNLP/Research/MRQA2019-D-NET/server/bert_server/pdnlp/module/transformer_encoder.py
+++ b/PaddleNLP/Research/MRQA2019-D-NET/server/bert_server/pdnlp/module/transformer_encoder.py
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Transformer encoder."""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+from functools import partial
+import numpy as np
+import paddle.fluid as fluid
+import paddle.fluid.layers as layers
+from paddle.fluid.layer_helper import LayerHelper
+def layer_norm(x, begin_norm_axis=1, epsilon=1e-6, param_attr=None, bias_attr=None):
+    helper = LayerHelper('layer_norm', **locals())
+    mean = layers.reduce_mean(x, dim=begin_norm_axis, keep_dim=True)
+    shift_x = layers.elementwise_sub(x=x, y=mean, axis=0)
+    variance = layers.reduce_mean(layers.square(shift_x), dim=begin_norm_axis, keep_dim=True)
+    r_stdev = layers.rsqrt(variance + epsilon)
+    norm_x = layers.elementwise_mul(x=shift_x, y=r_stdev, axis=0)
+    param_shape = [reduce(lambda x, y: x * y, norm_x.shape[begin_norm_axis:])]
+    param_dtype = norm_x.dtype
+    scale = helper.create_parameter(
+        attr=param_attr,
+        shape=param_shape,
+        dtype=param_dtype,
+        default_initializer=fluid.initializer.Constant(1.))
+    bias = helper.create_parameter(
+        attr=bias_attr,
+        shape=param_shape,
+        dtype=param_dtype,
+        is_bias=True,
+        default_initializer=fluid.initializer.Constant(0.))
+    out = layers.elementwise_mul(x=norm_x, y=scale, axis=-1)
+    out = layers.elementwise_add(x=out, y=bias, axis=-1)
+    return out
+def multi_head_attention(queries,
+                         keys,
+                         values,
+                         attn_bias,
+                         d_key,
+                         d_value,
+                         d_model,
+                         n_head=1,
+                         dropout_rate=0.,
+                         cache=None,
+                         param_initializer=None,
+                         name='multi_head_att'):
+    """
+    Multi-Head Attention. Note that attn_bias is added to the logit before
+    computing softmax activiation to mask certain selected positions so that
+    they will not considered in attention weights.
+    """
+    keys = queries if keys is None else keys
+    values = keys if values is None else values
+    if not (len(queries.shape) == len(keys.shape) == len(values.shape) == 3):
+        raise ValueError(
+            "Inputs: quries, keys and values should all be 3-D tensors.")
+    def __compute_qkv(queries, keys, values, n_head, d_key, d_value):
+        """
+        Add linear projection to queries, keys, and values.
+        """
+        q = layers.fc(input = queries,
+                      size = d_key * n_head,
+                      num_flatten_dims = 2,
+                      param_attr = fluid.ParamAttr(
+                          name = name + '_query_fc.w_0',
+                          initializer = param_initializer),
+                      bias_attr = name + '_query_fc.b_0')
+        k = layers.fc(input = keys,
+                      size = d_key * n_head,
+                      num_flatten_dims = 2,
+                      param_attr = fluid.ParamAttr(
+                          name = name + '_key_fc.w_0',
+                          initializer = param_initializer),
+                      bias_attr = name + '_key_fc.b_0')
+        v = layers.fc(input = values,
+                      size = d_value * n_head,
+                      num_flatten_dims = 2,
+                      param_attr = fluid.ParamAttr(
+                          name = name + '_value_fc.w_0',
+                          initializer = param_initializer),
+                      bias_attr = name + '_value_fc.b_0')
+        return q, k, v
+    def __split_heads(x, n_head):
+        """
+        Reshape the last dimension of inpunt tensor x so that it becomes two
+        dimensions and then transpose. Specifically, input a tensor with shape
+        [bs, max_sequence_length, n_head * hidden_dim] then output a tensor
+        with shape [bs, n_head, max_sequence_length, hidden_dim].
+        """
+        hidden_size = x.shape[-1]
+        # The value 0 in shape attr means copying the corresponding dimension
+        # size of the input as the output dimension size.
+        reshaped = layers.reshape(
+            x = x, shape = [0, 0, n_head, hidden_size // n_head], inplace=False)
+        # permuate the dimensions into:
+        # [batch_size, n_head, max_sequence_len, hidden_size_per_head]
+        return layers.transpose(x=reshaped, perm=[0, 2, 1, 3])
+    def __combine_heads(x):
+        """
+        Transpose and then reshape the last two dimensions of inpunt tensor x
+        so that it becomes one dimension, which is reverse to __split_heads.
+        """
+        if len(x.shape) == 3: return x
+        if len(x.shape) != 4:
+            raise ValueError("Input(x) should be a 4-D Tensor.")
+        trans_x = layers.transpose(x, perm=[0, 2, 1, 3])
+        # The value 0 in shape attr means copying the corresponding dimension
+        # size of the input as the output dimension size.
+        return layers.reshape(
+            x = trans_x,
+            shape = [0, 0, trans_x.shape[2] * trans_x.shape[3]],
+            inplace = False)
+    def scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate):
+        """
+        Scaled Dot-Product Attention
+        """
+        scaled_q = layers.scale(x = q, scale = d_key**-0.5)
+        product = layers.matmul(x = scaled_q, y = k, transpose_y = True)
+        if attn_bias:
+            product += attn_bias
+        weights = layers.softmax(product)
+        if dropout_rate:
+            weights = layers.dropout(
+                weights,
+                dropout_prob=dropout_rate,
+                dropout_implementation="upscale_in_train",
+                is_test=False)
+        out = layers.matmul(weights, v)
+        return out
+    q, k, v = __compute_qkv(queries, keys, values, n_head, d_key, d_value)
+    if cache is not None:  # use cache and concat time steps
+        # Since the inplace reshape in __split_heads changes the shape of k and
+        # v, which is the cache input for next time step, reshape the cache
+        # input from the previous time step first.
+        k = cache["k"] = layers.concat(
+            [layers.reshape(
+                cache["k"], shape=[0, 0, d_model]), k], axis=1)
+        v = cache["v"] = layers.concat(
+            [layers.reshape(
+                cache["v"], shape=[0, 0, d_model]), v], axis=1)
+    q = __split_heads(q, n_head)
+    k = __split_heads(k, n_head)
+    v = __split_heads(v, n_head)
+    ctx_multiheads = scaled_dot_product_attention(q, k, v, attn_bias, d_key,
+                                                  dropout_rate)
+    out = __combine_heads(ctx_multiheads)
+    # Project back to the model size.
+    proj_out = layers.fc(input = out,
+                         size = d_model,
+                         num_flatten_dims = 2,
+                         param_attr=fluid.ParamAttr(
+                             name = name + '_output_fc.w_0',
+                             initializer = param_initializer),
+                         bias_attr = name + '_output_fc.b_0')
+    return proj_out
+def positionwise_feed_forward(x,
+                              d_inner_hid,
+                              d_hid,
+                              dropout_rate,
+                              hidden_act,
+                              param_initializer=None,
+                              name='ffn'):
+    """
+    Position-wise Feed-Forward Networks.
+    This module consists of two linear transformations with a ReLU activation
+    in between, which is applied to each position separately and identically.
+    """
+    hidden = layers.fc(input=x,
+                       size=d_inner_hid,
+                       num_flatten_dims=2,
+                       act=hidden_act,
+                       param_attr=fluid.ParamAttr(
+                           name=name + '_fc_0.w_0',
+                           initializer=param_initializer),
+                       bias_attr=name + '_fc_0.b_0')
+    if dropout_rate:
+        hidden = layers.dropout(
+            hidden,
+            dropout_prob=dropout_rate,
+            dropout_implementation="upscale_in_train",
+            is_test = False)
+    out = layers.fc(input = hidden,
+                    size = d_hid,
+                    num_flatten_dims = 2,
+                    param_attr=fluid.ParamAttr(
+                        name = name + '_fc_1.w_0', 
+                        initializer = param_initializer),
+                    bias_attr = name + '_fc_1.b_0')
+    return out
+def pre_post_process_layer(prev_out, out, process_cmd, dropout_rate=0.,
+                           name=''):
+    """
+    Add residual connection, layer normalization and droput to the out tensor
+    optionally according to the value of process_cmd.
+    This will be used before or after multi-head attention and position-wise
+    feed-forward networks.
+    """
+    for cmd in process_cmd:
+        if cmd == "a":  # add residual connection
+            out = out + prev_out if prev_out else out
+        elif cmd == "n":  # add layer normalization
+            out_dtype = out.dtype
+            if out_dtype == fluid.core.VarDesc.VarType.FP16:
+                out = layers.cast(x = out, dtype = "float32")
+            out = layer_norm(
+                out,
+                begin_norm_axis=len(out.shape) - 1,
+                param_attr=fluid.ParamAttr(
+                    name = name + '_layer_norm_scale',
+                    initializer = fluid.initializer.Constant(1.)),
+                bias_attr=fluid.ParamAttr(
+                    name = name + '_layer_norm_bias',
+                    initializer = fluid.initializer.Constant(0.)))
+            if out_dtype == fluid.core.VarDesc.VarType.FP16:
+                out = layers.cast(x = out, dtype = "float16")
+        elif cmd == "d":  # add dropout
+            if dropout_rate:
+                out = layers.dropout(
+                    out,
+                    dropout_prob = dropout_rate,
+                    dropout_implementation = "upscale_in_train",
+                    is_test = False)
+    return out
+pre_process_layer = partial(pre_post_process_layer, None)
+post_process_layer = pre_post_process_layer
+def encoder_layer(enc_input,
+                  attn_bias,
+                  n_head,
+                  d_key,
+                  d_value,
+                  d_model,
+                  d_inner_hid,
+                  prepostprocess_dropout,
+                  attention_dropout,
+                  relu_dropout,
+                  hidden_act,
+                  preprocess_cmd="n",
+                  postprocess_cmd="da",
+                  param_initializer=None,
+                  name=''):
+    """The encoder layers that can be stacked to form a deep encoder.
+    This module consits of a multi-head (self) attention followed by
+    position-wise feed-forward networks and both the two components companied
+    with the post_process_layer to add residual connection, layer normalization
+    and droput.
+    """
+    attn_output = multi_head_attention(
+        pre_process_layer(
+            enc_input,
+            preprocess_cmd,
+            prepostprocess_dropout,
+            name=name + '_pre_att'),
+        None,
+        None,
+        attn_bias,
+        d_key,
+        d_value,
+        d_model,
+        n_head,
+        attention_dropout,
+        param_initializer = param_initializer,
+        name = name + '_multi_head_att')
+    attn_output = post_process_layer(
+        enc_input,
+        attn_output,
+        postprocess_cmd,
+        prepostprocess_dropout,
+        name = name + '_post_att')
+    ffd_output = positionwise_feed_forward(
+        pre_process_layer(
+            attn_output,
+            preprocess_cmd,
+            prepostprocess_dropout,
+            name = name + '_pre_ffn'),
+        d_inner_hid,
+        d_model,
+        relu_dropout,
+        hidden_act,
+        param_initializer = param_initializer,
+        name = name + '_ffn')
+    return post_process_layer(
+        attn_output,
+        ffd_output,
+        postprocess_cmd,
+        prepostprocess_dropout,
+        name = name + '_post_ffn')
+def encoder(enc_input,
+            attn_bias,
+            n_layer,
+            n_head,
+            d_key,
+            d_value,
+            d_model,
+            d_inner_hid,
+            prepostprocess_dropout,
+            attention_dropout,
+            relu_dropout,
+            hidden_act,
+            preprocess_cmd="n",
+            postprocess_cmd="da",
+            param_initializer=None,
+            name='',
+            return_all = False):
+    """
+    The encoder is composed of a stack of identical layers returned by calling
+    encoder_layer.
+    """
+    enc_outputs = []
+    for i in range(n_layer):
+        enc_output = encoder_layer(
+            enc_input,
+            attn_bias,
+            n_head,
+            d_key,
+            d_value,
+            d_model,
+            d_inner_hid,
+            prepostprocess_dropout,
+            attention_dropout,
+            relu_dropout,
+            hidden_act,
+            preprocess_cmd,
+            postprocess_cmd,
+            param_initializer = param_initializer,
+            name = name + '_layer_' + str(i))
+        enc_input = enc_output
+        if i < n_layer - 1:
+            enc_outputs.append(enc_output)
+    enc_output = pre_process_layer(
+        enc_output, preprocess_cmd, prepostprocess_dropout, name="post_encoder")
+    enc_outputs.append(enc_output)
+    if not return_all:
+        return enc_output
+    else:
+        return enc_output, enc_outputs
--- a/PaddleNLP/Research/MRQA2019-D-NET/server/bert_server/pdnlp/module/transformer_encoder.py.old
+++ b/PaddleNLP/Research/MRQA2019-D-NET/server/bert_server/pdnlp/module/transformer_encoder.py.old
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Transformer encoder."""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+from functools import partial
+import numpy as np
+import paddle.fluid as fluid
+import paddle.fluid.layers as layers
+def multi_head_attention(queries,
+                         keys,
+                         values,
+                         attn_bias,
+                         d_key,
+                         d_value,
+                         d_model,
+                         n_head=1,
+                         dropout_rate=0.,
+                         cache=None,
+                         param_initializer=None,
+                         name='multi_head_att'):
+    """
+    Multi-Head Attention. Note that attn_bias is added to the logit before
+    computing softmax activiation to mask certain selected positions so that
+    they will not considered in attention weights.
+    """
+    keys = queries if keys is None else keys
+    values = keys if values is None else values
+    if not (len(queries.shape) == len(keys.shape) == len(values.shape) == 3):
+        raise ValueError(
+            "Inputs: quries, keys and values should all be 3-D tensors.")
+    def __compute_qkv(queries, keys, values, n_head, d_key, d_value):
+        """
+        Add linear projection to queries, keys, and values.
+        """
+        q = layers.fc(input = queries,
+                      size = d_key * n_head,
+                      num_flatten_dims = 2,
+                      param_attr = fluid.ParamAttr(
+                          name = name + '_query_fc.w_0',
+                          initializer = param_initializer),
+                      bias_attr = name + '_query_fc.b_0')
+        k = layers.fc(input = keys,
+                      size = d_key * n_head,
+                      num_flatten_dims = 2,
+                      param_attr = fluid.ParamAttr(
+                          name = name + '_key_fc.w_0',
+                          initializer = param_initializer),
+                      bias_attr = name + '_key_fc.b_0')
+        v = layers.fc(input = values,
+                      size = d_value * n_head,
+                      num_flatten_dims = 2,
+                      param_attr = fluid.ParamAttr(
+                          name = name + '_value_fc.w_0',
+                          initializer = param_initializer),
+                      bias_attr = name + '_value_fc.b_0')
+        return q, k, v
+    def __split_heads(x, n_head):
+        """
+        Reshape the last dimension of inpunt tensor x so that it becomes two
+        dimensions and then transpose. Specifically, input a tensor with shape
+        [bs, max_sequence_length, n_head * hidden_dim] then output a tensor
+        with shape [bs, n_head, max_sequence_length, hidden_dim].
+        """
+        hidden_size = x.shape[-1]
+        # The value 0 in shape attr means copying the corresponding dimension
+        # size of the input as the output dimension size.
+        reshaped = layers.reshape(
+            x = x, shape = [0, 0, n_head, hidden_size // n_head], inplace=True)
+        # permuate the dimensions into:
+        # [batch_size, n_head, max_sequence_len, hidden_size_per_head]
+        return layers.transpose(x=reshaped, perm=[0, 2, 1, 3])
+    def __combine_heads(x):
+        """
+        Transpose and then reshape the last two dimensions of inpunt tensor x
+        so that it becomes one dimension, which is reverse to __split_heads.
+        """
+        if len(x.shape) == 3: return x
+        if len(x.shape) != 4:
+            raise ValueError("Input(x) should be a 4-D Tensor.")
+        trans_x = layers.transpose(x, perm=[0, 2, 1, 3])
+        # The value 0 in shape attr means copying the corresponding dimension
+        # size of the input as the output dimension size.
+        return layers.reshape(
+            x = trans_x,
+            shape = [0, 0, trans_x.shape[2] * trans_x.shape[3]],
+            inplace = True)
+    def scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate):
+        """
+        Scaled Dot-Product Attention
+        """
+        scaled_q = layers.scale(x = q, scale = d_key**-0.5)
+        product = layers.matmul(x = scaled_q, y = k, transpose_y = True)
+        if attn_bias:
+            product += attn_bias
+        weights = layers.softmax(product)
+        if dropout_rate:
+            weights = layers.dropout(
+                weights,
+                dropout_prob=dropout_rate,
+                dropout_implementation="upscale_in_train",
+                is_test=False)
+        out = layers.matmul(weights, v)
+        return out
+    q, k, v = __compute_qkv(queries, keys, values, n_head, d_key, d_value)
+    if cache is not None:  # use cache and concat time steps
+        # Since the inplace reshape in __split_heads changes the shape of k and
+        # v, which is the cache input for next time step, reshape the cache
+        # input from the previous time step first.
+        k = cache["k"] = layers.concat(
+            [layers.reshape(
+                cache["k"], shape=[0, 0, d_model]), k], axis=1)
+        v = cache["v"] = layers.concat(
+            [layers.reshape(
+                cache["v"], shape=[0, 0, d_model]), v], axis=1)
+    q = __split_heads(q, n_head)
+    k = __split_heads(k, n_head)
+    v = __split_heads(v, n_head)
+    ctx_multiheads = scaled_dot_product_attention(q, k, v, attn_bias, d_key,
+                                                  dropout_rate)
+    out = __combine_heads(ctx_multiheads)
+    # Project back to the model size.
+    proj_out = layers.fc(input = out,
+                         size = d_model,
+                         num_flatten_dims = 2,
+                         param_attr=fluid.ParamAttr(
+                             name = name + '_output_fc.w_0',
+                             initializer = param_initializer),
+                         bias_attr = name + '_output_fc.b_0')
+    return proj_out
+def positionwise_feed_forward(x,
+                              d_inner_hid,
+                              d_hid,
+                              dropout_rate,
+                              hidden_act,
+                              param_initializer=None,
+                              name='ffn'):
+    """
+    Position-wise Feed-Forward Networks.
+    This module consists of two linear transformations with a ReLU activation
+    in between, which is applied to each position separately and identically.
+    """
+    hidden = layers.fc(input=x,
+                       size=d_inner_hid,
+                       num_flatten_dims=2,
+                       act=hidden_act,
+                       param_attr=fluid.ParamAttr(
+                           name=name + '_fc_0.w_0',
+                           initializer=param_initializer),
+                       bias_attr=name + '_fc_0.b_0')
+    if dropout_rate:
+        hidden = layers.dropout(
+            hidden,
+            dropout_prob=dropout_rate,
+            dropout_implementation="upscale_in_train",
+            is_test = False)
+    out = layers.fc(input = hidden,
+                    size = d_hid,
+                    num_flatten_dims = 2,
+                    param_attr=fluid.ParamAttr(
+                        name = name + '_fc_1.w_0', 
+                        initializer = param_initializer),
+                    bias_attr = name + '_fc_1.b_0')
+    return out
+def pre_post_process_layer(prev_out, out, process_cmd, dropout_rate=0.,
+                           name=''):
+    """
+    Add residual connection, layer normalization and droput to the out tensor
+    optionally according to the value of process_cmd.
+    This will be used before or after multi-head attention and position-wise
+    feed-forward networks.
+    """
+    for cmd in process_cmd:
+        if cmd == "a":  # add residual connection
+            out = out + prev_out if prev_out else out
+        elif cmd == "n":  # add layer normalization
+            out_dtype = out.dtype
+            if out_dtype == fluid.core.VarDesc.VarType.FP16:
+                out = layers.cast(x = out, dtype = "float32")
+            out = layers.layer_norm(
+                out,
+                begin_norm_axis=len(out.shape) - 1,
+                param_attr=fluid.ParamAttr(
+                    name = name + '_layer_norm_scale',
+                    initializer = fluid.initializer.Constant(1.)),
+                bias_attr=fluid.ParamAttr(
+                    name = name + '_layer_norm_bias',
+                    initializer = fluid.initializer.Constant(0.)))
+            if out_dtype == fluid.core.VarDesc.VarType.FP16:
+                out = layers.cast(x = out, dtype = "float16")
+        elif cmd == "d":  # add dropout
+            if dropout_rate:
+                out = layers.dropout(
+                    out,
+                    dropout_prob = dropout_rate,
+                    dropout_implementation = "upscale_in_train",
+                    is_test = False)
+    return out
+pre_process_layer = partial(pre_post_process_layer, None)
+post_process_layer = pre_post_process_layer
+def encoder_layer(enc_input,
+                  attn_bias,
+                  n_head,
+                  d_key,
+                  d_value,
+                  d_model,
+                  d_inner_hid,
+                  prepostprocess_dropout,
+                  attention_dropout,
+                  relu_dropout,
+                  hidden_act,
+                  preprocess_cmd="n",
+                  postprocess_cmd="da",
+                  param_initializer=None,
+                  name=''):
+    """The encoder layers that can be stacked to form a deep encoder.
+    This module consits of a multi-head (self) attention followed by
+    position-wise feed-forward networks and both the two components companied
+    with the post_process_layer to add residual connection, layer normalization
+    and droput.
+    """
+    attn_output = multi_head_attention(
+        pre_process_layer(
+            enc_input,
+            preprocess_cmd,
+            prepostprocess_dropout,
+            name=name + '_pre_att'),
+        None,
+        None,
+        attn_bias,
+        d_key,
+        d_value,
+        d_model,
+        n_head,
+        attention_dropout,
+        param_initializer = param_initializer,
+        name = name + '_multi_head_att')
+    attn_output = post_process_layer(
+        enc_input,
+        attn_output,
+        postprocess_cmd,
+        prepostprocess_dropout,
+        name = name + '_post_att')
+    ffd_output = positionwise_feed_forward(
+        pre_process_layer(
+            attn_output,
+            preprocess_cmd,
+            prepostprocess_dropout,
+            name = name + '_pre_ffn'),
+        d_inner_hid,
+        d_model,
+        relu_dropout,
+        hidden_act,
+        param_initializer = param_initializer,
+        name = name + '_ffn')
+    return post_process_layer(
+        attn_output,
+        ffd_output,
+        postprocess_cmd,
+        prepostprocess_dropout,
+        name = name + '_post_ffn')
+def encoder(enc_input,
+            attn_bias,
+            n_layer,
+            n_head,
+            d_key,
+            d_value,
+            d_model,
+            d_inner_hid,
+            prepostprocess_dropout,
+            attention_dropout,
+            relu_dropout,
+            hidden_act,
+            preprocess_cmd="n",
+            postprocess_cmd="da",
+            param_initializer=None,
+            name='',
+            return_all = False):
+    """
+    The encoder is composed of a stack of identical layers returned by calling
+    encoder_layer.
+    """
+    enc_outputs = []
+    for i in range(n_layer):
+        enc_output = encoder_layer(
+            enc_input,
+            attn_bias,
+            n_head,
+            d_key,
+            d_value,
+            d_model,
+            d_inner_hid,
+            prepostprocess_dropout,
+            attention_dropout,
+            relu_dropout,
+            hidden_act,
+            preprocess_cmd,
+            postprocess_cmd,
+            param_initializer = param_initializer,
+            name = name + '_layer_' + str(i))
+        enc_input = enc_output
+        if i < n_layer - 1:
+            enc_outputs.append(enc_output)
+    enc_output = pre_process_layer(
+        enc_output, preprocess_cmd, prepostprocess_dropout, name="post_encoder")
+    enc_outputs.append(enc_output)
+    if not return_all:
+        return enc_output
+    else:
+        return enc_output, enc_outputs
--- a/PaddleNLP/Research/MRQA2019-D-NET/server/bert_server/pdnlp/nets/__init__.py
+++ b/PaddleNLP/Research/MRQA2019-D-NET/server/bert_server/pdnlp/nets/__init__.py
--- a/PaddleNLP/Research/MRQA2019-D-NET/server/bert_server/pdnlp/toolkit/__init__.py
+++ b/PaddleNLP/Research/MRQA2019-D-NET/server/bert_server/pdnlp/toolkit/__init__.py
--- a/PaddleNLP/Research/MRQA2019-D-NET/server/bert_server/pdnlp/toolkit/configure.py
+++ b/PaddleNLP/Research/MRQA2019-D-NET/server/bert_server/pdnlp/toolkit/configure.py
+#encoding=utf8
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+import sys
+import argparse
+import six
+import logging
+import json
+logging_only_message = "%(message)s"
+logging_details = "%(asctime)s.%(msecs)03d %(levelname)s %(module)s - %(funcName)s: %(message)s"
+class JsonConfig(object):
+    def __init__(self, config_path):
+        self._config_dict = self._parse(config_path)
+    def _parse(self, config_path):
+        try:
+            with open(config_path) as json_file:
+                config_dict = json.load(json_file)
+        except:
+            raise IOError("Error in parsing bert model config file '%s'" %
+                config_path)
+        else:
+            return config_dict
+    def __getitem__(self, key):
+        return self._config_dict[key]
+    def print_config(self):
+        for arg, value in sorted(six.iteritems(self._config_dict)):
+            print('%s: %s' % (arg, value))
+        print('------------------------------------------------')
+class ArgumentGroup(object):
+    def __init__(self, parser, title, des):
+        self._group = parser.add_argument_group(title=title, description=des)
+    def add_arg(self, name, type, default, help, **kwargs):
+        type = str2bool if type == bool else type
+        self._group.add_argument(
+            "--" + name,
+            default=default,
+            type=type,
+            help=help + ' Default: %(default)s.',
+            **kwargs)
+class ArgConfig(object):
+    def __init__(self):
+        parser = argparse.ArgumentParser()
+        train_g = ArgumentGroup(parser, "training", "training options.")
+        train_g.add_arg("epoch",             int,    3,      "Number of epoches for fine-tuning.")
+        train_g.add_arg("learning_rate",     float,  5e-5,   "Learning rate used to train with warmup.")
+        train_g.add_arg("lr_scheduler",      str,    "linear_warmup_decay",
+                        "scheduler of learning rate.", choices=['linear_warmup_decay', 'noam_decay'])
+        train_g.add_arg("weight_decay",      float,  0.01,   "Weight decay rate for L2 regularizer.")
+        train_g.add_arg("warmup_proportion", float,  0.1,
+                        "Proportion of training steps to perform linear learning rate warmup for.")
+        train_g.add_arg("save_steps",        int,    1000,   "The steps interval to save checkpoints.")
+        train_g.add_arg("use_fp16",          bool,   False,  "Whether to use fp16 mixed precision training.")
+        train_g.add_arg("loss_scaling",      float,  1.0,
+                        "Loss scaling factor for mixed precision training, only valid when use_fp16 is enabled.")
+        train_g.add_arg("pred_dir",   str,    None,   "Path to save the prediction results")
+        log_g = ArgumentGroup(parser, "logging", "logging related.")
+        log_g.add_arg("skip_steps",          int,    10,    "The steps interval to print loss.")
+        log_g.add_arg("verbose",             bool,   False, "Whether to output verbose log.")
+        run_type_g = ArgumentGroup(parser, "run_type", "running type options.")
+        run_type_g.add_arg("use_cuda",                     bool,   True,  "If set, use GPU for training.")
+        run_type_g.add_arg("use_fast_executor",            bool,   False, "If set, use fast parallel executor (in experiment).")
+        run_type_g.add_arg("num_iteration_per_drop_scope", int,    1,     "Ihe iteration intervals to clean up temporary variables.")
+        run_type_g.add_arg("do_train",                     bool,   True,  "Whether to perform training.")
+        run_type_g.add_arg("do_predict",                   bool,   True,  "Whether to perform prediction.")
+        custom_g = ArgumentGroup(parser, "customize", "customized options.")
+        self.custom_g = custom_g
+        self.parser = parser
+    def add_arg(self, name, dtype, default, descrip):
+        self.custom_g.add_arg(name, dtype, default, descrip)
+    def build_conf(self):
+        return self.parser.parse_args()
+def str2bool(v):
+    # because argparse does not support to parse "true, False" as python
+    # boolean directly
+    return v.lower() in ("true", "t", "1")
+def print_arguments(args, log = None):
+    if not log:
+        print('-----------  Configuration Arguments -----------')
+        for arg, value in sorted(six.iteritems(vars(args))):
+            print('%s: %s' % (arg, value))
+        print('------------------------------------------------')
+    else:
+        log.info('-----------  Configuration Arguments -----------')
+        for arg, value in sorted(six.iteritems(vars(args))):
+            log.info('%s: %s' % (arg, value))
+        log.info('------------------------------------------------')
+if __name__ == "__main__":
+    args = ArgConfig()
+    args = args.build_conf()
+    # using print()
+    print_arguments(args)
+    logging.basicConfig(
+        level=logging.INFO,
+        format=logging_details,
+        datefmt='%Y-%m-%d %H:%M:%S')
+    # using logging
+    print_arguments(args, logging)
+    json_conf = JsonConfig("../../data/pretrained_models/uncased_L-12_H-768_A-12/bert_config.json")
+    json_conf.print_config()
--- a/PaddleNLP/Research/MRQA2019-D-NET/server/bert_server/pdnlp/toolkit/init.py
+++ b/PaddleNLP/Research/MRQA2019-D-NET/server/bert_server/pdnlp/toolkit/init.py
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from __future__ import print_function
+import os
+import six
+import ast
+import copy
+import numpy as np
+import paddle.fluid as fluid
+def cast_fp32_to_fp16(exe, main_program):
+    print("Cast parameters to float16 data format.")
+    for param in main_program.global_block().all_parameters():
+        if not param.name.endswith(".master"):
+            param_t = fluid.global_scope().find_var(param.name).get_tensor()
+            data = np.array(param_t)
+            if param.name.find("layer_norm") == -1:
+                param_t.set(np.float16(data).view(np.uint16), exe.place)
+            master_param_var = fluid.global_scope().find_var(param.name +
+                                                             ".master")
+            if master_param_var is not None:
+                master_param_var.get_tensor().set(data, exe.place)
+def init_checkpoint(exe, init_checkpoint_path, main_program, use_fp16=False, skip_list = []):
+    assert os.path.exists(
+        init_checkpoint_path), "[%s] cann't be found." % init_checkpoint_path
+    def existed_persitables(var):
+        if not fluid.io.is_persistable(var):
+            return False
+        if var.name in skip_list:
+            return False
+        return os.path.exists(os.path.join(init_checkpoint_path, var.name))
+    fluid.io.load_vars(
+        exe,
+        init_checkpoint_path,
+        main_program=main_program,
+        predicate=existed_persitables)
+    print("Load model from {}".format(init_checkpoint_path))
+    if use_fp16:
+        cast_fp32_to_fp16(exe, main_program)
+def init_pretraining_params(exe,
+                            pretraining_params_path,
+                            main_program,
+                            use_fp16=False):
+    assert os.path.exists(pretraining_params_path
+                          ), "[%s] cann't be found." % pretraining_params_path
+    def existed_params(var):
+        if not isinstance(var, fluid.framework.Parameter):
+            return False
+        return os.path.exists(os.path.join(pretraining_params_path, var.name))
+    fluid.io.load_vars(
+        exe,
+        pretraining_params_path,
+        main_program=main_program,
+        predicate=existed_params)
+    print("Load pretraining parameters from {}.".format(
+        pretraining_params_path))
+    if use_fp16:
+        cast_fp32_to_fp16(exe, main_program)
--- a/PaddleNLP/Research/MRQA2019-D-NET/server/bert_server/pdnlp/toolkit/placeholder.py
+++ b/PaddleNLP/Research/MRQA2019-D-NET/server/bert_server/pdnlp/toolkit/placeholder.py
+#encoding=utf8
+from __future__ import print_function
+import os
+import six
+import ast
+import copy
+import numpy as np
+import paddle.fluid as fluid
+class Placeholder(object):
+    def __init__(self):
+        self.shapes = []
+        self.dtypes = []
+        self.lod_levels = []
+        self.names = []
+    def __init__(self, input_shapes):
+        self.shapes = []
+        self.dtypes = []
+        self.lod_levels = []
+        self.names = []
+        for new_holder in input_shapes:
+            shape = new_holder[0]
+            dtype = new_holder[1]
+            lod_level = new_holder[2] if len(new_holder) >= 3 else 0
+            name = new_holder[3] if len(new_holder) >= 4 else ""
+            self.append_placeholder(shape, dtype, lod_level = lod_level, name = name)
+    def append_placeholder(self, shape, dtype, lod_level = 0, name = ""):
+        self.shapes.append(shape)
+        self.dtypes.append(dtype)
+        self.lod_levels.append(lod_level)
+        self.names.append(name)
+    def build(self, capacity, reader_name, use_double_buffer = False):
+        pyreader = fluid.layers.py_reader(
+            capacity = capacity,
+            shapes = self.shapes,
+            dtypes = self.dtypes,
+            lod_levels = self.lod_levels,
+            name = reader_name, 
+            use_double_buffer = use_double_buffer)
+        return [pyreader, fluid.layers.read_file(pyreader)]
+    def __add__(self, new_holder):
+        assert isinstance(new_holder, tuple) or isinstance(new_holder, list) 
+        assert len(new_holder) >= 2
+        shape = new_holder[0]
+        dtype = new_holder[1]
+        lod_level = new_holder[2] if len(new_holder) >= 3 else 0
+        name = new_holder[3] if len(new_holder) >= 4 else ""
+        self.append_placeholder(shape, dtype, lod_level = lod_level, name = name)
+if __name__ == "__main__":
+    print("hello world!")
--- a/PaddleNLP/Research/MRQA2019-D-NET/server/bert_server/reader.py
+++ b/PaddleNLP/Research/MRQA2019-D-NET/server/bert_server/reader.py
+#encoding=utf8
+import os
+import sys
+import random
+import numpy as np
+import paddle
+import paddle.fluid as fluid
+from pdnlp.toolkit.placeholder import Placeholder
+def repeat(reader):
+    """Repeat a generator forever"""
+    generator = reader()
+    while True:
+        try:
+            yield next(generator)
+        except StopIteration:
+            generator = reader()
+            yield next(generator)
+def create_joint_generator(input_shape, generators, is_multi_task=True):
+    def empty_output(input_shape, batch_size=1):
+        results = []
+        for i in range(len(input_shape)):
+            if input_shape[i][1] == 'int32':
+                dtype = np.int32
+            if input_shape[i][1] == 'int64':
+                dtype = np.int64
+            if input_shape[i][1] == 'float32':
+                dtype = np.float32
+            if input_shape[i][1] == 'float64':
+                dtype = np.float64
+            shape = input_shape[i][0]
+            shape[0] = batch_size
+            pad_tensor = np.zeros(shape=shape, dtype=dtype)
+            results.append(pad_tensor)
+        return results
+    def wrapper(): 
+        """wrapper data"""
+        generators_inst = [repeat(gen[0]) for gen in generators]
+        generators_ratio = [gen[1] for gen in generators]
+        weights = [ratio/sum(generators_ratio) for ratio in generators_ratio]
+        run_task_id = range(len(generators))
+        while True:
+            idx = np.random.choice(run_task_id, p=weights)
+            gen_results = next(generators_inst[idx])
+            if not gen_results:
+                break
+            batch_size = gen_results[0].shape[0]
+            results = empty_output(input_shape, batch_size)
+            task_id_tensor = np.array([[idx]]).astype("int64")
+            results[0] = task_id_tensor
+            for i in range(4):
+                results[i+1] = gen_results[i]
+            if idx == 0:
+                # mrc batch
+                results[5] = gen_results[4]
+                results[6] = gen_results[5]
+            elif idx == 1:
+                # mlm batch
+                results[7] = gen_results[4]
+                results[8] = gen_results[5]
+            elif idx == 2:
+                # MNLI batch
+                results[9] = gen_results[4]
+            else:
+                raise RuntimeError('Invalid task ID - {}'.format(idx))
+            # idx stands for the task index
+            yield results
+    return wrapper
+def create_reader(reader_name, input_shape, is_multi_task, *gens):
+    """
+    build reader for multi_task_learning
+    """
+    placeholder = Placeholder(input_shape)
+    pyreader, model_inputs = placeholder.build(capacity=100, reader_name=reader_name)
+    joint_generator = create_joint_generator(input_shape, gens[0], is_multi_task=is_multi_task)
+    return joint_generator, pyreader, model_inputs
--- a/PaddleNLP/Research/MRQA2019-D-NET/server/bert_server/start.sh
+++ b/PaddleNLP/Research/MRQA2019-D-NET/server/bert_server/start.sh
+export FLAGS_fraction_of_gpu_memory_to_use=0.1
+python start_service.py ./infer_model 5118 &
--- a/PaddleNLP/Research/MRQA2019-D-NET/server/bert_server/start_service.py
+++ b/PaddleNLP/Research/MRQA2019-D-NET/server/bert_server/start_service.py
+#!/usr/bin/env python
+# -*- coding: utf-8 -*-
+"""Provide MRC service for TOP1 short answer extraction system
+Note the services here share some global pre/post process objects, which
+are **NOT THREAD SAFE**. Try to use multi-process instead of multi-thread
+for deployment.
+"""
+import json
+import sys
+import logging
+logging.basicConfig(
+    level=logging.INFO, format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
+)
+import requests
+from flask import Flask
+from flask import Response
+from flask import request
+import mrc_service
+import model_wrapper
+import argparse
+assert len(sys.argv) == 3 or len(sys.argv) == 4, "Usage: python serve.py <model_dir> <port> [process_mode]"
+if len(sys.argv) == 3:
+    _, model_dir, port = sys.argv
+    mode = 'parallel'
+else:
+    _, model_dir, port, mode = sys.argv
+max_batch_size = 5
+app = Flask(__name__)
+app.logger.setLevel(logging.INFO)
+model = model_wrapper.BertModelWrapper(model_dir=model_dir)
+server = mrc_service.MRQAService('MRQA service', app.logger)
+@app.route('/', methods=['POST'])
+def mrqa_service():
+    """Description"""
+    return server(model, process_mode=mode, max_batch_size=max_batch_size)
+if __name__ == '__main__':
+    app.run(port=port, debug=False, threaded=False, processes=1)
--- a/PaddleNLP/Research/MRQA2019-D-NET/server/bert_server/task_reader/__init__.py
+++ b/PaddleNLP/Research/MRQA2019-D-NET/server/bert_server/task_reader/__init__.py
--- a/PaddleNLP/Research/MRQA2019-D-NET/server/bert_server/task_reader/batching.py
+++ b/PaddleNLP/Research/MRQA2019-D-NET/server/bert_server/task_reader/batching.py
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Mask, padding and batching."""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+import numpy as np
+def mask(batch_tokens, total_token_num, vocab_size, CLS=1, SEP=2, MASK=3):
+    """
+    Add mask for batch_tokens, return out, mask_label, mask_pos;
+    Note: mask_pos responding the batch_tokens after padded;
+    """
+    max_len = max([len(sent) for sent in batch_tokens])
+    mask_label = []
+    mask_pos = []
+    prob_mask = np.random.rand(total_token_num)
+    # Note: the first token is [CLS], so [low=1]
+    replace_ids = np.random.randint(1, high=vocab_size, size=total_token_num)
+    pre_sent_len = 0
+    prob_index = 0
+    for sent_index, sent in enumerate(batch_tokens):
+        mask_flag = False
+        prob_index += pre_sent_len
+        for token_index, token in enumerate(sent):
+            prob = prob_mask[prob_index + token_index]
+            if prob > 0.15:
+                continue
+            elif 0.03 < prob <= 0.15:
+                # mask
+                if token != SEP and token != CLS:
+                    mask_label.append(sent[token_index])
+                    sent[token_index] = MASK
+                    mask_flag = True
+                    mask_pos.append(sent_index * max_len + token_index)
+            elif 0.015 < prob <= 0.03:
+                # random replace
+                if token != SEP and token != CLS:
+                    mask_label.append(sent[token_index])
+                    sent[token_index] = replace_ids[prob_index + token_index]
+                    mask_flag = True
+                    mask_pos.append(sent_index * max_len + token_index)
+            else:
+                # keep the original token
+                if token != SEP and token != CLS:
+                    mask_label.append(sent[token_index])
+                    mask_pos.append(sent_index * max_len + token_index)
+        pre_sent_len = len(sent)
+        # ensure at least mask one word in a sentence
+        while not mask_flag:
+            token_index = int(np.random.randint(1, high=len(sent) - 1, size=1))
+            if sent[token_index] != SEP and sent[token_index] != CLS:
+                mask_label.append(sent[token_index])
+                sent[token_index] = MASK
+                mask_flag = True
+                mask_pos.append(sent_index * max_len + token_index)
+    mask_label = np.array(mask_label).astype("int64").reshape([-1, 1])
+    mask_pos = np.array(mask_pos).astype("int64").reshape([-1, 1])
+    return batch_tokens, mask_label, mask_pos
+def prepare_batch_data(insts,
+                       total_token_num,
+                       max_len=None,
+                       voc_size=0,
+                       pad_id=None,
+                       cls_id=None,
+                       sep_id=None,
+                       mask_id=None,
+                       return_input_mask=True,
+                       return_max_len=True,
+                       return_num_token=False):
+    """
+    1. generate Tensor of data
+    2. generate Tensor of position
+    3. generate self attention mask, [shape: batch_size *  max_len * max_len]
+    """
+    batch_src_ids = [inst[0] for inst in insts]
+    batch_sent_ids = [inst[1] for inst in insts]
+    batch_pos_ids = [inst[2] for inst in insts]
+    labels_list = []
+    # compatible with mrqa, whose example includes start/end positions, 
+    # or unique id
+    for i in range(3, len(insts[0]), 1):
+        labels = [inst[i] for inst in insts]
+        labels = np.array(labels).astype("int64").reshape([-1, 1])
+        labels_list.append(labels)
+    # First step: do mask without padding
+    if mask_id >= 0:
+        out, mask_label, mask_pos = mask(
+            batch_src_ids,
+            total_token_num,
+            vocab_size=voc_size,
+            CLS=cls_id,
+            SEP=sep_id,
+            MASK=mask_id)
+    else:
+        out = batch_src_ids
+    # Second step: padding
+    src_id, self_input_mask = pad_batch_data(
+        out, 
+        max_len=max_len,
+        pad_idx=pad_id, return_input_mask=True)
+    pos_id = pad_batch_data(
+        batch_pos_ids,
+        max_len=max_len,
+        pad_idx=pad_id,
+        return_pos=False,
+        return_input_mask=False)
+    sent_id = pad_batch_data(
+        batch_sent_ids,
+        max_len=max_len,
+        pad_idx=pad_id,
+        return_pos=False,
+        return_input_mask=False)
+    if mask_id >= 0:
+        return_list = [
+            src_id, pos_id, sent_id, self_input_mask, mask_label, mask_pos
+        ] + labels_list
+    else:
+        return_list = [src_id, pos_id, sent_id, self_input_mask] + labels_list
+    return return_list if len(return_list) > 1 else return_list[0]
+def pad_batch_data(insts,
+                   max_len=None,
+                   pad_idx=0,
+                   return_pos=False,
+                   return_input_mask=False,
+                   return_max_len=False,
+                   return_num_token=False):
+    """
+    Pad the instances to the max sequence length in batch, and generate the
+    corresponding position data and input mask.
+    """
+    return_list = []
+    if max_len is None:
+        max_len = max(len(inst) for inst in insts)
+    # Any token included in dict can be used to pad, since the paddings' loss
+    # will be masked out by weights and make no effect on parameter gradients.
+    inst_data = np.array([
+        list(inst) + list([pad_idx] * (max_len - len(inst))) for inst in insts
+    ])
+    return_list += [inst_data.astype("int64").reshape([-1, max_len, 1])]
+    # position data
+    if return_pos:
+        inst_pos = np.array([
+            list(range(0, len(inst))) + [pad_idx] * (max_len - len(inst))
+            for inst in insts
+        ])
+        return_list += [inst_pos.astype("int64").reshape([-1, max_len, 1])]
+    if return_input_mask:
+        # This is used to avoid attention on paddings.
+        input_mask_data = np.array([[1] * len(inst) + [0] *
+                                    (max_len - len(inst)) for inst in insts])
+        input_mask_data = np.expand_dims(input_mask_data, axis=-1)
+        return_list += [input_mask_data.astype("float32")]
+    if return_max_len:
+        return_list += [max_len]
+    if return_num_token:
+        num_token = 0
+        for inst in insts:
+            num_token += len(inst)
+        return_list += [num_token]
+    return return_list if len(return_list) > 1 else return_list[0]
+if __name__ == "__main__":
+    pass
--- a/PaddleNLP/Research/MRQA2019-D-NET/server/bert_server/task_reader/mrqa.py
+++ b/PaddleNLP/Research/MRQA2019-D-NET/server/bert_server/task_reader/mrqa.py
+# coding=utf-8
+# Copyright 2018 The Google AI Language Team Authors.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Run MRQA"""
+import re
+import six
+import math
+import json
+import random
+import collections
+import numpy as np
+import tokenization
+from batching import prepare_batch_data
+class MRQAExample(object):
+    """A single training/test example for simple sequence classification.
+     For examples without an answer, the start and end position are -1.
+  """
+    def __init__(self,
+                 qas_id,
+                 question_text,
+                 doc_tokens,
+                 orig_answer_text=None,
+                 start_position=None,
+                 end_position=None,
+                 is_impossible=False):
+        self.qas_id = qas_id
+        self.question_text = question_text
+        self.doc_tokens = doc_tokens
+        self.orig_answer_text = orig_answer_text
+        self.start_position = start_position
+        self.end_position = end_position
+        self.is_impossible = is_impossible
+    def __str__(self):
+        return self.__repr__()
+    def __repr__(self):
+        s = ""
+        s += "qas_id: %s" % (tokenization.printable_text(self.qas_id))
+        s += ", question_text: %s" % (
+            tokenization.printable_text(self.question_text))
+        s += ", doc_tokens: [%s]" % (" ".join(self.doc_tokens))
+        if self.start_position:
+            s += ", start_position: %d" % (self.start_position)
+        if self.start_position:
+            s += ", end_position: %d" % (self.end_position)
+        if self.start_position:
+            s += ", is_impossible: %r" % (self.is_impossible)
+        return s
+class InputFeatures(object):
+    """A single set of features of data."""
+    def __init__(self,
+                 unique_id,
+                 example_index,
+                 doc_span_index,
+                 tokens,
+                 token_to_orig_map,
+                 token_is_max_context,
+                 input_ids,
+                 input_mask,
+                 segment_ids,
+                 start_position=None,
+                 end_position=None,
+                 is_impossible=None):
+        self.unique_id = unique_id
+        self.example_index = example_index
+        self.doc_span_index = doc_span_index
+        self.tokens = tokens
+        self.token_to_orig_map = token_to_orig_map
+        self.token_is_max_context = token_is_max_context
+        self.input_ids = input_ids
+        self.input_mask = input_mask
+        self.segment_ids = segment_ids
+        self.start_position = start_position
+        self.end_position = end_position
+        self.is_impossible = is_impossible
+def read_mrqa_examples(sample, is_training=False, with_negative=False):
+    """Read a MRQA json file into a list of MRQAExample."""
+    def is_whitespace(c):
+        if c == " " or c == "\t" or c == "\r" or c == "\n" or ord(c) == 0x202F:
+            return True
+        return False
+    examples = []
+    # sample = json.loads(raw_sample)
+    paragraph_text = sample["context"]
+    paragraph_text = re.sub(r'\[TLE\]|\[DOC\]|\[PAR\]', '[SEP]', paragraph_text)
+    doc_tokens = []
+    char_to_word_offset = []
+    prev_is_whitespace = True
+    for c in paragraph_text:
+        if is_whitespace(c):
+            prev_is_whitespace = True
+        else:
+            if prev_is_whitespace:
+                doc_tokens.append(c)
+            else:
+                doc_tokens[-1] += c
+            prev_is_whitespace = False
+        char_to_word_offset.append(len(doc_tokens) - 1)
+    for qa in sample["qas"]:
+        qas_id = qa["qid"]
+        question_text = qa["question"]
+        start_position = None
+        end_position = None
+        orig_answer_text = None
+        is_impossible = False
+        example = MRQAExample(
+            qas_id=qas_id,
+            question_text=question_text,
+            doc_tokens=doc_tokens)
+        examples.append(example)
+    return examples
+def convert_examples_to_features(
+        examples,
+        tokenizer,
+        max_seq_length,
+        doc_stride,
+        max_query_length,
+        is_training,
+        examples_start_id=0,
+        features_start_id=1000000000
+        #output_fn
+):
+    """Loads a data file into a list of `InputBatch`s."""
+    unique_id = features_start_id
+    example_index = examples_start_id
+    features = []
+    for example in examples:
+        query_tokens = tokenizer.tokenize(example.question_text)
+        if len(query_tokens) > max_query_length:
+            query_tokens = query_tokens[0:max_query_length]
+        tok_to_orig_index = []
+        orig_to_tok_index = []
+        all_doc_tokens = []
+        for (i, token) in enumerate(example.doc_tokens):
+            orig_to_tok_index.append(len(all_doc_tokens))
+            sub_tokens = tokenizer.tokenize(token)
+            for sub_token in sub_tokens:
+                tok_to_orig_index.append(i)
+                all_doc_tokens.append(sub_token)
+        tok_start_position = None
+        tok_end_position = None
+        if is_training and example.is_impossible:
+            tok_start_position = -1
+            tok_end_position = -1
+        if is_training and not example.is_impossible:
+            tok_start_position = orig_to_tok_index[example.start_position]
+            if example.end_position < len(example.doc_tokens) - 1:
+                tok_end_position = orig_to_tok_index[example.end_position +
+                                                     1] - 1
+            else:
+                tok_end_position = len(all_doc_tokens) - 1
+            (tok_start_position, tok_end_position) = _improve_answer_span(
+                all_doc_tokens, tok_start_position, tok_end_position, tokenizer,
+                example.orig_answer_text)
+        # The -3 accounts for [CLS], [SEP] and [SEP]
+        max_tokens_for_doc = max_seq_length - len(query_tokens) - 3
+        # We can have documents that are longer than the maximum sequence length.
+        # To deal with this we do a sliding window approach, where we take chunks
+        # of the up to our max length with a stride of `doc_stride`.
+        _DocSpan = collections.namedtuple(  # pylint: disable=invalid-name
+            "DocSpan", ["start", "length"])
+        doc_spans = []
+        start_offset = 0
+        while start_offset < len(all_doc_tokens):
+            length = len(all_doc_tokens) - start_offset
+            if length > max_tokens_for_doc:
+                length = max_tokens_for_doc
+            doc_spans.append(_DocSpan(start=start_offset, length=length))
+            if start_offset + length == len(all_doc_tokens):
+                break
+            start_offset += min(length, doc_stride)
+        for (doc_span_index, doc_span) in enumerate(doc_spans):
+            tokens = []
+            token_to_orig_map = {}
+            token_is_max_context = {}
+            segment_ids = []
+            tokens.append("[CLS]")
+            segment_ids.append(0)
+            for token in query_tokens:
+                tokens.append(token)
+                segment_ids.append(0)
+            tokens.append("[SEP]")
+            segment_ids.append(0)
+            for i in range(doc_span.length):
+                split_token_index = doc_span.start + i
+                token_to_orig_map[len(tokens)] = tok_to_orig_index[
+                    split_token_index]
+                is_max_context = _check_is_max_context(
+                    doc_spans, doc_span_index, split_token_index)
+                token_is_max_context[len(tokens)] = is_max_context
+                tokens.append(all_doc_tokens[split_token_index])
+                segment_ids.append(1)
+            tokens.append("[SEP]")
+            segment_ids.append(1)
+            input_ids = tokenizer.convert_tokens_to_ids(tokens)
+            # The mask has 1 for real tokens and 0 for padding tokens. Only real
+            # tokens are attended to.
+            input_mask = [1] * len(input_ids)
+            # Zero-pad up to the sequence length.
+            #while len(input_ids) < max_seq_length:
+            #  input_ids.append(0)
+            #  input_mask.append(0)
+            #  segment_ids.append(0)
+            #assert len(input_ids) == max_seq_length
+            #assert len(input_mask) == max_seq_length
+            #assert len(segment_ids) == max_seq_length
+            start_position = None
+            end_position = None
+            if is_training and not example.is_impossible:
+                # For training, if our document chunk does not contain an annotation
+                # we throw it out, since there is nothing to predict.
+                doc_start = doc_span.start
+                doc_end = doc_span.start + doc_span.length - 1
+                out_of_span = False
+                if not (tok_start_position >= doc_start and
+                        tok_end_position <= doc_end):
+                    out_of_span = True
+                if out_of_span:
+                    start_position = 0
+                    end_position = 0
+                    continue
+                else:
+                    doc_offset = len(query_tokens) + 2
+                    start_position = tok_start_position - doc_start + doc_offset
+                    end_position = tok_end_position - doc_start + doc_offset
+            """
+            if is_training and example.is_impossible:
+                start_position = 0
+                end_position = 0
+            """
+            feature = InputFeatures(
+                unique_id=unique_id,
+                example_index=example_index,
+                doc_span_index=doc_span_index,
+                tokens=tokens,
+                token_to_orig_map=token_to_orig_map,
+                token_is_max_context=token_is_max_context,
+                input_ids=input_ids,
+                input_mask=input_mask,
+                segment_ids=segment_ids,
+                start_position=start_position,
+                end_position=end_position,
+                is_impossible=example.is_impossible)
+            unique_id += 1
+            features.append(feature)
+        example_index += 1
+    return features
+def estimate_runtime_examples(data_path, sample_rate, tokenizer, \
+                              max_seq_length, doc_stride, max_query_length, \
+                              remove_impossible_questions=True, filter_invalid_spans=True):
+    """Count runtime examples which may differ from number of raw samples due to sliding window operation and etc.. This is useful to get correct warmup steps for training."""
+    assert sample_rate > 0.0 and sample_rate <= 1.0, "sample_rate must be set between 0.0~1.0"
+    print("loading data with json parser...")
+    with open(data_path, "r") as reader:
+        data = json.load(reader)["data"]
+    num_raw_examples = 0
+    for entry in data:
+        for paragraph in entry["paragraphs"]:
+            paragraph_text = paragraph["context"]
+            for qa in paragraph["qas"]:
+                num_raw_examples += 1
+    print("num raw examples:{}".format(num_raw_examples))
+    def is_whitespace(c):
+        if c == " " or c == "\t" or c == "\r" or c == "\n" or ord(c) == 0x202F:
+            return True
+        return False
+    sampled_examples = []
+    for entry in data:
+        for paragraph in entry["paragraphs"]:
+            doc_tokens = None
+            for qa in paragraph["qas"]:
+                if random.random() > sample_rate and sample_rate < 1.0:
+                    continue
+                if doc_tokens is None:
+                    paragraph_text = paragraph["context"]
+                    doc_tokens = []
+                    char_to_word_offset = []
+                    prev_is_whitespace = True
+                    for c in paragraph_text:
+                        if is_whitespace(c):
+                            prev_is_whitespace = True
+                        else:
+                            if prev_is_whitespace:
+                                doc_tokens.append(c)
+                            else:
+                                doc_tokens[-1] += c
+                            prev_is_whitespace = False
+                        char_to_word_offset.append(len(doc_tokens) - 1)
+                assert len(qa["answers"]) == 1, "For training, each question should have exactly 1 answer."
+                qas_id = qa["id"]
+                question_text = qa["question"]
+                start_position = None
+                end_position = None
+                orig_answer_text = None
+                is_impossible = False
+                if ('is_impossible' in qa) and (qa["is_impossible"]):
+                    if remove_impossible_questions or filter_invalid_spans:
+                        continue
+                    else:
+                        start_position = -1
+                        end_position = -1
+                        orig_answer_text = ""
+                        is_impossible = True
+                else:
+                    answer = qa["answers"][0]
+                    orig_answer_text = answer["text"]
+                    answer_offset = answer["answer_start"]
+                    answer_length = len(orig_answer_text)
+                    start_position = char_to_word_offset[answer_offset]
+                    end_position = char_to_word_offset[answer_offset +
+                                                       answer_length - 1]
+                    # remove corrupt samples
+                    actual_text = " ".join(doc_tokens[start_position:(
+                        end_position + 1)])
+                    cleaned_answer_text = " ".join(
+                        tokenization.whitespace_tokenize(orig_answer_text))
+                    if actual_text.find(cleaned_answer_text) == -1:
+                        print("Could not find answer: '%s' vs. '%s'",
+                              actual_text, cleaned_answer_text)
+                        continue
+                example = MRQAExample(
+                    qas_id=qas_id,
+                    question_text=question_text,
+                    doc_tokens=doc_tokens,
+                    orig_answer_text=orig_answer_text,
+                    start_position=start_position,
+                    end_position=end_position,
+                    is_impossible=is_impossible)
+                sampled_examples.append(example)
+    runtime_sample_rate = len(sampled_examples) / float(num_raw_examples)
+    # print("DEBUG-> runtime sampled examples: {}, sample rate: {}.".format(len(sampled_examples), runtime_sample_rate))
+    runtime_samp_cnt = 0
+    for example in sampled_examples:
+        query_tokens = tokenizer.tokenize(example.question_text)
+        if len(query_tokens) > max_query_length:
+            query_tokens = query_tokens[0:max_query_length]
+        tok_to_orig_index = []
+        orig_to_tok_index = []
+        all_doc_tokens = []
+        for (i, token) in enumerate(example.doc_tokens):
+            orig_to_tok_index.append(len(all_doc_tokens))
+            sub_tokens = tokenizer.tokenize(token)
+            for sub_token in sub_tokens:
+                tok_to_orig_index.append(i)
+                all_doc_tokens.append(sub_token)
+        tok_start_position = None
+        tok_end_position = None
+        tok_start_position = orig_to_tok_index[example.start_position]
+        if example.end_position < len(example.doc_tokens) - 1:
+            tok_end_position = orig_to_tok_index[example.end_position + 1] - 1
+        else:
+            tok_end_position = len(all_doc_tokens) - 1
+        (tok_start_position, tok_end_position) = _improve_answer_span(
+            all_doc_tokens, tok_start_position, tok_end_position, tokenizer,
+            example.orig_answer_text)
+        # The -3 accounts for [CLS], [SEP] and [SEP]
+        max_tokens_for_doc = max_seq_length - len(query_tokens) - 3
+        _DocSpan = collections.namedtuple(  # pylint: disable=invalid-name
+            "DocSpan", ["start", "length"])
+        doc_spans = []
+        start_offset = 0
+        while start_offset < len(all_doc_tokens):
+            length = len(all_doc_tokens) - start_offset
+            if length > max_tokens_for_doc:
+                length = max_tokens_for_doc
+            doc_spans.append(_DocSpan(start=start_offset, length=length))
+            if start_offset + length == len(all_doc_tokens):
+                break
+            start_offset += min(length, doc_stride)
+        for (doc_span_index, doc_span) in enumerate(doc_spans):
+            doc_start = doc_span.start
+            doc_end = doc_span.start + doc_span.length - 1
+            if filter_invalid_spans and not (tok_start_position >= doc_start and tok_end_position <= doc_end):
+                continue
+            runtime_samp_cnt += 1
+    return int(runtime_samp_cnt/runtime_sample_rate)
+def _improve_answer_span(doc_tokens, input_start, input_end, tokenizer,
+                         orig_answer_text):
+    """Returns tokenized answer spans that better match the annotated answer."""
+    # The MRQA annotations are character based. We first project them to
+    # whitespace-tokenized words. But then after WordPiece tokenization, we can
+    # often find a "better match". For example:
+    #
+    #   Question: What year was John Smith born?
+    #   Context: The leader was John Smith (1895-1943).
+    #   Answer: 1895
+    #
+    # The original whitespace-tokenized answer will be "(1895-1943).". However
+    # after tokenization, our tokens will be "( 1895 - 1943 ) .". So we can match
+    # the exact answer, 1895.
+    #
+    # However, this is not always possible. Consider the following:
+    #
+    #   Question: What country is the top exporter of electornics?
+    #   Context: The Japanese electronics industry is the lagest in the world.
+    #   Answer: Japan
+    #
+    # In this case, the annotator chose "Japan" as a character sub-span of
+    # the word "Japanese". Since our WordPiece tokenizer does not split
+    # "Japanese", we just use "Japanese" as the annotation. This is fairly rare
+    # in MRQA, but does happen.
+    tok_answer_text = " ".join(tokenizer.tokenize(orig_answer_text))
+    for new_start in range(input_start, input_end + 1):
+        for new_end in range(input_end, new_start - 1, -1):
+            text_span = " ".join(doc_tokens[new_start:(new_end + 1)])
+            if text_span == tok_answer_text:
+                return (new_start, new_end)
+    return (input_start, input_end)
+def _check_is_max_context(doc_spans, cur_span_index, position):
+    """Check if this is the 'max context' doc span for the token."""
+    # Because of the sliding window approach taken to scoring documents, a single
+    # token can appear in multiple documents. E.g.
+    #  Doc: the man went to the store and bought a gallon of milk
+    #  Span A: the man went to the
+    #  Span B: to the store and bought
+    #  Span C: and bought a gallon of
+    #  ...
+    #
+    # Now the word 'bought' will have two scores from spans B and C. We only
+    # want to consider the score with "maximum context", which we define as
+    # the *minimum* of its left and right context (the *sum* of left and
+    # right context will always be the same, of course).
+    #
+    # In the example the maximum context for 'bought' would be span C since
+    # it has 1 left context and 3 right context, while span B has 4 left context
+    # and 0 right context.
+    best_score = None
+    best_span_index = None
+    for (span_index, doc_span) in enumerate(doc_spans):
+        end = doc_span.start + doc_span.length - 1
+        if position < doc_span.start:
+            continue
+        if position > end:
+            continue
+        num_left_context = position - doc_span.start
+        num_right_context = end - position
+        score = min(num_left_context,
+                    num_right_context) + 0.01 * doc_span.length
+        if best_score is None or score > best_score:
+            best_score = score
+            best_span_index = span_index
+    return cur_span_index == best_span_index
+class DataProcessor(object):
+    def __init__(self, vocab_path, do_lower_case, max_seq_length, in_tokens,
+                 doc_stride, max_query_length):
+        self._tokenizer = tokenization.FullTokenizer(
+            vocab_file=vocab_path, do_lower_case=do_lower_case)
+        self._max_seq_length = max_seq_length
+        self._doc_stride = doc_stride
+        self._max_query_length = max_query_length
+        self._in_tokens = in_tokens
+        self.vocab = self._tokenizer.vocab
+        self.vocab_size = len(self.vocab)
+        self.pad_id = self.vocab["[PAD]"]
+        self.cls_id = self.vocab["[CLS]"]
+        self.sep_id = self.vocab["[SEP]"]
+        self.mask_id = self.vocab["[MASK]"]
+        self.current_train_example = -1
+        self.num_train_examples = -1
+        self.current_train_epoch = -1
+        self.train_examples = None
+        self.num_examples = {'train': -1, 'predict': -1}
+    def get_train_progress(self):
+        """Gets progress for training phase."""
+        return self.current_train_example, self.current_train_epoch
+    def get_examples(self,
+                     data_path,
+                     is_training,
+                     with_negative=False):
+        examples = read_mrqa_examples(
+            input_file=data_path,
+            is_training=is_training,
+            with_negative=with_negative)
+        return examples
+    def get_num_examples(self, phase):
+        if phase not in ['train', 'predict']:
+            raise ValueError(
+                "Unknown phase, which should be in ['train', 'predict'].")
+        return self.num_examples[phase]
+    def estimate_runtime_examples(self, data_path, sample_rate=0.01, \
+                                 remove_impossible_questions=True, filter_invalid_spans=True):
+        """Noted that this API Only support for Training phase."""
+        return estimate_runtime_examples(data_path, sample_rate, self._tokenizer, \
+                                  self._max_seq_length, self._doc_stride, self._max_query_length, \
+                                  remove_impossible_questions=True, filter_invalid_spans=True)
+    def get_features(self, examples, is_training, examples_start_id, features_start_id):
+        features = convert_examples_to_features(
+            examples=examples,
+            tokenizer=self._tokenizer,
+            max_seq_length=self._max_seq_length,
+            doc_stride=self._doc_stride,
+            max_query_length=self._max_query_length,
+            examples_start_id=examples_start_id,
+            features_start_id=features_start_id,
+            is_training=is_training)
+        return features
+    def data_generator(self,
+                       raw_samples,
+                       batch_size,
+                       max_len=None,
+                       phase='predict',
+                       shuffle=False,
+                       dev_count=1,
+                       with_negative=False,
+                       epoch=1,
+                       examples_start_id=0,
+                       features_start_id=1000000000):
+        examples = read_mrqa_examples(raw_samples)
+        def batch_reader(features, batch_size, in_tokens):
+            batch, total_token_num, max_len = [], 0, 0
+            for (index, feature) in enumerate(features):
+                if phase == 'train':
+                    self.current_train_example = index + 1
+                seq_len = len(feature.input_ids)
+                labels = [feature.unique_id
+                          ] if feature.start_position is None else [
+                              feature.start_position, feature.end_position
+                          ]
+                example = [
+                    feature.input_ids, feature.segment_ids, range(seq_len)
+                ] + labels
+                max_len = max(max_len, seq_len)
+                #max_len = max(max_len, len(token_ids))
+                if in_tokens:
+                    to_append = (len(batch) + 1) * max_len <= batch_size
+                else:
+                    to_append = len(batch) < batch_size
+                if to_append:
+                    batch.append(example)
+                    total_token_num += seq_len
+                else:
+                    yield batch, total_token_num
+                    batch, total_token_num, max_len = [example
+                                                       ], seq_len, seq_len
+            if len(batch) > 0:
+                yield batch, total_token_num
+        features = self.get_features(examples, is_training=False, examples_start_id=examples_start_id, features_start_id=features_start_id)
+        all_dev_batches = []
+        for batch_data, total_token_num in batch_reader(
+                features, batch_size, self._in_tokens):
+            batch_data = prepare_batch_data(
+                batch_data,
+                total_token_num,
+                max_len=max_len,
+                voc_size=-1,
+                pad_id=self.pad_id,
+                cls_id=self.cls_id,
+                sep_id=self.sep_id,
+                mask_id=-1,
+                return_input_mask=True,
+                return_max_len=False,
+                return_num_token=False)
+            all_dev_batches.append(batch_data)
+        return examples, features, all_dev_batches
+def get_answers(all_examples, all_features, all_results, n_best_size,
+                      max_answer_length, do_lower_case,
+                      verbose=False):
+    """Write final predictions to the json file and log-odds of null if needed."""
+    example_index_to_features = collections.defaultdict(list)
+    for feature in all_features:
+        example_index_to_features[feature.example_index].append(feature)
+    unique_id_to_result = {}
+    for result in all_results:
+        unique_id_to_result[result.unique_id] = result
+    _PrelimPrediction = collections.namedtuple(  # pylint: disable=invalid-name
+        "PrelimPrediction", [
+            "feature_index", "start_index", "end_index", "start_logit",
+            "end_logit"
+        ])
+    all_predictions = collections.OrderedDict()
+    all_nbest_json = collections.OrderedDict()
+    scores_diff_json = collections.OrderedDict()
+    for (example_index, example) in enumerate(all_examples):
+        features = example_index_to_features[example_index]
+        prelim_predictions = []
+        # keep track of the minimum score of null start+end of position 0
+        score_null = 1000000  # large and positive
+        min_null_feature_index = 0  # the paragraph slice with min mull score
+        null_start_logit = 0  # the start logit at the slice with min null score
+        null_end_logit = 0  # the end logit at the slice with min null score
+        for (feature_index, feature) in enumerate(features):
+            result = unique_id_to_result[feature.unique_id]
+            start_indexes = _get_best_indexes(result.start_logits, n_best_size)
+            end_indexes = _get_best_indexes(result.end_logits, n_best_size)
+            # if we could have irrelevant answers, get the min score of irrelevant
+            for start_index in start_indexes:
+                for end_index in end_indexes:
+                    # We could hypothetically create invalid predictions, e.g., predict
+                    # that the start of the span is in the question. We throw out all
+                    # invalid predictions.
+                    if start_index >= len(feature.tokens):
+                        continue
+                    if end_index >= len(feature.tokens):
+                        continue
+                    if start_index not in feature.token_to_orig_map:
+                        continue
+                    if end_index not in feature.token_to_orig_map:
+                        continue
+                    if not feature.token_is_max_context.get(start_index, False):
+                        continue
+                    if end_index < start_index:
+                        continue
+                    length = end_index - start_index + 1
+                    if length > max_answer_length:
+                        continue
+                    prelim_predictions.append(
+                        _PrelimPrediction(
+                            feature_index=feature_index,
+                            start_index=start_index,
+                            end_index=end_index,
+                            start_logit=result.start_logits[start_index],
+                            end_logit=result.end_logits[end_index]))
+        prelim_predictions = sorted(
+            prelim_predictions,
+            key=lambda x: (x.start_logit + x.end_logit),
+            reverse=True)
+        _NbestPrediction = collections.namedtuple(  # pylint: disable=invalid-name
+            "NbestPrediction", ["text", "start_logit", "end_logit"])
+        seen_predictions = {}
+        nbest = []
+        for pred in prelim_predictions:
+            if len(nbest) >= n_best_size:
+                break
+            feature = features[pred.feature_index]
+            if pred.start_index > 0:  # this is a non-null prediction
+                tok_tokens = feature.tokens[pred.start_index:(pred.end_index + 1
+                                                              )]
+                orig_doc_start = feature.token_to_orig_map[pred.start_index]
+                orig_doc_end = feature.token_to_orig_map[pred.end_index]
+                orig_tokens = example.doc_tokens[orig_doc_start:(orig_doc_end +
+                                                                 1)]
+                tok_text = " ".join(tok_tokens)
+                # De-tokenize WordPieces that have been split off.
+                tok_text = tok_text.replace(" ##", "")
+                tok_text = tok_text.replace("##", "")
+                # Clean whitespace
+                tok_text = tok_text.strip()
+                tok_text = " ".join(tok_text.split())
+                orig_text = " ".join(orig_tokens)
+                final_text = get_final_text(tok_text, orig_text, do_lower_case,
+                                            verbose)
+                if final_text in seen_predictions:
+                    continue
+                seen_predictions[final_text] = True
+            else:
+                final_text = ""
+                seen_predictions[final_text] = True
+            nbest.append(
+                _NbestPrediction(
+                    text=final_text,
+                    start_logit=pred.start_logit,
+                    end_logit=pred.end_logit))
+        # In very rare edge cases we could have no valid predictions. So we
+        # just create a nonce prediction in this case to avoid failure.
+        if not nbest:
+            nbest.append(
+                _NbestPrediction(
+                    text="empty", start_logit=0.0, end_logit=0.0))
+        assert len(nbest) >= 1
+        total_scores = []
+        best_non_null_entry = None
+        for entry in nbest:
+            total_scores.append(entry.start_logit + entry.end_logit)
+            if not best_non_null_entry:
+                if entry.text:
+                    best_non_null_entry = entry
+        # debug
+        if best_non_null_entry is None:
+            print("Emmm..., sth wrong")
+        probs = _compute_softmax(total_scores)
+        nbest_json = []
+        for (i, entry) in enumerate(nbest):
+            output = collections.OrderedDict()
+            output["text"] = entry.text
+            output["probability"] = probs[i]
+            output["start_logit"] = entry.start_logit
+            output["end_logit"] = entry.end_logit
+            nbest_json.append(output)
+        assert len(nbest_json) >= 1
+        all_predictions[example.qas_id] = nbest_json[0]["text"]
+        all_nbest_json[example.qas_id] = nbest_json
+    return all_predictions, all_nbest_json
+def get_final_text(pred_text, orig_text, do_lower_case, verbose):
+    """Project the tokenized prediction back to the original text."""
+    # When we created the data, we kept track of the alignment between original
+    # (whitespace tokenized) tokens and our WordPiece tokenized tokens. So
+    # now `orig_text` contains the span of our original text corresponding to the
+    # span that we predicted.
+    #
+    # However, `orig_text` may contain extra characters that we don't want in
+    # our prediction.
+    #
+    # For example, let's say:
+    #   pred_text = steve smith
+    #   orig_text = Steve Smith's
+    #
+    # We don't want to return `orig_text` because it contains the extra "'s".
+    #
+    # We don't want to return `pred_text` because it's already been normalized
+    # (the MRQA eval script also does punctuation stripping/lower casing but
+    # our tokenizer does additional normalization like stripping accent
+    # characters).
+    #
+    # What we really want to return is "Steve Smith".
+    #
+    # Therefore, we have to apply a semi-complicated alignment heruistic between
+    # `pred_text` and `orig_text` to get a character-to-charcter alignment. This
+    # can fail in certain cases in which case we just return `orig_text`.
+    def _strip_spaces(text):
+        ns_chars = []
+        ns_to_s_map = collections.OrderedDict()
+        for (i, c) in enumerate(text):
+            if c == " ":
+                continue
+            ns_to_s_map[len(ns_chars)] = i
+            ns_chars.append(c)
+        ns_text = "".join(ns_chars)
+        return (ns_text, ns_to_s_map)
+    # We first tokenize `orig_text`, strip whitespace from the result
+    # and `pred_text`, and check if they are the same length. If they are
+    # NOT the same length, the heuristic has failed. If they are the same
+    # length, we assume the characters are one-to-one aligned.
+    tokenizer = tokenization.BasicTokenizer(do_lower_case=do_lower_case)
+    tok_text = " ".join(tokenizer.tokenize(orig_text))
+    start_position = tok_text.find(pred_text)
+    if start_position == -1:
+        if verbose:
+            print("Unable to find text: '%s' in '%s'" % (pred_text, orig_text))
+        return orig_text
+    end_position = start_position + len(pred_text) - 1
+    (orig_ns_text, orig_ns_to_s_map) = _strip_spaces(orig_text)
+    (tok_ns_text, tok_ns_to_s_map) = _strip_spaces(tok_text)
+    if len(orig_ns_text) != len(tok_ns_text):
+        if verbose:
+            print("Length not equal after stripping spaces: '%s' vs '%s'",
+                  orig_ns_text, tok_ns_text)
+        return orig_text
+    # We then project the characters in `pred_text` back to `orig_text` using
+    # the character-to-character alignment.
+    tok_s_to_ns_map = {}
+    for (i, tok_index) in six.iteritems(tok_ns_to_s_map):
+        tok_s_to_ns_map[tok_index] = i
+    orig_start_position = None
+    if start_position in tok_s_to_ns_map:
+        ns_start_position = tok_s_to_ns_map[start_position]
+        if ns_start_position in orig_ns_to_s_map:
+            orig_start_position = orig_ns_to_s_map[ns_start_position]
+    if orig_start_position is None:
+        if verbose:
+            print("Couldn't map start position")
+        return orig_text
+    orig_end_position = None
+    if end_position in tok_s_to_ns_map:
+        ns_end_position = tok_s_to_ns_map[end_position]
+        if ns_end_position in orig_ns_to_s_map:
+            orig_end_position = orig_ns_to_s_map[ns_end_position]
+    if orig_end_position is None:
+        if verbose:
+            print("Couldn't map end position")
+        return orig_text
+    output_text = orig_text[orig_start_position:(orig_end_position + 1)]
+    return output_text
+def _get_best_indexes(logits, n_best_size):
+    """Get the n-best logits from a list."""
+    index_and_score = sorted(
+        enumerate(logits), key=lambda x: x[1], reverse=True)
+    best_indexes = []
+    for i in range(len(index_and_score)):
+        if i >= n_best_size:
+            break
+        best_indexes.append(index_and_score[i][0])
+    return best_indexes
+def _compute_softmax(scores):
+    """Compute softmax probability over raw logits."""
+    if not scores:
+        return []
+    max_score = None
+    for score in scores:
+        if max_score is None or score > max_score:
+            max_score = score
+    exp_scores = []
+    total_sum = 0.0
+    for score in scores:
+        x = math.exp(score - max_score)
+        exp_scores.append(x)
+        total_sum += x
+    probs = []
+    for score in exp_scores:
+        probs.append(score / total_sum)
+    return probs
+if __name__ == '__main__':
+    train_file = 'data/mrqa-combined.all_dev.raw.json'
+    vocab_file = 'uncased_L-12_H-768_A-12/vocab.txt'
+    do_lower_case = True
+    tokenizer = tokenization.FullTokenizer(
+        vocab_file=vocab_file, do_lower_case=do_lower_case)
+    train_examples = read_mrqa_examples(
+        input_file=train_file, is_training=True)
+    print("begin converting")
+    for (index, feature) in enumerate(
+            convert_examples_to_features(
+                examples=train_examples,
+                tokenizer=tokenizer,
+                max_seq_length=384,
+                doc_stride=128,
+                max_query_length=64,
+                is_training=True,
+                #output_fn=train_writer.process_feature
+            )):
+        if index < 10:
+            print(index, feature.input_ids, feature.input_mask,
+                  feature.segment_ids)
--- a/PaddleNLP/Research/MRQA2019-D-NET/server/bert_server/task_reader/tokenization.py
+++ b/PaddleNLP/Research/MRQA2019-D-NET/server/bert_server/task_reader/tokenization.py
+# coding=utf-8
+# Copyright 2018 The Google AI Language Team Authors.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#         http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Tokenization classes."""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+import collections
+import unicodedata
+import six
+def convert_to_unicode(text):
+    """Converts `text` to Unicode (if it's not already), assuming utf-8 input."""
+    if six.PY3:
+        if isinstance(text, str):
+            return text
+        elif isinstance(text, bytes):
+            return text.decode("utf-8", "ignore")
+        else:
+            raise ValueError("Unsupported string type: %s" % (type(text)))
+    elif six.PY2:
+        if isinstance(text, str):
+            return text.decode("utf-8", "ignore")
+        elif isinstance(text, unicode):
+            return text
+        else:
+            raise ValueError("Unsupported string type: %s" % (type(text)))
+    else:
+        raise ValueError("Not running on Python2 or Python 3?")
+def printable_text(text):
+    """Returns text encoded in a way suitable for print or `tf.logging`."""
+    # These functions want `str` for both Python2 and Python3, but in one case
+    # it's a Unicode string and in the other it's a byte string.
+    if six.PY3:
+        if isinstance(text, str):
+            return text
+        elif isinstance(text, bytes):
+            return text.decode("utf-8", "ignore")
+        else:
+            raise ValueError("Unsupported string type: %s" % (type(text)))
+    elif six.PY2:
+        if isinstance(text, str):
+            return text
+        elif isinstance(text, unicode):
+            return text.encode("utf-8")
+        else:
+            raise ValueError("Unsupported string type: %s" % (type(text)))
+    else:
+        raise ValueError("Not running on Python2 or Python 3?")
+def load_vocab(vocab_file):
+    """Loads a vocabulary file into a dictionary."""
+    vocab = collections.OrderedDict()
+    fin = open(vocab_file)
+    for num, line in enumerate(fin):
+        items = convert_to_unicode(line.strip()).split("\t")
+        if len(items) > 2:
+            break
+        token = items[0]
+        index = items[1] if len(items) == 2 else num
+        token = token.strip()
+        vocab[token] = int(index)
+    return vocab
+def convert_by_vocab(vocab, items):
+    """Converts a sequence of [tokens|ids] using the vocab."""
+    output = []
+    for item in items:
+        output.append(vocab[item])
+    return output
+def convert_tokens_to_ids(vocab, tokens):
+    return convert_by_vocab(vocab, tokens)
+def convert_ids_to_tokens(inv_vocab, ids):
+    return convert_by_vocab(inv_vocab, ids)
+def whitespace_tokenize(text):
+    """Runs basic whitespace cleaning and splitting on a peice of text."""
+    text = text.strip()
+    if not text:
+        return []
+    tokens = text.split()
+    return tokens
+class FullTokenizer(object):
+    """Runs end-to-end tokenziation."""
+    def __init__(self, vocab_file, do_lower_case=True):
+        self.vocab = load_vocab(vocab_file)
+        self.inv_vocab = {v: k for k, v in self.vocab.items()}
+        self.basic_tokenizer = BasicTokenizer(do_lower_case=do_lower_case)
+        self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab)
+    def tokenize(self, text):
+        split_tokens = []
+        for token in self.basic_tokenizer.tokenize(text):
+            for sub_token in self.wordpiece_tokenizer.tokenize(token):
+                split_tokens.append(sub_token)
+        return split_tokens
+    def convert_tokens_to_ids(self, tokens):
+        return convert_by_vocab(self.vocab, tokens)
+    def convert_ids_to_tokens(self, ids):
+        return convert_by_vocab(self.inv_vocab, ids)
+class CharTokenizer(object):
+    """Runs end-to-end tokenziation."""
+    def __init__(self, vocab_file, do_lower_case=True):
+        self.vocab = load_vocab(vocab_file)
+        self.inv_vocab = {v: k for k, v in self.vocab.items()}
+        self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab)
+    def tokenize(self, text):
+        split_tokens = []
+        for token in text.lower().split(" "):
+            for sub_token in self.wordpiece_tokenizer.tokenize(token):
+                split_tokens.append(sub_token)
+        return split_tokens
+    def convert_tokens_to_ids(self, tokens):
+        return convert_by_vocab(self.vocab, tokens)
+    def convert_ids_to_tokens(self, ids):
+        return convert_by_vocab(self.inv_vocab, ids)
+class BasicTokenizer(object):
+    """Runs basic tokenization (punctuation splitting, lower casing, etc.)."""
+    def __init__(self, do_lower_case=True):
+        """Constructs a BasicTokenizer.
+        Args:
+            do_lower_case: Whether to lower case the input.
+        """
+        self.do_lower_case = do_lower_case
+        self._never_lowercase = ['[UNK]', '[SEP]', '[PAD]', '[CLS]', '[MASK]']
+    def tokenize(self, text):
+        """Tokenizes a piece of text."""
+        text = convert_to_unicode(text)
+        text = self._clean_text(text)
+        # This was added on November 1st, 2018 for the multilingual and Chinese
+        # models. This is also applied to the English models now, but it doesn't
+        # matter since the English models were not trained on any Chinese data
+        # and generally don't have any Chinese data in them (there are Chinese
+        # characters in the vocabulary because Wikipedia does have some Chinese
+        # words in the English Wikipedia.).
+        text = self._tokenize_chinese_chars(text)
+        orig_tokens = whitespace_tokenize(text)
+        split_tokens = []
+        for token in orig_tokens:
+            if self.do_lower_case and token not in self._never_lowercase:
+                token = token.lower()
+                token = self._run_strip_accents(token)
+            if token in self._never_lowercase:
+                split_tokens.extend([token])
+            else:
+                split_tokens.extend(self._run_split_on_punc(token))
+        output_tokens = whitespace_tokenize(" ".join(split_tokens))
+        return output_tokens
+    def _run_strip_accents(self, text):
+        """Strips accents from a piece of text."""
+        text = unicodedata.normalize("NFD", text)
+        output = []
+        for char in text:
+            cat = unicodedata.category(char)
+            if cat == "Mn":
+                continue
+            output.append(char)
+        return "".join(output)
+    def _run_split_on_punc(self, text):
+        """Splits punctuation on a piece of text."""
+        chars = list(text)
+        i = 0
+        start_new_word = True
+        output = []
+        while i < len(chars):
+            char = chars[i]
+            if _is_punctuation(char):
+                output.append([char])
+                start_new_word = True
+            else:
+                if start_new_word:
+                    output.append([])
+                start_new_word = False
+                output[-1].append(char)
+            i += 1
+        return ["".join(x) for x in output]
+    def _tokenize_chinese_chars(self, text):
+        """Adds whitespace around any CJK character."""
+        output = []
+        for char in text:
+            cp = ord(char)
+            if self._is_chinese_char(cp):
+                output.append(" ")
+                output.append(char)
+                output.append(" ")
+            else:
+                output.append(char)
+        return "".join(output)
+    def _is_chinese_char(self, cp):
+        """Checks whether CP is the codepoint of a CJK character."""
+        # This defines a "chinese character" as anything in the CJK Unicode block:
+        #     https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_(Unicode_block)
+        #
+        # Note that the CJK Unicode block is NOT all Japanese and Korean characters,
+        # despite its name. The modern Korean Hangul alphabet is a different block,
+        # as is Japanese Hiragana and Katakana. Those alphabets are used to write
+        # space-separated words, so they are not treated specially and handled
+        # like the all of the other languages.
+        if ((cp >= 0x4E00 and cp <= 0x9FFF) or  #
+            (cp >= 0x3400 and cp <= 0x4DBF) or  #
+            (cp >= 0x20000 and cp <= 0x2A6DF) or  #
+            (cp >= 0x2A700 and cp <= 0x2B73F) or  #
+            (cp >= 0x2B740 and cp <= 0x2B81F) or  #
+            (cp >= 0x2B820 and cp <= 0x2CEAF) or
+            (cp >= 0xF900 and cp <= 0xFAFF) or  #
+            (cp >= 0x2F800 and cp <= 0x2FA1F)):  #
+            return True
+        return False
+    def _clean_text(self, text):
+        """Performs invalid character removal and whitespace cleanup on text."""
+        output = []
+        for char in text:
+            cp = ord(char)
+            if cp == 0 or cp == 0xfffd or _is_control(char):
+                continue
+            if _is_whitespace(char):
+                output.append(" ")
+            else:
+                output.append(char)
+        return "".join(output)
+class WordpieceTokenizer(object):
+    """Runs WordPiece tokenziation."""
+    def __init__(self, vocab, unk_token="[UNK]", max_input_chars_per_word=100):
+        self.vocab = vocab
+        self.unk_token = unk_token
+        self.max_input_chars_per_word = max_input_chars_per_word
+    def tokenize(self, text):
+        """Tokenizes a piece of text into its word pieces.
+        This uses a greedy longest-match-first algorithm to perform tokenization
+        using the given vocabulary.
+        For example:
+            input = "unaffable"
+            output = ["un", "##aff", "##able"]
+        Args:
+            text: A single token or whitespace separated tokens. This should have
+                already been passed through `BasicTokenizer.
+        Returns:
+            A list of wordpiece tokens.
+        """
+        text = convert_to_unicode(text)
+        output_tokens = []
+        for token in whitespace_tokenize(text):
+            chars = list(token)
+            if len(chars) > self.max_input_chars_per_word:
+                output_tokens.append(self.unk_token)
+                continue
+            is_bad = False
+            start = 0
+            sub_tokens = []
+            while start < len(chars):
+                end = len(chars)
+                cur_substr = None
+                while start < end:
+                    substr = "".join(chars[start:end])
+                    if start > 0:
+                        substr = "##" + substr
+                    if substr in self.vocab:
+                        cur_substr = substr
+                        break
+                    end -= 1
+                if cur_substr is None:
+                    is_bad = True
+                    break
+                sub_tokens.append(cur_substr)
+                start = end
+            if is_bad:
+                output_tokens.append(self.unk_token)
+            else:
+                output_tokens.extend(sub_tokens)
+        return output_tokens
+def _is_whitespace(char):
+    """Checks whether `chars` is a whitespace character."""
+    # \t, \n, and \r are technically contorl characters but we treat them
+    # as whitespace since they are generally considered as such.
+    if char == " " or char == "\t" or char == "\n" or char == "\r":
+        return True
+    cat = unicodedata.category(char)
+    if cat == "Zs":
+        return True
+    return False
+def _is_control(char):
+    """Checks whether `chars` is a control character."""
+    # These are technically control characters but we count them as whitespace
+    # characters.
+    if char == "\t" or char == "\n" or char == "\r":
+        return False
+    cat = unicodedata.category(char)
+    if cat.startswith("C"):
+        return True
+    return False
+def _is_punctuation(char):
+    """Checks whether `chars` is a punctuation character."""
+    cp = ord(char)
+    # We treat all non-letter/number ASCII as punctuation.
+    # Characters such as "^", "$", and "`" are not in the Unicode
+    # Punctuation class but we treat them as punctuation anyways, for
+    # consistency.
+    if ((cp >= 33 and cp <= 47) or (cp >= 58 and cp <= 64) or
+        (cp >= 91 and cp <= 96) or (cp >= 123 and cp <= 126)):
+        return True
+    cat = unicodedata.category(char)
+    if cat.startswith("P"):
+        return True
+    return False
--- a/PaddleNLP/Research/MRQA2019-D-NET/server/main_server.py
+++ b/PaddleNLP/Research/MRQA2019-D-NET/server/main_server.py
+#!/usr/bin/env python
+# -*- coding: utf-8 -*-
+import json
+import sys
+import logging
+logging.basicConfig(
+    level=logging.INFO, format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
+)
+import requests
+from flask import Flask
+from flask import Response
+from flask import request
+import numpy as np
+from multiprocessing.dummy import Pool as ThreadPool
+app = Flask(__name__)
+logger = logging.getLogger('flask')
+url_1 = 'http://127.0.0.1:5118'   # url for model1
+url_2 = 'http://127.0.0.1:5120'   # url for model2
+def ensemble_example(answers, n_models=None):
+    if n_models is None:
+        n_models = len(answers)
+    answer_dict = dict()
+    for nbest_predictions in answers:
+        for prediction in nbest_predictions:
+            score_list = answer_dict.setdefault(prediction['text'], [])
+            score_list.append(prediction['probability'])
+    ensemble_nbest_predictions = []
+    for answer, scores in answer_dict.items():
+        prediction = dict()
+        prediction['text'] = answer
+        prediction['probability'] = np.sum(scores) / n_models
+        ensemble_nbest_predictions.append(prediction)
+    ensemble_nbest_predictions = \
+        sorted(ensemble_nbest_predictions, key=lambda item: item['probability'], reverse=True)
+    return ensemble_nbest_predictions
+@app.route('/', methods=['POST'])
+def mrqa_main():
+    """Description"""
+    # parse input data
+    pred = {}
+    def _call_model(url, input_json):
+        nbest = requests.post(url, json=input_json)
+        return nbest
+    try:
+        input_json = request.get_json(silent=True)
+        pool = ThreadPool(2)
+        res1 = pool.apply_async(_call_model, (url_1, input_json))
+        res2 = pool.apply_async(_call_model, (url_2, input_json))
+        nbest1 = res1.get()
+        nbest2 = res2.get()
+        # print(res1)
+        # print(nbest1)
+        pool.close()
+        pool.join()
+        nbest1 = nbest1.json()['results']
+        nbest2 = nbest2.json()['results']
+        qids = list(nbest1.keys())
+        for qid in qids:
+            ensemble_nbest = ensemble_example([nbest1[qid], nbest2[qid]], n_models=2)
+            pred[qid] = ensemble_nbest[0]['text']
+    except Exception as e:
+        pred['error'] = 'empty'
+        # logger.error('Error in mrc server - {}'.format(e))
+        logger.exception(e)
+    # import pdb; pdb.set_trace()  # XXX BREAKPOINT
+    return Response(json.dumps(pred), mimetype='application/json')
+if __name__ == '__main__':
+    app.run(host='127.0.0.1', port=5121, debug=False, threaded=False, processes=1)
--- a/PaddleNLP/Research/MRQA2019-D-NET/server/start.sh
+++ b/PaddleNLP/Research/MRQA2019-D-NET/server/start.sh
+cd bert_server
+sh start.sh
+cd ../xlnet_server
+sh serve.sh
+cd ..
+sleep 60
+python main_server.py 
--- a/PaddleNLP/Research/MRQA2019-D-NET/server/wget_server_inference_model.sh
+++ b/PaddleNLP/Research/MRQA2019-D-NET/server/wget_server_inference_model.sh
+wget --no-check-certificate https://baidu-nlp.bj.bcebos.com/D-Net/mrqa2019_inference_model.tar.gz
+tar -xvf mrqa2019_inference_model.tar.gz
+rm mrqa2019_inference_model.tar.gz
+mv infer_model bert_server
+mv infer_model_800_bs128 xlnet_server
--- a/PaddleNLP/Research/MRQA2019-D-NET/server/xlnet_server/__init__.py
+++ b/PaddleNLP/Research/MRQA2019-D-NET/server/xlnet_server/__init__.py
--- a/PaddleNLP/Research/MRQA2019-D-NET/server/xlnet_server/data_utils.py
+++ b/PaddleNLP/Research/MRQA2019-D-NET/server/xlnet_server/data_utils.py
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+special_symbols = {
+    "<unk>"  : 0,
+    "<s>"    : 1,
+    "</s>"   : 2,
+    "<cls>"  : 3,
+    "<sep>"  : 4,
+    "<pad>"  : 5,
+    "<mask>" : 6,
+    "<eod>"  : 7,
+    "<eop>"  : 8,
+}
+VOCAB_SIZE = 32000
+UNK_ID = special_symbols["<unk>"]
+CLS_ID = special_symbols["<cls>"]
+SEP_ID = special_symbols["<sep>"]
+MASK_ID = special_symbols["<mask>"]
+EOD_ID = special_symbols["<eod>"]
--- a/PaddleNLP/Research/MRQA2019-D-NET/server/xlnet_server/model/__init__.py
+++ b/PaddleNLP/Research/MRQA2019-D-NET/server/xlnet_server/model/__init__.py
--- a/PaddleNLP/Research/MRQA2019-D-NET/server/xlnet_server/model/transformer_encoder.py
+++ b/PaddleNLP/Research/MRQA2019-D-NET/server/xlnet_server/model/transformer_encoder.py
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Transformer encoder."""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+from functools import partial
+import numpy as np
+import paddle.fluid as fluid
+import paddle.fluid.layers as layers
+def multi_head_attention(queries,
+                         keys,
+                         values,
+                         attn_bias,
+                         d_key,
+                         d_value,
+                         d_model,
+                         n_head=1,
+                         dropout_rate=0.,
+                         cache=None,
+                         param_initializer=None,
+                         name='multi_head_att'):
+    """
+    Multi-Head Attention. Note that attn_bias is added to the logit before
+    computing softmax activiation to mask certain selected positions so that
+    they will not considered in attention weights.
+    """
+    keys = queries if keys is None else keys
+    values = keys if values is None else values
+    if not (len(queries.shape) == len(keys.shape) == len(values.shape) == 3):
+        raise ValueError(
+            "Inputs: quries, keys and values should all be 3-D tensors.")
+    def __compute_qkv(queries, keys, values, n_head, d_key, d_value):
+        """
+        Add linear projection to queries, keys, and values.
+        """
+        q = layers.fc(input=queries,
+                      size=d_key * n_head,
+                      num_flatten_dims=2,
+                      param_attr=fluid.ParamAttr(
+                          name=name + '_query_fc.w_0',
+                          initializer=param_initializer),
+                      bias_attr=name + '_query_fc.b_0')
+        k = layers.fc(input=keys,
+                      size=d_key * n_head,
+                      num_flatten_dims=2,
+                      param_attr=fluid.ParamAttr(
+                          name=name + '_key_fc.w_0',
+                          initializer=param_initializer),
+                      bias_attr=name + '_key_fc.b_0')
+        v = layers.fc(input=values,
+                      size=d_value * n_head,
+                      num_flatten_dims=2,
+                      param_attr=fluid.ParamAttr(
+                          name=name + '_value_fc.w_0',
+                          initializer=param_initializer),
+                      bias_attr=name + '_value_fc.b_0')
+        return q, k, v
+    def __split_heads(x, n_head):
+        """
+        Reshape the last dimension of inpunt tensor x so that it becomes two
+        dimensions and then transpose. Specifically, input a tensor with shape
+        [bs, max_sequence_length, n_head * hidden_dim] then output a tensor
+        with shape [bs, n_head, max_sequence_length, hidden_dim].
+        """
+        hidden_size = x.shape[-1]
+        # The value 0 in shape attr means copying the corresponding dimension
+        # size of the input as the output dimension size.
+        reshaped = layers.reshape(
+            x=x, shape=[0, 0, n_head, hidden_size // n_head], inplace=True)
+        # permuate the dimensions into:
+        # [batch_size, n_head, max_sequence_len, hidden_size_per_head]
+        return layers.transpose(x=reshaped, perm=[0, 2, 1, 3])
+    def __combine_heads(x):
+        """
+        Transpose and then reshape the last two dimensions of inpunt tensor x
+        so that it becomes one dimension, which is reverse to __split_heads.
+        """
+        if len(x.shape) == 3: return x
+        if len(x.shape) != 4:
+            raise ValueError("Input(x) should be a 4-D Tensor.")
+        trans_x = layers.transpose(x, perm=[0, 2, 1, 3])
+        # The value 0 in shape attr means copying the corresponding dimension
+        # size of the input as the output dimension size.
+        return layers.reshape(
+            x=trans_x,
+            shape=[0, 0, trans_x.shape[2] * trans_x.shape[3]],
+            inplace=True)
+    def scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate):
+        """
+        Scaled Dot-Product Attention
+        """
+        scaled_q = layers.scale(x=q, scale=d_key**-0.5)
+        product = layers.matmul(x=scaled_q, y=k, transpose_y=True)
+        if attn_bias:
+            product += attn_bias
+        weights = layers.softmax(product)
+        if dropout_rate:
+            weights = layers.dropout(
+                weights,
+                dropout_prob=dropout_rate,
+                dropout_implementation="upscale_in_train",
+                is_test=False)
+        out = layers.matmul(weights, v)
+        return out
+    q, k, v = __compute_qkv(queries, keys, values, n_head, d_key, d_value)
+    if cache is not None:  # use cache and concat time steps
+        # Since the inplace reshape in __split_heads changes the shape of k and
+        # v, which is the cache input for next time step, reshape the cache
+        # input from the previous time step first.
+        k = cache["k"] = layers.concat(
+            [layers.reshape(
+                cache["k"], shape=[0, 0, d_model]), k], axis=1)
+        v = cache["v"] = layers.concat(
+            [layers.reshape(
+                cache["v"], shape=[0, 0, d_model]), v], axis=1)
+    q = __split_heads(q, n_head)
+    k = __split_heads(k, n_head)
+    v = __split_heads(v, n_head)
+    ctx_multiheads = scaled_dot_product_attention(q, k, v, attn_bias, d_key,
+                                                  dropout_rate)
+    out = __combine_heads(ctx_multiheads)
+    # Project back to the model size.
+    proj_out = layers.fc(input=out,
+                         size=d_model,
+                         num_flatten_dims=2,
+                         param_attr=fluid.ParamAttr(
+                             name=name + '_output_fc.w_0',
+                             initializer=param_initializer),
+                         bias_attr=name + '_output_fc.b_0')
+    return proj_out
+def positionwise_feed_forward(x,
+                              d_inner_hid,
+                              d_hid,
+                              dropout_rate,
+                              hidden_act,
+                              param_initializer=None,
+                              name='ffn'):
+    """
+    Position-wise Feed-Forward Networks.
+    This module consists of two linear transformations with a ReLU activation
+    in between, which is applied to each position separately and identically.
+    """
+    hidden = layers.fc(input=x,
+                       size=d_inner_hid,
+                       num_flatten_dims=2,
+                       act=hidden_act,
+                       param_attr=fluid.ParamAttr(
+                           name=name + '_fc_0.w_0',
+                           initializer=param_initializer),
+                       bias_attr=name + '_fc_0.b_0')
+    if dropout_rate:
+        hidden = layers.dropout(
+            hidden,
+            dropout_prob=dropout_rate,
+            dropout_implementation="upscale_in_train",
+            is_test=False)
+    out = layers.fc(input=hidden,
+                    size=d_hid,
+                    num_flatten_dims=2,
+                    param_attr=fluid.ParamAttr(
+                        name=name + '_fc_1.w_0', initializer=param_initializer),
+                    bias_attr=name + '_fc_1.b_0')
+    return out
+def pre_post_process_layer(prev_out, out, process_cmd, dropout_rate=0.,
+                           name=''):
+    """
+    Add residual connection, layer normalization and droput to the out tensor
+    optionally according to the value of process_cmd.
+    This will be used before or after multi-head attention and position-wise
+    feed-forward networks.
+    """
+    for cmd in process_cmd:
+        if cmd == "a":  # add residual connection
+            out = out + prev_out if prev_out else out
+        elif cmd == "n":  # add layer normalization
+            out_dtype = out.dtype
+            if out_dtype == fluid.core.VarDesc.VarType.FP16:
+                out = layers.cast(x=out, dtype="float32")
+            out = layers.layer_norm(
+                out,
+                begin_norm_axis=len(out.shape) - 1,
+                param_attr=fluid.ParamAttr(
+                    name=name + '_layer_norm_scale',
+                    initializer=fluid.initializer.Constant(1.)),
+                bias_attr=fluid.ParamAttr(
+                    name=name + '_layer_norm_bias',
+                    initializer=fluid.initializer.Constant(0.)))
+            if out_dtype == fluid.core.VarDesc.VarType.FP16:
+                out = layers.cast(x=out, dtype="float16")
+        elif cmd == "d":  # add dropout
+            if dropout_rate:
+                out = layers.dropout(
+                    out,
+                    dropout_prob=dropout_rate,
+                    dropout_implementation="upscale_in_train",
+                    is_test=False)
+    return out
+pre_process_layer = partial(pre_post_process_layer, None)
+post_process_layer = pre_post_process_layer
+def encoder_layer(enc_input,
+                  attn_bias,
+                  n_head,
+                  d_key,
+                  d_value,
+                  d_model,
+                  d_inner_hid,
+                  prepostprocess_dropout,
+                  attention_dropout,
+                  relu_dropout,
+                  hidden_act,
+                  preprocess_cmd="n",
+                  postprocess_cmd="da",
+                  param_initializer=None,
+                  name=''):
+    """The encoder layers that can be stacked to form a deep encoder.
+    This module consits of a multi-head (self) attention followed by
+    position-wise feed-forward networks and both the two components companied
+    with the post_process_layer to add residual connection, layer normalization
+    and droput.
+    """
+    attn_output = multi_head_attention(
+        pre_process_layer(
+            enc_input,
+            preprocess_cmd,
+            prepostprocess_dropout,
+            name=name + '_pre_att'),
+        None,
+        None,
+        attn_bias,
+        d_key,
+        d_value,
+        d_model,
+        n_head,
+        attention_dropout,
+        param_initializer=param_initializer,
+        name=name + '_multi_head_att')
+    attn_output = post_process_layer(
+        enc_input,
+        attn_output,
+        postprocess_cmd,
+        prepostprocess_dropout,
+        name=name + '_post_att')
+    ffd_output = positionwise_feed_forward(
+        pre_process_layer(
+            attn_output,
+            preprocess_cmd,
+            prepostprocess_dropout,
+            name=name + '_pre_ffn'),
+        d_inner_hid,
+        d_model,
+        relu_dropout,
+        hidden_act,
+        param_initializer=param_initializer,
+        name=name + '_ffn')
+    return post_process_layer(
+        attn_output,
+        ffd_output,
+        postprocess_cmd,
+        prepostprocess_dropout,
+        name=name + '_post_ffn')
+def encoder(enc_input,
+            attn_bias,
+            n_layer,
+            n_head,
+            d_key,
+            d_value,
+            d_model,
+            d_inner_hid,
+            prepostprocess_dropout,
+            attention_dropout,
+            relu_dropout,
+            hidden_act,
+            preprocess_cmd="n",
+            postprocess_cmd="da",
+            param_initializer=None,
+            name=''):
+    """
+    The encoder is composed of a stack of identical layers returned by calling
+    encoder_layer.
+    """
+    for i in range(n_layer):
+        enc_output = encoder_layer(
+            enc_input,
+            attn_bias,
+            n_head,
+            d_key,
+            d_value,
+            d_model,
+            d_inner_hid,
+            prepostprocess_dropout,
+            attention_dropout,
+            relu_dropout,
+            hidden_act,
+            preprocess_cmd,
+            postprocess_cmd,
+            param_initializer=param_initializer,
+            name=name + '_layer_' + str(i))
+        enc_input = enc_output
+    enc_output = pre_process_layer(
+        enc_output, preprocess_cmd, prepostprocess_dropout, name="post_encoder")
+    return enc_output
--- a/PaddleNLP/Research/MRQA2019-D-NET/server/xlnet_server/model/xlnet.py
+++ b/PaddleNLP/Research/MRQA2019-D-NET/server/xlnet_server/model/xlnet.py
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""BERT model."""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+import six
+import json
+import numpy as np
+import paddle.fluid as fluid
+from model.transformer_encoder import encoder, pre_process_layer
+import modeling
+def _get_initiliaizer(args):
+    if args.init == "uniform":
+        param_initializer = fluid.initializer.Uniform(
+            low=-args.init_range, high=args.init_range)
+    elif args.init == "normal":
+        param_initializer = fluid.initializer.Normal(scale=args.init_std)
+    else:
+        raise ValueError("Initializer {} not supported".format(args.init))
+    return param_initializer
+def init_attn_mask(args, place):
+    """create causal attention mask."""
+    qlen = args.max_seq_length
+    mlen=0 if 'mem_len' not in args else args.mem_len
+    same_length=False if 'same_length' not in args else args.same_length
+    dtype = 'float16' if args.use_fp16 else 'float32'
+    attn_mask = np.ones([qlen, qlen], dtype=dtype)
+    mask_u = np.triu(attn_mask)
+    mask_dia = np.diag(np.diag(attn_mask))
+    attn_mask_pad = np.zeros([qlen, mlen], dtype=dtype)
+    attn_mask = np.concatenate([attn_mask_pad, mask_u - mask_dia], 1)
+    if same_length:
+        mask_l = np.tril(attn_mask)
+        attn_mask = np.concatenate([ret[:, :qlen] + mask_l - mask_dia, ret[:, qlen:]], 1)
+    attn_mask = attn_mask[:, :, None, None]
+    attn_mask_t = fluid.global_scope().find_var("attn_mask").get_tensor()
+    attn_mask_t.set(attn_mask, place)
+class XLNetConfig(object):
+    def __init__(self, config_path):
+        self._config_dict = self._parse(config_path)
+    def _parse(self, config_path):
+        try:
+            with open(config_path) as json_file:
+                config_dict = json.load(json_file)
+        except Exception:
+            raise IOError("Error in parsing xlnet model config file '%s'" %
+                          config_path)
+        else:
+            return config_dict
+    def __getitem__(self, key):
+        return self._config_dict[key]
+    def has_key(self, key):
+        return self._config_dict.has_key(key)
+    def print_config(self):
+        for arg, value in sorted(six.iteritems(self._config_dict)):
+            print('%s: %s' % (arg, value))
+        print('------------------------------------------------')
+class XLNetModel(object):
+    def __init__(self,
+                 xlnet_config,
+                 input_ids,
+                 seg_ids,
+                 input_mask,
+                 args,
+                 mems=None,
+                 perm_mask=None,
+                 target_mapping=None,
+                 inp_q=None):
+        self._tie_weight = True
+        self._d_head = xlnet_config['d_head']
+        self._d_inner = xlnet_config['d_inner']
+        self._d_model = xlnet_config['d_model']
+        self._ff_activation = xlnet_config['ff_activation']
+        self._n_head = xlnet_config['n_head']
+        self._n_layer = xlnet_config['n_layer']
+        self._n_token = xlnet_config['n_token']
+        self._untie_r = xlnet_config['untie_r']
+        self._mem_len=None if 'mem_len' not in args else args.mem_len
+        self._reuse_len=None if 'reuse_len' not in args else args.reuse_len
+        self._bi_data=False if 'bi_data' not in args else args.bi_data
+        self._clamp_len=args.clamp_len
+        self._same_length=False if 'same_length' not in args else args.same_length
+        # Initialize all weigths by the specified initializer, and all biases 
+        # will be initialized by constant zero by default.
+        self._param_initializer = _get_initiliaizer(args)
+        tfm_args = dict(
+                n_token=self._n_token,
+                initializer=self._param_initializer,
+                attn_type="bi",
+                n_layer=self._n_layer,
+                d_model=self._d_model,
+		n_head=self._n_head,
+		d_head=self._d_head,
+		d_inner=self._d_inner,
+		ff_activation=self._ff_activation,
+		untie_r=self._untie_r,
+		use_bfloat16=args.use_fp16,
+		dropout=args.dropout,
+		dropatt=args.dropatt,
+		mem_len=self._mem_len,
+		reuse_len=self._reuse_len,
+		bi_data=self._bi_data,
+		clamp_len=args.clamp_len,
+		same_length=self._same_length,
+                name='model_transformer')
+        input_args = dict(
+            inp_k=input_ids,
+            seg_id=seg_ids,
+            input_mask=input_mask,
+            mems=mems,
+            perm_mask=perm_mask,
+            target_mapping=target_mapping,
+            inp_q=inp_q)
+        tfm_args.update(input_args)
+        self.output, self.new_mems, self.lookup_table = modeling.transformer_xl(**tfm_args)
+        #self._build_model(input_ids, sentence_ids, input_mask)
+    def get_initializer(self):
+        return self._param_initializer
+    def get_sequence_output(self):
+        return self.output
+    def get_pooled_output(self):
+        """Get the first feature of each sequence for classification"""
+        next_sent_feat = fluid.layers.slice(
+            input=self._enc_out, axes=[1], starts=[0], ends=[1])
+        next_sent_feat = fluid.layers.fc(
+            input=next_sent_feat,
+            size=self._emb_size,
+            act="tanh",
+            param_attr=fluid.ParamAttr(
+                name="pooled_fc.w_0", initializer=self._param_initializer),
+            bias_attr="pooled_fc.b_0")
+        return next_sent_feat
+    def get_pretraining_output(self, mask_label, mask_pos, labels):
+        """Get the loss & accuracy for pretraining"""
+        mask_pos = fluid.layers.cast(x=mask_pos, dtype='int32')
+        # extract the first token feature in each sentence
+        next_sent_feat = self.get_pooled_output()
+        reshaped_emb_out = fluid.layers.reshape(
+            x=self._enc_out, shape=[-1, self._emb_size])
+        # extract masked tokens' feature
+        mask_feat = fluid.layers.gather(input=reshaped_emb_out, index=mask_pos)
+        # transform: fc
+        mask_trans_feat = fluid.layers.fc(
+            input=mask_feat,
+            size=self._emb_size,
+            act=self._hidden_act,
+            param_attr=fluid.ParamAttr(
+                name='mask_lm_trans_fc.w_0',
+                initializer=self._param_initializer),
+            bias_attr=fluid.ParamAttr(name='mask_lm_trans_fc.b_0'))
+        # transform: layer norm 
+        mask_trans_feat = pre_process_layer(
+            mask_trans_feat, 'n', name='mask_lm_trans')
+        mask_lm_out_bias_attr = fluid.ParamAttr(
+            name="mask_lm_out_fc.b_0",
+            initializer=fluid.initializer.Constant(value=0.0))
+        if self._weight_sharing:
+            word_emb = fluid.default_main_program().global_block().var(
+                self._word_emb_name)
+            if self._emb_dtype != self._dtype:
+                word_emb = fluid.layers.cast(word_emb, self._dtype)
+            fc_out = fluid.layers.matmul(
+                x=mask_trans_feat, y=word_emb, transpose_y=True)
+            fc_out += fluid.layers.create_parameter(
+                shape=[self._voc_size],
+                dtype=self._dtype,
+                attr=mask_lm_out_bias_attr,
+                is_bias=True)
+        else:
+            fc_out = fluid.layers.fc(input=mask_trans_feat,
+                                     size=self._voc_size,
+                                     param_attr=fluid.ParamAttr(
+                                         name="mask_lm_out_fc.w_0",
+                                         initializer=self._param_initializer),
+                                     bias_attr=mask_lm_out_bias_attr)
+        mask_lm_loss = fluid.layers.softmax_with_cross_entropy(
+            logits=fc_out, label=mask_label)
+        mean_mask_lm_loss = fluid.layers.mean(mask_lm_loss)
+        next_sent_fc_out = fluid.layers.fc(
+            input=next_sent_feat,
+            size=2,
+            param_attr=fluid.ParamAttr(
+                name="next_sent_fc.w_0", initializer=self._param_initializer),
+            bias_attr="next_sent_fc.b_0")
+        next_sent_loss, next_sent_softmax = fluid.layers.softmax_with_cross_entropy(
+            logits=next_sent_fc_out, label=labels, return_softmax=True)
+        next_sent_acc = fluid.layers.accuracy(
+            input=next_sent_softmax, label=labels)
+        mean_next_sent_loss = fluid.layers.mean(next_sent_loss)
+        loss = mean_next_sent_loss + mean_mask_lm_loss
+        return next_sent_acc, mean_mask_lm_loss, loss
--- a/PaddleNLP/Research/MRQA2019-D-NET/server/xlnet_server/modeling.py
+++ b/PaddleNLP/Research/MRQA2019-D-NET/server/xlnet_server/modeling.py
+import re
+import numpy as np
+import paddle.fluid as fluid
+def einsum4x4(equation, x, y):
+    idx_x, idx_y, idx_z = re.split(",|->", equation)
+    repeated_idx = list(set(idx_x + idx_y) - set(idx_z))
+    unique_idx_x = list(set(idx_x) - set(idx_y))
+    unique_idx_y = list(set(idx_y) - set(idx_x))
+    common_idx = list(set(idx_x) & set(idx_y) - set(repeated_idx))
+    new_idx_x = common_idx + unique_idx_x + repeated_idx
+    new_idx_y = common_idx + unique_idx_y + repeated_idx
+    new_idx_z = common_idx + unique_idx_x + unique_idx_y
+    perm_x = [ idx_x.index(i) for i in new_idx_x]
+    perm_y = [ idx_y.index(i) for i in new_idx_y]
+    perm_z = [ new_idx_z.index(i) for i in idx_z]
+    x = fluid.layers.transpose(x, perm=perm_x)
+    y = fluid.layers.transpose(y, perm=perm_y)
+    z = fluid.layers.matmul(x=x, y=y, transpose_y=True)
+    z = fluid.layers.transpose(z, perm=perm_z)
+    return z
+def positional_embedding(pos_seq, inv_freq, bsz=None):
+    pos_seq = fluid.layers.reshape(pos_seq, [-1, 1])
+    inv_freq = fluid.layers.reshape(inv_freq, [1, -1])
+    sinusoid_inp = fluid.layers.matmul(pos_seq, inv_freq)
+    pos_emb = fluid.layers.concat(input=[fluid.layers.sin(sinusoid_inp), 
+                                  fluid.layers.cos(sinusoid_inp)], axis=-1)
+    pos_emb = fluid.layers.unsqueeze(pos_emb, [1])
+    if bsz is not None:
+        pos_emb = fluid.layers.expand(pos_emb, [1, bsz, 1])
+    return pos_emb
+def positionwise_ffn(inp, d_model, d_inner, dropout_prob, param_initializer=None,
+                     act_type='relu', name='ff'):
+  """Position-wise Feed-forward Network."""
+  if act_type not in ['relu', 'gelu']:
+      raise ValueError('Unsupported activation type {}'.format(act_type))
+  output = fluid.layers.fc(input=inp, size=d_inner, act=act_type,
+                           num_flatten_dims=2,
+                           param_attr=fluid.ParamAttr(
+                             name=name+'_layer_1_weight', initializer=param_initializer),
+                           bias_attr=name+'_layer_1_bias')
+  output = fluid.layers.dropout(output, dropout_prob=dropout_prob,
+                                dropout_implementation="upscale_in_train", is_test=False)
+  output = fluid.layers.fc(output, size=d_model,
+                           num_flatten_dims=2,
+                           param_attr=fluid.ParamAttr(
+                               name=name+'_layer_2_weight', initializer=param_initializer),
+                           bias_attr=name+'_layer_2_bias')
+  output = fluid.layers.dropout(output, dropout_prob=dropout_prob,
+                             dropout_implementation="upscale_in_train", is_test=False)
+  output = fluid.layers.layer_norm(output + inp, begin_norm_axis=len(output.shape)-1,
+                                   epsilon=1e-12,
+                                   param_attr=fluid.ParamAttr(name=name+'_layer_norm_scale',
+                                       initializer=fluid.initializer.Constant(1.)),
+                                   bias_attr=fluid.ParamAttr(name+'_layer_norm_bias',
+                                       initializer=fluid.initializer.Constant(0.)))
+  return output
+def head_projection(h, d_model, n_head, d_head, param_initializer, name=''):
+  """Project hidden states to a specific head with a 4D-shape."""
+  proj_weight=fluid.layers.create_parameter(
+                shape=[d_model, n_head, d_head],
+                dtype=h.dtype,
+                attr=fluid.ParamAttr(name=name+'_weight', initializer=param_initializer),
+                is_bias=False)
+  # ibh,hnd->ibnd 
+  head = fluid.layers.mul(x=h, y=proj_weight, x_num_col_dims=2, y_num_col_dims=1)
+  return head 
+def post_attention(h, attn_vec, d_model, n_head, d_head, dropout,
+                   param_initializer, residual=True, name=''):
+  """Post-attention processing."""
+  # post-attention projection (back to `d_model`)
+  proj_o=fluid.layers.create_parameter(
+                shape=[d_model, n_head, d_head],
+                dtype=h.dtype,
+                attr=fluid.ParamAttr(name=name+'_o_weight', initializer=param_initializer),
+                is_bias=False)
+  # ibnd,hnd->ibh
+  proj_o = fluid.layers.transpose(proj_o, perm=[1, 2, 0])
+  attn_out = fluid.layers.mul(x=attn_vec, y=proj_o, x_num_col_dims=2, y_num_col_dims=2)
+  attn_out = fluid.layers.dropout(attn_out, dropout_prob=dropout,
+                             dropout_implementation="upscale_in_train", is_test=False)
+  if residual:
+      output = fluid.layers.layer_norm(attn_out + h, begin_norm_axis=len(attn_out.shape)-1,
+                                   epsilon=1e-12,
+                                   param_attr=fluid.ParamAttr(name=name+'_layer_norm_scale',
+                                       initializer=fluid.initializer.Constant(1.)),
+                                   bias_attr=fluid.ParamAttr(name+'_layer_norm_bias',
+                                       initializer=fluid.initializer.Constant(0.)))
+  else:
+      output = fluid.layers.layer_norm(attn_out, begin_norm_axis=len(attn_out.shape)-1,
+                                   epsilon=1e-12,
+                                   param_attr=fluid.ParamAttr(name=name+'_layer_norm_scale',
+                                       initializer=fluid.initializer.Constant(1.)),
+                                   bias_attr=fluid.ParamAttr(name+'_layer_norm_bias',
+                                       initializer=fluid.initializer.Constant(0.)))
+  return output
+def abs_attn_core(q_head, k_head, v_head, attn_mask, dropatt, scale):
+  """Core absolute positional attention operations."""
+  attn_score = einsum4x4('ibnd,jbnd->ijbn', q_head, k_head) 
+  attn_score *= scale
+  if attn_mask is not None:
+    attn_score = attn_score - 1e30 * attn_mask
+  # attention probability
+  attn_prob = fluid.layers.softmax(attn_score, axis=1)
+  attn_prob = fluid.layers.dropout(attn_prob, dropout_prob=dropatt, 
+                  dropout_implementation="upscale_in_train", is_test=False)
+  # attention output
+  attn_vec = einsum4x4('ijbn,jbnd->ibnd', attn_prob, v_head)
+  return attn_vec
+def rel_attn_core(q_head, k_head_h, v_head_h, k_head_r, seg_embed, seg_mat,
+                  r_w_bias, r_r_bias, r_s_bias, attn_mask, dropatt,
+                  scale):
+  """Core relative positional attention operations."""
+  ## content based attention score
+  ac = einsum4x4('ibnd,jbnd->ijbn', fluid.layers.elementwise_add(q_head, r_w_bias, 2), k_head_h) 
+  # position based attention score
+  bd = einsum4x4('ibnd,jbnd->ijbn', fluid.layers.elementwise_add(q_head, r_r_bias, 2), k_head_r)
+  #klen = fluid.layers.slice(fluid.layers.shape(ac), axes=[0], starts=[1], ends=[2])
+  bd = rel_shift(bd, klen=ac.shape[1])
+  # segment based attention score
+  if seg_mat is None:
+    ef = 0
+  else:
+    ef = 0
+    """
+    bsz = fluid.layers.slice(fluid.layers.shape(q_head), axes=[0], starts=[1], ends=[2])
+    bsz.stop_gradient = True
+    """
+    #seg_embed = fluid.layers.unsqueeze(input=seg_embed, axes=[0])
+    seg_embed = fluid.layers.stack([seg_embed]*q_head.shape[0], axis=0)
+    ef = einsum4x4('ibnd,isnd->ibns', fluid.layers.elementwise_add(q_head, r_s_bias, 2), seg_embed)
+    ef = einsum4x4('ijbs,ibns->ijbn', seg_mat, ef)
+  # merge attention scores and perform masking
+  attn_score = (ac + bd + ef) * scale
+  if attn_mask is not None:
+    # attn_score = attn_score * (1 - attn_mask) - 1e30 * attn_mask
+    attn_score = attn_score - 1e30 * attn_mask
+  # attention probability
+  #attn_prob = fluid.layers.softmax(attn_score, axis=1)
+  attn_score = fluid.layers.transpose(attn_score, [0, 2, 3, 1])
+  attn_prob = fluid.layers.softmax(attn_score)
+  attn_prob = fluid.layers.transpose(attn_prob, [0, 3, 1, 2])
+  attn_prob = fluid.layers.dropout(attn_prob, dropatt, 
+                                   dropout_implementation="upscale_in_train")
+  # attention output
+  attn_vec = einsum4x4('ijbn,jbnd->ibnd', attn_prob, v_head_h)
+  return attn_vec
+def rel_shift(x, klen=-1):
+    """perform relative shift to form the relative attention score."""
+    x_size = x.shape
+    x = fluid.layers.reshape(x, [x_size[1], x_size[0], x_size[2], x_size[3]])
+    x = fluid.layers.slice(x,  axes=[0], starts=[1], ends=[x_size[1]])
+    x = fluid.layers.reshape(x, [x_size[0], x_size[1] - 1, x_size[2], x_size[3]])
+    x = fluid.layers.slice(x, axes=[1], starts=[0], ends=[klen])
+    return x
+def _cache_mem(curr_out, prev_mem, mem_len, reuse_len=None):
+  """cache hidden states into memory."""
+  if mem_len is None or mem_len == 0:
+      return None
+  else:
+      if reuse_len is not None and reuse_len > 0:
+          curr_out = curr_out[:reuse_len]
+      if prev_mem is None:
+         new_mem = curr_out[-mem_len:]
+      else:
+        new_mem = tf.concat([prev_mem, curr_out], 0)[-mem_len:]
+  new_mem.stop_gradient = True
+  return new_mem
+def relative_positional_encoding(qlen, klen, d_model, clamp_len, attn_type,
+                                 bi_data, bsz=None, dtype=None):
+  """create relative positional encoding."""
+  freq_seq = fluid.layers.range(0, d_model, 2.0, 'float32')
+  if dtype is not None and dtype != 'float32':
+    freq_seq = tf.cast(freq_seq, dtype=dtype)
+  inv_freq = 1 / (10000 ** (freq_seq / d_model))
+  if attn_type == 'bi':
+    beg, end = klen, -qlen
+  elif attn_type == 'uni':
+    beg, end = klen, -1
+  else:
+    raise ValueError('Unknown `attn_type` {}.'.format(attn_type))
+  if bi_data:
+    fwd_pos_seq = fluid.layers.range(beg, end, -1.0, 'float32')
+    bwd_pos_seq = fluid.layers.range(-beg, -end, 1.0, 'float32')
+    if dtype is not None and dtype != 'float32':
+      fwd_pos_seq =fluid.layers.cast(fwd_pos_seq, dtype='float32')
+      bwd_pos_seq = fluid.layers.cast(bwd_pos_seq, dtype='float32')
+    if clamp_len > 0:
+      fwd_pos_seq = fluid.layers.clip(fwd_pos_seq, -clamp_len, clamp_len)
+      bwd_pos_seq = fluid.layers.clip(bwd_pos_seq, -clamp_len, clamp_len)
+    if bsz is not None:
+      # With bi_data, the batch size should be divisible by 2.
+      assert bsz % 2 == 0
+      fwd_pos_emb = positional_embedding(fwd_pos_seq, inv_freq, bsz//2)
+      bwd_pos_emb = positional_embedding(bwd_pos_seq, inv_freq, bsz//2)
+    else:
+      fwd_pos_emb = positional_embedding(fwd_pos_seq, inv_freq)
+      bwd_pos_emb = positional_embedding(bwd_pos_seq, inv_freq)
+    pos_emb = fluid.layers.concat([fwd_pos_emb, bwd_pos_emb], axis=1)
+  else:
+    fwd_pos_seq = fluid.layers.range(beg, end, -1.0, 'float32')
+    if dtype is not None and dtype != 'float32':
+      fwd_pos_seq = fluid.layers.cast(fwd_pos_seq, dtype=dtype)
+    if clamp_len > 0:
+      fwd_pos_seq = fluid.layers.clip(fwd_pos_seq, -clamp_len, clamp_len)
+    pos_emb = positional_embedding(fwd_pos_seq, inv_freq, bsz)
+    fluid.layers.reshape(pos_emb, [2*qlen, -1, d_model], inplace=True)
+  return pos_emb
+def rel_multihead_attn(h, r, r_w_bias, r_r_bias, seg_mat, r_s_bias, seg_embed,
+                       attn_mask, mems, d_model, n_head, d_head, dropout,
+                       dropatt, initializer, name=''):
+    """Multi-head attention with relative positional encoding."""
+    scale = 1 / (d_head ** 0.5)
+    if mems is not None and len(mems.shape) > 1:
+        cat = fluid.layers.concat([mems, h], 0)
+    else:
+        cat = h
+    # content heads
+    q_head_h = head_projection(
+        h, d_model, n_head, d_head, initializer, name+'_rel_attn_q')
+    k_head_h = head_projection(
+        cat, d_model, n_head, d_head, initializer, name+'_rel_attn_k')
+    v_head_h = head_projection(
+        cat, d_model, n_head, d_head, initializer, name+'_rel_attn_v')
+    # positional heads
+    k_head_r = head_projection(
+        r, d_model, n_head, d_head, initializer, name+'_rel_attn_r')
+    # core attention ops
+    attn_vec = rel_attn_core(
+        q_head_h, k_head_h, v_head_h, k_head_r, seg_embed, seg_mat, r_w_bias,
+        r_r_bias, r_s_bias, attn_mask, dropatt, scale)
+    # post processing
+    output = post_attention(h, attn_vec, d_model, n_head, d_head, dropout, initializer, name=name+'_rel_attn')
+    return output
+def transformer_xl(inp_k, n_token, n_layer, d_model, n_head,
+                d_head, d_inner, dropout, dropatt, attn_type,
+                bi_data, initializer, mem_len=None,
+                inp_q=None, mems=None,
+                same_length=False, clamp_len=-1, untie_r=False,
+                input_mask=None,
+                perm_mask=None, seg_id=None, reuse_len=None,
+                ff_activation='relu', target_mapping=None,
+                use_fp16=False, name='', **kwargs):
+    """
+    Defines a Transformer-XL computation graph with additional
+	support for XLNet.
+	Args:
+	inp_k: int32 Tensor in shape [len, bsz], the input token IDs.
+	seg_id: int32 Tensor in shape [len, bsz], the input segment IDs.
+	input_mask: float32 Tensor in shape [len, bsz], the input mask.
+	  0 for real tokens and 1 for padding.
+	mems: a list of float32 Tensors in shape [mem_len, bsz, d_model], memory
+	  from previous batches. The length of the list equals n_layer.
+	  If None, no memory is used.
+	perm_mask: float32 Tensor in shape [len, len, bsz].
+	  If perm_mask[i, j, k] = 0, i attend to j in batch k;
+	  if perm_mask[i, j, k] = 1, i does not attend to j in batch k.
+	  If None, each position attends to all the others.
+	target_mapping: float32 Tensor in shape [num_predict, len, bsz].
+	  If target_mapping[i, j, k] = 1, the i-th predict in batch k is
+	  on the j-th token.
+	  Only used during pretraining for partial prediction.
+	  Set to None during finetuning.
+	inp_q: float32 Tensor in shape [len, bsz].
+	  1 for tokens with losses and 0 for tokens without losses.
+	  Only used during pretraining for two-stream attention.
+	  Set to None during finetuning.
+	n_layer: int, the number of layers.
+	d_model: int, the hidden size.
+	n_head: int, the number of attention heads.
+	d_head: int, the dimension size of each attention head.
+	d_inner: int, the hidden size in feed-forward layers.
+	ff_activation: str, "relu" or "gelu".
+	untie_r: bool, whether to untie the biases in attention.
+	n_token: int, the vocab size.
+	is_training: bool, whether in training mode.
+	use_tpu: bool, whether TPUs are used.
+	use_fp16: bool, use bfloat16 instead of float32.
+	dropout: float, dropout rate.
+	dropatt: float, dropout rate on attention probabilities.
+	init: str, the initialization scheme, either "normal" or "uniform".
+	init_range: float, initialize the parameters with a uniform distribution
+	  in [-init_range, init_range]. Only effective when init="uniform".
+	init_std: float, initialize the parameters with a normal distribution
+	  with mean 0 and stddev init_std. Only effective when init="normal".
+	mem_len: int, the number of tokens to cache.
+	reuse_len: int, the number of tokens in the currect batch to be cached
+	  and reused in the future.
+	bi_data: bool, whether to use bidirectional input pipeline.
+	  Usually set to True during pretraining and False during finetuning.
+	clamp_len: int, clamp all relative distances larger than clamp_len.
+	  -1 means no clamping.
+	same_length: bool, whether to use the same attention length for each token.
+	summary_type: str, "last", "first", "mean", or "attn". The method
+	  to pool the input to get a vector representation.
+	initializer: A tf initializer.
+	scope: scope name for the computation graph.
+    """
+    print('memory input {}'.format(mems))
+    data_type = "float16" if use_fp16 else "float32"
+    print('Use float type {}'.format(data_type))
+    qlen = inp_k.shape[0]
+    mlen = mems[0].shape[0] if mems is not None else 0
+    klen = mlen + qlen
+    bsz = fluid.layers.slice(fluid.layers.shape(inp_k), axes=[0], starts=[1], ends=[2])
+    ##### Attention mask
+    # causal attention mask
+    if attn_type == 'uni':
+        attn_mask = fluid.layers.create_global_var(
+	                name='attn_mask', 
+                        shape=[qlen, klen, 1, 1], 
+                        value=0.0, 
+                        dtype=data_type, persistable=True)
+    elif attn_type == 'bi':
+        attn_mask = None
+    else:
+        raise ValueError('Unsupported attention type: {}'.format(attn_type))
+    # data mask: input mask & perm mask
+    if input_mask is not None and perm_mask is not None:
+        data_mask = fluid.layers.unsqueeze(input_mask, [0]) + perm_mask
+    elif input_mask is not None and perm_mask is None:
+        data_mask = fluid.layers.unsqueeze(input_mask, [0])
+        print("input mask shape", input_mask.shape)
+    elif input_mask is None and perm_mask is not None:
+        data_mask = perm_mask
+    else:
+        data_mask = None
+    if data_mask is not None:
+        # all mems can be attended to
+        mems_mask = fluid.layers.zeros(shape=[data_mask.shape[0], mlen, 1], dtype='float32')
+        mems_mask = fluid.layers.expand(mems_mask, [1, 1, bsz])
+        data_mask = fluid.layers.concat([mems_mask, data_mask], 1)
+        if attn_mask is None:
+            attn_mask = fluid.layers.unsqueeze(data_mask, [-1])
+        else:
+            attn_mask += fluid.layers.unsqueeze(data_mask, [-1])
+        print("mems_mask, data_mask, attn mask shape", mems_mask.shape, data_mask.shape, attn_mask.shape)
+    if attn_mask is not None:
+        attn_mask = fluid.layers.cast(attn_mask > 0, dtype=data_type)
+    if attn_mask is not None:
+        non_tgt_mask = fluid.layers.diag(np.array([-1]*qlen).astype(data_type))
+        non_tgt_mask = fluid.layers.concat([fluid.layers.zeros([qlen, mlen], dtype=data_type),
+                                non_tgt_mask], axis=-1)
+        print("attn_mask, non_tgt_mask shape", attn_mask.shape, non_tgt_mask.shape)
+        attn_mask = fluid.layers.expand(attn_mask, [qlen, 1, 1, 1])
+        non_tgt_mask = fluid.layers.unsqueeze(non_tgt_mask, axes=[2, 3])
+        non_tgt_mask = fluid.layers.expand(non_tgt_mask, [1, 1, bsz, 1])
+        non_tgt_mask = fluid.layers.cast((attn_mask + non_tgt_mask) > 0,
+                             dtype=data_type)
+        non_tgt_mask.stop_gradient = True
+    else:
+        non_tgt_mask = None
+    if untie_r:
+      r_w_bias = fluid.layers.create_parameter(shape=[n_layer, n_head, d_head], dtype=data_type, 
+                                 attr=fluid.ParamAttr(name=name+'_r_w_bias', initializer=initializer), 
+                                 is_bias=True)
+      r_w_bias = [fluid.layers.slice(r_w_bias, axes=[0], starts=[i], ends=[i+1]) for i in range(n_layer)]
+      r_w_bias = [fluid.layers.squeeze(r_w_bias[i], axes=[0]) for i in range(n_layer)]
+      r_r_bias = fluid.layers.create_parameter(shape=[n_layer, n_head, d_head], dtype=data_type, 
+                                 attr=fluid.ParamAttr(name=name+'_r_r_bias', initializer=initializer), 
+                                 is_bias=True)
+      r_r_bias = [fluid.layers.slice(r_r_bias, axes=[0], starts=[i], ends=[i+1]) for i in range(n_layer)]
+      r_r_bias = [fluid.layers.squeeze(r_r_bias[i], axes=[0]) for i in range(n_layer)]
+    else:
+      r_w_bias = fluid.layers.create_parameter(shape=[n_head, d_head], dtype=data_type, 
+                                 attr=fluid.ParamAttr(name=name+'_r_w_bias', initializer=initializer), 
+                                 is_bias=True)
+      r_r_bias = fluid.layers.create_parameter(shape=[n_head, d_head], dtype=data_type, 
+                                 attr=fluid.ParamAttr(name=name+'_r_r_bias', initializer=initializer), 
+                                 is_bias=True)
+    lookup_table = fluid.layers.create_parameter(shape=[n_token, d_model], dtype=data_type, 
+                                 attr=fluid.ParamAttr(name=name+'_word_embedding', 
+                                         initializer=initializer), 
+                                 is_bias=True)
+    word_emb_k = fluid.layers.embedding(
+        input=inp_k,
+        size=[n_token, d_model],
+        dtype=data_type,
+        param_attr=fluid.ParamAttr(name=name+'_word_embedding', initializer=initializer))
+    if inp_q is not None:
+       pass
+    output_h = fluid.layers.dropout(word_emb_k, dropout_prob=dropout,
+                                   dropout_implementation="upscale_in_train") 
+    if inp_q is not None:
+       pass
+    if seg_id is not None:
+	if untie_r:
+	    r_s_bias = fluid.layers.create_parameter(shape=[n_layer, n_head, d_head], dtype=data_type, 
+				     attr=fluid.ParamAttr(name=name+'_r_s_bias', initializer=initializer), 
+				     is_bias=True)
+            r_s_bias = [fluid.layers.slice(r_s_bias, axes=[0], starts=[i], ends=[i+1]) for i in range(n_layer)]
+            r_s_bias = [fluid.layers.squeeze(r_s_bias[i], axes=[0]) for i in range(n_layer)]
+	else:
+	    r_s_bias = fluid.layers.create_parameter(shape=[n_head, d_head], dtype=data_type, 
+				     attr=fluid.ParamAttr(name=name+'_r_s_bias', initializer=initializer), 
+				     is_bias=True)
+        seg_embed = fluid.layers.create_parameter(shape=[n_layer, 2, n_head, d_head],
+			      dtype=data_type, attr=fluid.ParamAttr(name=name+'_seg_embed', 
+                              initializer=initializer))
+        seg_embed = [fluid.layers.slice(seg_embed, axes=[0], starts=[i], ends=[i+1]) for i in range(n_layer)]
+        seg_embed = [fluid.layers.squeeze(seg_embed[i], axes=[0]) for i in range(n_layer)]
+        # COnver `seg_id` to one-hot seg_mat
+        # seg_id: [bsz, qlen, 1]
+        mem_pad = fluid.layers.fill_constant_batch_size_like(input=seg_id, shape=[-1, mlen], value=0, dtype='int64')
+        # cat_ids: [bsz, klen, 1]
+        cat_ids = fluid.layers.concat(input=[mem_pad, seg_id], axis=1)
+        seg_id = fluid.layers.stack([seg_id] * klen, axis=2)
+        cat_ids = fluid.layers.stack([cat_ids] * qlen, axis=2)
+        cat_ids = fluid.layers.transpose(cat_ids, perm=[0, 2, 1])
+        # seg_mat: [bsz, qlen, klen]
+        seg_mat = fluid.layers.cast(
+          fluid.layers.logical_not(fluid.layers.equal(seg_id, cat_ids)),
+          dtype='int64')
+        seg_mat = fluid.layers.transpose(seg_mat, perm=[1, 2, 0])
+        seg_mat = fluid.layers.unsqueeze(seg_mat, [-1])
+        seg_mat = fluid.layers.one_hot(seg_mat, 2)
+        seg_mat.stop_gradient = True
+    else:
+        seg_mat = None
+    pos_emb =  relative_positional_encoding(
+             qlen, klen, d_model, clamp_len, attn_type, bi_data,
+             bsz=bsz, dtype=data_type) 
+    pos_emb = fluid.layers.dropout(pos_emb, dropout,
+                            dropout_implementation="upscale_in_train")
+    pos_emb.stop_gradient = True
+    ##### Attention layers
+    if mems is None:
+      mems = [None] * n_layer
+    for i in range(n_layer):
+        # cache new mems
+        #new_mems.append(_cache_mem(output_h, mems[i], mem_len, reuse_len)) 
+        # segment bias
+        if seg_id is None:
+            r_s_bias_i = None
+            seg_embed_i = None
+        else:
+            r_s_bias_i = r_s_bias if not untie_r else r_s_bias[i]
+            seg_embed_i = seg_embed[i]
+        if inp_q is not None:
+            pass
+        else:
+            output_h = rel_multihead_attn(
+              h=output_h,
+              r=pos_emb,
+              r_w_bias=r_w_bias if not untie_r else r_w_bias[i],
+              r_r_bias=r_r_bias if not untie_r else r_r_bias[i],
+              seg_mat=seg_mat,
+              r_s_bias=r_s_bias_i,
+              seg_embed=seg_embed_i,
+              attn_mask=non_tgt_mask,
+              mems=mems[i],
+              d_model=d_model,
+              n_head=n_head,
+              d_head=d_head,
+              dropout=dropout,
+              dropatt=dropatt,
+              initializer=initializer,
+              name=name+'_layer_{}'.format(i))
+        if inp_q is not None:
+            pass
+        output_h = positionwise_ffn(inp=output_h, d_model=d_model, 
+                         d_inner=d_inner, dropout_prob=dropout, 
+                         param_initializer=initializer,
+                         act_type=ff_activation, name=name+'_layer_{}_ff'.format(i))
+    if inp_q is not None:
+        output = fluid.layers.dropout(output_g, dropout, 
+                                      dropout_implementation="upscale_in_train")
+    else:
+        output = fluid.layers.dropout(output_h, dropout,
+                                      dropout_implementation="upscale_in_train")
+    new_mems = None
+    return output, new_mems, lookup_table
+def lm_loss(hidden, target, n_token, d_model, initializer, lookup_table=None,
+            tie_weight=False, bi_data=True):
+    if tie_weight:
+        assert lookup_table is not None, \
+          'lookup_table cannot be None for tie_weight'
+        softmax_w = lookup_table
+    else:
+        softmax_w = fluid.layers.create_parameter(
+                shape=[n_token, d_model],
+                dtype=hidden.dtype,
+                attr=fluid.ParamAttr(name='model_loss_weight', initializer=initializer),
+                is_bias=False)
+    softmax_b = fluid.layers.create_parameter(
+                shape=[n_token],
+                dtype=hidden.dtype,
+                attr=fluid.ParamAttr(name='model_lm_loss_bias', initializer=initializer),
+                is_bias=False)
+    logits = fluid.layers.matmul(x=hidden, y=softmax_w, transpose_y=True) + softmax_b
+    loss = fluid.layers.softmax_cross_entropy_with_logits(input=logits, label=target)
+    return loss 
+def summarize_sequence(summary_type, hidden, d_model, n_head, d_head, dropout,
+                       dropatt, input_mask, is_training, initializer,
+                       scope=None, reuse=None, use_proj=True):
+  """
+      Different classification tasks may not may not share the same parameters
+      to summarize the sequence features.
+      If shared, one can keep the `scope` to the default value `None`.
+      Otherwise, one should specify a different `scope` for each task.
+  """
+  with tf.variable_scope(scope, 'sequnece_summary', reuse=reuse):
+    if summary_type == 'last':
+      summary = hidden[-1]
+    elif summary_type == 'first':
+      summary = hidden[0]
+    elif summary_type == 'mean':
+      summary = tf.reduce_mean(hidden, axis=0)
+    elif summary_type == 'attn':
+      bsz = tf.shape(hidden)[1]
+      summary_bias = tf.get_variable('summary_bias', [d_model],
+                                     dtype=hidden.dtype,
+                                     initializer=initializer)
+      summary_bias = tf.tile(summary_bias[None, None], [1, bsz, 1])
+      if input_mask is not None:
+        input_mask = input_mask[None, :, :, None]
+      summary = multihead_attn(summary_bias, hidden, hidden, input_mask,
+                               d_model, n_head, d_head, dropout, dropatt,
+                               is_training, initializer, residual=False)
+      summary = summary[0]
+    else:
+      raise ValueError('Unsupported summary type {}'.format(summary_type))
+    # use another projection as in BERT
+    if use_proj:
+      summary = tf.layers.dense(
+          summary,
+          d_model,
+          activation=tf.tanh,
+          initializer=initializer,
+          name='summary')
+    # dropout
+    summary = tf.layers.dropout(
+        summary, dropout, training=is_training,
+        name='dropout')
+  return summary
+def classification_loss(hidden, labels, n_class, initializer, name, reuse=None,
+                        return_logits=False):
+    """
+      Different classification tasks should use different scope names to ensure
+      different dense layers (parameters) are used to produce the logits.
+      An exception will be in transfer learning, where one hopes to transfer
+      the classification weights.
+    """
+    logits = fluid.layers.fc(
+        input=hidden,
+        size=n_class,
+        param_attr=fluid.ParamAttr(name=name+'_logits', initializer=initializer))
+    one_hot_target = fluid.layers.one_hot(labels, depth=n_class, dtype=hidden.dtype)
+    loss = -fuid.layers.reduce_sum(fluid.layers.log_softmax(logits) * one_hot_target, -1)
+    if return_logits:
+      return loss, logits
+    return loss
+def regression_loss(hidden, labels, initializer, name='transformer',
+                    return_logits=False):
+    logits = fluid.layers.fc(
+        input=hidden,
+        size=1,
+        param_attr=fluid.ParamAttr(name=name+'_logits', initializer=initializer))
+    logits = tf.squeeze(logits, axis=-1)
+    loss = tf.square(logits - labels)
+    if return_logits:
+      return loss, logits
+    return loss 
--- a/PaddleNLP/Research/MRQA2019-D-NET/server/xlnet_server/prepro_utils.py
+++ b/PaddleNLP/Research/MRQA2019-D-NET/server/xlnet_server/prepro_utils.py
+# coding=utf-8
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+import unicodedata
+import six
+from functools import partial
+SPIECE_UNDERLINE = '▁'
+def printable_text(text):
+  """Returns text encoded in a way suitable for print or `tf.logging`."""
+  # These functions want `str` for both Python2 and Python3, but in one case
+  # it's a Unicode string and in the other it's a byte string.
+  if six.PY3:
+    if isinstance(text, str):
+      return text
+    elif isinstance(text, bytes):
+      return text.decode("utf-8", "ignore")
+    else:
+      raise ValueError("Unsupported string type: %s" % (type(text)))
+  elif six.PY2:
+    if isinstance(text, str):
+      return text
+    elif isinstance(text, unicode):
+      return text.encode("utf-8")
+    else:
+      raise ValueError("Unsupported string type: %s" % (type(text)))
+  else:
+    raise ValueError("Not running on Python2 or Python 3?")
+def print_(*args):
+  new_args = []
+  for arg in args:
+    if isinstance(arg, list):
+      s = [printable_text(i) for i in arg]
+      s = ' '.join(s)
+      new_args.append(s)
+    else:
+      new_args.append(printable_text(arg))
+  print(*new_args)
+def preprocess_text(inputs, lower=False, remove_space=True, keep_accents=False):
+  if remove_space:
+    outputs = ' '.join(inputs.strip().split())
+  else:
+    outputs = inputs
+  outputs = outputs.replace("``", '"').replace("''", '"')
+  if six.PY2 and isinstance(outputs, str):
+    outputs = outputs.decode('utf-8')
+  if not keep_accents:
+    outputs = unicodedata.normalize('NFKD', outputs)
+    outputs = ''.join([c for c in outputs if not unicodedata.combining(c)])
+  if lower:
+    outputs = outputs.lower()
+  return outputs
+def encode_pieces(sp_model, text, return_unicode=True, sample=False):
+  # return_unicode is used only for py2
+  # note(zhiliny): in some systems, sentencepiece only accepts str for py2
+  if six.PY2 and isinstance(text, unicode):
+    text = text.encode('utf-8')
+  if not sample:
+    pieces = sp_model.EncodeAsPieces(text)
+  else:
+    pieces = sp_model.SampleEncodeAsPieces(text, 64, 0.1)
+  new_pieces = []
+  for piece in pieces:
+    if len(piece) > 1 and piece[-1] == ',' and piece[-2].isdigit():
+      cur_pieces = sp_model.EncodeAsPieces(
+          piece[:-1].replace(SPIECE_UNDERLINE, ''))
+      if piece[0] != SPIECE_UNDERLINE and cur_pieces[0][0] == SPIECE_UNDERLINE:
+        if len(cur_pieces[0]) == 1:
+          cur_pieces = cur_pieces[1:]
+        else:
+          cur_pieces[0] = cur_pieces[0][1:]
+      cur_pieces.append(piece[-1])
+      new_pieces.extend(cur_pieces)
+    else:
+      new_pieces.append(piece)
+  # note(zhiliny): convert back to unicode for py2
+  if six.PY2 and return_unicode:
+    ret_pieces = []
+    for piece in new_pieces:
+      if isinstance(piece, str):
+        piece = piece.decode('utf-8')
+      ret_pieces.append(piece)
+    new_pieces = ret_pieces
+  return new_pieces
+def encode_ids(sp_model, text, sample=False):
+  pieces = encode_pieces(sp_model, text, return_unicode=False, sample=sample)
+  ids = [sp_model.PieceToId(piece) for piece in pieces]
+  return ids
+if __name__ == '__main__':
+  import sentencepiece as spm
+  sp = spm.SentencePieceProcessor()
+  sp.load('sp10m.uncased.v3.model')
+  print_(u'I was born in 2000, and this is falsé.')
+  print_(u'ORIGINAL', sp.EncodeAsPieces(u'I was born in 2000, and this is falsé.'))
+  print_(u'OURS', encode_pieces(sp, u'I was born in 2000, and this is falsé.'))
+  print(encode_ids(sp, u'I was born in 2000, and this is falsé.'))
+  print_('')
+  prepro_func = partial(preprocess_text, lower=True)
+  print_(prepro_func('I was born in 2000, and this is falsé.'))
+  print_('ORIGINAL', sp.EncodeAsPieces(prepro_func('I was born in 2000, and this is falsé.')))
+  print_('OURS', encode_pieces(sp, prepro_func('I was born in 2000, and this is falsé.')))
+  print(encode_ids(sp, prepro_func('I was born in 2000, and this is falsé.')))
+  print_('')
+  print_('I was born in 2000, and this is falsé.')
+  print_('ORIGINAL', sp.EncodeAsPieces('I was born in 2000, and this is falsé.'))
+  print_('OURS', encode_pieces(sp, 'I was born in 2000, and this is falsé.'))
+  print(encode_ids(sp, 'I was born in 2000, and this is falsé.'))
+  print_('')
+  print_('I was born in 92000, and this is falsé.')
+  print_('ORIGINAL', sp.EncodeAsPieces('I was born in 92000, and this is falsé.'))
+  print_('OURS', encode_pieces(sp, 'I was born in 92000, and this is falsé.'))
+  print(encode_ids(sp, 'I was born in 92000, and this is falsé.'))
--- a/PaddleNLP/Research/MRQA2019-D-NET/server/xlnet_server/serve.py
+++ b/PaddleNLP/Research/MRQA2019-D-NET/server/xlnet_server/serve.py
+#!/usr/bin/env python
+# -*- coding: utf-8 -*-
+"""Provide MRC service for TOP1 short answer extraction system
+Note the services here share some global pre/post process objects, which
+are **NOT THREAD SAFE**. Try to use multi-process instead of multi-thread
+for deployment.
+"""
+import json
+import sys
+import logging
+logging.basicConfig(
+    level=logging.DEBUG, format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
+)
+import requests
+from flask import Flask
+from flask import Response
+from flask import request
+import server_utils
+import wrapper as bert_wrapper
+assert len(sys.argv) == 3 or len(sys.argv) == 4, "Usage: python serve.py <model_dir> <port> [process_mode]"
+if len(sys.argv) == 3:
+    _, model_dir, port = sys.argv
+    mode = 'parallel'
+else:
+    _, model_dir, port, mode = sys.argv
+app = Flask(__name__)
+app.logger.setLevel(logging.INFO)
+bert_model = bert_wrapper.BertModelWrapper(model_dir=model_dir)
+server = server_utils.BasicMRCService('Short answer MRC service', app.logger)
+@app.route('/', methods=['POST'])
+def mrqa_service():
+    """Description"""
+    model = bert_model
+    return server(model, process_mode=mode, max_batch_size=5)
+    # return server(model)
+if __name__ == '__main__':
+    app.run(port=port, debug=False, threaded=False, processes=1)
--- a/PaddleNLP/Research/MRQA2019-D-NET/server/xlnet_server/serve.sh
+++ b/PaddleNLP/Research/MRQA2019-D-NET/server/xlnet_server/serve.sh
+export FLAGS_sync_nccl_allreduce=0
+export FLAGS_eager_delete_tensor_gb=1
+export FLAGS_fraction_of_gpu_memory_to_use=0.1
+python serve.py ./infer_model_800_bs128 5001 &
--- a/PaddleNLP/Research/MRQA2019-D-NET/server/xlnet_server/server_utils.py
+++ b/PaddleNLP/Research/MRQA2019-D-NET/server/xlnet_server/server_utils.py
+#!/usr/bin/env python
+# -*- coding: utf-8 -*-
+"""Some utilities for MRC online service"""
+import json
+import sys
+import logging
+import time
+import numpy as np
+from flask import Response
+from flask import request
+from copy import deepcopy
+verbose = False
+def _request_check(input_json):
+    """Check if the request json is valid"""
+    if input_json is None or not isinstance(input_json, dict):
+        return 'Can not parse the input json data - {}'.format(input_json)
+    try:
+        c = input_json['context']
+        qa = input_json['qas'][0]
+        qid = qa['qid']
+        q = qa['question']
+    except KeyError as e:
+        return 'Invalid request, key "{}" not found'.format(e)
+    return 'OK'
+def _abort(status_code, message):
+    """Create custom error message and status code"""
+    return Response(json.dumps(message), status=status_code, mimetype='application/json')
+def _timmer(init_start, start, current, process_name):
+    cumulated_elapsed_time = (current - init_start) * 1000
+    current_elapsed_time = (current - start) * 1000
+    print('{}\t-\t{:.2f}\t{:.2f}'.format(process_name, cumulated_elapsed_time,
+                                         current_elapsed_time))
+def _split_input_json(input_json):
+    if len(input_json['context_tokens']) > 810:
+        input_json['context'] = input_json['context'][:5000]
+    if len(input_json['qas']) == 1:
+        return [input_json]
+    else:
+        rets = []
+        for i in range(len(input_json['qas'])):
+            temp = deepcopy(input_json)
+            temp['qas'] = [input_json['qas'][i]]
+            rets.append(temp)
+        return rets
+class BasicMRCService(object):
+    """Provide basic MRC service for flask"""
+    def __init__(self, name, logger=None, log_data=False):
+        """ """
+        self.name = name
+        if logger is None:
+            self.logger = logging.getLogger('flask')
+        else:
+            self.logger = logger
+        self.log_data = log_data
+    def __call__(self, model, process_mode='serial', max_batch_size=5, timmer=False):
+        """
+        Args:
+            mode: serial, parallel
+        """
+        if timmer:
+            start = time.time()
+        """Call mrc model wrapper and handle expectations"""
+        self.input_json = request.get_json(silent=True)
+        try:
+            if timmer:
+                start_request_check = time.time()
+            request_status = _request_check(self.input_json)
+            jsons = _split_input_json(self.input_json)
+            if timmer:
+                current_time = time.time()
+                _timmer(start, start_request_check, current_time, 'request check')
+            if self.log_data:
+                if self.logger is None:
+                    logging.info(
+                        'Client input - {}'.format(json.dumps(self.input_json, ensure_ascii=False))
+                    )
+                else:
+                    self.logger.info(
+                        'Client input - {}'.format(json.dumps(self.input_json, ensure_ascii=False))
+                    )
+        except Exception as e:
+            self.logger.error('server request checker error')
+            self.logger.exception(e)
+            return _abort(500, 'server request checker error - {}'.format(e))
+        if request_status != 'OK':
+            return _abort(400, request_status)
+        self.results = {}
+        for single_sample in jsons:
+        # call preprocessor
+            try:
+                if timmer:
+                    start_preprocess = time.time()
+                example,features,batches = model.preprocessor(single_sample, batch_size=max_batch_size if process_mode == 'parallel' else 1)
+                if timmer:
+                    current_time = time.time()
+                    _timmer(start, start_preprocess, current_time, 'preprocess')
+            except Exception as e:
+                self.logger.error('preprocessor error')
+                self.logger.exception(e)
+                return _abort(500, 'preprocessor error - {}'.format(e))
+            def transpose(mat):
+                return zip(*mat)
+            print(len(features))
+            print(len(batches))
+            # call mrc
+            try:
+                if timmer:
+                    start_call_mrc = time.time()
+                mrc_results = []
+                # new_features = []
+                if verbose:
+                    if len(features) > max_batch_size:
+                        print("get a too long example....")
+                if process_mode == 'serial':
+                    mrc_results = [model.call_mrc(b, squeeze_dim0=True) for b in batches[:max_batch_size]]
+                elif process_mode == 'parallel':
+                    # only keep first max_batch_size features
+                    # batches = batches[0]
+                    for b in batches:
+                        mrc_results.extend(model.call_mrc(b, return_list=True))
+                else:
+                    raise NotImplementedError()
+                # new_features = features[:max_batch_size]
+                print('num examples:')
+                print(len(example))
+                print('num features:')
+                print(len(features))
+                if timmer:
+                    current_time = time.time()
+                    _timmer(start, start_call_mrc, current_time, 'call mrc')
+            except Exception as e:
+                self.logger.error('call_mrc error')
+                self.logger.exception(e)
+                return _abort(500, 'call_mrc error - {}'.format(e))
+            # call post processor
+            try:
+                if timmer:
+                    start_post_precess = time.time()
+                results = model.postprocessor(example, features, mrc_results)
+                # only nbest results is POSTed back
+                self.results.update(results[1])
+                # self.results = results[1]
+                # # self.results = results[0]
+                if timmer:
+                    current_time = time.time()
+                    _timmer(start, start_post_precess, current_time, 'post process')
+            except Exception as e:
+                self.logger.error('postprocessor error')
+                self.logger.exception(e)
+                return _abort(500, 'postprocessor error - {}'.format(e))
+        return self._response_constructor()
+    def _response_constructor(self):
+        """construct http response object"""
+        try:
+            response = {
+                # 'requestID': self.input_json['requestID'],
+                'results': self.results
+            }
+            if self.log_data:
+                self.logger.info(
+                    'Response - {}'.format(json.dumps(response, ensure_ascii=False))
+                )
+            return Response(json.dumps(response), mimetype='application/json')
+        except Exception as e:
+            self.logger.error('response constructor error')
+            self.logger.exception(e)
+            return _abort(500, 'response constructor error - {}'.format(e))
--- a/PaddleNLP/Research/MRQA2019-D-NET/server/xlnet_server/squad_reader.py
+++ b/PaddleNLP/Research/MRQA2019-D-NET/server/xlnet_server/squad_reader.py
+# coding=utf-8
+# Copyright 2018 The Google AI Language Team Authors.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Run BERT on SQuAD 1.1 and SQuAD 2.0."""
+import re
+import six
+import sys
+import math
+import json
+import random
+import collections
+import gc
+import numpy as np
+sys.path.append('.')
+import squad_utils
+from data_utils import SEP_ID, CLS_ID, VOCAB_SIZE
+import sentencepiece as spm
+from prepro_utils import preprocess_text, encode_ids, encode_pieces, printable_text
+SPIECE_UNDERLINE = u'▁'
+SEG_ID_P   = 0
+SEG_ID_Q   = 1
+SEG_ID_CLS = 2
+SEG_ID_PAD = 3
+class SquadExample(object):
+  """A single training/test example for simple sequence classification.
+     For examples without an answer, the start and end position are -1.
+  """
+  def __init__(self,
+               qas_id,
+               question_text,
+               paragraph_text,
+               orig_answer_text=None,
+               start_position=None,
+               is_impossible=False):
+    self.qas_id = qas_id
+    self.question_text = question_text
+    self.paragraph_text = paragraph_text
+    self.orig_answer_text = orig_answer_text
+    self.start_position = start_position
+    self.is_impossible = is_impossible
+  def __str__(self):
+    return self.__repr__()
+  def __repr__(self):
+    s = ""
+    s += "qas_id: %s" % (printable_text(self.qas_id))
+    s += ", question_text: %s" % (
+        printable_text(self.question_text))
+    s += ", paragraph_text: [%s]" % (" ".join(self.paragraph_text))
+    if self.start_position:
+      s += ", start_position: %d" % (self.start_position)
+    if self.start_position:
+      s += ", is_impossible: %r" % (self.is_impossible)
+    return s
+class InputFeatures(object):
+  """A single set of features of data."""
+  def __init__(self,
+               unique_id,
+               example_index,
+               doc_span_index,
+               tok_start_to_orig_index,
+               tok_end_to_orig_index,
+               token_is_max_context,
+               input_ids,
+               input_mask,
+               p_mask,
+               segment_ids,
+               paragraph_len,
+               cls_index,
+               start_position=None,
+               end_position=None,
+               is_impossible=None):
+    self.unique_id = unique_id
+    self.example_index = example_index
+    self.doc_span_index = doc_span_index
+    self.tok_start_to_orig_index = tok_start_to_orig_index
+    self.tok_end_to_orig_index = tok_end_to_orig_index
+    self.token_is_max_context = token_is_max_context
+    self.input_ids = input_ids
+    self.input_mask = input_mask
+    self.p_mask = p_mask
+    self.segment_ids = segment_ids
+    self.paragraph_len = paragraph_len
+    self.cls_index = cls_index
+    self.start_position = start_position
+    self.end_position = end_position
+    self.is_impossible = is_impossible
+def read_squad_examples(sample, is_training):
+  """Read a SQuAD json file into a list of SquadExample."""
+  examples = []
+  paragraph_text = sample["context"]
+  paragraph_text = re.sub(r'\[TLE\]|\[DOC\]|\[PAR\]', '[SEP]', paragraph_text)
+  for qa in sample["qas"]:
+    qas_id = qa["qid"]
+    question_text = qa["question"]
+    start_position = None
+    orig_answer_text = None
+    is_impossible = False
+    example = SquadExample(
+        qas_id=qas_id,
+        question_text=question_text,
+        paragraph_text=paragraph_text)
+    examples.append(example)
+  return examples
+def _convert_index(index, pos, M=None, is_start=True):
+  if index[pos] is not None:
+    return index[pos]
+  N = len(index)
+  rear = pos
+  while rear < N - 1 and index[rear] is None:
+    rear += 1
+  front = pos
+  while front > 0 and index[front] is None:
+    front -= 1
+  assert index[front] is not None or index[rear] is not None
+  if index[front] is None:
+    if index[rear] >= 1:
+      if is_start:
+        return 0
+      else:
+        return index[rear] - 1
+    return index[rear]
+  if index[rear] is None:
+    if M is not None and index[front] < M - 1:
+      if is_start:
+        return index[front] + 1
+      else:
+        return M - 1
+    return index[front]
+  if is_start:
+    if index[rear] > index[front] + 1:
+      return index[front] + 1
+    else:
+      return index[rear]
+  else:
+    if index[rear] > index[front] + 1:
+      return index[rear] - 1
+    else:
+      return index[front]
+def convert_examples_to_features(examples, sp_model, max_seq_length,
+                                 doc_stride, max_query_length, is_training,
+                                 uncased):
+  """Loads a data file into a list of `InputBatch`s."""
+  cnt_pos, cnt_neg = 0, 0
+  unique_id = 1000000000
+  max_N, max_M = 1024, 1024
+  f = np.zeros((max_N, max_M), dtype=np.float32)
+  for (example_index, example) in enumerate(examples):
+    if example_index % 100 == 0:
+      print('Converting {}/{} pos {} neg {}'.format(
+          example_index, len(examples), cnt_pos, cnt_neg))
+    query_tokens = encode_ids(
+        sp_model,
+        preprocess_text(example.question_text, lower=uncased))
+    if len(query_tokens) > max_query_length:
+      query_tokens = query_tokens[0:max_query_length]
+    paragraph_text = example.paragraph_text
+    para_tokens = encode_pieces(
+        sp_model,
+        preprocess_text(example.paragraph_text, lower=uncased))
+    chartok_to_tok_index = []
+    tok_start_to_chartok_index = []
+    tok_end_to_chartok_index = []
+    char_cnt = 0
+    for i, token in enumerate(para_tokens):
+      chartok_to_tok_index.extend([i] * len(token))
+      tok_start_to_chartok_index.append(char_cnt)
+      char_cnt += len(token)
+      tok_end_to_chartok_index.append(char_cnt - 1)
+    tok_cat_text = ''.join(para_tokens).replace(SPIECE_UNDERLINE, ' ')
+    N, M = len(paragraph_text), len(tok_cat_text)
+    if N > max_N or M > max_M:
+      max_N = max(N, max_N)
+      max_M = max(M, max_M)
+      f = np.zeros((max_N, max_M), dtype=np.float32)
+      gc.collect()
+    g = {}
+    def _lcs_match(max_dist):
+      f.fill(0)
+      g.clear()
+      ### longest common sub sequence
+      # f[i, j] = max(f[i - 1, j], f[i, j - 1], f[i - 1, j - 1] + match(i, j))
+      for i in range(N):
+        # note(zhiliny):
+        # unlike standard LCS, this is specifically optimized for the setting
+        # because the mismatch between sentence pieces and original text will
+        # be small
+        for j in range(i - max_dist, i + max_dist):
+          if j >= M or j < 0: continue
+          if i > 0:
+            g[(i, j)] = 0
+            f[i, j] = f[i - 1, j]
+          if j > 0 and f[i, j - 1] > f[i, j]:
+            g[(i, j)] = 1
+            f[i, j] = f[i, j - 1]
+          f_prev = f[i - 1, j - 1] if i > 0 and j > 0 else 0
+          if (preprocess_text(paragraph_text[i], lower=uncased,
+              remove_space=False)
+              == tok_cat_text[j]
+              and f_prev + 1 > f[i, j]):
+            g[(i, j)] = 2
+            f[i, j] = f_prev + 1
+    max_dist = abs(N - M) + 5
+    for _ in range(2):
+      _lcs_match(max_dist)
+      if f[N - 1, M - 1] > 0.8 * N: break
+      max_dist *= 2
+    orig_to_chartok_index = [None] * N
+    chartok_to_orig_index = [None] * M
+    i, j = N - 1, M - 1
+    while i >= 0 and j >= 0:
+      if (i, j) not in g: break
+      if g[(i, j)] == 2:
+        orig_to_chartok_index[i] = j
+        chartok_to_orig_index[j] = i
+        i, j = i - 1, j - 1
+      elif g[(i, j)] == 1:
+        j = j - 1
+      else:
+        i = i - 1
+    if all(v is None for v in orig_to_chartok_index) or f[N - 1, M - 1] < 0.8 * N:
+      print('MISMATCH DETECTED!')
+      continue
+    tok_start_to_orig_index = []
+    tok_end_to_orig_index = []
+    for i in range(len(para_tokens)):
+      start_chartok_pos = tok_start_to_chartok_index[i]
+      end_chartok_pos = tok_end_to_chartok_index[i]
+      start_orig_pos = _convert_index(chartok_to_orig_index, start_chartok_pos,
+                                      N, is_start=True)
+      end_orig_pos = _convert_index(chartok_to_orig_index, end_chartok_pos,
+                                    N, is_start=False)
+      tok_start_to_orig_index.append(start_orig_pos)
+      tok_end_to_orig_index.append(end_orig_pos)
+    if not is_training:
+      tok_start_position = tok_end_position = None
+    if is_training and example.is_impossible:
+      tok_start_position = -1
+      tok_end_position = -1
+    if is_training and not example.is_impossible:
+      start_position = example.start_position
+      end_position = start_position + len(example.orig_answer_text) - 1
+      start_chartok_pos = _convert_index(orig_to_chartok_index, start_position,
+                                         is_start=True)
+      tok_start_position = chartok_to_tok_index[start_chartok_pos]
+      end_chartok_pos = _convert_index(orig_to_chartok_index, end_position,
+                                       is_start=False)
+      tok_end_position = chartok_to_tok_index[end_chartok_pos]
+      assert tok_start_position <= tok_end_position
+    def _piece_to_id(x):
+      if six.PY2 and isinstance(x, unicode):
+        x = x.encode('utf-8')
+      return sp_model.PieceToId(x)
+    all_doc_tokens = list(map(_piece_to_id, para_tokens))
+    # The -3 accounts for [CLS], [SEP] and [SEP]
+    max_tokens_for_doc = max_seq_length - len(query_tokens) - 3
+    # We can have documents that are longer than the maximum sequence length.
+    # To deal with this we do a sliding window approach, where we take chunks
+    # of the up to our max length with a stride of `doc_stride`.
+    _DocSpan = collections.namedtuple(  # pylint: disable=invalid-name
+        "DocSpan", ["start", "length"])
+    doc_spans = []
+    start_offset = 0
+    while start_offset < len(all_doc_tokens):
+      length = len(all_doc_tokens) - start_offset
+      if length > max_tokens_for_doc:
+        length = max_tokens_for_doc
+      doc_spans.append(_DocSpan(start=start_offset, length=length))
+      if start_offset + length == len(all_doc_tokens):
+        break
+      start_offset += min(length, doc_stride)
+    for (doc_span_index, doc_span) in enumerate(doc_spans):
+      tokens = []
+      token_is_max_context = {}
+      segment_ids = []
+      p_mask = []
+      cur_tok_start_to_orig_index = []
+      cur_tok_end_to_orig_index = []
+      for i in range(doc_span.length):
+        split_token_index = doc_span.start + i
+        cur_tok_start_to_orig_index.append(
+            tok_start_to_orig_index[split_token_index])
+        cur_tok_end_to_orig_index.append(
+            tok_end_to_orig_index[split_token_index])
+        is_max_context = _check_is_max_context(doc_spans, doc_span_index,
+                                               split_token_index)
+        token_is_max_context[len(tokens)] = is_max_context
+        tokens.append(all_doc_tokens[split_token_index])
+        segment_ids.append(SEG_ID_P)
+        p_mask.append(0)
+      paragraph_len = len(tokens)
+      tokens.append(SEP_ID)
+      segment_ids.append(SEG_ID_P)
+      p_mask.append(1)
+      # note(zhiliny): we put P before Q
+      # because during pretraining, B is always shorter than A
+      for token in query_tokens:
+        tokens.append(token)
+        segment_ids.append(SEG_ID_Q)
+        p_mask.append(1)
+      tokens.append(SEP_ID)
+      segment_ids.append(SEG_ID_Q)
+      p_mask.append(1)
+      cls_index = len(segment_ids)
+      tokens.append(CLS_ID)
+      segment_ids.append(SEG_ID_CLS)
+      p_mask.append(0)
+      input_ids = tokens
+      # The mask has 0 for real tokens and 1 for padding tokens. Only real
+      # tokens are attended to.
+      input_mask = [0] * len(input_ids)
+      # Zero-pad up to the sequence length.
+      while len(input_ids) < max_seq_length:
+        input_ids.append(0)
+        input_mask.append(1)
+        segment_ids.append(SEG_ID_PAD)
+        p_mask.append(1)
+      assert len(input_ids) == max_seq_length
+      assert len(input_mask) == max_seq_length
+      assert len(segment_ids) == max_seq_length
+      assert len(p_mask) == max_seq_length
+      span_is_impossible = example.is_impossible
+      start_position = None
+      end_position = None
+      if is_training and not span_is_impossible:
+        # For training, if our document chunk does not contain an annotation
+        # we throw it out, since there is nothing to predict.
+        doc_start = doc_span.start
+        doc_end = doc_span.start + doc_span.length - 1
+        out_of_span = False
+        if not (tok_start_position >= doc_start and
+                tok_end_position <= doc_end):
+          out_of_span = True
+        if out_of_span:
+          # continue
+          start_position = 0
+          end_position = 0
+          span_is_impossible = True
+        else:
+          # note(zhiliny): we put P before Q, so doc_offset should be zero.
+          # doc_offset = len(query_tokens) + 2
+          doc_offset = 0
+          start_position = tok_start_position - doc_start + doc_offset
+          end_position = tok_end_position - doc_start + doc_offset
+      if is_training and span_is_impossible:
+        start_position = cls_index
+        end_position = cls_index
+      if example_index < 0:
+        print("*** Example ***")
+        print("unique_id: %s" % (unique_id))
+        print("example_index: %s" % (example_index))
+        print("doc_span_index: %s" % (doc_span_index))
+        print("tok_start_to_orig_index: %s" % " ".join(
+            [str(x) for x in cur_tok_start_to_orig_index]))
+        print("tok_end_to_orig_index: %s" % " ".join(
+            [str(x) for x in cur_tok_end_to_orig_index]))
+        print("token_is_max_context: %s" % " ".join([
+            "%d:%s" % (x, y) for (x, y) in six.iteritems(token_is_max_context)
+        ]))
+        print("input_ids: %s" % " ".join([str(x) for x in input_ids]))
+        print(
+            "input_mask: %s" % " ".join([str(x) for x in input_mask]))
+        print(
+            "segment_ids: %s" % " ".join([str(x) for x in segment_ids]))
+        if is_training and span_is_impossible:
+          print("impossible example span")
+        if is_training and not span_is_impossible:
+          pieces = [sp_model.IdToPiece(token) for token in
+                    tokens[start_position: (end_position + 1)]]
+          answer_text = sp_model.DecodePieces(pieces)
+          print("start_position: %d" % (start_position))
+          print("end_position: %d" % (end_position))
+          print(
+              "answer: %s" % (printable_text(answer_text)))
+          # note(zhiliny): With multi processing,
+          # the example_index is actually the index within the current process
+          # therefore we use example_index=None to avoid being used in the future.
+          # The current code does not use example_index of training data.
+      if is_training:
+        feat_example_index = None
+      else:
+        feat_example_index = example_index
+      feature = InputFeatures(
+          unique_id=unique_id,
+          example_index=feat_example_index,
+          doc_span_index=doc_span_index,
+          tok_start_to_orig_index=cur_tok_start_to_orig_index,
+          tok_end_to_orig_index=cur_tok_end_to_orig_index,
+          token_is_max_context=token_is_max_context,
+          input_ids=input_ids,
+          input_mask=input_mask,
+          p_mask=p_mask,
+          segment_ids=segment_ids,
+          paragraph_len=paragraph_len,
+          cls_index=cls_index,
+          start_position=start_position,
+          end_position=end_position,
+          is_impossible=span_is_impossible)
+      unique_id += 1
+      if span_is_impossible:
+        cnt_neg += 1
+      else:
+        cnt_pos += 1
+      yield feature
+  print("Total number of instances: {} = pos {} neg {}".format(
+      cnt_pos + cnt_neg, cnt_pos, cnt_neg))
+def _check_is_max_context(doc_spans, cur_span_index, position):
+  """Check if this is the 'max context' doc span for the token."""
+  # Because of the sliding window approach taken to scoring documents, a single
+  # token can appear in multiple documents. E.g.
+  #  Doc: the man went to the store and bought a gallon of milk
+  #  Span A: the man went to the
+  #  Span B: to the store and bought
+  #  Span C: and bought a gallon of
+  #  ...
+  #
+  # Now the word 'bought' will have two scores from spans B and C. We only
+  # want to consider the score with "maximum context", which we define as
+  # the *minimum* of its left and right context (the *sum* of left and
+  # right context will always be the same, of course).
+  #
+  # In the example the maximum context for 'bought' would be span C since
+  # it has 1 left context and 3 right context, while span B has 4 left context
+  # and 0 right context.
+  best_score = None
+  best_span_index = None
+  for (span_index, doc_span) in enumerate(doc_spans):
+    end = doc_span.start + doc_span.length - 1
+    if position < doc_span.start:
+      continue
+    if position > end:
+      continue
+    num_left_context = position - doc_span.start
+    num_right_context = end - position
+    score = min(num_left_context, num_right_context) + 0.01 * doc_span.length
+    if best_score is None or score > best_score:
+      best_score = score
+      best_span_index = span_index
+  return cur_span_index == best_span_index
+class DataProcessor(object):
+    def __init__(self, spiece_model_file, uncased, max_seq_length,
+                 doc_stride, max_query_length):
+        self._sp_model = spm.SentencePieceProcessor()
+        self._sp_model.Load(spiece_model_file)
+        self._uncased = uncased 
+        self._max_seq_length = max_seq_length
+        self._doc_stride = doc_stride
+        self._max_query_length = max_query_length
+        self.current_train_example = -1
+        self.num_train_examples = -1
+        self.current_train_epoch = -1
+        self.train_examples = None
+        self.predict_examples = None
+        self.num_examples = {'train': -1, 'predict': -1}
+    def get_train_progress(self):
+        """Gets progress for training phase."""
+        return self.current_train_example, self.current_train_epoch
+    def get_examples(self,
+                     sample,
+                     is_training):
+        examples = read_squad_examples(
+            sample,
+            is_training=is_training)
+        return examples
+    def get_num_examples(self, phase):
+        if phase not in ['train', 'predict']:
+            raise ValueError(
+                "Unknown phase, which should be in ['train', 'predict'].")
+        return self.num_examples[phase]
+    def get_features(self, examples, is_training):
+        features = convert_examples_to_features(
+            examples=examples,
+            sp_model=self._sp_model,
+            max_seq_length=self._max_seq_length,
+            doc_stride=self._doc_stride,
+            max_query_length=self._max_query_length,
+            is_training=is_training,
+            uncased=self._uncased)
+        return features
+    def data_generator(self,
+                       sample,
+                       batch_size,
+                       phase='predict',
+                       shuffle=False,
+                       dev_count=1,
+                       epoch=1):
+        self.predict_examples = self.get_examples(
+            sample,
+            is_training=False)
+        examples = self.predict_examples
+        self.num_examples['predict'] = len(self.predict_examples)
+        def batch_reader(features, batch_size):
+            batch = []
+            feats = []
+            for (index, feature) in enumerate(features):
+                if phase == 'train':
+                    self.current_train_example = index + 1
+                labels = [feature.unique_id] if feature.start_position is None else [
+                              feature.start_position, feature.end_position, feature.is_impossible
+                          ]
+                example = [
+                    feature.input_ids, feature.segment_ids, feature.input_mask, 
+                    feature.cls_index, feature.p_mask
+                ] + labels
+                to_append = len(batch) < batch_size
+                if to_append:
+                    batch.append(example)
+                    feats.append(feature)
+                else:
+                    yield batch, feats
+                    batch = [example]
+                    feats = [feature]
+            if len(batch) > 0:
+                yield batch, feats
+        def prepare_batch_data(insts):
+            """Generate numpy tensors"""
+            input_ids = np.expand_dims(np.array([inst[0] for inst in insts]).astype('int64'), axis=-1)
+            segment_ids = np.array([inst[1] for inst in insts]).astype('int64')
+            input_mask = np.array([inst[2] for inst in insts]).astype('float32')
+            cls_index = np.expand_dims(np.array([inst[3] for inst in insts]).astype('int64'), axis=-1)
+            p_mask = np.array([inst[4] for inst in insts]).astype('float32')
+            ret_list = [input_ids, segment_ids, input_mask, cls_index, p_mask]
+            if phase == 'train':
+                start_positions = np.expand_dims(np.array([inst[5] for inst in insts]).astype('int64'), axis=-1)
+                end_positions = np.expand_dims(np.array([inst[6] for inst in insts]).astype('int64'), axis=-1)
+                is_impossible = np.expand_dims(np.array([inst[7] for inst in insts]).astype('float32'), axis=-1)
+                ret_list += [start_positions, end_positions, is_impossible]
+            else:
+                unique_ids = np.expand_dims(np.array([inst[5] for inst in insts]).astype('int64'), axis=-1)
+                ret_list += [unique_ids]
+            return ret_list
+        feature_gen = self.get_features(examples, is_training=False)
+        all_dev_batches = []
+        features = []
+        for batch_insts, feats in batch_reader(feature_gen, batch_size):
+            batch_data = prepare_batch_data(batch_insts)
+            all_dev_batches.append(batch_data)
+            features.extend(feats)
+        return examples, features, all_dev_batches
+_PrelimPrediction = collections.namedtuple(  # pylint: disable=invalid-name
+    "PrelimPrediction",
+    ["feature_index", "start_index", "end_index",
+    "start_log_prob", "end_log_prob"])
+_NbestPrediction = collections.namedtuple(  # pylint: disable=invalid-name
+    "NbestPrediction", ["text", "start_log_prob", "end_log_prob"])
+def get_answers(all_examples, all_features, all_results, n_best_size,
+                      max_answer_length, start_n_top, end_n_top):
+  """Write final predictions to the json file and log-odds of null if needed."""
+  example_index_to_features = collections.defaultdict(list)
+  for feature in all_features:
+    example_index_to_features[feature.example_index].append(feature)
+  unique_id_to_result = {}
+  for result in all_results:
+    unique_id_to_result[result.unique_id] = result
+  all_predictions = collections.OrderedDict()
+  all_nbest_json = collections.OrderedDict()
+  scores_diff_json = collections.OrderedDict()
+  for (example_index, example) in enumerate(all_examples):
+    features = example_index_to_features[example_index]
+    prelim_predictions = []
+    # keep track of the minimum score of null start+end of position 0
+    score_null = 1000000  # large and positive
+    for (feature_index, feature) in enumerate(features):
+      result = unique_id_to_result[feature.unique_id]
+      print("cls_logits", feature.unique_id, result.cls_logits)
+      cur_null_score = result.cls_logits
+      # if we could have irrelevant answers, get the min score of irrelevant
+      score_null = min(score_null, cur_null_score)
+      for i in range(start_n_top):
+        for j in range(end_n_top):
+          start_log_prob = result.start_top_log_probs[i]
+          start_index = result.start_top_index[i]
+          j_index = i * end_n_top + j
+          end_log_prob = result.end_top_log_probs[j_index]
+          end_index = result.end_top_index[j_index]
+          # We could hypothetically create invalid predictions, e.g., predict
+          # that the start of the span is in the question. We throw out all
+          # invalid predictions.
+          if start_index >= feature.paragraph_len - 1:
+            continue
+          if end_index >= feature.paragraph_len - 1:
+            continue
+          if not feature.token_is_max_context.get(start_index, False):
+            continue
+          if end_index < start_index:
+            continue
+          length = end_index - start_index + 1
+          if length > max_answer_length:
+            continue
+          prelim_predictions.append(
+              _PrelimPrediction(
+                  feature_index=feature_index,
+                  start_index=start_index,
+                  end_index=end_index,
+                  start_log_prob=start_log_prob,
+                  end_log_prob=end_log_prob))
+    prelim_predictions = sorted(
+        prelim_predictions,
+        key=lambda x: (x.start_log_prob + x.end_log_prob),
+        reverse=True)
+    seen_predictions = {}
+    nbest = []
+    for pred in prelim_predictions:
+      if len(nbest) >= n_best_size:
+        break
+      feature = features[pred.feature_index]
+      tok_start_to_orig_index = feature.tok_start_to_orig_index
+      tok_end_to_orig_index = feature.tok_end_to_orig_index
+      start_orig_pos = tok_start_to_orig_index[pred.start_index]
+      end_orig_pos = tok_end_to_orig_index[pred.end_index]
+      paragraph_text = example.paragraph_text
+      final_text = paragraph_text[start_orig_pos: end_orig_pos + 1].strip()
+      if final_text in seen_predictions:
+        continue
+      seen_predictions[final_text] = True
+      nbest.append(
+          _NbestPrediction(
+              text=final_text,
+              start_log_prob=pred.start_log_prob,
+              end_log_prob=pred.end_log_prob))
+    # In very rare edge cases we could have no valid predictions. So we
+    # just create a nonce prediction in this case to avoid failure.
+    if not nbest:
+      nbest.append(
+          _NbestPrediction(text="", start_log_prob=-1e6,
+          end_log_prob=-1e6))
+    total_scores = []
+    best_non_null_entry = None
+    for entry in nbest:
+      total_scores.append(entry.start_log_prob + entry.end_log_prob)
+      if not best_non_null_entry:
+        best_non_null_entry = entry
+    probs = _compute_softmax(total_scores)
+    nbest_json = []
+    for (i, entry) in enumerate(nbest):
+      output = collections.OrderedDict()
+      output["text"] = entry.text
+      output["probability"] = probs[i]
+      output["start_log_prob"] = entry.start_log_prob
+      output["end_log_prob"] = entry.end_log_prob
+      nbest_json.append(output)
+    assert len(nbest_json) >= 1
+    assert best_non_null_entry is not None
+    score_diff = score_null
+    scores_diff_json[example.qas_id] = score_diff
+    # note(zhiliny): always predict best_non_null_entry
+    # and the evaluation script will search for the best threshold
+    all_predictions[example.qas_id] = best_non_null_entry.text
+    all_nbest_json[example.qas_id] = nbest_json
+  return all_predictions, all_nbest_json
+  # with open(output_prediction_file, "w") as writer:
+  #   writer.write(json.dumps(all_predictions, indent=4) + "\n")
+  # with open(output_nbest_file, "w") as writer:
+  #   writer.write(json.dumps(all_nbest_json, indent=4) + "\n")
+  # with open(output_null_log_odds_file, "w") as writer:
+  #   writer.write(json.dumps(scores_diff_json, indent=4) + "\n")
+  # qid_to_has_ans = squad_utils.make_qid_to_has_ans(orig_data)
+  # has_ans_qids = [k for k, v in qid_to_has_ans.items() if v]
+  # no_ans_qids = [k for k, v in qid_to_has_ans.items() if not v]
+  # exact_raw, f1_raw = squad_utils.get_raw_scores(orig_data, all_predictions)
+  # out_eval = {}
+  # squad_utils.find_all_best_thresh_v2(out_eval, all_predictions, exact_raw, f1_raw,
+  #                                  scores_diff_json, qid_to_has_ans)
+  # return out_eval
+def _get_best_indexes(logits, n_best_size):
+  """Get the n-best logits from a list."""
+  index_and_score = sorted(enumerate(logits), key=lambda x: x[1], reverse=True)
+  best_indexes = []
+  for i in range(len(index_and_score)):
+    if i >= n_best_size:
+      break
+    best_indexes.append(index_and_score[i][0])
+  return best_indexes
+def _compute_softmax(scores):
+  """Compute softmax probability over raw logits."""
+  if not scores:
+    return []
+  max_score = None
+  for score in scores:
+    if max_score is None or score > max_score:
+      max_score = score
+  exp_scores = []
+  total_sum = 0.0
+  for score in scores:
+    x = math.exp(score - max_score)
+    exp_scores.append(x)
+    total_sum += x
+  probs = []
+  for score in exp_scores:
+    probs.append(score / total_sum)
+  return probs
+if __name__ == '__main__':
+    processor = DataProcessor(spiece_model_file="xlnet_cased_L-24_H-1024_A-16/spiece.model", 
+                                uncased=False, 
+                                max_seq_length=512,
+                                doc_stride=128, 
+                                max_query_length=64)
+    train_data_generator = processor.data_generator(
+            data_path="squad_v2.0/dev-v2.0.json",
+            batch_size=32,
+            phase='predict',
+            shuffle=True,
+            dev_count=1,
+            epoch=1)
+    for (index, sample) in enumerate(train_data_generator()):
+        if index < 10:
+            print("index:", index)
+            for tensor in sample:
+                print(tensor.shape)
+        else:
+            break
+    #for (index, example) in enumerate(train_examples):
+    #    if index < 5:
+    #        print(example)
--- a/PaddleNLP/Research/MRQA2019-D-NET/server/xlnet_server/squad_utils.py
+++ b/PaddleNLP/Research/MRQA2019-D-NET/server/xlnet_server/squad_utils.py
+"""Official evaluation script for SQuAD version 2.0.
+In addition to basic functionality, we also compute additional statistics and
+plot precision-recall curves if an additional na_prob.json file is provided.
+This file is expected to map question ID's to the model's predicted probability
+that a question is unanswerable.
+"""
+import argparse
+import collections
+import json
+import numpy as np
+import os
+import re
+import string
+import sys
+OPTS = None
+def parse_args():
+  parser = argparse.ArgumentParser('Official evaluation script for SQuAD version 2.0.')
+  parser.add_argument('data_file', metavar='data.json', help='Input data JSON file.')
+  parser.add_argument('pred_file', metavar='pred.json', help='Model predictions.')
+  parser.add_argument('--out-file', '-o', metavar='eval.json',
+                      help='Write accuracy metrics to file (default is stdout).')
+  parser.add_argument('--na-prob-file', '-n', metavar='na_prob.json',
+                      help='Model estimates of probability of no answer.')
+  parser.add_argument('--na-prob-thresh', '-t', type=float, default=1.0,
+                      help='Predict "" if no-answer probability exceeds this (default = 1.0).')
+  parser.add_argument('--out-image-dir', '-p', metavar='out_images', default=None,
+                      help='Save precision-recall curves to directory.')
+  parser.add_argument('--verbose', '-v', action='store_true')
+  if len(sys.argv) == 1:
+    parser.print_help()
+    sys.exit(1)
+  return parser.parse_args()
+def make_qid_to_has_ans(dataset):
+  qid_to_has_ans = {}
+  for article in dataset:
+    for p in article['paragraphs']:
+      for qa in p['qas']:
+        qid_to_has_ans[qa['id']] = bool(qa['answers'])
+  return qid_to_has_ans
+def normalize_answer(s):
+  """Lower text and remove punctuation, articles and extra whitespace."""
+  def remove_articles(text):
+    regex = re.compile(r'\b(a|an|the)\b', re.UNICODE)
+    return re.sub(regex, ' ', text)
+  def white_space_fix(text):
+    return ' '.join(text.split())
+  def remove_punc(text):
+    exclude = set(string.punctuation)
+    return ''.join(ch for ch in text if ch not in exclude)
+  def lower(text):
+    return text.lower()
+  return white_space_fix(remove_articles(remove_punc(lower(s))))
+def get_tokens(s):
+  if not s: return []
+  return normalize_answer(s).split()
+def compute_exact(a_gold, a_pred):
+  return int(normalize_answer(a_gold) == normalize_answer(a_pred))
+def compute_f1(a_gold, a_pred):
+  gold_toks = get_tokens(a_gold)
+  pred_toks = get_tokens(a_pred)
+  common = collections.Counter(gold_toks) & collections.Counter(pred_toks)
+  num_same = sum(common.values())
+  if len(gold_toks) == 0 or len(pred_toks) == 0:
+    # If either is no-answer, then F1 is 1 if they agree, 0 otherwise
+    return int(gold_toks == pred_toks)
+  if num_same == 0:
+    return 0
+  precision = 1.0 * num_same / len(pred_toks)
+  recall = 1.0 * num_same / len(gold_toks)
+  f1 = (2 * precision * recall) / (precision + recall)
+  return f1
+def get_raw_scores(dataset, preds):
+  exact_scores = {}
+  f1_scores = {}
+  for article in dataset:
+    for p in article['paragraphs']:
+      for qa in p['qas']:
+        qid = qa['id']
+        gold_answers = [a['text'] for a in qa['answers']
+                        if normalize_answer(a['text'])]
+        if not gold_answers:
+          # For unanswerable questions, only correct answer is empty string
+          gold_answers = ['']
+        if qid not in preds:
+          print('Missing prediction for %s' % qid)
+          continue
+        a_pred = preds[qid]
+        # Take max over all gold answers
+        exact_scores[qid] = max(compute_exact(a, a_pred) for a in gold_answers)
+        f1_scores[qid] = max(compute_f1(a, a_pred) for a in gold_answers)
+  return exact_scores, f1_scores
+def apply_no_ans_threshold(scores, na_probs, qid_to_has_ans, na_prob_thresh):
+  new_scores = {}
+  for qid, s in scores.items():
+    pred_na = na_probs[qid] > na_prob_thresh
+    if pred_na:
+      new_scores[qid] = float(not qid_to_has_ans[qid])
+    else:
+      new_scores[qid] = s
+  return new_scores
+def make_eval_dict(exact_scores, f1_scores, qid_list=None):
+  if not qid_list:
+    total = len(exact_scores)
+    return collections.OrderedDict([
+        ('exact', 100.0 * sum(exact_scores.values()) / total),
+        ('f1', 100.0 * sum(f1_scores.values()) / total),
+        ('total', total),
+    ])
+  else:
+    total = len(qid_list)
+    return collections.OrderedDict([
+        ('exact', 100.0 * sum(exact_scores[k] for k in qid_list) / total),
+        ('f1', 100.0 * sum(f1_scores[k] for k in qid_list) / total),
+        ('total', total),
+    ])
+def merge_eval(main_eval, new_eval, prefix):
+  for k in new_eval:
+    main_eval['%s_%s' % (prefix, k)] = new_eval[k]
+def plot_pr_curve(precisions, recalls, out_image, title):
+  plt.step(recalls, precisions, color='b', alpha=0.2, where='post')
+  plt.fill_between(recalls, precisions, step='post', alpha=0.2, color='b')
+  plt.xlabel('Recall')
+  plt.ylabel('Precision')
+  plt.xlim([0.0, 1.05])
+  plt.ylim([0.0, 1.05])
+  plt.title(title)
+  plt.savefig(out_image)
+  plt.clf()
+def make_precision_recall_eval(scores, na_probs, num_true_pos, qid_to_has_ans,
+                               out_image=None, title=None):
+  qid_list = sorted(na_probs, key=lambda k: na_probs[k])
+  true_pos = 0.0
+  cur_p = 1.0
+  cur_r = 0.0
+  precisions = [1.0]
+  recalls = [0.0]
+  avg_prec = 0.0
+  for i, qid in enumerate(qid_list):
+    if qid_to_has_ans[qid]:
+      true_pos += scores[qid]
+    cur_p = true_pos / float(i+1)
+    cur_r = true_pos / float(num_true_pos)
+    if i == len(qid_list) - 1 or na_probs[qid] != na_probs[qid_list[i+1]]:
+      # i.e., if we can put a threshold after this point
+      avg_prec += cur_p * (cur_r - recalls[-1])
+      precisions.append(cur_p)
+      recalls.append(cur_r)
+  if out_image:
+    plot_pr_curve(precisions, recalls, out_image, title)
+  return {'ap': 100.0 * avg_prec}
+def run_precision_recall_analysis(main_eval, exact_raw, f1_raw, na_probs, 
+                                  qid_to_has_ans, out_image_dir):
+  if out_image_dir and not os.path.exists(out_image_dir):
+    os.makedirs(out_image_dir)
+  num_true_pos = sum(1 for v in qid_to_has_ans.values() if v)
+  if num_true_pos == 0:
+    return
+  pr_exact = make_precision_recall_eval(
+      exact_raw, na_probs, num_true_pos, qid_to_has_ans,
+      out_image=os.path.join(out_image_dir, 'pr_exact.png'),
+      title='Precision-Recall curve for Exact Match score')
+  pr_f1 = make_precision_recall_eval(
+      f1_raw, na_probs, num_true_pos, qid_to_has_ans,
+      out_image=os.path.join(out_image_dir, 'pr_f1.png'),
+      title='Precision-Recall curve for F1 score')
+  oracle_scores = {k: float(v) for k, v in qid_to_has_ans.items()}
+  pr_oracle = make_precision_recall_eval(
+      oracle_scores, na_probs, num_true_pos, qid_to_has_ans,
+      out_image=os.path.join(out_image_dir, 'pr_oracle.png'),
+      title='Oracle Precision-Recall curve (binary task of HasAns vs. NoAns)')
+  merge_eval(main_eval, pr_exact, 'pr_exact')
+  merge_eval(main_eval, pr_f1, 'pr_f1')
+  merge_eval(main_eval, pr_oracle, 'pr_oracle')
+def histogram_na_prob(na_probs, qid_list, image_dir, name):
+  if not qid_list:
+    return
+  x = [na_probs[k] for k in qid_list]
+  weights = np.ones_like(x) / float(len(x))
+  plt.hist(x, weights=weights, bins=20, range=(0.0, 1.0))
+  plt.xlabel('Model probability of no-answer')
+  plt.ylabel('Proportion of dataset')
+  plt.title('Histogram of no-answer probability: %s' % name)
+  plt.savefig(os.path.join(image_dir, 'na_prob_hist_%s.png' % name))
+  plt.clf()
+def find_best_thresh(preds, scores, na_probs, qid_to_has_ans):
+  num_no_ans = sum(1 for k in qid_to_has_ans if not qid_to_has_ans[k])
+  cur_score = num_no_ans
+  best_score = cur_score
+  best_thresh = 0.0
+  qid_list = sorted(na_probs, key=lambda k: na_probs[k])
+  for i, qid in enumerate(qid_list):
+    if qid not in scores: continue
+    if qid_to_has_ans[qid]:
+      diff = scores[qid]
+    else:
+      if preds[qid]:
+        diff = -1
+      else:
+        diff = 0
+    cur_score += diff
+    if cur_score > best_score:
+      best_score = cur_score
+      best_thresh = na_probs[qid]
+  return 100.0 * best_score / len(scores), best_thresh
+def find_best_thresh_v2(preds, scores, na_probs, qid_to_has_ans):
+  num_no_ans = sum(1 for k in qid_to_has_ans if not qid_to_has_ans[k])
+  cur_score = num_no_ans
+  best_score = cur_score
+  best_thresh = 0.0
+  qid_list = sorted(na_probs, key=lambda k: na_probs[k])
+  for i, qid in enumerate(qid_list):
+    if qid not in scores: continue
+    if qid_to_has_ans[qid]:
+      diff = scores[qid]
+    else:
+      if preds[qid]:
+        diff = -1
+      else:
+        diff = 0
+    cur_score += diff
+    if cur_score > best_score:
+      best_score = cur_score
+      best_thresh = na_probs[qid]
+  has_ans_score, has_ans_cnt = 0, 0
+  for qid in qid_list:
+    if not qid_to_has_ans[qid]: continue
+    has_ans_cnt += 1
+    if qid not in scores: continue
+    has_ans_score += scores[qid]
+  return 100.0 * best_score / len(scores), best_thresh, 1.0 * has_ans_score / has_ans_cnt
+def find_all_best_thresh(main_eval, preds, exact_raw, f1_raw, na_probs, qid_to_has_ans):
+  best_exact, exact_thresh = find_best_thresh(preds, exact_raw, na_probs, qid_to_has_ans)
+  best_f1, f1_thresh = find_best_thresh(preds, f1_raw, na_probs, qid_to_has_ans)
+  main_eval['best_exact'] = best_exact
+  main_eval['best_exact_thresh'] = exact_thresh
+  main_eval['best_f1'] = best_f1
+  main_eval['best_f1_thresh'] = f1_thresh
+def find_all_best_thresh_v2(main_eval, preds, exact_raw, f1_raw, na_probs, qid_to_has_ans):
+  best_exact, exact_thresh, has_ans_exact = find_best_thresh_v2(preds, exact_raw, na_probs, qid_to_has_ans)
+  best_f1, f1_thresh, has_ans_f1 = find_best_thresh_v2(preds, f1_raw, na_probs, qid_to_has_ans)
+  main_eval['best_exact'] = best_exact
+  main_eval['best_exact_thresh'] = exact_thresh
+  main_eval['best_f1'] = best_f1
+  main_eval['best_f1_thresh'] = f1_thresh
+  main_eval['has_ans_exact'] = has_ans_exact
+  main_eval['has_ans_f1'] = has_ans_f1
+def main():
+  with open(OPTS.data_file) as f:
+    dataset_json = json.load(f)
+    dataset = dataset_json['data']
+  with open(OPTS.pred_file) as f:
+    preds = json.load(f)
+  new_orig_data = []
+  for article in dataset:
+    for p in article['paragraphs']:
+      for qa in p['qas']:
+        if qa['id'] in preds:
+          new_para = {'qas': [qa]}
+          new_article = {'paragraphs': [new_para]}
+          new_orig_data.append(new_article)
+  dataset = new_orig_data
+  if OPTS.na_prob_file:
+    with open(OPTS.na_prob_file) as f:
+      na_probs = json.load(f)
+  else:
+    na_probs = {k: 0.0 for k in preds}
+  qid_to_has_ans = make_qid_to_has_ans(dataset)  # maps qid to True/False
+  has_ans_qids = [k for k, v in qid_to_has_ans.items() if v]
+  no_ans_qids = [k for k, v in qid_to_has_ans.items() if not v]
+  exact_raw, f1_raw = get_raw_scores(dataset, preds)
+  exact_thresh = apply_no_ans_threshold(exact_raw, na_probs, qid_to_has_ans,
+                                        OPTS.na_prob_thresh)
+  f1_thresh = apply_no_ans_threshold(f1_raw, na_probs, qid_to_has_ans,
+                                     OPTS.na_prob_thresh)
+  out_eval = make_eval_dict(exact_thresh, f1_thresh)
+  if has_ans_qids:
+    has_ans_eval = make_eval_dict(exact_thresh, f1_thresh, qid_list=has_ans_qids)
+    merge_eval(out_eval, has_ans_eval, 'HasAns')
+  if no_ans_qids:
+    no_ans_eval = make_eval_dict(exact_thresh, f1_thresh, qid_list=no_ans_qids)
+    merge_eval(out_eval, no_ans_eval, 'NoAns')
+  if OPTS.na_prob_file:
+    find_all_best_thresh(out_eval, preds, exact_raw, f1_raw, na_probs, qid_to_has_ans)
+  if OPTS.na_prob_file and OPTS.out_image_dir:
+    run_precision_recall_analysis(out_eval, exact_raw, f1_raw, na_probs, 
+                                  qid_to_has_ans, OPTS.out_image_dir)
+    histogram_na_prob(na_probs, has_ans_qids, OPTS.out_image_dir, 'hasAns')
+    histogram_na_prob(na_probs, no_ans_qids, OPTS.out_image_dir, 'noAns')
+  if OPTS.out_file:
+    with open(OPTS.out_file, 'w') as f:
+      json.dump(out_eval, f)
+  else:
+    print(json.dumps(out_eval, indent=2))
+if __name__ == '__main__':
+  OPTS = parse_args()
+  if OPTS.out_image_dir:
+    import matplotlib
+    matplotlib.use('Agg')
+    import matplotlib.pyplot as plt 
+  main()
--- a/PaddleNLP/Research/MRQA2019-D-NET/server/xlnet_server/wrapper.py
+++ b/PaddleNLP/Research/MRQA2019-D-NET/server/xlnet_server/wrapper.py
+#!/usr/bin/env python
+# -*- coding: utf-8 -*-
+"""BERT (PaddlePaddle) model wrapper"""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+import os
+import json
+import collections
+import multiprocessing
+import argparse
+import numpy as np
+import paddle.fluid as fluid
+from squad_reader import DataProcessor, get_answers
+from model.xlnet import XLNetConfig, XLNetModel
+conf_dir = "xlnet_config"
+bert_config_path = conf_dir+'/xlnet_config.json'
+spiece_model_file = conf_dir+'/spiece.model'
+ema_decay = 0.9999
+verbose = False
+vocab_path = conf_dir+'/vocab.txt'
+max_seq_len = 800
+max_query_length = 64
+max_answer_length = 30
+in_tokens = False
+do_lower_case = False
+doc_stride = 128
+n_best_size = 20
+start_n_top = 5
+end_n_top = 5
+use_cuda = True
+class BertModelWrapper():
+    """
+    Wrap a tnet model
+     the basic processes include input checking, preprocessing, calling tf-serving
+     and postprocessing
+    """
+    def __init__(self, model_dir):
+        """ """
+        xlnet_config = XLNetConfig(bert_config_path)
+        xlnet_config.print_config()
+        if use_cuda:
+            place = fluid.CUDAPlace(0)
+            dev_count = fluid.core.get_cuda_device_count()
+        else:
+            place = fluid.CPUPlace()
+            dev_count = int(os.environ.get('CPU_NUM', multiprocessing.cpu_count()))
+        self.exe = fluid.Executor(place)
+        self.processor = DataProcessor(
+            spiece_model_file=spiece_model_file,
+            uncased=do_lower_case,
+            max_seq_length=max_seq_len,
+            doc_stride=doc_stride,
+            max_query_length=max_query_length)
+        self.inference_program, self.feed_target_names, self.fetch_targets = \
+            fluid.io.load_inference_model(dirname=model_dir, executor=self.exe)
+        # self.inference_program = fluid.compiler.CompiledProgram(self.inference_program)
+        # self.exe = fluid.ParallelExecutor(
+        #     use_cuda=use_cuda,
+        #     main_program=self.inference_program)
+    def preprocessor(self, samples, batch_size):
+        """Preprocess the input samples, including word seg, padding, token to ids"""
+        # Tokenization and paragraph padding
+        examples, features, batch = self.processor.data_generator(
+            samples, batch_size)
+        self.samples = samples
+        return examples, features, batch
+    def call_mrc(self, batch, squeeze_dim0=False, return_list=False):
+        """MRC"""
+        if squeeze_dim0 and return_list:
+            raise ValueError("squeeze_dim0 only work for dict-type return value.")
+        src_ids = batch[0]
+        pos_ids = batch[1]
+        sent_ids = batch[2]
+        input_mask = batch[3]
+        unique_id = batch[4]
+        emmmm = batch[5]
+        feed_dict = {
+            self.feed_target_names[0]: src_ids,
+            self.feed_target_names[1]: pos_ids,
+            self.feed_target_names[2]: sent_ids,
+            self.feed_target_names[3]: input_mask,
+            self.feed_target_names[4]: unique_id,
+            self.feed_target_names[5]: emmmm
+        }
+        np_unique_ids, np_start_logits, np_start_top_index, np_end_logits, np_end_top_index, np_cls_logits = \
+            self.exe.run(self.inference_program, feed=feed_dict, fetch_list=self.fetch_targets, use_program_cache=True)
+        # np_unique_ids, np_start_logits, np_end_logits, np_num_seqs = \
+        #     self.exe.run(feed=feed_dict, fetch_list=self.fetch_targets)
+        if len(np_unique_ids) == 1 and squeeze_dim0:
+            np_unique_ids = np_unique_ids[0]
+            np_start_logits = np_start_logits[0]
+            np_end_logits = np_end_logits[0]
+        if return_list:
+            mrc_results = [{'unique_ids': id, 'start_logits': st, 'start_idx': st_idx, 'end_logits': end, 'end_idx': end_idx, 'cls': cls} 
+                            for id, st, st_idx, end, end_idx, cls in zip(np_unique_ids, np_start_logits, np_start_top_index, np_end_logits, np_end_top_index, np_cls_logits)]
+        else:
+            raise NotImplementedError()
+        return mrc_results
+    def postprocessor(self, examples, features, mrc_results):
+        """Extract answer
+         batch: [examples, features] from preprocessor
+         mrc_results: model results from call_mrc. if mrc_results is list, each element of which is a size=1 batch.
+        """
+        RawResult = collections.namedtuple("RawResult",
+                                            ["unique_id", "start_top_log_probs", "start_top_index",
+                                            "end_top_log_probs", "end_top_index", "cls_logits"])
+        results = []
+        if isinstance(mrc_results, list):
+            for res in mrc_results:
+                unique_id = res['unique_ids'][0]
+                start_logits = [float(x) for x in res['start_logits'].flat]
+                start_idx = [int(x) for x in res['start_idx'].flat]
+                end_logits = [float(x) for x in res['end_logits'].flat]
+                end_idx = [int(x) for x in res['end_idx'].flat]
+                cls_logits = float(res['cls'].flat[0])
+                results.append(
+                    RawResult(
+                        unique_id=unique_id,
+                        start_top_log_probs=start_logits,
+                        start_top_index=start_idx,
+                        end_top_log_probs=end_logits,
+                        end_top_index=end_idx,
+                        cls_logits=cls_logits))
+        else:
+            assert isinstance(mrc_results, dict)
+            raise NotImplementedError()
+            for idx in range(mrc_results['unique_ids'].shape[0]):
+                unique_id = int(mrc_results['unique_ids'][idx])
+                start_logits = [float(x) for x in mrc_results['start_logits'][idx].flat]
+                end_logits = [float(x) for x in mrc_results['end_logits'][idx].flat]
+                results.append(
+                    RawResult(
+                        unique_id=unique_id,
+                        start_logits=start_logits,
+                        end_logits=end_logits))
+        answers = get_answers(
+            examples, features, results, n_best_size,
+            max_answer_length, start_n_top, end_n_top)
+        return answers
--- a/PaddleNLP/Research/MRQA2019-D-NET/server/xlnet_server/xlnet_config/spiece.model
+++ b/PaddleNLP/Research/MRQA2019-D-NET/server/xlnet_server/xlnet_config/spiece.model
--- a/PaddleNLP/Research/MRQA2019-D-NET/server/xlnet_server/xlnet_config/xlnet_config.json
+++ b/PaddleNLP/Research/MRQA2019-D-NET/server/xlnet_server/xlnet_config/xlnet_config.json
+{
+    "d_head": 64, 
+    "d_inner": 4096, 
+    "d_model": 1024, 
+    "ff_activation": "gelu", 
+    "n_head": 16, 
+    "n_layer": 24, 
+    "n_token": 32000, 
+    "untie_r": true
+}
\ No newline at end of file