Add lang representation model XLNet (#3831)

e34627d7 · Yibing Liu · GitHub · 43e4ec71 · e34627d7 · e34627d7
19 changed file
--- a/PaddleNLP/PaddleLARK/XLNet/README.md
+++ b/PaddleNLP/PaddleLARK/XLNet/README.md
+[中文版](README_cn.md)
+This project is the implementation of [XLNet](https://github.com/zihangdai/xlnet) on Paddle Fluid, currently supporting the fine-tuning on all downstream tasks, including natural language inference, question answering (SQuAD) etc.
+There are a lot differences between XLNet and [BERT](../BERT). XLNet takes adavangtage of a new novel model [Transformer-XL](https://arxiv.org/abs/1901.02860) as the backbone of language representation, and the permutation language modeling as the optimizing objective. Also XLNet involed much more data in the pre-training stage. Finally, XLNet achieved SOTA results on several NLP tasks.
+For more details, please refer to the research paper
+[XLNet: Generalized Autoregressive Pretraining for Language Understanding](https://arxiv.org/abs/1906.08237)
+## Installation
+This project requires Paddle Fluid version 1.6.0 and later, please follow the [installation guide](https://www.paddlepaddle.org.cn/start) to install.  
+## Pre-trained models
+Two pre-trained models converted from the official release are available
+| Model | Layers | Hidden size | Heads |
+| :------| :------: | :------: |:------: |
+| [XLNet-Large, Cased](https://xlnet.bj.bcebos.com/xlnet_cased_L-24_H-1024_A-16.tgz)| 24 | 1024 | 16 |
+| [XLNet-Base, Cased](https://xlnet.bj.bcebos.com/xlnet_cased_L-12_H-768_A-12.tgz)| 12 | 768 | 12 |
+Each compressed package contains one subdirectory and two files:
+- `params`: a directory consisting of all converted parameters, one file for a parameter.
+- `spiece.model`: a [Sentence Piece](https://github.com/google/sentencepiece) model used for (de)tokenization.
+- `xlnet_config.json`: a config file which specifies the hyperparameters of the model.
+## Fine-tuning with XLNet
+We provide the scripts for fine-tuning on NLP tasks with XLNet on multi-card GPUs. And their correctness has been verified that all experiments on V100 GPUs can achieve the same performance as the officially reported (mainly on TPU). In the following statements, we assume that the two pre-trained models have been downloaded and extracted.
+### Text regression/classification
+The fine-tuning of regression/classification can be preformed via the script `run_classifier.py` , which contains examples for standard one-document classification, one-document regression, and document pair classification. The two examples, one for regression and another one for classification can go on in the following way.
+#### (1) STS-B: sentence pair relevance regression
+- Download the [GLUE data](https://gluebenchmark.com/tasks) by running [this script](https://gist.github.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e) and unpack it to some directory $GLUE_DIR
+  - **Note**: You may meet the error `ImportError: No module named request` when running the script under Python 2.x, this is because the module `urllib` doesn't have submodule `request`. It can be resolved by replacing all the code `urllib.request` with `urllib` or changing to a Python 3.x environment.
+- Perform fine-tuning on 4 V100 GPUs with XLNet-Large
+```
+export GLUE_DIR=glue_data
+export LARGE_DIR=xlnet_cased_L-24_H-1024_A-16
+CUDA_VISIBLE_DEVICES=0,1,2,3 python run_classifier.py \
+  --do_train=True \
+  --do_eval=True \
+  --do_predict=True \
+  --task_name=sts-b \
+  --data_dir=${GLUE_DIR}/STS-B \
+  --checkpoints=exp/sts-b \
+  --uncased=False \
+  --spiece_model_file=${LARGE_DIR}/spiece.model \
+  --model_config_path=${LARGE_DIR}/xlnet_config.json \
+  --init_pretraining_params=${LARGE_DIR}/params \
+  --max_seq_length=128 \
+  --train_batch_size=8 \
+  --learning_rate=5e-5 \
+  --predict_dir=exp/sts-b-pred \
+  --skip_steps=10 \
+  --train_steps=1200 \
+  --warmup_steps=120 \
+  --save_steps=600 \
+  --is_regression=True
+```
+This configuration doesn't require that large GPU memory, and 4 V100 (or other) GPUs with 16GB should be enough.
+As the fine-tuning finished, the evaluation result on dev dataset, including the average loss and pearson correlation coefficient, will yield
+```
+[dev evaluation] ave loss: 0.383523, eval_pearsonr: 0.916912, elapsed time: 21.804057 s
+```
+The expected `eval_pearsonr` is `91.3+`, quoted from the official repository, and the experiment can reproduce this performance.
+#### (2) IMDB: movie review sentiment classification
+- Download and unpack the IMDB dataset by running
+```shell
+wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
+tar zxvf aclImdb_v1.tar.gz
+```
+- Perform fine-tuning with XLNet-Large on 8 V100 GPUs (32GB) by running
+```shell
+export IMDB_DIR=aclImdb
+export LARGE_DIR=xlnet_cased_L-24_H-1024_A-16
+CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python run_classifier.py \
+  --do_train=True \
+  --do_eval=True \
+  --do_predict=True \
+  --task_name=imdb \
+  --checkpoints=exp/imdb \
+  --init_pretraining_params=${LARGE_DIR}/params \
+  --data_dir=${IMDB_DIR} \
+  --predict_dir=predict_imdb_1028 \
+  --uncased=False \
+  --spiece_model_file=${LARGE_DIR}/spiece.model \
+  --model_config_path=${LARGE_DIR}/xlnet_config.json \
+  --max_seq_length=512 \
+  --train_batch_size=4 \
+  --eval_batch_size=8 \
+  --learning_rate=2e-5 \
+  --train_steps=4000 \
+  --warmup_steps=500 \
+  --save_steps=500 \
+```
+The expected accuracy is `96.2+`, and here is an example of evaluation result
+```
+[dev evaluation] ave loss: 0.220047, eval_accuracy: 0.963480, elapsed time: 2799.974465 s
+```
+Other NLP regression/classification tasks' fine-tuning can be carried out in the similar way.
+### SQuAD 2.0
+- Download SQuAD2.0 data and put it in the `data/squad2.0` directory
+```
+mkdir -p data/squad2.0
+wget -P data/squad2.0 https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json
+wget -P data/squad2.0 https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json
+```
+- Perform fine-tuning running the script `run_squad.py` on V100 GPUs (32GB)
+```
+SQUAD_DIR=data/squad2.0
+INIT_CKPT_DIR=xlnet_cased_L-24_H-1024_A-16
+export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5
+python run_squad.py \
+  --model_config_path=${INIT_CKPT_DIR}/xlnet_config.json \
+  --spiece_model_file=${INIT_CKPT_DIR}/spiece.model \
+  --init_checkpoint=${INIT_CKPT_DIR}/params \
+  --train_file=${SQUAD_DIR}/train-v2.0.json \
+  --predict_file=${SQUAD_DIR}/dev-v2.0.json \
+  --uncased=False \
+  --checkpoints squad_2.0_0828 \
+  --max_seq_length=512 \
+  --do_train=True \
+  --do_predict=True \
+  --skip_steps=100 \
+  --save_steps=10000 \
+  --epoch 200 \
+  --dropout=0.1 \
+  --dropatt=0.1 \
+  --train_batch_size=4 \
+  --predict_batch_size=3 \
+  --learning_rate=2e-5 \
+  --save_steps=1000 \
+  --train_steps=12000 \
+  --warmup_steps=1000 \
+  --verbose=True\
+```
+And the final evaluation result after fine-tuning should looks like
+```
+================================================================================
+Result | best_f1 88.0893932758 | best_exact_thresh -2.07637166977 | best_exact 85.5049271456 | has_ans_f1 0.940979062625 | has_ans_exact 0.880566801619 | best_f1_thresh -2.07337403297 |
+================================================================================
+```
+### Use your own data
+Please refer to the data-format guidelines of GLUE/SQuAD if you want to use your own data for fine-tuning.
+## Acknowledgement
+We thank the distiguished work done by the authors of XLNet!
--- a/PaddleNLP/PaddleLARK/XLNet/README_cn.md
+++ b/PaddleNLP/PaddleLARK/XLNet/README_cn.md
+[ENGLISH](README.md)
+该项目是 [XLNet](https://github.com/zihangdai/xlnet) 基于 Paddle Fluid 的实现，目前支持该项目支持所有下游任务的 fine-tuning, 包括自然语言推断任务和阅读理解任务 （SQuAD2.0）等。
+XLNet 与 [BERT](https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/PaddleLARK/BERT) 有着许多的不同，XLNet 利用一个全新的模型 [Transformer-XL](https://arxiv.org/abs/1901.02860) 作为语义表示的骨架， 将置换语言模型的建模作为优化目标，同时在预训练阶段也利用了更多的数据。 最终，XLNet 在多个 NLP 任务上达到了 SOTA 的效果。
+更多的细节，请参考学术论文
+[XLNet: Generalized Autoregressive Pretraining for Language Understanding](https://arxiv.org/abs/1906.08237)
+## 安装
+该项目要求 Paddle Fluid 1.6.0 及以上版本，请参考 [安装指南](https://www.paddlepaddle.org.cn/start) 进行安装。
+## Pre-trained models
+这里提供了从官方开源模型转换而来的两个预训练模型供下载
+| Model | Layers | Hidden size | Heads |
+| :------| :------: | :------: |:------: |
+| [XLNet-Large, Cased](https://xlnet.bj.bcebos.com/xlnet_cased_L-24_H-1024_A-16.tgz)| 24 | 1024 | 16 |
+| [XLNet-Base, Cased](https://xlnet.bj.bcebos.com/xlnet_cased_L-12_H-768_A-12.tgz)| 12 | 768 | 12 |
+每个压缩包都包含了一个子文件夹和两个文件:
+- `params`: 由参数构成的文件夹, 每个模型文件包含一个参数
+- `spiece.model`: [Sentence Piece](https://github.com/google/sentencepiece) 模型，用于文本的（反）tokenization
+- `xlnet_config.json`: 配置文件，指定了模型的超参数
+## 利用 XLNet 进行 Fine-tuning
+我们提供了利用 XLNet 在多卡 GPU 上为自然语言处理任务进行 fine-tuning 的脚本。通过基于 V100 GPU 进行实验，达到官方报告的效果 （主要是基于 TPU），这些脚本的正确性已得到过验证。在下面的陈述中，我们假定以上两个预训练已下载和解压好。
+### 文本回归/分类任务
+文本回归和分类任务的 fine-tuning 可以通过运行脚本 `run_classifier.py` 来进行，其中包含了单文本分类、单文本回归、文本对分类等示例。下面的两个例子，一个用于示例回归任务，另一个用于分类任务，可以按以下的方式进行 fine-tuning。
+#### (1) STS-B: 句子对相关性回归
+-  通过运行 [脚本](https://gist.github.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e) 下载 [GLUE 数据集](https://gluebenchmark.com/tasks)， 并解压到某个文件夹 $GLUE_DIR。
+  - **请注意**: 在 Python 2.x 环境下运行这个脚本，可能会遇到报错 `ImportError: No module named request` , 这是因为模块 `urllib` 不包含子模块 `request`. 这个问题可以通过将脚本中的代码 `urllib.request` 全部替换为 `urllib`，或者在 Python 3.x 环境下运行予以解决。
+- 使用 XLNet-Large 在 4 卡 V100 GPU 上进行 fine-tuning
+```
+export GLUE_DIR=glue_data
+export LARGE_DIR=xlnet_cased_L-24_H-1024_A-16
+CUDA_VISIBLE_DEVICES=0,1,2,3 python run_classifier.py \
+  --do_train=True \
+  --do_eval=True \
+  --do_predict=True \
+  --task_name=sts-b \
+  --data_dir=${GLUE_DIR}/STS-B \
+  --checkpoints=exp/sts-b \
+  --uncased=False \
+  --spiece_model_file=${LARGE_DIR}/spiece.model \
+  --model_config_path=${LARGE_DIR}/xlnet_config.json \
+  --init_pretraining_params=${LARGE_DIR}/params \
+  --max_seq_length=128 \
+  --train_batch_size=8 \
+  --learning_rate=5e-5 \
+  --predict_dir=exp/sts-b-pred \
+  --skip_steps=10 \
+  --train_steps=1200 \
+  --warmup_steps=120 \
+  --save_steps=600 \
+  --is_regression=True
+```
+该配置不需要特别大的 GPU 显存，16GB 的 4 卡 V100 （或其它 GPU）即可运行。
+在 fine-tuning 结束后，会得到在 dev 数据集上的评估结果，包括平均误差和皮尔逊相关系数
+```
+[dev evaluation] ave loss: 0.383523, eval_pearsonr: 0.916912, elapsed time: 21.804057 s
+```
+按官方实现的说法，预期的 `eval_pearsonr` 是 `91.3+`，该实验应该能复现这个结果。
+#### (2) IMDB: 电影评论情感分类
+- 下载和解压 IMDB 数据集
+```shell
+wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
+tar zxvf aclImdb_v1.tar.gz
+```
+- 使用 XLNet-Large 在 8 卡 V100 GPU (32GB) 上进行 fine-tuning
+```shell
+export IMDB_DIR=aclImdb
+export LARGE_DIR=xlnet_cased_L-24_H-1024_A-16
+CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python run_classifier.py \
+  --do_train=True \
+  --do_eval=True \
+  --do_predict=True \
+  --task_name=imdb \
+  --checkpoints=exp/imdb \
+  --init_pretraining_params=${LARGE_DIR}/params \
+  --data_dir=${IMDB_DIR} \
+  --predict_dir=predict_imdb_1028 \
+  --uncased=False \
+  --spiece_model_file=${LARGE_DIR}/spiece.model \
+  --model_config_path=${LARGE_DIR}/xlnet_config.json \
+  --max_seq_length=512 \
+  --train_batch_size=4 \
+  --eval_batch_size=8 \
+  --learning_rate=2e-5 \
+  --train_steps=4000 \
+  --warmup_steps=500 \
+  --save_steps=500 \
+```
+期望的准确率是 `96.2+`， 以下是评估结果的一个样例
+```
+[dev evaluation] ave loss: 0.220047, eval_accuracy: 0.963480, elapsed time: 2799.974465 s
+```
+其它 NLP 回归/分类任务的 fine-tuning 可以通过同样的方式进行。
+### SQuAD 2.0
+- 下载 SQuAD 2.0 数据集并将其放入 `data/squad2.0` 目录中
+```
+mkdir -p data/squad2.0
+wget -P data/squad2.0 https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json
+wget -P data/squad2.0 https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json
+```
+- 在 6 卡 V100 GPU (32GB) 上运行脚本 `run_squad.py`
+```
+SQUAD_DIR=data/squad2.0
+INIT_CKPT_DIR=xlnet_cased_L-24_H-1024_A-16
+export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5
+python run_squad.py \
+  --model_config_path=${INIT_CKPT_DIR}/xlnet_config.json \
+  --spiece_model_file=${INIT_CKPT_DIR}/spiece.model \
+  --init_checkpoint=${INIT_CKPT_DIR}/params \
+  --train_file=${SQUAD_DIR}/train-v2.0.json \
+  --predict_file=${SQUAD_DIR}/dev-v2.0.json \
+  --uncased=False \
+  --checkpoints squad_2.0_0828 \
+  --max_seq_length=512 \
+  --do_train=True \
+  --do_predict=True \
+  --skip_steps=100 \
+  --save_steps=10000 \
+  --epoch 200 \
+  --dropout=0.1 \
+  --dropatt=0.1 \
+  --train_batch_size=4 \
+  --predict_batch_size=3 \
+  --learning_rate=2e-5 \
+  --save_steps=1000 \
+  --train_steps=12000 \
+  --warmup_steps=1000 \
+  --verbose=True\
+```
+运行结束后的评测结果如下所示
+```
+================================================================================
+Result | best_f1 88.0893932758 | best_exact_thresh -2.07637166977 | best_exact 85.5049271456 | has_ans_f1 0.940979062625 | has_ans_exact 0.880566801619 | best_f1_thresh -2.07337403297 |
+================================================================================
+```
+### 使用自定义数据
+如需使用自定义数据进行 fine-tuning，请参考 GLUE/SQuAD 的数据格式说明。
+## 致谢
+我们向 XLNet 的作者们所做的杰出工作致以谢意！
--- a/PaddleNLP/PaddleLARK/XLNet/classifier_utils.py
+++ b/PaddleNLP/PaddleLARK/XLNet/classifier_utils.py
+"""this file is a copy of https://github.com/zihangdai/xlnet"""
+import re
+import numpy as np
+from data_utils import SEP_ID, CLS_ID
+SEG_ID_A = 0
+SEG_ID_B = 1
+SEG_ID_CLS = 2
+SEG_ID_SEP = 3
+SEG_ID_PAD = 4
+class PaddingInputExample(object):
+    """Fake example so the num input examples is a multiple of the batch size.
+  When running eval/predict on the TPU, we need to pad the number of examples
+  to be a multiple of the batch size, because the TPU requires a fixed batch
+  size. The alternative is to drop the last batch, which is bad because it means
+  the entire output data won't be generated.
+  We use this class instead of `None` because treating `None` as padding
+  battches could cause silent errors.
+  """
+class InputFeatures(object):
+    """A single set of features of data."""
+    def __init__(self,
+                 input_ids,
+                 input_mask,
+                 segment_ids,
+                 label_id,
+                 is_real_example=True):
+        self.input_ids = input_ids
+        self.input_mask = input_mask
+        self.segment_ids = segment_ids
+        self.label_id = label_id
+        self.is_real_example = is_real_example
+def _truncate_seq_pair(tokens_a, tokens_b, max_length):
+    """Truncates a sequence pair in place to the maximum length."""
+    # This is a simple heuristic which will always truncate the longer sequence
+    # one token at a time. This makes more sense than truncating an equal percent
+    # of tokens from each, since if one sequence is very short then each token
+    # that's truncated likely contains more information than a longer sequence.
+    while True:
+        total_length = len(tokens_a) + len(tokens_b)
+        if total_length <= max_length:
+            break
+        if len(tokens_a) > len(tokens_b):
+            tokens_a.pop()
+        else:
+            tokens_b.pop()
+def convert_single_example(ex_index, example, label_list, max_seq_length,
+                           tokenize_fn):
+    """Converts a single `InputExample` into a single `InputFeatures`."""
+    if isinstance(example, PaddingInputExample):
+        return InputFeatures(
+            input_ids=[0] * max_seq_length,
+            input_mask=[1] * max_seq_length,
+            segment_ids=[0] * max_seq_length,
+            label_id=0,
+            is_real_example=False)
+    if label_list is not None:
+        label_map = {}
+        for (i, label) in enumerate(label_list):
+            label_map[label] = i
+    tokens_a = tokenize_fn(example.text_a)
+    tokens_b = None
+    if example.text_b:
+        tokens_b = tokenize_fn(example.text_b)
+    if tokens_b:
+        # Modifies `tokens_a` and `tokens_b` in place so that the total
+        # length is less than the specified length.
+        # Account for two [SEP] & one [CLS] with "- 3"
+        _truncate_seq_pair(tokens_a, tokens_b, max_seq_length - 3)
+    else:
+        # Account for one [SEP] & one [CLS] with "- 2"
+        if len(tokens_a) > max_seq_length - 2:
+            tokens_a = tokens_a[:max_seq_length - 2]
+    tokens = []
+    segment_ids = []
+    for token in tokens_a:
+        tokens.append(token)
+        segment_ids.append(SEG_ID_A)
+    tokens.append(SEP_ID)
+    segment_ids.append(SEG_ID_A)
+    if tokens_b:
+        for token in tokens_b:
+            tokens.append(token)
+            segment_ids.append(SEG_ID_B)
+        tokens.append(SEP_ID)
+        segment_ids.append(SEG_ID_B)
+    tokens.append(CLS_ID)
+    segment_ids.append(SEG_ID_CLS)
+    input_ids = tokens
+    # The mask has 0 for real tokens and 1 for padding tokens. Only real
+    # tokens are attended to.
+    input_mask = [0] * len(input_ids)
+    # Zero-pad up to the sequence length.
+    if len(input_ids) < max_seq_length:
+        delta_len = max_seq_length - len(input_ids)
+        input_ids = [0] * delta_len + input_ids
+        input_mask = [1] * delta_len + input_mask
+        segment_ids = [SEG_ID_PAD] * delta_len + segment_ids
+    assert len(input_ids) == max_seq_length
+    assert len(input_mask) == max_seq_length
+    assert len(segment_ids) == max_seq_length
+    if label_list is not None:
+        label_id = label_map[example.label]
+    else:
+        label_id = example.label
+    if ex_index < 1:
+        print("*** Example ***")
+        print("guid: %s" % (example.guid))
+        print("input_ids: %s" % " ".join([str(x) for x in input_ids]))
+        print("input_mask: %s" % " ".join([str(x) for x in input_mask]))
+        print("segment_ids: %s" % " ".join([str(x) for x in segment_ids]))
+        print("label: {} (id = {})".format(example.label, label_id))
+    feature = InputFeatures(
+        input_ids=input_ids,
+        input_mask=input_mask,
+        segment_ids=segment_ids,
+        label_id=label_id)
+    return feature
--- a/PaddleNLP/PaddleLARK/XLNet/data_utils.py
+++ b/PaddleNLP/PaddleLARK/XLNet/data_utils.py
+"""this file is a copy of https://github.com/zihangdai/xlnet"""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+special_symbols = {
+    "<unk>": 0,
+    "<s>": 1,
+    "</s>": 2,
+    "<cls>": 3,
+    "<sep>": 4,
+    "<pad>": 5,
+    "<mask>": 6,
+    "<eod>": 7,
+    "<eop>": 8,
+}
+VOCAB_SIZE = 32000
+UNK_ID = special_symbols["<unk>"]
+CLS_ID = special_symbols["<cls>"]
+SEP_ID = special_symbols["<sep>"]
+MASK_ID = special_symbols["<mask>"]
+EOD_ID = special_symbols["<eod>"]
--- a/PaddleNLP/PaddleLARK/XLNet/model/__init__.py
+++ b/PaddleNLP/PaddleLARK/XLNet/model/__init__.py
--- a/PaddleNLP/PaddleLARK/XLNet/model/classifier.py
+++ b/PaddleNLP/PaddleLARK/XLNet/model/classifier.py
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Model for classifier."""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+import collections
+import paddle.fluid as fluid
+import modeling
+from model.xlnet import XLNetModel, _get_initializer
+def get_regression_loss(args, xlnet_config, features, is_training=False):
+    """Loss for downstream regression tasks."""
+    inp = fluid.layers.transpose(features["input_ids"], [1, 0, 2])
+    seg_id = features["segment_ids"]
+    inp_mask = fluid.layers.transpose(features["input_mask"], [1, 0])
+    label = features["label_ids"]
+    xlnet_model = XLNetModel(
+        input_ids=inp,
+        seg_ids=seg_id,
+        input_mask=inp_mask,
+        xlnet_config=xlnet_config,
+        args=args)
+    summary = xlnet_model.get_pooled_out(args.summary_type, args.use_summ_proj)
+    per_example_loss, logits = modeling.regression_loss(
+        hidden=summary,
+        labels=label,
+        initializer=_get_initializer(args),
+        name="model_regression_{}".format(args.task_name.lower()),
+        return_logits=True)
+    total_loss = fluid.layers.reduce_mean(per_example_loss)
+    return total_loss, per_example_loss, logits
+def get_classification_loss(args,
+                            xlnet_config,
+                            features,
+                            n_class,
+                            is_training=True):
+    """Loss for downstream classification tasks."""
+    inp = fluid.layers.transpose(features["input_ids"], [1, 0, 2])
+    seg_id = features["segment_ids"]
+    inp_mask = fluid.layers.transpose(features["input_mask"], [1, 0])
+    label = features["label_ids"]
+    xlnet_model = XLNetModel(
+        input_ids=inp,
+        seg_ids=seg_id,
+        input_mask=inp_mask,
+        xlnet_config=xlnet_config,
+        args=args)
+    summary = xlnet_model.get_pooled_out(args.summary_type, args.use_summ_proj)
+    per_example_loss, logits = modeling.classification_loss(
+        hidden=summary,
+        labels=label,
+        n_class=n_class,
+        initializer=xlnet_model.get_initializer(),
+        name="model_classification_{}".format(args.task_name),
+        return_logits=True)
+    total_loss = fluid.layers.reduce_mean(per_example_loss)
+    return total_loss, per_example_loss, logits
+def create_model(args, xlnet_config, n_class, is_training=False):
+    label_ids_type = 'int64' if n_class else 'float32'
+    input_fields = {
+        'names': [
+            'input_ids', 'input_mask', 'segment_ids', 'label_ids',
+            'is_real_example'
+        ],
+        'shapes': [[-1, args.max_seq_length, 1], [-1, args.max_seq_length],
+                   [-1, args.max_seq_length], [-1, 1], [-1, 1]],
+        'dtypes':
+        ['int64', 'float32', 'int64', 'int64', label_ids_type, 'int64'],
+        'lod_levels': [0, 0, 0, 0, 0, 0],
+    }
+    inputs = [
+        fluid.layers.data(
+            name=input_fields['names'][i],
+            shape=input_fields['shapes'][i],
+            dtype=input_fields['dtypes'][i],
+            lod_level=input_fields['lod_levels'][i])
+        for i in range(len(input_fields['names']))
+    ]
+    (input_ids, input_mask, segment_ids, label_ids, is_real_example) = inputs
+    data_loader = fluid.io.DataLoader.from_generator(
+        feed_list=inputs, capacity=50, iterable=False)
+    features = collections.OrderedDict()
+    features["input_ids"] = input_ids
+    features["input_mask"] = input_mask
+    features["segment_ids"] = segment_ids
+    features["label_ids"] = label_ids
+    features["is_real_example"] = is_real_example
+    if args.is_regression:
+        (total_loss, per_example_loss, logits) = get_regression_loss(
+            args, xlnet_config, features, is_training)
+    else:
+        (total_loss, per_example_loss, logits) = get_classification_loss(
+            args, xlnet_config, features, n_class, is_training)
+    num_seqs = fluid.layers.fill_constant_batch_size_like(
+        input=label_ids, shape=[-1, 1], value=1, dtype="int64")
+    num_seqs = fluid.layers.reduce_sum(num_seqs)
+    return data_loader, total_loss, logits, num_seqs, label_ids
--- a/PaddleNLP/PaddleLARK/XLNet/model/xlnet.py
+++ b/PaddleNLP/PaddleLARK/XLNet/model/xlnet.py
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""XLNet model."""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+import six
+import json
+import numpy as np
+import paddle.fluid as fluid
+import modeling
+def _get_initializer(args):
+    if args.init == "uniform":
+        param_initializer = fluid.initializer.Uniform(
+            low=-args.init_range, high=args.init_range)
+    elif args.init == "normal":
+        param_initializer = fluid.initializer.Normal(scale=args.init_std)
+    else:
+        raise ValueError("Initializer {} not supported".format(args.init))
+    return param_initializer
+def init_attn_mask(args, place):
+    """create causal attention mask."""
+    qlen = args.max_seq_length
+    mlen = 0 if 'mem_len' not in args else args.mem_len
+    same_length = False if 'same_length' not in args else args.same_length
+    dtype = 'float16' if args.use_fp16 else 'float32'
+    attn_mask = np.ones([qlen, qlen], dtype=dtype)
+    mask_u = np.triu(attn_mask)
+    mask_dia = np.diag(np.diag(attn_mask))
+    attn_mask_pad = np.zeros([qlen, mlen], dtype=dtype)
+    attn_mask = np.concatenate([attn_mask_pad, mask_u - mask_dia], 1)
+    if same_length:
+        mask_l = np.tril(attn_mask)
+        attn_mask = np.concatenate(
+            [ret[:, :qlen] + mask_l - mask_dia, ret[:, qlen:]], 1)
+    attn_mask = attn_mask[:, :, None, None]
+    attn_mask_t = fluid.global_scope().find_var("attn_mask").get_tensor()
+    attn_mask_t.set(attn_mask, place)
+class XLNetConfig(object):
+    def __init__(self, config_path):
+        self._config_dict = self._parse(config_path)
+    def _parse(self, config_path):
+        try:
+            with open(config_path) as json_file:
+                config_dict = json.load(json_file)
+        except Exception:
+            raise IOError("Error in parsing xlnet model config file '%s'" %
+                          config_path)
+        else:
+            return config_dict
+    def __getitem__(self, key):
+        return self._config_dict[key]
+    def has_key(self, key):
+        return self._config_dict.has_key(key)
+    def print_config(self):
+        for arg, value in sorted(six.iteritems(self._config_dict)):
+            print('%s: %s' % (arg, value))
+        print('------------------------------------------------')
+class XLNetModel(object):
+    def __init__(self,
+                 xlnet_config,
+                 input_ids,
+                 seg_ids,
+                 input_mask,
+                 args,
+                 mems=None,
+                 perm_mask=None,
+                 target_mapping=None,
+                 inp_q=None):
+        self._tie_weight = True
+        self._d_head = xlnet_config['d_head']
+        self._d_inner = xlnet_config['d_inner']
+        self._d_model = xlnet_config['d_model']
+        self._ff_activation = xlnet_config['ff_activation']
+        self._n_head = xlnet_config['n_head']
+        self._n_layer = xlnet_config['n_layer']
+        self._n_token = xlnet_config['n_token']
+        self._untie_r = xlnet_config['untie_r']
+        self._xlnet_config = xlnet_config
+        self._dropout = args.dropout
+        self._dropatt = args.dropatt
+        self._mem_len = None if 'mem_len' not in args else args.mem_len
+        self._reuse_len = None if 'reuse_len' not in args else args.reuse_len
+        self._bi_data = False if 'bi_data' not in args else args.bi_data
+        self._clamp_len = args.clamp_len
+        self._same_length = False if 'same_length' not in args else args.same_length
+        # Initialize all weigths by the specified initializer, and all biases 
+        # will be initialized by constant zero by default.
+        self._param_initializer = _get_initializer(args)
+        self.input_mask = input_mask
+        tfm_args = dict(
+            n_token=self._n_token,
+            initializer=self._param_initializer,
+            attn_type="bi",
+            n_layer=self._n_layer,
+            d_model=self._d_model,
+            n_head=self._n_head,
+            d_head=self._d_head,
+            d_inner=self._d_inner,
+            ff_activation=self._ff_activation,
+            untie_r=self._untie_r,
+            use_bfloat16=False,
+            dropout=self._dropout,
+            dropatt=self._dropatt,
+            mem_len=self._mem_len,
+            reuse_len=self._reuse_len,
+            bi_data=self._bi_data,
+            clamp_len=self._clamp_len,
+            same_length=self._same_length,
+            name='model_transformer')
+        input_args = dict(
+            inp_k=input_ids,
+            seg_id=seg_ids,
+            input_mask=input_mask,
+            mems=mems,
+            perm_mask=perm_mask,
+            target_mapping=target_mapping,
+            inp_q=inp_q)
+        tfm_args.update(input_args)
+        self.output, self.new_mems, self.lookup_table = modeling.transformer_xl(
+            **tfm_args)
+        #self._build_model(input_ids, sentence_ids, input_mask)
+    def get_initializer(self):
+        return self._param_initializer
+    def get_debug_ret(self):
+        return self.debug_ret
+    def get_sequence_output(self):
+        return self.output
+    def get_pooled_out(self, summary_type, use_summ_proj=True):
+        """
+	Args:
+	  summary_type: str, "last", "first", "mean", or "attn". The method
+	    to pool the input to get a vector representation.
+	  use_summ_proj: bool, whether to use a linear projection during pooling.
+	Returns:
+	  float32 Tensor in shape [bsz, d_model], the pooled representation.
+	"""
+        summary = modeling.summarize_sequence(
+            summary_type=summary_type,
+            hidden=self.output,
+            d_model=self._d_model,
+            n_head=self._n_head,
+            d_head=self._d_head,
+            dropout=self._dropout,
+            dropatt=self._dropatt,
+            input_mask=self.input_mask,
+            initializer=self._param_initializer,
+            use_proj=use_summ_proj,
+            name='model_sequnece_summary')
+        return summary
--- a/PaddleNLP/PaddleLARK/XLNet/modeling.py
+++ b/PaddleNLP/PaddleLARK/XLNet/modeling.py
+import re
+import numpy as np
+import paddle.fluid as fluid
+import collections
+def log_softmax(logits, axis=-1):
+    logsoftmax = logits - fluid.layers.log(
+        fluid.layers.reduce_sum(fluid.layers.exp(logits), axis))
+    return logsoftmax
+def einsum4x4(equation, x, y):
+    idx_x, idx_y, idx_z = re.split(",|->", equation)
+    repeated_idx = list(set(idx_x + idx_y) - set(idx_z))
+    unique_idx_x = list(set(idx_x) - set(idx_y))
+    unique_idx_y = list(set(idx_y) - set(idx_x))
+    common_idx = list(set(idx_x) & set(idx_y) - set(repeated_idx))
+    new_idx_x = common_idx + unique_idx_x + repeated_idx
+    new_idx_y = common_idx + unique_idx_y + repeated_idx
+    new_idx_z = common_idx + unique_idx_x + unique_idx_y
+    perm_x = [idx_x.index(i) for i in new_idx_x]
+    perm_y = [idx_y.index(i) for i in new_idx_y]
+    perm_z = [new_idx_z.index(i) for i in idx_z]
+    x = fluid.layers.transpose(x, perm=perm_x)
+    y = fluid.layers.transpose(y, perm=perm_y)
+    z = fluid.layers.matmul(x=x, y=y, transpose_y=True)
+    z = fluid.layers.transpose(z, perm=perm_z)
+    return z
+def positional_embedding(pos_seq, inv_freq, bsz=None):
+    pos_seq = fluid.layers.reshape(pos_seq, [-1, 1])
+    inv_freq = fluid.layers.reshape(inv_freq, [1, -1])
+    sinusoid_inp = fluid.layers.matmul(pos_seq, inv_freq)
+    pos_emb = fluid.layers.concat(
+        input=[fluid.layers.sin(sinusoid_inp), fluid.layers.cos(sinusoid_inp)],
+        axis=-1)
+    pos_emb = fluid.layers.unsqueeze(pos_emb, [1])
+    if bsz is not None:
+        pos_emb = fluid.layers.expand(pos_emb, [1, bsz, 1])
+    return pos_emb
+def positionwise_ffn(inp,
+                     d_model,
+                     d_inner,
+                     dropout_prob,
+                     param_initializer=None,
+                     act_type='relu',
+                     name='ff'):
+    """Position-wise Feed-forward Network."""
+    if act_type not in ['relu', 'gelu']:
+        raise ValueError('Unsupported activation type {}'.format(act_type))
+    output = fluid.layers.fc(input=inp,
+                             size=d_inner,
+                             act=act_type,
+                             num_flatten_dims=2,
+                             param_attr=fluid.ParamAttr(
+                                 name=name + '_layer_1_weight',
+                                 initializer=param_initializer),
+                             bias_attr=name + '_layer_1_bias')
+    output = fluid.layers.dropout(
+        output,
+        dropout_prob=dropout_prob,
+        dropout_implementation="upscale_in_train",
+        is_test=False)
+    output = fluid.layers.fc(output,
+                             size=d_model,
+                             num_flatten_dims=2,
+                             param_attr=fluid.ParamAttr(
+                                 name=name + '_layer_2_weight',
+                                 initializer=param_initializer),
+                             bias_attr=name + '_layer_2_bias')
+    output = fluid.layers.dropout(
+        output,
+        dropout_prob=dropout_prob,
+        dropout_implementation="upscale_in_train",
+        is_test=False)
+    output = fluid.layers.layer_norm(
+        output + inp,
+        begin_norm_axis=len(output.shape) - 1,
+        epsilon=1e-12,
+        param_attr=fluid.ParamAttr(
+            name=name + '_layer_norm_scale',
+            initializer=fluid.initializer.Constant(1.)),
+        bias_attr=fluid.ParamAttr(
+            name + '_layer_norm_bias',
+            initializer=fluid.initializer.Constant(0.)))
+    return output
+def head_projection(h, d_model, n_head, d_head, param_initializer, name=''):
+    """Project hidden states to a specific head with a 4D-shape."""
+    proj_weight = fluid.layers.create_parameter(
+        shape=[d_model, n_head, d_head],
+        dtype=h.dtype,
+        attr=fluid.ParamAttr(
+            name=name + '_weight', initializer=param_initializer),
+        is_bias=False)
+    # ibh,hnd->ibnd 
+    head = fluid.layers.mul(x=h,
+                            y=proj_weight,
+                            x_num_col_dims=2,
+                            y_num_col_dims=1)
+    return head
+def post_attention(h,
+                   attn_vec,
+                   d_model,
+                   n_head,
+                   d_head,
+                   dropout,
+                   param_initializer,
+                   residual=True,
+                   name=''):
+    """Post-attention processing."""
+    # post-attention projection (back to `d_model`)
+    proj_o = fluid.layers.create_parameter(
+        shape=[d_model, n_head, d_head],
+        dtype=h.dtype,
+        attr=fluid.ParamAttr(
+            name=name + '_o_weight', initializer=param_initializer),
+        is_bias=False)
+    # ibnd,hnd->ibh
+    proj_o = fluid.layers.transpose(proj_o, perm=[1, 2, 0])
+    attn_out = fluid.layers.mul(x=attn_vec,
+                                y=proj_o,
+                                x_num_col_dims=2,
+                                y_num_col_dims=2)
+    attn_out = fluid.layers.dropout(
+        attn_out,
+        dropout_prob=dropout,
+        dropout_implementation="upscale_in_train",
+        is_test=False)
+    if residual:
+        output = fluid.layers.layer_norm(
+            attn_out + h,
+            begin_norm_axis=len(attn_out.shape) - 1,
+            epsilon=1e-12,
+            param_attr=fluid.ParamAttr(
+                name=name + '_layer_norm_scale',
+                initializer=fluid.initializer.Constant(1.)),
+            bias_attr=fluid.ParamAttr(
+                name + '_layer_norm_bias',
+                initializer=fluid.initializer.Constant(0.)))
+    else:
+        output = fluid.layers.layer_norm(
+            attn_out,
+            begin_norm_axis=len(attn_out.shape) - 1,
+            epsilon=1e-12,
+            param_attr=fluid.ParamAttr(
+                name=name + '_layer_norm_scale',
+                initializer=fluid.initializer.Constant(1.)),
+            bias_attr=fluid.ParamAttr(
+                name + '_layer_norm_bias',
+                initializer=fluid.initializer.Constant(0.)))
+    return output
+def abs_attn_core(q_head, k_head, v_head, attn_mask, dropatt, scale):
+    """Core absolute positional attention operations."""
+    attn_score = einsum4x4('ibnd,jbnd->ijbn', q_head, k_head)
+    attn_score *= scale
+    if attn_mask is not None:
+        attn_score = attn_score - 1e30 * attn_mask
+    # attention probability
+    attn_prob = fluid.layers.softmax(attn_score, axis=1)
+    attn_prob = fluid.layers.dropout(
+        attn_prob,
+        dropout_prob=dropatt,
+        dropout_implementation="upscale_in_train",
+        is_test=False)
+    # attention output
+    attn_vec = einsum4x4('ijbn,jbnd->ibnd', attn_prob, v_head)
+    return attn_vec
+def rel_attn_core(q_head, k_head_h, v_head_h, k_head_r, seg_embed, seg_mat,
+                  r_w_bias, r_r_bias, r_s_bias, attn_mask, dropatt, scale,
+                  name):
+    """Core relative positional attention operations."""
+    ## content based attention score
+    ac = einsum4x4('ibnd,jbnd->ijbn',
+                   fluid.layers.elementwise_add(q_head, r_w_bias, 2), k_head_h)
+    # position based attention score
+    bd = einsum4x4('ibnd,jbnd->ijbn',
+                   fluid.layers.elementwise_add(q_head, r_r_bias, 2), k_head_r)
+    bd = rel_shift(bd, klen=ac.shape[1])
+    # segment based attention score
+    if seg_mat is None:
+        ef = 0
+    else:
+        seg_embed = fluid.layers.stack([seg_embed] * q_head.shape[0], axis=0)
+        ef = einsum4x4('ibnd,isnd->ibns',
+                       fluid.layers.elementwise_add(q_head, r_s_bias, 2),
+                       seg_embed)
+        ef = einsum4x4('ijbs,ibns->ijbn', seg_mat, ef)
+    attn_score = (ac + bd + ef) * scale
+    if attn_mask is not None:
+        # attn_score = attn_score * (1 - attn_mask) - 1e30 * attn_mask
+        attn_score = attn_score - 1e30 * attn_mask
+    # attention probability
+    attn_prob = fluid.layers.softmax(attn_score, axis=1)
+    attn_prob = fluid.layers.dropout(
+        attn_prob, dropatt, dropout_implementation="upscale_in_train")
+    # attention output
+    attn_vec = einsum4x4('ijbn,jbnd->ibnd', attn_prob, v_head_h)
+    return attn_vec
+def rel_shift(x, klen=-1):
+    """perform relative shift to form the relative attention score."""
+    x_size = x.shape
+    x = fluid.layers.reshape(x, [x_size[1], x_size[0], x_size[2], x_size[3]])
+    x = fluid.layers.slice(x, axes=[0], starts=[1], ends=[x_size[1]])
+    x = fluid.layers.reshape(x,
+                             [x_size[0], x_size[1] - 1, x_size[2], x_size[3]])
+    x = fluid.layers.slice(x, axes=[1], starts=[0], ends=[klen])
+    return x
+def _cache_mem(curr_out, prev_mem, mem_len, reuse_len=None):
+    """cache hidden states into memory."""
+    if mem_len is None or mem_len == 0:
+        return None
+    else:
+        if reuse_len is not None and reuse_len > 0:
+            curr_out = curr_out[:reuse_len]
+        if prev_mem is None:
+            new_mem = curr_out[-mem_len:]
+        else:
+            new_mem = tf.concat([prev_mem, curr_out], 0)[-mem_len:]
+    new_mem.stop_gradient = True
+    return new_mem
+def relative_positional_encoding(qlen,
+                                 klen,
+                                 d_model,
+                                 clamp_len,
+                                 attn_type,
+                                 bi_data,
+                                 bsz=None,
+                                 dtype=None):
+    """create relative positional encoding."""
+    freq_seq = fluid.layers.range(0, d_model, 2.0, 'float32')
+    if dtype is not None and dtype != 'float32':
+        freq_seq = tf.cast(freq_seq, dtype=dtype)
+    inv_freq = 1 / (10000**(freq_seq / d_model))
+    if attn_type == 'bi':
+        beg, end = klen, -qlen
+    elif attn_type == 'uni':
+        beg, end = klen, -1
+    else:
+        raise ValueError('Unknown `attn_type` {}.'.format(attn_type))
+    if bi_data:
+        fwd_pos_seq = fluid.layers.range(beg, end, -1.0, 'float32')
+        bwd_pos_seq = fluid.layers.range(-beg, -end, 1.0, 'float32')
+        if dtype is not None and dtype != 'float32':
+            fwd_pos_seq = fluid.layers.cast(fwd_pos_seq, dtype='float32')
+            bwd_pos_seq = fluid.layers.cast(bwd_pos_seq, dtype='float32')
+        if clamp_len > 0:
+            fwd_pos_seq = fluid.layers.clip(fwd_pos_seq, -clamp_len, clamp_len)
+            bwd_pos_seq = fluid.layers.clip(bwd_pos_seq, -clamp_len, clamp_len)
+        if bsz is not None:
+            # With bi_data, the batch size should be divisible by 2.
+            assert bsz % 2 == 0
+            fwd_pos_emb = positional_embedding(fwd_pos_seq, inv_freq, bsz // 2)
+            bwd_pos_emb = positional_embedding(bwd_pos_seq, inv_freq, bsz // 2)
+        else:
+            fwd_pos_emb = positional_embedding(fwd_pos_seq, inv_freq)
+            bwd_pos_emb = positional_embedding(bwd_pos_seq, inv_freq)
+        pos_emb = fluid.layers.concat([fwd_pos_emb, bwd_pos_emb], axis=1)
+    else:
+        fwd_pos_seq = fluid.layers.range(beg, end, -1.0, 'float32')
+        if dtype is not None and dtype != 'float32':
+            fwd_pos_seq = fluid.layers.cast(fwd_pos_seq, dtype=dtype)
+        if clamp_len > 0:
+            fwd_pos_seq = fluid.layers.clip(fwd_pos_seq, -clamp_len, clamp_len)
+        pos_emb = positional_embedding(fwd_pos_seq, inv_freq, bsz)
+        fluid.layers.reshape(pos_emb, [2 * qlen, -1, d_model], inplace=True)
+    return pos_emb
+def multihead_attn(q,
+                   k,
+                   v,
+                   attn_mask,
+                   d_model,
+                   n_head,
+                   d_head,
+                   dropout,
+                   dropatt,
+                   is_training,
+                   kernel_initializer,
+                   residual=True,
+                   scope='abs_attn',
+                   reuse=None):
+    """Standard multi-head attention with absolute positional embedding."""
+    scale = 1 / (d_head**0.5)
+    with tf.variable_scope(scope, reuse=reuse):
+        # attention heads
+        q_head_h = head_projection(
+            h, d_model, n_head, d_head, initializer, name=name + '_rel_attn_q')
+        q_head = head_projection(q, d_model, n_head, d_head, kernel_initializer,
+                                 'q')
+        k_head = head_projection(k, d_model, n_head, d_head, kernel_initializer,
+                                 'k')
+        v_head = head_projection(v, d_model, n_head, d_head, kernel_initializer,
+                                 'v')
+        # attention vector
+        attn_vec = abs_attn_core(q_head, k_head, v_head, attn_mask, dropatt,
+                                 is_training, scale)
+        # post processing
+        output = post_attention(v, attn_vec, d_model, n_head, d_head, dropout,
+                                is_training, kernel_initializer, residual)
+    return output
+def rel_multihead_attn(h,
+                       r,
+                       r_w_bias,
+                       r_r_bias,
+                       seg_mat,
+                       r_s_bias,
+                       seg_embed,
+                       attn_mask,
+                       mems,
+                       d_model,
+                       n_head,
+                       d_head,
+                       dropout,
+                       dropatt,
+                       initializer,
+                       name=''):
+    """Multi-head attention with relative positional encoding."""
+    scale = 1 / (d_head**0.5)
+    if mems is not None and len(mems.shape) > 1:
+        cat = fluid.layers.concat([mems, h], 0)
+    else:
+        cat = h
+    # content heads
+    q_head_h = head_projection(
+        h, d_model, n_head, d_head, initializer, name=name + '_rel_attn_q')
+    k_head_h = head_projection(
+        cat, d_model, n_head, d_head, initializer, name=name + '_rel_attn_k')
+    v_head_h = head_projection(
+        cat, d_model, n_head, d_head, initializer, name=name + '_rel_attn_v')
+    # positional heads
+    k_head_r = head_projection(
+        r, d_model, n_head, d_head, initializer, name=name + '_rel_attn_r')
+    # core attention ops
+    attn_vec = rel_attn_core(q_head_h, k_head_h, v_head_h, k_head_r, seg_embed,
+                             seg_mat, r_w_bias, r_r_bias, r_s_bias, attn_mask,
+                             dropatt, scale, name)
+    # post processing
+    output = post_attention(
+        h,
+        attn_vec,
+        d_model,
+        n_head,
+        d_head,
+        dropout,
+        initializer,
+        name=name + '_rel_attn')
+    return output
+def transformer_xl(inp_k,
+                   n_token,
+                   n_layer,
+                   d_model,
+                   n_head,
+                   d_head,
+                   d_inner,
+                   dropout,
+                   dropatt,
+                   attn_type,
+                   bi_data,
+                   initializer,
+                   mem_len=None,
+                   inp_q=None,
+                   mems=None,
+                   same_length=False,
+                   clamp_len=-1,
+                   untie_r=False,
+                   input_mask=None,
+                   perm_mask=None,
+                   seg_id=None,
+                   reuse_len=None,
+                   ff_activation='relu',
+                   target_mapping=None,
+                   use_fp16=False,
+                   name='',
+                   **kwargs):
+    """
+    Defines a Transformer-XL computation graph with additional
+	support for XLNet.
+    Args:
+	inp_k: int64 Tensor in shape [len, bsz], the input token IDs.
+	seg_id: int64 Tensor in shape [len, bsz], the input segment IDs.
+	input_mask: float32 Tensor in shape [len, bsz], the input mask.
+	  0 for real tokens and 1 for padding.
+	mems: a list of float32 Tensors in shape [mem_len, bsz, d_model], memory
+	  from previous batches. The length of the list equals n_layer.
+	  If None, no memory is used.
+	perm_mask: float32 Tensor in shape [len, len, bsz].
+	  If perm_mask[i, j, k] = 0, i attend to j in batch k;
+	  if perm_mask[i, j, k] = 1, i does not attend to j in batch k.
+	  If None, each position attends to all the others.
+	target_mapping: float32 Tensor in shape [num_predict, len, bsz].
+	  If target_mapping[i, j, k] = 1, the i-th predict in batch k is
+	  on the j-th token.
+	  Only used during pretraining for partial prediction.
+	  Set to None during finetuning.
+	inp_q: float32 Tensor in shape [len, bsz].
+	  1 for tokens with losses and 0 for tokens without losses.
+	  Only used during pretraining for two-stream attention.
+	  Set to None during finetuning.
+	n_layer: int, the number of layers.
+	d_model: int, the hidden size.
+	n_head: int, the number of attention heads.
+	d_head: int, the dimension size of each attention head.
+	d_inner: int, the hidden size in feed-forward layers.
+	ff_activation: str, "relu" or "gelu".
+	untie_r: bool, whether to untie the biases in attention.
+	n_token: int, the vocab size.
+	is_training: bool, whether in training mode.
+	use_tpu: bool, whether TPUs are used.
+	use_fp16: bool, use bfloat16 instead of float32.
+	dropout: float, dropout rate.
+	dropatt: float, dropout rate on attention probabilities.
+	init: str, the initialization scheme, either "normal" or "uniform".
+	init_range: float, initialize the parameters with a uniform distribution
+	  in [-init_range, init_range]. Only effective when init="uniform".
+	init_std: float, initialize the parameters with a normal distribution
+	  with mean 0 and stddev init_std. Only effective when init="normal".
+	mem_len: int, the number of tokens to cache.
+	reuse_len: int, the number of tokens in the currect batch to be cached
+	  and reused in the future.
+	bi_data: bool, whether to use bidirectional input pipeline.
+	  Usually set to True during pretraining and False during finetuning.
+	clamp_len: int, clamp all relative distances larger than clamp_len.
+	  -1 means no clamping.
+	same_length: bool, whether to use the same attention length for each token.
+	summary_type: str, "last", "first", "mean", or "attn". The method
+	  to pool the input to get a vector representation.
+    """
+    print('memory input {}'.format(mems))
+    data_type = "float16" if use_fp16 else "float32"
+    print('Use float type {}'.format(data_type))
+    qlen = inp_k.shape[0]
+    mlen = mems[0].shape[0] if mems is not None else 0
+    klen = mlen + qlen
+    bsz = fluid.layers.slice(
+        fluid.layers.shape(inp_k), axes=[0], starts=[1], ends=[2])
+    ##### Attention mask
+    # causal attention mask
+    if attn_type == 'uni':
+        attn_mask = fluid.layers.create_global_var(
+            name='attn_mask',
+            shape=[qlen, klen, 1, 1],
+            value=0.0,
+            dtype=data_type,
+            persistable=True)
+    elif attn_type == 'bi':
+        attn_mask = None
+    else:
+        raise ValueError('Unsupported attention type: {}'.format(attn_type))
+    # data mask: input mask & perm mask
+    if input_mask is not None and perm_mask is not None:
+        data_mask = fluid.layers.unsqueeze(input_mask, [0]) + perm_mask
+    elif input_mask is not None and perm_mask is None:
+        data_mask = fluid.layers.unsqueeze(input_mask, [0])
+    elif input_mask is None and perm_mask is not None:
+        data_mask = perm_mask
+    else:
+        data_mask = None
+    if data_mask is not None:
+        # all mems can be attended to
+        mems_mask = fluid.layers.zeros(
+            shape=[data_mask.shape[0], mlen, 1], dtype='float32')
+        mems_mask = fluid.layers.expand(mems_mask, [1, 1, bsz])
+        data_mask = fluid.layers.concat([mems_mask, data_mask], 1)
+        if attn_mask is None:
+            attn_mask = fluid.layers.unsqueeze(data_mask, [-1])
+        else:
+            attn_mask += fluid.layers.unsqueeze(data_mask, [-1])
+    if attn_mask is not None:
+        attn_mask = fluid.layers.cast(attn_mask > 0, dtype=data_type)
+    if attn_mask is not None:
+        non_tgt_mask = fluid.layers.diag(
+            np.array([-1] * qlen).astype(data_type))
+        non_tgt_mask = fluid.layers.concat(
+            [fluid.layers.zeros(
+                [qlen, mlen], dtype=data_type), non_tgt_mask],
+            axis=-1)
+        attn_mask = fluid.layers.expand(attn_mask, [qlen, 1, 1, 1])
+        non_tgt_mask = fluid.layers.unsqueeze(non_tgt_mask, axes=[2, 3])
+        non_tgt_mask = fluid.layers.expand(non_tgt_mask, [1, 1, bsz, 1])
+        non_tgt_mask = fluid.layers.cast(
+            (attn_mask + non_tgt_mask) > 0, dtype=data_type)
+        non_tgt_mask.stop_gradient = True
+    else:
+        non_tgt_mask = None
+    if untie_r:
+        r_w_bias = fluid.layers.create_parameter(
+            shape=[n_layer, n_head, d_head],
+            dtype=data_type,
+            attr=fluid.ParamAttr(
+                name=name + '_r_w_bias', initializer=initializer),
+            is_bias=True)
+        r_w_bias = [
+            fluid.layers.slice(
+                r_w_bias, axes=[0], starts=[i], ends=[i + 1])
+            for i in range(n_layer)
+        ]
+        r_w_bias = [
+            fluid.layers.squeeze(
+                r_w_bias[i], axes=[0]) for i in range(n_layer)
+        ]
+        r_r_bias = fluid.layers.create_parameter(
+            shape=[n_layer, n_head, d_head],
+            dtype=data_type,
+            attr=fluid.ParamAttr(
+                name=name + '_r_r_bias', initializer=initializer),
+            is_bias=True)
+        r_r_bias = [
+            fluid.layers.slice(
+                r_r_bias, axes=[0], starts=[i], ends=[i + 1])
+            for i in range(n_layer)
+        ]
+        r_r_bias = [
+            fluid.layers.squeeze(
+                r_r_bias[i], axes=[0]) for i in range(n_layer)
+        ]
+    else:
+        r_w_bias = fluid.layers.create_parameter(
+            shape=[n_head, d_head],
+            dtype=data_type,
+            attr=fluid.ParamAttr(
+                name=name + '_r_w_bias', initializer=initializer),
+            is_bias=True)
+        r_r_bias = fluid.layers.create_parameter(
+            shape=[n_head, d_head],
+            dtype=data_type,
+            attr=fluid.ParamAttr(
+                name=name + '_r_r_bias', initializer=initializer),
+            is_bias=True)
+    lookup_table = fluid.layers.create_parameter(
+        shape=[n_token, d_model],
+        dtype=data_type,
+        attr=fluid.ParamAttr(
+            name=name + '_word_embedding', initializer=initializer),
+        is_bias=False)
+    word_emb_k = fluid.layers.embedding(
+        input=inp_k,
+        size=[n_token, d_model],
+        dtype=data_type,
+        param_attr=fluid.ParamAttr(
+            name=name + '_word_embedding', initializer=initializer))
+    if inp_q is not None:
+        pass
+    output_h = fluid.layers.dropout(
+        word_emb_k,
+        dropout_prob=dropout,
+        dropout_implementation="upscale_in_train")
+    if inp_q is not None:
+        pass
+    if seg_id is not None:
+        if untie_r:
+            r_s_bias = fluid.layers.create_parameter(
+                shape=[n_layer, n_head, d_head],
+                dtype=data_type,
+                attr=fluid.ParamAttr(
+                    name=name + '_r_s_bias', initializer=initializer),
+                is_bias=True)
+            r_s_bias = [
+                fluid.layers.slice(
+                    r_s_bias, axes=[0], starts=[i], ends=[i + 1])
+                for i in range(n_layer)
+            ]
+            r_s_bias = [
+                fluid.layers.squeeze(
+                    r_s_bias[i], axes=[0]) for i in range(n_layer)
+            ]
+        else:
+            r_s_bias = fluid.layers.create_parameter(
+                shape=[n_head, d_head],
+                dtype=data_type,
+                attr=fluid.ParamAttr(
+                    name=name + '_r_s_bias', initializer=initializer),
+                is_bias=True)
+        seg_embed = fluid.layers.create_parameter(
+            shape=[n_layer, 2, n_head, d_head],
+            dtype=data_type,
+            attr=fluid.ParamAttr(
+                name=name + '_seg_embed', initializer=initializer))
+        seg_embed = [
+            fluid.layers.slice(
+                seg_embed, axes=[0], starts=[i], ends=[i + 1])
+            for i in range(n_layer)
+        ]
+        seg_embed = [
+            fluid.layers.squeeze(
+                seg_embed[i], axes=[0]) for i in range(n_layer)
+        ]
+        # COnver `seg_id` to one-hot seg_mat
+        # seg_id: [bsz, qlen, 1]
+        mem_pad = fluid.layers.fill_constant_batch_size_like(
+            input=seg_id, shape=[-1, mlen], value=0, dtype='int64')
+        # cat_ids: [bsz, klen, 1]
+        cat_ids = fluid.layers.concat(input=[mem_pad, seg_id], axis=1)
+        seg_id = fluid.layers.stack([seg_id] * klen, axis=2)
+        cat_ids = fluid.layers.stack([cat_ids] * qlen, axis=2)
+        cat_ids = fluid.layers.transpose(cat_ids, perm=[0, 2, 1])
+        # seg_mat: [bsz, qlen, klen]
+        seg_mat = fluid.layers.cast(
+            fluid.layers.logical_not(fluid.layers.equal(seg_id, cat_ids)),
+            dtype='int64')
+        seg_mat = fluid.layers.transpose(seg_mat, perm=[1, 2, 0])
+        seg_mat = fluid.layers.unsqueeze(seg_mat, [-1])
+        seg_mat = fluid.layers.one_hot(seg_mat, 2)
+        seg_mat.stop_gradient = True
+    else:
+        seg_mat = None
+    pos_emb = relative_positional_encoding(
+        qlen,
+        klen,
+        d_model,
+        clamp_len,
+        attn_type,
+        bi_data,
+        bsz=bsz,
+        dtype=data_type)
+    pos_emb = fluid.layers.dropout(
+        pos_emb, dropout, dropout_implementation="upscale_in_train")
+    pos_emb.stop_gradient = True
+    ##### Attention layers
+    if mems is None:
+        mems = [None] * n_layer
+    for i in range(n_layer):
+        # cache new mems
+        #new_mems.append(_cache_mem(output_h, mems[i], mem_len, reuse_len)) 
+        # segment bias
+        if seg_id is None:
+            r_s_bias_i = None
+            seg_embed_i = None
+        else:
+            r_s_bias_i = r_s_bias if not untie_r else r_s_bias[i]
+            seg_embed_i = seg_embed[i]
+        if inp_q is not None:
+            pass
+        else:
+            output_h = rel_multihead_attn(
+                h=output_h,
+                r=pos_emb,
+                r_w_bias=r_w_bias if not untie_r else r_w_bias[i],
+                r_r_bias=r_r_bias if not untie_r else r_r_bias[i],
+                seg_mat=seg_mat,
+                r_s_bias=r_s_bias_i,
+                seg_embed=seg_embed_i,
+                attn_mask=non_tgt_mask,
+                mems=mems[i],
+                d_model=d_model,
+                n_head=n_head,
+                d_head=d_head,
+                dropout=dropout,
+                dropatt=dropatt,
+                initializer=initializer,
+                name=name + '_layer_{}'.format(i))
+        if inp_q is not None:
+            pass
+        output_h = positionwise_ffn(
+            inp=output_h,
+            d_model=d_model,
+            d_inner=d_inner,
+            dropout_prob=dropout,
+            param_initializer=initializer,
+            act_type=ff_activation,
+            name=name + '_layer_{}_ff'.format(i))
+    if inp_q is not None:
+        output = fluid.layers.dropout(
+            output_g, dropout, dropout_implementation="upscale_in_train")
+    else:
+        output = fluid.layers.dropout(
+            output_h, dropout, dropout_implementation="upscale_in_train")
+    new_mems = None
+    return output, new_mems, lookup_table
+def lm_loss(hidden,
+            target,
+            n_token,
+            d_model,
+            initializer,
+            lookup_table=None,
+            tie_weight=False,
+            bi_data=True):
+    if tie_weight:
+        assert lookup_table is not None, \
+          'lookup_table cannot be None for tie_weight'
+        softmax_w = lookup_table
+    else:
+        softmax_w = fluid.layers.create_parameter(
+            shape=[n_token, d_model],
+            dtype=hidden.dtype,
+            attr=fluid.ParamAttr(
+                name='model_loss_weight', initializer=initializer),
+            is_bias=False)
+    softmax_b = fluid.layers.create_parameter(
+        shape=[n_token],
+        dtype=hidden.dtype,
+        attr=fluid.ParamAttr(
+            name='model_lm_loss_bias', initializer=initializer),
+        is_bias=False)
+    logits = fluid.layers.matmul(
+        x=hidden, y=softmax_w, transpose_y=True) + softmax_b
+    loss = fluid.layers.softmax_cross_entropy_with_logits(
+        input=logits, label=target)
+    return loss
+def summarize_sequence(summary_type,
+                       hidden,
+                       d_model,
+                       n_head,
+                       d_head,
+                       dropout,
+                       dropatt,
+                       input_mask,
+                       initializer,
+                       scope=None,
+                       reuse=None,
+                       use_proj=True,
+                       name=''):
+    """
+      Different classification tasks may not may not share the same parameters
+      to summarize the sequence features.
+      If shared, one can keep the `scope` to the default value `None`.
+      Otherwise, one should specify a different `scope` for each task.
+  """
+    if summary_type == 'last':
+        summary = hidden[-1]
+    elif summary_type == 'first':
+        summary = hidden[0]
+    elif summary_type == 'mean':
+        summary = fluid.layers.reduce_mean(hidden, axis=0)
+    elif summary_type == 'attn':
+        bsz = fluid.layers.slice(
+            fluid.layers.shape(hidden), axes=[0], starts=[1], ends=[2])
+        summary_bias = tf.get_variable(
+            'summary_bias', [d_model],
+            dtype=hidden.dtype,
+            initializer=initializer)
+        summary_bias = tf.tile(summary_bias[None, None], [1, bsz, 1])
+        if input_mask is not None:
+            input_mask = input_mask[None, :, :, None]
+        summary = multihead_attn(
+            summary_bias,
+            hidden,
+            hidden,
+            input_mask,
+            d_model,
+            n_head,
+            d_head,
+            dropout,
+            dropatt,
+            is_training,
+            initializer,
+            residual=False)
+        summary = summary[0]
+    else:
+        raise ValueError('Unsupported summary type {}'.format(summary_type))
+    # use another projection as in BERT
+    if use_proj:
+        summary = fluid.layers.fc(input=summary,
+                                  size=d_model,
+                                  act='tanh',
+                                  param_attr=fluid.ParamAttr(
+                                      name=name + '_summary_weight',
+                                      initializer=initializer),
+                                  bias_attr=name + '_summary_bias')
+    summary = fluid.layers.dropout(
+        summary,
+        dropout_prob=dropout,
+        dropout_implementation="upscale_in_train")
+    return summary
+def classification_loss(hidden,
+                        labels,
+                        n_class,
+                        initializer,
+                        name,
+                        reuse=None,
+                        return_logits=False):
+    """
+      Different classification tasks should use different parameter names to ensure
+      different dense layers (parameters) are used to produce the logits.
+      An exception will be in transfer learning, where one hopes to transfer
+      the classification weights.
+    """
+    logits = fluid.layers.fc(input=hidden,
+                             size=n_class,
+                             param_attr=fluid.ParamAttr(
+                                 name=name + '_logit_weight',
+                                 initializer=initializer),
+                             bias_attr=name + '_logit_bias')
+    one_hot_target = fluid.layers.one_hot(labels, depth=n_class)
+    loss = -1.0 * fluid.layers.reduce_sum(
+        log_softmax(logits) * one_hot_target, dim=-1)
+    if return_logits:
+        return loss, logits
+    return loss
+def regression_loss(hidden, labels, initializer, name, return_logits=False):
+    logits = fluid.layers.fc(input=hidden,
+                             size=1,
+                             param_attr=fluid.ParamAttr(
+                                 name=name + '_logits_weight',
+                                 initializer=initializer),
+                             bias_attr=name + '_logits_bias')
+    loss = fluid.layers.square(logits - labels)
+    if return_logits:
+        return loss, logits
+    return loss
--- a/PaddleNLP/PaddleLARK/XLNet/optimization.py
+++ b/PaddleNLP/PaddleLARK/XLNet/optimization.py
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Optimization and learning rate scheduling."""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+import re
+import numpy as np
+import paddle.fluid as fluid
+def linear_warmup_decay(learning_rate, warmup_steps, num_train_steps):
+    """ Applies linear warmup of learning rate from 0 and decay to 0."""
+    with fluid.default_main_program()._lr_schedule_guard():
+        lr = fluid.layers.tensor.create_global_var(
+            shape=[1],
+            value=0.0,
+            dtype='float32',
+            persistable=True,
+            name="scheduled_learning_rate")
+        global_step = fluid.layers.learning_rate_scheduler._decay_step_counter()
+        with fluid.layers.control_flow.Switch() as switch:
+            with switch.case(global_step < warmup_steps):
+                warmup_lr = learning_rate * (global_step / warmup_steps)
+                fluid.layers.tensor.assign(warmup_lr, lr)
+            with switch.default():
+                decayed_lr = fluid.layers.learning_rate_scheduler.polynomial_decay(
+                    learning_rate=learning_rate,
+                    decay_steps=num_train_steps,
+                    end_learning_rate=0.0,
+                    power=1.0,
+                    cycle=False)
+                fluid.layers.tensor.assign(decayed_lr, lr)
+        return lr
+def optimization(loss,
+                 warmup_steps,
+                 num_train_steps,
+                 learning_rate,
+                 train_program,
+                 startup_prog,
+                 weight_decay,
+                 lr_layer_decay_rate=1.0,
+                 scheduler='linear_warmup_decay'):
+    scheduled_lr = None
+    if scheduler == 'noam_decay':
+        if warmup_steps > 0:
+            scheduled_lr = fluid.layers.learning_rate_scheduler\
+             .noam_decay(1/(warmup_steps *(learning_rate ** 2)),
+           warmup_steps)
+        else:
+            printf(
+                "WARNING: noam decay should have postive warmup steps, using "
+                "constant learning rate instead!")
+            scheduled_lr = fluid.layers.create_global_var(
+                name=fluid.unique_name.generate("learning_rate"),
+                shape=[1],
+                value=learning_rate,
+                dtype='float32',
+                persistable=True)
+    elif scheduler == 'linear_warmup_decay':
+        scheduled_lr = linear_warmup_decay(learning_rate, warmup_steps,
+                                           num_train_steps)
+    else:
+        raise ValueError("Unkown learning rate scheduler, should be "
+                         "'noam_decay' or 'linear_warmup_decay'")
+    if lr_layer_decay_rate != 1.0:
+        n_layer = 0
+        for param in fluid.default_main_program().block(0).all_parameters():
+            m = re.search(r"model_transformer_layer_(\d+?)_", param.name)
+            if not m: continue
+            n_layer = max(n_layer, int(m.group(1)) + 1)
+        for param in fluid.default_main_program().block(0).all_parameters():
+            for l in range(n_layer):
+                if "model_transformer_layer_{}_".format(l) in param.name:
+                    param.optimize_attr[
+                        'learning_rate'] = lr_layer_decay_rate**(
+                            n_layer - 1 - l)
+                    print("Apply lr decay {:.4f} to layer-{} grad of {}".format(
+                        param.optimize_attr['learning_rate'], l, param.name))
+                    break
+    def exclude_from_weight_decay(param):
+        name = param.name.rstrip(".master")
+        if name.find("layer_norm") > -1:
+            return True
+        bias_suffix = ["_bias", "_b", ".b_0"]
+        for suffix in bias_suffix:
+            if name.endswith(suffix):
+                return True
+        return False
+    optimizer = fluid.optimizer.Adam(learning_rate=scheduled_lr)
+    fluid.clip.set_gradient_clip(
+        clip=fluid.clip.GradientClipByGlobalNorm(clip_norm=1.0))
+    param_list = dict()
+    if weight_decay > 0:
+        for param in train_program.global_block().all_parameters():
+            param_list[param.name] = param * 1.0
+            param_list[param.name].stop_gradient = True
+    _, param_grads = optimizer.minimize(loss)
+    if weight_decay > 0:
+        for param, grad in param_grads:
+            if exclude_from_weight_decay(param):
+                continue
+            with param.block.program._optimized_guard(
+                [param, grad]), fluid.framework.name_scope("weight_decay"):
+                updated_param = param - param_list[
+                    param.name] * weight_decay * scheduled_lr
+                fluid.layers.assign(output=param, input=updated_param)
+    return scheduled_lr
--- a/PaddleNLP/PaddleLARK/XLNet/prepro_utils.py
+++ b/PaddleNLP/PaddleLARK/XLNet/prepro_utils.py
+# coding=utf-8
+"""this file is a copy of https://github.com/zihangdai/xlnet"""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+import unicodedata
+import six
+from functools import partial
+SPIECE_UNDERLINE = '▁'
+def printable_text(text):
+    """Returns text encoded in a way suitable for print or `tf.logging`."""
+    # These functions want `str` for both Python2 and Python3, but in one case
+    # it's a Unicode string and in the other it's a byte string.
+    if six.PY3:
+        if isinstance(text, str):
+            return text
+        elif isinstance(text, bytes):
+            return text.decode("utf-8", "ignore")
+        else:
+            raise ValueError("Unsupported string type: %s" % (type(text)))
+    elif six.PY2:
+        if isinstance(text, str):
+            return text
+        elif isinstance(text, unicode):
+            return text.encode("utf-8")
+        else:
+            raise ValueError("Unsupported string type: %s" % (type(text)))
+    else:
+        raise ValueError("Not running on Python2 or Python 3?")
+def print_(*args):
+    new_args = []
+    for arg in args:
+        if isinstance(arg, list):
+            s = [printable_text(i) for i in arg]
+            s = ' '.join(s)
+            new_args.append(s)
+        else:
+            new_args.append(printable_text(arg))
+    print(*new_args)
+def preprocess_text(inputs, lower=False, remove_space=True, keep_accents=False):
+    if remove_space:
+        outputs = ' '.join(inputs.strip().split())
+    else:
+        outputs = inputs
+    outputs = outputs.replace("``", '"').replace("''", '"')
+    if six.PY2 and isinstance(outputs, str):
+        outputs = outputs.decode('utf-8')
+    if not keep_accents:
+        outputs = unicodedata.normalize('NFKD', outputs)
+        outputs = ''.join([c for c in outputs if not unicodedata.combining(c)])
+    if lower:
+        outputs = outputs.lower()
+    return outputs
+def encode_pieces(sp_model, text, return_unicode=True, sample=False):
+    # return_unicode is used only for py2
+    # note(zhiliny): in some systems, sentencepiece only accepts str for py2
+    if six.PY2 and isinstance(text, unicode):
+        text = text.encode('utf-8')
+    if not sample:
+        pieces = sp_model.EncodeAsPieces(text)
+    else:
+        pieces = sp_model.SampleEncodeAsPieces(text, 64, 0.1)
+    new_pieces = []
+    for piece in pieces:
+        if len(piece) > 1 and piece[-1] == ',' and piece[-2].isdigit():
+            cur_pieces = sp_model.EncodeAsPieces(piece[:-1].replace(
+                SPIECE_UNDERLINE, ''))
+            if piece[0] != SPIECE_UNDERLINE and cur_pieces[0][
+                    0] == SPIECE_UNDERLINE:
+                if len(cur_pieces[0]) == 1:
+                    cur_pieces = cur_pieces[1:]
+                else:
+                    cur_pieces[0] = cur_pieces[0][1:]
+            cur_pieces.append(piece[-1])
+            new_pieces.extend(cur_pieces)
+        else:
+            new_pieces.append(piece)
+    # note(zhiliny): convert back to unicode for py2
+    if six.PY2 and return_unicode:
+        ret_pieces = []
+        for piece in new_pieces:
+            if isinstance(piece, str):
+                piece = piece.decode('utf-8')
+            ret_pieces.append(piece)
+        new_pieces = ret_pieces
+    return new_pieces
+def encode_ids(sp_model, text, sample=False):
+    pieces = encode_pieces(sp_model, text, return_unicode=False, sample=sample)
+    ids = [sp_model.PieceToId(piece) for piece in pieces]
+    return ids
+if __name__ == '__main__':
+    import sentencepiece as spm
+    sp = spm.SentencePieceProcessor()
+    sp.load('sp10m.uncased.v3.model')
+    print_(u'I was born in 2000, and this is falsé.')
+    print_(u'ORIGINAL',
+           sp.EncodeAsPieces(u'I was born in 2000, and this is falsé.'))
+    print_(u'OURS',
+           encode_pieces(sp, u'I was born in 2000, and this is falsé.'))
+    print(encode_ids(sp, u'I was born in 2000, and this is falsé.'))
+    print_('')
+    prepro_func = partial(preprocess_text, lower=True)
+    print_(prepro_func('I was born in 2000, and this is falsé.'))
+    print_('ORIGINAL',
+           sp.EncodeAsPieces(
+               prepro_func('I was born in 2000, and this is falsé.')))
+    print_('OURS',
+           encode_pieces(sp,
+                         prepro_func('I was born in 2000, and this is falsé.')))
+    print(encode_ids(sp, prepro_func('I was born in 2000, and this is falsé.')))
+    print_('')
+    print_('I was born in 2000, and this is falsé.')
+    print_('ORIGINAL',
+           sp.EncodeAsPieces('I was born in 2000, and this is falsé.'))
+    print_('OURS', encode_pieces(sp, 'I was born in 2000, and this is falsé.'))
+    print(encode_ids(sp, 'I was born in 2000, and this is falsé.'))
+    print_('')
+    print_('I was born in 92000, and this is falsé.')
+    print_('ORIGINAL',
+           sp.EncodeAsPieces('I was born in 92000, and this is falsé.'))
+    print_('OURS', encode_pieces(sp, 'I was born in 92000, and this is falsé.'))
+    print(encode_ids(sp, 'I was born in 92000, and this is falsé.'))
--- a/PaddleNLP/PaddleLARK/XLNet/reader/__init__.py
+++ b/PaddleNLP/PaddleLARK/XLNet/reader/__init__.py
--- a/PaddleNLP/PaddleLARK/XLNet/reader/cls.py
+++ b/PaddleNLP/PaddleLARK/XLNet/reader/cls.py
+"""this file is adapted from https://github.com/zihangdai/xlnet"""
+import io
+import os
+import types
+import csv
+import numpy as np
+import sentencepiece as spm
+from classifier_utils import PaddingInputExample
+from classifier_utils import convert_single_example
+from prepro_utils import preprocess_text, encode_ids
+class DataProcessor(object):
+    """Base class for data converters for sequence classification data sets."""
+    def __init__(self, args):
+        self.data_dir = args.data_dir
+        self.max_seq_length = args.max_seq_length
+        self.uncased = args.uncased
+        np.random.seed(args.random_seed)
+        sp = spm.SentencePieceProcessor()
+        sp.Load(args.spiece_model_file)
+        def tokenize_fn(text):
+            text = preprocess_text(text, lower=self.uncased)
+            return encode_ids(sp, text)
+        self.tokenize_fn = tokenize_fn
+        self.current_train_example = -1
+        self.num_examples = {'train': -1, 'dev': -1, 'test': -1}
+        self.current_train_epoch = -1
+    def get_train_examples(self, data_dir):
+        """Gets a collection of `InputExample`s for the train set."""
+        raise NotImplementedError()
+    def get_dev_examples(self, data_dir):
+        """Gets a collection of `InputExample`s for the dev set."""
+        raise NotImplementedError()
+    def get_test_examples(self, data_dir):
+        """Gets a collection of `InputExample`s for prediction."""
+        raise NotImplementedError()
+    def get_labels(self):
+        """Gets the list of labels for this data set."""
+        raise NotImplementedError()
+    def convert_example(self, index, example, labels, max_seq_length,
+                        tokenize_fn):
+        """Converts a single `InputExample` into a single `InputFeatures`."""
+        feature = convert_single_example(index, example, labels, max_seq_length,
+                                         tokenize_fn)
+        return feature
+    def generate_instance(self, feature):
+        """
+        generate instance with given feature
+        Args:
+            feature: InputFeatures(object). A single set of features of data.
+        """
+        return [
+            feature.input_ids, feature.segment_ids, input_pos, feature.label_id
+        ]
+    def prepare_batch_data(self, batch_data, is_regression):
+        """Generate numpy tensors"""
+        input_ids = np.expand_dims(
+            np.array([inst[0] for inst in batch_data]).astype('int64'), axis=-1)
+        input_mask = np.array(
+            [inst[1] for inst in batch_data]).astype('float32')
+        segment_ids = np.array([inst[2] for inst in batch_data]).astype('int64')
+        labels = np.expand_dims(
+            np.array([inst[3] for inst in batch_data]).astype(
+                'int64' if not is_regression else 'float32'),
+            axis=-1)
+        is_real_example = np.array(
+            [inst[4] for inst in batch_data]).astype('int64')
+        return [input_ids, input_mask, segment_ids, labels, is_real_example]
+    @classmethod
+    def _read_tsv(cls, input_file, quotechar=None):
+        """Reads a tab separated value file."""
+        with io.open(input_file, "r", encoding="utf8") as f:
+            reader = csv.reader(f, delimiter="\t", quotechar=quotechar)
+            lines = []
+            for line in reader:
+                if len(line) == 0: continue
+                lines.append(line)
+            return lines
+    def get_num_examples(self, phase):
+        """Get number of examples for train, dev or test."""
+        if phase not in ['train', 'dev', 'test']:
+            raise ValueError(
+                "Unknown phase, which should be in ['train', 'dev', 'test'].")
+        return self.num_examples[phase]
+    def get_train_progress(self):
+        """Gets progress for training phase."""
+        return self.current_train_example, self.current_train_epoch
+    def data_generator(self,
+                       batch_size,
+                       is_regression,
+                       phase='train',
+                       epoch=1,
+                       dev_count=1,
+                       shuffle=True):
+        """
+        Generate data for train, dev or test.
+        Args:
+          batch_size: int. The batch size of generated data.
+          phase: string. The phase for which to generate data.
+          epoch: int. Total epoches to generate data.
+          shuffle: bool. Whether to shuffle examples.
+        """
+        if phase == 'train':
+            examples = self.get_train_examples(self.data_dir)
+            self.num_examples['train'] = len(examples)
+        elif phase == 'dev':
+            examples = self.get_dev_examples(self.data_dir)
+            self.num_examples['dev'] = len(examples)
+        elif phase == 'test':
+            examples = self.get_test_examples(self.data_dir)
+            self.num_examples['test'] = len(examples)
+        else:
+            raise ValueError(
+                "Unknown phase, which should be in ['train', 'dev', 'test'].")
+        def instance_reader():
+            label_list = self.get_labels() if not is_regression else None
+            for epoch_index in range(epoch):
+                if shuffle:
+                    np.random.shuffle(examples)
+                if phase == 'train':
+                    self.current_train_epoch = epoch_index
+                for (index, example) in enumerate(examples):
+                    if phase == 'train':
+                        self.current_train_example = index + 1
+                    feature = convert_single_example(index, example, label_list,
+                                                     self.max_seq_length,
+                                                     self.tokenize_fn)
+                    instance = [
+                        feature.input_ids, feature.input_mask,
+                        feature.segment_ids, feature.label_id,
+                        feature.is_real_example
+                    ]
+                    yield instance
+        def batch_reader(reader, batch_size):
+            batch = []
+            for instance in reader():
+                if len(batch) < batch_size:
+                    batch.append(instance)
+                else:
+                    yield batch
+                    batch = [instance]
+            if len(batch) > 0:
+                yield batch
+        def wrapper():
+            all_dev_batches = []
+            for batch_data in batch_reader(instance_reader, batch_size):
+                batch_data = self.prepare_batch_data(batch_data, is_regression)
+                if len(all_dev_batches) < dev_count:
+                    all_dev_batches.append(batch_data)
+                if len(all_dev_batches) == dev_count:
+                    for batch in all_dev_batches:
+                        yield batch
+                    all_dev_batches = []
+        return wrapper
+class InputExample(object):
+    """A single training/test example for simple sequence classification."""
+    def __init__(self, guid, text_a, text_b=None, label=None):
+        """Constructs a InputExample.
+    Args:
+      guid: Unique id for the example.
+      text_a: string. The untokenized text of the first sequence. For single
+        sequence tasks, only this sequence must be specified.
+      text_b: (Optional) string. The untokenized text of the second sequence.
+        Only must be specified for sequence pair tasks.
+      label: (Optional) string. The label of the example. This should be
+        specified for train and dev examples, but not for test examples.
+    """
+        self.guid = guid
+        self.text_a = text_a
+        self.text_b = text_b
+        self.label = label
+def _truncate_seq_pair(tokens_a, tokens_b, max_length):
+    """Truncates a sequence pair in place to the maximum length."""
+    # This is a simple heuristic which will always truncate the longer sequence
+    # one token at a time. This makes more sense than truncating an equal percent
+    # of tokens from each, since if one sequence is very short then each token
+    # that's truncated likely contains more information than a longer sequence.
+    while True:
+        total_length = len(tokens_a) + len(tokens_b)
+        if total_length <= max_length:
+            break
+        if len(tokens_a) > len(tokens_b):
+            tokens_a.pop()
+        else:
+            tokens_b.pop()
+class InputFeatures(object):
+    """A single set of features of data."""
+    def __init__(self, input_ids, input_mask, segment_ids, label_id):
+        self.input_ids = input_ids
+        self.input_mask = input_mask
+        self.segment_ids = segment_ids
+        self.label_id = label_id
+class GLUEProcessor(DataProcessor):
+    def __init__(self, args):
+        super(GLUEProcessor, self).__init__(args)
+        self.train_file = "train.tsv"
+        self.dev_file = "dev.tsv"
+        self.test_file = "test.tsv"
+        self.label_column = None
+        self.text_a_column = None
+        self.text_b_column = None
+        self.contains_header = True
+        self.test_text_a_column = None
+        self.test_text_b_column = None
+        self.test_contains_header = True
+    def get_train_examples(self, data_dir):
+        """See base class."""
+        return self._create_examples(
+            self._read_tsv(os.path.join(data_dir, self.train_file)), "train")
+    def get_dev_examples(self, data_dir):
+        """See base class."""
+        return self._create_examples(
+            self._read_tsv(os.path.join(data_dir, self.dev_file)), "dev")
+    def get_test_examples(self, data_dir):
+        """See base class."""
+        if self.test_text_a_column is None:
+            self.test_text_a_column = self.text_a_column
+        if self.test_text_b_column is None:
+            self.test_text_b_column = self.text_b_column
+        return self._create_examples(
+            self._read_tsv(os.path.join(data_dir, self.test_file)), "test")
+    def get_labels(self):
+        """See base class."""
+        return ["0", "1"]
+    def _create_examples(self, lines, set_type):
+        """Creates examples for the training and dev sets."""
+        examples = []
+        for (i, line) in enumerate(lines):
+            if i == 0 and self.contains_header and set_type != "test":
+                continue
+            if i == 0 and self.test_contains_header and set_type == "test":
+                continue
+            guid = "%s-%s" % (set_type, i)
+            a_column = (self.text_a_column
+                        if set_type != "test" else self.test_text_a_column)
+            b_column = (self.text_b_column
+                        if set_type != "test" else self.test_text_b_column)
+            # there are some incomplete lines in QNLI
+            if len(line) <= a_column:
+                tf.logging.warning('Incomplete line, ignored.')
+                continue
+            text_a = line[a_column]
+            if b_column is not None:
+                if len(line) <= b_column:
+                    tf.logging.warning('Incomplete line, ignored.')
+                    continue
+                text_b = line[b_column]
+            else:
+                text_b = None
+            if set_type == "test":
+                label = self.get_labels()[0]
+            else:
+                if len(line) <= self.label_column:
+                    tf.logging.warning('Incomplete line, ignored.')
+                    continue
+                label = line[self.label_column]
+            examples.append(
+                InputExample(
+                    guid=guid, text_a=text_a, text_b=text_b, label=label))
+        return examples
+class Yelp5Processor(DataProcessor):
+    def __init__(self, args):
+        super(Yelp5Processor, self).__init__(args)
+    def get_train_examples(self, data_dir):
+        return self._create_examples(os.path.join(data_dir, "train.csv"))
+    def get_dev_examples(self, data_dir):
+        return self._create_examples(os.path.join(data_dir, "test.csv"))
+    def get_labels(self):
+        """See base class."""
+        return ["1", "2", "3", "4", "5"]
+    def _create_examples(self, input_file):
+        """Creates examples for the training and dev sets."""
+        examples = []
+        with tf.gfile.Open(input_file) as f:
+            reader = csv.reader(f)
+            for i, line in enumerate(reader):
+                label = line[0]
+                text_a = line[1].replace('""', '"').replace('\\"', '"')
+                examples.append(
+                    InputExample(
+                        guid=str(i), text_a=text_a, text_b=None, label=label))
+        return examples
+class ImdbProcessor(DataProcessor):
+    def __init__(self, args):
+        super(ImdbProcessor, self).__init__(args)
+    def get_labels(self):
+        return ["neg", "pos"]
+    def get_train_examples(self, data_dir):
+        return self._create_examples(os.path.join(data_dir, "train"))
+    def get_dev_examples(self, data_dir):
+        return self._create_examples(os.path.join(data_dir, "test"))
+    def _create_examples(self, data_dir):
+        examples = []
+        for label in ["neg", "pos"]:
+            cur_dir = os.path.join(data_dir, label)
+            for filename in os.listdir(cur_dir):
+                if not filename.endswith("txt"): continue
+                path = os.path.join(cur_dir, filename)
+                with io.open(path, 'r', encoding='utf8') as f:
+                    text = f.read().strip().replace("<br />", " ")
+                examples.append(
+                    InputExample(
+                        guid="unused_id", text_a=text, text_b=None,
+                        label=label))
+        return examples
+class MnliMatchedProcessor(GLUEProcessor):
+    def __init__(self, args):
+        super(MnliMatchedProcessor, self).__init__(args)
+        self.dev_file = "dev_matched.tsv"
+        self.test_file = "test_matched.tsv"
+        self.label_column = -1
+        self.text_a_column = 8
+        self.text_b_column = 9
+    def get_labels(self):
+        return ["contradiction", "entailment", "neutral"]
+class MnliMismatchedProcessor(MnliMatchedProcessor):
+    def __init__(self, args):
+        super(MnliMismatchedProcessor, self).__init__(args)
+        self.dev_file = "dev_mismatched.tsv"
+        self.test_file = "test_mismatched.tsv"
+class StsbProcessor(GLUEProcessor):
+    def __init__(self, args):
+        super(StsbProcessor, self).__init__(args)
+        self.label_column = 9
+        self.text_a_column = 7
+        self.text_b_column = 8
+    def get_labels(self):
+        return [0.0]
+    def _create_examples(self, lines, set_type):
+        """Creates examples for the training and dev sets."""
+        examples = []
+        for (i, line) in enumerate(lines):
+            if i == 0 and self.contains_header and set_type != "test":
+                continue
+            if i == 0 and self.test_contains_header and set_type == "test":
+                continue
+            guid = "%s-%s" % (set_type, i)
+            a_column = (self.text_a_column
+                        if set_type != "test" else self.test_text_a_column)
+            b_column = (self.text_b_column
+                        if set_type != "test" else self.test_text_b_column)
+            # there are some incomplete lines in QNLI
+            if len(line) <= a_column:
+                tf.logging.warning('Incomplete line, ignored.')
+                continue
+            text_a = line[a_column]
+            if b_column is not None:
+                if len(line) <= b_column:
+                    tf.logging.warning('Incomplete line, ignored.')
+                    continue
+                text_b = line[b_column]
+            else:
+                text_b = None
+            if set_type == "test":
+                label = self.get_labels()[0]
+            else:
+                if len(line) <= self.label_column:
+                    tf.logging.warning('Incomplete line, ignored.')
+                    continue
+                label = float(line[self.label_column])
+            examples.append(
+                InputExample(
+                    guid=guid, text_a=text_a, text_b=text_b, label=label))
+        return examples
+if __name__ == '__main__':
+    pass
--- a/PaddleNLP/PaddleLARK/XLNet/reader/squad.py
+++ b/PaddleNLP/PaddleLARK/XLNet/reader/squad.py
+# coding=utf-8
+"""This file is adapted from https://github.com/zihangdai/xlnet"""
+import io
+import six
+import sys
+import math
+import json
+import random
+import collections
+import gc
+import numpy as np
+sys.path.append('.')
+import squad_utils
+from data_utils import SEP_ID, CLS_ID, VOCAB_SIZE
+import sentencepiece as spm
+from prepro_utils import preprocess_text, encode_ids, encode_pieces, printable_text
+SPIECE_UNDERLINE = u'▁'
+SEG_ID_P = 0
+SEG_ID_Q = 1
+SEG_ID_CLS = 2
+SEG_ID_PAD = 3
+class SquadExample(object):
+    """A single training/test example for simple sequence classification.
+     For examples without an answer, the start and end position are -1.
+  """
+    def __init__(self,
+                 qas_id,
+                 question_text,
+                 paragraph_text,
+                 orig_answer_text=None,
+                 start_position=None,
+                 is_impossible=False):
+        self.qas_id = qas_id
+        self.question_text = question_text
+        self.paragraph_text = paragraph_text
+        self.orig_answer_text = orig_answer_text
+        self.start_position = start_position
+        self.is_impossible = is_impossible
+    def __str__(self):
+        return self.__repr__()
+    def __repr__(self):
+        s = ""
+        s += "qas_id: %s" % (printable_text(self.qas_id))
+        s += ", question_text: %s" % (printable_text(self.question_text))
+        s += ", paragraph_text: [%s]" % (" ".join(self.paragraph_text))
+        if self.start_position:
+            s += ", start_position: %d" % (self.start_position)
+        if self.start_position:
+            s += ", is_impossible: %r" % (self.is_impossible)
+        return s
+class InputFeatures(object):
+    """A single set of features of data."""
+    def __init__(self,
+                 unique_id,
+                 example_index,
+                 doc_span_index,
+                 tok_start_to_orig_index,
+                 tok_end_to_orig_index,
+                 token_is_max_context,
+                 input_ids,
+                 input_mask,
+                 p_mask,
+                 segment_ids,
+                 paragraph_len,
+                 cls_index,
+                 start_position=None,
+                 end_position=None,
+                 is_impossible=None):
+        self.unique_id = unique_id
+        self.example_index = example_index
+        self.doc_span_index = doc_span_index
+        self.tok_start_to_orig_index = tok_start_to_orig_index
+        self.tok_end_to_orig_index = tok_end_to_orig_index
+        self.token_is_max_context = token_is_max_context
+        self.input_ids = input_ids
+        self.input_mask = input_mask
+        self.p_mask = p_mask
+        self.segment_ids = segment_ids
+        self.paragraph_len = paragraph_len
+        self.cls_index = cls_index
+        self.start_position = start_position
+        self.end_position = end_position
+        self.is_impossible = is_impossible
+def read_squad_examples(input_file, is_training):
+    """Read a SQuAD json file into a list of SquadExample."""
+    with io.open(input_file, "r", encoding="utf8") as reader:
+        input_data = json.load(reader)["data"]
+    examples = []
+    for entry in input_data:
+        for paragraph in entry["paragraphs"]:
+            paragraph_text = paragraph["context"]
+            for qa in paragraph["qas"]:
+                qas_id = qa["id"]
+                question_text = qa["question"]
+                start_position = None
+                orig_answer_text = None
+                is_impossible = False
+                if is_training:
+                    is_impossible = qa["is_impossible"]
+                    if (len(qa["answers"]) != 1) and (not is_impossible):
+                        raise ValueError(
+                            "For training, each question should have exactly 1 answer."
+                        )
+                    if not is_impossible:
+                        answer = qa["answers"][0]
+                        orig_answer_text = answer["text"]
+                        start_position = answer["answer_start"]
+                    else:
+                        start_position = -1
+                        orig_answer_text = ""
+                example = SquadExample(
+                    qas_id=qas_id,
+                    question_text=question_text,
+                    paragraph_text=paragraph_text,
+                    orig_answer_text=orig_answer_text,
+                    start_position=start_position,
+                    is_impossible=is_impossible)
+                examples.append(example)
+    return examples
+def _convert_index(index, pos, M=None, is_start=True):
+    if index[pos] is not None:
+        return index[pos]
+    N = len(index)
+    rear = pos
+    while rear < N - 1 and index[rear] is None:
+        rear += 1
+    front = pos
+    while front > 0 and index[front] is None:
+        front -= 1
+    assert index[front] is not None or index[rear] is not None
+    if index[front] is None:
+        if index[rear] >= 1:
+            if is_start:
+                return 0
+            else:
+                return index[rear] - 1
+        return index[rear]
+    if index[rear] is None:
+        if M is not None and index[front] < M - 1:
+            if is_start:
+                return index[front] + 1
+            else:
+                return M - 1
+        return index[front]
+    if is_start:
+        if index[rear] > index[front] + 1:
+            return index[front] + 1
+        else:
+            return index[rear]
+    else:
+        if index[rear] > index[front] + 1:
+            return index[rear] - 1
+        else:
+            return index[front]
+def convert_examples_to_features(examples, sp_model, max_seq_length, doc_stride,
+                                 max_query_length, is_training, uncased):
+    """Loads a data file into a list of `InputBatch`s."""
+    cnt_pos, cnt_neg = 0, 0
+    unique_id = 1000000000
+    max_N, max_M = 1024, 1024
+    f = np.zeros((max_N, max_M), dtype=np.float32)
+    for (example_index, example) in enumerate(examples):
+        if example_index % 100 == 0:
+            print('Converting {}/{} pos {} neg {}'.format(
+                example_index, len(examples), cnt_pos, cnt_neg))
+        query_tokens = encode_ids(
+            sp_model, preprocess_text(
+                example.question_text, lower=uncased))
+        if len(query_tokens) > max_query_length:
+            query_tokens = query_tokens[0:max_query_length]
+        paragraph_text = example.paragraph_text
+        para_tokens = encode_pieces(
+            sp_model, preprocess_text(
+                example.paragraph_text, lower=uncased))
+        chartok_to_tok_index = []
+        tok_start_to_chartok_index = []
+        tok_end_to_chartok_index = []
+        char_cnt = 0
+        for i, token in enumerate(para_tokens):
+            chartok_to_tok_index.extend([i] * len(token))
+            tok_start_to_chartok_index.append(char_cnt)
+            char_cnt += len(token)
+            tok_end_to_chartok_index.append(char_cnt - 1)
+        tok_cat_text = ''.join(para_tokens).replace(SPIECE_UNDERLINE, ' ')
+        N, M = len(paragraph_text), len(tok_cat_text)
+        if N > max_N or M > max_M:
+            max_N = max(N, max_N)
+            max_M = max(M, max_M)
+            f = np.zeros((max_N, max_M), dtype=np.float32)
+            gc.collect()
+        g = {}
+        def _lcs_match(max_dist):
+            f.fill(0)
+            g.clear()
+            ### longest common sub sequence
+            # f[i, j] = max(f[i - 1, j], f[i, j - 1], f[i - 1, j - 1] + match(i, j))
+            for i in range(N):
+                # note(zhiliny):
+                # unlike standard LCS, this is specifically optimized for the setting
+                # because the mismatch between sentence pieces and original text will
+                # be small
+                for j in range(i - max_dist, i + max_dist):
+                    if j >= M or j < 0: continue
+                    if i > 0:
+                        g[(i, j)] = 0
+                        f[i, j] = f[i - 1, j]
+                    if j > 0 and f[i, j - 1] > f[i, j]:
+                        g[(i, j)] = 1
+                        f[i, j] = f[i, j - 1]
+                    f_prev = f[i - 1, j - 1] if i > 0 and j > 0 else 0
+                    if (preprocess_text(
+                            paragraph_text[i], lower=uncased,
+                            remove_space=False) == tok_cat_text[j] and
+                            f_prev + 1 > f[i, j]):
+                        g[(i, j)] = 2
+                        f[i, j] = f_prev + 1
+        max_dist = abs(N - M) + 5
+        for _ in range(2):
+            _lcs_match(max_dist)
+            if f[N - 1, M - 1] > 0.8 * N: break
+            max_dist *= 2
+        orig_to_chartok_index = [None] * N
+        chartok_to_orig_index = [None] * M
+        i, j = N - 1, M - 1
+        while i >= 0 and j >= 0:
+            if (i, j) not in g: break
+            if g[(i, j)] == 2:
+                orig_to_chartok_index[i] = j
+                chartok_to_orig_index[j] = i
+                i, j = i - 1, j - 1
+            elif g[(i, j)] == 1:
+                j = j - 1
+            else:
+                i = i - 1
+        if all(v is None
+               for v in orig_to_chartok_index) or f[N - 1, M - 1] < 0.8 * N:
+            print('MISMATCH DETECTED!')
+            continue
+        tok_start_to_orig_index = []
+        tok_end_to_orig_index = []
+        for i in range(len(para_tokens)):
+            start_chartok_pos = tok_start_to_chartok_index[i]
+            end_chartok_pos = tok_end_to_chartok_index[i]
+            start_orig_pos = _convert_index(
+                chartok_to_orig_index, start_chartok_pos, N, is_start=True)
+            end_orig_pos = _convert_index(
+                chartok_to_orig_index, end_chartok_pos, N, is_start=False)
+            tok_start_to_orig_index.append(start_orig_pos)
+            tok_end_to_orig_index.append(end_orig_pos)
+        if not is_training:
+            tok_start_position = tok_end_position = None
+        if is_training and example.is_impossible:
+            tok_start_position = -1
+            tok_end_position = -1
+        if is_training and not example.is_impossible:
+            start_position = example.start_position
+            end_position = start_position + len(example.orig_answer_text) - 1
+            start_chartok_pos = _convert_index(
+                orig_to_chartok_index, start_position, is_start=True)
+            tok_start_position = chartok_to_tok_index[start_chartok_pos]
+            end_chartok_pos = _convert_index(
+                orig_to_chartok_index, end_position, is_start=False)
+            tok_end_position = chartok_to_tok_index[end_chartok_pos]
+            assert tok_start_position <= tok_end_position
+        def _piece_to_id(x):
+            if six.PY2 and isinstance(x, unicode):
+                x = x.encode('utf-8')
+            return sp_model.PieceToId(x)
+        all_doc_tokens = list(map(_piece_to_id, para_tokens))
+        # The -3 accounts for [CLS], [SEP] and [SEP]
+        max_tokens_for_doc = max_seq_length - len(query_tokens) - 3
+        # We can have documents that are longer than the maximum sequence length.
+        # To deal with this we do a sliding window approach, where we take chunks
+        # of the up to our max length with a stride of `doc_stride`.
+        _DocSpan = collections.namedtuple(  # pylint: disable=invalid-name
+            "DocSpan", ["start", "length"])
+        doc_spans = []
+        start_offset = 0
+        while start_offset < len(all_doc_tokens):
+            length = len(all_doc_tokens) - start_offset
+            if length > max_tokens_for_doc:
+                length = max_tokens_for_doc
+            doc_spans.append(_DocSpan(start=start_offset, length=length))
+            if start_offset + length == len(all_doc_tokens):
+                break
+            start_offset += min(length, doc_stride)
+        for (doc_span_index, doc_span) in enumerate(doc_spans):
+            tokens = []
+            token_is_max_context = {}
+            segment_ids = []
+            p_mask = []
+            cur_tok_start_to_orig_index = []
+            cur_tok_end_to_orig_index = []
+            for i in range(doc_span.length):
+                split_token_index = doc_span.start + i
+                cur_tok_start_to_orig_index.append(tok_start_to_orig_index[
+                    split_token_index])
+                cur_tok_end_to_orig_index.append(tok_end_to_orig_index[
+                    split_token_index])
+                is_max_context = _check_is_max_context(
+                    doc_spans, doc_span_index, split_token_index)
+                token_is_max_context[len(tokens)] = is_max_context
+                tokens.append(all_doc_tokens[split_token_index])
+                segment_ids.append(SEG_ID_P)
+                p_mask.append(0)
+            paragraph_len = len(tokens)
+            tokens.append(SEP_ID)
+            segment_ids.append(SEG_ID_P)
+            p_mask.append(1)
+            # note(zhiliny): we put P before Q
+            # because during pretraining, B is always shorter than A
+            for token in query_tokens:
+                tokens.append(token)
+                segment_ids.append(SEG_ID_Q)
+                p_mask.append(1)
+            tokens.append(SEP_ID)
+            segment_ids.append(SEG_ID_Q)
+            p_mask.append(1)
+            cls_index = len(segment_ids)
+            tokens.append(CLS_ID)
+            segment_ids.append(SEG_ID_CLS)
+            p_mask.append(0)
+            input_ids = tokens
+            # The mask has 0 for real tokens and 1 for padding tokens. Only real
+            # tokens are attended to.
+            input_mask = [0] * len(input_ids)
+            # Zero-pad up to the sequence length.
+            while len(input_ids) < max_seq_length:
+                input_ids.append(0)
+                input_mask.append(1)
+                segment_ids.append(SEG_ID_PAD)
+                p_mask.append(1)
+            assert len(input_ids) == max_seq_length
+            assert len(input_mask) == max_seq_length
+            assert len(segment_ids) == max_seq_length
+            assert len(p_mask) == max_seq_length
+            span_is_impossible = example.is_impossible
+            start_position = None
+            end_position = None
+            if is_training and not span_is_impossible:
+                # For training, if our document chunk does not contain an annotation
+                # we throw it out, since there is nothing to predict.
+                doc_start = doc_span.start
+                doc_end = doc_span.start + doc_span.length - 1
+                out_of_span = False
+                if not (tok_start_position >= doc_start and
+                        tok_end_position <= doc_end):
+                    out_of_span = True
+                if out_of_span:
+                    # continue
+                    start_position = 0
+                    end_position = 0
+                    span_is_impossible = True
+                else:
+                    # note(zhiliny): we put P before Q, so doc_offset should be zero.
+                    # doc_offset = len(query_tokens) + 2
+                    doc_offset = 0
+                    start_position = tok_start_position - doc_start + doc_offset
+                    end_position = tok_end_position - doc_start + doc_offset
+            if is_training and span_is_impossible:
+                start_position = cls_index
+                end_position = cls_index
+            if example_index < 0:
+                print("*** Example ***")
+                print("unique_id: %s" % (unique_id))
+                print("example_index: %s" % (example_index))
+                print("doc_span_index: %s" % (doc_span_index))
+                print("tok_start_to_orig_index: %s" %
+                      " ".join([str(x) for x in cur_tok_start_to_orig_index]))
+                print("tok_end_to_orig_index: %s" %
+                      " ".join([str(x) for x in cur_tok_end_to_orig_index]))
+                print("token_is_max_context: %s" % " ".join([
+                    "%d:%s" % (x, y)
+                    for (x, y) in six.iteritems(token_is_max_context)
+                ]))
+                print("input_ids: %s" % " ".join([str(x) for x in input_ids]))
+                print("input_mask: %s" % " ".join([str(x) for x in input_mask]))
+                print("segment_ids: %s" %
+                      " ".join([str(x) for x in segment_ids]))
+                if is_training and span_is_impossible:
+                    print("impossible example span")
+                if is_training and not span_is_impossible:
+                    pieces = [
+                        sp_model.IdToPiece(token)
+                        for token in tokens[start_position:(end_position + 1)]
+                    ]
+                    answer_text = sp_model.DecodePieces(pieces)
+                    print("start_position: %d" % (start_position))
+                    print("end_position: %d" % (end_position))
+                    print("answer: %s" % (printable_text(answer_text)))
+                    # note(zhiliny): With multi processing,
+                    # the example_index is actually the index within the current process
+                    # therefore we use example_index=None to avoid being used in the future.
+                    # The current code does not use example_index of training data.
+            if is_training:
+                feat_example_index = None
+            else:
+                feat_example_index = example_index
+            feature = InputFeatures(
+                unique_id=unique_id,
+                example_index=feat_example_index,
+                doc_span_index=doc_span_index,
+                tok_start_to_orig_index=cur_tok_start_to_orig_index,
+                tok_end_to_orig_index=cur_tok_end_to_orig_index,
+                token_is_max_context=token_is_max_context,
+                input_ids=input_ids,
+                input_mask=input_mask,
+                p_mask=p_mask,
+                segment_ids=segment_ids,
+                paragraph_len=paragraph_len,
+                cls_index=cls_index,
+                start_position=start_position,
+                end_position=end_position,
+                is_impossible=span_is_impossible)
+            unique_id += 1
+            if span_is_impossible:
+                cnt_neg += 1
+            else:
+                cnt_pos += 1
+            yield feature
+    print("Total number of instances: {} = pos {} neg {}".format(
+        cnt_pos + cnt_neg, cnt_pos, cnt_neg))
+def _check_is_max_context(doc_spans, cur_span_index, position):
+    """Check if this is the 'max context' doc span for the token."""
+    # Because of the sliding window approach taken to scoring documents, a single
+    # token can appear in multiple documents. E.g.
+    #  Doc: the man went to the store and bought a gallon of milk
+    #  Span A: the man went to the
+    #  Span B: to the store and bought
+    #  Span C: and bought a gallon of
+    #  ...
+    #
+    # Now the word 'bought' will have two scores from spans B and C. We only
+    # want to consider the score with "maximum context", which we define as
+    # the *minimum* of its left and right context (the *sum* of left and
+    # right context will always be the same, of course).
+    #
+    # In the example the maximum context for 'bought' would be span C since
+    # it has 1 left context and 3 right context, while span B has 4 left context
+    # and 0 right context.
+    best_score = None
+    best_span_index = None
+    for (span_index, doc_span) in enumerate(doc_spans):
+        end = doc_span.start + doc_span.length - 1
+        if position < doc_span.start:
+            continue
+        if position > end:
+            continue
+        num_left_context = position - doc_span.start
+        num_right_context = end - position
+        score = min(num_left_context,
+                    num_right_context) + 0.01 * doc_span.length
+        if best_score is None or score > best_score:
+            best_score = score
+            best_span_index = span_index
+    return cur_span_index == best_span_index
+class DataProcessor(object):
+    def __init__(self, spiece_model_file, uncased, max_seq_length, doc_stride,
+                 max_query_length):
+        self._sp_model = spm.SentencePieceProcessor()
+        self._sp_model.Load(spiece_model_file)
+        self._uncased = uncased
+        self._max_seq_length = max_seq_length
+        self._doc_stride = doc_stride
+        self._max_query_length = max_query_length
+        self.current_train_example = -1
+        self.num_train_examples = -1
+        self.current_train_epoch = -1
+        self.train_examples = None
+        self.predict_examples = None
+        self.num_examples = {'train': -1, 'predict': -1}
+    def get_train_progress(self):
+        """Gets progress for training phase."""
+        return self.current_train_example, self.current_train_epoch
+    def get_examples(self, data_path, is_training):
+        examples = read_squad_examples(
+            input_file=data_path, is_training=is_training)
+        return examples
+    def get_num_examples(self, phase):
+        if phase not in ['train', 'predict']:
+            raise ValueError(
+                "Unknown phase, which should be in ['train', 'predict'].")
+        return self.num_examples[phase]
+    def get_features(self, examples, is_training):
+        features = convert_examples_to_features(
+            examples=examples,
+            sp_model=self._sp_model,
+            max_seq_length=self._max_seq_length,
+            doc_stride=self._doc_stride,
+            max_query_length=self._max_query_length,
+            is_training=is_training,
+            uncased=self._uncased)
+        return features
+    def data_generator(self,
+                       data_path,
+                       batch_size,
+                       phase='train',
+                       shuffle=False,
+                       dev_count=1,
+                       epoch=1):
+        if phase == 'train':
+            self.train_examples = self.get_examples(data_path, is_training=True)
+            examples = self.train_examples
+            self.num_examples['train'] = len(self.train_examples)
+        elif phase == 'predict':
+            self.predict_examples = self.get_examples(
+                data_path, is_training=False)
+            examples = self.predict_examples
+            self.num_examples['predict'] = len(self.predict_examples)
+        else:
+            raise ValueError(
+                "Unknown phase, which should be in ['train', 'predict'].")
+        def batch_reader(features, batch_size):
+            batch = []
+            for (index, feature) in enumerate(features):
+                if phase == 'train':
+                    self.current_train_example = index + 1
+                labels = [feature.unique_id
+                          ] if feature.start_position is None else [
+                              feature.start_position, feature.end_position,
+                              feature.is_impossible
+                          ]
+                example = [
+                    feature.input_ids, feature.segment_ids, feature.input_mask,
+                    feature.cls_index, feature.p_mask
+                ] + labels
+                to_append = len(batch) < batch_size
+                if to_append:
+                    batch.append(example)
+                else:
+                    yield batch
+                    batch = [example]
+            if len(batch) > 0:
+                yield batch
+        def prepare_batch_data(insts):
+            """Generate numpy tensors"""
+            input_ids = np.expand_dims(
+                np.array([inst[0] for inst in insts]).astype('int64'), axis=-1)
+            segment_ids = np.array([inst[1] for inst in insts]).astype('int64')
+            input_mask = np.array([inst[2] for inst in insts]).astype('float32')
+            cls_index = np.expand_dims(
+                np.array([inst[3] for inst in insts]).astype('int64'), axis=-1)
+            p_mask = np.array([inst[4] for inst in insts]).astype('float32')
+            ret_list = [input_ids, segment_ids, input_mask, cls_index, p_mask]
+            if phase == 'train':
+                start_positions = np.expand_dims(
+                    np.array([inst[5] for inst in insts]).astype('int64'),
+                    axis=-1)
+                end_positions = np.expand_dims(
+                    np.array([inst[6] for inst in insts]).astype('int64'),
+                    axis=-1)
+                is_impossible = np.expand_dims(
+                    np.array([inst[7] for inst in insts]).astype('float32'),
+                    axis=-1)
+                ret_list += [start_positions, end_positions, is_impossible]
+            else:
+                unique_ids = np.expand_dims(
+                    np.array([inst[5] for inst in insts]).astype('int64'),
+                    axis=-1)
+                ret_list += [unique_ids]
+            return ret_list
+        def wrapper():
+            for epoch_index in range(epoch):
+                if shuffle:
+                    random.shuffle(examples)
+                if phase == 'train':
+                    self.current_train_epoch = epoch_index
+                    features = self.get_features(examples, is_training=True)
+                else:
+                    features = self.get_features(examples, is_training=False)
+                all_dev_batches = []
+                for batch_insts in batch_reader(features, batch_size):
+                    batch_data = prepare_batch_data(batch_insts)
+                    if len(all_dev_batches) < dev_count:
+                        all_dev_batches.append(batch_data)
+                    if len(all_dev_batches) == dev_count:
+                        for batch in all_dev_batches:
+                            yield batch
+                        all_dev_batches = []
+        return wrapper
+_PrelimPrediction = collections.namedtuple(  # pylint: disable=invalid-name
+    "PrelimPrediction", [
+        "feature_index", "start_index", "end_index", "start_log_prob",
+        "end_log_prob"
+    ])
+_NbestPrediction = collections.namedtuple(  # pylint: disable=invalid-name
+    "NbestPrediction", ["text", "start_log_prob", "end_log_prob"])
+def write_predictions(all_examples, all_features, all_results, n_best_size,
+                      max_answer_length, output_prediction_file,
+                      output_nbest_file, output_null_log_odds_file, orig_data,
+                      args):
+    """Write final predictions to the json file and log-odds of null if needed."""
+    print("Writing predictions to: %s" % (output_prediction_file))
+    # tf.logging.info("Writing nbest to: %s" % (output_nbest_file))
+    example_index_to_features = collections.defaultdict(list)
+    for feature in all_features:
+        example_index_to_features[feature.example_index].append(feature)
+    unique_id_to_result = {}
+    for result in all_results:
+        unique_id_to_result[result.unique_id] = result
+    all_predictions = collections.OrderedDict()
+    all_nbest_json = collections.OrderedDict()
+    scores_diff_json = collections.OrderedDict()
+    for (example_index, example) in enumerate(all_examples):
+        features = example_index_to_features[example_index]
+        prelim_predictions = []
+        # keep track of the minimum score of null start+end of position 0
+        score_null = 1000000  # large and positive
+        for (feature_index, feature) in enumerate(features):
+            result = unique_id_to_result[feature.unique_id]
+            cur_null_score = result.cls_logits
+            # if we could have irrelevant answers, get the min score of irrelevant
+            score_null = min(score_null, cur_null_score)
+            for i in range(args.start_n_top):
+                for j in range(args.end_n_top):
+                    start_log_prob = result.start_top_log_probs[i]
+                    start_index = result.start_top_index[i]
+                    j_index = i * args.end_n_top + j
+                    end_log_prob = result.end_top_log_probs[j_index]
+                    end_index = result.end_top_index[j_index]
+                    # We could hypothetically create invalid predictions, e.g., predict
+                    # that the start of the span is in the question. We throw out all
+                    # invalid predictions.
+                    if start_index >= feature.paragraph_len - 1:
+                        continue
+                    if end_index >= feature.paragraph_len - 1:
+                        continue
+                    if not feature.token_is_max_context.get(start_index, False):
+                        continue
+                    if end_index < start_index:
+                        continue
+                    length = end_index - start_index + 1
+                    if length > max_answer_length:
+                        continue
+                    prelim_predictions.append(
+                        _PrelimPrediction(
+                            feature_index=feature_index,
+                            start_index=start_index,
+                            end_index=end_index,
+                            start_log_prob=start_log_prob,
+                            end_log_prob=end_log_prob))
+        prelim_predictions = sorted(
+            prelim_predictions,
+            key=lambda x: (x.start_log_prob + x.end_log_prob),
+            reverse=True)
+        seen_predictions = {}
+        nbest = []
+        for pred in prelim_predictions:
+            if len(nbest) >= n_best_size:
+                break
+            feature = features[pred.feature_index]
+            tok_start_to_orig_index = feature.tok_start_to_orig_index
+            tok_end_to_orig_index = feature.tok_end_to_orig_index
+            start_orig_pos = tok_start_to_orig_index[pred.start_index]
+            end_orig_pos = tok_end_to_orig_index[pred.end_index]
+            paragraph_text = example.paragraph_text
+            final_text = paragraph_text[start_orig_pos:end_orig_pos + 1].strip()
+            if final_text in seen_predictions:
+                continue
+            seen_predictions[final_text] = True
+            nbest.append(
+                _NbestPrediction(
+                    text=final_text,
+                    start_log_prob=pred.start_log_prob,
+                    end_log_prob=pred.end_log_prob))
+        # In very rare edge cases we could have no valid predictions. So we
+        # just create a nonce prediction in this case to avoid failure.
+        if not nbest:
+            nbest.append(
+                _NbestPrediction(
+                    text="", start_log_prob=-1e6, end_log_prob=-1e6))
+        total_scores = []
+        best_non_null_entry = None
+        for entry in nbest:
+            total_scores.append(entry.start_log_prob + entry.end_log_prob)
+            if not best_non_null_entry:
+                best_non_null_entry = entry
+        probs = _compute_softmax(total_scores)
+        nbest_json = []
+        for (i, entry) in enumerate(nbest):
+            output = collections.OrderedDict()
+            output["text"] = entry.text
+            output["probability"] = probs[i]
+            output["start_log_prob"] = entry.start_log_prob
+            output["end_log_prob"] = entry.end_log_prob
+            nbest_json.append(output)
+        assert len(nbest_json) >= 1
+        assert best_non_null_entry is not None
+        score_diff = score_null
+        scores_diff_json[example.qas_id] = score_diff
+        # note(zhiliny): always predict best_non_null_entry
+        # and the evaluation script will search for the best threshold
+        all_predictions[example.qas_id] = best_non_null_entry.text
+        all_nbest_json[example.qas_id] = nbest_json
+    with io.open(output_prediction_file, "w", encoding="utf8") as writer:
+        writer.write(json.dumps(all_predictions, indent=4) + u"\n")
+    with io.open(output_nbest_file, "w", encoding="utf8") as writer:
+        writer.write(json.dumps(all_nbest_json, indent=4) + u"\n")
+    with io.open(output_null_log_odds_file, "w", encoding="utf8") as writer:
+        writer.write(json.dumps(scores_diff_json, indent=4) + u"\n")
+    qid_to_has_ans = squad_utils.make_qid_to_has_ans(orig_data)
+    has_ans_qids = [k for k, v in qid_to_has_ans.items() if v]
+    no_ans_qids = [k for k, v in qid_to_has_ans.items() if not v]
+    exact_raw, f1_raw = squad_utils.get_raw_scores(orig_data, all_predictions)
+    out_eval = {}
+    squad_utils.find_all_best_thresh_v2(out_eval, all_predictions, exact_raw,
+                                        f1_raw, scores_diff_json,
+                                        qid_to_has_ans)
+    return out_eval
+def _get_best_indexes(logits, n_best_size):
+    """Get the n-best logits from a list."""
+    index_and_score = sorted(
+        enumerate(logits), key=lambda x: x[1], reverse=True)
+    best_indexes = []
+    for i in range(len(index_and_score)):
+        if i >= n_best_size:
+            break
+        best_indexes.append(index_and_score[i][0])
+    return best_indexes
+def _compute_softmax(scores):
+    """Compute softmax probability over raw logits."""
+    if not scores:
+        return []
+    max_score = None
+    for score in scores:
+        if max_score is None or score > max_score:
+            max_score = score
+    exp_scores = []
+    total_sum = 0.0
+    for score in scores:
+        x = math.exp(score - max_score)
+        exp_scores.append(x)
+        total_sum += x
+    probs = []
+    for score in exp_scores:
+        probs.append(score / total_sum)
+    return probs
+if __name__ == '__main__':
+    processor = DataProcessor(
+        spiece_model_file="xlnet_cased_L-24_H-1024_A-16/spiece.model",
+        uncased=False,
+        max_seq_length=512,
+        doc_stride=128,
+        max_query_length=64)
+    train_data_generator = processor.data_generator(
+        data_path="squad_v2.0/dev-v2.0.json",
+        batch_size=32,
+        phase='predict',
+        shuffle=True,
+        dev_count=1,
+        epoch=1)
+    for (index, sample) in enumerate(train_data_generator()):
+        if index < 10:
+            print("index:", index)
+            for tensor in sample:
+                print(tensor.shape)
+        else:
+            break
+    #for (index, example) in enumerate(train_examples):
+    #    if index < 5:
+    #        print(example)
--- a/PaddleNLP/PaddleLARK/XLNet/run_classifier.py
+++ b/PaddleNLP/PaddleLARK/XLNet/run_classifier.py
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Fine-tuning on regression/classification tasks."""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+import six
+import sys
+if six.PY2:
+    reload(sys)
+    sys.setdefaultencoding('utf8')
+import os
+import time
+import json
+import argparse
+import numpy as np
+import subprocess
+import multiprocessing
+from scipy.stats import pearsonr
+import paddle
+import paddle.fluid as fluid
+import reader.cls as reader
+from model.xlnet import XLNetConfig
+from model.classifier import create_model
+from optimization import optimization
+from utils.args import ArgumentGroup, print_arguments, check_cuda
+from utils.init import init_pretraining_params, init_checkpoint
+num_trainers = int(os.environ.get('PADDLE_TRAINERS_NUM', 1))
+# yapf: disable
+parser = argparse.ArgumentParser(__doc__)
+model_g = ArgumentGroup(parser, "model", "model configuration and paths.")
+model_g.add_arg("model_config_path",        str,  None,           "Path to the json file for bert model config.")
+model_g.add_arg("dropout",                  float,  0.1,          "Dropout rate.")
+model_g.add_arg("dropatt",                  float,  0.1,          "Attention dropout rate.")
+model_g.add_arg("clamp_len",                int,    -1,           "Clamp length.")
+model_g.add_arg("summary_type",             str, "last",
+                "Method used to summarize a sequence into a vector.", choices=['last'])
+model_g.add_arg("use_summ_proj",            bool, True,
+                "Whether to use projection for summarizing sequences.")
+model_g.add_arg("spiece_model_file",        str,  None,           "Sentence Piece model path.")
+model_g.add_arg("init_checkpoint",          str,  None,           "Init checkpoint to resume training from.")
+model_g.add_arg("init_pretraining_params",  str,  None,
+                "Init pre-training params which preforms fine-tuning from. If the "
+                 "arg 'init_checkpoint' has been set, this argument wouldn't be valid.")
+model_g.add_arg("checkpoints",              str,  "checkpoints",  "Path to save checkpoints.")
+init_g = ArgumentGroup(parser, "init", "parameter initialization options.")
+init_g.add_arg("init",       str, "normal",    "Initialization method.", choices=["normal", "uniform"])
+init_g.add_arg("init_std",   str, 0.02,    "Initialization std when init is normal.")
+init_g.add_arg("init_range", str, 0.1,   "Initialization std when init is uniform.")
+train_g = ArgumentGroup(parser, "training", "training options.")
+train_g.add_arg("epoch",             int,    1000,    "Number of epoches for fine-tuning.")
+train_g.add_arg("learning_rate",     float,  5e-5,    "Learning rate used to train with warmup.")
+train_g.add_arg("lr_scheduler",      str,    "linear_warmup_decay",
+                "scheduler of learning rate.", choices=['linear_warmup_decay', 'noam_decay'])
+train_g.add_arg("weight_decay",      float,  0.01,    "Weight decay rate for L2 regularizer.")
+train_g.add_arg("lr_layer_decay_rate", float,  1.0, "Top layer: lr[L] = args.learning_rate. "
+                                                     "Lower layers: lr[l-1] = lr[l] * lr_layer_decay_rate.")
+train_g.add_arg("save_steps",        int,    10000,   "The steps interval to save checkpoints.")
+train_g.add_arg("train_batch_size",    int,    8,       "Total examples' number in batch for training.")
+train_g.add_arg("eval_batch_size",     int,    128,     "Total examples' number in batch for development.")
+train_g.add_arg("predict_batch_size",  int,    128,     "Total examples' number in batch for prediction.")
+train_g.add_arg("train_steps",       int,    1000,   "The total steps for training.")
+train_g.add_arg("warmup_steps",      int,    1000,   "The steps for warmup.")
+train_g.add_arg("validation_steps",  int,    1000,    "The steps interval to evaluate model performance.")
+log_g = ArgumentGroup(parser,     "logging", "logging related.")
+log_g.add_arg("skip_steps",          int,    10,    "The steps interval to print loss.")
+log_g.add_arg("verbose",             bool,   False, "Whether to output verbose log.")
+data_g = ArgumentGroup(parser, "data", "Data paths, vocab paths and data processing options")
+data_g.add_arg("data_dir",           str,   None,  "Path to training data.")
+data_g.add_arg("predict_dir",        str,   None,  "Path to write predict results.")
+data_g.add_arg("predict_threshold",  float, 0.0,   "Threshold for binary prediction.")
+data_g.add_arg("max_seq_length",     int,   512,   "Number of words of the longest seqence.")
+data_g.add_arg("uncased", bool, True,
+               "Whether to lower case the input text. Should be True for uncased models and False for cased models.")
+data_g.add_arg("random_seed",   int,  0,     "Random seed.")
+run_type_g = ArgumentGroup(parser, "run_type", "running type options.")
+run_type_g.add_arg("use_cuda",                     bool,   True,  "If set, use GPU for training.")
+run_type_g.add_arg("use_fast_executor",            bool,   False, "If set, use fast parallel executor (in experiment).")
+run_type_g.add_arg("shuffle",                      bool,   True,  "")
+run_type_g.add_arg("task_name",                    str,    None,
+                   "The name of task to perform fine-tuning, should be in {'xnli', 'mnli', 'cola', 'mrpc'}.")
+run_type_g.add_arg("is_regression",                str,    None,  "Whether it's a regression task.")
+run_type_g.add_arg("do_train",                     bool,   True,  "Whether to perform training.")
+run_type_g.add_arg("do_eval",                      bool,   True,  "Whether to perform evaluation on dev data set.")
+run_type_g.add_arg("do_predict",                   bool,   True,  "Whether to perform evaluation on test data set.")
+run_type_g.add_arg("eval_split",                   str,    "dev", "Could be dev or test")
+parser.add_argument("--enable_ce", action='store_true', help="The flag indicating whether to run the task for continuous evaluation.")
+args = parser.parse_args()
+# yapf: enable.
+def evaluate(exe, predict_program, test_data_loader, fetch_list, eval_phase, num_examples):
+    test_data_loader.start()
+    total_cost, total_num_seqs = [], []
+    all_logits, all_labels = [], []
+    time_begin = time.time()
+    total_steps = int(num_examples / args.eval_batch_size)
+    steps = 0
+    while True:
+        try:
+            np_loss, np_num_seqs, np_logits, np_labels = exe.run(program=predict_program,
+                                                   fetch_list=fetch_list)
+            total_cost.extend(np_loss * np_num_seqs)
+            total_num_seqs.extend(np_num_seqs)
+            all_logits.extend(np_logits)
+            all_labels.extend(np_labels)
+            if steps % (int(total_steps / 10)) == 0:
+                print("Evaluation [{}/{}]".format(steps, total_steps))
+            steps += 1
+        except fluid.core.EOFException:
+            test_data_loader.reset()
+            break
+    all_logits = np.array(all_logits)
+    all_labels = np.array(all_labels)
+    if args.is_regression:
+        key = "eval_pearsonr"
+        eval_result, _ = pearsonr(all_logits, all_labels)
+    else:
+        key = "eval_accuracy"
+        pred = np.argmax(all_logits, axis=1).reshape(all_labels.shape)
+        eval_result = np.sum(pred == all_labels) / float(all_labels.size)
+    time_end = time.time()
+    print("[%s evaluation] ave loss: %f, %s: %f, elapsed time: %f s" %
+          (eval_phase, np.sum(total_cost) / np.sum(total_num_seqs), key, eval_result,
+           time_end - time_begin))
+def predict(exe, predict_program, test_data_loader, task_name, label_list, fetch_list):
+    test_data_loader.start()
+    pred_cnt = 0
+    predict_results = []
+    with open(os.path.join(args.predict_dir, "{}.tsv".format(
+        task_name)), "w") as fout:
+        fout.write("index\tprediction\n")
+        while True:
+            try:
+                np_logits = exe.run(program=predict_program,
+                                           fetch_list=fetch_list)
+                for result in np_logits[0]:
+                    if pred_cnt % 1000 == 0:
+                        print("Predicting submission for example: {}".format(
+                     pred_cnt))
+                    logits = [float(x) for x in result.flat]
+                    predict_results.append(logits)
+                    if len(logits) == 1:
+                        label_out = logits[0]
+                    elif len(logits) == 2:
+                        if logits[1] - logits[0] > args.predict_threshold:
+                            label_out = label_list[1]
+                        else:
+                            label_out = label_list[0]
+                    elif len(logits) > 2:
+                        max_index = np.argmax(np.array(logits, dtype=np.float32))
+                        label_out = label_list[max_index]
+                    else:
+                        raise NotImplementedError
+                    fout.write("{}\t{}\n".format(pred_cnt, label_out))
+                    pred_cnt += 1
+            except fluid.core.EOFException:
+                test_data_loader.reset()
+                break
+    predict_json_path = os.path.join(args.predict_dir, "{}.logits.json".format(
+        task_name))
+    with open(predict_json_path, "w") as fp:
+        json.dump(predict_results, fp, indent=4)
+def get_device_num():
+    # NOTE(zcd): for multi-processe training, each process use one GPU card.
+    if num_trainers > 1 : return 1
+    visible_device = os.environ.get('CUDA_VISIBLE_DEVICES', None)
+    if visible_device:
+        device_num = len(visible_device.split(','))
+    else:
+        device_num = subprocess.check_output(['nvidia-smi','-L']).decode().count('\n')
+    return device_num
+def main(args):
+    if not (args.do_train or args.do_eval or args.do_predict):
+        raise ValueError("For args `do_train`, `do_eval` and `do_predict`, at "
+                         "least one of them must be True.")
+    if args.do_predict and not args.predict_dir:
+        raise ValueError("args 'predict_dir' should be given when doing predict")
+    if not os.path.exists(args.predict_dir):
+        os.makedirs(args.predict_dir)
+    xlnet_config = XLNetConfig(args.model_config_path)
+    xlnet_config.print_config()
+    if args.use_cuda:
+        place = fluid.CUDAPlace(int(os.getenv('FLAGS_selected_gpus', '0')))
+        dev_count = get_device_num()
+    else:
+        place = fluid.CPUPlace()
+        dev_count = int(os.environ.get('CPU_NUM', multiprocessing.cpu_count()))
+    exe = fluid.Executor(place)
+    task_name = args.task_name.lower()
+    processors = {
+      "mnli_matched": reader.MnliMatchedProcessor,
+      "mnli_mismatched": reader.MnliMismatchedProcessor,
+      'sts-b': reader.StsbProcessor,
+      'imdb': reader.ImdbProcessor,
+      "yelp5": reader.Yelp5Processor
+    }
+    processor = processors[task_name](args)
+    label_list = processor.get_labels() if not args.is_regression else None
+    num_labels = len(label_list) if label_list is not None else None
+    train_program = fluid.Program()
+    startup_prog = fluid.Program()
+    if args.random_seed is not None:
+        startup_prog.random_seed = args.random_seed
+        train_program.random_seed = args.random_seed
+    if args.do_train:
+        # NOTE: If num_trainers > 1, the shuffle_seed must be set, because
+        # the order of batch data generated by reader
+        # must be the same in the respective processes.
+        shuffle_seed = 1 if num_trainers > 1 else None
+        train_data_generator = processor.data_generator(
+            batch_size=args.train_batch_size,
+            is_regression=args.is_regression,
+            phase='train',
+            epoch=args.epoch,
+            dev_count=dev_count,
+            shuffle=args.shuffle)
+        num_train_examples = processor.get_num_examples(phase='train')
+        print("Device count: %d" % dev_count)
+        print("Max num of epoches: %d" % args.epoch)
+        print("Num of train examples: %d" % num_train_examples)
+        print("Num of train steps: %d" % args.train_steps)
+        print("Num of warmup steps: %d" % args.warmup_steps)
+        with fluid.program_guard(train_program, startup_prog):
+            with fluid.unique_name.guard():
+                train_data_loader, loss, logits, num_seqs, label_ids = create_model(
+                    args,
+                    xlnet_config=xlnet_config,
+                    n_class=num_labels)
+                scheduled_lr = optimization(
+                    loss=loss,
+                    warmup_steps=args.warmup_steps,
+                    num_train_steps=args.train_steps,
+                    learning_rate=args.learning_rate,
+                    train_program=train_program,
+                    startup_prog=startup_prog,
+                    weight_decay=args.weight_decay,
+                    lr_layer_decay_rate=args.lr_layer_decay_rate,
+                    scheduler=args.lr_scheduler)
+    if args.do_eval:
+        dev_prog = fluid.Program()
+        with fluid.program_guard(dev_prog, startup_prog):
+            with fluid.unique_name.guard():
+                dev_data_loader, loss, logits, num_seqs, label_ids = create_model(
+                    args,
+                    xlnet_config=xlnet_config,
+                    n_class=num_labels)
+        dev_prog = dev_prog.clone(for_test=True)
+        dev_data_loader.set_batch_generator(
+            processor.data_generator(
+                batch_size=args.eval_batch_size,
+                is_regression=args.is_regression,
+                phase=args.eval_split,
+                epoch=1,
+                dev_count=1,
+                shuffle=False), place)
+    if args.do_predict:
+        predict_prog = fluid.Program()
+        with fluid.program_guard(predict_prog, startup_prog):
+            with fluid.unique_name.guard():
+                predict_data_loader, loss, logits, num_seqs, label_ids = create_model(
+                    args,
+                    xlnet_config=xlnet_config,
+                    n_class=num_labels)
+        predict_prog = predict_prog.clone(for_test=True)
+        predict_data_loader.set_batch_generator(
+            processor.data_generator(
+                batch_size=args.predict_batch_size,
+                is_regression=args.is_regression,
+                phase=args.eval_split,
+                epoch=1,
+                dev_count=1,
+                shuffle=False), place)
+    exe.run(startup_prog)
+    if args.do_train:
+        if args.init_checkpoint and args.init_pretraining_params:
+            print(
+                "WARNING: args 'init_checkpoint' and 'init_pretraining_params' "
+                "both are set! Only arg 'init_checkpoint' is made valid.")
+        if args.init_checkpoint:
+            init_checkpoint(
+                exe,
+                args.init_checkpoint,
+                main_program=startup_prog)
+        elif args.init_pretraining_params:
+            init_pretraining_params(
+                exe,
+                args.init_pretraining_params,
+                main_program=startup_prog)
+    elif args.do_eval or args.do_predict:
+        if not args.init_checkpoint:
+            raise ValueError("args 'init_checkpoint' should be set if"
+                             "only doing validation or testing!")
+        init_checkpoint(
+            exe,
+            args.init_checkpoint,
+            main_program=startup_prog)
+    if args.do_train:
+        exec_strategy = fluid.ExecutionStrategy()
+        exec_strategy.use_experimental_executor = args.use_fast_executor
+        exec_strategy.num_threads = dev_count
+        build_strategy = fluid.BuildStrategy()
+        if args.use_cuda and num_trainers > 1:
+            assert shuffle_seed is not None
+            dist_utils.prepare_for_multi_process(exe, build_strategy, train_program)
+            train_data_generator = fluid.contrib.reader.distributed_batch_reader(
+                  train_data_generator)
+        train_compiled_program = fluid.CompiledProgram(train_program).with_data_parallel(
+                 loss_name=loss.name, build_strategy=build_strategy)
+        train_data_loader.set_batch_generator(train_data_generator, place)
+    if args.do_train:
+        train_data_loader.start()
+        steps = 0
+        total_cost, total_num_seqs, total_time = [], [], 0.0
+        throughput = []
+        ce_info = []
+        while steps < args.train_steps:
+            try:
+                time_begin = time.time()
+                steps += 1
+                if steps % args.skip_steps == 0:
+                    fetch_list = [loss.name, scheduled_lr.name, num_seqs.name]
+                else:
+                    fetch_list = []
+                outputs = exe.run(train_compiled_program, fetch_list=fetch_list)
+                time_end = time.time()
+                used_time = time_end - time_begin
+                total_time += used_time
+                if steps % args.skip_steps == 0:
+                    np_loss, np_lr, np_num_seqs = outputs
+                    total_cost.extend(np_loss * np_num_seqs)
+                    total_num_seqs.extend(np_num_seqs)
+                    if args.verbose:
+                        verbose = "train data_loader queue size: %d, " % train_data_loader.queue.size(
+                        )
+                        verbose += "learning rate: %f" % np_lr[0]
+                        print(verbose)
+                    current_example, current_epoch = processor.get_train_progress(
+                    )
+                    log_record = "epoch: {}, progress: {}/{}, step: {}, ave loss: {}".format(
+                           current_epoch, current_example, num_train_examples,
+                           steps, np.sum(total_cost) / np.sum(total_num_seqs))
+                    ce_info.append([np.sum(total_cost) / np.sum(total_num_seqs), used_time])
+                    if steps > 0 :
+                        throughput.append( args.skip_steps / total_time)
+                        log_record = log_record + ", speed: %f steps/s" % (args.skip_steps / total_time)
+                        print(log_record)
+                    else:
+                        print(log_record)
+                    total_cost, total_num_seqs, total_time = [], [], 0.0
+                if steps % args.save_steps == 0:
+                    save_path = os.path.join(args.checkpoints,
+                                             "step_" + str(steps))
+                    fluid.io.save_persistables(exe, save_path, train_program)
+                if steps % args.validation_steps == 0:
+                    print("Average throughtput: %s" % (np.average(throughput)))
+                    throughput = []
+                    # evaluate dev set
+                    if args.do_eval:
+                        evaluate(exe, dev_prog, dev_data_loader,
+                                 [loss.name,  num_seqs.name, logits.name, label_ids.name],
+                                 args.eval_split, processor.get_num_examples(phase=args.eval_split))
+            except fluid.core.EOFException:
+                save_path = os.path.join(args.checkpoints, "step_" + str(steps))
+                fluid.io.save_persistables(exe, save_path, train_program)
+                train_data_loader.reset()
+                break
+        if args.enable_ce:
+            card_num = get_cards()
+            ce_cost = 0
+            ce_acc = 0
+            ce_time = 0
+            try:
+                ce_cost = ce_info[-2][0]
+                ce_acc = ce_info[-2][1]
+                ce_time = ce_info[-2][2]
+            except:
+                print("ce info error")
+            print("kpis\ttrain_duration_%s_card%s\t%s" %
+                (args.task_name, card_num, ce_time))
+            print("kpis\ttrain_cost_%s_card%s\t%f" %
+                (args.task_name, card_num, ce_cost))
+            print("kpis\ttrain_acc_%s_card%s\t%f" %
+                (args.task_name, card_num, ce_acc))
+    # final eval on dev set
+    if args.do_eval:
+        evaluate(exe, dev_prog, dev_data_loader,
+                 [loss.name, num_seqs.name, logits.name, label_ids], args.eval_split,
+                 processor.get_num_examples(phase=args.eval_split))
+    # final eval on test set
+    if args.do_predict:
+        predict(exe, predict_prog, predict_data_loader, task_name, label_list, [logits.name])
+if __name__ == '__main__':
+    print_arguments(args)
+    check_cuda(args.use_cuda)
+    main(args)
--- a/PaddleNLP/PaddleLARK/XLNet/run_squad.py
+++ b/PaddleNLP/PaddleLARK/XLNet/run_squad.py
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Fine-tuning on SQuAD."""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+import six
+import sys
+if six.PY2:
+    reload(sys)
+    sys.setdefaultencoding('utf8')
+import io
+import argparse
+import collections
+import multiprocessing
+import os
+import time
+import numpy as np
+import json
+import paddle
+import paddle.fluid as fluid
+from reader.squad import DataProcessor, write_predictions
+from model.xlnet import XLNetConfig, XLNetModel
+from utils.args import ArgumentGroup, print_arguments
+from optimization import optimization
+from utils.init import init_pretraining_params, init_checkpoint
+from modeling import log_softmax
+if six.PY2:
+    import cPickle as pickle
+else:
+    import pickle
+# yapf: disable
+parser = argparse.ArgumentParser(__doc__)
+model_g = ArgumentGroup(parser, "model", "model configuration and paths.")
+model_g.add_arg("model_config_path",        str,  None,           "Path to the json file for xlnet model config.")
+model_g.add_arg("dropout",                  float,  0.1,          "Dropout rate.")
+model_g.add_arg("dropatt",                  float,  0.1,          "Attention dropout rate.")
+model_g.add_arg("clamp_len",                int,    -1,           "Clamp length.")
+model_g.add_arg("summary_type",             str, "last",           "Method used to summarize a sequence into a vector.",
+                choices=['last'])
+model_g.add_arg("spiece_model_file",        str,  None,           "Sentence Piece model path.")
+model_g.add_arg("init_checkpoint",          str,  None,           "Init checkpoint to resume training from.")
+model_g.add_arg("init_pretraining_params",  str,  None,
+                "Init pre-training params which preforms fine-tuning from. If the "
+                 "arg 'init_checkpoint' has been set, this argument wouldn't be valid.")
+model_g.add_arg("checkpoints",              str,  "checkpoints",  "Path to save checkpoints.")
+# Parameter initialization
+init_g = ArgumentGroup(parser, "init", "parameter initialization options.")
+init_g.add_arg("init",       str, "normal",    "Initialization method.", choices=["normal", "uniform"])
+init_g.add_arg("init_std",   str, 0.02,    "Initialization std when init is normal.")
+init_g.add_arg("init_range", str, 0.1,   "Initialization std when init is uniform.")
+train_g = ArgumentGroup(parser, "training", "training options.")
+train_g.add_arg("epoch",               int,    3,      "Number of epoches for fine-tuning.")
+train_g.add_arg("learning_rate",       float,  5e-5,   "Learning rate used to train with warmup.")
+train_g.add_arg("lr_scheduler",        str,    "linear_warmup_decay",
+                "scheduler of learning rate.", choices=['linear_warmup_decay', 'noam_decay'])
+train_g.add_arg("weight_decay",        float,  0.01,   "Weight decay rate for L2 regularizer.")
+train_g.add_arg("adam_epsilon",        float,  1e-6,   "Adam epsilon.")
+train_g.add_arg("lr_layer_decay_rate", float,  0.75, "Top layer: lr[L] = args.learning_rate. "
+                                                     "Lower layers: lr[l-1] = lr[l] * lr_layer_decay_rate.")
+train_g.add_arg("train_batch_size",  int,    12,     "Total examples' number in batch for training.")
+train_g.add_arg("train_steps",       int,    1000,   "The total steps for training.")
+train_g.add_arg("warmup_steps",      int,    1000,   "The steps for warmup.")
+train_g.add_arg("save_steps",        int,    1000,   "The steps interval to save checkpoints.")
+predict_g = ArgumentGroup(parser, "prediction", "prediction options.")
+predict_g.add_arg("predict_batch_size",                int,   12,    "Total examples' number in batch for training.")
+predict_g.add_arg("start_n_top",    int,  5, "Beam size for span start.")
+predict_g.add_arg("end_n_top",      int,  5, "Beam size for span end.")
+log_g = ArgumentGroup(parser, "logging", "logging related.")
+log_g.add_arg("skip_steps",          int,    10,    "The steps interval to print loss.")
+log_g.add_arg("verbose",             bool,   False, "Whether to output verbose log.")
+data_g = ArgumentGroup(parser, "data", "Data paths, vocab paths and data processing options")
+data_g.add_arg("train_file",                str,   None,  "SQuAD json for training. E.g., train-v1.1.json.")
+data_g.add_arg("predict_file",              str,   None,  "SQuAD json for predictions. E.g. dev-v1.1.json or test-v1.1.json.")
+data_g.add_arg("max_seq_length",               int,   512,   "Number of words of the longest seqence.")
+data_g.add_arg("max_query_length",          int,   64,    "Max query length.")
+data_g.add_arg("max_answer_length",         int,   64,    "Max answer length.")
+data_g.add_arg("uncased",             bool,  True,
+               "Whether to lower case the input text. Should be True for uncased models and False for cased models.")
+data_g.add_arg("doc_stride",                int,   128,
+               "When splitting up a long document into chunks, how much stride to take between chunks.")
+data_g.add_arg("n_best_size",               int,   5,
+               "The total number of n-best predictions to generate in the nbest_predictions.json output file.")
+data_g.add_arg("random_seed",               int,   0,      "Random seed.")
+run_type_g = ArgumentGroup(parser, "run_type", "running type options.")
+run_type_g.add_arg("use_cuda",                     bool,   True,  "If set, use GPU for training.")
+run_type_g.add_arg("use_fast_executor",            bool,   False, "If set, use fast parallel executor (in experiment).")
+run_type_g.add_arg("num_iteration_per_drop_scope", int,    1,     "Ihe iteration intervals to clean up temporary variables.")
+run_type_g.add_arg("do_train",                     bool,   True,  "Whether to perform training.")
+run_type_g.add_arg("do_predict",                   bool,   True,  "Whether to perform prediction.")
+args = parser.parse_args()
+# yapf: enable.
+def get_qa_outputs(xlnet_config, features, is_training=False):
+    # (qlen, batch size)
+    input_ids = features['input_ids']
+    cls_index = features['cls_index']
+    segment_ids = features['segment_ids']
+    input_mask = features['input_mask']
+    p_mask = features['p_mask']
+    inp = fluid.layers.transpose(input_ids, perm=[1, 0, 2])
+    inp_mask = fluid.layers.transpose(input_mask, perm=[1, 0])
+    cls_index = fluid.layers.reshape(cls_index, shape=[-1, 1])
+    seq_len = inp.shape[0]
+    xlnet = XLNetModel(
+        input_ids=inp,
+        seg_ids=segment_ids,
+        input_mask=inp_mask,
+        xlnet_config=xlnet_config,
+        args=args)
+    output = xlnet.get_sequence_output()
+    initializer = xlnet.get_initializer()
+    return_dict = {}
+    # logit of the start position
+    start_logits = fluid.layers.fc(
+        input=output,
+        num_flatten_dims=2,
+        size=1,
+        param_attr=fluid.ParamAttr(name='start_logits_fc_weight', initializer=initializer),
+        bias_attr='start_logits_fc_bias')
+    start_logits = fluid.layers.transpose(fluid.layers.squeeze(start_logits, [-1]), [1, 0])
+    start_logits_masked = start_logits * (1 - p_mask) - 1e30 * p_mask
+    start_log_probs = log_softmax(start_logits_masked)
+    # logit of the end position
+    if is_training:
+        start_positions = features['start_positions']
+        start_index = fluid.layers.one_hot(start_positions, depth=args.max_seq_length)
+        # lbh,bl->bh
+        trans_out = fluid.layers.transpose(output, perm=[1, 2, 0])
+        start_index = fluid.layers.unsqueeze(start_index, axes=[2])
+        start_features = fluid.layers.matmul(x=trans_out, y=start_index)
+        start_features = fluid.layers.unsqueeze(start_features, axes=[0])
+        start_features = fluid.layers.squeeze(start_features, axes=[3])
+        start_features = fluid.layers.expand(start_features, [seq_len, 1, 1])
+        end_logits = fluid.layers.fc(
+          input=fluid.layers.concat([output, start_features], axis=-1),
+          num_flatten_dims=2,
+          size=xlnet_config['d_model'],
+          param_attr=fluid.ParamAttr(name="end_logits_fc_0_weight",initializer=initializer),
+          bias_attr="end_logits_fc_0_bias",
+          act='tanh')
+        end_logits = fluid.layers.layer_norm(end_logits,
+                         epsilon=1e-12,
+                         param_attr=fluid.ParamAttr(
+                           name='end_logits_layer_norm_scale',
+                           initializer=fluid.initializer.Constant(1.)),
+                         bias_attr=fluid.ParamAttr(
+                           name='end_logits_layer_norm_bias',
+                           initializer=fluid.initializer.Constant(0.)),
+                         begin_norm_axis=len(end_logits.shape)-1)
+        end_logits = fluid.layers.fc(
+            input=end_logits,
+            num_flatten_dims=2,
+            size=1,
+            param_attr=fluid.ParamAttr(name='end_logits_fc_1_weight', initializer=initializer),
+            bias_attr='end_logits_fc_1_bias')
+        end_logits = fluid.layers.transpose(fluid.layers.squeeze(end_logits, [-1]), [1, 0])
+        end_logits_masked = end_logits * (1 - p_mask) - 1e30 * p_mask
+        end_log_probs = log_softmax(end_logits_masked)
+    else:
+        start_top_log_probs, start_top_index = fluid.layers.topk(start_log_probs, k=args.start_n_top)
+        start_top_index = fluid.layers.unsqueeze(start_top_index, [-1])
+        start_index = fluid.layers.one_hot(start_top_index, seq_len)
+        # lbh,bkl->bkh
+        trans_out = fluid.layers.transpose(output, perm=[1, 2, 0])
+        trans_start_index = fluid.layers.transpose(start_index, [0, 2, 1])
+        start_features = fluid.layers.matmul(x=trans_out, y=trans_start_index)
+        start_features = fluid.layers.transpose(start_features, [0, 2, 1])
+        end_input = fluid.layers.expand(fluid.layers.unsqueeze(output, [2]), [1, 1, args.start_n_top, 1])
+        start_features = fluid.layers.expand(fluid.layers.unsqueeze(start_features, [0]), [seq_len, 1, 1, 1])
+        end_input = fluid.layers.concat([end_input, start_features], axis=-1)
+        end_logits = fluid.layers.fc(end_input, size=xlnet_config['d_model'],
+                                     num_flatten_dims=3,
+                                     param_attr=fluid.ParamAttr(name="end_logits_fc_0_weight", initializer=initializer),
+                                     bias_attr="end_logits_fc_0_bias",
+                                     act='tanh')
+        end_logits = fluid.layers.layer_norm(end_logits,
+                         epsilon=1e-12,
+                         param_attr=fluid.ParamAttr(
+                           name='end_logits_layer_norm_scale',
+                           initializer=fluid.initializer.Constant(1.)),
+                         bias_attr=fluid.ParamAttr(
+                           name='end_logits_layer_norm_bias',
+                           initializer=fluid.initializer.Constant(0.)),
+                         begin_norm_axis=len(end_logits.shape)-1)
+        end_logits = fluid.layers.fc(
+            input=end_logits,
+            num_flatten_dims=3,
+            size=1,
+            param_attr=fluid.ParamAttr(name='end_logits_fc_1_weight', initializer=initializer),
+            bias_attr='end_logits_fc_1_bias')
+        end_logits = fluid.layers.reshape(end_logits, [seq_len, -1, args.start_n_top])
+        end_logits = fluid.layers.transpose(end_logits, [1, 2, 0])
+        p_mask = fluid.layers.stack([p_mask]*args.start_n_top, axis=1)
+        end_logits_masked = end_logits * (1 - p_mask) - 1e30 * p_mask
+        end_log_probs = log_softmax(end_logits_masked)
+        end_top_log_probs, end_top_index = fluid.layers.topk(end_log_probs, k=args.end_n_top)
+        end_top_log_probs = fluid.layers.reshape(
+          end_top_log_probs,
+          [-1, args.start_n_top * args.end_n_top])
+        end_top_index = fluid.layers.reshape(
+          end_top_index,
+          [-1, args.start_n_top * args.end_n_top])
+    if is_training:
+        return_dict["start_log_probs"] = start_log_probs
+        return_dict["end_log_probs"] = end_log_probs
+    else:
+        return_dict["start_top_log_probs"] = start_top_log_probs
+        return_dict["start_top_index"] = start_top_index
+        return_dict["end_top_log_probs"] = end_top_log_probs
+        return_dict["end_top_index"] = end_top_index
+    cls_index = fluid.layers.one_hot(cls_index, seq_len)
+    cls_index = fluid.layers.unsqueeze(cls_index, axes=[2])
+    cls_feature = fluid.layers.matmul(x=trans_out, y=cls_index)
+    start_p = fluid.layers.softmax(start_logits_masked)
+    start_p = fluid.layers.unsqueeze(start_p, axes=[2])
+    start_feature = fluid.layers.matmul(x=trans_out, y=start_p)
+    ans_feature = fluid.layers.concat([start_feature, cls_feature], axis=1)
+    ans_feature = fluid.layers.fc(
+      input=ans_feature,
+                    size=xlnet_config['d_model'],
+                    act='tanh',
+                    param_attr=fluid.ParamAttr(initializer=initializer, name="answer_class_fc_0_weight"),
+                    bias_attr="answer_class_fc_0_bias")
+    ans_feature = fluid.layers.dropout(ans_feature, args.dropout)
+    cls_logits = fluid.layers.fc(
+        ans_feature,
+        size=1,
+        param_attr=fluid.ParamAttr(name='answer_class_fc_1_weight', initializer=initializer),
+        bias_attr=False)
+    cls_logits = fluid.layers.squeeze(cls_logits, axes=[-1])
+    return_dict["cls_logits"] = cls_logits
+    return return_dict
+def create_model(xlnet_config, is_training=False):
+    if is_training:
+        input_fields = {
+            'names': ['input_ids', 'segment_ids', 'input_mask', 'cls_index', 'p_mask',
+                       'start_positions', 'end_positions', 'is_impossible'],
+            'shapes': [[None, args.max_seq_length, 1], [None, args.max_seq_length],
+                       [None, args.max_seq_length], [None, 1],
+                       [None, args.max_seq_length], [None, 1], [None, 1], [None, 1]],
+            'dtypes': [
+                'int64', 'int64', 'float32', 'int64',
+                'float32', 'int64', 'int64', 'float32'],
+            'lod_levels': [0, 0, 0, 0, 0, 0, 0, 0]
+            }
+    else:
+        input_fields = {
+            'names': ['input_ids', 'segment_ids', 'input_mask', 'cls_index', 'p_mask', 'unique_ids'],
+            'shapes': [[None, args.max_seq_length, 1], [None, args.max_seq_length],
+                       [None, args.max_seq_length], [None, 1], [None, args.max_seq_length], [None, 1]],
+            'dtypes': [
+                'int64', 'int64', 'float32', 'int64', 'float32', 'int64'],
+            'lod_levels': [0, 0, 0, 0, 0, 0],
+        }
+    inputs = [fluid.layers.data(name=input_fields['names'][i],
+                      shape=input_fields['shapes'][i],
+                      dtype=input_fields['dtypes'][i],
+                      lod_level=input_fields['lod_levels'][i]) for i in range(len(input_fields['names']))]
+    data_loader = fluid.io.DataLoader.from_generator(feed_list=inputs, capacity=50, iterable=False)
+    if is_training:
+        (input_ids, segment_ids, input_mask, cls_index, p_mask, start_positions,
+         end_positions, is_impossible) = inputs
+    else:
+        (input_ids, segment_ids, input_mask, cls_index, p_mask, unique_ids) = inputs
+    features = {'input_ids': input_ids, 'segment_ids': segment_ids, 'input_mask': input_mask, 'cls_index': cls_index, 'p_mask':p_mask}
+    if is_training:
+        features['start_positions'] = start_positions
+        features['end_positions'] = end_positions
+        features['is_impossible'] = is_impossible
+    else:
+        features['unique_ids'] = unique_ids
+    outputs = get_qa_outputs(xlnet_config, features, is_training=is_training)
+    if not is_training:
+        predictions = {
+          "unique_ids": features["unique_ids"],
+          "start_top_index": outputs["start_top_index"],
+          "start_top_log_probs": outputs["start_top_log_probs"],
+          "end_top_index": outputs["end_top_index"],
+          "end_top_log_probs": outputs["end_top_log_probs"],
+          "cls_logits": outputs["cls_logits"]
+        }
+        return data_loader, predictions
+    seq_len = input_ids.shape[1]
+    def compute_loss(log_probs, positions):
+        one_hot_positions = fluid.layers.one_hot(positions, depth=seq_len)
+        loss = -1 * fluid.layers.reduce_sum(one_hot_positions * log_probs, dim=-1)
+        loss = fluid.layers.reduce_mean(loss)
+        return loss
+    start_loss = compute_loss(
+        outputs["start_log_probs"], features["start_positions"])
+    end_loss = compute_loss(
+        outputs["end_log_probs"], features["end_positions"])
+    total_loss = (start_loss + end_loss) * 0.5
+    cls_logits = outputs["cls_logits"]
+    is_impossible = fluid.layers.reshape(features["is_impossible"], [-1])
+    regression_loss = fluid.layers.sigmoid_cross_entropy_with_logits(
+        label=is_impossible, x=cls_logits)
+    regression_loss = fluid.layers.reduce_mean(regression_loss)
+    total_loss += regression_loss * 0.5
+    return data_loader, total_loss
+RawResult = collections.namedtuple("RawResult",
+    ["unique_id", "start_top_log_probs", "start_top_index",
+    "end_top_log_probs", "end_top_index", "cls_logits"])
+def predict(test_exe, test_program, test_data_loader, fetch_list, processor, name):
+    if not os.path.exists(args.checkpoints):
+        os.makedirs(args.checkpoints)
+    output_prediction_file = os.path.join(args.checkpoints, name + "predictions.json")
+    output_nbest_file = os.path.join(args.checkpoints, name + "nbest_predictions.json")
+    output_null_log_odds_file = os.path.join(args.checkpoints, name + "null_odds.json")
+    test_data_loader.start()
+    all_results = []
+    time_begin = time.time()
+    while True:
+        try:
+            outputs = test_exe.run(
+                fetch_list=fetch_list,
+                program=test_program)
+            np_unique_ids, np_start_top_log_probs, np_start_top_index, np_end_top_log_probs, np_end_top_index,  np_cls_logits, \
+                      = outputs[0:6]
+            for idx in range(np_unique_ids.shape[0]):
+                if len(all_results) % 1000 == 0:
+                    print("Processing example: %d" % len(all_results))
+                unique_id = int(np_unique_ids[idx])
+                start_top_log_probs = [float(x) for x in np_start_top_log_probs[idx].flat]
+                start_top_index = [int(x) for x in np_start_top_index[idx].flat]
+                end_top_log_probs = [float(x) for x in np_end_top_log_probs[idx].flat]
+                end_top_index = [int(x) for x in np_end_top_index[idx].flat]
+                cls_logits = float(np_cls_logits[idx].flat[0])
+                all_results.append(
+                    RawResult(
+                        unique_id=unique_id,
+                        start_top_log_probs=start_top_log_probs,
+                        start_top_index=start_top_index,
+                        end_top_log_probs=end_top_log_probs,
+                        end_top_index=end_top_index,
+                        cls_logits=cls_logits))
+        except fluid.core.EOFException:
+            test_data_loader.reset()
+            break
+    time_end = time.time()
+    with io.open(args.predict_file, "r", encoding="utf8") as f:
+        orig_data = json.load(f)["data"]
+    features = processor.get_features(
+        processor.predict_examples, is_training=False)
+    ret = write_predictions(processor.predict_examples, features, all_results,
+                      args.n_best_size, args.max_answer_length,
+                      output_prediction_file,
+                      output_nbest_file, output_null_log_odds_file,
+                      orig_data,  args)
+    # Log current result
+    print("=" * 80)
+    log_str = "Result | "
+    for key, val in ret.items():
+        log_str += "{} {} | ".format(key, val)
+    print(log_str)
+    print("=" * 80)
+def train(args):
+    if not (args.do_train or args.do_predict):
+        raise ValueError("For args `do_train` and `do_predict`, at "
+                         "least one of them must be True.")
+    xlnet_config = XLNetConfig(args.model_config_path)
+    xlnet_config.print_config()
+    if args.use_cuda:
+        place = fluid.CUDAPlace(0)
+        dev_count = fluid.core.get_cuda_device_count()
+    else:
+        place = fluid.CPUPlace()
+        dev_count = int(os.environ.get('CPU_NUM', multiprocessing.cpu_count()))
+    exe = fluid.Executor(place)
+    processor = DataProcessor(
+        spiece_model_file=args.spiece_model_file,
+        uncased=args.uncased,
+        max_seq_length=args.max_seq_length,
+        doc_stride=args.doc_stride,
+        max_query_length=args.max_query_length)
+    startup_prog = fluid.Program()
+    if args.random_seed is not None:
+        startup_prog.random_seed = args.random_seed
+    if args.do_train:
+        train_data_generator = processor.data_generator(
+            data_path=args.train_file,
+            batch_size=args.train_batch_size,
+            phase='train',
+            shuffle=True,
+            dev_count=dev_count,
+            epoch=args.epoch)
+        num_train_examples = processor.get_num_examples(phase='train')
+        print("Device count: %d" % dev_count)
+        print("Max num of epoches: %d" % args.epoch)
+        print("Num of train examples: %d" % num_train_examples)
+        print("Num of train steps: %d" % args.train_steps)
+        print("Num of warmup steps: %d" % args.warmup_steps)
+        train_program = fluid.Program()
+        with fluid.program_guard(train_program, startup_prog):
+            with fluid.unique_name.guard():
+                train_data_loader, loss  = create_model(
+                    xlnet_config=xlnet_config,
+                    is_training=True)
+                scheduled_lr = optimization(
+                    loss=loss,
+                    warmup_steps=args.warmup_steps,
+                    num_train_steps=args.train_steps,
+                    learning_rate=args.learning_rate,
+                    train_program=train_program,
+                    startup_prog=startup_prog,
+                    weight_decay=args.weight_decay,
+                    lr_layer_decay_rate=args.lr_layer_decay_rate,
+                    scheduler=args.lr_scheduler)
+    if args.do_predict:
+        test_prog = fluid.Program()
+        with fluid.program_guard(test_prog, startup_prog):
+            with fluid.unique_name.guard():
+                test_data_loader,  predictions = create_model(
+                    xlnet_config=xlnet_config,
+                    is_training=False)
+        test_prog = test_prog.clone(for_test=True)
+    exe.run(startup_prog)
+    if args.do_train:
+        if args.init_checkpoint and args.init_pretraining_params:
+            print(
+                "WARNING: args 'init_checkpoint' and 'init_pretraining_params' "
+                "both are set! Only arg 'init_checkpoint' is made valid.")
+        if args.init_checkpoint:
+            init_checkpoint(
+                exe,
+                args.init_checkpoint,
+                main_program=startup_prog)
+        elif args.init_pretraining_params:
+            init_pretraining_params(
+                exe,
+                args.init_pretraining_params,
+                main_program=startup_prog)
+    elif args.do_predict:
+        if not args.init_checkpoint:
+            raise ValueError("args 'init_checkpoint' should be set if"
+                             "only doing prediction!")
+        init_checkpoint(
+            exe,
+            args.init_checkpoint,
+            main_program=startup_prog)
+    if args.do_train:
+        exec_strategy = fluid.ExecutionStrategy()
+        exec_strategy.use_experimental_executor = args.use_fast_executor
+        exec_strategy.num_threads = dev_count
+        exec_strategy.num_iteration_per_drop_scope = args.num_iteration_per_drop_scope
+        build_strategy = fluid.BuildStrategy()
+        # These two flags must be set in this model for correctness
+        build_strategy.fuse_all_optimizer_ops = True
+        build_strategy.enable_inplace = False
+        train_exe = fluid.ParallelExecutor(
+            use_cuda=args.use_cuda,
+            loss_name=loss.name,
+            exec_strategy=exec_strategy,
+            build_strategy=build_strategy,
+            main_program=train_program)
+        train_data_loader.set_batch_generator(train_data_generator, place)
+        train_data_loader.start()
+        steps = 0
+        total_cost = []
+        time_begin = time.time()
+        print("Begin to train model  ...")
+        print(time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(time.time())))
+        while steps < args.train_steps:
+            try:
+                steps += 1
+                if steps % args.skip_steps == 0:
+                    fetch_list = [loss.name, scheduled_lr.name]
+                else:
+                    fetch_list = []
+                outputs = train_exe.run(fetch_list=fetch_list)
+                if steps % args.skip_steps == 0:
+                    np_loss, np_lr = outputs
+                    total_cost.extend(np_loss)
+                    if args.verbose:
+                        verbose = "train data_loader queue size: %d, " % train_data_loader.queue.size(
+                        )
+                        verbose += "learning rate: %f " % np_lr[0]
+                        print(verbose)
+                    time_end = time.time()
+                    used_time = time_end - time_begin
+                    current_example, epoch = processor.get_train_progress()
+                    print("epoch: %d, progress: %d/%d, step: %d, loss: %f, "
+                          "speed: %f steps/s" %
+                          (epoch, current_example, num_train_examples, steps,
+                           np.mean(total_cost),
+                           args.skip_steps / used_time))
+                    total_cost = []
+                    time_begin = time.time()
+                if steps % args.save_steps == 0 or steps == args.train_steps:
+                    save_path = os.path.join(args.checkpoints,
+                                             "step_" + str(steps))
+                    fluid.io.save_persistables(exe, save_path, train_program)
+            except fluid.core.EOFException:
+                save_path = os.path.join(args.checkpoints,
+                                         "step_" + str(steps) + "_final")
+                fluid.io.save_persistables(exe, save_path, train_program)
+                train_data_loader.reset()
+                break
+        print("Finish model training ...")
+        print(time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(time.time())))
+    if args.do_predict:
+        print("Begin to do prediction  ...")
+        print(time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(time.time())))
+        test_data_loader.set_batch_generator(
+            processor.data_generator(
+                data_path=args.predict_file,
+                batch_size=args.predict_batch_size,
+                phase='predict',
+                shuffle=False,
+                dev_count=1,
+                epoch=1), place)
+        predict(exe, test_prog, test_data_loader, [predictions['unique_ids'].name, predictions['start_top_log_probs'].name,
+            predictions['start_top_index'].name, predictions['end_top_log_probs'].name, predictions['end_top_index'].name,
+            predictions['cls_logits'].name
+        ], processor, name='')
+        print("Finish prediction ...")
+        print(time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(time.time())))
+if __name__ == '__main__':
+    print_arguments(args)
+    train(args)
--- a/PaddleNLP/PaddleLARK/XLNet/squad_utils.py
+++ b/PaddleNLP/PaddleLARK/XLNet/squad_utils.py
+"""this file is adapted from https://github.com/zihangdai/xlnet"""
+import io
+import argparse
+import collections
+import json
+import numpy as np
+import os
+import re
+import string
+import sys
+OPTS = None
+def parse_args():
+    parser = argparse.ArgumentParser(
+        'Official evaluation script for SQuAD version 2.0.')
+    parser.add_argument(
+        'data_file', metavar='data.json', help='Input data JSON file.')
+    parser.add_argument(
+        'pred_file', metavar='pred.json', help='Model predictions.')
+    parser.add_argument(
+        '--out-file',
+        '-o',
+        metavar='eval.json',
+        help='Write accuracy metrics to file (default is stdout).')
+    parser.add_argument(
+        '--na-prob-file',
+        '-n',
+        metavar='na_prob.json',
+        help='Model estimates of probability of no answer.')
+    parser.add_argument(
+        '--na-prob-thresh',
+        '-t',
+        type=float,
+        default=1.0,
+        help='Predict "" if no-answer probability exceeds this (default = 1.0).')
+    parser.add_argument(
+        '--out-image-dir',
+        '-p',
+        metavar='out_images',
+        default=None,
+        help='Save precision-recall curves to directory.')
+    parser.add_argument('--verbose', '-v', action='store_true')
+    if len(sys.argv) == 1:
+        parser.print_help()
+        sys.exit(1)
+    return parser.parse_args()
+def make_qid_to_has_ans(dataset):
+    qid_to_has_ans = {}
+    for article in dataset:
+        for p in article['paragraphs']:
+            for qa in p['qas']:
+                qid_to_has_ans[qa['id']] = bool(qa['answers'])
+    return qid_to_has_ans
+def normalize_answer(s):
+    """Lower text and remove punctuation, articles and extra whitespace."""
+    def remove_articles(text):
+        regex = re.compile(r'\b(a|an|the)\b', re.UNICODE)
+        return re.sub(regex, ' ', text)
+    def white_space_fix(text):
+        return ' '.join(text.split())
+    def remove_punc(text):
+        exclude = set(string.punctuation)
+        return ''.join(ch for ch in text if ch not in exclude)
+    def lower(text):
+        return text.lower()
+    return white_space_fix(remove_articles(remove_punc(lower(s))))
+def get_tokens(s):
+    if not s: return []
+    return normalize_answer(s).split()
+def compute_exact(a_gold, a_pred):
+    return int(normalize_answer(a_gold) == normalize_answer(a_pred))
+def compute_f1(a_gold, a_pred):
+    gold_toks = get_tokens(a_gold)
+    pred_toks = get_tokens(a_pred)
+    common = collections.Counter(gold_toks) & collections.Counter(pred_toks)
+    num_same = sum(common.values())
+    if len(gold_toks) == 0 or len(pred_toks) == 0:
+        # If either is no-answer, then F1 is 1 if they agree, 0 otherwise
+        return int(gold_toks == pred_toks)
+    if num_same == 0:
+        return 0
+    precision = 1.0 * num_same / len(pred_toks)
+    recall = 1.0 * num_same / len(gold_toks)
+    f1 = (2 * precision * recall) / (precision + recall)
+    return f1
+def get_raw_scores(dataset, preds):
+    exact_scores = {}
+    f1_scores = {}
+    for article in dataset:
+        for p in article['paragraphs']:
+            for qa in p['qas']:
+                qid = qa['id']
+                gold_answers = [
+                    a['text'] for a in qa['answers']
+                    if normalize_answer(a['text'])
+                ]
+                if not gold_answers:
+                    # For unanswerable questions, only correct answer is empty string
+                    gold_answers = ['']
+                if qid not in preds:
+                    print('Missing prediction for %s' % qid)
+                    continue
+                a_pred = preds[qid]
+                # Take max over all gold answers
+                exact_scores[qid] = max(
+                    compute_exact(a, a_pred) for a in gold_answers)
+                f1_scores[qid] = max(
+                    compute_f1(a, a_pred) for a in gold_answers)
+    return exact_scores, f1_scores
+def apply_no_ans_threshold(scores, na_probs, qid_to_has_ans, na_prob_thresh):
+    new_scores = {}
+    for qid, s in scores.items():
+        pred_na = na_probs[qid] > na_prob_thresh
+        if pred_na:
+            new_scores[qid] = float(not qid_to_has_ans[qid])
+        else:
+            new_scores[qid] = s
+    return new_scores
+def make_eval_dict(exact_scores, f1_scores, qid_list=None):
+    if not qid_list:
+        total = len(exact_scores)
+        return collections.OrderedDict([
+            ('exact', 100.0 * sum(exact_scores.values()) / total),
+            ('f1', 100.0 * sum(f1_scores.values()) / total),
+            ('total', total),
+        ])
+    else:
+        total = len(qid_list)
+        return collections.OrderedDict([
+            ('exact', 100.0 * sum(exact_scores[k] for k in qid_list) / total),
+            ('f1', 100.0 * sum(f1_scores[k] for k in qid_list) / total),
+            ('total', total),
+        ])
+def merge_eval(main_eval, new_eval, prefix):
+    for k in new_eval:
+        main_eval['%s_%s' % (prefix, k)] = new_eval[k]
+def plot_pr_curve(precisions, recalls, out_image, title):
+    plt.step(recalls, precisions, color='b', alpha=0.2, where='post')
+    plt.fill_between(recalls, precisions, step='post', alpha=0.2, color='b')
+    plt.xlabel('Recall')
+    plt.ylabel('Precision')
+    plt.xlim([0.0, 1.05])
+    plt.ylim([0.0, 1.05])
+    plt.title(title)
+    plt.savefig(out_image)
+    plt.clf()
+def make_precision_recall_eval(scores,
+                               na_probs,
+                               num_true_pos,
+                               qid_to_has_ans,
+                               out_image=None,
+                               title=None):
+    qid_list = sorted(na_probs, key=lambda k: na_probs[k])
+    true_pos = 0.0
+    cur_p = 1.0
+    cur_r = 0.0
+    precisions = [1.0]
+    recalls = [0.0]
+    avg_prec = 0.0
+    for i, qid in enumerate(qid_list):
+        if qid_to_has_ans[qid]:
+            true_pos += scores[qid]
+        cur_p = true_pos / float(i + 1)
+        cur_r = true_pos / float(num_true_pos)
+        if i == len(qid_list) - 1 or na_probs[qid] != na_probs[qid_list[i + 1]]:
+            # i.e., if we can put a threshold after this point
+            avg_prec += cur_p * (cur_r - recalls[-1])
+            precisions.append(cur_p)
+            recalls.append(cur_r)
+    if out_image:
+        plot_pr_curve(precisions, recalls, out_image, title)
+    return {'ap': 100.0 * avg_prec}
+def run_precision_recall_analysis(main_eval, exact_raw, f1_raw, na_probs,
+                                  qid_to_has_ans, out_image_dir):
+    if out_image_dir and not os.path.exists(out_image_dir):
+        os.makedirs(out_image_dir)
+    num_true_pos = sum(1 for v in qid_to_has_ans.values() if v)
+    if num_true_pos == 0:
+        return
+    pr_exact = make_precision_recall_eval(
+        exact_raw,
+        na_probs,
+        num_true_pos,
+        qid_to_has_ans,
+        out_image=os.path.join(out_image_dir, 'pr_exact.png'),
+        title='Precision-Recall curve for Exact Match score')
+    pr_f1 = make_precision_recall_eval(
+        f1_raw,
+        na_probs,
+        num_true_pos,
+        qid_to_has_ans,
+        out_image=os.path.join(out_image_dir, 'pr_f1.png'),
+        title='Precision-Recall curve for F1 score')
+    oracle_scores = {k: float(v) for k, v in qid_to_has_ans.items()}
+    pr_oracle = make_precision_recall_eval(
+        oracle_scores,
+        na_probs,
+        num_true_pos,
+        qid_to_has_ans,
+        out_image=os.path.join(out_image_dir, 'pr_oracle.png'),
+        title='Oracle Precision-Recall curve (binary task of HasAns vs. NoAns)')
+    merge_eval(main_eval, pr_exact, 'pr_exact')
+    merge_eval(main_eval, pr_f1, 'pr_f1')
+    merge_eval(main_eval, pr_oracle, 'pr_oracle')
+def histogram_na_prob(na_probs, qid_list, image_dir, name):
+    if not qid_list:
+        return
+    x = [na_probs[k] for k in qid_list]
+    weights = np.ones_like(x) / float(len(x))
+    plt.hist(x, weights=weights, bins=20, range=(0.0, 1.0))
+    plt.xlabel('Model probability of no-answer')
+    plt.ylabel('Proportion of dataset')
+    plt.title('Histogram of no-answer probability: %s' % name)
+    plt.savefig(os.path.join(image_dir, 'na_prob_hist_%s.png' % name))
+    plt.clf()
+def find_best_thresh(preds, scores, na_probs, qid_to_has_ans):
+    num_no_ans = sum(1 for k in qid_to_has_ans if not qid_to_has_ans[k])
+    cur_score = num_no_ans
+    best_score = cur_score
+    best_thresh = 0.0
+    qid_list = sorted(na_probs, key=lambda k: na_probs[k])
+    for i, qid in enumerate(qid_list):
+        if qid not in scores: continue
+        if qid_to_has_ans[qid]:
+            diff = scores[qid]
+        else:
+            if preds[qid]:
+                diff = -1
+            else:
+                diff = 0
+        cur_score += diff
+        if cur_score > best_score:
+            best_score = cur_score
+            best_thresh = na_probs[qid]
+    return 100.0 * best_score / len(scores), best_thresh
+def find_best_thresh_v2(preds, scores, na_probs, qid_to_has_ans):
+    num_no_ans = sum(1 for k in qid_to_has_ans if not qid_to_has_ans[k])
+    cur_score = num_no_ans
+    best_score = cur_score
+    best_thresh = 0.0
+    qid_list = sorted(na_probs, key=lambda k: na_probs[k])
+    for i, qid in enumerate(qid_list):
+        if qid not in scores: continue
+        if qid_to_has_ans[qid]:
+            diff = scores[qid]
+        else:
+            if preds[qid]:
+                diff = -1
+            else:
+                diff = 0
+        cur_score += diff
+        if cur_score > best_score:
+            best_score = cur_score
+            best_thresh = na_probs[qid]
+    has_ans_score, has_ans_cnt = 0, 0
+    for qid in qid_list:
+        if not qid_to_has_ans[qid]: continue
+        has_ans_cnt += 1
+        if qid not in scores: continue
+        has_ans_score += scores[qid]
+    return 100.0 * best_score / len(
+        scores), best_thresh, 1.0 * has_ans_score / has_ans_cnt
+def find_all_best_thresh(main_eval, preds, exact_raw, f1_raw, na_probs,
+                         qid_to_has_ans):
+    best_exact, exact_thresh = find_best_thresh(preds, exact_raw, na_probs,
+                                                qid_to_has_ans)
+    best_f1, f1_thresh = find_best_thresh(preds, f1_raw, na_probs,
+                                          qid_to_has_ans)
+    main_eval['best_exact'] = best_exact
+    main_eval['best_exact_thresh'] = exact_thresh
+    main_eval['best_f1'] = best_f1
+    main_eval['best_f1_thresh'] = f1_thresh
+def find_all_best_thresh_v2(main_eval, preds, exact_raw, f1_raw, na_probs,
+                            qid_to_has_ans):
+    best_exact, exact_thresh, has_ans_exact = find_best_thresh_v2(
+        preds, exact_raw, na_probs, qid_to_has_ans)
+    best_f1, f1_thresh, has_ans_f1 = find_best_thresh_v2(
+        preds, f1_raw, na_probs, qid_to_has_ans)
+    main_eval['best_exact'] = best_exact
+    main_eval['best_exact_thresh'] = exact_thresh
+    main_eval['best_f1'] = best_f1
+    main_eval['best_f1_thresh'] = f1_thresh
+    main_eval['has_ans_exact'] = has_ans_exact
+    main_eval['has_ans_f1'] = has_ans_f1
+def main():
+    with io.open(OPTS.data_file, encoding='utf8') as f:
+        dataset_json = json.load(f)
+        dataset = dataset_json['data']
+    with io.open(OPTS.pred_file, encoding='utf8') as f:
+        preds = json.load(f)
+    new_orig_data = []
+    for article in dataset:
+        for p in article['paragraphs']:
+            for qa in p['qas']:
+                if qa['id'] in preds:
+                    new_para = {'qas': [qa]}
+                    new_article = {'paragraphs': [new_para]}
+                    new_orig_data.append(new_article)
+    dataset = new_orig_data
+    if OPTS.na_prob_file:
+        with io.open(OPTS.na_prob_file, encoding='utf8') as f:
+            na_probs = json.load(f)
+    else:
+        na_probs = {k: 0.0 for k in preds}
+    qid_to_has_ans = make_qid_to_has_ans(dataset)  # maps qid to True/False
+    has_ans_qids = [k for k, v in qid_to_has_ans.items() if v]
+    no_ans_qids = [k for k, v in qid_to_has_ans.items() if not v]
+    exact_raw, f1_raw = get_raw_scores(dataset, preds)
+    exact_thresh = apply_no_ans_threshold(exact_raw, na_probs, qid_to_has_ans,
+                                          OPTS.na_prob_thresh)
+    f1_thresh = apply_no_ans_threshold(f1_raw, na_probs, qid_to_has_ans,
+                                       OPTS.na_prob_thresh)
+    out_eval = make_eval_dict(exact_thresh, f1_thresh)
+    if has_ans_qids:
+        has_ans_eval = make_eval_dict(
+            exact_thresh, f1_thresh, qid_list=has_ans_qids)
+        merge_eval(out_eval, has_ans_eval, 'HasAns')
+    if no_ans_qids:
+        no_ans_eval = make_eval_dict(
+            exact_thresh, f1_thresh, qid_list=no_ans_qids)
+        merge_eval(out_eval, no_ans_eval, 'NoAns')
+    if OPTS.na_prob_file:
+        find_all_best_thresh(out_eval, preds, exact_raw, f1_raw, na_probs,
+                             qid_to_has_ans)
+    if OPTS.na_prob_file and OPTS.out_image_dir:
+        run_precision_recall_analysis(out_eval, exact_raw, f1_raw, na_probs,
+                                      qid_to_has_ans, OPTS.out_image_dir)
+        histogram_na_prob(na_probs, has_ans_qids, OPTS.out_image_dir, 'hasAns')
+        histogram_na_prob(na_probs, no_ans_qids, OPTS.out_image_dir, 'noAns')
+    if OPTS.out_file:
+        with io.open(OPTS.out_file, 'w', encoding='utf8') as f:
+            json.dump(out_eval, f)
+    else:
+        print(json.dumps(out_eval, indent=2))
+if __name__ == '__main__':
+    OPTS = parse_args()
+    if OPTS.out_image_dir:
+        import matplotlib
+        matplotlib.use('Agg')
+        import matplotlib.pyplot as plt
+    main()
--- a/PaddleNLP/PaddleLARK/XLNet/utils/__init__.py
+++ b/PaddleNLP/PaddleLARK/XLNet/utils/__init__.py
--- a/PaddleNLP/PaddleLARK/XLNet/utils/args.py
+++ b/PaddleNLP/PaddleLARK/XLNet/utils/args.py
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Arguments for configuration."""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+import six
+import argparse
+import paddle.fluid as fluid
+def str2bool(v):
+    # because argparse does not support to parse "true, False" as python
+    # boolean directly
+    return v.lower() in ("true", "t", "1")
+class ArgumentGroup(object):
+    def __init__(self, parser, title, des):
+        self._group = parser.add_argument_group(title=title, description=des)
+    def add_arg(self, name, type, default, help, **kwargs):
+        type = str2bool if type == bool else type
+        self._group.add_argument(
+            "--" + name,
+            default=default,
+            type=type,
+            help=help + ' Default: %(default)s.',
+            **kwargs)
+def print_arguments(args):
+    print('-----------  Configuration Arguments -----------')
+    for arg, value in sorted(six.iteritems(vars(args))):
+        print('%s: %s' % (arg, value))
+    print('------------------------------------------------')
+def check_cuda(use_cuda, err = \
+    "\nYou can not set use_cuda = True in the model because you are using paddlepaddle-cpu.\n \
+    Please: 1. Install paddlepaddle-gpu to run your models on GPU or 2. Set use_cuda = False to run models on CPU.\n"
+                                                                                                                     ):
+    try:
+        if use_cuda == True and fluid.is_compiled_with_cuda() == False:
+            print(err)
+            sys.exit(1)
+    except Exception as e:
+        pass
--- a/PaddleNLP/PaddleLARK/XLNet/utils/init.py
+++ b/PaddleNLP/PaddleLARK/XLNet/utils/init.py
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from __future__ import print_function
+import os
+import six
+import ast
+import copy
+import numpy as np
+import paddle.fluid as fluid
+def cast_fp32_to_fp16(exe, main_program):
+    print("Cast parameters to float16 data format.")
+    for param in main_program.global_block().all_parameters():
+        if not param.name.endswith(".master"):
+            param_t = fluid.global_scope().find_var(param.name).get_tensor()
+            data = np.array(param_t)
+            if param.name.find("layer_norm") == -1 and param.name.find(
+                    "embedding") == -1:
+                print("shkip params", param.name)
+                param_t.set(np.float16(data).view(np.uint16), exe.place)
+            master_param_var = fluid.global_scope().find_var(param.name +
+                                                             ".master")
+            if master_param_var is not None:
+                master_param_var.get_tensor().set(data, exe.place)
+def init_checkpoint(exe, init_checkpoint_path, main_program, use_fp16=False):
+    assert os.path.exists(
+        init_checkpoint_path), "[%s] cann't be found." % init_checkpoint_path
+    print("Load checkpoint from {}".format(init_checkpoint_path))
+    def existed_persitables(var):
+        if not fluid.io.is_persistable(var):
+            return False
+        if os.path.exists(os.path.join(init_checkpoint_path, var.name)):
+            print("INIT %s" % var.name)
+            return True
+        else:
+            print("SKIP %s" % var.name)
+            return False
+    fluid.io.load_vars(
+        exe,
+        init_checkpoint_path,
+        main_program=main_program,
+        predicate=existed_persitables)
+    if use_fp16:
+        cast_fp32_to_fp16(exe, main_program)
+def init_pretraining_params(exe,
+                            pretraining_params_path,
+                            main_program,
+                            use_fp16=False):
+    assert os.path.exists(pretraining_params_path
+                          ), "[%s] cann't be found." % pretraining_params_path
+    print("Load pretraining parameters from {}.".format(
+        pretraining_params_path))
+    def existed_params(var):
+        if not isinstance(var, fluid.framework.Parameter):
+            return False
+        if os.path.exists(os.path.join(pretraining_params_path, var.name)):
+            print("INIT %s" % var.name)
+            return True
+        else:
+            print("SKIP %s" % var.name)
+            return False
+    fluid.io.load_vars(
+        exe,
+        pretraining_params_path,
+        main_program=main_program,
+        predicate=existed_params)
+    if use_fp16:
+        cast_fp32_to_fp16(exe, main_program)