Add lang representation model XLNet (#3830)

ad6a16c5 · Yibing Liu · GitHub · 6da62e65 · ad6a16c5 · ad6a16c5
19 changed file
--- a/PaddleNLP/PaddleLARK/XLNet/README.md
+++ b/PaddleNLP/PaddleLARK/XLNet/README.md
+[中文版](README_cn.md)
+
+This project is the implementation of [XLNet](https://github.com/zihangdai/xlnet) on Paddle Fluid, currently supporting the fine-tuning on all downstream tasks, including natural language inference, question answering (SQuAD) etc.
+
+There are a lot differences between XLNet and [BERT](../BERT). XLNet takes adavangtage of a new novel model [Transformer-XL](https://arxiv.org/abs/1901.02860) as the backbone of language representation, and the permutation language modeling as the optimizing objective. Also XLNet involed much more data in the pre-training stage. Finally, XLNet achieved SOTA results on several NLP tasks.
+
+For more details, please refer to the research paper
+
+[XLNet: Generalized Autoregressive Pretraining for Language Understanding](https://arxiv.org/abs/1906.08237)
+
+
+
+## Installation
+
+This project requires Paddle Fluid version 1.6.0 and later, please follow the [installation guide](https://www.paddlepaddle.org.cn/start) to install.  
+
+## Pre-trained models
+
+Two pre-trained models converted from the official release are available
+
+| Model | Layers | Hidden size | Heads |
+| :------| :------: | :------: |:------: |
+| [XLNet-Large, Cased](https://xlnet.bj.bcebos.com/xlnet_cased_L-24_H-1024_A-16.tgz)| 24 | 1024 | 16 |
+| [XLNet-Base, Cased](https://xlnet.bj.bcebos.com/xlnet_cased_L-12_H-768_A-12.tgz)| 12 | 768 | 12 |
+
+Each compressed package contains one subdirectory and two files:
+
+- `params`: a directory consisting of all converted parameters, one file for a parameter.
+- `spiece.model`: a [Sentence Piece](https://github.com/google/sentencepiece) model used for (de)tokenization.
+- `xlnet_config.json`: a config file which specifies the hyperparameters of the model.
+
+## Fine-tuning with XLNet
+
+We provide the scripts for fine-tuning on NLP tasks with XLNet on multi-card GPUs. And their correctness has been verified that all experiments on V100 GPUs can achieve the same performance as the officially reported (mainly on TPU). In the following statements, we assume that the two pre-trained models have been downloaded and extracted.
+
+### Text regression/classification
+
+The fine-tuning of regression/classification can be preformed via the script `run_classifier.py` , which contains examples for standard one-document classification, one-document regression, and document pair classification. The two examples, one for regression and another one for classification can go on in the following way.
+
+#### (1) STS-B: sentence pair relevance regression
+
+- Download the [GLUE data](https://gluebenchmark.com/tasks) by running [this script](https://gist.github.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e) and unpack it to some directory $GLUE_DIR
+
+  - **Note**: You may meet the error `ImportError: No module named request` when running the script under Python 2.x, this is because the module `urllib` doesn't have submodule `request`. It can be resolved by replacing all the code `urllib.request` with `urllib` or changing to a Python 3.x environment.
+
+- Perform fine-tuning on 4 V100 GPUs with XLNet-Large
+
+```
+export GLUE_DIR=glue_data
+export LARGE_DIR=xlnet_cased_L-24_H-1024_A-16
+
+CUDA_VISIBLE_DEVICES=0,1,2,3 python run_classifier.py \
+  --do_train=True \
+  --do_eval=True \
+  --do_predict=True \
+  --task_name=sts-b \
+  --data_dir=${GLUE_DIR}/STS-B \
+  --checkpoints=exp/sts-b \
+  --uncased=False \
+  --spiece_model_file=${LARGE_DIR}/spiece.model \
+  --model_config_path=${LARGE_DIR}/xlnet_config.json \
+  --init_pretraining_params=${LARGE_DIR}/params \
+  --max_seq_length=128 \
+  --train_batch_size=8 \
+  --learning_rate=5e-5 \
+  --predict_dir=exp/sts-b-pred \
+  --skip_steps=10 \
+  --train_steps=1200 \
+  --warmup_steps=120 \
+  --save_steps=600 \
+  --is_regression=True
+```
+This configuration doesn't require that large GPU memory, and 4 V100 (or other) GPUs with 16GB should be enough.
+
+As the fine-tuning finished, the evaluation result on dev dataset, including the average loss and pearson correlation coefficient, will yield
+
+```
+[dev evaluation] ave loss: 0.383523, eval_pearsonr: 0.916912, elapsed time: 21.804057 s
+```
+
+The expected `eval_pearsonr` is `91.3+`, quoted from the official repository, and the experiment can reproduce this performance.
+
+#### (2) IMDB: movie review sentiment classification
+
+- Download and unpack the IMDB dataset by running
+
+```shell
+wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
+tar zxvf aclImdb_v1.tar.gz
+```
+
+- Perform fine-tuning with XLNet-Large on 8 V100 GPUs (32GB) by running
+
+```shell
+export IMDB_DIR=aclImdb
+export LARGE_DIR=xlnet_cased_L-24_H-1024_A-16
+
+CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python run_classifier.py \
+  --do_train=True \
+  --do_eval=True \
+  --do_predict=True \
+  --task_name=imdb \
+  --checkpoints=exp/imdb \
+  --init_pretraining_params=${LARGE_DIR}/params \
+  --data_dir=${IMDB_DIR} \
+  --predict_dir=predict_imdb_1028 \
+  --uncased=False \
+  --spiece_model_file=${LARGE_DIR}/spiece.model \
+  --model_config_path=${LARGE_DIR}/xlnet_config.json \
+  --max_seq_length=512 \
+  --train_batch_size=4 \
+  --eval_batch_size=8 \
+  --learning_rate=2e-5 \
+  --train_steps=4000 \
+  --warmup_steps=500 \
+  --save_steps=500 \
+```
+
+The expected accuracy is `96.2+`, and here is an example of evaluation result
+
+```
+[dev evaluation] ave loss: 0.220047, eval_accuracy: 0.963480, elapsed time: 2799.974465 s
+```
+
+Other NLP regression/classification tasks' fine-tuning can be carried out in the similar way.
+
+### SQuAD 2.0
+
+- Download SQuAD2.0 data and put it in the `data/squad2.0` directory
+
+```
+mkdir -p data/squad2.0
+wget -P data/squad2.0 https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json
+wget -P data/squad2.0 https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json
+```
+
+- Perform fine-tuning running the script `run_squad.py` on V100 GPUs (32GB)
+
+```
+SQUAD_DIR=data/squad2.0
+INIT_CKPT_DIR=xlnet_cased_L-24_H-1024_A-16
+export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5
+
+python run_squad.py \
+  --model_config_path=${INIT_CKPT_DIR}/xlnet_config.json \
+  --spiece_model_file=${INIT_CKPT_DIR}/spiece.model \
+  --init_checkpoint=${INIT_CKPT_DIR}/params \
+  --train_file=${SQUAD_DIR}/train-v2.0.json \
+  --predict_file=${SQUAD_DIR}/dev-v2.0.json \
+  --uncased=False \
+  --checkpoints squad_2.0_0828 \
+  --max_seq_length=512 \
+  --do_train=True \
+  --do_predict=True \
+  --skip_steps=100 \
+  --save_steps=10000 \
+  --epoch 200 \
+  --dropout=0.1 \
+  --dropatt=0.1 \
+  --train_batch_size=4 \
+  --predict_batch_size=3 \
+  --learning_rate=2e-5 \
+  --save_steps=1000 \
+  --train_steps=12000 \
+  --warmup_steps=1000 \
+  --verbose=True\
+```
+
+And the final evaluation result after fine-tuning should looks like
+
+```
+================================================================================
+Result | best_f1 88.0893932758 | best_exact_thresh -2.07637166977 | best_exact 85.5049271456 | has_ans_f1 0.940979062625 | has_ans_exact 0.880566801619 | best_f1_thresh -2.07337403297 |
+================================================================================
+```
+
+### Use your own data
+
+Please refer to the data-format guidelines of GLUE/SQuAD if you want to use your own data for fine-tuning.
+
+
+## Acknowledgement
+
+We thank the distiguished work done by the authors of XLNet!
--- a/PaddleNLP/PaddleLARK/XLNet/README_cn.md
+++ b/PaddleNLP/PaddleLARK/XLNet/README_cn.md
+[ENGLISH](README.md)
+
+该项目是 [XLNet](https://github.com/zihangdai/xlnet) 基于 Paddle Fluid 的实现，目前支持该项目支持所有下游任务的 fine-tuning, 包括自然语言推断任务和阅读理解任务 （SQuAD2.0）等。
+
+XLNet 与 [BERT](https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/PaddleLARK/BERT) 有着许多的不同，XLNet 利用一个全新的模型 [Transformer-XL](https://arxiv.org/abs/1901.02860) 作为语义表示的骨架， 将置换语言模型的建模作为优化目标，同时在预训练阶段也利用了更多的数据。 最终，XLNet 在多个 NLP 任务上达到了 SOTA 的效果。
+
+更多的细节，请参考学术论文
+
+[XLNet: Generalized Autoregressive Pretraining for Language Understanding](https://arxiv.org/abs/1906.08237)
+
+## 安装
+
+该项目要求 Paddle Fluid 1.6.0 及以上版本，请参考 [安装指南](https://www.paddlepaddle.org.cn/start) 进行安装。
+
+## Pre-trained models
+
+这里提供了从官方开源模型转换而来的两个预训练模型供下载
+
+| Model | Layers | Hidden size | Heads |
+| :------| :------: | :------: |:------: |
+| [XLNet-Large, Cased](https://xlnet.bj.bcebos.com/xlnet_cased_L-24_H-1024_A-16.tgz)| 24 | 1024 | 16 |
+| [XLNet-Base, Cased](https://xlnet.bj.bcebos.com/xlnet_cased_L-12_H-768_A-12.tgz)| 12 | 768 | 12 |
+
+每个压缩包都包含了一个子文件夹和两个文件:
+
+- `params`: 由参数构成的文件夹, 每个模型文件包含一个参数
+- `spiece.model`: [Sentence Piece](https://github.com/google/sentencepiece) 模型，用于文本的（反）tokenization
+- `xlnet_config.json`: 配置文件，指定了模型的超参数
+
+
+## 利用 XLNet 进行 Fine-tuning
+
+我们提供了利用 XLNet 在多卡 GPU 上为自然语言处理任务进行 fine-tuning 的脚本。通过基于 V100 GPU 进行实验，达到官方报告的效果 （主要是基于 TPU），这些脚本的正确性已得到过验证。在下面的陈述中，我们假定以上两个预训练已下载和解压好。
+
+### 文本回归/分类任务
+
+文本回归和分类任务的 fine-tuning 可以通过运行脚本 `run_classifier.py` 来进行，其中包含了单文本分类、单文本回归、文本对分类等示例。下面的两个例子，一个用于示例回归任务，另一个用于分类任务，可以按以下的方式进行 fine-tuning。
+
+#### (1) STS-B: 句子对相关性回归
+
+-  通过运行 [脚本](https://gist.github.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e) 下载 [GLUE 数据集](https://gluebenchmark.com/tasks)， 并解压到某个文件夹 $GLUE_DIR。
+
+  - **请注意**: 在 Python 2.x 环境下运行这个脚本，可能会遇到报错 `ImportError: No module named request` , 这是因为模块 `urllib` 不包含子模块 `request`. 这个问题可以通过将脚本中的代码 `urllib.request` 全部替换为 `urllib`，或者在 Python 3.x 环境下运行予以解决。
+
+- 使用 XLNet-Large 在 4 卡 V100 GPU 上进行 fine-tuning
+
+```
+export GLUE_DIR=glue_data
+export LARGE_DIR=xlnet_cased_L-24_H-1024_A-16
+
+CUDA_VISIBLE_DEVICES=0,1,2,3 python run_classifier.py \
+  --do_train=True \
+  --do_eval=True \
+  --do_predict=True \
+  --task_name=sts-b \
+  --data_dir=${GLUE_DIR}/STS-B \
+  --checkpoints=exp/sts-b \
+  --uncased=False \
+  --spiece_model_file=${LARGE_DIR}/spiece.model \
+  --model_config_path=${LARGE_DIR}/xlnet_config.json \
+  --init_pretraining_params=${LARGE_DIR}/params \
+  --max_seq_length=128 \
+  --train_batch_size=8 \
+  --learning_rate=5e-5 \
+  --predict_dir=exp/sts-b-pred \
+  --skip_steps=10 \
+  --train_steps=1200 \
+  --warmup_steps=120 \
+  --save_steps=600 \
+  --is_regression=True
+```
+
+该配置不需要特别大的 GPU 显存，16GB 的 4 卡 V100 （或其它 GPU）即可运行。
+
+在 fine-tuning 结束后，会得到在 dev 数据集上的评估结果，包括平均误差和皮尔逊相关系数
+
+```
+[dev evaluation] ave loss: 0.383523, eval_pearsonr: 0.916912, elapsed time: 21.804057 s
+```
+
+按官方实现的说法，预期的 `eval_pearsonr` 是 `91.3+`，该实验应该能复现这个结果。
+
+#### (2) IMDB: 电影评论情感分类
+
+- 下载和解压 IMDB 数据集
+
+```shell
+wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
+tar zxvf aclImdb_v1.tar.gz
+```
+
+- 使用 XLNet-Large 在 8 卡 V100 GPU (32GB) 上进行 fine-tuning
+
+```shell
+export IMDB_DIR=aclImdb
+export LARGE_DIR=xlnet_cased_L-24_H-1024_A-16
+
+CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python run_classifier.py \
+  --do_train=True \
+  --do_eval=True \
+  --do_predict=True \
+  --task_name=imdb \
+  --checkpoints=exp/imdb \
+  --init_pretraining_params=${LARGE_DIR}/params \
+  --data_dir=${IMDB_DIR} \
+  --predict_dir=predict_imdb_1028 \
+  --uncased=False \
+  --spiece_model_file=${LARGE_DIR}/spiece.model \
+  --model_config_path=${LARGE_DIR}/xlnet_config.json \
+  --max_seq_length=512 \
+  --train_batch_size=4 \
+  --eval_batch_size=8 \
+  --learning_rate=2e-5 \
+  --train_steps=4000 \
+  --warmup_steps=500 \
+  --save_steps=500 \
+```
+
+期望的准确率是 `96.2+`， 以下是评估结果的一个样例
+
+```
+[dev evaluation] ave loss: 0.220047, eval_accuracy: 0.963480, elapsed time: 2799.974465 s
+```
+
+其它 NLP 回归/分类任务的 fine-tuning 可以通过同样的方式进行。
+
+### SQuAD 2.0
+
+- 下载 SQuAD 2.0 数据集并将其放入 `data/squad2.0` 目录中
+
+```
+mkdir -p data/squad2.0
+wget -P data/squad2.0 https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json
+wget -P data/squad2.0 https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json
+```
+
+- 在 6 卡 V100 GPU (32GB) 上运行脚本 `run_squad.py`
+
+```
+SQUAD_DIR=data/squad2.0
+INIT_CKPT_DIR=xlnet_cased_L-24_H-1024_A-16
+export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5
+
+python run_squad.py \
+  --model_config_path=${INIT_CKPT_DIR}/xlnet_config.json \
+  --spiece_model_file=${INIT_CKPT_DIR}/spiece.model \
+  --init_checkpoint=${INIT_CKPT_DIR}/params \
+  --train_file=${SQUAD_DIR}/train-v2.0.json \
+  --predict_file=${SQUAD_DIR}/dev-v2.0.json \
+  --uncased=False \
+  --checkpoints squad_2.0_0828 \
+  --max_seq_length=512 \
+  --do_train=True \
+  --do_predict=True \
+  --skip_steps=100 \
+  --save_steps=10000 \
+  --epoch 200 \
+  --dropout=0.1 \
+  --dropatt=0.1 \
+  --train_batch_size=4 \
+  --predict_batch_size=3 \
+  --learning_rate=2e-5 \
+  --save_steps=1000 \
+  --train_steps=12000 \
+  --warmup_steps=1000 \
+  --verbose=True\
+```
+
+运行结束后的评测结果如下所示
+
+```
+================================================================================
+Result | best_f1 88.0893932758 | best_exact_thresh -2.07637166977 | best_exact 85.5049271456 | has_ans_f1 0.940979062625 | has_ans_exact 0.880566801619 | best_f1_thresh -2.07337403297 |
+================================================================================
+```
+
+### 使用自定义数据
+
+如需使用自定义数据进行 fine-tuning，请参考 GLUE/SQuAD 的数据格式说明。
+
+## 致谢
+
+我们向 XLNet 的作者们所做的杰出工作致以谢意！
--- a/PaddleNLP/PaddleLARK/XLNet/classifier_utils.py
+++ b/PaddleNLP/PaddleLARK/XLNet/classifier_utils.py
+"""this file is a copy of https://github.com/zihangdai/xlnet"""
+
+import re
+import numpy as np
+
+from data_utils import SEP_ID, CLS_ID
+
+SEG_ID_A = 0
+SEG_ID_B = 1
+SEG_ID_CLS = 2
+SEG_ID_SEP = 3
+SEG_ID_PAD = 4
+
+
+class PaddingInputExample(object):
+    """Fake example so the num input examples is a multiple of the batch size.
+  When running eval/predict on the TPU, we need to pad the number of examples
+  to be a multiple of the batch size, because the TPU requires a fixed batch
+  size. The alternative is to drop the last batch, which is bad because it means
+  the entire output data won't be generated.
+  We use this class instead of `None` because treating `None` as padding
+  battches could cause silent errors.
+  """
+
+
+class InputFeatures(object):
+    """A single set of features of data."""
+
+    def __init__(self,
+                 input_ids,
+                 input_mask,
+                 segment_ids,
+                 label_id,
+                 is_real_example=True):
+        self.input_ids = input_ids
+        self.input_mask = input_mask
+        self.segment_ids = segment_ids
+        self.label_id = label_id
+        self.is_real_example = is_real_example
+
+
+def _truncate_seq_pair(tokens_a, tokens_b, max_length):
+    """Truncates a sequence pair in place to the maximum length."""
+
+    # This is a simple heuristic which will always truncate the longer sequence
+    # one token at a time. This makes more sense than truncating an equal percent
+    # of tokens from each, since if one sequence is very short then each token
+    # that's truncated likely contains more information than a longer sequence.
+    while True:
+        total_length = len(tokens_a) + len(tokens_b)
+        if total_length <= max_length:
+            break
+        if len(tokens_a) > len(tokens_b):
+            tokens_a.pop()
+        else:
+            tokens_b.pop()
+
+
+def convert_single_example(ex_index, example, label_list, max_seq_length,
+                           tokenize_fn):
+    """Converts a single `InputExample` into a single `InputFeatures`."""
+
+    if isinstance(example, PaddingInputExample):
+        return InputFeatures(
+            input_ids=[0] * max_seq_length,
+            input_mask=[1] * max_seq_length,
+            segment_ids=[0] * max_seq_length,
+            label_id=0,
+            is_real_example=False)
+
+    if label_list is not None:
+        label_map = {}
+        for (i, label) in enumerate(label_list):
+            label_map[label] = i
+
+    tokens_a = tokenize_fn(example.text_a)
+    tokens_b = None
+    if example.text_b:
+        tokens_b = tokenize_fn(example.text_b)
+
+    if tokens_b:
+        # Modifies `tokens_a` and `tokens_b` in place so that the total
+        # length is less than the specified length.
+        # Account for two [SEP] & one [CLS] with "- 3"
+        _truncate_seq_pair(tokens_a, tokens_b, max_seq_length - 3)
+    else:
+        # Account for one [SEP] & one [CLS] with "- 2"
+        if len(tokens_a) > max_seq_length - 2:
+            tokens_a = tokens_a[:max_seq_length - 2]
+
+    tokens = []
+    segment_ids = []
+    for token in tokens_a:
+        tokens.append(token)
+        segment_ids.append(SEG_ID_A)
+    tokens.append(SEP_ID)
+    segment_ids.append(SEG_ID_A)
+
+    if tokens_b:
+        for token in tokens_b:
+            tokens.append(token)
+            segment_ids.append(SEG_ID_B)
+        tokens.append(SEP_ID)
+        segment_ids.append(SEG_ID_B)
+
+    tokens.append(CLS_ID)
+    segment_ids.append(SEG_ID_CLS)
+
+    input_ids = tokens
+
+    # The mask has 0 for real tokens and 1 for padding tokens. Only real
+    # tokens are attended to.
+    input_mask = [0] * len(input_ids)
+
+    # Zero-pad up to the sequence length.
+    if len(input_ids) < max_seq_length:
+        delta_len = max_seq_length - len(input_ids)
+        input_ids = [0] * delta_len + input_ids
+        input_mask = [1] * delta_len + input_mask
+        segment_ids = [SEG_ID_PAD] * delta_len + segment_ids
+
+    assert len(input_ids) == max_seq_length
+    assert len(input_mask) == max_seq_length
+    assert len(segment_ids) == max_seq_length
+
+    if label_list is not None:
+        label_id = label_map[example.label]
+    else:
+        label_id = example.label
+    if ex_index < 1:
+        print("*** Example ***")
+        print("guid: %s" % (example.guid))
+        print("input_ids: %s" % " ".join([str(x) for x in input_ids]))
+        print("input_mask: %s" % " ".join([str(x) for x in input_mask]))
+        print("segment_ids: %s" % " ".join([str(x) for x in segment_ids]))
+        print("label: {} (id = {})".format(example.label, label_id))
+
+    feature = InputFeatures(
+        input_ids=input_ids,
+        input_mask=input_mask,
+        segment_ids=segment_ids,
+        label_id=label_id)
+    return feature
--- a/PaddleNLP/PaddleLARK/XLNet/data_utils.py
+++ b/PaddleNLP/PaddleLARK/XLNet/data_utils.py
+"""this file is a copy of https://github.com/zihangdai/xlnet"""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+special_symbols = {
+    "<unk>": 0,
+    "<s>": 1,
+    "</s>": 2,
+    "<cls>": 3,
+    "<sep>": 4,
+    "<pad>": 5,
+    "<mask>": 6,
+    "<eod>": 7,
+    "<eop>": 8,
+}
+
+VOCAB_SIZE = 32000
+UNK_ID = special_symbols["<unk>"]
+CLS_ID = special_symbols["<cls>"]
+SEP_ID = special_symbols["<sep>"]
+MASK_ID = special_symbols["<mask>"]
+EOD_ID = special_symbols["<eod>"]
--- a/PaddleNLP/PaddleLARK/XLNet/model/__init__.py
+++ b/PaddleNLP/PaddleLARK/XLNet/model/__init__.py
--- a/PaddleNLP/PaddleLARK/XLNet/model/classifier.py
+++ b/PaddleNLP/PaddleLARK/XLNet/model/classifier.py
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Model for classifier."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import collections
+import paddle.fluid as fluid
+
+import modeling
+from model.xlnet import XLNetModel, _get_initializer
+
+
+def get_regression_loss(args, xlnet_config, features, is_training=False):
+    """Loss for downstream regression tasks."""
+
+    inp = fluid.layers.transpose(features["input_ids"], [1, 0, 2])
+    seg_id = features["segment_ids"]
+    inp_mask = fluid.layers.transpose(features["input_mask"], [1, 0])
+    label = features["label_ids"]
+
+    xlnet_model = XLNetModel(
+        input_ids=inp,
+        seg_ids=seg_id,
+        input_mask=inp_mask,
+        xlnet_config=xlnet_config,
+        args=args)
+
+    summary = xlnet_model.get_pooled_out(args.summary_type, args.use_summ_proj)
+    per_example_loss, logits = modeling.regression_loss(
+        hidden=summary,
+        labels=label,
+        initializer=_get_initializer(args),
+        name="model_regression_{}".format(args.task_name.lower()),
+        return_logits=True)
+
+    total_loss = fluid.layers.reduce_mean(per_example_loss)
+
+    return total_loss, per_example_loss, logits
+
+
+def get_classification_loss(args,
+                            xlnet_config,
+                            features,
+                            n_class,
+                            is_training=True):
+    """Loss for downstream classification tasks."""
+
+    inp = fluid.layers.transpose(features["input_ids"], [1, 0, 2])
+    seg_id = features["segment_ids"]
+    inp_mask = fluid.layers.transpose(features["input_mask"], [1, 0])
+    label = features["label_ids"]
+
+    xlnet_model = XLNetModel(
+        input_ids=inp,
+        seg_ids=seg_id,
+        input_mask=inp_mask,
+        xlnet_config=xlnet_config,
+        args=args)
+
+    summary = xlnet_model.get_pooled_out(args.summary_type, args.use_summ_proj)
+
+    per_example_loss, logits = modeling.classification_loss(
+        hidden=summary,
+        labels=label,
+        n_class=n_class,
+        initializer=xlnet_model.get_initializer(),
+        name="model_classification_{}".format(args.task_name),
+        return_logits=True)
+
+    total_loss = fluid.layers.reduce_mean(per_example_loss)
+    return total_loss, per_example_loss, logits
+
+
+def create_model(args, xlnet_config, n_class, is_training=False):
+    label_ids_type = 'int64' if n_class else 'float32'
+    input_fields = {
+        'names': [
+            'input_ids', 'input_mask', 'segment_ids', 'label_ids',
+            'is_real_example'
+        ],
+        'shapes': [[-1, args.max_seq_length, 1], [-1, args.max_seq_length],
+                   [-1, args.max_seq_length], [-1, 1], [-1, 1]],
+        'dtypes':
+        ['int64', 'float32', 'int64', 'int64', label_ids_type, 'int64'],
+        'lod_levels': [0, 0, 0, 0, 0, 0],
+    }
+
+    inputs = [
+        fluid.layers.data(
+            name=input_fields['names'][i],
+            shape=input_fields['shapes'][i],
+            dtype=input_fields['dtypes'][i],
+            lod_level=input_fields['lod_levels'][i])
+        for i in range(len(input_fields['names']))
+    ]
+    (input_ids, input_mask, segment_ids, label_ids, is_real_example) = inputs
+
+    data_loader = fluid.io.DataLoader.from_generator(
+        feed_list=inputs, capacity=50, iterable=False)
+
+    features = collections.OrderedDict()
+    features["input_ids"] = input_ids
+    features["input_mask"] = input_mask
+    features["segment_ids"] = segment_ids
+    features["label_ids"] = label_ids
+    features["is_real_example"] = is_real_example
+
+    if args.is_regression:
+        (total_loss, per_example_loss, logits) = get_regression_loss(
+            args, xlnet_config, features, is_training)
+    else:
+        (total_loss, per_example_loss, logits) = get_classification_loss(
+            args, xlnet_config, features, n_class, is_training)
+
+    num_seqs = fluid.layers.fill_constant_batch_size_like(
+        input=label_ids, shape=[-1, 1], value=1, dtype="int64")
+    num_seqs = fluid.layers.reduce_sum(num_seqs)
+
+    return data_loader, total_loss, logits, num_seqs, label_ids
--- a/PaddleNLP/PaddleLARK/XLNet/model/xlnet.py
+++ b/PaddleNLP/PaddleLARK/XLNet/model/xlnet.py
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""XLNet model."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import six
+import json
+import numpy as np
+import paddle.fluid as fluid
+import modeling
+
+
+def _get_initializer(args):
+    if args.init == "uniform":
+        param_initializer = fluid.initializer.Uniform(
+            low=-args.init_range, high=args.init_range)
+    elif args.init == "normal":
+        param_initializer = fluid.initializer.Normal(scale=args.init_std)
+    else:
+        raise ValueError("Initializer {} not supported".format(args.init))
+    return param_initializer
+
+
+def init_attn_mask(args, place):
+    """create causal attention mask."""
+    qlen = args.max_seq_length
+    mlen = 0 if 'mem_len' not in args else args.mem_len
+    same_length = False if 'same_length' not in args else args.same_length
+    dtype = 'float16' if args.use_fp16 else 'float32'
+    attn_mask = np.ones([qlen, qlen], dtype=dtype)
+    mask_u = np.triu(attn_mask)
+    mask_dia = np.diag(np.diag(attn_mask))
+    attn_mask_pad = np.zeros([qlen, mlen], dtype=dtype)
+    attn_mask = np.concatenate([attn_mask_pad, mask_u - mask_dia], 1)
+    if same_length:
+        mask_l = np.tril(attn_mask)
+        attn_mask = np.concatenate(
+            [ret[:, :qlen] + mask_l - mask_dia, ret[:, qlen:]], 1)
+    attn_mask = attn_mask[:, :, None, None]
+    attn_mask_t = fluid.global_scope().find_var("attn_mask").get_tensor()
+    attn_mask_t.set(attn_mask, place)
+
+
+class XLNetConfig(object):
+    def __init__(self, config_path):
+        self._config_dict = self._parse(config_path)
+
+    def _parse(self, config_path):
+        try:
+            with open(config_path) as json_file:
+                config_dict = json.load(json_file)
+        except Exception:
+            raise IOError("Error in parsing xlnet model config file '%s'" %
+                          config_path)
+        else:
+            return config_dict
+
+    def __getitem__(self, key):
+        return self._config_dict[key]
+
+    def has_key(self, key):
+        return self._config_dict.has_key(key)
+
+    def print_config(self):
+        for arg, value in sorted(six.iteritems(self._config_dict)):
+            print('%s: %s' % (arg, value))
+        print('------------------------------------------------')
+
+
+class XLNetModel(object):
+    def __init__(self,
+                 xlnet_config,
+                 input_ids,
+                 seg_ids,
+                 input_mask,
+                 args,
+                 mems=None,
+                 perm_mask=None,
+                 target_mapping=None,
+                 inp_q=None):
+        self._tie_weight = True
+
+        self._d_head = xlnet_config['d_head']
+        self._d_inner = xlnet_config['d_inner']
+        self._d_model = xlnet_config['d_model']
+        self._ff_activation = xlnet_config['ff_activation']
+        self._n_head = xlnet_config['n_head']
+        self._n_layer = xlnet_config['n_layer']
+        self._n_token = xlnet_config['n_token']
+        self._untie_r = xlnet_config['untie_r']
+        self._xlnet_config = xlnet_config
+
+        self._dropout = args.dropout
+        self._dropatt = args.dropatt
+
+        self._mem_len = None if 'mem_len' not in args else args.mem_len
+        self._reuse_len = None if 'reuse_len' not in args else args.reuse_len
+        self._bi_data = False if 'bi_data' not in args else args.bi_data
+        self._clamp_len = args.clamp_len
+        self._same_length = False if 'same_length' not in args else args.same_length
+        # Initialize all weigths by the specified initializer, and all biases 
+        # will be initialized by constant zero by default.
+        self._param_initializer = _get_initializer(args)
+        self.input_mask = input_mask
+
+        tfm_args = dict(
+            n_token=self._n_token,
+            initializer=self._param_initializer,
+            attn_type="bi",
+            n_layer=self._n_layer,
+            d_model=self._d_model,
+            n_head=self._n_head,
+            d_head=self._d_head,
+            d_inner=self._d_inner,
+            ff_activation=self._ff_activation,
+            untie_r=self._untie_r,
+            use_bfloat16=False,
+            dropout=self._dropout,
+            dropatt=self._dropatt,
+            mem_len=self._mem_len,
+            reuse_len=self._reuse_len,
+            bi_data=self._bi_data,
+            clamp_len=self._clamp_len,
+            same_length=self._same_length,
+            name='model_transformer')
+        input_args = dict(
+            inp_k=input_ids,
+            seg_id=seg_ids,
+            input_mask=input_mask,
+            mems=mems,
+            perm_mask=perm_mask,
+            target_mapping=target_mapping,
+            inp_q=inp_q)
+        tfm_args.update(input_args)
+        self.output, self.new_mems, self.lookup_table = modeling.transformer_xl(
+            **tfm_args)
+        #self._build_model(input_ids, sentence_ids, input_mask)
+
+    def get_initializer(self):
+        return self._param_initializer
+
+    def get_debug_ret(self):
+        return self.debug_ret
+
+    def get_sequence_output(self):
+        return self.output
+
+    def get_pooled_out(self, summary_type, use_summ_proj=True):
+        """
+	Args:
+	  summary_type: str, "last", "first", "mean", or "attn". The method
+	    to pool the input to get a vector representation.
+	  use_summ_proj: bool, whether to use a linear projection during pooling.
+	Returns:
+	  float32 Tensor in shape [bsz, d_model], the pooled representation.
+	"""
+        summary = modeling.summarize_sequence(
+            summary_type=summary_type,
+            hidden=self.output,
+            d_model=self._d_model,
+            n_head=self._n_head,
+            d_head=self._d_head,
+            dropout=self._dropout,
+            dropatt=self._dropatt,
+            input_mask=self.input_mask,
+            initializer=self._param_initializer,
+            use_proj=use_summ_proj,
+            name='model_sequnece_summary')
+
+        return summary
--- a/PaddleNLP/PaddleLARK/XLNet/modeling.py
+++ b/PaddleNLP/PaddleLARK/XLNet/modeling.py
--- a/PaddleNLP/PaddleLARK/XLNet/optimization.py
+++ b/PaddleNLP/PaddleLARK/XLNet/optimization.py
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Optimization and learning rate scheduling."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import re
+import numpy as np
+import paddle.fluid as fluid
+
+
+def linear_warmup_decay(learning_rate, warmup_steps, num_train_steps):
+    """ Applies linear warmup of learning rate from 0 and decay to 0."""
+    with fluid.default_main_program()._lr_schedule_guard():
+        lr = fluid.layers.tensor.create_global_var(
+            shape=[1],
+            value=0.0,
+            dtype='float32',
+            persistable=True,
+            name="scheduled_learning_rate")
+
+        global_step = fluid.layers.learning_rate_scheduler._decay_step_counter()
+
+        with fluid.layers.control_flow.Switch() as switch:
+            with switch.case(global_step < warmup_steps):
+                warmup_lr = learning_rate * (global_step / warmup_steps)
+                fluid.layers.tensor.assign(warmup_lr, lr)
+            with switch.default():
+                decayed_lr = fluid.layers.learning_rate_scheduler.polynomial_decay(
+                    learning_rate=learning_rate,
+                    decay_steps=num_train_steps,
+                    end_learning_rate=0.0,
+                    power=1.0,
+                    cycle=False)
+                fluid.layers.tensor.assign(decayed_lr, lr)
+
+        return lr
+
+
+def optimization(loss,
+                 warmup_steps,
+                 num_train_steps,
+                 learning_rate,
+                 train_program,
+                 startup_prog,
+                 weight_decay,
+                 lr_layer_decay_rate=1.0,
+                 scheduler='linear_warmup_decay'):
+
+    scheduled_lr = None
+    if scheduler == 'noam_decay':
+        if warmup_steps > 0:
+            scheduled_lr = fluid.layers.learning_rate_scheduler\
+             .noam_decay(1/(warmup_steps *(learning_rate ** 2)),
+           warmup_steps)
+        else:
+            printf(
+                "WARNING: noam decay should have postive warmup steps, using "
+                "constant learning rate instead!")
+            scheduled_lr = fluid.layers.create_global_var(
+                name=fluid.unique_name.generate("learning_rate"),
+                shape=[1],
+                value=learning_rate,
+                dtype='float32',
+                persistable=True)
+    elif scheduler == 'linear_warmup_decay':
+        scheduled_lr = linear_warmup_decay(learning_rate, warmup_steps,
+                                           num_train_steps)
+    else:
+        raise ValueError("Unkown learning rate scheduler, should be "
+                         "'noam_decay' or 'linear_warmup_decay'")
+
+    if lr_layer_decay_rate != 1.0:
+        n_layer = 0
+        for param in fluid.default_main_program().block(0).all_parameters():
+            m = re.search(r"model_transformer_layer_(\d+?)_", param.name)
+            if not m: continue
+            n_layer = max(n_layer, int(m.group(1)) + 1)
+
+        for param in fluid.default_main_program().block(0).all_parameters():
+            for l in range(n_layer):
+                if "model_transformer_layer_{}_".format(l) in param.name:
+                    param.optimize_attr[
+                        'learning_rate'] = lr_layer_decay_rate**(
+                            n_layer - 1 - l)
+                    print("Apply lr decay {:.4f} to layer-{} grad of {}".format(
+                        param.optimize_attr['learning_rate'], l, param.name))
+                    break
+
+    def exclude_from_weight_decay(param):
+        name = param.name.rstrip(".master")
+        if name.find("layer_norm") > -1:
+            return True
+        bias_suffix = ["_bias", "_b", ".b_0"]
+        for suffix in bias_suffix:
+            if name.endswith(suffix):
+                return True
+        return False
+
+    optimizer = fluid.optimizer.Adam(learning_rate=scheduled_lr)
+
+    fluid.clip.set_gradient_clip(
+        clip=fluid.clip.GradientClipByGlobalNorm(clip_norm=1.0))
+
+    param_list = dict()
+
+    if weight_decay > 0:
+        for param in train_program.global_block().all_parameters():
+            param_list[param.name] = param * 1.0
+            param_list[param.name].stop_gradient = True
+
+    _, param_grads = optimizer.minimize(loss)
+
+    if weight_decay > 0:
+        for param, grad in param_grads:
+            if exclude_from_weight_decay(param):
+                continue
+            with param.block.program._optimized_guard(
+                [param, grad]), fluid.framework.name_scope("weight_decay"):
+                updated_param = param - param_list[
+                    param.name] * weight_decay * scheduled_lr
+                fluid.layers.assign(output=param, input=updated_param)
+
+    return scheduled_lr
--- a/PaddleNLP/PaddleLARK/XLNet/prepro_utils.py
+++ b/PaddleNLP/PaddleLARK/XLNet/prepro_utils.py
+# coding=utf-8
+"""this file is a copy of https://github.com/zihangdai/xlnet"""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import unicodedata
+import six
+from functools import partial
+
+SPIECE_UNDERLINE = '▁'
+
+
+def printable_text(text):
+    """Returns text encoded in a way suitable for print or `tf.logging`."""
+
+    # These functions want `str` for both Python2 and Python3, but in one case
+    # it's a Unicode string and in the other it's a byte string.
+    if six.PY3:
+        if isinstance(text, str):
+            return text
+        elif isinstance(text, bytes):
+            return text.decode("utf-8", "ignore")
+        else:
+            raise ValueError("Unsupported string type: %s" % (type(text)))
+    elif six.PY2:
+        if isinstance(text, str):
+            return text
+        elif isinstance(text, unicode):
+            return text.encode("utf-8")
+        else:
+            raise ValueError("Unsupported string type: %s" % (type(text)))
+    else:
+        raise ValueError("Not running on Python2 or Python 3?")
+
+
+def print_(*args):
+    new_args = []
+    for arg in args:
+        if isinstance(arg, list):
+            s = [printable_text(i) for i in arg]
+            s = ' '.join(s)
+            new_args.append(s)
+        else:
+            new_args.append(printable_text(arg))
+    print(*new_args)
+
+
+def preprocess_text(inputs, lower=False, remove_space=True, keep_accents=False):
+    if remove_space:
+        outputs = ' '.join(inputs.strip().split())
+    else:
+        outputs = inputs
+    outputs = outputs.replace("``", '"').replace("''", '"')
+
+    if six.PY2 and isinstance(outputs, str):
+        outputs = outputs.decode('utf-8')
+
+    if not keep_accents:
+        outputs = unicodedata.normalize('NFKD', outputs)
+        outputs = ''.join([c for c in outputs if not unicodedata.combining(c)])
+    if lower:
+        outputs = outputs.lower()
+
+    return outputs
+
+
+def encode_pieces(sp_model, text, return_unicode=True, sample=False):
+    # return_unicode is used only for py2
+
+    # note(zhiliny): in some systems, sentencepiece only accepts str for py2
+    if six.PY2 and isinstance(text, unicode):
+        text = text.encode('utf-8')
+
+    if not sample:
+        pieces = sp_model.EncodeAsPieces(text)
+    else:
+        pieces = sp_model.SampleEncodeAsPieces(text, 64, 0.1)
+    new_pieces = []
+    for piece in pieces:
+        if len(piece) > 1 and piece[-1] == ',' and piece[-2].isdigit():
+            cur_pieces = sp_model.EncodeAsPieces(piece[:-1].replace(
+                SPIECE_UNDERLINE, ''))
+            if piece[0] != SPIECE_UNDERLINE and cur_pieces[0][
+                    0] == SPIECE_UNDERLINE:
+                if len(cur_pieces[0]) == 1:
+                    cur_pieces = cur_pieces[1:]
+                else:
+                    cur_pieces[0] = cur_pieces[0][1:]
+            cur_pieces.append(piece[-1])
+            new_pieces.extend(cur_pieces)
+        else:
+            new_pieces.append(piece)
+
+    # note(zhiliny): convert back to unicode for py2
+    if six.PY2 and return_unicode:
+        ret_pieces = []
+        for piece in new_pieces:
+            if isinstance(piece, str):
+                piece = piece.decode('utf-8')
+            ret_pieces.append(piece)
+        new_pieces = ret_pieces
+
+    return new_pieces
+
+
+def encode_ids(sp_model, text, sample=False):
+    pieces = encode_pieces(sp_model, text, return_unicode=False, sample=sample)
+    ids = [sp_model.PieceToId(piece) for piece in pieces]
+    return ids
+
+
+if __name__ == '__main__':
+    import sentencepiece as spm
+
+    sp = spm.SentencePieceProcessor()
+    sp.load('sp10m.uncased.v3.model')
+
+    print_(u'I was born in 2000, and this is falsé.')
+    print_(u'ORIGINAL',
+           sp.EncodeAsPieces(u'I was born in 2000, and this is falsé.'))
+    print_(u'OURS',
+           encode_pieces(sp, u'I was born in 2000, and this is falsé.'))
+    print(encode_ids(sp, u'I was born in 2000, and this is falsé.'))
+    print_('')
+    prepro_func = partial(preprocess_text, lower=True)
+    print_(prepro_func('I was born in 2000, and this is falsé.'))
+    print_('ORIGINAL',
+           sp.EncodeAsPieces(
+               prepro_func('I was born in 2000, and this is falsé.')))
+    print_('OURS',
+           encode_pieces(sp,
+                         prepro_func('I was born in 2000, and this is falsé.')))
+    print(encode_ids(sp, prepro_func('I was born in 2000, and this is falsé.')))
+    print_('')
+    print_('I was born in 2000, and this is falsé.')
+    print_('ORIGINAL',
+           sp.EncodeAsPieces('I was born in 2000, and this is falsé.'))
+    print_('OURS', encode_pieces(sp, 'I was born in 2000, and this is falsé.'))
+    print(encode_ids(sp, 'I was born in 2000, and this is falsé.'))
+    print_('')
+    print_('I was born in 92000, and this is falsé.')
+    print_('ORIGINAL',
+           sp.EncodeAsPieces('I was born in 92000, and this is falsé.'))
+    print_('OURS', encode_pieces(sp, 'I was born in 92000, and this is falsé.'))
+    print(encode_ids(sp, 'I was born in 92000, and this is falsé.'))
--- a/PaddleNLP/PaddleLARK/XLNet/reader/__init__.py
+++ b/PaddleNLP/PaddleLARK/XLNet/reader/__init__.py
--- a/PaddleNLP/PaddleLARK/XLNet/reader/cls.py
+++ b/PaddleNLP/PaddleLARK/XLNet/reader/cls.py
+"""this file is adapted from https://github.com/zihangdai/xlnet"""
+
+import io
+import os
+import types
+import csv
+import numpy as np
+import sentencepiece as spm
+
+from classifier_utils import PaddingInputExample
+from classifier_utils import convert_single_example
+from prepro_utils import preprocess_text, encode_ids
+
+
+class DataProcessor(object):
+    """Base class for data converters for sequence classification data sets."""
+
+    def __init__(self, args):
+        self.data_dir = args.data_dir
+        self.max_seq_length = args.max_seq_length
+        self.uncased = args.uncased
+        np.random.seed(args.random_seed)
+
+        sp = spm.SentencePieceProcessor()
+        sp.Load(args.spiece_model_file)
+
+        def tokenize_fn(text):
+            text = preprocess_text(text, lower=self.uncased)
+            return encode_ids(sp, text)
+
+        self.tokenize_fn = tokenize_fn
+
+        self.current_train_example = -1
+        self.num_examples = {'train': -1, 'dev': -1, 'test': -1}
+        self.current_train_epoch = -1
+
+    def get_train_examples(self, data_dir):
+        """Gets a collection of `InputExample`s for the train set."""
+        raise NotImplementedError()
+
+    def get_dev_examples(self, data_dir):
+        """Gets a collection of `InputExample`s for the dev set."""
+        raise NotImplementedError()
+
+    def get_test_examples(self, data_dir):
+        """Gets a collection of `InputExample`s for prediction."""
+        raise NotImplementedError()
+
+    def get_labels(self):
+        """Gets the list of labels for this data set."""
+        raise NotImplementedError()
+
+    def convert_example(self, index, example, labels, max_seq_length,
+                        tokenize_fn):
+        """Converts a single `InputExample` into a single `InputFeatures`."""
+        feature = convert_single_example(index, example, labels, max_seq_length,
+                                         tokenize_fn)
+        return feature
+
+    def generate_instance(self, feature):
+        """
+        generate instance with given feature
+
+        Args:
+            feature: InputFeatures(object). A single set of features of data.
+        """
+        return [
+            feature.input_ids, feature.segment_ids, input_pos, feature.label_id
+        ]
+
+    def prepare_batch_data(self, batch_data, is_regression):
+        """Generate numpy tensors"""
+        input_ids = np.expand_dims(
+            np.array([inst[0] for inst in batch_data]).astype('int64'), axis=-1)
+        input_mask = np.array(
+            [inst[1] for inst in batch_data]).astype('float32')
+        segment_ids = np.array([inst[2] for inst in batch_data]).astype('int64')
+        labels = np.expand_dims(
+            np.array([inst[3] for inst in batch_data]).astype(
+                'int64' if not is_regression else 'float32'),
+            axis=-1)
+        is_real_example = np.array(
+            [inst[4] for inst in batch_data]).astype('int64')
+
+        return [input_ids, input_mask, segment_ids, labels, is_real_example]
+
+    @classmethod
+    def _read_tsv(cls, input_file, quotechar=None):
+        """Reads a tab separated value file."""
+        with io.open(input_file, "r", encoding="utf8") as f:
+            reader = csv.reader(f, delimiter="\t", quotechar=quotechar)
+            lines = []
+            for line in reader:
+                if len(line) == 0: continue
+                lines.append(line)
+            return lines
+
+    def get_num_examples(self, phase):
+        """Get number of examples for train, dev or test."""
+        if phase not in ['train', 'dev', 'test']:
+            raise ValueError(
+                "Unknown phase, which should be in ['train', 'dev', 'test'].")
+        return self.num_examples[phase]
+
+    def get_train_progress(self):
+        """Gets progress for training phase."""
+        return self.current_train_example, self.current_train_epoch
+
+    def data_generator(self,
+                       batch_size,
+                       is_regression,
+                       phase='train',
+                       epoch=1,
+                       dev_count=1,
+                       shuffle=True):
+        """
+        Generate data for train, dev or test.
+    
+        Args:
+          batch_size: int. The batch size of generated data.
+          phase: string. The phase for which to generate data.
+          epoch: int. Total epoches to generate data.
+          shuffle: bool. Whether to shuffle examples.
+        """
+        if phase == 'train':
+            examples = self.get_train_examples(self.data_dir)
+            self.num_examples['train'] = len(examples)
+        elif phase == 'dev':
+            examples = self.get_dev_examples(self.data_dir)
+            self.num_examples['dev'] = len(examples)
+        elif phase == 'test':
+            examples = self.get_test_examples(self.data_dir)
+            self.num_examples['test'] = len(examples)
+        else:
+            raise ValueError(
+                "Unknown phase, which should be in ['train', 'dev', 'test'].")
+
+        def instance_reader():
+            label_list = self.get_labels() if not is_regression else None
+            for epoch_index in range(epoch):
+                if shuffle:
+                    np.random.shuffle(examples)
+                if phase == 'train':
+                    self.current_train_epoch = epoch_index
+                for (index, example) in enumerate(examples):
+                    if phase == 'train':
+                        self.current_train_example = index + 1
+                    feature = convert_single_example(index, example, label_list,
+                                                     self.max_seq_length,
+                                                     self.tokenize_fn)
+                    instance = [
+                        feature.input_ids, feature.input_mask,
+                        feature.segment_ids, feature.label_id,
+                        feature.is_real_example
+                    ]
+                    yield instance
+
+        def batch_reader(reader, batch_size):
+            batch = []
+            for instance in reader():
+                if len(batch) < batch_size:
+                    batch.append(instance)
+                else:
+                    yield batch
+                    batch = [instance]
+
+            if len(batch) > 0:
+                yield batch
+
+        def wrapper():
+            all_dev_batches = []
+            for batch_data in batch_reader(instance_reader, batch_size):
+                batch_data = self.prepare_batch_data(batch_data, is_regression)
+                if len(all_dev_batches) < dev_count:
+                    all_dev_batches.append(batch_data)
+
+                if len(all_dev_batches) == dev_count:
+                    for batch in all_dev_batches:
+                        yield batch
+                    all_dev_batches = []
+
+        return wrapper
+
+
+class InputExample(object):
+    """A single training/test example for simple sequence classification."""
+
+    def __init__(self, guid, text_a, text_b=None, label=None):
+        """Constructs a InputExample.
+
+    Args:
+      guid: Unique id for the example.
+      text_a: string. The untokenized text of the first sequence. For single
+        sequence tasks, only this sequence must be specified.
+      text_b: (Optional) string. The untokenized text of the second sequence.
+        Only must be specified for sequence pair tasks.
+      label: (Optional) string. The label of the example. This should be
+        specified for train and dev examples, but not for test examples.
+    """
+        self.guid = guid
+        self.text_a = text_a
+        self.text_b = text_b
+        self.label = label
+
+
+def _truncate_seq_pair(tokens_a, tokens_b, max_length):
+    """Truncates a sequence pair in place to the maximum length."""
+
+    # This is a simple heuristic which will always truncate the longer sequence
+    # one token at a time. This makes more sense than truncating an equal percent
+    # of tokens from each, since if one sequence is very short then each token
+    # that's truncated likely contains more information than a longer sequence.
+    while True:
+        total_length = len(tokens_a) + len(tokens_b)
+        if total_length <= max_length:
+            break
+        if len(tokens_a) > len(tokens_b):
+            tokens_a.pop()
+        else:
+            tokens_b.pop()
+
+
+class InputFeatures(object):
+    """A single set of features of data."""
+
+    def __init__(self, input_ids, input_mask, segment_ids, label_id):
+        self.input_ids = input_ids
+        self.input_mask = input_mask
+        self.segment_ids = segment_ids
+        self.label_id = label_id
+
+
+class GLUEProcessor(DataProcessor):
+    def __init__(self, args):
+        super(GLUEProcessor, self).__init__(args)
+        self.train_file = "train.tsv"
+        self.dev_file = "dev.tsv"
+        self.test_file = "test.tsv"
+        self.label_column = None
+        self.text_a_column = None
+        self.text_b_column = None
+        self.contains_header = True
+        self.test_text_a_column = None
+        self.test_text_b_column = None
+        self.test_contains_header = True
+
+    def get_train_examples(self, data_dir):
+        """See base class."""
+        return self._create_examples(
+            self._read_tsv(os.path.join(data_dir, self.train_file)), "train")
+
+    def get_dev_examples(self, data_dir):
+        """See base class."""
+        return self._create_examples(
+            self._read_tsv(os.path.join(data_dir, self.dev_file)), "dev")
+
+    def get_test_examples(self, data_dir):
+        """See base class."""
+        if self.test_text_a_column is None:
+            self.test_text_a_column = self.text_a_column
+        if self.test_text_b_column is None:
+            self.test_text_b_column = self.text_b_column
+
+        return self._create_examples(
+            self._read_tsv(os.path.join(data_dir, self.test_file)), "test")
+
+    def get_labels(self):
+        """See base class."""
+        return ["0", "1"]
+
+    def _create_examples(self, lines, set_type):
+        """Creates examples for the training and dev sets."""
+        examples = []
+        for (i, line) in enumerate(lines):
+            if i == 0 and self.contains_header and set_type != "test":
+                continue
+            if i == 0 and self.test_contains_header and set_type == "test":
+                continue
+            guid = "%s-%s" % (set_type, i)
+
+            a_column = (self.text_a_column
+                        if set_type != "test" else self.test_text_a_column)
+            b_column = (self.text_b_column
+                        if set_type != "test" else self.test_text_b_column)
+
+            # there are some incomplete lines in QNLI
+            if len(line) <= a_column:
+                tf.logging.warning('Incomplete line, ignored.')
+                continue
+            text_a = line[a_column]
+
+            if b_column is not None:
+                if len(line) <= b_column:
+                    tf.logging.warning('Incomplete line, ignored.')
+                    continue
+                text_b = line[b_column]
+            else:
+                text_b = None
+
+            if set_type == "test":
+                label = self.get_labels()[0]
+            else:
+                if len(line) <= self.label_column:
+                    tf.logging.warning('Incomplete line, ignored.')
+                    continue
+                label = line[self.label_column]
+            examples.append(
+                InputExample(
+                    guid=guid, text_a=text_a, text_b=text_b, label=label))
+        return examples
+
+
+class Yelp5Processor(DataProcessor):
+    def __init__(self, args):
+        super(Yelp5Processor, self).__init__(args)
+
+    def get_train_examples(self, data_dir):
+        return self._create_examples(os.path.join(data_dir, "train.csv"))
+
+    def get_dev_examples(self, data_dir):
+        return self._create_examples(os.path.join(data_dir, "test.csv"))
+
+    def get_labels(self):
+        """See base class."""
+        return ["1", "2", "3", "4", "5"]
+
+    def _create_examples(self, input_file):
+        """Creates examples for the training and dev sets."""
+        examples = []
+        with tf.gfile.Open(input_file) as f:
+            reader = csv.reader(f)
+            for i, line in enumerate(reader):
+
+                label = line[0]
+                text_a = line[1].replace('""', '"').replace('\\"', '"')
+                examples.append(
+                    InputExample(
+                        guid=str(i), text_a=text_a, text_b=None, label=label))
+        return examples
+
+
+class ImdbProcessor(DataProcessor):
+    def __init__(self, args):
+        super(ImdbProcessor, self).__init__(args)
+
+    def get_labels(self):
+        return ["neg", "pos"]
+
+    def get_train_examples(self, data_dir):
+        return self._create_examples(os.path.join(data_dir, "train"))
+
+    def get_dev_examples(self, data_dir):
+        return self._create_examples(os.path.join(data_dir, "test"))
+
+    def _create_examples(self, data_dir):
+        examples = []
+        for label in ["neg", "pos"]:
+            cur_dir = os.path.join(data_dir, label)
+            for filename in os.listdir(cur_dir):
+                if not filename.endswith("txt"): continue
+
+                path = os.path.join(cur_dir, filename)
+                with io.open(path, 'r', encoding='utf8') as f:
+                    text = f.read().strip().replace("<br />", " ")
+                examples.append(
+                    InputExample(
+                        guid="unused_id", text_a=text, text_b=None,
+                        label=label))
+        return examples
+
+
+class MnliMatchedProcessor(GLUEProcessor):
+    def __init__(self, args):
+        super(MnliMatchedProcessor, self).__init__(args)
+        self.dev_file = "dev_matched.tsv"
+        self.test_file = "test_matched.tsv"
+        self.label_column = -1
+        self.text_a_column = 8
+        self.text_b_column = 9
+
+    def get_labels(self):
+        return ["contradiction", "entailment", "neutral"]
+
+
+class MnliMismatchedProcessor(MnliMatchedProcessor):
+    def __init__(self, args):
+        super(MnliMismatchedProcessor, self).__init__(args)
+        self.dev_file = "dev_mismatched.tsv"
+        self.test_file = "test_mismatched.tsv"
+
+
+class StsbProcessor(GLUEProcessor):
+    def __init__(self, args):
+        super(StsbProcessor, self).__init__(args)
+        self.label_column = 9
+        self.text_a_column = 7
+        self.text_b_column = 8
+
+    def get_labels(self):
+        return [0.0]
+
+    def _create_examples(self, lines, set_type):
+        """Creates examples for the training and dev sets."""
+        examples = []
+        for (i, line) in enumerate(lines):
+            if i == 0 and self.contains_header and set_type != "test":
+                continue
+            if i == 0 and self.test_contains_header and set_type == "test":
+                continue
+            guid = "%s-%s" % (set_type, i)
+
+            a_column = (self.text_a_column
+                        if set_type != "test" else self.test_text_a_column)
+            b_column = (self.text_b_column
+                        if set_type != "test" else self.test_text_b_column)
+
+            # there are some incomplete lines in QNLI
+            if len(line) <= a_column:
+                tf.logging.warning('Incomplete line, ignored.')
+                continue
+            text_a = line[a_column]
+
+            if b_column is not None:
+                if len(line) <= b_column:
+                    tf.logging.warning('Incomplete line, ignored.')
+                    continue
+                text_b = line[b_column]
+            else:
+                text_b = None
+
+            if set_type == "test":
+                label = self.get_labels()[0]
+            else:
+                if len(line) <= self.label_column:
+                    tf.logging.warning('Incomplete line, ignored.')
+                    continue
+                label = float(line[self.label_column])
+            examples.append(
+                InputExample(
+                    guid=guid, text_a=text_a, text_b=text_b, label=label))
+
+        return examples
+
+
+if __name__ == '__main__':
+    pass
--- a/PaddleNLP/PaddleLARK/XLNet/reader/squad.py
+++ b/PaddleNLP/PaddleLARK/XLNet/reader/squad.py
--- a/PaddleNLP/PaddleLARK/XLNet/run_classifier.py
+++ b/PaddleNLP/PaddleLARK/XLNet/run_classifier.py
--- a/PaddleNLP/PaddleLARK/XLNet/run_squad.py
+++ b/PaddleNLP/PaddleLARK/XLNet/run_squad.py
--- a/PaddleNLP/PaddleLARK/XLNet/squad_utils.py
+++ b/PaddleNLP/PaddleLARK/XLNet/squad_utils.py
+"""this file is adapted from https://github.com/zihangdai/xlnet"""
+
+import io
+import argparse
+import collections
+import json
+import numpy as np
+import os
+import re
+import string
+import sys
+
+OPTS = None
+
+
+def parse_args():
+    parser = argparse.ArgumentParser(
+        'Official evaluation script for SQuAD version 2.0.')
+    parser.add_argument(
+        'data_file', metavar='data.json', help='Input data JSON file.')
+    parser.add_argument(
+        'pred_file', metavar='pred.json', help='Model predictions.')
+    parser.add_argument(
+        '--out-file',
+        '-o',
+        metavar='eval.json',
+        help='Write accuracy metrics to file (default is stdout).')
+    parser.add_argument(
+        '--na-prob-file',
+        '-n',
+        metavar='na_prob.json',
+        help='Model estimates of probability of no answer.')
+    parser.add_argument(
+        '--na-prob-thresh',
+        '-t',
+        type=float,
+        default=1.0,
+        help='Predict "" if no-answer probability exceeds this (default = 1.0).')
+    parser.add_argument(
+        '--out-image-dir',
+        '-p',
+        metavar='out_images',
+        default=None,
+        help='Save precision-recall curves to directory.')
+    parser.add_argument('--verbose', '-v', action='store_true')
+    if len(sys.argv) == 1:
+        parser.print_help()
+        sys.exit(1)
+    return parser.parse_args()
+
+
+def make_qid_to_has_ans(dataset):
+    qid_to_has_ans = {}
+    for article in dataset:
+        for p in article['paragraphs']:
+            for qa in p['qas']:
+                qid_to_has_ans[qa['id']] = bool(qa['answers'])
+    return qid_to_has_ans
+
+
+def normalize_answer(s):
+    """Lower text and remove punctuation, articles and extra whitespace."""
+
+    def remove_articles(text):
+        regex = re.compile(r'\b(a|an|the)\b', re.UNICODE)
+        return re.sub(regex, ' ', text)
+
+    def white_space_fix(text):
+        return ' '.join(text.split())
+
+    def remove_punc(text):
+        exclude = set(string.punctuation)
+        return ''.join(ch for ch in text if ch not in exclude)
+
+    def lower(text):
+        return text.lower()
+
+    return white_space_fix(remove_articles(remove_punc(lower(s))))
+
+
+def get_tokens(s):
+    if not s: return []
+    return normalize_answer(s).split()
+
+
+def compute_exact(a_gold, a_pred):
+    return int(normalize_answer(a_gold) == normalize_answer(a_pred))
+
+
+def compute_f1(a_gold, a_pred):
+    gold_toks = get_tokens(a_gold)
+    pred_toks = get_tokens(a_pred)
+    common = collections.Counter(gold_toks) & collections.Counter(pred_toks)
+    num_same = sum(common.values())
+    if len(gold_toks) == 0 or len(pred_toks) == 0:
+        # If either is no-answer, then F1 is 1 if they agree, 0 otherwise
+        return int(gold_toks == pred_toks)
+    if num_same == 0:
+        return 0
+    precision = 1.0 * num_same / len(pred_toks)
+    recall = 1.0 * num_same / len(gold_toks)
+    f1 = (2 * precision * recall) / (precision + recall)
+    return f1
+
+
+def get_raw_scores(dataset, preds):
+    exact_scores = {}
+    f1_scores = {}
+    for article in dataset:
+        for p in article['paragraphs']:
+            for qa in p['qas']:
+                qid = qa['id']
+                gold_answers = [
+                    a['text'] for a in qa['answers']
+                    if normalize_answer(a['text'])
+                ]
+                if not gold_answers:
+                    # For unanswerable questions, only correct answer is empty string
+                    gold_answers = ['']
+                if qid not in preds:
+                    print('Missing prediction for %s' % qid)
+                    continue
+                a_pred = preds[qid]
+                # Take max over all gold answers
+                exact_scores[qid] = max(
+                    compute_exact(a, a_pred) for a in gold_answers)
+                f1_scores[qid] = max(
+                    compute_f1(a, a_pred) for a in gold_answers)
+    return exact_scores, f1_scores
+
+
+def apply_no_ans_threshold(scores, na_probs, qid_to_has_ans, na_prob_thresh):
+    new_scores = {}
+    for qid, s in scores.items():
+        pred_na = na_probs[qid] > na_prob_thresh
+        if pred_na:
+            new_scores[qid] = float(not qid_to_has_ans[qid])
+        else:
+            new_scores[qid] = s
+    return new_scores
+
+
+def make_eval_dict(exact_scores, f1_scores, qid_list=None):
+    if not qid_list:
+        total = len(exact_scores)
+        return collections.OrderedDict([
+            ('exact', 100.0 * sum(exact_scores.values()) / total),
+            ('f1', 100.0 * sum(f1_scores.values()) / total),
+            ('total', total),
+        ])
+    else:
+        total = len(qid_list)
+        return collections.OrderedDict([
+            ('exact', 100.0 * sum(exact_scores[k] for k in qid_list) / total),
+            ('f1', 100.0 * sum(f1_scores[k] for k in qid_list) / total),
+            ('total', total),
+        ])
+
+
+def merge_eval(main_eval, new_eval, prefix):
+    for k in new_eval:
+        main_eval['%s_%s' % (prefix, k)] = new_eval[k]
+
+
+def plot_pr_curve(precisions, recalls, out_image, title):
+    plt.step(recalls, precisions, color='b', alpha=0.2, where='post')
+    plt.fill_between(recalls, precisions, step='post', alpha=0.2, color='b')
+    plt.xlabel('Recall')
+    plt.ylabel('Precision')
+    plt.xlim([0.0, 1.05])
+    plt.ylim([0.0, 1.05])
+    plt.title(title)
+    plt.savefig(out_image)
+    plt.clf()
+
+
+def make_precision_recall_eval(scores,
+                               na_probs,
+                               num_true_pos,
+                               qid_to_has_ans,
+                               out_image=None,
+                               title=None):
+    qid_list = sorted(na_probs, key=lambda k: na_probs[k])
+    true_pos = 0.0
+    cur_p = 1.0
+    cur_r = 0.0
+    precisions = [1.0]
+    recalls = [0.0]
+    avg_prec = 0.0
+    for i, qid in enumerate(qid_list):
+        if qid_to_has_ans[qid]:
+            true_pos += scores[qid]
+        cur_p = true_pos / float(i + 1)
+        cur_r = true_pos / float(num_true_pos)
+        if i == len(qid_list) - 1 or na_probs[qid] != na_probs[qid_list[i + 1]]:
+            # i.e., if we can put a threshold after this point
+            avg_prec += cur_p * (cur_r - recalls[-1])
+            precisions.append(cur_p)
+            recalls.append(cur_r)
+    if out_image:
+        plot_pr_curve(precisions, recalls, out_image, title)
+    return {'ap': 100.0 * avg_prec}
+
+
+def run_precision_recall_analysis(main_eval, exact_raw, f1_raw, na_probs,
+                                  qid_to_has_ans, out_image_dir):
+    if out_image_dir and not os.path.exists(out_image_dir):
+        os.makedirs(out_image_dir)
+    num_true_pos = sum(1 for v in qid_to_has_ans.values() if v)
+    if num_true_pos == 0:
+        return
+    pr_exact = make_precision_recall_eval(
+        exact_raw,
+        na_probs,
+        num_true_pos,
+        qid_to_has_ans,
+        out_image=os.path.join(out_image_dir, 'pr_exact.png'),
+        title='Precision-Recall curve for Exact Match score')
+    pr_f1 = make_precision_recall_eval(
+        f1_raw,
+        na_probs,
+        num_true_pos,
+        qid_to_has_ans,
+        out_image=os.path.join(out_image_dir, 'pr_f1.png'),
+        title='Precision-Recall curve for F1 score')
+    oracle_scores = {k: float(v) for k, v in qid_to_has_ans.items()}
+    pr_oracle = make_precision_recall_eval(
+        oracle_scores,
+        na_probs,
+        num_true_pos,
+        qid_to_has_ans,
+        out_image=os.path.join(out_image_dir, 'pr_oracle.png'),
+        title='Oracle Precision-Recall curve (binary task of HasAns vs. NoAns)')
+    merge_eval(main_eval, pr_exact, 'pr_exact')
+    merge_eval(main_eval, pr_f1, 'pr_f1')
+    merge_eval(main_eval, pr_oracle, 'pr_oracle')
+
+
+def histogram_na_prob(na_probs, qid_list, image_dir, name):
+    if not qid_list:
+        return
+    x = [na_probs[k] for k in qid_list]
+    weights = np.ones_like(x) / float(len(x))
+    plt.hist(x, weights=weights, bins=20, range=(0.0, 1.0))
+    plt.xlabel('Model probability of no-answer')
+    plt.ylabel('Proportion of dataset')
+    plt.title('Histogram of no-answer probability: %s' % name)
+    plt.savefig(os.path.join(image_dir, 'na_prob_hist_%s.png' % name))
+    plt.clf()
+
+
+def find_best_thresh(preds, scores, na_probs, qid_to_has_ans):
+    num_no_ans = sum(1 for k in qid_to_has_ans if not qid_to_has_ans[k])
+    cur_score = num_no_ans
+    best_score = cur_score
+    best_thresh = 0.0
+    qid_list = sorted(na_probs, key=lambda k: na_probs[k])
+    for i, qid in enumerate(qid_list):
+        if qid not in scores: continue
+        if qid_to_has_ans[qid]:
+            diff = scores[qid]
+        else:
+            if preds[qid]:
+                diff = -1
+            else:
+                diff = 0
+        cur_score += diff
+        if cur_score > best_score:
+            best_score = cur_score
+            best_thresh = na_probs[qid]
+    return 100.0 * best_score / len(scores), best_thresh
+
+
+def find_best_thresh_v2(preds, scores, na_probs, qid_to_has_ans):
+    num_no_ans = sum(1 for k in qid_to_has_ans if not qid_to_has_ans[k])
+    cur_score = num_no_ans
+    best_score = cur_score
+    best_thresh = 0.0
+    qid_list = sorted(na_probs, key=lambda k: na_probs[k])
+    for i, qid in enumerate(qid_list):
+        if qid not in scores: continue
+        if qid_to_has_ans[qid]:
+            diff = scores[qid]
+        else:
+            if preds[qid]:
+                diff = -1
+            else:
+                diff = 0
+        cur_score += diff
+        if cur_score > best_score:
+            best_score = cur_score
+            best_thresh = na_probs[qid]
+
+    has_ans_score, has_ans_cnt = 0, 0
+    for qid in qid_list:
+        if not qid_to_has_ans[qid]: continue
+        has_ans_cnt += 1
+
+        if qid not in scores: continue
+        has_ans_score += scores[qid]
+
+    return 100.0 * best_score / len(
+        scores), best_thresh, 1.0 * has_ans_score / has_ans_cnt
+
+
+def find_all_best_thresh(main_eval, preds, exact_raw, f1_raw, na_probs,
+                         qid_to_has_ans):
+    best_exact, exact_thresh = find_best_thresh(preds, exact_raw, na_probs,
+                                                qid_to_has_ans)
+    best_f1, f1_thresh = find_best_thresh(preds, f1_raw, na_probs,
+                                          qid_to_has_ans)
+    main_eval['best_exact'] = best_exact
+    main_eval['best_exact_thresh'] = exact_thresh
+    main_eval['best_f1'] = best_f1
+    main_eval['best_f1_thresh'] = f1_thresh
+
+
+def find_all_best_thresh_v2(main_eval, preds, exact_raw, f1_raw, na_probs,
+                            qid_to_has_ans):
+    best_exact, exact_thresh, has_ans_exact = find_best_thresh_v2(
+        preds, exact_raw, na_probs, qid_to_has_ans)
+    best_f1, f1_thresh, has_ans_f1 = find_best_thresh_v2(
+        preds, f1_raw, na_probs, qid_to_has_ans)
+    main_eval['best_exact'] = best_exact
+    main_eval['best_exact_thresh'] = exact_thresh
+    main_eval['best_f1'] = best_f1
+    main_eval['best_f1_thresh'] = f1_thresh
+    main_eval['has_ans_exact'] = has_ans_exact
+    main_eval['has_ans_f1'] = has_ans_f1
+
+
+def main():
+    with io.open(OPTS.data_file, encoding='utf8') as f:
+        dataset_json = json.load(f)
+        dataset = dataset_json['data']
+    with io.open(OPTS.pred_file, encoding='utf8') as f:
+        preds = json.load(f)
+
+    new_orig_data = []
+    for article in dataset:
+        for p in article['paragraphs']:
+            for qa in p['qas']:
+                if qa['id'] in preds:
+                    new_para = {'qas': [qa]}
+                    new_article = {'paragraphs': [new_para]}
+                    new_orig_data.append(new_article)
+    dataset = new_orig_data
+
+    if OPTS.na_prob_file:
+        with io.open(OPTS.na_prob_file, encoding='utf8') as f:
+            na_probs = json.load(f)
+    else:
+        na_probs = {k: 0.0 for k in preds}
+    qid_to_has_ans = make_qid_to_has_ans(dataset)  # maps qid to True/False
+    has_ans_qids = [k for k, v in qid_to_has_ans.items() if v]
+    no_ans_qids = [k for k, v in qid_to_has_ans.items() if not v]
+    exact_raw, f1_raw = get_raw_scores(dataset, preds)
+    exact_thresh = apply_no_ans_threshold(exact_raw, na_probs, qid_to_has_ans,
+                                          OPTS.na_prob_thresh)
+    f1_thresh = apply_no_ans_threshold(f1_raw, na_probs, qid_to_has_ans,
+                                       OPTS.na_prob_thresh)
+    out_eval = make_eval_dict(exact_thresh, f1_thresh)
+    if has_ans_qids:
+        has_ans_eval = make_eval_dict(
+            exact_thresh, f1_thresh, qid_list=has_ans_qids)
+        merge_eval(out_eval, has_ans_eval, 'HasAns')
+    if no_ans_qids:
+        no_ans_eval = make_eval_dict(
+            exact_thresh, f1_thresh, qid_list=no_ans_qids)
+        merge_eval(out_eval, no_ans_eval, 'NoAns')
+    if OPTS.na_prob_file:
+        find_all_best_thresh(out_eval, preds, exact_raw, f1_raw, na_probs,
+                             qid_to_has_ans)
+    if OPTS.na_prob_file and OPTS.out_image_dir:
+        run_precision_recall_analysis(out_eval, exact_raw, f1_raw, na_probs,
+                                      qid_to_has_ans, OPTS.out_image_dir)
+        histogram_na_prob(na_probs, has_ans_qids, OPTS.out_image_dir, 'hasAns')
+        histogram_na_prob(na_probs, no_ans_qids, OPTS.out_image_dir, 'noAns')
+    if OPTS.out_file:
+        with io.open(OPTS.out_file, 'w', encoding='utf8') as f:
+            json.dump(out_eval, f)
+    else:
+        print(json.dumps(out_eval, indent=2))
+
+
+if __name__ == '__main__':
+    OPTS = parse_args()
+    if OPTS.out_image_dir:
+        import matplotlib
+        matplotlib.use('Agg')
+        import matplotlib.pyplot as plt
+    main()
--- a/PaddleNLP/PaddleLARK/XLNet/utils/__init__.py
+++ b/PaddleNLP/PaddleLARK/XLNet/utils/__init__.py
--- a/PaddleNLP/PaddleLARK/XLNet/utils/args.py
+++ b/PaddleNLP/PaddleLARK/XLNet/utils/args.py
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Arguments for configuration."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import six
+import argparse
+
+import paddle.fluid as fluid
+
+
+def str2bool(v):
+    # because argparse does not support to parse "true, False" as python
+    # boolean directly
+    return v.lower() in ("true", "t", "1")
+
+
+class ArgumentGroup(object):
+    def __init__(self, parser, title, des):
+        self._group = parser.add_argument_group(title=title, description=des)
+
+    def add_arg(self, name, type, default, help, **kwargs):
+        type = str2bool if type == bool else type
+        self._group.add_argument(
+            "--" + name,
+            default=default,
+            type=type,
+            help=help + ' Default: %(default)s.',
+            **kwargs)
+
+
+def print_arguments(args):
+    print('-----------  Configuration Arguments -----------')
+    for arg, value in sorted(six.iteritems(vars(args))):
+        print('%s: %s' % (arg, value))
+    print('------------------------------------------------')
+
+def check_cuda(use_cuda, err = \
+    "\nYou can not set use_cuda = True in the model because you are using paddlepaddle-cpu.\n \
+    Please: 1. Install paddlepaddle-gpu to run your models on GPU or 2. Set use_cuda = False to run models on CPU.\n"
+                                                                                                                     ):
+    try:
+        if use_cuda == True and fluid.is_compiled_with_cuda() == False:
+            print(err)
+            sys.exit(1)
+    except Exception as e:
+        pass
--- a/PaddleNLP/PaddleLARK/XLNet/utils/init.py
+++ b/PaddleNLP/PaddleLARK/XLNet/utils/init.py