Release ERNIE 2.0

ERNIE 2.0 is a continual pre-training framework for language understanding in which pre-training tasks can be incrementally built and learned through multi-task learning

Release ERNIE 2.0
ERNIE 2.0 is a continual pre-training framework for language understanding in which pre-training tasks can be incrementally built and learned through multi-task learning
5c3b8cd3 · tianxin · 03504515 · 5c3b8cd3 · 5c3b8cd3 · 5c3b8cd3
89 changed file
--- a/.gitignore
+++ b/.gitignore
 *.pyc
+*.un~
--- a/.metas/ernie2.0_arch.png
+++ b/.metas/ernie2.0_arch.png
--- a/.metas/ernie2.0_model.png
+++ b/.metas/ernie2.0_model.png
--- a/ERNIE/.run_ce.sh
+++ b/ERNIE/.run_ce.sh
--- a/ERNIE/README.md
+++ b/ERNIE/README.md
+
+<div align="center">
+    <h1>
+        <font color="red">
+        ERNIE 项目已经迁移至 <a href="../README.zh.md">这里</a>
+        </font>
+    </h1>
+</div>
+
+&nbsp;
+&nbsp;
+&nbsp;
+&nbsp;
+&nbsp;
+&nbsp;
+&nbsp;
+&nbsp;
+&nbsp;
+&nbsp;
+&nbsp;
+&nbsp;
+&nbsp;
+&nbsp;
+&nbsp;
+&nbsp;
+
+
 ## ERNIE: **E**nhanced **R**epresentation through k**N**owledge **I**nt**E**gration

 **** **2019-04-10 更新**: update ERNIE_stable-1.0.1.tar.gz, 将模型参数、配置 ernie_config.json、vocab.txt 打包发布 ****
@@ -170,7 +197,7 @@ nlpcc-dbqa是由国际自然语言处理和中文计算会议NLPCC于2016年举
 | [模型](https://ernie.bj.bcebos.com/ERNIE_stable.tgz) | 包含预训练模型参数 |
 | [模型(含配置文件及词典)](https://baidu-nlp.bj.bcebos.com/ERNIE_stable-1.0.1.tar.gz)) | 包含预训练模型参数、词典 vocab.txt、模型配置 ernie_config.json|

-2) [任务数据下载](https://ernie.bj.bcebos.com/task_data.tgz)
+2) [任务数据下载](https://ernie.bj.bcebos.com/task_data_zh.tgz)

 ### 安装
 本项目依赖于 Paddle Fluid 1.3.1，请参考[安装指南](http://www.paddlepaddle.org/#quick-start)进行安装。

--- a/README.md
+++ b/README.md
-# LARK
+English | [简体中文](./README.zh.md)

-**LA**nguage **R**epresentations **K**it, currently includes
+## ERNIE 2.0: A Continual Pre-training Framework for Language Understanding
+
+
+  * [Continual Pre-training Framework for Language Understanding](#continual-pre-training-framework-for-language-understanding)
+  * [Pre-training Tasks](#pre-training-tasks)
+     * [Word-aware Tasks](#word-aware-tasks)
+        * [Knowledge Masking Task](#knowledge-masking-task)
+        * [Capitalization Prediction Task](#capitalization-prediction-task)
+        * [Token-Document Relation Prediction Task](#token-document-relation-prediction-task)
+     * [Structure-aware Tasks](#structure-aware-tasks)
+        * [Sentence Reordering Task](#sentence-reordering-task)
+        * [Sentence Distance Task](#sentence-distance-task)
+     * [Semantic-aware Tasks](#semantic-aware-tasks)
+        * [Discourse Relation Task](#discourse-relation-task)
+        * [IR Relevance Task](#ir-relevance-task)
+  * [ERNIE 1.0: <strong>E</strong>nhanced <strong>R</strong>epresentation through k<strong>N</strong>owledge <strong>I</strong>nt<strong>E</strong>gration](#ernie-10-enhanced-representation-through-knowledge-integration)
+  * [Results on English Datasets](#results-on-english-datasets)
+  * [Results on Chinese Datasets](#results-on-chinese-datasets)
+
+
+### Continual Pre-training Framework for Language Understanding
+
+**[ERNIE 2.0](https://arxiv.org/abs/1907.12412v1) is a continual pre-training framework for language understanding** in which pre-training tasks can be incrementally built and learned through multi-task learning. In this framework, different customized tasks can be incrementally introduced at any time. For example, the tasks including named entity prediction, discourse relation recognition, sentence order prediction are leveraged in order to enable the models to learn language representations.
+
+![ernie2.0_arch](.metas/ernie2.0_arch.png)
+
+We compare the performance of [ERNIE 2.0 model](https://arxiv.org/abs/1907.12412v1) with the existing SOTA pre-training models on the authoritative English dataset GLUE and 9 popular Chinese datasets separately. And the results show that [ERNIE 2.0 model](https://arxiv.org/abs/1907.12412v1) outperforms BERT and XLNet on 7 GLUE tasks and outperforms BERT on all of the 9 Chinese NLP tasks. Specifically, according to the experimental results on GLUE datasets, we observe that [ERNIE 2.0 model](https://arxiv.org/abs/1907.12412v1) almost comprehensively outperforms BERT and XLNet on English tasks, whether the base model or the large model. And according to the experimental results on all Chinese datasets, ERNIE 2.0 model comprehensively outperforms BERT on all of the 9 Chinese datasets. Furthermore, ERNIE 2.0 large model achieves the best performance and creates new state-of-the-art results on these Chinese NLP task.
+
+### Pre-training Tasks
+
+We construct several tasks to capture different aspects of information in the training corpora:
+
+- **Word-aware Tasks**: to handle the lexical information
+- **Structure-aware Tasks**:  to capture the syntactic information
+- **Semantic-aware Tasks**:  in charge of semantic signals
+
+![ernie2.0_model](.metas/ernie2.0_model.png)
+
+
+#### Word-aware Tasks
+
+##### Knowledge Masking Task
+
+- [ERNIE 1.0](https://arxiv.org/abs/1904.09223) introduced phrase and named entity masking strategies to help the model learn the dependency information in both local contexts and global contexts.
+
+##### Capitalization Prediction Task
+
+- Capitalized words usually have certain specific semantic value compared to other words in sentences. we add a task to predict whether the word is capitalized or not.
+
+##### Token-Document Relation Prediction Task
+
+- A task to predict whether the token in a segment appears in other segments of the original document.
+
+#### Structure-aware Tasks
+
+##### Sentence Reordering Task
+
+- This task try to learn the relationships among sentences by randomly spliting a given paragraph into 1 to m segments and reorganizing these permuted segments as a standard classification task.
+
+##### Sentence Distance Task
+
+- This task handles the distance between sentences as a 3-class classification problem.
+
+#### Semantic-aware Tasks
+
+##### Discourse Relation Task
+
+- A task try to predict the semantic or rhetorical relation between two sentences.
+
+##### IR Relevance Task
+
+- A 3-class classification task which predicts the relationship between a query and a title.
+
+
+### ERNIE 1.0: **E**nhanced **R**epresentation through k**N**owledge **I**nt**E**gration
+
+**[ERNIE 1.0](https://arxiv.org/abs/1904.09223)** is a new unsupervised language representation learning method enhanced by knowledge masking strategies, which includes entity-level masking and phrase-level masking. Inspired by the masking strategy of BERT ([Devlin et al., 2018](https://arxiv.org/abs/1810.04805)), **ERNIE** introduced phrase masking and named entity masking and predicts the whole masked phrases or named entities. Phrase-level strategy masks the whole phrase which is a group of words that functions as a conceptual unit. Entity-level strategy masks named entites including persons, locations, organizations, products, etc., which can be denoted with proper names.
+
+**Example**:
+
+**Harry Potter is a series of fantasy novel written by J. K. Rowling**
+
+
+
+```- Learned by BERT ：[mask] Potter is a series [mask] fantasy novel [mask] by J. [mask] Rowling```
+
+```- Learned by ERNIE：Harry Potter is a series of [mask] [mask] written by [mask] [mask] [mask]```
+
+
+
+In the example sentence above, BERT can identify the  “K.” through the local co-occurring words J., K., and Rowling, but the model fails to learn any knowledge related to the word "J. K. Rowling". ERNIE however can extrapolate the relationship between Harry Potter and J. K. Rowling by analyzing implicit knowledge of words and entities, and infer that Harry Potter is a novel written by J. K. Rowling.
+
+Integrating both phrase information and named entity information enables the model to obtain better language representation compare to BERT. ERNIE is trained on multi-source data and knowledge collected from encyclopedia articles, news, and forum dialogues, which improves its performance in context-based knowledge reasoning.
+
+## Release Notes
+
+- July 30, 2019: release ERNIE 2.0
+- Apr 10, 2019: update ERNIE_stable-1.0.1.tar.gz, update config and vocab
+- Mar 18, 2019: update ERNIE_stable.tgz
+- Mar 15, 2019: release ERNIE 1.0

- [BERT](./BERT): Bidirectional Encoder Representation from Transformers
- [ERNIE](./ERNIE): Enhanced Representation from kNowledge IntEgration
- [ELMo](./ELMo): Embeddings from Language Models

 ## Communication

- [Github Issues](https://github.com/PaddlePaddle/LARK/issues): bug reports, feature requests, install issues, usage issues, etc.
+- [Github Issues](https://github.com/PaddlePaddle/ERNIE/issues): bug reports, feature requests, install issues, usage issues, etc.
 - QQ discussion group: 760439550 (ERNIE discussion group).
 - [Forums](http://ai.baidu.com/forum/topic/list/168?pageNo=1): discuss implementations, research, etc.

-And more is on the way.
+
+## Results
+
+### Results on English Datasets
+
+The English version ERNIE 2.0 is evaluated on [GLUE benchmark](https://gluebenchmark.com/) including 10 datasets and 11 test sets, which cover tasks about Natural Language Inference, e.g., MNLI, Sentiment Analysis, e.g., SST-2, Coreference Resolution, e.g., WNLI and so on. We compare single model ERNIE 2.0 with XLNet and BERT on GLUE dev set according to the result in the paper [XLNet (Z. Yang. etc)](https://arxiv.org/abs/1906.08237)  and compare with BERT on GLUE test set according to the [open leaderboard](https://gluebenchmark.com/leaderboard).
+
+
+
+#### Single Model Results on GLUE-Dev
+
+| <strong>Dataset</strong> | <strong>CoLA</strong> | <strong>SST-2</strong> | <strong>MRPC</strong> | <strong>STS-B</strong> | <strong>QQP</strong> | <strong>MNLI-m</strong> | <strong>QNLI</strong> | <strong>RTE</strong> |
+| --------------------- | --------------------- | ---------------------- | --------------------- | ---------------------- | -------------------- | ----------------------- | --------------------- | -------------------- |
+| **metric**            | **matthews corr.**    | **acc**                | **acc**          | **pearson corr.**      | **acc**              | **acc**                 | **acc**               | **acc**              |
+| **BERT Large**        | 60.6                  | 93.2                   | 88.0                  | 90.0                   | 91.3                 | 86.6                    | 92.3                  | 70.4                 |
+| **XLNet Large**       | 63.6          | 95.6   | 89.2   | 91.8    | 91.8  | 89.8   | 93.9   | 83.8   |
+| **ERNIE 2.0 Large**   | 65.4<br/>(**+4.8,+1.8**)   | 96.0<br/>(**+2.8,+0.4**)    | 89.7<br/>(**+1.7,+0.5**)   | 92.3<br/>(**+2.3,+0.5**)    | 92.5<br/>(**+1.2,+0.7**)  | 89.1<br/>(**+2.5,-0.7**)     | 94.3<br/>(**+2.0,+0.4**)   | 85.2<br/>(**+14.8,+1.4**) |
+
+
+
+We use single-task dev results in the table.
+
+
+
+#### Single Model Results on GLUE-Test
+
+| <strong>Dataset</strong>                | -                          | <strong>CoLA</strong> | <strong>SST-2</strong> | <strong>MRPC</strong>         | <strong>STS-B</strong>        | <strong>QQP</strong>          | <strong>MNLI-m</strong> | <strong>MNLI-mm</strong> | <strong>QNLI</strong> | <strong>RTE</strong> | <strong>WNLI</strong> | <strong>AX</strong> |
+| ------------------- | -------------------------- | --------------------- | ---------------------- | ----------------------------- | ----------------------------- | ----------------------------- | ----------------------- | ------------------------ | --------------------- | -------------------- | --------------------- | ------------------- |
+| **Metric**          | **<strong>score</strong>** | **matthews corr.**    | **acc**                | **f1-score/acc**              | **spearman/pearson corr.**    | **f1-score/acc**              | **acc**                 | **acc**                  | **acc**               | **acc**              | **acc**               | **matthews corr.**  |
+| **BERT Base**       | 78.3                       | 52.1                  | 93.5                   | 88.9/84.8                     | 85.8/87.1                     | 71.2/89.2                     | 84.6                    | 83.4                     | 90.5                  | 66.4                 | 65.1                  | 34.2                |
+| **ERNIE 2.0 Base**  | 80.6<br/>(**+2.3**)        | 55.2<br/>(**+3.1**)   | 95.0<br/>(**+1.5**)    | 89.9/86.1<br/>(**+1.0/+1.3**) | 86.5/87.6<br/>(**+0.7/+0.5**) | 73.2/89.8<br/>(**+2.0/+0.6**) | 86.1<br/>(**+1.5**)     | 85.5<br/>(**+2.1**)      | 92.9<br/>(**+2.4**)   | 74.8<br/>(**+8.4**)  | 65.1                  | 37.4<br/>(**+3.2**) |
+| **BERT Large**      | 80.5                       | 60.5                  | 94.9                   | 89.3/85.4                     | 86.5/87.6                     | 72.1/89.3                     | 86.7                    | 85.9                     | 92.7                  | 70.1                 | 65.1                  | 39.6                |
+| **ERNIE 2.0 Large** | 83.6<br/>(**+3.1**)        | 63.5<br/>(**+3.0**)   | 95.6<br/>(**+0.7**)    | 90.2/87.4<br/>(**+0.9/+2.0**) | 90.6/91.2<br/>(**+4.1/+3.6**) | 73.8/90.1<br/>(**+1.7/+0.8**) | 88.7<br/>(**+2.0**)     | 88.8<br/>(**+2.9**)      | 94.6<br/>(**+1.9**)   | 80.2<br/>(**+10.1**) | 67.8<br/>(**+2.7**)   | 48.0<br/>(**+8.4**) |
+
+
+
+Because XLNet have not published single model test result on GLUE, so we only compare ERNIE 2.0 with BERT here.
+
+### Results on Chinese Datasets
+
+#### Results on Natural Language Inference
+
+<table>
+  <tbody>
+    <tr>
+      <th><strong>Dataset</strong>
+        <br></th>
+      <th colspan="2"><center><strong>XNLI</strong></center></th>
+    </tr>
+    <tr>
+      <td rowspan="2">
+        <p>
+          <strong>Metric</strong>
+          <br></p>
+      </td>
+      <td colspan="2">
+        <center><strong>acc</strong></center>
+        <br></td>
+    </tr>
+    <tr>
+      <td colspan="1" width="">
+        <strong>dev</strong>
+        <br></td>
+      <td colspan="1" width="">
+        <strong>test</strong>
+        <br></td>
+    </tr>
+    <tr>
+      <td>
+        <strong>BERT Base
+          <br></strong></td>
+      <td>78.1</td>
+      <td>77.2</td>
+    </tr>
+    <tr>
+      <td>
+        <strong>ERNIE 1.0 Base
+          <br></strong></td>
+      <td>79.9 <span>(<strong>+1.8</strong>)</span></td>
+      <td>78.4 <span>(<strong>+1.2</strong>)</span></td>
+    </tr>
+    <tr>
+      <td>
+        <strong>ERNIE 2.0 Base
+          <br></strong></td>
+      <td>81.2 <span>(<strong>+3.1</strong>)</span></td>
+      <td>79.7 <span>(<strong>+2.5</strong>)</span></td>
+    </tr>
+    <tr>
+      <td>
+        <strong>ERNIE 2.0 Large
+          <br></strong></td>
+      <td>82.6 <span>(<strong>+4.5</strong>)</span></td>
+      <td>81.0 <span>(<strong>+3.8</strong>)</span></td>
+    </tr>
+  </tbody>
+</table>
+
+ - **XNLI**
+
+```text
+XNLI is a natural language inference dataset in 15 languages. It was jointly built by Facebook and New York University. We use Chinese data of XNLI to evaluate language understanding ability of our model. [url: https://github.com/facebookresearch/XNLI]
+```
+
+
+
+#### Results on Machine Reading Comprehension
+
+<table>
+  <tbody>
+    <tr>
+      <th><strong>Dataset</strong>
+        <br></th>
+      <th colspan="2"><center><strong>DuReader</strong></center></th>
+      <th colspan="2"><center><strong>CMRC2018</strong><center></th>
+      <th colspan="4"><strong>DRCD</strong></th>
+    </tr>
+    <tr>
+      <td rowspan="2">
+        <p>
+          <strong>Metric</strong>
+          <br></p>
+      </td>
+      <td colspan="1">
+        <center><strong>em</strong></center>
+        <br></td>
+      <td colspan="1">
+        <strong>f1-score</strong>
+        <br></td>
+      <td colspan="1">
+        <strong>em</strong>
+        <br></td>
+      <td colspan="1">
+        <strong>f1-score</strong>
+        <strong></strong>
+        <br></td>
+      <td colspan="2">
+        <strong>em</strong>
+        <br></td>
+      <td colspan="2">
+        <strong>f1-score</strong>
+        <br></td>
+    </tr>
+    <tr>
+      <td colspan="2" width="">
+        <strong>dev</strong>
+        <br></td>
+      <td colspan="2" width="">
+        <strong>dev</strong>
+        <br></td>
+      <td colspan="1" width="">
+        <strong>dev</strong>
+        <br></td>
+      <td colspan="1" width="">
+        <strong>test</strong>
+        <br></td>
+      <td colspan="1" width="">
+        <strong>dev</strong>
+        <br></td>
+      <td colspan="1" width="">
+        <strong>test</strong>
+        <br></td>
+    </tr>
+    <tr>
+      <td><strong>BERT Base</strong></td>
+      <td>59.5</td>
+      <td>73.1</td>
+      <td>66.3</td>
+      <td>85.9</td>
+      <td>85.7</td>
+      <td>84.9</td>
+      <td>91.6</td>
+      <td>90.9</td>
+    </tr>
+    <tr>
+      <td><strong>ERNIE 1.0 Base</strong></td>
+      <td>57.9 <span>(<strong>-1.6</strong>)</span></td>
+      <td>72.1 <span>(<strong>-1.0</strong>)</span></td>
+      <td>65.1 <span>(<strong>-1.2</strong>)</span></td>
+      <td>85.1 <span>(<strong>-0.8</strong>)</span></td>
+      <td>84.6 <span>(<strong>-1.1</strong>)</span></td>
+      <td>84.0 <span>(<strong>-0.9</strong>)</span></td>
+      <td>90.9 <span>(<strong>-0.7</strong>)</span></td>
+      <td>90.5 <span>(<strong>-0.4</strong>)</span></td>
+    </tr>
+    <tr>
+      <td><strong>ERNIE 2.0 Base</strong></td>
+      <td>61.3 <span>(<strong>+1.8</strong>)</span></td>
+      <td>74.9 <span>(<strong>+1.8</strong>)</span></td>
+      <td>69.1 <span>(<strong>+2.8</strong>)</span></td>
+      <td>88.6 <span>(<strong>+2.7</strong>)</span></td>
+      <td>88.5 <span>(<strong>+2.8</strong>)</span></td>
+      <td>88.0 <span>(<strong>+3.1</strong>)</span></td>
+      <td>93.8 <span>(<strong>+2.2</strong>)</span></td>
+      <td>93.4 <span>(<strong>+2.5</strong>)</span></td>
+    </tr>
+    <tr>
+      <td><strong>ERNIE 2.0 Large</strong></td>
+      <td>64.2 <span>(<strong>+4.7</strong>)</span></td>
+      <td>77.3 <span>(<strong>+4.2</strong>)</span></td>
+      <td>71.5 <span>(<strong>+5.2</strong>)</span></td>
+      <td>89.9 <span>(<strong>+4.0</strong>)</span></td>
+      <td>89.7 <span>(<strong>+4.0</strong>)</span></td>
+      <td>89.0 <span>(<strong>+4.1</strong>)</span></td>
+      <td>94.7 <span>(<strong>+3.1</strong>)</span></td>
+      <td>94.2 <span>(<strong>+3.3</strong>)</span></td>
+    </tr>
+
+
+  </tbody>
+</table>
+
+*\*The extractive single-document subset of DuReader dataset is an internal data set*
+
+*\*The DRCD dataset is converted from Traditional Chinese to Simplified Chinese based on tool: https://github.com/skydark/nstools/tree/master/zhtools*
+
+\* *The pre-training data of ERNIE 1.0 BASE does not contain instances whose length exceeds 128, but other models is pre-trained with the instances whose length are 512. It causes poorer performance of ERNIE 1.0 BASE on long-text tasks. So We have released [ERNIE 1.0 Base(max-len-512)](https://ernie.bj.bcebos.com/ERNIE_1.0_max-len-512.tar.gz) in July 29th, 2019*
+
+
+
+ - **DuReader**
+
+```text
+DuReader is a new large-scale, open-domain Chinese machine reading comprehension (MRC) dataset, which is designed to address real-world MRC. This dataset was released in ACL2018 (He et al., 2018) by Baidu. In this dataset, questions and documents are based on Baidu Search and Baidu Zhidao, answers are manually generated.
+Our experiment was carried out on an extractive single-document subset of DuReader. The training set contained 15,763 documents and questions, and the validation set contained 1628 documents and questions. The goal was to extract continuous fragments from documents as answers. [url: https://arxiv.org/pdf/1711.05073.pdf]
+```
+
+ - **CMRC2018**
+
+```text
+CMRC2018 is a evaluation of Chinese extractive reading comprehension hosted by Chinese Information Processing Society of China (CIPS-CL). [url: https://github.com/ymcui/cmrc2018]
+```
+
+ - **DRCD**
+
+```text
+DRCD is an open domain Traditional Chinese machine reading comprehension (MRC) dataset released by Delta Research Center. We translate this dataset to Simplified Chinese for our experiment. [url: https://github.com/DRCKnowledgeTeam/DRCD]
+```
+
+
+
+#### Results on Named Entity Recognition
+
+<table>
+  <tbody>
+    <tr>
+      <th><strong>Dataset</strong>
+        <br></th>
+      <th colspan="2"><center><strong>MSRA-NER(SIGHAN2006)</strong></center></th>
+    <tr>
+      <td rowspan="2">
+        <p>
+          <strong>Metric</strong>
+          <br></p>
+      </td>
+      <td colspan="2">
+        <center><strong>f1-score</strong></center>
+        <br></td>
+    </tr>
+    <tr>
+      <td colspan="1" width="">
+        <strong>dev</strong>
+        <br></td>
+      <td colspan="1" width="">
+        <strong>test</strong>
+        <br></td>
+    </tr>
+    <tr>
+      <td><strong>BERT Base</strong></td>
+      <td>94.0</td>
+      <td>92.6</td>
+    </tr>
+    <tr>
+      <td><strong>ERNIE 1.0 Base</strong></td>
+      <td>95.0 <span>(<strong>+1.0</strong>)</span></td>
+      <td>93.8 <span>(<strong>+1.2</strong>)</span></td>
+    </tr>
+    <tr>
+      <td><strong>ERNIE 2.0 Base</strong></td>
+      <td>95.2 <span>(<strong>+1.2</strong>)</span></td>
+      <td>93.8 <span>(<strong>+1.2</strong>)</span></td>
+    </tr>
+    <tr>
+      <td><strong>ERNIE 2.0 Large</strong></td>
+      <td>96.3 <span>(<strong>+2.3</strong>)</span></td>
+      <td>95.0 <span>(<strong>+2.4</strong>)</span></td>
+    </tr>
+  </tbody>
+</table>
+
+ - **MSRA-NER(SIGHAN2006)**
+
+```text
+MSRA-NER(SIGHAN2006) dataset is released by MSRA for recognizing the names of people, locations and organizations in text.
+```
+
+#### Results on Sentiment Analysis Task
+
+<table>
+  <tbody>
+    <tr>
+      <th><strong>Dataset</strong>
+        <br></th>
+      <th colspan="2"><center><strong>ChnSentiCorp</strong></center></th>
+    <tr>
+      <td rowspan="2">
+        <p>
+          <strong>Metric</strong>
+          <br></p>
+      </td>
+      <td colspan="2">
+        <center><strong>acc</strong></center>
+        <br></td>
+    </tr>
+    <tr>
+      <td colspan="1" width="">
+        <strong>dev</strong>
+        <br></td>
+      <td colspan="1" width="">
+        <strong>test</strong>
+        <br></td>
+    </tr>
+    <tr>
+      <td><strong>BERT Base</strong></td>
+      <td>94.6</td>
+      <td>94.3</td>
+    </tr>
+    <tr>
+      <td><strong>ERNIE 1.0 Base</strong></td>
+      <td>95.2 <span>(<strong>+0.6</strong>)</span></td>
+      <td>95.4 <span>(<strong>+1.1</strong>)</span></td>
+    </tr>
+    <tr>
+      <td><strong>ERNIE 2.0 Base</strong></td>
+      <td>95.7 <span>(<strong>+1.1</strong>)</span></td>
+      <td>95.5 <span>(<strong>+1.2</strong>)</span></td>
+    </tr>
+    <tr>
+      <td><strong>ERNIE 2.0 Large</strong></td>
+      <td>96.1 <span>(<strong>+1.5</strong>)</span></td>
+      <td>95.8 <span>(<strong>+1.5</strong>)</span></td>
+    </tr>
+  </tbody>
+</table>
+
+ - **ChnSentiCorp**
+
+```text
+ChnSentiCorp is a sentiment analysis dataset consisting of reviews on online shopping of hotels, notebooks and books.
+```
+
+#### Results on Question Answering Task
+
+<table>
+  <tbody>
+    <tr>
+      <th><strong>Datset</strong>
+        <br></th>
+      <th colspan="4"><center><strong>NLPCC2016-DBQA</strong></center></th>
+    <tr>
+      <td rowspan="2">
+        <p>
+          <strong>Metric</strong>
+          <br></p>
+      </td>
+      <td colspan="2">
+        <center><strong>mrr</strong></center>
+        <br></td>
+      <td colspan="2">
+        <center><strong>f1-score</strong></center>
+        <br></td>
+    </tr>
+    <tr>
+      <td colspan="1" width="">
+        <strong>dev</strong>
+        <br></td>
+      <td colspan="1" width="">
+        <strong>test</strong>
+        <br></td>
+      <td colspan="1" width="">
+        <strong>dev</strong>
+        <br></td>
+      <td colspan="1" width="">
+        <strong>test</strong>
+        <br></td>
+    </tr>
+    <tr>
+      <td><strong>BERT Base</strong></td>
+      <td>94.7</td>
+      <td>94.6</td>
+      <td>80.7</td>
+      <td>80.8</td>
+    </tr>
+    <tr>
+      <td><strong>ERNIE 1.0 Base</strong></td>
+      <td>95.0 <span>(<strong>+0.3</strong>)</span></td>
+      <td>95.1 <span>(<strong>+0.5</strong>)</span></td>
+      <td>82.3 <span>(<strong>+1.6</strong>)</span></td>
+      <td>82.7 <span>(<strong>+1.9</strong>)</span></td>
+    </tr>
+    <tr>
+      <td><strong>ERNIE 2.0 Base</strong></td>
+      <td>95.7 <span>(<strong>+1.0</strong>)</span></td>
+      <td>95.7 <span>(<strong>+1.1</strong>)</span></td>
+      <td>84.7 <span>(<strong>+4.0</strong>)</span></td>
+      <td>85.3 <span>(<strong>+4.5</strong>)</span></td>
+    </tr>
+    <tr>
+      <td><strong>ERNIE 2.0 Large</strong></td>
+      <td>95.9 <span>(<strong>+1.2</strong>)</span></td>
+      <td>95.8 <span>(<strong>+1.2</strong>)</span></td>
+      <td>85.3 <span>(<strong>+4.6</strong>)</span></td>
+      <td>85.8 <span>(<strong>+5.0</strong>)</span></td>
+    </tr>
+  </tbody>
+</table>
+
+ - **NLPCC2016-DBQA**
+
+```text
+NLPCC2016-DBQA is a sub-task of NLPCC-ICCPOL 2016 Shared Task which is hosted by NLPCC(Natural Language Processing and Chinese Computing), this task targets on selecting documents from the candidates to answer the questions. [url: http://tcci.ccf.org.cn/conference/2016/dldoc/evagline2.pdf]
+```
+
+#### Results on Semantic Similarity
+
+<table>
+  <tbody>
+    <tr>
+      <th><strong>Dataset</strong>
+        <br></th>
+      <th colspan="2"><center><strong>LCQMC</strong></center></th>
+      <th colspan="2"><center><strong>BQ Corpus</strong></center></th>
+    <tr>
+      <td rowspan="2">
+        <p>
+          <strong>Metric</strong>
+          <br></p>
+      </td>
+      <td colspan="2">
+        <center><strong>acc</strong></center></td>
+      <td colspan="2">
+        <center><strong>acc</strong></center></td>
+    </tr>
+    <tr>
+      <td colspan="1" width="">
+        <strong>dev</strong>
+        <br></td>
+      <td colspan="1" width="">
+        <strong>test</strong>
+        <br></td>
+      <td colspan="1" width="">
+        <strong>dev</strong>
+        <br></td>
+      <td colspan="1" width="">
+        <strong>test</strong>
+        <br></td>
+    </tr>
+    <tr>
+      <td><strong>BERT Base</strong></td>
+      <td>88.8</td>
+      <td>87.0</td>
+      <td>85.9</td>
+      <td>84.8</td>
+    </tr>
+    <tr>
+      <td><strong>ERNIE 1.0 Base</strong></td>
+      <td>89.7 <span>(<strong>+0.9</strong>)</span></td>
+      <td>87.4 <span>(<strong>+0.4</strong>)</span></td>
+      <td>86.1 <span>(<strong>+0.2</strong>)</span></td>
+      <td>84.8</td>
+    </tr>
+    <tr>
+      <td><strong>ERNIE 2.0 Base</strong></td>
+      <td>90.9 <span>(<strong>+2.1</strong>)</span></td>
+      <td>87.9 <span>(<strong>+0.9</strong>)</span></td>
+      <td>86.4 <span>(<strong>+0.5</strong>)</span></td>
+      <td>85.0 <span>(<strong>+0.2</strong>)</span></td>
+    </tr>
+    <tr>
+      <td><strong>ERNIE 2.0 Large</strong></td>
+      <td>90.9 <span>(<strong>+2.1</strong>)</span></td>
+      <td>87.9 <span>(<strong>+0.9</strong>)</span></td>
+      <td>86.5 <span>(<strong>+0.6</strong>)</span></td>
+      <td>85.2 <span>(<strong>+0.4</strong>)</span></td>
+    </tr>
+  </tbody>
+</table>
+
+*\* You can apply to the dataset owners for LCQMC、BQ Corpus. For the LCQMC:  http://icrc.hitsz.edu.cn/info/1037/1146.htm, For BQ Corpus: http://icrc.hitsz.edu.cn/Article/show/175.html*
+
+ - **LCQMC**
+
+```text
+LCQMC is a Chinese question semantic matching corpus published in COLING2018. [url: http://aclweb.org/anthology/C18-1166]
+```
+
+ - **BQ Corpus**
+
+```text
+BQ Corpus(Bank Question corpus) is a Chinese corpus for sentence semantic equivalence identification. This dataset was published in EMNLP 2018. [url: https://www.aclweb.org/anthology/D18-1536]
+```
+
+
+## Usage
+  * [Install PaddlePaddle](#install-paddlepaddle)
+  * [Pre-trained Models &amp; Datasets](#pre-trained-models--datasets)
+     * [Models](#models)
+     * [Datasets](#datasets)
+        * [English Datasets](#english-datasets)
+        * [Chinese Datasets](#chinese-datasets)
+  * [Fine-tuning](#fine-tuning)
+     * [Batchsize and GPU Settings](#batchsize-and-gpu-settings)
+     * [Classification](#classification)
+        * [Single Sentence Classification Tasks](#single-sentence-classification-tasks)
+        * [Sentence Pair Classification Tasks](#sentence-pair-classification-tasks)
+     * [Sequence Labeling](#sequence-labeling)
+        * [Named Entity Recognition](#named-entity-recognition)
+     * [Machine Reading Comprehension](#machine-reading-comprehension)
+  * [Pre-training with ERNIE 1.0](#pre-training-with-ernie-10)
+     * [Data Preprocessing](#data-preprocessing)
+     * [PreTrain ERNIE1.0](#pretrain-ernie10)
+  * [FAQ](#faq)
+     * [FAQ1: How to get sentence/tokens embedding of ERNIE?](#faq1-how-to-get-sentencetokens-embedding-of-ernie)
+     * [FAQ2: How to predict on new data with Fine-tuning model?](#faq2-how-to-predict-on-new-data-with-fine-tuning-model)
+     * [FAQ3: Is the  argument batch_size for one GPU card or for all GPU cards?](#faq3-is-the--argument-batch_size-for-one-gpu-card-or-for-all-gpu-cards)
+     * [FAQ4: Can not find library: libcudnn.so. Please try to add the lib path to LD_LIBRARY_PATH.](#faq4-can-not-find-library-libcudnnso-please-try-to-add-the-lib-path-to-ld_library_path)
+     * [FAQ5: Can not find library: libnccl.so. Please try to add the lib path to LD_LIBRARY_PATH.](#faq5-can-not-find-library-libncclso-please-try-to-add-the-lib-path-to-ld_library_path)
+
+
+## Install PaddlePaddle
+
+This code base has been tested with Paddle Fluid 1.5.1 under Python2.
+
+**\*Important\*** When finished installing Paddle Fluid, remember to update LD_LIBRARY_PATH about CUDA, cuDNN, NCCL2, for more information, you can click [here](http://en.paddlepaddle.org/documentation/docs/en/1.5/beginners_guide/index_en.html) and [here](http://en.paddlepaddle.org/documentation/docs/en/1.5/beginners_guide/install/install_Ubuntu_en.html). Also, you can read FAQ at the end of this document when you encounter errors.
+
+For beginners of PaddlePaddle, the following documentation will tutor you about installing PaddlePaddle:
+
+> - [Installation Manuals](https://www.paddlepaddle.org.cn/documentation/docs/en/1.5/beginners_guide/install/index_en.html) ：Installation on Ubuntu/CentOS/Windows/MacOS is supported.
+
+If you have been armed with certain level of deep learning knowledge, and it happens to be the first time to try PaddlePaddle, the following cases of model building will expedite your learning process:
+
+> - [Programming with Fluid](https://www.paddlepaddle.org.cn/documentation/docs/en/1.5/beginners_guide/programming_guide/programming_guide_en.html) ： Core concepts and basic usage of Fluid
+> - [Deep Learning Basics](https://www.paddlepaddle.org.cn/documentation/docs/en/1.5/beginners_guide/basics/index_en.html)： This section encompasses various fields of fundamental deep learning knowledge, such as image classification, customized recommendation, machine translation, and examples implemented by Fluid are provided.
+
+For more information about paddlepadde, Please refer to [PaddlePaddle Github](https://github.com/PaddlePaddle/Paddle) or [Official Website](https://www.paddlepaddle.org.cn/)for details.
+
+
+
+## Pre-trained Models & Datasets
+
+### Models
+
+| Model                                              | Description                                                 |
+| :------------------------------------------------- | :----------------------------------------------------------- |
+| [ERNIE 1.0 Base for Chinese](https://ernie.bj.bcebos.com/ERNIE_stable.tgz)                    | with params |
+| [ERNIE 1.0 Base for Chinese](https://baidu-nlp.bj.bcebos.com/ERNIE_stable-1.0.1.tar.gz)       | with params, config and vocabs|
+| [ERNIE 1.0 Base for Chinese(max-len-512)](https://ernie.bj.bcebos.com/ERNIE_1.0_max-len-512.tar.gz)    | with params, config and vocabs|
+| [ERNIE 2.0 Base for English](https://ernie.bj.bcebos.com/ERNIE_Base_en_stable-2.0.0.tar.gz)   | with params, config and vocabs |
+| [ERNIE 2.0 Large for English](https://ernie.bj.bcebos.com/ERNIE_Large_en_stable-2.0.0.tar.gz) | with params, config and vocabs |
+
+### Datasets
+
+#### English Datasets
+
+Download the [GLUE data](https://gluebenchmark.com/tasks) by running [this script](https://gist.github.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e) and unpack it to some directory `${TASK_DATA_PATH}`
+
+After the dataset is downloaded, you should run `sh ./script/en_glue/preprocess/cvt.sh $TASK_DATA_PATH` to convert the data format for training. If everything goes well, there will be a folder named `glue_data_processed`  created with all the converted datas in it.
+
+#### Chinese Datasets
+
+You can download Chinese Datasets from [here](https://ernie.bj.bcebos.com/task_data_zh.tgz)
+
+
+
+## Fine-tuning
+
+### Batchsize and GPU Settings
+
+In our experiments, we found that the batch size is important for different tasks. For users can more easily reproducing results, we list the batch size and gpu cards here:
+
+| Dataset      | Batch Size      | GPU                 |
+| ------------ | --------------- | ------------------- |
+| CoLA         | 32 / 64(base)   | 1                   |
+| SST-2        | 64 / 256(base)  | 8                   |
+| STS-B        | 128             | 8                   |
+| QQP          | 256             | 8                   |
+| MNLI         | 256 / 512(base) | 8                   |
+| QNLI         | 256             | 8                   |
+| RTE          | 16 / 4(base)    | 1                   |
+| MRPC         | 16 / 32(base)   | 2                   |
+| WNLI         | 8               | 1                   |
+| XNLI         | 65536 (tokens) | 8                   |
+| CMRC2018     | 64              | 8 (large) / 4(base) |
+| DRCD         | 64              | 8 (large) / 4(base) |
+| MSRA-NER(SIGHAN2006)     | 16              | 1                   |
+| ChnSentiCorp | 24              | 1                   |
+| LCQMC        | 32              | 1                   |
+| BQ Corpus    | 64              | 1                   |
+| NLPCC2016-DBQA         | 64              | 8                   |
+
+\* *For MNLI, QNLI，we used 32GB V100, for other tasks we used 22GB P40*
+
+### Classification
+
+#### Single Sentence Classification Tasks
+
+The code used to perform classification/regression finetuning is in `run_classifier.py`, we also provide the shell scripts for each task including best hyperpameters.
+
+Take an English task `SST-2` and a Chinese task `ChnSentCorp` for example,
+
+Step1: Download and unarchive  the model in path `${MODEL_PATH}`, if everything goes well, there should be a folder named `params` in `$MODEL_PATH`;
+
+Step2: Download and unarchive the data set in `${TASK_DATA_PATH}`, for English tasks, there should be 9 folders named `CoLA` , `MNLI`,  `MRPC`,  `QNLI` , `QQP`,  `RTE` , `SST-2`,  `STS-B` , `WNLI`; for Chinese tasks, there should be 5 folders named  `lcqmc`, `xnli`, `msra-ner`, `chnsentcorp`,  `nlpcc-dbqa` in `${TASK_DATA_PATH}`;
+
+Step3: Follow the instructions below based on your own task type for starting  your programs.
+
+ Take `SST-2` as an example, the path of its training data set should be `${TASK_DATA_PATH}/SST-2/train.tsv`,  the data should have 2 fields with tsv format: `text_a  label`, Here is some example datas:
+
+ ```
+label  text_a
+...
+0   hide new secretions from the parental units
+0   contains no wit , only labored gags
+1   that loves its characters and communicates something rather beautiful about human nature
+0   remains utterly satisfied to remain the same throughout
+0   on the worst revenge-of-the-nerds clichés the filmmakers could dredge up
+0   that 's far too tragic to merit such superficial treatment
+1   demonstrates that the director of such hollywood blockbusters as patriot games can still turn out a small , personal film with an emotional wallop .
+1   of saucy
+...
+ ```
+
+
+
+Before runinng the scripts, we should set some environment variables
+
+```
+export TASK_DATA_PATH=(the value of ${TASK_DATA_PATH} mentioned above)
+export MODEL_PATH=(the value of ${MODEL_PATH} mentioned above)
+```
+
+
+
+Run `sh script/en_glue/ernie_large/SST-2/task.sh`  for finetuning，some logs will be shown below:
+
+```
+epoch: 3, progress: 22456/67349, step: 3500, ave loss: 0.015862, ave acc: 0.984375, speed: 1.328810 steps/s
+[dev evaluation] ave loss: 0.174793, acc:0.957569, data_num: 872, elapsed time: 15.314256 s file: ./data/dev.tsv, epoch: 3, steps: 3500
+testing ./data/test.tsv, save to output/test_out.tsv
+```
+
+
+
+Similarly, for the Chinese task `ChnSentCorp`, after setting the environment variables, run`sh script/zh_task/ernie_base/run_ChnSentiCorp.sh`, some logs will be shown below:
+
+```
+[dev evaluation] ave loss: 0.303819, acc:0.943333, data_num: 1200, elapsed time: 16.280898 s, file: ./task_data/chnsenticorp/dev.tsv, epoch: 9, steps: 4001
+[dev evaluation] ave loss: 0.228482, acc:0.958333, data_num: 1200, elapsed time: 16.023091 s, file: ./task_data/chnsenticorp/test.tsv, epoch: 9, steps: 4001
+```
+
+
+
+#### Sentence Pair Classification Tasks
+
+Take `RTE` as an example,  the data should have 3 fields `text_a    text_b   label`with tsv format. Here is some example datas:
+```
+text_a  text_b  label
+Oil prices fall back as Yukos oil threat lifted Oil prices rise.    0
+No Weapons of Mass Destruction Found in Iraq Yet.   Weapons of Mass Destruction Found in Iraq.  0
+Iran is said to give up al Qaeda members.   Iran hands over al Qaeda members.   1
+Sani-Seat can offset the rising cost of paper products  The cost of paper is rising.    1
+```
+
+the path of its training data set should be `${TASK_DATA_PATH}/RTE/train.tsv`
+
+Before runinng the scripts, we should set some environment variables like before:
+
+```
+export TASK_DATA_PATH=(the value of ${TASK_DATA_PATH} mentioned above)
+export MODEL_PATH=(the value of ${MODEL_PATH} mentioned above)
+```
+
+Run `sh script/en_glue/ernie_large/RTE/task.sh` for finetuning, some logs are shown below:
+
+```
+epoch: 4, progress: 2489/2490, step: 760, ave loss: 0.000729, ave acc: 1.000000, speed: 1.221889 steps/s
+train pyreader queue size: 9, learning rate: 0.000000
+epoch: 4, progress: 2489/2490, step: 770, ave loss: 0.000833, ave acc: 1.000000, speed: 1.246080 steps/s
+train pyreader queue size: 0, learning rate: 0.000000
+epoch: 4, progress: 2489/2490, step: 780, ave loss: 0.000786, ave acc: 1.000000, speed: 1.265365 steps/s
+validation result of dataset ./data/dev.tsv:
+[dev evaluation] ave loss: 0.898279, acc:0.851986, data_num: 277, elapsed time: 6.425834 s file: ./data/dev.tsv, epoch: 4, steps: 781
+testing ./data/test.tsv, save to output/test_out.5.2019-07-23-15-25-06.tsv.4.781
+```
+
+
+
+
+### Sequence Labeling
+
+#### Named Entity Recognition
+
+ Take `MSRA-NER(SIGHAN2006)` as an example, the data should have 2 fields,  `text_a  label`, with tsv format. Here is some example datas :
+ ```
+text_a  label
+在 这 里 恕 弟 不 恭 之 罪 ， 敢 在 尊 前 一 诤 ： 前 人 论 书 ， 每 曰 “ 字 字 有 来 历 ， 笔 笔 有 出 处 ” ， 细 读 公 字 ， 何 尝 跳 出 前 人 藩 篱 ， 自 隶 变 而 后 ， 直 至 明 季 ， 兄 有 何 新 出 ？    O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O
+相 比 之 下 ， 青 岛 海 牛 队 和 广 州 松 日 队 的 雨 中 之 战 虽 然 也 是 0 ∶ 0 ， 但 乏 善 可 陈 。   O O O O O B-ORG I-ORG I-ORG I-ORG I-ORG O B-ORG I-ORG I-ORG I-ORG I-ORG O O O O O O O O O O O O O O O O O O O
+理 由 多 多 ， 最 无 奈 的 却 是 ： 5 月 恰 逢 双 重 考 试 ， 她 攻 读 的 博 士 学 位 论 文 要 通 考 ； 她 任 教 的 两 所 学 校 ， 也 要 在 这 段 时 日 大 考 。    O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O
+ ```
+
+Also, remember to set environmental variables like above, and run `sh script/zh_task/ernie_base/run_msra_ner.sh`  for finetuning, some logs are shown below:
+
+```
+[dev evaluation] f1: 0.951949, precision: 0.944636, recall: 0.959376, elapsed time: 19.156693 s
+[test evaluation] f1: 0.937390, precision: 0.925988, recall: 0.949077, elapsed time: 36.565929 s
+```
+
+### Machine Reading Comprehension
+
+
+ Take `DRCD` as an example, convert the data into SQUAD format firstly:
+ ```
+{
+  "version": "1.3",
+  "data": [
+    {
+      "paragraphs": [
+        {
+          "id": "1001-11",
+          "context": "广州是京广铁路、广深铁路、广茂铁路、广梅汕铁路的终点站。2009年末，武广客运专线投入运营，多单元列车覆盖980公里的路程，最高时速可达350公里/小时。2011年1月7日，广珠城际铁路投入运营，平均时速可达200公里/小时。广州铁路、长途汽车和渡轮直达香港，广九直通车从广州东站开出，直达香港九龙红磡站，总长度约182公里，车程在两小时内。繁忙的长途汽车每年会从城市中的不同载客点把旅客接载至香港。在珠江靠市中心的北航道有渡轮线路，用于近江居民直接渡江而无需乘坐公交或步行过桥。南沙码头和莲花山码头间每天都有高速双体船往返，渡轮也开往香港中国客运码头和港澳码头。",
+          "qas": [
+            {
+              "question": "广珠城际铁路平均每小时可以走多远？",
+              "id": "1001-11-1",
+              "answers": [
+                {
+                  "text": "200公里",
+                  "answer_start": 104,
+                  "id": "1"
+                }
+              ]
+            }
+          ]
+        }
+      ],
+      "id": "1001",
+      "title": "广州"
+    }
+  ]
+}
+ ```
+
+Also, remember to set environmental variables like above, and run `sh script/zh_task/ernie_base/run_drcd.sh`  for finetuning, some logs are shown below:
+
+```
+[dev evaluation] em: 88.450624, f1: 93.749887, avg: 91.100255, question_num: 3524
+[test evaluation] em: 88.061838, f1: 93.520152, avg: 90.790995, question_num: 3493
+```
+
+
+## Pre-training with ERNIE 1.0
+
+### Data Preprocessing
+
+We construct the training dataset based on [Baidu Baike](https://en.wikipedia.org/wiki/Baidu_Baike), [Baidu Knows(Baidu Zhidao)](https://en.wikipedia.org/wiki/Baidu_Knows), [Baidu Tieba](https://en.wikipedia.org/wiki/Baidu_Tieba) for Chinese version ERNIE, and [Wikipedia](https://en.wikipedia.org/wiki/Wikipedia:Database_download), [Reddit](https://en.wikipedia.org/wiki/Reddit), [BookCorpus](https://github.com/soskek/bookcorpus) for English version ERNIE.
+
+For the Chinese version dataset, we use a private version wordseg tool in Baidu to label those Chinese corpora in different granularities, such as character, word, entity, etc. Then using class `CharTokenizer` in [`tokenization.py`](tokenization.py)  for tokenization to get word boundaries. Finally, the words are mapped to ids according to the vocabulary  [`config/vocab.txt`](config/vocab.txt) . During training progress, we randomly mask words based on boundary information.
+
+Here are some train instances after processing (which can be found in [`data/demo_train_set.gz`](./data/demo_train_set.gz) and [`data/demo_valid_set.gz`](./data/demo_valid_set.gz)), each line corresponds to one training instance:
+
+```
+1 1048 492 1333 1361 1051 326 2508 5 1803 1827 98 164 133 2777 2696 983 121 4 19 9 634 551 844 85 14 2476 1895 33 13 983 121 23 7 1093 24 46 660 12043 2 1263 6 328 33 121 126 398 276 315 5 63 44 35 25 12043 2;0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1;0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55;-1 0 0 0 0 1 0 1 0 0 1 0 0 1 0 1 0 0 0 0 0 0 1 0 1 0 0 1 0 1 0 0 0 0 1 0 0 0 0 -1 0 0 0 1 0 0 1 0 1 0 0 1 0 1 0 -1;0
+```
+
+Each instance is composed of 5 fields, which are joined by `;`in one line, represented `token_ids; sentence_type_ids; position_ids; seg_labels; next_sentence_label` respectively. Especially, in the field`seg_labels`,  0 means the begin of one word, 1 means non-begin of one word, -1 means placeholder, the other number means  `CLS` or `SEP`.
+
+### PreTrain ERNIE 1.0
+
+The start entry for pretrain is  [`script/zh_task/pretrain.sh`](./script/zh_task/pretrain.sh). Before we run the train program, remember to set  CUDA、cuDNN、NCCL2 etc. in the environment variable LD_LIBRARY_PATH.
+
+Execute  `sh script/zh_task/pretrain.sh` , the progress of pretrain will start with default parameters.
+
+Here are some logs in the pretraining progress, including learning rate, epochs, steps, errors, training speed etc. The information will be printed according to the command parameter `--validation_steps`
+
+```
+current learning_rate:0.000001
+epoch: 1, progress: 1/1, step: 30, loss: 10.540648, ppl: 19106.925781, next_sent_acc: 0.625000, speed: 0.849662 steps/s, file: ./data/demo_train_set.gz, mask_type: mask_word
+feed_queue size 70
+current learning_rate:0.000001
+epoch: 1, progress: 1/1, step: 40, loss: 10.529287, ppl: 18056.654297, next_sent_acc: 0.531250, speed: 0.849549 steps/s, file: ./data/demo_train_set.gz, mask_type: mask_word
+feed_queue size 70
+current learning_rate:0.000001
+epoch: 1, progress: 1/1, step: 50, loss: 10.360563, ppl: 16398.287109, next_sent_acc: 0.625000, speed: 0.843776 steps/s, file: ./data/demo_train_set.gz, mask_type: mask_word
+```
+
+
+
+## FAQ
+
+### FAQ1: How to get sentence/tokens embedding of ERNIE?
+
+Run ```ernie_encoder.py ``` we can get the both sentence embedding and tokens embeddings. The input data format should be same as that mentioned in chapter [Fine-tuning](#fine-tuning).
+
+Here is an example to get sentence embedding and token embedding for LCQMC dev dataset:
+
+```
+export FLAGS_sync_nccl_allreduce=1
+export CUDA_VISIBLE_DEVICES=0
+
+python -u ernir_encoder.py \
+                   --use_cuda true \
+                   --batch_size 32 \
+                   --output_dir "./test" \
+                   --init_pretraining_params ${MODEL_PATH}/params \
+                   --data_set ${TASK_DATA_PATH}/lcqmc/dev.tsv \
+                   --vocab_path ${MODEL_PATH}/vocab.txt \
+                   --max_seq_len 128 \
+                   --ernie_config_path ${MODEL_PATH}/ernie_config.json
+```
+
+when finished running this script,  `cls_emb.npy` and `top_layer_emb.npy `will be generated for sentence embedding and token embedding respectively in folder `test` .
+
+
+
+### FAQ2: How to predict on new data with Fine-tuning model?
+
+Take classification tasks for example, here is the script for batch prediction:
+
+```
+python -u predict_classifier.py \
+       --use_cuda true \
+       --batch_size 32 \
+       --vocab_path ${MODEL_PATH}/vocab.txt \
+       --init_checkpoint "./checkpoints/step_100" \
+       --do_lower_case true \
+       --max_seq_len 128 \
+       --ernie_config_path ${MODEL_PATH}/ernie_config.json \
+       --do_predict true \
+       --predict_set ${TASK_DATA_PATH}/lcqmc/test.tsv \
+       --num_labels 2
+```
+
+Argument  `init_checkpoint` is the path of the model, `predict_set` is the path of test file,  `num_labels` is the number of target labels.
+
+**Note**: `predict_set `should be a tsv file with two fields named `text_a`、`text_b(optional)`
+
+
+
+### FAQ3: Is the  argument batch_size for one GPU card or for all GPU cards?
+
+For one GPU card.
+
+
+
+### FAQ4: Can not find library: libcudnn.so. Please try to add the lib path to LD_LIBRARY_PATH.
+
+Export the path of cuda to LD_LIBRARY_PATH, e.g.: `export LD_LIBRARY_PATH=/home/work/cudnn/cudnn_v[your cudnn version]/cuda/lib64`
+
+
+
+### FAQ5: Can not find library: libnccl.so. Please try to add the lib path to LD_LIBRARY_PATH.
+
+Download [NCCL2](https://developer.nvidia.com/nccl/nccl-download), and export the library path to LD_LIBRARY_PATH, e.g.:`export LD_LIBRARY_PATH=/home/work/nccl/lib`
--- a/README.zh.md
+++ b/README.zh.md
+[English](./README.md) | 简体中文
+
+## ERNIE 2.0: A Continual Pre-training Framework for Language Understanding
+
+
+  * [可持续学习语义理解框架](#可持续学习语义理解框架)
+  * [Pre-Training 任务](#pre-training-任务)
+     * [Word-aware Tasks](#word-aware-tasks)
+        * [Knowledge Masking Task](#knowledge-masking-task)
+        * [Capitalization Prediction Task](#capitalization-prediction-task)
+        * [Token-Document Relation Prediction Task](#token-document-relation-prediction-task)
+     * [Structure-aware Tasks](#structure-aware-tasks)
+        * [Sentence Reordering Task](#sentence-reordering-task)
+        * [Sentence Distance Task](#sentence-distance-task)
+     * [Semantic-aware Tasks](#semantic-aware-tasks)
+        * [Discourse Relation Task](#discourse-relation-task)
+        * [IR Relevance Task](#ir-relevance-task)
+  * [ERNIE 1.0: <strong>E</strong>nhanced <strong>R</strong>epresentation through k<strong>N</strong>owledge <strong>I</strong>nt<strong>E</strong>gration](#ernie-10-enhanced-representation-through-knowledge-integration)
+  * [中文效果验证](#中文效果验证)
+  * [英文效果验证](#英文效果验证)
+
+
+### 可持续学习语义理解框架
+
+**[ERNIE 2.0](https://arxiv.org/abs/1907.12412v1)** 是基于可持续学习的语义理解预训练框架，使用多任务学习增量式构建预训练任务。**[ERNIE 2.0](https://arxiv.org/abs/1907.12412v1)** 中，新构建的预训练任务类型可以无缝的加入训练框架，持续的进行语义理解学习。 通过新增的实体预测、句子因果关系判断、文章句子结构重建等语义任务，**[ERNIE 2.0](https://arxiv.org/abs/1907.12412v1)** 语义理解预训练模型从训练数据中获取了词法、句法、语义等多个维度的自然语言信息，极大地增强了通用语义表示能力。
+
+![ernie2.0_arch](.metas/ernie2.0_arch.png)
+
+我们对 **ERNIE 2.0** 模型和现有 SOTA 预训练模型在 **9 个中文数据集**、以及**英文数据集合 GLUE** 上进行效果比较。结果表明：**ERNIE 2.0** 模型在英语任务上几乎全面优于 **BERT** 和 **XLNet**，在 7 个 GLUE 任务上取得了最好的结果；中文任务上，**ERNIE 2.0** 模型在所有 9 个中文 NLP 任务上全面优于 **BERT**。
+
+### Pre-Training 任务
+
+针对 ERNIE 2.0 模型，我们构建了多个预训练任务，试图从 3 个层面去更好的理解训练语料中蕴含的信息：
+
+- **Word-aware Tasks**: 词汇 (lexical) 级别信息的学习
+- **Structure-aware Tasks**: 语法 (syntactic) 级别信息的学习
+- **Semantic-aware Tasks**:  语义 (semantic) 级别信息的学习
+
+![ernie2.0_model](.metas/ernie2.0_model.png)
+
+
+#### Word-aware Tasks
+
+##### Knowledge Masking Task
+
+- [ERNIE 1.0](https://arxiv.org/abs/1904.09223) 中已经引入的 phrase & named entity 知识增强 masking 策略。相较于 sub-word masking, 该策略可以更好的捕捉输入样本局部和全局的语义信息。
+
+##### Capitalization Prediction Task
+
+- 针对英文首字母大写词汇（如 Apple）所包含的特殊语义信息，我们在英文 Pre-training 训练中构造了一个分类任务去学习该词汇是否为大写。
+
+##### Token-Document Relation Prediction Task
+
+- 针对一个 segment 中出现的词汇，去预测该词汇是否也在原文档的其他 segments 从出现。
+
+#### Structure-aware Tasks
+
+##### Sentence Reordering Task
+
+- 针对一个 paragraph （包含 M 个 segments），我们随机打乱 segments 的顺序，通过一个分类任务去预测打乱的顺序类别。
+
+##### Sentence Distance Task
+
+- 通过一个 3 分类任务，去判断句对 (sentence pairs) 位置关系，更好的建模语义相关性。
+
+#### Semantic-aware Tasks
+
+##### Discourse Relation Task
+
+- 通过判断句对 (sentence pairs) 间的修辞关系 (semantic & rhetorical relation)，更好的学习句间语义。
+
+##### IR Relevance Task
+
+- 学习 IR 相关性弱监督信息，更好的建模句对相关性。
+
+
+### ERNIE 1.0: **E**nhanced **R**epresentation through k**N**owledge **I**nt**E**gration
+
+**[ERNIE 1.0](https://arxiv.org/abs/1904.09223)** 通过建模海量数据中的词、实体及实体关系，学习真实世界的语义知识。相较于 **BERT** 学习原始语言信号，**ERNIE** 直接对先验语义知识单元进行建模，增强了模型语义表示能力。
+
+这里我们举个例子：
+
+```Learnt by BERT ：哈 [mask] 滨是 [mask] 龙江的省会，[mask] 际冰 [mask] 文化名城。```
+
+```Learnt by ERNIE：[mask] [mask] [mask] 是黑龙江的省会，国际 [mask] [mask] 文化名城。```
+
+在 **BERT** 模型中，我们通过『哈』与『滨』的局部共现，即可判断出『尔』字，模型没有学习与『哈尔滨』相关的任何知识。而 **ERNIE** 通过学习词与实体的表达，使模型能够建模出『哈尔滨』与『黑龙江』的关系，学到『哈尔滨』是 『黑龙江』的省会以及『哈尔滨』是个冰雪城市。
+
+训练数据方面，除百科类、资讯类中文语料外，**ERNIE** 还引入了论坛对话类数据，利用 **DLM**（Dialogue Language Model）建模 Query-Response 对话结构，将对话 Pair 对作为输入，引入 Dialogue Embedding 标识对话的角色，利用 Dialogue Response Loss 学习对话的隐式关系，进一步提升模型的语义表示能力。
+
+
+
+## 开源记录
+- 2019-07-30 发布 ERNIE 2.0
+- 2019-04-10 更新: update ERNIE_stable-1.0.1.tar.gz, 将模型参数、配置 ernie_config.json、vocab.txt 打包发布
+- 2019-03-18 更新: update ERNIE_stable.tgz
+- 2019-03-15 发布 ERNIE 1.0
+
+## 技术交流
+
+- [Github Issues](https://github.com/PaddlePaddle/ERNIE/issues): bug reports, feature requests, install issues, usage issues, etc.
+- ERNIE QQ 群: 760439550 (ERNIE discussion group).
+- [论坛](http://ai.baidu.com/forum/topic/list/168?pageNo=1): discuss implementations, research, etc.
+
+
+## 中文效果验证
+
+我们在 9 个任务上验证 ERNIE 2.0 中文模型的效果。这些任务包括：自然语言推断任务 XNLI；阅读理解任务 DRCD、DuReader、CMRC2018；命名实体识别任务 MSRA-NER (SIGHAN2006)；情感分析任务 ChnSentiCorp；语义相似度任务 BQ Corpus、LCQMC；问答任务 NLPCC2016-DBQA 。任务的详情和效果会在如下章节中介绍。
+
+
+
+### 自然语言推断任务
+
+<table>
+  <tbody>
+    <tr>
+      <th><strong>数据集</strong>
+        <br></th>
+      <th colspan="2"><center><strong>XNLI</strong></center></th>
+    <tr>
+      <td rowspan="2">
+        <p>
+          <strong>评估</strong></p>
+        <p>
+          <strong>指标</strong>
+          <br></p>
+      </td>
+      <td colspan="2">
+        <center><strong>acc</strong></center>
+        <br></td>
+    </tr>
+    <tr>
+      <td colspan="1" width="">
+        <strong>dev</strong>
+        <br></td>
+      <td colspan="1" width="">
+        <strong>test</strong>
+        <br></td>
+    </tr>
+    <tr>
+      <td>
+        <strong>BERT Base
+          <br></strong></td>
+      <td>78.1</td>
+      <td>77.2</td>
+    </tr>
+    <tr>
+      <td>
+        <strong>ERNIE 1.0 Base
+          <br></strong></td>
+      <td>79.9 <span>(<strong>+1.8</strong>)</span></td>
+      <td>78.4 <span>(<strong>+1.2</strong>)</span></td>
+    </tr>
+    <tr>
+      <td>
+        <strong>ERNIE 2.0 Base
+          <br></strong></td>
+      <td>81.2 <span>(<strong>+3.1</strong>)</span></td>
+      <td>79.7 <span>(<strong>+2.5</strong>)</span></td>
+    </tr>
+    <tr>
+      <td>
+        <strong>ERNIE 2.0 Large
+          <br></strong></td>
+      <td>82.6 <span>(<strong>+4.5</strong>)</span></td>
+      <td>81.0 <span>(<strong>+3.8</strong>)</span></td>
+    </tr>
+  </tbody>
+</table>
+
+ - **XNLI**
+
+```text
+XNLI 是由 Facebook 和纽约大学的研究者联合构建的自然语言推断数据集，包括 15 种语言的数据。我们用其中的中文数据来评估模型的语言理解能力。[链接: https://github.com/facebookresearch/XNLI]
+```
+
+### 阅读理解任务
+
+<table>
+  <tbody>
+    <tr>
+      <th><strong>数据集</strong>
+        <br></th>
+      <th colspan="2"><center><strong>DuReader</strong></center></th>
+      <th colspan="2"><center><strong>CMRC2018</strong><center></th>
+      <th colspan="4"><strong>DRCD</strong></th></tr>
+    <tr>
+      <td rowspan="2">
+        <p>
+          <strong>评估</strong></p>
+        <p>
+          <strong>指标</strong>
+          <br></p>
+      </td>
+      <td colspan="1">
+        <center><strong>em</strong></center>
+        <br></td>
+      <td colspan="1">
+        <strong>f1-score</strong>
+        <br></td>
+      <td colspan="1">
+        <strong>em</strong>
+        <br></td>
+      <td colspan="1">
+        <strong>f1-score</strong>
+        <strong></strong>
+        <br></td>
+      <td colspan="2">
+        <strong>em</strong>
+        <br></td>
+      <td colspan="2">
+        <strong>f1-score</strong>
+        <br></td>
+    </tr>
+    <tr>
+      <td colspan="2" width="">
+        <strong>dev</strong>
+        <br></td>
+      <td colspan="2" width="">
+        <strong>dev</strong>
+        <br></td>
+      <td colspan="1" width="">
+        <strong>dev</strong>
+        <br></td>
+      <td colspan="1" width="">
+        <strong>test</strong>
+        <br></td>
+      <td colspan="1" width="">
+        <strong>dev</strong>
+        <br></td>
+      <td colspan="1" width="">
+        <strong>test</strong>
+        <br></td>
+    </tr>
+    <tr>
+      <td><strong>BERT Base</strong></td>
+      <td>59.5</td>
+      <td>73.1</td>
+      <td>66.3</td>
+      <td>85.9</td>
+      <td>85.7</td>
+      <td>84.9</td>
+      <td>91.6</td>
+      <td>90.9</td>
+    </tr>
+    <tr>
+      <td><strong>ERNIE 1.0 Base</strong></td>
+      <td>57.9 <span>(<strong>-1.6</strong>)</span></td>
+      <td>72.1 <span>(<strong>-1.0</strong>)</span></td>
+      <td>65.1 <span>(<strong>-1.2</strong>)</span></td>
+      <td>85.1 <span>(<strong>-0.8</strong>)</span></td>
+      <td>84.6 <span>(<strong>-1.1</strong>)</span></td>
+      <td>84.0 <span>(<strong>-0.9</strong>)</span></td>
+      <td>90.9 <span>(<strong>-0.7</strong>)</span></td>
+      <td>90.5 <span>(<strong>-0.4</strong>)</span></td>
+    </tr>
+    <tr>
+      <td><strong>ERNIE 2.0 Base</strong></td>
+      <td>61.3 <span>(<strong>+1.8</strong>)</span></td>
+      <td>74.9 <span>(<strong>+1.8</strong>)</span></td>
+      <td>69.1 <span>(<strong>+2.8</strong>)</span></td>
+      <td>88.6 <span>(<strong>+2.7</strong>)</span></td>
+      <td>88.5 <span>(<strong>+2.8</strong>)</span></td>
+      <td>88.0 <span>(<strong>+3.1</strong>)</span></td>
+      <td>93.8 <span>(<strong>+2.2</strong>)</span></td>
+      <td>93.4 <span>(<strong>+2.5</strong>)</span></td>
+    </tr>
+    <tr>
+      <td><strong>ERNIE 2.0 Large</strong></td>
+      <td>64.2 <span>(<strong>+4.7</strong>)</span></td>
+      <td>77.3 <span>(<strong>+4.2</strong>)</span></td>
+      <td>71.5 <span>(<strong>+5.2</strong>)</span></td>
+      <td>89.9 <span>(<strong>+4.0</strong>)</span></td>
+      <td>89.7 <span>(<strong>+4.0</strong>)</span></td>
+      <td>89.0 <span>(<strong>+4.1</strong>)</span></td>
+      <td>94.7 <span>(<strong>+3.1</strong>)</span></td>
+      <td>94.2 <span>(<strong>+3.3</strong>)</span></td>
+    </tr>
+  </tbody>
+</table>
+
+\* *实验所用的 DuReader 抽取类、单文档子集为内部数据集。*
+
+\* *实验时将 DRCD 繁体数据转换成简体，繁简转换工具：https://github.com/skydark/nstools/tree/master/zhtools*
+
+\* *ERNIE 1.0 的预训练数据长度为 128，其他模型使用 512 长度的数据训练，这导致 ERNIE 1.0 BASE 在长文本任务上性能较差, 为此我们发布了 [ERNIE 1.0 Base (max-len-512) 模型](https://ernie.bj.bcebos.com/ERNIE_1.0_max-len-512.tar.gz) (2019-07-29)*
+
+ - **DuReader**
+
+```text
+DuReader 是百度在自然语言处理国际顶会 ACL 2018 发布的机器阅读理解数据集，所有的问题、原文都来源于百度搜索引擎数据和百度知道问答社区，答案是由人工整理的。实验是在 DuReader 的单文档、抽取类的子集上进行的，训练集包含15763个文档和问题，验证集包含1628个文档和问题，目标是从篇章中抽取出连续片段作为答案。[链接: https://arxiv.org/pdf/1711.05073.pdf]
+```
+
+ - **CMRC2018**
+
+```text
+CMRC2018 是中文信息学会举办的评测，评测的任务是抽取类阅读理解。[链接: https://github.com/ymcui/cmrc2018]
+```
+
+ - **DRCD**
+
+```text
+DRCD 是台达研究院发布的繁体中文阅读理解数据集，目标是从篇章中抽取出连续片段作为答案。我们在实验时先将其转换成简体中文。[链接: https://github.com/DRCKnowledgeTeam/DRCD]
+```
+
+
+
+### 命名实体识别任务
+
+<table>
+  <tbody>
+    <tr>
+      <th><strong>数据集</strong>
+        <br></th>
+      <th colspan="2"><center><strong>MSRA-NER(SIGHAN2006)</strong></center></th>
+    <tr>
+      <td rowspan="2">
+        <p>
+          <strong>评估</strong></p>
+        <p>
+          <strong>指标</strong>
+          <br></p>
+      </td>
+      <td colspan="2">
+        <center><strong>f1-score</strong></center>
+        <br></td>
+    </tr>
+    <tr>
+      <td colspan="1" width="">
+        <strong>dev</strong>
+        <br></td>
+      <td colspan="1" width="">
+        <strong>test</strong>
+        <br></td>
+    </tr>
+    <tr>
+      <td><strong>BERT Base</strong></td>
+      <td>94.0</td>
+      <td>92.6</td>
+    </tr>
+    <tr>
+      <td><strong>ERNIE 1.0 Base</strong></td>
+      <td>95.0 <span>(<strong>+1.0</strong>)</span></td>
+      <td>93.8 <span>(<strong>+1.2</strong>)</span></td>
+    </tr>
+    <tr>
+      <td><strong>ERNIE 2.0 Base</strong></td>
+      <td>95.2 <span>(<strong>+1.2</strong>)</span></td>
+      <td>93.8 <span>(<strong>+1.2</strong>)</span></td>
+    </tr>
+    <tr>
+      <td><strong>ERNIE 2.0 Large</strong></td>
+      <td>96.3 <span>(<strong>+2.3</strong>)</span></td>
+      <td>95.0 <span>(<strong>+2.4</strong>)</span></td>
+    </tr>
+  </tbody>
+</table>
+
+ - **MSRA-NER(SIGHAN2006)**
+
+```text
+MSRA-NER(SIGHAN2006) 数据集由微软亚研院发布，其目标是识别文本中具有特定意义的实体，包括人名、地名、机构名。
+```
+
+
+
+### 情感分析任务
+
+<table>
+  <tbody>
+    <tr>
+      <th><strong>数据集</strong>
+        <br></th>
+      <th colspan="2"><center><strong>ChnSentiCorp</strong></center></th>
+    <tr>
+      <td rowspan="2">
+        <p>
+          <strong>评估</strong></p>
+        <p>
+          <strong>指标</strong>
+          <br></p>
+      </td>
+      <td colspan="2">
+        <center><strong>acc</strong></center>
+        <br></td>
+    </tr>
+    <tr>
+      <td colspan="1" width="">
+        <strong>dev</strong>
+        <br></td>
+      <td colspan="1" width="">
+        <strong>test</strong>
+        <br></td>
+    </tr>
+    <tr>
+      <td><strong>BERT Base</strong></td>
+      <td>94.6</td>
+      <td>94.3</td>
+    </tr>
+    <tr>
+      <td><strong>ERNIE 1.0 Base</strong></td>
+      <td>95.2 <span>(<strong>+0.6</strong>)</span></td>
+      <td>95.4 <span>(<strong>+1.1</strong>)</span></td>
+    </tr>
+    <tr>
+      <td><strong>ERNIE 2.0 Base</strong></td>
+      <td>95.7 <span>(<strong>+1.1</strong>)</span></td>
+      <td>95.5 <span>(<strong>+1.2</strong>)</span></td>
+    </tr>
+    <tr>
+      <td><strong>ERNIE 2.0 Large</strong></td>
+      <td>96.1 <span>(<strong>+1.5</strong>)</span></td>
+      <td>95.8 <span>(<strong>+1.5</strong>)</span></td>
+    </tr>
+  </tbody>
+</table>
+
+ - **ChnSentiCorp**
+
+```text
+ChnSentiCorp 是一个中文情感分析数据集，包含酒店、笔记本电脑和书籍的网购评论。
+```
+
+
+
+### 问答任务
+
+<table>
+  <tbody>
+    <tr>
+      <th><strong>数据集</strong>
+        <br></th>
+      <th colspan="4"><center><strong>NLPCC2016-DBQA</strong></center></th>
+    <tr>
+      <td rowspan="2">
+        <p>
+          <strong>评估</strong></p>
+        <p>
+          <strong>指标</strong>
+          <br></p>
+      </td>
+      <td colspan="2">
+        <center><strong>mrr</strong></center>
+        <br></td>
+      <td colspan="2">
+        <center><strong>f1-score</strong></center>
+        <br></td>
+    </tr>
+    <tr>
+      <td colspan="1" width="">
+        <strong>dev</strong>
+        <br></td>
+      <td colspan="1" width="">
+        <strong>test</strong>
+        <br></td>
+      <td colspan="1" width="">
+        <strong>dev</strong>
+        <br></td>
+      <td colspan="1" width="">
+        <strong>test</strong>
+        <br></td>
+    </tr>
+    <tr>
+      <td><strong>BERT Base</strong></td>
+      <td>94.7</td>
+      <td>94.6</td>
+      <td>80.7</td>
+      <td>80.8</td>
+    </tr>
+    <tr>
+      <td><strong>ERNIE 1.0 Base</strong></td>
+      <td>95.0 <span>(<strong>+0.3</strong>)</span></td>
+      <td>95.1 <span>(<strong>+0.5</strong>)</span></td>
+      <td>82.3 <span>(<strong>+1.6</strong>)</span></td>
+      <td>82.7 <span>(<strong>+1.9</strong>)</span></td>
+    </tr>
+    <tr>
+      <td><strong>ERNIE 2.0 Base</strong></td>
+      <td>95.7 <span>(<strong>+1.0</strong>)</span></td>
+      <td>95.7 <span>(<strong>+1.1</strong>)</span></td>
+      <td>84.7 <span>(<strong>+4.0</strong>)</span></td>
+      <td>85.3 <span>(<strong>+4.5</strong>)</span></td>
+    </tr>
+    <tr>
+      <td><strong>ERNIE 2.0 Large</strong></td>
+      <td>95.9 <span>(<strong>+1.2</strong>)</span></td>
+      <td>95.8 <span>(<strong>+1.2</strong>)</span></td>
+      <td>85.3 <span>(<strong>+4.6</strong>)</span></td>
+      <td>85.8 <span>(<strong>+5.0</strong>)</span></td>
+    </tr>
+  </tbody>
+</table>
+
+ - **NLPCC2016-DBQA**
+
+```text
+NLPCC2016-DBQA 是由国际自然语言处理和中文计算会议 NLPCC 于 2016 年举办的评测任务，其目标是从候选中找到合适的文档作为问题的答案。[链接: http://tcci.ccf.org.cn/conference/2016/dldoc/evagline2.pdf]
+```
+
+
+
+### 语义相似度
+
+<table>
+  <tbody>
+    <tr>
+      <th><strong>数据集</strong>
+        <br></th>
+      <th colspan="2"><center><strong>LCQMC</strong></center></th>
+      <th colspan="2"><center><strong>BQ Corpus</strong></center></th>
+    <tr>
+      <td rowspan="2">
+        <p>
+          <strong>评估</strong></p>
+        <p>
+          <strong>指标</strong>
+          <br></p>
+      </td>
+      <td colspan="2">
+        <center><strong>acc</strong></center></td>
+      <td colspan="2">
+        <center><strong>acc</strong></center></td>
+    </tr>
+    <tr>
+      <td colspan="1" width="">
+        <strong>dev</strong>
+        <br></td>
+      <td colspan="1" width="">
+        <strong>test</strong>
+        <br></td>
+      <td colspan="1" width="">
+        <strong>dev</strong>
+        <br></td>
+      <td colspan="1" width="">
+        <strong>test</strong>
+        <br></td>
+    </tr>
+    <tr>
+      <td><strong>BERT Base</strong></td>
+      <td>88.8</td>
+      <td>87.0</td>
+      <td>85.9</td>
+      <td>84.8</td>
+    </tr>
+    <tr>
+      <td><strong>ERNIE 1.0 Base</strong></td>
+      <td>89.7 <span>(<strong>+0.9</strong>)</span></td>
+      <td>87.4 <span>(<strong>+0.4</strong>)</span></td>
+      <td>86.1 <span>(<strong>+0.2</strong>)</span></td>
+      <td>84.8</td>
+    </tr>
+    <tr>
+      <td><strong>ERNIE 2.0 Base</strong></td>
+      <td>90.9 <span>(<strong>+2.1</strong>)</span></td>
+      <td>87.9 <span>(<strong>+0.9</strong>)</span></td>
+      <td>86.4 <span>(<strong>+0.5</strong>)</span></td>
+      <td>85.0 <span>(<strong>+0.2</strong>)</span></td>
+    </tr>
+    <tr>
+      <td><strong>ERNIE 2.0 Large</strong></td>
+      <td>90.9 <span>(<strong>+2.1</strong>)</span></td>
+      <td>87.9 <span>(<strong>+0.9</strong>)</span></td>
+      <td>86.5 <span>(<strong>+0.6</strong>)</span></td>
+      <td>85.2 <span>(<strong>+0.4</strong>)</span></td>
+    </tr>
+  </tbody>
+</table>
+
+\* *LCQMC 、BQ Corpus 数据集需要向作者申请，LCQMC 申请地址：http://icrc.hitsz.edu.cn/info/1037/1146.htm, BQ Corpus 申请地址：http://icrc.hitsz.edu.cn/Article/show/175.html*
+
+ - **LCQMC**
+
+```text
+LCQMC 是在自然语言处理国际顶会 COLING 2018 发布的语义匹配数据集，其目标是判断两个问题的语义是否相同。[链接: http://aclweb.org/anthology/C18-1166]
+```
+
+ - **BQ Corpus**
+
+```text
+BQ Corpus 是在自然语言处理国际顶会 EMNLP 2018 发布的语义匹配数据集，该数据集针对银行领域，其目标是判断两个问题的语义是否相同。[链接: https://www.aclweb.org/anthology/D18-1536]
+```
+
+
+
+##  英文效果验证
+
+ERNIE 2.0 的英文效果验证在 GLUE 上进行。GLUE 评测的官方地址为  https://gluebenchmark.com/ ，该评测涵盖了不同类型任务的 10 个数据集，其中包含 11 个测试集，涉及到 Accuracy, F1-score, Spearman Corr,. Pearson Corr,. Matthew Corr., 5 类指标。GLUE 排行榜使用每个数据集的平均分作为总体得分，并以此为依据将不同算法进行排名。
+
+
+
+
+### GLUE - 验证集结果
+
+| <strong>数据集</strong> | <strong>CoLA</strong> | <strong>SST-2</strong> | <strong>MRPC</strong> | <strong>STS-B</strong> | <strong>QQP</strong>  | <strong>MNLI-m</strong> | <strong>QNLI</strong> | <strong>RTE</strong>  |
+| ----------- | ---- | ----- | ---- | ----- | ---- | ---- | ---- | ---- |
+| **评测指标** | **matthews corr.** | **acc** | **acc** | **pearson corr.** | **acc** | **acc** | **acc** | **acc** |
+| **BERT Large** | 60.6 | 93.2  | 88.0 | 90.0  | 91.3 | 86.6 | 92.3 | 70.4 |
+| **XLNet Large** | 63.6 | 95.6 | 89.2 | 91.8 | 91.8 | 89.8 | 93.9 | 83.8 |
+| **ERNIE 2.0 Large** | 65.4<br/>(**+4.8,+1.8**) | 96.0<br/>(**+2.8,+0.4**) | 89.7<br/>(**+1.7,+0.5**) | 92.3<br/>(**+2.3,+0.5**) | 92.5<br/>(**+1.2,+0.7**) | 89.1<br/>(**+2.5,-0.7**) | 94.3<br/>(**+2.0,+0.4**) | 85.2<br/>(**+14.8,+1.4**) |
+
+我们使用单模型的验证集结果，来与 BERT/XLNet 进行比较。
+
+
+
+### GLUE - 测试集结果
+
+| <strong>数据集</strong> | - | <strong>CoLA</strong> | <strong>SST-2</strong> | <strong>MRPC</strong> | <strong>STS-B</strong> | <strong>QQP</strong>  | <strong>MNLI-m</strong> | <strong>MNLI-mm</strong> | <strong>QNLI</strong> | <strong>RTE</strong>  | <strong>WNLI</strong> |<strong>AX</strong>|
+| ----------- | ----- | ---- | ----- | ---- | ----- | ---- | ------ | ------- | ---- | ---- | ---- | ---- |
+| **评测指标** | **<strong>score</strong>** | **matthews corr.** | **acc** | **f1-score/acc** | **spearman/pearson corr.** | **f1-score/acc** | **acc** | **acc** |**acc**|**acc**|**acc**| **matthews corr.** |
+| **BERT Base** | 78.3  | 52.1 | 93.5  | 88.9/84.8 | 85.8/87.1 | 71.2/89.2 | 84.6   | 83.4    | 90.5 | 66.4 | 65.1 | 34.2 |
+| **ERNIE 2.0 Base** | 80.6<br/>(**+2.3**) | 55.2<br/>(**+3.1**) | 95.0<br/>(**+1.5**) | 89.9/86.1<br/>(**+1.0/+1.3**) | 86.5/87.6<br/>(**+0.7/+0.5**) | 73.2/89.8<br/>(**+2.0/+0.6**) | 86.1<br/>(**+1.5**) | 85.5<br/>(**+2.1**) | 92.9<br/>(**+2.4**) | 74.8<br/>(**+8.4**) | 65.1 | 37.4<br/>(**+3.2**) |
+| **BERT Large** | 80.5  | 60.5 | 94.9  | 89.3/85.4 | 86.5/87.6 | 72.1/89.3 | 86.7   | 85.9    | 92.7 | 70.1 | 65.1 | 39.6 |
+| **ERNIE 2.0 Large** | 83.6<br/>(**+3.1**) | 63.5<br/>(**+3.0**) | 95.6<br/>(**+0.7**) | 90.2/87.4<br/>(**+0.9/+2.0**) | 90.6/91.2<br/>(**+4.1/+3.6**) | 73.8/90.1<br/>(**+1.7/+0.8**) | 88.7<br/>(**+2.0**) | 88.8<br/>(**+2.9**) | 94.6<br/>(**+1.9**) | 80.2<br/>(**+10.1**) | 67.8<br/>(**+2.7**) | 48.0<br/>(**+8.4**) |
+
+
+由于 XLNet 暂未公布 GLUE 测试集上的单模型结果，所以我们只与 BERT 进行单模型比较。上表为ERNIE 2.0 单模型在 GLUE 测试集的表现结果。
+
+
+## 使用
+  * [PaddlePaddle 安装](#paddlepaddle安装)
+  * [模型&amp;数据](#模型数据)
+     * [预训练模型下载](#预训练模型下载)
+     * [数据下载](#数据下载)
+        * [中文数据](#中文数据)
+        * [英文数据](#英文数据)
+  * [Fine-tuning 任务](#fine-tuning-任务)
+     * [运行参数配置](#运行参数配置)
+     * [单句和句对分类任务](#单句和句对分类任务)
+        * [单句分类任务](#单句分类任务)
+        * [句对分类任务](#句对分类任务)
+     * [序列标注任务](#序列标注任务)
+        * [实体识别](#实体识别)
+     * [阅读理解任务](#阅读理解任务-1)
+  * [预训练 (ERNIE 1.0)](#预训练-ernie-10)
+     * [数据预处理](#数据预处理)
+     * [开始训练](#开始训练)
+  * [FAQ](#faq)
+     * [FAQ1: 如何获取输入句子/词经过 ERNIE 编码后的 Embedding 表示?](#faq1-如何获取输入句子词经过-ernie-编码后的-embedding-表示)
+     * [FAQ2: 如何利用 Fine-tuning 得到的模型对新数据进行批量预测？](#faq2-如何利用-fine-tuning-得到的模型对新数据进行批量预测)
+     * [FAQ3: 运行脚本中的batch size指的是单卡分配的数据量还是多卡的总数据量？](#faq3-运行脚本中的batch-size指的是单卡分配的数据量还是多卡的总数据量)
+     * [FAQ4: Can not find library: libcudnn.so. Please try to add the lib path to LD_LIBRARY_PATH.](#faq4-can-not-find-library-libcudnnso-please-try-to-add-the-lib-path-to-ld_library_path)
+     * [FAQ5: Can not find library: libnccl.so. Please try to add the lib path to LD_LIBRARY_PATH.](#faq5-can-not-find-library-libncclso-please-try-to-add-the-lib-path-to-ld_library_path)
+
+
+## PaddlePaddle安装
+
+本项目依赖于 Paddle Fluid 1.5，请参考[安装指南](http://www.paddlepaddle.org/#quick-start)进行安装。
+
+**【重要】安装后，需要及时的将 CUDA、cuDNN、NCCL2 等动态库路径加入到环境变量 LD_LIBRARY_PATH 之中，否则训练过程中会报相关的库错误。具体的安装细节请查阅[这里](http://en.paddlepaddle.org/documentation/docs/zh/1.5/beginners_guide/quick_start_cn.html)**
+
+如果您想了解更多的 Paddle 的相关信息，例如针对实际问题建模、搭建自己网络等，这里有更多的来自官方的文档供您参考：
+
+> - [基本概念](https://www.paddlepaddle.org.cn/documentation/docs/zh/1.5/user_guides/howto/basic_concept/index_cn.html) ：介绍了 Fluid 的基本使用概念
+> - [准备数据](https://www.paddlepaddle.org.cn/documentation/docs/zh/1.5/user_guides/howto/prepare_data/index_cn.html) ：介绍使用 Fluid 训练网络时，数据的支持类型及传输方法
+> - [配置简单的网络](https://www.paddlepaddle.org.cn/documentation/docs/zh/1.5/user_guides/howto/configure_simple_model/index_cn.html)： 介绍如何针对问题建模，并利用 Fluid 中相关算子搭建网络
+> - [训练神经网络](https://www.paddlepaddle.org.cn/documentation/docs/zh/1.5/user_guides/howto/training/index_cn.html)：介绍如何使用 Fluid 进行单机训练、多机训练、以及保存和载入模型变量
+> - [模型评估与调试](https://www.paddlepaddle.org.cn/documentation/docs/zh/1.5/user_guides/howto/evaluation_and_debugging/index_cn.html)：介绍在 Fluid 下进行模型评估和调试的方法
+
+
+## 模型&数据
+
+### 预训练模型下载
+
+| Model | Description |
+| :------| :------ |
+| [ERNIE 1.0 中文 Base 模型](https://ernie.bj.bcebos.com/ERNIE_stable.tgz) | 包含预训练模型参数 |
+| [ERNIE 1.0 中文 Base 模型](https://baidu-nlp.bj.bcebos.com/ERNIE_stable-1.0.1.tar.gz) | 包含预训练模型参数、词典 vocab.txt、模型配置 ernie_config.json|
+| [ERNIE 1.0 中文 Base 模型(max_len=512)](https://ernie.bj.bcebos.com/ERNIE_1.0_max-len-512.tar.gz) | 包含预训练模型参数、词典 vocab.txt、模型配置 ernie_config.json|
+| [ERNIE 2.0 英文 Base 模型](https://ernie.bj.bcebos.com/ERNIE_Base_en_stable-2.0.0.tar.gz) | 包含预训练模型参数、词典 vocab.txt、模型配置 ernie_config.json|
+| [ERNIE 2.0 英文 Large 模型](https://ernie.bj.bcebos.com/ERNIE_Large_en_stable-2.0.0.tar.gz) | 包含预训练模型参数、词典 vocab.txt、模型配置 ernie_config.json|
+
+
+
+### 数据下载
+
+#### 中文数据
+
+ [下载地址](https://ernie.bj.bcebos.com/task_data_zh.tgz)
+
+#### 英文数据
+
+由于数据集协议问题，在这里无法直接提供英文数据集。GLUE 的数据下载方式请参考[GLUE 主页](https://gluebenchmark.com/tasks)以及 GLUE 提供的数据[下载代码](https://gist.github.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e)。
+
+假设所有数据集下载放置的路径为`$GLUE_DATA`，将数据下载完毕后，执行 `sh ./script/en_glue/preprocess/cvt.sh $GLUE_DATA `，将完成所有数据的格式转换，默认转换后的数据会输出到文件夹`./glue_data_processed/`。
+
+
+
+## Fine-tuning 任务
+
+### 运行参数配置
+
+在实验中我们发现，不同的任务对应的 batch size 会影响任务的最终效果，因此在这里列出了具体实验中我们使用的具体配置，在具体的实验运行时，请注意本地 GPU 卡数。
+
+在下表的 Batch Size 一栏，*"(base)"* 指 ERNIE BASE 模型 Fine-tuning 时使用的参数，未特殊标明则表示 ERNIE Large 和 ERNIE Base 使用同样的 batch size。
+
+| 任务 |   Batch Size   | GPU卡数 |
+| ------ | ---- | ------- |
+| CoLA | 32 / 64 (base) | 1 |
+| SST-2  | 64 / 256 (base) | 8 |
+| STS-B  | 128 | 8 |
+| QQP    | 256 | 8 |
+| MNLI   | 256 / 512 (base) | 8 |
+| QNLI   | 256 | 8 |
+| RTE    | 16 / 4 (base) | 1 |
+| MRPC   | 16 / 32 (base) | 2 |
+| WNLI | 8 | 1 |
+| XNLI | 65536 (tokens) | 8 |
+| CMRC2018 | 64 | 8 (large) / 4(base) |
+| DRCD | 64 | 8 (large) / 4(base) |
+| MSRA-NER(SIGHAN 2006) | 16 | 1 |
+| ChnSentiCorp | 24 | 1 |
+| LCQMC | 32 | 1 |
+| BQ Corpus | 64 | 1 |
+| NLPCC2016-DBQA | 64 | 8 |
+
+\* *MNLI 和 QNLI 的任务中，使用了 32 GB 显存的 V100。除此之外的显卡皆为22 GB 的 P40。*
+
+
+
+### 单句和句对分类任务
+
+分类或者回归任务的逻辑都封装在 `run_classifier.py` 文件中。为了方便的复现上述的实验效果，该项目将每个任务与其对应的超参封装到了任务对应的 shell 文件中。
+
+下面提供了中英文情感分析 `ChnSentiCorp`，`SST-2`，和  `LCQMC` 的运行示例。在运行前，请通过 [模型&amp;数据](#模型数据) 一节提供的链接预先下载好对应的预训练模型。
+
+
+
+#### 单句分类任务
+
+以 `ChnSentiCorp` 情感分类数据集作为单句分类任务示例，假设下载数据并解压后的路径为 `/home/task_data/ ` ，则在该目录中应该存在文件夹`chnsenticorp`，其训练数据路径为`/home/task_data/chnsenticorp/train.tsv`，该数据格式为包含2个字段的tsv文件，2个字段分别为: `text_a  label`, 示例数据如下:
+
+```
+label  text_a
+...
+0   当当网名不符实，订货多日不见送货，询问客服只会推托，只会要求用户再下订单。如此服务留不住顾客的。去别的网站买书服务更好。
+0   XP的驱动不好找！我的17号提的货，现在就降价了100元，而且还送杀毒软件！
+1   <荐书> 推荐所有喜欢<红楼>的红迷们一定要收藏这本书,要知道当年我听说这本书的时候花很长时间去图书馆找和借都没能如愿,所以这次一看到当当有,马上买了,红迷们也要记得备货哦!
+...
+```
+
+假设下载的模型路径为 `/home/model/`，则该目录中应该有名为 `params `的文件夹。在执行任务前，需要提前设置环境变量：
+
+```
+export TASK_DATA_PATH=/home/task_data/
+export MODEL_PATH=/home/model/
+```
+
+执行 `sh script/zh_task/ernie_base/run_ChnSentiCorp.sh` 即可开始 finetune，执行结束后会输出如下所示的在验证集和测试集上的测试结果:
+
+ ```
+[dev evaluation] ave loss: 0.303819, acc:0.943333, data_num: 1200, elapsed time: 16.280898 s, file: /home/task_data/chnsenticorp/dev.tsv, epoch: 9, steps: 4001
+[dev evaluation] ave loss: 0.228482, acc:0.958333, data_num: 1200, elapsed time: 16.023091 s, file: /home/task_data/chnsenticorp/test.tsv, epoch: 9, steps: 4001
+ ```
+
+再以一个英文的数据集 `SST-2` 为例，文件的格式和中文文件的格式类似。假设经过 [模型&amp;数据](#模型数据) 章节中转换完数据之后，得到的路径为 `/home/glue_data_processed/ ` ，其训练数据路径为 `/home/glue_data_processed/SST-2/train.tsv`，该文件同样要有2列，分别为 `text_a  label`，示例数据如:
+
+```
+label  text_a
+0   hide new secretions from the parental units
+0   contains no wit , only labored gags
+1   that loves its characters and communicates something rather beautiful about human nature
+0   remains utterly satisfied to remain the same throughout
+0   on the worst revenge-of-the-nerds clichés the filmmakers could dredge up
+0   that 's far too tragic to merit such superficial treatment
+1   demonstrates that the director of such hollywood blockbusters as patriot games can still turn out a small , personal film with an emotional wallop .
+1   of saucy
+```
+
+同样在运行前设置环境变量：
+
+```
+export TASK_DATA_PATH=/home/glue_data_processed/
+export MODEL_PATH=/home/model/
+```
+
+执行 `sh script/en_glue/ernie_large/SST-2/task.sh` ，可以观测到类似如下内容的日志:
+
+```
+epoch: 3, progress: 22456/67349, step: 3500, ave loss: 0.015862, ave acc: 0.984375, speed: 1.328810 steps/s
+[dev evaluation] ave loss: 0.174793, acc:0.957569, data_num: 872, elapsed time: 15.314256 s file: ./data/dev.tsv, epoch: 3, steps: 3500
+testing ./data/test.tsv, save to output/test_out.tsv
+```
+
+#### 句对分类任务
+
+以 `LCQMC` 语义相似度任务作为句对分类任务示例，数据格式为包含 3 个字段的 tsv 文件，3 个字段分别为: `text_a    text_b   label`，示例数据如下:
+```
+text_a  text_b  label
+开初婚未育证明怎么弄？  初婚未育情况证明怎么开？    1
+谁知道她是网络美女吗？  爱情这杯酒谁喝都会醉是什么歌    0
+这腰带是什么牌子    护腰带什么牌子好    0
+```
+执行 ` sh script/zh_task/ernie_base/run_lcqmc.sh` 即可开始 fine-tuning，执行结束后会输出如下所示的在验证集和测试集上的测试结果:
+
+```
+[dev evaluation] ave loss: 0.299115, acc:0.900704, data_num: 8802, elapsed time: 32.327663 s, file: ./task_data/lcqmc/dev.tsv, epoch: 2, steps: 22387
+[dev evaluation] ave loss: 0.374148, acc:0.878080, data_num: 12500, elapsed time: 39.780520 s, file: ./task_data/lcqmc/test.tsv, epoch: 2, steps: 22387
+```
+
+
+
+### 序列标注任务
+
+#### 实体识别
+
+ 以 `MSRA-NER(SIGHAN2006)` 作为示例，数据格式为包含 2 个字段的 tsv 文件，2 个字段分别为: `text_a  label`, 示例数据如下:
+```
+text_a  label
+在 这 里 恕 弟 不 恭 之 罪 ， 敢 在 尊 前 一 诤 ： 前 人 论 书 ， 每 曰 “ 字 字 有 来 历 ， 笔 笔 有 出 处 ” ， 细 读 公 字 ， 何 尝 跳 出 前 人 藩 篱 ， 自 隶 变 而 后 ， 直 至 明 季 ， 兄 有 何 新 出 ？    O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O
+相 比 之 下 ， 青 岛 海 牛 队 和 广 州 松 日 队 的 雨 中 之 战 虽 然 也 是 0 ∶ 0 ， 但 乏 善 可 陈 。   O O O O O B-ORG I-ORG I-ORG I-ORG I-ORG O B-ORG I-ORG I-ORG I-ORG I-ORG O O O O O O O O O O O O O O O O O O O
+理 由 多 多 ， 最 无 奈 的 却 是 ： 5 月 恰 逢 双 重 考 试 ， 她 攻 读 的 博 士 学 位 论 文 要 通 考 ； 她 任 教 的 两 所 学 校 ， 也 要 在 这 段 时 日 大 考 。    O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O
+```
+
+执行 `sh script/zh_task/ernie_base/run_msra_ner.sh` 即可开始 finetune，执行结束后会输出如下所示的在验证集和测试集上的测试结果:
+
+ ```
+[dev evaluation] f1: 0.951949, precision: 0.944636, recall: 0.959376, elapsed time: 19.156693 s
+[test evaluation] f1: 0.937390, precision: 0.925988, recall: 0.949077, elapsed time: 36.565929 s
+ ```
+
+
+### 阅读理解任务
+
+
+ 以 `DRCD` 作为示例，首先将数据转换成 SQUAD 格式:
+ ```
+{
+  "version": "1.3",
+  "data": [
+    {
+      "paragraphs": [
+        {
+          "id": "1001-11",
+          "context": "广州是京广铁路、广深铁路、广茂铁路、广梅汕铁路的终点站。2009年末，武广客运专线投入运营，多单元列车覆盖980公里的路程，最高时速可达350公里/小时。2011年1月7日，广珠城际铁路投入运营，平均时速可达200公里/小时。广州铁路、长途汽车和渡轮直达香港，广九直通车从广州东站开出，直达香港九龙红磡站，总长度约182公里，车程在两小时内。繁忙的长途汽车每年会从城市中的不同载客点把旅客接载至香港。在珠江靠市中心的北航道有渡轮线路，用于近江居民直接渡江而无需乘坐公交或步行过桥。南沙码头和莲花山码头间每天都有高速双体船往返，渡轮也开往香港中国客运码头和港澳码头。",
+          "qas": [
+            {
+              "question": "广珠城际铁路平均每小时可以走多远？",
+              "id": "1001-11-1",
+              "answers": [
+                {
+                  "text": "200公里",
+                  "answer_start": 104,
+                  "id": "1"
+                }
+              ]
+            }
+          ]
+        }
+      ],
+      "id": "1001",
+      "title": "广州"
+    }
+  ]
+}
+ ```
+
+执行 `sh script/zh_task/ernie_base/run_drcd.sh` 即可开始 finetune，执行结束后会输出如下所示的在验证集和测试集上的测试结果:
+
+ ```
+[dev evaluation] em: 88.450624, f1: 93.749887, avg: 91.100255, question_num: 3524
+[test evaluation] em: 88.061838, f1: 93.520152, avg: 90.790995, question_num: 3493
+ ```
+
+
+## 预训练 (ERNIE 1.0)
+
+### 数据预处理
+
+基于百科类、资讯类、论坛对话类数据构造具有上下文关系的句子对数据，利用百度内部词法分析工具对句对数据进行字、词、实体等不同粒度的切分，然后基于 [`tokenization.py`](tokenization.py) 中的 CharTokenizer 对切分后的数据进行 token 化处理，得到明文的 token 序列及切分边界，然后将明文数据根据词典 [`config/vocab.txt`](config/vocab.txt) 映射为 id 数据，在训练过程中，根据切分边界对连续的 token 进行随机 mask 操作；
+
+我们给出了 id 化后的部分训练数据：[`data/demo_train_set.gz`](./data/demo_train_set.gz)、和测试数据：[`data/demo_valid_set.gz`](./data/demo_valid_set.gz)，每行数据为1个训练样本，示例如下:
+
+```
+1 1048 492 1333 1361 1051 326 2508 5 1803 1827 98 164 133 2777 2696 983 121 4 19 9 634 551 844 85 14 2476 1895 33 13 983 121 23 7 1093 24 46 660 12043 2 1263 6 328 33 121 126 398 276 315 5 63 44 35 25 12043 2;0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1;0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55;-1 0 0 0 0 1 0 1 0 0 1 0 0 1 0 1 0 0 0 0 0 0 1 0 1 0 0 1 0 1 0 0 0 0 1 0 0 0 0 -1 0 0 0 1 0 0 1 0 1 0 0 1 0 1 0 -1;0
+```
+
+每个样本由5个 '`;`' 分隔的字段组成，数据格式: `token_ids; sentence_type_ids; position_ids; seg_labels; next_sentence_label`；其中 `seg_labels` 表示分词边界信息: 0表示词首、1表示非词首、-1为占位符, 其对应的词为 `CLS` 或者 `SEP`；
+
+### 开始训练
+
+预训练任务的启动脚本是 [`script/zh_task/pretrain.sh`](./script/zh_task/pretrain.sh)，
+在开始预训练之前需要把 CUDA、cuDNN、NCCL2 等动态库路径加入到环境变量 LD_LIBRARY_PATH 之中；然后执行 `sh script/zh_task/pretrain.sh` 就可以基于 demo 数据和默认参数配置开始预训练；
+
+预训练任务进行的过程中会输出当前学习率、训练数据所经过的轮数、当前迭代的总步数、训练误差、训练速度等信息，根据 --validation_steps ${N} 的配置，每间隔 N 步输出模型在验证集的各种指标:
+
+```
+current learning_rate:0.000001
+epoch: 1, progress: 1/1, step: 30, loss: 10.540648, ppl: 19106.925781, next_sent_acc: 0.625000, speed: 0.849662 steps/s, file: ./data/demo_train_set.gz, mask_type: mask_word
+feed_queue size 70
+current learning_rate:0.000001
+epoch: 1, progress: 1/1, step: 40, loss: 10.529287, ppl: 18056.654297, next_sent_acc: 0.531250, speed: 0.849549 steps/s, file: ./data/demo_train_set.gz, mask_type: mask_word
+feed_queue size 70
+current learning_rate:0.000001
+epoch: 1, progress: 1/1, step: 50, loss: 10.360563, ppl: 16398.287109, next_sent_acc: 0.625000, speed: 0.843776 steps/s, file: ./data/demo_train_set.gz, mask_type: mask_word
+```
+
+如果用自定义的真实数据进行训练，请参照[`script/zh_task/pretrain.sh`](./script/zh_task/pretrain.sh)脚本对参数做相应修改。
+
+## FAQ
+
+### FAQ1: 如何获取输入句子/词经过 ERNIE 编码后的 Embedding 表示?
+
+可以通过 ernie_encoder.py 抽取出输入句子的 Embedding 表示和句子中每个 token 的 Embedding 表示，数据格式和 [Fine-tuning 任务](#fine-tuning-任务) 一节中介绍的各种类型 Fine-tuning 任务的训练数据格式一致；以获取 LCQMC dev 数据集中的句子 Embedding 和 token embedding 为例，示例脚本如下:
+
+```
+export FLAGS_sync_nccl_allreduce=1
+export CUDA_VISIBLE_DEVICES=0
+
+python -u ernie_encoder.py \
+                   --use_cuda true \
+                   --batch_size 32 \
+                   --output_dir "./test" \
+                   --init_pretraining_params ${MODEL_PATH}/params \
+                   --data_set ${TASK_DATA_PATH}/lcqmc/dev.tsv \
+                   --vocab_path ${MODEL_PATH}/vocab.txt \
+                   --max_seq_len 128 \
+                   --ernie_config_path ${MODEL_PATH}/ernie_config.json
+```
+
+上述脚本运行结束后，会在当前路径的 test 目录下分别生成 `cls_emb.npy` 文件存储句子 embeddings 和 `top_layer_emb.npy` 文件存储 token embeddings; 实际使用时，参照示例脚本修改数据路径、embeddings 文件存储路径等配置即可运行；
+
+
+### FAQ2: 如何利用 Fine-tuning 得到的模型对新数据进行批量预测？
+
+我们以分类任务为例，给出了分类任务进行批量预测的脚本, 使用示例如下:
+
+```
+python -u predict_classifier.py \
+       --use_cuda true \
+       --batch_size 32 \
+       --vocab_path ${MODEL_PATH}/vocab.txt \
+       --init_checkpoint "./checkpoints/step_100" \
+       --do_lower_case true \
+       --max_seq_len 128 \
+       --ernie_config_path ${MODEL_PATH}/ernie_config.json \
+       --do_predict true \
+       --predict_set ${TASK_DATA_PATH}/lcqmc/test.tsv \
+       --num_labels 2
+```
+
+实际使用时，需要通过 `init_checkpoint` 指定预测用的模型，通过 `predict_set` 指定待预测的数据文件，通过 `num_labels` 配置分类的类别数目;
+
+**Note**: predict_set 的数据格式是由 text_a、text_b(可选) 组成的 1 列 / 2 列 tsv 文件。
+
+
+
+### FAQ3: 运行脚本中的batch size指的是单卡分配的数据量还是多卡的总数据量？
+
+单独一张显卡分配到的数据量。
+
+
+
+### FAQ4: Can not find library: libcudnn.so. Please try to add the lib path to LD_LIBRARY_PATH.
+
+在 LD_LIBRARY_PATH 中添加 cudnn 库的路径，如 `export LD_LIBRARY_PATH=/home/work/cudnn/cudnn_v[your cudnn version]/cuda/lib64`
+
+
+
+### FAQ5: Can not find library: libnccl.so. Please try to add the lib path to LD_LIBRARY_PATH.
+
+需要先下载 [NCCL](https://developer.nvidia.com/nccl/nccl-download)，然后在 LD_LIBRARY_PATH 中添加 NCCL 库的路径，如`export LD_LIBRARY_PATH=/home/work/nccl/lib`
--- a/ERNIE/__init__.py
+++ b/ERNIE/__init__.py
--- a/ERNIE/_ce.py
+++ b/ERNIE/_ce.py
@@ -17,12 +17,12 @@ train_duration_card4_kpi = DurationKpi(
    'train_duration_card4', 0.02, 0, actived=True)

 tracking_kpis = [
-        train_loss_card1_kpi,
-        train_acc_card1_kpi,
-        train_duration_card1_kpi,
-        train_loss_card4_kpi,
-        train_acc_card4_kpi,
-        train_duration_card4_kpi,
+    train_loss_card1_kpi,
+    train_acc_card1_kpi,
+    train_duration_card1_kpi,
+    train_loss_card4_kpi,
+    train_acc_card4_kpi,
+    train_duration_card4_kpi,
 ]



--- a/ERNIE/batching.py
+++ b/ERNIE/batching.py
--- a/classify_infer.py
+++ b/classify_infer.py
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Inference by """
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import os
+import time
+import argparse
+import numpy as np
+import multiprocessing
+
+# NOTE(paddle-dev): All of these flags should be
+# set before `import paddle`. Otherwise, it would
+# not take any effect.
+os.environ['FLAGS_eager_delete_tensor_gb'] = '0'  # enable gc
+
+import paddle.fluid as fluid
+from paddle.fluid.core import PaddleBuf
+from paddle.fluid.core import PaddleDType
+from paddle.fluid.core import PaddleTensor
+from paddle.fluid.core import AnalysisConfig
+from paddle.fluid.core import create_paddle_predictor
+
+from reader.task_reader import ClassifyReader
+from model.ernie import ErnieConfig
+from finetune.classifier import create_model
+
+from utils.args import ArgumentGroup, print_arguments
+from utils.init import init_pretraining_params
+from finetune_args import parser
+
+# yapf: disable
+parser = argparse.ArgumentParser(__doc__)
+model_g = ArgumentGroup(parser, "model", "options to init, resume and save model.")
+model_g.add_arg("ernie_config_path",            str,  None,  "Path to the json file for bert model config.")
+model_g.add_arg("init_checkpoint",              str,  None,  "Init checkpoint to resume training from.")
+model_g.add_arg("save_inference_model_path",    str,  "inference_model",  "If set, save the inference model to this path.")
+model_g.add_arg("use_fp16",                     bool, False, "Whether to resume parameters from fp16 checkpoint.")
+model_g.add_arg("num_labels",                   int,  2,     "num labels for classify")
+
+data_g = ArgumentGroup(parser, "data", "Data paths, vocab paths and data processing options.")
+data_g.add_arg("predict_set",         str,  None,  "Predict set file")
+data_g.add_arg("vocab_path",          str,  None,  "Vocabulary path.")
+data_g.add_arg("label_map_config",    str,  None,  "Label_map_config json file.")
+data_g.add_arg("max_seq_len",         int,  128,   "Number of words of the longest seqence.")
+data_g.add_arg("batch_size",          int,  32,    "Total examples' number in batch for training. see also --in_tokens.")
+data_g.add_arg("do_lower_case",       bool, True,
+               "Whether to lower case the input text. Should be True for uncased models and False for cased models.")
+
+run_type_g = ArgumentGroup(parser, "run_type", "running type options.")
+run_type_g.add_arg("use_cuda",          bool,   True,  "If set, use GPU for training.")
+run_type_g.add_arg("do_prediction",     bool,   True,  "Whether to do prediction on test set.")
+
+args = parser.parse_args()
+# yapf: enable.
+
+def main(args):
+    ernie_config = ErnieConfig(args.ernie_config_path)
+    ernie_config.print_config()
+
+    reader = ClassifyReader(
+        vocab_path=args.vocab_path,
+        label_map_config=args.label_map_config,
+        max_seq_len=args.max_seq_len,
+        do_lower_case=args.do_lower_case,
+        in_tokens=False,
+        is_inference=True)
+
+    predict_prog = fluid.Program()
+    predict_startup = fluid.Program()
+    with fluid.program_guard(predict_prog, predict_startup):
+        with fluid.unique_name.guard():
+            predict_pyreader, probs, feed_target_names = create_model(
+                args,
+                pyreader_name='predict_reader',
+                ernie_config=ernie_config,
+                is_classify=True,
+                is_prediction=True)
+
+    predict_prog = predict_prog.clone(for_test=True)
+
+    if args.use_cuda:
+        place = fluid.CUDAPlace(0)
+        dev_count = fluid.core.get_cuda_device_count()
+    else:
+        place = fluid.CPUPlace()
+        dev_count = int(os.environ.get('CPU_NUM', multiprocessing.cpu_count()))
+
+    place = fluid.CUDAPlace(0) if args.use_cuda == True else fluid.CPUPlace()
+    exe = fluid.Executor(place)
+    exe.run(predict_startup)
+
+    if args.init_checkpoint:
+        init_pretraining_params(exe, args.init_checkpoint, predict_prog)
+    else:
+        raise ValueError("args 'init_checkpoint' should be set for prediction!")
+
+    assert args.save_inference_model_path, "args save_inference_model_path should be set for prediction"
+    _, ckpt_dir = os.path.split(args.init_checkpoint.rstrip('/'))
+    dir_name = ckpt_dir + '_inference_model'
+    model_path = os.path.join(args.save_inference_model_path, dir_name)
+    print("save inference model to %s" % model_path)
+    fluid.io.save_inference_model(
+        model_path,
+        feed_target_names, [probs],
+        exe,
+        main_program=predict_prog)
+
+    # Set config
+    #config = AnalysisConfig(args.model_dir)
+    #config = AnalysisConfig(os.path.join(model_path, "__model__"), os.path.join(model_path, ""))
+    config = AnalysisConfig(model_path)
+    if not args.use_cuda:
+        print("disable gpu")
+        config.disable_gpu()
+
+    # Create PaddlePredictor
+    predictor = create_paddle_predictor(config)
+
+    predict_data_generator = reader.data_generator(
+        input_file=args.predict_set,
+        batch_size=args.batch_size,
+        epoch=1,
+        shuffle=False)
+
+    print("-------------- prediction results --------------")
+    np.set_printoptions(precision=4, suppress=True)
+    index = 0
+    total_time = 0
+    for sample in predict_data_generator():
+        src_ids    = sample[0]
+        sent_ids   = sample[1]
+        pos_ids    = sample[2]
+        task_ids   = sample[3]
+        input_mask = sample[4]
+
+        inputs = [array2tensor(ndarray) for ndarray in [src_ids, sent_ids, pos_ids, input_mask]]
+        begin_time = time.time()
+        outputs = predictor.run(inputs)
+        end_time = time.time()
+        total_time += end_time - begin_time
+
+        # parse outputs
+        output = outputs[0]
+        print(output.name)
+        output_data = output.data.float_data()
+        #assert len(output_data) == args.num_labels * args.batch_size
+        batch_result  = np.array(output_data).reshape((-1, args.num_labels))
+        for single_example_probs in batch_result:
+            print("{} example\t{}".format(index, single_example_probs))
+            index += 1
+    print("qps:{}\ttotal_time:{}\ttotal_example:{}\tbatch_size:{}".format(index/total_time, total_time, index, args.batch_size))
+
+
+def array2tensor(ndarray):
+    """ convert numpy array to PaddleTensor"""
+    assert isinstance(ndarray, np.ndarray), "input type must be np.ndarray"
+    tensor = PaddleTensor()
+    tensor.name = "data"
+    tensor.shape = ndarray.shape
+    if "float" in str(ndarray.dtype):
+        tensor.dtype = PaddleDType.FLOAT32
+    elif "int" in str(ndarray.dtype):
+        tensor.dtype = PaddleDType.INT64
+    else:
+        raise ValueError("{} type ndarray is unsupported".format(tensor.dtype))
+
+    tensor.data = PaddleBuf(ndarray.flatten().tolist())
+    return tensor
+
+if __name__ == '__main__':
+    print_arguments(args)
+    main(args)
--- a/ERNIE/config/ernie_config.json
+++ b/ERNIE/config/ernie_config.json
--- a/ERNIE/config/vocab.txt
+++ b/ERNIE/config/vocab.txt
--- a/config/vocab_en.txt
+++ b/config/vocab_en.txt
--- a/ERNIE/data/demo_train_set.gz
+++ b/ERNIE/data/demo_train_set.gz
--- a/ERNIE/data/demo_valid_set.gz
+++ b/ERNIE/data/demo_valid_set.gz
--- a/ERNIE/data/train_filelist
+++ b/ERNIE/data/train_filelist
--- a/ERNIE/data/valid_filelist
+++ b/ERNIE/data/valid_filelist
--- a/ERNIE/ernie_encoder.py
+++ b/ERNIE/ernie_encoder.py
@@ -55,19 +55,21 @@ def create_model(args, pyreader_name, ernie_config):
    pyreader = fluid.layers.py_reader(
        capacity=50,
        shapes=[[-1, args.max_seq_len, 1], [-1, args.max_seq_len, 1],
-                [-1, args.max_seq_len, 1], [-1, args.max_seq_len, 1], [-1, 1]],
-        dtypes=['int64', 'int64', 'int64', 'float', 'int64'],
-        lod_levels=[0, 0, 0, 0, 0],
+                [-1, args.max_seq_len, 1], [-1, args.max_seq_len, 1],
+                [-1, args.max_seq_len, 1], [-1, 1]],
+        dtypes=['int64', 'int64', 'int64', 'int64', 'float', 'int64'],
+        lod_levels=[0, 0, 0, 0, 0, 0],
        name=pyreader_name,
        use_double_buffer=True)

-    (src_ids, sent_ids, pos_ids, input_mask,
+    (src_ids, sent_ids, pos_ids, task_ids, input_mask,
     seq_lens) = fluid.layers.read_file(pyreader)

    ernie = ErnieModel(
        src_ids=src_ids,
        position_ids=pos_ids,
        sentence_ids=sent_ids,
+        task_ids=task_ids,
        input_mask=input_mask,
        config=ernie_config)

@@ -154,8 +156,8 @@ def main(args):
            cls_emb, unpad_top_layer_emb = exe.run(
                program=infer_program,
                fetch_list=[
-                    graph_vars["cls_embeddings"].name, graph_vars[
-                        "top_layer_embeddings"].name
+                    graph_vars["cls_embeddings"].name,
+                    graph_vars["top_layer_embeddings"].name
                ],
                return_numpy=False)
            # batch_size * embedding_size

--- a/ERNIE/finetune/__init__.py
+++ b/ERNIE/finetune/__init__.py
--- a/ERNIE/finetune/classifier.py
+++ b/ERNIE/finetune/classifier.py
@@ -20,30 +20,55 @@ from __future__ import print_function
 import time
 import numpy as np

+from scipy.stats import pearsonr, spearmanr
 from six.moves import xrange
 import paddle.fluid as fluid

 from model.ernie import ErnieModel


-def create_model(args, pyreader_name, ernie_config, is_prediction=False):
-    pyreader = fluid.layers.py_reader(
-        capacity=50,
-        shapes=[[-1, args.max_seq_len, 1], [-1, args.max_seq_len, 1],
-                [-1, args.max_seq_len, 1], [-1, args.max_seq_len, 1], [-1, 1],
-                [-1, 1]],
-        dtypes=['int64', 'int64', 'int64', 'float32', 'int64', 'int64'],
-        lod_levels=[0, 0, 0, 0, 0, 0],
-        name=pyreader_name,
-        use_double_buffer=True)
-
-    (src_ids, sent_ids, pos_ids, input_mask, labels,
+def create_model(args,
+                 pyreader_name,
+                 ernie_config,
+                 is_prediction=False,
+                 task_name="",
+                 is_classify=False,
+                 is_regression=False,
+                 ernie_version="1.0"):
+    if is_classify:
+        pyreader = fluid.layers.py_reader(
+            capacity=50,
+            shapes=[[-1, args.max_seq_len, 1], [-1, args.max_seq_len, 1],
+                    [-1, args.max_seq_len, 1], [-1, args.max_seq_len, 1],
+                    [-1, args.max_seq_len, 1], [-1, 1], [-1, 1]],
+            dtypes=[
+                'int64', 'int64', 'int64', 'int64', 'float32', 'int64', 'int64'
+            ],
+            lod_levels=[0, 0, 0, 0, 0, 0, 0],
+            name=task_name + "_" + pyreader_name,
+            use_double_buffer=True)
+    elif is_regression:
+        pyreader = fluid.layers.py_reader(
+            capacity=50,
+            shapes=[[-1, args.max_seq_len, 1], [-1, args.max_seq_len, 1],
+                    [-1, args.max_seq_len, 1], [-1, args.max_seq_len, 1],
+                    [-1, args.max_seq_len, 1], [-1, 1], [-1, 1]],
+            dtypes=[
+                'int64', 'int64', 'int64', 'int64', 'float32', 'float32',
+                'int64'
+            ],
+            lod_levels=[0, 0, 0, 0, 0, 0, 0],
+            name=task_name + "_" + pyreader_name,
+            use_double_buffer=True)
+
+    (src_ids, sent_ids, pos_ids, task_ids, input_mask, labels,
     qids) = fluid.layers.read_file(pyreader)

    ernie = ErnieModel(
        src_ids=src_ids,
        position_ids=pos_ids,
        sentence_ids=sent_ids,
+        task_ids=task_ids,
        input_mask=input_mask,
        config=ernie_config,
        use_fp16=args.use_fp16)
@@ -57,39 +82,50 @@ def create_model(args, pyreader_name, ernie_config, is_prediction=False):
        input=cls_feats,
        size=args.num_labels,
        param_attr=fluid.ParamAttr(
-            name="cls_out_w",
+            name=task_name + "_cls_out_w",
            initializer=fluid.initializer.TruncatedNormal(scale=0.02)),
        bias_attr=fluid.ParamAttr(
-            name="cls_out_b", initializer=fluid.initializer.Constant(0.)))
+            name=task_name + "_cls_out_b",
+            initializer=fluid.initializer.Constant(0.)))

    if is_prediction:
        probs = fluid.layers.softmax(logits)
        feed_targets_name = [
            src_ids.name, sent_ids.name, pos_ids.name, input_mask.name
        ]
+        if ernie_version == "2.0":
+            feed_targets_name += [task_ids.name]
        return pyreader, probs, feed_targets_name

-    ce_loss, probs = fluid.layers.softmax_with_cross_entropy(
-        logits=logits, label=labels, return_softmax=True)
-    loss = fluid.layers.mean(x=ce_loss)
-
-    if args.use_fp16 and args.loss_scaling > 1.0:
-        loss *= args.loss_scaling
-
+    assert is_classify != is_regression, 'is_classify or is_regression must be true and only one of them can be true'
    num_seqs = fluid.layers.create_tensor(dtype='int64')
-    accuracy = fluid.layers.accuracy(input=probs, label=labels, total=num_seqs)
-
-    graph_vars = {
-        "loss": loss,
-        "probs": probs,
-        "accuracy": accuracy,
-        "labels": labels,
-        "num_seqs": num_seqs,
-        "qids": qids
-    }
-
-    for k, v in graph_vars.items():
-        v.persistable = True
+    if is_classify:
+        ce_loss, probs = fluid.layers.softmax_with_cross_entropy(
+            logits=logits, label=labels, return_softmax=True)
+        loss = fluid.layers.mean(x=ce_loss)
+        accuracy = fluid.layers.accuracy(
+            input=probs, label=labels, total=num_seqs)
+        graph_vars = {
+            "loss": loss,
+            "probs": probs,
+            "accuracy": accuracy,
+            "labels": labels,
+            "num_seqs": num_seqs,
+            "qids": qids
+        }
+    elif is_regression:
+        cost = fluid.layers.square_error_cost(input=logits, label=labels)
+        loss = fluid.layers.mean(x=cost)
+        graph_vars = {
+            "loss": loss,
+            "probs": logits,
+            "labels": labels,
+            "num_seqs": num_seqs,
+            "qids": qids
+        }
+    else:
+        raise ValueError(
+            'unsupported fine tune mode. only supported classify/regression')

    return pyreader, graph_vars

@@ -144,7 +180,15 @@ def evaluate_map(preds):
    return total_map / qnum


-def evaluate(exe, test_program, test_pyreader, graph_vars, eval_phase):
+def evaluate_classify(exe,
+                      test_program,
+                      test_pyreader,
+                      graph_vars,
+                      eval_phase,
+                      use_multi_gpu_test=False,
+                      metric='simple_accuracy',
+                      is_classify=False,
+                      is_regression=False):
    train_fetch_list = [
        graph_vars["loss"].name, graph_vars["accuracy"].name,
        graph_vars["num_seqs"].name
@@ -161,7 +205,7 @@ def evaluate(exe, test_program, test_pyreader, graph_vars, eval_phase):

    test_pyreader.start()
    total_cost, total_acc, total_num_seqs, total_label_pos_num, total_pred_pos_num, total_correct_num = 0.0, 0.0, 0.0, 0.0, 0.0, 0.0
-    qids, labels, scores = [], [], []
+    qids, labels, scores, preds = [], [], [], []
    time_begin = time.time()

    fetch_list = [
@@ -171,8 +215,12 @@ def evaluate(exe, test_program, test_pyreader, graph_vars, eval_phase):
    ]
    while True:
        try:
-            np_loss, np_acc, np_probs, np_labels, np_num_seqs, np_qids = exe.run(
-                program=test_program, fetch_list=fetch_list)
+            if use_multi_gpu_test:
+                np_loss, np_acc, np_probs, np_labels, np_num_seqs, np_qids = exe.run(
+                    fetch_list=fetch_list)
+            else:
+                np_loss, np_acc, np_probs, np_labels, np_num_seqs, np_qids = exe.run(
+                    program=test_program, fetch_list=fetch_list)
            total_cost += np.sum(np_loss * np_num_seqs)
            total_acc += np.sum(np_acc * np_num_seqs)
            total_num_seqs += np.sum(np_num_seqs)
@@ -182,6 +230,7 @@ def evaluate(exe, test_program, test_pyreader, graph_vars, eval_phase):
            qids.extend(np_qids.reshape(-1).tolist())
            scores.extend(np_probs[:, 1].reshape(-1).tolist())
            np_preds = np.argmax(np_probs, axis=1).astype(np.float32)
+            preds.extend(np_preds)
            total_label_pos_num += np.sum(np_labels)
            total_pred_pos_num += np.sum(np_preds)
            total_correct_num += np.sum(np.dot(np_preds, np_labels))
@@ -189,25 +238,221 @@ def evaluate(exe, test_program, test_pyreader, graph_vars, eval_phase):
            test_pyreader.reset()
            break
    time_end = time.time()
+    cost = total_cost / total_num_seqs
+    elapsed_time = time_end - time_begin
+
+    evaluate_info = ""
+    if metric == 'acc_and_f1':
+        ret = acc_and_f1(preds, labels)
+        evaluate_info = "[%s evaluation] ave loss: %f, ave_acc: %f, f1: %f, data_num: %d, elapsed time: %f s" \
+            % (eval_phase, cost, ret['acc'], ret['f1'], total_num_seqs, elapsed_time)
+    elif metric == 'matthews_corrcoef':
+        ret = matthews_corrcoef(preds, labels)
+        evaluate_info = "[%s evaluation] ave loss: %f, matthews_corrcoef: %f, data_num: %d, elapsed time: %f s" \
+            % (eval_phase, cost, ret, total_num_seqs, elapsed_time)
+    elif metric == 'pearson_and_spearman':
+        ret = pearson_and_spearman(scores, labels)
+        evaluate_info = "[%s evaluation] ave loss: %f, pearson:%f, spearman:%f, corr:%f, data_num: %d, elapsed time: %f s" \
+            % (eval_phase, cost, ret['pearson'], ret['spearman'], ret['corr'], total_num_seqs, elapsed_time)
+    elif metric == 'simple_accuracy':
+        ret = simple_accuracy(preds, labels)
+        evaluate_info = "[%s evaluation] ave loss: %f, acc:%f, data_num: %d, elapsed time: %f s" \
+            % (eval_phase, cost, ret, total_num_seqs, elapsed_time)
+    elif metric == "acc_and_f1_and_mrr":
+        ret_a = acc_and_f1(preds, labels)
+        preds = sorted(
+            zip(qids, scores, labels), key=lambda elem: (elem[0], -elem[1]))
+        ret_b = evaluate_mrr(preds)
+        evaluate_info = "[%s evaluation] ave loss: %f, acc: %f, f1: %f, mrr: %f, data_num: %d, elapsed time: %f s" \
+            % (eval_phase, cost, ret_a['acc'], ret_a['f1'], ret_b, total_num_seqs, elapsed_time)
+    else:
+        raise ValueError('unsupported metric {}'.format(metric))
+    return evaluate_info
+

-    if len(qids) == 0:
-        print(
-            "[%s evaluation] ave loss: %f, ave acc: %f, data_num: %d, elapsed time: %f s"
-            % (eval_phase, total_cost / total_num_seqs, total_acc /
-               total_num_seqs, total_num_seqs, time_end - time_begin))
+def evaluate_regression(exe,
+                        test_program,
+                        test_pyreader,
+                        graph_vars,
+                        eval_phase,
+                        use_multi_gpu_test=False,
+                        metric='pearson_and_spearman'):
+
+    if eval_phase == "train":
+        train_fetch_list = [graph_vars["loss"].name]
+        if "learning_rate" in graph_vars:
+            train_fetch_list.append(graph_vars["learning_rate"].name)
+        outputs = exe.run(fetch_list=train_fetch_list)
+        ret = {"loss": np.mean(outputs[0])}
+        if "learning_rate" in graph_vars:
+            ret["learning_rate"] = float(outputs[1][0])
+        return ret
+
+    test_pyreader.start()
+    total_cost, total_num_seqs = 0.0, 0.0
+    qids, labels, scores = [], [], []
+
+    fetch_list = [
+        graph_vars["loss"].name, graph_vars["probs"].name,
+        graph_vars["labels"].name, graph_vars["qids"].name
+    ]
+
+    time_begin = time.time()
+    while True:
+        try:
+            if use_multi_gpu_test:
+                np_loss, np_probs, np_labels, np_qids = exe.run(
+                    fetch_list=fetch_list)
+            else:
+                np_loss, np_probs, np_labels, np_qids = exe.run(
+                    program=test_program, fetch_list=fetch_list)
+            labels.extend(np_labels.reshape((-1)).tolist())
+            if np_qids is None:
+                np_qids = np.array([])
+            qids.extend(np_qids.reshape(-1).tolist())
+            scores.extend(np_probs.reshape(-1).tolist())
+        except fluid.core.EOFException:
+            test_pyreader.reset()
+            break
+    time_end = time.time()
+
+    elapsed_time = time_end - time_begin
+
+    if metric == 'pearson_and_spearman':
+        ret = pearson_and_spearman(scores, labels)
+        evaluate_info = "[%s evaluation] ave loss: %f, pearson:%f, spearman:%f, corr:%f, elapsed time: %f s" \
+            % (eval_phase, 0.0, ret['pearson'], ret['spearmanr'], ret['corr'], elapsed_time)
+    else:
+        raise ValueError('unsupported metric {}'.format(metric))
+
+    return evaluate_info
+
+
+def evaluate(exe,
+             test_program,
+             test_pyreader,
+             graph_vars,
+             eval_phase,
+             use_multi_gpu_test=False,
+             metric='simple_accuracy',
+             is_classify=False,
+             is_regression=False):
+
+    if is_classify:
+        return evaluate_classify(
+            exe,
+            test_program,
+            test_pyreader,
+            graph_vars,
+            eval_phase,
+            use_multi_gpu_test=use_multi_gpu_test,
+            metric=metric)
    else:
-        r = total_correct_num / total_label_pos_num
-        p = total_correct_num / total_pred_pos_num
-        f = 2 * p * r / (p + r)
+        return evaluate_regression(
+            exe,
+            test_program,
+            test_pyreader,
+            graph_vars,
+            eval_phase,
+            use_multi_gpu_test=use_multi_gpu_test,
+            metric=metric)
+
+
+def matthews_corrcoef(preds, labels):
+    preds = np.array(preds)
+    labels = np.array(labels)
+    tp = np.sum((labels == 1) & (preds == 1))
+    tn = np.sum((labels == 0) & (preds == 0))
+    fp = np.sum((labels == 0) & (preds == 1))
+    fn = np.sum((labels == 1) & (preds == 0))
+
+    mcc = ((tp * tn) - (fp * fn)) / np.sqrt(
+        (tp + fp) * (tp + fn) * (tn + fp) * (tn + fn))
+    return mcc
+
+
+def f1_score(preds, labels):
+    preds = np.array(preds)
+    labels = np.array(labels)
+
+    tp = np.sum((labels == 1) & (preds == 1))
+    tn = np.sum((labels == 0) & (preds == 0))
+    fp = np.sum((labels == 0) & (preds == 1))
+    fn = np.sum((labels == 1) & (preds == 0))
+    p = tp / (tp + fp)
+    r = tp / (tp + fn)
+    f1 = (2 * p * r) / (p + r + 1e-8)
+    return f1
+
+
+def pearson_and_spearman(preds, labels):
+    preds = np.array(preds)
+    labels = np.array(labels)
+
+    pearson_corr = pearsonr(preds, labels)[0]
+    spearman_corr = spearmanr(preds, labels)[0]
+    return {
+        "pearson": pearson_corr,
+        "spearmanr": spearman_corr,
+        "corr": (pearson_corr + spearman_corr) / 2,
+    }

-        assert len(qids) == len(labels) == len(scores)
-        preds = sorted(
-            zip(qids, scores, labels), key=lambda elem: (elem[0], -elem[1]))
-        mrr = evaluate_mrr(preds)
-        map = evaluate_map(preds)
-
-        print(
-            "[%s evaluation] ave loss: %f, ave_acc: %f, mrr: %f, map: %f, p: %f, r: %f, f1: %f, data_num: %d, elapsed time: %f s"
-            % (eval_phase, total_cost / total_num_seqs,
-               total_acc / total_num_seqs, mrr, map, p, r, f, total_num_seqs,
-               time_end - time_begin))
+
+def acc_and_f1(preds, labels):
+    preds = np.array(preds)
+    labels = np.array(labels)
+
+    acc = simple_accuracy(preds, labels)
+    f1 = f1_score(preds, labels)
+    return {
+        "acc": acc,
+        "f1": f1,
+        "acc_and_f1": (acc + f1) / 2,
+    }
+
+
+def simple_accuracy(preds, labels):
+    preds = np.array(preds)
+    labels = np.array(labels)
+    return (preds == labels).mean()
+
+
+def predict(exe,
+            test_program,
+            test_pyreader,
+            graph_vars,
+            dev_count=1,
+            is_classify=False,
+            is_regression=False):
+    test_pyreader.start()
+    qids, scores, probs = [], [], []
+    preds = []
+
+    fetch_list = [graph_vars["probs"].name, graph_vars["qids"].name]
+
+    while True:
+        try:
+            if dev_count == 1:
+                np_probs, np_qids = exe.run(program=test_program,
+                                            fetch_list=fetch_list)
+            else:
+                np_probs, np_qids = exe.run(fetch_list=fetch_list)
+
+            if np_qids is None:
+                np_qids = np.array([])
+            qids.extend(np_qids.reshape(-1).tolist())
+            if is_classify:
+                np_preds = np.argmax(np_probs, axis=1).astype(np.float32)
+                preds.extend(np_preds)
+            elif is_regression:
+                preds.extend(np_probs.reshape(-1))
+
+            probs.append(np_probs)
+
+        except fluid.core.EOFException:
+            test_pyreader.reset()
+            break
+
+    probs = np.concatenate(probs, axis=0).reshape([len(preds), -1])
+
+    return qids, preds, probs
--- a/finetune/mrc.py
+++ b/finetune/mrc.py
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Model for classifier."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import time
+import numpy as np
+import os
+import math
+import json
+import collections
+import six
+
+from scipy.stats import pearsonr, spearmanr
+from six.moves import xrange
+import paddle.fluid as fluid
+
+from utils.cmrc2018_eval import eval_file
+from model.ernie import ErnieModel
+import tokenization
+
+
+def create_model(args, pyreader_name, ernie_config, is_training):
+    pyreader = fluid.layers.py_reader(
+        capacity=50,
+        shapes=[[-1, args.max_seq_len, 1], [-1, args.max_seq_len, 1],
+                [-1, args.max_seq_len, 1], [-1, args.max_seq_len, 1],
+                [-1, args.max_seq_len, 1], [-1, 1], [-1, 1], [-1, 1]],
+        dtypes=[
+            'int64', 'int64', 'int64', 'int64', 'float32', 'int64', 'int64',
+            'int64'
+        ],
+        lod_levels=[0, 0, 0, 0, 0, 0, 0, 0],
+        name=pyreader_name,
+        use_double_buffer=True)
+    (src_ids, sent_ids, pos_ids, task_ids, input_mask, start_positions,
+     end_positions, unique_id) = fluid.layers.read_file(pyreader)
+
+    ernie = ErnieModel(
+        src_ids=src_ids,
+        position_ids=pos_ids,
+        sentence_ids=sent_ids,
+        task_ids=task_ids,
+        input_mask=input_mask,
+        config=ernie_config,
+        use_fp16=args.use_fp16)
+
+    enc_out = ernie.get_sequence_output()
+    enc_out = fluid.layers.dropout(
+        x=enc_out, dropout_prob=0.1, dropout_implementation="upscale_in_train")
+
+    logits = fluid.layers.fc(
+        input=enc_out,
+        size=2,
+        num_flatten_dims=2,
+        param_attr=fluid.ParamAttr(
+            name="cls_mrc_out_w",
+            initializer=fluid.initializer.TruncatedNormal(scale=0.02)),
+        bias_attr=fluid.ParamAttr(
+            name="cls_mrc_out_b", initializer=fluid.initializer.Constant(0.)))
+
+    logits = fluid.layers.transpose(x=logits, perm=[2, 0, 1])
+    start_logits, end_logits = fluid.layers.unstack(x=logits, axis=0)
+
+    batch_ones = fluid.layers.fill_constant_batch_size_like(
+        input=start_logits, dtype='int64', shape=[1], value=1)
+    num_seqs = fluid.layers.reduce_sum(input=batch_ones)
+
+    def compute_loss(logits, positions):
+        loss = fluid.layers.softmax_with_cross_entropy(
+            logits=logits, label=positions)
+        loss = fluid.layers.mean(x=loss)
+        return loss
+
+    start_loss = compute_loss(start_logits, start_positions)
+    end_loss = compute_loss(end_logits, end_positions)
+    loss = (start_loss + end_loss) / 2.0
+    if args.use_fp16 and args.loss_scaling > 1.0:
+        loss *= args.loss_scaling
+
+    graph_vars = {
+        "loss": loss,
+        "num_seqs": num_seqs,
+        "unique_id": unique_id,
+        "start_logits": start_logits,
+        "end_logits": end_logits
+    }
+
+    for k, v in graph_vars.items():
+        v.persistable = True
+
+    return pyreader, graph_vars
+
+
+def evaluate(exe,
+             test_program,
+             test_pyreader,
+             graph_vars,
+             eval_phase,
+             tag_num=None,
+             dev_count=1,
+             examples=None,
+             features=None,
+             args=None):
+    if eval_phase == "train":
+        train_fetch_list = [graph_vars["loss"].name]
+        if "learning_rate" in graph_vars:
+            train_fetch_list.append(graph_vars["learning_rate"].name)
+        outputs = exe.run(fetch_list=train_fetch_list)
+        ret = {"loss": np.mean(outputs[0])}
+        if "learning_rate" in graph_vars:
+            ret["learning_rate"] = float(outputs[1][0])
+        return ret
+
+    output_dir = args.checkpoints
+    if not os.path.exists(output_dir):
+        os.makedirs(output_dir)
+    output_prediction_file = os.path.join(output_dir,
+                                          eval_phase + "_predictions.json")
+    output_nbest_file = os.path.join(output_dir,
+                                     eval_phase + "_nbest_predictions.json")
+
+    RawResult = collections.namedtuple(
+        "RawResult", ["unique_id", "start_logits", "end_logits"])
+
+    test_pyreader.start()
+    all_results = []
+    time_begin = time.time()
+
+    fetch_list = [
+        graph_vars["unique_id"].name, graph_vars["start_logits"].name,
+        graph_vars["end_logits"].name, graph_vars["num_seqs"].name
+    ]
+    while True:
+        try:
+            np_unique_ids, np_start_logits, np_end_logits, np_num_seqs = exe.run(
+                program=test_program, fetch_list=fetch_list)
+            for idx in range(np_unique_ids.shape[0]):
+                if len(all_results) % 1000 == 0:
+                    print("Processing example: %d" % len(all_results))
+                unique_id = int(np_unique_ids[idx])
+                start_logits = [float(x) for x in np_start_logits[idx].flat]
+                end_logits = [float(x) for x in np_end_logits[idx].flat]
+                all_results.append(
+                    RawResult(
+                        unique_id=unique_id,
+                        start_logits=start_logits,
+                        end_logits=end_logits))
+
+        except fluid.core.EOFException:
+            test_pyreader.reset()
+            break
+
+    write_predictions(examples, features, all_results, args.n_best_size,
+                      args.max_answer_length, args.do_lower_case,
+                      output_prediction_file, output_nbest_file)
+
+    if eval_phase.find("dev") != -1:
+        data_file = args.dev_set
+    elif eval_phase.find("test") != -1:
+        data_file = args.test_set
+
+    em, f1, avg, total = eval_file(data_file, output_prediction_file)
+
+    time_end = time.time()
+    elapsed_time = time_end - time_begin
+
+    print(
+        "[%s evaluation] em: %f, f1: %f, avg: %f, questions: %d, elapsed time: %f"
+        % (eval_phase, em, f1, avg, total, elapsed_time))
+
+
+def write_predictions(all_examples, all_features, all_results, n_best_size,
+                      max_answer_length, do_lower_case, output_prediction_file,
+                      output_nbest_file):
+    """Write final predictions to the json file and log-odds of null if needed."""
+    print("Writing predictions to: %s" % (output_prediction_file))
+    print("Writing nbest to: %s" % (output_nbest_file))
+
+    example_index_to_features = collections.defaultdict(list)
+    for feature in all_features:
+        example_index_to_features[feature.example_index].append(feature)
+
+    unique_id_to_result = {}
+    for result in all_results:
+        unique_id_to_result[result.unique_id] = result
+
+    _PrelimPrediction = collections.namedtuple(  # pylint: disable=invalid-name
+        "PrelimPrediction", [
+            "feature_index", "start_index", "end_index", "start_logit",
+            "end_logit"
+        ])
+
+    all_predictions = collections.OrderedDict()
+    all_nbest_json = collections.OrderedDict()
+
+    for (example_index, example) in enumerate(all_examples):
+        features = example_index_to_features[example_index]
+
+        prelim_predictions = []
+        # keep track of the minimum score of null start+end of position 0
+        for (feature_index, feature) in enumerate(features):
+            result = unique_id_to_result[feature.unique_id]
+            start_indexes = _get_best_indexes(result.start_logits, n_best_size)
+            end_indexes = _get_best_indexes(result.end_logits, n_best_size)
+
+            for start_index in start_indexes:
+                for end_index in end_indexes:
+                    # We could hypothetically create invalid predictions, e.g., predict
+                    # that the start of the span is in the question. We throw out all
+                    # invalid predictions.
+                    if start_index >= len(feature.tokens):
+                        continue
+                    if end_index >= len(feature.tokens):
+                        continue
+                    if start_index not in feature.token_to_orig_map:
+                        continue
+                    if end_index not in feature.token_to_orig_map:
+                        continue
+                    if not feature.token_is_max_context.get(start_index, False):
+                        continue
+                    if end_index < start_index:
+                        continue
+                    length = end_index - start_index + 1
+                    if length > max_answer_length:
+                        continue
+                    prelim_predictions.append(
+                        _PrelimPrediction(
+                            feature_index=feature_index,
+                            start_index=start_index,
+                            end_index=end_index,
+                            start_logit=result.start_logits[start_index],
+                            end_logit=result.end_logits[end_index]))
+
+        prelim_predictions = sorted(
+            prelim_predictions,
+            key=lambda x: (x.start_logit + x.end_logit),
+            reverse=True)
+
+        _NbestPrediction = collections.namedtuple(  # pylint: disable=invalid-name
+            "NbestPrediction", ["text", "start_logit", "end_logit"])
+
+        seen_predictions = {}
+        nbest = []
+        for pred in prelim_predictions:
+            if len(nbest) >= n_best_size:
+                break
+            feature = features[pred.feature_index]
+            if pred.start_index > 0:  # this is a non-null prediction
+                tok_tokens = feature.tokens[pred.start_index:(pred.end_index + 1
+                                                              )]
+                orig_doc_start = feature.token_to_orig_map[pred.start_index]
+                orig_doc_end = feature.token_to_orig_map[pred.end_index]
+                orig_tokens = example.doc_tokens[orig_doc_start:(orig_doc_end +
+                                                                 1)]
+                tok_text = " ".join(tok_tokens)
+
+                # De-tokenize WordPieces that have been split off.
+                tok_text = tok_text.replace(" ##", "")
+                tok_text = tok_text.replace("##", "")
+
+                # Clean whitespace
+                tok_text = tok_text.strip()
+                tok_text = " ".join(tok_text.split())
+                orig_text = "".join(orig_tokens)
+
+                final_text = get_final_text(tok_text, orig_text, do_lower_case)
+                if final_text in seen_predictions:
+                    continue
+
+                seen_predictions[final_text] = True
+            else:
+                final_text = ""
+                seen_predictions[final_text] = True
+
+            nbest.append(
+                _NbestPrediction(
+                    text=final_text,
+                    start_logit=pred.start_logit,
+                    end_logit=pred.end_logit))
+
+        # In very rare edge cases we could have no valid predictions. So we
+        # just create a nonce prediction in this case to avoid failure.
+        if not nbest:
+            nbest.append(
+                _NbestPrediction(
+                    text="empty", start_logit=0.0, end_logit=0.0))
+
+        total_scores = []
+        best_non_null_entry = None
+        for entry in nbest:
+            total_scores.append(entry.start_logit + entry.end_logit)
+
+        probs = _compute_softmax(total_scores)
+
+        nbest_json = []
+        for (i, entry) in enumerate(nbest):
+            output = collections.OrderedDict()
+            output["text"] = entry.text
+            output["probability"] = probs[i]
+            output["start_logit"] = entry.start_logit
+            output["end_logit"] = entry.end_logit
+            nbest_json.append(output)
+
+        assert len(nbest_json) >= 1
+
+        all_predictions[example.qas_id] = nbest_json[0]["text"]
+        all_nbest_json[example.qas_id] = nbest_json
+
+    with open(output_prediction_file, "w") as writer:
+        writer.write(json.dumps(all_predictions, indent=4) + "\n")
+
+    with open(output_nbest_file, "w") as writer:
+        writer.write(json.dumps(all_nbest_json, indent=4) + "\n")
+
+
+def get_final_text(pred_text, orig_text, do_lower_case):
+    """Project the tokenized prediction back to the original text."""
+
+    # When we created the data, we kept track of the alignment between original
+    # (whitespace tokenized) tokens and our WordPiece tokenized tokens. So
+    # now `orig_text` contains the span of our original text corresponding to the
+    # span that we predicted.
+    #
+    # However, `orig_text` may contain extra characters that we don't want in
+    # our prediction.
+    #
+    # For example, let's say:
+    #   pred_text = steve smith
+    #   orig_text = Steve Smith's
+    #
+    # We don't want to return `orig_text` because it contains the extra "'s".
+    #
+    # We don't want to return `pred_text` because it's already been normalized
+    # (the SQuAD eval script also does punctuation stripping/lower casing but
+    # our tokenizer does additional normalization like stripping accent
+    # characters).
+    #
+    # What we really want to return is "Steve Smith".
+    #
+    # Therefore, we have to apply a semi-complicated alignment heruistic between
+    # `pred_text` and `orig_text` to get a character-to-charcter alignment. This
+    # can fail in certain cases in which case we just return `orig_text`.
+
+    def _strip_spaces(text):
+        ns_chars = []
+        ns_to_s_map = collections.OrderedDict()
+        for (i, c) in enumerate(text):
+            if c == " ":
+                continue
+            ns_to_s_map[len(ns_chars)] = i
+            ns_chars.append(c)
+        ns_text = "".join(ns_chars)
+        return (ns_text, ns_to_s_map)
+
+    # We first tokenize `orig_text`, strip whitespace from the result
+    # and `pred_text`, and check if they are the same length. If they are
+    # NOT the same length, the heuristic has failed. If they are the same
+    # length, we assume the characters are one-to-one aligned.
+    tokenizer = tokenization.BasicTokenizer(do_lower_case=do_lower_case)
+
+    tok_text = " ".join(tokenizer.tokenize(orig_text))
+
+    start_position = tok_text.find(pred_text)
+    if start_position == -1:
+        return orig_text
+    end_position = start_position + len(pred_text) - 1
+
+    (orig_ns_text, orig_ns_to_s_map) = _strip_spaces(orig_text)
+    (tok_ns_text, tok_ns_to_s_map) = _strip_spaces(tok_text)
+
+    if len(orig_ns_text) != len(tok_ns_text):
+        return orig_text
+
+    # We then project the characters in `pred_text` back to `orig_text` using
+    # the character-to-character alignment.
+    tok_s_to_ns_map = {}
+    for (i, tok_index) in six.iteritems(tok_ns_to_s_map):
+        tok_s_to_ns_map[tok_index] = i
+
+    orig_start_position = None
+    if start_position in tok_s_to_ns_map:
+        ns_start_position = tok_s_to_ns_map[start_position]
+        if ns_start_position in orig_ns_to_s_map:
+            orig_start_position = orig_ns_to_s_map[ns_start_position]
+
+    if orig_start_position is None:
+        return orig_text
+
+    orig_end_position = None
+    if end_position in tok_s_to_ns_map:
+        ns_end_position = tok_s_to_ns_map[end_position]
+        if ns_end_position in orig_ns_to_s_map:
+            orig_end_position = orig_ns_to_s_map[ns_end_position]
+
+    if orig_end_position is None:
+        return orig_text
+
+    output_text = orig_text[orig_start_position:(orig_end_position + 1)]
+    return output_text
+
+
+def _get_best_indexes(logits, n_best_size):
+    """Get the n-best logits from a list."""
+    index_and_score = sorted(
+        enumerate(logits), key=lambda x: x[1], reverse=True)
+
+    best_indexes = []
+    for i in range(len(index_and_score)):
+        if i >= n_best_size:
+            break
+        best_indexes.append(index_and_score[i][0])
+    return best_indexes
+
+
+def _compute_softmax(scores):
+    """Compute softmax probability over raw logits."""
+    if not scores:
+        return []
+
+    max_score = None
+    for score in scores:
+        if max_score is None or score > max_score:
+            max_score = score
+
+    exp_scores = []
+    total_sum = 0.0
+    for score in scores:
+        x = math.exp(score - max_score)
+        exp_scores.append(x)
+        total_sum += x
+
+    probs = []
+    for score in exp_scores:
+        probs.append(score / total_sum)
+    return probs
--- a/ERNIE/finetune/sequence_label.py
+++ b/ERNIE/finetune/sequence_label.py
@@ -35,24 +35,29 @@ def create_model(args, pyreader_name, ernie_config, is_prediction=False):
        capacity=50,
        shapes=[[-1, args.max_seq_len, 1], [-1, args.max_seq_len, 1],
                [-1, args.max_seq_len, 1], [-1, args.max_seq_len, 1],
-                [-1, args.max_seq_len, 1], [-1, 1]],
-        dtypes=['int64', 'int64', 'int64', 'float32', 'int64', 'int64'],
-        lod_levels=[0, 0, 0, 0, 0, 0],
+                [-1, args.max_seq_len, 1], [-1, args.max_seq_len, 1], [-1, 1]],
+        dtypes=[
+            'int64', 'int64', 'int64', 'int64', 'float32', 'int64', 'int64'
+        ],
+        lod_levels=[0, 0, 0, 0, 0, 0, 0],
        name=pyreader_name,
        use_double_buffer=True)

-    (src_ids, sent_ids, pos_ids, input_mask, labels,
+    (src_ids, sent_ids, pos_ids, task_ids, input_mask, labels,
     seq_lens) = fluid.layers.read_file(pyreader)

    ernie = ErnieModel(
        src_ids=src_ids,
        position_ids=pos_ids,
        sentence_ids=sent_ids,
+        task_ids=task_ids,
        input_mask=input_mask,
        config=ernie_config,
        use_fp16=args.use_fp16)

    enc_out = ernie.get_sequence_output()
+    enc_out = fluid.layers.dropout(
+        x=enc_out, dropout_prob=0.1, dropout_implementation="upscale_in_train")
    logits = fluid.layers.fc(
        input=enc_out,
        size=args.num_labels,
@@ -75,6 +80,8 @@ def create_model(args, pyreader_name, ernie_config, is_prediction=False):
            logits, axis=2),
        label=labels,
        return_softmax=True)
+    input_mask = fluid.layers.flatten(input_mask, axis=2)
+    ce_loss = ce_loss * input_mask
    loss = fluid.layers.mean(x=ce_loss)

    if args.use_fp16 and args.loss_scaling > 1.0:
@@ -218,15 +225,15 @@ def evaluate(exe,
        num_label, num_infer, num_correct = chunk_eval(
            np_labels, np_infers, np_lens, tag_num, dev_count)
        precision, recall, f1 = calculate_f1(num_label, num_infer, num_correct)
-        outputs = {
+        rets = {
            "precision": precision,
            "recall": recall,
            "f1": f1,
            "loss": np.mean(np_loss)
        }
        if "learning_rate" in graph_vars:
-            outputs["lr"] = float(outputs[4][0])
-        return outputs
+            rets["lr"] = float(outputs[4][0])
+        return rets

    else:
        total_label, total_infer, total_correct = 0.0, 0.0, 0.0

--- a/ERNIE/finetune_args.py
+++ b/ERNIE/finetune_args.py
@@ -32,6 +32,10 @@ model_g.add_arg("init_pretraining_params",  str,  None,
                 "arg 'init_checkpoint' has been set, this argument wouldn't be valid.")
 model_g.add_arg("checkpoints",              str,  "checkpoints",  "Path to save checkpoints.")

+model_g.add_arg("is_classify",    bool, True,  "is_classify")
+model_g.add_arg("is_regression",  bool, False, "is_regression")
+model_g.add_arg("task_id",           int,    0,       "task id")
+
 train_g = ArgumentGroup(parser, "training", "training options.")
 train_g.add_arg("epoch",             int,    3,       "Number of epoches for fine-tuning.")
 train_g.add_arg("learning_rate",     float,  5e-5,    "Learning rate used to train with warmup.")
@@ -45,26 +49,39 @@ train_g.add_arg("validation_steps",  int,    1000,    "The steps interval to eva
 train_g.add_arg("use_fp16",          bool,   False,   "Whether to use fp16 mixed precision training.")
 train_g.add_arg("loss_scaling",      float,  1.0,
                "Loss scaling factor for mixed precision training, only valid when use_fp16 is enabled.")
+train_g.add_arg("test_save",            str,    "test_result",       "test_save")
+train_g.add_arg("metric",               str,    "simple_accuracy",   "metric")

 log_g = ArgumentGroup(parser,     "logging", "logging related.")
 log_g.add_arg("skip_steps",          int,    10,    "The steps interval to print loss.")
 log_g.add_arg("verbose",             bool,   False, "Whether to output verbose log.")

 data_g = ArgumentGroup(parser, "data", "Data paths, vocab paths and data processing options")
+data_g.add_arg("tokenizer",           str, "FullTokenizer",
+              "ATTENTION: the INPUT must be splited by Word with blank while using SentencepieceTokenizer or WordsegTokenizer")
 data_g.add_arg("train_set",           str,  None,  "Path to training data.")
 data_g.add_arg("test_set",            str,  None,  "Path to test data.")
 data_g.add_arg("dev_set",             str,  None,  "Path to validation data.")
 data_g.add_arg("vocab_path",          str,  None,  "Vocabulary path.")
 data_g.add_arg("max_seq_len",         int,  512,   "Number of words of the longest seqence.")
 data_g.add_arg("batch_size",          int,  32,    "Total examples' number in batch for training. see also --in_tokens.")
+data_g.add_arg("predict_batch_size",  int,  None,    "Total examples' number in batch for predict. see also --in_tokens.")
 data_g.add_arg("in_tokens",           bool, False,
              "If set, the batch size will be the maximum number of tokens in one batch. "
              "Otherwise, it will be the maximum number of examples in one batch.")
 data_g.add_arg("do_lower_case",       bool, True,
               "Whether to lower case the input text. Should be True for uncased models and False for cased models.")
-data_g.add_arg("random_seed",         int,  0,     "Random seed.")
+data_g.add_arg("random_seed",         int,  None,     "Random seed.")
 data_g.add_arg("label_map_config",    str,  None,  "label_map_path.")
 data_g.add_arg("num_labels",          int,  2,     "label number")
+data_g.add_arg("diagnostic",          str,  None,  "GLUE Diagnostic Dataset")
+data_g.add_arg("diagnostic_save",     str,  None,  "GLUE Diagnostic save f")
+data_g.add_arg("max_query_length",          int,   64,    "Max query length.")
+data_g.add_arg("max_answer_length",         int,   100,    "Max answer length.")
+data_g.add_arg("doc_stride",                int,   128,
+               "When splitting up a long document into chunks, how much stride to take between chunks.")
+data_g.add_arg("n_best_size",               int,   20,
+               "The total number of n-best predictions to generate in the nbest_predictions.json output file.")

 run_type_g = ArgumentGroup(parser, "run_type", "running type options.")
 run_type_g.add_arg("use_cuda",                     bool,   True,  "If set, use GPU for training.")
@@ -73,8 +90,10 @@ run_type_g.add_arg("num_iteration_per_drop_scope", int,    10,    "Iteration int
 run_type_g.add_arg("do_train",                     bool,   True,  "Whether to perform training.")
 run_type_g.add_arg("do_val",                       bool,   True,  "Whether to perform evaluation on dev data set.")
 run_type_g.add_arg("do_test",                      bool,   True,  "Whether to perform evaluation on test data set.")
+run_type_g.add_arg("use_multi_gpu_test",           bool,   False, "Whether to perform evaluation using multiple gpu cards")
 run_type_g.add_arg("metrics",                      bool,   True,  "Whether to perform evaluation on test data set.")
 run_type_g.add_arg("shuffle",                      bool,   True,  "")
+run_type_g.add_arg("for_cn",                       bool,   True,  "model train for cn or for other langs.")

 parser.add_argument("--enable_ce", action='store_true', help="The flag indicating whether to run the task for continuous evaluation.")
 # yapf: enable
--- a/ERNIE/model/__init__.py
+++ b/ERNIE/model/__init__.py
--- a/model/ernie.py
+++ b/model/ernie.py
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Ernie model."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import json
+
+import six
+import paddle.fluid as fluid
+
+from model.transformer_encoder import encoder, pre_process_layer
+
+
+class ErnieConfig(object):
+    def __init__(self, config_path):
+        self._config_dict = self._parse(config_path)
+
+    def _parse(self, config_path):
+        try:
+            with open(config_path) as json_file:
+                config_dict = json.load(json_file)
+        except Exception:
+            raise IOError("Error in parsing Ernie model config file '%s'" %
+                          config_path)
+        else:
+            return config_dict
+
+    def __getitem__(self, key):
+        return self._config_dict.get(key, None)
+
+    def print_config(self):
+        for arg, value in sorted(six.iteritems(self._config_dict)):
+            print('%s: %s' % (arg, value))
+        print('------------------------------------------------')
+
+
+class ErnieModel(object):
+    def __init__(self,
+                 src_ids,
+                 position_ids,
+                 sentence_ids,
+                 task_ids,
+                 input_mask,
+                 config,
+                 weight_sharing=True,
+                 use_fp16=False):
+
+        self._emb_size = config['hidden_size']
+        self._n_layer = config['num_hidden_layers']
+        self._n_head = config['num_attention_heads']
+        self._voc_size = config['vocab_size']
+        self._max_position_seq_len = config['max_position_embeddings']
+        if config['sent_type_vocab_size']:
+            self._sent_types = config['sent_type_vocab_size']
+        else:
+            self._sent_types = config['type_vocab_size']
+
+        self._use_task_id = config['use_task_id']
+        if self._use_task_id:
+            self._task_types = config['task_type_vocab_size']
+        self._hidden_act = config['hidden_act']
+        self._prepostprocess_dropout = config['hidden_dropout_prob']
+        self._attention_dropout = config['attention_probs_dropout_prob']
+        self._weight_sharing = weight_sharing
+
+        self._word_emb_name = "word_embedding"
+        self._pos_emb_name = "pos_embedding"
+        self._sent_emb_name = "sent_embedding"
+        self._task_emb_name = "task_embedding"
+        self._dtype = "float16" if use_fp16 else "float32"
+        self._emb_dtype = "float32"
+
+        # Initialize all weigths by truncated normal initializer, and all biases
+        # will be initialized by constant zero by default.
+        self._param_initializer = fluid.initializer.TruncatedNormal(
+            scale=config['initializer_range'])
+
+        self._build_model(src_ids, position_ids, sentence_ids, task_ids,
+                          input_mask)
+
+    def _build_model(self, src_ids, position_ids, sentence_ids, task_ids,
+                     input_mask):
+        # padding id in vocabulary must be set to 0
+        emb_out = fluid.layers.embedding(
+            input=src_ids,
+            size=[self._voc_size, self._emb_size],
+            dtype=self._emb_dtype,
+            param_attr=fluid.ParamAttr(
+                name=self._word_emb_name, initializer=self._param_initializer),
+            is_sparse=False)
+
+        position_emb_out = fluid.layers.embedding(
+            input=position_ids,
+            size=[self._max_position_seq_len, self._emb_size],
+            dtype=self._emb_dtype,
+            param_attr=fluid.ParamAttr(
+                name=self._pos_emb_name, initializer=self._param_initializer))
+
+        sent_emb_out = fluid.layers.embedding(
+            sentence_ids,
+            size=[self._sent_types, self._emb_size],
+            dtype=self._emb_dtype,
+            param_attr=fluid.ParamAttr(
+                name=self._sent_emb_name, initializer=self._param_initializer))
+
+        emb_out = emb_out + position_emb_out
+        emb_out = emb_out + sent_emb_out
+
+        if self._use_task_id:
+            task_emb_out = fluid.layers.embedding(
+                task_ids,
+                size=[self._task_types, self._emb_size],
+                dtype=self._emb_dtype,
+                param_attr=fluid.ParamAttr(
+                    name=self._task_emb_name,
+                    initializer=self._param_initializer))
+
+            emb_out = emb_out + task_emb_out
+
+        emb_out = pre_process_layer(
+            emb_out, 'nd', self._prepostprocess_dropout, name='pre_encoder')
+
+        if self._dtype is "float16":
+            emb_out = fluid.layers.cast(x=emb_out, dtype=self._dtype)
+            input_mask = fluid.layers.cast(x=input_mask, dtype=self._dtype)
+        self_attn_mask = fluid.layers.matmul(
+            x=input_mask, y=input_mask, transpose_y=True)
+
+        self_attn_mask = fluid.layers.scale(
+            x=self_attn_mask, scale=10000.0, bias=-1.0, bias_after_scale=False)
+        n_head_self_attn_mask = fluid.layers.stack(
+            x=[self_attn_mask] * self._n_head, axis=1)
+        n_head_self_attn_mask.stop_gradient = True
+
+        self._enc_out = encoder(
+            enc_input=emb_out,
+            attn_bias=n_head_self_attn_mask,
+            n_layer=self._n_layer,
+            n_head=self._n_head,
+            d_key=self._emb_size // self._n_head,
+            d_value=self._emb_size // self._n_head,
+            d_model=self._emb_size,
+            d_inner_hid=self._emb_size * 4,
+            prepostprocess_dropout=self._prepostprocess_dropout,
+            attention_dropout=self._attention_dropout,
+            relu_dropout=0,
+            hidden_act=self._hidden_act,
+            preprocess_cmd="",
+            postprocess_cmd="dan",
+            param_initializer=self._param_initializer,
+            name='encoder')
+
+    def get_sequence_output(self):
+        return self._enc_out
+
+    def get_pooled_output(self):
+        """Get the first feature of each sequence for classification"""
+        next_sent_feat = fluid.layers.slice(
+            input=self._enc_out, axes=[1], starts=[0], ends=[1])
+        if self._dtype == "float16":
+            next_sent_feat = fluid.layers.cast(
+                x=next_sent_feat, dtype=self._emb_dtype)
+        next_sent_feat = fluid.layers.fc(
+            input=next_sent_feat,
+            size=self._emb_size,
+            act="tanh",
+            param_attr=fluid.ParamAttr(
+                name="pooled_fc.w_0", initializer=self._param_initializer),
+            bias_attr="pooled_fc.b_0")
+        return next_sent_feat
+
+    def get_lm_output(self, mask_label, mask_pos):
+        """Get the loss & accuracy for pretraining"""
+
+        mask_pos = fluid.layers.cast(x=mask_pos, dtype='int32')
+
+        # extract the first token feature in each sentence
+        self.next_sent_feat = self.get_pooled_output()
+        reshaped_emb_out = fluid.layers.reshape(
+            x=self._enc_out, shape=[-1, self._emb_size])
+        # extract masked tokens' feature
+        mask_feat = fluid.layers.gather(input=reshaped_emb_out, index=mask_pos)
+        if self._dtype == "float16":
+            mask_feat = fluid.layers.cast(x=mask_feat, dtype=self._emb_dtype)
+
+        # transform: fc
+        mask_trans_feat = fluid.layers.fc(
+            input=mask_feat,
+            size=self._emb_size,
+            act=self._hidden_act,
+            param_attr=fluid.ParamAttr(
+                name='mask_lm_trans_fc.w_0',
+                initializer=self._param_initializer),
+            bias_attr=fluid.ParamAttr(name='mask_lm_trans_fc.b_0'))
+
+        # transform: layer norm 
+        mask_trans_feat = fluid.layers.layer_norm(
+            mask_trans_feat,
+            begin_norm_axis=len(mask_trans_feat.shape) - 1,
+            param_attr=fluid.ParamAttr(
+                name='mask_lm_trans_layer_norm_scale',
+                initializer=fluid.initializer.Constant(1.)),
+            bias_attr=fluid.ParamAttr(
+                name='mask_lm_trans_layer_norm_bias',
+                initializer=fluid.initializer.Constant(1.)))
+        # transform: layer norm 
+        #mask_trans_feat = pre_process_layer(
+        #    mask_trans_feat, 'n', name='mask_lm_trans')
+
+        mask_lm_out_bias_attr = fluid.ParamAttr(
+            name="mask_lm_out_fc.b_0",
+            initializer=fluid.initializer.Constant(value=0.0))
+        if self._weight_sharing:
+            fc_out = fluid.layers.matmul(
+                x=mask_trans_feat,
+                y=fluid.default_main_program().global_block().var(
+                    self._word_emb_name),
+                transpose_y=True)
+            fc_out += fluid.layers.create_parameter(
+                shape=[self._voc_size],
+                dtype=self._emb_dtype,
+                attr=mask_lm_out_bias_attr,
+                is_bias=True)
+
+        else:
+            fc_out = fluid.layers.fc(input=mask_trans_feat,
+                                     size=self._voc_size,
+                                     param_attr=fluid.ParamAttr(
+                                         name="mask_lm_out_fc.w_0",
+                                         initializer=self._param_initializer),
+                                     bias_attr=mask_lm_out_bias_attr)
+
+        mask_lm_loss = fluid.layers.softmax_with_cross_entropy(
+            logits=fc_out, label=mask_label)
+        mean_mask_lm_loss = fluid.layers.mean(mask_lm_loss)
+
+        return mean_mask_lm_loss
+
+    def get_task_output(self, task, task_labels):
+        task_fc_out = fluid.layers.fc(input=self.next_sent_feat,
+                                      size=task["num_labels"],
+                                      param_attr=fluid.ParamAttr(
+                                          name=task["task_name"] + "_fc.w_0",
+                                          initializer=self._param_initializer),
+                                      bias_attr=task["task_name"] + "_fc.b_0")
+        task_loss, task_softmax = fluid.layers.softmax_with_cross_entropy(
+            logits=task_fc_out, label=task_labels, return_softmax=True)
+        task_acc = fluid.layers.accuracy(input=task_softmax, label=task_labels)
+        mean_task_loss = fluid.layers.mean(task_loss)
+        return mean_task_loss, task_acc
--- a/ERNIE/model/ernie.py
+++ b/ERNIE/model/ernie.py
--- a/ERNIE/model/transformer_encoder.py
+++ b/ERNIE/model/transformer_encoder.py
--- a/ERNIE/optimization.py
+++ b/ERNIE/optimization.py
@@ -59,7 +59,12 @@ def optimization(loss,
                 weight_decay,
                 scheduler='linear_warmup_decay',
                 use_fp16=False,
-                 loss_scaling=1.0):
+                 use_dynamic_loss_scaling=False,
+                 init_loss_scaling=1.0,
+                 incr_every_n_steps=1000,
+                 decr_every_n_nan_or_inf=2,
+                 incr_ratio=2.0,
+                 decr_ratio=0.8):
    if warmup_steps > 0:
        if scheduler == 'noam_decay':
            scheduled_lr = fluid.layers.learning_rate_scheduler\
@@ -73,16 +78,18 @@ def optimization(loss,
                             "'noam_decay' or 'linear_warmup_decay'")
        optimizer = fluid.optimizer.Adam(learning_rate=scheduled_lr)
    else:
-        optimizer = fluid.optimizer.Adam(learning_rate=learning_rate)
-        scheduled_lr = learning_rate
-
-    clip_norm_thres = 1.0
-    # When using mixed precision training, scale the gradient clip threshold
-    # by loss_scaling
-    if use_fp16 and loss_scaling > 1.0:
-        clip_norm_thres *= loss_scaling
+        scheduled_lr = fluid.layers.create_global_var(
+            name=fluid.unique_name.generate("learning_rate"),
+            shape=[1],
+            value=learning_rate,
+            dtype='float32',
+            persistable=True)
+        optimizer = fluid.optimizer.Adam(learning_rate=scheduled_lr)
+        optimizer._learning_rate_map[fluid.default_main_program(
+        )] = scheduled_lr
+
    fluid.clip.set_gradient_clip(
-        clip=fluid.clip.GradientClipByGlobalNorm(clip_norm=clip_norm_thres))
+        clip=fluid.clip.GradientClipByGlobalNorm(clip_norm=1.0))

    def exclude_from_weight_decay(name):
        if name.find("layer_norm") > -1:
@@ -95,8 +102,17 @@ def optimization(loss,

    param_list = dict()

+    loss_scaling = fluid.layers.create_global_var(
+        name=fluid.unique_name.generate("loss_scaling"),
+        shape=[1],
+        value=init_loss_scaling,
+        dtype='float32',
+        persistable=True)
+
    if use_fp16:
+        loss *= loss_scaling
        param_grads = optimizer.backward(loss)
+
        master_param_grads = create_master_params_grads(
            param_grads, train_program, startup_prog, loss_scaling)

@@ -104,6 +120,11 @@ def optimization(loss,
            param_list[param.name] = param * 1.0
            param_list[param.name].stop_gradient = True

+        if use_dynamic_loss_scaling:
+            apply_dynamic_loss_scaling(
+                loss_scaling, master_param_grads, incr_every_n_steps,
+                decr_every_n_nan_or_inf, incr_ratio, decr_ratio)
+
        optimizer.apply_gradients(master_param_grads)

        if weight_decay > 0:
@@ -136,4 +157,4 @@ def optimization(loss,
                        param.name] * weight_decay * scheduled_lr
                    fluid.layers.assign(output=param, input=updated_param)

-    return scheduled_lr
+    return scheduled_lr, loss_scaling
--- a/ERNIE/predict_classifier.py
+++ b/ERNIE/predict_classifier.py
@@ -46,6 +46,7 @@ model_g.add_arg("init_checkpoint",              str,  None,  "Init checkpoint to
 model_g.add_arg("save_inference_model_path",    str,  "inference_model",  "If set, save the inference model to this path.")
 model_g.add_arg("use_fp16",                     bool, False, "Whether to resume parameters from fp16 checkpoint.")
 model_g.add_arg("num_labels",                   int,  2,     "num labels for classify")
+model_g.add_arg("ernie_version",                str,  "1.0", "ernie_version")

 data_g = ArgumentGroup(parser, "data", "Data paths, vocab paths and data processing options.")
 data_g.add_arg("predict_set",         str,  None,  "Predict set file")
@@ -83,7 +84,9 @@ def main(args):
                args,
                pyreader_name='predict_reader',
                ernie_config=ernie_config,
-                is_prediction=True)
+                is_classify=True,
+                is_prediction=True,
+                ernie_version=args.ernie_version)

    predict_prog = predict_prog.clone(for_test=True)

@@ -122,6 +125,8 @@ def main(args):
    sent_ids = feed_target_names[1]
    pos_ids = feed_target_names[2]
    input_mask = feed_target_names[3]
+    if args.ernie_version == "2.0":
+        task_ids = feed_target_names[4]

    predict_data_generator = reader.data_generator(
        input_file=args.predict_set,
@@ -136,14 +141,28 @@ def main(args):
        src_ids_data = sample[0]
        sent_ids_data = sample[1]
        pos_ids_data = sample[2]
-        input_mask_data = sample[3]
-        output = exe.run(
-            infer_program,
-            feed={src_ids: src_ids_data,
-                  sent_ids: sent_ids_data,
-                  pos_ids: pos_ids_data,
-                  input_mask: input_mask_data},
-            fetch_list=probs)
+        task_ids_data = sample[3]
+        input_mask_data = sample[4]
+        if args.ernie_version == "1.0":
+            output = exe.run(
+                infer_program,
+                feed={src_ids: src_ids_data,
+                      sent_ids: sent_ids_data,
+                      pos_ids: pos_ids_data,
+                      input_mask: input_mask_data},
+                fetch_list=probs)
+        elif args.ernie_version == "2.0":
+            output = exe.run(
+                infer_program,
+                feed={src_ids: src_ids_data,
+                      sent_ids: sent_ids_data,
+                      pos_ids: pos_ids_data,
+                      task_ids: task_ids_data,
+                      input_mask: input_mask_data},
+                fetch_list=probs)
+        else:
+            raise ValueError("ernie_version must be 1.0 or 2.0")
+
        for single_result in output[0]:
            print("example_index:{}\t{}".format(index, single_result))
            index += 1

--- a/ERNIE/pretrain_args.py
+++ b/ERNIE/pretrain_args.py
--- a/ERNIE/reader/__init__.py
+++ b/ERNIE/reader/__init__.py
--- a/ERNIE/reader/pretraining.py
+++ b/ERNIE/reader/pretraining.py
--- a/ERNIE/reader/task_reader.py
+++ b/ERNIE/reader/task_reader.py
@@ -12,8 +12,10 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.

+import os
 import csv
 import json
+import random
 import numpy as np
 from collections import namedtuple

@@ -29,7 +31,12 @@ class BaseReader(object):
                 do_lower_case=True,
                 in_tokens=False,
                 is_inference=False,
-                 random_seed=None):
+                 random_seed=None,
+                 tokenizer="FullTokenizer",
+                 is_classify=True,
+                 is_regression=False,
+                 for_cn=True,
+                 task_id=0):
        self.max_seq_len = max_seq_len
        self.tokenizer = tokenization.FullTokenizer(
            vocab_file=vocab_path, do_lower_case=do_lower_case)
@@ -39,9 +46,13 @@ class BaseReader(object):
        self.sep_id = self.vocab["[SEP]"]
        self.in_tokens = in_tokens
        self.is_inference = is_inference
+        self.for_cn = for_cn
+        self.task_id = task_id

        np.random.seed(random_seed)

+        self.is_classify = is_classify
+        self.is_regression = is_regression
        self.current_example = 0
        self.current_epoch = 0
        self.num_examples = 0
@@ -91,7 +102,14 @@ class BaseReader(object):
        text_a = tokenization.convert_to_unicode(example.text_a)
        tokens_a = tokenizer.tokenize(text_a)
        tokens_b = None
-        if "text_b" in example._fields:
+
+        has_text_b = False
+        if isinstance(example, dict):
+            has_text_b = "text_b" in example.keys()
+        else:
+            has_text_b = "text_b" in example._fields
+
+        if has_text_b:
            text_b = tokenization.convert_to_unicode(example.text_b)
            tokens_b = tokenizer.tokenize(text_b)

@@ -202,11 +220,13 @@ class BaseReader(object):
                       input_file,
                       batch_size,
                       epoch,
+                       dev_count=1,
                       shuffle=True,
                       phase=None):
        examples = self._read_tsv(input_file)

        def wrapper():
+            all_dev_batches = []
            for epoch_index in range(epoch):
                if phase == "train":
                    self.current_example = 0
@@ -216,7 +236,12 @@ class BaseReader(object):

                for batch_data in self._prepare_batch_data(
                        examples, batch_size, phase=phase):
-                    yield batch_data
+                    if len(all_dev_batches) < dev_count:
+                        all_dev_batches.append(batch_data)
+                    if len(all_dev_batches) == dev_count:
+                        for batch in all_dev_batches:
+                            yield batch
+                        all_dev_batches = []

        return wrapper

@@ -236,7 +261,10 @@ class ClassifyReader(BaseReader):
            for line in reader:
                for index, text in enumerate(line):
                    if index in text_indices:
-                        line[index] = text.replace(' ', '')
+                        if self.for_cn:
+                            line[index] = text.replace(' ', '')
+                        else:
+                            line[index] = text
                example = Example(*line)
                examples.append(example)
            return examples
@@ -248,10 +276,14 @@ class ClassifyReader(BaseReader):

        if not self.is_inference:
            batch_labels = [record.label_id for record in batch_records]
-            batch_labels = np.array(batch_labels).astype("int64").reshape(
-                [-1, 1])
+            if self.is_classify:
+                batch_labels = np.array(batch_labels).astype("int64").reshape(
+                    [-1, 1])
+            elif self.is_regression:
+                batch_labels = np.array(batch_labels).astype("float32").reshape(
+                    [-1, 1])

-            if batch_records[0].qid is not None:
+            if batch_records[0].qid:
                batch_qids = [record.qid for record in batch_records]
                batch_qids = np.array(batch_qids).astype("int64").reshape(
                    [-1, 1])
@@ -265,10 +297,12 @@ class ClassifyReader(BaseReader):
            batch_text_type_ids, pad_idx=self.pad_id)
        padded_position_ids = pad_batch_data(
            batch_position_ids, pad_idx=self.pad_id)
+        padded_task_ids = np.ones_like(
+            padded_token_ids, dtype="int64") * self.task_id

        return_list = [
            padded_token_ids, padded_text_type_ids, padded_position_ids,
-            input_mask
+            padded_task_ids, input_mask
        ]
        if not self.is_inference:
            return_list += [batch_labels, batch_qids]
@@ -295,10 +329,12 @@ class SequenceLabelReader(BaseReader):
            batch_position_ids, pad_idx=self.pad_id)
        padded_label_ids = pad_batch_data(
            batch_label_ids, pad_idx=len(self.label_map) - 1)
+        padded_task_ids = np.ones_like(
+            padded_token_ids, dtype="int64") * self.task_id

        return_list = [
            padded_token_ids, padded_text_type_ids, padded_position_ids,
-            input_mask, padded_label_ids, batch_seq_lens
+            padded_task_ids, input_mask, padded_label_ids, batch_seq_lens
        ]
        return return_list

@@ -367,14 +403,370 @@ class ExtractEmbeddingReader(BaseReader):
            batch_text_type_ids, pad_idx=self.pad_id)
        padded_position_ids = pad_batch_data(
            batch_position_ids, pad_idx=self.pad_id)
+        padded_task_ids = np.ones_like(
+            padded_token_ids, dtype="int64") * self.task_id
+
+        return_list = [
+            padded_token_ids, padded_text_type_ids, padded_position_ids,
+            padded_task_ids, input_mask, seq_lens
+        ]
+
+        return return_list
+
+
+class MRCReader(BaseReader):
+    def __init__(self,
+                 vocab_path,
+                 label_map_config=None,
+                 max_seq_len=512,
+                 do_lower_case=True,
+                 in_tokens=False,
+                 random_seed=None,
+                 tokenizer="FullTokenizer",
+                 is_classify=True,
+                 is_regression=False,
+                 for_cn=True,
+                 task_id=0,
+                 doc_stride=128,
+                 max_query_length=64):
+        self.max_seq_len = max_seq_len
+        self.tokenizer = tokenization.FullTokenizer(
+            vocab_file=vocab_path, do_lower_case=do_lower_case)
+        self.vocab = self.tokenizer.vocab
+        self.pad_id = self.vocab["[PAD]"]
+        self.cls_id = self.vocab["[CLS]"]
+        self.sep_id = self.vocab["[SEP]"]
+        self.in_tokens = in_tokens
+        self.for_cn = for_cn
+        self.task_id = task_id
+        self.doc_stride = doc_stride
+        self.max_query_length = max_query_length
+        self.examples = {}
+        self.features = {}
+
+        if random_seed is not None:
+            np.random.seed(random_seed)
+
+        self.current_example = 0
+        self.current_epoch = 0
+        self.num_examples = 0
+
+    def _read_json(self, input_file, is_training):
+        examples = []
+        with open(input_file, "r") as f:
+            input_data = json.load(f)["data"]
+            for entry in input_data:
+                for paragraph in entry["paragraphs"]:
+                    paragraph_text = paragraph["context"]
+                    for qa in paragraph["qas"]:
+                        qas_id = qa["id"]
+                        question_text = qa["question"]
+                        start_pos = None
+                        end_pos = None
+                        orig_answer_text = None
+
+                        if is_training:
+                            if len(qa["answers"]) != 1:
+                                raise ValueError(
+                                    "For training, each question should have exactly 1 answer."
+                                )
+
+                            answer = qa["answers"][0]
+                            orig_answer_text = answer["text"]
+                            answer_offset = answer["answer_start"]
+                            answer_length = len(orig_answer_text)
+                            doc_tokens = [
+                                paragraph_text[:answer_offset],
+                                paragraph_text[answer_offset:answer_offset +
+                                               answer_length],
+                                paragraph_text[answer_offset + answer_length:]
+                            ]
+
+                            start_pos = 1
+                            end_pos = 1
+
+                            actual_text = " ".join(doc_tokens[start_pos:(end_pos
+                                                                         + 1)])
+                            if actual_text.find(orig_answer_text) == -1:
+                                print("Could not find answer: '%s' vs. '%s'",
+                                      actual_text, orig_answer_text)
+                                continue
+                        else:
+                            doc_tokens = tokenization.tokenize_chinese_chars(
+                                paragraph_text)
+
+                        Example = namedtuple('Example', [
+                            'qas_id', 'question_text', 'doc_tokens',
+                            'orig_answer_text', 'start_position', 'end_position'
+                        ])
+
+                        example = Example(
+                            qas_id=qas_id,
+                            question_text=question_text,
+                            doc_tokens=doc_tokens,
+                            orig_answer_text=orig_answer_text,
+                            start_position=start_pos,
+                            end_position=end_pos)
+                        examples.append(example)
+
+        return examples
+
+    def _improve_answer_span(self, doc_tokens, input_start, input_end,
+                             tokenizer, orig_answer_text):
+        tok_answer_text = " ".join(tokenizer.tokenize(orig_answer_text))
+
+        for new_start in range(input_start, input_end + 1):
+            for new_end in range(input_end, new_start - 1, -1):
+                text_span = " ".join(doc_tokens[new_start:(new_end + 1)])
+                if text_span == tok_answer_text:
+                    return (new_start, new_end)
+
+        return (input_start, input_end)
+
+    def _check_is_max_context(self, doc_spans, cur_span_index, position):
+        best_score = None
+        best_span_index = None
+        for (span_index, doc_span) in enumerate(doc_spans):
+            end = doc_span.start + doc_span.length - 1
+            if position < doc_span.start:
+                continue
+            if position > end:
+                continue
+            num_left_context = position - doc_span.start
+            num_right_context = end - position
+            score = min(num_left_context,
+                        num_right_context) + 0.01 * doc_span.length
+            if best_score is None or score > best_score:
+                best_score = score
+                best_span_index = span_index
+
+        return cur_span_index == best_span_index
+
+    def _convert_example_to_feature(self, examples, max_seq_length, tokenizer,
+                                    is_training):
+        Feature = namedtuple("Feature", [
+            "unique_id", "example_index", "doc_span_index", "tokens",
+            "token_to_orig_map", "token_is_max_context", "token_ids",
+            "position_ids", "text_type_ids", "start_position", "end_position"
+        ])
+        features = []
+        unique_id = 1000000000
+
+        for (example_index, example) in enumerate(examples):
+            query_tokens = tokenizer.tokenize(example.question_text)
+            if len(query_tokens) > self.max_query_length:
+                query_tokens = query_tokens[0:self.max_query_length]
+            tok_to_orig_index = []
+            orig_to_tok_index = []
+            all_doc_tokens = []
+            for (i, token) in enumerate(example.doc_tokens):
+                orig_to_tok_index.append(len(all_doc_tokens))
+                sub_tokens = tokenizer.tokenize(token)
+                for sub_token in sub_tokens:
+                    tok_to_orig_index.append(i)
+                    all_doc_tokens.append(sub_token)
+
+            tok_start_position = None
+            tok_end_position = None
+            if is_training:
+                tok_start_position = orig_to_tok_index[example.start_position]
+                if example.end_position < len(example.doc_tokens) - 1:
+                    tok_end_position = orig_to_tok_index[example.end_position +
+                                                         1] - 1
+                else:
+                    tok_end_position = len(all_doc_tokens) - 1
+                (tok_start_position,
+                 tok_end_position) = self._improve_answer_span(
+                     all_doc_tokens, tok_start_position, tok_end_position,
+                     tokenizer, example.orig_answer_text)
+
+            max_tokens_for_doc = max_seq_length - len(query_tokens) - 3
+            _DocSpan = namedtuple("DocSpan", ["start", "length"])
+            doc_spans = []
+            start_offset = 0
+            while start_offset < len(all_doc_tokens):
+                length = len(all_doc_tokens) - start_offset
+                if length > max_tokens_for_doc:
+                    length = max_tokens_for_doc
+                doc_spans.append(_DocSpan(start=start_offset, length=length))
+                if start_offset + length == len(all_doc_tokens):
+                    break
+                start_offset += min(length, self.doc_stride)
+
+            for (doc_span_index, doc_span) in enumerate(doc_spans):
+                tokens = []
+                token_to_orig_map = {}
+                token_is_max_context = {}
+                text_type_ids = []
+                tokens.append("[CLS]")
+                text_type_ids.append(0)
+                for token in query_tokens:
+                    tokens.append(token)
+                    text_type_ids.append(0)
+                tokens.append("[SEP]")
+                text_type_ids.append(0)
+
+                for i in range(doc_span.length):
+                    split_token_index = doc_span.start + i
+                    token_to_orig_map[len(tokens)] = tok_to_orig_index[
+                        split_token_index]
+
+                    is_max_context = self._check_is_max_context(
+                        doc_spans, doc_span_index, split_token_index)
+                    token_is_max_context[len(tokens)] = is_max_context
+                    tokens.append(all_doc_tokens[split_token_index])
+                    text_type_ids.append(1)
+                tokens.append("[SEP]")
+                text_type_ids.append(1)
+
+                token_ids = tokenizer.convert_tokens_to_ids(tokens)
+                position_ids = list(range(len(token_ids)))
+                start_position = None
+                end_position = None
+                if is_training:
+                    doc_start = doc_span.start
+                    doc_end = doc_span.start + doc_span.length - 1
+                    out_of_span = False
+                    if not (tok_start_position >= doc_start and
+                            tok_end_position <= doc_end):
+                        out_of_span = True
+                    if out_of_span:
+                        start_position = 0
+                        end_position = 0
+                    else:
+                        doc_offset = len(query_tokens) + 2
+                        start_position = tok_start_position - doc_start + doc_offset
+                        end_position = tok_end_position - doc_start + doc_offset
+
+                feature = Feature(
+                    unique_id=unique_id,
+                    example_index=example_index,
+                    doc_span_index=doc_span_index,
+                    tokens=tokens,
+                    token_to_orig_map=token_to_orig_map,
+                    token_is_max_context=token_is_max_context,
+                    token_ids=token_ids,
+                    position_ids=position_ids,
+                    text_type_ids=text_type_ids,
+                    start_position=start_position,
+                    end_position=end_position)
+                features.append(feature)
+
+                unique_id += 1
+
+        return features
+
+    def _prepare_batch_data(self, records, batch_size, phase=None):
+        """generate batch records"""
+        batch_records, max_len = [], 0
+
+        for index, record in enumerate(records):
+            if phase == "train":
+                self.current_example = index
+            max_len = max(max_len, len(record.token_ids))
+            if self.in_tokens:
+                to_append = (len(batch_records) + 1) * max_len <= batch_size
+            else:
+                to_append = len(batch_records) < batch_size
+            if to_append:
+                batch_records.append(record)
+            else:
+                yield self._pad_batch_records(batch_records, phase == "train")
+                batch_records, max_len = [record], len(record.token_ids)
+
+        if batch_records:
+            yield self._pad_batch_records(batch_records, phase == "train")
+
+    def _pad_batch_records(self, batch_records, is_training):
+        batch_token_ids = [record.token_ids for record in batch_records]
+        batch_text_type_ids = [record.text_type_ids for record in batch_records]
+        batch_position_ids = [record.position_ids for record in batch_records]
+        if is_training:
+            batch_start_position = [
+                record.start_position for record in batch_records
+            ]
+            batch_end_position = [
+                record.end_position for record in batch_records
+            ]
+            batch_start_position = np.array(batch_start_position).astype(
+                "int64").reshape([-1, 1])
+            batch_end_position = np.array(batch_end_position).astype(
+                "int64").reshape([-1, 1])
+
+        else:
+            batch_size = len(batch_token_ids)
+            batch_start_position = np.zeros(
+                shape=[batch_size, 1], dtype="int64")
+            batch_end_position = np.zeros(shape=[batch_size, 1], dtype="int64")
+
+        batch_unique_ids = [record.unique_id for record in batch_records]
+        batch_unique_ids = np.array(batch_unique_ids).astype("int64").reshape(
+            [-1, 1])
+
+        # padding
+        padded_token_ids, input_mask = pad_batch_data(
+            batch_token_ids, pad_idx=self.pad_id, return_input_mask=True)
+        padded_text_type_ids = pad_batch_data(
+            batch_text_type_ids, pad_idx=self.pad_id)
+        padded_position_ids = pad_batch_data(
+            batch_position_ids, pad_idx=self.pad_id)
+        padded_task_ids = np.ones_like(
+            padded_token_ids, dtype="int64") * self.task_id

        return_list = [
            padded_token_ids, padded_text_type_ids, padded_position_ids,
-            input_mask, seq_lens
+            padded_task_ids, input_mask, batch_start_position,
+            batch_end_position, batch_unique_ids
        ]

        return return_list

+    def get_num_examples(self, phase):
+        return len(self.features[phase])
+
+    def get_features(self, phase):
+        return self.features[phase]
+
+    def get_examples(self, phase):
+        return self.examples[phase]
+
+    def data_generator(self,
+                       input_file,
+                       batch_size,
+                       epoch,
+                       dev_count=1,
+                       shuffle=True,
+                       phase=None):
+
+        examples = self.examples.get(phase, None)
+        features = self.features.get(phase, None)
+        if not examples:
+            examples = self._read_json(input_file, phase == "train")
+            features = self._convert_example_to_feature(
+                examples, self.max_seq_len, self.tokenizer, phase == "train")
+            self.examples[phase] = examples
+            self.features[phase] = features
+
+        def wrapper():
+            all_dev_batches = []
+            for epoch_index in range(epoch):
+                if phase == "train":
+                    self.current_example = 0
+                    self.current_epoch = epoch_index
+                if phase == "train" and shuffle:
+                    np.random.shuffle(features)
+
+                for batch_data in self._prepare_batch_data(
+                        features, batch_size, phase=phase):
+                    if len(all_dev_batches) < dev_count:
+                        all_dev_batches.append(batch_data)
+                    if len(all_dev_batches) == dev_count:
+                        for batch in all_dev_batches:
+                            yield batch
+                        all_dev_batches = []
+
+        return wrapper
+

 if __name__ == '__main__':
    pass
--- a/run_classifier.py
+++ b/run_classifier.py
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Finetuning on classification tasks."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import os
+import time
+import multiprocessing
+
+# NOTE(paddle-dev): All of these flags should be
+# set before `import paddle`. Otherwise, it would
+# not take any effect.
+os.environ['FLAGS_eager_delete_tensor_gb'] = '0'  # enable gc
+
+import paddle.fluid as fluid
+
+import reader.task_reader as task_reader
+from model.ernie import ErnieConfig
+from finetune.classifier import create_model, evaluate, predict
+from optimization import optimization
+from utils.args import print_arguments, check_cuda
+from utils.init import init_pretraining_params, init_checkpoint
+from utils.cards import get_cards
+from finetune_args import parser
+
+args = parser.parse_args()
+
+
+def main(args):
+    ernie_config = ErnieConfig(args.ernie_config_path)
+    ernie_config.print_config()
+
+    if args.use_cuda:
+        place = fluid.CUDAPlace(int(os.getenv('FLAGS_selected_gpus', '0')))
+        dev_count = fluid.core.get_cuda_device_count()
+    else:
+        place = fluid.CPUPlace()
+        dev_count = int(os.environ.get('CPU_NUM', multiprocessing.cpu_count()))
+    exe = fluid.Executor(place)
+
+    reader = task_reader.ClassifyReader(
+        vocab_path=args.vocab_path,
+        label_map_config=args.label_map_config,
+        max_seq_len=args.max_seq_len,
+        do_lower_case=args.do_lower_case,
+        in_tokens=args.in_tokens,
+        random_seed=args.random_seed,
+        tokenizer=args.tokenizer,
+        is_classify=args.is_classify,
+        is_regression=args.is_regression,
+        for_cn=args.for_cn,
+        task_id=args.task_id)
+
+    if not (args.do_train or args.do_val or args.do_test):
+        raise ValueError("For args `do_train`, `do_val` and `do_test`, at "
+                         "least one of them must be True.")
+
+    if args.do_test:
+        assert args.test_save is not None
+    startup_prog = fluid.Program()
+    if args.random_seed is not None:
+        startup_prog.random_seed = args.random_seed
+
+    if args.predict_batch_size == None:
+        args.predict_batch_size = args.batch_size
+    if args.do_train:
+        train_data_generator = reader.data_generator(
+            input_file=args.train_set,
+            batch_size=args.batch_size,
+            epoch=args.epoch,
+            dev_count=dev_count,
+            shuffle=True,
+            phase="train")
+
+        num_train_examples = reader.get_num_examples(args.train_set)
+
+        if args.in_tokens:
+            max_train_steps = args.epoch * num_train_examples // (
+                args.batch_size // args.max_seq_len) // dev_count
+        else:
+            max_train_steps = args.epoch * num_train_examples // args.batch_size // dev_count
+
+        warmup_steps = int(max_train_steps * args.warmup_proportion)
+        print("Device count: %d" % dev_count)
+        print("Num train examples: %d" % num_train_examples)
+        print("Max train steps: %d" % max_train_steps)
+        print("Num warmup steps: %d" % warmup_steps)
+
+        train_program = fluid.Program()
+        if args.random_seed is not None and args.enable_ce:
+            train_program.random_seed = args.random_seed
+
+        with fluid.program_guard(train_program, startup_prog):
+            with fluid.unique_name.guard():
+                train_pyreader, graph_vars = create_model(
+                    args,
+                    pyreader_name='train_reader',
+                    ernie_config=ernie_config,
+                    is_classify=args.is_classify,
+                    is_regression=args.is_regression)
+                scheduled_lr, loss_scaling = optimization(
+                    loss=graph_vars["loss"],
+                    warmup_steps=warmup_steps,
+                    num_train_steps=max_train_steps,
+                    learning_rate=args.learning_rate,
+                    train_program=train_program,
+                    startup_prog=startup_prog,
+                    weight_decay=args.weight_decay,
+                    scheduler=args.lr_scheduler,
+                    use_fp16=args.use_fp16)
+
+        if args.verbose:
+            if args.in_tokens:
+                lower_mem, upper_mem, unit = fluid.contrib.memory_usage(
+                    program=train_program,
+                    batch_size=args.batch_size // args.max_seq_len)
+            else:
+                lower_mem, upper_mem, unit = fluid.contrib.memory_usage(
+                    program=train_program, batch_size=args.batch_size)
+            print("Theoretical memory usage in training: %.3f - %.3f %s" %
+                  (lower_mem, upper_mem, unit))
+
+    if args.do_val or args.do_test:
+        test_prog = fluid.Program()
+        with fluid.program_guard(test_prog, startup_prog):
+            with fluid.unique_name.guard():
+                test_pyreader, graph_vars = create_model(
+                    args,
+                    pyreader_name='test_reader',
+                    ernie_config=ernie_config,
+                    is_classify=args.is_classify,
+                    is_regression=args.is_regression)
+
+        test_prog = test_prog.clone(for_test=True)
+    nccl2_num_trainers = 1
+    nccl2_trainer_id = 0
+    exe.run(startup_prog)
+
+    if args.do_train:
+        if args.init_checkpoint and args.init_pretraining_params:
+            print(
+                "WARNING: args 'init_checkpoint' and 'init_pretraining_params' "
+                "both are set! Only arg 'init_checkpoint' is made valid.")
+        if args.init_checkpoint:
+            init_checkpoint(
+                exe,
+                args.init_checkpoint,
+                main_program=startup_prog,
+                use_fp16=args.use_fp16)
+        elif args.init_pretraining_params:
+            init_pretraining_params(
+                exe,
+                args.init_pretraining_params,
+                main_program=startup_prog,
+                use_fp16=args.use_fp16)
+    elif args.do_val or args.do_test:
+        if not args.init_checkpoint:
+            raise ValueError("args 'init_checkpoint' should be set if"
+                             "only doing validation or testing!")
+        init_checkpoint(
+            exe,
+            args.init_checkpoint,
+            main_program=startup_prog,
+            use_fp16=args.use_fp16)
+
+    if args.do_train:
+        exec_strategy = fluid.ExecutionStrategy()
+        if args.use_fast_executor:
+            exec_strategy.use_experimental_executor = True
+        exec_strategy.num_threads = dev_count
+        exec_strategy.num_iteration_per_drop_scope = args.num_iteration_per_drop_scope
+
+        train_exe = fluid.ParallelExecutor(
+            use_cuda=args.use_cuda,
+            loss_name=graph_vars["loss"].name,
+            exec_strategy=exec_strategy,
+            main_program=train_program,
+            num_trainers=nccl2_num_trainers,
+            trainer_id=nccl2_trainer_id)
+
+        train_pyreader.decorate_tensor_provider(train_data_generator)
+    else:
+        train_exe = None
+
+    test_exe = exe
+    if args.do_val or args.do_test:
+        if args.use_multi_gpu_test:
+            test_exe = fluid.ParallelExecutor(
+                use_cuda=args.use_cuda,
+                main_program=test_prog,
+                share_vars_from=train_exe)
+
+    if args.do_train:
+        train_pyreader.start()
+        steps = 0
+        if warmup_steps > 0:
+            graph_vars["learning_rate"] = scheduled_lr
+
+        ce_info = []
+        time_begin = time.time()
+        last_epoch = 0
+        current_epoch = 0
+        while True:
+            try:
+                steps += 1
+                if steps % args.skip_steps != 0:
+                    train_exe.run(fetch_list=[])
+                else:
+                    outputs = evaluate(
+                        train_exe,
+                        train_program,
+                        train_pyreader,
+                        graph_vars,
+                        "train",
+                        metric=args.metric,
+                        is_classify=args.is_classify,
+                        is_regression=args.is_regression)
+
+                    if args.verbose:
+                        verbose = "train pyreader queue size: %d, " % train_pyreader.queue.size(
+                        )
+                        verbose += "learning rate: %f" % (
+                            outputs["learning_rate"]
+                            if warmup_steps > 0 else args.learning_rate)
+                        print(verbose)
+
+                    current_example, current_epoch = reader.get_train_progress()
+                    time_end = time.time()
+                    used_time = time_end - time_begin
+
+                    if args.is_classify:
+                        print(
+                            "epoch: %d, progress: %d/%d, step: %d, ave loss: %f, "
+                            "ave acc: %f, speed: %f steps/s" %
+                            (current_epoch, current_example, num_train_examples,
+                             steps, outputs["loss"], outputs["accuracy"],
+                             args.skip_steps / used_time))
+                        ce_info.append(
+                            [outputs["loss"], outputs["accuracy"], used_time])
+                    if args.is_regression:
+                        print(
+                            "epoch: %d, progress: %d/%d, step: %d, ave loss: %f, "
+                            " speed: %f steps/s" %
+                            (current_epoch, current_example, num_train_examples,
+                             steps, outputs["loss"],
+                             args.skip_steps / used_time))
+                    time_begin = time.time()
+
+                if steps % args.save_steps == 0:
+                    save_path = os.path.join(args.checkpoints,
+                                             "step_" + str(steps))
+                    fluid.io.save_persistables(exe, save_path, train_program)
+
+                if steps % args.validation_steps == 0 or last_epoch != current_epoch:
+                    # evaluate dev set
+                    if args.do_val:
+                        evaluate_wrapper(args, reader, exe, test_prog,
+                                         test_pyreader, graph_vars,
+                                         current_epoch, steps)
+
+                    if args.do_test:
+                        predict_wrapper(args, reader, exe, test_prog,
+                                        test_pyreader, graph_vars,
+                                        current_epoch, steps)
+
+                if last_epoch != current_epoch:
+                    last_epoch = current_epoch
+
+            except fluid.core.EOFException:
+                save_path = os.path.join(args.checkpoints, "step_" + str(steps))
+                fluid.io.save_persistables(exe, save_path, train_program)
+                train_pyreader.reset()
+                break
+        if args.enable_ce:
+            card_num = get_cards()
+            ce_loss = 0
+            ce_acc = 0
+            ce_time = 0
+            try:
+                ce_loss = ce_info[-2][0]
+                ce_acc = ce_info[-2][1]
+                ce_time = ce_info[-2][2]
+            except:
+                print("ce info error")
+            print("kpis\ttrain_duration_card%s\t%s" % (card_num, ce_time))
+            print("kpis\ttrain_loss_card%s\t%f" % (card_num, ce_loss))
+            print("kpis\ttrain_acc_card%s\t%f" % (card_num, ce_acc))
+
+    # final eval on dev set
+    if args.do_val:
+        evaluate_wrapper(args, reader, exe, test_prog, test_pyreader,
+                         graph_vars, current_epoch, steps)
+
+    # final eval on test set
+    if args.do_test:
+        predict_wrapper(args, reader, exe, test_prog, test_pyreader, graph_vars,
+                        current_epoch, steps)
+
+    # final eval on dianostic, hack for glue-ax
+    if args.diagnostic:
+        test_pyreader.decorate_tensor_provider(
+            reader.data_generator(
+                args.diagnostic,
+                batch_size=args.batch_size,
+                epoch=1,
+                dev_count=1,
+                shuffle=False))
+
+        print("Final diagnostic")
+        qids, preds, probs = predict(
+            test_exe,
+            test_prog,
+            test_pyreader,
+            graph_vars,
+            is_classify=args.is_classify,
+            is_regression=args.is_regression)
+        assert len(qids) == len(preds), '{} v.s. {}'.format(
+            len(qids), len(preds))
+        with open(args.diagnostic_save, 'w') as f:
+            for id, s, p in zip(qids, preds, probs):
+                f.write('{}\t{}\t{}\n'.format(id, s, p))
+
+        print("Done final diagnostic, saving to {}".format(
+            args.diagnostic_save))
+
+
+def evaluate_wrapper(args, reader, exe, test_prog, test_pyreader, graph_vars,
+                     epoch, steps):
+    # evaluate dev set
+    for ds in args.dev_set.split(','):
+        test_pyreader.decorate_tensor_provider(
+            reader.data_generator(
+                ds,
+                batch_size=args.predict_batch_size,
+                epoch=1,
+                dev_count=1,
+                shuffle=False))
+        print("validation result of dataset {}:".format(ds))
+        evaluate_info = evaluate(
+            exe,
+            test_prog,
+            test_pyreader,
+            graph_vars,
+            "dev",
+            metric=args.metric,
+            is_classify=args.is_classify,
+            is_regression=args.is_regression)
+        print(evaluate_info + ', file: {}, epoch: {}, steps: {}'.format(
+            ds, epoch, steps))
+
+
+def predict_wrapper(args, reader, exe, test_prog, test_pyreader, graph_vars,
+                    epoch, steps):
+    test_sets = args.test_set.split(',')
+    save_dirs = args.test_save.split(',')
+    assert len(test_sets) == len(save_dirs)
+
+    for test_f, save_f in zip(test_sets, save_dirs):
+        test_pyreader.decorate_tensor_provider(
+            reader.data_generator(
+                test_f,
+                batch_size=args.predict_batch_size,
+                epoch=1,
+                dev_count=1,
+                shuffle=False))
+
+        save_path = save_f + '.' + str(epoch) + '.' + str(steps)
+        print("testing {}, save to {}".format(test_f, save_path))
+        qids, preds, probs = predict(
+            exe,
+            test_prog,
+            test_pyreader,
+            graph_vars,
+            is_classify=args.is_classify,
+            is_regression=args.is_regression)
+
+        save_dir = os.path.dirname(save_path)
+        if not os.path.exists(save_dir):
+            os.makedirs(save_dir)
+
+        with open(save_path, 'w') as f:
+            for id, s, p in zip(qids, preds, probs):
+                f.write('{}\t{}\t{}\n'.format(id, s, p))
+
+
+if __name__ == '__main__':
+    print_arguments(args)
+    check_cuda(args.use_cuda)
+    main(args)
--- a/ERNIE/run_classifier.py
+++ b/ERNIE/run_classifier.py
@@ -21,15 +21,19 @@ import os
 import time
 import multiprocessing

+# NOTE(paddle-dev): All of these flags should be
+# set before `import paddle`. Otherwise, it would
+# not take any effect.
+os.environ['FLAGS_eager_delete_tensor_gb'] = '0'  # enable gc
+
 import paddle.fluid as fluid

 import reader.task_reader as task_reader
 from model.ernie import ErnieConfig
-from finetune.classifier import create_model, evaluate
+from finetune.mrc import create_model, evaluate
 from optimization import optimization
-from utils.args import print_arguments, check_cuda
+from utils.args import print_arguments
 from utils.init import init_pretraining_params, init_checkpoint
-from utils.cards import get_cards
 from finetune_args import parser

 args = parser.parse_args()
@@ -47,13 +51,20 @@ def main(args):
        dev_count = int(os.environ.get('CPU_NUM', multiprocessing.cpu_count()))
    exe = fluid.Executor(place)

-    reader = task_reader.ClassifyReader(
+    reader = task_reader.MRCReader(
        vocab_path=args.vocab_path,
        label_map_config=args.label_map_config,
        max_seq_len=args.max_seq_len,
        do_lower_case=args.do_lower_case,
        in_tokens=args.in_tokens,
-        random_seed=args.random_seed)
+        random_seed=args.random_seed,
+        tokenizer=args.tokenizer,
+        is_classify=args.is_classify,
+        is_regression=args.is_regression,
+        for_cn=args.for_cn,
+        task_id=args.task_id,
+        doc_stride=args.doc_stride,
+        max_query_length=args.max_query_length)

    if not (args.do_train or args.do_val or args.do_test):
        raise ValueError("For args `do_train`, `do_val` and `do_test`, at "
@@ -63,15 +74,18 @@ def main(args):
    if args.random_seed is not None:
        startup_prog.random_seed = args.random_seed

+    if args.predict_batch_size == None:
+        args.predict_batch_size = args.batch_size
    if args.do_train:
        train_data_generator = reader.data_generator(
            input_file=args.train_set,
            batch_size=args.batch_size,
            epoch=args.epoch,
-            shuffle=args.shuffle,
+            dev_count=dev_count,
+            shuffle=True,
            phase="train")

-        num_train_examples = reader.get_num_examples(args.train_set)
+        num_train_examples = reader.get_num_examples("train")

        if args.in_tokens:
            max_train_steps = args.epoch * num_train_examples // (
@@ -86,16 +100,15 @@ def main(args):
        print("Num warmup steps: %d" % warmup_steps)

        train_program = fluid.Program()
-        if args.random_seed is not None and args.enable_ce:
-            train_program.random_seed = args.random_seed

        with fluid.program_guard(train_program, startup_prog):
            with fluid.unique_name.guard():
                train_pyreader, graph_vars = create_model(
                    args,
                    pyreader_name='train_reader',
-                    ernie_config=ernie_config)
-                scheduled_lr = optimization(
+                    ernie_config=ernie_config,
+                    is_training=True)
+                scheduled_lr, loss_scaling = optimization(
                    loss=graph_vars["loss"],
                    warmup_steps=warmup_steps,
                    num_train_steps=max_train_steps,
@@ -104,17 +117,15 @@ def main(args):
                    startup_prog=startup_prog,
                    weight_decay=args.weight_decay,
                    scheduler=args.lr_scheduler,
-                    use_fp16=args.use_fp16,
-                    loss_scaling=args.loss_scaling)
-
+                    use_fp16=args.use_fp16)
+                """
                fluid.memory_optimize(
                    input_program=train_program,
                    skip_opt_set=[
                        graph_vars["loss"].name,
-                        graph_vars["probs"].name,
-                        graph_vars["accuracy"].name,
                        graph_vars["num_seqs"].name,
                    ])
+                """

        if args.verbose:
            if args.in_tokens:
@@ -131,13 +142,16 @@ def main(args):
        test_prog = fluid.Program()
        with fluid.program_guard(test_prog, startup_prog):
            with fluid.unique_name.guard():
-                test_pyreader, graph_vars = create_model(
+                test_pyreader, test_graph_vars = create_model(
                    args,
                    pyreader_name='test_reader',
-                    ernie_config=ernie_config)
+                    ernie_config=ernie_config,
+                    is_training=False)

        test_prog = test_prog.clone(for_test=True)

+    nccl2_num_trainers = 1
+    nccl2_trainer_id = 0
    exe.run(startup_prog)

    if args.do_train:
@@ -178,7 +192,9 @@ def main(args):
            use_cuda=args.use_cuda,
            loss_name=graph_vars["loss"].name,
            exec_strategy=exec_strategy,
-            main_program=train_program)
+            main_program=train_program,
+            num_trainers=nccl2_num_trainers,
+            trainer_id=nccl2_trainer_id)

        train_pyreader.decorate_tensor_provider(train_data_generator)
    else:
@@ -190,7 +206,6 @@ def main(args):
        if warmup_steps > 0:
            graph_vars["learning_rate"] = scheduled_lr

-        ce_info = []
        time_begin = time.time()
        while True:
            try:
@@ -213,11 +228,9 @@ def main(args):
                    time_end = time.time()
                    used_time = time_end - time_begin
                    print("epoch: %d, progress: %d/%d, step: %d, ave loss: %f, "
-                          "ave acc: %f, speed: %f steps/s" %
+                          "speed: %f steps/s" %
                          (current_epoch, current_example, num_train_examples,
-                           steps, outputs["loss"], outputs["accuracy"],
-                           args.skip_steps / used_time))
-                    ce_info.append([outputs["loss"], outputs["accuracy"], used_time])
+                           steps, outputs["loss"], args.skip_steps / used_time))
                    time_begin = time.time()

                if steps % args.save_steps == 0:
@@ -226,74 +239,95 @@ def main(args):
                    fluid.io.save_persistables(exe, save_path, train_program)

                if steps % args.validation_steps == 0:
-                    # evaluate dev set
                    if args.do_val:
                        test_pyreader.decorate_tensor_provider(
                            reader.data_generator(
                                args.dev_set,
                                batch_size=args.batch_size,
                                epoch=1,
-                                shuffle=False))
-                        evaluate(exe, test_prog, test_pyreader, graph_vars,
-                                 "dev")
-                    # evaluate test set
+                                dev_count=1,
+                                shuffle=False,
+                                phase="dev"))
+                        evaluate(
+                            exe,
+                            test_prog,
+                            test_pyreader,
+                            test_graph_vars,
+                            str(steps) + "_dev",
+                            examples=reader.get_examples("dev"),
+                            features=reader.get_features("dev"),
+                            args=args)
+
                    if args.do_test:
                        test_pyreader.decorate_tensor_provider(
                            reader.data_generator(
                                args.test_set,
                                batch_size=args.batch_size,
                                epoch=1,
-                                shuffle=False))
-                        evaluate(exe, test_prog, test_pyreader, graph_vars,
-                                 "test")
+                                dev_count=1,
+                                shuffle=False,
+                                phase="test"))
+                        evaluate(
+                            exe,
+                            test_prog,
+                            test_pyreader,
+                            test_graph_vars,
+                            str(steps) + "_test",
+                            examples=reader.get_examples("test"),
+                            features=reader.get_features("test"),
+                            args=args)
+
            except fluid.core.EOFException:
                save_path = os.path.join(args.checkpoints, "step_" + str(steps))
                fluid.io.save_persistables(exe, save_path, train_program)
                train_pyreader.reset()
                break
-        if args.enable_ce:
-            card_num = get_cards()
-            ce_loss = 0
-            ce_acc = 0
-            ce_time = 0
-            try:
-                ce_loss = ce_info[-2][0]
-                ce_acc = ce_info[-2][1]
-                ce_time = ce_info[-2][2]
-            except:
-                print("ce info error")
-            print("kpis\ttrain_duration_card%s\t%s" %
-                (card_num, ce_time))
-            print("kpis\ttrain_loss_card%s\t%f" %
-                (card_num, ce_loss))
-            print("kpis\ttrain_acc_card%s\t%f" %
-                (card_num, ce_acc))
-

    # final eval on dev set
    if args.do_val:
+        print("Final validation result:")
        test_pyreader.decorate_tensor_provider(
            reader.data_generator(
                args.dev_set,
                batch_size=args.batch_size,
                epoch=1,
-                shuffle=False))
-        print("Final validation result:")
-        evaluate(exe, test_prog, test_pyreader, graph_vars, "dev")
+                dev_count=1,
+                shuffle=False,
+                phase="dev"))
+        evaluate(
+            exe,
+            test_prog,
+            test_pyreader,
+            test_graph_vars,
+            "dev",
+            examples=reader.get_examples("dev"),
+            features=reader.get_features("dev"),
+            args=args)

    # final eval on test set
    if args.do_test:
+        print("Final test result:")
        test_pyreader.decorate_tensor_provider(
            reader.data_generator(
                args.test_set,
                batch_size=args.batch_size,
                epoch=1,
-                shuffle=False))
-        print("Final test result:")
-        evaluate(exe, test_prog, test_pyreader, graph_vars, "test")
+                dev_count=1,
+                shuffle=False,
+                phase="test"))
+        evaluate(
+            exe,
+            test_prog,
+            test_pyreader,
+            test_graph_vars,
+            "test",
+            examples=reader.get_examples("test"),
+            features=reader.get_features("test"),
+            args=args)


 if __name__ == '__main__':
-    print_arguments(args)
-    check_cuda(args.use_cuda)
-    main(args)
+    while True:
+        scope = fluid.core.Scope()
+        with fluid.scope_guard(scope):
+            main(args)
--- a/ERNIE/run_sequence_labeling.py
+++ b/ERNIE/run_sequence_labeling.py
@@ -21,6 +21,11 @@ import os
 import time
 import multiprocessing

+# NOTE(paddle-dev): All of these flags should be
+# set before `import paddle`. Otherwise, it would
+# not take any effect.
+os.environ['FLAGS_eager_delete_tensor_gb'] = '0'  # enable gc
+
 import paddle.fluid as fluid

 import reader.task_reader as task_reader
@@ -52,7 +57,8 @@ def main(args):
        max_seq_len=args.max_seq_len,
        do_lower_case=args.do_lower_case,
        in_tokens=args.in_tokens,
-        random_seed=args.random_seed)
+        random_seed=args.random_seed,
+        task_id=args.task_id)

    if not (args.do_train or args.do_val or args.do_test):
        raise ValueError("For args `do_train`, `do_val` and `do_test`, at "
@@ -92,7 +98,7 @@ def main(args):
                    args,
                    pyreader_name='train_reader',
                    ernie_config=ernie_config)
-                scheduled_lr = optimization(
+                scheduled_lr, loss_scaling = optimization(
                    loss=graph_vars["loss"],
                    warmup_steps=warmup_steps,
                    num_train_steps=max_train_steps,
@@ -101,8 +107,7 @@ def main(args):
                    startup_prog=startup_prog,
                    weight_decay=args.weight_decay,
                    scheduler=args.lr_scheduler,
-                    use_fp16=args.use_fp16,
-                    loss_scaling=args.loss_scaling)
+                    use_fp16=args.use_fp16)

                fluid.memory_optimize(
                    input_program=train_program,

--- a/script/en_glue/ernie_base/CoLA/task.sh
+++ b/script/en_glue/ernie_base/CoLA/task.sh
+#!/bin/bash
+
+R_DIR=`dirname $0`; MYDIR=`cd $R_DIR;pwd`
+export FLAGS_sync_nccl_allreduce=1
+export FLAGS_eager_delete_tensor_gb=0.0
+
+if [[ -f ./model_conf ]];then
+    source ./model_conf
+else
+    export CUDA_VISIBLE_DEVICES=0
+fi
+
+
+mkdir -p log/
+
+timestamp=`date "+%Y-%m-%d-%H-%M-%S"`
+
+lr=3e-5
+batch_size=64
+epoch=3
+
+for i in {1..5};do
+python -u run_classifier.py                                                          \
+       --use_cuda true                                                               \
+       --for_cn  False                                                               \
+       --use_fast_executor ${e_executor:-"true"}                                     \
+       --tokenizer ${TOKENIZER:-"FullTokenizer"}                                     \
+       --use_fp16 ${USE_FP16:-"false"}                                               \
+       --do_train true                                                               \
+       --do_val true                                                                 \
+       --do_test true                                                                \
+       --batch_size $batch_size                                                      \
+       --init_pretraining_params ${MODEL_PATH}/params                                \
+       --verbose true                                                                \
+       --train_set ${TASK_DATA_PATH}/CoLA/train.tsv                                  \
+       --dev_set   ${TASK_DATA_PATH}/CoLA/dev.tsv                                    \
+       --test_set  ${TASK_DATA_PATH}/CoLA/test.tsv                                   \
+       --vocab_path script/en_glue/ernie_base/vocab.txt                              \
+       --checkpoints ./checkpoints                                                   \
+       --save_steps 1000                                                             \
+       --weight_decay  0.0                                                           \
+       --warmup_proportion 0.1                                                       \
+       --validation_steps 1000000000                                                 \
+       --epoch $epoch                                                                \
+       --max_seq_len 128                                                             \
+       --ernie_config_path script/en_glue/ernie_base/ernie_config.json               \
+       --learning_rate $lr                                                           \
+       --skip_steps 10                                                               \
+       --num_iteration_per_drop_scope 1                                              \
+       --num_labels 2                                                                \
+       --metric 'matthews_corrcoef'                                                  \
+       --test_save output/test_out.$i.$lr.$batch_size.$epoch.$timestamp.tsv          \
+       --random_seed 1 2>&1 | tee  log/job.$i.$lr.$batch_size.$epoch.$timestamp.log  \
+
+
+done
--- a/script/en_glue/ernie_base/MNLI/task.sh
+++ b/script/en_glue/ernie_base/MNLI/task.sh
+#!/bin/bash
+
+R_DIR=`dirname $0`; MYDIR=`cd $R_DIR;pwd`
+export FLAGS_eager_delete_tensor_gb=0.0
+export FLAGS_sync_nccl_allreduce=1
+
+if [[ -f ./model_conf ]];then
+    source ./model_conf
+else
+    export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
+fi
+
+mkdir -p log/
+
+lr=3e-5
+batch_size=64
+epoch=3
+
+for i in {1..5};do
+
+timestamp=`date "+%Y-%m-%d-%H-%M-%S"`
+
+python -u run_classifier.py                                                             \
+       --use_cuda true                                                                  \
+       --use_fast_executor ${e_executor:-"true"}                                        \
+       --tokenizer ${TOKENIZER:-"FullTokenizer"}                                        \
+       --use_fp16 ${USE_FP16:-"false"}                                                  \
+       --do_train true                                                                  \
+       --do_val true                                                                    \
+       --do_test true                                                                   \
+       --batch_size $batch_size                                                         \
+       --init_pretraining_params ${MODEL_PATH}/params                                   \
+       --verbose true                                                                   \
+       --train_set ${TASK_DATA_PATH}/MNLI/train.tsv                                     \
+       --dev_set   ${TASK_DATA_PATH}/MNLI/m/dev.tsv,${TASK_DATA_PATH}/MNLI/mm/dev.tsv   \
+       --test_set  ${TASK_DATA_PATH}/MNLI/m/test.tsv,${TASK_DATA_PATH}/MNLI/mm/test.tsv \
+       --vocab_path script/en_glue/ernie_base/vocab.txt                                 \
+       --checkpoints ./checkpoints                                                      \
+       --save_steps 25000                                                               \
+       --weight_decay 0.0                                                               \
+       --warmup_proportion 0.1                                                          \
+       --validation_steps 1000000000000                                                 \
+       --epoch $epoch                                                                   \
+       --max_seq_len 128                                                                \
+       --ernie_config_path script/en_glue/ernie_base/ernie_config.json                  \
+       --learning_rate $lr                                                              \
+       --skip_steps 10                                                                  \
+       --num_iteration_per_drop_scope 1                                                 \
+       --num_labels 3                                                                   \
+       --for_cn False                                                                   \
+       --test_save output/test_out.$i.m.tsv,output/test_out.$i.mm.tsv                   \
+       --diagnostic ${TASK_DATA_PATH}/diagnostic.tsv                                    \
+       --diagnostic_save output/test_out.$i.$lr.$batch_size.$epoch.$timestamp.m.diagnostic.tsv \
+       --random_seed 1 2>&1 | tee  log/job.$i.$lr.$batch_size.$epoch.$timestamp.log            \
+
+done
--- a/script/en_glue/ernie_base/MRPC/task.sh
+++ b/script/en_glue/ernie_base/MRPC/task.sh
+#!/bin/bash
+
+R_DIR=`dirname $0`; MYDIR=`cd $R_DIR;pwd`
+export FLAGS_eager_delete_tensor_gb=0.0
+export FLAGS_sync_nccl_allreduce=1
+
+if [[ -f ./model_conf ]];then
+    source ./model_conf
+else
+    export CUDA_VISIBLE_DEVICES=0,1
+fi
+
+mkdir -p log/
+
+lr=3e-5
+batch_size=16
+epoch=4
+
+for i in {1..5};do
+
+    timestamp=`date "+%Y-%m-%d-%H-%M-%S"`
+    python -u run_classifier.py                                              \
+           --use_cuda true                                                   \
+           --for_cn  False                                                   \
+           --use_fast_executor ${e_executor:-"true"}                         \
+           --tokenizer ${TOKENIZER:-"FullTokenizer"}                         \
+           --use_fp16 ${USE_FP16:-"false"}                                   \
+           --do_train true                                                   \
+           --do_val true                                                     \
+           --do_test true                                                    \
+           --batch_size 16                                                   \
+           --init_pretraining_params ${MODEL_PATH}/params                    \
+           --verbose true                                                    \
+           --train_set ${TASK_DATA_PATH}/MRPC/train.tsv                      \
+           --dev_set   ${TASK_DATA_PATH}/MRPC/dev.tsv                        \
+           --test_set  ${TASK_DATA_PATH}/MRPC/test.tsv                       \
+           --vocab_path script/en_glue/ernie_base/vocab.txt                  \
+           --checkpoints ./checkpoints                                       \
+           --save_steps 1000                                                 \
+           --weight_decay  0.0                                               \
+           --warmup_proportion 0.1                                           \
+           --validation_steps 1000000                                        \
+           --epoch 4                                                         \
+           --max_seq_len 128                                                 \
+           --ernie_config_path script/en_glue/ernie_base/ernie_config.json   \
+           --learning_rate 3e-5                                              \
+           --skip_steps 10                                                   \
+           --num_iteration_per_drop_scope 1                                  \
+           --num_labels 2                                                    \
+           --metric 'acc_and_f1'                                             \
+           --for_cn  False                                                   \
+           --test_save output/test_out.$i.$lr.$batch_size.$epoch.tsv         \
+           --random_seed 1 2>&1 | tee  log/job.$i.$lr.$batch_size.$epoch.log \
+
+done
--- a/script/en_glue/ernie_base/QNLI/task.sh
+++ b/script/en_glue/ernie_base/QNLI/task.sh
+#!/bin/bash
+
+R_DIR=`dirname $0`; MYDIR=`cd $R_DIR;pwd`
+
+export FLAGS_eager_delete_tensor_gb=0.0
+export FLAGS_sync_nccl_allreduce=1
+
+if [[ -f ./model_conf ]];then
+    source ./model_conf
+else
+    export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
+fi
+
+
+mkdir -p log/
+
+lr=2e-5
+batch_size=32
+epoch=4
+
+for i in {1..5};do
+
+    timestamp=`date "+%Y-%m-%d-%H-%M-%S"`
+
+    python -u run_classifier.py                                                \
+           --use_cuda true                                                     \
+           --for_cn False                                                      \
+           --use_fast_executor ${e_executor:-"true"}                           \
+           --tokenizer ${TOKENIZER:-"FullTokenizer"}                           \
+           --use_fp16 ${USE_FP16:-"false"}                                     \
+           --do_train true                                                     \
+           --do_val true                                                       \
+           --do_test true                                                      \
+           --batch_size $batch_size                                            \
+           --init_pretraining_params ${MODEL_PATH}/params                      \
+           --verbose true                                                      \
+           --train_set ${TASK_DATA_PATH}/QNLI/train.tsv                        \
+           --dev_set   ${TASK_DATA_PATH}/QNLI/dev.tsv                          \
+           --test_set  ${TASK_DATA_PATH}/QNLI/test.tsv                         \
+           --vocab_path script/en_glue/ernie_base/vocab.txt                    \
+           --checkpoints ./checkpoints                                         \
+           --save_steps 30000                                                  \
+           --weight_decay  0.0                                                 \
+           --warmup_proportion 0.1                                             \
+           --validation_steps 1000000000                                       \
+           --epoch $epoch                                                      \
+           --max_seq_len 128                                                   \
+           --ernie_config_path script/en_glue/ernie_base/ernie_config.json     \
+           --learning_rate $lr                                                 \
+           --skip_steps 10                                                     \
+           --num_iteration_per_drop_scope 1                                    \
+           --num_labels 2                                                      \
+           --test_save output/test_out.$i.$lr.$batch_size.$epoch.tsv           \
+           --random_seed  1 2>&1 | tee log/job.$i.$lr.$batch_size.$epoch.log   \
+
+done
+
--- a/script/en_glue/ernie_base/QQP/task.sh
+++ b/script/en_glue/ernie_base/QQP/task.sh
+#!/bin/bash
+
+R_DIR=`dirname $0`; MYDIR=`cd $R_DIR;pwd`
+export FLAGS_eager_delete_tensor_gb=0.0
+export FLAGS_sync_nccl_allreduce=1
+
+if [[ -f ./model_conf ]];then
+    source ./model_conf
+else
+    export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
+fi
+
+
+mkdir -p log/
+
+lr=3e-5
+batch_size=32
+epoch=3
+
+for i in {1..1};do
+
+  timestamp=`date "+%Y-%m-%d-%H-%M-%S"`
+
+  python -u run_classifier.py                                                      \
+       --for_cn False                                                              \
+       --ernie_config_path script/en_glue/ernie_base/ernie_config.json             \
+       --validation_steps 1000000000000                                            \
+       --use_cuda true                                                             \
+       --use_fast_executor ${e_executor:-"true"}                                   \
+       --tokenizer ${TOKENIZER:-"FullTokenizer"}                                   \
+       --use_fp16 ${USE_FP16:-"false"}                                             \
+       --do_train true                                                             \
+       --do_val true                                                               \
+       --do_test true                                                              \
+       --batch_size $batch_size                                                    \
+       --init_pretraining_params ${MODEL_PATH}/params                              \
+       --verbose true                                                              \
+       --train_set ${TASK_DATA_PATH}/QQP/train.tsv                                 \
+       --dev_set   ${TASK_DATA_PATH}/QQP/dev.tsv                                   \
+       --test_set  ${TASK_DATA_PATH}/QQP/test.tsv                                  \
+       --vocab_path script/en_glue/ernie_base/vocab.txt                            \
+       --checkpoints ./checkpoints                                                 \
+       --save_steps 30000                                                          \
+       --weight_decay  0.0                                                         \
+       --warmup_proportion 0.1                                                     \
+       --epoch $epoch                                                              \
+       --max_seq_len 128                                                           \
+       --learning_rate $lr                                                         \
+       --skip_steps 10                                                             \
+       --num_iteration_per_drop_scope 1                                            \
+       --num_labels 2                                                              \
+       --metric 'acc_and_f1'                                                       \
+       --test_save output/test_out.$i.$lr.$batch_size.$epoch.tsv                   \
+       --random_seed  1 2>&1 | tee log/job.$i.$lr.$batch_size.$epoch.log           \
+
+done
--- a/script/en_glue/ernie_base/RTE/task.sh
+++ b/script/en_glue/ernie_base/RTE/task.sh
+#!/bin/bash
+
+R_DIR=`dirname $0`; MYDIR=`cd $R_DIR;pwd`
+export FLAGS_eager_delete_tensor_gb=0.0
+export FLAGS_sync_nccl_allreduce=1
+
+if [[ -f ./model_conf ]];then
+    source ./model_conf
+else
+    export CUDA_VISIBLE_DEVICES=0
+fi
+
+mkdir -p log/
+
+for i in {1..5};do
+
+    timestamp=`date "+%Y-%m-%d-%H-%M-%S"`
+
+    python -u run_classifier.py                                                \
+               --use_cuda true                                                 \
+               --for_cn False                                                  \
+               --use_fast_executor ${e_executor:-"true"}                       \
+               --tokenizer ${TOKENIZER:-"FullTokenizer"}                       \
+               --use_fp16 ${USE_FP16:-"false"}                                 \
+               --do_train true                                                 \
+               --do_val true                                                   \
+               --do_test true                                                  \
+               --batch_size 4                                                  \
+               --init_pretraining_params ${MODEL_PATH}/params                  \
+               --verbose true                                                  \
+               --train_set ${TASK_DATA_PATH}/RTE/train.tsv                     \
+               --dev_set   ${TASK_DATA_PATH}/RTE/dev.tsv                       \
+               --test_set  ${TASK_DATA_PATH}/RTE/test.tsv                      \
+               --vocab_path script/en_glue/ernie_base/vocab.txt                \
+               --checkpoints ./checkpoints                                     \
+               --save_steps 1000                                               \
+               --weight_decay  0.0                                             \
+               --warmup_proportion 0.1                                         \
+               --validation_steps 1000000000                                   \
+               --epoch 4                                                       \
+               --max_seq_len 128                                               \
+               --ernie_config_path script/en_glue/ernie_base/ernie_config.json \
+               --learning_rate 2e-5                                            \
+               --skip_steps 10                                                 \
+               --num_iteration_per_drop_scope 1                                \
+               --num_labels 2                                                  \
+               --for_cn  False                                                 \
+               --test_save output/test_out.$i.tsv                              \
+               --random_seed 1 2>&1 | tee  log/job.$i.$timestamp.log           \
+
+done
--- a/script/en_glue/ernie_base/SST-2/task.sh
+++ b/script/en_glue/ernie_base/SST-2/task.sh
+#!/bin/bash
+
+R_DIR=`dirname $0`; MYDIR=`cd $R_DIR;pwd`
+export FLAGS_eager_delete_tensor_gb=0.0
+export FLAGS_sync_nccl_allreduce=1
+
+if [[ -f ./model_conf ]];then
+    source ./model_conf
+else
+    export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
+fi
+
+mkdir -p log/
+
+lr=2e-5
+batch_size=32
+epoch=4
+
+for i in {1..5};do
+
+ python -u run_classifier.py                                                       \
+      --for_cn  False                                                              \
+      --use_cuda true                                                              \
+      --use_fast_executor ${e_executor:-"true"}                                    \
+      --tokenizer ${TOKENIZER:-"FullTokenizer"}                                    \
+      --use_fp16 ${USE_FP16:-"false"}                                              \
+      --do_train true                                                              \
+      --do_val true                                                                \
+      --do_test true                                                               \
+      --batch_size $batch_size                                                     \
+      --init_pretraining_params ${MODEL_PATH}/params                               \
+      --verbose true                                                               \
+      --train_set ${TASK_DATA_PATH}/SST-2/train.tsv                                \
+      --dev_set   ${TASK_DATA_PATH}/SST-2/dev.tsv                                  \
+      --test_set  ${TASK_DATA_PATH}/SST-2/test.tsv                                 \
+      --vocab_path script/en_glue/ernie_base/vocab.txt                             \
+      --checkpoints ./checkpoints                                                  \
+      --save_steps 10000                                                           \
+      --weight_decay  0.0                                                          \
+      --warmup_proportion 0.1                                                      \
+      --validation_steps 800000000000                                              \
+      --epoch $epoch                                                               \
+      --max_seq_len 128                                                            \
+      --ernie_config_path script/en_glue/ernie_base/ernie_config.json              \
+      --learning_rate $lr                                                          \
+      --skip_steps 10                                                              \
+      --num_iteration_per_drop_scope 1                                             \
+      --num_labels 2                                                               \
+      --test_save output/test_out.$i.$lr.$batch_size.$epoch.tsv                    \
+      --random_seed 1 2>&1 | tee  log/job.$i.$lr.$batch_size.$epoch.$timestamp.log \
+
+done
+
--- a/script/en_glue/ernie_base/STS-B/task.sh
+++ b/script/en_glue/ernie_base/STS-B/task.sh
+#!/bin/bash
+
+R_DIR=`dirname $0`; MYDIR=`cd $R_DIR;pwd`
+export FLAGS_eager_delete_tensor_gb=0.0
+export FLAGS_sync_nccl_allreduce=1
+
+if [[ -f ./model_conf ]];then
+    source ./model_conf
+else
+    export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
+fi
+
+mkdir -p log/
+
+lr=5e-5
+batch_size=16
+epoch=3
+
+for i in {1..5};do
+
+python -u run_classifier.py                                                         \
+       --use_cuda true                                                              \
+       --for_cn  False                                                              \
+       --use_fast_executor ${e_executor:-"true"}                                    \
+       --tokenizer ${TOKENIZER:-"FullTokenizer"}                                    \
+       --use_fp16 ${USE_FP16:-"false"}                                              \
+       --do_train true                                                              \
+       --do_val true                                                                \
+       --do_test true                                                               \
+       --batch_size $batch_size                                                     \
+       --init_pretraining_params ${MODEL_PATH}/params                               \
+       --verbose true                                                               \
+       --train_set ${TASK_DATA_PATH}/STS-B/train.tsv                                \
+       --dev_set   ${TASK_DATA_PATH}/STS-B/dev.tsv                                  \
+       --test_set  ${TASK_DATA_PATH}/STS-B/test.tsv                                 \
+       --vocab_path script/en_glue/ernie_base/vocab.txt                             \
+       --checkpoints ./checkpoints                                                  \
+       --save_steps 1000                                                            \
+       --weight_decay  0.0                                                          \
+       --warmup_proportion 0.1                                                      \
+       --validation_steps 100000000000                                              \
+       --epoch $epoch                                                               \
+       --max_seq_len 128                                                            \
+       --ernie_config_path script/en_glue/ernie_base/ernie_config.json              \
+       --learning_rate 5e-5                                                         \
+       --skip_steps 10                                                              \
+       --num_iteration_per_drop_scope 1                                             \
+       --num_labels 1                                                               \
+       --is_classify false                                                          \
+       --is_regression true                                                         \
+       --metric 'pearson_and_spearman'                                              \
+       --test_save output/test_out.$i.tsv                                           \
+       --random_seed 1 2>&1 | tee  log/job.$i.$lr.$batch_size.$epoch.$timestamp.log \
+
+done
--- a/script/en_glue/ernie_base/WNLI/task.sh
+++ b/script/en_glue/ernie_base/WNLI/task.sh
+#!/bin/bash
+
+R_DIR=`dirname $0`; MYDIR=`cd $R_DIR;pwd`
+export FLAGS_eager_delete_tensor_gb=0.0
+export FLAGS_sync_nccl_allreduce=1
+
+if [[ -f ./model_conf ]];then
+    source ./model_conf
+else
+    export CUDA_VISIBLE_DEVICES=0
+fi
+
+
+mkdir -p log/
+
+lr=2e-5
+batch_size=8
+epoch=4
+
+
+for i in {1..5};do
+
+   python -u run_classifier.py                                            \
+       --for_cn False                                                     \
+       --use_cuda true                                                    \
+       --use_fast_executor ${e_executor:-"true"}                          \
+       --tokenizer ${TOKENIZER:-"FullTokenizer"}                          \
+       --use_fp16 ${USE_FP16:-"false"}                                    \
+       --do_train true                                                    \
+       --do_val true                                                      \
+       --do_test true                                                     \
+       --batch_size $batch_size                                           \
+       --init_pretraining_params ${MODEL_PATH}/params                     \
+       --verbose true                                                     \
+       --train_set ${TASK_DATA_PATH}/WNLI/train.tsv                       \
+       --dev_set   ${TASK_DATA_PATH}/WNLI/dev.tsv                         \
+       --test_set  ${TASK_DATA_PATH}/WNLI/test.tsv                        \
+       --vocab_path script/en_glue/ernie_base/vocab.txt                   \
+       --checkpoints ./checkpoints                                        \
+       --save_steps 1000                                                  \
+       --weight_decay  0.0                                                \
+       --warmup_proportion 0.1                                            \
+       --validation_steps 1000000                                         \
+       --epoch $epoch                                                     \
+       --max_seq_len 512                                                  \
+       --ernie_config_path script/en_glue/ernie_base/ernie_config.json    \
+       --learning_rate $lr                                                \
+       --skip_steps 10                                                    \
+       --num_iteration_per_drop_scope 1                                   \
+       --num_labels 2                                                     \
+       --test_save output/test_out.$i.$lr.$batch_size.$epoch.tsv          \
+       --random_seed 1 2>&1 | tee  log/job.$i.$lr.$batch_size.$epoch.log  \
+
+done
--- a/script/en_glue/ernie_base/ernie_config.json
+++ b/script/en_glue/ernie_base/ernie_config.json
+{
+  "attention_probs_dropout_prob": 0.1, 
+  "hidden_act": "gelu", 
+  "hidden_dropout_prob": 0.1, 
+  "hidden_size": 768, 
+  "initializer_range": 0.02, 
+  "max_position_embeddings": 512, 
+  "num_attention_heads": 12, 
+  "num_hidden_layers": 12, 
+  "sent_type_vocab_size": 4, 
+  "task_type_vocab_size": 16, 
+  "vocab_size": 30522
+}
--- a/script/en_glue/ernie_base/vocab.txt
+++ b/script/en_glue/ernie_base/vocab.txt
--- a/script/en_glue/ernie_large/CoLA/task.sh
+++ b/script/en_glue/ernie_large/CoLA/task.sh
+#!/bin/bash
+
+R_DIR=`dirname $0`; MYDIR=`cd $R_DIR;pwd`
+export FLAGS_sync_nccl_allreduce=1
+export FLAGS_eager_delete_tensor_gb=0.0
+
+if [[ -f ./model_conf ]];then
+    source ./model_conf
+else
+    export CUDA_VISIBLE_DEVICES=0
+fi
+
+mkdir -p log/
+
+lr=3e-5
+batch_size=32
+epoch=5
+
+for i in {1..5};do
+
+    timestamp=`date "+%Y-%m-%d-%H-%M-%S"`
+    python -u run_classifier.py                                              \
+           --use_cuda true                                                   \
+           --for_cn  False                                                   \
+           --use_fast_executor ${e_executor:-"true"}                         \
+           --tokenizer ${TOKENIZER:-"FullTokenizer"}                         \
+           --use_fp16 ${USE_FP16:-"false"}                                   \
+           --do_train true                                                   \
+           --do_val true                                                     \
+           --do_test true                                                    \
+           --batch_size $batch_size                                          \
+           --init_pretraining_params ${MODEL_PATH}/params                    \
+           --verbose true                                                    \
+           --train_set ${TASK_DATA_PATH}/CoLA/train.tsv                      \
+           --dev_set   ${TASK_DATA_PATH}/CoLA/dev.tsv                        \
+           --test_set  ${TASK_DATA_PATH}/CoLA/test.tsv                       \
+           --vocab_path script/en_glue/ernie_large/vocab.txt                 \
+           --checkpoints ./checkpoints                                       \
+           --save_steps 1000                                                 \
+           --weight_decay  0.0                                               \
+           --warmup_proportion 0.1                                           \
+           --validation_steps 1000000000                                     \
+           --epoch $epoch                                                    \
+           --max_seq_len 128                                                 \
+           --ernie_config_path script/en_glue/ernie_large/ernie_config.json  \
+           --learning_rate $lr                                               \
+           --skip_steps 10                                                   \
+           --num_iteration_per_drop_scope 1                                  \
+           --num_labels 2                                                    \
+           --metric 'matthews_corrcoef'                                      \
+           --test_save output/test_out.$i.$lr.$batch_size.$epoch.tsv         \
+           --random_seed 1 2>&1 | tee  log/job.$i.$lr.$batch_size.$epoch.log \
+
+done
+
+
--- a/script/en_glue/ernie_large/MNLI/task.sh
+++ b/script/en_glue/ernie_large/MNLI/task.sh
+#!/bin/bash
+
+R_DIR=`dirname $0`; MYDIR=`cd $R_DIR;pwd`
+
+export FLAGS_eager_delete_tensor_gb=0.0
+export FLAGS_sync_nccl_allreduce=1
+
+if [[ -f ./model_conf ]];then
+    source ./model_conf
+else
+    export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
+fi
+
+mkdir -p log/
+
+for i in {1..5};do
+
+    timestamp=`date "+%Y-%m-%d-%H-%M-%S"`
+
+    python -u run_classifier.py                                                             \
+           --use_cuda true                                                                  \
+           --use_fast_executor ${e_executor:-"true"}                                        \
+           --tokenizer ${TOKENIZER:-"FullTokenizer"}                                        \
+           --use_fp16 ${USE_FP16:-"false"}                                                  \
+           --do_train true                                                                  \
+           --do_val true                                                                    \
+           --do_test true                                                                   \
+           --batch_size 32                                                                  \
+           --init_pretraining_params ${MODEL_PATH}/params                                   \
+           --verbose true                                                                   \
+           --train_set ${TASK_DATA_PATH}/MNLI/train.tsv                                     \
+           --dev_set   ${TASK_DATA_PATH}/MNLI/m/dev.tsv,${TASK_DATA_PATH}/MNLI/mm/dev.tsv   \
+           --test_set  ${TASK_DATA_PATH}/MNLI/m/test.tsv,${TASK_DATA_PATH}/MNLI/mm/test.tsv \
+           --vocab_path script/en_glue/ernie_large/vocab.txt                                \
+           --checkpoints ./checkpoints                                                      \
+           --save_steps 25000                                                               \
+           --weight_decay 0.0                                                               \
+           --warmup_proportion 0.1                                                          \
+           --validation_steps 1000000000000                                                 \
+           --epoch 3                                                                        \
+           --max_seq_len 128                                                                \
+           --ernie_config_path script/en_glue/ernie_large/ernie_config.json                 \
+           --learning_rate 3e-5                                                             \
+           --skip_steps 500                                                                 \
+           --num_iteration_per_drop_scope 1                                                 \
+           --num_labels 3                                                                   \
+           --for_cn False                                                                   \
+           --test_save output/test_out.$i.m.tsv,output/test_out.$i.mm.tsv                   \
+           --diagnostic ${TASK_DATA_PATH}/MNLI/diagnostic.tsv                               \
+           --diagnostic_save output/test_out.$i.m.diagnostic.tsv                            \
+           --random_seed 1 2>&1 | tee  log/job.$i.$timestamp.log                            \
+
+done
--- a/script/en_glue/ernie_large/MRPC/task.sh
+++ b/script/en_glue/ernie_large/MRPC/task.sh
+#!/bin/bash
+
+R_DIR=`dirname $0`; MYDIR=`cd $R_DIR;pwd`
+export FLAGS_eager_delete_tensor_gb=0.0
+export FLAGS_sync_nccl_allreduce=1
+
+if [[ -f ./model_conf ]];then
+    source ./model_conf
+else
+    export CUDA_VISIBLE_DEVICES=0,1
+fi
+
+mkdir -p log/
+
+
+lr=3e-5
+batch_size=8
+epoch=4
+
+for i in {1..5};do
+
+    timestamp=`date "+%Y-%m-%d-%H-%M-%S"`
+    python -u run_classifier.py                                              \
+           --use_cuda true                                                   \
+           --for_cn  False                                                   \
+           --use_fast_executor ${e_executor:-"true"}                         \
+           --tokenizer ${TOKENIZER:-"FullTokenizer"}                         \
+           --use_fp16 ${USE_FP16:-"false"}                                   \
+           --do_train true                                                   \
+           --do_val true                                                     \
+           --do_test true                                                    \
+           --batch_size $batch_size                                          \
+           --init_pretraining_params ${MODEL_PATH}/params                    \
+           --verbose true                                                    \
+           --train_set ${TASK_DATA_PATH}/MRPC/train.tsv                      \
+           --dev_set   ${TASK_DATA_PATH}/MRPC/dev.tsv                        \
+           --test_set  ${TASK_DATA_PATH}/MRPC/test.tsv                       \
+           --vocab_path script/en_glue/ernie_large/vocab.txt                 \
+           --checkpoints ./checkpoints                                       \
+           --save_steps 1000                                                 \
+           --weight_decay  0.0                                               \
+           --warmup_proportion 0.1                                           \
+           --validation_steps 1000000                                        \
+           --epoch $epoch                                                    \
+           --max_seq_len 128                                                 \
+           --ernie_config_path script/en_glue/ernie_large/ernie_config.json  \
+           --learning_rate $lr                                               \
+           --skip_steps 10                                                   \
+           --num_iteration_per_drop_scope 1                                  \
+           --num_labels 2                                                    \
+           --metric 'acc_and_f1'                                             \
+           --for_cn  False                                                   \
+           --test_save output/test_out.$i.$lr.$batch_size.$epoch.tsv         \
+           --random_seed 1 2>&1 | tee  log/job.$i.$lr.$batch_size.$epoch.log \
+
+done
--- a/script/en_glue/ernie_large/QNLI/task.sh
+++ b/script/en_glue/ernie_large/QNLI/task.sh
+#!/bin/bash
+
+R_DIR=`dirname $0`; MYDIR=`cd $R_DIR;pwd`
+
+export FLAGS_eager_delete_tensor_gb=0.0
+export FLAGS_sync_nccl_allreduce=1
+
+if [[ -f ./model_conf ]];then
+    source ./model_conf
+else
+    export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
+fi
+
+mkdir -p log/
+
+lr=2e-5
+batch_size=32
+epoch=4
+
+for i in {1..5};do
+
+    timestamp=`date "+%Y-%m-%d-%H-%M-%S"`
+
+    python -u run_classifier.py                                            \
+           --use_cuda true                                                 \
+           --for_cn False                                                  \
+           --use_fast_executor ${e_executor:-"true"}                       \
+           --tokenizer ${TOKENIZER:-"FullTokenizer"}                       \
+           --use_fp16 ${USE_FP16:-"false"}                                 \
+           --do_train true                                                 \
+           --do_val true                                                   \
+           --do_test true                                                  \
+           --batch_size $batch_size                                        \
+           --init_pretraining_params ${MODEL_PATH}/params                  \
+           --verbose true                                                  \
+           --train_set ${TASK_DATA_PATH}/QNLI/train.tsv                    \
+           --dev_set   ${TASK_DATA_PATH}/QNLI/dev.tsv                      \
+           --test_set  ${TASK_DATA_PATH}/QNLI/test.tsv                     \
+           --vocab_path script/en_glue/ernie_large/vocab.txt               \
+           --checkpoints ./checkpoints                                     \
+           --save_steps 30000                                              \
+           --weight_decay  0.0                                             \
+           --warmup_proportion 0.1                                         \
+           --validation_steps 1000000000                                   \
+           --epoch $epoch                                                  \
+           --max_seq_len 128                                               \
+           --ernie_config_path script/en_glue/ernie_large/ernie_config.json \
+           --learning_rate $lr                                             \
+           --skip_steps 500                                                \
+           --num_iteration_per_drop_scope 1                                \
+           --num_labels 2                                                  \
+           --test_save output/test_out.$i.$lr.$batch_size.$epoch.tsv       \
+           --random_seed 1 2>&1 | tee  log/job.$i.$lr.$batch_size.$epoch.log \
+
+done
+
--- a/script/en_glue/ernie_large/QQP/task.sh
+++ b/script/en_glue/ernie_large/QQP/task.sh
+#!/bin/bash
+
+R_DIR=`dirname $0`; MYDIR=`cd $R_DIR;pwd`
+export FLAGS_eager_delete_tensor_gb=0.0
+export FLAGS_sync_nccl_allreduce=1
+
+if [[ -f ./model_conf ]];then
+    source ./model_conf
+else
+    export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
+fi
+
+mkdir -p log/
+
+for i in {1..5};do
+
+  timestamp=`date "+%Y-%m-%d-%H-%M-%S"`
+
+  python -u run_classifier.py                                                      \
+       --for_cn False                                                              \
+       --ernie_config_path script/en_glue/ernie_large/ernie_config.json            \
+       --validation_steps 1000000000000                                            \
+       --use_cuda true                                                             \
+       --use_fast_executor ${e_executor:-"true"}                                   \
+       --tokenizer ${TOKENIZER:-"FullTokenizer"}                                   \
+       --use_fp16 ${USE_FP16:-"false"}                                             \
+       --do_train true                                                             \
+       --do_val true                                                               \
+       --do_test true                                                              \
+       --batch_size 32                                                             \
+       --init_pretraining_params ${MODEL_PATH}/params                              \
+       --verbose true                                                              \
+       --train_set ${TASK_DATA_PATH}/QQP/train.tsv                                 \
+       --dev_set   ${TASK_DATA_PATH}/QQP/dev.tsv                                   \
+       --test_set  ${TASK_DATA_PATH}/QQP/test.tsv                                  \
+       --vocab_path script/en_glue/ernie_large/vocab.txt                           \
+       --checkpoints ./checkpoints                                                 \
+       --save_steps 30000                                                          \
+       --weight_decay  0.0                                                         \
+       --warmup_proportion 0.1                                                     \
+       --epoch 3                                                                   \
+       --max_seq_len 128                                                           \
+       --learning_rate 5e-5                                                        \
+       --skip_steps 500                                                            \
+       --num_iteration_per_drop_scope 1                                            \
+       --num_labels 2                                                              \
+       --metric 'acc_and_f1'                                                       \
+       --test_save output/test_out.$i.$timestamp.tsv                               \
+       --random_seed 1 2>&1 | tee  log/job.$i.$timestamp.log                       \
+
+done
--- a/script/en_glue/ernie_large/RTE/task.sh
+++ b/script/en_glue/ernie_large/RTE/task.sh
+#!/bin/bash
+
+R_DIR=`dirname $0`; MYDIR=`cd $R_DIR;pwd`
+export FLAGS_eager_delete_tensor_gb=0.0
+export FLAGS_sync_nccl_allreduce=1
+
+if [[ -f ./model_conf ]];then
+    source ./model_conf
+else
+    export CUDA_VISIBLE_DEVICES=0
+fi
+
+mkdir -p log/
+
+
+for i in {1..5};do
+    timestamp=`date "+%Y-%m-%d-%H-%M-%S"`
+
+    python -u run_classifier.py                                             \
+               --use_cuda true                                              \
+               --for_cn False                                               \
+               --use_fast_executor ${e_executor:-"true"}                    \
+               --tokenizer ${TOKENIZER:-"FullTokenizer"}                    \
+               --use_fp16 ${USE_FP16:-"false"}                              \
+               --do_train true                                              \
+               --do_val true                                                \
+               --do_test true                                               \
+               --batch_size 16                                              \
+               --init_pretraining_params ${MODEL_PATH}/params               \
+               --verbose true                                               \
+               --train_set ${TASK_DATA_PATH}/RTE/train.tsv                  \
+               --dev_set   ${TASK_DATA_PATH}/RTE/dev.tsv                    \
+               --test_set  ${TASK_DATA_PATH}/RTE/test.tsv                   \
+               --vocab_path script/en_glue/ernie_large/vocab.txt            \
+               --checkpoints ./checkpoints                                  \
+               --save_steps 1000                                            \
+               --weight_decay  0.0                                          \
+               --warmup_proportion 0.1                                      \
+               --validation_steps 2000000000000                             \
+               --epoch 5                                                    \
+               --max_seq_len 128                                            \
+               --ernie_config_path script/en_glue/ernie_large/ernie_config.json \
+               --learning_rate 3e-5                                         \
+               --skip_steps 10                                              \
+               --num_iteration_per_drop_scope 1                             \
+               --num_labels 2                                               \
+               --for_cn  False                                              \
+               --test_save output/test_out.$i.$timestamp.tsv                \
+               --random_seed 1 2>&1 | tee  log/job.$i.$timestamp.log        \
+
+done
+
--- a/script/en_glue/ernie_large/SST-2/task.sh
+++ b/script/en_glue/ernie_large/SST-2/task.sh
+#!/bin/bash
+
+R_DIR=`dirname $0`; MYDIR=`cd $R_DIR;pwd`
+export FLAGS_eager_delete_tensor_gb=0.0
+export FLAGS_sync_nccl_allreduce=1
+
+mkdir -p log/
+
+if [[ -f ./model_conf ]];then
+    source ./model_conf
+else
+    export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
+fi
+
+lr=2e-5
+batch_size=8
+epoch=4
+
+
+for i in {1..5};do
+
+ python -u run_classifier.py                                          \
+      --for_cn  False                                                 \
+      --use_cuda true                                                 \
+      --use_fast_executor ${e_executor:-"true"}                       \
+      --tokenizer ${TOKENIZER:-"FullTokenizer"}                       \
+      --use_fp16 ${USE_FP16:-"false"}                                 \
+      --do_train true                                                 \
+      --do_val true                                                   \
+      --do_test true                                                  \
+      --batch_size $batch_size                                        \
+      --init_pretraining_params ${MODEL_PATH}/params                  \
+      --verbose true                                                  \
+      --train_set ${TASK_DATA_PATH}/SST-2/train.tsv                   \
+      --dev_set   ${TASK_DATA_PATH}/SST-2/dev.tsv                     \
+      --test_set  ${TASK_DATA_PATH}/SST-2/test.tsv                    \
+      --vocab_path script/en_glue/ernie_large/vocab.txt               \
+      --checkpoints ./checkpoints                                     \
+      --save_steps 10000                                              \
+      --weight_decay  0.0                                             \
+      --warmup_proportion 0.1                                         \
+      --validation_steps 100000000000                                 \
+      --epoch $epoch                                                  \
+      --max_seq_len 128                                               \
+      --ernie_config_path script/en_glue/ernie_large/ernie_config.json\
+      --learning_rate $lr                                             \
+      --skip_steps 500                                                \
+      --num_iteration_per_drop_scope 1                                \
+      --num_labels 2                                                  \
+      --test_save output/test_out.$i.$lr.$batch_size.$epoch.tsv       \
+      --random_seed 1 2>&1 | tee  log/job.$i.$lr.$batch_size.$epoch.log \
+
+done
+
--- a/script/en_glue/ernie_large/STS-B/task.sh
+++ b/script/en_glue/ernie_large/STS-B/task.sh
+#!/bin/bash
+
+R_DIR=`dirname $0`; MYDIR=`cd $R_DIR;pwd`
+export FLAGS_eager_delete_tensor_gb=0.0
+export FLAGS_sync_nccl_allreduce=1
+
+if [[ -f ./model_conf ]];then
+    source ./model_conf
+else
+    export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
+fi
+
+
+mkdir -p log/
+
+for i in {1..5};do
+
+python -u run_classifier.py                                             \
+       --use_cuda true                                                  \
+       --for_cn  False                                                  \
+       --use_fast_executor ${e_executor:-"true"}                        \
+       --tokenizer ${TOKENIZER:-"FullTokenizer"}                        \
+       --use_fp16 ${USE_FP16:-"false"}                                  \
+       --do_train true                                                  \
+       --do_val true                                                    \
+       --do_test true                                                   \
+       --batch_size 16                                                  \
+       --init_pretraining_params ${MODEL_PATH}/params                   \
+       --verbose true                                                   \
+       --train_set ${TASK_DATA_PATH}/STS-B/train.tsv                    \
+       --dev_set   ${TASK_DATA_PATH}/STS-B/dev.tsv                      \
+       --test_set  ${TASK_DATA_PATH}/STS-B/test.tsv                     \
+       --vocab_path script/en_glue/ernie_large/vocab.txt                \
+       --checkpoints ./checkpoints                                      \
+       --save_steps 1000                                                \
+       --weight_decay  0.0                                              \
+       --warmup_proportion 0.1                                          \
+       --validation_steps 100000000000                                  \
+       --epoch 3                                                        \
+       --max_seq_len 128                                                \
+       --ernie_config_path script/en_glue/ernie_large/ernie_config.json \
+       --learning_rate 5e-5                                             \
+       --skip_steps 10                                                  \
+       --num_iteration_per_drop_scope 1                                 \
+       --num_labels 1                                                   \
+       --is_classify false                                              \
+       --is_regression true                                             \
+       --metric 'pearson_and_spearman'                                  \
+       --test_save output/test_out.$i.tsv                               \
+       --random_seed 1 2>&1 | tee  log/job.$i.$timestamp.log            \
+
+done
--- a/script/en_glue/ernie_large/WNLI/task.sh
+++ b/script/en_glue/ernie_large/WNLI/task.sh
+#!/bin/bash
+
+R_DIR=`dirname $0`; MYDIR=`cd $R_DIR;pwd`
+export FLAGS_eager_delete_tensor_gb=0.0
+export FLAGS_sync_nccl_allreduce=1
+
+if [[ -f ./model_conf ]];then
+    source ./model_conf
+else
+    export CUDA_VISIBLE_DEVICES=0
+fi
+
+
+mkdir -p log/
+
+lr=2e-5
+batch_size=8
+epoch=4
+
+for i in {1..5};do
+
+python -u run_classifier.py                                                \
+       --for_cn False                                                      \
+       --use_cuda true                                                     \
+       --use_fast_executor ${e_executor:-"true"}                           \
+       --tokenizer ${TOKENIZER:-"FullTokenizer"}                           \
+       --use_fp16 ${USE_FP16:-"false"}                                     \
+       --do_train true                                                     \
+       --do_val true                                                       \
+       --do_test true                                                      \
+       --batch_size $batch_size                                            \
+       --init_pretraining_params ${MODEL_PATH}/params                      \
+       --verbose true                                                      \
+       --train_set ${TASK_DATA_PATH}/WNLI/train.tsv                        \
+       --dev_set   ${TASK_DATA_PATH}/WNLI/dev.tsv                          \
+       --test_set  ${TASK_DATA_PATH}/WNLI/test.tsv                         \
+       --vocab_path script/en_glue/ernie_large/vocab.txt                   \
+       --checkpoints ./checkpoints                                         \
+       --save_steps 1000                                                   \
+       --weight_decay  0.0                                                 \
+       --warmup_proportion 0.1                                             \
+       --validation_steps 1000000                                          \
+       --epoch $epoch                                                      \
+       --max_seq_len 512                                                   \
+       --ernie_config_path script/en_glue/ernie_large/ernie_config.json    \
+       --learning_rate $lr                                                 \
+       --skip_steps 10                                                     \
+       --num_iteration_per_drop_scope 1                                    \
+       --num_labels 2                                                      \
+       --test_save output/test_out.$i.$lr.$batch_size.$epoch.tsv           \
+       --random_seed 1 2>&1 | tee  log/job.$i.$lr.$batch_size.$epoch.log   \
+
+done
--- a/script/en_glue/ernie_large/ernie_config.json
+++ b/script/en_glue/ernie_large/ernie_config.json
+{
+  "attention_probs_dropout_prob": 0.1, 
+  "hidden_act": "gelu", 
+  "hidden_dropout_prob": 0.1, 
+  "hidden_size": 1024, 
+  "initializer_range": 0.02, 
+  "max_position_embeddings": 512, 
+  "num_attention_heads": 16, 
+  "num_hidden_layers": 24, 
+  "sent_type_vocab_size": 4, 
+  "task_type_vocab_size": 16, 
+  "vocab_size": 30522
+}
--- a/script/en_glue/ernie_large/vocab.txt
+++ b/script/en_glue/ernie_large/vocab.txt
--- a/script/en_glue/preprocess/cvt.sh
+++ b/script/en_glue/preprocess/cvt.sh
+#!/bin/bash
+set -ex
+R_DIR=`dirname $0`; MY_DIR=`cd $R_DIR;pwd`; 
+
+INPUT=$1
+
+if [[ ! -d ./glue_data_processed/ ]];then
+    mkdir ./glue_data_processed/
+fi
+
+
+### CoLA
+mkdir -p ./glue_data_processed/CoLA
+cat $INPUT/CoLA/train.tsv | awk -F"\t"  '{if(NR==1){print "label\ttext_a"} else {print $2"\t"$4}}' > ./glue_data_processed/CoLA/train.tsv
+cat $INPUT/CoLA/dev.tsv   | awk -F"\t"  '{if(NR==1){print "label\ttext_a"} else {print $2"\t"$4}}' > ./glue_data_processed/CoLA/dev.tsv
+cat $INPUT/CoLA/test.tsv  | awk -F"\t"  '{if(NR==1){print "qid\ttext_a\tlabel"}   else {print $0"\t-1"}}'       > ./glue_data_processed/CoLA/test.tsv
+
+### SST-2
+mkdir -p ./glue_data_processed/SST-2
+cat $INPUT/SST-2/train.tsv | awk -F"\t"    '{if(NR==1){print "label\ttext_a"}  else if(NF==2) {print $2"\t"$1}}' > ./glue_data_processed/SST-2/train.tsv
+cat $INPUT/SST-2/dev.tsv   | awk -F"\t"    '{if(NR==1){print "label\ttext_a"}  else if(NF==2) {print $2"\t"$1}}' > ./glue_data_processed/SST-2/dev.tsv
+cat $INPUT/SST-2/test.tsv  | awk -F"\t"    '{if(NR==1){print "qid\ttext_a\tlabel"}    else {print $0"\t-1"}}'    > ./glue_data_processed/SST-2/test.tsv
+
+### MRPC
+mkdir -p ./glue_data_processed/MRPC
+cat $INPUT/MRPC/train.tsv | awk -F"\t" '{if(NR==1){print "text_a\ttext_b\tlabel"} else{print $4"\t"$5"\t"$1}}' > ./glue_data_processed/MRPC/train.tsv
+cat $INPUT/MRPC/dev.tsv   | awk -F"\t" '{if(NR==1){print "text_a\ttext_b\tlabel"} else{print $4"\t"$5"\t"$1}}' > ./glue_data_processed/MRPC/dev.tsv
+cat $INPUT/MRPC/test.tsv  | awk -F"\t" '{if(NR==1){print "qid\ttext_a\ttext_b\tlabel"}   else{print $1"\t"$4"\t"$5"\t-1"}}' > ./glue_data_processed/MRPC/test.tsv
+
+### STS-B
+mkdir -p ./glue_data_processed/STS-B
+cat $INPUT/STS-B/train.tsv | awk -F"\t" '{if(NR==1){print "text_a\ttext_b\tlabel"} else{print $8"\t"$9"\t"$10}}' > ./glue_data_processed/STS-B/train.tsv
+cat $INPUT/STS-B/dev.tsv   | awk -F"\t" '{if(NR==1){print "text_a\ttext_b\tlabel"} else{print $8"\t"$9"\t"$10}}' > ./glue_data_processed/STS-B/dev.tsv
+cat $INPUT/STS-B/test.tsv  | awk -F"\t" '{if(NR==1){print "qid\ttext_a\ttext_b\tlabel"}   else{print $1"\t"$8"\t"$9"\t-1"}}'  > ./glue_data_processed/STS-B/test.tsv
+
+### QQP
+mkdir -p ./glue_data_processed/QQP
+cat $INPUT/QQP/train.tsv | awk -F"\t"  '{if(NR==1){print "text_a\ttext_b\tlabel"} else if($6!="") {print $4"\t"$5"\t"$6}}' > ./glue_data_processed/QQP/train.tsv
+cat $INPUT/QQP/dev.tsv   | awk -F"\t"  '{if(NR==1){print "text_a\ttext_b\tlabel"} else if($6!="") {print $4"\t"$5"\t"$6}}' > ./glue_data_processed/QQP/dev.tsv
+cat $INPUT/QQP/test.tsv  | awk -F"\t"  '{if(NR==1){print "qid\ttext_a\ttext_b\tlabel"}   else {print $0"\t-1"}}'           > ./glue_data_processed/QQP/test.tsv
+
+### MNLI
+mkdir -p ./glue_data_processed/MNLI
+cat $INPUT/MNLI/train.tsv            | python $MY_DIR/mnli.py > ./glue_data_processed/MNLI/train.tsv
+
+mkdir -p ./glue_data_processed/MNLI/m
+cat $INPUT/MNLI/dev_matched.tsv      | python $MY_DIR/mnli.py > ./glue_data_processed/MNLI/m/dev.tsv
+cat $INPUT/MNLI/test_matched.tsv     | python $MY_DIR/mnli.py > ./glue_data_processed/MNLI/m/test.tsv
+
+mkdir -p ./glue_data_processed/MNLI/mm
+cat $INPUT/MNLI/dev_mismatched.tsv   | python $MY_DIR/mnli.py  > ./glue_data_processed/MNLI/mm/dev.tsv
+cat $INPUT/MNLI/test_mismatched.tsv  | python $MY_DIR/mnli.py > ./glue_data_processed/MNLI/mm/test.tsv
+
+### QNLI
+mkdir -p ./glue_data_processed/QNLI
+cat $INPUT/QNLI/train.tsv | python $MY_DIR/qnli.py > ./glue_data_processed/QNLI/train.tsv
+cat $INPUT/QNLI/dev.tsv   | python $MY_DIR/qnli.py > ./glue_data_processed/QNLI/dev.tsv
+cat $INPUT/QNLI/test.tsv  | python $MY_DIR/qnli.py > ./glue_data_processed/QNLI/test.tsv
+
+### RTE
+mkdir -p ./glue_data_processed/RTE
+cat $INPUT/RTE/train.tsv | python $MY_DIR/qnli.py > ./glue_data_processed/RTE/train.tsv
+cat $INPUT/RTE/dev.tsv   | python $MY_DIR/qnli.py > ./glue_data_processed/RTE/dev.tsv
+cat $INPUT/RTE/test.tsv  | python $MY_DIR/qnli.py > ./glue_data_processed/RTE/test.tsv
+
+### WNLI
+mkdir -p ./glue_data_processed/WNLI
+cat $INPUT/WNLI/train.tsv | awk -F"\t"  '{if(NR==1){print "text_a\ttext_b\tlabel"} else {print $2"\t"$3"\t"$4}}' > ./glue_data_processed/WNLI/train.tsv
+cat $INPUT/WNLI/dev.tsv   | awk -F"\t"  '{if(NR==1){print "text_a\ttext_b\tlabel"} else {print $2"\t"$3"\t"$4}}' > ./glue_data_processed/WNLI/dev.tsv
+cat $INPUT/WNLI/test.tsv  | awk -F"\t"  '{if(NR==1){print "qid\ttext_a\ttext_b\tlabel"}   else {print $1"\t"$2"\t"$3"\t-1"}}' > ./glue_data_processed/WNLI/test.tsv
+
+### Diagnostics
+cat $INPUT/diagnostic/diagnostic.tsv | awk -F"\t"  '{if(NR==1){print "qid\ttext_a\ttext_b\tlabel"} else {print $0"\t-1"}}'         > ./glue_data_processed/MNLI/diagnostic.tsv
+
+
--- a/script/en_glue/preprocess/mnli.py
+++ b/script/en_glue/preprocess/mnli.py
+import sys
+
+mapping = {
+    'contradiction': 0,
+    'neutral': 1,
+    'entailment': 2,
+}
+
+i = 0
+for line in sys.stdin:
+    arr = line.strip().split('\t')
+
+    if len(arr) == 12:
+        if i == 0:
+            i += 1
+            print('text_a\ttext_b\tlabel')
+            continue
+        print("{}\t{}\t{}".format(arr[8], arr[9], mapping[arr[11]]))
+    elif len(arr) == 16:
+        if i == 0:
+            i += 1
+            print('text_a\ttext_b\tlabel')
+            continue
+        s1 = arr[8]
+        s2 = arr[9]
+        s3 = arr[15]
+        print("{}\t{}\t{}".format(s1, s2, mapping[s3]))
+    else:
+        if i == 0:
+            i += 1
+            print('qid\ttext_a\ttext_b\tlabel')
+            continue
+        qid = arr[0]
+        s1 = arr[8]
+        s2 = arr[9]
+        print("{}\t{}\t{}\t-1".format(qid, s1, s2))
--- a/script/en_glue/preprocess/qnli.py
+++ b/script/en_glue/preprocess/qnli.py
+import sys
+
+mapping = {'entailment': 1, 'not_entailment': 0}
+
+i = 0
+for line in sys.stdin:
+    arr = line.strip().split('\t')
+    s1 = arr[1]
+    s2 = arr[2]
+    if len(arr) == 4:
+        if i == 0:
+            i += 1
+            print('text_a\ttext_b\tlabel')
+            continue
+        s3 = arr[3]
+        print("{}\t{}\t{}".format(s1, s2, mapping[s3]))
+    else:
+        if i == 0:
+            i += 1
+            print('qid\ttext_a\ttext_b\tlabel')
+            continue
+        print("{}\t{}\t{}\t-1".format(arr[0], s1, s2))
--- a/script/zh_task/ernie_base/run_ChnSentiCorp.sh
+++ b/script/zh_task/ernie_base/run_ChnSentiCorp.sh
+set -eux
+
+export FLAGS_eager_delete_tensor_gb=0
+export FLAGS_sync_nccl_allreduce=1
+export CUDA_VISIBLE_DEVICES=0
+
+python -u run_classifier.py \
+                   --use_cuda true \
+                   --verbose true \
+                   --do_train true \
+                   --do_val true \
+                   --do_test false \
+                   --batch_size 24 \
+                   --init_pretraining_params ${MODEL_PATH}/params \
+                   --train_set ${TASK_DATA_PATH}/chnsenticorp/train.tsv \
+                   --dev_set ${TASK_DATA_PATH}/chnsenticorp/dev.tsv,${TASK_DATA_PATH}/chnsenticorp/test.tsv \
+                   --vocab_path ${MODEL_PATH}/vocab.txt \
+                   --checkpoints ./checkpoints \
+                   --save_steps 1000 \
+                   --weight_decay  0.01 \
+                   --warmup_proportion 0.0 \
+                   --validation_steps 100 \
+                   --epoch 10 \
+                   --max_seq_len 256 \
+                   --ernie_config_path ${MODEL_PATH}/ernie_config.json \
+                   --learning_rate 5e-5 \
+                   --skip_steps 10 \
+                   --num_iteration_per_drop_scope 1 \
+                   --num_labels 2 \
+                   --random_seed 1
--- a/script/zh_task/ernie_base/run_bq.sh
+++ b/script/zh_task/ernie_base/run_bq.sh
+set -eux
+
+export FLAGS_eager_delete_tensor_gb=0
+export FLAGS_sync_nccl_allreduce=1
+export CUDA_VISIBLE_DEVICES=0
+
+python -u ./run_classifier.py \
+                     --use_cuda true \
+                     --verbose true \
+                     --do_train true \
+                     --do_val true \
+                     --do_test true \
+                     --batch_size 64 \
+                     --task_id 2 \
+                     --init_pretraining_params ${MODEL_PATH}/params \
+                     --train_set ${TASK_DATA_PATH}/bq/train.tsv \
+                     --dev_set ${TASK_DATA_PATH}/bq/dev.tsv,${TASK_DATA_PATH}/bq/test.tsv \
+                     --vocab_path ${MODEL_PATH}/vocab.txt \
+                     --checkpoints ./checkpoints \
+                     --save_steps 1000 \
+                     --weight_decay  0.01 \
+                     --warmup_proportion 0.0 \
+                     --validation_steps 100 \
+                     --epoch 3 \
+                     --max_seq_len 128 \
+                     --ernie_config_path ${MODEL_PATH}/ernie_config.json \
+                     --learning_rate 3e-5\
+                     --skip_steps 10 \
+                     --num_iteration_per_drop_scope 1 \
+                     --num_labels 2 \
+                     --random_seed 1
--- a/script/zh_task/ernie_base/run_cmrc2018.sh
+++ b/script/zh_task/ernie_base/run_cmrc2018.sh
+set -eux
+
+export FLAGS_eager_delete_tensor_gb=0
+export FLAGS_sync_nccl_allreduce=1
+export CUDA_VISIBLE_DEVICES=0,1,2,3
+
+python -u run_mrc.py --use_cuda true\
+                    --batch_size 16 \
+                    --in_tokens false\
+                    --use_fast_executor true \
+                    --checkpoints ./checkpoints \
+                    --vocab_path ${MODEL_PATH}/vocab.txt  \
+                    --do_train true \
+                    --do_val true \
+                    --do_test false \
+                    --verbose true \
+                    --save_steps 1000 \
+                    --validation_steps 100 \
+                    --warmup_proportion 0.0 \
+                    --weight_decay  0.01 \
+                    --epoch 2 \
+                    --max_seq_len 512 \
+                    --ernie_config_path ${MODEL_PATH}/ernie_config.json \
+                    --do_lower_case true \
+                    --doc_stride 128 \
+                    --train_set ${TASK_DATA_PATH}/cmrc2018/train.json \
+                    --dev_set ${TASK_DATA_PATH}/cmrc2018/dev.json \
+                    --learning_rate 3e-5 \
+                    --num_iteration_per_drop_scope 1 \
+                    --init_pretraining_params ${MODEL_PATH}/params \
+                    --skip_steps 10
--- a/script/zh_task/ernie_base/run_dbqa.sh
+++ b/script/zh_task/ernie_base/run_dbqa.sh
+set -eux
+
+export FLAGS_eager_delete_tensor_gb=0
+export FLAGS_sync_nccl_allreduce=1
+export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
+
+python -u run_classifier.py \
+                   --use_cuda true \
+                   --verbose true \
+                   --do_train true \
+                   --do_val true \
+                   --do_test false \
+                   --batch_size 8 \
+                   --metric "acc_and_f1_and_mrr" \
+                   --init_pretraining_params ${MODEL_PATH}/params \
+                   --train_set ${TASK_DATA_PATH}/nlpcc-dbqa/train.tsv \
+                   --dev_set ${TASK_DATA_PATH}/nlpcc-dbqa/dev.tsv,${TASK_DATA_PATH}/nlpcc-dbqa/test.tsv \
+                   --use_multi_gpu_test true \
+                   --vocab_path ${MODEL_PATH}/vocab.txt \
+                   --ernie_config_path ${MODEL_PATH}/ernie_config.json \
+                   --checkpoints "./checkpoints" \
+                   --save_steps 1000 \
+                   --weight_decay  0.01 \
+                   --warmup_proportion 0.0 \
+                   --validation_steps 1000 \
+                   --epoch 3 \
+                   --max_seq_len 512 \
+                   --learning_rate 2e-5 \
+                   --skip_steps 10 \
+                   --num_iteration_per_drop_scope 1 \
+                   --num_labels 2 \
+                   --random_seed 1
--- a/script/zh_task/ernie_base/run_drcd.sh
+++ b/script/zh_task/ernie_base/run_drcd.sh
+set -eux
+
+export FLAGS_eager_delete_tensor_gb=0
+export FLAGS_sync_nccl_allreduce=1
+export CUDA_VISIBLE_DEVICES=0,1,2,3
+
+python -u run_mrc.py --use_cuda true\
+                    --batch_size 16 \
+                    --in_tokens false\
+                    --use_fast_executor true \
+                    --checkpoints ./checkpoints \
+                    --vocab_path ${MODEL_PATH}/vocab.txt  \
+                    --ernie_config_path ${MODEL_PATH}/ernie_config.json \
+                    --do_train true \
+                    --do_val true \
+                    --do_test true \
+                    --verbose true \
+                    --save_steps 1000 \
+                    --validation_steps 100 \
+                    --warmup_proportion 0.0 \
+                    --weight_decay  0.01 \
+                    --epoch 2 \
+                    --max_seq_len 512 \
+                    --do_lower_case true \
+                    --doc_stride 128 \
+                    --train_set ${TASK_DATA_PATH}/drcd/train.json \
+                    --dev_set ${TASK_DATA_PATH}/drcd/dev.json \
+                    --test_set ${TASK_DATA_PATH}/drcd/test.json \
+                    --learning_rate 5e-5 \
+                    --num_iteration_per_drop_scope 1 \
+                    --init_pretraining_params ${MODEL_PATH}/params \
+                    --skip_steps 10
--- a/script/zh_task/ernie_base/run_lcqmc.sh
+++ b/script/zh_task/ernie_base/run_lcqmc.sh
+set -eux
+
+export FLAGS_eager_delete_tensor_gb=0
+export FLAGS_sync_nccl_allreduce=1
+export CUDA_VISIBLE_DEVICES=0
+
+python -u run_classifier.py \
+                   --use_cuda true \
+                   --verbose true \
+                   --do_train true \
+                   --do_val true \
+                   --do_test false \
+                   --batch_size 32 \
+                   --init_pretraining_params ${MODEL_PATH}/params \
+                   --train_set ${TASK_DATA_PATH}/lcqmc/train.tsv \
+                   --dev_set ${TASK_DATA_PATH}/lcqmc/dev.tsv,${TASK_DATA_PATH}/lcqmc/test.tsv \
+                   --vocab_path ${MODEL_PATH}/vocab.txt \
+                   --checkpoints ./checkpoints \
+                   --save_steps 1000 \
+                   --weight_decay  0.0 \
+                   --warmup_proportion 0.0 \
+                   --validation_steps 100 \
+                   --epoch 3 \
+                   --max_seq_len 128 \
+                   --ernie_config_path ${MODEL_PATH}/ernie_config.json \
+                   --learning_rate 2e-5 \
+                   --skip_steps 10 \
+                   --num_iteration_per_drop_scope 1 \
+                   --num_labels 2 \
+                   --random_seed 1
--- a/script/zh_task/ernie_base/run_msra_ner.sh
+++ b/script/zh_task/ernie_base/run_msra_ner.sh
+set -eux
+
+export FLAGS_eager_delete_tensor_gb=0
+export FLAGS_sync_nccl_allreduce=1
+export CUDA_VISIBLE_DEVICES=0
+
+python -u run_sequence_labeling.py \
+                   --use_cuda true \
+                   --do_train true \
+                   --do_val true \
+                   --do_test true \
+                   --batch_size 16 \
+                   --init_pretraining_params ${MODEL_PATH}/params \
+                   --num_labels 7 \
+                   --label_map_config ${TASK_DATA_PATH}/msra_ner/label_map.json \
+                   --train_set ${TASK_DATA_PATH}/msra_ner/train.tsv \
+                   --dev_set ${TASK_DATA_PATH}/msra_ner/dev.tsv \
+                   --test_set ${TASK_DATA_PATH}/msra_ner/test.tsv \
+                   --vocab_path ${MODEL_PATH}/vocab.txt \
+                   --ernie_config_path ${MODEL_PATH}/ernie_config.json \
+                   --checkpoints ./checkpoints \
+                   --save_steps 100000 \
+                   --weight_decay  0.01 \
+                   --warmup_proportion 0.0 \
+                   --validation_steps 100 \
+                   --epoch 6 \
+                   --max_seq_len 256 \
+                   --learning_rate 5e-5 \
+                   --skip_steps 10 \
+                   --num_iteration_per_drop_scope 1 \
+                   --random_seed 1
--- a/script/zh_task/ernie_base/run_thuc.sh
+++ b/script/zh_task/ernie_base/run_thuc.sh
+set -eux
+
+export FLAGS_eager_delete_tensor_gb=0
+export FLAGS_sync_nccl_allreduce=1
+export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
+
+python -u run_classifier.py \
+                   --use_cuda true \
+                   --do_train true \
+                   --do_val true \
+                   --do_test true \
+                   --verbose true \
+                   --batch_size 8 \
+                   --task_id 2 \
+                   --init_pretraining_params ${MODEL_PATH}/params \
+                   --train_set ${TASK_DATA_PATH}/thuc/train.tsv \
+                   --dev_set ${TASK_DATA_PATH}/thuc/dev.tsv,${TASK_DATA_PATH}/thuc/test.tsv \
+                   --vocab_path ${MODEL_PATH}/vocab.txt \
+                   --ernie_config_path ${MODEL_PATH}/ernie_config.json \
+                   --checkpoints ./checkpoints \
+                   --save_steps 1000 \
+                   --weight_decay  0.01 \
+                   --warmup_proportion 0.0 \
+                   --validation_steps 100 \
+                   --epoch 3 \
+                   --max_seq_len 512 \
+                   --learning_rate 3e-5 \
+                   --skip_steps 10 \
+                   --num_iteration_per_drop_scope 1 \
+                   --num_labels 10 \
+                   --random_seed 1
--- a/script/zh_task/ernie_base/run_xnli.sh
+++ b/script/zh_task/ernie_base/run_xnli.sh
+set -eux
+
+export FLAGS_eager_delete_tensor_gb=0
+export FLAGS_sync_nccl_allreduce=1
+export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
+
+python -u run_classifier.py \
+                   --use_cuda true \
+                   --do_train true \
+                   --do_val true \
+                   --do_test false \
+                   --verbose true \
+                   --batch_size 8192 \
+                   --in_tokens true \
+                   --init_pretraining_params ${MODEL_PATH}/params \
+                   --train_set ${TASK_DATA_PATH}/xnli/train.tsv \
+                   --dev_set ${TASK_DATA_PATH}/xnli/dev.tsv,${TASK_DATA_PATH}/xnli/test.tsv \
+                   --vocab_path ${MODEL_PATH}/vocab.txt \
+                   --label_map ${TASK_DATA_PATH}/xnli/label_map.json \
+                   --ernie_config_path ${MODEL_PATH}/ernie_config.json \
+                   --checkpoints ./checkpoints \
+                   --save_steps 1000 \
+                   --weight_decay  0.01 \
+                   --warmup_proportion 0.0 \
+                   --validation_steps 25 \
+                   --epoch 3 \
+                   --max_seq_len 512 \
+                   --learning_rate 1e-4 \
+                   --skip_steps 10 \
+                   --num_iteration_per_drop_scope 1 \
+                   --num_labels 3 \
+                   --random_seed 1
--- a/ERNIE/script/run_ChnSentiCorp.sh
+++ b/ERNIE/script/run_ChnSentiCorp.sh
@@ -12,8 +12,7 @@ python -u run_classifier.py \
                   --batch_size 24 \
                   --init_pretraining_params ${MODEL_PATH}/params \
                   --train_set ${TASK_DATA_PATH}/chnsenticorp/train.tsv \
-                   --dev_set ${TASK_DATA_PATH}/chnsenticorp/dev.tsv \
-                   --test_set ${TASK_DATA_PATH}/chnsenticorp/test.tsv \
+                   --dev_set ${TASK_DATA_PATH}/chnsenticorp/dev.tsv,${TASK_DATA_PATH}/chnsenticorp/test.tsv \
                   --vocab_path config/vocab.txt \
                   --checkpoints ./checkpoints \
                   --save_steps 1000 \
@@ -23,7 +22,7 @@ python -u run_classifier.py \
                   --epoch 10 \
                   --max_seq_len 256 \
                   --ernie_config_path config/ernie_config.json \
-                   --learning_rate 5e-5 \
+                   --learning_rate 1e-5 \
                   --skip_steps 10 \
                   --num_iteration_per_drop_scope 1 \
                   --num_labels 2 \

--- a/script/zh_task/ernie_large/run_bq.sh
+++ b/script/zh_task/ernie_large/run_bq.sh
+set -eux
+
+export FLAGS_sync_nccl_allreduce=1
+export CUDA_VISIBLE_DEVICES=0
+
+python -u ./run_classifier.py \
+                     --use_cuda true \
+                     --verbose true \
+                     --do_train true \
+                     --do_val true \
+                     --do_test true \
+                     --batch_size 64 \
+                     --init_pretraining_params ${MODEL_PATH}/params \
+                     --train_set ${TASK_DATA_PATH}/bq/train.tsv \
+                     --dev_set ${TASK_DATA_PATH}/bq/dev.tsv,${TASK_DATA_PATH}/bq/test.tsv \
+                     --vocab_path config/vocab.txt \
+                     --checkpoints ./checkpoints \
+                     --save_steps 1000 \
+                     --weight_decay  0.01 \
+                     --warmup_proportion 0.0 \
+                     --validation_steps 100 \
+                     --epoch 3 \
+                     --max_seq_len 128 \
+                     --ernie_config_path config/ernie_config.json \
+                     --learning_rate 1.5e-5\
+                     --skip_steps 10 \
+                     --num_iteration_per_drop_scope 1 \
+                     --num_labels 2 \
+                     --random_seed 1 &>job.log
--- a/script/zh_task/ernie_large/run_cmrc2018.sh
+++ b/script/zh_task/ernie_large/run_cmrc2018.sh
+set -eux
+
+export FLAGS_eager_delete_tensor_gb=0.0
+export FLAGS_sync_nccl_allreduce=1
+export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
+
+python -u run_mrc.py --use_cuda true\
+                    --batch_size 8 \
+                    --in_tokens false\
+                    --use_fast_executor true \
+                    --checkpoints ./checkpoints \
+                    --vocab_path ./config/vocab.txt  \
+                    --do_train true \
+                    --do_val true \
+                    --do_test false \
+                    --verbose true \
+                    --save_steps 1000 \
+                    --validation_steps 100 \
+                    --warmup_proportion 0.0 \
+                    --weight_decay  0.01 \
+                    --epoch 2 \
+                    --max_seq_len 512 \
+                    --ernie_config_path ./config/ernie_config.json \
+                    --do_lower_case true \
+                    --doc_stride 128 \
+                    --train_set ${TASK_DATA_PATH}/cmrc2018/train.json \
+                    --dev_set ${TASK_DATA_PATH}/cmrc2018/dev.json \
+                    --learning_rate 3e-5 \
+                    --num_iteration_per_drop_scope 1 \
+                    --init_pretraining_params ${MODEL_PATH}/params \
+                    --skip_steps 10
--- a/ERNIE/script/run_dbqa.sh
+++ b/ERNIE/script/run_dbqa.sh
@@ -10,10 +10,11 @@ python -u run_classifier.py \
                   --do_val true \
                   --do_test true \
                   --batch_size 8 \
+                   --metric "acc_and_f1_and_mrr" \
                   --init_pretraining_params ${MODEL_PATH}/params \
                   --train_set ${TASK_DATA_PATH}/nlpcc-dbqa/train.tsv \
-                   --dev_set ${TASK_DATA_PATH}/nlpcc-dbqa/dev.tsv \
-                   --test_set ${TASK_DATA_PATH}/nlpcc-dbqa/test.tsv \
+                   --dev_set ${TASK_DATA_PATH}/nlpcc-dbqa/dev.tsv,${TASK_DATA_PATH}/nlpcc-dbqa/test.tsv \
+                   --use_multi_gpu_test true \
                   --vocab_path config/vocab.txt \
                   --ernie_config_path config/ernie_config.json \
                   --checkpoints "./checkpoints" \
@@ -23,7 +24,7 @@ python -u run_classifier.py \
                   --validation_steps 1000 \
                   --epoch 3 \
                   --max_seq_len 512 \
-                   --learning_rate 2e-5 \
+                   --learning_rate 1e-5 \
                   --skip_steps 10 \
                   --num_iteration_per_drop_scope 1 \
                   --num_labels 2 \

--- a/script/zh_task/ernie_large/run_drcd.sh
+++ b/script/zh_task/ernie_large/run_drcd.sh
+set -eux
+
+export FLAGS_eager_delete_tensor_gb=0.0
+export FLAGS_sync_nccl_allreduce=1
+export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
+
+python -u run_mrc.py --use_cuda true\
+                    --batch_size 8 \
+                    --in_tokens false\
+                    --use_fast_executor true \
+                    --checkpoints ./checkpoints \
+                    --vocab_path ./config/vocab.txt  \
+                    --ernie_config_path ./config/ernie_config.json \
+                    --do_train true \
+                    --do_val true \
+                    --do_test true \
+                    --verbose true \
+                    --save_steps 1000 \
+                    --validation_steps 100 \
+                    --warmup_proportion 0.0 \
+                    --weight_decay  0.01 \
+                    --epoch 2 \
+                    --max_seq_len 512 \
+                    --do_lower_case true \
+                    --doc_stride 128 \
+                    --train_set ${TASK_DATA_PATH}/drcd/train.json \
+                    --dev_set ${TASK_DATA_PATH}/drcd/dev.json \
+                    --test_set ${TASK_DATA_PATH}/drcd/test.json \
+                    --learning_rate 3e-5 \
+                    --num_iteration_per_drop_scope 1 \
+                    --init_pretraining_params ${MODEL_PATH}/params \
+                    --skip_steps 10
+
--- a/ERNIE/script/run_lcqmc.sh
+++ b/ERNIE/script/run_lcqmc.sh
@@ -12,8 +12,7 @@ python -u run_classifier.py \
                   --batch_size 32 \
                   --init_pretraining_params ${MODEL_PATH}/params \
                   --train_set ${TASK_DATA_PATH}/lcqmc/train.tsv \
-                   --dev_set ${TASK_DATA_PATH}/lcqmc/dev.tsv \
-                   --test_set ${TASK_DATA_PATH}/lcqmc/test.tsv \
+                   --dev_set ${TASK_DATA_PATH}/lcqmc/dev.tsv,${TASK_DATA_PATH}/lcqmc/test.tsv \
                   --vocab_path config/vocab.txt \
                   --checkpoints ./checkpoints \
                   --save_steps 1000 \
@@ -23,7 +22,7 @@ python -u run_classifier.py \
                   --epoch 3 \
                   --max_seq_len 128 \
                   --ernie_config_path config/ernie_config.json \
-                   --learning_rate 2e-5 \
+                   --learning_rate 5e-6 \
                   --skip_steps 10 \
                   --num_iteration_per_drop_scope 1 \
                   --num_labels 2 \

--- a/ERNIE/script/run_msra_ner.sh
+++ b/ERNIE/script/run_msra_ner.sh
@@ -22,9 +22,9 @@ python -u run_sequence_labeling.py \
                   --weight_decay  0.01 \
                   --warmup_proportion 0.0 \
                   --validation_steps 100 \
-                   --epoch 3 \
+                   --epoch 6 \
                   --max_seq_len 256 \
-                   --learning_rate 5e-5 \
+                   --learning_rate 1e-5 \
                   --skip_steps 10 \
                   --num_iteration_per_drop_scope 1 \
                   --random_seed 1
--- a/script/zh_task/ernie_large/run_thuc.sh
+++ b/script/zh_task/ernie_large/run_thuc.sh
+set -eux
+
+export FLAGS_sync_nccl_allreduce=1
+export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
+
+python -u run_classifier.py \
+                   --use_cuda true \
+                   --do_train true \
+                   --do_val true \
+                   --do_test true \
+                   --verbose true \
+                   --batch_size 8 \
+                   --init_pretraining_params ${MODEL_PATH}/params \
+                   --train_set ${TASK_DATA_PATH}/thuc/train.tsv \
+                   --dev_set ${TASK_DATA_PATH}/thuc/dev.tsv,${TASK_DATA_PATH}/thuc/test.tsv \
+                   --vocab_path config/vocab.txt \
+                   --ernie_config_path config/ernie_config.json \
+                   --checkpoints ./checkpoints \
+                   --save_steps 1000 \
+                   --weight_decay  0.01 \
+                   --warmup_proportion 0.0 \
+                   --validation_steps 100 \
+                   --epoch 3 \
+                   --max_seq_len 512 \
+                   --learning_rate 1.5e-5 \
+                   --skip_steps 10 \
+                   --num_iteration_per_drop_scope 1 \
+                   --num_labels 10 \
+                   --random_seed 1
--- a/ERNIE/script/run_xnli.sh
+++ b/ERNIE/script/run_xnli.sh
@@ -13,8 +13,7 @@ python -u run_classifier.py \
                   --in_tokens true \
                   --init_pretraining_params ${MODEL_PATH}/params \
                   --train_set ${TASK_DATA_PATH}/xnli/train.tsv \
-                   --dev_set ${TASK_DATA_PATH}/xnli/dev.tsv \
-                   --test_set ${TASK_DATA_PATH}/xnli/test.tsv \
+                   --dev_set ${TASK_DATA_PATH}/xnli/dev.tsv,${TASK_DATA_PATH}/xnli/test.tsv \
                   --vocab_path config/vocab.txt \
                   --label_map ${TASK_DATA_PATH}/xnli/label_map.json \
                   --ernie_config_path config/ernie_config.json \
@@ -25,7 +24,7 @@ python -u run_classifier.py \
                   --validation_steps 25 \
                   --epoch 3 \
                   --max_seq_len 512 \
-                   --learning_rate 1e-4 \
+                   --learning_rate 4e-5 \
                   --skip_steps 10 \
                   --num_iteration_per_drop_scope 1 \
                   --num_labels 3 \

--- a/ERNIE/script/pretrain.sh
+++ b/ERNIE/script/pretrain.sh
 set -eux

+export FLAGS_eager_delete_tensor_gb=0
 export FLAGS_sync_nccl_allreduce=1
 export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7


--- a/ERNIE/tokenization.py
+++ b/ERNIE/tokenization.py
@@ -368,3 +368,46 @@ def _is_punctuation(char):
    if cat.startswith("P"):
        return True
    return False
+
+
+def tokenize_chinese_chars(text):
+    """Adds whitespace around any CJK character."""
+
+    def _is_chinese_char(cp):
+        """Checks whether CP is the codepoint of a CJK character."""
+        # This defines a "chinese character" as anything in the CJK Unicode block:
+        #     https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_(Unicode_block)
+        #
+        # Note that the CJK Unicode block is NOT all Japanese and Korean characters,
+        # despite its name. The modern Korean Hangul alphabet is a different block,
+        # as is Japanese Hiragana and Katakana. Those alphabets are used to write
+        # space-separated words, so they are not treated specially and handled
+        # like the all of the other languages.
+        if ((cp >= 0x4E00 and cp <= 0x9FFF) or  #
+            (cp >= 0x3400 and cp <= 0x4DBF) or  #
+            (cp >= 0x20000 and cp <= 0x2A6DF) or  #
+            (cp >= 0x2A700 and cp <= 0x2B73F) or  #
+            (cp >= 0x2B740 and cp <= 0x2B81F) or  #
+            (cp >= 0x2B820 and cp <= 0x2CEAF) or
+            (cp >= 0xF900 and cp <= 0xFAFF) or  #
+            (cp >= 0x2F800 and cp <= 0x2FA1F)):  #
+            return True
+
+        return False
+
+    output = []
+    buff = ""
+    for char in text:
+        cp = ord(char)
+        if _is_chinese_char(cp):
+            if buff != "":
+                output.append(buff)
+                buff = ""
+            output.append(char)
+        else:
+            buff += char
+
+    if buff != "":
+        output.append(buff)
+
+    return output
--- a/ERNIE/train.py
+++ b/ERNIE/train.py
@@ -25,7 +25,7 @@ import numpy as np
 import paddle.fluid as fluid

 from reader.pretraining import ErnieDataReader
-from model.ernie import ErnieModel, ErnieConfig
+from model.ernie_v1 import ErnieModel, ErnieConfig
 from optimization import optimization
 from utils.args import print_arguments, check_cuda
 from utils.init import init_checkpoint, init_pretraining_params
@@ -171,7 +171,7 @@ def train(args):
        with fluid.unique_name.guard():
            train_pyreader, next_sent_acc, mask_lm_loss, total_loss = create_model(
                pyreader_name='train_reader', ernie_config=ernie_config)
-            scheduled_lr = optimization(
+            scheduled_lr, loss_scaling = optimization(
                loss=total_loss,
                warmup_steps=args.warmup_steps,
                num_train_steps=args.num_train_steps,
@@ -180,8 +180,7 @@ def train(args):
                startup_prog=startup_prog,
                weight_decay=args.weight_decay,
                scheduler=args.lr_scheduler,
-                use_fp16=args.use_fp16,
-                loss_scaling=args.loss_scaling)
+                use_fp16=args.use_fp16)

            fluid.memory_optimize(
                input_program=train_program,

--- a/ERNIE/utils/__init__.py
+++ b/ERNIE/utils/__init__.py
--- a/ERNIE/utils/args.py
+++ b/ERNIE/utils/args.py
--- a/ERNIE/utils/cards.py
+++ b/ERNIE/utils/cards.py
@@ -12,9 +12,9 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.

-
 import os

+
 def get_cards():
    """
    get gpu cards number
@@ -24,5 +24,3 @@ def get_cards():
    if cards != '':
        num = len(cards.split(","))
    return num
-
-
--- a/utils/cmrc2018_eval.py
+++ b/utils/cmrc2018_eval.py
+# -*- coding: utf-8 -*-
+'''
+Evaluation script for CMRC 2018
+version: v5
+Note: 
+v5 formatted output, add usage description
+v4 fixed segmentation issues
+'''
+from __future__ import print_function
+from collections import Counter, OrderedDict
+import string
+import re
+import argparse
+import json
+import sys
+reload(sys)
+sys.setdefaultencoding('utf8')
+import nltk
+import pdb
+
+
+# split Chinese with English
+def mixed_segmentation(in_str, rm_punc=False):
+    in_str = str(in_str).decode('utf-8').lower().strip()
+    segs_out = []
+    temp_str = ""
+    sp_char = [
+        '-', ':', '_', '*', '^', '/', '\\', '~', '`', '+', '=', '，', '。', '：',
+        '？', '！', '“', '”', '；', '’', '《', '》', '……', '·', '、', '「', '」', '（',
+        '）', '－', '～', '『', '』'
+    ]
+    for char in in_str:
+        if rm_punc and char in sp_char:
+            continue
+        if re.search(ur'[\u4e00-\u9fa5]', char) or char in sp_char:
+            if temp_str != "":
+                ss = nltk.word_tokenize(temp_str)
+                segs_out.extend(ss)
+                temp_str = ""
+            segs_out.append(char)
+        else:
+            temp_str += char
+
+    #handling last part
+    if temp_str != "":
+        ss = nltk.word_tokenize(temp_str)
+        segs_out.extend(ss)
+
+    return segs_out
+
+
+# remove punctuation
+def remove_punctuation(in_str):
+    in_str = str(in_str).decode('utf-8').lower().strip()
+    sp_char = [
+        '-', ':', '_', '*', '^', '/', '\\', '~', '`', '+', '=', '，', '。', '：',
+        '？', '！', '“', '”', '；', '’', '《', '》', '……', '·', '、', '「', '」', '（',
+        '）', '－', '～', '『', '』'
+    ]
+    out_segs = []
+    for char in in_str:
+        if char in sp_char:
+            continue
+        else:
+            out_segs.append(char)
+    return ''.join(out_segs)
+
+
+# find longest common string
+def find_lcs(s1, s2):
+    m = [[0 for i in range(len(s2) + 1)] for j in range(len(s1) + 1)]
+    mmax = 0
+    p = 0
+    for i in range(len(s1)):
+        for j in range(len(s2)):
+            if s1[i] == s2[j]:
+                m[i + 1][j + 1] = m[i][j] + 1
+                if m[i + 1][j + 1] > mmax:
+                    mmax = m[i + 1][j + 1]
+                    p = i + 1
+    return s1[p - mmax:p], mmax
+
+
+#
+def evaluate(ground_truth_file, prediction_file):
+    f1 = 0
+    em = 0
+    total_count = 0
+    skip_count = 0
+    for instances in ground_truth_file["data"]:
+        for instance in instances["paragraphs"]:
+            context_text = instance['context'].strip()
+            for qas in instance['qas']:
+                total_count += 1
+                query_id = qas['id'].strip()
+                query_text = qas['question'].strip()
+                answers = [ans["text"] for ans in qas["answers"]]
+
+                if query_id not in prediction_file:
+                    sys.stderr.write('Unanswered question: {}\n'.format(
+                        query_id))
+                    skip_count += 1
+                    continue
+
+                prediction = str(prediction_file[query_id])
+                f1 += calc_f1_score(answers, prediction)
+                em += calc_em_score(answers, prediction)
+
+    f1_score = 100.0 * f1 / total_count
+    em_score = 100.0 * em / total_count
+    return f1_score, em_score, total_count, skip_count
+
+
+def calc_f1_score(answers, prediction):
+    f1_scores = []
+    for ans in answers:
+        ans_segs = mixed_segmentation(ans, rm_punc=True)
+        prediction_segs = mixed_segmentation(prediction, rm_punc=True)
+        lcs, lcs_len = find_lcs(ans_segs, prediction_segs)
+        if lcs_len == 0:
+            f1_scores.append(0)
+            continue
+        precision = 1.0 * lcs_len / len(prediction_segs)
+        recall = 1.0 * lcs_len / len(ans_segs)
+        f1 = (2 * precision * recall) / (precision + recall)
+        f1_scores.append(f1)
+    return max(f1_scores)
+
+
+def calc_em_score(answers, prediction):
+    em = 0
+    for ans in answers:
+        ans_ = remove_punctuation(ans)
+        prediction_ = remove_punctuation(prediction)
+        if ans_ == prediction_:
+            em = 1
+            break
+    return em
+
+
+def eval_file(dataset_file, prediction_file):
+    ground_truth_file = json.load(open(dataset_file, 'rb'))
+    prediction_file = json.load(open(prediction_file, 'rb'))
+    F1, EM, TOTAL, SKIP = evaluate(ground_truth_file, prediction_file)
+    AVG = (EM + F1) * 0.5
+    return EM, F1, AVG, TOTAL
+
+
+if __name__ == '__main__':
+    EM, F1, AVG, TOTAL = eval_file(sys.argv[1], sys.argv[2])
+    print(EM)
+    print(F1)
+    print(TOTAL)
--- a/ERNIE/utils/fp16.py
+++ b/ERNIE/utils/fp16.py
--- a/ERNIE/utils/init.py
+++ b/ERNIE/utils/init.py