add ernie-m code (#770)

Co-authored-by: N pangchao04 <pangchao04@baidu.com>

add ernie-m code (#770)
Co-authored-by: N pangchao04 <pangchao04@baidu.com>
a948f14d · oyjxer · GitHub · 2641a12a · a948f14d · a948f14d
41 changed file
--- a/ernie-m/.meta/framework.png
+++ b/ernie-m/.meta/framework.png
--- a/ernie-m/README.md
+++ b/ernie-m/README.md
+English | [简体中文](./README_zh.md)
+## ERNIE-M: Enhanced Multilingual Representation by Aligning Cross-lingual Semantics with Monolingual Corpora
+- [Framework](#framework)
+- [Pre-trained Models](#Pre-trained-Models)
+- [Fine-tuning Tasks](#Fine-tuning-Tasks)
+  * [Cross-lingual Natural Language Inference](#Cross-lingual-Natural-Language-Inference)
+  * [Named Entity Recognition](#Named-Entity-Recognition)
+  * [Cross-lingual Question Answering](#Cross-lingual-Question-Answering)
+  * [Cross-lingual Paraphrase Identification](#Cross-lingual-Paraphrase-Identification)
+  * [Cross-lingual Sentence Retrieval](#Cross-lingual-Sentence-Retrieval)
+- [Usage](#Usage)
+  * [Install Paddle](#Install-PaddlePaddle)
+  * [Fine-tuning](#Fine-tuning)
+- [Citation](#Citation)
+For technical description of the algorithm, please see our paper:
+>[_**ERNIE-M: Enhanced Multilingual Representation by Aligning Cross-lingual Semantics with Monolingual Corpora**_](https://arxiv.org/pdf/2012.15674.pdf)
+>
+>Xuan Ouyang, Shuohuan Wang, Chao Pang, Yu Sun, Hao Tian, Hua Wu, Haifeng Wang
+>
+>Preprint December 2020
+>
+>Accepted by **EMNLP-2021**
+![ERNIE-M](https://img.shields.io/badge/Pretraining-Multilingual%20Language%20-green) ![paper](https://img.shields.io/badge/Paper-EMNLP2021-yellow)
+---
+**ERNIE-M is a multilingual language model**. We propose a new training method that encourages the model to align the representation of multiple languages with monolingual corpora, to overcome the constraint that the parallel corpus size places on the model performance. Our key insight is to integrate **back-translation** into the pre-training process. We **generate pseudo-parallel sentence pairs on a monolingual corpus** to enable the learning of semantic alignments between different languages, thereby enhancing the semantic modeling of cross-lingual models. Experimental results show that ERNIE-M outperforms existing cross-lingual models and delivers new state-of-the-art results in various cross-lingual downstream tasks.
+## Framework
+We proposed two novel methods to align the representation of multiple languages:
+- **Cross-Attention Masked Language Modeling(CAMLM)**: In CAMLM, we learn the multilingual semantic representation by restoring the MASK tokens in the input sentences.  
+- **Back-Translation masked language modeling(BTMLM)**: We use BTMLM to train our model to generate pseudo-parallel sentences from the monolingual sentences. The generated pairs are then used as the input of the model to further align the cross-lingual semantics, thus enhancing the multilingual representation.
+![framework](.meta/framework.png)
+## Pre-trained Models
+We release the checkpoints for **ERNIE-M _base_** and **ERNIE-M _large_** model。 
+- [**ERNIE-M _base_**](http://bj.bcebos.com/wenxin-models/model-ernie-m-base.tar.gz) (_12-layer, 768-hidden, 12-heads_)
+- [**ERNIE-M _large_**](http://bj.bcebos.com/wenxin-models/model-ernie-m-large.tar.gz) (_24-layer, 1024-hidden, 16-heads_)
+## Fine-tuning Tasks
+We compare the performance of [ERNIE-M](https://arxiv.org/pdf/2012.15674.pdf) with the existing SOTA pre-training models (such as [XLM](https://arxiv.org/pdf/1901.07291.pdf), [Unicoder](https://arxiv.org/pdf/1909.00964.pdf), [XLM-R](https://arxiv.org/pdf/1911.02116.pdf), [INFOXLM](https://arxiv.org/pdf/2007.07834.pdf), [VECO](https://arxiv.org/pdf/2010.16046.pdf) and [mBERT](https://arxiv.org/pdf/1810.04805.pdf)) for cross-lingual downstream tasks, including natural language inference (**_XNLI_**), named entity recognition(**_CoNLL_**), question answering (**_MLQA_**), paraphrase identification (**_PAWS-X_**) and sentence-retrieval (**_Tatoeba_**).
+### Cross-lingual Natural Language Inference
+- [XNLI](https://arxiv.org/pdf/1809.05053.pdf)
+| Model | en | fr | es | de | el | bg | ru | tr | ar | vi | th | zh | hi | sw | ur | Avg |
+| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
+| Cross-lingual Transfer |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
+| XLM | 85.0 | 78.7 | 78.9 | 77.8 | 76.6 | 77.4 | 75.3 | 72.5 | 73.1 | 76.1 | 73.2 | 76.5 | 69.6 | 68.4 | 67.3 | 75.1 |
+| Unicoder | 85.1 | 79.0 | 79.4 | 77.8 | 77.2 | 77.2 | 76.3 | 72.8 | 73.5 | 76.4 | 73.6 | 76.2 | 69.4 | 69.7 | 66.7 | 75.4 |
+| XLM-R | 85.8 | 79.7 | 80.7 | 78.7 | 77.5 | 79.6 | 78.1 | 74.2 | 73.8 | 76.5 | 74.6 | 76.7 | 72.4 | 66.5 | 68.3 | 76.2 |
+| INFOXLM | **86.4** | **80.6** | 80.8 | 78.9 | 77.8 | 78.9 | 77.6 | 75.6 | 74.0 | 77.0 | 73.7 | 76.7 | 72.0 | 66.4 | 67.1 | 76.2 |
+| **ERNIE-M** | 85.5 | 80.1 | **81.2** | **79.2** | **79.1** | **80.4** | **78.1** | **76.8** | **76.3** | **78.3** | **75.8** | **77.4** | **72.9** | **69.5** | **68.8** | **77.3** |
+| XLM-R Large | 89.1 | 84.1 | 85.1 | 83.9 | 82.9 | 84.0 | 81.2 | 79.6 | 79.8 | 80.8 | 78.1 | 80.2 | 76.9 | 73.9 | 73.8 | 80.9 |
+| INFOXLM Large | **89.7** | 84.5 | 85.5 | 84.1 | 83.4 | 84.2 | 81.3 | 80.9 | 80.4 | 80.8 | 78.9 | 80.9 | 77.9 | 74.8 | 73.7 | 81.4 |
+| VECO Large | 88.2 | 79.2 | 83.1 | 82.9 | 81.2 | 84.2 | 82.8 | 76.2 | 80.3 | 74.3 | 77.0 | 78.4 | 71.3 | **80.4** | **79.1** | 79.9 |
+| **ERNIR-M Large** | 89.3 | **85.1** | **85.7** | **84.4** | **83.7** | **84.5** | 82.0 | **81.2** | **81.2** | **81.9** | **79.2** | **81.0** | **78.6** | 76.2 | 75.4 | **82.0** |
+| Translate-Train-All |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
+| XLM | 85.0 | 80.8 | 81.3 | 80.3 | 79.1 | 80.9 | 78.3 | 75.6 | 77.6 | 78.5 | 76.0 | 79.5 | 72.9 | 72.8 | 68.5 | 77.8 |
+| Unicoder | 85.6 | 81.1 | 82.3 | 80.9 | 79.5 | 81.4 | 79.7 | 76.8 | 78.2 | 77.9 | 77.1 | 80.5 | 73.4 | 73.8 | 69.6 | 78.5 |
+| XLM-R | 85.4 | 81.4 | 82.2 | 80.3 | 80.4 | 81.3 | 79.7 | 78.6 | 77.3 | 79.7 | 77.9 | 80.2 | 76.1 | 73.1 | 73.0 | 79.1 |
+| INFOXLM | 86.1 | 82.0 | 82.8 | 81.8 | 80.9 | 82.0 | 80.2 | 79.0 | 78.8 | 80.5 | 78.3 | 80.5 | 77.4 | 73.0 | 71.6 | 79.7 |
+| **ERNIE-M** | **86.2** | **82.5** | **83.8** | **82.6** | **82.4** | **83.4** | **80.2** | **80.6** | **80.5** | **81.1** | **79.2** | **80.5** | **77.7** | **75.0** | **73.3** | **80.6** |
+| XLM-R Large | 89.1 | 85.1 | 86.6 | 85.7 | 85.3 | 85.9 | 83.5 | 83.2 | 83.1 | 83.7 | 81.5 | **83.7** | **81.6** | 78.0 | 78.1 | 83.6 |
+| VECO Large | 88.9 | 82.4 | 86.0 | 84.7 | 85.3 | 86.2 | **85.8** | 80.1 | 83.0 | 77.2 | 80.9 | 82.8 | 75.3 | **83.1** | **83.0** | 83.0 |
+| **ERNIE-M Large** | **89.5** | **86.5** | **86.9** | **86.1** | **86.0** | **86.8** | 84.1 | **83.8** | **84.1** | **84.5** | **82.1** | 83.5 | 81.1 | 79.4 | 77.9 | **84.2** |
+### Named Entity Recognition
+- [CoNLL](https://arxiv.org/pdf/cs/0306050.pdf)
+| Model | en | nl | es | de | Avg |
+| --- | --- | --- | --- | --- | --- |
+| Fine-tune on English dataset |  |  |  |  |  |
+| mBERT | 91.97 | 77.57 | 74.96 | 69.56 | 78.52 |
+| XLM-R | 92.25 | **78.08** | 76.53 | **69.60** | 79.11 |
+| **ERNIE-M** | **92.78** | 78.01 | **79.37** | 68.08 | **79.56** |
+| XLM-R Large | 92.92 | 80.80 | 78.64 | 71.40 | 80.94 |
+| **ERNIE-M Large** | **93.28** | **81.45** | **78.83** | **72.99** | **81.64** |
+| Fine-tune on all dataset |  |  |  |  |  |
+| XLM-R | 91.08 | 89.09 | 87.28 | 83.17 | 87.66 |
+| **ERNIE-M** | **93.04** | **91.73** | **88.33** | **84.20** | **89.32** |
+| XLM-R Large | 92.00 | 91.60 | **89.52** | 84.60 | 89.43 |
+| **ERNIE-M Large** | **94.01** | **93.81** | 89.23 | **86.20** | **90.81** |
+### Cross-lingual Question Answering
+- [MLQA](https://arxiv.org/pdf/1910.07475.pdf)
+| Model | en | es | de | ar | hi | vi | zh | Avg |
+| --- | --- | --- | --- | --- | --- | --- | --- | --- |
+| mBERT | 77.7/65.2 | 64.3/46.6 | 57.9/44.3 | 45.7/29.8 | 43.8/29.7 | 57.1/38.6 | 57.5/37.3 | 57.7/41.6 |
+| XLM | 74.9/62.4 | 68.0/49.8 | 62.2/47.6 | 54.8/36.3 | 48.8/27.3 | 61.4/41.8 | 61.1/39.6 | 61.6/43.5 |
+| XLM-R | 77.1/64.6 | 67.4/49.6 | 60.9/46.7 | 54.9/36.6 | 59.4/42.9 | 64.5/44.7 | 61.8/39.3 | 63.7/46.3 |
+| INFOXLM | 81.3/68.2 | 69.9/51.9 | 64.2/49.6 | 60.1/40.9 | 65.0/47.5 | 70.0/48.6 | 64.7/**41.2** | 67.9/49.7 |
+| ERNIE-M | **81.6**/**68.5** | **70.9**/**52.6** | **65.8**/**50.7** | **61.8**/**41.9** | **65.4**/**47.5** | **70.0**/**49.2** | **65.6**/41.0 | **68.7**/**50.2** |
+| XLM-R Large | 80.6/67.8 | 74.1/56.0 | 68.5/53.6 | 63.1/43.5 | 62.9/51.6 | 71.3/50.9 | 68.0/45.4 | 70.7/52.7 |
+| INFOXLM Large | **84.5**/**71.6** | **75.1**/**57.3** | **71.2**/**56.2** | **67.6**/**47.6** | 72.5/54.2 | **75.2**/**54.1** | 69.2/45.4 | 73.6/55.2 |
+| ERNIE-M Large | 84.4/71.5 | 74.8/56.6 | 70.8/55.9 | 67.4/47.2 | **72.6**/**54.7** | 75.0/53.7 | **71.1**/**47.5** | **73.7**/**55.3** |
+### Cross-lingual Paraphrase Identification
+- [PAWS-X](https://arxiv.org/pdf/1908.11828.pdf)
+| Model | en | de | es | fr | ja | ko | zh | Avg |
+| --- | --- | --- | --- | --- | --- | --- | --- | --- |
+| Cross-lingual Transfer |  |  |  |  |  |  |  |  |
+| mBERT | 94.0 | 85.7 | 87.4 | 87.0 | 73.0 | 69.6 | 77.0 | 81.9 |
+| XLM | 94.0 | 85.9 | 88.3 | 87.4 | 69.3 | 64.8 | 76.5 | 80.9 |
+| MMTE | 93.1 | 85.1 | 87.2 | 86.9 | 72.0 | 69.2 | 75.9 | 81.3 |
+| XLM-R Large | 94.7 | 89.7 | 90.1 | 90.4 | 78.7 | 79.0 | 82.3 | 86.4 |
+| VECO Large | **96.2** | 91.3 | 91.4 | 92.0 | 81.8 | 82.9 | 85.1 | 88.7 |
+| ERNIE-M Large | 96.0 | **91.9** | **91.4** | **92.2** | **83.9** | **84.5** | **86.9** | **89.5** |
+| Translate-Train-All |  |  |  |  |  |  |  |  |
+| VECO Large | 96.4 | 93.0 | 93.0 | 93.5 | 87.2 | 86.8 | 87.9 | 91.1 |
+| ERNIE-M Large | **96.5** | **93.5** | **93.3** | **93.8** | **87.9** | **88.4** | **89.2** | **91.8** |
+### Cross-lingual Sentence Retrieval
+- [Tatoeba](https://arxiv.org/pdf/2003.11080.pdf)
+| Model | Avg |
+| --- | --- |
+| XLM-R Large | 75.2 |
+| VECO Large | 86.9 |
+| ERNIE-M Large | **87.9** |
+| ERNIE-M Large* | 93.3 |
+\* indicates the results after fine-tuning.
+## Usage
+### Install PaddlePaddle
+This code base has been tested with Paddle (version>=2.0) with Python3. Other dependency of ERNIE-M is listed in `requirements.txt`, you can install it by
+```script
+pip install -r requirements.txt
+```
+### Fine-tuning
+We release the finetuning code for natural language inference, named entity recognition, question answering and paraphrase identification. For example, you can finetune ERNIE-M large model on XNLI dataset by
+```shell
+sh scripts/large/xnli_cross_lingual_transfer.sh # Cross-lingual Transfer
+sh scripts/large/xnli_translate-train_all.sh # Translate-Train-All
+```
+The log of training and the evaluation results are in log/job.log.0.
+**Notice**: The actual total batch size is equal to `configured batch size * number of used gpus`.
+## Citation
+You can cite the paper as below:
+```
+@inproceedings{ouyang-etal-2021-ernie,
+    title = "{ERNIE}-{M}: Enhanced Multilingual Representation by Aligning Cross-lingual Semantics with Monolingual Corpora",
+    author = "Ouyang, Xuan  and
+      Wang, Shuohuan  and
+      Pang, Chao  and
+      Sun, Yu  and
+      Tian, Hao  and
+      Wu, Hua  and
+      Wang, Haifeng",
+    booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing",
+    month = nov,
+    year = "2021",
+    address = "Online and Punta Cana, Dominican Republic",
+    publisher = "Association for Computational Linguistics",
+    url = "https://aclanthology.org/2021.emnlp-main.3",
+    pages = "27--38",
+    abstract = "Recent studies have demonstrated that pre-trained cross-lingual models achieve impressive performance in downstream cross-lingual tasks. This improvement benefits from learning a large amount of monolingual and parallel corpora. Although it is generally acknowledged that parallel corpora are critical for improving the model performance, existing methods are often constrained by the size of parallel corpora, especially for low-resource languages. In this paper, we propose Ernie-M, a new training method that encourages the model to align the representation of multiple languages with monolingual corpora, to overcome the constraint that the parallel corpus size places on the model performance. Our key insight is to integrate back-translation into the pre-training process. We generate pseudo-parallel sentence pairs on a monolingual corpus to enable the learning of semantic alignments between different languages, thereby enhancing the semantic modeling of cross-lingual models. Experimental results show that Ernie-M outperforms existing cross-lingual models and delivers new state-of-the-art results in various cross-lingual downstream tasks. The codes and pre-trained models will be made publicly available.",
+}
+```
--- a/ernie-m/README_zh.md
+++ b/ernie-m/README_zh.md
+[English](./README.md) | 简体中文
+## ERNIE-M: Enhanced Multilingual Representation by Aligning Cross-lingual Semantics with Monolingual Corpora
+- [模型框架](#模型框架)
+- [预训练模型](#预训练模型)
+- [下游任务](#下游任务)
+  * [自然语言推断](#自然语言推断)
+  * [命名实体识别](#命名实体识别)
+  * [阅读理解](#阅读理解)
+  * [语义相似度](#语义相似度)
+  * [跨语言检索](#跨语言检索)
+- [使用说明](#使用说明)
+  * [安装飞桨](#安装飞桨)
+  * [运行微调](#运行微调)
+- [引用](#引用)
+关于算法的详细描述，请参见我们的论文：
+>[_**ERNIE-M: Enhanced Multilingual Representation by Aligning Cross-lingual Semantics with Monolingual Corpora**_](https://arxiv.org/pdf/2012.15674.pdf)
+>
+>Xuan Ouyang, Shuohuan Wang, Chao Pang, Yu Sun, Hao Tian, Hua Wu, Haifeng Wang
+>
+>Preprint December 2020
+>
+>Accepted by **EMNLP-2021**
+![ERNIE-M](https://img.shields.io/badge/预训练-多语言-green) ![paper](https://img.shields.io/badge/论文-EMNLP2021-yellow) 
+---
+**ERNIE-M 是面向多语言建模的预训练-微调框架**。为了突破双语语料规模对多语言模型的学习效果限制，提升跨语言理解的效果，我们提出基于回译机制，从单语语料中学习语言间的语义对齐关系的预训练模型 ERNIE-M，显著提升包括跨语言自然语言推断、语义检索、语义相似度、命名实体识别、阅读理解在内的 5 种典型跨语言理解任务效果。
+## 模型框架
+我们提出了两种方法建模各种语言间的对齐关系:
+- **Cross-Attention Masked Language Modeling(CAMLM)**: 该算法在少量双语语料上捕捉语言间的对齐信息。其需要在不利用源句子上下文的情况下，通过目标句子还原被掩盖的词语，使模型初步建模了语言间的对齐关系。
+- **Back-Translation masked language modeling(BTMLM)**: 该方法基于回译机制从单语语料中学习语言间的对齐关系。通过CAMLM 生成伪平行语料，然后让模型学习生成的伪平行句子，使模型可以利用单语语料更好地建模语义对齐关系。
+![framework](.meta/framework.png)
+## 预训练模型
+我们发布了 **ERNIE-M _base_** 多语言模型和 **ERNIE-M _large_** 多语言模型。 
+- [**ERNIE-M _base_**](http://bj.bcebos.com/wenxin-models/model-ernie-m-base.tar.gz) (_12-layer, 768-hidden, 12-heads_)
+- [**ERNIE-M _large_**](http://bj.bcebos.com/wenxin-models/model-ernie-m-large.tar.gz) (_24-layer, 1024-hidden, 16-heads_)
+## 下游任务
+我们在自然语言推断，命名实体识别，阅读理解，语义相似度以及跨语言检索等任务上选取了广泛使用的数据集进行模型效果验证，并且与当前效果最优的模型（[XLM](https://arxiv.org/pdf/1901.07291.pdf)、[Unicoder](https://arxiv.org/pdf/1909.00964.pdf)、[XLM-R](https://arxiv.org/pdf/1911.02116.pdf)、[INFOXLM](https://arxiv.org/pdf/2007.07834.pdf)、[VECO](https://arxiv.org/pdf/2010.16046.pdf)、[mBERT](https://arxiv.org/pdf/1810.04805.pdf)等）进行对比。
+### 自然语言推断
+- [XNLI](https://arxiv.org/pdf/1809.05053.pdf)
+| 模型 | en | fr | es | de | el | bg | ru | tr | ar | vi | th | zh | hi | sw | ur | Avg |
+| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
+| Cross-lingual Transfer |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
+| XLM | 85.0 | 78.7 | 78.9 | 77.8 | 76.6 | 77.4 | 75.3 | 72.5 | 73.1 | 76.1 | 73.2 | 76.5 | 69.6 | 68.4 | 67.3 | 75.1 |
+| Unicoder | 85.1 | 79.0 | 79.4 | 77.8 | 77.2 | 77.2 | 76.3 | 72.8 | 73.5 | 76.4 | 73.6 | 76.2 | 69.4 | 69.7 | 66.7 | 75.4 |
+| XLM-R | 85.8 | 79.7 | 80.7 | 78.7 | 77.5 | 79.6 | 78.1 | 74.2 | 73.8 | 76.5 | 74.6 | 76.7 | 72.4 | 66.5 | 68.3 | 76.2 |
+| INFOXLM | **86.4** | **80.6** | 80.8 | 78.9 | 77.8 | 78.9 | 77.6 | 75.6 | 74.0 | 77.0 | 73.7 | 76.7 | 72.0 | 66.4 | 67.1 | 76.2 |
+| **ERNIE-M** | 85.5 | 80.1 | **81.2** | **79.2** | **79.1** | **80.4** | **78.1** | **76.8** | **76.3** | **78.3** | **75.8** | **77.4** | **72.9** | **69.5** | **68.8** | **77.3** |
+| XLM-R Large | 89.1 | 84.1 | 85.1 | 83.9 | 82.9 | 84.0 | 81.2 | 79.6 | 79.8 | 80.8 | 78.1 | 80.2 | 76.9 | 73.9 | 73.8 | 80.9 |
+| INFOXLM Large | **89.7** | 84.5 | 85.5 | 84.1 | 83.4 | 84.2 | 81.3 | 80.9 | 80.4 | 80.8 | 78.9 | 80.9 | 77.9 | 74.8 | 73.7 | 81.4 |
+| VECO Large | 88.2 | 79.2 | 83.1 | 82.9 | 81.2 | 84.2 | 82.8 | 76.2 | 80.3 | 74.3 | 77.0 | 78.4 | 71.3 | **80.4** | **79.1** | 79.9 |
+| **ERNIR-M Large** | 89.3 | **85.1** | **85.7** | **84.4** | **83.7** | **84.5** | 82.0 | **81.2** | **81.2** | **81.9** | **79.2** | **81.0** | **78.6** | 76.2 | 75.4 | **82.0** |
+| Translate-Train-All |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
+| XLM | 85.0 | 80.8 | 81.3 | 80.3 | 79.1 | 80.9 | 78.3 | 75.6 | 77.6 | 78.5 | 76.0 | 79.5 | 72.9 | 72.8 | 68.5 | 77.8 |
+| Unicoder | 85.6 | 81.1 | 82.3 | 80.9 | 79.5 | 81.4 | 79.7 | 76.8 | 78.2 | 77.9 | 77.1 | 80.5 | 73.4 | 73.8 | 69.6 | 78.5 |
+| XLM-R | 85.4 | 81.4 | 82.2 | 80.3 | 80.4 | 81.3 | 79.7 | 78.6 | 77.3 | 79.7 | 77.9 | 80.2 | 76.1 | 73.1 | 73.0 | 79.1 |
+| INFOXLM | 86.1 | 82.0 | 82.8 | 81.8 | 80.9 | 82.0 | 80.2 | 79.0 | 78.8 | 80.5 | 78.3 | 80.5 | 77.4 | 73.0 | 71.6 | 79.7 |
+| **ERNIE-M** | **86.2** | **82.5** | **83.8** | **82.6** | **82.4** | **83.4** | **80.2** | **80.6** | **80.5** | **81.1** | **79.2** | **80.5** | **77.7** | **75.0** | **73.3** | **80.6** |
+| XLM-R Large | 89.1 | 85.1 | 86.6 | 85.7 | 85.3 | 85.9 | 83.5 | 83.2 | 83.1 | 83.7 | 81.5 | **83.7** | **81.6** | 78.0 | 78.1 | 83.6 |
+| VECO Large | 88.9 | 82.4 | 86.0 | 84.7 | 85.3 | 86.2 | **85.8** | 80.1 | 83.0 | 77.2 | 80.9 | 82.8 | 75.3 | **83.1** | **83.0** | 83.0 |
+| **ERNIE-M Large** | **89.5** | **86.5** | **86.9** | **86.1** | **86.0** | **86.8** | 84.1 | **83.8** | **84.1** | **84.5** | **82.1** | 83.5 | 81.1 | 79.4 | 77.9 | **84.2** |
+### 命名实体识别
+- [CoNLL](https://arxiv.org/pdf/cs/0306050.pdf)
+| 模型 | en | nl | es | de | Avg |
+| --- | --- | --- | --- | --- | --- |
+| Fine-tune on English dataset |  |  |  |  |  |
+| mBERT | 91.97 | 77.57 | 74.96 | 69.56 | 78.52 |
+| XLM-R | 92.25 | **78.08** | 76.53 | **69.60** | 79.11 |
+| **ERNIE-M** | **92.78** | 78.01 | **79.37** | 68.08 | **79.56** |
+| XLM-R Large | 92.92 | 80.80 | 78.64 | 71.40 | 80.94 |
+| **ERNIE-M Large** | **93.28** | **81.45** | **78.83** | **72.99** | **81.64** |
+| Fine-tune on all dataset |  |  |  |  |  |
+| XLM-R | 91.08 | 89.09 | 87.28 | 83.17 | 87.66 |
+| **ERNIE-M** | **93.04** | **91.73** | **88.33** | **84.20** | **89.32** |
+| XLM-R Large | 92.00 | 91.60 | **89.52** | 84.60 | 89.43 |
+| **ERNIE-M Large** | **94.01** | **93.81** | 89.23 | **86.20** | **90.81** |
+### 阅读理解
+- [MLQA](https://arxiv.org/pdf/1910.07475.pdf)
+| 模型 | en | es | de | ar | hi | vi | zh | Avg |
+| --- | --- | --- | --- | --- | --- | --- | --- | --- |
+| mBERT | 77.7/65.2 | 64.3/46.6 | 57.9/44.3 | 45.7/29.8 | 43.8/29.7 | 57.1/38.6 | 57.5/37.3 | 57.7/41.6 |
+| XLM | 74.9/62.4 | 68.0/49.8 | 62.2/47.6 | 54.8/36.3 | 48.8/27.3 | 61.4/41.8 | 61.1/39.6 | 61.6/43.5 |
+| XLM-R | 77.1/64.6 | 67.4/49.6 | 60.9/46.7 | 54.9/36.6 | 59.4/42.9 | 64.5/44.7 | 61.8/39.3 | 63.7/46.3 |
+| INFOXLM | 81.3/68.2 | 69.9/51.9 | 64.2/49.6 | 60.1/40.9 | 65.0/47.5 | 70.0/48.6 | 64.7/**41.2** | 67.9/49.7 |
+| ERNIE-M | **81.6**/**68.5** | **70.9**/**52.6** | **65.8**/**50.7** | **61.8**/**41.9** | **65.4**/**47.5** | **70.0**/**49.2** | **65.6**/41.0 | **68.7**/**50.2** |
+| XLM-R Large | 80.6/67.8 | 74.1/56.0 | 68.5/53.6 | 63.1/43.5 | 62.9/51.6 | 71.3/50.9 | 68.0/45.4 | 70.7/52.7 |
+| INFOXLM Large | **84.5**/**71.6** | **75.1**/**57.3** | **71.2**/**56.2** | **67.6**/**47.6** | 72.5/54.2 | **75.2**/**54.1** | 69.2/45.4 | 73.6/55.2 |
+| ERNIE-M Large | 84.4/71.5 | 74.8/56.6 | 70.8/55.9 | 67.4/47.2 | **72.6**/**54.7** | 75.0/53.7 | **71.1**/**47.5** | **73.7**/**55.3** |
+### 语义相似度
+- [PAWS-X](https://arxiv.org/pdf/1908.11828.pdf)
+| 模型 | en | de | es | fr | ja | ko | zh | Avg |
+| --- | --- | --- | --- | --- | --- | --- | --- | --- |
+| Cross-lingual Transfer |  |  |  |  |  |  |  |  |
+| mBERT | 94.0 | 85.7 | 87.4 | 87.0 | 73.0 | 69.6 | 77.0 | 81.9 |
+| XLM | 94.0 | 85.9 | 88.3 | 87.4 | 69.3 | 64.8 | 76.5 | 80.9 |
+| MMTE | 93.1 | 85.1 | 87.2 | 86.9 | 72.0 | 69.2 | 75.9 | 81.3 |
+| XLM-R Large | 94.7 | 89.7 | 90.1 | 90.4 | 78.7 | 79.0 | 82.3 | 86.4 |
+| VECO Large | **96.2** | 91.3 | 91.4 | 92.0 | 81.8 | 82.9 | 85.1 | 88.7 |
+| ERNIE-M Large | 96.0 | **91.9** | **91.4** | **92.2** | **83.9** | **84.5** | **86.9** | **89.5** |
+| Translate-Train-All |  |  |  |  |  |  |  |  |
+| VECO Large | 96.4 | 93.0 | 93.0 | 93.5 | 87.2 | 86.8 | 87.9 | 91.1 |
+| ERNIE-M Large | **96.5** | **93.5** | **93.3** | **93.8** | **87.9** | **88.4** | **89.2** | **91.8** |
+### 跨语言检索
+- [Tatoeba](https://arxiv.org/pdf/2003.11080.pdf)
+| 模型 | Avg |
+| --- | --- |
+| XLM-R Large | 75.2 |
+| VECO Large | 86.9 |
+| ERNIE-M Large | **87.9** |
+| ERNIE-M Large* | 93.3 |
+\* 表示微调后的结果
+## 使用说明
+### 安装飞桨
+我们的代码基于 Paddle(version>=2.0)，推荐使用python3运行。 ERNIE-M 依赖的其他模块也列举在 `requirements.txt`，可以通过下面的指令安装:
+```script
+pip install -r requirements.txt
+```
+### 运行微调
+我们开源了自然语言推断，命名实体识别以及阅读理解的微调代码，运行以下脚本即可进行实验
+```shell
+sh scripts/large/xnli_cross_lingual_transfer.sh # Cross-lingual Transfer
+sh scripts/large/xnli_translate-train_all.sh # Translate-Train-All
+```
+具体微调参数均可在上述脚本中进行修改，训练和评估的日志在 log/job.log.0。
+**注意**: 训练时实际的 batch size 等于 `配置的 batch size * GPU 卡数`。
+## 引用
+可以按下面的格式引用我们的论文:
+```
+@inproceedings{ouyang-etal-2021-ernie,
+    title = "{ERNIE}-{M}: Enhanced Multilingual Representation by Aligning Cross-lingual Semantics with Monolingual Corpora",
+    author = "Ouyang, Xuan  and
+      Wang, Shuohuan  and
+      Pang, Chao  and
+      Sun, Yu  and
+      Tian, Hao  and
+      Wu, Hua  and
+      Wang, Haifeng",
+    booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing",
+    month = nov,
+    year = "2021",
+    address = "Online and Punta Cana, Dominican Republic",
+    publisher = "Association for Computational Linguistics",
+    url = "https://aclanthology.org/2021.emnlp-main.3",
+    pages = "27--38",
+    abstract = "Recent studies have demonstrated that pre-trained cross-lingual models achieve impressive performance in downstream cross-lingual tasks. This improvement benefits from learning a large amount of monolingual and parallel corpora. Although it is generally acknowledged that parallel corpora are critical for improving the model performance, existing methods are often constrained by the size of parallel corpora, especially for low-resource languages. In this paper, we propose Ernie-M, a new training method that encourages the model to align the representation of multiple languages with monolingual corpora, to overcome the constraint that the parallel corpus size places on the model performance. Our key insight is to integrate back-translation into the pre-training process. We generate pseudo-parallel sentence pairs on a monolingual corpus to enable the learning of semantic alignments between different languages, thereby enhancing the semantic modeling of cross-lingual models. Experimental results show that Ernie-M outperforms existing cross-lingual models and delivers new state-of-the-art results in various cross-lingual downstream tasks. The codes and pre-trained models will be made publicly available.",
+}
+```
--- a/ernie-m/configs/piece.model
+++ b/ernie-m/configs/piece.model
--- a/ernie-m/configs/vocab.txt
+++ b/ernie-m/configs/vocab.txt
--- a/ernie-m/encode_vector.py
+++ b/ernie-m/encode_vector.py
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""extract embeddings from ERNIE encoder."""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+import os
+import argparse
+import numpy as np
+import multiprocessing
+import logging
+import paddle
+import paddle.fluid as fluid
+import reader.task_reader as task_reader
+from model.ernie import ErnieConfig, ErnieModel
+from utils.args import ArgumentGroup, print_arguments, prepare_logger
+from utils.init import init_pretraining_params
+paddle.enable_static()
+log = logging.getLogger()
+# yapf: disable
+parser = argparse.ArgumentParser(__doc__)
+model_g = ArgumentGroup(parser, "model", "model configuration and paths.")
+model_g.add_arg("ernie_config_path",         str,  None, "Path to the json file for ernie model config.")
+model_g.add_arg("init_pretraining_params",   str,  None,
+                "Init pre-training params which preforms fine-tuning from. If the "
+                 "arg 'init_checkpoint' has been set, this argument wouldn't be valid.")
+model_g.add_arg("output_dir",                str,  "embeddings", "path to save embeddings extracted by ernie_encoder.")
+data_g = ArgumentGroup(parser, "data", "Data paths, vocab paths and data processing options")
+data_g.add_arg("data_set",            str,  None,  "Path to data for calculating ernie_embeddings.")
+data_g.add_arg("vocab_path",          str,  None,  "Vocabulary path.")
+data_g.add_arg("piece_model_path",          str,  None, "Piece model path")
+data_g.add_arg("max_seq_len",         int,  512,   "Number of words of the longest seqence.")
+data_g.add_arg("batch_size",          int,  32,    "Total examples' number in batch for training.")
+run_type_g = ArgumentGroup(parser, "run_type", "running type options.")
+run_type_g.add_arg("use_cuda",                     bool,   True,  "If set, use GPU for training.")
+# yapf: enable
+def create_model(args, pyreader_name, ernie_config):
+    src_ids = fluid.layers.data(name='src_ids', shape=[-1, args.max_seq_len, 1], dtype='int64')
+    pos_ids = fluid.layers.data(name='pos_ids', shape=[-1, args.max_seq_len, 1], dtype='int64')
+    input_mask = fluid.layers.data(name='input_mask', shape=[-1, args.max_seq_len, 1], dtype='float32')
+    seq_lens = fluid.layers.data(name='seq_lens', shape=[-1], dtype='int64')
+    pyreader = fluid.io.DataLoader.from_generator(feed_list=[src_ids, pos_ids, input_mask, seq_lens], 
+            capacity=70,
+            iterable=False)
+    ernie = ErnieModel(
+        src_ids=src_ids,
+        position_ids=pos_ids,
+        input_mask=input_mask,
+        config=ernie_config)
+    enc_out = ernie.get_sequence_output()
+    unpad_enc_out = fluid.layers.sequence_unpad(enc_out, length=seq_lens)
+    cls_feats = ernie.get_pooled_output()
+    # set persistable = True to avoid memory opimizing
+    enc_out.persistable = True
+    unpad_enc_out.persistable = True
+    cls_feats.persistable = True
+    graph_vars = {
+        "cls_embeddings": cls_feats,
+        "top_layer_embeddings": unpad_enc_out,
+    }
+    return pyreader, graph_vars
+def main(args):
+    args = parser.parse_args()
+    ernie_config = ErnieConfig(args.ernie_config_path)
+    ernie_config.print_config()
+    if args.use_cuda:
+        place = fluid.CUDAPlace(int(os.getenv('FLAGS_selected_gpus', '0')))
+        dev_count = fluid.core.get_cuda_device_count()
+    else:
+        place = fluid.CPUPlace()
+        dev_count = int(os.environ.get('CPU_NUM', multiprocessing.cpu_count()))
+    exe = fluid.Executor(place)
+    reader = task_reader.ExtractEmbeddingReader(
+        vocab_path=args.vocab_path,
+        piece_model_path=args.piece_model_path,
+        max_seq_len=args.max_seq_len)
+    startup_prog = fluid.Program()
+    data_generator = reader.data_generator(
+        input_file=args.data_set,
+        batch_size=args.batch_size,
+        epoch=1,
+        shuffle=False)
+    total_examples = reader.get_num_examples(args.data_set)
+    print("Device count: %d" % dev_count)
+    print("Total num examples: %d" % total_examples)
+    infer_program = fluid.Program()
+    with fluid.program_guard(infer_program, startup_prog):
+        with fluid.unique_name.guard():
+            pyreader, graph_vars = create_model(
+                args, pyreader_name='reader', ernie_config=ernie_config)
+    infer_program = infer_program.clone(for_test=True)
+    exe.run(startup_prog)
+    if args.init_pretraining_params:
+        init_pretraining_params(
+            exe, args.init_pretraining_params, main_program=startup_prog)
+    else:
+        raise ValueError(
+            "WARNING: args 'init_pretraining_params' must be specified")
+    exec_strategy = fluid.ExecutionStrategy()
+    exec_strategy.num_threads = dev_count
+    pyreader.set_batch_generator(data_generator)
+    pyreader.start()
+    total_cls_emb = []
+    total_top_layer_emb = []
+    total_labels = []
+    while True:
+        try:
+            cls_emb, unpad_top_layer_emb = exe.run(
+                program=infer_program,
+                fetch_list=[
+                    graph_vars["cls_embeddings"].name,
+                    graph_vars["top_layer_embeddings"].name
+                ],
+                return_numpy=False)
+            # batch_size * embedding_size
+            total_cls_emb.append(np.array(cls_emb))
+            total_top_layer_emb.append(np.array(unpad_top_layer_emb))
+        except fluid.core.EOFException:
+            break
+    total_cls_emb = np.concatenate(total_cls_emb)
+    total_top_layer_emb = np.concatenate(total_top_layer_emb)
+    if not os.path.exists(args.output_dir):
+        os.makedirs(args.output_dir)
+    else:
+        raise RuntimeError('output dir exists: %s' % args.output_dir)
+    with open(os.path.join(args.output_dir, "cls_emb.npy"),
+              "wb") as cls_emb_file:
+        np.save(cls_emb_file, total_cls_emb)
+    with open(os.path.join(args.output_dir, "top_layer_emb.npy"),
+              "wb") as top_layer_emb_file:
+        np.save(top_layer_emb_file, total_top_layer_emb)
+if __name__ == '__main__':
+    prepare_logger(log)
+    args = parser.parse_args()
+    print_arguments(args)
+    main(args)
--- a/ernie-m/finetune/__init__.py
+++ b/ernie-m/finetune/__init__.py
--- a/ernie-m/finetune/classifier.py
+++ b/ernie-m/finetune/classifier.py
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Model for classifier."""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+from __future__ import unicode_literals
+from __future__ import absolute_import
+import time
+import json
+import numpy as np
+import logging
+import paddle.fluid as fluid
+from model.ernie import ErnieModel
+log = logging.getLogger(__name__)
+def create_model(args, ernie_config, is_prediction=False):
+    src_ids = fluid.layers.data(name="src_ids", shape=[-1, args.max_seq_len, 1], dtype="int64")
+    pos_ids = fluid.layers.data(name="pos_ids", shape=[-1, args.max_seq_len, 1], dtype="int64")
+    input_mask = fluid.layers.data(name="input_mask", shape=[-1, args.max_seq_len, 1], dtype="float32")
+    labels = fluid.layers.data(name="labels", shape=[-1, 1], dtype="int64")
+    qids = fluid.layers.data(name="qids", shape=[-1, 1], dtype="int64")
+    pyreader = fluid.io.DataLoader.from_generator(
+            feed_list=[src_ids, pos_ids, input_mask, labels, qids],
+            capacity=70, 
+            iterable=False)
+    ernie = ErnieModel(
+        src_ids=src_ids,
+        position_ids=pos_ids,
+        input_mask=input_mask,
+        config=ernie_config,
+        use_fp16=args.use_fp16)
+    cls_feats = ernie.get_pooled_output()
+    cls_feats = fluid.layers.dropout(
+        x=cls_feats,
+        dropout_prob=0.1,
+        dropout_implementation="upscale_in_train")
+    logits = fluid.layers.fc(
+        input=cls_feats,
+        size=args.num_labels,
+        param_attr=fluid.ParamAttr(
+            name="cls_out_w",
+            initializer=fluid.initializer.TruncatedNormal(scale=0.02)),
+        bias_attr=fluid.ParamAttr(
+            name="cls_out_b", initializer=fluid.initializer.Constant(0.)))
+    if is_prediction:
+        probs = fluid.layers.softmax(logits)
+        feed_targets_name = [
+            src_ids.name, pos_ids.name, input_mask.name
+        ]
+        return pyreader, probs, feed_targets_name
+    ce_loss, probs = fluid.layers.softmax_with_cross_entropy(
+            logits=logits, label=labels, return_softmax=True)
+    loss = fluid.layers.mean(x=ce_loss)
+    num_seqs = fluid.layers.create_tensor(dtype='int64')
+    accuracy = fluid.layers.accuracy(input=probs, label=labels, total=num_seqs)
+    graph_vars = {
+        "loss": loss,
+        "probs": probs,
+        "accuracy": accuracy,
+        "labels": labels,
+        "num_seqs": num_seqs,
+        "qids": qids
+    }
+    for k, v in graph_vars.items():
+        v.persistable = True
+    return pyreader, graph_vars
+def evaluate(exe,
+        test_program,
+        test_pyreader,
+        graph_vars,
+        eval_phase,
+        lang=None):
+    train_fetch_list = [
+        graph_vars["loss"].name,
+        graph_vars["accuracy"].name,
+        graph_vars["num_seqs"].name
+    ]
+    if eval_phase == "train":
+        if "learning_rate" in graph_vars:
+            train_fetch_list.append(graph_vars["learning_rate"].name)
+        outputs = exe.run(fetch_list=train_fetch_list)
+        ret = {"loss": np.mean(outputs[0]), "accuracy": np.mean(outputs[1])}
+        if "learning_rate" in graph_vars:
+            ret["learning_rate"] = float(outputs[3][0])
+        return ret
+    fetch_list = [
+        graph_vars["loss"].name,
+        graph_vars["accuracy"].name,
+        graph_vars["probs"].name,
+        graph_vars["labels"].name,
+        graph_vars["num_seqs"].name,
+        graph_vars["qids"].name,
+    ]
+    test_pyreader.start()
+    time_begin = time.time()
+    total_cost, total_num_seqs = 0.0, 0.0
+    qids, labels, preds = [], [], []
+    while True:
+        try:
+            np_loss, np_acc, np_probs, np_labels, np_num_seqs, np_qids = exe.run(
+                    program=test_program, fetch_list=fetch_list)
+            total_cost += np.sum(np_loss * np_num_seqs)
+            total_num_seqs += np.sum(np_num_seqs)
+            if np_qids is None:
+                np_qids = np.array([])
+            qids.extend(np_qids.reshape(-1).tolist())
+            labels.extend(np_labels.reshape(-1).tolist())
+            np_preds = np.argmax(np_probs, axis=1).astype(np.float32)
+            preds.extend(np_preds.tolist())
+        except fluid.core.EOFException:
+            test_pyreader.reset()
+            break
+    time_end = time.time()
+    cost = total_cost / total_num_seqs
+    elapsed_time = time_end - time_begin
+    acc = simple_accuracy(preds, labels)
+    evaluate_info = "[%s evaluation] ave loss: %f, %s acc: %f, data_num: %d, elapsed time: %f s" \
+            % (eval_phase, cost, lang, acc, total_num_seqs, elapsed_time)
+    return evaluate_info
+def simple_accuracy(preds, labels):
+    preds = np.array(preds)
+    labels = np.array(labels)
+    return (preds == labels).mean()
+def predict(exe,
+        test_program,
+        test_pyreader,
+        graph_vars):
+    test_pyreader.start()
+    qids, probs, preds = [], [], []
+    fetch_list = [
+        graph_vars["probs"].name, 
+        graph_vars["qids"].name,
+    ]
+    while True:
+        try:
+            np_probs, np_qids = exe.run(
+                    program=test_program, fetch_list=fetch_list)
+            if np_qids is None:
+                np_qids = np.array([])
+            qids.extend(np_qids.reshape(-1).tolist())
+            np_preds = np.argmax(np_probs, axis=1).astype(np.float32)
+            preds.extend(np_preds.tolist())
+            probs.append(np_probs)
+        except fluid.core.EOFException:
+            test_pyreader.reset()
+            break
+    probs = np.concatenate(probs, axis=0).reshape([len(preds), -1])
+    return qids, preds, probs
--- a/ernie-m/finetune/mrc.py
+++ b/ernie-m/finetune/mrc.py
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Model for classifier."""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+from __future__ import unicode_literals
+from __future__ import absolute_import
+import os
+import six
+import time
+import math
+import json
+import logging
+import subprocess
+import collections 
+from io import open
+import numpy as np
+import paddle.fluid as fluid
+import reader.tokenization
+from model.ernie import ErnieModel
+from utils.eval_mlqa import mlqa_eval
+log = logging.getLogger(__name__)
+def create_model(args, ernie_config):
+    src_ids = fluid.layers.data(name="src_ids", shape=[-1, args.max_seq_len, 1], dtype="int64")
+    pos_ids = fluid.layers.data(name="pos_ids", shape=[-1, args.max_seq_len, 1], dtype="int64")
+    input_mask = fluid.layers.data(name="input_mask", shape=[-1, args.max_seq_len, 1], dtype="float32")
+    start_positions = fluid.layers.data(name="start_positions", shape=[-1, 1], dtype="int64")
+    end_positions = fluid.layers.data(name="end_positions", shape=[-1, 1], dtype="int64")
+    unique_id = fluid.layers.data(name="unique_id", shape=[-1, 1], dtype="int64")
+    labels = fluid.layers.data(name="labels", shape=[-1, 1], dtype="int64")
+    pyreader = fluid.io.DataLoader.from_generator(
+         feed_list=[src_ids, pos_ids, input_mask, 
+             start_positions, end_positions, unique_id, labels],
+         capacity=70,
+         iterable=False)
+    ernie = ErnieModel(
+        src_ids=src_ids,
+        position_ids=pos_ids,
+        input_mask=input_mask,
+        config=ernie_config,
+        use_fp16=args.use_fp16)
+    enc_out = ernie.get_sequence_output()
+    enc_out = fluid.layers.dropout(
+        x=enc_out,
+        dropout_prob=0.1,
+        dropout_implementation="upscale_in_train")
+    logits = fluid.layers.fc(
+        input=enc_out,
+        size=2,
+        num_flatten_dims=2,
+        param_attr=fluid.ParamAttr(
+            name="cls_mrc_out_w",
+            initializer=fluid.initializer.TruncatedNormal(scale=0.02)),
+        bias_attr=fluid.ParamAttr(
+            name="cls_mrc_out_b", initializer=fluid.initializer.Constant(0.)))
+    logits = fluid.layers.transpose(x=logits, perm=[2, 0, 1])
+    start_logits, end_logits = fluid.layers.unstack(x=logits, axis=0)
+    batch_ones = fluid.layers.fill_constant_batch_size_like(
+        input=start_logits, dtype='int64', shape=[1], value=1)
+    num_seqs = fluid.layers.reduce_sum(input=batch_ones)
+    def compute_loss(logits, positions):
+        loss = fluid.layers.softmax_with_cross_entropy(
+            logits=logits, label=positions)
+        loss = fluid.layers.mean(x=loss)
+        return loss
+    start_loss = compute_loss(start_logits, start_positions)
+    end_loss = compute_loss(end_logits, end_positions)
+    loss = (start_loss + end_loss) / 2.0
+    if args.use_fp16 and args.loss_scaling > 1.0:
+        loss *= args.loss_scaling
+    graph_vars = {
+        "loss": loss,
+        "num_seqs": num_seqs,
+        "unique_id": unique_id,
+        "start_logits": start_logits,
+        "end_logits": end_logits
+    }
+    for k, v in graph_vars.items():
+        v.persistable = True
+    return pyreader, graph_vars
+def write_result(output_path, eval_phase, trainer_id, results):
+    output_file = os.path.join(output_path, 
+                      "%s.output.%d" % (eval_phase, trainer_id))
+    with open(output_file, "w") as fp:
+         json.dump(results, fp)
+    finish_file = os.path.join(output_path, 
+                      "%s.finish.%d" % (eval_phase, trainer_id))
+    with open(finish_file, "w") as fp:
+        pass
+def concat_result(output_path, eval_phase, dev_count, RawResult):
+    all_results = []
+    while True:
+        ret = subprocess.check_output('find %s -maxdepth 3 -name "%s.finish.*"' 
+                  %(output_path, eval_phase), shell=True)
+        ret = bytes.decode(ret).strip("\n").split("\n")
+        if len(ret) != dev_count:
+            time.sleep(1)
+            continue
+        try:
+            for trainer_id in range(dev_count):
+                output_file = os.path.join(output_path,
+                         "%s.output.%d" % (eval_phase, trainer_id))
+                with open(output_file, "r") as fp:
+                     results = json.load(fp)
+                     for result in results:
+                         assert len(result) == 3
+                         all_results.append(
+                             RawResult(
+                                 unique_id=result[0],
+                                 start_logits=result[1],
+                                 end_logits=result[2]))
+            break 
+        except Exception as e:
+            log.info('Error!!!!!!!!!!!!!!!!!!!!!!!')
+    return all_results
+def evaluate(exe, 
+             test_program,
+             test_pyreader,
+             graph_vars,
+             eval_phase,
+             examples=None,
+             features=None,
+             args=None,
+             trainer_id=0,
+             dev_count=1,
+             input_file=None,
+             output_path=None,
+             tokenizer=None,
+             version_2_with_negative=False):
+    if eval_phase == "train":
+        train_fetch_list = [
+            graph_vars["loss"].name
+        ]
+        if "learning_rate" in graph_vars:
+            train_fetch_list.append(
+                graph_vars["learning_rate"].name
+            )
+        outputs = exe.run(
+            fetch_list=train_fetch_list
+        )
+        ret = {"loss": np.mean(outputs[0])}
+        if "learning_rate" in graph_vars:
+            ret["learning_rate"] = float(outputs[1][0])
+        return ret
+    if not os.path.exists(output_path):
+        os.makedirs(output_path)
+    output_prediction_file = os.path.join(output_path, eval_phase + "_predictions.json")
+    output_nbest_file = os.path.join(output_path, eval_phase + "_nbest_predictions.json")
+    output_null_odds_file = os.path.join(output_path, eval_phase + "_null_odds.json")
+    RawResult = collections.namedtuple("RawResult",
+            ["unique_id", "start_logits", "end_logits"])
+    all_results = [] 
+    test_pyreader.start()
+    time_begin = time.time()
+    fetch_list = [
+        graph_vars["unique_id"].name, graph_vars["start_logits"].name,
+        graph_vars["end_logits"].name, graph_vars["num_seqs"].name
+    ]
+    while True:
+        try:
+            np_unique_ids, np_start_logits, np_end_logits, np_num_seqs = exe.run(
+                        program=test_program, fetch_list=fetch_list)
+            for idx in range(np_unique_ids.shape[0]):
+                # if len(all_results) % 500 == 0:
+                #     log.info("Processing example: %d" % len(all_results))
+                unique_id = int(np_unique_ids[idx])
+                start_logits = [float(x) for x in np_start_logits[idx].flat]
+                end_logits = [float(x) for x in np_end_logits[idx].flat]
+                all_results.append(
+                    RawResult(
+                        unique_id=unique_id,
+                        start_logits=start_logits,
+                        end_logits=end_logits))
+        except fluid.core.EOFException:
+            test_pyreader.reset()
+            break
+    if dev_count > 1:
+        write_result(output_path, 
+                eval_phase, 
+                trainer_id, 
+                all_results)
+        if trainer_id == 0:
+            all_results = concat_result(output_path, 
+                  eval_phase, 
+                  dev_count, 
+                  RawResult)
+    if trainer_id == 0:
+        write_predictions(examples, 
+                   features,
+                   all_results,
+                   args.n_best_size,
+                   args.max_answer_length,
+                   args.do_lower_case,
+                   output_prediction_file,
+                   output_nbest_file, 
+                   output_null_odds_file,
+                   tokenizer, 
+                   version_2_with_negative)
+        with open(input_file) as fp:
+            dataset_json = json.load(fp)
+        with open(output_prediction_file) as fp:
+            predictions = json.load(fp)
+        dataset = dataset_json['data']
+        lang = dataset_json['lang']
+        eval_out = mlqa_eval(dataset, predictions, lang)
+        time_end = time.time()
+        elapsed_time = time_end - time_begin
+        log.info("[%s evaluation] lang %s, em: %f, f1: %f, elapsed time: %f, eval file: %s"
+                % (eval_phase, lang, eval_out["exact_match"], 
+                   eval_out["f1"], elapsed_time, input_file))
+def write_predictions(all_examples, 
+                      all_features, 
+                      all_results, 
+                      n_best_size,
+                      max_answer_length, 
+                      do_lower_case, 
+                      output_prediction_file,
+                      output_nbest_file, 
+                      output_null_odds_file, 
+                      tokenizer, 
+                      version_2_with_negative=True,
+                      null_score_diff_threshold=0.0):
+    """Write final predictions to the json file and log-odds of null if needed."""
+    log.info("Writing predictions to: %s" % (output_prediction_file))
+    log.info("Writing nbest to: %s" % (output_nbest_file))
+    example_index_to_features = collections.defaultdict(list)
+    for feature in all_features:
+        example_index_to_features[feature.example_index].append(feature)
+    unique_id_to_result = {}
+    for result in all_results:
+        unique_id_to_result[result.unique_id] = result
+    _PrelimPrediction = collections.namedtuple(  # pylint: disable=invalid-name
+        "PrelimPrediction", [
+            "feature_index", "start_index", "end_index", "start_logit",
+            "end_logit"
+        ])
+    all_predictions = collections.OrderedDict()
+    all_nbest_json = collections.OrderedDict()
+    scores_diff_json = collections.OrderedDict()
+    for (example_index, example) in enumerate(all_examples):
+        features = example_index_to_features[example_index]
+        prelim_predictions = []
+        score_null = 1000000  # large and positive
+        min_null_feature_index = 0  # the paragraph slice with min mull score
+        null_start_logit = 0  # the start logit at the slice with min null score
+        null_end_logit = 0  # the end logit at the slice with min null score
+        for (feature_index, feature) in enumerate(features):
+            result = unique_id_to_result[feature.unique_id]
+            start_indexes = _get_best_indexes(result.start_logits, n_best_size)
+            end_indexes = _get_best_indexes(result.end_logits, n_best_size)
+            if version_2_with_negative:
+                feature_null_score = result.start_logits[0] + result.end_logits[
+                    0]
+                if feature_null_score < score_null:
+                    score_null = feature_null_score
+                    min_null_feature_index = feature_index
+                    null_start_logit = result.start_logits[0]
+                    null_end_logit = result.end_logits[0]
+            for start_index in start_indexes:
+                for end_index in end_indexes:
+                    if start_index >= len(feature.tokens):
+                        continue
+                    if end_index >= len(feature.tokens):
+                        continue
+                    if start_index not in feature.token_to_orig_map:
+                        continue
+                    if end_index not in feature.token_to_orig_map:
+                        continue
+                    if not feature.token_is_max_context.get(start_index, False):
+                        continue
+                    if end_index < start_index:
+                        continue
+                    length = end_index - start_index + 1
+                    if length > max_answer_length:
+                        continue
+                    prelim_predictions.append(
+                        _PrelimPrediction(
+                            feature_index=feature_index,
+                            start_index=start_index,
+                            end_index=end_index,
+                            start_logit=result.start_logits[start_index],
+                            end_logit=result.end_logits[end_index]))
+        if version_2_with_negative:
+            prelim_predictions.append(
+                _PrelimPrediction(
+                    feature_index=min_null_feature_index,
+                    start_index=0,
+                    end_index=0,
+                    start_logit=null_start_logit,
+                    end_logit=null_end_logit))
+        prelim_predictions = sorted(
+            prelim_predictions,
+            key=lambda x: (x.start_logit + x.end_logit),
+            reverse=True)
+        _NbestPrediction = collections.namedtuple(  # pylint: disable=invalid-name
+            "NbestPrediction", ["text", "start_logit", "end_logit"])
+        seen_predictions = {}
+        nbest = []
+        for pred in prelim_predictions:
+            if len(nbest) >= n_best_size:
+                break
+            feature = features[pred.feature_index]
+            if pred.start_index > 0:  # this is a non-null prediction
+                tok_tokens = feature.tokens[pred.start_index:(pred.end_index + 1
+                                                              )]
+                orig_doc_start = feature.token_to_orig_map[pred.start_index]
+                orig_doc_end = feature.token_to_orig_map[pred.end_index]
+                orig_tokens = example.doc_tokens[orig_doc_start:(orig_doc_end +
+                                                                 1)]
+                deal_tok_tokens = []
+                for tok_token in tok_tokens:
+                    tok_token = str(tok_token)
+                    tok_token = tok_token.replace("▁", " ", 1)
+                    deal_tok_tokens.append(tok_token)
+                tok_text = "".join(deal_tok_tokens)
+                tok_text = tok_text.strip()
+                final_text = ""
+                seen_predictions[final_text] = True
+            else:
+                final_text = ""
+                seen_predictions[final_text] = True
+            nbest.append(
+                _NbestPrediction(
+                    text=tok_text,
+                    start_logit=pred.start_logit,
+                    end_logit=pred.end_logit))
+        # if we didn't inlude the empty option in the n-best, inlcude it
+        if version_2_with_negative:
+            if "" not in seen_predictions:
+                nbest.append(
+                    _NbestPrediction(
+                        text="",
+                        start_logit=null_start_logit,
+                        end_logit=null_end_logit))
+        # In very rare edge cases we could have no valid predictions. So we
+        # just create a nonce prediction in this case to avoid failure.
+        if not nbest:
+            nbest.append(
+                _NbestPrediction(
+                    text="empty", start_logit=0.0, end_logit=0.0))
+        assert len(nbest) >= 1
+        total_scores = []
+        best_non_null_entry = None
+        for entry in nbest:
+            total_scores.append(entry.start_logit + entry.end_logit)
+            if not best_non_null_entry:
+                if entry.text:
+                    best_non_null_entry = entry
+        # debug
+        if best_non_null_entry is None:
+            log.info("Emmm..., sth wrong")
+        probs = _compute_softmax(total_scores)
+        nbest_json = []
+        for (i, entry) in enumerate(nbest):
+            output = collections.OrderedDict()
+            output["text"] = entry.text
+            output["probability"] = probs[i]
+            output["start_logit"] = entry.start_logit
+            output["end_logit"] = entry.end_logit
+            nbest_json.append(output)
+        assert len(nbest_json) >= 1
+        if not version_2_with_negative:
+            all_predictions[example.qas_id] = nbest_json[0]["text"]
+        else:
+            try:
+                # predict "" iff the null score - the score of best non-null > threshold
+                score_diff = score_null - best_non_null_entry.start_logit - (
+                    best_non_null_entry.end_logit)
+                scores_diff_json[example.qas_id] = score_diff
+                if score_diff > null_score_diff_threshold:
+                    all_predictions[example.qas_id] = ""
+                else:
+                    all_predictions[example.qas_id] = best_non_null_entry.text
+            except:
+                all_predictions[example.qas_id] = ""
+                scores_diff_json[example.qas_id] = 0
+        all_nbest_json[example.qas_id] = nbest_json
+    with open(output_prediction_file, "w") as fp:
+        json.dump(all_predictions, fp, indent=4)
+    with open(output_nbest_file, "w") as fp:
+        json.dump(all_nbest_json, fp, indent=4)
+    if version_2_with_negative:
+        with open(output_null_odds_file, "w") as fp:
+            json.dump(scores_diff_json, fp, indent=4)
+def _get_best_indexes(logits, n_best_size):
+    """Get the n-best logits from a list."""
+    index_and_score = sorted(
+        enumerate(logits), key=lambda x: x[1], reverse=True)
+    best_indexes = []
+    for i in range(len(index_and_score)):
+        if i >= n_best_size:
+            break
+        best_indexes.append(index_and_score[i][0])
+    return best_indexes
+def _compute_softmax(scores):
+    """Compute softmax probability over raw logits."""
+    if not scores:
+        return []
+    max_score = None
+    for score in scores:
+        if max_score is None or score > max_score:
+            max_score = score
+    exp_scores = []
+    total_sum = 0.0
+    for score in scores:
+        x = math.exp(score - max_score)
+        exp_scores.append(x)
+        total_sum += x
+    probs = []
+    for score in exp_scores:
+        probs.append(score / total_sum)
+    return probs
--- a/ernie-m/finetune/sequence_label.py
+++ b/ernie-m/finetune/sequence_label.py
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+from __future__ import unicode_literals
+from __future__ import absolute_import
+import os
+import time
+import argparse
+import numpy as np
+import logging
+import paddle.fluid as fluid
+from six.moves import xrange
+from model.ernie import ErnieModel
+log = logging.getLogger(__name__)
+def create_model(args, ernie_config, is_prediction=False):
+    src_ids = fluid.layers.data(name='src_ids', shape=[-1, args.max_seq_len, 1], dtype='int64')
+    pos_ids = fluid.layers.data(name='pos_ids', shape=[-1, args.max_seq_len, 1], dtype='int64')
+    input_mask = fluid.layers.data(name='input_mask', shape=[-1, args.max_seq_len, 1], dtype='float32')
+    labels = fluid.layers.data(name='labels', shape=[-1, args.max_seq_len, 1], dtype='int64')
+    seq_lens = fluid.layers.data(name='seq_lens', shape=[-1], dtype='int64')
+    pyreader = fluid.io.DataLoader.from_generator(feed_list=[src_ids, pos_ids, input_mask, labels, seq_lens], 
+            capacity=70,
+            iterable=False)
+    ernie = ErnieModel(
+        src_ids=src_ids,
+        position_ids=pos_ids,
+        input_mask=input_mask,
+        config=ernie_config,
+        use_fp16=args.use_fp16)
+    enc_out = ernie.get_sequence_output()
+    # enc_out = fluid.layers.dropout(
+    #     x=enc_out,
+    #     dropout_prob=0.1,
+    #     dropout_implementation="upscale_in_train")
+    logits = fluid.layers.fc(
+        input=enc_out,
+        size=args.num_labels,
+        num_flatten_dims=2,
+        param_attr=fluid.ParamAttr(
+            name="cls_seq_label_out_w",
+            initializer=fluid.initializer.TruncatedNormal(scale=0.02)),
+        bias_attr=fluid.ParamAttr(
+            name="cls_seq_label_out_b",
+            initializer=fluid.initializer.Constant(0.)))
+    infers = fluid.layers.argmax(logits, axis=2)
+    ret_infers = fluid.layers.reshape(x=infers, shape=[-1, 1])
+    lod_labels = fluid.layers.sequence_unpad(labels, seq_lens)
+    lod_infers = fluid.layers.sequence_unpad(infers, seq_lens)
+    (_, _, _, num_infer, num_label, num_correct) = fluid.layers.chunk_eval(
+         input=lod_infers,
+         label=lod_labels,
+         chunk_scheme=args.chunk_scheme,
+         num_chunk_types=((args.num_labels-1)//(len(args.chunk_scheme)-1)))
+    labels = fluid.layers.flatten(labels, axis=2)
+    ce_loss, probs = fluid.layers.softmax_with_cross_entropy(
+        logits=fluid.layers.flatten(
+            logits, axis=2),
+        label=labels,
+        return_softmax=True)
+    input_mask = fluid.layers.flatten(input_mask, axis=2)
+    ce_loss = ce_loss * input_mask
+    loss = fluid.layers.mean(x=ce_loss)
+    graph_vars = {
+        "inputs": src_ids,
+        "loss": loss,
+        "probs": probs,
+        "seqlen": seq_lens,
+        "num_infer": num_infer,
+        "num_label": num_label,
+        "num_correct": num_correct,
+    }
+    for k, v in graph_vars.items():
+        v.persistable = True
+    return pyreader, graph_vars
+def calculate_f1(num_label, num_infer, num_correct):
+    num_infer = np.sum(num_infer)
+    num_label = np.sum(num_label)
+    num_correct = np.sum(num_correct)
+    if num_infer == 0:
+        precision = 0.0
+    else:
+        precision = num_correct * 1.0 / num_infer
+    if num_label == 0:
+        recall = 0.0
+    else:
+        recall = num_correct * 1.0 / num_label
+    if num_correct == 0:
+        f1 = 0.0
+    else:
+        f1 = 2 * precision * recall / (precision + recall)
+    return precision, recall, f1
+def evaluate(exe,
+             program,
+             pyreader,
+             graph_vars,
+             tag_num,
+             lang=None):
+    fetch_list = [
+        graph_vars["num_infer"].name, graph_vars["num_label"].name,
+        graph_vars["num_correct"].name
+    ]
+    total_label, total_infer, total_correct = 0.0, 0.0, 0.0
+    time_begin = time.time()
+    pyreader.start()
+    while True:
+        try:
+            np_num_infer, np_num_label, np_num_correct = exe.run(program=program,
+                                                    fetch_list=fetch_list)
+            total_infer += np.sum(np_num_infer)
+            total_label += np.sum(np_num_label)
+            total_correct += np.sum(np_num_correct)
+        except fluid.core.EOFException:
+            pyreader.reset()
+            break
+    precision, recall, f1 = calculate_f1(total_label, total_infer,
+                                         total_correct)
+    time_end = time.time()
+    return  \
+        "[%s evaluation] f1: %f, precision: %f, recall: %f, elapsed time: %f s" \
+        % (lang, f1, precision, recall, time_end - time_begin)
+def chunk_predict(np_inputs, np_probs, np_lens, dev_count=1):
+    inputs = np_inputs.reshape([-1]).astype(np.int32)
+    probs = np_probs.reshape([-1, np_probs.shape[-1]])
+    all_lens = np_lens.reshape([dev_count, -1]).astype(np.int32).tolist()
+    base_index = 0
+    out = []
+    for dev_index in xrange(dev_count):
+        lens = all_lens[dev_index]
+        max_len = 0
+        for l in lens:
+            max_len = max(max_len, l)
+        for i in xrange(len(lens)):
+            seq_st = base_index + i * max_len + 1
+            seq_en = seq_st + (lens[i] - 2)
+            prob = probs[seq_st:seq_en, :]
+            infers = np.argmax(prob, -1)
+            out.append((
+                    inputs[seq_st:seq_en].tolist(), 
+                    infers.tolist(),
+                    prob.tolist()))
+        base_index += max_len * len(lens)
+    return out
+def predict(exe,
+            test_program,
+            test_pyreader,
+            graph_vars,
+            dev_count=1):
+    fetch_list = [
+        graph_vars["inputs"].name,
+        graph_vars["probs"].name,
+        graph_vars["seqlen"].name,
+    ]
+    test_pyreader.start()
+    res = []
+    while True:
+        try:
+            inputs, probs, np_lens = exe.run(program=test_program,
+                                        fetch_list=fetch_list)
+            r = chunk_predict(inputs, probs, np_lens, dev_count)
+            res += r
+        except fluid.core.EOFException:
+            test_pyreader.reset()
+            break
+    return res
--- a/ernie-m/launch.py
+++ b/ernie-m/launch.py
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+from __future__ import unicode_literals
+from __future__ import absolute_import
+import sys
+import subprocess
+import os
+import six
+import copy
+import argparse
+import time
+import logging
+from utils.args import ArgumentGroup, print_arguments, prepare_logger
+from utils.finetune_args import parser as worker_parser
+# yapf: disable
+parser = argparse.ArgumentParser(__doc__)
+multip_g = ArgumentGroup(parser, "multiprocessing", 
+        "start paddle training using multi-processing mode.")
+multip_g.add_arg("node_ips", str, None, 
+        "paddle trainer ips")
+multip_g.add_arg("node_id", int, 0, 
+        "the trainer id of the node for multi-node distributed training.")
+multip_g.add_arg("print_config", bool, True, 
+        "print the config of multi-processing mode.")
+multip_g.add_arg("current_node_ip", str, None, 
+        "the ip of current node.")
+multip_g.add_arg("split_log_path", str, "log",
+        "log path for each trainer.")
+multip_g.add_arg("log_prefix", str, "",
+        "the prefix name of job log.")
+multip_g.add_arg("nproc_per_node", int, 8, 
+        "the number of process to use on each node.")
+multip_g.add_arg("selected_gpus", str, "0,1,2,3,4,5,6,7", 
+        "the gpus selected to use.")
+multip_g.add_arg("training_script", str, None, "the program/script to be lauched "
+        "in parallel followed by all the arguments", positional_arg=True)
+multip_g.add_arg("training_script_args", str, None,
+        "training script args", positional_arg=True, nargs=argparse.REMAINDER)
+# yapf: enable
+log = logging.getLogger()
+def start_procs(args):
+    procs = []
+    log_fns = []
+    default_env = os.environ.copy()
+    node_id = args.node_id
+    node_ips = [x.strip() for x in args.node_ips.split(',')]
+    current_ip = args.current_node_ip
+    if args.current_node_ip is None:
+        assert len(node_ips) == 1
+        current_ip = node_ips[0]
+        log.info(current_ip)
+    num_nodes = len(node_ips)
+    selected_gpus = [x.strip() for x in args.selected_gpus.split(',')]
+    selected_gpu_num = len(selected_gpus)
+    all_trainer_endpoints = ""
+    for ip in node_ips:
+        for i in range(args.nproc_per_node):
+            if all_trainer_endpoints != "":
+                all_trainer_endpoints += ","
+            all_trainer_endpoints += "%s:617%d" % (ip, i)
+    nranks = num_nodes * args.nproc_per_node
+    gpus_per_proc = args.nproc_per_node % selected_gpu_num 
+    if gpus_per_proc == 0:
+        gpus_per_proc =  selected_gpu_num // args.nproc_per_node
+    else:
+        gpus_per_proc =  selected_gpu_num // args.nproc_per_node + 1
+    selected_gpus_per_proc = [selected_gpus[i:i + gpus_per_proc] for i in range(0, len(selected_gpus), gpus_per_proc)]
+    if args.print_config:
+        log.info("all_trainer_endpoints: %s"
+              ", node_id: %s"
+              ", current_ip: %s"
+              ", num_nodes: %s"
+              ", node_ips: %s"
+              ", gpus_per_proc: %s"
+              ", selected_gpus_per_proc: %s"
+              ", nranks: %s" % (
+                all_trainer_endpoints, 
+                node_id,
+                current_ip,
+                num_nodes,
+                node_ips,
+                gpus_per_proc,
+                selected_gpus_per_proc,
+                nranks))
+    current_env = copy.copy(default_env)
+    procs = []
+    cmds = []
+    log_fns = []
+    for i in range(0, args.nproc_per_node):
+        trainer_id = node_id * args.nproc_per_node + i
+        assert current_ip is not None
+        current_env.update({
+            "FLAGS_selected_gpus": "%s" % ",".join([str(s) for s in selected_gpus_per_proc[i]]),
+            "PADDLE_TRAINER_ID" : "%d" % trainer_id,
+            "PADDLE_CURRENT_ENDPOINT": "%s:617%d" % (current_ip, i),
+            "PADDLE_TRAINERS_NUM": "%d" % nranks,
+            "PADDLE_TRAINER_ENDPOINTS": all_trainer_endpoints,
+            "PADDLE_NODES_NUM": "%d" % num_nodes
+        })
+        try:
+            idx = args.training_script_args.index('--is_distributed')
+            args.training_script_args[idx + 1] = 'true'
+        except ValueError:
+            args.training_script_args += ['--is_distributed', 'true']
+        cmd = [sys.executable, "-u",
+               args.training_script] + args.training_script_args
+        cmds.append(cmd)
+        if args.split_log_path:
+            logdir = "%s/%sjob.log.%d" % (args.split_log_path, args.log_prefix, trainer_id)
+            try:
+                os.mkdir(os.path.dirname(logdir))
+            except OSError:
+                pass
+            fn = open(logdir, "a")
+            log_fns.append(fn)
+            process = subprocess.Popen(cmd, env=current_env, stdout=fn, stderr=fn)
+            log.info('subprocess launched, check log at %s' % logdir)
+        else:
+            process = subprocess.Popen(cmd, env=current_env)
+            log.info('subprocess launched')
+        procs.append(process)
+    try:
+        for i in range(len(procs)):
+            proc = procs[i]
+            proc.wait()
+            if len(log_fns) > 0:
+                log_fns[i].close()
+            if proc.returncode != 0:    
+                raise subprocess.CalledProcessError(returncode=procs[i].returncode,
+                                                    cmd=cmds[i])
+            else:
+                log.info("proc %d finsh" % i)
+    except KeyboardInterrupt as e:
+        for p in procs:
+            log.info('killing %s' % p)
+            p.terminate()
+def main(args):
+    if args.print_config:
+        print_arguments(args)
+    start_procs(args)
+if __name__ == "__main__":
+    prepare_logger(log)
+    lanch_args = parser.parse_args()
+    finetuning_args = worker_parser.parse_args(
+            lanch_args.training_script_args)
+    init_path = finetuning_args.init_pretraining_params 
+    log.info("init model: %s" % init_path)
+    main(lanch_args)
--- a/ernie-m/model/__init__.py
+++ b/ernie-m/model/__init__.py
--- a/ernie-m/model/ernie.py
+++ b/ernie-m/model/ernie.py
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Ernie model."""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+from __future__ import unicode_literals
+from __future__ import absolute_import
+import json
+import six
+import logging
+import paddle.fluid as fluid
+from io import open
+from paddle.fluid.layers import core
+from model.transformer_encoder import encoder, pre_process_layer
+log = logging.getLogger(__name__)
+class ErnieConfig(object):
+    def __init__(self, config_path):
+        self._config_dict = self._parse(config_path)
+    def _parse(self, config_path):
+        try:
+            with open(config_path, 'r', encoding='utf8') as json_file:
+                config_dict = json.load(json_file)
+        except Exception:
+            raise IOError("Error in parsing Ernie model config file '%s'" %
+                          config_path)
+        else:
+            return config_dict
+    def __getitem__(self, key):
+        return self._config_dict.get(key, None)
+    def print_config(self):
+        for arg, value in sorted(six.iteritems(self._config_dict)):
+            log.info('%s: %s' % (arg, value))
+        log.info('------------------------------------------------')
+class ErnieModel(object):
+    def __init__(self,
+                 src_ids,
+                 position_ids,
+                 input_mask,
+                 config,
+                 weight_sharing=True,
+                 use_fp16=False):
+        self._emb_size = config['hidden_size']
+        self._n_layer = config['num_hidden_layers']
+        self._n_head = config['num_attention_heads']
+        self._voc_size = config['vocab_size']
+        self._max_position_seq_len = config['max_position_embeddings']
+        self._hidden_act = config['hidden_act']
+        self._prepostprocess_dropout = config['hidden_dropout_prob']
+        self._attention_dropout = config['attention_probs_dropout_prob']
+        self._weight_sharing = weight_sharing
+        self._word_emb_name = "word_embedding"
+        self._pos_emb_name = "pos_embedding"
+        self._dtype = core.VarDesc.VarType.FP16 if use_fp16 else core.VarDesc.VarType.FP32
+        self._emb_dtype = core.VarDesc.VarType.FP32
+        # Initialize all weigths by truncated normal initializer, and all biases
+        # will be initialized by constant zero by default.
+        self._param_initializer = fluid.initializer.TruncatedNormal(
+            scale=config['initializer_range'])
+        self._build_model(src_ids, position_ids, input_mask)
+    def _build_model(self, src_ids, position_ids, input_mask):
+        # padding id in vocabulary must be set to 0
+        emb_out = fluid.layers.embedding(
+            input=src_ids,
+            size=[self._voc_size, self._emb_size],
+            dtype=self._emb_dtype,
+            param_attr=fluid.ParamAttr(
+                name=self._word_emb_name, initializer=self._param_initializer),
+            is_sparse=False)
+        position_emb_out = fluid.layers.embedding(
+            input=position_ids,
+            size=[self._max_position_seq_len, self._emb_size],
+            dtype=self._emb_dtype,
+            param_attr=fluid.ParamAttr(
+                name=self._pos_emb_name, initializer=self._param_initializer))
+        emb_out = emb_out + position_emb_out
+        emb_out = pre_process_layer(
+            emb_out, 'nd', self._prepostprocess_dropout, name='pre_encoder')
+        if self._dtype == core.VarDesc.VarType.FP16:
+            emb_out = fluid.layers.cast(x=emb_out, dtype=self._dtype)
+            input_mask = fluid.layers.cast(x=input_mask, dtype=self._dtype)
+        self_attn_mask = fluid.layers.matmul(
+            x=input_mask, y=input_mask, transpose_y=True)
+        self_attn_mask = fluid.layers.scale(
+            x=self_attn_mask, scale=10000.0, bias=-1.0, bias_after_scale=False)
+        n_head_self_attn_mask = fluid.layers.stack(
+            x=[self_attn_mask] * self._n_head, axis=1)
+        n_head_self_attn_mask.stop_gradient = True
+        self._enc_out = encoder(
+            enc_input=emb_out,
+            attn_bias=n_head_self_attn_mask,
+            n_layer=self._n_layer,
+            n_head=self._n_head,
+            d_key=self._emb_size // self._n_head,
+            d_value=self._emb_size // self._n_head,
+            d_model=self._emb_size,
+            d_inner_hid=self._emb_size * 4,
+            prepostprocess_dropout=self._prepostprocess_dropout,
+            attention_dropout=self._attention_dropout,
+            relu_dropout=0,
+            hidden_act=self._hidden_act,
+            preprocess_cmd="",
+            postprocess_cmd="dan",
+            param_initializer=self._param_initializer,
+            name='encoder')
+        if self._dtype == core.VarDesc.VarType.FP16:
+            self._enc_out = fluid.layers.cast(
+                x=self._enc_out, dtype=self._emb_dtype)
+    def get_sequence_output(self):
+        return self._enc_out
+    def get_pooled_output(self):
+        """Get the first feature of each sequence for classification"""
+        next_sent_feat = fluid.layers.slice(
+            input=self._enc_out, axes=[1], starts=[0], ends=[1])
+        next_sent_feat = fluid.layers.fc(
+            input=next_sent_feat,
+            size=self._emb_size,
+            act="tanh",
+            param_attr=fluid.ParamAttr(
+                name="pooled_fc.w_0", initializer=self._param_initializer),
+            bias_attr="pooled_fc.b_0")
+        return next_sent_feat 
--- a/ernie-m/model/optimization.py
+++ b/ernie-m/model/optimization.py
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Optimization and learning rate scheduling."""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+from __future__ import unicode_literals
+from __future__ import absolute_import
+import numpy as np
+import paddle.fluid as fluid
+from utils.fp16 import create_master_params_grads, master_param_to_train_param, apply_dynamic_loss_scaling
+def linear_warmup_decay(learning_rate, warmup_steps, num_train_steps):
+    """ Applies linear warmup of learning rate from 0 and decay to 0."""
+    with fluid.default_main_program()._lr_schedule_guard():
+        lr = fluid.layers.tensor.create_global_var(
+            shape=[1],
+            value=0.0,
+            dtype='float32',
+            persistable=True,
+            name="scheduled_learning_rate")
+        global_step = fluid.layers.learning_rate_scheduler._decay_step_counter()
+        with fluid.layers.control_flow.Switch() as switch:
+            with switch.case(global_step < warmup_steps):
+                warmup_lr = learning_rate * (global_step / warmup_steps)
+                fluid.layers.tensor.assign(warmup_lr, lr)
+            with switch.default():
+                decayed_lr = fluid.layers.learning_rate_scheduler.polynomial_decay(
+                    learning_rate=learning_rate,
+                    decay_steps=num_train_steps,
+                    end_learning_rate=0.0,
+                    power=1.0,
+                    cycle=False)
+                fluid.layers.tensor.assign(decayed_lr, lr)
+        return lr
+def layerwise_lr_decay_function(param, param_last, learning_rate, decay_rate, n_layers):
+    delta = param - param_last
+    if "encoder_layer" in param.name and param.name.index("encoder_layer")==0:
+        layer = int(param.name.split("_")[2])
+        ratio = decay_rate ** (n_layers - layer)
+        param_update = param + (ratio - 1) * delta
+    elif "embedding" in param.name:
+        ratio = decay_rate ** (n_layers + 1)
+        param_update = param + (ratio - 1) * delta
+    else:
+        param_update = None
+    return param_update
+def optimization(loss,
+                 warmup_steps,
+                 num_train_steps,
+                 learning_rate,
+                 train_program,
+                 startup_prog,
+                 weight_decay,
+                 scheduler='linear_warmup_decay',
+                 use_fp16=False,
+                 use_dynamic_loss_scaling=False,
+                 init_loss_scaling=1.0,
+                 incr_every_n_steps=1000,
+                 decr_every_n_nan_or_inf=2,
+                 incr_ratio=2.0,
+                 decr_ratio=0.8,
+                 layerwise_lr_decay=0.8,
+                 n_layers=12):
+    if warmup_steps > 0:
+        if scheduler == 'noam_decay':
+            scheduled_lr = fluid.layers.learning_rate_scheduler\
+             .noam_decay(1/(warmup_steps *(learning_rate ** 2)),
+                         warmup_steps)
+        elif scheduler == 'linear_warmup_decay':
+            scheduled_lr = linear_warmup_decay(learning_rate, warmup_steps,
+                                               num_train_steps)
+        else:
+            raise ValueError("Unkown learning rate scheduler, should be "
+                             "'noam_decay' or 'linear_warmup_decay'")
+        optimizer = fluid.optimizer.Adam(learning_rate=scheduled_lr, epsilon=1e-6)
+    else:
+        scheduled_lr = fluid.layers.create_global_var(
+            name=fluid.unique_name.generate("learning_rate"),
+            shape=[1],
+            value=learning_rate,
+            dtype='float32',
+            persistable=True)
+        optimizer = fluid.optimizer.Adam(learning_rate=scheduled_lr, epsilon=1e-6)
+        optimizer._learning_rate_map[fluid.default_main_program(
+        )] = scheduled_lr
+    fluid.clip.set_gradient_clip(
+        clip=fluid.clip.GradientClipByGlobalNorm(clip_norm=1.0))
+    def exclude_from_weight_decay(name):
+        if name.find("layer_norm") > -1:
+            return True
+        bias_suffix = ["_bias", "_b", ".b_0"]
+        for suffix in bias_suffix:
+            if name.endswith(suffix):
+                return True
+        return False
+    param_list = dict()
+    loss_scaling = fluid.layers.create_global_var(
+        name=fluid.unique_name.generate("loss_scaling"),
+        shape=[1],
+        value=init_loss_scaling,
+        dtype='float32',
+        persistable=True)
+    if use_fp16:
+        loss *= loss_scaling
+        param_grads = optimizer.backward(loss)
+        master_param_grads = create_master_params_grads(
+            param_grads, train_program, startup_prog, loss_scaling)
+        for param, _ in master_param_grads:
+            param_list[param.name] = param * 1.0
+            param_list[param.name].stop_gradient = True
+        if use_dynamic_loss_scaling:
+            apply_dynamic_loss_scaling(
+                loss_scaling, master_param_grads, incr_every_n_steps,
+                decr_every_n_nan_or_inf, incr_ratio, decr_ratio)
+        optimizer.apply_gradients(master_param_grads)
+        if weight_decay > 0:
+            for param, grad in master_param_grads:
+                if exclude_from_weight_decay(param.name.rstrip(".master")):
+                    continue
+                with param.block.program._optimized_guard(
+                    [param, grad]), fluid.framework.name_scope("weight_decay"):
+                    updated_param = param - param_list[
+                        param.name] * weight_decay * scheduled_lr
+                    fluid.layers.assign(output=param, input=updated_param)
+        master_param_to_train_param(master_param_grads, param_grads,
+                                    train_program)
+    else:
+        for param in train_program.global_block().all_parameters():
+            param_list[param.name] = param * 1.0
+            param_list[param.name].stop_gradient = True
+        _, param_grads = optimizer.minimize(loss)
+        if layerwise_lr_decay > 0:
+            for param, grad in param_grads:
+                with param.block.program._optimized_guard(
+                    [param, grad]), fluid.framework.name_scope("layer_decay"):
+                    param_update = layerwise_lr_decay_function(param, param_list[param.name], scheduled_lr, layerwise_lr_decay, n_layers)
+                    if param_update:
+                        fluid.layers.assign(output=param, input=param_update)
+        if weight_decay > 0:
+            for param, grad in param_grads:
+                if exclude_from_weight_decay(param.name):
+                    continue
+                with param.block.program._optimized_guard(
+                    [param, grad]), fluid.framework.name_scope("weight_decay"):
+                    updated_param = param - param_list[
+                        param.name] * weight_decay * scheduled_lr
+                    fluid.layers.assign(output=param, input=updated_param)
+    return scheduled_lr, loss_scaling
--- a/ernie-m/model/transformer_encoder.py
+++ b/ernie-m/model/transformer_encoder.py
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Transformer encoder."""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+from functools import partial
+import paddle.fluid as fluid
+import paddle.fluid.layers as layers
+def multi_head_attention(queries,
+                         keys,
+                         values,
+                         attn_bias,
+                         d_key,
+                         d_value,
+                         d_model,
+                         n_head=1,
+                         dropout_rate=0.,
+                         cache=None,
+                         param_initializer=None,
+                         name='multi_head_att'):
+    """
+    Multi-Head Attention. Note that attn_bias is added to the logit before
+    computing softmax activiation to mask certain selected positions so that
+    they will not considered in attention weights.
+    """
+    keys = queries if keys is None else keys
+    values = keys if values is None else values
+    if not (len(queries.shape) == len(keys.shape) == len(values.shape) == 3):
+        raise ValueError(
+            "Inputs: quries, keys and values should all be 3-D tensors.")
+    def __compute_qkv(queries, keys, values, n_head, d_key, d_value):
+        """
+        Add linear projection to queries, keys, and values.
+        """
+        q = layers.fc(input=queries,
+                      size=d_key * n_head,
+                      num_flatten_dims=2,
+                      param_attr=fluid.ParamAttr(
+                          name=name + '_query_fc.w_0',
+                          initializer=param_initializer),
+                      bias_attr=name + '_query_fc.b_0')
+        k = layers.fc(input=keys,
+                      size=d_key * n_head,
+                      num_flatten_dims=2,
+                      param_attr=fluid.ParamAttr(
+                          name=name + '_key_fc.w_0',
+                          initializer=param_initializer),
+                      bias_attr=name + '_key_fc.b_0')
+        v = layers.fc(input=values,
+                      size=d_value * n_head,
+                      num_flatten_dims=2,
+                      param_attr=fluid.ParamAttr(
+                          name=name + '_value_fc.w_0',
+                          initializer=param_initializer),
+                      bias_attr=name + '_value_fc.b_0')
+        return q, k, v
+    def __split_heads(x, n_head):
+        """
+        Reshape the last dimension of inpunt tensor x so that it becomes two
+        dimensions and then transpose. Specifically, input a tensor with shape
+        [bs, max_sequence_length, n_head * hidden_dim] then output a tensor
+        with shape [bs, n_head, max_sequence_length, hidden_dim].
+        """
+        hidden_size = x.shape[-1]
+        # The value 0 in shape attr means copying the corresponding dimension
+        # size of the input as the output dimension size.
+        reshaped = layers.reshape(
+            x=x, shape=[0, 0, n_head, hidden_size // n_head], inplace=True)
+        # permuate the dimensions into:
+        # [batch_size, n_head, max_sequence_len, hidden_size_per_head]
+        return layers.transpose(x=reshaped, perm=[0, 2, 1, 3])
+    def __combine_heads(x):
+        """
+        Transpose and then reshape the last two dimensions of inpunt tensor x
+        so that it becomes one dimension, which is reverse to __split_heads.
+        """
+        if len(x.shape) == 3: return x
+        if len(x.shape) != 4:
+            raise ValueError("Input(x) should be a 4-D Tensor.")
+        trans_x = layers.transpose(x, perm=[0, 2, 1, 3])
+        # The value 0 in shape attr means copying the corresponding dimension
+        # size of the input as the output dimension size.
+        return layers.reshape(
+            x=trans_x,
+            shape=[0, 0, trans_x.shape[2] * trans_x.shape[3]],
+            inplace=True)
+    def scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate):
+        """
+        Scaled Dot-Product Attention
+        """
+        scaled_q = layers.scale(x=q, scale=d_key**-0.5)
+        product = layers.matmul(x=scaled_q, y=k, transpose_y=True)
+        if attn_bias:
+            product += attn_bias
+        weights = layers.softmax(product)
+        if dropout_rate:
+            weights = layers.dropout(
+                weights,
+                dropout_prob=dropout_rate,
+                dropout_implementation="upscale_in_train",
+                is_test=False)
+        out = layers.matmul(weights, v)
+        return out
+    q, k, v = __compute_qkv(queries, keys, values, n_head, d_key, d_value)
+    if cache is not None:  # use cache and concat time steps
+        # Since the inplace reshape in __split_heads changes the shape of k and
+        # v, which is the cache input for next time step, reshape the cache
+        # input from the previous time step first.
+        k = cache["k"] = layers.concat(
+            [layers.reshape(
+                cache["k"], shape=[0, 0, d_model]), k], axis=1)
+        v = cache["v"] = layers.concat(
+            [layers.reshape(
+                cache["v"], shape=[0, 0, d_model]), v], axis=1)
+    q = __split_heads(q, n_head)
+    k = __split_heads(k, n_head)
+    v = __split_heads(v, n_head)
+    ctx_multiheads = scaled_dot_product_attention(q, k, v, attn_bias, d_key,
+                                                  dropout_rate)
+    out = __combine_heads(ctx_multiheads)
+    # Project back to the model size.
+    proj_out = layers.fc(input=out,
+                         size=d_model,
+                         num_flatten_dims=2,
+                         param_attr=fluid.ParamAttr(
+                             name=name + '_output_fc.w_0',
+                             initializer=param_initializer),
+                         bias_attr=name + '_output_fc.b_0')
+    return proj_out
+def positionwise_feed_forward(x,
+                              d_inner_hid,
+                              d_hid,
+                              dropout_rate,
+                              hidden_act,
+                              param_initializer=None,
+                              name='ffn'):
+    """
+    Position-wise Feed-Forward Networks.
+    This module consists of two linear transformations with a ReLU activation
+    in between, which is applied to each position separately and identically.
+    """
+    hidden = layers.fc(input=x,
+                       size=d_inner_hid,
+                       num_flatten_dims=2,
+                       act=hidden_act,
+                       param_attr=fluid.ParamAttr(
+                           name=name + '_fc_0.w_0',
+                           initializer=param_initializer),
+                       bias_attr=name + '_fc_0.b_0')
+    if dropout_rate:
+        hidden = layers.dropout(
+            hidden,
+            dropout_prob=dropout_rate,
+            dropout_implementation="upscale_in_train",
+            is_test=False)
+    out = layers.fc(input=hidden,
+                    size=d_hid,
+                    num_flatten_dims=2,
+                    param_attr=fluid.ParamAttr(
+                        name=name + '_fc_1.w_0', initializer=param_initializer),
+                    bias_attr=name + '_fc_1.b_0')
+    return out
+def pre_post_process_layer(prev_out, out, process_cmd, dropout_rate=0.,
+                           name=''):
+    """
+    Add residual connection, layer normalization and droput to the out tensor
+    optionally according to the value of process_cmd.
+    This will be used before or after multi-head attention and position-wise
+    feed-forward networks.
+    """
+    for cmd in process_cmd:
+        if cmd == "a":  # add residual connection
+            out = out + prev_out if prev_out else out
+        elif cmd == "n":  # add layer normalization
+            out_dtype = out.dtype
+            if out_dtype == fluid.core.VarDesc.VarType.FP16:
+                out = layers.cast(x=out, dtype="float32")
+            out = layers.layer_norm(
+                out,
+                begin_norm_axis=len(out.shape) - 1,
+                param_attr=fluid.ParamAttr(
+                    name=name + '_layer_norm_scale',
+                    initializer=fluid.initializer.Constant(1.)),
+                bias_attr=fluid.ParamAttr(
+                    name=name + '_layer_norm_bias',
+                    initializer=fluid.initializer.Constant(0.)))
+            if out_dtype == fluid.core.VarDesc.VarType.FP16:
+                out = layers.cast(x=out, dtype="float16")
+        elif cmd == "d":  # add dropout
+            if dropout_rate:
+                out = layers.dropout(
+                    out,
+                    dropout_prob=dropout_rate,
+                    dropout_implementation="upscale_in_train",
+                    is_test=False)
+    return out
+pre_process_layer = partial(pre_post_process_layer, None)
+post_process_layer = pre_post_process_layer
+def encoder_layer(enc_input,
+                  attn_bias,
+                  n_head,
+                  d_key,
+                  d_value,
+                  d_model,
+                  d_inner_hid,
+                  prepostprocess_dropout,
+                  attention_dropout,
+                  relu_dropout,
+                  hidden_act,
+                  preprocess_cmd="n",
+                  postprocess_cmd="da",
+                  param_initializer=None,
+                  name=''):
+    """The encoder layers that can be stacked to form a deep encoder.
+    This module consits of a multi-head (self) attention followed by
+    position-wise feed-forward networks and both the two components companied
+    with the post_process_layer to add residual connection, layer normalization
+    and droput.
+    """
+    attn_output = multi_head_attention(
+        pre_process_layer(
+            enc_input,
+            preprocess_cmd,
+            prepostprocess_dropout,
+            name=name + '_pre_att'),
+        None,
+        None,
+        attn_bias,
+        d_key,
+        d_value,
+        d_model,
+        n_head,
+        attention_dropout,
+        param_initializer=param_initializer,
+        name=name + '_multi_head_att')
+    attn_output = post_process_layer(
+        enc_input,
+        attn_output,
+        postprocess_cmd,
+        prepostprocess_dropout,
+        name=name + '_post_att')
+    ffd_output = positionwise_feed_forward(
+        pre_process_layer(
+            attn_output,
+            preprocess_cmd,
+            prepostprocess_dropout,
+            name=name + '_pre_ffn'),
+        d_inner_hid,
+        d_model,
+        relu_dropout,
+        hidden_act,
+        param_initializer=param_initializer,
+        name=name + '_ffn')
+    return post_process_layer(
+        attn_output,
+        ffd_output,
+        postprocess_cmd,
+        prepostprocess_dropout,
+        name=name + '_post_ffn')
+def encoder(enc_input,
+            attn_bias,
+            n_layer,
+            n_head,
+            d_key,
+            d_value,
+            d_model,
+            d_inner_hid,
+            prepostprocess_dropout,
+            attention_dropout,
+            relu_dropout,
+            hidden_act,
+            preprocess_cmd="n",
+            postprocess_cmd="da",
+            param_initializer=None,
+            name=''):
+    """
+    The encoder is composed of a stack of identical layers returned by calling
+    encoder_layer.
+    """
+    for i in range(n_layer):
+        enc_output = encoder_layer(
+            enc_input,
+            attn_bias,
+            n_head,
+            d_key,
+            d_value,
+            d_model,
+            d_inner_hid,
+            prepostprocess_dropout,
+            attention_dropout,
+            relu_dropout,
+            hidden_act,
+            preprocess_cmd,
+            postprocess_cmd,
+            param_initializer=param_initializer,
+            name=name + '_layer_' + str(i))
+        enc_input = enc_output
+    enc_output = pre_process_layer(
+        enc_output, preprocess_cmd, prepostprocess_dropout, name="post_encoder")
+    return enc_output
--- a/ernie-m/reader/__init__.py
+++ b/ernie-m/reader/__init__.py
--- a/ernie-m/reader/batching.py
+++ b/ernie-m/reader/batching.py
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Mask, padding and batching."""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+import numpy as np
+def pad_batch_data(insts,
+                   pad_idx=0,
+                   return_pos=False,
+                   return_input_mask=False,
+                   return_max_len=False,
+                   return_num_token=False,
+                   return_seq_lens=False):
+    """
+    Pad the instances to the max sequence length in batch, and generate the
+    corresponding position data and attention bias.
+    """
+    return_list = []
+    max_len = max(len(inst) for inst in insts)
+    # Any token included in dict can be used to pad, since the paddings' loss
+    # will be masked out by weights and make no effect on parameter gradients.
+    inst_data = np.array(
+        [inst + list([pad_idx] * (max_len - len(inst))) for inst in insts])
+    return_list += [inst_data.astype("int64").reshape([-1, max_len, 1])]
+    # position data
+    if return_pos:
+        inst_pos = np.array([
+            list(range(0, len(inst))) + [pad_idx] * (max_len - len(inst))
+            for inst in insts
+        ])
+        return_list += [inst_pos.astype("int64").reshape([-1, max_len, 1])]
+    if return_input_mask:
+        # This is used to avoid attention on paddings.
+        input_mask_data = np.array([[1] * len(inst) + [0] *
+                                    (max_len - len(inst)) for inst in insts])
+        input_mask_data = np.expand_dims(input_mask_data, axis=-1)
+        return_list += [input_mask_data.astype("float32")]
+    if return_max_len:
+        return_list += [max_len]
+    if return_num_token:
+        num_token = 0
+        for inst in insts:
+            num_token += len(inst)
+        return_list += [num_token]
+    if return_seq_lens:
+        seq_lens = np.array([len(inst) for inst in insts])
+        return_list += [seq_lens.astype("int64").reshape([-1])]
+    return return_list if len(return_list) > 1 else return_list[0]
+if __name__ == "__main__":
+    pass
--- a/ernie-m/reader/task_reader.py
+++ b/ernie-m/reader/task_reader.py
--- a/ernie-m/reader/tokenization.py
+++ b/ernie-m/reader/tokenization.py
+# coding=utf-8
+# Copyright 2019 The Google Research Authors.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# Lint as: python2, python3
+# coding=utf-8
+"""Tokenization classes."""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+import collections
+import re
+import unicodedata
+import six
+from six.moves import range
+#import tensorflow as tf
+import sentencepiece as spm
+SPIECE_UNDERLINE = u"▁".encode("utf-8")
+def validate_case_matches_checkpoint(do_lower_case, init_checkpoint):
+  """Checks whether the casing config is consistent with the checkpoint name."""
+  # The casing has to be passed in by the user and there is no explicit check
+  # as to whether it matches the checkpoint. The casing information probably
+  # should have been stored in the bert_config.json file, but it's not, so
+  # we have to heuristically detect it to validate.
+  if not init_checkpoint:
+    return
+  m = re.match("^.*?([A-Za-z0-9_-]+)/bert_model.ckpt",
+               six.ensure_str(init_checkpoint))
+  if m is None:
+    return
+  model_name = m.group(1)
+  lower_models = [
+      "uncased_L-24_H-1024_A-16", "uncased_L-12_H-768_A-12",
+      "multilingual_L-12_H-768_A-12", "chinese_L-12_H-768_A-12"
+  ]
+  cased_models = [
+      "cased_L-12_H-768_A-12", "cased_L-24_H-1024_A-16",
+      "multi_cased_L-12_H-768_A-12"
+  ]
+  is_bad_config = False
+  if model_name in lower_models and not do_lower_case:
+    is_bad_config = True
+    actual_flag = "False"
+    case_name = "lowercased"
+    opposite_flag = "True"
+  if model_name in cased_models and do_lower_case:
+    is_bad_config = True
+    actual_flag = "True"
+    case_name = "cased"
+    opposite_flag = "False"
+  if is_bad_config:
+    raise ValueError(
+        "You passed in `--do_lower_case=%s` with `--init_checkpoint=%s`. "
+        "However, `%s` seems to be a %s model, so you "
+        "should pass in `--do_lower_case=%s` so that the fine-tuning matches "
+        "how the model was pre-training. If this error is wrong, please "
+        "just comment out this check." % (actual_flag, init_checkpoint,
+                                          model_name, case_name, opposite_flag))
+def clean_text(text):
+  """Performs invalid character removal and whitespace cleanup on text."""
+  text = text.replace(u"“", u'"')\
+          .replace(u'”', u'"')\
+          .replace(u'‘', "'")\
+          .replace(u'’', u"'")\
+          .replace(u'—', u'-')
+  output = []
+  for char in text:
+      if _is_control(char):
+          continue
+      if _is_whitespace(char):
+          output.append(" ")
+      else:
+          output.append(char)
+  return "".join(output)
+def preprocess_text(inputs, remove_space=True, lower=False):
+  """preprocess data by removing extra space and normalize data."""
+  outputs = inputs
+  if remove_space:
+    outputs = " ".join(inputs.strip().split())
+  if six.PY2 and isinstance(outputs, str):
+    try:
+      outputs = six.ensure_text(outputs, "utf-8")
+    except UnicodeDecodeError:
+      outputs = six.ensure_text(outputs, "latin-1")
+  outputs = unicodedata.normalize("NFKD", outputs)
+  outputs = "".join([c for c in outputs if not unicodedata.combining(c)])
+  if lower:
+    outputs = outputs.lower()
+  return outputs
+def encode_pieces(sp_model, text, return_unicode=True, sample=False):
+  """turn sentences into word pieces."""
+  # liujiaxiang: add for ernie-albert, mainly consider for “/”/‘/’/— causing too many unk
+  text = clean_text(text)
+  if six.PY2 and isinstance(text, six.text_type):
+    text = six.ensure_binary(text, "utf-8")
+  if not sample:
+    pieces = sp_model.EncodeAsPieces(text)
+  else:
+    pieces = sp_model.SampleEncodeAsPieces(text, 64, 0.1)
+  new_pieces = []
+  for piece in pieces:
+    piece = printable_text(piece)
+    if len(piece) > 1 and piece[-1] == "," and piece[-2].isdigit():
+      cur_pieces = sp_model.EncodeAsPieces(
+          six.ensure_binary(piece[:-1]).replace(SPIECE_UNDERLINE, b""))
+      if piece[0] != SPIECE_UNDERLINE and cur_pieces[0][0] == SPIECE_UNDERLINE:
+        if len(cur_pieces[0]) == 1:
+          cur_pieces = cur_pieces[1:]
+        else:
+          cur_pieces[0] = cur_pieces[0][1:]
+      cur_pieces.append(piece[-1])
+      new_pieces.extend(cur_pieces)
+    else:
+      new_pieces.append(piece)
+  # note(zhiliny): convert back to unicode for py2
+  if six.PY2 and return_unicode:
+    ret_pieces = []
+    for piece in new_pieces:
+      if isinstance(piece, str):
+        piece = six.ensure_text(piece, "utf-8")
+      ret_pieces.append(piece)
+    new_pieces = ret_pieces
+  return new_pieces
+def encode_ids(sp_model, text, sample=False):
+  pieces = encode_pieces(sp_model, text, return_unicode=False, sample=sample)
+  ids = [sp_model.PieceToId(piece) for piece in pieces]
+  return ids
+def convert_to_unicode(text):
+  """Converts `text` to Unicode (if it's not already), assuming utf-8 input."""
+  if six.PY3:
+    if isinstance(text, str):
+      return text
+    elif isinstance(text, bytes):
+      return six.ensure_text(text, "utf-8", "ignore")
+    else:
+      raise ValueError("Unsupported string type: %s" % (type(text)))
+  elif six.PY2:
+    if isinstance(text, str):
+      return six.ensure_text(text, "utf-8", "ignore")
+    elif isinstance(text, six.text_type):
+      return text
+    else:
+      raise ValueError("Unsupported string type: %s" % (type(text)))
+  else:
+    raise ValueError("Not running on Python2 or Python 3?")
+def printable_text(text):
+  """Returns text encoded in a way suitable for print or `tf.logging`."""
+  # These functions want `str` for both Python2 and Python3, but in one case
+  # it's a Unicode string and in the other it's a byte string.
+  if six.PY3:
+    if isinstance(text, str):
+      return text
+    elif isinstance(text, bytes):
+      return six.ensure_text(text, "utf-8", "ignore")
+    else:
+      raise ValueError("Unsupported string type: %s" % (type(text)))
+  elif six.PY2:
+    if isinstance(text, str):
+      return text
+    elif isinstance(text, six.text_type):
+      return six.ensure_binary(text, "utf-8")
+    else:
+      raise ValueError("Unsupported string type: %s" % (type(text)))
+  else:
+    raise ValueError("Not running on Python2 or Python 3?")
+#def load_vocab(vocab_file):
+#  """Loads a vocabulary file into a dictionary."""
+#  vocab = collections.OrderedDict()
+#  with tf.gfile.GFile(vocab_file, "r") as reader:
+#    while True:
+#      token = convert_to_unicode(reader.readline())
+#      if not token:
+#        break
+#      token = token.strip().split()[0]
+#      if token not in vocab:
+#        vocab[token] = len(vocab)
+#  return vocab
+def load_vocab(vocab_file):
+    """Loads a vocabulary file into a dictionary."""
+    vocab = collections.OrderedDict()
+    fin = open(vocab_file)
+    for num, line in enumerate(fin):
+        items = convert_to_unicode(line.strip()).split("\t")
+        if len(items) > 2:
+            break
+        token = items[0]
+        index = items[1] if len(items) == 2 else num
+        token = token.strip()
+        vocab[token] = int(index)
+    return vocab
+def convert_by_vocab(vocab, items):
+  """Converts a sequence of [tokens|ids] using the vocab."""
+  output = []
+  for item in items:
+    output.append(vocab[item])
+  return output
+def convert_tokens_to_ids(vocab, tokens):
+  return convert_by_vocab(vocab, tokens)
+def convert_ids_to_tokens(inv_vocab, ids):
+  return convert_by_vocab(inv_vocab, ids)
+def whitespace_tokenize(text):
+  """Runs basic whitespace cleaning and splitting on a piece of text."""
+  text = text.strip()
+  if not text:
+    return []
+  tokens = text.split()
+  return tokens
+class FullTokenizer(object):
+  """Runs end-to-end tokenziation."""
+  def __init__(self, vocab_file, do_lower_case=True, model_file='./30k-clean.model'):
+    self.vocab = None
+    self.sp_model = None
+    if model_file:
+      self.sp_model = spm.SentencePieceProcessor()
+      #tf.logging.info("loading sentence piece model")
+      self.sp_model.Load(model_file)
+      # Note(mingdachen): For the purpose of consisent API, we are
+      # generating a vocabulary for the sentence piece tokenizer.
+      #self.vocab = {self.sp_model.IdToPiece(i): i for i
+      #              in range(self.sp_model.GetPieceSize())}
+      self.vocab = load_vocab(vocab_file)
+     # import pdb; pdb.set_trace()
+    else:
+      #self.vocab = load_vocab(vocab_file)
+      #self.basic_tokenizer = BasicTokenizer(do_lower_case=do_lower_case)
+      #self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab)
+      # (liujiaxiang) comment useless code for a better diff code
+      raise ValueError('albert use spm by default')
+    self.inv_vocab = {v: k for k, v in self.vocab.items()}
+  def tokenize(self, text):
+    if self.sp_model:
+      split_tokens = encode_pieces(self.sp_model, text, return_unicode=False)
+    else:
+      #split_tokens = []
+      #for token in self.basic_tokenizer.tokenize(text):
+      #  for sub_token in self.wordpiece_tokenizer.tokenize(token):
+      #    split_tokens.append(sub_token)
+      # (liujiaxiang) comment useless code for a better diff code
+      raise ValueError('albert use spm by default')
+    return split_tokens
+  def tokenize_for_pretrain(self, tok_list):
+    import tok as tok_protocol
+    text = " ".join([t.token for t in tok_list])
+    #split_tokens = encode_pieces(self.sp_model, text, return_unicode=True)
+    split_tokens = encode_pieces(self.sp_model, text, return_unicode=False)
+    ids = self.convert_tokens_to_ids(split_tokens)
+    # +1 for head _ : 'hello world' -> ['_hello', '_world']
+    if not (len(preprocess_text(''.join(split_tokens))) == len(text) + 1):
+      return None
+    if len(split_tokens) != len(ids):
+      return None
+    sent_piece_tokens = []
+    i = 0
+    position_to_nth = self.inverse_index_str("_" + text)
+    for t, id in zip(split_tokens, ids):
+      t = t.decode('utf8')
+      nth = position_to_nth[i]
+      token = tok_list[nth]
+      tok = tok_protocol.Tok()
+      tok.token = t
+      tok.id = id
+      tok.bio = token.bio
+      tok.origin = token.origin
+      tok.appear = token.appear
+      i += len(t)
+      sent_piece_tokens.append(tok)
+    return sent_piece_tokens
+  def inverse_index_str(self, s):
+    nth_tok = 0
+    position_to_nth = {}
+    for i, c in enumerate(s):
+        if c == " ":
+            nth_tok += 1
+        position_to_nth[i] = nth_tok
+    return position_to_nth
+#  def convert_tokens_to_ids(self, tokens):
+#    if self.sp_model:
+#      #tf.logging.info("using sentence piece tokenzier.")
+#      return [self.sp_model.PieceToId(
+#          printable_text(token)) for token in tokens]
+#    else:
+#      return convert_by_vocab(self.vocab, tokens)
+  def convert_tokens_to_ids(self, tokens):
+    tokens_out = []
+    for i in tokens:
+      item = i
+      if item in self.vocab:
+        tokens_out.append(self.vocab[item])
+      else:
+        tokens_out.append(self.vocab['[UNK]'])
+    return tokens_out
+  def convert_ids_to_tokens(self, ids):
+    if self.sp_model:
+      #tf.logging.info("using sentence piece tokenzier.")
+      return [self.sp_model.IdToPiece(id_) for id_ in ids]
+    else:
+      return convert_by_vocab(self.inv_vocab, ids)
+class BasicTokenizer(object):
+  """Runs basic tokenization (punctuation splitting, lower casing, etc.)."""
+  def __init__(self, do_lower_case=True):
+    """Constructs a BasicTokenizer.
+    Args:
+      do_lower_case: Whether to lower case the input.
+    """
+    self.do_lower_case = do_lower_case
+  def tokenize(self, text):
+    """Tokenizes a piece of text."""
+    text = convert_to_unicode(text)
+    text = self._clean_text(text)
+    # This was added on November 1st, 2018 for the multilingual and Chinese
+    # models. This is also applied to the English models now, but it doesn't
+    # matter since the English models were not trained on any Chinese data
+    # and generally don't have any Chinese data in them (there are Chinese
+    # characters in the vocabulary because Wikipedia does have some Chinese
+    # words in the English Wikipedia.).
+    text = self._tokenize_chinese_chars(text)
+    orig_tokens = whitespace_tokenize(text)
+    split_tokens = []
+    for token in orig_tokens:
+      if self.do_lower_case:
+        token = token.lower()
+        token = self._run_strip_accents(token)
+      split_tokens.extend(self._run_split_on_punc(token))
+    output_tokens = whitespace_tokenize(" ".join(split_tokens))
+    return output_tokens
+  def _run_strip_accents(self, text):
+    """Strips accents from a piece of text."""
+    text = unicodedata.normalize("NFD", text)
+    output = []
+    for char in text:
+      cat = unicodedata.category(char)
+      if cat == "Mn":
+        continue
+      output.append(char)
+    return "".join(output)
+  def _run_split_on_punc(self, text):
+    """Splits punctuation on a piece of text."""
+    chars = list(text)
+    i = 0
+    start_new_word = True
+    output = []
+    while i < len(chars):
+      char = chars[i]
+      if _is_punctuation(char):
+        output.append([char])
+        start_new_word = True
+      else:
+        if start_new_word:
+          output.append([])
+        start_new_word = False
+        output[-1].append(char)
+      i += 1
+    return ["".join(x) for x in output]
+  def _tokenize_chinese_chars(self, text):
+    """Adds whitespace around any CJK character."""
+    output = []
+    for char in text:
+      cp = ord(char)
+      if self._is_chinese_char(cp):
+        output.append(" ")
+        output.append(char)
+        output.append(" ")
+      else:
+        output.append(char)
+    return "".join(output)
+  def _is_chinese_char(self, cp):
+    """Checks whether CP is the codepoint of a CJK character."""
+    # This defines a "chinese character" as anything in the CJK Unicode block:
+    #   https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_(Unicode_block)
+    #
+    # Note that the CJK Unicode block is NOT all Japanese and Korean characters,
+    # despite its name. The modern Korean Hangul alphabet is a different block,
+    # as is Japanese Hiragana and Katakana. Those alphabets are used to write
+    # space-separated words, so they are not treated specially and handled
+    # like the all of the other languages.
+    if ((cp >= 0x4E00 and cp <= 0x9FFF) or  #
+        (cp >= 0x3400 and cp <= 0x4DBF) or  #
+        (cp >= 0x20000 and cp <= 0x2A6DF) or  #
+        (cp >= 0x2A700 and cp <= 0x2B73F) or  #
+        (cp >= 0x2B740 and cp <= 0x2B81F) or  #
+        (cp >= 0x2B820 and cp <= 0x2CEAF) or
+        (cp >= 0xF900 and cp <= 0xFAFF) or  #
+        (cp >= 0x2F800 and cp <= 0x2FA1F)):  #
+      return True
+    return False
+  def _clean_text(self, text):
+    """Performs invalid character removal and whitespace cleanup on text."""
+    output = []
+    for char in text:
+      cp = ord(char)
+      if cp == 0 or cp == 0xfffd or _is_control(char):
+        continue
+      if _is_whitespace(char):
+        output.append(" ")
+      else:
+        output.append(char)
+    return "".join(output)
+class WordpieceTokenizer(object):
+  """Runs WordPiece tokenziation."""
+  def __init__(self, vocab, unk_token="<unk>", max_input_chars_per_word=200):
+    self.vocab = vocab
+    self.unk_token = unk_token
+    self.max_input_chars_per_word = max_input_chars_per_word
+  def tokenize(self, text):
+    """Tokenizes a piece of text into its word pieces.
+    This uses a greedy longest-match-first algorithm to perform tokenization
+    using the given vocabulary.
+    For example:
+      input = "unaffable"
+      output = ["un", "##aff", "##able"]
+    Args:
+      text: A single token or whitespace separated tokens. This should have
+        already been passed through `BasicTokenizer.
+    Returns:
+      A list of wordpiece tokens.
+    """
+    text = convert_to_unicode(text)
+    output_tokens = []
+    for token in whitespace_tokenize(text):
+      chars = list(token)
+      if len(chars) > self.max_input_chars_per_word:
+        output_tokens.append(self.unk_token)
+        continue
+      is_bad = False
+      start = 0
+      sub_tokens = []
+      while start < len(chars):
+        end = len(chars)
+        cur_substr = None
+        while start < end:
+          substr = "".join(chars[start:end])
+          if start > 0:
+            substr = "##" + six.ensure_str(substr)
+          if substr in self.vocab:
+            cur_substr = substr
+            break
+          end -= 1
+        if cur_substr is None:
+          is_bad = True
+          break
+        sub_tokens.append(cur_substr)
+        start = end
+      if is_bad:
+        output_tokens.append(self.unk_token)
+      else:
+        output_tokens.extend(sub_tokens)
+    return output_tokens
+def _is_whitespace(char):
+  """Checks whether `chars` is a whitespace character."""
+  # \t, \n, and \r are technically control characters but we treat them
+  # as whitespace since they are generally considered as such.
+  if char == " " or char == "\t" or char == "\n" or char == "\r":
+    return True
+  cat = unicodedata.category(char)
+  if cat == "Zs":
+    return True
+  return False
+def _is_control(char):
+  """Checks whether `chars` is a control character."""
+  # These are technically control characters but we count them as whitespace
+  # characters.
+  if char == "\t" or char == "\n" or char == "\r":
+    return False
+  cat = unicodedata.category(char)
+  if cat in ("Cc", "Cf"):
+    return True
+  return False
+def _is_punctuation(char):
+  """Checks whether `chars` is a punctuation character."""
+  cp = ord(char)
+  # We treat all non-letter/number ASCII as punctuation.
+  # Characters such as "^", "$", and "`" are not in the Unicode
+  # Punctuation class but we treat them as punctuation anyways, for
+  # consistency.
+  if ((cp >= 33 and cp <= 47) or (cp >= 58 and cp <= 64) or
+    (cp >= 91 and cp <= 96) or (cp >= 123 and cp <= 126)):
+    return True
+  cat = unicodedata.category(char)
+  if cat.startswith("P"):
+    return True
+  return False
--- a/ernie-m/requirements.txt
+++ b/ernie-m/requirements.txt
+six==1.16.0
+sentencepiece==0.1.96
+numpy==1.21.4
--- a/ernie-m/run_classifier.py
+++ b/ernie-m/run_classifier.py
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Finetuning on classification tasks."""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+from __future__ import unicode_literals
+from __future__ import absolute_import
+import os
+import time
+import json
+import logging
+from io import open
+import paddle
+import paddle.fluid as fluid
+import reader.task_reader as task_reader
+from model.ernie import ErnieConfig
+from model.optimization import optimization
+from finetune.classifier import create_model, evaluate, predict
+from utils.args import print_arguments, check_cuda, prepare_logger
+from utils.init import init_pretraining_params, init_checkpoint
+from utils.finetune_args import parser
+paddle.enable_static()
+args = parser.parse_args()
+log = logging.getLogger()
+def main(args):
+    ernie_config = ErnieConfig(args.ernie_config_path)
+    ernie_config.print_config()
+    if args.use_cuda:
+        dev_list = fluid.cuda_places()
+        place = dev_list[0]
+    else:
+        place = fluid.CPUPlace()
+    exe = fluid.Executor(place)
+    reader = task_reader.ClassifyReader(
+        vocab_path=args.vocab_path,
+        piece_model_path=args.piece_model_path,
+        max_seq_len=args.max_seq_len,
+        in_tokens=args.in_tokens,
+        tokenizer=args.tokenizer,
+        label_map_config=args.label_map_config,
+        random_seed=args.random_seed
+        )
+    if not (args.do_train or args.do_val or args.do_test):
+        raise ValueError("For args `do_train`, `do_val` and `do_test`, at "
+                         "least one of them must be True.")
+    if args.do_test:
+        assert args.test_save is not None
+    startup_prog = fluid.Program()
+    if args.random_seed is not None:
+        startup_prog.random_seed = args.random_seed
+    if args.do_train:
+        trainers_num = int(os.getenv("PADDLE_TRAINERS_NUM", "1"))
+        train_data_generator = reader.data_generator(
+            input_file=args.train_set,
+            batch_size=args.batch_size,
+            epoch=args.epoch,
+            dev_count=trainers_num,
+            shuffle=True,
+            phase="train")
+        num_train_examples = reader.get_num_examples(args.train_set)
+        if args.in_tokens:
+            if args.batch_size < args.max_seq_len:
+                raise ValueError('if in_tokens=True, batch_size should greater than max_sqelen, got batch_size:%d seqlen:%d' % (args.batch_size, args.max_seq_len))
+            max_train_steps = args.epoch * num_train_examples // (
+                args.batch_size // args.max_seq_len) // trainers_num
+        else:
+            max_train_steps = args.epoch * num_train_examples // args.batch_size // trainers_num
+        warmup_steps = int(max_train_steps * args.warmup_proportion)
+        log.info("Trainer count: %d" % trainers_num)
+        log.info("Num train examples: %d" % num_train_examples)
+        log.info("Max train steps: %d" % max_train_steps)
+        log.info("Num warmup steps: %d" % warmup_steps)
+        train_program = fluid.Program()
+        with fluid.program_guard(train_program, startup_prog):
+            with fluid.unique_name.guard():
+                train_pyreader, graph_vars = create_model(
+                    args,
+                    ernie_config=ernie_config)
+                scheduled_lr, loss_scaling = optimization(
+                    loss=graph_vars["loss"],
+                    warmup_steps=warmup_steps,
+                    num_train_steps=max_train_steps,
+                    learning_rate=args.learning_rate,
+                    train_program=train_program,
+                    startup_prog=startup_prog,
+                    weight_decay=args.weight_decay,
+                    scheduler=args.lr_scheduler,
+                    use_fp16=args.use_fp16,
+                    use_dynamic_loss_scaling=args.use_dynamic_loss_scaling,
+                    init_loss_scaling=args.init_loss_scaling,
+                    incr_every_n_steps=args.incr_every_n_steps,
+                    decr_every_n_nan_or_inf=args.decr_every_n_nan_or_inf,
+                    incr_ratio=args.incr_ratio,
+                    decr_ratio=args.decr_ratio,
+                    layerwise_lr_decay=args.layerwise_lr_decay,
+                    n_layers=ernie_config["num_hidden_layers"])
+        if args.verbose:
+            if args.in_tokens:
+                lower_mem, upper_mem, unit = fluid.contrib.memory_usage(
+                    program=train_program,
+                    batch_size=args.batch_size // args.max_seq_len)
+            else:
+                lower_mem, upper_mem, unit = fluid.contrib.memory_usage(
+                    program=train_program, batch_size=args.batch_size)
+            log.info("Theoretical memory usage in training: %.3f - %.3f %s" %
+                  (lower_mem, upper_mem, unit))
+    if args.do_val or args.do_test:
+        test_prog = fluid.Program()
+        with fluid.program_guard(test_prog, startup_prog):
+            with fluid.unique_name.guard():
+                test_pyreader, graph_vars = create_model(
+                    args,
+                    ernie_config=ernie_config)
+        test_prog = test_prog.clone(for_test=True)
+    log.info("args.is_distributed: {}".format(args.is_distributed))
+    nccl2_trainer_id=0
+    nccl2_num_trainers=1
+    if args.is_distributed:
+        trainer_id = int(os.getenv("PADDLE_TRAINER_ID", "0"))
+        worker_endpoints_env = os.getenv("PADDLE_TRAINER_ENDPOINTS")
+        current_endpoint = os.getenv("PADDLE_CURRENT_ENDPOINT")
+        worker_endpoints = worker_endpoints_env.split(",")
+        trainers_num = len(worker_endpoints)
+        log.info("worker_endpoints:{} trainers_num:{} current_endpoint:{} \
+                  trainer_id:{}".format(worker_endpoints, trainers_num,
+                                    current_endpoint, trainer_id))
+        # prepare nccl2 env.
+        config = fluid.DistributeTranspilerConfig()
+        config.mode = "nccl2"
+        t = fluid.DistributeTranspiler(config=config)
+        t.transpile(
+            trainer_id,
+            trainers=worker_endpoints_env,
+            current_endpoint=current_endpoint,
+            program=train_program if args.do_train else test_prog,
+            startup_program=startup_prog)
+        nccl2_trainer_id=trainer_id
+        nccl2_num_trainers=trainers_num
+    exe = fluid.Executor(place)
+    exe.run(startup_prog)
+    if args.do_train:
+        if args.init_checkpoint and args.init_pretraining_params:
+            log.warning(
+                "WARNING: args 'init_checkpoint' and 'init_pretraining_params' "
+                "both are set! Only arg 'init_checkpoint' is made valid.")
+        if args.init_checkpoint:
+            init_checkpoint(
+                exe,
+                args.init_checkpoint,
+                main_program=startup_prog,
+                use_fp16=args.use_fp16)
+        elif args.init_pretraining_params:
+            init_pretraining_params(
+                exe,
+                args.init_pretraining_params,
+                main_program=startup_prog,
+                use_fp16=args.use_fp16)
+    elif args.do_val or args.do_test:
+        if not args.init_checkpoint:
+            raise ValueError("args 'init_checkpoint' should be set if"
+                             "only doing validation or testing!")
+        init_checkpoint(
+            exe,
+            args.init_checkpoint,
+            main_program=startup_prog,
+            use_fp16=args.use_fp16)
+    if args.do_train:
+        exec_strategy = fluid.ExecutionStrategy()
+        if args.use_fast_executor:
+            exec_strategy.use_experimental_executor = True
+        exec_strategy.num_threads = nccl2_num_trainers
+        exec_strategy.num_iteration_per_drop_scope = args.num_iteration_per_drop_scope
+        train_exe = fluid.ParallelExecutor(
+            use_cuda=args.use_cuda,
+            loss_name=graph_vars["loss"].name,
+            exec_strategy=exec_strategy,
+            main_program=train_program,
+            num_trainers=nccl2_num_trainers,
+            trainer_id=nccl2_trainer_id)
+        train_pyreader.set_batch_generator(train_data_generator)
+    if args.do_train:
+        train_pyreader.start()
+        steps = 0
+        if warmup_steps > 0:
+            graph_vars["learning_rate"] = scheduled_lr
+        time_begin = time.time()
+        last_epoch = 0
+        current_epoch = 0
+        while True:
+            try:
+                steps += 1
+                if steps % args.skip_steps != 0:
+                    train_exe.run(fetch_list=[])
+                else:
+                    outputs = evaluate(
+                        train_exe, 
+                        train_program, 
+                        train_pyreader,
+                        graph_vars,
+                        "train")
+                    if args.verbose:
+                        verbose = "train pyreader queue size: %d, " % train_pyreader.queue.size(
+                        )
+                        verbose += "learning rate: %f" % (
+                            outputs["learning_rate"] 
+                            if warmup_steps > 0 else args.learning_rate)
+                        log.info(verbose)
+                    current_example, current_epoch = reader.get_train_progress()
+                    time_end = time.time()
+                    used_time = time_end - time_begin
+                    log.info(
+                        "epoch: %d, progress: %d/%d, step: %d, ave loss: %f, "
+                        "ave acc: %f, speed: %f steps/s" %
+                        (current_epoch, current_example, num_train_examples,
+                         steps, outputs["loss"], outputs["accuracy"],
+                         args.skip_steps / used_time))
+                    time_begin = time.time()
+                if nccl2_trainer_id == 0:
+                    if steps % args.save_steps == 0:
+                        save_path = os.path.join(args.checkpoints,
+                                                 "step_" + str(steps))
+                        fluid.io.save_persistables(exe, save_path, train_program)
+                    if steps % args.validation_steps == 0 or last_epoch != current_epoch:
+                        # evaluate dev set
+                        if args.do_val:
+                            evaluate_wrapper(args, reader, exe, test_prog,
+                                             test_pyreader, graph_vars,
+                                             current_epoch, steps)
+                        if args.do_test:
+                            predict_wrapper(args, reader, exe, test_prog,
+                                            test_pyreader, graph_vars,
+                                            current_epoch, steps)
+                if last_epoch != current_epoch:
+                    last_epoch = current_epoch
+            except fluid.core.EOFException:
+                if nccl2_trainer_id == 0:
+                    save_path = os.path.join(args.checkpoints, "step_" + str(steps))
+                    fluid.io.save_persistables(exe, save_path, train_program)
+                train_pyreader.reset()
+                break
+    if nccl2_trainer_id == 0:
+        # final eval on dev set
+        if args.do_val:
+            evaluate_wrapper(args, reader, exe, test_prog, test_pyreader,
+                              graph_vars, current_epoch, steps)
+        # final eval on test set
+        if args.do_test:
+            predict_wrapper(args, reader, exe, test_prog, test_pyreader, 
+                              graph_vars, current_epoch, steps)
+def evaluate_wrapper(args, reader, exe, test_prog, test_pyreader, graph_vars, 
+                     epoch, steps):
+    # evaluate dev set
+    batch_size = args.batch_size if args.predict_batch_size is None else args.predict_batch_size
+    for ds in args.dev_set.split(','):
+        for lang in json.load(open(args.lang_map_config, "r")):
+            test_pyreader.set_batch_generator(
+                reader.data_generator(
+                    ds % lang,
+                    batch_size=batch_size,
+                    epoch=1,
+                    dev_count=1,
+                    shuffle=False,
+                    phase="dev"))
+            log.info("validation result of dataset {}:".format(ds % lang))
+            evaluate_info = evaluate(
+                exe, 
+                test_prog,
+                test_pyreader,
+                graph_vars, 
+                "dev", 
+                lang=lang)
+            log.info(evaluate_info + ' file: {}, epoch: {}, steps: {}'.format(ds % lang, epoch, steps))
+def predict_wrapper(args, reader, exe, test_prog, test_pyreader, graph_vars, 
+                      epoch, steps):
+    test_sets = args.test_set.split(',')
+    save_dirs = args.test_save.split(',')
+    assert len(test_sets) == len(save_dirs)
+    batch_size = args.batch_size if args.predict_batch_size is None else args.predict_batch_size
+    for test_f, save_f in zip(test_sets, save_dirs):
+        for lang in json.load(open(args.lang_map_config, "r")):
+            test_pyreader.set_batch_generator(
+                reader.data_generator(
+                    test_f % lang,
+                    batch_size=batch_size,
+                    epoch=1,
+                    dev_count=1,
+                    shuffle=False,
+                    phase="test"))
+            save_path = save_f + '.' + str(epoch) + '.' + str(steps)
+            log.info("testing {}, save to {}".format(test_f % lang, save_path % lang))
+            qids, preds, probs = predict(
+                    exe, 
+                    test_prog, 
+                    test_pyreader, 
+                    graph_vars)
+            save_dir = os.path.dirname(save_path % lang)
+            if not os.path.exists(save_dir):
+                os.makedirs(save_dir)
+            with open(save_path % lang, 'w') as f:
+                if len(qids) == 0:
+                    for pred, prob in zip(preds, probs):
+                        f.write('{}\t{}\n'.format(pred, prob))
+                else:
+                    for qid, pred, prob in zip(qids, preds, probs):
+                        f.write('{}\t{}\t{}\n'.format(qid, pred, prob))
+if __name__ == '__main__':
+    prepare_logger(log)
+    print_arguments(args)
+    check_cuda(args.use_cuda)
+    main(args)
--- a/ernie-m/run_mrc.py
+++ b/ernie-m/run_mrc.py
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Finetuning on classification tasks."""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+from __future__ import unicode_literals 
+from __future__ import absolute_import
+import os
+import time
+import json
+import logging 
+from io import open
+import paddle
+import paddle.fluid as fluid
+import reader.task_reader as task_reader
+from model.ernie import ErnieConfig
+from model.optimization import optimization
+from utils.init import init_pretraining_params, init_checkpoint
+from utils.args import print_arguments, check_cuda, prepare_logger 
+from finetune.mrc import create_model, evaluate
+from utils.finetune_args import parser
+paddle.enable_static()
+args = parser.parse_args()
+log = logging.getLogger()
+def main(args):
+    ernie_config = ErnieConfig(args.ernie_config_path)
+    ernie_config.print_config()
+    if args.use_cuda:
+        dev_list = fluid.cuda_places()
+        place = dev_list[0]
+    else:
+        place = fluid.CPUPlace()
+    reader = task_reader.MRCReader(
+        vocab_path=args.vocab_path,
+        piece_model_path=args.piece_model_path,
+        max_seq_len=args.max_seq_len,
+        in_tokens=args.in_tokens,
+        tokenizer=args.tokenizer,
+        label_map_config=args.label_map_config,
+        doc_stride=args.doc_stride,
+        max_query_length=args.max_query_length,
+        random_seed=args.random_seed)
+    if not (args.do_train or args.do_val or args.do_test):
+        raise ValueError("For args `do_train`, `do_val` and `do_test`, at "
+                         "least one of them must be True.")
+    if args.do_test: 
+        assert args.test_save is not None
+    startup_prog = fluid.Program()
+    if args.random_seed is not None:
+        startup_prog.random_seed = args.random_seed
+    if args.do_train:
+        trainers_num = int(os.getenv("PADDLE_TRAINERS_NUM", "1"))
+        train_data_generator = reader.data_generator(
+            input_file=args.train_set,
+            batch_size=args.batch_size,
+            epoch=args.epoch,
+            dev_count=trainers_num,
+            shuffle=True,
+            phase="train")
+        num_train_examples = reader.get_num_examples("train")
+        if args.in_tokens:
+            if args.batch_size < args.max_seq_len:
+                raise ValueError('if in_tokens=True, batch_size should greater than max_sqelen, got batch_size:%d seqlen:%d' % (args.batch_size, args.max_seq_len))
+            max_train_steps = args.epoch * num_train_examples // (
+                args.batch_size // args.max_seq_len) // trainers_num
+        else:
+            max_train_steps = args.epoch * num_train_examples // args.batch_size // trainers_num
+        warmup_steps = int(max_train_steps * args.warmup_proportion)
+        log.info("Trainer count: %d" % trainers_num)
+        log.info("Num train examples: %d" % num_train_examples)
+        log.info("Max train steps: %d" % max_train_steps)
+        log.info("Num warmup steps: %d" % warmup_steps)
+        train_program = fluid.Program()
+        with fluid.program_guard(train_program, startup_prog):
+            with fluid.unique_name.guard():
+                train_pyreader, graph_vars = create_model(
+                    args,
+                    ernie_config=ernie_config)
+                scheduled_lr, loss_scaling = optimization(
+                    loss=graph_vars["loss"],
+                    warmup_steps=warmup_steps,
+                    num_train_steps=max_train_steps,
+                    learning_rate=args.learning_rate,
+                    train_program=train_program,
+                    startup_prog=startup_prog,
+                    weight_decay=args.weight_decay,
+                    scheduler=args.lr_scheduler,
+                    use_fp16=args.use_fp16,
+                    use_dynamic_loss_scaling=args.use_dynamic_loss_scaling,
+                    init_loss_scaling=args.init_loss_scaling,
+                    incr_every_n_steps=args.incr_every_n_steps,
+                    decr_every_n_nan_or_inf=args.decr_every_n_nan_or_inf,
+                    incr_ratio=args.incr_ratio,
+                    decr_ratio=args.decr_ratio,
+                    layerwise_lr_decay=args.layerwise_lr_decay,
+                    n_layers=ernie_config["num_hidden_layers"])
+        if args.verbose:
+            if args.in_tokens:
+                lower_mem, upper_mem, unit = fluid.contrib.memory_usage(
+                    program=train_program,
+                    batch_size=args.batch_size // args.max_seq_len)
+            else:
+                lower_mem, upper_mem, unit = fluid.contrib.memory_usage(
+                    program=train_program, batch_size=args.batch_size)
+            log.info("Theoretical memory usage in training: %.3f - %.3f %s" %
+                  (lower_mem, upper_mem, unit))
+    if args.do_val or args.do_test:
+        test_prog = fluid.Program()
+        with fluid.program_guard(test_prog, startup_prog):
+            with fluid.unique_name.guard():
+                test_pyreader, test_graph_vars = create_model(
+                    args,
+                    ernie_config=ernie_config)
+        test_prog = test_prog.clone(for_test=True)
+    log.info("args.is_distributed: {}".format(args.is_distributed))
+    nccl2_num_trainers = 1
+    nccl2_trainer_id = 0
+    if args.is_distributed:
+        trainer_id = int(os.getenv("PADDLE_TRAINER_ID", "0"))
+        worker_endpoints_env = os.getenv("PADDLE_TRAINER_ENDPOINTS")
+        current_endpoint = os.getenv("PADDLE_CURRENT_ENDPOINT")
+        worker_endpoints = worker_endpoints_env.split(",")
+        trainers_num = len(worker_endpoints)
+        log.info("worker_endpoints:{} trainers_num:{} current_endpoint:{} \
+              trainer_id:{}".format(worker_endpoints, trainers_num,
+                                    current_endpoint, trainer_id))
+        # prepare nccl2 env.
+        config = fluid.DistributeTranspilerConfig()
+        config.mode = "nccl2"
+        t = fluid.DistributeTranspiler(config=config)
+        t.transpile(
+            trainer_id,
+            trainers=worker_endpoints_env,
+            current_endpoint=current_endpoint,
+            program=train_program if args.do_train else test_prog,
+            startup_program=startup_prog)
+        nccl2_num_trainers = trainers_num
+        nccl2_trainer_id = trainer_id
+    exe = fluid.Executor(place)
+    exe.run(startup_prog)
+    if args.do_train:
+        if args.init_checkpoint and args.init_pretraining_params:
+            print(
+                "WARNING: args 'init_checkpoint' and 'init_pretraining_params' "
+                "both are set! Only arg 'init_checkpoint' is made valid.")
+        if args.init_checkpoint:
+            init_checkpoint(
+                exe,
+                args.init_checkpoint,
+                main_program=startup_prog,
+                use_fp16=args.use_fp16)
+        elif args.init_pretraining_params:
+            init_pretraining_params(
+                exe,
+                args.init_pretraining_params,
+                main_program=startup_prog,
+                use_fp16=args.use_fp16)
+    elif args.do_val or args.do_test:
+        if not args.init_checkpoint:
+            raise ValueError("args 'init_checkpoint' should be set if"
+                             "only doing validation or testing!")
+        init_checkpoint(
+            exe,
+            args.init_checkpoint,
+            main_program=startup_prog,
+            use_fp16=args.use_fp16)
+    if args.do_train:
+        exec_strategy = fluid.ExecutionStrategy()
+        if args.use_fast_executor:
+            exec_strategy.use_experimental_executor = True
+        exec_strategy.num_threads = nccl2_num_trainers 
+        exec_strategy.num_iteration_per_drop_scope = args.num_iteration_per_drop_scope
+        train_exe = fluid.ParallelExecutor(
+            use_cuda=args.use_cuda,
+            loss_name=graph_vars["loss"].name,
+            exec_strategy=exec_strategy,
+            main_program=train_program,
+            num_trainers=nccl2_num_trainers,
+            trainer_id=nccl2_trainer_id)
+        train_pyreader.set_batch_generator(train_data_generator)
+    if args.do_train:
+        train_pyreader.start()
+        steps = 0
+        if warmup_steps > 0:
+            graph_vars["learning_rate"] = scheduled_lr
+        time_begin = time.time() 
+        last_epoch = 0
+        current_epoch = 0
+        while True:
+            try:
+                steps += 1
+                if steps % args.skip_steps != 0:
+                    train_exe.run(fetch_list=[])
+                else:
+                    outputs = evaluate(
+                        train_exe, 
+                        train_program, 
+                        train_pyreader, 
+                        graph_vars, 
+                        "train")
+                    if args.verbose:
+                        verbose = "train pyreader queue size: %d, " % train_pyreader.queue.size(
+                        )
+                        verbose += "learning rate: %f" % (
+                            outputs["learning_rate"]
+                            if warmup_steps > 0 else args.learning_rate)
+                        log.info(verbose)
+                    current_example, current_epoch = reader.get_train_progress()
+                    time_end = time.time()
+                    used_time = time_end - time_begin
+                    log.info(
+                         "epoch: %d, progress: %d/%d, step: %d, ave loss: %f, "
+                         "speed: %f steps/s" %
+                          (current_epoch, current_example, num_train_examples,
+                           steps, outputs["loss"], args.skip_steps / used_time))
+                    time_begin = time.time()
+                if nccl2_trainer_id == 0:
+                    if steps % args.save_steps == 0:
+                        save_path = os.path.join(args.checkpoints,
+                                                 "step_" + str(steps))
+                        fluid.io.save_persistables(exe, save_path, train_program)
+                if nccl2_trainer_id < args.use_gpu_num_in_test:
+                    if steps % args.validation_steps == 0 or last_epoch != current_epoch:
+                        if args.do_val:
+                            evaluate_wrapper(args, reader, exe, test_prog,
+                                             test_pyreader, graph_vars,
+                                             current_epoch, steps, nccl2_trainer_id)
+                        if args.do_test: # need to change for output
+                            predict_wrapper(args, reader, exe, test_prog,
+                                            test_pyreader, graph_vars,
+                                            current_epoch, steps, nccl2_trainer_id)
+                if last_epoch != current_epoch:
+                    last_epoch = current_epoch
+            except fluid.core.EOFException:
+                if nccl2_trainer_id == 0: 
+                    save_path = os.path.join(args.checkpoints, "step_" + str(steps))
+                    fluid.io.save_persistables(exe, save_path, train_program)
+                train_pyreader.reset()
+                break
+    if nccl2_trainer_id < args.use_gpu_num_in_test:
+        # final eval on dev set
+        if args.do_val:
+            evaluate_wrapper(args, reader, exe, test_prog, test_pyreader,
+                              graph_vars, current_epoch, steps, nccl2_trainer_id)
+        # final eval on test set
+        if args.do_test: #need to change for output
+            predict_wrapper(args, reader, exe, test_prog, test_pyreader,
+                              graph_vars, current_epoch, steps, nccl2_trainer_id)
+def evaluate_wrapper(args, reader, exe, test_prog, test_pyreader, graph_vars,
+                     epoch, steps, trainer_id):
+    batch_size = args.batch_size if args.predict_batch_size is None else args.predict_batch_size
+    for ds in args.dev_set.split(','):
+        for lang in json.load(open(args.lang_map_config, "r")):
+            test_pyreader.set_batch_generator(
+                reader.data_generator(
+                    ds % lang,
+                    batch_size=args.batch_size,
+                    epoch=1,
+                    dev_count=args.use_gpu_num_in_test,
+                    shuffle=False,
+                    phase="dev"))
+            save_path = "./tmpout/" + os.path.basename(ds) + '.' + str(epoch) + '.' + str(steps)
+            evaluate(
+                exe,
+                test_prog,
+                test_pyreader,
+                graph_vars,
+                "dev",
+                examples=reader.get_examples("dev"),
+                features=reader.get_features("dev"),
+                args=args,
+                trainer_id=trainer_id,
+                dev_count=args.use_gpu_num_in_test,
+                input_file=ds % lang, 
+                output_path=save_path % lang,
+                tokenizer=reader.tokenizer)
+def predict_wrapper(args, reader, exe, test_prog, test_pyreader, graph_vars,
+                      epoch, steps, trainer_id):
+    test_sets = args.test_set.split(',')
+    save_dirs = args.test_save.split(',') 
+    assert len(test_sets) == len(save_dirs)
+    batch_size = args.batch_size if args.predict_batch_size is None else args.predict_batch_size
+    for test_f, save_f in zip(test_sets, save_dirs):
+        for lang in json.load(open(args.lang_map_config, "r")):
+            test_pyreader.set_batch_generator(
+                reader.data_generator(
+                    test_f % lang,
+                    batch_size=args.batch_size,
+                    epoch=1,
+                    dev_count=args.use_gpu_num_in_test,
+                    shuffle=False,
+                    phase="test"))
+            save_path = save_f + '.' + str(epoch) + '.' + str(steps)
+            evaluate(exe,
+                     test_prog,
+                     test_pyreader,
+                     graph_vars,
+                     "test",
+                     examples=reader.get_examples("test"),
+                     features=reader.get_features("test"),
+                     args=args,
+                     trainer_id=trainer_id,
+                     dev_count=args.use_gpu_num_in_test,
+                     input_file=test_f % lang,
+                     output_path=save_path % lang,
+                     tokenizer=reader.tokenizer)
+if __name__ == '__main__':
+    prepare_logger(log)
+    print_arguments(args)
+    check_cuda(args.use_cuda)
+    main(args)
--- a/ernie-m/run_sequence_labeling.py
+++ b/ernie-m/run_sequence_labeling.py
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Finetuning on classification tasks."""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+from __future__ import unicode_literals
+from __future__ import absolute_import
+import os
+import six
+import time
+import json
+import logging
+from io import open
+import paddle
+import paddle.fluid as fluid
+import reader.task_reader as task_reader
+from model.ernie import ErnieConfig
+from model.optimization import optimization
+from utils.init import init_pretraining_params, init_checkpoint
+from utils.args import print_arguments, check_cuda, prepare_logger
+from finetune.sequence_label import create_model, evaluate, predict, calculate_f1
+from utils.finetune_args import parser
+paddle.enable_static()
+args = parser.parse_args()
+log = logging.getLogger()
+def main(args):
+    ernie_config = ErnieConfig(args.ernie_config_path)
+    ernie_config.print_config()
+    if args.use_cuda:
+        dev_list = fluid.cuda_places()
+        place = dev_list[0]
+    else:
+        place = fluid.CPUPlace()
+    reader = task_reader.SequenceLabelReader(
+        vocab_path=args.vocab_path,
+        piece_model_path=args.piece_model_path,
+        max_seq_len=args.max_seq_len,
+        in_tokens=args.in_tokens,
+        tokenizer=args.tokenizer,
+        label_map_config=args.label_map_config,
+        random_seed=args.random_seed
+    )
+    if not (args.do_train or args.do_val or args.do_test):
+        raise ValueError("For args `do_train`, `do_val` and `do_test`, at "
+                         "least one of them must be True.")
+    startup_prog = fluid.Program()
+    if args.random_seed is not None:
+        startup_prog.random_seed = args.random_seed
+    if args.do_train:
+        trainers_num = int(os.getenv("PADDLE_TRAINERS_NUM", "1"))
+        train_data_generator = reader.data_generator(
+            input_file=args.train_set,
+            batch_size=args.batch_size,
+            epoch=args.epoch,
+            dev_count=trainers_num,
+            shuffle=True,
+            phase="train")
+        num_train_examples = reader.get_num_examples(args.train_set)
+        if args.in_tokens:
+            if args.batch_size < args.max_seq_len:
+                raise ValueError('if in_tokens=True, batch_size should greater than max_sqelen, got batch_size:%d seqlen:%d' % (args.batch_size, args.max_seq_len))
+            max_train_steps = args.epoch * num_train_examples // (
+                args.batch_size // args.max_seq_len) // trainers_num
+        else:
+            max_train_steps = args.epoch * num_train_examples // args.batch_size // trainers_num
+        warmup_steps = int(max_train_steps * args.warmup_proportion)
+        log.info("Trainer count: %d" % trainers_num)
+        log.info("Num train examples: %d" % num_train_examples)
+        log.info("Max train steps: %d" % max_train_steps)
+        log.info("Num warmup steps: %d" % warmup_steps)
+        train_program = fluid.Program()
+        with fluid.program_guard(train_program, startup_prog):
+            with fluid.unique_name.guard():
+                train_pyreader, graph_vars = create_model(
+                    args,
+                    ernie_config=ernie_config)
+                scheduled_lr, loss_scaling = optimization(
+                    loss=graph_vars["loss"],
+                    warmup_steps=warmup_steps,
+                    num_train_steps=max_train_steps,
+                    learning_rate=args.learning_rate,
+                    train_program=train_program,
+                    startup_prog=startup_prog,
+                    weight_decay=args.weight_decay,
+                    scheduler=args.lr_scheduler,
+                    use_fp16=args.use_fp16,
+                    use_dynamic_loss_scaling=args.use_dynamic_loss_scaling,
+                    init_loss_scaling=args.init_loss_scaling,
+                    incr_every_n_steps=args.incr_every_n_steps,
+                    decr_every_n_nan_or_inf=args.decr_every_n_nan_or_inf,
+                    incr_ratio=args.incr_ratio,
+                    decr_ratio=args.decr_ratio,
+                    layerwise_lr_decay=args.layerwise_lr_decay,
+                    n_layers=ernie_config["num_hidden_layers"])
+        if args.verbose:
+            if args.in_tokens:
+                lower_mem, upper_mem, unit = fluid.contrib.memory_usage(
+                    program=train_program,
+                    batch_size=args.batch_size // args.max_seq_len)
+            else:
+                lower_mem, upper_mem, unit = fluid.contrib.memory_usage(
+                    program=train_program, batch_size=args.batch_size)
+            log.info("Theoretical memory usage in training: %.3f - %.3f %s" %
+                  (lower_mem, upper_mem, unit))
+    if args.do_val or args.do_test:
+        test_prog = fluid.Program()
+        with fluid.program_guard(test_prog, startup_prog):
+            with fluid.unique_name.guard():
+                test_pyreader, graph_vars = create_model(
+                    args,
+                    ernie_config=ernie_config)
+        test_prog = test_prog.clone(for_test=True)
+    log.info("args.is_distributed: {}".format(args.is_distributed)) 
+    nccl2_num_trainers = 1
+    nccl2_trainer_id = 0
+    if args.is_distributed:
+        trainer_id = int(os.getenv("PADDLE_TRAINER_ID", "0"))
+        worker_endpoints_env = os.getenv("PADDLE_TRAINER_ENDPOINTS")
+        current_endpoint = os.getenv("PADDLE_CURRENT_ENDPOINT")
+        worker_endpoints = worker_endpoints_env.split(",")
+        trainers_num = len(worker_endpoints)
+        log.info("worker_endpoints:{} trainers_num:{} current_endpoint:{} \
+              trainer_id:{}".format(worker_endpoints, trainers_num,
+                                    current_endpoint, trainer_id))
+        # prepare nccl2 env.
+        config = fluid.DistributeTranspilerConfig()
+        config.mode = "nccl2"
+        t = fluid.DistributeTranspiler(config=config)
+        t.transpile(
+            trainer_id,
+            trainers=worker_endpoints_env,
+            current_endpoint=current_endpoint,
+            program=train_program if args.do_train else test_prog,
+            startup_program=startup_prog)
+        nccl2_num_trainers = trainers_num
+        nccl2_trainer_id = trainer_id
+    exe = fluid.Executor(place)
+    exe.run(startup_prog)
+    if args.do_train:
+        if args.init_checkpoint and args.init_pretraining_params:
+            log.info(
+                "WARNING: args 'init_checkpoint' and 'init_pretraining_params' "
+                "both are set! Only arg 'init_checkpoint' is made valid.")
+        if args.init_checkpoint:
+            init_checkpoint(
+                exe,
+                args.init_checkpoint,
+                main_program=startup_prog,
+                use_fp16=args.use_fp16)
+        elif args.init_pretraining_params:
+            init_pretraining_params(
+                exe,
+                args.init_pretraining_params,
+                main_program=startup_prog,
+                use_fp16=args.use_fp16)
+    elif args.do_val or args.do_test:
+        if not args.init_checkpoint:
+            raise ValueError("args 'init_checkpoint' should be set if"
+                             "only doing validation or testing!")
+        init_checkpoint(
+            exe,
+            args.init_checkpoint,
+            main_program=startup_prog,
+            use_fp16=args.use_fp16)
+    if args.do_train:
+        exec_strategy = fluid.ExecutionStrategy()
+        if args.use_fast_executor:
+            exec_strategy.use_experimental_executor = True
+        exec_strategy.num_threads = nccl2_num_trainers
+        exec_strategy.num_iteration_per_drop_scope = args.num_iteration_per_drop_scope
+        train_exe = fluid.ParallelExecutor(
+            use_cuda=args.use_cuda,
+            loss_name=graph_vars["loss"].name,
+            exec_strategy=exec_strategy,
+            main_program=train_program,
+            num_trainers=nccl2_num_trainers,
+            trainer_id=nccl2_trainer_id)
+        train_pyreader.set_batch_generator(train_data_generator)
+    if args.do_train:
+        train_pyreader.start()
+        steps = 0
+        if warmup_steps > 0:
+            graph_vars["learning_rate"] = scheduled_lr
+        time_begin = time.time()
+        last_epoch = 0
+        current_epoch = 0
+        while True:
+            try:
+                steps += 1
+                if steps % args.skip_steps != 0:
+                    train_exe.run(fetch_list=[])
+                else:
+                    fetch_list = [
+                        graph_vars["num_infer"].name, graph_vars["num_label"].name,
+                        graph_vars["num_correct"].name,
+                        graph_vars["loss"].name,
+                        graph_vars['learning_rate'].name,
+                    ]
+                    out = train_exe.run(fetch_list=fetch_list)
+                    num_infer, num_label, num_correct, np_loss, np_lr = out
+                    lr = float(np_lr[0])
+                    loss = np_loss.mean()
+                    precision, recall, f1 = calculate_f1(num_label, num_infer, num_correct)
+                    if args.verbose:
+                        log.info("train pyreader queue size: %d, learning rate: %f" % (train_pyreader.queue.size(),
+                                lr if warmup_steps > 0 else args.learning_rate))
+                    current_example, current_epoch = reader.get_train_progress()
+                    time_end = time.time()
+                    used_time = time_end - time_begin
+                    log.info("epoch: %d, progress: %d/%d, step: %d, loss: %f, "
+                          "f1: %f, precision: %f, recall: %f, speed: %f steps/s"
+                          % (current_epoch, current_example, num_train_examples,
+                             steps, loss, f1, precision, recall,
+                             args.skip_steps / used_time))
+                    time_begin = time.time()
+                if nccl2_trainer_id == 0 and steps % args.save_steps == 0:
+                    save_path = os.path.join(args.checkpoints,
+                                             "step_" + str(steps))
+                    fluid.io.save_persistables(exe, save_path, train_program)
+                if nccl2_trainer_id == 0 and steps % args.validation_steps == 0 or last_epoch != current_epoch:
+                    # evaluate dev set
+                    if args.do_val:
+                        evaluate_wrapper(reader, exe, test_prog, test_pyreader, graph_vars,
+                                current_epoch, steps)
+                    # evaluate test set
+                    if args.do_test:
+                        predict_wrapper(reader, exe, test_prog, test_pyreader, graph_vars,
+                                current_epoch, steps)
+                if last_epoch != current_epoch:
+                    last_epoch = current_epoch
+            except fluid.core.EOFException:
+                if nccl2_trainer_id ==0:
+                    save_path = os.path.join(args.checkpoints, "step_" + str(steps))
+                    fluid.io.save_persistables(exe, save_path, train_program)
+                train_pyreader.reset()
+                break
+    # final eval on dev set
+    if nccl2_trainer_id ==0 and args.do_val:
+        if not args.do_train:
+            current_example, current_epoch = reader.get_train_progress()
+        evaluate_wrapper(reader, exe, test_prog, test_pyreader, graph_vars,
+                current_epoch, 'final')
+    if nccl2_trainer_id == 0 and args.do_test:
+        if not args.do_train:
+            current_example, current_epoch = reader.get_train_progress()
+        predict_wrapper(reader, exe, test_prog, test_pyreader, graph_vars,
+                current_epoch, 'final')
+def evaluate_wrapper(reader, exe, test_prog, test_pyreader, graph_vars,
+                     epoch, steps):
+    # evaluate dev set
+    batch_size = args.batch_size if args.predict_batch_size is None else args.predict_batch_size
+    for ds in args.dev_set.split(','): #single card eval
+        for lang in json.load(open(args.lang_map_config, "r")):
+            test_pyreader.set_batch_generator(
+                reader.data_generator(
+                    ds % lang,
+                    batch_size=batch_size,
+                    epoch=1,
+                    dev_count=1,
+                    shuffle=False,
+                    phase="dev"))
+            log.info("validation result of dataset {}:".format(ds % lang))
+            info = evaluate(exe,
+                     test_prog, 
+                     test_pyreader, 
+                     graph_vars,
+                     args.num_labels, 
+                     lang=lang)
+            log.info(info + ', file: {}, epoch: {}, steps: {}'.format(
+                ds % lang, epoch, steps))
+def predict_wrapper(reader, exe, test_prog, test_pyreader, graph_vars,
+                    epoch, steps):
+    test_sets = args.test_set.split(',')
+    save_dirs = args.test_save.split(',')
+    assert len(test_sets) == len(save_dirs), 'number of test_sets & test_save not match, got %d vs %d' % (len(test_sets), len(save_dirs))
+    batch_size = args.batch_size if args.predict_batch_size is None else args.predict_batch_size
+    for test_f, save_f in zip(test_sets, save_dirs):
+        for lang in json.load(open(args.lang_map_config, "r")):
+            test_pyreader.set_batch_generator(reader.data_generator(
+                    test_f % lang,
+                    batch_size=batch_size,
+                    epoch=1,
+                    dev_count=1,
+                    shuffle=False,
+                    phase="test"))
+            save_path = save_f + '.' + str(epoch) + '.' + str(steps)
+            log.info("testing {}, save to {}".format(test_f % lang, save_path % lang))
+            res = predict(exe, 
+                      test_prog, 
+                      test_pyreader, 
+                      graph_vars)
+            save_dir = os.path.dirname(save_path)
+            if not os.path.exists(save_dir):
+                os.makedirs(save_dir)
+            rev_label_map = {v: k for k, v in six.iteritems(reader.label_map)}
+            with open(save_path % lang, 'w', encoding='utf8') as f:
+                for i, s, p in res:
+                    i = ' '.join(reader.tokenizer.convert_ids_to_tokens(i))
+                    p = ' '.join(['%.5f' % pp[ss] for ss, pp in zip(s, p)])
+                    s = ' '.join([rev_label_map[ss] for ss in s])
+                    f.write('{}\t{}\t{}\n'.format(i, s, p))
+if __name__ == '__main__':
+    prepare_logger(log)
+    print_arguments(args)
+    check_cuda(args.use_cuda)
+    main(args)
--- a/ernie-m/scripts/base/conll_cross_lingual_transfer.sh
+++ b/ernie-m/scripts/base/conll_cross_lingual_transfer.sh
+export FLAGS_eager_delete_tensor_gb=0
+export FLAGS_sync_nccl_allreduce=1
+export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
+distributed_args="--node_ips $(hostname -i) --node_id 0 --current_node_ip $(hostname -i) --log_prefix conll_ --nproc_per_node 1 --selected_gpus 0"
+python -u ./launch.py ${distributed_args} \
+   ./run_sequence_labeling.py --use_cuda true \
+   --is_distributed true \
+   --do_train true \
+   --do_val true \
+   --do_test false \
+   --in_tokens false \
+   --batch_size 8 \
+   --train_set ./data/conll/train/train.en.tsv \
+   --dev_set ./data/conll/dev/dev.%s.tsv,./data/conll/test/test.%s.tsv \
+   --test_set ./data/conll/test/test.%s.tsv \
+   --test_save ./output/conll/test/test.%s.tsv \
+   --label_map_config ./data/conll/label_map.json \
+   --lang_map_config ./data/conll/lang_map.json \
+   --vocab_path ./configs/vocab.txt \
+   --piece_model_path ./configs/piece.model \
+   --ernie_config_path ./configs/base/ernie_config.json \
+   --init_pretraining_params "./configs/base/params" \
+   --checkpoints ./checkpoints \
+   --save_steps 10000 \
+   --weight_decay  0.01 \
+   --layerwise_lr_decay 0.8 \
+   --warmup_proportion 0.1 \
+   --use_fp16 false \
+   --validation_steps 100 \
+   --epoch 10 \
+   --max_seq_len 512 \
+   --learning_rate 4e-4 \
+   --skip_steps 10 \
+   --num_iteration_per_drop_scope 1 \
+   --num_labels 9 \
+   --random_seed 1
--- a/ernie-m/scripts/base/conll_finetune_on_all_dataset.sh
+++ b/ernie-m/scripts/base/conll_finetune_on_all_dataset.sh
+export FLAGS_eager_delete_tensor_gb=0
+export FLAGS_sync_nccl_allreduce=1
+export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
+distributed_args="--node_ips $(hostname -i) --node_id 0 --current_node_ip $(hostname -i) --log_prefix conll_ --nproc_per_node 1 --selected_gpus 0"
+python -u ./launch.py ${distributed_args} \
+   ./run_sequence_labeling.py --use_cuda true \
+   --is_distributed true \
+   --do_train true \
+   --do_val true \
+   --do_test false \
+   --in_tokens false \
+   --batch_size 8 \
+   --train_set ./data/conll/train/train.all.tsv \
+   --dev_set ./data/conll/dev/dev.%s.tsv,./data/conll/test/test.%s.tsv \
+   --test_set ./data/conll/test/test.%s.tsv \
+   --test_save ./output/conll/test/test.%s.tsv \
+   --label_map_config ./data/conll/label_map.json \
+   --lang_map_config ./data/conll/lang_map.json \
+   --vocab_path ./configs/vocab.txt \
+   --piece_model_path ./configs/piece.model \
+   --ernie_config_path ./configs/base/ernie_config.json \
+   --init_pretraining_params "./configs/base/params" \
+   --checkpoints ./checkpoints \
+   --save_steps 10000 \
+   --weight_decay  0.01 \
+   --layerwise_lr_decay 0.8 \
+   --warmup_proportion 0.1 \
+   --use_fp16 false \
+   --validation_steps 100 \
+   --epoch 10 \
+   --max_seq_len 512 \
+   --learning_rate 3e-4 \
+   --skip_steps 10 \
+   --num_iteration_per_drop_scope 1 \
+   --num_labels 9 \
+   --random_seed 1
--- a/ernie-m/scripts/base/encode_vector.sh
+++ b/ernie-m/scripts/base/encode_vector.sh
+export FLAGS_eager_delete_tensor_gb=0
+export FLAGS_sync_nccl_allreduce=1
+export CUDA_VISIBLE_DEVICES=0
+python -u ./encode_vector.py --use_cuda true \
+   --batch_size 16 \
+   --data_set ./data/xnli/test/test.en.tsv \
+   --output_dir ./output/xnli/test/test.en.tsv \
+   --vocab_path ./configs/vocab.txt \
+   --piece_model_path ./configs/piece.model \
+   --ernie_config_path ./configs/base/ernie_config.json \
+   --init_pretraining_params "./configs/base/params" \
+   --max_seq_len 256
--- a/ernie-m/scripts/base/mlqa_cross_lingual_transfer.sh
+++ b/ernie-m/scripts/base/mlqa_cross_lingual_transfer.sh
+export FLAGS_eager_delete_tensor_gb=0
+export FLAGS_sync_nccl_allreduce=1
+export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
+distributed_args="--node_ips $(hostname -i) --node_id 0 --current_node_ip $(hostname -i) --log_prefix mlqa_ --nproc_per_node 2 --selected_gpus 0,1"
+python -u ./launch.py ${distributed_args} \
+   ./run_mrc.py --use_cuda true \
+   --is_distributed true \
+   --do_train true \
+   --do_val true \
+   --do_test false \
+   --in_tokens false \
+   --batch_size 16 \
+   --train_set ./data/mlqa/train/train.en.json \
+   --dev_set ./data/mlqa/dev/dev.%s.json,./data/mlqa/test/test.%s.json \
+   --test_set ./data/mlqa/test/test.%s.json \
+   --test_save ./output/mlqa/test/test.%s.json \
+   --lang_map_config ./data/mlqa/lang_map.json \
+   --vocab_path ./configs/vocab.txt \
+   --piece_model_path ./configs/piece.model \
+   --ernie_config_path ./configs/base/ernie_config.json \
+   --init_pretraining_params "./configs/base/params" \
+   --checkpoints ./checkpoints \
+   --save_steps 10000 \
+   --weight_decay  0.0 \
+   --layerwise_lr_decay 0.8 \
+   --warmup_proportion 0.1 \
+   --use_fp16 false \
+   --validation_steps 100 \
+   --epoch 2 \
+   --doc_stride 128 \
+   --max_seq_len 384 \
+   --learning_rate 3e-4 \
+   --skip_steps 10 \
+   --num_iteration_per_drop_scope 1 \
+   --random_seed 1
--- a/ernie-m/scripts/base/xnli_cross_lingual_transfer.sh
+++ b/ernie-m/scripts/base/xnli_cross_lingual_transfer.sh
--- a/ernie-m/scripts/base/xnli_translate-train_all.sh
+++ b/ernie-m/scripts/base/xnli_translate-train_all.sh
--- a/ernie-m/scripts/large/conll_cross_lingual_transfer.sh
+++ b/ernie-m/scripts/large/conll_cross_lingual_transfer.sh
--- a/ernie-m/scripts/large/conll_finetune_on_all_dataset.sh
+++ b/ernie-m/scripts/large/conll_finetune_on_all_dataset.sh
--- a/ernie-m/scripts/large/encode_vector.sh
+++ b/ernie-m/scripts/large/encode_vector.sh
+export FLAGS_eager_delete_tensor_gb=0
+export FLAGS_sync_nccl_allreduce=1
+export CUDA_VISIBLE_DEVICES=0
+python -u ./encode_vector.py --use_cuda true \
+   --batch_size 16 \
+   --data_set ./data/xnli/test/test.en.tsv \
+   --output_dir ./output/xnli/test/test.en.tsv \
+   --vocab_path ./configs/vocab.txt \
+   --piece_model_path ./configs/piece.model \
+   --ernie_config_path ./configs/large/ernie_config.json \
+   --init_pretraining_params "./configs/large/params" \
+   --max_seq_len 256
--- a/ernie-m/scripts/large/mlqa_cross_lingual_transfer.sh
+++ b/ernie-m/scripts/large/mlqa_cross_lingual_transfer.sh
--- a/ernie-m/scripts/large/xnli_cross_lingual_transfer.sh
+++ b/ernie-m/scripts/large/xnli_cross_lingual_transfer.sh
--- a/ernie-m/scripts/large/xnli_translate-train_all.sh
+++ b/ernie-m/scripts/large/xnli_translate-train_all.sh
--- a/ernie-m/utils/__init__.py
+++ b/ernie-m/utils/__init__.py
--- a/ernie-m/utils/args.py
+++ b/ernie-m/utils/args.py
--- a/ernie-m/utils/eval_mlqa.py
+++ b/ernie-m/utils/eval_mlqa.py
--- a/ernie-m/utils/finetune_args.py
+++ b/ernie-m/utils/finetune_args.py
--- a/ernie-m/utils/fp16.py
+++ b/ernie-m/utils/fp16.py
--- a/ernie-m/utils/init.py
+++ b/ernie-m/utils/init.py