Revert "add ernie-doc to ernie develop"

01f4302b · nbcc · GitHub · 4b1b4ee3 · 4b1b4ee3 · 4b1b4ee3
43 changed file
--- a/ernie-doc/.gitignore
+++ b/ernie-doc/.gitignore
-# Virtualenv
-/.venv/
-/venv/
-
-# Byte-compiled / optimized / DLL files
-__pycache__/
-*.py[cod]
-
-# C extensions
-*.so
-
-# Distribution / packaging
-/bin/
-/build/
-/develop-eggs/
-/dist/
-/eggs/
-/lib/
-/lib64/
-/output/
-/parts/
-/sdist/
-/var/
-/*.egg-info/
-/.installed.cfg
-/*.egg
-/.eggs
-
-# AUTHORS and ChangeLog will be generated while packaging
-/AUTHORS
-/ChangeLog
-
-# BCloud / BuildSubmitter
-/build_submitter.*
-/logger_client_log
-
-# Installer logs
-pip-log.txt
-pip-delete-this-directory.txt
-
-# Unit test / coverage reports
-.tox/
-.coverage
-.cache
-.pytest_cache
-nosetests.xml
-coverage.xml
-
-# Translations
-*.mo
-
-# Sphinx documentation
-/docs/_build/
-
-# user-defined
-ernie_doc/output
-ernie_doc/log
-ernie_doc/data/imdb/*.txt
-ernie_doc/data/imdb.debug
-ernie_doc/tmpout
-ernie_doc/py37
--- a/ernie-doc/.meta/framework.pdf
+++ b/ernie-doc/.meta/framework.pdf
--- a/ernie-doc/.meta/framework.png
+++ b/ernie-doc/.meta/framework.png
--- a/ernie-doc/README.md
+++ b/ernie-doc/README.md
-English | [简体中文](./README_zh.md)
-
-## _ERNIE-Doc_: A Retrospective Long-Document Modeling Transformer
-
- [Framework](#framework)
- [Pre-trained Models](#Pre-trained-Models)
- [Fine-tuning Tasks](#Fine-tuning-Tasks)
-  * [Language Modeling](#Language-Modeling)
-  * [Long-Text Classification](#Long-Text-Classification)
-  * [Question Answering](#Question-Answering)
-  * [Information Extraction](#Information-Extraction)
-  * [Semantic Matching](#Semantic-Matching)
- [Usage](#Usage)
-  * [Install Paddle](#Install-PaddlePaddle)
-  * [Fine-tuning](#Fine-tuning)
- [Citation](#Citation)
-
-For technical description of the algorithm, please see our paper:
->[_**ERNIE-Doc: A Retrospective Long-Document Modeling Transformer**_](https://arxiv.org/abs/2012.15688)
->
->Siyu Ding\*, Junyuan Shang\*, Shuohuan Wang, Yu Sun, Hao Tian, Hua Wu, Haifeng Wang (\* : equal contribution)
->
->Preprint December 2020
->
->Accepted by **ACL-2021**
-
-![ERNIE-Doc](https://img.shields.io/badge/Pretraining-Long%20Document%20Modeling-green) ![paper](https://img.shields.io/badge/Paper-ACL2021-yellow) 
-
---
-**ERNIE-Doc is a document-level language pretraining model**. Two well-designed techniques, namely the **retrospective feed mechanism** and the **enhanced recurrence mechanism**, enable ERNIE-Doc, which has a much longer effective context length, to capture the contextual information of a complete document. ERNIE-Doc improved the state-of-the-art language modeling result of perplexity to 16.8 on WikiText-103. Moreover, it outperformed competitive pretraining models by a large margin on most language understanding tasks, such as text classification, question answering, information extraction and semantic matching.
-
-## Framework
-
-We proposed three novel methods to enhance the long document modeling ability of Transformers:
-
- **Retrospective Feed Mechanism**: Inspired by the human reading behavior of skimming a document first and then looking back upon it attentively, we design a retrospective feed mechanism in which segments from a document are fed twice as input. As a result, each segment in the retrospective phase could explicitly fuse the semantic information of the entire document learned in the skimming phase, which prevents context fragmentation.
- **Enhanced Recurrence Mechansim**, a drop-in replacement for a Recurrence Transformer (like Transformer-XL), by changing the shifting-one-layer-downwards recurrence to the same-layer recurrence. In this manner, the maximum effective context length can be expanded, and past higher-level representations can be exploited to enrich future lower-level representations.
- **Segment-reordering Objective**, a document-aware task of predicting the correct order of the permuted set of segments of a document, to model the relationship among segments directly. This allows ERNIE-Doc to build full document representations for prediction. 
-
-
-
-![framework](.meta/framework.png)
-Illustrations of ERNIE-Doc and Recurrence Transformers, where models with three layers take as input a long document which is sliced into four segments.
-
-## Pre-trained Models
-
-We release the checkpoints for **ERNIE-Doc _base_en/zh_** and **ERNIE-Doc _large_en_** model。 
-
- [**ERNIE-Doc _base_en_**](https://ernie-github.cdn.bcebos.com/model-ernie-doc-base-en.tar.gz) (_12-layer, 768-hidden, 12-heads_)
- [**ERNIE-Doc _base_zh_**](https://ernie-github.cdn.bcebos.com/model-ernie-doc-base-zh.tar.gz) (_12-layer, 768-hidden, 12-heads_)
- [**ERNIE-Doc _large_en_**](https://ernie-github.cdn.bcebos.com/model-ernie-doc-large-en.tar.gz) (_24-layer, 1024-hidden, 16-heads_)
-
-
-## Fine-tuning Tasks
-
-We compare the performance of [ERNIE-Doc](https://arxiv.org/abs/2012.15688) with the existing SOTA pre-training models (such as [Longformer](https://arxiv.org/abs/2004.05150), [BigBird](https://arxiv.org/abs/2007.14062), [ETC](https://arxiv.org/abs/2004.08483) and [ERNIE2.0](https://arxiv.org/abs/1907.12412)) for language modeling (**_WikiText-103_**) and document-level natural language understanding tasks, including long-text classification (**_IMDB_**,  **_HYP_**, **_THUCNews_**, **_IFLYTEK_**), question answering (**_TriviaQA_**, **_HotpotQA_**, **_DRCD_**, **_CMRC2018_**, **_DuReader_**, **_C3_**), information extraction (**_OpenKPE_**) and semantic matching (**_CAIL2019-SCM_**).
-
-### Language Modeling
-
- [WikiText-103](https://arxiv.org/abs/1609.07843)
-
-| Model                    | Param. | PPL  |
-|--------------------------|:--------:|:------:|
-| _Results of base models_   |        |      |
-| LSTM                     |    -   | 48.7 |
-| LSTM+Neural cache        |    -   | 40.8 |
-| GCNN-14                  |    -   | 37.2 |
-| QRNN                     |  151M  | 33.0 |
-| Transformer-XL Base      |  151M  | 24.0 |
-| SegaTransformer-XL Base  |  151M  | 22.5 |
-| **ERNIE-Doc** Base           |  151M  | **21.0** |
-| _Results of large models_  |        |      |
-| Adaptive Input           |  247M  | 18.7 |
-| Transformer-XL Large     |  247M  | 18.3 |
-| Compressive Transformer  |  247M  | 17.1 |
-| SegaTransformer-XL Large |  247M  | 17.1 |
-| **ERNIE-Doc** Large          |  247M  | **16.8** |
-
-### Long-Text Classification
-
- [IMDB reviews](http://ai.stanford.edu/~amaas/data/sentiment/index.html)
-
-| Models          | Acc. | F1 | 
-|-----------------|:----:|:----:|
-| RoBERTa         | 95.3 | 95.0 | 
-| Longformer      | 95.7 |   -  | 
-| BigBird         |   -  | 95.2 |
-| **ERNIE-Doc** Base  | **96.1** | **96.1** |
-| XLNet-Large     | 96.8 |   -  |   -  |
-| **ERNIE-Doc** Large | **97.1** | **97.1** | 
-
- [Hyperpartisan News Dection](https://pan.webis.de/semeval19/semeval19-web/)
-
-| Models          | F1 |
-|-----------------|:----:|
-| RoBERTa         | 87.8 | 
-| Longformer      | 94.8 |   
-| BigBird         |  92.2  | 
-| **ERNIE-Doc** Base  |  **96.3** | 
-| **ERNIE-Doc** Large | **96.6** | 
-
- [THUCNews(THU)](http://thuctc.thunlp.org/)、[IFLYTEK(IFK)](https://arxiv.org/abs/2004.05986)
-
-| Models          |    THU   |    THU   |    IFK   |
-|-----------------|:--------:|:--------:|:--------:|
-|                 |   Acc.   |   Acc.   |   Acc.   |
-|                 |    Dev   | Test     |    Dev   |
-| BERT            |   97.7   |   97.3   |   60.3   |
-| BERT-wwm-ext    |   97.6   |   97.6   |   59.4   |
-| RoBERTa-wwm-ext |     -    |     -    |   60.3   |
-| ERNIE 1.0       |   97.7   |   97.3   |   59.0   |
-| ERNIE 2.0       |   98.0   |   97.5   |   61.7   |
-| **ERNIE-Doc**       | **98.3** | **97.7** | **62.4** |
-
-### Question Answering
-
- [TriviaQA](http://nlp.cs.washington.edu/triviaqa/) on dev-set
-
-| Models        | F1 |
-|-----------------|:----:|
-| RoBERTa         | 74.3 | 
-| Longformer      | 75.2 |   
-| BigBird         |  79.5 | 
-| **ERNIE-Doc** Base  |  **80.1** | 
-| Longformer Large  |  77.8 | 
-|   BigBird Large  |  - | 
-| **ERNIE-Doc** Large | **82.5** | 
-
- [HotpotQA](https://hotpotqa.github.io/) on dev-set
-
-| Models          | Span-F1 | Supp.-F1 | Joint-F1 |
-|-----------------|:----:|:----:|:----:|
-| RoBERTa         | 73.5 | 83.4 | 63.5 | 
-| Longformer      | 74.3 |  84.4 | 64.4 |  
-| BigBird         |  75.5 | **87.1** | 67.8 | 
-| **ERNIE-Doc** Base  |  **79.4** | 86.3 | **70.5** | 
-| Longformer Large  |  81.0 | 85.8 | 71.4 | 
-|   BigBird Large  |  81.3 | **89.4** | - | 
-| **ERNIE-Doc** Large | **82.2** | 87.6 | **73.7** | 
-
- [DRCD](https://arxiv.org/abs/1806.00920), [CMRC2018](https://arxiv.org/abs/1810.07366), [DuReader](https://arxiv.org/abs/1711.05073), [C3](https://arxiv.org/abs/1904.09679)
-
-| Models            | DRCD          | DRCD          | CMRC2018      | DuReader      | C3       | C3       |
-|-----------------|---------------|---------------|---------------|---------------|----------|----------|
-|                 | dev           | test          | dev           | dev           | dev      | test     |
-|                 | EM/F1         | EM/F1         | EM/F1         | EM/F1         | Acc.     | Acc.     |
-| BERT            | 85.7/91.6     | 84.9/90.9     | 66.3/85.9     | 59.5/73.1     |   65.7   |   64.5   |
-| BERT-wwm-ext    | 85.0/91.2     | 83.6/90.4     | 67.1/85.7     | -/-           |   67.8   |   68.5   |
-| RoBERTa-wwm-ext | 86.6/92.5     | 85.2/92.0     | 67.4/87.2     | -/-           |   67.1   |   66.5   |
-| MacBERT         | 88.3/93.5     | 87.9/93.2     | 69.5/87.7     | -/-           |     -    |     -    |
-| XLNet-zh        | 83.2/92.0     | 82.8/91.8     | 63.0/85.9     | -/-           |     -    |     -    |
-| ERNIE 1.0       | 84.6/90.9     | 84.0/90.5     | 65.1/85.1     | 57.9/72/1     |   65.5   |   64.1   |
-| ERNIE 2.0       | 88.5/93.8     | 88.0/93.4     | 69.1/88.6     | 61.3/74.9     |   72.3   |   73.2   |
-| **ERNIE-Doc**   | **90.5/95.2** | **90.5/95.1** | **76.1/91.6** | **65.8/77.9** | **76.5** | **76.5** |
-
-### Information Extraction
-
- [Open Domain Web Keyphrase Extraction](https://www.aclweb.org/anthology/D19-1521/)
-
-| Models    | F1@1 | F1@3 | F1@5 |
-|-----------|:----:|:----:|:----:|
-| BLING-KPE | 26.7 | 29.2 | 20.9 |
-| JointKPE  | 39.1 | 39.8 | 33.8 |
-| ETC       |   -  | 40.2 |   -  |
-| ERNIE-Doc | **40.2** | **40.5** | **34.4** |
-
-### Semantic Matching
-
- [CAIL2019-SCM](https://arxiv.org/abs/1911.08962)
-
-| Models    |      Dev (Acc.)     |     Test  (Acc.)       |
-|-----------|:-------------:|:-------------:|
-| BERT      |      61.9     |      67.3     |
-| ERNIE 2.0 |      64.9     |      67.9     |
-| ERNIE-Doc | **65.6** | **68.8** |
-
-
-## Usage
-
-### Install PaddlePaddle
-
-This code base has been tested with Paddle (version>=1.8) with Python3. Other dependency of ERNIE-GEN is listed in `requirements.txt`, you can install it by
-```script
-pip install -r requirements.txt
-```
-
-### Fine-tuning
-We release the finetuning code for English and Chinese classification tasks and Chinese Question Answers Tasks. For example, you can finetune **ERNIE-Doc** base model on IMDB and IFLYTEK dataset by
-```shell
-sh script/run_imdb.sh
-sh script/run_iflytek.sh
-sh script/run_dureader.sh
-```
-[Preprocessing code for IMDB dataset](./ernie_doc/data/imdb/README.md)
-
-
-The log of training and the evaluation results are in `log/job.log.0`.
-
-**Notice**: The actual total batch size is equal to `configured batch size * number of used gpus`.
-
-
-## Citation
-
-You can cite the paper as below:
-
-```
-@article{ding2020ernie,
-  title={ERNIE-DOC: The Retrospective Long-Document Modeling Transformer},
-  author={Ding, Siyu and Shang, Junyuan and Wang, Shuohuan and Sun, Yu and Tian, Hao and Wu, Hua and Wang, Haifeng},
-  journal={arXiv preprint arXiv:2012.15688},
-  year={2020}
-}
-```
-
-
-
--- a/ernie-doc/README_zh.md
+++ b/ernie-doc/README_zh.md
-[English](./README.md) | 简体中文
-
-## _ERNIE-Doc_: A Retrospective Long-Document Modeling Transformer
-
- [模型框架](#模型框架)
- [预训练模型](#预训练模型)
- [下游任务](#下游任务)
-  * [语言建模](#语言建模)
-  * [篇章级分类](#篇章级分类)
-  * [阅读理解](#阅读理解)
-  * [信息抽取](#信息抽取)
-  * [语义匹配](#语义匹配)
- [使用说明](#使用说明)
-  * [安装飞桨](#安装飞桨)
-  * [运行微调](#运行微调)
- [引用](#引用)
-
-关于算法的详细描述，请参见我们的论文：
->[_**ERNIE-Doc: A Retrospective Long-Document Modeling Transformer**_](https://arxiv.org/abs/2012.15688)
->
->Siyu Ding\*, Junyuan Shang\*, Shuohuan Wang, Yu Sun, Hao Tian, Hua Wu, Haifeng Wang (\* : equal contribution)
->
->Preprint December 2020
->
->Accepted by **ACL-2021**
-
-![ERNIE-Doc](https://img.shields.io/badge/预训练-长文本建模-green) ![paper](https://img.shields.io/badge/论文-ACL2021-yellow) 
-
---
-**ERNIE-Doc 是面向篇章级长文本建模的预训练-微调框架**，ERNIE-Doc 受到人类先粗读后精读的阅读方式启发，提出了**回顾式建模机制**和**增强记忆机制**，突破了 Transformer 在文本长度上的建模瓶颈。ERNIE-Doc 在业界首次实现了全篇章级无限长文本的双向建模，在包括阅读理解、信息抽取、篇章分类、语言模型在内的13个权威中英文长文本语言理解任务上取得了SOTA效果。
-
-## 模型框架
-
-我们提出了三种方法解决长文本建模问题:
-
- **回顾式建模机制（Retrospective Feed Mechanism）**:  通过将文本片段重复输入两次，使得在回顾阶段的每一个文本片段可以双向建模并利用在粗读阶段获取到的全篇章语义信息。
- **增强记忆机制（Enhanced Recurrence Mechansim）**: 通过改进Recurrence Transformer模型（Transformer-XL等）为同层循环的方式，可建模无限长文本。
- **段落重排序目标（Segment-reordering Objective）**:  该预训练目标通过让模型学习篇章级本文段落间的关系，使得模型可以对篇章整体信息进行建模。
-
-下图展示了ERNIE-Doc 与Recurrence Transformer在3层网络，4个片段输入情况下的建模方式与建模长度的对比。
-
-![framework](.meta/framework.png)
-
-## 预训练模型
-
-我们发布了 **ERNIE-Doc _base_** 中英文模型和 **ERNIE-Doc _large_** 英文模型。 
-
- [**ERNIE-Doc _base_en_**](https://ernie-github.cdn.bcebos.com/model-ernie-doc-base-en.tar.gz) (_12-layer, 768-hidden, 12-heads_)
- [**ERNIE-Doc _base_zh_**](https://ernie-github.cdn.bcebos.com/model-ernie-doc-base-zh.tar.gz) (_12-layer, 768-hidden, 12-heads_)
- [**ERNIE-Doc _large_en_**](https://ernie-github.cdn.bcebos.com/model-ernie-doc-large-en.tar.gz) (_24-layer, 1024-hidden, 16-heads_)
-
-
-## 下游任务
-
-我们在语言建模、篇章级分类、阅读理解以及信息抽取等任务上选取了广泛使用的数据集进行模型效果验证，并且与当前效果最优的模型（[Longformer](https://arxiv.org/abs/2004.05150)、[BigBird](https://arxiv.org/abs/2007.14062)、[ETC](https://arxiv.org/abs/2004.08483)、[ERNIE2.0](https://arxiv.org/abs/1907.12412)等）进行对比。
-
-### 语言建模
-
- [WikiText-103](https://arxiv.org/abs/1609.07843)
-
-| 模型                   | Param. | PPL  |
-|--------------------------|:--------:|:------:|
-| _Results of base models_   |        |      |
-| LSTM                     |    -   | 48.7 |
-| LSTM+Neural cache        |    -   | 40.8 |
-| GCNN-14                  |    -   | 37.2 |
-| QRNN                     |  151M  | 33.0 |
-| Transformer-XL Base      |  151M  | 24.0 |
-| SegaTransformer-XL Base  |  151M  | 22.5 |
-| **ERNIE-Doc** Base           |  151M  | **21.0** |
-| _Results of large models_  |        |      |
-| Adaptive Input           |  247M  | 18.7 |
-| Transformer-XL Large     |  247M  | 18.3 |
-| Compressive Transformer  |  247M  | 17.1 |
-| SegaTransformer-XL Large |  247M  | 17.1 |
-| **ERNIE-Doc** Large          |  247M  | **16.8** |
-
-### 篇章级分类
-
- [IMDB reviews](http://ai.stanford.edu/~amaas/data/sentiment/index.html)
-
-| 模型          | Acc. | F1 | 
-|-----------------|:----:|:----:|
-| RoBERTa         | 95.3 | 95.0 | 
-| Longformer      | 95.7 |   -  | 
-| BigBird         |   -  | 95.2 |
-| **ERNIE-Doc** Base  | **96.1** | **96.1** |
-| XLNet-Large     | 96.8 |   -  |   -  |
-| **ERNIE-Doc** Large | **97.1** | **97.1** | 
-
- [Hyperpartisan News Dection](https://pan.webis.de/semeval19/semeval19-web/)
-
-| 模型          | F1 |
-|-----------------|:----:|
-| RoBERTa         | 87.8 | 
-| Longformer      | 94.8 |   
-| BigBird         |  92.2  | 
-| **ERNIE-Doc** Base  |  **96.3** | 
-| **ERNIE-Doc** Large | **96.6** | 
-
- [THUCNews(THU)](http://thuctc.thunlp.org/)、[IFLYTEK(IFK)](https://arxiv.org/abs/2004.05986)
-
-| 模型          |    THU   |    THU   |    IFK   |
-|-----------------|:--------:|:--------:|:--------:|
-|                 |   Acc.   |   Acc.   |   Acc.   |
-|                 |    Dev   | Test     |    Dev   |
-| BERT            |   97.7   |   97.3   |   60.3   |
-| BERT-wwm-ext    |   97.6   |   97.6   |   59.4   |
-| RoBERTa-wwm-ext |     -    |     -    |   60.3   |
-| ERNIE 1.0       |   97.7   |   97.3   |   59.0   |
-| ERNIE 2.0       |   98.0   |   97.5   |   61.7   |
-| **ERNIE-Doc**       | **98.3** | **97.7** | **62.4** |
-
-### 阅读理解
-
- [TriviaQA](http://nlp.cs.washington.edu/triviaqa/)验证集效果
-
-| 模型          | F1 |
-|-----------------|:----:|
-| RoBERTa         | 74.3 | 
-| Longformer      | 75.2 |   
-| BigBird         |  79.5 | 
-| **ERNIE-Doc** Base  |  **80.1** | 
-| Longformer Large  |  77.8 | 
-|   BigBird Large  |  - | 
-| **ERNIE-Doc** Large | **82.5** | 
-
- [HotpotQA](https://hotpotqa.github.io/)验证集效果
-
-| 模型          | Span-F1 | Supp.-F1 | Joint-F1 |
-|-----------------|:----:|:----:|:----:|
-| RoBERTa         | 73.5 | 83.4 | 63.5 | 
-| Longformer      | 74.3 |  84.4 | 64.4 |  
-| BigBird         |  75.5 | **87.1** | 67.8 | 
-| **ERNIE-Doc** Base  |  **79.4** | 86.3 | **70.5** | 
-| Longformer Large  |  81.0 | 85.8 | 71.4 | 
-|   BigBird Large  |  81.3 | **89.4** | - | 
-| **ERNIE-Doc** Large | **82.2** | 87.6 | **73.7** | 
-
- [DRCD](https://arxiv.org/abs/1806.00920)、[CMRC2018](https://arxiv.org/abs/1810.07366)、[DuReader](https://arxiv.org/abs/1711.05073)、[C3](https://arxiv.org/abs/1904.09679)
-
-| 模型            | DRCD          | DRCD          | CMRC2018      | DuReader      | C3       | C3       |
-|-----------------|---------------|---------------|---------------|---------------|----------|----------|
-|                 | dev           | test          | dev           | dev           | dev      | test     |
-|                 | EM/F1         | EM/F1         | EM/F1         | EM/F1         | Acc.     | Acc.     |
-| BERT            | 85.7/91.6     | 84.9/90.9     | 66.3/85.9     | 59.5/73.1     |   65.7   |   64.5   |
-| BERT-wwm-ext    | 85.0/91.2     | 83.6/90.4     | 67.1/85.7     | -/-           |   67.8   |   68.5   |
-| RoBERTa-wwm-ext | 86.6/92.5     | 85.2/92.0     | 67.4/87.2     | -/-           |   67.1   |   66.5   |
-| MacBERT         | 88.3/93.5     | 87.9/93.2     | 69.5/87.7     | -/-           |     -    |     -    |
-| XLNet-zh        | 83.2/92.0     | 82.8/91.8     | 63.0/85.9     | -/-           |     -    |     -    |
-| ERNIE 1.0       | 84.6/90.9     | 84.0/90.5     | 65.1/85.1     | 57.9/72/1     |   65.5   |   64.1   |
-| ERNIE 2.0       | 88.5/93.8     | 88.0/93.4     | 69.1/88.6     | 61.3/74.9     |   72.3   |   73.2   |
-| **ERNIE-Doc**   | **90.5/95.2** | **90.5/95.1** | **76.1/91.6** | **65.8/77.9** | **76.5** | **76.5** |
-
-### 信息抽取
-
- [Open Domain Web Keyphrase Extraction (OpenKP)](https://www.aclweb.org/anthology/D19-1521/)
-
-| 模型    | F1@1 | F1@3 | F1@5 |
-|-----------|:----:|:----:|:----:|
-| BLING-KPE | 26.7 | 29.2 | 20.9 |
-| JointKPE  | 39.1 | 39.8 | 33.8 |
-| ETC       |   -  | 40.2 |   -  |
-| ERNIE-Doc | **40.2** | **40.5** | **34.4** |
-
-### 语义匹配
-
- [CAIL2019-SCM](https://arxiv.org/abs/1911.08962)
-
-| 模型    |      Dev (Acc.)     |     Test  (Acc.)       |
-|-----------|:-------------:|:-------------:|
-| BERT      |      61.9     |      67.3     |
-| ERNIE 2.0 |      64.9     |      67.9     |
-| ERNIE-Doc | **65.6** | **68.8** |
-
-
-## 使用说明
-
-### 安装飞桨
-
-我们的代码基于 Paddle(version>=1.8)，推荐使用python3运行。 ERNIE-Doc 依赖的其他模块也列举在 `requirements.txt`，可以通过下面的指令安装:
-```script
-pip install -r requirements.txt
-```
-
-### 运行微调
-我们开源了中英文分类任务以及中文阅读理解任务的微调代码，运行以下脚本即可进行实验
-```shell
-sh script/run_imdb.sh # 英文分类任务 
-sh script/run_iflytek.sh # 中文分类任务
-sh script/run_dureader.sh # 中文阅读理解任务
-```
-[imdb数据处理说明](./ernie_doc/data/imdb/README.md)
-
-
-具体微调参数均可在上述脚本中进行修改，训练和评估的日志在 `log/job.log.0`。 
-
-**注意**: 训练时实际的 batch size 等于 `配置的 batch size * GPU 卡数`。
-
-
-## 引用
-
-可以按下面的格式引用我们的论文:
-
-```
-@article{ding2020ernie,
-  title={ERNIE-DOC: The Retrospective Long-Document Modeling Transformer},
-  author={Ding, Siyu and Shang, Junyuan and Wang, Shuohuan and Sun, Yu and Tian, Hao and Wu, Hua and Wang, Haifeng},
-  journal={arXiv preprint arXiv:2012.15688},
-  year={2020}
-}
-```
\ No newline at end of file
--- a/ernie-doc/configs/base/en/ernie_config.json
+++ b/ernie-doc/configs/base/en/ernie_config.json
-{
-  "attention_probs_dropout_prob": 0.0, 
-  "hidden_act": "gelu", 
-  "hidden_dropout_prob": 0.0, 
-  "hidden_size": 768, 
-  "initializer_range": 0.02, 
-  "max_position_embeddings": 512, 
-  "num_attention_heads": 12, 
-  "num_hidden_layers": 12, 
-  "sent_type_vocab_size": 4,
-  "task_type_vocab_size": 3,
-  "vocab_size": 50265,
-  "memory_len": 128,
-  "epsilon": 1e-12
-}
--- a/ernie-doc/configs/base/en/vocab.txt
+++ b/ernie-doc/configs/base/en/vocab.txt
--- a/ernie-doc/configs/base/zh/ernie_config.json
+++ b/ernie-doc/configs/base/zh/ernie_config.json
-{
-  "attention_probs_dropout_prob": 0.1, 
-  "hidden_act": "gelu", 
-  "hidden_dropout_prob": 0.1, 
-  "hidden_size": 768, 
-  "initializer_range": 0.02, 
-  "max_position_embeddings": 512, 
-  "num_attention_heads": 12, 
-  "num_hidden_layers": 12, 
-  "sent_type_vocab_size": 4,
-  "task_type_vocab_size": 3,
-  "vocab_size": 28000,
-  "memory_len": 128,
-  "epsilon": 1e-12
-}
--- a/ernie-doc/configs/base/zh/task.json
+++ b/ernie-doc/configs/base/zh/task.json
-[
-    {
-        "prob": 1,
-        "data_func": "multi_sent_sorted",
-        "task_name": "multi_sent_sorted",
-        "valid_filelist": "./package/valid_filelist_multi_sort_sorted",
-        "train_filelist": "./package/train_filelist_multi_sort_sorted",
-        "loss_weight": 1.0,
-        "num_labels": 33,
-        "lm_weight": 1.0
-    }
-]
--- a/ernie-doc/configs/base/zh/vocab.txt
+++ b/ernie-doc/configs/base/zh/vocab.txt
--- a/ernie-doc/configs/encoder.json
+++ b/ernie-doc/configs/encoder.json
--- a/ernie-doc/configs/large/en/ernie_config.json
+++ b/ernie-doc/configs/large/en/ernie_config.json
-{
-  "attention_probs_dropout_prob": 0.0, 
-  "hidden_act": "gelu", 
-  "hidden_dropout_prob": 0.0, 
-  "hidden_size": 1024, 
-  "initializer_range": 0.02, 
-  "max_position_embeddings": 512, 
-  "num_attention_heads": 16, 
-  "num_hidden_layers": 24, 
-  "sent_type_vocab_size": 4,
-  "task_type_vocab_size": 3,
-  "vocab_size": 50265,
-  "memory_len": 128,
-  "epsilon": 1e-12
-}
--- a/ernie-doc/configs/large/en/vocab.txt
+++ b/ernie-doc/configs/large/en/vocab.txt
--- a/ernie-doc/configs/vocab.bpe
+++ b/ernie-doc/configs/vocab.bpe
--- a/ernie-doc/data/imdb/README.md
+++ b/ernie-doc/data/imdb/README.md
-## 下载官方数据
-
-http://ai.stanford.edu/~amaas/data/sentiment/index.html
-
-## 运行预处理脚本
-
-```python
-python multi_files_to_one.py
-```
-生成train.txt与test.txt文件至该文件夹下
\ No newline at end of file
--- a/ernie-doc/data/imdb/multi_files_to_one.py
+++ b/ernie-doc/data/imdb/multi_files_to_one.py
-import logging
-import os
-import numpy as np
-from collections import namedtuple
-from tqdm import tqdm
-
-
-def read_files(dir_path):
-    """
-    :param dir_path
-    """
-    examples = []
-    Example = namedtuple('Example', ['qid', 'text_a', 'label', 'score'])
-
-    def _read_files(dir_p, label):
-        logging.info('loading data from %s' % dir_p)
-        data_files = os.listdir(dir_p)
-        desc = "loading " + dir_p
-        for f_idx, data_file in tqdm(enumerate(data_files), desc=desc):
-            file_path = os.path.join(dir_p, data_file)
-            qid, score = data_file.split('_')
-            score = score.split('.')[0]
-            with open(file_path, 'r') as f:
-                doc = []
-                for line in f:
-                    line = line.strip().replace('<br /><br />', ' ')
-                    doc.append(line)
-                doc_text = ' '.join(doc)
-                example = Example(
-                    qid=len(examples)+1,
-                    text_a=doc_text,
-                    label=label,
-                    score=score
-                )
-                examples.append(example)
-    
-    neg_dir = os.path.join(dir_path, 'neg')
-    pos_dir = os.path.join(dir_path, 'pos')
-    _read_files(neg_dir, label=0) 
-    _read_files(pos_dir, label=1)  
-    logging.info('loading data finished')
-    return examples
-
-def write_to_one(dir, o_file_name):
-    exampels = read_files(dir)
-    logging.info('ex nums:%d' % (len(exampels)))
-    with open(o_file_name, 'w') as fout:
-        fout.write("qid\tlabel\tscore\ttext_a\n")
-        for ex in exampels:
-            try:
-                fout.write("{}\t{}\t{}\t{}\n".format(ex.qid, ex.label, ex.score, ex.text_a.replace('\t', '')))
-            except Exception as e:
-                print(ex.qid, ex.text_a, ex.label, ex.score)
-                raise e
-
-if __name__ == "__main__":
-    write_to_one("train", 'train.txt')
-    write_to_one("test", "test.txt")
-
-
-
-
--- a/ernie-doc/finetune/__init__.py
+++ b/ernie-doc/finetune/__init__.py
--- a/ernie-doc/finetune/classifier.py
+++ b/ernie-doc/finetune/classifier.py
-#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-"""Model for classifier."""
-
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-
-import time
-import numpy as np
-import collections
-from collections import namedtuple
-
-import paddle.fluid as fluid
-
-from model.static.ernie import ErnieDocModel
-from utils.multi_process_eval import MultiProcessEvalForErnieDoc
-from utils.metrics import Acc
-
-def create_model(args, ernie_config, mem_len=128, is_infer=False):
-    """create model for classifier"""
-    shapes = [[-1, args.max_seq_len, 1], [-1, 2 * args.max_seq_len + mem_len, 1], [-1, args.max_seq_len, 1],
-            [-1, args.max_seq_len, 1], [-1, 1], [-1, 1], [-1, 1], []]
-    dtypes = ['int64', 'int64', 'int64', 'float32', 'int64', 'int64', 'int64', 'int64']
-    names = ["src_ids", "pos_ids", "task_ids", "input_mask", "labels", "qids", "gather_idx", "need_cal_loss"]
-
-    inputs = []
-    for shape, dtype, name in zip(shapes, dtypes, names):
-        inputs.append(fluid.layers.data(name=name, shape=shape, dtype=dtype, append_batch_size=False))
-
-    src_ids, pos_ids, task_ids, input_mask, labels, qids, \
-            gather_idx, need_cal_loss = inputs
-    pyreader = fluid.io.DataLoader.from_generator(
-            feed_list=inputs,
-            capacity=70, iterable=False)
-
-    ernie_doc = ErnieDocModel(
-        src_ids=src_ids,
-        position_ids=pos_ids,
-        task_ids=task_ids,
-        input_mask=input_mask,
-        config=ernie_config,
-        number_instance=args.batch_size,
-        rel_pos_params_sharing=args.rel_pos_params_sharing,
-        use_vars=args.use_vars)
-    
-    mems, new_mems = ernie_doc.get_mem_output()
-
-    cls_feats = ernie_doc.get_pooled_output()
-    checkpoints = ernie_doc.get_checkpoints()
-    cls_feats = fluid.layers.dropout(
-        x=cls_feats,
-        dropout_prob=0.1,
-        dropout_implementation="upscale_in_train")
-    logits = fluid.layers.fc(
-        input=cls_feats,
-        size=args.num_labels,
-        param_attr=fluid.ParamAttr(
-            name="cls_out_w",
-            initializer=fluid.initializer.TruncatedNormal(scale=0.02)),
-        bias_attr=fluid.ParamAttr(
-            name="cls_out_b", initializer=fluid.initializer.Constant(0.)))
-
-    if is_infer:
-        probs = fluid.layers.softmax(logits)
-        feed_targets_name = [
-            src_ids.name, pos_ids.name, task_ids.name, input_mask.name
-        ]
-        return pyreader, probs, feed_targets_name
-    
-    # filter
-    qids, logits, labels = list(map(lambda x: fluid.layers.gather(x, gather_idx), [qids, logits, labels]))
-    ce_loss, probs = fluid.layers.softmax_with_cross_entropy(
-        logits=logits, label=labels, return_softmax=True)
-    loss = fluid.layers.mean(x=ce_loss)
-
-    num_seqs = fluid.layers.create_tensor(dtype='int32')
-    accuracy = fluid.layers.accuracy(input=probs, label=labels, total=num_seqs)
-    
-    loss, num_seqs, accuracy = list(map(lambda x: x * need_cal_loss, [loss, num_seqs, accuracy]))
-    graph_vars = collections.OrderedDict()
-
-    fetch_names = ['loss', 'accuracy', 'probs', 'labels', 'num_seqs', 'qids', 'need_cal_loss']
-    fetch_vars = [loss, accuracy, probs, labels, num_seqs, qids, need_cal_loss]
-    for name, var in zip(fetch_names, fetch_vars):
-        graph_vars[name] = var
-
-    for k, v in graph_vars.items():
-        v.persistable = True
-    mems_vars = {'mems': mems, 'new_mems': new_mems}
-    return pyreader, graph_vars, checkpoints, mems_vars
-
-
-def evaluate(exe,  
-             program, 
-             pyreader, 
-             graph_vars,
-             mems_vars,
-             tower_mems_np, 
-             phase, 
-             steps=None,
-             trainers_id=None, 
-             trainers_num=None,
-             scheduled_lr=None,
-             use_vars=False):
-    """evaluate interface"""
-    fetch_names = [k for k, v in graph_vars.items()]
-    fetch_list = [v for k, v in graph_vars.items()]
-
-    if phase == "train":
-        fetch_names += ['scheduled_lr']
-        fetch_list += [scheduled_lr]
-
-    if not use_vars:
-        feed_dict = {}
-        for m, m_np in zip(mems_vars['mems'], tower_mems_np):
-            feed_dict[m.name] = m_np
-        
-        fetch_list += mems_vars['new_mems']
-        fetch_names += [m.name for m in mems_vars['new_mems']]
-        
-
-    if phase == "train":
-        if use_vars:
-            outputs = exe.run(fetch_list=fetch_list, program=program, use_program_cache=True)
-        else:
-            outputs = exe.run(feed=feed_dict, fetch_list=fetch_list, program=program, use_program_cache=True)
-            tower_mems_np = outputs[-len(mems_vars['new_mems']):]
-
-        outputs_dict = {}
-        for var_name, output_var in zip(fetch_names, outputs):
-            outputs_dict[var_name] = output_var
-
-        ret = {"loss": np.mean(outputs_dict['loss']), 
-               "accuracy": np.mean(outputs_dict['accuracy']), 
-               "learning_rate": np.mean(outputs_dict['scheduled_lr']),
-               "tower_mems_np": tower_mems_np}
-        return ret
-
-    if phase == "eval" or phase == "test":
-        pyreader.start()
-        qids, labels, scores = [], [], []
-        time_begin = time.time()
-        
-        all_results = []
-        total_cost, total_num_seqs= 0.0, 0.0
-        RawResult = namedtuple("RawResult", ["unique_id", "prob", "label"])
-        while True:
-            try:
-                if use_vars:
-                    outputs = exe.run(
-                        program=program, fetch_list=fetch_list, use_program_cache=True)
-                else:
-                    feed_dict = {}
-                    for m, m_np in zip(mems_vars['mems'], tower_mems_np):
-                        feed_dict[m.name] = m_np
-                    outputs = exe.run(feed=feed_dict, fetch_list=fetch_list, program=program, use_program_cache=True)
-                    tower_mems_np = outputs[-len(mems_vars['new_mems']):]
-                    outputs = outputs[:-len(mems_vars['new_mems'])]
-
-                np_loss, np_acc, np_probs, np_labels, np_num_seqs, np_qids, np_need_cal_loss = outputs
-
-                if int(np_need_cal_loss) == 1:
-                    total_cost += np.sum(np_loss * np_num_seqs)
-                    total_num_seqs += np.sum(np_num_seqs)
-                    for idx in range(np_qids.shape[0]):
-                        if len(all_results) % 1000 == 0 and len(all_results):
-                            print("processining example: %d" % len(all_results))
-                        qid_each = int(np_qids[idx])
-                        probs_each = [float(x) for x in np_probs[idx].flat]
-                        label_each = int(np_labels[idx])
-                        all_results.append(
-                        RawResult(
-                            unique_id=qid_each,
-                            prob=probs_each,
-                            label=label_each))
-
-            except fluid.core.EOFException:
-                pyreader.reset()
-                break
-        time_end = time.time()
-        
-        output_path = "./tmpout"
-        mul_pro_test = MultiProcessEvalForErnieDoc(output_path, phase, trainers_num, trainers_id)
-        is_print = True
-        if mul_pro_test.dev_count > 1:
-            is_print = False
-            mul_pro_test.write_result(all_results)
-            if trainers_id == 0:
-                is_print = True
-                all_results = mul_pro_test.concat_result(RawResult)
-
-        if is_print:
-            num_seqs, all_labels, all_probs = mul_pro_test.write_predictions(all_results)
-            acc_func = Acc()
-            accuracy = acc_func.eval([all_probs, all_labels])
-            time_cost = time_end - time_begin
-            print("[%d_%s evaluation] ave loss: %f, ave acc: %f, data_num: %d, elapsed time: %f s"
-                % (steps, phase, total_cost / total_num_seqs, accuracy, num_seqs, time_cost))
- 
--- a/ernie-doc/finetune/mrc.py
+++ b/ernie-doc/finetune/mrc.py
-#   Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-"""Model for MRC."""
-
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-
-import os
-import time
-import numpy as np
-from collections import namedtuple
-
-import paddle.fluid as fluid
-from model.static.ernie import ErnieDocModel
-from utils.metrics import EM_AND_F1 
-from reader.tokenization import BasicTokenizer 
-from utils.multi_process_eval import MultiProcessEvalForMrc
-
-def create_model(args, ernie_config, mem_len=128, is_infer=False):
-    """create model for mrc"""
-    shapes = [[-1, args.max_seq_len, 1], [-1, 2 * args.max_seq_len + mem_len, 1], [-1, args.max_seq_len, 1],
-            [-1, args.max_seq_len, 1], [-1, 1], [-1, 1], [-1, 1], [-1, 1], []]
-    dtypes = ['int64', 'int64', 'int64', 'float32', 'int64', 'int64', 'int64', 'int64', 'int64']
-    names = ["src_ids", "pos_ids", "task_ids", "input_mask", "start_positions", \
-             "end_positions", "qids", "gather_idx", "need_cal_loss"]
-
-    inputs = []
-    for shape, dtype, name in zip(shapes, dtypes, names):
-        inputs.append(fluid.layers.data(name=name, shape=shape, dtype=dtype, append_batch_size=False))
-
-    src_ids, pos_ids, task_ids, input_mask, start_positions, \
-            end_positions, qids, gather_idx, need_cal_loss = inputs
-    pyreader = fluid.io.DataLoader.from_generator(
-            feed_list=inputs,
-            capacity=70, iterable=False)
-
-    ernie_doc = ErnieDocModel(
-        src_ids=src_ids,
-        position_ids=pos_ids,
-        task_ids=task_ids,
-        input_mask=input_mask,
-        config=ernie_config,
-        number_instance=args.batch_size,
-        rel_pos_params_sharing=args.rel_pos_params_sharing,
-        use_vars=args.use_vars)
-
-    enc_out = ernie_doc.get_sequence_output()
-    checkpoints = ernie_doc.get_checkpoints()
-    mems, new_mems = ernie_doc.get_mem_output()
-    enc_out = fluid.layers.dropout(
-        x=enc_out,
-        dropout_prob=0.1,
-        dropout_implementation="upscale_in_train")
-
-    logits = fluid.layers.fc(
-        input=enc_out,
-        size=2,
-        num_flatten_dims=2,
-        param_attr=fluid.ParamAttr(
-            name="cls_mrc_out_w",
-            initializer=fluid.initializer.TruncatedNormal(scale=0.02)),
-        bias_attr=fluid.ParamAttr(
-            name="cls_mrc_out_b", initializer=fluid.initializer.Constant(0.)))
-    
-    if is_infer:
-        probs = fluid.layers.softmax(logits)
-        feed_targets_name = [
-            src_ids.name, pos_ids.name, task_ids.name, input_mask.name
-        ]
-        return pyreader, probs, feed_targets_name
-
-    logits = fluid.layers.transpose(x=logits, perm=[2, 0, 1])
-    start_logits, end_logits = fluid.layers.unstack(x=logits, axis=0)
-
-    filter_output = list(map(lambda x: fluid.layers.gather(x, gather_idx), \
-                                    [qids, start_logits, end_logits, start_positions, end_positions])) 
-    qids, start_logits, end_logits, start_positions, end_positions = filter_output
-
-    def compute_loss(logits, positions):
-        """compute loss"""
-        loss = fluid.layers.softmax_with_cross_entropy(
-            logits=logits, label=positions)
-        loss = fluid.layers.mean(x=loss)
-        return loss
-
-    start_loss = compute_loss(start_logits, start_positions)
-    end_loss = compute_loss(end_logits, end_positions)
-    loss = (start_loss + end_loss) / 2.0
-    loss *= need_cal_loss
-    
-    mems_vars = {'mems': mems, 'new_mems': new_mems}
-    graph_vars = {
-        "loss": loss,
-        "qids": qids,
-        "start_logits": start_logits,
-        "end_logits": end_logits,
-        "need_cal_loss": need_cal_loss
-    }
-
-    for k, v in graph_vars.items():
-        v.persistable = True
-
-    return pyreader, graph_vars, checkpoints, mems_vars
-
-def evaluate(exe, 
-             program,
-             pyreader,
-             graph_vars,
-             mems_vars,
-             tower_mems_np,  
-             phase,
-             steps=None,
-             trainers_id=None,
-             trainers_num=None,
-             scheduled_lr=None,
-             use_vars=False,
-             examples=None,
-             features=None,
-             args=None):
-    """evaluate interface"""
-    fetch_names = [k for k, v in graph_vars.items()]
-    fetch_list = [v for k, v in graph_vars.items()]
-    if phase == "train":
-        fetch_names += ['scheduled_lr']
-        fetch_list += [scheduled_lr]
-    
-    if not use_vars:
-        feed_dict = {}
-        for m, m_np in zip(mems_vars['mems'], tower_mems_np):
-            feed_dict[m.name] = m_np
-        
-        fetch_list += mems_vars['new_mems']
-        fetch_names += [m.name for m in mems_vars['new_mems']]
-
-    if phase == "train":
-        if use_vars:
-            outputs = exe.run(fetch_list=fetch_list, program=program, use_program_cache=True)
-        else:
-            outputs = exe.run(feed=feed_dict, fetch_list=fetch_list, program=program, use_program_cache=True)
-            tower_mems_np = outputs[-len(mems_vars['new_mems']):]
-
-        outputs_dict = {}
-        for var_name, output_var in zip(fetch_names, outputs):
-            outputs_dict[var_name] = output_var
-
-        ret = {"loss": np.mean(outputs_dict['loss']),
-               "learning_rate": np.mean(outputs_dict['scheduled_lr']),
-               "tower_mems_np": tower_mems_np}
-        return ret
-    
-    if phase == "eval" or phase == "test":
-        output_dir = args.checkpoints
-        if not os.path.exists(output_dir):
-            os.makedirs(output_dir)
-        output_prediction_file = os.path.join(output_dir, phase + "_predictions.json")
-        output_nbest_file = os.path.join(output_dir, phase + "_nbest_predictions.json")
-
-        RawResult = namedtuple("RawResult",
-                ["unique_id", "start_logits", "end_logits"])
-
-        pyreader.start()
-        all_results = []
-        time_begin = time.time()
-        while True:
-            try:
-                if use_vars:
-                    outputs = exe.run(
-                        program=program, fetch_list=fetch_list, use_program_cache=True)
-                else:
-                    feed_dict = {}
-                    for m, m_np in zip(mems_vars['mems'], tower_mems_np):
-                        feed_dict[m.name] = m_np
-                    outputs = exe.run(feed=feed_dict, fetch_list=fetch_list, program=program, use_program_cache=True)
-                    tower_mems_np = outputs[-len(mems_vars['new_mems']):]
-                    outputs = outputs[:-len(mems_vars['new_mems'])]
-                np_loss, np_qids, np_start_logits, np_end_logits, np_need_cal_loss = outputs 
-
-                if int(np_need_cal_loss) == 1:
-                    for idx in range(np_qids.shape[0]):
-                        if len(all_results) % 1000 == 0:
-                            print("Processing example: %d" % len(all_results))
-                        qid_each = int(np_qids[idx])
-                        start_logits_each = [float(x) for x in np_start_logits[idx].flat]
-                        end_logits_each = [float(x) for x in np_end_logits[idx].flat]
-                        all_results.append(
-                            RawResult(
-                                unique_id=qid_each,
-                                start_logits=start_logits_each,
-                                end_logits=end_logits_each))
-            except fluid.core.EOFException:
-                pyreader.reset()
-                break
-        time_end = time.time()
-
-        output_path = "./tmpout"
-        tokenizer = BasicTokenizer(do_lower_case=args.do_lower_case)
-        mul_pro_test = MultiProcessEvalForMrc(output_path, phase, trainers_num,
-                                                  trainers_id, tokenizer)
-        
-        is_print = True
-        if mul_pro_test.dev_count > 1:
-            is_print = False
-            mul_pro_test.write_result(all_results)
-            if trainers_id == 0:
-                is_print = True
-                all_results = mul_pro_test.concat_result(RawResult)
-
-        if is_print:
-            mul_pro_test.write_predictions(examples,
-                                           features,
-                                           all_results,
-                                           args.n_best_size,
-                                           args.max_answer_length,
-                                           args.do_lower_case,
-                                           mul_pro_test.output_prediction_file,
-                                           mul_pro_test.output_nbest_file)
-
-            if phase == "eval":
-                data_file = args.dev_set
-            elif phase == "test":
-                data_file = args.test_set
-            
-            elapsed_time = time_end - time_begin
-            em_and_f1 = EM_AND_F1()
-            em, f1, avg, total = em_and_f1.eval_file(data_file, mul_pro_test.output_prediction_file) 
-            
-            print("[%d_%s evaluation] em: %f, f1: %f, avg: %f, questions: %d, elapsed time: %f"
-                % (steps, phase, em, f1, avg, total, elapsed_time))
- 
-                       
--- a/ernie-doc/lanch.py
+++ b/ernie-doc/lanch.py
-import sys
-import subprocess
-import os
-import six
-import copy
-import argparse
-
-from utils.args import ArgumentGroup, print_arguments
-
-# yapf: disable
-parser = argparse.ArgumentParser(__doc__)
-multip_g = ArgumentGroup(parser, "multiprocessing", 
-        "start paddle training using multi-processing mode.")
-multip_g.add_arg("node_ips", str, None, 
-        "paddle trainer ips")
-multip_g.add_arg("node_id", int, None, 
-        "the trainer id of the node for multi-node distributed training.")
-multip_g.add_arg("print_config", bool, True, 
-        "print the config of multi-processing mode.")
-multip_g.add_arg("current_node_ip", str, None, 
-        "the ip of current node.")
-multip_g.add_arg("split_log_path", str, "log",
-        "log path for each trainer.")
-multip_g.add_arg("nproc_per_node", int, 8, 
-        "the number of process to use on each node.")
-multip_g.add_arg("selected_gpus", str, "0,1,2,3,4,5,6,7", 
-        "the gpus selected to use.")
-multip_g.add_arg("training_script", str, None, "the program/script to be lauched "
-        "in parallel followed by all the arguments", positional_arg=True)
-multip_g.add_arg("training_script_args", str, None,
-        "training script args", positional_arg=True, nargs=argparse.REMAINDER)
-# yapf: enable
-
-
-def start_procs(args):
-    procs = []
-    log_fns = []
-
-    default_env = os.environ.copy()
-
-    node_id = args.node_id
-    node_ips = [x.strip() for x in args.node_ips.split(',')]
-    current_ip = args.current_node_ip
-    num_nodes = len(node_ips)
-    selected_gpus = [x.strip() for x in args.selected_gpus.split(',')]
-    selected_gpu_num = len(selected_gpus)
-
-    all_trainer_endpoints = ""
-    for ip in node_ips:
-        for i in range(args.nproc_per_node):
-            if all_trainer_endpoints != "":
-                all_trainer_endpoints += ","
-            all_trainer_endpoints += "%s:617%d" % (ip, i)
-
-    nranks = num_nodes * args.nproc_per_node
-    gpus_per_proc = args.nproc_per_node % selected_gpu_num 
-    if gpus_per_proc == 0:
-        gpus_per_proc =  selected_gpu_num // args.nproc_per_node
-    else:
-        gpus_per_proc =  selected_gpu_num // args.nproc_per_node + 1
-
-    selected_gpus_per_proc = [selected_gpus[i:i + gpus_per_proc] for i in range(0, len(selected_gpus), gpus_per_proc)]
-
-    if args.print_config:
-        print("all_trainer_endpoints: ", all_trainer_endpoints, 
-              ", node_id: ", node_id,
-              ", current_ip: ", current_ip,
-              ", num_nodes: ", num_nodes,
-              ", node_ips: ", node_ips,
-              ", gpus_per_proc: ", gpus_per_proc,
-              ", selected_gpus_per_proc: ", selected_gpus_per_proc,
-              ", nranks: ", nranks)
-
-    current_env = copy.copy(default_env)
-    procs = []
-    log_fns = []
-    for i in range(0, args.nproc_per_node):
-        trainer_id = node_id * args.nproc_per_node + i
-        current_env.update({
-            "FLAGS_selected_gpus": "%s" % ",".join([str(s) for s in selected_gpus_per_proc[i]]),
-            "PADDLE_TRAINER_ID" : "%d" % trainer_id,
-            "PADDLE_CURRENT_ENDPOINT": "%s:617%d" % (current_ip, i),
-            "PADDLE_TRAINERS_NUM": "%d" % nranks,
-            "PADDLE_TRAINER_ENDPOINTS": all_trainer_endpoints,
-            "PADDLE_NODES_NUM": "%d" % num_nodes
-        })
-
-        cmd = [sys.executable, "-u",
-               args.training_script] + args.training_script_args
-        if args.split_log_path:
-            fn = open("%s/job.log.%d" % (args.split_log_path, trainer_id), "w")
-            log_fns.append(fn)
-            process = subprocess.Popen(cmd, env=current_env, stdout=fn, stderr=fn)
-        else:
-            process = subprocess.Popen(cmd, env=current_env)
-        procs.append(process)
-
-    for i in range(len(procs)):
-        try:
-            procs[i].communicate()
-            procs[i].terminate()
-            if len(log_fns) > 0:
-                log_fns[i].close()
-        except:
-            raise subprocess.CalledProcessError(returncode=procs[i].returncode,
-                                                cmd=procs[i].args)
-
-
-def main(args):
-    if args.print_config:
-        print_arguments(args)
-    start_procs(args)
-
-
-if __name__ == "__main__":
-    args = parser.parse_args()
-    main(args)
--- a/ernie-doc/model/__init__.py
+++ b/ernie-doc/model/__init__.py
--- a/ernie-doc/model/optimization.py
+++ b/ernie-doc/model/optimization.py
-#   Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-"""Optimization and learning rate scheduling."""
-
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-
-import numpy as np
-import paddle.fluid as fluid
-from paddle.fluid.incubate.fleet.collective import fleet
-
-def linear_warmup_decay(learning_rate, warmup_steps, num_train_steps):
-    """ Applies linear warmup of learning rate from 0 and decay to 0."""
-    with fluid.default_main_program()._lr_schedule_guard():
-        lr = fluid.layers.tensor.create_global_var(
-            shape=[1],
-            value=0.0,
-            dtype='float32',
-            persistable=True,
-            name="scheduled_learning_rate")
-
-        global_step = fluid.layers.learning_rate_scheduler._decay_step_counter(
-        )
-
-        with fluid.layers.control_flow.Switch() as switch:
-            with switch.case(global_step < warmup_steps):
-                warmup_lr = learning_rate * (global_step / warmup_steps)
-                fluid.layers.tensor.assign(warmup_lr, lr)
-            with switch.default():
-                decayed_lr = fluid.layers.learning_rate_scheduler.polynomial_decay(
-                    learning_rate=learning_rate,
-                    decay_steps=num_train_steps,
-                    end_learning_rate=0.0,
-                    power=1.0,
-                    cycle=False)
-                fluid.layers.tensor.assign(decayed_lr, lr)
-
-        return lr
-
-def exclude_from_weight_decay(name):
-    """exclude_from_weight_decay"""
-    if name.find("layer_norm") > -1:
-        return True
-    bias_suffix = ["_bias", "_b", ".b_0"]
-    for suffix in bias_suffix:
-        if name.endswith(suffix):
-            return True
-    return False
-
-def layer_decay(param, param_last, learning_rate, decay_rate, n_layers):
-    """layerwise learning rate decay"""
-    delta = param - param_last
-    if "encoder_layer" in param.name and param.name.index("encoder_layer")==0:
-        print(param.name)
-        layer = int(param.name.split("_")[2])
-        ratio = decay_rate ** (n_layers + 1 - layer)
-        ratio = decay_rate ** (n_layers - layer)
-        param_update = param + (ratio - 1) * delta
-    elif "embedding" in param.name:
-        ratio = decay_rate ** (n_layers + 2)
-        ratio = decay_rate ** (n_layers + 1)
-        param_update = param + (ratio - 1) * delta
-    else:
-        param_update = None
-    return param_update
-
-def optimization(loss,
-                 warmup_steps,
-                 num_train_steps,
-                 learning_rate,
-                 train_program,
-                 startup_prog,
-                 weight_decay,
-                 scheduler='linear_warmup_decay',
-                 use_amp=False,
-                 init_loss_scaling=32768,
-                 layer_decay_rate=0,
-                 n_layers=12,
-                 dist_strategy=None):
-    """optimization"""
-    grad_clip = fluid.clip.GradientClipByGlobalNorm(clip_norm=1.0)
-    if warmup_steps > 0:
-        if scheduler == 'noam_decay':
-            scheduled_lr = fluid.layers.learning_rate_scheduler\
-             .noam_decay(1/(warmup_steps *(learning_rate ** 2)),
-                         warmup_steps)
-        elif scheduler == 'linear_warmup_decay':
-            scheduled_lr = linear_warmup_decay(learning_rate, warmup_steps,
-                                               num_train_steps)
-        else:
-            raise ValueError("Unkown learning rate scheduler, should be "
-                             "'noam_decay' or 'linear_warmup_decay'")
-        optimizer = fluid.optimizer.Adam(learning_rate=scheduled_lr)
-    else:
-        scheduled_lr = fluid.layers.create_global_var(
-            name=fluid.unique_name.generate("learning_rate"),
-            shape=[1],
-            value=learning_rate,
-            dtype='float32',
-            persistable=True)
-        optimizer = fluid.optimizer.Adam(learning_rate=scheduled_lr, epsilon=1e-06, grad_clip=grad_clip)    
-        optimizer._learning_rate_map[fluid.default_main_program()] = scheduled_lr
-     
-    loss_scaling = fluid.layers.create_global_var(
-        name=fluid.unique_name.generate("loss_scaling"),
-        shape=[1],
-        value=init_loss_scaling,
-        dtype='float32',
-        persistable=True)
-    
-    param_list = dict()
-    for param in train_program.global_block().all_parameters():
-        param_list[param.name] = param * 1.0
-        param_list[param.name].stop_gradient = True
-
-    if dist_strategy:
-        optimizer = fleet.distributed_optimizer(optimizer, strategy=dist_strategy)
-
-    loss = fluid.layers.mean(loss)
-    _, param_grads = optimizer.minimize(loss)
-    
-    if use_amp:
-        loss_scaling = optimizer._optimizer.get_loss_scaling()
-    
-    if layer_decay_rate > 0:
-        for param, grad in param_grads:
-            with param.block.program._optimized_guard(
-                [param, grad]), fluid.framework.name_scope("layer_decay"):
-                param_decay = layer_decay(param, param_list[param.name], \
-                    scheduled_lr, layer_decay_rate, n_layers)
-                if param_decay:
-                    fluid.layers.assign(output=param, input=param_decay)
-
-    if weight_decay > 0:
-        for param, grad in param_grads:
-            if exclude_from_weight_decay(param.name):
-                continue
-            with param.block.program._optimized_guard(
-                [param, grad]), fluid.framework.name_scope("weight_decay"):
-                updated_param = param - param_list[
-                    param.name] * weight_decay * scheduled_lr
-                fluid.layers.assign(output=param, input=updated_param)
-
-    return scheduled_lr, loss_scaling
--- a/ernie-doc/model/static/__init__.py
+++ b/ernie-doc/model/static/__init__.py
--- a/ernie-doc/model/static/ernie.py
+++ b/ernie-doc/model/static/ernie.py
-#   Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-"""Ernie Doc model."""
-
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-
-import json
-
-import six
-import paddle.fluid as fluid
-import paddle.fluid.layers as layers
-from model.static.transformer_encoder import encoder, pre_process_layer
-
-class ErnieConfig(object):
-    """ErnieConfig"""
-    def __init__(self, config_path):
-        self._config_dict = self._parse(config_path)
-
-    def _parse(self, config_path):
-        try:
-            with open(config_path) as json_file:
-                config_dict = json.load(json_file)
-        except Exception:
-            raise IOError("Error in parsing Ernie model config file '%s'" %
-                          config_path)
-        else:
-            return config_dict
-
-    def __getitem__(self, key):
-        return self._config_dict[key]
-
-    def print_config(self):
-        for arg, value in sorted(six.iteritems(self._config_dict)):
-            print('%s: %s' % (arg, value))
-        print('------------------------------------------------')
-
-
-class ErnieDocModel(object):
-    def __init__(self,
-                 src_ids,
-                 position_ids,
-                 task_ids,
-                 input_mask,
-                 config,
-                 number_instance, 
-                 weight_sharing=True,
-                 rel_pos_params_sharing=False,
-                 use_vars=False):
-        """
-        Fundamental pretrained Ernie Doc model
-        """
-        self._emb_size = config['hidden_size']
-        self._n_layer = config['num_hidden_layers']
-        self._n_head = config['num_attention_heads']
-        self._voc_size = config['vocab_size']
-        self._max_position_seq_len = config['max_position_embeddings']
-        self._task_types = config['task_type_vocab_size']
-        self._hidden_act = config['hidden_act']
-        self._memory_len = config["memory_len"]
-        self._epsilon = config["epsilon"]
-        self._prepostprocess_dropout = config['hidden_dropout_prob']
-        self._attention_dropout = config['attention_probs_dropout_prob']
-        
-        self._number_instance = number_instance
-        self._weight_sharing = weight_sharing
-        self._rel_pos_params_sharing = rel_pos_params_sharing
-        
-        self._word_emb_name = "word_embedding"
-        self._pos_emb_name = "pos_embedding"
-        self._task_emb_name = "task_embedding"
-        self._emb_dtype = "float32"
-        self._encoder_checkpints = []
-        
-        self._batch_size = layers.slice(layers.shape(src_ids), axes=[0], starts=[0], ends=[1])
-        # Initialize all weigths by truncated normal initializer, and all biases
-        # will be initialized by constant zero by default.
-        self._param_initializer = fluid.initializer.TruncatedNormal(
-            scale=config['initializer_range'])
-        
-        self._use_vars = use_vars
-        self._init_memories()
-        self._build_model(src_ids, position_ids, task_ids, input_mask)
-    
-    def _init_memories(self):
-        """Initialize memories"""
-        self.memories = []
-        for i in range(self._n_layer):
-            if self._memory_len:
-                if self._use_vars:
-                    self.memories.append(layers.create_global_var(
-                            shape=[self._number_instance, self._memory_len, self._emb_size],
-                            value=0.0,
-                            dtype=self._emb_dtype,
-                            persistable=True,
-                            force_cpu=False,
-                            name="memory_%d" % i))
-                else:
-                    self.memories.append(layers.data(
-                        name="memory_%d" % i,
-                        shape=[-1, self._memory_len, self._emb_size],
-                        dtype=self._emb_dtype,
-                        append_batch_size=False))
-            else:
-                self.memories.append([None])
-
-    def _build_model(self, src_ids, position_ids, task_ids, input_mask):
-        """Build Ernie Doc Model"""
-        # padding id in vocabulary must be set to 0
-        word_emb = layers.embedding(
-            input=src_ids,
-            size=[self._voc_size, self._emb_size],
-            dtype=self._emb_dtype,
-            param_attr=fluid.ParamAttr(
-                name=self._word_emb_name, initializer=self._param_initializer),
-            is_sparse=False)
-        
-        pos_emb = layers.embedding(
-            input=position_ids,
-            size=[self._max_position_seq_len * 2 + self._memory_len, self._emb_size],
-            dtype=self._emb_dtype,
-            param_attr=fluid.ParamAttr(
-                name=self._pos_emb_name, initializer=self._param_initializer))  
-         
-        task_ids = layers.concat([
-                layers.zeros(
-                    shape=[self._batch_size, self._memory_len, 1], 
-                    dtype="int64") + task_ids[0, 0, 0],
-                task_ids], axis=1)
-        task_ids.stop_gradient = True
-        task_emb = layers.embedding(
-            task_ids,
-            size=[self._task_types, self._emb_size],
-            dtype=self._emb_dtype,
-            param_attr=fluid.ParamAttr(
-                name=self._task_emb_name, initializer=self._param_initializer))
-        
-        word_emb = pre_process_layer(
-            word_emb, 'nd', self._prepostprocess_dropout, name='pre_encoder_emb')
-        pos_emb = pre_process_layer(
-            pos_emb, 'nd', self._prepostprocess_dropout, name='pre_encoder_r_pos')
-        task_emb = pre_process_layer(
-            task_emb, 'nd', self._prepostprocess_dropout, name="pre_encoder_r_task")
- 
-        data_mask = layers.concat([
-                layers.ones(
-                    shape=[self._batch_size, self._memory_len, 1], 
-                    dtype=input_mask.dtype),
-                input_mask], axis=1) 
-        data_mask.stop_gradient = True
-        self_attn_mask = layers.matmul(
-            x=input_mask, y=data_mask, transpose_y=True) 
-        self_attn_mask = layers.scale(
-            x=self_attn_mask, scale=1000000000.0, bias=-1.0, bias_after_scale=False)
-        n_head_self_attn_mask = layers.stack(
-            x=[self_attn_mask] * self._n_head, axis=1)
-        n_head_self_attn_mask.stop_gradient = True
-        
-        self._enc_out, self._new_mems, self._checkpoints = encoder(
-            enc_input=word_emb,
-            memories=self.memories,
-            rel_pos=pos_emb,
-            rel_task=task_emb,
-            attn_bias=n_head_self_attn_mask,
-            n_layer=self._n_layer,
-            n_head=self._n_head,
-            d_key=self._emb_size // self._n_head,
-            d_value=self._emb_size // self._n_head,
-            d_model=self._emb_size,
-            d_inner_hid=self._emb_size * 4,
-            memory_len=self._memory_len,
-            prepostprocess_dropout=self._prepostprocess_dropout,
-            attention_dropout=self._attention_dropout,
-            relu_dropout=0,
-            hidden_act=self._hidden_act,
-            epsilon=self._epsilon,
-            preprocess_cmd="",
-            postprocess_cmd="dan",
-            param_initializer=self._param_initializer,
-            rel_pos_params_sharing=self._rel_pos_params_sharing,
-            name='encoder',
-            use_vars=self._use_vars)
-
-    def get_sequence_output(self):
-        return self._enc_out
-
-    def get_checkpoints(self):
-        return self._checkpoints
-
-    def get_mem_output(self):
-        return self.memories, self._new_mems
-
-    def get_pooled_output(self):
-        """Get the last feature of each sequence for classification"""
-        next_sent_feat = layers.slice(
-            input=self._enc_out, axes=[1], starts=[-1], ends=[self._max_position_seq_len])
-        next_sent_feat = layers.fc(
-            input=next_sent_feat,
-            size=self._emb_size,
-            act="tanh",
-            param_attr=fluid.ParamAttr(
-                name="pooled_fc.w_0", initializer=self._param_initializer),
-            bias_attr="pooled_fc.b_0")
-        return next_sent_feat
-
-    def get_pretrained_output(self, 
-                              mask_label, 
-                              mask_pos, 
-                              need_cal_loss=True, 
-                              reorder_labels=None, 
-                              reorder_chose_idx=None, 
-                              reorder_need_cal_loss=False):
-        """Get the loss & accuracy for pretraining"""
-        reshaped_emb_out = fluid.layers.reshape(
-            x=self._enc_out, shape=[-1, self._emb_size])
-        # extract masked tokens' feature
-        mask_feat = layers.gather(input=reshaped_emb_out, 
-            index=layers.cast(mask_pos, dtype="int32"))
-    
-        # transform: fc
-        mask_trans_feat = layers.fc(
-            input=mask_feat,
-            size=self._emb_size,
-            act=self._hidden_act,
-            param_attr=fluid.ParamAttr(
-                name='mask_lm_trans_fc.w_0',
-                initializer=self._param_initializer),
-            bias_attr=fluid.ParamAttr(name='mask_lm_trans_fc.b_0'))
-        
-        # transform: layer norm 
-        mask_trans_feat = layers.layer_norm(
-            mask_trans_feat,
-            begin_norm_axis=len(mask_trans_feat.shape) - 1,
-            param_attr=fluid.ParamAttr(
-                name='mask_lm_trans_layer_norm_scale',
-                initializer=fluid.initializer.Constant(1.)),
-            bias_attr=fluid.ParamAttr(
-                name='mask_lm_trans_layer_norm_bias',
-                initializer=fluid.initializer.Constant(1.)))
-
-        mask_lm_out_bias_attr = fluid.ParamAttr(
-            name="mask_lm_out_fc.b_0",
-            initializer=fluid.initializer.Constant(value=0.0))
-        if self._weight_sharing:
-            fc_out = layers.matmul(
-                x=mask_trans_feat,
-                y=fluid.default_main_program().global_block().var(
-                    self._word_emb_name),
-                transpose_y=True)
-            fc_out += layers.create_parameter(
-                shape=[self._voc_size],
-                dtype=self._emb_dtype,
-                attr=mask_lm_out_bias_attr,
-                is_bias=True)
-        else:
-            fc_out = layers.fc(
-                input=mask_trans_feat,
-                size=self._voc_size,
-                param_attr=fluid.ParamAttr(
-                    name="mask_lm_out_fc.w_0",
-                    initializer=self._param_initializer),
-                bias_attr=mask_lm_out_bias_attr)
-
-        mlm_loss = layers.softmax_with_cross_entropy(
-            logits=fc_out, label=mask_label)
-        mean_mlm_loss = layers.mean(mlm_loss) * need_cal_loss
-        
-        # extract the first token feature in each sentence
-        self.next_sent_feat = self.get_pooled_output()
-        next_sent_feat_filter = layers.gather(
-            input=self.next_sent_feat, 
-            index=reorder_chose_idx)
-        reorder_fc_out = layers.fc(
-            input=next_sent_feat_filter,
-            size=33,
-            param_attr=fluid.ParamAttr(
-                name="multi_sent_sorted" + "_fc.w_0", initializer=self._param_initializer),
-            bias_attr="multi_sent_sorted" + "_fc.b_0")
-        reorder_loss, reorder_softmax = layers.softmax_with_cross_entropy(
-            logits=reorder_fc_out, label=reorder_labels, return_softmax=True)
-        reorder_acc = fluid.layers.accuracy(
-            input=reorder_softmax, label=reorder_labels)
-        mean_reorder_loss = fluid.layers.mean(reorder_loss) * reorder_need_cal_loss
-
-        total_loss = mean_mlm_loss + mean_reorder_loss
-        reorder_acc *= reorder_need_cal_loss
-
-        return total_loss, mean_mlm_loss, reorder_acc
-
--- a/ernie-doc/model/static/transformer_encoder.py
+++ b/ernie-doc/model/static/transformer_encoder.py
-#   Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-"""Transformer encoder."""
-
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-
-from functools import partial
-
-import paddle.fluid as fluid
-import paddle.fluid.layers as layers
-
-def _cache_mem(curr_out, prev_mem, mem_len=None, use_vars=False):
-    """generate new memories for next step"""
-    if mem_len is None or mem_len == 0:
-        return None
-    else:
-        if prev_mem is None:
-            new_mem = curr_out[:, -mem_len:, :]
-        else:
-            new_mem = layers.concat([prev_mem, curr_out], 1)[:, -mem_len:, :]
-        new_mem.stop_gradient = True
-        if use_vars:
-            layers.assign(new_mem, prev_mem)
-    return new_mem
-
-
-def multi_head_attention(queries,
-                         keys,
-                         values,
-                         rel_pos,
-                         rel_task,
-                         memory,
-                         attn_bias,
-                         d_key,
-                         d_value,
-                         d_model,
-                         n_head=1,
-                         r_w_bias=None,
-                         r_r_bias=None,
-                         r_t_bias=None,
-                         dropout_rate=0.,
-                         cache=None,
-                         param_initializer=None,
-                         rel_pos_params_sharing=False,
-                         name='multi_head_att'):
-    """
-    Multi-Head Attention. Note that attn_bias is added to the logit before
-    computing softmax activiation to mask certain selected positions so that
-    they will not considered in attention weights.
-    """
-    if memory is not None and len(memory.shape) > 1:
-        cat = fluid.layers.concat([memory, queries], 1)
-    else:
-        cat = queries
-    keys, values = cat, cat
-
-    if not (len(queries.shape) == len(keys.shape) == len(values.shape) \
-            == len(rel_pos.shape) == len(rel_task.shape)== 3):
-        raise ValueError(
-            "Inputs: quries, keys, values, rel_pos and rel_task should all be 3-D tensors.")
-    
-    if rel_pos_params_sharing:
-        assert (r_w_bias and r_r_bias and r_t_bias) is not None, \
-                    "the rel pos bias can not be None when sharing the relative position params"
-    else:
-        r_w_bias, r_r_bias, r_t_bias = \
-                list(map(lambda x: layers.create_parameter(
-                                shape=[n_head * d_key], 
-                                dtype="float32", 
-                                name=name + "_" + x,
-                                default_initializer=param_initializer), 
-                ["r_w_bias", "r_r_bias", "r_t_bias"]))
-    
-    def __compute_qkv(queries, keys, values, rel_pos, rel_task, n_head, d_key, d_value):
-        """
-        Add linear projection to queries, keys, values, postions and tasks.
-        """
-        q = layers.fc(input=queries,
-                      size=d_key * n_head,
-                      num_flatten_dims=2,
-                      param_attr=fluid.ParamAttr(
-                          name=name + '_query_fc.w_0',
-                          initializer=param_initializer),
-                      bias_attr=name + '_query_fc.b_0')
-        k = layers.fc(input=keys,
-                      size=d_key * n_head,
-                      num_flatten_dims=2,
-                      param_attr=fluid.ParamAttr(
-                          name=name + '_key_fc.w_0',
-                          initializer=param_initializer),
-                      bias_attr=name + '_key_fc.b_0')
-        v = layers.fc(input=values,
-                      size=d_value * n_head,
-                      num_flatten_dims=2,
-                      param_attr=fluid.ParamAttr(
-                          name=name + '_value_fc.w_0',
-                          initializer=param_initializer),
-                      bias_attr=name + '_value_fc.b_0')
-        r = layers.fc(input=rel_pos,
-                      size=d_key * n_head,
-                      num_flatten_dims=2,
-                      param_attr=fluid.ParamAttr(
-                          name=name + '_pos_fc.w_0',
-                          initializer=param_initializer),
-                      bias_attr=name + '_pos_fc.b_0')
-        t = layers.fc(input=rel_task,
-                      size=d_key * n_head,
-                      num_flatten_dims=2,
-                      param_attr=fluid.ParamAttr(
-                          name=name + '_task_fc.w_0',
-                          initializer=param_initializer),
-                      bias_attr=name + '_task_fc.b_0')
-        return q, k, v, r, t
-
-    def __split_heads(x, n_head, add_bias=None):
-        """
-        Reshape the last dimension of inpunt tensor x so that it becomes two
-        dimensions and then transpose. Specifically, input a tensor with shape
-        [bs, max_sequence_length, n_head * hidden_dim] then output a tensor
-        with shape [bs, n_head, max_sequence_length, hidden_dim].
-        """
-        hidden_size = x.shape[-1]
-        # The value 0 in shape attr means copying the corresponding dimension
-        # size of the input as the output dimension size.
-        reshaped = layers.reshape(
-            x=x, shape=[0, 0, n_head, hidden_size // n_head], inplace=True)
-        
-        if add_bias:
-            reshaped = reshaped + add_bias
-
-        # permuate the dimensions into:
-        # [batch_size, n_head, max_sequence_len, hidden_size_per_head]
-        return layers.transpose(x=reshaped, perm=[0, 2, 1, 3])
-
-    def __combine_heads(x):
-        """
-        Transpose and then reshape the last two dimensions of inpunt tensor x
-        so that it becomes one dimension, which is reverse to __split_heads.
-        """
-        if len(x.shape) == 3: return x
-        if len(x.shape) != 4:
-            raise ValueError("Input(x) should be a 4-D Tensor.")
-
-        trans_x = layers.transpose(x, perm=[0, 2, 1, 3])
-        # The value 0 in shape attr means copying the corresponding dimension
-        # size of the input as the output dimension size.
-        return layers.reshape(
-            x=trans_x,
-            shape=[0, 0, trans_x.shape[2] * trans_x.shape[3]],
-            inplace=True)
-
-    def __rel_shift(x, klen):
-        """return relative shift"""
-        x_shape = x.shape
-        INT_MAX=10000000
-        x = layers.reshape(x, [x_shape[0], x_shape[1], x_shape[3], x_shape[2]])
-        x = layers.slice(x, [0, 1, 2, 3], [0, 0, 1, 0], [INT_MAX, INT_MAX, INT_MAX, INT_MAX])
-        x = layers.reshape(x, [x_shape[0], x_shape[1], x_shape[2], x_shape[3] - 1])
-        x = layers.slice(x, [0, 1, 2, 3], [0, 0, 0, 0], [INT_MAX, INT_MAX, INT_MAX, klen])
-        return x
-
-    def __scaled_dot_product_attention(q, k, v, r, t, attn_bias, d_key, dropout_rate):
-        """
-        Scaled Dot-Product Attention
-        """
-        q_w, q_r, q_t = list(map(lambda x: layers.scale(x=x, scale=d_key ** -0.5), q))
-        score_w = layers.matmul(x=q_w, y=k, transpose_y=True)
-        score_r = layers.matmul(x=q_r, y=r, transpose_y=True)
-        score_r = __rel_shift(score_r, k.shape[2])
-        score_t = layers.matmul(x=q_t, y=t, transpose_y=True)
-        score = score_w + score_r + score_t
-        if attn_bias is not None:
-            score += attn_bias
-        weights = layers.softmax(score, use_cudnn=True)
-        if dropout_rate:
-            weights = layers.dropout(
-                weights,
-                dropout_prob=dropout_rate,
-                dropout_implementation="upscale_in_train",
-                is_test=False)
-        out = layers.matmul(weights, v)
-        return out    
-
-    q, k, v, r, t = __compute_qkv(queries, keys, values, rel_pos, rel_task, n_head, d_key, d_value)
-    
-    if cache is not None:  # use cache and concat time steps
-        # Since the inplace reshape in __split_heads changes the shape of k and
-        # v, which is the cache input for next time step, reshape the cache
-        # input from the previous time step first.
-        k = cache["k"] = layers.concat(
-            [layers.reshape(
-                cache["k"], shape=[0, 0, d_model]), k], axis=1)
-        v = cache["v"] = layers.concat(
-            [layers.reshape(
-                cache["v"], shape=[0, 0, d_model]), v], axis=1)
-     
-    q_w, q_r, q_t = list(map(lambda x: layers.elementwise_add(q, x, 2), [r_w_bias, r_r_bias, r_t_bias]))
-    q_w, q_r, q_t = list(map(lambda x: __split_heads(x, n_head), [q_w, q_r, q_t]))
-    k, v, r, t = list(map(lambda x: __split_heads(x, n_head), [k, v, r, t]))
-    
-    ctx_multiheads = __scaled_dot_product_attention([q_w, q_r, q_t], \
-                                    k, v, r, t, attn_bias, d_key, dropout_rate)
-
-    out = __combine_heads(ctx_multiheads)
-
-    # Project back to the model size.
-    proj_out = layers.fc(input=out,
-                         size=d_model,
-                         num_flatten_dims=2,
-                         param_attr=fluid.ParamAttr(
-                             name=name + '_output_fc.w_0',
-                             initializer=param_initializer),
-                         bias_attr=name + '_output_fc.b_0')
-    return proj_out
-
-
-def positionwise_feed_forward(x,
-                              d_inner_hid,
-                              d_hid,
-                              dropout_rate,
-                              hidden_act,
-                              param_initializer=None,
-                              name='ffn'):
-    """
-    Position-wise Feed-Forward Networks.
-    This module consists of two linear transformations with a ReLU activation
-    in between, which is applied to each position separately and identically.
-    """
-    hidden = layers.fc(input=x,
-                       size=d_inner_hid,
-                       num_flatten_dims=2,
-                       act=hidden_act,
-                       param_attr=fluid.ParamAttr(
-                           name=name + '_fc_0.w_0',
-                           initializer=param_initializer),
-                       bias_attr=name + '_fc_0.b_0')
-    if dropout_rate:
-        hidden = layers.dropout(
-            hidden,
-            dropout_prob=dropout_rate,
-            dropout_implementation="upscale_in_train",
-            is_test=False)
-    out = layers.fc(input=hidden,
-                    size=d_hid,
-                    num_flatten_dims=2,
-                    param_attr=fluid.ParamAttr(
-                        name=name + '_fc_1.w_0', initializer=param_initializer),
-                    bias_attr=name + '_fc_1.b_0')
-    return out
-
-
-def pre_post_process_layer(prev_out, 
-                           out, 
-                           process_cmd, 
-                           dropout_rate=0.,
-                           epsilon=1e-5,
-                           name=''):
-    """
-    Add residual connection, layer normalization and droput to the out tensor
-    optionally according to the value of process_cmd.
-    This will be used before or after multi-head attention and position-wise
-    feed-forward networks.
-    """
-    for cmd in process_cmd:
-        if cmd == "a":  # add residual connection
-            out = out + prev_out if prev_out else out
-        elif cmd == "n":  # add layer normalization
-            out_dtype = out.dtype
-            if out_dtype == fluid.core.VarDesc.VarType.FP16:
-                out = layers.cast(x=out, dtype="float32")
-            out = layers.layer_norm(
-                out,
-                begin_norm_axis=len(out.shape) - 1,
-                param_attr=fluid.ParamAttr(
-                    name=name + '_layer_norm_scale',
-                    initializer=fluid.initializer.Constant(1.)),
-                bias_attr=fluid.ParamAttr(
-                    name=name + '_layer_norm_bias',
-                    initializer=fluid.initializer.Constant(0.)),
-                epsilon=epsilon)
-            if out_dtype == fluid.core.VarDesc.VarType.FP16:
-                out = layers.cast(x=out, dtype="float16")
-        elif cmd == "d":  # add dropout
-            if dropout_rate:
-                out = layers.dropout(
-                    out,
-                    dropout_prob=dropout_rate,
-                    dropout_implementation="upscale_in_train",
-                    is_test=False)
-    return out
-
-
-pre_process_layer = partial(pre_post_process_layer, None)
-post_process_layer = pre_post_process_layer
-
-
-def encoder_layer(enc_input,
-                  rel_pos,
-                  rel_task,
-                  memory,
-                  attn_bias,
-                  n_head,
-                  d_key,
-                  d_value,
-                  d_model,
-                  d_inner_hid,
-                  r_w_bias,
-                  r_r_bias,
-                  r_t_bias,
-                  prepostprocess_dropout,
-                  attention_dropout,
-                  relu_dropout,
-                  hidden_act,
-                  epsilon=1e-5,
-                  preprocess_cmd="n",
-                  postprocess_cmd="da",
-                  param_initializer=None,
-                  rel_pos_params_sharing=False,
-                  name=''):
-    """The encoder layers that can be stacked to form a deep encoder.
-    This module consits of a multi-head (self) attention followed by
-    position-wise feed-forward networks and both the two components companied
-    with the post_process_layer to add residual connection, layer normalization
-    and droput.
-    """
-    attn_output = multi_head_attention(
-        pre_process_layer(
-            enc_input,
-            preprocess_cmd,
-            prepostprocess_dropout,
-            epsilon=epsilon,
-            name=name + '_pre_att'),
-        None,
-        None,
-        rel_pos,
-        rel_task,
-        memory,
-        attn_bias,
-        d_key,
-        d_value,
-        d_model,
-        n_head,
-        r_w_bias,
-        r_r_bias,
-        r_t_bias,
-        attention_dropout,
-        param_initializer=param_initializer,
-        rel_pos_params_sharing=rel_pos_params_sharing,
-        name=name + '_multi_head_att')
-    attn_output = post_process_layer(
-        enc_input,
-        attn_output,
-        postprocess_cmd,
-        prepostprocess_dropout,
-        epsilon=epsilon,
-        name=name + '_post_att')
-    ffd_output = positionwise_feed_forward(
-        pre_process_layer(
-            attn_output,
-            preprocess_cmd,
-            prepostprocess_dropout,
-            epsilon=epsilon,
-            name=name + '_pre_ffn'),
-        d_inner_hid,
-        d_model,
-        relu_dropout,
-        hidden_act,
-        param_initializer=param_initializer,
-        name=name + '_ffn')
-    return post_process_layer(
-        attn_output,
-        ffd_output,
-        postprocess_cmd,
-        prepostprocess_dropout,
-        epsilon=epsilon,
-        name=name + '_post_ffn'), ffd_output
-
-
-def encoder(enc_input,
-            memories,
-            rel_pos,
-            rel_task,
-            attn_bias,
-            n_layer,
-            n_head,
-            d_key,
-            d_value,
-            d_model,
-            d_inner_hid,
-            memory_len,
-            prepostprocess_dropout,
-            attention_dropout,
-            relu_dropout,
-            hidden_act,
-            epsilon=1e-5,
-            preprocess_cmd="n",
-            postprocess_cmd="da",
-            param_initializer=None,
-            rel_pos_params_sharing=False,
-            name='',
-            use_vars=False):
-    """
-    The encoder is composed of a stack of identical layers returned by calling
-    encoder_layer.
-    """
-    r_w_bias, r_r_bias, r_t_bias = None, None, None
-    if rel_pos_params_sharing:
-        r_w_bias, r_r_bias, r_t_bias = \
-                list(map(lambda x: layers.create_parameter(
-                                shape=[n_head * d_key], 
-                                dtype="float32", 
-                                name=name + "_" + x,
-                                default_initializer=param_initializer), 
-                ["r_w_bias", "r_r_bias", "r_t_bias"]))
-     
-    checkpoints = []
-    _new_mems = []
-    for i in range(n_layer):
-        enc_input, cp = encoder_layer(
-            enc_input,
-            rel_pos,
-            rel_task,
-            memories[i],
-            attn_bias,
-            n_head,
-            d_key,
-            d_value,
-            d_model,
-            d_inner_hid,
-            r_w_bias,
-            r_r_bias,
-            r_t_bias,
-            prepostprocess_dropout,
-            attention_dropout,
-            relu_dropout,
-            hidden_act,
-            epsilon,
-            preprocess_cmd,
-            postprocess_cmd,
-            param_initializer=param_initializer,
-            rel_pos_params_sharing=rel_pos_params_sharing,
-            name=name + '_layer_' + str(i))
-        checkpoints.append(cp.name)
-        new_mem = _cache_mem(enc_input, memories[i], memory_len, use_vars=use_vars)
-        if not use_vars:
-            _new_mems.append(new_mem)
-    enc_output = pre_process_layer(
-        enc_input, 
-        preprocess_cmd, 
-        prepostprocess_dropout, 
-        epsilon, 
-        name="post_encoder")
-    return enc_output, _new_mems, checkpoints[:-1]
--- a/ernie-doc/reader/__init__.py
+++ b/ernie-doc/reader/__init__.py
--- a/ernie-doc/reader/batching.py
+++ b/ernie-doc/reader/batching.py
-#   Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-"""Mask, padding and batching."""
-
-import paddle
-import numpy as np
-
-def get_related_pos(insts, 
-                    seq_len,
-                    memory_len=128):
-    """generate relative postion ids"""
-    beg = seq_len + seq_len + memory_len
-    r_position = [list(range(beg - 1, seq_len - 1, -1)) + \
-                  list(range(0, seq_len)) for i in range(len(insts))]
-    return np.array(r_position).astype('int64').reshape([len(insts), beg, 1])
-
-def pad_batch_data(insts,
-                   insts_data_type="int64",
-                   pad_idx=0,
-                   final_cls=False,
-                   pad_max_len=None,
-                   return_pos=False,
-                   return_input_mask=False,
-                   return_max_len=False,
-                   return_num_token=False,
-                   return_seq_lens=False):
-    """
-    Pad the instances to the max sequence length in batch, and generate the
-    corresponding position data and attention bias.
-    """
-    return_list = []
-    if pad_max_len:
-        max_len = pad_max_len
-    else:
-        max_len = max(len(inst) for inst in insts)
-    # Any token included in dict can be used to pad, since the paddings' loss
-    # will be masked out by weights and make no effect on parameter gradients.
-
-    # id
-    if final_cls:
-        inst_data = np.array(
-            [inst[:-1] + list([pad_idx] * (max_len - len(inst))) + [inst[-1]] for inst in insts])
-    else:
-        inst_data = np.array(
-            [inst + list([pad_idx] * (max_len - len(inst))) for inst in insts])
-    return_list += [inst_data.astype(insts_data_type).reshape([-1, max_len, 1])]
-
-    # position data
-    if return_pos:
-        inst_pos = np.array([
-            list(range(0, len(inst))) + [pad_idx] * (max_len - len(inst))
-            for inst in insts
-        ])
-
-        return_list += [inst_pos.astype("int64").reshape([-1, max_len, 1])]
-
-    if return_input_mask:
-        # This is used to avoid attention on paddings.
-        if final_cls:
-            input_mask_data = np.array([[1] * len(inst[:-1]) + [0] *
-                                        (max_len - len(inst)) + [1] for inst in insts])
-        else:
-            input_mask_data = np.array([[1] * len(inst) + [0] *
-                                        (max_len - len(inst)) for inst in insts])
-        input_mask_data = np.expand_dims(input_mask_data, axis=-1)
-        return_list += [input_mask_data.astype("float32")]
-
-    if return_max_len:
-        return_list += [max_len]
-
-    if return_num_token:
-        num_token = 0
-        for inst in insts:
-            num_token += len(inst)
-        return_list += [num_token]
-
-    if return_seq_lens:
-        if paddle.__version__[:3] <= '1.5':
-            seq_lens_type = [-1, 1]
-        else:
-            seq_lens_type = [-1]
-        seq_lens = np.array([len(inst) for inst in insts])
-        return_list += [seq_lens.astype("int64").reshape(seq_lens_type)]
-
-    return return_list if len(return_list) > 1 else return_list[0]
--- a/ernie-doc/reader/gpt2_bpe_utils.py
+++ b/ernie-doc/reader/gpt2_bpe_utils.py
-# -*- coding: utf-8 -*-
-"""
-Byte pair encoding utilities from GPT-2.
-
-Original source: https://github.com/openai/gpt-2/blob/master/src/encoder.py
-Original license: MIT
-"""
-
-#from functools import lru_cache
-try:
-    from functools import lru_cache
-except ImportError:
-    from backports.functools_lru_cache import lru_cache
-import json
-import six
-import regex as re
-
-
-@lru_cache()
-def bytes_to_unicode():
-    """
-    Returns list of utf-8 byte and a corresponding list of unicode strings.
-    The reversible bpe codes work on unicode strings.
-    This means you need a large # of unicode characters in your vocab if you want to avoid UNKs.
-    When you're at something like a 10B token dataset you end up needing around 5K for decent coverage.
-    This is a signficant percentage of your normal, say, 32K bpe vocab.
-    To avoid that, we want lookup tables between utf-8 bytes and unicode strings.
-    And avoids mapping to whitespace/control characters the bpe code barfs on.
-    """
-    if six.PY2:
-        bs = list(range(ord("!".decode('utf8')), ord("~".decode('utf8'))+1))+list(range(ord("¡".decode('utf8')),ord("¬".decode('utf8'))+1))+list(range(ord("®".decode('utf8')), ord("ÿ".decode('utf8'))+1))
-    else:
-        bs = (
-            list(range(ord("!"), ord("~") + 1))
-            + list(range(ord("¡"), ord("¬") + 1))
-            + list(range(ord("®"), ord("ÿ") + 1))
-        )        
-    cs = bs[:]
-    
-    n = 0
-    for b in range(2**8):
-        if b not in bs:
-            bs.append(b)
-            cs.append(2**8+n)
-            n += 1
-
-    if six.PY2:
-        cs = [unichr(n) for n in cs]
-    else:
-        cs = [chr(n) for n in cs]
-
-    ddict = dict(zip(bs, cs))
-    return dict(zip(bs, cs))
-
-
-def get_pairs(word):
-    """Return set of symbol pairs in a word.
-    Word is represented as tuple of symbols (symbols being variable-length strings).
-    """
-    pairs = set()
-    prev_char = word[0]
-    for char in word[1:]:
-        pairs.add((prev_char, char))
-        prev_char = char
-    return pairs
-
-
-class Encoder(object):
-
-    def __init__(self, encoder, bpe_merges, errors='replace', special_tokens=["[SEP]", "[p]", "[q]", "[/q]"]):
-        self.encoder = encoder
-        self.decoder = {v:k for k,v in self.encoder.items()}
-        self.errors = errors # how to handle errors in decoding
-        self.byte_encoder = bytes_to_unicode()
-        # print('111',self.byte_encoder)
-        self.byte_decoder = {v:k for k, v in self.byte_encoder.items()}
-        self.bpe_ranks = dict(zip(bpe_merges, range(len(bpe_merges))))
-        self.cache = {}
-        self.re = re
-        self.special_tokens = special_tokens
-
-        # Should haved added re.IGNORECASE so BPE merges can happen for capitalized versions of contractions
-        self.pat = self.re.compile(r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""")
-
-    def bpe(self, token):
-        if token in self.cache:
-            return self.cache[token]
-        word = tuple(token)
-        pairs = get_pairs(word)
-
-        if not pairs:
-            return token
-
-        while True:
-            bigram = min(pairs, key = lambda pair: self.bpe_ranks.get(pair, float('inf')))
-            if bigram not in self.bpe_ranks:
-                break
-            first, second = bigram
-            new_word = []
-            i = 0
-            while i < len(word):
-                try:
-                    j = word.index(first, i)
-                    new_word.extend(word[i:j])
-                    i = j
-                except:
-                    new_word.extend(word[i:])
-                    break
-
-                if word[i] == first and i < len(word)-1 and word[i+1] == second:
-                    new_word.append(first+second)
-                    i += 2
-                else:
-                    new_word.append(word[i])
-                    i += 1
-            new_word = tuple(new_word)
-            word = new_word
-            if len(word) == 1:
-                break
-            else:
-                pairs = get_pairs(word)
-        word = ' '.join(word)
-        self.cache[token] = word
-        return word
-
-    # def tokenize(self, text):
-    #     tokens = []
-    #     return self.re.findall(self.pat, text)
-        
-    # def tokenize_bpe(self, token):
-    #     token = ''.join(self.byte_encoder[ord(b)] for b in token.encode('utf-8'))
-    #     return [self.encoder[bpe_token] for bpe_token in self.bpe(token).split(' ')]
-
-
-    # def encode(self, text):
-    #     bpe_tokens = []
-    #     for token in self.re.findall(self.pat, text):
-    #         #print(token)
-    #         #print(self.byte_encoder)
-    #         token = ''.join(self.byte_encoder[ord(b)] for b in token.encode('utf-8'))
-    #         bpe_tokens.extend(self.encoder[bpe_token] for bpe_token in self.bpe(token).split(' '))
-    #     return bpe_tokens
-
-    # def decode(self, tokens):
-    #     text = ''.join([self.decoder[token] for token in tokens])
-    #     text = bytearray([self.byte_decoder[c] for c in text]).decode('utf-8', errors=self.errors)
-    #     return text
-
-    def tokenize(self, text):
-        tokens = text.split(' ')
-        sub_tokens = []
-        for token_i, token in enumerate(tokens):
-            if self.is_special_token(token):
-                if token_i == 0:
-                    sub_tokens.extend([token])
-                else:
-                    sub_tokens.extend([" " + token])
-            else:
-                if token_i == 0:
-                    sub_tokens.extend(self.re.findall(self.pat, token))
-                else:
-                    sub_tokens.extend(self.re.findall(self.pat, " " + token))
-        return sub_tokens
-
-    def tokenize_old(self, text):
-        return self.re.findall(self.pat, text)
-
-    def is_special_token(self, tok):
-        if isinstance(tok, int):
-            return False
-        res = False
-        for t in self.special_tokens:
-            # if tok.find(t) != -1:
-            if tok.strip() == t:
-                res= True
-                break
-        return res
-
-    def tokenize_bpe(self, token):
-        
-        if self.is_special_token(token):
-            return [token.strip()] # remove space for convert_to_ids
-        else:
-            if six.PY2:
-                token = ''.join(self.byte_encoder[ord(b)] for b in token.encode('utf-8'))
-            else:
-                token = ''.join(self.byte_encoder[b] for b in token.encode('utf-8'))
-            return [self.encoder[bpe_token] for bpe_token in self.bpe(token).split(' ')]
-
-    def encode(self, text):
-        # bpe_tokens = []
-        # for token in self.re.findall(self.pat, text):
-        #     #print(token)
-        #     #print(self.byte_encoder)
-        #     token = ''.join(self.byte_encoder[ord(b)] for b in token.encode('utf-8'))
-        #     bpe_tokens.extend(self.encoder[bpe_token] for bpe_token in self.bpe(token).split(' '))
-        # return bpe_tokens
-        bpe_tokens = []
-        for token in self.tokenize(text):
-            bpe_tokens.extend(self.tokenize_bpe(token))
-        return bpe_tokens
-
-    def decode(self, tokens):
-        pre_token_i = 0
-        texts = []
-        for token_i, token in enumerate(tokens):
-            if self.is_special_token(token):
-                # proprecess tokens before token_i
-                if token_i - pre_token_i > 0:
-                    text = ''.join([self.decoder[int(tok)] for tok in tokens[pre_token_i:token_i]])
-                    text = bytearray([self.byte_decoder[c] for c in text]).decode('utf-8', errors=self.errors)
-                    texts.append(text)
-                # texts.append(token)
-                if token_i == 0:
-                    texts.append(token) # in the beginning, there is no space before special tokens
-                else:
-                    texts.extend([" ", token]) # in middle sentence, there must be a space before special tokens
-                pre_token_i = token_i + 1
-                
-        if pre_token_i < len(tokens):
-            text = ''.join([self.decoder[int(tok)] for tok in tokens[pre_token_i:]])
-            text = bytearray([self.byte_decoder[c] for c in text]).decode('utf-8', errors=self.errors)
-            texts.append(text)
-
-        return ''.join(texts)
-
-
-def get_encoder(encoder_json_path, vocab_bpe_path):
-    with open(encoder_json_path, 'r') as f:
-        encoder = json.load(f)
-    with open(vocab_bpe_path, 'r') as f:
-        bpe_data = f.read()
-    if six.PY2:
-        bpe_data = bpe_data.decode('utf8')
-    bpe_merges = [tuple(merge_str.split()) for merge_str in bpe_data.split('\n')[1:-1]]
-
-    return Encoder(
-        encoder=encoder,
-        bpe_merges=bpe_merges,
-    )
-
-def convert_to_unicode(text):
-    """Converts `text` to Unicode (if it's not already), assuming utf-8 input."""
-    if six.PY3:
-        if isinstance(text, str):
-            return text
-        elif isinstance(text, bytes):
-            return text.decode("utf-8", "ignore")
-        else:
-            raise ValueError("Unsupported string type: %s" % (type(text)))
-    elif six.PY2:
-        if isinstance(text, str):
-            return text.decode("utf-8", "ignore")
-        elif isinstance(text, unicode):
-            return text
-        else:
-            raise ValueError("Unsupported string type: %s" % (type(text)))
-    else:
-        raise ValueError("Not running on Python2 or Python 3?")
--- a/ernie-doc/reader/task_reader.py
+++ b/ernie-doc/reader/task_reader.py
--- a/ernie-doc/reader/tokenization.py
+++ b/ernie-doc/reader/tokenization.py
-# Copyright 2021 The Google AI Language Team Authors.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#         http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-"""Tokenization classes."""
-
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-
-import collections
-import unicodedata
-import six
-import sentencepiece as sp
-
-from reader import gpt2_bpe_utils
-from reader.vocabulary import Vocabulary
-from nltk import tokenize
-
-def convert_to_unicode(text):
-    """Converts `text` to Unicode (if it's not already), assuming utf-8 input."""
-    if six.PY3:
-        if isinstance(text, str):
-            return text
-        elif isinstance(text, bytes):
-            return text.decode("utf-8", "ignore")
-        else:
-            raise ValueError("Unsupported string type: %s" % (type(text)))
-    elif six.PY2:
-        if isinstance(text, str):
-            return text.decode("utf-8", "ignore")
-        elif isinstance(text, unicode):
-            return text
-        else:
-            raise ValueError("Unsupported string type: %s" % (type(text)))
-    else:
-        raise ValueError("Not running on Python2 or Python 3?")
-
-
-def load_vocab(vocab_file):
-    """Loads a vocabulary file into a dictionary."""
-    vocab = collections.OrderedDict()
-    fin = open(vocab_file)
-    for num, line in enumerate(fin):
-        items = convert_to_unicode(line.strip()).split("\t")
-        if len(items) > 2:
-            break
-        token = items[0]
-        index = items[1] if len(items) == 2 else num
-        token = token.strip()
-        vocab[token] = int(index)
-    return vocab
-
-
-def convert_by_vocab(vocab, items):
-    """Converts a sequence of [tokens|ids] using the vocab."""
-    output = []
-    for item in items:
-        output.append(vocab[item])
-    return output
-
-
-def convert_tokens_to_ids_include_unk(vocab, tokens, unk_token="[UNK]"):
-    output = []
-    for token in tokens:
-        if token in vocab:
-            output.append(vocab[token])
-        else:
-            output.append(vocab[unk_token])
-    return output
-
-
-def convert_tokens_to_ids(vocab, tokens):
-    return convert_by_vocab(vocab, tokens)
-
-
-def convert_ids_to_tokens(inv_vocab, ids):
-    return convert_by_vocab(inv_vocab, ids)
-
-
-def whitespace_tokenize(text):
-    """Runs basic whitespace cleaning and splitting on a peice of text."""
-    text = text.strip()
-    if not text:
-        return []
-    tokens = text.split()
-    return tokens
-
-
-class FullTokenizer(object):
-    """Runs end-to-end tokenziation."""
-
-    def __init__(self, vocab_file, do_lower_case=True):
-        self.vocab = load_vocab(vocab_file)
-        self.inv_vocab = {v: k for k, v in self.vocab.items()}
-        self.basic_tokenizer = BasicTokenizer(do_lower_case=do_lower_case)
-        self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab)
-
-    def tokenize(self, text):
-        split_tokens = []
-        for token in self.basic_tokenizer.tokenize(text):
-            for sub_token in self.wordpiece_tokenizer.tokenize(token):
-                split_tokens.append(sub_token)
-
-        return split_tokens
-
-    def convert_tokens_to_ids(self, tokens):
-        return convert_by_vocab(self.vocab, tokens)
-
-    def convert_ids_to_tokens(self, ids):
-        return convert_by_vocab(self.inv_vocab, ids)
-
-
-class CharTokenizer(object):
-    """Runs end-to-end tokenziation."""
-
-    def __init__(self, vocab_file, do_lower_case=True):
-        self.vocab = load_vocab(vocab_file)
-        self.inv_vocab = {v: k for k, v in self.vocab.items()}
-        self.tokenizer = WordpieceTokenizer(vocab=self.vocab)
-
-    def tokenize(self, text):
-        split_tokens = []
-        for token in text.lower().split(" "):
-            for sub_token in self.tokenizer.tokenize(token):
-                split_tokens.append(sub_token)
-        return split_tokens
-
-    def convert_tokens_to_ids(self, tokens):
-        return convert_by_vocab(self.vocab, tokens)
-
-    def convert_ids_to_tokens(self, ids):
-        return convert_by_vocab(self.inv_vocab, ids)
-
-
-class BasicTokenizer(object):
-    """Runs basic tokenization (punctuation splitting, lower casing, etc.)."""
-
-    def __init__(self, do_lower_case=True):
-        """Constructs a BasicTokenizer.
-
-        Args:
-            do_lower_case: Whether to lower case the input.
-        """
-        self.do_lower_case = do_lower_case
-
-    def tokenize(self, text):
-        """Tokenizes a piece of text."""
-        text = convert_to_unicode(text)
-        text = self._clean_text(text)
-
-        # This was added on November 1st, 2018 for the multilingual and Chinese
-        # models. This is also applied to the English models now, but it doesn't
-        # matter since the English models were not trained on any Chinese data
-        # and generally don't have any Chinese data in them (there are Chinese
-        # characters in the vocabulary because Wikipedia does have some Chinese
-        # words in the English Wikipedia.).
-        text = self._tokenize_chinese_chars(text)
-
-        orig_tokens = whitespace_tokenize(text)
-        split_tokens = []
-        for token in orig_tokens:
-            if self.do_lower_case:
-                token = token.lower()
-                token = self._run_strip_accents(token)
-            split_tokens.extend(self._run_split_on_punc(token))
-
-        output_tokens = whitespace_tokenize(" ".join(split_tokens))
-        return output_tokens
-
-    def _run_strip_accents(self, text):
-        """Strips accents from a piece of text."""
-        text = unicodedata.normalize("NFD", text)
-        output = []
-        for char in text:
-            cat = unicodedata.category(char)
-            if cat == "Mn":
-                continue
-            output.append(char)
-        return "".join(output)
-
-    def _run_split_on_punc(self, text):
-        """Splits punctuation on a piece of text."""
-        chars = list(text)
-        i = 0
-        start_new_word = True
-        output = []
-        while i < len(chars):
-            char = chars[i]
-            if _is_punctuation(char):
-                output.append([char])
-                start_new_word = True
-            else:
-                if start_new_word:
-                    output.append([])
-                start_new_word = False
-                output[-1].append(char)
-            i += 1
-
-        return ["".join(x) for x in output]
-
-    def _tokenize_chinese_chars(self, text):
-        """Adds whitespace around any CJK character."""
-        output = []
-        for char in text:
-            cp = ord(char)
-            if self._is_chinese_char(cp):
-                output.append(" ")
-                output.append(char)
-                output.append(" ")
-            else:
-                output.append(char)
-        return "".join(output)
-
-    def _is_chinese_char(self, cp):
-        """Checks whether CP is the codepoint of a CJK character."""
-        # This defines a "chinese character" as anything in the CJK Unicode block:
-        #     https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_(Unicode_block)
-        #
-        # Note that the CJK Unicode block is NOT all Japanese and Korean characters,
-        # despite its name. The modern Korean Hangul alphabet is a different block,
-        # as is Japanese Hiragana and Katakana. Those alphabets are used to write
-        # space-separated words, so they are not treated specially and handled
-        # like the all of the other languages.
-        if ((cp >= 0x4E00 and cp <= 0x9FFF) or  #
-            (cp >= 0x3400 and cp <= 0x4DBF) or  #
-            (cp >= 0x20000 and cp <= 0x2A6DF) or  #
-            (cp >= 0x2A700 and cp <= 0x2B73F) or  #
-            (cp >= 0x2B740 and cp <= 0x2B81F) or  #
-            (cp >= 0x2B820 and cp <= 0x2CEAF) or
-            (cp >= 0xF900 and cp <= 0xFAFF) or  #
-            (cp >= 0x2F800 and cp <= 0x2FA1F)):  #
-            return True
-
-        return False
-
-    def _clean_text(self, text):
-        """Performs invalid character removal and whitespace cleanup on text."""
-        output = []
-        for char in text:
-            cp = ord(char)
-            if cp == 0 or cp == 0xfffd or _is_control(char):
-                continue
-            if _is_whitespace(char):
-                output.append(" ")
-            else:
-                output.append(char)
-        return "".join(output)
-
-
-class SentencepieceTokenizer(object):
-    """Runs SentencePiece tokenziation."""
-
-    def __init__(self, vocab_file, do_lower_case=True, unk_token="[UNK]"):
-        self.vocab = load_vocab(vocab_file)
-        self.inv_vocab = {v: k for k, v in self.vocab.items()}
-        self.do_lower_case = do_lower_case
-        self.tokenizer = sp.SentencePieceProcessor()
-        self.tokenizer.Load(vocab_file + ".model")
-        self.sp_unk_token = "<unk>"
-        self.unk_token = unk_token
-
-    def tokenize(self, text):
-        """Tokenizes a piece of text into its word pieces.
-
-        Returns:
-            A list of wordpiece tokens.
-        """
-        text = text.lower() if self.do_lower_case else text 
-        text = convert_to_unicode(text.replace("\1", " "))
-        tokens = self.tokenizer.EncodeAsPieces(text)
-        
-        output_tokens = []
-        for token in tokens:
-            if token == self.sp_unk_token:
-                token = self.unk_token
-            
-            if token in self.vocab:
-                output_tokens.append(token)
-            else:
-                output_tokens.append(self.unk_token)
-        
-        return output_tokens
-    
-    def convert_tokens_to_ids(self, tokens):
-        return convert_by_vocab(self.vocab, tokens)
-
-    def convert_ids_to_tokens(self, ids):
-        return convert_by_vocab(self.inv_vocab, ids)
-
-
-class WordsegTokenizer(object):
-    """Runs Wordseg tokenziation."""
-
-    def __init__(self, vocab_file, do_lower_case=True, unk_token="[UNK]", 
-            split_token="\1"):
-        self.vocab = load_vocab(vocab_file)
-        self.inv_vocab = {v: k for k, v in self.vocab.items()}
-        self.tokenizer = sp.SentencePieceProcessor()
-        self.tokenizer.Load(vocab_file + ".model")
-        
-        self.do_lower_case = do_lower_case
-        self.unk_token = unk_token
-        self.split_token = split_token
-
-    def tokenize(self, text):
-        """Tokenizes a piece of text into its word pieces.
-
-        Returns:
-            A list of wordpiece tokens.
-        """
-        text = text.lower() if self.do_lower_case else text 
-        text = convert_to_unicode(text)
-        
-        output_tokens = []
-        for token in text.split(self.split_token):
-            if token in self.vocab:
-                output_tokens.append(token)
-            else:
-                sp_tokens = self.tokenizer.EncodeAsPieces(token)
-                for sp_token in sp_tokens:
-                    if sp_token in self.vocab:
-                        output_tokens.append(sp_token)
-        return output_tokens
-    
-    def convert_tokens_to_ids(self, tokens):
-        return convert_by_vocab(self.vocab, tokens)
-
-    def convert_ids_to_tokens(self, ids):
-        return convert_by_vocab(self.inv_vocab, ids)
-
-
-class WordpieceTokenizer(object):
-    """Runs WordPiece tokenziation."""
-
-    def __init__(self, vocab, unk_token="[UNK]", max_input_chars_per_word=100):
-        self.vocab = vocab
-        self.unk_token = unk_token
-        self.max_input_chars_per_word = max_input_chars_per_word
-
-    def tokenize(self, text):
-        """Tokenizes a piece of text into its word pieces.
-
-        This uses a greedy longest-match-first algorithm to perform tokenization
-        using the given vocabulary.
-
-        For example:
-            input = "unaffable"
-            output = ["un", "##aff", "##able"]
-
-        Args:
-            text: A single token or whitespace separated tokens. This should have
-                already been passed through `BasicTokenizer.
-
-        Returns:
-            A list of wordpiece tokens.
-        """
-
-        text = convert_to_unicode(text)
-
-        output_tokens = []
-        for token in whitespace_tokenize(text):
-            chars = list(token)
-            if len(chars) > self.max_input_chars_per_word:
-                output_tokens.append(self.unk_token)
-                continue
-
-            is_bad = False
-            start = 0
-            sub_tokens = []
-            while start < len(chars):
-                end = len(chars)
-                cur_substr = None
-                while start < end:
-                    substr = "".join(chars[start:end])
-                    if start > 0:
-                        substr = "##" + substr
-                    if substr in self.vocab:
-                        cur_substr = substr
-                        break
-                    end -= 1
-                if cur_substr is None:
-                    is_bad = True
-                    break
-                sub_tokens.append(cur_substr)
-                start = end
-
-            if is_bad:
-                output_tokens.append(self.unk_token)
-            else:
-                output_tokens.extend(sub_tokens)
-        return output_tokens
-
-class BPETokenizer(object):
-    """ Runs bpe tokenize """
-
-    def __init__(self, vocab_file, encoder_json_path="./configs/encoder.json", vocab_bpe_path="./configs/vocab.bpe", params=None):
-        self.vocabulary = Vocabulary(vocab_file, unk_token="[UNK]")
-        self.encoder = gpt2_bpe_utils.get_encoder(encoder_json_path, vocab_bpe_path)
-    
-    def tokenize(self, text, is_sentencepiece=True):
-        text = convert_to_unicode(text)
-        text = " ".join(text.split()) # remove duplicate whitespace
-        if is_sentencepiece:
-            sents = tokenize.sent_tokenize(text)
-            bpe_ids = sum([self.encoder.encode(sent) for sent in sents], [])
-        else:
-            bpe_ids = self.encoder.encode(text)
-        tokens = [str(bpe_id) for bpe_id in bpe_ids]
-        return tokens
-
-    def convert_tokens_to_ids(self, tokens):
-        """
-        :param tokens:
-        :return:
-        """
-        return self.vocabulary.convert_tokens_to_ids(tokens)
-
-    def convert_ids_to_tokens(self, ids):
-        """
-        :param ids:
-        :return:
-        """
-        return self.vocabulary.convert_ids_to_tokens(ids)
-
-
-
-def _is_whitespace(char):
-    """Checks whether `chars` is a whitespace character."""
-    # \t, \n, and \r are technically contorl characters but we treat them
-    # as whitespace since they are generally considered as such.
-    if char == " " or char == "\t" or char == "\n" or char == "\r":
-        return True
-    cat = unicodedata.category(char)
-    if cat == "Zs":
-        return True
-    return False
-
-
-def _is_control(char):
-    """Checks whether `chars` is a control character."""
-    # These are technically control characters but we count them as whitespace
-    # characters.
-    if char == "\t" or char == "\n" or char == "\r":
-        return False
-    cat = unicodedata.category(char)
-    if cat.startswith("C"):
-        return True
-    return False
-
-
-def _is_punctuation(char):
-    """Checks whether `chars` is a punctuation character."""
-    cp = ord(char)
-    # We treat all non-letter/number ASCII as punctuation.
-    # Characters such as "^", "$", and "`" are not in the Unicode
-    # Punctuation class but we treat them as punctuation anyways, for
-    # consistency.
-    if ((cp >= 33 and cp <= 47) or (cp >= 58 and cp <= 64) or
-        (cp >= 91 and cp <= 96) or (cp >= 123 and cp <= 126)):
-        return True
-    cat = unicodedata.category(char)
-    if cat.startswith("P"):
-        return True
-    return False
-
-
-def tokenize_chinese_chars(text):
-    """Adds whitespace around any CJK character."""
-
-    def _is_chinese_char(cp):
-        """Checks whether CP is the codepoint of a CJK character."""
-        # This defines a "chinese character" as anything in the CJK Unicode block:
-        #     https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_(Unicode_block)
-        #
-        # Note that the CJK Unicode block is NOT all Japanese and Korean characters,
-        # despite its name. The modern Korean Hangul alphabet is a different block,
-        # as is Japanese Hiragana and Katakana. Those alphabets are used to write
-        # space-separated words, so they are not treated specially and handled
-        # like the all of the other languages.
-        if ((cp >= 0x4E00 and cp <= 0x9FFF) or  #
-            (cp >= 0x3400 and cp <= 0x4DBF) or  #
-            (cp >= 0x20000 and cp <= 0x2A6DF) or  #
-            (cp >= 0x2A700 and cp <= 0x2B73F) or  #
-            (cp >= 0x2B740 and cp <= 0x2B81F) or  #
-            (cp >= 0x2B820 and cp <= 0x2CEAF) or
-            (cp >= 0xF900 and cp <= 0xFAFF) or  #
-            (cp >= 0x2F800 and cp <= 0x2FA1F)):  #
-            return True
-
-        return False
-
-    output = []
-    buff = ""
-    for char in text:
-        cp = ord(char)
-        if _is_chinese_char(cp):
-            if buff != "":
-                output.append(buff)
-                buff = ""
-            output.append(char)
-        else:
-            buff += char
-
-    if buff != "":
-        output.append(buff)
-
-    return output
--- a/ernie-doc/reader/vocabulary.py
+++ b/ernie-doc/reader/vocabulary.py
-# -*- coding: utf-8 -*
-#   Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-"""
-:py:class:`Vocabulary`
-"""
-import collections
-import unicodedata
-import six
-
-def convert_to_unicode(text):
-    """Converts `text` to Unicode (if it's not already), assuming utf-8 input."""
-    if six.PY3:
-        if isinstance(text, str):
-            return text
-        elif isinstance(text, bytes):
-            return text.decode("utf-8", "ignore")
-        else:
-            raise ValueError("Unsupported string type: %s" % (type(text)))
-    elif six.PY2:
-        if isinstance(text, str):
-            return text.decode("utf-8", "ignore")
-        elif isinstance(text, unicode):
-            return text
-        else:
-            raise ValueError("Unsupported string type: %s" % (type(text)))
-    else:
-        raise ValueError("Not running on Python2 or Python 3?")
-
-
-class Vocabulary(object):
-    """Vocabulary"""
-    def __init__(self, vocab_path, unk_token):
-        """
-        :param vocab_path: 
-        :param unk_token: 
-        """
-        if not vocab_path:
-            raise ValueError("vocab_path can't be None")
-
-        self.vocab_path = vocab_path
-        self.unk_token = unk_token
-        self.vocab_dict, self.id_dict = self.load_vocab()
-        self.vocab_size = len(self.id_dict)
-
-    def load_vocab(self):
-        """
-        :return:
-        """
-        vocab_dict = collections.OrderedDict()
-        id_dict = collections.OrderedDict()
-        file_vocab = open(self.vocab_path)
-        for num, line in enumerate(file_vocab):
-            items = convert_to_unicode(line.strip()).split("\t")
-            if len(items) > 2:
-                break
-            token = items[0]
-            if len(items) == 2:
-                index = items[1]
-            else:
-                index = num
-            token = token.strip()
-
-            vocab_dict[token] = int(index)
-            id_dict[index] = token
-
-        return vocab_dict, id_dict
-
-    def add_reserve_id(self):
-        """
-        :return:
-        """
-        pass
-
-    def convert_tokens_to_ids(self, tokens):
-        """
-        :param tokens:
-        :return:
-        """
-        UNK = self.vocab_dict[self.unk_token]
-        output = [self.vocab_dict.get(item, UNK) for item in tokens]
-        # for item in tokens:
-            # output.append(self.vocab_dict.get(item, UNK))
-        return output
-
-    def convert_ids_to_tokens(self, ids):
-        """
-        :param ids:
-        :return:
-        """
-        output = []
-        for item in ids:
-            output.append(self.id_dict.get(item, self.unk_token))
-        return output
-
-    def get_vocab_size(self):
-        """
-        :return:
-        """
-        return len(self.id_dict)
-
-    def covert_id_to_token(self, id):
-        """
-        :param id:
-        :return: token
-        """
-        return self.id_dict.get(id, self.unk_token)
-
-    def covert_token_to_id(self, token):
-        """
-        :param token:
-        :return: id
-        """
-        UNK = self.vocab_dict[self.unk_token]
-        return self.vocab_dict.get(token, UNK)
-
-
-
--- a/ernie-doc/requirements.txt
+++ b/ernie-doc/requirements.txt
-nltk==3.6.2
-regex==2021.4.4
-six==1.15.0
-tqdm==4.54.1
--- a/ernie-doc/run_classifier.py
+++ b/ernie-doc/run_classifier.py
-#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-"""Finetuning on classification tasks."""
-
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-
-import os
-import time
-import multiprocessing
-import numpy as np
-
-import paddle
-import paddle.fluid as fluid
-import paddle.fluid.layers as layers
-import paddle.fluid.incubate.fleet.base.role_maker as role_maker
-from paddle.fluid.incubate.fleet.collective import fleet, DistributedStrategy
-
-import reader.task_reader as task_reader
-from model.optimization import optimization
-from model.static.ernie import ErnieConfig, ErnieDocModel
-from finetune.classifier import create_model, evaluate
-from utils.args import print_arguments
-from utils.init import init_model
-from utils.finetune_args import parser
-
-paddle.enable_static()
-args = parser.parse_args()
-
-def main(args):
-    """main function"""
-    print("""finetuning start""")
-    ernie_config = ErnieConfig(args.ernie_config_path)
-    ernie_config.print_config()
-    memory_len = ernie_config["memory_len"]
-    d_dim = ernie_config["hidden_size"]
-    n_layers = ernie_config["num_hidden_layers"]
-    print("args.is_distributed:", args.is_distributed)
-
-    exec_strategy = fluid.ExecutionStrategy()
-    if args.use_fast_executor:
-        exec_strategy.use_experimental_executor = True
-    exec_strategy.num_threads = 4 if args.use_amp else 2
-    exec_strategy.num_iteration_per_drop_scope = min(1, args.skip_steps)
-
-    if args.is_distributed:
-        role = role_maker.PaddleCloudRoleMaker(is_collective=True)
-        fleet.init(role)
-        trainer_id = fleet.worker_index()
-        current_endpoint = os.getenv("PADDLE_CURRENT_ENDPOINT")
-        worker_endpoints = fleet.worker_endpoints()
-        trainers_num = len(worker_endpoints)
-        print("worker_endpoints:{} trainers_num:{} current_endpoint:{} trainer_id:{}"
-              .format(worker_endpoints, trainers_num, current_endpoint, trainer_id))
-
-        dist_strategy = DistributedStrategy()
-        dist_strategy.exec_strategy = exec_strategy
-        dist_strategy.remove_unnecessary_lock = False
-        dist_strategy.fuse_all_reduce_ops = False
-        dist_strategy.nccl_comm_num = 1
-
-        if args.use_amp:
-            dist_strategy.use_amp = True
-            dist_strategy.amp_loss_scaling = args.init_loss_scaling
-        if args.use_recompute:
-            dist_strategy.forward_recompute = True
-            dist_strategy.enable_sequential_execution=True
-    else:
-        dist_strategy=None
-
-    gpu_id = 0 
-    gpus = fluid.core.get_cuda_device_count()
-    if args.is_distributed:
-        gpus = os.getenv("FLAGS_selected_gpus").split(",")
-        gpu_id = int(gpus[0])
-    
-    if args.use_cuda:
-        place = fluid.CUDAPlace(gpu_id)
-        dev_count = fleet.worker_num() if args.is_distributed else gpus
-    else:
-        place = fluid.CPUPlace()
-        dev_count = int(os.environ.get('CPU_NUM', multiprocessing.cpu_count()))
-    trainer_id = fleet.worker_index()
-    print("Device count %d, trainer_id:%d" % (dev_count, trainer_id))
-    print('args.vocab_path', args.vocab_path)
-    reader = task_reader.ClassifyReader(
-        trainer_id=fleet.worker_index(),
-        trainer_num=dev_count,
-        vocab_path=args.vocab_path,
-        memory_len=memory_len,
-        repeat_input=args.repeat_input,
-        train_all=args.train_all,
-        eval_all=args.eval_all,
-        label_map_config=args.label_map_config,
-        max_seq_len=args.max_seq_len,
-        do_lower_case=args.do_lower_case,
-        in_tokens=args.in_tokens,
-        random_seed=args.random_seed,
-        tokenizer=args.tokenizer,
-        is_zh=args.is_zh)
-
-    if not (args.do_train or args.do_val or args.do_test):
-        raise ValueError("For args `do_train`, `do_val` and `do_test`, at "
-                         "least one of them must be True.")
-
-    startup_prog = fluid.Program()
-    if args.random_seed is not None:
-        startup_prog.random_seed = args.random_seed
-
-    if args.do_train:
-        train_data_generator = reader.data_generator(
-            input_file=args.train_set,
-            batch_size=args.batch_size,
-            epoch=args.epoch,
-            shuffle=True,
-            phase="train")
-
-        num_train_examples = reader.get_num_examples(args.train_set)
-        if args.in_tokens:
-            max_train_steps = args.epoch * num_train_examples // (
-                args.batch_size // args.max_seq_len) // dev_count
-        else:
-            max_train_steps = args.epoch * num_train_examples // args.batch_size // dev_count
-
-        warmup_steps = int(max_train_steps * args.warmup_proportion)
-        print("Device count: %d, trainer_id: %d" % (dev_count, trainer_id))
-        print("Num train examples: %d" % num_train_examples)
-        print("Max train steps: %d" % max_train_steps)
-        print("Num warmup steps: %d" % warmup_steps)
-
-        train_program = fluid.Program()
-
-        with fluid.program_guard(train_program, startup_prog):
-            with fluid.unique_name.guard():
-                train_pyreader, graph_vars, checkpoints, train_mems_vars = create_model(
-                    args,
-                    ernie_config=ernie_config, 
-                    mem_len=memory_len)
-                if args.use_recompute:
-                    dist_strategy.recompute_checkpoints = checkpoints
-                scheduled_lr, loss_scaling = optimization(
-                    loss=graph_vars['loss'],
-                    warmup_steps=warmup_steps,
-                    num_train_steps=max_train_steps,
-                    learning_rate=args.learning_rate,
-                    train_program=train_program,
-                    startup_prog=startup_prog,
-                    weight_decay=args.weight_decay,
-                    scheduler=args.lr_scheduler,
-		    use_amp=args.use_amp,
-                    init_loss_scaling=args.init_loss_scaling,
-                    layer_decay_rate=args.layer_decay_ratio,
-                    n_layers=n_layers,
-                    dist_strategy=dist_strategy)
-    
-        origin_train_program = train_program
-        if args.is_distributed:
-            train_program = fleet.main_program
-            origin_train_program = fleet._origin_program
-
-        # add data pe
-        # exec_strategy = fluid.ExecutionStrategy()
-        # exec_strategy.num_threads = 1
-        # exec_strategy.num_iteration_per_drop_scope = 10000
-        # build_strategy = fluid.BuildStrategy()
-        # train_program_dp = fluid.CompiledProgram(train_program).with_data_parallel(
-        #         loss_name=graph_vars["loss"].name,
-        #         exec_strategy=exec_strategy,
-        #         build_strategy=build_strategy)
-        
-    if args.do_val or args.do_test:
-        test_prog = fluid.Program()
-        with fluid.program_guard(test_prog, startup_prog):
-            with fluid.unique_name.guard():
-                test_pyreader, test_graph_vars, _, test_mems_vars = create_model(
-                    args,
-                    ernie_config=ernie_config,
-                    mem_len=memory_len)
-        test_prog = test_prog.clone(for_test=True)
-    
-    exe = fluid.Executor(place)
-    exe.run(startup_prog)
-    init_model(args, exe, startup_prog)
-    
-    train_exe = exe if args.do_train else None
-    steps = 0
-    def init_memory():
-        return [np.zeros([args.batch_size, memory_len, d_dim], dtype="float32")
-                     for _ in range(n_layers)]
-
-    if args.do_train:
-        train_pyreader.set_batch_generator(train_data_generator) 
-        train_pyreader.start()
-        time_begin = time.time()
-        tower_mems_np = init_memory()
-        while True:
-            try:
-                steps += 1
-                if steps % args.skip_steps == 0:
-                    outputs = evaluate(train_exe, train_program, train_pyreader, graph_vars, 
-                                        train_mems_vars, tower_mems_np,
-                                       "train", steps, trainer_id, dev_count, scheduled_lr, use_vars=args.use_vars)
-                    tower_mems_np = outputs['tower_mems_np']
-
-                    time_end = time.time()
-                    used_time = time_end - time_begin
-                    current_example, current_epoch = reader.get_train_progress()
-                    print("train pyreader queue size: %d, " % train_pyreader.queue.size())
-                    print("epoch: %d, worker_index: %d, progress: %d/%d, step: %d, ave loss: %f, "
-                          "ave acc: %f, time cost: %f, speed: %f steps/s, learning_rate: %f" %
-                          (current_epoch, trainer_id, current_example, num_train_examples,
-                           steps, outputs["loss"], outputs["accuracy"],
-                           used_time, args.skip_steps / used_time, outputs["learning_rate"]))
-                    time_begin = time.time()
-                else:
-                    if args.use_vars:
-                        # train_exe.run(fetch_list=[])
-                        train_exe.run(program=train_program, use_program_cache=True)
-                    else:
-                        outputs = evaluate(train_exe, train_program, train_pyreader, graph_vars, 
-                                        train_mems_vars, tower_mems_np,
-                                       "train", steps, trainer_id, dev_count, scheduled_lr, use_vars=args.use_vars)
-                        tower_mems_np = outputs['tower_mems_np']
-
-                if steps % args.validation_steps == 0:
-                    # evaluate dev set
-                    if args.do_val:
-                        test_pyreader.set_batch_generator(
-                            reader.data_generator(
-                                args.dev_set,
-                                batch_size=args.batch_size,
-                                epoch=1,
-                                shuffle=False,
-                                phase="eval"))
-                        evaluate(exe, test_prog, test_pyreader, test_graph_vars,
-                                test_mems_vars, init_memory(),
-                                    "eval", steps, trainer_id, dev_count, use_vars=args.use_vars)
-                    # evaluate test set
-                    if args.do_test:
-                        test_pyreader.set_batch_generator(
-                            reader.data_generator(
-                                args.test_set,
-                                batch_size=args.batch_size,
-                                epoch=1,
-                                shuffle=False,
-                                phase="test"))
-
-                        evaluate(exe, test_prog, test_pyreader, test_graph_vars,
-                                test_mems_vars, init_memory(),
-                                    "test", steps, trainer_id, dev_count, use_vars=args.use_vars)
-                # save model
-                if trainer_id == 0 and steps % args.save_steps == 0:
-                    save_path = os.path.join(args.checkpoints,
-                                                "step_" + str(steps))
-                    fluid.io.save_persistables(exe, save_path, origin_train_program)
-
-            except fluid.core.EOFException:
-                if trainer_id == 0:
-                    save_path = os.path.join(args.checkpoints, "step_" + str(steps))
-                    fluid.io.save_persistables(exe, save_path, origin_train_program)
-                train_pyreader.reset()
-                break
-
-    # final eval on dev set
-    if args.do_val:
-        test_pyreader.set_batch_generator(
-            reader.data_generator(
-                args.dev_set,
-                batch_size=args.batch_size,
-                epoch=1,
-                shuffle=False,
-                phase="eval"))
-        print("Final validation result:")
-        evaluate(exe, test_prog, test_pyreader, test_graph_vars, 
-                test_mems_vars, init_memory(),
-                "eval", steps, trainer_id, dev_count, use_vars=args.use_vars)
-
-    # final eval on test set
-    if args.do_test:
-        test_pyreader.set_batch_generator(
-            reader.data_generator(
-                args.test_set,
-                batch_size=args.batch_size,
-                epoch=1,
-                shuffle=False, 
-                phase="test"))
-        print("Final test result:")
-        evaluate(exe, test_prog, test_pyreader, test_graph_vars, 
-                test_mems_vars, init_memory(),
-                "test", steps, trainer_id, dev_count, use_vars=args.use_vars)
-
-if __name__ == '__main__':
-    print_arguments(args)
-    main(args)
--- a/ernie-doc/run_mrc.py
+++ b/ernie-doc/run_mrc.py
-#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-"""Finetuning on mrc tasks."""
-
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-
-import os
-import time
-import multiprocessing
-import numpy as np
-
-import paddle
-import paddle.fluid as fluid
-import paddle.fluid.layers as layers
-import paddle.fluid.incubate.fleet.base.role_maker as role_maker
-from paddle.fluid.incubate.fleet.collective import fleet, DistributedStrategy
-
-import reader.task_reader as task_reader
-from model.optimization import optimization
-from model.static.ernie import ErnieConfig, ErnieDocModel
-from finetune.mrc import create_model, evaluate
-from utils.args import print_arguments
-from utils.init import init_model
-from utils.finetune_args import parser
-
-paddle.enable_static()
-args = parser.parse_args()
-
-def main(args):
-    """main function"""
-    print("""finetuning start""")
-    ernie_config = ErnieConfig(args.ernie_config_path)
-    ernie_config.print_config()
-    memory_len = ernie_config["memory_len"]
-    d_dim = ernie_config["hidden_size"]
-    n_layers = ernie_config["num_hidden_layers"]
-    print("args.is_distributed:", args.is_distributed)
-
-    exec_strategy = fluid.ExecutionStrategy()
-    if args.use_fast_executor:
-        exec_strategy.use_experimental_executor = True
-    exec_strategy.num_threads = 4 if args.use_amp else 2
-    exec_strategy.num_iteration_per_drop_scope = min(1, args.skip_steps)
-
-    if args.is_distributed:
-        role = role_maker.PaddleCloudRoleMaker(is_collective=True)
-        fleet.init(role)
-        trainer_id = fleet.worker_index()
-        current_endpoint = os.getenv("PADDLE_CURRENT_ENDPOINT")
-        worker_endpoints = fleet.worker_endpoints()
-        trainers_num = len(worker_endpoints)
-        print("worker_endpoints:{} trainers_num:{} current_endpoint:{} trainer_id:{}"
-              .format(worker_endpoints, trainers_num, current_endpoint, trainer_id))
-
-        dist_strategy = DistributedStrategy()
-        dist_strategy.exec_strategy = exec_strategy
-        dist_strategy.remove_unnecessary_lock = False
-        dist_strategy.fuse_all_reduce_ops = False
-        dist_strategy.nccl_comm_num = 1
-
-        if args.use_amp:
-            dist_strategy.use_amp = True
-            dist_strategy.amp_loss_scaling = args.init_loss_scaling
-        if args.use_recompute:
-            dist_strategy.forward_recompute = True
-            dist_strategy.enable_sequential_execution=True
-    else:
-        dist_strategy=None
-
-    gpu_id = 0 
-    gpus = fluid.core.get_cuda_device_count()
-    if args.is_distributed:
-        gpus = os.getenv("FLAGS_selected_gpus").split(",")
-        gpu_id = int(gpus[0])
-    
-    if args.use_cuda:
-        place = fluid.CUDAPlace(gpu_id)
-        dev_count = fleet.worker_num() if args.is_distributed else gpus
-    else:
-        place = fluid.CPUPlace()
-        dev_count = int(os.environ.get('CPU_NUM', multiprocessing.cpu_count()))
-    trainer_id = fleet.worker_index()
-    print("Device count %d, trainer_id:%d" % (dev_count, trainer_id))
-    print('args.vocab_path', args.vocab_path)
-    reader = task_reader.MRCReader(
-        trainer_id=fleet.worker_index(),
-        trainer_num=dev_count,
-        vocab_path=args.vocab_path,
-        memory_len=memory_len,
-        repeat_input=args.repeat_input,
-        train_all=args.train_all,
-        eval_all=args.eval_all,
-        label_map_config=args.label_map_config,
-        max_seq_len=args.max_seq_len,
-        do_lower_case=args.do_lower_case,
-        in_tokens=args.in_tokens,
-        random_seed=args.random_seed,
-        tokenizer=args.tokenizer,
-        is_zh=args.is_zh, 
-        for_cn=args.for_cn,
-        doc_stride=args.doc_stride,
-        max_query_length=args.max_query_length)
-
-    if not (args.do_train or args.do_val or args.do_test):
-        raise ValueError("For args `do_train`, `do_val` and `do_test`, at "
-                         "least one of them must be True.")
-
-    startup_prog = fluid.Program()
-    if args.random_seed is not None:
-        startup_prog.random_seed = args.random_seed
-
-    if args.do_train:
-        train_data_generator = reader.data_generator(
-            input_file=args.train_set,
-            batch_size=args.batch_size,
-            epoch=args.epoch,
-            shuffle=True,
-            phase="train")
-
-        num_train_examples = reader.get_num_examples("train", args.train_set)
-        if args.in_tokens:
-            max_train_steps = args.epoch * num_train_examples // (
-                args.batch_size // args.max_seq_len) // dev_count
-        else:
-            max_train_steps = args.epoch * num_train_examples // args.batch_size // dev_count
-
-        warmup_steps = int(max_train_steps * args.warmup_proportion)
-        print("Device count: %d, trainer_id: %d" % (dev_count, trainer_id))
-        print("Num train examples: %d" % num_train_examples)
-        print("Max train steps: %d" % max_train_steps)
-        print("Num warmup steps: %d" % warmup_steps)
-
-        train_program = fluid.Program()
-
-        with fluid.program_guard(train_program, startup_prog):
-            with fluid.unique_name.guard():
-                train_pyreader, graph_vars, checkpoints, train_mems_vars = create_model(
-                    args,
-                    ernie_config=ernie_config, 
-                    mem_len=memory_len)
-                if args.use_recompute:
-                    dist_strategy.recompute_checkpoints = checkpoints
-                scheduled_lr, loss_scaling = optimization(
-                    loss=graph_vars['loss'],
-                    warmup_steps=warmup_steps,
-                    num_train_steps=max_train_steps,
-                    learning_rate=args.learning_rate,
-                    train_program=train_program,
-                    startup_prog=startup_prog,
-                    weight_decay=args.weight_decay,
-                    scheduler=args.lr_scheduler,
-		    use_amp=args.use_amp,
-                    init_loss_scaling=args.init_loss_scaling,
-                    layer_decay_rate=args.layer_decay_ratio,
-                    n_layers=n_layers,
-                    dist_strategy=dist_strategy)
-        
-        origin_train_program = train_program
-        if args.is_distributed:
-            train_program = fleet.main_program
-            origin_train_program = fleet._origin_program
-
-        # add data pe
-        # exec_strategy = fluid.ExecutionStrategy()
-        # exec_strategy.num_threads = 1
-        # exec_strategy.num_iteration_per_drop_scope = 10000
-        # build_strategy = fluid.BuildStrategy()
-        # train_program_dp = fluid.CompiledProgram(train_program).with_data_parallel(
-        #         loss_name=graph_vars["loss"].name,
-        #         exec_strategy=exec_strategy,
-        #         build_strategy=build_strategy)
-        
-    if args.do_val or args.do_test:
-        test_prog = fluid.Program()
-        with fluid.program_guard(test_prog, startup_prog):
-            with fluid.unique_name.guard():
-                test_pyreader, test_graph_vars, _, test_mems_vars = create_model(
-                    args,
-                    ernie_config=ernie_config,
-                    mem_len=memory_len)
-        test_prog = test_prog.clone(for_test=True)
-    
-    exe = fluid.Executor(place)
-    exe.run(startup_prog)
-    init_model(args, exe, startup_prog)
-    
-    train_exe = exe if args.do_train else None
-    steps = 0
-    def init_memory():
-        return [np.zeros([args.batch_size, memory_len, d_dim], dtype="float32")
-                     for _ in range(n_layers)]
-
-    if args.do_train:
-        train_pyreader.set_batch_generator(train_data_generator) 
-        train_pyreader.start()
-        time_begin = time.time()
-        tower_mems_np = init_memory()
-        while True:
-            try:
-                steps += 1
-                if steps % args.skip_steps == 0:
-                    outputs = evaluate(train_exe, train_program, train_pyreader, graph_vars, 
-                                        train_mems_vars, tower_mems_np, "train", steps, trainer_id, 
-                                       dev_count, scheduled_lr, use_vars=args.use_vars)
-                    tower_mems_np = outputs['tower_mems_np']
-
-                    time_end = time.time()
-                    used_time = time_end - time_begin
-                    current_example, current_epoch = reader.get_train_progress()
-                    print("train pyreader queue size: %d, " % train_pyreader.queue.size())
-                    print("epoch: %d, worker_index: %d, progress: %d/%d, step: %d, ave loss: %f, "
-                          "time cost: %f, speed: %f steps/s, learning_rate: %f" %
-                          (current_epoch, trainer_id, current_example, num_train_examples,
-                           steps, outputs["loss"], used_time, args.skip_steps / used_time, 
-                           outputs["learning_rate"]))
-                    time_begin = time.time()
-                else:
-                    if args.use_vars:
-                        # train_exe.run(fetch_list=[])
-                        train_exe.run(program=train_program, use_program_cache=True)
-                    else:
-                        outputs = evaluate(train_exe, train_program, train_pyreader, graph_vars, 
-                                        train_mems_vars, tower_mems_np, "train", steps, trainer_id, 
-                                           dev_count, scheduled_lr, use_vars=args.use_vars)
-                        tower_mems_np = outputs['tower_mems_np']
-
-                if steps % args.validation_steps == 0:
-                    # evaluate dev set
-                    if args.do_val:
-                        test_pyreader.set_batch_generator(
-                            reader.data_generator(
-                                args.dev_set,
-                                batch_size=args.batch_size,
-                                epoch=1,
-                                shuffle=False,
-                                phase="eval"))
-                        num_dev_examples = reader.get_num_examples("eval", args.dev_set)
-                        print("the example number of dev file is {}".format(num_dev_examples))
-                        evaluate(exe, test_prog, test_pyreader, test_graph_vars, test_mems_vars, 
-                                 init_memory(), "eval", steps, trainer_id, dev_count, 
-                                 use_vars=args.use_vars, examples=reader.get_examples("eval"),
-                                 features=reader.get_features("eval"), args=args)
-                    # evaluate test set
-                    if args.do_test:
-                        test_pyreader.set_batch_generator(
-                            reader.data_generator(
-                                args.test_set,
-                                batch_size=args.batch_size,
-                                epoch=1,
-                                shuffle=False,
-                                phase="test"))
-                        num_test_examples = reader.get_num_examples("test", args.test_set)
-                        print("the example number of test file is {}".format(num_test_examples))
-                        evaluate(exe, test_prog, test_pyreader, test_graph_vars, test_mems_vars, 
-                                 init_memory(), "test", steps, trainer_id, dev_count, 
-                                 use_vars=args.use_vars, examples=reader.get_examples("test"),
-                                features=reader.get_features("test"), args=args)
-                # save model
-                if trainer_id == 0 and steps % args.save_steps == 0:
-                    save_path = os.path.join(args.checkpoints,
-                                                "step_" + str(steps))
-                    fluid.io.save_persistables(exe, save_path, origin_train_program)
-
-            except fluid.core.EOFException:
-                if trainer_id == 0:
-                    save_path = os.path.join(args.checkpoints, "step_" + str(steps))
-                    fluid.io.save_persistables(exe, save_path, origin_train_program)
-                train_pyreader.reset()
-                break
-
-    # final eval on dev set
-    if args.do_val:
-        test_pyreader.set_batch_generator(
-            reader.data_generator(
-                args.dev_set,
-                batch_size=args.batch_size,
-                epoch=1,
-                shuffle=False,
-                phase="eval"))
-        print("Final validation result:")
-        num_dev_examples = reader.get_num_examples("eval", args.dev_set)
-        print("the example number of dev file is {}".format(num_dev_examples))                
-        evaluate(exe, test_prog, test_pyreader, test_graph_vars, 
-                 test_mems_vars, init_memory(), "eval", steps, 
-                 trainer_id, dev_count, use_vars=args.use_vars, 
-                 examples=reader.get_examples("eval"), 
-                 features=reader.get_features("eval"), args=args)
-
-    # final eval on test set
-    if args.do_test:
-        test_pyreader.set_batch_generator(
-            reader.data_generator(
-                args.test_set,
-                batch_size=args.batch_size,
-                epoch=1,
-                shuffle=False, 
-                phase="test"))
-        print("Final test result:")
-        num_test_examples = reader.get_num_examples("test", args.test_set)
-        print("the example number of test file is {}".format(num_test_examples)) 
-        evaluate(exe, test_prog, test_pyreader, test_graph_vars, 
-                 test_mems_vars, init_memory(), "test", steps, 
-                 trainer_id, dev_count, use_vars=args.use_vars, 
-                 examples=reader.get_examples("test"), 
-                 features=reader.get_features("test"), args=args)
-
-if __name__ == '__main__':
-    print_arguments(args)
-    main(args)
--- a/ernie-doc/scripts/run_dureader.sh
+++ b/ernie-doc/scripts/run_dureader.sh
-#!/bin/sh
-set -eux
-
-export FLAGS_eager_delete_tensor_gb=0.0
-export FLAGS_fraction_of_gpu_memory_to_use=0.98
-export FLAGS_sync_nccl_allreduce=1
-
-# task
-finetuning_task="dureader"
-task_data_path="./data/finetune/task_data/${finetuning_task}/"
-
-# model setup
-is_zh="True"
-use_vars="False"
-use_amp="False"
-train_all="Fasle"
-eval_all="False"
-use_vars="False"
-use_recompute="False"
-rel_pos_params_sharing="False"
-lr_scheduler="linear_warmup_decay"
-vocab_path="./configs/base/zh/vocab.txt"
-config_path="./configs/base/zh/ernie_config.json"
-init_model_checkpoint=""
-init_model_pretraining=""
-
-# args setup
-epoch=5
-warmup=0.1
-max_len=512
-lr_rate=2.75e-4
-batch_size=16
-weight_decay=0.01
-save_steps=10000
-validation_steps=100
-layer_decay_ratio=0.8
-init_loss_scaling=32768
-
-PADDLE_TRAINERS=`hostname -i`
-PADDLE_TRAINER_ID="0"
-POD_IP=`hostname -i`
-selected_gpus="0,1,2,3"
-
-
-if [[ $finetuning_task == "cmrc2018" ]];then
-    do_test=false
-elif [[ $finetuning_task == "drcd" ]];then
-    do_test=true
-elif [[ $finetuning_task == "dureader" ]];then
-    do_test=false
-fi
-
-mkdir -p log
-distributed_args="--node_ips ${PADDLE_TRAINERS} --node_id ${PADDLE_TRAINER_ID} --current_node_ip ${POD_IP} --nproc_per_node 4 --selected_gpus ${selected_gpus}"
-python -u ./lanch.py ${distributed_args} \
-    run_mrc.py --use_cuda true\
-            --is_distributed true \
-            --batch_size ${batch_size} \
-            --in_tokens false\
-            --use_fast_executor ${e_executor:-"true"} \
-            --checkpoints ./output \
-            --vocab_path ${vocab_path} \
-            --do_train true \
-            --do_val true \
-            --do_test ${do_test} \
-            --save_steps ${save_steps:-"10000"} \
-            --validation_steps ${validation_steps:-"100"} \
-            --warmup_proportion ${warmup} \
-            --weight_decay ${weight_decay} \
-            --epoch ${epoch} \
-            --max_seq_len ${max_len} \
-            --ernie_config_path ${config_path} \
-            --do_lower_case true \
-            --doc_stride 128 \
-            --train_set ${task_data_path}/train.json \
-            --dev_set ${task_data_path}/dev.json \
-            --test_set ${task_data_path}/test.json \
-            --learning_rate ${lr_rate} \
-            --num_iteration_per_drop_scope 1 \
-            --lr_scheduler linear_warmup_decay \
-            --layer_decay_ratio ${layer_decay_ratio:-"0.8"} \
-            --is_zh ${is_zh:-"True"} \
-            --repeat_input ${repeat_input:-"False"} \
-            --train_all ${train_all:-"False"} \
-            --eval_all ${eval_all:-"False"} \
-            --use_vars ${use_vars:-"False"} \
-            --init_checkpoint ${init_model_checkpoint:-""} \
-            --init_pretraining_params ${init_model_pretraining:-""} \
-            --init_loss_scaling ${init_loss_scaling:-32768} \
-            --use_recompute ${use_recompute:-"False"} \
-            --skip_steps 10 1>log/0_${finetuning_task}_job.log 2>&1
-
--- a/ernie-doc/scripts/run_iflytek.sh
+++ b/ernie-doc/scripts/run_iflytek.sh
-set -eux
-
-export FLAGS_eager_delete_tensor_gb=0.0
-export FLAGS_fraction_of_gpu_memory_to_use=0.98
-export FLAGS_sync_nccl_allreduce=1
-
-# task
-task_data_path="./data/finetune/task_data/"
-task_name="iflytek"
-
-# model setup
-is_zh="True"
-repeat_input="False"
-train_all="Fasle"
-eval_all="False"
-use_vars="False"
-use_amp="False"
-use_recompute="False"
-rel_pos_params_sharing="False"
-lr_scheduler="linear_warmup_decay"
-vocab_path="./configs/base/zh/vocab.txt"
-config_path="./configs/base/zh/ernie_config.json"
-init_model_checkpoint=""
-init_model_pretraining=""
-
-# args setup
-epoch=5
-warmup=0.1
-max_len=512
-lr_rate=1.5e-4
-batch_size=16
-weight_decay=0.01
-num_labels=119
-save_steps=10000
-validation_steps=100
-layer_decay_ratio=0.8
-init_loss_scaling=32768
-
-
-PADDLE_TRAINERS=`hostname -i`
-PADDLE_TRAINER_ID="0"
-POD_IP=`hostname -i`
-selected_gpus="0"
-
-mkdir -p log
-distributed_args="--node_ips ${PADDLE_TRAINERS} --node_id ${PADDLE_TRAINER_ID} --current_node_ip ${POD_IP} --nproc_per_node 1 --selected_gpus ${selected_gpus}"
-python -u ./lanch.py ${distributed_args} \
-    ./run_classifier.py --use_cuda true \
-                   --is_distributed true \
-                   --use_fast_executor ${e_executor:-"true"} \
-                   --tokenizer ${TOKENIZER:-"FullTokenizer"} \
-                   --use_amp ${use_amp:-"false"} \
-                   --do_train true \
-                   --do_val true \
-                   --do_test false \
-                   --batch_size ${batch_size} \
-                   --init_checkpoint ${init_model_checkpoint:-""} \
-                   --init_pretraining_params ${init_model_pretraining:-""} \
-                   --label_map_config "" \
-                   --train_set ${task_data_path}/${task_name}/train/1 \
-                   --dev_set ${task_data_path}/${task_name}/dev/1 \
-                   --test_set ${task_data_path}/${task_name}/test/1 \
-                   --vocab_path ${vocab_path} \
-                   --checkpoints ./output \
-                   --save_steps ${save_steps} \
-                   --weight_decay ${weight_decay} \
-                   --warmup_proportion ${warmup} \
-                   --validation_steps ${validation_steps} \
-                   --epoch ${epoch} \
-                   --max_seq_len ${max_len} \
-                   --ernie_config_path ${config_path} \
-                   --learning_rate ${lr_rate} \
-                   --skip_steps 10 \
-                   --num_iteration_per_drop_scope 10 \
-                   --layer_decay_ratio ${layer_decay_ratio:-"0.8"} \
-                   --num_labels ${num_labels} \
-                   --is_zh ${is_zh:-"True"} \
-                   --repeat_input ${repeat_input:-"False"} \
-                   --train_all ${train_all:-"False"} \
-                   --eval_all ${eval_all:-"False"} \
-                   --use_vars ${use_vars:-"False"} \
-                   --init_loss_scaling ${init_loss_scaling:-32768} \
-                   --use_recompute ${use_recompute:-"False"} \
-                   --random_seed 1 
--- a/ernie-doc/scripts/run_imdb.sh
+++ b/ernie-doc/scripts/run_imdb.sh
-set -eux
-
-export BASE_PATH="$PWD"
-export PATH="${BASE_PATH}/py37/bin/:$PATH"
-export PYTHONPATH="${BASE_PATH}/py37/"
-
-export FLAGS_eager_delete_tensor_gb=0.0
-export FLAGS_fraction_of_gpu_memory_to_use=0.98
-export FLAGS_sync_nccl_allreduce=1
-
-export TRAINER_PORTS_NUM='8'
-export PADDLE_CURRENT_ENDPOINT=`hostname -i`
-export PADDLE_TRAINERS_NUM='1'
-export POD_IP=`hostname -i`
-export PADDLE_TRAINER_COUNT='8'
-
-# task
-task_data_path="./data/imdb/"
-task_name="imdb"
-
-# model setup
-is_zh="False"
-use_vars="False"
-use_amp="False"
-use_recompute="False"
-rel_pos_params_sharing="False"
-lr_scheduler="linear_warmup_decay"
-vocab_path="./configs/base/en/vocab.txt"
-config_path="./configs/base/en/ernie_config.json"
-init_model_checkpoint=""
-init_model_pretraining=""
-
-# args setup
-max_len=512
-lr_rate=7e-5
-batch_size=16
-weight_decay=0.01
-warmup=0.1
-epoch=3
-num_labels=2
-save_steps=100000
-validation_steps=700
-layer_decay_ratio=1
-init_loss_scaling=32768
-
-PADDLE_TRAINERS=`hostname -i`
-PADDLE_TRAINER_ID="0"
-POD_IP=`hostname -i`
-selected_gpus="0,1"
-
-distributed_args="--node_ips ${PADDLE_TRAINERS} --node_id ${PADDLE_TRAINER_ID} --current_node_ip ${POD_IP} --nproc_per_node 2 --selected_gpus ${selected_gpus}"
-python -u ./lanch.py ${distributed_args} \
-    ./run_classifier.py --use_cuda true \
-                   --is_distributed true \
-                   --use_fast_executor ${e_executor:-"true"} \
-                   --tokenizer ${TOKENIZER:-"BPETokenizer"} \
-                   --use_amp ${use_amp:-"false"} \
-                   --do_train true \
-                   --do_val true \
-                   --do_test false \
-                   --batch_size ${batch_size} \
-                   --init_checkpoint ${init_model_checkpoint:-""} \
-                   --init_pretraining_params ${init_model_pretraining:-""} \
-                   --train_set ${task_data_path}/train.txt \
-                   --dev_set ${task_data_path}/test.txt \
-                   --test_set ${task_data_path}/test.txt \
-                   --vocab_path ${vocab_path} \
-                   --checkpoints ./output \
-                   --save_steps ${save_steps} \
-                   --weight_decay ${weight_decay} \
-                   --warmup_proportion ${warmup} \
-                   --validation_steps ${validation_steps} \
-                   --epoch ${epoch} \
-                   --max_seq_len ${max_len} \
-                   --ernie_config_path ${config_path} \
-                   --learning_rate ${lr_rate} \
-                   --skip_steps 10 \
-                   --num_iteration_per_drop_scope 10 \
-                   --layer_decay_ratio ${layer_decay_ratio:-"0.8"} \
-                   --num_labels ${num_labels} \
-                   --is_zh ${is_zh:-"True"} \
-                   --use_vars ${use_vars:-"False"} \
-                   --init_loss_scaling ${init_loss_scaling:-32768} \
-                   --use_recompute ${use_recompute:-"False"} \
-                   --random_seed 1 
--- a/ernie-doc/utils/__init__.py
+++ b/ernie-doc/utils/__init__.py
--- a/ernie-doc/utils/args.py
+++ b/ernie-doc/utils/args.py
-#   Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-"""Arguments for configuration."""
-
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-
-import six
-import argparse
-
-
-def str2bool(v):
-    """str2bool"""
-    # because argparse does not support to parse "true, False" as python
-    # boolean directly
-    return v.lower() in ("true", "t", "1")
-
-
-class ArgumentGroup(object):
-    """ArgumentGroup"""
-    def __init__(self, parser, title, des):
-        """init"""
-        self._group = parser.add_argument_group(title=title, description=des)
-
-    def add_arg(self, name, type, default, help, positional_arg=False, **kwargs):
-        """add_arg"""
-        prefix = "" if positional_arg else "--"
-        type = str2bool if type == bool else type
-        self._group.add_argument(
-            prefix + name,
-            default=default,
-            type=type,
-            help=help + ' Default: %(default)s.',
-            **kwargs)
-
-
-def print_arguments(args):
-    """print arguments"""
-    print('-----------  Configuration Arguments -----------')
-    for arg, value in sorted(six.iteritems(vars(args))):
-        print('%s: %s' % (arg, value))
-    print('------------------------------------------------')
--- a/ernie-doc/utils/finetune_args.py
+++ b/ernie-doc/utils/finetune_args.py
-#   Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-
-import os
-import time
-import argparse
-
-from utils.args import ArgumentGroup
-
-# yapf: disable
-parser = argparse.ArgumentParser(__doc__)
-model_g = ArgumentGroup(parser, "model", "model configuration and paths.")
-model_g.add_arg("ernie_config_path",         str,  None,           "Path to the json file for ernie model config.")
-model_g.add_arg("init_checkpoint",          str,  None,           "Init checkpoint to resume training from.")
-model_g.add_arg("init_pretraining_params",  str,  None,
-                "Init pre-training params which preforms fine-tuning from. If the "
-                 "arg 'init_checkpoint' has been set, this argument wouldn't be valid.")
-model_g.add_arg("checkpoints",              str,  "checkpoints",  "Path to save checkpoints.")
-model_g.add_arg("rel_pos_params_sharing", bool, False, "If set, share u and v")
-model_g.add_arg("is_zh", bool, True, "If true, use chinese data-loader")
-
-train_g = ArgumentGroup(parser, "training", "training options.")
-train_g.add_arg("epoch",             int,    3,       "Number of epoches for fine-tuning.")
-train_g.add_arg("learning_rate",     float,  5e-5,    "Learning rate used to train with warmup.")
-train_g.add_arg("lr_scheduler",      str,    "linear_warmup_decay",
-                "scheduler of learning rate.", choices=['linear_warmup_decay', 'noam_decay'])
-train_g.add_arg("weight_decay",      float,  0.01,    "Weight decay rate for L2 regularizer.")
-train_g.add_arg("warmup_proportion", float,  0.1,
-                "Proportion of training steps to perform linear learning rate warmup for.")
-train_g.add_arg("save_steps",        int,    10000,   "The steps interval to save checkpoints.")
-train_g.add_arg("validation_steps",  int,    1000,    "The steps interval to evaluate model performance.")
-train_g.add_arg("use_amp",          bool,   False,   "Whether to use fp16 mixed precision training.")
-train_g.add_arg("use_dynamic_loss_scaling",    bool,   False,   "Whether to use dynamic loss scaling.")
-train_g.add_arg("init_loss_scaling",           float,  1.0,
-                "Loss scaling factor for mixed precision training, only valid when use_amp is enabled.")
-train_g.add_arg("incr_every_n_steps",          int,    100,   "Increases loss scaling every n consecutive.")
-train_g.add_arg("decr_every_n_nan_or_inf",     int,    2,
-                "Decreases loss scaling every n accumulated steps with nan or inf gradients.")
-train_g.add_arg("incr_ratio",                  float,  5.0,
-                "The multiplier to use when increasing the loss scaling.")
-train_g.add_arg("decr_ratio",                  float,  0.8,
-                "The less-than-one-multiplier to use when decreasing.")
-train_g.add_arg("use_recompute", bool, False, "Whether to use recompute")
-train_g.add_arg("layer_decay_ratio", float, 0.8, "Set the layerwise learning rate decay ratio")
-train_g.add_arg("weight_sharing",         bool, True,                         "If set, share weights between word embedding and masked lm.")
-
-log_g = ArgumentGroup(parser,     "logging", "logging related.")
-log_g.add_arg("skip_steps",          int,    10,    "The steps interval to print loss.")
-log_g.add_arg("verbose",             bool,   False, "Whether to output verbose log.")
-
-data_g = ArgumentGroup(parser, "data", "Data paths, vocab paths and data processing options")
-data_g.add_arg("tokenizer",           str, "FullTokenizer",
-              "ATTENTION: the INPUT must be splited by Word with blank while using SentencepieceTokenizer or WordsegTokenizer")
-data_g.add_arg("train_set",           str,  None,  "Path to training data.")
-data_g.add_arg("test_set",            str,  None,  "Path to test data.")
-data_g.add_arg("dev_set",             str,  None,  "Path to validation data.")
-data_g.add_arg("vocab_path",          str,  None,  "Vocabulary path.")
-data_g.add_arg("max_seq_len",         int,  512,   "Number of words of the longest seqence.")
-data_g.add_arg("batch_size",          int,  32,    "Total examples' number in batch for training. see also --in_tokens.")
-data_g.add_arg("in_tokens",           bool, False,
-              "If set, the batch size will be the maximum number of tokens in one batch. "
-              "Otherwise, it will be the maximum number of examples in one batch.")
-data_g.add_arg("do_lower_case",       bool, True,
-               "Whether to lower case the input text. Should be True for uncased models and False for cased models.")
-data_g.add_arg("random_seed",         int,  0,     "Random seed.")
-data_g.add_arg("label_map_config",    str,  None,  "label_map_path.")
-data_g.add_arg("num_labels",          int,  2,     "label number")
-data_g.add_arg("repeat_input",        bool, False, "Whether to repeat the input sample")
-data_g.add_arg("train_all",           bool, False, "Whether to train all samples when repeat input")
-data_g.add_arg("eval_all",            bool, False, "Whether to eval all samples when repeat input")
-data_g.add_arg("max_query_length",          int,   64,    "Max query length.")
-data_g.add_arg("max_answer_length",         int,   100,    "Max answer length.")
-data_g.add_arg("doc_stride",                int,   128,
-               "When splitting up a long document into chunks, how much stride to take between chunks.")
-data_g.add_arg("n_best_size",               int,   20,
-               "The total number of n-best predictions to generate in the nbest_predictions.json output file.")
-data_g.add_arg("use_vars",            bool, True, "set for faster training, memory will not be in feed and fetch list")
-
-run_type_g = ArgumentGroup(parser, "run_type", "running type options.")
-run_type_g.add_arg("use_cuda",                     bool,   True,  "If set, use GPU for training.")
-run_type_g.add_arg("is_distributed",    bool,   False,  "If set, then start distributed training.")
-run_type_g.add_arg("use_fast_executor",            bool,   False, "If set, use fast parallel executor (in experiment).")
-run_type_g.add_arg("num_iteration_per_drop_scope", int,    10,    "Iteration intervals to drop scope.")
-run_type_g.add_arg("do_train",                     bool,   True,  "Whether to perform training.")
-run_type_g.add_arg("do_val",                       bool,   True,  "Whether to perform evaluation on dev data set.")
-run_type_g.add_arg("do_test",                      bool,   True,  "Whether to perform evaluation on test data set.")
-run_type_g.add_arg("metrics",                      bool,   True,  "Whether to perform evaluation on test data set.")
-run_type_g.add_arg("for_cn",                       bool,   True,  "model train for cn or for other langs.")
-run_type_g.add_arg("stream_job",                       str,   None,  "if not None, then stream finetuning task by job id.")
-# yapf: enable
--- a/ernie-doc/utils/init.py
+++ b/ernie-doc/utils/init.py
-#   Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-from __future__ import print_function
-
-import os
-import six
-import ast
-import copy
-
-import numpy as np
-import paddle.fluid as fluid
-
-
-def cast_fp32_to_fp16(exe, main_program):
-    """cast fp32 to fp16"""
-    print("Cast parameters to float16 data format.")
-    for param in main_program.global_block().all_parameters():
-        if not param.name.endswith(".master"):
-            #load fp16
-            param_t = fluid.global_scope().find_var(param.name).get_tensor()
-            data = np.array(param_t)
-            if param.name.startswith("encoder_layer") \
-                    and "layer_norm" not in param.name:
-                print(param.name)
-                param_t.set(np.float16(data).view(np.uint16), exe.place)
-            
-            #load fp32
-            master_param_var = fluid.global_scope().find_var(param.name + 
-                    ".master")
-            if master_param_var is not None:
-                master_param_var.get_tensor().set(data, exe.place)
-
-
-def init_checkpoint(exe, init_checkpoint_path, main_program, use_amp=False):
-    """init_checkpoint"""
-    assert os.path.exists(
-        init_checkpoint_path), "[%s] cann't be found." % init_checkpoint_path
-    
-    def existed_persitables(var):
-        if not fluid.io.is_persistable(var):
-            return False
-        return os.path.exists(os.path.join(init_checkpoint_path, var.name))
-    
-    fluid.io.load_vars(
-        exe,
-        init_checkpoint_path,
-        main_program=main_program,
-        predicate=existed_persitables)
-    print("Load model from {}".format(init_checkpoint_path))
-
-    if use_amp:
-        cast_fp32_to_fp16(exe, main_program)
-
-
-def init_pretraining_params(exe, pretraining_params_path, main_program, use_amp=False):
-    """init_pretraining_params"""
-    assert os.path.exists(pretraining_params_path
-                          ), "[%s] cann't be found." % pretraining_params_path
-
-    def existed_params(var):
-        if not isinstance(var, fluid.framework.Parameter):
-            return False
-        return os.path.exists(os.path.join(pretraining_params_path, var.name))
-    
-    fluid.io.load_vars(
-        exe,
-        pretraining_params_path,
-        main_program=main_program,
-        predicate=existed_params)
-    print("Load pretraining parameters from {}.".format(
-        pretraining_params_path))
-
-    if use_amp:
-        cast_fp32_to_fp16(exe, main_program)
-
-def init_model(args, exe, startup_prog):
-    """init model"""
-    init_func, init_path = None, None
-    if args.do_train:
-        if args.init_checkpoint and args.init_pretraining_params:
-            print(
-                "WARNING: args 'init_checkpoint' and 'init_pretraining_params' "
-                "both are set! Only arg 'init_checkpoint' is made valid.")
-        if args.init_checkpoint:
-            print("init checkpoint from ", args.init_checkpoint)
-            init_func = init_checkpoint
-            init_path = args.init_checkpoint
-        elif args.init_pretraining_params:
-            print("init pretraining params from ", args.init_pretraining_params)
-            init_func = init_pretraining_params
-            init_path = args.init_pretraining_params
-
-    elif args.do_val or args.do_test:
-        init_path = args.init_checkpoint or args.init_pretraining_params
-        if not init_path:
-            raise ValueError("args 'init_checkpoint' should be set if"
-                             "only doing validation or testing!")
-        init_func = init_checkpoint
-        print("init pretraining params from ", args.init_checkpoint if args.init_checkpoint else args.init_pretraining_params)
-    if init_path:
-        init_func(exe, init_path, main_program=startup_prog, use_amp=args.use_amp)
--- a/ernie-doc/utils/metrics.py
+++ b/ernie-doc/utils/metrics.py
--- a/ernie-doc/utils/multi_process_eval.py
+++ b/ernie-doc/utils/multi_process_eval.py