English | [简体中文](./README_zh.md)

## _ERNIE-Doc_: A Retrospective Long-Document Modeling Transformer

- [Framework](#framework)
- [Pre-trained Models](#Pre-trained-Models)
- [Fine-tuning Tasks](#Fine-tuning-Tasks)
  * [Language Modeling](#Language-Modeling)
  * [Long-Text Classification](#Long-Text-Classification)
  * [Question Answering](#Question-Answering)
  * [Information Extraction](#Information-Extraction)
  * [Semantic Matching](#Semantic-Matching)
- [Usage](#Usage)
  * [Install Paddle](#Install-PaddlePaddle)
  * [Fine-tuning](#Fine-tuning)
- [Citation](#Citation)

For technical description of the algorithm, please see our paper:
>[_**ERNIE-Doc: A Retrospective Long-Document Modeling Transformer**_](https://arxiv.org/abs/2012.15688)
>
>Siyu Ding\*, Junyuan Shang\*, Shuohuan Wang, Yu Sun, Hao Tian, Hua Wu, Haifeng Wang (\* : equal contribution)
>
>Preprint December 2020
>
>Accepted by **ACL-2021**

![ERNIE-Doc](https://img.shields.io/badge/Pretraining-Long%20Document%20Modeling-green) ![paper](https://img.shields.io/badge/Paper-ACL2021-yellow) 

---
**ERNIE-Doc is a document-level language pretraining model**. Two well-designed techniques, namely the **retrospective feed mechanism** and the **enhanced recurrence mechanism**, enable ERNIE-Doc, which has a much longer effective context length, to capture the contextual information of a complete document. ERNIE-Doc improved the state-of-the-art language modeling result of perplexity to 16.8 on WikiText-103. Moreover, it outperformed competitive pretraining models by a large margin on most language understanding tasks, such as text classification, question answering, information extraction and semantic matching.

## Framework

We proposed three novel methods to enhance the long document modeling ability of Transformers:

- **Retrospective Feed Mechanism**: Inspired by the human reading behavior of skimming a document first and then looking back upon it attentively, we design a retrospective feed mechanism in which segments from a document are fed twice as input. As a result, each segment in the retrospective phase could explicitly fuse the semantic information of the entire document learned in the skimming phase, which prevents context fragmentation.
- **Enhanced Recurrence Mechansim**, a drop-in replacement for a Recurrence Transformer (like Transformer-XL), by changing the shifting-one-layer-downwards recurrence to the same-layer recurrence. In this manner, the maximum effective context length can be expanded, and past higher-level representations can be exploited to enrich future lower-level representations.
- **Segment-reordering Objective**, a document-aware task of predicting the correct order of the permuted set of segments of a document, to model the relationship among segments directly. This allows ERNIE-Doc to build full document representations for prediction. 


![framework](.meta/framework.png)
Illustrations of ERNIE-Doc and Recurrence Transformers, where models with three layers take as input a long document which is sliced into four segments.

## Pre-trained Models

We release the checkpoints for **ERNIE-Doc _base_en/zh_** and **ERNIE-Doc _large_en_** model。 

- [**ERNIE-Doc _base_en_**](https://ernie-github.cdn.bcebos.com/model-ernie-doc-base-en.tar.gz) (_12-layer, 768-hidden, 12-heads_)
- [**ERNIE-Doc _base_zh_**](https://ernie-github.cdn.bcebos.com/model-ernie-doc-base-zh.tar.gz) (_12-layer, 768-hidden, 12-heads_)
- [**ERNIE-Doc _large_en_**](https://ernie-github.cdn.bcebos.com/model-ernie-doc-large-en.tar.gz) (_24-layer, 1024-hidden, 16-heads_)


## Fine-tuning Tasks

We compare the performance of [ERNIE-Doc](https://arxiv.org/abs/2012.15688) with the existing SOTA pre-training models (such as [Longformer](https://arxiv.org/abs/2004.05150), [BigBird](https://arxiv.org/abs/2007.14062), [ETC](https://arxiv.org/abs/2004.08483) and [ERNIE2.0](https://arxiv.org/abs/1907.12412)) for language modeling (**_WikiText-103_**) and document-level natural language understanding tasks, including long-text classification (**_IMDB_**,  **_HYP_**, **_THUCNews_**, **_IFLYTEK_**), question answering (**_TriviaQA_**, **_HotpotQA_**, **_DRCD_**, **_CMRC2018_**, **_DuReader_**, **_C3_**), information extraction (**_OpenKPE_**) and semantic matching (**_CAIL2019-SCM_**).

### Language Modeling

- [WikiText-103](https://arxiv.org/abs/1609.07843)

| Model                    | Param. | PPL  |
|--------------------------|:--------:|:------:|
| _Results of base models_   |        |      |
| LSTM                     |    -   | 48.7 |
| LSTM+Neural cache        |    -   | 40.8 |
| GCNN-14                  |    -   | 37.2 |
| QRNN                     |  151M  | 33.0 |
| Transformer-XL Base      |  151M  | 24.0 |
| SegaTransformer-XL Base  |  151M  | 22.5 |
| **ERNIE-Doc** Base           |  151M  | **21.0** |
| _Results of large models_  |        |      |
| Adaptive Input           |  247M  | 18.7 |
| Transformer-XL Large     |  247M  | 18.3 |
| Compressive Transformer  |  247M  | 17.1 |
| SegaTransformer-XL Large |  247M  | 17.1 |
| **ERNIE-Doc** Large          |  247M  | **16.8** |

### Long-Text Classification

- [IMDB reviews](http://ai.stanford.edu/~amaas/data/sentiment/index.html)

| Models          | Acc. | F1 | 
|-----------------|:----:|:----:|
| RoBERTa         | 95.3 | 95.0 | 
| Longformer      | 95.7 |   -  | 
| BigBird         |   -  | 95.2 |
| **ERNIE-Doc** Base  | **96.1** | **96.1** |
| XLNet-Large     | 96.8 |   -  |   -  |
| **ERNIE-Doc** Large | **97.1** | **97.1** | 

- [Hyperpartisan News Dection](https://pan.webis.de/semeval19/semeval19-web/)

| Models          | F1 |
|-----------------|:----:|
| RoBERTa         | 87.8 | 
| Longformer      | 94.8 |   
| BigBird         |  92.2  | 
| **ERNIE-Doc** Base  |  **96.3** | 
| **ERNIE-Doc** Large | **96.6** | 

- [THUCNews(THU)](http://thuctc.thunlp.org/)、[IFLYTEK(IFK)](https://arxiv.org/abs/2004.05986)

| Models          |    THU   |    THU   |    IFK   |
|-----------------|:--------:|:--------:|:--------:|
|                 |   Acc.   |   Acc.   |   Acc.   |
|                 |    Dev   | Test     |    Dev   |
| BERT            |   97.7   |   97.3   |   60.3   |
| BERT-wwm-ext    |   97.6   |   97.6   |   59.4   |
| RoBERTa-wwm-ext |     -    |     -    |   60.3   |
| ERNIE 1.0       |   97.7   |   97.3   |   59.0   |
| ERNIE 2.0       |   98.0   |   97.5   |   61.7   |
| **ERNIE-Doc**       | **98.3** | **97.7** | **62.4** |

### Question Answering

- [TriviaQA](http://nlp.cs.washington.edu/triviaqa/) on dev-set

| Models        | F1 |
|-----------------|:----:|
| RoBERTa         | 74.3 | 
| Longformer      | 75.2 |   
| BigBird         |  79.5 | 
| **ERNIE-Doc** Base  |  **80.1** | 
| Longformer Large  |  77.8 | 
|   BigBird Large  |  - | 
| **ERNIE-Doc** Large | **82.5** | 

- [HotpotQA](https://hotpotqa.github.io/) on dev-set

| Models          | Span-F1 | Supp.-F1 | Joint-F1 |
|-----------------|:----:|:----:|:----:|
| RoBERTa         | 73.5 | 83.4 | 63.5 | 
| Longformer      | 74.3 |  84.4 | 64.4 |  
| BigBird         |  75.5 | **87.1** | 67.8 | 
| **ERNIE-Doc** Base  |  **79.4** | 86.3 | **70.5** | 
| Longformer Large  |  81.0 | 85.8 | 71.4 | 
|   BigBird Large  |  81.3 | **89.4** | - | 
| **ERNIE-Doc** Large | **82.2** | 87.6 | **73.7** | 

- [DRCD](https://arxiv.org/abs/1806.00920), [CMRC2018](https://arxiv.org/abs/1810.07366), [DuReader](https://arxiv.org/abs/1711.05073), [C3](https://arxiv.org/abs/1904.09679)

| Models            | DRCD          | DRCD          | CMRC2018      | DuReader      | C3       | C3       |
|-----------------|---------------|---------------|---------------|---------------|----------|----------|
|                 | dev           | test          | dev           | dev           | dev      | test     |
|                 | EM/F1         | EM/F1         | EM/F1         | EM/F1         | Acc.     | Acc.     |
| BERT            | 85.7/91.6     | 84.9/90.9     | 66.3/85.9     | 59.5/73.1     |   65.7   |   64.5   |
| BERT-wwm-ext    | 85.0/91.2     | 83.6/90.4     | 67.1/85.7     | -/-           |   67.8   |   68.5   |
| RoBERTa-wwm-ext | 86.6/92.5     | 85.2/92.0     | 67.4/87.2     | -/-           |   67.1   |   66.5   |
| MacBERT         | 88.3/93.5     | 87.9/93.2     | 69.5/87.7     | -/-           |     -    |     -    |
| XLNet-zh        | 83.2/92.0     | 82.8/91.8     | 63.0/85.9     | -/-           |     -    |     -    |
| ERNIE 1.0       | 84.6/90.9     | 84.0/90.5     | 65.1/85.1     | 57.9/72/1     |   65.5   |   64.1   |
| ERNIE 2.0       | 88.5/93.8     | 88.0/93.4     | 69.1/88.6     | 61.3/74.9     |   72.3   |   73.2   |
| **ERNIE-Doc**   | **90.5/95.2** | **90.5/95.1** | **76.1/91.6** | **65.8/77.9** | **76.5** | **76.5** |

### Information Extraction

- [Open Domain Web Keyphrase Extraction](https://www.aclweb.org/anthology/D19-1521/)

| Models    | F1@1 | F1@3 | F1@5 |
|-----------|:----:|:----:|:----:|
| BLING-KPE | 26.7 | 29.2 | 20.9 |
| JointKPE  | 39.1 | 39.8 | 33.8 |
| ETC       |   -  | 40.2 |   -  |
| ERNIE-Doc | **40.2** | **40.5** | **34.4** |

### Semantic Matching

- [CAIL2019-SCM](https://arxiv.org/abs/1911.08962)

| Models    |      Dev (Acc.)     |     Test  (Acc.)       |
|-----------|:-------------:|:-------------:|
| BERT      |      61.9     |      67.3     |
| ERNIE 2.0 |      64.9     |      67.9     |
| ERNIE-Doc | **65.6** | **68.8** |


## Usage

### Install PaddlePaddle

This code base has been tested with Paddle (version>=1.8) with Python3. Other dependency of ERNIE-GEN is listed in `requirements.txt`, you can install it by
```script
pip install -r requirements.txt
```

### Fine-tuning
We release the finetuning code for English and Chinese classification tasks and Chinese Question Answers Tasks. For example, you can finetune **ERNIE-Doc** base model on IMDB and IFLYTEK dataset by
```shell
sh script/run_imdb.sh
sh script/run_iflytek.sh
sh script/run_dureader.sh
```
[Preprocessing code for IMDB dataset](./ernie_doc/data/imdb/README.md)


The log of training and the evaluation results are in `log/job.log.0`.

**Notice**: The actual total batch size is equal to `configured batch size * number of used gpus`.


## Citation

You can cite the paper as below:

```
@article{ding2020ernie,
  title={ERNIE-DOC: The Retrospective Long-Document Modeling Transformer},
  author={Ding, Siyu and Shang, Junyuan and Wang, Shuohuan and Sun, Yu and Tian, Hao and Wu, Hua and Wang, Haifeng},
  journal={arXiv preprint arXiv:2012.15688},
  year={2020}
}
```