English | [简体中文](./README.zh.md)
## _ERNIE-GEN_: An Enhanced Multi-Flow Pre-training and Fine-tuning Framework for Natural Language Generation
- [Proposed Generation Framework](#proposed-generation-framework)
- [Pre-trained Models](#pre-trained-models)
- [Fine-tuning on Downstream Tasks](#fine-tuning-on-downstream-tasks)
* [Abstractive Summarization](#abstractive-summarization)
* [Question Generation](#question-generation)
* [Generative Dialogue Response](#generative-dialogue-response)
* [Generative Question Answering](#generative-question-answering)
- [Usage](#usage)
* [Install PaddlePaddle](#install-paddlepaddle)
* [Fine-tuning](#fine-tuning)
* [Employ Dynamic Computation Graph](#employ-dynamic-computation-graph)
* [The ERNIE 1.0 is avaliable](#the-ernie-10-is-avaliable-for-chinese-generation-tasks)
- [Citation](#citation)
For technical description of the algorithm, please see our paper:
>[_**ERNIE-GEN:An Enhanced Multi-Flow Pre-training and Fine-tuning Framework for Natural Language Generation**_](https://arxiv.org/abs/2001.11314.pdf)
>Dongling Xiao\*, Han Zhang\*, Yukun Li, Yu Sun, Hao Tian, Hua Wu, Haifeng Wang (\* : equal contribution)
>Preprint January 2020
>Accepted by **IJCAI-2020**
![ERNIE-GEN](https://img.shields.io/badge/Pretraining-Generation-green) ![Gigaword](https://img.shields.io/badge/Abstractive Summarization-Gigaword-yellow) ![Gigaword](https://img.shields.io/badge/Abstractive Summarization-CNN/Daily Mail-blue) ![SQuAD](https://img.shields.io/badge/Question Generation-SQuAD-green) ![Personal-Chat](https://img.shields.io/badge/Dialogue Response-Personal Chat-yellowgreen) ![CoQA](https://img.shields.io/badge/Generative Question Answering-CoQA-orange)
---
**[ERNIE-GEN](https://arxiv.org/abs/2001.11314.pdf) is a multi-flow language generation framework for both pre-training and fine-tuning.** We propose a novel **span-by-span generation** pre-training task to enable the model to **generate a semantically-complete span** at each step rather than a word, in light of the fact that entities, phrases in human writing are organized in a coherent manner. An **infilling generation mechanism** and a **noise-aware generation method** are incorporated into both pre-training and fine-tuning to alleviate **the problem of exposure bias**. In the pre-training phase, ERNIE-GEN adopts a **multi-granularity target fragments sampling** strategy to force decoder to rely more on the encoder representations other than the previous generated words to enhancing the correlation between encoder and decoder.
## Proposed Generation Framework
We construct three novel methods to enhance the language generation ability:
- **Span-by-span Generation Pre-training Task**: to enable model to generate a semantically-complete span at each step rather than a word.
- **Infilling Genration and Noise-aware Generation**: to alleviate the problem of exposure bias.
- **Multi-Granularity Target Fragments**: to enhance the correlation between encoder and decoder during pre-training.
Specifically, the span-by-span generation task and word-by-word generation task based on infilling generation mechanism are impemented by a carefully designed **Multi-Flow Attention** architecture as shown below.
![multi-flow-attention](.meta/multi-flow-attention.png)
## Pre-trained Models
We release the checkpoints for **ERNIE-GEN _base_** model and **ERNIE-GEN _large_** model which are both pre-trained on English Wikipedia and [BookCorpus](https://arxiv.org/abs/1506.06724) (totally 16GB). Besides, **ERNIE-GEN _large_** pre-trained on the 160GB corpus (used by [RoBERTa](https://arxiv.org/abs/1907.11692) and [BART](https://arxiv.org/abs/1910.13461)) is available as well.
- [**ERNIE-GEN _base_**](https://ernie.bj.bcebos.com/ernie_gen_base.tgz) (_lowercased | 12-layer, 768-hidden, 12-heads, 110M parameters_)
- [**ERNIE-GEN _large_**](https://ernie.bj.bcebos.com/ernie_gen_large.tgz) (_lowercased | 24-layer, 1024-hidden, 16-heads, 340M parameters_)
- [**ERNIE-GEN _large with 160G_**](https://ernie.bj.bcebos.com/ernie_gen_large_160g.tgz) (_lowercased | 24-layer, 1024-hidden, 16-heads, 340M parameters_)
## Fine-tuning on Downstream Tasks
We compare the performance of [ERNIE-GEN](https://arxiv.org/pdf/2001.11314.pdf) with the existing SOTA pre-training models for natural language generation ([UniLM](https://arxiv.org/abs/1905.03197), [MASS](https://arxiv.org/abs/1905.02450), [PEGASUS](https://arxiv.org/abs/1912.08777), [BART](https://arxiv.org/abs/1910.13461) and [T5](https://arxiv.org/abs/1910.10683)) on 5 genration tasks, including abstractive summarization (**_Gigaword_** and **_CNN/DailyMail_**), question generation (**_SQuAD_**), dialogue generation (**_Persona-Chat_**) and generative question answering (**_CoQA_**).
### Abstractive Summarization
- _**Gigaword**_
The results on Gigaword-10k (10K examples of Gigaword) are presented as follows:
| Model | Data / Params | Rouge-1 | Rouge-2 | Rouge-L |
| :-------------------------------------------------------- | :----------------------------: | :----------------------: | :----------------------: | :----------------------: |
| UniLM | 16G / 340M | 34.21 | 15.28 | 31.54 |
| **ENRIE-GEN** _base_ | 16G / 110M | 33.75 | 15.23 | 31.35 |
| **ERNIE-GEN** _large_ | 16G / 340M | 35.05 | 16.10 | 32.50 |
| **ERNIE-GEN** _large_ (160G) | 160G / 340M | **35.51** | **16.79** | **33.23** |
The results on Gigaword are presented as follows:
| Model | Data / Params | Rouge-1 | Rouge-2 | Rouge-L |
| :-------------------------------------------------------- | :----------------------------: | :----------------------: | :----------------------: | :----------------------: |
| MASS | 18G / 160M | 38.73 | 19.71 | 35.96 |
| BERTSHARE | 16G / 110M | 38.13 | 19.81 | 35.62 |
| UniLM | 16G / 340M | 38.45 | 19.45 | 35.75 |
| PEGASUS (_C4_) | 750G / 568M | 38.75 | 19.96 | 36.14 |
| PEGASUS (_HugeNews_) | 3.8T / 568M | 39.12 | 19.86 | 36.24 |
| **ENRIE-GEN** _base_ | 16G / 110M | 38.83 | 20.04 | 36.20 |
| **ERNIE-GEN** _large_ | 16G / 340M | 39.25 | 20.25 | 36.53 |
| **ERNIE-GEN** _large_ (160G) | 160G / 340M | **39.46** | **20.34** | **36.74** |
We preprocess the raw Gigaword dataset following UniLM, the preprocessed data is avalilable at this [Gigaword](https://ernie.bj.bcebos.com/gigaword.tgz).
- _**CNN/Daily Mail**_
The results on CNN/Daily Mail are presented as follows:
| Model | Data / Params | Rouge-1 | Rouge-2 | Rouge-L |
| :-------------------------------------------------------- | :-----------: | :----------------------: | :----------------------: | :----------------------: |
| MASS | 18G / 160M | 42.12 | 19.50 | 39.01 |
| UniLM | 16G / 340M | 43.33 | 20.21 | 40.51 |
| T5 _large_ | 750G / 340M | 42.50 | 20.68 | 39.75 |
| T5 _xlarge_ | 750G / 11B | 43.52 | **21.55** | 40.69 |
| BART | 160G / 400M | 44.16 | 21.28 | 40.90 |
| PEGASUS (_C4_) | 750G / 568M | 43.90 | 21.20 | 40.76 |
| PEGASUS (_HugeNews_) | 3.8T / 568M | 44.17 | 21.47 | 41.11 |
| **ENRIE-GEN** _base_ | 16G / 110M | 42.30 | 19.92 | 39.68 |
| **ENRIE-GEN** _large_ | 16G / 340M | 44.02 | 21.17 | 41.26 |
| **ENRIE-GEN** _large_ (160G) | 160G / 340M | **44.31** | 21.35 | **41.60** |
We preprocess the raw CNN/Daily Mail dataset following UniLM, the preprocessed data is avalilable at this [CNN/Daily Mail](https://ernie.bj.bcebos.com/cnndm.tgz).
### Question Generation
- _**SQuAD**_
The results on the [SQuAD 1.1](https://arxiv.org/abs/1806.03822) dataset following the data split in [[Du et al., 2017]](https://arxiv.org/pdf/1705.00106.pdf) are presented as follows:
| Model | BLEU-4 | METEOR | Rouge-L |
| :----------------------------------------------------------- | :----------------------: | :----------------------: | :----------------------: |
| [SemQG](https://arxiv.org/abs/1909.06356) | 18.37 | 22.65 | 46.68 |
| UniLM _large_ (beam size=1) | 22.12 | 25.06 | 51.07 |
| **ENRIE-GEN** _base_ (beam size=1) | 22.28 | 25.13 | 50.38 |
| **ERNIE-GEN** _large_ (beam size=1) | 24.03 | 26.31 | 52.36 |
| **ERNIE-GEN** _large_ (beam size=5) | 25.40 | **26.92** | 52.84 |
| **ERNIE-GEN** _large_ (beam size=5) + (160G) | **25.41** | 26.77 | **52.91** |
The results following the reversed dev-test data split in [[Zhao et al., 2018]](https://www.aclweb.org/anthology/D18-1424/) are presented as follows:
| Model | BLEU-4 | METEOR | Rouge-L |
| :----------------------------------------------------------- | :----------------------: | :----------------------: | :----------------------: |
| SemQG | 20.76 | 24.20 | 48.91 |
| UniLM _large_ (beam size=1) | 23.75 | 25.61 | 52.04 |
| **ENRIE-GEN** _base_ (beam size=1) | 23.52 | 25.61 | 51.45 |
| **ERNIE-GEN** _large_ (beam size=1) | 25.57 | 26.89 | 53.31 |
| **ERNIE-GEN** _large_ (beam size=5) | 26.95 | **27.57** | 53.77 |
| **ERNIE-GEN** _large_ (beam size=5) + (160G) | **27.05** | 27.43 | **53.83** |
*_Note that we also report the results with higher beam size to 5._
The preprocessed data for question generation task can be downloaded from [SQuAD](https://ernie.bj.bcebos.com/squad_qg.tgz).
### Generative Dialogue Response
- _**Personal-Chat**_
Comparison with current state-of-the-art results on the multi-turn conversations task ([Persona-Chat](https://arxiv.org/abs/1801.07243)) is presented as follows:
| Model | BLEU-1 | BLEU-2 | Distinct-1 | Distinct-2 |
| :-------------------------------------------------------- | :---------------------: | :---------------------: | :-------------------------: | :---------------------------: |
| [LIC](https://arxiv.org/abs/1910.07931) | 40.5 | 32.0 | 0.019 | 0.113 |
| [PLATO](https://arxiv.org/abs/1910.07931) | 45.8 | 35.7 | 0.012 | 0.064 |
| PLATO _w/o latent_ | 40.6 | 31.5 | 0.021 | 0.121 |
| **ERNIE-GEN** _large_ | **46.8** | **36.4** | **0.023** | **0.168** |
The training data can be downloaded from [Personal-Chat](https://ernie.bj.bcebos.com/persona_chat.tgz).
### Generative Question Answering
- _**CoQA**_
Results of development set on CoQA task is presented as follows:
| Model | F1-score |
| :-------------------------------------------------------- | :------: |
| [Seq2Seq](https://arxiv.org/abs/1910.07931) | 27.5 |
| [PGNet](https://arxiv.org/abs/1910.07931) | 45.4 |
| UniLM _large_ | 82.5 |
| **ERNIE-GEN** _large_ | **84.5** |
We preprocess the raw [CoQA](https://arxiv.org/abs/1808.07042) dataset, the preprocessed data is avalilable at this [CoQA-preprocessed](https://ernie.bj.bcebos.com/coqa.tgz).
Finally, we also compared with a concurrent work [ProphetNet](https://arxiv.org/abs/2001.04063), the fine-tuning results on Gigaword, CNN/Daily Mail and SQuAD are reported as follows:
- _**Abstractive Summarization**_
| Model / Task | Data / Params | Gigaword |CNN/Daily Mail|
| :-------------------------------------------------------- | :----------------------------: | :----------------------: | :----------------------: |
| Metric | - | Rouge-1 / Rouge-2 / Rouge-L |Rouge-1 / Rouge-2 / Rouge-L|
| **ProphetNet** _large_ (160G) | 160G / 340M | **39.51** / **20.42** / 36.69 |44.20 / 21.17 / 41.30|
| **ERNIE-GEN** _large_ (160G) | 160G / 340M | 39.46 / 20.34 / **36.74** |**44.31** / **21.35** / **41.60**|
- _**Question Generation**_
| Model | Data / Params | BLEU-4 / METEOR / Rouge-L |BLEU-4 / METEOR / Rouge-L|
| :-------------------------------------------------------- | :----------------------------: | :----------------------: |:----------------------: |
| Data split | - | Original |Reversed dev-test|
| **ProphetNet** _large_ (16G) | 16G / 340M | 25.01 / 26.83 / 52.57 |26.72 / **27.64** / **53.79** |
| **ERNIE-GEN** _large_ (16G) | 16G / 340M | **25.40** / **26.92** / **52.84** |**26.95** / 27.57 / **53.77**|
## Usage
### Install PaddlePaddle
This code base has been tested with Paddle Fluid 1.7 with Python 2.7. Other dependency of ERNIE-GEN is listed in `requirements.txt`, you can install it by
```script
pip install -r requirements.txt
```
### Fine-tuning
Please update LD_LIBRARY_PATH about CUDA, cuDNN, NCCL2 before running ERNIE-GEN. We have put the parameter configurations of the above downstream tasks in `config/`. You can easily run finetuning through these configuration files. For example, you can finetune ERNIE-GEN base model on Gigaword by
```script
MODEL="base" # base or large or large_160g
TASK="gigaword" # cnndm, coqa, gigaword, squad_qg or persona-chat
sh run_seq2seq.sh ./configs/${MODEL}/${TASK}_conf
```
The log of training and the evaluation results are in `log/job.log.0`. To finetune on your own task data, you can refer to the data format we provide for processing your data.
Our fine-tuning experiments are carried on 8 NVIDIA V100 (32GB) GPUs. If your GPU memory is not enough, you can reduce the batch size in the corresponding configuration file.
**NOTICE: ** The actual total batch size is equal to `configured batch size * number of used gpus`.
### Employ Dynamic Computation Graph
The ERNIE-GEN code using dynamic graph is more concise and flexible, please refer to [ERNIE-GEN Dygraph](https://github.com/PaddlePaddle/ERNIE/tree/develop/experimental/seq2seq) for specific use.
### The ERNIE 1.0 is avaliable for Chinese Generation Tasks
The ERNIE-GEN code is compatible with [ERNIE 1.0](https://ernie.bj.bcebos.com/ERNIE_1.0_max-len-512.tar.gz) model. After specifying the parameters related to the model and data in the configuration file, you can use ERNIE 1.0 to fine-tune chinese generation tasks.
## Citation
You can cite the paper as below:
```
@article{xiao2020ernie-gen,
title={ERNIE-GEN: An Enhanced Multi-Flow Pre-training and Fine-tuning Framework for Natural Language Generation},
author={Xiao, Dongling and Zhang, Han and Li, Yukun and Sun, Yu and Tian, Hao and Wu, Hua and Wang, Haifeng},
journal={arXiv preprint arXiv:2001.11314},
year={2020}
}
```