Merge pull request #487 from zhanghan1992/repro

Update README

Merge pull request #487 from zhanghan1992/repro
Update README
da820fc1 · zhanghan · GitHub · 2a4965cf · 5b56b529 · da820fc1
8 changed file
--- a/ernie-gen/README.md
+++ b/ernie-gen/README.md
@@ -43,11 +43,11 @@ Specifically, the span-by-span generation task and word-by-word generation task

 ## Pre-trained Models

-We release the checkpoints for **ERNIE-GEN _base_** model and **ERNIE-GEN _large_** model which are both pre-trained on English Wikipedia and [BookCorpus](https://arxiv.org/abs/1506.06724) (totally 16GB). Besides, **ERNIE-GEN _large_** pre-trained on the 160GB corpus (used by [RoBERTa](https://arxiv.org/abs/1907.11692) and [BART](https://arxiv.org/abs/1910.13461)) is available as well.
+We release the checkpoints for **ERNIE-GEN _base_** model and **ERNIE-GEN _large_** model which are both pre-trained on English Wikipedia and [BookCorpus](https://arxiv.org/abs/1506.06724) (totally 16GB). Besides, **ERNIE-GEN _large_** pre-trained on the 430GB corpus (see [ERNIE-GEN Appendix A.1](https://arxiv.org/abs/2001.11314) for the description of the corpus) is available as well.

 - [**ERNIE-GEN _base_**](https://ernie.bj.bcebos.com/ernie_gen_base.tgz) (_lowercased | 12-layer, 768-hidden, 12-heads, 110M parameters_)
 - [**ERNIE-GEN _large_**](https://ernie.bj.bcebos.com/ernie_gen_large.tgz) (_lowercased | 24-layer, 1024-hidden, 16-heads, 340M parameters_)
- [**ERNIE-GEN _large with 160G_**](https://ernie.bj.bcebos.com/ernie_gen_large_160g.tgz) (_lowercased | 24-layer, 1024-hidden, 16-heads, 340M parameters_)
+- [**ERNIE-GEN _large with 430G_**](https://ernie.bj.bcebos.com/ernie_gen_large_430g.tgz) (_lowercased | 24-layer, 1024-hidden, 16-heads, 340M parameters_)


 ## Fine-tuning on Downstream Tasks
@@ -65,7 +65,7 @@ The results on Gigaword-10k (10K examples of Gigaword) are presented as follows:
 | UniLM                 |           16G / 340M           |          34.21           |          15.28           |          31.54           |
 | **ENRIE-GEN** _base_  |           16G / 110M           |          33.75           |          15.23           |          31.35           |
 | **ERNIE-GEN** _large_ |           16G / 340M           |        35.05         |        16.10         |        32.50         |
-| **ERNIE-GEN** _large_ (160G) |           160G / 340M           |        **35.51**         |        **16.79**         |        **33.23**         |
+| **ERNIE-GEN** _large_ (430G) |           430G / 340M           |        **35.51**         |        **16.79**         |        **33.23**         |

 The results on Gigaword are presented as follows: 

@@ -78,7 +78,7 @@ The results on Gigaword are presented as follows:
 | PEGASUS (_HugeNews_)  |          3.8T / 568M           |          39.12           |          19.86           |          36.24           |
 | **ENRIE-GEN** _base_  |           16G / 110M           |          38.83           |          20.04           |          36.20           |
 | **ERNIE-GEN** _large_ |           16G / 340M           |        39.25         |        20.25         |        36.53         |
-| **ERNIE-GEN** _large_ (160G) |           160G / 340M           |        **39.46**         |        **20.34**         |        **36.74**         |
+| **ERNIE-GEN** _large_ (430G) |           430G / 340M           |        **39.46**         |        **20.34**         |        **36.74**         |

 We preprocess the raw Gigaword dataset following UniLM, the preprocessed data is avalilable at this [Gigaword](https://ernie.bj.bcebos.com/gigaword.tgz).

@@ -97,7 +97,7 @@ The results on CNN/Daily Mail are presented as follows:
 | PEGASUS (_HugeNews_)  |  3.8T / 568M  |          44.17           |          21.47           |          41.11           |
 | **ENRIE-GEN** _base_  |  16G / 110M   |          42.30           |          19.92           |          39.68           |
 | **ENRIE-GEN** _large_ |  16G / 340M   |          44.02           |          21.17           |          41.26           |
-| **ENRIE-GEN** _large_ (160G) |  160G / 340M   |        **44.31**         |          21.35           |        **41.60**         |
+| **ENRIE-GEN** _large_ (430G) |  430G / 340M   |        **44.31**         |          21.35           |        **41.60**         |

 We preprocess the raw CNN/Daily Mail dataset following UniLM, the preprocessed data is avalilable at this [CNN/Daily Mail](https://ernie.bj.bcebos.com/cnndm.tgz).

@@ -114,7 +114,7 @@ The results on the [SQuAD 1.1](https://arxiv.org/abs/1806.03822) dataset followi
 | **ENRIE-GEN** _base_ (beam size=1) |          22.28           |          25.13           |          50.38           |
 | **ERNIE-GEN** _large_ (beam size=1) |        24.03         |        26.31         |        52.36         |
 | **ERNIE-GEN** _large_ (beam size=5) |        25.40         |        **26.92**         |        52.84         |
-| **ERNIE-GEN** _large_ (beam size=5) + (160G) |        **25.41**         |        26.77         |        **52.91**         |
+| **ERNIE-GEN** _large_ (beam size=5) + (430G) |        **25.41**         |        26.77         |        **52.91**         |

 The results following the reversed dev-test data split in [[Zhao et al., 2018]](https://www.aclweb.org/anthology/D18-1424/) are presented as follows:

@@ -125,7 +125,7 @@ The results following the reversed dev-test data split in [[Zhao et al., 2018]](
 | **ENRIE-GEN** _base_ (beam size=1) |          23.52           |          25.61           |          51.45           |
 | **ERNIE-GEN** _large_ (beam size=1) |        25.57         |        26.89         |        53.31         |
 | **ERNIE-GEN** _large_ (beam size=5) |        26.95         |        **27.57**         |        53.77         |
-| **ERNIE-GEN** _large_ (beam size=5) + (160G) |        **27.05**         |        27.43         |        **53.83**         |
+| **ERNIE-GEN** _large_ (beam size=5) + (430G) |        **27.05**         |        27.43         |        **53.83**         |

 *_Note that we also report the results with higher beam size to 5._

@@ -161,24 +161,6 @@ Results of development set on CoQA task is presented as follows:

 We preprocess the raw [CoQA](https://arxiv.org/abs/1808.07042) dataset, the preprocessed data is avalilable at this [CoQA-preprocessed](https://ernie.bj.bcebos.com/coqa.tgz).

-Finally, we also compared with a concurrent work [ProphetNet](https://arxiv.org/abs/2001.04063), the fine-tuning results on Gigaword, CNN/Daily Mail and SQuAD are reported as follows:
-
- _**Abstractive Summarization**_
-
-| Model / Task                                                     | <strong>Data / Params</strong> | <strong>Gigaword</strong> |<strong>CNN/Daily Mail</strong>|
-| :-------------------------------------------------------- | :----------------------------: | :----------------------: | :----------------------: |
-| Metric                                                     | - | <strong>Rouge-1 / Rouge-2 / Rouge-L</strong> |<strong>Rouge-1 / Rouge-2 / Rouge-L</strong>|
-| **ProphetNet** _large_ (160G) |           160G / 340M           |     **39.51** / **20.42** / 36.69       |44.20 / 21.17 / 41.30|
-| **ERNIE-GEN** _large_ (160G) |           160G / 340M           |        39.46 / 20.34 / **36.74**         |**44.31** / **21.35** / **41.60**|
-
- _**Question Generation**_
-
-| Model                                                     | <strong>Data / Params</strong> | <strong>BLEU-4 / METEOR / Rouge-L</strong> |<strong>BLEU-4 / METEOR / Rouge-L</strong>|
-| :-------------------------------------------------------- | :----------------------------: | :----------------------: |:----------------------: |
-| Data split                                                     | - | <strong>Original</strong> |<strong>Reversed dev-test</strong>|
-| **ProphetNet** _large_ (16G) |           16G / 340M           |     25.01 / 26.83 / 52.57       |26.72 / **27.64** / **53.79** |
-| **ERNIE-GEN** _large_ (16G) |           16G / 340M           |        **25.40** / **26.92** / **52.84**       |**26.95** / 27.57 / **53.77**|
-
 ## Usage

 ### Install PaddlePaddle
@@ -191,7 +173,7 @@ pip install -r requirements.txt
 ### Fine-tuning
 Please update LD_LIBRARY_PATH about CUDA, cuDNN, NCCL2 before running ERNIE-GEN. We have put the parameter configurations of the above downstream tasks in `config/`. You can easily run finetuning through these configuration files. For example, you can finetune ERNIE-GEN base model on Gigaword by
 ```script
-MODEL="base"      # base or large or large_160g
+MODEL="base"      # base or large or large_430g
 TASK="gigaword"   # cnndm, coqa, gigaword, squad_qg or persona-chat
 sh run_seq2seq.sh ./configs/${MODEL}/${TASK}_conf
 ```

--- a/ernie-gen/README.zh.md
+++ b/ernie-gen/README.zh.md
@@ -43,11 +43,11 @@

 ## 预训练模型

-我们发布了 **ERNIE-GEN _base_** 模型和 **ERNIE-GEN _large_** 模型。 预训练数据使用英文维基百科和 BookCorpus，总共16GB。此外，我们还发布了基于 160GB 语料预训练的**ERNIE-GEN _large_** 模型，此份语料也被用于 [RoBERTa](https://arxiv.org/abs/1907.11692) 和 [BART](https://arxiv.org/abs/1910.13461) 的预训练。
+我们发布了 **ERNIE-GEN _base_** 模型和 **ERNIE-GEN _large_** 模型。 预训练数据使用英文维基百科和 BookCorpus，总共16GB。此外，我们还发布了基于 430GB 语料(数据描述见[ERNIE-GEN Appendix A.1](https://arxiv.org/abs/2001.11314))预训练的**ERNIE-GEN _large_** 模型。

 - [**ERNIE-GEN _base_**](https://ernie.bj.bcebos.com/ernie_gen_base.tgz) (_lowercased | 12-layer, 768-hidden, 12-heads, 110M parameters_)
 - [**ERNIE-GEN _large_**](https://ernie.bj.bcebos.com/ernie_gen_large.tgz) (_lowercased | 24-layer, 1024-hidden, 16-heads, 340M parameters_)
- [**ERNIE-GEN _large with 160G_**](https://ernie.bj.bcebos.com/ernie_gen_large_160g.tgz) (_lowercased | 24-layer, 1024-hidden, 16-heads, 340M parameters_)
+- [**ERNIE-GEN _large with 430G_**](https://ernie.bj.bcebos.com/ernie_gen_large_430g.tgz) (_lowercased | 24-layer, 1024-hidden, 16-heads, 340M parameters_)


 ## 微调任务
@@ -65,7 +65,7 @@
 | UniLM                 |           16G / 340M             |          34.21           |          15.28           |          31.54           |
 | **ENRIE-GEN** _base_  |           16G / 110M             |          33.75           |          15.23           |          31.35           |
 | **ERNIE-GEN** _large_ |           16G / 340M             |        35.05         |        16.10         |        32.50         |
-| **ERNIE-GEN** _large_ (160G) |           160G / 340M           |        **35.51**         |        **16.79**         |        **33.23**         |
+| **ERNIE-GEN** _large_ (430G) |           430G / 340M           |        **35.51**         |        **16.79**         |        **33.23**         |

 在 Gigaword 上的效果:

@@ -78,7 +78,7 @@
 | PEGASUS (_HugeNews_)  |          3.8T / 568M           |          39.12           |          19.86           |          36.24           |
 | **ENRIE-GEN** _base_  |           16G / 110M           |          38.83           |          20.04           |          36.20           |
 | **ERNIE-GEN** _large_ |           16G / 340M           |        39.25         |        20.25         |        36.53         |
-| **ERNIE-GEN** _large_ (160G) |           160G / 340M           |        **39.46**         |        **20.34**         |        **36.74**         |
+| **ERNIE-GEN** _large_ (430G) |           430G / 340M           |        **39.46**         |        **20.34**         |        **36.74**         |

 我们按照 UniLM 的方式处理了数据，下载链接 [Gigaword](https://ernie.bj.bcebos.com/gigaword.tgz)。

@@ -97,7 +97,7 @@
 | PEGASUS (_HugeNews_)  |  3.8T / 568M  |          44.17           |          21.47           |          41.11           |
 | **ENRIE-GEN** _base_  |  16G / 110M   |          42.30           |          19.92           |          39.68           |
 | **ENRIE-GEN** _large_ |  16G / 340M   |          44.02           |          21.17           |          41.26           |
-| **ENRIE-GEN** _large_ (160G) |  160G / 340M   |        **44.31**         |          21.35           |        **41.60**         |
+| **ENRIE-GEN** _large_ (430G) |  430G / 340M   |        **44.31**         |          21.35           |        **41.60**         |

 我们按照 UniLM 的方式处理了数据，下载链接 [CNN/Daily Mail](https://ernie.bj.bcebos.com/cnndm.tgz)。

@@ -114,7 +114,7 @@
 | **ENRIE-GEN** _base_ (beam size=1) |          22.28           |          25.13           |          50.38           |
 | **ERNIE-GEN** _large_ (beam size=1) |        24.03         |        26.31         |        52.36         |
 | **ERNIE-GEN** _large_ (beam size=5) |        25.40         |        **26.92**         |        52.84         |
-| **ERNIE-GEN** _large_ (beam size=5) + (160G) |        **25.41**         |        26.77         |        **52.91**         |
+| **ERNIE-GEN** _large_ (beam size=5) + (430G) |        **25.41**         |        26.77         |        **52.91**         |

 按照 [[Zhao et al., 2018]](https://www.aclweb.org/anthology/D18-1424/) 反向使用验证集和测试集，效果如下:

@@ -125,7 +125,7 @@
 | **ENRIE-GEN** _base_ (beam size=1) |          23.52           |          25.61           |          51.45           |
 | **ERNIE-GEN** _large_ (beam size=1) |        25.57         |        26.89         |        53.31         |
 | **ERNIE-GEN** _large_ (beam size=5) |        26.95         |        **27.57**         |        53.77         |
-| **ERNIE-GEN** _large_ (beam size=5) + (160G) |        **27.05**         |        27.43         |        **53.83**         |
+| **ERNIE-GEN** _large_ (beam size=5) + (430G) |        **27.05**         |        27.43         |        **53.83**         |

 *_我们增加了将 beam size 扩大到 5 的结果。_

@@ -159,23 +159,6 @@

 我们对原始的 CoQA 数据集进行了处理，下载链接 [CoQA](https://ernie.bj.bcebos.com/coqa.tgz)。

-此外，我们与同期的工作 [ProphetNet](https://arxiv.org/abs/2001.04063) 在 Gigaword，CNN/Daily Mail 和 SQuAD 三个数据集上进行了对比:
-
- _**生成式摘要**_
-
-| 模型 / 任务                                               | <strong>数据量 / 参数量</strong> | <strong>Gigaword</strong> |<strong>CNN/Daily Mail</strong>|
-| :-------------------------------------------------------- | :------------------------------: | :----------------------: | :----------------------: |
-| Metric                                                     | - | <strong>Rouge-1 / Rouge-2 / Rouge-L</strong> |<strong>Rouge-1 / Rouge-2 / Rouge-L</strong>|
-| ProphetNet _large_ (160G) |           160G / 340M           |     **39.51** / **20.42** / 36.69       |44.20 / 21.17 / 41.30|
-| **ERNIE-GEN** _large_ (160G) |           160G / 340M           |        39.46 / 20.34 / **36.74**         |**44.31** / **21.35** / **41.60**|
-
- _**问题生成**_
-
-| 模型                                                      | <strong>数据量 / 参数量</strong> | <strong>BLEU-4 / METEOR / Rouge-L</strong> |<strong>BLEU-4 / METEOR / Rouge-L</strong>|
-| :-------------------------------------------------------- | :------------------------------: | :----------------------: |:----------------------: |
-| Data split                                                     | - | <strong>Original</strong> |<strong>Reversed dev-test</strong>|
-| ProphetNet** _large_ (16G) |           16G / 340M           |     25.01 / 26.83 / 52.57       |26.72 / **27.64** / **53.79** |
-| **ERNIE-GEN** _large_ (16G) |           16G / 340M           |        **25.40** / **26.92** / **52.84**       |**26.95** / 27.57 / **53.77**|

 ## 使用说明

@@ -189,7 +172,7 @@ pip install -r requirements.txt
 ### 运行微调
 在运行 ERNIE-GEN 前，需要将 CUDA 、cuDNN 、NCCL2 的动态库路径添加到 LD_LIBRARY_PATH 。 我们把下游任务的参数配置文件放到了 `config/` ，可以简单地通过配置文件运行。 例如，您可以通过下面的指令在 Gigaword 数据集上微调 ERNIE-GEN base 模型:
 ```script
-MODEL="base"      # base or large or large_160g
+MODEL="base"      # base or large or large_430g
 TASK="gigaword"   # cnndm, coqa, gigaword, squad_qg or persona-chat
 sh run_seq2seq.sh ./configs/${MODEL}/${TASK}_conf
 ```

--- a/ernie-gen/configs/large_160g/coqa_conf
+++ b/ernie-gen/configs/large_160g/coqa_conf
-#load model
-vocab_path="ernie_gen_large/vocab.txt"
-config_path="ernie_gen_large/ernie_config.json"
-init_model="ernie_gen_large/params"
-
-#for multi-turn dialog/qa
-task_type="dialog"
-role_type_size=3
-turn_type_size=16
-
-#input
-max_src_len=480
-max_tgt_len=32
-tokenized_input="true"
-continuous_position="true"
-batch_size=4
-in_tokens="false"
-#tgt_type_id=1
-
-#decode
-do_decode="true"
-max_dec_len=30
-beam_size=3
-length_penalty=0.0
-use_multi_gpu_test="true"
-
-#train
-epoch=10
-weight_decay=0.01
-label_smooth=0.1
-hidden_dropout_prob=0.1
-save_and_valid_by_epoch="true"
-#lr
-warmup_proportion=0.1
-lr_scheduler="linear_warmup_decay"
-learning_rate=1e-5
-#noise
-random_noise="false"
-noise_prob=0.5
-
-#dataset
-data_path="./datasets/coqa/"
-train_set="train.tsv"
-dev_set="dev.tsv"
-do_train="true"
-do_val="true"
-do_test="false"
-do_pred="false"
-
-#evaluate
-eval_script="sh ./eval/tasks/coqa/eval.sh"
-eval_mertrics="f1"
--- a/ernie-gen/configs/large_160g/persona-chat_conf
+++ b/ernie-gen/configs/large_160g/persona-chat_conf
-#load model
-vocab_path="ernie_gen_large/vocab.txt"
-config_path="ernie_gen_large/ernie_config.json"
-init_model="ernie_gen_large/params"
-
-#for multi-turn dialog/qa
-task_type="dialog"
-role_type_size=3
-turn_type_size=16
-
-#input
-max_src_len=472
-max_tgt_len=40
-tokenized_input="true"
-continuous_position="true"
-batch_size=8
-in_tokens="false"
-
-#decode
-do_decode="true"
-max_dec_len=32
-beam_size=10
-length_penalty=1.3
-use_multi_gpu_test="true"
-
-#train
-epoch=30
-weight_decay=0.01
-label_smooth=0.0
-hidden_dropout_prob=0.1
-save_and_valid_by_epoch="true"
-#lr
-warmup_proportion=0.1
-lr_scheduler="linear_warmup_decay"
-learning_rate=1e-4
-#noise
-random_noise="false"
-noise_prob=0.0
-
-#dataset
-data_path="./datasets/persona_chat/"
-train_set="train.tsv"
-dev_set="dev.2k.tsv"
-pred_set="test.tsv"
-do_train="true"
-do_val="true"
-do_test="false"
-do_pred="true"
-do_decode="true"
-
-#evaluate
-eval_script="sh ./eval/tasks/persona_chat/eval.sh"
-eval_mertrics="bleu_1,bleu_2,distinct_1,distinct_2"
--- a/ernie-gen/configs/large_160g/cnndm_conf
+++ b/ernie-gen/configs/large_160g/cnndm_conf
 #load model
-vocab_path="ernie_gen_large_160g/vocab.txt"
-config_path="ernie_gen_large_160g/ernie_config.json"
-init_model="ernie_gen_large_160g/params"
+vocab_path="ernie_gen_large_430g/vocab.txt"
+config_path="ernie_gen_large_430g/ernie_config.json"
+init_model="ernie_gen_large_430g/params"

 #input
 max_src_len=640

--- a/ernie-gen/configs/large_160g/gigaword-10k_conf
+++ b/ernie-gen/configs/large_160g/gigaword-10k_conf
 #load model
-vocab_path="ernie_gen_large_160g/vocab.txt"
-config_path="ernie_gen_large_160g/ernie_config.json"
-init_model="ernie_gen_large_160g/params"
+vocab_path="ernie_gen_large_430g/vocab.txt"
+config_path="ernie_gen_large_430g/ernie_config.json"
+init_model="ernie_gen_large_430g/params"

 #input
 max_src_len=192

--- a/ernie-gen/configs/large_160g/gigaword_conf
+++ b/ernie-gen/configs/large_160g/gigaword_conf
 #load model
-vocab_path="ernie_gen_large_160g/vocab.txt"
-config_path="ernie_gen_large_160g/ernie_config.json"
-init_model="ernie_gen_large_160g/params"
+vocab_path="ernie_gen_large_430g/vocab.txt"
+config_path="ernie_gen_large_430g/ernie_config.json"
+init_model="ernie_gen_large_430g/params"

 #input
 max_src_len=192

--- a/ernie-gen/configs/large_160g/squad-qg_conf
+++ b/ernie-gen/configs/large_160g/squad-qg_conf
 #load model
-vocab_path="ernie_gen_large_160g/vocab.txt"
-config_path="ernie_gen_large_160g/ernie_config.json"
-init_model="ernie_gen_large_160g/params"
+vocab_path="ernie_gen_large_430g/vocab.txt"
+config_path="ernie_gen_large_430g/ernie_config.json"
+init_model="ernie_gen_large_430g/params"

 #input
 max_src_len=512