README.en.md 4.9 KB
Newer Older
L
liyukun01 已提交
1 2
English|[简体中文](./README.zh.md)

P
PROoshio 已提交
3
`Remind`: *ERNIE-Gram* model has been officially released in [here](#3-download-pretrained-models-optional). Our reproduction codes have been released to [repro branch](https://github.com/PaddlePaddle/ERNIE/tree/repro/ernie-gram).
L
liyukun01 已提交
4 5 6 7


## _ERNIE-Gram_: Pre-Training with Explicitly N-Gram Masked Language Modeling for Natural Language Understanding

L
liyukun01 已提交
8 9
![ERNIE-Gram](.meta/ernie-gram.jpeg)

L
liyukun01 已提交
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
- [Framework](#ernie-gram-framework)
- [Quick Tour](#quick-tour)
- [Setup](#setup)
    * [Install PaddlePaddle](#1-install-paddlepaddle)
    * [Install ERNIE Kit](#2-install-ernie-kit)
    * [Download pre-trained models](#3-download-pretrained-models-optional)
    * [Download datasets](#4-download-datasets)
- [Fine-tuning](#fine-tuning)
- [Citation](#citation)

### ERNIE-Gram Framework

Since **ERNIE 1.0**, Baidu researchers have introduced **knowledge-enhanced representation learning** in pre-training to achieve better pre-training learning by masking consecutive words, phrases, named entities, and other semantic knowledge units. Furthermore, we propose **ERNIE-Gram**, an explicitly n-gram masking language model to enhance the integration of coarse-grained information for pre-training. In **ERNIE-Gram**, **n-grams** are masked and predicted directly using **explicit** n-gram identities rather than contiguous sequences of tokens.

In downstream tasks, **ERNIE-gram** uses a `bert-style` fine-tuning approach, thus maintaining the same parameter size and computational complexity.

We pre-train **ERNIE-Gram** on `English` and `Chinese` text corpora and fine-tune on `19` downstream tasks. Experimental results show that **ERNIE-Gram** outperforms previous pre-training models like *XLNet* and *RoBERTa* by a large margin, and achieves comparable results with state-of-the-art methods.

The **ERNIE-Gram** paper has been accepted for **NAACL-HLT 2021**, for more details please see in [here](https://arxiv.org/abs/2010.12148).

### Quick Tour

```shell
Z
zhanghan17 已提交
33 34 35 36 37 38
mkdir -p data
cd data
wget https://ernie-github.cdn.bcebos.com/data-xnli.tar.gz
tar xf data-xnli.tar.gz
cd ..
#demo for NLI task
39
sh run_cls.sh task_configs/xnli_conf
L
liyukun01 已提交
40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
```

### Setup

##### 1. Install PaddlePaddle

This repo requires PaddlePaddle 2.0.0+, please see [here](https://www.paddlepaddle.org.cn/install/quick) for installaton instruction.

##### 2. Install ERNIE Kit

```shell
git clone https://github.com/PaddlePaddle/ERNIE.git --depth 1
cd ERNIE
pip install -r requirements.txt
pip install -e .
```

##### 3. Download pretrained models (optional)

| Model                                              | Description                                                  |abbreviation|
| :------------------------------------------------- | :----------------------------------------------------------- |:-----------|
L
liyukun01 已提交
61 62
| [ERNIE-Gram Base for Chinese](https://ernie-github.cdn.bcebos.com/model-ernie-gram-zh.1.tar.gz) | Layer:12, Hidden:768, Heads:12 | ernie-gram|
| [ERNIE-Gram Base for English](https://ernie-github.cdn.bcebos.com/model-ernie-gram-en.1.tar.gz) | Layer:12, Hidden:768, Heads:12 | ernie-gram-en |
L
liyukun01 已提交
63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98

##### 4. Download datasets

**English Datasets**

Download the [GLUE datasets](https://gluebenchmark.com/tasks) by running [this script](https://gist.github.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e)

the `--data_dir` option in the following section assumes a directory tree like this:

```shell
data/xnli
├── dev
│   └── 1
├── test
│   └── 1
└── train
    └── 1
```

see [demo](https://ernie-github.cdn.bcebos.com/data-mnli-m.tar.gz) data for MNLI task.

### Fine-tuning

try eager execution with `dygraph model` :

  - [Natural Language Inference](./demo/finetune_classifier_distributed.py)
  - [Sentiment Analysis](./demo/finetune_sentiment_analysis.py)
  - [Semantic Similarity](./demo/finetune_classifier.py)
  - [Name Entity Recognition(NER)](./demo/finetune_ner.py)
  - [Machine Reading Comprehension](./demo/finetune_mrc.py)


**recomended hyper parameters:**

 - See **ERNIE-Gram** paper [Appendix B.1-4](https://arxiv.org/abs/2010.12148)

P
PROoshio 已提交
99
For full reproduction of paper results, please checkout to `repro` branch of this repo, the site is at [ernie-gram](https://github.com/PaddlePaddle/ERNIE/tree/repro/ernie-gram).
L
liyukun01 已提交
100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118

# Citation

```
@article{xiao2020ernie,
  title={ERNIE-Gram: Pre-Training with Explicitly N-Gram Masked Language Modeling for Natural Language Understanding},
  author={Xiao, Dongling and Li, Yu-Kun and Zhang, Han and Sun, Yu and Tian, Hao and Wu, Hua and Wang, Haifeng},
  journal={arXiv preprint arXiv:2010.12148},
  year={2020}
}
```

### Communication

- [ERNIE homepage](https://wenxin.baidu.com/)
- [Github Issues](https://github.com/PaddlePaddle/ERNIE/issues): bug reports, feature requests, install issues, usage issues, etc.
- QQ discussion group: 760439550 (ERNIE discussion group).
- QQ discussion group: 958422639 (ERNIE discussion group-v2).
- [Forums](http://ai.baidu.com/forum/topic/list/168?pageNo=1): discuss implementations, research, etc.