README.en.md 4.8 KB
Newer Older
L
liyukun01 已提交
1 2
English|[简体中文](./README.zh.md)

L
liyukun01 已提交
3
`Remind`: *ERNIE-Gram* model has been officially released in [here](#3-download-pretrained-models-optional). Our reproduction codes will be released to [repro branch](https://github.com/PaddlePaddle/ERNIE/tree/repro) soon.
L
liyukun01 已提交
4 5 6 7


## _ERNIE-Gram_: Pre-Training with Explicitly N-Gram Masked Language Modeling for Natural Language Understanding

L
liyukun01 已提交
8 9
![ERNIE-Gram](.meta/ernie-gram.jpeg)

L
liyukun01 已提交
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58
- [Framework](#ernie-gram-framework)
- [Quick Tour](#quick-tour)
- [Setup](#setup)
    * [Install PaddlePaddle](#1-install-paddlepaddle)
    * [Install ERNIE Kit](#2-install-ernie-kit)
    * [Download pre-trained models](#3-download-pretrained-models-optional)
    * [Download datasets](#4-download-datasets)
- [Fine-tuning](#fine-tuning)
- [Citation](#citation)

### ERNIE-Gram Framework

Since **ERNIE 1.0**, Baidu researchers have introduced **knowledge-enhanced representation learning** in pre-training to achieve better pre-training learning by masking consecutive words, phrases, named entities, and other semantic knowledge units. Furthermore, we propose **ERNIE-Gram**, an explicitly n-gram masking language model to enhance the integration of coarse-grained information for pre-training. In **ERNIE-Gram**, **n-grams** are masked and predicted directly using **explicit** n-gram identities rather than contiguous sequences of tokens.

In downstream tasks, **ERNIE-gram** uses a `bert-style` fine-tuning approach, thus maintaining the same parameter size and computational complexity.

We pre-train **ERNIE-Gram** on `English` and `Chinese` text corpora and fine-tune on `19` downstream tasks. Experimental results show that **ERNIE-Gram** outperforms previous pre-training models like *XLNet* and *RoBERTa* by a large margin, and achieves comparable results with state-of-the-art methods.

The **ERNIE-Gram** paper has been accepted for **NAACL-HLT 2021**, for more details please see in [here](https://arxiv.org/abs/2010.12148).

### Quick Tour

```shell
import numpy as np
import paddle as P
pooled, encoded = model(ids)                 # eager execution
print(pooled.numpy())                        # convert  results to numpy

```

### Setup

##### 1. Install PaddlePaddle

This repo requires PaddlePaddle 2.0.0+, please see [here](https://www.paddlepaddle.org.cn/install/quick) for installaton instruction.

##### 2. Install ERNIE Kit

```shell
git clone https://github.com/PaddlePaddle/ERNIE.git --depth 1
cd ERNIE
pip install -r requirements.txt
pip install -e .
```

##### 3. Download pretrained models (optional)

| Model                                              | Description                                                  |abbreviation|
| :------------------------------------------------- | :----------------------------------------------------------- |:-----------|
L
liyukun01 已提交
59 60
| [ERNIE-Gram Base for Chinese](https://ernie-github.cdn.bcebos.com/model-ernie-gram-zh.1.tar.gz) | Layer:12, Hidden:768, Heads:12 | ernie-gram|
| [ERNIE-Gram Base for English](https://ernie-github.cdn.bcebos.com/model-ernie-gram-en.1.tar.gz) | Layer:12, Hidden:768, Heads:12 | ernie-gram-en |
L
liyukun01 已提交
61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116

##### 4. Download datasets

**English Datasets**

Download the [GLUE datasets](https://gluebenchmark.com/tasks) by running [this script](https://gist.github.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e)

the `--data_dir` option in the following section assumes a directory tree like this:

```shell
data/xnli
├── dev
│   └── 1
├── test
│   └── 1
└── train
    └── 1
```

see [demo](https://ernie-github.cdn.bcebos.com/data-mnli-m.tar.gz) data for MNLI task.

### Fine-tuning

try eager execution with `dygraph model` :

  - [Natural Language Inference](./demo/finetune_classifier_distributed.py)
  - [Sentiment Analysis](./demo/finetune_sentiment_analysis.py)
  - [Semantic Similarity](./demo/finetune_classifier.py)
  - [Name Entity Recognition(NER)](./demo/finetune_ner.py)
  - [Machine Reading Comprehension](./demo/finetune_mrc.py)


**recomended hyper parameters:**

 - See **ERNIE-Gram** paper [Appendix B.1-4](https://arxiv.org/abs/2010.12148)

For full reproduction of paper results, please checkout to `repro` branch of this repo.

# Citation

```
@article{xiao2020ernie,
  title={ERNIE-Gram: Pre-Training with Explicitly N-Gram Masked Language Modeling for Natural Language Understanding},
  author={Xiao, Dongling and Li, Yu-Kun and Zhang, Han and Sun, Yu and Tian, Hao and Wu, Hua and Wang, Haifeng},
  journal={arXiv preprint arXiv:2010.12148},
  year={2020}
}
```

### Communication

- [ERNIE homepage](https://wenxin.baidu.com/)
- [Github Issues](https://github.com/PaddlePaddle/ERNIE/issues): bug reports, feature requests, install issues, usage issues, etc.
- QQ discussion group: 760439550 (ERNIE discussion group).
- QQ discussion group: 958422639 (ERNIE discussion group-v2).
- [Forums](http://ai.baidu.com/forum/topic/list/168?pageNo=1): discuss implementations, research, etc.