未验证 提交 00d6a6ee 编写于 作者: X Xiaoyao Xi 提交者: GitHub

Merge branch 'r0.3-api' into api

# PaddlePALM
PaddlePALM (PArallel Learning from Multi-tasks) is a flexible, general and easy-to-use NLP large-scale pretraining and multi-task learning friendly framework. PALM is a high level framework aiming at **fastly** develop **high-performance** NLP models. With PALM, 8 steps to achieve a typical NLP task for supervised learning or pretraining. 6 steps to achieve multi-task learning for prepared tasks. Zero steps to adapt your code to large-scale training/inference (with multiple GPUs and multiple computation nodes).
PaddlePALM (PArallel Learning from Multi-tasks) is a flexible, general and easy-to-use NLP large-scale pretraining and multi-task learning friendly framework. PALM is a high level framework aiming at **fastly** develop **high-performance** NLP models.
PaddlePALM also provides state-of-the-art general purpose architectures (BERT,ERNIE,RoBERTa,...) as build-in model backbones. We have decoupled the model backbone, dataset reader and task output layers, so that you can easily replace any of the component to other candidates with quite minor changes of your code. In addition, PaddlePALM support customized development of any component, e.g, backbone, task head, reader and optimizer, which gives high flexibility for developers to adapt to complicated NLP scenes.
With PaddlePALM, it is easy to achieve effecient exploration of robust learning of reading comprehension models with multiple auxilary tasks, and the produced model, [D-Net](), achieve **the 1st place** in [EMNLP2019 MRQA](mrqa.github.io) track.
然后给出一些成功案例和一些公开数据集的各个backbone的实验结果(BERT、ERNIE、RoBERTa)和一些成功的多任务学习示例。
<p align="center">
<img src="https://tva1.sinaimg.cn/large/006tNbRwly1gbjkuuwrmlj30hs0hzdh2.jpg" alt="Sample" width="300" height="333">
<p align="center">
<em>MRQA2019 Leaderboard</em>
</p>
</p>
Beyond the research scope, PaddlePALM has been applied on **Baidu Search Engine** to seek for more accurate user query understanding and answer mining, which implies the high reliability and performance of PaddlePALM.
#### Features:
- **Easy-to-use:** with PALM, *8 steps* to achieve a typical NLP task. Moreover, the model backbone, dataset reader and task output layers have been decoupled, which allows the replacement of any component to other candidates with quite minor changes of your code.
- **Multi-task Learning friendly:** *6 steps* to achieve multi-task learning for prepared tasks.
- **Large Scale and Pre-training freiendly:** automatically utilize multi-gpus (if exists) to accelerate training and inference. Minor codes is required for distributed training on clusters.
- **Popular NLP Backbones and Pre-trained models:** multiple state-of-the-art general purpose model architectures and pretrained models (e.g., BERT,ERNIE,RoBERTa,...) are built-in.
- **Easy to Customize:** support customized development of any component (e.g, backbone, task head, reader and optimizer) with reusement of pre-defined ones, which gives developers high flexibility and effeciency to adapt for diverse NLP scenes.
You can easily re-produce following competitive results with minor codes, which covers most of NLP tasks such as classification, matching, sequence labeling, reading comprehension, dialogue understanding and so on. More details can be found in `examples`.
<table>
<tbody>
......@@ -92,6 +109,7 @@ PaddlePALM also provides state-of-the-art general purpose architectures (BERT,ER
</table>
## Package Overview
| module | illustration |
......@@ -109,7 +127,7 @@ PaddlePALM also provides state-of-the-art general purpose architectures (BERT,ER
## Installation
PaddlePALM support both python2 and python3, linux and windows, CPU and GPU. The preferred way to install PaddlePALM is via `pip`. Just run following commands in your shell environment.
PaddlePALM support both python2 and python3, linux and windows, CPU and GPU. The preferred way to install PaddlePALM is via `pip`. Just run following commands in your shell.
```bash
pip install paddlepalm
......@@ -126,7 +144,7 @@ cd PALM && python setup.py install
- Python >= 2.7
- cuda >= 9.0
- cudnn >= 7.0
- PaddlePaddle >= 1.6.3 (请参考[安装指南](http://www.paddlepaddle.org/#quick-start)进行安装)
- PaddlePaddle >= 1.7.0 (请参考[安装指南](http://www.paddlepaddle.org/#quick-start)进行安装)
### Downloading pretrain models
......@@ -166,12 +184,21 @@ Available pretrain items:
7. fit prepared reader and data (achieved in step 1) to trainer with `trainer.fit_reader` method.
8. load pretrain model with `trainer.load_pretrain`, or load checkpoint with `trainer.load_ckpt` or nothing to do for training from scratch, then do training with `trainer.train`.
More implementation details see following demos: [Sentiment Classification](), [Quora Question Pairs matching](), [Tagging](), [SQuAD machine Reading Comprehension]().
For more implementation details, see following demos:
- [Sentiment Classification]()
- [Quora Question Pairs matching]()
- [Tagging]()
- [SQuAD machine Reading Comprehension]().
To save models/checkpoints during training, just call `trainer.set_saver` method. More implementation details see [this]().
### set saver
To save models/checkpoints and logs during training, just call `trainer.set_saver` method. More implementation details see [this]().
### do prediction
To do predict/evaluation after a training stage, just create another three reader, backbone and head instance with `phase='predict'` (repeat step 1~4 above). Then do predicting with `predict` method in trainer (no need to create another trainer). More implementation details see [this]().
### multi-task learning
To run with multi-task learning mode:
1. repeatedly create components (i.e., reader, backbone and head) for each task followed with step 1~5 above.
......@@ -183,7 +210,10 @@ To run with multi-task learning mode:
The save/load and predict operations of a multi_head_trainer is the same as a trainer.
More implementation details of running multi-task learning with multi_head_trainer can be found [here]().
For more implementation details with `multi_head_trainer`, see
- [ATIS: joint training of dialogue intent recognition and slot filling]()
- [MRQA: learning reading comprehension auxilarized with mask language model]() (初次发版先不用加)
## License
......
## Examples 1: Classification
## Example 1: Classification
This task is a sentiment analysis task. The following sections detail model preparation, dataset preparation, and how to run the task.
### Step 1: Prepare Pre-trained Models & Datasets
### Step 1: Prepare Pre-trained Model & Dataset
#### Pre-trianed Model
#### Pre-trained Model
The pre-training model of this mission is: [ernie-zh-base](https://github.com/PaddlePaddle/PALM/tree/r0.3-api).
......@@ -12,16 +12,16 @@ Make sure you have downloaded the required pre-training model in the current fol
#### Dataset
This task uses the `chnsenticorp` dataset.
This example demonstrates with [ChnSentiCorp](https://github.com/SophonPlus/ChineseNlpCorpus/tree/master/datasets/ChnSentiCorp_htl_all), a Chinese sentiment analysis dataset.
Download dataset:
```shell
python download.py
```
If everything goes well, there will be a folder named `data/` created with all the datas in it.
If everything goes well, there will be a folder named `data/` created with all the data files in it.
The data should have 2 fields, `label text_a`, with tsv format. Here is some example datas:
The dataset file (for training) should have 2 fields, `text_a` and `label`, stored with [tsv](https://en.wikipedia.org/wiki/Tab-separated_values) format. Here shows an example:
```
label text_a
......
## Examples 2: Mathing
This task is a sentence pair matching task. The following sections detail model preparation, dataset preparation, and how to run the task.
## Example 2: Matching
This task is a sentence pair matching task. The following sections detail model preparation, dataset preparation, and how to run the task with PaddlePALM.
### Step 1: Prepare Pre-trained Models & Datasets
#### Pre-trianed Model
#### Download Pre-trained Model
The pre-training model of this mission is: [ernie-en-base](https://github.com/PaddlePaddle/PALM/tree/r0.3-api).
......@@ -12,7 +12,7 @@ Make sure you have downloaded the required pre-training model in the current fol
#### Dataset
This task uses the `Quora Question Pairs matching` dataset.
Here takes the [Quora Question Pairs](https://www.quora.com/q/quoradata/First-Quora-Dataset-Release-Question-Pairs) dataset as the testbed for matching.
Download dataset:
```shell
......@@ -26,7 +26,7 @@ python process.py data/quora_duplicate_questions.tsv data/train.tsv data/test.ts
If everything goes well, there will be a folder named `data/` created with all the converted datas in it.
The data should have 3 fields, `text_a text_b label`, with tsv format. Here is some example datas:
The dataset file (for training) should have 3 fields, `text_a`, `text_b` and `label`, stored with [tsv](https://en.wikipedia.org/wiki/Tab-separated_values) format. Here shows an example:
```
text_a text_b label
......
## Examples 5: Predict(Classification)
This task is a sentiment analysis task. The following sections detail model preparation, dataset preparation, and how to run the task.
## Example 5: Prediction
This example demonstrates how to directly do prediction with PaddlePALM. You can either initialize the model from a checkpoint, a pretrained model or just randomly initialization. Here we reuse the task and data in example 1. Hence repeat the step 1 in example 1 to pretrain data.
### Step 1: Prepare Pre-trained Models & Datasets
#### Pre-trianed Model
The pre-training model of this mission is: [ernie-zh-base](https://github.com/PaddlePaddle/PALM/tree/r0.3-api).
Make sure you have downloaded the required pre-training model in the current folder.
#### Dataset
This task uses the `chnsenticorp` dataset.
Download dataset:
```shell
python download.py
```
If everything goes well, there will be a folder named `data/` created with all the datas in it.
The data should have 2 fields, `label text_a`, with tsv format. Here is some example datas:
```
label text_a
0 当当网名不符实,订货多日不见送货,询问客服只会推托,只会要求用户再下订单。如此服务留不住顾客的。去别的网站买书服务更好。
0 XP的驱动不好找!我的17号提的货,现在就降价了100元,而且还送杀毒软件!
1 <荐书> 推荐所有喜欢<红楼>的红迷们一定要收藏这本书,要知道当年我听说这本书的时候花很长时间去图书馆找和借都没能如愿,所以这次一看到当当有,马上买了,红迷们也要记得备货哦!
```
### Step 2: Predict
The code used to perform this task is in `run.py`. If you have prepared the pre-training model and the data set required for the task, run:
After you have prepared the pre-training model and the data set required for the task, run:
```shell
python run.py
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册