未验证 提交 48c0996e 编写于 作者: T Tesla 提交者: GitHub

Merge pull request #803 from webYFDT/ernie-kit-open-v1.0

Ernie kit open v1.0
Apache License
Version 2.0, January 2004
http://www.apache.org/licenses/
TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
1. Definitions.
"License" shall mean the terms and conditions for use, reproduction,
and distribution as defined by Sections 1 through 9 of this document.
"Licensor" shall mean the copyright owner or entity authorized by
the copyright owner that is granting the License.
"Legal Entity" shall mean the union of the acting entity and all
other entities that control, are controlled by, or are under common
control with that entity. For the purposes of this definition,
"control" means (i) the power, direct or indirect, to cause the
direction or management of such entity, whether by contract or
otherwise, or (ii) ownership of fifty percent (50%) or more of the
outstanding shares, or (iii) beneficial ownership of such entity.
"You" (or "Your") shall mean an individual or Legal Entity
exercising permissions granted by this License.
"Source" form shall mean the preferred form for making modifications,
including but not limited to software source code, documentation
source, and configuration files.
"Object" form shall mean any form resulting from mechanical
transformation or translation of a Source form, including but
not limited to compiled object code, generated documentation,
and conversions to other media types.
"Work" shall mean the work of authorship, whether in Source or
Object form, made available under the License, as indicated by a
copyright notice that is included in or attached to the work
(an example is provided in the Appendix below).
"Derivative Works" shall mean any work, whether in Source or Object
form, that is based on (or derived from) the Work and for which the
editorial revisions, annotations, elaborations, or other modifications
represent, as a whole, an original work of authorship. For the purposes
of this License, Derivative Works shall not include works that remain
separable from, or merely link (or bind by name) to the interfaces of,
the Work and Derivative Works thereof.
"Contribution" shall mean any work of authorship, including
the original version of the Work and any modifications or additions
to that Work or Derivative Works thereof, that is intentionally
submitted to Licensor for inclusion in the Work by the copyright owner
or by an individual or Legal Entity authorized to submit on behalf of
the copyright owner. For the purposes of this definition, "submitted"
means any form of electronic, verbal, or written communication sent
to the Licensor or its representatives, including but not limited to
communication on electronic mailing lists, source code control systems,
and issue tracking systems that are managed by, or on behalf of, the
Licensor for the purpose of discussing and improving the Work, but
excluding communication that is conspicuously marked or otherwise
designated in writing by the copyright owner as "Not a Contribution."
"Contributor" shall mean Licensor and any individual or Legal Entity
on behalf of whom a Contribution has been received by Licensor and
subsequently incorporated within the Work.
2. Grant of Copyright License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
copyright license to reproduce, prepare Derivative Works of,
publicly display, publicly perform, sublicense, and distribute the
Work and such Derivative Works in Source or Object form.
3. Grant of Patent License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
(except as stated in this section) patent license to make, have made,
use, offer to sell, sell, import, and otherwise transfer the Work,
where such license applies only to those patent claims licensable
by such Contributor that are necessarily infringed by their
Contribution(s) alone or by combination of their Contribution(s)
with the Work to which such Contribution(s) was submitted. If You
institute patent litigation against any entity (including a
cross-claim or counterclaim in a lawsuit) alleging that the Work
or a Contribution incorporated within the Work constitutes direct
or contributory patent infringement, then any patent licenses
granted to You under this License for that Work shall terminate
as of the date such litigation is filed.
4. Redistribution. You may reproduce and distribute copies of the
Work or Derivative Works thereof in any medium, with or without
modifications, and in Source or Object form, provided that You
meet the following conditions:
(a) You must give any other recipients of the Work or
Derivative Works a copy of this License; and
(b) You must cause any modified files to carry prominent notices
stating that You changed the files; and
(c) You must retain, in the Source form of any Derivative Works
that You distribute, all copyright, patent, trademark, and
attribution notices from the Source form of the Work,
excluding those notices that do not pertain to any part of
the Derivative Works; and
(d) If the Work includes a "NOTICE" text file as part of its
distribution, then any Derivative Works that You distribute must
include a readable copy of the attribution notices contained
within such NOTICE file, excluding those notices that do not
pertain to any part of the Derivative Works, in at least one
of the following places: within a NOTICE text file distributed
as part of the Derivative Works; within the Source form or
documentation, if provided along with the Derivative Works; or,
within a display generated by the Derivative Works, if and
wherever such third-party notices normally appear. The contents
of the NOTICE file are for informational purposes only and
do not modify the License. You may add Your own attribution
notices within Derivative Works that You distribute, alongside
or as an addendum to the NOTICE text from the Work, provided
that such additional attribution notices cannot be construed
as modifying the License.
You may add Your own copyright statement to Your modifications and
may provide additional or different license terms and conditions
for use, reproduction, or distribution of Your modifications, or
for any such Derivative Works as a whole, provided Your use,
reproduction, and distribution of the Work otherwise complies with
the conditions stated in this License.
5. Submission of Contributions. Unless You explicitly state otherwise,
any Contribution intentionally submitted for inclusion in the Work
by You to the Licensor shall be under the terms and conditions of
this License, without any additional terms or conditions.
Notwithstanding the above, nothing herein shall supersede or modify
the terms of any separate license agreement you may have executed
with Licensor regarding such Contributions.
6. Trademarks. This License does not grant permission to use the trade
names, trademarks, service marks, or product names of the Licensor,
except as required for reasonable and customary use in describing the
origin of the Work and reproducing the content of the NOTICE file.
7. Disclaimer of Warranty. Unless required by applicable law or
agreed to in writing, Licensor provides the Work (and each
Contributor provides its Contributions) on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
implied, including, without limitation, any warranties or conditions
of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
PARTICULAR PURPOSE. You are solely responsible for determining the
appropriateness of using or redistributing the Work and assume any
risks associated with Your exercise of permissions under this License.
8. Limitation of Liability. In no event and under no legal theory,
whether in tort (including negligence), contract, or otherwise,
unless required by applicable law (such as deliberate and grossly
negligent acts) or agreed to in writing, shall any Contributor be
liable to You for damages, including any direct, indirect, special,
incidental, or consequential damages of any character arising as a
result of this License or out of the use or inability to use the
Work (including but not limited to damages for loss of goodwill,
work stoppage, computer failure or malfunction, or any and all
other commercial damages or losses), even if such Contributor
has been advised of the possibility of such damages.
9. Accepting Warranty or Additional Liability. While redistributing
the Work or Derivative Works thereof, You may choose to offer,
and charge a fee for, acceptance of support, warranty, indemnity,
or other liability obligations and/or rights consistent with this
License. However, in accepting such obligations, You may act only
on Your own behalf and on Your sole responsibility, not on behalf
of any other Contributor, and only if You agree to indemnify,
defend, and hold each Contributor harmless for any liability
incurred by, or claims asserted against, such Contributor by reason
of your accepting any such warranty or additional liability.
END OF TERMS AND CONDITIONS
APPENDIX: How to apply the Apache License to your work.
To apply the Apache License to your work, attach the following
boilerplate notice, with the fields enclosed by brackets "[]"
replaced with your own identifying information. (Don't include
the brackets!) The text should be enclosed in the appropriate
comment syntax for the file format. We also recommend that a
file or class name and description of purpose be included on the
same "printed page" as the copyright notice for easier
identification within third-party archives.
Copyright [yyyy] [name of copyright owner]
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
English|[简体中文](./README.zh.md)
![./.metas/ERNIE_milestone.png](./.metas/ERNIE_milestone_20210519_en.png)
**Remind: This repo has been refactored, for paper re-production or backward compatibility; plase checkout to [repro branch](https://github.com/PaddlePaddle/ERNIE/tree/repro)**
ERNIE 2.0 is a continual pre-training framework for language understanding in which pre-training tasks can be incrementally built and learned through multi-task learning.
ERNIE 2.0 builds a strong basic for nearly every NLP tasks: Text Classification, Ranking, NER, machine reading comprehension, text genration and so on.
[\[more information\]](https://wenxin.baidu.com/)
# News
- Dec.03.2021:
- [`ERNIE-M`](https://github.com/PaddlePaddle/ERNIE/tree/repro/ernie-m) models are **avaliable** now!
- May.20.2021:
- [`ERNIE-Doc`](https://github.com/PaddlePaddle/ERNIE/tree/repro/ernie-doc), [`ERNIE-Gram`](./ernie-gram/), [`ERNIE-ViL`](https://github.com/PaddlePaddle/ERNIE/tree/repro/ernie-vil) models are **avaliable** now!
- `ERNIE-UNIMO` has been released in [here](https://github.com/PaddlePaddle/ERNIE/tree/repro/ernie-unimo).
- Dec.29.2020:
- Pretrain and finetune ERNIE with [PaddlePaddle v2.0](https://github.com/PaddlePaddle/Paddle/tree/release/2.0-rc).
- New AMP(auto mixed precision) feature for every demo in this repo.
- Introducing `Gradient accumulation`, run `ERNIE-large` with only 8G memory.
- Sept.24.2020:
- We have announced the [`ERNIE-ViL`](https://github.com/PaddlePaddle/ERNIE/tree/repro/ernie-vil)!
- A **knowledge-enhanced** joint representations for vision-language tasks.
- Constructing three **Scene Graph Prediction** tasks utilizing structured knowledge.
- The state-of-the-art performance on 5 downstream tasks, 1st place on [VCR leaderboad](https://visualcommonsense.com/leaderboard/).
- May.20.2020:
- Try ERNIE in "`dygraph`", with:
- Eager execution with `paddle.fluid.dygraph`.
- Distributed training.
- Easy deployment.
- Learn NLP in Aistudio tutorials.
- Backward compatibility for old-styled checkpoint
- [`ERNIE-GEN`](https://github.com/PaddlePaddle/ERNIE/tree/repro/ernie-gen) is **avaliable** now! ([link here](https://github.com/PaddlePaddle/ERNIE/tree/repro/ernie-gen))
- the **state-of-the-art** pre-trained model for generation tasks, accepted by `IJCAI-2020`.
- A novel **span-by-span generation pre-training task**.
- An **infilling generation** echanism and a **noise-aware generation** method.
- Implemented by a carefully designed **Multi-Flow Attention** architecture.
- You are able to `download` all models including `base/large/large-430G`.
- Apr.30.2020: Release [ERNIESage](https://github.com/PaddlePaddle/PGL/tree/master/examples/erniesage), a novel Graph Neural Network Model using ERNIE as its aggregator. It is implemented through [PGL](https://github.com/PaddlePaddle/PGL)
- Mar.27.2020: [Champion on 5 SemEval2020 sub tasks](https://www.jiqizhixin.com/articles/2020-03-27-8)
- Dec.26.2019: [1st place on GLUE leaderboard](https://www.technologyreview.com/2019/12/26/131372/ai-baidu-ernie-google-bert-natural-language-glue/)
- Nov.6.2019: [Introducing ERNIE-tiny](https://www.jiqizhixin.com/articles/2019-11-06-9)
- Jul.7.2019: [Introducing ERNIE2.0](https://www.jiqizhixin.com/articles/2019-07-31-10)
- Mar.16.2019: [Introducing ERNIE1.0](https://www.jiqizhixin.com/articles/2019-03-16-3)
# Table of contents
* [Tutorials](#tutorials)
* [Setup](#setup)
* [Fine-tuning](#fine-tuning)
* [Pre-training with ERNIE 1.0](#pre-training-with-ernie-10)
* [Online inference](#online-inference)
* [Distillation](#distillation)
# Quick Tour
```python
import numpy as np
import paddle as P
from ernie.tokenizing_ernie import ErnieTokenizer
from ernie.modeling_ernie import ErnieModel
model = ErnieModel.from_pretrained('ernie-1.0') # Try to get pretrained model from server, make sure you have network connection
model.eval()
tokenizer = ErnieTokenizer.from_pretrained('ernie-1.0')
ids, _ = tokenizer.encode('hello world')
ids = P.to_tensor(np.expand_dims(ids, 0)) # insert extra `batch` dimension
pooled, encoded = model(ids) # eager execution
print(pooled.numpy()) # convert results to numpy
```
# Tutorials
Don't have GPU? try ERNIE in [AIStudio](https://aistudio.baidu.com/aistudio/index)!
(please choose the latest version and apply for a GPU environment)
1. [ERNIE for beginners](https://aistudio.baidu.com/studio/edu/group/quick/join/314947)
1. [Sementic analysis](https://aistudio.baidu.com/aistudio/projectdetail/427482)
2. [Cloze test](https://aistudio.baidu.com/aistudio/projectdetail/433491)
3. [Knowledge distillation](https://aistudio.baidu.com/aistudio/projectdetail/439460)
4. [Ask ERNIE](https://aistudio.baidu.com/aistudio/projectdetail/456443)
5. [Loading old-styled checkpoint](https://aistudio.baidu.com/aistudio/projectdetail/493415)
# Setup
##### 1. install PaddlePaddle
This repo requires PaddlePaddle 1.7.0+, please see [here](https://www.paddlepaddle.org.cn/install/quick) for installaton instruction.
##### 2. install ernie
```script
pip install paddle-ernie
```
or
```shell
git clone https://github.com/PaddlePaddle/ERNIE.git --depth 1
cd ERNIE
pip install -r requirements.txt
pip install -e .
```
##### 3. download pretrained models (optional)
| Model | Description |abbreviation|
| :------------------------------------------------- | :----------------------------------------------------------- |:-----------|
| [ERNIE 1.0 Base for Chinese](https://ernie-github.cdn.bcebos.com/model-ernie1.0.1.tar.gz) | L12H768A12 |ernie-1.0|
| [ERNIE Tiny](https://ernie-github.cdn.bcebos.com/model-ernie_tiny.1.tar.gz) | L3H1024A16 |ernie-tiny|
| [ERNIE 2.0 Base for English](https://ernie-github.cdn.bcebos.com/model-ernie2.0-en.1.tar.gz) | L12H768A12 |ernie-2.0-en|
| [ERNIE 2.0 Large for English](https://ernie-github.cdn.bcebos.com/model-ernie2.0-large-en.1.tar.gz) | L24H1024A16 |ernie-2.0-large-en|
| [ERNIE Gen base for English](https://ernie-github.cdn.bcebos.com/model-ernie-gen-base-en.1.tar.gz) | L12H768A12 |ernie-gen-base-en|
| [ERNIE Gen Large for English](https://ernie-github.cdn.bcebos.com/model-ernie-gen-large-en.1.tar.gz)| L24H1024A16 | ernie-gen-large-en |
| [ERNIE Gen Large 430G for English](https://ernie-github.cdn.bcebos.com/model-ernie-gen-large-430g-en.1.tar.gz)| Layer:24, Hidden:1024, Heads:16 + 430G pretrain corpus | ernie-gen-large-430g-en |
| [ERNIE Doc Base for Chinese](https://ernie-github.cdn.bcebos.com/model-ernie-doc-base-zh.tar.gz)| L12H768A12 | ernie-doc-base-zh |
| [ERNIE Doc Base for English](https://ernie-github.cdn.bcebos.com/model-ernie-doc-base-en.tar.gz)| L12H768A12 | ernie-doc-base-en |
| [ERNIE Doc Large for English](https://ernie-github.cdn.bcebos.com/model-ernie-doc-large-en.tar.gz)| L24H1024A16 | ernie-doc-large-zh |
| [ERNIE Gram Base for Chinese](https://ernie-github.cdn.bcebos.com/model-ernie-gram-zh.1.tar.gz) | L12H768A12 | ernie-gram-zh |
| [ERNIE Gram Base for English](https://ernie-github.cdn.bcebos.com/model-ernie-gram-en.1.tar.gz) | L12H768A12 | ernie-gram-en |
##### 4. download datasets
**English Datasets**
Download the [GLUE datasets](https://gluebenchmark.com/tasks) by running [this script](https://gist.github.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e)
the `--data_dir` option in the following section assumes a directory tree like this:
```shell
data/xnli
├── dev
│   └── 1
├── test
│   └── 1
└── train
└── 1
```
see [demo](https://ernie-github.cdn.bcebos.com/data-mnli-m.tar.gz) data for MNLI task.
**Chinese Datasets**
| Datasets|Description|
|:--------|:----------|
| [XNLI](https://ernie-github.cdn.bcebos.com/data-xnli.tar.gz) |XNLI is a natural language inference dataset in 15 languages. It was jointly built by Facebook and New York University. We use Chinese data of XNLI to evaluate language understanding ability of our model. [url](https://github.com/facebookresearch/XNLI)|
| [ChnSentiCorp](https://ernie-github.cdn.bcebos.com/data-chnsenticorp.tar.gz) |ChnSentiCorp is a sentiment analysis dataset consisting of reviews on online shopping of hotels, notebooks and books.|
| [MSRA-NER](https://ernie-github.cdn.bcebos.com/data-msra_ner.tar.gz) |MSRA-NER (SIGHAN2006) dataset is released by MSRA for recognizing the names of people, locations and organizations in text.|
| [NLPCC2016-DBQA](https://ernie-github.cdn.bcebos.com/data-dbqa.tar.gz) |NLPCC2016-DBQA is a sub-task of NLPCC-ICCPOL 2016 Shared Task which is hosted by NLPCC(Natural Language Processing and Chinese Computing), this task targets on selecting documents from the candidates to answer the questions. [url: http://tcci.ccf.org.cn/conference/2016/dldoc/evagline2.pdf]|
|[CMRC2018](https://ernie-github.cdn.bcebos.com/data-cmrc2018.tar.gz)|CMRC2018 is a evaluation of Chinese extractive reading comprehension hosted by Chinese Information Processing Society of China (CIPS-CL). [url](https://github.com/ymcui/cmrc2018)|
# Fine-tuning
- try eager execution with `dygraph model` :
```script
python3 ./demo/finetune_classifier.py \
--from_pretrained ernie-1.0 \
--data_dir ./data/xnli
```
- specify `--use_amp` to activate AMP training.
- `--bsz` denotes global batch size for one optimization step, `--micro_bsz` denotes maximum batch size for each GPU device.
if `--micro_bsz < --bsz`, gradient accumulation will be actiavted.
- Distributed finetune
`paddle.distributed.launch` is a process manager, we use it to launch python processes on each avalible GPU devices:
When in distributed training, `max_steps` is used as stopping criteria rather than `epoch` to prevent dead block.
You could calculate `max_steps` with `EPOCH * NUM_TRAIN_EXAMPLES / TOTAL_BATCH`.
Also notice than we shard the train data according to device id to prevent over fitting.
demo:
(make sure you have more than 2 GPUs,
online model download can not work in `paddle.distributed.launch`,
you need to run single card finetuning first to get pretrained model, or donwload and extract one manualy from [here](#section-pretrained-models)):
```script
python3 -m paddle.distributed.launch \
./demo/finetune_classifier_distributed.py \
--data_dir data/mnli \
--max_steps 10000 \
--from_pretrained ernie-2.0-en
```
many other demo python scripts:
1. [Sentiment Analysis](./demo/finetune_sentiment_analysis.py)
1. [Semantic Similarity](./demo/finetune_classifier.py)
1. [Name Entity Recognition(NER)](./demo/finetune_ner.py)
1. [Machine Reading Comprehension](./demo/finetune_mrc.py)
1. [Text generation](./demo/seq2seq/README.md)
1. [Text classification with `paddle.static` API](./demo/finetune_classifier_static.py)
**recomended hyper parameters:**
|tasks|batch size|learning rate|
|--|--|--|
| CoLA | 32 / 64 (base) | 3e-5 |
| SST-2 | 64 / 256 (base) | 2e-5 |
| STS-B | 128 | 5e-5 |
| QQP | 256 | 3e-5(base)/5e-5(large) |
| MNLI | 256 / 512 (base)| 3e-5 |
| QNLI | 256 | 2e-5 |
| RTE | 16 / 4 (base) | 2e-5(base)/3e-5(large) |
| MRPC | 16 / 32 (base) | 3e-5 |
| WNLI | 8 | 2e-5 |
| XNLI | 512 | 1e-4(base)/4e-5(large) |
| CMRC2018 | 64 | 3e-5 |
| DRCD | 64 | 5e-5(base)/3e-5(large) |
| MSRA-NER(SIGHAN2006) | 16 | 5e-5(base)/1e-5(large) |
| ChnSentiCorp | 24 | 5e-5(base)/1e-5(large) |
| LCQMC | 32 | 2e-5(base)/5e-6(large) |
| NLPCC2016-DBQA| 64 | 2e-5(base)/1e-5(large) |
| VCR | 64 | 2e-5(base)/2e-5(large) |
# Pretraining with ERNIE 1.0
see [here](./demo/pretrain/README.md)
# Online inference
If `--inference_model_dir` is passed to `finetune_classifier_dygraph.py`,
a deployable model will be generated at the end of finetuning and your model is ready to serve.
For details about online inferece, see [C++ inference API](./inference/README.md),
or you can start a multi-gpu inference server with a few lines of codes:
```shell
python -m propeller.tools.start_server -m /path/to/saved/inference_model -p 8881
```
and call the server just like calling local function (python3 only):
```python
from propeller.service.client import InferenceClient
from ernie.tokenizing_ernie import ErnieTokenizer
client = InferenceClient('tcp://localhost:8881')
tokenizer = ErnieTokenizer.from_pretrained('ernie-1.0')
ids, sids = tokenizer.encode('hello world')
ids = np.expand_dims(ids, 0)
sids = np.expand_dims(sids, 0)
result = client(ids, sids)
```
A pre-made `inference model` for ernie-1.0 can be downloaded at [here](https://ernie.bj.bcebos.com/ernie1.0_zh_inference_model.tar.gz).
It can be used for feature-based finetuning or feature extraction.
# Distillation
Knowledge distillation is good way to compress and accelerate ERNIE.
For details about distillation, see [here](./demo/distill/README.md)
# Citation
### ERNIE 1.0
```
@article{sun2019ernie,
title={Ernie: Enhanced representation through knowledge integration},
author={Sun, Yu and Wang, Shuohuan and Li, Yukun and Feng, Shikun and Chen, Xuyi and Zhang, Han and Tian, Xin and Zhu, Danxiang and Tian, Hao and Wu, Hua},
journal={arXiv preprint arXiv:1904.09223},
year={2019}
}
```
### ERNIE 2.0
```
@article{sun2019ernie20,
title={ERNIE 2.0: A Continual Pre-training Framework for Language Understanding},
author={Sun, Yu and Wang, Shuohuan and Li, Yukun and Feng, Shikun and Tian, Hao and Wu, Hua and Wang, Haifeng},
journal={arXiv preprint arXiv:1907.12412},
year={2019}
}
```
### ERNIE-GEN
```
@article{xiao2020ernie-gen,
title={ERNIE-GEN: An Enhanced Multi-Flow Pre-training and Fine-tuning Framework for Natural Language Generation},
author={Xiao, Dongling and Zhang, Han and Li, Yukun and Sun, Yu and Tian, Hao and Wu, Hua and Wang, Haifeng},
journal={arXiv preprint arXiv:2001.11314},
year={2020}
}
```
### ERNIE-ViL
```
@article{yu2020ernie,
title={ERNIE-ViL: Knowledge Enhanced Vision-Language Representations Through Scene Graph},
author={Yu, Fei and Tang, Jiji and Yin, Weichong and Sun, Yu and Tian, Hao and Wu, Hua and Wang, Haifeng},
journal={arXiv preprint arXiv:2006.16934},
year={2020}
}
```
### ERNIE-Gram
```
@article{xiao2020ernie,
title={ERNIE-Gram: Pre-Training with Explicitly N-Gram Masked Language Modeling for Natural Language Understanding},
author={Xiao, Dongling and Li, Yu-Kun and Zhang, Han and Sun, Yu and Tian, Hao and Wu, Hua and Wang, Haifeng},
journal={arXiv preprint arXiv:2010.12148},
year={2020}
}
```
### ERNIE-Doc
```
@article{ding2020ernie,
title={ERNIE-DOC: The Retrospective Long-Document Modeling Transformer},
author={Ding, Siyu and Shang, Junyuan and Wang, Shuohuan and Sun, Yu and Tian, Hao and Wu, Hua and Wang, Haifeng},
journal={arXiv preprint arXiv:2012.15688},
year={2020}
}
```
### ERNIE-UNIMO
```
@article{li2020unimo,
title={UNIMO: Towards Unified-Modal Understanding and Generation via Cross-Modal Contrastive Learning},
author={Li, Wei and Gao, Can and Niu, Guocheng and Xiao, Xinyan and Liu, Hao and Liu, Jiachen and Wu, Hua and Wang, Haifeng},
journal={arXiv preprint arXiv:2012.15409},
year={2020}
}
```
### ERNIE-M
```
@article{ouyang2020ernie,
title={Ernie-m: Enhanced multilingual representation by aligning cross-lingual semantics with monolingual corpora},
author={Ouyang, Xuan and Wang, Shuohuan and Pang, Chao and Sun, Yu and Tian, Hao and Wu, Hua and Wang, Haifeng},
journal={arXiv preprint arXiv:2012.15674},
year={2020}
}
```
For full reproduction of paper results, please checkout to `repro` branch of this repo.
### Communication
- [ERNIE homepage](https://wenxin.baidu.com/)
- [Github Issues](https://github.com/PaddlePaddle/ERNIE/issues): bug reports, feature requests, install issues, usage issues, etc.
- QQ discussion group: 760439550 (ERNIE discussion group).
- QQ discussion group: 958422639 (ERNIE discussion group-v2).
- [Forums](http://ai.baidu.com/forum/topic/list/168?pageNo=1): discuss implementations, research, etc.
README.zh.md
\ No newline at end of file
# ![ERNIE_milestone_20210519_zh](./ERNIE_milestone_20210519_zh.png)
ERNIE是百度开创性提出的基于知识增强的持续学习语义理解框架,该框架将大数据预训练与多源丰富知识相结合,通过持续学习技术,不断吸收海量文本数据中词汇、结构、语义等方面的知识,实现模型效果不断进化。ERNIE在累积 40 余个典型 NLP 任务取得 SOTA 效果,并在 GLUE、VCR、XTREME、SemEval 等国际权威评测上斩获十余项冠军。ERNIE 在 2020年荣获了中国人工智能学会优秀科技成果奖及世界人工智能大会最高荣誉 SAIL奖,该技术也被全球顶级科技商业杂志《麻省理工科技评论》官方网站报道,相关创新成果也被国际顶级学术会议AAAI、ACL、NAACL、IJCAI收录。ERNIE在工业界得到了大规模应用,如搜索引擎、新闻推荐、广告系统、语音交互、智能客服等。
提醒: ERNIE老版本代码已经迁移至repro分支,欢迎使用我们全新升级的基于动静结合的新版ERNIE套件进行开发。另外,也欢迎上[EasyDL](https://ai.baidu.com/easydl/pro)[BML](https://ai.baidu.com/bml/app/overview)体验更丰富的功能。
[【了解更多】](https://wenxin.baidu.com/)
# 开源Roadmap
- 2022.5.20:
- 最新开源ERNIE 3.0系列预训练模型:
- 110M参数通用模型ERNIE 3.0 Base
- 250M参数重量级通用模型ERNIE 3.0 XBase
- 24M轻量级通用模型ERNIE 3.0 Medium
- 新增语音语义模型ERNIE-SAT(链接待补充)
- 新增ERNIE-Gen(中文)预训模型,支持多类主流生成任务:主要包括摘要、问题生成、对话、问答
- 动静结合的文心ERNIE开发套件:基于飞桨动态图功能,支持文心ERNIE模型动态图训练。您仅需要在模型训练开启前,修改一个参数配置,即可实现模型训练的动静切换。
- 将文本预处理、预训练模型、网络搭建、模型评估、上线部署等NLP开发流程规范封装。
- 支持NLP常用任务:文本分类、文本匹配、序列标注、信息抽取、文本生成、数据蒸馏等。
- 提供数据清洗、数据增强、分词、格式转换、大小写转换等数据预处理工具。
- 2021.12.3:
- 多语言预训练模型`ERNIE-M` [正式开源](https://github.com/PaddlePaddle/ERNIE/tree/repro/ernie-m)
- 2021.5.20:
- ERNIE 最新开源四大预训练模型:
- 多粒度语言知识模型`ERNIE-Gram` [正式开源](https://github.com/PaddlePaddle/ERNIE/blob/develop/ernie-gram)
- 超长文本双向建模预训练模型`ERNIE-Doc` [正式开源](https://github.com/PaddlePaddle/ERNIE/tree/repro/ernie-doc)
- 融合场景图知识的跨模态预训练模型教程`ERNIE-ViL` [正式开源](https://github.com/PaddlePaddle/ERNIE/tree/repro/ernie-vil)
- 语言与视觉一体的预训练模型`ERNIE-UNIMO` [正式开源](https://github.com/PaddlePaddle/ERNIE/tree/repro/ernie-unimo)
- 2020.9.24:
- `ERNIE-ViL` 技术发布! ([点击进入](https://github.com/PaddlePaddle/ERNIE/tree/repro/ernie-vil))
- 面向视觉-语言知识增强的预训练框架,首次在视觉-语言预训练引入结构化的知识。
- 利用场景图中的知识,构建了物体、属性和关系预测任务,精细刻画模态间细粒度语义对齐。
- 五项视觉-语言下游任务取得最好效果,[视觉常识推理榜单](https://visualcommonsense.com/)取得第一。
- 2020.5.20:
- `ERNIE-GEN` 模型正式开源! ([点击进入](https://github.com/PaddlePaddle/ERNIE/tree/repro/ernie-gen))
- 最强文本生成预训练模型正式开源,相关工作已被 `IJCAI-2020` 收录。
- 首次把 ERNIE 预训练技术能力扩展至文本生成领域,在多个典型任务上取得最佳。
- 您现在即可下载论文报告的所有模型(包含 [base/large/large-430G](https://github.com/PaddlePaddle/ERNIE/tree/repro/ernie-gen/README.zh.md#预训练模型))。
- 首次在预训练阶段加入span-by-span 生成任务,让模型每次能够生成一个语义完整的片段。
- 提出填充式生成机制和噪声感知机制来缓解曝光偏差问题。
- 精巧的 Mulit-Flow Attention 实现框架。
- 2020.4.30 发布[ERNIESage](https://github.com/PaddlePaddle/PGL/tree/master/examples/erniesage), 一种新型图神经网络模型,采用ERNIE做为aggreagtor. 由[PGL](https://github.com/PaddlePaddle/PGL)实现。
- 2020.3.27 [在SemEval2020五项子任务上夺冠](https://www.jiqizhixin.com/articles/2020-03-27-8)
- 2019.12.26 [GLUE榜第一名](https://www.technologyreview.com/2019/12/26/131372/ai-baidu-ernie-google-bert-natural-language-glue/)
- 2019.11.6 发布[ERNIE Tiny](https://www.jiqizhixin.com/articles/2019-11-06-9)
- 2019.7.7 发布[ERNIE 2.0](https://www.jiqizhixin.com/articles/2019-07-31-10)
- 2019.3.16 发布[ERNIE 1.0](https://www.jiqizhixin.com/articles/2019-03-16-3)
# 环境安装
1. 安装环境依赖:[环境安装](./readme_env.md)
2. 安装Ernie套件
```plain
git clone https://github.com/PaddlePaddle/ERNIE.git
```
# 快速上手:使用文心ERNIE大模型进行训练
- 使用ERNIE2.0作为预训练模型,准备工作包括:
- 下载模型
- 准备数据
- 配置训练json文件
- 启动训练模型
- 配置预测json文件
- 启动预测
- 我们以文本分类任务为例,来快速上手ERNIE大模型的使用
## 下载模型
- 使用ERNIE2.0预训练模型进行文本分类任务
- ERNNIE2.0预训练模型的下载与配置
```plain
# ernie_2.0 模型下载
# 进入models_hub目录
cd ./wenxin_appzoo/models_hub
# 运行下载脚本
sh download_ernie_2.0_base_ch.sh
```
## 准备数据
- 文心各个任务的data目录下自带一些示例数据,能够实现直接使用,方便快速熟悉文心的使用。
- 文本分类任务的数据
```shell
#进入文本分类任务文件夹
cd ./wenxin_appzoo/tasks/text_classification/
#查看文本分类任务自带数据集
ls ./data
```
- 注:示例数据仅作为格式演示使用,在真正训练模型时请替换为真实数据。
## 配置训练json文件
- 其预置json文件在./examples/目录下,使用ERNIE2.0预训练模型进行训练的配置文件为的./examples/cls_ernie_fc_ch.json,在该json文件中对数据、模型、训练方式等逻辑进行了配置。
```shell
#查看 ERNIE2.0预训练模型 训练文本分类任务的配置文件
cat ./examples/cls_ernie_fc_ch.json
```
## 启动训练
- 将数据集存放妥当,并配置好cls_ernie_fc_ch.json,我们就可以运行模型训练的命令。
- 其中,单卡指令为`python run_trainer.py`,如下所示,使用基于ernie的中文文本分类模型在训练集上进行本地模型训练。
```shell
# ernie 中文文本分类模型
# 基于json实现预置网络训练。其调用了配置文件./examples/cls_ernie_fc_ch.json
python run_trainer.py --param_path ./examples/cls_ernie_fc_ch.json
```
- 多卡指令为:
```plain
fleetrun --gpus=x,y run_trainer.py./examples/cls_ernie_fc_ch.json
```
- 训练运行的日志会自动保存在**./log/test.log**文件中。
- 训练中以及结束后产生的模型文件会默认保存在./output/**目录下,其中**save_inference_model/文件夹会保存用于预测的模型文件,**save_checkpoint/** 文件夹会保存用于热启动的模型文件。
## 配置预测json文件
- 其预置json文件在./examples/目录下,使用ERNIE2.0预训练模型训练的模型进行预测的配置文件为的./examples/cls_ernie_fc_ch_infer.json
- 主要修改./examples/cls_ernie_fc_ch_infer.json文件的预测模型的输入路径、预测文件的输入路径、预测结果的输出路径,对应修改配置如下:
```json
{
"dataset_reader":{"train_reader":{"config":{"data_path":"./data/predict_data"}}},
"inference":{"inference_model_path":"./output/cls_ernie_fc_ch/save_inference_model/inference_step_251",
"output_path": "./output/predict_result.txt"}
}
```
## 启动预测
- 运行run_infer.py ,选择对应的参数配置文件即可。如下所示:
```plain
python run_infer.py --param_path ./examples/cls_enrie_fc_ch_infer.json
```
- 预测过程中的日志自动保存在./output/predict_result.txt文件中。
# 预训练模型介绍
- 参考预训练模型原理介绍:[模型介绍](readme_model.md)
- 预训练模型下载:进入./wenxin_appzoo/models_hub目录下,下载示例:
```plain
#进入预训练模型下载目录
cd ./wenxin_appzoo/models_hub
#下载ERNIE3.0 base模型
sh downlaod_ernie3.0_base_ch.sh
```
- 更多开源模型,见[Research](./Research/README.md)
# 模型效果评估
[模型效果评估](readme_score.md)
# 数据集下载
[CLUE数据集](https://www.cluebenchmarks.com/)
[DuIE2.0数据集](https://www.luge.ai/#/luge/dataDetail?id=5)
[MSRA_NER数据集](https://ernie-github.cdn.bcebos.com/data-msra_ner.tar.gz)
# 应用场景
文本分类([文本分类](./nlp-ernie/wenxin_appzoo/wenxin_appzoo/tasks/text_classification/README.md)
文本匹配([文本匹配](./nlp-ernie/wenxin_appzoo/wenxin_appzoo/tasks/text_matching/README.md)
系列标注([序列标注](./nlp-ernie/wenxin_appzoo/wenxin_appzoo/tasks/sequence_labeling/README.md)
信息抽取([信息抽取](./nlp-ernie/wenxin_appzoo/wenxin_appzoo/tasks/information_extraction_many_to_many/README.md)
文本生成([文本生成](./nlp-ernie/wenxin_appzoo/wenxin_appzoo/tasks/text_generation/README.md)
数据蒸馏([数据蒸馏](./nlp-ernie/wenxin_appzoo/wenxin_appzoo/tasks/data_distillation/README.md)
工具使用([工具使用](./nlp-ernie/wenxin_appzoo/wenxin_appzoo/tools/README.md)
# 文献引用
### ERNIE 1.0
```json
@article{sun2019ernie,
title={Ernie: Enhanced representation through knowledge integration},
author={Sun, Yu and Wang, Shuohuan and Li, Yukun and Feng, Shikun and Chen, Xuyi and Zhang, Han and Tian, Xin and Zhu, Danxiang and Tian, Hao and Wu, Hua},
journal={arXiv preprint arXiv:1904.09223},
year={2019}
}
```
### ERNIE 2.0
```json
@inproceedings{sun2020ernie,
title={Ernie 2.0: A continual pre-training framework for language understanding},
author={Sun, Yu and Wang, Shuohuan and Li, Yukun and Feng, Shikun and Tian, Hao and Wu, Hua and Wang, Haifeng},
booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
volume={34},
number={05},
pages={8968--8975},
year={2020}
}
```
### ERNIE-GEN
```json
@article{xiao2020ernie,
title={Ernie-gen: An enhanced multi-flow pre-training and fine-tuning framework for natural language generation},
author={Xiao, Dongling and Zhang, Han and Li, Yukun and Sun, Yu and Tian, Hao and Wu, Hua and Wang, Haifeng},
journal={arXiv preprint arXiv:2001.11314},
year={2020}
}
```
### ERNIE-ViL
```json
@article{yu2020ernie,
title={Ernie-vil: Knowledge enhanced vision-language representations through scene graph},
author={Yu, Fei and Tang, Jiji and Yin, Weichong and Sun, Yu and Tian, Hao and Wu, Hua and Wang, Haifeng},
journal={arXiv preprint arXiv:2006.16934},
year={2020}
}
```
### ERNIE-Gram
```json
@article{xiao2020ernie,
title={ERNIE-Gram: Pre-Training with Explicitly N-Gram Masked Language Modeling for Natural Language Understanding},
author={Xiao, Dongling and Li, Yu-Kun and Zhang, Han and Sun, Yu and Tian, Hao and Wu, Hua and Wang, Haifeng},
journal={arXiv preprint arXiv:2010.12148},
year={2020}
}
```
### ERNIE-Doc
```json
@article{ding2020ernie,
title={ERNIE-Doc: A retrospective long-document modeling transformer},
author={Ding, Siyu and Shang, Junyuan and Wang, Shuohuan and Sun, Yu and Tian, Hao and Wu, Hua and Wang, Haifeng},
journal={arXiv preprint arXiv:2012.15688},
year={2020}
}
```
### ERNIE-UNIMO
```json
@article{li2020unimo,
title={Unimo: Towards unified-modal understanding and generation via cross-modal contrastive learning},
author={Li, Wei and Gao, Can and Niu, Guocheng and Xiao, Xinyan and Liu, Hao and Liu, Jiachen and Wu, Hua and Wang, Haifeng},
journal={arXiv preprint arXiv:2012.15409},
year={2020}
}
```
### ERNIE-M
```json
@article{ouyang2020ernie,
title={Ernie-m: Enhanced multilingual representation by aligning cross-lingual semantics with monolingual corpora},
author={Ouyang, Xuan and Wang, Shuohuan and Pang, Chao and Sun, Yu and Tian, Hao and Wu, Hua and Wang, Haifeng},
journal={arXiv preprint arXiv:2012.15674},
year={2020}
}
```
\ No newline at end of file
[English](./README.en.md)|简体中文
![./.metas/ERNIE_milestone.png](./.metas/ERNIE_milestone_20210519_zh.png)
ERNIE是百度开创性提出的基于知识增强的持续学习语义理解框架,该框架将大数据预训练与多源丰富知识相结合,通过持续学习技术,不断吸收海量文本数据中词汇、结构、语义等方面的知识,实现模型效果不断进化。ERNIE在累积 40 余个典型 NLP 任务取得 SOTA 效果,并在 GLUE、VCR、XTREME、SemEval 等国际权威评测上斩获十余项冠军。ERNIE 在 2020年荣获了中国人工智能学会优秀科技成果奖及世界人工智能大会最高荣誉 SAIL奖,该技术也被全球顶级科技商业杂志《麻省理工科技评论》官方网站报道,相关创新成果也被国际顶级学术会议AAAI、ACL、NAACL、IJCAI收录。ERNIE在工业界得到了大规模应用,如搜索引擎、新闻推荐、广告系统、语音交互、智能客服等。
**提醒: ERNIE老版本代码已经迁移至[repro分支](https://github.com/PaddlePaddle/ERNIE/tree/repro),欢迎使用我们全新升级的基于动静结合的新版ERNIE套件进行开发。另外,也欢迎上[EasyDL](https://ai.baidu.com/easydl/pro)体验更丰富的功能(如ERNIE 2.0、ERNIE 2.1、ERNIE领域模型等)。**
[【了解更多】](https://wenxin.baidu.com/)
# 新闻
- 2021.12.3:
- 多语言预训练模型`ERNIE-M` [正式开源](https://github.com/PaddlePaddle/ERNIE/tree/repro/ernie-m)
- 2021.5.20:
- ERNIE 最新开源四大预训练模型:
- 多粒度语言知识模型`ERNIE-Gram` [正式开源](./ernie-gram/)
- 超长文本双向建模预训练模型`ERNIE-Doc` [正式开源](https://github.com/PaddlePaddle/ERNIE/tree/repro/ernie-doc)
- 融合场景图知识的跨模态预训练模型`ERNIE-ViL` [正式开源](https://github.com/PaddlePaddle/ERNIE/tree/repro/ernie-vil)
- 语言与视觉一体的预训练模型`ERNIE-UNIMO` [正式开源](https://github.com/PaddlePaddle/ERNIE/tree/repro/ernie-unimo)
- 2020.12.29:
- `ERNIE`开源工具套件全面升级 [PaddlePaddle v2.0](https://github.com/PaddlePaddle/Paddle/tree/release/2.0-rc)
- 所有demo教程均引入AMP(混合精度训练), 平均提速达2.3倍。
- 引入`Gradient accumulation`, 8G显存也可运行`ERNIE-large`模型。
- 2020.9.24:
- `ERNIE-ViL` 技术发布! ([点击进入](https://github.com/PaddlePaddle/ERNIE/tree/repro/ernie-vil))
- 面向视觉-语言知识增强的预训练框架,首次在视觉-语言预训练引入结构化的知识。
- 利用场景图中的知识,构建了物体、属性和关系预测任务,精细刻画模态间细粒度语义对齐。
- 五项视觉-语言下游任务取得最好效果,[视觉常识推理榜单](https://visualcommonsense.com/)取得第一。
- 2020.5.20:
- 欢迎试用`动态图`实现的 ERNIE:
- 动态执行, 所见即所得。
- 大规模分布式训练。
- 易于部署。
- 通过Aistudio 教程快速入门NLP。
- 向后兼容老版 checkpoint。
- `ERNIE-GEN` 模型正式开源! ([点击进入](https://github.com/PaddlePaddle/ERNIE/tree/repro/ernie-gen))
- 最强文本生成预训练模型正式开源,相关工作已被 `IJCAI-2020` 收录。
- 首次把 ERNIE 预训练技术能力扩展至文本生成领域,在多个典型任务上取得最佳。
- 您现在即可下载论文报告的所有模型(包含 [`base/large/large-430G`](https://github.com/PaddlePaddle/ERNIE/tree/repro/ernie-gen/README.zh.md#预训练模型))。
- 首次在预训练阶段加入span-by-span 生成任务,让模型每次能够生成一个语义完整的片段。
- 提出填充式生成机制和噪声感知机制来缓解曝光偏差问题。
- 精巧的 Mulit-Flow Attention 实现框架。
- 2020.4.30 发布[ERNIESage](https://github.com/PaddlePaddle/PGL/tree/master/examples/erniesage), 一种新型图神经网络模型,采用ERNIE做为aggreagtor. 由[PGL](https://github.com/PaddlePaddle/PGL)实现。
- 2020.3.27 [在SemEval2020五项子任务上夺冠](https://www.jiqizhixin.com/articles/2020-03-27-8)
- 2019.12.26 [GLUE榜第一名](https://www.technologyreview.com/2019/12/26/131372/ai-baidu-ernie-google-bert-natural-language-glue/)
- 2019.11.6 发布[ERNIE Tiny](https://www.jiqizhixin.com/articles/2019-11-06-9)
- 2019.7.7 发布[ERNIE 2.0](https://www.jiqizhixin.com/articles/2019-07-31-10)
- 2019.3.16 发布[ERNIE 1.0](https://www.jiqizhixin.com/articles/2019-03-16-3)
# 导航
* [教程](#教程)
* [安装](#安装)
* [支持的NLP任务](#支持的nlp任务)
* [预训练(ERNIE 1.0)](#预训练-ernie-10)
* [在线预测](#在线预测)
* [蒸馏](#蒸馏)
# 快速上手
```python
import numpy as np
import paddle as P
from ernie.tokenizing_ernie import ErnieTokenizer
from ernie.modeling_ernie import ErnieModel
model = ErnieModel.from_pretrained('ernie-1.0') # Try to get pretrained model from server, make sure you have network connection
model.eval()
tokenizer = ErnieTokenizer.from_pretrained('ernie-1.0')
ids, _ = tokenizer.encode('hello world')
ids = P.to_tensor(np.expand_dims(ids, 0)) # insert extra `batch` dimension
pooled, encoded = model(ids) # eager execution
print(pooled.numpy()) # convert results to numpy
```
# 教程
手边没有GPU?欢迎在[AIStudio](https://aistudio.baidu.com/aistudio/index)中直接试用 ERNIE.
(请选择最新版本的教程并申请GPU运行环境)
1. [从0开始学ERNIE](https://aistudio.baidu.com/studio/edu/group/quick/join/314947)
1. [情感识别](https://aistudio.baidu.com/aistudio/projectdetail/427482)
2. [完形填空](https://aistudio.baidu.com/aistudio/projectdetail/433491)
3. [知识蒸馏](https://aistudio.baidu.com/aistudio/projectdetail/439460)
4. [万事不决问ERNIE](https://aistudio.baidu.com/aistudio/projectdetail/456443)
5. [加载并读取老式checkpoint](https://aistudio.baidu.com/aistudio/projectdetail/493415)
6. [ERNIE作诗](https://aistudio.baidu.com/aistudio/projectdetail/502844)
# 安装
##### 1. 安装 PaddlePaddle
本项目依赖PaddlePaddle 1.7.0+, 请参考[这里](https://www.paddlepaddle.org.cn/install/quick)安装 PaddlePaddle。
##### 2. 安装 ERNIE 套件
```script
pip install paddle-ernie
```
或者
```shell
git clone https://github.com/PaddlePaddle/ERNIE.git --depth 1
cd ERNIE
pip install -r requirements.txt
pip install -e .
```
`propeller`是辅助模型训练的高级框架,包含NLP常用的前、后处理流程。你可以通过将本repo根目录放入`PYTHONPATH`的方式导入`propeller`:
```shell
export PYTHONPATH=$PWD:$PYTHONPATH
```
##### 3. 下载预训练模型(可选)<a name="section-pretrained-models"></a>
| Model | 细节参数 |下载简写|
| :------------------------------------------------- |:------------------------------------------------------------------------- |:-------|
| [ERNIE 1.0 Base 中文](https://ernie-github.cdn.bcebos.com/model-ernie1.0.1.tar.gz) | Layer:12, Hidden:768, Heads:12 |ernie-1.0|
| [ERNIE Tiny](https://ernie-github.cdn.bcebos.com/model-ernie_tiny.1.tar.gz) | Layer:3, Hdden:1024, Heads:16 |ernie-tiny|
| [ERNIE 2.0 Base 英文](https://ernie-github.cdn.bcebos.com/model-ernie2.0-en.1.tar.gz) | Layer:12, Hidden:768, Heads:12 |ernie-2.0-en|
| [ERNIE 2.0 Large 英文](https://ernie-github.cdn.bcebos.com/model-ernie2.0-large-en.1.tar.gz) | Layer:24, Hidden:1024, Heads16 |ernie-2.0-large-en|
| [ERNIE Gen Base 英文](https://ernie-github.cdn.bcebos.com/model-ernie-gen-base-en.1.tar.gz) | Layer:12, Hidden:768, Heads:12 |ernie-gen-base-en|
| [ERNIE Gen Large 英文](https://ernie-github.cdn.bcebos.com/model-ernie-gen-large-en.1.tar.gz)| Layer:24, Hidden:1024, Heads:16 |ernie-gen-large-en|
| [ERNIE Gen Large 430G英文](https://ernie-github.cdn.bcebos.com/model-ernie-gen-large-430g-en.1.tar.gz)| Layer:24, Hidden:1024, Heads:16 + 额外430G 预训练语料 | ernie-gen-large-430g-en |
| [ERNIE Doc Base 中文](https://ernie-github.cdn.bcebos.com/model-ernie-doc-base-zh.tar.gz)| Layer:12, Hidden:768, Heads:12 |ernie-doc-base-zh|
| [ERNIE Doc Base 英文](https://ernie-github.cdn.bcebos.com/model-ernie-doc-base-en.tar.gz)| Layer:12, Hidden:768, Heads:12 |ernie-doc-base-en|
| [ERNIE Doc Large 英文](https://ernie-github.cdn.bcebos.com/model-ernie-doc-large-en.tar.gz)| Layer:24, Hidden:1024, Heads:16 |ernie-doc-large-en|
| [ERNIE Gram Base 中文](https://ernie-github.cdn.bcebos.com/model-ernie-gram-zh.1.tar.gz)| L12H768A12 |ernie-gram-zh|
| [ERNIE Gram Base 英文](https://ernie-github.cdn.bcebos.com/model-ernie-gram-en.1.tar.gz)| L12H768A12 |ernie-gram-en|
##### 4. 下载数据集
**英文数据集**
运行[](https://gist.github.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e)脚本,下载[GLUE datasets](https://gluebenchmark.com/tasks).
请将数据目录整理成以下格式,方便在后续 demo 教程中使用(通过`--data_dir`参数将数据路径传入训练脚本);
```shell
data/xnli
├── dev
│   └── 1
├── test
│   └── 1
└── train
└── 1
```
[示例](https://ernie-github.cdn.bcebos.com/data-mnli-m.tar.gz)数据(MNLI任务测试、训练集合)。
**中文数据**
| 数据集|描述|
|:--------|:----------|
| [XNLI](https://ernie-github.cdn.bcebos.com/data-xnli.tar.gz) |XNLI 是由 Facebook 和纽约大学的研究者联合构建的自然语言推断数据集,包括 15 种语言的数据。我们用其中的中文数据来评估模型的语言理解能力。[链接](https://github.com/facebookresearch/XNLI)|
| [ChnSentiCorp](https://ernie-github.cdn.bcebos.com/data-chnsenticorp.tar.gz) |ChnSentiCorp 是一个中文情感分析数据集,包含酒店、笔记本电脑和书籍的网购评论。|
| [MSRA-NER](https://ernie-github.cdn.bcebos.com/data-msra_ner.tar.gz) |MSRA-NER (SIGHAN2006) 数据集由微软亚研院发布,其目标是识别文本中具有特定意义的实体,包括人名、地名、机构名。|
| [NLPCC2016-DBQA](https://ernie-github.cdn.bcebos.com/data-dbqa.tar.gz) |NLPCC2016-DBQA 是由国际自然语言处理和中文计算会议 NLPCC 于 2016 年举办的评测任务,其目标是从候选中找到合适的文档作为问题的答案。[链接](http://tcci.ccf.org.cn/conference/2016/dldoc/evagline2.pdf)|
|[CMRC2018](https://ernie-github.cdn.bcebos.com/data-cmrc2018.tar.gz)|CMRC2018 是中文信息学会举办的评测,评测的任务是抽取类阅读理解。[链接](https://github.com/ymcui/cmrc2018)
# 支持的NLP任务
- 使用 `动态图` 模型进行finetune:
```script
python3 ./demo/finetune_classifier.py \
--from_pretrained ernie-1.0 \
--data_dir ./data/xnli
```
- 加入`--use_amp`以启用AMP功能(请在支持`TensorCore`设备上启用AMP)
- 通过`--bsz`指定全局batch\_size(一步优化中模型所能见到的样本数), 通过`--micro_bsz` 指定输入给每一张GPU卡的样本数
`--bsz > --micro_bsz` 脚本会自动开启梯度累计功能.
- 分布式 finetune
`paddle.distributed.launch` 是一个进程管理器,我们采用它在每一张GPU上启动一个python进程,并配置相应的环境变量以进行分布式训练:
当采用分布式训练时,我们采用`max_steps`做为终止条件而非`epoch`, 这样处理是为了避免进程间死锁。
你可以通过`EPOCH * NUM_TRAIN_EXAMPLES / TOTAL_BATCH`的方式计算出所需执行的`max_steps`.
另外值得注意的是训练集需要在不同的进程间进行切分;以避免所有进程训练同一份数据造成的过拟合。
示例脚本(请确保你有两张以上GPU卡, 在线模型下载功能在`paddle.distributed.launch`下无法工作,
你可能需要一个先通过单卡finetune方式下载预训练模型,或者根据[这里](#section-pretrained-models)手动下载并解压预训练模型):
```script
python3 -m paddle.distributed.launch \
./demo/finetune_classifier_distributed.py \
--data_dir data/mnli \
--max_steps 10000 \
--from_pretrained ernie2.0-en
```
更多示例脚本:
1. [情感分析](./demo/finetune_sentiment_analysis.py)
1. [语义匹配](./demo/finetune_classifier.py)
1. [命名实体识别(NER)](./demo/finetune_ner.py)
1. [机器阅读理解](./demo/finetune_mrc.py) (需要多卡环境运行;参见上面"分布式 finetune"一节)
1. [文本摘要生成](./demo/seq2seq/README.md)
1. [使用静态图完成文本分类](./demo/finetune_classifier_static.py)
**推荐超参数设置:**
|任务|batch size|learning rate|
|--|--|--|
| CoLA | 32 / 64 (base) | 3e-5 |
| SST-2 | 64 / 256 (base) | 2e-5 |
| STS-B | 128 | 5e-5 |
| QQP | 256 | 3e-5(base)/5e-5(large) |
| MNLI | 256 / 512 (base)| 3e-5 |
| QNLI | 256 | 2e-5 |
| RTE | 16 / 4 (base) | 2e-5(base)/3e-5(large) |
| MRPC | 16 / 32 (base) | 3e-5 |
| WNLI | 8 | 2e-5 |
| XNLI | 512 | 1e-4(base)/4e-5(large) |
| CMRC2018 | 64 | 3e-5 |
| DRCD | 64 | 5e-5(base)/3e-5(large) |
| MSRA-NER(SIGHAN2006) | 16 | 5e-5(base)/1e-5(large) |
| ChnSentiCorp | 24 | 5e-5(base)/1e-5(large) |
| LCQMC | 32 | 2e-5(base)/5e-6(large) |
| NLPCC2016-DBQA| 64 | 2e-5(base)/1e-5(large) |
| VCR | 64 | 2e-5(base)/2e-5(large) |
# 预训练 (ERNIE 1.0)
请见[这里](./demo/pretrain/README.md)
# 在线预测
如果`finetune_classifier.py`中指定了`--inference_model_dir`参数,finetune脚本会将你的模型序列化并产出可以直接部署线上预测的`inference_model`.
关于生产环境中使用线上预测代码的实现细节,请见[C++ inference API](./inference/README.md).
或者你可以使用`propeller`启动一个多GPU预测服务(需要GPU环境),只需执行:
```shell
python -m propeller.tools.start_server -m /path/to/saved/inference_model -p 8881
```
即可启动预测服务;随后在Python端采用如下命令访问该服务(仅限 python3):
```python
from propeller.service.client import InferenceClient
from ernie.tokenizing_ernie import ErnieTokenizer
client = InferenceClient('tcp://localhost:8881')
tokenizer = ErnieTokenizer.from_pretrained('ernie-1.0')
ids, sids = tokenizer.encode('hello world')
ids = np.expand_dims(ids, 0)
sids = np.expand_dims(sids, 0)
result = client(ids, sids)
```
你也可从[此处](https://ernie.bj.bcebos.com/ernie1.0_zh_inference_model.tar.gz)下载一个预先制作好的ernie-1.0 base模型的 `inference_model`.
该模型没有经过finetune,一般可以用做上层模型结构的 feature-base finetune或者做为一个文本特征抽取器。
因为该模行由老版API 产出,在进行客户端请求时需要在输入tensor后面追加一个维度:
```python3
ids = np.expand_dims(ids, -1) # ids.shape==[BATCH, SEQLEN, 1]
```
# 蒸馏
知识蒸馏是进行ERNIE模型压缩、加速的有效方式;关于知识蒸馏的实现细节请参见[这里](./demo/distill/README.md)
# 文献引用
### ERNIE 1.0
```
@article{sun2019ernie,
title={Ernie: Enhanced representation through knowledge integration},
author={Sun, Yu and Wang, Shuohuan and Li, Yukun and Feng, Shikun and Chen, Xuyi and Zhang, Han and Tian, Xin and Zhu, Danxiang and Tian, Hao and Wu, Hua},
journal={arXiv preprint arXiv:1904.09223},
year={2019}
}
```
### ERNIE 2.0
```
@article{sun2019ernie20,
title={ERNIE 2.0: A Continual Pre-training Framework for Language Understanding},
author={Sun, Yu and Wang, Shuohuan and Li, Yukun and Feng, Shikun and Tian, Hao and Wu, Hua and Wang, Haifeng},
journal={arXiv preprint arXiv:1907.12412},
year={2019}
}
```
### ERNIE-GEN
```
@article{xiao2020ernie-gen,
title={ERNIE-GEN: An Enhanced Multi-Flow Pre-training and Fine-tuning Framework for Natural Language Generation},
author={Xiao, Dongling and Zhang, Han and Li, Yukun and Sun, Yu and Tian, Hao and Wu, Hua and Wang, Haifeng},
journal={arXiv preprint arXiv:2001.11314},
year={2020}
}
```
### ERNIE-ViL
```
@article{yu2020ernie,
title={ERNIE-ViL: Knowledge Enhanced Vision-Language Representations Through Scene Graph},
author={Yu, Fei and Tang, Jiji and Yin, Weichong and Sun, Yu and Tian, Hao and Wu, Hua and Wang, Haifeng},
journal={arXiv preprint arXiv:2006.16934},
year={2020}
}
```
### ERNIE-Gram
```
@article{xiao2020ernie,
title={ERNIE-Gram: Pre-Training with Explicitly N-Gram Masked Language Modeling for Natural Language Understanding},
author={Xiao, Dongling and Li, Yu-Kun and Zhang, Han and Sun, Yu and Tian, Hao and Wu, Hua and Wang, Haifeng},
journal={arXiv preprint arXiv:2010.12148},
year={2020}
}
```
### ERNIE-Doc
```
@article{ding2020ernie,
title={ERNIE-DOC: The Retrospective Long-Document Modeling Transformer},
author={Ding, Siyu and Shang, Junyuan and Wang, Shuohuan and Sun, Yu and Tian, Hao and Wu, Hua and Wang, Haifeng},
journal={arXiv preprint arXiv:2012.15688},
year={2020}
}
```
### ERNIE-UNIMO
```
@article{li2020unimo,
title={UNIMO: Towards Unified-Modal Understanding and Generation via Cross-Modal Contrastive Learning},
author={Li, Wei and Gao, Can and Niu, Guocheng and Xiao, Xinyan and Liu, Hao and Liu, Jiachen and Wu, Hua and Wang, Haifeng},
journal={arXiv preprint arXiv:2012.15409},
year={2020}
}
```
### ERNIE-M
```
@article{ouyang2020ernie,
title={Ernie-m: Enhanced multilingual representation by aligning cross-lingual semantics with monolingual corpora},
author={Ouyang, Xuan and Wang, Shuohuan and Pang, Chao and Sun, Yu and Tian, Hao and Wu, Hua and Wang, Haifeng},
journal={arXiv preprint arXiv:2012.15674},
year={2020}
}
```
若希望复现 paper 中的所有实验,请切换至本repo的`repro`分支。
### 讨论组
- [ERNIE官方主页](https://wenxin.baidu.com/)
- [Github Issues](https://github.com/PaddlePaddle/ERNIE/issues): bug reports, feature requests, install issues, usage issues, etc.
- QQ 群: 760439550 (ERNIE discussion group).
- QQ 2群: 958422639 (ERNIE discussion group-v2).
- [Forums](http://ai.baidu.com/forum/topic/list/168?pageNo=1): discuss implementations, research, etc.
- 多粒度语言知识模型[ERNIE-Gram](https://github.com/PaddlePaddle/ERNIE/blob/develop/ernie-gram)
- 超长文本双向建模预训练模型 [ERNIE-Doc](https://github.com/PaddlePaddle/ERNIE/tree/repro/ernie-doc)
- 融合场景图知识的跨模态预训练模型教程 [ERNIE-ViL](https://github.com/PaddlePaddle/ERNIE/tree/repro/ernie-vil)
- 语言与视觉一体的预训练模型 [ERNIE-UNIMO](https://github.com/PaddlePaddle/ERNIE/tree/repro/ernie-unimo)
- 新增语音语义模型ERNIE-SAT(链接待补充)
\ No newline at end of file
* [ERNIE Slim 数据蒸馏](#ernie-slim-数据蒸馏)
* [ERNIE数据蒸馏三步](#ernie数据蒸馏三步)
* [数据增强](#数据增强)
* [使用教程](#使用教程)
* [效果验证](#效果验证)
* [Case#1 用户提供“无标注数据”](#case1)
* [Case#2 用户未提供“无标注数据”](#case2)
# ERNIE Slim 数据蒸馏
在ERNIE强大的语义理解能力背后,是需要同样强大的算力才能支撑起如此大规模模型的训练和预测。很多工业应用场景对性能要求较高,若不能有效压缩则无法实际应用。
![ernie_distill](../../.metas/ernie_distill.png)
因此,如上图所示,我们基于[数据蒸馏技术](https://arxiv.org/pdf/1712.04440.pdf)构建了**ERNIE Slim数据蒸馏系统**。它的原理是通过数据作为桥梁,将ERNIE模型的知识迁移至小模型,以达到损失很小的效果却能达到上千倍的预测速度提升的效果。
### ERNIE数据蒸馏三步
- **Step 1**. 使用ERNIE模型对输入标注数据对进行fine-tune,得到Teacher Model
- **Step 2**. 使用ERNIE Service对以下无监督数据进行预测:
1. 用户提供的大规模无标注数据,需与标注数据同源
2. 对标注数据进行数据增强,具体增强策略见下节
3. 对无标注数据和数据增强数据进行一定比例混合
- **Step 3.** 使用步骤2的数据训练出Student Model
### 数据增强
目前采用三种[数据增强策略](https://arxiv.org/pdf/1903.12136.pdf)策略,对于不用的任务可以特定的比例混合。三种数据增强策略包括:
1. 添加噪声:对原始样本中的词,以一定的概率(如0.1)替换为”UNK”标签
2. 同词性词替换:对原始样本中的所有词,以一定的概率(如0.1)替换为本数据集钟随机一个同词性的词
3. N-sampling:从原始样本中,随机选取位置截取长度为m的片段作为新的样本,其中片段的长度m为0到原始样本长度之间的随机值
# 使用教程
我们采用上述3种增强策略制作了chnsenticorp的增强数据:增强后的数据为原训练数据的10倍(96000行),可以从[这里](https://ernie-github.cdn.bcebos.com/data-chnsenticorp-distill.tar.gz)下载。即可执行下面的脚本开始蒸馏。
```shell
python ./distill/distill.py
```
# 效果验证
我们将实际应用场景分类为两种:
### Case#1 用户提供“无标注数据”<a name="case1"></a>
|模型 | 评论低质识别【分类 \| ACC】 | 中文情感【分类 \| ACC】 |问题识别【分类 \| ACC】|搜索问答匹配【匹配 \| 正逆序】|
|---|---|---|---|---|
|ERNIE-Finetune | 90.6% | 96.2% | 97.5% | 4.25 |
|非ERNIE基线(BOW)| 80.8% | 94.7% | 93.0% | 1.83 |
|**+ 数据蒸馏** | 87.2% | 95.8% | 96.3% | 3.30 |
### Case#2 用户未提供“无标注数据”(通过数据增强生成数据)<a name="case2"></a>
|模型 |ChnSentiCorp |
|---|---|
|ERNIE-Finetune |95.4% |
|非ERNIE基线(BOW)|90.1%|
|**+ 数据蒸馏** |91.4%|
|非ERNIE基线(LSTM)|91.2%|
|**+ 数据蒸馏**|93.9%|
# Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import sys
import os
import numpy as np
from sklearn.metrics import f1_score
import paddle as P
from paddle.nn import functional as F
import propeller.paddle as propeller
from ernie.tokenizing_ernie import ErnieTokenizer
from ernie.modeling_ernie import ErnieModelForSequenceClassification
from demo.utils import create_if_not_exists, get_warmup_and_linear_decay
# 本例子采用chnsenticorp中文情感识别任务作为示范;并且事先通过数据增强扩充了蒸馏所需的无监督数据
#
# 下载数据;并存放在 ./chnsenticorp-data/
# 数据分为3列:原文;空格切词;情感标签
# 其中第一列为ERNIE的输入;第二列为BoW词袋模型的输入
# 事先统计好的BoW 词典在 ./chnsenticorp-data/vocab.bow.txt
# 定义finetune teacher模型所需要的超参数
DATA_DIR = './chnsenticorp-data/'
SEQLEN = 256
BATCH = 32
EPOCH = 10
LR = 5e-5
tokenizer = ErnieTokenizer.from_pretrained('ernie-1.0')
student_vocab = {
i.strip(): l
for l, i in enumerate(
open(
os.path.join(DATA_DIR, 'vocab.bow.txt'), encoding='utf8')
.readlines())
}
def space_tokenizer(i):
return i.decode('utf8').split()
feature_column = propeller.data.FeatureColumns([
propeller.data.TextColumn(
'seg_a',
unk_id=tokenizer.unk_id,
vocab_dict=tokenizer.vocab,
tokenizer=tokenizer.tokenize),
propeller.data.TextColumn(
'seg_a_student',
unk_id=student_vocab['[UNK]'],
vocab_dict=student_vocab,
tokenizer=space_tokenizer),
propeller.data.LabelColumn(
'label', vocab_dict={
b"0": 0,
b"1": 1,
}),
])
def map_fn(seg_a, seg_a_student, label):
seg_a, _ = tokenizer.truncate(seg_a, [], seqlen=SEQLEN)
sentence, segments = tokenizer.build_for_ernie(seg_a)
return seg_a_student, sentence, segments, label
train_ds = feature_column.build_dataset('train', data_dir=os.path.join(DATA_DIR, 'train/'), shuffle=True, repeat=False, use_gz=False) \
.map(map_fn) \
.padded_batch(BATCH)
train_ds_unlabel = feature_column.build_dataset('train-da', data_dir=os.path.join(DATA_DIR, 'train-data-augmented/'), shuffle=True, repeat=False, use_gz=False) \
.map(map_fn) \
.padded_batch(BATCH)
dev_ds = feature_column.build_dataset('dev', data_dir=os.path.join(DATA_DIR, 'dev/'), shuffle=False, repeat=False, use_gz=False) \
.map(map_fn) \
.padded_batch(BATCH,)
shapes = ([-1, SEQLEN], [-1, SEQLEN], [-1, SEQLEN], [-1])
types = ('int64', 'int64', 'int64', 'int64')
train_ds.data_shapes = shapes
train_ds.data_types = types
train_ds_unlabel.data_shapes = shapes
train_ds_unlabel.data_types = types
dev_ds.data_shapes = shapes
dev_ds.data_types = types
place = P.CUDAPlace(0)
def evaluate_teacher(model, dataset):
all_pred, all_label = [], []
with P.no_grad():
model.eval()
for step, (ids_student, ids, _, labels) in enumerate(
P.io.DataLoader(
dataset, places=place, batch_size=None)):
_, logits = model(ids)
pred = logits.argmax(-1)
all_pred.extend(pred.numpy())
all_label.extend(labels.numpy())
f1 = f1_score(all_label, all_pred, average='macro')
model.train()
return f1
teacher_model = ErnieModelForSequenceClassification.from_pretrained(
'ernie-1.0', num_labels=2)
teacher_model.train()
if not os.path.exists('./teacher_model.bin'):
g_clip = P.nn.ClipGradByGlobalNorm(1.0) #experimental
lr_scheduler = P.optimizer.lr.LambdaDecay(
LR,
get_warmup_and_linear_decay(9600 * EPOCH / BATCH,
9600 * EPOCH * 0.1 / BATCH))
opt = P.optimizer.AdamW(
lr_scheduler,
parameters=teacher_model.parameters(),
weight_decay=0.01,
grad_clip=g_clip)
for epoch in range(EPOCH):
for step, (ids_student, ids, sids, labels) in enumerate(
P.io.DataLoader(
train_ds, places=place, batch_size=None)):
loss, logits = teacher_model(ids, labels=labels)
loss.backward()
opt.step()
lr_scheduler.step()
teacher_model.clear_gradients()
if step % 10 == 0:
_lr = lr_scheduler.get_lr()
_l = loss.numpy()
msg = '[step-%d] train loss %.5f lr %.3e' % (step, _l, _lr)
print(msg)
if step % 100 == 0:
f1 = evaluate_teacher(teacher_model, dev_ds)
print('teacher f1: %.5f' % f1)
P.save(teacher_model.state_dict(),str( './teacher_model.bin'))
else:
state_dict = P.load('./teacher_model.bin')
teacher_model.set_state_dict(state_dict)
f1 = evaluate_teacher(teacher_model, dev_ds)
print('teacher f1: %.5f' % f1)
# 定义finetune student 模型所需要的超参数
SEQLEN = 256
BATCH = 32
EPOCH = 10
LR = 1e-4
def evaluate_student(model, dataset):
all_pred, all_label = [], []
with P.no_grad():
model.eval()
for step, (ids_student, ids, _, labels) in enumerate(
P.io.DataLoader(
dataset, places=place, batch_size=None)):
_, logits = model(ids_student)
pred = logits.argmax(-1)
all_pred.extend(pred.numpy())
all_label.extend(labels.numpy())
f1 = f1_score(all_label, all_pred, average='macro')
model.train()
return f1
class BOW(P.nn.Layer):
def __init__(self):
super().__init__()
self.emb = P.nn.Embedding(len(student_vocab), 128, padding_idx=0)
self.fc = P.nn.Linear(128, 2)
def forward(self, ids, labels=None):
embbed = self.emb(ids)
pad_mask = (ids != 0).cast('float32').unsqueeze(-1)
embbed = (embbed * pad_mask).sum(1)
embbed = F.softsign(embbed)
logits = self.fc(embbed)
if labels is not None:
if len(labels.shape) == 1:
labels = labels.reshape([-1, 1])
loss = F.cross_entropy(logits, labels).mean()
else:
loss = None
return loss, logits
class CNN(P.nn.Layer):
def __init__(self):
super().__init__()
self.emb = P.nn.Embedding(30002, 128, padding_idx=0)
self.cnn = P.nn.Conv2D(128, 128, (1, 3), padding=(0, 1), act='relu')
self.pool = P.nn.Pool2D((1, 3), pool_padding=(0, 1))
self.fc = P.nn.Linear(128, 2)
def forward(self, ids, labels=None):
embbed = self.emb(ids)
#d_batch, d_seqlen = ids.shape
hidden = embbed
hidden = hidden.transpose([0, 2, 1]).unsqueeze(2) #change to NCWH
hidden = self.cnn(hidden)
hidden = self.pool(hidden).squeeze(2).transpose([0, 2, 1])
pad_mask = (ids != 0).cast('float32').unsqueeze(-1)
hidden = P.nn.funcional.softsign(L(hidden * pad_mask).sum(1))
logits = self.fc(hidden)
if labels is not None:
if len(labels.shape) == 1:
labels = labels.reshape([-1, 1])
loss = F.cross_entropy(logits, labels).mean()
else:
loss = None
return loss, logits
def KL(pred, target):
pred = F.log_softmax(pred)
target = F.softmax(target)
loss = F.kl_div(pred, target)
return loss
teacher_model.eval()
model = BOW()
g_clip = P.nn.ClipGradByGlobalNorm(1.0) #experimental
lr_scheduler = P.optimizer.lr.LambdaDecay(
LR,
get_warmup_and_linear_decay(9600 * EPOCH / BATCH,
9600 * EPOCH * 0.1 / BATCH))
opt = P.optimizer.AdamW(
lr_scheduler,
parameters=model.parameters(),
weight_decay=0.01,
grad_clip=g_clip)
model.train()
for epoch in range(EPOCH - 1):
for step, (
ids_student, ids, sids, label
) in enumerate(P.io.DataLoader(
train_ds, places=place, batch_size=None)):
with P.no_grad():
_, logits_t = teacher_model(ids, sids) # teacher 模型输出logits
_, logits_s = model(ids_student) # student 模型输出logits
loss_ce, _ = model(ids_student, labels=label)
loss_kd = KL(logits_s, logits_t.detach()) # 由KL divergence度量两个分布的距离
loss = loss_ce + loss_kd
loss.backward()
opt.step()
lr_scheduler.step()
model.clear_gradients()
if step % 10 == 0:
_lr = lr_scheduler.get_lr()
_l = loss.numpy()
msg = '[step-%d] train loss %.5f lr %.3e' % (step, _l, _lr)
print(msg)
f1 = evaluate_student(model, dev_ds)
print('student f1 %.5f' % f1)
# 最后再加一轮hard label训练巩固结果
for step, (
ids_student, ids, sids, label
) in enumerate(P.io.DataLoader(
train_ds, places=place, batch_size=None)):
loss, _ = model(ids_student, labels=label)
loss.backward()
opt.step()
model.clear_gradients()
if step % 10 == 0:
_lr = lr_scheduler.get_lr()
_l = loss.numpy()
msg = '[step-%d] train loss %.5f lr %.3e' % (step, _l, _lr)
print(msg)
f1 = evaluate_student(model, dev_ds)
print('final f1 %.5f' % f1)
# Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
import re
import time
import logging
import json
from random import random
from functools import reduce, partial
from visualdl import LogWriter
import numpy as np
import logging
import argparse
from pathlib import Path
import paddle as P
from propeller import log
import propeller.paddle as propeller
log.setLevel(logging.DEBUG)
logging.getLogger().setLevel(logging.DEBUG)
#from model.bert import BertConfig, BertModelLayer
from ernie.modeling_ernie import ErnieModel, ErnieModelForSequenceClassification
from ernie.tokenizing_ernie import ErnieTokenizer, ErnieTinyTokenizer
#from ernie.optimization import AdamW, LinearDecay
from demo.utils import create_if_not_exists, get_warmup_and_linear_decay
parser = argparse.ArgumentParser('classify model with ERNIE')
parser.add_argument(
'--from_pretrained',
type=Path,
required=True,
help='pretrained model directory or tag')
parser.add_argument(
'--max_seqlen',
type=int,
default=128,
help='max sentence length, should not greater than 512')
parser.add_argument(
'--bsz',
type=int,
default=128,
help='global batch size for each optimizer step')
parser.add_argument(
'--micro_bsz',
type=int,
default=32,
help='batch size for each device. if `--bsz` > `--micro_bsz` * num_device, will do grad accumulate'
)
parser.add_argument('--epoch', type=int, default=3, help='epoch')
parser.add_argument(
'--data_dir',
type=str,
required=True,
help='data directory includes train / develop data')
parser.add_argument(
'--use_lr_decay',
action='store_true',
help='if set, learning rate will decay to zero at `max_steps`')
parser.add_argument(
'--warmup_proportion',
type=float,
default=0.1,
help='if use_lr_decay is set, '
'learning rate will raise to `lr` at `warmup_proportion` * `max_steps` and decay to 0. at `max_steps`'
)
parser.add_argument('--lr', type=float, default=5e-5, help='learning rate')
parser.add_argument(
'--inference_model_dir',
type=Path,
default=None,
help='inference model output directory')
parser.add_argument(
'--save_dir', type=Path, required=True, help='model output directory')
parser.add_argument(
'--max_steps',
type=int,
default=None,
help='max_train_steps, set this to EPOCH * NUM_SAMPLES / BATCH_SIZE')
parser.add_argument(
'--wd', type=float, default=0.01, help='weight decay, aka L2 regularizer')
parser.add_argument(
'--init_checkpoint',
type=str,
default=None,
help='checkpoint to warm start from')
parser.add_argument(
'--use_amp',
action='store_true',
help='only activate AMP(auto mixed precision accelatoin) on TensorCore compatible devices'
)
args = parser.parse_args()
if args.bsz > args.micro_bsz:
assert args.bsz % args.micro_bsz == 0, 'cannot perform gradient accumulate with bsz:%d micro_bsz:%d' % (
args.bsz, args.micro_bsz)
acc_step = args.bsz // args.micro_bsz
log.info(
'performing gradient accumulate: global_bsz:%d, micro_bsz:%d, accumulate_steps:%d'
% (args.bsz, args.micro_bsz, acc_step))
args.bsz = args.micro_bsz
else:
acc_step = 1
tokenizer = ErnieTokenizer.from_pretrained(args.from_pretrained)
#tokenizer = ErnieTinyTokenizer.from_pretrained(args.from_pretrained)
feature_column = propeller.data.FeatureColumns([
propeller.data.TextColumn(
'seg_a',
unk_id=tokenizer.unk_id,
vocab_dict=tokenizer.vocab,
tokenizer=tokenizer.tokenize),
propeller.data.TextColumn(
'seg_b',
unk_id=tokenizer.unk_id,
vocab_dict=tokenizer.vocab,
tokenizer=tokenizer.tokenize),
propeller.data.LabelColumn(
'label',
vocab_dict={
b"contradictory": 0,
b"contradiction": 0,
b"entailment": 1,
b"neutral": 2,
}),
])
def map_fn(seg_a, seg_b, label):
seg_a, seg_b = tokenizer.truncate(seg_a, seg_b, seqlen=args.max_seqlen)
sentence, segments = tokenizer.build_for_ernie(seg_a, seg_b)
return sentence, segments, label
train_ds = feature_column.build_dataset('train', data_dir=os.path.join(args.data_dir, 'train'), shuffle=True, repeat=False, use_gz=False) \
.map(map_fn) \
.padded_batch(args.bsz, (0, 0, 0))
dev_ds = feature_column.build_dataset('dev', data_dir=os.path.join(args.data_dir, 'dev'), shuffle=False, repeat=False, use_gz=False) \
.map(map_fn) \
.padded_batch(args.bsz, (0, 0, 0))
place = P.CUDAPlace(0)
model = ErnieModelForSequenceClassification.from_pretrained(
args.from_pretrained, num_labels=3, name='')
if args.init_checkpoint is not None:
log.info('loading checkpoint from %s' % args.init_checkpoint)
sd = P.load(str(args.init_checkpoint))
model.set_state_dict(sd)
g_clip = P.nn.ClipGradByGlobalNorm(1.0) #experimental
param_name_to_exclue_from_weight_decay = re.compile(
r'.*layer_norm_scale|.*layer_norm_bias|.*b_0')
if args.use_lr_decay:
lr_scheduler = P.optimizer.lr.LambdaDecay(
args.lr,
get_warmup_and_linear_decay(
args.max_steps, int(args.warmup_proportion * args.max_steps)))
opt = P.optimizer.AdamW(
lr_scheduler,
parameters=model.parameters(),
weight_decay=args.wd,
apply_decay_param_fun=lambda n: not param_name_to_exclue_from_weight_decay.match(n),
grad_clip=g_clip)
else:
lr_scheduler = None
opt = P.optimizer.AdamW(
args.lr,
parameters=model.parameters(),
weight_decay=args.wd,
apply_decay_param_fun=lambda n: not param_name_to_exclue_from_weight_decay.match(n),
grad_clip=g_clip)
scaler = P.amp.GradScaler(enable=args.use_amp)
step, inter_step = 0, 0
with LogWriter(
logdir=str(create_if_not_exists(args.save_dir / 'vdl'))) as log_writer:
with P.amp.auto_cast(enable=args.use_amp):
for epoch in range(args.epoch):
for ids, sids, label in P.io.DataLoader(
train_ds, places=P.CUDAPlace(0), batch_size=None):
inter_step += 1
loss, _ = model(ids, sids, labels=label)
loss /= acc_step
loss = scaler.scale(loss)
loss.backward()
if inter_step % acc_step != 0:
continue
step += 1
scaler.minimize(opt, loss)
model.clear_gradients()
lr_scheduler and lr_scheduler.step()
if step % 10 == 0:
_lr = lr_scheduler.get_lr(
) if args.use_lr_decay else args.lr
if args.use_amp:
_l = (loss / scaler._scale).numpy()
msg = '[step-%d] train loss %.5f lr %.3e scaling %.3e' % (
step, _l, _lr, scaler._scale.numpy())
else:
_l = loss.numpy()
msg = '[step-%d] train loss %.5f lr %.3e' % (step, _l,
_lr)
log.debug(msg)
log_writer.add_scalar('loss', _l, step=step)
log_writer.add_scalar('lr', _lr, step=step)
if step % 100 == 0:
acc = []
with P.no_grad():
model.eval()
for ids, sids, label in P.io.DataLoader(
dev_ds, places=P.CUDAPlace(0),
batch_size=None):
loss, logits = model(ids, sids, labels=label)
#print('\n'.join(map(str, logits.numpy().tolist())))
a = (logits.argmax(-1) == label)
acc.append(a.numpy())
model.train()
acc = np.concatenate(acc).mean()
log_writer.add_scalar('eval/acc', acc, step=step)
log.debug('acc %.5f' % acc)
if args.save_dir is not None:
P.save(model.state_dict(), str(args.save_dir / 'ckpt.bin'))
if args.save_dir is not None:
P.save(model.state_dict(),str( args.save_dir / 'ckpt.bin'))
if args.inference_model_dir is not None:
class InferenceModel(ErnieModelForSequenceClassification):
def forward(self, ids, sids):
_, logits = super(InferenceModel, self).forward(ids, sids)
return logits
model.__class__ = InferenceModel
log.debug('saving inference model')
src_placeholder = P.zeros([2, 2], dtype='int64')
sent_placehodler = P.zeros([2, 2], dtype='int64')
_, static = P.jit.TracedLayer.trace(
model, inputs=[src_placeholder, sent_placehodler])
static.save_inference_model(str(args.inference_model_dir))
#class InferenceModel(ErnieModelForSequenceClassification):
# @P.jit.to_static
# def forward(self, ids, sids):
# _, logits = super(InferenceModel, self).forward(ids, sids, labels=None)
# return logits
#model.__class__ = InferenceModel
#src_placeholder = P.zeros([2, 2], dtype='int64')
#sent_placehodler = P.zeros([2, 2], dtype='int64')
#P.jit.save(model, args.inference_model_dir, input_var=[src_placeholder, sent_placehodler])
log.debug('done')
# Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
import time
import logging
import json
import re
from random import random
from functools import reduce, partial
import numpy as np
import logging
#from visualdl import LogWriter
from pathlib import Path
import paddle as P
from propeller import log
import propeller.paddle as propeller
#from model.bert import BertConfig, BertModelLayer
from ernie.modeling_ernie import ErnieModel, ErnieModelForSequenceClassification
from ernie.tokenizing_ernie import ErnieTokenizer, ErnieTinyTokenizer
#from ernie.optimization import AdamW, LinearDecay
from demo.utils import create_if_not_exists, get_warmup_and_linear_decay
log.setLevel(logging.DEBUG)
logging.getLogger().setLevel(logging.DEBUG)
parser = propeller.ArgumentParser('classify model with ERNIE')
parser.add_argument(
'--from_pretrained',
type=Path,
required=True,
help='pretrained model directory or tag')
parser.add_argument(
'--max_seqlen',
type=int,
default=128,
help='max sentence length, should not greater than 512')
parser.add_argument('--bsz', type=int, default=32, help='batchsize')
parser.add_argument(
'--data_dir',
type=str,
required=True,
help='data directory includes train / develop data')
parser.add_argument(
'--max_steps',
type=int,
required=True,
help='max_train_steps, set this to EPOCH * NUM_SAMPLES / BATCH_SIZE')
parser.add_argument('--warmup_proportion', type=float, default=0.1)
parser.add_argument('--lr', type=float, default=5e-5, help='learning rate')
parser.add_argument(
'--save_dir', type=Path, required=True, help='model output directory')
parser.add_argument(
'--wd', type=float, default=0.01, help='weight decay, aka L2 regularizer')
parser.add_argument(
'--init_checkpoint',
type=str,
default=None,
help='checkpoint to warm start from')
parser.add_argument(
'--use_amp',
action='store_true',
help='only activate AMP(auto mixed precision accelatoin) on TensorCore compatible devices'
)
args = parser.parse_args()
env = P.distributed.ParallelEnv()
tokenizer = ErnieTokenizer.from_pretrained(args.from_pretrained)
#tokenizer = ErnieTinyTokenizer.from_pretrained(args.from_pretrained)
feature_column = propeller.data.FeatureColumns([
propeller.data.TextColumn(
'seg_a',
unk_id=tokenizer.unk_id,
vocab_dict=tokenizer.vocab,
tokenizer=tokenizer.tokenize),
propeller.data.TextColumn(
'seg_b',
unk_id=tokenizer.unk_id,
vocab_dict=tokenizer.vocab,
tokenizer=tokenizer.tokenize),
propeller.data.LabelColumn(
'label', vocab_dict={
b"0": 0,
b"1": 1,
b"2": 2,
}),
])
def map_fn(seg_a, seg_b, label):
seg_a, seg_b = tokenizer.truncate(seg_a, seg_b, seqlen=args.max_seqlen)
sentence, segments = tokenizer.build_for_ernie(seg_a, seg_b)
return sentence, segments, label
train_ds = feature_column.build_dataset('train', data_dir=os.path.join(args.data_dir, 'train'),
shuffle=True, repeat=True, use_gz=False, shard=True) \
.map(map_fn) \
.padded_batch(args.bsz, (0, 0, 0))
dev_ds = feature_column.build_dataset('dev', data_dir=os.path.join(args.data_dir, 'dev'),
shuffle=False, repeat=False, use_gz=False) \
.map(map_fn) \
.padded_batch(args.bsz, (0, 0, 0))
shapes = ([-1, args.max_seqlen], [-1, args.max_seqlen], [-1])
types = ('int64', 'int64', 'int64')
P.distributed.init_parallel_env()
model = ErnieModelForSequenceClassification.from_pretrained(
args.from_pretrained, num_labels=3, name='')
if args.init_checkpoint is not None:
log.info('loading checkpoint from %s' % args.init_checkpoint)
sd = P.load(str(args.init_checkpoint))
model.set_state_dict(sd)
model = P.DataParallel(model)
g_clip = P.nn.ClipGradByGlobalNorm(1.0) #experimental
param_name_to_exclue_from_weight_decay = re.compile(
r'.*layer_norm_scale|.*layer_norm_bias|.*b_0')
lr_scheduler = P.optimizer.lr.LambdaDecay(
args.lr,
get_warmup_and_linear_decay(args.max_steps,
int(args.warmup_proportion * args.max_steps)))
opt = P.optimizer.AdamW(
learning_rate=lr_scheduler,
parameters=model.parameters(),
apply_decay_param_fun=lambda n: not param_name_to_exclue_from_weight_decay.match(n),
weight_decay=args.wd,
grad_clip=g_clip)
scaler = P.amp.GradScaler(enable=args.use_amp)
step = 0
create_if_not_exists(args.save_dir)
#with LogWriter(logdir=str(create_if_not_exists(args.save_dir / 'vdl-%d' % env.dev_id))) as log_writer:
with P.amp.auto_cast(enable=args.use_amp):
for ids, sids, label in P.io.DataLoader(
train_ds, places=P.CUDAPlace(env.dev_id), batch_size=None):
step += 1
loss, _ = model(ids, sids, labels=label)
loss = scaler.scale(loss)
loss.backward()
scaler.minimize(opt, loss)
model.clear_gradients()
lr_scheduler.step()
# do logging
if step % 10 == 0:
_lr = lr_scheduler.get_lr()
if args.use_amp:
_l = (loss / scaler._scale).numpy()
msg = '[rank-%d][step-%d] train loss %.5f lr %.3e scaling %.3e' % (
env.dev_id, step, _l, _lr, scaler._scale.numpy())
else:
_l = loss.numpy()
msg = '[rank-%d][step-%d] train loss %.5f lr %.3e' % (
env.dev_id, step, _l, _lr)
log.debug(msg)
#log_writer.add_scalar('loss', _l, step=step)
#log_writer.add_scalar('lr', _lr, step=step)
# do saving
if step % 100 == 0 and env.dev_id == 0:
acc = []
with P.no_grad():
model.eval()
for d in P.io.DataLoader(
dev_ds, places=P.CUDAPlace(env.dev_id),
batch_size=None):
ids, sids, label = d
loss, logits = model(ids, sids, labels=label)
a = (logits.argmax(-1) == label)
acc.append(a.numpy())
model.train()
acc = np.concatenate(acc).mean()
#log_writer.add_scalar('eval/acc', acc, step=step)
log.debug('acc %.5f' % acc)
if args.save_dir is not None:
P.save(model.state_dict(),str( args.save_dir / 'ckpt.bin'))
# exit
if step > args.max_steps:
break
if args.save_dir is not None and env.dev_id == 0:
P.save(model.state_dict(),str( args.save_dir / 'ckpt.bin'))
log.debug('done')
# Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from __future__ import absolute_import
import os
import re
import time
import logging
from random import random
import json
from functools import reduce, partial
import numpy as np
import multiprocessing
import tempfile
import re
import paddle as P
from ernie.modeling_ernie import ErnieModel, ErnieModelForSequenceClassification
from ernie.tokenizing_ernie import ErnieTokenizer, ErnieTinyTokenizer
from demo.optimization import optimization
#import utils.data
from propeller import log
import propeller.paddle as propeller
log.setLevel(logging.DEBUG)
logging.getLogger().setLevel(logging.DEBUG)
def model_fn(features, mode, params, run_config):
ernie = ErnieModelForSequenceClassification(params, name='')
if mode is not propeller.RunMode.TRAIN:
ernie.eval()
else:
ernie.train()
metrics, loss = None, None
if mode is propeller.RunMode.PREDICT:
src_ids, sent_ids = features
_, logits = ernie(src_ids, sent_ids)
predictions = [logits, ]
else:
src_ids, sent_ids, labels = features
if mode is propeller.RunMode.EVAL:
loss, logits = ernie(src_ids, sent_ids, labels=labels)
pred = logits.argmax(axis=1)
acc = propeller.metrics.Acc(labels, pred)
metrics = {'acc': acc}
predictions = [pred]
train_hooks = None
else:
loss, logits = ernie(src_ids, sent_ids, labels=labels)
lr_step_hook, loss_scale_coef = optimization(
loss=loss,
warmup_steps=int(run_config.max_steps *
params['warmup_proportion']),
num_train_steps=run_config.max_steps,
learning_rate=params['learning_rate'],
train_program=P.static.default_main_program(),
startup_prog=P.static.default_startup_program(),
use_fp16=args.use_amp,
weight_decay=params['weight_decay'],
scheduler="linear_warmup_decay", )
scheduled_lr = P.static.default_main_program().global_block().var(
'learning_rate_0')
propeller.summary.scalar('lr', scheduled_lr)
predictions = [logits, ]
train_hooks = [lr_step_hook]
return propeller.ModelSpec(
loss=loss,
mode=mode,
metrics=metrics,
predictions=predictions,
train_hooks=train_hooks)
if __name__ == '__main__':
parser = propeller.ArgumentParser('DAN model with Paddle')
parser.add_argument('--do_predict', action='store_true')
parser.add_argument('--max_seqlen', type=int, default=128)
parser.add_argument('--data_dir', type=str, required=True)
parser.add_argument('--from_pretrained', type=str, required=True)
parser.add_argument('--warm_start_from', type=str)
parser.add_argument('--epoch', type=int, default=3)
parser.add_argument('--use_amp', action='store_true')
args = parser.parse_args()
P.enable_static()
if not os.path.exists(args.from_pretrained):
raise ValueError('--from_pretrained not found: %s' %
args.from_pretrained)
cfg_file_path = os.path.join(args.from_pretrained, 'ernie_config.json')
param_path = os.path.join(args.from_pretrained, 'params')
vocab_path = os.path.join(args.from_pretrained, 'vocab.txt')
assert os.path.exists(cfg_file_path) and os.path.exists(
param_path) and os.path.exists(vocab_path)
hparams_cli = propeller.parse_hparam(args)
hparams_config_file = json.loads(open(cfg_file_path).read())
default_hparams = propeller.HParams(
batch_size=32,
num_labels=3,
warmup_proportion=0.1,
learning_rate=5e-5,
weight_decay=0.01,
use_task_id=False,
use_fp16=args.use_amp)
hparams = default_hparams.join(propeller.HParams(
**hparams_config_file)).join(hparams_cli)
default_run_config = dict(
max_steps=args.epoch * 390000 / hparams.batch_size,
save_steps=1000,
log_steps=10,
max_ckpt=1,
skip_steps=0,
model_dir=tempfile.mkdtemp(),
eval_steps=100)
run_config = dict(default_run_config, **json.loads(args.run_config))
run_config = propeller.RunConfig(**run_config)
tokenizer = ErnieTokenizer.from_pretrained(args.from_pretrained)
#tokenizer = ErnieTinyTokenizer.from_pretrained(args.from_pretrained)
unk_id = tokenizer.vocab['[UNK]']
shapes = ([-1, args.max_seqlen], [-1, args.max_seqlen], [-1])
types = ('int64', 'int64', 'int64')
if not args.do_predict:
feature_column = propeller.data.FeatureColumns([
propeller.data.TextColumn(
'title',
unk_id=unk_id,
vocab_dict=tokenizer.vocab,
tokenizer=tokenizer.tokenize),
propeller.data.TextColumn(
'comment',
unk_id=unk_id,
vocab_dict=tokenizer.vocab,
tokenizer=tokenizer.tokenize),
propeller.data.LabelColumn(
'label',
vocab_dict={
b"contradictory": 0,
b"contradiction": 0,
b"entailment": 1,
b"neutral": 2,
}),
])
def map_fn(seg_a, seg_b, label):
seg_a, seg_b = tokenizer.truncate(
seg_a, seg_b, seqlen=args.max_seqlen)
sentence, segments = tokenizer.build_for_ernie(seg_a, seg_b)
#label = np.expand_dims(label, -1) #
return sentence, segments, label
train_ds = feature_column.build_dataset('train', data_dir=os.path.join(args.data_dir, 'train'), shuffle=True, repeat=True, use_gz=False) \
.map(map_fn) \
.padded_batch(hparams.batch_size)
dev_ds = feature_column.build_dataset('dev', data_dir=os.path.join(args.data_dir, 'dev'), shuffle=False, repeat=False, use_gz=False) \
.map(map_fn) \
.padded_batch(hparams.batch_size)
test_ds = feature_column.build_dataset('test', data_dir=os.path.join(args.data_dir, 'test'), shuffle=False, repeat=False, use_gz=False) \
.map(map_fn) \
.padded_batch(hparams.batch_size) \
train_ds.data_shapes = shapes
train_ds.data_types = types
dev_ds.data_shapes = shapes
dev_ds.data_types = types
test_ds.data_shapes = shapes
test_ds.data_types = types
varname_to_warmstart = re.compile(
r'^encoder.*[wb]_0$|^.*embedding$|^.*bias$|^.*scale$|^pooled_fc.[wb]_0$'
)
ws = propeller.WarmStartSetting(
predicate_fn=lambda v: varname_to_warmstart.match(v.name) and os.path.exists(os.path.join(param_path, v.name)),
from_dir=param_path,
)
best_exporter = propeller.train.exporter.BestExporter(
os.path.join(run_config.model_dir, 'best'),
cmp_fn=lambda old, new: new['dev']['acc'] > old['dev']['acc'])
propeller.train.train_and_eval(
model_class_or_model_fn=model_fn,
params=hparams,
run_config=run_config,
train_dataset=train_ds,
eval_dataset={'dev': dev_ds,
'test': test_ds},
warm_start_setting=ws,
exporters=[best_exporter])
print('dev_acc3\t%.5f\ntest_acc3\t%.5f' %
(best_exporter._best['dev']['acc'],
best_exporter._best['test']['acc']))
else:
feature_column = propeller.data.FeatureColumns([
propeller.data.TextColumn(
'title',
unk_id=unk_id,
vocab_dict=tokenizer.vocab,
tokenizer=tokenizer.tokenize),
propeller.data.TextColumn(
'comment',
unk_id=unk_id,
vocab_dict=tokenizer.vocab,
tokenizer=tokenizer.tokenize),
])
def map_fn(seg_a, seg_b):
seg_a, seg_b = tokenizer.truncate(
seg_a, seg_b, seqlen=args.max_seqlen)
sentence, segments = tokenizer.build_for_ernie(seg_a, seg_b)
return sentence, segments
predict_ds = feature_column.build_dataset_from_stdin('predict') \
.map(map_fn) \
.padded_batch(hparams.batch_size) \
predict_ds.data_shapes = shapes[:-1]
predict_ds.data_types = types[:-1]
est = propeller.Learner(model_fn, run_config, hparams)
for res, in est.predict(predict_ds, ckpt=-1):
print('%d\t%.5f\t%.5f\t%.5f' %
(np.argmax(res), res[0], res[1], res[2]))
# Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from __future__ import division
from __future__ import absolute_import
from __future__ import print_function
from __future__ import unicode_literals
import os
import re
import time
import logging
import json
from pathlib import Path
from random import random
from tqdm import tqdm
from functools import reduce, partial
import pickle
import argparse
from functools import partial
from io import open
import numpy as np
import logging
import paddle as P
from propeller import log
import propeller.paddle as propeller
from ernie.modeling_ernie import ErnieModel, ErnieModelForQuestionAnswering
from ernie.tokenizing_ernie import ErnieTokenizer, ErnieTinyTokenizer
#from ernie.optimization import AdamW, LinearDecay
from demo.mrc import mrc_reader
from demo.mrc import mrc_metrics
from demo.utils import create_if_not_exists, get_warmup_and_linear_decay
log.setLevel(logging.DEBUG)
logging.getLogger().setLevel(logging.DEBUG)
def evaluate(model, ds, all_examples, all_features, tokenizer, args):
dev_file = json.loads(open(args.dev_file, encoding='utf8').read())
with P.no_grad():
log.debug('start eval')
model.eval()
all_res = []
for step, (uids, token_ids, token_type_ids, _, __) in enumerate(
P.io.DataLoader(
ds, places=P.CUDAPlace(env.dev_id), batch_size=None)):
_, start_logits, end_logits = model(token_ids, token_type_ids)
res = [
mrc_metrics.RawResult(
unique_id=u, start_logits=s, end_logits=e)
for u, s, e in zip(uids.numpy(),
start_logits.numpy(), end_logits.numpy())
]
all_res += res
open('all_res', 'wb').write(pickle.dumps(all_res))
all_pred, all_nbests = mrc_metrics.make_results(
tokenizer,
all_examples,
all_features,
all_res,
n_best_size=args.n_best_size,
max_answer_length=args.max_answer_length,
do_lower_case=tokenizer.lower)
f1, em, _, __ = mrc_metrics.evaluate(dev_file, all_pred)
model.train()
log.debug('done eval')
return f1, em
def train(model, train_dataset, dev_dataset, dev_examples, dev_features,
tokenizer, args):
model = P.DataParallel(model)
max_steps = len(train_features) * args.epoch // args.bsz
g_clip = P.nn.ClipGradByGlobalNorm(1.0) #experimental
lr_scheduler = P.optimizer.lr.LambdaDecay(
args.lr,
get_warmup_and_linear_decay(max_steps,
int(args.warmup_proportion * max_steps)))
opt = P.optimizer.AdamW(
lr_scheduler,
parameters=model.parameters(),
weight_decay=args.wd,
grad_clip=g_clip)
train_dataset = train_dataset \
.cache_shuffle_shard(env.nranks, env.dev_id, drop_last=True) \
.padded_batch(args.bsz)
log.debug('init training with args: %s' % repr(args))
scaler = P.amp.GradScaler(enable=args.use_amp)
create_if_not_exists(args.save_dir)
with P.amp.auto_cast(enable=args.use_amp):
for step, (_, token_ids, token_type_ids, start_pos,
end_pos) in enumerate(
P.io.DataLoader(
train_dataset,
places=P.CUDAPlace(env.dev_id),
batch_size=None)):
loss, _, __ = model(
token_ids,
token_type_ids,
start_pos=start_pos,
end_pos=end_pos)
loss = scaler.scale(loss)
loss.backward()
scaler.minimize(opt, loss)
model.clear_gradients()
lr_scheduler.step()
if env.dev_id == 0 and step % 10 == 0:
_lr = lr_scheduler.get_lr()
if args.use_amp:
_l = (loss / scaler._scale).numpy()
msg = '[rank-%d][step-%d] train loss %.5f lr %.3e scaling %.3e' % (
env.dev_id, step, _l, _lr, scaler._scale.numpy())
else:
_l = loss.numpy()
msg = '[rank-%d][step-%d] train loss %.5f lr %.3e' % (
env.dev_id, step, _l, _lr)
log.debug(msg)
if env.dev_id == 0 and step % 100 == 0:
f1, em = evaluate(model, dev_dataset, dev_examples,
dev_features, tokenizer, args)
log.debug('[step %d] eval result: f1 %.5f em %.5f' %
(step, f1, em))
if env.dev_id == 0 and args.save_dir is not None:
P.save(model.state_dict(), str(args.save_dir / 'ckpt.bin'))
if step > max_steps:
break
if __name__ == "__main__":
parser = argparse.ArgumentParser('MRC model with ERNIE')
parser.add_argument(
'--from_pretrained',
type=Path,
required=True,
help='pretrained model directory or tag')
parser.add_argument(
'--max_seqlen',
type=int,
default=512,
help='max sentence length, should not greater than 512')
parser.add_argument('--bsz', type=int, default=8, help='batchsize')
parser.add_argument('--epoch', type=int, default=2, help='epoch')
parser.add_argument(
'--train_file',
type=str,
required=True,
help='data directory includes train / develop data')
parser.add_argument(
'--dev_file',
type=str,
required=True,
help='data directory includes train / develop data')
parser.add_argument('--warmup_proportion', type=float, default=0.1)
parser.add_argument('--lr', type=float, default=3e-5, help='learning rate')
parser.add_argument(
'--save_dir', type=Path, required=True, help='model output directory')
parser.add_argument(
'--n_best_size', type=int, default=20, help='nbest prediction to keep')
parser.add_argument(
'--max_answer_length', type=int, default=100, help='max answer span')
parser.add_argument(
'--wd',
type=float,
default=0.01,
help='weight decay, aka L2 regularizer')
parser.add_argument(
'--use_amp',
action='store_true',
help='only activate AMP(auto mixed precision accelatoin) on TensorCore compatible devices'
)
args = parser.parse_args()
env = P.distributed.ParallelEnv()
P.distributed.init_parallel_env()
tokenizer = ErnieTokenizer.from_pretrained(args.from_pretrained)
if not os.path.exists(args.train_file):
raise RuntimeError('input data not found at %s' % args.train_file)
if not os.path.exists(args.dev_file):
raise RuntimeError('input data not found at %s' % args.dev_file)
log.info('making train/dev data...')
train_examples = mrc_reader.read_files(args.train_file, is_training=True)
train_features = mrc_reader.convert_example_to_features(
train_examples, args.max_seqlen, tokenizer, is_training=True)
dev_examples = mrc_reader.read_files(args.dev_file, is_training=False)
dev_features = mrc_reader.convert_example_to_features(
dev_examples, args.max_seqlen, tokenizer, is_training=False)
log.info('train examples: %d, features: %d' %
(len(train_examples), len(train_features)))
def map_fn(unique_id, example_index, doc_span_index, tokens,
token_to_orig_map, token_is_max_context, token_ids,
position_ids, text_type_ids, start_position, end_position):
if start_position is None:
start_position = 0
if end_position is None:
end_position = 0
return np.array(unique_id), np.array(token_ids), np.array(
text_type_ids), np.array(start_position), np.array(end_position)
train_dataset = propeller.data.Dataset.from_list(train_features).map(
map_fn)
dev_dataset = propeller.data.Dataset.from_list(dev_features).map(
map_fn).padded_batch(args.bsz)
model = ErnieModelForQuestionAnswering.from_pretrained(
args.from_pretrained, name='')
train(model, train_dataset, dev_dataset, dev_examples, dev_features,
tokenizer, args)
if env.dev_id == 0:
f1, em = evaluate(model, dev_dataset, dev_examples, dev_features,
tokenizer, args)
log.debug('final eval result: f1 %.5f em %.5f' % (f1, em))
if env.dev_id == 0 and args.save_dir is not None:
P.save(model.state_dict(), str(args.save_dir / 'ckpt.bin'))
# Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
import re
import time
import logging
import six
import json
from random import random
from tqdm import tqdm
from collections import OrderedDict
from functools import reduce, partial
from pathlib import Path
from visualdl import LogWriter
import numpy as np
import multiprocessing
import pickle
import logging
from sklearn.metrics import f1_score
import paddle as P
from propeller import log
import propeller.paddle as propeller
log.setLevel(logging.DEBUG)
logging.getLogger().setLevel(logging.DEBUG)
from demo.utils import create_if_not_exists, get_warmup_and_linear_decay
from ernie.modeling_ernie import ErnieModel, ErnieModelForSequenceClassification, ErnieModelForTokenClassification
from ernie.tokenizing_ernie import ErnieTokenizer
#from ernie.optimization import AdamW, LinearDecay
parser = propeller.ArgumentParser('NER model with ERNIE')
parser.add_argument('--max_seqlen', type=int, default=256)
parser.add_argument('--bsz', type=int, default=32)
parser.add_argument('--data_dir', type=str, required=True)
parser.add_argument('--epoch', type=int, default=6)
parser.add_argument(
'--warmup_proportion',
type=float,
default=0.1,
help='if use_lr_decay is set, '
'learning rate will raise to `lr` at `warmup_proportion` * `max_steps` and decay to 0. at `max_steps`'
)
parser.add_argument(
'--max_steps',
type=int,
required=True,
help='max_train_steps, set this to EPOCH * NUM_SAMPLES / BATCH_SIZE, used in learning rate scheduler'
)
parser.add_argument(
'--use_amp',
action='store_true',
help='only activate AMP(auto mixed precision accelatoin) on TensorCore compatible devices'
)
parser.add_argument('--from_pretrained', type=Path, required=True)
parser.add_argument('--lr', type=float, default=5e-5, help='learning rate')
parser.add_argument(
'--save_dir', type=Path, required=True, help='model output directory')
parser.add_argument(
'--wd', type=float, default=0.01, help='weight decay, aka L2 regularizer')
args = parser.parse_args()
tokenizer = ErnieTokenizer.from_pretrained(args.from_pretrained)
def tokenizer_func(inputs):
ret = inputs.split(b'\2')
tokens, orig_pos = [], []
for i, r in enumerate(ret):
t = tokenizer.tokenize(r)
for tt in t:
tokens.append(tt)
orig_pos.append(i)
assert len(tokens) == len(orig_pos)
return tokens + orig_pos
def tokenizer_func_for_label(inputs):
return inputs.split(b'\2')
feature_map = {
b"B-PER": 0,
b"I-PER": 1,
b"B-ORG": 2,
b"I-ORG": 3,
b"B-LOC": 4,
b"I-LOC": 5,
b"O": 6,
}
other_tag_id = feature_map[b'O']
feature_column = propeller.data.FeatureColumns([
propeller.data.TextColumn(
'text_a',
unk_id=tokenizer.unk_id,
vocab_dict=tokenizer.vocab,
tokenizer=tokenizer_func), propeller.data.TextColumn(
'label',
unk_id=other_tag_id,
vocab_dict=feature_map,
tokenizer=tokenizer_func_for_label, )
])
def before(seg, label):
seg, orig_pos = np.split(seg, 2)
aligned_label = label[orig_pos]
seg, _ = tokenizer.truncate(seg, [], args.max_seqlen)
aligned_label, _ = tokenizer.truncate(aligned_label, [], args.max_seqlen)
orig_pos, _ = tokenizer.truncate(orig_pos, [], args.max_seqlen)
sentence, segments = tokenizer.build_for_ernie(
seg
) #utils.data.build_1_pair(seg, max_seqlen=args.max_seqlen, cls_id=cls_id, sep_id=sep_id)
aligned_label = np.concatenate([[0], aligned_label, [0]], 0)
orig_pos = np.concatenate([[0], orig_pos, [0]])
assert len(aligned_label) == len(sentence) == len(orig_pos), (
len(aligned_label), len(sentence), len(orig_pos)) # alinged
return sentence, segments, aligned_label, label, orig_pos
train_ds = feature_column.build_dataset('train', data_dir=os.path.join(args.data_dir, 'train'), shuffle=True, repeat=False, use_gz=False) \
.map(before) \
.padded_batch(args.bsz, (0,0,-100, other_tag_id + 1, 0)) \
dev_ds = feature_column.build_dataset('dev', data_dir=os.path.join(args.data_dir, 'dev'), shuffle=False, repeat=False, use_gz=False) \
.map(before) \
.padded_batch(args.bsz, (0,0,-100, other_tag_id + 1,0)) \
test_ds = feature_column.build_dataset('test', data_dir=os.path.join(args.data_dir, 'test'), shuffle=False, repeat=False, use_gz=False) \
.map(before) \
.padded_batch(args.bsz, (0,0,-100, other_tag_id + 1,0)) \
def evaluate(model, dataset):
model.eval()
with P.no_grad():
chunkf1 = propeller.metrics.ChunkF1(None, None, None, len(feature_map))
for step, (ids, sids, aligned_label, label, orig_pos
) in enumerate(P.io.DataLoader(
dataset, batch_size=None)):
loss, logits = model(ids, sids)
#print('\n'.join(map(str, logits.numpy().tolist())))
assert orig_pos.shape[0] == logits.shape[0] == ids.shape[
0] == label.shape[0]
for pos, lo, la, id in zip(orig_pos.numpy(),
logits.numpy(),
label.numpy(), ids.numpy()):
_dic = OrderedDict()
assert len(pos) == len(lo) == len(id)
for _pos, _lo, _id in zip(pos, lo, id):
if _id > tokenizer.mask_id: # [MASK] is the largest special token
_dic.setdefault(_pos, []).append(_lo)
merged_lo = np.array(
[np.array(l).mean(0) for _, l in six.iteritems(_dic)])
merged_preds = np.argmax(merged_lo, -1)
la = la[np.where(la != (other_tag_id + 1))] #remove pad
if len(la) > len(merged_preds):
log.warn(
'accuracy loss due to truncation: label len:%d, truncate to %d'
% (len(la), len(merged_preds)))
merged_preds = np.pad(merged_preds,
[0, len(la) - len(merged_preds)],
mode='constant',
constant_values=7)
else:
assert len(la) == len(
merged_preds
), 'expect label == prediction, got %d vs %d' % (
la.shape, merged_preds.shape)
chunkf1.update((merged_preds, la, np.array(len(la))))
#f1 = f1_score(np.concatenate(all_label), np.concatenate(all_pred), average='macro')
f1 = chunkf1.eval()
model.train()
return f1
model = ErnieModelForTokenClassification.from_pretrained(
args.from_pretrained,
num_labels=len(feature_map),
name='',
has_pooler=False)
g_clip = P.nn.ClipGradByGlobalNorm(1.0) #experimental
param_name_to_exclue_from_weight_decay = re.compile(
r'.*layer_norm_scale|.*layer_norm_bias|.*b_0')
lr_scheduler = P.optimizer.lr.LambdaDecay(
args.lr,
get_warmup_and_linear_decay(args.max_steps,
int(args.warmup_proportion * args.max_steps)))
opt = P.optimizer.AdamW(
lr_scheduler,
parameters=model.parameters(),
weight_decay=args.wd,
apply_decay_param_fun=lambda n: not param_name_to_exclue_from_weight_decay.match(n),
grad_clip=g_clip)
scaler = P.amp.GradScaler(enable=args.use_amp)
with LogWriter(
logdir=str(create_if_not_exists(args.save_dir / 'vdl'))) as log_writer:
with P.amp.auto_cast(enable=args.use_amp):
for epoch in range(args.epoch):
for step, (
ids, sids, aligned_label, label, orig_pos
) in enumerate(P.io.DataLoader(
train_ds, batch_size=None)):
loss, logits = model(ids, sids, labels=aligned_label)
#loss, logits = model(ids, sids, labels=aligned_label, loss_weights=P.cast(ids != 0, 'float32'))
loss = scaler.scale(loss)
loss.backward()
scaler.minimize(opt, loss)
model.clear_gradients()
lr_scheduler.step()
if step % 10 == 0:
_lr = lr_scheduler.get_lr()
if args.use_amp:
_l = (loss / scaler._scale).numpy()
msg = '[step-%d] train loss %.5f lr %.3e scaling %.3e' % (
step, _l, _lr, scaler._scale.numpy())
else:
_l = loss.numpy()
msg = '[step-%d] train loss %.5f lr %.3e' % (step, _l,
_lr)
log.debug(msg)
log_writer.add_scalar('loss', _l, step=step)
log_writer.add_scalar('lr', _lr, step=step)
if step % 100 == 0:
f1 = evaluate(model, dev_ds)
log.debug('eval f1: %.5f' % f1)
log_writer.add_scalar('eval/f1', f1, step=step)
if args.save_dir is not None:
P.save(model.state_dict(),str( args.save_dir / 'ckpt.bin'))
f1 = evaluate(model, dev_ds)
log.debug('final eval f1: %.5f' % f1)
log_writer.add_scalar('eval/f1', f1, step=step)
if args.save_dir is not None:
P.save(model.state_dict(),str( args.save_dir / 'ckpt.bin'))
# Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
import re
import time
import logging
import json
from random import random
from tqdm import tqdm
from functools import reduce, partial
from pathlib import Path
from visualdl import LogWriter
import numpy as np
import logging
import argparse
import paddle as P
from propeller import log
import propeller.paddle as propeller
log.setLevel(logging.DEBUG)
logging.getLogger().setLevel(logging.DEBUG)
#from model.bert import BertConfig, BertModelLayer
from ernie.modeling_ernie import ErnieModel, ErnieModelForSequenceClassification
from ernie.tokenizing_ernie import ErnieTokenizer, ErnieTinyTokenizer
#from ernie.optimization import AdamW, LinearDecay
from demo.utils import create_if_not_exists, get_warmup_and_linear_decay
parser = argparse.ArgumentParser('classify model with ERNIE')
parser.add_argument(
'--from_pretrained',
type=Path,
required=True,
help='pretrained model directory or tag')
parser.add_argument(
'--max_seqlen',
type=int,
default=128,
help='max sentence length, should not greater than 512')
parser.add_argument('--bsz', type=int, default=32, help='batchsize')
parser.add_argument('--epoch', type=int, default=3, help='epoch')
parser.add_argument(
'--data_dir',
type=str,
required=True,
help='data directory includes train / develop data')
parser.add_argument(
'--max_steps',
type=int,
required=True,
help='max_train_steps, set this to EPOCH * NUM_SAMPLES / BATCH_SIZE')
parser.add_argument('--warmup_proportion', type=float, default=0.1)
parser.add_argument('--lr', type=float, default=5e-5, help='learning rate')
parser.add_argument('--eval', action='store_true')
parser.add_argument(
'--save_dir', type=Path, required=True, help='model output directory')
parser.add_argument(
'--init_checkpoint',
type=str,
default=None,
help='checkpoint to warm start from')
parser.add_argument(
'--wd', type=float, default=0.01, help='weight decay, aka L2 regularizer')
parser.add_argument(
'--use_amp',
action='store_true',
help='only activate AMP(auto mixed precision accelatoin) on TensorCore compatible devices'
)
args = parser.parse_args()
tokenizer = ErnieTokenizer.from_pretrained(args.from_pretrained)
#tokenizer = ErnieTinyTokenizer.from_pretrained(args.from_pretrained)
model = ErnieModelForSequenceClassification.from_pretrained(
args.from_pretrained, num_labels=3, name='')
if not args.eval:
feature_column = propeller.data.FeatureColumns([
propeller.data.TextColumn(
'seg_a',
unk_id=tokenizer.unk_id,
vocab_dict=tokenizer.vocab,
tokenizer=tokenizer.tokenize),
propeller.data.LabelColumn('label'),
])
def map_fn(seg_a, label):
seg_a, _ = tokenizer.truncate(seg_a, [], seqlen=args.max_seqlen)
sentence, segments = tokenizer.build_for_ernie(seg_a, [])
return sentence, segments, label
train_ds = feature_column.build_dataset('train', data_dir=os.path.join(args.data_dir, 'train'), shuffle=True, repeat=False, use_gz=False) \
.map(map_fn) \
.padded_batch(args.bsz)
dev_ds = feature_column.build_dataset('dev', data_dir=os.path.join(args.data_dir, 'dev'), shuffle=False, repeat=False, use_gz=False) \
.map(map_fn) \
.padded_batch(args.bsz)
g_clip = P.nn.ClipGradByGlobalNorm(1.0) #experimental
lr_scheduler = P.optimizer.lr.LambdaDecay(
args.lr,
get_warmup_and_linear_decay(
args.max_steps, int(args.warmup_proportion * args.max_steps)))
param_name_to_exclue_from_weight_decay = re.compile(
r'.*layer_norm_scale|.*layer_norm_bias|.*b_0')
opt = P.optimizer.AdamW(
lr_scheduler,
parameters=model.parameters(),
weight_decay=args.wd,
apply_decay_param_fun=lambda n: not param_name_to_exclue_from_weight_decay.match(n),
grad_clip=g_clip)
scaler = P.amp.GradScaler(enable=args.use_amp)
with LogWriter(logdir=str(create_if_not_exists(args.save_dir /
'vdl'))) as log_writer:
with P.amp.auto_cast(enable=args.use_amp):
for epoch in range(args.epoch):
for step, d in enumerate(
P.io.DataLoader(
train_ds, places=P.CUDAPlace(0), batch_size=None)):
ids, sids, label = d
loss, _ = model(ids, sids, labels=label)
loss = scaler.scale(loss)
loss.backward()
scaler.minimize(opt, loss)
model.clear_gradients()
lr_scheduler.step()
if step % 10 == 0:
_lr = lr_scheduler.get_lr()
if args.use_amp:
_l = (loss / scaler._scale).numpy()
msg = '[step-%d] train loss %.5f lr %.3e scaling %.3e' % (
step, _l, _lr, scaler._scale.numpy())
else:
_l = loss.numpy()
msg = '[step-%d] train loss %.5f lr %.3e' % (
step, _l, _lr)
log.debug(msg)
log_writer.add_scalar('loss', _l, step=step)
log_writer.add_scalar('lr', _lr, step=step)
if step % 100 == 0:
acc = []
with P.no_grad():
model.eval()
for step, d in enumerate(
P.io.DataLoader(
dev_ds,
places=P.CUDAPlace(0),
batch_size=None)):
ids, sids, label = d
loss, logits = model(ids, sids, labels=label)
a = (logits.argmax(-1) == label)
acc.append(a.numpy())
model.train()
acc = np.concatenate(acc).mean()
log_writer.add_scalar('eval/acc', acc, step=step)
log.debug('acc %.5f' % acc)
if args.save_dir is not None:
P.save(model.state_dict(),
str(args.save_dir / 'ckpt.bin'))
if args.save_dir is not None:
P.save(model.state_dict(), str(args.save_dir / 'ckpt.bin'))
else:
feature_column = propeller.data.FeatureColumns([
propeller.data.TextColumn(
'seg_a',
unk_id=tokenizer.unk_id,
vocab_dict=tokenizer.vocab,
tokenizer=tokenizer.tokenize),
])
sd = P.load(str(args.init_checkpoint))
model.set_dict(sd)
model.eval()
def map_fn(seg_a):
seg_a, _ = tokenizer.truncate(seg_a, [], seqlen=args.max_seqlen)
sentence, segments = tokenizer.build_for_ernie(seg_a, [])
return sentence, segments
predict_ds = feature_column.build_dataset_from_stdin('predict') \
.map(map_fn) \
.padded_batch(args.bsz)
for step, (ids, sids) in enumerate(
P.io.DataLoader(
predict_ds, places=P.CUDAPlace(0), batch_size=None)):
_, logits = model(ids, sids)
pred = logits.numpy().argmax(-1)
print('\n'.join(map(str, pred.tolist())))
# Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from __future__ import division
from __future__ import absolute_import
from __future__ import print_function
from __future__ import unicode_literals
import sys
import argparse
import logging
from functools import partial
from io import open
open = partial(open, encoding='utf-8')
import json
from collections import namedtuple
log = logging.getLogger(__name__)
Example = namedtuple('Example', [
'qas_id', 'question_text', 'doc_tokens', 'orig_answer_text',
'start_position', 'end_position'
])
Feature = namedtuple("Feature", [
"unique_id", "example_index", "doc_span_index", "tokens",
"token_to_orig_map", "token_is_max_context", "token_ids", "position_ids",
"text_type_ids", "start_position", "end_position"
])
def _tokenize_chinese_chars(text):
"""Adds whitespace around any CJK character."""
def _is_chinese_char(cp):
"""Checks whether CP is the codepoint of a CJK character."""
# This defines a "chinese character" as anything in the CJK Unicode block:
# https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_(Unicode_block)
#
# Note that the CJK Unicode block is NOT all Japanese and Korean characters,
# despite its name. The modern Korean Hangul alphabet is a different block,
# as is Japanese Hiragana and Katakana. Those alphabets are used to write
# space-separated words, so they are not treated specially and handled
# like the all of the other languages.
if ((cp >= 0x4E00 and cp <= 0x9FFF) or #
(cp >= 0x3400 and cp <= 0x4DBF) or #
(cp >= 0x20000 and cp <= 0x2A6DF) or #
(cp >= 0x2A700 and cp <= 0x2B73F) or #
(cp >= 0x2B740 and cp <= 0x2B81F) or #
(cp >= 0x2B820 and cp <= 0x2CEAF) or
(cp >= 0xF900 and cp <= 0xFAFF) or #
(cp >= 0x2F800 and cp <= 0x2FA1F)): #
return True
return False
output = []
buff = ""
for char in text:
cp = ord(char)
if _is_chinese_char(cp):
if buff != "":
output.append(buff)
buff = ""
output.append(char)
else:
buff += char
if buff != "":
output.append(buff)
return output
def _check_is_max_context(doc_spans, cur_span_index, position):
"""chech is max context"""
best_score = None
best_span_index = None
for (span_index, doc_span) in enumerate(doc_spans):
end = doc_span.start + doc_span.length - 1
if position < doc_span.start:
continue
if position > end:
continue
num_left_context = position - doc_span.start
num_right_context = end - position
score = min(num_left_context,
num_right_context) + 0.01 * doc_span.length
if best_score is None or score > best_score:
best_score = score
best_span_index = span_index
return cur_span_index == best_span_index
def _improve_answer_span(doc_tokens, input_start, input_end, tokenizer,
orig_answer_text):
"""improve answer span"""
tok_answer_text = " ".join(tokenizer.tokenize(orig_answer_text))
for new_start in range(input_start, input_end + 1):
for new_end in range(input_end, new_start - 1, -1):
text_span = " ".join(doc_tokens[new_start:(new_end + 1)])
if text_span == tok_answer_text:
return (new_start, new_end)
return (input_start, input_end)
def read_files(input_file, is_training):
"""read file"""
examples = []
with open(input_file, "r") as f:
input_data = json.load(f)["data"]
for entry in input_data:
for paragraph in entry["paragraphs"]:
paragraph_text = paragraph["context"]
for qa in paragraph["qas"]:
qas_id = qa["id"]
question_text = qa["question"]
start_pos = None
end_pos = None
orig_answer_text = None
if is_training:
if len(qa["answers"]) != 1:
raise ValueError(
"For training, each question should have exactly 1 answer."
)
answer = qa["answers"][0]
orig_answer_text = answer["text"]
answer_offset = answer["answer_start"]
answer_length = len(orig_answer_text)
doc_tokens = [
paragraph_text[:answer_offset], paragraph_text[
answer_offset:answer_offset + answer_length],
paragraph_text[answer_offset + answer_length:]
]
start_pos = 1
end_pos = 1
actual_text = " ".join(doc_tokens[start_pos:(end_pos +
1)])
if actual_text.find(orig_answer_text) == -1:
log.info("Could not find answer: '%s' vs. '%s'",
actual_text, orig_answer_text)
continue
else:
doc_tokens = _tokenize_chinese_chars(paragraph_text)
example = Example(
qas_id=qas_id,
question_text=question_text,
doc_tokens=doc_tokens,
orig_answer_text=orig_answer_text,
start_position=start_pos,
end_position=end_pos)
examples.append(example)
return examples
def convert_example_to_features(examples,
max_seq_length,
tokenizer,
is_training,
doc_stride=128,
max_query_length=64):
"""convert example to feature"""
features = []
unique_id = 1000000000
for (example_index, example) in enumerate(examples):
query_tokens = tokenizer.tokenize(example.question_text)
if len(query_tokens) > max_query_length:
query_tokens = query_tokens[0:max_query_length]
tok_to_orig_index = []
orig_to_tok_index = []
all_doc_tokens = []
for (i, token) in enumerate(example.doc_tokens):
orig_to_tok_index.append(len(all_doc_tokens))
sub_tokens = tokenizer.tokenize(token)
for sub_token in sub_tokens:
tok_to_orig_index.append(i)
all_doc_tokens.append(sub_token)
#log.info(orig_to_tok_index, example.start_position)
tok_start_position = None
tok_end_position = None
if is_training:
tok_start_position = orig_to_tok_index[example.start_position]
if example.end_position < len(example.doc_tokens) - 1:
tok_end_position = orig_to_tok_index[example.end_position +
1] - 1
else:
tok_end_position = len(all_doc_tokens) - 1
(tok_start_position, tok_end_position) = _improve_answer_span(
all_doc_tokens, tok_start_position, tok_end_position,
tokenizer, example.orig_answer_text)
max_tokens_for_doc = max_seq_length - len(query_tokens) - 3
_DocSpan = namedtuple("DocSpan", ["start", "length"])
doc_spans = []
start_offset = 0
while start_offset < len(all_doc_tokens):
length = len(all_doc_tokens) - start_offset
if length > max_tokens_for_doc:
length = max_tokens_for_doc
doc_spans.append(_DocSpan(start=start_offset, length=length))
if start_offset + length == len(all_doc_tokens):
break
start_offset += min(length, doc_stride)
for (doc_span_index, doc_span) in enumerate(doc_spans):
tokens = []
token_to_orig_map = {}
token_is_max_context = {}
text_type_ids = []
tokens.append("[CLS]")
text_type_ids.append(0)
for token in query_tokens:
tokens.append(token)
text_type_ids.append(0)
tokens.append("[SEP]")
text_type_ids.append(0)
for i in range(doc_span.length):
split_token_index = doc_span.start + i
token_to_orig_map[len(tokens)] = tok_to_orig_index[
split_token_index]
is_max_context = _check_is_max_context(
doc_spans, doc_span_index, split_token_index)
token_is_max_context[len(tokens)] = is_max_context
tokens.append(all_doc_tokens[split_token_index])
text_type_ids.append(1)
tokens.append("[SEP]")
text_type_ids.append(1)
token_ids = tokenizer.convert_tokens_to_ids(tokens)
position_ids = list(range(len(token_ids)))
start_position = None
end_position = None
if is_training:
doc_start = doc_span.start
doc_end = doc_span.start + doc_span.length - 1
out_of_span = False
if not (tok_start_position >= doc_start and
tok_end_position <= doc_end):
out_of_span = True
if out_of_span:
start_position = 0
end_position = 0
else:
doc_offset = len(query_tokens) + 2
start_position = tok_start_position - doc_start + doc_offset
end_position = tok_end_position - doc_start + doc_offset
feature = Feature(
unique_id=unique_id,
example_index=example_index,
doc_span_index=doc_span_index,
tokens=tokens,
token_to_orig_map=token_to_orig_map,
token_is_max_context=token_is_max_context,
token_ids=token_ids,
position_ids=position_ids,
text_type_ids=text_type_ids,
start_position=start_position,
end_position=end_position)
features.append(feature)
unique_id += 1
return features
if __name__ == "__main__":
parser = argparse.ArgumentParser(description='main')
parser.add_argument("--input", type=str, default=None)
args = parser.parse_args()
from ernie.tokenizing_ernie import ErnieTokenizer
tokenizer = ErnieTokenizer.from_pretrained('ernie-1.0')
examples = read_files(args.input, True)
features = convert_example_to_features(examples, 512, tokenizer, True)
log.debug(len(examples))
log.debug(len(features))
# Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals
from __future__ import absolute_import
import logging
import re
import numpy as np
import paddle as P
import paddle.distributed.fleet as fleet
from propeller.paddle.train.hooks import RunHook
log = logging.getLogger(__name__)
from demo.utils import create_if_not_exists, get_warmup_and_linear_decay
def optimization(
loss,
warmup_steps,
num_train_steps,
learning_rate,
train_program,
startup_prog,
weight_decay,
scheduler='linear_warmup_decay',
use_fp16=False, ):
"""do backword for static"""
def exclude_from_weight_decay(param):
name = param.rstrip('.master')
if name.find("layer_norm") > -1:
return True
bias_suffix = ["_bias", "_b", ".b_0"]
for suffix in bias_suffix:
if name.endswith(suffix):
return True
return False
g_clip = P.nn.ClipGradByGlobalNorm(1.0)
lr_scheduler = P.optimizer.lr.LambdaDecay(
learning_rate,
get_warmup_and_linear_decay(num_train_steps, warmup_steps))
optimizer = P.optimizer.AdamW(
learning_rate=lr_scheduler,
weight_decay=weight_decay,
grad_clip=g_clip,
apply_decay_param_fun=exclude_from_weight_decay)
if use_fp16:
log.info('AMP activated')
if weight_decay > 0.:
raise ValueError(
'paddle amp will ignore `weight_decay`, see https://github.com/PaddlePaddle/Paddle/issues/29794'
)
#amp_list = P.fluid.contrib.mixed_precision.AutoMixedPrecisionLists(
# custom_white_list=['softmax', 'layer_norm', 'gelu'])
optimizer = P.fluid.contrib.mixed_precision.decorate(
optimizer, init_loss_scaling=2**15, use_dynamic_loss_scaling=True)
_, param_grads = optimizer.minimize(loss)
loss_scaling = P.static.default_main_program().global_block().var(
'loss_scaling_0')
else:
_, param_grads = optimizer.minimize(loss)
loss_scaling = None
class LRStepHook(RunHook):
def after_run(self, _, __):
lr_scheduler.step()
log.debug('lr step: %.5f' % lr_scheduler.get_lr())
return LRStepHook(), loss_scaling
# Distributed Pretrain
only **mask word** strategy from [Ernie1.0](https://arxiv.org/pdf/1904.09223.pdf) is illustrated in this section.
1. make pretrain data
we use documents from multiple datasource (e.g. Wikipedia) to pretrain.
input text should be segmented with space (even in chinese, this segmentation is used for *mask word*).
each line corresonds to a *sentence*.
empty line indicates end of document.
example:
> 数学 是 利用 符号语言 研究 数量 、 结构 、 变化 以及 空间 等 概念 的 一门 学科 , 从 某种 角度看 属于 形式 科学 的 一种 。
> 数学 透过 抽象化 和 逻辑推理 的 使用 , 由 计数 、 计算 、 量度 和 对 物体 形状 及 运动 的 观察 而 产生 。
> 数学家 们 拓展 这些 概念 , 为了 公式化 新 的 猜想 以及 从 选定 的 公理 及 定义 中 建立 起 严谨 推导 出 的 定理 。
> 基础 数学 的 知识 与 运用 总是 个人 与 团体 生活 中 不可或缺 的 一环 。
> 对 数学 基本概念 的 完善 , 早 在 古埃及 、 美索不达米亚 及 古印度 内 的 古代 数学 文本 便 可观 见 , 而 在 古希腊 那里 有 更为 严谨 的 处理 。
> 从 那时 开始 , 数学 的 发展 便 持续 不断 地 小幅 进展 , 至 16 世纪 的 文艺复兴 时期 , 因为 新 的 科学 发现 和 数学 革新 两者 的 交互 , 致使 数学 的 加速 发展 , 直至 今日 。
>
> 云外镜 ( ) 是 一种 能 反映 遥远 地方 影像 的 镜子 , 就 好比 现在 的 电视 , 或是 吉卜赛人 占卜 用 的 水晶球 一样 。
> 它 属于 付丧神 的 一种 , 是 镜子 历经 百年 后 幻化 而成 的 妖怪 , 又名 镜 妖 。
> 也 有人 说云 外镜 是 狸 妖 幻化 而成 的 , 当狸 妖 的 肚子 胀大 , 像 电视 的 映像管 一样 发光 时 , 就 可以 自由 地 显现出 远方 的 情景 。
> 著名 的 妖怪 绘师 鸟 山石 燕曾 记载 云外镜 经常 容易 跟 照妖镜 搞混 , 因为 照妖镜 可以 映照 出 肉眼 看不见 的 妖怪 , 这点 与 云外 镜会 映照 出 怪异 的 脸孔 是 有些 相似 。
> 据说 在 阴历 八月 十五日 的 夜晚 , 在 水晶 盆内 注满 水 , 将 镜子 平 放在 水面 , 若 是 映照 出 妖怪 的 模样 , 就 表示 这 面 镜子 里 住 著 妖怪 。
make pretrain data with:
```script
python3 ./demo/pretrain/make_pretrain_data.py input_file output_file.gz --vocab /path/to/ernie1.0/vocab.txt
```
2. run distributed pretrain
```sript
python3 -m paddle.distributed.launch \
./demo/pretrain/pretrain_dygraph.py \
--data_dir "data/*.gz" \
--from_pretrained /path/to/ernie1.0_pretrain_dir/
```
import sys
import argparse
import struct
import random as r
import re
import gzip
import logging
from itertools import accumulate
from functools import reduce, partial, wraps
from propeller import log
from propeller.paddle.data import feature_pb2, example_pb2
#jfrom data_util import RawtextColumn
import io
sys.stdout = io.TextIOWrapper(sys.stdout.buffer, encoding='utf-8')
sys.stderr = io.TextIOWrapper(sys.stderr.buffer, encoding='utf-8')
def gen_segs(segment_piece):
if len(segment_piece) == 0:
return []
else:
return [min(segment_piece)] * len(segment_piece)
whit_space_pat = re.compile(r'\S+')
def segment(inputs, inputs_segment):
ret = [r.span() for r in whit_space_pat.finditer(inputs)]
ret = [(inputs[s:e], gen_segs(inputs_segment[s:e]))
for i, (s, e) in enumerate(ret)]
return ret
def tokenize(sen, seg_info):
"""
char tokenizer (wordpiece english)
normed txt(space seperated or not) => list of word-piece
"""
sen = sen.lower()
res_word, res_segments = [], []
for match in pat.finditer(sen):
words, pos = _wordpiece(
match.group(0), vocab=vocab_set, unk_token='[UNK]')
start_of_word = match.span()[0]
for w, p in zip(words, pos):
res_word.append(w)
res_segments.append(
gen_segs(seg_info[p[0] + start_of_word:p[1] + start_of_word]))
return res_word, res_segments
def parse_txt(line):
if len(line) == 0:
return []
line = line.decode('utf8')
ret_line, ret_seginfo = [], []
for l, i in segment(line, list(range(len(line)))):
for ll, ii in zip(*tokenize(l, i)):
ret_line.append(ll)
ret_seginfo.append(ii)
if args.check and r.random() < 0.005:
print('****', file=sys.stderr)
print(line, file=sys.stderr)
print('|'.join(ret_line), file=sys.stderr)
print(ret_seginfo, file=sys.stderr)
print('****', file=sys.stderr)
ret_line = [vocab.get(r, vocab['[UNK]']) for r in ret_line]
ret_seginfo = [[-1] if i == [] else i
for i in ret_seginfo] #for sentence piece only
ret_seginfo = [min(i) for i in ret_seginfo]
return ret_line, ret_seginfo
def build_example(slots):
txt, seginfo = slots
txt_fe_list = feature_pb2.FeatureList(feature=[
feature_pb2.Feature(int64_list=feature_pb2.Int64List(value=t))
for t in txt
])
segsinfo_fe_list = feature_pb2.FeatureList(feature=[
feature_pb2.Feature(int64_list=feature_pb2.Int64List(value=s))
for s in seginfo
])
assert len(txt_fe_list.feature) == len(
segsinfo_fe_list.feature), 'txt[%d] and seginfo[%d] size not match' % (
len(txt_fe_list.feature), len(segsinfo_fe_list.feature))
features = {
'txt': txt_fe_list,
'segs': segsinfo_fe_list,
}
ex = example_pb2.SequenceExample(feature_lists=feature_pb2.FeatureLists(
feature_list=features))
return ex
def write_gz(serialized, to_file):
l = len(serialized)
packed_data = struct.pack('i%ds' % l, l, serialized)
to_file.write(packed_data)
def build_bb(from_file, to_file):
slots = []
for i, line in enumerate(from_file):
line = line.strip()
if args.verbose and i % 10000 == 0:
log.debug(i)
if len(line) == 0:
if len(slots) != 0:
transposed_slots = list(zip(*slots))
ex = build_example(transposed_slots)
write_gz(ex.SerializeToString(), to_file)
slots = []
continue
parsed_line = parse_txt(line)
slots.append(parsed_line)
if len(slots) != 0:
transposed_slots = list(zip(*slots))
ex = build_example(transposed_slots)
write_gz(ex.SerializeToString(), to_file)
slots = []
if __name__ == '__main__':
parser = argparse.ArgumentParser('Pretrain Data Maker')
parser.add_argument('src', type=str)
parser.add_argument('tgt', type=str)
parser.add_argument('--vocab', type=str, required=True)
parser.add_argument('-v', '--verbose', action='store_true')
parser.add_argument('-c', '--check', action='store_true')
args = parser.parse_args()
log.setLevel(logging.DEBUG)
from ernie.tokenizing_ernie import _wordpiece
pat = re.compile(r'([a-zA-Z0-9]+|\S)')
vocab = {
j.strip().split(b'\t')[0].decode('utf8'): i
for i, j in enumerate(open(args.vocab, 'rb'))
}
vocab_set = set(vocab.keys())
with open(args.src, 'rb') as from_file, gzip.open(args.tgt,
'wb') as to_file:
log.info('making gz from bb %s ==> %s' % (from_file, to_file))
build_bb(from_file, to_file)
log.info('done: %s' % to_file)
# Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from __future__ import print_function
from __future__ import absolute_import
from __future__ import unicode_literals
import sys
import io
import os
import time
import numpy as np
import re
import logging
import six
from glob import glob
from pathlib import Path
from functools import reduce, partial
import itertools
import paddle as P
import sentencepiece as spm
import json
from tqdm import tqdm
import random as r
from ernie.modeling_ernie import ErnieModelForPretraining
from ernie.tokenizing_ernie import ErnieTokenizer
#from ernie.optimization import AdamW, LinearDecay
import propeller as propeller_base
import propeller.paddle as propeller
from propeller.paddle.data import Dataset
from propeller import log
from demo.utils import create_if_not_exists, get_warmup_and_linear_decay
log.setLevel(logging.DEBUG)
logging.getLogger().setLevel(logging.DEBUG)
if six.PY3:
from itertools import accumulate
else:
import operator
def accumulate(iterable, func=operator.add, initial=None):
'Return running totals'
# accumulate([1,2,3,4,5]) --> 1 3 6 10 15
# accumulate([1,2,3,4,5], initial=100) --> 100 101 103 106 110 115
# accumulate([1,2,3,4,5], operator.mul) --> 1 2 6 24 120
it = iter(iterable)
total = initial
if initial is None:
try:
total = next(it)
except StopIteration:
return
yield total
for element in it:
total = func(total, element)
yield total
def truncate_sentence(seq, from_length, to_length):
random_begin = np.random.randint(
0, np.maximum(0, from_length - to_length) + 1)
return seq[random_begin:random_begin + to_length]
def build_pair(seg_a, seg_b, max_seqlen, vocab):
#log.debug('pair %s \n %s' % (seg_a, seg_b))
cls_id = vocab['[CLS]']
sep_id = vocab['[SEP]']
a_len = len(seg_a)
b_len = len(seg_b)
ml = max_seqlen - 3
half_ml = ml // 2
if a_len > b_len:
a_len_truncated, b_len_truncated = np.maximum(
half_ml, ml - b_len), np.minimum(half_ml, b_len)
else:
a_len_truncated, b_len_truncated = np.minimum(
half_ml, a_len), np.maximum(half_ml, ml - a_len)
seg_a = truncate_sentence(seg_a, a_len, a_len_truncated)
seg_b = truncate_sentence(seg_b, b_len, b_len_truncated)
seg_a_txt, seg_a_info = seg_a[:, 0], seg_a[:, 1]
seg_b_txt, seg_b_info = seg_b[:, 0], seg_b[:, 1]
token_type_a = np.ones_like(seg_a_txt, dtype=np.int64) * 0
token_type_b = np.ones_like(seg_b_txt, dtype=np.int64) * 1
sen_emb = np.concatenate(
[[cls_id], seg_a_txt, [sep_id], seg_b_txt, [sep_id]], 0)
info_emb = np.concatenate([[-1], seg_a_info, [-1], seg_b_info, [-1]], 0)
token_type_emb = np.concatenate(
[[0], token_type_a, [0], token_type_b, [1]], 0)
return sen_emb, info_emb, token_type_emb
def apply_mask(sentence, seg_info, mask_rate, vocab_size, vocab):
pad_id = vocab['[PAD]']
mask_id = vocab['[MASK]']
shape = sentence.shape
batch_size, seqlen = shape
invalid_pos = np.where(seg_info == -1)
seg_info += 1 #no more =1
seg_info_flatten = seg_info.reshape([-1])
seg_info_incr = seg_info_flatten - np.roll(seg_info_flatten, shift=1)
seg_info = np.add.accumulate(
np.array([0 if s == 0 else 1 for s in seg_info_incr])).reshape(shape)
seg_info[invalid_pos] = -1
u_seginfo = np.array([i for i in np.unique(seg_info) if i != -1])
np.random.shuffle(u_seginfo)
sample_num = max(1, int(len(u_seginfo) * mask_rate))
u_seginfo = u_seginfo[:sample_num]
mask = reduce(np.logical_or, [seg_info == i for i in u_seginfo])
mask[:, 0] = False # ignore CLS head
rand = np.random.rand(*shape)
choose_original = rand < 0.1 #
choose_random_id = (0.1 < rand) & (rand < 0.2) #
choose_mask_id = 0.2 < rand #
random_id = np.random.randint(1, vocab_size, size=shape)
replace_id = mask_id * choose_mask_id + \
random_id * choose_random_id + \
sentence * choose_original
mask_pos = np.where(mask)
#mask_pos_flatten = list(map(lambda idx: idx[0] * seqlen + idx[1], zip(*mask_pos))) #transpose
mask_label = sentence[mask_pos]
sentence[mask_pos] = replace_id[mask_pos] #overwrite
#log.debug(mask_pos_flatten)
return sentence, np.stack(mask_pos, -1), mask_label
def make_pretrain_dataset(name, dir, vocab, args):
gz_files = glob(dir)
if not gz_files:
raise ValueError('train data not found in %s' % gz_files)
log.info('read from %s' % '\n'.join(gz_files))
max_input_seqlen = args.max_seqlen
max_pretrain_seqlen = lambda: max_input_seqlen if r.random() > 0.15 else r.randint(1, max_input_seqlen) # short sentence rate
def _parse_gz(record_str): # function that takes python_str as input
ex = propeller_base.data.example_pb2.SequenceExample()
ex.ParseFromString(record_str)
doc = [
np.array(
f.int64_list.value, dtype=np.int64)
for f in ex.feature_lists.feature_list['txt'].feature
]
doc_seg = [
np.array(
f.int64_list.value, dtype=np.int64)
for f in ex.feature_lists.feature_list['segs'].feature
]
return doc, doc_seg
def bb_to_segments(filename):
ds = Dataset.from_record_file(filename).map(_parse_gz)
iterable = iter(ds)
def gen():
buf, size = [], 0
iterator = iter(ds)
while 1:
doc, doc_seg = next(iterator)
for line, line_seg in zip(doc, doc_seg):
#line = np.array(sp_model.SampleEncodeAsIds(line, -1, 0.1), dtype=np.int64) # 0.1 means large variance on sentence piece result
if len(line) == 0:
continue
line = np.array(
line
) # 0.1 means large variance on sentence piece result
line_seg = np.array(line_seg)
size += len(line)
buf.append(np.stack([line, line_seg]).transpose())
if size > max_input_seqlen:
yield buf,
buf, size = [], 0
if len(buf) != 0:
yield buf,
buf, size = [], 0
return Dataset.from_generator_func(gen)
def sample_negative(dataset):
def gen():
iterator = iter(dataset)
while True:
chunk_a, = next(iterator)
#chunk_b, = next(iterator)
seqlen = max_pretrain_seqlen()
seqlen_a = r.randint(1, seqlen)
seqlen_b = seqlen - seqlen_a
len_a = list(accumulate([len(c) for c in chunk_a]))
buf_a = [c for c, l in zip(chunk_a, len_a)
if l < seqlen_a] #always take the first one
buf_b = [
c for c, l in zip(chunk_a, len_a) if seqlen_a <= l < seqlen
]
if r.random() < 0.5: #pos or neg
label = np.int64(1)
else:
label = np.int64(0)
buf_a, buf_b = buf_b, buf_a
if not (len(buf_a) and len(buf_b)):
continue
a = np.concatenate(buf_a)
b = np.concatenate(buf_b)
#log.debug(a)
#log.debug(b)
sample, seg_info, token_type = build_pair(
a, b, args.max_seqlen,
vocab) #negative sample might exceed max seqlen
yield sample, seg_info, token_type, label
ds = propeller.data.Dataset.from_generator_func(gen)
return ds
def after(sentence, seg_info, segments, label):
batch_size, seqlen = sentence.shape
sentence, mask_pos, mlm_label = apply_mask(sentence, seg_info,
args.mask_rate,
len(vocab), vocab)
ra = r.random()
if ra < args.check:
print('***')
print('\n'.join([
str(j) + '\t' + '|'.join(map(str, i))
for i, j in zip(sentence.tolist(), label)
]))
print('***')
print('\n'.join(
['|'.join(map(str, i)) for i in seg_info.tolist()]))
print('***')
print('|'.join(map(str, mlm_label.tolist())))
print('***')
return sentence, segments, mlm_label, mask_pos, label
# pretrain pipeline
dataset = Dataset.from_list(gz_files)
if propeller.train.distribution.status.mode == propeller.train.distribution.DistributionMode.NCCL:
log.info('Apply sharding in distribution env')
if len(gz_files) < propeller.train.distribution.status.num_replica:
raise ValueError(
'not enough train file to shard: # of train files: %d, # of workers %d'
% (len(gz_files),
propeller.train.distribution.status.num_replica))
dataset = dataset.shard(env.nranks, env.dev_id)
dataset = dataset.repeat().shuffle(buffer_size=len(gz_files))
dataset = dataset.interleave(
map_fn=bb_to_segments, cycle_length=len(gz_files), block_length=1)
dataset = dataset.shuffle(
buffer_size=1000) #must shuffle to ensure negative sample randomness
dataset = sample_negative(dataset)
dataset = dataset.padded_batch(args.bsz, (0, 0, 0, 0)).map(after)
dataset.name = name
return dataset
if __name__ == '__main__':
if six.PY3:
import io
sys.stdout = io.TextIOWrapper(sys.stdout.buffer, encoding='utf-8')
sys.stderr = io.TextIOWrapper(sys.stderr.buffer, encoding='utf-8')
parser = propeller.ArgumentParser('DAN model with Paddle')
parser.add_argument(
'--max_seqlen',
type=int,
default=256,
help='max sequence length, documents from pretrain data will expand to this length'
)
parser.add_argument(
'--data_dir',
type=str,
required=True,
help='protobuf pretrain data directory')
parser.add_argument(
'--mask_rate',
type=float,
default=0.15,
help='probability of input token tobe masked')
parser.add_argument(
'--check', type=float, default=0., help='probability of debug info')
parser.add_argument(
'--warmup_steps', type=int, default=10000, help='warmups steps')
parser.add_argument(
'--max_steps', type=int, default=1000000, help='max pretrian steps')
parser.add_argument('--lr', type=float, default=1e-4, help='learning_rate')
parser.add_argument(
'--from_pretrained',
type=Path,
required=True,
help='pretraind model dir')
parser.add_argument(
'--save_dir', type=Path, required=True, help='model output_dir')
parser.add_argument(
'--wd',
type=float,
default=0.01,
help='weight decay, aka L2 regularizer')
parser.add_argument('--bsz', type=int, default=50)
parser.add_argument(
'--use_amp',
action='store_true',
help='only activate AMP(auto mixed precision accelatoin) on TensorCore compatible devices'
)
args = parser.parse_args()
P.distributed.init_parallel_env()
env = P.distributed.ParallelEnv()
tokenizer = ErnieTokenizer.from_pretrained(args.from_pretrained)
train_ds = make_pretrain_dataset(
'train', args.data_dir, vocab=tokenizer.vocab, args=args)
model = ErnieModelForPretraining.from_pretrained(args.from_pretrained)
param_name_to_exclue_from_weight_decay = re.compile(
r'.*layer_norm_scale|.*layer_norm_bias|.*b_0')
lr_scheduler = P.optimizer.lr.LambdaDecay(
args.lr,
get_warmup_and_linear_decay(args.max_steps, args.warmup_steps))
g_clip = P.nn.ClipGradByGlobalNorm(1.0) #experimental
opt = P.optimizer.AdamW(
learning_rate=lr_scheduler,
parameters=model.parameters(),
apply_decay_param_fun=lambda n: param_name_to_exclue_from_weight_decay.match(n),
weight_decay=args.wd,
grad_clip=g_clip)
model = P.DataParallel(model)
scaler = P.amp.GradScaler(enable=args.use_amp)
create_if_not_exists(args.save_dir)
with P.amp.auto_cast(args.use_amp):
for step, samples in enumerate(
P.io.DataLoader(
train_ds, places=P.CUDAPlace(env.dev_id), batch_size=0)):
(src_ids, sent_ids, mlm_label, mask_pos, nsp_label) = samples
loss, mlmloss, nsploss = model(
src_ids,
sent_ids,
labels=mlm_label,
mlm_pos=mask_pos,
nsp_labels=nsp_label)
loss = scaler.scale(loss)
loss.backward()
scaler.minimize(opt, loss)
model.clear_gradients()
lr_scheduler.step()
if step % 10 == 0:
_lr = lr_scheduler.get_lr()
if args.use_amp:
_l = (loss / scaler._scale).numpy()
msg = '[rank-%d][step-%d] train loss %.5f lr %.3e scaling %.3e' % (
env.dev_id, step, _l, _lr, scaler._scale.numpy())
else:
_l = loss.numpy()
msg = '[rank-%d][step-%d] train loss %.5f lr %.3e' % (
env.dev_id, step, _l, _lr)
log.debug(msg)
if step % 1000 == 0 and env.dev_id == 0:
log.debug('saveing...')
P.save(model.state_dict(),str( args.save_dir / 'ckpt.bin'))
if step > args.max_steps:
break
log.info('done')
# Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from __future__ import print_function
from __future__ import absolute_import
from __future__ import unicode_literals
import sys
import io
import os
import time
import numpy as np
import re
import logging
import six
from glob import glob
from pathlib import Path
from functools import reduce, partial
import itertools
import paddle as P
import json
from tqdm import tqdm
import random as r
from ernie.modeling_ernie import ErnieModelForPretraining
from ernie.tokenizing_ernie import ErnieTokenizer
from demo.optimization import optimization
import propeller.paddle as propeller
import propeller as propeller_base
from propeller.paddle.data import Dataset
from propeller import log
log.setLevel(logging.DEBUG)
logging.getLogger().setLevel(logging.DEBUG)
if six.PY3:
from itertools import accumulate
else:
import operator
def accumulate(iterable, func=operator.add, initial=None):
'Return running totals'
# accumulate([1,2,3,4,5]) --> 1 3 6 10 15
# accumulate([1,2,3,4,5], initial=100) --> 100 101 103 106 110 115
# accumulate([1,2,3,4,5], operator.mul) --> 1 2 6 24 120
it = iter(iterable)
total = initial
if initial is None:
try:
total = next(it)
except StopIteration:
return
yield total
for element in it:
total = func(total, element)
yield total
def ernie_pretrain_model_fn(features, mode, params, run_config):
"""propeller Model wraper for paddle-ERNIE """
src_ids, sent_ids, mlm_label, mask_pos, nsp_label = features
ernie = ErnieModelForPretraining(params, name='')
total_loss, mlm_loss, nsp_loss = ernie(
src_ids,
sent_ids,
labels=mlm_label,
mlm_pos=mask_pos,
nsp_labels=nsp_label)
metrics = None
inf_spec = None
propeller.summary.scalar('loss', total_loss)
propeller.summary.scalar('nsp-loss', nsp_loss)
propeller.summary.scalar('mlm-loss', mlm_loss)
lr_step_hook, loss_scale_coef = optimization(
loss=total_loss,
warmup_steps=params['warmup_steps'],
num_train_steps=run_config.max_steps,
learning_rate=params['learning_rate'],
train_program=P.static.default_main_program(),
startup_prog=P.static.default_startup_program(),
weight_decay=params['weight_decay'],
scheduler="linear_warmup_decay",
use_fp16=args.use_amp, )
scheduled_lr = P.static.default_main_program().global_block().var(
'learning_rate_0')
propeller.summary.scalar('lr', scheduled_lr)
if args.use_amp:
propeller.summary.scalar('loss_scaling', loss_scale_coef)
pred = [total_loss]
return propeller.ModelSpec(
loss=total_loss,
mode=mode,
metrics=metrics,
predictions=pred,
train_hooks=[lr_step_hook])
def truncate_sentence(seq, from_length, to_length):
random_begin = np.random.randint(
0, np.maximum(0, from_length - to_length) + 1)
return seq[random_begin:random_begin + to_length]
def build_pair(seg_a, seg_b, max_seqlen, vocab):
#log.debug('pair %s \n %s' % (seg_a, seg_b))
cls_id = vocab['[CLS]']
sep_id = vocab['[SEP]']
a_len = len(seg_a)
b_len = len(seg_b)
ml = max_seqlen - 3
half_ml = ml // 2
if a_len > b_len:
a_len_truncated, b_len_truncated = np.maximum(
half_ml, ml - b_len), np.minimum(half_ml, b_len)
else:
a_len_truncated, b_len_truncated = np.minimum(
half_ml, a_len), np.maximum(half_ml, ml - a_len)
seg_a = truncate_sentence(seg_a, a_len, a_len_truncated)
seg_b = truncate_sentence(seg_b, b_len, b_len_truncated)
seg_a_txt, seg_a_info = seg_a[:, 0], seg_a[:, 1]
seg_b_txt, seg_b_info = seg_b[:, 0], seg_b[:, 1]
token_type_a = np.ones_like(seg_a_txt, dtype=np.int64) * 0
token_type_b = np.ones_like(seg_b_txt, dtype=np.int64) * 1
sen_emb = np.concatenate(
[[cls_id], seg_a_txt, [sep_id], seg_b_txt, [sep_id]], 0)
info_emb = np.concatenate([[-1], seg_a_info, [-1], seg_b_info, [-1]], 0)
token_type_emb = np.concatenate(
[[0], token_type_a, [0], token_type_b, [1]], 0)
return sen_emb, info_emb, token_type_emb
def apply_mask(sentence, seg_info, mask_rate, vocab_size, vocab):
pad_id = vocab['[PAD]']
mask_id = vocab['[MASK]']
shape = sentence.shape
batch_size, seqlen = shape
invalid_pos = np.where(seg_info == -1)
seg_info += 1 #no more =1
seg_info_flatten = seg_info.reshape([-1])
seg_info_incr = seg_info_flatten - np.roll(seg_info_flatten, shift=1)
seg_info = np.add.accumulate(
np.array([0 if s == 0 else 1 for s in seg_info_incr])).reshape(shape)
seg_info[invalid_pos] = -1
u_seginfo = np.array([i for i in np.unique(seg_info) if i != -1])
np.random.shuffle(u_seginfo)
sample_num = max(1, int(len(u_seginfo) * mask_rate))
u_seginfo = u_seginfo[:sample_num]
mask = reduce(np.logical_or, [seg_info == i for i in u_seginfo])
mask[:, 0] = False # ignore CLS head
rand = np.random.rand(*shape)
choose_original = rand < 0.1 #
choose_random_id = (0.1 < rand) & (rand < 0.2) #
choose_mask_id = 0.2 < rand #
random_id = np.random.randint(1, vocab_size, size=shape)
replace_id = mask_id * choose_mask_id + \
random_id * choose_random_id + \
sentence * choose_original
mask_pos = np.where(mask)
#mask_pos_flatten = list(map(lambda idx: idx[0] * seqlen + idx[1], zip(*mask_pos))) #transpose
mask_label = sentence[mask_pos]
sentence[mask_pos] = replace_id[mask_pos] #overwrite
#log.debug(mask_pos_flatten)
return sentence, np.stack(mask_pos, -1), mask_label
def make_pretrain_dataset(name, dir, vocab, hparams, args):
gz_files = glob(dir)
if not gz_files:
raise ValueError('train data not found in %s' % dir)
log.info('read from %s' % '\n'.join(gz_files))
max_input_seqlen = args.max_seqlen
max_pretrain_seqlen = lambda: max_input_seqlen if r.random() > 0.15 else r.randint(1, max_input_seqlen) # short sentence rate
def _parse_gz(record_str): # function that takes python_str as input
ex = propeller_base.data.example_pb2.SequenceExample()
ex.ParseFromString(record_str)
doc = [
np.array(
f.int64_list.value, dtype=np.int64)
for f in ex.feature_lists.feature_list['txt'].feature
]
doc_seg = [
np.array(
f.int64_list.value, dtype=np.int64)
for f in ex.feature_lists.feature_list['segs'].feature
]
return doc, doc_seg
def bb_to_segments(filename):
ds = Dataset.from_record_file(filename).map(_parse_gz)
iterable = iter(ds)
def gen():
buf, size = [], 0
iterator = iter(ds)
while 1:
doc, doc_seg = next(iterator)
for line, line_seg in zip(doc, doc_seg):
#line = np.array(sp_model.SampleEncodeAsIds(line, -1, 0.1), dtype=np.int64) # 0.1 means large variance on sentence piece result
if len(line) == 0:
continue
line = np.array(
line
) # 0.1 means large variance on sentence piece result
line_seg = np.array(line_seg)
size += len(line)
buf.append(np.stack([line, line_seg]).transpose())
if size > max_input_seqlen:
yield buf,
buf, size = [], 0
if len(buf) != 0:
yield buf,
buf, size = [], 0
return Dataset.from_generator_func(gen)
def sample_negative(dataset):
def gen():
iterator = iter(dataset)
while True:
chunk_a, = next(iterator)
#chunk_b, = next(iterator)
seqlen = max_pretrain_seqlen()
seqlen_a = r.randint(1, seqlen)
seqlen_b = seqlen - seqlen_a
len_a = list(accumulate([len(c) for c in chunk_a]))
buf_a = [c for c, l in zip(chunk_a, len_a)
if l < seqlen_a] #always take the first one
buf_b = [
c for c, l in zip(chunk_a, len_a) if seqlen_a <= l < seqlen
]
if r.random() < 0.5: #pos or neg
label = np.int64(1)
else:
label = np.int64(0)
buf_a, buf_b = buf_b, buf_a
if not (len(buf_a) and len(buf_b)):
continue
a = np.concatenate(buf_a)
b = np.concatenate(buf_b)
#log.debug(a)
#log.debug(b)
sample, seg_info, token_type = build_pair(
a, b, args.max_seqlen,
vocab) #negative sample might exceed max seqlen
yield sample, seg_info, token_type, label
ds = propeller.data.Dataset.from_generator_func(gen)
return ds
def after(sentence, seg_info, segments, label):
batch_size, seqlen = sentence.shape
sentence, mask_pos, mlm_label = apply_mask(
sentence, seg_info, args.mask_rate, hparams.vocab_size, vocab)
ra = r.random()
if ra < args.check:
print('***')
print('\n'.join([
str(j) + '\t' + '|'.join(map(str, i))
for i, j in zip(sentence.tolist(), label)
]))
print('***')
print('\n'.join(
['|'.join(map(str, i)) for i in seg_info.tolist()]))
print('***')
print('|'.join(map(str, mlm_label.tolist())))
print('***')
return sentence, segments, mlm_label, mask_pos, label
# pretrain pipeline
dataset = Dataset.from_list(gz_files)
if propeller.train.distribution.status.mode == propeller.train.distribution.DistributionMode.NCCL:
log.info('Apply sharding in distribution env')
dataset = dataset.shard(
propeller.train.distribution.status.num_replica,
propeller.train.distribution.status.replica_id)
dataset = dataset.repeat().shuffle(buffer_size=len(gz_files))
dataset = dataset.interleave(
map_fn=bb_to_segments, cycle_length=len(gz_files), block_length=1)
dataset = dataset.shuffle(
buffer_size=1000) #must shuffle to ensure negative sample randomness
dataset = sample_negative(dataset)
dataset = dataset.padded_batch(hparams.batch_size, (0, 0, 0, 0)).map(after)
dataset.name = name
return dataset
if __name__ == '__main__':
if six.PY3:
import io
sys.stdout = io.TextIOWrapper(sys.stdout.buffer, encoding='utf-8')
sys.stderr = io.TextIOWrapper(sys.stderr.buffer, encoding='utf-8')
parser = propeller.ArgumentParser('DAN model with Paddle')
parser.add_argument('--max_seqlen', type=int, default=256)
parser.add_argument('--data_dir', type=str, required=True)
parser.add_argument('--from_pretrained', type=Path, default=None)
parser.add_argument('--use_amp', action='store_true')
parser.add_argument('--mask_rate', type=float, default=0.15)
parser.add_argument('--check', type=float, default=0.)
args = parser.parse_args()
P.enable_static()
if not os.path.exists(args.from_pretrained):
raise ValueError('--from_pretrained not found: %s' %
args.from_pretrained)
cfg_file_path = os.path.join(args.from_pretrained, 'ernie_config.json')
param_path = os.path.join(args.from_pretrained, 'params')
vocab_path = os.path.join(args.from_pretrained, 'vocab.txt')
assert os.path.exists(cfg_file_path) and os.path.exists(
param_path) and os.path.exists(vocab_path)
hparams_cli = propeller.parse_hparam(args)
hparams_config_file = json.loads(open(cfg_file_path).read())
default_hparams = propeller.HParams(
batch_size=50,
warmup_steps=10000,
learning_rate=1e-4,
weight_decay=0.01, )
hparams = default_hparams.join(propeller.HParams(
**hparams_config_file)).join(hparams_cli)
default_run_config = dict(
max_steps=1000000,
save_steps=10000,
log_steps=10,
max_ckpt=3,
skip_steps=0,
eval_steps=-1)
run_config = dict(default_run_config, **json.loads(args.run_config))
run_config = propeller.RunConfig(**run_config)
tokenizer = ErnieTokenizer.from_pretrained(args.from_pretrained)
train_ds = make_pretrain_dataset(
'train',
args.data_dir,
vocab=tokenizer.vocab,
hparams=hparams,
args=args)
seq_shape = [-1, args.max_seqlen]
ints_shape = [-1, ]
shapes = (seq_shape, seq_shape, ints_shape, [-1, 2], ints_shape)
types = ('int64', 'int64', 'int64', 'int64', 'int64')
train_ds.data_shapes = shapes
train_ds.data_types = types
ws = None
#varname_to_warmstart = re.compile(r'^encoder.*[wb]_0$|^.*embedding$|^.*bias$|^.*scale$|^pooled_fc.[wb]_0$')
varname_to_warmstart = re.compile(r'.*')
if args.from_pretrained is not None:
warm_start_dir = os.path.join(args.from_pretrained, 'params')
ws = propeller.WarmStartSetting(
predicate_fn=lambda v: varname_to_warmstart.match(v.name) and os.path.exists(os.path.join(warm_start_dir, v.name)),
from_dir=warm_start_dir
)
ernie_learner = propeller.Learner(
ernie_pretrain_model_fn,
run_config,
params=hparams,
warm_start_setting=ws)
ernie_learner.train(train_ds)
# ERNIE-GEN
[ERNIE-GEN](https://arxiv.org/pdf/2001.11314.pdf) is a multi-flow language generation framework for both pre-training and fine-tuning.
Only finetune strategy is illustrated in this section.
## Finetune
We use Abstractive Summarization task CNN/DailyMail to illustate usage of ERNIE-GEN, you can download preprocessed finetune data from [here](https://ernie-github.cdn.bcebos.com/data-cnndm.tar.gz)
To starts finetuning ERNIE-GEN, run:
```script
python3 -m paddle.distributed.launch \
--log_dir ./log \
./demo/seq2seq/finetune_seq2seq_dygraph.py \
--from_pretrained ernie-gen-base-en \
--data_dir ./data/cnndm \
--save_dir ./model_cnndm \
--label_smooth 0.1 \
--use_random_noice \
--noise_prob 0.7 \
--predict_output_dir ./pred \
--max_steps $((287113*30/64))
```
Note that you need more than 2 GPUs to run the finetuning.
During multi-gpu finetuning, `max_steps` is used as stop criteria rather than `epoch` to prevent dead block.
We simply canculate `max_steps` with: `EPOCH * NUM_TRIAN_EXAMPLE / TOTAL_BATCH`.
This demo script will save a finetuned model at `--save_dir`, and do muti-gpu prediction every `--eval_steps` and save prediction results at `--predict_output_dir`.
### Evalution
While finetuning, a serials of prediction files is generated.
First you need to sort and join all files with:
```shell
sort -t$'\t' -k1n ./pred/pred.step60000.* |awk -F"\t" '{print $2}'> final_prediction
```
then use `./eval_cnndm/cnndm_eval.sh` to calcuate all metrics
(`pyrouge` is required to evalute CNN/Daily Mail.)
```shell
sh cnndm_eval.sh final_prediction ./data/cnndm/dev.summary
```
### Inference
To run beam serach decode after you got a finetuned model. try:
```shell
cat one_column_source_text| python3 demo/seq2seq/decode.py \
--from_pretrained ./ernie_gen_large \
--save_dir ./model_cnndm \
--bsz 8
```
# Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from __future__ import division
from __future__ import absolute_import
from __future__ import print_function
from __future__ import unicode_literals
import sys
import io
import re
import argparse
import logging
import json
import numpy as np
from pathlib import Path
from collections import namedtuple
import paddle as P
from paddle.nn import functional as F
from ernie.modeling_ernie import ErnieModel, ErnieModelForPretraining, ErnieModelForGeneration
from ernie.modeling_ernie import _build_linear, _build_ln, append_name
from ernie.tokenizing_ernie import ErnieTokenizer
from propeller import log
import propeller.paddle as propeller
@np.vectorize
def rev_lookup(i):
return rev_dict[i]
def gen_bias(encoder_inputs, decoder_inputs, step):
decoder_bsz, decoder_seqlen = decoder_inputs.shape[:2]
attn_bias = P.reshape(
P.arange(
0, decoder_seqlen, 1, dtype='float32') + 1, [1, -1, 1])
decoder_bias = P.cast(
(P.matmul(
attn_bias, 1. / attn_bias, transpose_y=True) >= 1.),
'float32') #[1, 1, decoderlen, decoderlen]
encoder_bias = P.unsqueeze(
P.cast(P.ones_like(encoder_inputs), 'float32'),
[1]) #[bsz, 1, encoderlen]
encoder_bias = P.tile(
encoder_bias, [1, decoder_seqlen, 1]) #[bsz,decoderlen, encoderlen]
decoder_bias = P.tile(decoder_bias,
[decoder_bsz, 1, 1]) #[bsz, decoderlen, decoderlen]
if step > 0:
bias = P.concat([
encoder_bias, P.ones([decoder_bsz, decoder_seqlen, step],
'float32'), decoder_bias
], -1)
else:
bias = P.concat([encoder_bias, decoder_bias], -1)
return bias
#def make_data(tokenizer, inputs, max_encode_len):
# all_ids, all_sids = [], []
# for i in inputs:
# q_ids, q_sids = tokenizer.build_for_ernie(
# np.array(
# tokenizer.convert_tokens_to_ids(i.split(' '))[: max_encode_len-2],
# dtype=np.int64
# )
# )
# all_ids.append(q_ids)
# all_sids.append(q_sids)
# ml = max(map(len, all_ids))
# all_ids = [np.pad(i, [0, ml-len(i)], mode='constant')for i in all_ids]
# all_sids = [np.pad(i, [0, ml-len(i)], mode='constant')for i in all_sids]
# all_ids = np.stack(all_ids, 0)
# all_sids = np.stack(all_sids, 0)
# return all_ids, all_sids
def greedy_search_infilling(model,
q_ids,
q_sids,
sos_id,
eos_id,
attn_id,
max_encode_len=640,
max_decode_len=100,
tgt_type_id=3):
model.eval()
with P.no_grad():
#log.debug(q_ids.numpy().tolist())
_, logits, info = model(q_ids, q_sids)
gen_ids = P.argmax(logits, -1)
d_batch, d_seqlen = q_ids.shape
seqlen = P.cast(q_ids != 0, 'int64').sum(1, keepdim=True)
log.debug(seqlen.numpy())
log.debug(d_seqlen)
has_stopped = np.zeros([d_batch], dtype=np.bool)
gen_seq_len = np.zeros([d_batch], dtype=np.int64)
output_ids = []
past_cache = info['caches']
cls_ids = P.ones([d_batch], dtype='int64') * sos_id
attn_ids = P.ones([d_batch], dtype='int64') * attn_id
ids = P.stack([cls_ids, attn_ids], -1)
for step in range(max_decode_len):
log.debug('decode step %d' % step)
bias = gen_bias(q_ids, ids, step)
pos_ids = P.to_tensor(
np.tile(
np.array(
[[step, step + 1]], dtype=np.int64), [d_batch, 1]))
pos_ids += seqlen
_, logits, info = model(
ids,
P.ones_like(ids) * tgt_type_id,
pos_ids=pos_ids,
attn_bias=bias,
past_cache=past_cache)
gen_ids = P.argmax(logits, -1)
past_cached_k, past_cached_v = past_cache
cached_k, cached_v = info['caches']
cached_k = [
P.concat([pk, k[:, :1, :]], 1)
for pk, k in zip(past_cached_k, cached_k)
] # concat cached
cached_v = [
P.concat([pv, v[:, :1, :]], 1)
for pv, v in zip(past_cached_v, cached_v)
]
past_cache = (cached_k, cached_v)
gen_ids = gen_ids[:, 1]
ids = P.stack([gen_ids, attn_ids], 1)
gen_ids = gen_ids.numpy()
has_stopped |= (gen_ids == eos_id).astype(np.bool)
gen_seq_len += (1 - has_stopped.astype(np.int64))
output_ids.append(gen_ids.tolist())
if has_stopped.all():
#log.debug('exit because all done')
break
#if step == 1: break
output_ids = np.array(output_ids).transpose([1, 0])
return output_ids
BeamSearchState = namedtuple('BeamSearchState',
['log_probs', 'lengths', 'finished'])
BeamSearchOutput = namedtuple('BeamSearchOutput',
['scores', 'predicted_ids', 'beam_parent_ids'])
def log_softmax(x):
e_x = np.exp(x - np.max(x))
return np.log(e_x / e_x.sum())
def mask_prob(p, onehot_eos, finished):
is_finished = P.cast(P.reshape(finished, [-1, 1]) != 0, 'float32')
p = is_finished * (1. - P.cast(onehot_eos, 'float32')) * -9999. + (
1. - is_finished) * p
return p
def hyp_score(log_probs, length, length_penalty):
lp = P.pow((5. + P.cast(length, 'float32')) / 6., length_penalty)
return log_probs / lp
def beam_search_step(state, logits, eos_id, beam_width, is_first_step,
length_penalty):
"""logits.shape == [B*W, V]"""
_, vocab_size = logits.shape
bsz, beam_width = state.log_probs.shape
onehot_eos = P.cast(
F.one_hot(P.ones([1], 'int64') * eos_id, vocab_size), 'int64') #[1, V]
probs = P.log(F.softmax(logits)) #[B*W, V]
probs = mask_prob(probs, onehot_eos, state.finished) #[B*W, V]
allprobs = P.reshape(state.log_probs, [-1, 1]) + probs #[B*W, V]
not_finished = 1 - P.reshape(state.finished, [-1, 1]) #[B*W,1]
not_eos = 1 - onehot_eos
length_to_add = not_finished * not_eos #[B*W,V]
alllen = P.reshape(state.lengths, [-1, 1]) + length_to_add
allprobs = P.reshape(allprobs, [-1, beam_width * vocab_size])
alllen = P.reshape(alllen, [-1, beam_width * vocab_size])
allscore = hyp_score(allprobs, alllen, length_penalty)
if is_first_step:
allscore = P.reshape(
allscore,
[bsz, beam_width, -1])[:, 0, :] # first step only consiter beam 0
scores, idx = P.topk(allscore, k=beam_width) #[B, W]
next_beam_id = idx // vocab_size #[B, W]
next_word_id = idx % vocab_size
gather_idx = P.concat(
[P.nonzero(idx != -1)[:, :1], P.reshape(idx, [-1, 1])], 1)
next_probs = P.reshape(P.gather_nd(allprobs, gather_idx), idx.shape)
next_len = P.reshape(P.gather_nd(alllen, gather_idx), idx.shape)
gather_idx = P.concat([
P.nonzero(next_beam_id != -1)[:, :1], P.reshape(next_beam_id, [-1, 1])
], 1)
next_finished = P.reshape(
P.gather_nd(state.finished, gather_idx), state.finished.
shape) #[gather new beam state according to new beam id]
#log.debug(gather_idx.numpy())
#log.debug(state.finished.numpy())
#log.debug(next_finished.numpy())
next_finished += P.cast(next_word_id == eos_id, 'int64')
next_finished = P.cast(next_finished > 0, 'int64')
#log.debug(next_word_id.numpy())
#log.debug(next_beam_id.numpy())
next_state = BeamSearchState(
log_probs=next_probs, lengths=next_len, finished=next_finished)
output = BeamSearchOutput(
scores=scores,
predicted_ids=next_word_id,
beam_parent_ids=next_beam_id)
return output, next_state
def beam_search_infilling(model,
q_ids,
q_sids,
sos_id,
eos_id,
attn_id,
max_encode_len=640,
max_decode_len=100,
beam_width=5,
tgt_type_id=3,
length_penalty=1.0):
model.eval()
with P.no_grad():
#log.debug(q_ids.numpy().tolist())
_, __, info = model(q_ids, q_sids)
d_batch, d_seqlen = q_ids.shape
state = BeamSearchState(
log_probs=P.zeros([d_batch, beam_width], 'float32'),
lengths=P.zeros([d_batch, beam_width], 'int64'),
finished=P.zeros([d_batch, beam_width], 'int64'))
outputs = []
def reorder_(t, parent_id):
"""reorder cache according to parent beam id"""
gather_idx = P.nonzero(
parent_id != -1)[:, 0] * beam_width + P.reshape(parent_id,
[-1])
t = P.gather(t, gather_idx)
return t
def tile_(t, times):
_shapes = list(t.shape[1:])
ret = P.reshape(
P.tile(
P.unsqueeze(t, [1]), [
1,
times,
] + [1, ] * len(_shapes)), [-1, ] + _shapes)
return ret
cached_k, cached_v = info['caches']
cached_k = [tile_(k, beam_width) for k in cached_k]
cached_v = [tile_(v, beam_width) for v in cached_v]
past_cache = (cached_k, cached_v)
q_ids = tile_(q_ids, beam_width)
seqlen = P.cast(q_ids != 0, 'int64').sum(1, keepdim=True)
#log.debug(q_ids.shape)
cls_ids = P.ones([d_batch * beam_width], dtype='int64') * sos_id
attn_ids = P.ones(
[d_batch * beam_width], dtype='int64') * attn_id # SOS
ids = P.stack([cls_ids, attn_ids], -1)
for step in range(max_decode_len):
#log.debug('decode step %d' % step)
bias = gen_bias(q_ids, ids, step)
pos_ids = P.to_tensor(
np.tile(
np.array(
[[step, step + 1]], dtype=np.int64),
[d_batch * beam_width, 1]))
pos_ids += seqlen
_, logits, info = model(
ids,
P.ones_like(ids) * tgt_type_id,
pos_ids=pos_ids,
attn_bias=bias,
past_cache=past_cache)
output, state = beam_search_step(
state,
logits[:, 1],
eos_id=eos_id,
beam_width=beam_width,
is_first_step=(step == 0),
length_penalty=length_penalty)
outputs.append(output)
past_cached_k, past_cached_v = past_cache
cached_k, cached_v = info['caches']
cached_k = [
reorder_(
P.concat([pk, k[:, :1, :]], 1), output.beam_parent_ids)
for pk, k in zip(past_cached_k, cached_k)
] # concat cached
cached_v = [
reorder_(
P.concat([pv, v[:, :1, :]], 1), output.beam_parent_ids)
for pv, v in zip(past_cached_v, cached_v)
]
past_cache = (cached_k, cached_v)
pred_ids_flatten = P.reshape(output.predicted_ids,
[d_batch * beam_width])
ids = P.stack([pred_ids_flatten, attn_ids], 1)
if state.finished.numpy().all():
#log.debug('exit because all done')
break
#if step == 1: break
final_ids = P.stack([o.predicted_ids for o in outputs], 0)
final_parent_ids = P.stack([o.beam_parent_ids for o in outputs], 0)
final_ids = P.fluid.layers.gather_tree(
final_ids, final_parent_ids)[:, :, 0] #pick best beam
final_ids = P.transpose(
P.reshape(final_ids, [-1, d_batch * 1]), [1, 0])
return final_ids
en_patten = re.compile(r'^[a-zA-Z0-9]*$')
def post_process(token):
if token.startswith('##'):
ret = token[2:]
else:
if en_patten.match(token):
ret = ' ' + token
else:
ret = token
return ret
if __name__ == '__main__':
sys.stdout = io.TextIOWrapper(sys.stdout.buffer, encoding='utf-8')
sys.stderr = io.TextIOWrapper(sys.stderr.buffer, encoding='utf-8')
parser = argparse.ArgumentParser('seq2seq model with ERNIE')
parser.add_argument(
'--from_pretrained',
type=Path,
required=True,
help='pretrained model directory or tag')
parser.add_argument('--bsz', type=int, default=8, help='batchsize')
parser.add_argument('--max_encode_len', type=int, default=640)
parser.add_argument('--max_decode_len', type=int, default=120)
parser.add_argument('--tgt_type_id', type=int, default=3)
parser.add_argument('--beam_width', type=int, default=5)
parser.add_argument(
'--attn_token',
type=str,
default='[ATTN]',
help='if [ATTN] not in vocab, you can specified [MAKK] as attn-token')
parser.add_argument('--length_penalty', type=float, default=1.0)
parser.add_argument(
'--save_dir', type=str, required=True, help='model dir to be loaded')
args = parser.parse_args()
env = P.distributed.ParallelEnv()
ernie = ErnieModelForGeneration.from_pretrained(
args.from_pretrained, name='')
tokenizer = ErnieTokenizer.from_pretrained(
args.from_pretrained, mask_token=None)
rev_dict = {v: k for k, v in tokenizer.vocab.items()}
rev_dict[tokenizer.pad_id] = '' # replace [PAD]
rev_dict[tokenizer.unk_id] = '' # replace [PAD]
sd = P.load(str(args.save_dir))
ernie.set_state_dict(sd)
def map_fn(src_ids):
src_ids = src_ids[:args.max_encode_len]
src_ids, src_sids = tokenizer.build_for_ernie(src_ids)
return (src_ids, src_sids)
feature_column = propeller.data.FeatureColumns([
propeller.data.TextColumn(
'seg_a',
unk_id=tokenizer.unk_id,
vocab_dict=tokenizer.vocab,
tokenizer=tokenizer.tokenize),
])
dataset = feature_column.build_dataset_from_stdin('predict').map(
map_fn).padded_batch(args.bsz)
for step, (encoder_ids, encoder_sids) in enumerate(dataset):
#result_ids = greedy_search_infilling(ernie, P.to_tensor(encoder_ids), P.to_tensor(encoder_sids),
# eos_id=tokenizer.sep_id,
# sos_id=tokenizer.cls_id,
# attn_id=tokenizer.vocab[args.attn_id],
# max_decode_len=args.max_decode_len,
# max_encode_len=args.max_encode_len,
# beam_width=args.beam_width,
# tgt_type_id=args.tgt_type_id)
result_ids = beam_search_infilling(
ernie,
P.to_tensor(encoder_ids),
P.to_tensor(encoder_sids),
eos_id=tokenizer.sep_id,
sos_id=tokenizer.cls_id,
attn_id=tokenizer.vocab[args.attn_token],
max_decode_len=args.max_decode_len,
max_encode_len=args.max_encode_len,
beam_width=args.beam_width,
length_penalty=args.length_penalty,
tgt_type_id=args.tgt_type_id)
output_str = rev_lookup(result_ids.numpy())
for ostr in output_str.tolist():
if '[SEP]' in ostr:
ostr = ostr[:ostr.index('[SEP]')]
ostr = ''.join(map(post_process, ostr))
ostr = ostr.strip()
print(ostr)
此差异已折叠。
此差异已折叠。
set -x
(($#!=2)) && echo "Usage predict_file label_file" && exit -1
PRED=$1
PREFIX=$2
python pyrouge_set_rouge_path.py `pwd`/file2rouge/
python cnndm/eval.py --pred ${PRED} \
--gold ${PREFIX} --trunc_len 100 --perl
此差异已折叠。
此差异已折叠。
=head1 NAME
XML::DOM::AttDef - A single XML attribute definition in an ATTLIST in XML::DOM
=head1 DESCRIPTION
XML::DOM::AttDef extends L<XML::DOM::Node>, but is not part of the DOM Level 1
specification.
Each object of this class represents one attribute definition in an AttlistDecl.
=head2 METHODS
=over 4
=item getName
Returns the attribute name.
=item getDefault
Returns the default value, or undef.
=item isFixed
Whether the attribute value is fixed (see #FIXED keyword.)
=item isRequired
Whether the attribute value is required (see #REQUIRED keyword.)
=item isImplied
Whether the attribute value is implied (see #IMPLIED keyword.)
=back
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册