提交 bf483c0f 编写于 作者: Z Zeyu Chen

reorganize legacy files

上级 0687ceb6
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class
# C extensions
*.so
# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
share/python-wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST
# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec
# Installer logs
pip-log.txt
pip-delete-this-directory.txt
# Unit test / coverage reports
htmlcov/
.tox/
.nox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
.hypothesis/
.pytest_cache/
# Translations
*.mo
*.pot
# Django stuff:
*.log
local_settings.py
db.sqlite3
# Flask stuff:
instance/
.webassets-cache
# Scrapy stuff:
.scrapy
# Sphinx documentation
docs/_build/
# PyBuilder
target/
# Jupyter Notebook
.ipynb_checkpoints
# IPython
profile_default/
ipython_config.py
# pyenv
.python-version
# celery beat schedule file
celerybeat-schedule
# SageMath parsed files
*.sage.py
# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/
# Spyder project settings
.spyderproject
.spyproject
# Rope project settings
.ropeproject
# mkdocs documentation
/site
# mypy
.mypy_cache/
.dmypy.json
dmypy.json
# Pyre type checker
.pyre/
# pycharm
.DS_Store
.idea/
FETCH_HEAD
# vscode
.vscode
- repo: https://github.com/PaddlePaddle/mirrors-yapf.git
sha: 0d79c0c469bab64f7229c9aca2b1186ef47f0e37
hooks:
- id: yapf
files: \.py$
- repo: https://github.com/pre-commit/pre-commit-hooks
sha: a11d9314b22d8f8c7556443875b731ef05965464
hooks:
- id: check-merge-conflict
- id: check-symlinks
- id: detect-private-key
files: (?!.*paddle)^.*$
- id: end-of-file-fixer
files: \.md$
- id: trailing-whitespace
files: \.md$
- repo: https://github.com/Lucas-C/pre-commit-hooks
sha: v1.0.1
hooks:
- id: forbid-crlf
files: \.md$
- id: remove-crlf
files: \.md$
- id: forbid-tabs
files: \.md$
- id: remove-tabs
files: \.md$
- repo: local
hooks:
- id: clang-format
name: clang-format
description: Format files with ClangFormat
entry: bash .clang_format.hook -i
language: system
files: \.(c|cc|cxx|cpp|cu|h|hpp|hxx|cuh|proto)$
[style]
based_on_style = pep8
column_limit = 80
\ No newline at end of file
Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved
Apache License
Version 2.0, January 2004
http://www.apache.org/licenses/
TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
1. Definitions.
"License" shall mean the terms and conditions for use, reproduction,
and distribution as defined by Sections 1 through 9 of this document.
"Licensor" shall mean the copyright owner or entity authorized by
the copyright owner that is granting the License.
"Legal Entity" shall mean the union of the acting entity and all
other entities that control, are controlled by, or are under common
control with that entity. For the purposes of this definition,
"control" means (i) the power, direct or indirect, to cause the
direction or management of such entity, whether by contract or
otherwise, or (ii) ownership of fifty percent (50%) or more of the
outstanding shares, or (iii) beneficial ownership of such entity.
"You" (or "Your") shall mean an individual or Legal Entity
exercising permissions granted by this License.
"Source" form shall mean the preferred form for making modifications,
including but not limited to software source code, documentation
source, and configuration files.
"Object" form shall mean any form resulting from mechanical
transformation or translation of a Source form, including but
not limited to compiled object code, generated documentation,
and conversions to other media types.
"Work" shall mean the work of authorship, whether in Source or
Object form, made available under the License, as indicated by a
copyright notice that is included in or attached to the work
(an example is provided in the Appendix below).
"Derivative Works" shall mean any work, whether in Source or Object
form, that is based on (or derived from) the Work and for which the
editorial revisions, annotations, elaborations, or other modifications
represent, as a whole, an original work of authorship. For the purposes
of this License, Derivative Works shall not include works that remain
separable from, or merely link (or bind by name) to the interfaces of,
the Work and Derivative Works thereof.
"Contribution" shall mean any work of authorship, including
the original version of the Work and any modifications or additions
to that Work or Derivative Works thereof, that is intentionally
submitted to Licensor for inclusion in the Work by the copyright owner
or by an individual or Legal Entity authorized to submit on behalf of
the copyright owner. For the purposes of this definition, "submitted"
means any form of electronic, verbal, or written communication sent
to the Licensor or its representatives, including but not limited to
communication on electronic mailing lists, source code control systems,
and issue tracking systems that are managed by, or on behalf of, the
Licensor for the purpose of discussing and improving the Work, but
excluding communication that is conspicuously marked or otherwise
designated in writing by the copyright owner as "Not a Contribution."
"Contributor" shall mean Licensor and any individual or Legal Entity
on behalf of whom a Contribution has been received by Licensor and
subsequently incorporated within the Work.
2. Grant of Copyright License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
copyright license to reproduce, prepare Derivative Works of,
publicly display, publicly perform, sublicense, and distribute the
Work and such Derivative Works in Source or Object form.
3. Grant of Patent License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
(except as stated in this section) patent license to make, have made,
use, offer to sell, sell, import, and otherwise transfer the Work,
where such license applies only to those patent claims licensable
by such Contributor that are necessarily infringed by their
Contribution(s) alone or by combination of their Contribution(s)
with the Work to which such Contribution(s) was submitted. If You
institute patent litigation against any entity (including a
cross-claim or counterclaim in a lawsuit) alleging that the Work
or a Contribution incorporated within the Work constitutes direct
or contributory patent infringement, then any patent licenses
granted to You under this License for that Work shall terminate
as of the date such litigation is filed.
4. Redistribution. You may reproduce and distribute copies of the
Work or Derivative Works thereof in any medium, with or without
modifications, and in Source or Object form, provided that You
meet the following conditions:
(a) You must give any other recipients of the Work or
Derivative Works a copy of this License; and
(b) You must cause any modified files to carry prominent notices
stating that You changed the files; and
(c) You must retain, in the Source form of any Derivative Works
that You distribute, all copyright, patent, trademark, and
attribution notices from the Source form of the Work,
excluding those notices that do not pertain to any part of
the Derivative Works; and
(d) If the Work includes a "NOTICE" text file as part of its
distribution, then any Derivative Works that You distribute must
include a readable copy of the attribution notices contained
within such NOTICE file, excluding those notices that do not
pertain to any part of the Derivative Works, in at least one
of the following places: within a NOTICE text file distributed
as part of the Derivative Works; within the Source form or
documentation, if provided along with the Derivative Works; or,
within a display generated by the Derivative Works, if and
wherever such third-party notices normally appear. The contents
of the NOTICE file are for informational purposes only and
do not modify the License. You may add Your own attribution
notices within Derivative Works that You distribute, alongside
or as an addendum to the NOTICE text from the Work, provided
that such additional attribution notices cannot be construed
as modifying the License.
You may add Your own copyright statement to Your modifications and
may provide additional or different license terms and conditions
for use, reproduction, or distribution of Your modifications, or
for any such Derivative Works as a whole, provided Your use,
reproduction, and distribution of the Work otherwise complies with
the conditions stated in this License.
5. Submission of Contributions. Unless You explicitly state otherwise,
any Contribution intentionally submitted for inclusion in the Work
by You to the Licensor shall be under the terms and conditions of
this License, without any additional terms or conditions.
Notwithstanding the above, nothing herein shall supersede or modify
the terms of any separate license agreement you may have executed
with Licensor regarding such Contributions.
6. Trademarks. This License does not grant permission to use the trade
names, trademarks, service marks, or product names of the Licensor,
except as required for reasonable and customary use in describing the
origin of the Work and reproducing the content of the NOTICE file.
7. Disclaimer of Warranty. Unless required by applicable law or
agreed to in writing, Licensor provides the Work (and each
Contributor provides its Contributions) on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
implied, including, without limitation, any warranties or conditions
of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
PARTICULAR PURPOSE. You are solely responsible for determining the
appropriateness of using or redistributing the Work and assume any
risks associated with Your exercise of permissions under this License.
8. Limitation of Liability. In no event and under no legal theory,
whether in tort (including negligence), contract, or otherwise,
unless required by applicable law (such as deliberate and grossly
negligent acts) or agreed to in writing, shall any Contributor be
liable to You for damages, including any direct, indirect, special,
incidental, or consequential damages of any character arising as a
result of this License or out of the use or inability to use the
Work (including but not limited to damages for loss of goodwill,
work stoppage, computer failure or malfunction, or any and all
other commercial damages or losses), even if such Contributor
has been advised of the possibility of such damages.
9. Accepting Warranty or Additional Liability. While redistributing
the Work or Derivative Works thereof, You may choose to offer,
and charge a fee for, acceptance of support, warranty, indemnity,
or other liability obligations and/or rights consistent with this
License. However, in accepting such obligations, You may act only
on Your own behalf and on Your sole responsibility, not on behalf
of any other Contributor, and only if You agree to indemnify,
defend, and hold each Contributor harmless for any liability
incurred by, or claims asserted against, such Contributor by reason
of your accepting any such warranty or additional liability.
END OF TERMS AND CONDITIONS
APPENDIX: How to apply the Apache License to your work.
To apply the Apache License to your work, attach the following
boilerplate notice, with the fields enclosed by brackets "[]"
replaced with your own identifying information. (Don't include
the brackets!) The text should be enclosed in the appropriate
comment syntax for the file format. We also recommend that a
file or class name and description of purpose be included on the
same "printed page" as the copyright notice for easier
identification within third-party archives.
Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
[![Documentation Status](https://img.shields.io/badge/docs-latest-brightgreen.svg?style=flat)](https://github.com/PaddlePaddle/models) [![License](https://img.shields.io/badge/license-Apache%202-blue.svg)](LICENSE) # PaddleNLP
![License](https://img.shields.io/badge/license-Apache%202-red.svg)
![python version](https://img.shields.io/badge/python-3.6+-orange.svg)
![support os](https://img.shields.io/badge/os-linux%2C%20win%2C%20mac-yellow.svg)
## Introduction
![PaddleNLP_overview](./appendix/PaddleNLP_overview.png) PaddleNLP aims to accelerate NLP applications by powerful model zoo, easy-to-use API and detailed tutorials, It's also the NLP best practice for PaddlePaddle 2.0 API system.
**TODO:** Add an architecture chart for PaddleNLP
## Features
**PaddleNLP**是基于[飞桨(PaddlePaddle)](http://www.paddlepaddle.org/)深度学习框架开发的自然语言处理(NLP)工具,算法,模型和数据的开源项目。百度在NLP领域十几年的深厚积淀为PaddleNLP提供了强大的核心动力,使用PaddleNLP,您可以得到: * **Rich and Powerful Model Zoo**
- Our Model Zoo covers mainstream NLP applications, including Lexical Analysis, Syntactic Parsing, Machine Translation, Text Classification, Text Generation, Text Matching, General Dialogue and Question Answering etc.
* **Easy-to-use API**
- The API is fully integrated with PaddlePaddle high-level API system. It minimizes the number of user actions required for common use cases like data loading, text pre-processing, training and evaluation. which enables you to deal with text problems more productively.
* **High Performance and Large-scale Training**
- We provide a highly optimized ditributed training implementation for BERT with Fleet API, it can fully utilize GPU clusters for large-scale model pre-training. Please refer to our [benchmark](./benchmark/bert) for more information.
* **Detailed Tutorials and Industrial Practices**
- We offers detailed and interactable notebook tutorials to show you the best practices of PaddlePaddle 2.0.
- **丰富而全面的NLP任务支持:** ## Installation
- PaddleNLP为您提供了多粒度,多场景的应用支持。涵盖了从[分词](https://github.com/PaddlePaddle/models/tree/release/1.7/PaddleNLP/lexical_analysis)[词性标注](https://github.com/PaddlePaddle/models/tree/release/1.7/PaddleNLP/lexical_analysis)[命名实体识别](https://github.com/PaddlePaddle/models/tree/release/1.7/PaddleNLP/lexical_analysis)等NLP基础技术,到[文本分类](https://github.com/PaddlePaddle/models/tree/release/1.7/PaddleNLP/sentiment_classification)[文本相似度计算](https://github.com/PaddlePaddle/models/tree/release/1.7/PaddleNLP/similarity_net)[语义表示](https://github.com/PaddlePaddle/models/tree/release/1.7/PaddleNLP/pretrain_language_models)[文本生成](https://github.com/PaddlePaddle/models/tree/release/1.7/PaddleNLP/seq2seq)等NLP核心技术。同时,PaddleNLP还提供了针对常见NLP大型应用系统(如[阅读理解](https://github.com/PaddlePaddle/models/tree/release/1.7/PaddleNLP/machine_reading_comprehension)[对话系统](https://github.com/PaddlePaddle/models/tree/release/1.7/PaddleNLP/dialogue_system)[机器翻译系统](https://github.com/PaddlePaddle/models/tree/release/1.7/PaddleNLP/machine_translation)等)的特定核心技术和工具组件,模型和预训练参数等,让您在NLP领域畅通无阻。 ### Prerequisites
- **稳定可靠的NLP模型和强大的预训练参数:** * python >= 3.6
* paddlepaddle >= 2.0.0-rc1
- PaddleNLP集成了百度内部广泛使用的NLP工具模型,为您提供了稳定可靠的NLP算法解决方案。基于百亿级数据的预训练参数和丰富的预训练模型,助您轻松提高模型效果,为您的NLP业务注入强大动力。 ```
pip install paddlenlp
- **持续改进和技术支持,零基础搭建NLP应用:** ```
- PaddleNLP为您提供持续的技术支持和模型算法更新,为您的NLP业务保驾护航。
快速安装
-------
### 依赖
本项目依赖于 Python 2.7 和 Paddle Fluid 1.3.1 及以上版本,请参考 [安装指南](http://www.paddlepaddle.org/#quick-start) 安装 PaddlePaddle。 注意,暂不支持 Windows GPU 环境,如需在 Windows GPU 环境使用,请将示例代码中的 [fluid.ParallelExecutor](http://paddlepaddle.org/documentation/docs/zh/1.4/api_cn/fluid_cn.html#parallelexecutor) 替换为 [fluid.Executor](http://paddlepaddle.org/documentation/docs/zh/1.4/api_cn/fluid_cn.html#executor)
### 流程 ## Quick Start
- 克隆代码库到本地 ### Quick Dataset Loading
```shell ```python
git clone https://github.com/PaddlePaddle/models.git dataset = paddlenlp.datasets.ChnSentiCorp(split="train")
``` ```
- 进入到特定的子目录中查看代码和运行任务(如情感分析) ### Reusable Text Emebdding
```shell ```python
cd models/PaddleNLP/sentiment_classification wordemb = paddlenlp.embedding.SkipGram("Text8")
wordemb("language")
>>> [1.0, 2.0, 3.0, ...., 5.0, 6.0]
``` ```
### High Quality Chinsese Pre-trained Model
```python
找到您的NLP解决方案 from paddlenlp.transformer import ErnieModel
------- ernie = ErnieModel.from_pretrained("ernie-1.0-chinese")
sequence_output, pooled_output = ernie.forward(input_ids, segment_ids)
| 任务场景 | 对应项目/目录 | 简介 |
| :------------------------------------------------: | :----------------------------------------------------------: | :----------------------------------------------------------: |
| **中文分词****词性标注****命名实体识别**:fire: | [LAC](https://github.com/PaddlePaddle/models/tree/release/1.7/PaddleNLP/lexical_analysis) | LAC,全称为Lexical Analysis of Chinese,是百度内部广泛使用的中文处理工具,功能涵盖从中文分词,词性标注,命名实体识别等常见中文处理任务。 |
| **词向量(word2vec)** | [word2vec](https://github.com/PaddlePaddle/models/tree/release/1.7/PaddleRec/word2vec) | 提供单机多卡,多机等分布式训练中文词向量能力,支持主流词向量模型(skip-gram,cbow等),可以快速使用自定义数据训练词向量模型。 |
| **语言模型** | [Language_model](https://github.com/PaddlePaddle/models/tree/release/1.7/PaddleNLP/language_model) | 基于循环神经网络(RNN)的经典神经语言模型(neural language model)。 |
| **情感分类**:fire: | [Senta](https://github.com/PaddlePaddle/models/tree/release/1.7/PaddleNLP/sentiment_classification)[EmotionDetection](https://github.com/PaddlePaddle/models/tree/release/1.7/PaddleNLP/emotion_detection) | Senta(Sentiment Classification,简称Senta)和EmotionDetection两个项目分别提供了面向*通用场景**人机对话场景专用*的情感倾向性分析模型。 |
| **文本相似度计算**:fire: | [SimNet](https://github.com/PaddlePaddle/models/tree/release/1.7/PaddleNLP/similarity_net) | SimNet,又称为Similarity Net,为您提供高效可靠的文本相似度计算工具和预训练模型。 |
| **语义表示**:fire: | [pretrain_language_models](https://github.com/PaddlePaddle/models/tree/release/1.7/PaddleNLP/pretrain_language_models) | 集成了ELMO,BERT,ERNIE 1.0,ERNIE 2.0,XLNet等热门中英文预训练模型。 |
| **文本生成** | [seq2seq](https://github.com/PaddlePaddle/models/tree/release/1.7/PaddleNLP/PaddleTextGEN) | seq2seq为您提供了一些列经典文本生成模型案例,如vanilla seq2seq,seq2seq with attention,variational seq2seq模型等。 |
| **阅读理解** | [machine_reading_comprehension](https://github.com/PaddlePaddle/models/tree/release/1.7/PaddleNLP/machine_reading_comprehension) | Paddle Machine Reading Comprehension,集合了百度在阅读理解领域相关的模型,工具,开源数据等一系列工作。包括DuReader (百度开源的基于真实搜索用户行为的中文大规模阅读理解数据集),KT-Net (结合知识的阅读理解模型,SQuAD以及ReCoRD曾排名第一), D-Net (预训练-微调框架,在EMNLP2019 MRQA国际阅读理解评测获得第一),等。 |
| **对话系统** | [dialogue_system](https://github.com/PaddlePaddle/models/tree/release/1.7/PaddleNLP/dialogue_system) | 包括:1)DGU(Dialogue General Understanding,通用对话理解模型)覆盖了包括**检索式聊天系统**中context-response matching任务和**任务完成型对话系统****意图识别****槽位解析****状态追踪**等常见对话系统任务,在6项国际公开数据集中都获得了最佳效果。<br/> 2) knowledge-driven dialogue:百度开源的知识驱动的开放领域对话数据集,发表于ACL2019。<br/>3)ADEM(Auto Dialogue Evaluation Model):对话自动评估模型,可用于自动评估不同对话生成模型的回复质量。 |
| **机器翻译** | [machine_translation](https://github.com/PaddlePaddle/models/tree/release/1.7/PaddleNLP/machine_translation) | 全称为Paddle Machine Translation,基于Transformer的经典机器翻译模型。 |
| **其他前沿工作** | [Research](https://github.com/PaddlePaddle/models/tree/release/1.7/PaddleNLP/Research) | 百度最新前沿工作开源。 |
目录结构
------
```text
.
├── Research # 百度NLP在research方面的工作集合
├── machine_translation # 机器翻译相关代码,数据,预训练模型
├── dialogue_system # 对话系统相关代码,数据,预训练模型
├── machcine_reading_comprehension # 阅读理解相关代码,数据,预训练模型
├── pretrain_language_models # 语言表示工具箱
├── language_model # 语言模型
├── lexical_analysis # LAC词法分析
├── shared_modules/models # 共享网络
│ ├── __init__.py
│ ├── classification
│ ├── dialogue_model_toolkit
│ ├── language_model
│ ├── matching
│ ├── neural_machine_translation
│ ├── reading_comprehension
│ ├── representation
│ ├── sequence_labeling
│ └── transformer_encoder.py
├── shared_modules/preprocess # 共享文本预处理工具
│ ├── __init__.py
│ ├── ernie
│ ├── padding.py
│ └── tokenizer
├── sentiment_classification # Senta文本情感分析
├── similarity_net # SimNet短文本语义匹配
``` ```
其中,除了 `models``preprocess` 分别是共享的模型集合与共享的数据预处理流程之外,其它目录包含的都是相互独立的任务,可以直接进入这些目录运行任务。 ## Tutorials
List our notebook tutorials based on AI Studio.
## Community
* SIG for Pretrained Model Contribution
* SIG for Dataset Integration
联系我们 ## FAQ
------
扫描下方二维码,加入我们的QQ群,即刻获取来自百度的技术支持: ## License
![Paddle_QQ](./appendix/Paddle_QQ.jpg) PaddleNLP is provided under the [Apache-2.0 license](./LICENSE).
# BERT Benchmark with Fleet API
先配置运行环境
export PYTHONPATH=/home/fangzeyang/PaddleNLP
export DATA_DIR=/home/fangzeyang/bert_data/wikicorpus_en
## NLP 任务中的Pretraining
```shell
1. 如果是需要多单机多卡/多机多卡训练,则使用下面的命令进行训练
unset CUDA_VISIBLE_DEVICES
fleetrun --gpus 0,1,2,3 ./run_pretrain.py \
--model_type bert \
--model_name_or_path bert-base-uncased \
--max_predictions_per_seq 20 \
--batch_size 32 \
--learning_rate 1e-4 \
--weight_decay 1e-2 \
--adam_epsilon 1e-6 \
--warmup_steps 10000 \
--input_dir $DATA_DIR \
--output_dir ./tmp2/ \
--logging_steps 1 \
--save_steps 20000 \
--max_steps 1000000
2. 如果是需要多单机多卡/多机多卡训练,则使用下面的命令进行训练
export CUDA_VISIBLE_DEVICES=0
python ./run_pretrain_single.py \
--model_type bert \
--model_name_or_path bert-base-uncased \
--max_predictions_per_seq 20 \
--batch_size 32 \
--learning_rate 1e-4 \
--weight_decay 1e-2 \
--adam_epsilon 1e-6 \
--warmup_steps 10000 \
--input_dir $DATA_DIR \
--output_dir ./tmp2/ \
--logging_steps 1 \
--save_steps 20000 \
--max_steps 1000000 \
--use_amp True\
--enable_addto True
```
## NLP 任务的 Fine-tuning
在完成 BERT 模型的预训练后,即可利用预训练参数在特定的 NLP 任务上做 Fine-tuning。以下利用开源的预训练模型,示例如何进行分类任务的 Fine-tuning。
### 语句和句对分类任务
以 GLUE/SST-2 任务为例,启动 Fine-tuning 的方式如下(`paddlenlp` 要已经安装或能在 `PYTHONPATH` 中找到):
```shell
export CUDA_VISIBLE_DEVICES=0
export TASK_NAME=SST-2
python -u ./run_glue.py \
--model_type bert \
--model_name_or_path bert-base-uncased \
--task_name $TASK_NAME \
--max_seq_length 128 \
--batch_size 64 \
--learning_rate 2e-5 \
--num_train_epochs 3 \
--logging_steps 1 \
--save_steps 500 \
--output_dir ./tmp/$TASK_NAME/
```
其中参数释义如下:
- `model_type` 指示了模型类型,当前仅支持BERT模型。
- `model_name_or_path` 指示了某种特定配置的模型,对应有其预训练模型和预训练时使用的 tokenizer。若模型相关内容保存在本地,这里也可以提供相应目录地址。
- `task_name` 表示 Fine-tuning 的任务。
- `max_seq_length` 表示最大句子长度,超过该长度将被截断。
- `batch_size` 表示每次迭代**每张卡**上的样本数目。
- `learning_rate` 表示基础学习率大小,将于learning rate scheduler产生的值相乘作为当前学习率。
- `num_train_epochs` 表示训练轮数。
- `logging_steps` 表示日志打印间隔。
- `save_steps` 表示模型保存及评估间隔。
- `output_dir` 表示模型保存路径。
使用以上命令进行单卡 Fine-tuning ,在验证集上有如下结果:
| Task | Metric | Result |
|-------|------------------------------|-------------|
| SST-2 | Accuracy | 92.76 |
| QNLI | Accuracy | 91.73 |
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import h5py
import numpy as np
import paddle
from paddle.io import DataLoader, Dataset
from paddlenlp.data import Tuple, Stack
def create_pretraining_dataset(input_file,
max_pred_length,
args,
data_holders,
worker_init=None,
places=None):
train_data = PretrainingDataset(
input_file=input_file, max_pred_length=max_pred_length)
train_batch_sampler = paddle.io.DistributedBatchSampler(
train_data, batch_size=args.batch_size, shuffle=True)
def _collate_data(data, stack_fn=Stack()):
num_fields = len(data[0])
out = [None] * num_fields
# input_ids, segment_ids, input_mask, masked_lm_positions,
# masked_lm_labels, next_sentence_labels, mask_token_num
for i in (0, 1, 2, 5):
out[i] = stack_fn([x[i] for x in data])
batch_size, seq_length = out[0].shape
size = num_mask = sum(len(x[3]) for x in data)
# Padding for divisibility by 8 for fp16 or int8 usage
if size % 8 != 0:
size += 8 - (size % 8)
# masked_lm_positions
# Organize as a 1D tensor for gather or use gather_nd
out[3] = np.full(size, 0, dtype=np.int64)
# masked_lm_labels
out[4] = np.full([size, 1], -1, dtype=np.int64)
mask_token_num = 0
for i, x in enumerate(data):
for j, pos in enumerate(x[3]):
out[3][mask_token_num] = i * seq_length + pos
out[4][mask_token_num] = x[4][j]
mask_token_num += 1
# mask_token_num
out.append(np.asarray([mask_token_num], dtype=np.float32))
return out
train_data_loader = DataLoader(
dataset=train_data,
places=places,
feed_list=data_holders,
batch_sampler=train_batch_sampler,
collate_fn=_collate_data,
num_workers=0,
worker_init_fn=worker_init,
return_list=False)
return train_data_loader, input_file
def create_data_holder(args):
input_ids = paddle.static.data(
name="input_ids", shape=[-1, -1], dtype="int64")
segment_ids = paddle.static.data(
name="segment_ids", shape=[-1, -1], dtype="int64")
input_mask = paddle.static.data(
name="input_mask", shape=[-1, 1, 1, -1], dtype="float32")
masked_lm_positions = paddle.static.data(
name="masked_lm_positions", shape=[-1], dtype="int64")
masked_lm_labels = paddle.static.data(
name="masked_lm_labels", shape=[-1, 1], dtype="int64")
next_sentence_labels = paddle.static.data(
name="next_sentence_labels", shape=[-1, 1], dtype="int64")
masked_lm_scale = paddle.static.data(
name="masked_lm_scale", shape=[-1, 1], dtype="float32")
return [
input_ids, segment_ids, input_mask, masked_lm_positions,
masked_lm_labels, next_sentence_labels, masked_lm_scale
]
class PretrainingDataset(Dataset):
def __init__(self, input_file, max_pred_length):
self.input_file = input_file
self.max_pred_length = max_pred_length
f = h5py.File(input_file, "r")
keys = [
'input_ids', 'input_mask', 'segment_ids', 'masked_lm_positions',
'masked_lm_ids', 'next_sentence_labels'
]
self.inputs = [np.asarray(f[key][:]) for key in keys]
f.close()
def __len__(self):
'Denotes the total number of samples'
return len(self.inputs[0])
def __getitem__(self, index):
[
input_ids, input_mask, segment_ids, masked_lm_positions,
masked_lm_ids, next_sentence_labels
] = [
input[index].astype(np.int64)
if indice < 5 else np.asarray(input[index].astype(np.int64))
for indice, input in enumerate(self.inputs)
]
# TODO: whether to use reversed mask by changing 1s and 0s to be
# consistent with nv bert
input_mask = (1 - np.reshape(
input_mask.astype(np.float32), [1, 1, input_mask.shape[0]])) * -1e9
index = self.max_pred_length
# store number of masked tokens in index
# outputs of torch.nonzero diff with that of numpy.nonzero by zip
padded_mask_indices = (masked_lm_positions == 0).nonzero()[0]
if len(padded_mask_indices) != 0:
index = padded_mask_indices[0].item()
mask_token_num = index
else:
index = 0
mask_token_num = 0
# masked_lm_labels = np.full(input_ids.shape, -1, dtype=np.int64)
# masked_lm_labels[masked_lm_positions[:index]] = masked_lm_ids[:index]
masked_lm_labels = masked_lm_ids[:index]
masked_lm_positions = masked_lm_positions[:index]
# softmax_with_cross_entropy enforce last dim size equal 1
masked_lm_labels = np.expand_dims(masked_lm_labels, axis=-1)
next_sentence_labels = np.expand_dims(next_sentence_labels, axis=-1)
return [
input_ids, segment_ids, input_mask, masked_lm_positions,
masked_lm_labels, next_sentence_labels
]
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse
import logging
import os
import random
import time
from functools import partial
import numpy as np
import paddle
from paddle.io import DataLoader
from paddlenlp.datasets import GlueQNLI, GlueSST2
from paddlenlp.data import Stack, Tuple, Pad
from paddlenlp.data.sampler import SamplerHelper
from paddlenlp.transformers import BertForSequenceClassification, BertTokenizer
FORMAT = '%(asctime)s-%(levelname)s: %(message)s'
logging.basicConfig(level=logging.INFO, format=FORMAT)
logger = logging.getLogger(__name__)
TASK_CLASSES = {
"qnli": (GlueQNLI, paddle.metric.Accuracy), # (dataset, metric)
"sst-2": (GlueSST2, paddle.metric.Accuracy),
}
MODEL_CLASSES = {"bert": (BertForSequenceClassification, BertTokenizer), }
def parse_args():
parser = argparse.ArgumentParser()
# Required parameters
parser.add_argument(
"--task_name",
default=None,
type=str,
required=True,
help="The name of the task to train selected in the list: " +
", ".join(TASK_CLASSES.keys()), )
parser.add_argument(
"--model_type",
default=None,
type=str,
required=True,
help="Model type selected in the list: " +
", ".join(MODEL_CLASSES.keys()), )
parser.add_argument(
"--model_name_or_path",
default=None,
type=str,
required=True,
help="Path to pre-trained model or shortcut name selected in the list: "
+ ", ".join(
sum([
list(classes[-1].pretrained_init_configuration.keys())
for classes in MODEL_CLASSES.values()
], [])), )
parser.add_argument(
"--output_dir",
default=None,
type=str,
required=True,
help="The output directory where the model predictions and checkpoints will be written.",
)
parser.add_argument(
"--max_seq_length",
default=128,
type=int,
help="The maximum total input sequence length after tokenization. Sequences longer "
"than this will be truncated, sequences shorter will be padded.", )
parser.add_argument(
"--batch_size",
default=8,
type=int,
help="Batch size per GPU/CPU for training.", )
parser.add_argument(
"--learning_rate",
default=5e-5,
type=float,
help="The initial learning rate for Adam.")
parser.add_argument(
"--weight_decay",
default=0.0,
type=float,
help="Weight decay if we apply some.")
parser.add_argument(
"--adam_epsilon",
default=1e-8,
type=float,
help="Epsilon for Adam optimizer.")
parser.add_argument(
"--max_grad_norm", default=1.0, type=float, help="Max gradient norm.")
parser.add_argument(
"--num_train_epochs",
default=3,
type=int,
help="Total number of training epochs to perform.", )
parser.add_argument(
"--max_steps",
default=-1,
type=int,
help="If > 0: set total number of training steps to perform. Override num_train_epochs.",
)
parser.add_argument(
"--warmup_steps",
default=0,
type=int,
help="Linear warmup over warmup_steps.")
parser.add_argument(
"--logging_steps",
type=int,
default=500,
help="Log every X updates steps.")
parser.add_argument(
"--save_steps",
type=int,
default=500,
help="Save checkpoint every X updates steps.")
parser.add_argument(
"--seed", type=int, default=42, help="Random seed for initialization")
args = parser.parse_args()
return args
def create_data_holder():
input_ids = paddle.static.data(
name="input_ids", shape=[-1, -1], dtype="int64")
segment_ids = paddle.static.data(
name="segment_ids", shape=[-1, -1], dtype="int64")
label = paddle.static.data(name="label", shape=[-1, 1], dtype="int64")
return [input_ids, segment_ids, label]
def reset_program_state_dict(model, state_dict, pretrained_state_dict):
reset_state_dict = {}
scale = model.initializer_range if hasattr(model, "initializer_range")\
else model.bert.config["initializer_range"]
for n, p in state_dict.items():
if n not in pretrained_state_dict:
dtype_str = "float32"
if str(p.dtype) == "VarType.FP64":
dtype_str = "float64"
reset_state_dict[p.name] = np.random.normal(
loc=0.0, scale=scale, size=p.shape).astype(dtype_str)
else:
reset_state_dict[p.name] = pretrained_state_dict[n]
return reset_state_dict
def set_seed(args):
random.seed(args.seed + paddle.distributed.get_rank())
np.random.seed(args.seed + paddle.distributed.get_rank())
paddle.seed(args.seed + paddle.distributed.get_rank())
def evaluate(exe, metric, loss, correct, dev_program, data_loader):
metric.reset()
for batch in data_loader:
loss_return, correct_return = exe.run(dev_program, feed=batch, \
fetch_list=[loss, correct])
metric.update(correct_return)
accuracy = metric.accumulate()
print("eval loss: %f, accuracy: %f" % (loss_return, accuracy))
def convert_example(example,
tokenizer,
label_list,
max_seq_length=512,
is_test=False):
"""convert a glue example into necessary features"""
def _truncate_seqs(seqs, max_seq_length):
if len(seqs) == 1: # single sentence
# Account for [CLS] and [SEP] with "- 2"
seqs[0] = seqs[0][0:(max_seq_length - 2)]
else: # sentence pair
# Account for [CLS], [SEP], [SEP] with "- 3"
tokens_a, tokens_b = seqs
max_seq_length -= 3
while True: # truncate with longest_first strategy
total_length = len(tokens_a) + len(tokens_b)
if total_length <= max_seq_length:
break
if len(tokens_a) > len(tokens_b):
tokens_a.pop()
else:
tokens_b.pop()
return seqs
def _concat_seqs(seqs, separators, seq_mask=0, separator_mask=1):
concat = sum((seq + sep for sep, seq in zip(separators, seqs)), [])
segment_ids = sum(
([i] * (len(seq) + len(sep))
for i, (sep, seq) in enumerate(zip(separators, seqs))), [])
if isinstance(seq_mask, int):
seq_mask = [[seq_mask] * len(seq) for seq in seqs]
if isinstance(separator_mask, int):
separator_mask = [[separator_mask] * len(sep) for sep in separators]
p_mask = sum((s_mask + mask
for sep, seq, s_mask, mask in zip(
separators, seqs, seq_mask, separator_mask)), [])
return concat, segment_ids, p_mask
if not is_test:
# `label_list == None` is for regression task
label_dtype = "int64" if label_list else "float32"
# get the label
label = example[-1]
example = example[:-1]
#create label maps if classification task
if label_list:
label_map = {}
for (i, l) in enumerate(label_list):
label_map[l] = i
label = label_map[label]
label = [label]
#label = np.array([label], dtype=label_dtype)
# tokenize raw text
tokens_raw = [tokenizer(l) for l in example]
# truncate to the truncate_length,
tokens_trun = _truncate_seqs(tokens_raw, max_seq_length)
# concate the sequences with special tokens
tokens_trun[0] = [tokenizer.cls_token] + tokens_trun[0]
tokens, segment_ids, _ = _concat_seqs(tokens_trun, [[tokenizer.sep_token]] *
len(tokens_trun))
# convert the token to ids
input_ids = tokenizer.convert_tokens_to_ids(tokens)
# The mask has 1 for real tokens and 0 for padding tokens. Only real
# tokens are attended to.
# input_mask = [1] * len(input_ids)
if not is_test:
return input_ids, segment_ids, label
else:
return input_ids, segment_ids
def do_train(args):
# Set the paddle execute enviroment
paddle.enable_static()
place = paddle.CUDAPlace(0)
set_seed(args)
# Create the main_program for the training and dev_program for the validation
main_program = paddle.static.default_main_program()
startup_program = paddle.static.default_startup_program()
dev_program = paddle.static.Program()
# Get the configuration of tokenizer and model
args.task_name = args.task_name.lower()
args.model_type = args.model_type.lower()
model_class, tokenizer_class = MODEL_CLASSES[args.model_type]
dataset_class, metric_class = TASK_CLASSES[args.task_name]
# Create the tokenizer and dataset
tokenizer = tokenizer_class.from_pretrained(args.model_name_or_path)
train_dataset, dev_dataset = dataset_class.get_datasets(["train", "dev"])
trans_func = partial(
convert_example,
tokenizer=tokenizer,
label_list=train_dataset.get_labels(),
max_seq_length=args.max_seq_length)
train_dataset = train_dataset.apply(trans_func, lazy=True)
batchify_fn = lambda samples, fn=Tuple(
Pad(axis=0, pad_val=tokenizer.pad_token_id), # input
Pad(axis=0, pad_val=tokenizer.pad_token_id), # segment
Stack(dtype="int64" if train_dataset.get_labels() else "float32") # label
): [data for i, data in enumerate(fn(samples))]
train_batch_sampler = paddle.io.BatchSampler(
train_dataset, batch_size=args.batch_size, shuffle=True)
dev_dataset = dev_dataset.apply(trans_func, lazy=True)
dev_batch_sampler = paddle.io.BatchSampler(
dev_dataset, batch_size=args.batch_size, shuffle=False)
feed_list_name = []
# Define the input data and create the train/dev data_loader
with paddle.static.program_guard(main_program, startup_program):
[input_ids, segment_ids, labels] = create_data_holder()
train_data_loader = DataLoader(
dataset=train_dataset,
feed_list=[input_ids, segment_ids, labels],
batch_sampler=train_batch_sampler,
collate_fn=batchify_fn,
num_workers=0,
return_list=False)
dev_data_loader = DataLoader(
dataset=dev_dataset,
feed_list=[input_ids, segment_ids, labels],
batch_sampler=dev_batch_sampler,
collate_fn=batchify_fn,
num_workers=0,
return_list=False)
# Create the training-forward program, and clone it for the validation
with paddle.static.program_guard(main_program, startup_program):
model, pretrained_state_dict = model_class.from_pretrained(
args.model_name_or_path,
num_classes=len(train_dataset.get_labels()))
loss_fct = paddle.nn.loss.CrossEntropyLoss(
) if train_dataset.get_labels() else paddle.nn.loss.MSELoss()
logits = model(input_ids, segment_ids)
loss = loss_fct(logits, labels)
dev_program = main_program.clone(for_test=True)
# Create the training-backward program, this pass will not be
# executed in the validation
with paddle.static.program_guard(main_program, startup_program):
lr_scheduler = paddle.optimizer.lr.LambdaDecay(
args.learning_rate,
lambda current_step, num_warmup_steps=args.warmup_steps,
num_training_steps=args.max_steps if args.max_steps > 0 else
(len(train_data_loader) * args.num_train_epochs): float(
current_step) / float(max(1, num_warmup_steps))
if current_step < num_warmup_steps else max(
0.0,
float(num_training_steps - current_step) / float(
max(1, num_training_steps - num_warmup_steps))))
optimizer = paddle.optimizer.AdamW(
learning_rate=lr_scheduler,
epsilon=args.adam_epsilon,
parameters=model.parameters(),
weight_decay=args.weight_decay,
apply_decay_param_fun=lambda x: x in [
p.name for n, p in model.named_parameters()
if not any(nd in n for nd in ["bias", "norm"])
])
optimizer.minimize(loss)
# Create the metric pass for the validation
with paddle.static.program_guard(dev_program, startup_program):
metric = metric_class()
correct = metric.compute(logits, labels)
# Initialize the fine-tuning parameter, we will load the parameters in
# pre-training model. And initialize the parameter which not in pre-training model
# by the normal distribution.
exe = paddle.static.Executor(place)
exe.run(startup_program)
state_dict = model.state_dict()
reset_state_dict = reset_program_state_dict(model, state_dict,
pretrained_state_dict)
paddle.static.set_program_state(main_program, reset_state_dict)
global_step = 0
tic_train = time.time()
for epoch in range(args.num_train_epochs):
for step, batch in enumerate(train_data_loader):
global_step += 1
loss_return = exe.run(main_program, feed=batch, fetch_list=[loss])
if global_step % args.logging_steps == 0:
logger.info(
"global step %d, epoch: %d, batch: %d, loss: %f, speed: %.2f step/s"
% (global_step, epoch, step, loss_return[0],
args.logging_steps / (time.time() - tic_train)))
tic_train = time.time()
lr_scheduler.step()
if global_step % args.save_steps == 0:
# Validation pass, record the loss and metric
evaluate(exe, metric, loss, correct, dev_program,
dev_data_loader)
output_dir = os.path.join(args.output_dir,
"model_%d" % global_step)
if not os.path.exists(output_dir):
os.makedirs(output_dir)
paddle.fluid.io.save_params(exe, output_dir)
tokenizer.save_pretrained(output_dir)
if __name__ == "__main__":
args = parse_args()
do_train(args)
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse
import os
import random
import time
import h5py
from functools import partial
from concurrent.futures import ThreadPoolExecutor
import numpy as np
import paddle
import paddle.distributed.fleet as fleet
from paddle.io import DataLoader, Dataset
from paddlenlp.transformers import BertForPretraining, BertModel, BertPretrainingCriterion
from paddlenlp.transformers import BertTokenizer
from data import create_data_holder, create_pretraining_dataset
MODEL_CLASSES = {"bert": (BertForPretraining, BertTokenizer)}
def parse_args():
parser = argparse.ArgumentParser()
parser.add_argument(
"--model_type",
default=None,
type=str,
required=True,
help="Model type selected in the list: " +
", ".join(MODEL_CLASSES.keys()), )
parser.add_argument(
"--model_name_or_path",
default=None,
type=str,
required=True,
help="Path to pre-trained model or shortcut name selected in the list: "
+ ", ".join(
sum([
list(classes[-1].pretrained_init_configuration.keys())
for classes in MODEL_CLASSES.values()
], [])), )
parser.add_argument(
"--input_dir",
default=None,
type=str,
required=True,
help="The input directory where the data will be read from.", )
parser.add_argument(
"--output_dir",
default=None,
type=str,
required=True,
help="The output directory where the model predictions and checkpoints will be written.",
)
parser.add_argument(
"--max_predictions_per_seq",
default=80,
type=int,
help="The maximum total of masked tokens in input sequence")
parser.add_argument(
"--batch_size",
default=8,
type=int,
help="Batch size per GPU/CPU for training.", )
parser.add_argument(
"--learning_rate",
default=5e-5,
type=float,
help="The initial learning rate for Adam.")
parser.add_argument(
"--weight_decay",
default=0.0,
type=float,
help="Weight decay if we apply some.")
parser.add_argument(
"--adam_epsilon",
default=1e-8,
type=float,
help="Epsilon for Adam optimizer.")
parser.add_argument(
"--max_grad_norm", default=1.0, type=float, help="Max gradient norm.")
parser.add_argument(
"--max_steps",
default=-1,
type=int,
help="If > 0: set total number of training steps to perform. Override num_train_epochs.",
)
parser.add_argument(
"--warmup_steps",
default=0,
type=int,
help="Linear warmup over warmup_steps.")
parser.add_argument(
"--logging_steps",
type=int,
default=500,
help="Log every X updates steps.")
parser.add_argument(
"--save_steps",
type=int,
default=500,
help="Save checkpoint every X updates steps.")
parser.add_argument(
"--seed", type=int, default=42, help="Random seed for initialization")
args = parser.parse_args()
return args
def select_dataset_file_for_each_worker(files, f_start_id, worker_num,
worker_index):
num_files = len(files)
if worker_num > num_files:
remainder = worker_num % num_files
data_file = files[(
f_start_id * worker_num + worker_index + remainder * f_start_id) %
num_files]
else:
data_file = files[(f_start_id * worker_num + worker_index) % num_files]
return data_file
def reset_program_state_dict(model, state_dict):
scale = model.initializer_range if hasattr(model, "initializer_range")\
else model.bert.config["initializer_range"]
new_state_dict = dict()
for n, p in state_dict.items():
if "layer_norm" not in p.name:
dtype_str = "float32"
if str(p.dtype) == "VarType.FP64":
dtype_str = "float64"
new_state_dict[p.name] = np.random.normal(
loc=0.0, scale=scale, size=p.shape).astype(dtype_str)
return new_state_dict
class WorkerInitObj(object):
def __init__(self, seed):
self.seed = seed
def __call__(self, id):
np.random.seed(seed=self.seed + id)
random.seed(self.seed + id)
def set_seed(seed):
random.seed(seed)
np.random.seed(seed)
paddle.seed(seed)
def do_train(args):
# Initialize the paddle and paddle fleet execute enviroment
paddle.enable_static()
place = paddle.CUDAPlace(int(os.environ.get('FLAGS_selected_gpus', 0)))
fleet.init(is_collective=True)
# Create the random seed for the worker
set_seed(args.seed)
worker_init = WorkerInitObj(args.seed + fleet.worker_index())
# Define the input data in the static mode
data_holders = create_data_holder(args)
[
input_ids, segment_ids, input_mask, masked_lm_positions,
masked_lm_labels, next_sentence_labels, masked_lm_scale
] = data_holders
# Define the model structure in static mode
args.model_type = args.model_type.lower()
model_class, tokenizer_class = MODEL_CLASSES[args.model_type]
tokenizer = tokenizer_class.from_pretrained(args.model_name_or_path)
model = BertForPretraining(
BertModel(**model_class.pretrained_init_configuration[
args.model_name_or_path]))
criterion = BertPretrainingCriterion(model.bert.config["vocab_size"])
prediction_scores, seq_relationship_score = model(
input_ids=input_ids,
token_type_ids=segment_ids,
attention_mask=input_mask,
masked_positions=masked_lm_positions)
loss = criterion(prediction_scores, seq_relationship_score,
masked_lm_labels, next_sentence_labels, masked_lm_scale)
# Define the dynamic learing_reate scheduler and optimizer
lr_scheduler = paddle.optimizer.lr.LambdaDecay(
args.learning_rate,
lambda current_step, num_warmup_steps=args.warmup_steps,
num_training_steps=args.max_steps if args.max_steps > 0 else
(len(train_data_loader) * args.num_train_epochs): float(
current_step) / float(max(1, num_warmup_steps))
if current_step < num_warmup_steps else max(
0.0,
float(num_training_steps - current_step) / float(
max(1, num_training_steps - num_warmup_steps))))
optimizer = paddle.optimizer.AdamW(
learning_rate=lr_scheduler,
epsilon=args.adam_epsilon,
parameters=model.parameters(),
weight_decay=args.weight_decay,
apply_decay_param_fun=lambda x: x in [
p.name for n, p in model.named_parameters()
if not any(nd in n for nd in ["bias", "norm"])
])
# Use the fleet api to compile the distributed optimizer
strategy = fleet.DistributedStrategy()
optimizer = fleet.distributed_optimizer(optimizer, strategy=strategy)
optimizer.minimize(loss)
# Define the Executor for running the static model
exe = paddle.static.Executor(place)
exe.run(paddle.static.default_startup_program())
state_dict = model.state_dict()
# Use the state dict to update the parameter
reset_state_dict = reset_program_state_dict(model, state_dict)
paddle.static.set_program_state(paddle.static.default_main_program(),
reset_state_dict)
pool = ThreadPoolExecutor(1)
global_step = 0
tic_train = time.time()
worker_num = fleet.worker_num()
worker_index = fleet.worker_index()
epoch = 0
while True:
files = [
os.path.join(args.input_dir, f) for f in os.listdir(args.input_dir)
if os.path.isfile(os.path.join(args.input_dir, f)) and "training" in
f
]
files.sort()
num_files = len(files)
random.Random(args.seed + epoch).shuffle(files)
f_start_id = 0
# Select one file for each worker and create the DataLoader for the file
data_file = select_dataset_file_for_each_worker(
files, f_start_id, worker_num, worker_index)
train_data_loader, _ = create_pretraining_dataset(
data_file, args.max_predictions_per_seq, args, data_holders,
worker_init, paddle.static.cuda_places())
for f_id in range(f_start_id + 1, len(files)):
data_file = select_dataset_file_for_each_worker(
files, f_id, worker_num, worker_index)
dataset_future = pool.submit(create_pretraining_dataset, data_file,
args.max_predictions_per_seq, args,
data_holders, worker_init,
paddle.static.cuda_places())
for step, batch in enumerate(train_data_loader):
global_step += 1
loss_return = exe.run(paddle.static.default_main_program(),\
feed=batch,
fetch_list=[loss])
# In the new 2.0 api, must call this function to change the learning_rate
lr_scheduler.step()
if global_step % args.logging_steps == 0:
time_cost = time.time() - tic_train
print(
"global step %d, epoch: %d, batch: %d, loss: %f, speed: %.2f step/s, ips :%.2f sequences/s"
% (global_step, epoch, step, loss_return[0],
args.logging_steps / time_cost,
args.logging_steps * args.batch_size / time_cost))
tic_train = time.time()
if global_step % args.save_steps == 0:
if worker_index == 0:
output_dir = os.path.join(args.output_dir,
"model_%d" % global_step)
if not os.path.exists(output_dir):
os.makedirs(output_dir)
# TODO(fangzeyang): Udpate the save_params to paddle.static
paddle.fluid.io.save_params(exe, output_dir)
tokenizer.save_pretrained(output_dir)
if global_step >= args.max_steps:
del train_data_loader
return
del train_data_loader
train_data_loader, data_file = dataset_future.result(timeout=None)
epoch += 1
if __name__ == "__main__":
args = parse_args()
do_train(args)
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse
import collections
import itertools
import os
import random
import time
import h5py
from functools import partial
import numpy as np
import distutils.util
import paddle
from paddle.io import DataLoader, Dataset
from paddlenlp.transformers import BertForPretraining, BertModel, BertPretrainingCriterion
from paddlenlp.transformers import BertTokenizer
from data import create_data_holder, create_pretraining_dataset
MODEL_CLASSES = {"bert": (BertForPretraining, BertTokenizer)}
def parse_args():
parser = argparse.ArgumentParser()
parser.add_argument(
"--model_type",
default=None,
type=str,
required=True,
help="Model type selected in the list: " +
", ".join(MODEL_CLASSES.keys()), )
parser.add_argument(
"--model_name_or_path",
default=None,
type=str,
required=True,
help="Path to pre-trained model or shortcut name selected in the list: "
+ ", ".join(
sum([
list(classes[-1].pretrained_init_configuration.keys())
for classes in MODEL_CLASSES.values()
], [])), )
parser.add_argument(
"--input_dir",
default=None,
type=str,
required=True,
help="The input directory where the data will be read from.", )
parser.add_argument(
"--output_dir",
default=None,
type=str,
required=True,
help="The output directory where the model predictions and checkpoints will be written.",
)
parser.add_argument(
"--max_predictions_per_seq",
default=80,
type=int,
help="The maximum total of masked tokens in input sequence")
parser.add_argument(
"--batch_size",
default=8,
type=int,
help="Batch size per GPU/CPU for training.", )
parser.add_argument(
"--learning_rate",
default=5e-5,
type=float,
help="The initial learning rate for Adam.")
parser.add_argument(
"--weight_decay",
default=0.0,
type=float,
help="Weight decay if we apply some.")
parser.add_argument(
"--adam_epsilon",
default=1e-8,
type=float,
help="Epsilon for Adam optimizer.")
parser.add_argument(
"--max_grad_norm", default=1.0, type=float, help="Max gradient norm.")
parser.add_argument(
"--max_steps",
default=-1,
type=int,
help="If > 0: set total number of training steps to perform. Override num_train_epochs.",
)
parser.add_argument(
"--warmup_steps",
default=0,
type=int,
help="Linear warmup over warmup_steps.")
parser.add_argument(
"--logging_steps",
type=int,
default=500,
help="Log every X updates steps.")
parser.add_argument(
"--save_steps",
type=int,
default=500,
help="Save checkpoint every X updates steps.")
parser.add_argument(
"--seed", type=int, default=42, help="Random seed for initialization")
parser.add_argument(
"--use_amp",
type=distutils.util.strtobool,
default=False,
help="Enable mixed precision training.")
parser.add_argument(
"--enable_addto",
type=distutils.util.strtobool,
default=False,
help="Whether to enable the addto strategy for gradient accumulation or not. This is only used for AMP training."
)
parser.add_argument(
"--scale_loss",
type=float,
default=1.0,
help="The value of scale_loss for fp16.")
parser.add_argument(
"--use_dynamic_loss_scaling",
type=distutils.util.strtobool,
default=True,
help="Whether to use dynamic loss scaling.")
args = parser.parse_args()
return args
def construct_compiled_program(main_program, loss):
exec_strategy = paddle.static.ExecutionStrategy()
exec_strategy.num_threads = 1
exec_strategy.num_iteration_per_drop_scope = 10000
build_strategy = paddle.static.BuildStrategy()
build_strategy.enable_addto = args.enable_addto
main_program = paddle.static.CompiledProgram(
main_program).with_data_parallel(
loss_name=loss.name,
exec_strategy=exec_strategy,
build_strategy=build_strategy)
return main_program
def reset_program_state_dict(model, state_dict):
scale = model.initializer_range if hasattr(model, "initializer_range")\
else model.bert.config["initializer_range"]
new_state_dict = dict()
for n, p in state_dict.items():
if "layer_norm" not in p.name:
dtype_str = "float32"
if str(p.dtype) == "VarType.FP64":
dtype_str = "float64"
new_state_dict[p.name] = np.random.normal(
loc=0.0, scale=scale, size=p.shape).astype(dtype_str)
return new_state_dict
def set_seed(seed):
random.seed(seed)
np.random.seed(seed)
paddle.seed(seed)
def do_train(args):
# Initialize the paddle execute enviroment
paddle.enable_static()
place = paddle.CUDAPlace(0)
# Set the random seed
set_seed(args.seed)
# Define the input data in the static mode
main_program = paddle.static.default_main_program()
startup_program = paddle.static.default_startup_program()
data_holders = create_data_holder(args)
[
input_ids, segment_ids, input_mask, masked_lm_positions,
masked_lm_labels, next_sentence_labels, masked_lm_scale
] = data_holders
# Define the model structure in static mode
args.model_type = args.model_type.lower()
model_class, tokenizer_class = MODEL_CLASSES[args.model_type]
tokenizer = tokenizer_class.from_pretrained(args.model_name_or_path)
config = model_class.pretrained_init_configuration[args.model_name_or_path]
if config["vocab_size"] % 8 != 0:
config["vocab_size"] += 8 - (config["vocab_size"] % 8)
model = BertForPretraining(BertModel(**config))
criterion = BertPretrainingCriterion(model.bert.config["vocab_size"])
prediction_scores, seq_relationship_score = model(
input_ids=input_ids,
token_type_ids=segment_ids,
attention_mask=input_mask,
masked_positions=masked_lm_positions)
loss = criterion(prediction_scores, seq_relationship_score,
masked_lm_labels, next_sentence_labels, masked_lm_scale)
# Define the dynamic learing_reate scheduler and optimizer
lr_scheduler = paddle.optimizer.lr.LambdaDecay(
args.learning_rate,
lambda current_step, num_warmup_steps=args.warmup_steps,
num_training_steps=args.max_steps if args.max_steps > 0 else
(len(train_data_loader) * args.num_train_epochs): float(
current_step) / float(max(1, num_warmup_steps))
if current_step < num_warmup_steps else max(
0.0,
float(num_training_steps - current_step) / float(
max(1, num_training_steps - num_warmup_steps))))
optimizer = paddle.optimizer.AdamW(
learning_rate=lr_scheduler,
epsilon=args.adam_epsilon,
parameters=model.parameters(),
weight_decay=args.weight_decay,
apply_decay_param_fun=lambda x: x in [
p.name for n, p in model.named_parameters()
if not any(nd in n for nd in ["bias", "norm"])
])
if args.use_amp:
amp_list = paddle.fluid.contrib.mixed_precision.AutoMixedPrecisionLists(
custom_white_list=['layer_norm', 'softmax'])
optimizer = paddle.fluid.contrib.mixed_precision.decorate(
optimizer,
amp_list,
init_loss_scaling=args.scale_loss,
use_dynamic_loss_scaling=args.use_dynamic_loss_scaling)
optimizer.minimize(loss)
# Define the Executor for running the static model
exe = paddle.static.Executor(place)
exe.run(startup_program)
state_dict = model.state_dict()
# Use the state dict to update the parameter
reset_state_dict = reset_program_state_dict(model, state_dict)
paddle.static.set_program_state(main_program, reset_state_dict)
# Construct the compiled program
main_program = construct_compiled_program(main_program, loss)
global_step = 0
tic_train = time.time()
epoch = 0
while True:
files = [
os.path.join(args.input_dir, f) for f in os.listdir(args.input_dir)
if os.path.isfile(os.path.join(args.input_dir, f)) and "training" in
f
]
files.sort()
random.Random(args.seed + epoch).shuffle(files)
for f_id in range(0, len(files)):
train_data_loader, _ = create_pretraining_dataset(
files[f_id], args.max_predictions_per_seq, args, data_holders)
for step, batch in enumerate(train_data_loader):
global_step += 1
loss_return = exe.run(main_program,\
feed=batch,
fetch_list=[loss])
# In the new 2.0 api, must call this function to change the learning_rate
lr_scheduler.step()
if global_step % args.logging_steps == 0:
time_cost = time.time() - tic_train
print(
"global step %d, epoch: %d, batch: %d, loss: %f, speed: %.2f step/s, ips :%.2f sequences/s"
% (global_step, epoch, step, loss_return[0],
args.logging_steps / time_cost,
args.logging_steps * args.batch_size / time_cost))
tic_train = time.time()
if global_step % args.save_steps == 0:
output_dir = os.path.join(args.output_dir,
"model_%d" % global_step)
if not os.path.exists(output_dir):
os.makedirs(output_dir)
# TODO(fangzeyang): Udpate the save_params to paddle.static
paddle.fluid.io.save_params(exe, output_dir)
tokenizer.save_pretrained(output_dir)
if global_step >= args.max_steps:
del train_data_loader
return
del train_data_loader
epoch += 1
if __name__ == "__main__":
args = parse_args()
do_train(args)
# Transformer Benchmark with Fleet API
### 静态图
如果是需要单机多卡训练,则使用下面的命令进行训练:
``` shell
cd static/
export CUDA_VISIBLE_DEVICES=0
python3 train.py
```
### 动态图
如果使用单机单卡进行训练可以使用如下命令:
``` shell
cd dygraph/
export CUDA_VISIBLE_DEVICES=0
python3 train.py
```
如果使用单机多卡进行训练可以使用如下命令:
``` shell
cd dygraph/
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
python3 -m paddle.distributed.launch --selected_gpus=0,1,2,3,4,5,6,7 train.py
```
# The frequency to save trained models when training.
save_step: 10000
# The frequency to fetch and print output when training.
print_step: 10
# Path of the checkpoint, to resume the previous training
init_from_checkpoint: ""
# Path of the pretrain model, to better solve the current task
init_from_pretrain_model: ""
# Path of trained parameter, to make prediction
init_from_params: "./trained_models/step_final/"
# The directory for saving model
save_model: "trained_models"
# The directory for saving inference model.
inference_model_dir: "infer_model"
# Set seed for CE or debug
random_seed: None
# The pattern to match training data files.
training_file: "../gen_data/wmt14_ende_data_bpe/train.tok.clean.bpe.33708.en-de"
# The pattern to match validation data files.
validation_file: "../gen_data/wmt14_ende_data_bpe/newstest2013.tok.bpe.33708.en-de"
# The pattern to match test data files.
predict_file: "../gen_data/wmt14_ende_data_bpe/newstest2014.tok.bpe.33708.en-de"
# The file to output the translation results of predict_file to.
output_file: "predict.txt"
# The path of vocabulary file of source language.
src_vocab_fpath: "../gen_data/wmt14_ende_data_bpe/vocab_all.bpe.33708"
# The path of vocabulary file of target language.
trg_vocab_fpath: "../gen_data/wmt14_ende_data_bpe/vocab_all.bpe.33708"
# The <bos>, <eos> and <unk> tokens in the dictionary.
special_token: ["<s>", "<e>", "<unk>"]
# Whether to use cuda
use_gpu: True
# Args for reader, see reader.py for details
token_delimiter: " "
use_token_batch: True
pool_size: 200000
sort_type: "global"
shuffle: False
shuffle_batch: False
batch_size: 4096
infer_batch_size: 16
# Hyparams for training:
# The number of epoches for training
epoch: 30
# The hyper parameters for Adam optimizer.
# This static learning_rate will be applied to the LearningRateScheduler
# derived learning rate the to get the final learning rate.
learning_rate: 2.0
beta1: 0.9
beta2: 0.997
eps: 1e-9
# The parameters for learning rate scheduling.
warmup_steps: 4000
# The weight used to mix up the ground-truth distribution and the fixed
# uniform distribution in label smoothing when training.
# Set this as zero if label smoothing is not wanted.
label_smooth_eps: 0.1
# Hyparams for generation:
# The parameters for beam search.
beam_size: 5
max_out_len: 1024
# The number of decoded sentences to output.
n_best: 1
# Hyparams for model:
# These following five vocabularies related configurations will be set
# automatically according to the passed vocabulary path and special tokens.
# Size of source word dictionary.
src_vocab_size: 10000
# Size of target word dictionay
trg_vocab_size: 10000
# Index for <bos> token
bos_idx: 0
# Index for <eos> token
eos_idx: 1
# Index for <unk> token
unk_idx: 2
# Max length of sequences deciding the size of position encoding table.
max_length: 1024
# The dimension for word embeddings, which is also the last dimension of
# the input and output of multi-head attention, position-wise feed-forward
# networks, encoder and decoder.
d_model: 1024
# Size of the hidden layer in position-wise feed-forward networks.
d_inner_hid: 4096
# Number of head used in multi-head attention.
n_head: 16
# Number of sub-layers to be stacked in the encoder and decoder.
n_layer: 6
# Dropout rates.
dropout: 0.1
# The flag indicating whether to share embedding and softmax weights.
# Vocabularies in source and target should be same for weight sharing.
weight_sharing: True
max_iter: None
import sys
import os
import argparse
import logging
import numpy as np
import paddle
import yaml
from attrdict import AttrDict
from pprint import pprint
from paddlenlp.transformers import InferTransformerModel, position_encoding_init
sys.path.append("../")
import reader
def parse_args():
parser = argparse.ArgumentParser()
parser.add_argument(
"--config",
default="../configs/transformer.big.yaml",
type=str,
help="Path of the config file. ")
args = parser.parse_args()
return args
def post_process_seq(seq, bos_idx, eos_idx, output_bos=False, output_eos=False):
"""
Post-process the decoded sequence.
"""
eos_pos = len(seq) - 1
for i, idx in enumerate(seq):
if idx == eos_idx:
eos_pos = i
break
seq = [
idx for idx in seq[:eos_pos + 1]
if (output_bos or idx != bos_idx) and (output_eos or idx != eos_idx)
]
return seq
def do_predict(args):
if args.use_gpu:
place = "gpu:0"
else:
place = "cpu"
paddle.set_device(place)
# Define data loader
(test_loader,
test_steps_fn), trg_idx2word = reader.create_infer_loader(args)
# Define model
transformer = InferTransformerModel(
src_vocab_size=args.src_vocab_size,
trg_vocab_size=args.trg_vocab_size,
max_length=args.max_length + 1,
n_layer=args.n_layer,
n_head=args.n_head,
d_model=args.d_model,
d_inner_hid=args.d_inner_hid,
dropout=args.dropout,
weight_sharing=args.weight_sharing,
bos_id=args.bos_idx,
eos_id=args.eos_idx,
beam_size=args.beam_size,
max_out_len=args.max_out_len)
# Load the trained model
assert args.init_from_params, (
"Please set init_from_params to load the infer model.")
model_dict = paddle.load(
os.path.join(args.init_from_params, "transformer.pdparams"))
# To avoid a longer length than training, reset the size of position
# encoding to max_length
model_dict["encoder.pos_encoder.weight"] = position_encoding_init(
args.max_length + 1, args.d_model)
model_dict["decoder.pos_encoder.weight"] = position_encoding_init(
args.max_length + 1, args.d_model)
transformer.load_dict(model_dict)
# Set evaluate mode
transformer.eval()
f = open(args.output_file, "w")
for (src_word, ) in test_loader:
finished_seq = transformer(src_word=src_word)
finished_seq = finished_seq.numpy().transpose([0, 2, 1])
for ins in finished_seq:
for beam_idx, beam in enumerate(ins):
if beam_idx >= args.n_best:
break
id_list = post_process_seq(beam, args.bos_idx, args.eos_idx)
word_list = [trg_idx2word[id] for id in id_list]
sequence = " ".join(word_list) + "\n"
f.write(sequence)
if __name__ == "__main__":
ARGS = parse_args()
yaml_file = ARGS.config
with open(yaml_file, 'rt') as f:
args = AttrDict(yaml.safe_load(f))
pprint(args)
do_predict(args)
import os
import time
import sys
import logging
import argparse
import numpy as np
import yaml
from attrdict import AttrDict
from pprint import pprint
import paddle
import paddle.distributed as dist
from paddlenlp.transformers import TransformerModel, CrossEntropyCriterion, position_encoding_init
sys.path.append("../")
import reader
from utils.record import AverageStatistical
FORMAT = '%(asctime)s-%(levelname)s: %(message)s'
logging.basicConfig(level=logging.INFO, format=FORMAT)
logger = logging.getLogger(__name__)
def parse_args():
parser = argparse.ArgumentParser()
parser.add_argument(
"--config",
default="../configs/transformer.big.yaml",
type=str,
help="Path of the config file. ")
args = parser.parse_args()
return args
def do_train(args):
if args.use_gpu:
rank = dist.get_rank()
trainer_count = dist.get_world_size()
else:
rank = 0
trainer_count = 1
if trainer_count > 1:
dist.init_parallel_env()
# Set seed for CE
random_seed = eval(str(args.random_seed))
if random_seed is not None:
paddle.seed(random_seed)
# Define data loader
(train_loader, train_steps_fn), (eval_loader,
eval_steps_fn) = reader.create_data_loader(
args, trainer_count, rank)
# Define model
transformer = TransformerModel(
src_vocab_size=args.src_vocab_size,
trg_vocab_size=args.trg_vocab_size,
max_length=args.max_length + 1,
n_layer=args.n_layer,
n_head=args.n_head,
d_model=args.d_model,
d_inner_hid=args.d_inner_hid,
dropout=args.dropout,
weight_sharing=args.weight_sharing,
bos_id=args.bos_idx,
eos_id=args.eos_idx)
# Define loss
criterion = CrossEntropyCriterion(args.label_smooth_eps, args.bos_idx)
scheduler = paddle.optimizer.lr.NoamDecay(
args.d_model, args.warmup_steps, args.learning_rate, last_epoch=0)
# Define optimizer
optimizer = paddle.optimizer.Adam(
learning_rate=scheduler,
beta1=args.beta1,
beta2=args.beta2,
epsilon=float(args.eps),
parameters=transformer.parameters())
# Init from some checkpoint, to resume the previous training
if args.init_from_checkpoint:
model_dict = paddle.load(
os.path.join(args.init_from_checkpoint, "transformer.pdparams"))
opt_dict = paddle.load(
os.path.join(args.init_from_checkpoint, "transformer.pdopt"))
transformer.set_state_dict(model_dict)
optimizer.set_state_dict(opt_dict)
print("loaded from checkpoint.")
# Init from some pretrain models, to better solve the current task
if args.init_from_pretrain_model:
model_dict = paddle.load(
os.path.join(args.init_from_pretrain_model, "transformer.pdparams"))
transformer.set_state_dict(model_dict)
print("loaded from pre-trained model.")
if trainer_count > 1:
transformer = paddle.DataParallel(transformer)
# The best cross-entropy value with label smoothing
loss_normalizer = -(
(1. - args.label_smooth_eps) * np.log(
(1. - args.label_smooth_eps)) + args.label_smooth_eps *
np.log(args.label_smooth_eps / (args.trg_vocab_size - 1) + 1e-20))
step_idx = 0
# For benchmark
reader_cost_avg = AverageStatistical()
batch_cost_avg = AverageStatistical()
batch_ips_avg = AverageStatistical()
# Train loop
for pass_id in range(args.epoch):
epoch_start = time.time()
batch_id = 0
batch_start = time.time()
for input_data in train_loader:
#NOTE: Used for benchmark and use None as default.
if args.max_iter and step_idx == args.max_iter:
return
train_reader_cost = time.time() - batch_start
(src_word, trg_word, lbl_word) = input_data
logits = transformer(src_word=src_word, trg_word=trg_word)
sum_cost, avg_cost, token_num = criterion(logits, lbl_word)
avg_cost.backward()
optimizer.step()
optimizer.clear_grad()
tokens_per_cards = token_num.numpy()
train_batch_cost = time.time() - batch_start
reader_cost_avg.record(train_reader_cost)
batch_cost_avg.record(train_batch_cost)
batch_ips_avg.record(train_batch_cost, tokens_per_cards)
# NOTE: For benchmark, loss infomation on all cards will be printed.
if step_idx % args.print_step == 0:
total_avg_cost = avg_cost.numpy()
if step_idx == 0:
logger.info(
"step_idx: %d, epoch: %d, batch: %d, avg loss: %f, "
"normalized loss: %f, ppl: %f " %
(step_idx, pass_id, batch_id, total_avg_cost,
total_avg_cost - loss_normalizer,
np.exp([min(total_avg_cost, 100)])))
else:
train_avg_batch_cost = args.print_step / batch_cost_avg.get_total_time(
)
logger.info(
"step_idx: %d, epoch: %d, batch: %d, avg loss: %f, "
"normalized loss: %f, ppl: %f, avg_speed: %.2f step/sec, "
"batch_cost: %.5f sec, reader_cost: %.5f sec, tokens: %d, "
"ips: %.5f words/sec" %
(step_idx, pass_id, batch_id, total_avg_cost,
total_avg_cost - loss_normalizer,
np.exp([min(total_avg_cost, 100)]),
train_avg_batch_cost, batch_cost_avg.get_average(),
reader_cost_avg.get_average(),
batch_ips_avg.get_total_cnt(),
batch_ips_avg.get_average_per_sec()))
reader_cost_avg.reset()
batch_cost_avg.reset()
batch_ips_avg.reset()
if step_idx % args.save_step == 0 and step_idx != 0:
# Validation
if args.validation_file:
transformer.eval()
total_sum_cost = 0
total_token_num = 0
with paddle.no_grad():
for input_data in eval_loader:
(src_word, trg_word, lbl_word) = input_data
logits = transformer(
src_word=src_word, trg_word=trg_word)
sum_cost, avg_cost, token_num = criterion(logits,
lbl_word)
total_sum_cost += sum_cost.numpy()
total_token_num += token_num.numpy()
total_avg_cost = total_sum_cost / total_token_num
logger.info("validation, step_idx: %d, avg loss: %f, "
"normalized loss: %f, ppl: %f" %
(step_idx, total_avg_cost,
total_avg_cost - loss_normalizer,
np.exp([min(total_avg_cost, 100)])))
transformer.train()
if args.save_model and rank == 0:
model_dir = os.path.join(args.save_model,
"step_" + str(step_idx))
if not os.path.exists(model_dir):
os.makedirs(model_dir)
paddle.save(transformer.state_dict(),
os.path.join(model_dir, "transformer.pdparams"))
paddle.save(optimizer.state_dict(),
os.path.join(model_dir, "transformer.pdopt"))
batch_id += 1
step_idx += 1
scheduler.step()
batch_start = time.time()
train_epoch_cost = time.time() - epoch_start
logger.info("train epoch: %d, epoch_cost: %.5f s" %
(pass_id, train_epoch_cost))
if args.save_model and rank == 0:
model_dir = os.path.join(args.save_model, "step_final")
if not os.path.exists(model_dir):
os.makedirs(model_dir)
paddle.save(transformer.state_dict(),
os.path.join(model_dir, "transformer.pdparams"))
paddle.save(optimizer.state_dict(),
os.path.join(model_dir, "transformer.pdopt"))
if __name__ == "__main__":
ARGS = parse_args()
yaml_file = ARGS.config
with open(yaml_file, 'rt') as f:
args = AttrDict(yaml.safe_load(f))
pprint(args)
do_train(args)
#! /usr/bin/env bash
set -e
OUTPUT_DIR=$PWD/gen_data
###############################################################################
# change these variables for other WMT data
###############################################################################
OUTPUT_DIR_DATA="${OUTPUT_DIR}/wmt14_ende_data"
OUTPUT_DIR_BPE_DATA="${OUTPUT_DIR}/wmt14_ende_data_bpe"
LANG1="en"
LANG2="de"
# each of TRAIN_DATA: data_url data_file_lang1 data_file_lang2
TRAIN_DATA=(
'http://statmt.org/wmt13/training-parallel-europarl-v7.tgz'
'europarl-v7.de-en.en' 'europarl-v7.de-en.de'
'http://statmt.org/wmt13/training-parallel-commoncrawl.tgz'
'commoncrawl.de-en.en' 'commoncrawl.de-en.de'
'http://data.statmt.org/wmt17/translation-task/training-parallel-nc-v12.tgz'
'news-commentary-v12.de-en.en' 'news-commentary-v12.de-en.de'
)
# each of DEV_TEST_DATA: data_url data_file_lang1 data_file_lang2
# source & reference
DEV_TEST_DATA=(
'http://data.statmt.org/wmt17/translation-task/dev.tgz'
'newstest2013-ref.de.sgm' 'newstest2013-src.en.sgm'
'http://statmt.org/wmt14/test-full.tgz'
'newstest2014-deen-ref.en.sgm' 'newstest2014-deen-src.de.sgm'
)
###############################################################################
###############################################################################
# change these variables for other WMT data
###############################################################################
# OUTPUT_DIR_DATA="${OUTPUT_DIR}/wmt14_enfr_data"
# OUTPUT_DIR_BPE_DATA="${OUTPUT_DIR}/wmt14_enfr_data_bpe"
# LANG1="en"
# LANG2="fr"
# # each of TRAIN_DATA: ata_url data_tgz data_file
# TRAIN_DATA=(
# 'http://www.statmt.org/wmt13/training-parallel-commoncrawl.tgz'
# 'commoncrawl.fr-en.en' 'commoncrawl.fr-en.fr'
# 'http://www.statmt.org/wmt13/training-parallel-europarl-v7.tgz'
# 'training/europarl-v7.fr-en.en' 'training/europarl-v7.fr-en.fr'
# 'http://www.statmt.org/wmt14/training-parallel-nc-v9.tgz'
# 'training/news-commentary-v9.fr-en.en' 'training/news-commentary-v9.fr-en.fr'
# 'http://www.statmt.org/wmt10/training-giga-fren.tar'
# 'giga-fren.release2.fixed.en.*' 'giga-fren.release2.fixed.fr.*'
# 'http://www.statmt.org/wmt13/training-parallel-un.tgz'
# 'un/undoc.2000.fr-en.en' 'un/undoc.2000.fr-en.fr'
# )
# # each of DEV_TEST_DATA: data_url data_tgz data_file_lang1 data_file_lang2
# DEV_TEST_DATA=(
# 'http://data.statmt.org/wmt16/translation-task/dev.tgz'
# '.*/newstest201[45]-fren-ref.en.sgm' '.*/newstest201[45]-fren-src.fr.sgm'
# 'http://data.statmt.org/wmt16/translation-task/test.tgz'
# '.*/newstest2016-fren-ref.en.sgm' '.*/newstest2016-fren-src.fr.sgm'
# )
###############################################################################
mkdir -p $OUTPUT_DIR_DATA $OUTPUT_DIR_BPE_DATA
# Extract training data
for ((i=0;i<${#TRAIN_DATA[@]};i+=3)); do
data_url=${TRAIN_DATA[i]}
data_tgz=${data_url##*/} # training-parallel-commoncrawl.tgz
data=${data_tgz%.*} # training-parallel-commoncrawl
data_lang1=${TRAIN_DATA[i+1]}
data_lang2=${TRAIN_DATA[i+2]}
if [ ! -e ${OUTPUT_DIR_DATA}/${data_tgz} ]; then
echo "Download "${data_url}
echo "Dir "${OUTPUT_DIR_DATA}/${data_tgz}
wget -O ${OUTPUT_DIR_DATA}/${data_tgz} ${data_url}
fi
if [ ! -d ${OUTPUT_DIR_DATA}/${data} ]; then
echo "Extract "${data_tgz}
mkdir -p ${OUTPUT_DIR_DATA}/${data}
tar_type=${data_tgz:0-3}
if [ ${tar_type} == "tar" ]; then
tar -xvf ${OUTPUT_DIR_DATA}/${data_tgz} -C ${OUTPUT_DIR_DATA}/${data}
else
tar -xvzf ${OUTPUT_DIR_DATA}/${data_tgz} -C ${OUTPUT_DIR_DATA}/${data}
fi
fi
# concatenate all training data
for data_lang in $data_lang1 $data_lang2; do
for f in `find ${OUTPUT_DIR_DATA}/${data} -regex ".*/${data_lang}"`; do
data_dir=`dirname $f`
data_file=`basename $f`
f_base=${f%.*}
f_ext=${f##*.}
if [ $f_ext == "gz" ]; then
gunzip $f
l=${f_base##*.}
f_base=${f_base%.*}
else
l=${f_ext}
fi
if [ $i -eq 0 ]; then
cat ${f_base}.$l > ${OUTPUT_DIR_DATA}/train.$l
else
cat ${f_base}.$l >> ${OUTPUT_DIR_DATA}/train.$l
fi
done
done
done
# Clone mosesdecoder
if [ ! -d ${OUTPUT_DIR}/mosesdecoder ]; then
echo "Cloning moses for data processing"
git clone https://github.com/moses-smt/mosesdecoder.git ${OUTPUT_DIR}/mosesdecoder
fi
# Extract develop and test data
dev_test_data=""
for ((i=0;i<${#DEV_TEST_DATA[@]};i+=3)); do
data_url=${DEV_TEST_DATA[i]}
data_tgz=${data_url##*/} # training-parallel-commoncrawl.tgz
data=${data_tgz%.*} # training-parallel-commoncrawl
data_lang1=${DEV_TEST_DATA[i+1]}
data_lang2=${DEV_TEST_DATA[i+2]}
if [ ! -e ${OUTPUT_DIR_DATA}/${data_tgz} ]; then
echo "Download "${data_url}
wget -O ${OUTPUT_DIR_DATA}/${data_tgz} ${data_url}
fi
if [ ! -d ${OUTPUT_DIR_DATA}/${data} ]; then
echo "Extract "${data_tgz}
mkdir -p ${OUTPUT_DIR_DATA}/${data}
tar_type=${data_tgz:0-3}
if [ ${tar_type} == "tar" ]; then
tar -xvf ${OUTPUT_DIR_DATA}/${data_tgz} -C ${OUTPUT_DIR_DATA}/${data}
else
tar -xvzf ${OUTPUT_DIR_DATA}/${data_tgz} -C ${OUTPUT_DIR_DATA}/${data}
fi
fi
for data_lang in $data_lang1 $data_lang2; do
for f in `find ${OUTPUT_DIR_DATA}/${data} -regex ".*/${data_lang}"`; do
echo "input-from-sgm"
data_dir=`dirname $f`
data_file=`basename $f`
data_out=`echo ${data_file} | cut -d '-' -f 1` # newstest2016
l=`echo ${data_file} | cut -d '.' -f 2` # en
dev_test_data="${dev_test_data}\|${data_out}" # to make regexp
if [ ! -e ${OUTPUT_DIR_DATA}/${data_out}.$l ]; then
${OUTPUT_DIR}/mosesdecoder/scripts/ems/support/input-from-sgm.perl \
< $f > ${OUTPUT_DIR_DATA}/${data_out}.$l
fi
done
done
done
# Tokenize data
for l in ${LANG1} ${LANG2}; do
for f in `ls ${OUTPUT_DIR_DATA}/*.$l | grep "\(train\|newstest2013\)\.$l$"`; do
f_base=${f%.*} # dir/train dir/newstest2013
f_out=$f_base.tok.$l
f_tmp=$f_base.tmp.$l
if [ ! -e $f_out ]; then
echo "Tokenize "$f
cat $f | \
${OUTPUT_DIR}/mosesdecoder/scripts/tokenizer/normalize-punctuation.perl $l | \
${OUTPUT_DIR}/mosesdecoder/scripts/tokenizer/remove-non-printing-char.perl | \
tee -a $tmp/valid.raw.$l | \
${OUTPUT_DIR}/mosesdecoder/scripts/tokenizer/tokenizer.perl -a -l $l -threads 8 >> $f_out
echo $f_out
fi
done
done
for l in ${LANG1} ${LANG2}; do
for f in `ls ${OUTPUT_DIR_DATA}/*.$l | grep "\(newstest2014\)\.$l$"`; do
f_base=${f%.*} # dir/newstest2014
f_out=$f_base.tok.$l
if [ ! -e $f_out ]; then
echo "Tokenize "$f
cat $f | \
${OUTPUT_DIR}/mosesdecoder/scripts/tokenizer/tokenizer.perl -a -l $l -threads 8 >> $f_out
echo $f_out
fi
done
done
# Clean data
for f in ${OUTPUT_DIR_DATA}/train.${LANG1} ${OUTPUT_DIR_DATA}/train.tok.${LANG1}; do
f_base=${f%.*} # dir/train dir/train.tok
f_out=${f_base}.clean
if [ ! -e $f_out.${LANG1} ] && [ ! -e $f_out.${LANG2} ]; then
echo "Clean "${f_base}
${OUTPUT_DIR}/mosesdecoder/scripts/training/clean-corpus-n.perl $f_base ${LANG1} ${LANG2} ${f_out} 1 256
fi
done
python -m pip install subword-nmt
# Generate BPE data and vocabulary
for num_operations in 33708; do
if [ ! -e ${OUTPUT_DIR_BPE_DATA}/bpe.${num_operations} ]; then
echo "Learn BPE with ${num_operations} merge operations"
cat ${OUTPUT_DIR_DATA}/train.tok.clean.${LANG1} ${OUTPUT_DIR_DATA}/train.tok.clean.${LANG2} | \
subword-nmt learn-bpe -s $num_operations > ${OUTPUT_DIR_BPE_DATA}/bpe.${num_operations}
fi
for l in ${LANG1} ${LANG2}; do
for f in `ls ${OUTPUT_DIR_DATA}/*.$l | grep "\(train${dev_test_data}\)\.tok\(\.clean\)\?\.$l$"`; do
f_base=${f%.*} # dir/train.tok dir/train.tok.clean dir/newstest2016.tok
f_base=${f_base##*/} # train.tok train.tok.clean newstest2016.tok
f_out=${OUTPUT_DIR_BPE_DATA}/${f_base}.bpe.${num_operations}.$l
if [ ! -e $f_out ]; then
echo "Apply BPE to "$f
subword-nmt apply-bpe -c ${OUTPUT_DIR_BPE_DATA}/bpe.${num_operations} < $f > $f_out
fi
done
done
if [ ! -e ${OUTPUT_DIR_BPE_DATA}/vocab.bpe.${num_operations} ]; then
echo "Create vocabulary for BPE data"
cat ${OUTPUT_DIR_BPE_DATA}/train.tok.clean.bpe.${num_operations}.${LANG1} ${OUTPUT_DIR_BPE_DATA}/train.tok.clean.bpe.${num_operations}.${LANG2} | \
subword-nmt get-vocab | cut -f1 -d ' ' > ${OUTPUT_DIR_BPE_DATA}/vocab.bpe.${num_operations}
fi
done
# Adapt to the reader
for f in ${OUTPUT_DIR_BPE_DATA}/*.bpe.${num_operations}.${LANG1}; do
f_base=${f%.*} # dir/train.tok.clean.bpe.32000 dir/newstest2016.tok.bpe.32000
f_out=${f_base}.${LANG1}-${LANG2}
if [ ! -e $f_out ]; then
paste -d '\t' $f_base.${LANG1} $f_base.${LANG2} > $f_out
fi
done
if [ ! -e ${OUTPUT_DIR_BPE_DATA}/vocab_all.bpe.${num_operations} ]; then
sed '1i\<s>\n<e>\n<unk>' ${OUTPUT_DIR_BPE_DATA}/vocab.bpe.${num_operations} > ${OUTPUT_DIR_BPE_DATA}/vocab_all.bpe.${num_operations}
fi
echo "All done."
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import glob
import sys
import os
import io
import itertools
from functools import partial
import numpy as np
from paddle.io import BatchSampler, DataLoader, Dataset
from paddlenlp.data import Pad
def create_infer_loader(args):
dataset = TransformerDataset(
fpattern=args.predict_file,
src_vocab_fpath=args.src_vocab_fpath,
trg_vocab_fpath=args.trg_vocab_fpath,
token_delimiter=args.token_delimiter,
start_mark=args.special_token[0],
end_mark=args.special_token[1],
unk_mark=args.special_token[2])
args.src_vocab_size, args.trg_vocab_size, args.bos_idx, args.eos_idx, \
args.unk_idx = dataset.get_vocab_summary()
trg_idx2word = TransformerDataset.load_dict(
dict_path=args.trg_vocab_fpath, reverse=True)
batch_sampler = TransformerBatchSampler(
dataset=dataset,
use_token_batch=False,
batch_size=args.infer_batch_size,
max_length=args.max_length)
data_loader = DataLoader(
dataset=dataset,
batch_sampler=batch_sampler,
collate_fn=partial(
prepare_infer_input,
bos_idx=args.bos_idx,
eos_idx=args.eos_idx,
pad_idx=args.eos_idx),
num_workers=0,
return_list=True)
data_loaders = (data_loader, batch_sampler.__len__)
return data_loaders, trg_idx2word
def create_data_loader(args, world_size=1, rank=0):
data_loaders = [(None, None)] * 2
data_files = [args.training_file, args.validation_file
] if args.validation_file else [args.training_file]
for i, data_file in enumerate(data_files):
dataset = TransformerDataset(
fpattern=data_file,
src_vocab_fpath=args.src_vocab_fpath,
trg_vocab_fpath=args.trg_vocab_fpath,
token_delimiter=args.token_delimiter,
start_mark=args.special_token[0],
end_mark=args.special_token[1],
unk_mark=args.special_token[2])
args.src_vocab_size, args.trg_vocab_size, args.bos_idx, args.eos_idx, \
args.unk_idx = dataset.get_vocab_summary()
batch_sampler = TransformerBatchSampler(
dataset=dataset,
batch_size=args.batch_size,
pool_size=args.pool_size,
sort_type=args.sort_type,
shuffle=args.shuffle,
shuffle_batch=args.shuffle_batch,
use_token_batch=args.use_token_batch,
max_length=args.max_length,
distribute_mode=True if i == 0 else False,
world_size=world_size,
rank=rank)
data_loader = DataLoader(
dataset=dataset,
batch_sampler=batch_sampler,
collate_fn=partial(
prepare_train_input,
bos_idx=args.bos_idx,
eos_idx=args.eos_idx,
pad_idx=args.bos_idx),
num_workers=0,
return_list=True)
data_loaders[i] = (data_loader, batch_sampler.__len__)
return data_loaders
def prepare_train_input(insts, bos_idx, eos_idx, pad_idx):
"""
Put all padded data needed by training into a list.
"""
word_pad = Pad(pad_idx)
src_word = word_pad([inst[0] + [eos_idx] for inst in insts])
trg_word = word_pad([[bos_idx] + inst[1] for inst in insts])
lbl_word = np.expand_dims(
word_pad([inst[1] + [eos_idx] for inst in insts]), axis=2)
data_inputs = [src_word, trg_word, lbl_word]
return data_inputs
def prepare_infer_input(insts, bos_idx, eos_idx, pad_idx):
"""
Put all padded data needed by beam search decoder into a list.
"""
word_pad = Pad(pad_idx)
src_word = word_pad([inst[0] + [eos_idx] for inst in insts])
return [src_word, ]
class SortType(object):
GLOBAL = 'global'
POOL = 'pool'
NONE = "none"
class Converter(object):
def __init__(self, vocab, beg, end, unk, delimiter, add_beg, add_end):
self._vocab = vocab
self._beg = beg
self._end = end
self._unk = unk
self._delimiter = delimiter
self._add_beg = add_beg
self._add_end = add_end
def __call__(self, sentence):
return ([self._beg] if self._add_beg else []) + [
self._vocab.get(w, self._unk)
for w in sentence.split(self._delimiter)
] + ([self._end] if self._add_end else [])
class ComposedConverter(object):
def __init__(self, converters):
self._converters = converters
def __call__(self, fields):
return [
converter(field)
for field, converter in zip(fields, self._converters)
]
class SentenceBatchCreator(object):
def __init__(self, batch_size):
self.batch = []
self._batch_size = batch_size
def append(self, info):
self.batch.append(info)
if len(self.batch) == self._batch_size:
tmp = self.batch
self.batch = []
return tmp
class TokenBatchCreator(object):
def __init__(self, batch_size):
self.batch = []
self.max_len = -1
self._batch_size = batch_size
def append(self, info):
cur_len = info.max_len
max_len = max(self.max_len, cur_len)
if max_len * (len(self.batch) + 1) > self._batch_size:
result = self.batch
self.batch = [info]
self.max_len = cur_len
return result
else:
self.max_len = max_len
self.batch.append(info)
class SampleInfo(object):
def __init__(self, i, lens):
self.i = i
# take bos and eos into account
self.min_len = min(lens[0] + 1, lens[1] + 1)
self.max_len = max(lens[0] + 1, lens[1] + 1)
self.src_len = lens[0]
self.trg_len = lens[1]
class MinMaxFilter(object):
def __init__(self, max_len, min_len, underlying_creator):
self._min_len = min_len
self._max_len = max_len
self._creator = underlying_creator
def append(self, info):
if info.max_len > self._max_len or info.min_len < self._min_len:
return
else:
return self._creator.append(info)
@property
def batch(self):
return self._creator.batch
class TransformerDataset(Dataset):
def __init__(self,
src_vocab_fpath,
trg_vocab_fpath,
fpattern,
field_delimiter="\t",
token_delimiter=" ",
start_mark="<s>",
end_mark="<e>",
unk_mark="<unk>",
trg_fpattern=None):
self._src_vocab = self.load_dict(src_vocab_fpath)
self._trg_vocab = self.load_dict(trg_vocab_fpath)
self._bos_idx = self._src_vocab[start_mark]
self._eos_idx = self._src_vocab[end_mark]
self._unk_idx = self._src_vocab[unk_mark]
self._field_delimiter = field_delimiter
self._token_delimiter = token_delimiter
self.load_src_trg_ids(fpattern, trg_fpattern)
def load_src_trg_ids(self, fpattern, trg_fpattern=None):
src_converter = Converter(
vocab=self._src_vocab,
beg=self._bos_idx,
end=self._eos_idx,
unk=self._unk_idx,
delimiter=self._token_delimiter,
add_beg=False,
add_end=False)
trg_converter = Converter(
vocab=self._trg_vocab,
beg=self._bos_idx,
end=self._eos_idx,
unk=self._unk_idx,
delimiter=self._token_delimiter,
add_beg=False,
add_end=False)
converters = ComposedConverter([src_converter, trg_converter])
self._src_seq_ids = []
self._trg_seq_ids = []
self._sample_infos = []
slots = [self._src_seq_ids, self._trg_seq_ids]
for i, line in enumerate(self._load_lines(fpattern, trg_fpattern)):
lens = []
for field, slot in zip(converters(line), slots):
slot.append(field)
lens.append(len(field))
self._sample_infos.append(SampleInfo(i, lens))
def _load_lines(self, fpattern, trg_fpattern=None):
fpaths = glob.glob(fpattern)
fpaths = sorted(fpaths) # TODO: Add custum sort
assert len(fpaths) > 0, "no matching file to the provided data path"
(f_mode, f_encoding, endl) = ("r", "utf8", "\n")
if trg_fpattern is None:
for fpath in fpaths:
with io.open(fpath, f_mode, encoding=f_encoding) as f:
for line in f:
fields = line.strip(endl).split(self._field_delimiter)
yield fields
else:
# separated source and target language data files
# assume we can get aligned data by sort the two language files
trg_fpaths = glob.glob(trg_fpattern)
trg_fpaths = sorted(trg_fpaths)
assert len(fpaths) == len(
trg_fpaths
), "the number of source language data files must equal \
with that of source language"
for fpath, trg_fpath in zip(fpaths, trg_fpaths):
with io.open(fpath, f_mode, encoding=f_encoding) as f:
with io.open(
trg_fpath, f_mode, encoding=f_encoding) as trg_f:
for line in zip(f, trg_f):
fields = [field.strip(endl) for field in line]
yield fields
@staticmethod
def load_dict(dict_path, reverse=False):
word_dict = {}
(f_mode, f_encoding, endl) = ("r", "utf8", "\n")
with io.open(dict_path, f_mode, encoding=f_encoding) as fdict:
for idx, line in enumerate(fdict):
if reverse:
word_dict[idx] = line.strip(endl)
else:
word_dict[line.strip(endl)] = idx
return word_dict
def get_vocab_summary(self):
return len(self._src_vocab), len(
self._trg_vocab), self._bos_idx, self._eos_idx, self._unk_idx
def __getitem__(self, idx):
return (self._src_seq_ids[idx], self._trg_seq_ids[idx]
) if self._trg_seq_ids else self._src_seq_ids[idx]
def __len__(self):
return len(self._sample_infos)
class TransformerBatchSampler(BatchSampler):
def __init__(self,
dataset,
batch_size,
pool_size=10000,
sort_type=SortType.NONE,
min_length=0,
max_length=100,
shuffle=False,
shuffle_batch=False,
use_token_batch=False,
clip_last_batch=False,
distribute_mode=True,
seed=0,
world_size=1,
rank=0):
for arg, value in locals().items():
if arg != "self":
setattr(self, "_" + arg, value)
self._random = np.random
self._random.seed(seed)
# for multi-devices
self._distribute_mode = distribute_mode
self._nranks = world_size
self._local_rank = rank
def __iter__(self):
# global sort or global shuffle
if self._sort_type == SortType.GLOBAL:
infos = sorted(self._dataset._sample_infos, key=lambda x: x.trg_len)
infos = sorted(infos, key=lambda x: x.src_len)
else:
if self._shuffle:
infos = self._dataset._sample_infos
self._random.shuffle(infos)
else:
infos = self._dataset._sample_infos
if self._sort_type == SortType.POOL:
reverse = True
for i in range(0, len(infos), self._pool_size):
# to avoid placing short next to long sentences
reverse = not reverse
infos[i:i + self._pool_size] = sorted(
infos[i:i + self._pool_size],
key=lambda x: x.max_len,
reverse=reverse)
batches = []
batch_creator = TokenBatchCreator(
self.
_batch_size) if self._use_token_batch else SentenceBatchCreator(
self._batch_size * self._nranks)
batch_creator = MinMaxFilter(self._max_length, self._min_length,
batch_creator)
for info in infos:
batch = batch_creator.append(info)
if batch is not None:
batches.append(batch)
if not self._clip_last_batch and len(batch_creator.batch) != 0:
batches.append(batch_creator.batch)
if self._shuffle_batch:
self._random.shuffle(batches)
if not self._use_token_batch:
# when producing batches according to sequence number, to confirm
# neighbor batches which would be feed and run parallel have similar
# length (thus similar computational cost) after shuffle, we as take
# them as a whole when shuffling and split here
batches = [[
batch[self._batch_size * i:self._batch_size * (i + 1)]
for i in range(self._nranks)
] for batch in batches]
batches = list(itertools.chain.from_iterable(batches))
self.batch_number = (len(batches) + self._nranks - 1) // self._nranks
# for multi-device
for batch_id, batch in enumerate(batches):
if not self._distribute_mode or (
batch_id % self._nranks == self._local_rank):
batch_indices = [info.i for info in batch]
yield batch_indices
if self._distribute_mode and len(batches) % self._nranks != 0:
if self._local_rank >= len(batches) % self._nranks:
# use previous data to pad
yield batch_indices
def __len__(self):
if hasattr(self, "batch_number"): #
return self.batch_number
if not self._use_token_batch:
batch_number = (
len(self._dataset) + self._batch_size * self._nranks - 1) // (
self._batch_size * self._nranks)
else:
# for uncertain batch number, the actual value is self.batch_number
batch_number = sys.maxsize
return batch_number
import os
import time
import sys
import argparse
import logging
import numpy as np
import yaml
from attrdict import AttrDict
from pprint import pprint
import paddle
import paddle.distributed as dist
from paddlenlp.transformers import TransformerModel, CrossEntropyCriterion, position_encoding_init
sys.path.append("../")
import reader
from utils.record import AverageStatistical
FORMAT = '%(asctime)s-%(levelname)s: %(message)s'
logging.basicConfig(level=logging.INFO, format=FORMAT)
logger = logging.getLogger(__name__)
def parse_args():
parser = argparse.ArgumentParser()
parser.add_argument(
"--config",
default="../configs/transformer.big.yaml",
type=str,
help="Path of the config file. ")
args = parser.parse_args()
return args
def batch_creator(loader, trainer_count):
batch = []
for data in loader:
batch.append(data)
if len(batch) == trainer_count:
yield batch
batch = []
# DO NOT drop last.
if len(batch) > 0:
while len(batch) < trainer_count:
batch.append(batch[-1])
yield batch
def do_train(args):
paddle.enable_static()
if args.use_gpu:
trainer_count = len(os.environ['CUDA_VISIBLE_DEVICES'].split(','))
place = paddle.set_device("gpu:0")
else:
trainer_count = int(os.environ['CPU_NUM'])
place = paddle.set_device("cpu")
# Set seed for CE
random_seed = eval(str(args.random_seed))
if random_seed is not None:
paddle.seed(random_seed)
# Define data loader
# NOTE: To guarantee all data is involved, use world_size=1 and rank=0.
(train_loader, train_steps_fn), (
eval_loader, eval_steps_fn) = reader.create_data_loader(args)
train_program = paddle.static.Program()
startup_program = paddle.static.Program()
with paddle.static.program_guard(train_program, startup_program):
src_word = paddle.static.data(
name="src_word", shape=[None, None], dtype="int64")
trg_word = paddle.static.data(
name="trg_word", shape=[None, None], dtype="int64")
lbl_word = paddle.static.data(
name="lbl_word", shape=[None, None, 1], dtype="int64")
# Define model
transformer = TransformerModel(
src_vocab_size=args.src_vocab_size,
trg_vocab_size=args.trg_vocab_size,
max_length=args.max_length + 1,
n_layer=args.n_layer,
n_head=args.n_head,
d_model=args.d_model,
d_inner_hid=args.d_inner_hid,
dropout=args.dropout,
weight_sharing=args.weight_sharing,
bos_id=args.bos_idx,
eos_id=args.eos_idx)
# Define loss
criterion = CrossEntropyCriterion(args.label_smooth_eps, args.bos_idx)
logits = transformer(src_word=src_word, trg_word=trg_word)
sum_cost, avg_cost, token_num = criterion(logits, lbl_word)
scheduler = paddle.optimizer.lr.NoamDecay(
args.d_model, args.warmup_steps, args.learning_rate, last_epoch=0)
# Define optimizer
optimizer = paddle.optimizer.Adam(
learning_rate=scheduler,
beta1=args.beta1,
beta2=args.beta2,
epsilon=float(args.eps),
parameters=transformer.parameters())
optimizer.minimize(avg_cost)
exe = paddle.static.Executor(place)
exe.run(startup_program)
build_strategy = paddle.static.BuildStrategy()
build_strategy.enable_inplace = True
exec_strategy = paddle.static.ExecutionStrategy()
compiled_train_program = paddle.static.CompiledProgram(
train_program).with_data_parallel(
loss_name=avg_cost.name,
build_strategy=build_strategy,
exec_strategy=exec_strategy)
# the best cross-entropy value with label smoothing
loss_normalizer = -(
(1. - args.label_smooth_eps) * np.log(
(1. - args.label_smooth_eps)) + args.label_smooth_eps *
np.log(args.label_smooth_eps / (args.trg_vocab_size - 1) + 1e-20))
step_idx = 0
# For benchmark
reader_cost_avg = AverageStatistical()
batch_cost_avg = AverageStatistical()
batch_ips_avg = AverageStatistical()
for pass_id in range(args.epoch):
batch_id = 0
batch_start = time.time()
pass_start_time = batch_start
for data in batch_creator(train_loader, trainer_count):
# NOTE: used for benchmark and use None as default.
if args.max_iter and step_idx == args.max_iter:
return
train_reader_cost = time.time() - batch_start
outs = exe.run(compiled_train_program,
feed=[{
'src_word': data[i][0],
'trg_word': data[i][1],
'lbl_word': data[i][2],
} for i in range(trainer_count)],
fetch_list=[sum_cost.name, token_num.name])
scheduler.step()
train_batch_cost = time.time() - batch_start
reader_cost_avg.record(train_reader_cost)
batch_cost_avg.record(train_batch_cost)
batch_ips_avg.record(train_batch_cost, np.asarray(outs[1]).sum())
if step_idx % args.print_step == 0:
sum_cost_val, token_num_val = np.array(outs[0]), np.array(outs[
1])
# Sum the cost from multi-devices
total_sum_cost = sum_cost_val.sum()
total_token_num = token_num_val.sum()
total_avg_cost = total_sum_cost / total_token_num
if step_idx == 0:
logging.info(
"step_idx: %d, epoch: %d, batch: %d, avg loss: %f, "
"normalized loss: %f, ppl: %f" %
(step_idx, pass_id, batch_id, total_avg_cost,
total_avg_cost - loss_normalizer,
np.exp([min(total_avg_cost, 100)])))
else:
train_avg_batch_cost = args.print_step / batch_cost_avg.get_total_time(
)
logging.info(
"step_idx: %d, epoch: %d, batch: %d, avg loss: %f, "
"normalized loss: %f, ppl: %f, avg_speed: %.2f step/s, "
"batch_cost: %.5f sec, reader_cost: %.5f sec, tokens: %d, "
"ips: %.5f words/sec" %
(step_idx, pass_id, batch_id, total_avg_cost,
total_avg_cost - loss_normalizer,
np.exp([min(total_avg_cost, 100)]),
train_avg_batch_cost, batch_cost_avg.get_average(),
reader_cost_avg.get_average(),
batch_ips_avg.get_total_cnt(),
batch_ips_avg.get_average_per_sec()))
reader_cost_avg.reset()
batch_cost_avg.reset()
batch_ips_avg.reset()
if step_idx % args.save_step == 0 and step_idx != 0:
if args.save_model:
model_path = os.path.join(
args.save_model, "step_" + str(step_idx), "transformer")
paddle.io.save(train_program, model_path)
batch_id += 1
step_idx += 1
batch_start = time.time()
paddle.disable_static()
if __name__ == "__main__":
ARGS = parse_args()
yaml_file = ARGS.config
with open(yaml_file, 'rt') as f:
args = AttrDict(yaml.safe_load(f))
pprint(args)
do_train(args)
import paddle
import paddle.fluid as fluid
import paddle.distributed as dist
def all_gather_tokens(data):
"""Gathers num of tokens from all nodes.
`data` should be a tensor of num of tokens.
"""
if dist.get_world_size() < 2:
return data
if not hasattr(all_gather_tokens,
'_in_buffer') or all_gather_tokens._in_buffer is None:
all_gather_tokens._in_buffer = data
all_gather_tokens._out_buffers = []
in_buffer = all_gather_tokens._in_buffer
out_buffers = all_gather_tokens._out_buffers
dist.all_gather(out_buffers, in_buffer)
return paddle.add_n(out_buffers)
class AverageStatistical(object):
def __init__(self):
self.reset()
def reset(self):
self.total_cnt = 0
self.time = 0
def record(self, val, cnt=1):
self.time += val
self.total_cnt += cnt
def get_average(self):
if self.total_cnt == 0:
return 0
return self.time / self.total_cnt
def get_average_per_sec(self):
if self.time == 0.0:
return 0.0
return float(self.total_cnt) / self.time
def get_total_cnt(self):
return self.total_cnt
def get_total_time(self):
return self.time
# PaddleNLP docs
存放readthedocs材料
\ No newline at end of file
# PaddleNLP Model Zoo
Examples are still work in progress...
# BERT with PaddleNLP
[BERT](https://arxiv.org/abs/1810.04805) 是一个迁移能力很强的通用语义表示模型, 以 [Transformer](https://arxiv.org/abs/1706.03762) 为网络基本组件,以双向 `Masked Language Model``Next Sentence Prediction` 为训练目标,通过预训练得到通用语义表示,再结合简单的输出层,应用到下游的 NLP 任务,在多个任务上取得了 SOTA 的结果。本项目是 BERT 在 Paddle 2.0上的开源实现。
### 发布要点
1)动态图BERT模型,支持 Fine-tuning,在 GLUE SST-2 任务上进行了验证。
2)支持 Pre-training。
## NLP 任务的 Fine-tuning
在完成 BERT 模型的预训练后,即可利用预训练参数在特定的 NLP 任务上做 Fine-tuning。以下利用开源的预训练模型,示例如何进行分类任务的 Fine-tuning。
### 语句和句对分类任务
以 GLUE/SST-2 任务为例,启动 Fine-tuning 的方式如下(`paddlenlp` 要已经安装或能在 `PYTHONPATH` 中找到):
```shell
export CUDA_VISIBLE_DEVICES=0,1
export TASK_NAME=SST-2
python -u ./run_glue.py \
--model_type bert \
--model_name_or_path bert-base-uncased \
--task_name $TASK_NAME \
--max_seq_length 128 \
--batch_size 32 \
--learning_rate 2e-5 \
--num_train_epochs 3 \
--logging_steps 1 \
--save_steps 500 \
--output_dir ./tmp/$TASK_NAME/ \
--n_gpu 1 \
```
其中参数释义如下:
- `model_type` 指示了模型类型,当前仅支持BERT模型。
- `model_name_or_path` 指示了某种特定配置的模型,对应有其预训练模型和预训练时使用的 tokenizer。若模型相关内容保存在本地,这里也可以提供相应目录地址。
- `task_name` 表示 Fine-tuning 的任务。
- `max_seq_length` 表示最大句子长度,超过该长度将被截断。
- `batch_size` 表示每次迭代**每张卡**上的样本数目。
- `learning_rate` 表示基础学习率大小,将于learning rate scheduler产生的值相乘作为当前学习率。
- `num_train_epochs` 表示训练轮数。
- `logging_steps` 表示日志打印间隔。
- `save_steps` 表示模型保存及评估间隔。
- `output_dir` 表示模型保存路径。
- `n_gpu` 表示使用的 GPU 卡数。若希望使用多卡训练,将其设置为指定数目即可;若为0,则使用CPU。
训练过程将按照 `logging_steps``save_steps` 的设置打印如下日志:
```
global step 996, epoch: 0, batch: 996, loss: 0.248909, speed: 5.07 step/s
global step 997, epoch: 0, batch: 997, loss: 0.113216, speed: 4.53 step/s
global step 998, epoch: 0, batch: 998, loss: 0.218075, speed: 4.55 step/s
global step 999, epoch: 0, batch: 999, loss: 0.133626, speed: 4.51 step/s
global step 1000, epoch: 0, batch: 1000, loss: 0.187652, speed: 4.45 step/s
eval loss: 0.083172, accu: 0.920872
```
使用以上命令进行单卡 Fine-tuning ,在验证集上有如下结果:
| Task | Metric | Result |
|-------|------------------------------|-------------|
| SST-2 | Accuracy | 92.88 |
| QNLI | Accuracy | 91.67 |
## 预训练
```shell
export CUDA_VISIBLE_DEVICES=0,1
export DATA_DIR=/guosheng/nv-bert/DeepLearningExamples/PyTorch/LanguageModeling/BERT/data/hdf5_lower_case_1_seq_len_128_max_pred_20_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5/wikicorpus_en/
python -u ./run_pretrain.py \
--model_type bert \
--model_name_or_path bert-base-uncased \
--max_predictions_per_seq 20 \
--batch_size 32 \
--learning_rate 1e-4 \
--weight_decay 1e-2 \
--adam_epsilon 1e-6 \
--warmup_steps 10000 \
--num_train_epochs 1e5 \
--input_dir $DATA_DIR \
--output_dir ./tmp2/ \
--logging_steps 1 \
--save_steps 20000 \
--max_steps 1000000 \
--n_gpu 2
```
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse
import collections
import itertools
import os
import random
import time
from functools import partial
import numpy as np
import paddle
from paddle.io import DataLoader
from paddlenlp.datasets.dataset import *
from paddlenlp.datasets.glue import *
from paddlenlp.data import *
from paddlenlp.data.sampler import SamplerHelper
from paddlenlp.transformers.model_bert import *
from paddlenlp.transformers.tokenizer_bert import BertTokenizer
from run_glue import convert_example, TASK_CLASSES
MODEL_CLASSES = {"bert": (BertForSequenceClassification, BertTokenizer), }
def parse_args():
parser = argparse.ArgumentParser()
# Required parameters
parser.add_argument(
"--task_name",
default=None,
type=str,
required=True,
help="The name of the task to train selected in the list: " +
", ".join(TASK_CLASSES.keys()), )
parser.add_argument(
"--model_type",
default=None,
type=str,
required=True,
help="Model type selected in the list: " +
", ".join(MODEL_CLASSES.keys()), )
parser.add_argument(
"--model_name_or_path",
default=None,
type=str,
required=True,
help="Path to pre-trained model or shortcut name selected in the list: "
+ ", ".join(
sum([
list(classes[-1].pretrained_init_configuration.keys())
for classes in MODEL_CLASSES.values()
], [])), )
parser.add_argument(
"--output_dir",
default=None,
type=str,
required=True,
help="The output directory where the model predictions and checkpoints will be written.",
)
parser.add_argument(
"--max_seq_length",
default=128,
type=int,
help="The maximum total input sequence length after tokenization. Sequences longer "
"than this will be truncated, sequences shorter will be padded.", )
parser.add_argument(
"--batch_size",
default=8,
type=int,
help="Batch size per GPU/CPU for prediction.", )
parser.add_argument(
"--eager_run", type=eval, default=True, help="Use dygraph mode.")
parser.add_argument(
"--use_gpu", type=eval, default=True, help="Whether to use gpu.")
args = parser.parse_args()
return args
def do_prdict(args):
paddle.enable_static() if not args.eager_run else None
paddle.set_device("gpu" if args.n_gpu else "cpu")
args.task_name = args.task_name.lower()
dataset_class, _ = TASK_CLASSES[args.task_name]
args.model_type = args.model_type.lower()
model_class, tokenizer_class = MODEL_CLASSES[args.model_type]
test_dataset = dataset_class.get_datasets(["test"])
tokenizer = tokenizer_class.from_pretrained(args.model_name_or_path)
trans_func = partial(
convert_example,
tokenizer=tokenizer,
label_list=test_dataset.get_labels(),
max_seq_length=args.max_seq_length,
is_test=True)
test_dataset = test_dataset.apply(trans_func, lazy=True)
test_batch_sampler = paddle.io.BatchSampler(
test_dataset, batch_size=args.batch_size, shuffle=False)
batchify_fn = lambda samples, fn=Tuple(
Pad(axis=0, pad_val=tokenizer.vocab[tokenizer.pad_token]), # input
Pad(axis=0, pad_val=tokenizer.vocab[tokenizer.pad_token]), # segment
Stack(), # length
): fn(samples)[:2]
test_data_loader = DataLoader(
dataset=test_dataset,
batch_sampler=test_batch_sampler,
collate_fn=batchify_fn,
num_workers=0,
return_list=True)
model = model_class.from_pretrained(args.model_name_or_path)
model.eval()
for batch in test_data_loader:
input_ids, segment_ids = batch
logits = model(input_ids, segment_ids)
for i, rs in enumerate(paddle.argmax(logits).numpy()):
print(batch[i], rs)
if __name__ == "__main__":
args = parse_args()
do_prdict(args)
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse
import logging
import os
import random
import time
from functools import partial
import numpy as np
import paddle
from paddle.io import DataLoader
from paddlenlp.datasets import GlueQNLI, GlueSST2
from paddlenlp.data import Stack, Tuple, Pad
from paddlenlp.data.sampler import SamplerHelper
from paddlenlp.transformers import BertForSequenceClassification, BertTokenizer
FORMAT = '%(asctime)s-%(levelname)s: %(message)s'
logging.basicConfig(level=logging.INFO, format=FORMAT)
logger = logging.getLogger(__name__)
TASK_CLASSES = {
"qnli": (GlueQNLI, paddle.metric.Accuracy), # (dataset, metric)
"sst-2": (GlueSST2, paddle.metric.Accuracy),
}
MODEL_CLASSES = {"bert": (BertForSequenceClassification, BertTokenizer), }
def parse_args():
parser = argparse.ArgumentParser()
# Required parameters
parser.add_argument(
"--task_name",
default=None,
type=str,
required=True,
help="The name of the task to train selected in the list: " +
", ".join(TASK_CLASSES.keys()), )
parser.add_argument(
"--model_type",
default=None,
type=str,
required=True,
help="Model type selected in the list: " +
", ".join(MODEL_CLASSES.keys()), )
parser.add_argument(
"--model_name_or_path",
default=None,
type=str,
required=True,
help="Path to pre-trained model or shortcut name selected in the list: "
+ ", ".join(
sum([
list(classes[-1].pretrained_init_configuration.keys())
for classes in MODEL_CLASSES.values()
], [])), )
parser.add_argument(
"--output_dir",
default=None,
type=str,
required=True,
help="The output directory where the model predictions and checkpoints will be written.",
)
parser.add_argument(
"--max_seq_length",
default=128,
type=int,
help="The maximum total input sequence length after tokenization. Sequences longer "
"than this will be truncated, sequences shorter will be padded.", )
parser.add_argument(
"--batch_size",
default=8,
type=int,
help="Batch size per GPU/CPU for training.", )
parser.add_argument(
"--learning_rate",
default=5e-5,
type=float,
help="The initial learning rate for Adam.")
parser.add_argument(
"--weight_decay",
default=0.0,
type=float,
help="Weight decay if we apply some.")
parser.add_argument(
"--adam_epsilon",
default=1e-8,
type=float,
help="Epsilon for Adam optimizer.")
parser.add_argument(
"--max_grad_norm", default=1.0, type=float, help="Max gradient norm.")
parser.add_argument(
"--num_train_epochs",
default=3,
type=int,
help="Total number of training epochs to perform.", )
parser.add_argument(
"--max_steps",
default=-1,
type=int,
help="If > 0: set total number of training steps to perform. Override num_train_epochs.",
)
parser.add_argument(
"--warmup_steps",
default=0,
type=int,
help="Linear warmup over warmup_steps.")
parser.add_argument(
"--logging_steps",
type=int,
default=500,
help="Log every X updates steps.")
parser.add_argument(
"--save_steps",
type=int,
default=500,
help="Save checkpoint every X updates steps.")
parser.add_argument(
"--seed", type=int, default=42, help="random seed for initialization")
parser.add_argument(
"--eager_run", type=eval, default=True, help="Use dygraph mode.")
parser.add_argument(
"--n_gpu",
type=int,
default=1,
help="number of gpus to use, 0 for cpu.")
args = parser.parse_args()
return args
def set_seed(args):
random.seed(args.seed + paddle.distributed.get_rank())
np.random.seed(args.seed + paddle.distributed.get_rank())
paddle.seed(args.seed + paddle.distributed.get_rank())
def evaluate(model, loss_fct, metric, data_loader):
model.eval()
metric.reset()
for batch in data_loader:
input_ids, segment_ids, labels = batch
logits = model(input_ids, segment_ids)
loss = loss_fct(logits, labels)
correct = metric.compute(logits, labels)
metric.update(correct)
accu = metric.accumulate()
print("eval loss: %f, accu: %f" % (loss.numpy(), accu))
model.train()
def convert_example(example,
tokenizer,
label_list,
max_seq_length=512,
is_test=False):
"""convert a glue example into necessary features"""
def _truncate_seqs(seqs, max_seq_length):
if len(seqs) == 1: # single sentence
# Account for [CLS] and [SEP] with "- 2"
seqs[0] = seqs[0][0:(max_seq_length - 2)]
else: # sentence pair
# Account for [CLS], [SEP], [SEP] with "- 3"
tokens_a, tokens_b = seqs
max_seq_length -= 3
while True: # truncate with longest_first strategy
total_length = len(tokens_a) + len(tokens_b)
if total_length <= max_seq_length:
break
if len(tokens_a) > len(tokens_b):
tokens_a.pop()
else:
tokens_b.pop()
return seqs
def _concat_seqs(seqs, separators, seq_mask=0, separator_mask=1):
concat = sum((seq + sep for sep, seq in zip(separators, seqs)), [])
segment_ids = sum(
([i] * (len(seq) + len(sep))
for i, (sep, seq) in enumerate(zip(separators, seqs))), [])
if isinstance(seq_mask, int):
seq_mask = [[seq_mask] * len(seq) for seq in seqs]
if isinstance(separator_mask, int):
separator_mask = [[separator_mask] * len(sep) for sep in separators]
p_mask = sum((s_mask + mask
for sep, seq, s_mask, mask in zip(
separators, seqs, seq_mask, separator_mask)), [])
return concat, segment_ids, p_mask
if not is_test:
# `label_list == None` is for regression task
label_dtype = "int64" if label_list else "float32"
# get the label
label = example[-1]
example = example[:-1]
#create label maps if classification task
if label_list:
label_map = {}
for (i, l) in enumerate(label_list):
label_map[l] = i
label = label_map[label]
label = np.array([label], dtype=label_dtype)
# tokenize raw text
tokens_raw = [tokenizer(l) for l in example]
# truncate to the truncate_length,
tokens_trun = _truncate_seqs(tokens_raw, max_seq_length)
# concate the sequences with special tokens
tokens_trun[0] = [tokenizer.cls_token] + tokens_trun[0]
tokens, segment_ids, _ = _concat_seqs(tokens_trun, [[tokenizer.sep_token]] *
len(tokens_trun))
# convert the token to ids
input_ids = tokenizer.convert_tokens_to_ids(tokens)
valid_length = len(input_ids)
# The mask has 1 for real tokens and 0 for padding tokens. Only real
# tokens are attended to.
# input_mask = [1] * len(input_ids)
if not is_test:
return input_ids, segment_ids, valid_length, label
else:
return input_ids, segment_ids, valid_length
def do_train(args):
paddle.enable_static() if not args.eager_run else None
paddle.set_device("gpu" if args.n_gpu else "cpu")
if paddle.distributed.get_world_size() > 1:
paddle.distributed.init_parallel_env()
set_seed(args)
args.task_name = args.task_name.lower()
dataset_class, metric_class = TASK_CLASSES[args.task_name]
args.model_type = args.model_type.lower()
model_class, tokenizer_class = MODEL_CLASSES[args.model_type]
train_ds, dev_ds = dataset_class.get_datasets(['train', 'dev'])
tokenizer = tokenizer_class.from_pretrained(args.model_name_or_path)
trans_func = partial(
convert_example,
tokenizer=tokenizer,
label_list=train_ds.get_labels(),
max_seq_length=args.max_seq_length)
train_ds = train_ds.apply(trans_func, lazy=True)
# train_batch_sampler = SamplerHelper(train_ds).shuffle().batch(
# batch_size=args.batch_size).shard()
train_batch_sampler = paddle.io.DistributedBatchSampler(
train_ds, batch_size=args.batch_size, shuffle=True)
batchify_fn = lambda samples, fn=Tuple(
Pad(axis=0, pad_val=tokenizer.pad_token_id), # input
Pad(axis=0, pad_val=tokenizer.pad_token_id), # segment
Stack(), # length
Stack(dtype="int64" if train_ds.get_labels() else "float32") # label
): [data for i, data in enumerate(fn(samples)) if i != 2]
train_data_loader = DataLoader(
dataset=train_ds,
batch_sampler=train_batch_sampler,
collate_fn=batchify_fn,
num_workers=0,
return_list=True)
dev_ds = dev_ds.apply(trans_func, lazy=True)
# dev_batch_sampler = SamplerHelper(dev_ds).batch(
# batch_size=args.batch_size)
dev_batch_sampler = paddle.io.BatchSampler(
dev_ds, batch_size=args.batch_size, shuffle=False)
dev_data_loader = DataLoader(
dataset=dev_ds,
batch_sampler=dev_batch_sampler,
collate_fn=batchify_fn,
num_workers=0,
return_list=True)
model = model_class.from_pretrained(
args.model_name_or_path, num_classes=len(train_ds.get_labels()))
if paddle.distributed.get_world_size() > 1:
model = paddle.DataParallel(model)
lr_scheduler = paddle.optimizer.lr.LambdaDecay(
args.learning_rate,
lambda current_step, num_warmup_steps=args.warmup_steps,
num_training_steps=args.max_steps if args.max_steps > 0 else
(len(train_data_loader) * args.num_train_epochs): float(
current_step) / float(max(1, num_warmup_steps))
if current_step < num_warmup_steps else max(
0.0,
float(num_training_steps - current_step) / float(
max(1, num_training_steps - num_warmup_steps))))
optimizer = paddle.optimizer.AdamW(
learning_rate=lr_scheduler,
epsilon=args.adam_epsilon,
parameters=model.parameters(),
weight_decay=args.weight_decay,
apply_decay_param_fun=lambda x: x in [
p.name for n, p in model.named_parameters()
if not any(nd in n for nd in ["bias", "norm"])
])
loss_fct = paddle.nn.loss.CrossEntropyLoss() if train_ds.get_labels(
) else paddle.nn.loss.MSELoss()
metric = metric_class()
### TODO: use hapi
# trainer = paddle.hapi.Model(model)
# trainer.prepare(optimizer, loss_fct, paddle.metric.Accuracy())
# trainer.fit(train_data_loader,
# dev_data_loader,
# log_freq=args.logging_steps,
# epochs=args.num_train_epochs,
# save_dir=args.output_dir)
global_step = 0
tic_train = time.time()
for epoch in range(args.num_train_epochs):
for step, batch in enumerate(train_data_loader):
global_step += 1
input_ids, segment_ids, labels = batch
logits = model(input_ids, segment_ids)
loss = loss_fct(logits, labels)
if global_step % args.logging_steps == 0:
if (not args.n_gpu > 1) or paddle.distributed.get_rank() == 0:
logger.info(
"global step %d, epoch: %d, batch: %d, loss: %f, speed: %.2f step/s"
% (global_step, epoch, step, loss,
args.logging_steps / (time.time() - tic_train)))
tic_train = time.time()
loss.backward()
optimizer.step()
lr_scheduler.step()
optimizer.clear_gradients()
if global_step % args.save_steps == 0:
evaluate(model, loss_fct, metric, dev_data_loader)
if (not args.n_gpu > 1) or paddle.distributed.get_rank() == 0:
output_dir = os.path.join(args.output_dir,
"model_%d" % global_step)
if not os.path.exists(output_dir):
os.makedirs(output_dir)
# need better way to get inner model of DataParallel
model_to_save = model._layers if isinstance(
model, paddle.DataParallel) else model
model_to_save.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)
if __name__ == "__main__":
args = parse_args()
if args.n_gpu > 1:
paddle.distributed.spawn(do_train, args=(args, ), nprocs=args.n_gpu)
else:
do_train(args)
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse
import collections
import itertools
import logging
import os
import random
import time
import h5py
from functools import partial
from concurrent.futures import ThreadPoolExecutor
import numpy as np
import paddle
import paddle.distributed as dist
from paddle.io import DataLoader, Dataset
from paddlenlp.data import Stack, Tuple, Pad
from paddlenlp.transformers import BertForPretraining, BertModel, BertPretrainingCriterion
from paddlenlp.transformers import BertTokenizer
FORMAT = '%(asctime)s-%(levelname)s: %(message)s'
logging.basicConfig(level=logging.INFO, format=FORMAT)
logger = logging.getLogger(__name__)
MODEL_CLASSES = {"bert": (BertForPretraining, BertTokenizer), }
def parse_args():
parser = argparse.ArgumentParser()
parser.add_argument(
"--model_type",
default=None,
type=str,
required=True,
help="Model type selected in the list: " +
", ".join(MODEL_CLASSES.keys()), )
parser.add_argument(
"--model_name_or_path",
default=None,
type=str,
required=True,
help="Path to pre-trained model or shortcut name selected in the list: "
+ ", ".join(
sum([
list(classes[-1].pretrained_init_configuration.keys())
for classes in MODEL_CLASSES.values()
], [])), )
parser.add_argument(
"--input_dir",
default=None,
type=str,
required=True,
help="The input directory where the data will be read from.", )
parser.add_argument(
"--output_dir",
default=None,
type=str,
required=True,
help="The output directory where the model predictions and checkpoints will be written.",
)
parser.add_argument(
"--max_predictions_per_seq",
default=80,
type=int,
help="The maximum total of masked tokens in input sequence")
parser.add_argument(
"--batch_size",
default=8,
type=int,
help="Batch size per GPU/CPU for training.", )
parser.add_argument(
"--learning_rate",
default=5e-5,
type=float,
help="The initial learning rate for Adam.")
parser.add_argument(
"--weight_decay",
default=0.0,
type=float,
help="Weight decay if we apply some.")
parser.add_argument(
"--adam_epsilon",
default=1e-8,
type=float,
help="Epsilon for Adam optimizer.")
parser.add_argument(
"--max_grad_norm", default=1.0, type=float, help="Max gradient norm.")
parser.add_argument(
"--num_train_epochs",
default=3,
type=int,
help="Total number of training epochs to perform.", )
parser.add_argument(
"--max_steps",
default=-1,
type=int,
help="If > 0: set total number of training steps to perform. Override num_train_epochs.",
)
parser.add_argument(
"--warmup_steps",
default=0,
type=int,
help="Linear warmup over warmup_steps.")
parser.add_argument(
"--logging_steps",
type=int,
default=500,
help="Log every X updates steps.")
parser.add_argument(
"--save_steps",
type=int,
default=500,
help="Save checkpoint every X updates steps.")
parser.add_argument(
"--seed", type=int, default=42, help="random seed for initialization")
parser.add_argument(
"--eager_run", type=eval, default=True, help="Use dygraph mode.")
parser.add_argument(
"--n_gpu",
type=int,
default=1,
help="number of gpus to use, 0 for cpu.")
args = parser.parse_args()
return args
def set_seed(args):
random.seed(args.seed + paddle.distributed.get_rank())
np.random.seed(args.seed + paddle.distributed.get_rank())
paddle.seed(args.seed + paddle.distributed.get_rank())
class WorkerInitObj(object):
def __init__(self, seed):
self.seed = seed
def __call__(self, id):
np.random.seed(seed=self.seed + id)
random.seed(self.seed + id)
def create_pretraining_dataset(input_file, max_pred_length, shared_list, args,
worker_init):
train_data = PretrainingDataset(
input_file=input_file, max_pred_length=max_pred_length)
# files have been sharded, no need to dispatch again
train_batch_sampler = paddle.io.BatchSampler(
train_data, batch_size=args.batch_size, shuffle=True)
# DataLoader cannot be pickled because of its place.
# If it can be pickled, use global function instead of lambda and use
# ProcessPoolExecutor instead of ThreadPoolExecutor to prefetch.
def _collate_data(data, stack_fn=Stack()):
num_fields = len(data[0])
out = [None] * num_fields
# input_ids, segment_ids, input_mask, masked_lm_positions,
# masked_lm_labels, next_sentence_labels, mask_token_num
for i in (0, 1, 2, 5):
out[i] = stack_fn([x[i] for x in data])
batch_size, seq_length = out[0].shape
size = num_mask = sum(len(x[3]) for x in data)
# Padding for divisibility by 8 for fp16 or int8 usage
if size % 8 != 0:
size += 8 - (size % 8)
# masked_lm_positions
# Organize as a 1D tensor for gather or use gather_nd
out[3] = np.full(size, 0, dtype=np.int64)
# masked_lm_labels
out[4] = np.full([size, 1], -1, dtype=np.int64)
mask_token_num = 0
for i, x in enumerate(data):
for j, pos in enumerate(x[3]):
out[3][mask_token_num] = i * seq_length + pos
out[4][mask_token_num] = x[4][j]
mask_token_num += 1
# mask_token_num
out.append(np.asarray([mask_token_num], dtype=np.float32))
return out
train_data_loader = DataLoader(
dataset=train_data,
batch_sampler=train_batch_sampler,
collate_fn=_collate_data,
num_workers=0,
worker_init_fn=worker_init,
return_list=True)
return train_data_loader, input_file
class PretrainingDataset(Dataset):
def __init__(self, input_file, max_pred_length):
self.input_file = input_file
self.max_pred_length = max_pred_length
f = h5py.File(input_file, "r")
keys = [
'input_ids', 'input_mask', 'segment_ids', 'masked_lm_positions',
'masked_lm_ids', 'next_sentence_labels'
]
self.inputs = [np.asarray(f[key][:]) for key in keys]
f.close()
def __len__(self):
'Denotes the total number of samples'
return len(self.inputs[0])
def __getitem__(self, index):
[
input_ids, input_mask, segment_ids, masked_lm_positions,
masked_lm_ids, next_sentence_labels
] = [
input[index].astype(np.int64)
if indice < 5 else np.asarray(input[index].astype(np.int64))
for indice, input in enumerate(self.inputs)
]
# TODO: whether to use reversed mask by changing 1s and 0s to be
# consistent with nv bert
input_mask = (1 - np.reshape(
input_mask.astype(np.float32), [1, 1, input_mask.shape[0]])) * -1e9
index = self.max_pred_length
# store number of masked tokens in index
# outputs of torch.nonzero diff with that of numpy.nonzero by zip
padded_mask_indices = (masked_lm_positions == 0).nonzero()[0]
if len(padded_mask_indices) != 0:
index = padded_mask_indices[0].item()
mask_token_num = index
else:
index = 0
mask_token_num = 0
# masked_lm_labels = np.full(input_ids.shape, -1, dtype=np.int64)
# masked_lm_labels[masked_lm_positions[:index]] = masked_lm_ids[:index]
masked_lm_labels = masked_lm_ids[:index]
masked_lm_positions = masked_lm_positions[:index]
# softmax_with_cross_entropy enforce last dim size equal 1
masked_lm_labels = np.expand_dims(masked_lm_labels, axis=-1)
next_sentence_labels = np.expand_dims(next_sentence_labels, axis=-1)
return [
input_ids, segment_ids, input_mask, masked_lm_positions,
masked_lm_labels, next_sentence_labels
]
def do_train(args):
paddle.enable_static() if not args.eager_run else None
paddle.set_device("gpu" if args.n_gpu else "cpu")
if paddle.distributed.get_world_size() > 1:
paddle.distributed.init_parallel_env()
set_seed(args)
worker_init = WorkerInitObj(args.seed + paddle.distributed.get_rank())
args.model_type = args.model_type.lower()
model_class, tokenizer_class = MODEL_CLASSES[args.model_type]
tokenizer = tokenizer_class.from_pretrained(args.model_name_or_path)
model = BertForPretraining(
BertModel(**model_class.pretrained_init_configuration[
args.model_name_or_path]))
criterion = BertPretrainingCriterion(
getattr(model, BertForPretraining.base_model_prefix).config[
"vocab_size"])
if paddle.distributed.get_world_size() > 1:
model = paddle.DataParallel(model)
# If use defalut last_epoch, lr of the first iteration is 0.
# Use `last_epoch = 0` to be consistent with nv bert.
lr_scheduler = paddle.optimizer.lr.LambdaDecay(
args.learning_rate,
lambda current_step, num_warmup_steps=args.warmup_steps,
num_training_steps=args.max_steps if args.max_steps > 0 else
(len(train_data_loader) * args.num_train_epochs): float(
current_step) / float(max(1, num_warmup_steps))
if current_step < num_warmup_steps else max(
0.0,
float(num_training_steps - current_step) / float(
max(1, num_training_steps - num_warmup_steps))),
last_epoch=0)
optimizer = paddle.optimizer.AdamW(
learning_rate=lr_scheduler,
epsilon=args.adam_epsilon,
parameters=model.parameters(),
weight_decay=args.weight_decay,
apply_decay_param_fun=lambda x: x in [
p.name for n, p in model.named_parameters()
if not any(nd in n for nd in ["bias", "norm"])
])
pool = ThreadPoolExecutor(1)
global_step = 0
tic_train = time.time()
for epoch in range(args.num_train_epochs):
files = [
os.path.join(args.input_dir, f) for f in os.listdir(args.input_dir)
if os.path.isfile(os.path.join(args.input_dir, f)) and "training" in
f
]
files.sort()
num_files = len(files)
random.Random(args.seed + epoch).shuffle(files)
f_start_id = 0
shared_file_list = {}
if paddle.distributed.get_world_size() > num_files:
remainder = paddle.distributed.get_world_size() % num_files
data_file = files[(
f_start_id * paddle.distributed.get_world_size() +
paddle.distributed.get_rank() + remainder * f_start_id) %
num_files]
else:
data_file = files[(f_start_id * paddle.distributed.get_world_size()
+ paddle.distributed.get_rank()) % num_files]
previous_file = data_file
train_data_loader, _ = create_pretraining_dataset(
data_file, args.max_predictions_per_seq, shared_file_list, args,
worker_init)
for f_id in range(f_start_id + 1, len(files)):
if paddle.distributed.get_world_size() > num_files:
data_file = files[(
f_id * paddle.distributed.get_world_size() +
paddle.distributed.get_rank() + remainder * f_id) %
num_files]
else:
data_file = files[(f_id * paddle.distributed.get_world_size() +
paddle.distributed.get_rank()) % num_files]
previous_file = data_file
dataset_future = pool.submit(create_pretraining_dataset, data_file,
args.max_predictions_per_seq,
shared_file_list, args, worker_init)
for step, batch in enumerate(train_data_loader):
global_step += 1
(input_ids, segment_ids, input_mask, masked_lm_positions,
masked_lm_labels, next_sentence_labels,
masked_lm_scale) = batch
prediction_scores, seq_relationship_score = model(
input_ids=input_ids,
token_type_ids=segment_ids,
attention_mask=input_mask,
masked_positions=masked_lm_positions)
loss = criterion(prediction_scores, seq_relationship_score,
masked_lm_labels, next_sentence_labels,
masked_lm_scale)
if global_step % args.logging_steps == 0:
if (not args.n_gpu > 1
) or paddle.distributed.get_rank() == 0:
logger.info(
"global step %d, epoch: %d, batch: %d, loss: %f, speed: %.2f step/s"
% (global_step, epoch, step, loss,
args.logging_steps / (time.time() - tic_train)))
tic_train = time.time()
loss.backward()
optimizer.step()
lr_scheduler.step()
optimizer.clear_gradients()
if global_step % args.save_steps == 0:
if (not args.n_gpu > 1
) or paddle.distributed.get_rank() == 0:
output_dir = os.path.join(args.output_dir,
"model_%d" % global_step)
if not os.path.exists(output_dir):
os.makedirs(output_dir)
# need better way to get inner model of DataParallel
model_to_save = model._layers if isinstance(
model, paddle.DataParallel) else model
model_to_save.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)
paddle.save(
optimizer.state_dict(),
os.path.join(output_dir, "model_state.pdopt"))
if global_step >= args.max_steps:
del train_data_loader
return
del train_data_loader
train_data_loader, data_file = dataset_future.result(timeout=None)
if __name__ == "__main__":
args = parse_args()
if args.n_gpu > 1:
paddle.distributed.spawn(do_train, args=(args, ), nprocs=args.n_gpu)
else:
do_train(args)
# Dialogue System
## Dialogue General Understanding
## PLATO
# 对话通用理解模型 (DGU, Dialogue General Understanding)
## 模型简介
对话系统 (Dialogue System) 常常需要根据应用场景的变化去解决多种多样的任务。任务的多样性(意图识别、槽填充、行为识别、状态追踪等等),以及领域训练数据的稀少,给Dialogue System的研究和应用带来了巨大的困难和挑战,要使得Dialogue System得到更好的发展,需要开发一个通用的对话理解模型。为此,我们给出了基于BERT的对话通用理解模型 (DGU: DialogueGeneralUnderstanding),通过实验表明,使用base-model (BERT)并结合常见的学习范式,就可以在几乎全部对话理解任务上取得比肩甚至超越各个领域业内最好的模型的效果,展现了学习一个通用对话理解模型的巨大潜力。
DGU模型内共包含6个任务,全部基于公开数据集在Paddle2.0上完成训练及评估,详细说明如下:
```
DRS: 使用UDC (Ubuntu Corpus V1) 数据集完成对话匹配 (Dialogue Response Selection) 任务;
DST: 使用DSTC2 (Dialog State Tracking Challenge 2) 数据集完成对话状态追踪 (Dialogue State Tracking) 任务;
DSF: 使用ATIS (Airline Travel Information System) 数据集完成对话槽填充 (Dialogue Slot Filling) 任务;
DID: 使用ATIS (Airline Travel Information System) 数据集完成对话意图识别 (Dialogue Intent Detection) 任务;
MRDA: 使用MRDAC (Meeting Recorder Dialogue Act Corpus) 数据集完成对话行为识别 (Dialogue Act Detection) 任务;
SwDA: 使用SwDAC (Switchboard Dialogue Act Corpus) 数据集完成对话行为识别 (Dialogue Act Detection) 任务;
```
## 模型效果
DGU模型中的6个任务,分别采用不同的评估指标在test集上进行评估,结果如下:
<table border="1">
<tr><th style="text-align:center">任务</th><th style="text-align:center">评估指标</th><th style="text-align:center">DGU</th></tr>
<tr align="center"><td rowspan="3" style="vertical-align:middle;">DRS</td><td>R1@10</td><td>81.04%</td></tr>
<tr align="center"><td>R2@10</td><td>89.85%</td></tr>
<tr align="center"><td>R5@10</td><td>97.59%</td></tr>
<tr align="center"><td>DST</td><td>Joint_Acc</td><td>90.43%</td></tr>
<tr align="center"><td>DSF</td><td>F1_Micro</td><td>97.98%</td></tr>
<tr align="center"><td>DID</td><td>Acc</td><td>97.42%</td></tr>
<tr align="center"><td>MRDA</td><td>Acc</td><td>90.94%</td></tr>
<tr align="center"><td>SwDA</td><td>Acc</td><td>80.61%</td></tr>
</table>
**NOTE:** 以上结果均是采用默认配置在GPU单卡上训练和评估得到的,用户如需复现效果,可采用默认配置在单卡上进行训练评估。
## 快速开始
### 安装说明
* PaddlePaddle 安装
本项目依赖于 PaddlePaddle 2.0 及以上版本,请参考 [安装指南](http://www.paddlepaddle.org/#quick-start) 进行安装
* PaddleNLP 安装
```shell
pip install paddlenlp
```
* 环境依赖
Python的版本要求 3.6+,其它环境请参考 PaddlePaddle [安装说明](https://www.paddlepaddle.org.cn/install/quick/zh/2.0rc-linux-docker) 部分的内容
### 代码结构说明
以下是本项目主要代码结构及说明:
```text
.
├── args.py # 运行参数配置
├── data.py # 数据读取
├── main.py # 训练模型主程序入口,包括训练、评估
├── metric.py # 模型评估指标
└── README.md # 说明文档
```
### 数据准备
下载数据集压缩包并解压后,DGU_datasets目录下共存在6个目录,分别对应每个任务的训练集train.txt、评估集dev.txt和测试集test.txt。
```shell
wget wget https://paddlenlp.bj.bcebos.com/datasets/DGU_datasets.tar.gz
tar -zxf DGU_datasets.tar.gz
```
DGU_datasets目录结构:
```text
DGU_datasets/
├── did
│   ├── dev.txt
│   ├── map_tag_intent_id.txt
│   ├── test.txt
│   └── train.txt
├── drs
│   ├── dev.txt
│   ├── dev.txt-small
│   ├── test.txt
│   └── train.txt
├── dsf
│   ├── dev.txt
│   ├── map_tag_slot_id.txt
│   ├── test.txt
│   └── train.txt
├── dst
│   ├── dev.txt
│   ├── map_tag_id.txt
│   ├── test.txt
│   └── train.txt
├── mrda
│   ├── dev.txt
│   ├── map_tag_id.txt
│   ├── test.txt
│   └── train.txt
└── swda
├── dev.txt
├── map_tag_id.txt
├── test.txt
└── train.txt
```
数据的每一行由多列组成,都以"\t"作为分割符,详细数据格式说明如下:
```
drs:由label、多轮对话conv和回应response组成
格式:label \t conv1 \t conv2 \t conv3 \t ... \t response
dst:由多轮对话id、当前轮QA对(使用\1拼接)和对话状态序列state_list(state_list中每个state由空格分割)组成
格式:conversation_id \t question \1 answer \t state1 state2 state3 ...
dsf:由对话内容conversation_content和标签序列label_list (label_list中每个label由空格分割) 组成, 其中标签序列和对话内容中word为一一对应关系
格式:conversation_content \t label1 label2 label3 ...
did:由标签label和对话内容conversation_content组成
格式: label \t conversation_content
mrda:由多轮对话id、标签label、发言人caller、对话内容conversation_content组成
格式:conversation_id \t label \t caller \t conversation_content
swda:由多轮对话id、标签label、发言人caller、对话内容conversation_content组成
格式:conversation_id \t label \t caller \t conversation_content
```
**NOTE:** 上述数据集来自于 [Paddle1.8静态图版本](https://github.com/PaddlePaddle/models/tree/release/1.8/PaddleNLP/dialogue_system/dialogue_general_understanding),是由相应的开源数据集经过数据格式转换而得来的,本项目中暂未包含数据格式转换脚本,细节请参考 [Paddle1.8静态图版本](https://github.com/PaddlePaddle/models/tree/release/1.8/PaddleNLP/dialogue_system/dialogue_general_understanding)
### 模型训练
运行如下命令即可在训练集 (train.tsv) 上进行模型训练,并在开发集 (dev.tsv) 验证,训练结束后会在测试集 (test.txt) 上进行模型评估
```shell
export CUDA_VISIBLE_DEVICES=0,1
# GPU启动,n_gpu指定训练所用的GPU数量,可以是单卡,也可以多卡。默认会进行训练、验证和评估
python -u main.py --task_name=drs --data_dir=./DGU_datasets/drs --output_dir=./checkpoints/drs --n_gpu=2
# 若只需进行评估,do_train设为False,并且必须指定init_from_ckpt
# python -u main.py --task_name=drs --data_dir=./DGU_datasets/drs --do_train=False --init_from_ckpt=./checkpoints/drs/best
```
以上参数表示:
* task_name:任务名称,可以为drs、dst、dsf、did、mrda或swda。
* data_dir:训练数据路径。
* output_dir:训练保存模型的文件路径。
* n_gpu:训练所使用的GPU卡的数量,默认为1。
* do_train:是否进行训练,默认为`True`
* init_from_ckpt:恢复模型参数的路径。
其他可选参数和参数的默认值请参考`args.py`
程序运行时将会自动进行训练,验证和评估。同时训练过程中会自动保存模型在指定的`output_dir`中。
如:
```text
checkpoints/
├── 1000.pdopt
├── 1000.pdparams
├── 2000.pdopt
├── 2000.pdparams
├── ...
├── best.pdopt
└── best.pdparams
```
**NOTE:** 如需恢复模型训练,则init_from_ckpt只需指定到文件名即可,不需要添加文件尾缀。如`--init_from_ckpt=checkpoints/1000`即可,程序会自动加载模型参数`checkpoints/1000.pdparams`,也会自动加载优化器状态`checkpoints/1000.pdopt`
import argparse
def parse_args():
parser = argparse.ArgumentParser(__doc__)
parser.add_argument(
"--task_name",
default=None,
type=str,
required=True,
help="The name of the task to train.")
parser.add_argument(
"--model_name_or_path",
default='bert-base-uncased',
type=str,
help="Path to pre-trained bert model or shortcut name.")
parser.add_argument(
"--output_dir",
default=None,
type=str,
help="The output directory where the checkpoints will be saved.")
parser.add_argument(
"--data_dir",
default=None,
type=str,
help="The directory where the dataset will be load.")
parser.add_argument(
"--init_from_ckpt",
default=None,
type=str,
help="The path of checkpoint to be loaded.")
parser.add_argument(
"--max_seq_len",
default=None,
type=int,
help="The maximum total input sequence length after tokenization for trainng. "
"Sequences longer than this will be truncated, sequences shorter will be padded."
)
parser.add_argument(
"--test_max_seq_len",
default=None,
type=int,
help="The maximum total input sequence length after tokenization for testing. "
"Sequences longer than this will be truncated, sequences shorter will be padded."
)
parser.add_argument(
"--batch_size",
default=None,
type=int,
help="Batch size per GPU/CPU for training.")
parser.add_argument(
"--test_batch_size",
default=None,
type=int,
help="Batch size per GPU/CPU for testing.")
parser.add_argument(
"--learning_rate",
default=None,
type=float,
help="The initial learning rate for Adam.")
parser.add_argument(
"--weight_decay",
default=0.01,
type=float,
help="Weight decay if we apply some.")
parser.add_argument(
"--epochs",
default=None,
type=int,
help="Total number of training epochs to perform.")
parser.add_argument(
"--logging_steps",
default=None,
type=int,
help="Log every X updates steps.")
parser.add_argument(
"--save_steps",
default=None,
type=int,
help="Save checkpoint every X updates steps.")
parser.add_argument(
"--seed", default=42, type=int, help="Random seed for initialization.")
parser.add_argument(
"--n_gpu",
default=1,
type=int,
help="The number of gpus to use, 0 for cpu.")
parser.add_argument(
"--warmup_proportion",
default=0.1,
type=float,
help="The proportion of warmup.")
parser.add_argument(
'--max_grad_norm',
default=1.0,
type=float,
help='The max value of grad norm.')
parser.add_argument(
"--do_train", default=True, type=eval, help="Whether training.")
parser.add_argument(
"--do_eval", default=True, type=eval, help="Whether evaluation.")
parser.add_argument(
"--do_test", default=True, type=eval, help="Whether testing.")
args = parser.parse_args()
return args
def set_default_args(args):
args.task_name = args.task_name.lower()
if args.task_name == 'drs':
if not args.save_steps:
args.save_steps = 1000
if not args.logging_steps:
args.logging_steps = 100
if not args.epochs:
args.epochs = 2
if not args.max_seq_len:
args.max_seq_len = 210
if not args.test_batch_size:
args.test_batch_size = 100
elif args.task_name == 'dst':
if not args.save_steps:
args.save_steps = 400
if not args.logging_steps:
args.logging_steps = 20
if not args.epochs:
args.epochs = 40
if not args.learning_rate:
args.learning_rate = 5e-5
if not args.max_seq_len:
args.max_seq_len = 256
if not args.test_max_seq_len:
args.test_max_seq_len = 512
elif args.task_name == 'dsf':
if not args.save_steps:
args.save_steps = 100
if not args.logging_steps:
args.logging_steps = 10
if not args.epochs:
args.epochs = 50
elif args.task_name == 'did':
if not args.save_steps:
args.save_steps = 100
if not args.logging_steps:
args.logging_steps = 10
if not args.epochs:
args.epochs = 20
elif args.task_name == 'mrda':
if not args.save_steps:
args.save_steps = 500
if not args.logging_steps:
args.logging_steps = 200
if not args.epochs:
args.epochs = 7
elif args.task_name == 'swda':
if not args.save_steps:
args.save_steps = 500
if not args.logging_steps:
args.logging_steps = 200
if not args.epochs:
args.epochs = 3
else:
raise ValueError('Not support task: %s.' % args.task_name)
if not args.data_dir:
args.data_dir = './DGU_datasets/' + args.task_name
if not args.output_dir:
args.output_dir = './checkpoints/' + args.task_name
if not args.learning_rate:
args.learning_rate = 2e-5
if not args.batch_size:
args.batch_size = 32
if not args.test_batch_size:
args.test_batch_size = args.batch_size
if not args.max_seq_len:
args.max_seq_len = 128
if not args.test_max_seq_len:
args.test_max_seq_len = args.max_seq_len
此差异已折叠。
import os
import random
import time
import numpy as np
from functools import partial
import paddle
import paddle.nn as nn
import paddle.nn.functional as F
import paddle.distributed as dist
from paddle.io import DataLoader, DistributedBatchSampler, BatchSampler
from paddle.optimizer.lr import LambdaDecay
from paddle.optimizer import AdamW
from paddle.metric import Accuracy
from paddlenlp.datasets import MapDatasetWrapper
from paddlenlp.data import Stack, Tuple, Pad
from paddlenlp.transformers import BertTokenizer, BertForSequenceClassification, BertForTokenClassification
from args import parse_args, set_default_args
import data
import metric
TASK_CLASSES = {
'drs': (data.UDCv1, metric.RecallAtK),
'dst': (data.DSTC2, metric.JointAccuracy),
'dsf': (data.ATIS_DSF, metric.F1Score),
'did': (data.ATIS_DID, Accuracy),
'mrda': (data.MRDA, Accuracy),
'swda': (data.SwDA, Accuracy)
}
def set_seed(seed):
random.seed(seed)
np.random.seed(seed)
paddle.seed(seed)
def load_ckpt(args, model, optimizer=None):
if args.init_from_ckpt:
params_state_dict = paddle.load(args.init_from_ckpt + '.pdparams')
model.set_state_dict(params_state_dict)
if optimizer:
opt_state_dict = paddle.load(args.init_from_ckpt + '.pdopt')
optimizer.set_state_dict(opt_state_dict)
print('Loaded checkpoint from %s' % args.init_from_ckpt)
def save_ckpt(model, optimizer, output_dir, name):
params_path = os.path.join(output_dir, '{}.pdparams'.format(name))
opt_path = os.path.join(output_dir, '{}.pdopt'.format(name))
paddle.save(model.state_dict(), params_path)
paddle.save(optimizer.state_dict(), opt_path)
def compute_lr_factor(current_step, warmup_steps, max_train_steps):
if current_step < warmup_steps:
factor = float(current_step) / warmup_steps
else:
factor = 1 - float(current_step) / max_train_steps
return factor
class DGULossFunction(nn.Layer):
def __init__(self, task_name):
super(DGULossFunction, self).__init__()
self.task_name = task_name
self.loss_fn = self.get_loss_fn()
def get_loss_fn(self):
if self.task_name in ['drs', 'dsf', 'did', 'mrda', 'swda']:
return F.softmax_with_cross_entropy
elif self.task_name == 'dst':
return nn.BCEWithLogitsLoss(reduction='sum')
def forward(self, logits, labels):
if self.task_name in ['drs', 'did', 'mrda', 'swda']:
loss = self.loss_fn(logits, labels)
loss = paddle.mean(loss)
elif self.task_name == 'dst':
loss = self.loss_fn(logits, paddle.cast(labels, dtype=logits.dtype))
elif self.task_name == 'dsf':
labels = paddle.unsqueeze(labels, axis=-1)
loss = self.loss_fn(logits, labels)
loss = paddle.mean(loss)
return loss
def print_logs(args, step, logits, labels, loss, total_time, metric):
if args.task_name in ['drs', 'did', 'mrda', 'swda']:
if args.task_name == 'drs':
metric = Accuracy()
metric.reset()
correct = metric.compute(logits, labels)
metric.update(correct)
acc = metric.accumulate()
print('step %d - loss: %.4f - acc: %.4f - %.3fs/step' %
(step, loss, acc, total_time / args.logging_steps))
elif args.task_name == 'dst':
metric.reset()
metric.update(logits, labels)
joint_acc = metric.accumulate()
print('step %d - loss: %.4f - joint_acc: %.4f - %.3fs/step' %
(step, loss, joint_acc, total_time / args.logging_steps))
elif args.task_name == 'dsf':
metric.reset()
metric.update(logits, labels)
f1_micro = metric.accumulate()
print('step %d - loss: %.4f - f1_micro: %.4f - %.3fs/step' %
(step, loss, f1_micro, total_time / args.logging_steps))
def train(args, model, train_data_loader, dev_data_loader, metric, rank):
num_examples = len(train_data_loader) * args.batch_size * args.n_gpu
max_train_steps = args.epochs * len(train_data_loader)
warmup_steps = int(max_train_steps * args.warmup_proportion)
if rank == 0:
print("Num train examples: %d" % num_examples)
print("Max train steps: %d" % max_train_steps)
print("Num warmup steps: %d" % warmup_steps)
factor_fn = partial(
compute_lr_factor,
warmup_steps=warmup_steps,
max_train_steps=max_train_steps)
lr_scheduler = LambdaDecay(args.learning_rate, factor_fn)
optimizer = AdamW(
learning_rate=lr_scheduler,
parameters=model.parameters(),
weight_decay=args.weight_decay,
apply_decay_param_fun=lambda x: x in [
params.name for params in model.parameters()
if not any(nd in params.name for nd in ['bias', 'norm'])],
grad_clip=nn.ClipGradByGlobalNorm(args.max_grad_norm)
)
optimizer = paddle.optimizer.AdamW(
learning_rate=lr_scheduler,
parameters=model.parameters(),
weight_decay=args.weight_decay,
apply_decay_param_fun=lambda x: x in [
p.name for n, p in model.named_parameters()
if not any(nd in n for nd in ["bias", "norm"])
])
loss_fn = DGULossFunction(args.task_name)
load_ckpt(args, model, optimizer)
step = 0
best_metric = 0.0
total_time = 0.0
for epoch in range(args.epochs):
if rank == 0:
print('\nEpoch %d/%d' % (epoch + 1, args.epochs))
batch_start_time = time.time()
for batch in train_data_loader:
step += 1
input_ids, segment_ids, labels = batch
logits = model(input_ids, segment_ids)
loss = loss_fn(logits, labels)
loss.backward()
optimizer.step()
lr_scheduler.step()
optimizer.clear_gradients()
total_time += (time.time() - batch_start_time)
if rank == 0:
if step % args.logging_steps == 0:
print_logs(args, step, logits, labels, loss, total_time,
metric)
total_time = 0.0
if step % args.save_steps == 0 or step == max_train_steps:
save_ckpt(model, optimizer, args.output_dir, step)
if args.do_eval:
print('\nEval begin...')
metric_out = evaluation(args, model, dev_data_loader,
metric)
if metric_out > best_metric:
best_metric = metric_out
save_ckpt(model, optimizer, args.output_dir, 'best')
print('Best model, step: %d\n' % step)
batch_start_time = time.time()
def evaluation(args, model, data_loader, metric):
model.eval()
metric.reset()
for batch in data_loader:
input_ids, segment_ids, labels = batch
logits = model(input_ids, segment_ids)
if args.task_name in ['did', 'mrda', 'swda']:
correct = metric.compute(logits, labels)
metric.update(correct)
else:
metric.update(logits, labels)
model.train()
metric_out = metric.accumulate()
print('Total samples: %d' % (len(data_loader) * args.test_batch_size))
if args.task_name == 'drs':
print('R1@10: %.4f - R2@10: %.4f - R5@10: %.4f\n' %
(metric_out[0], metric_out[1], metric_out[2]))
return metric_out[0]
elif args.task_name == 'dst':
print('Joint_acc: %.4f\n' % metric_out)
return metric_out
elif args.task_name == 'dsf':
print('F1_micro: %.4f\n' % metric_out)
return metric_out
elif args.task_name in ['did', 'mrda', 'swda']:
print('Acc: %.4f\n' % metric_out)
return metric_out
def create_data_loader(args, dataset_class, trans_func, batchify_fn, mode):
dataset = dataset_class(args.data_dir, mode)
dataset = MapDatasetWrapper(dataset).apply(trans_func, lazy=True)
if mode == 'train':
batch_sampler = DistributedBatchSampler(
dataset, batch_size=args.batch_size, shuffle=True)
else:
batch_sampler = BatchSampler(
dataset, batch_size=args.test_batch_size, shuffle=False)
data_loader = DataLoader(
dataset,
batch_sampler=batch_sampler,
collate_fn=batchify_fn,
return_list=True)
return data_loader
def main(args):
paddle.set_device('gpu' if args.n_gpu else 'cpu')
world_size = dist.get_world_size()
rank = dist.get_rank()
if world_size > 1 and args.do_train:
dist.init_parallel_env()
set_seed(args.seed)
dataset_class, metric_class = TASK_CLASSES[args.task_name]
tokenizer = BertTokenizer.from_pretrained(args.model_name_or_path)
trans_func = partial(
dataset_class.convert_example,
tokenizer=tokenizer,
max_seq_length=args.max_seq_len)
test_trans_func = partial(
dataset_class.convert_example,
tokenizer=tokenizer,
max_seq_length=args.test_max_seq_len)
metric = metric_class()
if args.task_name in ('drs', 'dst', 'did', 'mrda', 'swda'):
batchify_fn = lambda samples, fn=Tuple(
Pad(axis=0, pad_val=tokenizer.pad_token_id), # input
Pad(axis=0, pad_val=tokenizer.pad_token_id), # segment
Stack(dtype='int64') # label
): fn(samples)
model = BertForSequenceClassification.from_pretrained(
args.model_name_or_path, num_classes=dataset_class.num_classes())
elif args.task_name == 'dsf':
batchify_fn = lambda samples, fn=Tuple(
Pad(axis=0, pad_val=tokenizer.pad_token_id), # input
Pad(axis=0, pad_val=tokenizer.pad_token_id), # segment
Pad(axis=0, pad_val=0, dtype='int64') # label
): fn(samples)
model = BertForTokenClassification.from_pretrained(
args.model_name_or_path,
num_classes=dataset_class.num_classes(),
dropout=0.0)
if world_size > 1 and args.do_train:
model = paddle.DataParallel(model)
if args.do_train:
train_data_loader = create_data_loader(args, dataset_class, trans_func,
batchify_fn, 'train')
if args.do_eval:
dev_data_loader = create_data_loader(
args, dataset_class, test_trans_func, batchify_fn, 'dev')
else:
dev_data_loader = None
train(args, model, train_data_loader, dev_data_loader, metric, rank)
if args.do_test:
if rank == 0:
test_data_loader = create_data_loader(
args, dataset_class, test_trans_func, batchify_fn, 'test')
if args.do_train:
# If do_eval=True, use best model to evaluate the test data.
# Otherwise, use final model to evaluate the test data.
if args.do_eval:
args.init_from_ckpt = os.path.join(args.output_dir, 'best')
load_ckpt(args, model)
else:
if not args.init_from_ckpt:
raise ValueError('"init_from_ckpt" should be set.')
load_ckpt(args, model)
print('\nTest begin...')
evaluation(args, model, test_data_loader, metric)
def print_args(args):
print('----------- Configuration Arguments -----------')
for arg, value in sorted(vars(args).items()):
print('%s: %s' % (arg, value))
print('------------------------------------------------')
if __name__ == '__main__':
args = parse_args()
set_default_args(args)
print_args(args)
if args.n_gpu > 1:
dist.spawn(main, args=(args, ), nprocs=args.n_gpu)
else:
main(args)
此差异已折叠。
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse
import collections
import itertools
import os
import sys
import hashlib
import random
import time
from functools import partial
import numpy as np
import paddle
from paddle.io import DataLoader
from paddlenlp.datasets.dataset import *
from paddlenlp.datasets.glue import *
from paddlenlp.data import *
from paddlenlp.data.sampler import SamplerHelper
from paddlenlp.transformers import ElectraForSequenceClassification, ElectraTokenizer
from run_glue import convert_example, TASK_CLASSES
MODEL_CLASSES = {
"electra": (ElectraForSequenceClassification, ElectraTokenizer),
}
def do_prdict(args):
paddle.set_device("gpu" if args.use_gpu else "cpu")
args.task_name = args.task_name.lower()
dataset_class, _ = TASK_CLASSES[args.task_name]
args.model_type = args.model_type.lower()
model_class, tokenizer_class = MODEL_CLASSES[args.model_type]
test_dataset = dataset_class.get_datasets(["test"])
tokenizer = tokenizer_class.from_pretrained(args.model_name_or_path)
trans_func = partial(
convert_example,
tokenizer=tokenizer,
label_list=test_dataset.get_labels(),
max_seq_length=args.max_seq_length,
is_test=True)
test_dataset = test_dataset.apply(trans_func, lazy=True)
test_batch_sampler = paddle.io.BatchSampler(
test_dataset, batch_size=args.batch_size, shuffle=False)
batchify_fn = lambda samples, fn=Tuple(
Pad(axis=0, pad_val=tokenizer.vocab[tokenizer.pad_token]), # input
Pad(axis=0, pad_val=tokenizer.vocab[tokenizer.pad_token]), # segment
Stack(), # length
): fn(samples)[:2]
test_data_loader = DataLoader(
dataset=test_dataset,
batch_sampler=test_batch_sampler,
collate_fn=batchify_fn,
num_workers=0,
return_list=True)
# for debug
model = model_class.from_pretrained(args.model_name_or_path)
return_dict = model.return_dict
model.eval()
for batch in test_data_loader:
input_ids, segment_ids = batch
model_output = model(input_ids=input_ids, token_type_ids=segment_ids)
if not return_dict:
logits = model_output[0]
else:
logits = model_output.logits
#print("logits.shape is : %s" % logits.shape)
for i, rs in enumerate(paddle.argmax(logits, -1).numpy()):
print("data : %s, predict : %s" % (input_ids[i], rs))
def print_arguments(args):
"""print arguments"""
print('----------- Configuration Arguments -----------')
for arg, value in sorted(vars(args).items()):
print('%s: %s' % (arg, value))
print('------------------------------------------------')
if __name__ == "__main__":
parser = argparse.ArgumentParser()
# Required parameters
parser.add_argument(
"--task_name",
default=None,
type=str,
required=True,
help="The name of the task to train selected in the list: " +
", ".join(TASK_CLASSES.keys()), )
parser.add_argument(
"--model_type",
default="electra",
type=str,
required=False,
help="Model type selected in the list: " +
", ".join(MODEL_CLASSES.keys()), )
parser.add_argument(
"--model_name_or_path",
default=None,
type=str,
required=True,
help="Path to pre-trained model or shortcut name selected in the list: "
+ ", ".join(
sum([
list(classes[-1].pretrained_init_configuration.keys())
for classes in MODEL_CLASSES.values()
], [])), )
parser.add_argument(
"--max_seq_length",
default=128,
type=int,
help="The maximum total input sequence length after tokenization. Sequences longer "
"than this will be truncated, sequences shorter will be padded.", )
parser.add_argument(
"--batch_size",
default=8,
type=int,
help="Batch size per GPU/CPU for prediction.", )
parser.add_argument(
"--use_gpu", type=eval, default=True, help="Whether to use gpu.")
args, unparsed = parser.parse_known_args()
print_arguments(args)
do_prdict(args)
此差异已折叠。
此差异已折叠。
此差异已折叠。
# Language Model
## RNN-LM (xiaopeng)
## ELMo (moyuan)
\ No newline at end of file
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册