未验证 提交 01f4302b 编写于 作者: N nbcc 提交者: GitHub

Revert "add ernie-doc to ernie develop"

上级 4b1b4ee3
# Virtualenv
/.venv/
/venv/
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
# C extensions
*.so
# Distribution / packaging
/bin/
/build/
/develop-eggs/
/dist/
/eggs/
/lib/
/lib64/
/output/
/parts/
/sdist/
/var/
/*.egg-info/
/.installed.cfg
/*.egg
/.eggs
# AUTHORS and ChangeLog will be generated while packaging
/AUTHORS
/ChangeLog
# BCloud / BuildSubmitter
/build_submitter.*
/logger_client_log
# Installer logs
pip-log.txt
pip-delete-this-directory.txt
# Unit test / coverage reports
.tox/
.coverage
.cache
.pytest_cache
nosetests.xml
coverage.xml
# Translations
*.mo
# Sphinx documentation
/docs/_build/
# user-defined
ernie_doc/output
ernie_doc/log
ernie_doc/data/imdb/*.txt
ernie_doc/data/imdb.debug
ernie_doc/tmpout
ernie_doc/py37
English | [简体中文](./README_zh.md)
## _ERNIE-Doc_: A Retrospective Long-Document Modeling Transformer
- [Framework](#framework)
- [Pre-trained Models](#Pre-trained-Models)
- [Fine-tuning Tasks](#Fine-tuning-Tasks)
* [Language Modeling](#Language-Modeling)
* [Long-Text Classification](#Long-Text-Classification)
* [Question Answering](#Question-Answering)
* [Information Extraction](#Information-Extraction)
* [Semantic Matching](#Semantic-Matching)
- [Usage](#Usage)
* [Install Paddle](#Install-PaddlePaddle)
* [Fine-tuning](#Fine-tuning)
- [Citation](#Citation)
For technical description of the algorithm, please see our paper:
>[_**ERNIE-Doc: A Retrospective Long-Document Modeling Transformer**_](https://arxiv.org/abs/2012.15688)
>
>Siyu Ding\*, Junyuan Shang\*, Shuohuan Wang, Yu Sun, Hao Tian, Hua Wu, Haifeng Wang (\* : equal contribution)
>
>Preprint December 2020
>
>Accepted by **ACL-2021**
![ERNIE-Doc](https://img.shields.io/badge/Pretraining-Long%20Document%20Modeling-green) ![paper](https://img.shields.io/badge/Paper-ACL2021-yellow)
---
**ERNIE-Doc is a document-level language pretraining model**. Two well-designed techniques, namely the **retrospective feed mechanism** and the **enhanced recurrence mechanism**, enable ERNIE-Doc, which has a much longer effective context length, to capture the contextual information of a complete document. ERNIE-Doc improved the state-of-the-art language modeling result of perplexity to 16.8 on WikiText-103. Moreover, it outperformed competitive pretraining models by a large margin on most language understanding tasks, such as text classification, question answering, information extraction and semantic matching.
## Framework
We proposed three novel methods to enhance the long document modeling ability of Transformers:
- **Retrospective Feed Mechanism**: Inspired by the human reading behavior of skimming a document first and then looking back upon it attentively, we design a retrospective feed mechanism in which segments from a document are fed twice as input. As a result, each segment in the retrospective phase could explicitly fuse the semantic information of the entire document learned in the skimming phase, which prevents context fragmentation.
- **Enhanced Recurrence Mechansim**, a drop-in replacement for a Recurrence Transformer (like Transformer-XL), by changing the shifting-one-layer-downwards recurrence to the same-layer recurrence. In this manner, the maximum effective context length can be expanded, and past higher-level representations can be exploited to enrich future lower-level representations.
- **Segment-reordering Objective**, a document-aware task of predicting the correct order of the permuted set of segments of a document, to model the relationship among segments directly. This allows ERNIE-Doc to build full document representations for prediction.
![framework](.meta/framework.png)
Illustrations of ERNIE-Doc and Recurrence Transformers, where models with three layers take as input a long document which is sliced into four segments.
## Pre-trained Models
We release the checkpoints for **ERNIE-Doc _base_en/zh_** and **ERNIE-Doc _large_en_** model。
- [**ERNIE-Doc _base_en_**](https://ernie-github.cdn.bcebos.com/model-ernie-doc-base-en.tar.gz) (_12-layer, 768-hidden, 12-heads_)
- [**ERNIE-Doc _base_zh_**](https://ernie-github.cdn.bcebos.com/model-ernie-doc-base-zh.tar.gz) (_12-layer, 768-hidden, 12-heads_)
- [**ERNIE-Doc _large_en_**](https://ernie-github.cdn.bcebos.com/model-ernie-doc-large-en.tar.gz) (_24-layer, 1024-hidden, 16-heads_)
## Fine-tuning Tasks
We compare the performance of [ERNIE-Doc](https://arxiv.org/abs/2012.15688) with the existing SOTA pre-training models (such as [Longformer](https://arxiv.org/abs/2004.05150), [BigBird](https://arxiv.org/abs/2007.14062), [ETC](https://arxiv.org/abs/2004.08483) and [ERNIE2.0](https://arxiv.org/abs/1907.12412)) for language modeling (**_WikiText-103_**) and document-level natural language understanding tasks, including long-text classification (**_IMDB_**, **_HYP_**, **_THUCNews_**, **_IFLYTEK_**), question answering (**_TriviaQA_**, **_HotpotQA_**, **_DRCD_**, **_CMRC2018_**, **_DuReader_**, **_C3_**), information extraction (**_OpenKPE_**) and semantic matching (**_CAIL2019-SCM_**).
### Language Modeling
- [WikiText-103](https://arxiv.org/abs/1609.07843)
| Model | Param. | PPL |
|--------------------------|:--------:|:------:|
| _Results of base models_ | | |
| LSTM | - | 48.7 |
| LSTM+Neural cache | - | 40.8 |
| GCNN-14 | - | 37.2 |
| QRNN | 151M | 33.0 |
| Transformer-XL Base | 151M | 24.0 |
| SegaTransformer-XL Base | 151M | 22.5 |
| **ERNIE-Doc** Base | 151M | **21.0** |
| _Results of large models_ | | |
| Adaptive Input | 247M | 18.7 |
| Transformer-XL Large | 247M | 18.3 |
| Compressive Transformer | 247M | 17.1 |
| SegaTransformer-XL Large | 247M | 17.1 |
| **ERNIE-Doc** Large | 247M | **16.8** |
### Long-Text Classification
- [IMDB reviews](http://ai.stanford.edu/~amaas/data/sentiment/index.html)
| Models | Acc. | F1 |
|-----------------|:----:|:----:|
| RoBERTa | 95.3 | 95.0 |
| Longformer | 95.7 | - |
| BigBird | - | 95.2 |
| **ERNIE-Doc** Base | **96.1** | **96.1** |
| XLNet-Large | 96.8 | - | - |
| **ERNIE-Doc** Large | **97.1** | **97.1** |
- [Hyperpartisan News Dection](https://pan.webis.de/semeval19/semeval19-web/)
| Models | F1 |
|-----------------|:----:|
| RoBERTa | 87.8 |
| Longformer | 94.8 |
| BigBird | 92.2 |
| **ERNIE-Doc** Base | **96.3** |
| **ERNIE-Doc** Large | **96.6** |
- [THUCNews(THU)](http://thuctc.thunlp.org/)[IFLYTEK(IFK)](https://arxiv.org/abs/2004.05986)
| Models | THU | THU | IFK |
|-----------------|:--------:|:--------:|:--------:|
| | Acc. | Acc. | Acc. |
| | Dev | Test | Dev |
| BERT | 97.7 | 97.3 | 60.3 |
| BERT-wwm-ext | 97.6 | 97.6 | 59.4 |
| RoBERTa-wwm-ext | - | - | 60.3 |
| ERNIE 1.0 | 97.7 | 97.3 | 59.0 |
| ERNIE 2.0 | 98.0 | 97.5 | 61.7 |
| **ERNIE-Doc** | **98.3** | **97.7** | **62.4** |
### Question Answering
- [TriviaQA](http://nlp.cs.washington.edu/triviaqa/) on dev-set
| Models | F1 |
|-----------------|:----:|
| RoBERTa | 74.3 |
| Longformer | 75.2 |
| BigBird | 79.5 |
| **ERNIE-Doc** Base | **80.1** |
| Longformer Large | 77.8 |
| BigBird Large | - |
| **ERNIE-Doc** Large | **82.5** |
- [HotpotQA](https://hotpotqa.github.io/) on dev-set
| Models | Span-F1 | Supp.-F1 | Joint-F1 |
|-----------------|:----:|:----:|:----:|
| RoBERTa | 73.5 | 83.4 | 63.5 |
| Longformer | 74.3 | 84.4 | 64.4 |
| BigBird | 75.5 | **87.1** | 67.8 |
| **ERNIE-Doc** Base | **79.4** | 86.3 | **70.5** |
| Longformer Large | 81.0 | 85.8 | 71.4 |
| BigBird Large | 81.3 | **89.4** | - |
| **ERNIE-Doc** Large | **82.2** | 87.6 | **73.7** |
- [DRCD](https://arxiv.org/abs/1806.00920), [CMRC2018](https://arxiv.org/abs/1810.07366), [DuReader](https://arxiv.org/abs/1711.05073), [C3](https://arxiv.org/abs/1904.09679)
| Models | DRCD | DRCD | CMRC2018 | DuReader | C3 | C3 |
|-----------------|---------------|---------------|---------------|---------------|----------|----------|
| | dev | test | dev | dev | dev | test |
| | EM/F1 | EM/F1 | EM/F1 | EM/F1 | Acc. | Acc. |
| BERT | 85.7/91.6 | 84.9/90.9 | 66.3/85.9 | 59.5/73.1 | 65.7 | 64.5 |
| BERT-wwm-ext | 85.0/91.2 | 83.6/90.4 | 67.1/85.7 | -/- | 67.8 | 68.5 |
| RoBERTa-wwm-ext | 86.6/92.5 | 85.2/92.0 | 67.4/87.2 | -/- | 67.1 | 66.5 |
| MacBERT | 88.3/93.5 | 87.9/93.2 | 69.5/87.7 | -/- | - | - |
| XLNet-zh | 83.2/92.0 | 82.8/91.8 | 63.0/85.9 | -/- | - | - |
| ERNIE 1.0 | 84.6/90.9 | 84.0/90.5 | 65.1/85.1 | 57.9/72/1 | 65.5 | 64.1 |
| ERNIE 2.0 | 88.5/93.8 | 88.0/93.4 | 69.1/88.6 | 61.3/74.9 | 72.3 | 73.2 |
| **ERNIE-Doc** | **90.5/95.2** | **90.5/95.1** | **76.1/91.6** | **65.8/77.9** | **76.5** | **76.5** |
### Information Extraction
- [Open Domain Web Keyphrase Extraction](https://www.aclweb.org/anthology/D19-1521/)
| Models | F1@1 | F1@3 | F1@5 |
|-----------|:----:|:----:|:----:|
| BLING-KPE | 26.7 | 29.2 | 20.9 |
| JointKPE | 39.1 | 39.8 | 33.8 |
| ETC | - | 40.2 | - |
| ERNIE-Doc | **40.2** | **40.5** | **34.4** |
### Semantic Matching
- [CAIL2019-SCM](https://arxiv.org/abs/1911.08962)
| Models | Dev (Acc.) | Test (Acc.) |
|-----------|:-------------:|:-------------:|
| BERT | 61.9 | 67.3 |
| ERNIE 2.0 | 64.9 | 67.9 |
| ERNIE-Doc | **65.6** | **68.8** |
## Usage
### Install PaddlePaddle
This code base has been tested with Paddle (version>=1.8) with Python3. Other dependency of ERNIE-GEN is listed in `requirements.txt`, you can install it by
```script
pip install -r requirements.txt
```
### Fine-tuning
We release the finetuning code for English and Chinese classification tasks and Chinese Question Answers Tasks. For example, you can finetune **ERNIE-Doc** base model on IMDB and IFLYTEK dataset by
```shell
sh script/run_imdb.sh
sh script/run_iflytek.sh
sh script/run_dureader.sh
```
[Preprocessing code for IMDB dataset](./ernie_doc/data/imdb/README.md)
The log of training and the evaluation results are in `log/job.log.0`.
**Notice**: The actual total batch size is equal to `configured batch size * number of used gpus`.
## Citation
You can cite the paper as below:
```
@article{ding2020ernie,
title={ERNIE-DOC: The Retrospective Long-Document Modeling Transformer},
author={Ding, Siyu and Shang, Junyuan and Wang, Shuohuan and Sun, Yu and Tian, Hao and Wu, Hua and Wang, Haifeng},
journal={arXiv preprint arXiv:2012.15688},
year={2020}
}
```
[English](./README.md) | 简体中文
## _ERNIE-Doc_: A Retrospective Long-Document Modeling Transformer
- [模型框架](#模型框架)
- [预训练模型](#预训练模型)
- [下游任务](#下游任务)
* [语言建模](#语言建模)
* [篇章级分类](#篇章级分类)
* [阅读理解](#阅读理解)
* [信息抽取](#信息抽取)
* [语义匹配](#语义匹配)
- [使用说明](#使用说明)
* [安装飞桨](#安装飞桨)
* [运行微调](#运行微调)
- [引用](#引用)
关于算法的详细描述,请参见我们的论文:
>[_**ERNIE-Doc: A Retrospective Long-Document Modeling Transformer**_](https://arxiv.org/abs/2012.15688)
>
>Siyu Ding\*, Junyuan Shang\*, Shuohuan Wang, Yu Sun, Hao Tian, Hua Wu, Haifeng Wang (\* : equal contribution)
>
>Preprint December 2020
>
>Accepted by **ACL-2021**
![ERNIE-Doc](https://img.shields.io/badge/预训练-长文本建模-green) ![paper](https://img.shields.io/badge/论文-ACL2021-yellow)
---
**ERNIE-Doc 是面向篇章级长文本建模的预训练-微调框架**,ERNIE-Doc 受到人类先粗读后精读的阅读方式启发,提出了**回顾式建模机制****增强记忆机制**,突破了 Transformer 在文本长度上的建模瓶颈。ERNIE-Doc 在业界首次实现了全篇章级无限长文本的双向建模,在包括阅读理解、信息抽取、篇章分类、语言模型在内的13个权威中英文长文本语言理解任务上取得了SOTA效果。
## 模型框架
我们提出了三种方法解决长文本建模问题:
- **回顾式建模机制(Retrospective Feed Mechanism)**: 通过将文本片段重复输入两次,使得在回顾阶段的每一个文本片段可以双向建模并利用在粗读阶段获取到的全篇章语义信息。
- **增强记忆机制(Enhanced Recurrence Mechansim)**: 通过改进Recurrence Transformer模型(Transformer-XL等)为同层循环的方式,可建模无限长文本。
- **段落重排序目标(Segment-reordering Objective)**: 该预训练目标通过让模型学习篇章级本文段落间的关系,使得模型可以对篇章整体信息进行建模。
下图展示了ERNIE-Doc 与Recurrence Transformer在3层网络,4个片段输入情况下的建模方式与建模长度的对比。
![framework](.meta/framework.png)
## 预训练模型
我们发布了 **ERNIE-Doc _base_** 中英文模型和 **ERNIE-Doc _large_** 英文模型。
- [**ERNIE-Doc _base_en_**](https://ernie-github.cdn.bcebos.com/model-ernie-doc-base-en.tar.gz) (_12-layer, 768-hidden, 12-heads_)
- [**ERNIE-Doc _base_zh_**](https://ernie-github.cdn.bcebos.com/model-ernie-doc-base-zh.tar.gz) (_12-layer, 768-hidden, 12-heads_)
- [**ERNIE-Doc _large_en_**](https://ernie-github.cdn.bcebos.com/model-ernie-doc-large-en.tar.gz) (_24-layer, 1024-hidden, 16-heads_)
## 下游任务
我们在语言建模、篇章级分类、阅读理解以及信息抽取等任务上选取了广泛使用的数据集进行模型效果验证,并且与当前效果最优的模型([Longformer](https://arxiv.org/abs/2004.05150)[BigBird](https://arxiv.org/abs/2007.14062)[ETC](https://arxiv.org/abs/2004.08483)[ERNIE2.0](https://arxiv.org/abs/1907.12412)等)进行对比。
### 语言建模
- [WikiText-103](https://arxiv.org/abs/1609.07843)
| 模型 | Param. | PPL |
|--------------------------|:--------:|:------:|
| _Results of base models_ | | |
| LSTM | - | 48.7 |
| LSTM+Neural cache | - | 40.8 |
| GCNN-14 | - | 37.2 |
| QRNN | 151M | 33.0 |
| Transformer-XL Base | 151M | 24.0 |
| SegaTransformer-XL Base | 151M | 22.5 |
| **ERNIE-Doc** Base | 151M | **21.0** |
| _Results of large models_ | | |
| Adaptive Input | 247M | 18.7 |
| Transformer-XL Large | 247M | 18.3 |
| Compressive Transformer | 247M | 17.1 |
| SegaTransformer-XL Large | 247M | 17.1 |
| **ERNIE-Doc** Large | 247M | **16.8** |
### 篇章级分类
- [IMDB reviews](http://ai.stanford.edu/~amaas/data/sentiment/index.html)
| 模型 | Acc. | F1 |
|-----------------|:----:|:----:|
| RoBERTa | 95.3 | 95.0 |
| Longformer | 95.7 | - |
| BigBird | - | 95.2 |
| **ERNIE-Doc** Base | **96.1** | **96.1** |
| XLNet-Large | 96.8 | - | - |
| **ERNIE-Doc** Large | **97.1** | **97.1** |
- [Hyperpartisan News Dection](https://pan.webis.de/semeval19/semeval19-web/)
| 模型 | F1 |
|-----------------|:----:|
| RoBERTa | 87.8 |
| Longformer | 94.8 |
| BigBird | 92.2 |
| **ERNIE-Doc** Base | **96.3** |
| **ERNIE-Doc** Large | **96.6** |
- [THUCNews(THU)](http://thuctc.thunlp.org/)[IFLYTEK(IFK)](https://arxiv.org/abs/2004.05986)
| 模型 | THU | THU | IFK |
|-----------------|:--------:|:--------:|:--------:|
| | Acc. | Acc. | Acc. |
| | Dev | Test | Dev |
| BERT | 97.7 | 97.3 | 60.3 |
| BERT-wwm-ext | 97.6 | 97.6 | 59.4 |
| RoBERTa-wwm-ext | - | - | 60.3 |
| ERNIE 1.0 | 97.7 | 97.3 | 59.0 |
| ERNIE 2.0 | 98.0 | 97.5 | 61.7 |
| **ERNIE-Doc** | **98.3** | **97.7** | **62.4** |
### 阅读理解
- [TriviaQA](http://nlp.cs.washington.edu/triviaqa/)验证集效果
| 模型 | F1 |
|-----------------|:----:|
| RoBERTa | 74.3 |
| Longformer | 75.2 |
| BigBird | 79.5 |
| **ERNIE-Doc** Base | **80.1** |
| Longformer Large | 77.8 |
| BigBird Large | - |
| **ERNIE-Doc** Large | **82.5** |
- [HotpotQA](https://hotpotqa.github.io/)验证集效果
| 模型 | Span-F1 | Supp.-F1 | Joint-F1 |
|-----------------|:----:|:----:|:----:|
| RoBERTa | 73.5 | 83.4 | 63.5 |
| Longformer | 74.3 | 84.4 | 64.4 |
| BigBird | 75.5 | **87.1** | 67.8 |
| **ERNIE-Doc** Base | **79.4** | 86.3 | **70.5** |
| Longformer Large | 81.0 | 85.8 | 71.4 |
| BigBird Large | 81.3 | **89.4** | - |
| **ERNIE-Doc** Large | **82.2** | 87.6 | **73.7** |
- [DRCD](https://arxiv.org/abs/1806.00920)[CMRC2018](https://arxiv.org/abs/1810.07366)[DuReader](https://arxiv.org/abs/1711.05073)[C3](https://arxiv.org/abs/1904.09679)
| 模型 | DRCD | DRCD | CMRC2018 | DuReader | C3 | C3 |
|-----------------|---------------|---------------|---------------|---------------|----------|----------|
| | dev | test | dev | dev | dev | test |
| | EM/F1 | EM/F1 | EM/F1 | EM/F1 | Acc. | Acc. |
| BERT | 85.7/91.6 | 84.9/90.9 | 66.3/85.9 | 59.5/73.1 | 65.7 | 64.5 |
| BERT-wwm-ext | 85.0/91.2 | 83.6/90.4 | 67.1/85.7 | -/- | 67.8 | 68.5 |
| RoBERTa-wwm-ext | 86.6/92.5 | 85.2/92.0 | 67.4/87.2 | -/- | 67.1 | 66.5 |
| MacBERT | 88.3/93.5 | 87.9/93.2 | 69.5/87.7 | -/- | - | - |
| XLNet-zh | 83.2/92.0 | 82.8/91.8 | 63.0/85.9 | -/- | - | - |
| ERNIE 1.0 | 84.6/90.9 | 84.0/90.5 | 65.1/85.1 | 57.9/72/1 | 65.5 | 64.1 |
| ERNIE 2.0 | 88.5/93.8 | 88.0/93.4 | 69.1/88.6 | 61.3/74.9 | 72.3 | 73.2 |
| **ERNIE-Doc** | **90.5/95.2** | **90.5/95.1** | **76.1/91.6** | **65.8/77.9** | **76.5** | **76.5** |
### 信息抽取
- [Open Domain Web Keyphrase Extraction (OpenKP)](https://www.aclweb.org/anthology/D19-1521/)
| 模型 | F1@1 | F1@3 | F1@5 |
|-----------|:----:|:----:|:----:|
| BLING-KPE | 26.7 | 29.2 | 20.9 |
| JointKPE | 39.1 | 39.8 | 33.8 |
| ETC | - | 40.2 | - |
| ERNIE-Doc | **40.2** | **40.5** | **34.4** |
### 语义匹配
- [CAIL2019-SCM](https://arxiv.org/abs/1911.08962)
| 模型 | Dev (Acc.) | Test (Acc.) |
|-----------|:-------------:|:-------------:|
| BERT | 61.9 | 67.3 |
| ERNIE 2.0 | 64.9 | 67.9 |
| ERNIE-Doc | **65.6** | **68.8** |
## 使用说明
### 安装飞桨
我们的代码基于 Paddle(version>=1.8),推荐使用python3运行。 ERNIE-Doc 依赖的其他模块也列举在 `requirements.txt`,可以通过下面的指令安装:
```script
pip install -r requirements.txt
```
### 运行微调
我们开源了中英文分类任务以及中文阅读理解任务的微调代码,运行以下脚本即可进行实验
```shell
sh script/run_imdb.sh # 英文分类任务
sh script/run_iflytek.sh # 中文分类任务
sh script/run_dureader.sh # 中文阅读理解任务
```
[imdb数据处理说明](./ernie_doc/data/imdb/README.md)
具体微调参数均可在上述脚本中进行修改,训练和评估的日志在 `log/job.log.0`
**注意**: 训练时实际的 batch size 等于 `配置的 batch size * GPU 卡数`
## 引用
可以按下面的格式引用我们的论文:
```
@article{ding2020ernie,
title={ERNIE-DOC: The Retrospective Long-Document Modeling Transformer},
author={Ding, Siyu and Shang, Junyuan and Wang, Shuohuan and Sun, Yu and Tian, Hao and Wu, Hua and Wang, Haifeng},
journal={arXiv preprint arXiv:2012.15688},
year={2020}
}
```
\ No newline at end of file
{
"attention_probs_dropout_prob": 0.0,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.0,
"hidden_size": 768,
"initializer_range": 0.02,
"max_position_embeddings": 512,
"num_attention_heads": 12,
"num_hidden_layers": 12,
"sent_type_vocab_size": 4,
"task_type_vocab_size": 3,
"vocab_size": 50265,
"memory_len": 128,
"epsilon": 1e-12
}
因为 它太大了无法显示 source diff 。你可以改为 查看blob
{
"attention_probs_dropout_prob": 0.1,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 768,
"initializer_range": 0.02,
"max_position_embeddings": 512,
"num_attention_heads": 12,
"num_hidden_layers": 12,
"sent_type_vocab_size": 4,
"task_type_vocab_size": 3,
"vocab_size": 28000,
"memory_len": 128,
"epsilon": 1e-12
}
[
{
"prob": 1,
"data_func": "multi_sent_sorted",
"task_name": "multi_sent_sorted",
"valid_filelist": "./package/valid_filelist_multi_sort_sorted",
"train_filelist": "./package/train_filelist_multi_sort_sorted",
"loss_weight": 1.0,
"num_labels": 33,
"lm_weight": 1.0
}
]
因为 它太大了无法显示 source diff 。你可以改为 查看blob
因为 它太大了无法显示 source diff 。你可以改为 查看blob
{
"attention_probs_dropout_prob": 0.0,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.0,
"hidden_size": 1024,
"initializer_range": 0.02,
"max_position_embeddings": 512,
"num_attention_heads": 16,
"num_hidden_layers": 24,
"sent_type_vocab_size": 4,
"task_type_vocab_size": 3,
"vocab_size": 50265,
"memory_len": 128,
"epsilon": 1e-12
}
因为 它太大了无法显示 source diff 。你可以改为 查看blob
因为 它太大了无法显示 source diff 。你可以改为 查看blob
## 下载官方数据
http://ai.stanford.edu/~amaas/data/sentiment/index.html
## 运行预处理脚本
```python
python multi_files_to_one.py
```
生成train.txt与test.txt文件至该文件夹下
\ No newline at end of file
import logging
import os
import numpy as np
from collections import namedtuple
from tqdm import tqdm
def read_files(dir_path):
"""
:param dir_path
"""
examples = []
Example = namedtuple('Example', ['qid', 'text_a', 'label', 'score'])
def _read_files(dir_p, label):
logging.info('loading data from %s' % dir_p)
data_files = os.listdir(dir_p)
desc = "loading " + dir_p
for f_idx, data_file in tqdm(enumerate(data_files), desc=desc):
file_path = os.path.join(dir_p, data_file)
qid, score = data_file.split('_')
score = score.split('.')[0]
with open(file_path, 'r') as f:
doc = []
for line in f:
line = line.strip().replace('<br /><br />', ' ')
doc.append(line)
doc_text = ' '.join(doc)
example = Example(
qid=len(examples)+1,
text_a=doc_text,
label=label,
score=score
)
examples.append(example)
neg_dir = os.path.join(dir_path, 'neg')
pos_dir = os.path.join(dir_path, 'pos')
_read_files(neg_dir, label=0)
_read_files(pos_dir, label=1)
logging.info('loading data finished')
return examples
def write_to_one(dir, o_file_name):
exampels = read_files(dir)
logging.info('ex nums:%d' % (len(exampels)))
with open(o_file_name, 'w') as fout:
fout.write("qid\tlabel\tscore\ttext_a\n")
for ex in exampels:
try:
fout.write("{}\t{}\t{}\t{}\n".format(ex.qid, ex.label, ex.score, ex.text_a.replace('\t', '')))
except Exception as e:
print(ex.qid, ex.text_a, ex.label, ex.score)
raise e
if __name__ == "__main__":
write_to_one("train", 'train.txt')
write_to_one("test", "test.txt")
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Model for classifier."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import time
import numpy as np
import collections
from collections import namedtuple
import paddle.fluid as fluid
from model.static.ernie import ErnieDocModel
from utils.multi_process_eval import MultiProcessEvalForErnieDoc
from utils.metrics import Acc
def create_model(args, ernie_config, mem_len=128, is_infer=False):
"""create model for classifier"""
shapes = [[-1, args.max_seq_len, 1], [-1, 2 * args.max_seq_len + mem_len, 1], [-1, args.max_seq_len, 1],
[-1, args.max_seq_len, 1], [-1, 1], [-1, 1], [-1, 1], []]
dtypes = ['int64', 'int64', 'int64', 'float32', 'int64', 'int64', 'int64', 'int64']
names = ["src_ids", "pos_ids", "task_ids", "input_mask", "labels", "qids", "gather_idx", "need_cal_loss"]
inputs = []
for shape, dtype, name in zip(shapes, dtypes, names):
inputs.append(fluid.layers.data(name=name, shape=shape, dtype=dtype, append_batch_size=False))
src_ids, pos_ids, task_ids, input_mask, labels, qids, \
gather_idx, need_cal_loss = inputs
pyreader = fluid.io.DataLoader.from_generator(
feed_list=inputs,
capacity=70, iterable=False)
ernie_doc = ErnieDocModel(
src_ids=src_ids,
position_ids=pos_ids,
task_ids=task_ids,
input_mask=input_mask,
config=ernie_config,
number_instance=args.batch_size,
rel_pos_params_sharing=args.rel_pos_params_sharing,
use_vars=args.use_vars)
mems, new_mems = ernie_doc.get_mem_output()
cls_feats = ernie_doc.get_pooled_output()
checkpoints = ernie_doc.get_checkpoints()
cls_feats = fluid.layers.dropout(
x=cls_feats,
dropout_prob=0.1,
dropout_implementation="upscale_in_train")
logits = fluid.layers.fc(
input=cls_feats,
size=args.num_labels,
param_attr=fluid.ParamAttr(
name="cls_out_w",
initializer=fluid.initializer.TruncatedNormal(scale=0.02)),
bias_attr=fluid.ParamAttr(
name="cls_out_b", initializer=fluid.initializer.Constant(0.)))
if is_infer:
probs = fluid.layers.softmax(logits)
feed_targets_name = [
src_ids.name, pos_ids.name, task_ids.name, input_mask.name
]
return pyreader, probs, feed_targets_name
# filter
qids, logits, labels = list(map(lambda x: fluid.layers.gather(x, gather_idx), [qids, logits, labels]))
ce_loss, probs = fluid.layers.softmax_with_cross_entropy(
logits=logits, label=labels, return_softmax=True)
loss = fluid.layers.mean(x=ce_loss)
num_seqs = fluid.layers.create_tensor(dtype='int32')
accuracy = fluid.layers.accuracy(input=probs, label=labels, total=num_seqs)
loss, num_seqs, accuracy = list(map(lambda x: x * need_cal_loss, [loss, num_seqs, accuracy]))
graph_vars = collections.OrderedDict()
fetch_names = ['loss', 'accuracy', 'probs', 'labels', 'num_seqs', 'qids', 'need_cal_loss']
fetch_vars = [loss, accuracy, probs, labels, num_seqs, qids, need_cal_loss]
for name, var in zip(fetch_names, fetch_vars):
graph_vars[name] = var
for k, v in graph_vars.items():
v.persistable = True
mems_vars = {'mems': mems, 'new_mems': new_mems}
return pyreader, graph_vars, checkpoints, mems_vars
def evaluate(exe,
program,
pyreader,
graph_vars,
mems_vars,
tower_mems_np,
phase,
steps=None,
trainers_id=None,
trainers_num=None,
scheduled_lr=None,
use_vars=False):
"""evaluate interface"""
fetch_names = [k for k, v in graph_vars.items()]
fetch_list = [v for k, v in graph_vars.items()]
if phase == "train":
fetch_names += ['scheduled_lr']
fetch_list += [scheduled_lr]
if not use_vars:
feed_dict = {}
for m, m_np in zip(mems_vars['mems'], tower_mems_np):
feed_dict[m.name] = m_np
fetch_list += mems_vars['new_mems']
fetch_names += [m.name for m in mems_vars['new_mems']]
if phase == "train":
if use_vars:
outputs = exe.run(fetch_list=fetch_list, program=program, use_program_cache=True)
else:
outputs = exe.run(feed=feed_dict, fetch_list=fetch_list, program=program, use_program_cache=True)
tower_mems_np = outputs[-len(mems_vars['new_mems']):]
outputs_dict = {}
for var_name, output_var in zip(fetch_names, outputs):
outputs_dict[var_name] = output_var
ret = {"loss": np.mean(outputs_dict['loss']),
"accuracy": np.mean(outputs_dict['accuracy']),
"learning_rate": np.mean(outputs_dict['scheduled_lr']),
"tower_mems_np": tower_mems_np}
return ret
if phase == "eval" or phase == "test":
pyreader.start()
qids, labels, scores = [], [], []
time_begin = time.time()
all_results = []
total_cost, total_num_seqs= 0.0, 0.0
RawResult = namedtuple("RawResult", ["unique_id", "prob", "label"])
while True:
try:
if use_vars:
outputs = exe.run(
program=program, fetch_list=fetch_list, use_program_cache=True)
else:
feed_dict = {}
for m, m_np in zip(mems_vars['mems'], tower_mems_np):
feed_dict[m.name] = m_np
outputs = exe.run(feed=feed_dict, fetch_list=fetch_list, program=program, use_program_cache=True)
tower_mems_np = outputs[-len(mems_vars['new_mems']):]
outputs = outputs[:-len(mems_vars['new_mems'])]
np_loss, np_acc, np_probs, np_labels, np_num_seqs, np_qids, np_need_cal_loss = outputs
if int(np_need_cal_loss) == 1:
total_cost += np.sum(np_loss * np_num_seqs)
total_num_seqs += np.sum(np_num_seqs)
for idx in range(np_qids.shape[0]):
if len(all_results) % 1000 == 0 and len(all_results):
print("processining example: %d" % len(all_results))
qid_each = int(np_qids[idx])
probs_each = [float(x) for x in np_probs[idx].flat]
label_each = int(np_labels[idx])
all_results.append(
RawResult(
unique_id=qid_each,
prob=probs_each,
label=label_each))
except fluid.core.EOFException:
pyreader.reset()
break
time_end = time.time()
output_path = "./tmpout"
mul_pro_test = MultiProcessEvalForErnieDoc(output_path, phase, trainers_num, trainers_id)
is_print = True
if mul_pro_test.dev_count > 1:
is_print = False
mul_pro_test.write_result(all_results)
if trainers_id == 0:
is_print = True
all_results = mul_pro_test.concat_result(RawResult)
if is_print:
num_seqs, all_labels, all_probs = mul_pro_test.write_predictions(all_results)
acc_func = Acc()
accuracy = acc_func.eval([all_probs, all_labels])
time_cost = time_end - time_begin
print("[%d_%s evaluation] ave loss: %f, ave acc: %f, data_num: %d, elapsed time: %f s"
% (steps, phase, total_cost / total_num_seqs, accuracy, num_seqs, time_cost))
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Model for MRC."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import os
import time
import numpy as np
from collections import namedtuple
import paddle.fluid as fluid
from model.static.ernie import ErnieDocModel
from utils.metrics import EM_AND_F1
from reader.tokenization import BasicTokenizer
from utils.multi_process_eval import MultiProcessEvalForMrc
def create_model(args, ernie_config, mem_len=128, is_infer=False):
"""create model for mrc"""
shapes = [[-1, args.max_seq_len, 1], [-1, 2 * args.max_seq_len + mem_len, 1], [-1, args.max_seq_len, 1],
[-1, args.max_seq_len, 1], [-1, 1], [-1, 1], [-1, 1], [-1, 1], []]
dtypes = ['int64', 'int64', 'int64', 'float32', 'int64', 'int64', 'int64', 'int64', 'int64']
names = ["src_ids", "pos_ids", "task_ids", "input_mask", "start_positions", \
"end_positions", "qids", "gather_idx", "need_cal_loss"]
inputs = []
for shape, dtype, name in zip(shapes, dtypes, names):
inputs.append(fluid.layers.data(name=name, shape=shape, dtype=dtype, append_batch_size=False))
src_ids, pos_ids, task_ids, input_mask, start_positions, \
end_positions, qids, gather_idx, need_cal_loss = inputs
pyreader = fluid.io.DataLoader.from_generator(
feed_list=inputs,
capacity=70, iterable=False)
ernie_doc = ErnieDocModel(
src_ids=src_ids,
position_ids=pos_ids,
task_ids=task_ids,
input_mask=input_mask,
config=ernie_config,
number_instance=args.batch_size,
rel_pos_params_sharing=args.rel_pos_params_sharing,
use_vars=args.use_vars)
enc_out = ernie_doc.get_sequence_output()
checkpoints = ernie_doc.get_checkpoints()
mems, new_mems = ernie_doc.get_mem_output()
enc_out = fluid.layers.dropout(
x=enc_out,
dropout_prob=0.1,
dropout_implementation="upscale_in_train")
logits = fluid.layers.fc(
input=enc_out,
size=2,
num_flatten_dims=2,
param_attr=fluid.ParamAttr(
name="cls_mrc_out_w",
initializer=fluid.initializer.TruncatedNormal(scale=0.02)),
bias_attr=fluid.ParamAttr(
name="cls_mrc_out_b", initializer=fluid.initializer.Constant(0.)))
if is_infer:
probs = fluid.layers.softmax(logits)
feed_targets_name = [
src_ids.name, pos_ids.name, task_ids.name, input_mask.name
]
return pyreader, probs, feed_targets_name
logits = fluid.layers.transpose(x=logits, perm=[2, 0, 1])
start_logits, end_logits = fluid.layers.unstack(x=logits, axis=0)
filter_output = list(map(lambda x: fluid.layers.gather(x, gather_idx), \
[qids, start_logits, end_logits, start_positions, end_positions]))
qids, start_logits, end_logits, start_positions, end_positions = filter_output
def compute_loss(logits, positions):
"""compute loss"""
loss = fluid.layers.softmax_with_cross_entropy(
logits=logits, label=positions)
loss = fluid.layers.mean(x=loss)
return loss
start_loss = compute_loss(start_logits, start_positions)
end_loss = compute_loss(end_logits, end_positions)
loss = (start_loss + end_loss) / 2.0
loss *= need_cal_loss
mems_vars = {'mems': mems, 'new_mems': new_mems}
graph_vars = {
"loss": loss,
"qids": qids,
"start_logits": start_logits,
"end_logits": end_logits,
"need_cal_loss": need_cal_loss
}
for k, v in graph_vars.items():
v.persistable = True
return pyreader, graph_vars, checkpoints, mems_vars
def evaluate(exe,
program,
pyreader,
graph_vars,
mems_vars,
tower_mems_np,
phase,
steps=None,
trainers_id=None,
trainers_num=None,
scheduled_lr=None,
use_vars=False,
examples=None,
features=None,
args=None):
"""evaluate interface"""
fetch_names = [k for k, v in graph_vars.items()]
fetch_list = [v for k, v in graph_vars.items()]
if phase == "train":
fetch_names += ['scheduled_lr']
fetch_list += [scheduled_lr]
if not use_vars:
feed_dict = {}
for m, m_np in zip(mems_vars['mems'], tower_mems_np):
feed_dict[m.name] = m_np
fetch_list += mems_vars['new_mems']
fetch_names += [m.name for m in mems_vars['new_mems']]
if phase == "train":
if use_vars:
outputs = exe.run(fetch_list=fetch_list, program=program, use_program_cache=True)
else:
outputs = exe.run(feed=feed_dict, fetch_list=fetch_list, program=program, use_program_cache=True)
tower_mems_np = outputs[-len(mems_vars['new_mems']):]
outputs_dict = {}
for var_name, output_var in zip(fetch_names, outputs):
outputs_dict[var_name] = output_var
ret = {"loss": np.mean(outputs_dict['loss']),
"learning_rate": np.mean(outputs_dict['scheduled_lr']),
"tower_mems_np": tower_mems_np}
return ret
if phase == "eval" or phase == "test":
output_dir = args.checkpoints
if not os.path.exists(output_dir):
os.makedirs(output_dir)
output_prediction_file = os.path.join(output_dir, phase + "_predictions.json")
output_nbest_file = os.path.join(output_dir, phase + "_nbest_predictions.json")
RawResult = namedtuple("RawResult",
["unique_id", "start_logits", "end_logits"])
pyreader.start()
all_results = []
time_begin = time.time()
while True:
try:
if use_vars:
outputs = exe.run(
program=program, fetch_list=fetch_list, use_program_cache=True)
else:
feed_dict = {}
for m, m_np in zip(mems_vars['mems'], tower_mems_np):
feed_dict[m.name] = m_np
outputs = exe.run(feed=feed_dict, fetch_list=fetch_list, program=program, use_program_cache=True)
tower_mems_np = outputs[-len(mems_vars['new_mems']):]
outputs = outputs[:-len(mems_vars['new_mems'])]
np_loss, np_qids, np_start_logits, np_end_logits, np_need_cal_loss = outputs
if int(np_need_cal_loss) == 1:
for idx in range(np_qids.shape[0]):
if len(all_results) % 1000 == 0:
print("Processing example: %d" % len(all_results))
qid_each = int(np_qids[idx])
start_logits_each = [float(x) for x in np_start_logits[idx].flat]
end_logits_each = [float(x) for x in np_end_logits[idx].flat]
all_results.append(
RawResult(
unique_id=qid_each,
start_logits=start_logits_each,
end_logits=end_logits_each))
except fluid.core.EOFException:
pyreader.reset()
break
time_end = time.time()
output_path = "./tmpout"
tokenizer = BasicTokenizer(do_lower_case=args.do_lower_case)
mul_pro_test = MultiProcessEvalForMrc(output_path, phase, trainers_num,
trainers_id, tokenizer)
is_print = True
if mul_pro_test.dev_count > 1:
is_print = False
mul_pro_test.write_result(all_results)
if trainers_id == 0:
is_print = True
all_results = mul_pro_test.concat_result(RawResult)
if is_print:
mul_pro_test.write_predictions(examples,
features,
all_results,
args.n_best_size,
args.max_answer_length,
args.do_lower_case,
mul_pro_test.output_prediction_file,
mul_pro_test.output_nbest_file)
if phase == "eval":
data_file = args.dev_set
elif phase == "test":
data_file = args.test_set
elapsed_time = time_end - time_begin
em_and_f1 = EM_AND_F1()
em, f1, avg, total = em_and_f1.eval_file(data_file, mul_pro_test.output_prediction_file)
print("[%d_%s evaluation] em: %f, f1: %f, avg: %f, questions: %d, elapsed time: %f"
% (steps, phase, em, f1, avg, total, elapsed_time))
import sys
import subprocess
import os
import six
import copy
import argparse
from utils.args import ArgumentGroup, print_arguments
# yapf: disable
parser = argparse.ArgumentParser(__doc__)
multip_g = ArgumentGroup(parser, "multiprocessing",
"start paddle training using multi-processing mode.")
multip_g.add_arg("node_ips", str, None,
"paddle trainer ips")
multip_g.add_arg("node_id", int, None,
"the trainer id of the node for multi-node distributed training.")
multip_g.add_arg("print_config", bool, True,
"print the config of multi-processing mode.")
multip_g.add_arg("current_node_ip", str, None,
"the ip of current node.")
multip_g.add_arg("split_log_path", str, "log",
"log path for each trainer.")
multip_g.add_arg("nproc_per_node", int, 8,
"the number of process to use on each node.")
multip_g.add_arg("selected_gpus", str, "0,1,2,3,4,5,6,7",
"the gpus selected to use.")
multip_g.add_arg("training_script", str, None, "the program/script to be lauched "
"in parallel followed by all the arguments", positional_arg=True)
multip_g.add_arg("training_script_args", str, None,
"training script args", positional_arg=True, nargs=argparse.REMAINDER)
# yapf: enable
def start_procs(args):
procs = []
log_fns = []
default_env = os.environ.copy()
node_id = args.node_id
node_ips = [x.strip() for x in args.node_ips.split(',')]
current_ip = args.current_node_ip
num_nodes = len(node_ips)
selected_gpus = [x.strip() for x in args.selected_gpus.split(',')]
selected_gpu_num = len(selected_gpus)
all_trainer_endpoints = ""
for ip in node_ips:
for i in range(args.nproc_per_node):
if all_trainer_endpoints != "":
all_trainer_endpoints += ","
all_trainer_endpoints += "%s:617%d" % (ip, i)
nranks = num_nodes * args.nproc_per_node
gpus_per_proc = args.nproc_per_node % selected_gpu_num
if gpus_per_proc == 0:
gpus_per_proc = selected_gpu_num // args.nproc_per_node
else:
gpus_per_proc = selected_gpu_num // args.nproc_per_node + 1
selected_gpus_per_proc = [selected_gpus[i:i + gpus_per_proc] for i in range(0, len(selected_gpus), gpus_per_proc)]
if args.print_config:
print("all_trainer_endpoints: ", all_trainer_endpoints,
", node_id: ", node_id,
", current_ip: ", current_ip,
", num_nodes: ", num_nodes,
", node_ips: ", node_ips,
", gpus_per_proc: ", gpus_per_proc,
", selected_gpus_per_proc: ", selected_gpus_per_proc,
", nranks: ", nranks)
current_env = copy.copy(default_env)
procs = []
log_fns = []
for i in range(0, args.nproc_per_node):
trainer_id = node_id * args.nproc_per_node + i
current_env.update({
"FLAGS_selected_gpus": "%s" % ",".join([str(s) for s in selected_gpus_per_proc[i]]),
"PADDLE_TRAINER_ID" : "%d" % trainer_id,
"PADDLE_CURRENT_ENDPOINT": "%s:617%d" % (current_ip, i),
"PADDLE_TRAINERS_NUM": "%d" % nranks,
"PADDLE_TRAINER_ENDPOINTS": all_trainer_endpoints,
"PADDLE_NODES_NUM": "%d" % num_nodes
})
cmd = [sys.executable, "-u",
args.training_script] + args.training_script_args
if args.split_log_path:
fn = open("%s/job.log.%d" % (args.split_log_path, trainer_id), "w")
log_fns.append(fn)
process = subprocess.Popen(cmd, env=current_env, stdout=fn, stderr=fn)
else:
process = subprocess.Popen(cmd, env=current_env)
procs.append(process)
for i in range(len(procs)):
try:
procs[i].communicate()
procs[i].terminate()
if len(log_fns) > 0:
log_fns[i].close()
except:
raise subprocess.CalledProcessError(returncode=procs[i].returncode,
cmd=procs[i].args)
def main(args):
if args.print_config:
print_arguments(args)
start_procs(args)
if __name__ == "__main__":
args = parser.parse_args()
main(args)
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Optimization and learning rate scheduling."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import numpy as np
import paddle.fluid as fluid
from paddle.fluid.incubate.fleet.collective import fleet
def linear_warmup_decay(learning_rate, warmup_steps, num_train_steps):
""" Applies linear warmup of learning rate from 0 and decay to 0."""
with fluid.default_main_program()._lr_schedule_guard():
lr = fluid.layers.tensor.create_global_var(
shape=[1],
value=0.0,
dtype='float32',
persistable=True,
name="scheduled_learning_rate")
global_step = fluid.layers.learning_rate_scheduler._decay_step_counter(
)
with fluid.layers.control_flow.Switch() as switch:
with switch.case(global_step < warmup_steps):
warmup_lr = learning_rate * (global_step / warmup_steps)
fluid.layers.tensor.assign(warmup_lr, lr)
with switch.default():
decayed_lr = fluid.layers.learning_rate_scheduler.polynomial_decay(
learning_rate=learning_rate,
decay_steps=num_train_steps,
end_learning_rate=0.0,
power=1.0,
cycle=False)
fluid.layers.tensor.assign(decayed_lr, lr)
return lr
def exclude_from_weight_decay(name):
"""exclude_from_weight_decay"""
if name.find("layer_norm") > -1:
return True
bias_suffix = ["_bias", "_b", ".b_0"]
for suffix in bias_suffix:
if name.endswith(suffix):
return True
return False
def layer_decay(param, param_last, learning_rate, decay_rate, n_layers):
"""layerwise learning rate decay"""
delta = param - param_last
if "encoder_layer" in param.name and param.name.index("encoder_layer")==0:
print(param.name)
layer = int(param.name.split("_")[2])
ratio = decay_rate ** (n_layers + 1 - layer)
ratio = decay_rate ** (n_layers - layer)
param_update = param + (ratio - 1) * delta
elif "embedding" in param.name:
ratio = decay_rate ** (n_layers + 2)
ratio = decay_rate ** (n_layers + 1)
param_update = param + (ratio - 1) * delta
else:
param_update = None
return param_update
def optimization(loss,
warmup_steps,
num_train_steps,
learning_rate,
train_program,
startup_prog,
weight_decay,
scheduler='linear_warmup_decay',
use_amp=False,
init_loss_scaling=32768,
layer_decay_rate=0,
n_layers=12,
dist_strategy=None):
"""optimization"""
grad_clip = fluid.clip.GradientClipByGlobalNorm(clip_norm=1.0)
if warmup_steps > 0:
if scheduler == 'noam_decay':
scheduled_lr = fluid.layers.learning_rate_scheduler\
.noam_decay(1/(warmup_steps *(learning_rate ** 2)),
warmup_steps)
elif scheduler == 'linear_warmup_decay':
scheduled_lr = linear_warmup_decay(learning_rate, warmup_steps,
num_train_steps)
else:
raise ValueError("Unkown learning rate scheduler, should be "
"'noam_decay' or 'linear_warmup_decay'")
optimizer = fluid.optimizer.Adam(learning_rate=scheduled_lr)
else:
scheduled_lr = fluid.layers.create_global_var(
name=fluid.unique_name.generate("learning_rate"),
shape=[1],
value=learning_rate,
dtype='float32',
persistable=True)
optimizer = fluid.optimizer.Adam(learning_rate=scheduled_lr, epsilon=1e-06, grad_clip=grad_clip)
optimizer._learning_rate_map[fluid.default_main_program()] = scheduled_lr
loss_scaling = fluid.layers.create_global_var(
name=fluid.unique_name.generate("loss_scaling"),
shape=[1],
value=init_loss_scaling,
dtype='float32',
persistable=True)
param_list = dict()
for param in train_program.global_block().all_parameters():
param_list[param.name] = param * 1.0
param_list[param.name].stop_gradient = True
if dist_strategy:
optimizer = fleet.distributed_optimizer(optimizer, strategy=dist_strategy)
loss = fluid.layers.mean(loss)
_, param_grads = optimizer.minimize(loss)
if use_amp:
loss_scaling = optimizer._optimizer.get_loss_scaling()
if layer_decay_rate > 0:
for param, grad in param_grads:
with param.block.program._optimized_guard(
[param, grad]), fluid.framework.name_scope("layer_decay"):
param_decay = layer_decay(param, param_list[param.name], \
scheduled_lr, layer_decay_rate, n_layers)
if param_decay:
fluid.layers.assign(output=param, input=param_decay)
if weight_decay > 0:
for param, grad in param_grads:
if exclude_from_weight_decay(param.name):
continue
with param.block.program._optimized_guard(
[param, grad]), fluid.framework.name_scope("weight_decay"):
updated_param = param - param_list[
param.name] * weight_decay * scheduled_lr
fluid.layers.assign(output=param, input=updated_param)
return scheduled_lr, loss_scaling
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Ernie Doc model."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import json
import six
import paddle.fluid as fluid
import paddle.fluid.layers as layers
from model.static.transformer_encoder import encoder, pre_process_layer
class ErnieConfig(object):
"""ErnieConfig"""
def __init__(self, config_path):
self._config_dict = self._parse(config_path)
def _parse(self, config_path):
try:
with open(config_path) as json_file:
config_dict = json.load(json_file)
except Exception:
raise IOError("Error in parsing Ernie model config file '%s'" %
config_path)
else:
return config_dict
def __getitem__(self, key):
return self._config_dict[key]
def print_config(self):
for arg, value in sorted(six.iteritems(self._config_dict)):
print('%s: %s' % (arg, value))
print('------------------------------------------------')
class ErnieDocModel(object):
def __init__(self,
src_ids,
position_ids,
task_ids,
input_mask,
config,
number_instance,
weight_sharing=True,
rel_pos_params_sharing=False,
use_vars=False):
"""
Fundamental pretrained Ernie Doc model
"""
self._emb_size = config['hidden_size']
self._n_layer = config['num_hidden_layers']
self._n_head = config['num_attention_heads']
self._voc_size = config['vocab_size']
self._max_position_seq_len = config['max_position_embeddings']
self._task_types = config['task_type_vocab_size']
self._hidden_act = config['hidden_act']
self._memory_len = config["memory_len"]
self._epsilon = config["epsilon"]
self._prepostprocess_dropout = config['hidden_dropout_prob']
self._attention_dropout = config['attention_probs_dropout_prob']
self._number_instance = number_instance
self._weight_sharing = weight_sharing
self._rel_pos_params_sharing = rel_pos_params_sharing
self._word_emb_name = "word_embedding"
self._pos_emb_name = "pos_embedding"
self._task_emb_name = "task_embedding"
self._emb_dtype = "float32"
self._encoder_checkpints = []
self._batch_size = layers.slice(layers.shape(src_ids), axes=[0], starts=[0], ends=[1])
# Initialize all weigths by truncated normal initializer, and all biases
# will be initialized by constant zero by default.
self._param_initializer = fluid.initializer.TruncatedNormal(
scale=config['initializer_range'])
self._use_vars = use_vars
self._init_memories()
self._build_model(src_ids, position_ids, task_ids, input_mask)
def _init_memories(self):
"""Initialize memories"""
self.memories = []
for i in range(self._n_layer):
if self._memory_len:
if self._use_vars:
self.memories.append(layers.create_global_var(
shape=[self._number_instance, self._memory_len, self._emb_size],
value=0.0,
dtype=self._emb_dtype,
persistable=True,
force_cpu=False,
name="memory_%d" % i))
else:
self.memories.append(layers.data(
name="memory_%d" % i,
shape=[-1, self._memory_len, self._emb_size],
dtype=self._emb_dtype,
append_batch_size=False))
else:
self.memories.append([None])
def _build_model(self, src_ids, position_ids, task_ids, input_mask):
"""Build Ernie Doc Model"""
# padding id in vocabulary must be set to 0
word_emb = layers.embedding(
input=src_ids,
size=[self._voc_size, self._emb_size],
dtype=self._emb_dtype,
param_attr=fluid.ParamAttr(
name=self._word_emb_name, initializer=self._param_initializer),
is_sparse=False)
pos_emb = layers.embedding(
input=position_ids,
size=[self._max_position_seq_len * 2 + self._memory_len, self._emb_size],
dtype=self._emb_dtype,
param_attr=fluid.ParamAttr(
name=self._pos_emb_name, initializer=self._param_initializer))
task_ids = layers.concat([
layers.zeros(
shape=[self._batch_size, self._memory_len, 1],
dtype="int64") + task_ids[0, 0, 0],
task_ids], axis=1)
task_ids.stop_gradient = True
task_emb = layers.embedding(
task_ids,
size=[self._task_types, self._emb_size],
dtype=self._emb_dtype,
param_attr=fluid.ParamAttr(
name=self._task_emb_name, initializer=self._param_initializer))
word_emb = pre_process_layer(
word_emb, 'nd', self._prepostprocess_dropout, name='pre_encoder_emb')
pos_emb = pre_process_layer(
pos_emb, 'nd', self._prepostprocess_dropout, name='pre_encoder_r_pos')
task_emb = pre_process_layer(
task_emb, 'nd', self._prepostprocess_dropout, name="pre_encoder_r_task")
data_mask = layers.concat([
layers.ones(
shape=[self._batch_size, self._memory_len, 1],
dtype=input_mask.dtype),
input_mask], axis=1)
data_mask.stop_gradient = True
self_attn_mask = layers.matmul(
x=input_mask, y=data_mask, transpose_y=True)
self_attn_mask = layers.scale(
x=self_attn_mask, scale=1000000000.0, bias=-1.0, bias_after_scale=False)
n_head_self_attn_mask = layers.stack(
x=[self_attn_mask] * self._n_head, axis=1)
n_head_self_attn_mask.stop_gradient = True
self._enc_out, self._new_mems, self._checkpoints = encoder(
enc_input=word_emb,
memories=self.memories,
rel_pos=pos_emb,
rel_task=task_emb,
attn_bias=n_head_self_attn_mask,
n_layer=self._n_layer,
n_head=self._n_head,
d_key=self._emb_size // self._n_head,
d_value=self._emb_size // self._n_head,
d_model=self._emb_size,
d_inner_hid=self._emb_size * 4,
memory_len=self._memory_len,
prepostprocess_dropout=self._prepostprocess_dropout,
attention_dropout=self._attention_dropout,
relu_dropout=0,
hidden_act=self._hidden_act,
epsilon=self._epsilon,
preprocess_cmd="",
postprocess_cmd="dan",
param_initializer=self._param_initializer,
rel_pos_params_sharing=self._rel_pos_params_sharing,
name='encoder',
use_vars=self._use_vars)
def get_sequence_output(self):
return self._enc_out
def get_checkpoints(self):
return self._checkpoints
def get_mem_output(self):
return self.memories, self._new_mems
def get_pooled_output(self):
"""Get the last feature of each sequence for classification"""
next_sent_feat = layers.slice(
input=self._enc_out, axes=[1], starts=[-1], ends=[self._max_position_seq_len])
next_sent_feat = layers.fc(
input=next_sent_feat,
size=self._emb_size,
act="tanh",
param_attr=fluid.ParamAttr(
name="pooled_fc.w_0", initializer=self._param_initializer),
bias_attr="pooled_fc.b_0")
return next_sent_feat
def get_pretrained_output(self,
mask_label,
mask_pos,
need_cal_loss=True,
reorder_labels=None,
reorder_chose_idx=None,
reorder_need_cal_loss=False):
"""Get the loss & accuracy for pretraining"""
reshaped_emb_out = fluid.layers.reshape(
x=self._enc_out, shape=[-1, self._emb_size])
# extract masked tokens' feature
mask_feat = layers.gather(input=reshaped_emb_out,
index=layers.cast(mask_pos, dtype="int32"))
# transform: fc
mask_trans_feat = layers.fc(
input=mask_feat,
size=self._emb_size,
act=self._hidden_act,
param_attr=fluid.ParamAttr(
name='mask_lm_trans_fc.w_0',
initializer=self._param_initializer),
bias_attr=fluid.ParamAttr(name='mask_lm_trans_fc.b_0'))
# transform: layer norm
mask_trans_feat = layers.layer_norm(
mask_trans_feat,
begin_norm_axis=len(mask_trans_feat.shape) - 1,
param_attr=fluid.ParamAttr(
name='mask_lm_trans_layer_norm_scale',
initializer=fluid.initializer.Constant(1.)),
bias_attr=fluid.ParamAttr(
name='mask_lm_trans_layer_norm_bias',
initializer=fluid.initializer.Constant(1.)))
mask_lm_out_bias_attr = fluid.ParamAttr(
name="mask_lm_out_fc.b_0",
initializer=fluid.initializer.Constant(value=0.0))
if self._weight_sharing:
fc_out = layers.matmul(
x=mask_trans_feat,
y=fluid.default_main_program().global_block().var(
self._word_emb_name),
transpose_y=True)
fc_out += layers.create_parameter(
shape=[self._voc_size],
dtype=self._emb_dtype,
attr=mask_lm_out_bias_attr,
is_bias=True)
else:
fc_out = layers.fc(
input=mask_trans_feat,
size=self._voc_size,
param_attr=fluid.ParamAttr(
name="mask_lm_out_fc.w_0",
initializer=self._param_initializer),
bias_attr=mask_lm_out_bias_attr)
mlm_loss = layers.softmax_with_cross_entropy(
logits=fc_out, label=mask_label)
mean_mlm_loss = layers.mean(mlm_loss) * need_cal_loss
# extract the first token feature in each sentence
self.next_sent_feat = self.get_pooled_output()
next_sent_feat_filter = layers.gather(
input=self.next_sent_feat,
index=reorder_chose_idx)
reorder_fc_out = layers.fc(
input=next_sent_feat_filter,
size=33,
param_attr=fluid.ParamAttr(
name="multi_sent_sorted" + "_fc.w_0", initializer=self._param_initializer),
bias_attr="multi_sent_sorted" + "_fc.b_0")
reorder_loss, reorder_softmax = layers.softmax_with_cross_entropy(
logits=reorder_fc_out, label=reorder_labels, return_softmax=True)
reorder_acc = fluid.layers.accuracy(
input=reorder_softmax, label=reorder_labels)
mean_reorder_loss = fluid.layers.mean(reorder_loss) * reorder_need_cal_loss
total_loss = mean_mlm_loss + mean_reorder_loss
reorder_acc *= reorder_need_cal_loss
return total_loss, mean_mlm_loss, reorder_acc
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Transformer encoder."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from functools import partial
import paddle.fluid as fluid
import paddle.fluid.layers as layers
def _cache_mem(curr_out, prev_mem, mem_len=None, use_vars=False):
"""generate new memories for next step"""
if mem_len is None or mem_len == 0:
return None
else:
if prev_mem is None:
new_mem = curr_out[:, -mem_len:, :]
else:
new_mem = layers.concat([prev_mem, curr_out], 1)[:, -mem_len:, :]
new_mem.stop_gradient = True
if use_vars:
layers.assign(new_mem, prev_mem)
return new_mem
def multi_head_attention(queries,
keys,
values,
rel_pos,
rel_task,
memory,
attn_bias,
d_key,
d_value,
d_model,
n_head=1,
r_w_bias=None,
r_r_bias=None,
r_t_bias=None,
dropout_rate=0.,
cache=None,
param_initializer=None,
rel_pos_params_sharing=False,
name='multi_head_att'):
"""
Multi-Head Attention. Note that attn_bias is added to the logit before
computing softmax activiation to mask certain selected positions so that
they will not considered in attention weights.
"""
if memory is not None and len(memory.shape) > 1:
cat = fluid.layers.concat([memory, queries], 1)
else:
cat = queries
keys, values = cat, cat
if not (len(queries.shape) == len(keys.shape) == len(values.shape) \
== len(rel_pos.shape) == len(rel_task.shape)== 3):
raise ValueError(
"Inputs: quries, keys, values, rel_pos and rel_task should all be 3-D tensors.")
if rel_pos_params_sharing:
assert (r_w_bias and r_r_bias and r_t_bias) is not None, \
"the rel pos bias can not be None when sharing the relative position params"
else:
r_w_bias, r_r_bias, r_t_bias = \
list(map(lambda x: layers.create_parameter(
shape=[n_head * d_key],
dtype="float32",
name=name + "_" + x,
default_initializer=param_initializer),
["r_w_bias", "r_r_bias", "r_t_bias"]))
def __compute_qkv(queries, keys, values, rel_pos, rel_task, n_head, d_key, d_value):
"""
Add linear projection to queries, keys, values, postions and tasks.
"""
q = layers.fc(input=queries,
size=d_key * n_head,
num_flatten_dims=2,
param_attr=fluid.ParamAttr(
name=name + '_query_fc.w_0',
initializer=param_initializer),
bias_attr=name + '_query_fc.b_0')
k = layers.fc(input=keys,
size=d_key * n_head,
num_flatten_dims=2,
param_attr=fluid.ParamAttr(
name=name + '_key_fc.w_0',
initializer=param_initializer),
bias_attr=name + '_key_fc.b_0')
v = layers.fc(input=values,
size=d_value * n_head,
num_flatten_dims=2,
param_attr=fluid.ParamAttr(
name=name + '_value_fc.w_0',
initializer=param_initializer),
bias_attr=name + '_value_fc.b_0')
r = layers.fc(input=rel_pos,
size=d_key * n_head,
num_flatten_dims=2,
param_attr=fluid.ParamAttr(
name=name + '_pos_fc.w_0',
initializer=param_initializer),
bias_attr=name + '_pos_fc.b_0')
t = layers.fc(input=rel_task,
size=d_key * n_head,
num_flatten_dims=2,
param_attr=fluid.ParamAttr(
name=name + '_task_fc.w_0',
initializer=param_initializer),
bias_attr=name + '_task_fc.b_0')
return q, k, v, r, t
def __split_heads(x, n_head, add_bias=None):
"""
Reshape the last dimension of inpunt tensor x so that it becomes two
dimensions and then transpose. Specifically, input a tensor with shape
[bs, max_sequence_length, n_head * hidden_dim] then output a tensor
with shape [bs, n_head, max_sequence_length, hidden_dim].
"""
hidden_size = x.shape[-1]
# The value 0 in shape attr means copying the corresponding dimension
# size of the input as the output dimension size.
reshaped = layers.reshape(
x=x, shape=[0, 0, n_head, hidden_size // n_head], inplace=True)
if add_bias:
reshaped = reshaped + add_bias
# permuate the dimensions into:
# [batch_size, n_head, max_sequence_len, hidden_size_per_head]
return layers.transpose(x=reshaped, perm=[0, 2, 1, 3])
def __combine_heads(x):
"""
Transpose and then reshape the last two dimensions of inpunt tensor x
so that it becomes one dimension, which is reverse to __split_heads.
"""
if len(x.shape) == 3: return x
if len(x.shape) != 4:
raise ValueError("Input(x) should be a 4-D Tensor.")
trans_x = layers.transpose(x, perm=[0, 2, 1, 3])
# The value 0 in shape attr means copying the corresponding dimension
# size of the input as the output dimension size.
return layers.reshape(
x=trans_x,
shape=[0, 0, trans_x.shape[2] * trans_x.shape[3]],
inplace=True)
def __rel_shift(x, klen):
"""return relative shift"""
x_shape = x.shape
INT_MAX=10000000
x = layers.reshape(x, [x_shape[0], x_shape[1], x_shape[3], x_shape[2]])
x = layers.slice(x, [0, 1, 2, 3], [0, 0, 1, 0], [INT_MAX, INT_MAX, INT_MAX, INT_MAX])
x = layers.reshape(x, [x_shape[0], x_shape[1], x_shape[2], x_shape[3] - 1])
x = layers.slice(x, [0, 1, 2, 3], [0, 0, 0, 0], [INT_MAX, INT_MAX, INT_MAX, klen])
return x
def __scaled_dot_product_attention(q, k, v, r, t, attn_bias, d_key, dropout_rate):
"""
Scaled Dot-Product Attention
"""
q_w, q_r, q_t = list(map(lambda x: layers.scale(x=x, scale=d_key ** -0.5), q))
score_w = layers.matmul(x=q_w, y=k, transpose_y=True)
score_r = layers.matmul(x=q_r, y=r, transpose_y=True)
score_r = __rel_shift(score_r, k.shape[2])
score_t = layers.matmul(x=q_t, y=t, transpose_y=True)
score = score_w + score_r + score_t
if attn_bias is not None:
score += attn_bias
weights = layers.softmax(score, use_cudnn=True)
if dropout_rate:
weights = layers.dropout(
weights,
dropout_prob=dropout_rate,
dropout_implementation="upscale_in_train",
is_test=False)
out = layers.matmul(weights, v)
return out
q, k, v, r, t = __compute_qkv(queries, keys, values, rel_pos, rel_task, n_head, d_key, d_value)
if cache is not None: # use cache and concat time steps
# Since the inplace reshape in __split_heads changes the shape of k and
# v, which is the cache input for next time step, reshape the cache
# input from the previous time step first.
k = cache["k"] = layers.concat(
[layers.reshape(
cache["k"], shape=[0, 0, d_model]), k], axis=1)
v = cache["v"] = layers.concat(
[layers.reshape(
cache["v"], shape=[0, 0, d_model]), v], axis=1)
q_w, q_r, q_t = list(map(lambda x: layers.elementwise_add(q, x, 2), [r_w_bias, r_r_bias, r_t_bias]))
q_w, q_r, q_t = list(map(lambda x: __split_heads(x, n_head), [q_w, q_r, q_t]))
k, v, r, t = list(map(lambda x: __split_heads(x, n_head), [k, v, r, t]))
ctx_multiheads = __scaled_dot_product_attention([q_w, q_r, q_t], \
k, v, r, t, attn_bias, d_key, dropout_rate)
out = __combine_heads(ctx_multiheads)
# Project back to the model size.
proj_out = layers.fc(input=out,
size=d_model,
num_flatten_dims=2,
param_attr=fluid.ParamAttr(
name=name + '_output_fc.w_0',
initializer=param_initializer),
bias_attr=name + '_output_fc.b_0')
return proj_out
def positionwise_feed_forward(x,
d_inner_hid,
d_hid,
dropout_rate,
hidden_act,
param_initializer=None,
name='ffn'):
"""
Position-wise Feed-Forward Networks.
This module consists of two linear transformations with a ReLU activation
in between, which is applied to each position separately and identically.
"""
hidden = layers.fc(input=x,
size=d_inner_hid,
num_flatten_dims=2,
act=hidden_act,
param_attr=fluid.ParamAttr(
name=name + '_fc_0.w_0',
initializer=param_initializer),
bias_attr=name + '_fc_0.b_0')
if dropout_rate:
hidden = layers.dropout(
hidden,
dropout_prob=dropout_rate,
dropout_implementation="upscale_in_train",
is_test=False)
out = layers.fc(input=hidden,
size=d_hid,
num_flatten_dims=2,
param_attr=fluid.ParamAttr(
name=name + '_fc_1.w_0', initializer=param_initializer),
bias_attr=name + '_fc_1.b_0')
return out
def pre_post_process_layer(prev_out,
out,
process_cmd,
dropout_rate=0.,
epsilon=1e-5,
name=''):
"""
Add residual connection, layer normalization and droput to the out tensor
optionally according to the value of process_cmd.
This will be used before or after multi-head attention and position-wise
feed-forward networks.
"""
for cmd in process_cmd:
if cmd == "a": # add residual connection
out = out + prev_out if prev_out else out
elif cmd == "n": # add layer normalization
out_dtype = out.dtype
if out_dtype == fluid.core.VarDesc.VarType.FP16:
out = layers.cast(x=out, dtype="float32")
out = layers.layer_norm(
out,
begin_norm_axis=len(out.shape) - 1,
param_attr=fluid.ParamAttr(
name=name + '_layer_norm_scale',
initializer=fluid.initializer.Constant(1.)),
bias_attr=fluid.ParamAttr(
name=name + '_layer_norm_bias',
initializer=fluid.initializer.Constant(0.)),
epsilon=epsilon)
if out_dtype == fluid.core.VarDesc.VarType.FP16:
out = layers.cast(x=out, dtype="float16")
elif cmd == "d": # add dropout
if dropout_rate:
out = layers.dropout(
out,
dropout_prob=dropout_rate,
dropout_implementation="upscale_in_train",
is_test=False)
return out
pre_process_layer = partial(pre_post_process_layer, None)
post_process_layer = pre_post_process_layer
def encoder_layer(enc_input,
rel_pos,
rel_task,
memory,
attn_bias,
n_head,
d_key,
d_value,
d_model,
d_inner_hid,
r_w_bias,
r_r_bias,
r_t_bias,
prepostprocess_dropout,
attention_dropout,
relu_dropout,
hidden_act,
epsilon=1e-5,
preprocess_cmd="n",
postprocess_cmd="da",
param_initializer=None,
rel_pos_params_sharing=False,
name=''):
"""The encoder layers that can be stacked to form a deep encoder.
This module consits of a multi-head (self) attention followed by
position-wise feed-forward networks and both the two components companied
with the post_process_layer to add residual connection, layer normalization
and droput.
"""
attn_output = multi_head_attention(
pre_process_layer(
enc_input,
preprocess_cmd,
prepostprocess_dropout,
epsilon=epsilon,
name=name + '_pre_att'),
None,
None,
rel_pos,
rel_task,
memory,
attn_bias,
d_key,
d_value,
d_model,
n_head,
r_w_bias,
r_r_bias,
r_t_bias,
attention_dropout,
param_initializer=param_initializer,
rel_pos_params_sharing=rel_pos_params_sharing,
name=name + '_multi_head_att')
attn_output = post_process_layer(
enc_input,
attn_output,
postprocess_cmd,
prepostprocess_dropout,
epsilon=epsilon,
name=name + '_post_att')
ffd_output = positionwise_feed_forward(
pre_process_layer(
attn_output,
preprocess_cmd,
prepostprocess_dropout,
epsilon=epsilon,
name=name + '_pre_ffn'),
d_inner_hid,
d_model,
relu_dropout,
hidden_act,
param_initializer=param_initializer,
name=name + '_ffn')
return post_process_layer(
attn_output,
ffd_output,
postprocess_cmd,
prepostprocess_dropout,
epsilon=epsilon,
name=name + '_post_ffn'), ffd_output
def encoder(enc_input,
memories,
rel_pos,
rel_task,
attn_bias,
n_layer,
n_head,
d_key,
d_value,
d_model,
d_inner_hid,
memory_len,
prepostprocess_dropout,
attention_dropout,
relu_dropout,
hidden_act,
epsilon=1e-5,
preprocess_cmd="n",
postprocess_cmd="da",
param_initializer=None,
rel_pos_params_sharing=False,
name='',
use_vars=False):
"""
The encoder is composed of a stack of identical layers returned by calling
encoder_layer.
"""
r_w_bias, r_r_bias, r_t_bias = None, None, None
if rel_pos_params_sharing:
r_w_bias, r_r_bias, r_t_bias = \
list(map(lambda x: layers.create_parameter(
shape=[n_head * d_key],
dtype="float32",
name=name + "_" + x,
default_initializer=param_initializer),
["r_w_bias", "r_r_bias", "r_t_bias"]))
checkpoints = []
_new_mems = []
for i in range(n_layer):
enc_input, cp = encoder_layer(
enc_input,
rel_pos,
rel_task,
memories[i],
attn_bias,
n_head,
d_key,
d_value,
d_model,
d_inner_hid,
r_w_bias,
r_r_bias,
r_t_bias,
prepostprocess_dropout,
attention_dropout,
relu_dropout,
hidden_act,
epsilon,
preprocess_cmd,
postprocess_cmd,
param_initializer=param_initializer,
rel_pos_params_sharing=rel_pos_params_sharing,
name=name + '_layer_' + str(i))
checkpoints.append(cp.name)
new_mem = _cache_mem(enc_input, memories[i], memory_len, use_vars=use_vars)
if not use_vars:
_new_mems.append(new_mem)
enc_output = pre_process_layer(
enc_input,
preprocess_cmd,
prepostprocess_dropout,
epsilon,
name="post_encoder")
return enc_output, _new_mems, checkpoints[:-1]
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Mask, padding and batching."""
import paddle
import numpy as np
def get_related_pos(insts,
seq_len,
memory_len=128):
"""generate relative postion ids"""
beg = seq_len + seq_len + memory_len
r_position = [list(range(beg - 1, seq_len - 1, -1)) + \
list(range(0, seq_len)) for i in range(len(insts))]
return np.array(r_position).astype('int64').reshape([len(insts), beg, 1])
def pad_batch_data(insts,
insts_data_type="int64",
pad_idx=0,
final_cls=False,
pad_max_len=None,
return_pos=False,
return_input_mask=False,
return_max_len=False,
return_num_token=False,
return_seq_lens=False):
"""
Pad the instances to the max sequence length in batch, and generate the
corresponding position data and attention bias.
"""
return_list = []
if pad_max_len:
max_len = pad_max_len
else:
max_len = max(len(inst) for inst in insts)
# Any token included in dict can be used to pad, since the paddings' loss
# will be masked out by weights and make no effect on parameter gradients.
# id
if final_cls:
inst_data = np.array(
[inst[:-1] + list([pad_idx] * (max_len - len(inst))) + [inst[-1]] for inst in insts])
else:
inst_data = np.array(
[inst + list([pad_idx] * (max_len - len(inst))) for inst in insts])
return_list += [inst_data.astype(insts_data_type).reshape([-1, max_len, 1])]
# position data
if return_pos:
inst_pos = np.array([
list(range(0, len(inst))) + [pad_idx] * (max_len - len(inst))
for inst in insts
])
return_list += [inst_pos.astype("int64").reshape([-1, max_len, 1])]
if return_input_mask:
# This is used to avoid attention on paddings.
if final_cls:
input_mask_data = np.array([[1] * len(inst[:-1]) + [0] *
(max_len - len(inst)) + [1] for inst in insts])
else:
input_mask_data = np.array([[1] * len(inst) + [0] *
(max_len - len(inst)) for inst in insts])
input_mask_data = np.expand_dims(input_mask_data, axis=-1)
return_list += [input_mask_data.astype("float32")]
if return_max_len:
return_list += [max_len]
if return_num_token:
num_token = 0
for inst in insts:
num_token += len(inst)
return_list += [num_token]
if return_seq_lens:
if paddle.__version__[:3] <= '1.5':
seq_lens_type = [-1, 1]
else:
seq_lens_type = [-1]
seq_lens = np.array([len(inst) for inst in insts])
return_list += [seq_lens.astype("int64").reshape(seq_lens_type)]
return return_list if len(return_list) > 1 else return_list[0]
# -*- coding: utf-8 -*-
"""
Byte pair encoding utilities from GPT-2.
Original source: https://github.com/openai/gpt-2/blob/master/src/encoder.py
Original license: MIT
"""
#from functools import lru_cache
try:
from functools import lru_cache
except ImportError:
from backports.functools_lru_cache import lru_cache
import json
import six
import regex as re
@lru_cache()
def bytes_to_unicode():
"""
Returns list of utf-8 byte and a corresponding list of unicode strings.
The reversible bpe codes work on unicode strings.
This means you need a large # of unicode characters in your vocab if you want to avoid UNKs.
When you're at something like a 10B token dataset you end up needing around 5K for decent coverage.
This is a signficant percentage of your normal, say, 32K bpe vocab.
To avoid that, we want lookup tables between utf-8 bytes and unicode strings.
And avoids mapping to whitespace/control characters the bpe code barfs on.
"""
if six.PY2:
bs = list(range(ord("!".decode('utf8')), ord("~".decode('utf8'))+1))+list(range(ord("¡".decode('utf8')),ord("¬".decode('utf8'))+1))+list(range(ord("®".decode('utf8')), ord("ÿ".decode('utf8'))+1))
else:
bs = (
list(range(ord("!"), ord("~") + 1))
+ list(range(ord("¡"), ord("¬") + 1))
+ list(range(ord("®"), ord("ÿ") + 1))
)
cs = bs[:]
n = 0
for b in range(2**8):
if b not in bs:
bs.append(b)
cs.append(2**8+n)
n += 1
if six.PY2:
cs = [unichr(n) for n in cs]
else:
cs = [chr(n) for n in cs]
ddict = dict(zip(bs, cs))
return dict(zip(bs, cs))
def get_pairs(word):
"""Return set of symbol pairs in a word.
Word is represented as tuple of symbols (symbols being variable-length strings).
"""
pairs = set()
prev_char = word[0]
for char in word[1:]:
pairs.add((prev_char, char))
prev_char = char
return pairs
class Encoder(object):
def __init__(self, encoder, bpe_merges, errors='replace', special_tokens=["[SEP]", "[p]", "[q]", "[/q]"]):
self.encoder = encoder
self.decoder = {v:k for k,v in self.encoder.items()}
self.errors = errors # how to handle errors in decoding
self.byte_encoder = bytes_to_unicode()
# print('111',self.byte_encoder)
self.byte_decoder = {v:k for k, v in self.byte_encoder.items()}
self.bpe_ranks = dict(zip(bpe_merges, range(len(bpe_merges))))
self.cache = {}
self.re = re
self.special_tokens = special_tokens
# Should haved added re.IGNORECASE so BPE merges can happen for capitalized versions of contractions
self.pat = self.re.compile(r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""")
def bpe(self, token):
if token in self.cache:
return self.cache[token]
word = tuple(token)
pairs = get_pairs(word)
if not pairs:
return token
while True:
bigram = min(pairs, key = lambda pair: self.bpe_ranks.get(pair, float('inf')))
if bigram not in self.bpe_ranks:
break
first, second = bigram
new_word = []
i = 0
while i < len(word):
try:
j = word.index(first, i)
new_word.extend(word[i:j])
i = j
except:
new_word.extend(word[i:])
break
if word[i] == first and i < len(word)-1 and word[i+1] == second:
new_word.append(first+second)
i += 2
else:
new_word.append(word[i])
i += 1
new_word = tuple(new_word)
word = new_word
if len(word) == 1:
break
else:
pairs = get_pairs(word)
word = ' '.join(word)
self.cache[token] = word
return word
# def tokenize(self, text):
# tokens = []
# return self.re.findall(self.pat, text)
# def tokenize_bpe(self, token):
# token = ''.join(self.byte_encoder[ord(b)] for b in token.encode('utf-8'))
# return [self.encoder[bpe_token] for bpe_token in self.bpe(token).split(' ')]
# def encode(self, text):
# bpe_tokens = []
# for token in self.re.findall(self.pat, text):
# #print(token)
# #print(self.byte_encoder)
# token = ''.join(self.byte_encoder[ord(b)] for b in token.encode('utf-8'))
# bpe_tokens.extend(self.encoder[bpe_token] for bpe_token in self.bpe(token).split(' '))
# return bpe_tokens
# def decode(self, tokens):
# text = ''.join([self.decoder[token] for token in tokens])
# text = bytearray([self.byte_decoder[c] for c in text]).decode('utf-8', errors=self.errors)
# return text
def tokenize(self, text):
tokens = text.split(' ')
sub_tokens = []
for token_i, token in enumerate(tokens):
if self.is_special_token(token):
if token_i == 0:
sub_tokens.extend([token])
else:
sub_tokens.extend([" " + token])
else:
if token_i == 0:
sub_tokens.extend(self.re.findall(self.pat, token))
else:
sub_tokens.extend(self.re.findall(self.pat, " " + token))
return sub_tokens
def tokenize_old(self, text):
return self.re.findall(self.pat, text)
def is_special_token(self, tok):
if isinstance(tok, int):
return False
res = False
for t in self.special_tokens:
# if tok.find(t) != -1:
if tok.strip() == t:
res= True
break
return res
def tokenize_bpe(self, token):
if self.is_special_token(token):
return [token.strip()] # remove space for convert_to_ids
else:
if six.PY2:
token = ''.join(self.byte_encoder[ord(b)] for b in token.encode('utf-8'))
else:
token = ''.join(self.byte_encoder[b] for b in token.encode('utf-8'))
return [self.encoder[bpe_token] for bpe_token in self.bpe(token).split(' ')]
def encode(self, text):
# bpe_tokens = []
# for token in self.re.findall(self.pat, text):
# #print(token)
# #print(self.byte_encoder)
# token = ''.join(self.byte_encoder[ord(b)] for b in token.encode('utf-8'))
# bpe_tokens.extend(self.encoder[bpe_token] for bpe_token in self.bpe(token).split(' '))
# return bpe_tokens
bpe_tokens = []
for token in self.tokenize(text):
bpe_tokens.extend(self.tokenize_bpe(token))
return bpe_tokens
def decode(self, tokens):
pre_token_i = 0
texts = []
for token_i, token in enumerate(tokens):
if self.is_special_token(token):
# proprecess tokens before token_i
if token_i - pre_token_i > 0:
text = ''.join([self.decoder[int(tok)] for tok in tokens[pre_token_i:token_i]])
text = bytearray([self.byte_decoder[c] for c in text]).decode('utf-8', errors=self.errors)
texts.append(text)
# texts.append(token)
if token_i == 0:
texts.append(token) # in the beginning, there is no space before special tokens
else:
texts.extend([" ", token]) # in middle sentence, there must be a space before special tokens
pre_token_i = token_i + 1
if pre_token_i < len(tokens):
text = ''.join([self.decoder[int(tok)] for tok in tokens[pre_token_i:]])
text = bytearray([self.byte_decoder[c] for c in text]).decode('utf-8', errors=self.errors)
texts.append(text)
return ''.join(texts)
def get_encoder(encoder_json_path, vocab_bpe_path):
with open(encoder_json_path, 'r') as f:
encoder = json.load(f)
with open(vocab_bpe_path, 'r') as f:
bpe_data = f.read()
if six.PY2:
bpe_data = bpe_data.decode('utf8')
bpe_merges = [tuple(merge_str.split()) for merge_str in bpe_data.split('\n')[1:-1]]
return Encoder(
encoder=encoder,
bpe_merges=bpe_merges,
)
def convert_to_unicode(text):
"""Converts `text` to Unicode (if it's not already), assuming utf-8 input."""
if six.PY3:
if isinstance(text, str):
return text
elif isinstance(text, bytes):
return text.decode("utf-8", "ignore")
else:
raise ValueError("Unsupported string type: %s" % (type(text)))
elif six.PY2:
if isinstance(text, str):
return text.decode("utf-8", "ignore")
elif isinstance(text, unicode):
return text
else:
raise ValueError("Unsupported string type: %s" % (type(text)))
else:
raise ValueError("Not running on Python2 or Python 3?")
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
import csv
import json
import random
import numpy as np
from collections import namedtuple
from reader import tokenization
from reader.batching import pad_batch_data, get_related_pos
class BaseReader(object):
def __init__(self,
trainer_id,
trainer_num,
vocab_path,
memory_len=128,
repeat_input=False,
train_all=False,
eval_all=False,
label_map_config=None,
max_seq_len=512,
do_lower_case=True,
in_tokens=False,
random_seed=None,
tokenizer="FullTokenizer",
is_zh=True):
self.is_zh = is_zh
self.trainer_id = trainer_id
self.trainer_num = trainer_num
self.max_seq_len = max_seq_len
self.memory_len = memory_len
if tokenizer == 'BPETokenizer':
self.tokenizer = getattr(tokenization, tokenizer)(vocab_file=vocab_path)
self.vocab = self.tokenizer.vocabulary.vocab_dict
else:
self.tokenizer = getattr(tokenization, tokenizer)(
vocab_file=vocab_path, do_lower_case=do_lower_case)
self.vocab = self.tokenizer.vocab
self.pad_id = self.vocab["[PAD]"]
self.cls_id = self.vocab["[CLS]"]
self.sep_id = self.vocab["[SEP]"]
self.in_tokens = in_tokens
self.repeat_input = repeat_input
self.train_all = train_all
self.eval_all = eval_all
if random_seed is None:
random_seed = 12345
self.rng = random.Random(random_seed)
self.current_example = 0
self.current_epoch = 0
self.num_examples = 0
if label_map_config:
with open(label_map_config) as f:
self.label_map = json.load(f)
else:
self.label_map = None
def cnt_list(self, inp):
"""cnt_list"""
cnt = 0
for lit in inp:
if lit:
cnt += 1
return cnt
def get_train_progress(self):
"""Gets progress for training phase."""
return self.current_example, self.current_epoch
def get_num_examples(self, input_file):
"""get number of example"""
# return 30000
examples = self._read_tsv(os.path.join(input_file))
for example in examples:
self.num_examples += len(self._convert_to_instance(example, "train"))
return self.num_examples
def _read_tsv(self, input_file, quotechar=None):
"""Reads a tab separated value file."""
with open(input_file, "r") as f:
reader = csv.reader(f, delimiter="\t", quotechar=quotechar)
headers = next(reader)
Example = namedtuple('Example', headers)
examples = []
for line in reader:
example = Example(*line)
examples.append(example)
return examples
def _prepare_batch_data(self, examples, batch_size, phase=None):
"""generate batch records"""
batch_records, max_len, gather_idx = [], 0, []
for index, example in enumerate(examples):
if phase == "train":
self.current_example += 1
if example.cal_loss == 1:
gather_idx.append(index % batch_size)
max_len = max(max_len, len(example.src_ids))
if self.in_tokens:
to_append = (len(batch_records) + 1) * max_len <= batch_size
else:
to_append = len(batch_records) < batch_size
if to_append:
batch_records.append(example)
else:
yield self._pad_batch_records(batch_records, gather_idx, phase)
batch_records, max_len = [example], len(example.src_ids)
gather_idx = [index % batch_size] if example.cal_loss == 1 else []
yield self._pad_batch_records(batch_records, gather_idx, phase)
# if self.trainer_num == 1 or (phase != "train" and batch_records):
# while len(batch_records) < batch_size:
# batch_records.append(batch_records[-1]._replace(cal_loss=0))
# yield self._pad_batch_records(batch_records, gather_idx, phase)
def _get_samples(self, pre_list, batch_size, is_last=False):
"""get samples"""
if is_last:
len_doc = [len(doc) for doc in pre_list]
max_len_idx = len_doc.index(max(len_doc))
dirty_sample = pre_list[max_len_idx][-1]._replace(cal_loss=0)
for sample_list in pre_list:
sample_list.extend([dirty_sample] * (max(len_doc) - len(sample_list)))
samples = []
min_len = min([len(doc) for doc in pre_list])
for cnt in range(min_len):
for idx in range(batch_size * self.trainer_num):
sample = pre_list[idx][cnt]
samples.append(sample)
for idx in range(len(pre_list)):
pre_list[idx] = pre_list[idx][min_len:]
return samples
def _convert_to_instance(self, example, phase, qid=None):
"convert example to instance"
doc_spans = []
_DocSpan = namedtuple("DocSpan", ["start", "length"])
start_offset = 0
max_tokens_for_doc = self.max_seq_len - 2
tokens_a = self.tokenizer.tokenize(example.text_a)
while start_offset < len(tokens_a):
length = len(tokens_a) - start_offset
if length > max_tokens_for_doc:
length = max_tokens_for_doc
doc_spans.append(_DocSpan(start=start_offset, length=length))
if start_offset + length == len(tokens_a):
break
start_offset += min(length, self.memory_len)
features = []
Feature = namedtuple("Feature", ["src_ids", "label_id", "qid", "cal_loss"])
for (doc_span_index, doc_span) in enumerate(doc_spans):
tokens = tokens_a[doc_span.start: doc_span.start + doc_span.length] + ["[SEP]"] + ["[CLS]"]
token_ids = self.tokenizer.convert_tokens_to_ids(tokens)
if "qid" in example._fields:
qid = example.qid
if phase != "infer":
if self.label_map:
label_id = self.label_map[example.label]
else:
label_id = example.label
else:
label_id = None
features.append(Feature(
src_ids=token_ids,
label_id=label_id,
qid=qid,
cal_loss=1))
# repeat
if self.repeat_input:
features_repeat = features
if not self.train_all:
if phase == "train":
features = list(map(lambda x: x._replace(cal_loss=0), features))
if not self.eval_all:
if phase == "eval" or phase == "test":
features = list(map(lambda x: x._replace(cal_loss=0), features))
features = features + features_repeat
return features
def _create_instances(self, examples, batch_size, phase):
"""generate batch records"""
pre_batch_list = []
insert_idx = []
for index, example in enumerate(examples):
features = self._convert_to_instance(example, phase, index)
if self.cnt_list(pre_batch_list) < batch_size * self.trainer_num:
if insert_idx:
pre_batch_list[insert_idx[0]] = features
insert_idx.pop(0)
else:
pre_batch_list.append(features)
if self.cnt_list(pre_batch_list) == batch_size * self.trainer_num:
assert self.cnt_list(pre_batch_list) == len(pre_batch_list), "the two value must be equal"
assert not insert_idx, "the insert_idx must be null"
sample_batch = self._get_samples(pre_batch_list, batch_size)
for idx, lit in enumerate(pre_batch_list):
if not lit:
insert_idx.append(idx)
for batch_records in self._prepare_batch_data(sample_batch, batch_size, phase):
yield batch_records
if phase != "train":
if self.cnt_list(pre_batch_list):
pre_batch_list += [[] for _ in range(batch_size * self.trainer_num - self.cnt_list(pre_batch_list))]
sample_batch = self._get_samples(pre_batch_list, batch_size, is_last=True)
for batch_records in self._prepare_batch_data(sample_batch, batch_size, phase):
yield batch_records
def data_generator(self,
input_file,
batch_size,
epoch,
shuffle=True,
phase=None):
"""date generator"""
assert phase in ["train", "eval", "test", "infer"], "phase should be one of the four choices"
examples = self._read_tsv(input_file)
def wrapper():
all_dev_batches = []
for epoch_index in range(epoch):
if phase == "train":
self.current_example = 0
self.current_epoch = epoch_index
self.random_seed = epoch_index
if shuffle:
self.global_rng = np.random.RandomState(self.random_seed)
self.global_rng.shuffle(examples)
for batch_data in self._create_instances(
examples, batch_size, phase=phase):
if len(all_dev_batches) < self.trainer_num:
all_dev_batches.append(batch_data)
if len(all_dev_batches) == self.trainer_num:
yield all_dev_batches[self.trainer_id]
all_dev_batches = []
if phase != "train":
#while len(all_dev_batches) < self.trainer_num:
# all_dev_batches.append(all_dev_batches[-1])
if self.trainer_id < len(all_dev_batches):
yield all_dev_batches[self.trainer_id]
return wrapper
class ClassifyReader(BaseReader):
"""Reader for classifier task"""
def _read_tsv(self, input_file, quotechar=None):
"""Reads a tab separated value file."""
with open(input_file, "r") as f:
reader = csv.reader(f, delimiter="\t", quotechar=quotechar)
headers = next(reader)
text_indices = [
index for index, h in enumerate(headers) if h != "label"
]
Example = namedtuple('Example', headers)
examples = []
for line in reader:
if len(line) > len(headers):
print('[warning]: text_a may contains tab which will be used as split_char')
if self.is_zh:
for index, text in enumerate(line):
if index in text_indices:
line[index] = text.replace(' ', '')
example = Example(*line)
examples.append(example)
return examples
def _pad_batch_records(self, batch_records, gather_idx=None, phase=None):
"""padding batch records"""
batch_token_ids = [record.src_ids for record in batch_records]
if batch_records[0].label_id:
batch_labels = [record.label_id for record in batch_records]
batch_labels = np.array(batch_labels).astype("int64").reshape([-1, 1])
else:
batch_labels = np.array([]).astype("int64").reshape([-1, 1])
if batch_records[-1].qid:
batch_qids = [record.qid for record in batch_records]
batch_qids = np.array(batch_qids).astype("int64").reshape([-1, 1])
else:
batch_qids = np.array([]).astype("int64").reshape([-1, 1])
if gather_idx:
batch_gather_idx = np.array(gather_idx).astype("int64").reshape([-1, 1])
need_cal_loss = np.array([1]).astype("int64")
else:
batch_gather_idx = np.array(list(range(len(batch_records)))).astype("int64").reshape([-1, 1])
need_cal_loss = np.array([0]).astype("int64")
# padding
padded_token_ids, input_mask = pad_batch_data(
batch_token_ids, pad_idx=self.pad_id, pad_max_len=self.max_seq_len, \
final_cls=True, return_input_mask=True)
padded_task_ids = np.zeros_like(padded_token_ids, dtype="int64")
padded_position_ids = get_related_pos(padded_token_ids, \
self.max_seq_len, self.memory_len)
return_list = [
padded_token_ids, padded_position_ids, padded_task_ids,
input_mask, batch_labels, batch_qids, batch_gather_idx, need_cal_loss
]
return return_list
class MRCReader(BaseReader):
"""Reader for MRC tasks"""
def __init__(self,
trainer_id,
trainer_num,
vocab_path,
memory_len=128,
repeat_input=False,
train_all=False,
eval_all=False,
label_map_config=None,
max_seq_len=512,
do_lower_case=True,
in_tokens=False,
random_seed=None,
tokenizer="FullTokenizer",
is_zh=True,
for_cn=True,
doc_stride=128,
max_query_length=64):
super(MRCReader, self).__init__(trainer_id,
trainer_num,
vocab_path,
memory_len=128,
repeat_input=False,
train_all=False,
eval_all=False,
label_map_config=None,
max_seq_len=512,
do_lower_case=True,
in_tokens=False,
random_seed=None,
tokenizer="FullTokenizer")
self.for_cn = for_cn
self.doc_stride = doc_stride
self.max_query_length = max_query_length
self.examples = {}
self.features = {}
def get_num_examples(self, phase, data_path):
"""get number of example"""
examples, features = self._pre_process_data(phase, data_path)
return len(sum(self.features_all, []))
def get_features(self, phase):
"""get features"""
return self.features[phase]
def get_examples(self, phase):
"""get examples"""
return self.examples[phase]
def _read_tsv(self, input_file, is_training):
"""read file"""
examples = []
with open(input_file, "r") as f:
input_data = json.load(f)["data"]
for entry in input_data:
for paragraph in entry["paragraphs"]:
paragraph_text = paragraph["context"]
for qa in paragraph["qas"]:
qas_id = qa["id"]
question_text = qa["question"]
start_pos = None
end_pos = None
orig_answer_text = None
if is_training:
if len(qa["answers"]) != 1:
raise ValueError(
"For training, each question should have exactly 1 answer."
)
answer = qa["answers"][0]
orig_answer_text = answer["text"]
answer_offset = answer["answer_start"]
answer_length = len(orig_answer_text)
doc_tokens = [paragraph_text[:answer_offset],
paragraph_text[answer_offset: answer_offset + answer_length],
paragraph_text[answer_offset + answer_length:]]
start_pos = 1
end_pos = 1
actual_text = " ".join(doc_tokens[start_pos:(end_pos + 1)])
if actual_text.find(orig_answer_text) == -1:
logging.info("Could not find answer: '%s' vs. '%s'",
actual_text, orig_answer_text)
continue
else:
doc_tokens = tokenization.tokenize_chinese_chars(paragraph_text)
Example = namedtuple('Example',
['qas_id', 'question_text', 'doc_tokens', 'orig_answer_text',
'start_position', 'end_position'])
example = Example(
qas_id=qas_id,
question_text=question_text,
doc_tokens=doc_tokens,
orig_answer_text=orig_answer_text,
start_position=start_pos,
end_position=end_pos)
examples.append(example)
return examples
def _improve_answer_span(self, doc_tokens, input_start, input_end, tokenizer, orig_answer_text):
"""improve answer span"""
tok_answer_text = " ".join(self.tokenizer.tokenize(orig_answer_text))
for new_start in range(input_start, input_end + 1):
for new_end in range(input_end, new_start - 1, -1):
text_span = " ".join(doc_tokens[new_start:(new_end + 1)])
if text_span == tok_answer_text:
return (new_start, new_end)
return (input_start, input_end)
def _check_is_max_context(self, doc_spans, cur_span_index, position):
"""chech is max context"""
best_score = None
best_span_index = None
for (span_index, doc_span) in enumerate(doc_spans):
end = doc_span.start + doc_span.length - 1
if position < doc_span.start:
continue
if position > end:
continue
num_left_context = position - doc_span.start
num_right_context = end - position
score = min(num_left_context,
num_right_context) + 0.01 * doc_span.length
if best_score is None or score > best_score:
best_score = score
best_span_index = span_index
return cur_span_index == best_span_index
def _convert_example_to_feature(self, examples, max_seq_length, tokenizer, phase):
"""convert example to feature"""
Feature = namedtuple("Feature", ["qid", "example_index", "doc_span_index",
"tokens", "token_to_orig_map", "token_is_max_context",
"src_ids", "start_position", "end_position", "cal_loss"])
features = []
self.features_all = []
unique_id = 1000
is_training = phase == "train"
for (example_index, example) in enumerate(examples):
query_tokens = self.tokenizer.tokenize(example.question_text)
if len(query_tokens) > self.max_query_length:
query_tokens = query_tokens[0:self.max_query_length]
tok_to_orig_index = []
orig_to_tok_index = []
all_doc_tokens = []
for (i, token) in enumerate(example.doc_tokens):
orig_to_tok_index.append(len(all_doc_tokens))
sub_tokens = self.tokenizer.tokenize(token)
for sub_token in sub_tokens:
tok_to_orig_index.append(i)
all_doc_tokens.append(sub_token)
tok_start_position = None
tok_end_position = None
if is_training:
tok_start_position = orig_to_tok_index[example.start_position]
if example.end_position < len(example.doc_tokens) - 1:
tok_end_position = orig_to_tok_index[example.end_position + 1] - 1
else:
tok_end_position = len(all_doc_tokens) - 1
(tok_start_position, tok_end_position) = self._improve_answer_span(
all_doc_tokens, tok_start_position, tok_end_position,
tokenizer, example.orig_answer_text)
max_tokens_for_doc = max_seq_length - len(query_tokens) - 3
_DocSpan = namedtuple("DocSpan", ["start", "length"])
doc_spans = []
start_offset = 0
while start_offset < len(all_doc_tokens):
length = len(all_doc_tokens) - start_offset
if length > max_tokens_for_doc:
length = max_tokens_for_doc
doc_spans.append(_DocSpan(start=start_offset, length=length))
if start_offset + length == len(all_doc_tokens):
break
start_offset += min(length, self.doc_stride)
features_each = []
for (doc_span_index, doc_span) in enumerate(doc_spans):
tokens = []
token_to_orig_map = {}
token_is_max_context = {}
tokens.append("[CLS]")
for i in range(doc_span.length):
split_token_index = doc_span.start + i
token_to_orig_map[len(tokens)] = tok_to_orig_index[
split_token_index]
is_max_context = self._check_is_max_context(
doc_spans, doc_span_index, split_token_index)
token_is_max_context[len(tokens)] = is_max_context
tokens.append(all_doc_tokens[split_token_index])
tokens.append("[SEP]")
for token in query_tokens:
tokens.append(token)
tokens.append("[SEP]")
token_ids = self.tokenizer.convert_tokens_to_ids(tokens)
start_position = None
end_position = None
if is_training:
doc_start = doc_span.start
doc_end = doc_span.start + doc_span.length - 1
out_of_span = False
if not (tok_start_position >= doc_start and
tok_end_position <= doc_end):
out_of_span = True
if out_of_span:
start_position = 0
end_position = 0
else:
doc_offset = 1 #len(query_tokens) + 2
start_position = tok_start_position - doc_start + doc_offset
end_position = tok_end_position - doc_start + doc_offset
feature = Feature(
qid=unique_id,
example_index=example_index,
doc_span_index=doc_span_index,
tokens=tokens,
token_to_orig_map=token_to_orig_map,
token_is_max_context=token_is_max_context,
src_ids=token_ids,
start_position=start_position,
end_position=end_position,
cal_loss=1)
features.append(feature)
features_each.append(feature)
unique_id += 1
#repeat
if self.repeat_input:
features_each_repeat = features_each
if not self.train_all:
if phase == "train":
features_each = list(map(lambda x: x._replace(cla_loss=0), features_each))
if not self.eval_all:
if phase == "eval" or phase == "test":
features_each = list(map(lambda x: x._replace(cla_loss=0), features_each))
features_each += features_each_repeat
self.features_all.append(features_each)
return features
def _create_instances(self, records, batch_size, phase=None):
"""generate batch records"""
pre_batch_list = []
insert_idx = []
for idx, record in enumerate(records):
if self.cnt_list(pre_batch_list) < batch_size * self.trainer_num:
if insert_idx:
pre_batch_list[insert_idx[0]] = record
insert_idx.pop(0)
else:
pre_batch_list.append(record)
if self.cnt_list(pre_batch_list) == batch_size * self.trainer_num:
assert self.cnt_list(pre_batch_list) == len(pre_batch_list), "the two value must be equal"
assert not insert_idx, "the insert_idx must be null"
samples_batch = self._get_samples(pre_batch_list, batch_size)
for idx, lit in enumerate(pre_batch_list):
if not lit:
insert_idx.append(idx)
for batch_records in self._prepare_batch_data(samples_batch, batch_size, phase):
yield batch_records
if phase != "train":
if self.cnt_list(pre_batch_list):
pre_batch_list += [[] for _ in range(batch_size * self.trainer_num - self.cnt_list(pre_batch_list))]
samples_batch = self._get_samples(pre_batch_list, batch_size, is_last=True)
for batch_records in self._prepare_batch_data(samples_batch, batch_size, phase):
yield batch_records
def _pad_batch_records(self, batch_records, gather_idx=None, phase=None):
"""pad batch data"""
batch_token_ids = [record.src_ids for record in batch_records]
if phase == "train":
batch_start_position = [record.start_position for record in batch_records]
batch_end_position = [record.end_position for record in batch_records]
batch_start_position = np.array(batch_start_position).astype("int64").reshape([-1, 1])
batch_end_position = np.array(batch_end_position).astype("int64").reshape([-1, 1])
else:
batch_size = len(batch_token_ids)
batch_start_position = np.zeros(shape=[batch_size, 1], dtype="int64")
batch_end_position = np.zeros(shape=[batch_size, 1], dtype="int64")
batch_qids = [record.qid for record in batch_records]
batch_qids = np.array(batch_qids).astype("int64").reshape([-1, 1])
if gather_idx:
batch_gather_idx = np.array(gather_idx).astype("int64").reshape([-1, 1])
need_cal_loss = np.array([1]).astype("int64")
else:
batch_gather_idx = np.array(list(range(len(batch_records)))).astype("int64").reshape([-1, 1])
need_cal_loss = np.array([0]).astype("int64")
# padding
padded_token_ids, input_mask = pad_batch_data(
batch_token_ids, pad_idx=self.pad_id, pad_max_len=self.max_seq_len, return_input_mask=True)
padded_task_ids = np.zeros_like(padded_token_ids, dtype="int64")
padded_position_ids = get_related_pos(padded_task_ids, self.max_seq_len, self.memory_len)
return_list = [
padded_token_ids, padded_position_ids, padded_task_ids, input_mask,
batch_start_position, batch_end_position, batch_qids, batch_gather_idx, need_cal_loss
]
return return_list
def _pre_process_data(self, phase, data_path):
"""preprocess data"""
assert os.path.exists(data_path), "%s is not exist !" % self.config.data_path
examples = self._read_tsv(data_path, phase == "train")
features = self._convert_example_to_feature(examples, self.max_seq_len,
self.tokenizer, phase)
self.examples[phase] = examples
self.features[phase] = features
return examples, features
def data_generator(self,
input_file,
batch_size,
epoch,
shuffle=True,
phase=None):
"""data generate"""
assert phase in ["train", "eval", "test", "infer"], "phase should be one of the four choices"
examples = self.examples.get(phase, None)
features = self.features.get(phase, None)
if not examples:
examples, features = self._pre_process_data(phase, input_file)
def wrapper():
"""wrapper"""
all_dev_batches = []
for epoch_index in range(epoch):
if phase == "train":
self.current_example = 0
self.current_epoch = epoch_index
self.random_seed = epoch_index
if shuffle:
self.global_rng = np.random.RandomState(self.random_seed)
self.global_rng.shuffle(self.features_all)
for batch_data in self._create_instances(
self.features_all, batch_size, phase=phase):
if len(all_dev_batches) < self.trainer_num:
all_dev_batches.append(batch_data)
if len(all_dev_batches) == self.trainer_num:
yield all_dev_batches[self.trainer_id]
all_dev_batches = []
if phase != "train":
if self.trainer_id < len(all_dev_batches):
yield all_dev_batches[self.trainer_id]
return wrapper
if __name__ == '__main__':
pass
# Copyright 2021 The Google AI Language Team Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Tokenization classes."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import collections
import unicodedata
import six
import sentencepiece as sp
from reader import gpt2_bpe_utils
from reader.vocabulary import Vocabulary
from nltk import tokenize
def convert_to_unicode(text):
"""Converts `text` to Unicode (if it's not already), assuming utf-8 input."""
if six.PY3:
if isinstance(text, str):
return text
elif isinstance(text, bytes):
return text.decode("utf-8", "ignore")
else:
raise ValueError("Unsupported string type: %s" % (type(text)))
elif six.PY2:
if isinstance(text, str):
return text.decode("utf-8", "ignore")
elif isinstance(text, unicode):
return text
else:
raise ValueError("Unsupported string type: %s" % (type(text)))
else:
raise ValueError("Not running on Python2 or Python 3?")
def load_vocab(vocab_file):
"""Loads a vocabulary file into a dictionary."""
vocab = collections.OrderedDict()
fin = open(vocab_file)
for num, line in enumerate(fin):
items = convert_to_unicode(line.strip()).split("\t")
if len(items) > 2:
break
token = items[0]
index = items[1] if len(items) == 2 else num
token = token.strip()
vocab[token] = int(index)
return vocab
def convert_by_vocab(vocab, items):
"""Converts a sequence of [tokens|ids] using the vocab."""
output = []
for item in items:
output.append(vocab[item])
return output
def convert_tokens_to_ids_include_unk(vocab, tokens, unk_token="[UNK]"):
output = []
for token in tokens:
if token in vocab:
output.append(vocab[token])
else:
output.append(vocab[unk_token])
return output
def convert_tokens_to_ids(vocab, tokens):
return convert_by_vocab(vocab, tokens)
def convert_ids_to_tokens(inv_vocab, ids):
return convert_by_vocab(inv_vocab, ids)
def whitespace_tokenize(text):
"""Runs basic whitespace cleaning and splitting on a peice of text."""
text = text.strip()
if not text:
return []
tokens = text.split()
return tokens
class FullTokenizer(object):
"""Runs end-to-end tokenziation."""
def __init__(self, vocab_file, do_lower_case=True):
self.vocab = load_vocab(vocab_file)
self.inv_vocab = {v: k for k, v in self.vocab.items()}
self.basic_tokenizer = BasicTokenizer(do_lower_case=do_lower_case)
self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab)
def tokenize(self, text):
split_tokens = []
for token in self.basic_tokenizer.tokenize(text):
for sub_token in self.wordpiece_tokenizer.tokenize(token):
split_tokens.append(sub_token)
return split_tokens
def convert_tokens_to_ids(self, tokens):
return convert_by_vocab(self.vocab, tokens)
def convert_ids_to_tokens(self, ids):
return convert_by_vocab(self.inv_vocab, ids)
class CharTokenizer(object):
"""Runs end-to-end tokenziation."""
def __init__(self, vocab_file, do_lower_case=True):
self.vocab = load_vocab(vocab_file)
self.inv_vocab = {v: k for k, v in self.vocab.items()}
self.tokenizer = WordpieceTokenizer(vocab=self.vocab)
def tokenize(self, text):
split_tokens = []
for token in text.lower().split(" "):
for sub_token in self.tokenizer.tokenize(token):
split_tokens.append(sub_token)
return split_tokens
def convert_tokens_to_ids(self, tokens):
return convert_by_vocab(self.vocab, tokens)
def convert_ids_to_tokens(self, ids):
return convert_by_vocab(self.inv_vocab, ids)
class BasicTokenizer(object):
"""Runs basic tokenization (punctuation splitting, lower casing, etc.)."""
def __init__(self, do_lower_case=True):
"""Constructs a BasicTokenizer.
Args:
do_lower_case: Whether to lower case the input.
"""
self.do_lower_case = do_lower_case
def tokenize(self, text):
"""Tokenizes a piece of text."""
text = convert_to_unicode(text)
text = self._clean_text(text)
# This was added on November 1st, 2018 for the multilingual and Chinese
# models. This is also applied to the English models now, but it doesn't
# matter since the English models were not trained on any Chinese data
# and generally don't have any Chinese data in them (there are Chinese
# characters in the vocabulary because Wikipedia does have some Chinese
# words in the English Wikipedia.).
text = self._tokenize_chinese_chars(text)
orig_tokens = whitespace_tokenize(text)
split_tokens = []
for token in orig_tokens:
if self.do_lower_case:
token = token.lower()
token = self._run_strip_accents(token)
split_tokens.extend(self._run_split_on_punc(token))
output_tokens = whitespace_tokenize(" ".join(split_tokens))
return output_tokens
def _run_strip_accents(self, text):
"""Strips accents from a piece of text."""
text = unicodedata.normalize("NFD", text)
output = []
for char in text:
cat = unicodedata.category(char)
if cat == "Mn":
continue
output.append(char)
return "".join(output)
def _run_split_on_punc(self, text):
"""Splits punctuation on a piece of text."""
chars = list(text)
i = 0
start_new_word = True
output = []
while i < len(chars):
char = chars[i]
if _is_punctuation(char):
output.append([char])
start_new_word = True
else:
if start_new_word:
output.append([])
start_new_word = False
output[-1].append(char)
i += 1
return ["".join(x) for x in output]
def _tokenize_chinese_chars(self, text):
"""Adds whitespace around any CJK character."""
output = []
for char in text:
cp = ord(char)
if self._is_chinese_char(cp):
output.append(" ")
output.append(char)
output.append(" ")
else:
output.append(char)
return "".join(output)
def _is_chinese_char(self, cp):
"""Checks whether CP is the codepoint of a CJK character."""
# This defines a "chinese character" as anything in the CJK Unicode block:
# https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_(Unicode_block)
#
# Note that the CJK Unicode block is NOT all Japanese and Korean characters,
# despite its name. The modern Korean Hangul alphabet is a different block,
# as is Japanese Hiragana and Katakana. Those alphabets are used to write
# space-separated words, so they are not treated specially and handled
# like the all of the other languages.
if ((cp >= 0x4E00 and cp <= 0x9FFF) or #
(cp >= 0x3400 and cp <= 0x4DBF) or #
(cp >= 0x20000 and cp <= 0x2A6DF) or #
(cp >= 0x2A700 and cp <= 0x2B73F) or #
(cp >= 0x2B740 and cp <= 0x2B81F) or #
(cp >= 0x2B820 and cp <= 0x2CEAF) or
(cp >= 0xF900 and cp <= 0xFAFF) or #
(cp >= 0x2F800 and cp <= 0x2FA1F)): #
return True
return False
def _clean_text(self, text):
"""Performs invalid character removal and whitespace cleanup on text."""
output = []
for char in text:
cp = ord(char)
if cp == 0 or cp == 0xfffd or _is_control(char):
continue
if _is_whitespace(char):
output.append(" ")
else:
output.append(char)
return "".join(output)
class SentencepieceTokenizer(object):
"""Runs SentencePiece tokenziation."""
def __init__(self, vocab_file, do_lower_case=True, unk_token="[UNK]"):
self.vocab = load_vocab(vocab_file)
self.inv_vocab = {v: k for k, v in self.vocab.items()}
self.do_lower_case = do_lower_case
self.tokenizer = sp.SentencePieceProcessor()
self.tokenizer.Load(vocab_file + ".model")
self.sp_unk_token = "<unk>"
self.unk_token = unk_token
def tokenize(self, text):
"""Tokenizes a piece of text into its word pieces.
Returns:
A list of wordpiece tokens.
"""
text = text.lower() if self.do_lower_case else text
text = convert_to_unicode(text.replace("\1", " "))
tokens = self.tokenizer.EncodeAsPieces(text)
output_tokens = []
for token in tokens:
if token == self.sp_unk_token:
token = self.unk_token
if token in self.vocab:
output_tokens.append(token)
else:
output_tokens.append(self.unk_token)
return output_tokens
def convert_tokens_to_ids(self, tokens):
return convert_by_vocab(self.vocab, tokens)
def convert_ids_to_tokens(self, ids):
return convert_by_vocab(self.inv_vocab, ids)
class WordsegTokenizer(object):
"""Runs Wordseg tokenziation."""
def __init__(self, vocab_file, do_lower_case=True, unk_token="[UNK]",
split_token="\1"):
self.vocab = load_vocab(vocab_file)
self.inv_vocab = {v: k for k, v in self.vocab.items()}
self.tokenizer = sp.SentencePieceProcessor()
self.tokenizer.Load(vocab_file + ".model")
self.do_lower_case = do_lower_case
self.unk_token = unk_token
self.split_token = split_token
def tokenize(self, text):
"""Tokenizes a piece of text into its word pieces.
Returns:
A list of wordpiece tokens.
"""
text = text.lower() if self.do_lower_case else text
text = convert_to_unicode(text)
output_tokens = []
for token in text.split(self.split_token):
if token in self.vocab:
output_tokens.append(token)
else:
sp_tokens = self.tokenizer.EncodeAsPieces(token)
for sp_token in sp_tokens:
if sp_token in self.vocab:
output_tokens.append(sp_token)
return output_tokens
def convert_tokens_to_ids(self, tokens):
return convert_by_vocab(self.vocab, tokens)
def convert_ids_to_tokens(self, ids):
return convert_by_vocab(self.inv_vocab, ids)
class WordpieceTokenizer(object):
"""Runs WordPiece tokenziation."""
def __init__(self, vocab, unk_token="[UNK]", max_input_chars_per_word=100):
self.vocab = vocab
self.unk_token = unk_token
self.max_input_chars_per_word = max_input_chars_per_word
def tokenize(self, text):
"""Tokenizes a piece of text into its word pieces.
This uses a greedy longest-match-first algorithm to perform tokenization
using the given vocabulary.
For example:
input = "unaffable"
output = ["un", "##aff", "##able"]
Args:
text: A single token or whitespace separated tokens. This should have
already been passed through `BasicTokenizer.
Returns:
A list of wordpiece tokens.
"""
text = convert_to_unicode(text)
output_tokens = []
for token in whitespace_tokenize(text):
chars = list(token)
if len(chars) > self.max_input_chars_per_word:
output_tokens.append(self.unk_token)
continue
is_bad = False
start = 0
sub_tokens = []
while start < len(chars):
end = len(chars)
cur_substr = None
while start < end:
substr = "".join(chars[start:end])
if start > 0:
substr = "##" + substr
if substr in self.vocab:
cur_substr = substr
break
end -= 1
if cur_substr is None:
is_bad = True
break
sub_tokens.append(cur_substr)
start = end
if is_bad:
output_tokens.append(self.unk_token)
else:
output_tokens.extend(sub_tokens)
return output_tokens
class BPETokenizer(object):
""" Runs bpe tokenize """
def __init__(self, vocab_file, encoder_json_path="./configs/encoder.json", vocab_bpe_path="./configs/vocab.bpe", params=None):
self.vocabulary = Vocabulary(vocab_file, unk_token="[UNK]")
self.encoder = gpt2_bpe_utils.get_encoder(encoder_json_path, vocab_bpe_path)
def tokenize(self, text, is_sentencepiece=True):
text = convert_to_unicode(text)
text = " ".join(text.split()) # remove duplicate whitespace
if is_sentencepiece:
sents = tokenize.sent_tokenize(text)
bpe_ids = sum([self.encoder.encode(sent) for sent in sents], [])
else:
bpe_ids = self.encoder.encode(text)
tokens = [str(bpe_id) for bpe_id in bpe_ids]
return tokens
def convert_tokens_to_ids(self, tokens):
"""
:param tokens:
:return:
"""
return self.vocabulary.convert_tokens_to_ids(tokens)
def convert_ids_to_tokens(self, ids):
"""
:param ids:
:return:
"""
return self.vocabulary.convert_ids_to_tokens(ids)
def _is_whitespace(char):
"""Checks whether `chars` is a whitespace character."""
# \t, \n, and \r are technically contorl characters but we treat them
# as whitespace since they are generally considered as such.
if char == " " or char == "\t" or char == "\n" or char == "\r":
return True
cat = unicodedata.category(char)
if cat == "Zs":
return True
return False
def _is_control(char):
"""Checks whether `chars` is a control character."""
# These are technically control characters but we count them as whitespace
# characters.
if char == "\t" or char == "\n" or char == "\r":
return False
cat = unicodedata.category(char)
if cat.startswith("C"):
return True
return False
def _is_punctuation(char):
"""Checks whether `chars` is a punctuation character."""
cp = ord(char)
# We treat all non-letter/number ASCII as punctuation.
# Characters such as "^", "$", and "`" are not in the Unicode
# Punctuation class but we treat them as punctuation anyways, for
# consistency.
if ((cp >= 33 and cp <= 47) or (cp >= 58 and cp <= 64) or
(cp >= 91 and cp <= 96) or (cp >= 123 and cp <= 126)):
return True
cat = unicodedata.category(char)
if cat.startswith("P"):
return True
return False
def tokenize_chinese_chars(text):
"""Adds whitespace around any CJK character."""
def _is_chinese_char(cp):
"""Checks whether CP is the codepoint of a CJK character."""
# This defines a "chinese character" as anything in the CJK Unicode block:
# https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_(Unicode_block)
#
# Note that the CJK Unicode block is NOT all Japanese and Korean characters,
# despite its name. The modern Korean Hangul alphabet is a different block,
# as is Japanese Hiragana and Katakana. Those alphabets are used to write
# space-separated words, so they are not treated specially and handled
# like the all of the other languages.
if ((cp >= 0x4E00 and cp <= 0x9FFF) or #
(cp >= 0x3400 and cp <= 0x4DBF) or #
(cp >= 0x20000 and cp <= 0x2A6DF) or #
(cp >= 0x2A700 and cp <= 0x2B73F) or #
(cp >= 0x2B740 and cp <= 0x2B81F) or #
(cp >= 0x2B820 and cp <= 0x2CEAF) or
(cp >= 0xF900 and cp <= 0xFAFF) or #
(cp >= 0x2F800 and cp <= 0x2FA1F)): #
return True
return False
output = []
buff = ""
for char in text:
cp = ord(char)
if _is_chinese_char(cp):
if buff != "":
output.append(buff)
buff = ""
output.append(char)
else:
buff += char
if buff != "":
output.append(buff)
return output
# -*- coding: utf-8 -*
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
:py:class:`Vocabulary`
"""
import collections
import unicodedata
import six
def convert_to_unicode(text):
"""Converts `text` to Unicode (if it's not already), assuming utf-8 input."""
if six.PY3:
if isinstance(text, str):
return text
elif isinstance(text, bytes):
return text.decode("utf-8", "ignore")
else:
raise ValueError("Unsupported string type: %s" % (type(text)))
elif six.PY2:
if isinstance(text, str):
return text.decode("utf-8", "ignore")
elif isinstance(text, unicode):
return text
else:
raise ValueError("Unsupported string type: %s" % (type(text)))
else:
raise ValueError("Not running on Python2 or Python 3?")
class Vocabulary(object):
"""Vocabulary"""
def __init__(self, vocab_path, unk_token):
"""
:param vocab_path:
:param unk_token:
"""
if not vocab_path:
raise ValueError("vocab_path can't be None")
self.vocab_path = vocab_path
self.unk_token = unk_token
self.vocab_dict, self.id_dict = self.load_vocab()
self.vocab_size = len(self.id_dict)
def load_vocab(self):
"""
:return:
"""
vocab_dict = collections.OrderedDict()
id_dict = collections.OrderedDict()
file_vocab = open(self.vocab_path)
for num, line in enumerate(file_vocab):
items = convert_to_unicode(line.strip()).split("\t")
if len(items) > 2:
break
token = items[0]
if len(items) == 2:
index = items[1]
else:
index = num
token = token.strip()
vocab_dict[token] = int(index)
id_dict[index] = token
return vocab_dict, id_dict
def add_reserve_id(self):
"""
:return:
"""
pass
def convert_tokens_to_ids(self, tokens):
"""
:param tokens:
:return:
"""
UNK = self.vocab_dict[self.unk_token]
output = [self.vocab_dict.get(item, UNK) for item in tokens]
# for item in tokens:
# output.append(self.vocab_dict.get(item, UNK))
return output
def convert_ids_to_tokens(self, ids):
"""
:param ids:
:return:
"""
output = []
for item in ids:
output.append(self.id_dict.get(item, self.unk_token))
return output
def get_vocab_size(self):
"""
:return:
"""
return len(self.id_dict)
def covert_id_to_token(self, id):
"""
:param id:
:return: token
"""
return self.id_dict.get(id, self.unk_token)
def covert_token_to_id(self, token):
"""
:param token:
:return: id
"""
UNK = self.vocab_dict[self.unk_token]
return self.vocab_dict.get(token, UNK)
nltk==3.6.2
regex==2021.4.4
six==1.15.0
tqdm==4.54.1
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Finetuning on classification tasks."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import os
import time
import multiprocessing
import numpy as np
import paddle
import paddle.fluid as fluid
import paddle.fluid.layers as layers
import paddle.fluid.incubate.fleet.base.role_maker as role_maker
from paddle.fluid.incubate.fleet.collective import fleet, DistributedStrategy
import reader.task_reader as task_reader
from model.optimization import optimization
from model.static.ernie import ErnieConfig, ErnieDocModel
from finetune.classifier import create_model, evaluate
from utils.args import print_arguments
from utils.init import init_model
from utils.finetune_args import parser
paddle.enable_static()
args = parser.parse_args()
def main(args):
"""main function"""
print("""finetuning start""")
ernie_config = ErnieConfig(args.ernie_config_path)
ernie_config.print_config()
memory_len = ernie_config["memory_len"]
d_dim = ernie_config["hidden_size"]
n_layers = ernie_config["num_hidden_layers"]
print("args.is_distributed:", args.is_distributed)
exec_strategy = fluid.ExecutionStrategy()
if args.use_fast_executor:
exec_strategy.use_experimental_executor = True
exec_strategy.num_threads = 4 if args.use_amp else 2
exec_strategy.num_iteration_per_drop_scope = min(1, args.skip_steps)
if args.is_distributed:
role = role_maker.PaddleCloudRoleMaker(is_collective=True)
fleet.init(role)
trainer_id = fleet.worker_index()
current_endpoint = os.getenv("PADDLE_CURRENT_ENDPOINT")
worker_endpoints = fleet.worker_endpoints()
trainers_num = len(worker_endpoints)
print("worker_endpoints:{} trainers_num:{} current_endpoint:{} trainer_id:{}"
.format(worker_endpoints, trainers_num, current_endpoint, trainer_id))
dist_strategy = DistributedStrategy()
dist_strategy.exec_strategy = exec_strategy
dist_strategy.remove_unnecessary_lock = False
dist_strategy.fuse_all_reduce_ops = False
dist_strategy.nccl_comm_num = 1
if args.use_amp:
dist_strategy.use_amp = True
dist_strategy.amp_loss_scaling = args.init_loss_scaling
if args.use_recompute:
dist_strategy.forward_recompute = True
dist_strategy.enable_sequential_execution=True
else:
dist_strategy=None
gpu_id = 0
gpus = fluid.core.get_cuda_device_count()
if args.is_distributed:
gpus = os.getenv("FLAGS_selected_gpus").split(",")
gpu_id = int(gpus[0])
if args.use_cuda:
place = fluid.CUDAPlace(gpu_id)
dev_count = fleet.worker_num() if args.is_distributed else gpus
else:
place = fluid.CPUPlace()
dev_count = int(os.environ.get('CPU_NUM', multiprocessing.cpu_count()))
trainer_id = fleet.worker_index()
print("Device count %d, trainer_id:%d" % (dev_count, trainer_id))
print('args.vocab_path', args.vocab_path)
reader = task_reader.ClassifyReader(
trainer_id=fleet.worker_index(),
trainer_num=dev_count,
vocab_path=args.vocab_path,
memory_len=memory_len,
repeat_input=args.repeat_input,
train_all=args.train_all,
eval_all=args.eval_all,
label_map_config=args.label_map_config,
max_seq_len=args.max_seq_len,
do_lower_case=args.do_lower_case,
in_tokens=args.in_tokens,
random_seed=args.random_seed,
tokenizer=args.tokenizer,
is_zh=args.is_zh)
if not (args.do_train or args.do_val or args.do_test):
raise ValueError("For args `do_train`, `do_val` and `do_test`, at "
"least one of them must be True.")
startup_prog = fluid.Program()
if args.random_seed is not None:
startup_prog.random_seed = args.random_seed
if args.do_train:
train_data_generator = reader.data_generator(
input_file=args.train_set,
batch_size=args.batch_size,
epoch=args.epoch,
shuffle=True,
phase="train")
num_train_examples = reader.get_num_examples(args.train_set)
if args.in_tokens:
max_train_steps = args.epoch * num_train_examples // (
args.batch_size // args.max_seq_len) // dev_count
else:
max_train_steps = args.epoch * num_train_examples // args.batch_size // dev_count
warmup_steps = int(max_train_steps * args.warmup_proportion)
print("Device count: %d, trainer_id: %d" % (dev_count, trainer_id))
print("Num train examples: %d" % num_train_examples)
print("Max train steps: %d" % max_train_steps)
print("Num warmup steps: %d" % warmup_steps)
train_program = fluid.Program()
with fluid.program_guard(train_program, startup_prog):
with fluid.unique_name.guard():
train_pyreader, graph_vars, checkpoints, train_mems_vars = create_model(
args,
ernie_config=ernie_config,
mem_len=memory_len)
if args.use_recompute:
dist_strategy.recompute_checkpoints = checkpoints
scheduled_lr, loss_scaling = optimization(
loss=graph_vars['loss'],
warmup_steps=warmup_steps,
num_train_steps=max_train_steps,
learning_rate=args.learning_rate,
train_program=train_program,
startup_prog=startup_prog,
weight_decay=args.weight_decay,
scheduler=args.lr_scheduler,
use_amp=args.use_amp,
init_loss_scaling=args.init_loss_scaling,
layer_decay_rate=args.layer_decay_ratio,
n_layers=n_layers,
dist_strategy=dist_strategy)
origin_train_program = train_program
if args.is_distributed:
train_program = fleet.main_program
origin_train_program = fleet._origin_program
# add data pe
# exec_strategy = fluid.ExecutionStrategy()
# exec_strategy.num_threads = 1
# exec_strategy.num_iteration_per_drop_scope = 10000
# build_strategy = fluid.BuildStrategy()
# train_program_dp = fluid.CompiledProgram(train_program).with_data_parallel(
# loss_name=graph_vars["loss"].name,
# exec_strategy=exec_strategy,
# build_strategy=build_strategy)
if args.do_val or args.do_test:
test_prog = fluid.Program()
with fluid.program_guard(test_prog, startup_prog):
with fluid.unique_name.guard():
test_pyreader, test_graph_vars, _, test_mems_vars = create_model(
args,
ernie_config=ernie_config,
mem_len=memory_len)
test_prog = test_prog.clone(for_test=True)
exe = fluid.Executor(place)
exe.run(startup_prog)
init_model(args, exe, startup_prog)
train_exe = exe if args.do_train else None
steps = 0
def init_memory():
return [np.zeros([args.batch_size, memory_len, d_dim], dtype="float32")
for _ in range(n_layers)]
if args.do_train:
train_pyreader.set_batch_generator(train_data_generator)
train_pyreader.start()
time_begin = time.time()
tower_mems_np = init_memory()
while True:
try:
steps += 1
if steps % args.skip_steps == 0:
outputs = evaluate(train_exe, train_program, train_pyreader, graph_vars,
train_mems_vars, tower_mems_np,
"train", steps, trainer_id, dev_count, scheduled_lr, use_vars=args.use_vars)
tower_mems_np = outputs['tower_mems_np']
time_end = time.time()
used_time = time_end - time_begin
current_example, current_epoch = reader.get_train_progress()
print("train pyreader queue size: %d, " % train_pyreader.queue.size())
print("epoch: %d, worker_index: %d, progress: %d/%d, step: %d, ave loss: %f, "
"ave acc: %f, time cost: %f, speed: %f steps/s, learning_rate: %f" %
(current_epoch, trainer_id, current_example, num_train_examples,
steps, outputs["loss"], outputs["accuracy"],
used_time, args.skip_steps / used_time, outputs["learning_rate"]))
time_begin = time.time()
else:
if args.use_vars:
# train_exe.run(fetch_list=[])
train_exe.run(program=train_program, use_program_cache=True)
else:
outputs = evaluate(train_exe, train_program, train_pyreader, graph_vars,
train_mems_vars, tower_mems_np,
"train", steps, trainer_id, dev_count, scheduled_lr, use_vars=args.use_vars)
tower_mems_np = outputs['tower_mems_np']
if steps % args.validation_steps == 0:
# evaluate dev set
if args.do_val:
test_pyreader.set_batch_generator(
reader.data_generator(
args.dev_set,
batch_size=args.batch_size,
epoch=1,
shuffle=False,
phase="eval"))
evaluate(exe, test_prog, test_pyreader, test_graph_vars,
test_mems_vars, init_memory(),
"eval", steps, trainer_id, dev_count, use_vars=args.use_vars)
# evaluate test set
if args.do_test:
test_pyreader.set_batch_generator(
reader.data_generator(
args.test_set,
batch_size=args.batch_size,
epoch=1,
shuffle=False,
phase="test"))
evaluate(exe, test_prog, test_pyreader, test_graph_vars,
test_mems_vars, init_memory(),
"test", steps, trainer_id, dev_count, use_vars=args.use_vars)
# save model
if trainer_id == 0 and steps % args.save_steps == 0:
save_path = os.path.join(args.checkpoints,
"step_" + str(steps))
fluid.io.save_persistables(exe, save_path, origin_train_program)
except fluid.core.EOFException:
if trainer_id == 0:
save_path = os.path.join(args.checkpoints, "step_" + str(steps))
fluid.io.save_persistables(exe, save_path, origin_train_program)
train_pyreader.reset()
break
# final eval on dev set
if args.do_val:
test_pyreader.set_batch_generator(
reader.data_generator(
args.dev_set,
batch_size=args.batch_size,
epoch=1,
shuffle=False,
phase="eval"))
print("Final validation result:")
evaluate(exe, test_prog, test_pyreader, test_graph_vars,
test_mems_vars, init_memory(),
"eval", steps, trainer_id, dev_count, use_vars=args.use_vars)
# final eval on test set
if args.do_test:
test_pyreader.set_batch_generator(
reader.data_generator(
args.test_set,
batch_size=args.batch_size,
epoch=1,
shuffle=False,
phase="test"))
print("Final test result:")
evaluate(exe, test_prog, test_pyreader, test_graph_vars,
test_mems_vars, init_memory(),
"test", steps, trainer_id, dev_count, use_vars=args.use_vars)
if __name__ == '__main__':
print_arguments(args)
main(args)
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Finetuning on mrc tasks."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import os
import time
import multiprocessing
import numpy as np
import paddle
import paddle.fluid as fluid
import paddle.fluid.layers as layers
import paddle.fluid.incubate.fleet.base.role_maker as role_maker
from paddle.fluid.incubate.fleet.collective import fleet, DistributedStrategy
import reader.task_reader as task_reader
from model.optimization import optimization
from model.static.ernie import ErnieConfig, ErnieDocModel
from finetune.mrc import create_model, evaluate
from utils.args import print_arguments
from utils.init import init_model
from utils.finetune_args import parser
paddle.enable_static()
args = parser.parse_args()
def main(args):
"""main function"""
print("""finetuning start""")
ernie_config = ErnieConfig(args.ernie_config_path)
ernie_config.print_config()
memory_len = ernie_config["memory_len"]
d_dim = ernie_config["hidden_size"]
n_layers = ernie_config["num_hidden_layers"]
print("args.is_distributed:", args.is_distributed)
exec_strategy = fluid.ExecutionStrategy()
if args.use_fast_executor:
exec_strategy.use_experimental_executor = True
exec_strategy.num_threads = 4 if args.use_amp else 2
exec_strategy.num_iteration_per_drop_scope = min(1, args.skip_steps)
if args.is_distributed:
role = role_maker.PaddleCloudRoleMaker(is_collective=True)
fleet.init(role)
trainer_id = fleet.worker_index()
current_endpoint = os.getenv("PADDLE_CURRENT_ENDPOINT")
worker_endpoints = fleet.worker_endpoints()
trainers_num = len(worker_endpoints)
print("worker_endpoints:{} trainers_num:{} current_endpoint:{} trainer_id:{}"
.format(worker_endpoints, trainers_num, current_endpoint, trainer_id))
dist_strategy = DistributedStrategy()
dist_strategy.exec_strategy = exec_strategy
dist_strategy.remove_unnecessary_lock = False
dist_strategy.fuse_all_reduce_ops = False
dist_strategy.nccl_comm_num = 1
if args.use_amp:
dist_strategy.use_amp = True
dist_strategy.amp_loss_scaling = args.init_loss_scaling
if args.use_recompute:
dist_strategy.forward_recompute = True
dist_strategy.enable_sequential_execution=True
else:
dist_strategy=None
gpu_id = 0
gpus = fluid.core.get_cuda_device_count()
if args.is_distributed:
gpus = os.getenv("FLAGS_selected_gpus").split(",")
gpu_id = int(gpus[0])
if args.use_cuda:
place = fluid.CUDAPlace(gpu_id)
dev_count = fleet.worker_num() if args.is_distributed else gpus
else:
place = fluid.CPUPlace()
dev_count = int(os.environ.get('CPU_NUM', multiprocessing.cpu_count()))
trainer_id = fleet.worker_index()
print("Device count %d, trainer_id:%d" % (dev_count, trainer_id))
print('args.vocab_path', args.vocab_path)
reader = task_reader.MRCReader(
trainer_id=fleet.worker_index(),
trainer_num=dev_count,
vocab_path=args.vocab_path,
memory_len=memory_len,
repeat_input=args.repeat_input,
train_all=args.train_all,
eval_all=args.eval_all,
label_map_config=args.label_map_config,
max_seq_len=args.max_seq_len,
do_lower_case=args.do_lower_case,
in_tokens=args.in_tokens,
random_seed=args.random_seed,
tokenizer=args.tokenizer,
is_zh=args.is_zh,
for_cn=args.for_cn,
doc_stride=args.doc_stride,
max_query_length=args.max_query_length)
if not (args.do_train or args.do_val or args.do_test):
raise ValueError("For args `do_train`, `do_val` and `do_test`, at "
"least one of them must be True.")
startup_prog = fluid.Program()
if args.random_seed is not None:
startup_prog.random_seed = args.random_seed
if args.do_train:
train_data_generator = reader.data_generator(
input_file=args.train_set,
batch_size=args.batch_size,
epoch=args.epoch,
shuffle=True,
phase="train")
num_train_examples = reader.get_num_examples("train", args.train_set)
if args.in_tokens:
max_train_steps = args.epoch * num_train_examples // (
args.batch_size // args.max_seq_len) // dev_count
else:
max_train_steps = args.epoch * num_train_examples // args.batch_size // dev_count
warmup_steps = int(max_train_steps * args.warmup_proportion)
print("Device count: %d, trainer_id: %d" % (dev_count, trainer_id))
print("Num train examples: %d" % num_train_examples)
print("Max train steps: %d" % max_train_steps)
print("Num warmup steps: %d" % warmup_steps)
train_program = fluid.Program()
with fluid.program_guard(train_program, startup_prog):
with fluid.unique_name.guard():
train_pyreader, graph_vars, checkpoints, train_mems_vars = create_model(
args,
ernie_config=ernie_config,
mem_len=memory_len)
if args.use_recompute:
dist_strategy.recompute_checkpoints = checkpoints
scheduled_lr, loss_scaling = optimization(
loss=graph_vars['loss'],
warmup_steps=warmup_steps,
num_train_steps=max_train_steps,
learning_rate=args.learning_rate,
train_program=train_program,
startup_prog=startup_prog,
weight_decay=args.weight_decay,
scheduler=args.lr_scheduler,
use_amp=args.use_amp,
init_loss_scaling=args.init_loss_scaling,
layer_decay_rate=args.layer_decay_ratio,
n_layers=n_layers,
dist_strategy=dist_strategy)
origin_train_program = train_program
if args.is_distributed:
train_program = fleet.main_program
origin_train_program = fleet._origin_program
# add data pe
# exec_strategy = fluid.ExecutionStrategy()
# exec_strategy.num_threads = 1
# exec_strategy.num_iteration_per_drop_scope = 10000
# build_strategy = fluid.BuildStrategy()
# train_program_dp = fluid.CompiledProgram(train_program).with_data_parallel(
# loss_name=graph_vars["loss"].name,
# exec_strategy=exec_strategy,
# build_strategy=build_strategy)
if args.do_val or args.do_test:
test_prog = fluid.Program()
with fluid.program_guard(test_prog, startup_prog):
with fluid.unique_name.guard():
test_pyreader, test_graph_vars, _, test_mems_vars = create_model(
args,
ernie_config=ernie_config,
mem_len=memory_len)
test_prog = test_prog.clone(for_test=True)
exe = fluid.Executor(place)
exe.run(startup_prog)
init_model(args, exe, startup_prog)
train_exe = exe if args.do_train else None
steps = 0
def init_memory():
return [np.zeros([args.batch_size, memory_len, d_dim], dtype="float32")
for _ in range(n_layers)]
if args.do_train:
train_pyreader.set_batch_generator(train_data_generator)
train_pyreader.start()
time_begin = time.time()
tower_mems_np = init_memory()
while True:
try:
steps += 1
if steps % args.skip_steps == 0:
outputs = evaluate(train_exe, train_program, train_pyreader, graph_vars,
train_mems_vars, tower_mems_np, "train", steps, trainer_id,
dev_count, scheduled_lr, use_vars=args.use_vars)
tower_mems_np = outputs['tower_mems_np']
time_end = time.time()
used_time = time_end - time_begin
current_example, current_epoch = reader.get_train_progress()
print("train pyreader queue size: %d, " % train_pyreader.queue.size())
print("epoch: %d, worker_index: %d, progress: %d/%d, step: %d, ave loss: %f, "
"time cost: %f, speed: %f steps/s, learning_rate: %f" %
(current_epoch, trainer_id, current_example, num_train_examples,
steps, outputs["loss"], used_time, args.skip_steps / used_time,
outputs["learning_rate"]))
time_begin = time.time()
else:
if args.use_vars:
# train_exe.run(fetch_list=[])
train_exe.run(program=train_program, use_program_cache=True)
else:
outputs = evaluate(train_exe, train_program, train_pyreader, graph_vars,
train_mems_vars, tower_mems_np, "train", steps, trainer_id,
dev_count, scheduled_lr, use_vars=args.use_vars)
tower_mems_np = outputs['tower_mems_np']
if steps % args.validation_steps == 0:
# evaluate dev set
if args.do_val:
test_pyreader.set_batch_generator(
reader.data_generator(
args.dev_set,
batch_size=args.batch_size,
epoch=1,
shuffle=False,
phase="eval"))
num_dev_examples = reader.get_num_examples("eval", args.dev_set)
print("the example number of dev file is {}".format(num_dev_examples))
evaluate(exe, test_prog, test_pyreader, test_graph_vars, test_mems_vars,
init_memory(), "eval", steps, trainer_id, dev_count,
use_vars=args.use_vars, examples=reader.get_examples("eval"),
features=reader.get_features("eval"), args=args)
# evaluate test set
if args.do_test:
test_pyreader.set_batch_generator(
reader.data_generator(
args.test_set,
batch_size=args.batch_size,
epoch=1,
shuffle=False,
phase="test"))
num_test_examples = reader.get_num_examples("test", args.test_set)
print("the example number of test file is {}".format(num_test_examples))
evaluate(exe, test_prog, test_pyreader, test_graph_vars, test_mems_vars,
init_memory(), "test", steps, trainer_id, dev_count,
use_vars=args.use_vars, examples=reader.get_examples("test"),
features=reader.get_features("test"), args=args)
# save model
if trainer_id == 0 and steps % args.save_steps == 0:
save_path = os.path.join(args.checkpoints,
"step_" + str(steps))
fluid.io.save_persistables(exe, save_path, origin_train_program)
except fluid.core.EOFException:
if trainer_id == 0:
save_path = os.path.join(args.checkpoints, "step_" + str(steps))
fluid.io.save_persistables(exe, save_path, origin_train_program)
train_pyreader.reset()
break
# final eval on dev set
if args.do_val:
test_pyreader.set_batch_generator(
reader.data_generator(
args.dev_set,
batch_size=args.batch_size,
epoch=1,
shuffle=False,
phase="eval"))
print("Final validation result:")
num_dev_examples = reader.get_num_examples("eval", args.dev_set)
print("the example number of dev file is {}".format(num_dev_examples))
evaluate(exe, test_prog, test_pyreader, test_graph_vars,
test_mems_vars, init_memory(), "eval", steps,
trainer_id, dev_count, use_vars=args.use_vars,
examples=reader.get_examples("eval"),
features=reader.get_features("eval"), args=args)
# final eval on test set
if args.do_test:
test_pyreader.set_batch_generator(
reader.data_generator(
args.test_set,
batch_size=args.batch_size,
epoch=1,
shuffle=False,
phase="test"))
print("Final test result:")
num_test_examples = reader.get_num_examples("test", args.test_set)
print("the example number of test file is {}".format(num_test_examples))
evaluate(exe, test_prog, test_pyreader, test_graph_vars,
test_mems_vars, init_memory(), "test", steps,
trainer_id, dev_count, use_vars=args.use_vars,
examples=reader.get_examples("test"),
features=reader.get_features("test"), args=args)
if __name__ == '__main__':
print_arguments(args)
main(args)
#!/bin/sh
set -eux
export FLAGS_eager_delete_tensor_gb=0.0
export FLAGS_fraction_of_gpu_memory_to_use=0.98
export FLAGS_sync_nccl_allreduce=1
# task
finetuning_task="dureader"
task_data_path="./data/finetune/task_data/${finetuning_task}/"
# model setup
is_zh="True"
use_vars="False"
use_amp="False"
train_all="Fasle"
eval_all="False"
use_vars="False"
use_recompute="False"
rel_pos_params_sharing="False"
lr_scheduler="linear_warmup_decay"
vocab_path="./configs/base/zh/vocab.txt"
config_path="./configs/base/zh/ernie_config.json"
init_model_checkpoint=""
init_model_pretraining=""
# args setup
epoch=5
warmup=0.1
max_len=512
lr_rate=2.75e-4
batch_size=16
weight_decay=0.01
save_steps=10000
validation_steps=100
layer_decay_ratio=0.8
init_loss_scaling=32768
PADDLE_TRAINERS=`hostname -i`
PADDLE_TRAINER_ID="0"
POD_IP=`hostname -i`
selected_gpus="0,1,2,3"
if [[ $finetuning_task == "cmrc2018" ]];then
do_test=false
elif [[ $finetuning_task == "drcd" ]];then
do_test=true
elif [[ $finetuning_task == "dureader" ]];then
do_test=false
fi
mkdir -p log
distributed_args="--node_ips ${PADDLE_TRAINERS} --node_id ${PADDLE_TRAINER_ID} --current_node_ip ${POD_IP} --nproc_per_node 4 --selected_gpus ${selected_gpus}"
python -u ./lanch.py ${distributed_args} \
run_mrc.py --use_cuda true\
--is_distributed true \
--batch_size ${batch_size} \
--in_tokens false\
--use_fast_executor ${e_executor:-"true"} \
--checkpoints ./output \
--vocab_path ${vocab_path} \
--do_train true \
--do_val true \
--do_test ${do_test} \
--save_steps ${save_steps:-"10000"} \
--validation_steps ${validation_steps:-"100"} \
--warmup_proportion ${warmup} \
--weight_decay ${weight_decay} \
--epoch ${epoch} \
--max_seq_len ${max_len} \
--ernie_config_path ${config_path} \
--do_lower_case true \
--doc_stride 128 \
--train_set ${task_data_path}/train.json \
--dev_set ${task_data_path}/dev.json \
--test_set ${task_data_path}/test.json \
--learning_rate ${lr_rate} \
--num_iteration_per_drop_scope 1 \
--lr_scheduler linear_warmup_decay \
--layer_decay_ratio ${layer_decay_ratio:-"0.8"} \
--is_zh ${is_zh:-"True"} \
--repeat_input ${repeat_input:-"False"} \
--train_all ${train_all:-"False"} \
--eval_all ${eval_all:-"False"} \
--use_vars ${use_vars:-"False"} \
--init_checkpoint ${init_model_checkpoint:-""} \
--init_pretraining_params ${init_model_pretraining:-""} \
--init_loss_scaling ${init_loss_scaling:-32768} \
--use_recompute ${use_recompute:-"False"} \
--skip_steps 10 1>log/0_${finetuning_task}_job.log 2>&1
set -eux
export FLAGS_eager_delete_tensor_gb=0.0
export FLAGS_fraction_of_gpu_memory_to_use=0.98
export FLAGS_sync_nccl_allreduce=1
# task
task_data_path="./data/finetune/task_data/"
task_name="iflytek"
# model setup
is_zh="True"
repeat_input="False"
train_all="Fasle"
eval_all="False"
use_vars="False"
use_amp="False"
use_recompute="False"
rel_pos_params_sharing="False"
lr_scheduler="linear_warmup_decay"
vocab_path="./configs/base/zh/vocab.txt"
config_path="./configs/base/zh/ernie_config.json"
init_model_checkpoint=""
init_model_pretraining=""
# args setup
epoch=5
warmup=0.1
max_len=512
lr_rate=1.5e-4
batch_size=16
weight_decay=0.01
num_labels=119
save_steps=10000
validation_steps=100
layer_decay_ratio=0.8
init_loss_scaling=32768
PADDLE_TRAINERS=`hostname -i`
PADDLE_TRAINER_ID="0"
POD_IP=`hostname -i`
selected_gpus="0"
mkdir -p log
distributed_args="--node_ips ${PADDLE_TRAINERS} --node_id ${PADDLE_TRAINER_ID} --current_node_ip ${POD_IP} --nproc_per_node 1 --selected_gpus ${selected_gpus}"
python -u ./lanch.py ${distributed_args} \
./run_classifier.py --use_cuda true \
--is_distributed true \
--use_fast_executor ${e_executor:-"true"} \
--tokenizer ${TOKENIZER:-"FullTokenizer"} \
--use_amp ${use_amp:-"false"} \
--do_train true \
--do_val true \
--do_test false \
--batch_size ${batch_size} \
--init_checkpoint ${init_model_checkpoint:-""} \
--init_pretraining_params ${init_model_pretraining:-""} \
--label_map_config "" \
--train_set ${task_data_path}/${task_name}/train/1 \
--dev_set ${task_data_path}/${task_name}/dev/1 \
--test_set ${task_data_path}/${task_name}/test/1 \
--vocab_path ${vocab_path} \
--checkpoints ./output \
--save_steps ${save_steps} \
--weight_decay ${weight_decay} \
--warmup_proportion ${warmup} \
--validation_steps ${validation_steps} \
--epoch ${epoch} \
--max_seq_len ${max_len} \
--ernie_config_path ${config_path} \
--learning_rate ${lr_rate} \
--skip_steps 10 \
--num_iteration_per_drop_scope 10 \
--layer_decay_ratio ${layer_decay_ratio:-"0.8"} \
--num_labels ${num_labels} \
--is_zh ${is_zh:-"True"} \
--repeat_input ${repeat_input:-"False"} \
--train_all ${train_all:-"False"} \
--eval_all ${eval_all:-"False"} \
--use_vars ${use_vars:-"False"} \
--init_loss_scaling ${init_loss_scaling:-32768} \
--use_recompute ${use_recompute:-"False"} \
--random_seed 1
set -eux
export BASE_PATH="$PWD"
export PATH="${BASE_PATH}/py37/bin/:$PATH"
export PYTHONPATH="${BASE_PATH}/py37/"
export FLAGS_eager_delete_tensor_gb=0.0
export FLAGS_fraction_of_gpu_memory_to_use=0.98
export FLAGS_sync_nccl_allreduce=1
export TRAINER_PORTS_NUM='8'
export PADDLE_CURRENT_ENDPOINT=`hostname -i`
export PADDLE_TRAINERS_NUM='1'
export POD_IP=`hostname -i`
export PADDLE_TRAINER_COUNT='8'
# task
task_data_path="./data/imdb/"
task_name="imdb"
# model setup
is_zh="False"
use_vars="False"
use_amp="False"
use_recompute="False"
rel_pos_params_sharing="False"
lr_scheduler="linear_warmup_decay"
vocab_path="./configs/base/en/vocab.txt"
config_path="./configs/base/en/ernie_config.json"
init_model_checkpoint=""
init_model_pretraining=""
# args setup
max_len=512
lr_rate=7e-5
batch_size=16
weight_decay=0.01
warmup=0.1
epoch=3
num_labels=2
save_steps=100000
validation_steps=700
layer_decay_ratio=1
init_loss_scaling=32768
PADDLE_TRAINERS=`hostname -i`
PADDLE_TRAINER_ID="0"
POD_IP=`hostname -i`
selected_gpus="0,1"
distributed_args="--node_ips ${PADDLE_TRAINERS} --node_id ${PADDLE_TRAINER_ID} --current_node_ip ${POD_IP} --nproc_per_node 2 --selected_gpus ${selected_gpus}"
python -u ./lanch.py ${distributed_args} \
./run_classifier.py --use_cuda true \
--is_distributed true \
--use_fast_executor ${e_executor:-"true"} \
--tokenizer ${TOKENIZER:-"BPETokenizer"} \
--use_amp ${use_amp:-"false"} \
--do_train true \
--do_val true \
--do_test false \
--batch_size ${batch_size} \
--init_checkpoint ${init_model_checkpoint:-""} \
--init_pretraining_params ${init_model_pretraining:-""} \
--train_set ${task_data_path}/train.txt \
--dev_set ${task_data_path}/test.txt \
--test_set ${task_data_path}/test.txt \
--vocab_path ${vocab_path} \
--checkpoints ./output \
--save_steps ${save_steps} \
--weight_decay ${weight_decay} \
--warmup_proportion ${warmup} \
--validation_steps ${validation_steps} \
--epoch ${epoch} \
--max_seq_len ${max_len} \
--ernie_config_path ${config_path} \
--learning_rate ${lr_rate} \
--skip_steps 10 \
--num_iteration_per_drop_scope 10 \
--layer_decay_ratio ${layer_decay_ratio:-"0.8"} \
--num_labels ${num_labels} \
--is_zh ${is_zh:-"True"} \
--use_vars ${use_vars:-"False"} \
--init_loss_scaling ${init_loss_scaling:-32768} \
--use_recompute ${use_recompute:-"False"} \
--random_seed 1
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Arguments for configuration."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import six
import argparse
def str2bool(v):
"""str2bool"""
# because argparse does not support to parse "true, False" as python
# boolean directly
return v.lower() in ("true", "t", "1")
class ArgumentGroup(object):
"""ArgumentGroup"""
def __init__(self, parser, title, des):
"""init"""
self._group = parser.add_argument_group(title=title, description=des)
def add_arg(self, name, type, default, help, positional_arg=False, **kwargs):
"""add_arg"""
prefix = "" if positional_arg else "--"
type = str2bool if type == bool else type
self._group.add_argument(
prefix + name,
default=default,
type=type,
help=help + ' Default: %(default)s.',
**kwargs)
def print_arguments(args):
"""print arguments"""
print('----------- Configuration Arguments -----------')
for arg, value in sorted(six.iteritems(vars(args))):
print('%s: %s' % (arg, value))
print('------------------------------------------------')
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import os
import time
import argparse
from utils.args import ArgumentGroup
# yapf: disable
parser = argparse.ArgumentParser(__doc__)
model_g = ArgumentGroup(parser, "model", "model configuration and paths.")
model_g.add_arg("ernie_config_path", str, None, "Path to the json file for ernie model config.")
model_g.add_arg("init_checkpoint", str, None, "Init checkpoint to resume training from.")
model_g.add_arg("init_pretraining_params", str, None,
"Init pre-training params which preforms fine-tuning from. If the "
"arg 'init_checkpoint' has been set, this argument wouldn't be valid.")
model_g.add_arg("checkpoints", str, "checkpoints", "Path to save checkpoints.")
model_g.add_arg("rel_pos_params_sharing", bool, False, "If set, share u and v")
model_g.add_arg("is_zh", bool, True, "If true, use chinese data-loader")
train_g = ArgumentGroup(parser, "training", "training options.")
train_g.add_arg("epoch", int, 3, "Number of epoches for fine-tuning.")
train_g.add_arg("learning_rate", float, 5e-5, "Learning rate used to train with warmup.")
train_g.add_arg("lr_scheduler", str, "linear_warmup_decay",
"scheduler of learning rate.", choices=['linear_warmup_decay', 'noam_decay'])
train_g.add_arg("weight_decay", float, 0.01, "Weight decay rate for L2 regularizer.")
train_g.add_arg("warmup_proportion", float, 0.1,
"Proportion of training steps to perform linear learning rate warmup for.")
train_g.add_arg("save_steps", int, 10000, "The steps interval to save checkpoints.")
train_g.add_arg("validation_steps", int, 1000, "The steps interval to evaluate model performance.")
train_g.add_arg("use_amp", bool, False, "Whether to use fp16 mixed precision training.")
train_g.add_arg("use_dynamic_loss_scaling", bool, False, "Whether to use dynamic loss scaling.")
train_g.add_arg("init_loss_scaling", float, 1.0,
"Loss scaling factor for mixed precision training, only valid when use_amp is enabled.")
train_g.add_arg("incr_every_n_steps", int, 100, "Increases loss scaling every n consecutive.")
train_g.add_arg("decr_every_n_nan_or_inf", int, 2,
"Decreases loss scaling every n accumulated steps with nan or inf gradients.")
train_g.add_arg("incr_ratio", float, 5.0,
"The multiplier to use when increasing the loss scaling.")
train_g.add_arg("decr_ratio", float, 0.8,
"The less-than-one-multiplier to use when decreasing.")
train_g.add_arg("use_recompute", bool, False, "Whether to use recompute")
train_g.add_arg("layer_decay_ratio", float, 0.8, "Set the layerwise learning rate decay ratio")
train_g.add_arg("weight_sharing", bool, True, "If set, share weights between word embedding and masked lm.")
log_g = ArgumentGroup(parser, "logging", "logging related.")
log_g.add_arg("skip_steps", int, 10, "The steps interval to print loss.")
log_g.add_arg("verbose", bool, False, "Whether to output verbose log.")
data_g = ArgumentGroup(parser, "data", "Data paths, vocab paths and data processing options")
data_g.add_arg("tokenizer", str, "FullTokenizer",
"ATTENTION: the INPUT must be splited by Word with blank while using SentencepieceTokenizer or WordsegTokenizer")
data_g.add_arg("train_set", str, None, "Path to training data.")
data_g.add_arg("test_set", str, None, "Path to test data.")
data_g.add_arg("dev_set", str, None, "Path to validation data.")
data_g.add_arg("vocab_path", str, None, "Vocabulary path.")
data_g.add_arg("max_seq_len", int, 512, "Number of words of the longest seqence.")
data_g.add_arg("batch_size", int, 32, "Total examples' number in batch for training. see also --in_tokens.")
data_g.add_arg("in_tokens", bool, False,
"If set, the batch size will be the maximum number of tokens in one batch. "
"Otherwise, it will be the maximum number of examples in one batch.")
data_g.add_arg("do_lower_case", bool, True,
"Whether to lower case the input text. Should be True for uncased models and False for cased models.")
data_g.add_arg("random_seed", int, 0, "Random seed.")
data_g.add_arg("label_map_config", str, None, "label_map_path.")
data_g.add_arg("num_labels", int, 2, "label number")
data_g.add_arg("repeat_input", bool, False, "Whether to repeat the input sample")
data_g.add_arg("train_all", bool, False, "Whether to train all samples when repeat input")
data_g.add_arg("eval_all", bool, False, "Whether to eval all samples when repeat input")
data_g.add_arg("max_query_length", int, 64, "Max query length.")
data_g.add_arg("max_answer_length", int, 100, "Max answer length.")
data_g.add_arg("doc_stride", int, 128,
"When splitting up a long document into chunks, how much stride to take between chunks.")
data_g.add_arg("n_best_size", int, 20,
"The total number of n-best predictions to generate in the nbest_predictions.json output file.")
data_g.add_arg("use_vars", bool, True, "set for faster training, memory will not be in feed and fetch list")
run_type_g = ArgumentGroup(parser, "run_type", "running type options.")
run_type_g.add_arg("use_cuda", bool, True, "If set, use GPU for training.")
run_type_g.add_arg("is_distributed", bool, False, "If set, then start distributed training.")
run_type_g.add_arg("use_fast_executor", bool, False, "If set, use fast parallel executor (in experiment).")
run_type_g.add_arg("num_iteration_per_drop_scope", int, 10, "Iteration intervals to drop scope.")
run_type_g.add_arg("do_train", bool, True, "Whether to perform training.")
run_type_g.add_arg("do_val", bool, True, "Whether to perform evaluation on dev data set.")
run_type_g.add_arg("do_test", bool, True, "Whether to perform evaluation on test data set.")
run_type_g.add_arg("metrics", bool, True, "Whether to perform evaluation on test data set.")
run_type_g.add_arg("for_cn", bool, True, "model train for cn or for other langs.")
run_type_g.add_arg("stream_job", str, None, "if not None, then stream finetuning task by job id.")
# yapf: enable
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from __future__ import print_function
import os
import six
import ast
import copy
import numpy as np
import paddle.fluid as fluid
def cast_fp32_to_fp16(exe, main_program):
"""cast fp32 to fp16"""
print("Cast parameters to float16 data format.")
for param in main_program.global_block().all_parameters():
if not param.name.endswith(".master"):
#load fp16
param_t = fluid.global_scope().find_var(param.name).get_tensor()
data = np.array(param_t)
if param.name.startswith("encoder_layer") \
and "layer_norm" not in param.name:
print(param.name)
param_t.set(np.float16(data).view(np.uint16), exe.place)
#load fp32
master_param_var = fluid.global_scope().find_var(param.name +
".master")
if master_param_var is not None:
master_param_var.get_tensor().set(data, exe.place)
def init_checkpoint(exe, init_checkpoint_path, main_program, use_amp=False):
"""init_checkpoint"""
assert os.path.exists(
init_checkpoint_path), "[%s] cann't be found." % init_checkpoint_path
def existed_persitables(var):
if not fluid.io.is_persistable(var):
return False
return os.path.exists(os.path.join(init_checkpoint_path, var.name))
fluid.io.load_vars(
exe,
init_checkpoint_path,
main_program=main_program,
predicate=existed_persitables)
print("Load model from {}".format(init_checkpoint_path))
if use_amp:
cast_fp32_to_fp16(exe, main_program)
def init_pretraining_params(exe, pretraining_params_path, main_program, use_amp=False):
"""init_pretraining_params"""
assert os.path.exists(pretraining_params_path
), "[%s] cann't be found." % pretraining_params_path
def existed_params(var):
if not isinstance(var, fluid.framework.Parameter):
return False
return os.path.exists(os.path.join(pretraining_params_path, var.name))
fluid.io.load_vars(
exe,
pretraining_params_path,
main_program=main_program,
predicate=existed_params)
print("Load pretraining parameters from {}.".format(
pretraining_params_path))
if use_amp:
cast_fp32_to_fp16(exe, main_program)
def init_model(args, exe, startup_prog):
"""init model"""
init_func, init_path = None, None
if args.do_train:
if args.init_checkpoint and args.init_pretraining_params:
print(
"WARNING: args 'init_checkpoint' and 'init_pretraining_params' "
"both are set! Only arg 'init_checkpoint' is made valid.")
if args.init_checkpoint:
print("init checkpoint from ", args.init_checkpoint)
init_func = init_checkpoint
init_path = args.init_checkpoint
elif args.init_pretraining_params:
print("init pretraining params from ", args.init_pretraining_params)
init_func = init_pretraining_params
init_path = args.init_pretraining_params
elif args.do_val or args.do_test:
init_path = args.init_checkpoint or args.init_pretraining_params
if not init_path:
raise ValueError("args 'init_checkpoint' should be set if"
"only doing validation or testing!")
init_func = init_checkpoint
print("init pretraining params from ", args.init_checkpoint if args.init_checkpoint else args.init_pretraining_params)
if init_path:
init_func(exe, init_path, main_program=startup_prog, use_amp=args.use_amp)
# -*- coding: utf-8 -*
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
metrics
"""
from __future__ import print_function
import numpy as np
from sklearn import metrics
import re
import json
import sys
import six
import imp
imp.reload(sys)
if six.PY2:
sys.setdefaultencoding("utf-8")
import nltk
class Acc(object):
"""Acc"""
def eval(self, run_value):
predict, label = run_value
if isinstance(predict, list):
tmp_arr = []
for one_batch in predict:
for pre in one_batch:
tmp_arr.append(np.argmax(pre))
else:
tmp_arr = []
for pre in predict:
tmp_arr.append(np.argmax(pre))
predict_arr = np.array(tmp_arr)
if isinstance(label, list):
tmp_arr = []
for one_batch in label:
batch_arr = [one_label for one_label in one_batch]
tmp_arr.extend(batch_arr)
label_arr = np.array(tmp_arr)
else:
label_arr = np.array(label.flatten())
score = metrics.accuracy_score(label_arr, predict_arr)
return score
class EM_AND_F1(object):
# split Chinese with English
def _mixed_segmentation(self, in_str, rm_punc=False):
"""mixed_segmentation"""
if six.PY2:
in_str = str(in_str).decode('utf-8')
in_str = in_str.lower().strip()
segs_out = []
temp_str = ""
sp_char = ['-',':','_','*','^','/','\\','~','`','+','=',
',','。',':','?','!','“','”',';','’','《','》','……','·','、',
'「','」','(',')','-','~','『','』']
for char in in_str:
if rm_punc and char in sp_char:
continue
pattern = six.u(r'[\u4e00-\u9fa5]')
if re.search(pattern, char) or char in sp_char:
if temp_str != "":
ss = nltk.word_tokenize(temp_str)
segs_out.extend(ss)
temp_str = ""
segs_out.append(char)
else:
temp_str += char
#handling last part
if temp_str != "":
ss = nltk.word_tokenize(temp_str)
segs_out.extend(ss)
return segs_out
# remove punctuation
def _remove_punctuation(self, in_str):
"""remove_punctuation"""
if six.PY2:
in_str = str(in_str).decode('utf-8')
in_str = in_str.lower().strip()
sp_char = ['-',':','_','*','^','/','\\','~','`','+','=',
',','。',':','?','!','“','”',';','’','《','》','……','·','、',
'「','」','(',')','-','~','『','』']
out_segs = []
for char in in_str:
if char in sp_char:
continue
else:
out_segs.append(char)
return ''.join(out_segs)
# find longest common string
def _find_lcs(self, s1, s2):
m = [[0 for i in range(len(s2)+1)] for j in range(len(s1)+1)]
mmax = 0
p = 0
for i in range(len(s1)):
for j in range(len(s2)):
if s1[i] == s2[j]:
m[i+1][j+1] = m[i][j]+1
if m[i+1][j+1] > mmax:
mmax=m[i+1][j+1]
p=i+1
return s1[p-mmax:p], mmax
#
def _evaluate(self, ground_truth_file, prediction_file):
f1 = 0
em = 0
total_count = 0
skip_count = 0
for instances in ground_truth_file["data"]:
for instance in instances["paragraphs"]:
context_text = instance['context'].strip()
for qas in instance['qas']:
total_count += 1
query_id = qas['id'].strip()
query_text = qas['question'].strip()
answers = [ans["text"] for ans in qas["answers"]]
if query_id not in prediction_file:
sys.stderr.write('Unanswered question: {}\n'.format(query_id))
skip_count += 1
continue
prediction = str(prediction_file[query_id])
f1 += self._calc_f1_score(answers, prediction)
em += self._calc_em_score(answers, prediction)
f1_score = 100.0 * f1 / total_count
em_score = 100.0 * em / total_count
return f1_score, em_score, total_count, skip_count
def _calc_f1_score(self, answers, prediction):
f1_scores = []
for ans in answers:
ans_segs = self._mixed_segmentation(ans, rm_punc=True)
prediction_segs = self._mixed_segmentation(prediction, rm_punc=True)
lcs, lcs_len = self._find_lcs(ans_segs, prediction_segs)
if lcs_len == 0:
f1_scores.append(0)
continue
precision = 1.0*lcs_len/len(prediction_segs)
recall = 1.0*lcs_len/len(ans_segs)
f1 = (2*precision*recall)/(precision+recall)
f1_scores.append(f1)
return max(f1_scores)
def _calc_em_score(self, answers, prediction):
em = 0
for ans in answers:
ans_ = self._remove_punctuation(ans)
prediction_ = self._remove_punctuation(prediction)
if ans_ == prediction_:
em = 1
break
return em
def eval_file(self, dataset_file, prediction_file):
ground_truth_file = json.load(open(dataset_file, 'rb'))
prediction_file = json.load(open(prediction_file, 'rb'))
F1, EM, TOTAL, SKIP = self._evaluate(ground_truth_file, prediction_file)
AVG = (EM+F1)*0.5
return EM, F1, AVG, TOTAL
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""multi process to test"""
import os
import json
import time
import six
import math
import subprocess
#import commands
import collections
import logging
import numpy as np
class MultiProcessEvalForMrc(object):
"""multi process test for mrc tasks"""
def __init__(self, output_path, eval_phase, dev_count, gpu_id, tokenizer):
self.output_path = output_path
self.eval_phase = eval_phase
self.dev_count = dev_count
self.gpu_id = gpu_id
self.tokenizer = tokenizer
if not os.path.exists(self.output_path):
os.makedirs(self.output_path)
if not os.path.exists("./output"):
os.makedirs('./output')
self.output_prediction_file = os.path.join('./output', self.eval_phase + "_predictions.json")
self.output_nbest_file = os.path.join('./output', self.eval_phase + "_nbest_predictions.json")
def write_result(self, all_results):
"""write result to hard disk"""
outfile = self.output_path + "/" + self.eval_phase
outfile_part = outfile + ".part" + str(self.gpu_id)
writer = open(outfile_part, "w")
save_dict = json.dumps(all_results)
writer.write(save_dict)
writer.close()
tmp_writer = open(self.output_path + "/" + self.eval_phase + "_dec_finish." + str(self.gpu_id), "w")
tmp_writer.close()
def concat_result(self, RawResult):
"""read result from hard disk and concat them"""
outfile = self.output_path + "/" + self.eval_phase
all_results_read = []
while True:
ret = subprocess.check_output(['find', self.output_path, '-maxdepth', '1', '-name',
self.eval_phase + '_dec_finish.*'])
#_, ret = commands.getstatusoutput('find ' + self.output_path + \
# ' -maxdepth 1 -name ' + self.eval_phase + '"_dec_finish.*"')
if six.PY3:
ret = ret.decode()
ret = ret.rstrip().split("\n")
if len(ret) != self.dev_count:
time.sleep(1)
continue
for dev_cnt in range(self.dev_count):
fin_read = open(outfile + ".part" + str(dev_cnt), "rb")
cur_rawresult = json.loads(fin_read.read())
for tp in cur_rawresult:
assert len(tp) == 3
all_results_read.append(
RawResult(
unique_id=tp[0],
start_logits=tp[1],
end_logits=tp[2]))
#subprocess.check_output(["rm ", outfile + ".part*"])
#subprocess.check_output(["rm ", self.output_path + "/" + self.eval_phase + "_dec_finish.*"])
#commands.getstatusoutput("rm " + outfile + ".part*")
#commands.getstatusoutput("rm " + self.output_path + "/" + self.eval_phase + "_dec_finish.*")
os.system("rm " + outfile + "*.part*")
os.system("rm " + self.output_path + "/" + self.eval_phase + "_dec_finish.*")
break
return all_results_read
def write_predictions(self, all_examples, all_features, all_results, n_best_size,
max_answer_length, do_lower_case, output_prediction_file,
output_nbest_file):
"""Write final predictions to the json file and log-odds of null if needed."""
logging.info("Writing predictions to: %s" % (output_prediction_file))
logging.info("Writing nbest to: %s" % (output_nbest_file))
example_index_to_features = collections.defaultdict(list)
for feature in all_features:
example_index_to_features[feature.example_index].append(feature)
unique_id_to_result = {}
for result in all_results:
unique_id_to_result[result.unique_id] = result
_PrelimPrediction = collections.namedtuple( # pylint: disable=invalid-name
"PrelimPrediction", [
"feature_index", "start_index", "end_index", "start_logit",
"end_logit"
])
all_predictions = collections.OrderedDict()
all_nbest_json = collections.OrderedDict()
for (example_index, example) in enumerate(all_examples):
features = example_index_to_features[example_index]
prelim_predictions = []
# keep track of the minimum score of null start+end of position 0
for (feature_index, feature) in enumerate(features):
result = unique_id_to_result[feature.qid]
start_indexes = self._get_best_indexes(result.start_logits, n_best_size)
end_indexes = self._get_best_indexes(result.end_logits, n_best_size)
for start_index in start_indexes:
for end_index in end_indexes:
# We could hypothetically create invalid predictions, e.g., predict
# that the start of the span is in the question. We throw out all
# invalid predictions.
if start_index >= len(feature.tokens):
continue
if end_index >= len(feature.tokens):
continue
if start_index not in feature.token_to_orig_map:
continue
if end_index not in feature.token_to_orig_map:
continue
if not feature.token_is_max_context.get(start_index, False):
continue
if end_index < start_index:
continue
length = end_index - start_index + 1
if length > max_answer_length:
continue
prelim_predictions.append(
_PrelimPrediction(
feature_index=feature_index,
start_index=start_index,
end_index=end_index,
start_logit=result.start_logits[start_index],
end_logit=result.end_logits[end_index]))
prelim_predictions = sorted(
prelim_predictions,
key=lambda x: (x.start_logit + x.end_logit),
reverse=True)
_NbestPrediction = collections.namedtuple( # pylint: disable=invalid-name
"NbestPrediction", ["text", "start_logit", "end_logit"])
seen_predictions = {}
nbest = []
for pred in prelim_predictions:
if len(nbest) >= n_best_size:
break
feature = features[pred.feature_index]
if pred.start_index > 0: # this is a non-null prediction
tok_tokens = feature.tokens[pred.start_index:(pred.end_index + 1
)]
orig_doc_start = feature.token_to_orig_map[pred.start_index]
orig_doc_end = feature.token_to_orig_map[pred.end_index]
orig_tokens = example.doc_tokens[orig_doc_start:(orig_doc_end +
1)]
tok_text = " ".join(tok_tokens)
# De-tokenize WordPieces that have been split off.
tok_text = tok_text.replace(" ##", "")
tok_text = tok_text.replace("##", "")
# Clean whitespace
tok_text = tok_text.strip()
tok_text = " ".join(tok_text.split())
orig_text = "".join(orig_tokens)
final_text = self.get_final_text(tok_text, orig_text, do_lower_case)
if final_text in seen_predictions:
continue
seen_predictions[final_text] = True
else:
final_text = ""
seen_predictions[final_text] = True
nbest.append(
_NbestPrediction(
text=final_text,
start_logit=pred.start_logit,
end_logit=pred.end_logit))
# In very rare edge cases we could have no valid predictions. So we
# just create a nonce prediction in this case to avoid failure.
if not nbest:
nbest.append(
_NbestPrediction(
text="empty", start_logit=0.0, end_logit=0.0))
total_scores = []
best_non_null_entry = None
for entry in nbest:
total_scores.append(entry.start_logit + entry.end_logit)
probs = self._compute_softmax(total_scores)
nbest_json = []
for (i, entry) in enumerate(nbest):
output = collections.OrderedDict()
output["text"] = entry.text
output["probability"] = probs[i]
output["start_logit"] = entry.start_logit
output["end_logit"] = entry.end_logit
nbest_json.append(output)
assert len(nbest_json) >= 1
all_predictions[example.qas_id] = nbest_json[0]["text"]
all_nbest_json[example.qas_id] = nbest_json
with open(output_prediction_file, "w") as writer:
writer.write(json.dumps(all_predictions, indent=4) + "\n")
with open(output_nbest_file, "w") as writer:
writer.write(json.dumps(all_nbest_json, indent=4) + "\n")
def get_final_text(self, pred_text, orig_text, do_lower_case):
"""Project the tokenized prediction back to the original text."""
# When we created the data, we kept track of the alignment between original
# (whitespace tokenized) tokens and our WordPiece tokenized tokens. So
# now `orig_text` contains the span of our original text corresponding to the
# span that we predicted.
#
# However, `orig_text` may contain extra characters that we don't want in
# our prediction.
#
# For example, let's say:
# pred_text = steve smith
# orig_text = Steve Smith's
#
# We don't want to return `orig_text` because it contains the extra "'s".
#
# We don't want to return `pred_text` because it's already been normalized
# (the SQuAD eval script also does punctuation stripping/lower casing but
# our tokenizer does additional normalization like stripping accent
# characters).
#
# What we really want to return is "Steve Smith".
#
# Therefore, we have to apply a semi-complicated alignment heruistic between
# `pred_text` and `orig_text` to get a character-to-charcter alignment. This
# can fail in certain cases in which case we just return `orig_text`.
def _strip_spaces(text):
ns_chars = []
ns_to_s_map = collections.OrderedDict()
for (i, c) in enumerate(text):
if c == " ":
continue
ns_to_s_map[len(ns_chars)] = i
ns_chars.append(c)
ns_text = "".join(ns_chars)
return (ns_text, ns_to_s_map)
# We first tokenize `orig_text`, strip whitespace from the result
# and `pred_text`, and check if they are the same length. If they are
# NOT the same length, the heuristic has failed. If they are the same
# length, we assume the characters are one-to-one aligned.
tok_text = " ".join(self.tokenizer.tokenize(orig_text))
start_position = tok_text.find(pred_text)
if start_position == -1:
return orig_text
end_position = start_position + len(pred_text) - 1
(orig_ns_text, orig_ns_to_s_map) = _strip_spaces(orig_text)
(tok_ns_text, tok_ns_to_s_map) = _strip_spaces(tok_text)
if len(orig_ns_text) != len(tok_ns_text):
return orig_text
# We then project the characters in `pred_text` back to `orig_text` using
# the character-to-character alignment.
tok_s_to_ns_map = {}
for (i, tok_index) in six.iteritems(tok_ns_to_s_map):
tok_s_to_ns_map[tok_index] = i
orig_start_position = None
if start_position in tok_s_to_ns_map:
ns_start_position = tok_s_to_ns_map[start_position]
if ns_start_position in orig_ns_to_s_map:
orig_start_position = orig_ns_to_s_map[ns_start_position]
if orig_start_position is None:
return orig_text
orig_end_position = None
if end_position in tok_s_to_ns_map:
ns_end_position = tok_s_to_ns_map[end_position]
if ns_end_position in orig_ns_to_s_map:
orig_end_position = orig_ns_to_s_map[ns_end_position]
if orig_end_position is None:
return orig_text
output_text = orig_text[orig_start_position:(orig_end_position + 1)]
return output_text
def _get_best_indexes(self, logits, n_best_size):
"""Get the n-best logits from a list."""
index_and_score = sorted(
enumerate(logits), key=lambda x: x[1], reverse=True)
best_indexes = []
for i in range(len(index_and_score)):
if i >= n_best_size:
break
best_indexes.append(index_and_score[i][0])
return best_indexes
def _compute_softmax(self, scores):
"""Compute softmax probability over raw logits."""
if not scores:
return []
max_score = None
for score in scores:
if max_score is None or score > max_score:
max_score = score
exp_scores = []
total_sum = 0.0
for score in scores:
x = math.exp(score - max_score)
exp_scores.append(x)
total_sum += x
probs = []
for score in exp_scores:
probs.append(score / total_sum)
return probs
class MultiProcessEvalForErnieDoc(object):
"""multi process test for mrc tasks"""
def __init__(self, output_path, eval_phase, dev_count, gpu_id):
self.output_path = output_path
self.eval_phase = eval_phase
self.dev_count = dev_count
self.gpu_id = gpu_id
if not os.path.exists(self.output_path):
os.makedirs(self.output_path)
if not os.path.exists("./output"):
os.makedirs('./output')
self.output_prediction_file = os.path.join('./output', self.eval_phase + "_predictions.json")
self.output_nbest_file = os.path.join('./output', self.eval_phase + "_nbest_predictions.json")
def write_result(self, all_results):
"""write result to hard disk"""
outfile = self.output_path + "/" + self.eval_phase
outfile_part = outfile + ".part" + str(self.gpu_id)
writer = open(outfile_part, "w")
save_dict = json.dumps(all_results)
writer.write(save_dict)
writer.close()
tmp_writer = open(self.output_path + "/" + self.eval_phase + "_dec_finish." + str(self.gpu_id), "w")
tmp_writer.close()
def concat_result(self, RawResult):
"""read result from hard disk and concat them"""
outfile = self.output_path + "/" + self.eval_phase
all_results_read = []
while True:
ret = subprocess.check_output(['find', self.output_path, '-maxdepth', '1', '-name',
self.eval_phase + '_dec_finish.*'])
#_, ret = commands.getstatusoutput('find ' + self.output_path + \
# ' -maxdepth 1 -name ' + self.eval_phase + '"_dec_finish.*"')
if six.PY3:
ret = ret.decode()
ret = ret.rstrip().split("\n")
if len(ret) != self.dev_count:
time.sleep(1)
continue
for dev_cnt in range(self.dev_count):
fin_read = open(outfile + ".part" + str(dev_cnt), "rb")
cur_rawresult = json.loads(fin_read.read())
for tp in cur_rawresult:
assert len(tp) == 3
all_results_read.append(
RawResult(
unique_id=tp[0],
prob=tp[1],
label=tp[2]))
#subprocess.check_output(["rm ", outfile + ".part*"])
#subprocess.check_output(["rm ", self.output_path + "/" + self.eval_phase + "_dec_finish.*"])
#commands.getstatusoutput("rm " + outfile + ".part*")
#commands.getstatusoutput("rm " + self.output_path + "/" + self.eval_phase + "_dec_finish.*")
os.system("rm " + outfile + "*.part*")
os.system("rm " + self.output_path + "/" + self.eval_phase + "_dec_finish.*")
break
return all_results_read
def write_predictions(self, all_results):
"""Write final predictions to the json file and log-odds of null if needed."""
unique_id_to_result = collections.defaultdict(list)
for result in all_results:
unique_id_to_result[result.unique_id].append(result)
print("data num: %d" % (len(unique_id_to_result)))
all_probs = []
all_labels = []
for key, value in unique_id_to_result.items():
prob_for_one_sample = [result.prob for result in value]
label_for_one_sample = [result.label for result in value]
assert len(set(label_for_one_sample)) == 1
#prob_emb = np.sum(np.array(prob_for_one_sample), axis=0).tolist()
prob_emb = np.mean(np.array(prob_for_one_sample), axis=0).tolist()
all_probs.append(prob_emb)
all_labels.append(list(set(label_for_one_sample)))
assert len(all_labels) == len(all_probs)
all_labels = np.array(all_labels).astype("float32")
all_probs = np.array(all_probs).astype("float32")
return len(unique_id_to_result), all_labels, all_probs
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册