未验证 提交 e34627d7 编写于 作者: Y Yibing Liu 提交者: GitHub

Add lang representation model XLNet (#3831)

上级 43e4ec71
[中文版](README_cn.md)
This project is the implementation of [XLNet](https://github.com/zihangdai/xlnet) on Paddle Fluid, currently supporting the fine-tuning on all downstream tasks, including natural language inference, question answering (SQuAD) etc.
There are a lot differences between XLNet and [BERT](../BERT). XLNet takes adavangtage of a new novel model [Transformer-XL](https://arxiv.org/abs/1901.02860) as the backbone of language representation, and the permutation language modeling as the optimizing objective. Also XLNet involed much more data in the pre-training stage. Finally, XLNet achieved SOTA results on several NLP tasks.
For more details, please refer to the research paper
[XLNet: Generalized Autoregressive Pretraining for Language Understanding](https://arxiv.org/abs/1906.08237)
## Installation
This project requires Paddle Fluid version 1.6.0 and later, please follow the [installation guide](https://www.paddlepaddle.org.cn/start) to install.
## Pre-trained models
Two pre-trained models converted from the official release are available
| Model | Layers | Hidden size | Heads |
| :------| :------: | :------: |:------: |
| [XLNet-Large, Cased](https://xlnet.bj.bcebos.com/xlnet_cased_L-24_H-1024_A-16.tgz)| 24 | 1024 | 16 |
| [XLNet-Base, Cased](https://xlnet.bj.bcebos.com/xlnet_cased_L-12_H-768_A-12.tgz)| 12 | 768 | 12 |
Each compressed package contains one subdirectory and two files:
- `params`: a directory consisting of all converted parameters, one file for a parameter.
- `spiece.model`: a [Sentence Piece](https://github.com/google/sentencepiece) model used for (de)tokenization.
- `xlnet_config.json`: a config file which specifies the hyperparameters of the model.
## Fine-tuning with XLNet
We provide the scripts for fine-tuning on NLP tasks with XLNet on multi-card GPUs. And their correctness has been verified that all experiments on V100 GPUs can achieve the same performance as the officially reported (mainly on TPU). In the following statements, we assume that the two pre-trained models have been downloaded and extracted.
### Text regression/classification
The fine-tuning of regression/classification can be preformed via the script `run_classifier.py` , which contains examples for standard one-document classification, one-document regression, and document pair classification. The two examples, one for regression and another one for classification can go on in the following way.
#### (1) STS-B: sentence pair relevance regression
- Download the [GLUE data](https://gluebenchmark.com/tasks) by running [this script](https://gist.github.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e) and unpack it to some directory $GLUE_DIR
- **Note**: You may meet the error `ImportError: No module named request` when running the script under Python 2.x, this is because the module `urllib` doesn't have submodule `request`. It can be resolved by replacing all the code `urllib.request` with `urllib` or changing to a Python 3.x environment.
- Perform fine-tuning on 4 V100 GPUs with XLNet-Large
```
export GLUE_DIR=glue_data
export LARGE_DIR=xlnet_cased_L-24_H-1024_A-16
CUDA_VISIBLE_DEVICES=0,1,2,3 python run_classifier.py \
--do_train=True \
--do_eval=True \
--do_predict=True \
--task_name=sts-b \
--data_dir=${GLUE_DIR}/STS-B \
--checkpoints=exp/sts-b \
--uncased=False \
--spiece_model_file=${LARGE_DIR}/spiece.model \
--model_config_path=${LARGE_DIR}/xlnet_config.json \
--init_pretraining_params=${LARGE_DIR}/params \
--max_seq_length=128 \
--train_batch_size=8 \
--learning_rate=5e-5 \
--predict_dir=exp/sts-b-pred \
--skip_steps=10 \
--train_steps=1200 \
--warmup_steps=120 \
--save_steps=600 \
--is_regression=True
```
This configuration doesn't require that large GPU memory, and 4 V100 (or other) GPUs with 16GB should be enough.
As the fine-tuning finished, the evaluation result on dev dataset, including the average loss and pearson correlation coefficient, will yield
```
[dev evaluation] ave loss: 0.383523, eval_pearsonr: 0.916912, elapsed time: 21.804057 s
```
The expected `eval_pearsonr` is `91.3+`, quoted from the official repository, and the experiment can reproduce this performance.
#### (2) IMDB: movie review sentiment classification
- Download and unpack the IMDB dataset by running
```shell
wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
tar zxvf aclImdb_v1.tar.gz
```
- Perform fine-tuning with XLNet-Large on 8 V100 GPUs (32GB) by running
```shell
export IMDB_DIR=aclImdb
export LARGE_DIR=xlnet_cased_L-24_H-1024_A-16
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python run_classifier.py \
--do_train=True \
--do_eval=True \
--do_predict=True \
--task_name=imdb \
--checkpoints=exp/imdb \
--init_pretraining_params=${LARGE_DIR}/params \
--data_dir=${IMDB_DIR} \
--predict_dir=predict_imdb_1028 \
--uncased=False \
--spiece_model_file=${LARGE_DIR}/spiece.model \
--model_config_path=${LARGE_DIR}/xlnet_config.json \
--max_seq_length=512 \
--train_batch_size=4 \
--eval_batch_size=8 \
--learning_rate=2e-5 \
--train_steps=4000 \
--warmup_steps=500 \
--save_steps=500 \
```
The expected accuracy is `96.2+`, and here is an example of evaluation result
```
[dev evaluation] ave loss: 0.220047, eval_accuracy: 0.963480, elapsed time: 2799.974465 s
```
Other NLP regression/classification tasks' fine-tuning can be carried out in the similar way.
### SQuAD 2.0
- Download SQuAD2.0 data and put it in the `data/squad2.0` directory
```
mkdir -p data/squad2.0
wget -P data/squad2.0 https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json
wget -P data/squad2.0 https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json
```
- Perform fine-tuning running the script `run_squad.py` on V100 GPUs (32GB)
```
SQUAD_DIR=data/squad2.0
INIT_CKPT_DIR=xlnet_cased_L-24_H-1024_A-16
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5
python run_squad.py \
--model_config_path=${INIT_CKPT_DIR}/xlnet_config.json \
--spiece_model_file=${INIT_CKPT_DIR}/spiece.model \
--init_checkpoint=${INIT_CKPT_DIR}/params \
--train_file=${SQUAD_DIR}/train-v2.0.json \
--predict_file=${SQUAD_DIR}/dev-v2.0.json \
--uncased=False \
--checkpoints squad_2.0_0828 \
--max_seq_length=512 \
--do_train=True \
--do_predict=True \
--skip_steps=100 \
--save_steps=10000 \
--epoch 200 \
--dropout=0.1 \
--dropatt=0.1 \
--train_batch_size=4 \
--predict_batch_size=3 \
--learning_rate=2e-5 \
--save_steps=1000 \
--train_steps=12000 \
--warmup_steps=1000 \
--verbose=True\
```
And the final evaluation result after fine-tuning should looks like
```
================================================================================
Result | best_f1 88.0893932758 | best_exact_thresh -2.07637166977 | best_exact 85.5049271456 | has_ans_f1 0.940979062625 | has_ans_exact 0.880566801619 | best_f1_thresh -2.07337403297 |
================================================================================
```
### Use your own data
Please refer to the data-format guidelines of GLUE/SQuAD if you want to use your own data for fine-tuning.
## Acknowledgement
We thank the distiguished work done by the authors of XLNet!
[ENGLISH](README.md)
该项目是 [XLNet](https://github.com/zihangdai/xlnet) 基于 Paddle Fluid 的实现,目前支持该项目支持所有下游任务的 fine-tuning, 包括自然语言推断任务和阅读理解任务 (SQuAD2.0)等。
XLNet 与 [BERT](https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/PaddleLARK/BERT) 有着许多的不同,XLNet 利用一个全新的模型 [Transformer-XL](https://arxiv.org/abs/1901.02860) 作为语义表示的骨架, 将置换语言模型的建模作为优化目标,同时在预训练阶段也利用了更多的数据。 最终,XLNet 在多个 NLP 任务上达到了 SOTA 的效果。
更多的细节,请参考学术论文
[XLNet: Generalized Autoregressive Pretraining for Language Understanding](https://arxiv.org/abs/1906.08237)
## 安装
该项目要求 Paddle Fluid 1.6.0 及以上版本,请参考 [安装指南](https://www.paddlepaddle.org.cn/start) 进行安装。
## Pre-trained models
这里提供了从官方开源模型转换而来的两个预训练模型供下载
| Model | Layers | Hidden size | Heads |
| :------| :------: | :------: |:------: |
| [XLNet-Large, Cased](https://xlnet.bj.bcebos.com/xlnet_cased_L-24_H-1024_A-16.tgz)| 24 | 1024 | 16 |
| [XLNet-Base, Cased](https://xlnet.bj.bcebos.com/xlnet_cased_L-12_H-768_A-12.tgz)| 12 | 768 | 12 |
每个压缩包都包含了一个子文件夹和两个文件:
- `params`: 由参数构成的文件夹, 每个模型文件包含一个参数
- `spiece.model`: [Sentence Piece](https://github.com/google/sentencepiece) 模型,用于文本的(反)tokenization
- `xlnet_config.json`: 配置文件,指定了模型的超参数
## 利用 XLNet 进行 Fine-tuning
我们提供了利用 XLNet 在多卡 GPU 上为自然语言处理任务进行 fine-tuning 的脚本。通过基于 V100 GPU 进行实验,达到官方报告的效果 (主要是基于 TPU),这些脚本的正确性已得到过验证。在下面的陈述中,我们假定以上两个预训练已下载和解压好。
### 文本回归/分类任务
文本回归和分类任务的 fine-tuning 可以通过运行脚本 `run_classifier.py` 来进行,其中包含了单文本分类、单文本回归、文本对分类等示例。下面的两个例子,一个用于示例回归任务,另一个用于分类任务,可以按以下的方式进行 fine-tuning。
#### (1) STS-B: 句子对相关性回归
- 通过运行 [脚本](https://gist.github.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e) 下载 [GLUE 数据集](https://gluebenchmark.com/tasks), 并解压到某个文件夹 $GLUE_DIR。
- **请注意**: 在 Python 2.x 环境下运行这个脚本,可能会遇到报错 `ImportError: No module named request` , 这是因为模块 `urllib` 不包含子模块 `request`. 这个问题可以通过将脚本中的代码 `urllib.request` 全部替换为 `urllib`,或者在 Python 3.x 环境下运行予以解决。
- 使用 XLNet-Large 在 4 卡 V100 GPU 上进行 fine-tuning
```
export GLUE_DIR=glue_data
export LARGE_DIR=xlnet_cased_L-24_H-1024_A-16
CUDA_VISIBLE_DEVICES=0,1,2,3 python run_classifier.py \
--do_train=True \
--do_eval=True \
--do_predict=True \
--task_name=sts-b \
--data_dir=${GLUE_DIR}/STS-B \
--checkpoints=exp/sts-b \
--uncased=False \
--spiece_model_file=${LARGE_DIR}/spiece.model \
--model_config_path=${LARGE_DIR}/xlnet_config.json \
--init_pretraining_params=${LARGE_DIR}/params \
--max_seq_length=128 \
--train_batch_size=8 \
--learning_rate=5e-5 \
--predict_dir=exp/sts-b-pred \
--skip_steps=10 \
--train_steps=1200 \
--warmup_steps=120 \
--save_steps=600 \
--is_regression=True
```
该配置不需要特别大的 GPU 显存,16GB 的 4 卡 V100 (或其它 GPU)即可运行。
在 fine-tuning 结束后,会得到在 dev 数据集上的评估结果,包括平均误差和皮尔逊相关系数
```
[dev evaluation] ave loss: 0.383523, eval_pearsonr: 0.916912, elapsed time: 21.804057 s
```
按官方实现的说法,预期的 `eval_pearsonr``91.3+`,该实验应该能复现这个结果。
#### (2) IMDB: 电影评论情感分类
- 下载和解压 IMDB 数据集
```shell
wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
tar zxvf aclImdb_v1.tar.gz
```
- 使用 XLNet-Large 在 8 卡 V100 GPU (32GB) 上进行 fine-tuning
```shell
export IMDB_DIR=aclImdb
export LARGE_DIR=xlnet_cased_L-24_H-1024_A-16
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python run_classifier.py \
--do_train=True \
--do_eval=True \
--do_predict=True \
--task_name=imdb \
--checkpoints=exp/imdb \
--init_pretraining_params=${LARGE_DIR}/params \
--data_dir=${IMDB_DIR} \
--predict_dir=predict_imdb_1028 \
--uncased=False \
--spiece_model_file=${LARGE_DIR}/spiece.model \
--model_config_path=${LARGE_DIR}/xlnet_config.json \
--max_seq_length=512 \
--train_batch_size=4 \
--eval_batch_size=8 \
--learning_rate=2e-5 \
--train_steps=4000 \
--warmup_steps=500 \
--save_steps=500 \
```
期望的准确率是 `96.2+`, 以下是评估结果的一个样例
```
[dev evaluation] ave loss: 0.220047, eval_accuracy: 0.963480, elapsed time: 2799.974465 s
```
其它 NLP 回归/分类任务的 fine-tuning 可以通过同样的方式进行。
### SQuAD 2.0
- 下载 SQuAD 2.0 数据集并将其放入 `data/squad2.0` 目录中
```
mkdir -p data/squad2.0
wget -P data/squad2.0 https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json
wget -P data/squad2.0 https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json
```
- 在 6 卡 V100 GPU (32GB) 上运行脚本 `run_squad.py`
```
SQUAD_DIR=data/squad2.0
INIT_CKPT_DIR=xlnet_cased_L-24_H-1024_A-16
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5
python run_squad.py \
--model_config_path=${INIT_CKPT_DIR}/xlnet_config.json \
--spiece_model_file=${INIT_CKPT_DIR}/spiece.model \
--init_checkpoint=${INIT_CKPT_DIR}/params \
--train_file=${SQUAD_DIR}/train-v2.0.json \
--predict_file=${SQUAD_DIR}/dev-v2.0.json \
--uncased=False \
--checkpoints squad_2.0_0828 \
--max_seq_length=512 \
--do_train=True \
--do_predict=True \
--skip_steps=100 \
--save_steps=10000 \
--epoch 200 \
--dropout=0.1 \
--dropatt=0.1 \
--train_batch_size=4 \
--predict_batch_size=3 \
--learning_rate=2e-5 \
--save_steps=1000 \
--train_steps=12000 \
--warmup_steps=1000 \
--verbose=True\
```
运行结束后的评测结果如下所示
```
================================================================================
Result | best_f1 88.0893932758 | best_exact_thresh -2.07637166977 | best_exact 85.5049271456 | has_ans_f1 0.940979062625 | has_ans_exact 0.880566801619 | best_f1_thresh -2.07337403297 |
================================================================================
```
### 使用自定义数据
如需使用自定义数据进行 fine-tuning,请参考 GLUE/SQuAD 的数据格式说明。
## 致谢
我们向 XLNet 的作者们所做的杰出工作致以谢意!
"""this file is a copy of https://github.com/zihangdai/xlnet"""
import re
import numpy as np
from data_utils import SEP_ID, CLS_ID
SEG_ID_A = 0
SEG_ID_B = 1
SEG_ID_CLS = 2
SEG_ID_SEP = 3
SEG_ID_PAD = 4
class PaddingInputExample(object):
"""Fake example so the num input examples is a multiple of the batch size.
When running eval/predict on the TPU, we need to pad the number of examples
to be a multiple of the batch size, because the TPU requires a fixed batch
size. The alternative is to drop the last batch, which is bad because it means
the entire output data won't be generated.
We use this class instead of `None` because treating `None` as padding
battches could cause silent errors.
"""
class InputFeatures(object):
"""A single set of features of data."""
def __init__(self,
input_ids,
input_mask,
segment_ids,
label_id,
is_real_example=True):
self.input_ids = input_ids
self.input_mask = input_mask
self.segment_ids = segment_ids
self.label_id = label_id
self.is_real_example = is_real_example
def _truncate_seq_pair(tokens_a, tokens_b, max_length):
"""Truncates a sequence pair in place to the maximum length."""
# This is a simple heuristic which will always truncate the longer sequence
# one token at a time. This makes more sense than truncating an equal percent
# of tokens from each, since if one sequence is very short then each token
# that's truncated likely contains more information than a longer sequence.
while True:
total_length = len(tokens_a) + len(tokens_b)
if total_length <= max_length:
break
if len(tokens_a) > len(tokens_b):
tokens_a.pop()
else:
tokens_b.pop()
def convert_single_example(ex_index, example, label_list, max_seq_length,
tokenize_fn):
"""Converts a single `InputExample` into a single `InputFeatures`."""
if isinstance(example, PaddingInputExample):
return InputFeatures(
input_ids=[0] * max_seq_length,
input_mask=[1] * max_seq_length,
segment_ids=[0] * max_seq_length,
label_id=0,
is_real_example=False)
if label_list is not None:
label_map = {}
for (i, label) in enumerate(label_list):
label_map[label] = i
tokens_a = tokenize_fn(example.text_a)
tokens_b = None
if example.text_b:
tokens_b = tokenize_fn(example.text_b)
if tokens_b:
# Modifies `tokens_a` and `tokens_b` in place so that the total
# length is less than the specified length.
# Account for two [SEP] & one [CLS] with "- 3"
_truncate_seq_pair(tokens_a, tokens_b, max_seq_length - 3)
else:
# Account for one [SEP] & one [CLS] with "- 2"
if len(tokens_a) > max_seq_length - 2:
tokens_a = tokens_a[:max_seq_length - 2]
tokens = []
segment_ids = []
for token in tokens_a:
tokens.append(token)
segment_ids.append(SEG_ID_A)
tokens.append(SEP_ID)
segment_ids.append(SEG_ID_A)
if tokens_b:
for token in tokens_b:
tokens.append(token)
segment_ids.append(SEG_ID_B)
tokens.append(SEP_ID)
segment_ids.append(SEG_ID_B)
tokens.append(CLS_ID)
segment_ids.append(SEG_ID_CLS)
input_ids = tokens
# The mask has 0 for real tokens and 1 for padding tokens. Only real
# tokens are attended to.
input_mask = [0] * len(input_ids)
# Zero-pad up to the sequence length.
if len(input_ids) < max_seq_length:
delta_len = max_seq_length - len(input_ids)
input_ids = [0] * delta_len + input_ids
input_mask = [1] * delta_len + input_mask
segment_ids = [SEG_ID_PAD] * delta_len + segment_ids
assert len(input_ids) == max_seq_length
assert len(input_mask) == max_seq_length
assert len(segment_ids) == max_seq_length
if label_list is not None:
label_id = label_map[example.label]
else:
label_id = example.label
if ex_index < 1:
print("*** Example ***")
print("guid: %s" % (example.guid))
print("input_ids: %s" % " ".join([str(x) for x in input_ids]))
print("input_mask: %s" % " ".join([str(x) for x in input_mask]))
print("segment_ids: %s" % " ".join([str(x) for x in segment_ids]))
print("label: {} (id = {})".format(example.label, label_id))
feature = InputFeatures(
input_ids=input_ids,
input_mask=input_mask,
segment_ids=segment_ids,
label_id=label_id)
return feature
"""this file is a copy of https://github.com/zihangdai/xlnet"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
special_symbols = {
"<unk>": 0,
"<s>": 1,
"</s>": 2,
"<cls>": 3,
"<sep>": 4,
"<pad>": 5,
"<mask>": 6,
"<eod>": 7,
"<eop>": 8,
}
VOCAB_SIZE = 32000
UNK_ID = special_symbols["<unk>"]
CLS_ID = special_symbols["<cls>"]
SEP_ID = special_symbols["<sep>"]
MASK_ID = special_symbols["<mask>"]
EOD_ID = special_symbols["<eod>"]
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Model for classifier."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import collections
import paddle.fluid as fluid
import modeling
from model.xlnet import XLNetModel, _get_initializer
def get_regression_loss(args, xlnet_config, features, is_training=False):
"""Loss for downstream regression tasks."""
inp = fluid.layers.transpose(features["input_ids"], [1, 0, 2])
seg_id = features["segment_ids"]
inp_mask = fluid.layers.transpose(features["input_mask"], [1, 0])
label = features["label_ids"]
xlnet_model = XLNetModel(
input_ids=inp,
seg_ids=seg_id,
input_mask=inp_mask,
xlnet_config=xlnet_config,
args=args)
summary = xlnet_model.get_pooled_out(args.summary_type, args.use_summ_proj)
per_example_loss, logits = modeling.regression_loss(
hidden=summary,
labels=label,
initializer=_get_initializer(args),
name="model_regression_{}".format(args.task_name.lower()),
return_logits=True)
total_loss = fluid.layers.reduce_mean(per_example_loss)
return total_loss, per_example_loss, logits
def get_classification_loss(args,
xlnet_config,
features,
n_class,
is_training=True):
"""Loss for downstream classification tasks."""
inp = fluid.layers.transpose(features["input_ids"], [1, 0, 2])
seg_id = features["segment_ids"]
inp_mask = fluid.layers.transpose(features["input_mask"], [1, 0])
label = features["label_ids"]
xlnet_model = XLNetModel(
input_ids=inp,
seg_ids=seg_id,
input_mask=inp_mask,
xlnet_config=xlnet_config,
args=args)
summary = xlnet_model.get_pooled_out(args.summary_type, args.use_summ_proj)
per_example_loss, logits = modeling.classification_loss(
hidden=summary,
labels=label,
n_class=n_class,
initializer=xlnet_model.get_initializer(),
name="model_classification_{}".format(args.task_name),
return_logits=True)
total_loss = fluid.layers.reduce_mean(per_example_loss)
return total_loss, per_example_loss, logits
def create_model(args, xlnet_config, n_class, is_training=False):
label_ids_type = 'int64' if n_class else 'float32'
input_fields = {
'names': [
'input_ids', 'input_mask', 'segment_ids', 'label_ids',
'is_real_example'
],
'shapes': [[-1, args.max_seq_length, 1], [-1, args.max_seq_length],
[-1, args.max_seq_length], [-1, 1], [-1, 1]],
'dtypes':
['int64', 'float32', 'int64', 'int64', label_ids_type, 'int64'],
'lod_levels': [0, 0, 0, 0, 0, 0],
}
inputs = [
fluid.layers.data(
name=input_fields['names'][i],
shape=input_fields['shapes'][i],
dtype=input_fields['dtypes'][i],
lod_level=input_fields['lod_levels'][i])
for i in range(len(input_fields['names']))
]
(input_ids, input_mask, segment_ids, label_ids, is_real_example) = inputs
data_loader = fluid.io.DataLoader.from_generator(
feed_list=inputs, capacity=50, iterable=False)
features = collections.OrderedDict()
features["input_ids"] = input_ids
features["input_mask"] = input_mask
features["segment_ids"] = segment_ids
features["label_ids"] = label_ids
features["is_real_example"] = is_real_example
if args.is_regression:
(total_loss, per_example_loss, logits) = get_regression_loss(
args, xlnet_config, features, is_training)
else:
(total_loss, per_example_loss, logits) = get_classification_loss(
args, xlnet_config, features, n_class, is_training)
num_seqs = fluid.layers.fill_constant_batch_size_like(
input=label_ids, shape=[-1, 1], value=1, dtype="int64")
num_seqs = fluid.layers.reduce_sum(num_seqs)
return data_loader, total_loss, logits, num_seqs, label_ids
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""XLNet model."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import six
import json
import numpy as np
import paddle.fluid as fluid
import modeling
def _get_initializer(args):
if args.init == "uniform":
param_initializer = fluid.initializer.Uniform(
low=-args.init_range, high=args.init_range)
elif args.init == "normal":
param_initializer = fluid.initializer.Normal(scale=args.init_std)
else:
raise ValueError("Initializer {} not supported".format(args.init))
return param_initializer
def init_attn_mask(args, place):
"""create causal attention mask."""
qlen = args.max_seq_length
mlen = 0 if 'mem_len' not in args else args.mem_len
same_length = False if 'same_length' not in args else args.same_length
dtype = 'float16' if args.use_fp16 else 'float32'
attn_mask = np.ones([qlen, qlen], dtype=dtype)
mask_u = np.triu(attn_mask)
mask_dia = np.diag(np.diag(attn_mask))
attn_mask_pad = np.zeros([qlen, mlen], dtype=dtype)
attn_mask = np.concatenate([attn_mask_pad, mask_u - mask_dia], 1)
if same_length:
mask_l = np.tril(attn_mask)
attn_mask = np.concatenate(
[ret[:, :qlen] + mask_l - mask_dia, ret[:, qlen:]], 1)
attn_mask = attn_mask[:, :, None, None]
attn_mask_t = fluid.global_scope().find_var("attn_mask").get_tensor()
attn_mask_t.set(attn_mask, place)
class XLNetConfig(object):
def __init__(self, config_path):
self._config_dict = self._parse(config_path)
def _parse(self, config_path):
try:
with open(config_path) as json_file:
config_dict = json.load(json_file)
except Exception:
raise IOError("Error in parsing xlnet model config file '%s'" %
config_path)
else:
return config_dict
def __getitem__(self, key):
return self._config_dict[key]
def has_key(self, key):
return self._config_dict.has_key(key)
def print_config(self):
for arg, value in sorted(six.iteritems(self._config_dict)):
print('%s: %s' % (arg, value))
print('------------------------------------------------')
class XLNetModel(object):
def __init__(self,
xlnet_config,
input_ids,
seg_ids,
input_mask,
args,
mems=None,
perm_mask=None,
target_mapping=None,
inp_q=None):
self._tie_weight = True
self._d_head = xlnet_config['d_head']
self._d_inner = xlnet_config['d_inner']
self._d_model = xlnet_config['d_model']
self._ff_activation = xlnet_config['ff_activation']
self._n_head = xlnet_config['n_head']
self._n_layer = xlnet_config['n_layer']
self._n_token = xlnet_config['n_token']
self._untie_r = xlnet_config['untie_r']
self._xlnet_config = xlnet_config
self._dropout = args.dropout
self._dropatt = args.dropatt
self._mem_len = None if 'mem_len' not in args else args.mem_len
self._reuse_len = None if 'reuse_len' not in args else args.reuse_len
self._bi_data = False if 'bi_data' not in args else args.bi_data
self._clamp_len = args.clamp_len
self._same_length = False if 'same_length' not in args else args.same_length
# Initialize all weigths by the specified initializer, and all biases
# will be initialized by constant zero by default.
self._param_initializer = _get_initializer(args)
self.input_mask = input_mask
tfm_args = dict(
n_token=self._n_token,
initializer=self._param_initializer,
attn_type="bi",
n_layer=self._n_layer,
d_model=self._d_model,
n_head=self._n_head,
d_head=self._d_head,
d_inner=self._d_inner,
ff_activation=self._ff_activation,
untie_r=self._untie_r,
use_bfloat16=False,
dropout=self._dropout,
dropatt=self._dropatt,
mem_len=self._mem_len,
reuse_len=self._reuse_len,
bi_data=self._bi_data,
clamp_len=self._clamp_len,
same_length=self._same_length,
name='model_transformer')
input_args = dict(
inp_k=input_ids,
seg_id=seg_ids,
input_mask=input_mask,
mems=mems,
perm_mask=perm_mask,
target_mapping=target_mapping,
inp_q=inp_q)
tfm_args.update(input_args)
self.output, self.new_mems, self.lookup_table = modeling.transformer_xl(
**tfm_args)
#self._build_model(input_ids, sentence_ids, input_mask)
def get_initializer(self):
return self._param_initializer
def get_debug_ret(self):
return self.debug_ret
def get_sequence_output(self):
return self.output
def get_pooled_out(self, summary_type, use_summ_proj=True):
"""
Args:
summary_type: str, "last", "first", "mean", or "attn". The method
to pool the input to get a vector representation.
use_summ_proj: bool, whether to use a linear projection during pooling.
Returns:
float32 Tensor in shape [bsz, d_model], the pooled representation.
"""
summary = modeling.summarize_sequence(
summary_type=summary_type,
hidden=self.output,
d_model=self._d_model,
n_head=self._n_head,
d_head=self._d_head,
dropout=self._dropout,
dropatt=self._dropatt,
input_mask=self.input_mask,
initializer=self._param_initializer,
use_proj=use_summ_proj,
name='model_sequnece_summary')
return summary
import re
import numpy as np
import paddle.fluid as fluid
import collections
def log_softmax(logits, axis=-1):
logsoftmax = logits - fluid.layers.log(
fluid.layers.reduce_sum(fluid.layers.exp(logits), axis))
return logsoftmax
def einsum4x4(equation, x, y):
idx_x, idx_y, idx_z = re.split(",|->", equation)
repeated_idx = list(set(idx_x + idx_y) - set(idx_z))
unique_idx_x = list(set(idx_x) - set(idx_y))
unique_idx_y = list(set(idx_y) - set(idx_x))
common_idx = list(set(idx_x) & set(idx_y) - set(repeated_idx))
new_idx_x = common_idx + unique_idx_x + repeated_idx
new_idx_y = common_idx + unique_idx_y + repeated_idx
new_idx_z = common_idx + unique_idx_x + unique_idx_y
perm_x = [idx_x.index(i) for i in new_idx_x]
perm_y = [idx_y.index(i) for i in new_idx_y]
perm_z = [new_idx_z.index(i) for i in idx_z]
x = fluid.layers.transpose(x, perm=perm_x)
y = fluid.layers.transpose(y, perm=perm_y)
z = fluid.layers.matmul(x=x, y=y, transpose_y=True)
z = fluid.layers.transpose(z, perm=perm_z)
return z
def positional_embedding(pos_seq, inv_freq, bsz=None):
pos_seq = fluid.layers.reshape(pos_seq, [-1, 1])
inv_freq = fluid.layers.reshape(inv_freq, [1, -1])
sinusoid_inp = fluid.layers.matmul(pos_seq, inv_freq)
pos_emb = fluid.layers.concat(
input=[fluid.layers.sin(sinusoid_inp), fluid.layers.cos(sinusoid_inp)],
axis=-1)
pos_emb = fluid.layers.unsqueeze(pos_emb, [1])
if bsz is not None:
pos_emb = fluid.layers.expand(pos_emb, [1, bsz, 1])
return pos_emb
def positionwise_ffn(inp,
d_model,
d_inner,
dropout_prob,
param_initializer=None,
act_type='relu',
name='ff'):
"""Position-wise Feed-forward Network."""
if act_type not in ['relu', 'gelu']:
raise ValueError('Unsupported activation type {}'.format(act_type))
output = fluid.layers.fc(input=inp,
size=d_inner,
act=act_type,
num_flatten_dims=2,
param_attr=fluid.ParamAttr(
name=name + '_layer_1_weight',
initializer=param_initializer),
bias_attr=name + '_layer_1_bias')
output = fluid.layers.dropout(
output,
dropout_prob=dropout_prob,
dropout_implementation="upscale_in_train",
is_test=False)
output = fluid.layers.fc(output,
size=d_model,
num_flatten_dims=2,
param_attr=fluid.ParamAttr(
name=name + '_layer_2_weight',
initializer=param_initializer),
bias_attr=name + '_layer_2_bias')
output = fluid.layers.dropout(
output,
dropout_prob=dropout_prob,
dropout_implementation="upscale_in_train",
is_test=False)
output = fluid.layers.layer_norm(
output + inp,
begin_norm_axis=len(output.shape) - 1,
epsilon=1e-12,
param_attr=fluid.ParamAttr(
name=name + '_layer_norm_scale',
initializer=fluid.initializer.Constant(1.)),
bias_attr=fluid.ParamAttr(
name + '_layer_norm_bias',
initializer=fluid.initializer.Constant(0.)))
return output
def head_projection(h, d_model, n_head, d_head, param_initializer, name=''):
"""Project hidden states to a specific head with a 4D-shape."""
proj_weight = fluid.layers.create_parameter(
shape=[d_model, n_head, d_head],
dtype=h.dtype,
attr=fluid.ParamAttr(
name=name + '_weight', initializer=param_initializer),
is_bias=False)
# ibh,hnd->ibnd
head = fluid.layers.mul(x=h,
y=proj_weight,
x_num_col_dims=2,
y_num_col_dims=1)
return head
def post_attention(h,
attn_vec,
d_model,
n_head,
d_head,
dropout,
param_initializer,
residual=True,
name=''):
"""Post-attention processing."""
# post-attention projection (back to `d_model`)
proj_o = fluid.layers.create_parameter(
shape=[d_model, n_head, d_head],
dtype=h.dtype,
attr=fluid.ParamAttr(
name=name + '_o_weight', initializer=param_initializer),
is_bias=False)
# ibnd,hnd->ibh
proj_o = fluid.layers.transpose(proj_o, perm=[1, 2, 0])
attn_out = fluid.layers.mul(x=attn_vec,
y=proj_o,
x_num_col_dims=2,
y_num_col_dims=2)
attn_out = fluid.layers.dropout(
attn_out,
dropout_prob=dropout,
dropout_implementation="upscale_in_train",
is_test=False)
if residual:
output = fluid.layers.layer_norm(
attn_out + h,
begin_norm_axis=len(attn_out.shape) - 1,
epsilon=1e-12,
param_attr=fluid.ParamAttr(
name=name + '_layer_norm_scale',
initializer=fluid.initializer.Constant(1.)),
bias_attr=fluid.ParamAttr(
name + '_layer_norm_bias',
initializer=fluid.initializer.Constant(0.)))
else:
output = fluid.layers.layer_norm(
attn_out,
begin_norm_axis=len(attn_out.shape) - 1,
epsilon=1e-12,
param_attr=fluid.ParamAttr(
name=name + '_layer_norm_scale',
initializer=fluid.initializer.Constant(1.)),
bias_attr=fluid.ParamAttr(
name + '_layer_norm_bias',
initializer=fluid.initializer.Constant(0.)))
return output
def abs_attn_core(q_head, k_head, v_head, attn_mask, dropatt, scale):
"""Core absolute positional attention operations."""
attn_score = einsum4x4('ibnd,jbnd->ijbn', q_head, k_head)
attn_score *= scale
if attn_mask is not None:
attn_score = attn_score - 1e30 * attn_mask
# attention probability
attn_prob = fluid.layers.softmax(attn_score, axis=1)
attn_prob = fluid.layers.dropout(
attn_prob,
dropout_prob=dropatt,
dropout_implementation="upscale_in_train",
is_test=False)
# attention output
attn_vec = einsum4x4('ijbn,jbnd->ibnd', attn_prob, v_head)
return attn_vec
def rel_attn_core(q_head, k_head_h, v_head_h, k_head_r, seg_embed, seg_mat,
r_w_bias, r_r_bias, r_s_bias, attn_mask, dropatt, scale,
name):
"""Core relative positional attention operations."""
## content based attention score
ac = einsum4x4('ibnd,jbnd->ijbn',
fluid.layers.elementwise_add(q_head, r_w_bias, 2), k_head_h)
# position based attention score
bd = einsum4x4('ibnd,jbnd->ijbn',
fluid.layers.elementwise_add(q_head, r_r_bias, 2), k_head_r)
bd = rel_shift(bd, klen=ac.shape[1])
# segment based attention score
if seg_mat is None:
ef = 0
else:
seg_embed = fluid.layers.stack([seg_embed] * q_head.shape[0], axis=0)
ef = einsum4x4('ibnd,isnd->ibns',
fluid.layers.elementwise_add(q_head, r_s_bias, 2),
seg_embed)
ef = einsum4x4('ijbs,ibns->ijbn', seg_mat, ef)
attn_score = (ac + bd + ef) * scale
if attn_mask is not None:
# attn_score = attn_score * (1 - attn_mask) - 1e30 * attn_mask
attn_score = attn_score - 1e30 * attn_mask
# attention probability
attn_prob = fluid.layers.softmax(attn_score, axis=1)
attn_prob = fluid.layers.dropout(
attn_prob, dropatt, dropout_implementation="upscale_in_train")
# attention output
attn_vec = einsum4x4('ijbn,jbnd->ibnd', attn_prob, v_head_h)
return attn_vec
def rel_shift(x, klen=-1):
"""perform relative shift to form the relative attention score."""
x_size = x.shape
x = fluid.layers.reshape(x, [x_size[1], x_size[0], x_size[2], x_size[3]])
x = fluid.layers.slice(x, axes=[0], starts=[1], ends=[x_size[1]])
x = fluid.layers.reshape(x,
[x_size[0], x_size[1] - 1, x_size[2], x_size[3]])
x = fluid.layers.slice(x, axes=[1], starts=[0], ends=[klen])
return x
def _cache_mem(curr_out, prev_mem, mem_len, reuse_len=None):
"""cache hidden states into memory."""
if mem_len is None or mem_len == 0:
return None
else:
if reuse_len is not None and reuse_len > 0:
curr_out = curr_out[:reuse_len]
if prev_mem is None:
new_mem = curr_out[-mem_len:]
else:
new_mem = tf.concat([prev_mem, curr_out], 0)[-mem_len:]
new_mem.stop_gradient = True
return new_mem
def relative_positional_encoding(qlen,
klen,
d_model,
clamp_len,
attn_type,
bi_data,
bsz=None,
dtype=None):
"""create relative positional encoding."""
freq_seq = fluid.layers.range(0, d_model, 2.0, 'float32')
if dtype is not None and dtype != 'float32':
freq_seq = tf.cast(freq_seq, dtype=dtype)
inv_freq = 1 / (10000**(freq_seq / d_model))
if attn_type == 'bi':
beg, end = klen, -qlen
elif attn_type == 'uni':
beg, end = klen, -1
else:
raise ValueError('Unknown `attn_type` {}.'.format(attn_type))
if bi_data:
fwd_pos_seq = fluid.layers.range(beg, end, -1.0, 'float32')
bwd_pos_seq = fluid.layers.range(-beg, -end, 1.0, 'float32')
if dtype is not None and dtype != 'float32':
fwd_pos_seq = fluid.layers.cast(fwd_pos_seq, dtype='float32')
bwd_pos_seq = fluid.layers.cast(bwd_pos_seq, dtype='float32')
if clamp_len > 0:
fwd_pos_seq = fluid.layers.clip(fwd_pos_seq, -clamp_len, clamp_len)
bwd_pos_seq = fluid.layers.clip(bwd_pos_seq, -clamp_len, clamp_len)
if bsz is not None:
# With bi_data, the batch size should be divisible by 2.
assert bsz % 2 == 0
fwd_pos_emb = positional_embedding(fwd_pos_seq, inv_freq, bsz // 2)
bwd_pos_emb = positional_embedding(bwd_pos_seq, inv_freq, bsz // 2)
else:
fwd_pos_emb = positional_embedding(fwd_pos_seq, inv_freq)
bwd_pos_emb = positional_embedding(bwd_pos_seq, inv_freq)
pos_emb = fluid.layers.concat([fwd_pos_emb, bwd_pos_emb], axis=1)
else:
fwd_pos_seq = fluid.layers.range(beg, end, -1.0, 'float32')
if dtype is not None and dtype != 'float32':
fwd_pos_seq = fluid.layers.cast(fwd_pos_seq, dtype=dtype)
if clamp_len > 0:
fwd_pos_seq = fluid.layers.clip(fwd_pos_seq, -clamp_len, clamp_len)
pos_emb = positional_embedding(fwd_pos_seq, inv_freq, bsz)
fluid.layers.reshape(pos_emb, [2 * qlen, -1, d_model], inplace=True)
return pos_emb
def multihead_attn(q,
k,
v,
attn_mask,
d_model,
n_head,
d_head,
dropout,
dropatt,
is_training,
kernel_initializer,
residual=True,
scope='abs_attn',
reuse=None):
"""Standard multi-head attention with absolute positional embedding."""
scale = 1 / (d_head**0.5)
with tf.variable_scope(scope, reuse=reuse):
# attention heads
q_head_h = head_projection(
h, d_model, n_head, d_head, initializer, name=name + '_rel_attn_q')
q_head = head_projection(q, d_model, n_head, d_head, kernel_initializer,
'q')
k_head = head_projection(k, d_model, n_head, d_head, kernel_initializer,
'k')
v_head = head_projection(v, d_model, n_head, d_head, kernel_initializer,
'v')
# attention vector
attn_vec = abs_attn_core(q_head, k_head, v_head, attn_mask, dropatt,
is_training, scale)
# post processing
output = post_attention(v, attn_vec, d_model, n_head, d_head, dropout,
is_training, kernel_initializer, residual)
return output
def rel_multihead_attn(h,
r,
r_w_bias,
r_r_bias,
seg_mat,
r_s_bias,
seg_embed,
attn_mask,
mems,
d_model,
n_head,
d_head,
dropout,
dropatt,
initializer,
name=''):
"""Multi-head attention with relative positional encoding."""
scale = 1 / (d_head**0.5)
if mems is not None and len(mems.shape) > 1:
cat = fluid.layers.concat([mems, h], 0)
else:
cat = h
# content heads
q_head_h = head_projection(
h, d_model, n_head, d_head, initializer, name=name + '_rel_attn_q')
k_head_h = head_projection(
cat, d_model, n_head, d_head, initializer, name=name + '_rel_attn_k')
v_head_h = head_projection(
cat, d_model, n_head, d_head, initializer, name=name + '_rel_attn_v')
# positional heads
k_head_r = head_projection(
r, d_model, n_head, d_head, initializer, name=name + '_rel_attn_r')
# core attention ops
attn_vec = rel_attn_core(q_head_h, k_head_h, v_head_h, k_head_r, seg_embed,
seg_mat, r_w_bias, r_r_bias, r_s_bias, attn_mask,
dropatt, scale, name)
# post processing
output = post_attention(
h,
attn_vec,
d_model,
n_head,
d_head,
dropout,
initializer,
name=name + '_rel_attn')
return output
def transformer_xl(inp_k,
n_token,
n_layer,
d_model,
n_head,
d_head,
d_inner,
dropout,
dropatt,
attn_type,
bi_data,
initializer,
mem_len=None,
inp_q=None,
mems=None,
same_length=False,
clamp_len=-1,
untie_r=False,
input_mask=None,
perm_mask=None,
seg_id=None,
reuse_len=None,
ff_activation='relu',
target_mapping=None,
use_fp16=False,
name='',
**kwargs):
"""
Defines a Transformer-XL computation graph with additional
support for XLNet.
Args:
inp_k: int64 Tensor in shape [len, bsz], the input token IDs.
seg_id: int64 Tensor in shape [len, bsz], the input segment IDs.
input_mask: float32 Tensor in shape [len, bsz], the input mask.
0 for real tokens and 1 for padding.
mems: a list of float32 Tensors in shape [mem_len, bsz, d_model], memory
from previous batches. The length of the list equals n_layer.
If None, no memory is used.
perm_mask: float32 Tensor in shape [len, len, bsz].
If perm_mask[i, j, k] = 0, i attend to j in batch k;
if perm_mask[i, j, k] = 1, i does not attend to j in batch k.
If None, each position attends to all the others.
target_mapping: float32 Tensor in shape [num_predict, len, bsz].
If target_mapping[i, j, k] = 1, the i-th predict in batch k is
on the j-th token.
Only used during pretraining for partial prediction.
Set to None during finetuning.
inp_q: float32 Tensor in shape [len, bsz].
1 for tokens with losses and 0 for tokens without losses.
Only used during pretraining for two-stream attention.
Set to None during finetuning.
n_layer: int, the number of layers.
d_model: int, the hidden size.
n_head: int, the number of attention heads.
d_head: int, the dimension size of each attention head.
d_inner: int, the hidden size in feed-forward layers.
ff_activation: str, "relu" or "gelu".
untie_r: bool, whether to untie the biases in attention.
n_token: int, the vocab size.
is_training: bool, whether in training mode.
use_tpu: bool, whether TPUs are used.
use_fp16: bool, use bfloat16 instead of float32.
dropout: float, dropout rate.
dropatt: float, dropout rate on attention probabilities.
init: str, the initialization scheme, either "normal" or "uniform".
init_range: float, initialize the parameters with a uniform distribution
in [-init_range, init_range]. Only effective when init="uniform".
init_std: float, initialize the parameters with a normal distribution
with mean 0 and stddev init_std. Only effective when init="normal".
mem_len: int, the number of tokens to cache.
reuse_len: int, the number of tokens in the currect batch to be cached
and reused in the future.
bi_data: bool, whether to use bidirectional input pipeline.
Usually set to True during pretraining and False during finetuning.
clamp_len: int, clamp all relative distances larger than clamp_len.
-1 means no clamping.
same_length: bool, whether to use the same attention length for each token.
summary_type: str, "last", "first", "mean", or "attn". The method
to pool the input to get a vector representation.
"""
print('memory input {}'.format(mems))
data_type = "float16" if use_fp16 else "float32"
print('Use float type {}'.format(data_type))
qlen = inp_k.shape[0]
mlen = mems[0].shape[0] if mems is not None else 0
klen = mlen + qlen
bsz = fluid.layers.slice(
fluid.layers.shape(inp_k), axes=[0], starts=[1], ends=[2])
##### Attention mask
# causal attention mask
if attn_type == 'uni':
attn_mask = fluid.layers.create_global_var(
name='attn_mask',
shape=[qlen, klen, 1, 1],
value=0.0,
dtype=data_type,
persistable=True)
elif attn_type == 'bi':
attn_mask = None
else:
raise ValueError('Unsupported attention type: {}'.format(attn_type))
# data mask: input mask & perm mask
if input_mask is not None and perm_mask is not None:
data_mask = fluid.layers.unsqueeze(input_mask, [0]) + perm_mask
elif input_mask is not None and perm_mask is None:
data_mask = fluid.layers.unsqueeze(input_mask, [0])
elif input_mask is None and perm_mask is not None:
data_mask = perm_mask
else:
data_mask = None
if data_mask is not None:
# all mems can be attended to
mems_mask = fluid.layers.zeros(
shape=[data_mask.shape[0], mlen, 1], dtype='float32')
mems_mask = fluid.layers.expand(mems_mask, [1, 1, bsz])
data_mask = fluid.layers.concat([mems_mask, data_mask], 1)
if attn_mask is None:
attn_mask = fluid.layers.unsqueeze(data_mask, [-1])
else:
attn_mask += fluid.layers.unsqueeze(data_mask, [-1])
if attn_mask is not None:
attn_mask = fluid.layers.cast(attn_mask > 0, dtype=data_type)
if attn_mask is not None:
non_tgt_mask = fluid.layers.diag(
np.array([-1] * qlen).astype(data_type))
non_tgt_mask = fluid.layers.concat(
[fluid.layers.zeros(
[qlen, mlen], dtype=data_type), non_tgt_mask],
axis=-1)
attn_mask = fluid.layers.expand(attn_mask, [qlen, 1, 1, 1])
non_tgt_mask = fluid.layers.unsqueeze(non_tgt_mask, axes=[2, 3])
non_tgt_mask = fluid.layers.expand(non_tgt_mask, [1, 1, bsz, 1])
non_tgt_mask = fluid.layers.cast(
(attn_mask + non_tgt_mask) > 0, dtype=data_type)
non_tgt_mask.stop_gradient = True
else:
non_tgt_mask = None
if untie_r:
r_w_bias = fluid.layers.create_parameter(
shape=[n_layer, n_head, d_head],
dtype=data_type,
attr=fluid.ParamAttr(
name=name + '_r_w_bias', initializer=initializer),
is_bias=True)
r_w_bias = [
fluid.layers.slice(
r_w_bias, axes=[0], starts=[i], ends=[i + 1])
for i in range(n_layer)
]
r_w_bias = [
fluid.layers.squeeze(
r_w_bias[i], axes=[0]) for i in range(n_layer)
]
r_r_bias = fluid.layers.create_parameter(
shape=[n_layer, n_head, d_head],
dtype=data_type,
attr=fluid.ParamAttr(
name=name + '_r_r_bias', initializer=initializer),
is_bias=True)
r_r_bias = [
fluid.layers.slice(
r_r_bias, axes=[0], starts=[i], ends=[i + 1])
for i in range(n_layer)
]
r_r_bias = [
fluid.layers.squeeze(
r_r_bias[i], axes=[0]) for i in range(n_layer)
]
else:
r_w_bias = fluid.layers.create_parameter(
shape=[n_head, d_head],
dtype=data_type,
attr=fluid.ParamAttr(
name=name + '_r_w_bias', initializer=initializer),
is_bias=True)
r_r_bias = fluid.layers.create_parameter(
shape=[n_head, d_head],
dtype=data_type,
attr=fluid.ParamAttr(
name=name + '_r_r_bias', initializer=initializer),
is_bias=True)
lookup_table = fluid.layers.create_parameter(
shape=[n_token, d_model],
dtype=data_type,
attr=fluid.ParamAttr(
name=name + '_word_embedding', initializer=initializer),
is_bias=False)
word_emb_k = fluid.layers.embedding(
input=inp_k,
size=[n_token, d_model],
dtype=data_type,
param_attr=fluid.ParamAttr(
name=name + '_word_embedding', initializer=initializer))
if inp_q is not None:
pass
output_h = fluid.layers.dropout(
word_emb_k,
dropout_prob=dropout,
dropout_implementation="upscale_in_train")
if inp_q is not None:
pass
if seg_id is not None:
if untie_r:
r_s_bias = fluid.layers.create_parameter(
shape=[n_layer, n_head, d_head],
dtype=data_type,
attr=fluid.ParamAttr(
name=name + '_r_s_bias', initializer=initializer),
is_bias=True)
r_s_bias = [
fluid.layers.slice(
r_s_bias, axes=[0], starts=[i], ends=[i + 1])
for i in range(n_layer)
]
r_s_bias = [
fluid.layers.squeeze(
r_s_bias[i], axes=[0]) for i in range(n_layer)
]
else:
r_s_bias = fluid.layers.create_parameter(
shape=[n_head, d_head],
dtype=data_type,
attr=fluid.ParamAttr(
name=name + '_r_s_bias', initializer=initializer),
is_bias=True)
seg_embed = fluid.layers.create_parameter(
shape=[n_layer, 2, n_head, d_head],
dtype=data_type,
attr=fluid.ParamAttr(
name=name + '_seg_embed', initializer=initializer))
seg_embed = [
fluid.layers.slice(
seg_embed, axes=[0], starts=[i], ends=[i + 1])
for i in range(n_layer)
]
seg_embed = [
fluid.layers.squeeze(
seg_embed[i], axes=[0]) for i in range(n_layer)
]
# COnver `seg_id` to one-hot seg_mat
# seg_id: [bsz, qlen, 1]
mem_pad = fluid.layers.fill_constant_batch_size_like(
input=seg_id, shape=[-1, mlen], value=0, dtype='int64')
# cat_ids: [bsz, klen, 1]
cat_ids = fluid.layers.concat(input=[mem_pad, seg_id], axis=1)
seg_id = fluid.layers.stack([seg_id] * klen, axis=2)
cat_ids = fluid.layers.stack([cat_ids] * qlen, axis=2)
cat_ids = fluid.layers.transpose(cat_ids, perm=[0, 2, 1])
# seg_mat: [bsz, qlen, klen]
seg_mat = fluid.layers.cast(
fluid.layers.logical_not(fluid.layers.equal(seg_id, cat_ids)),
dtype='int64')
seg_mat = fluid.layers.transpose(seg_mat, perm=[1, 2, 0])
seg_mat = fluid.layers.unsqueeze(seg_mat, [-1])
seg_mat = fluid.layers.one_hot(seg_mat, 2)
seg_mat.stop_gradient = True
else:
seg_mat = None
pos_emb = relative_positional_encoding(
qlen,
klen,
d_model,
clamp_len,
attn_type,
bi_data,
bsz=bsz,
dtype=data_type)
pos_emb = fluid.layers.dropout(
pos_emb, dropout, dropout_implementation="upscale_in_train")
pos_emb.stop_gradient = True
##### Attention layers
if mems is None:
mems = [None] * n_layer
for i in range(n_layer):
# cache new mems
#new_mems.append(_cache_mem(output_h, mems[i], mem_len, reuse_len))
# segment bias
if seg_id is None:
r_s_bias_i = None
seg_embed_i = None
else:
r_s_bias_i = r_s_bias if not untie_r else r_s_bias[i]
seg_embed_i = seg_embed[i]
if inp_q is not None:
pass
else:
output_h = rel_multihead_attn(
h=output_h,
r=pos_emb,
r_w_bias=r_w_bias if not untie_r else r_w_bias[i],
r_r_bias=r_r_bias if not untie_r else r_r_bias[i],
seg_mat=seg_mat,
r_s_bias=r_s_bias_i,
seg_embed=seg_embed_i,
attn_mask=non_tgt_mask,
mems=mems[i],
d_model=d_model,
n_head=n_head,
d_head=d_head,
dropout=dropout,
dropatt=dropatt,
initializer=initializer,
name=name + '_layer_{}'.format(i))
if inp_q is not None:
pass
output_h = positionwise_ffn(
inp=output_h,
d_model=d_model,
d_inner=d_inner,
dropout_prob=dropout,
param_initializer=initializer,
act_type=ff_activation,
name=name + '_layer_{}_ff'.format(i))
if inp_q is not None:
output = fluid.layers.dropout(
output_g, dropout, dropout_implementation="upscale_in_train")
else:
output = fluid.layers.dropout(
output_h, dropout, dropout_implementation="upscale_in_train")
new_mems = None
return output, new_mems, lookup_table
def lm_loss(hidden,
target,
n_token,
d_model,
initializer,
lookup_table=None,
tie_weight=False,
bi_data=True):
if tie_weight:
assert lookup_table is not None, \
'lookup_table cannot be None for tie_weight'
softmax_w = lookup_table
else:
softmax_w = fluid.layers.create_parameter(
shape=[n_token, d_model],
dtype=hidden.dtype,
attr=fluid.ParamAttr(
name='model_loss_weight', initializer=initializer),
is_bias=False)
softmax_b = fluid.layers.create_parameter(
shape=[n_token],
dtype=hidden.dtype,
attr=fluid.ParamAttr(
name='model_lm_loss_bias', initializer=initializer),
is_bias=False)
logits = fluid.layers.matmul(
x=hidden, y=softmax_w, transpose_y=True) + softmax_b
loss = fluid.layers.softmax_cross_entropy_with_logits(
input=logits, label=target)
return loss
def summarize_sequence(summary_type,
hidden,
d_model,
n_head,
d_head,
dropout,
dropatt,
input_mask,
initializer,
scope=None,
reuse=None,
use_proj=True,
name=''):
"""
Different classification tasks may not may not share the same parameters
to summarize the sequence features.
If shared, one can keep the `scope` to the default value `None`.
Otherwise, one should specify a different `scope` for each task.
"""
if summary_type == 'last':
summary = hidden[-1]
elif summary_type == 'first':
summary = hidden[0]
elif summary_type == 'mean':
summary = fluid.layers.reduce_mean(hidden, axis=0)
elif summary_type == 'attn':
bsz = fluid.layers.slice(
fluid.layers.shape(hidden), axes=[0], starts=[1], ends=[2])
summary_bias = tf.get_variable(
'summary_bias', [d_model],
dtype=hidden.dtype,
initializer=initializer)
summary_bias = tf.tile(summary_bias[None, None], [1, bsz, 1])
if input_mask is not None:
input_mask = input_mask[None, :, :, None]
summary = multihead_attn(
summary_bias,
hidden,
hidden,
input_mask,
d_model,
n_head,
d_head,
dropout,
dropatt,
is_training,
initializer,
residual=False)
summary = summary[0]
else:
raise ValueError('Unsupported summary type {}'.format(summary_type))
# use another projection as in BERT
if use_proj:
summary = fluid.layers.fc(input=summary,
size=d_model,
act='tanh',
param_attr=fluid.ParamAttr(
name=name + '_summary_weight',
initializer=initializer),
bias_attr=name + '_summary_bias')
summary = fluid.layers.dropout(
summary,
dropout_prob=dropout,
dropout_implementation="upscale_in_train")
return summary
def classification_loss(hidden,
labels,
n_class,
initializer,
name,
reuse=None,
return_logits=False):
"""
Different classification tasks should use different parameter names to ensure
different dense layers (parameters) are used to produce the logits.
An exception will be in transfer learning, where one hopes to transfer
the classification weights.
"""
logits = fluid.layers.fc(input=hidden,
size=n_class,
param_attr=fluid.ParamAttr(
name=name + '_logit_weight',
initializer=initializer),
bias_attr=name + '_logit_bias')
one_hot_target = fluid.layers.one_hot(labels, depth=n_class)
loss = -1.0 * fluid.layers.reduce_sum(
log_softmax(logits) * one_hot_target, dim=-1)
if return_logits:
return loss, logits
return loss
def regression_loss(hidden, labels, initializer, name, return_logits=False):
logits = fluid.layers.fc(input=hidden,
size=1,
param_attr=fluid.ParamAttr(
name=name + '_logits_weight',
initializer=initializer),
bias_attr=name + '_logits_bias')
loss = fluid.layers.square(logits - labels)
if return_logits:
return loss, logits
return loss
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Optimization and learning rate scheduling."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import re
import numpy as np
import paddle.fluid as fluid
def linear_warmup_decay(learning_rate, warmup_steps, num_train_steps):
""" Applies linear warmup of learning rate from 0 and decay to 0."""
with fluid.default_main_program()._lr_schedule_guard():
lr = fluid.layers.tensor.create_global_var(
shape=[1],
value=0.0,
dtype='float32',
persistable=True,
name="scheduled_learning_rate")
global_step = fluid.layers.learning_rate_scheduler._decay_step_counter()
with fluid.layers.control_flow.Switch() as switch:
with switch.case(global_step < warmup_steps):
warmup_lr = learning_rate * (global_step / warmup_steps)
fluid.layers.tensor.assign(warmup_lr, lr)
with switch.default():
decayed_lr = fluid.layers.learning_rate_scheduler.polynomial_decay(
learning_rate=learning_rate,
decay_steps=num_train_steps,
end_learning_rate=0.0,
power=1.0,
cycle=False)
fluid.layers.tensor.assign(decayed_lr, lr)
return lr
def optimization(loss,
warmup_steps,
num_train_steps,
learning_rate,
train_program,
startup_prog,
weight_decay,
lr_layer_decay_rate=1.0,
scheduler='linear_warmup_decay'):
scheduled_lr = None
if scheduler == 'noam_decay':
if warmup_steps > 0:
scheduled_lr = fluid.layers.learning_rate_scheduler\
.noam_decay(1/(warmup_steps *(learning_rate ** 2)),
warmup_steps)
else:
printf(
"WARNING: noam decay should have postive warmup steps, using "
"constant learning rate instead!")
scheduled_lr = fluid.layers.create_global_var(
name=fluid.unique_name.generate("learning_rate"),
shape=[1],
value=learning_rate,
dtype='float32',
persistable=True)
elif scheduler == 'linear_warmup_decay':
scheduled_lr = linear_warmup_decay(learning_rate, warmup_steps,
num_train_steps)
else:
raise ValueError("Unkown learning rate scheduler, should be "
"'noam_decay' or 'linear_warmup_decay'")
if lr_layer_decay_rate != 1.0:
n_layer = 0
for param in fluid.default_main_program().block(0).all_parameters():
m = re.search(r"model_transformer_layer_(\d+?)_", param.name)
if not m: continue
n_layer = max(n_layer, int(m.group(1)) + 1)
for param in fluid.default_main_program().block(0).all_parameters():
for l in range(n_layer):
if "model_transformer_layer_{}_".format(l) in param.name:
param.optimize_attr[
'learning_rate'] = lr_layer_decay_rate**(
n_layer - 1 - l)
print("Apply lr decay {:.4f} to layer-{} grad of {}".format(
param.optimize_attr['learning_rate'], l, param.name))
break
def exclude_from_weight_decay(param):
name = param.name.rstrip(".master")
if name.find("layer_norm") > -1:
return True
bias_suffix = ["_bias", "_b", ".b_0"]
for suffix in bias_suffix:
if name.endswith(suffix):
return True
return False
optimizer = fluid.optimizer.Adam(learning_rate=scheduled_lr)
fluid.clip.set_gradient_clip(
clip=fluid.clip.GradientClipByGlobalNorm(clip_norm=1.0))
param_list = dict()
if weight_decay > 0:
for param in train_program.global_block().all_parameters():
param_list[param.name] = param * 1.0
param_list[param.name].stop_gradient = True
_, param_grads = optimizer.minimize(loss)
if weight_decay > 0:
for param, grad in param_grads:
if exclude_from_weight_decay(param):
continue
with param.block.program._optimized_guard(
[param, grad]), fluid.framework.name_scope("weight_decay"):
updated_param = param - param_list[
param.name] * weight_decay * scheduled_lr
fluid.layers.assign(output=param, input=updated_param)
return scheduled_lr
# coding=utf-8
"""this file is a copy of https://github.com/zihangdai/xlnet"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import unicodedata
import six
from functools import partial
SPIECE_UNDERLINE = '▁'
def printable_text(text):
"""Returns text encoded in a way suitable for print or `tf.logging`."""
# These functions want `str` for both Python2 and Python3, but in one case
# it's a Unicode string and in the other it's a byte string.
if six.PY3:
if isinstance(text, str):
return text
elif isinstance(text, bytes):
return text.decode("utf-8", "ignore")
else:
raise ValueError("Unsupported string type: %s" % (type(text)))
elif six.PY2:
if isinstance(text, str):
return text
elif isinstance(text, unicode):
return text.encode("utf-8")
else:
raise ValueError("Unsupported string type: %s" % (type(text)))
else:
raise ValueError("Not running on Python2 or Python 3?")
def print_(*args):
new_args = []
for arg in args:
if isinstance(arg, list):
s = [printable_text(i) for i in arg]
s = ' '.join(s)
new_args.append(s)
else:
new_args.append(printable_text(arg))
print(*new_args)
def preprocess_text(inputs, lower=False, remove_space=True, keep_accents=False):
if remove_space:
outputs = ' '.join(inputs.strip().split())
else:
outputs = inputs
outputs = outputs.replace("``", '"').replace("''", '"')
if six.PY2 and isinstance(outputs, str):
outputs = outputs.decode('utf-8')
if not keep_accents:
outputs = unicodedata.normalize('NFKD', outputs)
outputs = ''.join([c for c in outputs if not unicodedata.combining(c)])
if lower:
outputs = outputs.lower()
return outputs
def encode_pieces(sp_model, text, return_unicode=True, sample=False):
# return_unicode is used only for py2
# note(zhiliny): in some systems, sentencepiece only accepts str for py2
if six.PY2 and isinstance(text, unicode):
text = text.encode('utf-8')
if not sample:
pieces = sp_model.EncodeAsPieces(text)
else:
pieces = sp_model.SampleEncodeAsPieces(text, 64, 0.1)
new_pieces = []
for piece in pieces:
if len(piece) > 1 and piece[-1] == ',' and piece[-2].isdigit():
cur_pieces = sp_model.EncodeAsPieces(piece[:-1].replace(
SPIECE_UNDERLINE, ''))
if piece[0] != SPIECE_UNDERLINE and cur_pieces[0][
0] == SPIECE_UNDERLINE:
if len(cur_pieces[0]) == 1:
cur_pieces = cur_pieces[1:]
else:
cur_pieces[0] = cur_pieces[0][1:]
cur_pieces.append(piece[-1])
new_pieces.extend(cur_pieces)
else:
new_pieces.append(piece)
# note(zhiliny): convert back to unicode for py2
if six.PY2 and return_unicode:
ret_pieces = []
for piece in new_pieces:
if isinstance(piece, str):
piece = piece.decode('utf-8')
ret_pieces.append(piece)
new_pieces = ret_pieces
return new_pieces
def encode_ids(sp_model, text, sample=False):
pieces = encode_pieces(sp_model, text, return_unicode=False, sample=sample)
ids = [sp_model.PieceToId(piece) for piece in pieces]
return ids
if __name__ == '__main__':
import sentencepiece as spm
sp = spm.SentencePieceProcessor()
sp.load('sp10m.uncased.v3.model')
print_(u'I was born in 2000, and this is falsé.')
print_(u'ORIGINAL',
sp.EncodeAsPieces(u'I was born in 2000, and this is falsé.'))
print_(u'OURS',
encode_pieces(sp, u'I was born in 2000, and this is falsé.'))
print(encode_ids(sp, u'I was born in 2000, and this is falsé.'))
print_('')
prepro_func = partial(preprocess_text, lower=True)
print_(prepro_func('I was born in 2000, and this is falsé.'))
print_('ORIGINAL',
sp.EncodeAsPieces(
prepro_func('I was born in 2000, and this is falsé.')))
print_('OURS',
encode_pieces(sp,
prepro_func('I was born in 2000, and this is falsé.')))
print(encode_ids(sp, prepro_func('I was born in 2000, and this is falsé.')))
print_('')
print_('I was born in 2000, and this is falsé.')
print_('ORIGINAL',
sp.EncodeAsPieces('I was born in 2000, and this is falsé.'))
print_('OURS', encode_pieces(sp, 'I was born in 2000, and this is falsé.'))
print(encode_ids(sp, 'I was born in 2000, and this is falsé.'))
print_('')
print_('I was born in 92000, and this is falsé.')
print_('ORIGINAL',
sp.EncodeAsPieces('I was born in 92000, and this is falsé.'))
print_('OURS', encode_pieces(sp, 'I was born in 92000, and this is falsé.'))
print(encode_ids(sp, 'I was born in 92000, and this is falsé.'))
"""this file is adapted from https://github.com/zihangdai/xlnet"""
import io
import os
import types
import csv
import numpy as np
import sentencepiece as spm
from classifier_utils import PaddingInputExample
from classifier_utils import convert_single_example
from prepro_utils import preprocess_text, encode_ids
class DataProcessor(object):
"""Base class for data converters for sequence classification data sets."""
def __init__(self, args):
self.data_dir = args.data_dir
self.max_seq_length = args.max_seq_length
self.uncased = args.uncased
np.random.seed(args.random_seed)
sp = spm.SentencePieceProcessor()
sp.Load(args.spiece_model_file)
def tokenize_fn(text):
text = preprocess_text(text, lower=self.uncased)
return encode_ids(sp, text)
self.tokenize_fn = tokenize_fn
self.current_train_example = -1
self.num_examples = {'train': -1, 'dev': -1, 'test': -1}
self.current_train_epoch = -1
def get_train_examples(self, data_dir):
"""Gets a collection of `InputExample`s for the train set."""
raise NotImplementedError()
def get_dev_examples(self, data_dir):
"""Gets a collection of `InputExample`s for the dev set."""
raise NotImplementedError()
def get_test_examples(self, data_dir):
"""Gets a collection of `InputExample`s for prediction."""
raise NotImplementedError()
def get_labels(self):
"""Gets the list of labels for this data set."""
raise NotImplementedError()
def convert_example(self, index, example, labels, max_seq_length,
tokenize_fn):
"""Converts a single `InputExample` into a single `InputFeatures`."""
feature = convert_single_example(index, example, labels, max_seq_length,
tokenize_fn)
return feature
def generate_instance(self, feature):
"""
generate instance with given feature
Args:
feature: InputFeatures(object). A single set of features of data.
"""
return [
feature.input_ids, feature.segment_ids, input_pos, feature.label_id
]
def prepare_batch_data(self, batch_data, is_regression):
"""Generate numpy tensors"""
input_ids = np.expand_dims(
np.array([inst[0] for inst in batch_data]).astype('int64'), axis=-1)
input_mask = np.array(
[inst[1] for inst in batch_data]).astype('float32')
segment_ids = np.array([inst[2] for inst in batch_data]).astype('int64')
labels = np.expand_dims(
np.array([inst[3] for inst in batch_data]).astype(
'int64' if not is_regression else 'float32'),
axis=-1)
is_real_example = np.array(
[inst[4] for inst in batch_data]).astype('int64')
return [input_ids, input_mask, segment_ids, labels, is_real_example]
@classmethod
def _read_tsv(cls, input_file, quotechar=None):
"""Reads a tab separated value file."""
with io.open(input_file, "r", encoding="utf8") as f:
reader = csv.reader(f, delimiter="\t", quotechar=quotechar)
lines = []
for line in reader:
if len(line) == 0: continue
lines.append(line)
return lines
def get_num_examples(self, phase):
"""Get number of examples for train, dev or test."""
if phase not in ['train', 'dev', 'test']:
raise ValueError(
"Unknown phase, which should be in ['train', 'dev', 'test'].")
return self.num_examples[phase]
def get_train_progress(self):
"""Gets progress for training phase."""
return self.current_train_example, self.current_train_epoch
def data_generator(self,
batch_size,
is_regression,
phase='train',
epoch=1,
dev_count=1,
shuffle=True):
"""
Generate data for train, dev or test.
Args:
batch_size: int. The batch size of generated data.
phase: string. The phase for which to generate data.
epoch: int. Total epoches to generate data.
shuffle: bool. Whether to shuffle examples.
"""
if phase == 'train':
examples = self.get_train_examples(self.data_dir)
self.num_examples['train'] = len(examples)
elif phase == 'dev':
examples = self.get_dev_examples(self.data_dir)
self.num_examples['dev'] = len(examples)
elif phase == 'test':
examples = self.get_test_examples(self.data_dir)
self.num_examples['test'] = len(examples)
else:
raise ValueError(
"Unknown phase, which should be in ['train', 'dev', 'test'].")
def instance_reader():
label_list = self.get_labels() if not is_regression else None
for epoch_index in range(epoch):
if shuffle:
np.random.shuffle(examples)
if phase == 'train':
self.current_train_epoch = epoch_index
for (index, example) in enumerate(examples):
if phase == 'train':
self.current_train_example = index + 1
feature = convert_single_example(index, example, label_list,
self.max_seq_length,
self.tokenize_fn)
instance = [
feature.input_ids, feature.input_mask,
feature.segment_ids, feature.label_id,
feature.is_real_example
]
yield instance
def batch_reader(reader, batch_size):
batch = []
for instance in reader():
if len(batch) < batch_size:
batch.append(instance)
else:
yield batch
batch = [instance]
if len(batch) > 0:
yield batch
def wrapper():
all_dev_batches = []
for batch_data in batch_reader(instance_reader, batch_size):
batch_data = self.prepare_batch_data(batch_data, is_regression)
if len(all_dev_batches) < dev_count:
all_dev_batches.append(batch_data)
if len(all_dev_batches) == dev_count:
for batch in all_dev_batches:
yield batch
all_dev_batches = []
return wrapper
class InputExample(object):
"""A single training/test example for simple sequence classification."""
def __init__(self, guid, text_a, text_b=None, label=None):
"""Constructs a InputExample.
Args:
guid: Unique id for the example.
text_a: string. The untokenized text of the first sequence. For single
sequence tasks, only this sequence must be specified.
text_b: (Optional) string. The untokenized text of the second sequence.
Only must be specified for sequence pair tasks.
label: (Optional) string. The label of the example. This should be
specified for train and dev examples, but not for test examples.
"""
self.guid = guid
self.text_a = text_a
self.text_b = text_b
self.label = label
def _truncate_seq_pair(tokens_a, tokens_b, max_length):
"""Truncates a sequence pair in place to the maximum length."""
# This is a simple heuristic which will always truncate the longer sequence
# one token at a time. This makes more sense than truncating an equal percent
# of tokens from each, since if one sequence is very short then each token
# that's truncated likely contains more information than a longer sequence.
while True:
total_length = len(tokens_a) + len(tokens_b)
if total_length <= max_length:
break
if len(tokens_a) > len(tokens_b):
tokens_a.pop()
else:
tokens_b.pop()
class InputFeatures(object):
"""A single set of features of data."""
def __init__(self, input_ids, input_mask, segment_ids, label_id):
self.input_ids = input_ids
self.input_mask = input_mask
self.segment_ids = segment_ids
self.label_id = label_id
class GLUEProcessor(DataProcessor):
def __init__(self, args):
super(GLUEProcessor, self).__init__(args)
self.train_file = "train.tsv"
self.dev_file = "dev.tsv"
self.test_file = "test.tsv"
self.label_column = None
self.text_a_column = None
self.text_b_column = None
self.contains_header = True
self.test_text_a_column = None
self.test_text_b_column = None
self.test_contains_header = True
def get_train_examples(self, data_dir):
"""See base class."""
return self._create_examples(
self._read_tsv(os.path.join(data_dir, self.train_file)), "train")
def get_dev_examples(self, data_dir):
"""See base class."""
return self._create_examples(
self._read_tsv(os.path.join(data_dir, self.dev_file)), "dev")
def get_test_examples(self, data_dir):
"""See base class."""
if self.test_text_a_column is None:
self.test_text_a_column = self.text_a_column
if self.test_text_b_column is None:
self.test_text_b_column = self.text_b_column
return self._create_examples(
self._read_tsv(os.path.join(data_dir, self.test_file)), "test")
def get_labels(self):
"""See base class."""
return ["0", "1"]
def _create_examples(self, lines, set_type):
"""Creates examples for the training and dev sets."""
examples = []
for (i, line) in enumerate(lines):
if i == 0 and self.contains_header and set_type != "test":
continue
if i == 0 and self.test_contains_header and set_type == "test":
continue
guid = "%s-%s" % (set_type, i)
a_column = (self.text_a_column
if set_type != "test" else self.test_text_a_column)
b_column = (self.text_b_column
if set_type != "test" else self.test_text_b_column)
# there are some incomplete lines in QNLI
if len(line) <= a_column:
tf.logging.warning('Incomplete line, ignored.')
continue
text_a = line[a_column]
if b_column is not None:
if len(line) <= b_column:
tf.logging.warning('Incomplete line, ignored.')
continue
text_b = line[b_column]
else:
text_b = None
if set_type == "test":
label = self.get_labels()[0]
else:
if len(line) <= self.label_column:
tf.logging.warning('Incomplete line, ignored.')
continue
label = line[self.label_column]
examples.append(
InputExample(
guid=guid, text_a=text_a, text_b=text_b, label=label))
return examples
class Yelp5Processor(DataProcessor):
def __init__(self, args):
super(Yelp5Processor, self).__init__(args)
def get_train_examples(self, data_dir):
return self._create_examples(os.path.join(data_dir, "train.csv"))
def get_dev_examples(self, data_dir):
return self._create_examples(os.path.join(data_dir, "test.csv"))
def get_labels(self):
"""See base class."""
return ["1", "2", "3", "4", "5"]
def _create_examples(self, input_file):
"""Creates examples for the training and dev sets."""
examples = []
with tf.gfile.Open(input_file) as f:
reader = csv.reader(f)
for i, line in enumerate(reader):
label = line[0]
text_a = line[1].replace('""', '"').replace('\\"', '"')
examples.append(
InputExample(
guid=str(i), text_a=text_a, text_b=None, label=label))
return examples
class ImdbProcessor(DataProcessor):
def __init__(self, args):
super(ImdbProcessor, self).__init__(args)
def get_labels(self):
return ["neg", "pos"]
def get_train_examples(self, data_dir):
return self._create_examples(os.path.join(data_dir, "train"))
def get_dev_examples(self, data_dir):
return self._create_examples(os.path.join(data_dir, "test"))
def _create_examples(self, data_dir):
examples = []
for label in ["neg", "pos"]:
cur_dir = os.path.join(data_dir, label)
for filename in os.listdir(cur_dir):
if not filename.endswith("txt"): continue
path = os.path.join(cur_dir, filename)
with io.open(path, 'r', encoding='utf8') as f:
text = f.read().strip().replace("<br />", " ")
examples.append(
InputExample(
guid="unused_id", text_a=text, text_b=None,
label=label))
return examples
class MnliMatchedProcessor(GLUEProcessor):
def __init__(self, args):
super(MnliMatchedProcessor, self).__init__(args)
self.dev_file = "dev_matched.tsv"
self.test_file = "test_matched.tsv"
self.label_column = -1
self.text_a_column = 8
self.text_b_column = 9
def get_labels(self):
return ["contradiction", "entailment", "neutral"]
class MnliMismatchedProcessor(MnliMatchedProcessor):
def __init__(self, args):
super(MnliMismatchedProcessor, self).__init__(args)
self.dev_file = "dev_mismatched.tsv"
self.test_file = "test_mismatched.tsv"
class StsbProcessor(GLUEProcessor):
def __init__(self, args):
super(StsbProcessor, self).__init__(args)
self.label_column = 9
self.text_a_column = 7
self.text_b_column = 8
def get_labels(self):
return [0.0]
def _create_examples(self, lines, set_type):
"""Creates examples for the training and dev sets."""
examples = []
for (i, line) in enumerate(lines):
if i == 0 and self.contains_header and set_type != "test":
continue
if i == 0 and self.test_contains_header and set_type == "test":
continue
guid = "%s-%s" % (set_type, i)
a_column = (self.text_a_column
if set_type != "test" else self.test_text_a_column)
b_column = (self.text_b_column
if set_type != "test" else self.test_text_b_column)
# there are some incomplete lines in QNLI
if len(line) <= a_column:
tf.logging.warning('Incomplete line, ignored.')
continue
text_a = line[a_column]
if b_column is not None:
if len(line) <= b_column:
tf.logging.warning('Incomplete line, ignored.')
continue
text_b = line[b_column]
else:
text_b = None
if set_type == "test":
label = self.get_labels()[0]
else:
if len(line) <= self.label_column:
tf.logging.warning('Incomplete line, ignored.')
continue
label = float(line[self.label_column])
examples.append(
InputExample(
guid=guid, text_a=text_a, text_b=text_b, label=label))
return examples
if __name__ == '__main__':
pass
# coding=utf-8
"""This file is adapted from https://github.com/zihangdai/xlnet"""
import io
import six
import sys
import math
import json
import random
import collections
import gc
import numpy as np
sys.path.append('.')
import squad_utils
from data_utils import SEP_ID, CLS_ID, VOCAB_SIZE
import sentencepiece as spm
from prepro_utils import preprocess_text, encode_ids, encode_pieces, printable_text
SPIECE_UNDERLINE = u'▁'
SEG_ID_P = 0
SEG_ID_Q = 1
SEG_ID_CLS = 2
SEG_ID_PAD = 3
class SquadExample(object):
"""A single training/test example for simple sequence classification.
For examples without an answer, the start and end position are -1.
"""
def __init__(self,
qas_id,
question_text,
paragraph_text,
orig_answer_text=None,
start_position=None,
is_impossible=False):
self.qas_id = qas_id
self.question_text = question_text
self.paragraph_text = paragraph_text
self.orig_answer_text = orig_answer_text
self.start_position = start_position
self.is_impossible = is_impossible
def __str__(self):
return self.__repr__()
def __repr__(self):
s = ""
s += "qas_id: %s" % (printable_text(self.qas_id))
s += ", question_text: %s" % (printable_text(self.question_text))
s += ", paragraph_text: [%s]" % (" ".join(self.paragraph_text))
if self.start_position:
s += ", start_position: %d" % (self.start_position)
if self.start_position:
s += ", is_impossible: %r" % (self.is_impossible)
return s
class InputFeatures(object):
"""A single set of features of data."""
def __init__(self,
unique_id,
example_index,
doc_span_index,
tok_start_to_orig_index,
tok_end_to_orig_index,
token_is_max_context,
input_ids,
input_mask,
p_mask,
segment_ids,
paragraph_len,
cls_index,
start_position=None,
end_position=None,
is_impossible=None):
self.unique_id = unique_id
self.example_index = example_index
self.doc_span_index = doc_span_index
self.tok_start_to_orig_index = tok_start_to_orig_index
self.tok_end_to_orig_index = tok_end_to_orig_index
self.token_is_max_context = token_is_max_context
self.input_ids = input_ids
self.input_mask = input_mask
self.p_mask = p_mask
self.segment_ids = segment_ids
self.paragraph_len = paragraph_len
self.cls_index = cls_index
self.start_position = start_position
self.end_position = end_position
self.is_impossible = is_impossible
def read_squad_examples(input_file, is_training):
"""Read a SQuAD json file into a list of SquadExample."""
with io.open(input_file, "r", encoding="utf8") as reader:
input_data = json.load(reader)["data"]
examples = []
for entry in input_data:
for paragraph in entry["paragraphs"]:
paragraph_text = paragraph["context"]
for qa in paragraph["qas"]:
qas_id = qa["id"]
question_text = qa["question"]
start_position = None
orig_answer_text = None
is_impossible = False
if is_training:
is_impossible = qa["is_impossible"]
if (len(qa["answers"]) != 1) and (not is_impossible):
raise ValueError(
"For training, each question should have exactly 1 answer."
)
if not is_impossible:
answer = qa["answers"][0]
orig_answer_text = answer["text"]
start_position = answer["answer_start"]
else:
start_position = -1
orig_answer_text = ""
example = SquadExample(
qas_id=qas_id,
question_text=question_text,
paragraph_text=paragraph_text,
orig_answer_text=orig_answer_text,
start_position=start_position,
is_impossible=is_impossible)
examples.append(example)
return examples
def _convert_index(index, pos, M=None, is_start=True):
if index[pos] is not None:
return index[pos]
N = len(index)
rear = pos
while rear < N - 1 and index[rear] is None:
rear += 1
front = pos
while front > 0 and index[front] is None:
front -= 1
assert index[front] is not None or index[rear] is not None
if index[front] is None:
if index[rear] >= 1:
if is_start:
return 0
else:
return index[rear] - 1
return index[rear]
if index[rear] is None:
if M is not None and index[front] < M - 1:
if is_start:
return index[front] + 1
else:
return M - 1
return index[front]
if is_start:
if index[rear] > index[front] + 1:
return index[front] + 1
else:
return index[rear]
else:
if index[rear] > index[front] + 1:
return index[rear] - 1
else:
return index[front]
def convert_examples_to_features(examples, sp_model, max_seq_length, doc_stride,
max_query_length, is_training, uncased):
"""Loads a data file into a list of `InputBatch`s."""
cnt_pos, cnt_neg = 0, 0
unique_id = 1000000000
max_N, max_M = 1024, 1024
f = np.zeros((max_N, max_M), dtype=np.float32)
for (example_index, example) in enumerate(examples):
if example_index % 100 == 0:
print('Converting {}/{} pos {} neg {}'.format(
example_index, len(examples), cnt_pos, cnt_neg))
query_tokens = encode_ids(
sp_model, preprocess_text(
example.question_text, lower=uncased))
if len(query_tokens) > max_query_length:
query_tokens = query_tokens[0:max_query_length]
paragraph_text = example.paragraph_text
para_tokens = encode_pieces(
sp_model, preprocess_text(
example.paragraph_text, lower=uncased))
chartok_to_tok_index = []
tok_start_to_chartok_index = []
tok_end_to_chartok_index = []
char_cnt = 0
for i, token in enumerate(para_tokens):
chartok_to_tok_index.extend([i] * len(token))
tok_start_to_chartok_index.append(char_cnt)
char_cnt += len(token)
tok_end_to_chartok_index.append(char_cnt - 1)
tok_cat_text = ''.join(para_tokens).replace(SPIECE_UNDERLINE, ' ')
N, M = len(paragraph_text), len(tok_cat_text)
if N > max_N or M > max_M:
max_N = max(N, max_N)
max_M = max(M, max_M)
f = np.zeros((max_N, max_M), dtype=np.float32)
gc.collect()
g = {}
def _lcs_match(max_dist):
f.fill(0)
g.clear()
### longest common sub sequence
# f[i, j] = max(f[i - 1, j], f[i, j - 1], f[i - 1, j - 1] + match(i, j))
for i in range(N):
# note(zhiliny):
# unlike standard LCS, this is specifically optimized for the setting
# because the mismatch between sentence pieces and original text will
# be small
for j in range(i - max_dist, i + max_dist):
if j >= M or j < 0: continue
if i > 0:
g[(i, j)] = 0
f[i, j] = f[i - 1, j]
if j > 0 and f[i, j - 1] > f[i, j]:
g[(i, j)] = 1
f[i, j] = f[i, j - 1]
f_prev = f[i - 1, j - 1] if i > 0 and j > 0 else 0
if (preprocess_text(
paragraph_text[i], lower=uncased,
remove_space=False) == tok_cat_text[j] and
f_prev + 1 > f[i, j]):
g[(i, j)] = 2
f[i, j] = f_prev + 1
max_dist = abs(N - M) + 5
for _ in range(2):
_lcs_match(max_dist)
if f[N - 1, M - 1] > 0.8 * N: break
max_dist *= 2
orig_to_chartok_index = [None] * N
chartok_to_orig_index = [None] * M
i, j = N - 1, M - 1
while i >= 0 and j >= 0:
if (i, j) not in g: break
if g[(i, j)] == 2:
orig_to_chartok_index[i] = j
chartok_to_orig_index[j] = i
i, j = i - 1, j - 1
elif g[(i, j)] == 1:
j = j - 1
else:
i = i - 1
if all(v is None
for v in orig_to_chartok_index) or f[N - 1, M - 1] < 0.8 * N:
print('MISMATCH DETECTED!')
continue
tok_start_to_orig_index = []
tok_end_to_orig_index = []
for i in range(len(para_tokens)):
start_chartok_pos = tok_start_to_chartok_index[i]
end_chartok_pos = tok_end_to_chartok_index[i]
start_orig_pos = _convert_index(
chartok_to_orig_index, start_chartok_pos, N, is_start=True)
end_orig_pos = _convert_index(
chartok_to_orig_index, end_chartok_pos, N, is_start=False)
tok_start_to_orig_index.append(start_orig_pos)
tok_end_to_orig_index.append(end_orig_pos)
if not is_training:
tok_start_position = tok_end_position = None
if is_training and example.is_impossible:
tok_start_position = -1
tok_end_position = -1
if is_training and not example.is_impossible:
start_position = example.start_position
end_position = start_position + len(example.orig_answer_text) - 1
start_chartok_pos = _convert_index(
orig_to_chartok_index, start_position, is_start=True)
tok_start_position = chartok_to_tok_index[start_chartok_pos]
end_chartok_pos = _convert_index(
orig_to_chartok_index, end_position, is_start=False)
tok_end_position = chartok_to_tok_index[end_chartok_pos]
assert tok_start_position <= tok_end_position
def _piece_to_id(x):
if six.PY2 and isinstance(x, unicode):
x = x.encode('utf-8')
return sp_model.PieceToId(x)
all_doc_tokens = list(map(_piece_to_id, para_tokens))
# The -3 accounts for [CLS], [SEP] and [SEP]
max_tokens_for_doc = max_seq_length - len(query_tokens) - 3
# We can have documents that are longer than the maximum sequence length.
# To deal with this we do a sliding window approach, where we take chunks
# of the up to our max length with a stride of `doc_stride`.
_DocSpan = collections.namedtuple( # pylint: disable=invalid-name
"DocSpan", ["start", "length"])
doc_spans = []
start_offset = 0
while start_offset < len(all_doc_tokens):
length = len(all_doc_tokens) - start_offset
if length > max_tokens_for_doc:
length = max_tokens_for_doc
doc_spans.append(_DocSpan(start=start_offset, length=length))
if start_offset + length == len(all_doc_tokens):
break
start_offset += min(length, doc_stride)
for (doc_span_index, doc_span) in enumerate(doc_spans):
tokens = []
token_is_max_context = {}
segment_ids = []
p_mask = []
cur_tok_start_to_orig_index = []
cur_tok_end_to_orig_index = []
for i in range(doc_span.length):
split_token_index = doc_span.start + i
cur_tok_start_to_orig_index.append(tok_start_to_orig_index[
split_token_index])
cur_tok_end_to_orig_index.append(tok_end_to_orig_index[
split_token_index])
is_max_context = _check_is_max_context(
doc_spans, doc_span_index, split_token_index)
token_is_max_context[len(tokens)] = is_max_context
tokens.append(all_doc_tokens[split_token_index])
segment_ids.append(SEG_ID_P)
p_mask.append(0)
paragraph_len = len(tokens)
tokens.append(SEP_ID)
segment_ids.append(SEG_ID_P)
p_mask.append(1)
# note(zhiliny): we put P before Q
# because during pretraining, B is always shorter than A
for token in query_tokens:
tokens.append(token)
segment_ids.append(SEG_ID_Q)
p_mask.append(1)
tokens.append(SEP_ID)
segment_ids.append(SEG_ID_Q)
p_mask.append(1)
cls_index = len(segment_ids)
tokens.append(CLS_ID)
segment_ids.append(SEG_ID_CLS)
p_mask.append(0)
input_ids = tokens
# The mask has 0 for real tokens and 1 for padding tokens. Only real
# tokens are attended to.
input_mask = [0] * len(input_ids)
# Zero-pad up to the sequence length.
while len(input_ids) < max_seq_length:
input_ids.append(0)
input_mask.append(1)
segment_ids.append(SEG_ID_PAD)
p_mask.append(1)
assert len(input_ids) == max_seq_length
assert len(input_mask) == max_seq_length
assert len(segment_ids) == max_seq_length
assert len(p_mask) == max_seq_length
span_is_impossible = example.is_impossible
start_position = None
end_position = None
if is_training and not span_is_impossible:
# For training, if our document chunk does not contain an annotation
# we throw it out, since there is nothing to predict.
doc_start = doc_span.start
doc_end = doc_span.start + doc_span.length - 1
out_of_span = False
if not (tok_start_position >= doc_start and
tok_end_position <= doc_end):
out_of_span = True
if out_of_span:
# continue
start_position = 0
end_position = 0
span_is_impossible = True
else:
# note(zhiliny): we put P before Q, so doc_offset should be zero.
# doc_offset = len(query_tokens) + 2
doc_offset = 0
start_position = tok_start_position - doc_start + doc_offset
end_position = tok_end_position - doc_start + doc_offset
if is_training and span_is_impossible:
start_position = cls_index
end_position = cls_index
if example_index < 0:
print("*** Example ***")
print("unique_id: %s" % (unique_id))
print("example_index: %s" % (example_index))
print("doc_span_index: %s" % (doc_span_index))
print("tok_start_to_orig_index: %s" %
" ".join([str(x) for x in cur_tok_start_to_orig_index]))
print("tok_end_to_orig_index: %s" %
" ".join([str(x) for x in cur_tok_end_to_orig_index]))
print("token_is_max_context: %s" % " ".join([
"%d:%s" % (x, y)
for (x, y) in six.iteritems(token_is_max_context)
]))
print("input_ids: %s" % " ".join([str(x) for x in input_ids]))
print("input_mask: %s" % " ".join([str(x) for x in input_mask]))
print("segment_ids: %s" %
" ".join([str(x) for x in segment_ids]))
if is_training and span_is_impossible:
print("impossible example span")
if is_training and not span_is_impossible:
pieces = [
sp_model.IdToPiece(token)
for token in tokens[start_position:(end_position + 1)]
]
answer_text = sp_model.DecodePieces(pieces)
print("start_position: %d" % (start_position))
print("end_position: %d" % (end_position))
print("answer: %s" % (printable_text(answer_text)))
# note(zhiliny): With multi processing,
# the example_index is actually the index within the current process
# therefore we use example_index=None to avoid being used in the future.
# The current code does not use example_index of training data.
if is_training:
feat_example_index = None
else:
feat_example_index = example_index
feature = InputFeatures(
unique_id=unique_id,
example_index=feat_example_index,
doc_span_index=doc_span_index,
tok_start_to_orig_index=cur_tok_start_to_orig_index,
tok_end_to_orig_index=cur_tok_end_to_orig_index,
token_is_max_context=token_is_max_context,
input_ids=input_ids,
input_mask=input_mask,
p_mask=p_mask,
segment_ids=segment_ids,
paragraph_len=paragraph_len,
cls_index=cls_index,
start_position=start_position,
end_position=end_position,
is_impossible=span_is_impossible)
unique_id += 1
if span_is_impossible:
cnt_neg += 1
else:
cnt_pos += 1
yield feature
print("Total number of instances: {} = pos {} neg {}".format(
cnt_pos + cnt_neg, cnt_pos, cnt_neg))
def _check_is_max_context(doc_spans, cur_span_index, position):
"""Check if this is the 'max context' doc span for the token."""
# Because of the sliding window approach taken to scoring documents, a single
# token can appear in multiple documents. E.g.
# Doc: the man went to the store and bought a gallon of milk
# Span A: the man went to the
# Span B: to the store and bought
# Span C: and bought a gallon of
# ...
#
# Now the word 'bought' will have two scores from spans B and C. We only
# want to consider the score with "maximum context", which we define as
# the *minimum* of its left and right context (the *sum* of left and
# right context will always be the same, of course).
#
# In the example the maximum context for 'bought' would be span C since
# it has 1 left context and 3 right context, while span B has 4 left context
# and 0 right context.
best_score = None
best_span_index = None
for (span_index, doc_span) in enumerate(doc_spans):
end = doc_span.start + doc_span.length - 1
if position < doc_span.start:
continue
if position > end:
continue
num_left_context = position - doc_span.start
num_right_context = end - position
score = min(num_left_context,
num_right_context) + 0.01 * doc_span.length
if best_score is None or score > best_score:
best_score = score
best_span_index = span_index
return cur_span_index == best_span_index
class DataProcessor(object):
def __init__(self, spiece_model_file, uncased, max_seq_length, doc_stride,
max_query_length):
self._sp_model = spm.SentencePieceProcessor()
self._sp_model.Load(spiece_model_file)
self._uncased = uncased
self._max_seq_length = max_seq_length
self._doc_stride = doc_stride
self._max_query_length = max_query_length
self.current_train_example = -1
self.num_train_examples = -1
self.current_train_epoch = -1
self.train_examples = None
self.predict_examples = None
self.num_examples = {'train': -1, 'predict': -1}
def get_train_progress(self):
"""Gets progress for training phase."""
return self.current_train_example, self.current_train_epoch
def get_examples(self, data_path, is_training):
examples = read_squad_examples(
input_file=data_path, is_training=is_training)
return examples
def get_num_examples(self, phase):
if phase not in ['train', 'predict']:
raise ValueError(
"Unknown phase, which should be in ['train', 'predict'].")
return self.num_examples[phase]
def get_features(self, examples, is_training):
features = convert_examples_to_features(
examples=examples,
sp_model=self._sp_model,
max_seq_length=self._max_seq_length,
doc_stride=self._doc_stride,
max_query_length=self._max_query_length,
is_training=is_training,
uncased=self._uncased)
return features
def data_generator(self,
data_path,
batch_size,
phase='train',
shuffle=False,
dev_count=1,
epoch=1):
if phase == 'train':
self.train_examples = self.get_examples(data_path, is_training=True)
examples = self.train_examples
self.num_examples['train'] = len(self.train_examples)
elif phase == 'predict':
self.predict_examples = self.get_examples(
data_path, is_training=False)
examples = self.predict_examples
self.num_examples['predict'] = len(self.predict_examples)
else:
raise ValueError(
"Unknown phase, which should be in ['train', 'predict'].")
def batch_reader(features, batch_size):
batch = []
for (index, feature) in enumerate(features):
if phase == 'train':
self.current_train_example = index + 1
labels = [feature.unique_id
] if feature.start_position is None else [
feature.start_position, feature.end_position,
feature.is_impossible
]
example = [
feature.input_ids, feature.segment_ids, feature.input_mask,
feature.cls_index, feature.p_mask
] + labels
to_append = len(batch) < batch_size
if to_append:
batch.append(example)
else:
yield batch
batch = [example]
if len(batch) > 0:
yield batch
def prepare_batch_data(insts):
"""Generate numpy tensors"""
input_ids = np.expand_dims(
np.array([inst[0] for inst in insts]).astype('int64'), axis=-1)
segment_ids = np.array([inst[1] for inst in insts]).astype('int64')
input_mask = np.array([inst[2] for inst in insts]).astype('float32')
cls_index = np.expand_dims(
np.array([inst[3] for inst in insts]).astype('int64'), axis=-1)
p_mask = np.array([inst[4] for inst in insts]).astype('float32')
ret_list = [input_ids, segment_ids, input_mask, cls_index, p_mask]
if phase == 'train':
start_positions = np.expand_dims(
np.array([inst[5] for inst in insts]).astype('int64'),
axis=-1)
end_positions = np.expand_dims(
np.array([inst[6] for inst in insts]).astype('int64'),
axis=-1)
is_impossible = np.expand_dims(
np.array([inst[7] for inst in insts]).astype('float32'),
axis=-1)
ret_list += [start_positions, end_positions, is_impossible]
else:
unique_ids = np.expand_dims(
np.array([inst[5] for inst in insts]).astype('int64'),
axis=-1)
ret_list += [unique_ids]
return ret_list
def wrapper():
for epoch_index in range(epoch):
if shuffle:
random.shuffle(examples)
if phase == 'train':
self.current_train_epoch = epoch_index
features = self.get_features(examples, is_training=True)
else:
features = self.get_features(examples, is_training=False)
all_dev_batches = []
for batch_insts in batch_reader(features, batch_size):
batch_data = prepare_batch_data(batch_insts)
if len(all_dev_batches) < dev_count:
all_dev_batches.append(batch_data)
if len(all_dev_batches) == dev_count:
for batch in all_dev_batches:
yield batch
all_dev_batches = []
return wrapper
_PrelimPrediction = collections.namedtuple( # pylint: disable=invalid-name
"PrelimPrediction", [
"feature_index", "start_index", "end_index", "start_log_prob",
"end_log_prob"
])
_NbestPrediction = collections.namedtuple( # pylint: disable=invalid-name
"NbestPrediction", ["text", "start_log_prob", "end_log_prob"])
def write_predictions(all_examples, all_features, all_results, n_best_size,
max_answer_length, output_prediction_file,
output_nbest_file, output_null_log_odds_file, orig_data,
args):
"""Write final predictions to the json file and log-odds of null if needed."""
print("Writing predictions to: %s" % (output_prediction_file))
# tf.logging.info("Writing nbest to: %s" % (output_nbest_file))
example_index_to_features = collections.defaultdict(list)
for feature in all_features:
example_index_to_features[feature.example_index].append(feature)
unique_id_to_result = {}
for result in all_results:
unique_id_to_result[result.unique_id] = result
all_predictions = collections.OrderedDict()
all_nbest_json = collections.OrderedDict()
scores_diff_json = collections.OrderedDict()
for (example_index, example) in enumerate(all_examples):
features = example_index_to_features[example_index]
prelim_predictions = []
# keep track of the minimum score of null start+end of position 0
score_null = 1000000 # large and positive
for (feature_index, feature) in enumerate(features):
result = unique_id_to_result[feature.unique_id]
cur_null_score = result.cls_logits
# if we could have irrelevant answers, get the min score of irrelevant
score_null = min(score_null, cur_null_score)
for i in range(args.start_n_top):
for j in range(args.end_n_top):
start_log_prob = result.start_top_log_probs[i]
start_index = result.start_top_index[i]
j_index = i * args.end_n_top + j
end_log_prob = result.end_top_log_probs[j_index]
end_index = result.end_top_index[j_index]
# We could hypothetically create invalid predictions, e.g., predict
# that the start of the span is in the question. We throw out all
# invalid predictions.
if start_index >= feature.paragraph_len - 1:
continue
if end_index >= feature.paragraph_len - 1:
continue
if not feature.token_is_max_context.get(start_index, False):
continue
if end_index < start_index:
continue
length = end_index - start_index + 1
if length > max_answer_length:
continue
prelim_predictions.append(
_PrelimPrediction(
feature_index=feature_index,
start_index=start_index,
end_index=end_index,
start_log_prob=start_log_prob,
end_log_prob=end_log_prob))
prelim_predictions = sorted(
prelim_predictions,
key=lambda x: (x.start_log_prob + x.end_log_prob),
reverse=True)
seen_predictions = {}
nbest = []
for pred in prelim_predictions:
if len(nbest) >= n_best_size:
break
feature = features[pred.feature_index]
tok_start_to_orig_index = feature.tok_start_to_orig_index
tok_end_to_orig_index = feature.tok_end_to_orig_index
start_orig_pos = tok_start_to_orig_index[pred.start_index]
end_orig_pos = tok_end_to_orig_index[pred.end_index]
paragraph_text = example.paragraph_text
final_text = paragraph_text[start_orig_pos:end_orig_pos + 1].strip()
if final_text in seen_predictions:
continue
seen_predictions[final_text] = True
nbest.append(
_NbestPrediction(
text=final_text,
start_log_prob=pred.start_log_prob,
end_log_prob=pred.end_log_prob))
# In very rare edge cases we could have no valid predictions. So we
# just create a nonce prediction in this case to avoid failure.
if not nbest:
nbest.append(
_NbestPrediction(
text="", start_log_prob=-1e6, end_log_prob=-1e6))
total_scores = []
best_non_null_entry = None
for entry in nbest:
total_scores.append(entry.start_log_prob + entry.end_log_prob)
if not best_non_null_entry:
best_non_null_entry = entry
probs = _compute_softmax(total_scores)
nbest_json = []
for (i, entry) in enumerate(nbest):
output = collections.OrderedDict()
output["text"] = entry.text
output["probability"] = probs[i]
output["start_log_prob"] = entry.start_log_prob
output["end_log_prob"] = entry.end_log_prob
nbest_json.append(output)
assert len(nbest_json) >= 1
assert best_non_null_entry is not None
score_diff = score_null
scores_diff_json[example.qas_id] = score_diff
# note(zhiliny): always predict best_non_null_entry
# and the evaluation script will search for the best threshold
all_predictions[example.qas_id] = best_non_null_entry.text
all_nbest_json[example.qas_id] = nbest_json
with io.open(output_prediction_file, "w", encoding="utf8") as writer:
writer.write(json.dumps(all_predictions, indent=4) + u"\n")
with io.open(output_nbest_file, "w", encoding="utf8") as writer:
writer.write(json.dumps(all_nbest_json, indent=4) + u"\n")
with io.open(output_null_log_odds_file, "w", encoding="utf8") as writer:
writer.write(json.dumps(scores_diff_json, indent=4) + u"\n")
qid_to_has_ans = squad_utils.make_qid_to_has_ans(orig_data)
has_ans_qids = [k for k, v in qid_to_has_ans.items() if v]
no_ans_qids = [k for k, v in qid_to_has_ans.items() if not v]
exact_raw, f1_raw = squad_utils.get_raw_scores(orig_data, all_predictions)
out_eval = {}
squad_utils.find_all_best_thresh_v2(out_eval, all_predictions, exact_raw,
f1_raw, scores_diff_json,
qid_to_has_ans)
return out_eval
def _get_best_indexes(logits, n_best_size):
"""Get the n-best logits from a list."""
index_and_score = sorted(
enumerate(logits), key=lambda x: x[1], reverse=True)
best_indexes = []
for i in range(len(index_and_score)):
if i >= n_best_size:
break
best_indexes.append(index_and_score[i][0])
return best_indexes
def _compute_softmax(scores):
"""Compute softmax probability over raw logits."""
if not scores:
return []
max_score = None
for score in scores:
if max_score is None or score > max_score:
max_score = score
exp_scores = []
total_sum = 0.0
for score in scores:
x = math.exp(score - max_score)
exp_scores.append(x)
total_sum += x
probs = []
for score in exp_scores:
probs.append(score / total_sum)
return probs
if __name__ == '__main__':
processor = DataProcessor(
spiece_model_file="xlnet_cased_L-24_H-1024_A-16/spiece.model",
uncased=False,
max_seq_length=512,
doc_stride=128,
max_query_length=64)
train_data_generator = processor.data_generator(
data_path="squad_v2.0/dev-v2.0.json",
batch_size=32,
phase='predict',
shuffle=True,
dev_count=1,
epoch=1)
for (index, sample) in enumerate(train_data_generator()):
if index < 10:
print("index:", index)
for tensor in sample:
print(tensor.shape)
else:
break
#for (index, example) in enumerate(train_examples):
# if index < 5:
# print(example)
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Fine-tuning on regression/classification tasks."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import six
import sys
if six.PY2:
reload(sys)
sys.setdefaultencoding('utf8')
import os
import time
import json
import argparse
import numpy as np
import subprocess
import multiprocessing
from scipy.stats import pearsonr
import paddle
import paddle.fluid as fluid
import reader.cls as reader
from model.xlnet import XLNetConfig
from model.classifier import create_model
from optimization import optimization
from utils.args import ArgumentGroup, print_arguments, check_cuda
from utils.init import init_pretraining_params, init_checkpoint
num_trainers = int(os.environ.get('PADDLE_TRAINERS_NUM', 1))
# yapf: disable
parser = argparse.ArgumentParser(__doc__)
model_g = ArgumentGroup(parser, "model", "model configuration and paths.")
model_g.add_arg("model_config_path", str, None, "Path to the json file for bert model config.")
model_g.add_arg("dropout", float, 0.1, "Dropout rate.")
model_g.add_arg("dropatt", float, 0.1, "Attention dropout rate.")
model_g.add_arg("clamp_len", int, -1, "Clamp length.")
model_g.add_arg("summary_type", str, "last",
"Method used to summarize a sequence into a vector.", choices=['last'])
model_g.add_arg("use_summ_proj", bool, True,
"Whether to use projection for summarizing sequences.")
model_g.add_arg("spiece_model_file", str, None, "Sentence Piece model path.")
model_g.add_arg("init_checkpoint", str, None, "Init checkpoint to resume training from.")
model_g.add_arg("init_pretraining_params", str, None,
"Init pre-training params which preforms fine-tuning from. If the "
"arg 'init_checkpoint' has been set, this argument wouldn't be valid.")
model_g.add_arg("checkpoints", str, "checkpoints", "Path to save checkpoints.")
init_g = ArgumentGroup(parser, "init", "parameter initialization options.")
init_g.add_arg("init", str, "normal", "Initialization method.", choices=["normal", "uniform"])
init_g.add_arg("init_std", str, 0.02, "Initialization std when init is normal.")
init_g.add_arg("init_range", str, 0.1, "Initialization std when init is uniform.")
train_g = ArgumentGroup(parser, "training", "training options.")
train_g.add_arg("epoch", int, 1000, "Number of epoches for fine-tuning.")
train_g.add_arg("learning_rate", float, 5e-5, "Learning rate used to train with warmup.")
train_g.add_arg("lr_scheduler", str, "linear_warmup_decay",
"scheduler of learning rate.", choices=['linear_warmup_decay', 'noam_decay'])
train_g.add_arg("weight_decay", float, 0.01, "Weight decay rate for L2 regularizer.")
train_g.add_arg("lr_layer_decay_rate", float, 1.0, "Top layer: lr[L] = args.learning_rate. "
"Lower layers: lr[l-1] = lr[l] * lr_layer_decay_rate.")
train_g.add_arg("save_steps", int, 10000, "The steps interval to save checkpoints.")
train_g.add_arg("train_batch_size", int, 8, "Total examples' number in batch for training.")
train_g.add_arg("eval_batch_size", int, 128, "Total examples' number in batch for development.")
train_g.add_arg("predict_batch_size", int, 128, "Total examples' number in batch for prediction.")
train_g.add_arg("train_steps", int, 1000, "The total steps for training.")
train_g.add_arg("warmup_steps", int, 1000, "The steps for warmup.")
train_g.add_arg("validation_steps", int, 1000, "The steps interval to evaluate model performance.")
log_g = ArgumentGroup(parser, "logging", "logging related.")
log_g.add_arg("skip_steps", int, 10, "The steps interval to print loss.")
log_g.add_arg("verbose", bool, False, "Whether to output verbose log.")
data_g = ArgumentGroup(parser, "data", "Data paths, vocab paths and data processing options")
data_g.add_arg("data_dir", str, None, "Path to training data.")
data_g.add_arg("predict_dir", str, None, "Path to write predict results.")
data_g.add_arg("predict_threshold", float, 0.0, "Threshold for binary prediction.")
data_g.add_arg("max_seq_length", int, 512, "Number of words of the longest seqence.")
data_g.add_arg("uncased", bool, True,
"Whether to lower case the input text. Should be True for uncased models and False for cased models.")
data_g.add_arg("random_seed", int, 0, "Random seed.")
run_type_g = ArgumentGroup(parser, "run_type", "running type options.")
run_type_g.add_arg("use_cuda", bool, True, "If set, use GPU for training.")
run_type_g.add_arg("use_fast_executor", bool, False, "If set, use fast parallel executor (in experiment).")
run_type_g.add_arg("shuffle", bool, True, "")
run_type_g.add_arg("task_name", str, None,
"The name of task to perform fine-tuning, should be in {'xnli', 'mnli', 'cola', 'mrpc'}.")
run_type_g.add_arg("is_regression", str, None, "Whether it's a regression task.")
run_type_g.add_arg("do_train", bool, True, "Whether to perform training.")
run_type_g.add_arg("do_eval", bool, True, "Whether to perform evaluation on dev data set.")
run_type_g.add_arg("do_predict", bool, True, "Whether to perform evaluation on test data set.")
run_type_g.add_arg("eval_split", str, "dev", "Could be dev or test")
parser.add_argument("--enable_ce", action='store_true', help="The flag indicating whether to run the task for continuous evaluation.")
args = parser.parse_args()
# yapf: enable.
def evaluate(exe, predict_program, test_data_loader, fetch_list, eval_phase, num_examples):
test_data_loader.start()
total_cost, total_num_seqs = [], []
all_logits, all_labels = [], []
time_begin = time.time()
total_steps = int(num_examples / args.eval_batch_size)
steps = 0
while True:
try:
np_loss, np_num_seqs, np_logits, np_labels = exe.run(program=predict_program,
fetch_list=fetch_list)
total_cost.extend(np_loss * np_num_seqs)
total_num_seqs.extend(np_num_seqs)
all_logits.extend(np_logits)
all_labels.extend(np_labels)
if steps % (int(total_steps / 10)) == 0:
print("Evaluation [{}/{}]".format(steps, total_steps))
steps += 1
except fluid.core.EOFException:
test_data_loader.reset()
break
all_logits = np.array(all_logits)
all_labels = np.array(all_labels)
if args.is_regression:
key = "eval_pearsonr"
eval_result, _ = pearsonr(all_logits, all_labels)
else:
key = "eval_accuracy"
pred = np.argmax(all_logits, axis=1).reshape(all_labels.shape)
eval_result = np.sum(pred == all_labels) / float(all_labels.size)
time_end = time.time()
print("[%s evaluation] ave loss: %f, %s: %f, elapsed time: %f s" %
(eval_phase, np.sum(total_cost) / np.sum(total_num_seqs), key, eval_result,
time_end - time_begin))
def predict(exe, predict_program, test_data_loader, task_name, label_list, fetch_list):
test_data_loader.start()
pred_cnt = 0
predict_results = []
with open(os.path.join(args.predict_dir, "{}.tsv".format(
task_name)), "w") as fout:
fout.write("index\tprediction\n")
while True:
try:
np_logits = exe.run(program=predict_program,
fetch_list=fetch_list)
for result in np_logits[0]:
if pred_cnt % 1000 == 0:
print("Predicting submission for example: {}".format(
pred_cnt))
logits = [float(x) for x in result.flat]
predict_results.append(logits)
if len(logits) == 1:
label_out = logits[0]
elif len(logits) == 2:
if logits[1] - logits[0] > args.predict_threshold:
label_out = label_list[1]
else:
label_out = label_list[0]
elif len(logits) > 2:
max_index = np.argmax(np.array(logits, dtype=np.float32))
label_out = label_list[max_index]
else:
raise NotImplementedError
fout.write("{}\t{}\n".format(pred_cnt, label_out))
pred_cnt += 1
except fluid.core.EOFException:
test_data_loader.reset()
break
predict_json_path = os.path.join(args.predict_dir, "{}.logits.json".format(
task_name))
with open(predict_json_path, "w") as fp:
json.dump(predict_results, fp, indent=4)
def get_device_num():
# NOTE(zcd): for multi-processe training, each process use one GPU card.
if num_trainers > 1 : return 1
visible_device = os.environ.get('CUDA_VISIBLE_DEVICES', None)
if visible_device:
device_num = len(visible_device.split(','))
else:
device_num = subprocess.check_output(['nvidia-smi','-L']).decode().count('\n')
return device_num
def main(args):
if not (args.do_train or args.do_eval or args.do_predict):
raise ValueError("For args `do_train`, `do_eval` and `do_predict`, at "
"least one of them must be True.")
if args.do_predict and not args.predict_dir:
raise ValueError("args 'predict_dir' should be given when doing predict")
if not os.path.exists(args.predict_dir):
os.makedirs(args.predict_dir)
xlnet_config = XLNetConfig(args.model_config_path)
xlnet_config.print_config()
if args.use_cuda:
place = fluid.CUDAPlace(int(os.getenv('FLAGS_selected_gpus', '0')))
dev_count = get_device_num()
else:
place = fluid.CPUPlace()
dev_count = int(os.environ.get('CPU_NUM', multiprocessing.cpu_count()))
exe = fluid.Executor(place)
task_name = args.task_name.lower()
processors = {
"mnli_matched": reader.MnliMatchedProcessor,
"mnli_mismatched": reader.MnliMismatchedProcessor,
'sts-b': reader.StsbProcessor,
'imdb': reader.ImdbProcessor,
"yelp5": reader.Yelp5Processor
}
processor = processors[task_name](args)
label_list = processor.get_labels() if not args.is_regression else None
num_labels = len(label_list) if label_list is not None else None
train_program = fluid.Program()
startup_prog = fluid.Program()
if args.random_seed is not None:
startup_prog.random_seed = args.random_seed
train_program.random_seed = args.random_seed
if args.do_train:
# NOTE: If num_trainers > 1, the shuffle_seed must be set, because
# the order of batch data generated by reader
# must be the same in the respective processes.
shuffle_seed = 1 if num_trainers > 1 else None
train_data_generator = processor.data_generator(
batch_size=args.train_batch_size,
is_regression=args.is_regression,
phase='train',
epoch=args.epoch,
dev_count=dev_count,
shuffle=args.shuffle)
num_train_examples = processor.get_num_examples(phase='train')
print("Device count: %d" % dev_count)
print("Max num of epoches: %d" % args.epoch)
print("Num of train examples: %d" % num_train_examples)
print("Num of train steps: %d" % args.train_steps)
print("Num of warmup steps: %d" % args.warmup_steps)
with fluid.program_guard(train_program, startup_prog):
with fluid.unique_name.guard():
train_data_loader, loss, logits, num_seqs, label_ids = create_model(
args,
xlnet_config=xlnet_config,
n_class=num_labels)
scheduled_lr = optimization(
loss=loss,
warmup_steps=args.warmup_steps,
num_train_steps=args.train_steps,
learning_rate=args.learning_rate,
train_program=train_program,
startup_prog=startup_prog,
weight_decay=args.weight_decay,
lr_layer_decay_rate=args.lr_layer_decay_rate,
scheduler=args.lr_scheduler)
if args.do_eval:
dev_prog = fluid.Program()
with fluid.program_guard(dev_prog, startup_prog):
with fluid.unique_name.guard():
dev_data_loader, loss, logits, num_seqs, label_ids = create_model(
args,
xlnet_config=xlnet_config,
n_class=num_labels)
dev_prog = dev_prog.clone(for_test=True)
dev_data_loader.set_batch_generator(
processor.data_generator(
batch_size=args.eval_batch_size,
is_regression=args.is_regression,
phase=args.eval_split,
epoch=1,
dev_count=1,
shuffle=False), place)
if args.do_predict:
predict_prog = fluid.Program()
with fluid.program_guard(predict_prog, startup_prog):
with fluid.unique_name.guard():
predict_data_loader, loss, logits, num_seqs, label_ids = create_model(
args,
xlnet_config=xlnet_config,
n_class=num_labels)
predict_prog = predict_prog.clone(for_test=True)
predict_data_loader.set_batch_generator(
processor.data_generator(
batch_size=args.predict_batch_size,
is_regression=args.is_regression,
phase=args.eval_split,
epoch=1,
dev_count=1,
shuffle=False), place)
exe.run(startup_prog)
if args.do_train:
if args.init_checkpoint and args.init_pretraining_params:
print(
"WARNING: args 'init_checkpoint' and 'init_pretraining_params' "
"both are set! Only arg 'init_checkpoint' is made valid.")
if args.init_checkpoint:
init_checkpoint(
exe,
args.init_checkpoint,
main_program=startup_prog)
elif args.init_pretraining_params:
init_pretraining_params(
exe,
args.init_pretraining_params,
main_program=startup_prog)
elif args.do_eval or args.do_predict:
if not args.init_checkpoint:
raise ValueError("args 'init_checkpoint' should be set if"
"only doing validation or testing!")
init_checkpoint(
exe,
args.init_checkpoint,
main_program=startup_prog)
if args.do_train:
exec_strategy = fluid.ExecutionStrategy()
exec_strategy.use_experimental_executor = args.use_fast_executor
exec_strategy.num_threads = dev_count
build_strategy = fluid.BuildStrategy()
if args.use_cuda and num_trainers > 1:
assert shuffle_seed is not None
dist_utils.prepare_for_multi_process(exe, build_strategy, train_program)
train_data_generator = fluid.contrib.reader.distributed_batch_reader(
train_data_generator)
train_compiled_program = fluid.CompiledProgram(train_program).with_data_parallel(
loss_name=loss.name, build_strategy=build_strategy)
train_data_loader.set_batch_generator(train_data_generator, place)
if args.do_train:
train_data_loader.start()
steps = 0
total_cost, total_num_seqs, total_time = [], [], 0.0
throughput = []
ce_info = []
while steps < args.train_steps:
try:
time_begin = time.time()
steps += 1
if steps % args.skip_steps == 0:
fetch_list = [loss.name, scheduled_lr.name, num_seqs.name]
else:
fetch_list = []
outputs = exe.run(train_compiled_program, fetch_list=fetch_list)
time_end = time.time()
used_time = time_end - time_begin
total_time += used_time
if steps % args.skip_steps == 0:
np_loss, np_lr, np_num_seqs = outputs
total_cost.extend(np_loss * np_num_seqs)
total_num_seqs.extend(np_num_seqs)
if args.verbose:
verbose = "train data_loader queue size: %d, " % train_data_loader.queue.size(
)
verbose += "learning rate: %f" % np_lr[0]
print(verbose)
current_example, current_epoch = processor.get_train_progress(
)
log_record = "epoch: {}, progress: {}/{}, step: {}, ave loss: {}".format(
current_epoch, current_example, num_train_examples,
steps, np.sum(total_cost) / np.sum(total_num_seqs))
ce_info.append([np.sum(total_cost) / np.sum(total_num_seqs), used_time])
if steps > 0 :
throughput.append( args.skip_steps / total_time)
log_record = log_record + ", speed: %f steps/s" % (args.skip_steps / total_time)
print(log_record)
else:
print(log_record)
total_cost, total_num_seqs, total_time = [], [], 0.0
if steps % args.save_steps == 0:
save_path = os.path.join(args.checkpoints,
"step_" + str(steps))
fluid.io.save_persistables(exe, save_path, train_program)
if steps % args.validation_steps == 0:
print("Average throughtput: %s" % (np.average(throughput)))
throughput = []
# evaluate dev set
if args.do_eval:
evaluate(exe, dev_prog, dev_data_loader,
[loss.name, num_seqs.name, logits.name, label_ids.name],
args.eval_split, processor.get_num_examples(phase=args.eval_split))
except fluid.core.EOFException:
save_path = os.path.join(args.checkpoints, "step_" + str(steps))
fluid.io.save_persistables(exe, save_path, train_program)
train_data_loader.reset()
break
if args.enable_ce:
card_num = get_cards()
ce_cost = 0
ce_acc = 0
ce_time = 0
try:
ce_cost = ce_info[-2][0]
ce_acc = ce_info[-2][1]
ce_time = ce_info[-2][2]
except:
print("ce info error")
print("kpis\ttrain_duration_%s_card%s\t%s" %
(args.task_name, card_num, ce_time))
print("kpis\ttrain_cost_%s_card%s\t%f" %
(args.task_name, card_num, ce_cost))
print("kpis\ttrain_acc_%s_card%s\t%f" %
(args.task_name, card_num, ce_acc))
# final eval on dev set
if args.do_eval:
evaluate(exe, dev_prog, dev_data_loader,
[loss.name, num_seqs.name, logits.name, label_ids], args.eval_split,
processor.get_num_examples(phase=args.eval_split))
# final eval on test set
if args.do_predict:
predict(exe, predict_prog, predict_data_loader, task_name, label_list, [logits.name])
if __name__ == '__main__':
print_arguments(args)
check_cuda(args.use_cuda)
main(args)
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Fine-tuning on SQuAD."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import six
import sys
if six.PY2:
reload(sys)
sys.setdefaultencoding('utf8')
import io
import argparse
import collections
import multiprocessing
import os
import time
import numpy as np
import json
import paddle
import paddle.fluid as fluid
from reader.squad import DataProcessor, write_predictions
from model.xlnet import XLNetConfig, XLNetModel
from utils.args import ArgumentGroup, print_arguments
from optimization import optimization
from utils.init import init_pretraining_params, init_checkpoint
from modeling import log_softmax
if six.PY2:
import cPickle as pickle
else:
import pickle
# yapf: disable
parser = argparse.ArgumentParser(__doc__)
model_g = ArgumentGroup(parser, "model", "model configuration and paths.")
model_g.add_arg("model_config_path", str, None, "Path to the json file for xlnet model config.")
model_g.add_arg("dropout", float, 0.1, "Dropout rate.")
model_g.add_arg("dropatt", float, 0.1, "Attention dropout rate.")
model_g.add_arg("clamp_len", int, -1, "Clamp length.")
model_g.add_arg("summary_type", str, "last", "Method used to summarize a sequence into a vector.",
choices=['last'])
model_g.add_arg("spiece_model_file", str, None, "Sentence Piece model path.")
model_g.add_arg("init_checkpoint", str, None, "Init checkpoint to resume training from.")
model_g.add_arg("init_pretraining_params", str, None,
"Init pre-training params which preforms fine-tuning from. If the "
"arg 'init_checkpoint' has been set, this argument wouldn't be valid.")
model_g.add_arg("checkpoints", str, "checkpoints", "Path to save checkpoints.")
# Parameter initialization
init_g = ArgumentGroup(parser, "init", "parameter initialization options.")
init_g.add_arg("init", str, "normal", "Initialization method.", choices=["normal", "uniform"])
init_g.add_arg("init_std", str, 0.02, "Initialization std when init is normal.")
init_g.add_arg("init_range", str, 0.1, "Initialization std when init is uniform.")
train_g = ArgumentGroup(parser, "training", "training options.")
train_g.add_arg("epoch", int, 3, "Number of epoches for fine-tuning.")
train_g.add_arg("learning_rate", float, 5e-5, "Learning rate used to train with warmup.")
train_g.add_arg("lr_scheduler", str, "linear_warmup_decay",
"scheduler of learning rate.", choices=['linear_warmup_decay', 'noam_decay'])
train_g.add_arg("weight_decay", float, 0.01, "Weight decay rate for L2 regularizer.")
train_g.add_arg("adam_epsilon", float, 1e-6, "Adam epsilon.")
train_g.add_arg("lr_layer_decay_rate", float, 0.75, "Top layer: lr[L] = args.learning_rate. "
"Lower layers: lr[l-1] = lr[l] * lr_layer_decay_rate.")
train_g.add_arg("train_batch_size", int, 12, "Total examples' number in batch for training.")
train_g.add_arg("train_steps", int, 1000, "The total steps for training.")
train_g.add_arg("warmup_steps", int, 1000, "The steps for warmup.")
train_g.add_arg("save_steps", int, 1000, "The steps interval to save checkpoints.")
predict_g = ArgumentGroup(parser, "prediction", "prediction options.")
predict_g.add_arg("predict_batch_size", int, 12, "Total examples' number in batch for training.")
predict_g.add_arg("start_n_top", int, 5, "Beam size for span start.")
predict_g.add_arg("end_n_top", int, 5, "Beam size for span end.")
log_g = ArgumentGroup(parser, "logging", "logging related.")
log_g.add_arg("skip_steps", int, 10, "The steps interval to print loss.")
log_g.add_arg("verbose", bool, False, "Whether to output verbose log.")
data_g = ArgumentGroup(parser, "data", "Data paths, vocab paths and data processing options")
data_g.add_arg("train_file", str, None, "SQuAD json for training. E.g., train-v1.1.json.")
data_g.add_arg("predict_file", str, None, "SQuAD json for predictions. E.g. dev-v1.1.json or test-v1.1.json.")
data_g.add_arg("max_seq_length", int, 512, "Number of words of the longest seqence.")
data_g.add_arg("max_query_length", int, 64, "Max query length.")
data_g.add_arg("max_answer_length", int, 64, "Max answer length.")
data_g.add_arg("uncased", bool, True,
"Whether to lower case the input text. Should be True for uncased models and False for cased models.")
data_g.add_arg("doc_stride", int, 128,
"When splitting up a long document into chunks, how much stride to take between chunks.")
data_g.add_arg("n_best_size", int, 5,
"The total number of n-best predictions to generate in the nbest_predictions.json output file.")
data_g.add_arg("random_seed", int, 0, "Random seed.")
run_type_g = ArgumentGroup(parser, "run_type", "running type options.")
run_type_g.add_arg("use_cuda", bool, True, "If set, use GPU for training.")
run_type_g.add_arg("use_fast_executor", bool, False, "If set, use fast parallel executor (in experiment).")
run_type_g.add_arg("num_iteration_per_drop_scope", int, 1, "Ihe iteration intervals to clean up temporary variables.")
run_type_g.add_arg("do_train", bool, True, "Whether to perform training.")
run_type_g.add_arg("do_predict", bool, True, "Whether to perform prediction.")
args = parser.parse_args()
# yapf: enable.
def get_qa_outputs(xlnet_config, features, is_training=False):
# (qlen, batch size)
input_ids = features['input_ids']
cls_index = features['cls_index']
segment_ids = features['segment_ids']
input_mask = features['input_mask']
p_mask = features['p_mask']
inp = fluid.layers.transpose(input_ids, perm=[1, 0, 2])
inp_mask = fluid.layers.transpose(input_mask, perm=[1, 0])
cls_index = fluid.layers.reshape(cls_index, shape=[-1, 1])
seq_len = inp.shape[0]
xlnet = XLNetModel(
input_ids=inp,
seg_ids=segment_ids,
input_mask=inp_mask,
xlnet_config=xlnet_config,
args=args)
output = xlnet.get_sequence_output()
initializer = xlnet.get_initializer()
return_dict = {}
# logit of the start position
start_logits = fluid.layers.fc(
input=output,
num_flatten_dims=2,
size=1,
param_attr=fluid.ParamAttr(name='start_logits_fc_weight', initializer=initializer),
bias_attr='start_logits_fc_bias')
start_logits = fluid.layers.transpose(fluid.layers.squeeze(start_logits, [-1]), [1, 0])
start_logits_masked = start_logits * (1 - p_mask) - 1e30 * p_mask
start_log_probs = log_softmax(start_logits_masked)
# logit of the end position
if is_training:
start_positions = features['start_positions']
start_index = fluid.layers.one_hot(start_positions, depth=args.max_seq_length)
# lbh,bl->bh
trans_out = fluid.layers.transpose(output, perm=[1, 2, 0])
start_index = fluid.layers.unsqueeze(start_index, axes=[2])
start_features = fluid.layers.matmul(x=trans_out, y=start_index)
start_features = fluid.layers.unsqueeze(start_features, axes=[0])
start_features = fluid.layers.squeeze(start_features, axes=[3])
start_features = fluid.layers.expand(start_features, [seq_len, 1, 1])
end_logits = fluid.layers.fc(
input=fluid.layers.concat([output, start_features], axis=-1),
num_flatten_dims=2,
size=xlnet_config['d_model'],
param_attr=fluid.ParamAttr(name="end_logits_fc_0_weight",initializer=initializer),
bias_attr="end_logits_fc_0_bias",
act='tanh')
end_logits = fluid.layers.layer_norm(end_logits,
epsilon=1e-12,
param_attr=fluid.ParamAttr(
name='end_logits_layer_norm_scale',
initializer=fluid.initializer.Constant(1.)),
bias_attr=fluid.ParamAttr(
name='end_logits_layer_norm_bias',
initializer=fluid.initializer.Constant(0.)),
begin_norm_axis=len(end_logits.shape)-1)
end_logits = fluid.layers.fc(
input=end_logits,
num_flatten_dims=2,
size=1,
param_attr=fluid.ParamAttr(name='end_logits_fc_1_weight', initializer=initializer),
bias_attr='end_logits_fc_1_bias')
end_logits = fluid.layers.transpose(fluid.layers.squeeze(end_logits, [-1]), [1, 0])
end_logits_masked = end_logits * (1 - p_mask) - 1e30 * p_mask
end_log_probs = log_softmax(end_logits_masked)
else:
start_top_log_probs, start_top_index = fluid.layers.topk(start_log_probs, k=args.start_n_top)
start_top_index = fluid.layers.unsqueeze(start_top_index, [-1])
start_index = fluid.layers.one_hot(start_top_index, seq_len)
# lbh,bkl->bkh
trans_out = fluid.layers.transpose(output, perm=[1, 2, 0])
trans_start_index = fluid.layers.transpose(start_index, [0, 2, 1])
start_features = fluid.layers.matmul(x=trans_out, y=trans_start_index)
start_features = fluid.layers.transpose(start_features, [0, 2, 1])
end_input = fluid.layers.expand(fluid.layers.unsqueeze(output, [2]), [1, 1, args.start_n_top, 1])
start_features = fluid.layers.expand(fluid.layers.unsqueeze(start_features, [0]), [seq_len, 1, 1, 1])
end_input = fluid.layers.concat([end_input, start_features], axis=-1)
end_logits = fluid.layers.fc(end_input, size=xlnet_config['d_model'],
num_flatten_dims=3,
param_attr=fluid.ParamAttr(name="end_logits_fc_0_weight", initializer=initializer),
bias_attr="end_logits_fc_0_bias",
act='tanh')
end_logits = fluid.layers.layer_norm(end_logits,
epsilon=1e-12,
param_attr=fluid.ParamAttr(
name='end_logits_layer_norm_scale',
initializer=fluid.initializer.Constant(1.)),
bias_attr=fluid.ParamAttr(
name='end_logits_layer_norm_bias',
initializer=fluid.initializer.Constant(0.)),
begin_norm_axis=len(end_logits.shape)-1)
end_logits = fluid.layers.fc(
input=end_logits,
num_flatten_dims=3,
size=1,
param_attr=fluid.ParamAttr(name='end_logits_fc_1_weight', initializer=initializer),
bias_attr='end_logits_fc_1_bias')
end_logits = fluid.layers.reshape(end_logits, [seq_len, -1, args.start_n_top])
end_logits = fluid.layers.transpose(end_logits, [1, 2, 0])
p_mask = fluid.layers.stack([p_mask]*args.start_n_top, axis=1)
end_logits_masked = end_logits * (1 - p_mask) - 1e30 * p_mask
end_log_probs = log_softmax(end_logits_masked)
end_top_log_probs, end_top_index = fluid.layers.topk(end_log_probs, k=args.end_n_top)
end_top_log_probs = fluid.layers.reshape(
end_top_log_probs,
[-1, args.start_n_top * args.end_n_top])
end_top_index = fluid.layers.reshape(
end_top_index,
[-1, args.start_n_top * args.end_n_top])
if is_training:
return_dict["start_log_probs"] = start_log_probs
return_dict["end_log_probs"] = end_log_probs
else:
return_dict["start_top_log_probs"] = start_top_log_probs
return_dict["start_top_index"] = start_top_index
return_dict["end_top_log_probs"] = end_top_log_probs
return_dict["end_top_index"] = end_top_index
cls_index = fluid.layers.one_hot(cls_index, seq_len)
cls_index = fluid.layers.unsqueeze(cls_index, axes=[2])
cls_feature = fluid.layers.matmul(x=trans_out, y=cls_index)
start_p = fluid.layers.softmax(start_logits_masked)
start_p = fluid.layers.unsqueeze(start_p, axes=[2])
start_feature = fluid.layers.matmul(x=trans_out, y=start_p)
ans_feature = fluid.layers.concat([start_feature, cls_feature], axis=1)
ans_feature = fluid.layers.fc(
input=ans_feature,
size=xlnet_config['d_model'],
act='tanh',
param_attr=fluid.ParamAttr(initializer=initializer, name="answer_class_fc_0_weight"),
bias_attr="answer_class_fc_0_bias")
ans_feature = fluid.layers.dropout(ans_feature, args.dropout)
cls_logits = fluid.layers.fc(
ans_feature,
size=1,
param_attr=fluid.ParamAttr(name='answer_class_fc_1_weight', initializer=initializer),
bias_attr=False)
cls_logits = fluid.layers.squeeze(cls_logits, axes=[-1])
return_dict["cls_logits"] = cls_logits
return return_dict
def create_model(xlnet_config, is_training=False):
if is_training:
input_fields = {
'names': ['input_ids', 'segment_ids', 'input_mask', 'cls_index', 'p_mask',
'start_positions', 'end_positions', 'is_impossible'],
'shapes': [[None, args.max_seq_length, 1], [None, args.max_seq_length],
[None, args.max_seq_length], [None, 1],
[None, args.max_seq_length], [None, 1], [None, 1], [None, 1]],
'dtypes': [
'int64', 'int64', 'float32', 'int64',
'float32', 'int64', 'int64', 'float32'],
'lod_levels': [0, 0, 0, 0, 0, 0, 0, 0]
}
else:
input_fields = {
'names': ['input_ids', 'segment_ids', 'input_mask', 'cls_index', 'p_mask', 'unique_ids'],
'shapes': [[None, args.max_seq_length, 1], [None, args.max_seq_length],
[None, args.max_seq_length], [None, 1], [None, args.max_seq_length], [None, 1]],
'dtypes': [
'int64', 'int64', 'float32', 'int64', 'float32', 'int64'],
'lod_levels': [0, 0, 0, 0, 0, 0],
}
inputs = [fluid.layers.data(name=input_fields['names'][i],
shape=input_fields['shapes'][i],
dtype=input_fields['dtypes'][i],
lod_level=input_fields['lod_levels'][i]) for i in range(len(input_fields['names']))]
data_loader = fluid.io.DataLoader.from_generator(feed_list=inputs, capacity=50, iterable=False)
if is_training:
(input_ids, segment_ids, input_mask, cls_index, p_mask, start_positions,
end_positions, is_impossible) = inputs
else:
(input_ids, segment_ids, input_mask, cls_index, p_mask, unique_ids) = inputs
features = {'input_ids': input_ids, 'segment_ids': segment_ids, 'input_mask': input_mask, 'cls_index': cls_index, 'p_mask':p_mask}
if is_training:
features['start_positions'] = start_positions
features['end_positions'] = end_positions
features['is_impossible'] = is_impossible
else:
features['unique_ids'] = unique_ids
outputs = get_qa_outputs(xlnet_config, features, is_training=is_training)
if not is_training:
predictions = {
"unique_ids": features["unique_ids"],
"start_top_index": outputs["start_top_index"],
"start_top_log_probs": outputs["start_top_log_probs"],
"end_top_index": outputs["end_top_index"],
"end_top_log_probs": outputs["end_top_log_probs"],
"cls_logits": outputs["cls_logits"]
}
return data_loader, predictions
seq_len = input_ids.shape[1]
def compute_loss(log_probs, positions):
one_hot_positions = fluid.layers.one_hot(positions, depth=seq_len)
loss = -1 * fluid.layers.reduce_sum(one_hot_positions * log_probs, dim=-1)
loss = fluid.layers.reduce_mean(loss)
return loss
start_loss = compute_loss(
outputs["start_log_probs"], features["start_positions"])
end_loss = compute_loss(
outputs["end_log_probs"], features["end_positions"])
total_loss = (start_loss + end_loss) * 0.5
cls_logits = outputs["cls_logits"]
is_impossible = fluid.layers.reshape(features["is_impossible"], [-1])
regression_loss = fluid.layers.sigmoid_cross_entropy_with_logits(
label=is_impossible, x=cls_logits)
regression_loss = fluid.layers.reduce_mean(regression_loss)
total_loss += regression_loss * 0.5
return data_loader, total_loss
RawResult = collections.namedtuple("RawResult",
["unique_id", "start_top_log_probs", "start_top_index",
"end_top_log_probs", "end_top_index", "cls_logits"])
def predict(test_exe, test_program, test_data_loader, fetch_list, processor, name):
if not os.path.exists(args.checkpoints):
os.makedirs(args.checkpoints)
output_prediction_file = os.path.join(args.checkpoints, name + "predictions.json")
output_nbest_file = os.path.join(args.checkpoints, name + "nbest_predictions.json")
output_null_log_odds_file = os.path.join(args.checkpoints, name + "null_odds.json")
test_data_loader.start()
all_results = []
time_begin = time.time()
while True:
try:
outputs = test_exe.run(
fetch_list=fetch_list,
program=test_program)
np_unique_ids, np_start_top_log_probs, np_start_top_index, np_end_top_log_probs, np_end_top_index, np_cls_logits, \
= outputs[0:6]
for idx in range(np_unique_ids.shape[0]):
if len(all_results) % 1000 == 0:
print("Processing example: %d" % len(all_results))
unique_id = int(np_unique_ids[idx])
start_top_log_probs = [float(x) for x in np_start_top_log_probs[idx].flat]
start_top_index = [int(x) for x in np_start_top_index[idx].flat]
end_top_log_probs = [float(x) for x in np_end_top_log_probs[idx].flat]
end_top_index = [int(x) for x in np_end_top_index[idx].flat]
cls_logits = float(np_cls_logits[idx].flat[0])
all_results.append(
RawResult(
unique_id=unique_id,
start_top_log_probs=start_top_log_probs,
start_top_index=start_top_index,
end_top_log_probs=end_top_log_probs,
end_top_index=end_top_index,
cls_logits=cls_logits))
except fluid.core.EOFException:
test_data_loader.reset()
break
time_end = time.time()
with io.open(args.predict_file, "r", encoding="utf8") as f:
orig_data = json.load(f)["data"]
features = processor.get_features(
processor.predict_examples, is_training=False)
ret = write_predictions(processor.predict_examples, features, all_results,
args.n_best_size, args.max_answer_length,
output_prediction_file,
output_nbest_file, output_null_log_odds_file,
orig_data, args)
# Log current result
print("=" * 80)
log_str = "Result | "
for key, val in ret.items():
log_str += "{} {} | ".format(key, val)
print(log_str)
print("=" * 80)
def train(args):
if not (args.do_train or args.do_predict):
raise ValueError("For args `do_train` and `do_predict`, at "
"least one of them must be True.")
xlnet_config = XLNetConfig(args.model_config_path)
xlnet_config.print_config()
if args.use_cuda:
place = fluid.CUDAPlace(0)
dev_count = fluid.core.get_cuda_device_count()
else:
place = fluid.CPUPlace()
dev_count = int(os.environ.get('CPU_NUM', multiprocessing.cpu_count()))
exe = fluid.Executor(place)
processor = DataProcessor(
spiece_model_file=args.spiece_model_file,
uncased=args.uncased,
max_seq_length=args.max_seq_length,
doc_stride=args.doc_stride,
max_query_length=args.max_query_length)
startup_prog = fluid.Program()
if args.random_seed is not None:
startup_prog.random_seed = args.random_seed
if args.do_train:
train_data_generator = processor.data_generator(
data_path=args.train_file,
batch_size=args.train_batch_size,
phase='train',
shuffle=True,
dev_count=dev_count,
epoch=args.epoch)
num_train_examples = processor.get_num_examples(phase='train')
print("Device count: %d" % dev_count)
print("Max num of epoches: %d" % args.epoch)
print("Num of train examples: %d" % num_train_examples)
print("Num of train steps: %d" % args.train_steps)
print("Num of warmup steps: %d" % args.warmup_steps)
train_program = fluid.Program()
with fluid.program_guard(train_program, startup_prog):
with fluid.unique_name.guard():
train_data_loader, loss = create_model(
xlnet_config=xlnet_config,
is_training=True)
scheduled_lr = optimization(
loss=loss,
warmup_steps=args.warmup_steps,
num_train_steps=args.train_steps,
learning_rate=args.learning_rate,
train_program=train_program,
startup_prog=startup_prog,
weight_decay=args.weight_decay,
lr_layer_decay_rate=args.lr_layer_decay_rate,
scheduler=args.lr_scheduler)
if args.do_predict:
test_prog = fluid.Program()
with fluid.program_guard(test_prog, startup_prog):
with fluid.unique_name.guard():
test_data_loader, predictions = create_model(
xlnet_config=xlnet_config,
is_training=False)
test_prog = test_prog.clone(for_test=True)
exe.run(startup_prog)
if args.do_train:
if args.init_checkpoint and args.init_pretraining_params:
print(
"WARNING: args 'init_checkpoint' and 'init_pretraining_params' "
"both are set! Only arg 'init_checkpoint' is made valid.")
if args.init_checkpoint:
init_checkpoint(
exe,
args.init_checkpoint,
main_program=startup_prog)
elif args.init_pretraining_params:
init_pretraining_params(
exe,
args.init_pretraining_params,
main_program=startup_prog)
elif args.do_predict:
if not args.init_checkpoint:
raise ValueError("args 'init_checkpoint' should be set if"
"only doing prediction!")
init_checkpoint(
exe,
args.init_checkpoint,
main_program=startup_prog)
if args.do_train:
exec_strategy = fluid.ExecutionStrategy()
exec_strategy.use_experimental_executor = args.use_fast_executor
exec_strategy.num_threads = dev_count
exec_strategy.num_iteration_per_drop_scope = args.num_iteration_per_drop_scope
build_strategy = fluid.BuildStrategy()
# These two flags must be set in this model for correctness
build_strategy.fuse_all_optimizer_ops = True
build_strategy.enable_inplace = False
train_exe = fluid.ParallelExecutor(
use_cuda=args.use_cuda,
loss_name=loss.name,
exec_strategy=exec_strategy,
build_strategy=build_strategy,
main_program=train_program)
train_data_loader.set_batch_generator(train_data_generator, place)
train_data_loader.start()
steps = 0
total_cost = []
time_begin = time.time()
print("Begin to train model ...")
print(time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(time.time())))
while steps < args.train_steps:
try:
steps += 1
if steps % args.skip_steps == 0:
fetch_list = [loss.name, scheduled_lr.name]
else:
fetch_list = []
outputs = train_exe.run(fetch_list=fetch_list)
if steps % args.skip_steps == 0:
np_loss, np_lr = outputs
total_cost.extend(np_loss)
if args.verbose:
verbose = "train data_loader queue size: %d, " % train_data_loader.queue.size(
)
verbose += "learning rate: %f " % np_lr[0]
print(verbose)
time_end = time.time()
used_time = time_end - time_begin
current_example, epoch = processor.get_train_progress()
print("epoch: %d, progress: %d/%d, step: %d, loss: %f, "
"speed: %f steps/s" %
(epoch, current_example, num_train_examples, steps,
np.mean(total_cost),
args.skip_steps / used_time))
total_cost = []
time_begin = time.time()
if steps % args.save_steps == 0 or steps == args.train_steps:
save_path = os.path.join(args.checkpoints,
"step_" + str(steps))
fluid.io.save_persistables(exe, save_path, train_program)
except fluid.core.EOFException:
save_path = os.path.join(args.checkpoints,
"step_" + str(steps) + "_final")
fluid.io.save_persistables(exe, save_path, train_program)
train_data_loader.reset()
break
print("Finish model training ...")
print(time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(time.time())))
if args.do_predict:
print("Begin to do prediction ...")
print(time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(time.time())))
test_data_loader.set_batch_generator(
processor.data_generator(
data_path=args.predict_file,
batch_size=args.predict_batch_size,
phase='predict',
shuffle=False,
dev_count=1,
epoch=1), place)
predict(exe, test_prog, test_data_loader, [predictions['unique_ids'].name, predictions['start_top_log_probs'].name,
predictions['start_top_index'].name, predictions['end_top_log_probs'].name, predictions['end_top_index'].name,
predictions['cls_logits'].name
], processor, name='')
print("Finish prediction ...")
print(time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(time.time())))
if __name__ == '__main__':
print_arguments(args)
train(args)
"""this file is adapted from https://github.com/zihangdai/xlnet"""
import io
import argparse
import collections
import json
import numpy as np
import os
import re
import string
import sys
OPTS = None
def parse_args():
parser = argparse.ArgumentParser(
'Official evaluation script for SQuAD version 2.0.')
parser.add_argument(
'data_file', metavar='data.json', help='Input data JSON file.')
parser.add_argument(
'pred_file', metavar='pred.json', help='Model predictions.')
parser.add_argument(
'--out-file',
'-o',
metavar='eval.json',
help='Write accuracy metrics to file (default is stdout).')
parser.add_argument(
'--na-prob-file',
'-n',
metavar='na_prob.json',
help='Model estimates of probability of no answer.')
parser.add_argument(
'--na-prob-thresh',
'-t',
type=float,
default=1.0,
help='Predict "" if no-answer probability exceeds this (default = 1.0).')
parser.add_argument(
'--out-image-dir',
'-p',
metavar='out_images',
default=None,
help='Save precision-recall curves to directory.')
parser.add_argument('--verbose', '-v', action='store_true')
if len(sys.argv) == 1:
parser.print_help()
sys.exit(1)
return parser.parse_args()
def make_qid_to_has_ans(dataset):
qid_to_has_ans = {}
for article in dataset:
for p in article['paragraphs']:
for qa in p['qas']:
qid_to_has_ans[qa['id']] = bool(qa['answers'])
return qid_to_has_ans
def normalize_answer(s):
"""Lower text and remove punctuation, articles and extra whitespace."""
def remove_articles(text):
regex = re.compile(r'\b(a|an|the)\b', re.UNICODE)
return re.sub(regex, ' ', text)
def white_space_fix(text):
return ' '.join(text.split())
def remove_punc(text):
exclude = set(string.punctuation)
return ''.join(ch for ch in text if ch not in exclude)
def lower(text):
return text.lower()
return white_space_fix(remove_articles(remove_punc(lower(s))))
def get_tokens(s):
if not s: return []
return normalize_answer(s).split()
def compute_exact(a_gold, a_pred):
return int(normalize_answer(a_gold) == normalize_answer(a_pred))
def compute_f1(a_gold, a_pred):
gold_toks = get_tokens(a_gold)
pred_toks = get_tokens(a_pred)
common = collections.Counter(gold_toks) & collections.Counter(pred_toks)
num_same = sum(common.values())
if len(gold_toks) == 0 or len(pred_toks) == 0:
# If either is no-answer, then F1 is 1 if they agree, 0 otherwise
return int(gold_toks == pred_toks)
if num_same == 0:
return 0
precision = 1.0 * num_same / len(pred_toks)
recall = 1.0 * num_same / len(gold_toks)
f1 = (2 * precision * recall) / (precision + recall)
return f1
def get_raw_scores(dataset, preds):
exact_scores = {}
f1_scores = {}
for article in dataset:
for p in article['paragraphs']:
for qa in p['qas']:
qid = qa['id']
gold_answers = [
a['text'] for a in qa['answers']
if normalize_answer(a['text'])
]
if not gold_answers:
# For unanswerable questions, only correct answer is empty string
gold_answers = ['']
if qid not in preds:
print('Missing prediction for %s' % qid)
continue
a_pred = preds[qid]
# Take max over all gold answers
exact_scores[qid] = max(
compute_exact(a, a_pred) for a in gold_answers)
f1_scores[qid] = max(
compute_f1(a, a_pred) for a in gold_answers)
return exact_scores, f1_scores
def apply_no_ans_threshold(scores, na_probs, qid_to_has_ans, na_prob_thresh):
new_scores = {}
for qid, s in scores.items():
pred_na = na_probs[qid] > na_prob_thresh
if pred_na:
new_scores[qid] = float(not qid_to_has_ans[qid])
else:
new_scores[qid] = s
return new_scores
def make_eval_dict(exact_scores, f1_scores, qid_list=None):
if not qid_list:
total = len(exact_scores)
return collections.OrderedDict([
('exact', 100.0 * sum(exact_scores.values()) / total),
('f1', 100.0 * sum(f1_scores.values()) / total),
('total', total),
])
else:
total = len(qid_list)
return collections.OrderedDict([
('exact', 100.0 * sum(exact_scores[k] for k in qid_list) / total),
('f1', 100.0 * sum(f1_scores[k] for k in qid_list) / total),
('total', total),
])
def merge_eval(main_eval, new_eval, prefix):
for k in new_eval:
main_eval['%s_%s' % (prefix, k)] = new_eval[k]
def plot_pr_curve(precisions, recalls, out_image, title):
plt.step(recalls, precisions, color='b', alpha=0.2, where='post')
plt.fill_between(recalls, precisions, step='post', alpha=0.2, color='b')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.xlim([0.0, 1.05])
plt.ylim([0.0, 1.05])
plt.title(title)
plt.savefig(out_image)
plt.clf()
def make_precision_recall_eval(scores,
na_probs,
num_true_pos,
qid_to_has_ans,
out_image=None,
title=None):
qid_list = sorted(na_probs, key=lambda k: na_probs[k])
true_pos = 0.0
cur_p = 1.0
cur_r = 0.0
precisions = [1.0]
recalls = [0.0]
avg_prec = 0.0
for i, qid in enumerate(qid_list):
if qid_to_has_ans[qid]:
true_pos += scores[qid]
cur_p = true_pos / float(i + 1)
cur_r = true_pos / float(num_true_pos)
if i == len(qid_list) - 1 or na_probs[qid] != na_probs[qid_list[i + 1]]:
# i.e., if we can put a threshold after this point
avg_prec += cur_p * (cur_r - recalls[-1])
precisions.append(cur_p)
recalls.append(cur_r)
if out_image:
plot_pr_curve(precisions, recalls, out_image, title)
return {'ap': 100.0 * avg_prec}
def run_precision_recall_analysis(main_eval, exact_raw, f1_raw, na_probs,
qid_to_has_ans, out_image_dir):
if out_image_dir and not os.path.exists(out_image_dir):
os.makedirs(out_image_dir)
num_true_pos = sum(1 for v in qid_to_has_ans.values() if v)
if num_true_pos == 0:
return
pr_exact = make_precision_recall_eval(
exact_raw,
na_probs,
num_true_pos,
qid_to_has_ans,
out_image=os.path.join(out_image_dir, 'pr_exact.png'),
title='Precision-Recall curve for Exact Match score')
pr_f1 = make_precision_recall_eval(
f1_raw,
na_probs,
num_true_pos,
qid_to_has_ans,
out_image=os.path.join(out_image_dir, 'pr_f1.png'),
title='Precision-Recall curve for F1 score')
oracle_scores = {k: float(v) for k, v in qid_to_has_ans.items()}
pr_oracle = make_precision_recall_eval(
oracle_scores,
na_probs,
num_true_pos,
qid_to_has_ans,
out_image=os.path.join(out_image_dir, 'pr_oracle.png'),
title='Oracle Precision-Recall curve (binary task of HasAns vs. NoAns)')
merge_eval(main_eval, pr_exact, 'pr_exact')
merge_eval(main_eval, pr_f1, 'pr_f1')
merge_eval(main_eval, pr_oracle, 'pr_oracle')
def histogram_na_prob(na_probs, qid_list, image_dir, name):
if not qid_list:
return
x = [na_probs[k] for k in qid_list]
weights = np.ones_like(x) / float(len(x))
plt.hist(x, weights=weights, bins=20, range=(0.0, 1.0))
plt.xlabel('Model probability of no-answer')
plt.ylabel('Proportion of dataset')
plt.title('Histogram of no-answer probability: %s' % name)
plt.savefig(os.path.join(image_dir, 'na_prob_hist_%s.png' % name))
plt.clf()
def find_best_thresh(preds, scores, na_probs, qid_to_has_ans):
num_no_ans = sum(1 for k in qid_to_has_ans if not qid_to_has_ans[k])
cur_score = num_no_ans
best_score = cur_score
best_thresh = 0.0
qid_list = sorted(na_probs, key=lambda k: na_probs[k])
for i, qid in enumerate(qid_list):
if qid not in scores: continue
if qid_to_has_ans[qid]:
diff = scores[qid]
else:
if preds[qid]:
diff = -1
else:
diff = 0
cur_score += diff
if cur_score > best_score:
best_score = cur_score
best_thresh = na_probs[qid]
return 100.0 * best_score / len(scores), best_thresh
def find_best_thresh_v2(preds, scores, na_probs, qid_to_has_ans):
num_no_ans = sum(1 for k in qid_to_has_ans if not qid_to_has_ans[k])
cur_score = num_no_ans
best_score = cur_score
best_thresh = 0.0
qid_list = sorted(na_probs, key=lambda k: na_probs[k])
for i, qid in enumerate(qid_list):
if qid not in scores: continue
if qid_to_has_ans[qid]:
diff = scores[qid]
else:
if preds[qid]:
diff = -1
else:
diff = 0
cur_score += diff
if cur_score > best_score:
best_score = cur_score
best_thresh = na_probs[qid]
has_ans_score, has_ans_cnt = 0, 0
for qid in qid_list:
if not qid_to_has_ans[qid]: continue
has_ans_cnt += 1
if qid not in scores: continue
has_ans_score += scores[qid]
return 100.0 * best_score / len(
scores), best_thresh, 1.0 * has_ans_score / has_ans_cnt
def find_all_best_thresh(main_eval, preds, exact_raw, f1_raw, na_probs,
qid_to_has_ans):
best_exact, exact_thresh = find_best_thresh(preds, exact_raw, na_probs,
qid_to_has_ans)
best_f1, f1_thresh = find_best_thresh(preds, f1_raw, na_probs,
qid_to_has_ans)
main_eval['best_exact'] = best_exact
main_eval['best_exact_thresh'] = exact_thresh
main_eval['best_f1'] = best_f1
main_eval['best_f1_thresh'] = f1_thresh
def find_all_best_thresh_v2(main_eval, preds, exact_raw, f1_raw, na_probs,
qid_to_has_ans):
best_exact, exact_thresh, has_ans_exact = find_best_thresh_v2(
preds, exact_raw, na_probs, qid_to_has_ans)
best_f1, f1_thresh, has_ans_f1 = find_best_thresh_v2(
preds, f1_raw, na_probs, qid_to_has_ans)
main_eval['best_exact'] = best_exact
main_eval['best_exact_thresh'] = exact_thresh
main_eval['best_f1'] = best_f1
main_eval['best_f1_thresh'] = f1_thresh
main_eval['has_ans_exact'] = has_ans_exact
main_eval['has_ans_f1'] = has_ans_f1
def main():
with io.open(OPTS.data_file, encoding='utf8') as f:
dataset_json = json.load(f)
dataset = dataset_json['data']
with io.open(OPTS.pred_file, encoding='utf8') as f:
preds = json.load(f)
new_orig_data = []
for article in dataset:
for p in article['paragraphs']:
for qa in p['qas']:
if qa['id'] in preds:
new_para = {'qas': [qa]}
new_article = {'paragraphs': [new_para]}
new_orig_data.append(new_article)
dataset = new_orig_data
if OPTS.na_prob_file:
with io.open(OPTS.na_prob_file, encoding='utf8') as f:
na_probs = json.load(f)
else:
na_probs = {k: 0.0 for k in preds}
qid_to_has_ans = make_qid_to_has_ans(dataset) # maps qid to True/False
has_ans_qids = [k for k, v in qid_to_has_ans.items() if v]
no_ans_qids = [k for k, v in qid_to_has_ans.items() if not v]
exact_raw, f1_raw = get_raw_scores(dataset, preds)
exact_thresh = apply_no_ans_threshold(exact_raw, na_probs, qid_to_has_ans,
OPTS.na_prob_thresh)
f1_thresh = apply_no_ans_threshold(f1_raw, na_probs, qid_to_has_ans,
OPTS.na_prob_thresh)
out_eval = make_eval_dict(exact_thresh, f1_thresh)
if has_ans_qids:
has_ans_eval = make_eval_dict(
exact_thresh, f1_thresh, qid_list=has_ans_qids)
merge_eval(out_eval, has_ans_eval, 'HasAns')
if no_ans_qids:
no_ans_eval = make_eval_dict(
exact_thresh, f1_thresh, qid_list=no_ans_qids)
merge_eval(out_eval, no_ans_eval, 'NoAns')
if OPTS.na_prob_file:
find_all_best_thresh(out_eval, preds, exact_raw, f1_raw, na_probs,
qid_to_has_ans)
if OPTS.na_prob_file and OPTS.out_image_dir:
run_precision_recall_analysis(out_eval, exact_raw, f1_raw, na_probs,
qid_to_has_ans, OPTS.out_image_dir)
histogram_na_prob(na_probs, has_ans_qids, OPTS.out_image_dir, 'hasAns')
histogram_na_prob(na_probs, no_ans_qids, OPTS.out_image_dir, 'noAns')
if OPTS.out_file:
with io.open(OPTS.out_file, 'w', encoding='utf8') as f:
json.dump(out_eval, f)
else:
print(json.dumps(out_eval, indent=2))
if __name__ == '__main__':
OPTS = parse_args()
if OPTS.out_image_dir:
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
main()
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Arguments for configuration."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import six
import argparse
import paddle.fluid as fluid
def str2bool(v):
# because argparse does not support to parse "true, False" as python
# boolean directly
return v.lower() in ("true", "t", "1")
class ArgumentGroup(object):
def __init__(self, parser, title, des):
self._group = parser.add_argument_group(title=title, description=des)
def add_arg(self, name, type, default, help, **kwargs):
type = str2bool if type == bool else type
self._group.add_argument(
"--" + name,
default=default,
type=type,
help=help + ' Default: %(default)s.',
**kwargs)
def print_arguments(args):
print('----------- Configuration Arguments -----------')
for arg, value in sorted(six.iteritems(vars(args))):
print('%s: %s' % (arg, value))
print('------------------------------------------------')
def check_cuda(use_cuda, err = \
"\nYou can not set use_cuda = True in the model because you are using paddlepaddle-cpu.\n \
Please: 1. Install paddlepaddle-gpu to run your models on GPU or 2. Set use_cuda = False to run models on CPU.\n"
):
try:
if use_cuda == True and fluid.is_compiled_with_cuda() == False:
print(err)
sys.exit(1)
except Exception as e:
pass
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from __future__ import print_function
import os
import six
import ast
import copy
import numpy as np
import paddle.fluid as fluid
def cast_fp32_to_fp16(exe, main_program):
print("Cast parameters to float16 data format.")
for param in main_program.global_block().all_parameters():
if not param.name.endswith(".master"):
param_t = fluid.global_scope().find_var(param.name).get_tensor()
data = np.array(param_t)
if param.name.find("layer_norm") == -1 and param.name.find(
"embedding") == -1:
print("shkip params", param.name)
param_t.set(np.float16(data).view(np.uint16), exe.place)
master_param_var = fluid.global_scope().find_var(param.name +
".master")
if master_param_var is not None:
master_param_var.get_tensor().set(data, exe.place)
def init_checkpoint(exe, init_checkpoint_path, main_program, use_fp16=False):
assert os.path.exists(
init_checkpoint_path), "[%s] cann't be found." % init_checkpoint_path
print("Load checkpoint from {}".format(init_checkpoint_path))
def existed_persitables(var):
if not fluid.io.is_persistable(var):
return False
if os.path.exists(os.path.join(init_checkpoint_path, var.name)):
print("INIT %s" % var.name)
return True
else:
print("SKIP %s" % var.name)
return False
fluid.io.load_vars(
exe,
init_checkpoint_path,
main_program=main_program,
predicate=existed_persitables)
if use_fp16:
cast_fp32_to_fp16(exe, main_program)
def init_pretraining_params(exe,
pretraining_params_path,
main_program,
use_fp16=False):
assert os.path.exists(pretraining_params_path
), "[%s] cann't be found." % pretraining_params_path
print("Load pretraining parameters from {}.".format(
pretraining_params_path))
def existed_params(var):
if not isinstance(var, fluid.framework.Parameter):
return False
if os.path.exists(os.path.join(pretraining_params_path, var.name)):
print("INIT %s" % var.name)
return True
else:
print("SKIP %s" % var.name)
return False
fluid.io.load_vars(
exe,
pretraining_params_path,
main_program=main_program,
predicate=existed_params)
if use_fp16:
cast_fp32_to_fp16(exe, main_program)
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册