未验证 提交 ad6a16c5 编写于 作者: Y Yibing Liu 提交者: GitHub

Add lang representation model XLNet (#3830)

上级 6da62e65
[中文版](README_cn.md)
This project is the implementation of [XLNet](https://github.com/zihangdai/xlnet) on Paddle Fluid, currently supporting the fine-tuning on all downstream tasks, including natural language inference, question answering (SQuAD) etc.
There are a lot differences between XLNet and [BERT](../BERT). XLNet takes adavangtage of a new novel model [Transformer-XL](https://arxiv.org/abs/1901.02860) as the backbone of language representation, and the permutation language modeling as the optimizing objective. Also XLNet involed much more data in the pre-training stage. Finally, XLNet achieved SOTA results on several NLP tasks.
For more details, please refer to the research paper
[XLNet: Generalized Autoregressive Pretraining for Language Understanding](https://arxiv.org/abs/1906.08237)
## Installation
This project requires Paddle Fluid version 1.6.0 and later, please follow the [installation guide](https://www.paddlepaddle.org.cn/start) to install.
## Pre-trained models
Two pre-trained models converted from the official release are available
| Model | Layers | Hidden size | Heads |
| :------| :------: | :------: |:------: |
| [XLNet-Large, Cased](https://xlnet.bj.bcebos.com/xlnet_cased_L-24_H-1024_A-16.tgz)| 24 | 1024 | 16 |
| [XLNet-Base, Cased](https://xlnet.bj.bcebos.com/xlnet_cased_L-12_H-768_A-12.tgz)| 12 | 768 | 12 |
Each compressed package contains one subdirectory and two files:
- `params`: a directory consisting of all converted parameters, one file for a parameter.
- `spiece.model`: a [Sentence Piece](https://github.com/google/sentencepiece) model used for (de)tokenization.
- `xlnet_config.json`: a config file which specifies the hyperparameters of the model.
## Fine-tuning with XLNet
We provide the scripts for fine-tuning on NLP tasks with XLNet on multi-card GPUs. And their correctness has been verified that all experiments on V100 GPUs can achieve the same performance as the officially reported (mainly on TPU). In the following statements, we assume that the two pre-trained models have been downloaded and extracted.
### Text regression/classification
The fine-tuning of regression/classification can be preformed via the script `run_classifier.py` , which contains examples for standard one-document classification, one-document regression, and document pair classification. The two examples, one for regression and another one for classification can go on in the following way.
#### (1) STS-B: sentence pair relevance regression
- Download the [GLUE data](https://gluebenchmark.com/tasks) by running [this script](https://gist.github.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e) and unpack it to some directory $GLUE_DIR
- **Note**: You may meet the error `ImportError: No module named request` when running the script under Python 2.x, this is because the module `urllib` doesn't have submodule `request`. It can be resolved by replacing all the code `urllib.request` with `urllib` or changing to a Python 3.x environment.
- Perform fine-tuning on 4 V100 GPUs with XLNet-Large
```
export GLUE_DIR=glue_data
export LARGE_DIR=xlnet_cased_L-24_H-1024_A-16
CUDA_VISIBLE_DEVICES=0,1,2,3 python run_classifier.py \
--do_train=True \
--do_eval=True \
--do_predict=True \
--task_name=sts-b \
--data_dir=${GLUE_DIR}/STS-B \
--checkpoints=exp/sts-b \
--uncased=False \
--spiece_model_file=${LARGE_DIR}/spiece.model \
--model_config_path=${LARGE_DIR}/xlnet_config.json \
--init_pretraining_params=${LARGE_DIR}/params \
--max_seq_length=128 \
--train_batch_size=8 \
--learning_rate=5e-5 \
--predict_dir=exp/sts-b-pred \
--skip_steps=10 \
--train_steps=1200 \
--warmup_steps=120 \
--save_steps=600 \
--is_regression=True
```
This configuration doesn't require that large GPU memory, and 4 V100 (or other) GPUs with 16GB should be enough.
As the fine-tuning finished, the evaluation result on dev dataset, including the average loss and pearson correlation coefficient, will yield
```
[dev evaluation] ave loss: 0.383523, eval_pearsonr: 0.916912, elapsed time: 21.804057 s
```
The expected `eval_pearsonr` is `91.3+`, quoted from the official repository, and the experiment can reproduce this performance.
#### (2) IMDB: movie review sentiment classification
- Download and unpack the IMDB dataset by running
```shell
wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
tar zxvf aclImdb_v1.tar.gz
```
- Perform fine-tuning with XLNet-Large on 8 V100 GPUs (32GB) by running
```shell
export IMDB_DIR=aclImdb
export LARGE_DIR=xlnet_cased_L-24_H-1024_A-16
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python run_classifier.py \
--do_train=True \
--do_eval=True \
--do_predict=True \
--task_name=imdb \
--checkpoints=exp/imdb \
--init_pretraining_params=${LARGE_DIR}/params \
--data_dir=${IMDB_DIR} \
--predict_dir=predict_imdb_1028 \
--uncased=False \
--spiece_model_file=${LARGE_DIR}/spiece.model \
--model_config_path=${LARGE_DIR}/xlnet_config.json \
--max_seq_length=512 \
--train_batch_size=4 \
--eval_batch_size=8 \
--learning_rate=2e-5 \
--train_steps=4000 \
--warmup_steps=500 \
--save_steps=500 \
```
The expected accuracy is `96.2+`, and here is an example of evaluation result
```
[dev evaluation] ave loss: 0.220047, eval_accuracy: 0.963480, elapsed time: 2799.974465 s
```
Other NLP regression/classification tasks' fine-tuning can be carried out in the similar way.
### SQuAD 2.0
- Download SQuAD2.0 data and put it in the `data/squad2.0` directory
```
mkdir -p data/squad2.0
wget -P data/squad2.0 https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json
wget -P data/squad2.0 https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json
```
- Perform fine-tuning running the script `run_squad.py` on V100 GPUs (32GB)
```
SQUAD_DIR=data/squad2.0
INIT_CKPT_DIR=xlnet_cased_L-24_H-1024_A-16
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5
python run_squad.py \
--model_config_path=${INIT_CKPT_DIR}/xlnet_config.json \
--spiece_model_file=${INIT_CKPT_DIR}/spiece.model \
--init_checkpoint=${INIT_CKPT_DIR}/params \
--train_file=${SQUAD_DIR}/train-v2.0.json \
--predict_file=${SQUAD_DIR}/dev-v2.0.json \
--uncased=False \
--checkpoints squad_2.0_0828 \
--max_seq_length=512 \
--do_train=True \
--do_predict=True \
--skip_steps=100 \
--save_steps=10000 \
--epoch 200 \
--dropout=0.1 \
--dropatt=0.1 \
--train_batch_size=4 \
--predict_batch_size=3 \
--learning_rate=2e-5 \
--save_steps=1000 \
--train_steps=12000 \
--warmup_steps=1000 \
--verbose=True\
```
And the final evaluation result after fine-tuning should looks like
```
================================================================================
Result | best_f1 88.0893932758 | best_exact_thresh -2.07637166977 | best_exact 85.5049271456 | has_ans_f1 0.940979062625 | has_ans_exact 0.880566801619 | best_f1_thresh -2.07337403297 |
================================================================================
```
### Use your own data
Please refer to the data-format guidelines of GLUE/SQuAD if you want to use your own data for fine-tuning.
## Acknowledgement
We thank the distiguished work done by the authors of XLNet!
[ENGLISH](README.md)
该项目是 [XLNet](https://github.com/zihangdai/xlnet) 基于 Paddle Fluid 的实现,目前支持该项目支持所有下游任务的 fine-tuning, 包括自然语言推断任务和阅读理解任务 (SQuAD2.0)等。
XLNet 与 [BERT](https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/PaddleLARK/BERT) 有着许多的不同,XLNet 利用一个全新的模型 [Transformer-XL](https://arxiv.org/abs/1901.02860) 作为语义表示的骨架, 将置换语言模型的建模作为优化目标,同时在预训练阶段也利用了更多的数据。 最终,XLNet 在多个 NLP 任务上达到了 SOTA 的效果。
更多的细节,请参考学术论文
[XLNet: Generalized Autoregressive Pretraining for Language Understanding](https://arxiv.org/abs/1906.08237)
## 安装
该项目要求 Paddle Fluid 1.6.0 及以上版本,请参考 [安装指南](https://www.paddlepaddle.org.cn/start) 进行安装。
## Pre-trained models
这里提供了从官方开源模型转换而来的两个预训练模型供下载
| Model | Layers | Hidden size | Heads |
| :------| :------: | :------: |:------: |
| [XLNet-Large, Cased](https://xlnet.bj.bcebos.com/xlnet_cased_L-24_H-1024_A-16.tgz)| 24 | 1024 | 16 |
| [XLNet-Base, Cased](https://xlnet.bj.bcebos.com/xlnet_cased_L-12_H-768_A-12.tgz)| 12 | 768 | 12 |
每个压缩包都包含了一个子文件夹和两个文件:
- `params`: 由参数构成的文件夹, 每个模型文件包含一个参数
- `spiece.model`: [Sentence Piece](https://github.com/google/sentencepiece) 模型,用于文本的(反)tokenization
- `xlnet_config.json`: 配置文件,指定了模型的超参数
## 利用 XLNet 进行 Fine-tuning
我们提供了利用 XLNet 在多卡 GPU 上为自然语言处理任务进行 fine-tuning 的脚本。通过基于 V100 GPU 进行实验,达到官方报告的效果 (主要是基于 TPU),这些脚本的正确性已得到过验证。在下面的陈述中,我们假定以上两个预训练已下载和解压好。
### 文本回归/分类任务
文本回归和分类任务的 fine-tuning 可以通过运行脚本 `run_classifier.py` 来进行,其中包含了单文本分类、单文本回归、文本对分类等示例。下面的两个例子,一个用于示例回归任务,另一个用于分类任务,可以按以下的方式进行 fine-tuning。
#### (1) STS-B: 句子对相关性回归
- 通过运行 [脚本](https://gist.github.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e) 下载 [GLUE 数据集](https://gluebenchmark.com/tasks), 并解压到某个文件夹 $GLUE_DIR。
- **请注意**: 在 Python 2.x 环境下运行这个脚本,可能会遇到报错 `ImportError: No module named request` , 这是因为模块 `urllib` 不包含子模块 `request`. 这个问题可以通过将脚本中的代码 `urllib.request` 全部替换为 `urllib`,或者在 Python 3.x 环境下运行予以解决。
- 使用 XLNet-Large 在 4 卡 V100 GPU 上进行 fine-tuning
```
export GLUE_DIR=glue_data
export LARGE_DIR=xlnet_cased_L-24_H-1024_A-16
CUDA_VISIBLE_DEVICES=0,1,2,3 python run_classifier.py \
--do_train=True \
--do_eval=True \
--do_predict=True \
--task_name=sts-b \
--data_dir=${GLUE_DIR}/STS-B \
--checkpoints=exp/sts-b \
--uncased=False \
--spiece_model_file=${LARGE_DIR}/spiece.model \
--model_config_path=${LARGE_DIR}/xlnet_config.json \
--init_pretraining_params=${LARGE_DIR}/params \
--max_seq_length=128 \
--train_batch_size=8 \
--learning_rate=5e-5 \
--predict_dir=exp/sts-b-pred \
--skip_steps=10 \
--train_steps=1200 \
--warmup_steps=120 \
--save_steps=600 \
--is_regression=True
```
该配置不需要特别大的 GPU 显存,16GB 的 4 卡 V100 (或其它 GPU)即可运行。
在 fine-tuning 结束后,会得到在 dev 数据集上的评估结果,包括平均误差和皮尔逊相关系数
```
[dev evaluation] ave loss: 0.383523, eval_pearsonr: 0.916912, elapsed time: 21.804057 s
```
按官方实现的说法,预期的 `eval_pearsonr``91.3+`,该实验应该能复现这个结果。
#### (2) IMDB: 电影评论情感分类
- 下载和解压 IMDB 数据集
```shell
wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
tar zxvf aclImdb_v1.tar.gz
```
- 使用 XLNet-Large 在 8 卡 V100 GPU (32GB) 上进行 fine-tuning
```shell
export IMDB_DIR=aclImdb
export LARGE_DIR=xlnet_cased_L-24_H-1024_A-16
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python run_classifier.py \
--do_train=True \
--do_eval=True \
--do_predict=True \
--task_name=imdb \
--checkpoints=exp/imdb \
--init_pretraining_params=${LARGE_DIR}/params \
--data_dir=${IMDB_DIR} \
--predict_dir=predict_imdb_1028 \
--uncased=False \
--spiece_model_file=${LARGE_DIR}/spiece.model \
--model_config_path=${LARGE_DIR}/xlnet_config.json \
--max_seq_length=512 \
--train_batch_size=4 \
--eval_batch_size=8 \
--learning_rate=2e-5 \
--train_steps=4000 \
--warmup_steps=500 \
--save_steps=500 \
```
期望的准确率是 `96.2+`, 以下是评估结果的一个样例
```
[dev evaluation] ave loss: 0.220047, eval_accuracy: 0.963480, elapsed time: 2799.974465 s
```
其它 NLP 回归/分类任务的 fine-tuning 可以通过同样的方式进行。
### SQuAD 2.0
- 下载 SQuAD 2.0 数据集并将其放入 `data/squad2.0` 目录中
```
mkdir -p data/squad2.0
wget -P data/squad2.0 https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json
wget -P data/squad2.0 https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json
```
- 在 6 卡 V100 GPU (32GB) 上运行脚本 `run_squad.py`
```
SQUAD_DIR=data/squad2.0
INIT_CKPT_DIR=xlnet_cased_L-24_H-1024_A-16
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5
python run_squad.py \
--model_config_path=${INIT_CKPT_DIR}/xlnet_config.json \
--spiece_model_file=${INIT_CKPT_DIR}/spiece.model \
--init_checkpoint=${INIT_CKPT_DIR}/params \
--train_file=${SQUAD_DIR}/train-v2.0.json \
--predict_file=${SQUAD_DIR}/dev-v2.0.json \
--uncased=False \
--checkpoints squad_2.0_0828 \
--max_seq_length=512 \
--do_train=True \
--do_predict=True \
--skip_steps=100 \
--save_steps=10000 \
--epoch 200 \
--dropout=0.1 \
--dropatt=0.1 \
--train_batch_size=4 \
--predict_batch_size=3 \
--learning_rate=2e-5 \
--save_steps=1000 \
--train_steps=12000 \
--warmup_steps=1000 \
--verbose=True\
```
运行结束后的评测结果如下所示
```
================================================================================
Result | best_f1 88.0893932758 | best_exact_thresh -2.07637166977 | best_exact 85.5049271456 | has_ans_f1 0.940979062625 | has_ans_exact 0.880566801619 | best_f1_thresh -2.07337403297 |
================================================================================
```
### 使用自定义数据
如需使用自定义数据进行 fine-tuning,请参考 GLUE/SQuAD 的数据格式说明。
## 致谢
我们向 XLNet 的作者们所做的杰出工作致以谢意!
"""this file is a copy of https://github.com/zihangdai/xlnet"""
import re
import numpy as np
from data_utils import SEP_ID, CLS_ID
SEG_ID_A = 0
SEG_ID_B = 1
SEG_ID_CLS = 2
SEG_ID_SEP = 3
SEG_ID_PAD = 4
class PaddingInputExample(object):
"""Fake example so the num input examples is a multiple of the batch size.
When running eval/predict on the TPU, we need to pad the number of examples
to be a multiple of the batch size, because the TPU requires a fixed batch
size. The alternative is to drop the last batch, which is bad because it means
the entire output data won't be generated.
We use this class instead of `None` because treating `None` as padding
battches could cause silent errors.
"""
class InputFeatures(object):
"""A single set of features of data."""
def __init__(self,
input_ids,
input_mask,
segment_ids,
label_id,
is_real_example=True):
self.input_ids = input_ids
self.input_mask = input_mask
self.segment_ids = segment_ids
self.label_id = label_id
self.is_real_example = is_real_example
def _truncate_seq_pair(tokens_a, tokens_b, max_length):
"""Truncates a sequence pair in place to the maximum length."""
# This is a simple heuristic which will always truncate the longer sequence
# one token at a time. This makes more sense than truncating an equal percent
# of tokens from each, since if one sequence is very short then each token
# that's truncated likely contains more information than a longer sequence.
while True:
total_length = len(tokens_a) + len(tokens_b)
if total_length <= max_length:
break
if len(tokens_a) > len(tokens_b):
tokens_a.pop()
else:
tokens_b.pop()
def convert_single_example(ex_index, example, label_list, max_seq_length,
tokenize_fn):
"""Converts a single `InputExample` into a single `InputFeatures`."""
if isinstance(example, PaddingInputExample):
return InputFeatures(
input_ids=[0] * max_seq_length,
input_mask=[1] * max_seq_length,
segment_ids=[0] * max_seq_length,
label_id=0,
is_real_example=False)
if label_list is not None:
label_map = {}
for (i, label) in enumerate(label_list):
label_map[label] = i
tokens_a = tokenize_fn(example.text_a)
tokens_b = None
if example.text_b:
tokens_b = tokenize_fn(example.text_b)
if tokens_b:
# Modifies `tokens_a` and `tokens_b` in place so that the total
# length is less than the specified length.
# Account for two [SEP] & one [CLS] with "- 3"
_truncate_seq_pair(tokens_a, tokens_b, max_seq_length - 3)
else:
# Account for one [SEP] & one [CLS] with "- 2"
if len(tokens_a) > max_seq_length - 2:
tokens_a = tokens_a[:max_seq_length - 2]
tokens = []
segment_ids = []
for token in tokens_a:
tokens.append(token)
segment_ids.append(SEG_ID_A)
tokens.append(SEP_ID)
segment_ids.append(SEG_ID_A)
if tokens_b:
for token in tokens_b:
tokens.append(token)
segment_ids.append(SEG_ID_B)
tokens.append(SEP_ID)
segment_ids.append(SEG_ID_B)
tokens.append(CLS_ID)
segment_ids.append(SEG_ID_CLS)
input_ids = tokens
# The mask has 0 for real tokens and 1 for padding tokens. Only real
# tokens are attended to.
input_mask = [0] * len(input_ids)
# Zero-pad up to the sequence length.
if len(input_ids) < max_seq_length:
delta_len = max_seq_length - len(input_ids)
input_ids = [0] * delta_len + input_ids
input_mask = [1] * delta_len + input_mask
segment_ids = [SEG_ID_PAD] * delta_len + segment_ids
assert len(input_ids) == max_seq_length
assert len(input_mask) == max_seq_length
assert len(segment_ids) == max_seq_length
if label_list is not None:
label_id = label_map[example.label]
else:
label_id = example.label
if ex_index < 1:
print("*** Example ***")
print("guid: %s" % (example.guid))
print("input_ids: %s" % " ".join([str(x) for x in input_ids]))
print("input_mask: %s" % " ".join([str(x) for x in input_mask]))
print("segment_ids: %s" % " ".join([str(x) for x in segment_ids]))
print("label: {} (id = {})".format(example.label, label_id))
feature = InputFeatures(
input_ids=input_ids,
input_mask=input_mask,
segment_ids=segment_ids,
label_id=label_id)
return feature
"""this file is a copy of https://github.com/zihangdai/xlnet"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
special_symbols = {
"<unk>": 0,
"<s>": 1,
"</s>": 2,
"<cls>": 3,
"<sep>": 4,
"<pad>": 5,
"<mask>": 6,
"<eod>": 7,
"<eop>": 8,
}
VOCAB_SIZE = 32000
UNK_ID = special_symbols["<unk>"]
CLS_ID = special_symbols["<cls>"]
SEP_ID = special_symbols["<sep>"]
MASK_ID = special_symbols["<mask>"]
EOD_ID = special_symbols["<eod>"]
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Model for classifier."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import collections
import paddle.fluid as fluid
import modeling
from model.xlnet import XLNetModel, _get_initializer
def get_regression_loss(args, xlnet_config, features, is_training=False):
"""Loss for downstream regression tasks."""
inp = fluid.layers.transpose(features["input_ids"], [1, 0, 2])
seg_id = features["segment_ids"]
inp_mask = fluid.layers.transpose(features["input_mask"], [1, 0])
label = features["label_ids"]
xlnet_model = XLNetModel(
input_ids=inp,
seg_ids=seg_id,
input_mask=inp_mask,
xlnet_config=xlnet_config,
args=args)
summary = xlnet_model.get_pooled_out(args.summary_type, args.use_summ_proj)
per_example_loss, logits = modeling.regression_loss(
hidden=summary,
labels=label,
initializer=_get_initializer(args),
name="model_regression_{}".format(args.task_name.lower()),
return_logits=True)
total_loss = fluid.layers.reduce_mean(per_example_loss)
return total_loss, per_example_loss, logits
def get_classification_loss(args,
xlnet_config,
features,
n_class,
is_training=True):
"""Loss for downstream classification tasks."""
inp = fluid.layers.transpose(features["input_ids"], [1, 0, 2])
seg_id = features["segment_ids"]
inp_mask = fluid.layers.transpose(features["input_mask"], [1, 0])
label = features["label_ids"]
xlnet_model = XLNetModel(
input_ids=inp,
seg_ids=seg_id,
input_mask=inp_mask,
xlnet_config=xlnet_config,
args=args)
summary = xlnet_model.get_pooled_out(args.summary_type, args.use_summ_proj)
per_example_loss, logits = modeling.classification_loss(
hidden=summary,
labels=label,
n_class=n_class,
initializer=xlnet_model.get_initializer(),
name="model_classification_{}".format(args.task_name),
return_logits=True)
total_loss = fluid.layers.reduce_mean(per_example_loss)
return total_loss, per_example_loss, logits
def create_model(args, xlnet_config, n_class, is_training=False):
label_ids_type = 'int64' if n_class else 'float32'
input_fields = {
'names': [
'input_ids', 'input_mask', 'segment_ids', 'label_ids',
'is_real_example'
],
'shapes': [[-1, args.max_seq_length, 1], [-1, args.max_seq_length],
[-1, args.max_seq_length], [-1, 1], [-1, 1]],
'dtypes':
['int64', 'float32', 'int64', 'int64', label_ids_type, 'int64'],
'lod_levels': [0, 0, 0, 0, 0, 0],
}
inputs = [
fluid.layers.data(
name=input_fields['names'][i],
shape=input_fields['shapes'][i],
dtype=input_fields['dtypes'][i],
lod_level=input_fields['lod_levels'][i])
for i in range(len(input_fields['names']))
]
(input_ids, input_mask, segment_ids, label_ids, is_real_example) = inputs
data_loader = fluid.io.DataLoader.from_generator(
feed_list=inputs, capacity=50, iterable=False)
features = collections.OrderedDict()
features["input_ids"] = input_ids
features["input_mask"] = input_mask
features["segment_ids"] = segment_ids
features["label_ids"] = label_ids
features["is_real_example"] = is_real_example
if args.is_regression:
(total_loss, per_example_loss, logits) = get_regression_loss(
args, xlnet_config, features, is_training)
else:
(total_loss, per_example_loss, logits) = get_classification_loss(
args, xlnet_config, features, n_class, is_training)
num_seqs = fluid.layers.fill_constant_batch_size_like(
input=label_ids, shape=[-1, 1], value=1, dtype="int64")
num_seqs = fluid.layers.reduce_sum(num_seqs)
return data_loader, total_loss, logits, num_seqs, label_ids
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""XLNet model."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import six
import json
import numpy as np
import paddle.fluid as fluid
import modeling
def _get_initializer(args):
if args.init == "uniform":
param_initializer = fluid.initializer.Uniform(
low=-args.init_range, high=args.init_range)
elif args.init == "normal":
param_initializer = fluid.initializer.Normal(scale=args.init_std)
else:
raise ValueError("Initializer {} not supported".format(args.init))
return param_initializer
def init_attn_mask(args, place):
"""create causal attention mask."""
qlen = args.max_seq_length
mlen = 0 if 'mem_len' not in args else args.mem_len
same_length = False if 'same_length' not in args else args.same_length
dtype = 'float16' if args.use_fp16 else 'float32'
attn_mask = np.ones([qlen, qlen], dtype=dtype)
mask_u = np.triu(attn_mask)
mask_dia = np.diag(np.diag(attn_mask))
attn_mask_pad = np.zeros([qlen, mlen], dtype=dtype)
attn_mask = np.concatenate([attn_mask_pad, mask_u - mask_dia], 1)
if same_length:
mask_l = np.tril(attn_mask)
attn_mask = np.concatenate(
[ret[:, :qlen] + mask_l - mask_dia, ret[:, qlen:]], 1)
attn_mask = attn_mask[:, :, None, None]
attn_mask_t = fluid.global_scope().find_var("attn_mask").get_tensor()
attn_mask_t.set(attn_mask, place)
class XLNetConfig(object):
def __init__(self, config_path):
self._config_dict = self._parse(config_path)
def _parse(self, config_path):
try:
with open(config_path) as json_file:
config_dict = json.load(json_file)
except Exception:
raise IOError("Error in parsing xlnet model config file '%s'" %
config_path)
else:
return config_dict
def __getitem__(self, key):
return self._config_dict[key]
def has_key(self, key):
return self._config_dict.has_key(key)
def print_config(self):
for arg, value in sorted(six.iteritems(self._config_dict)):
print('%s: %s' % (arg, value))
print('------------------------------------------------')
class XLNetModel(object):
def __init__(self,
xlnet_config,
input_ids,
seg_ids,
input_mask,
args,
mems=None,
perm_mask=None,
target_mapping=None,
inp_q=None):
self._tie_weight = True
self._d_head = xlnet_config['d_head']
self._d_inner = xlnet_config['d_inner']
self._d_model = xlnet_config['d_model']
self._ff_activation = xlnet_config['ff_activation']
self._n_head = xlnet_config['n_head']
self._n_layer = xlnet_config['n_layer']
self._n_token = xlnet_config['n_token']
self._untie_r = xlnet_config['untie_r']
self._xlnet_config = xlnet_config
self._dropout = args.dropout
self._dropatt = args.dropatt
self._mem_len = None if 'mem_len' not in args else args.mem_len
self._reuse_len = None if 'reuse_len' not in args else args.reuse_len
self._bi_data = False if 'bi_data' not in args else args.bi_data
self._clamp_len = args.clamp_len
self._same_length = False if 'same_length' not in args else args.same_length
# Initialize all weigths by the specified initializer, and all biases
# will be initialized by constant zero by default.
self._param_initializer = _get_initializer(args)
self.input_mask = input_mask
tfm_args = dict(
n_token=self._n_token,
initializer=self._param_initializer,
attn_type="bi",
n_layer=self._n_layer,
d_model=self._d_model,
n_head=self._n_head,
d_head=self._d_head,
d_inner=self._d_inner,
ff_activation=self._ff_activation,
untie_r=self._untie_r,
use_bfloat16=False,
dropout=self._dropout,
dropatt=self._dropatt,
mem_len=self._mem_len,
reuse_len=self._reuse_len,
bi_data=self._bi_data,
clamp_len=self._clamp_len,
same_length=self._same_length,
name='model_transformer')
input_args = dict(
inp_k=input_ids,
seg_id=seg_ids,
input_mask=input_mask,
mems=mems,
perm_mask=perm_mask,
target_mapping=target_mapping,
inp_q=inp_q)
tfm_args.update(input_args)
self.output, self.new_mems, self.lookup_table = modeling.transformer_xl(
**tfm_args)
#self._build_model(input_ids, sentence_ids, input_mask)
def get_initializer(self):
return self._param_initializer
def get_debug_ret(self):
return self.debug_ret
def get_sequence_output(self):
return self.output
def get_pooled_out(self, summary_type, use_summ_proj=True):
"""
Args:
summary_type: str, "last", "first", "mean", or "attn". The method
to pool the input to get a vector representation.
use_summ_proj: bool, whether to use a linear projection during pooling.
Returns:
float32 Tensor in shape [bsz, d_model], the pooled representation.
"""
summary = modeling.summarize_sequence(
summary_type=summary_type,
hidden=self.output,
d_model=self._d_model,
n_head=self._n_head,
d_head=self._d_head,
dropout=self._dropout,
dropatt=self._dropatt,
input_mask=self.input_mask,
initializer=self._param_initializer,
use_proj=use_summ_proj,
name='model_sequnece_summary')
return summary
此差异已折叠。
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Optimization and learning rate scheduling."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import re
import numpy as np
import paddle.fluid as fluid
def linear_warmup_decay(learning_rate, warmup_steps, num_train_steps):
""" Applies linear warmup of learning rate from 0 and decay to 0."""
with fluid.default_main_program()._lr_schedule_guard():
lr = fluid.layers.tensor.create_global_var(
shape=[1],
value=0.0,
dtype='float32',
persistable=True,
name="scheduled_learning_rate")
global_step = fluid.layers.learning_rate_scheduler._decay_step_counter()
with fluid.layers.control_flow.Switch() as switch:
with switch.case(global_step < warmup_steps):
warmup_lr = learning_rate * (global_step / warmup_steps)
fluid.layers.tensor.assign(warmup_lr, lr)
with switch.default():
decayed_lr = fluid.layers.learning_rate_scheduler.polynomial_decay(
learning_rate=learning_rate,
decay_steps=num_train_steps,
end_learning_rate=0.0,
power=1.0,
cycle=False)
fluid.layers.tensor.assign(decayed_lr, lr)
return lr
def optimization(loss,
warmup_steps,
num_train_steps,
learning_rate,
train_program,
startup_prog,
weight_decay,
lr_layer_decay_rate=1.0,
scheduler='linear_warmup_decay'):
scheduled_lr = None
if scheduler == 'noam_decay':
if warmup_steps > 0:
scheduled_lr = fluid.layers.learning_rate_scheduler\
.noam_decay(1/(warmup_steps *(learning_rate ** 2)),
warmup_steps)
else:
printf(
"WARNING: noam decay should have postive warmup steps, using "
"constant learning rate instead!")
scheduled_lr = fluid.layers.create_global_var(
name=fluid.unique_name.generate("learning_rate"),
shape=[1],
value=learning_rate,
dtype='float32',
persistable=True)
elif scheduler == 'linear_warmup_decay':
scheduled_lr = linear_warmup_decay(learning_rate, warmup_steps,
num_train_steps)
else:
raise ValueError("Unkown learning rate scheduler, should be "
"'noam_decay' or 'linear_warmup_decay'")
if lr_layer_decay_rate != 1.0:
n_layer = 0
for param in fluid.default_main_program().block(0).all_parameters():
m = re.search(r"model_transformer_layer_(\d+?)_", param.name)
if not m: continue
n_layer = max(n_layer, int(m.group(1)) + 1)
for param in fluid.default_main_program().block(0).all_parameters():
for l in range(n_layer):
if "model_transformer_layer_{}_".format(l) in param.name:
param.optimize_attr[
'learning_rate'] = lr_layer_decay_rate**(
n_layer - 1 - l)
print("Apply lr decay {:.4f} to layer-{} grad of {}".format(
param.optimize_attr['learning_rate'], l, param.name))
break
def exclude_from_weight_decay(param):
name = param.name.rstrip(".master")
if name.find("layer_norm") > -1:
return True
bias_suffix = ["_bias", "_b", ".b_0"]
for suffix in bias_suffix:
if name.endswith(suffix):
return True
return False
optimizer = fluid.optimizer.Adam(learning_rate=scheduled_lr)
fluid.clip.set_gradient_clip(
clip=fluid.clip.GradientClipByGlobalNorm(clip_norm=1.0))
param_list = dict()
if weight_decay > 0:
for param in train_program.global_block().all_parameters():
param_list[param.name] = param * 1.0
param_list[param.name].stop_gradient = True
_, param_grads = optimizer.minimize(loss)
if weight_decay > 0:
for param, grad in param_grads:
if exclude_from_weight_decay(param):
continue
with param.block.program._optimized_guard(
[param, grad]), fluid.framework.name_scope("weight_decay"):
updated_param = param - param_list[
param.name] * weight_decay * scheduled_lr
fluid.layers.assign(output=param, input=updated_param)
return scheduled_lr
# coding=utf-8
"""this file is a copy of https://github.com/zihangdai/xlnet"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import unicodedata
import six
from functools import partial
SPIECE_UNDERLINE = '▁'
def printable_text(text):
"""Returns text encoded in a way suitable for print or `tf.logging`."""
# These functions want `str` for both Python2 and Python3, but in one case
# it's a Unicode string and in the other it's a byte string.
if six.PY3:
if isinstance(text, str):
return text
elif isinstance(text, bytes):
return text.decode("utf-8", "ignore")
else:
raise ValueError("Unsupported string type: %s" % (type(text)))
elif six.PY2:
if isinstance(text, str):
return text
elif isinstance(text, unicode):
return text.encode("utf-8")
else:
raise ValueError("Unsupported string type: %s" % (type(text)))
else:
raise ValueError("Not running on Python2 or Python 3?")
def print_(*args):
new_args = []
for arg in args:
if isinstance(arg, list):
s = [printable_text(i) for i in arg]
s = ' '.join(s)
new_args.append(s)
else:
new_args.append(printable_text(arg))
print(*new_args)
def preprocess_text(inputs, lower=False, remove_space=True, keep_accents=False):
if remove_space:
outputs = ' '.join(inputs.strip().split())
else:
outputs = inputs
outputs = outputs.replace("``", '"').replace("''", '"')
if six.PY2 and isinstance(outputs, str):
outputs = outputs.decode('utf-8')
if not keep_accents:
outputs = unicodedata.normalize('NFKD', outputs)
outputs = ''.join([c for c in outputs if not unicodedata.combining(c)])
if lower:
outputs = outputs.lower()
return outputs
def encode_pieces(sp_model, text, return_unicode=True, sample=False):
# return_unicode is used only for py2
# note(zhiliny): in some systems, sentencepiece only accepts str for py2
if six.PY2 and isinstance(text, unicode):
text = text.encode('utf-8')
if not sample:
pieces = sp_model.EncodeAsPieces(text)
else:
pieces = sp_model.SampleEncodeAsPieces(text, 64, 0.1)
new_pieces = []
for piece in pieces:
if len(piece) > 1 and piece[-1] == ',' and piece[-2].isdigit():
cur_pieces = sp_model.EncodeAsPieces(piece[:-1].replace(
SPIECE_UNDERLINE, ''))
if piece[0] != SPIECE_UNDERLINE and cur_pieces[0][
0] == SPIECE_UNDERLINE:
if len(cur_pieces[0]) == 1:
cur_pieces = cur_pieces[1:]
else:
cur_pieces[0] = cur_pieces[0][1:]
cur_pieces.append(piece[-1])
new_pieces.extend(cur_pieces)
else:
new_pieces.append(piece)
# note(zhiliny): convert back to unicode for py2
if six.PY2 and return_unicode:
ret_pieces = []
for piece in new_pieces:
if isinstance(piece, str):
piece = piece.decode('utf-8')
ret_pieces.append(piece)
new_pieces = ret_pieces
return new_pieces
def encode_ids(sp_model, text, sample=False):
pieces = encode_pieces(sp_model, text, return_unicode=False, sample=sample)
ids = [sp_model.PieceToId(piece) for piece in pieces]
return ids
if __name__ == '__main__':
import sentencepiece as spm
sp = spm.SentencePieceProcessor()
sp.load('sp10m.uncased.v3.model')
print_(u'I was born in 2000, and this is falsé.')
print_(u'ORIGINAL',
sp.EncodeAsPieces(u'I was born in 2000, and this is falsé.'))
print_(u'OURS',
encode_pieces(sp, u'I was born in 2000, and this is falsé.'))
print(encode_ids(sp, u'I was born in 2000, and this is falsé.'))
print_('')
prepro_func = partial(preprocess_text, lower=True)
print_(prepro_func('I was born in 2000, and this is falsé.'))
print_('ORIGINAL',
sp.EncodeAsPieces(
prepro_func('I was born in 2000, and this is falsé.')))
print_('OURS',
encode_pieces(sp,
prepro_func('I was born in 2000, and this is falsé.')))
print(encode_ids(sp, prepro_func('I was born in 2000, and this is falsé.')))
print_('')
print_('I was born in 2000, and this is falsé.')
print_('ORIGINAL',
sp.EncodeAsPieces('I was born in 2000, and this is falsé.'))
print_('OURS', encode_pieces(sp, 'I was born in 2000, and this is falsé.'))
print(encode_ids(sp, 'I was born in 2000, and this is falsé.'))
print_('')
print_('I was born in 92000, and this is falsé.')
print_('ORIGINAL',
sp.EncodeAsPieces('I was born in 92000, and this is falsé.'))
print_('OURS', encode_pieces(sp, 'I was born in 92000, and this is falsé.'))
print(encode_ids(sp, 'I was born in 92000, and this is falsé.'))
"""this file is adapted from https://github.com/zihangdai/xlnet"""
import io
import os
import types
import csv
import numpy as np
import sentencepiece as spm
from classifier_utils import PaddingInputExample
from classifier_utils import convert_single_example
from prepro_utils import preprocess_text, encode_ids
class DataProcessor(object):
"""Base class for data converters for sequence classification data sets."""
def __init__(self, args):
self.data_dir = args.data_dir
self.max_seq_length = args.max_seq_length
self.uncased = args.uncased
np.random.seed(args.random_seed)
sp = spm.SentencePieceProcessor()
sp.Load(args.spiece_model_file)
def tokenize_fn(text):
text = preprocess_text(text, lower=self.uncased)
return encode_ids(sp, text)
self.tokenize_fn = tokenize_fn
self.current_train_example = -1
self.num_examples = {'train': -1, 'dev': -1, 'test': -1}
self.current_train_epoch = -1
def get_train_examples(self, data_dir):
"""Gets a collection of `InputExample`s for the train set."""
raise NotImplementedError()
def get_dev_examples(self, data_dir):
"""Gets a collection of `InputExample`s for the dev set."""
raise NotImplementedError()
def get_test_examples(self, data_dir):
"""Gets a collection of `InputExample`s for prediction."""
raise NotImplementedError()
def get_labels(self):
"""Gets the list of labels for this data set."""
raise NotImplementedError()
def convert_example(self, index, example, labels, max_seq_length,
tokenize_fn):
"""Converts a single `InputExample` into a single `InputFeatures`."""
feature = convert_single_example(index, example, labels, max_seq_length,
tokenize_fn)
return feature
def generate_instance(self, feature):
"""
generate instance with given feature
Args:
feature: InputFeatures(object). A single set of features of data.
"""
return [
feature.input_ids, feature.segment_ids, input_pos, feature.label_id
]
def prepare_batch_data(self, batch_data, is_regression):
"""Generate numpy tensors"""
input_ids = np.expand_dims(
np.array([inst[0] for inst in batch_data]).astype('int64'), axis=-1)
input_mask = np.array(
[inst[1] for inst in batch_data]).astype('float32')
segment_ids = np.array([inst[2] for inst in batch_data]).astype('int64')
labels = np.expand_dims(
np.array([inst[3] for inst in batch_data]).astype(
'int64' if not is_regression else 'float32'),
axis=-1)
is_real_example = np.array(
[inst[4] for inst in batch_data]).astype('int64')
return [input_ids, input_mask, segment_ids, labels, is_real_example]
@classmethod
def _read_tsv(cls, input_file, quotechar=None):
"""Reads a tab separated value file."""
with io.open(input_file, "r", encoding="utf8") as f:
reader = csv.reader(f, delimiter="\t", quotechar=quotechar)
lines = []
for line in reader:
if len(line) == 0: continue
lines.append(line)
return lines
def get_num_examples(self, phase):
"""Get number of examples for train, dev or test."""
if phase not in ['train', 'dev', 'test']:
raise ValueError(
"Unknown phase, which should be in ['train', 'dev', 'test'].")
return self.num_examples[phase]
def get_train_progress(self):
"""Gets progress for training phase."""
return self.current_train_example, self.current_train_epoch
def data_generator(self,
batch_size,
is_regression,
phase='train',
epoch=1,
dev_count=1,
shuffle=True):
"""
Generate data for train, dev or test.
Args:
batch_size: int. The batch size of generated data.
phase: string. The phase for which to generate data.
epoch: int. Total epoches to generate data.
shuffle: bool. Whether to shuffle examples.
"""
if phase == 'train':
examples = self.get_train_examples(self.data_dir)
self.num_examples['train'] = len(examples)
elif phase == 'dev':
examples = self.get_dev_examples(self.data_dir)
self.num_examples['dev'] = len(examples)
elif phase == 'test':
examples = self.get_test_examples(self.data_dir)
self.num_examples['test'] = len(examples)
else:
raise ValueError(
"Unknown phase, which should be in ['train', 'dev', 'test'].")
def instance_reader():
label_list = self.get_labels() if not is_regression else None
for epoch_index in range(epoch):
if shuffle:
np.random.shuffle(examples)
if phase == 'train':
self.current_train_epoch = epoch_index
for (index, example) in enumerate(examples):
if phase == 'train':
self.current_train_example = index + 1
feature = convert_single_example(index, example, label_list,
self.max_seq_length,
self.tokenize_fn)
instance = [
feature.input_ids, feature.input_mask,
feature.segment_ids, feature.label_id,
feature.is_real_example
]
yield instance
def batch_reader(reader, batch_size):
batch = []
for instance in reader():
if len(batch) < batch_size:
batch.append(instance)
else:
yield batch
batch = [instance]
if len(batch) > 0:
yield batch
def wrapper():
all_dev_batches = []
for batch_data in batch_reader(instance_reader, batch_size):
batch_data = self.prepare_batch_data(batch_data, is_regression)
if len(all_dev_batches) < dev_count:
all_dev_batches.append(batch_data)
if len(all_dev_batches) == dev_count:
for batch in all_dev_batches:
yield batch
all_dev_batches = []
return wrapper
class InputExample(object):
"""A single training/test example for simple sequence classification."""
def __init__(self, guid, text_a, text_b=None, label=None):
"""Constructs a InputExample.
Args:
guid: Unique id for the example.
text_a: string. The untokenized text of the first sequence. For single
sequence tasks, only this sequence must be specified.
text_b: (Optional) string. The untokenized text of the second sequence.
Only must be specified for sequence pair tasks.
label: (Optional) string. The label of the example. This should be
specified for train and dev examples, but not for test examples.
"""
self.guid = guid
self.text_a = text_a
self.text_b = text_b
self.label = label
def _truncate_seq_pair(tokens_a, tokens_b, max_length):
"""Truncates a sequence pair in place to the maximum length."""
# This is a simple heuristic which will always truncate the longer sequence
# one token at a time. This makes more sense than truncating an equal percent
# of tokens from each, since if one sequence is very short then each token
# that's truncated likely contains more information than a longer sequence.
while True:
total_length = len(tokens_a) + len(tokens_b)
if total_length <= max_length:
break
if len(tokens_a) > len(tokens_b):
tokens_a.pop()
else:
tokens_b.pop()
class InputFeatures(object):
"""A single set of features of data."""
def __init__(self, input_ids, input_mask, segment_ids, label_id):
self.input_ids = input_ids
self.input_mask = input_mask
self.segment_ids = segment_ids
self.label_id = label_id
class GLUEProcessor(DataProcessor):
def __init__(self, args):
super(GLUEProcessor, self).__init__(args)
self.train_file = "train.tsv"
self.dev_file = "dev.tsv"
self.test_file = "test.tsv"
self.label_column = None
self.text_a_column = None
self.text_b_column = None
self.contains_header = True
self.test_text_a_column = None
self.test_text_b_column = None
self.test_contains_header = True
def get_train_examples(self, data_dir):
"""See base class."""
return self._create_examples(
self._read_tsv(os.path.join(data_dir, self.train_file)), "train")
def get_dev_examples(self, data_dir):
"""See base class."""
return self._create_examples(
self._read_tsv(os.path.join(data_dir, self.dev_file)), "dev")
def get_test_examples(self, data_dir):
"""See base class."""
if self.test_text_a_column is None:
self.test_text_a_column = self.text_a_column
if self.test_text_b_column is None:
self.test_text_b_column = self.text_b_column
return self._create_examples(
self._read_tsv(os.path.join(data_dir, self.test_file)), "test")
def get_labels(self):
"""See base class."""
return ["0", "1"]
def _create_examples(self, lines, set_type):
"""Creates examples for the training and dev sets."""
examples = []
for (i, line) in enumerate(lines):
if i == 0 and self.contains_header and set_type != "test":
continue
if i == 0 and self.test_contains_header and set_type == "test":
continue
guid = "%s-%s" % (set_type, i)
a_column = (self.text_a_column
if set_type != "test" else self.test_text_a_column)
b_column = (self.text_b_column
if set_type != "test" else self.test_text_b_column)
# there are some incomplete lines in QNLI
if len(line) <= a_column:
tf.logging.warning('Incomplete line, ignored.')
continue
text_a = line[a_column]
if b_column is not None:
if len(line) <= b_column:
tf.logging.warning('Incomplete line, ignored.')
continue
text_b = line[b_column]
else:
text_b = None
if set_type == "test":
label = self.get_labels()[0]
else:
if len(line) <= self.label_column:
tf.logging.warning('Incomplete line, ignored.')
continue
label = line[self.label_column]
examples.append(
InputExample(
guid=guid, text_a=text_a, text_b=text_b, label=label))
return examples
class Yelp5Processor(DataProcessor):
def __init__(self, args):
super(Yelp5Processor, self).__init__(args)
def get_train_examples(self, data_dir):
return self._create_examples(os.path.join(data_dir, "train.csv"))
def get_dev_examples(self, data_dir):
return self._create_examples(os.path.join(data_dir, "test.csv"))
def get_labels(self):
"""See base class."""
return ["1", "2", "3", "4", "5"]
def _create_examples(self, input_file):
"""Creates examples for the training and dev sets."""
examples = []
with tf.gfile.Open(input_file) as f:
reader = csv.reader(f)
for i, line in enumerate(reader):
label = line[0]
text_a = line[1].replace('""', '"').replace('\\"', '"')
examples.append(
InputExample(
guid=str(i), text_a=text_a, text_b=None, label=label))
return examples
class ImdbProcessor(DataProcessor):
def __init__(self, args):
super(ImdbProcessor, self).__init__(args)
def get_labels(self):
return ["neg", "pos"]
def get_train_examples(self, data_dir):
return self._create_examples(os.path.join(data_dir, "train"))
def get_dev_examples(self, data_dir):
return self._create_examples(os.path.join(data_dir, "test"))
def _create_examples(self, data_dir):
examples = []
for label in ["neg", "pos"]:
cur_dir = os.path.join(data_dir, label)
for filename in os.listdir(cur_dir):
if not filename.endswith("txt"): continue
path = os.path.join(cur_dir, filename)
with io.open(path, 'r', encoding='utf8') as f:
text = f.read().strip().replace("<br />", " ")
examples.append(
InputExample(
guid="unused_id", text_a=text, text_b=None,
label=label))
return examples
class MnliMatchedProcessor(GLUEProcessor):
def __init__(self, args):
super(MnliMatchedProcessor, self).__init__(args)
self.dev_file = "dev_matched.tsv"
self.test_file = "test_matched.tsv"
self.label_column = -1
self.text_a_column = 8
self.text_b_column = 9
def get_labels(self):
return ["contradiction", "entailment", "neutral"]
class MnliMismatchedProcessor(MnliMatchedProcessor):
def __init__(self, args):
super(MnliMismatchedProcessor, self).__init__(args)
self.dev_file = "dev_mismatched.tsv"
self.test_file = "test_mismatched.tsv"
class StsbProcessor(GLUEProcessor):
def __init__(self, args):
super(StsbProcessor, self).__init__(args)
self.label_column = 9
self.text_a_column = 7
self.text_b_column = 8
def get_labels(self):
return [0.0]
def _create_examples(self, lines, set_type):
"""Creates examples for the training and dev sets."""
examples = []
for (i, line) in enumerate(lines):
if i == 0 and self.contains_header and set_type != "test":
continue
if i == 0 and self.test_contains_header and set_type == "test":
continue
guid = "%s-%s" % (set_type, i)
a_column = (self.text_a_column
if set_type != "test" else self.test_text_a_column)
b_column = (self.text_b_column
if set_type != "test" else self.test_text_b_column)
# there are some incomplete lines in QNLI
if len(line) <= a_column:
tf.logging.warning('Incomplete line, ignored.')
continue
text_a = line[a_column]
if b_column is not None:
if len(line) <= b_column:
tf.logging.warning('Incomplete line, ignored.')
continue
text_b = line[b_column]
else:
text_b = None
if set_type == "test":
label = self.get_labels()[0]
else:
if len(line) <= self.label_column:
tf.logging.warning('Incomplete line, ignored.')
continue
label = float(line[self.label_column])
examples.append(
InputExample(
guid=guid, text_a=text_a, text_b=text_b, label=label))
return examples
if __name__ == '__main__':
pass
此差异已折叠。
此差异已折叠。
此差异已折叠。
"""this file is adapted from https://github.com/zihangdai/xlnet"""
import io
import argparse
import collections
import json
import numpy as np
import os
import re
import string
import sys
OPTS = None
def parse_args():
parser = argparse.ArgumentParser(
'Official evaluation script for SQuAD version 2.0.')
parser.add_argument(
'data_file', metavar='data.json', help='Input data JSON file.')
parser.add_argument(
'pred_file', metavar='pred.json', help='Model predictions.')
parser.add_argument(
'--out-file',
'-o',
metavar='eval.json',
help='Write accuracy metrics to file (default is stdout).')
parser.add_argument(
'--na-prob-file',
'-n',
metavar='na_prob.json',
help='Model estimates of probability of no answer.')
parser.add_argument(
'--na-prob-thresh',
'-t',
type=float,
default=1.0,
help='Predict "" if no-answer probability exceeds this (default = 1.0).')
parser.add_argument(
'--out-image-dir',
'-p',
metavar='out_images',
default=None,
help='Save precision-recall curves to directory.')
parser.add_argument('--verbose', '-v', action='store_true')
if len(sys.argv) == 1:
parser.print_help()
sys.exit(1)
return parser.parse_args()
def make_qid_to_has_ans(dataset):
qid_to_has_ans = {}
for article in dataset:
for p in article['paragraphs']:
for qa in p['qas']:
qid_to_has_ans[qa['id']] = bool(qa['answers'])
return qid_to_has_ans
def normalize_answer(s):
"""Lower text and remove punctuation, articles and extra whitespace."""
def remove_articles(text):
regex = re.compile(r'\b(a|an|the)\b', re.UNICODE)
return re.sub(regex, ' ', text)
def white_space_fix(text):
return ' '.join(text.split())
def remove_punc(text):
exclude = set(string.punctuation)
return ''.join(ch for ch in text if ch not in exclude)
def lower(text):
return text.lower()
return white_space_fix(remove_articles(remove_punc(lower(s))))
def get_tokens(s):
if not s: return []
return normalize_answer(s).split()
def compute_exact(a_gold, a_pred):
return int(normalize_answer(a_gold) == normalize_answer(a_pred))
def compute_f1(a_gold, a_pred):
gold_toks = get_tokens(a_gold)
pred_toks = get_tokens(a_pred)
common = collections.Counter(gold_toks) & collections.Counter(pred_toks)
num_same = sum(common.values())
if len(gold_toks) == 0 or len(pred_toks) == 0:
# If either is no-answer, then F1 is 1 if they agree, 0 otherwise
return int(gold_toks == pred_toks)
if num_same == 0:
return 0
precision = 1.0 * num_same / len(pred_toks)
recall = 1.0 * num_same / len(gold_toks)
f1 = (2 * precision * recall) / (precision + recall)
return f1
def get_raw_scores(dataset, preds):
exact_scores = {}
f1_scores = {}
for article in dataset:
for p in article['paragraphs']:
for qa in p['qas']:
qid = qa['id']
gold_answers = [
a['text'] for a in qa['answers']
if normalize_answer(a['text'])
]
if not gold_answers:
# For unanswerable questions, only correct answer is empty string
gold_answers = ['']
if qid not in preds:
print('Missing prediction for %s' % qid)
continue
a_pred = preds[qid]
# Take max over all gold answers
exact_scores[qid] = max(
compute_exact(a, a_pred) for a in gold_answers)
f1_scores[qid] = max(
compute_f1(a, a_pred) for a in gold_answers)
return exact_scores, f1_scores
def apply_no_ans_threshold(scores, na_probs, qid_to_has_ans, na_prob_thresh):
new_scores = {}
for qid, s in scores.items():
pred_na = na_probs[qid] > na_prob_thresh
if pred_na:
new_scores[qid] = float(not qid_to_has_ans[qid])
else:
new_scores[qid] = s
return new_scores
def make_eval_dict(exact_scores, f1_scores, qid_list=None):
if not qid_list:
total = len(exact_scores)
return collections.OrderedDict([
('exact', 100.0 * sum(exact_scores.values()) / total),
('f1', 100.0 * sum(f1_scores.values()) / total),
('total', total),
])
else:
total = len(qid_list)
return collections.OrderedDict([
('exact', 100.0 * sum(exact_scores[k] for k in qid_list) / total),
('f1', 100.0 * sum(f1_scores[k] for k in qid_list) / total),
('total', total),
])
def merge_eval(main_eval, new_eval, prefix):
for k in new_eval:
main_eval['%s_%s' % (prefix, k)] = new_eval[k]
def plot_pr_curve(precisions, recalls, out_image, title):
plt.step(recalls, precisions, color='b', alpha=0.2, where='post')
plt.fill_between(recalls, precisions, step='post', alpha=0.2, color='b')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.xlim([0.0, 1.05])
plt.ylim([0.0, 1.05])
plt.title(title)
plt.savefig(out_image)
plt.clf()
def make_precision_recall_eval(scores,
na_probs,
num_true_pos,
qid_to_has_ans,
out_image=None,
title=None):
qid_list = sorted(na_probs, key=lambda k: na_probs[k])
true_pos = 0.0
cur_p = 1.0
cur_r = 0.0
precisions = [1.0]
recalls = [0.0]
avg_prec = 0.0
for i, qid in enumerate(qid_list):
if qid_to_has_ans[qid]:
true_pos += scores[qid]
cur_p = true_pos / float(i + 1)
cur_r = true_pos / float(num_true_pos)
if i == len(qid_list) - 1 or na_probs[qid] != na_probs[qid_list[i + 1]]:
# i.e., if we can put a threshold after this point
avg_prec += cur_p * (cur_r - recalls[-1])
precisions.append(cur_p)
recalls.append(cur_r)
if out_image:
plot_pr_curve(precisions, recalls, out_image, title)
return {'ap': 100.0 * avg_prec}
def run_precision_recall_analysis(main_eval, exact_raw, f1_raw, na_probs,
qid_to_has_ans, out_image_dir):
if out_image_dir and not os.path.exists(out_image_dir):
os.makedirs(out_image_dir)
num_true_pos = sum(1 for v in qid_to_has_ans.values() if v)
if num_true_pos == 0:
return
pr_exact = make_precision_recall_eval(
exact_raw,
na_probs,
num_true_pos,
qid_to_has_ans,
out_image=os.path.join(out_image_dir, 'pr_exact.png'),
title='Precision-Recall curve for Exact Match score')
pr_f1 = make_precision_recall_eval(
f1_raw,
na_probs,
num_true_pos,
qid_to_has_ans,
out_image=os.path.join(out_image_dir, 'pr_f1.png'),
title='Precision-Recall curve for F1 score')
oracle_scores = {k: float(v) for k, v in qid_to_has_ans.items()}
pr_oracle = make_precision_recall_eval(
oracle_scores,
na_probs,
num_true_pos,
qid_to_has_ans,
out_image=os.path.join(out_image_dir, 'pr_oracle.png'),
title='Oracle Precision-Recall curve (binary task of HasAns vs. NoAns)')
merge_eval(main_eval, pr_exact, 'pr_exact')
merge_eval(main_eval, pr_f1, 'pr_f1')
merge_eval(main_eval, pr_oracle, 'pr_oracle')
def histogram_na_prob(na_probs, qid_list, image_dir, name):
if not qid_list:
return
x = [na_probs[k] for k in qid_list]
weights = np.ones_like(x) / float(len(x))
plt.hist(x, weights=weights, bins=20, range=(0.0, 1.0))
plt.xlabel('Model probability of no-answer')
plt.ylabel('Proportion of dataset')
plt.title('Histogram of no-answer probability: %s' % name)
plt.savefig(os.path.join(image_dir, 'na_prob_hist_%s.png' % name))
plt.clf()
def find_best_thresh(preds, scores, na_probs, qid_to_has_ans):
num_no_ans = sum(1 for k in qid_to_has_ans if not qid_to_has_ans[k])
cur_score = num_no_ans
best_score = cur_score
best_thresh = 0.0
qid_list = sorted(na_probs, key=lambda k: na_probs[k])
for i, qid in enumerate(qid_list):
if qid not in scores: continue
if qid_to_has_ans[qid]:
diff = scores[qid]
else:
if preds[qid]:
diff = -1
else:
diff = 0
cur_score += diff
if cur_score > best_score:
best_score = cur_score
best_thresh = na_probs[qid]
return 100.0 * best_score / len(scores), best_thresh
def find_best_thresh_v2(preds, scores, na_probs, qid_to_has_ans):
num_no_ans = sum(1 for k in qid_to_has_ans if not qid_to_has_ans[k])
cur_score = num_no_ans
best_score = cur_score
best_thresh = 0.0
qid_list = sorted(na_probs, key=lambda k: na_probs[k])
for i, qid in enumerate(qid_list):
if qid not in scores: continue
if qid_to_has_ans[qid]:
diff = scores[qid]
else:
if preds[qid]:
diff = -1
else:
diff = 0
cur_score += diff
if cur_score > best_score:
best_score = cur_score
best_thresh = na_probs[qid]
has_ans_score, has_ans_cnt = 0, 0
for qid in qid_list:
if not qid_to_has_ans[qid]: continue
has_ans_cnt += 1
if qid not in scores: continue
has_ans_score += scores[qid]
return 100.0 * best_score / len(
scores), best_thresh, 1.0 * has_ans_score / has_ans_cnt
def find_all_best_thresh(main_eval, preds, exact_raw, f1_raw, na_probs,
qid_to_has_ans):
best_exact, exact_thresh = find_best_thresh(preds, exact_raw, na_probs,
qid_to_has_ans)
best_f1, f1_thresh = find_best_thresh(preds, f1_raw, na_probs,
qid_to_has_ans)
main_eval['best_exact'] = best_exact
main_eval['best_exact_thresh'] = exact_thresh
main_eval['best_f1'] = best_f1
main_eval['best_f1_thresh'] = f1_thresh
def find_all_best_thresh_v2(main_eval, preds, exact_raw, f1_raw, na_probs,
qid_to_has_ans):
best_exact, exact_thresh, has_ans_exact = find_best_thresh_v2(
preds, exact_raw, na_probs, qid_to_has_ans)
best_f1, f1_thresh, has_ans_f1 = find_best_thresh_v2(
preds, f1_raw, na_probs, qid_to_has_ans)
main_eval['best_exact'] = best_exact
main_eval['best_exact_thresh'] = exact_thresh
main_eval['best_f1'] = best_f1
main_eval['best_f1_thresh'] = f1_thresh
main_eval['has_ans_exact'] = has_ans_exact
main_eval['has_ans_f1'] = has_ans_f1
def main():
with io.open(OPTS.data_file, encoding='utf8') as f:
dataset_json = json.load(f)
dataset = dataset_json['data']
with io.open(OPTS.pred_file, encoding='utf8') as f:
preds = json.load(f)
new_orig_data = []
for article in dataset:
for p in article['paragraphs']:
for qa in p['qas']:
if qa['id'] in preds:
new_para = {'qas': [qa]}
new_article = {'paragraphs': [new_para]}
new_orig_data.append(new_article)
dataset = new_orig_data
if OPTS.na_prob_file:
with io.open(OPTS.na_prob_file, encoding='utf8') as f:
na_probs = json.load(f)
else:
na_probs = {k: 0.0 for k in preds}
qid_to_has_ans = make_qid_to_has_ans(dataset) # maps qid to True/False
has_ans_qids = [k for k, v in qid_to_has_ans.items() if v]
no_ans_qids = [k for k, v in qid_to_has_ans.items() if not v]
exact_raw, f1_raw = get_raw_scores(dataset, preds)
exact_thresh = apply_no_ans_threshold(exact_raw, na_probs, qid_to_has_ans,
OPTS.na_prob_thresh)
f1_thresh = apply_no_ans_threshold(f1_raw, na_probs, qid_to_has_ans,
OPTS.na_prob_thresh)
out_eval = make_eval_dict(exact_thresh, f1_thresh)
if has_ans_qids:
has_ans_eval = make_eval_dict(
exact_thresh, f1_thresh, qid_list=has_ans_qids)
merge_eval(out_eval, has_ans_eval, 'HasAns')
if no_ans_qids:
no_ans_eval = make_eval_dict(
exact_thresh, f1_thresh, qid_list=no_ans_qids)
merge_eval(out_eval, no_ans_eval, 'NoAns')
if OPTS.na_prob_file:
find_all_best_thresh(out_eval, preds, exact_raw, f1_raw, na_probs,
qid_to_has_ans)
if OPTS.na_prob_file and OPTS.out_image_dir:
run_precision_recall_analysis(out_eval, exact_raw, f1_raw, na_probs,
qid_to_has_ans, OPTS.out_image_dir)
histogram_na_prob(na_probs, has_ans_qids, OPTS.out_image_dir, 'hasAns')
histogram_na_prob(na_probs, no_ans_qids, OPTS.out_image_dir, 'noAns')
if OPTS.out_file:
with io.open(OPTS.out_file, 'w', encoding='utf8') as f:
json.dump(out_eval, f)
else:
print(json.dumps(out_eval, indent=2))
if __name__ == '__main__':
OPTS = parse_args()
if OPTS.out_image_dir:
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
main()
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Arguments for configuration."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import six
import argparse
import paddle.fluid as fluid
def str2bool(v):
# because argparse does not support to parse "true, False" as python
# boolean directly
return v.lower() in ("true", "t", "1")
class ArgumentGroup(object):
def __init__(self, parser, title, des):
self._group = parser.add_argument_group(title=title, description=des)
def add_arg(self, name, type, default, help, **kwargs):
type = str2bool if type == bool else type
self._group.add_argument(
"--" + name,
default=default,
type=type,
help=help + ' Default: %(default)s.',
**kwargs)
def print_arguments(args):
print('----------- Configuration Arguments -----------')
for arg, value in sorted(six.iteritems(vars(args))):
print('%s: %s' % (arg, value))
print('------------------------------------------------')
def check_cuda(use_cuda, err = \
"\nYou can not set use_cuda = True in the model because you are using paddlepaddle-cpu.\n \
Please: 1. Install paddlepaddle-gpu to run your models on GPU or 2. Set use_cuda = False to run models on CPU.\n"
):
try:
if use_cuda == True and fluid.is_compiled_with_cuda() == False:
print(err)
sys.exit(1)
except Exception as e:
pass
此差异已折叠。
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册