未验证 提交 4fa5b715 编写于 作者: Y Yibing Liu 提交者: GitHub

Add more research projects (#2611)

上级 8d63308d
Data
=====
This dataset is for our paper: ARNOR: Attention Regularization based Noise Reduction for Distant Supervision Relation Classification. This test set is for sentence-level evaluation.
The original data is from the dataset in the paper: Cotype: Joint extraction of typed entities and relations with knowledge bases. It is a distant supervision dataset from NYT (New York Time). And the test set is annotated by humans. However the number of positive instances in test set is small. We revise and annotate more test data based on it.
In a data file, each line is a json string. The content is like
{
"sentText": "The source sentence text",
"relationMentions": [
{
"em1Text": "The first entity in relation",
"em2Text": "The second entity in relation",
"label": "Relation label",
"is_noise": false # only occur in test set
},
...
],
"entityMentions": [
{
"text": "Entity words",
"label": "Entity type",
...
},
...
]
...
}
Data version 1.0.0
=====
This version of dataset is the original one applied in our paper, which includes four files: train.json, test.json, dev_part.json, and test_part.json. Here dev_part.json and test_part.json are from test.json. This dataset can be downloaded here: https://baidu-nlp.bj.bcebos.com/arnor_dataset-1.0.0.tar.gz
Data version 2.0.0
=====
More test date are coming soon ......
# Multi-Perspective Models
This model won the first place in SemEval 2019 Task 9 SubTask A - Suggestion Mining from Online Reviews and Forums.
See more information about SemEval 2019: [http://alt.qcri.org/semeval2019/](http://alt.qcri.org/semeval2019/)
## 1. Introduction
This paper describes our system participated in Task 9 of SemEval-2019: the task is focused on suggestion mining and it aims to classify given sentences into suggestion and non-suggestion classes in domain specific and cross domain training setting respectively. We propose a multi-perspective architecture for learning representations by using different classical models including Convolutional Neural Networks (CNN), Gated Recurrent Units (GRU), Feed Forward Attention (FFA), etc. To leverage the semantics distributed in large amount of unsupervised data, we also have adopted the pre-trained Bidirectional Encoder Representations from Transformers (BERT) model as an encoder to produce sentence and word representations. The proposed architecture is applied for both sub-tasks, and achieved f1-score of 0.7812 for subtask A, and 0.8579 for subtask B. We won the first and second place for the two tasks respectively in the final competition.
## 2. Quick Start
### Installation
This project depends on python2.7 and paddle.fluid = 1.3.2, please follow [quick start](http://www.paddlepaddle.org/#quick-start) to install.
### Data Preparation
- Download the competition's data
```
# Download the competition's data
cd ./data && git clone https://github.com/Semeval2019Task9/Subtask-A.git
cd ../
```
- Download BERT and pre-trained model
```
# Download BERT code
git clone https://github.com/PaddlePaddle/LARK && mv LARK/BERT ./
# Download BERT pre-trained model
wget https://bert-models.bj.bcebos.com/uncased_L-24_H-1024_A-16.tar.gz
tar zxf uncased_L-24_H-1024_A-16.tar.gz -C ./
```
### Train
Use this command to start training:
```
# run training script
sh train.sh
```
The models will output to ./output .
### Ensemble & Evaluation
Use this commad to evaluate ensemble result:
```
# run evaluation
python evaluation.py \
./data/Subtask-A/SubtaskA_EvaluationData_labeled.csv \
./probs/prob_raw.txt \
./probs/prob_cnn.txt \
./probs/prob_gru.txt \
./probs/prob_ffa.txt \
```
Due to the dataset size is small, the training result may fluctuate, please try re-training several times more.
## 3. Advance
### Task Introduction
[Semeval2019-Task9](https://www.aclweb.org/anthology/S19-2151) presents the pilot SemEval task on Suggestion Mining. The task consists of subtasks A and B, creating labeled data from feedback forum and hotel reviews respectively. Examples:
|Source |Sentence |Label|
|------| ------|------|
|Hotel reviews |Be sure to specify a room at the back of the hotel. |suggestion|
|Hotel reviews |The point is, don’t advertise the service if there are caveats that go with it.|non-suggestion|
|Suggestion forum| Why not let us have several pages that we can put tiles on and name whatever we want to |suggestion|
|Suggestion forum| It fails with a uninformative message indicating deployment failed.|non-suggestion|
### Model Introduction
Model's framwork is shown in Figure 1:
<p align="center">
<img src="data/mpm.png"/> <br />
<b>Figure 1: An overall framework and pipeline of our system for suggestion mining</b>
</p>
As shown in Figure 1. our model architecture is constituted of two modules which includes a universal encoding module as either a sentence or a word encoder, and a task specified module used for suggestion classification. To fully explored the information generated by the encoder, we stack a serious of different task specified modules upon the encoder according to different perspective. Intuitively, we could use the sentence encoding directly to make a classification, to go further beyond that, as language is time-series information in essence, the time perspective based GRU cells can also be applied to model the sequence state to learn the structure for the suggestion mining task. Similarly, the spatial perspective based CNN can be used to mimic the n-gram model, as well. Moreover, we also introduce a convenient attention mechanism FFA (Raffel and Ellis, 2015) to automatically learns the combination of most important features. At last, we ensemble those models by a voting strategy as final prediction by this system.
### Result
| Models | CV f1-score | test score |
| ----- | ----- | ------ |
BERT-Large-Logistic | 0.8522 (±0.0213) | 0.7697
BERT-Large-Conv | 0.8520 (±0.0231) | 0.7800
BERT-Large-FFA | 0.8516 (±0.0307) | 0.7722
BERT-Large-GRU | 0.8503 (±0.0275) | 0.7725
Ensemble | – | 0.7812
## 4. Others
If you use the library in you research project, please cite the paper "OleNet at SemEval-2019 Task 9: BERT based Multi-Perspective Models for Suggestion Mining".
### Citation
```
@inproceedings{BaiduMPM,
title={OleNet at SemEval-2019 Task 9: BERT based Multi-Perspective Models for Suggestion Mining},
author={Jiaxiang Liu, Shuohuan Wang, and Yu Sun},
booktitle={Proceedings of the 13th International Workshop on Semantic Evaluation (SemEval-2019)},
year={2019}
}
```
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Mask, padding and batching."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import numpy as np
def mask(batch_tokens, total_token_num, vocab_size, CLS=1, SEP=2, MASK=3):
"""
Add mask for batch_tokens, return out, mask_label, mask_pos;
Note: mask_pos responding the batch_tokens after padded;
"""
max_len = max([len(sent) for sent in batch_tokens])
mask_label = []
mask_pos = []
prob_mask = np.random.rand(total_token_num)
# Note: the first token is [CLS], so [low=1]
replace_ids = np.random.randint(1, high=vocab_size, size=total_token_num)
pre_sent_len = 0
prob_index = 0
for sent_index, sent in enumerate(batch_tokens):
mask_flag = False
prob_index += pre_sent_len
for token_index, token in enumerate(sent):
prob = prob_mask[prob_index + token_index]
if prob > 0.15:
continue
elif 0.03 < prob <= 0.15:
# mask
if token != SEP and token != CLS:
mask_label.append(sent[token_index])
sent[token_index] = MASK
mask_flag = True
mask_pos.append(sent_index * max_len + token_index)
elif 0.015 < prob <= 0.03:
# random replace
if token != SEP and token != CLS:
mask_label.append(sent[token_index])
sent[token_index] = replace_ids[prob_index + token_index]
mask_flag = True
mask_pos.append(sent_index * max_len + token_index)
else:
# keep the original token
if token != SEP and token != CLS:
mask_label.append(sent[token_index])
mask_pos.append(sent_index * max_len + token_index)
pre_sent_len = len(sent)
# ensure at least mask one word in a sentence
while not mask_flag:
token_index = int(np.random.randint(1, high=len(sent) - 1, size=1))
if sent[token_index] != SEP and sent[token_index] != CLS:
mask_label.append(sent[token_index])
sent[token_index] = MASK
mask_flag = True
mask_pos.append(sent_index * max_len + token_index)
mask_label = np.array(mask_label).astype("int64").reshape([-1, 1])
mask_pos = np.array(mask_pos).astype("int64").reshape([-1, 1])
return batch_tokens, mask_label, mask_pos
def prepare_batch_data(insts,
total_token_num,
voc_size=0,
pad_id=None,
cls_id=None,
sep_id=None,
mask_id=None,
return_input_mask=True,
return_max_len=True,
return_num_token=False):
"""
1. generate Tensor of data
2. generate Tensor of position
3. generate self attention mask, [shape: batch_size * max_len * max_len]
"""
batch_src_ids = [inst[0] for inst in insts]
batch_sent_ids = [inst[1] for inst in insts]
batch_pos_ids = [inst[2] for inst in insts]
seq_len = np.array(
[[len(inst[0])] for inst in insts]).astype("int64").reshape([-1, 1])
labels_list = []
# compatible with squad, whose example includes start/end positions,
# or unique id
for i in range(3, len(insts[0]), 1):
labels = [inst[i] for inst in insts]
labels = np.array(labels).astype("int64").reshape([-1, 1])
labels_list.append(labels)
# First step: do mask without padding
if mask_id >= 0:
out, mask_label, mask_pos = mask(
batch_src_ids,
total_token_num,
vocab_size=voc_size,
CLS=cls_id,
SEP=sep_id,
MASK=mask_id)
else:
out = batch_src_ids
# Second step: padding
src_id, self_input_mask = pad_batch_data(
out, pad_idx=pad_id, return_input_mask=True)
pos_id = pad_batch_data(
batch_pos_ids,
pad_idx=pad_id,
return_pos=False,
return_input_mask=False)
sent_id = pad_batch_data(
batch_sent_ids,
pad_idx=pad_id,
return_pos=False,
return_input_mask=False)
if mask_id >= 0:
return_list = [
src_id, pos_id, sent_id, self_input_mask, mask_label, mask_pos
] + labels_list
else:
return_list = [src_id, pos_id, sent_id, self_input_mask, seq_len
] + labels_list
return return_list if len(return_list) > 1 else return_list[0]
def pad_batch_data(insts,
pad_idx=0,
return_pos=False,
return_input_mask=False,
return_max_len=False,
return_num_token=False,
return_seq_len=False):
"""
Pad the instances to the max sequence length in batch, and generate the
corresponding position data and input mask.
"""
return_list = []
max_len = max(len(inst) for inst in insts)
# Any token included in dict can be used to pad, since the paddings' loss
# will be masked out by weights and make no effect on parameter gradients.
inst_data = np.array([
list(inst) + list([pad_idx] * (max_len - len(inst))) for inst in insts
])
return_list += [inst_data.astype("int64").reshape([-1, max_len, 1])]
# position data
if return_pos:
inst_pos = np.array([
list(range(0, len(inst))) + [pad_idx] * (max_len - len(inst))
for inst in insts
])
return_list += [inst_pos.astype("int64").reshape([-1, max_len, 1])]
if return_input_mask:
# This is used to avoid attention on paddings.
input_mask_data = np.array(
[[1] * len(inst) + [0] * (max_len - len(inst)) for inst in insts])
input_mask_data = np.expand_dims(input_mask_data, axis=-1)
return_list += [input_mask_data.astype("float32")]
if return_max_len:
return_list += [max_len]
if return_num_token:
num_token = 0
for inst in insts:
num_token += len(inst)
return_list += [num_token]
if return_seq_len:
seq_len = np.array([[len(inst)] for inst in insts])
return_list += [seq_len.astype("int64").reshape([-1, 1])]
return return_list if len(return_list) > 1 else return_list[0]
if __name__ == "__main__":
pass
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Model for classifier."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import sys
import numpy as np
import paddle.fluid as fluid
sys.path.append("./BERT")
from model.bert import BertModel
def create_model(args,
pyreader_name,
bert_config,
num_labels,
is_prediction=False):
"""
define fine-tuning model
"""
if args.binary:
pyreader = fluid.layers.py_reader(
capacity=50,
shapes=[[-1, args.max_seq_len, 1], [-1, args.max_seq_len, 1],
[-1, args.max_seq_len, 1], [-1, args.max_seq_len, 1],
[-1, 1], [-1, 1]],
dtypes=['int64', 'int64', 'int64', 'float32', 'int64', 'int64'],
lod_levels=[0, 0, 0, 0, 0, 0],
name=pyreader_name,
use_double_buffer=True)
(src_ids, pos_ids, sent_ids, input_mask, seq_len,
labels) = fluid.layers.read_file(pyreader)
bert = BertModel(
src_ids=src_ids,
position_ids=pos_ids,
sentence_ids=sent_ids,
input_mask=input_mask,
config=bert_config,
use_fp16=args.use_fp16)
if args.sub_model_type == 'raw':
cls_feats = bert.get_pooled_output()
elif args.sub_model_type == 'cnn':
bert_seq_out = bert.get_sequence_output()
bert_seq_out = fluid.layers.sequence_unpad(bert_seq_out, seq_len)
cnn_hidden_size = 100
convs = []
for h in [3, 4, 5]:
conv_feats = fluid.layers.sequence_conv(
input=bert_seq_out, num_filters=cnn_hidden_size, filter_size=h)
conv_feats = fluid.layers.batch_norm(input=conv_feats, act="relu")
conv_feats = fluid.layers.sequence_pool(
input=conv_feats, pool_type='max')
convs.append(conv_feats)
cls_feats = fluid.layers.concat(input=convs, axis=1)
elif args.sub_model_type == 'gru':
bert_seq_out = bert.get_sequence_output()
bert_seq_out = fluid.layers.sequence_unpad(bert_seq_out, seq_len)
gru_hidden_size = 1024
gru_input = fluid.layers.fc(input=bert_seq_out,
size=gru_hidden_size * 3)
gru_forward = fluid.layers.dynamic_gru(
input=gru_input, size=gru_hidden_size, is_reverse=False)
gru_backward = fluid.layers.dynamic_gru(
input=gru_input, size=gru_hidden_size, is_reverse=True)
gru_output = fluid.layers.concat([gru_forward, gru_backward], axis=1)
cls_feats = fluid.layers.sequence_pool(
input=gru_output, pool_type='max')
elif args.sub_model_type == 'ffa':
bert_seq_out = bert.get_sequence_output()
attn = fluid.layers.fc(input=bert_seq_out,
num_flatten_dims=2,
size=1,
act='tanh')
attn = fluid.layers.softmax(attn)
weighted_input = bert_seq_out * attn
weighted_input = fluid.layers.sequence_unpad(weighted_input, seq_len)
cls_feats = fluid.layers.sequence_pool(weighted_input, pool_type='sum')
else:
raise NotImplementedError("%s is not implemented!" %
args.sub_model_type)
cls_feats = fluid.layers.dropout(
x=cls_feats,
dropout_prob=0.1,
dropout_implementation="upscale_in_train")
logits = fluid.layers.fc(
input=cls_feats,
size=num_labels,
param_attr=fluid.ParamAttr(
name="cls_out_w",
initializer=fluid.initializer.TruncatedNormal(scale=0.02)),
bias_attr=fluid.ParamAttr(
name="cls_out_b", initializer=fluid.initializer.Constant(0.)))
probs = fluid.layers.softmax(logits)
if is_prediction:
feed_targets_name = [
src_ids.name, pos_ids.name, sent_ids.name, input_mask.name
]
return pyreader, probs, feed_targets_name
ce_loss = fluid.layers.softmax_with_cross_entropy(
logits=logits, label=labels)
loss = fluid.layers.mean(x=ce_loss)
if args.use_fp16 and args.loss_scaling > 1.0:
loss *= args.loss_scaling
num_seqs = fluid.layers.create_tensor(dtype='int64')
accuracy = fluid.layers.accuracy(input=probs, label=labels, total=num_seqs)
return (pyreader, loss, probs, accuracy, labels, num_seqs)
should
be
Please
please
add
Allow
could
Add
make
need
Make
like
Provide
for
to
needs
support
a
fix
allow
provide
Feedly
suggest
Create
want
so
option
would
remove
maybe
API
feedly
us
give
integration
google
least
nice
better
as
back
feeds
wish
control
games
by
Should
helpful
also
function
d
bring
must
XAML
might
Can
custom
default
RSS
if
history
Let
author
let
in
developers
lock
from
engine
too
Include
Could
useful
textbox
feed
allowing
can
Get
feedback
extend
attributes
without
user
Enable
Would
or
property
into
functionality
specific
APP
Change
love
possibility
ALL
Give
Remove
ability
more
display
including
enable
Dialog
Twitter
improve
mark
If
UWP
single
information
consider
concept
clock
multi
performance
minute
suggestion
Update
reset
OneDrive
through
keyboard
specified
Bring
And
net
really
wanted
tools
So
include
these
service
articles
Adding
Maybe
life
controller
screenshots
manifest
making
Project
users
with
filters
email
straight
Why
think
optional
bar
trust
needed
Have
APIs
full
based
3rd
unless
greatly
e
great
Use
enabled
Center
Allowing
Preview
see
string
adding
individual
events
downloading
ru
we
re
window
everyone
priority
percentage
OPML
method
Download
ShowAsync
export
cool
Cortana
localization
your
case
per
opinion
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""script for ensemble and evaluation."""
import os
import sys
import csv
import numpy as np
from sklearn.metrics import f1_score
label_file = sys.argv[1]
prob_file_1 = sys.argv[2]
prob_file_2 = sys.argv[3]
prob_file_3 = sys.argv[4]
prob_file_4 = sys.argv[5]
def get_labels(input_file):
"""
get labels labels true labels file.
"""
readers = csv.reader(open(input_file, "r"), delimiter=',')
lines = []
for line in readers:
lines.append(int(line[2]))
return lines
def get_probs(input_file):
"""
get probs from input file.
"""
return [float(i.strip('\n')) for i in open(input_file)]
def get_pred(probs, threshold=0.5):
"""
get prediction from probs.
"""
pred = []
for p in probs:
if p >= threshold:
pred.append(1)
else:
pred.append(0)
return pred
def vote(pred_list):
"""
get vote result from prediction list.
"""
pred_list = np.array(pred_list).transpose()
preds = []
for p in pred_list:
counts = np.bincount(p)
preds.append(np.argmax(counts))
return preds
def cal_f1(preds, labels):
"""
calculate f1 score.
"""
return f1_score(np.array(labels), np.array(preds))
labels = get_labels(label_file)
file_list = [prob_file_1, prob_file_2, prob_file_3, prob_file_4]
pred_list = []
for f in file_list:
pred_list.append(get_pred(get_probs(f)))
pred_ensemble = vote(pred_list)
print("all model ensemble(vote) f1: %.5f " % cal_f1(pred_ensemble, labels))
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" module for data reader """
import os
import sys
import re
import types
import csv
import random
import numpy as np
from batching import prepare_batch_data
sys.path.append('./BERT')
import tokenization
class DataProcessor(object):
"""Base class for data converters for sequence classification data sets."""
def __init__(self,
data_dir,
vocab_path,
max_seq_len,
do_lower_case,
in_tokens,
random_seed=None):
self.data_dir = data_dir
self.max_seq_len = max_seq_len
self.tokenizer = tokenization.FullTokenizer(
vocab_file=vocab_path, do_lower_case=do_lower_case)
self.vocab = self.tokenizer.vocab
self.in_tokens = in_tokens
np.random.seed(random_seed)
self.current_train_example = -1
self.num_examples = {'train': -1, 'dev': -1, 'test': -1}
self.current_train_epoch = -1
def get_train_examples(self, data_dir, drop_keyword):
"""Gets a collection of `InputExample`s for the train set."""
raise NotImplementedError()
def get_dev_examples(self, data_dir):
"""Gets a collection of `InputExample`s for the dev set."""
raise NotImplementedError()
def get_test_examples(self, data_dir):
"""Gets a collection of `InputExample`s for prediction."""
raise NotImplementedError()
def get_labels(self):
"""Gets the list of labels for this data set."""
raise NotImplementedError()
def convert_example(self, index, example, labels, max_seq_len, tokenizer):
"""Converts a single `InputExample` into a single `InputFeatures`."""
feature = convert_single_example(index, example, labels, max_seq_len,
tokenizer)
return feature
def generate_instance(self, feature):
"""
generate instance with given feature
Args:
feature: InputFeatures(object). A single set of features of data.
"""
input_pos = list(range(len(feature.input_ids)))
return [
feature.input_ids, feature.segment_ids, input_pos, feature.label_id
]
def generate_batch_data(self,
batch_data,
total_token_num,
voc_size=-1,
mask_id=-1,
return_input_mask=True,
return_max_len=False,
return_num_token=False):
"""Generate batch data."""
return prepare_batch_data(
batch_data,
total_token_num,
voc_size=-1,
pad_id=self.vocab["[PAD]"],
cls_id=self.vocab["[CLS]"],
sep_id=self.vocab["[SEP]"],
mask_id=-1,
return_input_mask=True,
return_max_len=False,
return_num_token=False)
@classmethod
def _read_tsv(cls, input_file, quotechar=None):
"""Reads a tab separated value file."""
with open(input_file, "r") as f:
reader = csv.reader(f, delimiter="\t", quotechar=quotechar)
lines = []
for line in reader:
lines.append(line)
return lines
def get_num_examples(self, phase):
"""Get number of examples for train, dev or test."""
if phase not in ['train', 'dev', 'test']:
raise ValueError(
"Unknown phase, which should be in ['train', 'dev', 'test'].")
return self.num_examples[phase]
def get_train_progress(self):
"""Gets progress for training phase."""
return self.current_train_example, self.current_train_epoch
def data_generator_for_kfold(self,
examples,
batch_size,
phase='train',
epoch=1,
dev_count=1,
shuffle=True):
"""
Generate data for train, dev or test.
Args:
examples: list. Train, dev or test data.
batch_size: int. The batch size of generated data.
phase: string. The phase for which to generate data.
epoch: int. Total epoches to generate data.
shuffle: bool. Whether to shuffle examples.
"""
if phase == 'train':
self.num_examples['train'] = len(examples)
elif phase == 'dev':
self.num_examples['dev'] = len(examples)
elif phase == 'test':
self.num_examples['test'] = len(examples)
else:
raise ValueError(
"Unknown phase, which should be in ['train', 'dev', 'test'].")
def instance_reader():
"""Process sinle example and return."""
for epoch_index in range(epoch):
if shuffle:
np.random.shuffle(examples)
if phase == 'train':
self.current_train_epoch = epoch_index
for (index, example) in enumerate(examples):
if phase == 'train':
self.current_train_example = index + 1
feature = self.convert_example(
index, example,
self.get_labels(), self.max_seq_len, self.tokenizer)
instance = self.generate_instance(feature)
yield instance
def batch_reader(reader, batch_size, in_tokens):
"""Generate batch data and return."""
batch, total_token_num, max_len = [], 0, 0
for instance in reader():
token_ids, sent_ids, pos_ids, label = instance[:4]
max_len = max(max_len, len(token_ids))
if in_tokens:
to_append = (len(batch) + 1) * max_len <= batch_size
else:
to_append = len(batch) < batch_size
if to_append:
batch.append(instance)
total_token_num += len(token_ids)
else:
yield batch, total_token_num
batch, total_token_num, max_len = [instance], len(
token_ids), len(token_ids)
if len(batch) > 0:
yield batch, total_token_num
def wrapper():
"""Data wrapeer."""
all_dev_batches = []
for batch_data, total_token_num in batch_reader(
instance_reader, batch_size, self.in_tokens):
batch_data = self.generate_batch_data(
batch_data,
total_token_num,
voc_size=-1,
mask_id=-1,
return_input_mask=True,
return_max_len=False,
return_num_token=False)
if len(all_dev_batches) < dev_count:
all_dev_batches.append(batch_data)
if len(all_dev_batches) == dev_count:
for batch in all_dev_batches:
yield batch
all_dev_batches = []
return wrapper
def data_generator(self,
batch_size,
phase='train',
epoch=1,
dev_count=1,
shuffle=True,
drop_keyword=False):
"""
Generate data for train, dev or test.
Args:
batch_size: int. The batch size of generated data.
phase: string. The phase for which to generate data.
epoch: int. Total epoches to generate data.
shuffle: bool. Whether to shuffle examples.
"""
if phase == 'train':
examples = self.get_train_examples(
self.data_dir, drop_keyword=drop_keyword)
self.num_examples['train'] = len(examples)
elif phase == 'dev':
examples = self.get_dev_examples(self.data_dir)
self.num_examples['dev'] = len(examples)
elif phase == 'test':
examples = self.get_test_examples(self.data_dir)
self.num_examples['test'] = len(examples)
else:
raise ValueError(
"Unknown phase, which should be in ['train', 'dev', 'test'].")
def instance_reader():
"""Process sinle example and return."""
for epoch_index in range(epoch):
if shuffle:
np.random.shuffle(examples)
if phase == 'train':
self.current_train_epoch = epoch_index
for (index, example) in enumerate(examples):
if phase == 'train':
self.current_train_example = index + 1
feature = self.convert_example(
index, example,
self.get_labels(), self.max_seq_len, self.tokenizer)
instance = self.generate_instance(feature)
yield instance
def batch_reader(reader, batch_size, in_tokens):
"""Generate batch data and return."""
batch, total_token_num, max_len = [], 0, 0
for instance in reader():
token_ids, sent_ids, pos_ids, label = instance[:4]
max_len = max(max_len, len(token_ids))
if in_tokens:
to_append = (len(batch) + 1) * max_len <= batch_size
else:
to_append = len(batch) < batch_size
if to_append:
batch.append(instance)
total_token_num += len(token_ids)
else:
yield batch, total_token_num
batch, total_token_num, max_len = [instance], len(
token_ids), len(token_ids)
if len(batch) > 0:
yield batch, total_token_num
def wrapper():
"""Data wrapeer."""
all_dev_batches = []
for batch_data, total_token_num in batch_reader(
instance_reader, batch_size, self.in_tokens):
batch_data = self.generate_batch_data(
batch_data,
total_token_num,
voc_size=-1,
mask_id=-1,
return_input_mask=True,
return_max_len=False,
return_num_token=False)
if len(all_dev_batches) < dev_count:
all_dev_batches.append(batch_data)
if len(all_dev_batches) == dev_count:
for batch in all_dev_batches:
yield batch
all_dev_batches = []
return wrapper
class InputExample(object):
"""A single training/test example for simple sequence classification."""
def __init__(self, guid, text_a, text_b=None, label=None):
"""Constructs a InputExample.
Args:
guid: Unique id for the example.
text_a: string. The untokenized text of the first sequence. For single
sequence tasks, only this sequence must be specified.
text_b: (Optional) string. The untokenized text of the second sequence.
Only must be specified for sequence pair tasks.
label: (Optional) string. The label of the example. This should be
specified for train and dev examples, but not for test examples.
"""
self.guid = guid
self.text_a = text_a
self.text_b = text_b
self.label = label
def _truncate_seq_pair(tokens_a, tokens_b, max_length):
"""Truncates a sequence pair in place to the maximum length."""
# This is a simple heuristic which will always truncate the longer sequence
# one token at a time. This makes more sense than truncating an equal percent
# of tokens from each, since if one sequence is very short then each token
# that's truncated likely contains more information than a longer sequence.
while True:
total_length = len(tokens_a) + len(tokens_b)
if total_length <= max_length:
break
if len(tokens_a) > len(tokens_b):
tokens_a.pop()
else:
tokens_b.pop()
class InputFeatures(object):
"""A single set of features of data."""
def __init__(self, input_ids, input_mask, segment_ids, label_id):
self.input_ids = input_ids
self.input_mask = input_mask
self.segment_ids = segment_ids
self.label_id = label_id
class SemevalTask9Processor(DataProcessor):
"""Processor for Semeval Task9 data set."""
def get_train_examples(self, data_dir, header=False, drop_keyword=False):
lines = self._read_csv(data_dir + '/V1.4_Training.csv')
examples = []
if drop_keyword:
keywords = [
line.strip() for line in open(data_dir + '/../keywords')
]
for i, line in enumerate(lines):
if i == 0 and header:
continue
guid = line[0]
text_a = tokenization.convert_to_unicode(line[1])
text_a = clean_str(text_a)
if drop_keyword:
new_tokens = []
for w in text_a.split(' '):
if w in keywords and random.random() > 0.8:
continue
new_tokens.append(w)
text_a = ' '.join(new_tokens)
text_b = None
label = line[2]
examples.append(
InputExample(
guid=guid, text_a=text_a, text_b=text_b, label=label))
return examples
def get_dev_examples(self, data_dir, header=True):
lines = self._read_csv(data_dir + '/SubtaskA_Trial_Test_Labeled.csv')
examples = []
for i, line in enumerate(lines):
if i == 0 and header:
continue
guid = line[0]
text_a = clean_str(line[1])
text_a = tokenization.convert_to_unicode(text_a)
text_b = None
label = line[2]
examples.append(
InputExample(
guid=guid, text_a=text_a, text_b=text_b, label=label))
return examples
def get_test_examples(self, data_dir, header=False):
lines = self._read_csv(data_dir +
'/SubtaskA_EvaluationData_labeled.csv')
examples = []
for i, line in enumerate(lines):
if i == 0 and header:
continue
guid = line[0]
text_a = clean_str(line[1])
text_a = tokenization.convert_to_unicode(text_a)
text_b = None
label = line[2]
examples.append(
InputExample(
guid=guid, text_a=text_a, text_b=text_b, label=label))
return examples
def get_labels(self):
"""See base class."""
return ["0", "1"]
@classmethod
def _read_csv(cls, input_file):
"""Reads a comma separated value file."""
readers = csv.reader(open(input_file, "r"), delimiter=',')
lines = []
for line in readers:
lines.append(line)
return lines
def clean_str(string):
"""
Tokenization/string cleaning for all datasets except for SST.
Original taken from https://github.com/yoonkim/CNN_sentence/blob/master/process_data.py
"""
string = string.strip('\n').replace('\n', ' ')
string = re.sub(r"[^A-Za-z0-9(),!?\'\`]", " ", string)
string = re.sub(r"\'s", " \'s", string)
string = re.sub(r"\'ve", " \'ve", string)
string = re.sub(r"n\'t", " n\'t", string)
string = re.sub(r"\'re", " \'re", string)
string = re.sub(r"\'d", " \'d", string)
string = re.sub(r"\'ll", " \'ll", string)
string = re.sub(r",", " , ", string)
string = re.sub(r"!", " ! ", string)
string = re.sub(r"\(", " ( ", string)
string = re.sub(r"\)", " ) ", string)
string = re.sub(r"\?", " ? ", string)
string = re.sub(r"\s{2,}", " ", string)
return string
def convert_single_example_to_unicode(guid, single_example):
"""Convert single example to unicode."""
text_a = tokenization.convert_to_unicode(single_example[0])
text_b = tokenization.convert_to_unicode(single_example[1])
label = tokenization.convert_to_unicode(single_example[2])
return InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label)
def convert_single_example(ex_index, example, label_list, max_seq_length,
tokenizer):
"""Converts a single `InputExample` into a single `InputFeatures`."""
label_map = {}
for (i, label) in enumerate(label_list):
label_map[label] = i
tokens_a = tokenizer.tokenize(example.text_a)
tokens_b = None
if example.text_b:
tokens_b = tokenizer.tokenize(example.text_b)
if tokens_b:
# Modifies `tokens_a` and `tokens_b` in place so that the total
# length is less than the specified length.
# Account for [CLS], [SEP], [SEP] with "- 3"
_truncate_seq_pair(tokens_a, tokens_b, max_seq_length - 3)
else:
# Account for [CLS] and [SEP] with "- 2"
if len(tokens_a) > max_seq_length - 2:
tokens_a = tokens_a[0:(max_seq_length - 2)]
# The convention in BERT is:
# (a) For sequence pairs:
# tokens: [CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP]
# type_ids: 0 0 0 0 0 0 0 0 1 1 1 1 1 1
# (b) For single sequences:
# tokens: [CLS] the dog is hairy . [SEP]
# type_ids: 0 0 0 0 0 0 0
#
# Where "type_ids" are used to indicate whether this is the first
# sequence or the second sequence. The embedding vectors for `type=0` and
# `type=1` were learned during pre-training and are added to the wordpiece
# embedding vector (and position vector). This is not *strictly* necessary
# since the [SEP] token unambiguously separates the sequences, but it makes
# it easier for the model to learn the concept of sequences.
#
# For classification tasks, the first vector (corresponding to [CLS]) is
# used as as the "sentence vector". Note that this only makes sense because
# the entire model is fine-tuned.
tokens = []
segment_ids = []
tokens.append("[CLS]")
segment_ids.append(0)
for token in tokens_a:
tokens.append(token)
segment_ids.append(0)
tokens.append("[SEP]")
segment_ids.append(0)
if tokens_b:
for token in tokens_b:
tokens.append(token)
segment_ids.append(1)
tokens.append("[SEP]")
segment_ids.append(1)
input_ids = tokenizer.convert_tokens_to_ids(tokens)
# The mask has 1 for real tokens and 0 for padding tokens. Only real
# tokens are attended to.
input_mask = [1] * len(input_ids)
label_id = label_map[example.label]
feature = InputFeatures(
input_ids=input_ids,
input_mask=input_mask,
segment_ids=segment_ids,
label_id=label_id)
return feature
def convert_examples_to_features(examples, label_list, max_seq_length,
tokenizer):
"""Convert a set of `InputExample`s to a list of `InputFeatures`."""
features = []
for (ex_index, example) in enumerate(examples):
if ex_index % 10000 == 0:
print("Writing example %d of %d" % (ex_index, len(examples)))
feature = convert_single_example(ex_index, example, label_list,
max_seq_length, tokenizer)
features.append(feature)
return features
if __name__ == '__main__':
pass
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Finetuning on classification tasks."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import os
import sys
import time
import argparse
import numpy as np
import multiprocessing
import paddle
import paddle.fluid as fluid
from classifier import create_model
import reader
sys.path.append("./BERT")
from model.bert import BertConfig
from optimization import optimization
from utils.args import ArgumentGroup, print_arguments
from utils.init import init_pretraining_params, init_checkpoint
import scipy
from sklearn.model_selection import KFold, StratifiedKFold
# yapf: disable
parser = argparse.ArgumentParser(__doc__)
model_g = ArgumentGroup(parser, "model", "model configuration and paths.")
model_g.add_arg("bert_config_path", str, None, "Path to the json file for bert model config.")
model_g.add_arg("init_checkpoint", str, None, "Init checkpoint to resume training from.")
model_g.add_arg("init_pretraining_params", str, None,
"Init pre-training params which preforms fine-tuning from. If the "
"arg 'init_checkpoint' has been set, this argument wouldn't be valid.")
model_g.add_arg("checkpoints", str, "checkpoints", "Path to save checkpoints.")
train_g = ArgumentGroup(parser, "training", "training options.")
train_g.add_arg("epoch", int, 3, "Number of epoches for fine-tuning.")
train_g.add_arg("learning_rate", float, 5e-5, "Learning rate used to train with warmup.")
train_g.add_arg("lr_scheduler", str, "linear_warmup_decay",
"scheduler of learning rate.", choices=['linear_warmup_decay', 'noam_decay'])
train_g.add_arg("weight_decay", float, 0.01, "Weight decay rate for L2 regularizer.")
train_g.add_arg("warmup_proportion", float, 0.1,
"Proportion of training steps to perform linear learning rate warmup for.")
train_g.add_arg("save_steps", int, 10000, "The steps interval to save checkpoints.")
train_g.add_arg("validation_steps", int, 1000, "The steps interval to evaluate model performance.")
train_g.add_arg("use_fp16", bool, False, "Whether to use fp16 mixed precision training.")
train_g.add_arg("loss_scaling", float, 1.0,
"Loss scaling factor for mixed precision training, only valid when use_fp16 is enabled.")
log_g = ArgumentGroup(parser, "logging", "logging related.")
log_g.add_arg("skip_steps", int, 10, "The steps interval to print loss.")
log_g.add_arg("verbose", bool, False, "Whether to output verbose log.")
data_g = ArgumentGroup(parser, "data", "Data paths, vocab paths and data processing options")
data_g.add_arg("data_dir", str, None, "Path to training data.")
data_g.add_arg("vocab_path", str, None, "Vocabulary path.")
data_g.add_arg("max_seq_len", int, 512, "Number of words of the longest seqence.")
data_g.add_arg("batch_size", int, 32, "Total examples' number in batch for training. see also --in_tokens.")
data_g.add_arg("in_tokens", bool, False,
"If set, the batch size will be the maximum number of tokens in one batch. "
"Otherwise, it will be the maximum number of examples in one batch.")
data_g.add_arg("do_lower_case", bool, True,
"Whether to lower case the input text. Should be True for uncased models and False for cased models.")
data_g.add_arg("random_seed", int, 0, "Random seed.")
run_type_g = ArgumentGroup(parser, "run_type", "running type options.")
run_type_g.add_arg("use_cuda", bool, True, "If set, use GPU for training.")
run_type_g.add_arg("use_fast_executor", bool, False, "If set, use fast parallel executor (in experiment).")
run_type_g.add_arg("num_iteration_per_drop_scope", int, 1, "Ihe iteration intervals to clean up temporary variables.")
run_type_g.add_arg("task_name", str, "sem",
"The name of task to perform fine-tuning, should be in {'xnli', 'mnli', 'cola', 'mrpc'}.")
run_type_g.add_arg("sub_model_type", str, "raw",
"The type of sub model to use, should be in {'raw', 'cnn', 'gru', ffa}.")
run_type_g.add_arg("ksplit", int, -1,
"if ksplit > 0, use kfold training")
run_type_g.add_arg("drop_keyword", bool, False,
"if drop keyword for data augmentation.")
run_type_g.add_arg("kfold_type", str, "normal",
"The type of kfold should be in {'normal', 'stratified'}")
run_type_g.add_arg("binary", bool, True, "if is binary classification.")
run_type_g.add_arg("do_train", bool, True, "Whether to perform training.")
run_type_g.add_arg("do_val", bool, True, "Whether to perform evaluation on dev data set.")
run_type_g.add_arg("do_test", bool, True, "Whether to perform evaluation on test data set.")
args = parser.parse_args()
# yapf: enable.
def evaluate(exe, test_program, test_pyreader, fetch_list, eval_phase):
"""
evaluation for dev and test dataset.
"""
test_pyreader.start()
total_cost, total_acc, total_num_seqs = 0.0, 0.0, 0.0
total_label_pos_num, total_pred_pos_num, total_correct_num = 0.0, 0.0, 0.0
qids, labels, scores = [], [], []
time_begin = time.time()
while True:
try:
np_loss, np_acc, np_probs, np_labels, np_num_seqs = exe.run(
program=test_program, fetch_list=fetch_list)
total_cost += np.sum(np_loss * np_num_seqs)
total_acc += np.sum(np_acc * np_num_seqs)
total_num_seqs += np.sum(np_num_seqs)
labels.extend(np_labels.reshape((-1)).tolist())
scores.extend(np_probs[:, 1].reshape(-1).tolist())
np_preds = np.argmax(np_probs, axis=1).astype(np.float32)
total_label_pos_num += np.sum(np_labels)
total_pred_pos_num += np.sum(np_preds)
total_correct_num += np.sum(np.dot(np_preds, np_labels))
except fluid.core.EOFException:
test_pyreader.reset()
break
time_end = time.time()
r = total_correct_num / total_label_pos_num
p = total_correct_num / total_pred_pos_num
f = 2 * p * r / (p + r)
print(
"[%s evaluation] ave loss: %f, ave_acc: %f, p: %f, r: %f, f1: %f, data_num: %d, elapsed time: %f s"
% (eval_phase, total_cost / total_num_seqs,
total_acc / total_num_seqs, p, r, f, total_num_seqs,
time_end - time_begin))
def predict(exe, test_program, test_pyreader, fetch_list, eval_phase, output_file):
"""
predict function
"""
test_pyreader.start()
qids, scores = [], []
time_begin = time.time()
while True:
try:
np_probs, np_num_seqs = exe.run(
program=test_program, fetch_list=fetch_list)
scores.extend(np_probs[:, 1].reshape(-1).tolist())
except fluid.core.EOFException:
test_pyreader.reset()
break
time_end = time.time()
with open(output_file, 'w') as w:
for prob in scores:
w.write(str(prob) + '\n')
def train_kfold(args):
"""
main program for training kfold.
"""
task_name = args.task_name.lower()
processors = {
'sem': reader.SemevalTask9Processor,
}
processor = processors[task_name](data_dir=args.data_dir,
vocab_path=args.vocab_path,
max_seq_len=args.max_seq_len,
do_lower_case=args.do_lower_case,
in_tokens=args.in_tokens,
random_seed=args.random_seed)
if not (args.do_train or args.do_val or args.do_test):
raise ValueError("For args `do_train`, `do_val` and `do_test`, at "
"least one of them must be True.")
train_examples = processor.get_train_examples(args.data_dir, drop_keyword=args.drop_keyword)
test_examples = processor.get_test_examples(args.data_dir)
if args.kfold_type == 'normal':
kf = KFold(n_splits=args.ksplit, shuffle=True, random_state=args.random_seed)
kf_iter = kf.split(train_examples)
elif args.kfold_type == 'stratified':
kf = StratifiedKFold(n_splits=args.ksplit, shuffle=True, random_state=args.random_seed)
train_labels = [e.label for e in train_examples]
kf_iter = kf.split(train_examples, train_labels)
else:
raise NotImplementedError("%s is not implemented" % args.kfold_type)
for fold, (train_idx, val_idx) in enumerate(kf_iter):
print("==================== fold %d ===================" % fold)
train_fold = np.array(train_examples)[train_idx]
dev_fold = np.array(train_examples)[val_idx]
test_examples = np.array(test_examples)
kfold_program(args, processor, train_fold, dev_fold, test_examples, str(fold))
def kfold_program(args, processor, train_examples, dev_examples, test_examples, fold):
"""
training program for kfold.
"""
bert_config = BertConfig(args.bert_config_path)
bert_config.print_config()
if args.use_cuda:
place = fluid.CUDAPlace(int(os.getenv('FLAGS_selected_gpus', '0')))
dev_count = fluid.core.get_cuda_device_count()
else:
place = fluid.CPUPlace()
dev_count = int(os.environ.get('CPU_NUM', multiprocessing.cpu_count()))
exe = fluid.Executor(place)
num_labels = len(processor.get_labels())
if not (args.do_train or args.do_val or args.do_test):
raise ValueError("For args `do_train`, `do_val` and `do_test`, at "
"least one of them must be True.")
startup_prog = fluid.Program()
if args.random_seed is not None:
startup_prog.random_seed = args.random_seed
if args.do_train:
train_data_generator = processor.data_generator_for_kfold(
examples=train_examples,
batch_size=args.batch_size,
phase='train',
epoch=args.epoch,
dev_count=dev_count,
shuffle=True)
num_train_examples = processor.get_num_examples(phase='train')
if args.in_tokens:
max_train_steps = args.epoch * num_train_examples // (
args.batch_size // args.max_seq_len) // dev_count
else:
max_train_steps = args.epoch * num_train_examples // args.batch_size // dev_count
warmup_steps = int(max_train_steps * args.warmup_proportion)
print("Device count: %d" % dev_count)
print("Num train examples: %d" % num_train_examples)
print("Max train steps: %d" % max_train_steps)
print("Num warmup steps: %d" % warmup_steps)
train_program = fluid.Program()
with fluid.program_guard(train_program, startup_prog):
with fluid.unique_name.guard():
train_pyreader, loss, probs, accuracy, labels, num_seqs = create_model(
args,
pyreader_name=fold + 'train_reader',
bert_config=bert_config,
num_labels=num_labels)
scheduled_lr = optimization(
loss=loss,
warmup_steps=warmup_steps,
num_train_steps=max_train_steps,
learning_rate=args.learning_rate,
train_program=train_program,
startup_prog=startup_prog,
weight_decay=args.weight_decay,
scheduler=args.lr_scheduler,
use_fp16=args.use_fp16,
loss_scaling=args.loss_scaling)
fluid.memory_optimize(
input_program=train_program,
skip_opt_set=[
loss.name, probs.name, accuracy.name, num_seqs.name
])
if args.verbose:
if args.in_tokens:
lower_mem, upper_mem, unit = fluid.contrib.memory_usage(
program=train_program,
batch_size=args.batch_size // args.max_seq_len)
else:
lower_mem, upper_mem, unit = fluid.contrib.memory_usage(
program=train_program, batch_size=args.batch_size)
print("Theoretical memory usage in training: %.3f - %.3f %s" %
(lower_mem, upper_mem, unit))
if args.do_val or args.do_test:
test_prog = fluid.Program()
with fluid.program_guard(test_prog, startup_prog):
with fluid.unique_name.guard():
test_pyreader, loss, probs, accuracy, labels, num_seqs = create_model(
args,
pyreader_name=fold + 'test_reader',
bert_config=bert_config,
num_labels=num_labels)
test_prog = test_prog.clone(for_test=True)
exe.run(startup_prog)
if args.do_train:
if args.init_checkpoint and args.init_pretraining_params:
print(
"WARNING: args 'init_checkpoint' and 'init_pretraining_params' "
"both are set! Only arg 'init_checkpoint' is made valid.")
if args.init_checkpoint:
init_checkpoint(
exe,
args.init_checkpoint,
main_program=startup_prog,
use_fp16=args.use_fp16)
elif args.init_pretraining_params:
init_pretraining_params(
exe,
args.init_pretraining_params,
main_program=startup_prog,
use_fp16=args.use_fp16)
elif args.do_val or args.do_test:
if not args.init_checkpoint:
raise ValueError("args 'init_checkpoint' should be set if"
"only doing validation or testing!")
init_checkpoint(
exe,
args.init_checkpoint,
main_program=startup_prog,
use_fp16=args.use_fp16)
if args.do_train:
exec_strategy = fluid.ExecutionStrategy()
exec_strategy.use_experimental_executor = args.use_fast_executor
exec_strategy.num_threads = dev_count
exec_strategy.num_iteration_per_drop_scope = args.num_iteration_per_drop_scope
train_exe = fluid.ParallelExecutor(
use_cuda=args.use_cuda,
loss_name=loss.name,
exec_strategy=exec_strategy,
main_program=train_program)
train_pyreader.decorate_tensor_provider(train_data_generator)
else:
train_exe = None
if args.do_val or args.do_test:
test_exe = fluid.ParallelExecutor(
use_cuda=args.use_cuda,
main_program=test_prog,
share_vars_from=train_exe)
if args.do_train:
train_pyreader.start()
steps = 0
total_cost, total_acc, total_num_seqs = [], [], []
time_begin = time.time()
while True:
try:
steps += 1
if steps % args.skip_steps == 0:
if warmup_steps <= 0:
fetch_list = [loss.name, accuracy.name, num_seqs.name]
else:
fetch_list = [
loss.name, accuracy.name, scheduled_lr.name,
num_seqs.name
]
else:
fetch_list = []
outputs = train_exe.run(fetch_list=fetch_list)
if steps % args.skip_steps == 0:
if warmup_steps <= 0:
np_loss, np_acc, np_num_seqs = outputs
else:
np_loss, np_acc, np_lr, np_num_seqs = outputs
total_cost.extend(np_loss * np_num_seqs)
total_acc.extend(np_acc * np_num_seqs)
total_num_seqs.extend(np_num_seqs)
if args.verbose:
verbose = "train pyreader queue size: %d, " % train_pyreader.queue.size(
)
verbose += "learning rate: %f" % (
np_lr[0]
if warmup_steps > 0 else args.learning_rate)
print(verbose)
current_example, current_epoch = processor.get_train_progress(
)
time_end = time.time()
used_time = time_end - time_begin
print("epoch: %d, progress: %d/%d, step: %d, ave loss: %f, "
"ave acc: %f, speed: %f steps/s" %
(current_epoch, current_example, num_train_examples,
steps, np.sum(total_cost) / np.sum(total_num_seqs),
np.sum(total_acc) / np.sum(total_num_seqs),
args.skip_steps / used_time))
total_cost, total_acc, total_num_seqs = [], [], []
time_begin = time.time()
if steps % args.save_steps == 0:
save_path = os.path.join(args.checkpoints, fold,
"step_" + str(steps))
fluid.io.save_persistables(exe, save_path, train_program)
if steps % args.validation_steps == 0:
# evaluate dev set
if args.do_val:
test_pyreader.decorate_tensor_provider(
processor.data_generator_for_kfold(
examples=dev_examples,
batch_size=args.batch_size,
phase='dev',
epoch=1,
dev_count=1,
shuffle=False))
evaluate(exe, test_prog, test_pyreader,
[loss.name, accuracy.name, probs.name, labels.name, num_seqs.name],
"dev")
# evaluate test set
if args.do_test:
test_pyreader.decorate_tensor_provider(
processor.data_generator_for_kfold(
examples=test_examples,
batch_size=args.batch_size,
phase='test',
epoch=1,
dev_count=1,
shuffle=False))
evaluate(exe, test_prog, test_pyreader,
[loss.name, accuracy.name, probs.name, labels.name, num_seqs.name],
"test")
except fluid.core.EOFException:
save_path = os.path.join(args.checkpoints, fold, "step_" + str(steps))
fluid.io.save_persistables(exe, save_path, train_program)
train_pyreader.reset()
break
# final eval on dev set
if args.do_val:
test_pyreader.decorate_tensor_provider(
processor.data_generator_for_kfold(
examples=dev_examples,
batch_size=args.batch_size, phase='dev', epoch=1, dev_count=1,
shuffle=False))
print("Final validation result:")
evaluate(exe, test_prog, test_pyreader,
[loss.name, accuracy.name, probs.name, labels.name, num_seqs.name], "dev")
# final eval on test set
if args.do_test:
test_pyreader.decorate_tensor_provider(
processor.data_generator_for_kfold(
examples=test_examples,
batch_size=args.batch_size,
phase='test',
epoch=1,
dev_count=1,
shuffle=False))
print("Final test result:")
evaluate(exe, test_prog, test_pyreader,
[loss.name, accuracy.name, probs.name, labels.name, num_seqs.name], "test")
exe.close()
def train_single(args):
"""
training program.
"""
bert_config = BertConfig(args.bert_config_path)
bert_config.print_config()
if args.use_cuda:
place = fluid.CUDAPlace(int(os.getenv('FLAGS_selected_gpus', '0')))
dev_count = fluid.core.get_cuda_device_count()
else:
place = fluid.CPUPlace()
dev_count = int(os.environ.get('CPU_NUM', multiprocessing.cpu_count()))
exe = fluid.Executor(place)
task_name = args.task_name.lower()
processors = {
'sem': reader.SemevalTask9Processor,
}
processor = processors[task_name](data_dir=args.data_dir,
vocab_path=args.vocab_path,
max_seq_len=args.max_seq_len,
do_lower_case=args.do_lower_case,
in_tokens=args.in_tokens,
random_seed=args.random_seed)
num_labels = len(processor.get_labels())
if not (args.do_train or args.do_val or args.do_test):
raise ValueError("For args `do_train`, `do_val` and `do_test`, at "
"least one of them must be True.")
startup_prog = fluid.Program()
if args.random_seed is not None:
startup_prog.random_seed = args.random_seed
if args.do_train:
train_data_generator = processor.data_generator(
batch_size=args.batch_size,
phase='train',
epoch=args.epoch,
dev_count=dev_count,
shuffle=True,
drop_keyword=args.drop_keyword)
num_train_examples = processor.get_num_examples(phase='train')
if args.in_tokens:
max_train_steps = args.epoch * num_train_examples // (
args.batch_size // args.max_seq_len) // dev_count
else:
max_train_steps = args.epoch * num_train_examples // args.batch_size // dev_count
warmup_steps = int(max_train_steps * args.warmup_proportion)
print("Device count: %d" % dev_count)
print("Num train examples: %d" % num_train_examples)
print("Max train steps: %d" % max_train_steps)
print("Num warmup steps: %d" % warmup_steps)
train_program = fluid.Program()
with fluid.program_guard(train_program, startup_prog):
with fluid.unique_name.guard():
train_pyreader, loss, probs, accuracy, labels, num_seqs = create_model(
args,
pyreader_name='train_reader',
bert_config=bert_config,
num_labels=num_labels)
scheduled_lr = optimization(
loss=loss,
warmup_steps=warmup_steps,
num_train_steps=max_train_steps,
learning_rate=args.learning_rate,
train_program=train_program,
startup_prog=startup_prog,
weight_decay=args.weight_decay,
scheduler=args.lr_scheduler,
use_fp16=args.use_fp16,
loss_scaling=args.loss_scaling)
fluid.memory_optimize(
input_program=train_program,
skip_opt_set=[
loss.name, probs.name, accuracy.name, num_seqs.name
])
if args.verbose:
if args.in_tokens:
lower_mem, upper_mem, unit = fluid.contrib.memory_usage(
program=train_program,
batch_size=args.batch_size // args.max_seq_len)
else:
lower_mem, upper_mem, unit = fluid.contrib.memory_usage(
program=train_program, batch_size=args.batch_size)
print("Theoretical memory usage in training: %.3f - %.3f %s" %
(lower_mem, upper_mem, unit))
if args.do_val or args.do_test:
test_prog = fluid.Program()
with fluid.program_guard(test_prog, startup_prog):
with fluid.unique_name.guard():
test_pyreader, loss, probs, accuracy, labels, num_seqs = create_model(
args,
pyreader_name='test_reader',
bert_config=bert_config,
num_labels=num_labels)
test_prog = test_prog.clone(for_test=True)
exe.run(startup_prog)
if args.do_train:
if args.init_checkpoint and args.init_pretraining_params:
print(
"WARNING: args 'init_checkpoint' and 'init_pretraining_params' "
"both are set! Only arg 'init_checkpoint' is made valid.")
if args.init_checkpoint:
init_checkpoint(
exe,
args.init_checkpoint,
main_program=startup_prog,
use_fp16=args.use_fp16)
elif args.init_pretraining_params:
init_pretraining_params(
exe,
args.init_pretraining_params,
main_program=startup_prog,
use_fp16=args.use_fp16)
elif args.do_val or args.do_test:
if not args.init_checkpoint:
raise ValueError("args 'init_checkpoint' should be set if"
"only doing validation or testing!")
init_checkpoint(
exe,
args.init_checkpoint,
main_program=startup_prog,
use_fp16=args.use_fp16)
if args.do_train:
exec_strategy = fluid.ExecutionStrategy()
exec_strategy.use_experimental_executor = args.use_fast_executor
exec_strategy.num_threads = dev_count
exec_strategy.num_iteration_per_drop_scope = args.num_iteration_per_drop_scope
train_exe = fluid.ParallelExecutor(
use_cuda=args.use_cuda,
loss_name=loss.name,
exec_strategy=exec_strategy,
main_program=train_program)
train_pyreader.decorate_tensor_provider(train_data_generator)
else:
train_exe = None
if args.do_val or args.do_test:
test_exe = fluid.ParallelExecutor(
use_cuda=args.use_cuda,
main_program=test_prog,
share_vars_from=train_exe)
if args.do_train:
train_pyreader.start()
steps = 0
total_cost, total_acc, total_num_seqs = [], [], []
time_begin = time.time()
while True:
try:
steps += 1
if steps % args.skip_steps == 0:
if warmup_steps <= 0:
fetch_list = [loss.name, accuracy.name, num_seqs.name]
else:
fetch_list = [
loss.name, accuracy.name, scheduled_lr.name,
num_seqs.name
]
else:
fetch_list = []
outputs = train_exe.run(fetch_list=fetch_list)
if steps % args.skip_steps == 0:
if warmup_steps <= 0:
np_loss, np_acc, np_num_seqs = outputs
else:
np_loss, np_acc, np_lr, np_num_seqs = outputs
total_cost.extend(np_loss * np_num_seqs)
total_acc.extend(np_acc * np_num_seqs)
total_num_seqs.extend(np_num_seqs)
if args.verbose:
verbose = "train pyreader queue size: %d, " % train_pyreader.queue.size(
)
verbose += "learning rate: %f" % (
np_lr[0]
if warmup_steps > 0 else args.learning_rate)
print(verbose)
current_example, current_epoch = processor.get_train_progress(
)
time_end = time.time()
used_time = time_end - time_begin
print("epoch: %d, progress: %d/%d, step: %d, ave loss: %f, "
"ave acc: %f, speed: %f steps/s" %
(current_epoch, current_example, num_train_examples,
steps, np.sum(total_cost) / np.sum(total_num_seqs),
np.sum(total_acc) / np.sum(total_num_seqs),
args.skip_steps / used_time))
total_cost, total_acc, total_num_seqs = [], [], []
time_begin = time.time()
if steps % args.save_steps == 0:
save_path = os.path.join(args.checkpoints,
"step_" + str(steps))
fluid.io.save_persistables(exe, save_path, train_program)
if steps % args.validation_steps == 0:
# evaluate dev set
if args.do_val:
test_pyreader.decorate_tensor_provider(
processor.data_generator(
batch_size=args.batch_size,
phase='dev',
epoch=1,
dev_count=1,
shuffle=False))
evaluate(exe, test_prog, test_pyreader,
[loss.name, accuracy.name, probs.name, labels.name, num_seqs.name],
"dev")
# evaluate test set
if args.do_test:
test_pyreader.decorate_tensor_provider(
processor.data_generator(
batch_size=args.batch_size,
phase='test',
epoch=1,
dev_count=1,
shuffle=False))
evaluate(exe, test_prog, test_pyreader,
[loss.name, accuracy.name, probs.name, labels.name, num_seqs.name],
"test")
except fluid.core.EOFException:
save_path = os.path.join(args.checkpoints, "step_" + str(steps))
fluid.io.save_persistables(exe, save_path, train_program)
train_pyreader.reset()
break
# final eval on dev set
if args.do_val:
test_pyreader.decorate_tensor_provider(
processor.data_generator(
batch_size=args.batch_size, phase='dev', epoch=1, dev_count=1,
shuffle=False))
print("Final validation result:")
evaluate(exe, test_prog, test_pyreader,
[loss.name, accuracy.name, probs.name, labels.name, num_seqs.name], "dev")
# final eval on test set
if args.do_test:
test_pyreader.decorate_tensor_provider(
processor.data_generator(
batch_size=args.batch_size,
phase='test',
epoch=1,
dev_count=1,
shuffle=False))
print("Final test result:")
predict(exe, test_prog, test_pyreader,
[probs.name, num_seqs.name], "test", args.checkpoints + '/prob.txt')
if __name__ == '__main__':
print_arguments(args)
if args.ksplit <= 0:
train_single(args)
else:
train_kfold(args)
#!/bin/sh
export CUDA_VISIBLE_DEVICES=0
output_dir=./output
prob_dir=./probs
bert_dir=./uncased_L-24_H-1024_A-16
mkdir -p $output_dir
mkdir -p $prob_dir
for model_type in raw cnn gru ffa
do
python run_classifier.py \
--bert_config_path ${bert_dir}/bert_config.json \
--checkpoints ${output_dir}/bert_large_${model_type} \
--init_pretraining_params ${bert_dir}/params \
--data_dir ./data/Subtask-A \
--vocab_path ${bert_dir}/vocab.txt \
--task_name sem \
--sub_model_type ${model_type} \
--max_seq_len 128 \
--batch_size 32 \
--random_seed 777 \
--save_steps 200 \
--validation_steps 200 \
--drop_keyword True
mv ${output_dir}/bert_large_${model_type}/prob.txt ${prob_dir}/prob_${model_type}.txt
done
## PaddleNLP for Research
Provide the most advanced, powerful and professional research papers.
Fully open code and datasets to enable researchers to quickly understand the NLP frontier direction and information.
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册