未验证 提交 4fa5b715 编写于 作者: Y Yibing Liu 提交者: GitHub

Add more research projects (#2611)

上级 8d63308d
Data
=====
This dataset is for our paper: ARNOR: Attention Regularization based Noise Reduction for Distant Supervision Relation Classification. This test set is for sentence-level evaluation.
The original data is from the dataset in the paper: Cotype: Joint extraction of typed entities and relations with knowledge bases. It is a distant supervision dataset from NYT (New York Time). And the test set is annotated by humans. However the number of positive instances in test set is small. We revise and annotate more test data based on it.
In a data file, each line is a json string. The content is like
{
"sentText": "The source sentence text",
"relationMentions": [
{
"em1Text": "The first entity in relation",
"em2Text": "The second entity in relation",
"label": "Relation label",
"is_noise": false # only occur in test set
},
...
],
"entityMentions": [
{
"text": "Entity words",
"label": "Entity type",
...
},
...
]
...
}
Data version 1.0.0
=====
This version of dataset is the original one applied in our paper, which includes four files: train.json, test.json, dev_part.json, and test_part.json. Here dev_part.json and test_part.json are from test.json. This dataset can be downloaded here: https://baidu-nlp.bj.bcebos.com/arnor_dataset-1.0.0.tar.gz
Data version 2.0.0
=====
More test date are coming soon ......
# Multi-Perspective Models
This model won the first place in SemEval 2019 Task 9 SubTask A - Suggestion Mining from Online Reviews and Forums.
See more information about SemEval 2019: [http://alt.qcri.org/semeval2019/](http://alt.qcri.org/semeval2019/)
## 1. Introduction
This paper describes our system participated in Task 9 of SemEval-2019: the task is focused on suggestion mining and it aims to classify given sentences into suggestion and non-suggestion classes in domain specific and cross domain training setting respectively. We propose a multi-perspective architecture for learning representations by using different classical models including Convolutional Neural Networks (CNN), Gated Recurrent Units (GRU), Feed Forward Attention (FFA), etc. To leverage the semantics distributed in large amount of unsupervised data, we also have adopted the pre-trained Bidirectional Encoder Representations from Transformers (BERT) model as an encoder to produce sentence and word representations. The proposed architecture is applied for both sub-tasks, and achieved f1-score of 0.7812 for subtask A, and 0.8579 for subtask B. We won the first and second place for the two tasks respectively in the final competition.
## 2. Quick Start
### Installation
This project depends on python2.7 and paddle.fluid = 1.3.2, please follow [quick start](http://www.paddlepaddle.org/#quick-start) to install.
### Data Preparation
- Download the competition's data
```
# Download the competition's data
cd ./data && git clone https://github.com/Semeval2019Task9/Subtask-A.git
cd ../
```
- Download BERT and pre-trained model
```
# Download BERT code
git clone https://github.com/PaddlePaddle/LARK && mv LARK/BERT ./
# Download BERT pre-trained model
wget https://bert-models.bj.bcebos.com/uncased_L-24_H-1024_A-16.tar.gz
tar zxf uncased_L-24_H-1024_A-16.tar.gz -C ./
```
### Train
Use this command to start training:
```
# run training script
sh train.sh
```
The models will output to ./output .
### Ensemble & Evaluation
Use this commad to evaluate ensemble result:
```
# run evaluation
python evaluation.py \
./data/Subtask-A/SubtaskA_EvaluationData_labeled.csv \
./probs/prob_raw.txt \
./probs/prob_cnn.txt \
./probs/prob_gru.txt \
./probs/prob_ffa.txt \
```
Due to the dataset size is small, the training result may fluctuate, please try re-training several times more.
## 3. Advance
### Task Introduction
[Semeval2019-Task9](https://www.aclweb.org/anthology/S19-2151) presents the pilot SemEval task on Suggestion Mining. The task consists of subtasks A and B, creating labeled data from feedback forum and hotel reviews respectively. Examples:
|Source |Sentence |Label|
|------| ------|------|
|Hotel reviews |Be sure to specify a room at the back of the hotel. |suggestion|
|Hotel reviews |The point is, don’t advertise the service if there are caveats that go with it.|non-suggestion|
|Suggestion forum| Why not let us have several pages that we can put tiles on and name whatever we want to |suggestion|
|Suggestion forum| It fails with a uninformative message indicating deployment failed.|non-suggestion|
### Model Introduction
Model's framwork is shown in Figure 1:
<p align="center">
<img src="data/mpm.png"/> <br />
<b>Figure 1: An overall framework and pipeline of our system for suggestion mining</b>
</p>
As shown in Figure 1. our model architecture is constituted of two modules which includes a universal encoding module as either a sentence or a word encoder, and a task specified module used for suggestion classification. To fully explored the information generated by the encoder, we stack a serious of different task specified modules upon the encoder according to different perspective. Intuitively, we could use the sentence encoding directly to make a classification, to go further beyond that, as language is time-series information in essence, the time perspective based GRU cells can also be applied to model the sequence state to learn the structure for the suggestion mining task. Similarly, the spatial perspective based CNN can be used to mimic the n-gram model, as well. Moreover, we also introduce a convenient attention mechanism FFA (Raffel and Ellis, 2015) to automatically learns the combination of most important features. At last, we ensemble those models by a voting strategy as final prediction by this system.
### Result
| Models | CV f1-score | test score |
| ----- | ----- | ------ |
BERT-Large-Logistic | 0.8522 (±0.0213) | 0.7697
BERT-Large-Conv | 0.8520 (±0.0231) | 0.7800
BERT-Large-FFA | 0.8516 (±0.0307) | 0.7722
BERT-Large-GRU | 0.8503 (±0.0275) | 0.7725
Ensemble | – | 0.7812
## 4. Others
If you use the library in you research project, please cite the paper "OleNet at SemEval-2019 Task 9: BERT based Multi-Perspective Models for Suggestion Mining".
### Citation
```
@inproceedings{BaiduMPM,
title={OleNet at SemEval-2019 Task 9: BERT based Multi-Perspective Models for Suggestion Mining},
author={Jiaxiang Liu, Shuohuan Wang, and Yu Sun},
booktitle={Proceedings of the 13th International Workshop on Semantic Evaluation (SemEval-2019)},
year={2019}
}
```
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Mask, padding and batching."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import numpy as np
def mask(batch_tokens, total_token_num, vocab_size, CLS=1, SEP=2, MASK=3):
"""
Add mask for batch_tokens, return out, mask_label, mask_pos;
Note: mask_pos responding the batch_tokens after padded;
"""
max_len = max([len(sent) for sent in batch_tokens])
mask_label = []
mask_pos = []
prob_mask = np.random.rand(total_token_num)
# Note: the first token is [CLS], so [low=1]
replace_ids = np.random.randint(1, high=vocab_size, size=total_token_num)
pre_sent_len = 0
prob_index = 0
for sent_index, sent in enumerate(batch_tokens):
mask_flag = False
prob_index += pre_sent_len
for token_index, token in enumerate(sent):
prob = prob_mask[prob_index + token_index]
if prob > 0.15:
continue
elif 0.03 < prob <= 0.15:
# mask
if token != SEP and token != CLS:
mask_label.append(sent[token_index])
sent[token_index] = MASK
mask_flag = True
mask_pos.append(sent_index * max_len + token_index)
elif 0.015 < prob <= 0.03:
# random replace
if token != SEP and token != CLS:
mask_label.append(sent[token_index])
sent[token_index] = replace_ids[prob_index + token_index]
mask_flag = True
mask_pos.append(sent_index * max_len + token_index)
else:
# keep the original token
if token != SEP and token != CLS:
mask_label.append(sent[token_index])
mask_pos.append(sent_index * max_len + token_index)
pre_sent_len = len(sent)
# ensure at least mask one word in a sentence
while not mask_flag:
token_index = int(np.random.randint(1, high=len(sent) - 1, size=1))
if sent[token_index] != SEP and sent[token_index] != CLS:
mask_label.append(sent[token_index])
sent[token_index] = MASK
mask_flag = True
mask_pos.append(sent_index * max_len + token_index)
mask_label = np.array(mask_label).astype("int64").reshape([-1, 1])
mask_pos = np.array(mask_pos).astype("int64").reshape([-1, 1])
return batch_tokens, mask_label, mask_pos
def prepare_batch_data(insts,
total_token_num,
voc_size=0,
pad_id=None,
cls_id=None,
sep_id=None,
mask_id=None,
return_input_mask=True,
return_max_len=True,
return_num_token=False):
"""
1. generate Tensor of data
2. generate Tensor of position
3. generate self attention mask, [shape: batch_size * max_len * max_len]
"""
batch_src_ids = [inst[0] for inst in insts]
batch_sent_ids = [inst[1] for inst in insts]
batch_pos_ids = [inst[2] for inst in insts]
seq_len = np.array(
[[len(inst[0])] for inst in insts]).astype("int64").reshape([-1, 1])
labels_list = []
# compatible with squad, whose example includes start/end positions,
# or unique id
for i in range(3, len(insts[0]), 1):
labels = [inst[i] for inst in insts]
labels = np.array(labels).astype("int64").reshape([-1, 1])
labels_list.append(labels)
# First step: do mask without padding
if mask_id >= 0:
out, mask_label, mask_pos = mask(
batch_src_ids,
total_token_num,
vocab_size=voc_size,
CLS=cls_id,
SEP=sep_id,
MASK=mask_id)
else:
out = batch_src_ids
# Second step: padding
src_id, self_input_mask = pad_batch_data(
out, pad_idx=pad_id, return_input_mask=True)
pos_id = pad_batch_data(
batch_pos_ids,
pad_idx=pad_id,
return_pos=False,
return_input_mask=False)
sent_id = pad_batch_data(
batch_sent_ids,
pad_idx=pad_id,
return_pos=False,
return_input_mask=False)
if mask_id >= 0:
return_list = [
src_id, pos_id, sent_id, self_input_mask, mask_label, mask_pos
] + labels_list
else:
return_list = [src_id, pos_id, sent_id, self_input_mask, seq_len
] + labels_list
return return_list if len(return_list) > 1 else return_list[0]
def pad_batch_data(insts,
pad_idx=0,
return_pos=False,
return_input_mask=False,
return_max_len=False,
return_num_token=False,
return_seq_len=False):
"""
Pad the instances to the max sequence length in batch, and generate the
corresponding position data and input mask.
"""
return_list = []
max_len = max(len(inst) for inst in insts)
# Any token included in dict can be used to pad, since the paddings' loss
# will be masked out by weights and make no effect on parameter gradients.
inst_data = np.array([
list(inst) + list([pad_idx] * (max_len - len(inst))) for inst in insts
])
return_list += [inst_data.astype("int64").reshape([-1, max_len, 1])]
# position data
if return_pos:
inst_pos = np.array([
list(range(0, len(inst))) + [pad_idx] * (max_len - len(inst))
for inst in insts
])
return_list += [inst_pos.astype("int64").reshape([-1, max_len, 1])]
if return_input_mask:
# This is used to avoid attention on paddings.
input_mask_data = np.array(
[[1] * len(inst) + [0] * (max_len - len(inst)) for inst in insts])
input_mask_data = np.expand_dims(input_mask_data, axis=-1)
return_list += [input_mask_data.astype("float32")]
if return_max_len:
return_list += [max_len]
if return_num_token:
num_token = 0
for inst in insts:
num_token += len(inst)
return_list += [num_token]
if return_seq_len:
seq_len = np.array([[len(inst)] for inst in insts])
return_list += [seq_len.astype("int64").reshape([-1, 1])]
return return_list if len(return_list) > 1 else return_list[0]
if __name__ == "__main__":
pass
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Model for classifier."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import sys
import numpy as np
import paddle.fluid as fluid
sys.path.append("./BERT")
from model.bert import BertModel
def create_model(args,
pyreader_name,
bert_config,
num_labels,
is_prediction=False):
"""
define fine-tuning model
"""
if args.binary:
pyreader = fluid.layers.py_reader(
capacity=50,
shapes=[[-1, args.max_seq_len, 1], [-1, args.max_seq_len, 1],
[-1, args.max_seq_len, 1], [-1, args.max_seq_len, 1],
[-1, 1], [-1, 1]],
dtypes=['int64', 'int64', 'int64', 'float32', 'int64', 'int64'],
lod_levels=[0, 0, 0, 0, 0, 0],
name=pyreader_name,
use_double_buffer=True)
(src_ids, pos_ids, sent_ids, input_mask, seq_len,
labels) = fluid.layers.read_file(pyreader)
bert = BertModel(
src_ids=src_ids,
position_ids=pos_ids,
sentence_ids=sent_ids,
input_mask=input_mask,
config=bert_config,
use_fp16=args.use_fp16)
if args.sub_model_type == 'raw':
cls_feats = bert.get_pooled_output()
elif args.sub_model_type == 'cnn':
bert_seq_out = bert.get_sequence_output()
bert_seq_out = fluid.layers.sequence_unpad(bert_seq_out, seq_len)
cnn_hidden_size = 100
convs = []
for h in [3, 4, 5]:
conv_feats = fluid.layers.sequence_conv(
input=bert_seq_out, num_filters=cnn_hidden_size, filter_size=h)
conv_feats = fluid.layers.batch_norm(input=conv_feats, act="relu")
conv_feats = fluid.layers.sequence_pool(
input=conv_feats, pool_type='max')
convs.append(conv_feats)
cls_feats = fluid.layers.concat(input=convs, axis=1)
elif args.sub_model_type == 'gru':
bert_seq_out = bert.get_sequence_output()
bert_seq_out = fluid.layers.sequence_unpad(bert_seq_out, seq_len)
gru_hidden_size = 1024
gru_input = fluid.layers.fc(input=bert_seq_out,
size=gru_hidden_size * 3)
gru_forward = fluid.layers.dynamic_gru(
input=gru_input, size=gru_hidden_size, is_reverse=False)
gru_backward = fluid.layers.dynamic_gru(
input=gru_input, size=gru_hidden_size, is_reverse=True)
gru_output = fluid.layers.concat([gru_forward, gru_backward], axis=1)
cls_feats = fluid.layers.sequence_pool(
input=gru_output, pool_type='max')
elif args.sub_model_type == 'ffa':
bert_seq_out = bert.get_sequence_output()
attn = fluid.layers.fc(input=bert_seq_out,
num_flatten_dims=2,
size=1,
act='tanh')
attn = fluid.layers.softmax(attn)
weighted_input = bert_seq_out * attn
weighted_input = fluid.layers.sequence_unpad(weighted_input, seq_len)
cls_feats = fluid.layers.sequence_pool(weighted_input, pool_type='sum')
else:
raise NotImplementedError("%s is not implemented!" %
args.sub_model_type)
cls_feats = fluid.layers.dropout(
x=cls_feats,
dropout_prob=0.1,
dropout_implementation="upscale_in_train")
logits = fluid.layers.fc(
input=cls_feats,
size=num_labels,
param_attr=fluid.ParamAttr(
name="cls_out_w",
initializer=fluid.initializer.TruncatedNormal(scale=0.02)),
bias_attr=fluid.ParamAttr(
name="cls_out_b", initializer=fluid.initializer.Constant(0.)))
probs = fluid.layers.softmax(logits)
if is_prediction:
feed_targets_name = [
src_ids.name, pos_ids.name, sent_ids.name, input_mask.name
]
return pyreader, probs, feed_targets_name
ce_loss = fluid.layers.softmax_with_cross_entropy(
logits=logits, label=labels)
loss = fluid.layers.mean(x=ce_loss)
if args.use_fp16 and args.loss_scaling > 1.0:
loss *= args.loss_scaling
num_seqs = fluid.layers.create_tensor(dtype='int64')
accuracy = fluid.layers.accuracy(input=probs, label=labels, total=num_seqs)
return (pyreader, loss, probs, accuracy, labels, num_seqs)
should
be
Please
please
add
Allow
could
Add
make
need
Make
like
Provide
for
to
needs
support
a
fix
allow
provide
Feedly
suggest
Create
want
so
option
would
remove
maybe
API
feedly
us
give
integration
google
least
nice
better
as
back
feeds
wish
control
games
by
Should
helpful
also
function
d
bring
must
XAML
might
Can
custom
default
RSS
if
history
Let
author
let
in
developers
lock
from
engine
too
Include
Could
useful
textbox
feed
allowing
can
Get
feedback
extend
attributes
without
user
Enable
Would
or
property
into
functionality
specific
APP
Change
love
possibility
ALL
Give
Remove
ability
more
display
including
enable
Dialog
Twitter
improve
mark
If
UWP
single
information
consider
concept
clock
multi
performance
minute
suggestion
Update
reset
OneDrive
through
keyboard
specified
Bring
And
net
really
wanted
tools
So
include
these
service
articles
Adding
Maybe
life
controller
screenshots
manifest
making
Project
users
with
filters
email
straight
Why
think
optional
bar
trust
needed
Have
APIs
full
based
3rd
unless
greatly
e
great
Use
enabled
Center
Allowing
Preview
see
string
adding
individual
events
downloading
ru
we
re
window
everyone
priority
percentage
OPML
method
Download
ShowAsync
export
cool
Cortana
localization
your
case
per
opinion
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""script for ensemble and evaluation."""
import os
import sys
import csv
import numpy as np
from sklearn.metrics import f1_score
label_file = sys.argv[1]
prob_file_1 = sys.argv[2]
prob_file_2 = sys.argv[3]
prob_file_3 = sys.argv[4]
prob_file_4 = sys.argv[5]
def get_labels(input_file):
"""
get labels labels true labels file.
"""
readers = csv.reader(open(input_file, "r"), delimiter=',')
lines = []
for line in readers:
lines.append(int(line[2]))
return lines
def get_probs(input_file):
"""
get probs from input file.
"""
return [float(i.strip('\n')) for i in open(input_file)]
def get_pred(probs, threshold=0.5):
"""
get prediction from probs.
"""
pred = []
for p in probs:
if p >= threshold:
pred.append(1)
else:
pred.append(0)
return pred
def vote(pred_list):
"""
get vote result from prediction list.
"""
pred_list = np.array(pred_list).transpose()
preds = []
for p in pred_list:
counts = np.bincount(p)
preds.append(np.argmax(counts))
return preds
def cal_f1(preds, labels):
"""
calculate f1 score.
"""
return f1_score(np.array(labels), np.array(preds))
labels = get_labels(label_file)
file_list = [prob_file_1, prob_file_2, prob_file_3, prob_file_4]
pred_list = []
for f in file_list:
pred_list.append(get_pred(get_probs(f)))
pred_ensemble = vote(pred_list)
print("all model ensemble(vote) f1: %.5f " % cal_f1(pred_ensemble, labels))
此差异已折叠。
此差异已折叠。
#!/bin/sh
export CUDA_VISIBLE_DEVICES=0
output_dir=./output
prob_dir=./probs
bert_dir=./uncased_L-24_H-1024_A-16
mkdir -p $output_dir
mkdir -p $prob_dir
for model_type in raw cnn gru ffa
do
python run_classifier.py \
--bert_config_path ${bert_dir}/bert_config.json \
--checkpoints ${output_dir}/bert_large_${model_type} \
--init_pretraining_params ${bert_dir}/params \
--data_dir ./data/Subtask-A \
--vocab_path ${bert_dir}/vocab.txt \
--task_name sem \
--sub_model_type ${model_type} \
--max_seq_len 128 \
--batch_size 32 \
--random_seed 777 \
--save_steps 200 \
--validation_steps 200 \
--drop_keyword True
mv ${output_dir}/bert_large_${model_type}/prob.txt ${prob_dir}/prob_${model_type}.txt
done
## PaddleNLP for Research
Provide the most advanced, powerful and professional research papers.
Fully open code and datasets to enable researchers to quickly understand the NLP frontier direction and information.
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册