diff --git a/PaddleNLP/Research/ACL2019-ARNOR/README.md b/PaddleNLP/Research/ACL2019-ARNOR/README.md new file mode 100644 index 0000000000000000000000000000000000000000..6709f3563ffc7f912be7f6cf1da1ef1bf5f254f1 --- /dev/null +++ b/PaddleNLP/Research/ACL2019-ARNOR/README.md @@ -0,0 +1,41 @@ +Data +===== + +This dataset is for our paper: ARNOR: Attention Regularization based Noise Reduction for Distant Supervision Relation Classification. This test set is for sentence-level evaluation. + +The original data is from the dataset in the paper: Cotype: Joint extraction of typed entities and relations with knowledge bases. It is a distant supervision dataset from NYT (New York Time). And the test set is annotated by humans. However the number of positive instances in test set is small. We revise and annotate more test data based on it. + +In a data file, each line is a json string. The content is like + + { + "sentText": "The source sentence text", + "relationMentions": [ + { + "em1Text": "The first entity in relation", + "em2Text": "The second entity in relation", + "label": "Relation label", + "is_noise": false # only occur in test set + }, + ... + ], + "entityMentions": [ + { + "text": "Entity words", + "label": "Entity type", + ... + }, + ... + ] + ... + } + +Data version 1.0.0 +===== + +This version of dataset is the original one applied in our paper, which includes four files: train.json, test.json, dev_part.json, and test_part.json. Here dev_part.json and test_part.json are from test.json. This dataset can be downloaded here: https://baidu-nlp.bj.bcebos.com/arnor_dataset-1.0.0.tar.gz + + +Data version 2.0.0 +===== + +More test date are coming soon ...... diff --git a/PaddleNLP/Research/NAACL2019-MPM/README.md b/PaddleNLP/Research/NAACL2019-MPM/README.md new file mode 100644 index 0000000000000000000000000000000000000000..b52f384bc6b19a94dd055b8513b4d4d6ba947166 --- /dev/null +++ b/PaddleNLP/Research/NAACL2019-MPM/README.md @@ -0,0 +1,95 @@ +# Multi-Perspective Models + +This model won the first place in SemEval 2019 Task 9 SubTask A - Suggestion Mining from Online Reviews and Forums. + +See more information about SemEval 2019: [http://alt.qcri.org/semeval2019/](http://alt.qcri.org/semeval2019/) + +## 1. Introduction +This paper describes our system participated in Task 9 of SemEval-2019: the task is focused on suggestion mining and it aims to classify given sentences into suggestion and non-suggestion classes in domain specific and cross domain training setting respectively. We propose a multi-perspective architecture for learning representations by using different classical models including Convolutional Neural Networks (CNN), Gated Recurrent Units (GRU), Feed Forward Attention (FFA), etc. To leverage the semantics distributed in large amount of unsupervised data, we also have adopted the pre-trained Bidirectional Encoder Representations from Transformers (BERT) model as an encoder to produce sentence and word representations. The proposed architecture is applied for both sub-tasks, and achieved f1-score of 0.7812 for subtask A, and 0.8579 for subtask B. We won the first and second place for the two tasks respectively in the final competition. + +## 2. Quick Start +### Installation +This project depends on python2.7 and paddle.fluid = 1.3.2, please follow [quick start](http://www.paddlepaddle.org/#quick-start) to install. +### Data Preparation +- Download the competition's data + +``` +# Download the competition's data +cd ./data && git clone https://github.com/Semeval2019Task9/Subtask-A.git +cd ../ +``` + +- Download BERT and pre-trained model + +``` +# Download BERT code +git clone https://github.com/PaddlePaddle/LARK && mv LARK/BERT ./ +# Download BERT pre-trained model +wget https://bert-models.bj.bcebos.com/uncased_L-24_H-1024_A-16.tar.gz +tar zxf uncased_L-24_H-1024_A-16.tar.gz -C ./ +``` + +### Train +Use this command to start training: + +``` +# run training script +sh train.sh +``` +The models will output to ./output . + +### Ensemble & Evaluation +Use this commad to evaluate ensemble result: + +``` +# run evaluation +python evaluation.py \ + ./data/Subtask-A/SubtaskA_EvaluationData_labeled.csv \ + ./probs/prob_raw.txt \ + ./probs/prob_cnn.txt \ + ./probs/prob_gru.txt \ + ./probs/prob_ffa.txt \ +``` +Due to the dataset size is small, the training result may fluctuate, please try re-training several times more. + +## 3. Advance +### Task Introduction +[Semeval2019-Task9](https://www.aclweb.org/anthology/S19-2151) presents the pilot SemEval task on Suggestion Mining. The task consists of subtasks A and B, creating labeled data from feedback forum and hotel reviews respectively. Examples: + +|Source |Sentence |Label| +|------| ------|------| +|Hotel reviews |Be sure to specify a room at the back of the hotel. |suggestion| +|Hotel reviews |The point is, don’t advertise the service if there are caveats that go with it.|non-suggestion| +|Suggestion forum| Why not let us have several pages that we can put tiles on and name whatever we want to |suggestion| +|Suggestion forum| It fails with a uninformative message indicating deployment failed.|non-suggestion| + +### Model Introduction +Model's framwork is shown in Figure 1: +

+
+Figure 1: An overall framework and pipeline of our system for suggestion mining +

+As shown in Figure 1. our model architecture is constituted of two modules which includes a universal encoding module as either a sentence or a word encoder, and a task specified module used for suggestion classification. To fully explored the information generated by the encoder, we stack a serious of different task specified modules upon the encoder according to different perspective. Intuitively, we could use the sentence encoding directly to make a classification, to go further beyond that, as language is time-series information in essence, the time perspective based GRU cells can also be applied to model the sequence state to learn the structure for the suggestion mining task. Similarly, the spatial perspective based CNN can be used to mimic the n-gram model, as well. Moreover, we also introduce a convenient attention mechanism FFA (Raffel and Ellis, 2015) to automatically learns the combination of most important features. At last, we ensemble those models by a voting strategy as final prediction by this system. + +### Result +| Models | CV f1-score | test score | +| ----- | ----- | ------ | +BERT-Large-Logistic | 0.8522 (±0.0213) | 0.7697 +BERT-Large-Conv | 0.8520 (±0.0231) | 0.7800 +BERT-Large-FFA | 0.8516 (±0.0307) | 0.7722 +BERT-Large-GRU | 0.8503 (±0.0275) | 0.7725 +Ensemble | – | 0.7812 + + +## 4. Others +If you use the library in you research project, please cite the paper "OleNet at SemEval-2019 Task 9: BERT based Multi-Perspective Models for Suggestion Mining". +### Citation + +``` +@inproceedings{BaiduMPM, + title={OleNet at SemEval-2019 Task 9: BERT based Multi-Perspective Models for Suggestion Mining}, + author={Jiaxiang Liu, Shuohuan Wang, and Yu Sun}, + booktitle={Proceedings of the 13th International Workshop on Semantic Evaluation (SemEval-2019)}, + year={2019} +} +``` diff --git a/PaddleNLP/Research/NAACL2019-MPM/batching.py b/PaddleNLP/Research/NAACL2019-MPM/batching.py new file mode 100644 index 0000000000000000000000000000000000000000..bebd37d555c3492aba005cdfebd1ac40f96dd6ef --- /dev/null +++ b/PaddleNLP/Research/NAACL2019-MPM/batching.py @@ -0,0 +1,195 @@ +# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +"""Mask, padding and batching.""" + +from __future__ import absolute_import +from __future__ import division +from __future__ import print_function + +import numpy as np + + +def mask(batch_tokens, total_token_num, vocab_size, CLS=1, SEP=2, MASK=3): + """ + Add mask for batch_tokens, return out, mask_label, mask_pos; + Note: mask_pos responding the batch_tokens after padded; + """ + max_len = max([len(sent) for sent in batch_tokens]) + mask_label = [] + mask_pos = [] + prob_mask = np.random.rand(total_token_num) + # Note: the first token is [CLS], so [low=1] + replace_ids = np.random.randint(1, high=vocab_size, size=total_token_num) + pre_sent_len = 0 + prob_index = 0 + for sent_index, sent in enumerate(batch_tokens): + mask_flag = False + prob_index += pre_sent_len + for token_index, token in enumerate(sent): + prob = prob_mask[prob_index + token_index] + if prob > 0.15: + continue + elif 0.03 < prob <= 0.15: + # mask + if token != SEP and token != CLS: + mask_label.append(sent[token_index]) + sent[token_index] = MASK + mask_flag = True + mask_pos.append(sent_index * max_len + token_index) + elif 0.015 < prob <= 0.03: + # random replace + if token != SEP and token != CLS: + mask_label.append(sent[token_index]) + sent[token_index] = replace_ids[prob_index + token_index] + mask_flag = True + mask_pos.append(sent_index * max_len + token_index) + else: + # keep the original token + if token != SEP and token != CLS: + mask_label.append(sent[token_index]) + mask_pos.append(sent_index * max_len + token_index) + pre_sent_len = len(sent) + + # ensure at least mask one word in a sentence + while not mask_flag: + token_index = int(np.random.randint(1, high=len(sent) - 1, size=1)) + if sent[token_index] != SEP and sent[token_index] != CLS: + mask_label.append(sent[token_index]) + sent[token_index] = MASK + mask_flag = True + mask_pos.append(sent_index * max_len + token_index) + mask_label = np.array(mask_label).astype("int64").reshape([-1, 1]) + mask_pos = np.array(mask_pos).astype("int64").reshape([-1, 1]) + return batch_tokens, mask_label, mask_pos + + +def prepare_batch_data(insts, + total_token_num, + voc_size=0, + pad_id=None, + cls_id=None, + sep_id=None, + mask_id=None, + return_input_mask=True, + return_max_len=True, + return_num_token=False): + """ + 1. generate Tensor of data + 2. generate Tensor of position + 3. generate self attention mask, [shape: batch_size * max_len * max_len] + """ + + batch_src_ids = [inst[0] for inst in insts] + batch_sent_ids = [inst[1] for inst in insts] + batch_pos_ids = [inst[2] for inst in insts] + seq_len = np.array( + [[len(inst[0])] for inst in insts]).astype("int64").reshape([-1, 1]) + labels_list = [] + # compatible with squad, whose example includes start/end positions, + # or unique id + + for i in range(3, len(insts[0]), 1): + labels = [inst[i] for inst in insts] + labels = np.array(labels).astype("int64").reshape([-1, 1]) + labels_list.append(labels) + + # First step: do mask without padding + if mask_id >= 0: + out, mask_label, mask_pos = mask( + batch_src_ids, + total_token_num, + vocab_size=voc_size, + CLS=cls_id, + SEP=sep_id, + MASK=mask_id) + else: + out = batch_src_ids + # Second step: padding + src_id, self_input_mask = pad_batch_data( + out, pad_idx=pad_id, return_input_mask=True) + pos_id = pad_batch_data( + batch_pos_ids, + pad_idx=pad_id, + return_pos=False, + return_input_mask=False) + sent_id = pad_batch_data( + batch_sent_ids, + pad_idx=pad_id, + return_pos=False, + return_input_mask=False) + + if mask_id >= 0: + return_list = [ + src_id, pos_id, sent_id, self_input_mask, mask_label, mask_pos + ] + labels_list + else: + return_list = [src_id, pos_id, sent_id, self_input_mask, seq_len + ] + labels_list + + return return_list if len(return_list) > 1 else return_list[0] + + +def pad_batch_data(insts, + pad_idx=0, + return_pos=False, + return_input_mask=False, + return_max_len=False, + return_num_token=False, + return_seq_len=False): + """ + Pad the instances to the max sequence length in batch, and generate the + corresponding position data and input mask. + """ + return_list = [] + max_len = max(len(inst) for inst in insts) + # Any token included in dict can be used to pad, since the paddings' loss + # will be masked out by weights and make no effect on parameter gradients. + + inst_data = np.array([ + list(inst) + list([pad_idx] * (max_len - len(inst))) for inst in insts + ]) + return_list += [inst_data.astype("int64").reshape([-1, max_len, 1])] + + # position data + if return_pos: + inst_pos = np.array([ + list(range(0, len(inst))) + [pad_idx] * (max_len - len(inst)) + for inst in insts + ]) + + return_list += [inst_pos.astype("int64").reshape([-1, max_len, 1])] + + if return_input_mask: + # This is used to avoid attention on paddings. + input_mask_data = np.array( + [[1] * len(inst) + [0] * (max_len - len(inst)) for inst in insts]) + input_mask_data = np.expand_dims(input_mask_data, axis=-1) + return_list += [input_mask_data.astype("float32")] + + if return_max_len: + return_list += [max_len] + + if return_num_token: + num_token = 0 + for inst in insts: + num_token += len(inst) + return_list += [num_token] + if return_seq_len: + seq_len = np.array([[len(inst)] for inst in insts]) + return_list += [seq_len.astype("int64").reshape([-1, 1])] + return return_list if len(return_list) > 1 else return_list[0] + + +if __name__ == "__main__": + pass diff --git a/PaddleNLP/Research/NAACL2019-MPM/classifier.py b/PaddleNLP/Research/NAACL2019-MPM/classifier.py new file mode 100644 index 0000000000000000000000000000000000000000..8ca3077a0d7094602a090f6ad662c11bf21b56b4 --- /dev/null +++ b/PaddleNLP/Research/NAACL2019-MPM/classifier.py @@ -0,0 +1,135 @@ +# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +"""Model for classifier.""" + +from __future__ import absolute_import +from __future__ import division +from __future__ import print_function +import sys +import numpy as np +import paddle.fluid as fluid + +sys.path.append("./BERT") +from model.bert import BertModel + + +def create_model(args, + pyreader_name, + bert_config, + num_labels, + is_prediction=False): + """ + define fine-tuning model + """ + if args.binary: + pyreader = fluid.layers.py_reader( + capacity=50, + shapes=[[-1, args.max_seq_len, 1], [-1, args.max_seq_len, 1], + [-1, args.max_seq_len, 1], [-1, args.max_seq_len, 1], + [-1, 1], [-1, 1]], + dtypes=['int64', 'int64', 'int64', 'float32', 'int64', 'int64'], + lod_levels=[0, 0, 0, 0, 0, 0], + name=pyreader_name, + use_double_buffer=True) + + (src_ids, pos_ids, sent_ids, input_mask, seq_len, + labels) = fluid.layers.read_file(pyreader) + + bert = BertModel( + src_ids=src_ids, + position_ids=pos_ids, + sentence_ids=sent_ids, + input_mask=input_mask, + config=bert_config, + use_fp16=args.use_fp16) + + if args.sub_model_type == 'raw': + cls_feats = bert.get_pooled_output() + + elif args.sub_model_type == 'cnn': + bert_seq_out = bert.get_sequence_output() + bert_seq_out = fluid.layers.sequence_unpad(bert_seq_out, seq_len) + cnn_hidden_size = 100 + convs = [] + for h in [3, 4, 5]: + conv_feats = fluid.layers.sequence_conv( + input=bert_seq_out, num_filters=cnn_hidden_size, filter_size=h) + conv_feats = fluid.layers.batch_norm(input=conv_feats, act="relu") + conv_feats = fluid.layers.sequence_pool( + input=conv_feats, pool_type='max') + convs.append(conv_feats) + + cls_feats = fluid.layers.concat(input=convs, axis=1) + + elif args.sub_model_type == 'gru': + bert_seq_out = bert.get_sequence_output() + bert_seq_out = fluid.layers.sequence_unpad(bert_seq_out, seq_len) + gru_hidden_size = 1024 + gru_input = fluid.layers.fc(input=bert_seq_out, + size=gru_hidden_size * 3) + gru_forward = fluid.layers.dynamic_gru( + input=gru_input, size=gru_hidden_size, is_reverse=False) + gru_backward = fluid.layers.dynamic_gru( + input=gru_input, size=gru_hidden_size, is_reverse=True) + gru_output = fluid.layers.concat([gru_forward, gru_backward], axis=1) + cls_feats = fluid.layers.sequence_pool( + input=gru_output, pool_type='max') + + elif args.sub_model_type == 'ffa': + bert_seq_out = bert.get_sequence_output() + attn = fluid.layers.fc(input=bert_seq_out, + num_flatten_dims=2, + size=1, + act='tanh') + attn = fluid.layers.softmax(attn) + weighted_input = bert_seq_out * attn + weighted_input = fluid.layers.sequence_unpad(weighted_input, seq_len) + cls_feats = fluid.layers.sequence_pool(weighted_input, pool_type='sum') + + else: + raise NotImplementedError("%s is not implemented!" % + args.sub_model_type) + + cls_feats = fluid.layers.dropout( + x=cls_feats, + dropout_prob=0.1, + dropout_implementation="upscale_in_train") + + logits = fluid.layers.fc( + input=cls_feats, + size=num_labels, + param_attr=fluid.ParamAttr( + name="cls_out_w", + initializer=fluid.initializer.TruncatedNormal(scale=0.02)), + bias_attr=fluid.ParamAttr( + name="cls_out_b", initializer=fluid.initializer.Constant(0.))) + probs = fluid.layers.softmax(logits) + + if is_prediction: + feed_targets_name = [ + src_ids.name, pos_ids.name, sent_ids.name, input_mask.name + ] + return pyreader, probs, feed_targets_name + + ce_loss = fluid.layers.softmax_with_cross_entropy( + logits=logits, label=labels) + loss = fluid.layers.mean(x=ce_loss) + + if args.use_fp16 and args.loss_scaling > 1.0: + loss *= args.loss_scaling + + num_seqs = fluid.layers.create_tensor(dtype='int64') + accuracy = fluid.layers.accuracy(input=probs, label=labels, total=num_seqs) + + return (pyreader, loss, probs, accuracy, labels, num_seqs) diff --git a/PaddleNLP/Research/NAACL2019-MPM/data/keywords b/PaddleNLP/Research/NAACL2019-MPM/data/keywords new file mode 100644 index 0000000000000000000000000000000000000000..5dd30a78a58dcaef97cefdea462c50d91fb78be0 --- /dev/null +++ b/PaddleNLP/Research/NAACL2019-MPM/data/keywords @@ -0,0 +1,192 @@ +should +be +Please +please +add +Allow +could +Add +make +need +Make +like +Provide +for +to +needs +support +a +fix +allow +provide +Feedly +suggest +Create +want +so +option +would +remove +maybe +API +feedly +us +give +integration +google +least +nice +better +as +back +feeds +wish +control +games +by +Should +helpful +also +function +d +bring +must +XAML +might +Can +custom +default +RSS +if +history +Let +author +let +in +developers +lock +from +engine +too +Include +Could +useful +textbox +feed +allowing +can +Get +feedback +extend +attributes +without +user +Enable +Would +or +property +into +functionality +specific +APP +Change +love +possibility +ALL +Give +Remove +ability +more +display +including +enable +Dialog +Twitter +improve +mark +If +UWP +single +information +consider +concept +clock +multi +performance +minute +suggestion +Update +reset +OneDrive +through +keyboard +specified +Bring +And +net +really +wanted +tools +So +include +these +service +articles +Adding +Maybe +life +controller +screenshots +manifest +making +Project +users +with +filters +email +straight +Why +think +optional +bar +trust +needed +Have +APIs +full +based +3rd +unless +greatly +e +great +Use +enabled +Center +Allowing +Preview +see +string +adding +individual +events +downloading +ru +we +re +window +everyone +priority +percentage +OPML +method +Download +ShowAsync +export +cool +Cortana +localization +your +case +per +opinion diff --git a/PaddleNLP/Research/NAACL2019-MPM/data/mpm.png b/PaddleNLP/Research/NAACL2019-MPM/data/mpm.png new file mode 100644 index 0000000000000000000000000000000000000000..ce125a7388c664312bbf335eee40573da6a252e7 Binary files /dev/null and b/PaddleNLP/Research/NAACL2019-MPM/data/mpm.png differ diff --git a/PaddleNLP/Research/NAACL2019-MPM/evaluation.py b/PaddleNLP/Research/NAACL2019-MPM/evaluation.py new file mode 100644 index 0000000000000000000000000000000000000000..8170f11bba1b01d6f476da05e00c8a7c8be7ede1 --- /dev/null +++ b/PaddleNLP/Research/NAACL2019-MPM/evaluation.py @@ -0,0 +1,88 @@ +# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +"""script for ensemble and evaluation.""" + +import os +import sys +import csv +import numpy as np +from sklearn.metrics import f1_score + +label_file = sys.argv[1] +prob_file_1 = sys.argv[2] +prob_file_2 = sys.argv[3] +prob_file_3 = sys.argv[4] +prob_file_4 = sys.argv[5] + + +def get_labels(input_file): + """ + get labels labels true labels file. + """ + readers = csv.reader(open(input_file, "r"), delimiter=',') + lines = [] + for line in readers: + lines.append(int(line[2])) + return lines + + +def get_probs(input_file): + """ + get probs from input file. + """ + return [float(i.strip('\n')) for i in open(input_file)] + + +def get_pred(probs, threshold=0.5): + """ + get prediction from probs. + """ + pred = [] + for p in probs: + if p >= threshold: + pred.append(1) + else: + pred.append(0) + return pred + + +def vote(pred_list): + """ + get vote result from prediction list. + """ + pred_list = np.array(pred_list).transpose() + preds = [] + for p in pred_list: + counts = np.bincount(p) + preds.append(np.argmax(counts)) + return preds + + +def cal_f1(preds, labels): + """ + calculate f1 score. + """ + return f1_score(np.array(labels), np.array(preds)) + + +labels = get_labels(label_file) + +file_list = [prob_file_1, prob_file_2, prob_file_3, prob_file_4] +pred_list = [] +for f in file_list: + pred_list.append(get_pred(get_probs(f))) + +pred_ensemble = vote(pred_list) + +print("all model ensemble(vote) f1: %.5f " % cal_f1(pred_ensemble, labels)) diff --git a/PaddleNLP/Research/NAACL2019-MPM/reader.py b/PaddleNLP/Research/NAACL2019-MPM/reader.py new file mode 100644 index 0000000000000000000000000000000000000000..6813f3dac235b96dc27ac39659204ecbe89f6b30 --- /dev/null +++ b/PaddleNLP/Research/NAACL2019-MPM/reader.py @@ -0,0 +1,557 @@ +# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" module for data reader """ + +import os +import sys +import re +import types +import csv +import random +import numpy as np + +from batching import prepare_batch_data + +sys.path.append('./BERT') +import tokenization + + +class DataProcessor(object): + """Base class for data converters for sequence classification data sets.""" + + def __init__(self, + data_dir, + vocab_path, + max_seq_len, + do_lower_case, + in_tokens, + random_seed=None): + self.data_dir = data_dir + self.max_seq_len = max_seq_len + self.tokenizer = tokenization.FullTokenizer( + vocab_file=vocab_path, do_lower_case=do_lower_case) + self.vocab = self.tokenizer.vocab + self.in_tokens = in_tokens + + np.random.seed(random_seed) + + self.current_train_example = -1 + self.num_examples = {'train': -1, 'dev': -1, 'test': -1} + self.current_train_epoch = -1 + + def get_train_examples(self, data_dir, drop_keyword): + """Gets a collection of `InputExample`s for the train set.""" + raise NotImplementedError() + + def get_dev_examples(self, data_dir): + """Gets a collection of `InputExample`s for the dev set.""" + raise NotImplementedError() + + def get_test_examples(self, data_dir): + """Gets a collection of `InputExample`s for prediction.""" + raise NotImplementedError() + + def get_labels(self): + """Gets the list of labels for this data set.""" + raise NotImplementedError() + + def convert_example(self, index, example, labels, max_seq_len, tokenizer): + """Converts a single `InputExample` into a single `InputFeatures`.""" + feature = convert_single_example(index, example, labels, max_seq_len, + tokenizer) + return feature + + def generate_instance(self, feature): + """ + generate instance with given feature + + Args: + feature: InputFeatures(object). A single set of features of data. + """ + input_pos = list(range(len(feature.input_ids))) + return [ + feature.input_ids, feature.segment_ids, input_pos, feature.label_id + ] + + def generate_batch_data(self, + batch_data, + total_token_num, + voc_size=-1, + mask_id=-1, + return_input_mask=True, + return_max_len=False, + return_num_token=False): + """Generate batch data.""" + return prepare_batch_data( + batch_data, + total_token_num, + voc_size=-1, + pad_id=self.vocab["[PAD]"], + cls_id=self.vocab["[CLS]"], + sep_id=self.vocab["[SEP]"], + mask_id=-1, + return_input_mask=True, + return_max_len=False, + return_num_token=False) + + @classmethod + def _read_tsv(cls, input_file, quotechar=None): + """Reads a tab separated value file.""" + with open(input_file, "r") as f: + reader = csv.reader(f, delimiter="\t", quotechar=quotechar) + lines = [] + for line in reader: + lines.append(line) + return lines + + def get_num_examples(self, phase): + """Get number of examples for train, dev or test.""" + if phase not in ['train', 'dev', 'test']: + raise ValueError( + "Unknown phase, which should be in ['train', 'dev', 'test'].") + return self.num_examples[phase] + + def get_train_progress(self): + """Gets progress for training phase.""" + return self.current_train_example, self.current_train_epoch + + def data_generator_for_kfold(self, + examples, + batch_size, + phase='train', + epoch=1, + dev_count=1, + shuffle=True): + """ + Generate data for train, dev or test. + + Args: + examples: list. Train, dev or test data. + batch_size: int. The batch size of generated data. + phase: string. The phase for which to generate data. + epoch: int. Total epoches to generate data. + shuffle: bool. Whether to shuffle examples. + """ + if phase == 'train': + self.num_examples['train'] = len(examples) + elif phase == 'dev': + self.num_examples['dev'] = len(examples) + elif phase == 'test': + self.num_examples['test'] = len(examples) + else: + raise ValueError( + "Unknown phase, which should be in ['train', 'dev', 'test'].") + + def instance_reader(): + """Process sinle example and return.""" + for epoch_index in range(epoch): + if shuffle: + np.random.shuffle(examples) + if phase == 'train': + self.current_train_epoch = epoch_index + for (index, example) in enumerate(examples): + if phase == 'train': + self.current_train_example = index + 1 + feature = self.convert_example( + index, example, + self.get_labels(), self.max_seq_len, self.tokenizer) + + instance = self.generate_instance(feature) + yield instance + + def batch_reader(reader, batch_size, in_tokens): + """Generate batch data and return.""" + batch, total_token_num, max_len = [], 0, 0 + for instance in reader(): + token_ids, sent_ids, pos_ids, label = instance[:4] + max_len = max(max_len, len(token_ids)) + if in_tokens: + to_append = (len(batch) + 1) * max_len <= batch_size + else: + to_append = len(batch) < batch_size + if to_append: + batch.append(instance) + total_token_num += len(token_ids) + else: + yield batch, total_token_num + batch, total_token_num, max_len = [instance], len( + token_ids), len(token_ids) + + if len(batch) > 0: + yield batch, total_token_num + + def wrapper(): + """Data wrapeer.""" + all_dev_batches = [] + for batch_data, total_token_num in batch_reader( + instance_reader, batch_size, self.in_tokens): + batch_data = self.generate_batch_data( + batch_data, + total_token_num, + voc_size=-1, + mask_id=-1, + return_input_mask=True, + return_max_len=False, + return_num_token=False) + if len(all_dev_batches) < dev_count: + all_dev_batches.append(batch_data) + + if len(all_dev_batches) == dev_count: + for batch in all_dev_batches: + yield batch + all_dev_batches = [] + + return wrapper + + def data_generator(self, + batch_size, + phase='train', + epoch=1, + dev_count=1, + shuffle=True, + drop_keyword=False): + """ + Generate data for train, dev or test. + + Args: + batch_size: int. The batch size of generated data. + phase: string. The phase for which to generate data. + epoch: int. Total epoches to generate data. + shuffle: bool. Whether to shuffle examples. + """ + if phase == 'train': + examples = self.get_train_examples( + self.data_dir, drop_keyword=drop_keyword) + self.num_examples['train'] = len(examples) + elif phase == 'dev': + examples = self.get_dev_examples(self.data_dir) + self.num_examples['dev'] = len(examples) + elif phase == 'test': + examples = self.get_test_examples(self.data_dir) + self.num_examples['test'] = len(examples) + else: + raise ValueError( + "Unknown phase, which should be in ['train', 'dev', 'test'].") + + def instance_reader(): + """Process sinle example and return.""" + for epoch_index in range(epoch): + if shuffle: + np.random.shuffle(examples) + if phase == 'train': + self.current_train_epoch = epoch_index + for (index, example) in enumerate(examples): + if phase == 'train': + self.current_train_example = index + 1 + feature = self.convert_example( + index, example, + self.get_labels(), self.max_seq_len, self.tokenizer) + + instance = self.generate_instance(feature) + yield instance + + def batch_reader(reader, batch_size, in_tokens): + """Generate batch data and return.""" + batch, total_token_num, max_len = [], 0, 0 + for instance in reader(): + token_ids, sent_ids, pos_ids, label = instance[:4] + max_len = max(max_len, len(token_ids)) + if in_tokens: + to_append = (len(batch) + 1) * max_len <= batch_size + else: + to_append = len(batch) < batch_size + if to_append: + batch.append(instance) + total_token_num += len(token_ids) + else: + yield batch, total_token_num + batch, total_token_num, max_len = [instance], len( + token_ids), len(token_ids) + + if len(batch) > 0: + yield batch, total_token_num + + def wrapper(): + """Data wrapeer.""" + all_dev_batches = [] + for batch_data, total_token_num in batch_reader( + instance_reader, batch_size, self.in_tokens): + batch_data = self.generate_batch_data( + batch_data, + total_token_num, + voc_size=-1, + mask_id=-1, + return_input_mask=True, + return_max_len=False, + return_num_token=False) + if len(all_dev_batches) < dev_count: + all_dev_batches.append(batch_data) + + if len(all_dev_batches) == dev_count: + for batch in all_dev_batches: + yield batch + all_dev_batches = [] + + return wrapper + + +class InputExample(object): + """A single training/test example for simple sequence classification.""" + + def __init__(self, guid, text_a, text_b=None, label=None): + """Constructs a InputExample. + + Args: + guid: Unique id for the example. + text_a: string. The untokenized text of the first sequence. For single + sequence tasks, only this sequence must be specified. + text_b: (Optional) string. The untokenized text of the second sequence. + Only must be specified for sequence pair tasks. + label: (Optional) string. The label of the example. This should be + specified for train and dev examples, but not for test examples. + """ + self.guid = guid + self.text_a = text_a + self.text_b = text_b + self.label = label + + +def _truncate_seq_pair(tokens_a, tokens_b, max_length): + """Truncates a sequence pair in place to the maximum length.""" + + # This is a simple heuristic which will always truncate the longer sequence + # one token at a time. This makes more sense than truncating an equal percent + # of tokens from each, since if one sequence is very short then each token + # that's truncated likely contains more information than a longer sequence. + while True: + total_length = len(tokens_a) + len(tokens_b) + if total_length <= max_length: + break + if len(tokens_a) > len(tokens_b): + tokens_a.pop() + else: + tokens_b.pop() + + +class InputFeatures(object): + """A single set of features of data.""" + + def __init__(self, input_ids, input_mask, segment_ids, label_id): + self.input_ids = input_ids + self.input_mask = input_mask + self.segment_ids = segment_ids + self.label_id = label_id + + +class SemevalTask9Processor(DataProcessor): + """Processor for Semeval Task9 data set.""" + + def get_train_examples(self, data_dir, header=False, drop_keyword=False): + lines = self._read_csv(data_dir + '/V1.4_Training.csv') + examples = [] + if drop_keyword: + keywords = [ + line.strip() for line in open(data_dir + '/../keywords') + ] + + for i, line in enumerate(lines): + if i == 0 and header: + continue + guid = line[0] + text_a = tokenization.convert_to_unicode(line[1]) + text_a = clean_str(text_a) + + if drop_keyword: + new_tokens = [] + for w in text_a.split(' '): + if w in keywords and random.random() > 0.8: + continue + new_tokens.append(w) + text_a = ' '.join(new_tokens) + text_b = None + label = line[2] + examples.append( + InputExample( + guid=guid, text_a=text_a, text_b=text_b, label=label)) + return examples + + def get_dev_examples(self, data_dir, header=True): + lines = self._read_csv(data_dir + '/SubtaskA_Trial_Test_Labeled.csv') + examples = [] + for i, line in enumerate(lines): + if i == 0 and header: + continue + guid = line[0] + text_a = clean_str(line[1]) + text_a = tokenization.convert_to_unicode(text_a) + text_b = None + label = line[2] + examples.append( + InputExample( + guid=guid, text_a=text_a, text_b=text_b, label=label)) + return examples + + def get_test_examples(self, data_dir, header=False): + lines = self._read_csv(data_dir + + '/SubtaskA_EvaluationData_labeled.csv') + examples = [] + for i, line in enumerate(lines): + if i == 0 and header: + continue + guid = line[0] + text_a = clean_str(line[1]) + text_a = tokenization.convert_to_unicode(text_a) + text_b = None + label = line[2] + examples.append( + InputExample( + guid=guid, text_a=text_a, text_b=text_b, label=label)) + return examples + + def get_labels(self): + """See base class.""" + return ["0", "1"] + + @classmethod + def _read_csv(cls, input_file): + """Reads a comma separated value file.""" + readers = csv.reader(open(input_file, "r"), delimiter=',') + lines = [] + for line in readers: + lines.append(line) + return lines + + +def clean_str(string): + """ + Tokenization/string cleaning for all datasets except for SST. + Original taken from https://github.com/yoonkim/CNN_sentence/blob/master/process_data.py + """ + string = string.strip('\n').replace('\n', ' ') + string = re.sub(r"[^A-Za-z0-9(),!?\'\`]", " ", string) + string = re.sub(r"\'s", " \'s", string) + string = re.sub(r"\'ve", " \'ve", string) + string = re.sub(r"n\'t", " n\'t", string) + string = re.sub(r"\'re", " \'re", string) + string = re.sub(r"\'d", " \'d", string) + string = re.sub(r"\'ll", " \'ll", string) + string = re.sub(r",", " , ", string) + string = re.sub(r"!", " ! ", string) + string = re.sub(r"\(", " ( ", string) + string = re.sub(r"\)", " ) ", string) + string = re.sub(r"\?", " ? ", string) + string = re.sub(r"\s{2,}", " ", string) + return string + + +def convert_single_example_to_unicode(guid, single_example): + """Convert single example to unicode.""" + text_a = tokenization.convert_to_unicode(single_example[0]) + text_b = tokenization.convert_to_unicode(single_example[1]) + label = tokenization.convert_to_unicode(single_example[2]) + return InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label) + + +def convert_single_example(ex_index, example, label_list, max_seq_length, + tokenizer): + """Converts a single `InputExample` into a single `InputFeatures`.""" + label_map = {} + for (i, label) in enumerate(label_list): + label_map[label] = i + + tokens_a = tokenizer.tokenize(example.text_a) + tokens_b = None + if example.text_b: + tokens_b = tokenizer.tokenize(example.text_b) + + if tokens_b: + # Modifies `tokens_a` and `tokens_b` in place so that the total + # length is less than the specified length. + # Account for [CLS], [SEP], [SEP] with "- 3" + _truncate_seq_pair(tokens_a, tokens_b, max_seq_length - 3) + else: + # Account for [CLS] and [SEP] with "- 2" + if len(tokens_a) > max_seq_length - 2: + tokens_a = tokens_a[0:(max_seq_length - 2)] + + # The convention in BERT is: + # (a) For sequence pairs: + # tokens: [CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP] + # type_ids: 0 0 0 0 0 0 0 0 1 1 1 1 1 1 + # (b) For single sequences: + # tokens: [CLS] the dog is hairy . [SEP] + # type_ids: 0 0 0 0 0 0 0 + # + # Where "type_ids" are used to indicate whether this is the first + # sequence or the second sequence. The embedding vectors for `type=0` and + # `type=1` were learned during pre-training and are added to the wordpiece + # embedding vector (and position vector). This is not *strictly* necessary + # since the [SEP] token unambiguously separates the sequences, but it makes + # it easier for the model to learn the concept of sequences. + # + # For classification tasks, the first vector (corresponding to [CLS]) is + # used as as the "sentence vector". Note that this only makes sense because + # the entire model is fine-tuned. + tokens = [] + segment_ids = [] + tokens.append("[CLS]") + segment_ids.append(0) + for token in tokens_a: + tokens.append(token) + segment_ids.append(0) + tokens.append("[SEP]") + segment_ids.append(0) + + if tokens_b: + for token in tokens_b: + tokens.append(token) + segment_ids.append(1) + tokens.append("[SEP]") + segment_ids.append(1) + + input_ids = tokenizer.convert_tokens_to_ids(tokens) + + # The mask has 1 for real tokens and 0 for padding tokens. Only real + # tokens are attended to. + + input_mask = [1] * len(input_ids) + + label_id = label_map[example.label] + + feature = InputFeatures( + input_ids=input_ids, + input_mask=input_mask, + segment_ids=segment_ids, + label_id=label_id) + return feature + + +def convert_examples_to_features(examples, label_list, max_seq_length, + tokenizer): + """Convert a set of `InputExample`s to a list of `InputFeatures`.""" + + features = [] + for (ex_index, example) in enumerate(examples): + if ex_index % 10000 == 0: + print("Writing example %d of %d" % (ex_index, len(examples))) + + feature = convert_single_example(ex_index, example, label_list, + max_seq_length, tokenizer) + + features.append(feature) + return features + + +if __name__ == '__main__': + pass diff --git a/PaddleNLP/Research/NAACL2019-MPM/run_classifier.py b/PaddleNLP/Research/NAACL2019-MPM/run_classifier.py new file mode 100644 index 0000000000000000000000000000000000000000..50ab58fcf357ccc5ff15932117cab94f91c27fe7 --- /dev/null +++ b/PaddleNLP/Research/NAACL2019-MPM/run_classifier.py @@ -0,0 +1,749 @@ +# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +"""Finetuning on classification tasks.""" + +from __future__ import absolute_import +from __future__ import division +from __future__ import print_function + +import os +import sys +import time +import argparse +import numpy as np +import multiprocessing + +import paddle +import paddle.fluid as fluid +from classifier import create_model +import reader + +sys.path.append("./BERT") +from model.bert import BertConfig +from optimization import optimization +from utils.args import ArgumentGroup, print_arguments +from utils.init import init_pretraining_params, init_checkpoint +import scipy +from sklearn.model_selection import KFold, StratifiedKFold + +# yapf: disable +parser = argparse.ArgumentParser(__doc__) +model_g = ArgumentGroup(parser, "model", "model configuration and paths.") +model_g.add_arg("bert_config_path", str, None, "Path to the json file for bert model config.") +model_g.add_arg("init_checkpoint", str, None, "Init checkpoint to resume training from.") +model_g.add_arg("init_pretraining_params", str, None, + "Init pre-training params which preforms fine-tuning from. If the " + "arg 'init_checkpoint' has been set, this argument wouldn't be valid.") +model_g.add_arg("checkpoints", str, "checkpoints", "Path to save checkpoints.") + +train_g = ArgumentGroup(parser, "training", "training options.") +train_g.add_arg("epoch", int, 3, "Number of epoches for fine-tuning.") +train_g.add_arg("learning_rate", float, 5e-5, "Learning rate used to train with warmup.") +train_g.add_arg("lr_scheduler", str, "linear_warmup_decay", + "scheduler of learning rate.", choices=['linear_warmup_decay', 'noam_decay']) +train_g.add_arg("weight_decay", float, 0.01, "Weight decay rate for L2 regularizer.") +train_g.add_arg("warmup_proportion", float, 0.1, + "Proportion of training steps to perform linear learning rate warmup for.") +train_g.add_arg("save_steps", int, 10000, "The steps interval to save checkpoints.") +train_g.add_arg("validation_steps", int, 1000, "The steps interval to evaluate model performance.") +train_g.add_arg("use_fp16", bool, False, "Whether to use fp16 mixed precision training.") +train_g.add_arg("loss_scaling", float, 1.0, + "Loss scaling factor for mixed precision training, only valid when use_fp16 is enabled.") + +log_g = ArgumentGroup(parser, "logging", "logging related.") +log_g.add_arg("skip_steps", int, 10, "The steps interval to print loss.") +log_g.add_arg("verbose", bool, False, "Whether to output verbose log.") + +data_g = ArgumentGroup(parser, "data", "Data paths, vocab paths and data processing options") +data_g.add_arg("data_dir", str, None, "Path to training data.") +data_g.add_arg("vocab_path", str, None, "Vocabulary path.") +data_g.add_arg("max_seq_len", int, 512, "Number of words of the longest seqence.") +data_g.add_arg("batch_size", int, 32, "Total examples' number in batch for training. see also --in_tokens.") +data_g.add_arg("in_tokens", bool, False, + "If set, the batch size will be the maximum number of tokens in one batch. " + "Otherwise, it will be the maximum number of examples in one batch.") +data_g.add_arg("do_lower_case", bool, True, + "Whether to lower case the input text. Should be True for uncased models and False for cased models.") +data_g.add_arg("random_seed", int, 0, "Random seed.") + +run_type_g = ArgumentGroup(parser, "run_type", "running type options.") +run_type_g.add_arg("use_cuda", bool, True, "If set, use GPU for training.") +run_type_g.add_arg("use_fast_executor", bool, False, "If set, use fast parallel executor (in experiment).") +run_type_g.add_arg("num_iteration_per_drop_scope", int, 1, "Ihe iteration intervals to clean up temporary variables.") +run_type_g.add_arg("task_name", str, "sem", + "The name of task to perform fine-tuning, should be in {'xnli', 'mnli', 'cola', 'mrpc'}.") +run_type_g.add_arg("sub_model_type", str, "raw", + "The type of sub model to use, should be in {'raw', 'cnn', 'gru', ffa}.") +run_type_g.add_arg("ksplit", int, -1, + "if ksplit > 0, use kfold training") +run_type_g.add_arg("drop_keyword", bool, False, + "if drop keyword for data augmentation.") +run_type_g.add_arg("kfold_type", str, "normal", + "The type of kfold should be in {'normal', 'stratified'}") +run_type_g.add_arg("binary", bool, True, "if is binary classification.") +run_type_g.add_arg("do_train", bool, True, "Whether to perform training.") +run_type_g.add_arg("do_val", bool, True, "Whether to perform evaluation on dev data set.") +run_type_g.add_arg("do_test", bool, True, "Whether to perform evaluation on test data set.") + +args = parser.parse_args() + + +# yapf: enable. + + +def evaluate(exe, test_program, test_pyreader, fetch_list, eval_phase): + """ + evaluation for dev and test dataset. + """ + test_pyreader.start() + total_cost, total_acc, total_num_seqs = 0.0, 0.0, 0.0 + total_label_pos_num, total_pred_pos_num, total_correct_num = 0.0, 0.0, 0.0 + qids, labels, scores = [], [], [] + time_begin = time.time() + while True: + try: + np_loss, np_acc, np_probs, np_labels, np_num_seqs = exe.run( + program=test_program, fetch_list=fetch_list) + total_cost += np.sum(np_loss * np_num_seqs) + total_acc += np.sum(np_acc * np_num_seqs) + total_num_seqs += np.sum(np_num_seqs) + labels.extend(np_labels.reshape((-1)).tolist()) + scores.extend(np_probs[:, 1].reshape(-1).tolist()) + np_preds = np.argmax(np_probs, axis=1).astype(np.float32) + total_label_pos_num += np.sum(np_labels) + total_pred_pos_num += np.sum(np_preds) + total_correct_num += np.sum(np.dot(np_preds, np_labels)) + except fluid.core.EOFException: + test_pyreader.reset() + break + time_end = time.time() + + r = total_correct_num / total_label_pos_num + p = total_correct_num / total_pred_pos_num + f = 2 * p * r / (p + r) + + print( + "[%s evaluation] ave loss: %f, ave_acc: %f, p: %f, r: %f, f1: %f, data_num: %d, elapsed time: %f s" + % (eval_phase, total_cost / total_num_seqs, + total_acc / total_num_seqs, p, r, f, total_num_seqs, + time_end - time_begin)) + + +def predict(exe, test_program, test_pyreader, fetch_list, eval_phase, output_file): + """ + predict function + """ + test_pyreader.start() + qids, scores = [], [] + time_begin = time.time() + while True: + try: + np_probs, np_num_seqs = exe.run( + program=test_program, fetch_list=fetch_list) + scores.extend(np_probs[:, 1].reshape(-1).tolist()) + except fluid.core.EOFException: + test_pyreader.reset() + break + time_end = time.time() + with open(output_file, 'w') as w: + for prob in scores: + w.write(str(prob) + '\n') + + +def train_kfold(args): + """ + main program for training kfold. + """ + task_name = args.task_name.lower() + processors = { + 'sem': reader.SemevalTask9Processor, + } + + processor = processors[task_name](data_dir=args.data_dir, + vocab_path=args.vocab_path, + max_seq_len=args.max_seq_len, + do_lower_case=args.do_lower_case, + in_tokens=args.in_tokens, + random_seed=args.random_seed) + + if not (args.do_train or args.do_val or args.do_test): + raise ValueError("For args `do_train`, `do_val` and `do_test`, at " + "least one of them must be True.") + + train_examples = processor.get_train_examples(args.data_dir, drop_keyword=args.drop_keyword) + test_examples = processor.get_test_examples(args.data_dir) + + if args.kfold_type == 'normal': + kf = KFold(n_splits=args.ksplit, shuffle=True, random_state=args.random_seed) + kf_iter = kf.split(train_examples) + elif args.kfold_type == 'stratified': + kf = StratifiedKFold(n_splits=args.ksplit, shuffle=True, random_state=args.random_seed) + train_labels = [e.label for e in train_examples] + kf_iter = kf.split(train_examples, train_labels) + else: + raise NotImplementedError("%s is not implemented" % args.kfold_type) + + for fold, (train_idx, val_idx) in enumerate(kf_iter): + print("==================== fold %d ===================" % fold) + train_fold = np.array(train_examples)[train_idx] + dev_fold = np.array(train_examples)[val_idx] + test_examples = np.array(test_examples) + kfold_program(args, processor, train_fold, dev_fold, test_examples, str(fold)) + + +def kfold_program(args, processor, train_examples, dev_examples, test_examples, fold): + """ + training program for kfold. + """ + bert_config = BertConfig(args.bert_config_path) + bert_config.print_config() + + if args.use_cuda: + place = fluid.CUDAPlace(int(os.getenv('FLAGS_selected_gpus', '0'))) + dev_count = fluid.core.get_cuda_device_count() + else: + place = fluid.CPUPlace() + dev_count = int(os.environ.get('CPU_NUM', multiprocessing.cpu_count())) + exe = fluid.Executor(place) + + num_labels = len(processor.get_labels()) + + if not (args.do_train or args.do_val or args.do_test): + raise ValueError("For args `do_train`, `do_val` and `do_test`, at " + "least one of them must be True.") + + startup_prog = fluid.Program() + if args.random_seed is not None: + startup_prog.random_seed = args.random_seed + + if args.do_train: + train_data_generator = processor.data_generator_for_kfold( + examples=train_examples, + batch_size=args.batch_size, + phase='train', + epoch=args.epoch, + dev_count=dev_count, + shuffle=True) + + num_train_examples = processor.get_num_examples(phase='train') + + if args.in_tokens: + max_train_steps = args.epoch * num_train_examples // ( + args.batch_size // args.max_seq_len) // dev_count + else: + max_train_steps = args.epoch * num_train_examples // args.batch_size // dev_count + + warmup_steps = int(max_train_steps * args.warmup_proportion) + print("Device count: %d" % dev_count) + print("Num train examples: %d" % num_train_examples) + print("Max train steps: %d" % max_train_steps) + print("Num warmup steps: %d" % warmup_steps) + + train_program = fluid.Program() + + with fluid.program_guard(train_program, startup_prog): + with fluid.unique_name.guard(): + train_pyreader, loss, probs, accuracy, labels, num_seqs = create_model( + args, + pyreader_name=fold + 'train_reader', + bert_config=bert_config, + num_labels=num_labels) + scheduled_lr = optimization( + loss=loss, + warmup_steps=warmup_steps, + num_train_steps=max_train_steps, + learning_rate=args.learning_rate, + train_program=train_program, + startup_prog=startup_prog, + weight_decay=args.weight_decay, + scheduler=args.lr_scheduler, + use_fp16=args.use_fp16, + loss_scaling=args.loss_scaling) + + fluid.memory_optimize( + input_program=train_program, + skip_opt_set=[ + loss.name, probs.name, accuracy.name, num_seqs.name + ]) + + if args.verbose: + if args.in_tokens: + lower_mem, upper_mem, unit = fluid.contrib.memory_usage( + program=train_program, + batch_size=args.batch_size // args.max_seq_len) + else: + lower_mem, upper_mem, unit = fluid.contrib.memory_usage( + program=train_program, batch_size=args.batch_size) + print("Theoretical memory usage in training: %.3f - %.3f %s" % + (lower_mem, upper_mem, unit)) + + if args.do_val or args.do_test: + test_prog = fluid.Program() + with fluid.program_guard(test_prog, startup_prog): + with fluid.unique_name.guard(): + test_pyreader, loss, probs, accuracy, labels, num_seqs = create_model( + args, + pyreader_name=fold + 'test_reader', + bert_config=bert_config, + num_labels=num_labels) + + test_prog = test_prog.clone(for_test=True) + + exe.run(startup_prog) + + if args.do_train: + if args.init_checkpoint and args.init_pretraining_params: + print( + "WARNING: args 'init_checkpoint' and 'init_pretraining_params' " + "both are set! Only arg 'init_checkpoint' is made valid.") + if args.init_checkpoint: + init_checkpoint( + exe, + args.init_checkpoint, + main_program=startup_prog, + use_fp16=args.use_fp16) + elif args.init_pretraining_params: + init_pretraining_params( + exe, + args.init_pretraining_params, + main_program=startup_prog, + use_fp16=args.use_fp16) + elif args.do_val or args.do_test: + if not args.init_checkpoint: + raise ValueError("args 'init_checkpoint' should be set if" + "only doing validation or testing!") + init_checkpoint( + exe, + args.init_checkpoint, + main_program=startup_prog, + use_fp16=args.use_fp16) + + if args.do_train: + exec_strategy = fluid.ExecutionStrategy() + exec_strategy.use_experimental_executor = args.use_fast_executor + exec_strategy.num_threads = dev_count + exec_strategy.num_iteration_per_drop_scope = args.num_iteration_per_drop_scope + + train_exe = fluid.ParallelExecutor( + use_cuda=args.use_cuda, + loss_name=loss.name, + exec_strategy=exec_strategy, + main_program=train_program) + + train_pyreader.decorate_tensor_provider(train_data_generator) + else: + train_exe = None + + if args.do_val or args.do_test: + test_exe = fluid.ParallelExecutor( + use_cuda=args.use_cuda, + main_program=test_prog, + share_vars_from=train_exe) + + if args.do_train: + train_pyreader.start() + steps = 0 + total_cost, total_acc, total_num_seqs = [], [], [] + time_begin = time.time() + while True: + try: + steps += 1 + if steps % args.skip_steps == 0: + if warmup_steps <= 0: + fetch_list = [loss.name, accuracy.name, num_seqs.name] + else: + fetch_list = [ + loss.name, accuracy.name, scheduled_lr.name, + num_seqs.name + ] + else: + fetch_list = [] + + outputs = train_exe.run(fetch_list=fetch_list) + + if steps % args.skip_steps == 0: + if warmup_steps <= 0: + np_loss, np_acc, np_num_seqs = outputs + else: + np_loss, np_acc, np_lr, np_num_seqs = outputs + + total_cost.extend(np_loss * np_num_seqs) + total_acc.extend(np_acc * np_num_seqs) + total_num_seqs.extend(np_num_seqs) + + if args.verbose: + verbose = "train pyreader queue size: %d, " % train_pyreader.queue.size( + ) + verbose += "learning rate: %f" % ( + np_lr[0] + if warmup_steps > 0 else args.learning_rate) + print(verbose) + + current_example, current_epoch = processor.get_train_progress( + ) + time_end = time.time() + used_time = time_end - time_begin + print("epoch: %d, progress: %d/%d, step: %d, ave loss: %f, " + "ave acc: %f, speed: %f steps/s" % + (current_epoch, current_example, num_train_examples, + steps, np.sum(total_cost) / np.sum(total_num_seqs), + np.sum(total_acc) / np.sum(total_num_seqs), + args.skip_steps / used_time)) + total_cost, total_acc, total_num_seqs = [], [], [] + time_begin = time.time() + + if steps % args.save_steps == 0: + save_path = os.path.join(args.checkpoints, fold, + "step_" + str(steps)) + fluid.io.save_persistables(exe, save_path, train_program) + + if steps % args.validation_steps == 0: + # evaluate dev set + if args.do_val: + test_pyreader.decorate_tensor_provider( + processor.data_generator_for_kfold( + examples=dev_examples, + batch_size=args.batch_size, + phase='dev', + epoch=1, + dev_count=1, + shuffle=False)) + evaluate(exe, test_prog, test_pyreader, + [loss.name, accuracy.name, probs.name, labels.name, num_seqs.name], + "dev") + # evaluate test set + if args.do_test: + test_pyreader.decorate_tensor_provider( + processor.data_generator_for_kfold( + examples=test_examples, + batch_size=args.batch_size, + phase='test', + epoch=1, + dev_count=1, + shuffle=False)) + evaluate(exe, test_prog, test_pyreader, + [loss.name, accuracy.name, probs.name, labels.name, num_seqs.name], + "test") + except fluid.core.EOFException: + save_path = os.path.join(args.checkpoints, fold, "step_" + str(steps)) + fluid.io.save_persistables(exe, save_path, train_program) + train_pyreader.reset() + break + + # final eval on dev set + if args.do_val: + test_pyreader.decorate_tensor_provider( + processor.data_generator_for_kfold( + examples=dev_examples, + batch_size=args.batch_size, phase='dev', epoch=1, dev_count=1, + shuffle=False)) + print("Final validation result:") + evaluate(exe, test_prog, test_pyreader, + [loss.name, accuracy.name, probs.name, labels.name, num_seqs.name], "dev") + + # final eval on test set + if args.do_test: + test_pyreader.decorate_tensor_provider( + processor.data_generator_for_kfold( + examples=test_examples, + batch_size=args.batch_size, + phase='test', + epoch=1, + dev_count=1, + shuffle=False)) + print("Final test result:") + evaluate(exe, test_prog, test_pyreader, + [loss.name, accuracy.name, probs.name, labels.name, num_seqs.name], "test") + + exe.close() + + +def train_single(args): + """ + training program. + """ + bert_config = BertConfig(args.bert_config_path) + bert_config.print_config() + + if args.use_cuda: + place = fluid.CUDAPlace(int(os.getenv('FLAGS_selected_gpus', '0'))) + dev_count = fluid.core.get_cuda_device_count() + else: + place = fluid.CPUPlace() + dev_count = int(os.environ.get('CPU_NUM', multiprocessing.cpu_count())) + exe = fluid.Executor(place) + + task_name = args.task_name.lower() + processors = { + 'sem': reader.SemevalTask9Processor, + } + + processor = processors[task_name](data_dir=args.data_dir, + vocab_path=args.vocab_path, + max_seq_len=args.max_seq_len, + do_lower_case=args.do_lower_case, + in_tokens=args.in_tokens, + random_seed=args.random_seed) + num_labels = len(processor.get_labels()) + + if not (args.do_train or args.do_val or args.do_test): + raise ValueError("For args `do_train`, `do_val` and `do_test`, at " + "least one of them must be True.") + + startup_prog = fluid.Program() + if args.random_seed is not None: + startup_prog.random_seed = args.random_seed + + if args.do_train: + train_data_generator = processor.data_generator( + batch_size=args.batch_size, + phase='train', + epoch=args.epoch, + dev_count=dev_count, + shuffle=True, + drop_keyword=args.drop_keyword) + + num_train_examples = processor.get_num_examples(phase='train') + + if args.in_tokens: + max_train_steps = args.epoch * num_train_examples // ( + args.batch_size // args.max_seq_len) // dev_count + else: + max_train_steps = args.epoch * num_train_examples // args.batch_size // dev_count + + warmup_steps = int(max_train_steps * args.warmup_proportion) + print("Device count: %d" % dev_count) + print("Num train examples: %d" % num_train_examples) + print("Max train steps: %d" % max_train_steps) + print("Num warmup steps: %d" % warmup_steps) + + train_program = fluid.Program() + + with fluid.program_guard(train_program, startup_prog): + with fluid.unique_name.guard(): + train_pyreader, loss, probs, accuracy, labels, num_seqs = create_model( + args, + pyreader_name='train_reader', + bert_config=bert_config, + num_labels=num_labels) + scheduled_lr = optimization( + loss=loss, + warmup_steps=warmup_steps, + num_train_steps=max_train_steps, + learning_rate=args.learning_rate, + train_program=train_program, + startup_prog=startup_prog, + weight_decay=args.weight_decay, + scheduler=args.lr_scheduler, + use_fp16=args.use_fp16, + loss_scaling=args.loss_scaling) + + fluid.memory_optimize( + input_program=train_program, + skip_opt_set=[ + loss.name, probs.name, accuracy.name, num_seqs.name + ]) + + if args.verbose: + if args.in_tokens: + lower_mem, upper_mem, unit = fluid.contrib.memory_usage( + program=train_program, + batch_size=args.batch_size // args.max_seq_len) + else: + lower_mem, upper_mem, unit = fluid.contrib.memory_usage( + program=train_program, batch_size=args.batch_size) + print("Theoretical memory usage in training: %.3f - %.3f %s" % + (lower_mem, upper_mem, unit)) + + if args.do_val or args.do_test: + test_prog = fluid.Program() + with fluid.program_guard(test_prog, startup_prog): + with fluid.unique_name.guard(): + test_pyreader, loss, probs, accuracy, labels, num_seqs = create_model( + args, + pyreader_name='test_reader', + bert_config=bert_config, + num_labels=num_labels) + + test_prog = test_prog.clone(for_test=True) + + exe.run(startup_prog) + + if args.do_train: + if args.init_checkpoint and args.init_pretraining_params: + print( + "WARNING: args 'init_checkpoint' and 'init_pretraining_params' " + "both are set! Only arg 'init_checkpoint' is made valid.") + if args.init_checkpoint: + init_checkpoint( + exe, + args.init_checkpoint, + main_program=startup_prog, + use_fp16=args.use_fp16) + elif args.init_pretraining_params: + init_pretraining_params( + exe, + args.init_pretraining_params, + main_program=startup_prog, + use_fp16=args.use_fp16) + elif args.do_val or args.do_test: + if not args.init_checkpoint: + raise ValueError("args 'init_checkpoint' should be set if" + "only doing validation or testing!") + init_checkpoint( + exe, + args.init_checkpoint, + main_program=startup_prog, + use_fp16=args.use_fp16) + + if args.do_train: + exec_strategy = fluid.ExecutionStrategy() + exec_strategy.use_experimental_executor = args.use_fast_executor + exec_strategy.num_threads = dev_count + exec_strategy.num_iteration_per_drop_scope = args.num_iteration_per_drop_scope + + train_exe = fluid.ParallelExecutor( + use_cuda=args.use_cuda, + loss_name=loss.name, + exec_strategy=exec_strategy, + main_program=train_program) + + train_pyreader.decorate_tensor_provider(train_data_generator) + else: + train_exe = None + + if args.do_val or args.do_test: + test_exe = fluid.ParallelExecutor( + use_cuda=args.use_cuda, + main_program=test_prog, + share_vars_from=train_exe) + + if args.do_train: + train_pyreader.start() + steps = 0 + total_cost, total_acc, total_num_seqs = [], [], [] + time_begin = time.time() + while True: + try: + steps += 1 + if steps % args.skip_steps == 0: + if warmup_steps <= 0: + fetch_list = [loss.name, accuracy.name, num_seqs.name] + else: + fetch_list = [ + loss.name, accuracy.name, scheduled_lr.name, + num_seqs.name + ] + else: + fetch_list = [] + + outputs = train_exe.run(fetch_list=fetch_list) + + if steps % args.skip_steps == 0: + if warmup_steps <= 0: + np_loss, np_acc, np_num_seqs = outputs + else: + np_loss, np_acc, np_lr, np_num_seqs = outputs + + total_cost.extend(np_loss * np_num_seqs) + total_acc.extend(np_acc * np_num_seqs) + total_num_seqs.extend(np_num_seqs) + + if args.verbose: + verbose = "train pyreader queue size: %d, " % train_pyreader.queue.size( + ) + verbose += "learning rate: %f" % ( + np_lr[0] + if warmup_steps > 0 else args.learning_rate) + print(verbose) + + current_example, current_epoch = processor.get_train_progress( + ) + time_end = time.time() + used_time = time_end - time_begin + print("epoch: %d, progress: %d/%d, step: %d, ave loss: %f, " + "ave acc: %f, speed: %f steps/s" % + (current_epoch, current_example, num_train_examples, + steps, np.sum(total_cost) / np.sum(total_num_seqs), + np.sum(total_acc) / np.sum(total_num_seqs), + args.skip_steps / used_time)) + total_cost, total_acc, total_num_seqs = [], [], [] + time_begin = time.time() + + if steps % args.save_steps == 0: + save_path = os.path.join(args.checkpoints, + "step_" + str(steps)) + fluid.io.save_persistables(exe, save_path, train_program) + + if steps % args.validation_steps == 0: + # evaluate dev set + if args.do_val: + test_pyreader.decorate_tensor_provider( + processor.data_generator( + batch_size=args.batch_size, + phase='dev', + epoch=1, + dev_count=1, + shuffle=False)) + evaluate(exe, test_prog, test_pyreader, + [loss.name, accuracy.name, probs.name, labels.name, num_seqs.name], + "dev") + # evaluate test set + if args.do_test: + test_pyreader.decorate_tensor_provider( + processor.data_generator( + batch_size=args.batch_size, + phase='test', + epoch=1, + dev_count=1, + shuffle=False)) + evaluate(exe, test_prog, test_pyreader, + [loss.name, accuracy.name, probs.name, labels.name, num_seqs.name], + "test") + except fluid.core.EOFException: + save_path = os.path.join(args.checkpoints, "step_" + str(steps)) + fluid.io.save_persistables(exe, save_path, train_program) + train_pyreader.reset() + break + + # final eval on dev set + if args.do_val: + test_pyreader.decorate_tensor_provider( + processor.data_generator( + batch_size=args.batch_size, phase='dev', epoch=1, dev_count=1, + shuffle=False)) + print("Final validation result:") + evaluate(exe, test_prog, test_pyreader, + [loss.name, accuracy.name, probs.name, labels.name, num_seqs.name], "dev") + + # final eval on test set + if args.do_test: + test_pyreader.decorate_tensor_provider( + processor.data_generator( + batch_size=args.batch_size, + phase='test', + epoch=1, + dev_count=1, + shuffle=False)) + print("Final test result:") + predict(exe, test_prog, test_pyreader, + [probs.name, num_seqs.name], "test", args.checkpoints + '/prob.txt') + + +if __name__ == '__main__': + print_arguments(args) + if args.ksplit <= 0: + train_single(args) + else: + train_kfold(args) diff --git a/PaddleNLP/Research/NAACL2019-MPM/train.sh b/PaddleNLP/Research/NAACL2019-MPM/train.sh new file mode 100644 index 0000000000000000000000000000000000000000..fd43d26d7e91d1c6a242d14927b7fc658a8eea8e --- /dev/null +++ b/PaddleNLP/Research/NAACL2019-MPM/train.sh @@ -0,0 +1,28 @@ +#!/bin/sh +export CUDA_VISIBLE_DEVICES=0 +output_dir=./output +prob_dir=./probs +bert_dir=./uncased_L-24_H-1024_A-16 +mkdir -p $output_dir +mkdir -p $prob_dir + +for model_type in raw cnn gru ffa +do + python run_classifier.py \ + --bert_config_path ${bert_dir}/bert_config.json \ + --checkpoints ${output_dir}/bert_large_${model_type} \ + --init_pretraining_params ${bert_dir}/params \ + --data_dir ./data/Subtask-A \ + --vocab_path ${bert_dir}/vocab.txt \ + --task_name sem \ + --sub_model_type ${model_type} \ + --max_seq_len 128 \ + --batch_size 32 \ + --random_seed 777 \ + --save_steps 200 \ + --validation_steps 200 \ + --drop_keyword True + + mv ${output_dir}/bert_large_${model_type}/prob.txt ${prob_dir}/prob_${model_type}.txt +done + diff --git a/PaddleNLP/Research/README.md b/PaddleNLP/Research/README.md new file mode 100644 index 0000000000000000000000000000000000000000..8dd8cd2e652d9ab5caf43f51a9d50ae25c4c284e --- /dev/null +++ b/PaddleNLP/Research/README.md @@ -0,0 +1,3 @@ +## PaddleNLP for Research +Provide the most advanced, powerful and professional research papers. +Fully open code and datasets to enable researchers to quickly understand the NLP frontier direction and information.