diff --git a/PaddleNLP/Research/MRQA2019-BASELINE/README.md b/PaddleNLP/Research/MRQA2019-BASELINE/README.md new file mode 100644 index 0000000000000000000000000000000000000000..0e3538c83dc5745b3df4ed731c0c2609fcec95cd --- /dev/null +++ b/PaddleNLP/Research/MRQA2019-BASELINE/README.md @@ -0,0 +1,106 @@ +# A PaddlePaddle Baseline for 2019 MRQA Shared Task + +Machine Reading for Question Answering (MRQA), which requires machines to comprehend text and answer questions about it, is a crucial task in natural language processing. + +Although recent systems achieve impressive results on the several benchmarks, these systems are primarily evaluated on in-domain accuracy. The [2019 MRQA Shared Task](https://mrqa.github.io/shared) focuses on testing the generalization of the existing systems on out-of-domain datasets. + +In this repository, we provide a baseline for the 2019 MRQA Shared Task that is built on top of [PaddlePaddle](https://github.com/paddlepaddle/paddle), and it features: +* ***Pre-trained Language Model***: [ERNIE](https://github.com/PaddlePaddle/LARK/tree/develop/ERNIE) (Enhanced Representation through kNowledge IntEgration) is a pre-trained language model that is designed to learn better language representations by incorporating linguistic knowledge masking. Our ERNIE-based baseline outperforms the MRQA official baseline that uses BERT by 6.1 point (marco-f1) on the out-of-domain dev set. +* ***Multi-GPU Fine-tuning and Prediction***: Support for Multi-GPU fine-tuning and prediction to accelerate the experiments. + +You can use this repo as starter codebase for 2019 MRQA Shared Task and bootstrap your next model. + +## How to Run +### Environment Requirements +The MRQA baseline system has been tested on python2.7.13 and PaddlePaddle 1.5, CentOS 6.3. +The model is fine-tuned on 8 P40-GPUs, with batch size=4*8=32 in total. + +### 1. Download Thirdparty Dependencies +We will use the evaluation script for *SQuAD v1.1*, which is equivelent to the official one for MRQA. To download the SQuAD v1.1 evaluation script, run +``` +wget https://worksheets.codalab.org/rest/bundles/0xbcd57bee090b421c982906709c8c27e1/contents/blob/ -O evaluate-v1.1.py +``` + +### 2. Download Dataset +To download the MRQA datasets, run + +``` +cd data && sh download_data.sh && cd .. +``` +The training and prediction datasets will be saved in `./data/train/` and `./data/dev/`, respectively. + +### 3. Preprocess +The baseline system only supports dataset files in SQuAD format. Before running the system on MRQA datasets, one need to convert the official MRQA data to SQuAD format. To do the conversion, run + +``` +cd data && sh convert_mrqa2squad.sh && cd .. +``` +The output files will be named as `xxx.raw.json`. + +For convenience, we provide a script to combine all the training and development data into a single file respectively + +``` +cd data && sh combine.sh && cd .. + +``` +The combined files will be saved in `./data/train/mrqa-combined.raw.json` and `./data/dev/mrqa-combined.raw.json`. + + +### 4. Fine-tuning with ERNIE +To get better performance than the official baseline, we provide a pretrained model - **ERNIE** for fine-tuning. To download the ERNIE parameters, run + +``` +sh download_pre_train_model.sh +``` +The pretrained model parameters and config files will be saved in `./ernie_model`. + +To start fine-tuning, run + +``` +sh run_finetuning.sh +``` +The predicted results and model parameters will be saved in `./output`. + +### 5. Prediction +Once fine-tuned, one can predict by specifying the model checkpoint file saved in `./output/` (E.g. step\_3000, step\_5000\_final) + +``` +sh run_predict.sh parameters_to_restore +``` +Where `parameters_to_restore` is the model parameters used in the evaluatation (e.g. output/step\_5000\_final). The predicted results will be saved in `./output/prediction.json`. For convenience, we also provide **[fine-tuned model parameters](https://baidu-nlp.bj.bcebos.com/MRQA2019-PaddlePaddle-fine-tuned-model.tar.gz)** on MRQA datasets. The model is fine-tuned for 2 epochs on 8 P40-GPUs, with batch size=4*8=32 in total. The performerce is shown below, + +##### in-domain dev (F1/EM) + +| Model | HotpotQA | NaturalQ | NewsQA | SearchQA | SQuAD | TriviaQA | Macro-F1 | +| :------------- | :---------: | :----------: | :---------: | :----------: | :---------: | :----------: |:----------: | +| baseline + EMA | 82.3/66.8 | 81.6/70.0 | 73.1/57.9 | 85.1/79.1 | 93.3/87.1 | 79.0/73.4 | 82.4 | +| baseline woEMA | 82.4/66.9 | 81.7/69.9 | 73.0/57.8 | 85.1/79.2 | 93.4/87.2 | 79.0/73.4 | 82.4 | + +##### out-of-domain dev (F1/EM) + +| Model | BioASQ | DROP | DuoRC | RACE | RE | Textbook | Macro-F1 | +| :------------- | :---------: | :----------: | :---------: | :----------: | :---------: | :----------: |:----------: | +| baseline + EMA | 70.2/54.7 | 57.3/47.5 | 64.1/52.8 | 51.7/37.2 | 87.9/77.7 | 63.1/53.5 | 65.7 | +| baseline woEMA | 69.9/54.6 | 57.0/47.3 | 64.0/52.8 | 51.8/37.4 | 87.8/77.6 | 63.0/53.4 | 65.6 | +Note that we turn on exponential moving average (EMA) during training by default (in most cases EMA can improve performance) and save EMA parameters into the final checkpoint files. The predicted answers using EMA parameters are saved into `ema_predictions.json`. + + +### 6. Evaluation +To evaluate the result, run + +``` +sh run_evaluation.sh +``` +Note that we use the evaluation script for *SQuAD 1.1* here, which is equivalent to the official one. + +# Copyright and License +Copyright 2019 Baidu.com, Inc. All Rights Reserved +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + http://www.apache.org/licenses/LICENSE-2.0 +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. \ No newline at end of file diff --git a/PaddleNLP/Research/MRQA2019-BASELINE/data/combine.py b/PaddleNLP/Research/MRQA2019-BASELINE/data/combine.py new file mode 100644 index 0000000000000000000000000000000000000000..d9f6b0fc03730a688387c4b3cc322f4f426147d0 --- /dev/null +++ b/PaddleNLP/Research/MRQA2019-BASELINE/data/combine.py @@ -0,0 +1,46 @@ +#!/usr/bin/env python +# -*- coding: utf-8 -*- +# ============================================================================== +# Copyright 2017 Baidu.com, Inc. All Rights Reserved +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# ============================================================================== +""" +This module add all train/dev data to a file named "mrqa-combined.raw.json". +""" + +import json +import argparse +import glob + +# path of train/dev data +parser = argparse.ArgumentParser() +parser.add_argument('path', help='the path of train/dev data') +args = parser.parse_args() +path = args.path + +# all train/dev data files +files = glob.glob(path + '/*.raw.json') +print ('files:', files) + +# add all train/dev data to "datasets" +with open(files[0]) as fin: + datasets = json.load(fin) +for i in range(1, len(files)): + with open(files[i]) as fin: + dataset = json.load(fin) + datasets['data'].extend(dataset['data']) + +# save to "mrqa-combined.raw.json" +with open(path + '/mrqa-combined.raw.json', 'w') as fout: + json.dump(datasets, fout, indent=4) diff --git a/PaddleNLP/Research/MRQA2019-BASELINE/data/combine.sh b/PaddleNLP/Research/MRQA2019-BASELINE/data/combine.sh new file mode 100644 index 0000000000000000000000000000000000000000..8173f047967d4a3342b3ca7589250b83c386ddae --- /dev/null +++ b/PaddleNLP/Research/MRQA2019-BASELINE/data/combine.sh @@ -0,0 +1,26 @@ +#!/usr/bin/env bash +# ============================================================================== +# Copyright 2017 Baidu.com, Inc. All Rights Reserved +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# ============================================================================== + +# path of train and dev data +PATH_train=train +PATH_dev=dev + +# add all train data to a file "$PATH_train/mrqa-combined.raw.json". +python combine.py $PATH_train + +# add all dev data to a file "$PATH_dev/mrqa-combined.raw.json". +python combine.py $PATH_dev \ No newline at end of file diff --git a/PaddleNLP/Research/MRQA2019-BASELINE/data/convert_mrqa2squad.py b/PaddleNLP/Research/MRQA2019-BASELINE/data/convert_mrqa2squad.py new file mode 100644 index 0000000000000000000000000000000000000000..a585c485ae0f676e6bb160815b42c66d529d3d8c --- /dev/null +++ b/PaddleNLP/Research/MRQA2019-BASELINE/data/convert_mrqa2squad.py @@ -0,0 +1,158 @@ +#!/usr/bin/env python +# -*- coding: utf-8 -*- +# ============================================================================== +# Copyright 2017 Baidu.com, Inc. All Rights Reserved +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# ============================================================================== +""" +This module convert MRQA official data to SQuAD format +""" + +import json +import argparse +import re + + +def reader(filename): + """ + This function read a MRQA data file. + :param filename: name of a MRQA data file. + :return: original samples of a MRQA data file. + """ + with open(filename) as fin: + for lidx, line in enumerate(fin): + if lidx == 0: + continue + sample = json.loads(line.strip()) + yield sample + + +def to_squad_para_train(sample): + """ + This function convert training data from MRQA format to SQuAD format. + :param sample: one sample in MRQA format. + :return: paragraphs in SQuAD format. + """ + squad_para = dict() + context = sample['context'] + context = re.sub(r'\[TLE\]|\[DOC\]|\[PAR\]', '[SEP]', context) + # replace special tokens to [SEP] to avoid UNK in BERT + squad_para['context'] = context + qas = [] + for qa in sample['qas']: + text = qa['detected_answers'][0]['text'] + new_start = context.find(text) + # Try to find an exact match (without normalization) of the reference answer. + # Some articles like {a|an|the} my get lost in the original spans. + # E.g. the reference answer is "The London Eye", + # while the original span may only contain "London Eye" due to normalization. + new_end = new_start + len(text) - 1 + org_start = qa['detected_answers'][0]['char_spans'][0][0] + org_end = qa['detected_answers'][0]['char_spans'][0][1] + if new_start == -1 or len(text) < 8: + # If no exact match (without normalization) can be found or reference answer is too short + # (e.g. only contain a character "c", which will cause problems using find), + # use the original span in MRQA dataset. + answer = { + 'text': squad_para['context'][org_start:org_end + 1], + 'answer_start': org_start + } + answer_start = org_start + answer_end = org_end + else: + answer = { + 'text': text, + 'answer_start': new_start + } + answer_start = new_start + answer_end = new_end + # A sanity check + try: + assert answer['text'].lower() == squad_para['context'][answer_start:answer_end + 1].lower() + except AssertionError: + print(answer['text']) + print(squad_para['context'][answer_start:answer_end + 1]) + continue + squad_qa = { + 'question': qa['question'], + 'id': qa['qid'], + 'answers': [answer] + } + qas.append(squad_qa) + squad_para['qas'] = qas + return squad_para + + +def to_squad_para_dev(sample): + """ + This function convert development data from MRQA format to SQuAD format. + :param sample: one sample in MRQA format. + :return: paragraphs in SQuAD format. + """ + + squad_para = dict() + context = sample['context'] + context = re.sub(r'\[TLE\]|\[DOC\]|\[PAR\]', '[SEP]', context) + squad_para['context'] = context + qas = [] + for qa in sample['qas']: + org_answers = qa['answers'] + answers = [] + for org_answer in org_answers: + answer = { + 'text': org_answer, + 'answer_start': -1 + } + answers.append(answer) + squad_qa = { + 'question': qa['question'], + 'id': qa['qid'], + 'answers': answers + } + qas.append(squad_qa) + squad_para['qas'] = qas + return squad_para + + +def doc_wrapper(squad_para, title=""): + """ + This function wrap paragraphs into a document. + :param squad_para: paragraphs in SQuAD format. + :param title: the title of paragraphs. + :return: wrap of title and paragraphs + """ + squad_doc = { + 'title': title, + 'paragraphs': [squad_para] + } + return squad_doc + + +if __name__ == "__main__": + parser = argparse.ArgumentParser() + parser.add_argument('input', help='the input file') + parser.add_argument('--dev', action='store_true', help='convert devset') + args = parser.parse_args() + file_prefix = args.input[0:-6] + squad = { + 'data': [], + 'version': "1.1" + } + to_squad_para = to_squad_para_dev if args.dev else to_squad_para_train + for org_sample in reader(args.input): + para = to_squad_para(org_sample) + doc = doc_wrapper(para) + squad['data'].append(doc) + with open('{}.raw.json'.format(file_prefix), 'w') as fout: + json.dump(squad, fout, indent=4) diff --git a/PaddleNLP/Research/MRQA2019-BASELINE/data/convert_mrqa2squad.sh b/PaddleNLP/Research/MRQA2019-BASELINE/data/convert_mrqa2squad.sh new file mode 100644 index 0000000000000000000000000000000000000000..eb323d3af5057949e0ecd23c816946ce884cbe50 --- /dev/null +++ b/PaddleNLP/Research/MRQA2019-BASELINE/data/convert_mrqa2squad.sh @@ -0,0 +1,34 @@ +#!/usr/bin/env bash +# ============================================================================== +# Copyright 2017 Baidu.com, Inc. All Rights Reserved +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# ============================================================================== + +# path of train and dev data +PATH_train=train +PATH_dev=dev + +# Convert train data from MRQA format to SQuAD format +NAME_LIST_train="SQuAD NewsQA TriviaQA SearchQA HotpotQA NaturalQuestions" +for name in $NAME_LIST_train;do + echo "Converting training data from MRQA format to SQuAD format: ""$name" + python convert_mrqa2squad.py $PATH_train/$name.jsonl +done + +# Convert dev data from MRQA format to SQuAD format +NAME_LIST_dev="SQuAD NewsQA TriviaQA SearchQA HotpotQA NaturalQuestions BioASQ TextbookQA RelationExtraction DROP DuoRC RACE" +for name in $NAME_LIST_dev;do + echo "Converting development data from MRQA format to SQuAD format: ""$name" + python convert_mrqa2squad.py --dev $PATH_dev/$name.jsonl +done diff --git a/PaddleNLP/Research/MRQA2019-BASELINE/data/dev/md5sum_dev.txt b/PaddleNLP/Research/MRQA2019-BASELINE/data/dev/md5sum_dev.txt new file mode 100644 index 0000000000000000000000000000000000000000..faef333c91f691207d106b9f4a8b9542947f66ac --- /dev/null +++ b/PaddleNLP/Research/MRQA2019-BASELINE/data/dev/md5sum_dev.txt @@ -0,0 +1,12 @@ +05f3f16c5c31ba8e46ff5fa80647ac46 SQuAD.jsonl.gz +5c188c92a84ddffe2ab590ac7598bde2 NewsQA.jsonl.gz +a7a3bd90db58524f666e757db659b047 TriviaQA.jsonl.gz +bfcb304f1b3167693b627cbf0f98bc9e SearchQA.jsonl.gz +675de35c3605353ec039ca4d2854072d HotpotQA.jsonl.gz +c0347eebbca02d10d1b07b9a64efe61d NaturalQuestions.jsonl.gz +6408dc4fcf258535d0ea8b125bba5fbb BioASQ.jsonl.gz +76ca9cc16625dd8da75758d64676e6a1 TextbookQA.jsonl.gz +128d318ea1391bf77234d8c1b69a45df RelationExtraction.jsonl.gz +8b03867e4da2817ef341707040d99785 DROP.jsonl.gz +9e66769a70fdfdec4906a4bcef5f3d71 DuoRC.jsonl.gz +94a7ef9b9ea9402671e5b0248b6a5395 RACE.jsonl.gz \ No newline at end of file diff --git a/PaddleNLP/Research/MRQA2019-BASELINE/data/download_data.sh b/PaddleNLP/Research/MRQA2019-BASELINE/data/download_data.sh new file mode 100644 index 0000000000000000000000000000000000000000..3661253b96e9cfea8463d014997155d6efab9eb5 --- /dev/null +++ b/PaddleNLP/Research/MRQA2019-BASELINE/data/download_data.sh @@ -0,0 +1,79 @@ +#!/usr/bin/env bash +# ============================================================================== +# Copyright 2017 Baidu.com, Inc. All Rights Reserved +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# ============================================================================== + + +# path to save data +OUTPUT_train=train +OUTPUT_dev=dev + +DATA_URL="https://s3.us-east-2.amazonaws.com/mrqa/release/v2" +alias wget="wget -c --no-check-certificate" +# download training datasets +wget $DATA_URL/train/SQuAD.jsonl.gz -O $OUTPUT_train/SQuAD.jsonl.gz +wget $DATA_URL/train/NewsQA.jsonl.gz -O $OUTPUT_train/NewsQA.jsonl.gz +wget $DATA_URL/train/TriviaQA-web.jsonl.gz -O $OUTPUT_train/TriviaQA.jsonl.gz +wget $DATA_URL/train/SearchQA.jsonl.gz -O $OUTPUT_train/SearchQA.jsonl.gz +wget $DATA_URL/train/HotpotQA.jsonl.gz -O $OUTPUT_train/HotpotQA.jsonl.gz +wget $DATA_URL/train/NaturalQuestionsShort.jsonl.gz -O $OUTPUT_train/NaturalQuestions.jsonl.gz + +# download the in-domain development data +wget $DATA_URL/dev/SQuAD.jsonl.gz -O $OUTPUT_dev/SQuAD.jsonl.gz +wget $DATA_URL/dev/NewsQA.jsonl.gz -O $OUTPUT_dev/NewsQA.jsonl.gz +wget $DATA_URL/dev/TriviaQA-web.jsonl.gz -O $OUTPUT_dev/TriviaQA.jsonl.gz +wget $DATA_URL/dev/SearchQA.jsonl.gz -O $OUTPUT_dev/SearchQA.jsonl.gz +wget $DATA_URL/dev/HotpotQA.jsonl.gz -O $OUTPUT_dev/HotpotQA.jsonl.gz +wget $DATA_URL/dev/NaturalQuestionsShort.jsonl.gz -O $OUTPUT_dev/NaturalQuestions.jsonl.gz + +# download the out-of-domain development data +wget http://participants-area.bioasq.org/MRQA2019/ -O $OUTPUT_dev/BioASQ.jsonl.gz +wget $DATA_URL/dev/TextbookQA.jsonl.gz -O $OUTPUT_dev/TextbookQA.jsonl.gz +wget $DATA_URL/dev/RelationExtraction.jsonl.gz -O $OUTPUT_dev/RelationExtraction.jsonl.gz +wget $DATA_URL/dev/DROP.jsonl.gz -O $OUTPUT_dev/DROP.jsonl.gz +wget $DATA_URL/dev/DuoRC.ParaphraseRC.jsonl.gz -O $OUTPUT_dev/DuoRC.jsonl.gz +wget $DATA_URL/dev/RACE.jsonl.gz -O $OUTPUT_dev/RACE.jsonl.gz + +# check md5sum for training datasets +cd $OUTPUT_train +if md5sum --status -c md5sum_train.txt; then + echo "finish download training data" +else + echo "md5sum check failed!" +fi +cd .. + +# check md5sum for development data +cd $OUTPUT_dev +if md5sum --status -c md5sum_dev.txt; then + echo "finish download development data" +else + echo "md5sum check failed!" +fi +cd .. + +# gzip training datasets +echo "unzipping train data" +NAME_LIST_train="SQuAD NewsQA TriviaQA SearchQA HotpotQA NaturalQuestions" +for name in $NAME_LIST_train;do + gzip -d $OUTPUT_train/$name.jsonl.gz +done + +# gzip development data +echo "unzipping dev data" +NAME_LIST_dev="SQuAD NewsQA TriviaQA SearchQA HotpotQA NaturalQuestions BioASQ TextbookQA RelationExtraction DROP DuoRC RACE" +for name in $NAME_LIST_dev;do + gzip -d $OUTPUT_dev/$name.jsonl.gz +done diff --git a/PaddleNLP/Research/MRQA2019-BASELINE/data/train/md5sum_train.txt b/PaddleNLP/Research/MRQA2019-BASELINE/data/train/md5sum_train.txt new file mode 100644 index 0000000000000000000000000000000000000000..68d9d2befc8c7e35d47264ae2d1043ed8c5999c7 --- /dev/null +++ b/PaddleNLP/Research/MRQA2019-BASELINE/data/train/md5sum_train.txt @@ -0,0 +1,6 @@ +efd6a551d2697c20a694e933210489f8 SQuAD.jsonl.gz +182f4e977b849cb1dbfb796030b91444 NewsQA.jsonl.gz +e18f586152612a9358c22f5536bfd32a TriviaQA.jsonl.gz +612245315e6e7c4d8446e5fcc3dc1086 SearchQA.jsonl.gz +d212c7b3fc949bd0dc47d124e8c34907 HotpotQA.jsonl.gz +e27d27bf7c49eb5ead43cef3f41de6be NaturalQuestions.jsonl.gz \ No newline at end of file diff --git a/PaddleNLP/Research/MRQA2019-BASELINE/download_pretrained_model.sh b/PaddleNLP/Research/MRQA2019-BASELINE/download_pretrained_model.sh new file mode 100644 index 0000000000000000000000000000000000000000..03a8e9869fd08358c74db996feb50de6ef149d45 --- /dev/null +++ b/PaddleNLP/Research/MRQA2019-BASELINE/download_pretrained_model.sh @@ -0,0 +1,21 @@ +#!/usr/bin/env bash +# ============================================================================== +# Copyright 2017 Baidu.com, Inc. All Rights Reserved +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# ============================================================================== + +# download pre_train model +wget ftp://cp01-rdqa04-dev119-zhangxiyuan01.epc.baidu.com/home/users/zhangxiyuan01/models/ernie_model.tar.gz +tar -xvf ernie_model.tar.gz +rm ernie_model.tar.gz diff --git a/PaddleNLP/Research/MRQA2019-BASELINE/run_evaluation.sh b/PaddleNLP/Research/MRQA2019-BASELINE/run_evaluation.sh new file mode 100644 index 0000000000000000000000000000000000000000..d0f91d791df47afb67ba0a07668a010099ee2f92 --- /dev/null +++ b/PaddleNLP/Research/MRQA2019-BASELINE/run_evaluation.sh @@ -0,0 +1,27 @@ +#!/usr/bin/env bash +# ============================================================================== +# Copyright 2017 Baidu.com, Inc. All Rights Reserved +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# ============================================================================== + +# path of dev data +PATH_dev=./data/dev +# path of dev prediction +PATH_prediction=./output/ema_predictions.json + +# evaluation +for dataset in `ls $PATH_dev/*.raw.json`;do + echo $dataset + python evaluate-v1.1.py $dataset $PATH_prediction +done diff --git a/PaddleNLP/Research/MRQA2019-BASELINE/run_finetuning.sh b/PaddleNLP/Research/MRQA2019-BASELINE/run_finetuning.sh new file mode 100644 index 0000000000000000000000000000000000000000..1781bf80629d6d1613e444475cc51f5b5adc081d --- /dev/null +++ b/PaddleNLP/Research/MRQA2019-BASELINE/run_finetuning.sh @@ -0,0 +1,56 @@ +#!/usr/bin/env bash +# ============================================================================== +# Copyright 2017 Baidu.com, Inc. All Rights Reserved +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# ============================================================================== + +set -xe + +export FLAGS_sync_nccl_allreduce=0 +export FLAGS_eager_delete_tensor_gb=1 + +# set CUDA_VISIBLE_DEVICES +export CUDA_VISIBLE_DEVICES=0 + +# path of pre_train model +BERT_BASE_PATH=ernie_model +# path to save checkpoint +CHECKPOINT_PATH=output/ +mkdir -p $CHECKPOINT_PATH +# path of train and dev data +DATA_PATH_train=data/train +DATA_PATH_dev=data/dev + +# fine-tune params +python -u src/run_mrqa.py --use_cuda true\ + --batch_size 4 \ + --in_tokens false \ + --init_pretraining_params ${BERT_BASE_PATH}/params \ + --checkpoints ${CHECKPOINT_PATH} \ + --vocab_path ${BERT_BASE_PATH}/vocab.txt \ + --do_train true \ + --do_predict true \ + --save_steps 10000 \ + --warmup_proportion 0.1 \ + --weight_decay 0.01 \ + --epoch 2 \ + --max_seq_len 512 \ + --bert_config_path ${BERT_BASE_PATH}/bert_config.json \ + --predict_file ${DATA_PATH_dev}/mrqa-combined.raw.json \ + --do_lower_case false \ + --doc_stride 128 \ + --train_file ${DATA_PATH_train}/mrqa-combined.raw.json \ + --learning_rate 3e-5 \ + --lr_scheduler linear_warmup_decay \ + --skip_steps 200 diff --git a/PaddleNLP/Research/MRQA2019-BASELINE/run_predict.sh b/PaddleNLP/Research/MRQA2019-BASELINE/run_predict.sh new file mode 100644 index 0000000000000000000000000000000000000000..27dbd54722999c1374069c8a43843c31afe30b04 --- /dev/null +++ b/PaddleNLP/Research/MRQA2019-BASELINE/run_predict.sh @@ -0,0 +1,54 @@ +#!/usr/bin/env bash +# ============================================================================== +# Copyright 2017 Baidu.com, Inc. All Rights Reserved +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# ============================================================================== + +set -xe + +# set CUDA_VISIBLE_DEVICES +export CUDA_VISIBLE_DEVICES=0 + +# path of pre_train model +BERT_BASE_PATH=ernie_model +# path to save checkpoint +CHECKPOINT_PATH=output/ +mkdir -p $CHECKPOINT_PATH +# path of init_checkpoint +PATH_init_checkpoint=$1 +# path of dev data +DATA_PATH_dev=data/dev + +# fine-tune params +python -u src/run_mrqa.py --use_cuda true\ + --batch_size 8 \ + --in_tokens false \ + --init_pretraining_params ${BERT_BASE_PATH}/params \ + --init_checkpoint ${PATH_init_checkpoint} \ + --checkpoints ${CHECKPOINT_PATH} \ + --vocab_path ${BERT_BASE_PATH}/vocab.txt \ + --do_train false \ + --do_predict true \ + --save_steps 10000 \ + --warmup_proportion 0.1 \ + --weight_decay 0.01 \ + --epoch 2 \ + --max_seq_len 512 \ + --bert_config_path ${BERT_BASE_PATH}/bert_config.json \ + --predict_file ${DATA_PATH_dev}/mrqa-combined.raw.json \ + --do_lower_case true \ + --doc_stride 128 \ + --learning_rate 3e-5 \ + --lr_scheduler linear_warmup_decay \ + --skip_steps 200 diff --git a/PaddleNLP/Research/MRQA2019-BASELINE/src/batching.py b/PaddleNLP/Research/MRQA2019-BASELINE/src/batching.py new file mode 100644 index 0000000000000000000000000000000000000000..13803cf79f00d7ea5857360eeceb2da354a5e626 --- /dev/null +++ b/PaddleNLP/Research/MRQA2019-BASELINE/src/batching.py @@ -0,0 +1,183 @@ +# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +"""Mask, padding and batching.""" +from __future__ import absolute_import +from __future__ import division +from __future__ import print_function +import numpy as np + + +def mask(batch_tokens, total_token_num, vocab_size, CLS=1, SEP=2, MASK=3): + """ + Add mask for batch_tokens, return out, mask_label, mask_pos; + Note: mask_pos responding the batch_tokens after padded; + """ + max_len = max([len(sent) for sent in batch_tokens]) + mask_label = [] + mask_pos = [] + prob_mask = np.random.rand(total_token_num) + # Note: the first token is [CLS], so [low=1] + replace_ids = np.random.randint(1, high=vocab_size, size=total_token_num) + pre_sent_len = 0 + prob_index = 0 + for sent_index, sent in enumerate(batch_tokens): + mask_flag = False + prob_index += pre_sent_len + for token_index, token in enumerate(sent): + prob = prob_mask[prob_index + token_index] + if prob > 0.15: + continue + elif 0.03 < prob <= 0.15: + # mask + if token != SEP and token != CLS: + mask_label.append(sent[token_index]) + sent[token_index] = MASK + mask_flag = True + mask_pos.append(sent_index * max_len + token_index) + elif 0.015 < prob <= 0.03: + # random replace + if token != SEP and token != CLS: + mask_label.append(sent[token_index]) + sent[token_index] = replace_ids[prob_index + token_index] + mask_flag = True + mask_pos.append(sent_index * max_len + token_index) + else: + # keep the original token + if token != SEP and token != CLS: + mask_label.append(sent[token_index]) + mask_pos.append(sent_index * max_len + token_index) + pre_sent_len = len(sent) + # ensure at least mask one word in a sentence + while not mask_flag: + token_index = int(np.random.randint(1, high=len(sent) - 1, size=1)) + if sent[token_index] != SEP and sent[token_index] != CLS: + mask_label.append(sent[token_index]) + sent[token_index] = MASK + mask_flag = True + mask_pos.append(sent_index * max_len + token_index) + mask_label = np.array(mask_label).astype("int64").reshape([-1, 1]) + mask_pos = np.array(mask_pos).astype("int64").reshape([-1, 1]) + return batch_tokens, mask_label, mask_pos + + +def prepare_batch_data(insts, + total_token_num, + max_len=None, + voc_size=0, + pad_id=None, + cls_id=None, + sep_id=None, + mask_id=None, + return_input_mask=True, + return_max_len=True, + return_num_token=False): + """ + 1. generate Tensor of data + 2. generate Tensor of position + 3. generate self attention mask, [shape: batch_size * max_len * max_len] + """ + batch_src_ids = [inst[0] for inst in insts] + batch_sent_ids = [inst[1] for inst in insts] + batch_pos_ids = [inst[2] for inst in insts] + labels_list = [] + # compatible with mrqa, whose example includes start/end positions, + # or unique id + for i in range(3, len(insts[0]), 1): + labels = [inst[i] for inst in insts] + labels = np.array(labels).astype("int64").reshape([-1, 1]) + labels_list.append(labels) + # First step: do mask without padding + if mask_id >= 0: + out, mask_label, mask_pos = mask( + batch_src_ids, + total_token_num, + vocab_size=voc_size, + CLS=cls_id, + SEP=sep_id, + MASK=mask_id) + else: + out = batch_src_ids + # Second step: padding + src_id, self_input_mask = pad_batch_data( + out, + max_len=max_len, + pad_idx=pad_id, return_input_mask=True) + pos_id = pad_batch_data( + batch_pos_ids, + max_len=max_len, + pad_idx=pad_id, + return_pos=False, + return_input_mask=False) + sent_id = pad_batch_data( + batch_sent_ids, + max_len=max_len, + pad_idx=pad_id, + return_pos=False, + return_input_mask=False) + if mask_id >= 0: + return_list = [ + src_id, pos_id, sent_id, self_input_mask, mask_label, mask_pos + ] + labels_list + else: + return_list = [src_id, pos_id, sent_id, self_input_mask] + labels_list + return return_list if len(return_list) > 1 else return_list[0] + + +def pad_batch_data(insts, + max_len=None, + pad_idx=0, + return_pos=False, + return_input_mask=False, + return_max_len=False, + return_num_token=False): + """ + Pad the instances to the max sequence length in batch, and generate the + corresponding position data and input mask. + """ + return_list = [] + if max_len is None: + max_len = max(len(inst) for inst in insts) + # Any token included in dict can be used to pad, since the paddings' loss + # will be masked out by weights and make no effect on parameter gradients. + inst_data = np.array([ + list(inst) + list([pad_idx] * (max_len - len(inst))) for inst in insts + ]) + return_list += [inst_data.astype("int64").reshape([-1, max_len, 1])] + # position data + if return_pos: + inst_pos = np.array([ + list(range(0, len(inst))) + [pad_idx] * (max_len - len(inst)) + for inst in insts + ]) + return_list += [inst_pos.astype("int64").reshape([-1, max_len, 1])] + if return_input_mask: + # This is used to avoid attention on paddings. + input_mask_data = np.array([[1] * len(inst) + [0] * + (max_len - len(inst)) for inst in insts]) + input_mask_data = np.expand_dims(input_mask_data, axis=-1) + return_list += [input_mask_data.astype("float32")] + if return_max_len: + return_list += [max_len] + if return_num_token: + num_token = 0 + for inst in insts: + num_token += len(inst) + return_list += [num_token] + return return_list if len(return_list) > 1 else return_list[0] + + +if __name__ == "__main__": + pass + + diff --git a/PaddleNLP/Research/MRQA2019-BASELINE/src/model/__init__.py b/PaddleNLP/Research/MRQA2019-BASELINE/src/model/__init__.py new file mode 100644 index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 diff --git a/PaddleNLP/Research/MRQA2019-BASELINE/src/model/bert.py b/PaddleNLP/Research/MRQA2019-BASELINE/src/model/bert.py new file mode 100644 index 0000000000000000000000000000000000000000..c17803caed17e81fafd55f9b9ae9f2b539f9f39c --- /dev/null +++ b/PaddleNLP/Research/MRQA2019-BASELINE/src/model/bert.py @@ -0,0 +1,226 @@ +# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +"""BERT model.""" + +from __future__ import absolute_import +from __future__ import division +from __future__ import print_function + +import six +import json +import numpy as np +import paddle.fluid as fluid +from model.transformer_encoder import encoder, pre_process_layer + + +class BertConfig(object): + def __init__(self, config_path): + self._config_dict = self._parse(config_path) + + def _parse(self, config_path): + try: + with open(config_path) as json_file: + config_dict = json.load(json_file) + except Exception: + raise IOError("Error in parsing bert model config file '%s'" % + config_path) + else: + return config_dict + + def __getitem__(self, key): + return self._config_dict[key] + + def print_config(self): + for arg, value in sorted(six.iteritems(self._config_dict)): + print('%s: %s' % (arg, value)) + print('------------------------------------------------') + + +class BertModel(object): + def __init__(self, + src_ids, + position_ids, + sentence_ids, + input_mask, + config, + weight_sharing=True, + use_fp16=False): + + self._emb_size = config['hidden_size'] + self._n_layer = config['num_hidden_layers'] + self._n_head = config['num_attention_heads'] + self._voc_size = config['vocab_size'] + self._max_position_seq_len = config['max_position_embeddings'] + self._sent_types = config['type_vocab_size'] + self._hidden_act = config['hidden_act'] + self._prepostprocess_dropout = config['hidden_dropout_prob'] + self._attention_dropout = config['attention_probs_dropout_prob'] + self._weight_sharing = weight_sharing + + self._word_emb_name = "word_embedding" + self._pos_emb_name = "pos_embedding" + self._sent_emb_name = "sent_embedding" + self._dtype = "float16" if use_fp16 else "float32" + + # Initialize all weigths by truncated normal initializer, and all biases + # will be initialized by constant zero by default. + self._param_initializer = fluid.initializer.TruncatedNormal( + scale=config['initializer_range']) + + self._build_model(src_ids, position_ids, sentence_ids, input_mask) + + def _build_model(self, src_ids, position_ids, sentence_ids, input_mask): + # padding id in vocabulary must be set to 0 + emb_out = fluid.layers.embedding( + input=src_ids, + size=[self._voc_size, self._emb_size], + dtype=self._dtype, + param_attr=fluid.ParamAttr( + name=self._word_emb_name, initializer=self._param_initializer), + is_sparse=False) + position_emb_out = fluid.layers.embedding( + input=position_ids, + size=[self._max_position_seq_len, self._emb_size], + dtype=self._dtype, + param_attr=fluid.ParamAttr( + name=self._pos_emb_name, initializer=self._param_initializer)) + + sent_emb_out = fluid.layers.embedding( + sentence_ids, + size=[self._sent_types, self._emb_size], + dtype=self._dtype, + param_attr=fluid.ParamAttr( + name=self._sent_emb_name, initializer=self._param_initializer)) + + emb_out = emb_out + position_emb_out + emb_out = emb_out + sent_emb_out + + emb_out = pre_process_layer( + emb_out, 'nd', self._prepostprocess_dropout, name='pre_encoder') + + if self._dtype == "float16": + input_mask = fluid.layers.cast(x=input_mask, dtype=self._dtype) + + self_attn_mask = fluid.layers.matmul( + x=input_mask, y=input_mask, transpose_y=True) + self_attn_mask = fluid.layers.scale( + x=self_attn_mask, scale=10000.0, bias=-1.0, bias_after_scale=False) + n_head_self_attn_mask = fluid.layers.stack( + x=[self_attn_mask] * self._n_head, axis=1) + n_head_self_attn_mask.stop_gradient = True + + self._enc_out = encoder( + enc_input=emb_out, + attn_bias=n_head_self_attn_mask, + n_layer=self._n_layer, + n_head=self._n_head, + d_key=self._emb_size // self._n_head, + d_value=self._emb_size // self._n_head, + d_model=self._emb_size, + d_inner_hid=self._emb_size * 4, + prepostprocess_dropout=self._prepostprocess_dropout, + attention_dropout=self._attention_dropout, + relu_dropout=0, + hidden_act=self._hidden_act, + preprocess_cmd="", + postprocess_cmd="dan", + param_initializer=self._param_initializer, + name='encoder') + + def get_sequence_output(self): + return self._enc_out + + def get_pooled_output(self): + """Get the first feature of each sequence for classification""" + + next_sent_feat = fluid.layers.slice( + input=self._enc_out, axes=[1], starts=[0], ends=[1]) + next_sent_feat = fluid.layers.fc( + input=next_sent_feat, + size=self._emb_size, + act="tanh", + param_attr=fluid.ParamAttr( + name="pooled_fc.w_0", initializer=self._param_initializer), + bias_attr="pooled_fc.b_0") + return next_sent_feat + + def get_pretraining_output(self, mask_label, mask_pos, labels): + """Get the loss & accuracy for pretraining""" + + mask_pos = fluid.layers.cast(x=mask_pos, dtype='int32') + + # extract the first token feature in each sentence + next_sent_feat = self.get_pooled_output() + reshaped_emb_out = fluid.layers.reshape( + x=self._enc_out, shape=[-1, self._emb_size]) + # extract masked tokens' feature + mask_feat = fluid.layers.gather(input=reshaped_emb_out, index=mask_pos) + + # transform: fc + mask_trans_feat = fluid.layers.fc( + input=mask_feat, + size=self._emb_size, + act=self._hidden_act, + param_attr=fluid.ParamAttr( + name='mask_lm_trans_fc.w_0', + initializer=self._param_initializer), + bias_attr=fluid.ParamAttr(name='mask_lm_trans_fc.b_0')) + # transform: layer norm + mask_trans_feat = pre_process_layer( + mask_trans_feat, 'n', name='mask_lm_trans') + + mask_lm_out_bias_attr = fluid.ParamAttr( + name="mask_lm_out_fc.b_0", + initializer=fluid.initializer.Constant(value=0.0)) + if self._weight_sharing: + fc_out = fluid.layers.matmul( + x=mask_trans_feat, + y=fluid.default_main_program().global_block().var( + self._word_emb_name), + transpose_y=True) + fc_out += fluid.layers.create_parameter( + shape=[self._voc_size], + dtype=self._dtype, + attr=mask_lm_out_bias_attr, + is_bias=True) + + else: + fc_out = fluid.layers.fc(input=mask_trans_feat, + size=self._voc_size, + param_attr=fluid.ParamAttr( + name="mask_lm_out_fc.w_0", + initializer=self._param_initializer), + bias_attr=mask_lm_out_bias_attr) + + mask_lm_loss = fluid.layers.softmax_with_cross_entropy( + logits=fc_out, label=mask_label) + mean_mask_lm_loss = fluid.layers.mean(mask_lm_loss) + + next_sent_fc_out = fluid.layers.fc( + input=next_sent_feat, + size=2, + param_attr=fluid.ParamAttr( + name="next_sent_fc.w_0", initializer=self._param_initializer), + bias_attr="next_sent_fc.b_0") + + next_sent_loss, next_sent_softmax = fluid.layers.softmax_with_cross_entropy( + logits=next_sent_fc_out, label=labels, return_softmax=True) + + next_sent_acc = fluid.layers.accuracy( + input=next_sent_softmax, label=labels) + + mean_next_sent_loss = fluid.layers.mean(next_sent_loss) + + loss = mean_next_sent_loss + mean_mask_lm_loss + return next_sent_acc, mean_mask_lm_loss, loss diff --git a/PaddleNLP/Research/MRQA2019-BASELINE/src/model/transformer_encoder.py b/PaddleNLP/Research/MRQA2019-BASELINE/src/model/transformer_encoder.py new file mode 100644 index 0000000000000000000000000000000000000000..93a77ebe480f0e4a8e2b4f2c0c18b23383075fb7 --- /dev/null +++ b/PaddleNLP/Research/MRQA2019-BASELINE/src/model/transformer_encoder.py @@ -0,0 +1,342 @@ +# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +"""Transformer encoder.""" + +from __future__ import absolute_import +from __future__ import division +from __future__ import print_function + +from functools import partial +import numpy as np + +import paddle.fluid as fluid +import paddle.fluid.layers as layers + + +def multi_head_attention(queries, + keys, + values, + attn_bias, + d_key, + d_value, + d_model, + n_head=1, + dropout_rate=0., + cache=None, + param_initializer=None, + name='multi_head_att'): + """ + Multi-Head Attention. Note that attn_bias is added to the logit before + computing softmax activiation to mask certain selected positions so that + they will not considered in attention weights. + """ + keys = queries if keys is None else keys + values = keys if values is None else values + + if not (len(queries.shape) == len(keys.shape) == len(values.shape) == 3): + raise ValueError( + "Inputs: quries, keys and values should all be 3-D tensors.") + + def __compute_qkv(queries, keys, values, n_head, d_key, d_value): + """ + Add linear projection to queries, keys, and values. + """ + q = layers.fc(input=queries, + size=d_key * n_head, + num_flatten_dims=2, + param_attr=fluid.ParamAttr( + name=name + '_query_fc.w_0', + initializer=param_initializer), + bias_attr=name + '_query_fc.b_0') + k = layers.fc(input=keys, + size=d_key * n_head, + num_flatten_dims=2, + param_attr=fluid.ParamAttr( + name=name + '_key_fc.w_0', + initializer=param_initializer), + bias_attr=name + '_key_fc.b_0') + v = layers.fc(input=values, + size=d_value * n_head, + num_flatten_dims=2, + param_attr=fluid.ParamAttr( + name=name + '_value_fc.w_0', + initializer=param_initializer), + bias_attr=name + '_value_fc.b_0') + return q, k, v + + def __split_heads(x, n_head): + """ + Reshape the last dimension of inpunt tensor x so that it becomes two + dimensions and then transpose. Specifically, input a tensor with shape + [bs, max_sequence_length, n_head * hidden_dim] then output a tensor + with shape [bs, n_head, max_sequence_length, hidden_dim]. + """ + hidden_size = x.shape[-1] + # The value 0 in shape attr means copying the corresponding dimension + # size of the input as the output dimension size. + reshaped = layers.reshape( + x=x, shape=[0, 0, n_head, hidden_size // n_head], inplace=True) + + # permuate the dimensions into: + # [batch_size, n_head, max_sequence_len, hidden_size_per_head] + return layers.transpose(x=reshaped, perm=[0, 2, 1, 3]) + + def __combine_heads(x): + """ + Transpose and then reshape the last two dimensions of inpunt tensor x + so that it becomes one dimension, which is reverse to __split_heads. + """ + if len(x.shape) == 3: return x + if len(x.shape) != 4: + raise ValueError("Input(x) should be a 4-D Tensor.") + + trans_x = layers.transpose(x, perm=[0, 2, 1, 3]) + # The value 0 in shape attr means copying the corresponding dimension + # size of the input as the output dimension size. + return layers.reshape( + x=trans_x, + shape=[0, 0, trans_x.shape[2] * trans_x.shape[3]], + inplace=True) + + def scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate): + """ + Scaled Dot-Product Attention + """ + scaled_q = layers.scale(x=q, scale=d_key**-0.5) + product = layers.matmul(x=scaled_q, y=k, transpose_y=True) + if attn_bias: + product += attn_bias + weights = layers.softmax(product) + if dropout_rate: + weights = layers.dropout( + weights, + dropout_prob=dropout_rate, + dropout_implementation="upscale_in_train", + is_test=False) + out = layers.matmul(weights, v) + return out + + q, k, v = __compute_qkv(queries, keys, values, n_head, d_key, d_value) + + if cache is not None: # use cache and concat time steps + # Since the inplace reshape in __split_heads changes the shape of k and + # v, which is the cache input for next time step, reshape the cache + # input from the previous time step first. + k = cache["k"] = layers.concat( + [layers.reshape( + cache["k"], shape=[0, 0, d_model]), k], axis=1) + v = cache["v"] = layers.concat( + [layers.reshape( + cache["v"], shape=[0, 0, d_model]), v], axis=1) + + q = __split_heads(q, n_head) + k = __split_heads(k, n_head) + v = __split_heads(v, n_head) + + ctx_multiheads = scaled_dot_product_attention(q, k, v, attn_bias, d_key, + dropout_rate) + + out = __combine_heads(ctx_multiheads) + + # Project back to the model size. + proj_out = layers.fc(input=out, + size=d_model, + num_flatten_dims=2, + param_attr=fluid.ParamAttr( + name=name + '_output_fc.w_0', + initializer=param_initializer), + bias_attr=name + '_output_fc.b_0') + return proj_out + + +def positionwise_feed_forward(x, + d_inner_hid, + d_hid, + dropout_rate, + hidden_act, + param_initializer=None, + name='ffn'): + """ + Position-wise Feed-Forward Networks. + This module consists of two linear transformations with a ReLU activation + in between, which is applied to each position separately and identically. + """ + hidden = layers.fc(input=x, + size=d_inner_hid, + num_flatten_dims=2, + act=hidden_act, + param_attr=fluid.ParamAttr( + name=name + '_fc_0.w_0', + initializer=param_initializer), + bias_attr=name + '_fc_0.b_0') + if dropout_rate: + hidden = layers.dropout( + hidden, + dropout_prob=dropout_rate, + dropout_implementation="upscale_in_train", + is_test=False) + out = layers.fc(input=hidden, + size=d_hid, + num_flatten_dims=2, + param_attr=fluid.ParamAttr( + name=name + '_fc_1.w_0', initializer=param_initializer), + bias_attr=name + '_fc_1.b_0') + return out + + +def pre_post_process_layer(prev_out, out, process_cmd, dropout_rate=0., + name=''): + """ + Add residual connection, layer normalization and droput to the out tensor + optionally according to the value of process_cmd. + This will be used before or after multi-head attention and position-wise + feed-forward networks. + """ + for cmd in process_cmd: + if cmd == "a": # add residual connection + out = out + prev_out if prev_out else out + elif cmd == "n": # add layer normalization + out_dtype = out.dtype + if out_dtype == fluid.core.VarDesc.VarType.FP16: + out = layers.cast(x=out, dtype="float32") + out = layers.layer_norm( + out, + begin_norm_axis=len(out.shape) - 1, + param_attr=fluid.ParamAttr( + name=name + '_layer_norm_scale', + initializer=fluid.initializer.Constant(1.)), + bias_attr=fluid.ParamAttr( + name=name + '_layer_norm_bias', + initializer=fluid.initializer.Constant(0.))) + if out_dtype == fluid.core.VarDesc.VarType.FP16: + out = layers.cast(x=out, dtype="float16") + elif cmd == "d": # add dropout + if dropout_rate: + out = layers.dropout( + out, + dropout_prob=dropout_rate, + dropout_implementation="upscale_in_train", + is_test=False) + return out + + +pre_process_layer = partial(pre_post_process_layer, None) +post_process_layer = pre_post_process_layer + + +def encoder_layer(enc_input, + attn_bias, + n_head, + d_key, + d_value, + d_model, + d_inner_hid, + prepostprocess_dropout, + attention_dropout, + relu_dropout, + hidden_act, + preprocess_cmd="n", + postprocess_cmd="da", + param_initializer=None, + name=''): + """The encoder layers that can be stacked to form a deep encoder. + This module consits of a multi-head (self) attention followed by + position-wise feed-forward networks and both the two components companied + with the post_process_layer to add residual connection, layer normalization + and droput. + """ + attn_output = multi_head_attention( + pre_process_layer( + enc_input, + preprocess_cmd, + prepostprocess_dropout, + name=name + '_pre_att'), + None, + None, + attn_bias, + d_key, + d_value, + d_model, + n_head, + attention_dropout, + param_initializer=param_initializer, + name=name + '_multi_head_att') + attn_output = post_process_layer( + enc_input, + attn_output, + postprocess_cmd, + prepostprocess_dropout, + name=name + '_post_att') + ffd_output = positionwise_feed_forward( + pre_process_layer( + attn_output, + preprocess_cmd, + prepostprocess_dropout, + name=name + '_pre_ffn'), + d_inner_hid, + d_model, + relu_dropout, + hidden_act, + param_initializer=param_initializer, + name=name + '_ffn') + return post_process_layer( + attn_output, + ffd_output, + postprocess_cmd, + prepostprocess_dropout, + name=name + '_post_ffn') + + +def encoder(enc_input, + attn_bias, + n_layer, + n_head, + d_key, + d_value, + d_model, + d_inner_hid, + prepostprocess_dropout, + attention_dropout, + relu_dropout, + hidden_act, + preprocess_cmd="n", + postprocess_cmd="da", + param_initializer=None, + name=''): + """ + The encoder is composed of a stack of identical layers returned by calling + encoder_layer. + """ + for i in range(n_layer): + enc_output = encoder_layer( + enc_input, + attn_bias, + n_head, + d_key, + d_value, + d_model, + d_inner_hid, + prepostprocess_dropout, + attention_dropout, + relu_dropout, + hidden_act, + preprocess_cmd, + postprocess_cmd, + param_initializer=param_initializer, + name=name + '_layer_' + str(i)) + enc_input = enc_output + enc_output = pre_process_layer( + enc_output, preprocess_cmd, prepostprocess_dropout, name="post_encoder") + + return enc_output diff --git a/PaddleNLP/Research/MRQA2019-BASELINE/src/optimization.py b/PaddleNLP/Research/MRQA2019-BASELINE/src/optimization.py new file mode 100755 index 0000000000000000000000000000000000000000..151ee27675e559cc1d381facf9391bcf9cff7b97 --- /dev/null +++ b/PaddleNLP/Research/MRQA2019-BASELINE/src/optimization.py @@ -0,0 +1,134 @@ +# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +"""Optimization and learning rate scheduling.""" + +from __future__ import absolute_import +from __future__ import division +from __future__ import print_function + +import numpy as np +import paddle.fluid as fluid +from utils.fp16 import create_master_params_grads, master_param_to_train_param + + +def linear_warmup_decay(learning_rate, warmup_steps, num_train_steps): + """ Applies linear warmup of learning rate from 0 and decay to 0.""" + with fluid.default_main_program()._lr_schedule_guard(): + lr = fluid.layers.tensor.create_global_var( + shape=[1], + value=0.0, + dtype='float32', + persistable=True, + name="scheduled_learning_rate") + + global_step = fluid.layers.learning_rate_scheduler._decay_step_counter() + + with fluid.layers.control_flow.Switch() as switch: + with switch.case(global_step < warmup_steps): + warmup_lr = learning_rate * (global_step / warmup_steps) + fluid.layers.tensor.assign(warmup_lr, lr) + with switch.default(): + decayed_lr = fluid.layers.learning_rate_scheduler.polynomial_decay( + learning_rate=learning_rate, + decay_steps=num_train_steps, + end_learning_rate=0.0, + power=1.0, + cycle=False) + fluid.layers.tensor.assign(decayed_lr, lr) + + return lr + + +def optimization(loss, + warmup_steps, + num_train_steps, + learning_rate, + train_program, + startup_prog, + weight_decay, + scheduler='linear_warmup_decay', + use_fp16=False, + loss_scaling=1.0): + if warmup_steps > 0: + if scheduler == 'noam_decay': + scheduled_lr = fluid.layers.learning_rate_scheduler\ + .noam_decay(1/(warmup_steps *(learning_rate ** 2)), + warmup_steps) + elif scheduler == 'linear_warmup_decay': + scheduled_lr = linear_warmup_decay(learning_rate, warmup_steps, + num_train_steps) + else: + raise ValueError("Unkown learning rate scheduler, should be " + "'noam_decay' or 'linear_warmup_decay'") + optimizer = fluid.optimizer.Adam(learning_rate=scheduled_lr) + else: + optimizer = fluid.optimizer.Adam(learning_rate=learning_rate) + scheduled_lr = learning_rate + + fluid.clip.set_gradient_clip( + clip=fluid.clip.GradientClipByGlobalNorm(clip_norm=1.0)) + + def exclude_from_weight_decay(name): + if name.find("layer_norm") > -1: + return True + bias_suffix = ["_bias", "_b", ".b_0"] + for suffix in bias_suffix: + if name.endswith(suffix): + return True + return False + + param_list = dict() + + if use_fp16: + param_grads = optimizer.backward(loss) + master_param_grads = create_master_params_grads( + param_grads, train_program, startup_prog, loss_scaling) + + for param, _ in master_param_grads: + param_list[param.name] = param * 1.0 + param_list[param.name].stop_gradient = True + + optimizer.apply_gradients(master_param_grads) + + if weight_decay > 0: + for param, grad in master_param_grads: + if exclude_from_weight_decay(param.name.rstrip(".master")): + continue + with param.block.program._optimized_guard( + [param, grad]), fluid.framework.name_scope("weight_decay"): + updated_param = param - param_list[ + param.name] * weight_decay * scheduled_lr + fluid.layers.assign(output=param, input=updated_param) + + master_param_to_train_param(master_param_grads, param_grads, + train_program) + + else: + for param in train_program.global_block().all_parameters(): + param_list[param.name] = param * 1.0 + param_list[param.name].stop_gradient = True + + _, param_grads = optimizer.minimize(loss) + + if weight_decay > 0: + for param, grad in param_grads: + if exclude_from_weight_decay(param.name): + continue + with param.block.program._optimized_guard( + [param, grad]), fluid.framework.name_scope("weight_decay"): + updated_param = param - param_list[ + param.name] * weight_decay * scheduled_lr + fluid.layers.assign(output=param, input=updated_param) + + return scheduled_lr diff --git a/PaddleNLP/Research/MRQA2019-BASELINE/src/reader/__init__.py b/PaddleNLP/Research/MRQA2019-BASELINE/src/reader/__init__.py new file mode 100644 index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 diff --git a/PaddleNLP/Research/MRQA2019-BASELINE/src/reader/mrqa.py b/PaddleNLP/Research/MRQA2019-BASELINE/src/reader/mrqa.py new file mode 100644 index 0000000000000000000000000000000000000000..94a7471e94f7b1e1a6f8f08636c53b44caf381bf --- /dev/null +++ b/PaddleNLP/Research/MRQA2019-BASELINE/src/reader/mrqa.py @@ -0,0 +1,1101 @@ +# coding=utf-8 +# Copyright 2018 The Google AI Language Team Authors. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +"""Run MRQA""" + +import six +import math +import json +import random +import collections +import numpy as np +import tokenization +from batching import prepare_batch_data + + +class MRQAExample(object): + """A single training/test example for simple sequence classification. + + For examples without an answer, the start and end position are -1. + """ + + def __init__(self, + qas_id, + question_text, + doc_tokens, + orig_answer_text=None, + start_position=None, + end_position=None, + is_impossible=False): + self.qas_id = qas_id + self.question_text = question_text + self.doc_tokens = doc_tokens + self.orig_answer_text = orig_answer_text + self.start_position = start_position + self.end_position = end_position + self.is_impossible = is_impossible + + def __str__(self): + return self.__repr__() + + def __repr__(self): + s = "" + s += "qas_id: %s" % (tokenization.printable_text(self.qas_id)) + s += ", question_text: %s" % ( + tokenization.printable_text(self.question_text)) + s += ", doc_tokens: [%s]" % (" ".join(self.doc_tokens)) + if self.start_position: + s += ", start_position: %d" % (self.start_position) + if self.start_position: + s += ", end_position: %d" % (self.end_position) + if self.start_position: + s += ", is_impossible: %r" % (self.is_impossible) + return s + + +class InputFeatures(object): + """A single set of features of data.""" + + def __init__(self, + unique_id, + example_index, + doc_span_index, + tokens, + token_to_orig_map, + token_is_max_context, + input_ids, + input_mask, + segment_ids, + start_position=None, + end_position=None, + is_impossible=None): + self.unique_id = unique_id + self.example_index = example_index + self.doc_span_index = doc_span_index + self.tokens = tokens + self.token_to_orig_map = token_to_orig_map + self.token_is_max_context = token_is_max_context + self.input_ids = input_ids + self.input_mask = input_mask + self.segment_ids = segment_ids + self.start_position = start_position + self.end_position = end_position + self.is_impossible = is_impossible + + +def read_mrqa_examples(input_file, is_training, with_negative=False): + """Read a MRQA json file into a list of MRQAExample.""" + with open(input_file, "r") as reader: + input_data = json.load(reader)["data"] + + def is_whitespace(c): + if c == " " or c == "\t" or c == "\r" or c == "\n" or ord(c) == 0x202F: + return True + return False + + examples = [] + for entry in input_data: + for paragraph in entry["paragraphs"]: + paragraph_text = paragraph["context"] + doc_tokens = [] + char_to_word_offset = [] + prev_is_whitespace = True + for c in paragraph_text: + if is_whitespace(c): + prev_is_whitespace = True + else: + if prev_is_whitespace: + doc_tokens.append(c) + else: + doc_tokens[-1] += c + prev_is_whitespace = False + char_to_word_offset.append(len(doc_tokens) - 1) + + for qa in paragraph["qas"]: + qas_id = qa["id"] + question_text = qa["question"] + start_position = None + end_position = None + orig_answer_text = None + is_impossible = False + if is_training: + + if with_negative: + is_impossible = qa["is_impossible"] + if (len(qa["answers"]) != 1) and (not is_impossible): + raise ValueError( + "For training, each question should have exactly 1 answer." + ) + if not is_impossible: + answer = qa["answers"][0] + orig_answer_text = answer["text"] + answer_offset = answer["answer_start"] + answer_length = len(orig_answer_text) + start_position = char_to_word_offset[answer_offset] + end_position = char_to_word_offset[answer_offset + + answer_length - 1] + # Only add answers where the text can be exactly recovered from the + # document. If this CAN'T happen it's likely due to weird Unicode + # stuff so we will just skip the example. + # + # Note that this means for training mode, every example is NOT + # guaranteed to be preserved. + actual_text = " ".join(doc_tokens[start_position:( + end_position + 1)]) + cleaned_answer_text = " ".join( + tokenization.whitespace_tokenize(orig_answer_text)) + if actual_text.find(cleaned_answer_text) == -1: + print("Could not find answer: '%s' vs. '%s'", + actual_text, cleaned_answer_text) + continue + else: + start_position = -1 + end_position = -1 + orig_answer_text = "" + + example = MRQAExample( + qas_id=qas_id, + question_text=question_text, + doc_tokens=doc_tokens, + orig_answer_text=orig_answer_text, + start_position=start_position, + end_position=end_position, + is_impossible=is_impossible) + examples.append(example) + + return examples + + +def convert_examples_to_features( + examples, + tokenizer, + max_seq_length, + doc_stride, + max_query_length, + is_training, + #output_fn +): + """Loads a data file into a list of `InputBatch`s.""" + + unique_id = 1000000000 + + for (example_index, example) in enumerate(examples): + query_tokens = tokenizer.tokenize(example.question_text) + + if len(query_tokens) > max_query_length: + query_tokens = query_tokens[0:max_query_length] + + tok_to_orig_index = [] + orig_to_tok_index = [] + all_doc_tokens = [] + for (i, token) in enumerate(example.doc_tokens): + orig_to_tok_index.append(len(all_doc_tokens)) + sub_tokens = tokenizer.tokenize(token) + for sub_token in sub_tokens: + tok_to_orig_index.append(i) + all_doc_tokens.append(sub_token) + + tok_start_position = None + tok_end_position = None + if is_training and example.is_impossible: + tok_start_position = -1 + tok_end_position = -1 + if is_training and not example.is_impossible: + tok_start_position = orig_to_tok_index[example.start_position] + if example.end_position < len(example.doc_tokens) - 1: + tok_end_position = orig_to_tok_index[example.end_position + + 1] - 1 + else: + tok_end_position = len(all_doc_tokens) - 1 + (tok_start_position, tok_end_position) = _improve_answer_span( + all_doc_tokens, tok_start_position, tok_end_position, tokenizer, + example.orig_answer_text) + + # The -3 accounts for [CLS], [SEP] and [SEP] + max_tokens_for_doc = max_seq_length - len(query_tokens) - 3 + + # We can have documents that are longer than the maximum sequence length. + # To deal with this we do a sliding window approach, where we take chunks + # of the up to our max length with a stride of `doc_stride`. + _DocSpan = collections.namedtuple( # pylint: disable=invalid-name + "DocSpan", ["start", "length"]) + doc_spans = [] + start_offset = 0 + while start_offset < len(all_doc_tokens): + length = len(all_doc_tokens) - start_offset + if length > max_tokens_for_doc: + length = max_tokens_for_doc + doc_spans.append(_DocSpan(start=start_offset, length=length)) + if start_offset + length == len(all_doc_tokens): + break + start_offset += min(length, doc_stride) + + for (doc_span_index, doc_span) in enumerate(doc_spans): + tokens = [] + token_to_orig_map = {} + token_is_max_context = {} + segment_ids = [] + tokens.append("[CLS]") + segment_ids.append(0) + for token in query_tokens: + tokens.append(token) + segment_ids.append(0) + tokens.append("[SEP]") + segment_ids.append(0) + + for i in range(doc_span.length): + split_token_index = doc_span.start + i + token_to_orig_map[len(tokens)] = tok_to_orig_index[ + split_token_index] + + is_max_context = _check_is_max_context( + doc_spans, doc_span_index, split_token_index) + token_is_max_context[len(tokens)] = is_max_context + tokens.append(all_doc_tokens[split_token_index]) + segment_ids.append(1) + tokens.append("[SEP]") + segment_ids.append(1) + + input_ids = tokenizer.convert_tokens_to_ids(tokens) + + # The mask has 1 for real tokens and 0 for padding tokens. Only real + # tokens are attended to. + input_mask = [1] * len(input_ids) + + # Zero-pad up to the sequence length. + #while len(input_ids) < max_seq_length: + # input_ids.append(0) + # input_mask.append(0) + # segment_ids.append(0) + + #assert len(input_ids) == max_seq_length + #assert len(input_mask) == max_seq_length + #assert len(segment_ids) == max_seq_length + + start_position = None + end_position = None + if is_training and not example.is_impossible: + # For training, if our document chunk does not contain an annotation + # we throw it out, since there is nothing to predict. + doc_start = doc_span.start + doc_end = doc_span.start + doc_span.length - 1 + out_of_span = False + if not (tok_start_position >= doc_start and + tok_end_position <= doc_end): + out_of_span = True + if out_of_span: + start_position = 0 + end_position = 0 + continue + else: + doc_offset = len(query_tokens) + 2 + start_position = tok_start_position - doc_start + doc_offset + end_position = tok_end_position - doc_start + doc_offset + + """ + if is_training and example.is_impossible: + start_position = 0 + end_position = 0 + """ + if example_index < 3: + print("*** Example ***") + print("unique_id: %s" % (unique_id)) + print("example_index: %s" % (example_index)) + print("doc_span_index: %s" % (doc_span_index)) + print("tokens: %s" % " ".join( + [tokenization.printable_text(x) for x in tokens])) + print("token_to_orig_map: %s" % " ".join([ + "%d:%d" % (x, y) + for (x, y) in six.iteritems(token_to_orig_map) + ])) + print("token_is_max_context: %s" % " ".join([ + "%d:%s" % (x, y) + for (x, y) in six.iteritems(token_is_max_context) + ])) + print("input_ids: %s" % " ".join([str(x) for x in input_ids])) + print("input_mask: %s" % " ".join([str(x) for x in input_mask])) + print("segment_ids: %s" % + " ".join([str(x) for x in segment_ids])) + if is_training and example.is_impossible: + print("impossible example") + if is_training and not example.is_impossible: + answer_text = " ".join(tokens[start_position:(end_position + + 1)]) + print("start_position: %d" % (start_position)) + print("end_position: %d" % (end_position)) + print("answer: %s" % + (tokenization.printable_text(answer_text))) + + feature = InputFeatures( + unique_id=unique_id, + example_index=example_index, + doc_span_index=doc_span_index, + tokens=tokens, + token_to_orig_map=token_to_orig_map, + token_is_max_context=token_is_max_context, + input_ids=input_ids, + input_mask=input_mask, + segment_ids=segment_ids, + start_position=start_position, + end_position=end_position, + is_impossible=example.is_impossible) + + unique_id += 1 + + yield feature + + +def estimate_runtime_examples(data_path, sample_rate, tokenizer, \ + max_seq_length, doc_stride, max_query_length, \ + remove_impossible_questions=True, filter_invalid_spans=True): + """Count runtime examples which may differ from number of raw samples due to sliding window operation and etc.. This is useful to get correct warmup steps for training.""" + + assert sample_rate > 0.0 and sample_rate <= 1.0, "sample_rate must be set between 0.0~1.0" + + print("loading data with json parser...") + with open(data_path, "r") as reader: + data = json.load(reader)["data"] + + num_raw_examples = 0 + for entry in data: + for paragraph in entry["paragraphs"]: + paragraph_text = paragraph["context"] + for qa in paragraph["qas"]: + num_raw_examples += 1 + print("num raw examples:{}".format(num_raw_examples)) + + def is_whitespace(c): + if c == " " or c == "\t" or c == "\r" or c == "\n" or ord(c) == 0x202F: + return True + return False + + sampled_examples = [] + for entry in data: + for paragraph in entry["paragraphs"]: + doc_tokens = None + for qa in paragraph["qas"]: + if random.random() > sample_rate and sample_rate < 1.0: + continue + + if doc_tokens is None: + paragraph_text = paragraph["context"] + doc_tokens = [] + char_to_word_offset = [] + prev_is_whitespace = True + for c in paragraph_text: + if is_whitespace(c): + prev_is_whitespace = True + else: + if prev_is_whitespace: + doc_tokens.append(c) + else: + doc_tokens[-1] += c + prev_is_whitespace = False + char_to_word_offset.append(len(doc_tokens) - 1) + + assert len(qa["answers"]) == 1, "For training, each question should have exactly 1 answer." + + qas_id = qa["id"] + question_text = qa["question"] + start_position = None + end_position = None + orig_answer_text = None + is_impossible = False + + if ('is_impossible' in qa) and (qa["is_impossible"]): + if remove_impossible_questions or filter_invalid_spans: + continue + else: + start_position = -1 + end_position = -1 + orig_answer_text = "" + is_impossible = True + else: + answer = qa["answers"][0] + orig_answer_text = answer["text"] + answer_offset = answer["answer_start"] + answer_length = len(orig_answer_text) + start_position = char_to_word_offset[answer_offset] + end_position = char_to_word_offset[answer_offset + + answer_length - 1] + + # remove corrupt samples + actual_text = " ".join(doc_tokens[start_position:( + end_position + 1)]) + cleaned_answer_text = " ".join( + tokenization.whitespace_tokenize(orig_answer_text)) + if actual_text.find(cleaned_answer_text) == -1: + print("Could not find answer: '%s' vs. '%s'", + actual_text, cleaned_answer_text) + continue + + example = MRQAExample( + qas_id=qas_id, + question_text=question_text, + doc_tokens=doc_tokens, + orig_answer_text=orig_answer_text, + start_position=start_position, + end_position=end_position, + is_impossible=is_impossible) + sampled_examples.append(example) + + + runtime_sample_rate = len(sampled_examples) / float(num_raw_examples) + # print("DEBUG-> runtime sampled examples: {}, sample rate: {}.".format(len(sampled_examples), runtime_sample_rate)) + + runtime_samp_cnt = 0 + + for example in sampled_examples: + query_tokens = tokenizer.tokenize(example.question_text) + + if len(query_tokens) > max_query_length: + query_tokens = query_tokens[0:max_query_length] + + tok_to_orig_index = [] + orig_to_tok_index = [] + all_doc_tokens = [] + for (i, token) in enumerate(example.doc_tokens): + orig_to_tok_index.append(len(all_doc_tokens)) + sub_tokens = tokenizer.tokenize(token) + for sub_token in sub_tokens: + tok_to_orig_index.append(i) + all_doc_tokens.append(sub_token) + + tok_start_position = None + tok_end_position = None + + tok_start_position = orig_to_tok_index[example.start_position] + if example.end_position < len(example.doc_tokens) - 1: + tok_end_position = orig_to_tok_index[example.end_position + 1] - 1 + else: + tok_end_position = len(all_doc_tokens) - 1 + (tok_start_position, tok_end_position) = _improve_answer_span( + all_doc_tokens, tok_start_position, tok_end_position, tokenizer, + example.orig_answer_text) + + # The -3 accounts for [CLS], [SEP] and [SEP] + max_tokens_for_doc = max_seq_length - len(query_tokens) - 3 + + _DocSpan = collections.namedtuple( # pylint: disable=invalid-name + "DocSpan", ["start", "length"]) + doc_spans = [] + start_offset = 0 + while start_offset < len(all_doc_tokens): + length = len(all_doc_tokens) - start_offset + if length > max_tokens_for_doc: + length = max_tokens_for_doc + doc_spans.append(_DocSpan(start=start_offset, length=length)) + if start_offset + length == len(all_doc_tokens): + break + start_offset += min(length, doc_stride) + + for (doc_span_index, doc_span) in enumerate(doc_spans): + doc_start = doc_span.start + doc_end = doc_span.start + doc_span.length - 1 + if filter_invalid_spans and not (tok_start_position >= doc_start and tok_end_position <= doc_end): + continue + runtime_samp_cnt += 1 + return int(runtime_samp_cnt/runtime_sample_rate) + + +def _improve_answer_span(doc_tokens, input_start, input_end, tokenizer, + orig_answer_text): + """Returns tokenized answer spans that better match the annotated answer.""" + + # The MRQA annotations are character based. We first project them to + # whitespace-tokenized words. But then after WordPiece tokenization, we can + # often find a "better match". For example: + # + # Question: What year was John Smith born? + # Context: The leader was John Smith (1895-1943). + # Answer: 1895 + # + # The original whitespace-tokenized answer will be "(1895-1943).". However + # after tokenization, our tokens will be "( 1895 - 1943 ) .". So we can match + # the exact answer, 1895. + # + # However, this is not always possible. Consider the following: + # + # Question: What country is the top exporter of electornics? + # Context: The Japanese electronics industry is the lagest in the world. + # Answer: Japan + # + # In this case, the annotator chose "Japan" as a character sub-span of + # the word "Japanese". Since our WordPiece tokenizer does not split + # "Japanese", we just use "Japanese" as the annotation. This is fairly rare + # in MRQA, but does happen. + tok_answer_text = " ".join(tokenizer.tokenize(orig_answer_text)) + + for new_start in range(input_start, input_end + 1): + for new_end in range(input_end, new_start - 1, -1): + text_span = " ".join(doc_tokens[new_start:(new_end + 1)]) + if text_span == tok_answer_text: + return (new_start, new_end) + + return (input_start, input_end) + + +def _check_is_max_context(doc_spans, cur_span_index, position): + """Check if this is the 'max context' doc span for the token.""" + + # Because of the sliding window approach taken to scoring documents, a single + # token can appear in multiple documents. E.g. + # Doc: the man went to the store and bought a gallon of milk + # Span A: the man went to the + # Span B: to the store and bought + # Span C: and bought a gallon of + # ... + # + # Now the word 'bought' will have two scores from spans B and C. We only + # want to consider the score with "maximum context", which we define as + # the *minimum* of its left and right context (the *sum* of left and + # right context will always be the same, of course). + # + # In the example the maximum context for 'bought' would be span C since + # it has 1 left context and 3 right context, while span B has 4 left context + # and 0 right context. + best_score = None + best_span_index = None + for (span_index, doc_span) in enumerate(doc_spans): + end = doc_span.start + doc_span.length - 1 + if position < doc_span.start: + continue + if position > end: + continue + num_left_context = position - doc_span.start + num_right_context = end - position + score = min(num_left_context, + num_right_context) + 0.01 * doc_span.length + if best_score is None or score > best_score: + best_score = score + best_span_index = span_index + + return cur_span_index == best_span_index + + +class DataProcessor(object): + def __init__(self, vocab_path, do_lower_case, max_seq_length, in_tokens, + doc_stride, max_query_length): + self._tokenizer = tokenization.FullTokenizer( + vocab_file=vocab_path, do_lower_case=do_lower_case) + self._max_seq_length = max_seq_length + self._doc_stride = doc_stride + self._max_query_length = max_query_length + self._in_tokens = in_tokens + + self.vocab = self._tokenizer.vocab + self.vocab_size = len(self.vocab) + self.pad_id = self.vocab["[PAD]"] + self.cls_id = self.vocab["[CLS]"] + self.sep_id = self.vocab["[SEP]"] + self.mask_id = self.vocab["[MASK]"] + + self.current_train_example = -1 + self.num_train_examples = -1 + self.current_train_epoch = -1 + + self.train_examples = None + self.predict_examples = None + self.num_examples = {'train': -1, 'predict': -1} + + def get_train_progress(self): + """Gets progress for training phase.""" + return self.current_train_example, self.current_train_epoch + + def get_examples(self, + data_path, + is_training, + with_negative=False): + examples = read_mrqa_examples( + input_file=data_path, + is_training=is_training, + with_negative=with_negative) + return examples + + def get_num_examples(self, phase): + if phase not in ['train', 'predict']: + raise ValueError( + "Unknown phase, which should be in ['train', 'predict'].") + return self.num_examples[phase] + + def estimate_runtime_examples(self, data_path, sample_rate=0.01, \ + remove_impossible_questions=True, filter_invalid_spans=True): + """Noted that this API Only support for Training phase.""" + return estimate_runtime_examples(data_path, sample_rate, self._tokenizer, \ + self._max_seq_length, self._doc_stride, self._max_query_length, \ + remove_impossible_questions=True, filter_invalid_spans=True) + + def get_features(self, examples, is_training): + features = convert_examples_to_features( + examples=examples, + tokenizer=self._tokenizer, + max_seq_length=self._max_seq_length, + doc_stride=self._doc_stride, + max_query_length=self._max_query_length, + is_training=is_training) + return features + + def data_generator(self, + data_path, + batch_size, + max_len=None, + phase='train', + shuffle=False, + dev_count=1, + with_negative=False, + epoch=1): + if phase == 'train': + self.train_examples = self.get_examples( + data_path, + is_training=True, + with_negative=with_negative) + examples = self.train_examples + self.num_examples['train'] = len(self.train_examples) + elif phase == 'predict': + self.predict_examples = self.get_examples( + data_path, + is_training=False, + with_negative=with_negative) + examples = self.predict_examples + self.num_examples['predict'] = len(self.predict_examples) + else: + raise ValueError( + "Unknown phase, which should be in ['train', 'predict'].") + + def batch_reader(features, batch_size, in_tokens): + batch, total_token_num, max_len = [], 0, 0 + for (index, feature) in enumerate(features): + if phase == 'train': + self.current_train_example = index + 1 + seq_len = len(feature.input_ids) + labels = [feature.unique_id + ] if feature.start_position is None else [ + feature.start_position, feature.end_position + ] + example = [ + feature.input_ids, feature.segment_ids, range(seq_len) + ] + labels + max_len = max(max_len, seq_len) + + #max_len = max(max_len, len(token_ids)) + if in_tokens: + to_append = (len(batch) + 1) * max_len <= batch_size + else: + to_append = len(batch) < batch_size + + if to_append: + batch.append(example) + total_token_num += seq_len + else: + yield batch, total_token_num + batch, total_token_num, max_len = [example + ], seq_len, seq_len + if len(batch) > 0: + yield batch, total_token_num + + def wrapper(): + for epoch_index in range(epoch): + if shuffle: + random.shuffle(examples) + if phase == 'train': + self.current_train_epoch = epoch_index + features = self.get_features(examples, is_training=True) + else: + features = self.get_features(examples, is_training=False) + + all_dev_batches = [] + for batch_data, total_token_num in batch_reader( + features, batch_size, self._in_tokens): + batch_data = prepare_batch_data( + batch_data, + total_token_num, + max_len=max_len, + voc_size=-1, + pad_id=self.pad_id, + cls_id=self.cls_id, + sep_id=self.sep_id, + mask_id=-1, + return_input_mask=True, + return_max_len=False, + return_num_token=False) + if len(all_dev_batches) < dev_count: + all_dev_batches.append(batch_data) + + if len(all_dev_batches) == dev_count: + for batch in all_dev_batches: + yield batch + all_dev_batches = [] + + if phase == 'predict' and len(all_dev_batches) > 0: + fake_batch = all_dev_batches[-1] + fake_batch = fake_batch[:-1] + [np.array([-1]*len(fake_batch[0]))] + all_dev_batches = all_dev_batches + [fake_batch] * (dev_count - len(all_dev_batches)) + for batch in all_dev_batches: + yield batch + + return wrapper + + +def write_predictions(all_examples, all_features, all_results, n_best_size, + max_answer_length, do_lower_case, output_prediction_file, + output_nbest_file, output_null_log_odds_file, + with_negative, null_score_diff_threshold, + verbose): + """Write final predictions to the json file and log-odds of null if needed.""" + print("Writing predictions to: %s" % (output_prediction_file)) + print("Writing nbest to: %s" % (output_nbest_file)) + + example_index_to_features = collections.defaultdict(list) + for feature in all_features: + example_index_to_features[feature.example_index].append(feature) + + unique_id_to_result = {} + for result in all_results: + unique_id_to_result[result.unique_id] = result + + _PrelimPrediction = collections.namedtuple( # pylint: disable=invalid-name + "PrelimPrediction", [ + "feature_index", "start_index", "end_index", "start_logit", + "end_logit" + ]) + + all_predictions = collections.OrderedDict() + all_nbest_json = collections.OrderedDict() + scores_diff_json = collections.OrderedDict() + + for (example_index, example) in enumerate(all_examples): + features = example_index_to_features[example_index] + + prelim_predictions = [] + # keep track of the minimum score of null start+end of position 0 + score_null = 1000000 # large and positive + min_null_feature_index = 0 # the paragraph slice with min mull score + null_start_logit = 0 # the start logit at the slice with min null score + null_end_logit = 0 # the end logit at the slice with min null score + for (feature_index, feature) in enumerate(features): + result = unique_id_to_result[feature.unique_id] + start_indexes = _get_best_indexes(result.start_logits, n_best_size) + end_indexes = _get_best_indexes(result.end_logits, n_best_size) + # if we could have irrelevant answers, get the min score of irrelevant + if with_negative: + feature_null_score = result.start_logits[0] + result.end_logits[ + 0] + if feature_null_score < score_null: + score_null = feature_null_score + min_null_feature_index = feature_index + null_start_logit = result.start_logits[0] + null_end_logit = result.end_logits[0] + for start_index in start_indexes: + for end_index in end_indexes: + # We could hypothetically create invalid predictions, e.g., predict + # that the start of the span is in the question. We throw out all + # invalid predictions. + if start_index >= len(feature.tokens): + continue + if end_index >= len(feature.tokens): + continue + if start_index not in feature.token_to_orig_map: + continue + if end_index not in feature.token_to_orig_map: + continue + if not feature.token_is_max_context.get(start_index, False): + continue + if end_index < start_index: + continue + length = end_index - start_index + 1 + if length > max_answer_length: + continue + prelim_predictions.append( + _PrelimPrediction( + feature_index=feature_index, + start_index=start_index, + end_index=end_index, + start_logit=result.start_logits[start_index], + end_logit=result.end_logits[end_index])) + + if with_negative: + prelim_predictions.append( + _PrelimPrediction( + feature_index=min_null_feature_index, + start_index=0, + end_index=0, + start_logit=null_start_logit, + end_logit=null_end_logit)) + prelim_predictions = sorted( + prelim_predictions, + key=lambda x: (x.start_logit + x.end_logit), + reverse=True) + + _NbestPrediction = collections.namedtuple( # pylint: disable=invalid-name + "NbestPrediction", ["text", "start_logit", "end_logit"]) + + seen_predictions = {} + nbest = [] + for pred in prelim_predictions: + if len(nbest) >= n_best_size: + break + feature = features[pred.feature_index] + if pred.start_index > 0: # this is a non-null prediction + tok_tokens = feature.tokens[pred.start_index:(pred.end_index + 1 + )] + orig_doc_start = feature.token_to_orig_map[pred.start_index] + orig_doc_end = feature.token_to_orig_map[pred.end_index] + orig_tokens = example.doc_tokens[orig_doc_start:(orig_doc_end + + 1)] + tok_text = " ".join(tok_tokens) + + # De-tokenize WordPieces that have been split off. + tok_text = tok_text.replace(" ##", "") + tok_text = tok_text.replace("##", "") + + # Clean whitespace + tok_text = tok_text.strip() + tok_text = " ".join(tok_text.split()) + orig_text = " ".join(orig_tokens) + + final_text = get_final_text(tok_text, orig_text, do_lower_case, + verbose) + if final_text in seen_predictions: + continue + + seen_predictions[final_text] = True + else: + final_text = "" + seen_predictions[final_text] = True + + nbest.append( + _NbestPrediction( + text=final_text, + start_logit=pred.start_logit, + end_logit=pred.end_logit)) + + # if we didn't inlude the empty option in the n-best, inlcude it + if with_negative: + if "" not in seen_predictions: + nbest.append( + _NbestPrediction( + text="", + start_logit=null_start_logit, + end_logit=null_end_logit)) + # In very rare edge cases we could have no valid predictions. So we + # just create a nonce prediction in this case to avoid failure. + if not nbest: + nbest.append( + _NbestPrediction( + text="empty", start_logit=0.0, end_logit=0.0)) + + assert len(nbest) >= 1 + + total_scores = [] + best_non_null_entry = None + for entry in nbest: + total_scores.append(entry.start_logit + entry.end_logit) + if not best_non_null_entry: + if entry.text: + best_non_null_entry = entry + # debug + if best_non_null_entry is None: + print("Emmm..., sth wrong") + + probs = _compute_softmax(total_scores) + + nbest_json = [] + for (i, entry) in enumerate(nbest): + output = collections.OrderedDict() + output["text"] = entry.text + output["probability"] = probs[i] + output["start_logit"] = entry.start_logit + output["end_logit"] = entry.end_logit + nbest_json.append(output) + + assert len(nbest_json) >= 1 + + if not with_negative: + all_predictions[example.qas_id] = nbest_json[0]["text"] + else: + # predict "" iff the null score - the score of best non-null > threshold + score_diff = score_null - best_non_null_entry.start_logit - ( + best_non_null_entry.end_logit) + scores_diff_json[example.qas_id] = score_diff + if score_diff > null_score_diff_threshold: + all_predictions[example.qas_id] = "" + else: + all_predictions[example.qas_id] = best_non_null_entry.text + + all_nbest_json[example.qas_id] = nbest_json + + with open(output_prediction_file, "w") as writer: + writer.write(json.dumps(all_predictions, indent=4) + "\n") + + with open(output_nbest_file, "w") as writer: + writer.write(json.dumps(all_nbest_json, indent=4) + "\n") + + if with_negative: + with open(output_null_log_odds_file, "w") as writer: + writer.write(json.dumps(scores_diff_json, indent=4) + "\n") + + +def get_final_text(pred_text, orig_text, do_lower_case, verbose): + """Project the tokenized prediction back to the original text.""" + + # When we created the data, we kept track of the alignment between original + # (whitespace tokenized) tokens and our WordPiece tokenized tokens. So + # now `orig_text` contains the span of our original text corresponding to the + # span that we predicted. + # + # However, `orig_text` may contain extra characters that we don't want in + # our prediction. + # + # For example, let's say: + # pred_text = steve smith + # orig_text = Steve Smith's + # + # We don't want to return `orig_text` because it contains the extra "'s". + # + # We don't want to return `pred_text` because it's already been normalized + # (the MRQA eval script also does punctuation stripping/lower casing but + # our tokenizer does additional normalization like stripping accent + # characters). + # + # What we really want to return is "Steve Smith". + # + # Therefore, we have to apply a semi-complicated alignment heruistic between + # `pred_text` and `orig_text` to get a character-to-charcter alignment. This + # can fail in certain cases in which case we just return `orig_text`. + + def _strip_spaces(text): + ns_chars = [] + ns_to_s_map = collections.OrderedDict() + for (i, c) in enumerate(text): + if c == " ": + continue + ns_to_s_map[len(ns_chars)] = i + ns_chars.append(c) + ns_text = "".join(ns_chars) + return (ns_text, ns_to_s_map) + + # We first tokenize `orig_text`, strip whitespace from the result + # and `pred_text`, and check if they are the same length. If they are + # NOT the same length, the heuristic has failed. If they are the same + # length, we assume the characters are one-to-one aligned. + tokenizer = tokenization.BasicTokenizer(do_lower_case=do_lower_case) + + tok_text = " ".join(tokenizer.tokenize(orig_text)) + + start_position = tok_text.find(pred_text) + if start_position == -1: + if verbose: + print("Unable to find text: '%s' in '%s'" % (pred_text, orig_text)) + return orig_text + end_position = start_position + len(pred_text) - 1 + + (orig_ns_text, orig_ns_to_s_map) = _strip_spaces(orig_text) + (tok_ns_text, tok_ns_to_s_map) = _strip_spaces(tok_text) + + if len(orig_ns_text) != len(tok_ns_text): + if verbose: + print("Length not equal after stripping spaces: '%s' vs '%s'", + orig_ns_text, tok_ns_text) + return orig_text + + # We then project the characters in `pred_text` back to `orig_text` using + # the character-to-character alignment. + tok_s_to_ns_map = {} + for (i, tok_index) in six.iteritems(tok_ns_to_s_map): + tok_s_to_ns_map[tok_index] = i + + orig_start_position = None + if start_position in tok_s_to_ns_map: + ns_start_position = tok_s_to_ns_map[start_position] + if ns_start_position in orig_ns_to_s_map: + orig_start_position = orig_ns_to_s_map[ns_start_position] + + if orig_start_position is None: + if verbose: + print("Couldn't map start position") + return orig_text + + orig_end_position = None + if end_position in tok_s_to_ns_map: + ns_end_position = tok_s_to_ns_map[end_position] + if ns_end_position in orig_ns_to_s_map: + orig_end_position = orig_ns_to_s_map[ns_end_position] + + if orig_end_position is None: + if verbose: + print("Couldn't map end position") + return orig_text + + output_text = orig_text[orig_start_position:(orig_end_position + 1)] + return output_text + + +def _get_best_indexes(logits, n_best_size): + """Get the n-best logits from a list.""" + index_and_score = sorted( + enumerate(logits), key=lambda x: x[1], reverse=True) + + best_indexes = [] + for i in range(len(index_and_score)): + if i >= n_best_size: + break + best_indexes.append(index_and_score[i][0]) + return best_indexes + + +def _compute_softmax(scores): + """Compute softmax probability over raw logits.""" + if not scores: + return [] + + max_score = None + for score in scores: + if max_score is None or score > max_score: + max_score = score + + exp_scores = [] + total_sum = 0.0 + for score in scores: + x = math.exp(score - max_score) + exp_scores.append(x) + total_sum += x + + probs = [] + for score in exp_scores: + probs.append(score / total_sum) + return probs + + +if __name__ == '__main__': + train_file = 'data/mrqa-combined.all_dev.raw.json' + vocab_file = 'uncased_L-12_H-768_A-12/vocab.txt' + do_lower_case = True + tokenizer = tokenization.FullTokenizer( + vocab_file=vocab_file, do_lower_case=do_lower_case) + train_examples = read_mrqa_examples( + input_file=train_file, is_training=True) + print("begin converting") + for (index, feature) in enumerate( + convert_examples_to_features( + examples=train_examples, + tokenizer=tokenizer, + max_seq_length=384, + doc_stride=128, + max_query_length=64, + is_training=True, + #output_fn=train_writer.process_feature + )): + if index < 10: + print(index, feature.input_ids, feature.input_mask, + feature.segment_ids) diff --git a/PaddleNLP/Research/MRQA2019-BASELINE/src/run_mrqa.py b/PaddleNLP/Research/MRQA2019-BASELINE/src/run_mrqa.py new file mode 100644 index 0000000000000000000000000000000000000000..4ce4e2378474d6c96de06fc4568cabc0fffe2f9f --- /dev/null +++ b/PaddleNLP/Research/MRQA2019-BASELINE/src/run_mrqa.py @@ -0,0 +1,455 @@ +# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +"""Finetuning on MRQA.""" + +from __future__ import absolute_import +from __future__ import division +from __future__ import print_function + +import argparse +import collections +import multiprocessing +import os +import time +import numpy as np +import paddle +import paddle.fluid as fluid + +from reader.mrqa import DataProcessor, write_predictions +from model.bert import BertConfig, BertModel +from utils.args import ArgumentGroup, print_arguments +from optimization import optimization +from utils.init import init_pretraining_params, init_checkpoint + +# yapf: disable +parser = argparse.ArgumentParser(__doc__) +model_g = ArgumentGroup(parser, "model", "model configuration and paths.") +model_g.add_arg("bert_config_path", str, None, "Path to the json file for bert model config.") +model_g.add_arg("init_checkpoint", str, None, "Init checkpoint to resume training from.") +model_g.add_arg("init_pretraining_params", str, None, + "Init pre-training params which preforms fine-tuning from. If the " + "arg 'init_checkpoint' has been set, this argument wouldn't be valid.") +model_g.add_arg("checkpoints", str, "checkpoints", "Path to save checkpoints.") + +train_g = ArgumentGroup(parser, "training", "training options.") +train_g.add_arg("epoch", int, 3, "Number of epoches for fine-tuning.") +train_g.add_arg("learning_rate", float, 5e-5, "Learning rate used to train with warmup.") +train_g.add_arg("lr_scheduler", str, "linear_warmup_decay", + "scheduler of learning rate.", choices=['linear_warmup_decay', 'noam_decay']) +train_g.add_arg("weight_decay", float, 0.01, "Weight decay rate for L2 regularizer.") +train_g.add_arg("use_ema", bool, True, "Whether to use ema.") +train_g.add_arg("ema_decay", float, 0.9999, "Decay rate for expoential moving average.") +train_g.add_arg("warmup_proportion", float, 0.1, + "Proportion of training steps to perform linear learning rate warmup for.") +train_g.add_arg("save_steps", int, 1000, "The steps interval to save checkpoints.") +train_g.add_arg("sample_rate", float, 0.02, "train samples num.") +train_g.add_arg("use_fp16", bool, False, "Whether to use fp16 mixed precision training.") +train_g.add_arg("loss_scaling", float, 1.0, + "Loss scaling factor for mixed precision training, only valid when use_fp16 is enabled.") + +log_g = ArgumentGroup(parser, "logging", "logging related.") +log_g.add_arg("skip_steps", int, 10, "The steps interval to print loss.") +log_g.add_arg("verbose", bool, False, "Whether to output verbose log.") + +data_g = ArgumentGroup(parser, "data", "Data paths, vocab paths and data processing options") +data_g.add_arg("train_file", str, None, "json data for training.") +data_g.add_arg("predict_file", str, None, "json data for predictions.") +data_g.add_arg("vocab_path", str, None, "Vocabulary path.") +data_g.add_arg("with_negative", bool, False, + "If true, the examples contain some that do not have an answer.") +data_g.add_arg("max_seq_len", int, 512, "Number of words of the longest seqence.") +data_g.add_arg("max_query_length", int, 64, "Max query length.") +data_g.add_arg("max_answer_length", int, 30, "Max answer length.") +data_g.add_arg("batch_size", int, 12, "Total examples' number in batch for training. see also --in_tokens.") +data_g.add_arg("in_tokens", bool, False, + "If set, the batch size will be the maximum number of tokens in one batch. " + "Otherwise, it will be the maximum number of examples in one batch.") +data_g.add_arg("do_lower_case", bool, True, + "Whether to lower case the input text. Should be True for uncased models and False for cased models.") +data_g.add_arg("doc_stride", int, 128, + "When splitting up a long document into chunks, how much stride to take between chunks.") +data_g.add_arg("n_best_size", int, 20, + "The total number of n-best predictions to generate in the nbest_predictions.json output file.") +data_g.add_arg("null_score_diff_threshold", float, 0.0, + "If null_score - best_non_null is greater than the threshold predict null.") +data_g.add_arg("random_seed", int, 0, "Random seed.") + +run_type_g = ArgumentGroup(parser, "run_type", "running type options.") +run_type_g.add_arg("use_cuda", bool, True, "If set, use GPU for training.") +run_type_g.add_arg("use_fast_executor", bool, False, "If set, use fast parallel executor (in experiment).") +run_type_g.add_arg("num_iteration_per_drop_scope", int, 1, "Ihe iteration intervals to clean up temporary variables.") +run_type_g.add_arg("do_train", bool, True, "Whether to perform training.") +run_type_g.add_arg("do_predict", bool, True, "Whether to perform prediction.") + +args = parser.parse_args() +# yapf: enable. + +def create_model(pyreader_name, bert_config, is_training=False): + if is_training: + pyreader = fluid.layers.py_reader( + capacity=50, + shapes=[[-1, args.max_seq_len, 1], [-1, args.max_seq_len, 1], + [-1, args.max_seq_len, 1], + [-1, args.max_seq_len, 1], [-1, 1], [-1, 1]], + dtypes=[ + 'int64', 'int64', 'int64', 'float32', 'int64', 'int64'], + lod_levels=[0, 0, 0, 0, 0, 0], + name=pyreader_name, + use_double_buffer=True) + (src_ids, pos_ids, sent_ids, input_mask, start_positions, + end_positions) = fluid.layers.read_file(pyreader) + else: + pyreader = fluid.layers.py_reader( + capacity=50, + shapes=[[-1, args.max_seq_len, 1], [-1, args.max_seq_len, 1], + [-1, args.max_seq_len, 1], + [-1, args.max_seq_len, 1], [-1, 1]], + dtypes=['int64', 'int64', 'int64', 'float32', 'int64'], + lod_levels=[0, 0, 0, 0, 0], + name=pyreader_name, + use_double_buffer=True) + (src_ids, pos_ids, sent_ids, input_mask, unique_id) = fluid.layers.read_file(pyreader) + + bert = BertModel( + src_ids=src_ids, + position_ids=pos_ids, + sentence_ids=sent_ids, + input_mask=input_mask, + config=bert_config, + use_fp16=args.use_fp16) + + enc_out = bert.get_sequence_output() + + logits = fluid.layers.fc( + input=enc_out, + size=2, + num_flatten_dims=2, + param_attr=fluid.ParamAttr( + name="cls_squad_out_w", + initializer=fluid.initializer.TruncatedNormal(scale=0.02)), + bias_attr=fluid.ParamAttr( + name="cls_squad_out_b", initializer=fluid.initializer.Constant(0.))) + + logits = fluid.layers.transpose(x=logits, perm=[2, 0, 1]) + start_logits, end_logits = fluid.layers.unstack(x=logits, axis=0) + + batch_ones = fluid.layers.fill_constant_batch_size_like( + input=start_logits, dtype='int64', shape=[1], value=1) + num_seqs = fluid.layers.reduce_sum(input=batch_ones) + + if is_training: + + def compute_loss(logits, positions): + loss = fluid.layers.softmax_with_cross_entropy( + logits=logits, label=positions) + loss = fluid.layers.mean(x=loss) + return loss + + start_loss = compute_loss(start_logits, start_positions) + end_loss = compute_loss(end_logits, end_positions) + total_loss = (start_loss + end_loss) / 2.0 + if args.use_fp16 and args.loss_scaling > 1.0: + total_loss = total_loss * args.loss_scaling + + return pyreader, total_loss, num_seqs + else: + return pyreader, unique_id, start_logits, end_logits, num_seqs + + +RawResult = collections.namedtuple("RawResult", + ["unique_id", "start_logits", "end_logits"]) + + +def predict(test_exe, test_program, test_pyreader, fetch_list, processor, prefix=''): + if not os.path.exists(args.checkpoints): + os.makedirs(args.checkpoints) + output_prediction_file = os.path.join(args.checkpoints, prefix + "predictions.json") + output_nbest_file = os.path.join(args.checkpoints, prefix + "nbest_predictions.json") + output_null_log_odds_file = os.path.join(args.checkpoints, prefix + "null_odds.json") + + test_pyreader.start() + all_results = [] + time_begin = time.time() + while True: + try: + np_unique_ids, np_start_logits, np_end_logits, np_num_seqs = test_exe.run( + fetch_list=fetch_list, program=test_program) + for idx in range(np_unique_ids.shape[0]): + if np_unique_ids[idx] < 0: + continue + if len(all_results) % 1000 == 0: + print("Processing example: %d" % len(all_results)) + unique_id = int(np_unique_ids[idx]) + start_logits = [float(x) for x in np_start_logits[idx].flat] + end_logits = [float(x) for x in np_end_logits[idx].flat] + all_results.append( + RawResult( + unique_id=unique_id, + start_logits=start_logits, + end_logits=end_logits)) + except fluid.core.EOFException: + test_pyreader.reset() + break + time_end = time.time() + + features = processor.get_features( + processor.predict_examples, is_training=False) + write_predictions(processor.predict_examples, features, all_results, + args.n_best_size, args.max_answer_length, + args.do_lower_case, output_prediction_file, + output_nbest_file, output_null_log_odds_file, + args.with_negative, + args.null_score_diff_threshold, args.verbose) + + +def train(args): + bert_config = BertConfig(args.bert_config_path) + bert_config.print_config() + + if not (args.do_train or args.do_predict): + raise ValueError("For args `do_train` and `do_predict`, at " + "least one of them must be True.") + + if args.use_cuda: + place = fluid.CUDAPlace(0) + dev_count = fluid.core.get_cuda_device_count() + else: + place = fluid.CPUPlace() + dev_count = int(os.environ.get('CPU_NUM', multiprocessing.cpu_count())) + exe = fluid.Executor(place) + + processor = DataProcessor( + vocab_path=args.vocab_path, + do_lower_case=args.do_lower_case, + max_seq_length=args.max_seq_len, + in_tokens=args.in_tokens, + doc_stride=args.doc_stride, + max_query_length=args.max_query_length) + + startup_prog = fluid.Program() + if args.random_seed is not None: + startup_prog.random_seed = args.random_seed + + if args.do_train: + build_strategy = fluid.BuildStrategy() + print("estimating runtime number of examples...") + num_train_examples = processor.estimate_runtime_examples(args.train_file, sample_rate=args.sample_rate) + print("runtime number of examples:") + print(num_train_examples) + + train_data_generator = processor.data_generator( + data_path=args.train_file, + batch_size=args.batch_size, + max_len=args.max_seq_len, + phase='train', + shuffle=True, + dev_count=dev_count, + with_negative=args.with_negative, + epoch=args.epoch) + + if args.in_tokens: + max_train_steps = args.epoch * num_train_examples // ( + args.batch_size // args.max_seq_len) // dev_count + else: + max_train_steps = args.epoch * num_train_examples // ( + args.batch_size) // dev_count + warmup_steps = int(max_train_steps * args.warmup_proportion) + print("Device count: %d" % dev_count) + print("Num train examples: %d" % num_train_examples) + print("Max train steps: %d" % max_train_steps) + print("Num warmup steps: %d" % warmup_steps) + + train_program = fluid.Program() + with fluid.program_guard(train_program, startup_prog): + with fluid.unique_name.guard(): + train_pyreader, loss, num_seqs = create_model( + pyreader_name='train_reader', + bert_config=bert_config, + is_training=True) + + train_pyreader.decorate_tensor_provider(train_data_generator) + + scheduled_lr = optimization( + loss=loss, + warmup_steps=warmup_steps, + num_train_steps=max_train_steps, + learning_rate=args.learning_rate, + train_program=train_program, + startup_prog=startup_prog, + weight_decay=args.weight_decay, + scheduler=args.lr_scheduler, + use_fp16=args.use_fp16, + loss_scaling=args.loss_scaling) + + loss.persistable = True + num_seqs.persistable = True + + ema = fluid.optimizer.ExponentialMovingAverage(args.ema_decay) + ema.update() + + train_compiled_program = fluid.CompiledProgram(train_program).with_data_parallel( + loss_name=loss.name, build_strategy=build_strategy) + + if args.verbose: + if args.in_tokens: + lower_mem, upper_mem, unit = fluid.contrib.memory_usage( + program=train_program, + batch_size=args.batch_size // args.max_seq_len) + else: + lower_mem, upper_mem, unit = fluid.contrib.memory_usage( + program=train_program, batch_size=args.batch_size) + print("Theoretical memory usage in training: %.3f - %.3f %s" % + (lower_mem, upper_mem, unit)) + + if args.do_predict: + build_strategy = fluid.BuildStrategy() + test_prog = fluid.Program() + with fluid.program_guard(test_prog, startup_prog): + with fluid.unique_name.guard(): + test_pyreader, unique_ids, start_logits, end_logits, num_seqs = create_model( + pyreader_name='test_reader', + bert_config=bert_config, + is_training=False) + + if 'ema' not in dir(): + ema = fluid.optimizer.ExponentialMovingAverage(args.ema_decay) + + unique_ids.persistable = True + start_logits.persistable = True + end_logits.persistable = True + num_seqs.persistable = True + + test_prog = test_prog.clone(for_test=True) + test_compiled_program = fluid.CompiledProgram(test_prog).with_data_parallel( + build_strategy=build_strategy) + + exe.run(startup_prog) + + if args.do_train: + if args.init_checkpoint and args.init_pretraining_params: + print( + "WARNING: args 'init_checkpoint' and 'init_pretraining_params' " + "both are set! Only arg 'init_checkpoint' is made valid.") + if args.init_checkpoint: + init_checkpoint( + exe, + args.init_checkpoint, + main_program=startup_prog, + use_fp16=args.use_fp16) + elif args.init_pretraining_params: + init_pretraining_params( + exe, + args.init_pretraining_params, + main_program=startup_prog, + use_fp16=args.use_fp16) + elif args.do_predict: + if not args.init_checkpoint: + raise ValueError("args 'init_checkpoint' should be set if" + "only doing prediction!") + init_checkpoint( + exe, + args.init_checkpoint, + main_program=startup_prog, + use_fp16=args.use_fp16) + + if args.do_train: + train_pyreader.start() + + steps = 0 + total_cost, total_num_seqs = [], [] + time_begin = time.time() + while True: + try: + steps += 1 + if steps % args.skip_steps == 0: + if warmup_steps <= 0: + fetch_list = [loss.name, num_seqs.name] + else: + fetch_list = [ + loss.name, scheduled_lr.name, num_seqs.name + ] + else: + fetch_list = [] + + outputs = exe.run(train_compiled_program, fetch_list=fetch_list) + + if steps % args.skip_steps == 0: + if warmup_steps <= 0: + np_loss, np_num_seqs = outputs + else: + np_loss, np_lr, np_num_seqs = outputs + total_cost.extend(np_loss * np_num_seqs) + total_num_seqs.extend(np_num_seqs) + + if args.verbose: + verbose = "train pyreader queue size: %d, " % train_pyreader.queue.size( + ) + verbose += "learning rate: %f" % ( + np_lr[0] + if warmup_steps > 0 else args.learning_rate) + print(verbose) + + time_end = time.time() + used_time = time_end - time_begin + current_example, epoch = processor.get_train_progress() + + print("epoch: %d, progress: %d/%d, step: %d, loss: %f, " + "speed: %f steps/s" % + (epoch, current_example, num_train_examples, steps, + np.sum(total_cost) / np.sum(total_num_seqs), + args.skip_steps / used_time)) + total_cost, total_num_seqs = [], [] + time_begin = time.time() + + if steps % args.save_steps == 0: + save_path = os.path.join(args.checkpoints, + "step_" + str(steps)) + fluid.io.save_persistables(exe, save_path, train_program) + if steps == max_train_steps: + save_path = os.path.join(args.checkpoints, + "step_" + str(steps) + "_final") + fluid.io.save_persistables(exe, save_path, train_program) + break + except Exception as err: + save_path = os.path.join(args.checkpoints, + "step_" + str(steps) + "_final") + fluid.io.save_persistables(exe, save_path, train_program) + train_pyreader.reset() + break + + if args.do_predict: + test_pyreader.decorate_tensor_provider( + processor.data_generator( + data_path=args.predict_file, + batch_size=args.batch_size, + max_len=args.max_seq_len, + phase='predict', + shuffle=False, + dev_count=dev_count, + epoch=1)) + + if args.use_ema: + with ema.apply(exe): + predict(exe, test_prog, test_pyreader, [ + unique_ids.name, start_logits.name, end_logits.name, num_seqs.name + ], processor, prefix='ema_') + else: + predict(exe, test_prog, test_pyreader, [ + unique_ids.name, start_logits.name, end_logits.name, num_seqs.name + ], processor) + + +if __name__ == '__main__': + print_arguments(args) + train(args) diff --git a/PaddleNLP/Research/MRQA2019-BASELINE/src/tokenization.py b/PaddleNLP/Research/MRQA2019-BASELINE/src/tokenization.py new file mode 100644 index 0000000000000000000000000000000000000000..59e035d92cb776f32be20360532b8c18c9b801e5 --- /dev/null +++ b/PaddleNLP/Research/MRQA2019-BASELINE/src/tokenization.py @@ -0,0 +1,374 @@ +# coding=utf-8 +# Copyright 2018 The Google AI Language Team Authors. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +"""Tokenization classes.""" + +from __future__ import absolute_import +from __future__ import division +from __future__ import print_function + +import collections +import unicodedata +import six + + +def convert_to_unicode(text): + """Converts `text` to Unicode (if it's not already), assuming utf-8 input.""" + if six.PY3: + if isinstance(text, str): + return text + elif isinstance(text, bytes): + return text.decode("utf-8", "ignore") + else: + raise ValueError("Unsupported string type: %s" % (type(text))) + elif six.PY2: + if isinstance(text, str): + return text.decode("utf-8", "ignore") + elif isinstance(text, unicode): + return text + else: + raise ValueError("Unsupported string type: %s" % (type(text))) + else: + raise ValueError("Not running on Python2 or Python 3?") + + +def printable_text(text): + """Returns text encoded in a way suitable for print or `tf.logging`.""" + + # These functions want `str` for both Python2 and Python3, but in one case + # it's a Unicode string and in the other it's a byte string. + if six.PY3: + if isinstance(text, str): + return text + elif isinstance(text, bytes): + return text.decode("utf-8", "ignore") + else: + raise ValueError("Unsupported string type: %s" % (type(text))) + elif six.PY2: + if isinstance(text, str): + return text + elif isinstance(text, unicode): + return text.encode("utf-8") + else: + raise ValueError("Unsupported string type: %s" % (type(text))) + else: + raise ValueError("Not running on Python2 or Python 3?") + + +def load_vocab(vocab_file): + """Loads a vocabulary file into a dictionary.""" + vocab = collections.OrderedDict() + fin = open(vocab_file) + for num, line in enumerate(fin): + items = convert_to_unicode(line.strip()).split("\t") + if len(items) > 2: + break + token = items[0] + index = items[1] if len(items) == 2 else num + token = token.strip() + vocab[token] = int(index) + return vocab + + +def convert_by_vocab(vocab, items): + """Converts a sequence of [tokens|ids] using the vocab.""" + output = [] + for item in items: + output.append(vocab[item]) + return output + + +def convert_tokens_to_ids(vocab, tokens): + return convert_by_vocab(vocab, tokens) + + +def convert_ids_to_tokens(inv_vocab, ids): + return convert_by_vocab(inv_vocab, ids) + + +def whitespace_tokenize(text): + """Runs basic whitespace cleaning and splitting on a peice of text.""" + text = text.strip() + if not text: + return [] + tokens = text.split() + return tokens + + +class FullTokenizer(object): + """Runs end-to-end tokenziation.""" + + def __init__(self, vocab_file, do_lower_case=True): + self.vocab = load_vocab(vocab_file) + self.inv_vocab = {v: k for k, v in self.vocab.items()} + self.basic_tokenizer = BasicTokenizer(do_lower_case=do_lower_case) + self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab) + + def tokenize(self, text): + split_tokens = [] + for token in self.basic_tokenizer.tokenize(text): + for sub_token in self.wordpiece_tokenizer.tokenize(token): + split_tokens.append(sub_token) + + return split_tokens + + def convert_tokens_to_ids(self, tokens): + return convert_by_vocab(self.vocab, tokens) + + def convert_ids_to_tokens(self, ids): + return convert_by_vocab(self.inv_vocab, ids) + + +class CharTokenizer(object): + """Runs end-to-end tokenziation.""" + + def __init__(self, vocab_file, do_lower_case=True): + self.vocab = load_vocab(vocab_file) + self.inv_vocab = {v: k for k, v in self.vocab.items()} + self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab) + + def tokenize(self, text): + split_tokens = [] + for token in text.lower().split(" "): + for sub_token in self.wordpiece_tokenizer.tokenize(token): + split_tokens.append(sub_token) + + return split_tokens + + def convert_tokens_to_ids(self, tokens): + return convert_by_vocab(self.vocab, tokens) + + def convert_ids_to_tokens(self, ids): + return convert_by_vocab(self.inv_vocab, ids) + + +class BasicTokenizer(object): + """Runs basic tokenization (punctuation splitting, lower casing, etc.).""" + + def __init__(self, do_lower_case=True): + """Constructs a BasicTokenizer. + + Args: + do_lower_case: Whether to lower case the input. + """ + self.do_lower_case = do_lower_case + self._never_lowercase = ['[UNK]', '[SEP]', '[PAD]', '[CLS]', '[MASK]'] + + def tokenize(self, text): + """Tokenizes a piece of text.""" + text = convert_to_unicode(text) + text = self._clean_text(text) + + # This was added on November 1st, 2018 for the multilingual and Chinese + # models. This is also applied to the English models now, but it doesn't + # matter since the English models were not trained on any Chinese data + # and generally don't have any Chinese data in them (there are Chinese + # characters in the vocabulary because Wikipedia does have some Chinese + # words in the English Wikipedia.). + text = self._tokenize_chinese_chars(text) + + orig_tokens = whitespace_tokenize(text) + split_tokens = [] + for token in orig_tokens: + if self.do_lower_case and token not in self._never_lowercase: + token = token.lower() + token = self._run_strip_accents(token) + if token in self._never_lowercase: + split_tokens.extend([token]) + else: + split_tokens.extend(self._run_split_on_punc(token)) + + output_tokens = whitespace_tokenize(" ".join(split_tokens)) + return output_tokens + + def _run_strip_accents(self, text): + """Strips accents from a piece of text.""" + text = unicodedata.normalize("NFD", text) + output = [] + for char in text: + cat = unicodedata.category(char) + if cat == "Mn": + continue + output.append(char) + return "".join(output) + + def _run_split_on_punc(self, text): + """Splits punctuation on a piece of text.""" + chars = list(text) + i = 0 + start_new_word = True + output = [] + while i < len(chars): + char = chars[i] + if _is_punctuation(char): + output.append([char]) + start_new_word = True + else: + if start_new_word: + output.append([]) + start_new_word = False + output[-1].append(char) + i += 1 + + return ["".join(x) for x in output] + + def _tokenize_chinese_chars(self, text): + """Adds whitespace around any CJK character.""" + output = [] + for char in text: + cp = ord(char) + if self._is_chinese_char(cp): + output.append(" ") + output.append(char) + output.append(" ") + else: + output.append(char) + return "".join(output) + + def _is_chinese_char(self, cp): + """Checks whether CP is the codepoint of a CJK character.""" + # This defines a "chinese character" as anything in the CJK Unicode block: + # https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_(Unicode_block) + # + # Note that the CJK Unicode block is NOT all Japanese and Korean characters, + # despite its name. The modern Korean Hangul alphabet is a different block, + # as is Japanese Hiragana and Katakana. Those alphabets are used to write + # space-separated words, so they are not treated specially and handled + # like the all of the other languages. + if ((cp >= 0x4E00 and cp <= 0x9FFF) or # + (cp >= 0x3400 and cp <= 0x4DBF) or # + (cp >= 0x20000 and cp <= 0x2A6DF) or # + (cp >= 0x2A700 and cp <= 0x2B73F) or # + (cp >= 0x2B740 and cp <= 0x2B81F) or # + (cp >= 0x2B820 and cp <= 0x2CEAF) or + (cp >= 0xF900 and cp <= 0xFAFF) or # + (cp >= 0x2F800 and cp <= 0x2FA1F)): # + return True + + return False + + def _clean_text(self, text): + """Performs invalid character removal and whitespace cleanup on text.""" + output = [] + for char in text: + cp = ord(char) + if cp == 0 or cp == 0xfffd or _is_control(char): + continue + if _is_whitespace(char): + output.append(" ") + else: + output.append(char) + return "".join(output) + + +class WordpieceTokenizer(object): + """Runs WordPiece tokenziation.""" + + def __init__(self, vocab, unk_token="[UNK]", max_input_chars_per_word=100): + self.vocab = vocab + self.unk_token = unk_token + self.max_input_chars_per_word = max_input_chars_per_word + + def tokenize(self, text): + """Tokenizes a piece of text into its word pieces. + + This uses a greedy longest-match-first algorithm to perform tokenization + using the given vocabulary. + + For example: + input = "unaffable" + output = ["un", "##aff", "##able"] + + Args: + text: A single token or whitespace separated tokens. This should have + already been passed through `BasicTokenizer. + + Returns: + A list of wordpiece tokens. + """ + + text = convert_to_unicode(text) + + output_tokens = [] + for token in whitespace_tokenize(text): + chars = list(token) + if len(chars) > self.max_input_chars_per_word: + output_tokens.append(self.unk_token) + continue + + is_bad = False + start = 0 + sub_tokens = [] + while start < len(chars): + end = len(chars) + cur_substr = None + while start < end: + substr = "".join(chars[start:end]) + if start > 0: + substr = "##" + substr + if substr in self.vocab: + cur_substr = substr + break + end -= 1 + if cur_substr is None: + is_bad = True + break + sub_tokens.append(cur_substr) + start = end + + if is_bad: + output_tokens.append(self.unk_token) + else: + output_tokens.extend(sub_tokens) + return output_tokens + + +def _is_whitespace(char): + """Checks whether `chars` is a whitespace character.""" + # \t, \n, and \r are technically contorl characters but we treat them + # as whitespace since they are generally considered as such. + if char == " " or char == "\t" or char == "\n" or char == "\r": + return True + cat = unicodedata.category(char) + if cat == "Zs": + return True + return False + + +def _is_control(char): + """Checks whether `chars` is a control character.""" + # These are technically control characters but we count them as whitespace + # characters. + if char == "\t" or char == "\n" or char == "\r": + return False + cat = unicodedata.category(char) + if cat.startswith("C"): + return True + return False + + +def _is_punctuation(char): + """Checks whether `chars` is a punctuation character.""" + cp = ord(char) + # We treat all non-letter/number ASCII as punctuation. + # Characters such as "^", "$", and "`" are not in the Unicode + # Punctuation class but we treat them as punctuation anyways, for + # consistency. + if ((cp >= 33 and cp <= 47) or (cp >= 58 and cp <= 64) or + (cp >= 91 and cp <= 96) or (cp >= 123 and cp <= 126)): + return True + cat = unicodedata.category(char) + if cat.startswith("P"): + return True + return False diff --git a/PaddleNLP/Research/MRQA2019-BASELINE/src/utils/__init__.py b/PaddleNLP/Research/MRQA2019-BASELINE/src/utils/__init__.py new file mode 100644 index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 diff --git a/PaddleNLP/Research/MRQA2019-BASELINE/src/utils/args.py b/PaddleNLP/Research/MRQA2019-BASELINE/src/utils/args.py new file mode 100644 index 0000000000000000000000000000000000000000..b9be634f0f383db61eb667df2345a89262179fd8 --- /dev/null +++ b/PaddleNLP/Research/MRQA2019-BASELINE/src/utils/args.py @@ -0,0 +1,48 @@ +# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +"""Arguments for configuration.""" + +from __future__ import absolute_import +from __future__ import division +from __future__ import print_function + +import six +import argparse + + +def str2bool(v): + # because argparse does not support to parse "true, False" as python + # boolean directly + return v.lower() in ("true", "t", "1") + + +class ArgumentGroup(object): + def __init__(self, parser, title, des): + self._group = parser.add_argument_group(title=title, description=des) + + def add_arg(self, name, type, default, help, **kwargs): + type = str2bool if type == bool else type + self._group.add_argument( + "--" + name, + default=default, + type=type, + help=help + ' Default: %(default)s.', + **kwargs) + + +def print_arguments(args): + print('----------- Configuration Arguments -----------') + for arg, value in sorted(six.iteritems(vars(args))): + print('%s: %s' % (arg, value)) + print('------------------------------------------------') diff --git a/PaddleNLP/Research/MRQA2019-BASELINE/src/utils/fp16.py b/PaddleNLP/Research/MRQA2019-BASELINE/src/utils/fp16.py new file mode 100644 index 0000000000000000000000000000000000000000..e153c2b9a1029897def264278c5dbe72e1f369f5 --- /dev/null +++ b/PaddleNLP/Research/MRQA2019-BASELINE/src/utils/fp16.py @@ -0,0 +1,97 @@ +# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from __future__ import print_function +import paddle +import paddle.fluid as fluid + + +def cast_fp16_to_fp32(i, o, prog): + prog.global_block().append_op( + type="cast", + inputs={"X": i}, + outputs={"Out": o}, + attrs={ + "in_dtype": fluid.core.VarDesc.VarType.FP16, + "out_dtype": fluid.core.VarDesc.VarType.FP32 + }) + + +def cast_fp32_to_fp16(i, o, prog): + prog.global_block().append_op( + type="cast", + inputs={"X": i}, + outputs={"Out": o}, + attrs={ + "in_dtype": fluid.core.VarDesc.VarType.FP32, + "out_dtype": fluid.core.VarDesc.VarType.FP16 + }) + + +def copy_to_master_param(p, block): + v = block.vars.get(p.name, None) + if v is None: + raise ValueError("no param name %s found!" % p.name) + new_p = fluid.framework.Parameter( + block=block, + shape=v.shape, + dtype=fluid.core.VarDesc.VarType.FP32, + type=v.type, + lod_level=v.lod_level, + stop_gradient=p.stop_gradient, + trainable=p.trainable, + optimize_attr=p.optimize_attr, + regularizer=p.regularizer, + gradient_clip_attr=p.gradient_clip_attr, + error_clip=p.error_clip, + name=v.name + ".master") + return new_p + + +def create_master_params_grads(params_grads, main_prog, startup_prog, + loss_scaling): + master_params_grads = [] + tmp_role = main_prog._current_role + OpRole = fluid.core.op_proto_and_checker_maker.OpRole + main_prog._current_role = OpRole.Backward + for p, g in params_grads: + # create master parameters + master_param = copy_to_master_param(p, main_prog.global_block()) + startup_master_param = startup_prog.global_block()._clone_variable( + master_param) + startup_p = startup_prog.global_block().var(p.name) + cast_fp16_to_fp32(startup_p, startup_master_param, startup_prog) + # cast fp16 gradients to fp32 before apply gradients + if g.name.find("layer_norm") > -1: + if loss_scaling > 1: + scaled_g = g / float(loss_scaling) + else: + scaled_g = g + master_params_grads.append([p, scaled_g]) + continue + master_grad = fluid.layers.cast(g, "float32") + if loss_scaling > 1: + master_grad = master_grad / float(loss_scaling) + master_params_grads.append([master_param, master_grad]) + main_prog._current_role = tmp_role + return master_params_grads + + +def master_param_to_train_param(master_params_grads, params_grads, main_prog): + for idx, m_p_g in enumerate(master_params_grads): + train_p, _ = params_grads[idx] + if train_p.name.find("layer_norm") > -1: + continue + with main_prog._optimized_guard([m_p_g[0], m_p_g[1]]): + cast_fp32_to_fp16(m_p_g[0], train_p, main_prog) diff --git a/PaddleNLP/Research/MRQA2019-BASELINE/src/utils/init.py b/PaddleNLP/Research/MRQA2019-BASELINE/src/utils/init.py new file mode 100644 index 0000000000000000000000000000000000000000..3844d01298ecbb70aed37b467aebca62caadd391 --- /dev/null +++ b/PaddleNLP/Research/MRQA2019-BASELINE/src/utils/init.py @@ -0,0 +1,81 @@ +# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from __future__ import print_function + +import os +import six +import ast +import copy + +import numpy as np +import paddle.fluid as fluid + + +def cast_fp32_to_fp16(exe, main_program): + print("Cast parameters to float16 data format.") + for param in main_program.global_block().all_parameters(): + if not param.name.endswith(".master"): + param_t = fluid.global_scope().find_var(param.name).get_tensor() + data = np.array(param_t) + if param.name.find("layer_norm") == -1: + param_t.set(np.float16(data).view(np.uint16), exe.place) + master_param_var = fluid.global_scope().find_var(param.name + + ".master") + if master_param_var is not None: + master_param_var.get_tensor().set(data, exe.place) + + +def init_checkpoint(exe, init_checkpoint_path, main_program, use_fp16=False): + assert os.path.exists( + init_checkpoint_path), "[%s] cann't be found." % init_checkpoint_path + + def existed_persitables(var): + if not fluid.io.is_persistable(var): + return False + return os.path.exists(os.path.join(init_checkpoint_path, var.name)) + + fluid.io.load_vars( + exe, + init_checkpoint_path, + main_program=main_program, + predicate=existed_persitables) + print("Load model from {}".format(init_checkpoint_path)) + + if use_fp16: + cast_fp32_to_fp16(exe, main_program) + + +def init_pretraining_params(exe, + pretraining_params_path, + main_program, + use_fp16=False): + assert os.path.exists(pretraining_params_path + ), "[%s] cann't be found." % pretraining_params_path + + def existed_params(var): + if not isinstance(var, fluid.framework.Parameter): + return False + return os.path.exists(os.path.join(pretraining_params_path, var.name)) + + fluid.io.load_vars( + exe, + pretraining_params_path, + main_program=main_program, + predicate=existed_params) + print("Load pretraining parameters from {}.".format( + pretraining_params_path)) + + if use_fp16: + cast_fp32_to_fp16(exe, main_program)