diff --git a/PaddleKG/DuIE_Baseline/LICENSE b/PaddleKG/DuIE_Baseline/LICENSE new file mode 100755 index 0000000000000000000000000000000000000000..261eeb9e9f8b2b4b0d119366dda99c6fd7d35c64 --- /dev/null +++ b/PaddleKG/DuIE_Baseline/LICENSE @@ -0,0 +1,201 @@ + Apache License + Version 2.0, January 2004 + http://www.apache.org/licenses/ + + TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION + + 1. Definitions. + + "License" shall mean the terms and conditions for use, reproduction, + and distribution as defined by Sections 1 through 9 of this document. + + "Licensor" shall mean the copyright owner or entity authorized by + the copyright owner that is granting the License. + + "Legal Entity" shall mean the union of the acting entity and all + other entities that control, are controlled by, or are under common + control with that entity. For the purposes of this definition, + "control" means (i) the power, direct or indirect, to cause the + direction or management of such entity, whether by contract or + otherwise, or (ii) ownership of fifty percent (50%) or more of the + outstanding shares, or (iii) beneficial ownership of such entity. + + "You" (or "Your") shall mean an individual or Legal Entity + exercising permissions granted by this License. + + "Source" form shall mean the preferred form for making modifications, + including but not limited to software source code, documentation + source, and configuration files. + + "Object" form shall mean any form resulting from mechanical + transformation or translation of a Source form, including but + not limited to compiled object code, generated documentation, + and conversions to other media types. + + "Work" shall mean the work of authorship, whether in Source or + Object form, made available under the License, as indicated by a + copyright notice that is included in or attached to the work + (an example is provided in the Appendix below). + + "Derivative Works" shall mean any work, whether in Source or Object + form, that is based on (or derived from) the Work and for which the + editorial revisions, annotations, elaborations, or other modifications + represent, as a whole, an original work of authorship. For the purposes + of this License, Derivative Works shall not include works that remain + separable from, or merely link (or bind by name) to the interfaces of, + the Work and Derivative Works thereof. + + "Contribution" shall mean any work of authorship, including + the original version of the Work and any modifications or additions + to that Work or Derivative Works thereof, that is intentionally + submitted to Licensor for inclusion in the Work by the copyright owner + or by an individual or Legal Entity authorized to submit on behalf of + the copyright owner. For the purposes of this definition, "submitted" + means any form of electronic, verbal, or written communication sent + to the Licensor or its representatives, including but not limited to + communication on electronic mailing lists, source code control systems, + and issue tracking systems that are managed by, or on behalf of, the + Licensor for the purpose of discussing and improving the Work, but + excluding communication that is conspicuously marked or otherwise + designated in writing by the copyright owner as "Not a Contribution." + + "Contributor" shall mean Licensor and any individual or Legal Entity + on behalf of whom a Contribution has been received by Licensor and + subsequently incorporated within the Work. + + 2. Grant of Copyright License. Subject to the terms and conditions of + this License, each Contributor hereby grants to You a perpetual, + worldwide, non-exclusive, no-charge, royalty-free, irrevocable + copyright license to reproduce, prepare Derivative Works of, + publicly display, publicly perform, sublicense, and distribute the + Work and such Derivative Works in Source or Object form. + + 3. Grant of Patent License. Subject to the terms and conditions of + this License, each Contributor hereby grants to You a perpetual, + worldwide, non-exclusive, no-charge, royalty-free, irrevocable + (except as stated in this section) patent license to make, have made, + use, offer to sell, sell, import, and otherwise transfer the Work, + where such license applies only to those patent claims licensable + by such Contributor that are necessarily infringed by their + Contribution(s) alone or by combination of their Contribution(s) + with the Work to which such Contribution(s) was submitted. If You + institute patent litigation against any entity (including a + cross-claim or counterclaim in a lawsuit) alleging that the Work + or a Contribution incorporated within the Work constitutes direct + or contributory patent infringement, then any patent licenses + granted to You under this License for that Work shall terminate + as of the date such litigation is filed. + + 4. Redistribution. You may reproduce and distribute copies of the + Work or Derivative Works thereof in any medium, with or without + modifications, and in Source or Object form, provided that You + meet the following conditions: + + (a) You must give any other recipients of the Work or + Derivative Works a copy of this License; and + + (b) You must cause any modified files to carry prominent notices + stating that You changed the files; and + + (c) You must retain, in the Source form of any Derivative Works + that You distribute, all copyright, patent, trademark, and + attribution notices from the Source form of the Work, + excluding those notices that do not pertain to any part of + the Derivative Works; and + + (d) If the Work includes a "NOTICE" text file as part of its + distribution, then any Derivative Works that You distribute must + include a readable copy of the attribution notices contained + within such NOTICE file, excluding those notices that do not + pertain to any part of the Derivative Works, in at least one + of the following places: within a NOTICE text file distributed + as part of the Derivative Works; within the Source form or + documentation, if provided along with the Derivative Works; or, + within a display generated by the Derivative Works, if and + wherever such third-party notices normally appear. The contents + of the NOTICE file are for informational purposes only and + do not modify the License. You may add Your own attribution + notices within Derivative Works that You distribute, alongside + or as an addendum to the NOTICE text from the Work, provided + that such additional attribution notices cannot be construed + as modifying the License. + + You may add Your own copyright statement to Your modifications and + may provide additional or different license terms and conditions + for use, reproduction, or distribution of Your modifications, or + for any such Derivative Works as a whole, provided Your use, + reproduction, and distribution of the Work otherwise complies with + the conditions stated in this License. + + 5. Submission of Contributions. Unless You explicitly state otherwise, + any Contribution intentionally submitted for inclusion in the Work + by You to the Licensor shall be under the terms and conditions of + this License, without any additional terms or conditions. + Notwithstanding the above, nothing herein shall supersede or modify + the terms of any separate license agreement you may have executed + with Licensor regarding such Contributions. + + 6. Trademarks. This License does not grant permission to use the trade + names, trademarks, service marks, or product names of the Licensor, + except as required for reasonable and customary use in describing the + origin of the Work and reproducing the content of the NOTICE file. + + 7. Disclaimer of Warranty. Unless required by applicable law or + agreed to in writing, Licensor provides the Work (and each + Contributor provides its Contributions) on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or + implied, including, without limitation, any warranties or conditions + of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A + PARTICULAR PURPOSE. You are solely responsible for determining the + appropriateness of using or redistributing the Work and assume any + risks associated with Your exercise of permissions under this License. + + 8. Limitation of Liability. In no event and under no legal theory, + whether in tort (including negligence), contract, or otherwise, + unless required by applicable law (such as deliberate and grossly + negligent acts) or agreed to in writing, shall any Contributor be + liable to You for damages, including any direct, indirect, special, + incidental, or consequential damages of any character arising as a + result of this License or out of the use or inability to use the + Work (including but not limited to damages for loss of goodwill, + work stoppage, computer failure or malfunction, or any and all + other commercial damages or losses), even if such Contributor + has been advised of the possibility of such damages. + + 9. Accepting Warranty or Additional Liability. While redistributing + the Work or Derivative Works thereof, You may choose to offer, + and charge a fee for, acceptance of support, warranty, indemnity, + or other liability obligations and/or rights consistent with this + License. However, in accepting such obligations, You may act only + on Your own behalf and on Your sole responsibility, not on behalf + of any other Contributor, and only if You agree to indemnify, + defend, and hold each Contributor harmless for any liability + incurred by, or claims asserted against, such Contributor by reason + of your accepting any such warranty or additional liability. + + END OF TERMS AND CONDITIONS + + APPENDIX: How to apply the Apache License to your work. + + To apply the Apache License to your work, attach the following + boilerplate notice, with the fields enclosed by brackets "[]" + replaced with your own identifying information. (Don't include + the brackets!) The text should be enclosed in the appropriate + comment syntax for the file format. We also recommend that a + file or class name and description of purpose be included on the + same "printed page" as the copyright notice for easier + identification within third-party archives. + + Copyright [yyyy] [name of copyright owner] + + Licensed under the Apache License, Version 2.0 (the "License"); + you may not use this file except in compliance with the License. + You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. diff --git a/PaddleKG/DuIE_Baseline/README.md b/PaddleKG/DuIE_Baseline/README.md new file mode 100755 index 0000000000000000000000000000000000000000..8e10001c95c624420a1d9be11731259cbe60fc27 --- /dev/null +++ b/PaddleKG/DuIE_Baseline/README.md @@ -0,0 +1,64 @@ +## Relation Extraction Baseline System—InfoExtractor 2.0 +### Abstract +InfoExtractor 2.0 is a relation extraction baseline system developed for DuIE 2.0. +Different from [DuIE 1.0](http://lic2019.ccf.org.cn/kg), the new 2.0 task is more inclined to colloquial language, and further introduces **complex relations** which entails multiple objects in one single SPO. +For detailed information about the dataset, please refer to the official website of our [competition](http://bjyz-ai.epc.baidu.com/aistudio/competition/detail/34?isFromCcf=true). +InfoExtractor 2.0 is built upon a SOTA pre-trained language model [ERNIE](https://arxiv.org/abs/1904.09223) using PaddlePaddle. +We design a structured **tagging strategy** to directly fine-tune ERNIE, through which multiple, overlapped SPOs can be extracted in **a single pass**. +The InfoExtractor 2.0 system is simple yet effective, achieving 0.554 F1 on the DuIE 2.0 demo data and 0.848 F1 on DuIE 1.0. +The hyperparameters are simply set to: BATCH_SIZE=16, LEARNING_RATE=2e-5, and EPOCH=10 (without tuning). +- - - +### Tagging Strategy +Our tagging strategy is designed to discover multiple, overlapped SPOs in the DuIE 2.0 task. +Based on the classic 'BIO' tagging scheme, we assign tags (also known as labels) to each token to indicate its position in an entity span. +The only difference lies in that a "B" tag here is further distinguished by different predicates and subject/object dichotomy. +Suppose there are N predicates. Then a "B" tag should be like "B-predicate-subject" or "B-predicate-object", +which results in 2*N **mutually exclusive** "B" tags. +After tagging, we treat the task as token-level multi-label classification, with a total of (2*N+2) labels (2 for the “I” and “O” tags). +Below is a visual illustration of our tagging strategy: +
+Tagging Strategy +
+For **complex relations** in the DuIE 2.0 task, we simply treat affiliated objects as independent instances (SPOs) which share the same subject. +Anything else besides the tagging strategy is implemented in the most straightforward way. The model input is: + \ *input text* \, and the final hidden states are directly projected into classification probabilities. +- - - +### Environments +Python3 + Paddle Fluid 1.5 for training/evaluation/prediction (please confirm your Python path in scripts). +Python2 for official evaluation script. +Dependencies are listed in `./requirements.txt`. +The code is tested on a single P40 GPU, with CUDA version=10.1, GPU Driver Version = 418.39. + +### Download pre-trained ERNIE model +Download ERNIE1.0 Base(max-len-512)model and extract it into `./pretrained_model/` +``` +cd ./pretrained_mdoel/ +wget --no-check-certificate https://ernie.bj.bcebos.com/ERNIE_1.0_max-len-512.tar.gz +tar -zxvf ERNIE_1.0_max-len-512.tar.gz +``` +### Training +``` +sh ./script/train.sh +``` +By default the checkpoints will be saved into `./checkpoints/` +GPU ID can be specified in the script. On P40 devices, the batch size can be assigned up to 64 under 256 max-seq-len setting. +Multi-gpu training is supported after `LD_LIBRARY_PATH` is specified in the script: +``` +export LD_LIBRARY_PATH=/your/custom/path:$LD_LIBRARY_PATH +``` +**Accuracy** (token-level and example-level) is printed during the during the training procedure. + +### Prediction +Specify your checkpoints dir in the prediction script, and then run: +``` +sh ./script/predict.sh +``` +This will write the predictions into a json file with the same format as the original dataset (required for final official evaluation). GPU ID and batch size can be specified in the script. The final prediction file is saved into `./data/` + +### Official Evaluation +Zip your prediction json file and then run official evaluation: +``` +zip ./data/predict_test.json.zip ./data/predict_test.json +python2 ./script/re_official_evaluation.py --golden_file=./data/dev_demo.json --predict_file=./data/predict_test.json.zip [--alias_file alias_dict] +``` +Precision, Recall and F1 scores are used as the official evaluation metrics to measure the performance of participating systems. Alias file lists entities with more than one correct mentions. It is not provided due to security reasons. diff --git a/PaddleKG/DuIE_Baseline/ernie/__init__.py b/PaddleKG/DuIE_Baseline/ernie/__init__.py new file mode 100755 index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 diff --git a/PaddleKG/DuIE_Baseline/ernie/batching.py b/PaddleKG/DuIE_Baseline/ernie/batching.py new file mode 100755 index 0000000000000000000000000000000000000000..c3130a3bbe14ae31fbbf08ff8fa005b61ff305ba --- /dev/null +++ b/PaddleKG/DuIE_Baseline/ernie/batching.py @@ -0,0 +1,218 @@ +# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +"""Mask, padding and batching.""" + +from __future__ import absolute_import +from __future__ import division +from __future__ import print_function + +import numpy as np + +from six.moves import xrange + + +def mask(batch_tokens, + seg_labels, + mask_word_tags, + total_token_num, + vocab_size, + CLS=1, + SEP=2, + MASK=3): + """ + Add mask for batch_tokens, return out, mask_label, mask_pos; + Note: mask_pos responding the batch_tokens after padded; + """ + max_len = max([len(sent) for sent in batch_tokens]) + mask_label = [] + mask_pos = [] + prob_mask = np.random.rand(total_token_num) + # Note: the first token is [CLS], so [low=1] + replace_ids = np.random.randint(1, high=vocab_size, size=total_token_num) + pre_sent_len = 0 + prob_index = 0 + for sent_index, sent in enumerate(batch_tokens): + mask_flag = False + mask_word = mask_word_tags[sent_index] + prob_index += pre_sent_len + if mask_word: + beg = 0 + for token_index, token in enumerate(sent): + seg_label = seg_labels[sent_index][token_index] + if seg_label == 1: + continue + if beg == 0: + if seg_label != -1: + beg = token_index + continue + + prob = prob_mask[prob_index + beg] + if prob > 0.15: + pass + else: + for index in xrange(beg, token_index): + prob = prob_mask[prob_index + index] + base_prob = 1.0 + if index == beg: + base_prob = 0.15 + if base_prob * 0.2 < prob <= base_prob: + mask_label.append(sent[index]) + sent[index] = MASK + mask_flag = True + mask_pos.append(sent_index * max_len + index) + elif base_prob * 0.1 < prob <= base_prob * 0.2: + mask_label.append(sent[index]) + sent[index] = replace_ids[prob_index + index] + mask_flag = True + mask_pos.append(sent_index * max_len + index) + else: + mask_label.append(sent[index]) + mask_pos.append(sent_index * max_len + index) + + if seg_label == -1: + beg = 0 + else: + beg = token_index + else: + for token_index, token in enumerate(sent): + prob = prob_mask[prob_index + token_index] + if prob > 0.15: + continue + elif 0.03 < prob <= 0.15: + # mask + if token != SEP and token != CLS: + mask_label.append(sent[token_index]) + sent[token_index] = MASK + mask_flag = True + mask_pos.append(sent_index * max_len + token_index) + elif 0.015 < prob <= 0.03: + # random replace + if token != SEP and token != CLS: + mask_label.append(sent[token_index]) + sent[token_index] = replace_ids[prob_index + + token_index] + mask_flag = True + mask_pos.append(sent_index * max_len + token_index) + else: + # keep the original token + if token != SEP and token != CLS: + mask_label.append(sent[token_index]) + mask_pos.append(sent_index * max_len + token_index) + + pre_sent_len = len(sent) + + mask_label = np.array(mask_label).astype("int64").reshape([-1, 1]) + mask_pos = np.array(mask_pos).astype("int64").reshape([-1, 1]) + return batch_tokens, mask_label, mask_pos + + +def prepare_batch_data(insts, + total_token_num, + voc_size=0, + pad_id=None, + cls_id=None, + sep_id=None, + mask_id=None, + return_input_mask=True, + return_max_len=True, + return_num_token=False): + + batch_src_ids = [inst[0] for inst in insts] + batch_sent_ids = [inst[1] for inst in insts] + batch_pos_ids = [inst[2] for inst in insts] + labels = [inst[3] for inst in insts] + labels = np.array(labels).astype("int64").reshape([-1, 1]) + seg_labels = [inst[4] for inst in insts] + mask_word_tags = [inst[5] for inst in insts] + + # First step: do mask without padding + assert mask_id >= 0, "[FATAL] mask_id must >= 0" + out, mask_label, mask_pos = mask( + batch_src_ids, + seg_labels, + mask_word_tags, + total_token_num, + vocab_size=voc_size, + CLS=cls_id, + SEP=sep_id, + MASK=mask_id) + + # Second step: padding + src_id, self_input_mask = pad_batch_data( + out, pad_idx=pad_id, return_input_mask=True) + pos_id = pad_batch_data(batch_pos_ids, pad_idx=pad_id) + sent_id = pad_batch_data(batch_sent_ids, pad_idx=pad_id) + + return_list = [ + src_id, pos_id, sent_id, self_input_mask, mask_label, mask_pos, labels + ] + + return return_list + + +def pad_batch_data(insts, + pad_idx=0, + return_pos=False, + return_input_mask=False, + return_max_len=False, + return_num_token=False, + return_seq_lens=False): + """ + Pad the instances to the max sequence length in batch, and generate the + corresponding position data and attention bias. + """ + return_list = [] + max_len = max(len(inst) for inst in insts) + # Any token included in dict can be used to pad, since the paddings' loss + # will be masked out by weights and make no effect on parameter gradients. + + inst_data = np.array( + [inst + list([pad_idx] * (max_len - len(inst))) for inst in insts]) + return_list += [inst_data.astype("int64").reshape([-1, max_len, 1])] + + # position data + if return_pos: + inst_pos = np.array([ + list(range(0, len(inst))) + [pad_idx] * (max_len - len(inst)) + for inst in insts + ]) + + return_list += [inst_pos.astype("int64").reshape([-1, max_len, 1])] + + if return_input_mask: + # This is used to avoid attention on paddings. + input_mask_data = np.array([[1] * len(inst) + [0] * + (max_len - len(inst)) for inst in insts]) + input_mask_data = np.expand_dims(input_mask_data, axis=-1) + return_list += [input_mask_data.astype("float32")] + + if return_max_len: + return_list += [max_len] + + if return_num_token: + num_token = 0 + for inst in insts: + num_token += len(inst) + return_list += [num_token] + + if return_seq_lens: + seq_lens = np.array([len(inst) for inst in insts]) + return_list += [seq_lens.astype("int64").reshape([-1, 1])] + + return return_list if len(return_list) > 1 else return_list[0] + + +if __name__ == "__main__": + + pass diff --git a/PaddleKG/DuIE_Baseline/ernie/ernie_encoder.py b/PaddleKG/DuIE_Baseline/ernie/ernie_encoder.py new file mode 100755 index 0000000000000000000000000000000000000000..9773edf9cf6c72703d4e358cb8e2da87da68e06c --- /dev/null +++ b/PaddleKG/DuIE_Baseline/ernie/ernie_encoder.py @@ -0,0 +1,182 @@ +# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +"""extract embeddings from ERNIE encoder.""" + +from __future__ import absolute_import +from __future__ import division +from __future__ import print_function + +import os +import argparse +import numpy as np +import multiprocessing + +import paddle.fluid as fluid + +import reader.task_reader as task_reader +from model.ernie import ErnieConfig, ErnieModel +from utils.args import ArgumentGroup, print_arguments +from utils.init import init_pretraining_params + +# yapf: disable +parser = argparse.ArgumentParser(__doc__) +model_g = ArgumentGroup(parser, "model", "model configuration and paths.") +model_g.add_arg("ernie_config_path", str, None, "Path to the json file for ernie model config.") +model_g.add_arg("init_pretraining_params", str, None, + "Init pre-training params which preforms fine-tuning from. If the " + "arg 'init_checkpoint' has been set, this argument wouldn't be valid.") +model_g.add_arg("output_dir", str, "embeddings", "path to save embeddings extracted by ernie_encoder.") + +data_g = ArgumentGroup(parser, "data", "Data paths, vocab paths and data processing options") +data_g.add_arg("data_set", str, None, "Path to data for calculating ernie_embeddings.") +data_g.add_arg("vocab_path", str, None, "Vocabulary path.") +data_g.add_arg("max_seq_len", int, 512, "Number of words of the longest seqence.") +data_g.add_arg("batch_size", int, 32, "Total examples' number in batch for training.") +data_g.add_arg("do_lower_case", bool, True, + "Whether to lower case the input text. Should be True for uncased models and False for cased models.") + +run_type_g = ArgumentGroup(parser, "run_type", "running type options.") +run_type_g.add_arg("use_cuda", bool, True, "If set, use GPU for training.") +# yapf: enable + + +def create_model(args, pyreader_name, ernie_config): + pyreader = fluid.layers.py_reader( + capacity=50, + shapes=[[-1, args.max_seq_len, 1], [-1, args.max_seq_len, 1], + [-1, args.max_seq_len, 1], [-1, args.max_seq_len, 1], + [-1, args.max_seq_len, 1], [-1, 1]], + dtypes=['int64', 'int64', 'int64', 'int64', 'float', 'int64'], + lod_levels=[0, 0, 0, 0, 0, 0], + name=pyreader_name, + use_double_buffer=True) + + (src_ids, sent_ids, pos_ids, task_ids, input_mask, + seq_lens) = fluid.layers.read_file(pyreader) + + ernie = ErnieModel( + src_ids=src_ids, + position_ids=pos_ids, + sentence_ids=sent_ids, + task_ids=task_ids, + input_mask=input_mask, + config=ernie_config) + + enc_out = ernie.get_sequence_output() + unpad_enc_out = fluid.layers.sequence_unpad(enc_out, length=seq_lens) + cls_feats = ernie.get_pooled_output() + + # set persistable = True to avoid memory opimizing + enc_out.persistable = True + unpad_enc_out.persistable = True + cls_feats.persistable = True + + graph_vars = { + "cls_embeddings": cls_feats, + "top_layer_embeddings": unpad_enc_out, + } + + return pyreader, graph_vars + + +def main(args): + args = parser.parse_args() + ernie_config = ErnieConfig(args.ernie_config_path) + ernie_config.print_config() + + if args.use_cuda: + place = fluid.CUDAPlace(int(os.getenv('FLAGS_selected_gpus', '0'))) + dev_count = fluid.core.get_cuda_device_count() + else: + place = fluid.CPUPlace() + dev_count = int(os.environ.get('CPU_NUM', multiprocessing.cpu_count())) + + exe = fluid.Executor(place) + + reader = task_reader.ExtractEmbeddingReader( + vocab_path=args.vocab_path, + max_seq_len=args.max_seq_len, + do_lower_case=args.do_lower_case) + + startup_prog = fluid.Program() + + data_generator = reader.data_generator( + input_file=args.data_set, + batch_size=args.batch_size, + epoch=1, + shuffle=False) + + total_examples = reader.get_num_examples(args.data_set) + + print("Device count: %d" % dev_count) + print("Total num examples: %d" % total_examples) + + infer_program = fluid.Program() + + with fluid.program_guard(infer_program, startup_prog): + with fluid.unique_name.guard(): + pyreader, graph_vars = create_model( + args, pyreader_name='reader', ernie_config=ernie_config) + + infer_program = infer_program.clone(for_test=True) + + exe.run(startup_prog) + + if args.init_pretraining_params: + init_pretraining_params( + exe, args.init_pretraining_params, main_program=startup_prog) + else: + raise ValueError( + "WARNING: args 'init_pretraining_params' must be specified") + + exec_strategy = fluid.ExecutionStrategy() + exec_strategy.num_threads = dev_count + + pyreader.decorate_tensor_provider(data_generator) + pyreader.start() + + total_cls_emb = [] + total_top_layer_emb = [] + total_labels = [] + while True: + try: + cls_emb, unpad_top_layer_emb = exe.run( + program=infer_program, + fetch_list=[ + graph_vars["cls_embeddings"].name, + graph_vars["top_layer_embeddings"].name + ], + return_numpy=False) + # batch_size * embedding_size + total_cls_emb.append(np.array(cls_emb)) + total_top_layer_emb.append(np.array(unpad_top_layer_emb)) + except fluid.core.EOFException: + break + + total_cls_emb = np.concatenate(total_cls_emb) + total_top_layer_emb = np.concatenate(total_top_layer_emb) + + with open(os.path.join(args.output_dir, "cls_emb.npy"), + "wb") as cls_emb_file: + np.save(cls_emb_file, total_cls_emb) + with open(os.path.join(args.output_dir, "top_layer_emb.npy"), + "wb") as top_layer_emb_file: + np.save(top_layer_emb_file, total_top_layer_emb) + + +if __name__ == '__main__': + args = parser.parse_args() + print_arguments(args) + + main(args) diff --git a/PaddleKG/DuIE_Baseline/ernie/extract_chinese_and_punct.py b/PaddleKG/DuIE_Baseline/ernie/extract_chinese_and_punct.py new file mode 100755 index 0000000000000000000000000000000000000000..115e2fdc6aae8a714b28959cd36752f8330b8312 --- /dev/null +++ b/PaddleKG/DuIE_Baseline/ernie/extract_chinese_and_punct.py @@ -0,0 +1,122 @@ +#!/usr/bin/env python +# -*- coding: utf-8 -*- +# +# Copyright (c) 2019 Baidu.com, Inc. All Rights Reserved +# +""" +requirements: +Authors: daisongtai(daisongtai@baidu.com) +Date: 2019/5/29 6:38 PM +""" +from __future__ import print_function +import sys +import re +import io + +LHan = [ + [0x2E80, 0x2E99], # Han # So [26] CJK RADICAL REPEAT, CJK RADICAL RAP + [0x2E9B, 0x2EF3 + ], # Han # So [89] CJK RADICAL CHOKE, CJK RADICAL C-SIMPLIFIED TURTLE + [0x2F00, 0x2FD5], # Han # So [214] KANGXI RADICAL ONE, KANGXI RADICAL FLUTE + 0x3005, # Han # Lm IDEOGRAPHIC ITERATION MARK + 0x3007, # Han # Nl IDEOGRAPHIC NUMBER ZERO + [0x3021, + 0x3029], # Han # Nl [9] HANGZHOU NUMERAL ONE, HANGZHOU NUMERAL NINE + [0x3038, + 0x303A], # Han # Nl [3] HANGZHOU NUMERAL TEN, HANGZHOU NUMERAL THIRTY + 0x303B, # Han # Lm VERTICAL IDEOGRAPHIC ITERATION MARK + [ + 0x3400, 0x4DB5 + ], # Han # Lo [6582] CJK UNIFIED IDEOGRAPH-3400, CJK UNIFIED IDEOGRAPH-4DB5 + [ + 0x4E00, 0x9FC3 + ], # Han # Lo [20932] CJK UNIFIED IDEOGRAPH-4E00, CJK UNIFIED IDEOGRAPH-9FC3 + [ + 0xF900, 0xFA2D + ], # Han # Lo [302] CJK COMPATIBILITY IDEOGRAPH-F900, CJK COMPATIBILITY IDEOGRAPH-FA2D + [ + 0xFA30, 0xFA6A + ], # Han # Lo [59] CJK COMPATIBILITY IDEOGRAPH-FA30, CJK COMPATIBILITY IDEOGRAPH-FA6A + [ + 0xFA70, 0xFAD9 + ], # Han # Lo [106] CJK COMPATIBILITY IDEOGRAPH-FA70, CJK COMPATIBILITY IDEOGRAPH-FAD9 + [ + 0x20000, 0x2A6D6 + ], # Han # Lo [42711] CJK UNIFIED IDEOGRAPH-20000, CJK UNIFIED IDEOGRAPH-2A6D6 + [0x2F800, 0x2FA1D] +] # Han # Lo [542] CJK COMPATIBILITY IDEOGRAPH-2F800, CJK COMPATIBILITY IDEOGRAPH-2FA1D + +CN_PUNCTS = [(0x3002, "。"), (0xFF1F, "?"), (0xFF01, "!"), (0xFF0C, ","), + (0x3001, "、"), (0xFF1B, ";"), (0xFF1A, ":"), (0x300C, "「"), + (0x300D, "」"), (0x300E, "『"), (0x300F, "』"), (0x2018, "‘"), + (0x2019, "’"), (0x201C, "“"), (0x201D, "”"), (0xFF08, "("), + (0xFF09, ")"), (0x3014, "〔"), (0x3015, "〕"), (0x3010, "【"), + (0x3011, "】"), (0x2014, "—"), (0x2026, "…"), (0x2013, "–"), + (0xFF0E, "."), (0x300A, "《"), (0x300B, "》"), (0x3008, "〈"), + (0x3009, "〉"), (0x2015, "―"), (0xff0d, "-"), (0x0020, " ")] +#(0xFF5E, "~"), + +EN_PUNCTS = [[0x0021, 0x002F], [0x003A, 0x0040], [0x005B, 0x0060], + [0x007B, 0x007E]] + + +class ChineseAndPunctuationExtractor(object): + def __init__(self): + self.chinese_re = self.build_re() + + def is_chinese_or_punct(self, c): + if self.chinese_re.match(c): + return True + else: + return False + + def build_re(self): + L = [] + for i in LHan: + if isinstance(i, list): + f, t = i + try: + f = chr(f) + t = chr(t) + L.append('%s-%s' % (f, t)) + except: + pass # A narrow python build, so can't use chars > 65535 without surrogate pairs! + + else: + try: + L.append(chr(i)) + except: + pass + for j, _ in CN_PUNCTS: + try: + L.append(chr(j)) + except: + pass + + for k in EN_PUNCTS: + f, t = k + try: + f = chr(f) + t = chr(t) + L.append('%s-%s' % (f, t)) + except: + raise ValueError() + pass # A narrow python build, so can't use chars > 65535 without surrogate pairs! + + RE = '[%s]' % ''.join(L) + # print('RE:', RE.encode('utf-8')) + return re.compile(RE, re.UNICODE) + + +if __name__ == '__main__': + extractor = ChineseAndPunctuationExtractor() + for c in "韩邦庆(1856~1894)曾用名寄,字子云,别署太仙、大一山人、花也怜侬、三庆": + if extractor.is_chinese_or_punct(c): + print(c, 'yes') + else: + print(c, "no") + + print("~", extractor.is_chinese_or_punct("~")) + print("~", extractor.is_chinese_or_punct("~")) + print("―", extractor.is_chinese_or_punct("―")) + print("-", extractor.is_chinese_or_punct("-")) diff --git a/PaddleKG/DuIE_Baseline/ernie/finetune/__init__.py b/PaddleKG/DuIE_Baseline/ernie/finetune/__init__.py new file mode 100755 index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 diff --git a/PaddleKG/DuIE_Baseline/ernie/finetune/relation_extraction_multi_cls.py b/PaddleKG/DuIE_Baseline/ernie/finetune/relation_extraction_multi_cls.py new file mode 100755 index 0000000000000000000000000000000000000000..5c27f07c8ae8d405b581cbe80fc8f111c7a292ac --- /dev/null +++ b/PaddleKG/DuIE_Baseline/ernie/finetune/relation_extraction_multi_cls.py @@ -0,0 +1,451 @@ +# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from __future__ import absolute_import +from __future__ import division +from __future__ import print_function +from __future__ import unicode_literals +from __future__ import absolute_import + +import os +import time +import argparse +import numpy as np +import json +import multiprocessing + +import paddle +import logging +import paddle.fluid as fluid + +from model.ernie import ErnieModel + +log = logging.getLogger(__name__) + + +def create_model(args, pyreader_name, ernie_config): + pyreader = fluid.layers.py_reader( + capacity=50, + shapes=[[-1, args.max_seq_len, 1], [-1, args.max_seq_len, 1], + [-1, args.max_seq_len, 1], [-1, args.max_seq_len, 1], + [-1, args.max_seq_len, 1], + [-1, args.max_seq_len, args.num_labels], [-1, 1], [-1, 1], + [-1, args.max_seq_len, 1], [-1, args.max_seq_len, 1]], + dtypes=[ + 'int64', 'int64', 'int64', 'int64', 'float32', 'float32', 'int64', + 'int64', 'int64', 'int64' + ], + lod_levels=[0, 0, 0, 0, 0, 0, 0, 0, 0, 0], + name=pyreader_name, + use_double_buffer=True) + + (src_ids, sent_ids, pos_ids, task_ids, input_mask, labels, seq_lens, + example_index, tok_to_orig_start_index, + tok_to_orig_end_index) = fluid.layers.read_file(pyreader) + + ernie = ErnieModel( + src_ids=src_ids, + position_ids=pos_ids, + sentence_ids=sent_ids, + task_ids=task_ids, + input_mask=input_mask, + config=ernie_config, + use_fp16=args.use_fp16) + + enc_out = ernie.get_sequence_output() + enc_out = fluid.layers.dropout( + x=enc_out, dropout_prob=0.1, dropout_implementation="upscale_in_train") + logits = fluid.layers.fc( + input=enc_out, + size=args.num_labels, + num_flatten_dims=2, + param_attr=fluid.ParamAttr( + name="cls_seq_label_out_w", + initializer=fluid.initializer.TruncatedNormal(scale=0.02)), + bias_attr=fluid.ParamAttr( + name="cls_seq_label_out_b", + initializer=fluid.initializer.Constant(0.))) + logits = fluid.layers.sigmoid(logits) + + lod_labels = fluid.layers.sequence_unpad(labels, seq_lens) + lod_logits = fluid.layers.sequence_unpad(logits, seq_lens) + lod_tok_to_orig_start_index = fluid.layers.sequence_unpad( + tok_to_orig_start_index, seq_lens) + lod_tok_to_orig_end_index = fluid.layers.sequence_unpad( + tok_to_orig_end_index, seq_lens) + + labels = fluid.layers.flatten(labels, axis=2) + logits = fluid.layers.flatten(logits, axis=2) + input_mask = fluid.layers.flatten(input_mask, axis=2) + + # calculate loss + log_logits = fluid.layers.log(logits) + log_logits_neg = fluid.layers.log(1 - logits) + ce_loss = 0. - labels * log_logits - (1 - labels) * log_logits_neg + + ce_loss = fluid.layers.reduce_mean(ce_loss, dim=1, keep_dim=True) + ce_loss = ce_loss * input_mask + loss = fluid.layers.mean(x=ce_loss) + + graph_vars = { + "inputs": src_ids, + "loss": loss, + "seqlen": seq_lens, + "lod_logit": lod_logits, + "lod_label": lod_labels, + "example_index": example_index, + "tok_to_orig_start_index": lod_tok_to_orig_start_index, + "tok_to_orig_end_index": lod_tok_to_orig_end_index + } + + for k, v in graph_vars.items(): + v.persistable = True + + return pyreader, graph_vars + + +def calculate_acc(logits, labels): + # the golden metric should be "f1" in spo level + # but here only "accuracy" is computed during training for simplicity (provide crude view of your training status) + # accuracy is dependent on the tagging strategy + # for each token, the prediction is counted as correct if all its 100 labels were correctly predicted + # for each example, the prediction is counted as correct if all its token were correctly predicted + + logits_lod = logits.lod() + labels_lod = labels.lod() + logits_tensor = np.array(logits) + labels_tensor = np.array(labels) + assert logits_lod == labels_lod + + num_total = 0 + num_correct = 0 + token_total = 0 + token_correct = 0 + for i in range(len(logits_lod[0]) - 1): + inference_tmp = logits_tensor[logits_lod[0][i]:logits_lod[0][i + 1]] + inference_tmp[inference_tmp >= 0.5] = 1 + inference_tmp[inference_tmp < 0.5] = 0 + label_tmp = labels_tensor[labels_lod[0][i]:labels_lod[0][i + 1]] + num_total += 1 + if (inference_tmp == label_tmp).all(): + num_correct += 1 + for j in range(len(inference_tmp)): + token_total += 1 + if (inference_tmp[j] == label_tmp[j]).all(): + token_correct += 1 + return num_correct, num_total, token_correct, token_total + + +def calculate_metric(spo_list_gt, spo_list_predict): + # calculate golden metric precision, recall and f1 + # may be slightly different with final official evaluation on test set, + # because more comprehensive detail is considered (e.g. alias) + tp, fp, fn = 0, 0, 0 + + for spo in spo_list_predict: + flag = 0 + for spo_gt in spo_list_gt: + if spo['predicate'] == spo_gt['predicate'] and spo[ + 'object'] == spo_gt['object'] and spo['subject'] == spo_gt[ + 'subject']: + flag = 1 + tp += 1 + break + if flag == 0: + fp += 1 + ''' + for spo in spo_list_predict: + if spo in spo_list_gt: + tp += 1 + else: + fp += 1 + ''' + + fn = len(spo_list_gt) - tp + return tp, fp, fn + + +def evaluate(args, examples, exe, program, pyreader, graph_vars): + spo_label_map = json.load(open(args.spo_label_map_config)) + + fetch_list = [ + graph_vars["lod_logit"].name, graph_vars["lod_label"].name, + graph_vars["example_index"].name, + graph_vars["tok_to_orig_start_index"].name, + graph_vars["tok_to_orig_end_index"].name + ] + + tp, fp, fn = 0, 0, 0 + + time_begin = time.time() + pyreader.start() + while True: + try: + # prepare fetched batch data: unlod etc. + logits, labels, example_index_list, tok_to_orig_start_index_list, tok_to_orig_end_index_list = \ + exe.run(program=program, fetch_list=fetch_list, return_numpy=False) + example_index_list = np.array(example_index_list).astype( + int) - 100000 + logits_lod = logits.lod() + tok_to_orig_start_index_list_lod = tok_to_orig_start_index_list.lod( + ) + tok_to_orig_end_index_list_lod = tok_to_orig_end_index_list.lod() + logits_tensor = np.array(logits) + tok_to_orig_start_index_list = np.array( + tok_to_orig_start_index_list).flatten() + tok_to_orig_end_index_list = np.array( + tok_to_orig_end_index_list).flatten() + + # perform evaluation + for i in range(len(logits_lod[0]) - 1): + # prepare prediction results for each example + example_index = example_index_list[i] + example = examples[example_index] + tok_to_orig_start_index = tok_to_orig_start_index_list[ + tok_to_orig_start_index_list_lod[0][ + i]:tok_to_orig_start_index_list_lod[0][i + 1] - 2] + tok_to_orig_end_index = tok_to_orig_end_index_list[ + tok_to_orig_end_index_list_lod[0][ + i]:tok_to_orig_end_index_list_lod[0][i + 1] - 2] + inference_tmp = logits_tensor[logits_lod[0][i]:logits_lod[0][i + + 1]] + labels_tmp = np.array(labels)[logits_lod[0][i]:logits_lod[0][i + + 1]] + + # some simple post process + inference_tmp = post_process(inference_tmp) + + # logits -> classification results + inference_tmp[inference_tmp >= 0.5] = 1 + inference_tmp[inference_tmp < 0.5] = 0 + predict_result = [] + for token in inference_tmp: + predict_result.append(np.argwhere(token == 1).tolist()) + + # format prediction into spo, calculate metric + formated_result = format_output( + example, predict_result, spo_label_map, + tok_to_orig_start_index, tok_to_orig_end_index) + tp_tmp, fp_tmp, fn_tmp = calculate_metric( + example['spo_list'], formated_result['spo_list']) + + tp += tp_tmp + fp += fp_tmp + fn += fn_tmp + + except fluid.core.EOFException: + pyreader.reset() + break + + time_end = time.time() + p = tp / (tp + fp) if tp + fp != 0 else 0 + r = tp / (tp + fn) if tp + fn != 0 else 0 + f = 2 * p * r / (p + r) if p + r != 0 else 0 + return "[evaluation] precision: %f, recall: %f, f1: %f, elapsed time: %f s" % ( + p, r, f, time_end - time_begin) + + +def predict(args, examples, exe, test_program, test_pyreader, graph_vars): + + spo_label_map = json.load(open(args.spo_label_map_config)) + + fetch_list = [ + graph_vars["lod_logit"].name, graph_vars["lod_label"].name, + graph_vars["example_index"].name, + graph_vars["tok_to_orig_start_index"].name, + graph_vars["tok_to_orig_end_index"].name + ] + + test_pyreader.start() + res = [] + while True: + try: + # prepare fetched batch data: unlod etc. + logits, labels, example_index_list, tok_to_orig_start_index_list, tok_to_orig_end_index_list = \ + exe.run(program=test_program, fetch_list=fetch_list, return_numpy=False) + example_index_list = np.array(example_index_list).astype( + int) - 100000 + logits_lod = logits.lod() + tok_to_orig_start_index_list_lod = tok_to_orig_start_index_list.lod( + ) + tok_to_orig_end_index_list_lod = tok_to_orig_end_index_list.lod() + logits_tensor = np.array(logits) + tok_to_orig_start_index_list = np.array( + tok_to_orig_start_index_list).flatten() + tok_to_orig_end_index_list = np.array( + tok_to_orig_end_index_list).flatten() + + # perform evaluation + for i in range(len(logits_lod[0]) - 1): + # prepare prediction results for each example + example_index = example_index_list[i] + example = examples[example_index] + tok_to_orig_start_index = tok_to_orig_start_index_list[ + tok_to_orig_start_index_list_lod[0][ + i]:tok_to_orig_start_index_list_lod[0][i + 1] - 2] + tok_to_orig_end_index = tok_to_orig_end_index_list[ + tok_to_orig_end_index_list_lod[0][ + i]:tok_to_orig_end_index_list_lod[0][i + 1] - 2] + inference_tmp = logits_tensor[logits_lod[0][i]:logits_lod[0][i + + 1]] + + # some simple post process + inference_tmp = post_process(inference_tmp) + + # logits -> classification results + inference_tmp[inference_tmp >= 0.5] = 1 + inference_tmp[inference_tmp < 0.5] = 0 + predict_result = [] + for token in inference_tmp: + predict_result.append(np.argwhere(token == 1).tolist()) + + # format prediction into spo, calculate metric + formated_result = format_output( + example, predict_result, spo_label_map, + tok_to_orig_start_index, tok_to_orig_end_index) + + res.append(formated_result) + except fluid.core.EOFException: + test_pyreader.reset() + break + return res + + +def post_process(inference): + # this post process only brings limited improvements (less than 0.5 f1) in order to keep simplicity + # to obtain better results, CRF is recommended + reference = [] + for token in inference: + token_ = token.copy() + token_[token_ >= 0.5] = 1 + token_[token_ < 0.5] = 0 + reference.append(np.argwhere(token_ == 1)) + + # token was classified into conflict situation (both 'I' and 'B' tag) + for i, token in enumerate(reference[:-1]): + if [0] in token and len(token) >= 2: + if [1] in reference[i + 1]: + inference[i][0] = 0 + else: + inference[i][2:] = 0 + + # token wasn't assigned any cls ('B', 'I', 'O' tag all zero) + for i, token in enumerate(reference[:-1]): + if len(token) == 0: + if [1] in reference[i - 1] and [1] in reference[i + 1]: + inference[i][1] = 1 + elif [1] in reference[i + 1]: + inference[i][np.argmax(inference[i, 1:]) + 1] = 1 + + # handle with empty spo: to be implemented + + return inference + + +def format_output(example, predict_result, spo_label_map, + tok_to_orig_start_index, tok_to_orig_end_index): + # format prediction into example-style output + complex_relation_label = [8, 10, 26, 32, 46] + complex_relation_affi_label = [9, 11, 27, 28, 29, 33, 47] + instance = {} + predict_result = predict_result[1:len(predict_result) - + 1] # remove [CLS] and [SEP] + text_raw = example['text'] + + flatten_predict = [] + for layer_1 in predict_result: + for layer_2 in layer_1: + flatten_predict.append(layer_2[0]) + + subject_id_list = [] + for cls_label in list(set(flatten_predict)): + if 1 < cls_label <= 56 and (cls_label + 55) in flatten_predict: + subject_id_list.append(cls_label) + subject_id_list = list(set(subject_id_list)) + + def find_entity(id_, predict_result): + entity_list = [] + for i in range(len(predict_result)): + if [id_] in predict_result[i]: + j = 0 + while i + j + 1 < len(predict_result): + if [1] in predict_result[i + j + 1]: + j += 1 + else: + break + entity = ''.join(text_raw[tok_to_orig_start_index[i]: + tok_to_orig_end_index[i + j] + 1]) + entity_list.append(entity) + + return list(set(entity_list)) + + spo_list = [] + for id_ in subject_id_list: + if id_ in complex_relation_affi_label: + continue + if id_ not in complex_relation_label: + subjects = find_entity(id_, predict_result) + objects = find_entity(id_ + 55, predict_result) + for subject_ in subjects: + for object_ in objects: + spo_list.append({ + "predicate": spo_label_map['predicate'][id_], + "object_type": { + '@value': spo_label_map['object_type'][id_] + }, + 'subject_type': spo_label_map['subject_type'][id_], + "object": { + '@value': object_ + }, + "subject": subject_ + }) + else: + # traverse all complex relation and look through their corresponding affiliated objects + subjects = find_entity(id_, predict_result) + objects = find_entity(id_ + 55, predict_result) + for subject_ in subjects: + for object_ in objects: + object_dict = {'@value': object_} + object_type_dict = { + '@value': + spo_label_map['object_type'][id_].split('_')[0] + } + + if id_ in [8, 10, 32, 46] and id_ + 1 in subject_id_list: + id_affi = id_ + 1 + object_dict[spo_label_map['object_type'][id_affi].split( + '_')[1]] = find_entity(id_affi + 55, + predict_result)[0] + object_type_dict[spo_label_map['object_type'][ + id_affi].split('_')[1]] = spo_label_map[ + 'object_type'][id_affi].split('_')[0] + elif id_ == 26: + for id_affi in [27, 28, 29]: + if id_affi in subject_id_list: + object_dict[spo_label_map['object_type'][id_affi].split('_')[1]] = \ + find_entity(id_affi + 55, predict_result)[0] + object_type_dict[spo_label_map['object_type'][id_affi].split('_')[1]] = \ + spo_label_map['object_type'][id_affi].split('_')[0] + + spo_list.append({ + "predicate": spo_label_map['predicate'][id_], + "object_type": object_type_dict, + "subject_type": spo_label_map['subject_type'][id_], + "object": object_dict, + "subject": subject_ + }) + + instance['text'] = example['text'] + instance['spo_list'] = spo_list + return instance diff --git a/PaddleKG/DuIE_Baseline/ernie/finetune_args.py b/PaddleKG/DuIE_Baseline/ernie/finetune_args.py new file mode 100755 index 0000000000000000000000000000000000000000..f28445bb44406a15d4ef08c7babf9fa9f5c3bbe1 --- /dev/null +++ b/PaddleKG/DuIE_Baseline/ernie/finetune_args.py @@ -0,0 +1,115 @@ +# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +from __future__ import absolute_import +from __future__ import division +from __future__ import print_function +from __future__ import unicode_literals +from __future__ import absolute_import + +import os +import time +import argparse + +from utils.args import ArgumentGroup + +# yapf: disable +parser = argparse.ArgumentParser(__doc__) +model_g = ArgumentGroup(parser, "model", "model configuration and paths.") +model_g.add_arg("ernie_config_path", str, None, "Path to the json file for ernie model config.") +model_g.add_arg("init_checkpoint", str, None, "Init checkpoint to resume training from.") +model_g.add_arg("init_pretraining_params", str, None, + "Init pre-training params which preforms fine-tuning from. If the " + "arg 'init_checkpoint' has been set, this argument wouldn't be valid.") +model_g.add_arg("checkpoints", str, "checkpoints", "Path to save checkpoints.") + +model_g.add_arg("is_classify", bool, True, "is_classify") +model_g.add_arg("is_regression", bool, False, "is_regression") +model_g.add_arg("task_id", int, 0, "task id") + +train_g = ArgumentGroup(parser, "training", "training options.") +train_g.add_arg("epoch", int, 3, "Number of epoches for fine-tuning.") +train_g.add_arg("max_steps", int, 0, "Number of steps for fine-tuning.") +train_g.add_arg("learning_rate", float, 5e-5, "Learning rate used to train with warmup.") +train_g.add_arg("lr_scheduler", str, "linear_warmup_decay", + "scheduler of learning rate.", choices=['linear_warmup_decay', 'noam_decay']) +train_g.add_arg("weight_decay", float, 0.01, "Weight decay rate for L2 regularizer.") +train_g.add_arg("warmup_proportion", float, 0.1, + "Proportion of training steps to perform linear learning rate warmup for.") +train_g.add_arg("save_steps", int, 10000, "The steps interval to save checkpoints.") +train_g.add_arg("validation_steps", int, 1000, "The steps interval to evaluate model performance.") +train_g.add_arg("use_fp16", bool, False, "Whether to use fp16 mixed precision training.") +train_g.add_arg("use_dynamic_loss_scaling", bool, True, "Whether to use dynamic loss scaling.") +train_g.add_arg("init_loss_scaling", float, 102400, + "Loss scaling factor for mixed precision training, only valid when use_fp16 is enabled.") + +train_g.add_arg("test_save", str, "./checkpoints/test_result", "test_save") +train_g.add_arg("metric", str, "simple_accuracy", "metric") +train_g.add_arg("incr_every_n_steps", int, 100, "Increases loss scaling every n consecutive.") +train_g.add_arg("decr_every_n_nan_or_inf", int, 2, + "Decreases loss scaling every n accumulated steps with nan or inf gradients.") +train_g.add_arg("incr_ratio", float, 2.0, + "The multiplier to use when increasing the loss scaling.") +train_g.add_arg("decr_ratio", float, 0.8, + "The less-than-one-multiplier to use when decreasing.") + + + +log_g = ArgumentGroup(parser, "logging", "logging related.") +log_g.add_arg("skip_steps", int, 10, "The steps interval to print loss.") +log_g.add_arg("verbose", bool, False, "Whether to output verbose log.") + +data_g = ArgumentGroup(parser, "data", "Data paths, vocab paths and data processing options") +data_g.add_arg("tokenizer", str, "FullTokenizer", + "ATTENTION: the INPUT must be splited by Word with blank while using SentencepieceTokenizer or WordsegTokenizer") +data_g.add_arg("train_set", str, None, "Path to training data.") +data_g.add_arg("test_set", str, None, "Path to test data.") +data_g.add_arg("dev_set", str, None, "Path to validation data.") +data_g.add_arg("vocab_path", str, None, "Vocabulary path.") +data_g.add_arg("max_seq_len", int, 512, "Number of words of the longest seqence.") +data_g.add_arg("batch_size", int, 32, "Total examples' number in batch for training. see also --in_tokens.") +data_g.add_arg("predict_batch_size", int, None, "Total examples' number in batch for predict. see also --in_tokens.") +data_g.add_arg("in_tokens", bool, False, + "If set, the batch size will be the maximum number of tokens in one batch. " + "Otherwise, it will be the maximum number of examples in one batch.") +data_g.add_arg("do_lower_case", bool, True, + "Whether to lower case the input text. Should be True for uncased models and False for cased models.") +data_g.add_arg("random_seed", int, None, "Random seed.") +data_g.add_arg("label_map_config", str, None, "label_map_path.") +data_g.add_arg("spo_label_map_config", str, None, "spo_label_map_path.") +data_g.add_arg("num_labels", int, 2, "label number") +data_g.add_arg("diagnostic", str, None, "GLUE Diagnostic Dataset") +data_g.add_arg("diagnostic_save", str, None, "GLUE Diagnostic save f") +data_g.add_arg("max_query_length", int, 64, "Max query length.") +data_g.add_arg("max_answer_length", int, 100, "Max answer length.") +data_g.add_arg("doc_stride", int, 128, + "When splitting up a long document into chunks, how much stride to take between chunks.") +data_g.add_arg("n_best_size", int, 20, + "The total number of n-best predictions to generate in the nbest_predictions.json output file.") +data_g.add_arg("chunk_scheme", type=str, default="IOB", choices=["IO", "IOB", "IOE", "IOBES"], help="chunk scheme") + +run_type_g = ArgumentGroup(parser, "run_type", "running type options.") +run_type_g.add_arg("use_cuda", bool, True, "If set, use GPU for training.") +run_type_g.add_arg("is_distributed", bool, False, "If set, then start distributed training.") +run_type_g.add_arg("use_fast_executor", bool, False, "If set, use fast parallel executor (in experiment).") +run_type_g.add_arg("num_iteration_per_drop_scope", int, 10, "Iteration intervals to drop scope.") +run_type_g.add_arg("do_train", bool, True, "Whether to perform training.") +run_type_g.add_arg("do_val", bool, True, "Whether to perform evaluation on dev data set.") +run_type_g.add_arg("do_test", bool, True, "Whether to perform evaluation on test data set.") +run_type_g.add_arg("use_multi_gpu_test", bool, False, "Whether to perform evaluation using multiple gpu cards") +run_type_g.add_arg("metrics", bool, True, "Whether to perform evaluation on test data set.") +run_type_g.add_arg("shuffle", bool, True, "") +run_type_g.add_arg("for_cn", bool, True, "model train for cn or for other langs.") + +parser.add_argument("--enable_ce", action='store_true', help="The flag indicating whether to run the task for continuous evaluation.") +# yapf: enable diff --git a/PaddleKG/DuIE_Baseline/ernie/finetune_launch.py b/PaddleKG/DuIE_Baseline/ernie/finetune_launch.py new file mode 100755 index 0000000000000000000000000000000000000000..b45d92cb51a17291380d55ba5111c2fcda7403f6 --- /dev/null +++ b/PaddleKG/DuIE_Baseline/ernie/finetune_launch.py @@ -0,0 +1,186 @@ +# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from __future__ import absolute_import +from __future__ import division +from __future__ import print_function +from __future__ import unicode_literals +from __future__ import absolute_import + +import sys +import subprocess +import os +import six +import copy +import argparse +import time +import logging + +from utils.args import ArgumentGroup, print_arguments, prepare_logger +from finetune_args import parser as worker_parser + +# yapf: disable +parser = argparse.ArgumentParser(__doc__) +multip_g = ArgumentGroup(parser, "multiprocessing", + "start paddle training using multi-processing mode.") +multip_g.add_arg("node_ips", str, None, + "paddle trainer ips") +multip_g.add_arg("node_id", int, 0, + "the trainer id of the node for multi-node distributed training.") +multip_g.add_arg("print_config", bool, True, + "print the config of multi-processing mode.") +multip_g.add_arg("current_node_ip", str, None, + "the ip of current node.") +multip_g.add_arg("split_log_path", str, "log", + "log path for each trainer.") +multip_g.add_arg("log_prefix", str, "", + "the prefix name of job log.") +multip_g.add_arg("nproc_per_node", int, 8, + "the number of process to use on each node.") +multip_g.add_arg("selected_gpus", str, "0,1,2,3,4,5,6,7", + "the gpus selected to use.") +multip_g.add_arg("training_script", str, None, "the program/script to be lauched " + "in parallel followed by all the arguments", positional_arg=True) +multip_g.add_arg("training_script_args", str, None, + "training script args", positional_arg=True, nargs=argparse.REMAINDER) +# yapf: enable + +log = logging.getLogger() + + +def start_procs(args): + procs = [] + log_fns = [] + + default_env = os.environ.copy() + + node_id = args.node_id + node_ips = [x.strip() for x in args.node_ips.split(',')] + current_ip = args.current_node_ip + if args.current_node_ip is None: + assert len(node_ips) == 1 + current_ip = node_ips[0] + log.info(current_ip) + + num_nodes = len(node_ips) + selected_gpus = [x.strip() for x in args.selected_gpus.split(',')] + selected_gpu_num = len(selected_gpus) + + all_trainer_endpoints = "" + for ip in node_ips: + for i in range(args.nproc_per_node): + if all_trainer_endpoints != "": + all_trainer_endpoints += "," + all_trainer_endpoints += "%s:617%d" % (ip, i) + + nranks = num_nodes * args.nproc_per_node + gpus_per_proc = args.nproc_per_node % selected_gpu_num + if gpus_per_proc == 0: + gpus_per_proc = selected_gpu_num // args.nproc_per_node + else: + gpus_per_proc = selected_gpu_num // args.nproc_per_node + 1 + + selected_gpus_per_proc = [ + selected_gpus[i:i + gpus_per_proc] + for i in range(0, len(selected_gpus), gpus_per_proc) + ] + + if args.print_config: + log.info("all_trainer_endpoints: %s" + ", node_id: %s" + ", current_ip: %s" + ", num_nodes: %s" + ", node_ips: %s" + ", gpus_per_proc: %s" + ", selected_gpus_per_proc: %s" + ", nranks: %s" % + (all_trainer_endpoints, node_id, current_ip, num_nodes, + node_ips, gpus_per_proc, selected_gpus_per_proc, nranks)) + + current_env = copy.copy(default_env) + procs = [] + cmds = [] + log_fns = [] + for i in range(0, args.nproc_per_node): + trainer_id = node_id * args.nproc_per_node + i + assert current_ip is not None + current_env.update({ + "FLAGS_selected_gpus": + "%s" % ",".join([str(s) for s in selected_gpus_per_proc[i]]), + "PADDLE_TRAINER_ID": "%d" % trainer_id, + "PADDLE_CURRENT_ENDPOINT": "%s:617%d" % (current_ip, i), + "PADDLE_TRAINERS_NUM": "%d" % nranks, + "PADDLE_TRAINER_ENDPOINTS": all_trainer_endpoints, + "PADDLE_NODES_NUM": "%d" % num_nodes + }) + + try: + idx = args.training_script_args.index('--is_distributed') + args.training_script_args[idx + 1] = 'true' + except ValueError: + args.training_script_args += ['--is_distributed', 'true'] + + cmd = [sys.executable, "-u", args.training_script + ] + args.training_script_args + cmds.append(cmd) + + if args.split_log_path: + logdir = "%s/%sjob.log.%d" % (args.split_log_path, args.log_prefix, + trainer_id) + try: + os.mkdir(os.path.dirname(logdir)) + except OSError: + pass + fn = open(logdir, "a") + log_fns.append(fn) + process = subprocess.Popen( + cmd, env=current_env, stdout=fn, stderr=fn) + log.info('subprocess launched, check log at %s' % logdir) + else: + process = subprocess.Popen(cmd, env=current_env) + log.info('subprocess launched') + procs.append(process) + + try: + for i in range(len(procs)): + proc = procs[i] + proc.wait() + if len(log_fns) > 0: + log_fns[i].close() + if proc.returncode != 0: + raise subprocess.CalledProcessError( + returncode=procs[i].returncode, cmd=cmds[i]) + else: + log.info("proc %d finsh" % i) + except KeyboardInterrupt as e: + for p in procs: + log.info('killing %s' % p) + p.terminate() + + +def main(args): + if args.print_config: + print_arguments(args) + start_procs(args) + + +if __name__ == "__main__": + prepare_logger(log) + lanch_args = parser.parse_args() + finetuning_args = worker_parser.parse_args(lanch_args.training_script_args) + init_path = finetuning_args.init_pretraining_params + log.info("init model: %s" % init_path) + if not finetuning_args.use_fp16: + os.system('rename .master "" ' + init_path + '/*.master') + main(lanch_args) diff --git a/PaddleKG/DuIE_Baseline/ernie/model/__init__.py b/PaddleKG/DuIE_Baseline/ernie/model/__init__.py new file mode 100755 index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 diff --git a/PaddleKG/DuIE_Baseline/ernie/model/ernie.py b/PaddleKG/DuIE_Baseline/ernie/model/ernie.py new file mode 100755 index 0000000000000000000000000000000000000000..20b079f37c619dfe95d10a0cf71c8f1b9644756e --- /dev/null +++ b/PaddleKG/DuIE_Baseline/ernie/model/ernie.py @@ -0,0 +1,266 @@ +# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# See the License for the specific language governing permissions and +# limitations under the License. +"""Ernie model.""" + +from __future__ import absolute_import +from __future__ import division +from __future__ import print_function +from __future__ import unicode_literals +from __future__ import absolute_import + +import json +import six +import logging +import paddle.fluid as fluid +from io import open + +from model.transformer_encoder import encoder, pre_process_layer + +log = logging.getLogger(__name__) + + +class ErnieConfig(object): + def __init__(self, config_path): + self._config_dict = self._parse(config_path) + + def _parse(self, config_path): + try: + with open(config_path, 'r', encoding='utf8') as json_file: + config_dict = json.load(json_file) + except Exception: + raise IOError("Error in parsing Ernie model config file '%s'" % + config_path) + else: + return config_dict + + def __getitem__(self, key): + return self._config_dict.get(key, None) + + def print_config(self): + for arg, value in sorted(six.iteritems(self._config_dict)): + log.info('%s: %s' % (arg, value)) + log.info('------------------------------------------------') + + +class ErnieModel(object): + def __init__(self, + src_ids, + position_ids, + sentence_ids, + task_ids, + input_mask, + config, + weight_sharing=True, + use_fp16=False): + + self._emb_size = config['hidden_size'] + self._n_layer = config['num_hidden_layers'] + self._n_head = config['num_attention_heads'] + self._voc_size = config['vocab_size'] + self._max_position_seq_len = config['max_position_embeddings'] + if config['sent_type_vocab_size']: + self._sent_types = config['sent_type_vocab_size'] + else: + self._sent_types = config['type_vocab_size'] + + self._use_task_id = config['use_task_id'] + if self._use_task_id: + self._task_types = config['task_type_vocab_size'] + self._hidden_act = config['hidden_act'] + self._prepostprocess_dropout = config['hidden_dropout_prob'] + self._attention_dropout = config['attention_probs_dropout_prob'] + self._weight_sharing = weight_sharing + + self._word_emb_name = "word_embedding" + self._pos_emb_name = "pos_embedding" + self._sent_emb_name = "sent_embedding" + self._task_emb_name = "task_embedding" + self._dtype = "float16" if use_fp16 else "float32" + self._emb_dtype = "float32" + + # Initialize all weigths by truncated normal initializer, and all biases + # will be initialized by constant zero by default. + self._param_initializer = fluid.initializer.TruncatedNormal( + scale=config['initializer_range']) + + self._build_model(src_ids, position_ids, sentence_ids, task_ids, + input_mask) + + def _build_model(self, src_ids, position_ids, sentence_ids, task_ids, + input_mask): + # padding id in vocabulary must be set to 0 + emb_out = fluid.layers.embedding( + input=src_ids, + size=[self._voc_size, self._emb_size], + dtype=self._emb_dtype, + param_attr=fluid.ParamAttr( + name=self._word_emb_name, initializer=self._param_initializer), + is_sparse=False) + + position_emb_out = fluid.layers.embedding( + input=position_ids, + size=[self._max_position_seq_len, self._emb_size], + dtype=self._emb_dtype, + param_attr=fluid.ParamAttr( + name=self._pos_emb_name, initializer=self._param_initializer)) + + sent_emb_out = fluid.layers.embedding( + sentence_ids, + size=[self._sent_types, self._emb_size], + dtype=self._emb_dtype, + param_attr=fluid.ParamAttr( + name=self._sent_emb_name, initializer=self._param_initializer)) + + emb_out = emb_out + position_emb_out + emb_out = emb_out + sent_emb_out + + if self._use_task_id: + task_emb_out = fluid.layers.embedding( + task_ids, + size=[self._task_types, self._emb_size], + dtype=self._emb_dtype, + param_attr=fluid.ParamAttr( + name=self._task_emb_name, + initializer=self._param_initializer)) + + emb_out = emb_out + task_emb_out + + emb_out = pre_process_layer( + emb_out, 'nd', self._prepostprocess_dropout, name='pre_encoder') + + if self._dtype == "float16": + emb_out = fluid.layers.cast(x=emb_out, dtype=self._dtype) + input_mask = fluid.layers.cast(x=input_mask, dtype=self._dtype) + self_attn_mask = fluid.layers.matmul( + x=input_mask, y=input_mask, transpose_y=True) + + self_attn_mask = fluid.layers.scale( + x=self_attn_mask, scale=10000.0, bias=-1.0, bias_after_scale=False) + n_head_self_attn_mask = fluid.layers.stack( + x=[self_attn_mask] * self._n_head, axis=1) + n_head_self_attn_mask.stop_gradient = True + + self._enc_out = encoder( + enc_input=emb_out, + attn_bias=n_head_self_attn_mask, + n_layer=self._n_layer, + n_head=self._n_head, + d_key=self._emb_size // self._n_head, + d_value=self._emb_size // self._n_head, + d_model=self._emb_size, + d_inner_hid=self._emb_size * 4, + prepostprocess_dropout=self._prepostprocess_dropout, + attention_dropout=self._attention_dropout, + relu_dropout=0, + hidden_act=self._hidden_act, + preprocess_cmd="", + postprocess_cmd="dan", + param_initializer=self._param_initializer, + name='encoder') + if self._dtype == "float16": + self._enc_out = fluid.layers.cast( + x=self._enc_out, dtype=self._emb_dtype) + + def get_sequence_output(self): + return self._enc_out + + def get_pooled_output(self): + """Get the first feature of each sequence for classification""" + next_sent_feat = fluid.layers.slice( + input=self._enc_out, axes=[1], starts=[0], ends=[1]) + next_sent_feat = fluid.layers.fc( + input=next_sent_feat, + size=self._emb_size, + act="tanh", + param_attr=fluid.ParamAttr( + name="pooled_fc.w_0", initializer=self._param_initializer), + bias_attr="pooled_fc.b_0") + return next_sent_feat + + def get_lm_output(self, mask_label, mask_pos): + """Get the loss & accuracy for pretraining""" + + mask_pos = fluid.layers.cast(x=mask_pos, dtype='int32') + + # extract the first token feature in each sentence + self.next_sent_feat = self.get_pooled_output() + reshaped_emb_out = fluid.layers.reshape( + x=self._enc_out, shape=[-1, self._emb_size]) + # extract masked tokens' feature + mask_feat = fluid.layers.gather(input=reshaped_emb_out, index=mask_pos) + + # transform: fc + mask_trans_feat = fluid.layers.fc( + input=mask_feat, + size=self._emb_size, + act=self._hidden_act, + param_attr=fluid.ParamAttr( + name='mask_lm_trans_fc.w_0', + initializer=self._param_initializer), + bias_attr=fluid.ParamAttr(name='mask_lm_trans_fc.b_0')) + + # transform: layer norm + mask_trans_feat = fluid.layers.layer_norm( + mask_trans_feat, + begin_norm_axis=len(mask_trans_feat.shape) - 1, + param_attr=fluid.ParamAttr( + name='mask_lm_trans_layer_norm_scale', + initializer=fluid.initializer.Constant(1.)), + bias_attr=fluid.ParamAttr( + name='mask_lm_trans_layer_norm_bias', + initializer=fluid.initializer.Constant(1.))) + # transform: layer norm + #mask_trans_feat = pre_process_layer( + # mask_trans_feat, 'n', name='mask_lm_trans') + + mask_lm_out_bias_attr = fluid.ParamAttr( + name="mask_lm_out_fc.b_0", + initializer=fluid.initializer.Constant(value=0.0)) + if self._weight_sharing: + fc_out = fluid.layers.matmul( + x=mask_trans_feat, + y=fluid.default_main_program().global_block().var( + self._word_emb_name), + transpose_y=True) + fc_out += fluid.layers.create_parameter( + shape=[self._voc_size], + dtype=self._emb_dtype, + attr=mask_lm_out_bias_attr, + is_bias=True) + + else: + fc_out = fluid.layers.fc(input=mask_trans_feat, + size=self._voc_size, + param_attr=fluid.ParamAttr( + name="mask_lm_out_fc.w_0", + initializer=self._param_initializer), + bias_attr=mask_lm_out_bias_attr) + + mask_lm_loss = fluid.layers.softmax_with_cross_entropy( + logits=fc_out, label=mask_label) + mean_mask_lm_loss = fluid.layers.mean(mask_lm_loss) + + return mean_mask_lm_loss + + def get_task_output(self, task, task_labels): + task_fc_out = fluid.layers.fc(input=self.next_sent_feat, + size=task["num_labels"], + param_attr=fluid.ParamAttr( + name=task["task_name"] + "_fc.w_0", + initializer=self._param_initializer), + bias_attr=task["task_name"] + "_fc.b_0") + task_loss, task_softmax = fluid.layers.softmax_with_cross_entropy( + logits=task_fc_out, label=task_labels, return_softmax=True) + task_acc = fluid.layers.accuracy(input=task_softmax, label=task_labels) + mean_task_loss = fluid.layers.mean(task_loss) + return mean_task_loss, task_acc diff --git a/PaddleKG/DuIE_Baseline/ernie/model/transformer_encoder.py b/PaddleKG/DuIE_Baseline/ernie/model/transformer_encoder.py new file mode 100755 index 0000000000000000000000000000000000000000..ac5d293f3b198e8529afa75940c5aaf0a9fdbfc4 --- /dev/null +++ b/PaddleKG/DuIE_Baseline/ernie/model/transformer_encoder.py @@ -0,0 +1,341 @@ +# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +"""Transformer encoder.""" + +from __future__ import absolute_import +from __future__ import division +from __future__ import print_function + +from functools import partial + +import paddle.fluid as fluid +import paddle.fluid.layers as layers + + +def multi_head_attention(queries, + keys, + values, + attn_bias, + d_key, + d_value, + d_model, + n_head=1, + dropout_rate=0., + cache=None, + param_initializer=None, + name='multi_head_att'): + """ + Multi-Head Attention. Note that attn_bias is added to the logit before + computing softmax activiation to mask certain selected positions so that + they will not considered in attention weights. + """ + keys = queries if keys is None else keys + values = keys if values is None else values + + if not (len(queries.shape) == len(keys.shape) == len(values.shape) == 3): + raise ValueError( + "Inputs: quries, keys and values should all be 3-D tensors.") + + def __compute_qkv(queries, keys, values, n_head, d_key, d_value): + """ + Add linear projection to queries, keys, and values. + """ + q = layers.fc(input=queries, + size=d_key * n_head, + num_flatten_dims=2, + param_attr=fluid.ParamAttr( + name=name + '_query_fc.w_0', + initializer=param_initializer), + bias_attr=name + '_query_fc.b_0') + k = layers.fc(input=keys, + size=d_key * n_head, + num_flatten_dims=2, + param_attr=fluid.ParamAttr( + name=name + '_key_fc.w_0', + initializer=param_initializer), + bias_attr=name + '_key_fc.b_0') + v = layers.fc(input=values, + size=d_value * n_head, + num_flatten_dims=2, + param_attr=fluid.ParamAttr( + name=name + '_value_fc.w_0', + initializer=param_initializer), + bias_attr=name + '_value_fc.b_0') + return q, k, v + + def __split_heads(x, n_head): + """ + Reshape the last dimension of inpunt tensor x so that it becomes two + dimensions and then transpose. Specifically, input a tensor with shape + [bs, max_sequence_length, n_head * hidden_dim] then output a tensor + with shape [bs, n_head, max_sequence_length, hidden_dim]. + """ + hidden_size = x.shape[-1] + # The value 0 in shape attr means copying the corresponding dimension + # size of the input as the output dimension size. + reshaped = layers.reshape( + x=x, shape=[0, 0, n_head, hidden_size // n_head], inplace=True) + + # permuate the dimensions into: + # [batch_size, n_head, max_sequence_len, hidden_size_per_head] + return layers.transpose(x=reshaped, perm=[0, 2, 1, 3]) + + def __combine_heads(x): + """ + Transpose and then reshape the last two dimensions of inpunt tensor x + so that it becomes one dimension, which is reverse to __split_heads. + """ + if len(x.shape) == 3: return x + if len(x.shape) != 4: + raise ValueError("Input(x) should be a 4-D Tensor.") + + trans_x = layers.transpose(x, perm=[0, 2, 1, 3]) + # The value 0 in shape attr means copying the corresponding dimension + # size of the input as the output dimension size. + return layers.reshape( + x=trans_x, + shape=[0, 0, trans_x.shape[2] * trans_x.shape[3]], + inplace=True) + + def scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate): + """ + Scaled Dot-Product Attention + """ + scaled_q = layers.scale(x=q, scale=d_key**-0.5) + product = layers.matmul(x=scaled_q, y=k, transpose_y=True) + if attn_bias: + product += attn_bias + weights = layers.softmax(product) + if dropout_rate: + weights = layers.dropout( + weights, + dropout_prob=dropout_rate, + dropout_implementation="upscale_in_train", + is_test=False) + out = layers.matmul(weights, v) + return out + + q, k, v = __compute_qkv(queries, keys, values, n_head, d_key, d_value) + + if cache is not None: # use cache and concat time steps + # Since the inplace reshape in __split_heads changes the shape of k and + # v, which is the cache input for next time step, reshape the cache + # input from the previous time step first. + k = cache["k"] = layers.concat( + [layers.reshape( + cache["k"], shape=[0, 0, d_model]), k], axis=1) + v = cache["v"] = layers.concat( + [layers.reshape( + cache["v"], shape=[0, 0, d_model]), v], axis=1) + + q = __split_heads(q, n_head) + k = __split_heads(k, n_head) + v = __split_heads(v, n_head) + + ctx_multiheads = scaled_dot_product_attention(q, k, v, attn_bias, d_key, + dropout_rate) + + out = __combine_heads(ctx_multiheads) + + # Project back to the model size. + proj_out = layers.fc(input=out, + size=d_model, + num_flatten_dims=2, + param_attr=fluid.ParamAttr( + name=name + '_output_fc.w_0', + initializer=param_initializer), + bias_attr=name + '_output_fc.b_0') + return proj_out + + +def positionwise_feed_forward(x, + d_inner_hid, + d_hid, + dropout_rate, + hidden_act, + param_initializer=None, + name='ffn'): + """ + Position-wise Feed-Forward Networks. + This module consists of two linear transformations with a ReLU activation + in between, which is applied to each position separately and identically. + """ + hidden = layers.fc(input=x, + size=d_inner_hid, + num_flatten_dims=2, + act=hidden_act, + param_attr=fluid.ParamAttr( + name=name + '_fc_0.w_0', + initializer=param_initializer), + bias_attr=name + '_fc_0.b_0') + if dropout_rate: + hidden = layers.dropout( + hidden, + dropout_prob=dropout_rate, + dropout_implementation="upscale_in_train", + is_test=False) + out = layers.fc(input=hidden, + size=d_hid, + num_flatten_dims=2, + param_attr=fluid.ParamAttr( + name=name + '_fc_1.w_0', initializer=param_initializer), + bias_attr=name + '_fc_1.b_0') + return out + + +def pre_post_process_layer(prev_out, out, process_cmd, dropout_rate=0., + name=''): + """ + Add residual connection, layer normalization and droput to the out tensor + optionally according to the value of process_cmd. + This will be used before or after multi-head attention and position-wise + feed-forward networks. + """ + for cmd in process_cmd: + if cmd == "a": # add residual connection + out = out + prev_out if prev_out else out + elif cmd == "n": # add layer normalization + out_dtype = out.dtype + if out_dtype == fluid.core.VarDesc.VarType.FP16: + out = layers.cast(x=out, dtype="float32") + out = layers.layer_norm( + out, + begin_norm_axis=len(out.shape) - 1, + param_attr=fluid.ParamAttr( + name=name + '_layer_norm_scale', + initializer=fluid.initializer.Constant(1.)), + bias_attr=fluid.ParamAttr( + name=name + '_layer_norm_bias', + initializer=fluid.initializer.Constant(0.))) + if out_dtype == fluid.core.VarDesc.VarType.FP16: + out = layers.cast(x=out, dtype="float16") + elif cmd == "d": # add dropout + if dropout_rate: + out = layers.dropout( + out, + dropout_prob=dropout_rate, + dropout_implementation="upscale_in_train", + is_test=False) + return out + + +pre_process_layer = partial(pre_post_process_layer, None) +post_process_layer = pre_post_process_layer + + +def encoder_layer(enc_input, + attn_bias, + n_head, + d_key, + d_value, + d_model, + d_inner_hid, + prepostprocess_dropout, + attention_dropout, + relu_dropout, + hidden_act, + preprocess_cmd="n", + postprocess_cmd="da", + param_initializer=None, + name=''): + """The encoder layers that can be stacked to form a deep encoder. + This module consits of a multi-head (self) attention followed by + position-wise feed-forward networks and both the two components companied + with the post_process_layer to add residual connection, layer normalization + and droput. + """ + attn_output = multi_head_attention( + pre_process_layer( + enc_input, + preprocess_cmd, + prepostprocess_dropout, + name=name + '_pre_att'), + None, + None, + attn_bias, + d_key, + d_value, + d_model, + n_head, + attention_dropout, + param_initializer=param_initializer, + name=name + '_multi_head_att') + attn_output = post_process_layer( + enc_input, + attn_output, + postprocess_cmd, + prepostprocess_dropout, + name=name + '_post_att') + ffd_output = positionwise_feed_forward( + pre_process_layer( + attn_output, + preprocess_cmd, + prepostprocess_dropout, + name=name + '_pre_ffn'), + d_inner_hid, + d_model, + relu_dropout, + hidden_act, + param_initializer=param_initializer, + name=name + '_ffn') + return post_process_layer( + attn_output, + ffd_output, + postprocess_cmd, + prepostprocess_dropout, + name=name + '_post_ffn') + + +def encoder(enc_input, + attn_bias, + n_layer, + n_head, + d_key, + d_value, + d_model, + d_inner_hid, + prepostprocess_dropout, + attention_dropout, + relu_dropout, + hidden_act, + preprocess_cmd="n", + postprocess_cmd="da", + param_initializer=None, + name=''): + """ + The encoder is composed of a stack of identical layers returned by calling + encoder_layer. + """ + for i in range(n_layer): + enc_output = encoder_layer( + enc_input, + attn_bias, + n_head, + d_key, + d_value, + d_model, + d_inner_hid, + prepostprocess_dropout, + attention_dropout, + relu_dropout, + hidden_act, + preprocess_cmd, + postprocess_cmd, + param_initializer=param_initializer, + name=name + '_layer_' + str(i)) + enc_input = enc_output + enc_output = pre_process_layer( + enc_output, preprocess_cmd, prepostprocess_dropout, name="post_encoder") + + return enc_output diff --git a/PaddleKG/DuIE_Baseline/ernie/optimization.py b/PaddleKG/DuIE_Baseline/ernie/optimization.py new file mode 100755 index 0000000000000000000000000000000000000000..dd2ef1e5e526a991c8a325221890abeb7c2b575d --- /dev/null +++ b/PaddleKG/DuIE_Baseline/ernie/optimization.py @@ -0,0 +1,162 @@ +# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +"""Optimization and learning rate scheduling.""" + +from __future__ import absolute_import +from __future__ import division +from __future__ import print_function +from __future__ import unicode_literals +from __future__ import absolute_import + +import numpy as np +import paddle.fluid as fluid +from utils.fp16 import create_master_params_grads, master_param_to_train_param, apply_dynamic_loss_scaling + + +def linear_warmup_decay(learning_rate, warmup_steps, num_train_steps): + """ Applies linear warmup of learning rate from 0 and decay to 0.""" + with fluid.default_main_program()._lr_schedule_guard(): + lr = fluid.layers.tensor.create_global_var( + shape=[1], + value=0.0, + dtype='float32', + persistable=True, + name="scheduled_learning_rate") + + global_step = fluid.layers.learning_rate_scheduler._decay_step_counter() + + with fluid.layers.control_flow.Switch() as switch: + with switch.case(global_step < warmup_steps): + warmup_lr = learning_rate * (global_step / warmup_steps) + fluid.layers.tensor.assign(warmup_lr, lr) + with switch.default(): + decayed_lr = fluid.layers.learning_rate_scheduler.polynomial_decay( + learning_rate=learning_rate, + decay_steps=num_train_steps, + end_learning_rate=0.0, + power=1.0, + cycle=False) + fluid.layers.tensor.assign(decayed_lr, lr) + + return lr + + +def optimization(loss, + warmup_steps, + num_train_steps, + learning_rate, + train_program, + startup_prog, + weight_decay, + scheduler='linear_warmup_decay', + use_fp16=False, + use_dynamic_loss_scaling=False, + init_loss_scaling=1.0, + incr_every_n_steps=1000, + decr_every_n_nan_or_inf=2, + incr_ratio=2.0, + decr_ratio=0.8): + if warmup_steps > 0: + if scheduler == 'noam_decay': + scheduled_lr = fluid.layers.learning_rate_scheduler\ + .noam_decay(1/(warmup_steps *(learning_rate ** 2)), + warmup_steps) + elif scheduler == 'linear_warmup_decay': + scheduled_lr = linear_warmup_decay(learning_rate, warmup_steps, + num_train_steps) + else: + raise ValueError("Unkown learning rate scheduler, should be " + "'noam_decay' or 'linear_warmup_decay'") + optimizer = fluid.optimizer.Adam(learning_rate=scheduled_lr) + else: + scheduled_lr = fluid.layers.create_global_var( + name=fluid.unique_name.generate("learning_rate"), + shape=[1], + value=learning_rate, + dtype='float32', + persistable=True) + optimizer = fluid.optimizer.Adam(learning_rate=scheduled_lr) + optimizer._learning_rate_map[fluid.default_main_program( + )] = scheduled_lr + + fluid.clip.set_gradient_clip( + clip=fluid.clip.GradientClipByGlobalNorm(clip_norm=1.0)) + + def exclude_from_weight_decay(name): + if name.find("layer_norm") > -1: + return True + bias_suffix = ["_bias", "_b", ".b_0"] + for suffix in bias_suffix: + if name.endswith(suffix): + return True + return False + + param_list = dict() + + loss_scaling = fluid.layers.create_global_var( + name=fluid.unique_name.generate("loss_scaling"), + shape=[1], + value=init_loss_scaling, + dtype='float32', + persistable=True) + + if use_fp16: + loss *= loss_scaling + param_grads = optimizer.backward(loss) + + master_param_grads = create_master_params_grads( + param_grads, train_program, startup_prog, loss_scaling) + + for param, _ in master_param_grads: + param_list[param.name] = param * 1.0 + param_list[param.name].stop_gradient = True + + if use_dynamic_loss_scaling: + apply_dynamic_loss_scaling( + loss_scaling, master_param_grads, incr_every_n_steps, + decr_every_n_nan_or_inf, incr_ratio, decr_ratio) + + optimizer.apply_gradients(master_param_grads) + + if weight_decay > 0: + for param, grad in master_param_grads: + if exclude_from_weight_decay(param.name.rstrip(".master")): + continue + with param.block.program._optimized_guard( + [param, grad]), fluid.framework.name_scope("weight_decay"): + updated_param = param - param_list[ + param.name] * weight_decay * scheduled_lr + fluid.layers.assign(output=param, input=updated_param) + + master_param_to_train_param(master_param_grads, param_grads, + train_program) + + else: + for param in train_program.global_block().all_parameters(): + param_list[param.name] = param * 1.0 + param_list[param.name].stop_gradient = True + + _, param_grads = optimizer.minimize(loss) + + if weight_decay > 0: + for param, grad in param_grads: + if exclude_from_weight_decay(param.name): + continue + with param.block.program._optimized_guard( + [param, grad]), fluid.framework.name_scope("weight_decay"): + updated_param = param - param_list[ + param.name] * weight_decay * scheduled_lr + fluid.layers.assign(output=param, input=updated_param) + + return scheduled_lr, loss_scaling diff --git a/PaddleKG/DuIE_Baseline/ernie/reader/__init__.py b/PaddleKG/DuIE_Baseline/ernie/reader/__init__.py new file mode 100755 index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 diff --git a/PaddleKG/DuIE_Baseline/ernie/reader/task_reader.py b/PaddleKG/DuIE_Baseline/ernie/reader/task_reader.py new file mode 100755 index 0000000000000000000000000000000000000000..da2726bf5f760c2fcebe32f1b5d588026f8d2a50 --- /dev/null +++ b/PaddleKG/DuIE_Baseline/ernie/reader/task_reader.py @@ -0,0 +1,434 @@ +# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +from __future__ import absolute_import +from __future__ import division +from __future__ import print_function +from __future__ import unicode_literals +from __future__ import absolute_import + +import sys +import os +import json +import random +import logging +import numpy as np +import six +from io import open +from collections import namedtuple + +import tokenization +from batching import pad_batch_data + +import extract_chinese_and_punct + +log = logging.getLogger(__name__) + +if six.PY3: + import io + sys.stdout = io.TextIOWrapper(sys.stdout.buffer, encoding='utf-8') + sys.stderr = io.TextIOWrapper(sys.stderr.buffer, encoding='utf-8') + + +class BaseReader(object): + def __init__(self, + vocab_path, + label_map_config=None, + max_seq_len=512, + do_lower_case=True, + in_tokens=False, + is_inference=False, + random_seed=None, + tokenizer="FullTokenizer", + is_classify=True, + is_regression=False, + for_cn=True, + task_id=0): + self.max_seq_len = max_seq_len + self.tokenizer = tokenization.FullTokenizer( + vocab_file=vocab_path, do_lower_case=do_lower_case) + self.vocab = self.tokenizer.vocab + self.pad_id = self.vocab["[PAD]"] + self.cls_id = self.vocab["[CLS]"] + self.sep_id = self.vocab["[SEP]"] + self.in_tokens = in_tokens + self.is_inference = is_inference + self.for_cn = for_cn + self.task_id = task_id + + np.random.seed(random_seed) + + self.is_classify = is_classify + self.is_regression = is_regression + self.current_example = 0 + self.current_epoch = 0 + self.num_examples = 0 + + if label_map_config: + with open(label_map_config, encoding='utf8') as f: + self.label_map = json.load(f) + else: + self.label_map = None + + def get_train_progress(self): + """Gets progress for training phase.""" + return self.current_example, self.current_epoch + + def _truncate_seq_pair(self, tokens_a, tokens_b, max_length): + """Truncates a sequence pair in place to the maximum length.""" + + # This is a simple heuristic which will always truncate the longer sequence + # one token at a time. This makes more sense than truncating an equal percent + # of tokens from each, since if one sequence is very short then each token + # that's truncated likely contains more information than a longer sequence. + while True: + total_length = len(tokens_a) + len(tokens_b) + if total_length <= max_length: + break + if len(tokens_a) > len(tokens_b): + tokens_a.pop() + else: + tokens_b.pop() + + +class RelationExtractionMultiCLSReader(BaseReader): + def __init__(self, + vocab_path, + label_map_config=None, + spo_label_map_config=None, + max_seq_len=512, + do_lower_case=True, + in_tokens=False, + is_inference=False, + random_seed=None, + tokenizer="FullTokenizer", + is_classify=True, + is_regression=False, + for_cn=True, + task_id=0, + num_labels=0): + self.max_seq_len = max_seq_len + self.tokenizer = tokenization.FullTokenizer( + vocab_file=vocab_path, do_lower_case=do_lower_case) + self.chineseandpunctuationextractor = extract_chinese_and_punct.ChineseAndPunctuationExtractor( + ) + self.vocab = self.tokenizer.vocab + self.pad_id = self.vocab["[PAD]"] + self.cls_id = self.vocab["[CLS]"] + self.sep_id = self.vocab["[SEP]"] + self.in_tokens = in_tokens + self.is_inference = is_inference + self.for_cn = for_cn + self.task_id = task_id + self.num_labels = num_labels + + np.random.seed(random_seed) + + self.is_classify = is_classify + self.is_regression = is_regression + self.current_example = 0 + self.current_epoch = 0 + self.num_examples = 0 + # map string to relation id + if label_map_config: + with open(label_map_config, encoding='utf8') as f: + self.label_map = json.load(f) + else: + self.label_map = None + # map relation id to string(including subject name, predicate name, object name) + if spo_label_map_config: + with open(label_map_config, encoding='utf8') as f: + self.label_map = json.load(f) + else: + self.spo_label_map = None + + def _read_json(self, input_file): + f = open(input_file, 'r', encoding="utf8") + examples = [] + for line in f.readlines(): + examples.append(json.loads(line)) + f.close() + return examples + + def get_num_examples(self, input_file): + examples = self._read_json(input_file) + return len(examples) + + def _prepare_batch_data(self, examples, batch_size, phase=None): + """generate batch records""" + batch_records, max_len = [], 0 + for index, example in enumerate(examples): + if phase == "train": + self.current_example = index + example_index = 100000 + index + record = self._convert_example_to_record( + example_index, example, self.max_seq_len, self.tokenizer) + max_len = max(max_len, len(record.token_ids)) + if self.in_tokens: + to_append = (len(batch_records) + 1) * max_len <= batch_size + else: + to_append = len(batch_records) < batch_size + if to_append: + batch_records.append(record) + else: + yield self._pad_batch_records(batch_records) + batch_records, max_len = [record], len(record.token_ids) + + if batch_records: + yield self._pad_batch_records(batch_records) + + def data_generator(self, + input_file, + batch_size, + epoch, + dev_count=1, + shuffle=True, + phase=None): + examples = self._read_json(input_file) + + def wrapper(): + all_dev_batches = [] + for epoch_index in range(epoch): + if phase == "train": + self.current_example = 0 + self.current_epoch = epoch_index + if shuffle: + np.random.shuffle(examples) + + for batch_data in self._prepare_batch_data( + examples, batch_size, phase=phase): + if len(all_dev_batches) < dev_count: + all_dev_batches.append(batch_data) + if len(all_dev_batches) == dev_count: + for batch in all_dev_batches: + yield batch + all_dev_batches = [] + + def f(): + try: + for i in wrapper(): + yield i + except Exception as e: + import traceback + traceback.print_exc() + + return f + + def _pad_batch_records(self, batch_records): + batch_token_ids = [record.token_ids for record in batch_records] + batch_text_type_ids = [record.text_type_ids for record in batch_records] + batch_position_ids = [record.position_ids for record in batch_records] + batch_label_ids = [record.label_ids for record in batch_records] + batch_example_index = [record.example_index for record in batch_records] + batch_tok_to_orig_start_index = [ + record.tok_to_orig_start_index for record in batch_records + ] + batch_tok_to_orig_end_index = [ + record.tok_to_orig_end_index for record in batch_records + ] + # padding + padded_token_ids, input_mask, batch_seq_lens = pad_batch_data( + batch_token_ids, + pad_idx=self.pad_id, + return_input_mask=True, + return_seq_lens=True) + padded_text_type_ids = pad_batch_data( + batch_text_type_ids, pad_idx=self.pad_id) + padded_position_ids = pad_batch_data( + batch_position_ids, pad_idx=self.pad_id) + + # label padding for expended dimension + outside_label = np.array([1] + [0] * (self.num_labels - 1)) + max_len = max(len(inst) for inst in batch_label_ids) + padded_label_ids = [] + for i, inst in enumerate(batch_label_ids): + inst = np.concatenate( + (np.array(inst), np.tile(outside_label, ((max_len - len(inst)), + 1))), + axis=0) + padded_label_ids.append(inst) + padded_label_ids = np.stack(padded_label_ids).astype("float32") + + padded_tok_to_orig_start_index = np.array([ + inst + [0] * (max_len - len(inst)) + for inst in batch_tok_to_orig_start_index + ]) + padded_tok_to_orig_end_index = np.array([ + inst + [0] * (max_len - len(inst)) + for inst in batch_tok_to_orig_end_index + ]) + + padded_task_ids = np.ones_like( + padded_token_ids, dtype="int64") * self.task_id + + return_list = [ + padded_token_ids, padded_text_type_ids, padded_position_ids, + padded_task_ids, input_mask, padded_label_ids, batch_seq_lens, + batch_example_index, padded_tok_to_orig_start_index, + padded_tok_to_orig_end_index + ] + return return_list + + def _convert_example_to_record(self, example_index, example, max_seq_length, + tokenizer): + spo_list = example['spo_list'] + text_raw = example['text'] + + sub_text = [] + buff = "" + for char in text_raw: + if self.chineseandpunctuationextractor.is_chinese_or_punct(char): + if buff != "": + sub_text.append(buff) + buff = "" + sub_text.append(char) + else: + buff += char + if buff != "": + sub_text.append(buff) + + tok_to_orig_start_index = [] + tok_to_orig_end_index = [] + orig_to_tok_index = [] + tokens = [] + text_tmp = '' + for (i, token) in enumerate(sub_text): + orig_to_tok_index.append(len(tokens)) + sub_tokens = tokenizer.tokenize(token) + text_tmp += token + for sub_token in sub_tokens: + tok_to_orig_start_index.append(len(text_tmp) - len(token)) + tok_to_orig_end_index.append(len(text_tmp) - 1) + tokens.append(sub_token) + if len(tokens) >= max_seq_length - 2: + break + else: + continue + break + + labels = [[0] * self.num_labels + for i in range(len(tokens))] # initialize tag + # find all entities and tag them with corresponding "B"/"I" labels + for spo in spo_list: + for spo_object in spo['object'].keys(): + # assign relation label + if spo['predicate'] in self.label_map.keys(): + # simple relation + label_subject = self.label_map[spo['predicate']] + label_object = label_subject + 55 + subject_sub_tokens = tokenizer.tokenize(spo['subject']) + object_sub_tokens = tokenizer.tokenize(spo['object'][ + '@value']) + else: + # complex relation + label_subject = self.label_map[spo['predicate'] + '_' + + spo_object] + label_object = label_subject + 55 + subject_sub_tokens = tokenizer.tokenize(spo['subject']) + object_sub_tokens = tokenizer.tokenize(spo['object'][ + spo_object]) + + # assign token label + # there are situations where s entity and o entity might overlap, e.g. xyz established xyz corporation + # to prevent single token from being labeled into two different entity + # we tag the longer entity first, then match the shorter entity within the rest text + forbidden_index = None + if len(subject_sub_tokens) > len(object_sub_tokens): + for index in range( + len(tokens) - len(subject_sub_tokens) + 1): + if tokens[index:index + len( + subject_sub_tokens)] == subject_sub_tokens: + labels[index][label_subject] = 1 + for i in range(len(subject_sub_tokens) - 1): + labels[index + i + 1][1] = 1 + forbidden_index = index + break + + for index in range( + len(tokens) - len(object_sub_tokens) + 1): + if tokens[index:index + len( + object_sub_tokens)] == object_sub_tokens: + if forbidden_index is None: + labels[index][label_object] = 1 + for i in range(len(object_sub_tokens) - 1): + labels[index + i + 1][1] = 1 + break + # check if labeled already + elif index < forbidden_index or index >= forbidden_index + len( + subject_sub_tokens): + labels[index][label_object] = 1 + for i in range(len(object_sub_tokens) - 1): + labels[index + i + 1][1] = 1 + break + + else: + for index in range( + len(tokens) - len(object_sub_tokens) + 1): + if tokens[index:index + len( + object_sub_tokens)] == object_sub_tokens: + labels[index][label_object] = 1 + for i in range(len(object_sub_tokens) - 1): + labels[index + i + 1][1] = 1 + forbidden_index = index + break + + for index in range( + len(tokens) - len(subject_sub_tokens) + 1): + if tokens[index:index + len( + subject_sub_tokens)] == subject_sub_tokens: + if forbidden_index is None: + labels[index][label_subject] = 1 + for i in range(len(subject_sub_tokens) - 1): + labels[index + i + 1][1] = 1 + break + elif index < forbidden_index or index >= forbidden_index + len( + object_sub_tokens): + labels[index][label_subject] = 1 + for i in range(len(subject_sub_tokens) - 1): + labels[index + i + 1][1] = 1 + break + + # if token wasn't assigned as any "B"/"I" tag, give it an "O" tag for outside + for i in range(len(labels)): + if labels[i] == [0] * self.num_labels: + labels[i][0] = 1 + + # add [CLS] and [SEP] token, they are tagged into "O" for outside + if len(tokens) > max_seq_length - 2: + tokens = tokens[0:(max_seq_length - 2)] + labels = labels[0:(max_seq_length - 2)] + tokens = ["[CLS]"] + tokens + ["[SEP]"] + outside_label = [[1] + [0] * (self.num_labels - 1)] + labels = outside_label + labels + outside_label + + token_ids = tokenizer.convert_tokens_to_ids(tokens) + position_ids = list(range(len(token_ids))) + text_type_ids = [0] * len(token_ids) + + Record = namedtuple('Record', [ + 'token_ids', 'text_type_ids', 'position_ids', 'label_ids', + 'example_index', 'tok_to_orig_start_index', 'tok_to_orig_end_index' + ]) + record = Record( + token_ids=token_ids, + text_type_ids=text_type_ids, + position_ids=position_ids, + label_ids=labels, + example_index=example_index, + tok_to_orig_start_index=tok_to_orig_start_index, + tok_to_orig_end_index=tok_to_orig_end_index) + return record + + +if __name__ == '__main__': + pass diff --git a/PaddleKG/DuIE_Baseline/ernie/run_duie.py b/PaddleKG/DuIE_Baseline/ernie/run_duie.py new file mode 100755 index 0000000000000000000000000000000000000000..0682d6ecaf5c2a7614cf7bf3cc469b33a4baea00 --- /dev/null +++ b/PaddleKG/DuIE_Baseline/ernie/run_duie.py @@ -0,0 +1,371 @@ +# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +"""Finetuning on classification tasks.""" + +from __future__ import absolute_import +from __future__ import division +from __future__ import print_function +from __future__ import unicode_literals +from __future__ import absolute_import + +import os +import time +import six +import logging +import multiprocessing +from io import open +import numpy as np +import json +# NOTE(paddle-dev): All of these flags should be +# set before `import paddle`. Otherwise, it would +# not take any effect. +os.environ['FLAGS_eager_delete_tensor_gb'] = '0' # enable gc + +import codecs +import paddle.fluid as fluid + +import reader.task_reader as task_reader +from model.ernie import ErnieConfig +from optimization import optimization +from utils.init import init_pretraining_params, init_checkpoint +from utils.args import print_arguments, check_cuda, prepare_logger +from finetune.relation_extraction_multi_cls import create_model, evaluate, predict, calculate_acc +from finetune_args import parser + +args = parser.parse_args() +log = logging.getLogger() + + +def main(args): + ernie_config = ErnieConfig(args.ernie_config_path) + ernie_config.print_config() + + if args.use_cuda: + dev_list = fluid.cuda_places() + place = dev_list[0] + dev_count = len(dev_list) + else: + place = fluid.CPUPlace() + dev_count = int(os.environ.get('CPU_NUM', multiprocessing.cpu_count())) + + reader = task_reader.RelationExtractionMultiCLSReader( + vocab_path=args.vocab_path, + label_map_config=args.label_map_config, + spo_label_map_config=args.spo_label_map_config, + max_seq_len=args.max_seq_len, + do_lower_case=args.do_lower_case, + in_tokens=args.in_tokens, + random_seed=args.random_seed, + task_id=args.task_id, + num_labels=args.num_labels) + + if not (args.do_train or args.do_val or args.do_test): + raise ValueError("For args `do_train`, `do_val` and `do_test`, at " + "least one of them must be True.") + + startup_prog = fluid.Program() + if args.random_seed is not None: + startup_prog.random_seed = args.random_seed + + if args.do_train: + train_data_generator = reader.data_generator( + input_file=args.train_set, + batch_size=args.batch_size, + epoch=args.epoch, + shuffle=True, + phase="train") + + num_train_examples = reader.get_num_examples(args.train_set) + + if args.in_tokens: + if args.batch_size < args.max_seq_len: + raise ValueError( + 'if in_tokens=True, batch_size should greater than max_sqelen, got batch_size:%d seqlen:%d' + % (args.batch_size, args.max_seq_len)) + + max_train_steps = args.epoch * num_train_examples // ( + args.batch_size // args.max_seq_len) // dev_count + else: + ''' + if args.max_steps != 0: + max_train_steps = min(args.epoch * num_train_examples // args.batch_size // dev_count, args.max_steps) + else: + max_train_steps = args.epoch * num_train_examples // args.batch_size // dev_count + ''' + max_train_steps = args.epoch * num_train_examples // args.batch_size // dev_count + + warmup_steps = int(max_train_steps * args.warmup_proportion) + log.info("Device count: %d" % dev_count) + log.info("Num train examples: %d" % num_train_examples) + log.info("Max train steps: %d" % max_train_steps) + log.info("Num warmup steps: %d" % warmup_steps) + + train_program = fluid.Program() + + with fluid.program_guard(train_program, startup_prog): + with fluid.unique_name.guard(): + train_pyreader, graph_vars = create_model( + args, + pyreader_name='train_reader', + ernie_config=ernie_config) + scheduled_lr, loss_scaling = optimization( + loss=graph_vars["loss"], + warmup_steps=warmup_steps, + num_train_steps=max_train_steps, + learning_rate=args.learning_rate, + train_program=train_program, + startup_prog=startup_prog, + weight_decay=args.weight_decay, + scheduler=args.lr_scheduler, + use_fp16=args.use_fp16, + use_dynamic_loss_scaling=args.use_dynamic_loss_scaling, + init_loss_scaling=args.init_loss_scaling, + incr_every_n_steps=args.incr_every_n_steps, + decr_every_n_nan_or_inf=args.decr_every_n_nan_or_inf, + incr_ratio=args.incr_ratio, + decr_ratio=args.decr_ratio) + + if args.verbose: + if args.in_tokens: + lower_mem, upper_mem, unit = fluid.contrib.memory_usage( + program=train_program, + batch_size=args.batch_size // args.max_seq_len) + else: + lower_mem, upper_mem, unit = fluid.contrib.memory_usage( + program=train_program, batch_size=args.batch_size) + log.info("Theoretical memory usage in training: %.3f - %.3f %s" % + (lower_mem, upper_mem, unit)) + + if args.do_val or args.do_test: + test_prog = fluid.Program() + with fluid.program_guard(test_prog, startup_prog): + with fluid.unique_name.guard(): + test_pyreader, graph_vars = create_model( + args, + pyreader_name='test_reader', + ernie_config=ernie_config) + + test_prog = test_prog.clone(for_test=True) + + nccl2_num_trainers = 1 + nccl2_trainer_id = 0 + if args.is_distributed: + trainer_id = int(os.getenv("PADDLE_TRAINER_ID", "0")) + worker_endpoints_env = os.getenv("PADDLE_TRAINER_ENDPOINTS") + current_endpoint = os.getenv("PADDLE_CURRENT_ENDPOINT") + worker_endpoints = worker_endpoints_env.split(",") + trainers_num = len(worker_endpoints) + + log.info("worker_endpoints:{} trainers_num:{} current_endpoint:{} \ + trainer_id:{}".format(worker_endpoints, trainers_num, + current_endpoint, trainer_id)) + + # prepare nccl2 env. + config = fluid.DistributeTranspilerConfig() + config.mode = "nccl2" + t = fluid.DistributeTranspiler(config=config) + t.transpile( + trainer_id, + trainers=worker_endpoints_env, + current_endpoint=current_endpoint, + program=train_program if args.do_train else test_prog, + startup_program=startup_prog) + nccl2_num_trainers = trainers_num + nccl2_trainer_id = trainer_id + + exe = fluid.Executor(place) + exe.run(startup_prog) + + if args.do_train: + if args.init_checkpoint and args.init_pretraining_params: + log.info( + "WARNING: args 'init_checkpoint' and 'init_pretraining_params' " + "both are set! Only arg 'init_checkpoint' is made valid.") + if args.init_checkpoint: + init_checkpoint( + exe, + args.init_checkpoint, + main_program=startup_prog, + use_fp16=args.use_fp16) + elif args.init_pretraining_params: + init_pretraining_params( + exe, + args.init_pretraining_params, + main_program=startup_prog, + use_fp16=args.use_fp16) + elif args.do_val or args.do_test: + if not args.init_checkpoint: + raise ValueError("args 'init_checkpoint' should be set if" + "only doing validation or testing!") + init_checkpoint( + exe, + args.init_checkpoint, + main_program=startup_prog, + use_fp16=args.use_fp16) + + if args.do_train: + exec_strategy = fluid.ExecutionStrategy() + if args.use_fast_executor: + exec_strategy.use_experimental_executor = True + exec_strategy.num_threads = dev_count + exec_strategy.num_iteration_per_drop_scope = args.num_iteration_per_drop_scope + + train_exe = fluid.ParallelExecutor( + use_cuda=args.use_cuda, + loss_name=graph_vars["loss"].name, + exec_strategy=exec_strategy, + main_program=train_program, + num_trainers=nccl2_num_trainers, + trainer_id=nccl2_trainer_id) + + train_pyreader.decorate_tensor_provider(train_data_generator) + else: + train_exe = None + + if args.do_val or args.do_test: + test_exe = fluid.ParallelExecutor( + use_cuda=args.use_cuda, + main_program=test_prog, + share_vars_from=train_exe) + + if args.do_train: + train_pyreader.start() + steps = 0 + graph_vars["learning_rate"] = scheduled_lr + + time_begin = time.time() + while True: + try: + steps += 1 + if steps % args.skip_steps != 0: + train_exe.run(fetch_list=[]) + else: + fetch_list = [ + graph_vars["lod_logit"].name, + graph_vars["lod_label"].name, graph_vars["loss"].name, + graph_vars['learning_rate'].name + ] + + out = train_exe.run(fetch_list=fetch_list, + return_numpy=False) + logits, labels, loss_lod, lr_lod = out + lr = np.array(lr_lod)[0] + loss = np.array(loss_lod).mean() + correct_, num_, token_correct_, token_total_ = calculate_acc( + logits, labels) + accuracy = correct_ / num_ + accuracy_token = token_correct_ / token_total_ + + if args.verbose: + log.info( + "train pyreader queue size: %d, learning rate: %f" % + (train_pyreader.queue.size(), lr + if warmup_steps > 0 else args.learning_rate)) + + current_example, current_epoch = reader.get_train_progress() + time_end = time.time() + used_time = time_end - time_begin + log.info( + "epoch: %d, progress: %d/%d, step: %d, loss: %f, " + "accuracy: %f, accuracy_token: %f, speed: %f steps/s" % + (current_epoch, current_example, num_train_examples, + steps, loss, accuracy, accuracy_token, + args.skip_steps / used_time)) + time_begin = time.time() + + if nccl2_trainer_id == 0 and steps % args.save_steps == 0: + save_path = os.path.join(args.checkpoints, + "step_" + str(steps)) + fluid.io.save_persistables(exe, save_path, train_program) + + if nccl2_trainer_id == 0 and steps % args.validation_steps == 0: + # evaluate dev set + if args.do_val: + evaluate_wrapper(reader, exe, test_prog, test_pyreader, + graph_vars, current_epoch, steps) + # evaluate test set + if args.do_test: + predict_wrapper(reader, exe, test_prog, test_pyreader, + graph_vars, current_epoch, steps) + + except fluid.core.EOFException: + save_path = os.path.join(args.checkpoints, "step_" + str(steps)) + fluid.io.save_persistables(exe, save_path, train_program) + train_pyreader.reset() + break + + # final eval on dev set + if nccl2_trainer_id == 0 and args.do_val: + evaluate_wrapper(reader, exe, test_prog, test_pyreader, graph_vars, + 'final', 'final') + + if nccl2_trainer_id == 0 and args.do_test: + predict_wrapper(reader, exe, test_prog, test_pyreader, graph_vars, + 'final', 'final') + + +def evaluate_wrapper(reader, exe, test_prog, test_pyreader, graph_vars, epoch, + steps): + # evaluate dev set + batch_size = args.batch_size if args.predict_batch_size is None else args.predict_batch_size + for ds in args.dev_set.split(','): # single card eval + test_pyreader.decorate_tensor_provider( + reader.data_generator( + ds, batch_size=batch_size, epoch=1, dev_count=1, shuffle=False)) + examples = reader._read_json(ds) + log.info("validation result of dataset {}:".format(ds)) + info = evaluate(args, examples, exe, test_prog, test_pyreader, + graph_vars) + log.info(info + ', file: {}, epoch: {}, steps: {}'.format(ds, epoch, + steps)) + + +def predict_wrapper(reader, exe, test_prog, test_pyreader, graph_vars, epoch, + steps): + + test_sets = args.test_set.split(',') + save_dirs = args.test_save.split(',') + assert len(test_sets) == len( + save_dirs), 'number of test_sets & test_save not match, got %d vs %d' % ( + len(test_sets), len(save_dirs)) + + batch_size = args.batch_size if args.predict_batch_size is None else args.predict_batch_size + for test_f, save_f in zip(test_sets, save_dirs): + test_pyreader.decorate_tensor_provider( + reader.data_generator( + test_f, + batch_size=batch_size, + epoch=1, + dev_count=1, + shuffle=False)) + examples = reader._read_json(test_f) + save_path = save_f + log.info("testing {}, save to {}".format(test_f, save_path)) + res = predict(args, examples, exe, test_prog, test_pyreader, graph_vars) + save_dir = os.path.dirname(save_path) + if not os.path.exists(save_dir): + os.makedirs(save_dir) + + with codecs.open(save_path, 'w', 'utf-8') as f: + for result in res: + json_str = json.dumps(result, ensure_ascii=False) + f.write(json_str) + f.write('\n') + + +if __name__ == '__main__': + prepare_logger(log) + print_arguments(args) + check_cuda(args.use_cuda) + main(args) diff --git a/PaddleKG/DuIE_Baseline/ernie/tokenization.py b/PaddleKG/DuIE_Baseline/ernie/tokenization.py new file mode 100755 index 0000000000000000000000000000000000000000..8d9c1d83cc2f7654abf9871c360e2caf03ce488c --- /dev/null +++ b/PaddleKG/DuIE_Baseline/ernie/tokenization.py @@ -0,0 +1,422 @@ +# coding=utf-8 +# Copyright 2018 The Google AI Language Team Authors. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +"""Tokenization classes.""" + +from __future__ import absolute_import +from __future__ import division +from __future__ import print_function +from __future__ import unicode_literals +from __future__ import absolute_import + +from io import open + +import collections +import unicodedata +import six + + +def convert_to_unicode(text): + """Converts `text` to Unicode (if it's not already), assuming utf-8 input.""" + if six.PY3: + if isinstance(text, str): + return text + elif isinstance(text, bytes): + return text.decode("utf-8", "ignore") + else: + raise ValueError("Unsupported string type: %s" % (type(text))) + elif six.PY2: + if isinstance(text, str): + return text.decode("utf-8", "ignore") + elif isinstance(text, unicode): + return text + else: + raise ValueError("Unsupported string type: %s" % (type(text))) + else: + raise ValueError("Not running on Python2 or Python 3?") + + +def printable_text(text): + """Returns text encoded in a way suitable for print or `tf.logging`.""" + + # These functions want `str` for both Python2 and Python3, but in one case + # it's a Unicode string and in the other it's a byte string. + if six.PY3: + if isinstance(text, str): + return text + elif isinstance(text, bytes): + return text.decode("utf-8", "ignore") + else: + raise ValueError("Unsupported string type: %s" % (type(text))) + elif six.PY2: + if isinstance(text, str): + return text + elif isinstance(text, unicode): + return text.encode("utf-8") + else: + raise ValueError("Unsupported string type: %s" % (type(text))) + else: + raise ValueError("Not running on Python2 or Python 3?") + + +def load_vocab(vocab_file): + """Loads a vocabulary file into a dictionary.""" + vocab = collections.OrderedDict() + with open(vocab_file, encoding='utf8') as fin: + for num, line in enumerate(fin): + items = convert_to_unicode(line.strip()).split("\t") + if len(items) > 2: + break + token = items[0] + index = items[1] if len(items) == 2 else num + token = token.strip() + vocab[token] = int(index) + return vocab + + +def convert_by_vocab(vocab, items): + """Converts a sequence of [tokens|ids] using the vocab.""" + output = [] + for item in items: + output.append(vocab[item]) + return output + + +def convert_tokens_to_ids(vocab, tokens): + return convert_by_vocab(vocab, tokens) + + +def convert_ids_to_tokens(inv_vocab, ids): + return convert_by_vocab(inv_vocab, ids) + + +def whitespace_tokenize(text): + """Runs basic whitespace cleaning and splitting on a peice of text.""" + text = text.strip() + if not text: + return [] + tokens = text.split() + return tokens + + +class FullTokenizer(object): + """Runs end-to-end tokenziation.""" + + def __init__(self, vocab_file, do_lower_case=True): + self.vocab = load_vocab(vocab_file) + self.inv_vocab = {v: k for k, v in self.vocab.items()} + self.basic_tokenizer = BasicTokenizer(do_lower_case=do_lower_case) + self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab) + + def tokenize(self, text): + split_tokens = [] + for token in self.basic_tokenizer.tokenize(text): + for sub_token in self.wordpiece_tokenizer.tokenize(token): + split_tokens.append(sub_token) + + return split_tokens + + def convert_tokens_to_ids(self, tokens): + return convert_by_vocab(self.vocab, tokens) + + def convert_ids_to_tokens(self, ids): + return convert_by_vocab(self.inv_vocab, ids) + + +class CharTokenizer(object): + """Runs end-to-end tokenziation.""" + + def __init__(self, vocab_file, do_lower_case=True): + self.vocab = load_vocab(vocab_file) + self.inv_vocab = {v: k for k, v in self.vocab.items()} + self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab) + + def tokenize(self, text): + split_tokens = [] + for token in text.lower().split(" "): + for sub_token in self.wordpiece_tokenizer.tokenize(token): + split_tokens.append(sub_token) + + return split_tokens + + def convert_tokens_to_ids(self, tokens): + return convert_by_vocab(self.vocab, tokens) + + def convert_ids_to_tokens(self, ids): + return convert_by_vocab(self.inv_vocab, ids) + + +class BasicTokenizer(object): + """Runs basic tokenization (punctuation splitting, lower casing, etc.).""" + + def __init__(self, do_lower_case=True): + """Constructs a BasicTokenizer. + + Args: + do_lower_case: Whether to lower case the input. + """ + self.do_lower_case = do_lower_case + + def tokenize(self, text): + """Tokenizes a piece of text.""" + text = convert_to_unicode(text) + text = self._clean_text(text) + + # This was added on November 1st, 2018 for the multilingual and Chinese + # models. This is also applied to the English models now, but it doesn't + # matter since the English models were not trained on any Chinese data + # and generally don't have any Chinese data in them (there are Chinese + # characters in the vocabulary because Wikipedia does have some Chinese + # words in the English Wikipedia.). + text = self._tokenize_chinese_chars(text) + + orig_tokens = whitespace_tokenize(text) + split_tokens = [] + for token in orig_tokens: + if self.do_lower_case: + token = token.lower() + token = self._run_strip_accents(token) + split_tokens.extend(self._run_split_on_punc(token)) + + output_tokens = whitespace_tokenize(" ".join(split_tokens)) + return output_tokens + + def _run_strip_accents(self, text): + """Strips accents from a piece of text.""" + text = unicodedata.normalize("NFD", text) + output = [] + for char in text: + cat = unicodedata.category(char) + if cat == "Mn": + continue + output.append(char) + return "".join(output) + + def _run_split_on_punc(self, text): + """Splits punctuation on a piece of text.""" + chars = list(text) + i = 0 + start_new_word = True + output = [] + while i < len(chars): + char = chars[i] + if _is_punctuation(char): + output.append([char]) + start_new_word = True + else: + if start_new_word: + output.append([]) + start_new_word = False + output[-1].append(char) + i += 1 + + return ["".join(x) for x in output] + + def _tokenize_chinese_chars(self, text): + """Adds whitespace around any CJK character.""" + output = [] + for char in text: + cp = ord(char) + if self._is_chinese_char(cp): + output.append(" ") + output.append(char) + output.append(" ") + else: + output.append(char) + return "".join(output) + + def _is_chinese_char(self, cp): + """Checks whether CP is the codepoint of a CJK character.""" + # This defines a "chinese character" as anything in the CJK Unicode block: + # https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_(Unicode_block) + # + # Note that the CJK Unicode block is NOT all Japanese and Korean characters, + # despite its name. The modern Korean Hangul alphabet is a different block, + # as is Japanese Hiragana and Katakana. Those alphabets are used to write + # space-separated words, so they are not treated specially and handled + # like the all of the other languages. + if ((cp >= 0x4E00 and cp <= 0x9FFF) or # + (cp >= 0x3400 and cp <= 0x4DBF) or # + (cp >= 0x20000 and cp <= 0x2A6DF) or # + (cp >= 0x2A700 and cp <= 0x2B73F) or # + (cp >= 0x2B740 and cp <= 0x2B81F) or # + (cp >= 0x2B820 and cp <= 0x2CEAF) or + (cp >= 0xF900 and cp <= 0xFAFF) or # + (cp >= 0x2F800 and cp <= 0x2FA1F)): # + return True + + return False + + def _clean_text(self, text): + """Performs invalid character removal and whitespace cleanup on text.""" + output = [] + for char in text: + cp = ord(char) + if cp == 0 or cp == 0xfffd or _is_control(char): + continue + if _is_whitespace(char): + output.append(" ") + else: + output.append(char) + return "".join(output) + + +class WordpieceTokenizer(object): + """Runs WordPiece tokenziation.""" + + def __init__(self, vocab, unk_token="[UNK]", max_input_chars_per_word=100): + self.vocab = vocab + self.unk_token = unk_token + self.max_input_chars_per_word = max_input_chars_per_word + + def tokenize(self, text): + """Tokenizes a piece of text into its word pieces. + + This uses a greedy longest-match-first algorithm to perform tokenization + using the given vocabulary. + + For example: + input = "unaffable" + output = ["un", "##aff", "##able"] + + Args: + text: A single token or whitespace separated tokens. This should have + already been passed through `BasicTokenizer. + + Returns: + A list of wordpiece tokens. + """ + + text = convert_to_unicode(text) + + output_tokens = [] + for token in whitespace_tokenize(text): + chars = list(token) + if len(chars) > self.max_input_chars_per_word: + output_tokens.append(self.unk_token) + continue + + is_bad = False + start = 0 + sub_tokens = [] + while start < len(chars): + end = len(chars) + cur_substr = None + while start < end: + substr = "".join(chars[start:end]) + if start > 0: + substr = "##" + substr + if substr in self.vocab: + cur_substr = substr + break + end -= 1 + if cur_substr is None: + is_bad = True + break + sub_tokens.append(cur_substr) + start = end + + if is_bad: + output_tokens.append(self.unk_token) + else: + output_tokens.extend(sub_tokens) + return output_tokens + + +def _is_whitespace(char): + """Checks whether `chars` is a whitespace character.""" + # \t, \n, and \r are technically contorl characters but we treat them + # as whitespace since they are generally considered as such. + if char == " " or char == "\t" or char == "\n" or char == "\r": + return True + cat = unicodedata.category(char) + if cat == "Zs": + return True + return False + + +def _is_control(char): + """Checks whether `chars` is a control character.""" + # These are technically control characters but we count them as whitespace + # characters. + if char == "\t" or char == "\n" or char == "\r": + return False + cat = unicodedata.category(char) + if cat.startswith("C"): + return True + return False + + +def _is_punctuation(char): + """Checks whether `chars` is a punctuation character.""" + cp = ord(char) + # We treat all non-letter/number ASCII as punctuation. + # Characters such as "^", "$", and "`" are not in the Unicode + # Punctuation class but we treat them as punctuation anyways, for + # consistency. + if ((cp >= 33 and cp <= 47) or (cp >= 58 and cp <= 64) or + (cp >= 91 and cp <= 96) or (cp >= 123 and cp <= 126)): + return True + cat = unicodedata.category(char) + if cat.startswith("P"): + return True + return False + + +def tokenize_chinese_chars(text): + """Adds whitespace around any CJK character.""" + + def _is_chinese_char(cp): + """Checks whether CP is the codepoint of a CJK character.""" + # This defines a "chinese character" as anything in the CJK Unicode block: + # https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_(Unicode_block) + # + # Note that the CJK Unicode block is NOT all Japanese and Korean characters, + # despite its name. The modern Korean Hangul alphabet is a different block, + # as is Japanese Hiragana and Katakana. Those alphabets are used to write + # space-separated words, so they are not treated specially and handled + # like the all of the other languages. + if ((cp >= 0x4E00 and cp <= 0x9FFF) or # + (cp >= 0x3400 and cp <= 0x4DBF) or # + (cp >= 0x20000 and cp <= 0x2A6DF) or # + (cp >= 0x2A700 and cp <= 0x2B73F) or # + (cp >= 0x2B740 and cp <= 0x2B81F) or # + (cp >= 0x2B820 and cp <= 0x2CEAF) or + (cp >= 0xF900 and cp <= 0xFAFF) or # + (cp >= 0x2F800 and cp <= 0x2FA1F)): # + return True + + return False + + def _is_whitespace(c): + if c == " " or c == "\t" or c == "\r" or c == "\n" or ord(c) == 0x202F: + return True + return False + + output = [] + buff = "" + for char in text: + cp = ord(char) + if _is_chinese_char(cp) or _is_whitespace(char): + if buff != "": + output.append(buff) + buff = "" + output.append(char) + else: + buff += char + + if buff != "": + output.append(buff) + + return output diff --git a/PaddleKG/DuIE_Baseline/ernie/utils/__init__.py b/PaddleKG/DuIE_Baseline/ernie/utils/__init__.py new file mode 100755 index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 diff --git a/PaddleKG/DuIE_Baseline/ernie/utils/args.py b/PaddleKG/DuIE_Baseline/ernie/utils/args.py new file mode 100755 index 0000000000000000000000000000000000000000..53bb23de758a942877cacc98f1891bf06c8319df --- /dev/null +++ b/PaddleKG/DuIE_Baseline/ernie/utils/args.py @@ -0,0 +1,83 @@ +# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +"""Arguments for configuration.""" +from __future__ import absolute_import +from __future__ import division +from __future__ import print_function +from __future__ import unicode_literals +from __future__ import absolute_import + +import six +import argparse +import logging + +import paddle.fluid as fluid + +log = logging.getLogger(__name__) + + +def prepare_logger(logger, debug=False, save_to_file=None): + formatter = logging.Formatter( + fmt='[%(levelname)s] %(asctime)s [%(filename)12s:%(lineno)5d]:\t%(message)s' + ) + console_hdl = logging.StreamHandler() + console_hdl.setFormatter(formatter) + logger.addHandler(console_hdl) + if save_to_file is not None and not os.path.exits(save_to_file): + file_hdl = logging.FileHandler(save_to_file) + file_hdl.setFormatter(formatter) + logger.addHandler(file_hdl) + logger.setLevel(logging.DEBUG) + logger.propagate = False + + +def str2bool(v): + # because argparse does not support to parse "true, False" as python + # boolean directly + return v.lower() in ("true", "t", "1") + + +class ArgumentGroup(object): + def __init__(self, parser, title, des): + self._group = parser.add_argument_group(title=title, description=des) + + def add_arg(self, name, type, default, help, positional_arg=False, + **kwargs): + prefix = "" if positional_arg else "--" + type = str2bool if type == bool else type + self._group.add_argument( + prefix + name, + default=default, + type=type, + help=help + ' Default: %(default)s.', + **kwargs) + + +def print_arguments(args): + log.info('----------- Configuration Arguments -----------') + for arg, value in sorted(six.iteritems(vars(args))): + log.info('%s: %s' % (arg, value)) + log.info('------------------------------------------------') + + +def check_cuda(use_cuda, err = \ + "\nYou can not set use_cuda = True in the model because you are using paddlepaddle-cpu.\n \ + Please: 1. Install paddlepaddle-gpu to run your models on GPU or 2. Set use_cuda = False to run models on CPU.\n" + ): + try: + if use_cuda == True and fluid.is_compiled_with_cuda() == False: + log.error(err) + sys.exit(1) + except Exception as e: + pass diff --git a/PaddleKG/DuIE_Baseline/ernie/utils/fp16.py b/PaddleKG/DuIE_Baseline/ernie/utils/fp16.py new file mode 100755 index 0000000000000000000000000000000000000000..740add267dff2dbf463032bcc47a6741ca9f7c43 --- /dev/null +++ b/PaddleKG/DuIE_Baseline/ernie/utils/fp16.py @@ -0,0 +1,201 @@ +# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from __future__ import print_function +import paddle +import paddle.fluid as fluid + + +def append_cast_op(i, o, prog): + """ + Append a cast op in a given Program to cast input `i` to data type `o.dtype`. + Args: + i (Variable): The input Variable. + o (Variable): The output Variable. + prog (Program): The Program to append cast op. + """ + prog.global_block().append_op( + type="cast", + inputs={"X": i}, + outputs={"Out": o}, + attrs={"in_dtype": i.dtype, + "out_dtype": o.dtype}) + + +def copy_to_master_param(p, block): + v = block.vars.get(p.name, None) + if v is None: + raise ValueError("no param name %s found!" % p.name) + new_p = fluid.framework.Parameter( + block=block, + shape=v.shape, + dtype=fluid.core.VarDesc.VarType.FP32, + type=v.type, + lod_level=v.lod_level, + stop_gradient=p.stop_gradient, + trainable=p.trainable, + optimize_attr=p.optimize_attr, + regularizer=p.regularizer, + gradient_clip_attr=p.gradient_clip_attr, + error_clip=p.error_clip, + name=v.name + ".master") + return new_p + + +def apply_dynamic_loss_scaling(loss_scaling, master_params_grads, + incr_every_n_steps, decr_every_n_nan_or_inf, + incr_ratio, decr_ratio): + _incr_every_n_steps = fluid.layers.fill_constant( + shape=[1], dtype='int32', value=incr_every_n_steps) + _decr_every_n_nan_or_inf = fluid.layers.fill_constant( + shape=[1], dtype='int32', value=decr_every_n_nan_or_inf) + + _num_good_steps = fluid.layers.create_global_var( + name=fluid.unique_name.generate("num_good_steps"), + shape=[1], + value=0, + dtype='int32', + persistable=True) + _num_bad_steps = fluid.layers.create_global_var( + name=fluid.unique_name.generate("num_bad_steps"), + shape=[1], + value=0, + dtype='int32', + persistable=True) + + grads = [fluid.layers.reduce_sum(g) for [_, g] in master_params_grads] + all_grads = fluid.layers.concat(grads) + all_grads_sum = fluid.layers.reduce_sum(all_grads) + is_overall_finite = fluid.layers.isfinite(all_grads_sum) + + update_loss_scaling(is_overall_finite, loss_scaling, _num_good_steps, + _num_bad_steps, _incr_every_n_steps, + _decr_every_n_nan_or_inf, incr_ratio, decr_ratio) + + # apply_gradient append all ops in global block, thus we shouldn't + # apply gradient in the switch branch. + with fluid.layers.Switch() as switch: + with switch.case(is_overall_finite): + pass + with switch.default(): + for _, g in master_params_grads: + fluid.layers.assign(fluid.layers.zeros_like(g), g) + + +def create_master_params_grads(params_grads, main_prog, startup_prog, + loss_scaling): + master_params_grads = [] + for p, g in params_grads: + with main_prog._optimized_guard([p, g]): + # create master parameters + master_param = copy_to_master_param(p, main_prog.global_block()) + startup_master_param = startup_prog.global_block()._clone_variable( + master_param) + startup_p = startup_prog.global_block().var(p.name) + append_cast_op(startup_p, startup_master_param, startup_prog) + # cast fp16 gradients to fp32 before apply gradients + if g.name.find("layer_norm") > -1: + scaled_g = g / loss_scaling + master_params_grads.append([p, scaled_g]) + continue + master_grad = fluid.layers.cast(g, "float32") + master_grad = master_grad / loss_scaling + master_params_grads.append([master_param, master_grad]) + + return master_params_grads + + +def master_param_to_train_param(master_params_grads, params_grads, main_prog): + for idx, m_p_g in enumerate(master_params_grads): + train_p, _ = params_grads[idx] + if train_p.name.find("layer_norm") > -1: + continue + with main_prog._optimized_guard([m_p_g[0], m_p_g[1]]): + append_cast_op(m_p_g[0], train_p, main_prog) + + +def update_loss_scaling(is_overall_finite, prev_loss_scaling, num_good_steps, + num_bad_steps, incr_every_n_steps, + decr_every_n_nan_or_inf, incr_ratio, decr_ratio): + """ + Update loss scaling according to overall gradients. If all gradients is + finite after incr_every_n_steps, loss scaling will increase by incr_ratio. + Otherwisw, loss scaling will decrease by decr_ratio after + decr_every_n_nan_or_inf steps and each step some gradients are infinite. + Args: + is_overall_finite (Variable): A boolean variable indicates whether + all gradients are finite. + prev_loss_scaling (Variable): Previous loss scaling. + num_good_steps (Variable): A variable accumulates good steps in which + all gradients are finite. + num_bad_steps (Variable): A variable accumulates bad steps in which + some gradients are infinite. + incr_every_n_steps (Variable): A variable represents increasing loss + scaling every n consecutive steps with + finite gradients. + decr_every_n_nan_or_inf (Variable): A variable represents decreasing + loss scaling every n accumulated + steps with nan or inf gradients. + incr_ratio(float): The multiplier to use when increasing the loss + scaling. + decr_ratio(float): The less-than-one-multiplier to use when decreasing + loss scaling. + """ + zero_steps = fluid.layers.fill_constant(shape=[1], dtype='int32', value=0) + with fluid.layers.Switch() as switch: + with switch.case(is_overall_finite): + should_incr_loss_scaling = fluid.layers.less_than( + incr_every_n_steps, num_good_steps + 1) + with fluid.layers.Switch() as switch1: + with switch1.case(should_incr_loss_scaling): + new_loss_scaling = prev_loss_scaling * incr_ratio + loss_scaling_is_finite = fluid.layers.isfinite( + new_loss_scaling) + with fluid.layers.Switch() as switch2: + with switch2.case(loss_scaling_is_finite): + fluid.layers.assign(new_loss_scaling, + prev_loss_scaling) + with switch2.default(): + pass + fluid.layers.assign(zero_steps, num_good_steps) + fluid.layers.assign(zero_steps, num_bad_steps) + + with switch1.default(): + fluid.layers.increment(num_good_steps) + fluid.layers.assign(zero_steps, num_bad_steps) + + with switch.default(): + should_decr_loss_scaling = fluid.layers.less_than( + decr_every_n_nan_or_inf, num_bad_steps + 1) + with fluid.layers.Switch() as switch3: + with switch3.case(should_decr_loss_scaling): + new_loss_scaling = prev_loss_scaling * decr_ratio + static_loss_scaling = \ + fluid.layers.fill_constant(shape=[1], + dtype='float32', + value=1.0) + less_than_one = fluid.layers.less_than(new_loss_scaling, + static_loss_scaling) + with fluid.layers.Switch() as switch4: + with switch4.case(less_than_one): + fluid.layers.assign(static_loss_scaling, + prev_loss_scaling) + with switch4.default(): + fluid.layers.assign(new_loss_scaling, + prev_loss_scaling) + fluid.layers.assign(zero_steps, num_good_steps) + fluid.layers.assign(zero_steps, num_bad_steps) + with switch3.default(): + fluid.layers.assign(zero_steps, num_good_steps) + fluid.layers.increment(num_bad_steps) diff --git a/PaddleKG/DuIE_Baseline/ernie/utils/init.py b/PaddleKG/DuIE_Baseline/ernie/utils/init.py new file mode 100755 index 0000000000000000000000000000000000000000..5f4bbc38531d3d2984c552ddfc1365b0a036255d --- /dev/null +++ b/PaddleKG/DuIE_Baseline/ernie/utils/init.py @@ -0,0 +1,91 @@ +# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from __future__ import absolute_import +from __future__ import division +from __future__ import print_function +from __future__ import unicode_literals +from __future__ import absolute_import + +import os +import six +import ast +import copy +import logging + +import numpy as np +import paddle.fluid as fluid + +log = logging.getLogger(__name__) + + +def cast_fp32_to_fp16(exe, main_program): + log.info("Cast parameters to float16 data format.") + for param in main_program.global_block().all_parameters(): + if not param.name.endswith(".master"): + param_t = fluid.global_scope().find_var(param.name).get_tensor() + data = np.array(param_t) + if param.name.startswith("encoder_layer") \ + and "layer_norm" not in param.name: + param_t.set(np.float16(data).view(np.uint16), exe.place) + + #load fp32 + master_param_var = fluid.global_scope().find_var(param.name + + ".master") + if master_param_var is not None: + master_param_var.get_tensor().set(data, exe.place) + + +def init_checkpoint(exe, init_checkpoint_path, main_program, use_fp16=False): + assert os.path.exists( + init_checkpoint_path), "[%s] can't be found." % init_checkpoint_path + + def existed_persitables(var): + if not fluid.io.is_persistable(var): + return False + return os.path.exists(os.path.join(init_checkpoint_path, var.name)) + + fluid.io.load_vars( + exe, + init_checkpoint_path, + main_program=main_program, + predicate=existed_persitables) + log.info("Load model from {}".format(init_checkpoint_path)) + + if use_fp16: + cast_fp32_to_fp16(exe, main_program) + + +def init_pretraining_params(exe, + pretraining_params_path, + main_program, + use_fp16=False): + assert os.path.exists(pretraining_params_path + ), "[%s] can't be found." % pretraining_params_path + + def existed_params(var): + if not isinstance(var, fluid.framework.Parameter): + return False + return os.path.exists(os.path.join(pretraining_params_path, var.name)) + + fluid.io.load_vars( + exe, + pretraining_params_path, + main_program=main_program, + predicate=existed_params) + log.info("Load pretraining parameters from {}.".format( + pretraining_params_path)) + + if use_fp16: + cast_fp32_to_fp16(exe, main_program) diff --git a/PaddleKG/DuIE_Baseline/pretrained_model/README.md b/PaddleKG/DuIE_Baseline/pretrained_model/README.md new file mode 100755 index 0000000000000000000000000000000000000000..dfb54bc41075b29b97390da883d02ed665235506 --- /dev/null +++ b/PaddleKG/DuIE_Baseline/pretrained_model/README.md @@ -0,0 +1 @@ +this directory is used for store pretrained ERNIE checkpoint. diff --git a/PaddleKG/DuIE_Baseline/requirements.txt b/PaddleKG/DuIE_Baseline/requirements.txt new file mode 100755 index 0000000000000000000000000000000000000000..90a1298fa42c2e9d197e4f992545cc4a0fc44e69 --- /dev/null +++ b/PaddleKG/DuIE_Baseline/requirements.txt @@ -0,0 +1,8 @@ +#Python3 +numpy==1.14.5 +six==1.11.0 +paddlepaddle-gpu==1.5.2.post107 + +#Python2 +zipfile +ConfigParser diff --git a/PaddleKG/DuIE_Baseline/script/predict.sh b/PaddleKG/DuIE_Baseline/script/predict.sh new file mode 100755 index 0000000000000000000000000000000000000000..fd21992016eecf688dba9abf14564659079555b0 --- /dev/null +++ b/PaddleKG/DuIE_Baseline/script/predict.sh @@ -0,0 +1,27 @@ +set -eux + +export TASK_DATA_PATH=./data/ +export MODEL_PATH=./pretrained_model/ +export CHECKPOINT=./checkpoints/step_60000/ +export TEST_SAVE=./data/ + +export FLAGS_sync_nccl_allreduce=1 +export PYTHONPATH=./ernie:${PYTHONPATH:-} +CUDA_VISIBLE_DEVICES=7 python -u ./ernie/run_duie.py \ + --use_cuda true \ + --do_train false \ + --do_val false \ + --do_test true \ + --batch_size 128 \ + --init_checkpoint ${CHECKPOINT} \ + --num_labels 112 \ + --label_map_config ${TASK_DATA_PATH}relation2label.json \ + --spo_label_map_config ${TASK_DATA_PATH}label2relation.json \ + --test_set ${TASK_DATA_PATH}dev_demo.json \ + --test_save ${TEST_SAVE}predict_test.json \ + --vocab_path ${MODEL_PATH}vocab.txt \ + --ernie_config_path ${MODEL_PATH}ernie_config.json \ + --use_fp16 false \ + --max_seq_len 512 \ + --skip_steps 10 \ + --random_seed 1 diff --git a/PaddleKG/DuIE_Baseline/script/re_official_evaluation.py b/PaddleKG/DuIE_Baseline/script/re_official_evaluation.py new file mode 100644 index 0000000000000000000000000000000000000000..763f32e0752daa8696461b159a6b91fdd8d1e483 --- /dev/null +++ b/PaddleKG/DuIE_Baseline/script/re_official_evaluation.py @@ -0,0 +1,288 @@ +# -*- coding: utf-8 -*- +######################################################## +# Copyright (c) 2019, Baidu Inc. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# imitations under the License. +######################################################## +""" +This module to calculate precision, recall and f1-value +of the predicated results. +""" +import sys +import json +import os +import zipfile +import traceback +import argparse +import ConfigParser +reload(sys) +sys.setdefaultencoding('utf-8') + +SUCCESS = 0 +FILE_ERROR = 1 +NOT_ZIP_FILE = 2 +ENCODING_ERROR = 3 +JSON_ERROR = 4 +SCHEMA_ERROR = 5 +ALIAS_FORMAT_ERROR = 6 + +CODE_INFO = { + SUCCESS: 'success', + FILE_ERROR: 'file is not exists', + NOT_ZIP_FILE: 'predict file is not a zipfile', + ENCODING_ERROR: 'file encoding error', + JSON_ERROR: 'json parse is error', + SCHEMA_ERROR: 'schema is error', + ALIAS_FORMAT_ERROR: 'alias dict format is error' +} + + +def del_bookname(entity_name): + """delete the book name""" + if entity_name.startswith(u'《') and entity_name.endswith(u'》'): + entity_name = entity_name[1:-1] + return entity_name + + +def check_format(line): + """检查输入行是否格式错误""" + ret_code = SUCCESS + json_info = {} + try: + line = line.decode('utf8').strip() + except: + ret_code = ENCODING_ERROR + return ret_code, json_info + try: + json_info = json.loads(line) + except: + ret_code = JSON_ERROR + return ret_code, json_info + if 'text' not in json_info or 'spo_list' not in json_info: + ret_code = SCHEMA_ERROR + return ret_code, json_info + required_key_list = ['subject', 'predicate', 'object'] + for spo_item in json_info['spo_list']: + if type(spo_item) is not dict: + ret_code = SCHEMA_ERROR + return ret_code, json_info + if not all( + [required_key in spo_item for required_key in required_key_list]): + ret_code = SCHEMA_ERROR + return ret_code, json_info + if not isinstance(spo_item['subject'], basestring) or \ + not isinstance(spo_item['object'], dict): + ret_code = SCHEMA_ERROR + return ret_code, json_info + return ret_code, json_info + + +def _parse_structured_ovalue(json_info): + spo_result = [] + for item in json_info["spo_list"]: + s = del_bookname(item['subject'].lower()) + o = {} + for o_key, o_value in item['object'].items(): + o_value = del_bookname(o_value).lower() + o[o_key] = o_value + spo_result.append({"predicate": item['predicate'], \ + "subject": s, \ + "object": o}) + return spo_result + + +def load_predict_result(predict_filename): + """Loads the file to be predicted""" + predict_result = {} + ret_code = SUCCESS + if not os.path.exists(predict_filename): + ret_code = FILE_ERROR + return ret_code, predict_result + try: + predict_file_zip = zipfile.ZipFile(predict_filename) + except: + ret_code = NOT_ZIP_FILE + return ret_code, predict_result + for predict_file in predict_file_zip.namelist(): + for line in predict_file_zip.open(predict_file): + ret_code, json_info = check_format(line) + if ret_code != SUCCESS: + return ret_code, predict_result + sent = json_info['text'] + spo_result = _parse_structured_ovalue(json_info) + predict_result[sent] = spo_result + return ret_code, predict_result + + +def load_test_dataset(golden_filename): + """load golden file""" + golden_dict = {} + ret_code = SUCCESS + if not os.path.exists(golden_filename): + ret_code = FILE_ERROR + return ret_code, golden_dict + with open(golden_filename) as gf: + for line in gf: + ret_code, json_info = check_format(line) + if ret_code != SUCCESS: + return ret_code, golden_dict + + sent = json_info['text'] + spo_result = _parse_structured_ovalue(json_info) + golden_dict[sent] = spo_result + return ret_code, golden_dict + + +def load_alias_dict(alias_filename): + """load alias dict""" + alias_dict = {} + ret_code = SUCCESS + if alias_filename == "": + return ret_code, alias_dict + if not os.path.exists(alias_filename): + ret_code = FILE_ERROR + return ret_code, alias_dict + with open(alias_filename) as af: + for line in af: + line = line.decode().strip() + try: + words = line.split('\t') + alias_dict[words[0].lower()] = set() + for alias_word in words[1:]: + alias_dict[words[0].lower()].add(alias_word.lower()) + except: + ret_code = ALIAS_FORMAT_ERROR + return ret_code, alias_dict + return ret_code, alias_dict + + +def del_duplicate(spo_list, alias_dict): + """delete synonyms triples in predict result""" + normalized_spo_list = [] + for spo in spo_list: + if not is_spo_in_list(spo, normalized_spo_list, alias_dict): + normalized_spo_list.append(spo) + return normalized_spo_list + + +def is_spo_in_list(target_spo, golden_spo_list, alias_dict): + """target spo是否在golden_spo_list中""" + if target_spo in golden_spo_list: + return True + target_s = target_spo["subject"] + target_p = target_spo["predicate"] + target_o = target_spo["object"] + target_s_alias_set = alias_dict.get(target_s, set()) + target_s_alias_set.add(target_s) + for spo in golden_spo_list: + s = spo["subject"] + p = spo["predicate"] + o = spo["object"] + if p != target_p: + continue + if s in target_s_alias_set and _is_equal_o(o, target_o, alias_dict): + return True + return False + + +def _is_equal_o(o_a, o_b, alias_dict): + for key_a, value_a in o_a.items(): + if key_a not in o_b: + return False + value_a_alias_set = alias_dict.get(value_a, set()) + value_a_alias_set.add(value_a) + if o_b[key_a] not in value_a_alias_set: + return False + for key_b, value_b in o_b.items(): + if key_b not in o_a: + return False + value_b_alias_set = alias_dict.get(value_b, set()) + value_b_alias_set.add(value_b) + if o_a[key_b] not in value_b_alias_set: + return False + return True + + +def calc_pr(predict_filename, alias_filename, golden_filename): + """calculate precision, recall, f1""" + ret_info = {} + + #load alias dict + ret_code, alias_dict = load_alias_dict(alias_filename) + if ret_code != SUCCESS: + ret_info['errorCode'] = ret_code + ret_info['errorMsg'] = CODE_INFO[ret_code] + return ret_info + #load test golden dataset + ret_code, golden_dict = load_test_dataset(golden_filename) + if ret_code != SUCCESS: + ret_info['errorCode'] = ret_code + ret_info['errorMsg'] = CODE_INFO[ret_code] + return ret_info + #load predict result + ret_code, predict_result = load_predict_result(predict_filename) + if ret_code != SUCCESS: + ret_info['errorCode'] = ret_code + ret_info['errorMsg'] = CODE_INFO[ret_code] + return ret_info + + #evaluation + correct_sum, predict_sum, recall_sum, recall_correct_sum = 0.0, 0.0, 0.0, 0.0 + for sent in golden_dict: + golden_spo_list = del_duplicate(golden_dict[sent], alias_dict) + predict_spo_list = predict_result.get(sent, list()) + normalized_predict_spo = del_duplicate(predict_spo_list, alias_dict) + recall_sum += len(golden_spo_list) + predict_sum += len(normalized_predict_spo) + for spo in normalized_predict_spo: + if is_spo_in_list(spo, golden_spo_list, alias_dict): + correct_sum += 1 + for golden_spo in golden_spo_list: + if is_spo_in_list(golden_spo, predict_spo_list, alias_dict): + recall_correct_sum += 1 + print >> sys.stderr, 'correct spo num = ', correct_sum + print >> sys.stderr, 'submitted spo num = ', predict_sum + print >> sys.stderr, 'golden set spo num = ', recall_sum + print >> sys.stderr, 'submitted recall spo num = ', recall_correct_sum + precision = correct_sum / predict_sum if predict_sum > 0 else 0.0 + recall = recall_correct_sum / recall_sum if recall_sum > 0 else 0.0 + f1 = 2 * precision * recall / (precision + recall) \ + if precision + recall > 0 else 0.0 + precision = round(precision, 4) + recall = round(recall, 4) + f1 = round(f1, 4) + ret_info['errorCode'] = SUCCESS + ret_info['errorMsg'] = CODE_INFO[SUCCESS] + ret_info['data'] = [] + ret_info['data'].append({'name': 'precision', 'value': precision}) + ret_info['data'].append({'name': 'recall', 'value': recall}) + ret_info['data'].append({'name': 'f1-score', 'value': f1}) + return ret_info + + +if __name__ == '__main__': + reload(sys) + sys.setdefaultencoding('utf-8') + parser = argparse.ArgumentParser() + parser.add_argument( + "--golden_file", type=str, help="true spo results", required=True) + parser.add_argument( + "--predict_file", type=str, help="spo results predicted", required=True) + parser.add_argument( + "--alias_file", type=str, default='', help="entities alias dictionary") + args = parser.parse_args() + golden_filename = args.golden_file + predict_filename = args.predict_file + alias_filename = args.alias_file + ret_info = calc_pr(predict_filename, alias_filename, golden_filename) + print json.dumps(ret_info) diff --git a/PaddleKG/DuIE_Baseline/script/train.sh b/PaddleKG/DuIE_Baseline/script/train.sh new file mode 100755 index 0000000000000000000000000000000000000000..5df26204296806381c9c08a54b843eb1a6dc333f --- /dev/null +++ b/PaddleKG/DuIE_Baseline/script/train.sh @@ -0,0 +1,41 @@ +set -eux + +export BATCH_SIZE=16 +export LR=2e-5 +export EPOCH=10 +export SAVE_STEPS=5000 + +export SAVE_PATH=./ +export TASK_DATA_PATH=./data/ +export MODEL_PATH=./pretrained_model/ + +export FLAGS_sync_nccl_allreduce=1 +export PYTHONPATH=./ernie:${PYTHONPATH:-} + +CUDA_VISIBLE_DEVICES=7 python -u ./ernie/run_duie.py \ + --use_cuda true \ + --do_train true \ + --do_val true \ + --do_test false \ + --batch_size ${BATCH_SIZE} \ + --init_checkpoint ${MODEL_PATH}params \ + --num_labels 112 \ + --chunk_scheme "IOB" \ + --label_map_config ${TASK_DATA_PATH}relation2label.json \ + --spo_label_map_config ${TASK_DATA_PATH}label2relation.json \ + --train_set ${TASK_DATA_PATH}train_demo.json \ + --dev_set ${TASK_DATA_PATH}dev_demo.json \ + --vocab_path ${MODEL_PATH}vocab.txt \ + --ernie_config_path ${MODEL_PATH}ernie_config.json \ + --checkpoints ${SAVE_PATH}checkpoints \ + --save_steps ${SAVE_STEPS} \ + --validation_steps ${SAVE_STEPS} \ + --weight_decay 0.01 \ + --warmup_proportion 0.0 \ + --use_fp16 false \ + --epoch ${EPOCH} \ + --max_seq_len 256 \ + --learning_rate ${LR} \ + --skip_steps 10 \ + --num_iteration_per_drop_scope 1 \ + --random_seed 1 diff --git a/PaddleKG/DuIE_Baseline/tagging_strategy.png b/PaddleKG/DuIE_Baseline/tagging_strategy.png new file mode 100644 index 0000000000000000000000000000000000000000..0b67f69d775c1811f70bb0de880fa32bfc7009d3 Binary files /dev/null and b/PaddleKG/DuIE_Baseline/tagging_strategy.png differ