提交 93b2a5ab 编写于 作者: K kgresearch

add DuIE_Baseline

上级 445fb2c5
Apache License
Version 2.0, January 2004
http://www.apache.org/licenses/
TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
1. Definitions.
"License" shall mean the terms and conditions for use, reproduction,
and distribution as defined by Sections 1 through 9 of this document.
"Licensor" shall mean the copyright owner or entity authorized by
the copyright owner that is granting the License.
"Legal Entity" shall mean the union of the acting entity and all
other entities that control, are controlled by, or are under common
control with that entity. For the purposes of this definition,
"control" means (i) the power, direct or indirect, to cause the
direction or management of such entity, whether by contract or
otherwise, or (ii) ownership of fifty percent (50%) or more of the
outstanding shares, or (iii) beneficial ownership of such entity.
"You" (or "Your") shall mean an individual or Legal Entity
exercising permissions granted by this License.
"Source" form shall mean the preferred form for making modifications,
including but not limited to software source code, documentation
source, and configuration files.
"Object" form shall mean any form resulting from mechanical
transformation or translation of a Source form, including but
not limited to compiled object code, generated documentation,
and conversions to other media types.
"Work" shall mean the work of authorship, whether in Source or
Object form, made available under the License, as indicated by a
copyright notice that is included in or attached to the work
(an example is provided in the Appendix below).
"Derivative Works" shall mean any work, whether in Source or Object
form, that is based on (or derived from) the Work and for which the
editorial revisions, annotations, elaborations, or other modifications
represent, as a whole, an original work of authorship. For the purposes
of this License, Derivative Works shall not include works that remain
separable from, or merely link (or bind by name) to the interfaces of,
the Work and Derivative Works thereof.
"Contribution" shall mean any work of authorship, including
the original version of the Work and any modifications or additions
to that Work or Derivative Works thereof, that is intentionally
submitted to Licensor for inclusion in the Work by the copyright owner
or by an individual or Legal Entity authorized to submit on behalf of
the copyright owner. For the purposes of this definition, "submitted"
means any form of electronic, verbal, or written communication sent
to the Licensor or its representatives, including but not limited to
communication on electronic mailing lists, source code control systems,
and issue tracking systems that are managed by, or on behalf of, the
Licensor for the purpose of discussing and improving the Work, but
excluding communication that is conspicuously marked or otherwise
designated in writing by the copyright owner as "Not a Contribution."
"Contributor" shall mean Licensor and any individual or Legal Entity
on behalf of whom a Contribution has been received by Licensor and
subsequently incorporated within the Work.
2. Grant of Copyright License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
copyright license to reproduce, prepare Derivative Works of,
publicly display, publicly perform, sublicense, and distribute the
Work and such Derivative Works in Source or Object form.
3. Grant of Patent License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
(except as stated in this section) patent license to make, have made,
use, offer to sell, sell, import, and otherwise transfer the Work,
where such license applies only to those patent claims licensable
by such Contributor that are necessarily infringed by their
Contribution(s) alone or by combination of their Contribution(s)
with the Work to which such Contribution(s) was submitted. If You
institute patent litigation against any entity (including a
cross-claim or counterclaim in a lawsuit) alleging that the Work
or a Contribution incorporated within the Work constitutes direct
or contributory patent infringement, then any patent licenses
granted to You under this License for that Work shall terminate
as of the date such litigation is filed.
4. Redistribution. You may reproduce and distribute copies of the
Work or Derivative Works thereof in any medium, with or without
modifications, and in Source or Object form, provided that You
meet the following conditions:
(a) You must give any other recipients of the Work or
Derivative Works a copy of this License; and
(b) You must cause any modified files to carry prominent notices
stating that You changed the files; and
(c) You must retain, in the Source form of any Derivative Works
that You distribute, all copyright, patent, trademark, and
attribution notices from the Source form of the Work,
excluding those notices that do not pertain to any part of
the Derivative Works; and
(d) If the Work includes a "NOTICE" text file as part of its
distribution, then any Derivative Works that You distribute must
include a readable copy of the attribution notices contained
within such NOTICE file, excluding those notices that do not
pertain to any part of the Derivative Works, in at least one
of the following places: within a NOTICE text file distributed
as part of the Derivative Works; within the Source form or
documentation, if provided along with the Derivative Works; or,
within a display generated by the Derivative Works, if and
wherever such third-party notices normally appear. The contents
of the NOTICE file are for informational purposes only and
do not modify the License. You may add Your own attribution
notices within Derivative Works that You distribute, alongside
or as an addendum to the NOTICE text from the Work, provided
that such additional attribution notices cannot be construed
as modifying the License.
You may add Your own copyright statement to Your modifications and
may provide additional or different license terms and conditions
for use, reproduction, or distribution of Your modifications, or
for any such Derivative Works as a whole, provided Your use,
reproduction, and distribution of the Work otherwise complies with
the conditions stated in this License.
5. Submission of Contributions. Unless You explicitly state otherwise,
any Contribution intentionally submitted for inclusion in the Work
by You to the Licensor shall be under the terms and conditions of
this License, without any additional terms or conditions.
Notwithstanding the above, nothing herein shall supersede or modify
the terms of any separate license agreement you may have executed
with Licensor regarding such Contributions.
6. Trademarks. This License does not grant permission to use the trade
names, trademarks, service marks, or product names of the Licensor,
except as required for reasonable and customary use in describing the
origin of the Work and reproducing the content of the NOTICE file.
7. Disclaimer of Warranty. Unless required by applicable law or
agreed to in writing, Licensor provides the Work (and each
Contributor provides its Contributions) on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
implied, including, without limitation, any warranties or conditions
of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
PARTICULAR PURPOSE. You are solely responsible for determining the
appropriateness of using or redistributing the Work and assume any
risks associated with Your exercise of permissions under this License.
8. Limitation of Liability. In no event and under no legal theory,
whether in tort (including negligence), contract, or otherwise,
unless required by applicable law (such as deliberate and grossly
negligent acts) or agreed to in writing, shall any Contributor be
liable to You for damages, including any direct, indirect, special,
incidental, or consequential damages of any character arising as a
result of this License or out of the use or inability to use the
Work (including but not limited to damages for loss of goodwill,
work stoppage, computer failure or malfunction, or any and all
other commercial damages or losses), even if such Contributor
has been advised of the possibility of such damages.
9. Accepting Warranty or Additional Liability. While redistributing
the Work or Derivative Works thereof, You may choose to offer,
and charge a fee for, acceptance of support, warranty, indemnity,
or other liability obligations and/or rights consistent with this
License. However, in accepting such obligations, You may act only
on Your own behalf and on Your sole responsibility, not on behalf
of any other Contributor, and only if You agree to indemnify,
defend, and hold each Contributor harmless for any liability
incurred by, or claims asserted against, such Contributor by reason
of your accepting any such warranty or additional liability.
END OF TERMS AND CONDITIONS
APPENDIX: How to apply the Apache License to your work.
To apply the Apache License to your work, attach the following
boilerplate notice, with the fields enclosed by brackets "[]"
replaced with your own identifying information. (Don't include
the brackets!) The text should be enclosed in the appropriate
comment syntax for the file format. We also recommend that a
file or class name and description of purpose be included on the
same "printed page" as the copyright notice for easier
identification within third-party archives.
Copyright [yyyy] [name of copyright owner]
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
## Relation Extraction Baseline System—InfoExtractor 2.0
### Abstract
InfoExtractor 2.0 is a relation extraction baseline system developed for DuIE 2.0.
Different from [DuIE 1.0](http://lic2019.ccf.org.cn/kg), the new 2.0 task is more inclined to colloquial language, and further introduces **complex relations** which entails multiple objects in one single SPO.
For detailed information about the dataset, please refer to the official website of our [competition](http://bjyz-ai.epc.baidu.com/aistudio/competition/detail/34?isFromCcf=true).
InfoExtractor 2.0 is built upon a SOTA pre-trained language model [ERNIE](https://arxiv.org/abs/1904.09223) using PaddlePaddle.
We design a structured **tagging strategy** to directly fine-tune ERNIE, through which multiple, overlapped SPOs can be extracted in **a single pass**.
The InfoExtractor 2.0 system is simple yet effective, achieving 0.554 F1 on the DuIE 2.0 demo data and 0.848 F1 on DuIE 1.0.
The hyperparameters are simply set to: BATCH_SIZE=16, LEARNING_RATE=2e-5, and EPOCH=10 (without tuning).
- - -
### Tagging Strategy
Our tagging strategy is designed to discover multiple, overlapped SPOs in the DuIE 2.0 task.
Based on the classic 'BIO' tagging scheme, we assign tags (also known as labels) to each token to indicate its position in an entity span.
The only difference lies in that a "B" tag here is further distinguished by different predicates and subject/object dichotomy.
Suppose there are N predicates. Then a "B" tag should be like "B-predicate-subject" or "B-predicate-object",
which results in 2*N **mutually exclusive** "B" tags.
After tagging, we treat the task as token-level multi-label classification, with a total of (2*N+2) labels (2 for the “I” and “O” tags).
Below is a visual illustration of our tagging strategy:
<div align="center">
<img src="./tagging_strategy.png" width = "550" height = "420" alt="Tagging Strategy" align=center />
</div>
For **complex relations** in the DuIE 2.0 task, we simply treat affiliated objects as independent instances (SPOs) which share the same subject.
Anything else besides the tagging strategy is implemented in the most straightforward way. The model input is:
\<CLS\> *input text* \<SEP\>, and the final hidden states are directly projected into classification probabilities.
- - -
### Environments
Python3 + Paddle Fluid 1.5 for training/evaluation/prediction (please confirm your Python path in scripts).
Python2 for official evaluation script.
Dependencies are listed in `./requirements.txt`.
The code is tested on a single P40 GPU, with CUDA version=10.1, GPU Driver Version = 418.39.
### Download pre-trained ERNIE model
Download ERNIE1.0 Base(max-len-512)model and extract it into `./pretrained_model/`
```
cd ./pretrained_mdoel/
wget --no-check-certificate https://ernie.bj.bcebos.com/ERNIE_1.0_max-len-512.tar.gz
tar -zxvf ERNIE_1.0_max-len-512.tar.gz
```
### Training
```
sh ./script/train.sh
```
By default the checkpoints will be saved into `./checkpoints/`
GPU ID can be specified in the script. On P40 devices, the batch size can be assigned up to 64 under 256 max-seq-len setting.
Multi-gpu training is supported after `LD_LIBRARY_PATH` is specified in the script:
```
export LD_LIBRARY_PATH=/your/custom/path:$LD_LIBRARY_PATH
```
**Accuracy** (token-level and example-level) is printed during the during the training procedure.
### Prediction
Specify your checkpoints dir in the prediction script, and then run:
```
sh ./script/predict.sh
```
This will write the predictions into a json file with the same format as the original dataset (required for final official evaluation). GPU ID and batch size can be specified in the script. The final prediction file is saved into `./data/`
### Official Evaluation
Zip your prediction json file and then run official evaluation:
```
zip ./data/predict_test.json.zip ./data/predict_test.json
python2 ./script/re_official_evaluation.py --golden_file=./data/dev_demo.json --predict_file=./data/predict_test.json.zip [--alias_file alias_dict]
```
Precision, Recall and F1 scores are used as the official evaluation metrics to measure the performance of participating systems. Alias file lists entities with more than one correct mentions. It is not provided due to security reasons.
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Mask, padding and batching."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import numpy as np
from six.moves import xrange
def mask(batch_tokens,
seg_labels,
mask_word_tags,
total_token_num,
vocab_size,
CLS=1,
SEP=2,
MASK=3):
"""
Add mask for batch_tokens, return out, mask_label, mask_pos;
Note: mask_pos responding the batch_tokens after padded;
"""
max_len = max([len(sent) for sent in batch_tokens])
mask_label = []
mask_pos = []
prob_mask = np.random.rand(total_token_num)
# Note: the first token is [CLS], so [low=1]
replace_ids = np.random.randint(1, high=vocab_size, size=total_token_num)
pre_sent_len = 0
prob_index = 0
for sent_index, sent in enumerate(batch_tokens):
mask_flag = False
mask_word = mask_word_tags[sent_index]
prob_index += pre_sent_len
if mask_word:
beg = 0
for token_index, token in enumerate(sent):
seg_label = seg_labels[sent_index][token_index]
if seg_label == 1:
continue
if beg == 0:
if seg_label != -1:
beg = token_index
continue
prob = prob_mask[prob_index + beg]
if prob > 0.15:
pass
else:
for index in xrange(beg, token_index):
prob = prob_mask[prob_index + index]
base_prob = 1.0
if index == beg:
base_prob = 0.15
if base_prob * 0.2 < prob <= base_prob:
mask_label.append(sent[index])
sent[index] = MASK
mask_flag = True
mask_pos.append(sent_index * max_len + index)
elif base_prob * 0.1 < prob <= base_prob * 0.2:
mask_label.append(sent[index])
sent[index] = replace_ids[prob_index + index]
mask_flag = True
mask_pos.append(sent_index * max_len + index)
else:
mask_label.append(sent[index])
mask_pos.append(sent_index * max_len + index)
if seg_label == -1:
beg = 0
else:
beg = token_index
else:
for token_index, token in enumerate(sent):
prob = prob_mask[prob_index + token_index]
if prob > 0.15:
continue
elif 0.03 < prob <= 0.15:
# mask
if token != SEP and token != CLS:
mask_label.append(sent[token_index])
sent[token_index] = MASK
mask_flag = True
mask_pos.append(sent_index * max_len + token_index)
elif 0.015 < prob <= 0.03:
# random replace
if token != SEP and token != CLS:
mask_label.append(sent[token_index])
sent[token_index] = replace_ids[prob_index +
token_index]
mask_flag = True
mask_pos.append(sent_index * max_len + token_index)
else:
# keep the original token
if token != SEP and token != CLS:
mask_label.append(sent[token_index])
mask_pos.append(sent_index * max_len + token_index)
pre_sent_len = len(sent)
mask_label = np.array(mask_label).astype("int64").reshape([-1, 1])
mask_pos = np.array(mask_pos).astype("int64").reshape([-1, 1])
return batch_tokens, mask_label, mask_pos
def prepare_batch_data(insts,
total_token_num,
voc_size=0,
pad_id=None,
cls_id=None,
sep_id=None,
mask_id=None,
return_input_mask=True,
return_max_len=True,
return_num_token=False):
batch_src_ids = [inst[0] for inst in insts]
batch_sent_ids = [inst[1] for inst in insts]
batch_pos_ids = [inst[2] for inst in insts]
labels = [inst[3] for inst in insts]
labels = np.array(labels).astype("int64").reshape([-1, 1])
seg_labels = [inst[4] for inst in insts]
mask_word_tags = [inst[5] for inst in insts]
# First step: do mask without padding
assert mask_id >= 0, "[FATAL] mask_id must >= 0"
out, mask_label, mask_pos = mask(
batch_src_ids,
seg_labels,
mask_word_tags,
total_token_num,
vocab_size=voc_size,
CLS=cls_id,
SEP=sep_id,
MASK=mask_id)
# Second step: padding
src_id, self_input_mask = pad_batch_data(
out, pad_idx=pad_id, return_input_mask=True)
pos_id = pad_batch_data(batch_pos_ids, pad_idx=pad_id)
sent_id = pad_batch_data(batch_sent_ids, pad_idx=pad_id)
return_list = [
src_id, pos_id, sent_id, self_input_mask, mask_label, mask_pos, labels
]
return return_list
def pad_batch_data(insts,
pad_idx=0,
return_pos=False,
return_input_mask=False,
return_max_len=False,
return_num_token=False,
return_seq_lens=False):
"""
Pad the instances to the max sequence length in batch, and generate the
corresponding position data and attention bias.
"""
return_list = []
max_len = max(len(inst) for inst in insts)
# Any token included in dict can be used to pad, since the paddings' loss
# will be masked out by weights and make no effect on parameter gradients.
inst_data = np.array(
[inst + list([pad_idx] * (max_len - len(inst))) for inst in insts])
return_list += [inst_data.astype("int64").reshape([-1, max_len, 1])]
# position data
if return_pos:
inst_pos = np.array([
list(range(0, len(inst))) + [pad_idx] * (max_len - len(inst))
for inst in insts
])
return_list += [inst_pos.astype("int64").reshape([-1, max_len, 1])]
if return_input_mask:
# This is used to avoid attention on paddings.
input_mask_data = np.array([[1] * len(inst) + [0] *
(max_len - len(inst)) for inst in insts])
input_mask_data = np.expand_dims(input_mask_data, axis=-1)
return_list += [input_mask_data.astype("float32")]
if return_max_len:
return_list += [max_len]
if return_num_token:
num_token = 0
for inst in insts:
num_token += len(inst)
return_list += [num_token]
if return_seq_lens:
seq_lens = np.array([len(inst) for inst in insts])
return_list += [seq_lens.astype("int64").reshape([-1, 1])]
return return_list if len(return_list) > 1 else return_list[0]
if __name__ == "__main__":
pass
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""extract embeddings from ERNIE encoder."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import os
import argparse
import numpy as np
import multiprocessing
import paddle.fluid as fluid
import reader.task_reader as task_reader
from model.ernie import ErnieConfig, ErnieModel
from utils.args import ArgumentGroup, print_arguments
from utils.init import init_pretraining_params
# yapf: disable
parser = argparse.ArgumentParser(__doc__)
model_g = ArgumentGroup(parser, "model", "model configuration and paths.")
model_g.add_arg("ernie_config_path", str, None, "Path to the json file for ernie model config.")
model_g.add_arg("init_pretraining_params", str, None,
"Init pre-training params which preforms fine-tuning from. If the "
"arg 'init_checkpoint' has been set, this argument wouldn't be valid.")
model_g.add_arg("output_dir", str, "embeddings", "path to save embeddings extracted by ernie_encoder.")
data_g = ArgumentGroup(parser, "data", "Data paths, vocab paths and data processing options")
data_g.add_arg("data_set", str, None, "Path to data for calculating ernie_embeddings.")
data_g.add_arg("vocab_path", str, None, "Vocabulary path.")
data_g.add_arg("max_seq_len", int, 512, "Number of words of the longest seqence.")
data_g.add_arg("batch_size", int, 32, "Total examples' number in batch for training.")
data_g.add_arg("do_lower_case", bool, True,
"Whether to lower case the input text. Should be True for uncased models and False for cased models.")
run_type_g = ArgumentGroup(parser, "run_type", "running type options.")
run_type_g.add_arg("use_cuda", bool, True, "If set, use GPU for training.")
# yapf: enable
def create_model(args, pyreader_name, ernie_config):
pyreader = fluid.layers.py_reader(
capacity=50,
shapes=[[-1, args.max_seq_len, 1], [-1, args.max_seq_len, 1],
[-1, args.max_seq_len, 1], [-1, args.max_seq_len, 1],
[-1, args.max_seq_len, 1], [-1, 1]],
dtypes=['int64', 'int64', 'int64', 'int64', 'float', 'int64'],
lod_levels=[0, 0, 0, 0, 0, 0],
name=pyreader_name,
use_double_buffer=True)
(src_ids, sent_ids, pos_ids, task_ids, input_mask,
seq_lens) = fluid.layers.read_file(pyreader)
ernie = ErnieModel(
src_ids=src_ids,
position_ids=pos_ids,
sentence_ids=sent_ids,
task_ids=task_ids,
input_mask=input_mask,
config=ernie_config)
enc_out = ernie.get_sequence_output()
unpad_enc_out = fluid.layers.sequence_unpad(enc_out, length=seq_lens)
cls_feats = ernie.get_pooled_output()
# set persistable = True to avoid memory opimizing
enc_out.persistable = True
unpad_enc_out.persistable = True
cls_feats.persistable = True
graph_vars = {
"cls_embeddings": cls_feats,
"top_layer_embeddings": unpad_enc_out,
}
return pyreader, graph_vars
def main(args):
args = parser.parse_args()
ernie_config = ErnieConfig(args.ernie_config_path)
ernie_config.print_config()
if args.use_cuda:
place = fluid.CUDAPlace(int(os.getenv('FLAGS_selected_gpus', '0')))
dev_count = fluid.core.get_cuda_device_count()
else:
place = fluid.CPUPlace()
dev_count = int(os.environ.get('CPU_NUM', multiprocessing.cpu_count()))
exe = fluid.Executor(place)
reader = task_reader.ExtractEmbeddingReader(
vocab_path=args.vocab_path,
max_seq_len=args.max_seq_len,
do_lower_case=args.do_lower_case)
startup_prog = fluid.Program()
data_generator = reader.data_generator(
input_file=args.data_set,
batch_size=args.batch_size,
epoch=1,
shuffle=False)
total_examples = reader.get_num_examples(args.data_set)
print("Device count: %d" % dev_count)
print("Total num examples: %d" % total_examples)
infer_program = fluid.Program()
with fluid.program_guard(infer_program, startup_prog):
with fluid.unique_name.guard():
pyreader, graph_vars = create_model(
args, pyreader_name='reader', ernie_config=ernie_config)
infer_program = infer_program.clone(for_test=True)
exe.run(startup_prog)
if args.init_pretraining_params:
init_pretraining_params(
exe, args.init_pretraining_params, main_program=startup_prog)
else:
raise ValueError(
"WARNING: args 'init_pretraining_params' must be specified")
exec_strategy = fluid.ExecutionStrategy()
exec_strategy.num_threads = dev_count
pyreader.decorate_tensor_provider(data_generator)
pyreader.start()
total_cls_emb = []
total_top_layer_emb = []
total_labels = []
while True:
try:
cls_emb, unpad_top_layer_emb = exe.run(
program=infer_program,
fetch_list=[
graph_vars["cls_embeddings"].name,
graph_vars["top_layer_embeddings"].name
],
return_numpy=False)
# batch_size * embedding_size
total_cls_emb.append(np.array(cls_emb))
total_top_layer_emb.append(np.array(unpad_top_layer_emb))
except fluid.core.EOFException:
break
total_cls_emb = np.concatenate(total_cls_emb)
total_top_layer_emb = np.concatenate(total_top_layer_emb)
with open(os.path.join(args.output_dir, "cls_emb.npy"),
"wb") as cls_emb_file:
np.save(cls_emb_file, total_cls_emb)
with open(os.path.join(args.output_dir, "top_layer_emb.npy"),
"wb") as top_layer_emb_file:
np.save(top_layer_emb_file, total_top_layer_emb)
if __name__ == '__main__':
args = parser.parse_args()
print_arguments(args)
main(args)
#!/usr/bin/env python
# -*- coding: utf-8 -*-
#
# Copyright (c) 2019 Baidu.com, Inc. All Rights Reserved
#
"""
requirements:
Authors: daisongtai(daisongtai@baidu.com)
Date: 2019/5/29 6:38 PM
"""
from __future__ import print_function
import sys
import re
import io
LHan = [
[0x2E80, 0x2E99], # Han # So [26] CJK RADICAL REPEAT, CJK RADICAL RAP
[0x2E9B, 0x2EF3
], # Han # So [89] CJK RADICAL CHOKE, CJK RADICAL C-SIMPLIFIED TURTLE
[0x2F00, 0x2FD5], # Han # So [214] KANGXI RADICAL ONE, KANGXI RADICAL FLUTE
0x3005, # Han # Lm IDEOGRAPHIC ITERATION MARK
0x3007, # Han # Nl IDEOGRAPHIC NUMBER ZERO
[0x3021,
0x3029], # Han # Nl [9] HANGZHOU NUMERAL ONE, HANGZHOU NUMERAL NINE
[0x3038,
0x303A], # Han # Nl [3] HANGZHOU NUMERAL TEN, HANGZHOU NUMERAL THIRTY
0x303B, # Han # Lm VERTICAL IDEOGRAPHIC ITERATION MARK
[
0x3400, 0x4DB5
], # Han # Lo [6582] CJK UNIFIED IDEOGRAPH-3400, CJK UNIFIED IDEOGRAPH-4DB5
[
0x4E00, 0x9FC3
], # Han # Lo [20932] CJK UNIFIED IDEOGRAPH-4E00, CJK UNIFIED IDEOGRAPH-9FC3
[
0xF900, 0xFA2D
], # Han # Lo [302] CJK COMPATIBILITY IDEOGRAPH-F900, CJK COMPATIBILITY IDEOGRAPH-FA2D
[
0xFA30, 0xFA6A
], # Han # Lo [59] CJK COMPATIBILITY IDEOGRAPH-FA30, CJK COMPATIBILITY IDEOGRAPH-FA6A
[
0xFA70, 0xFAD9
], # Han # Lo [106] CJK COMPATIBILITY IDEOGRAPH-FA70, CJK COMPATIBILITY IDEOGRAPH-FAD9
[
0x20000, 0x2A6D6
], # Han # Lo [42711] CJK UNIFIED IDEOGRAPH-20000, CJK UNIFIED IDEOGRAPH-2A6D6
[0x2F800, 0x2FA1D]
] # Han # Lo [542] CJK COMPATIBILITY IDEOGRAPH-2F800, CJK COMPATIBILITY IDEOGRAPH-2FA1D
CN_PUNCTS = [(0x3002, "。"), (0xFF1F, "?"), (0xFF01, "!"), (0xFF0C, ","),
(0x3001, "、"), (0xFF1B, ";"), (0xFF1A, ":"), (0x300C, "「"),
(0x300D, "」"), (0x300E, "『"), (0x300F, "』"), (0x2018, "‘"),
(0x2019, "’"), (0x201C, "“"), (0x201D, "”"), (0xFF08, "("),
(0xFF09, ")"), (0x3014, "〔"), (0x3015, "〕"), (0x3010, "【"),
(0x3011, "】"), (0x2014, "—"), (0x2026, "…"), (0x2013, "–"),
(0xFF0E, "."), (0x300A, "《"), (0x300B, "》"), (0x3008, "〈"),
(0x3009, "〉"), (0x2015, "―"), (0xff0d, "-"), (0x0020, " ")]
#(0xFF5E, "~"),
EN_PUNCTS = [[0x0021, 0x002F], [0x003A, 0x0040], [0x005B, 0x0060],
[0x007B, 0x007E]]
class ChineseAndPunctuationExtractor(object):
def __init__(self):
self.chinese_re = self.build_re()
def is_chinese_or_punct(self, c):
if self.chinese_re.match(c):
return True
else:
return False
def build_re(self):
L = []
for i in LHan:
if isinstance(i, list):
f, t = i
try:
f = chr(f)
t = chr(t)
L.append('%s-%s' % (f, t))
except:
pass # A narrow python build, so can't use chars > 65535 without surrogate pairs!
else:
try:
L.append(chr(i))
except:
pass
for j, _ in CN_PUNCTS:
try:
L.append(chr(j))
except:
pass
for k in EN_PUNCTS:
f, t = k
try:
f = chr(f)
t = chr(t)
L.append('%s-%s' % (f, t))
except:
raise ValueError()
pass # A narrow python build, so can't use chars > 65535 without surrogate pairs!
RE = '[%s]' % ''.join(L)
# print('RE:', RE.encode('utf-8'))
return re.compile(RE, re.UNICODE)
if __name__ == '__main__':
extractor = ChineseAndPunctuationExtractor()
for c in "韩邦庆(1856~1894)曾用名寄,字子云,别署太仙、大一山人、花也怜侬、三庆":
if extractor.is_chinese_or_punct(c):
print(c, 'yes')
else:
print(c, "no")
print("~", extractor.is_chinese_or_punct("~"))
print("~", extractor.is_chinese_or_punct("~"))
print("―", extractor.is_chinese_or_punct("―"))
print("-", extractor.is_chinese_or_punct("-"))
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals
from __future__ import absolute_import
import os
import time
import argparse
import numpy as np
import json
import multiprocessing
import paddle
import logging
import paddle.fluid as fluid
from model.ernie import ErnieModel
log = logging.getLogger(__name__)
def create_model(args, pyreader_name, ernie_config):
pyreader = fluid.layers.py_reader(
capacity=50,
shapes=[[-1, args.max_seq_len, 1], [-1, args.max_seq_len, 1],
[-1, args.max_seq_len, 1], [-1, args.max_seq_len, 1],
[-1, args.max_seq_len, 1],
[-1, args.max_seq_len, args.num_labels], [-1, 1], [-1, 1],
[-1, args.max_seq_len, 1], [-1, args.max_seq_len, 1]],
dtypes=[
'int64', 'int64', 'int64', 'int64', 'float32', 'float32', 'int64',
'int64', 'int64', 'int64'
],
lod_levels=[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
name=pyreader_name,
use_double_buffer=True)
(src_ids, sent_ids, pos_ids, task_ids, input_mask, labels, seq_lens,
example_index, tok_to_orig_start_index,
tok_to_orig_end_index) = fluid.layers.read_file(pyreader)
ernie = ErnieModel(
src_ids=src_ids,
position_ids=pos_ids,
sentence_ids=sent_ids,
task_ids=task_ids,
input_mask=input_mask,
config=ernie_config,
use_fp16=args.use_fp16)
enc_out = ernie.get_sequence_output()
enc_out = fluid.layers.dropout(
x=enc_out, dropout_prob=0.1, dropout_implementation="upscale_in_train")
logits = fluid.layers.fc(
input=enc_out,
size=args.num_labels,
num_flatten_dims=2,
param_attr=fluid.ParamAttr(
name="cls_seq_label_out_w",
initializer=fluid.initializer.TruncatedNormal(scale=0.02)),
bias_attr=fluid.ParamAttr(
name="cls_seq_label_out_b",
initializer=fluid.initializer.Constant(0.)))
logits = fluid.layers.sigmoid(logits)
lod_labels = fluid.layers.sequence_unpad(labels, seq_lens)
lod_logits = fluid.layers.sequence_unpad(logits, seq_lens)
lod_tok_to_orig_start_index = fluid.layers.sequence_unpad(
tok_to_orig_start_index, seq_lens)
lod_tok_to_orig_end_index = fluid.layers.sequence_unpad(
tok_to_orig_end_index, seq_lens)
labels = fluid.layers.flatten(labels, axis=2)
logits = fluid.layers.flatten(logits, axis=2)
input_mask = fluid.layers.flatten(input_mask, axis=2)
# calculate loss
log_logits = fluid.layers.log(logits)
log_logits_neg = fluid.layers.log(1 - logits)
ce_loss = 0. - labels * log_logits - (1 - labels) * log_logits_neg
ce_loss = fluid.layers.reduce_mean(ce_loss, dim=1, keep_dim=True)
ce_loss = ce_loss * input_mask
loss = fluid.layers.mean(x=ce_loss)
graph_vars = {
"inputs": src_ids,
"loss": loss,
"seqlen": seq_lens,
"lod_logit": lod_logits,
"lod_label": lod_labels,
"example_index": example_index,
"tok_to_orig_start_index": lod_tok_to_orig_start_index,
"tok_to_orig_end_index": lod_tok_to_orig_end_index
}
for k, v in graph_vars.items():
v.persistable = True
return pyreader, graph_vars
def calculate_acc(logits, labels):
# the golden metric should be "f1" in spo level
# but here only "accuracy" is computed during training for simplicity (provide crude view of your training status)
# accuracy is dependent on the tagging strategy
# for each token, the prediction is counted as correct if all its 100 labels were correctly predicted
# for each example, the prediction is counted as correct if all its token were correctly predicted
logits_lod = logits.lod()
labels_lod = labels.lod()
logits_tensor = np.array(logits)
labels_tensor = np.array(labels)
assert logits_lod == labels_lod
num_total = 0
num_correct = 0
token_total = 0
token_correct = 0
for i in range(len(logits_lod[0]) - 1):
inference_tmp = logits_tensor[logits_lod[0][i]:logits_lod[0][i + 1]]
inference_tmp[inference_tmp >= 0.5] = 1
inference_tmp[inference_tmp < 0.5] = 0
label_tmp = labels_tensor[labels_lod[0][i]:labels_lod[0][i + 1]]
num_total += 1
if (inference_tmp == label_tmp).all():
num_correct += 1
for j in range(len(inference_tmp)):
token_total += 1
if (inference_tmp[j] == label_tmp[j]).all():
token_correct += 1
return num_correct, num_total, token_correct, token_total
def calculate_metric(spo_list_gt, spo_list_predict):
# calculate golden metric precision, recall and f1
# may be slightly different with final official evaluation on test set,
# because more comprehensive detail is considered (e.g. alias)
tp, fp, fn = 0, 0, 0
for spo in spo_list_predict:
flag = 0
for spo_gt in spo_list_gt:
if spo['predicate'] == spo_gt['predicate'] and spo[
'object'] == spo_gt['object'] and spo['subject'] == spo_gt[
'subject']:
flag = 1
tp += 1
break
if flag == 0:
fp += 1
'''
for spo in spo_list_predict:
if spo in spo_list_gt:
tp += 1
else:
fp += 1
'''
fn = len(spo_list_gt) - tp
return tp, fp, fn
def evaluate(args, examples, exe, program, pyreader, graph_vars):
spo_label_map = json.load(open(args.spo_label_map_config))
fetch_list = [
graph_vars["lod_logit"].name, graph_vars["lod_label"].name,
graph_vars["example_index"].name,
graph_vars["tok_to_orig_start_index"].name,
graph_vars["tok_to_orig_end_index"].name
]
tp, fp, fn = 0, 0, 0
time_begin = time.time()
pyreader.start()
while True:
try:
# prepare fetched batch data: unlod etc.
logits, labels, example_index_list, tok_to_orig_start_index_list, tok_to_orig_end_index_list = \
exe.run(program=program, fetch_list=fetch_list, return_numpy=False)
example_index_list = np.array(example_index_list).astype(
int) - 100000
logits_lod = logits.lod()
tok_to_orig_start_index_list_lod = tok_to_orig_start_index_list.lod(
)
tok_to_orig_end_index_list_lod = tok_to_orig_end_index_list.lod()
logits_tensor = np.array(logits)
tok_to_orig_start_index_list = np.array(
tok_to_orig_start_index_list).flatten()
tok_to_orig_end_index_list = np.array(
tok_to_orig_end_index_list).flatten()
# perform evaluation
for i in range(len(logits_lod[0]) - 1):
# prepare prediction results for each example
example_index = example_index_list[i]
example = examples[example_index]
tok_to_orig_start_index = tok_to_orig_start_index_list[
tok_to_orig_start_index_list_lod[0][
i]:tok_to_orig_start_index_list_lod[0][i + 1] - 2]
tok_to_orig_end_index = tok_to_orig_end_index_list[
tok_to_orig_end_index_list_lod[0][
i]:tok_to_orig_end_index_list_lod[0][i + 1] - 2]
inference_tmp = logits_tensor[logits_lod[0][i]:logits_lod[0][i +
1]]
labels_tmp = np.array(labels)[logits_lod[0][i]:logits_lod[0][i +
1]]
# some simple post process
inference_tmp = post_process(inference_tmp)
# logits -> classification results
inference_tmp[inference_tmp >= 0.5] = 1
inference_tmp[inference_tmp < 0.5] = 0
predict_result = []
for token in inference_tmp:
predict_result.append(np.argwhere(token == 1).tolist())
# format prediction into spo, calculate metric
formated_result = format_output(
example, predict_result, spo_label_map,
tok_to_orig_start_index, tok_to_orig_end_index)
tp_tmp, fp_tmp, fn_tmp = calculate_metric(
example['spo_list'], formated_result['spo_list'])
tp += tp_tmp
fp += fp_tmp
fn += fn_tmp
except fluid.core.EOFException:
pyreader.reset()
break
time_end = time.time()
p = tp / (tp + fp) if tp + fp != 0 else 0
r = tp / (tp + fn) if tp + fn != 0 else 0
f = 2 * p * r / (p + r) if p + r != 0 else 0
return "[evaluation] precision: %f, recall: %f, f1: %f, elapsed time: %f s" % (
p, r, f, time_end - time_begin)
def predict(args, examples, exe, test_program, test_pyreader, graph_vars):
spo_label_map = json.load(open(args.spo_label_map_config))
fetch_list = [
graph_vars["lod_logit"].name, graph_vars["lod_label"].name,
graph_vars["example_index"].name,
graph_vars["tok_to_orig_start_index"].name,
graph_vars["tok_to_orig_end_index"].name
]
test_pyreader.start()
res = []
while True:
try:
# prepare fetched batch data: unlod etc.
logits, labels, example_index_list, tok_to_orig_start_index_list, tok_to_orig_end_index_list = \
exe.run(program=test_program, fetch_list=fetch_list, return_numpy=False)
example_index_list = np.array(example_index_list).astype(
int) - 100000
logits_lod = logits.lod()
tok_to_orig_start_index_list_lod = tok_to_orig_start_index_list.lod(
)
tok_to_orig_end_index_list_lod = tok_to_orig_end_index_list.lod()
logits_tensor = np.array(logits)
tok_to_orig_start_index_list = np.array(
tok_to_orig_start_index_list).flatten()
tok_to_orig_end_index_list = np.array(
tok_to_orig_end_index_list).flatten()
# perform evaluation
for i in range(len(logits_lod[0]) - 1):
# prepare prediction results for each example
example_index = example_index_list[i]
example = examples[example_index]
tok_to_orig_start_index = tok_to_orig_start_index_list[
tok_to_orig_start_index_list_lod[0][
i]:tok_to_orig_start_index_list_lod[0][i + 1] - 2]
tok_to_orig_end_index = tok_to_orig_end_index_list[
tok_to_orig_end_index_list_lod[0][
i]:tok_to_orig_end_index_list_lod[0][i + 1] - 2]
inference_tmp = logits_tensor[logits_lod[0][i]:logits_lod[0][i +
1]]
# some simple post process
inference_tmp = post_process(inference_tmp)
# logits -> classification results
inference_tmp[inference_tmp >= 0.5] = 1
inference_tmp[inference_tmp < 0.5] = 0
predict_result = []
for token in inference_tmp:
predict_result.append(np.argwhere(token == 1).tolist())
# format prediction into spo, calculate metric
formated_result = format_output(
example, predict_result, spo_label_map,
tok_to_orig_start_index, tok_to_orig_end_index)
res.append(formated_result)
except fluid.core.EOFException:
test_pyreader.reset()
break
return res
def post_process(inference):
# this post process only brings limited improvements (less than 0.5 f1) in order to keep simplicity
# to obtain better results, CRF is recommended
reference = []
for token in inference:
token_ = token.copy()
token_[token_ >= 0.5] = 1
token_[token_ < 0.5] = 0
reference.append(np.argwhere(token_ == 1))
# token was classified into conflict situation (both 'I' and 'B' tag)
for i, token in enumerate(reference[:-1]):
if [0] in token and len(token) >= 2:
if [1] in reference[i + 1]:
inference[i][0] = 0
else:
inference[i][2:] = 0
# token wasn't assigned any cls ('B', 'I', 'O' tag all zero)
for i, token in enumerate(reference[:-1]):
if len(token) == 0:
if [1] in reference[i - 1] and [1] in reference[i + 1]:
inference[i][1] = 1
elif [1] in reference[i + 1]:
inference[i][np.argmax(inference[i, 1:]) + 1] = 1
# handle with empty spo: to be implemented
return inference
def format_output(example, predict_result, spo_label_map,
tok_to_orig_start_index, tok_to_orig_end_index):
# format prediction into example-style output
complex_relation_label = [8, 10, 26, 32, 46]
complex_relation_affi_label = [9, 11, 27, 28, 29, 33, 47]
instance = {}
predict_result = predict_result[1:len(predict_result) -
1] # remove [CLS] and [SEP]
text_raw = example['text']
flatten_predict = []
for layer_1 in predict_result:
for layer_2 in layer_1:
flatten_predict.append(layer_2[0])
subject_id_list = []
for cls_label in list(set(flatten_predict)):
if 1 < cls_label <= 56 and (cls_label + 55) in flatten_predict:
subject_id_list.append(cls_label)
subject_id_list = list(set(subject_id_list))
def find_entity(id_, predict_result):
entity_list = []
for i in range(len(predict_result)):
if [id_] in predict_result[i]:
j = 0
while i + j + 1 < len(predict_result):
if [1] in predict_result[i + j + 1]:
j += 1
else:
break
entity = ''.join(text_raw[tok_to_orig_start_index[i]:
tok_to_orig_end_index[i + j] + 1])
entity_list.append(entity)
return list(set(entity_list))
spo_list = []
for id_ in subject_id_list:
if id_ in complex_relation_affi_label:
continue
if id_ not in complex_relation_label:
subjects = find_entity(id_, predict_result)
objects = find_entity(id_ + 55, predict_result)
for subject_ in subjects:
for object_ in objects:
spo_list.append({
"predicate": spo_label_map['predicate'][id_],
"object_type": {
'@value': spo_label_map['object_type'][id_]
},
'subject_type': spo_label_map['subject_type'][id_],
"object": {
'@value': object_
},
"subject": subject_
})
else:
# traverse all complex relation and look through their corresponding affiliated objects
subjects = find_entity(id_, predict_result)
objects = find_entity(id_ + 55, predict_result)
for subject_ in subjects:
for object_ in objects:
object_dict = {'@value': object_}
object_type_dict = {
'@value':
spo_label_map['object_type'][id_].split('_')[0]
}
if id_ in [8, 10, 32, 46] and id_ + 1 in subject_id_list:
id_affi = id_ + 1
object_dict[spo_label_map['object_type'][id_affi].split(
'_')[1]] = find_entity(id_affi + 55,
predict_result)[0]
object_type_dict[spo_label_map['object_type'][
id_affi].split('_')[1]] = spo_label_map[
'object_type'][id_affi].split('_')[0]
elif id_ == 26:
for id_affi in [27, 28, 29]:
if id_affi in subject_id_list:
object_dict[spo_label_map['object_type'][id_affi].split('_')[1]] = \
find_entity(id_affi + 55, predict_result)[0]
object_type_dict[spo_label_map['object_type'][id_affi].split('_')[1]] = \
spo_label_map['object_type'][id_affi].split('_')[0]
spo_list.append({
"predicate": spo_label_map['predicate'][id_],
"object_type": object_type_dict,
"subject_type": spo_label_map['subject_type'][id_],
"object": object_dict,
"subject": subject_
})
instance['text'] = example['text']
instance['spo_list'] = spo_list
return instance
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals
from __future__ import absolute_import
import os
import time
import argparse
from utils.args import ArgumentGroup
# yapf: disable
parser = argparse.ArgumentParser(__doc__)
model_g = ArgumentGroup(parser, "model", "model configuration and paths.")
model_g.add_arg("ernie_config_path", str, None, "Path to the json file for ernie model config.")
model_g.add_arg("init_checkpoint", str, None, "Init checkpoint to resume training from.")
model_g.add_arg("init_pretraining_params", str, None,
"Init pre-training params which preforms fine-tuning from. If the "
"arg 'init_checkpoint' has been set, this argument wouldn't be valid.")
model_g.add_arg("checkpoints", str, "checkpoints", "Path to save checkpoints.")
model_g.add_arg("is_classify", bool, True, "is_classify")
model_g.add_arg("is_regression", bool, False, "is_regression")
model_g.add_arg("task_id", int, 0, "task id")
train_g = ArgumentGroup(parser, "training", "training options.")
train_g.add_arg("epoch", int, 3, "Number of epoches for fine-tuning.")
train_g.add_arg("max_steps", int, 0, "Number of steps for fine-tuning.")
train_g.add_arg("learning_rate", float, 5e-5, "Learning rate used to train with warmup.")
train_g.add_arg("lr_scheduler", str, "linear_warmup_decay",
"scheduler of learning rate.", choices=['linear_warmup_decay', 'noam_decay'])
train_g.add_arg("weight_decay", float, 0.01, "Weight decay rate for L2 regularizer.")
train_g.add_arg("warmup_proportion", float, 0.1,
"Proportion of training steps to perform linear learning rate warmup for.")
train_g.add_arg("save_steps", int, 10000, "The steps interval to save checkpoints.")
train_g.add_arg("validation_steps", int, 1000, "The steps interval to evaluate model performance.")
train_g.add_arg("use_fp16", bool, False, "Whether to use fp16 mixed precision training.")
train_g.add_arg("use_dynamic_loss_scaling", bool, True, "Whether to use dynamic loss scaling.")
train_g.add_arg("init_loss_scaling", float, 102400,
"Loss scaling factor for mixed precision training, only valid when use_fp16 is enabled.")
train_g.add_arg("test_save", str, "./checkpoints/test_result", "test_save")
train_g.add_arg("metric", str, "simple_accuracy", "metric")
train_g.add_arg("incr_every_n_steps", int, 100, "Increases loss scaling every n consecutive.")
train_g.add_arg("decr_every_n_nan_or_inf", int, 2,
"Decreases loss scaling every n accumulated steps with nan or inf gradients.")
train_g.add_arg("incr_ratio", float, 2.0,
"The multiplier to use when increasing the loss scaling.")
train_g.add_arg("decr_ratio", float, 0.8,
"The less-than-one-multiplier to use when decreasing.")
log_g = ArgumentGroup(parser, "logging", "logging related.")
log_g.add_arg("skip_steps", int, 10, "The steps interval to print loss.")
log_g.add_arg("verbose", bool, False, "Whether to output verbose log.")
data_g = ArgumentGroup(parser, "data", "Data paths, vocab paths and data processing options")
data_g.add_arg("tokenizer", str, "FullTokenizer",
"ATTENTION: the INPUT must be splited by Word with blank while using SentencepieceTokenizer or WordsegTokenizer")
data_g.add_arg("train_set", str, None, "Path to training data.")
data_g.add_arg("test_set", str, None, "Path to test data.")
data_g.add_arg("dev_set", str, None, "Path to validation data.")
data_g.add_arg("vocab_path", str, None, "Vocabulary path.")
data_g.add_arg("max_seq_len", int, 512, "Number of words of the longest seqence.")
data_g.add_arg("batch_size", int, 32, "Total examples' number in batch for training. see also --in_tokens.")
data_g.add_arg("predict_batch_size", int, None, "Total examples' number in batch for predict. see also --in_tokens.")
data_g.add_arg("in_tokens", bool, False,
"If set, the batch size will be the maximum number of tokens in one batch. "
"Otherwise, it will be the maximum number of examples in one batch.")
data_g.add_arg("do_lower_case", bool, True,
"Whether to lower case the input text. Should be True for uncased models and False for cased models.")
data_g.add_arg("random_seed", int, None, "Random seed.")
data_g.add_arg("label_map_config", str, None, "label_map_path.")
data_g.add_arg("spo_label_map_config", str, None, "spo_label_map_path.")
data_g.add_arg("num_labels", int, 2, "label number")
data_g.add_arg("diagnostic", str, None, "GLUE Diagnostic Dataset")
data_g.add_arg("diagnostic_save", str, None, "GLUE Diagnostic save f")
data_g.add_arg("max_query_length", int, 64, "Max query length.")
data_g.add_arg("max_answer_length", int, 100, "Max answer length.")
data_g.add_arg("doc_stride", int, 128,
"When splitting up a long document into chunks, how much stride to take between chunks.")
data_g.add_arg("n_best_size", int, 20,
"The total number of n-best predictions to generate in the nbest_predictions.json output file.")
data_g.add_arg("chunk_scheme", type=str, default="IOB", choices=["IO", "IOB", "IOE", "IOBES"], help="chunk scheme")
run_type_g = ArgumentGroup(parser, "run_type", "running type options.")
run_type_g.add_arg("use_cuda", bool, True, "If set, use GPU for training.")
run_type_g.add_arg("is_distributed", bool, False, "If set, then start distributed training.")
run_type_g.add_arg("use_fast_executor", bool, False, "If set, use fast parallel executor (in experiment).")
run_type_g.add_arg("num_iteration_per_drop_scope", int, 10, "Iteration intervals to drop scope.")
run_type_g.add_arg("do_train", bool, True, "Whether to perform training.")
run_type_g.add_arg("do_val", bool, True, "Whether to perform evaluation on dev data set.")
run_type_g.add_arg("do_test", bool, True, "Whether to perform evaluation on test data set.")
run_type_g.add_arg("use_multi_gpu_test", bool, False, "Whether to perform evaluation using multiple gpu cards")
run_type_g.add_arg("metrics", bool, True, "Whether to perform evaluation on test data set.")
run_type_g.add_arg("shuffle", bool, True, "")
run_type_g.add_arg("for_cn", bool, True, "model train for cn or for other langs.")
parser.add_argument("--enable_ce", action='store_true', help="The flag indicating whether to run the task for continuous evaluation.")
# yapf: enable
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals
from __future__ import absolute_import
import sys
import subprocess
import os
import six
import copy
import argparse
import time
import logging
from utils.args import ArgumentGroup, print_arguments, prepare_logger
from finetune_args import parser as worker_parser
# yapf: disable
parser = argparse.ArgumentParser(__doc__)
multip_g = ArgumentGroup(parser, "multiprocessing",
"start paddle training using multi-processing mode.")
multip_g.add_arg("node_ips", str, None,
"paddle trainer ips")
multip_g.add_arg("node_id", int, 0,
"the trainer id of the node for multi-node distributed training.")
multip_g.add_arg("print_config", bool, True,
"print the config of multi-processing mode.")
multip_g.add_arg("current_node_ip", str, None,
"the ip of current node.")
multip_g.add_arg("split_log_path", str, "log",
"log path for each trainer.")
multip_g.add_arg("log_prefix", str, "",
"the prefix name of job log.")
multip_g.add_arg("nproc_per_node", int, 8,
"the number of process to use on each node.")
multip_g.add_arg("selected_gpus", str, "0,1,2,3,4,5,6,7",
"the gpus selected to use.")
multip_g.add_arg("training_script", str, None, "the program/script to be lauched "
"in parallel followed by all the arguments", positional_arg=True)
multip_g.add_arg("training_script_args", str, None,
"training script args", positional_arg=True, nargs=argparse.REMAINDER)
# yapf: enable
log = logging.getLogger()
def start_procs(args):
procs = []
log_fns = []
default_env = os.environ.copy()
node_id = args.node_id
node_ips = [x.strip() for x in args.node_ips.split(',')]
current_ip = args.current_node_ip
if args.current_node_ip is None:
assert len(node_ips) == 1
current_ip = node_ips[0]
log.info(current_ip)
num_nodes = len(node_ips)
selected_gpus = [x.strip() for x in args.selected_gpus.split(',')]
selected_gpu_num = len(selected_gpus)
all_trainer_endpoints = ""
for ip in node_ips:
for i in range(args.nproc_per_node):
if all_trainer_endpoints != "":
all_trainer_endpoints += ","
all_trainer_endpoints += "%s:617%d" % (ip, i)
nranks = num_nodes * args.nproc_per_node
gpus_per_proc = args.nproc_per_node % selected_gpu_num
if gpus_per_proc == 0:
gpus_per_proc = selected_gpu_num // args.nproc_per_node
else:
gpus_per_proc = selected_gpu_num // args.nproc_per_node + 1
selected_gpus_per_proc = [
selected_gpus[i:i + gpus_per_proc]
for i in range(0, len(selected_gpus), gpus_per_proc)
]
if args.print_config:
log.info("all_trainer_endpoints: %s"
", node_id: %s"
", current_ip: %s"
", num_nodes: %s"
", node_ips: %s"
", gpus_per_proc: %s"
", selected_gpus_per_proc: %s"
", nranks: %s" %
(all_trainer_endpoints, node_id, current_ip, num_nodes,
node_ips, gpus_per_proc, selected_gpus_per_proc, nranks))
current_env = copy.copy(default_env)
procs = []
cmds = []
log_fns = []
for i in range(0, args.nproc_per_node):
trainer_id = node_id * args.nproc_per_node + i
assert current_ip is not None
current_env.update({
"FLAGS_selected_gpus":
"%s" % ",".join([str(s) for s in selected_gpus_per_proc[i]]),
"PADDLE_TRAINER_ID": "%d" % trainer_id,
"PADDLE_CURRENT_ENDPOINT": "%s:617%d" % (current_ip, i),
"PADDLE_TRAINERS_NUM": "%d" % nranks,
"PADDLE_TRAINER_ENDPOINTS": all_trainer_endpoints,
"PADDLE_NODES_NUM": "%d" % num_nodes
})
try:
idx = args.training_script_args.index('--is_distributed')
args.training_script_args[idx + 1] = 'true'
except ValueError:
args.training_script_args += ['--is_distributed', 'true']
cmd = [sys.executable, "-u", args.training_script
] + args.training_script_args
cmds.append(cmd)
if args.split_log_path:
logdir = "%s/%sjob.log.%d" % (args.split_log_path, args.log_prefix,
trainer_id)
try:
os.mkdir(os.path.dirname(logdir))
except OSError:
pass
fn = open(logdir, "a")
log_fns.append(fn)
process = subprocess.Popen(
cmd, env=current_env, stdout=fn, stderr=fn)
log.info('subprocess launched, check log at %s' % logdir)
else:
process = subprocess.Popen(cmd, env=current_env)
log.info('subprocess launched')
procs.append(process)
try:
for i in range(len(procs)):
proc = procs[i]
proc.wait()
if len(log_fns) > 0:
log_fns[i].close()
if proc.returncode != 0:
raise subprocess.CalledProcessError(
returncode=procs[i].returncode, cmd=cmds[i])
else:
log.info("proc %d finsh" % i)
except KeyboardInterrupt as e:
for p in procs:
log.info('killing %s' % p)
p.terminate()
def main(args):
if args.print_config:
print_arguments(args)
start_procs(args)
if __name__ == "__main__":
prepare_logger(log)
lanch_args = parser.parse_args()
finetuning_args = worker_parser.parse_args(lanch_args.training_script_args)
init_path = finetuning_args.init_pretraining_params
log.info("init model: %s" % init_path)
if not finetuning_args.use_fp16:
os.system('rename .master "" ' + init_path + '/*.master')
main(lanch_args)
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# See the License for the specific language governing permissions and
# limitations under the License.
"""Ernie model."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals
from __future__ import absolute_import
import json
import six
import logging
import paddle.fluid as fluid
from io import open
from model.transformer_encoder import encoder, pre_process_layer
log = logging.getLogger(__name__)
class ErnieConfig(object):
def __init__(self, config_path):
self._config_dict = self._parse(config_path)
def _parse(self, config_path):
try:
with open(config_path, 'r', encoding='utf8') as json_file:
config_dict = json.load(json_file)
except Exception:
raise IOError("Error in parsing Ernie model config file '%s'" %
config_path)
else:
return config_dict
def __getitem__(self, key):
return self._config_dict.get(key, None)
def print_config(self):
for arg, value in sorted(six.iteritems(self._config_dict)):
log.info('%s: %s' % (arg, value))
log.info('------------------------------------------------')
class ErnieModel(object):
def __init__(self,
src_ids,
position_ids,
sentence_ids,
task_ids,
input_mask,
config,
weight_sharing=True,
use_fp16=False):
self._emb_size = config['hidden_size']
self._n_layer = config['num_hidden_layers']
self._n_head = config['num_attention_heads']
self._voc_size = config['vocab_size']
self._max_position_seq_len = config['max_position_embeddings']
if config['sent_type_vocab_size']:
self._sent_types = config['sent_type_vocab_size']
else:
self._sent_types = config['type_vocab_size']
self._use_task_id = config['use_task_id']
if self._use_task_id:
self._task_types = config['task_type_vocab_size']
self._hidden_act = config['hidden_act']
self._prepostprocess_dropout = config['hidden_dropout_prob']
self._attention_dropout = config['attention_probs_dropout_prob']
self._weight_sharing = weight_sharing
self._word_emb_name = "word_embedding"
self._pos_emb_name = "pos_embedding"
self._sent_emb_name = "sent_embedding"
self._task_emb_name = "task_embedding"
self._dtype = "float16" if use_fp16 else "float32"
self._emb_dtype = "float32"
# Initialize all weigths by truncated normal initializer, and all biases
# will be initialized by constant zero by default.
self._param_initializer = fluid.initializer.TruncatedNormal(
scale=config['initializer_range'])
self._build_model(src_ids, position_ids, sentence_ids, task_ids,
input_mask)
def _build_model(self, src_ids, position_ids, sentence_ids, task_ids,
input_mask):
# padding id in vocabulary must be set to 0
emb_out = fluid.layers.embedding(
input=src_ids,
size=[self._voc_size, self._emb_size],
dtype=self._emb_dtype,
param_attr=fluid.ParamAttr(
name=self._word_emb_name, initializer=self._param_initializer),
is_sparse=False)
position_emb_out = fluid.layers.embedding(
input=position_ids,
size=[self._max_position_seq_len, self._emb_size],
dtype=self._emb_dtype,
param_attr=fluid.ParamAttr(
name=self._pos_emb_name, initializer=self._param_initializer))
sent_emb_out = fluid.layers.embedding(
sentence_ids,
size=[self._sent_types, self._emb_size],
dtype=self._emb_dtype,
param_attr=fluid.ParamAttr(
name=self._sent_emb_name, initializer=self._param_initializer))
emb_out = emb_out + position_emb_out
emb_out = emb_out + sent_emb_out
if self._use_task_id:
task_emb_out = fluid.layers.embedding(
task_ids,
size=[self._task_types, self._emb_size],
dtype=self._emb_dtype,
param_attr=fluid.ParamAttr(
name=self._task_emb_name,
initializer=self._param_initializer))
emb_out = emb_out + task_emb_out
emb_out = pre_process_layer(
emb_out, 'nd', self._prepostprocess_dropout, name='pre_encoder')
if self._dtype == "float16":
emb_out = fluid.layers.cast(x=emb_out, dtype=self._dtype)
input_mask = fluid.layers.cast(x=input_mask, dtype=self._dtype)
self_attn_mask = fluid.layers.matmul(
x=input_mask, y=input_mask, transpose_y=True)
self_attn_mask = fluid.layers.scale(
x=self_attn_mask, scale=10000.0, bias=-1.0, bias_after_scale=False)
n_head_self_attn_mask = fluid.layers.stack(
x=[self_attn_mask] * self._n_head, axis=1)
n_head_self_attn_mask.stop_gradient = True
self._enc_out = encoder(
enc_input=emb_out,
attn_bias=n_head_self_attn_mask,
n_layer=self._n_layer,
n_head=self._n_head,
d_key=self._emb_size // self._n_head,
d_value=self._emb_size // self._n_head,
d_model=self._emb_size,
d_inner_hid=self._emb_size * 4,
prepostprocess_dropout=self._prepostprocess_dropout,
attention_dropout=self._attention_dropout,
relu_dropout=0,
hidden_act=self._hidden_act,
preprocess_cmd="",
postprocess_cmd="dan",
param_initializer=self._param_initializer,
name='encoder')
if self._dtype == "float16":
self._enc_out = fluid.layers.cast(
x=self._enc_out, dtype=self._emb_dtype)
def get_sequence_output(self):
return self._enc_out
def get_pooled_output(self):
"""Get the first feature of each sequence for classification"""
next_sent_feat = fluid.layers.slice(
input=self._enc_out, axes=[1], starts=[0], ends=[1])
next_sent_feat = fluid.layers.fc(
input=next_sent_feat,
size=self._emb_size,
act="tanh",
param_attr=fluid.ParamAttr(
name="pooled_fc.w_0", initializer=self._param_initializer),
bias_attr="pooled_fc.b_0")
return next_sent_feat
def get_lm_output(self, mask_label, mask_pos):
"""Get the loss & accuracy for pretraining"""
mask_pos = fluid.layers.cast(x=mask_pos, dtype='int32')
# extract the first token feature in each sentence
self.next_sent_feat = self.get_pooled_output()
reshaped_emb_out = fluid.layers.reshape(
x=self._enc_out, shape=[-1, self._emb_size])
# extract masked tokens' feature
mask_feat = fluid.layers.gather(input=reshaped_emb_out, index=mask_pos)
# transform: fc
mask_trans_feat = fluid.layers.fc(
input=mask_feat,
size=self._emb_size,
act=self._hidden_act,
param_attr=fluid.ParamAttr(
name='mask_lm_trans_fc.w_0',
initializer=self._param_initializer),
bias_attr=fluid.ParamAttr(name='mask_lm_trans_fc.b_0'))
# transform: layer norm
mask_trans_feat = fluid.layers.layer_norm(
mask_trans_feat,
begin_norm_axis=len(mask_trans_feat.shape) - 1,
param_attr=fluid.ParamAttr(
name='mask_lm_trans_layer_norm_scale',
initializer=fluid.initializer.Constant(1.)),
bias_attr=fluid.ParamAttr(
name='mask_lm_trans_layer_norm_bias',
initializer=fluid.initializer.Constant(1.)))
# transform: layer norm
#mask_trans_feat = pre_process_layer(
# mask_trans_feat, 'n', name='mask_lm_trans')
mask_lm_out_bias_attr = fluid.ParamAttr(
name="mask_lm_out_fc.b_0",
initializer=fluid.initializer.Constant(value=0.0))
if self._weight_sharing:
fc_out = fluid.layers.matmul(
x=mask_trans_feat,
y=fluid.default_main_program().global_block().var(
self._word_emb_name),
transpose_y=True)
fc_out += fluid.layers.create_parameter(
shape=[self._voc_size],
dtype=self._emb_dtype,
attr=mask_lm_out_bias_attr,
is_bias=True)
else:
fc_out = fluid.layers.fc(input=mask_trans_feat,
size=self._voc_size,
param_attr=fluid.ParamAttr(
name="mask_lm_out_fc.w_0",
initializer=self._param_initializer),
bias_attr=mask_lm_out_bias_attr)
mask_lm_loss = fluid.layers.softmax_with_cross_entropy(
logits=fc_out, label=mask_label)
mean_mask_lm_loss = fluid.layers.mean(mask_lm_loss)
return mean_mask_lm_loss
def get_task_output(self, task, task_labels):
task_fc_out = fluid.layers.fc(input=self.next_sent_feat,
size=task["num_labels"],
param_attr=fluid.ParamAttr(
name=task["task_name"] + "_fc.w_0",
initializer=self._param_initializer),
bias_attr=task["task_name"] + "_fc.b_0")
task_loss, task_softmax = fluid.layers.softmax_with_cross_entropy(
logits=task_fc_out, label=task_labels, return_softmax=True)
task_acc = fluid.layers.accuracy(input=task_softmax, label=task_labels)
mean_task_loss = fluid.layers.mean(task_loss)
return mean_task_loss, task_acc
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Transformer encoder."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from functools import partial
import paddle.fluid as fluid
import paddle.fluid.layers as layers
def multi_head_attention(queries,
keys,
values,
attn_bias,
d_key,
d_value,
d_model,
n_head=1,
dropout_rate=0.,
cache=None,
param_initializer=None,
name='multi_head_att'):
"""
Multi-Head Attention. Note that attn_bias is added to the logit before
computing softmax activiation to mask certain selected positions so that
they will not considered in attention weights.
"""
keys = queries if keys is None else keys
values = keys if values is None else values
if not (len(queries.shape) == len(keys.shape) == len(values.shape) == 3):
raise ValueError(
"Inputs: quries, keys and values should all be 3-D tensors.")
def __compute_qkv(queries, keys, values, n_head, d_key, d_value):
"""
Add linear projection to queries, keys, and values.
"""
q = layers.fc(input=queries,
size=d_key * n_head,
num_flatten_dims=2,
param_attr=fluid.ParamAttr(
name=name + '_query_fc.w_0',
initializer=param_initializer),
bias_attr=name + '_query_fc.b_0')
k = layers.fc(input=keys,
size=d_key * n_head,
num_flatten_dims=2,
param_attr=fluid.ParamAttr(
name=name + '_key_fc.w_0',
initializer=param_initializer),
bias_attr=name + '_key_fc.b_0')
v = layers.fc(input=values,
size=d_value * n_head,
num_flatten_dims=2,
param_attr=fluid.ParamAttr(
name=name + '_value_fc.w_0',
initializer=param_initializer),
bias_attr=name + '_value_fc.b_0')
return q, k, v
def __split_heads(x, n_head):
"""
Reshape the last dimension of inpunt tensor x so that it becomes two
dimensions and then transpose. Specifically, input a tensor with shape
[bs, max_sequence_length, n_head * hidden_dim] then output a tensor
with shape [bs, n_head, max_sequence_length, hidden_dim].
"""
hidden_size = x.shape[-1]
# The value 0 in shape attr means copying the corresponding dimension
# size of the input as the output dimension size.
reshaped = layers.reshape(
x=x, shape=[0, 0, n_head, hidden_size // n_head], inplace=True)
# permuate the dimensions into:
# [batch_size, n_head, max_sequence_len, hidden_size_per_head]
return layers.transpose(x=reshaped, perm=[0, 2, 1, 3])
def __combine_heads(x):
"""
Transpose and then reshape the last two dimensions of inpunt tensor x
so that it becomes one dimension, which is reverse to __split_heads.
"""
if len(x.shape) == 3: return x
if len(x.shape) != 4:
raise ValueError("Input(x) should be a 4-D Tensor.")
trans_x = layers.transpose(x, perm=[0, 2, 1, 3])
# The value 0 in shape attr means copying the corresponding dimension
# size of the input as the output dimension size.
return layers.reshape(
x=trans_x,
shape=[0, 0, trans_x.shape[2] * trans_x.shape[3]],
inplace=True)
def scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate):
"""
Scaled Dot-Product Attention
"""
scaled_q = layers.scale(x=q, scale=d_key**-0.5)
product = layers.matmul(x=scaled_q, y=k, transpose_y=True)
if attn_bias:
product += attn_bias
weights = layers.softmax(product)
if dropout_rate:
weights = layers.dropout(
weights,
dropout_prob=dropout_rate,
dropout_implementation="upscale_in_train",
is_test=False)
out = layers.matmul(weights, v)
return out
q, k, v = __compute_qkv(queries, keys, values, n_head, d_key, d_value)
if cache is not None: # use cache and concat time steps
# Since the inplace reshape in __split_heads changes the shape of k and
# v, which is the cache input for next time step, reshape the cache
# input from the previous time step first.
k = cache["k"] = layers.concat(
[layers.reshape(
cache["k"], shape=[0, 0, d_model]), k], axis=1)
v = cache["v"] = layers.concat(
[layers.reshape(
cache["v"], shape=[0, 0, d_model]), v], axis=1)
q = __split_heads(q, n_head)
k = __split_heads(k, n_head)
v = __split_heads(v, n_head)
ctx_multiheads = scaled_dot_product_attention(q, k, v, attn_bias, d_key,
dropout_rate)
out = __combine_heads(ctx_multiheads)
# Project back to the model size.
proj_out = layers.fc(input=out,
size=d_model,
num_flatten_dims=2,
param_attr=fluid.ParamAttr(
name=name + '_output_fc.w_0',
initializer=param_initializer),
bias_attr=name + '_output_fc.b_0')
return proj_out
def positionwise_feed_forward(x,
d_inner_hid,
d_hid,
dropout_rate,
hidden_act,
param_initializer=None,
name='ffn'):
"""
Position-wise Feed-Forward Networks.
This module consists of two linear transformations with a ReLU activation
in between, which is applied to each position separately and identically.
"""
hidden = layers.fc(input=x,
size=d_inner_hid,
num_flatten_dims=2,
act=hidden_act,
param_attr=fluid.ParamAttr(
name=name + '_fc_0.w_0',
initializer=param_initializer),
bias_attr=name + '_fc_0.b_0')
if dropout_rate:
hidden = layers.dropout(
hidden,
dropout_prob=dropout_rate,
dropout_implementation="upscale_in_train",
is_test=False)
out = layers.fc(input=hidden,
size=d_hid,
num_flatten_dims=2,
param_attr=fluid.ParamAttr(
name=name + '_fc_1.w_0', initializer=param_initializer),
bias_attr=name + '_fc_1.b_0')
return out
def pre_post_process_layer(prev_out, out, process_cmd, dropout_rate=0.,
name=''):
"""
Add residual connection, layer normalization and droput to the out tensor
optionally according to the value of process_cmd.
This will be used before or after multi-head attention and position-wise
feed-forward networks.
"""
for cmd in process_cmd:
if cmd == "a": # add residual connection
out = out + prev_out if prev_out else out
elif cmd == "n": # add layer normalization
out_dtype = out.dtype
if out_dtype == fluid.core.VarDesc.VarType.FP16:
out = layers.cast(x=out, dtype="float32")
out = layers.layer_norm(
out,
begin_norm_axis=len(out.shape) - 1,
param_attr=fluid.ParamAttr(
name=name + '_layer_norm_scale',
initializer=fluid.initializer.Constant(1.)),
bias_attr=fluid.ParamAttr(
name=name + '_layer_norm_bias',
initializer=fluid.initializer.Constant(0.)))
if out_dtype == fluid.core.VarDesc.VarType.FP16:
out = layers.cast(x=out, dtype="float16")
elif cmd == "d": # add dropout
if dropout_rate:
out = layers.dropout(
out,
dropout_prob=dropout_rate,
dropout_implementation="upscale_in_train",
is_test=False)
return out
pre_process_layer = partial(pre_post_process_layer, None)
post_process_layer = pre_post_process_layer
def encoder_layer(enc_input,
attn_bias,
n_head,
d_key,
d_value,
d_model,
d_inner_hid,
prepostprocess_dropout,
attention_dropout,
relu_dropout,
hidden_act,
preprocess_cmd="n",
postprocess_cmd="da",
param_initializer=None,
name=''):
"""The encoder layers that can be stacked to form a deep encoder.
This module consits of a multi-head (self) attention followed by
position-wise feed-forward networks and both the two components companied
with the post_process_layer to add residual connection, layer normalization
and droput.
"""
attn_output = multi_head_attention(
pre_process_layer(
enc_input,
preprocess_cmd,
prepostprocess_dropout,
name=name + '_pre_att'),
None,
None,
attn_bias,
d_key,
d_value,
d_model,
n_head,
attention_dropout,
param_initializer=param_initializer,
name=name + '_multi_head_att')
attn_output = post_process_layer(
enc_input,
attn_output,
postprocess_cmd,
prepostprocess_dropout,
name=name + '_post_att')
ffd_output = positionwise_feed_forward(
pre_process_layer(
attn_output,
preprocess_cmd,
prepostprocess_dropout,
name=name + '_pre_ffn'),
d_inner_hid,
d_model,
relu_dropout,
hidden_act,
param_initializer=param_initializer,
name=name + '_ffn')
return post_process_layer(
attn_output,
ffd_output,
postprocess_cmd,
prepostprocess_dropout,
name=name + '_post_ffn')
def encoder(enc_input,
attn_bias,
n_layer,
n_head,
d_key,
d_value,
d_model,
d_inner_hid,
prepostprocess_dropout,
attention_dropout,
relu_dropout,
hidden_act,
preprocess_cmd="n",
postprocess_cmd="da",
param_initializer=None,
name=''):
"""
The encoder is composed of a stack of identical layers returned by calling
encoder_layer.
"""
for i in range(n_layer):
enc_output = encoder_layer(
enc_input,
attn_bias,
n_head,
d_key,
d_value,
d_model,
d_inner_hid,
prepostprocess_dropout,
attention_dropout,
relu_dropout,
hidden_act,
preprocess_cmd,
postprocess_cmd,
param_initializer=param_initializer,
name=name + '_layer_' + str(i))
enc_input = enc_output
enc_output = pre_process_layer(
enc_output, preprocess_cmd, prepostprocess_dropout, name="post_encoder")
return enc_output
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Optimization and learning rate scheduling."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals
from __future__ import absolute_import
import numpy as np
import paddle.fluid as fluid
from utils.fp16 import create_master_params_grads, master_param_to_train_param, apply_dynamic_loss_scaling
def linear_warmup_decay(learning_rate, warmup_steps, num_train_steps):
""" Applies linear warmup of learning rate from 0 and decay to 0."""
with fluid.default_main_program()._lr_schedule_guard():
lr = fluid.layers.tensor.create_global_var(
shape=[1],
value=0.0,
dtype='float32',
persistable=True,
name="scheduled_learning_rate")
global_step = fluid.layers.learning_rate_scheduler._decay_step_counter()
with fluid.layers.control_flow.Switch() as switch:
with switch.case(global_step < warmup_steps):
warmup_lr = learning_rate * (global_step / warmup_steps)
fluid.layers.tensor.assign(warmup_lr, lr)
with switch.default():
decayed_lr = fluid.layers.learning_rate_scheduler.polynomial_decay(
learning_rate=learning_rate,
decay_steps=num_train_steps,
end_learning_rate=0.0,
power=1.0,
cycle=False)
fluid.layers.tensor.assign(decayed_lr, lr)
return lr
def optimization(loss,
warmup_steps,
num_train_steps,
learning_rate,
train_program,
startup_prog,
weight_decay,
scheduler='linear_warmup_decay',
use_fp16=False,
use_dynamic_loss_scaling=False,
init_loss_scaling=1.0,
incr_every_n_steps=1000,
decr_every_n_nan_or_inf=2,
incr_ratio=2.0,
decr_ratio=0.8):
if warmup_steps > 0:
if scheduler == 'noam_decay':
scheduled_lr = fluid.layers.learning_rate_scheduler\
.noam_decay(1/(warmup_steps *(learning_rate ** 2)),
warmup_steps)
elif scheduler == 'linear_warmup_decay':
scheduled_lr = linear_warmup_decay(learning_rate, warmup_steps,
num_train_steps)
else:
raise ValueError("Unkown learning rate scheduler, should be "
"'noam_decay' or 'linear_warmup_decay'")
optimizer = fluid.optimizer.Adam(learning_rate=scheduled_lr)
else:
scheduled_lr = fluid.layers.create_global_var(
name=fluid.unique_name.generate("learning_rate"),
shape=[1],
value=learning_rate,
dtype='float32',
persistable=True)
optimizer = fluid.optimizer.Adam(learning_rate=scheduled_lr)
optimizer._learning_rate_map[fluid.default_main_program(
)] = scheduled_lr
fluid.clip.set_gradient_clip(
clip=fluid.clip.GradientClipByGlobalNorm(clip_norm=1.0))
def exclude_from_weight_decay(name):
if name.find("layer_norm") > -1:
return True
bias_suffix = ["_bias", "_b", ".b_0"]
for suffix in bias_suffix:
if name.endswith(suffix):
return True
return False
param_list = dict()
loss_scaling = fluid.layers.create_global_var(
name=fluid.unique_name.generate("loss_scaling"),
shape=[1],
value=init_loss_scaling,
dtype='float32',
persistable=True)
if use_fp16:
loss *= loss_scaling
param_grads = optimizer.backward(loss)
master_param_grads = create_master_params_grads(
param_grads, train_program, startup_prog, loss_scaling)
for param, _ in master_param_grads:
param_list[param.name] = param * 1.0
param_list[param.name].stop_gradient = True
if use_dynamic_loss_scaling:
apply_dynamic_loss_scaling(
loss_scaling, master_param_grads, incr_every_n_steps,
decr_every_n_nan_or_inf, incr_ratio, decr_ratio)
optimizer.apply_gradients(master_param_grads)
if weight_decay > 0:
for param, grad in master_param_grads:
if exclude_from_weight_decay(param.name.rstrip(".master")):
continue
with param.block.program._optimized_guard(
[param, grad]), fluid.framework.name_scope("weight_decay"):
updated_param = param - param_list[
param.name] * weight_decay * scheduled_lr
fluid.layers.assign(output=param, input=updated_param)
master_param_to_train_param(master_param_grads, param_grads,
train_program)
else:
for param in train_program.global_block().all_parameters():
param_list[param.name] = param * 1.0
param_list[param.name].stop_gradient = True
_, param_grads = optimizer.minimize(loss)
if weight_decay > 0:
for param, grad in param_grads:
if exclude_from_weight_decay(param.name):
continue
with param.block.program._optimized_guard(
[param, grad]), fluid.framework.name_scope("weight_decay"):
updated_param = param - param_list[
param.name] * weight_decay * scheduled_lr
fluid.layers.assign(output=param, input=updated_param)
return scheduled_lr, loss_scaling
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals
from __future__ import absolute_import
import sys
import os
import json
import random
import logging
import numpy as np
import six
from io import open
from collections import namedtuple
import tokenization
from batching import pad_batch_data
import extract_chinese_and_punct
log = logging.getLogger(__name__)
if six.PY3:
import io
sys.stdout = io.TextIOWrapper(sys.stdout.buffer, encoding='utf-8')
sys.stderr = io.TextIOWrapper(sys.stderr.buffer, encoding='utf-8')
class BaseReader(object):
def __init__(self,
vocab_path,
label_map_config=None,
max_seq_len=512,
do_lower_case=True,
in_tokens=False,
is_inference=False,
random_seed=None,
tokenizer="FullTokenizer",
is_classify=True,
is_regression=False,
for_cn=True,
task_id=0):
self.max_seq_len = max_seq_len
self.tokenizer = tokenization.FullTokenizer(
vocab_file=vocab_path, do_lower_case=do_lower_case)
self.vocab = self.tokenizer.vocab
self.pad_id = self.vocab["[PAD]"]
self.cls_id = self.vocab["[CLS]"]
self.sep_id = self.vocab["[SEP]"]
self.in_tokens = in_tokens
self.is_inference = is_inference
self.for_cn = for_cn
self.task_id = task_id
np.random.seed(random_seed)
self.is_classify = is_classify
self.is_regression = is_regression
self.current_example = 0
self.current_epoch = 0
self.num_examples = 0
if label_map_config:
with open(label_map_config, encoding='utf8') as f:
self.label_map = json.load(f)
else:
self.label_map = None
def get_train_progress(self):
"""Gets progress for training phase."""
return self.current_example, self.current_epoch
def _truncate_seq_pair(self, tokens_a, tokens_b, max_length):
"""Truncates a sequence pair in place to the maximum length."""
# This is a simple heuristic which will always truncate the longer sequence
# one token at a time. This makes more sense than truncating an equal percent
# of tokens from each, since if one sequence is very short then each token
# that's truncated likely contains more information than a longer sequence.
while True:
total_length = len(tokens_a) + len(tokens_b)
if total_length <= max_length:
break
if len(tokens_a) > len(tokens_b):
tokens_a.pop()
else:
tokens_b.pop()
class RelationExtractionMultiCLSReader(BaseReader):
def __init__(self,
vocab_path,
label_map_config=None,
spo_label_map_config=None,
max_seq_len=512,
do_lower_case=True,
in_tokens=False,
is_inference=False,
random_seed=None,
tokenizer="FullTokenizer",
is_classify=True,
is_regression=False,
for_cn=True,
task_id=0,
num_labels=0):
self.max_seq_len = max_seq_len
self.tokenizer = tokenization.FullTokenizer(
vocab_file=vocab_path, do_lower_case=do_lower_case)
self.chineseandpunctuationextractor = extract_chinese_and_punct.ChineseAndPunctuationExtractor(
)
self.vocab = self.tokenizer.vocab
self.pad_id = self.vocab["[PAD]"]
self.cls_id = self.vocab["[CLS]"]
self.sep_id = self.vocab["[SEP]"]
self.in_tokens = in_tokens
self.is_inference = is_inference
self.for_cn = for_cn
self.task_id = task_id
self.num_labels = num_labels
np.random.seed(random_seed)
self.is_classify = is_classify
self.is_regression = is_regression
self.current_example = 0
self.current_epoch = 0
self.num_examples = 0
# map string to relation id
if label_map_config:
with open(label_map_config, encoding='utf8') as f:
self.label_map = json.load(f)
else:
self.label_map = None
# map relation id to string(including subject name, predicate name, object name)
if spo_label_map_config:
with open(label_map_config, encoding='utf8') as f:
self.label_map = json.load(f)
else:
self.spo_label_map = None
def _read_json(self, input_file):
f = open(input_file, 'r', encoding="utf8")
examples = []
for line in f.readlines():
examples.append(json.loads(line))
f.close()
return examples
def get_num_examples(self, input_file):
examples = self._read_json(input_file)
return len(examples)
def _prepare_batch_data(self, examples, batch_size, phase=None):
"""generate batch records"""
batch_records, max_len = [], 0
for index, example in enumerate(examples):
if phase == "train":
self.current_example = index
example_index = 100000 + index
record = self._convert_example_to_record(
example_index, example, self.max_seq_len, self.tokenizer)
max_len = max(max_len, len(record.token_ids))
if self.in_tokens:
to_append = (len(batch_records) + 1) * max_len <= batch_size
else:
to_append = len(batch_records) < batch_size
if to_append:
batch_records.append(record)
else:
yield self._pad_batch_records(batch_records)
batch_records, max_len = [record], len(record.token_ids)
if batch_records:
yield self._pad_batch_records(batch_records)
def data_generator(self,
input_file,
batch_size,
epoch,
dev_count=1,
shuffle=True,
phase=None):
examples = self._read_json(input_file)
def wrapper():
all_dev_batches = []
for epoch_index in range(epoch):
if phase == "train":
self.current_example = 0
self.current_epoch = epoch_index
if shuffle:
np.random.shuffle(examples)
for batch_data in self._prepare_batch_data(
examples, batch_size, phase=phase):
if len(all_dev_batches) < dev_count:
all_dev_batches.append(batch_data)
if len(all_dev_batches) == dev_count:
for batch in all_dev_batches:
yield batch
all_dev_batches = []
def f():
try:
for i in wrapper():
yield i
except Exception as e:
import traceback
traceback.print_exc()
return f
def _pad_batch_records(self, batch_records):
batch_token_ids = [record.token_ids for record in batch_records]
batch_text_type_ids = [record.text_type_ids for record in batch_records]
batch_position_ids = [record.position_ids for record in batch_records]
batch_label_ids = [record.label_ids for record in batch_records]
batch_example_index = [record.example_index for record in batch_records]
batch_tok_to_orig_start_index = [
record.tok_to_orig_start_index for record in batch_records
]
batch_tok_to_orig_end_index = [
record.tok_to_orig_end_index for record in batch_records
]
# padding
padded_token_ids, input_mask, batch_seq_lens = pad_batch_data(
batch_token_ids,
pad_idx=self.pad_id,
return_input_mask=True,
return_seq_lens=True)
padded_text_type_ids = pad_batch_data(
batch_text_type_ids, pad_idx=self.pad_id)
padded_position_ids = pad_batch_data(
batch_position_ids, pad_idx=self.pad_id)
# label padding for expended dimension
outside_label = np.array([1] + [0] * (self.num_labels - 1))
max_len = max(len(inst) for inst in batch_label_ids)
padded_label_ids = []
for i, inst in enumerate(batch_label_ids):
inst = np.concatenate(
(np.array(inst), np.tile(outside_label, ((max_len - len(inst)),
1))),
axis=0)
padded_label_ids.append(inst)
padded_label_ids = np.stack(padded_label_ids).astype("float32")
padded_tok_to_orig_start_index = np.array([
inst + [0] * (max_len - len(inst))
for inst in batch_tok_to_orig_start_index
])
padded_tok_to_orig_end_index = np.array([
inst + [0] * (max_len - len(inst))
for inst in batch_tok_to_orig_end_index
])
padded_task_ids = np.ones_like(
padded_token_ids, dtype="int64") * self.task_id
return_list = [
padded_token_ids, padded_text_type_ids, padded_position_ids,
padded_task_ids, input_mask, padded_label_ids, batch_seq_lens,
batch_example_index, padded_tok_to_orig_start_index,
padded_tok_to_orig_end_index
]
return return_list
def _convert_example_to_record(self, example_index, example, max_seq_length,
tokenizer):
spo_list = example['spo_list']
text_raw = example['text']
sub_text = []
buff = ""
for char in text_raw:
if self.chineseandpunctuationextractor.is_chinese_or_punct(char):
if buff != "":
sub_text.append(buff)
buff = ""
sub_text.append(char)
else:
buff += char
if buff != "":
sub_text.append(buff)
tok_to_orig_start_index = []
tok_to_orig_end_index = []
orig_to_tok_index = []
tokens = []
text_tmp = ''
for (i, token) in enumerate(sub_text):
orig_to_tok_index.append(len(tokens))
sub_tokens = tokenizer.tokenize(token)
text_tmp += token
for sub_token in sub_tokens:
tok_to_orig_start_index.append(len(text_tmp) - len(token))
tok_to_orig_end_index.append(len(text_tmp) - 1)
tokens.append(sub_token)
if len(tokens) >= max_seq_length - 2:
break
else:
continue
break
labels = [[0] * self.num_labels
for i in range(len(tokens))] # initialize tag
# find all entities and tag them with corresponding "B"/"I" labels
for spo in spo_list:
for spo_object in spo['object'].keys():
# assign relation label
if spo['predicate'] in self.label_map.keys():
# simple relation
label_subject = self.label_map[spo['predicate']]
label_object = label_subject + 55
subject_sub_tokens = tokenizer.tokenize(spo['subject'])
object_sub_tokens = tokenizer.tokenize(spo['object'][
'@value'])
else:
# complex relation
label_subject = self.label_map[spo['predicate'] + '_' +
spo_object]
label_object = label_subject + 55
subject_sub_tokens = tokenizer.tokenize(spo['subject'])
object_sub_tokens = tokenizer.tokenize(spo['object'][
spo_object])
# assign token label
# there are situations where s entity and o entity might overlap, e.g. xyz established xyz corporation
# to prevent single token from being labeled into two different entity
# we tag the longer entity first, then match the shorter entity within the rest text
forbidden_index = None
if len(subject_sub_tokens) > len(object_sub_tokens):
for index in range(
len(tokens) - len(subject_sub_tokens) + 1):
if tokens[index:index + len(
subject_sub_tokens)] == subject_sub_tokens:
labels[index][label_subject] = 1
for i in range(len(subject_sub_tokens) - 1):
labels[index + i + 1][1] = 1
forbidden_index = index
break
for index in range(
len(tokens) - len(object_sub_tokens) + 1):
if tokens[index:index + len(
object_sub_tokens)] == object_sub_tokens:
if forbidden_index is None:
labels[index][label_object] = 1
for i in range(len(object_sub_tokens) - 1):
labels[index + i + 1][1] = 1
break
# check if labeled already
elif index < forbidden_index or index >= forbidden_index + len(
subject_sub_tokens):
labels[index][label_object] = 1
for i in range(len(object_sub_tokens) - 1):
labels[index + i + 1][1] = 1
break
else:
for index in range(
len(tokens) - len(object_sub_tokens) + 1):
if tokens[index:index + len(
object_sub_tokens)] == object_sub_tokens:
labels[index][label_object] = 1
for i in range(len(object_sub_tokens) - 1):
labels[index + i + 1][1] = 1
forbidden_index = index
break
for index in range(
len(tokens) - len(subject_sub_tokens) + 1):
if tokens[index:index + len(
subject_sub_tokens)] == subject_sub_tokens:
if forbidden_index is None:
labels[index][label_subject] = 1
for i in range(len(subject_sub_tokens) - 1):
labels[index + i + 1][1] = 1
break
elif index < forbidden_index or index >= forbidden_index + len(
object_sub_tokens):
labels[index][label_subject] = 1
for i in range(len(subject_sub_tokens) - 1):
labels[index + i + 1][1] = 1
break
# if token wasn't assigned as any "B"/"I" tag, give it an "O" tag for outside
for i in range(len(labels)):
if labels[i] == [0] * self.num_labels:
labels[i][0] = 1
# add [CLS] and [SEP] token, they are tagged into "O" for outside
if len(tokens) > max_seq_length - 2:
tokens = tokens[0:(max_seq_length - 2)]
labels = labels[0:(max_seq_length - 2)]
tokens = ["[CLS]"] + tokens + ["[SEP]"]
outside_label = [[1] + [0] * (self.num_labels - 1)]
labels = outside_label + labels + outside_label
token_ids = tokenizer.convert_tokens_to_ids(tokens)
position_ids = list(range(len(token_ids)))
text_type_ids = [0] * len(token_ids)
Record = namedtuple('Record', [
'token_ids', 'text_type_ids', 'position_ids', 'label_ids',
'example_index', 'tok_to_orig_start_index', 'tok_to_orig_end_index'
])
record = Record(
token_ids=token_ids,
text_type_ids=text_type_ids,
position_ids=position_ids,
label_ids=labels,
example_index=example_index,
tok_to_orig_start_index=tok_to_orig_start_index,
tok_to_orig_end_index=tok_to_orig_end_index)
return record
if __name__ == '__main__':
pass
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Finetuning on classification tasks."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals
from __future__ import absolute_import
import os
import time
import six
import logging
import multiprocessing
from io import open
import numpy as np
import json
# NOTE(paddle-dev): All of these flags should be
# set before `import paddle`. Otherwise, it would
# not take any effect.
os.environ['FLAGS_eager_delete_tensor_gb'] = '0' # enable gc
import codecs
import paddle.fluid as fluid
import reader.task_reader as task_reader
from model.ernie import ErnieConfig
from optimization import optimization
from utils.init import init_pretraining_params, init_checkpoint
from utils.args import print_arguments, check_cuda, prepare_logger
from finetune.relation_extraction_multi_cls import create_model, evaluate, predict, calculate_acc
from finetune_args import parser
args = parser.parse_args()
log = logging.getLogger()
def main(args):
ernie_config = ErnieConfig(args.ernie_config_path)
ernie_config.print_config()
if args.use_cuda:
dev_list = fluid.cuda_places()
place = dev_list[0]
dev_count = len(dev_list)
else:
place = fluid.CPUPlace()
dev_count = int(os.environ.get('CPU_NUM', multiprocessing.cpu_count()))
reader = task_reader.RelationExtractionMultiCLSReader(
vocab_path=args.vocab_path,
label_map_config=args.label_map_config,
spo_label_map_config=args.spo_label_map_config,
max_seq_len=args.max_seq_len,
do_lower_case=args.do_lower_case,
in_tokens=args.in_tokens,
random_seed=args.random_seed,
task_id=args.task_id,
num_labels=args.num_labels)
if not (args.do_train or args.do_val or args.do_test):
raise ValueError("For args `do_train`, `do_val` and `do_test`, at "
"least one of them must be True.")
startup_prog = fluid.Program()
if args.random_seed is not None:
startup_prog.random_seed = args.random_seed
if args.do_train:
train_data_generator = reader.data_generator(
input_file=args.train_set,
batch_size=args.batch_size,
epoch=args.epoch,
shuffle=True,
phase="train")
num_train_examples = reader.get_num_examples(args.train_set)
if args.in_tokens:
if args.batch_size < args.max_seq_len:
raise ValueError(
'if in_tokens=True, batch_size should greater than max_sqelen, got batch_size:%d seqlen:%d'
% (args.batch_size, args.max_seq_len))
max_train_steps = args.epoch * num_train_examples // (
args.batch_size // args.max_seq_len) // dev_count
else:
'''
if args.max_steps != 0:
max_train_steps = min(args.epoch * num_train_examples // args.batch_size // dev_count, args.max_steps)
else:
max_train_steps = args.epoch * num_train_examples // args.batch_size // dev_count
'''
max_train_steps = args.epoch * num_train_examples // args.batch_size // dev_count
warmup_steps = int(max_train_steps * args.warmup_proportion)
log.info("Device count: %d" % dev_count)
log.info("Num train examples: %d" % num_train_examples)
log.info("Max train steps: %d" % max_train_steps)
log.info("Num warmup steps: %d" % warmup_steps)
train_program = fluid.Program()
with fluid.program_guard(train_program, startup_prog):
with fluid.unique_name.guard():
train_pyreader, graph_vars = create_model(
args,
pyreader_name='train_reader',
ernie_config=ernie_config)
scheduled_lr, loss_scaling = optimization(
loss=graph_vars["loss"],
warmup_steps=warmup_steps,
num_train_steps=max_train_steps,
learning_rate=args.learning_rate,
train_program=train_program,
startup_prog=startup_prog,
weight_decay=args.weight_decay,
scheduler=args.lr_scheduler,
use_fp16=args.use_fp16,
use_dynamic_loss_scaling=args.use_dynamic_loss_scaling,
init_loss_scaling=args.init_loss_scaling,
incr_every_n_steps=args.incr_every_n_steps,
decr_every_n_nan_or_inf=args.decr_every_n_nan_or_inf,
incr_ratio=args.incr_ratio,
decr_ratio=args.decr_ratio)
if args.verbose:
if args.in_tokens:
lower_mem, upper_mem, unit = fluid.contrib.memory_usage(
program=train_program,
batch_size=args.batch_size // args.max_seq_len)
else:
lower_mem, upper_mem, unit = fluid.contrib.memory_usage(
program=train_program, batch_size=args.batch_size)
log.info("Theoretical memory usage in training: %.3f - %.3f %s" %
(lower_mem, upper_mem, unit))
if args.do_val or args.do_test:
test_prog = fluid.Program()
with fluid.program_guard(test_prog, startup_prog):
with fluid.unique_name.guard():
test_pyreader, graph_vars = create_model(
args,
pyreader_name='test_reader',
ernie_config=ernie_config)
test_prog = test_prog.clone(for_test=True)
nccl2_num_trainers = 1
nccl2_trainer_id = 0
if args.is_distributed:
trainer_id = int(os.getenv("PADDLE_TRAINER_ID", "0"))
worker_endpoints_env = os.getenv("PADDLE_TRAINER_ENDPOINTS")
current_endpoint = os.getenv("PADDLE_CURRENT_ENDPOINT")
worker_endpoints = worker_endpoints_env.split(",")
trainers_num = len(worker_endpoints)
log.info("worker_endpoints:{} trainers_num:{} current_endpoint:{} \
trainer_id:{}".format(worker_endpoints, trainers_num,
current_endpoint, trainer_id))
# prepare nccl2 env.
config = fluid.DistributeTranspilerConfig()
config.mode = "nccl2"
t = fluid.DistributeTranspiler(config=config)
t.transpile(
trainer_id,
trainers=worker_endpoints_env,
current_endpoint=current_endpoint,
program=train_program if args.do_train else test_prog,
startup_program=startup_prog)
nccl2_num_trainers = trainers_num
nccl2_trainer_id = trainer_id
exe = fluid.Executor(place)
exe.run(startup_prog)
if args.do_train:
if args.init_checkpoint and args.init_pretraining_params:
log.info(
"WARNING: args 'init_checkpoint' and 'init_pretraining_params' "
"both are set! Only arg 'init_checkpoint' is made valid.")
if args.init_checkpoint:
init_checkpoint(
exe,
args.init_checkpoint,
main_program=startup_prog,
use_fp16=args.use_fp16)
elif args.init_pretraining_params:
init_pretraining_params(
exe,
args.init_pretraining_params,
main_program=startup_prog,
use_fp16=args.use_fp16)
elif args.do_val or args.do_test:
if not args.init_checkpoint:
raise ValueError("args 'init_checkpoint' should be set if"
"only doing validation or testing!")
init_checkpoint(
exe,
args.init_checkpoint,
main_program=startup_prog,
use_fp16=args.use_fp16)
if args.do_train:
exec_strategy = fluid.ExecutionStrategy()
if args.use_fast_executor:
exec_strategy.use_experimental_executor = True
exec_strategy.num_threads = dev_count
exec_strategy.num_iteration_per_drop_scope = args.num_iteration_per_drop_scope
train_exe = fluid.ParallelExecutor(
use_cuda=args.use_cuda,
loss_name=graph_vars["loss"].name,
exec_strategy=exec_strategy,
main_program=train_program,
num_trainers=nccl2_num_trainers,
trainer_id=nccl2_trainer_id)
train_pyreader.decorate_tensor_provider(train_data_generator)
else:
train_exe = None
if args.do_val or args.do_test:
test_exe = fluid.ParallelExecutor(
use_cuda=args.use_cuda,
main_program=test_prog,
share_vars_from=train_exe)
if args.do_train:
train_pyreader.start()
steps = 0
graph_vars["learning_rate"] = scheduled_lr
time_begin = time.time()
while True:
try:
steps += 1
if steps % args.skip_steps != 0:
train_exe.run(fetch_list=[])
else:
fetch_list = [
graph_vars["lod_logit"].name,
graph_vars["lod_label"].name, graph_vars["loss"].name,
graph_vars['learning_rate'].name
]
out = train_exe.run(fetch_list=fetch_list,
return_numpy=False)
logits, labels, loss_lod, lr_lod = out
lr = np.array(lr_lod)[0]
loss = np.array(loss_lod).mean()
correct_, num_, token_correct_, token_total_ = calculate_acc(
logits, labels)
accuracy = correct_ / num_
accuracy_token = token_correct_ / token_total_
if args.verbose:
log.info(
"train pyreader queue size: %d, learning rate: %f" %
(train_pyreader.queue.size(), lr
if warmup_steps > 0 else args.learning_rate))
current_example, current_epoch = reader.get_train_progress()
time_end = time.time()
used_time = time_end - time_begin
log.info(
"epoch: %d, progress: %d/%d, step: %d, loss: %f, "
"accuracy: %f, accuracy_token: %f, speed: %f steps/s" %
(current_epoch, current_example, num_train_examples,
steps, loss, accuracy, accuracy_token,
args.skip_steps / used_time))
time_begin = time.time()
if nccl2_trainer_id == 0 and steps % args.save_steps == 0:
save_path = os.path.join(args.checkpoints,
"step_" + str(steps))
fluid.io.save_persistables(exe, save_path, train_program)
if nccl2_trainer_id == 0 and steps % args.validation_steps == 0:
# evaluate dev set
if args.do_val:
evaluate_wrapper(reader, exe, test_prog, test_pyreader,
graph_vars, current_epoch, steps)
# evaluate test set
if args.do_test:
predict_wrapper(reader, exe, test_prog, test_pyreader,
graph_vars, current_epoch, steps)
except fluid.core.EOFException:
save_path = os.path.join(args.checkpoints, "step_" + str(steps))
fluid.io.save_persistables(exe, save_path, train_program)
train_pyreader.reset()
break
# final eval on dev set
if nccl2_trainer_id == 0 and args.do_val:
evaluate_wrapper(reader, exe, test_prog, test_pyreader, graph_vars,
'final', 'final')
if nccl2_trainer_id == 0 and args.do_test:
predict_wrapper(reader, exe, test_prog, test_pyreader, graph_vars,
'final', 'final')
def evaluate_wrapper(reader, exe, test_prog, test_pyreader, graph_vars, epoch,
steps):
# evaluate dev set
batch_size = args.batch_size if args.predict_batch_size is None else args.predict_batch_size
for ds in args.dev_set.split(','): # single card eval
test_pyreader.decorate_tensor_provider(
reader.data_generator(
ds, batch_size=batch_size, epoch=1, dev_count=1, shuffle=False))
examples = reader._read_json(ds)
log.info("validation result of dataset {}:".format(ds))
info = evaluate(args, examples, exe, test_prog, test_pyreader,
graph_vars)
log.info(info + ', file: {}, epoch: {}, steps: {}'.format(ds, epoch,
steps))
def predict_wrapper(reader, exe, test_prog, test_pyreader, graph_vars, epoch,
steps):
test_sets = args.test_set.split(',')
save_dirs = args.test_save.split(',')
assert len(test_sets) == len(
save_dirs), 'number of test_sets & test_save not match, got %d vs %d' % (
len(test_sets), len(save_dirs))
batch_size = args.batch_size if args.predict_batch_size is None else args.predict_batch_size
for test_f, save_f in zip(test_sets, save_dirs):
test_pyreader.decorate_tensor_provider(
reader.data_generator(
test_f,
batch_size=batch_size,
epoch=1,
dev_count=1,
shuffle=False))
examples = reader._read_json(test_f)
save_path = save_f
log.info("testing {}, save to {}".format(test_f, save_path))
res = predict(args, examples, exe, test_prog, test_pyreader, graph_vars)
save_dir = os.path.dirname(save_path)
if not os.path.exists(save_dir):
os.makedirs(save_dir)
with codecs.open(save_path, 'w', 'utf-8') as f:
for result in res:
json_str = json.dumps(result, ensure_ascii=False)
f.write(json_str)
f.write('\n')
if __name__ == '__main__':
prepare_logger(log)
print_arguments(args)
check_cuda(args.use_cuda)
main(args)
# coding=utf-8
# Copyright 2018 The Google AI Language Team Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Tokenization classes."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals
from __future__ import absolute_import
from io import open
import collections
import unicodedata
import six
def convert_to_unicode(text):
"""Converts `text` to Unicode (if it's not already), assuming utf-8 input."""
if six.PY3:
if isinstance(text, str):
return text
elif isinstance(text, bytes):
return text.decode("utf-8", "ignore")
else:
raise ValueError("Unsupported string type: %s" % (type(text)))
elif six.PY2:
if isinstance(text, str):
return text.decode("utf-8", "ignore")
elif isinstance(text, unicode):
return text
else:
raise ValueError("Unsupported string type: %s" % (type(text)))
else:
raise ValueError("Not running on Python2 or Python 3?")
def printable_text(text):
"""Returns text encoded in a way suitable for print or `tf.logging`."""
# These functions want `str` for both Python2 and Python3, but in one case
# it's a Unicode string and in the other it's a byte string.
if six.PY3:
if isinstance(text, str):
return text
elif isinstance(text, bytes):
return text.decode("utf-8", "ignore")
else:
raise ValueError("Unsupported string type: %s" % (type(text)))
elif six.PY2:
if isinstance(text, str):
return text
elif isinstance(text, unicode):
return text.encode("utf-8")
else:
raise ValueError("Unsupported string type: %s" % (type(text)))
else:
raise ValueError("Not running on Python2 or Python 3?")
def load_vocab(vocab_file):
"""Loads a vocabulary file into a dictionary."""
vocab = collections.OrderedDict()
with open(vocab_file, encoding='utf8') as fin:
for num, line in enumerate(fin):
items = convert_to_unicode(line.strip()).split("\t")
if len(items) > 2:
break
token = items[0]
index = items[1] if len(items) == 2 else num
token = token.strip()
vocab[token] = int(index)
return vocab
def convert_by_vocab(vocab, items):
"""Converts a sequence of [tokens|ids] using the vocab."""
output = []
for item in items:
output.append(vocab[item])
return output
def convert_tokens_to_ids(vocab, tokens):
return convert_by_vocab(vocab, tokens)
def convert_ids_to_tokens(inv_vocab, ids):
return convert_by_vocab(inv_vocab, ids)
def whitespace_tokenize(text):
"""Runs basic whitespace cleaning and splitting on a peice of text."""
text = text.strip()
if not text:
return []
tokens = text.split()
return tokens
class FullTokenizer(object):
"""Runs end-to-end tokenziation."""
def __init__(self, vocab_file, do_lower_case=True):
self.vocab = load_vocab(vocab_file)
self.inv_vocab = {v: k for k, v in self.vocab.items()}
self.basic_tokenizer = BasicTokenizer(do_lower_case=do_lower_case)
self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab)
def tokenize(self, text):
split_tokens = []
for token in self.basic_tokenizer.tokenize(text):
for sub_token in self.wordpiece_tokenizer.tokenize(token):
split_tokens.append(sub_token)
return split_tokens
def convert_tokens_to_ids(self, tokens):
return convert_by_vocab(self.vocab, tokens)
def convert_ids_to_tokens(self, ids):
return convert_by_vocab(self.inv_vocab, ids)
class CharTokenizer(object):
"""Runs end-to-end tokenziation."""
def __init__(self, vocab_file, do_lower_case=True):
self.vocab = load_vocab(vocab_file)
self.inv_vocab = {v: k for k, v in self.vocab.items()}
self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab)
def tokenize(self, text):
split_tokens = []
for token in text.lower().split(" "):
for sub_token in self.wordpiece_tokenizer.tokenize(token):
split_tokens.append(sub_token)
return split_tokens
def convert_tokens_to_ids(self, tokens):
return convert_by_vocab(self.vocab, tokens)
def convert_ids_to_tokens(self, ids):
return convert_by_vocab(self.inv_vocab, ids)
class BasicTokenizer(object):
"""Runs basic tokenization (punctuation splitting, lower casing, etc.)."""
def __init__(self, do_lower_case=True):
"""Constructs a BasicTokenizer.
Args:
do_lower_case: Whether to lower case the input.
"""
self.do_lower_case = do_lower_case
def tokenize(self, text):
"""Tokenizes a piece of text."""
text = convert_to_unicode(text)
text = self._clean_text(text)
# This was added on November 1st, 2018 for the multilingual and Chinese
# models. This is also applied to the English models now, but it doesn't
# matter since the English models were not trained on any Chinese data
# and generally don't have any Chinese data in them (there are Chinese
# characters in the vocabulary because Wikipedia does have some Chinese
# words in the English Wikipedia.).
text = self._tokenize_chinese_chars(text)
orig_tokens = whitespace_tokenize(text)
split_tokens = []
for token in orig_tokens:
if self.do_lower_case:
token = token.lower()
token = self._run_strip_accents(token)
split_tokens.extend(self._run_split_on_punc(token))
output_tokens = whitespace_tokenize(" ".join(split_tokens))
return output_tokens
def _run_strip_accents(self, text):
"""Strips accents from a piece of text."""
text = unicodedata.normalize("NFD", text)
output = []
for char in text:
cat = unicodedata.category(char)
if cat == "Mn":
continue
output.append(char)
return "".join(output)
def _run_split_on_punc(self, text):
"""Splits punctuation on a piece of text."""
chars = list(text)
i = 0
start_new_word = True
output = []
while i < len(chars):
char = chars[i]
if _is_punctuation(char):
output.append([char])
start_new_word = True
else:
if start_new_word:
output.append([])
start_new_word = False
output[-1].append(char)
i += 1
return ["".join(x) for x in output]
def _tokenize_chinese_chars(self, text):
"""Adds whitespace around any CJK character."""
output = []
for char in text:
cp = ord(char)
if self._is_chinese_char(cp):
output.append(" ")
output.append(char)
output.append(" ")
else:
output.append(char)
return "".join(output)
def _is_chinese_char(self, cp):
"""Checks whether CP is the codepoint of a CJK character."""
# This defines a "chinese character" as anything in the CJK Unicode block:
# https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_(Unicode_block)
#
# Note that the CJK Unicode block is NOT all Japanese and Korean characters,
# despite its name. The modern Korean Hangul alphabet is a different block,
# as is Japanese Hiragana and Katakana. Those alphabets are used to write
# space-separated words, so they are not treated specially and handled
# like the all of the other languages.
if ((cp >= 0x4E00 and cp <= 0x9FFF) or #
(cp >= 0x3400 and cp <= 0x4DBF) or #
(cp >= 0x20000 and cp <= 0x2A6DF) or #
(cp >= 0x2A700 and cp <= 0x2B73F) or #
(cp >= 0x2B740 and cp <= 0x2B81F) or #
(cp >= 0x2B820 and cp <= 0x2CEAF) or
(cp >= 0xF900 and cp <= 0xFAFF) or #
(cp >= 0x2F800 and cp <= 0x2FA1F)): #
return True
return False
def _clean_text(self, text):
"""Performs invalid character removal and whitespace cleanup on text."""
output = []
for char in text:
cp = ord(char)
if cp == 0 or cp == 0xfffd or _is_control(char):
continue
if _is_whitespace(char):
output.append(" ")
else:
output.append(char)
return "".join(output)
class WordpieceTokenizer(object):
"""Runs WordPiece tokenziation."""
def __init__(self, vocab, unk_token="[UNK]", max_input_chars_per_word=100):
self.vocab = vocab
self.unk_token = unk_token
self.max_input_chars_per_word = max_input_chars_per_word
def tokenize(self, text):
"""Tokenizes a piece of text into its word pieces.
This uses a greedy longest-match-first algorithm to perform tokenization
using the given vocabulary.
For example:
input = "unaffable"
output = ["un", "##aff", "##able"]
Args:
text: A single token or whitespace separated tokens. This should have
already been passed through `BasicTokenizer.
Returns:
A list of wordpiece tokens.
"""
text = convert_to_unicode(text)
output_tokens = []
for token in whitespace_tokenize(text):
chars = list(token)
if len(chars) > self.max_input_chars_per_word:
output_tokens.append(self.unk_token)
continue
is_bad = False
start = 0
sub_tokens = []
while start < len(chars):
end = len(chars)
cur_substr = None
while start < end:
substr = "".join(chars[start:end])
if start > 0:
substr = "##" + substr
if substr in self.vocab:
cur_substr = substr
break
end -= 1
if cur_substr is None:
is_bad = True
break
sub_tokens.append(cur_substr)
start = end
if is_bad:
output_tokens.append(self.unk_token)
else:
output_tokens.extend(sub_tokens)
return output_tokens
def _is_whitespace(char):
"""Checks whether `chars` is a whitespace character."""
# \t, \n, and \r are technically contorl characters but we treat them
# as whitespace since they are generally considered as such.
if char == " " or char == "\t" or char == "\n" or char == "\r":
return True
cat = unicodedata.category(char)
if cat == "Zs":
return True
return False
def _is_control(char):
"""Checks whether `chars` is a control character."""
# These are technically control characters but we count them as whitespace
# characters.
if char == "\t" or char == "\n" or char == "\r":
return False
cat = unicodedata.category(char)
if cat.startswith("C"):
return True
return False
def _is_punctuation(char):
"""Checks whether `chars` is a punctuation character."""
cp = ord(char)
# We treat all non-letter/number ASCII as punctuation.
# Characters such as "^", "$", and "`" are not in the Unicode
# Punctuation class but we treat them as punctuation anyways, for
# consistency.
if ((cp >= 33 and cp <= 47) or (cp >= 58 and cp <= 64) or
(cp >= 91 and cp <= 96) or (cp >= 123 and cp <= 126)):
return True
cat = unicodedata.category(char)
if cat.startswith("P"):
return True
return False
def tokenize_chinese_chars(text):
"""Adds whitespace around any CJK character."""
def _is_chinese_char(cp):
"""Checks whether CP is the codepoint of a CJK character."""
# This defines a "chinese character" as anything in the CJK Unicode block:
# https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_(Unicode_block)
#
# Note that the CJK Unicode block is NOT all Japanese and Korean characters,
# despite its name. The modern Korean Hangul alphabet is a different block,
# as is Japanese Hiragana and Katakana. Those alphabets are used to write
# space-separated words, so they are not treated specially and handled
# like the all of the other languages.
if ((cp >= 0x4E00 and cp <= 0x9FFF) or #
(cp >= 0x3400 and cp <= 0x4DBF) or #
(cp >= 0x20000 and cp <= 0x2A6DF) or #
(cp >= 0x2A700 and cp <= 0x2B73F) or #
(cp >= 0x2B740 and cp <= 0x2B81F) or #
(cp >= 0x2B820 and cp <= 0x2CEAF) or
(cp >= 0xF900 and cp <= 0xFAFF) or #
(cp >= 0x2F800 and cp <= 0x2FA1F)): #
return True
return False
def _is_whitespace(c):
if c == " " or c == "\t" or c == "\r" or c == "\n" or ord(c) == 0x202F:
return True
return False
output = []
buff = ""
for char in text:
cp = ord(char)
if _is_chinese_char(cp) or _is_whitespace(char):
if buff != "":
output.append(buff)
buff = ""
output.append(char)
else:
buff += char
if buff != "":
output.append(buff)
return output
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Arguments for configuration."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals
from __future__ import absolute_import
import six
import argparse
import logging
import paddle.fluid as fluid
log = logging.getLogger(__name__)
def prepare_logger(logger, debug=False, save_to_file=None):
formatter = logging.Formatter(
fmt='[%(levelname)s] %(asctime)s [%(filename)12s:%(lineno)5d]:\t%(message)s'
)
console_hdl = logging.StreamHandler()
console_hdl.setFormatter(formatter)
logger.addHandler(console_hdl)
if save_to_file is not None and not os.path.exits(save_to_file):
file_hdl = logging.FileHandler(save_to_file)
file_hdl.setFormatter(formatter)
logger.addHandler(file_hdl)
logger.setLevel(logging.DEBUG)
logger.propagate = False
def str2bool(v):
# because argparse does not support to parse "true, False" as python
# boolean directly
return v.lower() in ("true", "t", "1")
class ArgumentGroup(object):
def __init__(self, parser, title, des):
self._group = parser.add_argument_group(title=title, description=des)
def add_arg(self, name, type, default, help, positional_arg=False,
**kwargs):
prefix = "" if positional_arg else "--"
type = str2bool if type == bool else type
self._group.add_argument(
prefix + name,
default=default,
type=type,
help=help + ' Default: %(default)s.',
**kwargs)
def print_arguments(args):
log.info('----------- Configuration Arguments -----------')
for arg, value in sorted(six.iteritems(vars(args))):
log.info('%s: %s' % (arg, value))
log.info('------------------------------------------------')
def check_cuda(use_cuda, err = \
"\nYou can not set use_cuda = True in the model because you are using paddlepaddle-cpu.\n \
Please: 1. Install paddlepaddle-gpu to run your models on GPU or 2. Set use_cuda = False to run models on CPU.\n"
):
try:
if use_cuda == True and fluid.is_compiled_with_cuda() == False:
log.error(err)
sys.exit(1)
except Exception as e:
pass
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from __future__ import print_function
import paddle
import paddle.fluid as fluid
def append_cast_op(i, o, prog):
"""
Append a cast op in a given Program to cast input `i` to data type `o.dtype`.
Args:
i (Variable): The input Variable.
o (Variable): The output Variable.
prog (Program): The Program to append cast op.
"""
prog.global_block().append_op(
type="cast",
inputs={"X": i},
outputs={"Out": o},
attrs={"in_dtype": i.dtype,
"out_dtype": o.dtype})
def copy_to_master_param(p, block):
v = block.vars.get(p.name, None)
if v is None:
raise ValueError("no param name %s found!" % p.name)
new_p = fluid.framework.Parameter(
block=block,
shape=v.shape,
dtype=fluid.core.VarDesc.VarType.FP32,
type=v.type,
lod_level=v.lod_level,
stop_gradient=p.stop_gradient,
trainable=p.trainable,
optimize_attr=p.optimize_attr,
regularizer=p.regularizer,
gradient_clip_attr=p.gradient_clip_attr,
error_clip=p.error_clip,
name=v.name + ".master")
return new_p
def apply_dynamic_loss_scaling(loss_scaling, master_params_grads,
incr_every_n_steps, decr_every_n_nan_or_inf,
incr_ratio, decr_ratio):
_incr_every_n_steps = fluid.layers.fill_constant(
shape=[1], dtype='int32', value=incr_every_n_steps)
_decr_every_n_nan_or_inf = fluid.layers.fill_constant(
shape=[1], dtype='int32', value=decr_every_n_nan_or_inf)
_num_good_steps = fluid.layers.create_global_var(
name=fluid.unique_name.generate("num_good_steps"),
shape=[1],
value=0,
dtype='int32',
persistable=True)
_num_bad_steps = fluid.layers.create_global_var(
name=fluid.unique_name.generate("num_bad_steps"),
shape=[1],
value=0,
dtype='int32',
persistable=True)
grads = [fluid.layers.reduce_sum(g) for [_, g] in master_params_grads]
all_grads = fluid.layers.concat(grads)
all_grads_sum = fluid.layers.reduce_sum(all_grads)
is_overall_finite = fluid.layers.isfinite(all_grads_sum)
update_loss_scaling(is_overall_finite, loss_scaling, _num_good_steps,
_num_bad_steps, _incr_every_n_steps,
_decr_every_n_nan_or_inf, incr_ratio, decr_ratio)
# apply_gradient append all ops in global block, thus we shouldn't
# apply gradient in the switch branch.
with fluid.layers.Switch() as switch:
with switch.case(is_overall_finite):
pass
with switch.default():
for _, g in master_params_grads:
fluid.layers.assign(fluid.layers.zeros_like(g), g)
def create_master_params_grads(params_grads, main_prog, startup_prog,
loss_scaling):
master_params_grads = []
for p, g in params_grads:
with main_prog._optimized_guard([p, g]):
# create master parameters
master_param = copy_to_master_param(p, main_prog.global_block())
startup_master_param = startup_prog.global_block()._clone_variable(
master_param)
startup_p = startup_prog.global_block().var(p.name)
append_cast_op(startup_p, startup_master_param, startup_prog)
# cast fp16 gradients to fp32 before apply gradients
if g.name.find("layer_norm") > -1:
scaled_g = g / loss_scaling
master_params_grads.append([p, scaled_g])
continue
master_grad = fluid.layers.cast(g, "float32")
master_grad = master_grad / loss_scaling
master_params_grads.append([master_param, master_grad])
return master_params_grads
def master_param_to_train_param(master_params_grads, params_grads, main_prog):
for idx, m_p_g in enumerate(master_params_grads):
train_p, _ = params_grads[idx]
if train_p.name.find("layer_norm") > -1:
continue
with main_prog._optimized_guard([m_p_g[0], m_p_g[1]]):
append_cast_op(m_p_g[0], train_p, main_prog)
def update_loss_scaling(is_overall_finite, prev_loss_scaling, num_good_steps,
num_bad_steps, incr_every_n_steps,
decr_every_n_nan_or_inf, incr_ratio, decr_ratio):
"""
Update loss scaling according to overall gradients. If all gradients is
finite after incr_every_n_steps, loss scaling will increase by incr_ratio.
Otherwisw, loss scaling will decrease by decr_ratio after
decr_every_n_nan_or_inf steps and each step some gradients are infinite.
Args:
is_overall_finite (Variable): A boolean variable indicates whether
all gradients are finite.
prev_loss_scaling (Variable): Previous loss scaling.
num_good_steps (Variable): A variable accumulates good steps in which
all gradients are finite.
num_bad_steps (Variable): A variable accumulates bad steps in which
some gradients are infinite.
incr_every_n_steps (Variable): A variable represents increasing loss
scaling every n consecutive steps with
finite gradients.
decr_every_n_nan_or_inf (Variable): A variable represents decreasing
loss scaling every n accumulated
steps with nan or inf gradients.
incr_ratio(float): The multiplier to use when increasing the loss
scaling.
decr_ratio(float): The less-than-one-multiplier to use when decreasing
loss scaling.
"""
zero_steps = fluid.layers.fill_constant(shape=[1], dtype='int32', value=0)
with fluid.layers.Switch() as switch:
with switch.case(is_overall_finite):
should_incr_loss_scaling = fluid.layers.less_than(
incr_every_n_steps, num_good_steps + 1)
with fluid.layers.Switch() as switch1:
with switch1.case(should_incr_loss_scaling):
new_loss_scaling = prev_loss_scaling * incr_ratio
loss_scaling_is_finite = fluid.layers.isfinite(
new_loss_scaling)
with fluid.layers.Switch() as switch2:
with switch2.case(loss_scaling_is_finite):
fluid.layers.assign(new_loss_scaling,
prev_loss_scaling)
with switch2.default():
pass
fluid.layers.assign(zero_steps, num_good_steps)
fluid.layers.assign(zero_steps, num_bad_steps)
with switch1.default():
fluid.layers.increment(num_good_steps)
fluid.layers.assign(zero_steps, num_bad_steps)
with switch.default():
should_decr_loss_scaling = fluid.layers.less_than(
decr_every_n_nan_or_inf, num_bad_steps + 1)
with fluid.layers.Switch() as switch3:
with switch3.case(should_decr_loss_scaling):
new_loss_scaling = prev_loss_scaling * decr_ratio
static_loss_scaling = \
fluid.layers.fill_constant(shape=[1],
dtype='float32',
value=1.0)
less_than_one = fluid.layers.less_than(new_loss_scaling,
static_loss_scaling)
with fluid.layers.Switch() as switch4:
with switch4.case(less_than_one):
fluid.layers.assign(static_loss_scaling,
prev_loss_scaling)
with switch4.default():
fluid.layers.assign(new_loss_scaling,
prev_loss_scaling)
fluid.layers.assign(zero_steps, num_good_steps)
fluid.layers.assign(zero_steps, num_bad_steps)
with switch3.default():
fluid.layers.assign(zero_steps, num_good_steps)
fluid.layers.increment(num_bad_steps)
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals
from __future__ import absolute_import
import os
import six
import ast
import copy
import logging
import numpy as np
import paddle.fluid as fluid
log = logging.getLogger(__name__)
def cast_fp32_to_fp16(exe, main_program):
log.info("Cast parameters to float16 data format.")
for param in main_program.global_block().all_parameters():
if not param.name.endswith(".master"):
param_t = fluid.global_scope().find_var(param.name).get_tensor()
data = np.array(param_t)
if param.name.startswith("encoder_layer") \
and "layer_norm" not in param.name:
param_t.set(np.float16(data).view(np.uint16), exe.place)
#load fp32
master_param_var = fluid.global_scope().find_var(param.name +
".master")
if master_param_var is not None:
master_param_var.get_tensor().set(data, exe.place)
def init_checkpoint(exe, init_checkpoint_path, main_program, use_fp16=False):
assert os.path.exists(
init_checkpoint_path), "[%s] can't be found." % init_checkpoint_path
def existed_persitables(var):
if not fluid.io.is_persistable(var):
return False
return os.path.exists(os.path.join(init_checkpoint_path, var.name))
fluid.io.load_vars(
exe,
init_checkpoint_path,
main_program=main_program,
predicate=existed_persitables)
log.info("Load model from {}".format(init_checkpoint_path))
if use_fp16:
cast_fp32_to_fp16(exe, main_program)
def init_pretraining_params(exe,
pretraining_params_path,
main_program,
use_fp16=False):
assert os.path.exists(pretraining_params_path
), "[%s] can't be found." % pretraining_params_path
def existed_params(var):
if not isinstance(var, fluid.framework.Parameter):
return False
return os.path.exists(os.path.join(pretraining_params_path, var.name))
fluid.io.load_vars(
exe,
pretraining_params_path,
main_program=main_program,
predicate=existed_params)
log.info("Load pretraining parameters from {}.".format(
pretraining_params_path))
if use_fp16:
cast_fp32_to_fp16(exe, main_program)
this directory is used for store pretrained ERNIE checkpoint.
#Python3
numpy==1.14.5
six==1.11.0
paddlepaddle-gpu==1.5.2.post107
#Python2
zipfile
ConfigParser
set -eux
export TASK_DATA_PATH=./data/
export MODEL_PATH=./pretrained_model/
export CHECKPOINT=./checkpoints/step_60000/
export TEST_SAVE=./data/
export FLAGS_sync_nccl_allreduce=1
export PYTHONPATH=./ernie:${PYTHONPATH:-}
CUDA_VISIBLE_DEVICES=7 python -u ./ernie/run_duie.py \
--use_cuda true \
--do_train false \
--do_val false \
--do_test true \
--batch_size 128 \
--init_checkpoint ${CHECKPOINT} \
--num_labels 112 \
--label_map_config ${TASK_DATA_PATH}relation2label.json \
--spo_label_map_config ${TASK_DATA_PATH}label2relation.json \
--test_set ${TASK_DATA_PATH}dev_demo.json \
--test_save ${TEST_SAVE}predict_test.json \
--vocab_path ${MODEL_PATH}vocab.txt \
--ernie_config_path ${MODEL_PATH}ernie_config.json \
--use_fp16 false \
--max_seq_len 512 \
--skip_steps 10 \
--random_seed 1
# -*- coding: utf-8 -*-
########################################################
# Copyright (c) 2019, Baidu Inc. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# imitations under the License.
########################################################
"""
This module to calculate precision, recall and f1-value
of the predicated results.
"""
import sys
import json
import os
import zipfile
import traceback
import argparse
import ConfigParser
reload(sys)
sys.setdefaultencoding('utf-8')
SUCCESS = 0
FILE_ERROR = 1
NOT_ZIP_FILE = 2
ENCODING_ERROR = 3
JSON_ERROR = 4
SCHEMA_ERROR = 5
ALIAS_FORMAT_ERROR = 6
CODE_INFO = {
SUCCESS: 'success',
FILE_ERROR: 'file is not exists',
NOT_ZIP_FILE: 'predict file is not a zipfile',
ENCODING_ERROR: 'file encoding error',
JSON_ERROR: 'json parse is error',
SCHEMA_ERROR: 'schema is error',
ALIAS_FORMAT_ERROR: 'alias dict format is error'
}
def del_bookname(entity_name):
"""delete the book name"""
if entity_name.startswith(u'《') and entity_name.endswith(u'》'):
entity_name = entity_name[1:-1]
return entity_name
def check_format(line):
"""检查输入行是否格式错误"""
ret_code = SUCCESS
json_info = {}
try:
line = line.decode('utf8').strip()
except:
ret_code = ENCODING_ERROR
return ret_code, json_info
try:
json_info = json.loads(line)
except:
ret_code = JSON_ERROR
return ret_code, json_info
if 'text' not in json_info or 'spo_list' not in json_info:
ret_code = SCHEMA_ERROR
return ret_code, json_info
required_key_list = ['subject', 'predicate', 'object']
for spo_item in json_info['spo_list']:
if type(spo_item) is not dict:
ret_code = SCHEMA_ERROR
return ret_code, json_info
if not all(
[required_key in spo_item for required_key in required_key_list]):
ret_code = SCHEMA_ERROR
return ret_code, json_info
if not isinstance(spo_item['subject'], basestring) or \
not isinstance(spo_item['object'], dict):
ret_code = SCHEMA_ERROR
return ret_code, json_info
return ret_code, json_info
def _parse_structured_ovalue(json_info):
spo_result = []
for item in json_info["spo_list"]:
s = del_bookname(item['subject'].lower())
o = {}
for o_key, o_value in item['object'].items():
o_value = del_bookname(o_value).lower()
o[o_key] = o_value
spo_result.append({"predicate": item['predicate'], \
"subject": s, \
"object": o})
return spo_result
def load_predict_result(predict_filename):
"""Loads the file to be predicted"""
predict_result = {}
ret_code = SUCCESS
if not os.path.exists(predict_filename):
ret_code = FILE_ERROR
return ret_code, predict_result
try:
predict_file_zip = zipfile.ZipFile(predict_filename)
except:
ret_code = NOT_ZIP_FILE
return ret_code, predict_result
for predict_file in predict_file_zip.namelist():
for line in predict_file_zip.open(predict_file):
ret_code, json_info = check_format(line)
if ret_code != SUCCESS:
return ret_code, predict_result
sent = json_info['text']
spo_result = _parse_structured_ovalue(json_info)
predict_result[sent] = spo_result
return ret_code, predict_result
def load_test_dataset(golden_filename):
"""load golden file"""
golden_dict = {}
ret_code = SUCCESS
if not os.path.exists(golden_filename):
ret_code = FILE_ERROR
return ret_code, golden_dict
with open(golden_filename) as gf:
for line in gf:
ret_code, json_info = check_format(line)
if ret_code != SUCCESS:
return ret_code, golden_dict
sent = json_info['text']
spo_result = _parse_structured_ovalue(json_info)
golden_dict[sent] = spo_result
return ret_code, golden_dict
def load_alias_dict(alias_filename):
"""load alias dict"""
alias_dict = {}
ret_code = SUCCESS
if alias_filename == "":
return ret_code, alias_dict
if not os.path.exists(alias_filename):
ret_code = FILE_ERROR
return ret_code, alias_dict
with open(alias_filename) as af:
for line in af:
line = line.decode().strip()
try:
words = line.split('\t')
alias_dict[words[0].lower()] = set()
for alias_word in words[1:]:
alias_dict[words[0].lower()].add(alias_word.lower())
except:
ret_code = ALIAS_FORMAT_ERROR
return ret_code, alias_dict
return ret_code, alias_dict
def del_duplicate(spo_list, alias_dict):
"""delete synonyms triples in predict result"""
normalized_spo_list = []
for spo in spo_list:
if not is_spo_in_list(spo, normalized_spo_list, alias_dict):
normalized_spo_list.append(spo)
return normalized_spo_list
def is_spo_in_list(target_spo, golden_spo_list, alias_dict):
"""target spo是否在golden_spo_list中"""
if target_spo in golden_spo_list:
return True
target_s = target_spo["subject"]
target_p = target_spo["predicate"]
target_o = target_spo["object"]
target_s_alias_set = alias_dict.get(target_s, set())
target_s_alias_set.add(target_s)
for spo in golden_spo_list:
s = spo["subject"]
p = spo["predicate"]
o = spo["object"]
if p != target_p:
continue
if s in target_s_alias_set and _is_equal_o(o, target_o, alias_dict):
return True
return False
def _is_equal_o(o_a, o_b, alias_dict):
for key_a, value_a in o_a.items():
if key_a not in o_b:
return False
value_a_alias_set = alias_dict.get(value_a, set())
value_a_alias_set.add(value_a)
if o_b[key_a] not in value_a_alias_set:
return False
for key_b, value_b in o_b.items():
if key_b not in o_a:
return False
value_b_alias_set = alias_dict.get(value_b, set())
value_b_alias_set.add(value_b)
if o_a[key_b] not in value_b_alias_set:
return False
return True
def calc_pr(predict_filename, alias_filename, golden_filename):
"""calculate precision, recall, f1"""
ret_info = {}
#load alias dict
ret_code, alias_dict = load_alias_dict(alias_filename)
if ret_code != SUCCESS:
ret_info['errorCode'] = ret_code
ret_info['errorMsg'] = CODE_INFO[ret_code]
return ret_info
#load test golden dataset
ret_code, golden_dict = load_test_dataset(golden_filename)
if ret_code != SUCCESS:
ret_info['errorCode'] = ret_code
ret_info['errorMsg'] = CODE_INFO[ret_code]
return ret_info
#load predict result
ret_code, predict_result = load_predict_result(predict_filename)
if ret_code != SUCCESS:
ret_info['errorCode'] = ret_code
ret_info['errorMsg'] = CODE_INFO[ret_code]
return ret_info
#evaluation
correct_sum, predict_sum, recall_sum, recall_correct_sum = 0.0, 0.0, 0.0, 0.0
for sent in golden_dict:
golden_spo_list = del_duplicate(golden_dict[sent], alias_dict)
predict_spo_list = predict_result.get(sent, list())
normalized_predict_spo = del_duplicate(predict_spo_list, alias_dict)
recall_sum += len(golden_spo_list)
predict_sum += len(normalized_predict_spo)
for spo in normalized_predict_spo:
if is_spo_in_list(spo, golden_spo_list, alias_dict):
correct_sum += 1
for golden_spo in golden_spo_list:
if is_spo_in_list(golden_spo, predict_spo_list, alias_dict):
recall_correct_sum += 1
print >> sys.stderr, 'correct spo num = ', correct_sum
print >> sys.stderr, 'submitted spo num = ', predict_sum
print >> sys.stderr, 'golden set spo num = ', recall_sum
print >> sys.stderr, 'submitted recall spo num = ', recall_correct_sum
precision = correct_sum / predict_sum if predict_sum > 0 else 0.0
recall = recall_correct_sum / recall_sum if recall_sum > 0 else 0.0
f1 = 2 * precision * recall / (precision + recall) \
if precision + recall > 0 else 0.0
precision = round(precision, 4)
recall = round(recall, 4)
f1 = round(f1, 4)
ret_info['errorCode'] = SUCCESS
ret_info['errorMsg'] = CODE_INFO[SUCCESS]
ret_info['data'] = []
ret_info['data'].append({'name': 'precision', 'value': precision})
ret_info['data'].append({'name': 'recall', 'value': recall})
ret_info['data'].append({'name': 'f1-score', 'value': f1})
return ret_info
if __name__ == '__main__':
reload(sys)
sys.setdefaultencoding('utf-8')
parser = argparse.ArgumentParser()
parser.add_argument(
"--golden_file", type=str, help="true spo results", required=True)
parser.add_argument(
"--predict_file", type=str, help="spo results predicted", required=True)
parser.add_argument(
"--alias_file", type=str, default='', help="entities alias dictionary")
args = parser.parse_args()
golden_filename = args.golden_file
predict_filename = args.predict_file
alias_filename = args.alias_file
ret_info = calc_pr(predict_filename, alias_filename, golden_filename)
print json.dumps(ret_info)
set -eux
export BATCH_SIZE=16
export LR=2e-5
export EPOCH=10
export SAVE_STEPS=5000
export SAVE_PATH=./
export TASK_DATA_PATH=./data/
export MODEL_PATH=./pretrained_model/
export FLAGS_sync_nccl_allreduce=1
export PYTHONPATH=./ernie:${PYTHONPATH:-}
CUDA_VISIBLE_DEVICES=7 python -u ./ernie/run_duie.py \
--use_cuda true \
--do_train true \
--do_val true \
--do_test false \
--batch_size ${BATCH_SIZE} \
--init_checkpoint ${MODEL_PATH}params \
--num_labels 112 \
--chunk_scheme "IOB" \
--label_map_config ${TASK_DATA_PATH}relation2label.json \
--spo_label_map_config ${TASK_DATA_PATH}label2relation.json \
--train_set ${TASK_DATA_PATH}train_demo.json \
--dev_set ${TASK_DATA_PATH}dev_demo.json \
--vocab_path ${MODEL_PATH}vocab.txt \
--ernie_config_path ${MODEL_PATH}ernie_config.json \
--checkpoints ${SAVE_PATH}checkpoints \
--save_steps ${SAVE_STEPS} \
--validation_steps ${SAVE_STEPS} \
--weight_decay 0.01 \
--warmup_proportion 0.0 \
--use_fp16 false \
--epoch ${EPOCH} \
--max_seq_len 256 \
--learning_rate ${LR} \
--skip_steps 10 \
--num_iteration_per_drop_scope 1 \
--random_seed 1
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册