未验证 提交 63b738e3 编写于 作者: G Guo Sheng 提交者: GitHub

[cherry-pick] Add BERT into Language Model. (#5125) (#5126)

* Add BERT. (#5125)

* Fix single data file in BERT pre-training. (#5127)
上级 27fe562c
# BERT
## 模型简介
[BERT](https://arxiv.org/abs/1810.04805) (Bidirectional Encoder Representations from Transformers)以[Transformer](https://arxiv.org/abs/1706.03762) 编码器为网络基本组件,使用掩码语言模型(Masked Language Model)和邻接句子预测(Next Sentence Prediction)两个任务在大规模无标注文本语料上进行预训练(pre-train),得到融合了双向内容的通用语义表示模型。以预训练产生的通用语义表示模型为基础,结合任务适配的简单输出层,微调(fine-tune)后即可应用到下游的NLP任务,效果通常也较直接在下游的任务上训练的模型更优。此前BERT即在[GLUE评测任务](https://gluebenchmark.com/tasks)上取得了SOTA的结果。
本项目是BERT在 Paddle 2.0上的开源实现,包含了预训练和[GLUE评测任务](https://gluebenchmark.com/tasks)上的微调代码。
## 快速开始
### 安装说明
* PaddlePaddle 安装
本项目依赖于 PaddlePaddle 2.0rc1 及以上版本,请参考 [安装指南](http://www.paddlepaddle.org/#quick-start) 进行安装
* PaddleNLP 安装
```shell
pip install paddlenlp>=2.0.0b
```
### 数据准备
#### Pre-training数据准备
`create_pretraining_data.py` 是创建预训练程序所需数据的脚本。其以文本文件(使用换行符换行和空白符分隔,data目录下提供了部分示例数据)为输入,经由BERT tokenizer进行tokenize后再做生成sentence pair正负样本、掩码token等处理,最后输出hdf5格式的数据文件。使用方式如下:
```python
python create_pretraining_data.py \
--input_file=data/sample_text.txt \
--output_file=data/training_data.hdf5 \
--bert_model=bert-base-uncased \
--max_seq_length=128 \
--max_predictions_per_seq=20 \
--masked_lm_prob=0.15 \
--random_seed=12345 \
--dupe_factor=5
```
其中参数释义如下:
- `input_file` 指定输入文件,可以使用目录,指定目录时将包括目录中的所有`.txt`文件。
- `output_file` 指定输出文件。
- `bert_model` 指定使用特定BERT模型对应的tokenizer进行tokenize处理。
- `max_seq_length` 指定最大句子长度,超过该长度将被截断,不足该长度的将会进行padding。
- `max_predictions_per_seq` 表示每个句子中会被mask的token的最大数目。
- `masked_lm_prob` 表示每个token被mask的概率。
- `random_seed` 指定随机种子。
- `dupe_factor` 指定输入数据被重复处理的次数,每次处理将重新产生随机mask。
使用以上预训练数据生成程序可以用于处理领域垂类数据后进行二次预训练。若需要使用BERT论文中预训练使用的英文Wiki和BookCorpus数据,可以参考[这里](https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/LanguageModeling/BERT)进行处理,得到的数据可以直接接入本项目中的预训练程序使用。
#### Fine-tunning数据准备
##### GLUE评测任务数据
GLUE评测任务所含数据集已在paddlenlp中以API形式提供,无需预先准备,使用`run_glue.py`执行微调时将会自动下载。
### 执行Pre-training
```shell
python -u ./run_pretrain.py \
--model_type bert \
--model_name_or_path bert-base-uncased \
--max_predictions_per_seq 20 \
--batch_size 32 \
--learning_rate 1e-4 \
--weight_decay 1e-2 \
--adam_epsilon 1e-6 \
--warmup_steps 10000 \
--num_train_epochs 3 \
--input_dir data/ \
--output_dir pretrained_models/ \
--logging_steps 1 \
--save_steps 20000 \
--max_steps 1000000 \
--n_gpu 1
```
其中参数释义如下:
- `model_type` 指示了模型类型,使用BERT模型时设置为bert即可。
- `model_name_or_path` 指示了某种特定配置的模型,对应有其预训练模型和预训练时使用的 tokenizer。若模型相关内容保存在本地,这里也可以提供相应目录地址。
- `max_predictions_per_seq` 表示每个句子中会被mask的token的最大数目,与创建预训练数据时的设置一致。
- `batch_size` 表示每次迭代**每张卡**上的样本数目。
- `learning_rate` 表示基础学习率大小,将于learning rate scheduler产生的值相乘作为当前学习率。
- `weight_decay` 表示AdamW优化器中使用的weight_decay的系数。
- `adam_epsilon` 表示AdamW优化器中使用的epsilon值。
- `warmup_steps` 表示。
- `num_train_epochs` 表示训练轮数。
- `input_dir` 表示输入数据的目录,该目录下所有文件名中包含training的文件将被作为训练数据。
- `output_dir` 表示模型的保存目录。
- `logging_steps` 表示日志打印间隔。
- `save_steps` 表示模型保存及评估间隔。
- `max_steps` 表示最大训练步数。若训练`num_train_epochs`轮包含的训练步数大于该值,则达到`max_steps`后就提前结束。
- `n_gpu` 表示使用的 GPU 卡数。若希望使用多卡训练,将其设置为指定数目即可;若为0,则使用CPU。
### 执行Fine-tunning
以GLUE中的SST-2任务为例,启动Fine-tuning的方式如下:
```shell
python -u ./run_glue.py \
--model_type bert \
--model_name_or_path bert-base-uncased \
--task_name SST-2 \
--max_seq_length 128 \
--batch_size 32 \
--learning_rate 2e-5 \
--num_train_epochs 3 \
--logging_steps 1 \
--save_steps 500 \
--output_dir ./tmp/ \
--n_gpu 1 \
```
其中参数释义如下:
- `model_type` 指示了模型类型,使用BERT模型时设置为bert即可。
- `model_name_or_path` 指示了某种特定配置的模型,对应有其预训练模型和预训练时使用的 tokenizer。若模型相关内容保存在本地,这里也可以提供相应目录地址。
- `task_name` 表示Fine-tuning的任务。
- `max_seq_length` 表示最大句子长度,超过该长度将被截断。
- `batch_size` 表示每次迭代**每张卡**上的样本数目。
- `learning_rate` 表示基础学习率大小,将于learning rate scheduler产生的值相乘作为当前学习率。
- `num_train_epochs` 表示训练轮数。
- `logging_steps` 表示日志打印间隔。
- `save_steps` 表示模型保存及评估间隔。
- `output_dir` 表示模型保存路径。
- `n_gpu` 表示使用的 GPU 卡数。若希望使用多卡训练,将其设置为指定数目即可;若为0,则使用CPU。
基于`bert-base-uncased`在GLUE各评测任务上Fine-tuning后,在验证集上有如下结果:
| Task | Metric | Result |
|:-----:|:----------------------------:|:-----------------:|
| SST-2 | Accuracy | 0.92660 |
| QNLI | Accuracy | 0.91707 |
| CoLA | Mattehew's corr | 0.59557 |
| MRPC | F1/Accuracy | 0.91667/0.88235 |
| STS-B | Person/Spearman corr | 0.88847/0.88350 |
| QQP | Accuracy/F1 | 0.90581/0.87347 |
| MNLI | Matched acc/MisMatched acc | 0.84422/0.84825 |
| RTE | Accuracy | 0.711191 |
# coding=utf-8
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
# Copyright (c) 2019 NVIDIA CORPORATION. All rights reserved.
# Copyright 2018 The Google AI Language Team Authors and The HugginFace Inc. team.
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Create masked LM/next sentence masked_lm TF examples for BERT."""
from __future__ import absolute_import, division, print_function, unicode_literals
import argparse
import logging
import os
import random
from io import open
import h5py
import numpy as np
from tqdm import tqdm
from paddlenlp.transformers import BertTokenizer
from paddlenlp.transformers.tokenizer_utils import convert_to_unicode
import random
import collections
class TrainingInstance(object):
"""A single training instance (sentence pair)."""
def __init__(self, tokens, segment_ids, masked_lm_positions,
masked_lm_labels, is_random_next):
self.tokens = tokens
self.segment_ids = segment_ids
self.is_random_next = is_random_next
self.masked_lm_positions = masked_lm_positions
self.masked_lm_labels = masked_lm_labels
def write_instance_to_example_file(instances, tokenizer, max_seq_length,
max_predictions_per_seq, output_file):
"""Create TF example files from `TrainingInstance`s."""
total_written = 0
features = collections.OrderedDict()
num_instances = len(instances)
features["input_ids"] = np.zeros([num_instances, max_seq_length],
dtype="int32")
features["input_mask"] = np.zeros([num_instances, max_seq_length],
dtype="int32")
features["segment_ids"] = np.zeros([num_instances, max_seq_length],
dtype="int32")
features["masked_lm_positions"] = np.zeros(
[num_instances, max_predictions_per_seq], dtype="int32")
features["masked_lm_ids"] = np.zeros(
[num_instances, max_predictions_per_seq], dtype="int32")
features["next_sentence_labels"] = np.zeros(num_instances, dtype="int32")
for inst_index, instance in enumerate(tqdm(instances)):
input_ids = tokenizer.convert_tokens_to_ids(instance.tokens)
input_mask = [1] * len(input_ids)
segment_ids = list(instance.segment_ids)
assert len(input_ids) <= max_seq_length
while len(input_ids) < max_seq_length:
input_ids.append(0)
input_mask.append(0)
segment_ids.append(0)
assert len(input_ids) == max_seq_length
assert len(input_mask) == max_seq_length
assert len(segment_ids) == max_seq_length
masked_lm_positions = list(instance.masked_lm_positions)
masked_lm_ids = tokenizer.convert_tokens_to_ids(
instance.masked_lm_labels)
masked_lm_weights = [1.0] * len(masked_lm_ids)
while len(masked_lm_positions) < max_predictions_per_seq:
masked_lm_positions.append(0)
masked_lm_ids.append(0)
masked_lm_weights.append(0.0)
next_sentence_label = 1 if instance.is_random_next else 0
features["input_ids"][inst_index] = input_ids
features["input_mask"][inst_index] = input_mask
features["segment_ids"][inst_index] = segment_ids
features["masked_lm_positions"][inst_index] = masked_lm_positions
features["masked_lm_ids"][inst_index] = masked_lm_ids
features["next_sentence_labels"][inst_index] = next_sentence_label
total_written += 1
print("saving data")
f = h5py.File(output_file, 'w')
f.create_dataset("input_ids",
data=features["input_ids"],
dtype='i4',
compression='gzip')
f.create_dataset("input_mask",
data=features["input_mask"],
dtype='i1',
compression='gzip')
f.create_dataset("segment_ids",
data=features["segment_ids"],
dtype='i1',
compression='gzip')
f.create_dataset("masked_lm_positions",
data=features["masked_lm_positions"],
dtype='i4',
compression='gzip')
f.create_dataset("masked_lm_ids",
data=features["masked_lm_ids"],
dtype='i4',
compression='gzip')
f.create_dataset("next_sentence_labels",
data=features["next_sentence_labels"],
dtype='i1',
compression='gzip')
f.flush()
f.close()
def create_training_instances(input_files, tokenizer, max_seq_length,
dupe_factor, short_seq_prob, masked_lm_prob,
max_predictions_per_seq, rng):
"""Create `TrainingInstance`s from raw text."""
all_documents = [[]]
# Input file format:
# (1) One sentence per line. These should ideally be actual sentences, not
# entire paragraphs or arbitrary spans of text. (Because we use the
# sentence boundaries for the "next sentence prediction" task).
# (2) Blank lines between documents. Document boundaries are needed so
# that the "next sentence prediction" task doesn't span between documents.
for input_file in input_files:
print("creating instance from {}".format(input_file))
with open(input_file, "r") as reader:
while True:
line = convert_to_unicode(reader.readline())
if not line:
break
line = line.strip()
# Empty lines are used as document delimiters
if not line:
all_documents.append([])
# tokens = tokenizer.tokenize(line)
tokens = tokenizer(line)
if tokens:
all_documents[-1].append(tokens)
# Remove empty documents
all_documents = [x for x in all_documents if x]
rng.shuffle(all_documents)
# vocab_words = list(tokenizer.vocab.keys())
vocab_words = list(tokenizer.vocab.token_to_idx.keys())
instances = []
for _ in range(dupe_factor):
for document_index in range(len(all_documents)):
instances.extend(
create_instances_from_document(all_documents, document_index,
max_seq_length, short_seq_prob,
masked_lm_prob,
max_predictions_per_seq,
vocab_words, rng))
rng.shuffle(instances)
return instances
def create_instances_from_document(all_documents, document_index,
max_seq_length, short_seq_prob,
masked_lm_prob, max_predictions_per_seq,
vocab_words, rng):
"""Creates `TrainingInstance`s for a single document."""
document = all_documents[document_index]
# Account for [CLS], [SEP], [SEP]
max_num_tokens = max_seq_length - 3
# We *usually* want to fill up the entire sequence since we are padding
# to `max_seq_length` anyways, so short sequences are generally wasted
# computation. However, we *sometimes*
# (i.e., short_seq_prob == 0.1 == 10% of the time) want to use shorter
# sequences to minimize the mismatch between pre-training and fine-tuning.
# The `target_seq_length` is just a rough target however, whereas
# `max_seq_length` is a hard limit.
target_seq_length = max_num_tokens
if rng.random() < short_seq_prob:
target_seq_length = rng.randint(2, max_num_tokens)
# We DON'T just concatenate all of the tokens from a document into a long
# sequence and choose an arbitrary split point because this would make the
# next sentence prediction task too easy. Instead, we split the input into
# segments "A" and "B" based on the actual "sentences" provided by the user
# input.
instances = []
current_chunk = []
current_length = 0
i = 0
while i < len(document):
segment = document[i]
current_chunk.append(segment)
current_length += len(segment)
if i == len(document) - 1 or current_length >= target_seq_length:
if current_chunk:
# `a_end` is how many segments from `current_chunk` go into the `A`
# (first) sentence.
a_end = 1
if len(current_chunk) >= 2:
a_end = rng.randint(1, len(current_chunk) - 1)
tokens_a = []
for j in range(a_end):
tokens_a.extend(current_chunk[j])
tokens_b = []
# Random next
is_random_next = False
if len(current_chunk) == 1 or rng.random() < 0.5:
is_random_next = True
target_b_length = target_seq_length - len(tokens_a)
# This should rarely go for more than one iteration for large
# corpora. However, just to be careful, we try to make sure that
# the random document is not the same as the document
# we're processing.
for _ in range(10):
random_document_index = rng.randint(
0,
len(all_documents) - 1)
if random_document_index != document_index:
break
#If picked random document is the same as the current document
if random_document_index == document_index:
is_random_next = False
random_document = all_documents[random_document_index]
random_start = rng.randint(0, len(random_document) - 1)
for j in range(random_start, len(random_document)):
tokens_b.extend(random_document[j])
if len(tokens_b) >= target_b_length:
break
# We didn't actually use these segments so we "put them back" so
# they don't go to waste.
num_unused_segments = len(current_chunk) - a_end
i -= num_unused_segments
# Actual next
else:
is_random_next = False
for j in range(a_end, len(current_chunk)):
tokens_b.extend(current_chunk[j])
truncate_seq_pair(tokens_a, tokens_b, max_num_tokens, rng)
assert len(tokens_a) >= 1
assert len(tokens_b) >= 1
tokens = []
segment_ids = []
tokens.append("[CLS]")
segment_ids.append(0)
for token in tokens_a:
tokens.append(token)
segment_ids.append(0)
tokens.append("[SEP]")
segment_ids.append(0)
for token in tokens_b:
tokens.append(token)
segment_ids.append(1)
tokens.append("[SEP]")
segment_ids.append(1)
(tokens, masked_lm_positions,
masked_lm_labels) = create_masked_lm_predictions(
tokens, masked_lm_prob, max_predictions_per_seq,
vocab_words, rng)
instance = TrainingInstance(
tokens=tokens,
segment_ids=segment_ids,
is_random_next=is_random_next,
masked_lm_positions=masked_lm_positions,
masked_lm_labels=masked_lm_labels)
instances.append(instance)
current_chunk = []
current_length = 0
i += 1
return instances
MaskedLmInstance = collections.namedtuple("MaskedLmInstance",
["index", "label"])
def create_masked_lm_predictions(tokens, masked_lm_prob,
max_predictions_per_seq, vocab_words, rng):
"""Creates the predictions for the masked LM objective."""
cand_indexes = []
for (i, token) in enumerate(tokens):
if token == "[CLS]" or token == "[SEP]":
continue
cand_indexes.append(i)
rng.shuffle(cand_indexes)
output_tokens = list(tokens)
num_to_predict = min(max_predictions_per_seq,
max(1, int(round(len(tokens) * masked_lm_prob))))
masked_lms = []
covered_indexes = set()
for index in cand_indexes:
if len(masked_lms) >= num_to_predict:
break
if index in covered_indexes:
continue
covered_indexes.add(index)
masked_token = None
# 80% of the time, replace with [MASK]
if rng.random() < 0.8:
masked_token = "[MASK]"
else:
# 10% of the time, keep original
if rng.random() < 0.5:
masked_token = tokens[index]
# 10% of the time, replace with random word
else:
masked_token = vocab_words[rng.randint(0, len(vocab_words) - 1)]
output_tokens[index] = masked_token
masked_lms.append(MaskedLmInstance(index=index, label=tokens[index]))
masked_lms = sorted(masked_lms, key=lambda x: x.index)
masked_lm_positions = []
masked_lm_labels = []
for p in masked_lms:
masked_lm_positions.append(p.index)
masked_lm_labels.append(p.label)
return (output_tokens, masked_lm_positions, masked_lm_labels)
def truncate_seq_pair(tokens_a, tokens_b, max_num_tokens, rng):
"""Truncates a pair of sequences to a maximum sequence length."""
while True:
total_length = len(tokens_a) + len(tokens_b)
if total_length <= max_num_tokens:
break
trunc_tokens = tokens_a if len(tokens_a) > len(tokens_b) else tokens_b
assert len(trunc_tokens) >= 1
# We want to sometimes truncate from the front and sometimes from the
# back to add more randomness and avoid biases.
if rng.random() < 0.5:
del trunc_tokens[0]
else:
trunc_tokens.pop()
def main():
parser = argparse.ArgumentParser()
parser.add_argument(
"--input_file",
default=None,
type=str,
required=True,
help=
"The input train corpus. can be directory with .txt files or a path to a single file"
)
parser.add_argument(
"--output_file",
default=None,
type=str,
required=True,
help="The output file where created hdf5 formatted data will be written.")
parser.add_argument("--vocab_file",
default=None,
type=str,
required=False,
help="The vocabulary the BERT model will train on. "
"Use bert_model argument would ignore this. "
"The bert_model argument is recommended.")
parser.add_argument(
"--do_lower_case",
action='store_true',
default=True,
help=
"Whether to lower case the input text. True for uncased models, False for cased models. "
"Use bert_model argument would ignore this. The bert_model argument is recommended."
)
parser.add_argument(
"--bert_model",
default="bert-base-uncased",
type=str,
required=False,
help="Bert pre-trained model selected in the list: bert-base-uncased, "
"bert-large-uncased, bert-base-cased, bert-base-multilingual, bert-base-chinese."
"If provided, use the pre-trained model used tokenizer to create data "
"and ignore vocab_file and do_lower_case.")
## Other parameters
#int
parser.add_argument(
"--max_seq_length",
default=128,
type=int,
help=
"The maximum total input sequence length after WordPiece tokenization. \n"
"Sequences longer than this will be truncated, and sequences shorter \n"
"than this will be padded.")
parser.add_argument(
"--dupe_factor",
default=10,
type=int,
help=
"Number of times to duplicate the input data (with different masks).")
parser.add_argument(
"--max_predictions_per_seq",
default=20,
type=int,
help="Maximum number of masked LM predictions per sequence.")
# floats
parser.add_argument("--masked_lm_prob",
default=0.15,
type=float,
help="Masked LM probability.")
parser.add_argument(
"--short_seq_prob",
default=0.1,
type=float,
help=
"Probability to create a sequence shorter than maximum sequence length")
parser.add_argument('--random_seed',
type=int,
default=12345,
help="random seed for initialization")
args = parser.parse_args()
print(args)
if args.bert_model:
tokenizer = BertTokenizer.from_pretrained(args.bert_model)
else:
assert args.vocab_file, (
"vocab_file must be set If bert_model is not provided.")
tokenizer = BertTokenizer(args.vocab_file,
do_lower_case=args.do_lower_case)
input_files = []
if os.path.isfile(args.input_file):
input_files.append(args.input_file)
elif os.path.isdir(args.input_file):
input_files = [
os.path.join(args.input_file, f)
for f in os.listdir(args.input_file)
if (os.path.isfile(os.path.join(args.input_file, f))
and f.endswith('.txt'))
]
else:
raise ValueError("{} is not a valid path".format(args.input_file))
rng = random.Random(args.random_seed)
instances = create_training_instances(input_files, tokenizer,
args.max_seq_length, args.dupe_factor,
args.short_seq_prob,
args.masked_lm_prob,
args.max_predictions_per_seq, rng)
output_file = args.output_file
write_instance_to_example_file(instances, tokenizer, args.max_seq_length,
args.max_predictions_per_seq, output_file)
if __name__ == "__main__":
main()
\ No newline at end of file
Zulfiqar A. Bhutta trained as a physician in Pakistan in the early stages of his career.
He holds titles across various organizations in diverse geographies.
Professor Bhutta is the Founding Director of the Center of Excellence in Women and Child Health & Institute for Global Child Health & Development, at the Aga Khan University South-Central Asia, East Africa & United Kingdom.
He is currently the Co-Director at the Centre for Global Child Health, at the Hospital for Sick Children and leads many projects as a Senior Scientist at the Research Institute in the Centre for Global Child Health at Sick Kids.
He holds a Professorship at the University of Toronto in the Department of Nutritional Sciences and the Division of Epidemiology, Dalla Lana School of Public Health.
Additionally, he holds concurrent professorship at the Department of Paediatrics, Aga Khan University in Karachi, Pakistan and at the Schools of Public Health of Johns Hopkins University, Tufts University, Boston University, University of Alberta and the London School of Hygiene & Tropical Medicine.
He is a designated Distinguished National Professor of the Government of Pakistan and was the Founding Chair of the National Research Ethics Committee of the Government of Pakistan from 2003-2014.
Dr. Bhutta received his MBBS from Khyber Medical College in Peshawar, Pakistan in 1977 at which time he was names "Best Graduate of the Year" and awarded the University Gold Medal for overall distinction.
His PhD work was completed at Karolinska Institute in Stockholm, Sweden in 1996.
He is a Fellow of the Royal College of Physicians (Edinburgh & London), the Royal College of Paediatrics and Child Health (London), American Academy of Paediatrics and the Pakistan Academy of Sciences.
Following the completion of his PhD Dr. Bhutta began working as House Surgeon in Obstetrics & Gynecology at the Khyber Teaching Hospital, Peshawar (April-November 1978).
He began work in paediatrics as a physician in November of 1978 in the Professorial Unit at the Institute of Child Health, Jinnah Postgraduate Medical Centre, Karachi (Pakistan).
Through 1980's he continued his work as a surgeon and paediatrician.
He undertook his first professor position in the Department of Paediatrics, The Aga Khan University Hospital, Karachi (Pakistan), from November 1987 to June 1992.
In 2005, Dr. Bhutta became the Chairman of the Department of Paediatrics & Child Health at the Aga Khan University & Medical Center, a position held until 2008.
Following his term as Chairman he became The Noordin Noormahomed Sheriff Professor & Founding Chair, Division of Women & Child Health, The Aga Khan University, a position he held for four years.
Dr. Bhutta currently holds the titles of co-director of the Centre for Global Child Health at the Hospital for Sick Children in Toronto, and founding director of the Centre of Excellence in Women and Child Health at the Aga Khan University.
In 2020, he was appointed founding director of the Institute for Global child Health & Development at the Aga Khan University and elected Fellow to the Royal Society, United Kingdom.
Outside of his professional responsibilities Dr. Bhutta serves on various local and international boards and committees, including a series of editorial boards.
In his various capacities Dr. Bhutta has produced a large collection of publications working with his teams at Sick Kids, AKU and international partners.
These include book reviews, chapters, 1.
"Haematological disorders" "Neonatal Jaundice" in Neonatal Vade‑Mecum, Fleming PJ, Speidel BD, Dunn PM Eds, Lloyd‑Luke Publishers, UK, 1986.
Revised 2nd Edition 1991.
2.
"Nutritional management of acute and persistent diarrhoea".
A M Molla, Bhutta Z A and  A Molla.
In McNeish A S, Mittal S K and Walker-Smith J A (eds).
Recent trends in diarrhoea and malnutrition, MAMC, Delhi, 1991, pp 37-51.
3.
"Paediatric Prescribing” in "Text book of Paediatrics for developing countries"            Arif MA, Hanif SM, Wasti SMK Eds, 1989, 2nd Edition 1996,  PPA, Karachi.
& Lahore 4.
"Innovations in neonatal care : Impact on neonatal survival in the developing world:.
Bhutta Z A  Zaidi S (Editor) 1992.
TWEL Publisher.
Karachi pp 121-131 5.
"Short course therapy in Pediatrics" Bhutta Z A& Teele D.  In Tice A D, Waldvogel F (Eds), Contemporary issues in Infectious Disease Epidemiology and Management, 1993 Gardiner Caldwell, Cheshire, pp 52 - 60.
6.
"Dietary management of persistent diarrhoea".
Bhutta Z A, Molla A M, Issani Z.
In Reflections on  Diarrhoeal Disease & Nutrition  of Children".
1993 Karachi, pp 97 - 103.
7.
"Prescribing practices amongst general practitioners (GPs) and consultant paediatricians in childhood diarrhoea.”  S.Q.
Nizami, I.A.
Khan, Bhutta Z A.
In "Reflections on Diarrhoeal Disease and Nutrition of Children".
1993 Karachi, pp  88-90.
8.
"The challenge of multidrug-resistant typhoid".
Bhutta Z A.
In Puri R K, Sachdev H P S, Choudhry P, Verma I C (Eds), Current concepts in Paediatrics, 1994.
Jaypee Publishers, New Delhi, pp 403.8.
9.
"Perinatal Care in Pakistan: Current status and trends".
In Proceedings of the Workshop in Reproductive Health.
College of Physicians and Surgeons, Pakistan, Karachi, 1995, pp 95-103.
10.
“A study of whole body protein kinetics in malnourished children with persistent diarrhoea” Bhutta Z A, Nizami SQ, Isani Z, Hardy S, Hendricks K, Young V.   Report of the second RCM coordinated Research Programme for application of stable isotope tracer methods to studies of energy metabolism in malnourished populations of developing countries.
NAHRES-30 1996 IAEA Vienna.
11.
"Pneumococcal infections in Pakistan: a country report".
In Adult Immunization in Asia, Fondation Mercel Merieux, Lyon, 1998. pp 79-82.
12.
“Factors affecting protein and aminoacid metabolism in childhood from developing countries".
In Child Nutrition: an international perspective.
Editors Solomons NW, Caballero B, Brown KH.
CRC Press 1998.
13.
"Protein Digestion and Bioavailability".
In Encyclopedia of Human Nutrition.
Editors: Sadler M, Strain JJ, Caballero B.
Academic Press (London), 1998 pp.1646-54.
14.
"Perinatal Care in Pakistan.
Reproductive Health: A manual for family practice and primary health care.
Bhutta Z A, Maqbool S.  College of Physicians and Surgeons, Pakistan, Karachi, 1999, pp 69-78.
15.
“Effective interventions to reduce neonatal mortality and morbidity from perinatal infection.
Bhutta ZA.
In Costello A, Manandhar D (eds).
"Improving Newborn Infant Health in Developing Countries’ 1999.
Imperial College Press, London pp.289-308.
16.
“Ambulatory management of typhoid fever”            “Risk factors and management of micronutrient deficiencies”            “Management of persistent diarrhoea in developing countries”.
In Manual of International Child Health, British Medical Journal, 2000 (in press).
17.
“The role of Cefixime in typhoid fever during childhood” in Cefixime, Adam D, Quintiliani R (Eds), Torre-Lazur-McCann, Tokyo, 2000; pp.107-112.
18.
"Micronutrients and Child Health in the Commonwealth”, Commonwealth Foundation" (UK) (2001).
19.
"Isotopic evaluation of breast milk intake, energy metabolism growth and body composition of exclusively breastfed infants in Pakistan".
Bhutta ZA, Nizami SQ, Weaver LT, Preston T. In Application of Stable Isotopes to evaluate Growth and Body Composition of Exclusively Breastfed infants, IAEA and WHO, NAHRES Report.
2000.
20.
“Typhoid Fever in Childhood: the south Asian experience”.
Ahmad K &Bhutta ZA.
In "Recent Advances in Paediatrics", Gupte S (Ed), 2000, India .
21.
“Neonatal Infections in developing countries” in  Carrera JM, Cabero L, Baraibar R (Eds).
The Perinatal Medicine of the new Millennium.
\ No newline at end of file
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse
import logging
import os
import sys
import random
import time
import math
from functools import partial
import numpy as np
import paddle
from paddle.io import DataLoader
from paddle.metric import Metric, Accuracy, Precision, Recall
from paddlenlp.datasets import GlueCoLA, GlueSST2, GlueMRPC, GlueSTSB, GlueQQP, GlueMNLI, GlueQNLI, GlueRTE
from paddlenlp.data import Stack, Tuple, Pad
from paddlenlp.transformers import BertForSequenceClassification, BertTokenizer
from paddlenlp.transformers import ElectraForSequenceClassification, ElectraTokenizer
from paddlenlp.transformers import ErnieForSequenceClassification, ErnieTokenizer
from paddlenlp.metrics import AccuracyAndF1, Mcc, PearsonAndSpearman
FORMAT = '%(asctime)s-%(levelname)s: %(message)s'
logging.basicConfig(level=logging.INFO, format=FORMAT)
logger = logging.getLogger(__name__)
TASK_CLASSES = {
"cola": (GlueCoLA, Mcc),
"sst-2": (GlueSST2, Accuracy),
"mrpc": (GlueMRPC, AccuracyAndF1),
"sts-b": (GlueSTSB, PearsonAndSpearman),
"qqp": (GlueQQP, AccuracyAndF1),
"mnli": (GlueMNLI, Accuracy),
"qnli": (GlueQNLI, Accuracy),
"rte": (GlueRTE, Accuracy),
}
MODEL_CLASSES = {"bert": (BertForSequenceClassification, BertTokenizer)}
def parse_args():
parser = argparse.ArgumentParser()
# Required parameters
parser.add_argument(
"--task_name",
default=None,
type=str,
required=True,
help="The name of the task to train selected in the list: " +
", ".join(TASK_CLASSES.keys()), )
parser.add_argument(
"--model_type",
default=None,
type=str,
required=True,
help="Model type selected in the list: " +
", ".join(MODEL_CLASSES.keys()), )
parser.add_argument(
"--model_name_or_path",
default=None,
type=str,
required=True,
help="Path to pre-trained model or shortcut name selected in the list: "
+ ", ".join(
sum([
list(classes[-1].pretrained_init_configuration.keys())
for classes in MODEL_CLASSES.values()
], [])), )
parser.add_argument(
"--output_dir",
default=None,
type=str,
required=True,
help="The output directory where the model predictions and checkpoints will be written.",
)
parser.add_argument(
"--max_seq_length",
default=128,
type=int,
help="The maximum total input sequence length after tokenization. Sequences longer "
"than this will be truncated, sequences shorter will be padded.", )
parser.add_argument(
"--learning_rate",
default=1e-4,
type=float,
help="The initial learning rate for Adam.")
parser.add_argument(
"--num_train_epochs",
default=3,
type=int,
help="Total number of training epochs to perform.", )
parser.add_argument(
"--logging_steps",
type=int,
default=100,
help="Log every X updates steps.")
parser.add_argument(
"--save_steps",
type=int,
default=100,
help="Save checkpoint every X updates steps.")
parser.add_argument(
"--batch_size",
default=32,
type=int,
help="Batch size per GPU/CPU for training.", )
parser.add_argument(
"--weight_decay",
default=0.0,
type=float,
help="Weight decay if we apply some.")
parser.add_argument(
"--warmup_steps",
default=0,
type=int,
help="Linear warmup over warmup_steps. If > 0: Override warmup_proportion"
)
parser.add_argument(
"--warmup_proportion",
default=0.,
type=float,
help="Linear warmup proportion over total steps.")
parser.add_argument(
"--adam_epsilon",
default=1e-6,
type=float,
help="Epsilon for Adam optimizer.")
parser.add_argument(
"--max_steps",
default=-1,
type=int,
help="If > 0: set total number of training steps to perform. Override num_train_epochs.",
)
parser.add_argument(
"--seed", default=42, type=int, help="random seed for initialization")
parser.add_argument(
"--n_gpu",
default=1,
type=int,
help="number of gpus to use, 0 for cpu.")
args = parser.parse_args()
return args
def set_seed(args):
random.seed(args.seed + paddle.distributed.get_rank())
np.random.seed(args.seed + paddle.distributed.get_rank())
paddle.seed(args.seed + paddle.distributed.get_rank())
def evaluate(model, loss_fct, metric, data_loader):
model.eval()
metric.reset()
for batch in data_loader:
input_ids, segment_ids, labels = batch
logits = model(input_ids, segment_ids)
loss = loss_fct(logits, labels)
correct = metric.compute(logits, labels)
metric.update(correct)
res = metric.accumulate()
if isinstance(metric, AccuracyAndF1):
logger.info(
"eval loss: %f, acc: %s, precision: %s, recall: %s, f1: %s, acc and f1: %s."
% (loss.numpy(), res[0], res[1], res[2], res[3], res[4]))
elif isinstance(metric, Mcc):
logger.info("eval loss: %f, mcc: %s." % (loss.numpy(), res[0]))
elif isinstance(metric, PearsonAndSpearman):
logger.info(
"eval loss: %f, pearson: %s, spearman: %s, pearson and spearman: %s."
% (loss.numpy(), res[0], res[1], res[2]))
else:
logger.info("eval loss: %f, acc: %s." % (loss.numpy(), res))
model.train()
def convert_example(example,
tokenizer,
label_list,
max_seq_length=512,
is_test=False):
"""convert a glue example into necessary features"""
def _truncate_seqs(seqs, max_seq_length):
if len(seqs) == 1: # single sentence
# Account for [CLS] and [SEP] with "- 2"
seqs[0] = seqs[0][0:(max_seq_length - 2)]
else: # Sentence pair
# Account for [CLS], [SEP], [SEP] with "- 3"
tokens_a, tokens_b = seqs
max_seq_length -= 3
while True: # Truncate with longest_first strategy
total_length = len(tokens_a) + len(tokens_b)
if total_length <= max_seq_length:
break
if len(tokens_a) > len(tokens_b):
tokens_a.pop()
else:
tokens_b.pop()
return seqs
def _concat_seqs(seqs, separators, seq_mask=0, separator_mask=1):
concat = sum((seq + sep for sep, seq in zip(separators, seqs)), [])
segment_ids = sum(
([i] * (len(seq) + len(sep))
for i, (sep, seq) in enumerate(zip(separators, seqs))), [])
if isinstance(seq_mask, int):
seq_mask = [[seq_mask] * len(seq) for seq in seqs]
if isinstance(separator_mask, int):
separator_mask = [[separator_mask] * len(sep) for sep in separators]
p_mask = sum((s_mask + mask
for sep, seq, s_mask, mask in zip(
separators, seqs, seq_mask, separator_mask)), [])
return concat, segment_ids, p_mask
if not is_test:
# `label_list == None` is for regression task
label_dtype = "int64" if label_list else "float32"
# Get the label
label = example[-1]
example = example[:-1]
# Create label maps if classification task
if label_list:
label_map = {}
for (i, l) in enumerate(label_list):
label_map[l] = i
label = label_map[label]
label = np.array([label], dtype=label_dtype)
# Tokenize raw text
tokens_raw = [tokenizer(l) for l in example]
# Truncate to the truncate_length,
tokens_trun = _truncate_seqs(tokens_raw, max_seq_length)
# Concate the sequences with special tokens
tokens_trun[0] = [tokenizer.cls_token] + tokens_trun[0]
tokens, segment_ids, _ = _concat_seqs(tokens_trun, [[tokenizer.sep_token]] *
len(tokens_trun))
# Convert the token to ids
input_ids = tokenizer.convert_tokens_to_ids(tokens)
valid_length = len(input_ids)
# The mask has 1 for real tokens and 0 for padding tokens. Only real
# tokens are attended to.
# input_mask = [1] * len(input_ids)
if not is_test:
return input_ids, segment_ids, valid_length, label
else:
return input_ids, segment_ids, valid_length
def do_train(args):
paddle.set_device("gpu" if args.n_gpu else "cpu")
if paddle.distributed.get_world_size() > 1:
paddle.distributed.init_parallel_env()
set_seed(args)
args.task_name = args.task_name.lower()
dataset_class, metric_class = TASK_CLASSES[args.task_name]
args.model_type = args.model_type.lower()
model_class, tokenizer_class = MODEL_CLASSES[args.model_type]
train_dataset = dataset_class.get_datasets(["train"])
tokenizer = tokenizer_class.from_pretrained(args.model_name_or_path)
trans_func = partial(
convert_example,
tokenizer=tokenizer,
label_list=train_dataset.get_labels(),
max_seq_length=args.max_seq_length)
train_dataset = train_dataset.apply(trans_func, lazy=True)
train_batch_sampler = paddle.io.DistributedBatchSampler(
train_dataset, batch_size=args.batch_size, shuffle=True)
batchify_fn = lambda samples, fn=Tuple(
Pad(axis=0, pad_val=tokenizer.pad_token_id), # input
Pad(axis=0, pad_val=tokenizer.pad_token_id), # segment
Stack(), # length
Stack(dtype="int64" if train_dataset.get_labels() else "float32") # label
): [data for i, data in enumerate(fn(samples)) if i != 2]
train_data_loader = DataLoader(
dataset=train_dataset,
batch_sampler=train_batch_sampler,
collate_fn=batchify_fn,
num_workers=0,
return_list=True)
if args.task_name == "mnli":
dev_dataset_matched, dev_dataset_mismatched = dataset_class.get_datasets(
["dev_matched", "dev_mismatched"])
dev_dataset_matched = dev_dataset_matched.apply(trans_func, lazy=True)
dev_dataset_mismatched = dev_dataset_mismatched.apply(
trans_func, lazy=True)
dev_batch_sampler_matched = paddle.io.BatchSampler(
dev_dataset_matched, batch_size=args.batch_size, shuffle=False)
dev_data_loader_matched = DataLoader(
dataset=dev_dataset_matched,
batch_sampler=dev_batch_sampler_matched,
collate_fn=batchify_fn,
num_workers=0,
return_list=True)
dev_batch_sampler_mismatched = paddle.io.BatchSampler(
dev_dataset_mismatched, batch_size=args.batch_size, shuffle=False)
dev_data_loader_mismatched = DataLoader(
dataset=dev_dataset_mismatched,
batch_sampler=dev_batch_sampler_mismatched,
collate_fn=batchify_fn,
num_workers=0,
return_list=True)
else:
dev_dataset = dataset_class.get_datasets(["dev"])
dev_dataset = dev_dataset.apply(trans_func, lazy=True)
dev_batch_sampler = paddle.io.BatchSampler(
dev_dataset, batch_size=args.batch_size, shuffle=False)
dev_data_loader = DataLoader(
dataset=dev_dataset,
batch_sampler=dev_batch_sampler,
collate_fn=batchify_fn,
num_workers=0,
return_list=True)
num_classes = 1 if train_dataset.get_labels() == None else len(
train_dataset.get_labels())
model = model_class.from_pretrained(
args.model_name_or_path, num_classes=num_classes)
if paddle.distributed.get_world_size() > 1:
model = paddle.DataParallel(model)
num_training_steps = args.max_steps if args.max_steps > 0 else (
len(train_data_loader) * args.num_train_epochs)
warmup_steps = args.warmup_steps if args.warmup_steps > 0 else (
int(math.floor(num_training_steps * args.warmup_proportion)))
lr_scheduler = paddle.optimizer.lr.LambdaDecay(
args.learning_rate,
lambda current_step, num_warmup_steps=warmup_steps,
num_training_steps=num_training_steps : float(
current_step) / float(max(1, num_warmup_steps))
if current_step < num_warmup_steps else max(
0.0,
float(num_training_steps - current_step) / float(
max(1, num_training_steps - num_warmup_steps))))
optimizer = paddle.optimizer.AdamW(
learning_rate=lr_scheduler,
beta1=0.9,
beta2=0.999,
epsilon=args.adam_epsilon,
parameters=model.parameters(),
weight_decay=args.weight_decay,
apply_decay_param_fun=lambda x: x in [
p.name for n, p in model.named_parameters()
if not any(nd in n for nd in ["bias", "norm"])
])
loss_fct = paddle.nn.loss.CrossEntropyLoss() if train_dataset.get_labels(
) else paddle.nn.loss.MSELoss()
metric = metric_class()
global_step = 0
tic_train = time.time()
for epoch in range(args.num_train_epochs):
for step, batch in enumerate(train_data_loader):
global_step += 1
input_ids, segment_ids, labels = batch
logits = model(input_ids, segment_ids)
loss = loss_fct(logits, labels)
loss.backward()
optimizer.step()
lr_scheduler.step()
optimizer.clear_gradients()
if global_step % args.logging_steps == 0:
logger.info(
"global step %d/%d, epoch: %d, batch: %d, rank_id: %s, loss: %f, lr: %.10f, speed: %.4f step/s"
% (global_step, num_training_steps, epoch, step,
paddle.distributed.get_rank(), loss, optimizer.get_lr(),
args.logging_steps / (time.time() - tic_train)))
tic_train = time.time()
if global_step % args.save_steps == 0:
tic_eval = time.time()
if args.task_name == "mnli":
evaluate(model, loss_fct, metric, dev_data_loader_matched)
evaluate(model, loss_fct, metric,
dev_data_loader_mismatched)
logger.info("eval done total : %s s" %
(time.time() - tic_eval))
else:
evaluate(model, loss_fct, metric, dev_data_loader)
logger.info("eval done total : %s s" %
(time.time() - tic_eval))
if (not args.n_gpu > 1) or paddle.distributed.get_rank() == 0:
output_dir = os.path.join(
args.output_dir, "%s_ft_model_%d.pdparams" %
(args.task_name, global_step))
if not os.path.exists(output_dir):
os.makedirs(output_dir)
# Need better way to get inner model of DataParallel
model_to_save = model._layers if isinstance(
model, paddle.DataParallel) else model
model_to_save.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)
def print_arguments(args):
"""print arguments"""
print('----------- Configuration Arguments -----------')
for arg, value in sorted(vars(args).items()):
print('%s: %s' % (arg, value))
print('------------------------------------------------')
if __name__ == "__main__":
args = parse_args()
print_arguments(args)
if args.n_gpu > 1:
paddle.distributed.spawn(do_train, args=(args, ), nprocs=args.n_gpu)
else:
do_train(args)
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse
import collections
import itertools
import logging
import os
import random
import time
import h5py
from functools import partial
from concurrent.futures import ThreadPoolExecutor
import numpy as np
import paddle
import paddle.distributed as dist
from paddle.io import DataLoader, Dataset
from paddlenlp.data import Stack, Tuple, Pad
from paddlenlp.transformers import BertForPretraining, BertModel, BertPretrainingCriterion
from paddlenlp.transformers import BertTokenizer
FORMAT = '%(asctime)s-%(levelname)s: %(message)s'
logging.basicConfig(level=logging.INFO, format=FORMAT)
logger = logging.getLogger(__name__)
MODEL_CLASSES = {
"bert": (BertForPretraining, BertTokenizer),
}
def parse_args():
parser = argparse.ArgumentParser()
parser.add_argument(
"--model_type",
default=None,
type=str,
required=True,
help="Model type selected in the list: " +
", ".join(MODEL_CLASSES.keys()),
)
parser.add_argument(
"--model_name_or_path",
default=None,
type=str,
required=True,
help="Path to pre-trained model or shortcut name selected in the list: "
+ ", ".join(
sum([
list(classes[-1].pretrained_init_configuration.keys())
for classes in MODEL_CLASSES.values()
], [])),
)
parser.add_argument(
"--input_dir",
default=None,
type=str,
required=True,
help="The input directory where the data will be read from.",
)
parser.add_argument(
"--output_dir",
default=None,
type=str,
required=True,
help=
"The output directory where the model predictions and checkpoints will be written.",
)
parser.add_argument(
"--max_predictions_per_seq",
default=80,
type=int,
help="The maximum total of masked tokens in input sequence")
parser.add_argument(
"--batch_size",
default=8,
type=int,
help="Batch size per GPU/CPU for training.",
)
parser.add_argument("--learning_rate",
default=5e-5,
type=float,
help="The initial learning rate for Adam.")
parser.add_argument("--weight_decay",
default=0.0,
type=float,
help="Weight decay if we apply some.")
parser.add_argument("--adam_epsilon",
default=1e-8,
type=float,
help="Epsilon for Adam optimizer.")
parser.add_argument("--max_grad_norm",
default=1.0,
type=float,
help="Max gradient norm.")
parser.add_argument(
"--num_train_epochs",
default=3,
type=int,
help="Total number of training epochs to perform.",
)
parser.add_argument(
"--max_steps",
default=-1,
type=int,
help=
"If > 0: set total number of training steps to perform. Override num_train_epochs.",
)
parser.add_argument("--warmup_steps",
default=0,
type=int,
help="Linear warmup over warmup_steps.")
parser.add_argument("--logging_steps",
type=int,
default=500,
help="Log every X updates steps.")
parser.add_argument("--save_steps",
type=int,
default=500,
help="Save checkpoint every X updates steps.")
parser.add_argument("--seed",
type=int,
default=42,
help="random seed for initialization")
parser.add_argument("--n_gpu",
type=int,
default=1,
help="number of gpus to use, 0 for cpu.")
args = parser.parse_args()
return args
def set_seed(args):
random.seed(args.seed + paddle.distributed.get_rank())
np.random.seed(args.seed + paddle.distributed.get_rank())
paddle.seed(args.seed + paddle.distributed.get_rank())
class WorkerInitObj(object):
def __init__(self, seed):
self.seed = seed
def __call__(self, id):
np.random.seed(seed=self.seed + id)
random.seed(self.seed + id)
def create_pretraining_dataset(input_file, max_pred_length, shared_list, args,
worker_init):
train_data = PretrainingDataset(input_file=input_file,
max_pred_length=max_pred_length)
# files have been sharded, no need to dispatch again
train_batch_sampler = paddle.io.BatchSampler(train_data,
batch_size=args.batch_size,
shuffle=True)
# DataLoader cannot be pickled because of its place.
# If it can be pickled, use global function instead of lambda and use
# ProcessPoolExecutor instead of ThreadPoolExecutor to prefetch.
def _collate_data(data, stack_fn=Stack()):
num_fields = len(data[0])
out = [None] * num_fields
# input_ids, segment_ids, input_mask, masked_lm_positions,
# masked_lm_labels, next_sentence_labels, mask_token_num
for i in (0, 1, 2, 5):
out[i] = stack_fn([x[i] for x in data])
batch_size, seq_length = out[0].shape
size = num_mask = sum(len(x[3]) for x in data)
# Padding for divisibility by 8 for fp16 or int8 usage
if size % 8 != 0:
size += 8 - (size % 8)
# masked_lm_positions
# Organize as a 1D tensor for gather or use gather_nd
out[3] = np.full(size, 0, dtype=np.int64)
# masked_lm_labels
out[4] = np.full([size, 1], -1, dtype=np.int64)
mask_token_num = 0
for i, x in enumerate(data):
for j, pos in enumerate(x[3]):
out[3][mask_token_num] = i * seq_length + pos
out[4][mask_token_num] = x[4][j]
mask_token_num += 1
# mask_token_num
out.append(np.asarray([mask_token_num], dtype=np.float32))
return out
train_data_loader = DataLoader(dataset=train_data,
batch_sampler=train_batch_sampler,
collate_fn=_collate_data,
num_workers=0,
worker_init_fn=worker_init,
return_list=True)
return train_data_loader, input_file
class PretrainingDataset(Dataset):
def __init__(self, input_file, max_pred_length):
self.input_file = input_file
self.max_pred_length = max_pred_length
f = h5py.File(input_file, "r")
keys = [
'input_ids', 'input_mask', 'segment_ids', 'masked_lm_positions',
'masked_lm_ids', 'next_sentence_labels'
]
self.inputs = [np.asarray(f[key][:]) for key in keys]
f.close()
def __len__(self):
'Denotes the total number of samples'
return len(self.inputs[0])
def __getitem__(self, index):
[
input_ids, input_mask, segment_ids, masked_lm_positions,
masked_lm_ids, next_sentence_labels
] = [
input[index].astype(np.int64)
if indice < 5 else np.asarray(input[index].astype(np.int64))
for indice, input in enumerate(self.inputs)
]
# TODO: whether to use reversed mask by changing 1s and 0s to be
# consistent with nv bert
input_mask = (1 - np.reshape(input_mask.astype(np.float32),
[1, 1, input_mask.shape[0]])) * -1e9
index = self.max_pred_length
# store number of masked tokens in index
# outputs of torch.nonzero diff with that of numpy.nonzero by zip
padded_mask_indices = (masked_lm_positions == 0).nonzero()[0]
if len(padded_mask_indices) != 0:
index = padded_mask_indices[0].item()
mask_token_num = index
else:
index = 0
mask_token_num = 0
# masked_lm_labels = np.full(input_ids.shape, -1, dtype=np.int64)
# masked_lm_labels[masked_lm_positions[:index]] = masked_lm_ids[:index]
masked_lm_labels = masked_lm_ids[:index]
masked_lm_positions = masked_lm_positions[:index]
# softmax_with_cross_entropy enforce last dim size equal 1
masked_lm_labels = np.expand_dims(masked_lm_labels, axis=-1)
next_sentence_labels = np.expand_dims(next_sentence_labels, axis=-1)
return [
input_ids, segment_ids, input_mask, masked_lm_positions,
masked_lm_labels, next_sentence_labels
]
def do_train(args):
paddle.set_device("gpu" if args.n_gpu else "cpu")
if paddle.distributed.get_world_size() > 1:
paddle.distributed.init_parallel_env()
set_seed(args)
worker_init = WorkerInitObj(args.seed + paddle.distributed.get_rank())
args.model_type = args.model_type.lower()
model_class, tokenizer_class = MODEL_CLASSES[args.model_type]
tokenizer = tokenizer_class.from_pretrained(args.model_name_or_path)
model = BertForPretraining(
BertModel(**model_class.pretrained_init_configuration[
args.model_name_or_path]))
criterion = BertPretrainingCriterion(
getattr(model,
BertForPretraining.base_model_prefix).config["vocab_size"])
if paddle.distributed.get_world_size() > 1:
model = paddle.DataParallel(model)
# If use defalut last_epoch, lr of the first iteration is 0.
# Use `last_epoch = 0` to be consistent with nv bert.
lr_scheduler = paddle.optimizer.lr.LambdaDecay(
args.learning_rate,
lambda current_step, num_warmup_steps=args.warmup_steps,
num_training_steps=args.max_steps if args.max_steps > 0 else
(len(train_data_loader) * args.num_train_epochs): float(
current_step) / float(max(1, num_warmup_steps))
if current_step < num_warmup_steps else max(
0.0,
float(num_training_steps - current_step) / float(
max(1, num_training_steps - num_warmup_steps))),
last_epoch=0)
optimizer = paddle.optimizer.AdamW(
learning_rate=lr_scheduler,
epsilon=args.adam_epsilon,
parameters=model.parameters(),
weight_decay=args.weight_decay,
apply_decay_param_fun=lambda x: x in [
p.name for n, p in model.named_parameters()
if not any(nd in n for nd in ["bias", "norm"])
])
pool = ThreadPoolExecutor(1)
global_step = 0
tic_train = time.time()
for epoch in range(args.num_train_epochs):
files = [
os.path.join(args.input_dir, f) for f in os.listdir(args.input_dir)
if os.path.isfile(os.path.join(args.input_dir, f))
and "training" in f
]
files.sort()
num_files = len(files)
random.Random(args.seed + epoch).shuffle(files)
f_start_id = 0
shared_file_list = {}
if paddle.distributed.get_world_size() > num_files:
remainder = paddle.distributed.get_world_size() % num_files
data_file = files[
(f_start_id * paddle.distributed.get_world_size() +
paddle.distributed.get_rank() + remainder * f_start_id) %
num_files]
else:
data_file = files[(f_start_id * paddle.distributed.get_world_size()
+ paddle.distributed.get_rank()) % num_files]
previous_file = data_file
train_data_loader, _ = create_pretraining_dataset(
data_file, args.max_predictions_per_seq, shared_file_list, args,
worker_init)
# TODO(guosheng): better way to process single file
single_file = True if f_start_id + 1 == len(files) else False
for f_id in range(f_start_id, len(files)):
if not single_file and f_id == f_start_id:
continue
if paddle.distributed.get_world_size() > num_files:
data_file = files[(f_id * paddle.distributed.get_world_size() +
paddle.distributed.get_rank() +
remainder * f_id) % num_files]
else:
data_file = files[(f_id * paddle.distributed.get_world_size() +
paddle.distributed.get_rank()) % num_files]
previous_file = data_file
dataset_future = pool.submit(create_pretraining_dataset, data_file,
args.max_predictions_per_seq,
shared_file_list, args, worker_init)
for step, batch in enumerate(train_data_loader):
global_step += 1
(input_ids, segment_ids, input_mask, masked_lm_positions,
masked_lm_labels, next_sentence_labels,
masked_lm_scale) = batch
prediction_scores, seq_relationship_score = model(
input_ids=input_ids,
token_type_ids=segment_ids,
attention_mask=input_mask,
masked_positions=masked_lm_positions)
loss = criterion(prediction_scores, seq_relationship_score,
masked_lm_labels, next_sentence_labels,
masked_lm_scale)
if global_step % args.logging_steps == 0:
if (not args.n_gpu > 1
) or paddle.distributed.get_rank() == 0:
logger.info(
"global step %d, epoch: %d, batch: %d, loss: %f, speed: %.2f step/s"
% (global_step, epoch, step, loss,
args.logging_steps / (time.time() - tic_train)))
tic_train = time.time()
loss.backward()
optimizer.step()
lr_scheduler.step()
optimizer.clear_gradients()
if global_step % args.save_steps == 0:
if (not args.n_gpu > 1
) or paddle.distributed.get_rank() == 0:
output_dir = os.path.join(args.output_dir,
"model_%d" % global_step)
if not os.path.exists(output_dir):
os.makedirs(output_dir)
# need better way to get inner model of DataParallel
model_to_save = model._layers if isinstance(
model, paddle.DataParallel) else model
model_to_save.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)
paddle.save(
optimizer.state_dict(),
os.path.join(output_dir, "model_state.pdopt"))
if global_step >= args.max_steps:
del train_data_loader
return
del train_data_loader
train_data_loader, data_file = dataset_future.result(timeout=None)
if __name__ == "__main__":
args = parse_args()
if args.n_gpu > 1:
paddle.distributed.spawn(do_train, args=(args, ), nprocs=args.n_gpu)
else:
do_train(args)
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册