Skip to content
体验新版
项目
组织
正在加载...
登录
切换导航
打开侧边栏
PaddlePaddle
models
提交
e92d5d40
M
models
项目概览
PaddlePaddle
/
models
大约 2 年 前同步成功
通知
232
Star
6828
Fork
2962
代码
文件
提交
分支
Tags
贡献者
分支图
Diff
Issue
602
列表
看板
标记
里程碑
合并请求
255
Wiki
0
Wiki
分析
仓库
DevOps
项目成员
Pages
M
models
项目概览
项目概览
详情
发布
仓库
仓库
文件
提交
分支
标签
贡献者
分支图
比较
Issue
602
Issue
602
列表
看板
标记
里程碑
合并请求
255
合并请求
255
Pages
分析
分析
仓库分析
DevOps
Wiki
0
Wiki
成员
成员
收起侧边栏
关闭侧边栏
动态
分支图
创建新Issue
提交
Issue看板
提交
e92d5d40
编写于
12月 15, 2020
作者:
Z
Zeyu Chen
浏览文件
操作
浏览文件
下载
电子邮件补丁
差异文件
remove examples/bert, unifiy to benchmark/bert
上级
b84cace2
变更
3
隐藏空白更改
内联
并排
Showing
3 changed file
with
0 addition
and
852 deletion
+0
-852
PaddleNLP/examples/bert/README.md
PaddleNLP/examples/bert/README.md
+0
-90
PaddleNLP/examples/bert/run_glue.py
PaddleNLP/examples/bert/run_glue.py
+0
-356
PaddleNLP/examples/bert/run_pretrain.py
PaddleNLP/examples/bert/run_pretrain.py
+0
-406
未找到文件。
PaddleNLP/examples/bert/README.md
已删除
100644 → 0
浏览文件 @
b84cace2
# BERT with PaddleNLP
[
BERT
](
https://arxiv.org/abs/1810.04805
)
是一个迁移能力很强的通用语义表示模型, 以
[
Transformer
](
https://arxiv.org/abs/1706.03762
)
为网络基本组件,以双向
`Masked Language Model`
和
`Next Sentence Prediction`
为训练目标,通过预训练得到通用语义表示,再结合简单的输出层,应用到下游的 NLP 任务,在多个任务上取得了 SOTA 的结果。本项目是 BERT 在 Paddle 2.0上的开源实现。
### 发布要点
1)动态图BERT模型,支持 Fine-tuning,在 GLUE SST-2 任务上进行了验证。
2)支持 Pre-training。
## NLP 任务的 Fine-tuning
在完成 BERT 模型的预训练后,即可利用预训练参数在特定的 NLP 任务上做 Fine-tuning。以下利用开源的预训练模型,示例如何进行分类任务的 Fine-tuning。
### 语句和句对分类任务
以 GLUE/SST-2 任务为例,启动 Fine-tuning 的方式如下(
`paddlenlp`
要已经安装或能在
`PYTHONPATH`
中找到):
```
shell
export
CUDA_VISIBLE_DEVICES
=
0,1
export
TASK_NAME
=
SST-2
python
-u
./run_glue.py
\
--model_type
bert
\
--model_name_or_path
bert-base-uncased
\
--task_name
$TASK_NAME
\
--max_seq_length
128
\
--batch_size
32
\
--learning_rate
2e-5
\
--num_train_epochs
3
\
--logging_steps
1
\
--save_steps
500
\
--output_dir
./tmp/
$TASK_NAME
/
\
--n_gpu
1
\
```
其中参数释义如下:
-
`model_type`
指示了模型类型,当前仅支持BERT模型。
-
`model_name_or_path`
指示了某种特定配置的模型,对应有其预训练模型和预训练时使用的 tokenizer。若模型相关内容保存在本地,这里也可以提供相应目录地址。
-
`task_name`
表示 Fine-tuning 的任务。
-
`max_seq_length`
表示最大句子长度,超过该长度将被截断。
-
`batch_size`
表示每次迭代
**每张卡**
上的样本数目。
-
`learning_rate`
表示基础学习率大小,将于learning rate scheduler产生的值相乘作为当前学习率。
-
`num_train_epochs`
表示训练轮数。
-
`logging_steps`
表示日志打印间隔。
-
`save_steps`
表示模型保存及评估间隔。
-
`output_dir`
表示模型保存路径。
-
`n_gpu`
表示使用的 GPU 卡数。若希望使用多卡训练,将其设置为指定数目即可;若为0,则使用CPU。
训练过程将按照
`logging_steps`
和
`save_steps`
的设置打印如下日志:
```
global step 996, epoch: 0, batch: 996, loss: 0.248909, speed: 5.07 step/s
global step 997, epoch: 0, batch: 997, loss: 0.113216, speed: 4.53 step/s
global step 998, epoch: 0, batch: 998, loss: 0.218075, speed: 4.55 step/s
global step 999, epoch: 0, batch: 999, loss: 0.133626, speed: 4.51 step/s
global step 1000, epoch: 0, batch: 1000, loss: 0.187652, speed: 4.45 step/s
eval loss: 0.083172, accu: 0.920872
```
使用以上命令进行单卡 Fine-tuning ,在验证集上有如下结果:
| Task | Metric | Result |
|-------|------------------------------|-------------|
| SST-2 | Accuracy | 92.88 |
| QNLI | Accuracy | 91.67 |
## 预训练
```
shell
export
CUDA_VISIBLE_DEVICES
=
0,1
export
DATA_DIR
=
/guosheng/nv-bert/DeepLearningExamples/PyTorch/LanguageModeling/BERT/data/hdf5_lower_case_1_seq_len_128_max_pred_20_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5/wikicorpus_en/
python
-u
./run_pretrain.py
\
--model_type
bert
\
--model_name_or_path
bert-base-uncased
\
--max_predictions_per_seq
20
\
--batch_size
32
\
--learning_rate
1e-4
\
--weight_decay
1e-2
\
--adam_epsilon
1e-6
\
--warmup_steps
10000
\
--num_train_epochs
1e5
\
--input_dir
$DATA_DIR
\
--output_dir
./tmp2/
\
--logging_steps
1
\
--save_steps
20000
\
--max_steps
1000000
\
--n_gpu
2
```
PaddleNLP/examples/bert/run_glue.py
已删除
100644 → 0
浏览文件 @
b84cace2
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import
argparse
import
logging
import
os
import
random
import
time
from
functools
import
partial
import
numpy
as
np
import
paddle
from
paddle.io
import
DataLoader
from
paddlenlp.datasets
import
GlueQNLI
,
GlueSST2
from
paddlenlp.data
import
Stack
,
Tuple
,
Pad
from
paddlenlp.data.sampler
import
SamplerHelper
from
paddlenlp.transformers
import
BertForSequenceClassification
,
BertTokenizer
from
paddlenlp.utils.log
import
logger
TASK_CLASSES
=
{
"qnli"
:
(
GlueQNLI
,
paddle
.
metric
.
Accuracy
),
# (dataset, metric)
"sst-2"
:
(
GlueSST2
,
paddle
.
metric
.
Accuracy
),
}
MODEL_CLASSES
=
{
"bert"
:
(
BertForSequenceClassification
,
BertTokenizer
),
}
def
parse_args
():
parser
=
argparse
.
ArgumentParser
()
# Required parameters
parser
.
add_argument
(
"--task_name"
,
default
=
None
,
type
=
str
,
required
=
True
,
help
=
"The name of the task to train selected in the list: "
+
", "
.
join
(
TASK_CLASSES
.
keys
()),
)
parser
.
add_argument
(
"--model_type"
,
default
=
None
,
type
=
str
,
required
=
True
,
help
=
"Model type selected in the list: "
+
", "
.
join
(
MODEL_CLASSES
.
keys
()),
)
parser
.
add_argument
(
"--model_name_or_path"
,
default
=
None
,
type
=
str
,
required
=
True
,
help
=
"Path to pre-trained model or shortcut name selected in the list: "
+
", "
.
join
(
sum
([
list
(
classes
[
-
1
].
pretrained_init_configuration
.
keys
())
for
classes
in
MODEL_CLASSES
.
values
()
],
[])),
)
parser
.
add_argument
(
"--output_dir"
,
default
=
None
,
type
=
str
,
required
=
True
,
help
=
"The output directory where the model predictions and checkpoints will be written."
,
)
parser
.
add_argument
(
"--max_seq_length"
,
default
=
128
,
type
=
int
,
help
=
"The maximum total input sequence length after tokenization. Sequences longer "
"than this will be truncated, sequences shorter will be padded."
,
)
parser
.
add_argument
(
"--batch_size"
,
default
=
8
,
type
=
int
,
help
=
"Batch size per GPU/CPU for training."
,
)
parser
.
add_argument
(
"--learning_rate"
,
default
=
5e-5
,
type
=
float
,
help
=
"The initial learning rate for Adam."
)
parser
.
add_argument
(
"--weight_decay"
,
default
=
0.0
,
type
=
float
,
help
=
"Weight decay if we apply some."
)
parser
.
add_argument
(
"--adam_epsilon"
,
default
=
1e-8
,
type
=
float
,
help
=
"Epsilon for Adam optimizer."
)
parser
.
add_argument
(
"--max_grad_norm"
,
default
=
1.0
,
type
=
float
,
help
=
"Max gradient norm."
)
parser
.
add_argument
(
"--num_train_epochs"
,
default
=
3
,
type
=
int
,
help
=
"Total number of training epochs to perform."
,
)
parser
.
add_argument
(
"--max_steps"
,
default
=-
1
,
type
=
int
,
help
=
"If > 0: set total number of training steps to perform. Override num_train_epochs."
,
)
parser
.
add_argument
(
"--warmup_steps"
,
default
=
0
,
type
=
int
,
help
=
"Linear warmup over warmup_steps."
)
parser
.
add_argument
(
"--logging_steps"
,
type
=
int
,
default
=
500
,
help
=
"Log every X updates steps."
)
parser
.
add_argument
(
"--save_steps"
,
type
=
int
,
default
=
500
,
help
=
"Save checkpoint every X updates steps."
)
parser
.
add_argument
(
"--seed"
,
type
=
int
,
default
=
42
,
help
=
"random seed for initialization"
)
parser
.
add_argument
(
"--eager_run"
,
type
=
eval
,
default
=
True
,
help
=
"Use dygraph mode."
)
parser
.
add_argument
(
"--n_gpu"
,
type
=
int
,
default
=
1
,
help
=
"number of gpus to use, 0 for cpu."
)
args
=
parser
.
parse_args
()
return
args
def
set_seed
(
args
):
random
.
seed
(
args
.
seed
+
paddle
.
distributed
.
get_rank
())
np
.
random
.
seed
(
args
.
seed
+
paddle
.
distributed
.
get_rank
())
paddle
.
seed
(
args
.
seed
+
paddle
.
distributed
.
get_rank
())
def
evaluate
(
model
,
criterion
,
metric
,
data_loader
):
model
.
eval
()
metric
.
reset
()
for
batch
in
data_loader
:
input_ids
,
segment_ids
,
labels
=
batch
logits
=
model
(
input_ids
,
segment_ids
)
loss
=
criterion
(
logits
,
labels
)
correct
=
metric
.
compute
(
logits
,
labels
)
metric
.
update
(
correct
)
accu
=
metric
.
accumulate
()
print
(
"eval loss: %f, accu: %f"
%
(
loss
.
numpy
(),
accu
))
model
.
train
()
def
convert_example
(
example
,
tokenizer
,
label_list
,
max_seq_length
=
512
,
is_test
=
False
):
"""convert a glue example into necessary features"""
def
_truncate_seqs
(
seqs
,
max_seq_length
):
if
len
(
seqs
)
==
1
:
# single sentence
# Account for [CLS] and [SEP] with "- 2"
seqs
[
0
]
=
seqs
[
0
][
0
:(
max_seq_length
-
2
)]
else
:
# sentence pair
# Account for [CLS], [SEP], [SEP] with "- 3"
tokens_a
,
tokens_b
=
seqs
max_seq_length
-=
3
while
True
:
# truncate with longest_first strategy
total_length
=
len
(
tokens_a
)
+
len
(
tokens_b
)
if
total_length
<=
max_seq_length
:
break
if
len
(
tokens_a
)
>
len
(
tokens_b
):
tokens_a
.
pop
()
else
:
tokens_b
.
pop
()
return
seqs
def
_concat_seqs
(
seqs
,
separators
,
seq_mask
=
0
,
separator_mask
=
1
):
concat
=
sum
((
seq
+
sep
for
sep
,
seq
in
zip
(
separators
,
seqs
)),
[])
segment_ids
=
sum
(
([
i
]
*
(
len
(
seq
)
+
len
(
sep
))
for
i
,
(
sep
,
seq
)
in
enumerate
(
zip
(
separators
,
seqs
))),
[])
if
isinstance
(
seq_mask
,
int
):
seq_mask
=
[[
seq_mask
]
*
len
(
seq
)
for
seq
in
seqs
]
if
isinstance
(
separator_mask
,
int
):
separator_mask
=
[[
separator_mask
]
*
len
(
sep
)
for
sep
in
separators
]
p_mask
=
sum
((
s_mask
+
mask
for
sep
,
seq
,
s_mask
,
mask
in
zip
(
separators
,
seqs
,
seq_mask
,
separator_mask
)),
[])
return
concat
,
segment_ids
,
p_mask
if
not
is_test
:
# `label_list == None` is for regression task
label_dtype
=
"int64"
if
label_list
else
"float32"
# get the label
label
=
example
[
-
1
]
example
=
example
[:
-
1
]
#create label maps if classification task
if
label_list
:
label_map
=
{}
for
(
i
,
l
)
in
enumerate
(
label_list
):
label_map
[
l
]
=
i
label
=
label_map
[
label
]
label
=
np
.
array
([
label
],
dtype
=
label_dtype
)
# tokenize raw text
tokens_raw
=
[
tokenizer
(
l
)
for
l
in
example
]
# truncate to the truncate_length,
tokens_trun
=
_truncate_seqs
(
tokens_raw
,
max_seq_length
)
# concate the sequences with special tokens
tokens_trun
[
0
]
=
[
tokenizer
.
cls_token
]
+
tokens_trun
[
0
]
tokens
,
segment_ids
,
_
=
_concat_seqs
(
tokens_trun
,
[[
tokenizer
.
sep_token
]]
*
len
(
tokens_trun
))
# convert the token to ids
input_ids
=
tokenizer
.
convert_tokens_to_ids
(
tokens
)
valid_length
=
len
(
input_ids
)
# The mask has 1 for real tokens and 0 for padding tokens. Only real
# tokens are attended to.
# input_mask = [1] * len(input_ids)
if
not
is_test
:
return
input_ids
,
segment_ids
,
valid_length
,
label
else
:
return
input_ids
,
segment_ids
,
valid_length
def
do_train
(
args
):
paddle
.
enable_static
()
if
not
args
.
eager_run
else
None
paddle
.
set_device
(
"gpu"
if
args
.
n_gpu
else
"cpu"
)
if
paddle
.
distributed
.
get_world_size
()
>
1
:
paddle
.
distributed
.
init_parallel_env
()
set_seed
(
args
)
args
.
task_name
=
args
.
task_name
.
lower
()
dataset_class
,
metric_class
=
TASK_CLASSES
[
args
.
task_name
]
args
.
model_type
=
args
.
model_type
.
lower
()
model_class
,
tokenizer_class
=
MODEL_CLASSES
[
args
.
model_type
]
train_ds
,
dev_ds
=
dataset_class
.
get_datasets
([
'train'
,
'dev'
])
tokenizer
=
tokenizer_class
.
from_pretrained
(
args
.
model_name_or_path
)
trans_func
=
partial
(
convert_example
,
tokenizer
=
tokenizer
,
label_list
=
train_ds
.
get_labels
(),
max_seq_length
=
args
.
max_seq_length
)
train_ds
=
train_ds
.
apply
(
trans_func
,
lazy
=
True
)
# train_batch_sampler = SamplerHelper(train_ds).shuffle().batch(
# batch_size=args.batch_size).shard()
train_batch_sampler
=
paddle
.
io
.
DistributedBatchSampler
(
train_ds
,
batch_size
=
args
.
batch_size
,
shuffle
=
True
)
batchify_fn
=
lambda
samples
,
fn
=
Tuple
(
Pad
(
axis
=
0
,
pad_val
=
tokenizer
.
pad_token_id
),
# input
Pad
(
axis
=
0
,
pad_val
=
tokenizer
.
pad_token_id
),
# segment
Stack
(),
# length
Stack
(
dtype
=
"int64"
if
train_ds
.
get_labels
()
else
"float32"
)
# label
):
[
data
for
i
,
data
in
enumerate
(
fn
(
samples
))
if
i
!=
2
]
train_data_loader
=
DataLoader
(
dataset
=
train_ds
,
batch_sampler
=
train_batch_sampler
,
collate_fn
=
batchify_fn
,
num_workers
=
0
,
return_list
=
True
)
dev_ds
=
dev_ds
.
apply
(
trans_func
,
lazy
=
True
)
# dev_batch_sampler = SamplerHelper(dev_ds).batch(
# batch_size=args.batch_size)
dev_batch_sampler
=
paddle
.
io
.
BatchSampler
(
dev_ds
,
batch_size
=
args
.
batch_size
,
shuffle
=
False
)
dev_data_loader
=
DataLoader
(
dataset
=
dev_ds
,
batch_sampler
=
dev_batch_sampler
,
collate_fn
=
batchify_fn
,
num_workers
=
0
,
return_list
=
True
)
model
=
model_class
.
from_pretrained
(
args
.
model_name_or_path
,
num_classes
=
len
(
train_ds
.
get_labels
()))
if
paddle
.
distributed
.
get_world_size
()
>
1
:
model
=
paddle
.
DataParallel
(
model
)
lr_scheduler
=
paddle
.
optimizer
.
lr
.
LambdaDecay
(
args
.
learning_rate
,
lambda
current_step
,
num_warmup_steps
=
args
.
warmup_steps
,
num_training_steps
=
args
.
max_steps
if
args
.
max_steps
>
0
else
(
len
(
train_data_loader
)
*
args
.
num_train_epochs
):
float
(
current_step
)
/
float
(
max
(
1
,
num_warmup_steps
))
if
current_step
<
num_warmup_steps
else
max
(
0.0
,
float
(
num_training_steps
-
current_step
)
/
float
(
max
(
1
,
num_training_steps
-
num_warmup_steps
))))
optimizer
=
paddle
.
optimizer
.
AdamW
(
learning_rate
=
lr_scheduler
,
epsilon
=
args
.
adam_epsilon
,
parameters
=
model
.
parameters
(),
weight_decay
=
args
.
weight_decay
,
apply_decay_param_fun
=
lambda
x
:
x
in
[
p
.
name
for
n
,
p
in
model
.
named_parameters
()
if
not
any
(
nd
in
n
for
nd
in
[
"bias"
,
"norm"
])
])
criterion
=
paddle
.
nn
.
loss
.
CrossEntropyLoss
()
if
train_ds
.
get_labels
(
)
else
paddle
.
nn
.
loss
.
MSELoss
()
metric
=
metric_class
()
global_step
=
0
tic_train
=
time
.
time
()
for
epoch
in
range
(
args
.
num_train_epochs
):
for
step
,
batch
in
enumerate
(
train_data_loader
):
global_step
+=
1
input_ids
,
segment_ids
,
labels
=
batch
logits
=
model
(
input_ids
,
segment_ids
)
loss
=
criterion
(
logits
,
labels
)
if
global_step
%
args
.
logging_steps
==
0
:
if
(
not
args
.
n_gpu
>
1
)
or
paddle
.
distributed
.
get_rank
()
==
0
:
logger
.
info
(
"global step %d, epoch: %d, batch: %d, loss: %f, speed: %.2f step/s"
%
(
global_step
,
epoch
,
step
,
loss
,
args
.
logging_steps
/
(
time
.
time
()
-
tic_train
)))
tic_train
=
time
.
time
()
loss
.
backward
()
optimizer
.
step
()
lr_scheduler
.
step
()
optimizer
.
clear_gradients
()
if
global_step
%
args
.
save_steps
==
0
:
evaluate
(
model
,
criterion
,
metric
,
dev_data_loader
)
if
(
not
args
.
n_gpu
>
1
)
or
paddle
.
distributed
.
get_rank
()
==
0
:
output_dir
=
os
.
path
.
join
(
args
.
output_dir
,
"model_%d"
%
global_step
)
if
not
os
.
path
.
exists
(
output_dir
):
os
.
makedirs
(
output_dir
)
# need better way to get inner model of DataParallel
model_to_save
=
model
.
_layers
if
isinstance
(
model
,
paddle
.
DataParallel
)
else
model
model_to_save
.
save_pretrained
(
output_dir
)
tokenizer
.
save_pretrained
(
output_dir
)
if
__name__
==
"__main__"
:
args
=
parse_args
()
if
args
.
n_gpu
>
1
:
paddle
.
distributed
.
spawn
(
do_train
,
args
=
(
args
,
),
nprocs
=
args
.
n_gpu
)
else
:
do_train
(
args
)
PaddleNLP/examples/bert/run_pretrain.py
已删除
100644 → 0
浏览文件 @
b84cace2
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import
argparse
import
collections
import
itertools
import
logging
import
os
import
random
import
time
import
h5py
from
functools
import
partial
from
concurrent.futures
import
ThreadPoolExecutor
import
numpy
as
np
import
paddle
import
paddle.distributed
as
dist
from
paddle.io
import
DataLoader
,
Dataset
from
paddlenlp.data
import
Stack
,
Tuple
,
Pad
from
paddlenlp.transformers
import
BertForPretraining
,
BertModel
,
BertPretrainingCriterion
from
paddlenlp.transformers
import
BertTokenizer
from
paddlenlp.utils.log
import
logger
MODEL_CLASSES
=
{
"bert"
:
(
BertForPretraining
,
BertTokenizer
),
}
def
parse_args
():
parser
=
argparse
.
ArgumentParser
()
parser
.
add_argument
(
"--model_type"
,
default
=
None
,
type
=
str
,
required
=
True
,
help
=
"Model type selected in the list: "
+
", "
.
join
(
MODEL_CLASSES
.
keys
()),
)
parser
.
add_argument
(
"--model_name_or_path"
,
default
=
None
,
type
=
str
,
required
=
True
,
help
=
"Path to pre-trained model or shortcut name selected in the list: "
+
", "
.
join
(
sum
([
list
(
classes
[
-
1
].
pretrained_init_configuration
.
keys
())
for
classes
in
MODEL_CLASSES
.
values
()
],
[])),
)
parser
.
add_argument
(
"--input_dir"
,
default
=
None
,
type
=
str
,
required
=
True
,
help
=
"The input directory where the data will be read from."
,
)
parser
.
add_argument
(
"--output_dir"
,
default
=
None
,
type
=
str
,
required
=
True
,
help
=
"The output directory where the model predictions and checkpoints will be written."
,
)
parser
.
add_argument
(
"--max_predictions_per_seq"
,
default
=
80
,
type
=
int
,
help
=
"The maximum total of masked tokens in input sequence"
)
parser
.
add_argument
(
"--batch_size"
,
default
=
8
,
type
=
int
,
help
=
"Batch size per GPU/CPU for training."
,
)
parser
.
add_argument
(
"--learning_rate"
,
default
=
5e-5
,
type
=
float
,
help
=
"The initial learning rate for Adam."
)
parser
.
add_argument
(
"--weight_decay"
,
default
=
0.0
,
type
=
float
,
help
=
"Weight decay if we apply some."
)
parser
.
add_argument
(
"--adam_epsilon"
,
default
=
1e-8
,
type
=
float
,
help
=
"Epsilon for Adam optimizer."
)
parser
.
add_argument
(
"--max_grad_norm"
,
default
=
1.0
,
type
=
float
,
help
=
"Max gradient norm."
)
parser
.
add_argument
(
"--num_train_epochs"
,
default
=
3
,
type
=
int
,
help
=
"Total number of training epochs to perform."
,
)
parser
.
add_argument
(
"--max_steps"
,
default
=-
1
,
type
=
int
,
help
=
"If > 0: set total number of training steps to perform. Override num_train_epochs."
,
)
parser
.
add_argument
(
"--warmup_steps"
,
default
=
0
,
type
=
int
,
help
=
"Linear warmup over warmup_steps."
)
parser
.
add_argument
(
"--logging_steps"
,
type
=
int
,
default
=
500
,
help
=
"Log every X updates steps."
)
parser
.
add_argument
(
"--save_steps"
,
type
=
int
,
default
=
500
,
help
=
"Save checkpoint every X updates steps."
)
parser
.
add_argument
(
"--seed"
,
type
=
int
,
default
=
42
,
help
=
"random seed for initialization"
)
parser
.
add_argument
(
"--eager_run"
,
type
=
eval
,
default
=
True
,
help
=
"Use dygraph mode."
)
parser
.
add_argument
(
"--n_gpu"
,
type
=
int
,
default
=
1
,
help
=
"number of gpus to use, 0 for cpu."
)
args
=
parser
.
parse_args
()
return
args
def
set_seed
(
args
):
random
.
seed
(
args
.
seed
+
paddle
.
distributed
.
get_rank
())
np
.
random
.
seed
(
args
.
seed
+
paddle
.
distributed
.
get_rank
())
paddle
.
seed
(
args
.
seed
+
paddle
.
distributed
.
get_rank
())
class
WorkerInitObj
(
object
):
def
__init__
(
self
,
seed
):
self
.
seed
=
seed
def
__call__
(
self
,
id
):
np
.
random
.
seed
(
seed
=
self
.
seed
+
id
)
random
.
seed
(
self
.
seed
+
id
)
def
create_pretraining_dataset
(
input_file
,
max_pred_length
,
shared_list
,
args
,
worker_init
):
train_data
=
PretrainingDataset
(
input_file
=
input_file
,
max_pred_length
=
max_pred_length
)
# files have been sharded, no need to dispatch again
train_batch_sampler
=
paddle
.
io
.
BatchSampler
(
train_data
,
batch_size
=
args
.
batch_size
,
shuffle
=
True
)
# DataLoader cannot be pickled because of its place.
# If it can be pickled, use global function instead of lambda and use
# ProcessPoolExecutor instead of ThreadPoolExecutor to prefetch.
def
_collate_data
(
data
,
stack_fn
=
Stack
()):
num_fields
=
len
(
data
[
0
])
out
=
[
None
]
*
num_fields
# input_ids, segment_ids, input_mask, masked_lm_positions,
# masked_lm_labels, next_sentence_labels, mask_token_num
for
i
in
(
0
,
1
,
2
,
5
):
out
[
i
]
=
stack_fn
([
x
[
i
]
for
x
in
data
])
batch_size
,
seq_length
=
out
[
0
].
shape
size
=
num_mask
=
sum
(
len
(
x
[
3
])
for
x
in
data
)
# Padding for divisibility by 8 for fp16 or int8 usage
if
size
%
8
!=
0
:
size
+=
8
-
(
size
%
8
)
# masked_lm_positions
# Organize as a 1D tensor for gather or use gather_nd
out
[
3
]
=
np
.
full
(
size
,
0
,
dtype
=
np
.
int64
)
# masked_lm_labels
out
[
4
]
=
np
.
full
([
size
,
1
],
-
1
,
dtype
=
np
.
int64
)
mask_token_num
=
0
for
i
,
x
in
enumerate
(
data
):
for
j
,
pos
in
enumerate
(
x
[
3
]):
out
[
3
][
mask_token_num
]
=
i
*
seq_length
+
pos
out
[
4
][
mask_token_num
]
=
x
[
4
][
j
]
mask_token_num
+=
1
# mask_token_num
out
.
append
(
np
.
asarray
([
mask_token_num
],
dtype
=
np
.
float32
))
return
out
train_data_loader
=
DataLoader
(
dataset
=
train_data
,
batch_sampler
=
train_batch_sampler
,
collate_fn
=
_collate_data
,
num_workers
=
0
,
worker_init_fn
=
worker_init
,
return_list
=
True
)
return
train_data_loader
,
input_file
class
PretrainingDataset
(
Dataset
):
def
__init__
(
self
,
input_file
,
max_pred_length
):
self
.
input_file
=
input_file
self
.
max_pred_length
=
max_pred_length
f
=
h5py
.
File
(
input_file
,
"r"
)
keys
=
[
'input_ids'
,
'input_mask'
,
'segment_ids'
,
'masked_lm_positions'
,
'masked_lm_ids'
,
'next_sentence_labels'
]
self
.
inputs
=
[
np
.
asarray
(
f
[
key
][:])
for
key
in
keys
]
f
.
close
()
def
__len__
(
self
):
'Denotes the total number of samples'
return
len
(
self
.
inputs
[
0
])
def
__getitem__
(
self
,
index
):
[
input_ids
,
input_mask
,
segment_ids
,
masked_lm_positions
,
masked_lm_ids
,
next_sentence_labels
]
=
[
input
[
index
].
astype
(
np
.
int64
)
if
indice
<
5
else
np
.
asarray
(
input
[
index
].
astype
(
np
.
int64
))
for
indice
,
input
in
enumerate
(
self
.
inputs
)
]
# TODO: whether to use reversed mask by changing 1s and 0s to be
# consistent with nv bert
input_mask
=
(
1
-
np
.
reshape
(
input_mask
.
astype
(
np
.
float32
),
[
1
,
1
,
input_mask
.
shape
[
0
]]))
*
-
1e9
index
=
self
.
max_pred_length
# store number of masked tokens in index
# outputs of torch.nonzero diff with that of numpy.nonzero by zip
padded_mask_indices
=
(
masked_lm_positions
==
0
).
nonzero
()[
0
]
if
len
(
padded_mask_indices
)
!=
0
:
index
=
padded_mask_indices
[
0
].
item
()
mask_token_num
=
index
else
:
index
=
0
mask_token_num
=
0
# masked_lm_labels = np.full(input_ids.shape, -1, dtype=np.int64)
# masked_lm_labels[masked_lm_positions[:index]] = masked_lm_ids[:index]
masked_lm_labels
=
masked_lm_ids
[:
index
]
masked_lm_positions
=
masked_lm_positions
[:
index
]
# softmax_with_cross_entropy enforce last dim size equal 1
masked_lm_labels
=
np
.
expand_dims
(
masked_lm_labels
,
axis
=-
1
)
next_sentence_labels
=
np
.
expand_dims
(
next_sentence_labels
,
axis
=-
1
)
return
[
input_ids
,
segment_ids
,
input_mask
,
masked_lm_positions
,
masked_lm_labels
,
next_sentence_labels
]
def
do_train
(
args
):
paddle
.
enable_static
()
if
not
args
.
eager_run
else
None
paddle
.
set_device
(
"gpu"
if
args
.
n_gpu
else
"cpu"
)
if
paddle
.
distributed
.
get_world_size
()
>
1
:
paddle
.
distributed
.
init_parallel_env
()
set_seed
(
args
)
worker_init
=
WorkerInitObj
(
args
.
seed
+
paddle
.
distributed
.
get_rank
())
args
.
model_type
=
args
.
model_type
.
lower
()
model_class
,
tokenizer_class
=
MODEL_CLASSES
[
args
.
model_type
]
tokenizer
=
tokenizer_class
.
from_pretrained
(
args
.
model_name_or_path
)
model
=
BertForPretraining
(
BertModel
(
**
model_class
.
pretrained_init_configuration
[
args
.
model_name_or_path
]))
criterion
=
BertPretrainingCriterion
(
getattr
(
model
,
BertForPretraining
.
base_model_prefix
).
config
[
"vocab_size"
])
if
paddle
.
distributed
.
get_world_size
()
>
1
:
model
=
paddle
.
DataParallel
(
model
)
# If use defalut last_epoch, lr of the first iteration is 0.
# Use `last_epoch = 0` to be consistent with nv bert.
lr_scheduler
=
paddle
.
optimizer
.
lr
.
LambdaDecay
(
args
.
learning_rate
,
lambda
current_step
,
num_warmup_steps
=
args
.
warmup_steps
,
num_training_steps
=
args
.
max_steps
if
args
.
max_steps
>
0
else
(
len
(
train_data_loader
)
*
args
.
num_train_epochs
):
float
(
current_step
)
/
float
(
max
(
1
,
num_warmup_steps
))
if
current_step
<
num_warmup_steps
else
max
(
0.0
,
float
(
num_training_steps
-
current_step
)
/
float
(
max
(
1
,
num_training_steps
-
num_warmup_steps
))),
last_epoch
=
0
)
optimizer
=
paddle
.
optimizer
.
AdamW
(
learning_rate
=
lr_scheduler
,
epsilon
=
args
.
adam_epsilon
,
parameters
=
model
.
parameters
(),
weight_decay
=
args
.
weight_decay
,
apply_decay_param_fun
=
lambda
x
:
x
in
[
p
.
name
for
n
,
p
in
model
.
named_parameters
()
if
not
any
(
nd
in
n
for
nd
in
[
"bias"
,
"norm"
])
])
pool
=
ThreadPoolExecutor
(
1
)
global_step
=
0
tic_train
=
time
.
time
()
for
epoch
in
range
(
args
.
num_train_epochs
):
files
=
[
os
.
path
.
join
(
args
.
input_dir
,
f
)
for
f
in
os
.
listdir
(
args
.
input_dir
)
if
os
.
path
.
isfile
(
os
.
path
.
join
(
args
.
input_dir
,
f
))
and
"training"
in
f
]
files
.
sort
()
num_files
=
len
(
files
)
random
.
Random
(
args
.
seed
+
epoch
).
shuffle
(
files
)
f_start_id
=
0
shared_file_list
=
{}
if
paddle
.
distributed
.
get_world_size
()
>
num_files
:
remainder
=
paddle
.
distributed
.
get_world_size
()
%
num_files
data_file
=
files
[(
f_start_id
*
paddle
.
distributed
.
get_world_size
()
+
paddle
.
distributed
.
get_rank
()
+
remainder
*
f_start_id
)
%
num_files
]
else
:
data_file
=
files
[(
f_start_id
*
paddle
.
distributed
.
get_world_size
()
+
paddle
.
distributed
.
get_rank
())
%
num_files
]
previous_file
=
data_file
train_data_loader
,
_
=
create_pretraining_dataset
(
data_file
,
args
.
max_predictions_per_seq
,
shared_file_list
,
args
,
worker_init
)
for
f_id
in
range
(
f_start_id
+
1
,
len
(
files
)):
if
paddle
.
distributed
.
get_world_size
()
>
num_files
:
data_file
=
files
[(
f_id
*
paddle
.
distributed
.
get_world_size
()
+
paddle
.
distributed
.
get_rank
()
+
remainder
*
f_id
)
%
num_files
]
else
:
data_file
=
files
[(
f_id
*
paddle
.
distributed
.
get_world_size
()
+
paddle
.
distributed
.
get_rank
())
%
num_files
]
previous_file
=
data_file
dataset_future
=
pool
.
submit
(
create_pretraining_dataset
,
data_file
,
args
.
max_predictions_per_seq
,
shared_file_list
,
args
,
worker_init
)
for
step
,
batch
in
enumerate
(
train_data_loader
):
global_step
+=
1
(
input_ids
,
segment_ids
,
input_mask
,
masked_lm_positions
,
masked_lm_labels
,
next_sentence_labels
,
masked_lm_scale
)
=
batch
prediction_scores
,
seq_relationship_score
=
model
(
input_ids
=
input_ids
,
token_type_ids
=
segment_ids
,
attention_mask
=
input_mask
,
masked_positions
=
masked_lm_positions
)
loss
=
criterion
(
prediction_scores
,
seq_relationship_score
,
masked_lm_labels
,
next_sentence_labels
,
masked_lm_scale
)
if
global_step
%
args
.
logging_steps
==
0
:
if
(
not
args
.
n_gpu
>
1
)
or
paddle
.
distributed
.
get_rank
()
==
0
:
logger
.
info
(
"global step %d, epoch: %d, batch: %d, loss: %f, speed: %.2f step/s"
%
(
global_step
,
epoch
,
step
,
loss
,
args
.
logging_steps
/
(
time
.
time
()
-
tic_train
)))
tic_train
=
time
.
time
()
loss
.
backward
()
optimizer
.
step
()
lr_scheduler
.
step
()
optimizer
.
clear_gradients
()
if
global_step
%
args
.
save_steps
==
0
:
if
(
not
args
.
n_gpu
>
1
)
or
paddle
.
distributed
.
get_rank
()
==
0
:
output_dir
=
os
.
path
.
join
(
args
.
output_dir
,
"model_%d"
%
global_step
)
if
not
os
.
path
.
exists
(
output_dir
):
os
.
makedirs
(
output_dir
)
# need better way to get inner model of DataParallel
model_to_save
=
model
.
_layers
if
isinstance
(
model
,
paddle
.
DataParallel
)
else
model
model_to_save
.
save_pretrained
(
output_dir
)
tokenizer
.
save_pretrained
(
output_dir
)
paddle
.
save
(
optimizer
.
state_dict
(),
os
.
path
.
join
(
output_dir
,
"model_state.pdopt"
))
if
global_step
>=
args
.
max_steps
:
del
train_data_loader
return
del
train_data_loader
train_data_loader
,
data_file
=
dataset_future
.
result
(
timeout
=
None
)
if
__name__
==
"__main__"
:
args
=
parse_args
()
if
args
.
n_gpu
>
1
:
paddle
.
distributed
.
spawn
(
do_train
,
args
=
(
args
,
),
nprocs
=
args
.
n_gpu
)
else
:
do_train
(
args
)
编辑
预览
Markdown
is supported
0%
请重试
或
添加新附件
.
添加附件
取消
You are about to add
0
people
to the discussion. Proceed with caution.
先完成此消息的编辑!
取消
想要评论请
注册
或
登录