Skip to content
体验新版
项目
组织
正在加载...
登录
切换导航
打开侧边栏
PaddlePaddle
models
提交
84f481c0
M
models
项目概览
PaddlePaddle
/
models
大约 1 年 前同步成功
通知
222
Star
6828
Fork
2962
代码
文件
提交
分支
Tags
贡献者
分支图
Diff
Issue
602
列表
看板
标记
里程碑
合并请求
255
Wiki
0
Wiki
分析
仓库
DevOps
项目成员
Pages
M
models
项目概览
项目概览
详情
发布
仓库
仓库
文件
提交
分支
标签
贡献者
分支图
比较
Issue
602
Issue
602
列表
看板
标记
里程碑
合并请求
255
合并请求
255
Pages
分析
分析
仓库分析
DevOps
Wiki
0
Wiki
成员
成员
收起侧边栏
关闭侧边栏
动态
分支图
创建新Issue
提交
Issue看板
未验证
提交
84f481c0
编写于
1月 26, 2021
作者:
Z
Zhong Hui
提交者:
GitHub
1月 26, 2021
浏览文件
操作
浏览文件
下载
电子邮件补丁
差异文件
add gpt2 model for the paddlenlp
add gpt2 model for the paddlenlp
上级
b1fcba33
变更
13
展开全部
隐藏空白更改
内联
并排
Showing
13 changed file
with
1920 addition
and
0 deletion
+1920
-0
PaddleNLP/examples/language_model/gpt2/README.md
PaddleNLP/examples/language_model/gpt2/README.md
+151
-0
PaddleNLP/examples/language_model/gpt2/data.py
PaddleNLP/examples/language_model/gpt2/data.py
+263
-0
PaddleNLP/examples/language_model/gpt2/decompress.sh
PaddleNLP/examples/language_model/gpt2/decompress.sh
+18
-0
PaddleNLP/examples/language_model/gpt2/generate_sample.py
PaddleNLP/examples/language_model/gpt2/generate_sample.py
+81
-0
PaddleNLP/examples/language_model/gpt2/lr.py
PaddleNLP/examples/language_model/gpt2/lr.py
+49
-0
PaddleNLP/examples/language_model/gpt2/process_data.py
PaddleNLP/examples/language_model/gpt2/process_data.py
+92
-0
PaddleNLP/examples/language_model/gpt2/run_pretrain.py
PaddleNLP/examples/language_model/gpt2/run_pretrain.py
+281
-0
PaddleNLP/examples/language_model/gpt2/scripts/run.sh
PaddleNLP/examples/language_model/gpt2/scripts/run.sh
+13
-0
PaddleNLP/examples/language_model/gpt2/scripts/run_multi.sh
PaddleNLP/examples/language_model/gpt2/scripts/run_multi.sh
+13
-0
PaddleNLP/paddlenlp/transformers/__init__.py
PaddleNLP/paddlenlp/transformers/__init__.py
+2
-0
PaddleNLP/paddlenlp/transformers/gpt2/__init__.py
PaddleNLP/paddlenlp/transformers/gpt2/__init__.py
+2
-0
PaddleNLP/paddlenlp/transformers/gpt2/modeling.py
PaddleNLP/paddlenlp/transformers/gpt2/modeling.py
+608
-0
PaddleNLP/paddlenlp/transformers/gpt2/tokenizer.py
PaddleNLP/paddlenlp/transformers/gpt2/tokenizer.py
+347
-0
未找到文件。
PaddleNLP/examples/language_model/gpt2/README.md
0 → 100644
浏览文件 @
84f481c0
# GPT2
## 模型介绍
[
GPT2
](
https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf
)(
Language
Models are Unsupervised Multitask Learners) 以
[
Transformer
](
https://arxiv.org/abs/1706.03762
)
解码器为网络基本组件,使用自回归的方式在大规模无标注文本语料上进行预训练(pre-train),得到的语言生成模型。
本项目是语言模型 GPT2 的 PaddlePaddle 实现, 包含模型训练,预测等内容。下是本例的简要目录结构及说明:
```
text
.
├── data.py # 数据处理
├── decompress.sh # 数据集解压脚本
├── generate_sample.py # inference demo
├── lr.py # 学习率控制
├── process_data.py # 数据预处理脚本
├── README.md # 文档
├── run_pretrain.py # 预训练入口
└── scripts # 训练脚本
```
## 快速开始
### 安装说明
1.
paddle安装
本项目依赖于 PaddlePaddle 2.0rc1及以上版本或适当的develop版本,请参考 [安装指南](https://www.paddlepaddle.org.cn/install/quick) 进行安装
2.
下载代码
克隆代码库到本地
3.
环境依赖
该模型使用PaddlePaddle,关于环境依赖部分,请先参考PaddlePaddle[安装说明](https://www.paddlepaddle.org.cn/documentation/docs/zh/install/index_cn.html)关于环境依赖部分的内容。
### 数据准备
#### 原始数据获取
[
OpenWebTextCorpus
](
https://skylion007.github.io/OpenWebTextCorpus/
)
是一个开源的英文网页文本数据集,数据来源于Reddit,经过去重、清洗、提取,最终包含800多万个文档。
下载以后通过以下命令解压:
```
shell
xz
-d
openwebtext.tar.xz
tar
xf openwebtext.tar
mkdir
raw_data
bash decompress.sh
```
解压以后得到的raw_data目录大小约为54GB。
#### 数据预处理
为了提升训练速度,我们在训练前将文本数据转成相应的id,并保存为npz格式:
```
shell
python process_data.py
--input_path
raw_data
\
--model_name
gpt2-medium-en
\
--append_eod
\
--workers
8
```
运行命令后,产出
`raw_data_ids.npz`
文件。为了方便用户运行测试本模型,本项目提供了处理好的300M的训练样本:
```
shell
wget https://paddlenlp.bj.bcebos.com/models/transformers/gpt2/train.data.json_ids.npz
```
将所有预处理得到的npz文件统一放入一个文件夹中,以备训练使用:
```
mkdir data
mv train.data.json_ids.npz data
```
#### 单卡训练
```
shell
CUDA_VISIBLE_DEVICES
=
0 python run_pretrain.py
--model_name_or_path
gpt2-small-en
\
--input_dir
"./data"
\
--output_dir
"output"
\
--weight_decay
0.01
\
--grad_clip
1.0
\
--max_steps
500000
\
--save_steps
100000
\
--warmup_rate
0.01
\
--batch_size
8
\
--device
gpu
```
其中参数释义如下:
-
`model_name_or_path`
要训练的模型或者之前训练的checkpoint。
-
`input_dir`
指定输入文件,可以使用目录,指定目录时将包括目录中的所有文件。
-
`output_dir`
指定输出文件。
-
`weight_decay`
权重衰减参数。
-
`grad_clip`
梯度裁剪范围。
-
`max_steps`
最大训练步数
-
`save_steps`
保存模型间隔
-
`batch_size`
训练的batch大小
-
`device`
训练设备
用户也可以使用提供的shell脚本直接训练
`sh scripts/run.sh`
.
### 单机多卡
同样,可以执行如下命令实现八卡训练:
```
shell
unset
CUDA_VISIBLE_DEVICES
python
-m
paddle.distributed.launch
--gpus
"0,1,2,3,4,5,6,7"
run_pretrain.py
--model_name_or_path
gpt2-small-en
\
--input_dir
"./data"
\
--output_dir
"output"
\
--weight_decay
0.01
\
--grad_clip
1.0
\
--max_steps
500000
\
--save_steps
100000
\
--warmup_rate
0.01
\
--batch_size
8
\
--device
gpu
```
用户也可以使用提供的shell脚本直接训练
`sh scripts/run_multi.sh`
.
#### 文本生成
本项目提供了简单的文本生成的demo,供用户测试文本生成效果。
```
shell
python generate_sample.py
```
生成效果展示:
```
text
问题:中国的首都是哪里?答案:北京。
问题:百度的厂长是谁? 答案:
李彦宏。
默写古诗: 大漠孤烟直,长河落日圆。
举杯邀明月,
对影成三人。
```
## 参考文献
-
[
Language Models are Unsupervised Multitask Learners
](
https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf
)
-
[
CPM: A Large-scale Generative Chinese Pre-trained Language Model
](
https://arxiv.org/abs/2012.00413
)
PaddleNLP/examples/language_model/gpt2/data.py
0 → 100644
浏览文件 @
84f481c0
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import
time
import
os
import
numpy
as
np
import
paddle
def
construct_samples_and_shuffle_data
(
name
,
data_prefix
,
documents
,
sizes
,
num_samples
,
seq_length
,
seed
,
worker_index
):
# Number of tokens in each epoch and number of required epochs.
tokens_per_epoch
=
_num_tokens
(
sizes
)
num_epochs
=
_num_epochs
(
tokens_per_epoch
,
seq_length
,
num_samples
)
# rng state
np_rng
=
np
.
random
.
RandomState
(
seed
=
seed
)
# Filename of the index mappings.
_filename
=
data_prefix
_filename
+=
'_{}_indexmap'
.
format
(
name
)
_filename
+=
'_{}ns'
.
format
(
num_samples
)
_filename
+=
'_{}sl'
.
format
(
seq_length
)
doc_idx_filename
=
_filename
+
'_doc_idx.npy'
sample_idx_filename
=
_filename
+
'_sample_idx.npy'
shuffle_idx_filename
=
_filename
+
'_shuffle_idx.npy'
# Build the indexed mapping if not exist.
if
worker_index
==
0
:
if
(
not
os
.
path
.
isfile
(
doc_idx_filename
))
or
\
(
not
os
.
path
.
isfile
(
sample_idx_filename
))
or
\
(
not
os
.
path
.
isfile
(
shuffle_idx_filename
)):
if
num_epochs
==
1
:
separate_last_epoch
=
False
else
:
num_samples_from_epochs_minus_one
=
(
(
num_epochs
-
1
)
*
tokens_per_epoch
-
1
)
//
seq_length
last_epoch_num_samples
=
num_samples
-
\
num_samples_from_epochs_minus_one
assert
last_epoch_num_samples
>=
0
,
\
'last epoch number of samples should be non-negative.'
num_samples_per_epoch
=
(
tokens_per_epoch
-
1
)
//
seq_length
assert
last_epoch_num_samples
<
(
num_samples_per_epoch
+
1
),
\
'last epoch number of samples exceeded max value.'
separate_last_epoch
=
(
last_epoch_num_samples
<
int
(
0.80
*
num_samples_per_epoch
))
doc_idx
=
_build_doc_idx
(
documents
,
num_epochs
,
np_rng
,
separate_last_epoch
)
np
.
save
(
doc_idx_filename
,
doc_idx
,
allow_pickle
=
True
)
# sample-idx.
assert
doc_idx
.
dtype
==
np
.
int32
sample_idx
=
_build_sample_idx
(
sizes
,
doc_idx
,
seq_length
,
num_epochs
,
tokens_per_epoch
)
np
.
save
(
sample_idx_filename
,
sample_idx
,
allow_pickle
=
True
)
if
separate_last_epoch
:
num_samples_
=
num_samples_from_epochs_minus_one
else
:
num_samples_
=
sample_idx
.
shape
[
0
]
-
1
shuffle_idx
=
_build_shuffle_idx
(
num_samples_
,
sample_idx
.
shape
[
0
]
-
1
,
np_rng
)
np
.
save
(
shuffle_idx_filename
,
shuffle_idx
,
allow_pickle
=
True
)
else
:
while
True
:
if
(
not
os
.
path
.
isfile
(
doc_idx_filename
))
or
\
(
not
os
.
path
.
isfile
(
sample_idx_filename
))
or
\
(
not
os
.
path
.
isfile
(
shuffle_idx_filename
)):
time
.
sleep
(
3
)
else
:
break
# Load mappings.
doc_idx
=
np
.
load
(
doc_idx_filename
,
allow_pickle
=
True
,
mmap_mode
=
'r'
)
sample_idx
=
np
.
load
(
sample_idx_filename
,
allow_pickle
=
True
,
mmap_mode
=
'r'
)
shuffle_idx
=
np
.
load
(
shuffle_idx_filename
,
allow_pickle
=
True
,
mmap_mode
=
'r'
)
return
doc_idx
,
sample_idx
,
shuffle_idx
def
_num_tokens
(
lens
):
"""Total number of tokens in the dataset."""
return
np
.
sum
(
lens
)
def
_num_epochs
(
tokens_per_epoch
,
seq_length
,
num_samples
):
"""Based on number of samples and sequence lenght, calculate how many
epochs will be needed."""
num_epochs
=
0
total_tokens
=
0
while
True
:
num_epochs
+=
1
total_tokens
+=
tokens_per_epoch
if
((
total_tokens
-
1
)
//
seq_length
)
>=
num_samples
:
return
num_epochs
def
_build_doc_idx
(
documents
,
num_epochs
,
np_rng
,
separate_last_epoch
):
"""Build an array with length = number-of-epochs * number-of-dcuments.
Each index is mapped to a corresponding document."""
if
not
separate_last_epoch
or
num_epochs
==
1
:
doc_idx
=
np
.
mgrid
[
0
:
num_epochs
,
0
:
len
(
documents
)][
1
]
doc_idx
[:]
=
documents
doc_idx
=
doc_idx
.
reshape
(
-
1
)
doc_idx
=
doc_idx
.
astype
(
np
.
int32
)
# np_rng.shuffle(doc_idx)
return
doc_idx
doc_idx_first
=
_build_doc_idx
(
documents
,
num_epochs
-
1
,
np_rng
,
False
)
doc_idx_last
=
_build_doc_idx
(
documents
,
1
,
np_rng
,
False
)
return
np
.
concatenate
((
doc_idx_first
,
doc_idx_last
))
def
_build_sample_idx
(
sizes
,
doc_idx
,
seq_length
,
num_epochs
,
tokens_per_epoch
):
num_samples
=
(
num_epochs
*
tokens_per_epoch
-
1
)
//
seq_length
sample_idx
=
np
.
zeros
([
int
(
num_samples
)
+
1
,
2
],
dtype
=
np
.
int32
)
sample_index
=
0
doc_idx_index
=
0
doc_offset
=
0
sample_idx
[
sample_index
][
0
]
=
doc_idx_index
sample_idx
[
sample_index
][
1
]
=
doc_offset
sample_index
+=
1
while
sample_index
<=
num_samples
:
remaining_seq_length
=
seq_length
+
1
while
remaining_seq_length
!=
0
:
doc_id
=
doc_idx
[
doc_idx_index
]
doc_length
=
sizes
[
doc_id
]
-
doc_offset
remaining_seq_length
-=
doc_length
if
remaining_seq_length
<=
0
:
doc_offset
+=
(
remaining_seq_length
+
doc_length
-
1
)
remaining_seq_length
=
0
else
:
doc_idx_index
+=
1
doc_offset
=
0
sample_idx
[
sample_index
][
0
]
=
doc_idx_index
sample_idx
[
sample_index
][
1
]
=
doc_offset
sample_index
+=
1
return
sample_idx
def
_build_shuffle_idx
(
num_samples
,
total_size
,
np_rng
):
dtype_
=
np
.
uint32
if
total_size
>=
(
np
.
iinfo
(
np
.
uint32
).
max
-
1
):
dtype_
=
np
.
int64
shuffle_idx_first
=
np
.
arange
(
start
=
0
,
stop
=
num_samples
,
step
=
1
,
dtype
=
dtype_
)
np_rng
.
shuffle
(
shuffle_idx_first
)
if
num_samples
==
total_size
:
return
shuffle_idx_first
shuffle_idx_last
=
np
.
arange
(
start
=
num_samples
,
stop
=
total_size
,
step
=
1
,
dtype
=
dtype_
)
np_rng
.
shuffle
(
shuffle_idx_last
)
return
np
.
concatenate
((
shuffle_idx_first
,
shuffle_idx_last
))
class
GPT2Dataset
(
paddle
.
io
.
Dataset
):
def
__init__
(
self
,
file_path
,
worker_index
,
num_samples
,
eod_id
,
name
=
"gpt2"
,
max_seq_len
=
1024
,
mode
=
"train"
,
seed
=
1234
):
self
.
file_path
=
file_path
self
.
max_seq_len
=
max_seq_len
self
.
name
=
name
process_datas
=
np
.
load
(
self
.
file_path
,
mmap_mode
=
"r+"
,
allow_pickle
=
True
)
self
.
sample_ids
=
process_datas
[
"ids"
]
self
.
sample_lens
=
process_datas
[
"lens"
]
document_ids
=
np
.
arange
(
0
,
self
.
sample_lens
.
shape
[
0
])
self
.
eod_id
=
eod_id
self
.
doc_idx
,
self
.
sample_idx
,
self
.
shuffle_idx
=
\
construct_samples_and_shuffle_data
(
self
.
name
,
self
.
file_path
,
document_ids
,
\
self
.
sample_lens
,
num_samples
,
max_seq_len
,
seed
,
worker_index
)
self
.
start_pos
=
[
0
]
+
np
.
cumsum
(
self
.
sample_lens
).
tolist
()
def
_construct_sample
(
self
,
tokens
):
tokens
=
np
.
array
(
tokens
).
astype
(
"int64"
).
tolist
()
labels
=
tokens
[
1
:]
tokens
=
tokens
[:
-
1
]
seq_length
=
len
(
tokens
)
# attention mask for the attention calulate
attention_mask
=
np
.
tri
(
seq_length
,
seq_length
).
reshape
(
(
1
,
seq_length
,
seq_length
))
# the pad and eod tokens do not contribute the loss
loss_mask
=
np
.
ones
(
seq_length
,
dtype
=
"float32"
)
loss_mask
[
np
.
where
(
np
.
array
(
tokens
)
==
self
.
eod_id
)]
=
0.0
position_ids
=
np
.
arange
(
0
,
seq_length
,
dtype
=
"int64"
)
# -INF mask value as default
attention_mask
=
(
attention_mask
-
1.0
)
*
1e9
# Bool mask of attention
# attention_mask = attention_mask.astype("float32")
return
[
tokens
,
loss_mask
,
attention_mask
,
position_ids
,
labels
]
def
_get_single_sample_from_idx
(
self
,
doc_index_f
,
doc_index_l
,
offset_f
,
offset_l
):
if
doc_index_f
==
doc_index_l
:
current_start_pos
=
self
.
start_pos
[
doc_index_f
]
return
self
.
sample_ids
[
current_start_pos
+
offset_f
:
\
current_start_pos
+
offset_l
+
1
].
tolist
()
elif
doc_index_f
<
doc_index_l
:
current_start_pos
=
self
.
start_pos
[
doc_index_f
]
next_start_pos
=
self
.
start_pos
[
doc_index_f
+
1
]
tokens
=
self
.
sample_ids
[
current_start_pos
+
offset_f
:
next_start_pos
].
tolist
()
for
i
in
range
(
doc_index_f
+
1
,
doc_index_l
):
current_start_pos
=
self
.
start_pos
[
i
]
next_start_pos
=
self
.
start_pos
[
i
+
1
]
tokens
.
extend
(
self
.
sample_ids
[
current_start_pos
:
next_start_pos
]
.
tolist
())
last_start_pos
=
self
.
start_pos
[
doc_index_l
]
tokens
.
extend
(
self
.
sample_ids
[
last_start_pos
:
last_start_pos
+
offset_l
+
1
].
tolist
())
else
:
current_start_pos
=
self
.
start_pos
[
doc_index_f
]
next_start_pos
=
self
.
start_pos
[
-
1
]
tokens
=
self
.
sample_ids
[
current_start_pos
+
offset_f
:
next_start_pos
].
tolist
()
for
i
in
range
(
0
,
doc_index_l
):
current_start_pos
=
self
.
start_pos
[
i
]
next_start_pos
=
self
.
start_pos
[
i
+
1
]
tokens
.
extend
(
self
.
sample_ids
[
current_start_pos
:
next_start_pos
]
.
tolist
())
last_start_pos
=
self
.
start_pos
[
doc_index_l
]
tokens
.
extend
(
self
.
sample_ids
[
last_start_pos
:
last_start_pos
+
offset_l
+
1
].
tolist
())
return
tokens
def
__getitem__
(
self
,
index
):
idx
=
self
.
shuffle_idx
[
index
]
# Start and end documents and offsets.
doc_index_f_raw
=
self
.
sample_idx
[
idx
][
0
]
doc_index_l_raw
=
self
.
sample_idx
[
idx
+
1
][
0
]
doc_index_f
=
self
.
doc_idx
[
self
.
sample_idx
[
idx
][
0
]]
doc_index_l
=
self
.
doc_idx
[
self
.
sample_idx
[
idx
+
1
][
0
]]
offset_f
=
self
.
sample_idx
[
idx
][
1
]
offset_l
=
self
.
sample_idx
[
idx
+
1
][
1
]
tokens
=
self
.
_get_single_sample_from_idx
(
doc_index_f
,
doc_index_l
,
offset_f
,
offset_l
)
token_arr
=
np
.
array
(
tokens
,
dtype
=
"int64"
)
return
self
.
_construct_sample
(
tokens
)
def
__len__
(
self
):
return
self
.
sample_idx
.
shape
[
0
]
-
1
PaddleNLP/examples/language_model/gpt2/decompress.sh
0 → 100644
浏览文件 @
84f481c0
#!/bin/bash
n
=
0
maxjobs
=
2
# 最大进程数
m
=
0
maxfiles
=
12800
# 每个目录中的最大文件数
for
i
in
$(
ls
openwebtext
)
;
do
echo
$i
;
if
((
n %
$maxfiles
==
0
))
;
then
((
m
=
n
))
mkdir
-p
raw_data/data_
$m
fi
if
((
++n %
$maxjobs
==
0
))
;
then
wait
fi
tar
xJf openwebtext/
$i
--warning
=
no-timestamp
-C
raw_data/data_
$m
/ &
done
PaddleNLP/examples/language_model/gpt2/generate_sample.py
0 → 100644
浏览文件 @
84f481c0
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# Many thanks for following projects.
# https://github.com/TsinghuaAI/CPM-Generate
# https://github.com/jm12138/CPM-Generate-Paddle
import
argparse
import
numpy
as
np
import
paddle
from
paddlenlp.utils.tools
import
loadz
from
paddlenlp.transformers
import
GPT2Model
,
GPT2ForPretraining
from
paddlenlp.transformers
import
GPT2ChineseTokenizer
,
GPT2Tokenizer
from
paddlenlp.utils.log
import
logger
MODEL_CLASSES
=
{
"gpt2-base-cn"
:
(
GPT2ForPretraining
,
GPT2ChineseTokenizer
),
"gpt2-medium-en"
:
(
GPT2ForPretraining
,
GPT2Tokenizer
),
}
class
Demo
:
def
__init__
(
self
,
model_name_or_path
=
"gpt2-base-cn"
):
model_class
,
tokenizer_class
=
MODEL_CLASSES
[
model_name_or_path
]
self
.
tokenizer
=
tokenizer_class
.
from_pretrained
(
model_name_or_path
)
logger
.
info
(
'Loading the model parameters, please wait...'
)
self
.
model
=
model_class
.
from_pretrained
(
model_name_or_path
)
self
.
model
.
eval
()
logger
.
info
(
'Model loaded.'
)
# prediction function
def
predict
(
self
,
text
,
max_len
=
10
):
ids
=
self
.
tokenizer
.
encode
(
text
)
input_id
=
paddle
.
to_tensor
(
np
.
array
(
ids
).
reshape
(
1
,
-
1
).
astype
(
'int64'
))
output
,
cached_kvs
=
self
.
model
(
input_id
,
use_cache
=
True
,
cache
=
None
)
nid
=
int
(
np
.
argmax
(
output
[
0
,
-
1
].
numpy
()))
ids
.
append
(
nid
)
out
=
[
nid
]
for
i
in
range
(
max_len
):
input_id
=
paddle
.
to_tensor
(
np
.
array
([
nid
]).
reshape
(
1
,
-
1
).
astype
(
'int64'
))
output
,
cached_kvs
=
self
.
model
(
input_id
,
use_cache
=
True
,
cache
=
cached_kvs
)
nid
=
int
(
np
.
argmax
(
output
[
0
,
-
1
].
numpy
()))
ids
.
append
(
nid
)
# if nid is '\n', the predicion is over.
if
nid
==
3
:
break
out
.
append
(
nid
)
logger
.
info
(
text
)
logger
.
info
(
self
.
tokenizer
.
decode
(
out
))
# One shot example
def
ask_question
(
self
,
question
,
max_len
=
10
):
self
.
predict
(
"问题:中国的首都是哪里?答案:北京。
\n
问题:%s 答案:"
%
question
,
max_len
)
# dictation poetry
def
dictation_poetry
(
self
,
front
,
max_len
=
10
):
self
.
predict
(
'''默写古诗: 大漠孤烟直,长河落日圆。
\n
%s'''
%
front
,
max_len
)
if
__name__
==
"__main__"
:
demo
=
Demo
(
"gpt2-base-cn"
)
demo
.
ask_question
(
"百度的厂长是谁?"
)
demo
.
dictation_poetry
(
"举杯邀明月,"
)
del
demo
# demo = Demo("gpt2-medium-en")
# demo.predict("Question: Where is the capital of China? Answer: Beijing. \nQuestion: Who is the CEO of Apple? Answer:", 20)
PaddleNLP/examples/language_model/gpt2/lr.py
0 → 100644
浏览文件 @
84f481c0
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import
math
import
numpy
import
warnings
from
paddle
import
Tensor
from
paddle.optimizer.lr
import
LRScheduler
class
CosineAnnealingWithWarmupDecay
(
LRScheduler
):
def
__init__
(
self
,
max_lr
,
min_lr
,
warmup_step
,
decay_step
,
last_epoch
=-
1
,
verbose
=
False
):
self
.
decay_step
=
decay_step
self
.
warmup_step
=
warmup_step
self
.
max_lr
=
max_lr
self
.
min_lr
=
min_lr
super
(
CosineAnnealingWithWarmupDecay
,
self
).
__init__
(
max_lr
,
last_epoch
,
verbose
)
def
get_lr
(
self
):
if
self
.
warmup_step
>
0
and
self
.
last_epoch
<=
self
.
warmup_step
:
return
float
(
self
.
max_lr
)
*
(
self
.
last_epoch
)
/
self
.
warmup_step
if
self
.
last_epoch
>
self
.
decay_step
:
return
self
.
min_lr
num_step_
=
self
.
last_epoch
-
self
.
warmup_step
decay_step_
=
self
.
decay_step
-
self
.
warmup_step
decay_ratio
=
float
(
num_step_
)
/
float
(
decay_step_
)
coeff
=
0.5
*
(
math
.
cos
(
math
.
pi
*
decay_ratio
)
+
1.0
)
return
self
.
min_lr
+
coeff
*
(
self
.
max_lr
-
self
.
min_lr
)
PaddleNLP/examples/language_model/gpt2/process_data.py
0 → 100644
浏览文件 @
84f481c0
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import
os
import
re
import
argparse
import
json
import
multiprocessing
import
numpy
as
np
from
paddlenlp.transformers
import
GPT2Tokenizer
from
tqdm
import
tqdm
def
get_args
():
parser
=
argparse
.
ArgumentParser
()
parser
.
add_argument
(
'--input_path'
,
type
=
str
,
required
=
True
,
help
=
'Path to input JSON'
)
parser
.
add_argument
(
'--model_name'
,
type
=
str
,
required
=
True
,
help
=
'What model to use.'
)
parser
.
add_argument
(
'--append_eod'
,
action
=
'store_true'
,
help
=
'Append an <eod> token to the end of a document.'
)
parser
.
add_argument
(
'--workers'
,
type
=
int
,
default
=
1
,
help
=
'Number of worker processes to launch'
)
args
=
parser
.
parse_args
()
return
args
class
Converter
(
object
):
def
__init__
(
self
,
model_name
,
append_eod
):
self
.
append_eod
=
append_eod
tokenizer
=
GPT2Tokenizer
.
from_pretrained
(
model_name
)
Converter
.
tokenizer
=
tokenizer
self
.
eod_id
=
tokenizer
.
command_name_map
[
"eod"
].
Id
self
.
vocab_size
=
len
(
tokenizer
)
def
encode
(
self
,
text
):
tokens
=
self
.
tokenizer
.
encode
(
text
)
if
self
.
append_eod
:
tokens
.
append
(
self
.
eod_id
)
return
tokens
,
len
(
tokens
)
def
main
():
args
=
get_args
()
file_paths
=
[]
if
os
.
path
.
isfile
(
args
.
input_path
):
file_paths
.
append
(
args
.
input_path
)
else
:
for
root
,
_
,
fs
in
os
.
walk
(
args
.
input_path
):
for
f
in
fs
:
file_paths
.
append
(
os
.
path
.
join
(
root
,
f
))
all_doc_ids
=
[]
lens
=
[]
convert
=
Converter
(
args
.
model_name
,
args
.
append_eod
)
pool
=
multiprocessing
.
Pool
(
args
.
workers
)
if
convert
.
vocab_size
<
65500
:
save_dtype
=
np
.
uint16
else
:
save_dtype
=
np
.
int32
for
file_path
in
tqdm
(
file_paths
):
text
=
open
(
file_path
,
'r'
,
encoding
=
'utf-8'
).
read
()
text
=
re
.
sub
(
'[
\n
]+'
,
'
\n
'
,
text
)
text
=
re
.
sub
(
'[ ]+'
,
' '
,
text
)
encoded_docs
=
pool
.
imap
(
convert
.
encode
,
[
text
],
25
)
for
tokens
,
sizes
in
encoded_docs
:
all_doc_ids
.
extend
(
tokens
)
lens
.
append
(
sizes
)
all_doc_ids
=
np
.
array
(
all_doc_ids
,
dtype
=
save_dtype
)
lens
=
np
.
array
(
lens
,
dtype
=
save_dtype
)
np
.
savez
(
args
.
input_path
+
"_ids.npz"
,
ids
=
all_doc_ids
,
lens
=
lens
)
if
__name__
==
"__main__"
:
main
()
PaddleNLP/examples/language_model/gpt2/run_pretrain.py
0 → 100644
浏览文件 @
84f481c0
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import
argparse
import
math
import
os
import
random
import
time
import
numpy
as
np
import
paddle
from
paddle.io
import
DataLoader
,
Dataset
from
paddlenlp.data
import
Stack
,
Tuple
,
Pad
from
paddlenlp.transformers
import
GPT2Model
,
GPT2ForPretraining
,
GPT2PretrainingCriterion
from
paddlenlp.transformers
import
GPT2Tokenizer
from
paddlenlp.utils.log
import
logger
from
data
import
GPT2Dataset
import
lr
MODEL_CLASSES
=
{
"gpt2-small-en"
:
(
GPT2ForPretraining
,
GPT2Tokenizer
),
"gpt2-medium-en"
:
(
GPT2ForPretraining
,
GPT2Tokenizer
),
"gpt2-large-en"
:
(
GPT2ForPretraining
,
GPT2Tokenizer
),
}
def
parse_args
():
parser
=
argparse
.
ArgumentParser
()
parser
.
add_argument
(
"--model_name_or_path"
,
default
=
None
,
type
=
str
,
required
=
True
,
help
=
"Path to pre-trained model or shortcut name selected in the list: "
+
", "
.
join
(
sum
([
list
(
classes
[
-
1
].
pretrained_init_configuration
.
keys
())
for
classes
in
MODEL_CLASSES
.
values
()
],
[])),
)
parser
.
add_argument
(
"--input_dir"
,
default
=
None
,
type
=
str
,
required
=
True
,
help
=
"The input directory where the data will be read from."
,
)
parser
.
add_argument
(
"--output_dir"
,
default
=
None
,
type
=
str
,
required
=
True
,
help
=
"The output directory where the model predictions and checkpoints will be written."
,
)
parser
.
add_argument
(
"--batch_size"
,
default
=
8
,
type
=
int
,
help
=
"Batch size per GPU/CPU for training."
,
)
parser
.
add_argument
(
"--weight_decay"
,
default
=
0.0
,
type
=
float
,
help
=
"Weight decay if we apply some."
)
parser
.
add_argument
(
"--grad_clip"
,
default
=
0.0
,
type
=
float
,
help
=
"Grad clip for the parameter."
)
parser
.
add_argument
(
"--adam_epsilon"
,
default
=
1e-8
,
type
=
float
,
help
=
"Epsilon for Adam optimizer."
)
parser
.
add_argument
(
"--max_grad_norm"
,
default
=
1.0
,
type
=
float
,
help
=
"Max gradient norm."
)
parser
.
add_argument
(
"--num_train_epochs"
,
default
=
1
,
type
=
int
,
help
=
"Total number of training epochs to perform."
,
)
parser
.
add_argument
(
"--max_steps"
,
default
=
520000
,
type
=
int
,
help
=
"If > 0: set total number of training steps to perform. Override num_train_epochs."
,
)
parser
.
add_argument
(
"--decay_steps"
,
default
=
360000
,
type
=
int
,
help
=
"The steps use to control the learing rate. If the step > decay_steps, will use the min_lr."
,
)
parser
.
add_argument
(
"--max_lr"
,
default
=
1e-5
,
type
=
float
,
help
=
"The initial max learning rate for Adam."
)
parser
.
add_argument
(
"--min_lr"
,
default
=
5e-5
,
type
=
float
,
help
=
"The initial min learning rate for Adam."
)
parser
.
add_argument
(
"--warmup_rate"
,
default
=
0.01
,
type
=
float
,
help
=
"Linear warmup over warmup_steps."
)
parser
.
add_argument
(
"--logging_steps"
,
type
=
int
,
default
=
1
,
help
=
"Log every X updates steps."
)
parser
.
add_argument
(
"--save_steps"
,
type
=
int
,
default
=
500
,
help
=
"Save checkpoint every X updates steps."
)
parser
.
add_argument
(
"--seed"
,
type
=
int
,
default
=
42
,
help
=
"random seed for initialization"
)
parser
.
add_argument
(
"--device"
,
type
=
str
,
default
=
"gpu"
,
help
=
"select cpu, gpu, xpu devices."
)
args
=
parser
.
parse_args
()
return
args
class
WorkerInitObj
(
object
):
def
__init__
(
self
,
seed
):
self
.
seed
=
seed
def
__call__
(
self
,
id
):
np
.
random
.
seed
(
seed
=
self
.
seed
+
id
)
random
.
seed
(
self
.
seed
+
id
)
def
create_pretrained_dataset
(
args
,
input_path
,
worker_init
,
worker_index
,
eod_id
):
train_data
=
GPT2Dataset
(
file_path
=
input_path
,
worker_index
=
worker_index
,
num_samples
=
args
.
batch_size
*
args
.
max_steps
,
eod_id
=
eod_id
,
seed
=
args
.
seed
+
worker_index
)
train_batch_sampler
=
paddle
.
io
.
DistributedBatchSampler
(
train_data
,
batch_size
=
args
.
batch_size
,
shuffle
=
True
,
drop_last
=
True
)
train_data_loader
=
DataLoader
(
dataset
=
train_data
,
batch_sampler
=
train_batch_sampler
,
num_workers
=
0
,
worker_init_fn
=
worker_init
,
collate_fn
=
Tuple
(
Stack
(),
Stack
(),
Stack
(),
Stack
(),
Stack
()))
return
train_data_loader
def
set_seed
(
args
):
if
args
.
device
==
"cpu"
:
idx
=
0
else
:
idx
=
paddle
.
distributed
.
get_rank
()
random
.
seed
(
args
.
seed
+
idx
)
np
.
random
.
seed
(
args
.
seed
+
idx
)
paddle
.
seed
(
args
.
seed
+
idx
)
def
do_train
(
args
):
assert
args
.
device
in
[
"cpu"
,
"gpu"
,
"xpu"
],
"Invalid device! Available device should be cpu, gpu, or xpu."
paddle
.
set_device
(
args
.
device
)
if
paddle
.
distributed
.
get_world_size
()
>
1
:
paddle
.
distributed
.
init_parallel_env
()
worker_index
=
paddle
.
distributed
.
get_rank
()
worker_num
=
paddle
.
distributed
.
get_world_size
()
set_seed
(
args
)
worker_init
=
WorkerInitObj
(
args
.
seed
+
paddle
.
distributed
.
get_rank
())
model_class
,
tokenizer_class
=
MODEL_CLASSES
[
args
.
model_name_or_path
]
tokenizer
=
tokenizer_class
.
from_pretrained
(
args
.
model_name_or_path
)
eod_id
=
tokenizer
.
command_name_map
[
"eod"
].
Id
model
=
GPT2ForPretraining
(
GPT2Model
(
**
model_class
.
pretrained_init_configuration
[
args
.
model_name_or_path
]))
# creat the critrion for the gpt model
criterion
=
GPT2PretrainingCriterion
()
if
args
.
decay_steps
is
None
:
args
.
decay_steps
=
args
.
max_steps
warmup_step
=
args
.
warmup_rate
*
args
.
decay_steps
lr_scheduler
=
lr
.
CosineAnnealingWithWarmupDecay
(
max_lr
=
args
.
max_lr
,
min_lr
=
args
.
min_lr
,
warmup_step
=
warmup_step
,
decay_step
=
args
.
decay_steps
)
clip
=
None
if
args
.
grad_clip
>
0
:
clip
=
paddle
.
nn
.
ClipGradByNorm
(
clip_norm
=
args
.
grad_clip
)
optimizer
=
paddle
.
optimizer
.
AdamW
(
learning_rate
=
lr_scheduler
,
epsilon
=
args
.
adam_epsilon
,
parameters
=
model
.
parameters
(),
weight_decay
=
args
.
weight_decay
,
grad_clip
=
clip
,
apply_decay_param_fun
=
lambda
x
:
x
in
[
p
.
name
for
n
,
p
in
model
.
named_parameters
()
if
not
any
(
nd
in
n
for
nd
in
[
"bias"
,
"norm"
])
])
global_step
=
0
tic_train
=
time
.
time
()
for
epoch
in
range
(
args
.
num_train_epochs
):
files
=
[
os
.
path
.
join
(
args
.
input_dir
,
f
)
for
f
in
os
.
listdir
(
args
.
input_dir
)
if
(
os
.
path
.
isfile
(
os
.
path
.
join
(
args
.
input_dir
,
f
))
and
"npz_"
not
in
str
(
f
))
]
files
.
sort
()
num_files
=
len
(
files
)
for
f_id
in
range
(
num_files
):
data_file
=
files
[
f_id
]
train_data_loader
=
create_pretrained_dataset
(
args
,
data_file
,
worker_init
,
worker_index
,
eod_id
=
eod_id
)
for
step
,
batch
in
enumerate
(
train_data_loader
):
global_step
+=
1
tokens
,
loss_mask
,
attention_mask
,
position_ids
,
labels
=
batch
loss_mask
.
stop_gradient
=
True
attention_mask
.
stop_gradient
=
True
preds
=
model
(
tokens
,
position_ids
,
attention_mask
)
loss
=
criterion
(
preds
,
labels
,
loss_mask
)
if
global_step
%
args
.
logging_steps
==
0
:
if
worker_index
==
0
:
logger
.
info
(
"global step %d, epoch: %d, lr: %.10f, batch: %d, loss: %f, speed: %.2f step/s"
%
(
global_step
,
epoch
,
optimizer
.
get_lr
(),
step
,
loss
,
args
.
logging_steps
/
(
time
.
time
()
-
tic_train
)))
tic_train
=
time
.
time
()
loss
.
backward
()
optimizer
.
step
()
lr_scheduler
.
step
()
optimizer
.
clear_gradients
()
if
global_step
%
args
.
save_steps
==
0
:
if
worker_index
==
0
:
output_dir
=
os
.
path
.
join
(
args
.
output_dir
,
"model_%d"
%
global_step
)
if
not
os
.
path
.
exists
(
output_dir
):
os
.
makedirs
(
output_dir
)
# need better way to get inner model of DataParallel
model_to_save
=
model
.
_layers
if
isinstance
(
model
,
paddle
.
DataParallel
)
else
model
model_to_save
.
save_pretrained
(
output_dir
)
if
global_step
>=
args
.
max_steps
:
del
train_data_loader
return
del
train_data_loader
if
__name__
==
"__main__"
:
args
=
parse_args
()
do_train
(
args
)
PaddleNLP/examples/language_model/gpt2/scripts/run.sh
0 → 100644
浏览文件 @
84f481c0
export
CUDA_VISIBLE_DEVICES
=
0
python run_pretrain.py
--model_name_or_path
gpt2-small-en
--input_dir
"./data"
\
--output_dir
"output"
\
--max_lr
0.00015
\
--min_lr
0.00001
\
--weight_decay
0.01
\
--grad_clip
1.0
\
--max_steps
500000
\
--save_steps
100000
\
--decay_steps
320000
\
--warmup_rate
0.01
\
--batch_size
8
\
--device
gpu
PaddleNLP/examples/language_model/gpt2/scripts/run_multi.sh
0 → 100644
浏览文件 @
84f481c0
unset
CUDA_VISIBLE_DEVICES
python
-m
paddle.distributed.launch
--gpus
"0,1"
run_pretrain.py
--model_name_or_path
gpt2-small-en
--input_dir
"./data"
\
--output_dir
"output"
\
--max_lr
0.00015
\
--min_lr
0.00001
\
--weight_decay
0.01
\
--grad_clip
1.0
\
--max_steps
500000
\
--save_steps
100000
\
--decay_steps
320000
\
--warmup_rate
0.01
\
--batch_size
8
\
--device
gpu
PaddleNLP/paddlenlp/transformers/__init__.py
浏览文件 @
84f481c0
...
@@ -19,6 +19,8 @@ from .bert.modeling import *
...
@@ -19,6 +19,8 @@ from .bert.modeling import *
from
.bert.tokenizer
import
*
from
.bert.tokenizer
import
*
from
.ernie.modeling
import
*
from
.ernie.modeling
import
*
from
.ernie.tokenizer
import
*
from
.ernie.tokenizer
import
*
from
.gpt2.modeling
import
*
from
.gpt2.tokenizer
import
*
from
.roberta.modeling
import
*
from
.roberta.modeling
import
*
from
.roberta.tokenizer
import
*
from
.roberta.tokenizer
import
*
from
.electra.modeling
import
*
from
.electra.modeling
import
*
...
...
PaddleNLP/paddlenlp/transformers/gpt2/__init__.py
0 → 100644
浏览文件 @
84f481c0
from
.modeling
import
*
from
.tokenizer
import
GPT2ChineseTokenizer
PaddleNLP/paddlenlp/transformers/gpt2/modeling.py
0 → 100644
浏览文件 @
84f481c0
此差异已折叠。
点击以展开。
PaddleNLP/paddlenlp/transformers/gpt2/tokenizer.py
0 → 100644
浏览文件 @
84f481c0
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import
os
import
regex
as
re
import
unicodedata
import
json
import
sentencepiece
import
jieba
from
functools
import
lru_cache
from
collections
import
namedtuple
from
..
import
PretrainedTokenizer
from
..tokenizer_utils
import
convert_to_unicode
,
whitespace_tokenize
,
\
_is_whitespace
,
_is_control
,
_is_punctuation
__all__
=
[
'GPT2Tokenizer'
,
'GPT2ChineseTokenizer'
,
]
COMMAND_TUPLE
=
namedtuple
(
'CommandToken'
,
(
'name'
,
'token'
,
'Id'
))
TYPE_TUPLE
=
namedtuple
(
'TypeToken'
,
(
'name'
,
'token'
,
'Id'
))
class
CommandToken
(
object
):
def
__init__
(
self
,
name
,
token
,
Id
):
self
.
name
=
name
self
.
token
=
token
self
.
Id
=
Id
def
__str__
(
self
):
return
str
(
COMMAND_TUPLE
(
self
.
name
,
self
.
token
,
self
.
Id
))
@
lru_cache
()
def
bytes_to_unicode
():
"""
Returns list of utf-8 byte and a corresponding list of unicode strings.
The reversible bpe codes work on unicode strings.
This means you need a large # of unicode characters in your vocab if you want to avoid UNKs.
When you're at something like a 10B token dataset you end up needing around 5K for decent coverage.
This is a signficant percentage of your normal, say, 32K bpe vocab.
To avoid that, we want lookup tables between utf-8 bytes and unicode strings.
And avoids mapping to whitespace/control characters the bpe code barfs on.
"""
_chr
=
chr
bs
=
list
(
range
(
ord
(
"!"
),
ord
(
"~"
)
+
1
))
+
list
(
range
(
ord
(
"¡"
),
ord
(
"¬"
)
+
1
))
+
list
(
range
(
ord
(
"®"
),
ord
(
"ÿ"
)
+
1
))
cs
=
bs
[:]
n
=
0
for
b
in
range
(
2
**
8
):
if
b
not
in
bs
:
bs
.
append
(
b
)
cs
.
append
(
2
**
8
+
n
)
n
+=
1
cs
=
[
_chr
(
n
)
for
n
in
cs
]
return
dict
(
zip
(
bs
,
cs
))
def
get_pairs
(
word
):
"""Return set of symbol pairs in a word.
Word is represented as tuple of symbols (symbols being variable-length strings).
"""
pairs
=
set
()
prev_char
=
word
[
0
]
for
char
in
word
[
1
:]:
pairs
.
add
((
prev_char
,
char
))
prev_char
=
char
return
pairs
class
GPT2ChineseTokenizer
(
PretrainedTokenizer
):
"""
Constructs a GPT2 Chinese tokenizer. It uses a basic tokenizer to do punctuation
splitting, lower casing and so on, and follows a WordPiece tokenizer to
tokenize as subwords.
"""
resource_files_names
=
{
"vocab_file"
:
"vocab.json"
,
"model_file"
:
"sentencepiece.model"
}
# for save_pretrained
pretrained_resource_files_map
=
{
"vocab_file"
:
{
"gpt2-base-cn"
:
"https://paddlenlp.bj.bcebos.com/models/transformers/gpt2/gpt2-base-cn-vocab.json"
,
},
"model_file"
:
{
"gpt2-base-cn"
:
"https://paddlenlp.bj.bcebos.com/models/transformers/gpt2/gpt2-base-cn-sentencepiece.model"
}
}
pretrained_init_configuration
=
{
"gpt2-base-cn"
:
{
"do_lower_case"
:
True
},
}
def
__init__
(
self
,
vocab_file
,
model_file
,
do_lower_case
=
True
,
max_len
=
512
,
bod_id
=
"<bod>"
,
eod_id
=
"<eod>"
,
max_length
=
None
):
if
not
os
.
path
.
isfile
(
vocab_file
):
raise
ValueError
(
"Can't find a vocabulary file at path '{}'. To load the "
"vocabulary from a pretrained model please use "
"`tokenizer = GPT2Tokenizer.from_pretrained(PRETRAINED_MODEL_NAME)`"
.
format
(
vocab_file
))
self
.
max_len
=
max_len
if
max_len
is
not
None
else
int
(
1e12
)
self
.
encoder
=
json
.
load
(
open
(
vocab_file
))
self
.
decoder
=
{
v
:
k
for
k
,
v
in
self
.
encoder
.
items
()}
self
.
sp
=
sentencepiece
.
SentencePieceProcessor
(
model_file
=
model_file
)
self
.
translator
=
str
.
maketrans
(
"
\n
"
,
"
\u2582\u2583
"
)
def
tokenize
(
self
,
text
):
""" Tokenize a string. """
seg_list
=
[
x
.
translate
(
self
.
translator
)
for
x
in
jieba
.
cut
(
text
,
cut_all
=
False
)
]
new_seg
=
" "
.
join
(
seg_list
)
return
self
.
sp
.
encode
(
new_seg
)
def
encode
(
self
,
text
):
return
self
.
convert_tokens_to_ids
(
text
)
def
decode
(
self
,
tokens
):
return
self
.
convert_ids_to_tokens
(
tokens
)
def
convert_tokens_to_ids
(
self
,
text
):
res
=
self
.
tokenize
(
text
)
return
res
def
convert_ids_to_tokens
(
self
,
tokens
):
text
=
self
.
sp
.
decode
(
tokens
)
text
=
text
.
replace
(
' '
,
''
).
replace
(
'
\u2582
'
,
' '
).
replace
(
'
\u2583
'
,
'
\n
'
)
return
text
class
GPT2Tokenizer
(
PretrainedTokenizer
):
resource_files_names
=
{
"vocab_file"
:
"vocab.json"
,
"merges_file"
:
"merges.txt"
}
# for save_pretrained
pretrained_resource_files_map
=
{
"vocab_file"
:
{
"gpt2-large-en"
:
"http://paddlenlp.bj.bcebos.com/models/transformers/gpt2/gpt2-large-en-vocab.json"
,
"gpt2-medium-en"
:
"http://paddlenlp.bj.bcebos.com/models/transformers/gpt2/gpt2-medium-en-vocab.json"
,
"gpt2-small-en"
:
"http://paddlenlp.bj.bcebos.com/models/transformers/gpt2/gpt2-small-en-vocab.json"
,
},
"merges_file"
:
{
"gpt2-large-en"
:
"http://paddlenlp.bj.bcebos.com/models/transformers/gpt2/gpt2-large-en-merges.txt"
,
"gpt2-medium-en"
:
"http://paddlenlp.bj.bcebos.com/models/transformers/gpt2/gpt2-medium-en-merges.txt"
,
"gpt2-small-en"
:
"http://paddlenlp.bj.bcebos.com/models/transformers/gpt2/gpt2-small-en-merges.txt"
,
}
}
pretrained_init_configuration
=
{
"gpt2-large-en"
:
{
"do_lower_case"
:
True
},
"gpt2-medium-en"
:
{
"do_lower_case"
:
True
},
"gpt2-small-en"
:
{
"do_lower_case"
:
True
},
}
def
__init__
(
self
,
vocab_file
,
merges_file
,
errors
=
'replace'
,
special_tokens
=
None
,
max_len
=
None
,
do_lower_case
=
True
):
self
.
max_len
=
int
(
1e12
)
self
.
num_command_tokens
=
2
self
.
num_type_tokens
=
2
self
.
encoder
=
json
.
load
(
open
(
vocab_file
))
self
.
decoder
=
{
v
:
k
for
k
,
v
in
self
.
encoder
.
items
()}
# construct the command tokens
self
.
_command_tokens
=
[
CommandToken
(
'pad'
,
'<|endoftext|>'
,
self
.
encoder
[
'<|endoftext|>'
]),
CommandToken
(
'eod'
,
'<|endoftext|>'
,
self
.
encoder
[
'<|endoftext|>'
]),
]
self
.
command_name_map
=
{
tok
.
name
:
tok
for
tok
in
self
.
_command_tokens
}
self
.
command_token_map
=
{
tok
.
token
:
tok
for
tok
in
self
.
_command_tokens
}
self
.
command_id_map
=
{
tok
.
Id
:
tok
for
tok
in
self
.
_command_tokens
}
self
.
num_tokens
=
len
(
self
.
encoder
)
self
.
num_text_tokens
=
self
.
num_tokens
-
1
self
.
errors
=
errors
# how to handle errors in decoding
self
.
byte_encoder
=
bytes_to_unicode
()
self
.
byte_decoder
=
{
v
:
k
for
k
,
v
in
self
.
byte_encoder
.
items
()}
bpe_data
=
open
(
merges_file
,
encoding
=
'utf-8'
).
read
().
split
(
'
\n
'
)[
1
:
-
1
]
bpe_merges
=
[
tuple
(
merge
.
split
())
for
merge
in
bpe_data
]
self
.
bpe_ranks
=
dict
(
zip
(
bpe_merges
,
range
(
len
(
bpe_merges
))))
self
.
cache
=
{}
self
.
pat
=
re
.
compile
(
r
"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+"""
)
self
.
special_tokens
=
{}
self
.
special_tokens_decoder
=
{}
self
.
set_special_tokens
(
special_tokens
)
def
__len__
(
self
):
return
len
(
self
.
encoder
)
+
len
(
self
.
special_tokens
)
def
set_special_tokens
(
self
,
special_tokens
):
""" Add a list of additional tokens to the encoder.
The additional tokens are indexed starting from the last index of the
current vocabulary in the order of the `special_tokens` list.
"""
if
not
special_tokens
:
self
.
special_tokens
=
{}
self
.
special_tokens_decoder
=
{}
return
self
.
special_tokens
=
dict
((
tok
,
len
(
self
.
encoder
)
+
i
)
for
i
,
tok
in
enumerate
(
special_tokens
))
self
.
special_tokens_decoder
=
{
v
:
k
for
k
,
v
in
self
.
special_tokens
.
items
()
}
logger
.
info
(
"Special tokens {}"
.
format
(
self
.
special_tokens
))
def
bpe
(
self
,
token
):
if
token
in
self
.
cache
:
return
self
.
cache
[
token
]
word
=
tuple
(
token
)
pairs
=
get_pairs
(
word
)
if
not
pairs
:
return
token
while
True
:
bigram
=
min
(
pairs
,
key
=
lambda
pair
:
self
.
bpe_ranks
.
get
(
pair
,
float
(
'inf'
)))
if
bigram
not
in
self
.
bpe_ranks
:
break
first
,
second
=
bigram
new_word
=
[]
i
=
0
while
i
<
len
(
word
):
try
:
j
=
word
.
index
(
first
,
i
)
new_word
.
extend
(
word
[
i
:
j
])
i
=
j
except
:
new_word
.
extend
(
word
[
i
:])
break
if
word
[
i
]
==
first
and
i
<
len
(
word
)
-
1
and
word
[
i
+
1
]
==
second
:
new_word
.
append
(
first
+
second
)
i
+=
2
else
:
new_word
.
append
(
word
[
i
])
i
+=
1
new_word
=
tuple
(
new_word
)
word
=
new_word
if
len
(
word
)
==
1
:
break
else
:
pairs
=
get_pairs
(
word
)
word
=
' '
.
join
(
word
)
self
.
cache
[
token
]
=
word
return
word
def
tokenize
(
self
,
text
):
""" Tokenize a string. """
bpe_tokens
=
[]
for
token
in
re
.
findall
(
self
.
pat
,
text
):
token
=
''
.
join
(
self
.
byte_encoder
[
b
]
for
b
in
token
.
encode
(
'utf-8'
))
bpe_tokens
.
extend
(
bpe_token
for
bpe_token
in
self
.
bpe
(
token
).
split
(
' '
))
return
bpe_tokens
def
convert_tokens_to_ids
(
self
,
tokens
):
""" Converts a sequence of tokens into ids using the vocab. """
ids
=
[]
if
isinstance
(
tokens
,
str
):
if
tokens
in
self
.
special_tokens
:
return
self
.
special_tokens
[
tokens
]
else
:
return
self
.
encoder
.
get
(
tokens
,
0
)
for
token
in
tokens
:
if
token
in
self
.
special_tokens
:
ids
.
append
(
self
.
special_tokens
[
token
])
else
:
ids
.
append
(
self
.
encoder
.
get
(
token
,
0
))
if
len
(
ids
)
>
self
.
max_len
:
logger
.
warning
(
"Token indices sequence length is longer than the specified maximum "
" sequence length for this OpenAI GPT model ({} > {}). Running this"
" sequence through the model will result in indexing errors"
.
format
(
len
(
ids
),
self
.
max_len
))
return
ids
def
convert_ids_to_tokens
(
self
,
ids
,
skip_special_tokens
=
False
):
tokens
=
[]
for
i
in
ids
:
if
i
in
self
.
special_tokens_decoder
:
if
not
skip_special_tokens
:
tokens
.
append
(
self
.
special_tokens_decoder
[
i
])
else
:
tokens
.
append
(
self
.
decoder
[
i
])
return
tokens
def
encode
(
self
,
text
,
fn
=
None
):
processed_text
=
text
if
fn
is
not
None
:
processed_text
=
fn
(
text
)
ids
=
self
.
convert_tokens_to_ids
(
self
.
tokenize
(
processed_text
))
return
ids
def
decode
(
self
,
tokens
):
# TODO
text
=
''
.
join
([
self
.
decoder
[
token
]
for
token
in
tokens
])
text
=
bytearray
([
self
.
byte_decoder
[
c
]
for
c
in
text
]).
decode
(
'utf-8'
,
errors
=
self
.
errors
)
return
text
编辑
预览
Markdown
is supported
0%
请重试
或
添加新附件
.
添加附件
取消
You are about to add
0
people
to the discussion. Proceed with caution.
先完成此消息的编辑!
取消
想要评论请
注册
或
登录