Skip to content
体验新版
项目
组织
正在加载...
登录
切换导航
打开侧边栏
PaddlePaddle
PALM
提交
48cf8096
P
PALM
项目概览
PaddlePaddle
/
PALM
通知
5
Star
3
Fork
0
代码
文件
提交
分支
Tags
贡献者
分支图
Diff
Issue
10
列表
看板
标记
里程碑
合并请求
0
Wiki
0
Wiki
分析
仓库
DevOps
项目成员
Pages
P
PALM
项目概览
项目概览
详情
发布
仓库
仓库
文件
提交
分支
标签
贡献者
分支图
比较
Issue
10
Issue
10
列表
看板
标记
里程碑
合并请求
0
合并请求
0
Pages
分析
分析
仓库分析
DevOps
Wiki
0
Wiki
成员
成员
收起侧边栏
关闭侧边栏
动态
分支图
创建新Issue
提交
Issue看板
提交
48cf8096
编写于
12月 10, 2019
作者:
X
xixiaoyao
浏览文件
操作
浏览文件
下载
差异文件
merge from master
上级
ea0664b9
edd3c2ad
变更
24
显示空白变更内容
内联
并排
Showing
24 changed file
with
227 addition
and
102 deletion
+227
-102
README.md
README.md
+7
-2
demo/demo2/config.yaml
demo/demo2/config.yaml
+2
-0
paddlepalm/README.md
paddlepalm/README.md
+0
-0
paddlepalm/_downloader.py
paddlepalm/_downloader.py
+2
-1
paddlepalm/backbone/bert.py
paddlepalm/backbone/bert.py
+6
-6
paddlepalm/backbone/ernie.py
paddlepalm/backbone/ernie.py
+8
-8
paddlepalm/mtl_controller.py
paddlepalm/mtl_controller.py
+5
-4
paddlepalm/reader/cls.py
paddlepalm/reader/cls.py
+9
-9
paddlepalm/reader/match.py
paddlepalm/reader/match.py
+9
-9
paddlepalm/reader/mlm.py
paddlepalm/reader/mlm.py
+6
-6
paddlepalm/reader/mrc.py
paddlepalm/reader/mrc.py
+11
-12
paddlepalm/reader/utils/batching4bert.py
paddlepalm/reader/utils/batching4bert.py
+5
-5
paddlepalm/reader/utils/batching4ernie.py
paddlepalm/reader/utils/batching4ernie.py
+5
-5
paddlepalm/reader/utils/mlm_batching.py
paddlepalm/reader/utils/mlm_batching.py
+2
-2
paddlepalm/reader/utils/reader4ernie.py
paddlepalm/reader/utils/reader4ernie.py
+9
-9
paddlepalm/task_paradigm/cls.py
paddlepalm/task_paradigm/cls.py
+4
-3
paddlepalm/task_paradigm/match.py
paddlepalm/task_paradigm/match.py
+4
-3
paddlepalm/task_paradigm/mlm.py
paddlepalm/task_paradigm/mlm.py
+5
-4
paddlepalm/task_paradigm/mrc.py
paddlepalm/task_paradigm/mrc.py
+10
-8
paddlepalm/utils/reader_helper.py
paddlepalm/utils/reader_helper.py
+2
-3
script/convert_params.sh
script/convert_params.sh
+37
-0
script/download_pretrain_backbone.sh
script/download_pretrain_backbone.sh
+43
-0
script/recover_params.sh
script/recover_params.sh
+33
-0
setup.py
setup.py
+3
-3
未找到文件。
README.md
浏览文件 @
48cf8096
...
...
@@ -741,7 +741,7 @@ BERT包含了如下输入对象
```
yaml
token_ids: 一个shape为[batch_size, seq_len]的矩阵,每行是一条样本,其中的每个元素为文本中的每个token对应的单词id。
position_ids: 一个shape为[batch_size, seq_len]的矩阵,每行是一条样本,其中的每个元素为文本中的每个token对应的位置id。
segment_ids: 一个shape为[batch_size, seq_len]的0/1矩阵,用于支持BERT、ERNIE等模型的输入,当元素为0时,代表当前token属于分类任务或匹配任务的text1,为1时代表当前token属于匹配任务的text2
.
segment_ids: 一个shape为[batch_size, seq_len]的0/1矩阵,用于支持BERT、ERNIE等模型的输入,当元素为0时,代表当前token属于分类任务或匹配任务的text1,为1时代表当前token属于匹配任务的text2
。
input_mask: 一个shape为[batch_size, seq_len]的矩阵,其中的每个元素为0或1,表示该位置是否是padding词(为1时代表是真实词,为0时代表是填充词)。
```
...
...
@@ -781,6 +781,7 @@ sentence_pair_embedding: 一个shape为[batch_size, hidden_size]的matrix, float
## 附录C:内置任务范式(paradigm)
#### 分类范式:cls
分类范式额外包含以下配置字段:
...
...
@@ -788,6 +789,7 @@ sentence_pair_embedding: 一个shape为[batch_size, hidden_size]的matrix, float
```
yaml
n_classes(REQUIRED): int类型。分类任务的类别数。
pred_output_path (OPTIONAL) : str类型。预测输出结果的保存路径,当该参数未空时,保存至全局配置文件中的
`save_path`
字段指定路径下的任务目录。
save_infermodel_every_n_steps (OPTIONAL) : int类型。周期性保存预测模型的间隔,未设置或设为-1时仅在该任务训练结束时保存预测模型。默认为-1。
```
分类范式包含如下的输入对象:
...
...
@@ -812,6 +814,7 @@ sentence_embedding: 一个shape为[batch_size, hidden_size]的matrix, float32类
```
yaml
pred_output_path (OPTIONAL) : str类型。预测输出结果的保存路径,当该参数未空时,保存至全局配置文件中的
`save_path`
字段指定路径下的任务目录。
save_infermodel_every_n_steps (OPTIONAL) : int类型。周期性保存预测模型的间隔,未设置或设为-1时仅在该任务训练结束时保存预测模型。默认为-1。
```
匹配范式包含如下的输入对象:
...
...
@@ -838,6 +841,7 @@ sentence_pair_embedding: 一个shape为[batch_size, hidden_size]的matrix, float
max_answer_len(REQUIRED): int类型。预测的最大答案长度
n_best_size (OPTIONAL) : int类型,默认为20。预测时保存的nbest回答文件中每条样本的n_best数量
pred_output_path (OPTIONAL) : str类型。预测输出结果的保存路径,当该参数未空时,保存至全局配置文件中的
`save_path`
字段指定路径下的任务目录
save_infermodel_every_n_steps (OPTIONAL) : int类型。周期性保存预测模型的间隔,未设置或设为-1时仅在该任务训练结束时保存预测模型。默认为-1。
```
机器阅读理解范式包含如下的输入对象:
...
...
@@ -885,7 +889,8 @@ do_lower_case (OPTIONAL): bool类型。大小写标志位。默认为False,即
for_cn: bool类型。中文模式标志位。默认为False,即默认输入为英文,设置为True后,分词器、后处理等按照中文语言进行处理。
print_every_n_steps (OPTIONAL): int类型。默认为5。训练阶段打印日志的频率(step为单位)。
save_every_n_steps (OPTIONAL): int类型。默认为-1。训练过程中保存checkpoint模型的频率,默认不保存。
save_ckpt_every_n_steps (OPTIONAL): int类型。默认为-1。训练过程中保存完整计算图的检查点(checkpoint)的频率,默认-1,仅在最后一个step自动保存检查点。
save_infermodel_every_n_steps (OPTIONAL) : int类型。周期性保存预测模型的间隔,未设置或设为-1时仅在该任务训练结束时保存预测模型。默认为-1。
optimizer(REQUIRED): str类型。优化器名称,目前框架只支持adam,未来会支持更多优化器。
learning_rate(REQUIRED): str类型。训练阶段的学习率。
...
...
demo/demo2/config.yaml
浏览文件 @
48cf8096
...
...
@@ -12,6 +12,8 @@ do_lower_case: True
max_seq_len
:
512
batch_size
:
4
save_ckpt_every_n_steps
:
5
save_infermodel_every_n_steps
:
5
num_epochs
:
2
optimizer
:
"
adam"
learning_rate
:
3e-5
...
...
paddlepalm/README.md
已删除
100644 → 0
浏览文件 @
ea0664b9
paddlepalm/_downloader.py
浏览文件 @
48cf8096
...
...
@@ -33,6 +33,7 @@ ssl._create_default_https_context = ssl._create_unverified_context
_items
=
{
'pretrain'
:
{
'ernie-en-uncased-large'
:
'https://ernie.bj.bcebos.com/ERNIE_Large_en_stable-2.0.0.tar.gz'
,
'bert-en-uncased-large'
:
'https://bert-models.bj.bcebos.com/uncased_L-24_H-1024_A-16.tar.gz'
,
'bert-en-uncased-base'
:
'https://bert-models.bj.bcebos.com/uncased_L-12_H-768_A-12.tar.gz'
,
'utils'
:
None
},
'reader'
:
{
'utils'
:
None
},
'backbone'
:
{
'utils'
:
None
},
...
...
@@ -90,7 +91,7 @@ def _download(item, scope, path, silent=False):
tar
.
extractall
(
path
=
data_dir
)
tar
.
close
()
os
.
remove
(
filename
)
if
scope
==
'bert-en-uncased-large'
:
if
scope
.
startswith
(
'bert'
)
:
source_path
=
data_dir
+
'/'
+
data_name
.
split
(
'.'
)[
0
]
fileList
=
os
.
listdir
(
source_path
)
for
file
in
fileList
:
...
...
paddlepalm/backbone/bert.py
浏览文件 @
48cf8096
...
...
@@ -52,9 +52,9 @@ class Model(backbone):
@
property
def
inputs_attr
(
self
):
return
{
"token_ids"
:
[[
-
1
,
-
1
,
1
],
'int64'
],
"position_ids"
:
[[
-
1
,
-
1
,
1
],
'int64'
],
"segment_ids"
:
[[
-
1
,
-
1
,
1
],
'int64'
],
return
{
"token_ids"
:
[[
-
1
,
-
1
],
'int64'
],
"position_ids"
:
[[
-
1
,
-
1
],
'int64'
],
"segment_ids"
:
[[
-
1
,
-
1
],
'int64'
],
"input_mask"
:
[[
-
1
,
-
1
,
1
],
'float32'
]}
@
property
...
...
@@ -73,7 +73,7 @@ class Model(backbone):
self
.
_emb_dtype
=
'float32'
# padding id in vocabulary must be set to 0
emb_out
=
fluid
.
layers
.
embedding
(
emb_out
=
fluid
.
embedding
(
input
=
src_ids
,
size
=
[
self
.
_voc_size
,
self
.
_emb_size
],
dtype
=
self
.
_emb_dtype
,
...
...
@@ -84,14 +84,14 @@ class Model(backbone):
# fluid.global_scope().find_var('backbone-word_embedding').get_tensor()
embedding_table
=
fluid
.
default_main_program
().
global_block
().
var
(
scope_name
+
self
.
_word_emb_name
)
position_emb_out
=
fluid
.
layers
.
embedding
(
position_emb_out
=
fluid
.
embedding
(
input
=
pos_ids
,
size
=
[
self
.
_max_position_seq_len
,
self
.
_emb_size
],
dtype
=
self
.
_emb_dtype
,
param_attr
=
fluid
.
ParamAttr
(
name
=
scope_name
+
self
.
_pos_emb_name
,
initializer
=
self
.
_param_initializer
))
sent_emb_out
=
fluid
.
layers
.
embedding
(
sent_emb_out
=
fluid
.
embedding
(
sent_ids
,
size
=
[
self
.
_sent_types
,
self
.
_emb_size
],
dtype
=
self
.
_emb_dtype
,
...
...
paddlepalm/backbone/ernie.py
浏览文件 @
48cf8096
...
...
@@ -62,11 +62,11 @@ class Model(backbone):
@
property
def
inputs_attr
(
self
):
return
{
"token_ids"
:
[[
-
1
,
-
1
,
1
],
'int64'
],
"position_ids"
:
[[
-
1
,
-
1
,
1
],
'int64'
],
"segment_ids"
:
[[
-
1
,
-
1
,
1
],
'int64'
],
return
{
"token_ids"
:
[[
-
1
,
-
1
],
'int64'
],
"position_ids"
:
[[
-
1
,
-
1
],
'int64'
],
"segment_ids"
:
[[
-
1
,
-
1
],
'int64'
],
"input_mask"
:
[[
-
1
,
-
1
,
1
],
'float32'
],
"task_ids"
:
[[
-
1
,
-
1
,
1
],
'int64'
]}
"task_ids"
:
[[
-
1
,
-
1
],
'int64'
]}
@
property
def
outputs_attr
(
self
):
...
...
@@ -85,7 +85,7 @@ class Model(backbone):
task_ids
=
inputs
[
'task_ids'
]
# padding id in vocabulary must be set to 0
emb_out
=
fluid
.
layers
.
embedding
(
emb_out
=
fluid
.
embedding
(
input
=
src_ids
,
size
=
[
self
.
_voc_size
,
self
.
_emb_size
],
dtype
=
self
.
_emb_dtype
,
...
...
@@ -96,14 +96,14 @@ class Model(backbone):
# fluid.global_scope().find_var('backbone-word_embedding').get_tensor()
embedding_table
=
fluid
.
default_main_program
().
global_block
().
var
(
scope_name
+
self
.
_word_emb_name
)
position_emb_out
=
fluid
.
layers
.
embedding
(
position_emb_out
=
fluid
.
embedding
(
input
=
pos_ids
,
size
=
[
self
.
_max_position_seq_len
,
self
.
_emb_size
],
dtype
=
self
.
_emb_dtype
,
param_attr
=
fluid
.
ParamAttr
(
name
=
scope_name
+
self
.
_pos_emb_name
,
initializer
=
self
.
_param_initializer
))
sent_emb_out
=
fluid
.
layers
.
embedding
(
sent_emb_out
=
fluid
.
embedding
(
sent_ids
,
size
=
[
self
.
_sent_types
,
self
.
_emb_size
],
dtype
=
self
.
_emb_dtype
,
...
...
@@ -113,7 +113,7 @@ class Model(backbone):
emb_out
=
emb_out
+
position_emb_out
emb_out
=
emb_out
+
sent_emb_out
task_emb_out
=
fluid
.
layers
.
embedding
(
task_emb_out
=
fluid
.
embedding
(
task_ids
,
size
=
[
self
.
_task_types
,
self
.
_emb_size
],
dtype
=
self
.
_emb_dtype
,
...
...
paddlepalm/mtl_controller.py
浏览文件 @
48cf8096
...
...
@@ -473,7 +473,7 @@ class Controller(object):
# compute loss
task_id_var
=
net_inputs
[
'__task_id'
]
task_id_vec
=
layers
.
one_hot
(
task_id_var
,
num_instances
)
task_id_vec
=
fluid
.
one_hot
(
task_id_var
,
num_instances
)
losses
=
fluid
.
layers
.
concat
([
task_output_vars
[
inst
.
name
+
'/loss'
]
for
inst
in
instances
],
axis
=
0
)
loss
=
layers
.
reduce_sum
(
task_id_vec
*
losses
)
...
...
@@ -622,8 +622,9 @@ class Controller(object):
global_step
+=
1
cur_task
.
cur_train_step
+=
1
if
cur_task
.
save_infermodel_every_n_steps
>
0
and
cur_task
.
cur_train_step
%
cur_task
.
save_infermodel_every_n_steps
==
0
:
cur_task
.
save
(
suffix
=
'.step'
+
str
(
cur_task
.
cur_train_step
))
cur_task_global_step
=
cur_task
.
cur_train_step
+
cur_task
.
cur_train_epoch
*
cur_task
.
steps_pur_epoch
if
cur_task
.
is_target
and
cur_task
.
save_infermodel_every_n_steps
>
0
and
cur_task_global_step
%
cur_task
.
save_infermodel_every_n_steps
==
0
:
cur_task
.
save
(
suffix
=
'.step'
+
str
(
cur_task_global_step
))
if
global_step
%
main_conf
.
get
(
'print_every_n_steps'
,
5
)
==
0
:
loss
=
rt_outputs
[
cur_task
.
name
+
'/loss'
]
...
...
@@ -641,7 +642,7 @@ class Controller(object):
print
(
cur_task
.
name
+
': train finished!'
)
cur_task
.
save
()
if
'save_
every_n_steps'
in
main_conf
and
global_step
%
main_conf
[
'save
_every_n_steps'
]
==
0
:
if
'save_
ckpt_every_n_steps'
in
main_conf
and
global_step
%
main_conf
[
'save_ckpt
_every_n_steps'
]
==
0
:
save_path
=
os
.
path
.
join
(
main_conf
[
'save_path'
],
'ckpt'
,
"step_"
+
str
(
global_step
))
fluid
.
io
.
save_persistables
(
self
.
exe
,
save_path
,
saver_program
)
...
...
paddlepalm/reader/cls.py
浏览文件 @
48cf8096
...
...
@@ -62,18 +62,18 @@ class Reader(reader):
@
property
def
outputs_attr
(
self
):
if
self
.
_is_training
:
return
{
"token_ids"
:
[[
-
1
,
-
1
,
1
],
'int64'
],
"position_ids"
:
[[
-
1
,
-
1
,
1
],
'int64'
],
"segment_ids"
:
[[
-
1
,
-
1
,
1
],
'int64'
],
return
{
"token_ids"
:
[[
-
1
,
-
1
],
'int64'
],
"position_ids"
:
[[
-
1
,
-
1
],
'int64'
],
"segment_ids"
:
[[
-
1
,
-
1
],
'int64'
],
"input_mask"
:
[[
-
1
,
-
1
,
1
],
'float32'
],
"label_ids"
:
[[
-
1
,
1
],
'int64'
],
"task_ids"
:
[[
-
1
,
-
1
,
1
],
'int64'
]
"label_ids"
:
[[
-
1
],
'int64'
],
"task_ids"
:
[[
-
1
,
-
1
],
'int64'
]
}
else
:
return
{
"token_ids"
:
[[
-
1
,
-
1
,
1
],
'int64'
],
"position_ids"
:
[[
-
1
,
-
1
,
1
],
'int64'
],
"segment_ids"
:
[[
-
1
,
-
1
,
1
],
'int64'
],
"task_ids"
:
[[
-
1
,
-
1
,
1
],
'int64'
],
return
{
"token_ids"
:
[[
-
1
,
-
1
],
'int64'
],
"position_ids"
:
[[
-
1
,
-
1
],
'int64'
],
"segment_ids"
:
[[
-
1
,
-
1
],
'int64'
],
"task_ids"
:
[[
-
1
,
-
1
],
'int64'
],
"input_mask"
:
[[
-
1
,
-
1
,
1
],
'float32'
]
}
...
...
paddlepalm/reader/match.py
浏览文件 @
48cf8096
...
...
@@ -72,12 +72,12 @@ class Reader(reader):
@
property
def
outputs_attr
(
self
):
if
self
.
_is_training
:
return
{
"token_ids"
:
[[
-
1
,
-
1
,
1
],
'int64'
],
"position_ids"
:
[[
-
1
,
-
1
,
1
],
'int64'
],
"segment_ids"
:
[[
-
1
,
-
1
,
1
],
'int64'
],
return
{
"token_ids"
:
[[
-
1
,
-
1
],
'int64'
],
"position_ids"
:
[[
-
1
,
-
1
],
'int64'
],
"segment_ids"
:
[[
-
1
,
-
1
],
'int64'
],
"input_mask"
:
[[
-
1
,
-
1
,
1
],
'float32'
],
"label_ids"
:
[[
-
1
,
1
],
'int64'
],
"task_ids"
:
[[
-
1
,
-
1
,
1
],
'int64'
]
"label_ids"
:
[[
-
1
],
'int64'
],
"task_ids"
:
[[
-
1
,
-
1
],
'int64'
]
}
if
siamese
:
if
learning_strategy
==
'pointwise'
:
...
...
@@ -102,10 +102,10 @@ class Reader(reader):
else
:
else
:
return
{
"token_ids"
:
[[
-
1
,
-
1
,
1
],
'int64'
],
"position_ids"
:
[[
-
1
,
-
1
,
1
],
'int64'
],
"segment_ids"
:
[[
-
1
,
-
1
,
1
],
'int64'
],
"task_ids"
:
[[
-
1
,
-
1
,
1
],
'int64'
],
return
{
"token_ids"
:
[[
-
1
,
-
1
],
'int64'
],
"position_ids"
:
[[
-
1
,
-
1
],
'int64'
],
"segment_ids"
:
[[
-
1
,
-
1
],
'int64'
],
"task_ids"
:
[[
-
1
,
-
1
],
'int64'
],
"input_mask"
:
[[
-
1
,
-
1
,
1
],
'float32'
]
}
...
...
paddlepalm/reader/mlm.py
浏览文件 @
48cf8096
...
...
@@ -60,13 +60,13 @@ class Reader(reader):
@
property
def
outputs_attr
(
self
):
return
{
"token_ids"
:
[[
-
1
,
-
1
,
1
],
'int64'
],
"position_ids"
:
[[
-
1
,
-
1
,
1
],
'int64'
],
"segment_ids"
:
[[
-
1
,
-
1
,
1
],
'int64'
],
return
{
"token_ids"
:
[[
-
1
,
-
1
],
'int64'
],
"position_ids"
:
[[
-
1
,
-
1
],
'int64'
],
"segment_ids"
:
[[
-
1
,
-
1
],
'int64'
],
"input_mask"
:
[[
-
1
,
-
1
,
1
],
'float32'
],
"task_ids"
:
[[
-
1
,
-
1
,
1
],
'int64'
],
"mask_label"
:
[[
-
1
,
1
],
'int64'
],
"mask_pos"
:
[[
-
1
,
1
],
'int64'
],
"task_ids"
:
[[
-
1
,
-
1
],
'int64'
],
"mask_label"
:
[[
-
1
],
'int64'
],
"mask_pos"
:
[[
-
1
],
'int64'
],
}
...
...
paddlepalm/reader/mrc.py
浏览文件 @
48cf8096
...
...
@@ -69,22 +69,21 @@ class Reader(reader):
@
property
def
outputs_attr
(
self
):
if
self
.
_is_training
:
return
{
"token_ids"
:
[[
-
1
,
-
1
,
1
],
'int64'
],
"position_ids"
:
[[
-
1
,
-
1
,
1
],
'int64'
],
"segment_ids"
:
[[
-
1
,
-
1
,
1
],
'int64'
],
return
{
"token_ids"
:
[[
-
1
,
-
1
],
'int64'
],
"position_ids"
:
[[
-
1
,
-
1
],
'int64'
],
"segment_ids"
:
[[
-
1
,
-
1
],
'int64'
],
"input_mask"
:
[[
-
1
,
-
1
,
1
],
'float32'
],
"start_positions"
:
[[
-
1
,
1
],
'int64'
],
"unique_ids"
:
[[
-
1
,
1
],
'int64'
],
"end_positions"
:
[[
-
1
,
1
],
'int64'
],
"task_ids"
:
[[
-
1
,
-
1
,
1
],
'int64'
]
"start_positions"
:
[[
-
1
],
'int64'
],
"end_positions"
:
[[
-
1
],
'int64'
],
"task_ids"
:
[[
-
1
,
-
1
],
'int64'
]
}
else
:
return
{
"token_ids"
:
[[
-
1
,
-
1
,
1
],
'int64'
],
"position_ids"
:
[[
-
1
,
-
1
,
1
],
'int64'
],
"segment_ids"
:
[[
-
1
,
-
1
,
1
],
'int64'
],
"task_ids"
:
[[
-
1
,
-
1
,
1
],
'int64'
],
return
{
"token_ids"
:
[[
-
1
,
-
1
],
'int64'
],
"position_ids"
:
[[
-
1
,
-
1
],
'int64'
],
"segment_ids"
:
[[
-
1
,
-
1
],
'int64'
],
"task_ids"
:
[[
-
1
,
-
1
],
'int64'
],
"input_mask"
:
[[
-
1
,
-
1
,
1
],
'float32'
],
"unique_ids"
:
[[
-
1
,
1
],
'int64'
]
"unique_ids"
:
[[
-
1
],
'int64'
]
}
@
property
...
...
paddlepalm/reader/utils/batching4bert.py
浏览文件 @
48cf8096
...
...
@@ -67,8 +67,8 @@ def mask(batch_tokens, total_token_num, vocab_size, CLS=1, SEP=2, MASK=3):
sent
[
token_index
]
=
MASK
mask_flag
=
True
mask_pos
.
append
(
sent_index
*
max_len
+
token_index
)
mask_label
=
np
.
array
(
mask_label
).
astype
(
"int64"
).
reshape
([
-
1
,
1
])
mask_pos
=
np
.
array
(
mask_pos
).
astype
(
"int64"
).
reshape
([
-
1
,
1
])
mask_label
=
np
.
array
(
mask_label
).
astype
(
"int64"
).
reshape
([
-
1
])
mask_pos
=
np
.
array
(
mask_pos
).
astype
(
"int64"
).
reshape
([
-
1
])
return
batch_tokens
,
mask_label
,
mask_pos
...
...
@@ -96,7 +96,7 @@ def prepare_batch_data(insts,
# or unique id
for
i
in
range
(
3
,
len
(
insts
[
0
]),
1
):
labels
=
[
inst
[
i
]
for
inst
in
insts
]
labels
=
np
.
array
(
labels
).
astype
(
"int64"
).
reshape
([
-
1
,
1
])
labels
=
np
.
array
(
labels
).
astype
(
"int64"
).
reshape
([
-
1
])
labels_list
.
append
(
labels
)
# First step: do mask without padding
if
mask_id
>=
0
:
...
...
@@ -154,14 +154,14 @@ def pad_batch_data(insts,
inst_data
=
np
.
array
([
list
(
inst
)
+
list
([
pad_idx
]
*
(
max_len
-
len
(
inst
)))
for
inst
in
insts
])
return_list
+=
[
inst_data
.
astype
(
"int64"
).
reshape
([
-
1
,
max_len
,
1
])]
return_list
+=
[
inst_data
.
astype
(
"int64"
).
reshape
([
-
1
,
max_len
])]
# position data
if
return_pos
:
inst_pos
=
np
.
array
([
list
(
range
(
0
,
len
(
inst
)))
+
[
pad_idx
]
*
(
max_len
-
len
(
inst
))
for
inst
in
insts
])
return_list
+=
[
inst_pos
.
astype
(
"int64"
).
reshape
([
-
1
,
max_len
,
1
])]
return_list
+=
[
inst_pos
.
astype
(
"int64"
).
reshape
([
-
1
,
max_len
])]
if
return_input_mask
:
# This is used to avoid attention on paddings.
input_mask_data
=
np
.
array
([[
1
]
*
len
(
inst
)
+
[
0
]
*
...
...
paddlepalm/reader/utils/batching4ernie.py
浏览文件 @
48cf8096
...
...
@@ -113,8 +113,8 @@ def mask(batch_tokens,
pre_sent_len
=
len
(
sent
)
mask_label
=
np
.
array
(
mask_label
).
astype
(
"int64"
).
reshape
([
-
1
,
1
])
mask_pos
=
np
.
array
(
mask_pos
).
astype
(
"int64"
).
reshape
([
-
1
,
1
])
mask_label
=
np
.
array
(
mask_label
).
astype
(
"int64"
).
reshape
([
-
1
])
mask_pos
=
np
.
array
(
mask_pos
).
astype
(
"int64"
).
reshape
([
-
1
])
return
batch_tokens
,
mask_label
,
mask_pos
...
...
@@ -136,7 +136,7 @@ def pad_batch_data(insts,
inst_data
=
np
.
array
(
[
inst
+
list
([
pad_idx
]
*
(
max_len
-
len
(
inst
)))
for
inst
in
insts
])
return_list
+=
[
inst_data
.
astype
(
"int64"
).
reshape
([
-
1
,
max_len
,
1
])]
return_list
+=
[
inst_data
.
astype
(
"int64"
).
reshape
([
-
1
,
max_len
])]
# position data
if
return_pos
:
...
...
@@ -145,7 +145,7 @@ def pad_batch_data(insts,
for
inst
in
insts
])
return_list
+=
[
inst_pos
.
astype
(
"int64"
).
reshape
([
-
1
,
max_len
,
1
])]
return_list
+=
[
inst_pos
.
astype
(
"int64"
).
reshape
([
-
1
,
max_len
])]
if
return_input_mask
:
# This is used to avoid attention on paddings.
...
...
@@ -165,7 +165,7 @@ def pad_batch_data(insts,
if
return_seq_lens
:
seq_lens
=
np
.
array
([
len
(
inst
)
for
inst
in
insts
])
return_list
+=
[
seq_lens
.
astype
(
"int64"
).
reshape
([
-
1
,
1
])]
return_list
+=
[
seq_lens
.
astype
(
"int64"
).
reshape
([
-
1
])]
return
return_list
if
len
(
return_list
)
>
1
else
return_list
[
0
]
...
...
paddlepalm/reader/utils/mlm_batching.py
浏览文件 @
48cf8096
...
...
@@ -168,14 +168,14 @@ def pad_batch_data(insts,
inst_data
=
np
.
array
([
list
(
inst
)
+
list
([
pad_idx
]
*
(
max_len
-
len
(
inst
)))
for
inst
in
insts
])
return_list
+=
[
inst_data
.
astype
(
"int64"
).
reshape
([
-
1
,
max_len
,
1
])]
return_list
+=
[
inst_data
.
astype
(
"int64"
).
reshape
([
-
1
,
max_len
])]
# position data
if
return_pos
:
inst_pos
=
np
.
array
([
list
(
range
(
0
,
len
(
inst
)))
+
[
pad_idx
]
*
(
max_len
-
len
(
inst
))
for
inst
in
insts
])
return_list
+=
[
inst_pos
.
astype
(
"int64"
).
reshape
([
-
1
,
max_len
,
1
])]
return_list
+=
[
inst_pos
.
astype
(
"int64"
).
reshape
([
-
1
,
max_len
])]
if
return_input_mask
:
# This is used to avoid attention on paddings.
input_mask_data
=
np
.
array
([[
1
]
*
len
(
inst
)
+
[
0
]
*
...
...
paddlepalm/reader/utils/reader4ernie.py
浏览文件 @
48cf8096
...
...
@@ -480,17 +480,17 @@ class ClassifyReader(BaseReader):
batch_labels
=
[
record
.
label_id
for
record
in
batch_records
]
if
self
.
is_classify
:
batch_labels
=
np
.
array
(
batch_labels
).
astype
(
"int64"
).
reshape
(
[
-
1
,
1
])
[
-
1
])
elif
self
.
is_regression
:
batch_labels
=
np
.
array
(
batch_labels
).
astype
(
"float32"
).
reshape
(
[
-
1
,
1
])
[
-
1
])
if
batch_records
[
0
].
qid
:
batch_qids
=
[
record
.
qid
for
record
in
batch_records
]
batch_qids
=
np
.
array
(
batch_qids
).
astype
(
"int64"
).
reshape
(
[
-
1
,
1
])
[
-
1
])
else
:
batch_qids
=
np
.
array
([]).
astype
(
"int64"
).
reshape
([
-
1
,
1
])
batch_qids
=
np
.
array
([]).
astype
(
"int64"
).
reshape
([
-
1
])
# padding
padded_token_ids
,
input_mask
=
pad_batch_data
(
...
...
@@ -918,19 +918,19 @@ class MRCReader(BaseReader):
record
.
end_position
for
record
in
batch_records
]
batch_start_position
=
np
.
array
(
batch_start_position
).
astype
(
"int64"
).
reshape
([
-
1
,
1
])
"int64"
).
reshape
([
-
1
])
batch_end_position
=
np
.
array
(
batch_end_position
).
astype
(
"int64"
).
reshape
([
-
1
,
1
])
"int64"
).
reshape
([
-
1
])
else
:
batch_size
=
len
(
batch_token_ids
)
batch_start_position
=
np
.
zeros
(
shape
=
[
batch_size
,
1
],
dtype
=
"int64"
)
batch_end_position
=
np
.
zeros
(
shape
=
[
batch_size
,
1
],
dtype
=
"int64"
)
shape
=
[
batch_size
],
dtype
=
"int64"
)
batch_end_position
=
np
.
zeros
(
shape
=
[
batch_size
],
dtype
=
"int64"
)
batch_unique_ids
=
[
record
.
unique_id
for
record
in
batch_records
]
batch_unique_ids
=
np
.
array
(
batch_unique_ids
).
astype
(
"int64"
).
reshape
(
[
-
1
,
1
])
[
-
1
])
# padding
padded_token_ids
,
input_mask
=
pad_batch_data
(
...
...
paddlepalm/task_paradigm/cls.py
浏览文件 @
48cf8096
...
...
@@ -43,7 +43,7 @@ class TaskParadigm(task_paradigm):
@
property
def
inputs_attrs
(
self
):
if
self
.
_is_training
:
reader
=
{
"label_ids"
:
[[
-
1
,
1
],
'int64'
]}
reader
=
{
"label_ids"
:
[[
-
1
],
'int64'
]}
else
:
reader
=
{}
bb
=
{
"sentence_embedding"
:
[[
-
1
,
self
.
_hidden_size
],
'float32'
]}
...
...
@@ -75,8 +75,9 @@ class TaskParadigm(task_paradigm):
name
=
scope_name
+
"cls_out_b"
,
initializer
=
fluid
.
initializer
.
Constant
(
0.
)))
if
self
.
_is_training
:
loss
=
fluid
.
layers
.
softmax_with_cross_entropy
(
logits
=
logits
,
label
=
label_ids
)
inputs
=
fluid
.
layers
.
softmax
(
logits
)
loss
=
fluid
.
layers
.
cross_entropy
(
input
=
inputs
,
label
=
label_ids
)
loss
=
layers
.
mean
(
loss
)
return
{
"loss"
:
loss
}
else
:
...
...
paddlepalm/task_paradigm/match.py
浏览文件 @
48cf8096
...
...
@@ -44,7 +44,7 @@ class TaskParadigm(task_paradigm):
@
property
def
inputs_attrs
(
self
):
if
self
.
_is_training
:
reader
=
{
"label_ids"
:
[[
-
1
,
1
],
'int64'
]}
reader
=
{
"label_ids"
:
[[
-
1
],
'int64'
]}
else
:
reader
=
{}
bb
=
{
"sentence_pair_embedding"
:
[[
-
1
,
self
.
_hidden_size
],
'float32'
]}
...
...
@@ -84,8 +84,9 @@ class TaskParadigm(task_paradigm):
initializer
=
fluid
.
initializer
.
Constant
(
0.
)))
if
self
.
_is_training
:
ce_loss
,
probs
=
fluid
.
layers
.
softmax_with_cross_entropy
(
logits
=
logits
,
label
=
labels
,
return_softmax
=
True
)
inputs
=
fluid
.
layers
.
softmax
(
logits
)
ce_loss
=
fluid
.
layers
.
cross_entropy
(
input
=
inputs
,
label
=
labels
)
loss
=
fluid
.
layers
.
mean
(
x
=
ce_loss
)
return
{
'loss'
:
loss
}
else
:
...
...
paddlepalm/task_paradigm/mlm.py
浏览文件 @
48cf8096
...
...
@@ -33,8 +33,8 @@ class TaskParadigm(task_paradigm):
@
property
def
inputs_attrs
(
self
):
reader
=
{
"mask_label"
:
[[
-
1
,
1
],
'int64'
],
"mask_pos"
:
[[
-
1
,
1
],
'int64'
]}
"mask_label"
:
[[
-
1
],
'int64'
],
"mask_pos"
:
[[
-
1
],
'int64'
]}
if
not
self
.
_is_training
:
del
reader
[
'mask_label'
]
del
reader
[
'batchsize_x_seqlen'
]
...
...
@@ -100,8 +100,9 @@ class TaskParadigm(task_paradigm):
is_bias
=
True
)
if
self
.
_is_training
:
mask_lm_loss
=
fluid
.
layers
.
softmax_with_cross_entropy
(
logits
=
fc_out
,
label
=
mask_label
)
inputs
=
fluid
.
layers
.
softmax
(
fc_out
)
mask_lm_loss
=
fluid
.
layers
.
cross_entropy
(
input
=
inputs
,
label
=
mask_label
)
loss
=
fluid
.
layers
.
mean
(
mask_lm_loss
)
return
{
'loss'
:
loss
}
else
:
...
...
paddlepalm/task_paradigm/mrc.py
浏览文件 @
48cf8096
...
...
@@ -49,11 +49,11 @@ class TaskParadigm(task_paradigm):
@
property
def
inputs_attrs
(
self
):
if
self
.
_is_training
:
reader
=
{
"start_positions"
:
[[
-
1
,
1
],
'int64'
],
"end_positions"
:
[[
-
1
,
1
],
'int64'
],
reader
=
{
"start_positions"
:
[[
-
1
],
'int64'
],
"end_positions"
:
[[
-
1
],
'int64'
],
}
else
:
reader
=
{
'unique_ids'
:
[[
-
1
,
1
],
'int64'
]}
reader
=
{
'unique_ids'
:
[[
-
1
],
'int64'
]}
bb
=
{
"encoder_outputs"
:
[[
-
1
,
-
1
,
self
.
_hidden_size
],
'float32'
]}
return
{
'reader'
:
reader
,
'backbone'
:
bb
}
...
...
@@ -70,7 +70,7 @@ class TaskParadigm(task_paradigm):
else
:
return
{
'start_logits'
:
[[
-
1
,
-
1
,
1
],
'float32'
],
'end_logits'
:
[[
-
1
,
-
1
,
1
],
'float32'
],
'unique_ids'
:
[[
-
1
,
1
],
'int64'
]}
'unique_ids'
:
[[
-
1
],
'int64'
]}
def
build
(
self
,
inputs
,
scope_name
=
""
):
...
...
@@ -102,9 +102,11 @@ class TaskParadigm(task_paradigm):
start_logits
,
end_logits
=
fluid
.
layers
.
unstack
(
x
=
logits
,
axis
=
0
)
def
_compute_single_loss
(
logits
,
positions
):
"""Compute start/end loss for mrc model"""
loss
=
fluid
.
layers
.
softmax_with_cross_entropy
(
logits
=
logits
,
label
=
positions
)
"""Compute start/en
d loss for mrc model"""
inputs
=
fluid
.
layers
.
softmax
(
logits
)
loss
=
fluid
.
layers
.
cross_entropy
(
input
=
inputs
,
label
=
positions
)
loss
=
fluid
.
layers
.
mean
(
x
=
loss
)
return
loss
...
...
@@ -122,7 +124,7 @@ class TaskParadigm(task_paradigm):
def
postprocess
(
self
,
rt_outputs
):
"""this func will be called after each step(batch) of training/evaluating/predicting process."""
if
not
self
.
_is_training
:
unique_ids
=
np
.
squeeze
(
rt_outputs
[
'unique_ids'
],
-
1
)
unique_ids
=
rt_outputs
[
'unique_ids'
]
start_logits
=
rt_outputs
[
'start_logits'
]
end_logits
=
rt_outputs
[
'end_logits'
]
for
idx
in
range
(
len
(
unique_ids
)):
...
...
paddlepalm/utils/reader_helper.py
浏览文件 @
48cf8096
...
...
@@ -19,7 +19,6 @@ import random
import
numpy
as
np
import
paddle
from
paddle
import
fluid
from
paddle.fluid
import
layers
def
_check_and_adapt_shape_dtype
(
rt_val
,
attr
,
message
=
""
):
...
...
@@ -65,7 +64,7 @@ def create_net_inputs(input_attrs, async=False, iterator_fn=None, dev_count=1, n
inputs
=
[]
ret
=
{}
for
name
,
shape
,
dtype
in
input_attrs
:
p
=
layers
.
data
(
name
,
shape
=
shape
,
dtype
=
dtype
)
p
=
fluid
.
data
(
name
,
shape
=
shape
,
dtype
=
dtype
)
ret
[
name
]
=
p
inputs
.
append
(
p
)
...
...
@@ -227,7 +226,7 @@ def merge_input_attrs(backbone_attr, task_attrs, insert_taskid=True, insert_batc
names
=
[]
start
=
0
if
insert_taskid
:
ret
.
append
(([
1
,
1
],
'int64'
))
ret
.
append
(([
1
,
1
],
'int64'
))
names
.
append
(
'__task_id'
)
start
+=
1
...
...
script/convert_params.sh
0 → 100755
浏览文件 @
48cf8096
#!/bin/sh
if
[[
$#
!=
1
]]
;
then
echo
"usage: bash convert_params.sh <params_dir>"
exit
1
fi
if
[[
-f
$1
/__palminfo__
]]
;
then
echo
"already converted."
exit
0
fi
echo
"converting..."
if
[[
-d
$1
/params
]]
;
then
cd
$1
/params
else
cd
$1
fi
mkdir
.palm.backup
for
file
in
$(
ls
*
)
do
cp
$file
.palm.backup
;
mv
$file
"__paddlepalm_"
$file
done
tar
-cf
__rawmodel__ .palm.backup/
*
rm
.palm.backup/
*
mv
__rawmodel__ .palm.backup
# find . ! -name '__rawmodel__' -exec rm {} +
tar
-cf
__palmmodel__ __paddlepalm_
*
touch
__palminfo__
ls
__paddlepalm_
*
>
__palminfo__
rm
__paddlepalm_
*
cd
-
>
/dev/null
echo
"done!"
script/download_pretrain_backbone.sh
0 → 100755
浏览文件 @
48cf8096
#!/bin/bash
set
-e
if
[[
$#
!=
1
]]
;
then
echo
"Usage: bash download_pretrain.sh <bert|ernie>"
exit
1
fi
if
[[
$1
==
'bert'
]]
;
then
name
=
"bert"
link
=
"https://bert-models.bj.bcebos.com/uncased_L-24_H-1024_A-16.tar.gz"
packname
=
"uncased_L-24_H-1024_A-16.tar.gz"
dirname
=
"uncased_L-24_H-1024_A-16"
elif
[[
$1
==
'ernie'
]]
;
then
name
=
"ernie"
link
=
"https://ernie.bj.bcebos.com/ERNIE_Large_en_stable-2.0.0.tar.gz"
packname
=
"ERNIE_Large_en_stable-2.0.0.tar.gz"
else
echo
"
$1
is currently not supported."
exit
1
fi
if
[[
!
-d
pretrain_model
]]
;
then
mkdir
pretrain_model
fi
cd
pretrain_model
mkdir
$name
cd
$name
echo
"downloading
${
name
}
..."
wget
--no-check-certificate
$link
echo
"decompressing..."
tar
-zxf
$packname
rm
-rf
$packname
if
[[
$dirname
!=
""
]]
;
then
mv
$dirname
/
*
.
rm
-rf
$dirname
fi
cd
../..
script/recover_params.sh
0 → 100755
浏览文件 @
48cf8096
#!/bin/sh
if
[[
$#
!=
1
]]
;
then
echo
"usage: bash recover_params.sh <params_dir>"
exit
1
fi
if
[[
!
-d
$1
]]
;
then
echo
"
$1
not found."
exit
1
fi
if
[[
!
-f
$1
/__palmmodel__
]]
;
then
echo
"paddlepalm model not found."
exit
1
fi
echo
"recovering..."
if
[[
-d
$1
/params
]]
;
then
cd
$1
/params
else
cd
$1
fi
rm
__palm
*
mv
.palm.backup/__rawmodel__
.
rm
-rf
.palm.backup
tar
-xf
__rawmodel__
mv
.palm.backup/
*
.
rm
__rawmodel__
rm
-rf
.palm.backup
cd
-
>
/dev/null
setup.py
浏览文件 @
48cf8096
...
...
@@ -18,7 +18,7 @@
"""
Setup script.
Authors: zhouxiangyang(zhouxiangyang@baidu.com)
Date: 2019/
09/29 21:00
:01
Date: 2019/
12/05 13:24
:01
"""
import
setuptools
from
io
import
open
...
...
@@ -27,10 +27,10 @@ with open("README.md", "r", encoding='utf-8') as fh:
setuptools
.
setup
(
name
=
"paddlepalm"
,
version
=
"0.2.
1
"
,
version
=
"0.2.
2
"
,
author
=
"PaddlePaddle"
,
author_email
=
"zhangyiming04@baidu.com"
,
description
=
"A
Multi-task Learning
Lib for PaddlePaddle Users."
,
description
=
"A Lib for PaddlePaddle Users."
,
# long_description=long_description,
# long_description_content_type="text/markdown",
url
=
"https://github.com/PaddlePaddle/PALM"
,
...
...
编辑
预览
Markdown
is supported
0%
请重试
或
添加新附件
.
添加附件
取消
You are about to add
0
people
to the discussion. Proceed with caution.
先完成此消息的编辑!
取消
想要评论请
注册
或
登录