未验证 提交 f6e5d4ff 编写于 作者: P pkpk 提交者: GitHub

Merge pull request #1 from xixiaoyao/master

Release PALM 1.0
PALM
===
PALM (PAddLE Multitask) 是一个灵活易用的多任务学习框架,框架中内置了丰富的模型backbone(BERT、ERNIE等)、常见的任务范式(分类、匹配、序列标注、机器阅读理解等)和数据集读取与处理工具。对于典型的任务场景,用户几乎无需书写代码便可完成新任务的添加;对于特殊的任务场景,用户可通过对预置接口的实现来完成对新任务的支持。
## 安装
目前仅支持git clone源码的方式使用:
```shell
git clone https://github.com/PaddlePaddle/PALM.git
```
**环境依赖**
- Python >= 2.7
- cuda >= 9.0
- cudnn >= 7.0
- PaddlePaddle >= 1.5.0 (请参考[安装指南](http://www.paddlepaddle.org/#quick-start)进行安装)
## 目录结构
- backbone: 多任务学习的主干网络表示,支持bert, ernie, xlnet等,用户可自定义添加
- config:存放各个任务的配置文件,用户添加任务时需在此建立该任务的配置文件
- data: 存放各个任务的数据集
- pretrain_model: 存放预训练模型、字典及其相关配置
- optimizer: 优化器,用户可在此自定义优化器
- reader: 各个任务的数据读取与处理模块以及做reader融合的joint_reader文件
- paradigm: 任务输出层相关网络结构描述
- utils: 通用工具函数文件
- mtl_run.py: 多任务学习的主要流程描述
- run.sh: 多任务学习启动脚本
## 使用说明
框架给出了三个添加完成的任务示例:*Machine Reading Comprehension**Mask Language Model**Question Answer Matching*。其中在`mtl_config.yaml`中将*Machine Reading Comprehension*设置为了主任务,其他为辅助任务,用户可通过如下命令启动多任务学习
```
bash run.sh
```
### 多任务学习配置
`mtl_config.yaml`中完成对多任务训练和推理的主配置,配置包含如下
***必选字段***
- main_task:*(str)* 指定主任务的名称,目前仅支持单个主任务。名称选取自`config`文件夹中的配置的文件名(不包含后缀`.yaml`和为task共享而设置的中间后缀)
- auxiliary_task:*(str)* 指定辅助任务,支持多个辅助任务,辅助任务之间使用空格隔开。名称选取自`config`文件夹中的配置的文件名(不包含后缀`.yaml`和为task共享而设置的中间后缀)
- do_train:*(bool)* 训练标志位
- do_predict:*(bool)* 预测标志位,目前仅支持对主任务进行预测
- checkpoint_path: *(str)* 模型保存、训练断点恢复和预测模型载入路径,从该路径载入模型时默认读取最后一个训练step的模型
- backbone_model:*(str)* 使用的骨干网络,名称选取自`backbone`目录下的模块
- vocab_path:*(str)* 字典文件,纯文本格式存储,其中每行为一个单词
- optimizer:*(str)* 优化器名称,名称选取自`optimizer`中的文件名
- learning_rate:*(str)* 训练阶段的学习率
- skip_steps:*(int)* 训练阶段打印日志的频率(step为单位)
- epoch:*(int)* 主任务的训练epoch数
- use_cuda:*(bool)* 使用GPU训练的标志位
- warmup_proportion:*(float)* 预训练模型finetuning时的warmup比例
- use_ema:*(bool)* 是否开启ema进行训练和推理
- ema_decay:*(float)* 开启ema时的衰减指数
- random_seed:*(int)* 随机种子
- use_fp16:*(bool)* 开启混合精度训练标志位
- loss_scaling:*(float)* 开启混合精度训练时的loss缩放因子
***可选字段***
- pretrain_model_path:*(str)* 预训练模型的载入路径,该路径下应包含存储模型参数的params文件夹
- pretrain_config_path:*(str)* 预训练模型的配置文件,json格式描述
- do_lower_case:*(bool)* 预处理阶段是否区分大小写
- 其他用户自定义字段
### 添加新任务
用户添加任务时,在准备好该任务的数据集后,需要完成如下3处开发工作:
***config模块***
位于`./config`目录。存放各个任务实例的配置文件,使用`yaml`格式描述。配置文件中的必选字段包括
- in_tokens:是否使用lod tensor的方式构造batch,当`in_tokens`为False时,使用padding方式构造batch。
- batch_size:每个训练或推理step所使用样本数。当`in_tokens`为True时,`batch_size`表示每个step所包含的tokens数量。
训练阶段包含的必选字段包括
- train_file:训练集文件路径
- mix_ratio:该任务的训练阶段采样权重(1.0代表与主任务采样次数的期望相同)
推理阶段包含的必选字段包括
- predict_file:测试集文件路径
此外用户可根据任务需要,自行定义其他超参数,该超参可在创建任务模型时被访问
***reader模块***
位于`./reader`目录下。完成数据集读取与处理。新增的reader应放置在`paradigm`目录下,且包含一个`get_input_shape`方法和`DataProcessor`类。
- **get_input_shape**: *(function)* 定义reader给backbone和task_paradigm生成的数据的shape和dtype,且需要同时返回训练和推理阶段的定义。
- 输入参数
- args: *(dict)* 解析后的任务配置
- 返回值
- train_input_shape: *(dict)* 包含backbone和task两个key,每个key对应的value为一个list,存储若干`(shape, dtype)`的元组
- test_input_shape: *(dict)* 包含backbone和task两个key,每个key对应的value为一个list,存储若干`(shape, dtype)`的元组
- **DataProcessor***(class)* 定义数据集的载入、预处理和遍历
- \_\_init\_\_: 构造函数,解析和存储相关参数,进行必要的初始化
- 输入参数
- args: *(dict)* 解析后的任务配置
- 返回值
-
- data_generator: *(function)* 数据集的迭代器,被遍历时每次yield一个batch
- 输入参数
- phase: *(str)* 任务所处阶段,支持训练`train`和推理`predict`两种可选阶段
- shuffle: *(bool)* 训练阶段是否进行数据集打乱
- dev_count: *(int)* 可用的GPU数量或CPU数量
- yield输出
- tensors: (list) 根据`get_input_shape`中定义的任务backbone和task的所需输入shape和类型,来yield相应list结构的数据。其中被yield出的list的头部元素为backbone要求的输入数据,后续元素为task要求的输入数据
- get_num_examples: *(function)* 返回样本数。注意由于滑动窗口等机制,实际运行时产生的样本数可能多于数据集中的样本数,这时应返回runtime阶段实际样本数
- 输入参数
-
- 返回值
- num_examples: *(int)* 样本数量
***task_paradigm模块***
位于`./paradigm`目录下。描述任务范式(如分类、匹配、阅读理解等)。新增的任务范式应放置在`paradigm`目录下,且应包含`compute_loss``create_model`两个必选方法,以及`postprocess``global_postprocess`两个可选方法。
- create_model:*(function)* 创建task模型
- 输入参数
- reader_input:*(nested Variables)* 数据输入层的输出,定义位于该任务的reader模块的`input_shape`方法中。输入的前N个元素为backbone的输入元素,之后的元素为task的输入。
- base_model:*(Model)* 模型backbone的实例,可调用backbone的对外输出接口来实现task与backbone的连接。一般来说,backbone的输出接口最少包括`final_sentence_representation``final_word_representation`两个属性。
- base_model.final_sentence_representation:*(Variable)* 输入文本的向量表示,shape为`[batch_size, hidden_size]`
- base_model.final_word_representation:*(Variable)* 输入文本中每个单词的向量表示,shape为`[batch_size, max_seqlen, hidden_size]`
- is_training:*(bool)* 训练标志位
- args:*(Argument)* 任务相关的参数配置,具体参数在config文件夹中定义
- 返回值
- output_tensors: *(dict)* 任务输出的tensor字典。训练阶段的输出字典中应至少包括num_seqs元素,num_seqs记录了batch中所包含的样本数(在输入为lod tensor(args.in_tokens被设置为True)时所以样本压平打平,没有样本数维度)
- compute_loss: *(function)* 计算task在训练阶段的batch平均损失值
- 输入参数
- output_tensors: *(dict)* 创建task时(调用`create_model`)时返回值,存储计算loss所需的Variables的名字到实例的映射
- args:*(Argument)* 任务相关的参数配置,具体参数在config文件夹中定义
- 返回值
- total_loss:*(Variable)* 当前batch的平均损失值
- postprocess:*(function)* 推理阶段对每个推理step得到的fetch_results进行的后处理,返回对该step的每个样本的后处理结果
- 输入参数
- fetch_results:(dict) 当前推理step的fetch_dict中的计算结果,其中fetch_dict在create_model时定义并返回。
- 返回值
- processed_results:(list)当前推理step所有样本的后处理结果。
- global_postprocess: *(function)* 推理结束后,对全部样本的后处理结果进行最终处理(如结果保存、二次后处理等)
- 输入参数
- pred_buf:所有测试集样本的预测后处理结果
- processor:任务的数据集载入与处理类DataProcessor的实例
- mtl_args:多任务学习配置,在`mtl_conf.yaml`中定义
- task_args:任务相关的参数配置,在`conf`文件夹中定义
- 返回值
-
***命名规范***
为新任务创建config,task_paradigm和reader文件后,应将三者文件名统一,且为reader文件的文件名增加`_reader`后缀。例如,用户添加的新任务名为yelp_senti,则config文件名为`yelp_senti.yaml`,放置于config文件夹下;task_paradigm文件名为`yelp_senti.py`,放置于paradigm文件夹下;reader文件名为`yelp_senti_reader.py`,放置于reader文件夹下。
***One-to-One模式(任务层共享)***
框架默认使用one-to-many的模式进行多任务训练,即多任务共享encoder,不共享输出层。该版本同时支持one-to-one模式,即多任务同时共享encoder和输出层(模型参数完全共享,但是有不同的数据源)。该模式通过config文件命名的方式开启,具体流程如下。
```
1. mtl_config.yaml下用户配置任务相关的名称,如main_task: "reading_comprehension"
2. 如果一个任务的数据集是多个来源,请在configs下对同一个任务添加多个任务配置,如任务为"reading_comprehension"有两个数据集需要训练,且每个batch内的数据都来自同一数据集,则需要添加reading_comprehension.name1.yaml和reading_comprehension.name2.yaml两个配置文件,其中name1和name2用户可根据自己需求定义名称,框架内不限定名称定义;
3. 启动多任务学习:sh run.sh
```
## License
This tutorial is contributed by [PaddlePaddle](https://github.com/PaddlePaddle/Paddle) and licensed under the [Apache-2.0 license](https://github.com/PaddlePaddle/models/blob/develop/LICENSE).
## 许可证书
此向导由[PaddlePaddle](https://github.com/PaddlePaddle/Paddle)贡献,受[Apache-2.0 license](https://github.com/PaddlePaddle/models/blob/develop/LICENSE)许可认证。
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""BERT model"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import paddle.fluid as fluid
from paddle.fluid import layers
import backbone.utils.transformer as transformer
class Model(object):
def __init__(self,
config,
is_training=False,
model_name=''):
self._emb_size = config["hidden_size"]
self._n_layer = config["num_hidden_layers"]
self._n_head = config["num_attention_heads"]
self._voc_size = config["vocab_size"]
self._max_position_seq_len = config["max_position_embeddings"]
self._sent_types = config["type_vocab_size"]
self._hidden_act = config["hidden_act"]
self._prepostprocess_dropout = config["hidden_dropout_prob"]
self._attention_dropout = config["attention_probs_dropout_prob"]
self._is_training = is_training
self.model_name = model_name
self._word_emb_name = self.model_name + "word_embedding"
self._pos_emb_name = self.model_name + "pos_embedding"
self._sent_emb_name = self.model_name + "sent_embedding"
# Initialize all weigths by truncated normal initializer, and all biases
# will be initialized by constant zero by default.
self._param_initializer = fluid.initializer.TruncatedNormal(
scale=config["initializer_range"])
def build_model(self, reader_input, use_fp16=False):
dtype = "float16" if use_fp16 else "float32"
src_ids, pos_ids, sent_ids, input_mask = reader_input[:4]
# padding id in vocabulary must be set to 0
emb_out = fluid.layers.embedding(
input=src_ids,
size=[self._voc_size, self._emb_size],
dtype=dtype,
param_attr=fluid.ParamAttr(
name=self._word_emb_name, initializer=self._param_initializer),
is_sparse=False)
self.emb_out = emb_out
position_emb_out = fluid.layers.embedding(
input=pos_ids,
size=[self._max_position_seq_len, self._emb_size],
dtype=dtype,
param_attr=fluid.ParamAttr(
name=self._pos_emb_name, initializer=self._param_initializer))
self.position_emb_out = position_emb_out
sent_emb_out = fluid.layers.embedding(
sent_ids,
size=[self._sent_types, self._emb_size],
dtype=dtype,
param_attr=fluid.ParamAttr(
name=self._sent_emb_name, initializer=self._param_initializer))
self.sent_emb_out = sent_emb_out
emb_out = emb_out + position_emb_out + sent_emb_out
emb_out = transformer.pre_process_layer(
emb_out, 'nd', self._prepostprocess_dropout, name='pre_encoder')
if dtype == "float16":
input_mask = fluid.layers.cast(x=input_mask, dtype=dtype)
self_attn_mask = fluid.layers.matmul(
x = input_mask, y = input_mask, transpose_y = True)
self_attn_mask = fluid.layers.scale(
x = self_attn_mask, scale = 10000.0, bias = -1.0, bias_after_scale = False)
n_head_self_attn_mask = fluid.layers.stack(
x=[self_attn_mask] * self._n_head, axis=1)
n_head_self_attn_mask.stop_gradient = True
self._enc_out = transformer.encoder(
enc_input = emb_out,
attn_bias = n_head_self_attn_mask,
n_layer = self._n_layer,
n_head = self._n_head,
d_key = self._emb_size // self._n_head,
d_value = self._emb_size // self._n_head,
d_model = self._emb_size,
d_inner_hid = self._emb_size * 4,
prepostprocess_dropout = self._prepostprocess_dropout,
attention_dropout = self._attention_dropout,
relu_dropout = 0,
hidden_act = self._hidden_act,
preprocess_cmd = "",
postprocess_cmd = "dan",
param_initializer = self._param_initializer,
name = self.model_name + 'encoder')
next_sent_feat = fluid.layers.slice(
input = self._enc_out, axes = [1], starts = [0], ends = [1])
self.next_sent_feat = fluid.layers.fc(
input = next_sent_feat,
size = self._emb_size,
act = "tanh",
param_attr = fluid.ParamAttr(
name = self.model_name + "pooled_fc.w_0",
initializer = self._param_initializer),
bias_attr = "pooled_fc.b_0")
@property
def final_word_representation(self):
"""final layer output of transformer encoder as the (contextual) word representation"""
return self._enc_out
@property
def final_sentence_representation(self):
"""final representation of the first token ([CLS]) as sentence representation """
return self.next_sent_feat
if __name__ == "__main__":
print("hello world!")
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Ernie model."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals
from __future__ import absolute_import
import paddle.fluid as fluid
import backbone.utils.transformer4ernie as transformer
from backbone.interface import backbone
class Model(backbone):
def __init__(self,
config,
is_training=False,
):
self._emb_size = config['hidden_size']
self._n_layer = config['num_hidden_layers']
self._n_head = config['num_attention_heads']
self._voc_size = config['vocab_size']
self._max_position_seq_len = config['max_position_embeddings']
if config['sent_type_vocab_size']:
self._sent_types = config['sent_type_vocab_size']
else:
self._sent_types = config['type_vocab_size']
self._hidden_act = config['hidden_act']
self._prepostprocess_dropout = config['hidden_dropout_prob']
self._attention_dropout = config['attention_probs_dropout_prob']
self._word_emb_name = "word_embedding"
self._pos_emb_name = "pos_embedding"
self._sent_emb_name = "sent_embedding"
self._task_emb_name = "task_embedding"
self._emb_dtype = "float32"
self._param_initializer = fluid.initializer.TruncatedNormal(
scale=config['initializer_range'])
def build_model(self, reader_input, use_fp16=False):
dtype = "float16" if use_fp16 else "float32"
src_ids, pos_ids, sent_ids, input_mask = reader_input[:4]
# padding id in vocabulary must be set to 0
emb_out = fluid.layers.embedding(
input=src_ids,
size=[self._voc_size, self._emb_size],
dtype=self._emb_dtype,
param_attr=fluid.ParamAttr(
name=self._word_emb_name, initializer=self._param_initializer),
is_sparse=False)
position_emb_out = fluid.layers.embedding(
input=pos_ids,
size=[self._max_position_seq_len, self._emb_size],
dtype=self._emb_dtype,
param_attr=fluid.ParamAttr(
name=self._pos_emb_name, initializer=self._param_initializer))
sent_emb_out = fluid.layers.embedding(
sent_ids,
size=[self._sent_types, self._emb_size],
dtype=self._emb_dtype,
param_attr=fluid.ParamAttr(
name=self._sent_emb_name, initializer=self._param_initializer))
emb_out = emb_out + position_emb_out
emb_out = emb_out + sent_emb_out
emb_out = transformer.pre_process_layer(
emb_out, 'nd', self._prepostprocess_dropout, name='pre_encoder')
if dtype == "float16":
emb_out = fluid.layers.cast(x=emb_out, dtype=dtype)
input_mask = fluid.layers.cast(x=input_mask, dtype=dtype)
self_attn_mask = fluid.layers.matmul(
x=input_mask, y=input_mask, transpose_y=True)
self_attn_mask = fluid.layers.scale(
x=self_attn_mask, scale=10000.0, bias=-1.0, bias_after_scale=False)
n_head_self_attn_mask = fluid.layers.stack(
x=[self_attn_mask] * self._n_head, axis=1)
n_head_self_attn_mask.stop_gradient = True
self._enc_out = transformer.encoder(
enc_input=emb_out,
attn_bias=n_head_self_attn_mask,
n_layer=self._n_layer,
n_head=self._n_head,
d_key=self._emb_size // self._n_head,
d_value=self._emb_size // self._n_head,
d_model=self._emb_size,
d_inner_hid=self._emb_size * 4,
prepostprocess_dropout=self._prepostprocess_dropout,
attention_dropout=self._attention_dropout,
relu_dropout=0,
hidden_act=self._hidden_act,
preprocess_cmd="",
postprocess_cmd="dan",
param_initializer=self._param_initializer,
name='encoder')
if dtype == "float16":
self._enc_out = fluid.layers.cast(
x=self._enc_out, dtype=self._emb_dtype)
@property
def final_word_representation(self):
return self._enc_out
@property
def final_sentence_representation(self):
"""Get the first feature of each sequence for classification"""
next_sent_feat = fluid.layers.slice(
input=self._enc_out, axes=[1], starts=[0], ends=[1])
next_sent_feat = fluid.layers.fc(
input=next_sent_feat,
size=self._emb_size,
act="tanh",
param_attr=fluid.ParamAttr(
name="pooled_fc.w_0", initializer=self._param_initializer),
bias_attr="pooled_fc.b_0")
return next_sent_feat
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Transformer encoder."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from functools import partial
import numpy as np
import paddle.fluid as fluid
import paddle.fluid.layers as layers
from paddle.fluid.layer_helper import LayerHelper
def layer_norm(x, begin_norm_axis=1, epsilon=1e-6, param_attr=None, bias_attr=None):
helper = LayerHelper('layer_norm', **locals())
mean = layers.reduce_mean(x, dim=begin_norm_axis, keep_dim=True)
shift_x = layers.elementwise_sub(x=x, y=mean, axis=0)
variance = layers.reduce_mean(layers.square(shift_x), dim=begin_norm_axis, keep_dim=True)
r_stdev = layers.rsqrt(variance + epsilon)
norm_x = layers.elementwise_mul(x=shift_x, y=r_stdev, axis=0)
param_shape = [reduce(lambda x, y: x * y, norm_x.shape[begin_norm_axis:])]
param_dtype = norm_x.dtype
scale = helper.create_parameter(
attr=param_attr,
shape=param_shape,
dtype=param_dtype,
default_initializer=fluid.initializer.Constant(1.))
bias = helper.create_parameter(
attr=bias_attr,
shape=param_shape,
dtype=param_dtype,
is_bias=True,
default_initializer=fluid.initializer.Constant(0.))
out = layers.elementwise_mul(x=norm_x, y=scale, axis=-1)
out = layers.elementwise_add(x=out, y=bias, axis=-1)
return out
def multi_head_attention(queries,
keys,
values,
attn_bias,
d_key,
d_value,
d_model,
n_head=1,
dropout_rate=0.,
cache=None,
param_initializer=None,
name='multi_head_att'):
"""
Multi-Head Attention. Note that attn_bias is added to the logit before
computing softmax activiation to mask certain selected positions so that
they will not considered in attention weights.
"""
keys = queries if keys is None else keys
values = keys if values is None else values
if not (len(queries.shape) == len(keys.shape) == len(values.shape) == 3):
raise ValueError(
"Inputs: quries, keys and values should all be 3-D tensors.")
def __compute_qkv(queries, keys, values, n_head, d_key, d_value):
"""
Add linear projection to queries, keys, and values.
"""
q = layers.fc(input = queries,
size = d_key * n_head,
num_flatten_dims = 2,
param_attr = fluid.ParamAttr(
name = name + '_query_fc.w_0',
initializer = param_initializer),
bias_attr = name + '_query_fc.b_0')
k = layers.fc(input = keys,
size = d_key * n_head,
num_flatten_dims = 2,
param_attr = fluid.ParamAttr(
name = name + '_key_fc.w_0',
initializer = param_initializer),
bias_attr = name + '_key_fc.b_0')
v = layers.fc(input = values,
size = d_value * n_head,
num_flatten_dims = 2,
param_attr = fluid.ParamAttr(
name = name + '_value_fc.w_0',
initializer = param_initializer),
bias_attr = name + '_value_fc.b_0')
return q, k, v
def __split_heads(x, n_head):
"""
Reshape the last dimension of inpunt tensor x so that it becomes two
dimensions and then transpose. Specifically, input a tensor with shape
[bs, max_sequence_length, n_head * hidden_dim] then output a tensor
with shape [bs, n_head, max_sequence_length, hidden_dim].
"""
hidden_size = x.shape[-1]
# The value 0 in shape attr means copying the corresponding dimension
# size of the input as the output dimension size.
reshaped = layers.reshape(
x = x, shape = [0, 0, n_head, hidden_size // n_head], inplace=False)
# permuate the dimensions into:
# [batch_size, n_head, max_sequence_len, hidden_size_per_head]
return layers.transpose(x=reshaped, perm=[0, 2, 1, 3])
def __combine_heads(x):
"""
Transpose and then reshape the last two dimensions of inpunt tensor x
so that it becomes one dimension, which is reverse to __split_heads.
"""
if len(x.shape) == 3: return x
if len(x.shape) != 4:
raise ValueError("Input(x) should be a 4-D Tensor.")
trans_x = layers.transpose(x, perm=[0, 2, 1, 3])
# The value 0 in shape attr means copying the corresponding dimension
# size of the input as the output dimension size.
return layers.reshape(
x = trans_x,
shape = [0, 0, trans_x.shape[2] * trans_x.shape[3]],
inplace = False)
def scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate):
"""
Scaled Dot-Product Attention
"""
scaled_q = layers.scale(x = q, scale = d_key**-0.5)
product = layers.matmul(x = scaled_q, y = k, transpose_y = True)
if attn_bias:
product += attn_bias
weights = layers.softmax(product)
if dropout_rate:
weights = layers.dropout(
weights,
dropout_prob=dropout_rate,
dropout_implementation="upscale_in_train",
is_test=False)
out = layers.matmul(weights, v)
return out
q, k, v = __compute_qkv(queries, keys, values, n_head, d_key, d_value)
if cache is not None: # use cache and concat time steps
# Since the inplace reshape in __split_heads changes the shape of k and
# v, which is the cache input for next time step, reshape the cache
# input from the previous time step first.
k = cache["k"] = layers.concat(
[layers.reshape(
cache["k"], shape=[0, 0, d_model]), k], axis=1)
v = cache["v"] = layers.concat(
[layers.reshape(
cache["v"], shape=[0, 0, d_model]), v], axis=1)
q = __split_heads(q, n_head)
k = __split_heads(k, n_head)
v = __split_heads(v, n_head)
ctx_multiheads = scaled_dot_product_attention(q, k, v, attn_bias, d_key,
dropout_rate)
out = __combine_heads(ctx_multiheads)
# Project back to the model size.
proj_out = layers.fc(input = out,
size = d_model,
num_flatten_dims = 2,
param_attr=fluid.ParamAttr(
name = name + '_output_fc.w_0',
initializer = param_initializer),
bias_attr = name + '_output_fc.b_0')
return proj_out
def positionwise_feed_forward(x,
d_inner_hid,
d_hid,
dropout_rate,
hidden_act,
param_initializer=None,
name='ffn'):
"""
Position-wise Feed-Forward Networks.
This module consists of two linear transformations with a ReLU activation
in between, which is applied to each position separately and identically.
"""
hidden = layers.fc(input=x,
size=d_inner_hid,
num_flatten_dims=2,
act=hidden_act,
param_attr=fluid.ParamAttr(
name=name + '_fc_0.w_0',
initializer=param_initializer),
bias_attr=name + '_fc_0.b_0')
if dropout_rate:
hidden = layers.dropout(
hidden,
dropout_prob=dropout_rate,
dropout_implementation="upscale_in_train",
is_test = False)
out = layers.fc(input = hidden,
size = d_hid,
num_flatten_dims = 2,
param_attr=fluid.ParamAttr(
name = name + '_fc_1.w_0',
initializer = param_initializer),
bias_attr = name + '_fc_1.b_0')
return out
def pre_post_process_layer(prev_out, out, process_cmd, dropout_rate=0.,
name=''):
"""
Add residual connection, layer normalization and droput to the out tensor
optionally according to the value of process_cmd.
This will be used before or after multi-head attention and position-wise
feed-forward networks.
"""
for cmd in process_cmd:
if cmd == "a": # add residual connection
out = out + prev_out if prev_out else out
elif cmd == "n": # add layer normalization
out_dtype = out.dtype
if out_dtype == fluid.core.VarDesc.VarType.FP16:
out = layers.cast(x = out, dtype = "float32")
out = layer_norm(
out,
begin_norm_axis=len(out.shape) - 1,
param_attr=fluid.ParamAttr(
name = name + '_layer_norm_scale',
initializer = fluid.initializer.Constant(1.)),
bias_attr=fluid.ParamAttr(
name = name + '_layer_norm_bias',
initializer = fluid.initializer.Constant(0.)))
if out_dtype == fluid.core.VarDesc.VarType.FP16:
out = layers.cast(x = out, dtype = "float16")
elif cmd == "d": # add dropout
if dropout_rate:
out = layers.dropout(
out,
dropout_prob = dropout_rate,
dropout_implementation = "upscale_in_train",
is_test = False)
return out
pre_process_layer = partial(pre_post_process_layer, None)
post_process_layer = pre_post_process_layer
def encoder_layer(enc_input,
attn_bias,
n_head,
d_key,
d_value,
d_model,
d_inner_hid,
prepostprocess_dropout,
attention_dropout,
relu_dropout,
hidden_act,
preprocess_cmd="n",
postprocess_cmd="da",
param_initializer=None,
name=''):
"""The encoder layers that can be stacked to form a deep encoder.
This module consits of a multi-head (self) attention followed by
position-wise feed-forward networks and both the two components companied
with the post_process_layer to add residual connection, layer normalization
and droput.
"""
attn_output = multi_head_attention(
pre_process_layer(
enc_input,
preprocess_cmd,
prepostprocess_dropout,
name=name + '_pre_att'),
None,
None,
attn_bias,
d_key,
d_value,
d_model,
n_head,
attention_dropout,
param_initializer = param_initializer,
name = name + '_multi_head_att')
attn_output = post_process_layer(
enc_input,
attn_output,
postprocess_cmd,
prepostprocess_dropout,
name = name + '_post_att')
ffd_output = positionwise_feed_forward(
pre_process_layer(
attn_output,
preprocess_cmd,
prepostprocess_dropout,
name = name + '_pre_ffn'),
d_inner_hid,
d_model,
relu_dropout,
hidden_act,
param_initializer = param_initializer,
name = name + '_ffn')
return post_process_layer(
attn_output,
ffd_output,
postprocess_cmd,
prepostprocess_dropout,
name = name + '_post_ffn')
def encoder(enc_input,
attn_bias,
n_layer,
n_head,
d_key,
d_value,
d_model,
d_inner_hid,
prepostprocess_dropout,
attention_dropout,
relu_dropout,
hidden_act,
preprocess_cmd="n",
postprocess_cmd="da",
param_initializer=None,
name='',
return_all = False):
"""
The encoder is composed of a stack of identical layers returned by calling
encoder_layer.
"""
enc_outputs = []
for i in range(n_layer):
enc_output = encoder_layer(
enc_input,
attn_bias,
n_head,
d_key,
d_value,
d_model,
d_inner_hid,
prepostprocess_dropout,
attention_dropout,
relu_dropout,
hidden_act,
preprocess_cmd,
postprocess_cmd,
param_initializer = param_initializer,
name = name + '_layer_' + str(i))
enc_input = enc_output
if i < n_layer - 1:
enc_outputs.append(enc_output)
enc_output = pre_process_layer(
enc_output, preprocess_cmd, prepostprocess_dropout, name="post_encoder")
enc_outputs.append(enc_output)
if not return_all:
return enc_output
else:
return enc_output, enc_outputs
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Transformer encoder."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from functools import partial
import paddle.fluid as fluid
import paddle.fluid.layers as layers
def multi_head_attention(queries,
keys,
values,
attn_bias,
d_key,
d_value,
d_model,
n_head=1,
dropout_rate=0.,
cache=None,
param_initializer=None,
name='multi_head_att'):
"""
Multi-Head Attention. Note that attn_bias is added to the logit before
computing softmax activiation to mask certain selected positions so that
they will not considered in attention weights.
"""
keys = queries if keys is None else keys
values = keys if values is None else values
if not (len(queries.shape) == len(keys.shape) == len(values.shape) == 3):
raise ValueError(
"Inputs: quries, keys and values should all be 3-D tensors.")
def __compute_qkv(queries, keys, values, n_head, d_key, d_value):
"""
Add linear projection to queries, keys, and values.
"""
q = layers.fc(input=queries,
size=d_key * n_head,
num_flatten_dims=2,
param_attr=fluid.ParamAttr(
name=name + '_query_fc.w_0',
initializer=param_initializer),
bias_attr=name + '_query_fc.b_0')
k = layers.fc(input=keys,
size=d_key * n_head,
num_flatten_dims=2,
param_attr=fluid.ParamAttr(
name=name + '_key_fc.w_0',
initializer=param_initializer),
bias_attr=name + '_key_fc.b_0')
v = layers.fc(input=values,
size=d_value * n_head,
num_flatten_dims=2,
param_attr=fluid.ParamAttr(
name=name + '_value_fc.w_0',
initializer=param_initializer),
bias_attr=name + '_value_fc.b_0')
return q, k, v
def __split_heads(x, n_head):
"""
Reshape the last dimension of inpunt tensor x so that it becomes two
dimensions and then transpose. Specifically, input a tensor with shape
[bs, max_sequence_length, n_head * hidden_dim] then output a tensor
with shape [bs, n_head, max_sequence_length, hidden_dim].
"""
hidden_size = x.shape[-1]
# The value 0 in shape attr means copying the corresponding dimension
# size of the input as the output dimension size.
reshaped = layers.reshape(
x=x, shape=[0, 0, n_head, hidden_size // n_head], inplace=True)
# permuate the dimensions into:
# [batch_size, n_head, max_sequence_len, hidden_size_per_head]
return layers.transpose(x=reshaped, perm=[0, 2, 1, 3])
def __combine_heads(x):
"""
Transpose and then reshape the last two dimensions of inpunt tensor x
so that it becomes one dimension, which is reverse to __split_heads.
"""
if len(x.shape) == 3: return x
if len(x.shape) != 4:
raise ValueError("Input(x) should be a 4-D Tensor.")
trans_x = layers.transpose(x, perm=[0, 2, 1, 3])
# The value 0 in shape attr means copying the corresponding dimension
# size of the input as the output dimension size.
return layers.reshape(
x=trans_x,
shape=[0, 0, trans_x.shape[2] * trans_x.shape[3]],
inplace=True)
def scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate):
"""
Scaled Dot-Product Attention
"""
scaled_q = layers.scale(x=q, scale=d_key**-0.5)
product = layers.matmul(x=scaled_q, y=k, transpose_y=True)
if attn_bias:
product += attn_bias
weights = layers.softmax(product)
if dropout_rate:
weights = layers.dropout(
weights,
dropout_prob=dropout_rate,
dropout_implementation="upscale_in_train",
is_test=False)
out = layers.matmul(weights, v)
return out
q, k, v = __compute_qkv(queries, keys, values, n_head, d_key, d_value)
if cache is not None: # use cache and concat time steps
# Since the inplace reshape in __split_heads changes the shape of k and
# v, which is the cache input for next time step, reshape the cache
# input from the previous time step first.
k = cache["k"] = layers.concat(
[layers.reshape(
cache["k"], shape=[0, 0, d_model]), k], axis=1)
v = cache["v"] = layers.concat(
[layers.reshape(
cache["v"], shape=[0, 0, d_model]), v], axis=1)
q = __split_heads(q, n_head)
k = __split_heads(k, n_head)
v = __split_heads(v, n_head)
ctx_multiheads = scaled_dot_product_attention(q, k, v, attn_bias, d_key,
dropout_rate)
out = __combine_heads(ctx_multiheads)
# Project back to the model size.
proj_out = layers.fc(input=out,
size=d_model,
num_flatten_dims=2,
param_attr=fluid.ParamAttr(
name=name + '_output_fc.w_0',
initializer=param_initializer),
bias_attr=name + '_output_fc.b_0')
return proj_out
def positionwise_feed_forward(x,
d_inner_hid,
d_hid,
dropout_rate,
hidden_act,
param_initializer=None,
name='ffn'):
"""
Position-wise Feed-Forward Networks.
This module consists of two linear transformations with a ReLU activation
in between, which is applied to each position separately and identically.
"""
hidden = layers.fc(input=x,
size=d_inner_hid,
num_flatten_dims=2,
act=hidden_act,
param_attr=fluid.ParamAttr(
name=name + '_fc_0.w_0',
initializer=param_initializer),
bias_attr=name + '_fc_0.b_0')
if dropout_rate:
hidden = layers.dropout(
hidden,
dropout_prob=dropout_rate,
dropout_implementation="upscale_in_train",
is_test=False)
out = layers.fc(input=hidden,
size=d_hid,
num_flatten_dims=2,
param_attr=fluid.ParamAttr(
name=name + '_fc_1.w_0', initializer=param_initializer),
bias_attr=name + '_fc_1.b_0')
return out
def pre_post_process_layer(prev_out, out, process_cmd, dropout_rate=0.,
name=''):
"""
Add residual connection, layer normalization and droput to the out tensor
optionally according to the value of process_cmd.
This will be used before or after multi-head attention and position-wise
feed-forward networks.
"""
for cmd in process_cmd:
if cmd == "a": # add residual connection
out = out + prev_out if prev_out else out
elif cmd == "n": # add layer normalization
out_dtype = out.dtype
if out_dtype == fluid.core.VarDesc.VarType.FP16:
out = layers.cast(x=out, dtype="float32")
out = layers.layer_norm(
out,
begin_norm_axis=len(out.shape) - 1,
param_attr=fluid.ParamAttr(
name=name + '_layer_norm_scale',
initializer=fluid.initializer.Constant(1.)),
bias_attr=fluid.ParamAttr(
name=name + '_layer_norm_bias',
initializer=fluid.initializer.Constant(0.)))
if out_dtype == fluid.core.VarDesc.VarType.FP16:
out = layers.cast(x=out, dtype="float16")
elif cmd == "d": # add dropout
if dropout_rate:
out = layers.dropout(
out,
dropout_prob=dropout_rate,
dropout_implementation="upscale_in_train",
is_test=False)
return out
pre_process_layer = partial(pre_post_process_layer, None)
post_process_layer = pre_post_process_layer
def encoder_layer(enc_input,
attn_bias,
n_head,
d_key,
d_value,
d_model,
d_inner_hid,
prepostprocess_dropout,
attention_dropout,
relu_dropout,
hidden_act,
preprocess_cmd="n",
postprocess_cmd="da",
param_initializer=None,
name=''):
"""The encoder layers that can be stacked to form a deep encoder.
This module consits of a multi-head (self) attention followed by
position-wise feed-forward networks and both the two components companied
with the post_process_layer to add residual connection, layer normalization
and droput.
"""
attn_output = multi_head_attention(
pre_process_layer(
enc_input,
preprocess_cmd,
prepostprocess_dropout,
name=name + '_pre_att'),
None,
None,
attn_bias,
d_key,
d_value,
d_model,
n_head,
attention_dropout,
param_initializer=param_initializer,
name=name + '_multi_head_att')
attn_output = post_process_layer(
enc_input,
attn_output,
postprocess_cmd,
prepostprocess_dropout,
name=name + '_post_att')
ffd_output = positionwise_feed_forward(
pre_process_layer(
attn_output,
preprocess_cmd,
prepostprocess_dropout,
name=name + '_pre_ffn'),
d_inner_hid,
d_model,
relu_dropout,
hidden_act,
param_initializer=param_initializer,
name=name + '_ffn')
return post_process_layer(
attn_output,
ffd_output,
postprocess_cmd,
prepostprocess_dropout,
name=name + '_post_ffn')
def encoder(enc_input,
attn_bias,
n_layer,
n_head,
d_key,
d_value,
d_model,
d_inner_hid,
prepostprocess_dropout,
attention_dropout,
relu_dropout,
hidden_act,
preprocess_cmd="n",
postprocess_cmd="da",
param_initializer=None,
name=''):
"""
The encoder is composed of a stack of identical layers returned by calling
encoder_layer.
"""
for i in range(n_layer):
enc_output = encoder_layer(
enc_input,
attn_bias,
n_head,
d_key,
d_value,
d_model,
d_inner_hid,
prepostprocess_dropout,
attention_dropout,
relu_dropout,
hidden_act,
preprocess_cmd,
postprocess_cmd,
param_initializer=param_initializer,
name=name + '_layer_' + str(i))
enc_input = enc_output
enc_output = pre_process_layer(
enc_output, preprocess_cmd, prepostprocess_dropout, name="post_encoder")
return enc_output
train_file: "data/am4mrqa/train.txt"
mix_ratio: 0.4
batch_size: 4
in_tokens: False
generate_neg_sample: False
train_file: "data/mlm4mrqa"
mix_ratio: 0.4
batch_size: 4
in_tokens: False
generate_neg_sample: False
train_file: "data/mrqa/mrqa-combined.train.raw.json"
predict_file: "data/mrqa/mrqa-combined.dev.raw.json"
sample_rate: 0.02
mix_ratio: 1.0
batch_size: 4
in_tokens: false
doc_stride: 128
with_negative: false
max_query_length: 64
max_answer_length: 30
n_best_size: 20
null_score_diff_threshold: 0.0
verbose: False
因为 它太大了无法显示 source diff 。你可以改为 查看blob
因为 它太大了无法显示 source diff 。你可以改为 查看blob
因为 它太大了无法显示 source diff 。你可以改为 查看blob
因为 它太大了无法显示 source diff 。你可以改为 查看blob
main_task: "reading_comprehension"
auxiliary_task: "mask_language_model answer_matching"
do_train: True
do_predict: True
checkpoint_path: "output_model/firstrun"
backbone_model: "bert_model"
pretrain_model_path: "pretrain_model/bert"
pretrain_config_path: "pretrain_model/bert/bert_config.json"
vocab_path: "pretrain_model/bert/vocab.txt"
# backbone_model: "ernie_model"
# pretrain_model_path: "pretrain_model/ernie/params"
# pretrain_config_path: "pretrain_model/ernie/ernie_config.json"
# vocab_path: "pretrain_model/ernie/vocab.txt"
optimizer: "bert_optimizer"
learning_rate: 3e-5
lr_scheduler: "linear_warmup_decay"
skip_steps: 10
save_steps: 10000
epoch: 2
use_cuda: True
warmup_proportion: 0.1
weight_decay: 0.1
do_lower_case: False
max_seq_len: 512
use_ema: True
ema_decay: 0.9999
random_seed: 0
use_fp16: False
loss_scaling: 1.0
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# encoding=utf8
import os
import sys
import time
import argparse
import importlib
import collections
import numpy as np
import multiprocessing
import paddle
import paddle.fluid as fluid
from utils.configure import PDConfig
from utils.placeholder import Placeholder
from utils.configure import JsonConfig, ArgumentGroup, print_arguments
from utils.init import init_pretraining_params, init_checkpoint
sys.path.append("reader")
import joint_reader
from joint_reader import create_reader
sys.path.append("optimizer")
sys.path.append("paradigm")
sys.path.append("backbone")
TASKSET_PATH="config"
def train(multitask_config):
# load task config
print("Loading multi_task configure...................")
args = PDConfig(yaml_file=[multitask_config])
args.build()
index = 0
reader_map_task = dict()
task_args_list = list()
reader_args_list = list()
id_map_task = {index: args.main_task}
print("Loading main task configure....................")
main_task_name = args.main_task
task_config_files = [i for i in os.listdir(TASKSET_PATH) if i.endswith('.yaml')]
main_config_list = [config for config in task_config_files if config.split('.')[0] == main_task_name]
main_args = None
for config in main_config_list:
main_yaml = os.path.join(TASKSET_PATH, config)
main_args = PDConfig(yaml_file=[multitask_config, main_yaml])
main_args.build()
main_args.Print()
if not task_args_list or main_task_name != task_args_list[-1][0]:
task_args_list.append((main_task_name, main_args))
reader_args_list.append((config.strip('.yaml'), main_args))
reader_map_task[config.strip('.yaml')] = main_task_name
print("Loading auxiliary tasks configure...................")
aux_task_name_list = args.auxiliary_task.strip().split()
for aux_task_name in aux_task_name_list:
index += 1
id_map_task[index] = aux_task_name
print("Loading %s auxiliary tasks configure......." % aux_task_name)
aux_config_list = [config for config in task_config_files if config.split('.')[0] == aux_task_name]
for aux_yaml in aux_config_list:
aux_yaml = os.path.join(TASKSET_PATH, aux_yaml)
aux_args = PDConfig(yaml_file=[multitask_config, aux_yaml])
aux_args.build()
aux_args.Print()
if aux_task_name != task_args_list[-1][0]:
task_args_list.append((aux_task_name, aux_args))
reader_args_list.append((aux_yaml.strip('.yaml'), aux_args))
reader_map_task[aux_yaml.strip('.yaml')] = aux_task_name
# import tasks reader module and build joint_input_shape
input_shape_list = []
reader_module_dict = {}
input_shape_dict = {}
for params in task_args_list:
task_reader_mdl = "%s_reader" % params[0]
reader_module = importlib.import_module(task_reader_mdl)
reader_servlet_cls = getattr(reader_module, "get_input_shape")
reader_input_shape = reader_servlet_cls(params[1])
reader_module_dict[params[0]] = reader_module
input_shape_list.append(reader_input_shape)
input_shape_dict[params[0]] = reader_input_shape
train_input_shape, test_input_shape, task_map_id = joint_reader.joint_input_shape(input_shape_list)
# import backbone model
backbone_mdl = args.backbone_model
backbone_cls = "Model"
backbone_module = importlib.import_module(backbone_mdl)
backbone_servlet = getattr(backbone_module, backbone_cls)
if not (args.do_train or args.do_predict):
raise ValueError("For args `do_train` and `do_predict`, at "
"least one of them must be True.")
if args.use_cuda:
place = fluid.CUDAPlace(0)
dev_count = fluid.core.get_cuda_device_count()
else:
place = fluid.CPUPlace()
dev_count = int(os.environ.get('CPU_NUM', multiprocessing.cpu_count()))
exe = fluid.Executor(place)
startup_prog = fluid.default_startup_program()
if args.random_seed is not None:
startup_prog.random_seed = args.random_seed
if args.do_train:
#create joint pyreader
print('creating readers...')
gens = []
main_generator = ""
for params in reader_args_list:
generator_cls = getattr(reader_module_dict[reader_map_task[params[0]]], "DataProcessor")
generator_inst = generator_cls(params[1])
reader_generator = generator_inst.data_generator(phase='train', shuffle=True, dev_count=dev_count)
if not main_generator:
main_generator = generator_inst
gens.append((reader_generator, params[1].mix_ratio, reader_map_task[params[0]]))
joint_generator, train_pyreader, model_inputs = create_reader("train_reader", train_input_shape, True, task_map_id, gens)
train_pyreader.decorate_tensor_provider(joint_generator)
# build task inputs
task_inputs_list = []
main_test_input = []
task_id = model_inputs[0]
backbone_inputs = model_inputs[task_map_id[0][0]: task_map_id[0][1]]
for i in range(1, len(task_map_id)):
task_inputs = backbone_inputs + model_inputs[task_map_id[i][0]: task_map_id[i][1]]
task_inputs_list.append(task_inputs)
# build backbone model
print('building model backbone...')
conf = vars(args)
if args.pretrain_config_path is not None:
model_conf = JsonConfig(args.pretrain_config_path).asdict()
for k, v in model_conf.items():
if k in conf:
assert k == conf[k], "ERROR: argument {} in pretrain_model_config is NOT consistent with which in main.yaml"
conf.update(model_conf)
backbone_inst = backbone_servlet(conf, is_training=True)
print('building task models...')
num_train_examples = main_generator.get_num_examples()
if main_args.in_tokens:
max_train_steps = int(main_args.epoch * num_train_examples) // (
main_args.batch_size // main_args.max_seq_len) // dev_count
else:
max_train_steps = int(main_args.epoch * num_train_examples) // (
main_args.batch_size) // dev_count
mix_ratio_list = [task_args[1].mix_ratio for task_args in task_args_list]
args.max_train_steps = int(max_train_steps * (sum(mix_ratio_list) / main_args.mix_ratio))
print("Max train steps: %d" % max_train_steps)
build_strategy = fluid.BuildStrategy()
train_program = fluid.default_main_program()
with fluid.program_guard(train_program, startup_prog):
with fluid.unique_name.guard():
backbone_inst.build_model(backbone_inputs)
all_loss_list = []
for i in range(len(task_args_list)):
task_name = task_args_list[i][0]
task_args = task_args_list[i][1]
if hasattr(task_args, 'paradigm'):
task_net = task_args.paradigm
else:
task_net = task_name
task_net_mdl = importlib.import_module(task_net)
task_net_cls = getattr(task_net_mdl, "create_model")
output_tensor = task_net_cls(task_inputs_list[i], base_model=backbone_inst, is_training=True, args=task_args)
loss_cls = getattr(task_net_mdl, "compute_loss")
task_loss = loss_cls(output_tensor, task_args)
all_loss_list.append(task_loss)
num_seqs = output_tensor['num_seqs']
task_one_hot = fluid.layers.one_hot(task_id, len(task_args_list))
all_loss = fluid.layers.concat(all_loss_list, axis=0)
loss = fluid.layers.reduce_sum(task_one_hot * all_loss)
programs = [train_program, startup_prog]
optimizer_mdl = importlib.import_module(args.optimizer)
optimizer_inst = getattr(optimizer_mdl, "optimization")
optimizer_inst(loss, programs, args=args)
loss.persistable = True
num_seqs.persistable = True
ema = fluid.optimizer.ExponentialMovingAverage(args.ema_decay)
ema.update()
train_compiled_program = fluid.CompiledProgram(train_program).with_data_parallel(
loss_name=loss.name, build_strategy=build_strategy)
if args.do_predict:
conf = vars(args)
if args.pretrain_config_path is not None:
model_conf = JsonConfig(args.pretrain_config_path).asdict()
for k, v in model_conf.items():
if k in conf:
assert v == conf[k], "ERROR: argument {} in pretrain_model_config is NOT consistent with which in main.yaml".format(k)
conf.update(model_conf)
mod = reader_module_dict[main_task_name]
DataProcessor = getattr(mod, 'DataProcessor')
predict_processor = DataProcessor(main_args)
test_generator = predict_processor.data_generator(
phase='predict',
shuffle=False,
dev_count=dev_count)
new_test_input_shape = input_shape_dict[main_task_name][1]['backbone'] + input_shape_dict[main_task_name][1]['task']
assert new_test_input_shape == test_input_shape
build_strategy = fluid.BuildStrategy()
test_prog = fluid.Program()
with fluid.program_guard(test_prog, startup_prog):
with fluid.unique_name.guard():
placeholder = Placeholder(test_input_shape)
test_pyreader, model_inputs = placeholder.build(
capacity=100, reader_name="test_reader")
test_pyreader.decorate_tensor_provider(test_generator)
# create model
backbone_inst = backbone_servlet(conf, is_training=False)
backbone_inst.build_model(model_inputs)
task_net_mdl = importlib.import_module(main_task_name)
task_net_cls = getattr(task_net_mdl, "create_model")
postprocess = getattr(task_net_mdl, "postprocess")
global_postprocess = getattr(task_net_mdl, "global_postprocess")
output_tensor = task_net_cls(model_inputs, base_model=backbone_inst, is_training=False, args=main_args)
if 'ema' not in dir():
ema = fluid.optimizer.ExponentialMovingAverage(args.ema_decay)
pred_fetch_names = []
fetch_vars = []
for i,j in output_tensor.items():
pred_fetch_names.append(i)
fetch_vars.append(j)
for var in fetch_vars:
var.persistable = True
pred_fetch_list = [i.name for i in fetch_vars]
test_prog = test_prog.clone(for_test=True)
test_compiled_program = fluid.CompiledProgram(test_prog).with_data_parallel(
build_strategy=build_strategy)
exe.run(startup_prog)
if args.do_train:
if args.pretrain_model_path:
init_pretraining_params(
exe,
args.pretrain_model_path,
main_program=startup_prog,
use_fp16=args.use_fp16)
if args.checkpoint_path:
if os.path.exists(args.checkpoint_path):
init_checkpoint(
exe,
args.checkpoint_path,
main_program=startup_prog,
use_fp16=args.use_fp16)
else:
os.makedirs(args.checkpoint_path)
elif args.do_predict:
if not args.checkpoint_path:
raise ValueError("args 'checkpoint_path' should be set if"
"only doing prediction!")
init_checkpoint(
exe,
args.checkpoint_path,
main_program=test_prog,
use_fp16=args.use_fp16)
if args.do_train:
print('start training...')
train_pyreader.start()
steps = 0
total_cost, total_num_seqs = [], []
time_begin = time.time()
while True:
try:
steps += 1
if steps % args.skip_steps == 0:
fetch_list = [loss.name, num_seqs.name, task_id.name]
else:
fetch_list = []
outputs = exe.run(train_compiled_program, fetch_list=fetch_list)
if steps % args.skip_steps == 0:
np_loss, np_num_seqs, np_task_id = outputs
total_cost.extend(np_loss * np_num_seqs)
total_num_seqs.extend(np_num_seqs)
time_end = time.time()
used_time = time_end - time_begin
current_example, epoch = main_generator.get_train_progress()
cur_task_name = id_map_task[np_task_id[0][0]]
print("epoch: %d, task_name: %s, progress: %d/%d, step: %d, loss: %f, "
"speed: %f steps/s" %
(epoch, cur_task_name, current_example, num_train_examples, steps,
np.sum(total_cost) / np.sum(total_num_seqs),
args.skip_steps / used_time))
total_cost, total_num_seqs = [], []
time_begin = time.time()
if steps % args.save_steps == 0:
save_path = os.path.join(args.checkpoint_path,
"step_" + str(steps))
fluid.io.save_persistables(exe, save_path, train_program)
if steps == max_train_steps:
save_path = os.path.join(args.checkpoint_path,
"step_" + str(steps) + "_final")
fluid.io.save_persistables(exe, save_path, train_program)
break
except paddle.fluid.core.EOFException as err:
save_path = os.path.join(args.checkpoint_path,
"step_" + str(steps) + "_final")
fluid.io.save_persistables(exe, save_path, train_program)
train_pyreader.reset()
break
if args.do_predict:
print('start predicting...')
cnt = 0
if args.use_ema:
with ema.apply(exe):
test_pyreader.start()
pred_buf = []
while True:
try:
fetch_res = exe.run(fetch_list=pred_fetch_list, program=test_compiled_program)
cnt += 1
if cnt % 200 == 0:
print('predicting {}th batch...'.format(cnt))
fetch_dict = {}
for key,val in zip(pred_fetch_names, fetch_res):
fetch_dict[key] = val
res = postprocess(fetch_dict)
if res is not None:
pred_buf.extend(res)
except fluid.core.EOFException:
test_pyreader.reset()
break
global_postprocess(pred_buf, predict_processor, args, main_args)
else:
test_pyreader.start()
pred_buf = []
while True:
try:
fetch_res = exe.run(fetch_list=pred_fetch_list, program=test_compiled_program)
cnt += 1
if cnt % 200 == 0:
print('predicting {}th batch...'.format(cnt))
fetch_dict = {}
for key,val in zip(pred_fetch_names, fetch_res):
fetch_dict[key] = val
res = postprocess(fetch_dict)
if res is not None:
pred_buf.extend(res)
except fluid.core.EOFException:
test_pyreader.reset()
break
global_postprocess(pred_buf, predict_processor, args, main_args)
if __name__ == '__main__':
multitask_config = "mtl_config.yaml"
train(multitask_config)
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Optimization and learning rate scheduling."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import numpy as np
import paddle.fluid as fluid
from utils.fp16 import create_master_params_grads, master_param_to_train_param
def linear_warmup_decay(learning_rate, warmup_steps, num_train_steps):
""" Applies linear warmup of learning rate from 0 and decay to 0."""
with fluid.default_main_program()._lr_schedule_guard():
lr = fluid.layers.tensor.create_global_var(
shape=[1],
value=0.0,
dtype='float32',
persistable=True,
name="scheduled_learning_rate")
global_step = fluid.layers.learning_rate_scheduler._decay_step_counter()
with fluid.layers.control_flow.Switch() as switch:
with switch.case(global_step < warmup_steps):
warmup_lr = learning_rate * (global_step / warmup_steps)
fluid.layers.tensor.assign(warmup_lr, lr)
with switch.default():
decayed_lr = fluid.layers.learning_rate_scheduler.polynomial_decay(
learning_rate=learning_rate,
decay_steps=num_train_steps,
end_learning_rate=0.0,
power=1.0,
cycle=False)
fluid.layers.tensor.assign(decayed_lr, lr)
return lr
def optimization(loss, programs, args):
train_program = programs[0]
startup_prog = programs[1]
warmup_steps = args.max_train_steps * args.warmup_proportion
if warmup_steps > 0:
if args.lr_scheduler == 'noam_decay':
scheduled_lr = fluid.layers.learning_rate_scheduler\
.noam_decay(1/(warmup_steps *(float(args.learning_rate) ** 2)),
warmup_steps)
elif args.lr_scheduler == 'linear_warmup_decay':
scheduled_lr = linear_warmup_decay(float(args.learning_rate), warmup_steps,
args.max_train_steps)
else:
raise ValueError("Unkown learning rate scheduler, should be "
"'noam_decay' or 'linear_warmup_decay'")
optimizer = fluid.optimizer.Adam(learning_rate=scheduled_lr)
else:
optimizer = fluid.optimizer.Adam(learning_rate=args.learning_rate)
scheduled_lr = args.learning_rate
clip_norm_thres = 1.0
# When using mixed precision training, scale the gradient clip threshold
# by loss_scaling
if args.use_fp16 and args.loss_scaling > 1.0:
clip_norm_thres *= args.loss_scaling
fluid.clip.set_gradient_clip(
clip=fluid.clip.GradientClipByGlobalNorm(clip_norm=clip_norm_thres))
def exclude_from_weight_decay(name):
if name.find("layer_norm") > -1:
return True
bias_suffix = ["_bias", "_b", ".b_0"]
for suffix in bias_suffix:
if name.endswith(suffix):
return True
return False
param_list = dict()
if args.use_fp16:
param_grads = optimizer.backward(loss)
master_param_grads = create_master_params_grads(
param_grads, train_program, startup_prog, args.loss_scaling)
for param, _ in master_param_grads:
param_list[param.name] = param * 1.0
param_list[param.name].stop_gradient = True
optimizer.apply_gradients(master_param_grads)
if args.weight_decay > 0:
for param, grad in master_param_grads:
if exclude_from_weight_decay(param.name.rstrip(".master")):
continue
with param.block.program._optimized_guard(
[param, grad]), fluid.framework.name_scope("weight_decay"):
updated_param = param - param_list[
param.name] * weight_decay * scheduled_lr
fluid.layers.assign(output=param, input=updated_param)
master_param_to_train_param(master_param_grads, param_grads,
train_program)
else:
for param in train_program.global_block().all_parameters():
param_list[param.name] = param * 1.0
param_list[param.name].stop_gradient = True
_, param_grads = optimizer.minimize(loss)
if args.weight_decay > 0:
for param, grad in param_grads:
if exclude_from_weight_decay(param.name):
continue
with param.block.program._optimized_guard(
[param, grad]), fluid.framework.name_scope("weight_decay"):
updated_param = param - param_list[
param.name] * args.weight_decay * scheduled_lr
fluid.layers.assign(output=param, input=updated_param)
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# encoding=utf8
import paddle.fluid as fluid
def compute_loss(output_tensors, args=None):
"""Compute loss for mrc model"""
labels = output_tensors['labels']
logits = output_tensors['logits']
ce_loss, probs = fluid.layers.softmax_with_cross_entropy(
logits=logits, label=labels, return_softmax=True)
loss = fluid.layers.mean(x=ce_loss)
if args.use_fp16 and args.loss_scaling > 1.0:
loss *= args.loss_scaling
return loss
def create_model(reader_input, base_model=None, is_training=True, args=None):
"""
given the base model, reader_input
return the output tensors
"""
labels = reader_input[-1]
cls_feats = base_model.final_sentence_representation
cls_feats = fluid.layers.dropout(
x=cls_feats,
dropout_prob=0.1,
dropout_implementation="upscale_in_train")
logits = fluid.layers.fc(
input=cls_feats,
size=2,
param_attr=fluid.ParamAttr(
name="cls_out_w",
initializer=fluid.initializer.TruncatedNormal(scale=0.02)),
bias_attr=fluid.ParamAttr(
name="cls_out_b", initializer=fluid.initializer.Constant(0.)))
num_seqs = fluid.layers.fill_constant(shape=[1], value=512, dtype='int64')
output_tensors = {}
output_tensors['labels'] = labels
output_tensors['logits'] = logits
output_tensors['num_seqs'] = num_seqs
return output_tensors
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import paddle.fluid as fluid
from backbone.utils.transformer import pre_process_layer
from utils.configure import JsonConfig
def compute_loss(output_tensors, args=None):
"""Compute loss for mlm model"""
fc_out = output_tensors['mlm_out']
mask_label = output_tensors['mask_label']
mask_lm_loss = fluid.layers.softmax_with_cross_entropy(
logits=fc_out, label=mask_label)
mean_mask_lm_loss = fluid.layers.mean(mask_lm_loss)
return mean_mask_lm_loss
def create_model(reader_input, base_model=None, is_training=True, args=None):
"""
given the base model, reader_input
return the output tensors
"""
src_ids, pos_ids, sent_ids, input_mask, mask_label, mask_pos = reader_input
config = JsonConfig(args.pretrain_config_path)
_emb_size = config['hidden_size']
_voc_size = config['vocab_size']
_hidden_act = config['hidden_act']
_word_emb_name = "word_embedding"
_dtype = "float16" if args.use_fp16 else "float32"
_param_initializer = fluid.initializer.TruncatedNormal(
scale=config['initializer_range'])
mask_pos = fluid.layers.cast(x=mask_pos, dtype='int32')
enc_out = base_model.final_word_representation
# extract the first token feature in each sentence
reshaped_emb_out = fluid.layers.reshape(
x=enc_out, shape=[-1, _emb_size])
# extract masked tokens' feature
mask_feat = fluid.layers.gather(input=reshaped_emb_out, index=mask_pos)
num_seqs = fluid.layers.fill_constant(shape=[1], value=512, dtype='int64')
# transform: fc
mask_trans_feat = fluid.layers.fc(
input=mask_feat,
size=_emb_size,
act=_hidden_act,
param_attr=fluid.ParamAttr(
name='mask_lm_trans_fc.w_0',
initializer=_param_initializer),
bias_attr=fluid.ParamAttr(name='mask_lm_trans_fc.b_0'))
# transform: layer norm
mask_trans_feat = pre_process_layer(
mask_trans_feat, 'n', name='mask_lm_trans')
mask_lm_out_bias_attr = fluid.ParamAttr(
name="mask_lm_out_fc.b_0",
initializer=fluid.initializer.Constant(value=0.0))
fc_out = fluid.layers.matmul(
x=mask_trans_feat,
y=fluid.default_main_program().global_block().var(
_word_emb_name),
transpose_y=True)
fc_out += fluid.layers.create_parameter(
shape=[_voc_size],
dtype=_dtype,
attr=mask_lm_out_bias_attr,
is_bias=True)
output_tensors = {}
output_tensors['num_seqs'] = num_seqs
output_tensors['mlm_out'] = fc_out
output_tensors['mask_label'] = mask_label
return output_tensors
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# encoding=utf8
import paddle.fluid as fluid
import collections
import os
from paddle.fluid import layers
def compute_loss(output_tensors, args=None):
"""Compute loss for mrc model"""
def _compute_single_loss(logits, positions):
"""Compute start/end loss for mrc model"""
loss = fluid.layers.softmax_with_cross_entropy(
logits=logits, label=positions)
loss = fluid.layers.mean(x=loss)
return loss
start_logits = output_tensors['start_logits']
end_logits = output_tensors['end_logits']
start_positions = output_tensors['start_positions']
end_positions = output_tensors['end_positions']
start_loss = _compute_single_loss(start_logits, start_positions)
end_loss = _compute_single_loss(end_logits, end_positions)
total_loss = (start_loss + end_loss) / 2.0
if args.use_fp16 and args.loss_scaling > 1.0:
total_loss = total_loss * args.loss_scaling
return total_loss
def create_model(reader_input, base_model=None, is_training=True, args=None):
"""
given the base model, reader_input
return the output tensors
"""
if is_training:
src_ids, pos_ids, sent_ids, input_mask, start_positions, end_positions = reader_input
else:
src_ids, pos_ids, sent_ids, input_mask, unique_id = reader_input
enc_out = base_model.final_word_representation
logits = fluid.layers.fc(
input=enc_out,
size=2,
num_flatten_dims=2,
param_attr=fluid.ParamAttr(
name="cls_squad_out_w",
initializer=fluid.initializer.TruncatedNormal(scale=0.02)),
bias_attr=fluid.ParamAttr(
name="cls_squad_out_b", initializer=fluid.initializer.Constant(0.)))
logits = fluid.layers.transpose(x=logits, perm=[2, 0, 1])
start_logits, end_logits = fluid.layers.unstack(x=logits, axis=0)
batch_ones = fluid.layers.fill_constant_batch_size_like(
input=start_logits, dtype='int64', shape=[1], value=1)
num_seqs = fluid.layers.reduce_sum(input=batch_ones)
output_tensors = {}
output_tensors['start_logits'] = start_logits
output_tensors['end_logits'] = end_logits
output_tensors['num_seqs'] = num_seqs
if is_training:
output_tensors['start_positions'] = start_positions
output_tensors['end_positions'] = end_positions
else:
output_tensors['unique_id'] = unique_id
return output_tensors
RawResult = collections.namedtuple("RawResult",
["unique_id", "start_logits", "end_logits"])
def postprocess(fetch_results):
np_unique_ids= fetch_results['unique_id']
np_start_logits= fetch_results['start_logits']
np_end_logits= fetch_results['end_logits']
ret = []
for idx in range(np_unique_ids.shape[0]):
if np_unique_ids[idx] < 0:
continue
unique_id = int(np_unique_ids[idx])
start_logits = [float(x) for x in np_start_logits[idx].flat]
end_logits = [float(x) for x in np_end_logits[idx].flat]
ret.append(
RawResult(
unique_id=unique_id,
start_logits=start_logits,
end_logits=end_logits))
return ret
def global_postprocess(pred_buf, processor, mtl_args, task_args):
if not os.path.exists(mtl_args.checkpoint_path):
os.makedirs(mtl_args.checkpoints)
output_prediction_file = os.path.join(mtl_args.checkpoint_path, "predictions.json")
output_nbest_file = os.path.join(mtl_args.checkpoint_path, "nbest_predictions.json")
output_null_log_odds_file = os.path.join(mtl_args.checkpoint_path, "null_odds.json")
processor.write_predictions(pred_buf, task_args.n_best_size, task_args.max_answer_length,
task_args.do_lower_case, output_prediction_file,
output_nbest_file, output_null_log_odds_file,
task_args.with_negative,
task_args.null_score_diff_threshold, task_args.verbose)
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
import types
import csv
import numpy as np
from utils import tokenization
from utils.batching import prepare_batch_data
def get_input_shape(args):
"""
define answer matching model input shape
"""
train_input_shape = {"backbone": [([-1, args.max_seq_len, 1], 'int64'),
([-1, args.max_seq_len, 1], 'int64'),
([-1, args.max_seq_len, 1], 'int64'),
([-1, args.max_seq_len, 1], 'float32')],
"task": [([-1, 1], 'int64')]
}
test_input_shape = {"backbone": [([-1, args.max_seq_len, 1], 'int64'),
([-1, args.max_seq_len, 1], 'int64'),
([-1, args.max_seq_len, 1], 'int64'),
([-1, args.max_seq_len, 1], 'float32')],
"task": [([-1, 1], 'int64')]
}
return train_input_shape, test_input_shape
class BaseProcessor(object):
"""Base class for data converters for sequence classification data sets."""
def __init__(self, args):
self.train_file = args.train_file
self.max_seq_len = args.max_seq_len
self.batch_size = args.batch_size
self.epoch = args.epoch
self.tokenizer = tokenization.FullTokenizer(
vocab_file=args.vocab_path, do_lower_case=args.do_lower_case)
self.vocab = self.tokenizer.vocab
self.in_tokens = args.in_tokens
self.current_train_example = -1
self.num_examples = {'train': -1, 'dev': -1, 'test': -1}
self.current_train_epoch = -1
def get_train_examples(self, file_path):
"""Gets a collection of `InputExample`s for the train set."""
raise NotImplementedError()
def get_dev_examples(self, file_path):
"""Gets a collection of `InputExample`s for the dev set."""
raise NotImplementedError()
def get_test_examples(self, file_path):
"""Gets a collection of `InputExample`s for prediction."""
raise NotImplementedError()
def get_labels(self):
"""Gets the list of labels for this data set."""
raise NotImplementedError()
def convert_example(self, index, example, labels, max_seq_len, tokenizer):
"""Converts a single `InputExample` into a single `InputFeatures`."""
feature = convert_single_example(index, example, labels, max_seq_len,
tokenizer)
return feature
def generate_instance(self, feature):
"""
generate instance with given feature
Args:
feature: InputFeatures(object). A single set of features of data.
"""
input_pos = list(range(len(feature.input_ids)))
return [
feature.input_ids, feature.segment_ids, input_pos, feature.label_id
]
def generate_batch_data(self,
batch_data,
total_token_num,
voc_size=-1,
mask_id=-1,
return_input_mask=True,
return_max_len=False,
return_num_token=False):
return prepare_batch_data(
batch_data,
total_token_num,
voc_size=-1,
pad_id=self.vocab["[PAD]"],
cls_id=self.vocab["[CLS]"],
sep_id=self.vocab["[SEP]"],
mask_id=-1,
return_input_mask=True,
return_max_len=False,
return_num_token=False)
@classmethod
def _read_tsv(cls, input_file, quotechar=None):
"""Reads a tab separated value file."""
with open(input_file, "r") as f:
reader = csv.reader(f, delimiter="\t", quotechar=quotechar)
lines = []
for line in reader:
lines.append(line)
return lines
def get_num_examples(self, phase):
"""Get number of examples for train, dev or test."""
if phase not in ['train', 'dev', 'test']:
raise ValueError(
"Unknown phase, which should be in ['train', 'dev', 'test'].")
return self.num_examples[phase]
def get_train_progress(self):
"""Gets progress for training phase."""
return self.current_train_example, self.current_train_epoch
def data_generator(self,
phase='train',
dev_count=1,
shuffle=True
):
"""
Generate data for train, dev or test.
Args:
batch_size: int. The batch size of generated data.
phase: string. The phase for which to generate data.
epoch: int. Total epoches to generate data.
shuffle: bool. Whether to shuffle examples.
"""
if phase == 'train':
examples = self.get_train_examples(self.train_file)
self.num_examples['train'] = len(examples)
elif phase == 'dev':
examples = self.get_dev_examples(self.dev_file)
self.num_examples['dev'] = len(examples)
elif phase == 'test':
examples = self.get_test_examples(self.test_file)
self.num_examples['test'] = len(examples)
else:
raise ValueError(
"Unknown phase, which should be in ['train', 'dev', 'test'].")
def instance_reader():
for epoch_index in range(self.epoch):
if shuffle:
np.random.shuffle(examples)
if phase == 'train':
self.current_train_epoch = epoch_index
for (index, example) in enumerate(examples):
if phase == 'train':
self.current_train_example = index + 1
feature = self.convert_example(
index, example,
self.get_labels(), self.max_seq_len, self.tokenizer)
instance = self.generate_instance(feature)
yield instance
def batch_reader(reader, batch_size, in_tokens):
batch, total_token_num, max_len = [], 0, 0
for instance in reader():
token_ids, sent_ids, pos_ids, label = instance[:4]
max_len = max(max_len, len(token_ids))
if in_tokens:
to_append = (len(batch) + 1) * max_len <= batch_size
else:
to_append = len(batch) < batch_size
if to_append:
batch.append(instance)
total_token_num += len(token_ids)
else:
yield batch, total_token_num
batch, total_token_num, max_len = [instance], len(
token_ids), len(token_ids)
if len(batch) > 0:
yield batch, total_token_num
def wrapper():
all_dev_batches = []
for batch_data, total_token_num in batch_reader(
instance_reader, self.batch_size, self.in_tokens):
batch_data = self.generate_batch_data(
batch_data,
total_token_num,
voc_size=-1,
mask_id=-1,
return_input_mask=True,
return_max_len=False,
return_num_token=False)
if len(all_dev_batches) < dev_count:
all_dev_batches.append(batch_data)
if len(all_dev_batches) == dev_count:
for batch in all_dev_batches:
yield batch
all_dev_batches = []
return wrapper
class DataProcessor(BaseProcessor):
"""Processor for the MultiNLI data set (GLUE version)."""
def get_train_examples(self, file_path):
"""See base class."""
return self._create_examples(
self._read_tsv(file_path), "train")
def get_dev_examples(self, file_path):
"""See base class."""
return self._create_examples(
self._read_tsv(file_path), "dev")
def get_test_examples(self, file_path):
"""See base class."""
return self._create_examples(
self._read_tsv(file_path), "test")
def get_labels(self):
"""See base class."""
return ["0", "1"]
def _create_examples(self, lines, set_type):
"""Creates examples for the training and dev sets."""
examples = []
for (i, line) in enumerate(lines):
guid = "%s-%s" % (set_type,
tokenization.convert_to_unicode("0000"))
text_a = tokenization.convert_to_unicode(line[1])
text_b = tokenization.convert_to_unicode(line[2])
if set_type == "test":
label = "0"
else:
label = tokenization.convert_to_unicode(line[0])
examples.append(
InputExample(
guid=guid, text_a=text_a, text_b=text_b, label=label))
return examples
class InputExample(object):
"""A single training/test example for simple sequence classification."""
def __init__(self, guid, text_a, text_b=None, label=None):
"""Constructs a InputExample.
Args:
guid: Unique id for the example.
text_a: string. The untokenized text of the first sequence. For single
sequence tasks, only this sequence must be specified.
text_b: (Optional) string. The untokenized text of the second sequence.
Only must be specified for sequence pair tasks.
label: (Optional) string. The label of the example. This should be
specified for train and dev examples, but not for test examples.
"""
self.guid = guid
self.text_a = text_a
self.text_b = text_b
self.label = label
def _truncate_seq_pair(tokens_a, tokens_b, max_length):
"""Truncates a sequence pair in place to the maximum length."""
# This is a simple heuristic which will always truncate the longer sequence
# one token at a time. This makes more sense than truncating an equal percent
# of tokens from each, since if one sequence is very short then each token
# that's truncated likely contains more information than a longer sequence.
while True:
total_length = len(tokens_a) + len(tokens_b)
if total_length <= max_length:
break
if len(tokens_a) > len(tokens_b):
tokens_a.pop()
else:
tokens_b.pop()
class InputFeatures(object):
"""A single set of features of data."""
def __init__(self, input_ids, input_mask, segment_ids, label_id):
self.input_ids = input_ids
self.input_mask = input_mask
self.segment_ids = segment_ids
self.label_id = label_id
def convert_single_example_to_unicode(guid, single_example):
text_a = tokenization.convert_to_unicode(single_example[0])
text_b = tokenization.convert_to_unicode(single_example[1])
label = tokenization.convert_to_unicode(single_example[2])
return InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label)
def convert_single_example(ex_index, example, label_list, max_seq_length,
tokenizer):
"""Converts a single `InputExample` into a single `InputFeatures`."""
label_map = {}
for (i, label) in enumerate(label_list):
label_map[label] = i
tokens_a = tokenizer.tokenize(example.text_a)
tokens_b = None
if example.text_b:
tokens_b = tokenizer.tokenize(example.text_b)
if tokens_b:
# Modifies `tokens_a` and `tokens_b` in place so that the total
# length is less than the specified length.
# Account for [CLS], [SEP], [SEP] with "- 3"
_truncate_seq_pair(tokens_a, tokens_b, max_seq_length - 3)
else:
# Account for [CLS] and [SEP] with "- 2"
if len(tokens_a) > max_seq_length - 2:
tokens_a = tokens_a[0:(max_seq_length - 2)]
# The convention in BERT is:
# (a) For sequence pairs:
# tokens: [CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP]
# type_ids: 0 0 0 0 0 0 0 0 1 1 1 1 1 1
# (b) For single sequences:
# tokens: [CLS] the dog is hairy . [SEP]
# type_ids: 0 0 0 0 0 0 0
#
# Where "type_ids" are used to indicate whether this is the first
# sequence or the second sequence. The embedding vectors for `type=0` and
# `type=1` were learned during pre-training and are added to the wordpiece
# embedding vector (and position vector). This is not *strictly* necessary
# since the [SEP] token unambiguously separates the sequences, but it makes
# it easier for the model to learn the concept of sequences.
#
# For classification tasks, the first vector (corresponding to [CLS]) is
# used as as the "sentence vector". Note that this only makes sense because
# the entire model is fine-tuned.
tokens = []
segment_ids = []
tokens.append("[CLS]")
segment_ids.append(0)
for token in tokens_a:
tokens.append(token)
segment_ids.append(0)
tokens.append("[SEP]")
segment_ids.append(0)
if tokens_b:
for token in tokens_b:
tokens.append(token)
segment_ids.append(1)
tokens.append("[SEP]")
segment_ids.append(1)
input_ids = tokenizer.convert_tokens_to_ids(tokens)
# The mask has 1 for real tokens and 0 for padding tokens. Only real
# tokens are attended to.
input_mask = [1] * len(input_ids)
label_id = label_map[example.label]
feature = InputFeatures(
input_ids=input_ids,
input_mask=input_mask,
segment_ids=segment_ids,
label_id=label_id)
return feature
def convert_examples_to_features(examples, label_list, max_seq_length,
tokenizer):
"""Convert a set of `InputExample`s to a list of `InputFeatures`."""
features = []
for (ex_index, example) in enumerate(examples):
if ex_index % 10000 == 0:
print("Writing example %d of %d" % (ex_index, len(examples)))
feature = convert_single_example(ex_index, example, label_list,
max_seq_length, tokenizer)
features.append(feature)
return features
if __name__ == '__main__':
pass
#encoding=utf8
import os
import sys
import random
import numpy as np
import paddle
import paddle.fluid as fluid
from utils.placeholder import Placeholder
def repeat(reader):
"""Repeat a generator forever"""
generator = reader()
while True:
try:
yield next(generator)
except StopIteration:
generator = reader()
yield next(generator)
def create_joint_generator(input_shape, generators, task_map_id, is_multi_task=True):
def empty_output(input_shape, batch_size=1):
results = []
for i in range(len(input_shape)):
if input_shape[i][1] == 'int32':
dtype = np.int32
if input_shape[i][1] == 'int64':
dtype = np.int64
if input_shape[i][1] == 'float32':
dtype = np.float32
if input_shape[i][1] == 'float64':
dtype = np.float64
shape = input_shape[i][0]
shape[0] = batch_size
pad_tensor = np.zeros(shape=shape, dtype=dtype)
results.append(pad_tensor)
return results
def wrapper():
generators_inst = [repeat(gen[0]) for gen in generators]
generators_ratio = [gen[1] for gen in generators]
weights = [ratio/sum(generators_ratio) for ratio in generators_ratio]
task_names = [gen[2] for gen in generators]
task_names_ids = [0]
for i in range(1, len(task_names)):
if task_names[i] == task_names[i - 1]:
task_names_ids.append(task_names_ids[-1])
else:
task_names_ids.append(task_names_ids[-1] + 1)
run_task_id = range(len(generators))
while True:
idx = np.random.choice(run_task_id, p=weights)
gen_results = next(generators_inst[idx])
if not gen_results:
break
batch_size = gen_results[0].shape[0]
results = empty_output(input_shape, batch_size)
task_id_tensor = np.array([[task_names_ids[idx]]]).astype("int64")
results[0] = task_id_tensor
backbone_range_start = task_map_id[0][0]
backbone_range_end = task_map_id[0][1]
for i in range(backbone_range_start, backbone_range_end):
results[i] = gen_results[i - 1]
cur_gene_task = task_names_ids[idx] + 1
for j in range(task_map_id[cur_gene_task][0], task_map_id[cur_gene_task][1]):
results[j] = gen_results[i]
i += 1
yield results
return wrapper
def create_reader(reader_name, input_shape, is_multi_task, task_map_id, *gens):
"""
build reader for multi_task_learning
"""
placeholder = Placeholder(input_shape)
pyreader, model_inputs = placeholder.build(capacity=16, reader_name=reader_name)
joint_generator = create_joint_generator(input_shape, gens[0], task_map_id, is_multi_task=is_multi_task)
return joint_generator, pyreader, model_inputs
def joint_input_shape(input_shape_list):
"""
joint main task and auxiliary tasks input shape
"""
joint_test_input_shape = input_shape_list[0][1]["backbone"] + input_shape_list[0][1]["task"]
joint_train_input_shape = [([1, 1], 'int64')] # task_id_shape
backbone_input_shape = input_shape_list[0][0]["backbone"]
joint_train_input_shape.extend(backbone_input_shape)
task_map_id = [(1, len(input_shape_list[0][0]["backbone"]) + 1)]
for input_shape in input_shape_list:
task_input_shape = input_shape[0]["task"]
joint_train_input_shape.extend(task_input_shape)
task_map_id.append((task_map_id[-1][1], task_map_id[-1][1] + len(task_input_shape)))
return joint_train_input_shape, joint_test_input_shape, task_map_id
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from __future__ import print_function
from __future__ import division
import os
import numpy as np
import types
import gzip
import logging
import re
import six
import collections
from utils import tokenization
from utils.batching import prepare_batch_data
def get_input_shape(args):
"""
define mask language model input shape
"""
train_input_shape = {"backbone": [([-1, args.max_seq_len, 1], 'int64'),
([-1, args.max_seq_len, 1], 'int64'),
([-1, args.max_seq_len, 1], 'int64'),
([-1, args.max_seq_len, 1], 'float32')],
"task": [([-1, 1], 'int64'),
([-1, 1], 'int64')]
}
test_input_shape = {"backbone": [([-1, args.max_seq_len, 1], 'int64'),
([-1, args.max_seq_len, 1], 'int64'),
([-1, args.max_seq_len, 1], 'int64'),
([-1, args.max_seq_len, 1], 'float32')],
"task": [([-1, 1], 'int64'),
([-1, 1], 'int64')]
}
return train_input_shape, test_input_shape
class DataProcessor(object):
def __init__(self, args):
self.vocab = self.load_vocab(args.vocab_path)
self.data_dir = args.train_file
self.batch_size = args.batch_size
self.in_tokens = args.in_tokens
self.epoch = args.epoch
self.current_epoch = 0
self.current_file_index = 0
self.total_file = 0
self.current_file = None
self.generate_neg_sample = args.generate_neg_sample
self.voc_size = len(self.vocab)
self.max_seq_len = args.max_seq_len
self.pad_id = self.vocab["[PAD]"]
self.cls_id = self.vocab["[CLS]"]
self.sep_id = self.vocab["[SEP]"]
self.mask_id = self.vocab["[MASK]"]
if self.in_tokens:
assert self.batch_size >= self.max_seq_len, "The number of " \
"tokens in batch should not be smaller than max seq length."
def get_progress(self):
"""return current progress of training data
"""
return self.current_epoch, self.current_file_index, self.total_file, self.current_file
def parse_line(self, line, max_seq_len=512):
""" parse one line to token_ids, sentence_ids, pos_ids, label
"""
line = line.strip().decode().split(";")
assert len(line) == 4, "One sample must have 4 fields!"
(token_ids, sent_ids, pos_ids, label) = line
token_ids = [int(token) for token in token_ids.split(" ")]
sent_ids = [int(token) for token in sent_ids.split(" ")]
pos_ids = [int(token) for token in pos_ids.split(" ")]
assert len(token_ids) == len(sent_ids) == len(
pos_ids
), "[Must be true]len(token_ids) == len(sent_ids) == len(pos_ids)"
label = int(label)
if len(token_ids) > max_seq_len:
return None
return [token_ids, sent_ids, pos_ids, label]
def read_file(self, file):
assert file.endswith('.gz'), "[ERROR] %s is not a gzip file" % file
file_path = self.data_dir + "/" + file
with gzip.open(file_path, "rb") as f:
for line in f:
parsed_line = self.parse_line(
line, max_seq_len=self.max_seq_len)
if parsed_line is None:
continue
yield parsed_line
def convert_to_unicode(self, text):
"""Converts `text` to Unicode (if it's not already), assuming utf-8 input."""
if six.PY3:
if isinstance(text, str):
return text
elif isinstance(text, bytes):
return text.decode("utf-8", "ignore")
else:
raise ValueError("Unsupported string type: %s" % (type(text)))
elif six.PY2:
if isinstance(text, str):
return text.decode("utf-8", "ignore")
elif isinstance(text, unicode):
return text
else:
raise ValueError("Unsupported string type: %s" % (type(text)))
else:
raise ValueError("Not running on Python2 or Python 3?")
def load_vocab(self, vocab_file):
"""Loads a vocabulary file into a dictionary."""
vocab = collections.OrderedDict()
fin = open(vocab_file)
for num, line in enumerate(fin):
items = self.convert_to_unicode(line.strip()).split("\t")
if len(items) > 2:
break
token = items[0]
index = items[1] if len(items) == 2 else num
token = token.strip()
vocab[token] = int(index)
return vocab
def random_pair_neg_samples(self, pos_samples):
""" randomly generate negative samples using pos_samples
Args:
pos_samples: list of positive samples
Returns:
neg_samples: list of negative samples
"""
np.random.shuffle(pos_samples)
num_sample = len(pos_samples)
neg_samples = []
miss_num = 0
for i in range(num_sample):
pair_index = (i + 1) % num_sample
origin_src_ids = pos_samples[i][0]
origin_sep_index = origin_src_ids.index(2)
pair_src_ids = pos_samples[pair_index][0]
pair_sep_index = pair_src_ids.index(2)
src_ids = origin_src_ids[:origin_sep_index + 1] + pair_src_ids[
pair_sep_index + 1:]
if len(src_ids) >= self.max_seq_len:
miss_num += 1
continue
sent_ids = [0] * len(origin_src_ids[:origin_sep_index + 1]) + [
1
] * len(pair_src_ids[pair_sep_index + 1:])
pos_ids = list(range(len(src_ids)))
neg_sample = [src_ids, sent_ids, pos_ids, 0]
assert len(src_ids) == len(sent_ids) == len(
pos_ids
), "[ERROR]len(src_id) == lne(sent_id) == len(pos_id) must be True"
neg_samples.append(neg_sample)
return neg_samples, miss_num
def mixin_negative_samples(self, pos_sample_generator, buffer=1000):
""" 1. generate negative samples by randomly group sentence_1 and sentence_2 of positive samples
2. combine negative samples and positive samples
Args:
pos_sample_generator: a generator producing a parsed positive sample, which is a list: [token_ids, sent_ids, pos_ids, 1]
Returns:
sample: one sample from shuffled positive samples and negative samples
"""
pos_samples = []
num_total_miss = 0
pos_sample_num = 0
try:
while True:
while len(pos_samples) < buffer:
pos_sample = next(pos_sample_generator)
label = pos_sample[3]
assert label == 1, "positive sample's label must be 1"
pos_samples.append(pos_sample)
pos_sample_num += 1
neg_samples, miss_num = self.random_pair_neg_samples(
pos_samples)
num_total_miss += miss_num
samples = pos_samples + neg_samples
pos_samples = []
np.random.shuffle(samples)
for sample in samples:
yield sample
except StopIteration:
print("stopiteration: reach end of file")
if len(pos_samples) == 1:
yield pos_samples[0]
elif len(pos_samples) == 0:
yield None
else:
neg_samples, miss_num = self.random_pair_neg_samples(
pos_samples)
num_total_miss += miss_num
samples = pos_samples + neg_samples
pos_samples = []
np.random.shuffle(samples)
for sample in samples:
yield sample
print("miss_num:%d\tideal_total_sample_num:%d\tmiss_rate:%f" %
(num_total_miss, pos_sample_num * 2,
num_total_miss / (pos_sample_num * 2)))
def data_generator(self, phase='train', shuffle=True, dev_count=1):
"""
data_generator
"""
files = os.listdir(self.data_dir)
self.total_file = len(files)
assert self.total_file > 0, "[Error] data_dir is empty"
def wrapper():
def reader():
if phase == "train":
epoch_num = self.epoch
is_test = False
else:
epoch_num = 1
is_test = True
for epoch in range(epoch_num):
self.current_epoch = epoch + 1
if shuffle:
np.random.shuffle(files)
for index, file in enumerate(files):
self.current_file_index = index + 1
self.current_file = file
sample_generator = self.read_file(file)
if not is_test and self.generate_neg_sample:
sample_generator = self.mixin_negative_samples(
sample_generator)
for sample in sample_generator:
if sample is None:
continue
yield sample
def batch_reader(reader, batch_size, in_tokens):
batch, total_token_num, max_len = [], 0, 0
for parsed_line in reader():
token_ids, sent_ids, pos_ids, label = parsed_line
max_len = max(max_len, len(token_ids))
if in_tokens:
to_append = (len(batch) + 1) * max_len <= batch_size
else:
to_append = len(batch) < batch_size
if to_append:
batch.append(parsed_line)
total_token_num += len(token_ids)
else:
yield batch, total_token_num
batch, total_token_num, max_len = [parsed_line], len(
token_ids), len(token_ids)
if len(batch) > 0:
yield batch, total_token_num
for batch_data, total_token_num in batch_reader(
reader, self.batch_size, self.in_tokens):
yield prepare_batch_data(
batch_data,
total_token_num,
voc_size=self.voc_size,
pad_id=self.pad_id,
cls_id=self.cls_id,
sep_id=self.sep_id,
mask_id=self.mask_id,
max_len=self.max_seq_len,
return_input_mask=True,
return_max_len=False,
return_num_token=False)
return wrapper
if __name__ == "__main__":
pass
此差异已折叠。
#!/bin/bash
export FLAGS_sync_nccl_allreduce=0
export FLAGS_eager_delete_tensor_gb=1
export CUDA_VISIBLE_DEVICES=0
if [ ! "$CUDA_VISIBLE_DEVICES" ]
then
export CPU_NUM=1
use_cuda=false
else
use_cuda=true
fi
python -u mtl_run.py
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Mask, padding and batching."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import numpy as np
def mask(batch_tokens, total_token_num, vocab_size, CLS=1, SEP=2, MASK=3):
"""
Add mask for batch_tokens, return out, mask_label, mask_pos;
Note: mask_pos responding the batch_tokens after padded;
"""
max_len = max([len(sent) for sent in batch_tokens])
mask_label = []
mask_pos = []
prob_mask = np.random.rand(total_token_num)
# Note: the first token is [CLS], so [low=1]
replace_ids = np.random.randint(1, high=vocab_size, size=total_token_num)
pre_sent_len = 0
prob_index = 0
for sent_index, sent in enumerate(batch_tokens):
mask_flag = False
prob_index += pre_sent_len
for token_index, token in enumerate(sent):
prob = prob_mask[prob_index + token_index]
if prob > 0.15:
continue
elif 0.03 < prob <= 0.15:
# mask
if token != SEP and token != CLS:
mask_label.append(sent[token_index])
sent[token_index] = MASK
mask_flag = True
mask_pos.append(sent_index * max_len + token_index)
elif 0.015 < prob <= 0.03:
# random replace
if token != SEP and token != CLS:
mask_label.append(sent[token_index])
sent[token_index] = replace_ids[prob_index + token_index]
mask_flag = True
mask_pos.append(sent_index * max_len + token_index)
else:
# keep the original token
if token != SEP and token != CLS:
mask_label.append(sent[token_index])
mask_pos.append(sent_index * max_len + token_index)
pre_sent_len = len(sent)
# ensure at least mask one word in a sentence
while not mask_flag:
token_index = int(np.random.randint(1, high=len(sent) - 1, size=1))
if sent[token_index] != SEP and sent[token_index] != CLS:
mask_label.append(sent[token_index])
sent[token_index] = MASK
mask_flag = True
mask_pos.append(sent_index * max_len + token_index)
mask_label = np.array(mask_label).astype("int64").reshape([-1, 1])
mask_pos = np.array(mask_pos).astype("int64").reshape([-1, 1])
return batch_tokens, mask_label, mask_pos
def prepare_batch_data(insts,
total_token_num,
max_len=None,
voc_size=0,
pad_id=None,
cls_id=None,
sep_id=None,
mask_id=None,
return_input_mask=True,
return_max_len=True,
return_num_token=False):
"""
1. generate Tensor of data
2. generate Tensor of position
3. generate self attention mask, [shape: batch_size * max_len * max_len]
"""
batch_src_ids = [inst[0] for inst in insts]
batch_sent_ids = [inst[1] for inst in insts]
batch_pos_ids = [inst[2] for inst in insts]
labels_list = []
# compatible with mrqa, whose example includes start/end positions,
# or unique id
for i in range(3, len(insts[0]), 1):
labels = [inst[i] for inst in insts]
labels = np.array(labels).astype("int64").reshape([-1, 1])
labels_list.append(labels)
# First step: do mask without padding
if mask_id >= 0:
out, mask_label, mask_pos = mask(
batch_src_ids,
total_token_num,
vocab_size=voc_size,
CLS=cls_id,
SEP=sep_id,
MASK=mask_id)
else:
out = batch_src_ids
# Second step: padding
src_id, self_input_mask = pad_batch_data(
out,
max_len=max_len,
pad_idx=pad_id, return_input_mask=True)
pos_id = pad_batch_data(
batch_pos_ids,
max_len=max_len,
pad_idx=pad_id,
return_pos=False,
return_input_mask=False)
sent_id = pad_batch_data(
batch_sent_ids,
max_len=max_len,
pad_idx=pad_id,
return_pos=False,
return_input_mask=False)
if mask_id >= 0:
return_list = [
src_id, pos_id, sent_id, self_input_mask, mask_label, mask_pos
] + labels_list
else:
return_list = [src_id, pos_id, sent_id, self_input_mask] + labels_list
return return_list if len(return_list) > 1 else return_list[0]
def pad_batch_data(insts,
max_len=None,
pad_idx=0,
return_pos=False,
return_input_mask=False,
return_max_len=False,
return_num_token=False):
"""
Pad the instances to the max sequence length in batch, and generate the
corresponding position data and input mask.
"""
return_list = []
if max_len is None:
max_len = max(len(inst) for inst in insts)
# Any token included in dict can be used to pad, since the paddings' loss
# will be masked out by weights and make no effect on parameter gradients.
inst_data = np.array([
list(inst) + list([pad_idx] * (max_len - len(inst))) for inst in insts
])
return_list += [inst_data.astype("int64").reshape([-1, max_len, 1])]
# position data
if return_pos:
inst_pos = np.array([
list(range(0, len(inst))) + [pad_idx] * (max_len - len(inst))
for inst in insts
])
return_list += [inst_pos.astype("int64").reshape([-1, max_len, 1])]
if return_input_mask:
# This is used to avoid attention on paddings.
input_mask_data = np.array([[1] * len(inst) + [0] *
(max_len - len(inst)) for inst in insts])
input_mask_data = np.expand_dims(input_mask_data, axis=-1)
return_list += [input_mask_data.astype("float32")]
if return_max_len:
return_list += [max_len]
if return_num_token:
num_token = 0
for inst in insts:
num_token += len(inst)
return_list += [num_token]
return return_list if len(return_list) > 1 else return_list[0]
if __name__ == "__main__":
pass
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import os
import sys
import argparse
import json
import yaml
import six
import logging
logging_only_message = "%(message)s"
logging_details = "%(asctime)s.%(msecs)03d %(levelname)s %(module)s - %(funcName)s: %(message)s"
class JsonConfig(object):
"""
A high-level api for handling json configure file.
"""
def __init__(self, config_path):
self._config_dict = self._parse(config_path)
def _parse(self, config_path):
try:
with open(config_path) as json_file:
config_dict = json.load(json_file)
assert isinstance(config_dict, dict), "Object in {} is NOT a dict.".format(config_path)
except:
raise IOError("Error in parsing bert model config file '%s'" %
config_path)
else:
return config_dict
def __getitem__(self, key):
return self._config_dict[key]
def asdict(self):
return self._config_dict
def print_config(self):
for arg, value in sorted(six.iteritems(self._config_dict)):
print('%s: %s' % (arg, value))
print('------------------------------------------------')
class ArgumentGroup(object):
def __init__(self, parser, title, des):
self._group = parser.add_argument_group(title=title, description=des)
def add_arg(self, name, type, default, help, **kwargs):
type = str2bool if type == bool else type
self._group.add_argument(
"--" + name,
default=default,
type=type,
help=help + ' Default: %(default)s.',
**kwargs)
class ArgConfig(object):
"""
A high-level api for handling argument configs.
"""
def __init__(self):
parser = argparse.ArgumentParser()
train_g = ArgumentGroup(parser, "training", "training options.")
train_g.add_arg("epoch", int, 3, "Number of epoches for fine-tuning.")
train_g.add_arg("learning_rate", float, 5e-5,
"Learning rate used to train with warmup.")
train_g.add_arg(
"lr_scheduler",
str,
"linear_warmup_decay",
"scheduler of learning rate.",
choices=['linear_warmup_decay', 'noam_decay'])
train_g.add_arg("weight_decay", float, 0.01,
"Weight decay rate for L2 regularizer.")
train_g.add_arg(
"warmup_proportion", float, 0.1,
"Proportion of training steps to perform linear learning rate warmup for."
)
train_g.add_arg("save_steps", int, 1000,
"The steps interval to save checkpoints.")
train_g.add_arg(
"loss_scaling", float, 1.0,
"Loss scaling factor for mixed precision training, only valid when use_fp16 is enabled."
)
train_g.add_arg("pred_dir", str, None,
"Path to save the prediction results")
log_g = ArgumentGroup(parser, "logging", "logging related.")
log_g.add_arg("skip_steps", int, 10,
"The steps interval to print loss.")
log_g.add_arg("verbose", bool, False, "Whether to output verbose log.")
run_type_g = ArgumentGroup(parser, "run_type", "running type options.")
run_type_g.add_arg("use_cuda", bool, True,
"If set, use GPU for training.")
run_type_g.add_arg(
"use_fast_executor", bool, False,
"If set, use fast parallel executor (in experiment).")
run_type_g.add_arg(
"num_iteration_per_drop_scope", int, 1,
"Ihe iteration intervals to clean up temporary variables.")
run_type_g.add_arg("do_train", bool, True,
"Whether to perform training.")
run_type_g.add_arg("do_predict", bool, True,
"Whether to perform prediction.")
custom_g = ArgumentGroup(parser, "customize", "customized options.")
self.custom_g = custom_g
self.parser = parser
def add_arg(self, name, dtype, default, descrip):
self.custom_g.add_arg(name, dtype, default, descrip)
def build_conf(self):
return self.parser.parse_args()
def str2bool(v):
# because argparse does not support to parse "true, False" as python
# boolean directly
return v.lower() in ("true", "t", "1")
def print_arguments(args, log=None):
if not log:
print('----------- Configuration Arguments -----------')
for arg, value in sorted(six.iteritems(vars(args))):
print('%s: %s' % (arg, value))
print('------------------------------------------------')
else:
log.info('----------- Configuration Arguments -----------')
for arg, value in sorted(six.iteritems(vars(args))):
log.info('%s: %s' % (arg, value))
log.info('------------------------------------------------')
class PDConfig(object):
"""
A high-level API for managing configuration files in PaddlePaddle.
Can jointly work with command-line-arugment, json files and yaml files.
"""
def __init__(self, json_file="", yaml_file=[], fuse_args=True):
"""
Init funciton for PDConfig.
json_file: the path to the json configure file.
yaml_file: the path to the yaml configure file.
fuse_args: if fuse the json/yaml configs with argparse.
"""
assert isinstance(json_file, str)
assert isinstance(yaml_file, list)
if json_file != "" and yaml_file != []:
raise Warning(
"json_file and yaml_file can not co-exist for now. please only use one configure file type."
)
return
self.args = None
self.arg_config = {}
self.json_config = {}
self.yaml_config = {}
parser = argparse.ArgumentParser()
self.default_g = ArgumentGroup(parser, "default", "default options.")
self.yaml_g = ArgumentGroup(parser, "yaml", "options from yaml.")
self.json_g = ArgumentGroup(parser, "json", "options from json.")
self.com_g = ArgumentGroup(parser, "custom", "customized options.")
"""
self.default_g.add_arg("epoch", int, 2,
"Number of epoches for training.")
self.default_g.add_arg("do_train", bool, False,
"Whether to perform training.")
self.default_g.add_arg("do_predict", bool, False,
"Whether to perform predicting.")
self.default_g.add_arg("do_eval", bool, False,
"Whether to perform evaluating.")
"""
self.parser = parser
if json_file != "":
self.load_json(json_file, fuse_args=fuse_args)
if yaml_file:
self.load_yaml(yaml_file, fuse_args=fuse_args)
def load_json(self, file_path, fuse_args=True):
if not os.path.exists(file_path):
raise Warning("the json file %s does not exist." % file_path)
return
with open(file_path, "r") as fin:
self.json_config = json.loads(fin.read())
fin.close()
if fuse_args:
for name in self.json_config:
if not isinstance(self.json_config[name], int) \
and not isinstance(self.json_config[name], float) \
and not isinstance(self.json_config[name], str) \
and not isinstance(self.json_config[name], bool):
continue
self.json_g.add_arg(name,
type(self.json_config[name]),
self.json_config[name],
"This is from %s" % file_path)
def load_yaml(self, file_path_list, fuse_args=True):
for file_path in file_path_list:
if not os.path.exists(file_path):
raise Warning("the yaml file %s does not exist." % file_path)
return
with open(file_path, "r") as fin:
self.yaml_config = yaml.load(fin, Loader=yaml.SafeLoader)
fin.close()
if fuse_args:
for name in self.yaml_config:
if not isinstance(self.yaml_config[name], int) \
and not isinstance(self.yaml_config[name], float) \
and not isinstance(self.yaml_config[name], str) \
and not isinstance(self.yaml_config[name], bool):
continue
self.yaml_g.add_arg(name,
type(self.yaml_config[name]),
self.yaml_config[name],
"This is from %s" % file_path)
def build(self):
self.args = self.parser.parse_args()
self.arg_config = vars(self.args)
def __add__(self, new_arg):
assert isinstance(new_arg, list) or isinstance(new_arg, tuple)
assert len(new_arg) >= 3
assert self.args is None
name = new_arg[0]
dtype = new_arg[1]
dvalue = new_arg[2]
desc = new_arg[3] if len(
new_arg) == 4 else "Description is not provided."
self.com_g.add_arg(name, dtype, dvalue, desc)
return self
def __getattr__(self, name):
if name in self.arg_config:
return self.arg_config[name]
if name in self.json_config:
return self.json_config[name]
if name in self.yaml_config:
return self.yaml_config[name]
raise Warning("The argument %s is not defined." % name)
def Print(self):
print("-" * 70)
for name in self.arg_config:
print("{: <25}\t{}".format(str(name), str(self.arg_config[name])))
for name in self.json_config:
if name not in self.arg_config:
print("{: <25}\t{}" %
(str(name), str(self.json_config[name])))
for name in self.yaml_config:
if name not in self.arg_config:
print("{: <25}\t{}" %
(str(name), str(self.yaml_config[name])))
print("-" * 70)
if __name__ == "__main__":
pd_config = PDConfig(yaml_file="./test/bert_config.yaml")
pd_config += ("my_age", int, 18, "I am forever 18.")
pd_config.build()
print(pd_config.do_train)
print(pd_config.hidden_size)
print(pd_config.my_age)
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from __future__ import print_function
import paddle
import paddle.fluid as fluid
def cast_fp16_to_fp32(i, o, prog):
prog.global_block().append_op(
type="cast",
inputs={"X": i},
outputs={"Out": o},
attrs={
"in_dtype": fluid.core.VarDesc.VarType.FP16,
"out_dtype": fluid.core.VarDesc.VarType.FP32
})
def cast_fp32_to_fp16(i, o, prog):
prog.global_block().append_op(
type="cast",
inputs={"X": i},
outputs={"Out": o},
attrs={
"in_dtype": fluid.core.VarDesc.VarType.FP32,
"out_dtype": fluid.core.VarDesc.VarType.FP16
})
def copy_to_master_param(p, block):
v = block.vars.get(p.name, None)
if v is None:
raise ValueError("no param name %s found!" % p.name)
new_p = fluid.framework.Parameter(
block=block,
shape=v.shape,
dtype=fluid.core.VarDesc.VarType.FP32,
type=v.type,
lod_level=v.lod_level,
stop_gradient=p.stop_gradient,
trainable=p.trainable,
optimize_attr=p.optimize_attr,
regularizer=p.regularizer,
gradient_clip_attr=p.gradient_clip_attr,
error_clip=p.error_clip,
name=v.name + ".master")
return new_p
def create_master_params_grads(params_grads, main_prog, startup_prog,
loss_scaling):
master_params_grads = []
tmp_role = main_prog._current_role
OpRole = fluid.core.op_proto_and_checker_maker.OpRole
main_prog._current_role = OpRole.Backward
for p, g in params_grads:
# create master parameters
master_param = copy_to_master_param(p, main_prog.global_block())
startup_master_param = startup_prog.global_block()._clone_variable(
master_param)
startup_p = startup_prog.global_block().var(p.name)
cast_fp16_to_fp32(startup_p, startup_master_param, startup_prog)
# cast fp16 gradients to fp32 before apply gradients
if g.name.find("layer_norm") > -1:
if loss_scaling > 1:
scaled_g = g / float(loss_scaling)
else:
scaled_g = g
master_params_grads.append([p, scaled_g])
continue
master_grad = fluid.layers.cast(g, "float32")
if loss_scaling > 1:
master_grad = master_grad / float(loss_scaling)
master_params_grads.append([master_param, master_grad])
main_prog._current_role = tmp_role
return master_params_grads
def master_param_to_train_param(master_params_grads, params_grads, main_prog):
for idx, m_p_g in enumerate(master_params_grads):
train_p, _ = params_grads[idx]
if train_p.name.find("layer_norm") > -1:
continue
with main_prog._optimized_guard([m_p_g[0], m_p_g[1]]):
cast_fp32_to_fp16(m_p_g[0], train_p, main_prog)
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from __future__ import print_function
import os
import six
import ast
import copy
import numpy as np
import paddle.fluid as fluid
def cast_fp32_to_fp16(exe, main_program):
print("Cast parameters to float16 data format.")
for param in main_program.global_block().all_parameters():
if not param.name.endswith(".master"):
param_t = fluid.global_scope().find_var(param.name).get_tensor()
data = np.array(param_t)
if param.name.find("layer_norm") == -1:
param_t.set(np.float16(data).view(np.uint16), exe.place)
master_param_var = fluid.global_scope().find_var(param.name +
".master")
if master_param_var is not None:
master_param_var.get_tensor().set(data, exe.place)
def init_checkpoint(exe, init_checkpoint_path, main_program, use_fp16=False, skip_list = []):
assert os.path.exists(
init_checkpoint_path), "[%s] cann't be found." % init_checkpoint_path
assert os.path.isdir(init_checkpoint_path), '{} is not a dir.'.format(init_checkpoint_path)
path = init_checkpoint_path
if not os.path.split(init_checkpoint_path)[-1].startswith('step_') and 'params' != os.path.split(init_checkpoint_path)[-1]:
max_step = 0
for d in os.listdir(init_checkpoint_path):
if os.path.isdir(os.path.join(init_checkpoint_path, d)):
if d.startswith('step_'):
step = int(d.lstrip('step_').rstrip('_final'))
if step > max_step:
path = os.path.join(init_checkpoint_path, d)
max_step = step
def existed_persitables(var):
if not fluid.io.is_persistable(var):
return False
if var.name in skip_list:
return False
return os.path.exists(os.path.join(path, var.name))
print("loading checkpoint from {}...".format(path))
fluid.io.load_vars(
exe,
path,
main_program=main_program,
predicate=existed_persitables)
if use_fp16:
cast_fp32_to_fp16(exe, main_program)
def init_pretraining_params(exe,
pretraining_params_path,
main_program,
use_fp16=False):
assert os.path.exists(pretraining_params_path
), "[%s] cann't be found." % pretraining_params_path
assert os.path.isdir(pretraining_params_path), '{} is not a dir.'.format(pretraining_params_path)
if os.path.exists(os.path.join(pretraining_params_path, 'params')):
pretraining_params_path = os.path.join(pretraining_params_path, 'params')
if not os.path.split(pretraining_params_path)[-1] == 'params':
raise Warning('Dir "params" not found in {}.'.format(pretraining_params_path))
max_step = 0
path = pretraining_params_path
for d in os.listdir(pretraining_params_path):
if os.path.isdir(os.path.join(pretraining_params_path, d)):
if d.startswith('step_'):
step = int(d.lstrip('step_').rstrip('_final'))
if step > max_step:
path = os.path.join(pretraining_params_path, d)
max_step = step
pretraining_params_path = path
print("loading pretrained parameters from {}...".format(
pretraining_params_path))
def existed_params(var):
if not isinstance(var, fluid.framework.Parameter):
return False
return os.path.exists(os.path.join(pretraining_params_path, var.name))
fluid.io.load_vars(
exe,
pretraining_params_path,
main_program=main_program,
predicate=existed_params)
if use_fp16:
cast_fp32_to_fp16(exe, main_program)
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import os
import six
import ast
import copy
import numpy as np
import paddle.fluid as fluid
class Placeholder(object):
def __init__(self):
self.shapes = []
self.dtypes = []
self.lod_levels = []
self.names = []
def __init__(self, input_shapes):
self.shapes = []
self.dtypes = []
self.lod_levels = []
self.names = []
for new_holder in input_shapes:
shape = new_holder[0]
dtype = new_holder[1]
lod_level = new_holder[2] if len(new_holder) >= 3 else 0
name = new_holder[3] if len(new_holder) >= 4 else ""
self.append_placeholder(shape, dtype, lod_level = lod_level, name = name)
def append_placeholder(self, shape, dtype, lod_level = 0, name = ""):
self.shapes.append(shape)
self.dtypes.append(dtype)
self.lod_levels.append(lod_level)
self.names.append(name)
def build(self, capacity, reader_name, use_double_buffer = False):
pyreader = fluid.layers.py_reader(
capacity = capacity,
shapes = self.shapes,
dtypes = self.dtypes,
lod_levels = self.lod_levels,
name = reader_name,
use_double_buffer = use_double_buffer)
return [pyreader, fluid.layers.read_file(pyreader)]
def __add__(self, new_holder):
assert isinstance(new_holder, tuple) or isinstance(new_holder, list)
assert len(new_holder) >= 2
shape = new_holder[0]
dtype = new_holder[1]
lod_level = new_holder[2] if len(new_holder) >= 3 else 0
name = new_holder[3] if len(new_holder) >= 4 else ""
self.append_placeholder(shape, dtype, lod_level = lod_level, name = name)
if __name__ == "__main__":
print("hello world!")
此差异已折叠。
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册