提交 06ec43f5 编写于 作者: S Steffy-zxf 提交者: Zeyu Chen

Simpliy Demo (#273)

* update demo

* simplify img cls demo

* remove elmo demo

* simplify multi-label cls demo

* simplify multi-label cls demo

* simplify img cls demo

* simplify text cls demo

* simplify lac demo

* simplify ssd demo

* simplify qa cls demo

* simplify regression demo

* simplify reading comprehension demo

* simplify sequence labeling demo

* simplify senta demo
上级 802f9bef
# PaddleHub 文本分类
本示例将展示如何使用PaddleHub Finetune API以及加载ELMo预训练中文word embedding在中文情感分析数据集ChnSentiCorp上完成分类任务。
## 如何开始Finetune
在完成安装PaddlePaddle与PaddleHub后,通过执行脚本`sh run_elmo_finetune.sh`即可开始使用ELMo对ChnSentiCorp数据集进行Finetune。
其中脚本参数说明如下:
```bash
# 模型相关
--batch_size: 批处理大小,请结合显存情况进行调整,若出现显存不足,请适当调低这一参数use
--use_gpu: 是否使用GPU进行FineTune,默认为True
--learning_rate: Finetune的最大学习率
--weight_decay: 控制正则项力度的参数,用于防止过拟合,默认为0.01
--warmup_proportion: 学习率warmup策略的比例,如果0.1,则学习率会在前10%训练step的过程中从0慢慢增长到learning_rate, 而后再缓慢衰减,默认为0
--num_epoch: Finetune迭代的轮数
# 任务相关
--checkpoint_dir: 模型保存路径,PaddleHub会自动保存验证集上表现最好的模型
```
## 代码步骤
使用PaddleHub Finetune API进行Finetune可以分为4个步骤
### Step1: 加载预训练模型
```python
module = hub.Module(name="elmo")
inputs, outputs, program = module.context(trainable=True)
```
### Step2: 准备数据集并使用LACClassifyReader读取数据
```python
dataset = hub.dataset.ChnSentiCorp()
reader = hub.reader.LACClassifyReader(
dataset=dataset,
vocab_path=module.get_vocab_path())
```
其中数据集的准备代码可以参考 [chnsenticorp.py](https://github.com/PaddlePaddle/PaddleHub/blob/release/v1.2/paddlehub/dataset/chnsenticorp.py)
`hub.dataset.ChnSentiCorp()` 会自动从网络下载数据集并解压到用户目录下`$HOME/.paddlehub/dataset`目录
`module.get_vaocab_path()` 会返回预训练模型对应的词表
LACClassifyReader中的`data_generator`会自动按照模型对应词表对数据进行切词,以迭代器的方式返回ELMo所需要的Tensor格式,包括`word_ids`.
### Step3:选择优化策略和运行配置
```python
strategy = hub.AdamWeightDecayStrategy(
learning_rate=5e-5,
weight_decay=0.01,
warmup_proportion=0.0,
lr_scheduler="linear_decay",
)
config = hub.RunConfig(use_cuda=True, use_data_parallel=True, use_pyreader=False, num_epoch=3, batch_size=32, strategy=strategy)
```
#### 优化策略
针对ERNIE与BERT类任务,PaddleHub封装了适合这一任务的迁移学习优化策略`AdamWeightDecayStrategy`
* `learning_rate`: Finetune过程中的最大学习率;
* `weight_decay`: 模型的正则项参数,默认0.01,如果模型有过拟合倾向,可适当调高这一参数;
* `warmup_proportion`: 如果warmup_proportion>0, 例如0.1, 则学习率会在前10%的steps中线性增长至最高值learning_rate;
* `lr_scheduler`: 有两种策略可选(1) `linear_decay`策略学习率会在最高点后以线性方式衰减; `noam_decay`策略学习率会在最高点以多项式形式衰减;
#### 运行配置
`RunConfig` 主要控制Finetune的训练,包含以下可控制的参数:
* `log_interval`: 进度日志打印间隔,默认每10个step打印一次
* `eval_interval`: 模型评估的间隔,默认每100个step评估一次验证集
* `save_ckpt_interval`: 模型保存间隔,请根据任务大小配置,默认只保存验证集效果最好的模型和训练结束的模型
* `use_cuda`: 是否使用GPU训练,默认为False
* `use_data_parallel`: 是否使用并行计算,默认False。打开该功能依赖nccl库
* `use_pyreader`: 是否使用pyreader,默认False
* `checkpoint_dir`: 模型checkpoint保存路径, 若用户没有指定,程序会自动生成
* `num_epoch`: finetune的轮数
* `batch_size`: 训练的批大小,如果使用GPU,请根据实际情况调整batch_size
* `enable_memory_optim`: 是否使用内存优化, 默认为True
* `strategy`: Finetune优化策略
**Note**: 当使用LACClassifyReader时,use_pyreader应该为False。
### Step4: 构建网络并创建分类迁移任务进行Finetune
有了合适的预训练模型和准备要迁移的数据集后,我们开始组建一个Task。
>* 获取module的上下文环境,包括输入和输出的变量,以及Paddle Program;
>* 从输出变量中找到输入单词对应的elmo_embedding, 并拼接上随机初始化word embedding;
>* 在拼接embedding输入gru网络,进行文本分类,生成Task;
```python
word_ids = inputs["word_ids"]
elmo_embedding = outputs["elmo_embed"]
feed_list = [word_ids.name]
switch_main_program(program)
word_embed_dims = 128
word_embedding = fluid.layers.embedding(
input=word_ids,
size=[word_dict_len, word_embed_dims],
param_attr=fluid.ParamAttr(
learning_rate=30,
initializer=fluid.initializer.Uniform(low=-0.1, high=0.1)))
input_feature = fluid.layers.concat(
input=[elmo_embedding, word_embedding], axis=1)
fc = gru_net(program, input_feature)
elmo_task = hub.TextClassifierTask(
data_reader=reader,
feature=fc,
feed_list=feed_list,
num_classes=dataset.num_labels,
config=config)
elmo_task.finetune_and_eval()
```
**NOTE:**
1. `outputs["elmo_embed"]`返回了ELMo模型预训练的word embedding。
2. `hub.TextClassifierTask`通过输入特征,label与迁移的类别数,可以生成适用于文本分类的迁移任务`TextClassifierTask`
## 可视化
Finetune API训练过程中会自动对关键训练指标进行打点,启动程序后执行下面命令
```bash
$ tensorboard --logdir $CKPT_DIR/visualization --host ${HOST_IP} --port ${PORT_NUM}
```
其中${HOST_IP}为本机IP地址,${PORT_NUM}为可用端口号,如本机IP地址为192.168.0.1,端口号8040,用浏览器打开192.168.0.1:8040,即可看到训练过程中指标的变化情况
## 模型预测
通过Finetune完成模型训练后,在对应的ckpt目录下,会自动保存验证集上效果最好的模型。
配置脚本参数
```
CKPT_DIR="./ckpt_chnsentiment"
python predict.py --checkpoint_dir --use_gpu True
```
其中CKPT_DIR为Finetune API保存最佳模型的路径
参数配置正确后,请执行脚本`sh run_predict.sh`,即可看到以下文本分类预测结果, 以及最终准确率。
如需了解更多预测步骤,请参考`predict.py`
#coding:utf-8
import argparse
import ast
import io
import numpy as np
from paddle.fluid.framework import switch_main_program
import paddle.fluid as fluid
import paddlehub as hub
# yapf: disable
parser = argparse.ArgumentParser(__doc__)
parser.add_argument("--num_epoch", type=int, default=3, help="Number of epoches for fine-tuning.")
parser.add_argument("--use_gpu", type=ast.literal_eval, default=True, help="Whether use GPU for finetuning, input should be True or False")
parser.add_argument("--checkpoint_dir", type=str, default=None, help="Directory to model checkpoint")
parser.add_argument("--batch_size", type=int, default=32, help="Total examples' number in batch for training.")
parser.add_argument("--learning_rate", type=float, default=1e-4, help="Learning rate used to train with warmup.")
parser.add_argument("--weight_decay", type=float, default=5, help="Weight decay rate for L2 regularizer.")
parser.add_argument("--warmup_proportion", type=float, default=0.05, help="Warmup proportion params for warmup strategy")
args = parser.parse_args()
# yapf: enable.
def bow_net(program, input_feature, hid_dim=128, hid_dim2=96):
switch_main_program(program)
bow = fluid.layers.sequence_pool(input=input_feature, pool_type='sum')
bow_tanh = fluid.layers.tanh(bow)
fc_1 = fluid.layers.fc(input=bow_tanh, size=hid_dim, act="tanh")
fc = fluid.layers.fc(input=fc_1, size=hid_dim2, act="tanh")
return fc
def cnn_net(program, input_feature, win_size=3, hid_dim=128, hid_dim2=96):
switch_main_program(program)
conv_3 = fluid.nets.sequence_conv_pool(
input=input_feature,
num_filters=hid_dim,
filter_size=win_size,
act="relu",
pool_type="max")
fc = fluid.layers.fc(input=conv_3, size=hid_dim2)
return fc
def gru_net(program, input_feature, hid_dim=128, hid_dim2=96):
switch_main_program(program)
fc0 = fluid.layers.fc(input=input_feature, size=hid_dim * 3)
gru_h = fluid.layers.dynamic_gru(input=fc0, size=hid_dim, is_reverse=False)
gru_max = fluid.layers.sequence_pool(input=gru_h, pool_type='max')
gru_max_tanh = fluid.layers.tanh(gru_max)
fc = fluid.layers.fc(input=gru_max_tanh, size=hid_dim2, act='tanh')
return fc
def bilstm_net(program, input_feature, hid_dim=128, hid_dim2=96):
switch_main_program(program)
fc0 = fluid.layers.fc(input=input_feature, size=hid_dim * 4)
rfc0 = fluid.layers.fc(input=input_feature, size=hid_dim * 4)
lstm_h, c = fluid.layers.dynamic_lstm(
input=fc0, size=hid_dim * 4, is_reverse=False)
rlstm_h, c = fluid.layers.dynamic_lstm(
input=rfc0, size=hid_dim * 4, is_reverse=True)
# extract last step
lstm_last = fluid.layers.sequence_last_step(input=lstm_h)
rlstm_last = fluid.layers.sequence_last_step(input=rlstm_h)
lstm_last_tanh = fluid.layers.tanh(lstm_last)
rlstm_last_tanh = fluid.layers.tanh(rlstm_last)
# concat layer
lstm_concat = fluid.layers.concat(input=[lstm_last, rlstm_last], axis=1)
# full connect layer
fc = fluid.layers.fc(input=lstm_concat, size=hid_dim2, act='tanh')
return fc
def lstm_net(program, input_feature, hid_dim=128, hid_dim2=96):
switch_main_program(program)
fc0 = fluid.layers.fc(input=input_feature, size=hid_dim * 4)
lstm_h, c = fluid.layers.dynamic_lstm(
input=fc0, size=hid_dim * 4, is_reverse=False)
lstm_max = fluid.layers.sequence_pool(input=lstm_h, pool_type='max')
lstm_max_tanh = fluid.layers.tanh(lstm_max)
fc = fluid.layers.fc(input=lstm_max_tanh, size=hid_dim2, act='tanh')
return fc
if __name__ == '__main__':
# Step1: load Paddlehub elmo pretrained model
module = hub.Module(name="elmo")
inputs, outputs, program = module.context(trainable=True)
# Step2: Download dataset and use LACClassifyReade to read dataset
dataset = hub.dataset.ChnSentiCorp()
reader = hub.reader.LACClassifyReader(
dataset=dataset, vocab_path=module.get_vocab_path())
word_dict_len = len(reader.vocab)
word_ids = inputs["word_ids"]
elmo_embedding = outputs["elmo_embed"]
# Step3: switch program and build network
# Choose the net which you would like: bow, cnn, gru, bilstm, lstm
switch_main_program(program)
# Embedding layer
word_embed_dims = 128
word_embedding = fluid.layers.embedding(
input=word_ids,
size=[word_dict_len, word_embed_dims],
param_attr=fluid.ParamAttr(
learning_rate=30,
initializer=fluid.initializer.Uniform(low=-0.1, high=0.1)))
# Add elmo embedding
input_feature = fluid.layers.concat(
input=[elmo_embedding, word_embedding], axis=1)
# Choose the net which you would like: bow, cnn, gru, bilstm, lstm
# We recommend you to choose the gru_net
fc = gru_net(program, input_feature)
# Setup feed list for data feeder
# Must feed all the tensor of senta's module need
feed_list = [word_ids.name]
# Step4: Select finetune strategy, setup config and finetune
strategy = hub.AdamWeightDecayStrategy(
weight_decay=args.weight_decay,
learning_rate=args.learning_rate,
lr_scheduler="linear_decay",
warmup_proportion=args.warmup_proportion)
# Step5: Setup runing config for PaddleHub Finetune API
config = hub.RunConfig(
use_cuda=args.use_gpu,
use_data_parallel=True,
use_pyreader=False,
num_epoch=args.num_epoch,
batch_size=args.batch_size,
checkpoint_dir=args.checkpoint_dir,
strategy=strategy)
# Step6: Define a classfication finetune task by PaddleHub's API
elmo_task = hub.TextClassifierTask(
data_reader=reader,
feature=fc,
feed_list=feed_list,
num_classes=dataset.num_labels,
config=config)
# Finetune and evaluate by PaddleHub's API
# will finish training, evaluation, testing, save model automatically
elmo_task.finetune_and_eval()
#coding:utf-8
import argparse
import ast
import io
import numpy as np
from paddle.fluid.framework import switch_main_program
import paddle.fluid as fluid
import paddlehub as hub
# yapf: disable
parser = argparse.ArgumentParser(__doc__)
parser.add_argument("--use_gpu", type=ast.literal_eval, default=True, help="Whether use GPU for finetuning, input should be True or False")
parser.add_argument("--checkpoint_dir", type=str, default=None, help="Directory to model checkpoint")
parser.add_argument("--batch_size", type=int, default=1, help="Total examples' number in batch for training.")
parser.add_argument("--learning_rate", type=float, default=1e-4, help="Learning rate used to train with warmup.")
parser.add_argument("--weight_decay", type=float, default=5, help="Weight decay rate for L2 regularizer.")
parser.add_argument("--warmup_proportion", type=float, default=0.05, help="Warmup proportion params for warmup strategy")
args = parser.parse_args()
# yapf: enable.
def bow_net(program, input_feature, hid_dim=128, hid_dim2=96):
switch_main_program(program)
bow = fluid.layers.sequence_pool(input=input_feature, pool_type='sum')
bow_tanh = fluid.layers.tanh(bow)
fc_1 = fluid.layers.fc(input=bow_tanh, size=hid_dim, act="tanh")
fc = fluid.layers.fc(input=fc_1, size=hid_dim2, act="tanh")
return fc
def cnn_net(program, input_feature, win_size=3, hid_dim=128, hid_dim2=96):
switch_main_program(program)
conv_3 = fluid.nets.sequence_conv_pool(
input=input_feature,
num_filters=hid_dim,
filter_size=win_size,
act="relu",
pool_type="max")
fc = fluid.layers.fc(input=conv_3, size=hid_dim2)
return fc
def gru_net(program, input_feature, hid_dim=128, hid_dim2=96):
switch_main_program(program)
fc0 = fluid.layers.fc(input=input_feature, size=hid_dim * 3)
gru_h = fluid.layers.dynamic_gru(input=fc0, size=hid_dim, is_reverse=False)
gru_max = fluid.layers.sequence_pool(input=gru_h, pool_type='max')
gru_max_tanh = fluid.layers.tanh(gru_max)
fc = fluid.layers.fc(input=gru_max_tanh, size=hid_dim2, act='tanh')
return fc
def bilstm_net(program, input_feature, hid_dim=128, hid_dim2=96):
switch_main_program(program)
fc0 = fluid.layers.fc(input=input_feature, size=hid_dim * 4)
rfc0 = fluid.layers.fc(input=input_feature, size=hid_dim * 4)
lstm_h, c = fluid.layers.dynamic_lstm(
input=fc0, size=hid_dim * 4, is_reverse=False)
rlstm_h, c = fluid.layers.dynamic_lstm(
input=rfc0, size=hid_dim * 4, is_reverse=True)
# extract last step
lstm_last = fluid.layers.sequence_last_step(input=lstm_h)
rlstm_last = fluid.layers.sequence_last_step(input=rlstm_h)
lstm_last_tanh = fluid.layers.tanh(lstm_last)
rlstm_last_tanh = fluid.layers.tanh(rlstm_last)
# concat layer
lstm_concat = fluid.layers.concat(input=[lstm_last, rlstm_last], axis=1)
# full connect layer
fc = fluid.layers.fc(input=lstm_concat, size=hid_dim2, act='tanh')
return fc
def lstm_net(program, input_feature, hid_dim=128, hid_dim2=96):
switch_main_program(program)
fc0 = fluid.layers.fc(input=input_feature, size=hid_dim * 4)
lstm_h, c = fluid.layers.dynamic_lstm(
input=fc0, size=hid_dim * 4, is_reverse=False)
lstm_max = fluid.layers.sequence_pool(input=lstm_h, pool_type='max')
lstm_max_tanh = fluid.layers.tanh(lstm_max)
fc = fluid.layers.fc(input=lstm_max_tanh, size=hid_dim2, act='tanh')
return fc
if __name__ == '__main__':
# Step1: load Paddlehub elmo pretrained model
module = hub.Module(name="elmo")
inputs, outputs, program = module.context(trainable=True)
# Step2: Download dataset and use LACClassifyReade to read dataset
dataset = hub.dataset.ChnSentiCorp()
reader = hub.reader.LACClassifyReader(
dataset=dataset, vocab_path=module.get_vocab_path())
word_dict_len = len(reader.vocab)
word_ids = inputs["word_ids"]
elmo_embedding = outputs["elmo_embed"]
# Step3: switch program and build network
# Choose the net which you would like: bow, cnn, gru, bilstm, lstm
switch_main_program(program)
# Embedding layer
word_embed_dims = 128
word_embedding = fluid.layers.embedding(
input=word_ids,
size=[word_dict_len, word_embed_dims],
param_attr=fluid.ParamAttr(
learning_rate=30,
initializer=fluid.initializer.Uniform(low=-0.1, high=0.1)))
# Add elmo embedding
input_feature = fluid.layers.concat(
input=[elmo_embedding, word_embedding], axis=1)
# Choose the net which you would like: bow, cnn, gru, bilstm, lstm
# We recommend you to choose the gru_net
fc = gru_net(program, input_feature)
# Setup feed list for data feeder
# Must feed all the tensor of senta's module need
feed_list = [word_ids.name]
# Step4: Select finetune strategy, setup config and finetune
strategy = hub.AdamWeightDecayStrategy(
weight_decay=args.weight_decay,
learning_rate=args.learning_rate,
lr_scheduler="linear_decay",
warmup_proportion=args.warmup_proportion)
# Step5: Setup runing config for PaddleHub Finetune API
config = hub.RunConfig(
use_cuda=args.use_gpu,
use_data_parallel=True,
use_pyreader=False,
batch_size=args.batch_size,
checkpoint_dir=args.checkpoint_dir,
strategy=strategy)
# Step6: Define a classfication finetune task by PaddleHub's API
elmo_task = hub.TextClassifierTask(
data_reader=reader,
feature=fc,
feed_list=feed_list,
num_classes=dataset.num_labels,
config=config)
# Data to be prdicted
data = [
"这个宾馆比较陈旧了,特价的房间也很一般。总体来说一般", "交通方便;环境很好;服务态度很好 房间较小",
"还稍微重了点,可能是硬盘大的原故,还要再轻半斤就好了。其他要进一步验证。贴的几种膜气泡较多,用不了多久就要更换了,屏幕膜稍好点,但比没有要强多了。建议配赠几张膜让用用户自己贴。",
"前台接待太差,酒店有A B楼之分,本人check-in后,前台未告诉B楼在何处,并且B楼无明显指示;房间太小,根本不像4星级设施,下次不会再选择入住此店啦",
"19天硬盘就罢工了~~~算上运来的一周都没用上15天~~~可就是不能换了~~~唉~~~~你说这算什么事呀~~~"
]
index = 0
run_states = elmo_task.predict(data=data)
results = [run_state.run_results for run_state in run_states]
for batch_result in results:
# get predict index
batch_result = np.argmax(batch_result, axis=2)[0]
for result in batch_result:
print("%s\tpredict=%s" % (data[index], result))
index += 1
export FLAGS_eager_delete_tensor_gb=0.0
export CUDA_VISIBLE_DEVICES=0
python -u elmo_finetune.py \
--batch_size=32 \
--use_gpu=True \
--checkpoint_dir="./ckpt_chnsenticorp" \
--learning_rate=1e-4 \
--weight_decay=1 \
--num_epoch=3
export FLAGS_eager_delete_tensor_gb=0.0
export CUDA_VISIBLE_DEVICES=0
CKPT_DIR="./ckpt_chnsenticorp"
python -u predict.py --checkpoint_dir $CKPT_DIR --use_gpu True
...@@ -15,7 +15,6 @@ parser.add_argument("--checkpoint_dir", type=str, default="pad ...@@ -15,7 +15,6 @@ parser.add_argument("--checkpoint_dir", type=str, default="pad
parser.add_argument("--batch_size", type=int, default=16, help="Total examples' number in batch for training.") parser.add_argument("--batch_size", type=int, default=16, help="Total examples' number in batch for training.")
parser.add_argument("--module", type=str, default="resnet50", help="Module used as feature extractor.") parser.add_argument("--module", type=str, default="resnet50", help="Module used as feature extractor.")
parser.add_argument("--dataset", type=str, default="flowers", help="Dataset to finetune.") parser.add_argument("--dataset", type=str, default="flowers", help="Dataset to finetune.")
parser.add_argument("--use_pyreader", type=ast.literal_eval, default=True, help="Whether use pyreader to feed data.")
parser.add_argument("--use_data_parallel", type=ast.literal_eval, default=True, help="Whether use data parallel.") parser.add_argument("--use_data_parallel", type=ast.literal_eval, default=True, help="Whether use data parallel.")
# yapf: enable. # yapf: enable.
...@@ -30,9 +29,11 @@ module_map = { ...@@ -30,9 +29,11 @@ module_map = {
def finetune(args): def finetune(args):
# Load Paddlehub pretrained model
module = hub.Module(name=args.module) module = hub.Module(name=args.module)
input_dict, output_dict, program = module.context(trainable=True) input_dict, output_dict, program = module.context(trainable=True)
# Download dataset
if args.dataset.lower() == "flowers": if args.dataset.lower() == "flowers":
dataset = hub.dataset.Flowers() dataset = hub.dataset.Flowers()
elif args.dataset.lower() == "dogcat": elif args.dataset.lower() == "dogcat":
...@@ -46,6 +47,7 @@ def finetune(args): ...@@ -46,6 +47,7 @@ def finetune(args):
else: else:
raise ValueError("%s dataset is not defined" % args.dataset) raise ValueError("%s dataset is not defined" % args.dataset)
# Use ImageClassificationReader to read dataset
data_reader = hub.reader.ImageClassificationReader( data_reader = hub.reader.ImageClassificationReader(
image_width=module.get_expected_image_width(), image_width=module.get_expected_image_width(),
image_height=module.get_expected_image_height(), image_height=module.get_expected_image_height(),
...@@ -55,25 +57,27 @@ def finetune(args): ...@@ -55,25 +57,27 @@ def finetune(args):
feature_map = output_dict["feature_map"] feature_map = output_dict["feature_map"]
img = input_dict["image"] # Setup feed list for data feeder
feed_list = [img.name] feed_list = [input_dict["image"].name]
# Setup runing config for PaddleHub Finetune API
config = hub.RunConfig( config = hub.RunConfig(
use_data_parallel=args.use_data_parallel, use_data_parallel=args.use_data_parallel,
use_pyreader=args.use_pyreader,
use_cuda=args.use_gpu, use_cuda=args.use_gpu,
num_epoch=args.num_epoch, num_epoch=args.num_epoch,
batch_size=args.batch_size, batch_size=args.batch_size,
enable_memory_optim=False,
checkpoint_dir=args.checkpoint_dir, checkpoint_dir=args.checkpoint_dir,
strategy=hub.finetune.strategy.DefaultFinetuneStrategy()) strategy=hub.finetune.strategy.DefaultFinetuneStrategy())
# Define a reading comprehension finetune task by PaddleHub's API
task = hub.ImageClassifierTask( task = hub.ImageClassifierTask(
data_reader=data_reader, data_reader=data_reader,
feed_list=feed_list, feed_list=feed_list,
feature=feature_map, feature=feature_map,
num_classes=dataset.num_labels, num_classes=dataset.num_labels,
config=config) config=config)
# Finetune by PaddleHub's API
task.finetune_and_eval() task.finetune_and_eval()
......
...@@ -14,7 +14,6 @@ parser.add_argument("--checkpoint_dir", type=str, default="pad ...@@ -14,7 +14,6 @@ parser.add_argument("--checkpoint_dir", type=str, default="pad
parser.add_argument("--batch_size", type=int, default=16, help="Total examples' number in batch for training.") parser.add_argument("--batch_size", type=int, default=16, help="Total examples' number in batch for training.")
parser.add_argument("--module", type=str, default="resnet50", help="Module used as a feature extractor.") parser.add_argument("--module", type=str, default="resnet50", help="Module used as a feature extractor.")
parser.add_argument("--dataset", type=str, default="flowers", help="Dataset to finetune.") parser.add_argument("--dataset", type=str, default="flowers", help="Dataset to finetune.")
parser.add_argument("--use_pyreader", type=ast.literal_eval, default=False, help="Whether use pyreader to feed data.")
# yapf: enable. # yapf: enable.
module_map = { module_map = {
...@@ -28,9 +27,11 @@ module_map = { ...@@ -28,9 +27,11 @@ module_map = {
def predict(args): def predict(args):
# Load Paddlehub pretrained model
module = hub.Module(name=args.module) module = hub.Module(name=args.module)
input_dict, output_dict, program = module.context(trainable=True) input_dict, output_dict, program = module.context(trainable=True)
# Download dataset
if args.dataset.lower() == "flowers": if args.dataset.lower() == "flowers":
dataset = hub.dataset.Flowers() dataset = hub.dataset.Flowers()
elif args.dataset.lower() == "dogcat": elif args.dataset.lower() == "dogcat":
...@@ -44,6 +45,7 @@ def predict(args): ...@@ -44,6 +45,7 @@ def predict(args):
else: else:
raise ValueError("%s dataset is not defined" % args.dataset) raise ValueError("%s dataset is not defined" % args.dataset)
# Use ImageClassificationReader to read dataset
data_reader = hub.reader.ImageClassificationReader( data_reader = hub.reader.ImageClassificationReader(
image_width=module.get_expected_image_width(), image_width=module.get_expected_image_width(),
image_height=module.get_expected_image_height(), image_height=module.get_expected_image_height(),
...@@ -53,19 +55,19 @@ def predict(args): ...@@ -53,19 +55,19 @@ def predict(args):
feature_map = output_dict["feature_map"] feature_map = output_dict["feature_map"]
img = input_dict["image"] # Setup feed list for data feeder
feed_list = [img.name] feed_list = [input_dict["image"].name]
# Setup runing config for PaddleHub Finetune API
config = hub.RunConfig( config = hub.RunConfig(
use_data_parallel=False, use_data_parallel=False,
use_pyreader=args.use_pyreader,
use_cuda=args.use_gpu, use_cuda=args.use_gpu,
batch_size=args.batch_size, batch_size=args.batch_size,
enable_memory_optim=False,
checkpoint_dir=args.checkpoint_dir, checkpoint_dir=args.checkpoint_dir,
strategy=hub.finetune.strategy.DefaultFinetuneStrategy()) strategy=hub.finetune.strategy.DefaultFinetuneStrategy())
task = hub.ClassifierTask( # Define a reading comprehension finetune task by PaddleHub's API
task = hub.ImageClassifierTask(
data_reader=data_reader, data_reader=data_reader,
feed_list=feed_list, feed_list=feed_list,
feature=feature_map, feature=feature_map,
......
python ../../paddlehub/commands/hub.py run lac --input_file test/test.txt
今天是个好日子
天气预报说今天要下雨
下一班地铁马上就要到了
input_data:
text:
type : TEXT
key : TEXT_INPUT
...@@ -34,35 +34,33 @@ args = parser.parse_args() ...@@ -34,35 +34,33 @@ args = parser.parse_args()
# yapf: enable. # yapf: enable.
if __name__ == '__main__': if __name__ == '__main__':
# Load Paddlehub BERT pretrained model # Load Paddlehub ERNIE 2.0 pretrained model
module = hub.Module(name="ernie_v2_eng_base") module = hub.Module(name="ernie_v2_eng_base")
inputs, outputs, program = module.context( inputs, outputs, program = module.context(
trainable=True, max_seq_len=args.max_seq_len) trainable=True, max_seq_len=args.max_seq_len)
# Setup feed list for data feeder
feed_list = [
inputs["input_ids"].name, inputs["position_ids"].name,
inputs["segment_ids"].name, inputs["input_mask"].name
]
# Download dataset and use MultiLabelReader to read dataset # Download dataset and use MultiLabelReader to read dataset
dataset = hub.dataset.Toxic() dataset = hub.dataset.Toxic()
reader = hub.reader.MultiLabelClassifyReader( reader = hub.reader.MultiLabelClassifyReader(
dataset=dataset, dataset=dataset,
vocab_path=module.get_vocab_path(), vocab_path=module.get_vocab_path(),
max_seq_len=args.max_seq_len) max_seq_len=args.max_seq_len)
# Setup feed list for data feeder
feed_list = [
inputs["input_ids"].name, inputs["position_ids"].name,
inputs["segment_ids"].name, inputs["input_mask"].name
]
# Construct transfer learning network # Construct transfer learning network
# Use "pooled_output" for classification tasks on an entire sentence. # Use "pooled_output" for classification tasks on an entire sentence.
pooled_output = outputs["pooled_output"] pooled_output = outputs["pooled_output"]
# Select finetune strategy, setup config and finetune # Select finetune strategy, setup config and finetune
strategy = hub.AdamWeightDecayStrategy( strategy = hub.AdamWeightDecayStrategy(
warmup_proportion=args.warmup_proportion,
weight_decay=args.weight_decay, weight_decay=args.weight_decay,
learning_rate=args.learning_rate, learning_rate=args.learning_rate)
lr_scheduler="linear_decay")
# Setup runing config for PaddleHub Finetune API # Setup runing config for PaddleHub Finetune API
config = hub.RunConfig( config = hub.RunConfig(
......
...@@ -40,12 +40,18 @@ args = parser.parse_args() ...@@ -40,12 +40,18 @@ args = parser.parse_args()
# yapf: enable. # yapf: enable.
if __name__ == '__main__': if __name__ == '__main__':
# Load Paddlehub BERT pretrained model # Load Paddlehub ERNIE 2.0 pretrained model
module = hub.Module(name="ernie_eng_base.hub_module") module = hub.Module(name="ernie_v2_eng_base")
inputs, outputs, program = module.context( inputs, outputs, program = module.context(
trainable=True, max_seq_len=args.max_seq_len) trainable=True, max_seq_len=args.max_seq_len)
# Download dataset and use MultiLabelReader to read dataset
dataset = hub.dataset.Toxic()
reader = hub.reader.MultiLabelClassifyReader(
dataset=dataset,
vocab_path=module.get_vocab_path(),
max_seq_len=args.max_seq_len)
# Setup feed list for data feeder # Setup feed list for data feeder
feed_list = [ feed_list = [
inputs["input_ids"].name, inputs["input_ids"].name,
...@@ -54,14 +60,6 @@ if __name__ == '__main__': ...@@ -54,14 +60,6 @@ if __name__ == '__main__':
inputs["input_mask"].name, inputs["input_mask"].name,
] ]
# Download dataset and use MultiLabelReader to read dataset
dataset = hub.dataset.Toxic()
reader = hub.reader.MultiLabelClassifyReader(
dataset=dataset,
vocab_path=module.get_vocab_path(),
max_seq_len=args.max_seq_len)
# Construct transfer learning network # Construct transfer learning network
# Use "pooled_output" for classification tasks on an entire sentence. # Use "pooled_output" for classification tasks on an entire sentence.
# Use "sequence_output" for token-level output. # Use "sequence_output" for token-level output.
...@@ -70,10 +68,8 @@ if __name__ == '__main__': ...@@ -70,10 +68,8 @@ if __name__ == '__main__':
# Setup runing config for PaddleHub Finetune API # Setup runing config for PaddleHub Finetune API
config = hub.RunConfig( config = hub.RunConfig(
use_data_parallel=False, use_data_parallel=False,
use_pyreader=False,
use_cuda=args.use_gpu, use_cuda=args.use_gpu,
batch_size=args.batch_size, batch_size=args.batch_size,
enable_memory_optim=False,
checkpoint_dir=args.checkpoint_dir, checkpoint_dir=args.checkpoint_dir,
strategy=hub.finetune.strategy.DefaultFinetuneStrategy()) strategy=hub.finetune.strategy.DefaultFinetuneStrategy())
...@@ -85,7 +81,7 @@ if __name__ == '__main__': ...@@ -85,7 +81,7 @@ if __name__ == '__main__':
num_classes=dataset.num_labels, num_classes=dataset.num_labels,
config=config) config=config)
# Data to be prdicted # Data to be predicted
data = [ data = [
[ [
"Yes you did. And you admitted to doing it. See the Warren Kinsella talk page." "Yes you did. And you admitted to doing it. See the Warren Kinsella talk page."
......
export FLAGS_eager_delete_tensor_gb=0.0 export FLAGS_eager_delete_tensor_gb=0.0
export CUDA_VISIBLE_DEVICES=0 export CUDA_VISIBLE_DEVICES=0
# User can select chnsenticorp, nlpcc_dbqa, lcqmc for different task CKPT_DIR="./ckpt_toxic"
DATASET="toxic"
CKPT_DIR="./ckpt_${DATASET}"
# Recommending hyper parameters for difference task
# ChnSentiCorp: batch_size=24, weight_decay=0.01, num_epoch=3, max_seq_len=128, lr=5e-5
# NLPCC_DBQA: batch_size=8, weight_decay=0.01, num_epoch=3, max_seq_len=512, lr=2e-5
# LCQMC: batch_size=32, weight_decay=0, num_epoch=3, max_seq_len=128, lr=2e-5
python -u multi_label_classifier.py \ python -u multi_label_classifier.py \
--batch_size=32 \ --batch_size=32 \
...@@ -16,4 +10,5 @@ python -u multi_label_classifier.py \ ...@@ -16,4 +10,5 @@ python -u multi_label_classifier.py \
--learning_rate=5e-5 \ --learning_rate=5e-5 \
--weight_decay=0.01 \ --weight_decay=0.01 \
--max_seq_len=128 \ --max_seq_len=128 \
--warmup_proportion=0.1 \
--num_epoch=3 --num_epoch=3
...@@ -30,7 +30,6 @@ parser.add_argument("--warmup_proportion", type=float, default=0.0, help="Warmup ...@@ -30,7 +30,6 @@ parser.add_argument("--warmup_proportion", type=float, default=0.0, help="Warmup
parser.add_argument("--checkpoint_dir", type=str, default=None, help="Directory to model checkpoint") parser.add_argument("--checkpoint_dir", type=str, default=None, help="Directory to model checkpoint")
parser.add_argument("--max_seq_len", type=int, default=512, help="Number of words of the longest seqence.") parser.add_argument("--max_seq_len", type=int, default=512, help="Number of words of the longest seqence.")
parser.add_argument("--batch_size", type=int, default=32, help="Total examples' number in batch for training.") parser.add_argument("--batch_size", type=int, default=32, help="Total examples' number in batch for training.")
parser.add_argument("--use_pyreader", type=ast.literal_eval, default=False, help="Whether use pyreader to feed data.")
parser.add_argument("--use_data_parallel", type=ast.literal_eval, default=False, help="Whether use data parallel.") parser.add_argument("--use_data_parallel", type=ast.literal_eval, default=False, help="Whether use data parallel.")
args = parser.parse_args() args = parser.parse_args()
# yapf: enable. # yapf: enable.
...@@ -38,13 +37,11 @@ args = parser.parse_args() ...@@ -38,13 +37,11 @@ args = parser.parse_args()
if __name__ == '__main__': if __name__ == '__main__':
# Load Paddlehub ERNIE pretrained model # Load Paddlehub ERNIE pretrained model
module = hub.Module(name="ernie") module = hub.Module(name="ernie")
# module = hub.Module(name="bert_chinese_L-12_H-768_A-12")
inputs, outputs, program = module.context( inputs, outputs, program = module.context(
trainable=True, max_seq_len=args.max_seq_len) trainable=True, max_seq_len=args.max_seq_len)
# Download dataset and use ClassifyReader to read dataset # Download dataset and use ClassifyReader to read dataset
dataset = hub.dataset.NLPCC_DBQA() dataset = hub.dataset.NLPCC_DBQA()
reader = hub.reader.ClassifyReader( reader = hub.reader.ClassifyReader(
dataset=dataset, dataset=dataset,
vocab_path=module.get_vocab_path(), vocab_path=module.get_vocab_path(),
...@@ -66,14 +63,13 @@ if __name__ == '__main__': ...@@ -66,14 +63,13 @@ if __name__ == '__main__':
# Select finetune strategy, setup config and finetune # Select finetune strategy, setup config and finetune
strategy = hub.AdamWeightDecayStrategy( strategy = hub.AdamWeightDecayStrategy(
warmup_proportion=args.warmup_proportion,
weight_decay=args.weight_decay, weight_decay=args.weight_decay,
learning_rate=args.learning_rate, learning_rate=args.learning_rate)
lr_scheduler="linear_decay")
# Setup runing config for PaddleHub Finetune API # Setup runing config for PaddleHub Finetune API
config = hub.RunConfig( config = hub.RunConfig(
use_data_parallel=args.use_data_parallel, use_data_parallel=args.use_data_parallel,
use_pyreader=args.use_pyreader,
use_cuda=args.use_gpu, use_cuda=args.use_gpu,
num_epoch=args.num_epoch, num_epoch=args.num_epoch,
batch_size=args.batch_size, batch_size=args.batch_size,
......
...@@ -34,7 +34,6 @@ parser.add_argument("--checkpoint_dir", type=str, default=None, help="Directory ...@@ -34,7 +34,6 @@ parser.add_argument("--checkpoint_dir", type=str, default=None, help="Directory
parser.add_argument("--batch_size", type=int, default=1, help="Total examples' number in batch for training.") parser.add_argument("--batch_size", type=int, default=1, help="Total examples' number in batch for training.")
parser.add_argument("--max_seq_len", type=int, default=128, help="Number of words of the longest seqence.") parser.add_argument("--max_seq_len", type=int, default=128, help="Number of words of the longest seqence.")
parser.add_argument("--use_gpu", type=ast.literal_eval, default=False, help="Whether use GPU for finetuning, input should be True or False") parser.add_argument("--use_gpu", type=ast.literal_eval, default=False, help="Whether use GPU for finetuning, input should be True or False")
parser.add_argument("--use_pyreader", type=ast.literal_eval, default=False, help="Whether use pyreader to feed data.")
args = parser.parse_args() args = parser.parse_args()
# yapf: enable. # yapf: enable.
...@@ -50,9 +49,6 @@ if __name__ == '__main__': ...@@ -50,9 +49,6 @@ if __name__ == '__main__':
vocab_path=module.get_vocab_path(), vocab_path=module.get_vocab_path(),
max_seq_len=args.max_seq_len) max_seq_len=args.max_seq_len)
place = fluid.CUDAPlace(0) if args.use_gpu else fluid.CPUPlace()
exe = fluid.Executor(place)
# Construct transfer learning network # Construct transfer learning network
# Use "pooled_output" for classification tasks on an entire sentence. # Use "pooled_output" for classification tasks on an entire sentence.
# Use "sequence_output" for token-level output. # Use "sequence_output" for token-level output.
...@@ -70,10 +66,8 @@ if __name__ == '__main__': ...@@ -70,10 +66,8 @@ if __name__ == '__main__':
# Setup runing config for PaddleHub Finetune API # Setup runing config for PaddleHub Finetune API
config = hub.RunConfig( config = hub.RunConfig(
use_data_parallel=False, use_data_parallel=False,
use_pyreader=args.use_pyreader,
use_cuda=args.use_gpu, use_cuda=args.use_gpu,
batch_size=args.batch_size, batch_size=args.batch_size,
enable_memory_optim=False,
checkpoint_dir=args.checkpoint_dir, checkpoint_dir=args.checkpoint_dir,
strategy=hub.finetune.strategy.DefaultFinetuneStrategy()) strategy=hub.finetune.strategy.DefaultFinetuneStrategy())
...@@ -100,5 +94,5 @@ if __name__ == '__main__': ...@@ -100,5 +94,5 @@ if __name__ == '__main__':
max_probs = batch_result[0][0, 1] max_probs = batch_result[0][0, 1]
max_flag = index max_flag = index
print("question:%s\tthe predict answer:%s\t" % (data[max_flag][0], print("question:%s\tthe predicted matched answer:%s\t" %
data[max_flag][1])) (data[max_flag][0], data[max_flag][1]))
...@@ -2,10 +2,6 @@ export FLAGS_eager_delete_tensor_gb=0.0 ...@@ -2,10 +2,6 @@ export FLAGS_eager_delete_tensor_gb=0.0
export CUDA_VISIBLE_DEVICES=0 export CUDA_VISIBLE_DEVICES=0
CKPT_DIR="./ckpt_qa" CKPT_DIR="./ckpt_qa"
# Recommending hyper parameters for difference task
# ChnSentiCorp: batch_size=24, weight_decay=0.01, num_epoch=3, max_seq_len=128, lr=5e-5
# NLPCC_DBQA: batch_size=8, weight_decay=0.01, num_epoch=3, max_seq_len=512, lr=2e-5
# LCQMC: batch_size=32, weight_decay=0, num_epoch=3, max_seq_len=128, lr=2e-5
python -u classifier.py \ python -u classifier.py \
--batch_size=24 \ --batch_size=24 \
...@@ -13,7 +9,7 @@ python -u classifier.py \ ...@@ -13,7 +9,7 @@ python -u classifier.py \
--checkpoint_dir=${CKPT_DIR} \ --checkpoint_dir=${CKPT_DIR} \
--learning_rate=5e-5 \ --learning_rate=5e-5 \
--weight_decay=0.01 \ --weight_decay=0.01 \
--warmup_proportion=0.1 \
--max_seq_len=128 \ --max_seq_len=128 \
--num_epoch=3 \ --num_epoch=3 \
--use_pyreader=False \ --use_data_parallel=True \
--use_data_parallel=False \
...@@ -41,44 +41,23 @@ hub.common.logger.logger.setLevel("INFO") ...@@ -41,44 +41,23 @@ hub.common.logger.logger.setLevel("INFO")
parser = argparse.ArgumentParser(__doc__) parser = argparse.ArgumentParser(__doc__)
parser.add_argument("--num_epoch", type=int, default=1, help="Number of epoches for fine-tuning.") parser.add_argument("--num_epoch", type=int, default=1, help="Number of epoches for fine-tuning.")
parser.add_argument("--use_gpu", type=ast.literal_eval, default=True, help="Whether use GPU for finetuning, input should be True or False") parser.add_argument("--use_gpu", type=ast.literal_eval, default=True, help="Whether use GPU for finetuning, input should be True or False")
parser.add_argument("--learning_rate", type=float, default=4e-5, help="Learning rate used to train with warmup.")
parser.add_argument("--weight_decay", type=float, default=0.01, help="Weight decay rate for L2 regularizer.")
parser.add_argument("--warmup_proportion", type=float, default=0.0, help="Warmup proportion params for warmup strategy")
parser.add_argument("--checkpoint_dir", type=str, default=None, help="Directory to model checkpoint.") parser.add_argument("--checkpoint_dir", type=str, default=None, help="Directory to model checkpoint.")
parser.add_argument("--result_dir", type=str, default=None, help="Directory to predicted results to be written.")
parser.add_argument("--max_seq_len", type=int, default=384, help="Number of words of the longest seqence.") parser.add_argument("--max_seq_len", type=int, default=384, help="Number of words of the longest seqence.")
parser.add_argument("--batch_size", type=int, default=8, help="Total examples' number in batch for training.") parser.add_argument("--batch_size", type=int, default=8, help="Total examples' number in batch for training.")
parser.add_argument("--use_pyreader", type=ast.literal_eval, default=False, help="Whether use pyreader to feed data.")
parser.add_argument("--use_data_parallel", type=ast.literal_eval, default=True, help="Whether use data parallel.")
parser.add_argument("--max_answer_length", type=int, default=30, help="Max answer length.")
parser.add_argument("--n_best_size", type=int, default=20, help="The total number of n-best predictions to generate in the nbest_predictions.json output file.")
parser.add_argument("--null_score_diff_threshold", type=float, default=0.0, help="If null_score - best_non_null is greater than the threshold predict null.")
parser.add_argument("--dataset", type=str, default="squad", help="Support squad, squad2.0, drcd and cmrc2018")
args = parser.parse_args() args = parser.parse_args()
# yapf: enable. # yapf: enable.
if __name__ == '__main__': if __name__ == '__main__':
# Download dataset and use ReadingComprehensionReader to read dataset # Load Paddlehub BERT pretrained model
if args.dataset == "squad":
dataset = hub.dataset.SQUAD(version_2_with_negative=False)
module = hub.Module(name="bert_uncased_L-12_H-768_A-12") module = hub.Module(name="bert_uncased_L-12_H-768_A-12")
elif args.dataset == "squad2.0" or args.dataset == "squad2":
args.dataset = "squad2.0"
dataset = hub.dataset.SQUAD(version_2_with_negative=True)
module = hub.Module(name="bert_uncased_L-12_H-768_A-12")
elif args.dataset == "drcd":
dataset = hub.dataset.DRCD()
module = hub.Module(name="roberta_wwm_ext_chinese_L-24_H-1024_A-16")
elif args.dataset == "cmrc2018":
dataset = hub.dataset.CMRC2018()
module = hub.Module(name="roberta_wwm_ext_chinese_L-24_H-1024_A-16")
else:
raise Exception(
"Only support datasets: squad, squad2.0, drcd and cmrc2018")
inputs, outputs, program = module.context( inputs, outputs, program = module.context(
trainable=True, max_seq_len=args.max_seq_len) trainable=True, max_seq_len=args.max_seq_len)
# Download dataset and use ReadingComprehensionReader to read dataset
# If you wanna load SQuAD 2.0 dataset, just set version_2_with_negative as True
dataset = hub.dataset.SQUAD(version_2_with_negative=False)
# dataset = hub.dataset.SQUAD(version_2_with_negative=True)
reader = hub.reader.ReadingComprehensionReader( reader = hub.reader.ReadingComprehensionReader(
dataset=dataset, dataset=dataset,
vocab_path=module.get_vocab_path(), vocab_path=module.get_vocab_path(),
...@@ -97,25 +76,13 @@ if __name__ == '__main__': ...@@ -97,25 +76,13 @@ if __name__ == '__main__':
inputs["input_mask"].name, inputs["input_mask"].name,
] ]
# Select finetune strategy, setup config and finetune
strategy = hub.AdamWeightDecayStrategy(
weight_decay=args.weight_decay,
learning_rate=args.learning_rate,
warmup_proportion=args.warmup_proportion,
lr_scheduler="linear_decay")
# Setup runing config for PaddleHub Finetune API # Setup runing config for PaddleHub Finetune API
config = hub.RunConfig( config = hub.RunConfig(
log_interval=10, use_data_parallel=False,
use_pyreader=args.use_pyreader,
use_data_parallel=args.use_data_parallel,
save_ckpt_interval=100,
use_cuda=args.use_gpu, use_cuda=args.use_gpu,
num_epoch=args.num_epoch,
batch_size=args.batch_size, batch_size=args.batch_size,
checkpoint_dir=args.checkpoint_dir, checkpoint_dir=args.checkpoint_dir,
enable_memory_optim=True, strategy=hub.AdamWeightDecayStrategy())
strategy=strategy)
# Define a reading comprehension finetune task by PaddleHub's API # Define a reading comprehension finetune task by PaddleHub's API
reading_comprehension_task = hub.ReadingComprehensionTask( reading_comprehension_task = hub.ReadingComprehensionTask(
...@@ -125,5 +92,5 @@ if __name__ == '__main__': ...@@ -125,5 +92,5 @@ if __name__ == '__main__':
config=config) config=config)
# Data to be predicted # Data to be predicted
data = dataset.dev_examples[97:98] data = dataset.dev_examples[:10]
reading_comprehension_task.predict(data=data) reading_comprehension_task.predict(data=data)
...@@ -31,38 +31,22 @@ parser.add_argument("--weight_decay", type=float, default=0.01, help="Weight dec ...@@ -31,38 +31,22 @@ parser.add_argument("--weight_decay", type=float, default=0.01, help="Weight dec
parser.add_argument("--warmup_proportion", type=float, default=0.0, help="Warmup proportion params for warmup strategy") parser.add_argument("--warmup_proportion", type=float, default=0.0, help="Warmup proportion params for warmup strategy")
parser.add_argument("--checkpoint_dir", type=str, default=None, help="Directory to model checkpoint") parser.add_argument("--checkpoint_dir", type=str, default=None, help="Directory to model checkpoint")
parser.add_argument("--max_seq_len", type=int, default=384, help="Number of words of the longest seqence.") parser.add_argument("--max_seq_len", type=int, default=384, help="Number of words of the longest seqence.")
parser.add_argument("--null_score_diff_threshold", type=float, default=0.0, help="If null_score - best_non_null is greater than the threshold predict null.")
parser.add_argument("--n_best_size", type=int, default=20,help="The total number of n-best predictions to generate in the ""nbest_predictions.json output file.")
parser.add_argument("--max_answer_length", type=int, default=30,help="The maximum length of an answer that can be generated. This is needed ""because the start and end predictions are not conditioned on one another.")
parser.add_argument("--batch_size", type=int, default=8, help="Total examples' number in batch for training.") parser.add_argument("--batch_size", type=int, default=8, help="Total examples' number in batch for training.")
parser.add_argument("--use_pyreader", type=ast.literal_eval, default=False, help="Whether use pyreader to feed data.") parser.add_argument("--use_data_parallel", type=ast.literal_eval, default=True, help="Whether use data parallel.")
parser.add_argument("--use_data_parallel", type=ast.literal_eval, default=False, help="Whether use data parallel.")
parser.add_argument("--dataset", type=str, default="squad", help="Support squad, squad2.0, drcd and cmrc2018")
args = parser.parse_args() args = parser.parse_args()
# yapf: enable. # yapf: enable.
if __name__ == '__main__': if __name__ == '__main__':
# Download dataset and use ReadingComprehensionReader to read dataset # Load Paddlehub BERT pretrained model
if args.dataset == "squad":
dataset = hub.dataset.SQUAD(version_2_with_negative=False)
module = hub.Module(name="bert_uncased_L-12_H-768_A-12") module = hub.Module(name="bert_uncased_L-12_H-768_A-12")
elif args.dataset == "squad2.0" or args.dataset == "squad2":
args.dataset = "squad2.0"
dataset = hub.dataset.SQUAD(version_2_with_negative=True)
module = hub.Module(name="bert_uncased_L-12_H-768_A-12")
elif args.dataset == "drcd":
dataset = hub.dataset.DRCD()
module = hub.Module(name="roberta_wwm_ext_chinese_L-24_H-1024_A-16")
elif args.dataset == "cmrc2018":
dataset = hub.dataset.CMRC2018()
module = hub.Module(name="roberta_wwm_ext_chinese_L-24_H-1024_A-16")
else:
raise Exception(
"Only support datasets: squad, squad2.0, drcd and cmrc2018")
inputs, outputs, program = module.context( inputs, outputs, program = module.context(
trainable=True, max_seq_len=args.max_seq_len) trainable=True, max_seq_len=args.max_seq_len)
# Download dataset and use ReadingComprehensionReader to read dataset
# If you wanna load SQuAD 2.0 dataset, just set version_2_with_negative as True
dataset = hub.dataset.SQUAD(version_2_with_negative=False)
# dataset = hub.dataset.SQUAD(version_2_with_negative=True)
reader = hub.reader.ReadingComprehensionReader( reader = hub.reader.ReadingComprehensionReader(
dataset=dataset, dataset=dataset,
vocab_path=module.get_vocab_path(), vocab_path=module.get_vocab_path(),
...@@ -84,19 +68,16 @@ if __name__ == '__main__': ...@@ -84,19 +68,16 @@ if __name__ == '__main__':
strategy = hub.AdamWeightDecayStrategy( strategy = hub.AdamWeightDecayStrategy(
weight_decay=args.weight_decay, weight_decay=args.weight_decay,
learning_rate=args.learning_rate, learning_rate=args.learning_rate,
warmup_proportion=args.warmup_proportion, warmup_proportion=args.warmup_proportion)
lr_scheduler="linear_decay")
# Setup runing config for PaddleHub Finetune API # Setup runing config for PaddleHub Finetune API
config = hub.RunConfig( config = hub.RunConfig(
eval_interval=300, eval_interval=300,
use_pyreader=args.use_pyreader,
use_data_parallel=args.use_data_parallel, use_data_parallel=args.use_data_parallel,
use_cuda=args.use_gpu, use_cuda=args.use_gpu,
num_epoch=args.num_epoch, num_epoch=args.num_epoch,
batch_size=args.batch_size, batch_size=args.batch_size,
checkpoint_dir=args.checkpoint_dir, checkpoint_dir=args.checkpoint_dir,
enable_memory_optim=True,
strategy=strategy) strategy=strategy)
# Define a reading comprehension finetune task by PaddleHub's API # Define a reading comprehension finetune task by PaddleHub's API
...@@ -105,7 +86,7 @@ if __name__ == '__main__': ...@@ -105,7 +86,7 @@ if __name__ == '__main__':
feature=seq_output, feature=seq_output,
feed_list=feed_list, feed_list=feed_list,
config=config, config=config,
sub_task=args.dataset, sub_task="squad",
) )
# Finetune by PaddleHub's API # Finetune by PaddleHub's API
......
export FLAGS_eager_delete_tensor_gb=0.0 export FLAGS_eager_delete_tensor_gb=0.0
export CUDA_VISIBLE_DEVICES=0 export CUDA_VISIBLE_DEVICES=0
# Recommending hyper parameters for difference task # The suggested hyper parameters for difference task
# squad: batch_size=8, weight_decay=0, num_epoch=3, max_seq_len=512, lr=5e-5 # squad: batch_size=8, weight_decay=0.01, num_epoch=3, max_seq_len=512, lr=3e-5
# squad2.0: batch_size=8, weight_decay=0, num_epoch=3, max_seq_len=512, lr=5e-5 # squad2.0: batch_size=8, weight_decay=0.01, num_epoch=3, max_seq_len=512, lr=3e-5
# cmrc2018: batch_size=8, weight_decay=0, num_epoch=2, max_seq_len=512, lr=2.5e-5 # cmrc2018: batch_size=8, weight_decay=0, num_epoch=2, max_seq_len=512, lr=2.5e-5
# drcd: batch_size=8, weight_decay=0, num_epoch=2, max_seq_len=512, lr=2.5e-5 # drcd: batch_size=8, weight_decay=0, num_epoch=2, max_seq_len=512, lr=2.5e-5
dataset=cmrc2018
python -u reading_comprehension.py \ python -u reading_comprehension.py \
--batch_size=8 \ --batch_size=8 \
--use_gpu=True \ --use_gpu=True \
--checkpoint_dir=./ckpt_${dataset} \ --checkpoint_dir="./ckpt_squad" \
--learning_rate=2.5e-5 \ --learning_rate=3e-5 \
--weight_decay=0.01 \ --weight_decay=0.01 \
--warmup_proportion=0.1 \ --warmup_proportion=0.1 \
--num_epoch=2 \ --num_epoch=2 \
--max_seq_len=512 \ --max_seq_len=512 \
--dataset=${dataset} --use_data_parallel=True
export FLAGS_eager_delete_tensor_gb=0.0 export FLAGS_eager_delete_tensor_gb=0.0
export CUDA_VISIBLE_DEVICES=0 export CUDA_VISIBLE_DEVICES=0
CKPT_DIR="./ckpt_cmrc2018"
dataset=cmrc2018
python -u predict.py \ python -u predict.py \
--batch_size=8 \ --batch_size=1 \
--use_gpu=True \ --use_gpu=True \
--dataset=${dataset} \ --checkpoint_dir="./ckpt_squad" \
--checkpoint_dir=${CKPT_DIR} \
--learning_rate=2.5e-5 \
--weight_decay=0.01 \
--warmup_proportion=0.1 \
--num_epoch=1 \
--max_seq_len=512 \ --max_seq_len=512 \
--use_pyreader=False \
--use_data_parallel=False
...@@ -34,29 +34,17 @@ parser.add_argument("--checkpoint_dir", type=str, default=None, help="Directory ...@@ -34,29 +34,17 @@ parser.add_argument("--checkpoint_dir", type=str, default=None, help="Directory
parser.add_argument("--batch_size", type=int, default=1, help="Total examples' number in batch for training.") parser.add_argument("--batch_size", type=int, default=1, help="Total examples' number in batch for training.")
parser.add_argument("--max_seq_len", type=int, default=512, help="Number of words of the longest seqence.") parser.add_argument("--max_seq_len", type=int, default=512, help="Number of words of the longest seqence.")
parser.add_argument("--use_gpu", type=ast.literal_eval, default=False, help="Whether use GPU for finetuning, input should be True or False") parser.add_argument("--use_gpu", type=ast.literal_eval, default=False, help="Whether use GPU for finetuning, input should be True or False")
parser.add_argument("--use_pyreader", type=ast.literal_eval, default=False, help="Whether use pyreader to feed data.")
parser.add_argument("--dataset", type=str, default="STS-B", help="Directory to model checkpoint")
args = parser.parse_args() args = parser.parse_args()
# yapf: enable. # yapf: enable.
if __name__ == '__main__': if __name__ == '__main__':
dataset = None # Load Paddlehub ERNIE 2.0 pretrained model
metrics_choices = [] module = hub.Module(name="ernie_v2_eng_base")
# Download dataset and use ClassifyReader to read dataset
if args.dataset.lower() == "sts-b":
dataset = hub.dataset.GLUE("STS-B")
module = hub.Module(name="bert_uncased_L-12_H-768_A-12")
metrics_choices = ["acc"]
else:
raise ValueError("%s dataset is not defined" % args.dataset)
support_metrics = ["acc", "f1", "matthews"]
for metric in metrics_choices:
if metric not in support_metrics:
raise ValueError("\"%s\" metric is not defined" % metric)
inputs, outputs, program = module.context( inputs, outputs, program = module.context(
trainable=True, max_seq_len=args.max_seq_len) trainable=True, max_seq_len=args.max_seq_len)
# Download dataset and use RegressionReader to read dataset
dataset = hub.dataset.GLUE("STS-B")
reader = hub.reader.RegressionReader( reader = hub.reader.RegressionReader(
dataset=dataset, dataset=dataset,
vocab_path=module.get_vocab_path(), vocab_path=module.get_vocab_path(),
...@@ -79,35 +67,27 @@ if __name__ == '__main__': ...@@ -79,35 +67,27 @@ if __name__ == '__main__':
# Setup runing config for PaddleHub Finetune API # Setup runing config for PaddleHub Finetune API
config = hub.RunConfig( config = hub.RunConfig(
use_data_parallel=False, use_data_parallel=False,
use_pyreader=args.use_pyreader,
use_cuda=args.use_gpu, use_cuda=args.use_gpu,
batch_size=args.batch_size, batch_size=args.batch_size,
enable_memory_optim=False,
checkpoint_dir=args.checkpoint_dir, checkpoint_dir=args.checkpoint_dir,
strategy=hub.finetune.strategy.DefaultFinetuneStrategy()) strategy=hub.AdamWeightDecayStrategy())
# Define a regression finetune task by PaddleHub's API # Define a regression finetune task by PaddleHub's API
reg_task = hub.RegressionTask( reg_task = hub.RegressionTask(
data_reader=reader, data_reader=reader,
feature=pooled_output, feature=pooled_output,
feed_list=feed_list, feed_list=feed_list,
config=config) config=config,
)
# Data to be prdicted # Data to be prdicted
data = [[d.text_a, d.text_b] for d in dataset.get_predict_examples()] data = [[d.text_a, d.text_b] for d in dataset.get_predict_examples()[:10]]
index = 0 index = 0
run_states = reg_task.predict(data=data) run_states = reg_task.predict(data=data)
results = [run_state.run_results for run_state in run_states] results = [run_state.run_results for run_state in run_states]
if not os.path.exists("output"):
os.makedirs("output")
fout = open(os.path.join("output", "%s.tsv" % args.dataset.upper()), 'w')
fout.write("index\tprediction")
for batch_result in results: for batch_result in results:
for result in batch_result[0]: for result in batch_result[0]:
if index < 3: print("text:%s\t%s\tpredict:%.3f" % (data[index][0], data[index][1],
print("%s\t%s\tpredict=%.3f" % (data[index][0], data[index][1],
result[0])) result[0]))
fout.write("\n%s\t%.3f" % (index, result[0]))
index += 1 index += 1
fout.close()
...@@ -24,30 +24,25 @@ import paddlehub as hub ...@@ -24,30 +24,25 @@ import paddlehub as hub
parser = argparse.ArgumentParser(__doc__) parser = argparse.ArgumentParser(__doc__)
parser.add_argument("--num_epoch", type=int, default=3, help="Number of epoches for fine-tuning.") parser.add_argument("--num_epoch", type=int, default=3, help="Number of epoches for fine-tuning.")
parser.add_argument("--use_gpu", type=ast.literal_eval, default=False, help="Whether use GPU for finetuning, input should be True or False") parser.add_argument("--use_gpu", type=ast.literal_eval, default=False, help="Whether use GPU for finetuning, input should be True or False")
parser.add_argument("--dataset", type=str, default="STS-B", help="Directory to model checkpoint")
parser.add_argument("--learning_rate", type=float, default=5e-5, help="Learning rate used to train with warmup.") parser.add_argument("--learning_rate", type=float, default=5e-5, help="Learning rate used to train with warmup.")
parser.add_argument("--weight_decay", type=float, default=0.01, help="Weight decay rate for L2 regularizer.") parser.add_argument("--weight_decay", type=float, default=0.01, help="Weight decay rate for L2 regularizer.")
parser.add_argument("--warmup_proportion", type=float, default=0.0, help="Warmup proportion params for warmup strategy") parser.add_argument("--warmup_proportion", type=float, default=0.1, help="Warmup proportion params for warmup strategy")
parser.add_argument("--data_dir", type=str, default=None, help="Path to training data.")
parser.add_argument("--checkpoint_dir", type=str, default=None, help="Directory to model checkpoint") parser.add_argument("--checkpoint_dir", type=str, default=None, help="Directory to model checkpoint")
parser.add_argument("--max_seq_len", type=int, default=512, help="Number of words of the longest seqence.") parser.add_argument("--max_seq_len", type=int, default=512, help="Number of words of the longest seqence.")
parser.add_argument("--batch_size", type=int, default=32, help="Total examples' number in batch for training.") parser.add_argument("--batch_size", type=int, default=32, help="Total examples' number in batch for training.")
parser.add_argument("--use_pyreader", type=ast.literal_eval, default=False, help="Whether use pyreader to feed data.")
parser.add_argument("--use_data_parallel", type=ast.literal_eval, default=False, help="Whether use data parallel.") parser.add_argument("--use_data_parallel", type=ast.literal_eval, default=False, help="Whether use data parallel.")
args = parser.parse_args() args = parser.parse_args()
# yapf: enable. # yapf: enable.
if __name__ == '__main__': if __name__ == '__main__':
dataset = None
# Download dataset and use ClassifyReader to read dataset
if args.dataset.lower() == "sts-b":
dataset = hub.dataset.GLUE("STS-B")
module = hub.Module(name="ernie_v2_eng_base")
else:
raise ValueError("%s dataset is not defined" % args.dataset)
# Load Paddlehub ERNIE 2.0 pretrained model
module = hub.Module(name="ernie_v2_eng_base")
inputs, outputs, program = module.context( inputs, outputs, program = module.context(
trainable=True, max_seq_len=args.max_seq_len) trainable=True, max_seq_len=args.max_seq_len)
# Download dataset and use RegressionReader to read dataset
dataset = hub.dataset.GLUE("STS-B")
reader = hub.reader.RegressionReader( reader = hub.reader.RegressionReader(
dataset=dataset, dataset=dataset,
vocab_path=module.get_vocab_path(), vocab_path=module.get_vocab_path(),
...@@ -69,14 +64,14 @@ if __name__ == '__main__': ...@@ -69,14 +64,14 @@ if __name__ == '__main__':
# Select finetune strategy, setup config and finetune # Select finetune strategy, setup config and finetune
strategy = hub.AdamWeightDecayStrategy( strategy = hub.AdamWeightDecayStrategy(
warmup_proportion=args.warmup_proportion,
weight_decay=args.weight_decay, weight_decay=args.weight_decay,
learning_rate=args.learning_rate, learning_rate=args.learning_rate)
lr_scheduler="linear_decay")
# Setup runing config for PaddleHub Finetune API # Setup runing config for PaddleHub Finetune API
config = hub.RunConfig( config = hub.RunConfig(
eval_interval=300,
use_data_parallel=args.use_data_parallel, use_data_parallel=args.use_data_parallel,
use_pyreader=args.use_pyreader,
use_cuda=args.use_gpu, use_cuda=args.use_gpu,
num_epoch=args.num_epoch, num_epoch=args.num_epoch,
batch_size=args.batch_size, batch_size=args.batch_size,
......
export FLAGS_eager_delete_tensor_gb=0.0 export FLAGS_eager_delete_tensor_gb=0.0
export CUDA_VISIBLE_DEVICES=0 export CUDA_VISIBLE_DEVICES=0
# User can select chnsenticorp, nlpcc_dbqa, lcqmc and so on for different task CKPT_DIR="./ckpt_stsb"
DATASET="STS-B"
CKPT_DIR="./ckpt_${DATASET}"
# STS-B: batch_size=32, max_seq_len=128
python -u predict.py --checkpoint_dir $CKPT_DIR \ python -u predict.py --checkpoint_dir ${CKPT_DIR} \
--max_seq_len 128 \ --max_seq_len 128 \
--use_gpu True \ --use_gpu True \
--dataset=${DATASET} \ --batch_size=1 \
--batch_size=32 \
export FLAGS_eager_delete_tensor_gb=0.0 export FLAGS_eager_delete_tensor_gb=0.0
export CUDA_VISIBLE_DEVICES=0 export CUDA_VISIBLE_DEVICES=0
DATASET="STS-B" CKPT_DIR="./ckpt_stsb"
CKPT_DIR="./ckpt_${DATASET}"
# Recommending hyper parameters for difference task
# STS-B: batch_size=32, weight_decay=0.1, num_epoch=3, max_seq_len=128, lr=4e-5
python -u regression.py \ python -u regression.py \
--batch_size=32 \ --batch_size=32 \
--use_gpu=True \ --use_gpu=True \
--dataset=${DATASET} \
--checkpoint_dir=${CKPT_DIR} \ --checkpoint_dir=${CKPT_DIR} \
--learning_rate=4e-5 \ --learning_rate=4e-5 \
--warmup_proportion=0.1 \
--weight_decay=0.1 \ --weight_decay=0.1 \
--max_seq_len=128 \ --max_seq_len=128 \
--num_epoch=3 \ --num_epoch=3 \
--use_pyreader=True \ --use_data_parallel=False
--use_data_parallel=True
python ../../paddlehub/commands/hub.py run senta_bilstm --input_file test/test.txt
...@@ -17,6 +17,7 @@ import paddlehub as hub ...@@ -17,6 +17,7 @@ import paddlehub as hub
parser = argparse.ArgumentParser(__doc__) parser = argparse.ArgumentParser(__doc__)
parser.add_argument("--checkpoint_dir", type=str, default=None, help="Directory to model checkpoint") parser.add_argument("--checkpoint_dir", type=str, default=None, help="Directory to model checkpoint")
parser.add_argument("--use_gpu", type=ast.literal_eval, default=True, help="Whether use GPU for finetuning, input should be True or False") parser.add_argument("--use_gpu", type=ast.literal_eval, default=True, help="Whether use GPU for finetuning, input should be True or False")
parser.add_argument("--batch_size", type=int, default=1, help="Total examples' number in batch when the program predicts.")
args = parser.parse_args() args = parser.parse_args()
# yapf: enable. # yapf: enable.
...@@ -25,31 +26,26 @@ if __name__ == '__main__': ...@@ -25,31 +26,26 @@ if __name__ == '__main__':
module = hub.Module(name="senta_bilstm") module = hub.Module(name="senta_bilstm")
inputs, outputs, program = module.context(trainable=True) inputs, outputs, program = module.context(trainable=True)
# Sentence classification dataset reader # Download dataset and use LACClassifyReader to read dataset
dataset = hub.dataset.ChnSentiCorp() dataset = hub.dataset.ChnSentiCorp()
reader = hub.reader.LACClassifyReader( reader = hub.reader.LACClassifyReader(
dataset=dataset, vocab_path=module.get_vocab_path()) dataset=dataset, vocab_path=module.get_vocab_path())
strategy = hub.AdamWeightDecayStrategy( sent_feature = outputs["sentence_feature"]
weight_decay=0.01,
warmup_proportion=0.1, # Setup feed list for data feeder
learning_rate=5e-5, # Must feed all the tensor of senta's module need
lr_scheduler="linear_decay", feed_list = [inputs["words"].name]
optimizer_name="adam")
# Setup runing config for PaddleHub Finetune API
config = hub.RunConfig( config = hub.RunConfig(
use_data_parallel=False, use_data_parallel=False,
use_pyreader=False,
use_cuda=args.use_gpu, use_cuda=args.use_gpu,
batch_size=1, batch_size=args.batch_size,
enable_memory_optim=False,
checkpoint_dir=args.checkpoint_dir, checkpoint_dir=args.checkpoint_dir,
strategy=strategy) strategy=hub.AdamWeightDecayStrategy())
sent_feature = outputs["sentence_feature"]
feed_list = [inputs["words"].name]
# Define a classfication finetune task by PaddleHub's API
cls_task = hub.TextClassifierTask( cls_task = hub.TextClassifierTask(
data_reader=reader, data_reader=reader,
feature=sent_feature, feature=sent_feature,
...@@ -57,9 +53,12 @@ if __name__ == '__main__': ...@@ -57,9 +53,12 @@ if __name__ == '__main__':
num_classes=dataset.num_labels, num_classes=dataset.num_labels,
config=config) config=config)
# Data to be predicted
data = ["这家餐厅很好吃", "这部电影真的很差劲"] data = ["这家餐厅很好吃", "这部电影真的很差劲"]
# Predict by PaddleHub's API
run_states = cls_task.predict(data=data) run_states = cls_task.predict(data=data)
results = [run_state.run_results for run_state in run_states] results = [run_state.run_results for run_state in run_states]
index = 0 index = 0
for batch_result in results: for batch_result in results:
......
export FLAGS_eager_delete_tensor_gb=0.0 export FLAGS_eager_delete_tensor_gb=0.0
export CUDA_VISIBLE_DEVICES=0 export CUDA_VISIBLE_DEVICES=0
DATASET="chnsenticorp" CKPT_DIR="./ckpt_chnsenticorp"
CKPT_DIR="./ckpt_${DATASET}"
python -u senta_finetune.py \ python -u senta_finetune.py \
--batch_size=24 \ --batch_size=24 \
--use_gpu=False \ --use_gpu=True \
--checkpoint_dir=${CKPT_DIR} \ --checkpoint_dir=${CKPT_DIR} \
--num_epoch=3 --num_epoch=3
export FLAGS_eager_delete_tensor_gb=0.0 export FLAGS_eager_delete_tensor_gb=0.0
export CUDA_VISIBLE_DEVICES=0 export CUDA_VISIBLE_DEVICES=0
CKPT_DIR="./ckpt_chnsenticorp/best_model" CKPT_DIR="./ckpt_chnsenticorp "
python -u predict.py --checkpoint_dir $CKPT_DIR --use_gpu False python -u predict.py --checkpoint_dir $CKPT_DIR --use_gpu True
...@@ -11,10 +11,11 @@ if __name__ == "__main__": ...@@ -11,10 +11,11 @@ if __name__ == "__main__":
# Load Senta-BiLSTM module # Load Senta-BiLSTM module
senta = hub.Module(name="senta_bilstm") senta = hub.Module(name="senta_bilstm")
# Data to be predicted
test_text = ["这家餐厅很好吃", "这部电影真的很差劲"] test_text = ["这家餐厅很好吃", "这部电影真的很差劲"]
# execute predict and print the result
input_dict = {"text": test_text} input_dict = {"text": test_text}
results = senta.sentiment_classify(data=input_dict) results = senta.sentiment_classify(data=input_dict)
for index, text in enumerate(test_text): for index, text in enumerate(test_text):
......
...@@ -15,13 +15,12 @@ args = parser.parse_args() ...@@ -15,13 +15,12 @@ args = parser.parse_args()
# yapf: enable. # yapf: enable.
if __name__ == '__main__': if __name__ == '__main__':
# Step1: load Paddlehub senta pretrained model # Load Paddlehub senta pretrained model
module = hub.Module(name="senta_bilstm") module = hub.Module(name="senta_bilstm")
inputs, outputs, program = module.context(trainable=True) inputs, outputs, program = module.context(trainable=True)
# Step2: Download dataset and use LACClassifyReader to read dataset # Download dataset and use LACClassifyReader to read dataset
dataset = hub.dataset.ChnSentiCorp() dataset = hub.dataset.ChnSentiCorp()
reader = hub.reader.LACClassifyReader( reader = hub.reader.LACClassifyReader(
dataset=dataset, vocab_path=module.get_vocab_path()) dataset=dataset, vocab_path=module.get_vocab_path())
...@@ -31,16 +30,15 @@ if __name__ == '__main__': ...@@ -31,16 +30,15 @@ if __name__ == '__main__':
# Must feed all the tensor of senta's module need # Must feed all the tensor of senta's module need
feed_list = [inputs["words"].name] feed_list = [inputs["words"].name]
strategy = hub.finetune.strategy.AdamWeightDecayStrategy( # Setup runing config for PaddleHub Finetune API
learning_rate=1e-4, weight_decay=0.01, warmup_proportion=0.05)
config = hub.RunConfig( config = hub.RunConfig(
use_cuda=args.use_gpu, use_cuda=args.use_gpu,
use_pyreader=False,
use_data_parallel=False,
num_epoch=args.num_epoch, num_epoch=args.num_epoch,
batch_size=args.batch_size, batch_size=args.batch_size,
checkpoint_dir=args.checkpoint_dir, checkpoint_dir=args.checkpoint_dir,
use_pyreader=False, strategy=hub.AdamWeightDecayStrategy())
strategy=strategy)
# Define a classfication finetune task by PaddleHub's API # Define a classfication finetune task by PaddleHub's API
cls_task = hub.TextClassifierTask( cls_task = hub.TextClassifierTask(
......
...@@ -35,7 +35,6 @@ parser.add_argument("--checkpoint_dir", type=str, default=None, help="Directory ...@@ -35,7 +35,6 @@ parser.add_argument("--checkpoint_dir", type=str, default=None, help="Directory
parser.add_argument("--max_seq_len", type=int, default=512, help="Number of words of the longest seqence.") parser.add_argument("--max_seq_len", type=int, default=512, help="Number of words of the longest seqence.")
parser.add_argument("--batch_size", type=int, default=1, help="Total examples' number in batch for training.") parser.add_argument("--batch_size", type=int, default=1, help="Total examples' number in batch for training.")
parser.add_argument("--use_gpu", type=ast.literal_eval, default=False, help="Whether use GPU for finetuning, input should be True or False") parser.add_argument("--use_gpu", type=ast.literal_eval, default=False, help="Whether use GPU for finetuning, input should be True or False")
parser.add_argument("--use_pyreader", type=ast.literal_eval, default=False, help="Whether use pyreader to feed data.")
args = parser.parse_args() args = parser.parse_args()
# yapf: enable. # yapf: enable.
...@@ -52,10 +51,8 @@ if __name__ == '__main__': ...@@ -52,10 +51,8 @@ if __name__ == '__main__':
max_seq_len=args.max_seq_len, max_seq_len=args.max_seq_len,
sp_model_path=module.get_spm_path(), sp_model_path=module.get_spm_path(),
word_dict_path=module.get_word_dict_path()) word_dict_path=module.get_word_dict_path())
inv_label_map = {val: key for key, val in reader.label_map.items()}
place = fluid.CUDAPlace(0) if args.use_gpu else fluid.CPUPlace() inv_label_map = {val: key for key, val in reader.label_map.items()}
exe = fluid.Executor(place)
# Construct transfer learning network # Construct transfer learning network
# Use "sequence_output" for token-level output. # Use "sequence_output" for token-level output.
...@@ -73,10 +70,8 @@ if __name__ == '__main__': ...@@ -73,10 +70,8 @@ if __name__ == '__main__':
# Setup runing config for PaddleHub Finetune API # Setup runing config for PaddleHub Finetune API
config = hub.RunConfig( config = hub.RunConfig(
use_data_parallel=False, use_data_parallel=False,
use_pyreader=args.use_pyreader,
use_cuda=args.use_gpu, use_cuda=args.use_gpu,
batch_size=args.batch_size, batch_size=args.batch_size,
enable_memory_optim=False,
checkpoint_dir=args.checkpoint_dir, checkpoint_dir=args.checkpoint_dir,
strategy=hub.finetune.strategy.DefaultFinetuneStrategy()) strategy=hub.finetune.strategy.DefaultFinetuneStrategy())
...@@ -91,7 +86,7 @@ if __name__ == '__main__': ...@@ -91,7 +86,7 @@ if __name__ == '__main__':
config=config, config=config,
add_crf=True) add_crf=True)
# test data # Data to be predicted
data = [ data = [
["我们变而以书会友,以书结缘,把欧美、港台流行的食品类图谱、画册、工具书汇集一堂。"], ["我们变而以书会友,以书结缘,把欧美、港台流行的食品类图谱、画册、工具书汇集一堂。"],
["为了跟踪国际最新食品工艺、流行趋势,大量搜集海外专业书刊资料是提高技艺的捷径。"], ["为了跟踪国际最新食品工艺、流行趋势,大量搜集海外专业书刊资料是提高技艺的捷径。"],
......
...@@ -9,5 +9,5 @@ python -u sequence_label.py \ ...@@ -9,5 +9,5 @@ python -u sequence_label.py \
--checkpoint_dir $CKPT_DIR \ --checkpoint_dir $CKPT_DIR \
--max_seq_len 128 \ --max_seq_len 128 \
--learning_rate 5e-5 \ --learning_rate 5e-5 \
--use_pyreader True \ --warmup_proportion 0.1 \
--use_data_parallel True --use_data_parallel True
...@@ -26,17 +26,16 @@ parser.add_argument("--num_epoch", type=int, default=3, help="Number of epoches ...@@ -26,17 +26,16 @@ parser.add_argument("--num_epoch", type=int, default=3, help="Number of epoches
parser.add_argument("--use_gpu", type=ast.literal_eval, default=True, help="Whether use GPU for finetuning, input should be True or False") parser.add_argument("--use_gpu", type=ast.literal_eval, default=True, help="Whether use GPU for finetuning, input should be True or False")
parser.add_argument("--learning_rate", type=float, default=5e-5, help="Learning rate used to train with warmup.") parser.add_argument("--learning_rate", type=float, default=5e-5, help="Learning rate used to train with warmup.")
parser.add_argument("--weight_decay", type=float, default=0.01, help="Weight decay rate for L2 regularizer.") parser.add_argument("--weight_decay", type=float, default=0.01, help="Weight decay rate for L2 regularizer.")
parser.add_argument("--warmup_proportion", type=float, default=0.0, help="Warmup proportion params for warmup strategy") parser.add_argument("--warmup_proportion", type=float, default=0.1, help="Warmup proportion params for warmup strategy")
parser.add_argument("--max_seq_len", type=int, default=512, help="Number of words of the longest seqence.") parser.add_argument("--max_seq_len", type=int, default=512, help="Number of words of the longest seqence.")
parser.add_argument("--batch_size", type=int, default=32, help="Total examples' number in batch for training.") parser.add_argument("--batch_size", type=int, default=32, help="Total examples' number in batch for training.")
parser.add_argument("--checkpoint_dir", type=str, default=None, help="Directory to model checkpoint") parser.add_argument("--checkpoint_dir", type=str, default=None, help="Directory to model checkpoint")
parser.add_argument("--use_pyreader", type=ast.literal_eval, default=False, help="Whether use pyreader to feed data.")
parser.add_argument("--use_data_parallel", type=ast.literal_eval, default=False, help="Whether use data parallel.") parser.add_argument("--use_data_parallel", type=ast.literal_eval, default=False, help="Whether use data parallel.")
args = parser.parse_args() args = parser.parse_args()
# yapf: enable. # yapf: enable.
if __name__ == '__main__': if __name__ == '__main__':
# Load Paddlehub ERNIE pretrained model # Load Paddlehub ERNIE Tiny pretrained model
module = hub.Module(name="ernie_tiny") module = hub.Module(name="ernie_tiny")
inputs, outputs, program = module.context( inputs, outputs, program = module.context(
trainable=True, max_seq_len=args.max_seq_len) trainable=True, max_seq_len=args.max_seq_len)
...@@ -55,8 +54,7 @@ if __name__ == '__main__': ...@@ -55,8 +54,7 @@ if __name__ == '__main__':
sequence_output = outputs["sequence_output"] sequence_output = outputs["sequence_output"]
# Setup feed list for data feeder # Setup feed list for data feeder
# Must feed all the tensor of ERNIE's module need # Must feed all the tensor of module need
# Compared to classification task, we need add seq_len tensor to feedlist
feed_list = [ feed_list = [
inputs["input_ids"].name, inputs["position_ids"].name, inputs["input_ids"].name, inputs["position_ids"].name,
inputs["segment_ids"].name, inputs["input_mask"].name inputs["segment_ids"].name, inputs["input_mask"].name
...@@ -64,15 +62,13 @@ if __name__ == '__main__': ...@@ -64,15 +62,13 @@ if __name__ == '__main__':
# Select a finetune strategy # Select a finetune strategy
strategy = hub.AdamWeightDecayStrategy( strategy = hub.AdamWeightDecayStrategy(
warmup_proportion=args.warmup_proportion,
weight_decay=args.weight_decay, weight_decay=args.weight_decay,
learning_rate=args.learning_rate, learning_rate=args.learning_rate)
lr_scheduler="linear_decay",
)
# Setup runing config for PaddleHub Finetune API # Setup runing config for PaddleHub Finetune API
config = hub.RunConfig( config = hub.RunConfig(
use_data_parallel=args.use_data_parallel, use_data_parallel=args.use_data_parallel,
use_pyreader=args.use_pyreader,
use_cuda=args.use_gpu, use_cuda=args.use_gpu,
num_epoch=args.num_epoch, num_epoch=args.num_epoch,
batch_size=args.batch_size, batch_size=args.batch_size,
...@@ -80,7 +76,7 @@ if __name__ == '__main__': ...@@ -80,7 +76,7 @@ if __name__ == '__main__':
strategy=strategy) strategy=strategy)
# Define a sequence labeling finetune task by PaddleHub's API # Define a sequence labeling finetune task by PaddleHub's API
# if add crf, the network use crf as decoder # If add crf, the network use crf as decoder
seq_label_task = hub.SequenceLabelTask( seq_label_task = hub.SequenceLabelTask(
data_reader=reader, data_reader=reader,
feature=sequence_output, feature=sequence_output,
......
python ../../paddlehub/commands/hub.py run ssd_mobilenet_v1_pascal --input_file test/test.txt
...@@ -33,99 +33,27 @@ parser.add_argument("--checkpoint_dir", type=str, default=None, help="Directory ...@@ -33,99 +33,27 @@ parser.add_argument("--checkpoint_dir", type=str, default=None, help="Directory
parser.add_argument("--batch_size", type=int, default=1, help="Total examples' number in batch for training.") parser.add_argument("--batch_size", type=int, default=1, help="Total examples' number in batch for training.")
parser.add_argument("--max_seq_len", type=int, default=512, help="Number of words of the longest seqence.") parser.add_argument("--max_seq_len", type=int, default=512, help="Number of words of the longest seqence.")
parser.add_argument("--use_gpu", type=ast.literal_eval, default=False, help="Whether use GPU for finetuning, input should be True or False") parser.add_argument("--use_gpu", type=ast.literal_eval, default=False, help="Whether use GPU for finetuning, input should be True or False")
parser.add_argument("--use_pyreader", type=ast.literal_eval, default=False, help="Whether use pyreader to feed data.")
parser.add_argument("--dataset", type=str, default="chnsenticorp", help="The choice of dataset")
parser.add_argument("--use_data_parallel", type=ast.literal_eval, default=False, help="Whether use data parallel.") parser.add_argument("--use_data_parallel", type=ast.literal_eval, default=False, help="Whether use data parallel.")
args = parser.parse_args() args = parser.parse_args()
# yapf: enable. # yapf: enable.
if __name__ == '__main__': if __name__ == '__main__':
dataset = None # Load Paddlehub ERNIE Tiny pretrained model
metrics_choices = []
# Download dataset and use ClassifyReader to read dataset
if args.dataset.lower() == "chnsenticorp":
dataset = hub.dataset.ChnSentiCorp()
module = hub.Module(name="ernie_tiny") module = hub.Module(name="ernie_tiny")
metrics_choices = ["acc"]
elif args.dataset.lower() == "tnews":
dataset = hub.dataset.TNews()
module = hub.Module(name="roberta_wwm_ext_chinese_L-24_H-1024_A-16")
metrics_choices = ["acc"]
elif args.dataset.lower() == "nlpcc_dbqa":
dataset = hub.dataset.NLPCC_DBQA()
module = hub.Module(name="roberta_wwm_ext_chinese_L-24_H-1024_A-16")
metrics_choices = ["acc"]
elif args.dataset.lower() == "lcqmc":
dataset = hub.dataset.LCQMC()
module = hub.Module(name="roberta_wwm_ext_chinese_L-24_H-1024_A-16")
metrics_choices = ["acc"]
elif args.dataset.lower() == 'inews':
dataset = hub.dataset.INews()
module = hub.Module(name="roberta_wwm_ext_chinese_L-24_H-1024_A-16")
metrics_choices = ["acc"]
elif args.dataset.lower() == 'bq':
dataset = hub.dataset.BQ()
module = hub.Module(name="roberta_wwm_ext_chinese_L-24_H-1024_A-16")
metrics_choices = ["acc"]
elif args.dataset.lower() == 'thucnews':
dataset = hub.dataset.THUCNEWS()
module = hub.Module(name="roberta_wwm_ext_chinese_L-24_H-1024_A-16")
metrics_choices = ["acc"]
elif args.dataset.lower() == 'iflytek':
dataset = hub.dataset.IFLYTEK()
module = hub.Module(name="roberta_wwm_ext_chinese_L-24_H-1024_A-16")
metrics_choices = ["acc"]
elif args.dataset.lower() == "mrpc":
dataset = hub.dataset.GLUE("MRPC")
module = hub.Module(name="ernie_v2_eng_base")
metrics_choices = ["f1", "acc"]
# The first metric will be choose to eval. Ref: task.py:799
elif args.dataset.lower() == "qqp":
dataset = hub.dataset.GLUE("QQP")
module = hub.Module(name="ernie_v2_eng_base")
metrics_choices = ["f1", "acc"]
elif args.dataset.lower() == "sst-2":
dataset = hub.dataset.GLUE("SST-2")
module = hub.Module(name="ernie_v2_eng_base")
metrics_choices = ["acc"]
elif args.dataset.lower() == "cola":
dataset = hub.dataset.GLUE("CoLA")
module = hub.Module(name="ernie_v2_eng_base")
metrics_choices = ["matthews", "acc"]
elif args.dataset.lower() == "qnli":
dataset = hub.dataset.GLUE("QNLI")
module = hub.Module(name="ernie_v2_eng_base")
metrics_choices = ["acc"]
elif args.dataset.lower() == "rte":
dataset = hub.dataset.GLUE("RTE")
module = hub.Module(name="ernie_v2_eng_base")
metrics_choices = ["acc"]
elif args.dataset.lower() == "mnli" or args.dataset.lower() == "mnli_m":
dataset = hub.dataset.GLUE("MNLI_m")
module = hub.Module(name="ernie_v2_eng_base")
metrics_choices = ["acc"]
elif args.dataset.lower() == "mnli_mm":
dataset = hub.dataset.GLUE("MNLI_mm")
module = hub.Module(name="ernie_v2_eng_base")
metrics_choices = ["acc"]
elif args.dataset.lower().startswith("xnli"):
dataset = hub.dataset.XNLI(language=args.dataset.lower()[-2:])
module = hub.Module(name="roberta_wwm_ext_chinese_L-24_H-1024_A-16")
metrics_choices = ["acc"]
else:
raise ValueError("%s dataset is not defined" % args.dataset)
support_metrics = ["acc", "f1", "matthews"]
for metric in metrics_choices:
if metric not in support_metrics:
raise ValueError("\"%s\" metric is not defined" % metric)
inputs, outputs, program = module.context( inputs, outputs, program = module.context(
trainable=True, max_seq_len=args.max_seq_len) trainable=True, max_seq_len=args.max_seq_len)
# Download dataset and use accuracy as metrics
# Choose dataset: GLUE/XNLI/ChinesesGLUE/NLPCC-DBQA/LCQMC
dataset = hub.dataset.ChnSentiCorp()
# For ernie_tiny, it use sub-word to tokenize chinese sentence
# If not ernie tiny, sp_model_path and word_dict_path should be set None
reader = hub.reader.ClassifyReader( reader = hub.reader.ClassifyReader(
dataset=dataset,
vocab_path=module.get_vocab_path(), vocab_path=module.get_vocab_path(),
max_seq_len=args.max_seq_len) max_seq_len=args.max_seq_len,
sp_model_path=module.get_spm_path(),
word_dict_path=module.get_word_dict_path())
# Construct transfer learning network # Construct transfer learning network
# Use "pooled_output" for classification tasks on an entire sentence. # Use "pooled_output" for classification tasks on an entire sentence.
...@@ -133,7 +61,7 @@ if __name__ == '__main__': ...@@ -133,7 +61,7 @@ if __name__ == '__main__':
pooled_output = outputs["pooled_output"] pooled_output = outputs["pooled_output"]
# Setup feed list for data feeder # Setup feed list for data feeder
# Must feed all the tensor of ERNIE's module need # Must feed all the tensor of module need
feed_list = [ feed_list = [
inputs["input_ids"].name, inputs["input_ids"].name,
inputs["position_ids"].name, inputs["position_ids"].name,
...@@ -143,13 +71,11 @@ if __name__ == '__main__': ...@@ -143,13 +71,11 @@ if __name__ == '__main__':
# Setup runing config for PaddleHub Finetune API # Setup runing config for PaddleHub Finetune API
config = hub.RunConfig( config = hub.RunConfig(
use_data_parallel=False, use_data_parallel=args.use_data_parallel,
use_pyreader=args.use_pyreader,
use_cuda=args.use_gpu, use_cuda=args.use_gpu,
batch_size=args.batch_size, batch_size=args.batch_size,
enable_memory_optim=False,
checkpoint_dir=args.checkpoint_dir, checkpoint_dir=args.checkpoint_dir,
strategy=hub.finetune.strategy.DefaultFinetuneStrategy()) strategy=hub.AdamWeightDecayStrategy())
# Define a classfication finetune task by PaddleHub's API # Define a classfication finetune task by PaddleHub's API
cls_task = hub.TextClassifierTask( cls_task = hub.TextClassifierTask(
...@@ -157,11 +83,11 @@ if __name__ == '__main__': ...@@ -157,11 +83,11 @@ if __name__ == '__main__':
feature=pooled_output, feature=pooled_output,
feed_list=feed_list, feed_list=feed_list,
num_classes=dataset.num_labels, num_classes=dataset.num_labels,
config=config, config=config)
metrics_choices=metrics_choices)
# Data to be prdicted # Data to be prdicted
data = [[d.text_a, d.text_b] for d in dataset.get_dev_examples()[:3]] data = [["这个宾馆比较陈旧了,特价的房间也很一般。总体来说一般"], ["交通方便;环境很好;服务态度很好 房间较小"],
["19天硬盘就罢工了~~~算上运来的一周都没用上15天~~~可就是不能换了~~~唉~~~~你说这算什么事呀~~~"]]
index = 0 index = 0
run_states = cls_task.predict(data=data) run_states = cls_task.predict(data=data)
......
export FLAGS_eager_delete_tensor_gb=0.0 export FLAGS_eager_delete_tensor_gb=0.0
export CUDA_VISIBLE_DEVICES=0 export CUDA_VISIBLE_DEVICES=0
# User can select chnsenticorp, nlpcc_dbqa, lcqmc and so on for different task CKPT_DIR="./ckpt_chnsenticorp"
DATASET="chnsenticorp"
CKPT_DIR="./ckpt_${DATASET}"
python -u text_classifier.py \ python -u text_classifier.py \
--batch_size=24 \ --batch_size=24 \
--use_gpu=True \ --use_gpu=True \
--dataset=${DATASET} \
--checkpoint_dir=${CKPT_DIR} \ --checkpoint_dir=${CKPT_DIR} \
--learning_rate=5e-5 \ --learning_rate=5e-5 \
--weight_decay=0.01 \ --weight_decay=0.01 \
--max_seq_len=128 \ --max_seq_len=128 \
--warmup_proportion=0.1 \
--num_epoch=3 \ --num_epoch=3 \
--use_pyreader=True \
--use_data_parallel=True --use_data_parallel=True
# Recommending hyper parameters for difference task # The sugguested hyper parameters for difference task
# for ChineseGLUE: # for ChineseGLUE:
# TNews: batch_size=32, weight_decay=0, num_epoch=3, max_seq_len=128, lr=5e-5 # TNews: batch_size=32, weight_decay=0, num_epoch=3, max_seq_len=128, lr=5e-5
# LCQMC: batch_size=32, weight_decay=0, num_epoch=3, max_seq_len=128, lr=5e-5 # LCQMC: batch_size=32, weight_decay=0, num_epoch=3, max_seq_len=128, lr=5e-5
......
export FLAGS_eager_delete_tensor_gb=0.0 export FLAGS_eager_delete_tensor_gb=0.0
export CUDA_VISIBLE_DEVICES=0 export CUDA_VISIBLE_DEVICES=0
# User can select chnsenticorp, nlpcc_dbqa, lcqmc and so on for different task CKPT_DIR="./ckpt_chnsenticorp"
# Support ChnSentiCorp NLPCC_DBQA LCQMC MRPC QQP SST-2
# CoLA QNLI RTE MNLI (or MNLI_m) MNLI_mm) XNLI
# for XNLI: Specify the language with an underscore like xnli_zh.
# ar: Arabic bg: Bulgarian de: German
# el: Greek en: English es: Spanish
# fr: French hi: Hindi ru: Russian
# sw: Swahili th: Thai tr: Turkish
# ur: Urdu vi: Vietnamese zh: Chinese (Simplified)
DATASET="ChnSentiCorp"
CKPT_DIR="./ckpt_${DATASET}"
python -u predict.py --checkpoint_dir=$CKPT_DIR \ python -u predict.py --checkpoint_dir=$CKPT_DIR \
--max_seq_len=128 \ --max_seq_len=128 \
--use_gpu=True \ --use_gpu=True \
--dataset=${DATASET} \ --batch_size=24 \
--batch_size=32 \
...@@ -23,106 +23,38 @@ import paddlehub as hub ...@@ -23,106 +23,38 @@ import paddlehub as hub
parser = argparse.ArgumentParser(__doc__) parser = argparse.ArgumentParser(__doc__)
parser.add_argument("--num_epoch", type=int, default=3, help="Number of epoches for fine-tuning.") parser.add_argument("--num_epoch", type=int, default=3, help="Number of epoches for fine-tuning.")
parser.add_argument("--use_gpu", type=ast.literal_eval, default=True, help="Whether use GPU for finetuning, input should be True or False") parser.add_argument("--use_gpu", type=ast.literal_eval, default=True, help="Whether use GPU for finetuning, input should be True or False")
parser.add_argument("--dataset", type=str, default="chnsenticorp", help="The choice of dataset")
parser.add_argument("--learning_rate", type=float, default=5e-5, help="Learning rate used to train with warmup.") parser.add_argument("--learning_rate", type=float, default=5e-5, help="Learning rate used to train with warmup.")
parser.add_argument("--weight_decay", type=float, default=0.01, help="Weight decay rate for L2 regularizer.") parser.add_argument("--weight_decay", type=float, default=0.01, help="Weight decay rate for L2 regularizer.")
parser.add_argument("--warmup_proportion", type=float, default=0.0, help="Warmup proportion params for warmup strategy") parser.add_argument("--warmup_proportion", type=float, default=0.1, help="Warmup proportion params for warmup strategy")
parser.add_argument("--data_dir", type=str, default=None, help="Path to training data.")
parser.add_argument("--checkpoint_dir", type=str, default=None, help="Directory to model checkpoint") parser.add_argument("--checkpoint_dir", type=str, default=None, help="Directory to model checkpoint")
parser.add_argument("--max_seq_len", type=int, default=512, help="Number of words of the longest seqence.") parser.add_argument("--max_seq_len", type=int, default=512, help="Number of words of the longest seqence.")
parser.add_argument("--batch_size", type=int, default=32, help="Total examples' number in batch for training.") parser.add_argument("--batch_size", type=int, default=32, help="Total examples' number in batch for training.")
parser.add_argument("--use_pyreader", type=ast.literal_eval, default=False, help="Whether use pyreader to feed data.")
parser.add_argument("--use_data_parallel", type=ast.literal_eval, default=False, help="Whether use data parallel.") parser.add_argument("--use_data_parallel", type=ast.literal_eval, default=False, help="Whether use data parallel.")
args = parser.parse_args() args = parser.parse_args()
# yapf: enable. # yapf: enable.
if __name__ == '__main__': if __name__ == '__main__':
dataset = None
metrics_choices = [] # Load Paddlehub ERNIE Tiny pretrained model
# Download dataset and use ClassifyReader to read dataset
if args.dataset.lower() == "chnsenticorp":
dataset = hub.dataset.ChnSentiCorp()
module = hub.Module(name="ernie_tiny") module = hub.Module(name="ernie_tiny")
inputs, outputs, program = module.context(
trainable=True, max_seq_len=args.max_seq_len)
# Download dataset and use accuracy as metrics
# Choose dataset: GLUE/XNLI/ChinesesGLUE/NLPCC-DBQA/LCQMC
# metric should be acc, f1 or matthews
dataset = hub.dataset.ChnSentiCorp()
metrics_choices = ["acc"] metrics_choices = ["acc"]
elif args.dataset.lower() == "tnews":
dataset = hub.dataset.TNews()
module = hub.Module(name="roberta_wwm_ext_chinese_L-24_H-1024_A-16")
metrics_choices = ["acc"]
elif args.dataset.lower() == "nlpcc_dbqa":
dataset = hub.dataset.NLPCC_DBQA()
module = hub.Module(name="roberta_wwm_ext_chinese_L-24_H-1024_A-16")
metrics_choices = ["acc"]
elif args.dataset.lower() == "lcqmc":
dataset = hub.dataset.LCQMC()
module = hub.Module(name="roberta_wwm_ext_chinese_L-24_H-1024_A-16")
metrics_choices = ["acc"]
elif args.dataset.lower() == 'inews':
dataset = hub.dataset.INews()
module = hub.Module(name="roberta_wwm_ext_chinese_L-24_H-1024_A-16")
metrics_choices = ["acc"]
elif args.dataset.lower() == 'bq':
dataset = hub.dataset.BQ()
module = hub.Module(name="roberta_wwm_ext_chinese_L-24_H-1024_A-16")
metrics_choices = ["acc"]
elif args.dataset.lower() == 'thucnews':
dataset = hub.dataset.THUCNEWS()
module = hub.Module(name="roberta_wwm_ext_chinese_L-24_H-1024_A-16")
metrics_choices = ["acc"]
elif args.dataset.lower() == 'iflytek':
dataset = hub.dataset.IFLYTEK()
module = hub.Module(name="roberta_wwm_ext_chinese_L-24_H-1024_A-16")
metrics_choices = ["acc"]
elif args.dataset.lower() == "mrpc":
dataset = hub.dataset.GLUE("MRPC")
module = hub.Module(name="ernie_v2_eng_base")
metrics_choices = ["f1", "acc"]
# The first metric will be choose to eval. Ref: task.py:799
elif args.dataset.lower() == "qqp":
dataset = hub.dataset.GLUE("QQP")
module = hub.Module(name="ernie_v2_eng_base")
metrics_choices = ["f1", "acc"]
elif args.dataset.lower() == "sst-2":
dataset = hub.dataset.GLUE("SST-2")
module = hub.Module(name="ernie_v2_eng_base")
metrics_choices = ["acc"]
elif args.dataset.lower() == "cola":
dataset = hub.dataset.GLUE("CoLA")
module = hub.Module(name="ernie_v2_eng_base")
metrics_choices = ["matthews", "acc"]
elif args.dataset.lower() == "qnli":
dataset = hub.dataset.GLUE("QNLI")
module = hub.Module(name="ernie_v2_eng_base")
metrics_choices = ["acc"]
elif args.dataset.lower() == "rte":
dataset = hub.dataset.GLUE("RTE")
module = hub.Module(name="ernie_v2_eng_base")
metrics_choices = ["acc"]
elif args.dataset.lower() == "mnli" or args.dataset.lower() == "mnli_m":
dataset = hub.dataset.GLUE("MNLI_m")
module = hub.Module(name="ernie_v2_eng_base")
metrics_choices = ["acc"]
elif args.dataset.lower() == "mnli_mm":
dataset = hub.dataset.GLUE("MNLI_mm")
module = hub.Module(name="ernie_v2_eng_base")
metrics_choices = ["acc"]
elif args.dataset.lower().startswith("xnli"):
dataset = hub.dataset.XNLI(language=args.dataset.lower()[-2:])
module = hub.Module(name="roberta_wwm_ext_chinese_L-24_H-1024_A-16")
metrics_choices = ["acc"]
else:
raise ValueError("%s dataset is not defined" % args.dataset)
# Check metric # For ernie_tiny, it use sub-word to tokenize chinese sentence
support_metrics = ["acc", "f1", "matthews"] # If not ernie tiny, sp_model_path and word_dict_path should be set None
for metric in metrics_choices: reader = hub.reader.ClassifyReader(
if metric not in support_metrics: dataset=dataset,
raise ValueError("\"%s\" metric is not defined" % metric) vocab_path=module.get_vocab_path(),
max_seq_len=args.max_seq_len,
sp_model_path=module.get_spm_path(),
word_dict_path=module.get_word_dict_path())
# Start preparing parameters for reader and task accoring to module
# For ernie_v2, it has an addition embedding named task_id
# For ernie_v2_chinese_tiny, it use an addition sentence_piece_vocab to tokenize
inputs, outputs, program = module.context(
trainable=True, max_seq_len=args.max_seq_len)
# Construct transfer learning network # Construct transfer learning network
# Use "pooled_output" for classification tasks on an entire sentence. # Use "pooled_output" for classification tasks on an entire sentence.
# Use "sequence_output" for token-level output. # Use "sequence_output" for token-level output.
...@@ -136,26 +68,16 @@ if __name__ == '__main__': ...@@ -136,26 +68,16 @@ if __name__ == '__main__':
inputs["segment_ids"].name, inputs["segment_ids"].name,
inputs["input_mask"].name, inputs["input_mask"].name,
] ]
# Finish preparing parameter for reader and task accoring to modul
# Define reader
reader = hub.reader.ClassifyReader(
dataset=dataset,
vocab_path=module.get_vocab_path(),
max_seq_len=args.max_seq_len,
sp_model_path=module.get_spm_path(),
word_dict_path=module.get_word_dict_path())
# Select finetune strategy, setup config and finetune # Select finetune strategy, setup config and finetune
strategy = hub.AdamWeightDecayStrategy( strategy = hub.AdamWeightDecayStrategy(
warmup_proportion=args.warmup_proportion,
weight_decay=args.weight_decay, weight_decay=args.weight_decay,
learning_rate=args.learning_rate, learning_rate=args.learning_rate)
lr_scheduler="linear_decay")
# Setup runing config for PaddleHub Finetune API # Setup runing config for PaddleHub Finetune API
config = hub.RunConfig( config = hub.RunConfig(
use_data_parallel=args.use_data_parallel, use_data_parallel=args.use_data_parallel,
use_pyreader=args.use_pyreader,
use_cuda=args.use_gpu, use_cuda=args.use_gpu,
num_epoch=args.num_epoch, num_epoch=args.num_epoch,
batch_size=args.batch_size, batch_size=args.batch_size,
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册