“d5f740520f798d3ce7f3d7088fe66a67b260b1ab”上不存在“PaddleCV/PaddleVideo/metrics/detections/detection_metrics.py”
提交 d5fbe650 编写于 作者: 0 0YuanZhang0 提交者: pkpk

update Dgu and ade module (#2971)

* dgu_and_ade

* fix_comment
上级 dc9116d0
# Auto Dialogue Evaluation
## 简介
### 任务说明
对话自动评估(Auto Dialogue Evaluation)评估开放领域对话系统的回复质量,能够帮助企业或个人快速评估对话系统的回复质量,减少人工评估成本。
1. 在无标注数据的情况下,利用负采样训练匹配模型作为评估工具,实现对多个对话系统回复质量排序;
2. 利用少量标注数据(特定对话系统或场景的人工打分),在匹配模型基础上进行微调,可以显著提高该对话系统或场景的评估效果。
### 效果说明
我们以四个不同的对话系统(seq2seq\_naive/seq2seq\_att/keywords/human)为例,使用对话自动评估工具进行自动评估。
1. 无标注数据情况下,直接使用预训练好的评估工具进行评估;
在四个对话系统上,自动评估打分和人工评估打分spearman相关系数,如下:
/|seq2seq\_naive|seq2seq\_att|keywords|human
--|:--:|--:|:--:|--:
cor|0.361|0.343|0.324|0.288
# 对话自动评估模块ADE
对四个系统平均得分排序:
* [1、模型简介](#1、模型简介)
* [2、快速开始](#2、快速开始)
* [3、进阶使用](#3、进阶使用)
* [4、参考论文](#4、参考论文)
* [5、版本更新](#5、版本更新)
人工评估|k(0.591)<n(0.847)<a(1.116)<h(1.240)
--|--:
自动评估|k(0.625)<n(0.909)<a(1.399)<h(1.683)
## 1、模型简介
&ensp;&ensp;&ensp;&ensp;对话自动评估(Auto Dialogue Evaluation)评估开放领域对话系统的回复质量,能够帮助企业或个人快速评估对话系统的回复质量,减少人工评估成本。
1. 在无标注数据的情况下,利用负采样训练匹配模型作为评估工具,实现对多个对话系统回复质量排序;
2. 利用少量标注数据(特定对话系统或场景的人工打分),在匹配模型基础上进行微调,可以显著提高该对话系统或场景的评估效果。
2. 利用少量标注数据微调后,自动评估打分和人工打分spearman相关系数,如下:
## 2、快速开始
/|seq2seq\_naive|seq2seq\_att|keywords|human
--|:--:|--:|:--:|--:
cor|0.474|0.477|0.443|0.378
## 快速开始
### 安装说明
1. paddle安装
本项目依赖于Paddle Fluid 1.3.1 及以上版本,请参考[安装指南](http://www.paddlepaddle.org/#quick-start)进行安装
2. 安装代码
#### &ensp;&ensp;a、环境依赖
- Python >= 2.7
- cuda >= 9.0
- cudnn >= 7.0
- pandas >= 0.20.1
- PaddlePaddle >= 1.3.1,请参考[安装指南](http://www.paddlepaddle.org/#quick-start)进行安装, 本模块使用bert作为pretrain model进行模型的finetuning训练,训练速度较慢,建议安装GPU版本的PaddlePaddle
克隆数据集代码库到本地
&ensp;&ensp; 注意:使用Windows GPU环境的用户,需要将示例代码中的[fluid.ParallelExecutor](http://paddlepaddle.org/documentation/docs/zh/1.4/api_cn/fluid_cn.html#parallelexecutor)替换为[fluid.Executor](http://paddlepaddle.org/documentation/docs/zh/1.4/api_cn/fluid_cn.html#executor)
#### &ensp;&ensp;b、安装代码
&ensp;&ensp;&ensp;&ensp;克隆数据集代码库到本地
```
git clone https://github.com/PaddlePaddle/models.git
cd models/PaddleNLP/dialogue_model_toolkit/auto_dialogue_evaluation
```
3. 环境依赖
python版本依赖python 2.7
### 开始第一次模型调用
1. 数据准备
下载经过预处理的数据,运行该脚本之后,data目录下会存在unlabel_data(train.ids/val.ids/test.ids),lable_data(四个任务数据train.ids/val.ids/test.ids),以及word2ids.
该项目只开源测试集数据,其他数据仅提供样例。
### 任务简介
&ensp;&ensp;&ensp;&ensp; 本模块内模型训练主要包括两个阶段:
&ensp;&ensp;&ensp;&ensp; 1)第一阶段:训练一个匹配模型作为评估工具,可用于待评估对话系统内的回复内容进行排序;(matching任务)
&ensp;&ensp;&ensp;&ensp; 2)第二阶段:利用少量的对话系统的标记数据,对第一阶段训练的匹配模型进行finetuning, 可以提高评估效果(包含human,keywords,seq2seq_att,seq2seq_naive,4个finetuning任务);
&ensp;&ensp;&ensp;&ensp;用于第二阶段fine-tuning的对话系统包括下面四部分:
```
sh download_data.sh
human: 人工模拟的对话系统;
keywords:seq2seq keywords对话系统;
seq2seq_att:seq2seq attention model 对话系统;
seq2seq_naive:naive seq2seq model对话系统;
```
2. 模型下载
### 数据准备
&ensp;&ensp;&ensp;&ensp;数据集说明:本模块内只提供训练方法,真实涉及的匹配数据及4个对话系统的数据只开源测试集数据,仅提供样例,用户如有自动化评估对话系统的需求,可自行准备业务数据集按照文档提供流程进行训练;
```
unlabel_data(第一阶段训练匹配数据集)
我们开源了基于海量未标注数据训练好的模型,以及基于少量标注数据微调的模型,可供用户直接使用
label_data(第二阶段finetuning数据集)
1、human: 人工对话系统产出的标注数据;
2、keywords:关键词对话系统产出的标注数据;
3、seq2seq_att:seq2seq attention model产出的标注对话数据;
4、seq2seq_naive:传统seq2seq model产出的标注对话数据;
```
&ensp;&ensp;&ensp;&ensp; 数据集、相关模型下载:
```
cd model_files
sh download_model.sh
cd ade && sh prepare_data_and_model.sh
```
&ensp;&ensp;&ensp;&ensp;下载经过预处理的数据,运行该脚本之后,data目录下会存在unlabel_data(train.ids/val.ids/test.ids),lable_data: human、keywords、seq2seq_att、seq2seq_naive(四个任务数据train.ids/val.ids/test.ids),以及word2ids.
我们提供了两种下载方式,以下载auto_dialogue_evaluation_matching_pretrained_model为例
### 单机训练
####1、第一阶段matching模型的训练:
#### &ensp;&ensp;&ensp;&ensp; 方式一: 推荐直接使用模块内脚本训练
```
sh run .sh matching train
```
&ensp;&ensp;&ensp;&ensp; 方式一如果为CPU训练:
```
请将run.sh内参数设置为:
1、export CUDA_VISIBLE_DEVICES=
```
&ensp;&ensp;&ensp;&ensp; 方式一如果为GPU训练:
方式一:基于PaddleHub命令行工具(PaddleHub可参考[安装指南](https://github.com/PaddlePaddle/PaddleHub)进行安装)
```
hub download auto_dialogue_evaluation_matching_pretrained_model --output_path ./
```
```
请将run.sh内参数设置为:
1、如果为单卡训练(用户指定空闲的单卡):
export CUDA_VISIBLE_DEVICES=0
2、如果为多卡训练(用户指定空闲的多张卡):
export CUDA_VISIBLE_DEVICES=0,1,2,3
```
#### &ensp;&ensp;&ensp;&ensp; 方式二: 执行训练相关的代码:
```
export FLAGS_sync_nccl_allreduce=0
export FLAGS_eager_delete_tensor_gb=1 #开启显存优化
export CUDA_VISIBLE_DEVICES=0 #GPU单卡训练
#export CUDA_VISIBLE_DEVICES=0,1,2,3 #GPU多卡训练
#export CUDA_VISIBLE_DEVICES= #CPU训练
if [ ! "$CUDA_VISIBLE_DEVICES" ]
then
use_cuda=false
else
use_cuda=true
fi
pretrain_model_path="data/saved_models/matching_pretrained"
if [ ! -d ${pretrain_model_path} ]
then
mkdir ${pretrain_model_path}
fi
python -u main.py \
--do_train=true \
--use_cuda=${use_cuda} \
--loss_type="CLS" \
--max_seq_len=50 \
--save_model_path="data/saved_models/matching_pretrained" \
--save_param="params" \
--training_file="data/input/data/unlabel_data/train.ids" \
--epoch=20 \
--print_step=1 \
--save_step=400 \
--batch_size=256 \
--hidden_size=256 \
--emb_size=256 \
--vocab_size=484016 \
--learning_rate=0.001 \
--sample_pro 0.1
```
方式二:直接下载
```
wget --no-check-certificate https://baidu-nlp.bj.bcebos.com/auto_dialogue_evaluation_matching_pretrained-1.0.0.tar.gz
```
####2、第二阶段finetuning模型的训练:
#### &ensp;&ensp;&ensp;&ensp; 方式一: 推荐直接使用模块内脚本训练
```
sh run.sh task_name task_type
参数说明:
task_name: seq2seq_naive、seq2seq_att、keywords、human,选择4个任务中任意一项;
task_type: train、predict、evaluate、inference, 选择4个参数选项中任意一项(train: 只执行训练,predict: 只执行预测,evaluate:只执行评估过程,依赖预测的结果,inference: 保存inference model;
3. 模型预测
训练示例: sh run .sh human train
```
&ensp;&ensp;&ensp;&ensp; CPU和GPU使用方式如单机训练1中所示;
基于上面的模型和数据,可以运行下面的命令直接对对话数据进行打分(预测结果输出在test_path中).
```
TASK=human
python -u main.py \
--do_infer True \
--use_cuda \
--test_path data/label_data/$TASK/test.ids \
--init_model model_files/${TASK}_finetuned
```
4. 模型评估
#### &ensp;&ensp;&ensp;&ensp; 方式二: 执行训练相关的代码:
```
export FLAGS_sync_nccl_allreduce=0
export FLAGS_eager_delete_tensor_gb=1 #开启显存优化
export CUDA_VISIBLE_DEVICES=0 #GPU单卡训练
#export CUDA_VISIBLE_DEVICES=0,1,2,3 #GPU多卡训练
#export CUDA_VISIBLE_DEVICES= #CPU训练
if [ ! "$CUDA_VISIBLE_DEVICES" ]
then
use_cuda=false
else
use_cuda=true
fi
save_model_path="data/saved_models/human_finetuned"
if [ ! -d ${save_model_path} ]
then
mkdir ${save_model_path}
fi
python -u main.py \
--do_train=true \
--use_cuda=${use_cuda} \
--loss_type="L2" \
--max_seq_len=50 \
--init_from_pretrain_model="data/saved_models/trained_models/matching_pretrained/params" \
--save_model_path="data/saved_models/human_finetuned" \
--save_param="params" \
--training_file="data/input/data/label_data/human/train.ids" \
--epoch=50 \
--print_step=1 \
--save_step=400 \
--batch_size=256 \
--hidden_size=256 \
--emb_size=256 \
--vocab_size=484016 \
--learning_rate=0.001 \
--sample_pro 0.1
```
基于上面的模型和数据,可以运行下面的命令进行效果评估。
### 模型预测
####1、第一阶段matching模型的预测:
评估预训练模型作为自动评估效果:
```
for task in seq2seq_naive seq2seq_att keywords human
do
echo $task
python -u main.py \
--do_val True \
--use_cuda \
--test_path data/label_data/$task/test.ids \
--init_model model_files/matching_pretrained \
--loss_type L2
done
```
#### &ensp;&ensp;&ensp;&ensp; 方式一: 推荐直接使用模块内脚本预测
```
sh run .sh matching predict
```
&ensp;&ensp;&ensp;&ensp; 方式一如果为CPU预测:
```
请将run.sh内参数设置为:
1、export CUDA_VISIBLE_DEVICES=
```
&ensp;&ensp;&ensp;&ensp; 方式一如果为GPU预测:
```
请将run.sh内参数设置为:
单卡预测:
export CUDA_VISIBLE_DEVICES=0 #用户可自行指定空闲的卡
```
注:预测时,如采用方式一,用户可通过修改run.sh中init_from_params参数来指定自己需要预测的模型,目前代码中默认预测本模块提供的训练好的模型;
评估微调模型效果:
```
for task in seq2seq_naive seq2seq_att keywords human
do
echo $task
python -u main.py \
--do_val True \
--use_cuda \
--test_path data/label_data/$task/test.ids \
--init_model model_files/${task}_finetuned \
--loss_type L2
done
```
#### &ensp;&ensp;&ensp;&ensp; 方式二: 执行预测相关的代码:
```
export FLAGS_sync_nccl_allreduce=0
export FLAGS_eager_delete_tensor_gb=1 #开启显存优化
export CUDA_VISIBLE_DEVICES=0 #单卡预测
#export CUDA_VISIBLE_DEVICES= #CPU预测
if [ ! "$CUDA_VISIBLE_DEVICES" ]
then
use_cuda=false
else
use_cuda=true
fi
python -u main.py \
--do_predict=true \
--use_cuda=${use_cuda} \
--predict_file="data/input/data/unlabel_data/test.ids" \
--init_from_params="data/saved_models/trained_models/matching_pretrained/params" \
--loss_type="CLS" \
--output_prediction_file="data/output/pretrain_matching_predict" \
--max_seq_len=50 \
--batch_size=256 \
--hidden_size=256 \
--emb_size=256 \
--vocab_size=484016
```
注:采用方式二时,模型预测过程可参考run.sh内具体任务的参数设置
####2、第二阶段finetuning模型的预测:
#### &ensp;&ensp;&ensp;&ensp; 方式一: 推荐直接使用模块内脚本预测
```
sh run.sh task_name task_type
参数说明:
task_name: seq2seq_naive、seq2seq_att、keywords、human,选择4个任务中任意一项;
task_type: train、predict、evaluate、inference, 选择4个参数选项中任意一项(train: 只执行训练,predict: 只执行预测,evaluate:只执行评估过程,依赖预测的结果,inference: 保存inference model;
5. 训练与验证
预测示例: sh run .sh human predict
```
&ensp;&ensp;&ensp;&ensp; 指定CPU或者GPU方法同上模型预测1中所示;
基于示例的数据集,可以运行下面的命令,进行第一阶段训练
```
python -u main.py \
--do_train True \
--use_cuda \
--save_path model_files_tmp/matching_pretrained \
--train_path data/unlabel_data/train.ids \
--val_path data/unlabel_data/val.ids
```
#### &ensp;&ensp;&ensp;&ensp; 方式二: 执行预测相关的代码:
```
export FLAGS_sync_nccl_allreduce=0
export FLAGS_eager_delete_tensor_gb=1 #开启显存优化
export CUDA_VISIBLE_DEVICES=0 #单卡预测
#export CUDA_VISIBLE_DEVICES= #CPU预测
if [ ! "$CUDA_VISIBLE_DEVICES" ]
then
use_cuda=false
else
use_cuda=true
fi
python -u main.py \
--do_predict=true \
--use_cuda=${use_cuda} \
--predict_file="data/input/data/label_data/human/test.ids" \
--init_from_params="data/saved_models/trained_models/human_finetuned/params" \
--loss_type="L2" \
--output_prediction_file="data/output/finetuning_human_predict" \
--max_seq_len=50 \
--batch_size=256 \
--hidden_size=256 \
--emb_size=256 \
--vocab_size=484016
```
在第一阶段训练基础上,可利用少量标注数据进行第二阶段训练
### 模型评估
&ensp;&ensp;&ensp;&ensp; 模块中5个任务,各任务支持计算的评估指标内容如下:
```
TASK=human
python -u main.py \
--do_train True \
--loss_type L2 \
--use_cuda \
--save_path model_files_tmp/${TASK}_finetuned \
--init_model model_files/matching_pretrained \
--train_path data/label_data/$TASK/train.ids \
--val_path data/label_data/$TASK/val.ids \
--print_step 1 \
--save_step 1 \
--num_scan_data 50
第一阶段:
matching: 使用R1@2, R1@10, R2@10, R5@10四个指标进行评估排序模型的效果;
第二阶段:
human: 使用spearman相关系数来衡量评估模型对系统的打分与实际对话系统打分之间的关系;
keywords:使用spearman相关系数来衡量评估模型对系统的打分与实际对话系统打分之间的关系;
seq2seq_att:使用spearman相关系数来衡量评估模型对系统的打分与实际对话系统打分之间的关系;
seq2seq_naive:使用spearman相关系数来衡量评估模型对系统的打分与实际对话系统打分之间的关系;
```
&ensp;&ensp;&ensp;&ensp; 效果上,以四个不同的对话系统(seq2seq\_naive/seq2seq\_att/keywords/human)为例,使用对话自动评估工具进行自动评估。
&ensp;&ensp;&ensp;&ensp; 在四个对话系统上,自动评估打分和人工评估打分spearman相关系数,如下:
/|seq2seq\_naive|seq2seq\_att|keywords|human
--|:--:|--:|:--:|--:
cor|0.361|0.343|0.324|0.288
&ensp;&ensp;&ensp;&ensp; 对四个系统平均得分排序:
人工评估|k(0.591)<n(0.847)<a(1.116)<h(1.240)
--|--:
自动评估|k(0.625)<n(0.909)<a(1.399)<h(1.683)
####1、第一阶段matching模型的评估:
#### &ensp;&ensp;&ensp;&ensp; 方式一: 推荐直接使用模块内脚本评估
```
sh run .sh matching evaluate
```
注:评估计算ground_truth和predict_label之间的打分,默认CPU计算即可;
#### &ensp;&ensp;&ensp;&ensp; 方式二: 执行评估相关的代码:
```
export CUDA_VISIBLE_DEVICES= #指默认CPU评估
python -u main.py \
--do_eval=true \
--use_cuda=false \
--evaluation_file="data/input/data/unlabel_data/test.ids" \
--output_prediction_file="data/output/pretrain_matching_predict" \
--loss_type="CLS"
```
####2、第二阶段finetuning模型的评估:
#### &ensp;&ensp;&ensp;&ensp; 方式一: 推荐直接使用模块内脚本评估
```
sh run.sh task_name task_type
参数说明:
task_name: seq2seq_naive、seq2seq_att、keywords、human,选择4个任务中任意一项;
task_type: train、predict、evaluate、inference, 选择4个参数选项中任意一项(train: 只执行训练,predict: 只执行预测,evaluate:只执行评估过程,依赖预测的结果,inference: 保存inference model;
评估示例: sh run .sh human evaluate
```
#### &ensp;&ensp;&ensp;&ensp; 方式二: 执行评估相关的代码:
```
export CUDA_VISIBLE_DEVICES= #指默认CPU评估
python -u main.py \
--do_eval=true \
--use_cuda=false \
--evaluation_file="data/input/data/label_data/human/test.ids" \
--output_prediction_file="data/output/finetuning_human_predict" \
--loss_type="L2"
```
## 进阶使用
### 任务定义与建模
对话自动评估任务输入是文本对(上文,回复),输出是回复质量得分。
### 模型原理介绍
匹配任务(预测上下文是否匹配)和自动评估任务有天然的联系,该项目利用匹配任务作为自动评估的预训练;
### 模型推断
####1、第一阶段matching模型的推断:
#### &ensp;&ensp;&ensp;&ensp; 方式一: 推荐直接使用模块内脚本保存inference model
```
sh run .sh matching inference
```
&ensp;&ensp;&ensp;&ensp; 方式一如果为CPU执行inference model过程:
```
请将run.sh内参数设置为:
1、export CUDA_VISIBLE_DEVICES=
```
&ensp;&ensp;&ensp;&ensp; 方式一如果为GPU执行inference model过程:
```
请将run.sh内参数设置为:
单卡推断(用户指定空闲的单卡):
export CUDA_VISIBLE_DEVICES=0
```
#### &ensp;&ensp;&ensp;&ensp; 方式二: 执行inference model相关的代码:
```
export CUDA_VISIBLE_DEVICES=0 # 指GPU单卡推断
#export CUDA_VISIBLE_DEVICES= #CPU推断
if [ ! "$CUDA_VISIBLE_DEVICES" ]
then
use_cuda=false
else
use_cuda=true
fi
python -u main.py \
--do_save_inference_model=true \
--use_cuda=${use_cuda} \
--init_from_params="data/saved_models/trained_models/matching_pretrained/params" \
--inference_model_dir="data/inference_models/matching_inference_model"
```
####2、第二阶段finetuning模型的推断:
#### &ensp;&ensp;&ensp;&ensp; 方式一: 推荐直接使用模块内脚本保存inference model
```
sh run.sh task_name task_type
参数说明:
task_name: seq2seq_naive、seq2seq_att、keywords、human,选择4个任务中任意一项;
task_type: train、predict、evaluate、inference, 选择4个参数选项中任意一项(train: 只执行训练,predict: 只执行预测,evaluate:只执行评估过程,依赖预测的结果,inference: 保存inference model;
利用少量标注数据,在匹配模型基础上微调。
### 数据格式说明
训练、预测、评估使用的数据示例如下,数据由三列组成,以制表符('\t')分隔,第一列是以空格分开的上文id,第二列是以空格分开的回复id,第三列是标签
评估示例: sh run.sh human inference
```
&ensp;&ensp;&ensp;&ensp; CPU和GPU指定方式同模型推断1中所示;
#### &ensp;&ensp;&ensp;&ensp; 方式二: 执行inference model相关的代码:
```
export CUDA_VISIBLE_DEVICES=0 # 指GPU单卡推断
#export CUDA_VISIBLE_DEVICES= #CPU推断
if [ ! "$CUDA_VISIBLE_DEVICES" ]
then
use_cuda=false
else
use_cuda=true
fi
python -u main.py \
--do_save_inference_model=true \
--use_cuda=${use_cuda} \
--init_from_params="data/saved_models/trained_models/human_finetuned/params" \
--inference_model_dir="data/inference_models/human_inference_model"
```
### 服务部署
&ensp;&ensp;&ensp;&ensp; 模块内提供已训练好6个对话任务的inference_model模型,用户可根据自身业务情况进行下载使用。
#### 服务器部署
&ensp;&ensp;&ensp;&ensp; 请参考PaddlePaddle官方提供的[服务器端部署](https://www.paddlepaddle.org.cn/documentation/docs/zh/1.5/advanced_usage/deploy/inference/index_cn.html)文档进行部署上线。
## 3、进阶使用
### 背景介绍
&ensp;&ensp;&ensp;&ensp; 对话自动评估任务输入是文本对(上文,回复),输出是回复质量得分,匹配任务(预测上下文是否匹配)和自动评估任务有天然的联系,该项目利用匹配任务作为自动评估的预训练,利用少量标注数据,在匹配模型基础上微调。
### 模型概览
&ensp;&ensp;&ensp;&ensp; 本模块内提供的模型为:
&ensp;&ensp;&ensp;&ensp; 1)匹配模型:context和response作为输入,使用lstm学习两个句子的表示,在计算两个线性张量的积作为logits,然后sigmoid_cross_entropy_with_logits作为loss, 最终用来评估相似程度;
&ensp;&ensp;&ensp;&ensp; 2)finetuing模型:在匹配模型的基础上,将sigmoid_cross_entropy_with_logits loss优化成平方损失loss,来进行训练;
&ensp;&ensp;&ensp;&ensp; 模型中所需数据格式如下:
&ensp;&ensp;&ensp;&ensp; 训练、预测、评估使用的数据示例如下,数据由三列组成,以制表符('\t')分隔,第一列是以空格分开的上文id,第二列是以空格分开的回复id,第三列是标签
```
723 236 7823 12 8 887 13 77 4 2
8474 13 44 34 2 87 91 23 0
```
注:本项目额外提供了分词预处理脚本(在preprocess目录下),可供用户使用,具体使用方法如下:
&ensp;&ensp;&ensp;&ensp; 注:本项目额外提供了分词预处理脚本(在preprocess目录下),可供用户使用,具体使用方法如下:
```
python tokenizer.py --test_data_dir ./test.txt.utf8 --batch_size 1 > test.txt.utf8.seg
```
## 4、参考论文
1、Anjuli Kannan and Oriol Vinyals. 2017. Adversarial evaluation of dialogue models. arXiv preprint arXiv:1701.08198.
2、Ryan Lowe, Michael Noseworthy, Iulian V Serban, Nicolas Angelard-Gontier, Yoshua Bengio, and Joelle Pineau. 2017. Towards an automatic turing test: Learning to evaluate dialogue responses. arXiv preprint arXiv:1708.07149.
3、Sebastian M¨oller, Roman Englert, Klaus Engelbrecht, Verena Hafner, Anthony Jameson, Antti Oulasvirta, Alexander Raake, and Norbert Reithinger. 2006. Memo: towards automatic usability evaluation of spoken dialogue services by user error simulations. In Ninth International Conference on Spoken Language Processing.
4、Kishore Papineni, Salim Roukos, ToddWard, andWei-Jing Zhu. 2002. Bleu: a method for automatic evaluation
of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pages 311–318. Association for Computational Linguistics.
5、Chongyang Tao, Lili Mou, Dongyan Zhao, and Rui Yan. 2017. Ruber: An unsupervised method for automatic evaluation of open-domain dialog systems. arXiv preprint arXiv:1701.03079.
6、Marilyn AWalker, Diane J Litman, Candace A Kamm, and Alicia Abella. 1997. Paradise: A framework for evaluating spoken dialogue agents. In Proceedings of the eighth conference on European chapter of the Association for Computational Linguistics, pages 271–280. Association for Computational Linguistics.
7、Zhao Yan, Nan Duan, Junwei Bao, Peng Chen, Ming Zhou, Zhoujun Li, and Jianshe Zhou. 2016. Docchat: An information retrieval approach for chatbot engines using unstructured documents. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 516–525.
8、Chia-Wei Liu, Ryan Lowe, Iulian V Serban, Michael Noseworthy, Laurent Charlin, and Joelle Pineau. 2016. How not to evaluate your dialogue system: An empirical study of unsupervised evaluation metrics for dialogue response generation. arXiv preprint arXiv:1603.08023.
9、Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. Text Summarization Branches Out.
### 代码结构说明
main.py:该项目的主函数,封装包括训练、预测、评估的部分
## 5、版本更新
config.py:定义了该项目模型的相关配置,包括具体模型类别、以及模型的超参数
第一版:PaddlePaddle 1.4.0版本
主要功能:支持4个不同对话系统数据上训练、预测和系统性能评估
reader.py:定义了读入数据,加载词典的功能
第二版:PaddlePaddle 1.6.0版本
更新功能:在第一版的基础上,根据PaddlePaddle的模型规范化标准,对模块内训练、预测、评估等代码进行了重构,提高易用性;
evaluation.py:定义评估函数
## 作者
init.py:定义模型load函数
zhangxiyuan01@baidu.com
zhouxiangyang@baidu.com
run.sh:训练、预测、评估运行脚本
lilu12@baidu.com
## 其他
如何贡献代码
## 如何贡献代码
如果你可以修复某个issue或者增加一个新功能,欢迎给我们提交PR。如果对应的PR被接受了,我们将根据贡献的质量和难度进行打分(0-5分,越高越好)。如果你累计获得了10分,可以联系我们获得面试机会或者为你写推荐信。
&ensp;&ensp;&ensp;&ensp;如果你可以修复某个issue或者增加一个新功能,欢迎给我们提交PR。如果对应的PR被接受了,我们将根据贡献的质量和难度进行打分(0-5分,越高越好)。如果你累计获得了10分,可以联系我们获得面试机会或者为你写推荐信。
# this file is only used for continuous evaluation test!
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""this file is only used for continuous evaluation test!"""
import os
import sys
......
"""
Evaluation for auto dialogue evaluation
"""
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Evaluation for auto dialogue evaluation"""
import sys
import numpy as np
......
#!/bin/bash
#check data directory
cd ..
echo "Start download data and models.............."
if [ ! -d "data" ]; then
echo "Directory data does not exist, make new data directory"
mkdir data
fi
cd data
#check configure file
if [ ! -d "config" ]; then
echo "config directory not exist........"
exit 255
else
if [ ! -f "config/ade.yaml" ]; then
echo "config file dgu.yaml has been lost........"
exit 255
fi
fi
#check and download input data
if [ ! -d "input" ]; then
echo "Directory input does not exist, make new input directory"
mkdir input
fi
cd input
wget --no-check-certificate https://baidu-nlp.bj.bcebos.com/auto_dialogue_evaluation_dataset-1.0.0.tar.gz
tar -zxvf auto_dialogue_evaluation_dataset-1.0.0.tar.gz
rm auto_dialogue_evaluation_dataset-1.0.0.tar.gz
cd ..
#check and download pretrain model
if [ ! -d "pretrain_model" ]; then
echo "Directory pretrain_model does not exist, make new pretrain_model directory"
mkdir pretrain_model
fi
#check and download inferenece model
if [ ! -d "inference_models" ]; then
echo "Directory inferenece_model does not exist, make new inferenece_model directory"
mkdir inference_models
fi
#check output
if [ ! -d "output" ]; then
echo "Directory output does not exist, make new output directory"
mkdir output
fi
#check saved model
if [ ! -d "saved_models" ]; then
echo "Directory saved_models does not exist, make new saved_models directory"
mkdir saved_models
fi
cd saved_models
wget --no-check-certificate https://baidu-nlp.bj.bcebos.com/auto_dialogue_evaluation_models.2.0.0.tar.gz
tar -xvf auto_dialogue_evaluation_models.2.0.0.tar.gz
rm auto_dialogue_evaluation_models.2.0.0.tar.gz
echo "Finish.............."
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Reader for auto dialogue evaluation"""
import sys
import time
import random
import numpy as np
import paddle
import paddle.fluid as fluid
class DataProcessor(object):
def __init__(self, data_path, max_seq_length, batch_size):
"""init"""
self.data_file = data_path
self.max_seq_len = max_seq_length
self.batch_size = batch_size
self.num_examples = {'train': -1, 'dev': -1, 'test': -1}
def get_examples(self):
"""load examples"""
examples = []
with open(self.data_file, 'r') as fr:
for line in fr:
examples.append(line.strip())
return examples
def get_num_examples(self, phase):
"""Get number of examples for train, dev or test."""
if phase not in ['train', 'dev', 'test']:
raise ValueError(
"Unknown phase, which should be in ['train', 'dev', 'test'].")
count = len(open(self.data_file,'rU').readlines())
self.num_examples[phase] = count
return self.num_examples[phase]
def data_generator(self,
place,
phase="train",
shuffle=True,
sample_pro=1):
"""
Generate data for train, dev or test.
Args:
phase: string. The phase for which to generate data.
shuffle: bool. Whether to shuffle examples.
sample_pro: sample data ratio
"""
examples = self.get_examples()
if shuffle:
np.random.shuffle(examples)
def batch_reader():
"""read batch data"""
batch = []
for example in examples:
if sample_pro < 1:
if random.random() > sample_pro:
continue
tokens = example.strip().split('\t')
assert len(tokens) == 3
context = [int(x) for x in tokens[0].split()[: self.max_seq_len]]
response = [int(x) for x in tokens[1].split()[: self.max_seq_len]]
label = [int(tokens[2])]
instance = (context, response, label)
if len(batch) < self.batch_size:
batch.append(instance)
else:
if len(batch) == self.batch_size:
yield batch
batch = [instance]
if len(batch) > 0:
yield batch
def create_lodtensor(data_ids, place):
"""create LodTensor for input ids"""
cur_len = 0
lod = [cur_len]
seq_lens = [len(ids) for ids in data_ids]
for l in seq_lens:
cur_len += l
lod.append(cur_len)
flattened_data = np.concatenate(data_ids, axis=0).astype("int64")
flattened_data = flattened_data.reshape([len(flattened_data), 1])
res = fluid.LoDTensor()
res.set(flattened_data, place)
res.set_lod([lod])
return res
def wrapper():
"""yield batch data to network"""
for batch_data in batch_reader():
context_ids = [batch[0] for batch in batch_data]
response_ids = [batch[1] for batch in batch_data]
label_ids = [batch[2] for batch in batch_data]
context_res = create_lodtensor(context_ids, place)
response_res = create_lodtensor(response_ids, place)
label_ids = np.array(label_ids).astype("int64").reshape([-1, 1])
input_batch = [context_res, response_res, label_ids]
yield input_batch
return wrapper
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import os
import sys
import argparse
import json
import yaml
import six
import logging
logging_only_message = "%(message)s"
logging_details = "%(asctime)s.%(msecs)03d %(levelname)s %(module)s - %(funcName)s: %(message)s"
class JsonConfig(object):
"""
A high-level api for handling json configure file.
"""
def __init__(self, config_path):
self._config_dict = self._parse(config_path)
def _parse(self, config_path):
try:
with open(config_path) as json_file:
config_dict = json.load(json_file)
except:
raise IOError("Error in parsing bert model config file '%s'" %
config_path)
else:
return config_dict
def __getitem__(self, key):
return self._config_dict[key]
def print_config(self):
for arg, value in sorted(six.iteritems(self._config_dict)):
print('%s: %s' % (arg, value))
print('------------------------------------------------')
class ArgumentGroup(object):
def __init__(self, parser, title, des):
self._group = parser.add_argument_group(title=title, description=des)
def add_arg(self, name, type, default, help, **kwargs):
type = str2bool if type == bool else type
self._group.add_argument(
"--" + name,
default=default,
type=type,
help=help + ' Default: %(default)s.',
**kwargs)
class ArgConfig(object):
"""
A high-level api for handling argument configs.
"""
def __init__(self):
parser = argparse.ArgumentParser()
train_g = ArgumentGroup(parser, "training", "training options.")
train_g.add_arg("epoch", int, 3, "Number of epoches for fine-tuning.")
train_g.add_arg("learning_rate", float, 5e-5,
"Learning rate used to train with warmup.")
train_g.add_arg(
"lr_scheduler",
str,
"linear_warmup_decay",
"scheduler of learning rate.",
choices=['linear_warmup_decay', 'noam_decay'])
train_g.add_arg("weight_decay", float, 0.01,
"Weight decay rate for L2 regularizer.")
train_g.add_arg(
"warmup_proportion", float, 0.1,
"Proportion of training steps to perform linear learning rate warmup for."
)
train_g.add_arg("save_steps", int, 1000,
"The steps interval to save checkpoints.")
train_g.add_arg("use_fp16", bool, False,
"Whether to use fp16 mixed precision training.")
train_g.add_arg(
"loss_scaling", float, 1.0,
"Loss scaling factor for mixed precision training, only valid when use_fp16 is enabled."
)
train_g.add_arg("pred_dir", str, None,
"Path to save the prediction results")
log_g = ArgumentGroup(parser, "logging", "logging related.")
log_g.add_arg("skip_steps", int, 10,
"The steps interval to print loss.")
log_g.add_arg("verbose", bool, False, "Whether to output verbose log.")
run_type_g = ArgumentGroup(parser, "run_type", "running type options.")
run_type_g.add_arg("use_cuda", bool, True,
"If set, use GPU for training.")
run_type_g.add_arg(
"use_fast_executor", bool, False,
"If set, use fast parallel executor (in experiment).")
run_type_g.add_arg(
"num_iteration_per_drop_scope", int, 1,
"Ihe iteration intervals to clean up temporary variables.")
run_type_g.add_arg("do_train", bool, True,
"Whether to perform training.")
run_type_g.add_arg("do_predict", bool, True,
"Whether to perform prediction.")
custom_g = ArgumentGroup(parser, "customize", "customized options.")
self.custom_g = custom_g
self.parser = parser
def add_arg(self, name, dtype, default, descrip):
self.custom_g.add_arg(name, dtype, default, descrip)
def build_conf(self):
return self.parser.parse_args()
def str2bool(v):
# because argparse does not support to parse "true, False" as python
# boolean directly
return v.lower() in ("true", "t", "1")
def print_arguments(args, log=None):
if not log:
print('----------- Configuration Arguments -----------')
for arg, value in sorted(six.iteritems(vars(args))):
print('%s: %s' % (arg, value))
print('------------------------------------------------')
else:
log.info('----------- Configuration Arguments -----------')
for arg, value in sorted(six.iteritems(vars(args))):
log.info('%s: %s' % (arg, value))
log.info('------------------------------------------------')
class PDConfig(object):
"""
A high-level API for managing configuration files in PaddlePaddle.
Can jointly work with command-line-arugment, json files and yaml files.
"""
def __init__(self, json_file="", yaml_file="", fuse_args=True):
"""
Init funciton for PDConfig.
json_file: the path to the json configure file.
yaml_file: the path to the yaml configure file.
fuse_args: if fuse the json/yaml configs with argparse.
"""
assert isinstance(json_file, str)
assert isinstance(yaml_file, str)
if json_file != "" and yaml_file != "":
raise Warning(
"json_file and yaml_file can not co-exist for now. please only use one configure file type."
)
return
self.args = None
self.arg_config = {}
self.json_config = {}
self.yaml_config = {}
parser = argparse.ArgumentParser()
self.default_g = ArgumentGroup(parser, "default", "default options.")
self.yaml_g = ArgumentGroup(parser, "yaml", "options from yaml.")
self.json_g = ArgumentGroup(parser, "json", "options from json.")
self.com_g = ArgumentGroup(parser, "custom", "customized options.")
self.default_g.add_arg("epoch", int, 2,
"Number of epoches for training.")
self.default_g.add_arg("learning_rate", float, 1e-2,
"Learning rate used to train.")
self.default_g.add_arg("do_train", bool, False,
"Whether to perform training.")
self.default_g.add_arg("do_predict", bool, False,
"Whether to perform predicting.")
self.default_g.add_arg("do_eval", bool, False,
"Whether to perform evaluating.")
self.parser = parser
if json_file != "":
self.load_json(json_file, fuse_args=fuse_args)
if yaml_file:
self.load_yaml(yaml_file, fuse_args=fuse_args)
def load_json(self, file_path, fuse_args=True):
if not os.path.exists(file_path):
raise Warning("the json file %s does not exist." % file_path)
return
with open(file_path, "r") as fin:
self.json_config = json.loads(fin.read())
fin.close()
if fuse_args:
for name in self.json_config:
if not isinstance(self.json_config[name], int) \
and not isinstance(self.json_config[name], float) \
and not isinstance(self.json_config[name], str) \
and not isinstance(self.json_config[name], bool):
continue
self.json_g.add_arg(name,
type(self.json_config[name]),
self.json_config[name],
"This is from %s" % file_path)
def load_yaml(self, file_path, fuse_args=True):
if not os.path.exists(file_path):
raise Warning("the yaml file %s does not exist." % file_path)
return
with open(file_path, "r") as fin:
self.yaml_config = yaml.load(fin, Loader=yaml.SafeLoader)
fin.close()
if fuse_args:
for name in self.yaml_config:
if not isinstance(self.yaml_config[name], int) \
and not isinstance(self.yaml_config[name], float) \
and not isinstance(self.yaml_config[name], str) \
and not isinstance(self.yaml_config[name], bool):
continue
self.yaml_g.add_arg(name,
type(self.yaml_config[name]),
self.yaml_config[name],
"This is from %s" % file_path)
def build(self):
self.args = self.parser.parse_args()
self.arg_config = vars(self.args)
def __add__(self, new_arg):
assert isinstance(new_arg, list) or isinstance(new_arg, tuple)
assert len(new_arg) >= 3
assert self.args is None
name = new_arg[0]
dtype = new_arg[1]
dvalue = new_arg[2]
desc = new_arg[3] if len(
new_arg) == 4 else "Description is not provided."
self.com_g.add_arg(name, dtype, dvalue, desc)
return self
def __getattr__(self, name):
if name in self.arg_config:
return self.arg_config[name]
if name in self.json_config:
return self.json_config[name]
if name in self.yaml_config:
return self.yaml_config[name]
raise Warning("The argument %s is not defined." % name)
def Print(self):
print("-" * 70)
for name in self.arg_config:
print("%s:\t\t\t\t%s" % (str(name), str(self.arg_config[name])))
for name in self.json_config:
if name not in self.arg_config:
print("%s:\t\t\t\t%s" %
(str(name), str(self.json_config[name])))
for name in self.yaml_config:
if name not in self.arg_config:
print("%s:\t\t\t\t%s" %
(str(name), str(self.yaml_config[name])))
print("-" * 70)
if __name__ == "__main__":
"""
pd_config = PDConfig(json_file = "./test/bert_config.json")
pd_config.build()
print(pd_config.do_train)
print(pd_config.hidden_size)
pd_config = PDConfig(yaml_file = "./test/bert_config.yaml")
pd_config.build()
print(pd_config.do_train)
print(pd_config.hidden_size)
"""
pd_config = PDConfig(yaml_file="./test/bert_config.yaml")
pd_config += ("my_age", int, 18, "I am forever 18.")
pd_config.build()
print(pd_config.do_train)
print(pd_config.hidden_size)
print(pd_config.my_age)
"""
Init for pretrained para
"""
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
......@@ -16,6 +12,8 @@ Init for pretrained para
# See the License for the specific language governing permissions and
# limitations under the License.
from __future__ import print_function
from __future__ import division
from __future__ import print_function
import os
......@@ -27,25 +25,9 @@ import numpy as np
import paddle.fluid as fluid
def init_pretraining_params(exe, pretraining_params_path, main_program):
"""
Init pretraining params
"""
assert os.path.exists(pretraining_params_path
), "[%s] cann't be found." % pretraining_params_path
def existed_params(var):
"""
Test existed
"""
if not isinstance(var, fluid.framework.Parameter):
return False
return os.path.exists(os.path.join(pretraining_params_path, var.name))
fluid.io.load_vars(
exe,
pretraining_params_path,
main_program=main_program,
predicate=existed_params)
print("Load pretraining parameters from {}.".format(
pretraining_params_path))
class InputField(object):
def __init__(self, input_field):
"""init inpit field"""
self.context_wordseq = input_field[0]
self.response_wordseq = input_field[1]
self.labels = input_field[2]
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import sys
import paddle
import paddle.fluid as fluid
def check_cuda(use_cuda, err = \
"\nYou can not set use_cuda = True in the model because you are using paddlepaddle-cpu.\n \
Please: 1. Install paddlepaddle-gpu to run your models on GPU or 2. Set use_cuda = False to run models on CPU.\n"
):
try:
if use_cuda == True and fluid.is_compiled_with_cuda() == False:
print(err)
sys.exit(1)
except Exception as e:
pass
if __name__ == "__main__":
check_cuda(True)
check_cuda(False)
check_cuda(True, "This is only for testing.")
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""save or load model api"""
import os
import sys
import paddle
import paddle.fluid as fluid
def init_from_pretrain_model(args, exe, program):
assert isinstance(args.init_from_pretrain_model, str)
if not os.path.exists(args.init_from_pretrain_model):
raise Warning("The pretrained params do not exist.")
return False
def existed_params(var):
if not isinstance(var, fluid.framework.Parameter):
return False
return os.path.exists(
os.path.join(args.init_from_pretrain_model, var.name))
fluid.io.load_vars(
exe,
args.init_from_pretrain_model,
main_program=program,
predicate=existed_params)
print("finish initing model from pretrained params from %s" %
(args.init_from_pretrain_model))
return True
def init_from_checkpoint(args, exe, program):
assert isinstance(args.init_from_checkpoint, str)
if not os.path.exists(args.init_from_checkpoint):
raise Warning("the checkpoint path does not exist.")
return False
fluid.io.load_persistables(
executor=exe,
dirname=args.init_from_checkpoint,
main_program=program,
filename="checkpoint.pdckpt")
print("finish initing model from checkpoint from %s" %
(args.init_from_checkpoint))
return True
def init_from_params(args, exe, program):
assert isinstance(args.init_from_params, str)
if not os.path.exists(args.init_from_params):
raise Warning("the params path does not exist.")
return False
fluid.io.load_params(
executor=exe,
dirname=args.init_from_params,
main_program=program,
filename="params.pdparams")
print("finish init model from params from %s" % (args.init_from_params))
return True
def save_checkpoint(args, exe, program, dirname):
assert isinstance(args.save_model_path, str)
checkpoint_dir = os.path.join(args.save_model_path, args.save_checkpoint)
if not os.path.exists(checkpoint_dir):
os.mkdir(checkpoint_dir)
fluid.io.save_persistables(
exe,
os.path.join(checkpoint_dir, dirname),
main_program=program,
filename="checkpoint.pdckpt")
print("save checkpoint at %s" % (os.path.join(checkpoint_dir, dirname)))
return True
def save_param(args, exe, program, dirname):
assert isinstance(args.save_model_path, str)
param_dir = os.path.join(args.save_model_path, args.save_param)
if not os.path.exists(param_dir):
os.mkdir(param_dir)
fluid.io.save_params(
exe,
os.path.join(param_dir, dirname),
main_program=program,
filename="params.pdparams")
print("save parameters at %s" % (os.path.join(param_dir, dirname)))
return True
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Network for auto dialogue evaluation"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import paddle
import paddle.fluid as fluid
def create_net(
is_training,
model_input,
args,
clip_value=10.0,
word_emb_name="shared_word_emb",
lstm_W_name="shared_lstm_W",
lstm_bias_name="shared_lstm_bias"):
context_wordseq = model_input.context_wordseq
response_wordseq = model_input.response_wordseq
label = model_input.labels
#emb
context_emb = fluid.layers.embedding(
input=context_wordseq,
size=[args.vocab_size, args.emb_size],
is_sparse=True,
param_attr=fluid.ParamAttr(
name=word_emb_name,
initializer=fluid.initializer.Normal(scale=0.1)))
response_emb = fluid.layers.embedding(
input=response_wordseq,
size=[args.vocab_size, args.emb_size],
is_sparse=True,
param_attr=fluid.ParamAttr(
name=word_emb_name,
initializer=fluid.initializer.Normal(scale=0.1)))
#fc to fit dynamic LSTM
context_fc = fluid.layers.fc(
input=context_emb,
size=args.hidden_size * 4,
param_attr=fluid.ParamAttr(name='fc_weight'),
bias_attr=fluid.ParamAttr(name='fc_bias'))
response_fc = fluid.layers.fc(
input=response_emb,
size=args.hidden_size * 4,
param_attr=fluid.ParamAttr(name='fc_weight'),
bias_attr=fluid.ParamAttr(name='fc_bias'))
#LSTM
context_rep, _ = fluid.layers.dynamic_lstm(
input=context_fc,
size=args.hidden_size * 4,
param_attr=fluid.ParamAttr(name=lstm_W_name),
bias_attr=fluid.ParamAttr(name=lstm_bias_name))
context_rep = fluid.layers.sequence_last_step(context_rep)
response_rep, _ = fluid.layers.dynamic_lstm(
input=response_fc,
size=args.hidden_size * 4,
param_attr=fluid.ParamAttr(name=lstm_W_name),
bias_attr=fluid.ParamAttr(name=lstm_bias_name))
response_rep = fluid.layers.sequence_last_step(input=response_rep)
logits = fluid.layers.bilinear_tensor_product(
context_rep, response_rep, size=1)
if args.loss_type == 'CLS':
label = fluid.layers.cast(x=label, dtype='float32')
loss = fluid.layers.sigmoid_cross_entropy_with_logits(logits, label)
loss = fluid.layers.reduce_mean(
fluid.layers.clip(
loss, min=-clip_value, max=clip_value))
elif args.loss_type == 'L2':
norm_score = 2 * fluid.layers.sigmoid(logits)
label = fluid.layers.cast(x=label, dtype='float32')
loss = fluid.layers.square_error_cost(norm_score, label) / 4
loss = fluid.layers.reduce_mean(loss)
else:
raise ValueError
if is_training:
return loss
else:
return logits
def set_word_embedding(word_emb, place, word_emb_name="shared_word_emb"):
"""
Set word embedding
"""
word_emb_param = fluid.global_scope().find_var(
word_emb_name).get_tensor()
word_emb_param.set(word_emb, place)
"""
Auto Dialogue Evaluation.
"""
import argparse
import six
def parse_args():
"""
Auto Dialogue Evaluation Config
"""
parser = argparse.ArgumentParser('Automatic Dialogue Evaluation.')
parser.add_argument(
'--do_train', type=bool, default=False, help='Whether to perform training.')
parser.add_argument(
'--do_val', type=bool, default=False, help='Whether to perform evaluation.')
parser.add_argument(
'--do_infer', type=bool, default=False, help='Whether to perform inference.')
parser.add_argument(
'--loss_type', type=str, default='CLS', help='Loss type, CLS or L2.')
#data path
parser.add_argument(
'--train_path', type=str, default=None, help='Path of training data')
parser.add_argument(
'--val_path', type=str, default=None, help='Path of validation data')
parser.add_argument(
'--test_path', type=str, default=None, help='Path of test data')
parser.add_argument(
'--save_path', type=str, default='tmp', help='Save path')
#step fit for data size
parser.add_argument(
'--print_step', type=int, default=50, help='Print step')
parser.add_argument(
'--save_step', type=int, default=400, help='Save step')
parser.add_argument(
'--num_scan_data', type=int, default=20, help='Save step')
parser.add_argument(
'--word_emb_init', type=str, default=None, help='Path to the initial word embedding')
parser.add_argument(
'--init_model', type=str, default=None, help='Path to the init model')
parser.add_argument(
'--use_cuda',
action='store_true',
help='If set, use cuda for training.')
parser.add_argument(
'--batch_size', type=int, default=256, help='Batch size')
parser.add_argument(
'--hidden_size', type=int, default=256, help='Hidden size')
parser.add_argument(
'--emb_size', type=int, default=256, help='Embedding size')
parser.add_argument(
'--vocab_size', type=int, default=484016, help='Vocabulary size')
parser.add_argument(
'--learning_rate', type=float, default=0.001, help='Learning rate')
parser.add_argument(
'--sample_pro', type=float, default=1, help='Sample probability for training data')
parser.add_argument(
'--max_len', type=int, default=50, help='Max length for sentences')
args = parser.parse_args()
return args
def print_arguments(args):
"""
Print Config
"""
print('----------- Configuration Arguments -----------')
for arg, value in sorted(six.iteritems(vars(args))):
print('%s: %s' % (arg, value))
print('------------------------------------------------')
loss_type: "CLS"
training_file: ""
val_file: ""
predict_file: ""
print_steps: 10
save_steps: 10
num_scan_data: ""
word_emb_init: ""
init_model: ""
use_cuda: ""
batch_size: 256
hidden_size: 256
emb_size: 256
vocab_size: 484016
sample_pro: 1.0
output_prediction_file: ""
init_from_checkpoint: ""
init_from_params: ""
init_from_pretrain_model: ""
inference_model_dir: ""
save_model_path: ""
save_checkpoint: ""
save_param: ""
evaluation_file: ""
vocab_path: ""
max_seq_len: 128
random_seed: 110
do_save_inference_model: False
enable_ce: "store_true"
wget --no-check-certificate https://baidu-nlp.bj.bcebos.com/auto_dialogue_evaluation_dataset-1.0.0.tar.gz
tar -xzf auto_dialogue_evaluation_dataset-1.0.0.tar.gz
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""evaluation metrics"""
import os
import sys
import numpy as np
import ade.evaluate as evaluate
from ade.utils.configure import PDConfig
def do_eval(args):
"""evaluate metrics"""
labels = []
with open(args.evaluation_file, 'r') as fr:
for line in fr:
tokens = line.strip().split('\t')
assert len(tokens) == 3
label = int(tokens[2])
labels.append(label)
scores = []
with open(args.output_prediction_file, 'r') as fr:
for line in fr:
tokens = line.strip().split('\t')
assert len(tokens) == 2
score = tokens[1].strip("[]").split()
score = np.array(score)
score = score.astype(np.float64)
scores.append(score)
if args.loss_type == 'CLS':
recall_dict = evaluate.evaluate_Recall(list(zip(scores, labels)))
mean_score = sum(scores) / len(scores)
print('mean score: %.6f' % mean_score)
print('evaluation recall result:')
print('1_in_2: %.6f\t1_in_10: %.6f\t2_in_10: %.6f\t5_in_10: %.6f' %
(recall_dict['1_in_2'], recall_dict['1_in_10'],
recall_dict['2_in_10'], recall_dict['5_in_10']))
elif args.loss_type == 'L2':
scores = [x[0] for x in scores]
mean_score = sum(scores) / len(scores)
cor = evaluate.evaluate_cor(scores, labels)
print('mean score: %.6f\nevaluation cor results:%.6f' %
(mean_score, cor))
else:
raise ValueError
if __name__ == "__main__":
args = PDConfig(yaml_file="./data/config/ade.yaml")
args.build()
do_eval(args)
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""save inference model for auto dialogue evaluation"""
import os
import sys
import six
import numpy as np
import time
import multiprocessing
import paddle
import paddle.fluid as fluid
import ade.reader as reader
from ade_net import create_net
from ade.utils.configure import PDConfig
from ade.utils.input_field import InputField
from ade.utils.model_check import check_cuda
import ade.utils.save_load_io as save_load_io
def do_save_inference_model(args):
test_prog = fluid.default_main_program()
startup_prog = fluid.default_startup_program()
with fluid.program_guard(test_prog, startup_prog):
test_prog.random_seed = args.random_seed
startup_prog.random_seed = args.random_seed
with fluid.unique_name.guard():
context_wordseq = fluid.layers.data(
name='context_wordseq', shape=[1], dtype='int64', lod_level=1)
response_wordseq = fluid.layers.data(
name='response_wordseq', shape=[1], dtype='int64', lod_level=1)
labels = fluid.layers.data(
name='labels', shape=[1], dtype='int64')
input_inst = [context_wordseq, response_wordseq, labels]
input_field = InputField(input_inst)
data_reader = fluid.io.PyReader(feed_list=input_inst,
capacity=4, iterable=False)
logits = create_net(
is_training=False,
model_input=input_field,
args=args
)
if args.use_cuda:
place = fluid.CUDAPlace(0)
else:
place = fluid.CPUPlace()
exe = fluid.Executor(place)
exe.run(startup_prog)
assert (args.init_from_params) or (args.init_from_pretrain_model)
if args.init_from_params:
save_load_io.init_from_params(args, exe, test_prog)
elif args.init_from_pretrain_model:
save_load_io.init_from_pretrain_model(args, exe, test_prog)
# saving inference model
fluid.io.save_inference_model(
args.inference_model_dir,
feeded_var_names=[
input_field.context_wordseq.name,
input_field.response_wordseq.name,
],
target_vars=[
logits,
],
executor=exe,
main_program=test_prog,
model_filename="model.pdmodel",
params_filename="params.pdparams")
print("save inference model at %s" % (args.inference_model_dir))
if __name__ == "__main__":
args = PDConfig(yaml_file="./data/config/ade.yaml")
args.build()
check_cuda(args.use_cuda)
do_save_inference_model(args)
"""
Auto dialogue evaluation task
"""
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
import sys
import six
import numpy as np
import time
import multiprocessing
import paddle
import paddle.fluid as fluid
import reader as reader
import evaluation as eva
import init as init
try:
import cPickle as pickle #python 2
except ImportError as e:
import pickle #python 3
sys.path.append('../../models/dialogue_model_toolkit/auto_dialogue_evaluation/')
sys.path.append('../../models/')
from net import Network
import config
from model_check import check_cuda
def train(args):
"""Train
"""
if not os.path.exists(args.save_path):
os.makedirs(args.save_path)
net = Network(args.vocab_size, args.emb_size, args.hidden_size)
train_program = fluid.Program()
train_startup = fluid.Program()
if "CE_MODE_X" in os.environ:
train_program.random_seed = 110
train_startup.random_seed = 110
with fluid.program_guard(train_program, train_startup):
with fluid.unique_name.guard():
logits, loss = net.network(args.loss_type)
loss.persistable = True
logits.persistable = True
# gradient clipping
fluid.clip.set_gradient_clip(clip=fluid.clip.GradientClipByValue(
max=1.0, min=-1.0))
optimizer = fluid.optimizer.Adam(learning_rate=args.learning_rate)
optimizer.minimize(loss)
print("begin memory optimization ...")
fluid.memory_optimize(train_program)
print("end memory optimization ...")
test_program = fluid.Program()
test_startup = fluid.Program()
if "CE_MODE_X" in os.environ:
test_program.random_seed = 110
test_startup.random_seed = 110
with fluid.program_guard(test_program, test_startup):
with fluid.unique_name.guard():
logits, loss = net.network(args.loss_type)
loss.persistable = True
logits.persistable = True
test_program = test_program.clone(for_test=True)
if args.use_cuda:
place = fluid.CUDAPlace(0)
dev_count = fluid.core.get_cuda_device_count()
else:
place = fluid.CPUPlace()
dev_count = int(os.environ.get('CPU_NUM', multiprocessing.cpu_count()))
print("device count %d" % dev_count)
print("theoretical memory usage: ")
print(
fluid.contrib.memory_usage(
program=train_program, batch_size=args.batch_size))
exe = fluid.Executor(place)
exe.run(train_startup)
exe.run(test_startup)
train_exe = fluid.ParallelExecutor(
use_cuda=args.use_cuda, loss_name=loss.name, main_program=train_program)
test_exe = fluid.ParallelExecutor(
use_cuda=args.use_cuda,
main_program=test_program,
share_vars_from=train_exe)
if args.word_emb_init is not None:
print("start loading word embedding init ...")
if six.PY2:
word_emb = np.array(pickle.load(open(args.word_emb_init,
'rb'))).astype('float32')
else:
word_emb = np.array(
pickle.load(
open(args.word_emb_init, 'rb'), encoding="bytes")).astype(
'float32')
net.set_word_embedding(word_emb, place)
print("finish init word embedding ...")
print("start loading data ...")
def train_with_feed(batch_data):
"""
Train on one batch
"""
#to do get_feed_names
feed_dict = dict(zip(net.get_feed_names(), batch_data))
cost = train_exe.run(feed=feed_dict, fetch_list=[loss.name])
return cost[0]
def test_with_feed(batch_data):
"""
Test on one batch
"""
feed_dict = dict(zip(net.get_feed_names(), batch_data))
score = test_exe.run(feed=feed_dict, fetch_list=[logits.name])
return score[0]
def evaluate():
"""
Evaluate to choose model
"""
val_batches = reader.batch_reader(args.val_path, args.batch_size, place,
args.max_len, 1)
scores = []
labels = []
for batch in val_batches:
scores.extend(test_with_feed(batch))
labels.extend([x[0] for x in batch[2]])
return eva.evaluate_Recall(list(zip(scores, labels)))
def save_exe(step, best_recall):
"""
Save exe conditional
"""
recall_dict = evaluate()
print('evaluation recall result:')
print('1_in_2: %s\t1_in_10: %s\t2_in_10: %s\t5_in_10: %s' %
(recall_dict['1_in_2'], recall_dict['1_in_10'],
recall_dict['2_in_10'], recall_dict['5_in_10']))
if recall_dict['1_in_10'] > best_recall and step != 0:
fluid.io.save_inference_model(
args.save_path,
net.get_feed_inference_names(),
logits,
exe,
main_program=train_program)
print("Save model at step %d ... " % step)
print(
time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(time.time())))
best_recall = recall_dict['1_in_10']
return best_recall
# train over different epoches
global_step, train_time = 0, 0.0
best_recall = 0
for epoch in six.moves.xrange(args.num_scan_data):
train_batches = reader.batch_reader(args.train_path, args.batch_size,
place, args.max_len,
args.sample_pro)
begin_time = time.time()
sum_cost = 0
ce_cost = 0
for batch in train_batches:
if (args.save_path is not None) and (
global_step % args.save_step == 0):
best_recall = save_exe(global_step, best_recall)
cost = train_with_feed(batch)
global_step += 1
sum_cost += cost.mean()
ce_cost = cost.mean()
if global_step % args.print_step == 0:
print('training step %s avg loss %s' %
(global_step, sum_cost / args.print_step))
sum_cost = 0
pass_time_cost = time.time() - begin_time
train_time += pass_time_cost
print("Pass {0}, pass_time_cost {1}"
.format(epoch, "%2.2f sec" % pass_time_cost))
if "CE_MODE_X" in os.environ and epoch == args.num_scan_data - 1:
card_num = get_cards()
print("kpis\ttrain_duration_card%s\t%s" %
(card_num, pass_time_cost))
print("kpis\ttrain_loss_card%s\t%s" % (card_num, ce_cost))
def finetune(args):
"""
Finetune
"""
if not os.path.exists(args.save_path):
os.makedirs(args.save_path)
net = Network(args.vocab_size, args.emb_size, args.hidden_size)
train_program = fluid.Program()
train_startup = fluid.Program()
if "CE_MODE_X" in os.environ:
train_program.random_seed = 110
train_startup.random_seed = 110
with fluid.program_guard(train_program, train_startup):
with fluid.unique_name.guard():
logits, loss = net.network(args.loss_type)
loss.persistable = True
logits.persistable = True
# gradient clipping
fluid.clip.set_gradient_clip(clip=fluid.clip.GradientClipByValue(
max=1.0, min=-1.0))
optimizer = fluid.optimizer.Adam(
learning_rate=fluid.layers.exponential_decay(
learning_rate=args.learning_rate,
decay_steps=400,
decay_rate=0.9,
staircase=True))
optimizer.minimize(loss)
print("begin memory optimization ...")
fluid.memory_optimize(train_program)
print("end memory optimization ...")
test_program = fluid.Program()
test_startup = fluid.Program()
if "CE_MODE_X" in os.environ:
test_program.random_seed = 110
test_startup.random_seed = 110
with fluid.program_guard(test_program, test_startup):
with fluid.unique_name.guard():
logits, loss = net.network(args.loss_type)
loss.persistable = True
logits.persistable = True
test_program = test_program.clone(for_test=True)
if args.use_cuda:
place = fluid.CUDAPlace(0)
dev_count = fluid.core.get_cuda_device_count()
else:
place = fluid.CPUPlace()
dev_count = int(os.environ.get('CPU_NUM', multiprocessing.cpu_count()))
print("device count %d" % dev_count)
print("theoretical memory usage: ")
print(
fluid.contrib.memory_usage(
program=train_program, batch_size=args.batch_size))
exe = fluid.Executor(place)
exe.run(train_startup)
exe.run(test_startup)
train_exe = fluid.ParallelExecutor(
use_cuda=args.use_cuda, loss_name=loss.name, main_program=train_program)
test_exe = fluid.ParallelExecutor(
use_cuda=args.use_cuda,
main_program=test_program,
share_vars_from=train_exe)
if args.init_model:
init.init_pretraining_params(
exe, args.init_model, main_program=train_startup)
print('sccuess init %s' % args.init_model)
print("start loading data ...")
def train_with_feed(batch_data):
"""
Train on one batch
"""
#to do get_feed_names
feed_dict = dict(zip(net.get_feed_names(), batch_data))
cost = train_exe.run(feed=feed_dict, fetch_list=[loss.name])
return cost[0]
def test_with_feed(batch_data):
"""
Test on one batch
"""
feed_dict = dict(zip(net.get_feed_names(), batch_data))
score = test_exe.run(feed=feed_dict, fetch_list=[logits.name])
return score[0]
def evaluate():
"""
Evaluate to choose model
"""
val_batches = reader.batch_reader(args.val_path, args.batch_size, place,
args.max_len, 1)
scores = []
labels = []
for batch in val_batches:
scores.extend(test_with_feed(batch))
labels.extend([x[0] for x in batch[2]])
scores = [x[0] for x in scores]
return eva.evaluate_cor(scores, labels)
def save_exe(step, best_cor):
"""
Save exe conditional
"""
cor = evaluate()
print('evaluation cor relevance %s' % cor)
if cor > best_cor and step != 0:
fluid.io.save_inference_model(
args.save_path,
net.get_feed_inference_names(),
logits,
exe,
main_program=train_program)
print("Save model at step %d ... " % step)
print(
time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(time.time())))
best_cor = cor
return best_cor
# train over different epoches
global_step, train_time = 0, 0.0
best_cor = 0.0
pre_index = -1
for epoch in six.moves.xrange(args.num_scan_data):
train_batches = reader.batch_reader(args.train_path, args.batch_size,
place, args.max_len,
args.sample_pro)
begin_time = time.time()
sum_cost = 0
for batch in train_batches:
if (args.save_path is not None) and (
global_step % args.save_step == 0):
best_cor = save_exe(global_step, best_cor)
cost = train_with_feed(batch)
global_step += 1
sum_cost += cost.mean()
if global_step % args.print_step == 0:
print('training step %s avg loss %s' %
(global_step, sum_cost / args.print_step))
sum_cost = 0
pass_time_cost = time.time() - begin_time
train_time += pass_time_cost
print("Pass {0}, pass_time_cost {1}"
.format(epoch, "%2.2f sec" % pass_time_cost))
def evaluate(args):
"""
Evaluate model for both pretrained and finetuned
"""
place = fluid.CUDAPlace(0) if args.use_cuda else fluid.CPUPlace()
exe = fluid.Executor(place)
t0 = time.time()
with fluid.scope_guard(fluid.Scope()):
infer_program, feed_target_names, fetch_vars = fluid.io.load_inference_model(
args.init_model, exe)
print('init model %s' % args.init_model)
global_step, infer_time = 0, 0.0
test_batches = reader.batch_reader(args.test_path, args.batch_size,
place, args.max_len, 1)
scores = []
labels = []
for batch in test_batches:
logits = exe.run(infer_program,
feed={
'context_wordseq': batch[0],
'response_wordseq': batch[1]
},
fetch_list=fetch_vars)
logits = [x[0] for x in logits[0]]
scores.extend(logits)
labels.extend([x[0] for x in batch[2]])
print('len scores: %s len labels: %s' % (len(scores), len(labels)))
mean_score = sum(scores) / len(scores)
if args.loss_type == 'CLS':
recall_dict = eva.evaluate_Recall(list(zip(scores, labels)))
print('mean score: %s' % mean_score)
print('evaluation recall result:')
print('1_in_2: %s\t1_in_10: %s\t2_in_10: %s\t5_in_10: %s' %
(recall_dict['1_in_2'], recall_dict['1_in_10'],
recall_dict['2_in_10'], recall_dict['5_in_10']))
elif args.loss_type == 'L2':
cor = eva.evaluate_cor(scores, labels)
print('mean score: %s\nevaluation cor resuls:%s' %
(mean_score, cor))
else:
raise ValueError
t1 = time.time()
print("finish evaluate model:%s on data:%s time_cost(s):%.2f" %
(args.init_model, args.test_path, t1 - t0))
def infer(args):
"""
Inference function
"""
place = fluid.CUDAPlace(0) if args.use_cuda else fluid.CPUPlace()
exe = fluid.Executor(place)
t0 = time.time()
with fluid.scope_guard(fluid.Scope()):
infer_program, feed_target_names, fetch_vars = fluid.io.load_inference_model(
args.init_model, exe)
global_step, infer_time = 0, 0.0
test_batches = reader.batch_reader(args.test_path, args.batch_size,
place, args.max_len, 1)
scores = []
for batch in test_batches:
logits = exe.run(infer_program,
feed={
'context_wordseq': batch[0],
'response_wordseq': batch[1]
},
fetch_list=fetch_vars)
logits = [x[0] for x in logits[0]]
scores.extend(logits)
in_file = open(args.test_path, 'r')
out_path = args.test_path + '.infer'
out_file = open(out_path, 'w')
for line, s in zip(in_file, scores):
out_file.write('%s\t%s\n' % (line.strip(), s))
in_file.close()
out_file.close()
from eval import do_eval
from train import do_train
from predict import do_predict
from inference_model import do_save_inference_model
t1 = time.time()
print("finish infer model:%s out file: %s time_cost(s):%.2f" %
(args.init_model, out_path, t1 - t0))
from ade.utils.configure import PDConfig
def get_cards():
num = 0
cards = os.environ.get('CUDA_VISIBLE_DEVICES', '')
if cards != '':
num = len(cards.split(","))
return num
if __name__ == "__main__":
args = PDConfig(yaml_file="./data/config/ade.yaml")
args.build()
args.Print()
def main():
"""
main
"""
args = config.parse_args()
config.print_arguments(args)
if args.do_train:
do_train(args)
check_cuda(args.use_cuda)
if args.do_predict:
do_predict(args)
if args.do_train == True:
if args.loss_type == 'CLS':
train(args)
elif args.loss_type == 'L2':
finetune(args)
else:
raise ValueError
elif args.do_val == True:
evaluate(args)
elif args.do_infer == True:
infer(args)
else:
raise ValueError
if args.do_eval:
do_eval(args)
if args.do_save_inference_model:
do_save_inference_model(args)
if __name__ == '__main__':
main()
# vim: set ts=4 sw=4 sts=4 tw=100:
#matching pretrained
wget --no-check-certificate https://baidu-nlp.bj.bcebos.com/auto_dialogue_evaluation_matching_pretrained-1.0.0.tar.gz
tar -xzf auto_dialogue_evaluation_matching_pretrained-1.0.0.tar.gz
#finetuned
for task in seq2seq_naive seq2seq_att keywords human
do
wget --no-check-certificate https://baidu-nlp.bj.bcebos.com/auto_dialogue_evaluation_${task}_finetuned-1.0.0.tar.gz
tar -xzf auto_dialogue_evaluation_${task}_finetuned-1.0.0.tar.gz
done
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""predict auto dialogue evaluation task"""
import os
import sys
import six
import time
import numpy as np
import multiprocessing
import paddle
import paddle.fluid as fluid
import ade.reader as reader
from ade_net import create_net
from ade.utils.configure import PDConfig
from ade.utils.input_field import InputField
from ade.utils.model_check import check_cuda
import ade.utils.save_load_io as save_load_io
def do_predict(args):
"""
predict function
"""
test_prog = fluid.default_main_program()
startup_prog = fluid.default_startup_program()
with fluid.program_guard(test_prog, startup_prog):
test_prog.random_seed = args.random_seed
startup_prog.random_seed = args.random_seed
with fluid.unique_name.guard():
context_wordseq = fluid.layers.data(
name='context_wordseq', shape=[1], dtype='int64', lod_level=1)
response_wordseq = fluid.layers.data(
name='response_wordseq', shape=[1], dtype='int64', lod_level=1)
labels = fluid.layers.data(
name='labels', shape=[1], dtype='int64')
input_inst = [context_wordseq, response_wordseq, labels]
input_field = InputField(input_inst)
data_reader = fluid.io.PyReader(feed_list=input_inst,
capacity=4, iterable=False)
logits = create_net(
is_training=False,
model_input=input_field,
args=args
)
logits.persistable = True
fetch_list = [logits.name]
#for_test is True if change the is_test attribute of operators to True
test_prog = test_prog.clone(for_test=True)
if args.use_cuda:
place = fluid.CUDAPlace(int(os.getenv('FLAGS_selected_gpus', '0')))
else:
place = fluid.CPUPlace()
exe = fluid.Executor(place)
exe.run(startup_prog)
assert (args.init_from_params) or (args.init_from_pretrain_model)
if args.init_from_params:
save_load_io.init_from_params(args, exe, test_prog)
if args.init_from_pretrain_model:
save_load_io.init_from_pretrain_model(args, exe, test_prog)
compiled_test_prog = fluid.CompiledProgram(test_prog)
processor = reader.DataProcessor(
data_path=args.predict_file,
max_seq_length=args.max_seq_len,
batch_size=args.batch_size)
batch_generator = processor.data_generator(
place=place,
phase="test",
shuffle=False,
sample_pro=1)
num_test_examples = processor.get_num_examples(phase='test')
data_reader.decorate_batch_generator(batch_generator)
data_reader.start()
scores = []
while True:
try:
results = exe.run(compiled_test_prog, fetch_list=fetch_list)
scores.extend(results[0])
except fluid.core.EOFException:
data_reader.reset()
break
scores = scores[: num_test_examples]
with open(args.output_prediction_file, 'w') as fw:
for index, score in enumerate(scores):
fw.write("%s\t%s\n" % (index, score))
if __name__ == "__main__":
args = PDConfig(yaml_file="./data/config/ade.yaml")
args.build()
args.Print()
check_cuda(args.use_cuda)
do_predict(args)
"""
Reader for auto dialogue evaluation
"""
import sys
import time
import numpy as np
import random
import paddle.fluid as fluid
import paddle
def to_lodtensor(data, place):
"""
Convert to LODtensor
"""
seq_lens = [len(seq) for seq in data]
cur_len = 0
lod = [cur_len]
for l in seq_lens:
cur_len += l
lod.append(cur_len)
flattened_data = np.concatenate(data, axis=0).astype("int64")
flattened_data = flattened_data.reshape([len(flattened_data), 1])
res = fluid.LoDTensor()
res.set(flattened_data, place)
res.set_lod([lod])
return res
def reshape_batch(batch, place):
"""
Reshape batch
"""
context_reshape = to_lodtensor([dat[0] for dat in batch], place)
response_reshape = to_lodtensor([dat[1] for dat in batch], place)
label_reshape = [dat[2] for dat in batch]
return (context_reshape, response_reshape, label_reshape)
def batch_reader(data_path,
batch_size,
place,
max_len=50,
sample_pro=1):
"""
Yield batch
"""
batch = []
with open(data_path, 'r') as f:
Print = True
for line in f:
#sample for training data
if sample_pro < 1:
if random.random() > sample_pro:
continue
tokens = line.strip().split('\t')
assert len(tokens) == 3
context = [int(x) for x in tokens[0].split()[:max_len]]
response = [int(x) for x in tokens[1].split()[:max_len]]
label = [int(tokens[2])]
#label = int(tokens[2])
instance = (context, response, label)
if len(batch) < batch_size:
batch.append(instance)
else:
if len(batch) == batch_size:
yield reshape_batch(batch, place)
batch = [instance]
if len(batch) == batch_size:
yield reshape_batch(batch, place)
export CUDA_VISIBLE_DEVICES=4
export FLAGS_eager_delete_tensor_gb=0.0
#pretrain
python -u main.py \
--do_train True \
--use_cuda \
--save_path model_files_tmp/matching_pretrained \
--train_path data/unlabel_data/train.ids \
--val_path data/unlabel_data/val.ids
#finetune based on one task
TASK=human
python -u main.py \
--do_train True \
--loss_type L2 \
--use_cuda \
--save_path model_files_tmp/${TASK}_finetuned \
--init_model model_files/matching_pretrained \
--train_path data/label_data/$TASK/train.ids \
--val_path data/label_data/$TASK/val.ids \
--print_step 1 \
--save_step 1 \
--num_scan_data 50
#evaluate pretrained model by Recall
python -u main.py \
--do_val True \
--use_cuda \
--test_path data/unlabel_data/test.ids \
--init_model model_files/matching_pretrained \
--loss_type CLS
#evaluate pretrained model by Cor
for task in seq2seq_naive seq2seq_att keywords human
do
echo $task
python -u main.py \
--do_val True \
--use_cuda \
--test_path data/label_data/$task/test.ids \
--init_model model_files/matching_pretrained \
--loss_type L2
done
#evaluate finetuned model by Cor
for task in seq2seq_naive seq2seq_att keywords human
do
echo $task
python -u main.py \
--do_val True \
--use_cuda \
--test_path data/label_data/$task/test.ids \
--init_model model_files/${task}_finetuned \
--loss_type L2
done
#infer
TASK=human
python -u main.py \
--do_infer True \
--use_cuda \
--test_path data/label_data/$TASK/test.ids \
--init_model model_files/${TASK}_finetuned
#!/bin/bash
export FLAGS_sync_nccl_allreduce=0
export FLAGS_eager_delete_tensor_gb=1.0
export CUDA_VISIBLE_DEVICES=0
if [ $# -ne 2 ]
then
echo "please input parameters: TRAIN_TYPE and TASK_TYPE"
echo "TRAIN_TYPE: [matching|seq2seq_naive|seq2seq_att|keywords|human]"
echo "TASK_TYPE: [train|predict|evaluate|inference]"
exit 255
fi
TRAIN_TYPE=$1
TASK_TYPE=$2
typeset -l TRAIN_TYPE
typeset -l TASK_TYPE
candi_train_type=("matching" "seq2seq_naive" "seq2seq_att" "keywords" "human")
candi_task_type=("train" "predict" "evaluate" "inference")
if [[ ! "${candi_train_type[@]}" =~ ${TRAIN_TYPE} ]]
then
echo "unknown parameter: ${TRAIN_TYPE}, just support [matching|seq2seq_naive|seq2seq_att|keywords|human]"
exit 255
fi
if [[ ! "${candi_task_type[@]}" =~ ${TASK_TYPE} ]]
then
echo "unknown parameter: ${TRAIN_TYPE}, just support [train|predict|evaluate|inference]"
exit 255
fi
INPUT_PATH="data/input/data"
OUTPUT_PATH="data/output"
SAVED_MODELS="data/saved_models"
INFERENCE_MODEL="data/inference_models"
PYTHON_PATH="python"
#train pretrain model
if [ ! "$CUDA_VISIBLE_DEVICES" ]
then
use_cuda=false
else
use_cuda=true
fi
#training
function pretrain_train()
{
pretrain_model_path="${SAVED_MODELS}/matching_pretrained"
if [ ! -d ${pretrain_model_path} ]
then
mkdir ${pretrain_model_path}
fi
${PYTHON_PATH} -u main.py \
--do_train=true \
--use_cuda=${1} \
--loss_type="CLS" \
--max_seq_len=50 \
--save_model_path=${pretrain_model_path} \
--save_param="params" \
--training_file="${INPUT_PATH}/unlabel_data/train.ids" \
--epoch=20 \
--print_step=1 \
--save_step=400 \
--batch_size=256 \
--hidden_size=256 \
--emb_size=256 \
--vocab_size=484016 \
--learning_rate=0.001 \
--sample_pro 0.1
}
function finetuning_train()
{
save_model_path="${SAVED_MODELS}/${2}_finetuned"
if [ ! -d ${save_model_path} ]
then
mkdir ${save_model_path}
fi
${PYTHON_PATH} -u main.py \
--do_train=true \
--use_cuda=${1} \
--loss_type="L2" \
--max_seq_len=50 \
--init_from_pretrain_model="${SAVED_MODELS}/matching_pretrained/params/step_final" \
--save_model_path=${save_model_path} \
--save_param="params" \
--training_file="${INPUT_PATH}/label_data/${2}/train.ids" \
--epoch=50 \
--print_step=1 \
--save_step=400 \
--batch_size=256 \
--hidden_size=256 \
--emb_size=256 \
--vocab_size=484016 \
--learning_rate=0.001 \
--sample_pro 0.1
}
#predict
function pretrain_predict()
{
${PYTHON_PATH} -u main.py \
--do_predict=true \
--use_cuda=${1} \
--predict_file="${INPUT_PATH}/unlabel_data/test.ids" \
--init_from_params="${SAVED_MODELS}/trained_models/matching_pretrained/params" \
--loss_type="CLS" \
--output_prediction_file="${OUTPUT_PATH}/pretrain_matching_predict" \
--max_seq_len=50 \
--batch_size=256 \
--hidden_size=256 \
--emb_size=256 \
--vocab_size=484016
}
function finetuning_predict()
{
${PYTHON_PATH} -u main.py \
--do_predict=true \
--use_cuda=${1} \
--predict_file="${INPUT_PATH}/label_data/${2}/test.ids" \
--init_from_params=${SAVED_MODELS}/trained_models/${2}_finetuned/params \
--loss_type="L2" \
--output_prediction_file="${OUTPUT_PATH}/finetuning_${2}_predict" \
--max_seq_len=50 \
--batch_size=256 \
--hidden_size=256 \
--emb_size=256 \
--vocab_size=484016
}
#evaluate
function pretrain_eval()
{
${PYTHON_PATH} -u main.py \
--do_eval=true \
--use_cuda=${1} \
--evaluation_file="${INPUT_PATH}/unlabel_data/test.ids" \
--output_prediction_file="${OUTPUT_PATH}/pretrain_matching_predict" \
--loss_type="CLS"
}
function finetuning_eval()
{
${PYTHON_PATH} -u main.py \
--do_eval=true \
--use_cuda=${1} \
--evaluation_file="${INPUT_PATH}/label_data/${2}/test.ids" \
--output_prediction_file="${OUTPUT_PATH}/finetuning_${2}_predict" \
--loss_type="L2"
}
#inference model
function pretrain_infer()
{
${PYTHON_PATH} -u main.py \
--do_save_inference_model=true \
--use_cuda=${1} \
--init_from_params="${SAVED_MODELS}/trained_models/matching_pretrained/params" \
--inference_model_dir="${INFERENCE_MODEL}/matching_inference_model"
}
function finetuning_infer()
{
${PYTHON_PATH} -u main.py \
--do_save_inference_model=true \
--use_cuda=${1} \
--init_from_params="${SAVED_MODELS}/trained_models/${2}_finetuned/params" \
--inference_model_dir="${INFERENCE_MODEL}/${2}_inference_model"
}
if [ "${TASK_TYPE}" = "train" ]
then
echo "train ${TRAIN_TYPE} start.........."
if [ "${TRAIN_TYPE}" = "matching" ]
then
pretrain_train ${use_cuda};
else
finetuning_train ${use_cuda} ${TRAIN_TYPE};
fi
elif [ "${TASK_TYPE}" = "predict" ]
then
echo "predict ${TRAIN_TYPE} start.........."
if [ "${TRAIN_TYPE}" = "matching" ]
then
pretrain_predict ${use_cuda};
else
finetuning_predict ${use_cuda} ${TRAIN_TYPE};
fi
elif [ "${TASK_TYPE}" = "evaluate" ]
then
echo "evaluate ${TRAIN_TYPE} start.........."
if [ "${TRAIN_TYPE}" = "matching" ]
then
pretrain_eval ${use_cuda};
else
finetuning_eval ${use_cuda} ${TRAIN_TYPE};
fi
elif [ "${TASK_TYPE}" = "inference" ]
then
echo "save ${TRAIN_TYPE} inference model start.........."
if [ "${TRAIN_TYPE}" = "matching" ]
then
pretrain_infer ${use_cuda};
else
finetuning_infer ${use_cuda} ${TRAIN_TYPE};
fi
else
exit 255
fi
export FLAGS_eager_delete_tensor_gb=0.0
#pretrain
python -u main.py \
--do_train True \
--sample_pro 0.9 \
--batch_size 64 \
--save_path model_files_tmp/matching_pretrained \
--train_path data/unlabel_data/train.ids \
--val_path data/unlabel_data/val.ids
#finetune based on one task
TASK=human
python -u main.py \
--do_train True \
--loss_type L2 \
--save_path model_files_tmp/${TASK}_finetuned \
--init_model model_files/matching_pretrained \
--train_path data/label_data/$TASK/train.ids \
--val_path data/label_data/$TASK/val.ids \
--print_step 1 \
--save_step 1 \
--num_scan_data 50
#evaluate pretrained model by Recall
python -u main.py \
--do_val True \
--test_path data/unlabel_data/test.ids \
--init_model model_files/matching_pretrained \
--loss_type CLS
#evaluate pretrained model by Cor
for task in seq2seq_naive seq2seq_att keywords human
do
echo $task
python -u main.py \
--do_val True \
--test_path data/label_data/$task/test.ids \
--init_model model_files/matching_pretrained \
--loss_type L2
done
#evaluate finetuned model by Cor
for task in seq2seq_naive seq2seq_att keywords human
do
echo $task
python -u main.py \
--do_val True \
--test_path data/label_data/$task/test.ids \
--init_model model_files/${task}_finetuned \
--loss_type L2
done
#infer
TASK=human
python -u main.py \
--do_infer True \
--test_path data/label_data/$TASK/test.ids \
--init_model model_files/${TASK}_finetuned
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""train auto dialogue evaluation task"""
import os
import sys
import six
import time
import numpy as np
import multiprocessing
import paddle
import paddle.fluid as fluid
import ade.reader as reader
from ade_net import create_net, set_word_embedding
from ade.utils.configure import PDConfig
from ade.utils.input_field import InputField
from ade.utils.model_check import check_cuda
import ade.utils.save_load_io as save_load_io
try:
import cPickle as pickle #python 2
except ImportError as e:
import pickle #python 3
def do_train(args):
"""train function"""
train_prog = fluid.default_main_program()
startup_prog = fluid.default_startup_program()
with fluid.program_guard(train_prog, startup_prog):
train_prog.random_seed = args.random_seed
startup_prog.random_seed = args.random_seed
with fluid.unique_name.guard():
context_wordseq = fluid.layers.data(
name='context_wordseq', shape=[1], dtype='int64', lod_level=1)
response_wordseq = fluid.layers.data(
name='response_wordseq', shape=[1], dtype='int64', lod_level=1)
labels = fluid.layers.data(
name='labels', shape=[1], dtype='int64')
input_inst = [context_wordseq, response_wordseq, labels]
input_field = InputField(input_inst)
data_reader = fluid.io.PyReader(feed_list=input_inst,
capacity=4, iterable=False)
loss = create_net(
is_training=True,
model_input=input_field,
args=args
)
loss.persistable = True
# gradient clipping
fluid.clip.set_gradient_clip(clip=fluid.clip.GradientClipByValue(
max=1.0, min=-1.0))
optimizer = fluid.optimizer.Adam(learning_rate=args.learning_rate)
optimizer.minimize(loss)
if args.use_cuda:
dev_count = fluid.core.get_cuda_device_count()
place = fluid.CUDAPlace(int(os.getenv('FLAGS_selected_gpus', '0')))
else:
dev_count = int(
os.environ.get('CPU_NUM', multiprocessing.cpu_count()))
place = fluid.CPUPlace()
processor = reader.DataProcessor(
data_path=args.training_file,
max_seq_length=args.max_seq_len,
batch_size=args.batch_size)
batch_generator = processor.data_generator(
place=place,
phase="train",
shuffle=True,
sample_pro=args.sample_pro)
num_train_examples = processor.get_num_examples(phase='train')
max_train_steps = args.epoch * num_train_examples // dev_count // args.batch_size
print("Num train examples: %d" % num_train_examples)
print("Max train steps: %d" % max_train_steps)
data_reader.decorate_batch_generator(batch_generator)
exe = fluid.Executor(place)
exe.run(startup_prog)
assert (args.init_from_checkpoint == "") or (
args.init_from_pretrain_model == "")
#init from some checkpoint, to resume the previous training
if args.init_from_checkpoint:
save_load_io.init_from_checkpoint(args, exe, train_prog)
#init from some pretrain models, to better solve the current task
if args.init_from_pretrain_model:
save_load_io.init_from_pretrain_model(args, exe, train_prog)
if args.word_emb_init:
print("start loading word embedding init ...")
if six.PY2:
word_emb = np.array(pickle.load(open(args.word_emb_init, 'rb'))).astype('float32')
else:
word_emb = np.array(pickle.load(open(args.word_emb_init, 'rb'), encoding="bytes")).astype('float32')
set_word_embedding(word_emb, place)
print("finish init word embedding ...")
build_strategy = fluid.compiler.BuildStrategy()
build_strategy.enable_inplace = True
compiled_train_prog = fluid.CompiledProgram(train_prog).with_data_parallel(
loss_name=loss.name, build_strategy=build_strategy)
steps = 0
begin_time = time.time()
for epoch_step in range(args.epoch):
data_reader.start()
sum_loss = 0.0
ce_loss = 0.0
while True:
try:
steps += 1
fetch_list = [loss.name]
outputs = exe.run(compiled_train_prog, fetch_list=fetch_list)
np_loss = outputs
sum_loss += np.array(np_loss).mean()
ce_loss = np.array(np_loss).mean()
if steps % args.print_steps == 0:
print('epoch: %d, step: %s, avg loss %s' % (epoch_step, steps, sum_loss / args.print_steps))
sum_loss = 0.0
if steps % args.save_steps == 0:
if args.save_checkpoint:
save_load_io.save_checkpoint(args, exe, train_prog, "step_" + str(steps))
if args.save_param:
save_load_io.save_param(args, exe, train_prog, "step_" + str(steps))
except fluid.core.EOFException:
data_reader.reset()
break
if args.save_checkpoint:
save_load_io.save_checkpoint(args, exe, train_prog, "step_final")
if args.save_param:
save_load_io.save_param(args, exe, train_prog, "step_final")
def get_cards():
num = 0
cards = os.environ.get('CUDA_VISIBLE_DEVICES', '')
if cards != '':
num = len(cards.split(","))
return num
if args.enable_ce:
card_num = get_cards()
pass_time_cost = time.time() - begin_time
print("test_card_num", card_num)
print("kpis\ttrain_duration_card%s\t%s" % (card_num, pass_time_cost))
print("kpis\ttrain_loss_card%s\t%f" % (card_num, ce_loss))
if __name__ == '__main__':
args = PDConfig(yaml_file="./data/config/ade.yaml")
args.build()
args.Print()
check_cuda(args.use_cuda)
do_train(args)
# 对话通用理解模块DGU
- [一、简介](#一、简介)
- [二、快速开始](#二、快速开始)
- [三、进阶使用](#三、进阶使用)
- [四、其他](#四、其他)
## 一、简介
* [1、模型简介](#1、模型简介)
* [2、快速开始](#2、快速开始)
* [3、进阶使用](#3、进阶使用)
* [4、参考论文](#4、参考论文)
* [5、版本更新](#5、版本更新)
### 任务说明
## 1、模型简介
&ensp;&ensp;&ensp;&ensp;对话相关的任务中,Dialogue System常常需要根据场景的变化去解决多种多样的任务。任务的多样性(意图识别、槽位解析、DA识别、DST等等),以及领域训练数据的稀少,给Dialogue System的研究和应用带来了巨大的困难和挑战,要使得dialogue system得到更好的发展,需要开发一个通用的对话理解模型。为此,我们给出了基于BERT的对话通用理解模块(DGU: DialogueGeneralUnderstanding),通过实验表明,使用base-model(BERT)并结合常见的学习范式,就可以在几乎全部对话理解任务上取得比肩甚至超越各个领域业内最好的模型的效果,展现了学习一个通用对话理解模型的巨大潜力。
### 效果说明
## 2、快速开始
&ensp;&ensp;&ensp;&ensp;a、效果上,我们基于对话相关的业内公开数据集进行评测,效果如下表所示:
### 安装说明
| task_name | udc | udc | udc | atis_slot | dstc2 | atis_intent | swda | mrda |
| :------ | :------ | :------ | :------ | :------| :------ | :------ | :------ | :------ |
| 对话任务 | 匹配 | 匹配 | 匹配 | 槽位解析 | DST | 意图识别 | DA | DA |
| 任务类型 | 分类 | 分类 | 分类 | 序列标注 | 多标签分类 | 分类 | 分类 | 分类 |
| 任务名称 | udc | udc | udc| atis_slot | dstc2 | atis_intent | swda | mrda |
| 评估指标 | R1@10 | R2@10 | R5@10 | F1 | JOINT ACC | ACC | ACC | ACC |
| SOTA | 76.70% | 87.40% | 96.90% | 96.89% | 74.50% | 98.32% | 81.30% | 91.70% |
| DGU | 82.02% | 90.43% | 97.75% | 97.10% | 89.57% | 97.65% | 80.19% | 91.43% |
&ensp;&ensp;&ensp;&ensp;b、数据集说明:
```
UDC: Ubuntu Corpus V1;
ATIS: 微软提供的公开数据集DSTC2,Airline Travel Information System;
DSTC2: 对话状态跟踪挑战(Dialog State Tracking Challenge)2;
MRDA: Meeting Recorder Dialogue Act;
SWDA:Switchboard Dialogue Act Corpus;
```
## 二、快速开始
### 1、安装说明
#### &ensp;&ensp;a、paddle安装
&ensp;&ensp;&ensp;&ensp;本项目依赖于PaddlePaddle 1.3.1 及以上版本,请参考[安装指南](http://www.paddlepaddle.org/#quick-start)进行安装
#### &ensp;&ensp;a、环境依赖
- Python >= 2.7
- cuda >= 9.0
- cudnn >= 7.0
- PaddlePaddle >= 1.3.1,请参考[安装指南](http://www.paddlepaddle.org/#quick-start)进行安装, 由于模块内模型基于bert做finetuning, 训练速度较慢, 建议用户安装GPU版本PaddlePaddle进行训练。
&ensp;&ensp; 注意:使用Windows GPU环境的用户,需要将示例代码中的[fluid.ParallelExecutor](http://paddlepaddle.org/documentation/docs/zh/1.4/api_cn/fluid_cn.html#parallelexecutor)替换为[fluid.Executor](http://paddlepaddle.org/documentation/docs/zh/1.4/api_cn/fluid_cn.html#executor)
#### &ensp;&ensp;b、安装代码
&ensp;&ensp;&ensp;&ensp;克隆数据集代码库到本地
&ensp;&ensp;&ensp;&ensp;克隆代码库到本地
```
git clone https://github.com/PaddlePaddle/models.git
cd models/PaddleNLP/dialogue_model_toolkit/dialogue_general_understanding
```
#### &ensp;&ensp;c、环境依赖
### 任务简介
&ensp;&ensp;&ensp;&ensp; 本模块内共包含6个任务,内容如下:
&ensp;&ensp;&ensp;&ensp;python版本依赖python 2.7
```
udc: 使用Ubuntu Corpus V1公开数据集,实现对话匹配任务;
atis_slot: 使用微软提供的公开数据集(Airline Travel Information System),实现槽位识别任务;
dstc2: 使用对话状态跟踪挑战(Dialog State Tracking Challenge)2公开数据集,实现对话状态追踪(DST)任务;
atis_intent: 使用微软提供的公开数据集(Airline Travel Information System),实现意图识别任务;
mrda: 使用公开数据集Meeting Recorder Dialogue Act,实现DA识别任务;
swda:使用公开数据集Switchboard Dialogue Act Corpus,实现DA识别任务;
```
### 数据准备
&ensp;&ensp;&ensp;&ensp;数据集说明:
&ensp;&ensp; 注意:使用Windows GPU环境的用户,需要将示例代码中的[fluid.ParallelExecutor](http://paddlepaddle.org/documentation/docs/zh/1.4/api_cn/fluid_cn.html#parallelexecutor)替换为[fluid.Executor](http://paddlepaddle.org/documentation/docs/zh/1.4/api_cn/fluid_cn.html#executor)
```
UDC: Ubuntu Corpus V1;
ATIS: 微软提供的公开数据集(Airline Travel Information System),模块内包含意图识别和槽位解析两个任务的数据;
DSTC2: 对话状态跟踪挑战(Dialog State Tracking Challenge)2;
MRDA: Meeting Recorder Dialogue Act;
SWDA:Switchboard Dialogue Act Corpus;
```
### 2、开始第一次模型调用
&ensp;&ensp;&ensp;&ensp; 数据集、相关模型下载:
```
cd dgu && sh prepare_data_and_model.sh
```
&ensp;&ensp;&ensp;&ensp; 下载的数据集中已提供了训练集,测试集和验证集,用户如果需要重新生成某任务数据集的训练数据,可执行:
```
cd dgu/scripts && sh run_build_data.sh task_name
参数说明:
task_name: udc, swda, mrda, atis, dstc2, 选择5个数据集选项中用户需要生成的数据名
```
#### &ensp;&ensp;a、数据准备(数据、模型下载,预处理)
### 单机训练
&ensp;&ensp;&ensp;&ensp;i、数据下载
#### &ensp;&ensp;&ensp;&ensp; 方式一: 推荐直接使用模块内脚本训练
```
sh download_data.sh
```
sh run.sh task_name task_type
参数说明:
task_name: udc, swda, mrda, atis_intent, atis_slot, dstc2,选择6个任务中任意一项;
task_type: train,predict, evaluate, inference, all, 选择5个参数选项中任意一项(train: 只执行训练,predict: 只执行预测,evaluate:只执行评估过程,依赖预测的结果,inference: 保存inference model,all: 顺序执行训练、预测、评估、保存inference model的过程);
&ensp;&ensp;&ensp;&ensp;ii、(非必需)下载的数据集中已提供了训练集,测试集和验证集,用户如果需要重新生成某数据集的训练数据,可执行:
训练示例: sh run.sh atis_intent train
```
&ensp;&ensp;&ensp;&ensp; 方式一如果为CPU训练:
```
请将run.sh内参数设置为:
1、export CUDA_VISIBLE_DEVICES=
```
&ensp;&ensp;&ensp;&ensp; 方式一如果为GPU训练:
```
请将run.sh内参数设置为:
1、如果为单卡训练(用户指定空闲的单卡):
export CUDA_VISIBLE_DEVICES=0
2、如果为多卡训练(用户指定空闲的多张卡):
export CUDA_VISIBLE_DEVICES=0,1,2,3
```
cd dialogue_general_understanding/scripts && sh run_build_data.sh task_name
parameters:
task_name: udc, swda, mrda, atis, dstc2
#### &ensp;&ensp;&ensp;&ensp; 方式二: 执行训练相关的代码:
```
export FLAGS_sync_nccl_allreduce=0
export FLAGS_eager_delete_tensor_gb=1 #开启显存优化
#### &ensp;&ensp;b、模型下载
export CUDA_VISIBLE_DEVICES=0 #GPU单卡训练
#export CUDA_VISIBLE_DEVICES=0,1,2,3 #GPU多卡训练
#export CUDA_VISIBLE_DEVICES= #CPU训练
&ensp;&ensp;&ensp;&ensp;该项目中,我们基于BERT开发了相关的对话模型,对话模型训练时需要依赖BERT的模型做fine-tuning, 且提供了目前公开数据集上训练好的多个对话模型。
if [ ! "$CUDA_VISIBLE_DEVICES" ]
then
use_cuda=false
else
use_cuda=true
fi
&ensp;&ensp;&ensp;&ensp;i、BERT pretrain模型下载:
TASK_NAME="atis_intent" #指定训练的任务名称
BERT_BASE_PATH="data/pretrain_model/uncased_L-12_H-768_A-12"
python -u main.py \
--task_name=${TASK_NAME} \
--use_cuda=${use_cuda} \
--do_train=true \
--in_tokens=true \
--epoch=20 \
--batch_size=4096 \
--do_lower_case=true \
--data_dir="./data/input/data/atis/${TASK_NAME}" \
--bert_config_path="${BERT_BASE_PATH}/bert_config.json" \
--vocab_path="${BERT_BASE_PATH}/vocab.txt" \
--init_from_pretrain_model="${BERT_BASE_PATH}/params" \
--save_model_path="./data/saved_models/${TASK_NAME}" \
--save_param="params" \
--save_steps=100 \
--learning_rate=2e-5 \
--weight_decay=0.01 \
--max_seq_len=128 \
--print_steps=10 \
--use_fp16 false
```
sh download_pretrain_model.sh
```
&ensp;&ensp;&ensp;&ensp;ii、dialogue_general_understanding模块内对话相关模型下载:
注:采用方式二时,模型训练过程可参考run.sh内相关任务的参数设置
&ensp;&ensp;&ensp;&ensp;&ensp;&ensp;方式一:基于PaddleHub命令行工具(PaddleHub安装方式 https://github.com/PaddlePaddle/PaddleHub)
### 模型预测
#### &ensp;&ensp;&ensp;&ensp; 方式一: 推荐直接使用模块内脚本预测
```
sh run.sh task_name task_type
参数说明:
task_name: udc, swda, mrda, atis_intent, atis_slot, dstc2,选择6个任务中任意一项;
task_type: train,predict, evaluate, inference, all, 选择5个参数选项中任意一项(train: 只执行训练,predict: 只执行预测,evaluate:只执行评估过程,依赖预测的结果,inference: 保存inference model,all: 顺序执行训练、预测、评估、保存inference model的过程);
预测示例: sh run.sh atis_intent predict
```
hub download dmtk_models --output_path ./
tar -xvf dmtk_models_1.0.0.tar.gz
&ensp;&ensp;&ensp;&ensp; 方式一如果为CPU预测:
```
&ensp;&ensp;&ensp;&ensp;&ensp;&ensp;方式二:直接下载
请将run.sh内参数设置为:
1、export CUDA_VISIBLE_DEVICES=
```
&ensp;&ensp;&ensp;&ensp; 方式一如果为GPU预测:
```
sh download_models.sh
请将run.sh内参数设置为:
支持单卡预测(用户指定空闲的单卡):
export CUDA_VISIBLE_DEVICES=0
```
注:预测时,如采用方式一,用户可通过修改run.sh中init_from_params参数来指定自己训练好的需要预测的模型,目前代码中默认为加载官方已经训练好的模型;
#### &ensp;&ensp;c、CPU、GPU训练设置
#### &ensp;&ensp;&ensp;&ensp; 方式二: 执行预测相关的代码:
```
export FLAGS_sync_nccl_allreduce=0
export FLAGS_eager_delete_tensor_gb=1 #开启显存优化
&ensp;&ensp;&ensp;&ensp;CPU训练和预测:
export CUDA_VISIBLE_DEVICES=0 #单卡预测
#export CUDA_VISIBLE_DEVICES= #CPU预测
```
请将run_train.sh和run_predict.sh内如下两行参数设置为:
1、export CUDA_VISIBLE_DEVICES=
2、--use_cuda false
if [ ! "$CUDA_VISIBLE_DEVICES" ]
then
use_cuda=false
else
use_cuda=true
fi
TASK_NAME="atis_intent" #指定预测的任务名称
BERT_BASE_PATH="./data/pretrain_model/uncased_L-12_H-768_A-12"
python -u main.py \
--task_name=${TASK_NAME} \
--use_cuda=${use_cuda} \
--do_predict=true \
--in_tokens=true \
--batch_size=4096 \
--do_lower_case=true \
--data_dir="./data/input/data/atis/${TASK_NAME}" \
--init_from_params="./data/saved_models/trained_models/${TASK_NAME}/params" \
--bert_config_path="${BERT_BASE_PATH}/bert_config.json" \
--vocab_path="${BERT_BASE_PATH}/vocab.txt" \
--output_prediction_file="./data/output/pred_${TASK_NAME}" \
--max_seq_len=128
```
&ensp;&ensp;&ensp;&ensp;GPU训练和预测:
注:采用方式二时,模型预测过程可参考run.sh内具体任务的参数设置
### 模型评估
&ensp;&ensp;&ensp;&ensp; 模块中6个任务,各任务支持计算的评估指标内容如下:
```
请修改run_train.sh和run_predict.sh内如下两行参数设置为:
1、export CUDA_VISIBLE_DEVICES=4 (用户可自行指定空闲的卡)
2、--use_cuda true
udc: 使用R1@10、R2@10、R5@10三个指标评估匹配任务的效果;
atis_slot: 使用F1指标来评估序列标注任务;
dstc2: 使用joint acc 指标来评估DST任务的多标签分类结果;
atis_intent: 使用acc指标来评估分类结果;
mrda: 使用acc指标来评估DA任务分类结果;
swda:使用acc指标来评估DA任务分类结果;
```
&ensp;&ensp;&ensp;&ensp; 效果上,6个任务公开数据集评测效果如下表所示:
#### &ensp;&ensp;d、训练
| task_name | udc | udc | udc | atis_slot | dstc2 | atis_intent | swda | mrda |
| :------ | :------ | :------ | :------ | :------| :------ | :------ | :------ | :------ |
| 对话任务 | 匹配 | 匹配 | 匹配 | 槽位解析 | DST | 意图识别 | DA | DA |
| 任务类型 | 分类 | 分类 | 分类 | 序列标注 | 多标签分类 | 分类 | 分类 | 分类 |
| 任务名称 | udc | udc | udc| atis_slot | dstc2 | atis_intent | swda | mrda |
| 评估指标 | R1@10 | R2@10 | R5@10 | F1 | JOINT ACC | ACC | ACC | ACC |
| SOTA | 76.70% | 87.40% | 96.90% | 96.89% | 74.50% | 98.32% | 81.30% | 91.70% |
| DGU | 82.03% | 90.59% | 97.73% | 97.14% | 91.23% | 97.76% | 80.37% | 91.53% |
&ensp;&ensp;&ensp;&ensp;方式一(推荐):
#### &ensp;&ensp;&ensp;&ensp; 方式一: 推荐直接使用模块内脚本评估
```
sh run_train.sh task_name
parameters:
task_name: udc, swda, mrda, atis_intent, atis_slot, dstc2
sh run.sh task_name task_type
参数说明:
task_name: udc, swda, mrda, atis_intent, atis_slot, dstc2,选择6个任务中任意一项;
task_type: train,predict, evaluate, inference, all, 选择5个参数选项中任意一项(train: 只执行训练,predict: 只执行预测,evaluate:只执行评估过程,依赖预测的结果,inference: 保存inference model,all: 顺序执行训练、预测、评估、保存inference model的过程);
评估示例: sh run.sh atis_intent evaluate
```
注:评估计算ground_truth和predict_label之间的打分,默认CPU计算即可;
&ensp;&ensp;&ensp;&ensp;方式二:
#### &ensp;&ensp;&ensp;&ensp; 方式二: 执行评估相关的代码:
```
python -u train.py --task_name mrda \ # name model to use. [udc|swda|mrda|atis_intent|atis_slot|dstc2]
TASK_NAME="atis_intent" #指定预测的任务名称
--use_cuda true \ # If set, use GPU for training.
--do_train true \ # Whether to perform training.
--do_val true \ # Whether to perform evaluation on dev data set.
--do_test true \ # Whether to perform evaluation on test data set.
--epoch 10 \ # Number of epoches for fine-tuning.
--batch_size 4096 \ # Total examples' number in batch for training. see also --in_tokens.
--data_dir ./data/mrda \ # Path to training data.
--bert_config_path ./uncased_L-12_H-768_A-12/bert_config.json \ # Path to the json file for bert model config.
--vocab_path ./uncased_L-12_H-768_A-12/vocab.txt \ # Vocabulary path.
--init_pretraining_params ./uncased_L-12_H-768_A-12/params \ # Init pre-training params which preforms fine-tuning from
--checkpoints ./output/mrda \ # Path to save checkpoints.
--save_steps 200 \ # The steps interval to save checkpoints.
--learning_rate 2e-5 \ # Learning rate used to train with warmup.
--weight_decay 0.01 \ # Weight decay rate for L2 regularizer.
--max_seq_len 128 \ # Number of words of the longest seqence.
--skip_steps 100 \ # The steps interval to print loss.
--validation_steps 500 \ # The steps interval to evaluate model performance.
--num_iteration_per_drop_scope 10 \ # The iteration intervals to clean up temporary variables.
--use_fp16 false # If set, use fp16 for training.
python -u main.py \
--task_name=${TASK_NAME} \
--use_cuda=false \
--do_eval=true \
--evaluation_file="./data/input/data/atis/${TASK_NAME}/test.txt" \
--output_prediction_file="./data/output/pred_${TASK_NAME}"
```
#### &ensp;&ensp;e、预测 (推荐f的方式来进行预测评估)
&ensp;&ensp;&ensp;&ensp;方式一(推荐):
### 模型推断
#### &ensp;&ensp;&ensp;&ensp; 方式一: 推荐直接使用模块内脚本保存inference model
```
sh run.sh task_name task_type
参数说明:
task_name: udc, swda, mrda, atis_intent, atis_slot, dstc2,选择6个任务中任意一项;
task_type: train,predict, evaluate, inference, all, 选择5个参数选项中任意一项(train: 只执行训练,predict: 只执行预测,evaluate:只执行评估过程,依赖预测的结果,inference: 保存inference model,all: 顺序执行训练、预测、评估、保存inference model的过程);
保存模型示例: sh run.sh atis_intent inference
```
sh run_predict.sh task_name
parameters:
task_name: udc, swda, mrda, atis_intent, atis_slot, dstc2
&ensp;&ensp;&ensp;&ensp; 方式一如果为CPU执行inference model过程:
```
&ensp;&ensp;&ensp;&ensp;方式二:
请将run.sh内参数设置为:
1、export CUDA_VISIBLE_DEVICES=
```
&ensp;&ensp;&ensp;&ensp; 方式一如果为GPU执行inference model过程:
```
python -u predict.py --task_name mrda \ # name model to use. [udc|swda|mrda|atis_intent|atis_slot|dstc2]
--use_cuda true \ # If set, use GPU for training.
--batch_size 4096 \ # Total examples' number in batch for training. see also --in_tokens.
--init_checkpoint ./output/mrda/step_6500 \ # Init model
--data_dir ./data/mrda \ # Path to training data.
--vocab_path ./uncased_L-12_H-768_A-12/vocab.txt \ # Vocabulary path.
--max_seq_len 128 \ # Number of words of the longest seqence.
--bert_config_path ./uncased_L-12_H-768_A-12/bert_config.json # Path to the json file for bert model config.
请将run.sh内参数设置为:
1、单卡模型推断(用户指定空闲的单卡):
export CUDA_VISIBLE_DEVICES=0
```
#### &ensp;&ensp;f、预测+评估(推荐)
#### &ensp;&ensp;&ensp;&ensp; 方式二: 执行inference model相关的代码:
```
TASK_NAME="atis_intent" #指定预测的任务名称
BERT_BASE_PATH="./data/pretrain_model/uncased_L-12_H-768_A-12"
&ensp;&ensp;&ensp;&ensp;dialogue_general_understanding模块内提供已训练好的对话模型,可通过sh download_models.sh下载,用户如果不训练模型的时候,可使用提供模型进行预测评估:
export CUDA_VISIBLE_DEVICES=0 #单卡推断inference model
#export CUDA_VISIBLE_DEVICES= #CPU预测
```
sh run_eval_metrics.sh task_name
parameters:
task_name: udc, swda, mrda, atis_intent, atis_slot, dstc2
if [ ! "$CUDA_VISIBLE_DEVICES" ]
then
use_cuda=false
else
use_cuda=true
fi
python -u main.py \
--task_name=${TASK_NAME} \
--use_cuda=${use_cuda} \
--do_save_inference_model=true \
--init_from_params="./data/saved_models/trained_models/${TASK_NAME}/params" \
--bert_config_path="${BERT_BASE_PATH}/bert_config.json" \
--inference_model_dir="data/inference_models/${TASK_NAME}"
```
## 三、进阶使用
### 预训练模型
&ensp;&ensp;&ensp;&ensp; 支持PaddlePaddle官方提供的BERT及ERNIE相关模型作为预训练模型
### 1、任务定义与建模
| Model | Layers | Hidden size | Heads |Parameters |
| :------| :------: | :------: |:------: |:------: |
| [BERT-Base, Uncased](https://bert-models.bj.bcebos.com/uncased_L-12_H-768_A-12.tar.gz) | 12 | 768 |12 |110M |
| [BERT-Large, Uncased](https://bert-models.bj.bcebos.com/uncased_L-24_H-1024_A-16.tar.gz) | 24 | 1024 |16 |340M |
|[BERT-Base, Cased](https://bert-models.bj.bcebos.com/cased_L-12_H-768_A-12.tar.gz)|12|768|12|110M|
|[BERT-Large, Cased](https://bert-models.bj.bcebos.com/cased_L-24_H-1024_A-16.tar.gz)|24|1024|16|340M|
|[ERNIE, english](https://ernie.bj.bcebos.com/ERNIE_en_1.0.tgz)|24|1024|16|3.8G|
&ensp;&ensp;&ensp;&ensp;dialogue_general_understanding模块,针对数据集开发了相关的模型训练过程,支持分类,多标签分类,序列标注等任务,用户可针对自己的数据集,进行相关的模型定制;
### 2、模型原理介绍
### 服务部署
&ensp;&ensp;&ensp;&ensp; 模块内提供已训练好6个对话任务的inference_model模型,用户可根据自身业务情况进行下载使用。
#### 服务器部署
&ensp;&ensp;&ensp;&ensp; 请参考PaddlePaddle官方提供的[服务器端部署](https://www.paddlepaddle.org.cn/documentation/docs/zh/1.5/advanced_usage/deploy/inference/index_cn.html)文档进行部署上线。
&ensp;&ensp;&ensp;&ensp;本项目针对对话理解相关的问题,底层基于BERT,上层定义范式(分类,多标签分类,序列标注), 开源了一系列公开数据集相关的模型,供用户可配置地使用:
### 3、数据格式说明
## 3、进阶使用
### 背景介绍
&ensp;&ensp;&ensp;&ensp;dialogue_general_understanding模块,针对数据集开发了相关的模型训练过程,支持分类,多标签分类,序列标注等任务,用户可针对自己的数据集,进行相关的模型定制;并取得了比肩业内最好模型的效果:
### 模型概览
<p align="center">
<img src="./images/dgu.png" width="500">
</p>
&ensp;&ensp;&ensp;&ensp;训练、预测、评估使用的数据可以由用户根据实际的对话应用场景,自己组织数据。输入网络的数据格式统一为,示例如下:
......@@ -207,72 +320,49 @@ task_name: udc, swda, mrda, atis_intent, atis_slot, dstc2
&ensp;&ensp;&ensp;&ensp;输入数据以[CLS]开始,[SEP]分割内容为对话内容相关三部分,如上文,当前句,下文等,如[SEP]分割的每部分内部由多轮组成的话,使用[INNER_SEP]进行分割;第二部分和第三部分部分皆可缺省;
&ensp;&ensp;&ensp;&ensp;目前dialogue_general_understanding模块内已将数据准备部分集成到代码内,用户可根据上面输入数据格式,组装自己的数据;
### 4、代码结构说明
```
.
├── run_train.sh # 训练执行脚本
├── run_predict.sh # 预测执行脚本
├── run_eval_metrics.sh # 评估执行脚本
├── download_data.sh # 下载数据脚本
├── download_models.sh # 下载对话模型脚本
├── download_pretrain_model.sh # 下载bert pretrain模型脚本
├── train.py # train流程
├── predict.py # predict流程
├── eval_metrics.py # 指标评估
├── define_predict_pack.py # 封装预测结果
├── finetune_args.py # 模型训练相关的配置参数
├── batching.py # 封装yield batch数据
├── optimization.py # 模型优化器
├── tokenization.py # tokenizer工具
├── reader/data_reader.py: # 数据的处理和组装过程,每个数据集都定义一个类进行处理
├── README.md # 文档
├── utils/* # 定义了其他常用的功能函数
└── scripts # 数据处理脚本集合
├── run_build_data.sh # 数据处理运行脚本
├── build_atis_dataset.py # 构建atis_intent和atis_slot训练数据
├── build_dstc2_dataset.py # 构建dstc2训练数据
├── build_mrda_dataset.py # 构建mrda训练数据
├── build_swda_dataset.py # 构建swda训练数据
├── commonlib.py # 数据处理通用方法
└── conf # 公开数据集中训练集、验证集、测试集划分
../../models/dialogue_model_toolkit/dialogue_general_understanding
├── bert.py # 底层bert模型
├── define_paradigm.py # 上层网络范式
└── create_model.py # 创建底层bert模型+上层网络范式网络结构
```
### 5、如何组建自己的模型
&ensp;&ensp;&ensp;&ensp;用户也可以根据自己的需求,组建自定义的模型,具体方法如下所示:
&ensp;&ensp;&ensp;&ensp;用户可以根据自己的需求,组建自定义的模型,具体方法如下所示:
&ensp;&ensp;&ensp;&ensp;a、自定义数据
&ensp;&ensp;&ensp;&ensp;i、自定义数据
&ensp;&ensp;&ensp;&ensp;&ensp;&ensp;如用户目前有数据集为**task_name**, 则在**data/input/data**下定义**task_name**文件夹,将数据集存放进去;在**dgu/reader.py**中,新增自定义的数据处理的类,如**udc**数据集对应**UDCProcessor**; 在**train.py**内设置**task_name****processor**的对应关系(如**processors = {'udc': reader.UDCProcessor}**).
&ensp;&ensp;&ensp;&ensp;&ensp;&ensp;如用户目前有数据集为**task_name**, 则在**data**下定义**task_name**文件夹,将数据集存放进去;在**reader/data_reader.py**中,新增自定义的数据处理的类,如**udc**数据集对应**UDCProcessor**; 在**train.py**内设置**task_name****processor**的对应关系(如**processors = {'udc': reader.UDCProcessor}**),以及当前的数据集训练时是否是否使用**in_tokens**的方式计算batch大小(如:**in_tokens = {'udc': True}**)
&ensp;&ensp;&ensp;&ensp;b、 自定义上层网络范式
&ensp;&ensp;&ensp;&ensp;ii、 自定义上层网络范式
&ensp;&ensp;&ensp;&ensp;&ensp;&ensp;如果用户自定义模型属于分类、多分类和序列标注这3种类型其中一个,则只需要在**dgu/define_paradigm.py** 内指明**task_name**和相应上层范式函数的对应关系即可,如用户自定义模型属于其他模型,则需要自定义上层范式函数并指明其与**task_name**之间的关系;
&ensp;&ensp;&ensp;&ensp;&ensp;&ensp;如果用户自定义模型属于分类、多分类和序列标注这3种类型其中一个,则只需要在**models/PaddleNLP/models/dialogue_model_toolkit/dialogue_general_understanding/define_paradigm.py** 内指明**task_name**和相应上层范式函数的对应关系即可,如用户自定义模型属于其他模型,则需要自定义上层范式函数并指明其与**task_name**之间的关系;
&ensp;&ensp;&ensp;&ensp;c、自定义预测封装接口
&ensp;&ensp;&ensp;&ensp;iii、自定义预测封装接口
&ensp;&ensp;&ensp;&ensp;&ensp;&ensp;用户可在**dgu/define_predict_pack.py**内定义task_name和自定义封装预测接口的对应关系;
&ensp;&ensp;&ensp;&ensp;&ensp;&ensp;用户可在define_predict_pack.py内定义task_name和自定义封装预测接口的对应关系;
## 4、参考论文
1、Harshit Kumar, Arvind Agarwal, Riddhiman Dasgupta,Sachindra Joshi, and Arun Kumar. 2017. Dia-logue act sequence labeling using hierarchical en-coder with crf.arXiv preprint arXiv:1709.04250.
2、Changliang Li, Liang Li, and Ji Qi. 2018. A self-attentive model with gate mechanism for spoken lan-guage understanding. InProceedings of the 2018Conference on Empirical Methods in Natural Lan-guage Processing, pages 3824–3833.
3、Ryan Lowe, Nissan Pow, Iulian Serban, and JoellePineau. 2015. The ubuntu dialogue corpus: A largedataset for research in unstructured multi-turn dia-logue systems.arXiv preprint arXiv:1506.08909.
4、Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-rado, and Jeff Dean. 2013. Distributed representa-tions of words and phrases and their compositional-ity. InAdvances in neural information processingsystems, pages 3111–3119.
5、Hiroki Ouchi and Yuta Tsuboi. 2016. Addressee andresponse selection for multi-party conversation. InProceedings of the 2016 Conference on EmpiricalMethods in Natural Language Processing, pages2133–2143.
6、Elizabeth Shriberg, Raj Dhillon, Sonali Bhagat, JeremyAng, and Hannah Carvey. 2004. The icsi meetingrecorder dialog act (mrda) corpus. Technical report,INTERNATIONAL COMPUTER SCIENCE INSTBERKELEY CA.
7、Andreas Stolcke, Klaus Ries, Noah Coccaro, Eliza-beth Shriberg, Rebecca Bates, Daniel Jurafsky, PaulTaylor, Rachel Martin, Carol Van Ess-Dykema, andMarie Meteer. 2000. Dialogue act modeling for au-tomatic tagging and recognition of conversationalspeech.Computational linguistics, 26(3):339–373.
8、Ye-Yi Wang, Li Deng, and Alex Acero. 2005. Spo-ken language understanding.IEEE Signal Process-ing Magazine, 22(5):16–31.Jason Williams, Antoine Raux, Deepak Ramachan-dran, and Alan Black. 2013. The dialog state tracking challenge. InProceedings of the SIGDIAL 2013Conference, pages 404–413.
9、Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc VLe, Mohammad Norouzi, Wolfgang Macherey,Maxim Krikun, Yuan Cao, Qin Gao, KlausMacherey, et al. 2016. Google’s neural ma-chine translation system: Bridging the gap betweenhuman and machine translation.arXiv preprintarXiv:1609.08144.Kaisheng
10、Yao, Geoffrey Zweig, Mei-Yuh Hwang,Yangyang Shi, and Dong Yu. 2013. Recurrent neu-ral networks for language understanding. InInter-speech, pages 2524–2528.
11、Xiangyang Zhou, Lu Li, Daxiang Dong, Yi Liu, YingChen, Wayne Xin Zhao, Dianhai Yu, and Hua Wu.2018. Multi-turn response selection for chatbotswith deep attention matching network. InProceed-ings of the 56th Annual Meeting of the Associationfor Computational Linguistics (Volume 1: Long Pa-pers), volume 1, pages 1118–1127.
12、Su Zhu and Kai Yu. 2017. Encoder-decoder withfocus-mechanism for sequence labelling based spo-ken language understanding. In2017 IEEE Interna-tional Conference on Acoustics, Speech and SignalProcessing (ICASSP), pages 5675–5679. IEEE.
13、Jason Williams, Antoine Raux, Deepak Ramachan-dran, and Alan Black. 2013. The dialog state track-ing challenge. InProceedings of the SIGDIAL 2013Conference, pages 404–413.
### 6、如何训练
## 5、版本更新
&ensp;&ensp;&ensp;&ensp;i、按照上文所述的数据组织形式,组织自己的训练、评估、预测数据
第一版:PaddlePaddle 1.4.0版本
主要功能:支持对话6个数据集上任务的训练、预测和评估
&ensp;&ensp;&ensp;&ensp;ii、运行训练脚本
第二版:PaddlePaddle 1.6.0版本
更新功能:在第一版的基础上,根据PaddlePaddle的模型规范化标准,对模块内训练、预测、评估等代码进行了重构,提高易用性;
```
sh run_train.sh task_name
parameters:
task_name: 用户自定义名称
```
## 作者
## 四、其他
zhangxiyuan01@baidu.com
zhouxiangyang@baidu.com
### 如何贡献代码
## 如何贡献代码
&ensp;&ensp;&ensp;&ensp;如果你可以修复某个issue或者增加一个新功能,欢迎给我们提交PR。如果对应的PR被接受了,我们将根据贡献的质量和难度进行打分(0-5分,越高越好)。如果你累计获得了10分,可以联系我们获得面试机会或者为你写推荐信。
# this file is only used for continuous evaluation test!
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""this file is only used for continuous evaluation test!"""
import os
import sys
......
task_name: ""
data_dir: ""
bert_config_path: ""
init_from_checkpoint: ""
init_from_params: ""
init_from_pretrain_model: ""
inference_model_dir: ""
save_model_path: ""
save_checkpoint: ""
save_param: ""
lr_scheduler: "linear_warmup_decay"
weight_decay: 0.01
warmup_proportion: 0.1
save_steps: 1000
use_fp16: False
loss_scaling: 1.0
print_steps: 20
evaluation_file: ""
output_prediction_file: ""
vocab_path: ""
max_seq_len: 128
batch_size: 2
verbose: False
do_lower_case: False
random_seed: 0
use_cuda: True
task_name: ""
in_tokens: False
do_save_inference_model: False
enable_ce: "store_true"
pretrain model directory: in this module, we use bert as pretrain model
# Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
......@@ -74,7 +74,8 @@ def mask(batch_tokens, total_token_num, vocab_size, CLS=1, SEP=2, MASK=3):
return batch_tokens, mask_label, mask_pos
def prepare_batch_data(insts,
def prepare_batch_data(task_name,
insts,
max_len,
total_token_num,
voc_size=0,
......@@ -90,7 +91,6 @@ def prepare_batch_data(insts,
2. generate Tensor of position
3. generate self attention mask, [shape: batch_size * max_len * max_len]
"""
batch_src_ids = [inst[0] for inst in insts]
batch_sent_ids = [inst[1] for inst in insts]
batch_pos_ids = [inst[2] for inst in insts]
......@@ -99,10 +99,10 @@ def prepare_batch_data(insts,
# or unique id
if isinstance(insts[0][3], list):
if max_len != -1:
if task_name == "atis_slot":
labels_list = [inst[3] + [0] * (max_len - len(inst[3])) for inst in insts]
labels_list = [np.array(labels_list).astype("int64").reshape([-1, max_len])]
else:
elif task_name == "dstc2":
labels_list = [inst[3] for inst in insts]
labels_list = [np.array(labels_list).astype("int64")]
else:
......
......@@ -24,9 +24,7 @@ import json
import numpy as np
import paddle.fluid as fluid
_WORK_DIR = os.path.split(os.path.realpath(__file__))[0]
sys.path.append(os.path.join(_WORK_DIR, "../../"))
from transformer_encoder import encoder, pre_process_layer
from dgu.transformer_encoder import encoder, pre_process_layer
class BertConfig(object):
......
# Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
......@@ -49,15 +49,9 @@ class Paradigm(object):
bias_attr=fluid.ParamAttr(
name="cls_out_b", initializer=fluid.initializer.Constant(0.)))
if params['is_prediction']:
if not params['is_training']:
probs = fluid.layers.softmax(logits)
feed_targets_name = [
params['src_ids'].name,
params['pos_ids'].name,
params['sent_ids'].name,
params['input_mask'].name,
]
results = {"probs": probs, "feed_targets_name": feed_targets_name}
results = {"probs": probs}
return results
ce_loss, probs = fluid.layers.softmax_with_cross_entropy(
......@@ -67,11 +61,6 @@ class Paradigm(object):
accuracy = fluid.layers.accuracy(
input=probs, label=params['labels'], total=num_seqs)
loss.persistable = True
probs.persistable = True
accuracy.persistable = True
num_seqs.persistable = True
results = {
"loss": loss,
"probs": probs,
......@@ -105,22 +94,13 @@ class Paradigm(object):
loss = fluid.layers.mean(x=ce_loss)
probs = fluid.layers.sigmoid(logits)
if params['is_prediction']:
feed_targets_name = [
params['src_ids'].name,
params['pos_ids'].name,
params['sent_ids'].name,
params['input_mask'].name,
]
results = {"probs": probs, "feed_targets_name": feed_targets_name}
if not params['is_training']:
results = {"probs": probs}
return results
num_seqs = fluid.layers.tensor.fill_constant(
shape=[1], dtype='int64', value=1)
loss.persistable = True
probs.persistable = True
num_seqs.persistable = True
results = {"loss": loss, "probs": probs, "num_seqs": num_seqs}
return results
......@@ -138,14 +118,8 @@ class Paradigm(object):
fluid.layers.argmax(
logits, axis=1), dtype='int32')
if params['is_prediction']:
feed_targets_name = [
params['src_ids'].name,
params['pos_ids'].name,
params['sent_ids'].name,
params['input_mask'].name,
]
results = {"probs": probs, "feed_targets_name": feed_targets_name}
if not params['is_training']:
results = {"probs": probs}
return results
num_seqs = fluid.layers.tensor.fill_constant(
......@@ -160,10 +134,6 @@ class Paradigm(object):
label=fluid.layers.reshape(params['labels'], [-1, 1]))
loss = fluid.layers.mean(x=ce_loss)
loss.persistable = True
probs.persistable = True
accuracy.persistable = True
num_seqs.persistable = True
results = {
"loss": loss,
"probs": probs,
......
# Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
......
# Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
......@@ -11,9 +11,7 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
evaluate task metrics
"""
"""evaluate task metrics"""
import sys
......@@ -22,15 +20,12 @@ class EvalDA(object):
"""
evaluate da testset, swda|mrda
"""
def __init__(self, task_name, pred):
def __init__(self, task_name, pred, refer):
"""
predict file
"""
self.pred_file = pred
if task_name == 'swda':
self.refer_file = "./data/swda/test.txt"
elif task_name == "mrda":
self.refer_file = "./data/mrda/test.txt"
self.refer_file = refer
def load_data(self):
"""
......@@ -70,12 +65,12 @@ class EvalATISIntent(object):
"""
evaluate da testset, swda|mrda
"""
def __init__(self, pred):
def __init__(self, pred, refer):
"""
predict file
"""
self.pred_file = pred
self.refer_file = "./data/atis/atis_intent/test.txt"
self.refer_file = refer
def load_data(self):
"""
......@@ -115,12 +110,12 @@ class EvalATISSlot(object):
"""
evaluate atis slot
"""
def __init__(self, pred):
def __init__(self, pred, refer):
"""
pred file
"""
self.pred_file = pred
self.refer_file = "./data/atis/atis_slot/test.txt"
self.refer_file = refer
def load_data(self):
"""
......@@ -200,12 +195,12 @@ class EvalUDC(object):
"""
evaluate udc
"""
def __init__(self, pred):
def __init__(self, pred, refer):
"""
predict file
"""
self.pred_file = pred
self.refer_file = "./data/udc/test.txt"
self.refer_file = refer
def load_data(self):
"""
......@@ -272,13 +267,13 @@ class EvalDSTC2(object):
"""
evaluate dst testset, dstc2
"""
def __init__(self, task_name, pred):
def __init__(self, task_name, pred, refer):
"""
predict file
"""
self.task_name = task_name
self.pred_file = pred
self.refer_file = "./data/dstc2/%s/test.txt" % self.task_name
self.refer_file = refer
def load_data(self):
"""
......@@ -320,15 +315,10 @@ class EvalDSTC2(object):
return metrics_out
if __name__ == "__main__":
if len(sys.argv[1:]) < 2:
print("please input task_name predict_file")
task_name = sys.argv[1]
pred_file = sys.argv[2]
def evaluate(task_name, pred_file, refer_file):
"""evaluate task metrics"""
if task_name.lower() == 'udc':
eval_inst = EvalUDC(pred_file)
eval_inst = EvalUDC(pred_file, refer_file)
eval_metrics = eval_inst.evaluate()
print("MATCHING TASK: %s metrics in testset: " % task_name)
print("R1@2: %s" % eval_metrics[0])
......@@ -337,29 +327,29 @@ if __name__ == "__main__":
print("R5@10: %s" % eval_metrics[3])
elif task_name.lower() in ['swda', 'mrda']:
eval_inst = EvalDA(task_name.lower(), pred_file)
eval_inst = EvalDA(task_name.lower(), pred_file, refer_file)
eval_metrics = eval_inst.evaluate()
print("DA TASK: %s metrics in testset: " % task_name)
print("ACC: %s" % eval_metrics)
elif task_name.lower() == 'atis_intent':
eval_inst = EvalATISIntent(pred_file)
eval_inst = EvalATISIntent(pred_file, refer_file)
eval_metrics = eval_inst.evaluate()
print("INTENTION TASK: %s metrics in testset: " % task_name)
print("ACC: %s" % eval_metrics)
elif task_name.lower() == 'atis_slot':
eval_inst = EvalATISSlot(pred_file)
eval_inst = EvalATISSlot(pred_file, refer_file)
eval_metrics = eval_inst.evaluate()
print("SLOT FILLING TASK: %s metrics in testset: " % task_name)
print(eval_metrics)
elif task_name.lower() in ['dstc2', 'dstc2_asr']:
eval_inst = EvalDSTC2(task_name.lower(), pred_file)
eval_inst = EvalDSTC2(task_name.lower(), pred_file, refer_file)
eval_metrics = eval_inst.evaluate()
print("DST TASK: %s metrics in testset: " % task_name)
print("JOINT ACC: %s" % eval_metrics[0])
elif task_name.lower() == "multi-woz":
eval_inst = EvalMultiWoz(pred_file)
eval_inst = EvalMultiWoz(pred_file, refer_file)
eval_metrics = eval_inst.evaluate()
print("DST TASK: %s metrics in testset: " % task_name)
print("JOINT ACC: %s" % eval_metrics[0])
......@@ -367,3 +357,14 @@ if __name__ == "__main__":
else:
print("task name not in [udc|swda|mrda|atis_intent|atis_slot|dstc2|dstc2_asr|multi-woz]")
if __name__ == "__main__":
if len(sys.argv[1:]) < 3:
print("please input task_name predict_file reference_file")
task_name = sys.argv[1]
pred_file = sys.argv[2]
refer_file = sys.argv[3]
evaluate(task_name, pred_file, refer_file)
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
......@@ -19,7 +19,7 @@ from __future__ import print_function
import numpy as np
import paddle.fluid as fluid
from utils.fp16 import create_master_params_grads, master_param_to_train_param
from dgu.utils.fp16 import create_master_params_grads, master_param_to_train_param
def linear_warmup_decay(learning_rate, warmup_steps, num_train_steps):
......
#!/bin/bash
#check data directory
cd ..
echo "Start download data and models.............."
if [ ! -d "data" ]; then
echo "Directory data does not exist, make new data directory"
mkdir data
fi
cd data
#check configure file
if [ ! -d "config" ]; then
echo "config directory not exist........"
exit 255
else
if [ ! -f "config/dgu.yaml" ]; then
echo "config file dgu.yaml has been lost........"
exit 255
fi
fi
#check and download input data
if [ ! -d "input" ]; then
echo "Directory input does not exist, make new input directory"
mkdir input
fi
cd input
wget --no-check-certificate https://baidu-nlp.bj.bcebos.com/dmtk_data_1.0.0.tar.gz
tar -xvf dmtk_data_1.0.0.tar.gz
rm dmtk_data_1.0.0.tar.gz
cd ..
#check and download pretrain model
if [ ! -d "pretrain_model" ]; then
echo "Directory pretrain_model does not exist, make new pretrain_model directory"
mkdir pretrain_model
fi
cd pretrain_model
wget --no-check-certificate https://bert-models.bj.bcebos.com/uncased_L-12_H-768_A-12.tar.gz
tar -xvf uncased_L-12_H-768_A-12.tar.gz
rm uncased_L-12_H-768_A-12.tar.gz
cd ..
#check and download inferenece model
if [ ! -d "inference_models" ]; then
echo "Directory inferenece_model does not exist, make new inferenece_model directory"
mkdir inference_models
fi
#check output
if [ ! -d "output" ]; then
echo "Directory output does not exist, make new output directory"
mkdir output
fi
#check saved model
if [ ! -d "saved_models" ]; then
echo "Directory saved_models does not exist, make new saved_models directory"
mkdir saved_models
fi
cd saved_models
wget --no-check-certificate https://baidu-nlp.bj.bcebos.com/dgu_models_2.0.0.tar.gz
tar -xvf dgu_models_2.0.0.tar.gz
rm dgu_models_2.0.0.tar.gz
cd ..
echo "Finish.............."
# Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
......@@ -13,9 +13,10 @@
# limitations under the License.
"""data reader"""
import os
import types
import csv
import types
import numpy as np
import tokenization
from batching import prepare_batch_data
......@@ -40,9 +41,7 @@ class DataProcessor(object):
np.random.seed(random_seed)
self.current_train_example = -1
self.num_examples = {'train': -1, 'dev': -1, 'test': -1}
self.current_train_epoch = -1
self.task_name = task_name
def get_train_examples(self, data_dir):
......@@ -57,7 +56,8 @@ class DataProcessor(object):
"""Gets a collection of `InputExample`s for prediction."""
raise NotImplementedError()
def get_labels(self):
@staticmethod
def get_labels():
"""Gets the list of labels for this data set."""
raise NotImplementedError()
......@@ -90,6 +90,7 @@ class DataProcessor(object):
return_num_token=False):
"""generate batch data"""
return prepare_batch_data(
self.task_name,
batch_data,
max_len,
total_token_num,
......@@ -119,18 +120,13 @@ class DataProcessor(object):
"Unknown phase, which should be in ['train', 'dev', 'test'].")
return self.num_examples[phase]
def get_train_progress(self):
"""Gets progress for training phase."""
return self.current_train_example, self.current_train_epoch
def data_generator(self, batch_size, phase='train', epoch=1, shuffle=False):
def data_generator(self, batch_size, phase='train', shuffle=False):
"""
Generate data for train, dev or test.
Args:
batch_size: int. The batch size of generated data.
phase: string. The phase for which to generate data.
epoch: int. Total epoches to generate data.
shuffle: bool. Whether to shuffle examples.
"""
if phase == 'train':
......@@ -148,25 +144,19 @@ class DataProcessor(object):
def instance_reader():
"""generate instance data"""
for epoch_index in range(epoch):
if shuffle:
np.random.shuffle(examples)
if phase == 'train':
self.current_train_epoch = epoch_index
for (index, example) in enumerate(examples):
if phase == 'train':
self.current_train_example = index + 1
feature = self.convert_example(
index, example,
self.get_labels(), self.max_seq_len, self.tokenizer)
instance = self.generate_instance(feature)
yield instance
if shuffle:
np.random.shuffle(examples)
for (index, example) in enumerate(examples):
feature = self.convert_example(
index, example,
self.get_labels(), self.max_seq_len, self.tokenizer)
instance = self.generate_instance(feature)
yield instance
def batch_reader(reader, batch_size, in_tokens):
"""read batch data"""
batch, total_token_num, max_len = [], 0, 0
for instance in reader():
for instance in reader():
token_ids, sent_ids, pos_ids, label = instance[:4]
max_len = max(max_len, len(token_ids))
if in_tokens:
......@@ -294,7 +284,8 @@ class UDCProcessor(DataProcessor):
examples = self._create_examples(lines, "test")
return examples
def get_labels(self):
@staticmethod
def get_labels():
"""See base class."""
return ["0", "1"]
......@@ -327,7 +318,8 @@ class SWDAProcessor(DataProcessor):
examples = self._create_examples(lines, "test")
return examples
def get_labels(self):
@staticmethod
def get_labels():
"""See base class."""
labels = range(42)
labels = [str(label) for label in labels]
......@@ -362,7 +354,8 @@ class MRDAProcessor(DataProcessor):
examples = self._create_examples(lines, "test")
return examples
def get_labels(self):
@staticmethod
def get_labels():
"""See base class."""
labels = range(42)
labels = [str(label) for label in labels]
......@@ -406,7 +399,8 @@ class ATISSlotProcessor(DataProcessor):
examples = self._create_examples(lines, "test")
return examples
def get_labels(self):
@staticmethod
def get_labels():
"""See base class."""
labels = range(130)
labels = [str(label) for label in labels]
......@@ -449,7 +443,8 @@ class ATISIntentProcessor(DataProcessor):
examples = self._create_examples(lines, "test")
return examples
def get_labels(self):
@staticmethod
def get_labels():
"""See base class."""
labels = range(26)
labels = [str(label) for label in labels]
......@@ -522,7 +517,8 @@ class DSTC2Processor(DataProcessor):
examples = self._create_examples(lines, "test")
return examples
def get_labels(self):
@staticmethod
def get_labels():
"""See base class."""
labels = range(217)
labels = [str(label) for label in labels]
......@@ -598,7 +594,8 @@ class MULTIWOZProcessor(DataProcessor):
examples = self._create_examples(lines, "test")
return examples
def get_labels(self):
@staticmethod
def get_labels():
"""See base class."""
labels = range(722)
labels = [str(label) for label in labels]
......@@ -666,8 +663,8 @@ def convert_tokens(tokens, sep_id, tokenizer):
tok_text = tokenizer.tokenize(text)
ids = tokenizer.convert_tokens_to_ids(tok_text)
tokens_ids.extend(ids)
if text != tokens[-1]:
tokens_ids.append(sep_id)
tokens_ids.append(sep_id)
tokens_ids = tokens_ids[: -1]
else:
tok_text = tokenizer.tokenize(tokens)
tokens_ids = tokenizer.convert_tokens_to_ids(tok_text)
......@@ -719,7 +716,8 @@ def convert_single_example(ex_index, example, label_list, max_seq_length,
if tokens_b_ids:
tokens_b_ids = tokens_b_ids[:min(limit_length, len(tokens_b_ids))]
else:
tokens_a_ids = tokens_a_ids[len(tokens_a_ids) - max_seq_length + 2:]
if len(tokens_a_ids) > max_seq_length - 2:
tokens_a_ids = tokens_a_ids[len(tokens_a_ids) - max_seq_length + 2:]
if not tokens_c_ids:
if len(tokens_a_ids) > max_seq_length - len(tokens_b_ids) - 3:
tokens_a_ids = tokens_a_ids[len(tokens_a_ids) - max_seq_length + len(tokens_b_ids) + 3:]
......@@ -727,13 +725,10 @@ def convert_single_example(ex_index, example, label_list, max_seq_length,
if len(tokens_a_ids) + len(tokens_b_ids) + len(tokens_c_ids) > max_seq_length - 4:
left_num = max_seq_length - len(tokens_b_ids) - 4
if len(tokens_a_ids) > len(tokens_c_ids):
if not tokens_c_ids:
tokens_a_ids = tokens_a_ids[max(0, len(tokens_a_ids) - left_num):]
else:
suffix_num = int(left_num / 2)
tokens_c_ids = tokens_c_ids[: min(len(tokens_c_ids), suffix_num)]
prefix_num = left_num - len(tokens_c_ids)
tokens_a_ids = tokens_a_ids[max(0, len(tokens_a_ids) - prefix_num):]
suffix_num = int(left_num / 2)
tokens_c_ids = tokens_c_ids[: min(len(tokens_c_ids), suffix_num)]
prefix_num = left_num - len(tokens_c_ids)
tokens_a_ids = tokens_a_ids[max(0, len(tokens_a_ids) - prefix_num):]
else:
if not tokens_a_ids:
tokens_c_ids = tokens_c_ids[max(0, len(tokens_c_ids) - left_num):]
......
scripts:运行数据处理脚本目录, 将官方公开数据集转换成模型所需训练数据格式
运行命令:
sh run_build_data.sh [udc|swda|mrda|atis|dstc2]
1)、生成MATCHING任务所需要的训练集、开发集、测试集时:
sh run_build_data.sh udc
生成数据在dialogue_general_understanding/data/input/data/udc
2)、生成DA任务所需要的训练集、开发集、测试集时:
sh run_build_data.sh swda
sh run_build_data.sh mrda
生成数据分别在dialogue_general_understanding/data/input/data/swda和dialogue_general_understanding/data/input/data/mrda
3)、生成DST任务所需的训练集、开发集、测试集时:
sh run_build_data.sh dstc2
生成数据分别在dialogue_general_understanding/data/input/data/dstc2
4)、生成意图解析, 槽位识别任务所需训练集、开发集、测试集时:
sh run_build_data.sh atis
生成槽位识别数据在dialogue_general_understanding/data/input/data/atis/atis_slot
生成意图识别数据在dialogue_general_understanding/data/input/data/atis/atis_intent
# Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
......@@ -11,6 +11,7 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""build swda train dev test dataset"""
import json
......@@ -32,11 +33,11 @@ class ATIS(object):
self.slot_dict = {"PAD": 0, "O": 1}
self.intent_id = 0
self.intent_dict = dict()
self.src_dir = "../data/atis/source_data"
self.out_slot_dir = "../data/atis/atis_slot"
self.out_intent_dir = "../data/atis/atis_intent"
self.map_tag_slot = "../data/atis/atis_slot/map_tag_slot_id.txt"
self.map_tag_intent = "../data/atis/atis_intent/map_tag_intent_id.txt"
self.src_dir = "../../data/input/data/atis/source_data"
self.out_slot_dir = "../../data/input/data/atis/atis_slot"
self.out_intent_dir = "../../data/input/data/atis/atis_intent"
self.map_tag_slot = "../../data/input/data/atis/atis_slot/map_tag_slot_id.txt"
self.map_tag_intent = "../../data/input/data/atis/atis_intent/map_tag_intent_id.txt"
def _load_file(self, data_type):
"""
......
# Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
......@@ -31,12 +31,12 @@ class DSTC2(object):
init instance
"""
self.map_tag_dict = {}
self.out_dir = "../data/dstc2/dstc2"
self.out_asr_dir = "../data/dstc2/dstc2_asr"
self.out_dir = "../../data/input/data/dstc2/dstc2"
self.out_asr_dir = "../../data/input/data/dstc2/dstc2_asr"
self.data_list = "./conf/dstc2.conf"
self.map_tag = "../data/dstc2/dstc2/map_tag_id.txt"
self.src_dir = "../data/dstc2/source_data"
self.onto_json = "../data/dstc2/source_data/ontology_dstc2.json"
self.map_tag = "../../data/input/data/dstc2/dstc2/map_tag_id.txt"
self.src_dir = "../../data/input/data/dstc2/source_data"
self.onto_json = "../../data/input/data/dstc2/source_data/ontology_dstc2.json"
self._load_file()
self._load_ontology()
......
# Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
......@@ -31,11 +31,11 @@ class MRDA(object):
"""
self.tag_id = 0
self.map_tag_dict = dict()
self.out_dir = "../data/mrda"
self.out_dir = "../../data/input/data/mrda"
self.data_list = "./conf/mrda.conf"
self.map_tag = "../data/mrda/map_tag_id.txt"
self.voc_map_tag = "../data/mrda/source_data/icsi_mrda+hs_corpus_050512/classmaps/map_01b_expanded_w_split"
self.src_dir = "../data/mrda/source_data/icsi_mrda+hs_corpus_050512/data"
self.map_tag = "../../data/input/data/mrda/map_tag_id.txt"
self.voc_map_tag = "../../data/input/data/mrda/source_data/icsi_mrda+hs_corpus_050512/classmaps/map_01b_expanded_w_split"
self.src_dir = "../../data/input/data/mrda/source_data/icsi_mrda+hs_corpus_050512/data"
self._load_file()
self.tag_dict = commonlib.load_voc(self.voc_map_tag)
......
# Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
......@@ -31,10 +31,10 @@ class SWDA(object):
"""
self.tag_id = 0
self.map_tag_dict = dict()
self.out_dir = "../data/swda"
self.out_dir = "../../data/input/data/swda"
self.data_list = "./conf/swda.conf"
self.map_tag = "../data/swda/map_tag_id.txt"
self.src_dir = "../data/swda/source_data/swda"
self.map_tag = "../../data/input/data/swda/map_tag_id.txt"
self.src_dir = "../../data/input/data/swda/source_data/swda"
self._load_file()
def _load_file(self):
......
# Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
......
......@@ -15,8 +15,8 @@ then
elif [[ "${TASK_DATA}" =~ "atis" ]]
then
python build_atis_dataset.py
cat ../data/atis/atis_slot/test.txt > ../data/atis/atis_slot/dev.txt
cat ../data/atis/atis_intent/test.txt > ../data/atis/atis_intent/dev.txt
cat ../../data/input/data/atis/atis_slot/test.txt > ../../data/input/data/atis/atis_slot/dev.txt
cat ../../data/input/data/atis/atis_intent/test.txt > ../../data/input/data/atis/atis_intent/dev.txt
elif [ "${TASK_DATA}" = "dstc2" ]
then
python build_dstc2_dataset.py
......
# coding=utf-8
# Copyright 2018 The Google AI Language Team Authors.
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Tokenization classes."""
from __future__ import absolute_import
......
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Transformer encoder."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from functools import partial
import paddle.fluid as fluid
import paddle.fluid.layers as layers
def multi_head_attention(queries,
keys,
values,
attn_bias,
d_key,
d_value,
d_model,
n_head=1,
dropout_rate=0.,
cache=None,
param_initializer=None,
name='multi_head_att'):
"""
Multi-Head Attention. Note that attn_bias is added to the logit before
computing softmax activiation to mask certain selected positions so that
they will not considered in attention weights.
"""
keys = queries if keys is None else keys
values = keys if values is None else values
if not (len(queries.shape) == len(keys.shape) == len(values.shape) == 3):
raise ValueError(
"Inputs: quries, keys and values should all be 3-D tensors.")
def __compute_qkv(queries, keys, values, n_head, d_key, d_value):
"""
Add linear projection to queries, keys, and values.
"""
q = layers.fc(input=queries,
size=d_key * n_head,
num_flatten_dims=2,
param_attr=fluid.ParamAttr(
name=name + '_query_fc.w_0',
initializer=param_initializer),
bias_attr=name + '_query_fc.b_0')
k = layers.fc(input=keys,
size=d_key * n_head,
num_flatten_dims=2,
param_attr=fluid.ParamAttr(
name=name + '_key_fc.w_0',
initializer=param_initializer),
bias_attr=name + '_key_fc.b_0')
v = layers.fc(input=values,
size=d_value * n_head,
num_flatten_dims=2,
param_attr=fluid.ParamAttr(
name=name + '_value_fc.w_0',
initializer=param_initializer),
bias_attr=name + '_value_fc.b_0')
return q, k, v
def __split_heads(x, n_head):
"""
Reshape the last dimension of inpunt tensor x so that it becomes two
dimensions and then transpose. Specifically, input a tensor with shape
[bs, max_sequence_length, n_head * hidden_dim] then output a tensor
with shape [bs, n_head, max_sequence_length, hidden_dim].
"""
hidden_size = x.shape[-1]
# The value 0 in shape attr means copying the corresponding dimension
# size of the input as the output dimension size.
reshaped = layers.reshape(
x=x, shape=[0, 0, n_head, hidden_size // n_head], inplace=True)
# permuate the dimensions into:
# [batch_size, n_head, max_sequence_len, hidden_size_per_head]
return layers.transpose(x=reshaped, perm=[0, 2, 1, 3])
def __combine_heads(x):
"""
Transpose and then reshape the last two dimensions of inpunt tensor x
so that it becomes one dimension, which is reverse to __split_heads.
"""
if len(x.shape) == 3:
return x
if len(x.shape) != 4:
raise ValueError("Input(x) should be a 4-D Tensor.")
trans_x = layers.transpose(x, perm=[0, 2, 1, 3])
# The value 0 in shape attr means copying the corresponding dimension
# size of the input as the output dimension size.
return layers.reshape(
x=trans_x,
shape=[0, 0, trans_x.shape[2] * trans_x.shape[3]],
inplace=True)
def scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate):
"""
Scaled Dot-Product Attention
"""
scaled_q = layers.scale(x=q, scale=d_key ** -0.5)
product = layers.matmul(x=scaled_q, y=k, transpose_y=True)
if attn_bias:
product += attn_bias
weights = layers.softmax(product)
if dropout_rate:
weights = layers.dropout(
weights,
dropout_prob=dropout_rate,
dropout_implementation="upscale_in_train",
is_test=False)
out = layers.matmul(weights, v)
return out
q, k, v = __compute_qkv(queries, keys, values, n_head, d_key, d_value)
if cache is not None: # use cache and concat time steps
# Since the inplace reshape in __split_heads changes the shape of k and
# v, which is the cache input for next time step, reshape the cache
# input from the previous time step first.
k = cache["k"] = layers.concat(
[layers.reshape(
cache["k"], shape=[0, 0, d_model]), k], axis=1)
v = cache["v"] = layers.concat(
[layers.reshape(
cache["v"], shape=[0, 0, d_model]), v], axis=1)
q = __split_heads(q, n_head)
k = __split_heads(k, n_head)
v = __split_heads(v, n_head)
ctx_multiheads = scaled_dot_product_attention(q, k, v, attn_bias, d_key,
dropout_rate)
out = __combine_heads(ctx_multiheads)
# Project back to the model size.
proj_out = layers.fc(input=out,
size=d_model,
num_flatten_dims=2,
param_attr=fluid.ParamAttr(
name=name + '_output_fc.w_0',
initializer=param_initializer),
bias_attr=name + '_output_fc.b_0')
return proj_out
def positionwise_feed_forward(x,
d_inner_hid,
d_hid,
dropout_rate,
hidden_act,
param_initializer=None,
name='ffn'):
"""
Position-wise Feed-Forward Networks.
This module consists of two linear transformations with a ReLU activation
in between, which is applied to each position separately and identically.
"""
hidden = layers.fc(input=x,
size=d_inner_hid,
num_flatten_dims=2,
act=hidden_act,
param_attr=fluid.ParamAttr(
name=name + '_fc_0.w_0',
initializer=param_initializer),
bias_attr=name + '_fc_0.b_0')
if dropout_rate:
hidden = layers.dropout(
hidden,
dropout_prob=dropout_rate,
dropout_implementation="upscale_in_train",
is_test=False)
out = layers.fc(input=hidden,
size=d_hid,
num_flatten_dims=2,
param_attr=fluid.ParamAttr(
name=name + '_fc_1.w_0', initializer=param_initializer),
bias_attr=name + '_fc_1.b_0')
return out
def pre_post_process_layer(prev_out, out, process_cmd, dropout_rate=0.,
name=''):
"""
Add residual connection, layer normalization and droput to the out tensor
optionally according to the value of process_cmd.
This will be used before or after multi-head attention and position-wise
feed-forward networks.
"""
for cmd in process_cmd:
if cmd == "a": # add residual connection
out = out + prev_out if prev_out else out
elif cmd == "n": # add layer normalization
out_dtype = out.dtype
if out_dtype == fluid.core.VarDesc.VarType.FP16:
out = layers.cast(x=out, dtype="float32")
out = layers.layer_norm(
out,
begin_norm_axis=len(out.shape) - 1,
param_attr=fluid.ParamAttr(
name=name + '_layer_norm_scale',
initializer=fluid.initializer.Constant(1.)),
bias_attr=fluid.ParamAttr(
name=name + '_layer_norm_bias',
initializer=fluid.initializer.Constant(0.)))
if out_dtype == fluid.core.VarDesc.VarType.FP16:
out = layers.cast(x=out, dtype="float16")
elif cmd == "d": # add dropout
if dropout_rate:
out = layers.dropout(
out,
dropout_prob=dropout_rate,
dropout_implementation="upscale_in_train",
is_test=False)
return out
pre_process_layer = partial(pre_post_process_layer, None)
post_process_layer = pre_post_process_layer
def encoder_layer(enc_input,
attn_bias,
n_head,
d_key,
d_value,
d_model,
d_inner_hid,
prepostprocess_dropout,
attention_dropout,
relu_dropout,
hidden_act,
preprocess_cmd="n",
postprocess_cmd="da",
param_initializer=None,
name=''):
"""The encoder layers that can be stacked to form a deep encoder.
This module consits of a multi-head (self) attention followed by
position-wise feed-forward networks and both the two components companied
with the post_process_layer to add residual connection, layer normalization
and droput.
"""
attn_output = multi_head_attention(
pre_process_layer(
enc_input,
preprocess_cmd,
prepostprocess_dropout,
name=name + '_pre_att'),
None,
None,
attn_bias,
d_key,
d_value,
d_model,
n_head,
attention_dropout,
param_initializer=param_initializer,
name=name + '_multi_head_att')
attn_output = post_process_layer(
enc_input,
attn_output,
postprocess_cmd,
prepostprocess_dropout,
name=name + '_post_att')
ffd_output = positionwise_feed_forward(
pre_process_layer(
attn_output,
preprocess_cmd,
prepostprocess_dropout,
name=name + '_pre_ffn'),
d_inner_hid,
d_model,
relu_dropout,
hidden_act,
param_initializer=param_initializer,
name=name + '_ffn')
return post_process_layer(
attn_output,
ffd_output,
postprocess_cmd,
prepostprocess_dropout,
name=name + '_post_ffn')
def encoder(enc_input,
attn_bias,
n_layer,
n_head,
d_key,
d_value,
d_model,
d_inner_hid,
prepostprocess_dropout,
attention_dropout,
relu_dropout,
hidden_act,
preprocess_cmd="n",
postprocess_cmd="da",
param_initializer=None,
name=''):
"""
The encoder is composed of a stack of identical layers returned by calling
encoder_layer.
"""
for i in range(n_layer):
enc_output = encoder_layer(
enc_input,
attn_bias,
n_head,
d_key,
d_value,
d_model,
d_inner_hid,
prepostprocess_dropout,
attention_dropout,
relu_dropout,
hidden_act,
preprocess_cmd,
postprocess_cmd,
param_initializer=param_initializer,
name=name + '_layer_' + str(i))
enc_input = enc_output
enc_output = pre_process_layer(
enc_output, preprocess_cmd, prepostprocess_dropout, name="post_encoder")
return enc_output
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import os
import sys
import argparse
import json
import yaml
import six
import logging
logging_only_message = "%(message)s"
logging_details = "%(asctime)s.%(msecs)03d %(levelname)s %(module)s - %(funcName)s: %(message)s"
class JsonConfig(object):
"""
A high-level api for handling json configure file.
"""
def __init__(self, config_path):
self._config_dict = self._parse(config_path)
def _parse(self, config_path):
try:
with open(config_path) as json_file:
config_dict = json.load(json_file)
except:
raise IOError("Error in parsing bert model config file '%s'" %
config_path)
else:
return config_dict
def __getitem__(self, key):
return self._config_dict[key]
def print_config(self):
for arg, value in sorted(six.iteritems(self._config_dict)):
print('%s: %s' % (arg, value))
print('------------------------------------------------')
class ArgumentGroup(object):
def __init__(self, parser, title, des):
self._group = parser.add_argument_group(title=title, description=des)
def add_arg(self, name, type, default, help, **kwargs):
type = str2bool if type == bool else type
self._group.add_argument(
"--" + name,
default=default,
type=type,
help=help + ' Default: %(default)s.',
**kwargs)
class ArgConfig(object):
"""
A high-level api for handling argument configs.
"""
def __init__(self):
parser = argparse.ArgumentParser()
train_g = ArgumentGroup(parser, "training", "training options.")
train_g.add_arg("epoch", int, 3, "Number of epoches for fine-tuning.")
train_g.add_arg("learning_rate", float, 5e-5,
"Learning rate used to train with warmup.")
train_g.add_arg(
"lr_scheduler",
str,
"linear_warmup_decay",
"scheduler of learning rate.",
choices=['linear_warmup_decay', 'noam_decay'])
train_g.add_arg("weight_decay", float, 0.01,
"Weight decay rate for L2 regularizer.")
train_g.add_arg(
"warmup_proportion", float, 0.1,
"Proportion of training steps to perform linear learning rate warmup for."
)
train_g.add_arg("save_steps", int, 1000,
"The steps interval to save checkpoints.")
train_g.add_arg("use_fp16", bool, False,
"Whether to use fp16 mixed precision training.")
train_g.add_arg(
"loss_scaling", float, 1.0,
"Loss scaling factor for mixed precision training, only valid when use_fp16 is enabled."
)
train_g.add_arg("pred_dir", str, None,
"Path to save the prediction results")
log_g = ArgumentGroup(parser, "logging", "logging related.")
log_g.add_arg("skip_steps", int, 10,
"The steps interval to print loss.")
log_g.add_arg("verbose", bool, False, "Whether to output verbose log.")
run_type_g = ArgumentGroup(parser, "run_type", "running type options.")
run_type_g.add_arg("use_cuda", bool, True,
"If set, use GPU for training.")
run_type_g.add_arg(
"use_fast_executor", bool, False,
"If set, use fast parallel executor (in experiment).")
run_type_g.add_arg(
"num_iteration_per_drop_scope", int, 1,
"Ihe iteration intervals to clean up temporary variables.")
run_type_g.add_arg("do_train", bool, True,
"Whether to perform training.")
run_type_g.add_arg("do_predict", bool, True,
"Whether to perform prediction.")
custom_g = ArgumentGroup(parser, "customize", "customized options.")
self.custom_g = custom_g
self.parser = parser
def add_arg(self, name, dtype, default, descrip):
self.custom_g.add_arg(name, dtype, default, descrip)
def build_conf(self):
return self.parser.parse_args()
def str2bool(v):
# because argparse does not support to parse "true, False" as python
# boolean directly
return v.lower() in ("true", "t", "1")
def print_arguments(args, log=None):
if not log:
print('----------- Configuration Arguments -----------')
for arg, value in sorted(six.iteritems(vars(args))):
print('%s: %s' % (arg, value))
print('------------------------------------------------')
else:
log.info('----------- Configuration Arguments -----------')
for arg, value in sorted(six.iteritems(vars(args))):
log.info('%s: %s' % (arg, value))
log.info('------------------------------------------------')
class PDConfig(object):
"""
A high-level API for managing configuration files in PaddlePaddle.
Can jointly work with command-line-arugment, json files and yaml files.
"""
def __init__(self, json_file="", yaml_file="", fuse_args=True):
"""
Init funciton for PDConfig.
json_file: the path to the json configure file.
yaml_file: the path to the yaml configure file.
fuse_args: if fuse the json/yaml configs with argparse.
"""
assert isinstance(json_file, str)
assert isinstance(yaml_file, str)
if json_file != "" and yaml_file != "":
raise Warning(
"json_file and yaml_file can not co-exist for now. please only use one configure file type."
)
return
self.args = None
self.arg_config = {}
self.json_config = {}
self.yaml_config = {}
parser = argparse.ArgumentParser()
self.default_g = ArgumentGroup(parser, "default", "default options.")
self.yaml_g = ArgumentGroup(parser, "yaml", "options from yaml.")
self.json_g = ArgumentGroup(parser, "json", "options from json.")
self.com_g = ArgumentGroup(parser, "custom", "customized options.")
self.default_g.add_arg("epoch", int, 2,
"Number of epoches for training.")
self.default_g.add_arg("learning_rate", float, 1e-2,
"Learning rate used to train.")
self.default_g.add_arg("do_train", bool, False,
"Whether to perform training.")
self.default_g.add_arg("do_predict", bool, False,
"Whether to perform predicting.")
self.default_g.add_arg("do_eval", bool, False,
"Whether to perform evaluating.")
self.parser = parser
if json_file != "":
self.load_json(json_file, fuse_args=fuse_args)
if yaml_file:
self.load_yaml(yaml_file, fuse_args=fuse_args)
def load_json(self, file_path, fuse_args=True):
if not os.path.exists(file_path):
raise Warning("the json file %s does not exist." % file_path)
return
with open(file_path, "r") as fin:
self.json_config = json.loads(fin.read())
fin.close()
if fuse_args:
for name in self.json_config:
if not isinstance(self.json_config[name], int) \
and not isinstance(self.json_config[name], float) \
and not isinstance(self.json_config[name], str) \
and not isinstance(self.json_config[name], bool):
continue
self.json_g.add_arg(name,
type(self.json_config[name]),
self.json_config[name],
"This is from %s" % file_path)
def load_yaml(self, file_path, fuse_args=True):
if not os.path.exists(file_path):
raise Warning("the yaml file %s does not exist." % file_path)
return
with open(file_path, "r") as fin:
self.yaml_config = yaml.load(fin, Loader=yaml.SafeLoader)
fin.close()
if fuse_args:
for name in self.yaml_config:
if not isinstance(self.yaml_config[name], int) \
and not isinstance(self.yaml_config[name], float) \
and not isinstance(self.yaml_config[name], str) \
and not isinstance(self.yaml_config[name], bool):
continue
self.yaml_g.add_arg(name,
type(self.yaml_config[name]),
self.yaml_config[name],
"This is from %s" % file_path)
def build(self):
self.args = self.parser.parse_args()
self.arg_config = vars(self.args)
def __add__(self, new_arg):
assert isinstance(new_arg, list) or isinstance(new_arg, tuple)
assert len(new_arg) >= 3
assert self.args is None
name = new_arg[0]
dtype = new_arg[1]
dvalue = new_arg[2]
desc = new_arg[3] if len(
new_arg) == 4 else "Description is not provided."
self.com_g.add_arg(name, dtype, dvalue, desc)
return self
def __getattr__(self, name):
if name in self.arg_config:
return self.arg_config[name]
if name in self.json_config:
return self.json_config[name]
if name in self.yaml_config:
return self.yaml_config[name]
raise Warning("The argument %s is not defined." % name)
def Print(self):
print("-" * 70)
for name in self.arg_config:
print("%s:\t\t\t\t%s" % (str(name), str(self.arg_config[name])))
for name in self.json_config:
if name not in self.arg_config:
print("%s:\t\t\t\t%s" %
(str(name), str(self.json_config[name])))
for name in self.yaml_config:
if name not in self.arg_config:
print("%s:\t\t\t\t%s" %
(str(name), str(self.yaml_config[name])))
print("-" * 70)
if __name__ == "__main__":
"""
pd_config = PDConfig(json_file = "./test/bert_config.json")
pd_config.build()
print(pd_config.do_train)
print(pd_config.hidden_size)
pd_config = PDConfig(yaml_file = "./test/bert_config.yaml")
pd_config.build()
print(pd_config.do_train)
print(pd_config.hidden_size)
"""
pd_config = PDConfig(yaml_file="./test/bert_config.yaml")
pd_config += ("my_age", int, 18, "I am forever 18.")
pd_config.build()
print(pd_config.do_train)
print(pd_config.hidden_size)
print(pd_config.my_age)
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
......
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
......@@ -11,38 +11,25 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Arguments for configuration."""
from __future__ import absolute_import
from __future__ import print_function
from __future__ import division
from __future__ import print_function
import os
import six
import argparse
def str2bool(v):
# because argparse does not support to parse "true, False" as python
# boolean directly
return v.lower() in ("true", "t", "1")
class ArgumentGroup(object):
def __init__(self, parser, title, des):
self._group = parser.add_argument_group(title=title, description=des)
import ast
import copy
def add_arg(self, name, type, default, help, **kwargs):
type = str2bool if type == bool else type
self._group.add_argument(
"--" + name,
default=default,
type=type,
help=help + ' Default: %(default)s.',
**kwargs)
import numpy as np
import paddle.fluid as fluid
def print_arguments(args):
print('----------- Configuration Arguments -----------')
for arg, value in sorted(six.iteritems(vars(args))):
print('%s: %s' % (arg, value))
print('------------------------------------------------')
class InputField(object):
def __init__(self, input_field):
"""init inpit field"""
self.src_ids = input_field[0]
self.pos_ids = input_field[1]
self.sent_ids = input_field[2]
self.input_mask = input_field[3]
self.labels = input_field[4]
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import sys
import paddle
import paddle.fluid as fluid
def check_cuda(use_cuda, err = \
"\nYou can not set use_cuda = True in the model because you are using paddlepaddle-cpu.\n \
Please: 1. Install paddlepaddle-gpu to run your models on GPU or 2. Set use_cuda = False to run models on CPU.\n"
):
try:
if use_cuda == True and fluid.is_compiled_with_cuda() == False:
print(err)
sys.exit(1)
except Exception as e:
pass
if __name__ == "__main__":
check_cuda(True)
check_cuda(False)
check_cuda(True, "This is only for testing.")
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""save or load model api"""
import os
import sys
import paddle
import paddle.fluid as fluid
def init_from_pretrain_model(args, exe, program):
assert isinstance(args.init_from_pretrain_model, str)
if not os.path.exists(args.init_from_pretrain_model):
raise Warning("The pretrained params do not exist.")
return False
def existed_params(var):
if not isinstance(var, fluid.framework.Parameter):
return False
return os.path.exists(
os.path.join(args.init_from_pretrain_model, var.name))
fluid.io.load_vars(
exe,
args.init_from_pretrain_model,
main_program=program,
predicate=existed_params)
print("finish initing model from pretrained params from %s" %
(args.init_from_pretrain_model))
return True
def init_from_checkpoint(args, exe, program):
assert isinstance(args.init_from_checkpoint, str)
if not os.path.exists(args.init_from_checkpoint):
raise Warning("the checkpoint path does not exist.")
return False
fluid.io.load_persistables(
executor=exe,
dirname=args.init_from_checkpoint,
main_program=program,
filename="checkpoint.pdckpt")
print("finish initing model from checkpoint from %s" %
(args.init_from_checkpoint))
return True
def init_from_params(args, exe, program):
assert isinstance(args.init_from_params, str)
if not os.path.exists(args.init_from_params):
raise Warning("the params path does not exist.")
return False
fluid.io.load_params(
executor=exe,
dirname=args.init_from_params,
main_program=program,
filename="params.pdparams")
print("finish init model from params from %s" % (args.init_from_params))
return True
def save_checkpoint(args, exe, program, dirname):
assert isinstance(args.save_model_path, str)
checkpoint_dir = os.path.join(args.save_model_path, args.save_checkpoint)
if not os.path.exists(checkpoint_dir):
os.mkdir(checkpoint_dir)
fluid.io.save_persistables(
exe,
os.path.join(checkpoint_dir, dirname),
main_program=program,
filename="checkpoint.pdckpt")
print("save checkpoint at %s" % (os.path.join(checkpoint_dir, dirname)))
return True
def save_param(args, exe, program, dirname):
assert isinstance(args.save_model_path, str)
param_dir = os.path.join(args.save_model_path, args.save_param)
if not os.path.exists(param_dir):
os.mkdir(param_dir)
fluid.io.save_params(
exe,
os.path.join(param_dir, dirname),
main_program=program,
filename="params.pdparams")
print("save parameters at %s" % (os.path.join(param_dir, dirname)))
return True
# Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
......@@ -19,43 +19,33 @@ from __future__ import print_function
import paddle.fluid as fluid
from bert import BertModel
from dgu.bert import BertModel
from dgu.utils.configure import JsonConfig
def create_model(args,
pyreader_name,
bert_config,
num_labels,
paradigm_inst,
is_prediction=False):
def create_net(
is_training,
model_input,
num_labels,
paradigm_inst,
args):
"""create dialogue task model"""
if args.task_name == 'atis_slot':
label_dim = [-1, args.max_seq_len]
lod_level = 1
elif args.task_name in ['dstc2', 'dstc2_asr', 'multi-woz']:
label_dim = [-1, num_labels]
lod_level = 0
else:
label_dim = [-1, 1]
lod_level = 0
pyreader = fluid.layers.py_reader(
capacity=50,
shapes=[[-1, args.max_seq_len, 1], [-1, args.max_seq_len, 1],
[-1, args.max_seq_len, 1], [-1, args.max_seq_len, 1], label_dim],
dtypes=['int64', 'int64', 'int64', 'float32', 'int64'],
lod_levels=[0, 0, 0, 0, lod_level],
name=pyreader_name,
use_double_buffer=True)
src_ids = model_input.src_ids
pos_ids = model_input.pos_ids
sent_ids = model_input.sent_ids
input_mask = model_input.input_mask
labels = model_input.labels
(src_ids, pos_ids, sent_ids, input_mask,
labels) = fluid.layers.read_file(pyreader)
assert isinstance(args.bert_config_path, str)
bert_conf = JsonConfig(args.bert_config_path)
bert = BertModel(
src_ids=src_ids,
position_ids=pos_ids,
sentence_ids=sent_ids,
input_mask=input_mask,
config=bert_config,
config=bert_conf,
use_fp16=args.use_fp16)
params = {'num_labels': num_labels,
......@@ -64,15 +54,8 @@ def create_model(args,
'sent_ids': sent_ids,
'input_mask': input_mask,
'labels': labels,
'is_prediction': is_prediction}
if is_prediction:
results = paradigm_inst.paradigm(bert, params)
results['pyreader'] = pyreader
return results
'is_training': is_training}
results = paradigm_inst.paradigm(bert, params)
results['pyreader'] = pyreader
return results
wget --no-check-certificate https://baidu-nlp.bj.bcebos.com/dmtk_data_1.0.0.tar.gz
tar -xvf dmtk_data_1.0.0.tar.gz
rm dmtk_data_1.0.0.tar.gz
wget --no-check-certificate https://baidu-nlp.bj.bcebos.com/dmtk_models_1.0.0.tar.gz
tar -xvf dmtk_models_1.0.0.tar.gz
rm dmtk_models_1.0.0.tar.gz
wget --no-check-certificate https://bert-models.bj.bcebos.com/uncased_L-12_H-768_A-12.tar.gz
tar -xvf uncased_L-12_H-768_A-12.tar.gz
rm uncased_L-12_H-768_A-12.tar.gz
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""evaluation"""
import os
import sys
from dgu.evaluation import evaluate
from dgu.utils.configure import PDConfig
def do_eval(args):
task_name = args.task_name.lower()
reference = args.evaluation_file
predicitions = args.output_prediction_file
evaluate(task_name, predicitions, reference)
if __name__ == "__main__":
args = PDConfig(yaml_file="./data/config/dgu.yaml")
args.build()
do_eval(args)
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Finetuning on classification tasks."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import os
import time
import argparse
from utils.args import ArgumentGroup
parser = argparse.ArgumentParser(__doc__)
model_g = ArgumentGroup(parser, "model", "model configuration and paths.")
model_g.add_arg("bert_config_path", str, None, "Path to the json file for bert model config.")
model_g.add_arg("init_checkpoint", str, None, "Init checkpoint to resume training from.")
model_g.add_arg("init_pretraining_params", str, None,
"Init pre-training params which preforms fine-tuning from. If the "
"arg 'init_checkpoint' has been set, this argument wouldn't be valid.")
model_g.add_arg("checkpoints", str, "checkpoints", "Path to save checkpoints.")
model_g.add_arg("save_inference_model_path", str, None, "Path to save model.")
train_g = ArgumentGroup(parser, "training", "training options.")
train_g.add_arg("epoch", int, 3, "Number of epoches for fine-tuning.")
train_g.add_arg("learning_rate", float, 5e-5, "Learning rate used to train with warmup.")
train_g.add_arg("lr_scheduler", str, "linear_warmup_decay",
"scheduler of learning rate.", choices=['linear_warmup_decay', 'noam_decay'])
train_g.add_arg("weight_decay", float, 0.01, "Weight decay rate for L2 regularizer.")
train_g.add_arg("warmup_proportion", float, 0.1,
"Proportion of training steps to perform linear learning rate warmup for.")
train_g.add_arg("save_steps", int, 10000, "The steps interval to save checkpoints.")
train_g.add_arg("validation_steps", int, 1000, "The steps interval to evaluate model performance.")
train_g.add_arg("use_fp16", bool, False, "Whether to use fp16 mixed precision training.")
train_g.add_arg("loss_scaling", float, 1.0,
"Loss scaling factor for mixed precision training, only valid when use_fp16 is enabled.")
log_g = ArgumentGroup(parser, "logging", "logging related.")
log_g.add_arg("skip_steps", int, 10, "The steps interval to print loss.")
log_g.add_arg("verbose", bool, False, "Whether to output verbose log.")
data_g = ArgumentGroup(parser, "data", "Data paths, vocab paths and data processing options")
data_g.add_arg("data_dir", str, None, "Path to training data.")
data_g.add_arg("vocab_path", str, None, "Vocabulary path.")
data_g.add_arg("max_seq_len", int, 512, "Number of words of the longest seqence.")
data_g.add_arg("batch_size", int, 32, "Total examples' number in batch for training. see also --in_tokens.")
data_g.add_arg("do_lower_case", bool, True,
"Whether to lower case the input text. Should be True for uncased models and False for cased models.")
data_g.add_arg("random_seed", int, 0, "Random seed.")
run_type_g = ArgumentGroup(parser, "run_type", "running type options.")
run_type_g.add_arg("use_cuda", bool, True, "If set, use GPU for training.")
run_type_g.add_arg("use_fast_executor", bool, False, "If set, use fast parallel executor (in experiment).")
run_type_g.add_arg("num_iteration_per_drop_scope", int, 1, "Ihe iteration intervals to clean up temporary variables.")
run_type_g.add_arg("task_name", str, None,
"The name of task to perform fine-tuning, "
"should be in {'udc', 'swda', 'mrda', 'atis_slot', 'atis_intent', 'dstc2'}.")
run_type_g.add_arg("do_train", bool, True, "Whether to perform training.")
run_type_g.add_arg("do_val", bool, True, "Whether to perform evaluation on dev data set.")
run_type_g.add_arg("do_test", bool, True, "Whether to perform evaluation on test data set.")
parser.add_argument('--enable_ce', action='store_true', help='If set, run the task with continuous evaluation logs.')
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""save inference model"""
import os
import sys
import argparse
import collections
import numpy as np
import paddle
import paddle.fluid as fluid
from dgu.utils.configure import PDConfig
from dgu.utils.input_field import InputField
from dgu.utils.model_check import check_cuda
import dgu.utils.save_load_io as save_load_io
import dgu.reader as reader
from dgu_net import create_net
import dgu.define_paradigm as define_paradigm
def do_save_inference_model(args):
"""save inference model function"""
task_name = args.task_name.lower()
paradigm_inst = define_paradigm.Paradigm(task_name)
processors = {
'udc': reader.UDCProcessor,
'swda': reader.SWDAProcessor,
'mrda': reader.MRDAProcessor,
'atis_slot': reader.ATISSlotProcessor,
'atis_intent': reader.ATISIntentProcessor,
'dstc2': reader.DSTC2Processor,
}
test_prog = fluid.default_main_program()
startup_prog = fluid.default_startup_program()
with fluid.program_guard(test_prog, startup_prog):
test_prog.random_seed = args.random_seed
startup_prog.random_seed = args.random_seed
with fluid.unique_name.guard():
# define inputs of the network
num_labels = len(processors[task_name].get_labels())
src_ids = fluid.layers.data(
name='src_ids', shape=[args.max_seq_len, 1], dtype='int64')
pos_ids = fluid.layers.data(
name='pos_ids', shape=[args.max_seq_len, 1], dtype='int64')
sent_ids = fluid.layers.data(
name='sent_ids', shape=[args.max_seq_len, 1], dtype='int64')
input_mask = fluid.layers.data(
name='input_mask', shape=[args.max_seq_len, 1], dtype='float32')
if args.task_name == 'atis_slot':
labels = fluid.layers.data(
name='labels', shape=[args.max_seq_len], dtype='int64')
elif args.task_name in ['dstc2', 'dstc2_asr', 'multi-woz']:
labels = fluid.layers.data(
name='labels', shape=[num_labels], dtype='int64')
else:
labels = fluid.layers.data(
name='labels', shape=[1], dtype='int64')
input_inst = [src_ids, pos_ids, sent_ids, input_mask, labels]
input_field = InputField(input_inst)
results = create_net(
is_training=False,
model_input=input_field,
num_labels=num_labels,
paradigm_inst=paradigm_inst,
args=args)
probs = results.get("probs", None)
if args.use_cuda:
place = fluid.CUDAPlace(0)
else:
place = fluid.CPUPlace()
exe = fluid.Executor(place)
exe.run(startup_prog)
assert (args.init_from_params) or (args.init_from_pretrain_model)
if args.init_from_params:
save_load_io.init_from_params(args, exe, test_prog)
elif args.init_from_pretrain_model:
save_load_io.init_from_pretrain_model(args, exe, test_prog)
# saving inference model
fluid.io.save_inference_model(
args.inference_model_dir,
feeded_var_names=[
input_field.src_ids.name,
input_field.pos_ids.name,
input_field.sent_ids.name,
input_field.input_mask.name
],
target_vars=[
probs
],
executor=exe,
main_program=test_prog,
model_filename="model.pdmodel",
params_filename="params.pdparams")
print("save inference model at %s" % (args.inference_model_dir))
if __name__ == "__main__":
args = PDConfig(yaml_file="./data/config/dgu.yaml")
args.build()
check_cuda(args.use_cuda)
do_save_inference_model(args)
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
import sys
import numpy as np
import paddle
import paddle.fluid as fluid
from eval import do_eval
from train import do_train
from predict import do_predict
from inference_model import do_save_inference_model
from dgu.utils.configure import PDConfig
if __name__ == "__main__":
args = PDConfig(yaml_file="./data/config/dgu.yaml")
args.build()
args.Print()
if args.do_train:
do_train(args)
if args.do_predict:
do_predict(args)
if args.do_eval:
do_eval(args)
if args.do_save_inference_model:
do_save_inference_model(args)
# vim: set ts=4 sw=4 sts=4 tw=100:
# Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
......@@ -11,44 +11,28 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Load checkpoint of running classifier to do prediction and save inference model."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import os
import sys
import time
import numpy as np
import multiprocessing
import argparse
import collections
import paddle
import paddle.fluid as fluid
from finetune_args import parser
from utils.args import print_arguments
from utils.init import init_pretraining_params, init_checkpoint
import define_predict_pack
import reader.data_reader as reader
import dgu.reader as reader
from dgu_net import create_net
import dgu.define_paradigm as define_paradigm
import dgu.define_predict_pack as define_predict_pack
_WORK_DIR = os.path.split(os.path.realpath(__file__))[0]
sys.path.append(
'../../models/dialogue_model_toolkit/dialogue_general_understanding')
sys.path.append('../../models/')
from dgu.utils.configure import PDConfig
from dgu.utils.input_field import InputField
from dgu.utils.model_check import check_cuda
import dgu.utils.save_load_io as save_load_io
from bert import BertConfig, BertModel
from create_model import create_model
import define_paradigm
from model_check import check_cuda
def main(args):
"""main function"""
bert_config = BertConfig(args.bert_config_path)
bert_config.print_config()
def do_predict(args):
"""predict function"""
task_name = args.task_name.lower()
paradigm_inst = define_paradigm.Paradigm(task_name)
......@@ -62,107 +46,118 @@ def main(args):
'atis_slot': reader.ATISSlotProcessor,
'atis_intent': reader.ATISIntentProcessor,
'dstc2': reader.DSTC2Processor,
'dstc2_asr': reader.DSTC2Processor,
}
in_tokens = {
'udc': True,
'swda': True,
'mrda': True,
'atis_slot': False,
'atis_intent': True,
'dstc2': True,
'dstc2_asr': True
}
test_prog = fluid.default_main_program()
startup_prog = fluid.default_startup_program()
processor = processors[task_name](data_dir=args.data_dir,
vocab_path=args.vocab_path,
max_seq_len=args.max_seq_len,
do_lower_case=args.do_lower_case,
in_tokens=in_tokens[task_name],
task_name=task_name,
random_seed=args.random_seed)
num_labels = len(processor.get_labels())
with fluid.program_guard(test_prog, startup_prog):
test_prog.random_seed = args.random_seed
startup_prog.random_seed = args.random_seed
predict_prog = fluid.Program()
predict_startup = fluid.Program()
with fluid.program_guard(predict_prog, predict_startup):
with fluid.unique_name.guard():
pred_results = create_model(
args,
pyreader_name='predict_reader',
bert_config=bert_config,
num_labels=num_labels,
paradigm_inst=paradigm_inst,
is_prediction=True)
predict_pyreader = pred_results.get('pyreader', None)
probs = pred_results.get('probs', None)
feed_target_names = pred_results.get('feed_targets_name', None)
predict_prog = predict_prog.clone(for_test=True)
# define inputs of the network
num_labels = len(processors[task_name].get_labels())
src_ids = fluid.layers.data(
name='src_ids', shape=[args.max_seq_len, 1], dtype='int64')
pos_ids = fluid.layers.data(
name='pos_ids', shape=[args.max_seq_len, 1], dtype='int64')
sent_ids = fluid.layers.data(
name='sent_ids', shape=[args.max_seq_len, 1], dtype='int64')
input_mask = fluid.layers.data(
name='input_mask', shape=[args.max_seq_len, 1], dtype='float32')
if args.task_name == 'atis_slot':
labels = fluid.layers.data(
name='labels', shape=[args.max_seq_len], dtype='int64')
elif args.task_name in ['dstc2', 'dstc2_asr', 'multi-woz']:
labels = fluid.layers.data(
name='labels', shape=[num_labels], dtype='int64')
else:
labels = fluid.layers.data(
name='labels', shape=[1], dtype='int64')
input_inst = [src_ids, pos_ids, sent_ids, input_mask, labels]
input_field = InputField(input_inst)
data_reader = fluid.io.PyReader(feed_list=input_inst,
capacity=4, iterable=False)
results = create_net(
is_training=False,
model_input=input_field,
num_labels=num_labels,
paradigm_inst=paradigm_inst,
args=args)
probs = results.get("probs", None)
probs.persistable = True
fetch_list = [probs.name]
#for_test is True if change the is_test attribute of operators to True
test_prog = test_prog.clone(for_test=True)
if args.use_cuda:
place = fluid.CUDAPlace(0)
dev_count = fluid.core.get_cuda_device_count()
place = fluid.CUDAPlace(int(os.getenv('FLAGS_selected_gpus', '0')))
else:
place = fluid.CPUPlace()
dev_count = int(os.environ.get('CPU_NUM', multiprocessing.cpu_count()))
place = fluid.CUDAPlace(0) if args.use_cuda == True else fluid.CPUPlace()
exe = fluid.Executor(place)
exe.run(predict_startup)
if args.init_checkpoint:
init_pretraining_params(exe, args.init_checkpoint, predict_prog)
else:
raise ValueError("args 'init_checkpoint' should be set for prediction!")
exe.run(startup_prog)
predict_exe = fluid.ParallelExecutor(
use_cuda=args.use_cuda, main_program=predict_prog)
assert (args.init_from_params) or (args.init_from_pretrain_model)
test_data_generator = processor.data_generator(
batch_size=args.batch_size, phase='test', epoch=1, shuffle=False)
predict_pyreader.decorate_tensor_provider(test_data_generator)
if args.init_from_params:
save_load_io.init_from_params(args, exe, test_prog)
if args.init_from_pretrain_model:
save_load_io.init_from_pretrain_model(args, exe, test_prog)
predict_pyreader.start()
compiled_test_prog = fluid.CompiledProgram(test_prog)
processor = processors[task_name](data_dir=args.data_dir,
vocab_path=args.vocab_path,
max_seq_len=args.max_seq_len,
do_lower_case=args.do_lower_case,
in_tokens=args.in_tokens,
task_name=task_name,
random_seed=args.random_seed)
batch_generator = processor.data_generator(
batch_size=args.batch_size,
phase='test',
shuffle=False)
data_reader.decorate_batch_generator(batch_generator)
data_reader.start()
all_results = []
time_begin = time.time()
while True:
try:
results = predict_exe.run(fetch_list=[probs.name])
while True:
try:
results = exe.run(compiled_test_prog, fetch_list=fetch_list)
all_results.extend(results[0])
except fluid.core.EOFException:
predict_pyreader.reset()
except fluid.core.EOFException:
data_reader.reset()
break
time_end = time.time()
np.set_printoptions(precision=4, suppress=True)
print("-------------- prediction results --------------")
print("example_id\t" + ' '.join(processor.get_labels()))
if in_tokens[task_name]:
for index, result in enumerate(all_results):
tags = pred_func(result)
print("%s\t%s" % (index, tags))
else:
tags = pred_func(all_results, args.max_seq_len)
for index, tag in enumerate(tags):
print("%s\t%s" % (index, tag))
with open(args.output_prediction_file, 'w') as fw:
if task_name not in ['atis_slot']:
for index, result in enumerate(all_results):
tags = pred_func(result)
fw.write("%s\t%s\n" % (index, tags))
else:
tags = pred_func(all_results, args.max_seq_len)
for index, tag in enumerate(tags):
fw.write("%s\t%s\n" % (index, tag))
if args.save_inference_model_path:
_, ckpt_dir = os.path.split(args.init_checkpoint)
dir_name = ckpt_dir + '_inference_model'
model_path = os.path.join(args.save_inference_model_path, dir_name)
fluid.io.save_inference_model(
model_path,
feed_target_names, [probs],
exe,
main_program=predict_prog)
if __name__ == "__main__":
if __name__ == '__main__':
args = parser.parse_args()
print_arguments(args)
args = PDConfig(yaml_file="./data/config/dgu.yaml")
args.build()
args.Print()
check_cuda(args.use_cuda)
main(args)
do_predict(args)
#!/bin/bash
export FLAGS_sync_nccl_allreduce=0
export FLAGS_eager_delete_tensor_gb=1
export CUDA_VISIBLE_DEVICES=0
if [ ! "$CUDA_VISIBLE_DEVICES" ]
then
use_cuda=false
else
use_cuda=true
fi
TASK_NAME=$1
TASK_TYPE=$2
typeset -l TASK_NAME
typeset -l TASK_TYPE
BERT_BASE_PATH="./data/pretrain_model/uncased_L-12_H-768_A-12"
INPUT_PATH="./data/input/data/${TASK_NAME}"
SAVE_MODEL_PATH="./data/saved_models/${TASK_NAME}"
TRAIN_MODEL_PATH="./data/saved_models/trained_models"
OUTPUT_PATH="./data/output"
INFERENCE_MODEL="data/inference_models"
PYTHON_PATH="python"
if [ ! -d ${SAVE_MODEL_PATH} ]; then
mkdir ${SAVE_MODEL_PATH}
fi
#parameter configuration
if [ "${TASK_NAME}" = "udc" ]
then
save_steps=1000
max_seq_len=210
print_steps=1000
batch_size=6720
in_tokens=true
epoch=2
learning_rate=2e-5
elif [ "${TASK_NAME}" = "swda" ]
then
save_steps=500
max_seq_len=128
print_steps=200
batch_size=6720
in_tokens=true
epoch=3
learning_rate=2e-5
elif [ "${TASK_NAME}" = "mrda" ]
then
save_steps=500
max_seq_len=128
print_steps=200
batch_size=4096
in_tokens=true
epoch=7
learning_rate=2e-5
elif [ "${TASK_NAME}" = "atis_intent" ]
then
save_steps=100
max_seq_len=128
print_steps=10
batch_size=4096
in_tokens=true
epoch=1
learning_rate=2e-5
INPUT_PATH="./data/input/data/atis/${TASK_NAME}"
elif [ "${TASK_NAME}" = "atis_slot" ]
then
save_steps=100
max_seq_len=128
print_steps=10
batch_size=32
in_tokens=False
epoch=50
learning_rate=2e-5
INPUT_PATH="./data/input/data/atis/${TASK_NAME}"
elif [ "${TASK_NAME}" = "dstc2" ]
then
save_steps=400
print_steps=20
batch_size=8192
in_tokens=true
epoch=40
learning_rate=5e-5
INPUT_PATH="./data/input/data/dstc2/${TASK_NAME}"
if [ "${TASK_TYPE}" = "train" ]
then
max_seq_len=256
else
max_seq_len=512
fi
else
echo "not support ${TASK_NAME} dataset.."
exit 255
fi
#training
function train()
{
$PYTHON_PATH -u main.py \
--task_name=${TASK_NAME} \
--use_cuda=$1 \
--do_train=true \
--in_tokens=${in_tokens} \
--epoch=${epoch} \
--batch_size=${batch_size} \
--do_lower_case=true \
--data_dir=${INPUT_PATH} \
--bert_config_path=${BERT_BASE_PATH}/bert_config.json \
--vocab_path=${BERT_BASE_PATH}/vocab.txt \
--init_from_pretrain_model=${BERT_BASE_PATH}/params \
--save_model_path=${SAVE_MODEL_PATH} \
--save_param="params" \
--save_steps=${save_steps} \
--learning_rate=${learning_rate} \
--weight_decay=0.01 \
--max_seq_len=${max_seq_len} \
--print_steps=${print_steps} \
--use_fp16 false;
}
#predicting
function predict()
{
$PYTHON_PATH -u main.py \
--task_name=${TASK_NAME} \
--use_cuda=$1 \
--do_predict=true \
--in_tokens=${in_tokens} \
--batch_size=${batch_size} \
--data_dir=${INPUT_PATH} \
--do_lower_case=true \
--init_from_params=${TRAIN_MODEL_PATH}/${TASK_NAME}/params \
--bert_config_path=${BERT_BASE_PATH}/bert_config.json \
--vocab_path=${BERT_BASE_PATH}/vocab.txt \
--output_prediction_file=${OUTPUT_PATH}/pred_${TASK_NAME} \
--max_seq_len=${max_seq_len};
}
#evaluating
function evaluate()
{
$PYTHON_PATH -u main.py \
--task_name=${TASK_NAME} \
--use_cuda=$1 \
--do_eval=True \
--evaluation_file=${INPUT_PATH}/test.txt \
--output_prediction_file=${OUTPUT_PATH}/pred_${TASK_NAME};
}
#saving the inference model
function save_inference()
{
$PYTHON_PATH -u main.py \
--task_name=${TASK_NAME} \
--use_cuda=$1 \
--init_from_params=${TRAIN_MODEL_PATH}/${TASK_NAME}/params \
--do_save_inference_model=True \
--bert_config_path=${BERT_BASE_PATH}/bert_config.json \
--inference_model_dir=${INFERENCE_MODEL}/${TASK_NAME};
}
if [ "${TASK_TYPE}" = "train" ]
then
echo "train $TASK_NAME start..........";
train $use_cuda;
echo ""train $TASK_NAME finish..........
elif [ "${TASK_TYPE}" = "predict" ]
then
echo "predict $TASK_NAME start..........";
predict $use_cuda;
echo "predict $TASK_NAME finish..........";
elif [ "${TASK_TYPE}" = "evaluate" ]
then
export CUDA_VISIBLE_DEVICES=
echo "evaluate $TASK_NAME start..........";
evaluate false;
echo "evaluate $TASK_NAME finish..........";
elif [ "${TASK_TYPE}" = "inference" ]
then
echo "save $TASK_NAME inference model start..........";
save_inference $use_cuda;
echo "save $TASK_NAME inference model finish..........";
elif [ "${TASK_TYPE}" = "all" ]
then
echo "Execute train、predict、evaluate and save inference model in sequence...."
train $use_cuda;
predict $use_cuda;
evaluate false;
save_inference $use_cuda;
echo "done";
else
echo "Parameter $TASK_TYPE is not supported, you can input parameter in [train|predict|evaluate|inference|all]"
exit 255;
fi
#!/bin/bash
TASK_NAME=$1
PRED_FILE="./pred_"${TASK_NAME}
PYTHON_PATH="python"
echo "run predict............................"
sh run_predict.sh ${TASK_NAME} > ${PRED_FILE}
echo "eval_metrics..........................."
${PYTHON_PATH} eval_metrics.py ${TASK_NAME} ${PRED_FILE}
#!/bin/bash
export CUDA_VISIBLE_DEVICES=4
export CPU_NUM=1
TASK_NAME=$1
BERT_BASE_PATH="./uncased_L-12_H-768_A-12"
INPUT_PATH="./data/${TASK_NAME}"
OUTPUT_PATH="./output/${TASK_NAME}"
PYTHON_PATH="python"
if [ "$TASK_NAME" = "udc" ]
then
best_model="step_62500"
max_seq_len=210
batch_size=6720
elif [ "$TASK_NAME" = "swda" ]
then
best_model="step_12500"
max_seq_len=128
batch_size=6720
elif [ "$TASK_NAME" = "mrda" ]
then
best_model="step_6500"
max_seq_len=128
batch_size=6720
elif [ "$TASK_NAME" = "atis_intent" ]
then
best_model="step_600"
max_seq_len=128
batch_size=4096
INPUT_PATH="./data/atis/${TASK_NAME}"
elif [ "$TASK_NAME" = "atis_slot" ]
then
best_model="step_7500"
max_seq_len=128
batch_size=32
INPUT_PATH="./data/atis/${TASK_NAME}"
elif [ "$TASK_NAME" = "dstc2" ]
then
best_model="step_12000"
max_seq_len=700
batch_size=6000
INPUT_PATH="./data/dstc2/${TASK_NAME}"
else
echo "not support ${TASK_NAME} dataset.."
exit 255
fi
$PYTHON_PATH -u predict.py --task_name ${TASK_NAME} \
--use_cuda true\
--batch_size ${batch_size} \
--init_checkpoint ${OUTPUT_PATH}/${best_model} \
--data_dir ${INPUT_PATH} \
--vocab_path ${BERT_BASE_PATH}/vocab.txt \
--max_seq_len ${max_seq_len} \
--bert_config_path ${BERT_BASE_PATH}/bert_config.json
#!/bin/bash
export CUDA_VISIBLE_DEVICES=3
export CPU_NUM=1
TASK_NAME=$1
typeset -l TASK_NAME
BERT_BASE_PATH="./uncased_L-12_H-768_A-12"
INPUT_PATH="./data/${TASK_NAME}"
OUTPUT_PATH="./output/${TASK_NAME}"
PYTHON_PATH="python"
DO_TRAIN=true
DO_VAL=true
DO_TEST=true
#parameter configuration
if [ "${TASK_NAME}" = "udc" ]
then
save_steps=1000
max_seq_len=210
skip_steps=1000
batch_size=6720
epoch=2
learning_rate=2e-5
DO_VAL=false
DO_TEST=false
elif [ "${TASK_NAME}" = "swda" ]
then
save_steps=500
max_seq_len=128
skip_steps=200
batch_size=6720
epoch=10
learning_rate=2e-5
elif [ "${TASK_NAME}" = "mrda" ]
then
save_steps=500
max_seq_len=128
skip_steps=200
batch_size=4096
epoch=4
learning_rate=2e-5
elif [ "${TASK_NAME}" = "atis_intent" ]
then
save_steps=100
max_seq_len=128
skip_steps=10
batch_size=4096
epoch=20
learning_rate=2e-5
INPUT_PATH="./data/atis/${TASK_NAME}"
elif [ "${TASK_NAME}" = "atis_slot" ]
then
save_steps=100
max_seq_len=128
skip_steps=10
batch_size=32
epoch=50
learning_rate=2e-5
INPUT_PATH="./data/atis/${TASK_NAME}"
elif [ "${TASK_NAME}" = "dstc2" ]
then
save_steps=400
max_seq_len=256
skip_steps=20
batch_size=8192
epoch=40
learning_rate=5e-5
INPUT_PATH="./data/dstc2/${TASK_NAME}"
else
echo "not support ${TASK_NAME} dataset.."
exit 255
fi
# build train, dev, test dataset
cd scripts && sh run_build_data.sh ${TASK_NAME} && cd ..
#training
$PYTHON_PATH -u train.py --task_name ${TASK_NAME} \
--use_cuda true\
--do_train ${DO_TRAIN} \
--do_val ${DO_VAL} \
--do_test ${DO_TEST} \
--epoch ${epoch} \
--batch_size ${batch_size} \
--data_dir ${INPUT_PATH} \
--bert_config_path ${BERT_BASE_PATH}/bert_config.json \
--vocab_path ${BERT_BASE_PATH}/vocab.txt \
--init_pretraining_params ${BERT_BASE_PATH}/params \
--checkpoints ${OUTPUT_PATH} \
--save_steps ${save_steps} \
--learning_rate ${learning_rate} \
--weight_decay 0.01 \
--max_seq_len ${max_seq_len} \
--skip_steps ${skip_steps} \
--validation_steps 1000000 \
--num_iteration_per_drop_scope 10 \
--use_fp16 false
scripts:运行数据处理脚本目录
运行命令:
sh run_build_data.sh [udc|swda|mrda|atis]
生成DA任务所需要的训练集、开发集、测试集时:
sh run_build_data.sh swda
sh run_build_data.sh mrda
生成数据分别在open-dialog/data/swda和open-dialog/data/mrda
生成DST任务所需的训练集、开发集、测试集时:
sh run_build_data.sh dstc2
生成数据分别在open-dialog/data/dstc2
生成意图解析, 槽位识别任务所需训练集、开发集、测试集时:
sh run_build_data.sh atis
生成槽位识别数据在open-dialog/data/atis/atis_slot
生成意图识别数据在open-dialog/data/atis/atis_intent
# Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
......@@ -26,69 +26,19 @@ import multiprocessing
import paddle
import paddle.fluid as fluid
from finetune_args import parser
import reader.data_reader as reader
from optimization import optimization
from utils.args import print_arguments
from utils.init import init_checkpoint, init_pretraining_params
from dgu_net import create_net
import dgu.reader as reader
from dgu.optimization import optimization
import dgu.define_paradigm as define_paradigm
from dgu.utils.configure import PDConfig
from dgu.utils.input_field import InputField
from dgu.utils.model_check import check_cuda
import dgu.utils.save_load_io as save_load_io
_WORK_DIR = os.path.split(os.path.realpath(__file__))[0]
sys.path.append(
'../../models/dialogue_model_toolkit/dialogue_general_understanding')
sys.path.append('../../models/')
from model_check import check_cuda
from bert import BertConfig, BertModel
from create_model import create_model
import define_paradigm
def evaluate(test_exe, test_program, test_pyreader, fetch_list, eval_phase):
"""evaluate validation or test data"""
test_pyreader.start()
total_cost, total_acc, total_num_seqs = [], [], []
time_begin = time.time()
while True:
try:
if len(fetch_list) > 2:
np_loss, np_acc, np_num_seqs = test_exe.run(
fetch_list=fetch_list)
total_acc.extend(np_acc * np_num_seqs)
else:
np_loss, np_num_seqs = test_exe.run(fetch_list=fetch_list)
total_cost.extend(np_loss * np_num_seqs)
total_num_seqs.extend(np_num_seqs)
except fluid.core.EOFException:
test_pyreader.reset()
break
time_end = time.time()
current_time = time.strftime('%Y-%m-%d %H:%M:%S',
time.localtime(time.time()))
if len(fetch_list) > 2:
print("[%s evaluation] %s ave loss: %f, ave acc: %f, elapsed time: %f s"
% (eval_phase, current_time, np.sum(total_cost) /
np.sum(total_num_seqs), np.sum(total_acc) /
np.sum(total_num_seqs), time_end - time_begin))
else:
print("[%s evaluation] %s ave loss: %f, elapsed time: %f s" %
(eval_phase, current_time, np.sum(total_cost) /
np.sum(total_num_seqs), time_end - time_begin))
def main(args):
"""main function"""
bert_config = BertConfig(args.bert_config_path)
bert_config.print_config()
if args.use_cuda:
place = fluid.CUDAPlace(int(os.getenv('FLAGS_selected_gpus', '0')))
dev_count = fluid.core.get_cuda_device_count()
else:
place = fluid.CPUPlace()
dev_count = int(os.environ.get('CPU_NUM', multiprocessing.cpu_count()))
exe = fluid.Executor(place)
def do_train(args):
"""train function"""
task_name = args.task_name.lower()
paradigm_inst = define_paradigm.Paradigm(task_name)
......@@ -100,178 +50,136 @@ def main(args):
'atis_intent': reader.ATISIntentProcessor,
'dstc2': reader.DSTC2Processor,
}
in_tokens = {
'udc': True,
'swda': True,
'mrda': True,
'atis_slot': False,
'atis_intent': True,
'dstc2': True,
}
processor = processors[task_name](data_dir=args.data_dir,
vocab_path=args.vocab_path,
max_seq_len=args.max_seq_len,
do_lower_case=args.do_lower_case,
in_tokens=in_tokens[task_name],
task_name=task_name,
random_seed=args.random_seed)
num_labels = len(processor.get_labels())
train_prog = fluid.default_main_program()
startup_prog = fluid.default_startup_program()
if not (args.do_train or args.do_val or args.do_test):
raise ValueError("For args `do_train`, `do_val` and `do_test`, at "
"least one of them must be True.")
startup_prog = fluid.Program()
if args.random_seed is not None:
with fluid.program_guard(train_prog, startup_prog):
train_prog.random_seed = args.random_seed
startup_prog.random_seed = args.random_seed
if args.do_train:
train_data_generator = processor.data_generator(
batch_size=args.batch_size,
phase='train',
epoch=args.epoch,
shuffle=True)
num_train_examples = processor.get_num_examples(phase='train')
if in_tokens[task_name]:
max_train_steps = args.epoch * num_train_examples // (
args.batch_size // args.max_seq_len) // dev_count
else:
max_train_steps = args.epoch * num_train_examples // args.batch_size // dev_count
warmup_steps = int(max_train_steps * args.warmup_proportion)
print("Device count: %d" % dev_count)
print("Num train examples: %d" % num_train_examples)
print("Max train steps: %d" % max_train_steps)
print("Num warmup steps: %d" % warmup_steps)
train_program = fluid.Program()
if args.random_seed is not None:
train_program.random_seed = args.random_seed
with fluid.program_guard(train_program, startup_prog):
with fluid.unique_name.guard():
results = create_model(
args,
pyreader_name='train_reader',
bert_config=bert_config,
with fluid.unique_name.guard():
num_labels = len(processors[task_name].get_labels())
src_ids = fluid.layers.data(
name='src_ids', shape=[args.max_seq_len, 1], dtype='int64')
pos_ids = fluid.layers.data(
name='pos_ids', shape=[args.max_seq_len, 1], dtype='int64')
sent_ids = fluid.layers.data(
name='sent_ids', shape=[args.max_seq_len, 1], dtype='int64')
input_mask = fluid.layers.data(
name='input_mask', shape=[args.max_seq_len, 1], dtype='float32')
if args.task_name == 'atis_slot':
labels = fluid.layers.data(
name='labels', shape=[args.max_seq_len], dtype='int64')
elif args.task_name in ['dstc2']:
labels = fluid.layers.data(
name='labels', shape=[num_labels], dtype='int64')
else:
labels = fluid.layers.data(
name='labels', shape=[1], dtype='int64')
input_inst = [src_ids, pos_ids, sent_ids, input_mask, labels]
input_field = InputField(input_inst)
data_reader = fluid.io.PyReader(feed_list=input_inst,
capacity=4, iterable=False)
processor = processors[task_name](data_dir=args.data_dir,
vocab_path=args.vocab_path,
max_seq_len=args.max_seq_len,
do_lower_case=args.do_lower_case,
in_tokens=args.in_tokens,
task_name=task_name,
random_seed=args.random_seed)
results = create_net(
is_training=True,
model_input=input_field,
num_labels=num_labels,
paradigm_inst=paradigm_inst)
train_pyreader = results.get("pyreader", None)
loss = results.get("loss", None)
probs = results.get("probs", None)
accuracy = results.get("accuracy", None)
num_seqs = results.get("num_seqs", None)
scheduled_lr = optimization(
loss=loss,
warmup_steps=warmup_steps,
num_train_steps=max_train_steps,
learning_rate=args.learning_rate,
train_program=train_program,
startup_prog=startup_prog,
weight_decay=args.weight_decay,
scheduler=args.lr_scheduler,
use_fp16=args.use_fp16,
loss_scaling=args.loss_scaling)
if accuracy is not None:
skip_opt_set = [
loss.name, probs.name, accuracy.name, num_seqs.name
]
else:
skip_opt_set = [loss.name, probs.name, num_seqs.name]
fluid.memory_optimize(
input_program=train_program, skip_opt_set=skip_opt_set)
if args.verbose:
if in_tokens[task_name]:
lower_mem, upper_mem, unit = fluid.contrib.memory_usage(
program=train_program,
batch_size=args.batch_size // args.max_seq_len)
paradigm_inst=paradigm_inst,
args=args)
loss = results.get("loss", None)
probs = results.get("probs", None)
accuracy = results.get("accuracy", None)
num_seqs = results.get("num_seqs", None)
loss.persistable = True
probs.persistable = True
if accuracy:
accuracy.persistable = True
num_seqs.persistable = True
if args.use_cuda:
dev_count = fluid.core.get_cuda_device_count()
else:
dev_count = int(
os.environ.get('CPU_NUM', multiprocessing.cpu_count()))
batch_generator = processor.data_generator(
batch_size=args.batch_size,
phase='train',
shuffle=True)
num_train_examples = processor.get_num_examples(phase='train')
if args.in_tokens:
max_train_steps = args.epoch * num_train_examples // (
args.batch_size // args.max_seq_len) // dev_count
else:
lower_mem, upper_mem, unit = fluid.contrib.memory_usage(
program=train_program, batch_size=args.batch_size)
print("Theoretical memory usage in training: %.3f - %.3f %s" %
(lower_mem, upper_mem, unit))
if args.do_val or args.do_test:
test_prog = fluid.Program()
with fluid.program_guard(test_prog, startup_prog):
with fluid.unique_name.guard():
test_results = create_model(
args,
pyreader_name='test_reader',
bert_config=bert_config,
num_labels=num_labels,
paradigm_inst=paradigm_inst)
test_pyreader = test_results.get("pyreader", None)
loss = test_results.get("loss", None)
probs = test_results.get("probs", None)
accuracy = test_results.get("accuracy", None)
num_seqs = test_results.get("num_seqs", None)
test_prog = test_prog.clone(for_test=True)
exe.run(startup_prog)
if args.do_train:
if args.init_checkpoint and args.init_pretraining_params:
print(
"WARNING: args 'init_checkpoint' and 'init_pretraining_params' "
"both are set! Only arg 'init_checkpoint' is made valid.")
if args.init_checkpoint:
init_checkpoint(
exe,
args.init_checkpoint,
main_program=startup_prog,
use_fp16=args.use_fp16)
elif args.init_pretraining_params:
init_pretraining_params(
exe,
args.init_pretraining_params,
main_program=startup_prog,
use_fp16=args.use_fp16)
elif args.do_val or args.do_test:
if not args.init_checkpoint:
raise ValueError("args 'init_checkpoint' should be set if"
"only doing validation or testing!")
init_checkpoint(
exe,
args.init_checkpoint,
main_program=startup_prog,
use_fp16=args.use_fp16)
if args.do_train:
exec_strategy = fluid.ExecutionStrategy()
exec_strategy.use_experimental_executor = args.use_fast_executor
exec_strategy.num_threads = dev_count
exec_strategy.num_iteration_per_drop_scope = args.num_iteration_per_drop_scope
max_train_steps = args.epoch * num_train_examples // args.batch_size // dev_count
warmup_steps = int(max_train_steps * args.warmup_proportion)
print("Num train examples: %d" % num_train_examples)
print("Max train steps: %d" % max_train_steps)
print("Num warmup steps: %d" % warmup_steps)
optimizor = optimization(
loss=loss,
warmup_steps=warmup_steps,
num_train_steps=max_train_steps,
learning_rate=args.learning_rate,
train_program=train_prog,
startup_prog=startup_prog,
weight_decay=args.weight_decay,
scheduler=args.lr_scheduler,
use_fp16=args.use_fp16,
loss_scaling=args.loss_scaling)
data_reader.decorate_batch_generator(batch_generator)
train_exe = fluid.ParallelExecutor(
use_cuda=args.use_cuda,
loss_name=loss.name,
exec_strategy=exec_strategy,
main_program=train_program)
train_pyreader.decorate_tensor_provider(train_data_generator)
if args.use_cuda:
place = fluid.CUDAPlace(int(os.getenv('FLAGS_selected_gpus', '0')))
else:
train_exe = None
if args.do_val or args.do_test:
test_exe = fluid.ParallelExecutor(
use_cuda=args.use_cuda,
main_program=test_prog,
share_vars_from=train_exe)
place = fluid.CPUPlace()
exe = fluid.Executor(place)
exe.run(startup_prog)
if args.do_train:
train_pyreader.start()
steps = 0
total_cost, total_acc, total_num_seqs = [], [], []
time_begin = time.time()
ce_info = []
assert (args.init_from_checkpoint == "") or (
args.init_from_pretrain_model == "")
# init from some checkpoint, to resume the previous training
if args.init_from_checkpoint:
save_load_io.init_from_checkpoint(args, exe, train_prog)
# init from some pretrain models, to better solve the current task
if args.init_from_pretrain_model:
save_load_io.init_from_pretrain_model(args, exe, train_prog)
build_strategy = fluid.compiler.BuildStrategy()
build_strategy.enable_inplace = True
compiled_train_prog = fluid.CompiledProgram(train_prog).with_data_parallel(
loss_name=loss.name, build_strategy=build_strategy)
# start training
steps = 0
time_begin = time.time()
ce_info = []
for epoch_step in range(args.epoch):
data_reader.start()
while True:
try:
steps += 1
if steps % args.skip_steps == 0:
if steps % args.print_steps == 0:
if warmup_steps <= 0:
if accuracy is not None:
fetch_list = [
......@@ -282,23 +190,19 @@ def main(args):
else:
if accuracy is not None:
fetch_list = [
loss.name, accuracy.name, scheduled_lr.name,
loss.name, accuracy.name, optimizor.name,
num_seqs.name
]
else:
fetch_list = [
loss.name, scheduled_lr.name, num_seqs.name
loss.name, optimizor.name, num_seqs.name
]
else:
fetch_list = []
if accuracy is not None:
fetch_test_list = [loss.name, accuracy.name, num_seqs.name]
else:
fetch_test_list = [loss.name, num_seqs.name]
outputs = train_exe.run(fetch_list=fetch_list)
outputs = exe.run(compiled_train_prog, fetch_list=fetch_list)
if steps % args.skip_steps == 0:
if steps % args.print_steps == 0:
if warmup_steps <= 0:
if accuracy is not None:
np_loss, np_acc, np_num_seqs = outputs
......@@ -310,87 +214,62 @@ def main(args):
else:
np_loss, np_lr, np_num_seqs = outputs
total_cost.extend(np_loss * np_num_seqs)
total_num_seqs.extend(np_num_seqs)
if accuracy is not None:
total_acc.extend(np_acc * np_num_seqs)
if args.verbose:
verbose = "train pyreader queue size: %d, " % train_pyreader.queue.size(
)
verbose += "learning rate: %f" % (
np_lr[0]
if warmup_steps > 0 else args.learning_rate)
print(verbose)
current_example, current_epoch = processor.get_train_progress(
)
time_end = time.time()
used_time = time_end - time_begin
current_time = time.strftime('%Y-%m-%d %H:%M:%S',
time.localtime(time.time()))
if accuracy is not None:
if accuracy is not None:
print(
"%s epoch: %d, progress: %d/%d, step: %d, ave loss: %f, "
"%s epoch: %d, step: %d, ave loss: %f, "
"ave acc: %f, speed: %f steps/s" %
(current_time, current_epoch, current_example,
num_train_examples, steps,
np.sum(total_cost) / np.sum(total_num_seqs),
np.sum(total_acc) / np.sum(total_num_seqs),
args.skip_steps / used_time))
(current_time, epoch_step, steps,
np.mean(np_loss),
np.mean(np_acc),
args.print_steps / used_time))
ce_info.append([
np.sum(total_cost) / np.sum(total_num_seqs),
np.sum(total_acc) / np.sum(total_num_seqs),
args.skip_steps / used_time
np.mean(np_loss),
np.mean(np_acc),
args.print_steps / used_time
])
else:
print(
"%s epoch: %d, progress: %d/%d, step: %d, ave loss: %f, "
"%s epoch: %d, step: %d, ave loss: %f, "
"speed: %f steps/s" %
(current_time, current_epoch, current_example,
num_train_examples, steps,
np.sum(total_cost) / np.sum(total_num_seqs),
args.skip_steps / used_time))
(current_time, epoch_step, steps,
np.mean(np_loss),
args.print_steps / used_time))
ce_info.append([
np.sum(total_cost) / np.sum(total_num_seqs),
args.skip_steps / used_time
np.mean(np_loss),
args.print_steps / used_time
])
total_cost, total_acc, total_num_seqs = [], [], []
time_begin = time.time()
if steps % args.save_steps == 0:
save_path = os.path.join(args.checkpoints,
"step_" + str(steps))
fluid.io.save_persistables(exe, save_path, train_program)
if steps % args.validation_steps == 0:
#evaluate dev set
if args.do_val:
test_pyreader.decorate_tensor_provider(
processor.data_generator(
batch_size=args.batch_size,
phase='dev',
epoch=1,
shuffle=False))
evaluate(test_exe, test_prog, test_pyreader,
fetch_test_list, "dev")
#evaluate test set
if args.do_test:
test_pyreader.decorate_tensor_provider(
processor.data_generator(
batch_size=args.batch_size,
phase='test',
epoch=1,
shuffle=False))
evaluate(test_exe, test_prog, test_pyreader,
fetch_test_list, "test")
except fluid.core.EOFException:
save_path = os.path.join(args.checkpoints, "step_" + str(steps))
fluid.io.save_persistables(exe, save_path, train_program)
train_pyreader.reset()
if steps % args.save_steps == 0:
save_path = "step_" + str(steps)
if args.save_checkpoint:
save_load_io.save_checkpoint(args, exe, train_prog, save_path)
if args.save_param:
save_load_io.save_param(args, exe, train_prog, save_path)
except fluid.core.EOFException:
data_reader.reset()
break
if args.do_train and args.enable_ce:
if args.save_checkpoint:
save_load_io.save_checkpoint(args, exe, train_prog, "step_final")
if args.save_param:
save_load_io.save_param(args, exe, train_prog, "step_final")
def get_cards():
num = 0
cards = os.environ.get('CUDA_VISIBLE_DEVICES', '')
print("test_cards", cards)
if cards != '':
num = len(cards.split(","))
return num
if args.enable_ce:
card_num = get_cards()
print("zytest_card_num", card_num)
print("test_card_num", card_num)
ce_loss = 0
ce_acc = 0
ce_time = 0
......@@ -405,40 +284,13 @@ def main(args):
print("kpis\ttrain_loss_%s_card%s\t%f" % (task_name, card_num, ce_loss))
print("kpis\ttrain_acc_%s_card%s\t%f" % (task_name, card_num, ce_acc))
#final eval on dev set
if args.do_val:
test_pyreader.decorate_tensor_provider(
processor.data_generator(
batch_size=args.batch_size, phase='dev', epoch=1,
shuffle=False))
print("Final validation result:")
evaluate(test_exe, test_prog, test_pyreader, fetch_test_list, "dev")
#final eval on test set
if args.do_test:
test_pyreader.decorate_tensor_provider(
processor.data_generator(
batch_size=args.batch_size,
phase='test',
epoch=1,
shuffle=False))
print("Final test result:")
evaluate(test_exe, test_prog, test_pyreader, fetch_test_list, "test")
def get_cards():
num = 0
cards = os.environ.get('CUDA_VISIBLE_DEVICES', '')
print("zytest_cards", cards)
if cards != '':
num = len(cards.split(","))
return num
if __name__ == '__main__':
args = parser.parse_args()
print_arguments(args)
if __name__ == '__main__':
args = PDConfig(yaml_file="./data/config/dgu.yaml")
args.build()
args.Print()
check_cuda(args.use_cuda)
main(args)
do_train(args)
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from __future__ import print_function
import os
import six
import ast
import copy
import numpy as np
import paddle.fluid as fluid
def cast_fp32_to_fp16(exe, main_program):
print("Cast parameters to float16 data format.")
for param in main_program.global_block().all_parameters():
if not param.name.endswith(".master"):
param_t = fluid.global_scope().find_var(param.name).get_tensor()
data = np.array(param_t)
if param.name.find("layer_norm") == -1:
param_t.set(np.float16(data).view(np.uint16), exe.place)
master_param_var = fluid.global_scope().find_var(param.name +
".master")
if master_param_var is not None:
master_param_var.get_tensor().set(data, exe.place)
def init_checkpoint(exe, init_checkpoint_path, main_program, use_fp16=False):
assert os.path.exists(
init_checkpoint_path), "[%s] cann't be found." % init_checkpoint_path
def existed_persitables(var):
if not fluid.io.is_persistable(var):
return False
return os.path.exists(os.path.join(init_checkpoint_path, var.name))
fluid.io.load_vars(
exe,
init_checkpoint_path,
main_program=main_program,
predicate=existed_persitables)
print("Load model from {}".format(init_checkpoint_path))
if use_fp16:
cast_fp32_to_fp16(exe, main_program)
def init_pretraining_params(exe,
pretraining_params_path,
main_program,
use_fp16=False):
assert os.path.exists(pretraining_params_path
), "[%s] cann't be found." % pretraining_params_path
def existed_params(var):
if not isinstance(var, fluid.framework.Parameter):
return False
return os.path.exists(os.path.join(pretraining_params_path, var.name))
fluid.io.load_vars(
exe,
pretraining_params_path,
main_program=main_program,
predicate=existed_params)
print("Load pretraining parameters from {}.".format(
pretraining_params_path))
if use_fp16:
cast_fp32_to_fp16(exe, main_program)
"""
Network for auto dialogue evaluation
"""
import os
import sys
import time
import six
import numpy as np
import math
import paddle.fluid as fluid
import paddle
class Network(object):
"""
Network
"""
def __init__(self,
vocab_size,
emb_size,
hidden_size,
clip_value=10.0,
word_emb_name="shared_word_emb",
lstm_W_name="shared_lstm_W",
lstm_bias_name="shared_lstm_bias"):
"""
Init function
"""
self.vocab_size = vocab_size
self.emb_size = emb_size
self.hidden_size = hidden_size
self.clip_value = clip_value
self.word_emb_name = word_emb_name
self.lstm_W_name = lstm_W_name
self.lstm_bias_name = lstm_bias_name
def network(self, loss_type='CLS'):
"""
Network definition
"""
#Input data
context_wordseq = fluid.layers.data(
name="context_wordseq", shape=[1], dtype="int64", lod_level=1)
response_wordseq = fluid.layers.data(
name="response_wordseq", shape=[1], dtype="int64", lod_level=1)
label = fluid.layers.data(name="label", shape=[1], dtype="float32")
self._feed_name = ["context_wordseq", "response_wordseq", "label"]
self._feed_infer_name = ["context_wordseq", "response_wordseq"]
#emb
context_emb = fluid.layers.embedding(
input=context_wordseq,
size=[self.vocab_size, self.emb_size],
is_sparse=True,
param_attr=fluid.ParamAttr(
name=self.word_emb_name,
initializer=fluid.initializer.Normal(scale=0.1)))
response_emb = fluid.layers.embedding(
input=response_wordseq,
size=[self.vocab_size, self.emb_size],
is_sparse=True,
param_attr=fluid.ParamAttr(
name=self.word_emb_name,
initializer=fluid.initializer.Normal(scale=0.1)))
#fc to fit dynamic LSTM
context_fc = fluid.layers.fc(
input=context_emb,
size=self.hidden_size * 4,
param_attr=fluid.ParamAttr(name='fc_weight'),
bias_attr=fluid.ParamAttr(name='fc_bias'))
response_fc = fluid.layers.fc(
input=response_emb,
size=self.hidden_size * 4,
param_attr=fluid.ParamAttr(name='fc_weight'),
bias_attr=fluid.ParamAttr(name='fc_bias'))
#LSTM
context_rep, _ = fluid.layers.dynamic_lstm(
input=context_fc,
size=self.hidden_size * 4,
param_attr=fluid.ParamAttr(name=self.lstm_W_name),
bias_attr=fluid.ParamAttr(name=self.lstm_bias_name))
context_rep = fluid.layers.sequence_last_step(context_rep)
print('context_rep shape: %s' % str(context_rep.shape))
response_rep, _ = fluid.layers.dynamic_lstm(
input=response_fc,
size=self.hidden_size * 4,
param_attr=fluid.ParamAttr(name=self.lstm_W_name),
bias_attr=fluid.ParamAttr(name=self.lstm_bias_name))
response_rep = fluid.layers.sequence_last_step(input=response_rep)
print('response_rep shape: %s' % str(response_rep.shape))
logits = fluid.layers.bilinear_tensor_product(
context_rep, response_rep, size=1)
print('logits shape: %s' % str(logits.shape)) #[batch,1]
if loss_type == 'CLS':
loss = fluid.layers.sigmoid_cross_entropy_with_logits(logits, label)
print('before reduce mean loss shape: %s' % str(loss.shape))
loss = fluid.layers.reduce_mean(
fluid.layers.clip(
loss, min=-self.clip_value, max=self.clip_value))
print('after reduce mean loss shape: %s' % str(loss.shape))
elif loss_type == 'L2':
norm_score = 2 * fluid.layers.sigmoid(logits)
loss = fluid.layers.square_error_cost(norm_score, label) / 4
loss = fluid.layers.reduce_mean(loss)
else:
raise ValueError
return logits, loss
def set_word_embedding(self, word_emb, place):
"""
Set word embedding
"""
word_emb_param = fluid.global_scope().find_var(
self.word_emb_name).get_tensor()
word_emb_param.set(word_emb, place)
def get_feed_names(self):
"""
Return feed names
"""
return self._feed_name
def get_feed_inference_names(self):
"""
Return inference names
"""
return self._feed_infer_name
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册