未验证 提交 a3f6ee86 编写于 作者: Y Yibing Liu 提交者: GitHub

Init paddle-nlp (#2112)

* init paddle-nlp tools for QA test

* Fix paragraph extraction bug

* Update download links

* first update LAC README.md

* rename EmoTect as emotion_detection

* download data from bos

* Update README.md

* Rename project

* second add code

* modify downloads.sh for lac

* rename LAC to lexical_analysis

* update lac readme

* Update README.md

* Update README.md

* Update README.md

* add struct.jpg

* Update README.md

* Update README.md

* update README

* Update README.md

* update emotion_detection README

* add download_data.sh and download_model.sh

* first commit ADE

* dialogue_model_toolkit_update

* update emotion_detection model bos url

* update README

* update readme

* update readme

* update download file

* first commit DAM

* add readme

* fix readme

* fix readme

* fix readme

* fix readme

* fix readme

* rename

* rename again

* 1. add gradient_clip for ernie_lac
2. delete LARK config

* fix download.sh

* Rename MRC task

* fix logger

* fix to douban

* fix final

* update readme

* update readme

* update readme

* fix batch is null

* fix typo

* fix typo

* fix typo

* update ernie config

* update readme

* add AI platform url in readme

* update readme subtitlestyle

* update

* Update README.md

* Update README.md

* update

* Create README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* update batch size

* adapt to samll data size

* update ERNIE bcebos url

* add language model

* modify readme

* update

* update

* Update README.md

* Update README.md

* fix readme

* fix max_step, update run.sh and run_ernie.sh

* add finetuned model for lac

* fix bug

* Update README.md

* update

* Update README.md

* add ERNIE pretrained model, and update README

* update readme

* add CPU

* update infer in run.sh and run_ernie.sh

* Update README.md

* Update README.md

* Delete test.py

* fix bug

* fix run.sh infer bug & add ernie infer code

* fix cpu mode

* Update README.md

* fix bug for python3

* fix CPU and GPU diff result bug

* Update README.md

* update readme

* Update run_classifier.py

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update run.sh

* Update run_ernie.sh

* modify dir

* Update README.md

* modify dir too

* modify path

* Update README.md

* PaddleNLP modules backup to old/, rm links-LAC,Senta,SimNet

* mv all modules out of paddle-nlp, rm Senta, auto_dialog_eval, deep_match

* mv models/classify to models/classification, models/seq_lab to models/sequence_labeling

* update readme for models/classification

* update sentiment_classification and rm README

* Add Transformer into paddle-nlp

* change seq_lab to sequence labeling

* Rename old as unarchived in PaddleNLP

* add LARK

* Update README, add paddlehub

* add paddlehub

* Add tmp readme

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update run_ernie.sh

* Update run_ernie.sh

* Update README.md

* Update run_ernie_classifier.py

* Update README.md

* Update README.md

* Update run.sh

* Update run_ernie_classifier.py

* update

* fix chunk_evaluator bug

* change names

* Update README

* add gitmodules

* add install code

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update READMEs

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* README

* Update README.md

* update emotion_detection README

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* update REAME, add finetune doc

* update emotion_detection readme

* change run.sh

* Update README.md

* Update the link in fluid dir

* update readme

* update README for markdown style

* Update README.md

* Update README.md
上级 f89fab5b
......@@ -13,3 +13,9 @@
[submodule "PaddleNLP/knowledge-driven-dialogue"]
path = PaddleNLP/knowledge-driven-dialogue
url = https://github.com/baidu/knowledge-driven-dialogue
[submodule "PaddleNLP/language_representations_kit"]
path = PaddleNLP/language_representations_kit
url = https://github.com/PaddlePaddle/LARK
[submodule "PaddleNLP/knowledge_driven_dialogue"]
path = PaddleNLP/knowledge_driven_dialogue
url = https://github.com/baidu/knowledge-driven-dialogue/
Subproject commit a4eb73b2fb64d8aab8499a1184edf4fc386f8268
Subproject commit 77ab80a7061024c4b28f0b41fdd6ba42d5e6d9e1
......@@ -9,7 +9,7 @@ Machine Translation, NMT)等阶段。在 NMT 成熟后,机器翻译才真正
本实例所实现的 Transformer 就是一个基于自注意力机制的机器翻译模型,其中不再有RNN或CNN结构,而是完全利用 Attention 学习语言中的上下文依赖。相较于RNN/CNN, 这种结构在单层内计算复杂度更低、易于并行化、对长程依赖更易建模,最终在多种语言之间取得了最好的翻译效果。
- [Transformer](https://github.com/PaddlePaddle/models/blob/develop/PaddleNLP/neural_machine_translation/transformer/README_cn.md)
- [Transformer](https://github.com/PaddlePaddle/models/blob/develop/PaddleNLP/neural_machine_translation/transformer/README.md)
中文词法分析
......@@ -35,7 +35,7 @@ Machine Translation, NMT)等阶段。在 NMT 成熟后,机器翻译才真正
本例所开放的DAM (Deep Attention Matching Network)为百度自然语言处理部发表于ACL-2018的工作,用于检索式聊天机器人多轮对话中应答的选择。DAM受Transformer的启发,其网络结构完全基于注意力(attention)机制,利用栈式的self-attention结构分别学习不同粒度下应答和语境的语义表示,然后利用cross-attention获取应答与语境之间的相关性,在两个大规模多轮对话数据集上的表现均好于其它模型。
- [Deep Attention Matching Network](https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/deep_attention_matching_net)
- [Deep Attention Matching Network](https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/dialogue_model_toolkit/deep_attention_matching)
AnyQ
----
......@@ -53,4 +53,4 @@ SimNet是百度自然语言处理部于2013年自主研发的语义匹配框架
百度阅读理解数据集是由百度自然语言处理部开源的一个真实世界数据集,所有的问题、原文都来源于实际数据(百度搜索引擎数据和百度知道问答社区),答案是由人类回答的。每个问题都对应多个答案,数据集包含200k问题、1000k原文和420k答案,是目前最大的中文MRC数据集。百度同时开源了对应的阅读理解模型,称为DuReader,采用当前通用的网络分层结构,通过双向attention机制捕捉问题和原文之间的交互关系,生成query-aware的原文表示,最终基于query-aware的原文表示通过point network预测答案范围。
- [DuReader in PaddlePaddle Fluid](https://github.com/PaddlePaddle/models/blob/develop/PaddleNLP/machine_reading_comprehension/README.md)
- [DuReader in PaddlePaddle Fluid](https://github.com/PaddlePaddle/models/blob/develop/PaddleNLP/reading_comprehension)
\ No newline at end of file
Subproject commit dc1af6a83dd1372055158ac6d17f6d14b3a0f0f8
Subproject commit b3e096b92f26720f6e3b020b374e11aa0748c032
# Auto Dialogue Evaluation
## 简介
### 任务说明
对话自动评估(Auto Dialogue Evaluation)评估开放领域对话系统的回复质量,能够帮助企业或个人快速评估对话系统的回复质量,减少人工评估成本。
1. 在无标注数据的情况下,利用负采样训练匹配模型作为评估工具,实现对多个对话系统回复质量排序;
2. 利用少量标注数据(特定对话系统或场景的人工打分),在匹配模型基础上进行微调,可以显著该对话系统或场景的评估效果。
### 效果说明
我们以四个不同的对话系统(seq2seq\_naive/seq2seq\_att/keywords/human)为例,使用对话自动评估工具进行自动评估。
1. 无标注数据情况下,直接使用预训练好的评估工具进行评估;
在四个对话系统上,自动评估打分和人工评估打分spearman相关系数,如下:
/|seq2seq\_naive|seq2seq\_att|keywords|human
--|:--:|--:|:--:|--:
cor|0.361|0.343|0.324|0.288
对四个系统平均得分排序:
人工评估|k(0.591)<n(0.847)<a(1.116)<h(1.240)
--|--:
自动评估|k(0.625)<n(0.909)<a(1.399)<h(1.683)
2. 利用少量标注数据微调后,自动评估打分和人工打分spearman相关系数,如下:
/|seq2seq\_naive|seq2seq\_att|keywords|human
--|:--:|--:|:--:|--:
cor|0.474|0.477|0.443|0.378
## 快速开始
### 安装说明
1. paddle安装
本项目依赖于 Paddlepaddle Fluid 1.3.1,请参考安装指南进行安装。
2. 安装代码
3. 环境依赖
### 开始第一次模型调用
1. 数据准备
下载经过预处理的数据,运行该脚本之后,data目录下会存在unlabel_data(train.ids/val.ids/test.ids/word2ids),lable_data(四个任务数据train.ids/val.ids/test.ids)
该项目只开源测试集数据,其他数据仅提供样例。
```
sh download_data.sh
```
2. 模型下载
我们开源了基于海量未标注数据训练好的模型,以及基于少量标注数据微调的模型,可供用户直接使用
```
cd model_files
sh download_model.sh
```
3. 模型预测
基于上面的模型和数据,可以运行下面的命令直接对对话数据进行打分。
```
TASK=human
python -u main.py \
--do_infer True \
--use_cuda \
--test_path data/label_data/$TASK/test.ids \
--init_model model_files/${TASK}_finetuned
```
4. 模型评估
基于上面的模型和数据,可以运行下面的命令进行效果评估。
评估预训练模型作为自动评估效果:
```
for task in seq2seq_naive seq2seq_att keywords human
do
echo $task
python -u main.py \
--do_val True \
--use_cuda \
--test_path data/label_data/$task/test.ids \
--init_model model_files/matching_pretrained \
--loss_type L2
done
```
评估微调模型效果:
```
for task in seq2seq_naive seq2seq_att keywords human
do
echo $task
python -u main.py \
--do_val True \
--use_cuda \
--test_path data/label_data/$task/test.ids \
--init_model model_files/${task}_finetuned \
--loss_type L2
done
```
5. 训练与验证
基于示例的数据集,可以运行下面的命令,进行第一阶段训练
```
python -u main.py \
--do_train True \
--use_cuda \
--save_path model_files_tmp/matching_pretrained \
--train_path data/unlabel_data/train.ids \
--val_path data/unlabel_data/val.ids
```
在第一阶段训练基础上,可利用少量标注数据进行第二阶段训练
```
TASK=human
python -u main.py \
--do_train True \
--loss_type L2 \
--use_cuda \
--save_path model_files_tmp/${TASK}_finetuned \
--init_model model_files/matching_pretrained \
--train_path data/label_data/$TASK/train.ids \
--val_path data/label_data/$TASK/val.ids \
--print_step 1 \
--save_step 1 \
--num_scan_data 50
```
## 进阶使用
### 任务定义与建模
对话自动评估任务输入是文本对(上文,回复),输出是回复质量得分。
### 模型原理介绍
匹配任务(预测上下文是否匹配)和自动评估任务有天然的联系,该项目利用匹配任务作为自动评估的预训练;
利用少量标注数据,在匹配模型基础上微调。
### 数据格式说明
训练、预测、评估使用的数据示例如下,数据由三列组成,以制表符('\t')分隔,第一列是以空格分开的上文id,第二列是以空格分开的回复id,第三列是标签
注:本项目额外提供了分词预处理脚本(在preprocess目录下),可供用户使用,具体使用方法如下:
```
python tokenizer.py --test_data_dir ./test.txt.utf8 --batch_size 1 > test.txt.utf8.seg
```
### 代码结构说明
main.py:该项目的主函数,封装包括训练、预测、评估的部分
config.py:定义了该项目模型的相关配置,包括具体模型类别、以及模型的超参数
reader.py:定义了读入数据,加载词典的功能
evaluation.py:定义评估函数
init.py:定义模型load函数
run.sh:训练、预测、评估运行脚本
## 其他
如何贡献代码
如果你可以修复某个issue或者增加一个新功能,欢迎给我们提交PR。如果对应的PR被接受了,我们将根据贡献的质量和难度进行打分(0-5分,越高越好)。如果你累计获得了10分,可以联系我们获得面试机会或者为你写推荐信。
"""
Auto Dialogue Evaluation.
"""
import argparse
import six
def parse_args():
"""
Auto Dialogue Evaluation Config
"""
parser = argparse.ArgumentParser('Automatic Dialogue Evaluation.')
parser.add_argument(
'--do_train', type=bool, default=False, help='Whether to perform training.')
parser.add_argument(
'--do_val', type=bool, default=False, help='Whether to perform evaluation.')
parser.add_argument(
'--do_infer', type=bool, default=False, help='Whether to perform inference.')
parser.add_argument(
'--loss_type', type=str, default='CLS', help='Loss type, CLS or L2.')
#data path
parser.add_argument(
'--train_path', type=str, default=None, help='Path of training data')
parser.add_argument(
'--val_path', type=str, default=None, help='Path of validation data')
parser.add_argument(
'--test_path', type=str, default=None, help='Path of validation data')
parser.add_argument(
'--save_path', type=str, default='tmp', help='Save path')
#step fit for data size
parser.add_argument(
'--print_step', type=int, default=50, help='Print step')
parser.add_argument(
'--save_step', type=int, default=400, help='Save step')
parser.add_argument(
'--num_scan_data', type=int, default=20, help='Save step')
parser.add_argument(
'--word_emb_init', type=str, default=None, help='Path to the initial word embedding')
parser.add_argument(
'--init_model', type=str, default=None, help='Path to the init model')
parser.add_argument(
'--use_cuda',
action='store_true',
help='If set, use cuda for training.')
parser.add_argument(
'--batch_size', type=int, default=256, help='Batch size')
parser.add_argument(
'--hidden_size', type=int, default=256, help='Hidden size')
parser.add_argument(
'--emb_size', type=int, default=256, help='Embedding size')
parser.add_argument(
'--vocab_size', type=int, default=484016, help='Vocabulary size')
parser.add_argument(
'--learning_rate', type=float, default=0.001, help='Learning rate')
parser.add_argument(
'--sample_pro', type=float, default=0.1, help='Sample probability for training data')
parser.add_argument(
'--max_len', type=int, default=50, help='Max length for sentences')
args = parser.parse_args()
return args
def print_arguments(args):
"""
Print Config
"""
print('----------- Configuration Arguments -----------')
for arg, value in sorted(six.iteritems(vars(args))):
print('%s: %s' % (arg, value))
print('------------------------------------------------')
wget --no-check-certificate https://baidu-nlp.bj.bcebos.com/auto_dialogue_evaluation_dataset-1.0.0.tar.gz
tar -xzf auto_dialogue_evaluation_dataset-1.0.0.tar.gz
"""
Evaluation for auto dialogue evaluation
"""
import sys
import numpy as np
import pandas as pd
def get_p_at_n_in_m(data, n, m, ind):
"""
Get n in m
"""
pos_score = data[ind][0]
curr = data[ind : ind+m]
curr = sorted(curr, key = lambda x: x[0], reverse=True)
if curr[n-1][0] <= pos_score:
return 1
return 0
def evaluate_Recall(data):
"""
Evaluate Recall
"""
p_at_1_in_2 = 0.0
p_at_1_in_10 = 0.0
p_at_2_in_10 = 0.0
p_at_5_in_10 = 0.0
length = len(data) / 10
for i in xrange(0, length):
ind = i * 10
assert data[ind][1] == 1
p_at_1_in_2 += get_p_at_n_in_m(data, 1, 2, ind)
p_at_1_in_10 += get_p_at_n_in_m(data, 1, 10, ind)
p_at_2_in_10 += get_p_at_n_in_m(data, 2, 10, ind)
p_at_5_in_10 += get_p_at_n_in_m(data, 5, 10, ind)
recall_dict = {
'1_in_2': p_at_1_in_2 / length,
'1_in_10': p_at_1_in_10 / length,
'2_in_10': p_at_2_in_10 / length,
'5_in_10': p_at_5_in_10 / length}
return recall_dict
def evaluate_cor(pred, true):
"""
Evaluate cor
"""
df = pd.DataFrame({'pred': pred, 'true': true})
cor_matrix = df.corr('spearman')
return cor_matrix['pred']['true']
"""
Init for pretrained para
"""
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from __future__ import print_function
import os
import six
import ast
import copy
import numpy as np
import paddle.fluid as fluid
def init_pretraining_params(exe,
pretraining_params_path,
main_program):
"""
Init pretraining params
"""
assert os.path.exists(pretraining_params_path
), "[%s] cann't be found." % pretraining_params_path
def existed_params(var):
"""
Test existed
"""
if not isinstance(var, fluid.framework.Parameter):
return False
return os.path.exists(os.path.join(pretraining_params_path, var.name))
fluid.io.load_vars(
exe,
pretraining_params_path,
main_program=main_program,
predicate=existed_params)
print("Load pretraining parameters from {}.".format(
pretraining_params_path))
"""
Auto dialogue evaluation task
"""
import os
import sys
import six
import numpy as np
import time
import multiprocessing
import paddle
import paddle.fluid as fluid
import reader as reader
import evaluation as eva
import init as init
try:
import cPickle as pickle #python 2
except ImportError as e:
import pickle #python 3
sys.path.append('../../models/dialogue_model_toolkit/auto_dialogue_evaluation/')
from net import Network
import config
def train(args):
"""Train
"""
if not os.path.exists(args.save_path):
os.makedirs(args.save_path)
net = Network(args.vocab_size, args.emb_size, args.hidden_size)
train_program = fluid.Program()
train_startup = fluid.Program()
if "CE_MODE_X" in os.environ:
train_program.random_seed = 110
train_startup.random_seed = 110
with fluid.program_guard(train_program, train_startup):
with fluid.unique_name.guard():
logits, loss = net.network(args.loss_type)
loss.persistable = True
logits.persistable = True
# gradient clipping
fluid.clip.set_gradient_clip(clip=fluid.clip.GradientClipByValue(
max=1.0, min=-1.0))
optimizer = fluid.optimizer.Adam(
learning_rate=args.learning_rate)
optimizer.minimize(loss)
print("begin memory optimization ...")
fluid.memory_optimize(train_program)
print("end memory optimization ...")
test_program = fluid.Program()
test_startup = fluid.Program()
if "CE_MODE_X" in os.environ:
test_program.random_seed = 110
test_startup.random_seed = 110
with fluid.program_guard(test_program, test_startup):
with fluid.unique_name.guard():
logits, loss = net.network(args.loss_type)
loss.persistable = True
logits.persistable = True
test_program = test_program.clone(for_test=True)
if args.use_cuda:
place = fluid.CUDAPlace(0)
dev_count = fluid.core.get_cuda_device_count()
else:
place = fluid.CPUPlace()
dev_count = int(os.environ.get('CPU_NUM', multiprocessing.cpu_count()))
print("device count %d" % dev_count)
print("theoretical memory usage: ")
print(
fluid.contrib.memory_usage(
program=train_program, batch_size=args.batch_size))
exe = fluid.Executor(place)
exe.run(train_startup)
exe.run(test_startup)
train_exe = fluid.ParallelExecutor(
use_cuda=args.use_cuda, loss_name=loss.name, main_program=train_program)
test_exe = fluid.ParallelExecutor(
use_cuda=args.use_cuda,
main_program=test_program,
share_vars_from=train_exe)
if args.word_emb_init is not None:
print("start loading word embedding init ...")
if six.PY2:
word_emb = np.array(pickle.load(open(args.word_emb_init,
'rb'))).astype('float32')
else:
word_emb = np.array(
pickle.load(
open(args.word_emb_init, 'rb'), encoding="bytes")).astype(
'float32')
net.set_word_embedding(word_emb, place)
print("finish init word embedding ...")
print("start loading data ...")
def train_with_feed(batch_data):
"""
Train on one batch
"""
#to do get_feed_names
feed_dict = dict(zip(net.get_feed_names(), batch_data))
cost = train_exe.run(feed=feed_dict, fetch_list=[loss.name])
return cost[0]
def test_with_feed(batch_data):
"""
Test on one batch
"""
feed_dict = dict(zip(net.get_feed_names(), batch_data))
score = test_exe.run(feed=feed_dict, fetch_list=[logits.name])
return score[0]
def evaluate():
"""
Evaluate to choose model
"""
val_batches = reader.batch_reader(
args.val_path, args.batch_size, place, args.max_len, 1)
scores = []
labels = []
for batch in val_batches:
scores.extend(test_with_feed(batch))
labels.extend([x[0] for x in batch[2]])
return eva.evaluate_Recall(zip(scores, labels))
def save_exe(step, best_recall):
"""
Save exe conditional
"""
recall_dict = evaluate()
print('evaluation recall result:')
print('1_in_2: %s\t1_in_10: %s\t2_in_10: %s\t5_in_10: %s' % (
recall_dict['1_in_2'], recall_dict['1_in_10'],
recall_dict['2_in_10'], recall_dict['5_in_10']))
if recall_dict['1_in_10'] > best_recall and step != 0:
fluid.io.save_inference_model(args.save_path,
net.get_feed_inference_names(),
logits, exe, main_program=train_program)
print("Save model at step %d ... " % step)
print(time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(time.time())))
best_recall = recall_dict['1_in_10']
return best_recall
# train over different epoches
global_step, train_time = 0, 0.0
best_recall = 0
for epoch in six.moves.xrange(args.num_scan_data):
train_batches = reader.batch_reader(
args.train_path, args.batch_size, place,
args.max_len, args.sample_pro)
begin_time = time.time()
sum_cost = 0
for batch in train_batches:
if (args.save_path is not None) and (global_step % args.save_step == 0):
best_recall = save_exe(global_step, best_recall)
cost = train_with_feed(batch)
global_step += 1
sum_cost += cost.mean()
if global_step % args.print_step == 0:
print('training step %s avg loss %s' % (global_step, sum_cost / args.print_step))
sum_cost = 0
pass_time_cost = time.time() - begin_time
train_time += pass_time_cost
print("Pass {0}, pass_time_cost {1}"
.format(epoch, "%2.2f sec" % pass_time_cost))
def finetune(args):
"""
Finetune
"""
if not os.path.exists(args.save_path):
os.makedirs(args.save_path)
net = Network(args.vocab_size, args.emb_size, args.hidden_size)
train_program = fluid.Program()
train_startup = fluid.Program()
if "CE_MODE_X" in os.environ:
train_program.random_seed = 110
train_startup.random_seed = 110
with fluid.program_guard(train_program, train_startup):
with fluid.unique_name.guard():
logits, loss = net.network(args.loss_type)
loss.persistable = True
logits.persistable = True
# gradient clipping
fluid.clip.set_gradient_clip(clip=fluid.clip.GradientClipByValue(
max=1.0, min=-1.0))
optimizer = fluid.optimizer.Adam(
learning_rate=fluid.layers.exponential_decay(
learning_rate=args.learning_rate,
decay_steps=400,
decay_rate=0.9,
staircase=True))
optimizer.minimize(loss)
print("begin memory optimization ...")
fluid.memory_optimize(train_program)
print("end memory optimization ...")
test_program = fluid.Program()
test_startup = fluid.Program()
if "CE_MODE_X" in os.environ:
test_program.random_seed = 110
test_startup.random_seed = 110
with fluid.program_guard(test_program, test_startup):
with fluid.unique_name.guard():
logits, loss = net.network(args.loss_type)
loss.persistable = True
logits.persistable = True
test_program = test_program.clone(for_test=True)
if args.use_cuda:
place = fluid.CUDAPlace(0)
dev_count = fluid.core.get_cuda_device_count()
else:
place = fluid.CPUPlace()
dev_count = int(os.environ.get('CPU_NUM', multiprocessing.cpu_count()))
print("device count %d" % dev_count)
print("theoretical memory usage: ")
print(
fluid.contrib.memory_usage(
program=train_program, batch_size=args.batch_size))
exe = fluid.Executor(place)
exe.run(train_startup)
exe.run(test_startup)
train_exe = fluid.ParallelExecutor(
use_cuda=args.use_cuda, loss_name=loss.name, main_program=train_program)
test_exe = fluid.ParallelExecutor(
use_cuda=args.use_cuda,
main_program=test_program,
share_vars_from=train_exe)
if args.init_model:
init.init_pretraining_params(
exe,
args.init_model,
main_program=train_startup)
print('sccuess init %s' % args.init_model)
print("start loading data ...")
def train_with_feed(batch_data):
"""
Train on one batch
"""
#to do get_feed_names
feed_dict = dict(zip(net.get_feed_names(), batch_data))
cost = train_exe.run(feed=feed_dict, fetch_list=[loss.name])
return cost[0]
def test_with_feed(batch_data):
"""
Test on one batch
"""
feed_dict = dict(zip(net.get_feed_names(), batch_data))
score = test_exe.run(feed=feed_dict, fetch_list=[logits.name])
return score[0]
def evaluate():
"""
Evaluate to choose model
"""
val_batches = reader.batch_reader(
args.val_path, args.batch_size, place, args.max_len, 1)
scores = []
labels = []
for batch in val_batches:
scores.extend(test_with_feed(batch))
labels.extend([x[0] for x in batch[2]])
scores = [x[0] for x in scores]
return eva.evaluate_cor(scores, labels)
def save_exe(step, best_cor):
"""
Save exe conditional
"""
cor = evaluate()
print('evaluation cor relevance %s' % cor)
if cor > best_cor and step != 0:
fluid.io.save_inference_model(args.save_path,
net.get_feed_inference_names(), logits,
exe, main_program=train_program)
print("Save model at step %d ... " % step)
print(time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(time.time())))
best_cor = cor
return best_cor
# train over different epoches
global_step, train_time = 0, 0.0
best_cor = 0.0
pre_index = -1
for epoch in six.moves.xrange(args.num_scan_data):
train_batches = reader.batch_reader(
args.train_path,
args.batch_size, place,
args.max_len, args.sample_pro)
begin_time = time.time()
sum_cost = 0
for batch in train_batches:
if (args.save_path is not None) and (global_step % args.save_step == 0):
best_cor = save_exe(global_step, best_cor)
cost = train_with_feed(batch)
global_step += 1
sum_cost += cost.mean()
if global_step % args.print_step == 0:
print('training step %s avg loss %s' % (global_step, sum_cost / args.print_step))
sum_cost = 0
pass_time_cost = time.time() - begin_time
train_time += pass_time_cost
print("Pass {0}, pass_time_cost {1}"
.format(epoch, "%2.2f sec" % pass_time_cost))
def evaluate(args):
"""
Evaluate model for both pretrained and finetuned
"""
place = fluid.CUDAPlace(0) if args.use_cuda else fluid.CPUPlace()
exe = fluid.Executor(place)
t0 = time.time()
with fluid.scope_guard(fluid.core.Scope()):
infer_program, feed_target_names, fetch_vars = fluid.io.load_inference_model(
args.init_model, exe)
global_step, infer_time = 0, 0.0
test_batches = reader.batch_reader(
args.test_path, args.batch_size, place,
args.max_len, 1)
scores = []
labels = []
for batch in test_batches:
logits = exe.run(
infer_program,
feed = {
'context_wordseq': batch[0],
'response_wordseq': batch[1]},
fetch_list = fetch_vars)
logits = [x[0] for x in logits[0]]
scores.extend(logits)
labels.extend([x[0] for x in batch[2]])
mean_score = sum(scores)/len(scores)
if args.loss_type == 'CLS':
recall_dict = eva.evaluate_Recall(zip(scores, labels))
print('mean score: %s' % mean_score)
print('evaluation recall result:')
print('1_in_2: %s\t1_in_10: %s\t2_in_10: %s\t5_in_10: %s' % (
recall_dict['1_in_2'], recall_dict['1_in_10'],
recall_dict['2_in_10'], recall_dict['5_in_10']))
elif args.loss_type == 'L2':
cor = eva.evaluate_cor(scores, labels)
print('mean score: %s\nevaluation cor resuls:%s' % (mean_score, cor))
else:
raise ValueError
t1 = time.time()
print("finish evaluate model:%s on data:%s time_cost(s):%.2f" %
(args.init_model, args.test_path, t1 - t0))
def infer(args):
"""
Inference function
"""
place = fluid.CUDAPlace(0) if args.use_cuda else fluid.CPUPlace()
exe = fluid.Executor(place)
t0 = time.time()
with fluid.scope_guard(fluid.core.Scope()):
infer_program, feed_target_names, fetch_vars = fluid.io.load_inference_model(
args.init_model, exe)
global_step, infer_time = 0, 0.0
test_batches = reader.batch_reader(
args.test_path, args.batch_size, place,
args.max_len, 1)
scores = []
for batch in test_batches:
logits = exe.run(
infer_program,
feed = {
'context_wordseq': batch[0],
'response_wordseq': batch[1]},
fetch_list = fetch_vars)
logits = [x[0] for x in logits[0]]
scores.extend(logits)
in_file = open(args.test_path, 'r')
out_path = args.test_path + '.infer'
out_file = open(out_path, 'w')
for line, s in zip(in_file, scores):
out_file.write('%s\t%s\n' % (line.strip(), s))
in_file.close()
out_file.close()
t1 = time.time()
print("finish infer model:%s out file: %s time_cost(s):%.2f" %
(args.init_model, out_path, t1 - t0))
def main():
"""
main
"""
args = config.parse_args()
config.print_arguments(args)
if args.do_train == True:
if args.loss_type == 'CLS':
train(args)
elif args.loss_type == 'L2':
finetune(args)
else:
raise ValueError
elif args.do_val == True:
evaluate(args)
elif args.do_infer == True:
infer(args)
else:
raise ValueError
if __name__ == '__main__':
main()
#matching pretrained
wget --no-check-certificate https://baidu-nlp.bj.bcebos.com/auto_dialogue_evaluation_matching_pretrained-1.0.0.tar.gz
tar -xzf auto_dialogue_evaluation_matching_pretrained-1.0.0.tar.gz
#finetuned
for task in seq2seq_naive seq2seq_att keywords human
do
wget --no-check-certificate https://baidu-nlp.bj.bcebos.com/auto_dialogue_evaluation_${task}_finetuned-1.0.0.tar.gz
tar -xzf auto_dialogue_evaluation_${task}_finetuned-1.0.0.tar.gz
done
"""
Reader for auto dialogue evaluation
"""
import sys
import time
import numpy as np
import random
import paddle.fluid as fluid
import paddle
def to_lodtensor(data, place):
"""
Convert to LODtensor
"""
seq_lens = [len(seq) for seq in data]
cur_len = 0
lod = [cur_len]
for l in seq_lens:
cur_len += l
lod.append(cur_len)
flattened_data = np.concatenate(data, axis=0).astype("int64")
flattened_data = flattened_data.reshape([len(flattened_data), 1])
res = fluid.LoDTensor()
res.set(flattened_data, place)
res.set_lod([lod])
return res
def reshape_batch(batch, place):
"""
Reshape batch
"""
context_reshape = to_lodtensor([dat[0] for dat in batch], place)
response_reshape = to_lodtensor([dat[1] for dat in batch], place)
label_reshape = [dat[2] for dat in batch]
return (context_reshape, response_reshape, label_reshape)
def batch_reader(data_path,
batch_size,
place,
max_len=50,
sample_pro=1):
"""
Yield batch
"""
batch = []
with open(data_path, 'r') as f:
Print = True
for line in f:
#sample for training data
if sample_pro < 1:
if random.random() > sample_pro:
continue
tokens = line.strip().split('\t')
assert len(tokens) == 3
context = [int(x) for x in tokens[0].split()[:max_len]]
response = [int(x) for x in tokens[1].split()[:max_len]]
label = [int(tokens[2])]
#label = int(tokens[2])
instance = (context, response, label)
if len(batch) < batch_size:
batch.append(instance)
else:
if len(batch) == batch_size:
yield reshape_batch(batch, place)
batch = [instance]
if len(batch) == batch_size:
yield reshape_batch(batch, place)
export CUDA_VISIBLE_DEVICES=4
export FLAGS_eager_delete_tensor_gb=0.0
#pretrain
python -u main.py \
--do_train True \
--use_cuda \
--save_path model_files_tmp/matching_pretrained \
--train_path data/unlabel_data/train.ids \
--val_path data/unlabel_data/val.ids
#finetune based on one task
TASK=human
python -u main.py \
--do_train True \
--loss_type L2 \
--use_cuda \
--save_path model_files_tmp/${TASK}_finetuned \
--init_model model_files/matching_pretrained \
--train_path data/label_data/$TASK/train.ids \
--val_path data/label_data/$TASK/val.ids \
--print_step 1 \
--save_step 1 \
--num_scan_data 50
#evaluate pretrained model by Recall
python -u main.py \
--do_val True \
--use_cuda \
--test_path data/unlabel_data/test.ids \
--init_model model_files/matching_pretrained \
--loss_type CLS
#evaluate pretrained model by Cor
for task in seq2seq_naive seq2seq_att keywords human
do
echo $task
python -u main.py \
--do_val True \
--use_cuda \
--test_path data/label_data/$task/test.ids \
--init_model model_files/matching_pretrained \
--loss_type L2
done
#evaluate finetuned model by Cor
for task in seq2seq_naive seq2seq_att keywords human
do
echo $task
python -u main.py \
--do_val True \
--use_cuda \
--test_path data/label_data/$task/test.ids \
--init_model model_files/${task}_finetuned \
--loss_type L2
done
#infer
TASK=human
python -u main.py \
--do_infer True \
--use_cuda \
--test_path data/label_data/$TASK/test.ids \
--init_model model_files/${TASK}_finetuned
export FLAGS_eager_delete_tensor_gb=0.0
#pretrain
python -u main.py \
--do_train True \
--sample_pro 0.9 \
--batch_size 64 \
--save_path model_files_tmp/matching_pretrained \
--train_path data/unlabel_data/train.ids \
--val_path data/unlabel_data/val.ids
#finetune based on one task
TASK=human
python -u main.py \
--do_train True \
--loss_type L2 \
--save_path model_files_tmp/${TASK}_finetuned \
--init_model model_files/matching_pretrained \
--train_path data/label_data/$TASK/train.ids \
--val_path data/label_data/$TASK/val.ids \
--print_step 1 \
--save_step 1 \
--num_scan_data 50
#evaluate pretrained model by Recall
python -u main.py \
--do_val True \
--test_path data/unlabel_data/test.ids \
--init_model model_files/matching_pretrained \
--loss_type CLS
#evaluate pretrained model by Cor
for task in seq2seq_naive seq2seq_att keywords human
do
echo $task
python -u main.py \
--do_val True \
--test_path data/label_data/$task/test.ids \
--init_model model_files/matching_pretrained \
--loss_type L2
done
#evaluate finetuned model by Cor
for task in seq2seq_naive seq2seq_att keywords human
do
echo $task
python -u main.py \
--do_val True \
--test_path data/label_data/$task/test.ids \
--init_model model_files/${task}_finetuned \
--loss_type L2
done
#infer
TASK=human
python -u main.py \
--do_infer True \
--test_path data/label_data/$TASK/test.ids \
--init_model model_files/${TASK}_finetuned
# __Deep Attention Matching Network__
## 简介
### 任务说明
深度注意力机制模型(Deep Attention Matching Network)是开放领域多轮对话匹配模型。根据多轮对话历史和候选回复内容,排序出最合适的回复。
网络结构如下,更多内容可以参考论文:[http://aclweb.org/anthology/P18-1103](http://aclweb.org/anthology/P18-1103).
<p align="center">
<img src="images/Figure1.png"/> <br />
Overview of Deep Attention Matching Network
</p>
### 效果说明
该模型在两个公开数据集上效果如下:
<p align="center">
<img src="images/Figure2.png"/> <br />
</p>
## 快速开始
### 安装说明
1. paddle安装
本项目依赖于 Paddlepaddle Fluid 1.3.1,请参考安装指南进行安装。
2. 安装代码
3. 环境依赖
### 开始第一次模型调用
1. 数据准备
下载经过预处理的数据,运行该脚本之后,data目录下会存在unlabel和douban两个文件夹。
```
cd data
sh download_data.sh
```
2. 模型训练
```
python -u main.py \
--do_train True \
--use_cuda \
--data_path ./data/ubuntu/data_small.pkl \
--save_path ./model_files/ubuntu \
--use_pyreader \
--vocab_size 434512 \
--_EOS_ 28270 \
--batch_size 32
```
3. 模型评估
```
python -u main.py \
--do_test True \
--use_cuda \
--data_path ./data/ubuntu/data_small.pkl \
--save_path ./model_files/ubuntu/step_372 \
--model_path ./model_files/ubuntu/step_372 \
--vocab_size 434512 \
--_EOS_ 28270 \
--batch_size 100
```
## 进阶使用
### 任务定义与建模
多轮对话匹配任务输入是多轮对话历史和候选回复,输出是回复匹配得分,根据匹配得分排序。
### 模型原理介绍
可以参考论文:[http://aclweb.org/anthology/P18-1103](http://aclweb.org/anthology/P18-1103).
### 数据格式说明
训练、预测、评估使用的数据示例如下,数据由三列组成,以制表符('\t')分隔,第一列是以空
格分开的上文id,第二列是以空格分开的回复id,第三列是标签
注:本项目额外提供了分词预处理脚本(在preprocess目录下),可供用户使用,具体使用方法如
下:
```
python tokenizer.py \
--test_data_dir ./test.txt.utf8 \
--batch_size 1 > test.txt.utf8.seg
```
### 代码结构说明
main.py:该项目的主函数,封装包括训练、预测的部分
config.py:定义了该项目模型的相关配置,包括具体模型类别、以及模型的超参数
reader.py:定义了读入数据,加载词典的功能
evaluation.py:定义评估函数
run.sh:训练、预测运行脚本
## 其他
如何贡献代码
如果你可以修复某个issue或者增加一个新功能,欢迎给我们提交PR。如果对应的PR被接受了,我们将根据贡献的质量和难度进行打分(0-5分,越高越好)。如果你累计获得了10分,可以联系我们获得面试机会或者为你写推荐信。
"""
Deep Attention Matching Network
"""
import argparse
import six
def parse_args():
"""
Deep Attention Matching Network Config
"""
parser = argparse.ArgumentParser("DAM Config")
parser.add_argument(
'--do_train',
type=bool,
default=False,
help='Whether to perform training.')
parser.add_argument(
'--do_test',
type=bool,
default=False,
help='Whether to perform training.')
parser.add_argument(
'--batch_size',
type=int,
default=256,
help='Batch size for training. (default: %(default)d)')
parser.add_argument(
'--num_scan_data',
type=int,
default=2,
help='Number of pass for training. (default: %(default)d)')
parser.add_argument(
'--learning_rate',
type=float,
default=1e-3,
help='Learning rate used to train. (default: %(default)f)')
parser.add_argument(
'--data_path',
type=str,
default="data/data_small.pkl",
help='Path to training data. (default: %(default)s)')
parser.add_argument(
'--save_path',
type=str,
default="saved_models",
help='Path to save trained models. (default: %(default)s)')
parser.add_argument(
'--model_path',
type=str,
default=None,
help='Path to load well-trained models. (default: %(default)s)')
parser.add_argument(
'--use_cuda',
action='store_true',
help='If set, use cuda for training.')
parser.add_argument(
'--use_pyreader',
action='store_true',
help='If set, use pyreader for reading data.')
parser.add_argument(
'--ext_eval',
action='store_true',
help='If set, use MAP, MRR ect for evaluation.')
parser.add_argument(
'--max_turn_num',
type=int,
default=9,
help='Maximum number of utterances in context.')
parser.add_argument(
'--max_turn_len',
type=int,
default=50,
help='Maximum length of setences in turns.')
parser.add_argument(
'--word_emb_init',
type=str,
default=None,
help='Path to the initial word embedding.')
parser.add_argument(
'--vocab_size',
type=int,
default=434512,
help='The size of vocabulary.')
parser.add_argument(
'--emb_size',
type=int,
default=200,
help='The dimension of word embedding.')
parser.add_argument(
'--_EOS_',
type=int,
default=28270,
help='The id for the end of sentence in vocabulary.')
parser.add_argument(
'--stack_num',
type=int,
default=5,
help='The number of stacked attentive modules in network.')
parser.add_argument(
'--channel1_num',
type=int,
default=32,
help="The channels' number of the 1st conv3d layer's output.")
parser.add_argument(
'--channel2_num',
type=int,
default=16,
help="The channels' number of the 2nd conv3d layer's output.")
args = parser.parse_args()
return args
def print_arguments(args):
"""
Print Config
"""
print('----------- Configuration Arguments -----------')
for arg, value in sorted(six.iteritems(vars(args))):
print('%s: %s' % (arg, value))
print('------------------------------------------------')
ubuntu_url=http://dam-data.cdn.bcebos.com/ubuntu.tar.gz
ubuntu_md5=9d7db116a040530a16f68dc0ab44e4b6
if [ ! -e ubuntu.tar.gz ]; then
wget -c $ubuntu_url
fi
echo "Checking md5 sum ..."
md5sum_tmp=`md5sum ubuntu.tar.gz | cut -d ' ' -f1`
if [ $md5sum_tmp != $ubuntu_md5 ]; then
echo "Md5sum check failed, please remove and redownload ubuntu.tar.gz"
exit 1
fi
echo "Untar ubuntu.tar.gz ..."
tar -xzvf ubuntu.tar.gz
mv data ubuntu
douban_url=http://dam-data.cdn.bcebos.com/douban.tar.gz
douban_md5=e07ca68f21c20e09efb3e8b247194405
if [ ! -e douban.tar.gz ]; then
wget -c $douban_url
fi
echo "Checking md5 sum ..."
md5sum_tmp=`md5sum douban.tar.gz | cut -d ' ' -f1`
if [ $md5sum_tmp != $douban_md5 ]; then
echo "Md5sum check failed, please remove and redownload douban.tar.gz"
exit 1
fi
echo "Untar douban.tar.gz ..."
tar -xzvf douban.tar.gz
mv data douban
"""
Evaluation
"""
import sys
import six
import numpy as np
from sklearn.metrics import average_precision_score
def evaluate_ubuntu(file_path):
"""
Evaluate on ubuntu data
"""
def get_p_at_n_in_m(data, n, m, ind):
"""
Recall n at m
"""
pos_score = data[ind][0]
curr = data[ind:ind + m]
curr = sorted(curr, key=lambda x: x[0], reverse=True)
if curr[n - 1][0] <= pos_score:
return 1
return 0
data = []
with open(file_path, 'r') as file:
for line in file:
line = line.strip()
tokens = line.split("\t")
if len(tokens) != 2:
continue
data.append((float(tokens[0]), int(tokens[1])))
#assert len(data) % 10 == 0
p_at_1_in_2 = 0.0
p_at_1_in_10 = 0.0
p_at_2_in_10 = 0.0
p_at_5_in_10 = 0.0
length = len(data) // 10
for i in six.moves.xrange(0, length):
ind = i * 10
assert data[ind][1] == 1
p_at_1_in_2 += get_p_at_n_in_m(data, 1, 2, ind)
p_at_1_in_10 += get_p_at_n_in_m(data, 1, 10, ind)
p_at_2_in_10 += get_p_at_n_in_m(data, 2, 10, ind)
p_at_5_in_10 += get_p_at_n_in_m(data, 5, 10, ind)
result_dict = {
"1_in_2": p_at_1_in_2 / length,
"1_in_10": p_at_1_in_10 / length,
"2_in_10": p_at_2_in_10 / length,
"5_in_10": p_at_5_in_10 / length}
return result_dict
def evaluate_douban(file_path):
"""
Evaluate douban data
"""
def mean_average_precision(sort_data):
"""
Evaluate mean average precision
"""
count_1 = 0
sum_precision = 0
for index in six.moves.xrange(len(sort_data)):
if sort_data[index][1] == 1:
count_1 += 1
sum_precision += 1.0 * count_1 / (index + 1)
return sum_precision / count_1
def mean_reciprocal_rank(sort_data):
"""
Evaluate MRR
"""
sort_lable = [s_d[1] for s_d in sort_data]
assert 1 in sort_lable
return 1.0 / (1 + sort_lable.index(1))
def precision_at_position_1(sort_data):
"""
Evaluate precision
"""
if sort_data[0][1] == 1:
return 1
else:
return 0
def recall_at_position_k_in_10(sort_data, k):
""""
Evaluate recall
"""
sort_lable = [s_d[1] for s_d in sort_data]
select_lable = sort_lable[:k]
return 1.0 * select_lable.count(1) / sort_lable.count(1)
def evaluation_one_session(data):
"""
Evaluate one session
"""
sort_data = sorted(data, key=lambda x: x[0], reverse=True)
m_a_p = mean_average_precision(sort_data)
m_r_r = mean_reciprocal_rank(sort_data)
p_1 = precision_at_position_1(sort_data)
r_1 = recall_at_position_k_in_10(sort_data, 1)
r_2 = recall_at_position_k_in_10(sort_data, 2)
r_5 = recall_at_position_k_in_10(sort_data, 5)
return m_a_p, m_r_r, p_1, r_1, r_2, r_5
sum_m_a_p = 0
sum_m_r_r = 0
sum_p_1 = 0
sum_r_1 = 0
sum_r_2 = 0
sum_r_5 = 0
i = 0
total_num = 0
with open(file_path, 'r') as infile:
for line in infile:
if i % 10 == 0:
data = []
tokens = line.strip().split('\t')
data.append((float(tokens[0]), int(tokens[1])))
if i % 10 == 9:
total_num += 1
m_a_p, m_r_r, p_1, r_1, r_2, r_5 = evaluation_one_session(data)
sum_m_a_p += m_a_p
sum_m_r_r += m_r_r
sum_p_1 += p_1
sum_r_1 += r_1
sum_r_2 += r_2
sum_r_5 += r_5
i += 1
result_dict = {
"MAP": 1.0 * sum_m_a_p / total_num,
"MRR": 1.0 * sum_m_r_r / total_num,
"P_1": 1.0 * sum_p_1 / total_num,
"1_in_10": 1.0 * sum_r_1 / total_num,
"2_in_10": 1.0 * sum_r_2 / total_num,
"5_in_10": 1.0 * sum_r_5 / total_num}
return result_dict
"""
Deep Attention Matching Network
"""
import sys
import os
import six
import numpy as np
import time
import multiprocessing
import paddle
import paddle.fluid as fluid
import reader as reader
from util import mkdir
import evaluation as eva
import config
try:
import cPickle as pickle #python 2
except ImportError as e:
import pickle #python 3
sys.path.append('../../models/dialogue_model_toolkit/deep_attention_matching/')
from net import Net
def evaluate(score_path, result_file_path):
"""
Evaluate both douban and ubuntu dataset
"""
if args.ext_eval:
result = eva.evaluate_douban(score_path)
else:
result = eva.evaluate_ubuntu(score_path)
#write evaluation result
with open(result_file_path, 'w') as out_file:
for p_at in result:
out_file.write(p_at + '\t' + str(result[p_at]) + '\n')
print('finish evaluation')
print(time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(time.time())))
def test_with_feed(exe, program, feed_names, fetch_list, score_path, batches,
batch_num, dev_count):
"""
Test with feed
"""
score_file = open(score_path, 'w')
for it in six.moves.xrange(batch_num // dev_count):
feed_list = []
for dev in six.moves.xrange(dev_count):
val_index = it * dev_count + dev
batch_data = reader.make_one_batch_input(batches, val_index)
feed_dict = dict(zip(feed_names, batch_data))
feed_list.append(feed_dict)
predicts = exe.run(feed=feed_list, fetch_list=fetch_list)
scores = np.array(predicts[0])
for dev in six.moves.xrange(dev_count):
val_index = it * dev_count + dev
for i in six.moves.xrange(args.batch_size):
score_file.write(
str(scores[args.batch_size * dev + i][0]) + '\t' + str(
batches["label"][val_index][i]) + '\n')
score_file.close()
def test_with_pyreader(exe, program, pyreader, fetch_list, score_path, batches,
batch_num, dev_count):
"""
Test with pyreader
"""
def data_provider():
"""
Data reader
"""
for index in six.moves.xrange(batch_num):
yield reader.make_one_batch_input(batches, index)
score_file = open(score_path, 'w')
pyreader.decorate_tensor_provider(data_provider)
it = 0
pyreader.start()
while True:
try:
predicts = exe.run(fetch_list=fetch_list)
scores = np.array(predicts[0])
for dev in six.moves.xrange(dev_count):
val_index = it * dev_count + dev
for i in six.moves.xrange(args.batch_size):
score_file.write(
str(scores[args.batch_size * dev + i][0]) + '\t' + str(
batches["label"][val_index][i]) + '\n')
it += 1
except fluid.core.EOFException:
pyreader.reset()
break
score_file.close()
def train(args):
"""
Train Program
"""
if not os.path.exists(args.save_path):
os.makedirs(args.save_path)
# data data_config
data_conf = {
"batch_size": args.batch_size,
"max_turn_num": args.max_turn_num,
"max_turn_len": args.max_turn_len,
"_EOS_": args._EOS_,
}
dam = Net(args.max_turn_num, args.max_turn_len, args.vocab_size,
args.emb_size, args.stack_num, args.channel1_num,
args.channel2_num)
train_program = fluid.Program()
train_startup = fluid.Program()
if "CE_MODE_X" in os.environ:
train_program.random_seed = 110
train_startup.random_seed = 110
with fluid.program_guard(train_program, train_startup):
with fluid.unique_name.guard():
if args.use_pyreader:
train_pyreader = dam.create_py_reader(
capacity=10, name='train_reader')
else:
dam.create_data_layers()
loss, logits = dam.create_network()
loss.persistable = True
logits.persistable = True
# gradient clipping
fluid.clip.set_gradient_clip(clip=fluid.clip.GradientClipByValue(
max=1.0, min=-1.0))
optimizer = fluid.optimizer.Adam(
learning_rate=fluid.layers.exponential_decay(
learning_rate=args.learning_rate,
decay_steps=400,
decay_rate=0.9,
staircase=True))
optimizer.minimize(loss)
print("begin memory optimization ...")
print(time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(time.time())))
fluid.memory_optimize(train_program)
print("end memory optimization ...")
print(time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(time.time())))
test_program = fluid.Program()
test_startup = fluid.Program()
if "CE_MODE_X" in os.environ:
test_program.random_seed = 110
test_startup.random_seed = 110
with fluid.program_guard(test_program, test_startup):
with fluid.unique_name.guard():
if args.use_pyreader:
test_pyreader = dam.create_py_reader(
capacity=10, name='test_reader')
else:
dam.create_data_layers()
loss, logits = dam.create_network()
loss.persistable = True
logits.persistable = True
test_program = test_program.clone(for_test=True)
if args.use_cuda:
place = fluid.CUDAPlace(0)
dev_count = fluid.core.get_cuda_device_count()
else:
place = fluid.CPUPlace()
dev_count = int(os.environ.get('CPU_NUM', multiprocessing.cpu_count()))
print("device count %d" % dev_count)
print("theoretical memory usage: ")
print(
fluid.contrib.memory_usage(
program=train_program, batch_size=args.batch_size))
exe = fluid.Executor(place)
exe.run(train_startup)
exe.run(test_startup)
train_exe = fluid.ParallelExecutor(
use_cuda=args.use_cuda, loss_name=loss.name, main_program=train_program)
test_exe = fluid.ParallelExecutor(
use_cuda=args.use_cuda,
main_program=test_program,
share_vars_from=train_exe)
if args.word_emb_init is not None:
print("start loading word embedding init ...")
if six.PY2:
word_emb = np.array(pickle.load(open(args.word_emb_init,
'rb'))).astype('float32')
else:
word_emb = np.array(
pickle.load(
open(args.word_emb_init, 'rb'), encoding="bytes")).astype(
'float32')
dam.set_word_embedding(word_emb, place)
print("finish init word embedding ...")
print("start loading data ...")
with open(args.data_path, 'rb') as f:
if six.PY2:
train_data, val_data, test_data = pickle.load(f)
else:
train_data, val_data, test_data = pickle.load(f, encoding="bytes")
print("finish loading data ...")
val_batches = reader.build_batches(val_data, data_conf)
batch_num = len(train_data[six.b('y')]) // args.batch_size
val_batch_num = len(val_batches["response"])
print_step = max(1, batch_num // (dev_count * 100))
save_step = max(1, batch_num // (dev_count * 10))
print("begin model training ...")
print(time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(time.time())))
def train_with_feed(step):
"""
Train on one epoch data by feeding
"""
ave_cost = 0.0
for it in six.moves.xrange(batch_num // dev_count):
feed_list = []
for dev in six.moves.xrange(dev_count):
index = it * dev_count + dev
batch_data = reader.make_one_batch_input(train_batches, index)
feed_dict = dict(zip(dam.get_feed_names(), batch_data))
feed_list.append(feed_dict)
cost = train_exe.run(feed=feed_list, fetch_list=[loss.name])
ave_cost += np.array(cost[0]).mean()
step = step + 1
if step % print_step == 0:
print("processed: [" + str(step * dev_count * 1.0 / batch_num) +
"] ave loss: [" + str(ave_cost / print_step) + "]")
ave_cost = 0.0
if (args.save_path is not None) and (step % save_step == 0):
save_path = os.path.join(args.save_path, "step_" + str(step))
print("Save model at step %d ... " % step)
print(
time.strftime('%Y-%m-%d %H:%M:%S',
time.localtime(time.time())))
fluid.io.save_persistables(exe, save_path, train_program)
score_path = os.path.join(args.save_path, 'score.' + str(step))
test_with_feed(test_exe, test_program,
dam.get_feed_names(), [logits.name], score_path,
val_batches, val_batch_num, dev_count)
result_file_path = os.path.join(args.save_path,
'result.' + str(step))
evaluate(score_path, result_file_path)
return step, np.array(cost[0]).mean()
def train_with_pyreader(step):
"""
Train on one epoch with pyreader
"""
def data_provider():
"""
Data reader
"""
for index in six.moves.xrange(batch_num):
yield reader.make_one_batch_input(train_batches, index)
train_pyreader.decorate_tensor_provider(data_provider)
ave_cost = 0.0
train_pyreader.start()
while True:
try:
cost = train_exe.run(fetch_list=[loss.name])
ave_cost += np.array(cost[0]).mean()
step = step + 1
if step % print_step == 0:
print("processed: [" + str(step * dev_count * 1.0 /
batch_num) + "] ave loss: [" +
str(ave_cost / print_step) + "]")
ave_cost = 0.0
if (args.save_path is not None) and (step % save_step == 0):
save_path = os.path.join(args.save_path,
"step_" + str(step))
print("Save model at step %d ... " % step)
print(
time.strftime('%Y-%m-%d %H:%M:%S',
time.localtime(time.time())))
fluid.io.save_persistables(exe, save_path, train_program)
score_path = os.path.join(args.save_path,
'score.' + str(step))
test_with_pyreader(test_exe, test_program, test_pyreader,
[logits.name], score_path, val_batches,
val_batch_num, dev_count)
result_file_path = os.path.join(args.save_path,
'result.' + str(step))
evaluate(score_path, result_file_path)
except fluid.core.EOFException:
train_pyreader.reset()
break
return step, np.array(cost[0]).mean()
# train over different epoches
global_step, train_time = 0, 0.0
for epoch in six.moves.xrange(args.num_scan_data):
shuffle_train = reader.unison_shuffle(
train_data, seed=110 if ("CE_MODE_X" in os.environ) else None)
train_batches = reader.build_batches(shuffle_train, data_conf)
begin_time = time.time()
if args.use_pyreader:
global_step, last_cost = train_with_pyreader(global_step)
else:
global_step, last_cost = train_with_feed(global_step)
pass_time_cost = time.time() - begin_time
train_time += pass_time_cost
print("Pass {0}, pass_time_cost {1}"
.format(epoch, "%2.2f sec" % pass_time_cost))
# For internal continuous evaluation
if "CE_MODE_X" in os.environ:
print("kpis train_cost %f" % last_cost)
print("kpis train_duration %f" % train_time)
def test(args):
"""
Test
"""
if not os.path.exists(args.save_path):
mkdir(args.save_path)
if not os.path.exists(args.model_path):
raise ValueError("Invalid model init path %s" % args.model_path)
# data data_config
data_conf = {
"batch_size": args.batch_size,
"max_turn_num": args.max_turn_num,
"max_turn_len": args.max_turn_len,
"_EOS_": args._EOS_,
}
dam = Net(args.max_turn_num, args.max_turn_len, args.vocab_size,
args.emb_size, args.stack_num, args.channel1_num,
args.channel2_num)
dam.create_data_layers()
loss, logits = dam.create_network()
loss.persistable = True
logits.persistable = True
# gradient clipping
fluid.clip.set_gradient_clip(clip=fluid.clip.GradientClipByValue(
max=1.0, min=-1.0))
test_program = fluid.default_main_program().clone(for_test=True)
optimizer = fluid.optimizer.Adam(
learning_rate=fluid.layers.exponential_decay(
learning_rate=args.learning_rate,
decay_steps=400,
decay_rate=0.9,
staircase=True))
optimizer.minimize(loss)
print("begin memory optimization ...")
print(time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(time.time())))
fluid.memory_optimize(fluid.default_main_program())
print("end memory optimization ...")
print(time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(time.time())))
if args.use_cuda:
place = fluid.CUDAPlace(0)
dev_count = fluid.core.get_cuda_device_count()
else:
place = fluid.CPUPlace()
#dev_count = multiprocessing.cpu_count()
dev_count = 1
exe = fluid.Executor(place)
exe.run(fluid.default_startup_program())
fluid.io.load_persistables(exe, args.model_path)
test_exe = fluid.ParallelExecutor(
use_cuda=args.use_cuda, main_program=test_program)
print("start loading data ...")
with open(args.data_path, 'rb') as f:
if six.PY2:
train_data, val_data, test_data = pickle.load(f)
else:
train_data, val_data, test_data = pickle.load(f, encoding="bytes")
print("finish loading data ...")
test_batches = reader.build_batches(test_data, data_conf)
test_batch_num = len(test_batches["response"])
print("test batch num: %d" % test_batch_num)
print("begin inference ...")
print(time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(time.time())))
score_path = os.path.join(args.save_path, 'score.txt')
score_file = open(score_path, 'w')
for it in six.moves.xrange(test_batch_num // dev_count):
feed_list = []
for dev in six.moves.xrange(dev_count):
index = it * dev_count + dev
batch_data = reader.make_one_batch_input(test_batches, index)
feed_dict = dict(zip(dam.get_feed_names(), batch_data))
feed_list.append(feed_dict)
predicts = test_exe.run(feed=feed_list, fetch_list=[logits.name])
scores = np.array(predicts[0])
print("step = %d" % it)
for dev in six.moves.xrange(dev_count):
index = it * dev_count + dev
for i in six.moves.xrange(args.batch_size):
score_file.write(
str(scores[args.batch_size * dev + i][0]) + '\t' + str(
test_batches["label"][index][i]) + '\n')
score_file.close()
#write evaluation result
if args.ext_eval:
result = eva.evaluate_douban(score_path)
else:
result = eva.evaluate_ubuntu(score_path)
result_file_path = os.path.join(args.save_path, 'result.txt')
with open(result_file_path, 'w') as out_file:
for metric in result:
out_file.write(metric + '\t' + str(result[metric]) + '\n')
print('finish test')
print(time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(time.time())))
if __name__ == '__main__':
args = config.parse_args()
config.print_arguments(args)
if args.do_train:
train(args)
if args.do_test:
test(args)
"""
Reader for deep attention matching network
"""
import six
import numpy as np
try:
import cPickle as pickle #python 2
except ImportError as e:
import pickle #python 3
def unison_shuffle(data, seed=None):
"""
Shuffle data
"""
if seed is not None:
np.random.seed(seed)
y = np.array(data[six.b('y')])
c = np.array(data[six.b('c')])
r = np.array(data[six.b('r')])
assert len(y) == len(c) == len(r)
p = np.random.permutation(len(y))
print(p)
shuffle_data = {six.b('y'): y[p], six.b('c'): c[p], six.b('r'): r[p]}
return shuffle_data
def split_c(c, split_id):
"""
Split
c is a list, example context
split_id is a integer, conf[_EOS_]
return nested list
"""
turns = [[]]
for _id in c:
if _id != split_id:
turns[-1].append(_id)
else:
turns.append([])
if turns[-1] == [] and len(turns) > 1:
turns.pop()
return turns
def normalize_length(_list, length, cut_type='tail'):
"""_list is a list or nested list, example turns/r/single turn c
cut_type is head or tail, if _list len > length is used
return a list len=length and min(read_length, length)
"""
real_length = len(_list)
if real_length == 0:
return [0] * length, 0
if real_length <= length:
if not isinstance(_list[0], list):
_list.extend([0] * (length - real_length))
else:
_list.extend([[]] * (length - real_length))
return _list, real_length
if cut_type == 'head':
return _list[:length], length
if cut_type == 'tail':
return _list[-length:], length
def produce_one_sample(data,
index,
split_id,
max_turn_num,
max_turn_len,
turn_cut_type='tail',
term_cut_type='tail'):
"""max_turn_num=10
max_turn_len=50
return y, nor_turns_nor_c, nor_r, turn_len, term_len, r_len
"""
c = data[six.b('c')][index]
r = data[six.b('r')][index][:]
y = data[six.b('y')][index]
turns = split_c(c, split_id)
#normalize turns_c length, nor_turns length is max_turn_num
nor_turns, turn_len = normalize_length(turns, max_turn_num, turn_cut_type)
nor_turns_nor_c = []
term_len = []
#nor_turn_nor_c length is max_turn_num, element is a list length is max_turn_len
for c in nor_turns:
#nor_c length is max_turn_len
nor_c, nor_c_len = normalize_length(c, max_turn_len, term_cut_type)
nor_turns_nor_c.append(nor_c)
term_len.append(nor_c_len)
nor_r, r_len = normalize_length(r, max_turn_len, term_cut_type)
return y, nor_turns_nor_c, nor_r, turn_len, term_len, r_len
def build_one_batch(data,
batch_index,
conf,
turn_cut_type='tail',
term_cut_type='tail'):
"""
Build one batch
"""
_turns = []
_tt_turns_len = []
_every_turn_len = []
_response = []
_response_len = []
_label = []
for i in six.moves.xrange(conf['batch_size']):
index = batch_index * conf['batch_size'] + i
y, nor_turns_nor_c, nor_r, turn_len, term_len, r_len = produce_one_sample(
data, index, conf['_EOS_'], conf['max_turn_num'],
conf['max_turn_len'], turn_cut_type, term_cut_type)
_label.append(y)
_turns.append(nor_turns_nor_c)
_response.append(nor_r)
_every_turn_len.append(term_len)
_tt_turns_len.append(turn_len)
_response_len.append(r_len)
return _turns, _tt_turns_len, _every_turn_len, _response, _response_len, _label
def build_one_batch_dict(data,
batch_index,
conf,
turn_cut_type='tail',
term_cut_type='tail'):
"""
Build one batch dict
"""
_turns, _tt_turns_len, _every_turn_len, _response, _response_len, _label = build_one_batch(
data, batch_index, conf, turn_cut_type, term_cut_type)
ans = {
'turns': _turns,
'tt_turns_len': _tt_turns_len,
'every_turn_len': _every_turn_len,
'response': _response,
'response_len': _response_len,
'label': _label
}
return ans
def build_batches(data, conf, turn_cut_type='tail', term_cut_type='tail'):
"""
Build batches
"""
_turns_batches = []
_tt_turns_len_batches = []
_every_turn_len_batches = []
_response_batches = []
_response_len_batches = []
_label_batches = []
batch_len = len(data[six.b('y')]) // conf['batch_size']
for batch_index in six.moves.range(batch_len):
_turns, _tt_turns_len, _every_turn_len, _response, _response_len, _label = build_one_batch(
data, batch_index, conf, turn_cut_type='tail', term_cut_type='tail')
_turns_batches.append(_turns)
_tt_turns_len_batches.append(_tt_turns_len)
_every_turn_len_batches.append(_every_turn_len)
_response_batches.append(_response)
_response_len_batches.append(_response_len)
_label_batches.append(_label)
ans = {
"turns": _turns_batches,
"tt_turns_len": _tt_turns_len_batches,
"every_turn_len": _every_turn_len_batches,
"response": _response_batches,
"response_len": _response_len_batches,
"label": _label_batches
}
return ans
def make_one_batch_input(data_batches, index):
"""Split turns and return feeding data.
Args:
data_batches: All data batches
index: The index for current batch
Return:
feeding dictionary
"""
turns = np.array(data_batches["turns"][index])
tt_turns_len = np.array(data_batches["tt_turns_len"][index])
every_turn_len = np.array(data_batches["every_turn_len"][index])
response = np.array(data_batches["response"][index])
response_len = np.array(data_batches["response_len"][index])
batch_size = turns.shape[0]
max_turn_num = turns.shape[1]
max_turn_len = turns.shape[2]
turns_list = [turns[:, i, :] for i in six.moves.xrange(max_turn_num)]
every_turn_len_list = [
every_turn_len[:, i] for i in six.moves.xrange(max_turn_num)
]
feed_list = []
for i, turn in enumerate(turns_list):
turn = np.expand_dims(turn, axis=-1)
feed_list.append(turn)
for i, turn_len in enumerate(every_turn_len_list):
turn_mask = np.ones((batch_size, max_turn_len, 1)).astype("float32")
for row in six.moves.xrange(batch_size):
turn_mask[row, turn_len[row]:, 0] = 0
feed_list.append(turn_mask)
response = np.expand_dims(response, axis=-1)
feed_list.append(response)
response_mask = np.ones((batch_size, max_turn_len, 1)).astype("float32")
for row in six.moves.xrange(batch_size):
response_mask[row, response_len[row]:, 0] = 0
feed_list.append(response_mask)
label = np.array([data_batches["label"][index]]).reshape(
[-1, 1]).astype("float32")
feed_list.append(label)
return feed_list
if __name__ == '__main__':
conf = {
"batch_size": 256,
"max_turn_num": 10,
"max_turn_len": 50,
"_EOS_": 28270,
}
with open('../ubuntu/data/data_small.pkl', 'rb') as f:
if six.PY2:
train, val, test = pickle.load(f)
else:
train, val, test = pickle.load(f, encoding="bytes")
print('load data success')
train_batches = build_batches(train, conf)
val_batches = build_batches(val, conf)
test_batches = build_batches(test, conf)
print('build batches success')
export CUDA_VISIBLE_DEVICES=3
export FLAGS_eager_delete_tensor_gb=0.0
#train on ubuntu
python -u main.py \
--do_train True \
--use_cuda \
--data_path ./data/ubuntu/data_small.pkl \
--save_path ./model_files/ubuntu \
--use_pyreader \
--vocab_size 434512 \
--_EOS_ 28270 \
--batch_size 32
#test on ubuntu
python -u main.py \
--do_test True \
--use_cuda \
--data_path ./data/ubuntu/data_small.pkl \
--save_path ./model_files/ubuntu/step_31 \
--model_path ./model_files/ubuntu/step_31 \
--vocab_size 434512 \
--_EOS_ 28270 \
--batch_size 100
#train on douban
python -u main.py \
--do_train True \
--use_cuda \
--data_path ./data/douban/data_small.pkl \
--save_path ./model_files/douban \
--use_pyreader \
--vocab_size 172130 \
--_EOS_ 1 \
--channel1_num 16 \
--batch_size 32
#test on douban
python -u main.py \
--do_test True \
--use_cuda \
--ext_eval \
--data_path ./data/douban/data_small.pkl \
--save_path ./model_files/douban/step_31 \
--model_path ./model_files/douban/step_31 \
--vocab_size 172130 \
--_EOS_ 1 \
--channel1_num 16 \
--batch_size 32
export CPU_NUM=1
export FLAGS_eager_delete_tensor_gb=0.0
#train on ubuntu
python -u main.py \
--do_train True \
--data_path ./data/ubuntu/data_small.pkl \
--save_path ./model_files_cpu/ubuntu \
--use_pyreader \
--stack_num 2 \
--vocab_size 434512 \
--_EOS_ 28270 \
--batch_size 32
#test on ubuntu
python -u main.py \
--do_test True \
--data_path ./data/ubuntu/data_small.pkl \
--save_path ./model_files_cpu/ubuntu/step_31 \
--model_path ./model_files_cpu/ubuntu/step_31 \
--stack_num 2 \
--vocab_size 434512 \
--_EOS_ 28270 \
--batch_size 40
#train on douban
python -u main.py \
--do_train True \
--data_path ./data/douban/data_small.pkl \
--save_path ./model_files_cpu/douban \
--use_pyreader \
--stack_num 2 \
--vocab_size 172130 \
--_EOS_ 1 \
--channel1_num 16 \
--batch_size 32
#test on douban
python -u main.py \
--do_test True \
--ext_eval \
--data_path ./data/douban/data_small.pkl \
--save_path ./model_files_cpu/douban/step_31 \
--model_path ./model_files_cpu/douban/step_31 \
--stack_num 2 \
--vocab_size 172130 \
--_EOS_ 1 \
--channel1_num 16 \
--batch_size 40
""""
Utils
"""
import six
import os
def print_arguments(args):
"""
Print arguments
"""
print('----------- Configuration Arguments -----------')
for arg, value in sorted(six.iteritems(vars(args))):
print('%s: %s' % (arg, value))
print('------------------------------------------------')
def mkdir(path):
"""
Mkdir
"""
if not os.path.isdir(path):
mkdir(os.path.split(path)[0])
else:
return
os.mkdir(path)
def pos_encoding_init():
"""
Pos encoding init
"""
pass
def scaled_dot_product_attention():
"""
Scaleed dot product attention
"""
pass
# 对话通用理解模块DGU
- [一、简介](#一、简介)
- [二、快速开始](#二、快速开始)
- [三、进阶使用](#三、进阶使用)
- [四、其他](#四、其他)
## 一、简介
### 任务说明
&ensp;&ensp;&ensp;&ensp;对话相关的任务中,Dialogue System常常需要根据场景的变化去解决多种多样的任务。任务的多样性(意图识别、槽位解析、DA识别、DST等等),以及领域训练数据的稀少,给Dialogue System的研究和应用带来了巨大的困难和挑战,要使得dialogue system得到更好的发展,需要开发一个通用的对话理解模型。为此,我们给出了基于BERT的对话通用理解模块(DGU: DialogueGeneralUnderstanding),通过实验表明,使用base-model(BERT)并结合常见的学习范式,就可以在几乎全部对话理解任务上取得比肩甚至超越各个领域业内最好的模型的效果,展现了学习一个通用对话理解模型的巨大潜力。
### 效果说明
&ensp;&ensp;&ensp;&ensp;a、效果上,我们基于对话相关的业内公开数据集进行评测,效果如下表所示:
| task_name | udc | udc | udc | atis_slot | dstc2 | atis_intent | swda | mrda |
| :------ | :------ | :------ | :------ | :------| :------ | :------ | :------ | :------ |
| 对话任务 | 匹配 | 匹配 | 匹配 | 槽位解析 | DST | 意图识别 | DA | DA |
| 任务类型 | 分类 | 分类 | 分类 | 序列标注 | 多标签分类 | 分类 | 分类 | 分类 |
| 任务名称 | udc | udc | udc| atis_slot | dstc2 | atis_intent | swda | mrda |
| 评估指标 | R1@10 | R2@10 | R5@10 | F1 | JOINT ACC | ACC | ACC | ACC |
| SOTA | 76.70% | 87.40% | 96.90% | 96.89% | 74.50% | 98.32% | 81.30% | 91.70% |
| DGU | 82.02% | 90.43% | 97.75% | 97.10% | 89.57% | 97.65% | 80.19% | 91.43% |
&ensp;&ensp;&ensp;&ensp;b、数据集说明:
```
UDC: Ubuntu Corpus V1;
ATIS: 微软提供的公开数据集DSTC2,Airline Travel Information System;
DSTC2: 对话状态跟踪挑战(Dialog State Tracking Challenge)2;
MRDA: Meeting Recorder Dialogue Act;
SWDA:Switchboard Dialogue Act Corpus;
```
## 二、快速开始
### 1、安装说明
#### &ensp;&ensp;a、paddle安装
&ensp;&ensp;&ensp;&ensp;本项目依赖于Paddle Fluid 1.3.1,请参考[安装指南](http://www.paddlepaddle.org/#quick-start)进行安装
#### &ensp;&ensp;b、安装代码
&ensp;&ensp;&ensp;&ensp;克隆数据集代码库到本地
```
git clone https://github.com/PaddlePaddle/models.git
cd models/PaddleNLP/dialogue_model_toolkit/dialogue_general_understanding
```
#### &ensp;&ensp;c、环境依赖
&ensp;&ensp;&ensp;&ensp;python版本依赖python 2.7
### 2、开始第一次模型调用
#### &ensp;&ensp;a、数据准备(数据、模型下载,预处理)
&ensp;&ensp;&ensp;&ensp;i、数据下载
```
sh download_data.sh
```
&ensp;&ensp;&ensp;&ensp;ii、(非必需)下载的数据集中已提供了训练集,测试集和验证集,用户如果需要重新生成某数据集的训练数据,可执行:
```
cd dialogue_general_understanding/scripts && sh run_build_data.sh task_name
parameters:
task_name: udc, swda, mrda, atis, dstc2
```
#### &ensp;&ensp;b、模型下载
&ensp;&ensp;&ensp;&ensp;该项目中,我们基于BERT开发了相关的对话模型,对话模型训练时需要依赖BERT的模型做fine-tuning, 且提供了目前公开数据集上训练好的多个对话模型。
&ensp;&ensp;&ensp;&ensp;i、BERT pretrain模型下载:
```
sh download_pretrain_model.sh
```
&ensp;&ensp;&ensp;&ensp;ii、dialogue_general_understanding模块内对话相关模型下载:
&ensp;&ensp;&ensp;&ensp;&ensp;&ensp;方式一:基于PaddleHub命令行工具(PaddleHub安装方式 https://github.com/PaddlePaddle/PaddleHub)
```
hub download dmtk_models --output_path ./
tar -xvf dmtk_models_1.0.0.tar.gz
```
&ensp;&ensp;&ensp;&ensp;&ensp;&ensp;方式二:直接下载
```
sh download_models.sh
```
#### &ensp;&ensp;c、CPU、GPU训练设置
&ensp;&ensp;&ensp;&ensp;CPU训练和预测:
```
请将run_train.sh和run_predict.sh内如下两行参数设置为:
1、export CUDA_VISIBLE_DEVICES=
2、--use_cuda false
```
&ensp;&ensp;&ensp;&ensp;GPU训练和预测:
```
请修改run_train.sh和run_predict.sh内如下两行参数设置为:
1、export CUDA_VISIBLE_DEVICES=4 (用户可自行指定空闲的卡)
2、--use_cuda true
```
#### &ensp;&ensp;d、训练
&ensp;&ensp;&ensp;&ensp;方式一(推荐):
```
sh run_train.sh task_name
parameters:
task_name: udc, swda, mrda, atis_intent, atis_slot, dstc2
```
&ensp;&ensp;&ensp;&ensp;方式二:
```
python -u train.py --task_name mrda \ # name model to use. [udc|swda|mrda|atis_intent|atis_slot|dstc2]
--use_cuda true \ # If set, use GPU for training.
--do_train true \ # Whether to perform training.
--do_val true \ # Whether to perform evaluation on dev data set.
--do_test true \ # Whether to perform evaluation on test data set.
--epoch 10 \ # Number of epoches for fine-tuning.
--batch_size 4096 \ # Total examples' number in batch for training. see also --in_tokens.
--data_dir ./data/mrda \ # Path to training data.
--bert_config_path ./uncased_L-12_H-768_A-12/bert_config.json \ # Path to the json file for bert model config.
--vocab_path ./uncased_L-12_H-768_A-12/vocab.txt \ # Vocabulary path.
--init_pretraining_params ./uncased_L-12_H-768_A-12/params \ # Init pre-training params which preforms fine-tuning from
--checkpoints ./output/mrda \ # Path to save checkpoints.
--save_steps 200 \ # The steps interval to save checkpoints.
--learning_rate 2e-5 \ # Learning rate used to train with warmup.
--weight_decay 0.01 \ # Weight decay rate for L2 regularizer.
--max_seq_len 128 \ # Number of words of the longest seqence.
--skip_steps 100 \ # The steps interval to print loss.
--validation_steps 500 \ # The steps interval to evaluate model performance.
--num_iteration_per_drop_scope 10 \ # The iteration intervals to clean up temporary variables.
--use_fp16 false # If set, use fp16 for training.
```
#### &ensp;&ensp;e、预测 (推荐f的方式来进行预测评估)
&ensp;&ensp;&ensp;&ensp;方式一(推荐):
```
sh run_predict.sh task_name
parameters:
task_name: udc, swda, mrda, atis_intent, atis_slot, dstc2
```
&ensp;&ensp;&ensp;&ensp;方式二:
```
python -u predict.py --task_name mrda \ # name model to use. [udc|swda|mrda|atis_intent|atis_slot|dstc2]
--use_cuda true \ # If set, use GPU for training.
--batch_size 4096 \ # Total examples' number in batch for training. see also --in_tokens.
--init_checkpoint ./output/mrda/step_6500 \ # Init model
--data_dir ./data/mrda \ # Path to training data.
--vocab_path ./uncased_L-12_H-768_A-12/vocab.txt \ # Vocabulary path.
--max_seq_len 128 \ # Number of words of the longest seqence.
--bert_config_path ./uncased_L-12_H-768_A-12/bert_config.json # Path to the json file for bert model config.
```
#### &ensp;&ensp;f、预测+评估(推荐)
&ensp;&ensp;&ensp;&ensp;dialogue_general_understanding模块内提供已训练好的对话模型,可通过sh download_models.sh下载,用户如果不训练模型的时候,可使用提供模型进行预测评估:
```
sh run_eval_metrics.sh task_name
parameters:
task_name: udc, swda, mrda, atis_intent, atis_slot, dstc2
```
## 三、进阶使用
### 1、任务定义与建模
&ensp;&ensp;&ensp;&ensp;dialogue_general_understanding模块,针对数据集开发了相关的模型训练过程,支持分类,多标签分类,序列标注等任务,用户可针对自己的数据集,进行相关的模型定制;
### 2、模型原理介绍
&ensp;&ensp;&ensp;&ensp;本项目针对对话理解相关的问题,底层基于BERT,上层定义范式(分类,多标签分类,序列标注), 开源了一系列公开数据集相关的模型,供用户可配置地使用:
### 3、数据格式说明
&ensp;&ensp;&ensp;&ensp;训练、预测、评估使用的数据可以由用户根据实际的对话应用场景,自己组织数据。输入网络的数据格式统一为,示例如下:
```
[CLS] token11 token12 token13 [INNER_SEP] token11 token12 token13 [SEP] token21 token22 token23 [SEP] token31 token32 token33 [SEP]
```
&ensp;&ensp;&ensp;&ensp;输入数据以[CLS]开始,[SEP]分割内容为对话内容相关三部分,如上文,当前句,下文等,如[SEP]分割的每部分内部由多轮组成的话,使用[INNER_SEP]进行分割;第二部分和第三部分部分皆可缺省;
&ensp;&ensp;&ensp;&ensp;目前dialogue_general_understanding模块内已将数据准备部分集成到代码内,用户可根据上面输入数据格式,组装自己的数据;
### 4、代码结构说明
```
.
├── run_train.sh # 训练执行脚本
├── run_predict.sh # 预测执行脚本
├── run_eval_metrics.sh # 评估执行脚本
├── download_data.sh # 下载数据脚本
├── download_models.sh # 下载对话模型脚本
├── download_pretrain_model.sh # 下载bert pretrain模型脚本
├── train.py # train流程
├── predict.py # predict流程
├── eval_metrics.py # 指标评估
├── define_predict_pack.py # 封装预测结果
├── finetune_args.py # 模型训练相关的配置参数
├── batching.py # 封装yield batch数据
├── optimization.py # 模型优化器
├── tokenization.py # tokenizer工具
├── reader/data_reader.py: # 数据的处理和组装过程,每个数据集都定义一个类进行处理
├── README.md # 文档
├── utils/* # 定义了其他常用的功能函数
└── scripts # 数据处理脚本集合
├── run_build_data.sh # 数据处理运行脚本
├── build_atis_dataset.py # 构建atis_intent和atis_slot训练数据
├── build_dstc2_dataset.py # 构建dstc2训练数据
├── build_mrda_dataset.py # 构建mrda训练数据
├── build_swda_dataset.py # 构建swda训练数据
├── commonlib.py # 数据处理通用方法
└── conf # 公开数据集中训练集、验证集、测试集划分
../../models/dialogue_model_toolkit/dialogue_general_understanding
├── bert.py # 底层bert模型
├── define_paradigm.py # 上层网络范式
└── create_model.py # 创建底层bert模型+上层网络范式网络结构
```
### 5、如何组建自己的模型
&ensp;&ensp;&ensp;&ensp;用户可以根据自己的需求,组建自定义的模型,具体方法如下所示:
&ensp;&ensp;&ensp;&ensp;i、自定义数据
&ensp;&ensp;&ensp;&ensp;&ensp;&ensp;如用户目前有数据集为**task_name**, 则在**data**下定义**task_name**文件夹,将数据集存放进去;在**reader/data_reader.py**中,新增自定义的数据处理的类,如**udc**数据集对应**UDCProcessor**; 在**train.py**内设置**task_name****processor**的对应关系(如**processors = {'udc': reader.UDCProcessor}**),以及当前的数据集训练时是否是否使用**in_tokens**的方式计算batch大小(如:**in_tokens = {'udc': True}**)
&ensp;&ensp;&ensp;&ensp;ii、 自定义上层网络范式
&ensp;&ensp;&ensp;&ensp;&ensp;&ensp;如果用户自定义模型属于分类、多分类和序列标注这3种类型其中一个,则只需要在**paddle-nlp/models/dialogue_model_toolkit/dialogue_general_understanding/define_paradigm.py** 内指明**task_name**和相应上层范式函数的对应关系即可,如用户自定义模型属于其他模型,则需要自定义上层范式函数并指明其与**task_name**之间的关系;
&ensp;&ensp;&ensp;&ensp;iii、自定义预测封装接口
&ensp;&ensp;&ensp;&ensp;&ensp;&ensp;用户可在define_predict_pack.py内定义task_name和自定义封装预测接口的对应关系;
### 6、如何训练
&ensp;&ensp;&ensp;&ensp;i、按照上文所述的数据组织形式,组织自己的训练、评估、预测数据
&ensp;&ensp;&ensp;&ensp;ii、运行训练脚本
```
sh run_train.sh task_name
parameters:
task_name: 用户自定义名称
```
## 四、其他
### 如何贡献代码
&ensp;&ensp;&ensp;&ensp;如果你可以修复某个issue或者增加一个新功能,欢迎给我们提交PR。如果对应的PR被接受了,我们将根据贡献的质量和难度进行打分(0-5分,越高越好)。如果你累计获得了10分,可以联系我们获得面试机会或者为你写推荐信。
# Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Mask, padding and batching."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import numpy as np
def mask(batch_tokens, total_token_num, vocab_size, CLS=1, SEP=2, MASK=3):
"""
Add mask for batch_tokens, return out, mask_label, mask_pos;
Note: mask_pos responding the batch_tokens after padded;
"""
max_len = max([len(sent) for sent in batch_tokens])
mask_label = []
mask_pos = []
prob_mask = np.random.rand(total_token_num)
# Note: the first token is [CLS], so [low=1]
replace_ids = np.random.randint(1, high=vocab_size, size=total_token_num)
pre_sent_len = 0
prob_index = 0
for sent_index, sent in enumerate(batch_tokens):
mask_flag = False
prob_index += pre_sent_len
for token_index, token in enumerate(sent):
prob = prob_mask[prob_index + token_index]
if prob > 0.15:
continue
elif 0.03 < prob <= 0.15:
# mask
if token != SEP and token != CLS:
mask_label.append(sent[token_index])
sent[token_index] = MASK
mask_flag = True
mask_pos.append(sent_index * max_len + token_index)
elif 0.015 < prob <= 0.03:
# random replace
if token != SEP and token != CLS:
mask_label.append(sent[token_index])
sent[token_index] = replace_ids[prob_index + token_index]
mask_flag = True
mask_pos.append(sent_index * max_len + token_index)
else:
# keep the original token
if token != SEP and token != CLS:
mask_label.append(sent[token_index])
mask_pos.append(sent_index * max_len + token_index)
pre_sent_len = len(sent)
# ensure at least mask one word in a sentence
while not mask_flag:
token_index = int(np.random.randint(1, high=len(sent) - 1, size=1))
if sent[token_index] != SEP and sent[token_index] != CLS:
mask_label.append(sent[token_index])
sent[token_index] = MASK
mask_flag = True
mask_pos.append(sent_index * max_len + token_index)
mask_label = np.array(mask_label).astype("int64").reshape([-1, 1])
mask_pos = np.array(mask_pos).astype("int64").reshape([-1, 1])
return batch_tokens, mask_label, mask_pos
def prepare_batch_data(insts,
max_len,
total_token_num,
voc_size=0,
pad_id=None,
cls_id=None,
sep_id=None,
mask_id=None,
return_input_mask=True,
return_max_len=True,
return_num_token=False):
"""
1. generate Tensor of data
2. generate Tensor of position
3. generate self attention mask, [shape: batch_size * max_len * max_len]
"""
batch_src_ids = [inst[0] for inst in insts]
batch_sent_ids = [inst[1] for inst in insts]
batch_pos_ids = [inst[2] for inst in insts]
labels_list = []
# compatible with squad, whose example includes start/end positions,
# or unique id
if isinstance(insts[0][3], list):
if max_len != -1:
labels_list = [inst[3] + [0] * (max_len - len(inst[3])) for inst in insts]
labels_list = [np.array(labels_list).astype("int64").reshape([-1, max_len])]
else:
labels_list = [inst[3] for inst in insts]
labels_list = [np.array(labels_list).astype("int64")]
else:
for i in range(3, len(insts[0]), 1):
labels = [inst[i] for inst in insts]
labels = np.array(labels).astype("int64").reshape([-1, 1])
labels_list.append(labels)
# First step: do mask without padding
if mask_id >= 0:
out, mask_label, mask_pos = mask(
batch_src_ids,
total_token_num,
vocab_size=voc_size,
CLS=cls_id,
SEP=sep_id,
MASK=mask_id)
else:
out = batch_src_ids
# Second step: padding
src_id, self_input_mask = pad_batch_data(
out,
max_len,
pad_idx=pad_id,
return_input_mask=True)
pos_id = pad_batch_data(
batch_pos_ids,
max_len,
pad_idx=pad_id,
return_pos=False,
return_input_mask=False)
sent_id = pad_batch_data(
batch_sent_ids,
max_len,
pad_idx=pad_id,
return_pos=False,
return_input_mask=False)
if mask_id >= 0:
return_list = [
src_id, pos_id, sent_id, self_input_mask, mask_label, mask_pos
] + labels_list
else:
return_list = [src_id, pos_id, sent_id, self_input_mask] + labels_list
return return_list if len(return_list) > 1 else return_list[0]
def pad_batch_data(insts,
max_len_in,
pad_idx=0,
return_pos=False,
return_input_mask=False,
return_max_len=False,
return_num_token=False):
"""
Pad the instances to the max sequence length in batch, and generate the
corresponding position data and attention bias.
"""
return_list = []
max_len = max_len_in if max_len_in != -1 else max(len(inst) for inst in insts)
# Any token included in dict can be used to pad, since the paddings' loss
# will be masked out by weights and make no effect on parameter gradients.
inst_data = np.array(
[inst + list([pad_idx] * (max_len - len(inst))) for inst in insts
])
return_list += [inst_data.astype("int64").reshape([-1, max_len, 1])]
# position data
if return_pos:
inst_pos = np.array([
list(range(0, len(inst))) + [pad_idx] * (max_len - len(inst))
for inst in insts
])
return_list += [inst_pos.astype("int64").reshape([-1, max_len, 1])]
if return_input_mask:
# This is used to avoid attention on paddings.
input_mask_data = np.array([[1] * len(inst) + [0] *
(max_len - len(inst)) for inst in insts])
input_mask_data = np.expand_dims(input_mask_data, axis=-1)
return_list += [input_mask_data.astype("float32")]
if return_max_len:
return_list += [max_len]
if return_num_token:
num_token = 0
for inst in insts:
num_token += len(inst)
return_list += [num_token]
return return_list if len(return_list) > 1 else return_list[0]
if __name__ == "__main__":
pass
# Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""define prediction results"""
import re
import sys
import numpy as np
import paddle
import paddle.fluid as fluid
class DefinePredict(object):
"""
Packaging Prediction Results
"""
def __init__(self):
"""
init
"""
self.task_map = {'udc': 'get_matching_res',
'swda': 'get_cls_res',
'mrda': 'get_cls_res',
'atis_intent': 'get_cls_res',
'atis_slot': 'get_sequence_tagging',
'dstc2': 'get_multi_cls_res',
'dstc2_asr': 'get_multi_cls_res',
'multi-woz': 'get_multi_cls_res'}
def get_matching_res(self, probs, params=None):
"""
get matching score
"""
probs = list(probs)
return probs[1]
def get_cls_res(self, probs, params=None):
"""
get da classify tag
"""
probs = list(probs)
max_prob = max(probs)
tag = probs.index(max_prob)
return tag
def get_sequence_tagging(self, probs, params=None):
"""
get sequence tagging tag
"""
labels = []
batch_labels = np.array(probs).reshape(-1, params)
labels = [" ".join([str(l) for l in list(l_l)]) for l_l in batch_labels]
return labels
def get_multi_cls_res(self, probs, params=None):
"""
get dst classify tag
"""
labels = []
probs = list(probs)
for i in range(len(probs)):
if probs[i] >= 0.5:
labels.append(i)
if not labels:
max_prob = max(probs)
label_str = str(probs.index(max_prob))
else:
label_str = " ".join([str(l) for l in sorted(labels)])
return label_str
wget --no-check-certificate https://baidu-nlp.bj.bcebos.com/dmtk_data_1.0.0.tar.gz
tar -xvf dmtk_data_1.0.0.tar.gz
rm dmtk_data_1.0.0.tar.gz
wget --no-check-certificate https://baidu-nlp.bj.bcebos.com/dmtk_models_1.0.0.tar.gz
tar -xvf dmtk_models_1.0.0.tar.gz
rm dmtk_models_1.0.0.tar.gz
wget --no-check-certificate https://bert-models.bj.bcebos.com/uncased_L-12_H-768_A-12.tar.gz
tar -xvf uncased_L-12_H-768_A-12.tar.gz
rm uncased_L-12_H-768_A-12.tar.gz
# Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
evaluate task metrics
"""
import sys
class EvalDA(object):
"""
evaluate da testset, swda|mrda
"""
def __init__(self, task_name, pred):
"""
predict file
"""
self.pred_file = pred
if task_name == 'swda':
self.refer_file = "./data/swda/test.txt"
elif task_name == "mrda":
self.refer_file = "./data/mrda/test.txt"
def load_data(self):
"""
load reference label and predict label
"""
pred_label = []
refer_label = []
with open(self.refer_file, 'r') as fr:
for line in fr:
label = line.rstrip('\n').split('\t')[1]
refer_label.append(int(label))
idx = 0
with open(self.pred_file, 'r') as fr:
for line in fr:
elems = line.rstrip('\n').split('\t')
if len(elems) != 2 or not elems[0].isdigit():
continue
tag_id = int(elems[1])
pred_label.append(tag_id)
return pred_label, refer_label
def evaluate(self):
"""
calculate acc metrics
"""
pred_label, refer_label = self.load_data()
common_num = 0
total_num = len(pred_label)
for i in range(total_num):
if pred_label[i] == refer_label[i]:
common_num += 1
acc = float(common_num) / total_num
return acc
class EvalATISIntent(object):
"""
evaluate da testset, swda|mrda
"""
def __init__(self, pred):
"""
predict file
"""
self.pred_file = pred
self.refer_file = "./data/atis/atis_intent/test.txt"
def load_data(self):
"""
load reference label and predict label
"""
pred_label = []
refer_label = []
with open(self.refer_file, 'r') as fr:
for line in fr:
label = line.rstrip('\n').split('\t')[0]
refer_label.append(int(label))
idx = 0
with open(self.pred_file, 'r') as fr:
for line in fr:
elems = line.rstrip('\n').split('\t')
if len(elems) != 2 or not elems[0].isdigit():
continue
tag_id = int(elems[1])
pred_label.append(tag_id)
return pred_label, refer_label
def evaluate(self):
"""
calculate acc metrics
"""
pred_label, refer_label = self.load_data()
common_num = 0
total_num = len(pred_label)
for i in range(total_num):
if pred_label[i] == refer_label[i]:
common_num += 1
acc = float(common_num) / total_num
return acc
class EvalATISSlot(object):
"""
evaluate atis slot
"""
def __init__(self, pred):
"""
pred file
"""
self.pred_file = pred
self.refer_file = "./data/atis/atis_slot/test.txt"
def load_data(self):
"""
load reference label and predict label
"""
pred_label = []
refer_label = []
with open(self.refer_file, 'r') as fr:
for line in fr:
labels = line.rstrip('\n').split('\t')[1].split()
labels = [int(l) for l in labels]
refer_label.append(labels)
with open(self.pred_file, 'r') as fr:
for line in fr:
if len(line.split('\t')) != 2 or not line[0].isdigit():
continue
labels = line.rstrip('\n').split('\t')[1].split()[1:]
labels = [int(l) for l in labels]
pred_label.append(labels)
pred_label_equal = []
refer_label_equal = []
assert len(refer_label) == len(pred_label)
for i in range(len(refer_label)):
num = len(refer_label[i])
refer_label_equal.extend(refer_label[i])
pred_label[i] = pred_label[i][: num]
pred_label_equal.extend(pred_label[i])
return pred_label_equal, refer_label_equal
def evaluate(self):
"""
evaluate f1_micro score
"""
pred_label, refer_label = self.load_data()
tp = dict()
fn = dict()
fp = dict()
for i in range(len(refer_label)):
if refer_label[i] == pred_label[i]:
if refer_label[i] not in tp:
tp[refer_label[i]] = 0
tp[refer_label[i]] += 1
else:
if pred_label[i] not in fp:
fp[pred_label[i]] = 0
fp[pred_label[i]] += 1
if refer_label[i] not in fn:
fn[refer_label[i]] = 0
fn[refer_label[i]] += 1
results = ["label precision recall"]
for i in range(0, 130):
if i not in tp:
results.append(" %s: 0.0 0.0" % i)
continue
if i in fp:
precision = float(tp[i]) / (tp[i] + fp[i])
else:
precision = 1.0
if i in fn:
recall = float(tp[i]) / (tp[i] + fn[i])
else:
recall = 1.0
results.append(" %s: %.4f %.4f" % (i, precision, recall))
tp_total = sum(tp.values())
fn_total = sum(fn.values())
fp_total = sum(fp.values())
p_total = float(tp_total) / (tp_total + fp_total)
r_total = float(tp_total) / (tp_total + fn_total)
f_micro = 2 * p_total * r_total / (p_total + r_total)
results.append("f1_micro: %.4f" % (f_micro))
return "\n".join(results)
class EvalUDC(object):
"""
evaluate udc
"""
def __init__(self, pred):
"""
predict file
"""
self.pred_file = pred
self.refer_file = "./data/udc/test.txt"
def load_data(self):
"""
load reference label and predict label
"""
data = []
refer_label = []
with open(self.refer_file, 'r') as fr:
for line in fr:
label = line.rstrip('\n').split('\t')[0]
refer_label.append(label)
idx = 0
with open(self.pred_file, 'r') as fr:
for line in fr:
elems = line.rstrip('\n').split('\t')
if len(elems) != 2 or not elems[0].isdigit():
continue
match_prob = elems[1]
data.append((float(match_prob), int(refer_label[idx])))
idx += 1
return data
def get_p_at_n_in_m(self, data, n, m, ind):
"""
calculate precision in recall n
"""
pos_score = data[ind][0]
curr = data[ind: ind + m]
curr = sorted(curr, key = lambda x: x[0], reverse = True)
if curr[n - 1][0] <= pos_score:
return 1
return 0
def evaluate(self):
"""
calculate udc data
"""
data = self.load_data()
assert len(data) % 10 == 0
p_at_1_in_2 = 0.0
p_at_1_in_10 = 0.0
p_at_2_in_10 = 0.0
p_at_5_in_10 = 0.0
length = len(data)/10
for i in range(0, length):
ind = i * 10
assert data[ind][1] == 1
p_at_1_in_2 += self.get_p_at_n_in_m(data, 1, 2, ind)
p_at_1_in_10 += self.get_p_at_n_in_m(data, 1, 10, ind)
p_at_2_in_10 += self.get_p_at_n_in_m(data, 2, 10, ind)
p_at_5_in_10 += self.get_p_at_n_in_m(data, 5, 10, ind)
metrics_out = [p_at_1_in_2 / length, p_at_1_in_10 / length, \
p_at_2_in_10 / length, p_at_5_in_10 / length]
return metrics_out
class EvalDSTC2(object):
"""
evaluate dst testset, dstc2
"""
def __init__(self, task_name, pred):
"""
predict file
"""
self.task_name = task_name
self.pred_file = pred
self.refer_file = "./data/dstc2/%s/test.txt" % self.task_name
def load_data(self):
"""
load reference label and predict label
"""
pred_label = []
refer_label = []
with open(self.refer_file, 'r') as fr:
for line in fr:
line = line.strip('\n')
labels = [int(l) for l in line.split('\t')[-1].split()]
labels = sorted(list(set(labels)))
refer_label.append(" ".join([str(l) for l in labels]))
all_pred = []
with open(self.pred_file, 'r') as fr:
for line in fr:
line = line.strip('\n')
all_pred.append(line)
all_pred = all_pred[len(all_pred) - len(refer_label):]
for line in all_pred:
labels = [int(l) for l in line.split('\t')[-1].split()]
labels = sorted(list(set(labels)))
pred_label.append(" ".join([str(l) for l in labels]))
return pred_label, refer_label
def evaluate(self):
"""
calculate joint acc && overall acc
"""
overall_all = 0.0
correct_joint = 0
pred_label, refer_label = self.load_data()
for i in range(len(refer_label)):
if refer_label[i] != pred_label[i]:
continue
correct_joint += 1
joint_all = float(correct_joint) / len(refer_label)
metrics_out = [joint_all, overall_all]
return metrics_out
if __name__ == "__main__":
if len(sys.argv[1:]) < 2:
print("please input task_name predict_file")
task_name = sys.argv[1]
pred_file = sys.argv[2]
if task_name.lower() == 'udc':
eval_inst = EvalUDC(pred_file)
eval_metrics = eval_inst.evaluate()
print("MATCHING TASK: %s metrics in testset: " % task_name)
print("R1@2: %s" % eval_metrics[0])
print("R1@10: %s" % eval_metrics[1])
print("R2@10: %s" % eval_metrics[2])
print("R5@10: %s" % eval_metrics[3])
elif task_name.lower() in ['swda', 'mrda']:
eval_inst = EvalDA(task_name.lower(), pred_file)
eval_metrics = eval_inst.evaluate()
print("DA TASK: %s metrics in testset: " % task_name)
print("ACC: %s" % eval_metrics)
elif task_name.lower() == 'atis_intent':
eval_inst = EvalATISIntent(pred_file)
eval_metrics = eval_inst.evaluate()
print("INTENTION TASK: %s metrics in testset: " % task_name)
print("ACC: %s" % eval_metrics)
elif task_name.lower() == 'atis_slot':
eval_inst = EvalATISSlot(pred_file)
eval_metrics = eval_inst.evaluate()
print("SLOT FILLING TASK: %s metrics in testset: " % task_name)
print(eval_metrics)
elif task_name.lower() in ['dstc2', 'dstc2_asr']:
eval_inst = EvalDSTC2(task_name.lower(), pred_file)
eval_metrics = eval_inst.evaluate()
print("DST TASK: %s metrics in testset: " % task_name)
print("JOINT ACC: %s" % eval_metrics[0])
elif task_name.lower() == "multi-woz":
eval_inst = EvalMultiWoz(pred_file)
eval_metrics = eval_inst.evaluate()
print("DST TASK: %s metrics in testset: " % task_name)
print("JOINT ACC: %s" % eval_metrics[0])
print("OVERALL ACC: %s" % eval_metrics[1])
else:
print("task name not in [udc|swda|mrda|atis_intent|atis_slot|dstc2|dstc2_asr|multi-woz]")
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Finetuning on classification tasks."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import os
import time
import argparse
from utils.args import ArgumentGroup
parser = argparse.ArgumentParser(__doc__)
model_g = ArgumentGroup(parser, "model", "model configuration and paths.")
model_g.add_arg("bert_config_path", str, None, "Path to the json file for bert model config.")
model_g.add_arg("init_checkpoint", str, None, "Init checkpoint to resume training from.")
model_g.add_arg("init_pretraining_params", str, None,
"Init pre-training params which preforms fine-tuning from. If the "
"arg 'init_checkpoint' has been set, this argument wouldn't be valid.")
model_g.add_arg("checkpoints", str, "checkpoints", "Path to save checkpoints.")
model_g.add_arg("save_inference_model_path", str, None, "Path to save model.")
train_g = ArgumentGroup(parser, "training", "training options.")
train_g.add_arg("epoch", int, 3, "Number of epoches for fine-tuning.")
train_g.add_arg("learning_rate", float, 5e-5, "Learning rate used to train with warmup.")
train_g.add_arg("lr_scheduler", str, "linear_warmup_decay",
"scheduler of learning rate.", choices=['linear_warmup_decay', 'noam_decay'])
train_g.add_arg("weight_decay", float, 0.01, "Weight decay rate for L2 regularizer.")
train_g.add_arg("warmup_proportion", float, 0.1,
"Proportion of training steps to perform linear learning rate warmup for.")
train_g.add_arg("save_steps", int, 10000, "The steps interval to save checkpoints.")
train_g.add_arg("validation_steps", int, 1000, "The steps interval to evaluate model performance.")
train_g.add_arg("use_fp16", bool, False, "Whether to use fp16 mixed precision training.")
train_g.add_arg("loss_scaling", float, 1.0,
"Loss scaling factor for mixed precision training, only valid when use_fp16 is enabled.")
log_g = ArgumentGroup(parser, "logging", "logging related.")
log_g.add_arg("skip_steps", int, 10, "The steps interval to print loss.")
log_g.add_arg("verbose", bool, False, "Whether to output verbose log.")
data_g = ArgumentGroup(parser, "data", "Data paths, vocab paths and data processing options")
data_g.add_arg("data_dir", str, None, "Path to training data.")
data_g.add_arg("vocab_path", str, None, "Vocabulary path.")
data_g.add_arg("max_seq_len", int, 512, "Number of words of the longest seqence.")
data_g.add_arg("batch_size", int, 32, "Total examples' number in batch for training. see also --in_tokens.")
data_g.add_arg("do_lower_case", bool, True,
"Whether to lower case the input text. Should be True for uncased models and False for cased models.")
data_g.add_arg("random_seed", int, 0, "Random seed.")
run_type_g = ArgumentGroup(parser, "run_type", "running type options.")
run_type_g.add_arg("use_cuda", bool, True, "If set, use GPU for training.")
run_type_g.add_arg("use_fast_executor", bool, False, "If set, use fast parallel executor (in experiment).")
run_type_g.add_arg("num_iteration_per_drop_scope", int, 1, "Ihe iteration intervals to clean up temporary variables.")
run_type_g.add_arg("task_name", str, None,
"The name of task to perform fine-tuning, "
"should be in {'udc', 'swda', 'mrda', 'atis_slot', 'atis_intent', 'dstc2'}.")
run_type_g.add_arg("do_train", bool, True, "Whether to perform training.")
run_type_g.add_arg("do_val", bool, True, "Whether to perform evaluation on dev data set.")
run_type_g.add_arg("do_test", bool, True, "Whether to perform evaluation on test data set.")
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Optimization and learning rate scheduling."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import numpy as np
import paddle.fluid as fluid
from utils.fp16 import create_master_params_grads, master_param_to_train_param
def linear_warmup_decay(learning_rate, warmup_steps, num_train_steps):
""" Applies linear warmup of learning rate from 0 and decay to 0."""
with fluid.default_main_program()._lr_schedule_guard():
lr = fluid.layers.tensor.create_global_var(
shape=[1],
value=0.0,
dtype='float32',
persistable=True,
name="scheduled_learning_rate")
global_step = fluid.layers.learning_rate_scheduler._decay_step_counter()
with fluid.layers.control_flow.Switch() as switch:
with switch.case(global_step < warmup_steps):
warmup_lr = learning_rate * (global_step / warmup_steps)
fluid.layers.tensor.assign(warmup_lr, lr)
with switch.default():
decayed_lr = fluid.layers.learning_rate_scheduler.polynomial_decay(
learning_rate=learning_rate,
decay_steps=num_train_steps,
end_learning_rate=0.0,
power=1.0,
cycle=False)
fluid.layers.tensor.assign(decayed_lr, lr)
return lr
def optimization(loss,
warmup_steps,
num_train_steps,
learning_rate,
train_program,
startup_prog,
weight_decay,
scheduler='linear_warmup_decay',
use_fp16=False,
loss_scaling=1.0):
if warmup_steps > 0:
if scheduler == 'noam_decay':
scheduled_lr = fluid.layers.learning_rate_scheduler\
.noam_decay(1/(warmup_steps *(learning_rate ** 2)),
warmup_steps)
elif scheduler == 'linear_warmup_decay':
scheduled_lr = linear_warmup_decay(learning_rate, warmup_steps,
num_train_steps)
else:
raise ValueError("Unkown learning rate scheduler, should be "
"'noam_decay' or 'linear_warmup_decay'")
optimizer = fluid.optimizer.Adam(learning_rate=scheduled_lr)
else:
optimizer = fluid.optimizer.Adam(learning_rate=learning_rate)
scheduled_lr = learning_rate
clip_norm_thres = 1.0
# When using mixed precision training, scale the gradient clip threshold
# by loss_scaling
if use_fp16 and loss_scaling > 1.0:
clip_norm_thres *= loss_scaling
fluid.clip.set_gradient_clip(
clip=fluid.clip.GradientClipByGlobalNorm(clip_norm=clip_norm_thres))
def exclude_from_weight_decay(name):
if name.find("layer_norm") > -1:
return True
bias_suffix = ["_bias", "_b", ".b_0"]
for suffix in bias_suffix:
if name.endswith(suffix):
return True
return False
param_list = dict()
if use_fp16:
param_grads = optimizer.backward(loss)
master_param_grads = create_master_params_grads(
param_grads, train_program, startup_prog, loss_scaling)
for param, _ in master_param_grads:
param_list[param.name] = param * 1.0
param_list[param.name].stop_gradient = True
optimizer.apply_gradients(master_param_grads)
if weight_decay > 0:
for param, grad in master_param_grads:
if exclude_from_weight_decay(param.name.rstrip(".master")):
continue
with param.block.program._optimized_guard(
[param, grad]), fluid.framework.name_scope("weight_decay"):
updated_param = param - param_list[
param.name] * weight_decay * scheduled_lr
fluid.layers.assign(output=param, input=updated_param)
master_param_to_train_param(master_param_grads, param_grads,
train_program)
else:
for param in train_program.global_block().all_parameters():
param_list[param.name] = param * 1.0
param_list[param.name].stop_gradient = True
_, param_grads = optimizer.minimize(loss)
if weight_decay > 0:
for param, grad in param_grads:
if exclude_from_weight_decay(param.name):
continue
with param.block.program._optimized_guard(
[param, grad]), fluid.framework.name_scope("weight_decay"):
updated_param = param - param_list[
param.name] * weight_decay * scheduled_lr
fluid.layers.assign(output=param, input=updated_param)
return scheduled_lr
# Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Load checkpoint of running classifier to do prediction and save inference model."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import os
import sys
import time
import numpy as np
import multiprocessing
import paddle
import paddle.fluid as fluid
from finetune_args import parser
from utils.args import print_arguments
from utils.init import init_pretraining_params, init_checkpoint
import define_predict_pack
import reader.data_reader as reader
_WORK_DIR = os.path.split(os.path.realpath(__file__))[0]
sys.path.append('../../models/dialogue_model_toolkit/dialogue_general_understanding')
from bert import BertConfig, BertModel
from create_model import create_model
import define_paradigm
def main(args):
"""main function"""
bert_config = BertConfig(args.bert_config_path)
bert_config.print_config()
task_name = args.task_name.lower()
paradigm_inst = define_paradigm.Paradigm(task_name)
pred_inst = define_predict_pack.DefinePredict()
pred_func = getattr(pred_inst, pred_inst.task_map[task_name])
processors = {
'udc': reader.UDCProcessor,
'swda': reader.SWDAProcessor,
'mrda': reader.MRDAProcessor,
'atis_slot': reader.ATISSlotProcessor,
'atis_intent': reader.ATISIntentProcessor,
'dstc2': reader.DSTC2Processor,
'dstc2_asr': reader.DSTC2Processor,
}
in_tokens = {
'udc': True,
'swda': True,
'mrda': True,
'atis_slot': False,
'atis_intent': True,
'dstc2': True,
'dstc2_asr': True
}
processor = processors[task_name](data_dir=args.data_dir,
vocab_path=args.vocab_path,
max_seq_len=args.max_seq_len,
do_lower_case=args.do_lower_case,
in_tokens=in_tokens[task_name],
task_name=task_name,
random_seed=args.random_seed)
num_labels = len(processor.get_labels())
predict_prog = fluid.Program()
predict_startup = fluid.Program()
with fluid.program_guard(predict_prog, predict_startup):
with fluid.unique_name.guard():
pred_results = create_model(
args,
pyreader_name='predict_reader',
bert_config=bert_config,
num_labels=num_labels,
paradigm_inst=paradigm_inst,
is_prediction=True)
predict_pyreader = pred_results.get('pyreader', None)
probs = pred_results.get('probs', None)
feed_target_names = pred_results.get('feed_target_names', None)
predict_prog = predict_prog.clone(for_test=True)
if args.use_cuda:
place = fluid.CUDAPlace(0)
dev_count = fluid.core.get_cuda_device_count()
else:
place = fluid.CPUPlace()
dev_count = int(os.environ.get('CPU_NUM', multiprocessing.cpu_count()))
place = fluid.CUDAPlace(0) if args.use_cuda == True else fluid.CPUPlace()
exe = fluid.Executor(place)
exe.run(predict_startup)
if args.init_checkpoint:
init_pretraining_params(exe, args.init_checkpoint, predict_prog)
else:
raise ValueError("args 'init_checkpoint' should be set for prediction!")
predict_exe = fluid.ParallelExecutor(
use_cuda=args.use_cuda, main_program=predict_prog)
test_data_generator = processor.data_generator(
batch_size=args.batch_size,
phase='test',
epoch=1,
shuffle=False)
predict_pyreader.decorate_tensor_provider(test_data_generator)
predict_pyreader.start()
all_results = []
time_begin = time.time()
while True:
try:
results = predict_exe.run(fetch_list=[probs.name])
all_results.extend(results[0])
except fluid.core.EOFException:
predict_pyreader.reset()
break
time_end = time.time()
np.set_printoptions(precision=4, suppress=True)
print("-------------- prediction results --------------")
print("example_id\t" + ' '.join(processor.get_labels()))
if in_tokens[task_name]:
for index, result in enumerate(all_results):
tags = pred_func(result)
print("%s\t%s" % (index, tags))
else:
tags = pred_func(all_results, args.max_seq_len)
for index, tag in enumerate(tags):
print("%s\t%s" % (index, tag))
if args.save_inference_model_path:
_, ckpt_dir = os.path.split(args.init_checkpoint)
dir_name = ckpt_dir + '_inference_model'
model_path = os.path.join(args.save_inference_model_path, dir_name)
fluid.io.save_inference_model(
model_path,
feed_target_names, [probs],
exe,
main_program=predict_prog)
if __name__ == '__main__':
args = parser.parse_args()
print_arguments(args)
main(args)
#!/bin/bash
TASK_NAME=$1
PRED_FILE="./pred_"${TASK_NAME}
PYTHON_PATH="python"
echo "run predict............................"
sh run_predict.sh ${TASK_NAME} > ${PRED_FILE}
echo "eval_metrics..........................."
${PYTHON_PATH} eval_metrics.py ${TASK_NAME} ${PRED_FILE}
#!/bin/bash
export CUDA_VISIBLE_DEVICES=4
export CPU_NUM=1
TASK_NAME=$1
BERT_BASE_PATH="./uncased_L-12_H-768_A-12"
INPUT_PATH="./data/${TASK_NAME}"
OUTPUT_PATH="./output/${TASK_NAME}"
PYTHON_PATH="python"
if [ "$TASK_NAME" = "udc" ]
then
best_model="step_62500"
max_seq_len=210
batch_size=6720
elif [ "$TASK_NAME" = "swda" ]
then
best_model="step_12500"
max_seq_len=128
batch_size=6720
elif [ "$TASK_NAME" = "mrda" ]
then
best_model="step_6500"
max_seq_len=128
batch_size=6720
elif [ "$TASK_NAME" = "atis_intent" ]
then
best_model="step_600"
max_seq_len=128
batch_size=4096
INPUT_PATH="./data/atis/${TASK_NAME}"
elif [ "$TASK_NAME" = "atis_slot" ]
then
best_model="step_7500"
max_seq_len=128
batch_size=32
INPUT_PATH="./data/atis/${TASK_NAME}"
elif [ "$TASK_NAME" = "dstc2" ]
then
best_model="step_12000"
max_seq_len=700
batch_size=6000
INPUT_PATH="./data/dstc2/${TASK_NAME}"
else
echo "not support ${TASK_NAME} dataset.."
exit 255
fi
$PYTHON_PATH -u predict.py --task_name ${TASK_NAME} \
--use_cuda true\
--batch_size ${batch_size} \
--init_checkpoint ${OUTPUT_PATH}/${best_model} \
--data_dir ${INPUT_PATH} \
--vocab_path ${BERT_BASE_PATH}/vocab.txt \
--max_seq_len ${max_seq_len} \
--bert_config_path ${BERT_BASE_PATH}/bert_config.json
#!/bin/bash
export CUDA_VISIBLE_DEVICES=3
export CPU_NUM=1
TASK_NAME=$1
typeset -l TASK_NAME
BERT_BASE_PATH="./uncased_L-12_H-768_A-12"
INPUT_PATH="./data/${TASK_NAME}"
OUTPUT_PATH="./output/${TASK_NAME}"
PYTHON_PATH="python"
DO_TRAIN=true
DO_VAL=true
DO_TEST=true
#parameter configuration
if [ "${TASK_NAME}" = "udc" ]
then
save_steps=1000
max_seq_len=210
skip_steps=1000
batch_size=6720
epoch=2
learning_rate=2e-5
DO_VAL=false
DO_TEST=false
elif [ "${TASK_NAME}" = "swda" ]
then
save_steps=500
max_seq_len=128
skip_steps=200
batch_size=6720
epoch=10
learning_rate=2e-5
elif [ "${TASK_NAME}" = "mrda" ]
then
save_steps=500
max_seq_len=128
skip_steps=200
batch_size=4096
epoch=4
learning_rate=2e-5
elif [ "${TASK_NAME}" = "atis_intent" ]
then
save_steps=100
max_seq_len=128
skip_steps=10
batch_size=4096
epoch=20
learning_rate=2e-5
INPUT_PATH="./data/atis/${TASK_NAME}"
elif [ "${TASK_NAME}" = "atis_slot" ]
then
save_steps=100
max_seq_len=128
skip_steps=10
batch_size=32
epoch=50
learning_rate=2e-5
INPUT_PATH="./data/atis/${TASK_NAME}"
elif [ "${TASK_NAME}" = "dstc2" ]
then
save_steps=400
max_seq_len=256
skip_steps=20
batch_size=8192
epoch=40
learning_rate=5e-5
INPUT_PATH="./data/dstc2/${TASK_NAME}"
else
echo "not support ${TASK_NAME} dataset.."
exit 255
fi
# build train, dev, test dataset
cd scripts && sh run_build_data.sh ${TASK_NAME} && cd ..
#training
$PYTHON_PATH -u train.py --task_name ${TASK_NAME} \
--use_cuda true\
--do_train ${DO_TRAIN} \
--do_val ${DO_VAL} \
--do_test ${DO_TEST} \
--epoch ${epoch} \
--batch_size ${batch_size} \
--data_dir ${INPUT_PATH} \
--bert_config_path ${BERT_BASE_PATH}/bert_config.json \
--vocab_path ${BERT_BASE_PATH}/vocab.txt \
--init_pretraining_params ${BERT_BASE_PATH}/params \
--checkpoints ${OUTPUT_PATH} \
--save_steps ${save_steps} \
--learning_rate ${learning_rate} \
--weight_decay 0.01 \
--max_seq_len ${max_seq_len} \
--skip_steps ${skip_steps} \
--validation_steps 1000000 \
--num_iteration_per_drop_scope 10 \
--use_fp16 false
scripts:运行数据处理脚本目录
运行命令:
sh run_build_data.sh [udc|swda|mrda|atis]
生成DA任务所需要的训练集、开发集、测试集时:
sh run_build_data.sh swda
sh run_build_data.sh mrda
生成数据分别在open-dialog/data/swda和open-dialog/data/mrda
生成DST任务所需的训练集、开发集、测试集时:
sh run_build_data.sh dstc2
生成数据分别在open-dialog/data/dstc2
生成意图解析, 槽位识别任务所需训练集、开发集、测试集时:
sh run_build_data.sh atis
生成槽位识别数据在open-dialog/data/atis/atis_slot
生成意图识别数据在open-dialog/data/atis/atis_intent
# Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""build swda train dev test dataset"""
import json
import sys
import csv
import os
import re
class ATIS(object):
"""
nlu dataset atis data process
"""
def __init__(self):
"""
init instance
"""
self.slot_id = 2
self.slot_dict = {"PAD": 0, "O": 1}
self.intent_id = 0
self.intent_dict = dict()
self.src_dir = "../data/atis/source_data"
self.out_slot_dir = "../data/atis/atis_slot"
self.out_intent_dir = "../data/atis/atis_intent"
self.map_tag_slot = "../data/atis/atis_slot/map_tag_slot_id.txt"
self.map_tag_intent = "../data/atis/atis_intent/map_tag_intent_id.txt"
def _load_file(self, data_type):
"""
load dataset filename
"""
slot_stat = os.path.exists(self.out_slot_dir)
if not slot_stat:
os.makedirs(self.out_slot_dir)
intent_stat = os.path.exists(self.out_intent_dir)
if not intent_stat:
os.makedirs(self.out_intent_dir)
src_examples = []
json_file = os.path.join(self.src_dir, "%s.json" % data_type)
with open(json_file, 'r') as load_f:
json_dict = json.load(load_f)
examples = json_dict['rasa_nlu_data']['common_examples']
for example in examples:
text = example.get('text')
intent = example.get('intent')
entities = example.get('entities')
src_examples.append((text, intent, entities))
return src_examples
def _parser_intent_data(self, examples, data_type):
"""
parser intent dataset
"""
out_filename = "%s/%s.txt" % (self.out_intent_dir, data_type)
with open(out_filename, 'w') as fw:
for example in examples:
if example[1] not in self.intent_dict:
self.intent_dict[example[1]] = self.intent_id
self.intent_id += 1
fw.write("%s\t%s\n" % (self.intent_dict[example[1]], example[0].lower()))
with open(self.map_tag_intent, 'w') as fw:
for tag in self.intent_dict:
fw.write("%s\t%s\n" % (tag, self.intent_dict[tag]))
def _parser_slot_data(self, examples, data_type):
"""
parser slot dataset
"""
out_filename = "%s/%s.txt" % (self.out_slot_dir, data_type)
with open(out_filename, 'w') as fw:
for example in examples:
tags = []
text = example[0]
entities = example[2]
if not entities:
tags = [str(self.slot_dict['O'])] * len(text.strip().split())
continue
for i in range(len(entities)):
enty = entities[i]
start = enty['start']
value_num = len(enty['value'].split())
tags_slot = []
for j in range(value_num):
if j == 0:
bround_tag = "B"
else:
bround_tag = "I"
tag = "%s-%s" % (bround_tag, enty['entity'])
if tag not in self.slot_dict:
self.slot_dict[tag] = self.slot_id
self.slot_id += 1
tags_slot.append(str(self.slot_dict[tag]))
if i == 0:
if start not in [0, 1]:
prefix_num = len(text[: start].strip().split())
tags.extend([str(self.slot_dict['O'])] * prefix_num)
tags.extend(tags_slot)
else:
prefix_num = len(text[entities[i - 1]['end']: start].strip().split())
tags.extend([str(self.slot_dict['O'])] * prefix_num)
tags.extend(tags_slot)
if entities[-1]['end'] < len(text):
suffix_num = len(text[entities[-1]['end']:].strip().split())
tags.extend([str(self.slot_dict['O'])] * suffix_num)
fw.write("%s\t%s\n" % (text.encode('utf8'), " ".join(tags).encode('utf8')))
with open(self.map_tag_slot, 'w') as fw:
for slot in self.slot_dict:
fw.write("%s\t%s\n" % (slot, self.slot_dict[slot]))
def get_train_dataset(self):
"""
parser train dataset and print train.txt
"""
train_examples = self._load_file("train")
self._parser_intent_data(train_examples, "train")
self._parser_slot_data(train_examples, "train")
def get_test_dataset(self):
"""
parser test dataset and print test.txt
"""
test_examples = self._load_file("test")
self._parser_intent_data(test_examples, "test")
self._parser_slot_data(test_examples, "test")
def main(self):
"""
run data process
"""
self.get_train_dataset()
self.get_test_dataset()
if __name__ == "__main__":
atis_inst = ATIS()
atis_inst.main()
# Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""build mrda train dev test dataset"""
import json
import sys
import csv
import os
import re
import commonlib
class DSTC2(object):
"""
dialogue state tracking dstc2 data process
"""
def __init__(self):
"""
init instance
"""
self.map_tag_dict = {}
self.out_dir = "../data/dstc2/dstc2"
self.out_asr_dir = "../data/dstc2/dstc2_asr"
self.data_list = "./conf/dstc2.conf"
self.map_tag = "../data/dstc2/dstc2/map_tag_id.txt"
self.src_dir = "../data/dstc2/source_data"
self.onto_json = "../data/dstc2/source_data/ontology_dstc2.json"
self._load_file()
self._load_ontology()
def _load_file(self):
"""
load dataset filename
"""
self.data_dict = commonlib.load_dict(self.data_list)
for data_type in self.data_dict:
for i in range(len(self.data_dict[data_type])):
self.data_dict[data_type][i] = os.path.join(self.src_dir, self.data_dict[data_type][i])
def _load_ontology(self):
"""
load ontology tag
"""
tag_id = 1
self.map_tag_dict['none'] = 0
with open(self.onto_json, 'r') as fr:
ontology = json.load(fr)
slots_values = ontology['informable']
for slot in slots_values:
for value in slots_values[slot]:
key = "%s_%s" % (slot, value)
self.map_tag_dict[key] = tag_id
tag_id += 1
key = "%s_none" % (slot)
self.map_tag_dict[key] = tag_id
tag_id += 1
def _parser_dataset(self, data_type):
"""
parser train dev test dataset
"""
stat = os.path.exists(self.out_dir)
if not stat:
os.makedirs(self.out_dir)
asr_stat = os.path.exists(self.out_asr_dir)
if not asr_stat:
os.makedirs(self.out_asr_dir)
out_file = os.path.join(self.out_dir, "%s.txt" % data_type)
out_asr_file = os.path.join(self.out_asr_dir, "%s.txt" % data_type)
with open(out_file, 'w') as fw, open(out_asr_file, 'w') as fw_asr:
data_list = self.data_dict.get(data_type)
for fn in data_list:
log_file = os.path.join(fn, "log.json")
label_file = os.path.join(fn, "label.json")
with open(log_file, 'r') as f_log, open(label_file, 'r') as f_label:
log_json = json.load(f_log)
label_json = json.load(f_label)
session_id = log_json['session-id']
assert len(label_json["turns"]) == len(log_json["turns"])
for i in range(len(label_json["turns"])):
log_turn = log_json["turns"][i]
label_turn = label_json["turns"][i]
assert log_turn["turn-index"] == label_turn["turn-index"]
labels = ["%s_%s" % (slot, label_turn["goal-labels"][slot]) for slot in label_turn["goal-labels"]]
labels_ids = " ".join([str(self.map_tag_dict.get(label, self.map_tag_dict["%s_none" % label.split('_')[0]])) for label in labels])
mach = log_turn['output']['transcript']
user = label_turn['transcription']
if not labels_ids.strip():
labels_ids = self.map_tag_dict['none']
out = "%s\t%s\1%s\t%s" % (session_id, mach, user, labels_ids)
user_asr = log_turn['input']['live']['asr-hyps'][0]['asr-hyp'].strip()
out_asr = "%s\t%s\1%s\t%s" % (session_id, mach, user_asr, labels_ids)
fw.write("%s\n" % out.encode('utf8'))
fw_asr.write("%s\n" % out_asr.encode('utf8'))
def get_train_dataset(self):
"""
parser train dataset and print train.txt
"""
self._parser_dataset("train")
def get_dev_dataset(self):
"""
parser dev dataset and print dev.txt
"""
self._parser_dataset("dev")
def get_test_dataset(self):
"""
parser test dataset and print test.txt
"""
self._parser_dataset("test")
def get_labels(self):
"""
get tag and map ids file
"""
with open(self.map_tag, 'w') as fw:
for elem in self.map_tag_dict:
fw.write("%s\t%s\n" % (elem, self.map_tag_dict[elem]))
def main(self):
"""
run data process
"""
self.get_train_dataset()
self.get_dev_dataset()
self.get_test_dataset()
self.get_labels()
if __name__ == "__main__":
dstc_inst = DSTC2()
dstc_inst.main()
#!/bin/bash
TASK_DATA=$1
typeset -l TASK_DATA
if [ "${TASK_DATA}" = "udc" ]
then
exit 0
elif [ "${TASK_DATA}" = "swda" ]
then
python build_swda_dataset.py
elif [ "${TASK_DATA}" = "mrda" ]
then
python build_mrda_dataset.py
elif [[ "${TASK_DATA}" =~ "atis" ]]
then
python build_atis_dataset.py
cat ../data/atis/atis_slot/test.txt > ../data/atis/atis_slot/dev.txt
cat ../data/atis/atis_intent/test.txt > ../data/atis/atis_intent/dev.txt
elif [ "${TASK_DATA}" = "dstc2" ]
then
python build_dstc2_dataset.py
else
echo "can not support $TASK_DATA , please choose [swda|mrda|atis|dstc2|multi-woz]"
fi
此差异已折叠。
{
"model_type": "textcnn_net",
"vocab_size": 240465
}
"""
EmoTect config
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import six
import json
class EmoTectConfig(object):
"""
EmoTect Config
"""
def __init__(self, config_path):
self._config_dict = self._parse(config_path)
def _parse(self, config_path):
try:
with open(config_path) as json_file:
config_dict = json.load(json_file)
except Exception:
raise IOError("Error in parsing emotect model config file '%s'" % config_path)
else:
return config_dict
def __getitem__(self, key):
return self._config_dict[key]
def print_config(self):
"""
Print Config
"""
for arg, value in sorted(six.iteritems(self._config_dict)):
print('%s: %s' % (arg, value))
print('------------------------------------------------')
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
文件已移动
此差异已折叠。
此差异已折叠。
Subproject commit b9dae026c25602b96adf7ee776ff9f894c912338
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册