未验证 提交 f492ae4f 编写于 作者: P pkpk 提交者: GitHub

refactor the PaddleNLP (#4351)

* Update README.md (#4267)

* test=develop (#4269)

* 3d use new api (#4275)

* PointNet++ and PointRCNN use new API

* Update Readme of Dygraph BERT (#4277)

Fix some typos.

* Update run_classifier_multi_gpu.sh (#4279)

remove the CUDA_VISIBLE_DEVICES

* Update README.md (#4280)

* 17 update api (#4294)

* update1.7 save/load & fluid.data

* update datafeed to dataloader

* Update resnet_acnet.py (#4297)

Bias attr of square conv should be "False" rather than None during training mode.

* test=develop

* test=develop

* test=develop

* test=develop

* test
Co-authored-by: NKaipeng Deng <dengkaipeng@baidu.com>
Co-authored-by: Nzhang wenhui <frankwhzhang@126.com>
Co-authored-by: Nparap1uie-s <parap1uie-s@users.noreply.github.com>
上级 8dc42c73
......@@ -13,6 +13,3 @@
[submodule "PaddleSpeech/DeepSpeech"]
path = PaddleSpeech/DeepSpeech
url = https://github.com/PaddlePaddle/DeepSpeech.git
[submodule "PaddleNLP/PALM"]
path = PaddleNLP/PALM
url = https://github.com/PaddlePaddle/PALM
Subproject commit 5426f75073cf5bd416622dbe71b146d3dc8fffb6
Subproject commit 30b892e3c029bff706337f269e6c158b0a223f60
......@@ -10,7 +10,7 @@
- **丰富而全面的NLP任务支持:**
- PaddleNLP为您提供了多粒度,多场景的应用支持。涵盖了从[分词](https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/lexical_analysis)[词性标注](https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/lexical_analysis)[命名实体识别](https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/lexical_analysis)等NLP基础技术,到[文本分类](https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/sentiment_classification)[文本相似度计算](https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/similarity_net)[语义表示](https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/PaddleLARK)[文本生成](https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/PaddleTextGEN)等NLP核心技术。同时,PaddleNLP还提供了针对常见NLP大型应用系统(如[阅读理解](https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/PaddleMRC)[对话系统](https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/PaddleDialogue)[机器翻译系统](https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/PaddleMT)等)的特定核心技术和工具组件,模型和预训练参数等,让您在NLP领域畅通无阻。
- PaddleNLP为您提供了多粒度,多场景的应用支持。涵盖了从[分词](https://github.com/PaddlePaddle/models/tree/release/1.7/PaddleNLP/lexical_analysis)[词性标注](https://github.com/PaddlePaddle/models/tree/release/1.7/PaddleNLP/lexical_analysis)[命名实体识别](https://github.com/PaddlePaddle/models/tree/release/1.7/PaddleNLP/lexical_analysis)等NLP基础技术,到[文本分类](https://github.com/PaddlePaddle/models/tree/release/1.7/PaddleNLP/sentiment_classification)[文本相似度计算](https://github.com/PaddlePaddle/models/tree/release/1.7/PaddleNLP/similarity_net)[语义表示](https://github.com/PaddlePaddle/models/tree/release/1.7/PaddleNLP/pretrain_langauge_models)[文本生成](https://github.com/PaddlePaddle/models/tree/release/1.7/PaddleNLP/seq2seq)等NLP核心技术。同时,PaddleNLP还提供了针对常见NLP大型应用系统(如[阅读理解](https://github.com/PaddlePaddle/models/tree/release/1.7/PaddleNLP/machine_reading_comprehension)[对话系统](https://github.com/PaddlePaddle/models/tree/release/1.7/PaddleNLP/dialogue_system)[机器翻译系统](https://github.com/PaddlePaddle/models/tree/release/1.7/PaddleNLP/machine_translation)等)的特定核心技术和工具组件,模型和预训练参数等,让您在NLP领域畅通无阻。
- **稳定可靠的NLP模型和强大的预训练参数:**
......@@ -50,17 +50,17 @@ cd models/PaddleNLP/sentiment_classification
| 任务场景 | 对应项目/目录 | 简介 |
| :------------------------------------------------: | :----------------------------------------------------------: | :----------------------------------------------------------: |
| **中文分词****词性标注****命名实体识别**:fire: | [LAC](https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/lexical_analysis) | LAC,全称为Lexical Analysis of Chinese,是百度内部广泛使用的中文处理工具,功能涵盖从中文分词,词性标注,命名实体识别等常见中文处理任务。 |
| **词向量(word2vec)** | [word2vec](https://github.com/PaddlePaddle/models/tree/develop/PaddleRec/word2vec) | 提供单机多卡,多机等分布式训练中文词向量能力,支持主流词向量模型(skip-gram,cbow等),可以快速使用自定义数据训练词向量模型。 |
| **语言模型** | [Language_model](https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/language_model) | 基于循环神经网络(RNN)的经典神经语言模型(neural language model)。 |
| **情感分类**:fire: | [Senta](https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/sentiment_classification)[EmotionDetection](https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/emotion_detection) | Senta(Sentiment Classification,简称Senta)和EmotionDetection两个项目分别提供了面向*通用场景**人机对话场景专用*的情感倾向性分析模型。 |
| **文本相似度计算**:fire: | [SimNet](https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/similarity_net) | SimNet,又称为Similarity Net,为您提供高效可靠的文本相似度计算工具和预训练模型。 |
| **语义表示**:fire: | [PaddleLARK](https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/PaddleLARK) | PaddleLARK,全称为Paddle LAngauge Representation Toolkit,集成了ELMO,BERT,ERNIE 1.0,ERNIE 2.0,XLNet等热门中英文预训练模型。 |
| **文本生成** | [PaddleTextGEN](https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/PaddleTextGEN) | Paddle Text Generation为您提供了一些列经典文本生成模型案例,如vanilla seq2seq,seq2seq with attention,variational seq2seq模型等。 |
| **阅读理解** | [PaddleMRC](https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/PaddleMRC) | PaddleMRC,全称为Paddle Machine Reading Comprehension,集合了百度在阅读理解领域相关的模型,工具,开源数据等一系列工作。包括DuReader (百度开源的基于真实搜索用户行为的中文大规模阅读理解数据集),KT-Net (结合知识的阅读理解模型,SQuAD以及ReCoRD曾排名第一), D-Net (预训练-微调框架,在EMNLP2019 MRQA国际阅读理解评测获得第一),等。 |
| **对话系统** | [PaddleDialogue](https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/PaddleDialogue) | 包括:1)DGU(Dialogue General Understanding,通用对话理解模型)覆盖了包括**检索式聊天系统**中context-response matching任务和**任务完成型对话系统****意图识别****槽位解析****状态追踪**等常见对话系统任务,在6项国际公开数据集中都获得了最佳效果。<br/> 2) knowledge-driven dialogue:百度开源的知识驱动的开放领域对话数据集,发表于ACL2019。<br/>3)ADEM(Auto Dialogue Evaluation Model):对话自动评估模型,可用于自动评估不同对话生成模型的回复质量。 |
| **机器翻译** | [PaddleMT](https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/PaddleMT) | 全称为Paddle Machine Translation,基于Transformer的经典机器翻译模型。 |
| **其他前沿工作** | [Research](https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/Research) | 百度最新前沿工作开源。 |
| **中文分词****词性标注****命名实体识别**:fire: | [LAC](https://github.com/PaddlePaddle/models/tree/release/1.7/PaddleNLP/lexical_analysis) | LAC,全称为Lexical Analysis of Chinese,是百度内部广泛使用的中文处理工具,功能涵盖从中文分词,词性标注,命名实体识别等常见中文处理任务。 |
| **词向量(word2vec)** | [word2vec](https://github.com/PaddlePaddle/models/tree/release/1.7/PaddleRec/word2vec) | 提供单机多卡,多机等分布式训练中文词向量能力,支持主流词向量模型(skip-gram,cbow等),可以快速使用自定义数据训练词向量模型。 |
| **语言模型** | [Language_model](https://github.com/PaddlePaddle/models/tree/release/1.7/PaddleNLP/language_model) | 基于循环神经网络(RNN)的经典神经语言模型(neural language model)。 |
| **情感分类**:fire: | [Senta](https://github.com/PaddlePaddle/models/tree/release/1.7/PaddleNLP/sentiment_classification)[EmotionDetection](https://github.com/PaddlePaddle/models/tree/release/1.7/PaddleNLP/emotion_detection) | Senta(Sentiment Classification,简称Senta)和EmotionDetection两个项目分别提供了面向*通用场景**人机对话场景专用*的情感倾向性分析模型。 |
| **文本相似度计算**:fire: | [SimNet](https://github.com/PaddlePaddle/models/tree/release/1.7/PaddleNLP/similarity_net) | SimNet,又称为Similarity Net,为您提供高效可靠的文本相似度计算工具和预训练模型。 |
| **语义表示**:fire: | [pretrain_langauge_models](https://github.com/PaddlePaddle/models/tree/release/1.7/PaddleNLP/pretrain_langauge_models) | 集成了ELMO,BERT,ERNIE 1.0,ERNIE 2.0,XLNet等热门中英文预训练模型。 |
| **文本生成** | [seq2seq](https://github.com/PaddlePaddle/models/tree/release/1.7/PaddleNLP/PaddleTextGEN) | seq2seq为您提供了一些列经典文本生成模型案例,如vanilla seq2seq,seq2seq with attention,variational seq2seq模型等。 |
| **阅读理解** | [machine_reading_comprehension](https://github.com/PaddlePaddle/models/tree/release/1.7/PaddleNLP/machine_reading_comprehension) | Paddle Machine Reading Comprehension,集合了百度在阅读理解领域相关的模型,工具,开源数据等一系列工作。包括DuReader (百度开源的基于真实搜索用户行为的中文大规模阅读理解数据集),KT-Net (结合知识的阅读理解模型,SQuAD以及ReCoRD曾排名第一), D-Net (预训练-微调框架,在EMNLP2019 MRQA国际阅读理解评测获得第一),等。 |
| **对话系统** | [dialogue_system](https://github.com/PaddlePaddle/models/tree/release/1.7/PaddleNLP/dialogue_system) | 包括:1)DGU(Dialogue General Understanding,通用对话理解模型)覆盖了包括**检索式聊天系统**中context-response matching任务和**任务完成型对话系统****意图识别****槽位解析****状态追踪**等常见对话系统任务,在6项国际公开数据集中都获得了最佳效果。<br/> 2) knowledge-driven dialogue:百度开源的知识驱动的开放领域对话数据集,发表于ACL2019。<br/>3)ADEM(Auto Dialogue Evaluation Model):对话自动评估模型,可用于自动评估不同对话生成模型的回复质量。 |
| **机器翻译** | [machine_translation](https://github.com/PaddlePaddle/models/tree/release/1.7/PaddleNLP/machine_translation) | 全称为Paddle Machine Translation,基于Transformer的经典机器翻译模型。 |
| **其他前沿工作** | [Research](https://github.com/PaddlePaddle/models/tree/release/1.7/PaddleNLP/Research) | 百度最新前沿工作开源。 |
......@@ -70,13 +70,13 @@ cd models/PaddleNLP/sentiment_classification
```text
.
├── Research # 百度NLP在research方面的工作集合
├── PaddleMT # 机器翻译相关代码,数据,预训练模型
├── PaddleDialogue # 对话系统相关代码,数据,预训练模型
├── PaddleMRC # 阅读理解相关代码,数据,预训练模型
├── PaddleLARK # 语言表示工具箱
├── machine_translation # 机器翻译相关代码,数据,预训练模型
├── dialogue_system # 对话系统相关代码,数据,预训练模型
├── machcine_reading_comprehension # 阅读理解相关代码,数据,预训练模型
├── pretrain_langauge_models # 语言表示工具箱
├── language_model # 语言模型
├── lexical_analysis # LAC词法分析
├── models # 共享网络
├── shared_modules/models # 共享网络
│ ├── __init__.py
│ ├── classification
│ ├── dialogue_model_toolkit
......@@ -87,7 +87,7 @@ cd models/PaddleNLP/sentiment_classification
│ ├── representation
│ ├── sequence_labeling
│ └── transformer_encoder.py
├── preprocess # 共享文本预处理工具
├── shared_modules/preprocess # 共享文本预处理工具
│ ├── __init__.py
│ ├── ernie
│ ├── padding.py
......
......@@ -468,7 +468,7 @@ python -u main.py \
--loss_type="CLS"
```
#### windows环境下:
评估:
评估:
```
python -u main.py --do_eval=true --use_cuda=false --evaluation_file=data\input\data\unlabel_data\test.ids --output_prediction_file=data\output\pretrain_matching_predict --loss_type=CLS
```
......
......@@ -21,14 +21,16 @@ from kpi import DurationKpi
train_loss_card1 = CostKpi('train_loss_card1', 0.03, 0, actived=True)
train_loss_card4 = CostKpi('train_loss_card4', 0.03, 0, actived=True)
train_duration_card1 = DurationKpi('train_duration_card1', 0.01, 0, actived=True)
train_duration_card4 = DurationKpi('train_duration_card4', 0.01, 0, actived=True)
train_duration_card1 = DurationKpi(
'train_duration_card1', 0.01, 0, actived=True)
train_duration_card4 = DurationKpi(
'train_duration_card4', 0.01, 0, actived=True)
tracking_kpis = [
train_loss_card1,
train_loss_card4,
train_duration_card1,
train_duration_card4,
train_loss_card1,
train_loss_card4,
train_duration_card1,
train_duration_card4,
]
......
......@@ -20,48 +20,52 @@ import sys
import io
import os
URLLIB=urllib
if sys.version_info >= (3, 0):
URLLIB = urllib
if sys.version_info >= (3, 0):
import urllib.request
URLLIB=urllib.request
URLLIB = urllib.request
DATA_MODEL_PATH = {"DATA_PATH": "https://baidu-nlp.bj.bcebos.com/auto_dialogue_evaluation_dataset-1.0.0.tar.gz",
"TRAINED_MODEL": "https://baidu-nlp.bj.bcebos.com/auto_dialogue_evaluation_models.2.0.0.tar.gz"}
DATA_MODEL_PATH = {
"DATA_PATH":
"https://baidu-nlp.bj.bcebos.com/auto_dialogue_evaluation_dataset-1.0.0.tar.gz",
"TRAINED_MODEL":
"https://baidu-nlp.bj.bcebos.com/auto_dialogue_evaluation_models.2.0.0.tar.gz"
}
PATH_MAP = {'DATA_PATH': "./data/input",
'TRAINED_MODEL': './data/saved_models'}
PATH_MAP = {'DATA_PATH': "./data/input", 'TRAINED_MODEL': './data/saved_models'}
def un_tar(tar_name, dir_name):
try:
def un_tar(tar_name, dir_name):
try:
t = tarfile.open(tar_name)
t.extractall(path = dir_name)
t.extractall(path=dir_name)
return True
except Exception as e:
print(e)
return False
def download_model_and_data():
def download_model_and_data():
print("Downloading ade data, pretrain model and trained models......")
print("This process is quite long, please wait patiently............")
for path in ['./data/input/data', './data/saved_models/trained_models']:
if not os.path.exists(path):
for path in ['./data/input/data', './data/saved_models/trained_models']:
if not os.path.exists(path):
continue
shutil.rmtree(path)
for path_key in DATA_MODEL_PATH:
for path_key in DATA_MODEL_PATH:
filename = os.path.basename(DATA_MODEL_PATH[path_key])
URLLIB.urlretrieve(DATA_MODEL_PATH[path_key], os.path.join("./", filename))
URLLIB.urlretrieve(DATA_MODEL_PATH[path_key],
os.path.join("./", filename))
state = un_tar(filename, PATH_MAP[path_key])
if not state:
if not state:
print("Tar %s error....." % path_key)
return False
os.remove(filename)
return True
if __name__ == "__main__":
if __name__ == "__main__":
state = download_model_and_data()
if not state:
if not state:
exit(1)
print("Downloading data and models sucess......")
......@@ -25,8 +25,8 @@ import numpy as np
import paddle.fluid as fluid
class InputField(object):
def __init__(self, input_field):
class InputField(object):
def __init__(self, input_field):
"""init inpit field"""
self.context_wordseq = input_field[0]
self.response_wordseq = input_field[1]
......
......@@ -30,7 +30,7 @@ def check_cuda(use_cuda, err = \
if __name__ == "__main__":
check_cuda(True)
check_cuda(False)
......
......@@ -69,8 +69,8 @@ def init_from_checkpoint(args, exe, program):
def init_from_params(args, exe, program):
assert isinstance(args.init_from_params, str)
if not os.path.exists(args.init_from_params):
if not os.path.exists(args.init_from_params):
raise Warning("the params path does not exist.")
return False
......@@ -122,5 +122,3 @@ def save_param(args, exe, program, dirname):
print("save parameters at %s" % (os.path.join(param_dir, dirname)))
return True
......@@ -21,14 +21,13 @@ import paddle
import paddle.fluid as fluid
def create_net(
is_training,
model_input,
args,
clip_value=10.0,
word_emb_name="shared_word_emb",
lstm_W_name="shared_lstm_W",
lstm_bias_name="shared_lstm_bias"):
def create_net(is_training,
model_input,
args,
clip_value=10.0,
word_emb_name="shared_word_emb",
lstm_W_name="shared_lstm_W",
lstm_bias_name="shared_lstm_bias"):
context_wordseq = model_input.context_wordseq
response_wordseq = model_input.response_wordseq
......@@ -52,17 +51,15 @@ def create_net(
initializer=fluid.initializer.Normal(scale=0.1)))
#fc to fit dynamic LSTM
context_fc = fluid.layers.fc(
input=context_emb,
size=args.hidden_size * 4,
param_attr=fluid.ParamAttr(name='fc_weight'),
bias_attr=fluid.ParamAttr(name='fc_bias'))
context_fc = fluid.layers.fc(input=context_emb,
size=args.hidden_size * 4,
param_attr=fluid.ParamAttr(name='fc_weight'),
bias_attr=fluid.ParamAttr(name='fc_bias'))
response_fc = fluid.layers.fc(
input=response_emb,
size=args.hidden_size * 4,
param_attr=fluid.ParamAttr(name='fc_weight'),
bias_attr=fluid.ParamAttr(name='fc_bias'))
response_fc = fluid.layers.fc(input=response_emb,
size=args.hidden_size * 4,
param_attr=fluid.ParamAttr(name='fc_weight'),
bias_attr=fluid.ParamAttr(name='fc_bias'))
#LSTM
context_rep, _ = fluid.layers.dynamic_lstm(
......@@ -82,7 +79,7 @@ def create_net(
logits = fluid.layers.bilinear_tensor_product(
context_rep, response_rep, size=1)
if args.loss_type == 'CLS':
if args.loss_type == 'CLS':
label = fluid.layers.cast(x=label, dtype='float32')
loss = fluid.layers.sigmoid_cross_entropy_with_logits(logits, label)
loss = fluid.layers.reduce_mean(
......@@ -95,10 +92,10 @@ def create_net(
loss = fluid.layers.reduce_mean(loss)
else:
raise ValueError
if is_training:
if is_training:
return loss
else:
else:
return logits
......@@ -106,7 +103,5 @@ def set_word_embedding(word_emb, place, word_emb_name="shared_word_emb"):
"""
Set word embedding
"""
word_emb_param = fluid.global_scope().find_var(
word_emb_name).get_tensor()
word_emb_param = fluid.global_scope().find_var(word_emb_name).get_tensor()
word_emb_param.set(word_emb, place)
......@@ -23,13 +23,13 @@ import ade.evaluate as evaluate
from ade.utils.configure import PDConfig
def do_eval(args):
def do_eval(args):
"""evaluate metrics"""
labels = []
fr = io.open(args.evaluation_file, 'r', encoding="utf8")
for line in fr:
for line in fr:
tokens = line.strip().split('\t')
assert len(tokens) == 3
assert len(tokens) == 3
label = int(tokens[2])
labels.append(label)
......@@ -43,25 +43,25 @@ def do_eval(args):
score = score.astype(np.float64)
scores.append(score)
if args.loss_type == 'CLS':
if args.loss_type == 'CLS':
recall_dict = evaluate.evaluate_Recall(list(zip(scores, labels)))
mean_score = sum(scores) / len(scores)
print('mean score: %.6f' % mean_score)
print('evaluation recall result:')
print('1_in_2: %.6f\t1_in_10: %.6f\t2_in_10: %.6f\t5_in_10: %.6f' %
(recall_dict['1_in_2'], recall_dict['1_in_10'],
recall_dict['2_in_10'], recall_dict['5_in_10']))
elif args.loss_type == 'L2':
(recall_dict['1_in_2'], recall_dict['1_in_10'],
recall_dict['2_in_10'], recall_dict['5_in_10']))
elif args.loss_type == 'L2':
scores = [x[0] for x in scores]
mean_score = sum(scores) / len(scores)
cor = evaluate.evaluate_cor(scores, labels)
print('mean score: %.6f\nevaluation cor results:%.6f' %
(mean_score, cor))
(mean_score, cor))
else:
raise ValueError
if __name__ == "__main__":
if __name__ == "__main__":
args = PDConfig(yaml_file="./data/config/ade.yaml")
args.build()
......
......@@ -42,22 +42,24 @@ def do_save_inference_model(args):
with fluid.unique_name.guard():
context_wordseq = fluid.data(
name='context_wordseq', shape=[-1, 1], dtype='int64', lod_level=1)
name='context_wordseq',
shape=[-1, 1],
dtype='int64',
lod_level=1)
response_wordseq = fluid.data(
name='response_wordseq', shape=[-1, 1], dtype='int64', lod_level=1)
labels = fluid.data(
name='labels', shape=[-1, 1], dtype='int64')
name='response_wordseq',
shape=[-1, 1],
dtype='int64',
lod_level=1)
labels = fluid.data(name='labels', shape=[-1, 1], dtype='int64')
input_inst = [context_wordseq, response_wordseq, labels]
input_field = InputField(input_inst)
data_reader = fluid.io.PyReader(feed_list=input_inst,
capacity=4, iterable=False)
data_reader = fluid.io.PyReader(
feed_list=input_inst, capacity=4, iterable=False)
logits = create_net(
is_training=False,
model_input=input_field,
args=args
)
is_training=False, model_input=input_field, args=args)
if args.use_cuda:
place = fluid.CUDAPlace(0)
......@@ -68,7 +70,7 @@ def do_save_inference_model(args):
exe.run(startup_prog)
assert (args.init_from_params) or (args.init_from_pretrain_model)
if args.init_from_params:
save_load_io.init_from_params(args, exe, test_prog)
elif args.init_from_pretrain_model:
......@@ -76,24 +78,22 @@ def do_save_inference_model(args):
# saving inference model
fluid.io.save_inference_model(
args.inference_model_dir,
feeded_var_names=[
input_field.context_wordseq.name,
input_field.response_wordseq.name,
],
target_vars=[
logits,
],
executor=exe,
main_program=test_prog,
model_filename="model.pdmodel",
params_filename="params.pdparams")
args.inference_model_dir,
feeded_var_names=[
input_field.context_wordseq.name,
input_field.response_wordseq.name,
],
target_vars=[logits, ],
executor=exe,
main_program=test_prog,
model_filename="model.pdmodel",
params_filename="params.pdparams")
print("save inference model at %s" % (args.inference_model_dir))
if __name__ == "__main__":
args = PDConfig(yaml_file="./data/config/ade.yaml")
args = PDConfig(yaml_file="./data/config/ade.yaml")
args.build()
check_cuda(args.use_cuda)
......
......@@ -26,7 +26,6 @@ from inference_model import do_save_inference_model
from ade.utils.configure import PDConfig
if __name__ == "__main__":
args = PDConfig(yaml_file="./data/config/ade.yaml")
......
......@@ -32,7 +32,7 @@ from ade.utils.model_check import check_cuda
import ade.utils.save_load_io as save_load_io
def do_predict(args):
def do_predict(args):
"""
predict function
"""
......@@ -46,30 +46,32 @@ def do_predict(args):
with fluid.unique_name.guard():
context_wordseq = fluid.data(
name='context_wordseq', shape=[-1, 1], dtype='int64', lod_level=1)
name='context_wordseq',
shape=[-1, 1],
dtype='int64',
lod_level=1)
response_wordseq = fluid.data(
name='response_wordseq', shape=[-1, 1], dtype='int64', lod_level=1)
labels = fluid.data(
name='labels', shape=[-1, 1], dtype='int64')
name='response_wordseq',
shape=[-1, 1],
dtype='int64',
lod_level=1)
labels = fluid.data(name='labels', shape=[-1, 1], dtype='int64')
input_inst = [context_wordseq, response_wordseq, labels]
input_field = InputField(input_inst)
data_reader = fluid.io.PyReader(feed_list=input_inst,
capacity=4, iterable=False)
data_reader = fluid.io.PyReader(
feed_list=input_inst, capacity=4, iterable=False)
logits = create_net(
is_training=False,
model_input=input_field,
args=args
)
is_training=False, model_input=input_field, args=args)
logits.persistable = True
fetch_list = [logits.name]
#for_test is True if change the is_test attribute of operators to True
test_prog = test_prog.clone(for_test=True)
if args.use_cuda:
if args.use_cuda:
place = fluid.CUDAPlace(int(os.getenv('FLAGS_selected_gpus', '0')))
else:
else:
place = fluid.CPUPlace()
exe = fluid.Executor(place)
......@@ -85,42 +87,39 @@ def do_predict(args):
processor = reader.DataProcessor(
data_path=args.predict_file,
max_seq_length=args.max_seq_len,
max_seq_length=args.max_seq_len,
batch_size=args.batch_size)
batch_generator = processor.data_generator(
place=place,
phase="test",
shuffle=False,
sample_pro=1)
place=place, phase="test", shuffle=False, sample_pro=1)
num_test_examples = processor.get_num_examples(phase='test')
data_reader.decorate_batch_generator(batch_generator)
data_reader.start()
scores = []
while True:
try:
while True:
try:
results = exe.run(compiled_test_prog, fetch_list=fetch_list)
scores.extend(results[0])
except fluid.core.EOFException:
data_reader.reset()
break
scores = scores[: num_test_examples]
scores = scores[:num_test_examples]
print("Write the predicted results into the output_prediction_file")
fw = io.open(args.output_prediction_file, 'w', encoding="utf8")
for index, score in enumerate(scores):
for index, score in enumerate(scores):
fw.write("%s\t%s\n" % (index, score))
print("finish........................................")
if __name__ == "__main__":
if __name__ == "__main__":
args = PDConfig(yaml_file="./data/config/ade.yaml")
args.build()
args.Print()
check_cuda(args.use_cuda)
do_predict(args)
do_predict(args)
......@@ -31,7 +31,7 @@ from ade.utils.input_field import InputField
from ade.utils.model_check import check_cuda
import ade.utils.save_load_io as save_load_io
try:
try:
import cPickle as pickle #python 2
except ImportError as e:
import pickle #python 3
......@@ -47,24 +47,26 @@ def do_train(args):
train_prog.random_seed = args.random_seed
startup_prog.random_seed = args.random_seed
with fluid.unique_name.guard():
with fluid.unique_name.guard():
context_wordseq = fluid.data(
name='context_wordseq', shape=[-1, 1], dtype='int64', lod_level=1)
name='context_wordseq',
shape=[-1, 1],
dtype='int64',
lod_level=1)
response_wordseq = fluid.data(
name='response_wordseq', shape=[-1, 1], dtype='int64', lod_level=1)
labels = fluid.data(
name='labels', shape=[-1, 1], dtype='int64')
name='response_wordseq',
shape=[-1, 1],
dtype='int64',
lod_level=1)
labels = fluid.data(name='labels', shape=[-1, 1], dtype='int64')
input_inst = [context_wordseq, response_wordseq, labels]
input_field = InputField(input_inst)
data_reader = fluid.io.PyReader(feed_list=input_inst,
capacity=4, iterable=False)
data_reader = fluid.io.PyReader(
feed_list=input_inst, capacity=4, iterable=False)
loss = create_net(
is_training=True,
model_input=input_field,
args=args
)
is_training=True, model_input=input_field, args=args)
loss.persistable = True
# gradient clipping
fluid.clip.set_gradient_clip(clip=fluid.clip.GradientClipByValue(
......@@ -74,20 +76,21 @@ def do_train(args):
if args.use_cuda:
dev_count = fluid.core.get_cuda_device_count()
place = fluid.CUDAPlace(int(os.getenv('FLAGS_selected_gpus', '0')))
else:
place = fluid.CUDAPlace(
int(os.getenv('FLAGS_selected_gpus', '0')))
else:
dev_count = int(os.environ.get('CPU_NUM', 1))
place = fluid.CPUPlace()
processor = reader.DataProcessor(
data_path=args.training_file,
max_seq_length=args.max_seq_len,
max_seq_length=args.max_seq_len,
batch_size=args.batch_size)
batch_generator = processor.data_generator(
place=place,
phase="train",
shuffle=True,
shuffle=True,
sample_pro=args.sample_pro)
num_train_examples = processor.get_num_examples(phase='train')
......@@ -105,18 +108,23 @@ def do_train(args):
args.init_from_pretrain_model == "")
#init from some checkpoint, to resume the previous training
if args.init_from_checkpoint:
if args.init_from_checkpoint:
save_load_io.init_from_checkpoint(args, exe, train_prog)
#init from some pretrain models, to better solve the current task
if args.init_from_pretrain_model:
if args.init_from_pretrain_model:
save_load_io.init_from_pretrain_model(args, exe, train_prog)
if args.word_emb_init:
print("start loading word embedding init ...")
if six.PY2:
word_emb = np.array(pickle.load(io.open(args.word_emb_init, 'rb'))).astype('float32')
word_emb = np.array(
pickle.load(io.open(args.word_emb_init, 'rb'))).astype(
'float32')
else:
word_emb = np.array(pickle.load(io.open(args.word_emb_init, 'rb'), encoding="bytes")).astype('float32')
word_emb = np.array(
pickle.load(
io.open(args.word_emb_init, 'rb'),
encoding="bytes")).astype('float32')
set_word_embedding(word_emb, place)
print("finish init word embedding ...")
......@@ -124,69 +132,74 @@ def do_train(args):
build_strategy.enable_inplace = True
compiled_train_prog = fluid.CompiledProgram(train_prog).with_data_parallel(
loss_name=loss.name, build_strategy=build_strategy)
loss_name=loss.name, build_strategy=build_strategy)
steps = 0
begin_time = time.time()
time_begin = time.time()
time_begin = time.time()
for epoch_step in range(args.epoch):
for epoch_step in range(args.epoch):
data_reader.start()
sum_loss = 0.0
ce_loss = 0.0
while True:
try:
try:
fetch_list = [loss.name]
outputs = exe.run(compiled_train_prog, fetch_list=fetch_list)
np_loss = outputs
sum_loss += np.array(np_loss).mean()
ce_loss = np.array(np_loss).mean()
if steps % args.print_steps == 0:
if steps % args.print_steps == 0:
time_end = time.time()
used_time = time_end - time_begin
current_time = time.strftime('%Y-%m-%d %H:%M:%S',
time.localtime(time.time()))
print('%s epoch: %d, step: %s, avg loss %s, speed: %f steps/s' % (current_time, epoch_step, steps, sum_loss / args.print_steps, args.print_steps / used_time))
time.localtime(time.time()))
print(
'%s epoch: %d, step: %s, avg loss %s, speed: %f steps/s'
% (current_time, epoch_step, steps, sum_loss /
args.print_steps, args.print_steps / used_time))
sum_loss = 0.0
time_begin = time.time()
if steps % args.save_steps == 0:
if steps % args.save_steps == 0:
if args.save_checkpoint:
save_load_io.save_checkpoint(args, exe, train_prog, "step_" + str(steps))
if args.save_param:
save_load_io.save_param(args, exe, train_prog, "step_" + str(steps))
save_load_io.save_checkpoint(args, exe, train_prog,
"step_" + str(steps))
if args.save_param:
save_load_io.save_param(args, exe, train_prog,
"step_" + str(steps))
steps += 1
except fluid.core.EOFException:
except fluid.core.EOFException:
data_reader.reset()
break
if args.save_checkpoint:
if args.save_checkpoint:
save_load_io.save_checkpoint(args, exe, train_prog, "step_final")
if args.save_param:
if args.save_param:
save_load_io.save_param(args, exe, train_prog, "step_final")
def get_cards():
def get_cards():
num = 0
cards = os.environ.get('CUDA_VISIBLE_DEVICES', '')
if cards != '':
if cards != '':
num = len(cards.split(","))
return num
if args.enable_ce:
if args.enable_ce:
card_num = get_cards()
pass_time_cost = time.time() - begin_time
print("test_card_num", card_num)
print("kpis\ttrain_duration_card%s\t%s" % (card_num, pass_time_cost))
print("kpis\ttrain_loss_card%s\t%f" % (card_num, ce_loss))
if __name__ == '__main__':
args = PDConfig(yaml_file="./data/config/ade.yaml")
args.build()
args.Print()
check_cuda(args.use_cuda)
do_train(args)
......@@ -62,7 +62,7 @@ SWDA:Switchboard Dialogue Act Corpus;
&ensp;&ensp;&ensp;&ensp;数据集、相关模型下载:
&ensp;&ensp;&ensp;&ensp;linux环境下:
```
python dgu/prepare_data_and_model.py
python dgu/prepare_data_and_model.py
```
&ensp;&ensp;&ensp;&ensp;数据路径:data/input/data
......@@ -72,7 +72,7 @@ python dgu/prepare_data_and_model.py
&ensp;&ensp;&ensp;&ensp;windows环境下:
```
python dgu\prepare_data_and_model.py
python dgu\prepare_data_and_model.py
```
&ensp;&ensp;&ensp;&ensp;下载的数据集中已提供了训练集,测试集和验证集,用户如果需要重新生成某任务数据集的训练数据,可执行:
......@@ -164,19 +164,19 @@ task_type: train,predict, evaluate, inference, all, 选择5个参数选项中
训练示例: bash run.sh atis_intent train
```
&ensp;&ensp;&ensp;&ensp;如果为CPU训练:
&ensp;&ensp;&ensp;&ensp;如果为CPU训练:
```
请将run.sh内参数设置为:
请将run.sh内参数设置为:
1、export CUDA_VISIBLE_DEVICES=
```
&ensp;&ensp;&ensp;&ensp;如果为GPU训练:
&ensp;&ensp;&ensp;&ensp;如果为GPU训练:
```
请将run.sh内参数设置为:
请将run.sh内参数设置为:
1、如果为单卡训练(用户指定空闲的单卡):
export CUDA_VISIBLE_DEVICES=0
export CUDA_VISIBLE_DEVICES=0
2、如果为多卡训练(用户指定空闲的多张卡):
export CUDA_VISIBLE_DEVICES=0,1,2,3
```
......@@ -252,19 +252,19 @@ task_type: train,predict, evaluate, inference, all, 选择5个参数选项中
预测示例: bash run.sh atis_intent predict
```
&ensp;&ensp;&ensp;&ensp;如果为CPU预测:
&ensp;&ensp;&ensp;&ensp;如果为CPU预测:
```
请将run.sh内参数设置为:
请将run.sh内参数设置为:
1、export CUDA_VISIBLE_DEVICES=
```
&ensp;&ensp;&ensp;&ensp;如果为GPU预测:
&ensp;&ensp;&ensp;&ensp;如果为GPU预测:
```
请将run.sh内参数设置为:
请将run.sh内参数设置为:
支持单卡预测(用户指定空闲的单卡):
export CUDA_VISIBLE_DEVICES=0
export CUDA_VISIBLE_DEVICES=0
```
注:预测时,如采用方式一,用户可通过修改run.sh中init_from_params参数来指定自己训练好的需要预测的模型,目前代码中默认为加载官方已经训练好的模型;
......@@ -348,7 +348,7 @@ task_type: train,predict, evaluate, inference, all, 选择5个参数选项中
注:评估计算ground_truth和predict_label之间的打分,默认CPU计算即可;
#### &ensp;&ensp;&ensp;&ensp;方式二: 执行评估相关的代码:
#### &ensp;&ensp;&ensp;&ensp;方式二: 执行评估相关的代码:
```
TASK_NAME="atis_intent" #指定预测的任务名称
......@@ -363,7 +363,7 @@ python -u main.py \
#### windows环境下
```
python -u main.py --task_name=atis_intent --use_cuda=false --do_eval=true --evaluation_file=data\input\data\atis\atis_intent\test.txt --output_prediction_file=data\output\pred_atis_intent
python -u main.py --task_name=atis_intent --use_cuda=false --do_eval=true --evaluation_file=data\input\data\atis\atis_intent\test.txt --output_prediction_file=data\output\pred_atis_intent
```
### 模型推断
......@@ -378,22 +378,22 @@ task_type: train,predict, evaluate, inference, all, 选择5个参数选项中
保存模型示例: bash run.sh atis_intent inference
```
&ensp;&ensp;&ensp;&ensp;如果为CPU执行inference model过程:
&ensp;&ensp;&ensp;&ensp;如果为CPU执行inference model过程:
```
请将run.sh内参数设置为:
请将run.sh内参数设置为:
1、export CUDA_VISIBLE_DEVICES=
```
&ensp;&ensp;&ensp;&ensp;如果为GPU执行inference model过程:
```
请将run.sh内参数设置为:
请将run.sh内参数设置为:
1、单卡模型推断(用户指定空闲的单卡):
export CUDA_VISIBLE_DEVICES=0
```
#### &ensp;&ensp;&ensp;&ensp;方式二: 执行inference model相关的代码:
#### &ensp;&ensp;&ensp;&ensp;方式二: 执行inference model相关的代码:
```
TASK_NAME="atis_intent" #指定预测的任务名称
......@@ -459,7 +459,7 @@ python -u main.py \
&ensp;&ensp;&ensp;&ensp;用户也可以根据自己的需求,组建自定义的模型,具体方法如下所示:
&ensp;&ensp;&ensp;&ensp;a、自定义数据
&ensp;&ensp;&ensp;&ensp;a、自定义数据
&ensp;&ensp;&ensp;&ensp;&ensp;&ensp;如用户目前有数据集为**task_name**, 则在**data/input/data**下定义**task_name**文件夹,将数据集存放进去;在**dgu/reader.py**中,新增自定义的数据处理的类,如**udc**数据集对应**UDCProcessor**; 在**train.py**内设置**task_name****processor**的对应关系(如**processors = {'udc': reader.UDCProcessor}**).
......@@ -481,7 +481,7 @@ python -u main.py \
- Elizabeth Shriberg, Raj Dhillon, Sonali Bhagat, JeremyAng, and Hannah Carvey. 2004. The icsi meetingrecorder dialog act (mrda) corpus. Technical report,INTERNATIONAL COMPUTER SCIENCE INSTBERKELEY CA.
- Andreas Stolcke, Klaus Ries, Noah Coccaro, Eliza-beth Shriberg, Rebecca Bates, Daniel Jurafsky, PaulTaylor, Rachel Martin, Carol Van Ess-Dykema, andMarie Meteer. 2000. Dialogue act modeling for au-tomatic tagging and recognition of conversationalspeech.Computational linguistics, 26(3):339–373.
- Ye-Yi Wang, Li Deng, and Alex Acero. 2005. Spo-ken language understanding.IEEE Signal Process-ing Magazine, 22(5):16–31.Jason Williams, Antoine Raux, Deepak Ramachan-dran, and Alan Black. 2013. The dialog state tracking challenge. InProceedings of the SIGDIAL 2013Conference, pages 404–413.
- Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc VLe, Mohammad Norouzi, Wolfgang Macherey,Maxim Krikun, Yuan Cao, Qin Gao, KlausMacherey, et al. 2016. Google’s neural ma-chine translation system: Bridging the gap betweenhuman and machine translation.arXiv preprintarXiv:1609.08144.Kaisheng
- Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc VLe, Mohammad Norouzi, Wolfgang Macherey,Maxim Krikun, Yuan Cao, Qin Gao, KlausMacherey, et al. 2016. Google’s neural ma-chine translation system: Bridging the gap betweenhuman and machine translation.arXiv preprintarXiv:1609.08144.Kaisheng
- Yao, Geoffrey Zweig, Mei-Yuh Hwang,Yangyang Shi, and Dong Yu. 2013. Recurrent neu-ral networks for language understanding. InInter-speech, pages 2524–2528.
- Xiangyang Zhou, Lu Li, Daxiang Dong, Yi Liu, YingChen, Wayne Xin Zhao, Dianhai Yu, and Hua Wu.2018. Multi-turn response selection for chatbotswith deep attention matching network. InProceed-ings of the 56th Annual Meeting of the Associationfor Computational Linguistics (Volume 1: Long Pa-pers), volume 1, pages 1118–1127.
- Su Zhu and Kai Yu. 2017. Encoder-decoder withfocus-mechanism for sequence labelling based spo-ken language understanding. In2017 IEEE Interna-tional Conference on Acoustics, Speech and SignalProcessing (ICASSP), pages 5675–5679. IEEE.
......
......@@ -20,20 +20,26 @@ from kpi import CostKpi
from kpi import DurationKpi
from kpi import AccKpi
each_step_duration_atis_slot_card1 = DurationKpi('each_step_duration_atis_slot_card1', 0.01, 0, actived=True)
train_loss_atis_slot_card1 = CostKpi('train_loss_atis_slot_card1', 0.08, 0, actived=True)
train_acc_atis_slot_card1 = CostKpi('train_acc_atis_slot_card1', 0.01, 0, actived=True)
each_step_duration_atis_slot_card4 = DurationKpi('each_step_duration_atis_slot_card4', 0.06, 0, actived=True)
train_loss_atis_slot_card4 = CostKpi('train_loss_atis_slot_card4', 0.03, 0, actived=True)
train_acc_atis_slot_card4 = CostKpi('train_acc_atis_slot_card4', 0.01, 0, actived=True)
each_step_duration_atis_slot_card1 = DurationKpi(
'each_step_duration_atis_slot_card1', 0.01, 0, actived=True)
train_loss_atis_slot_card1 = CostKpi(
'train_loss_atis_slot_card1', 0.08, 0, actived=True)
train_acc_atis_slot_card1 = CostKpi(
'train_acc_atis_slot_card1', 0.01, 0, actived=True)
each_step_duration_atis_slot_card4 = DurationKpi(
'each_step_duration_atis_slot_card4', 0.06, 0, actived=True)
train_loss_atis_slot_card4 = CostKpi(
'train_loss_atis_slot_card4', 0.03, 0, actived=True)
train_acc_atis_slot_card4 = CostKpi(
'train_acc_atis_slot_card4', 0.01, 0, actived=True)
tracking_kpis = [
each_step_duration_atis_slot_card1,
train_loss_atis_slot_card1,
train_acc_atis_slot_card1,
each_step_duration_atis_slot_card4,
train_loss_atis_slot_card4,
train_acc_atis_slot_card4,
each_step_duration_atis_slot_card1,
train_loss_atis_slot_card1,
train_acc_atis_slot_card1,
each_step_duration_atis_slot_card4,
train_loss_atis_slot_card4,
train_acc_atis_slot_card4,
]
......
......@@ -75,8 +75,8 @@ def mask(batch_tokens, total_token_num, vocab_size, CLS=1, SEP=2, MASK=3):
def prepare_batch_data(task_name,
insts,
max_len,
insts,
max_len,
total_token_num,
voc_size=0,
pad_id=None,
......@@ -98,14 +98,18 @@ def prepare_batch_data(task_name,
# compatible with squad, whose example includes start/end positions,
# or unique id
if isinstance(insts[0][3], list):
if task_name == "atis_slot":
labels_list = [inst[3] + [0] * (max_len - len(inst[3])) for inst in insts]
labels_list = [np.array(labels_list).astype("int64").reshape([-1, max_len])]
elif task_name == "dstc2":
if isinstance(insts[0][3], list):
if task_name == "atis_slot":
labels_list = [
inst[3] + [0] * (max_len - len(inst[3])) for inst in insts
]
labels_list = [
np.array(labels_list).astype("int64").reshape([-1, max_len])
]
elif task_name == "dstc2":
labels_list = [inst[3] for inst in insts]
labels_list = [np.array(labels_list).astype("int64")]
else:
else:
for i in range(3, len(insts[0]), 1):
labels = [inst[i] for inst in insts]
labels = np.array(labels).astype("int64").reshape([-1, 1])
......@@ -124,28 +128,25 @@ def prepare_batch_data(task_name,
out = batch_src_ids
# Second step: padding
src_id, self_input_mask = pad_batch_data(
out,
max_len,
pad_idx=pad_id,
return_input_mask=True)
out, max_len, pad_idx=pad_id, return_input_mask=True)
pos_id = pad_batch_data(
batch_pos_ids,
max_len,
pad_idx=pad_id,
return_pos=False,
batch_pos_ids,
max_len,
pad_idx=pad_id,
return_pos=False,
return_input_mask=False)
sent_id = pad_batch_data(
batch_sent_ids,
max_len,
pad_idx=pad_id,
return_pos=False,
batch_sent_ids,
max_len,
pad_idx=pad_id,
return_pos=False,
return_input_mask=False)
if mask_id >= 0:
return_list = [
src_id, pos_id, sent_id, self_input_mask, mask_label, mask_pos
] + labels_list
else:
else:
return_list = [src_id, pos_id, sent_id, self_input_mask] + labels_list
return return_list if len(return_list) > 1 else return_list[0]
......@@ -163,13 +164,13 @@ def pad_batch_data(insts,
corresponding position data and attention bias.
"""
return_list = []
max_len = max_len_in if max_len_in != -1 else max(len(inst) for inst in insts)
max_len = max_len_in if max_len_in != -1 else max(
len(inst) for inst in insts)
# Any token included in dict can be used to pad, since the paddings' loss
# will be masked out by weights and make no effect on parameter gradients.
inst_data = np.array(
[inst + list([pad_idx] * (max_len - len(inst))) for inst in insts
])
[inst + list([pad_idx] * (max_len - len(inst))) for inst in insts])
return_list += [inst_data.astype("int64").reshape([-1, max_len])]
# position data
......@@ -183,10 +184,10 @@ def pad_batch_data(insts,
if return_input_mask:
# This is used to avoid attention on paddings.
input_mask_data = np.array([[1] * len(inst) + [0] *
input_mask_data = np.array([[1] * len(inst) + [0] *
(max_len - len(inst)) for inst in insts])
input_mask_data = np.expand_dims(input_mask_data, axis=-1)
return_list += [input_mask_data.astype("float32")]
return_list += [input_mask_data.astype("float32")]
if return_max_len:
return_list += [max_len]
......
......@@ -21,31 +21,34 @@ import paddle
import paddle.fluid as fluid
class DefinePredict(object):
class DefinePredict(object):
"""
Packaging Prediction Results
"""
def __init__(self):
def __init__(self):
"""
init
"""
self.task_map = {'udc': 'get_matching_res',
'swda': 'get_cls_res',
'mrda': 'get_cls_res',
'atis_intent': 'get_cls_res',
'atis_slot': 'get_sequence_tagging',
'dstc2': 'get_multi_cls_res',
'dstc2_asr': 'get_multi_cls_res',
'multi-woz': 'get_multi_cls_res'}
self.task_map = {
'udc': 'get_matching_res',
'swda': 'get_cls_res',
'mrda': 'get_cls_res',
'atis_intent': 'get_cls_res',
'atis_slot': 'get_sequence_tagging',
'dstc2': 'get_multi_cls_res',
'dstc2_asr': 'get_multi_cls_res',
'multi-woz': 'get_multi_cls_res'
}
def get_matching_res(self, probs, params=None):
def get_matching_res(self, probs, params=None):
"""
get matching score
"""
probs = list(probs)
return probs[1]
def get_cls_res(self, probs, params=None):
def get_cls_res(self, probs, params=None):
"""
get da classify tag
"""
......@@ -54,7 +57,7 @@ class DefinePredict(object):
tag = probs.index(max_prob)
return tag
def get_sequence_tagging(self, probs, params=None):
def get_sequence_tagging(self, probs, params=None):
"""
get sequence tagging tag
"""
......@@ -63,23 +66,19 @@ class DefinePredict(object):
labels = [" ".join([str(l) for l in list(l_l)]) for l_l in batch_labels]
return labels
def get_multi_cls_res(self, probs, params=None):
def get_multi_cls_res(self, probs, params=None):
"""
get dst classify tag
"""
labels = []
probs = list(probs)
for i in range(len(probs)):
if probs[i] >= 0.5:
for i in range(len(probs)):
if probs[i] >= 0.5:
labels.append(i)
if not labels:
if not labels:
max_prob = max(probs)
label_str = str(probs.index(max_prob))
else:
else:
label_str = " ".join([str(l) for l in sorted(labels)])
return label_str
......@@ -20,51 +20,60 @@ import sys
import io
import os
URLLIB=urllib
if sys.version_info >= (3, 0):
URLLIB = urllib
if sys.version_info >= (3, 0):
import urllib.request
URLLIB=urllib.request
URLLIB = urllib.request
DATA_MODEL_PATH = {"DATA_PATH": "https://baidu-nlp.bj.bcebos.com/dmtk_data_1.0.0.tar.gz",
"PRETRAIN_MODEL": "https://bert-models.bj.bcebos.com/uncased_L-12_H-768_A-12.tar.gz",
"TRAINED_MODEL": "https://baidu-nlp.bj.bcebos.com/dgu_models_2.0.0.tar.gz"}
DATA_MODEL_PATH = {
"DATA_PATH": "https://baidu-nlp.bj.bcebos.com/dmtk_data_1.0.0.tar.gz",
"PRETRAIN_MODEL":
"https://bert-models.bj.bcebos.com/uncased_L-12_H-768_A-12.tar.gz",
"TRAINED_MODEL": "https://baidu-nlp.bj.bcebos.com/dgu_models_2.0.0.tar.gz"
}
PATH_MAP = {'DATA_PATH': "./data/input",
'PRETRAIN_MODEL': './data/pretrain_model',
'TRAINED_MODEL': './data/saved_models'}
PATH_MAP = {
'DATA_PATH': "./data/input",
'PRETRAIN_MODEL': './data/pretrain_model',
'TRAINED_MODEL': './data/saved_models'
}
def un_tar(tar_name, dir_name):
try:
def un_tar(tar_name, dir_name):
try:
t = tarfile.open(tar_name)
t.extractall(path = dir_name)
t.extractall(path=dir_name)
return True
except Exception as e:
print(e)
return False
def download_model_and_data():
def download_model_and_data():
print("Downloading dgu data, pretrain model and trained models......")
print("This process is quite long, please wait patiently............")
for path in ['./data/input/data', './data/pretrain_model/uncased_L-12_H-768_A-12', './data/saved_models/trained_models']:
if not os.path.exists(path):
for path in [
'./data/input/data',
'./data/pretrain_model/uncased_L-12_H-768_A-12',
'./data/saved_models/trained_models'
]:
if not os.path.exists(path):
continue
shutil.rmtree(path)
for path_key in DATA_MODEL_PATH:
for path_key in DATA_MODEL_PATH:
filename = os.path.basename(DATA_MODEL_PATH[path_key])
URLLIB.urlretrieve(DATA_MODEL_PATH[path_key], os.path.join("./", filename))
URLLIB.urlretrieve(DATA_MODEL_PATH[path_key],
os.path.join("./", filename))
state = un_tar(filename, PATH_MAP[path_key])
if not state:
if not state:
print("Tar %s error....." % path_key)
return False
os.remove(filename)
return True
if __name__ == "__main__":
if __name__ == "__main__":
state = download_model_and_data()
if not state:
if not state:
exit(1)
print("Downloading data and models sucess......")
......@@ -6,7 +6,7 @@ scripts:运行数据处理脚本目录, 将官方公开数据集转换成模
python run_build_data.py udc
生成数据在dialogue_general_understanding/data/input/data/udc
2)、生成DA任务所需要的训练集、开发集、测试集时:
2)、生成DA任务所需要的训练集、开发集、测试集时:
python run_build_data.py swda
python run_build_data.py mrda
生成数据分别在dialogue_general_understanding/data/input/data/swda和dialogue_general_understanding/data/input/data/mrda
......@@ -19,6 +19,3 @@ python run_build_data.py udc
python run_build_data.py atis
生成槽位识别数据在dialogue_general_understanding/data/input/data/atis/atis_slot
生成意图识别数据在dialogue_general_understanding/data/input/data/atis/atis_intent
......@@ -12,7 +12,6 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""build swda train dev test dataset"""
import json
......@@ -23,11 +22,12 @@ import io
import re
class ATIS(object):
class ATIS(object):
"""
nlu dataset atis data process
"""
def __init__(self):
def __init__(self):
"""
init instance
"""
......@@ -41,91 +41,94 @@ class ATIS(object):
self.map_tag_slot = "../../data/input/data/atis/atis_slot/map_tag_slot_id.txt"
self.map_tag_intent = "../../data/input/data/atis/atis_intent/map_tag_intent_id.txt"
def _load_file(self, data_type):
def _load_file(self, data_type):
"""
load dataset filename
"""
slot_stat = os.path.exists(self.out_slot_dir)
if not slot_stat:
if not slot_stat:
os.makedirs(self.out_slot_dir)
intent_stat = os.path.exists(self.out_intent_dir)
if not intent_stat:
if not intent_stat:
os.makedirs(self.out_intent_dir)
src_examples = []
json_file = os.path.join(self.src_dir, "%s.json" % data_type)
load_f = io.open(json_file, 'r', encoding="utf8")
json_dict = json.load(load_f)
examples = json_dict['rasa_nlu_data']['common_examples']
for example in examples:
for example in examples:
text = example.get('text')
intent = example.get('intent')
entities = example.get('entities')
src_examples.append((text, intent, entities))
return src_examples
def _parser_intent_data(self, examples, data_type):
def _parser_intent_data(self, examples, data_type):
"""
parser intent dataset
"""
out_filename = "%s/%s.txt" % (self.out_intent_dir, data_type)
fw = io.open(out_filename, 'w', encoding="utf8")
for example in examples:
if example[1] not in self.intent_dict:
for example in examples:
if example[1] not in self.intent_dict:
self.intent_dict[example[1]] = self.intent_id
self.intent_id += 1
fw.write(u"%s\t%s\n" % (self.intent_dict[example[1]], example[0].lower()))
fw.write(u"%s\t%s\n" %
(self.intent_dict[example[1]], example[0].lower()))
fw = io.open(self.map_tag_intent, 'w', encoding="utf8")
for tag in self.intent_dict:
for tag in self.intent_dict:
fw.write(u"%s\t%s\n" % (tag, self.intent_dict[tag]))
def _parser_slot_data(self, examples, data_type):
def _parser_slot_data(self, examples, data_type):
"""
parser slot dataset
"""
out_filename = "%s/%s.txt" % (self.out_slot_dir, data_type)
fw = io.open(out_filename, 'w', encoding="utf8")
for example in examples:
for example in examples:
tags = []
text = example[0]
entities = example[2]
if not entities:
if not entities:
tags = [str(self.slot_dict['O'])] * len(text.strip().split())
continue
for i in range(len(entities)):
for i in range(len(entities)):
enty = entities[i]
start = enty['start']
value_num = len(enty['value'].split())
tags_slot = []
for j in range(value_num):
if j == 0:
for j in range(value_num):
if j == 0:
bround_tag = "B"
else:
else:
bround_tag = "I"
tag = "%s-%s" % (bround_tag, enty['entity'])
if tag not in self.slot_dict:
if tag not in self.slot_dict:
self.slot_dict[tag] = self.slot_id
self.slot_id += 1
tags_slot.append(str(self.slot_dict[tag]))
if i == 0:
if start not in [0, 1]:
prefix_num = len(text[: start].strip().split())
if i == 0:
if start not in [0, 1]:
prefix_num = len(text[:start].strip().split())
tags.extend([str(self.slot_dict['O'])] * prefix_num)
tags.extend(tags_slot)
else:
prefix_num = len(text[entities[i - 1]['end']: start].strip().split())
else:
prefix_num = len(text[entities[i - 1]['end']:start].strip()
.split())
tags.extend([str(self.slot_dict['O'])] * prefix_num)
tags.extend(tags_slot)
if entities[-1]['end'] < len(text):
if entities[-1]['end'] < len(text):
suffix_num = len(text[entities[-1]['end']:].strip().split())
tags.extend([str(self.slot_dict['O'])] * suffix_num)
fw.write(u"%s\t%s\n" % (text.encode('utf8'), " ".join(tags).encode('utf8')))
fw.write(u"%s\t%s\n" %
(text.encode('utf8'), " ".join(tags).encode('utf8')))
fw = io.open(self.map_tag_slot, 'w', encoding="utf8")
for slot in self.slot_dict:
for slot in self.slot_dict:
fw.write(u"%s\t%s\n" % (slot, self.slot_dict[slot]))
def get_train_dataset(self):
def get_train_dataset(self):
"""
parser train dataset and print train.txt
"""
......@@ -133,7 +136,7 @@ class ATIS(object):
self._parser_intent_data(train_examples, "train")
self._parser_slot_data(train_examples, "train")
def get_test_dataset(self):
def get_test_dataset(self):
"""
parser test dataset and print test.txt
"""
......@@ -141,7 +144,7 @@ class ATIS(object):
self._parser_intent_data(test_examples, "test")
self._parser_slot_data(test_examples, "test")
def main(self):
def main(self):
"""
run data process
"""
......@@ -149,10 +152,6 @@ class ATIS(object):
self.get_test_dataset()
if __name__ == "__main__":
if __name__ == "__main__":
atis_inst = ATIS()
atis_inst.main()
......@@ -24,11 +24,12 @@ import re
import commonlib
class DSTC2(object):
class DSTC2(object):
"""
dialogue state tracking dstc2 data process
"""
def __init__(self):
def __init__(self):
"""
init instance
"""
......@@ -42,16 +43,17 @@ class DSTC2(object):
self._load_file()
self._load_ontology()
def _load_file(self):
def _load_file(self):
"""
load dataset filename
"""
self.data_dict = commonlib.load_dict(self.data_list)
for data_type in self.data_dict:
for i in range(len(self.data_dict[data_type])):
self.data_dict[data_type][i] = os.path.join(self.src_dir, self.data_dict[data_type][i])
for data_type in self.data_dict:
for i in range(len(self.data_dict[data_type])):
self.data_dict[data_type][i] = os.path.join(
self.src_dir, self.data_dict[data_type][i])
def _load_ontology(self):
def _load_ontology(self):
"""
load ontology tag
"""
......@@ -60,8 +62,8 @@ class DSTC2(object):
fr = io.open(self.onto_json, 'r', encoding="utf8")
ontology = json.load(fr)
slots_values = ontology['informable']
for slot in slots_values:
for value in slots_values[slot]:
for slot in slots_values:
for value in slots_values[slot]:
key = "%s_%s" % (slot, value)
self.map_tag_dict[key] = tag_id
tag_id += 1
......@@ -69,22 +71,22 @@ class DSTC2(object):
self.map_tag_dict[key] = tag_id
tag_id += 1
def _parser_dataset(self, data_type):
def _parser_dataset(self, data_type):
"""
parser train dev test dataset
"""
stat = os.path.exists(self.out_dir)
if not stat:
if not stat:
os.makedirs(self.out_dir)
asr_stat = os.path.exists(self.out_asr_dir)
if not asr_stat:
if not asr_stat:
os.makedirs(self.out_asr_dir)
out_file = os.path.join(self.out_dir, "%s.txt" % data_type)
out_asr_file = os.path.join(self.out_asr_dir, "%s.txt" % data_type)
fw = io.open(out_file, 'w', encoding="utf8")
fw_asr = io.open(out_asr_file, 'w', encoding="utf8")
data_list = self.data_dict.get(data_type)
for fn in data_list:
for fn in data_list:
log_file = os.path.join(fn, "log.json")
label_file = os.path.join(fn, "label.json")
f_log = io.open(log_file, 'r', encoding="utf8")
......@@ -93,49 +95,59 @@ class DSTC2(object):
label_json = json.load(f_label)
session_id = log_json['session-id']
assert len(label_json["turns"]) == len(log_json["turns"])
for i in range(len(label_json["turns"])):
for i in range(len(label_json["turns"])):
log_turn = log_json["turns"][i]
label_turn = label_json["turns"][i]
assert log_turn["turn-index"] == label_turn["turn-index"]
labels = ["%s_%s" % (slot, label_turn["goal-labels"][slot]) for slot in label_turn["goal-labels"]]
labels_ids = " ".join([str(self.map_tag_dict.get(label, self.map_tag_dict["%s_none" % label.split('_')[0]])) for label in labels])
labels = [
"%s_%s" % (slot, label_turn["goal-labels"][slot])
for slot in label_turn["goal-labels"]
]
labels_ids = " ".join([
str(
self.map_tag_dict.get(label, self.map_tag_dict[
"%s_none" % label.split('_')[0]]))
for label in labels
])
mach = log_turn['output']['transcript']
user = label_turn['transcription']
if not labels_ids.strip():
if not labels_ids.strip():
labels_ids = self.map_tag_dict['none']
out = "%s\t%s\1%s\t%s" % (session_id, mach, user, labels_ids)
user_asr = log_turn['input']['live']['asr-hyps'][0]['asr-hyp'].strip()
out_asr = "%s\t%s\1%s\t%s" % (session_id, mach, user_asr, labels_ids)
user_asr = log_turn['input']['live']['asr-hyps'][0][
'asr-hyp'].strip()
out_asr = "%s\t%s\1%s\t%s" % (session_id, mach, user_asr,
labels_ids)
fw.write(u"%s\n" % out.encode('utf8'))
fw_asr.write(u"%s\n" % out_asr.encode('utf8'))
def get_train_dataset(self):
def get_train_dataset(self):
"""
parser train dataset and print train.txt
"""
self._parser_dataset("train")
def get_dev_dataset(self):
def get_dev_dataset(self):
"""
parser dev dataset and print dev.txt
"""
self._parser_dataset("dev")
def get_test_dataset(self):
def get_test_dataset(self):
"""
parser test dataset and print test.txt
"""
self._parser_dataset("test")
def get_labels(self):
def get_labels(self):
"""
get tag and map ids file
"""
fw = io.open(self.map_tag, 'w', encoding="utf8")
for elem in self.map_tag_dict:
for elem in self.map_tag_dict:
fw.write(u"%s\t%s\n" % (elem, self.map_tag_dict[elem]))
def main(self):
def main(self):
"""
run data process
"""
......@@ -144,10 +156,7 @@ class DSTC2(object):
self.get_test_dataset()
self.get_labels()
if __name__ == "__main__":
if __name__ == "__main__":
dstc_inst = DSTC2()
dstc_inst.main()
......@@ -23,11 +23,12 @@ import re
import commonlib
class MRDA(object):
class MRDA(object):
"""
dialogue act dataset mrda data process
"""
def __init__(self):
def __init__(self):
"""
init instance
"""
......@@ -41,7 +42,7 @@ class MRDA(object):
self._load_file()
self.tag_dict = commonlib.load_voc(self.voc_map_tag)
def _load_file(self):
def _load_file(self):
"""
load dataset filename
"""
......@@ -49,30 +50,30 @@ class MRDA(object):
self.trans_dict = {}
self.data_dict = commonlib.load_dict(self.data_list)
file_list, file_path = commonlib.get_file_list(self.src_dir)
for i in range(len(file_list)):
for i in range(len(file_list)):
name = file_list[i]
keyword = name.split('.')[0]
if 'dadb' in name:
if 'dadb' in name:
self.dadb_dict[keyword] = file_path[i]
if 'trans' in name:
if 'trans' in name:
self.trans_dict[keyword] = file_path[i]
def load_dadb(self, data_type):
def load_dadb(self, data_type):
"""
load dadb dataset
"""
dadb_dict = {}
conv_id_list = []
dadb_list = self.data_dict[data_type]
for dadb_key in dadb_list:
for dadb_key in dadb_list:
dadb_file = self.dadb_dict[dadb_key]
fr = io.open(dadb_file, 'r', encoding="utf8")
row = csv.reader(fr, delimiter = ',')
for line in row:
row = csv.reader(fr, delimiter=',')
for line in row:
elems = line
conv_id = elems[2]
conv_id_list.append(conv_id)
if len(elems) != 14:
if len(elems) != 14:
continue
error_code = elems[3]
da_tag = elems[-9]
......@@ -80,17 +81,17 @@ class MRDA(object):
dadb_dict[conv_id] = (error_code, da_ori_tag, da_tag)
return dadb_dict, conv_id_list
def load_trans(self, data_type):
def load_trans(self, data_type):
"""load trans data"""
trans_dict = {}
trans_list = self.data_dict[data_type]
for trans_key in trans_list:
for trans_key in trans_list:
trans_file = self.trans_dict[trans_key]
fr = io.open(trans_file, 'r', encoding="utf8")
row = csv.reader(fr, delimiter = ',')
for line in row:
row = csv.reader(fr, delimiter=',')
for line in row:
elems = line
if len(elems) != 3:
if len(elems) != 3:
continue
conv_id = elems[0]
text = elems[1]
......@@ -98,7 +99,7 @@ class MRDA(object):
trans_dict[conv_id] = (text, text_process)
return trans_dict
def _parser_dataset(self, data_type):
def _parser_dataset(self, data_type):
"""
parser train dev test dataset
"""
......@@ -106,50 +107,51 @@ class MRDA(object):
dadb_dict, conv_id_list = self.load_dadb(data_type)
trans_dict = self.load_trans(data_type)
fw = io.open(out_filename, 'w', encoding="utf8")
for elem in conv_id_list:
for elem in conv_id_list:
v_dadb = dadb_dict[elem]
v_trans = trans_dict[elem]
da_tag = v_dadb[2]
if da_tag not in self.tag_dict:
if da_tag not in self.tag_dict:
continue
tag = self.tag_dict[da_tag]
if tag == "Z":
if tag == "Z":
continue
if tag not in self.map_tag_dict:
if tag not in self.map_tag_dict:
self.map_tag_dict[tag] = self.tag_id
self.tag_id += 1
caller = elem.split('_')[0].split('-')[-1]
conv_no = elem.split('_')[0].split('-')[0]
out = "%s\t%s\t%s\t%s" % (conv_no, self.map_tag_dict[tag], caller, v_trans[0])
out = "%s\t%s\t%s\t%s" % (conv_no, self.map_tag_dict[tag], caller,
v_trans[0])
fw.write(u"%s\n" % out)
def get_train_dataset(self):
def get_train_dataset(self):
"""
parser train dataset and print train.txt
"""
self._parser_dataset("train")
def get_dev_dataset(self):
def get_dev_dataset(self):
"""
parser dev dataset and print dev.txt
"""
self._parser_dataset("dev")
def get_test_dataset(self):
def get_test_dataset(self):
"""
parser test dataset and print test.txt
"""
self._parser_dataset("test")
def get_labels(self):
def get_labels(self):
"""
get tag and map ids file
"""
fw = io.open(self.map_tag, 'w', encoding="utf8")
for elem in self.map_tag_dict:
for elem in self.map_tag_dict:
fw.write(u"%s\t%s\n" % (elem, self.map_tag_dict[elem]))
def main(self):
def main(self):
"""
run data process
"""
......@@ -158,10 +160,7 @@ class MRDA(object):
self.get_test_dataset()
self.get_labels()
if __name__ == "__main__":
if __name__ == "__main__":
mrda_inst = MRDA()
mrda_inst.main()
......@@ -23,11 +23,12 @@ import re
import commonlib
class SWDA(object):
class SWDA(object):
"""
dialogue act dataset swda data process
"""
def __init__(self):
def __init__(self):
"""
init instance
"""
......@@ -39,94 +40,94 @@ class SWDA(object):
self.src_dir = "../../data/input/data/swda/source_data/swda"
self._load_file()
def _load_file(self):
def _load_file(self):
"""
load dataset filename
"""
self.data_dict = commonlib.load_dict(self.data_list)
self.file_dict = {}
child_dir = commonlib.get_dir_list(self.src_dir)
for chd in child_dir:
for chd in child_dir:
file_list, file_path = commonlib.get_file_list(chd)
for i in range(len(file_list)):
name = file_list[i]
for i in range(len(file_list)):
name = file_list[i]
keyword = "sw%s" % name.split('.')[0].split('_')[-1]
self.file_dict[keyword] = file_path[i]
def _parser_dataset(self, data_type):
def _parser_dataset(self, data_type):
"""
parser train dev test dataset
"""
out_filename = "%s/%s.txt" % (self.out_dir, data_type)
fw = io.open(out_filename, 'w', encoding='utf8')
for name in self.data_dict[data_type]:
for name in self.data_dict[data_type]:
file_path = self.file_dict[name]
fr = io.open(file_path, 'r', encoding="utf8")
idx = 0
row = csv.reader(fr, delimiter = ',')
for r in row:
if idx == 0:
row = csv.reader(fr, delimiter=',')
for r in row:
if idx == 0:
idx += 1
continue
out = self._parser_utterence(r)
fw.write(u"%s\n" % out)
def _clean_text(self, text):
def _clean_text(self, text):
"""
text cleaning for dialogue act dataset
"""
if text.startswith('<') and text.endswith('>.'):
if text.startswith('<') and text.endswith('>.'):
return text
if "[" in text or "]" in text:
stat = True
else:
else:
stat = False
group = re.findall("\[.*?\+.*?\]", text)
while group and stat:
for elem in group:
while group and stat:
for elem in group:
elem_src = elem
elem = re.sub('\+', '', elem.lstrip('[').rstrip(']'))
text = text.replace(elem_src, elem)
if "[" in text or "]" in text:
if "[" in text or "]" in text:
stat = True
else:
else:
stat = False
group = re.findall("\[.*?\+.*?\]", text)
if "{" in text or "}" in text:
if "{" in text or "}" in text:
stat = True
else:
else:
stat = False
group = re.findall("{[A-Z].*?}", text)
while group and stat:
while group and stat:
child_group = re.findall("{[A-Z]*(.*?)}", text)
for i in range(len(group)):
for i in range(len(group)):
text = text.replace(group[i], child_group[i])
if "{" in text or "}" in text:
if "{" in text or "}" in text:
stat = True
else:
else:
stat = False
group = re.findall("{[A-Z].*?}", text)
if "(" in text or ")" in text:
if "(" in text or ")" in text:
stat = True
else:
else:
stat = False
group = re.findall("\(\(.*?\)\)", text)
while group and stat:
for elem in group:
if elem:
while group and stat:
for elem in group:
if elem:
elem_clean = re.sub("\(|\)", "", elem)
text = text.replace(elem, elem_clean)
else:
else:
text = text.replace(elem, "mumblex")
if "(" in text or ")" in text:
stat = True
else:
else:
stat = False
group = re.findall("\(\((.*?)\)\)", text)
group = re.findall("\<.*?\>", text)
if group:
for elem in group:
if group:
for elem in group:
text = text.replace(elem, "")
text = re.sub(r" \'s", "\'s", text)
......@@ -137,24 +138,24 @@ class SWDA(object):
text = re.sub("\[|\]|\+|\>|\<|\{|\}", "", text)
return text.strip().lower()
def _map_tag(self, da_tag):
def _map_tag(self, da_tag):
"""
map tag to 42 classes
"""
curr_da_tags = []
curr_das = re.split(r"\s*[,;]\s*", da_tag)
for curr_da in curr_das:
for curr_da in curr_das:
if curr_da == "qy_d" or curr_da == "qw^d" or curr_da == "b^m":
pass
elif curr_da == "nn^e":
curr_da = "ng"
elif curr_da == "ny^e":
curr_da = "na"
else:
else:
curr_da = re.sub(r'(.)\^.*', r'\1', curr_da)
curr_da = re.sub(r'[\(\)@*]', '', curr_da)
tag = curr_da
if tag in ('qr', 'qy'):
if tag in ('qr', 'qy'):
tag = 'qy'
elif tag in ('fe', 'ba'):
tag = 'ba'
......@@ -170,12 +171,12 @@ class SWDA(object):
tag = 'fo_o_fw_"_by_bc'
curr_da = tag
curr_da_tags.append(curr_da)
if curr_da_tags[0] not in self.map_tag_dict:
if curr_da_tags[0] not in self.map_tag_dict:
self.map_tag_dict[curr_da_tags[0]] = self.tag_id
self.tag_id += 1
return self.map_tag_dict[curr_da_tags[0]]
def _parser_utterence(self, line):
def _parser_utterence(self, line):
"""
parser one turn dialogue
"""
......@@ -188,34 +189,34 @@ class SWDA(object):
out = "%s\t%s\t%s\t%s" % (conversation_no, act_tag, caller, text)
return out
def get_train_dataset(self):
def get_train_dataset(self):
"""
parser train dataset and print train.txt
"""
self._parser_dataset("train")
def get_dev_dataset(self):
def get_dev_dataset(self):
"""
parser dev dataset and print dev.txt
"""
self._parser_dataset("dev")
def get_test_dataset(self):
def get_test_dataset(self):
"""
parser test dataset and print test.txt
"""
self._parser_dataset("test")
def get_labels(self):
def get_labels(self):
"""
get tag and map ids file
"""
fw = io.open(self.map_tag, 'w', encoding='utf8')
for elem in self.map_tag_dict:
for elem in self.map_tag_dict:
fw.write(u"%s\t%s\n" % (elem, self.map_tag_dict[elem]))
def main(self):
def main(self):
"""
run data process
"""
......@@ -224,10 +225,7 @@ class SWDA(object):
self.get_test_dataset()
self.get_labels()
if __name__ == "__main__":
if __name__ == "__main__":
swda_inst = SWDA()
swda_inst.main()
......@@ -25,52 +25,49 @@ def get_file_list(dir_name):
file_list = list()
file_path = list()
for root, dirs, files in os.walk(dir_name):
for file in files:
for file in files:
file_list.append(file)
file_path.append(os.path.join(root, file))
return file_list, file_path
def get_dir_list(dir_name):
def get_dir_list(dir_name):
"""
get directory names
"""
child_dir = []
dir_list = os.listdir(dir_name)
for cur_file in dir_list:
for cur_file in dir_list:
path = os.path.join(dir_name, cur_file)
if not os.path.isdir(path):
if not os.path.isdir(path):
continue
child_dir.append(path)
return child_dir
def load_dict(conf):
def load_dict(conf):
"""
load swda dataset config
"""
conf_dict = dict()
fr = io.open(conf, 'r', encoding="utf8")
for line in fr:
for line in fr:
line = line.strip()
elems = line.split('\t')
if elems[0] not in conf_dict:
if elems[0] not in conf_dict:
conf_dict[elems[0]] = []
conf_dict[elems[0]].append(elems[1])
return conf_dict
def load_voc(conf):
def load_voc(conf):
"""
load map dict
"""
map_dict = {}
fr = io.open(conf, 'r', encoding="utf8")
for line in fr:
for line in fr:
line = line.strip()
elems = line.split('\t')
map_dict[elems[0]] = elems[1]
return map_dict
......@@ -20,29 +20,29 @@ from build_dstc2_dataset import DSTC2
from build_mrda_dataset import MRDA
from build_swda_dataset import SWDA
if __name__ == "__main__":
if __name__ == "__main__":
task_name = sys.argv[1]
task_name = task_name.lower()
if task_name not in ['swda', 'mrda', 'atis', 'dstc2', 'udc']:
if task_name not in ['swda', 'mrda', 'atis', 'dstc2', 'udc']:
print("task name error: we support [swda|mrda|atis|dstc2|udc]")
exit(1)
if task_name == 'swda':
if task_name == 'swda':
swda_inst = SWDA()
swda_inst.main()
elif task_name == 'mrda':
elif task_name == 'mrda':
mrda_inst = MRDA()
mrda_inst.main()
elif task_name == 'atis':
elif task_name == 'atis':
atis_inst = ATIS()
atis_inst.main()
shutil.copyfile("../../data/input/data/atis/atis_slot/test.txt", "../../data/input/data/atis/atis_slot/dev.txt")
shutil.copyfile("../../data/input/data/atis/atis_intent/test.txt", "../../data/input/data/atis/atis_intent/dev.txt")
elif task_name == 'dstc2':
shutil.copyfile("../../data/input/data/atis/atis_slot/test.txt",
"../../data/input/data/atis/atis_slot/dev.txt")
shutil.copyfile("../../data/input/data/atis/atis_intent/test.txt",
"../../data/input/data/atis/atis_intent/dev.txt")
elif task_name == 'dstc2':
dstc_inst = DSTC2()
dstc_inst.main()
else:
else:
exit(0)
......@@ -12,7 +12,6 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Tokenization classes."""
from __future__ import absolute_import
......
......@@ -113,7 +113,7 @@ def multi_head_attention(queries,
"""
Scaled Dot-Product Attention
"""
scaled_q = layers.scale(x=q, scale=d_key ** -0.5)
scaled_q = layers.scale(x=q, scale=d_key**-0.5)
product = layers.matmul(x=scaled_q, y=k, transpose_y=True)
if attn_bias:
product += attn_bias
......
......@@ -25,8 +25,8 @@ import numpy as np
import paddle.fluid as fluid
class InputField(object):
def __init__(self, input_field):
class InputField(object):
def __init__(self, input_field):
"""init inpit field"""
self.src_ids = input_field[0]
self.pos_ids = input_field[1]
......
......@@ -69,8 +69,8 @@ def init_from_checkpoint(args, exe, program):
def init_from_params(args, exe, program):
assert isinstance(args.init_from_params, str)
if not os.path.exists(args.init_from_params):
if not os.path.exists(args.init_from_params):
raise Warning("the params path does not exist.")
return False
......@@ -113,7 +113,7 @@ def save_param(args, exe, program, dirname):
if not os.path.exists(param_dir):
os.makedirs(param_dir)
fluid.io.save_params(
exe,
os.path.join(param_dir, dirname),
......@@ -122,5 +122,3 @@ def save_param(args, exe, program, dirname):
print("save parameters at %s" % (os.path.join(param_dir, dirname)))
return True
......@@ -23,14 +23,9 @@ from dgu.bert import BertModel
from dgu.utils.configure import JsonConfig
def create_net(
is_training,
model_input,
num_labels,
paradigm_inst,
args):
def create_net(is_training, model_input, num_labels, paradigm_inst, args):
"""create dialogue task model"""
src_ids = model_input.src_ids
pos_ids = model_input.pos_ids
sent_ids = model_input.sent_ids
......@@ -48,14 +43,15 @@ def create_net(
config=bert_conf,
use_fp16=False)
params = {'num_labels': num_labels,
'src_ids': src_ids,
'pos_ids': pos_ids,
'sent_ids': sent_ids,
'input_mask': input_mask,
'labels': labels,
'is_training': is_training}
params = {
'num_labels': num_labels,
'src_ids': src_ids,
'pos_ids': pos_ids,
'sent_ids': sent_ids,
'input_mask': input_mask,
'labels': labels,
'is_training': is_training
}
results = paradigm_inst.paradigm(bert, params)
return results
......@@ -20,17 +20,17 @@ from dgu.evaluation import evaluate
from dgu.utils.configure import PDConfig
def do_eval(args):
def do_eval(args):
task_name = args.task_name.lower()
reference = args.evaluation_file
predicitions = args.output_prediction_file
evaluate(task_name, predicitions, reference)
if __name__ == "__main__":
if __name__ == "__main__":
args = PDConfig(yaml_file="./data/config/dgu.yaml")
args.build()
......
......@@ -29,10 +29,10 @@ import dgu.utils.save_load_io as save_load_io
import dgu.reader as reader
from dgu_net import create_net
import dgu.define_paradigm as define_paradigm
import dgu.define_paradigm as define_paradigm
def do_save_inference_model(args):
def do_save_inference_model(args):
"""save inference model function"""
task_name = args.task_name.lower()
......@@ -57,35 +57,36 @@ def do_save_inference_model(args):
with fluid.unique_name.guard():
# define inputs of the network
num_labels = len(processors[task_name].get_labels())
num_labels = len(processors[task_name].get_labels())
src_ids = fluid.data(
name='src_ids', shape=[-1, args.max_seq_len], dtype='int64')
name='src_ids', shape=[-1, args.max_seq_len], dtype='int64')
pos_ids = fluid.data(
name='pos_ids', shape=[-1, args.max_seq_len], dtype='int64')
name='pos_ids', shape=[-1, args.max_seq_len], dtype='int64')
sent_ids = fluid.data(
name='sent_ids', shape=[-1, args.max_seq_len], dtype='int64')
name='sent_ids', shape=[-1, args.max_seq_len], dtype='int64')
input_mask = fluid.data(
name='input_mask', shape=[-1, args.max_seq_len], dtype='float32')
if args.task_name == 'atis_slot':
name='input_mask',
shape=[-1, args.max_seq_len],
dtype='float32')
if args.task_name == 'atis_slot':
labels = fluid.data(
name='labels', shape=[-1, args.max_seq_len], dtype='int64')
name='labels', shape=[-1, args.max_seq_len], dtype='int64')
elif args.task_name in ['dstc2', 'dstc2_asr', 'multi-woz']:
labels = fluid.data(
name='labels', shape=[-1, num_labels], dtype='int64')
else:
labels = fluid.data(
name='labels', shape=[-1, 1], dtype='int64')
name='labels', shape=[-1, num_labels], dtype='int64')
else:
labels = fluid.data(name='labels', shape=[-1, 1], dtype='int64')
input_inst = [src_ids, pos_ids, sent_ids, input_mask, labels]
input_field = InputField(input_inst)
results = create_net(
is_training=False,
model_input=input_field,
num_labels=num_labels,
paradigm_inst=paradigm_inst,
args=args)
is_training=False,
model_input=input_field,
num_labels=num_labels,
paradigm_inst=paradigm_inst,
args=args)
probs = results.get("probs", None)
if args.use_cuda:
......@@ -97,7 +98,7 @@ def do_save_inference_model(args):
exe.run(startup_prog)
assert (args.init_from_params) or (args.init_from_pretrain_model)
if args.init_from_params:
save_load_io.init_from_params(args, exe, test_prog)
elif args.init_from_pretrain_model:
......@@ -105,20 +106,16 @@ def do_save_inference_model(args):
# saving inference model
fluid.io.save_inference_model(
args.inference_model_dir,
feeded_var_names=[
input_field.src_ids.name,
input_field.pos_ids.name,
input_field.sent_ids.name,
input_field.input_mask.name
],
target_vars=[
probs
],
executor=exe,
main_program=test_prog,
model_filename="model.pdmodel",
params_filename="params.pdparams")
args.inference_model_dir,
feeded_var_names=[
input_field.src_ids.name, input_field.pos_ids.name,
input_field.sent_ids.name, input_field.input_mask.name
],
target_vars=[probs],
executor=exe,
main_program=test_prog,
model_filename="model.pdmodel",
params_filename="params.pdparams")
print("save inference model at %s" % (args.inference_model_dir))
......
......@@ -26,7 +26,6 @@ from inference_model import do_save_inference_model
from dgu.utils.configure import PDConfig
if __name__ == "__main__":
args = PDConfig(yaml_file="./data/config/dgu.yaml")
......
......@@ -28,7 +28,7 @@ import paddle.fluid as fluid
from dgu_net import create_net
import dgu.reader as reader
from dgu.optimization import optimization
import dgu.define_paradigm as define_paradigm
import dgu.define_paradigm as define_paradigm
from dgu.utils.configure import PDConfig
from dgu.utils.input_field import InputField
from dgu.utils.model_check import check_cuda
......@@ -37,7 +37,7 @@ import dgu.utils.save_load_io as save_load_io
def do_train(args):
"""train function"""
task_name = args.task_name.lower()
paradigm_inst = define_paradigm.Paradigm(task_name)
......@@ -53,34 +53,35 @@ def do_train(args):
train_prog = fluid.default_main_program()
startup_prog = fluid.default_startup_program()
with fluid.program_guard(train_prog, startup_prog):
with fluid.program_guard(train_prog, startup_prog):
train_prog.random_seed = args.random_seed
startup_prog.random_seed = args.random_seed
with fluid.unique_name.guard():
with fluid.unique_name.guard():
num_labels = len(processors[task_name].get_labels())
src_ids = fluid.data(
name='src_ids', shape=[-1, args.max_seq_len], dtype='int64')
name='src_ids', shape=[-1, args.max_seq_len], dtype='int64')
pos_ids = fluid.data(
name='pos_ids', shape=[-1, args.max_seq_len], dtype='int64')
name='pos_ids', shape=[-1, args.max_seq_len], dtype='int64')
sent_ids = fluid.data(
name='sent_ids', shape=[-1, args.max_seq_len], dtype='int64')
name='sent_ids', shape=[-1, args.max_seq_len], dtype='int64')
input_mask = fluid.data(
name='input_mask', shape=[-1, args.max_seq_len], dtype='float32')
if args.task_name == 'atis_slot':
name='input_mask',
shape=[-1, args.max_seq_len],
dtype='float32')
if args.task_name == 'atis_slot':
labels = fluid.data(
name='labels', shape=[-1, args.max_seq_len], dtype='int64')
name='labels', shape=[-1, args.max_seq_len], dtype='int64')
elif args.task_name in ['dstc2']:
labels = fluid.data(
name='labels', shape=[-1, num_labels], dtype='int64')
else:
labels = fluid.data(
name='labels', shape=[-1, 1], dtype='int64')
name='labels', shape=[-1, num_labels], dtype='int64')
else:
labels = fluid.data(name='labels', shape=[-1, 1], dtype='int64')
input_inst = [src_ids, pos_ids, sent_ids, input_mask, labels]
input_field = InputField(input_inst)
data_reader = fluid.io.PyReader(feed_list=input_inst,
capacity=4, iterable=False)
data_reader = fluid.io.PyReader(
feed_list=input_inst, capacity=4, iterable=False)
processor = processors[task_name](data_dir=args.data_dir,
vocab_path=args.vocab_path,
max_seq_len=args.max_seq_len,
......@@ -90,12 +91,12 @@ def do_train(args):
random_seed=args.random_seed)
results = create_net(
is_training=True,
model_input=input_field,
num_labels=num_labels,
paradigm_inst=paradigm_inst,
args=args)
is_training=True,
model_input=input_field,
num_labels=num_labels,
paradigm_inst=paradigm_inst,
args=args)
loss = results.get("loss", None)
probs = results.get("probs", None)
accuracy = results.get("accuracy", None)
......@@ -103,21 +104,19 @@ def do_train(args):
loss.persistable = True
probs.persistable = True
if accuracy:
if accuracy:
accuracy.persistable = True
num_seqs.persistable = True
if args.use_cuda:
if args.use_cuda:
dev_count = fluid.core.get_cuda_device_count()
else:
else:
dev_count = int(os.environ.get('CPU_NUM', 1))
batch_generator = processor.data_generator(
batch_size=args.batch_size,
phase='train',
shuffle=True)
batch_size=args.batch_size, phase='train', shuffle=True)
num_train_examples = processor.get_num_examples(phase='train')
if args.in_tokens:
max_train_steps = args.epoch * num_train_examples // (
args.batch_size // args.max_seq_len) // dev_count
......@@ -147,32 +146,32 @@ def do_train(args):
place = fluid.CUDAPlace(int(os.getenv('FLAGS_selected_gpus', '0')))
else:
place = fluid.CPUPlace()
exe = fluid.Executor(place)
exe.run(startup_prog)
assert (args.init_from_checkpoint == "") or (
args.init_from_pretrain_model == "")
args.init_from_pretrain_model == "")
# init from some checkpoint, to resume the previous training
if args.init_from_checkpoint:
if args.init_from_checkpoint:
save_load_io.init_from_checkpoint(args, exe, train_prog)
# init from some pretrain models, to better solve the current task
if args.init_from_pretrain_model:
if args.init_from_pretrain_model:
save_load_io.init_from_pretrain_model(args, exe, train_prog)
build_strategy = fluid.compiler.BuildStrategy()
build_strategy.enable_inplace = True
compiled_train_prog = fluid.CompiledProgram(train_prog).with_data_parallel(
loss_name=loss.name, build_strategy=build_strategy)
loss_name=loss.name, build_strategy=build_strategy)
# start training
steps = 0
time_begin = time.time()
ce_info = []
for epoch_step in range(args.epoch):
for epoch_step in range(args.epoch):
data_reader.start()
while True:
try:
......@@ -216,43 +215,38 @@ def do_train(args):
used_time = time_end - time_begin
current_time = time.strftime('%Y-%m-%d %H:%M:%S',
time.localtime(time.time()))
if accuracy is not None:
print(
"%s epoch: %d, step: %d, ave loss: %f, "
"ave acc: %f, speed: %f steps/s" %
(current_time, epoch_step, steps,
np.mean(np_loss),
np.mean(np_acc),
args.print_steps / used_time))
if accuracy is not None:
print("%s epoch: %d, step: %d, ave loss: %f, "
"ave acc: %f, speed: %f steps/s" %
(current_time, epoch_step, steps,
np.mean(np_loss), np.mean(np_acc),
args.print_steps / used_time))
ce_info.append([
np.mean(np_loss),
np.mean(np_acc),
np.mean(np_loss), np.mean(np_acc),
args.print_steps / used_time
])
else:
print(
"%s epoch: %d, step: %d, ave loss: %f, "
"speed: %f steps/s" %
(current_time, epoch_step, steps,
np.mean(np_loss),
args.print_steps / used_time))
ce_info.append([
np.mean(np_loss),
args.print_steps / used_time
])
print("%s epoch: %d, step: %d, ave loss: %f, "
"speed: %f steps/s" %
(current_time, epoch_step, steps,
np.mean(np_loss), args.print_steps / used_time))
ce_info.append(
[np.mean(np_loss), args.print_steps / used_time])
time_begin = time.time()
if steps % args.save_steps == 0:
if steps % args.save_steps == 0:
save_path = "step_" + str(steps)
if args.save_checkpoint:
save_load_io.save_checkpoint(args, exe, train_prog, save_path)
if args.save_checkpoint:
save_load_io.save_checkpoint(args, exe, train_prog,
save_path)
if args.save_param:
save_load_io.save_param(args, exe, train_prog, save_path)
except fluid.core.EOFException:
save_load_io.save_param(args, exe, train_prog,
save_path)
except fluid.core.EOFException:
data_reader.reset()
break
if args.save_checkpoint:
if args.save_checkpoint:
save_load_io.save_checkpoint(args, exe, train_prog, "step_final")
if args.save_param:
save_load_io.save_param(args, exe, train_prog, "step_final")
......@@ -264,7 +258,7 @@ def do_train(args):
if cards != '':
num = len(cards.split(","))
return num
if args.enable_ce:
card_num = get_cards()
print("test_card_num", card_num)
......@@ -283,8 +277,8 @@ def do_train(args):
print("kpis\ttrain_acc_%s_card%s\t%f" % (task_name, card_num, ce_acc))
if __name__ == '__main__':
if __name__ == '__main__':
args = PDConfig(yaml_file="./data/config/dgu.yaml")
args.build()
args.Print()
......
......@@ -19,8 +19,7 @@ from __future__ import print_function
import os
import sys
sys.path.append("../")
sys.path.append("../shared_modules/")
import paddle
import paddle.fluid as fluid
import numpy as np
......
......@@ -23,7 +23,7 @@ import os
import time
import multiprocessing
import sys
sys.path.append("../")
sys.path.append("../shared_modules/")
import paddle
import paddle.fluid as fluid
......
......@@ -24,7 +24,7 @@ import time
import argparse
import multiprocessing
import sys
sys.path.append("../")
sys.path.append("../shared_modules/")
import paddle
import paddle.fluid as fluid
......
......@@ -36,7 +36,7 @@ import sys
if sys.version[0] == '2':
reload(sys)
sys.setdefaultencoding("utf-8")
sys.path.append('../')
sys.path.append('../shared_modules/')
import os
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "3"
......
......@@ -26,7 +26,7 @@ from paddle.fluid.initializer import NormalInitializer
from reader import Dataset
from ernie_reader import SequenceLabelReader
sys.path.append("..")
sys.path.append("../shared_modules/")
from models.sequence_labeling import nets
from models.representation.ernie import ernie_encoder, ernie_pyreader
......@@ -35,7 +35,8 @@ def create_model(args, vocab_size, num_labels, mode='train'):
"""create lac model"""
# model's input data
words = fluid.data(name='words', shape=[None, 1], dtype='int64', lod_level=1)
words = fluid.data(
name='words', shape=[None, 1], dtype='int64', lod_level=1)
targets = fluid.data(
name='targets', shape=[None, 1], dtype='int64', lod_level=1)
......@@ -88,7 +89,8 @@ def create_pyreader(args,
return_reader=False,
mode='train'):
# init reader
device_count = len(fluid.cuda_places()) if args.use_cuda else len(fluid.cpu_places())
device_count = len(fluid.cuda_places()) if args.use_cuda else len(
fluid.cpu_places())
if model == 'lac':
pyreader = fluid.io.DataLoader.from_generator(
......@@ -107,14 +109,14 @@ def create_pyreader(args,
fluid.io.shuffle(
reader.file_reader(file_name),
buf_size=args.traindata_shuffle_buffer),
batch_size=args.batch_size/device_count),
batch_size=args.batch_size / device_count),
places=place)
else:
pyreader.set_sample_list_generator(
fluid.io.batch(
reader.file_reader(
file_name, mode=mode),
batch_size=args.batch_size/device_count),
batch_size=args.batch_size / device_count),
places=place)
elif model == 'ernie':
......
......@@ -20,7 +20,7 @@ import sys
from collections import namedtuple
import numpy as np
sys.path.append("..")
sys.path.append("../shared_modules/")
from preprocess.ernie.task_reader import BaseReader, tokenization
......
......@@ -24,7 +24,7 @@ import paddle
import utils
import reader
import creator
sys.path.append('../models/')
sys.path.append('../shared_modules/models/')
from model_check import check_cuda
from model_check import check_version
......
......@@ -10,7 +10,7 @@ import paddle.fluid as fluid
import creator
import reader
import utils
sys.path.append('../models/')
sys.path.append('../shared_modules/models/')
from model_check import check_cuda
from model_check import check_version
......
......@@ -24,7 +24,7 @@ import paddle
import utils
import reader
import creator
sys.path.append('../models/')
sys.path.append('../shared_modules/models/')
from model_check import check_cuda
from model_check import check_version
......
......@@ -34,7 +34,7 @@ import paddle.fluid as fluid
import creator
import utils
sys.path.append("..")
sys.path.append("../shared_modules/")
from models.representation.ernie import ErnieConfig
from models.model_check import check_cuda
from models.model_check import check_version
......@@ -187,8 +187,8 @@ def do_train(args):
end_time - start_time, train_pyreader.queue.size()))
if steps % args.save_steps == 0:
save_path = os.path.join(args.model_save_dir, "step_" + str(steps),
"checkpoint")
save_path = os.path.join(args.model_save_dir,
"step_" + str(steps), "checkpoint")
print("\tsaving model as %s" % (save_path))
fluid.save(train_program, save_path)
......@@ -196,9 +196,10 @@ def do_train(args):
evaluate(exe, test_program, test_pyreader, train_ret)
save_path = os.path.join(args.model_save_dir, "step_" + str(steps),
"checkpoint")
"checkpoint")
fluid.save(train_program, save_path)
def do_eval(args):
# init executor
if args.use_cuda:
......
......@@ -29,7 +29,7 @@ import reader
import utils
import creator
from eval import test_process
sys.path.append('../models/')
sys.path.append('../shared_modules/models/')
from model_check import check_cuda
from model_check import check_version
......@@ -151,8 +151,7 @@ def do_train(args):
# save checkpoints
if step % args.save_steps == 0 and step != 0:
save_path = os.path.join(args.model_save_dir,
"step_" + str(step),
"checkpoint")
"step_" + str(step), "checkpoint")
fluid.save(train_program, save_path)
step += 1
......
......@@ -14,12 +14,12 @@ DuReader是一个大规模、面向真实应用、由人类生成的中文阅读
- 答案由人类生成
- 面向真实应用场景
- 标注更加丰富细致
更多关于DuReader数据集的详细信息可在[DuReader官网](https://ai.baidu.com//broad/subordinate?dataset=dureader)找到。
### DuReader基线系统
DuReader基线系统利用[PaddlePaddle](http://paddlepaddle.org)深度学习框架,针对**DuReader阅读理解数据集**实现并升级了一个经典的阅读理解模型 —— BiDAF.
DuReader基线系统利用[PaddlePaddle](http://paddlepaddle.org)深度学习框架,针对**DuReader阅读理解数据集**实现并升级了一个经典的阅读理解模型 —— BiDAF.
## [KT-Net](https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/Research/ACL2019-KTNET)
......@@ -30,7 +30,7 @@ KT-NET是百度NLP提出的具有开创性意义的语言表示与知识表示
- 被ACL 2019录用为长文 ([文章链接](https://www.aclweb.org/anthology/P19-1226/))
此外,KT-NET具备很强的通用性,不仅适用于机器阅读理解任务,对其他形式的语言理解任务,如自然语言推断、复述识别、语义相似度判断等均有帮助。
## [D-NET](https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/Research/MRQA2019-D-NET)
D-NET是一个以提升**阅读理解模型泛化能力**为目标的“预训练-微调”框架。D-NET的特点包括:
......@@ -39,4 +39,3 @@ D-NET是一个以提升**阅读理解模型泛化能力**为目标的“预训
- 在微调阶段引入多任务、多领域的学习策略 (基于[PALM](https://github.com/PaddlePaddle/PALM)多任务学习框架),有效的提升了模型在不同领域的泛化能力
百度利用D-NET框架在EMNLP 2019 [MRQA](https://mrqa.github.io/shared)国际阅读理解评测中以超过第二名近两个百分点的成绩夺得冠军,同时,在全部12个测试数据集中的10个排名第一。
......@@ -12,6 +12,7 @@
# See the License for the specific language governing permissions and
# limitations under the License.
def get_input_descs(args):
"""
Generate a dict mapping data fields to the corresponding data shapes and
......@@ -42,11 +43,12 @@ def get_input_descs(args):
# encoder.
# The actual data shape of src_slf_attn_bias is:
# [batch_size, n_head, max_src_len_in_batch, max_src_len_in_batch]
"src_slf_attn_bias": [(batch_size, n_head, seq_len, seq_len), "float32"],
"src_slf_attn_bias":
[(batch_size, n_head, seq_len, seq_len), "float32"],
# The actual data shape of trg_word is:
# [batch_size, max_trg_len_in_batch, 1]
"trg_word": [(batch_size, seq_len), "int64",
2], # lod_level is only used in fast decoder.
2], # lod_level is only used in fast decoder.
# The actual data shape of trg_pos is:
# [batch_size, max_trg_len_in_batch, 1]
"trg_pos": [(batch_size, seq_len), "int64"],
......@@ -54,12 +56,14 @@ def get_input_descs(args):
# subsequent words in the decoder.
# The actual data shape of trg_slf_attn_bias is:
# [batch_size, n_head, max_trg_len_in_batch, max_trg_len_in_batch]
"trg_slf_attn_bias": [(batch_size, n_head, seq_len, seq_len), "float32"],
"trg_slf_attn_bias":
[(batch_size, n_head, seq_len, seq_len), "float32"],
# This input is used to remove attention weights on paddings of the source
# input in the encoder-decoder attention.
# The actual data shape of trg_src_attn_bias is:
# [batch_size, n_head, max_trg_len_in_batch, max_src_len_in_batch]
"trg_src_attn_bias": [(batch_size, n_head, seq_len, seq_len), "float32"],
"trg_src_attn_bias":
[(batch_size, n_head, seq_len, seq_len), "float32"],
# This input is used in independent decoder program for inference.
# The actual data shape of enc_output is:
# [batch_size, max_src_len_in_batch, d_model]
......@@ -80,6 +84,7 @@ def get_input_descs(args):
return input_descs
# Names of word embedding table which might be reused for weight sharing.
word_emb_param_names = (
"src_word_emb_table",
......
......@@ -87,13 +87,14 @@ def do_save_inference_model(args):
# saving inference model
fluid.io.save_inference_model(args.inference_model_dir,
feeded_var_names=list(input_field_names),
target_vars=[out_ids, out_scores],
executor=exe,
main_program=test_prog,
model_filename="model.pdmodel",
params_filename="params.pdparams")
fluid.io.save_inference_model(
args.inference_model_dir,
feeded_var_names=list(input_field_names),
target_vars=[out_ids, out_scores],
executor=exe,
main_program=test_prog,
model_filename="model.pdmodel",
params_filename="params.pdparams")
print("save inference model at %s" % (args.inference_model_dir))
......
......@@ -25,7 +25,6 @@ from train import do_train
from predict import do_predict
from inference_model import do_save_inference_model
if __name__ == "__main__":
LOG_FORMAT = "[%(asctime)s %(levelname)s %(filename)s:%(lineno)d] %(message)s"
logging.basicConfig(
......@@ -43,4 +42,4 @@ if __name__ == "__main__":
do_predict(args)
if args.do_save_inference_model:
do_save_inference_model(args)
\ No newline at end of file
do_save_inference_model(args)
此差异已折叠。
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册