未验证 提交 f492ae4f 编写于 作者: P pkpk 提交者: GitHub

refactor the PaddleNLP (#4351)

* Update README.md (#4267)

* test=develop (#4269)

* 3d use new api (#4275)

* PointNet++ and PointRCNN use new API

* Update Readme of Dygraph BERT (#4277)

Fix some typos.

* Update run_classifier_multi_gpu.sh (#4279)

remove the CUDA_VISIBLE_DEVICES

* Update README.md (#4280)

* 17 update api (#4294)

* update1.7 save/load & fluid.data

* update datafeed to dataloader

* Update resnet_acnet.py (#4297)

Bias attr of square conv should be "False" rather than None during training mode.

* test=develop

* test=develop

* test=develop

* test=develop

* test
Co-authored-by: NKaipeng Deng <dengkaipeng@baidu.com>
Co-authored-by: Nzhang wenhui <frankwhzhang@126.com>
Co-authored-by: Nparap1uie-s <parap1uie-s@users.noreply.github.com>
上级 8dc42c73
...@@ -13,6 +13,3 @@ ...@@ -13,6 +13,3 @@
[submodule "PaddleSpeech/DeepSpeech"] [submodule "PaddleSpeech/DeepSpeech"]
path = PaddleSpeech/DeepSpeech path = PaddleSpeech/DeepSpeech
url = https://github.com/PaddlePaddle/DeepSpeech.git url = https://github.com/PaddlePaddle/DeepSpeech.git
[submodule "PaddleNLP/PALM"]
path = PaddleNLP/PALM
url = https://github.com/PaddlePaddle/PALM
Subproject commit 5426f75073cf5bd416622dbe71b146d3dc8fffb6
Subproject commit 30b892e3c029bff706337f269e6c158b0a223f60
...@@ -10,7 +10,7 @@ ...@@ -10,7 +10,7 @@
- **丰富而全面的NLP任务支持:** - **丰富而全面的NLP任务支持:**
- PaddleNLP为您提供了多粒度,多场景的应用支持。涵盖了从[分词](https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/lexical_analysis)[词性标注](https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/lexical_analysis)[命名实体识别](https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/lexical_analysis)等NLP基础技术,到[文本分类](https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/sentiment_classification)[文本相似度计算](https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/similarity_net)[语义表示](https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/PaddleLARK)[文本生成](https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/PaddleTextGEN)等NLP核心技术。同时,PaddleNLP还提供了针对常见NLP大型应用系统(如[阅读理解](https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/PaddleMRC)[对话系统](https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/PaddleDialogue)[机器翻译系统](https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/PaddleMT)等)的特定核心技术和工具组件,模型和预训练参数等,让您在NLP领域畅通无阻。 - PaddleNLP为您提供了多粒度,多场景的应用支持。涵盖了从[分词](https://github.com/PaddlePaddle/models/tree/release/1.7/PaddleNLP/lexical_analysis)[词性标注](https://github.com/PaddlePaddle/models/tree/release/1.7/PaddleNLP/lexical_analysis)[命名实体识别](https://github.com/PaddlePaddle/models/tree/release/1.7/PaddleNLP/lexical_analysis)等NLP基础技术,到[文本分类](https://github.com/PaddlePaddle/models/tree/release/1.7/PaddleNLP/sentiment_classification)[文本相似度计算](https://github.com/PaddlePaddle/models/tree/release/1.7/PaddleNLP/similarity_net)[语义表示](https://github.com/PaddlePaddle/models/tree/release/1.7/PaddleNLP/pretrain_langauge_models)[文本生成](https://github.com/PaddlePaddle/models/tree/release/1.7/PaddleNLP/seq2seq)等NLP核心技术。同时,PaddleNLP还提供了针对常见NLP大型应用系统(如[阅读理解](https://github.com/PaddlePaddle/models/tree/release/1.7/PaddleNLP/machine_reading_comprehension)[对话系统](https://github.com/PaddlePaddle/models/tree/release/1.7/PaddleNLP/dialogue_system)[机器翻译系统](https://github.com/PaddlePaddle/models/tree/release/1.7/PaddleNLP/machine_translation)等)的特定核心技术和工具组件,模型和预训练参数等,让您在NLP领域畅通无阻。
- **稳定可靠的NLP模型和强大的预训练参数:** - **稳定可靠的NLP模型和强大的预训练参数:**
...@@ -50,17 +50,17 @@ cd models/PaddleNLP/sentiment_classification ...@@ -50,17 +50,17 @@ cd models/PaddleNLP/sentiment_classification
| 任务场景 | 对应项目/目录 | 简介 | | 任务场景 | 对应项目/目录 | 简介 |
| :------------------------------------------------: | :----------------------------------------------------------: | :----------------------------------------------------------: | | :------------------------------------------------: | :----------------------------------------------------------: | :----------------------------------------------------------: |
| **中文分词****词性标注****命名实体识别**:fire: | [LAC](https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/lexical_analysis) | LAC,全称为Lexical Analysis of Chinese,是百度内部广泛使用的中文处理工具,功能涵盖从中文分词,词性标注,命名实体识别等常见中文处理任务。 | | **中文分词****词性标注****命名实体识别**:fire: | [LAC](https://github.com/PaddlePaddle/models/tree/release/1.7/PaddleNLP/lexical_analysis) | LAC,全称为Lexical Analysis of Chinese,是百度内部广泛使用的中文处理工具,功能涵盖从中文分词,词性标注,命名实体识别等常见中文处理任务。 |
| **词向量(word2vec)** | [word2vec](https://github.com/PaddlePaddle/models/tree/develop/PaddleRec/word2vec) | 提供单机多卡,多机等分布式训练中文词向量能力,支持主流词向量模型(skip-gram,cbow等),可以快速使用自定义数据训练词向量模型。 | | **词向量(word2vec)** | [word2vec](https://github.com/PaddlePaddle/models/tree/release/1.7/PaddleRec/word2vec) | 提供单机多卡,多机等分布式训练中文词向量能力,支持主流词向量模型(skip-gram,cbow等),可以快速使用自定义数据训练词向量模型。 |
| **语言模型** | [Language_model](https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/language_model) | 基于循环神经网络(RNN)的经典神经语言模型(neural language model)。 | | **语言模型** | [Language_model](https://github.com/PaddlePaddle/models/tree/release/1.7/PaddleNLP/language_model) | 基于循环神经网络(RNN)的经典神经语言模型(neural language model)。 |
| **情感分类**:fire: | [Senta](https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/sentiment_classification)[EmotionDetection](https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/emotion_detection) | Senta(Sentiment Classification,简称Senta)和EmotionDetection两个项目分别提供了面向*通用场景**人机对话场景专用*的情感倾向性分析模型。 | | **情感分类**:fire: | [Senta](https://github.com/PaddlePaddle/models/tree/release/1.7/PaddleNLP/sentiment_classification)[EmotionDetection](https://github.com/PaddlePaddle/models/tree/release/1.7/PaddleNLP/emotion_detection) | Senta(Sentiment Classification,简称Senta)和EmotionDetection两个项目分别提供了面向*通用场景**人机对话场景专用*的情感倾向性分析模型。 |
| **文本相似度计算**:fire: | [SimNet](https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/similarity_net) | SimNet,又称为Similarity Net,为您提供高效可靠的文本相似度计算工具和预训练模型。 | | **文本相似度计算**:fire: | [SimNet](https://github.com/PaddlePaddle/models/tree/release/1.7/PaddleNLP/similarity_net) | SimNet,又称为Similarity Net,为您提供高效可靠的文本相似度计算工具和预训练模型。 |
| **语义表示**:fire: | [PaddleLARK](https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/PaddleLARK) | PaddleLARK,全称为Paddle LAngauge Representation Toolkit,集成了ELMO,BERT,ERNIE 1.0,ERNIE 2.0,XLNet等热门中英文预训练模型。 | | **语义表示**:fire: | [pretrain_langauge_models](https://github.com/PaddlePaddle/models/tree/release/1.7/PaddleNLP/pretrain_langauge_models) | 集成了ELMO,BERT,ERNIE 1.0,ERNIE 2.0,XLNet等热门中英文预训练模型。 |
| **文本生成** | [PaddleTextGEN](https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/PaddleTextGEN) | Paddle Text Generation为您提供了一些列经典文本生成模型案例,如vanilla seq2seq,seq2seq with attention,variational seq2seq模型等。 | | **文本生成** | [seq2seq](https://github.com/PaddlePaddle/models/tree/release/1.7/PaddleNLP/PaddleTextGEN) | seq2seq为您提供了一些列经典文本生成模型案例,如vanilla seq2seq,seq2seq with attention,variational seq2seq模型等。 |
| **阅读理解** | [PaddleMRC](https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/PaddleMRC) | PaddleMRC,全称为Paddle Machine Reading Comprehension,集合了百度在阅读理解领域相关的模型,工具,开源数据等一系列工作。包括DuReader (百度开源的基于真实搜索用户行为的中文大规模阅读理解数据集),KT-Net (结合知识的阅读理解模型,SQuAD以及ReCoRD曾排名第一), D-Net (预训练-微调框架,在EMNLP2019 MRQA国际阅读理解评测获得第一),等。 | | **阅读理解** | [machine_reading_comprehension](https://github.com/PaddlePaddle/models/tree/release/1.7/PaddleNLP/machine_reading_comprehension) | Paddle Machine Reading Comprehension,集合了百度在阅读理解领域相关的模型,工具,开源数据等一系列工作。包括DuReader (百度开源的基于真实搜索用户行为的中文大规模阅读理解数据集),KT-Net (结合知识的阅读理解模型,SQuAD以及ReCoRD曾排名第一), D-Net (预训练-微调框架,在EMNLP2019 MRQA国际阅读理解评测获得第一),等。 |
| **对话系统** | [PaddleDialogue](https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/PaddleDialogue) | 包括:1)DGU(Dialogue General Understanding,通用对话理解模型)覆盖了包括**检索式聊天系统**中context-response matching任务和**任务完成型对话系统****意图识别****槽位解析****状态追踪**等常见对话系统任务,在6项国际公开数据集中都获得了最佳效果。<br/> 2) knowledge-driven dialogue:百度开源的知识驱动的开放领域对话数据集,发表于ACL2019。<br/>3)ADEM(Auto Dialogue Evaluation Model):对话自动评估模型,可用于自动评估不同对话生成模型的回复质量。 | | **对话系统** | [dialogue_system](https://github.com/PaddlePaddle/models/tree/release/1.7/PaddleNLP/dialogue_system) | 包括:1)DGU(Dialogue General Understanding,通用对话理解模型)覆盖了包括**检索式聊天系统**中context-response matching任务和**任务完成型对话系统****意图识别****槽位解析****状态追踪**等常见对话系统任务,在6项国际公开数据集中都获得了最佳效果。<br/> 2) knowledge-driven dialogue:百度开源的知识驱动的开放领域对话数据集,发表于ACL2019。<br/>3)ADEM(Auto Dialogue Evaluation Model):对话自动评估模型,可用于自动评估不同对话生成模型的回复质量。 |
| **机器翻译** | [PaddleMT](https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/PaddleMT) | 全称为Paddle Machine Translation,基于Transformer的经典机器翻译模型。 | | **机器翻译** | [machine_translation](https://github.com/PaddlePaddle/models/tree/release/1.7/PaddleNLP/machine_translation) | 全称为Paddle Machine Translation,基于Transformer的经典机器翻译模型。 |
| **其他前沿工作** | [Research](https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/Research) | 百度最新前沿工作开源。 | | **其他前沿工作** | [Research](https://github.com/PaddlePaddle/models/tree/release/1.7/PaddleNLP/Research) | 百度最新前沿工作开源。 |
...@@ -70,13 +70,13 @@ cd models/PaddleNLP/sentiment_classification ...@@ -70,13 +70,13 @@ cd models/PaddleNLP/sentiment_classification
```text ```text
. .
├── Research # 百度NLP在research方面的工作集合 ├── Research # 百度NLP在research方面的工作集合
├── PaddleMT # 机器翻译相关代码,数据,预训练模型 ├── machine_translation # 机器翻译相关代码,数据,预训练模型
├── PaddleDialogue # 对话系统相关代码,数据,预训练模型 ├── dialogue_system # 对话系统相关代码,数据,预训练模型
├── PaddleMRC # 阅读理解相关代码,数据,预训练模型 ├── machcine_reading_comprehension # 阅读理解相关代码,数据,预训练模型
├── PaddleLARK # 语言表示工具箱 ├── pretrain_langauge_models # 语言表示工具箱
├── language_model # 语言模型 ├── language_model # 语言模型
├── lexical_analysis # LAC词法分析 ├── lexical_analysis # LAC词法分析
├── models # 共享网络 ├── shared_modules/models # 共享网络
│ ├── __init__.py │ ├── __init__.py
│ ├── classification │ ├── classification
│ ├── dialogue_model_toolkit │ ├── dialogue_model_toolkit
...@@ -87,7 +87,7 @@ cd models/PaddleNLP/sentiment_classification ...@@ -87,7 +87,7 @@ cd models/PaddleNLP/sentiment_classification
│ ├── representation │ ├── representation
│ ├── sequence_labeling │ ├── sequence_labeling
│ └── transformer_encoder.py │ └── transformer_encoder.py
├── preprocess # 共享文本预处理工具 ├── shared_modules/preprocess # 共享文本预处理工具
│ ├── __init__.py │ ├── __init__.py
│ ├── ernie │ ├── ernie
│ ├── padding.py │ ├── padding.py
......
...@@ -468,7 +468,7 @@ python -u main.py \ ...@@ -468,7 +468,7 @@ python -u main.py \
--loss_type="CLS" --loss_type="CLS"
``` ```
#### windows环境下: #### windows环境下:
评估: 评估:
``` ```
python -u main.py --do_eval=true --use_cuda=false --evaluation_file=data\input\data\unlabel_data\test.ids --output_prediction_file=data\output\pretrain_matching_predict --loss_type=CLS python -u main.py --do_eval=true --use_cuda=false --evaluation_file=data\input\data\unlabel_data\test.ids --output_prediction_file=data\output\pretrain_matching_predict --loss_type=CLS
``` ```
......
...@@ -21,14 +21,16 @@ from kpi import DurationKpi ...@@ -21,14 +21,16 @@ from kpi import DurationKpi
train_loss_card1 = CostKpi('train_loss_card1', 0.03, 0, actived=True) train_loss_card1 = CostKpi('train_loss_card1', 0.03, 0, actived=True)
train_loss_card4 = CostKpi('train_loss_card4', 0.03, 0, actived=True) train_loss_card4 = CostKpi('train_loss_card4', 0.03, 0, actived=True)
train_duration_card1 = DurationKpi('train_duration_card1', 0.01, 0, actived=True) train_duration_card1 = DurationKpi(
train_duration_card4 = DurationKpi('train_duration_card4', 0.01, 0, actived=True) 'train_duration_card1', 0.01, 0, actived=True)
train_duration_card4 = DurationKpi(
'train_duration_card4', 0.01, 0, actived=True)
tracking_kpis = [ tracking_kpis = [
train_loss_card1, train_loss_card1,
train_loss_card4, train_loss_card4,
train_duration_card1, train_duration_card1,
train_duration_card4, train_duration_card4,
] ]
......
...@@ -20,48 +20,52 @@ import sys ...@@ -20,48 +20,52 @@ import sys
import io import io
import os import os
URLLIB=urllib URLLIB = urllib
if sys.version_info >= (3, 0): if sys.version_info >= (3, 0):
import urllib.request import urllib.request
URLLIB=urllib.request URLLIB = urllib.request
DATA_MODEL_PATH = {"DATA_PATH": "https://baidu-nlp.bj.bcebos.com/auto_dialogue_evaluation_dataset-1.0.0.tar.gz", DATA_MODEL_PATH = {
"TRAINED_MODEL": "https://baidu-nlp.bj.bcebos.com/auto_dialogue_evaluation_models.2.0.0.tar.gz"} "DATA_PATH":
"https://baidu-nlp.bj.bcebos.com/auto_dialogue_evaluation_dataset-1.0.0.tar.gz",
"TRAINED_MODEL":
"https://baidu-nlp.bj.bcebos.com/auto_dialogue_evaluation_models.2.0.0.tar.gz"
}
PATH_MAP = {'DATA_PATH': "./data/input", PATH_MAP = {'DATA_PATH': "./data/input", 'TRAINED_MODEL': './data/saved_models'}
'TRAINED_MODEL': './data/saved_models'}
def un_tar(tar_name, dir_name): def un_tar(tar_name, dir_name):
try: try:
t = tarfile.open(tar_name) t = tarfile.open(tar_name)
t.extractall(path = dir_name) t.extractall(path=dir_name)
return True return True
except Exception as e: except Exception as e:
print(e) print(e)
return False return False
def download_model_and_data(): def download_model_and_data():
print("Downloading ade data, pretrain model and trained models......") print("Downloading ade data, pretrain model and trained models......")
print("This process is quite long, please wait patiently............") print("This process is quite long, please wait patiently............")
for path in ['./data/input/data', './data/saved_models/trained_models']: for path in ['./data/input/data', './data/saved_models/trained_models']:
if not os.path.exists(path): if not os.path.exists(path):
continue continue
shutil.rmtree(path) shutil.rmtree(path)
for path_key in DATA_MODEL_PATH: for path_key in DATA_MODEL_PATH:
filename = os.path.basename(DATA_MODEL_PATH[path_key]) filename = os.path.basename(DATA_MODEL_PATH[path_key])
URLLIB.urlretrieve(DATA_MODEL_PATH[path_key], os.path.join("./", filename)) URLLIB.urlretrieve(DATA_MODEL_PATH[path_key],
os.path.join("./", filename))
state = un_tar(filename, PATH_MAP[path_key]) state = un_tar(filename, PATH_MAP[path_key])
if not state: if not state:
print("Tar %s error....." % path_key) print("Tar %s error....." % path_key)
return False return False
os.remove(filename) os.remove(filename)
return True return True
if __name__ == "__main__": if __name__ == "__main__":
state = download_model_and_data() state = download_model_and_data()
if not state: if not state:
exit(1) exit(1)
print("Downloading data and models sucess......") print("Downloading data and models sucess......")
...@@ -25,8 +25,8 @@ import numpy as np ...@@ -25,8 +25,8 @@ import numpy as np
import paddle.fluid as fluid import paddle.fluid as fluid
class InputField(object): class InputField(object):
def __init__(self, input_field): def __init__(self, input_field):
"""init inpit field""" """init inpit field"""
self.context_wordseq = input_field[0] self.context_wordseq = input_field[0]
self.response_wordseq = input_field[1] self.response_wordseq = input_field[1]
......
...@@ -30,7 +30,7 @@ def check_cuda(use_cuda, err = \ ...@@ -30,7 +30,7 @@ def check_cuda(use_cuda, err = \
if __name__ == "__main__": if __name__ == "__main__":
check_cuda(True) check_cuda(True)
check_cuda(False) check_cuda(False)
......
...@@ -69,8 +69,8 @@ def init_from_checkpoint(args, exe, program): ...@@ -69,8 +69,8 @@ def init_from_checkpoint(args, exe, program):
def init_from_params(args, exe, program): def init_from_params(args, exe, program):
assert isinstance(args.init_from_params, str) assert isinstance(args.init_from_params, str)
if not os.path.exists(args.init_from_params): if not os.path.exists(args.init_from_params):
raise Warning("the params path does not exist.") raise Warning("the params path does not exist.")
return False return False
...@@ -122,5 +122,3 @@ def save_param(args, exe, program, dirname): ...@@ -122,5 +122,3 @@ def save_param(args, exe, program, dirname):
print("save parameters at %s" % (os.path.join(param_dir, dirname))) print("save parameters at %s" % (os.path.join(param_dir, dirname)))
return True return True
...@@ -21,14 +21,13 @@ import paddle ...@@ -21,14 +21,13 @@ import paddle
import paddle.fluid as fluid import paddle.fluid as fluid
def create_net( def create_net(is_training,
is_training, model_input,
model_input, args,
args, clip_value=10.0,
clip_value=10.0, word_emb_name="shared_word_emb",
word_emb_name="shared_word_emb", lstm_W_name="shared_lstm_W",
lstm_W_name="shared_lstm_W", lstm_bias_name="shared_lstm_bias"):
lstm_bias_name="shared_lstm_bias"):
context_wordseq = model_input.context_wordseq context_wordseq = model_input.context_wordseq
response_wordseq = model_input.response_wordseq response_wordseq = model_input.response_wordseq
...@@ -52,17 +51,15 @@ def create_net( ...@@ -52,17 +51,15 @@ def create_net(
initializer=fluid.initializer.Normal(scale=0.1))) initializer=fluid.initializer.Normal(scale=0.1)))
#fc to fit dynamic LSTM #fc to fit dynamic LSTM
context_fc = fluid.layers.fc( context_fc = fluid.layers.fc(input=context_emb,
input=context_emb, size=args.hidden_size * 4,
size=args.hidden_size * 4, param_attr=fluid.ParamAttr(name='fc_weight'),
param_attr=fluid.ParamAttr(name='fc_weight'), bias_attr=fluid.ParamAttr(name='fc_bias'))
bias_attr=fluid.ParamAttr(name='fc_bias'))
response_fc = fluid.layers.fc( response_fc = fluid.layers.fc(input=response_emb,
input=response_emb, size=args.hidden_size * 4,
size=args.hidden_size * 4, param_attr=fluid.ParamAttr(name='fc_weight'),
param_attr=fluid.ParamAttr(name='fc_weight'), bias_attr=fluid.ParamAttr(name='fc_bias'))
bias_attr=fluid.ParamAttr(name='fc_bias'))
#LSTM #LSTM
context_rep, _ = fluid.layers.dynamic_lstm( context_rep, _ = fluid.layers.dynamic_lstm(
...@@ -82,7 +79,7 @@ def create_net( ...@@ -82,7 +79,7 @@ def create_net(
logits = fluid.layers.bilinear_tensor_product( logits = fluid.layers.bilinear_tensor_product(
context_rep, response_rep, size=1) context_rep, response_rep, size=1)
if args.loss_type == 'CLS': if args.loss_type == 'CLS':
label = fluid.layers.cast(x=label, dtype='float32') label = fluid.layers.cast(x=label, dtype='float32')
loss = fluid.layers.sigmoid_cross_entropy_with_logits(logits, label) loss = fluid.layers.sigmoid_cross_entropy_with_logits(logits, label)
loss = fluid.layers.reduce_mean( loss = fluid.layers.reduce_mean(
...@@ -95,10 +92,10 @@ def create_net( ...@@ -95,10 +92,10 @@ def create_net(
loss = fluid.layers.reduce_mean(loss) loss = fluid.layers.reduce_mean(loss)
else: else:
raise ValueError raise ValueError
if is_training: if is_training:
return loss return loss
else: else:
return logits return logits
...@@ -106,7 +103,5 @@ def set_word_embedding(word_emb, place, word_emb_name="shared_word_emb"): ...@@ -106,7 +103,5 @@ def set_word_embedding(word_emb, place, word_emb_name="shared_word_emb"):
""" """
Set word embedding Set word embedding
""" """
word_emb_param = fluid.global_scope().find_var( word_emb_param = fluid.global_scope().find_var(word_emb_name).get_tensor()
word_emb_name).get_tensor()
word_emb_param.set(word_emb, place) word_emb_param.set(word_emb, place)
...@@ -23,13 +23,13 @@ import ade.evaluate as evaluate ...@@ -23,13 +23,13 @@ import ade.evaluate as evaluate
from ade.utils.configure import PDConfig from ade.utils.configure import PDConfig
def do_eval(args): def do_eval(args):
"""evaluate metrics""" """evaluate metrics"""
labels = [] labels = []
fr = io.open(args.evaluation_file, 'r', encoding="utf8") fr = io.open(args.evaluation_file, 'r', encoding="utf8")
for line in fr: for line in fr:
tokens = line.strip().split('\t') tokens = line.strip().split('\t')
assert len(tokens) == 3 assert len(tokens) == 3
label = int(tokens[2]) label = int(tokens[2])
labels.append(label) labels.append(label)
...@@ -43,25 +43,25 @@ def do_eval(args): ...@@ -43,25 +43,25 @@ def do_eval(args):
score = score.astype(np.float64) score = score.astype(np.float64)
scores.append(score) scores.append(score)
if args.loss_type == 'CLS': if args.loss_type == 'CLS':
recall_dict = evaluate.evaluate_Recall(list(zip(scores, labels))) recall_dict = evaluate.evaluate_Recall(list(zip(scores, labels)))
mean_score = sum(scores) / len(scores) mean_score = sum(scores) / len(scores)
print('mean score: %.6f' % mean_score) print('mean score: %.6f' % mean_score)
print('evaluation recall result:') print('evaluation recall result:')
print('1_in_2: %.6f\t1_in_10: %.6f\t2_in_10: %.6f\t5_in_10: %.6f' % print('1_in_2: %.6f\t1_in_10: %.6f\t2_in_10: %.6f\t5_in_10: %.6f' %
(recall_dict['1_in_2'], recall_dict['1_in_10'], (recall_dict['1_in_2'], recall_dict['1_in_10'],
recall_dict['2_in_10'], recall_dict['5_in_10'])) recall_dict['2_in_10'], recall_dict['5_in_10']))
elif args.loss_type == 'L2': elif args.loss_type == 'L2':
scores = [x[0] for x in scores] scores = [x[0] for x in scores]
mean_score = sum(scores) / len(scores) mean_score = sum(scores) / len(scores)
cor = evaluate.evaluate_cor(scores, labels) cor = evaluate.evaluate_cor(scores, labels)
print('mean score: %.6f\nevaluation cor results:%.6f' % print('mean score: %.6f\nevaluation cor results:%.6f' %
(mean_score, cor)) (mean_score, cor))
else: else:
raise ValueError raise ValueError
if __name__ == "__main__":
if __name__ == "__main__":
args = PDConfig(yaml_file="./data/config/ade.yaml") args = PDConfig(yaml_file="./data/config/ade.yaml")
args.build() args.build()
......
...@@ -42,22 +42,24 @@ def do_save_inference_model(args): ...@@ -42,22 +42,24 @@ def do_save_inference_model(args):
with fluid.unique_name.guard(): with fluid.unique_name.guard():
context_wordseq = fluid.data( context_wordseq = fluid.data(
name='context_wordseq', shape=[-1, 1], dtype='int64', lod_level=1) name='context_wordseq',
shape=[-1, 1],
dtype='int64',
lod_level=1)
response_wordseq = fluid.data( response_wordseq = fluid.data(
name='response_wordseq', shape=[-1, 1], dtype='int64', lod_level=1) name='response_wordseq',
labels = fluid.data( shape=[-1, 1],
name='labels', shape=[-1, 1], dtype='int64') dtype='int64',
lod_level=1)
labels = fluid.data(name='labels', shape=[-1, 1], dtype='int64')
input_inst = [context_wordseq, response_wordseq, labels] input_inst = [context_wordseq, response_wordseq, labels]
input_field = InputField(input_inst) input_field = InputField(input_inst)
data_reader = fluid.io.PyReader(feed_list=input_inst, data_reader = fluid.io.PyReader(
capacity=4, iterable=False) feed_list=input_inst, capacity=4, iterable=False)
logits = create_net( logits = create_net(
is_training=False, is_training=False, model_input=input_field, args=args)
model_input=input_field,
args=args
)
if args.use_cuda: if args.use_cuda:
place = fluid.CUDAPlace(0) place = fluid.CUDAPlace(0)
...@@ -68,7 +70,7 @@ def do_save_inference_model(args): ...@@ -68,7 +70,7 @@ def do_save_inference_model(args):
exe.run(startup_prog) exe.run(startup_prog)
assert (args.init_from_params) or (args.init_from_pretrain_model) assert (args.init_from_params) or (args.init_from_pretrain_model)
if args.init_from_params: if args.init_from_params:
save_load_io.init_from_params(args, exe, test_prog) save_load_io.init_from_params(args, exe, test_prog)
elif args.init_from_pretrain_model: elif args.init_from_pretrain_model:
...@@ -76,24 +78,22 @@ def do_save_inference_model(args): ...@@ -76,24 +78,22 @@ def do_save_inference_model(args):
# saving inference model # saving inference model
fluid.io.save_inference_model( fluid.io.save_inference_model(
args.inference_model_dir, args.inference_model_dir,
feeded_var_names=[ feeded_var_names=[
input_field.context_wordseq.name, input_field.context_wordseq.name,
input_field.response_wordseq.name, input_field.response_wordseq.name,
], ],
target_vars=[ target_vars=[logits, ],
logits, executor=exe,
], main_program=test_prog,
executor=exe, model_filename="model.pdmodel",
main_program=test_prog, params_filename="params.pdparams")
model_filename="model.pdmodel",
params_filename="params.pdparams")
print("save inference model at %s" % (args.inference_model_dir)) print("save inference model at %s" % (args.inference_model_dir))
if __name__ == "__main__": if __name__ == "__main__":
args = PDConfig(yaml_file="./data/config/ade.yaml") args = PDConfig(yaml_file="./data/config/ade.yaml")
args.build() args.build()
check_cuda(args.use_cuda) check_cuda(args.use_cuda)
......
...@@ -26,7 +26,6 @@ from inference_model import do_save_inference_model ...@@ -26,7 +26,6 @@ from inference_model import do_save_inference_model
from ade.utils.configure import PDConfig from ade.utils.configure import PDConfig
if __name__ == "__main__": if __name__ == "__main__":
args = PDConfig(yaml_file="./data/config/ade.yaml") args = PDConfig(yaml_file="./data/config/ade.yaml")
......
...@@ -32,7 +32,7 @@ from ade.utils.model_check import check_cuda ...@@ -32,7 +32,7 @@ from ade.utils.model_check import check_cuda
import ade.utils.save_load_io as save_load_io import ade.utils.save_load_io as save_load_io
def do_predict(args): def do_predict(args):
""" """
predict function predict function
""" """
...@@ -46,30 +46,32 @@ def do_predict(args): ...@@ -46,30 +46,32 @@ def do_predict(args):
with fluid.unique_name.guard(): with fluid.unique_name.guard():
context_wordseq = fluid.data( context_wordseq = fluid.data(
name='context_wordseq', shape=[-1, 1], dtype='int64', lod_level=1) name='context_wordseq',
shape=[-1, 1],
dtype='int64',
lod_level=1)
response_wordseq = fluid.data( response_wordseq = fluid.data(
name='response_wordseq', shape=[-1, 1], dtype='int64', lod_level=1) name='response_wordseq',
labels = fluid.data( shape=[-1, 1],
name='labels', shape=[-1, 1], dtype='int64') dtype='int64',
lod_level=1)
labels = fluid.data(name='labels', shape=[-1, 1], dtype='int64')
input_inst = [context_wordseq, response_wordseq, labels] input_inst = [context_wordseq, response_wordseq, labels]
input_field = InputField(input_inst) input_field = InputField(input_inst)
data_reader = fluid.io.PyReader(feed_list=input_inst, data_reader = fluid.io.PyReader(
capacity=4, iterable=False) feed_list=input_inst, capacity=4, iterable=False)
logits = create_net( logits = create_net(
is_training=False, is_training=False, model_input=input_field, args=args)
model_input=input_field,
args=args
)
logits.persistable = True logits.persistable = True
fetch_list = [logits.name] fetch_list = [logits.name]
#for_test is True if change the is_test attribute of operators to True #for_test is True if change the is_test attribute of operators to True
test_prog = test_prog.clone(for_test=True) test_prog = test_prog.clone(for_test=True)
if args.use_cuda: if args.use_cuda:
place = fluid.CUDAPlace(int(os.getenv('FLAGS_selected_gpus', '0'))) place = fluid.CUDAPlace(int(os.getenv('FLAGS_selected_gpus', '0')))
else: else:
place = fluid.CPUPlace() place = fluid.CPUPlace()
exe = fluid.Executor(place) exe = fluid.Executor(place)
...@@ -85,42 +87,39 @@ def do_predict(args): ...@@ -85,42 +87,39 @@ def do_predict(args):
processor = reader.DataProcessor( processor = reader.DataProcessor(
data_path=args.predict_file, data_path=args.predict_file,
max_seq_length=args.max_seq_len, max_seq_length=args.max_seq_len,
batch_size=args.batch_size) batch_size=args.batch_size)
batch_generator = processor.data_generator( batch_generator = processor.data_generator(
place=place, place=place, phase="test", shuffle=False, sample_pro=1)
phase="test",
shuffle=False,
sample_pro=1)
num_test_examples = processor.get_num_examples(phase='test') num_test_examples = processor.get_num_examples(phase='test')
data_reader.decorate_batch_generator(batch_generator) data_reader.decorate_batch_generator(batch_generator)
data_reader.start() data_reader.start()
scores = [] scores = []
while True: while True:
try: try:
results = exe.run(compiled_test_prog, fetch_list=fetch_list) results = exe.run(compiled_test_prog, fetch_list=fetch_list)
scores.extend(results[0]) scores.extend(results[0])
except fluid.core.EOFException: except fluid.core.EOFException:
data_reader.reset() data_reader.reset()
break break
scores = scores[: num_test_examples] scores = scores[:num_test_examples]
print("Write the predicted results into the output_prediction_file") print("Write the predicted results into the output_prediction_file")
fw = io.open(args.output_prediction_file, 'w', encoding="utf8") fw = io.open(args.output_prediction_file, 'w', encoding="utf8")
for index, score in enumerate(scores): for index, score in enumerate(scores):
fw.write("%s\t%s\n" % (index, score)) fw.write("%s\t%s\n" % (index, score))
print("finish........................................") print("finish........................................")
if __name__ == "__main__": if __name__ == "__main__":
args = PDConfig(yaml_file="./data/config/ade.yaml") args = PDConfig(yaml_file="./data/config/ade.yaml")
args.build() args.build()
args.Print() args.Print()
check_cuda(args.use_cuda) check_cuda(args.use_cuda)
do_predict(args) do_predict(args)
...@@ -31,7 +31,7 @@ from ade.utils.input_field import InputField ...@@ -31,7 +31,7 @@ from ade.utils.input_field import InputField
from ade.utils.model_check import check_cuda from ade.utils.model_check import check_cuda
import ade.utils.save_load_io as save_load_io import ade.utils.save_load_io as save_load_io
try: try:
import cPickle as pickle #python 2 import cPickle as pickle #python 2
except ImportError as e: except ImportError as e:
import pickle #python 3 import pickle #python 3
...@@ -47,24 +47,26 @@ def do_train(args): ...@@ -47,24 +47,26 @@ def do_train(args):
train_prog.random_seed = args.random_seed train_prog.random_seed = args.random_seed
startup_prog.random_seed = args.random_seed startup_prog.random_seed = args.random_seed
with fluid.unique_name.guard(): with fluid.unique_name.guard():
context_wordseq = fluid.data( context_wordseq = fluid.data(
name='context_wordseq', shape=[-1, 1], dtype='int64', lod_level=1) name='context_wordseq',
shape=[-1, 1],
dtype='int64',
lod_level=1)
response_wordseq = fluid.data( response_wordseq = fluid.data(
name='response_wordseq', shape=[-1, 1], dtype='int64', lod_level=1) name='response_wordseq',
labels = fluid.data( shape=[-1, 1],
name='labels', shape=[-1, 1], dtype='int64') dtype='int64',
lod_level=1)
labels = fluid.data(name='labels', shape=[-1, 1], dtype='int64')
input_inst = [context_wordseq, response_wordseq, labels] input_inst = [context_wordseq, response_wordseq, labels]
input_field = InputField(input_inst) input_field = InputField(input_inst)
data_reader = fluid.io.PyReader(feed_list=input_inst, data_reader = fluid.io.PyReader(
capacity=4, iterable=False) feed_list=input_inst, capacity=4, iterable=False)
loss = create_net( loss = create_net(
is_training=True, is_training=True, model_input=input_field, args=args)
model_input=input_field,
args=args
)
loss.persistable = True loss.persistable = True
# gradient clipping # gradient clipping
fluid.clip.set_gradient_clip(clip=fluid.clip.GradientClipByValue( fluid.clip.set_gradient_clip(clip=fluid.clip.GradientClipByValue(
...@@ -74,20 +76,21 @@ def do_train(args): ...@@ -74,20 +76,21 @@ def do_train(args):
if args.use_cuda: if args.use_cuda:
dev_count = fluid.core.get_cuda_device_count() dev_count = fluid.core.get_cuda_device_count()
place = fluid.CUDAPlace(int(os.getenv('FLAGS_selected_gpus', '0'))) place = fluid.CUDAPlace(
else: int(os.getenv('FLAGS_selected_gpus', '0')))
else:
dev_count = int(os.environ.get('CPU_NUM', 1)) dev_count = int(os.environ.get('CPU_NUM', 1))
place = fluid.CPUPlace() place = fluid.CPUPlace()
processor = reader.DataProcessor( processor = reader.DataProcessor(
data_path=args.training_file, data_path=args.training_file,
max_seq_length=args.max_seq_len, max_seq_length=args.max_seq_len,
batch_size=args.batch_size) batch_size=args.batch_size)
batch_generator = processor.data_generator( batch_generator = processor.data_generator(
place=place, place=place,
phase="train", phase="train",
shuffle=True, shuffle=True,
sample_pro=args.sample_pro) sample_pro=args.sample_pro)
num_train_examples = processor.get_num_examples(phase='train') num_train_examples = processor.get_num_examples(phase='train')
...@@ -105,18 +108,23 @@ def do_train(args): ...@@ -105,18 +108,23 @@ def do_train(args):
args.init_from_pretrain_model == "") args.init_from_pretrain_model == "")
#init from some checkpoint, to resume the previous training #init from some checkpoint, to resume the previous training
if args.init_from_checkpoint: if args.init_from_checkpoint:
save_load_io.init_from_checkpoint(args, exe, train_prog) save_load_io.init_from_checkpoint(args, exe, train_prog)
#init from some pretrain models, to better solve the current task #init from some pretrain models, to better solve the current task
if args.init_from_pretrain_model: if args.init_from_pretrain_model:
save_load_io.init_from_pretrain_model(args, exe, train_prog) save_load_io.init_from_pretrain_model(args, exe, train_prog)
if args.word_emb_init: if args.word_emb_init:
print("start loading word embedding init ...") print("start loading word embedding init ...")
if six.PY2: if six.PY2:
word_emb = np.array(pickle.load(io.open(args.word_emb_init, 'rb'))).astype('float32') word_emb = np.array(
pickle.load(io.open(args.word_emb_init, 'rb'))).astype(
'float32')
else: else:
word_emb = np.array(pickle.load(io.open(args.word_emb_init, 'rb'), encoding="bytes")).astype('float32') word_emb = np.array(
pickle.load(
io.open(args.word_emb_init, 'rb'),
encoding="bytes")).astype('float32')
set_word_embedding(word_emb, place) set_word_embedding(word_emb, place)
print("finish init word embedding ...") print("finish init word embedding ...")
...@@ -124,69 +132,74 @@ def do_train(args): ...@@ -124,69 +132,74 @@ def do_train(args):
build_strategy.enable_inplace = True build_strategy.enable_inplace = True
compiled_train_prog = fluid.CompiledProgram(train_prog).with_data_parallel( compiled_train_prog = fluid.CompiledProgram(train_prog).with_data_parallel(
loss_name=loss.name, build_strategy=build_strategy) loss_name=loss.name, build_strategy=build_strategy)
steps = 0 steps = 0
begin_time = time.time() begin_time = time.time()
time_begin = time.time() time_begin = time.time()
for epoch_step in range(args.epoch): for epoch_step in range(args.epoch):
data_reader.start() data_reader.start()
sum_loss = 0.0 sum_loss = 0.0
ce_loss = 0.0 ce_loss = 0.0
while True: while True:
try: try:
fetch_list = [loss.name] fetch_list = [loss.name]
outputs = exe.run(compiled_train_prog, fetch_list=fetch_list) outputs = exe.run(compiled_train_prog, fetch_list=fetch_list)
np_loss = outputs np_loss = outputs
sum_loss += np.array(np_loss).mean() sum_loss += np.array(np_loss).mean()
ce_loss = np.array(np_loss).mean() ce_loss = np.array(np_loss).mean()
if steps % args.print_steps == 0: if steps % args.print_steps == 0:
time_end = time.time() time_end = time.time()
used_time = time_end - time_begin used_time = time_end - time_begin
current_time = time.strftime('%Y-%m-%d %H:%M:%S', current_time = time.strftime('%Y-%m-%d %H:%M:%S',
time.localtime(time.time())) time.localtime(time.time()))
print('%s epoch: %d, step: %s, avg loss %s, speed: %f steps/s' % (current_time, epoch_step, steps, sum_loss / args.print_steps, args.print_steps / used_time)) print(
'%s epoch: %d, step: %s, avg loss %s, speed: %f steps/s'
% (current_time, epoch_step, steps, sum_loss /
args.print_steps, args.print_steps / used_time))
sum_loss = 0.0 sum_loss = 0.0
time_begin = time.time() time_begin = time.time()
if steps % args.save_steps == 0: if steps % args.save_steps == 0:
if args.save_checkpoint: if args.save_checkpoint:
save_load_io.save_checkpoint(args, exe, train_prog, "step_" + str(steps)) save_load_io.save_checkpoint(args, exe, train_prog,
if args.save_param: "step_" + str(steps))
save_load_io.save_param(args, exe, train_prog, "step_" + str(steps)) if args.save_param:
save_load_io.save_param(args, exe, train_prog,
"step_" + str(steps))
steps += 1 steps += 1
except fluid.core.EOFException: except fluid.core.EOFException:
data_reader.reset() data_reader.reset()
break break
if args.save_checkpoint: if args.save_checkpoint:
save_load_io.save_checkpoint(args, exe, train_prog, "step_final") save_load_io.save_checkpoint(args, exe, train_prog, "step_final")
if args.save_param: if args.save_param:
save_load_io.save_param(args, exe, train_prog, "step_final") save_load_io.save_param(args, exe, train_prog, "step_final")
def get_cards(): def get_cards():
num = 0 num = 0
cards = os.environ.get('CUDA_VISIBLE_DEVICES', '') cards = os.environ.get('CUDA_VISIBLE_DEVICES', '')
if cards != '': if cards != '':
num = len(cards.split(",")) num = len(cards.split(","))
return num return num
if args.enable_ce: if args.enable_ce:
card_num = get_cards() card_num = get_cards()
pass_time_cost = time.time() - begin_time pass_time_cost = time.time() - begin_time
print("test_card_num", card_num) print("test_card_num", card_num)
print("kpis\ttrain_duration_card%s\t%s" % (card_num, pass_time_cost)) print("kpis\ttrain_duration_card%s\t%s" % (card_num, pass_time_cost))
print("kpis\ttrain_loss_card%s\t%f" % (card_num, ce_loss)) print("kpis\ttrain_loss_card%s\t%f" % (card_num, ce_loss))
if __name__ == '__main__': if __name__ == '__main__':
args = PDConfig(yaml_file="./data/config/ade.yaml") args = PDConfig(yaml_file="./data/config/ade.yaml")
args.build() args.build()
args.Print() args.Print()
check_cuda(args.use_cuda) check_cuda(args.use_cuda)
do_train(args) do_train(args)
...@@ -62,7 +62,7 @@ SWDA:Switchboard Dialogue Act Corpus; ...@@ -62,7 +62,7 @@ SWDA:Switchboard Dialogue Act Corpus;
&ensp;&ensp;&ensp;&ensp;数据集、相关模型下载: &ensp;&ensp;&ensp;&ensp;数据集、相关模型下载:
&ensp;&ensp;&ensp;&ensp;linux环境下: &ensp;&ensp;&ensp;&ensp;linux环境下:
``` ```
python dgu/prepare_data_and_model.py python dgu/prepare_data_and_model.py
``` ```
&ensp;&ensp;&ensp;&ensp;数据路径:data/input/data &ensp;&ensp;&ensp;&ensp;数据路径:data/input/data
...@@ -72,7 +72,7 @@ python dgu/prepare_data_and_model.py ...@@ -72,7 +72,7 @@ python dgu/prepare_data_and_model.py
&ensp;&ensp;&ensp;&ensp;windows环境下: &ensp;&ensp;&ensp;&ensp;windows环境下:
``` ```
python dgu\prepare_data_and_model.py python dgu\prepare_data_and_model.py
``` ```
&ensp;&ensp;&ensp;&ensp;下载的数据集中已提供了训练集,测试集和验证集,用户如果需要重新生成某任务数据集的训练数据,可执行: &ensp;&ensp;&ensp;&ensp;下载的数据集中已提供了训练集,测试集和验证集,用户如果需要重新生成某任务数据集的训练数据,可执行:
...@@ -164,19 +164,19 @@ task_type: train,predict, evaluate, inference, all, 选择5个参数选项中 ...@@ -164,19 +164,19 @@ task_type: train,predict, evaluate, inference, all, 选择5个参数选项中
训练示例: bash run.sh atis_intent train 训练示例: bash run.sh atis_intent train
``` ```
&ensp;&ensp;&ensp;&ensp;如果为CPU训练: &ensp;&ensp;&ensp;&ensp;如果为CPU训练:
``` ```
请将run.sh内参数设置为: 请将run.sh内参数设置为:
1、export CUDA_VISIBLE_DEVICES= 1、export CUDA_VISIBLE_DEVICES=
``` ```
&ensp;&ensp;&ensp;&ensp;如果为GPU训练: &ensp;&ensp;&ensp;&ensp;如果为GPU训练:
``` ```
请将run.sh内参数设置为: 请将run.sh内参数设置为:
1、如果为单卡训练(用户指定空闲的单卡): 1、如果为单卡训练(用户指定空闲的单卡):
export CUDA_VISIBLE_DEVICES=0 export CUDA_VISIBLE_DEVICES=0
2、如果为多卡训练(用户指定空闲的多张卡): 2、如果为多卡训练(用户指定空闲的多张卡):
export CUDA_VISIBLE_DEVICES=0,1,2,3 export CUDA_VISIBLE_DEVICES=0,1,2,3
``` ```
...@@ -252,19 +252,19 @@ task_type: train,predict, evaluate, inference, all, 选择5个参数选项中 ...@@ -252,19 +252,19 @@ task_type: train,predict, evaluate, inference, all, 选择5个参数选项中
预测示例: bash run.sh atis_intent predict 预测示例: bash run.sh atis_intent predict
``` ```
&ensp;&ensp;&ensp;&ensp;如果为CPU预测: &ensp;&ensp;&ensp;&ensp;如果为CPU预测:
``` ```
请将run.sh内参数设置为: 请将run.sh内参数设置为:
1、export CUDA_VISIBLE_DEVICES= 1、export CUDA_VISIBLE_DEVICES=
``` ```
&ensp;&ensp;&ensp;&ensp;如果为GPU预测: &ensp;&ensp;&ensp;&ensp;如果为GPU预测:
``` ```
请将run.sh内参数设置为: 请将run.sh内参数设置为:
支持单卡预测(用户指定空闲的单卡): 支持单卡预测(用户指定空闲的单卡):
export CUDA_VISIBLE_DEVICES=0 export CUDA_VISIBLE_DEVICES=0
``` ```
注:预测时,如采用方式一,用户可通过修改run.sh中init_from_params参数来指定自己训练好的需要预测的模型,目前代码中默认为加载官方已经训练好的模型; 注:预测时,如采用方式一,用户可通过修改run.sh中init_from_params参数来指定自己训练好的需要预测的模型,目前代码中默认为加载官方已经训练好的模型;
...@@ -348,7 +348,7 @@ task_type: train,predict, evaluate, inference, all, 选择5个参数选项中 ...@@ -348,7 +348,7 @@ task_type: train,predict, evaluate, inference, all, 选择5个参数选项中
注:评估计算ground_truth和predict_label之间的打分,默认CPU计算即可; 注:评估计算ground_truth和predict_label之间的打分,默认CPU计算即可;
#### &ensp;&ensp;&ensp;&ensp;方式二: 执行评估相关的代码: #### &ensp;&ensp;&ensp;&ensp;方式二: 执行评估相关的代码:
``` ```
TASK_NAME="atis_intent" #指定预测的任务名称 TASK_NAME="atis_intent" #指定预测的任务名称
...@@ -363,7 +363,7 @@ python -u main.py \ ...@@ -363,7 +363,7 @@ python -u main.py \
#### windows环境下 #### windows环境下
``` ```
python -u main.py --task_name=atis_intent --use_cuda=false --do_eval=true --evaluation_file=data\input\data\atis\atis_intent\test.txt --output_prediction_file=data\output\pred_atis_intent python -u main.py --task_name=atis_intent --use_cuda=false --do_eval=true --evaluation_file=data\input\data\atis\atis_intent\test.txt --output_prediction_file=data\output\pred_atis_intent
``` ```
### 模型推断 ### 模型推断
...@@ -378,22 +378,22 @@ task_type: train,predict, evaluate, inference, all, 选择5个参数选项中 ...@@ -378,22 +378,22 @@ task_type: train,predict, evaluate, inference, all, 选择5个参数选项中
保存模型示例: bash run.sh atis_intent inference 保存模型示例: bash run.sh atis_intent inference
``` ```
&ensp;&ensp;&ensp;&ensp;如果为CPU执行inference model过程: &ensp;&ensp;&ensp;&ensp;如果为CPU执行inference model过程:
``` ```
请将run.sh内参数设置为: 请将run.sh内参数设置为:
1、export CUDA_VISIBLE_DEVICES= 1、export CUDA_VISIBLE_DEVICES=
``` ```
&ensp;&ensp;&ensp;&ensp;如果为GPU执行inference model过程: &ensp;&ensp;&ensp;&ensp;如果为GPU执行inference model过程:
``` ```
请将run.sh内参数设置为: 请将run.sh内参数设置为:
1、单卡模型推断(用户指定空闲的单卡): 1、单卡模型推断(用户指定空闲的单卡):
export CUDA_VISIBLE_DEVICES=0 export CUDA_VISIBLE_DEVICES=0
``` ```
#### &ensp;&ensp;&ensp;&ensp;方式二: 执行inference model相关的代码: #### &ensp;&ensp;&ensp;&ensp;方式二: 执行inference model相关的代码:
``` ```
TASK_NAME="atis_intent" #指定预测的任务名称 TASK_NAME="atis_intent" #指定预测的任务名称
...@@ -459,7 +459,7 @@ python -u main.py \ ...@@ -459,7 +459,7 @@ python -u main.py \
&ensp;&ensp;&ensp;&ensp;用户也可以根据自己的需求,组建自定义的模型,具体方法如下所示: &ensp;&ensp;&ensp;&ensp;用户也可以根据自己的需求,组建自定义的模型,具体方法如下所示:
&ensp;&ensp;&ensp;&ensp;a、自定义数据 &ensp;&ensp;&ensp;&ensp;a、自定义数据
&ensp;&ensp;&ensp;&ensp;&ensp;&ensp;如用户目前有数据集为**task_name**, 则在**data/input/data**下定义**task_name**文件夹,将数据集存放进去;在**dgu/reader.py**中,新增自定义的数据处理的类,如**udc**数据集对应**UDCProcessor**; 在**train.py**内设置**task_name****processor**的对应关系(如**processors = {'udc': reader.UDCProcessor}**). &ensp;&ensp;&ensp;&ensp;&ensp;&ensp;如用户目前有数据集为**task_name**, 则在**data/input/data**下定义**task_name**文件夹,将数据集存放进去;在**dgu/reader.py**中,新增自定义的数据处理的类,如**udc**数据集对应**UDCProcessor**; 在**train.py**内设置**task_name****processor**的对应关系(如**processors = {'udc': reader.UDCProcessor}**).
...@@ -481,7 +481,7 @@ python -u main.py \ ...@@ -481,7 +481,7 @@ python -u main.py \
- Elizabeth Shriberg, Raj Dhillon, Sonali Bhagat, JeremyAng, and Hannah Carvey. 2004. The icsi meetingrecorder dialog act (mrda) corpus. Technical report,INTERNATIONAL COMPUTER SCIENCE INSTBERKELEY CA. - Elizabeth Shriberg, Raj Dhillon, Sonali Bhagat, JeremyAng, and Hannah Carvey. 2004. The icsi meetingrecorder dialog act (mrda) corpus. Technical report,INTERNATIONAL COMPUTER SCIENCE INSTBERKELEY CA.
- Andreas Stolcke, Klaus Ries, Noah Coccaro, Eliza-beth Shriberg, Rebecca Bates, Daniel Jurafsky, PaulTaylor, Rachel Martin, Carol Van Ess-Dykema, andMarie Meteer. 2000. Dialogue act modeling for au-tomatic tagging and recognition of conversationalspeech.Computational linguistics, 26(3):339–373. - Andreas Stolcke, Klaus Ries, Noah Coccaro, Eliza-beth Shriberg, Rebecca Bates, Daniel Jurafsky, PaulTaylor, Rachel Martin, Carol Van Ess-Dykema, andMarie Meteer. 2000. Dialogue act modeling for au-tomatic tagging and recognition of conversationalspeech.Computational linguistics, 26(3):339–373.
- Ye-Yi Wang, Li Deng, and Alex Acero. 2005. Spo-ken language understanding.IEEE Signal Process-ing Magazine, 22(5):16–31.Jason Williams, Antoine Raux, Deepak Ramachan-dran, and Alan Black. 2013. The dialog state tracking challenge. InProceedings of the SIGDIAL 2013Conference, pages 404–413. - Ye-Yi Wang, Li Deng, and Alex Acero. 2005. Spo-ken language understanding.IEEE Signal Process-ing Magazine, 22(5):16–31.Jason Williams, Antoine Raux, Deepak Ramachan-dran, and Alan Black. 2013. The dialog state tracking challenge. InProceedings of the SIGDIAL 2013Conference, pages 404–413.
- Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc VLe, Mohammad Norouzi, Wolfgang Macherey,Maxim Krikun, Yuan Cao, Qin Gao, KlausMacherey, et al. 2016. Google’s neural ma-chine translation system: Bridging the gap betweenhuman and machine translation.arXiv preprintarXiv:1609.08144.Kaisheng - Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc VLe, Mohammad Norouzi, Wolfgang Macherey,Maxim Krikun, Yuan Cao, Qin Gao, KlausMacherey, et al. 2016. Google’s neural ma-chine translation system: Bridging the gap betweenhuman and machine translation.arXiv preprintarXiv:1609.08144.Kaisheng
- Yao, Geoffrey Zweig, Mei-Yuh Hwang,Yangyang Shi, and Dong Yu. 2013. Recurrent neu-ral networks for language understanding. InInter-speech, pages 2524–2528. - Yao, Geoffrey Zweig, Mei-Yuh Hwang,Yangyang Shi, and Dong Yu. 2013. Recurrent neu-ral networks for language understanding. InInter-speech, pages 2524–2528.
- Xiangyang Zhou, Lu Li, Daxiang Dong, Yi Liu, YingChen, Wayne Xin Zhao, Dianhai Yu, and Hua Wu.2018. Multi-turn response selection for chatbotswith deep attention matching network. InProceed-ings of the 56th Annual Meeting of the Associationfor Computational Linguistics (Volume 1: Long Pa-pers), volume 1, pages 1118–1127. - Xiangyang Zhou, Lu Li, Daxiang Dong, Yi Liu, YingChen, Wayne Xin Zhao, Dianhai Yu, and Hua Wu.2018. Multi-turn response selection for chatbotswith deep attention matching network. InProceed-ings of the 56th Annual Meeting of the Associationfor Computational Linguistics (Volume 1: Long Pa-pers), volume 1, pages 1118–1127.
- Su Zhu and Kai Yu. 2017. Encoder-decoder withfocus-mechanism for sequence labelling based spo-ken language understanding. In2017 IEEE Interna-tional Conference on Acoustics, Speech and SignalProcessing (ICASSP), pages 5675–5679. IEEE. - Su Zhu and Kai Yu. 2017. Encoder-decoder withfocus-mechanism for sequence labelling based spo-ken language understanding. In2017 IEEE Interna-tional Conference on Acoustics, Speech and SignalProcessing (ICASSP), pages 5675–5679. IEEE.
......
...@@ -20,20 +20,26 @@ from kpi import CostKpi ...@@ -20,20 +20,26 @@ from kpi import CostKpi
from kpi import DurationKpi from kpi import DurationKpi
from kpi import AccKpi from kpi import AccKpi
each_step_duration_atis_slot_card1 = DurationKpi('each_step_duration_atis_slot_card1', 0.01, 0, actived=True) each_step_duration_atis_slot_card1 = DurationKpi(
train_loss_atis_slot_card1 = CostKpi('train_loss_atis_slot_card1', 0.08, 0, actived=True) 'each_step_duration_atis_slot_card1', 0.01, 0, actived=True)
train_acc_atis_slot_card1 = CostKpi('train_acc_atis_slot_card1', 0.01, 0, actived=True) train_loss_atis_slot_card1 = CostKpi(
each_step_duration_atis_slot_card4 = DurationKpi('each_step_duration_atis_slot_card4', 0.06, 0, actived=True) 'train_loss_atis_slot_card1', 0.08, 0, actived=True)
train_loss_atis_slot_card4 = CostKpi('train_loss_atis_slot_card4', 0.03, 0, actived=True) train_acc_atis_slot_card1 = CostKpi(
train_acc_atis_slot_card4 = CostKpi('train_acc_atis_slot_card4', 0.01, 0, actived=True) 'train_acc_atis_slot_card1', 0.01, 0, actived=True)
each_step_duration_atis_slot_card4 = DurationKpi(
'each_step_duration_atis_slot_card4', 0.06, 0, actived=True)
train_loss_atis_slot_card4 = CostKpi(
'train_loss_atis_slot_card4', 0.03, 0, actived=True)
train_acc_atis_slot_card4 = CostKpi(
'train_acc_atis_slot_card4', 0.01, 0, actived=True)
tracking_kpis = [ tracking_kpis = [
each_step_duration_atis_slot_card1, each_step_duration_atis_slot_card1,
train_loss_atis_slot_card1, train_loss_atis_slot_card1,
train_acc_atis_slot_card1, train_acc_atis_slot_card1,
each_step_duration_atis_slot_card4, each_step_duration_atis_slot_card4,
train_loss_atis_slot_card4, train_loss_atis_slot_card4,
train_acc_atis_slot_card4, train_acc_atis_slot_card4,
] ]
......
...@@ -75,8 +75,8 @@ def mask(batch_tokens, total_token_num, vocab_size, CLS=1, SEP=2, MASK=3): ...@@ -75,8 +75,8 @@ def mask(batch_tokens, total_token_num, vocab_size, CLS=1, SEP=2, MASK=3):
def prepare_batch_data(task_name, def prepare_batch_data(task_name,
insts, insts,
max_len, max_len,
total_token_num, total_token_num,
voc_size=0, voc_size=0,
pad_id=None, pad_id=None,
...@@ -98,14 +98,18 @@ def prepare_batch_data(task_name, ...@@ -98,14 +98,18 @@ def prepare_batch_data(task_name,
# compatible with squad, whose example includes start/end positions, # compatible with squad, whose example includes start/end positions,
# or unique id # or unique id
if isinstance(insts[0][3], list): if isinstance(insts[0][3], list):
if task_name == "atis_slot": if task_name == "atis_slot":
labels_list = [inst[3] + [0] * (max_len - len(inst[3])) for inst in insts] labels_list = [
labels_list = [np.array(labels_list).astype("int64").reshape([-1, max_len])] inst[3] + [0] * (max_len - len(inst[3])) for inst in insts
elif task_name == "dstc2": ]
labels_list = [
np.array(labels_list).astype("int64").reshape([-1, max_len])
]
elif task_name == "dstc2":
labels_list = [inst[3] for inst in insts] labels_list = [inst[3] for inst in insts]
labels_list = [np.array(labels_list).astype("int64")] labels_list = [np.array(labels_list).astype("int64")]
else: else:
for i in range(3, len(insts[0]), 1): for i in range(3, len(insts[0]), 1):
labels = [inst[i] for inst in insts] labels = [inst[i] for inst in insts]
labels = np.array(labels).astype("int64").reshape([-1, 1]) labels = np.array(labels).astype("int64").reshape([-1, 1])
...@@ -124,28 +128,25 @@ def prepare_batch_data(task_name, ...@@ -124,28 +128,25 @@ def prepare_batch_data(task_name,
out = batch_src_ids out = batch_src_ids
# Second step: padding # Second step: padding
src_id, self_input_mask = pad_batch_data( src_id, self_input_mask = pad_batch_data(
out, out, max_len, pad_idx=pad_id, return_input_mask=True)
max_len,
pad_idx=pad_id,
return_input_mask=True)
pos_id = pad_batch_data( pos_id = pad_batch_data(
batch_pos_ids, batch_pos_ids,
max_len, max_len,
pad_idx=pad_id, pad_idx=pad_id,
return_pos=False, return_pos=False,
return_input_mask=False) return_input_mask=False)
sent_id = pad_batch_data( sent_id = pad_batch_data(
batch_sent_ids, batch_sent_ids,
max_len, max_len,
pad_idx=pad_id, pad_idx=pad_id,
return_pos=False, return_pos=False,
return_input_mask=False) return_input_mask=False)
if mask_id >= 0: if mask_id >= 0:
return_list = [ return_list = [
src_id, pos_id, sent_id, self_input_mask, mask_label, mask_pos src_id, pos_id, sent_id, self_input_mask, mask_label, mask_pos
] + labels_list ] + labels_list
else: else:
return_list = [src_id, pos_id, sent_id, self_input_mask] + labels_list return_list = [src_id, pos_id, sent_id, self_input_mask] + labels_list
return return_list if len(return_list) > 1 else return_list[0] return return_list if len(return_list) > 1 else return_list[0]
...@@ -163,13 +164,13 @@ def pad_batch_data(insts, ...@@ -163,13 +164,13 @@ def pad_batch_data(insts,
corresponding position data and attention bias. corresponding position data and attention bias.
""" """
return_list = [] return_list = []
max_len = max_len_in if max_len_in != -1 else max(len(inst) for inst in insts) max_len = max_len_in if max_len_in != -1 else max(
len(inst) for inst in insts)
# Any token included in dict can be used to pad, since the paddings' loss # Any token included in dict can be used to pad, since the paddings' loss
# will be masked out by weights and make no effect on parameter gradients. # will be masked out by weights and make no effect on parameter gradients.
inst_data = np.array( inst_data = np.array(
[inst + list([pad_idx] * (max_len - len(inst))) for inst in insts [inst + list([pad_idx] * (max_len - len(inst))) for inst in insts])
])
return_list += [inst_data.astype("int64").reshape([-1, max_len])] return_list += [inst_data.astype("int64").reshape([-1, max_len])]
# position data # position data
...@@ -183,10 +184,10 @@ def pad_batch_data(insts, ...@@ -183,10 +184,10 @@ def pad_batch_data(insts,
if return_input_mask: if return_input_mask:
# This is used to avoid attention on paddings. # This is used to avoid attention on paddings.
input_mask_data = np.array([[1] * len(inst) + [0] * input_mask_data = np.array([[1] * len(inst) + [0] *
(max_len - len(inst)) for inst in insts]) (max_len - len(inst)) for inst in insts])
input_mask_data = np.expand_dims(input_mask_data, axis=-1) input_mask_data = np.expand_dims(input_mask_data, axis=-1)
return_list += [input_mask_data.astype("float32")] return_list += [input_mask_data.astype("float32")]
if return_max_len: if return_max_len:
return_list += [max_len] return_list += [max_len]
......
...@@ -21,31 +21,34 @@ import paddle ...@@ -21,31 +21,34 @@ import paddle
import paddle.fluid as fluid import paddle.fluid as fluid
class DefinePredict(object): class DefinePredict(object):
""" """
Packaging Prediction Results Packaging Prediction Results
""" """
def __init__(self):
def __init__(self):
""" """
init init
""" """
self.task_map = {'udc': 'get_matching_res', self.task_map = {
'swda': 'get_cls_res', 'udc': 'get_matching_res',
'mrda': 'get_cls_res', 'swda': 'get_cls_res',
'atis_intent': 'get_cls_res', 'mrda': 'get_cls_res',
'atis_slot': 'get_sequence_tagging', 'atis_intent': 'get_cls_res',
'dstc2': 'get_multi_cls_res', 'atis_slot': 'get_sequence_tagging',
'dstc2_asr': 'get_multi_cls_res', 'dstc2': 'get_multi_cls_res',
'multi-woz': 'get_multi_cls_res'} 'dstc2_asr': 'get_multi_cls_res',
'multi-woz': 'get_multi_cls_res'
}
def get_matching_res(self, probs, params=None): def get_matching_res(self, probs, params=None):
""" """
get matching score get matching score
""" """
probs = list(probs) probs = list(probs)
return probs[1] return probs[1]
def get_cls_res(self, probs, params=None): def get_cls_res(self, probs, params=None):
""" """
get da classify tag get da classify tag
""" """
...@@ -54,7 +57,7 @@ class DefinePredict(object): ...@@ -54,7 +57,7 @@ class DefinePredict(object):
tag = probs.index(max_prob) tag = probs.index(max_prob)
return tag return tag
def get_sequence_tagging(self, probs, params=None): def get_sequence_tagging(self, probs, params=None):
""" """
get sequence tagging tag get sequence tagging tag
""" """
...@@ -63,23 +66,19 @@ class DefinePredict(object): ...@@ -63,23 +66,19 @@ class DefinePredict(object):
labels = [" ".join([str(l) for l in list(l_l)]) for l_l in batch_labels] labels = [" ".join([str(l) for l in list(l_l)]) for l_l in batch_labels]
return labels return labels
def get_multi_cls_res(self, probs, params=None): def get_multi_cls_res(self, probs, params=None):
""" """
get dst classify tag get dst classify tag
""" """
labels = [] labels = []
probs = list(probs) probs = list(probs)
for i in range(len(probs)): for i in range(len(probs)):
if probs[i] >= 0.5: if probs[i] >= 0.5:
labels.append(i) labels.append(i)
if not labels: if not labels:
max_prob = max(probs) max_prob = max(probs)
label_str = str(probs.index(max_prob)) label_str = str(probs.index(max_prob))
else: else:
label_str = " ".join([str(l) for l in sorted(labels)]) label_str = " ".join([str(l) for l in sorted(labels)])
return label_str return label_str
...@@ -20,51 +20,60 @@ import sys ...@@ -20,51 +20,60 @@ import sys
import io import io
import os import os
URLLIB = urllib
URLLIB=urllib if sys.version_info >= (3, 0):
if sys.version_info >= (3, 0):
import urllib.request import urllib.request
URLLIB=urllib.request URLLIB = urllib.request
DATA_MODEL_PATH = {"DATA_PATH": "https://baidu-nlp.bj.bcebos.com/dmtk_data_1.0.0.tar.gz", DATA_MODEL_PATH = {
"PRETRAIN_MODEL": "https://bert-models.bj.bcebos.com/uncased_L-12_H-768_A-12.tar.gz", "DATA_PATH": "https://baidu-nlp.bj.bcebos.com/dmtk_data_1.0.0.tar.gz",
"TRAINED_MODEL": "https://baidu-nlp.bj.bcebos.com/dgu_models_2.0.0.tar.gz"} "PRETRAIN_MODEL":
"https://bert-models.bj.bcebos.com/uncased_L-12_H-768_A-12.tar.gz",
"TRAINED_MODEL": "https://baidu-nlp.bj.bcebos.com/dgu_models_2.0.0.tar.gz"
}
PATH_MAP = {'DATA_PATH': "./data/input", PATH_MAP = {
'PRETRAIN_MODEL': './data/pretrain_model', 'DATA_PATH': "./data/input",
'TRAINED_MODEL': './data/saved_models'} 'PRETRAIN_MODEL': './data/pretrain_model',
'TRAINED_MODEL': './data/saved_models'
}
def un_tar(tar_name, dir_name): def un_tar(tar_name, dir_name):
try: try:
t = tarfile.open(tar_name) t = tarfile.open(tar_name)
t.extractall(path = dir_name) t.extractall(path=dir_name)
return True return True
except Exception as e: except Exception as e:
print(e) print(e)
return False return False
def download_model_and_data(): def download_model_and_data():
print("Downloading dgu data, pretrain model and trained models......") print("Downloading dgu data, pretrain model and trained models......")
print("This process is quite long, please wait patiently............") print("This process is quite long, please wait patiently............")
for path in ['./data/input/data', './data/pretrain_model/uncased_L-12_H-768_A-12', './data/saved_models/trained_models']: for path in [
if not os.path.exists(path): './data/input/data',
'./data/pretrain_model/uncased_L-12_H-768_A-12',
'./data/saved_models/trained_models'
]:
if not os.path.exists(path):
continue continue
shutil.rmtree(path) shutil.rmtree(path)
for path_key in DATA_MODEL_PATH: for path_key in DATA_MODEL_PATH:
filename = os.path.basename(DATA_MODEL_PATH[path_key]) filename = os.path.basename(DATA_MODEL_PATH[path_key])
URLLIB.urlretrieve(DATA_MODEL_PATH[path_key], os.path.join("./", filename)) URLLIB.urlretrieve(DATA_MODEL_PATH[path_key],
os.path.join("./", filename))
state = un_tar(filename, PATH_MAP[path_key]) state = un_tar(filename, PATH_MAP[path_key])
if not state: if not state:
print("Tar %s error....." % path_key) print("Tar %s error....." % path_key)
return False return False
os.remove(filename) os.remove(filename)
return True return True
if __name__ == "__main__": if __name__ == "__main__":
state = download_model_and_data() state = download_model_and_data()
if not state: if not state:
exit(1) exit(1)
print("Downloading data and models sucess......") print("Downloading data and models sucess......")
...@@ -6,7 +6,7 @@ scripts:运行数据处理脚本目录, 将官方公开数据集转换成模 ...@@ -6,7 +6,7 @@ scripts:运行数据处理脚本目录, 将官方公开数据集转换成模
python run_build_data.py udc python run_build_data.py udc
生成数据在dialogue_general_understanding/data/input/data/udc 生成数据在dialogue_general_understanding/data/input/data/udc
2)、生成DA任务所需要的训练集、开发集、测试集时: 2)、生成DA任务所需要的训练集、开发集、测试集时:
python run_build_data.py swda python run_build_data.py swda
python run_build_data.py mrda python run_build_data.py mrda
生成数据分别在dialogue_general_understanding/data/input/data/swda和dialogue_general_understanding/data/input/data/mrda 生成数据分别在dialogue_general_understanding/data/input/data/swda和dialogue_general_understanding/data/input/data/mrda
...@@ -19,6 +19,3 @@ python run_build_data.py udc ...@@ -19,6 +19,3 @@ python run_build_data.py udc
python run_build_data.py atis python run_build_data.py atis
生成槽位识别数据在dialogue_general_understanding/data/input/data/atis/atis_slot 生成槽位识别数据在dialogue_general_understanding/data/input/data/atis/atis_slot
生成意图识别数据在dialogue_general_understanding/data/input/data/atis/atis_intent 生成意图识别数据在dialogue_general_understanding/data/input/data/atis/atis_intent
...@@ -12,7 +12,6 @@ ...@@ -12,7 +12,6 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and # See the License for the specific language governing permissions and
# limitations under the License. # limitations under the License.
"""build swda train dev test dataset""" """build swda train dev test dataset"""
import json import json
...@@ -23,11 +22,12 @@ import io ...@@ -23,11 +22,12 @@ import io
import re import re
class ATIS(object): class ATIS(object):
""" """
nlu dataset atis data process nlu dataset atis data process
""" """
def __init__(self):
def __init__(self):
""" """
init instance init instance
""" """
...@@ -41,91 +41,94 @@ class ATIS(object): ...@@ -41,91 +41,94 @@ class ATIS(object):
self.map_tag_slot = "../../data/input/data/atis/atis_slot/map_tag_slot_id.txt" self.map_tag_slot = "../../data/input/data/atis/atis_slot/map_tag_slot_id.txt"
self.map_tag_intent = "../../data/input/data/atis/atis_intent/map_tag_intent_id.txt" self.map_tag_intent = "../../data/input/data/atis/atis_intent/map_tag_intent_id.txt"
def _load_file(self, data_type): def _load_file(self, data_type):
""" """
load dataset filename load dataset filename
""" """
slot_stat = os.path.exists(self.out_slot_dir) slot_stat = os.path.exists(self.out_slot_dir)
if not slot_stat: if not slot_stat:
os.makedirs(self.out_slot_dir) os.makedirs(self.out_slot_dir)
intent_stat = os.path.exists(self.out_intent_dir) intent_stat = os.path.exists(self.out_intent_dir)
if not intent_stat: if not intent_stat:
os.makedirs(self.out_intent_dir) os.makedirs(self.out_intent_dir)
src_examples = [] src_examples = []
json_file = os.path.join(self.src_dir, "%s.json" % data_type) json_file = os.path.join(self.src_dir, "%s.json" % data_type)
load_f = io.open(json_file, 'r', encoding="utf8") load_f = io.open(json_file, 'r', encoding="utf8")
json_dict = json.load(load_f) json_dict = json.load(load_f)
examples = json_dict['rasa_nlu_data']['common_examples'] examples = json_dict['rasa_nlu_data']['common_examples']
for example in examples: for example in examples:
text = example.get('text') text = example.get('text')
intent = example.get('intent') intent = example.get('intent')
entities = example.get('entities') entities = example.get('entities')
src_examples.append((text, intent, entities)) src_examples.append((text, intent, entities))
return src_examples return src_examples
def _parser_intent_data(self, examples, data_type): def _parser_intent_data(self, examples, data_type):
""" """
parser intent dataset parser intent dataset
""" """
out_filename = "%s/%s.txt" % (self.out_intent_dir, data_type) out_filename = "%s/%s.txt" % (self.out_intent_dir, data_type)
fw = io.open(out_filename, 'w', encoding="utf8") fw = io.open(out_filename, 'w', encoding="utf8")
for example in examples: for example in examples:
if example[1] not in self.intent_dict: if example[1] not in self.intent_dict:
self.intent_dict[example[1]] = self.intent_id self.intent_dict[example[1]] = self.intent_id
self.intent_id += 1 self.intent_id += 1
fw.write(u"%s\t%s\n" % (self.intent_dict[example[1]], example[0].lower())) fw.write(u"%s\t%s\n" %
(self.intent_dict[example[1]], example[0].lower()))
fw = io.open(self.map_tag_intent, 'w', encoding="utf8") fw = io.open(self.map_tag_intent, 'w', encoding="utf8")
for tag in self.intent_dict: for tag in self.intent_dict:
fw.write(u"%s\t%s\n" % (tag, self.intent_dict[tag])) fw.write(u"%s\t%s\n" % (tag, self.intent_dict[tag]))
def _parser_slot_data(self, examples, data_type): def _parser_slot_data(self, examples, data_type):
""" """
parser slot dataset parser slot dataset
""" """
out_filename = "%s/%s.txt" % (self.out_slot_dir, data_type) out_filename = "%s/%s.txt" % (self.out_slot_dir, data_type)
fw = io.open(out_filename, 'w', encoding="utf8") fw = io.open(out_filename, 'w', encoding="utf8")
for example in examples: for example in examples:
tags = [] tags = []
text = example[0] text = example[0]
entities = example[2] entities = example[2]
if not entities: if not entities:
tags = [str(self.slot_dict['O'])] * len(text.strip().split()) tags = [str(self.slot_dict['O'])] * len(text.strip().split())
continue continue
for i in range(len(entities)): for i in range(len(entities)):
enty = entities[i] enty = entities[i]
start = enty['start'] start = enty['start']
value_num = len(enty['value'].split()) value_num = len(enty['value'].split())
tags_slot = [] tags_slot = []
for j in range(value_num): for j in range(value_num):
if j == 0: if j == 0:
bround_tag = "B" bround_tag = "B"
else: else:
bround_tag = "I" bround_tag = "I"
tag = "%s-%s" % (bround_tag, enty['entity']) tag = "%s-%s" % (bround_tag, enty['entity'])
if tag not in self.slot_dict: if tag not in self.slot_dict:
self.slot_dict[tag] = self.slot_id self.slot_dict[tag] = self.slot_id
self.slot_id += 1 self.slot_id += 1
tags_slot.append(str(self.slot_dict[tag])) tags_slot.append(str(self.slot_dict[tag]))
if i == 0: if i == 0:
if start not in [0, 1]: if start not in [0, 1]:
prefix_num = len(text[: start].strip().split()) prefix_num = len(text[:start].strip().split())
tags.extend([str(self.slot_dict['O'])] * prefix_num) tags.extend([str(self.slot_dict['O'])] * prefix_num)
tags.extend(tags_slot) tags.extend(tags_slot)
else: else:
prefix_num = len(text[entities[i - 1]['end']: start].strip().split()) prefix_num = len(text[entities[i - 1]['end']:start].strip()
.split())
tags.extend([str(self.slot_dict['O'])] * prefix_num) tags.extend([str(self.slot_dict['O'])] * prefix_num)
tags.extend(tags_slot) tags.extend(tags_slot)
if entities[-1]['end'] < len(text): if entities[-1]['end'] < len(text):
suffix_num = len(text[entities[-1]['end']:].strip().split()) suffix_num = len(text[entities[-1]['end']:].strip().split())
tags.extend([str(self.slot_dict['O'])] * suffix_num) tags.extend([str(self.slot_dict['O'])] * suffix_num)
fw.write(u"%s\t%s\n" % (text.encode('utf8'), " ".join(tags).encode('utf8'))) fw.write(u"%s\t%s\n" %
(text.encode('utf8'), " ".join(tags).encode('utf8')))
fw = io.open(self.map_tag_slot, 'w', encoding="utf8") fw = io.open(self.map_tag_slot, 'w', encoding="utf8")
for slot in self.slot_dict: for slot in self.slot_dict:
fw.write(u"%s\t%s\n" % (slot, self.slot_dict[slot])) fw.write(u"%s\t%s\n" % (slot, self.slot_dict[slot]))
def get_train_dataset(self): def get_train_dataset(self):
""" """
parser train dataset and print train.txt parser train dataset and print train.txt
""" """
...@@ -133,7 +136,7 @@ class ATIS(object): ...@@ -133,7 +136,7 @@ class ATIS(object):
self._parser_intent_data(train_examples, "train") self._parser_intent_data(train_examples, "train")
self._parser_slot_data(train_examples, "train") self._parser_slot_data(train_examples, "train")
def get_test_dataset(self): def get_test_dataset(self):
""" """
parser test dataset and print test.txt parser test dataset and print test.txt
""" """
...@@ -141,7 +144,7 @@ class ATIS(object): ...@@ -141,7 +144,7 @@ class ATIS(object):
self._parser_intent_data(test_examples, "test") self._parser_intent_data(test_examples, "test")
self._parser_slot_data(test_examples, "test") self._parser_slot_data(test_examples, "test")
def main(self): def main(self):
""" """
run data process run data process
""" """
...@@ -149,10 +152,6 @@ class ATIS(object): ...@@ -149,10 +152,6 @@ class ATIS(object):
self.get_test_dataset() self.get_test_dataset()
if __name__ == "__main__": if __name__ == "__main__":
atis_inst = ATIS() atis_inst = ATIS()
atis_inst.main() atis_inst.main()
...@@ -24,11 +24,12 @@ import re ...@@ -24,11 +24,12 @@ import re
import commonlib import commonlib
class DSTC2(object): class DSTC2(object):
""" """
dialogue state tracking dstc2 data process dialogue state tracking dstc2 data process
""" """
def __init__(self):
def __init__(self):
""" """
init instance init instance
""" """
...@@ -42,16 +43,17 @@ class DSTC2(object): ...@@ -42,16 +43,17 @@ class DSTC2(object):
self._load_file() self._load_file()
self._load_ontology() self._load_ontology()
def _load_file(self): def _load_file(self):
""" """
load dataset filename load dataset filename
""" """
self.data_dict = commonlib.load_dict(self.data_list) self.data_dict = commonlib.load_dict(self.data_list)
for data_type in self.data_dict: for data_type in self.data_dict:
for i in range(len(self.data_dict[data_type])): for i in range(len(self.data_dict[data_type])):
self.data_dict[data_type][i] = os.path.join(self.src_dir, self.data_dict[data_type][i]) self.data_dict[data_type][i] = os.path.join(
self.src_dir, self.data_dict[data_type][i])
def _load_ontology(self): def _load_ontology(self):
""" """
load ontology tag load ontology tag
""" """
...@@ -60,8 +62,8 @@ class DSTC2(object): ...@@ -60,8 +62,8 @@ class DSTC2(object):
fr = io.open(self.onto_json, 'r', encoding="utf8") fr = io.open(self.onto_json, 'r', encoding="utf8")
ontology = json.load(fr) ontology = json.load(fr)
slots_values = ontology['informable'] slots_values = ontology['informable']
for slot in slots_values: for slot in slots_values:
for value in slots_values[slot]: for value in slots_values[slot]:
key = "%s_%s" % (slot, value) key = "%s_%s" % (slot, value)
self.map_tag_dict[key] = tag_id self.map_tag_dict[key] = tag_id
tag_id += 1 tag_id += 1
...@@ -69,22 +71,22 @@ class DSTC2(object): ...@@ -69,22 +71,22 @@ class DSTC2(object):
self.map_tag_dict[key] = tag_id self.map_tag_dict[key] = tag_id
tag_id += 1 tag_id += 1
def _parser_dataset(self, data_type): def _parser_dataset(self, data_type):
""" """
parser train dev test dataset parser train dev test dataset
""" """
stat = os.path.exists(self.out_dir) stat = os.path.exists(self.out_dir)
if not stat: if not stat:
os.makedirs(self.out_dir) os.makedirs(self.out_dir)
asr_stat = os.path.exists(self.out_asr_dir) asr_stat = os.path.exists(self.out_asr_dir)
if not asr_stat: if not asr_stat:
os.makedirs(self.out_asr_dir) os.makedirs(self.out_asr_dir)
out_file = os.path.join(self.out_dir, "%s.txt" % data_type) out_file = os.path.join(self.out_dir, "%s.txt" % data_type)
out_asr_file = os.path.join(self.out_asr_dir, "%s.txt" % data_type) out_asr_file = os.path.join(self.out_asr_dir, "%s.txt" % data_type)
fw = io.open(out_file, 'w', encoding="utf8") fw = io.open(out_file, 'w', encoding="utf8")
fw_asr = io.open(out_asr_file, 'w', encoding="utf8") fw_asr = io.open(out_asr_file, 'w', encoding="utf8")
data_list = self.data_dict.get(data_type) data_list = self.data_dict.get(data_type)
for fn in data_list: for fn in data_list:
log_file = os.path.join(fn, "log.json") log_file = os.path.join(fn, "log.json")
label_file = os.path.join(fn, "label.json") label_file = os.path.join(fn, "label.json")
f_log = io.open(log_file, 'r', encoding="utf8") f_log = io.open(log_file, 'r', encoding="utf8")
...@@ -93,49 +95,59 @@ class DSTC2(object): ...@@ -93,49 +95,59 @@ class DSTC2(object):
label_json = json.load(f_label) label_json = json.load(f_label)
session_id = log_json['session-id'] session_id = log_json['session-id']
assert len(label_json["turns"]) == len(log_json["turns"]) assert len(label_json["turns"]) == len(log_json["turns"])
for i in range(len(label_json["turns"])): for i in range(len(label_json["turns"])):
log_turn = log_json["turns"][i] log_turn = log_json["turns"][i]
label_turn = label_json["turns"][i] label_turn = label_json["turns"][i]
assert log_turn["turn-index"] == label_turn["turn-index"] assert log_turn["turn-index"] == label_turn["turn-index"]
labels = ["%s_%s" % (slot, label_turn["goal-labels"][slot]) for slot in label_turn["goal-labels"]] labels = [
labels_ids = " ".join([str(self.map_tag_dict.get(label, self.map_tag_dict["%s_none" % label.split('_')[0]])) for label in labels]) "%s_%s" % (slot, label_turn["goal-labels"][slot])
for slot in label_turn["goal-labels"]
]
labels_ids = " ".join([
str(
self.map_tag_dict.get(label, self.map_tag_dict[
"%s_none" % label.split('_')[0]]))
for label in labels
])
mach = log_turn['output']['transcript'] mach = log_turn['output']['transcript']
user = label_turn['transcription'] user = label_turn['transcription']
if not labels_ids.strip(): if not labels_ids.strip():
labels_ids = self.map_tag_dict['none'] labels_ids = self.map_tag_dict['none']
out = "%s\t%s\1%s\t%s" % (session_id, mach, user, labels_ids) out = "%s\t%s\1%s\t%s" % (session_id, mach, user, labels_ids)
user_asr = log_turn['input']['live']['asr-hyps'][0]['asr-hyp'].strip() user_asr = log_turn['input']['live']['asr-hyps'][0][
out_asr = "%s\t%s\1%s\t%s" % (session_id, mach, user_asr, labels_ids) 'asr-hyp'].strip()
out_asr = "%s\t%s\1%s\t%s" % (session_id, mach, user_asr,
labels_ids)
fw.write(u"%s\n" % out.encode('utf8')) fw.write(u"%s\n" % out.encode('utf8'))
fw_asr.write(u"%s\n" % out_asr.encode('utf8')) fw_asr.write(u"%s\n" % out_asr.encode('utf8'))
def get_train_dataset(self): def get_train_dataset(self):
""" """
parser train dataset and print train.txt parser train dataset and print train.txt
""" """
self._parser_dataset("train") self._parser_dataset("train")
def get_dev_dataset(self): def get_dev_dataset(self):
""" """
parser dev dataset and print dev.txt parser dev dataset and print dev.txt
""" """
self._parser_dataset("dev") self._parser_dataset("dev")
def get_test_dataset(self): def get_test_dataset(self):
""" """
parser test dataset and print test.txt parser test dataset and print test.txt
""" """
self._parser_dataset("test") self._parser_dataset("test")
def get_labels(self): def get_labels(self):
""" """
get tag and map ids file get tag and map ids file
""" """
fw = io.open(self.map_tag, 'w', encoding="utf8") fw = io.open(self.map_tag, 'w', encoding="utf8")
for elem in self.map_tag_dict: for elem in self.map_tag_dict:
fw.write(u"%s\t%s\n" % (elem, self.map_tag_dict[elem])) fw.write(u"%s\t%s\n" % (elem, self.map_tag_dict[elem]))
def main(self): def main(self):
""" """
run data process run data process
""" """
...@@ -144,10 +156,7 @@ class DSTC2(object): ...@@ -144,10 +156,7 @@ class DSTC2(object):
self.get_test_dataset() self.get_test_dataset()
self.get_labels() self.get_labels()
if __name__ == "__main__":
if __name__ == "__main__":
dstc_inst = DSTC2() dstc_inst = DSTC2()
dstc_inst.main() dstc_inst.main()
...@@ -23,11 +23,12 @@ import re ...@@ -23,11 +23,12 @@ import re
import commonlib import commonlib
class MRDA(object): class MRDA(object):
""" """
dialogue act dataset mrda data process dialogue act dataset mrda data process
""" """
def __init__(self):
def __init__(self):
""" """
init instance init instance
""" """
...@@ -41,7 +42,7 @@ class MRDA(object): ...@@ -41,7 +42,7 @@ class MRDA(object):
self._load_file() self._load_file()
self.tag_dict = commonlib.load_voc(self.voc_map_tag) self.tag_dict = commonlib.load_voc(self.voc_map_tag)
def _load_file(self): def _load_file(self):
""" """
load dataset filename load dataset filename
""" """
...@@ -49,30 +50,30 @@ class MRDA(object): ...@@ -49,30 +50,30 @@ class MRDA(object):
self.trans_dict = {} self.trans_dict = {}
self.data_dict = commonlib.load_dict(self.data_list) self.data_dict = commonlib.load_dict(self.data_list)
file_list, file_path = commonlib.get_file_list(self.src_dir) file_list, file_path = commonlib.get_file_list(self.src_dir)
for i in range(len(file_list)): for i in range(len(file_list)):
name = file_list[i] name = file_list[i]
keyword = name.split('.')[0] keyword = name.split('.')[0]
if 'dadb' in name: if 'dadb' in name:
self.dadb_dict[keyword] = file_path[i] self.dadb_dict[keyword] = file_path[i]
if 'trans' in name: if 'trans' in name:
self.trans_dict[keyword] = file_path[i] self.trans_dict[keyword] = file_path[i]
def load_dadb(self, data_type): def load_dadb(self, data_type):
""" """
load dadb dataset load dadb dataset
""" """
dadb_dict = {} dadb_dict = {}
conv_id_list = [] conv_id_list = []
dadb_list = self.data_dict[data_type] dadb_list = self.data_dict[data_type]
for dadb_key in dadb_list: for dadb_key in dadb_list:
dadb_file = self.dadb_dict[dadb_key] dadb_file = self.dadb_dict[dadb_key]
fr = io.open(dadb_file, 'r', encoding="utf8") fr = io.open(dadb_file, 'r', encoding="utf8")
row = csv.reader(fr, delimiter = ',') row = csv.reader(fr, delimiter=',')
for line in row: for line in row:
elems = line elems = line
conv_id = elems[2] conv_id = elems[2]
conv_id_list.append(conv_id) conv_id_list.append(conv_id)
if len(elems) != 14: if len(elems) != 14:
continue continue
error_code = elems[3] error_code = elems[3]
da_tag = elems[-9] da_tag = elems[-9]
...@@ -80,17 +81,17 @@ class MRDA(object): ...@@ -80,17 +81,17 @@ class MRDA(object):
dadb_dict[conv_id] = (error_code, da_ori_tag, da_tag) dadb_dict[conv_id] = (error_code, da_ori_tag, da_tag)
return dadb_dict, conv_id_list return dadb_dict, conv_id_list
def load_trans(self, data_type): def load_trans(self, data_type):
"""load trans data""" """load trans data"""
trans_dict = {} trans_dict = {}
trans_list = self.data_dict[data_type] trans_list = self.data_dict[data_type]
for trans_key in trans_list: for trans_key in trans_list:
trans_file = self.trans_dict[trans_key] trans_file = self.trans_dict[trans_key]
fr = io.open(trans_file, 'r', encoding="utf8") fr = io.open(trans_file, 'r', encoding="utf8")
row = csv.reader(fr, delimiter = ',') row = csv.reader(fr, delimiter=',')
for line in row: for line in row:
elems = line elems = line
if len(elems) != 3: if len(elems) != 3:
continue continue
conv_id = elems[0] conv_id = elems[0]
text = elems[1] text = elems[1]
...@@ -98,7 +99,7 @@ class MRDA(object): ...@@ -98,7 +99,7 @@ class MRDA(object):
trans_dict[conv_id] = (text, text_process) trans_dict[conv_id] = (text, text_process)
return trans_dict return trans_dict
def _parser_dataset(self, data_type): def _parser_dataset(self, data_type):
""" """
parser train dev test dataset parser train dev test dataset
""" """
...@@ -106,50 +107,51 @@ class MRDA(object): ...@@ -106,50 +107,51 @@ class MRDA(object):
dadb_dict, conv_id_list = self.load_dadb(data_type) dadb_dict, conv_id_list = self.load_dadb(data_type)
trans_dict = self.load_trans(data_type) trans_dict = self.load_trans(data_type)
fw = io.open(out_filename, 'w', encoding="utf8") fw = io.open(out_filename, 'w', encoding="utf8")
for elem in conv_id_list: for elem in conv_id_list:
v_dadb = dadb_dict[elem] v_dadb = dadb_dict[elem]
v_trans = trans_dict[elem] v_trans = trans_dict[elem]
da_tag = v_dadb[2] da_tag = v_dadb[2]
if da_tag not in self.tag_dict: if da_tag not in self.tag_dict:
continue continue
tag = self.tag_dict[da_tag] tag = self.tag_dict[da_tag]
if tag == "Z": if tag == "Z":
continue continue
if tag not in self.map_tag_dict: if tag not in self.map_tag_dict:
self.map_tag_dict[tag] = self.tag_id self.map_tag_dict[tag] = self.tag_id
self.tag_id += 1 self.tag_id += 1
caller = elem.split('_')[0].split('-')[-1] caller = elem.split('_')[0].split('-')[-1]
conv_no = elem.split('_')[0].split('-')[0] conv_no = elem.split('_')[0].split('-')[0]
out = "%s\t%s\t%s\t%s" % (conv_no, self.map_tag_dict[tag], caller, v_trans[0]) out = "%s\t%s\t%s\t%s" % (conv_no, self.map_tag_dict[tag], caller,
v_trans[0])
fw.write(u"%s\n" % out) fw.write(u"%s\n" % out)
def get_train_dataset(self): def get_train_dataset(self):
""" """
parser train dataset and print train.txt parser train dataset and print train.txt
""" """
self._parser_dataset("train") self._parser_dataset("train")
def get_dev_dataset(self): def get_dev_dataset(self):
""" """
parser dev dataset and print dev.txt parser dev dataset and print dev.txt
""" """
self._parser_dataset("dev") self._parser_dataset("dev")
def get_test_dataset(self): def get_test_dataset(self):
""" """
parser test dataset and print test.txt parser test dataset and print test.txt
""" """
self._parser_dataset("test") self._parser_dataset("test")
def get_labels(self): def get_labels(self):
""" """
get tag and map ids file get tag and map ids file
""" """
fw = io.open(self.map_tag, 'w', encoding="utf8") fw = io.open(self.map_tag, 'w', encoding="utf8")
for elem in self.map_tag_dict: for elem in self.map_tag_dict:
fw.write(u"%s\t%s\n" % (elem, self.map_tag_dict[elem])) fw.write(u"%s\t%s\n" % (elem, self.map_tag_dict[elem]))
def main(self): def main(self):
""" """
run data process run data process
""" """
...@@ -158,10 +160,7 @@ class MRDA(object): ...@@ -158,10 +160,7 @@ class MRDA(object):
self.get_test_dataset() self.get_test_dataset()
self.get_labels() self.get_labels()
if __name__ == "__main__":
if __name__ == "__main__":
mrda_inst = MRDA() mrda_inst = MRDA()
mrda_inst.main() mrda_inst.main()
...@@ -23,11 +23,12 @@ import re ...@@ -23,11 +23,12 @@ import re
import commonlib import commonlib
class SWDA(object): class SWDA(object):
""" """
dialogue act dataset swda data process dialogue act dataset swda data process
""" """
def __init__(self):
def __init__(self):
""" """
init instance init instance
""" """
...@@ -39,94 +40,94 @@ class SWDA(object): ...@@ -39,94 +40,94 @@ class SWDA(object):
self.src_dir = "../../data/input/data/swda/source_data/swda" self.src_dir = "../../data/input/data/swda/source_data/swda"
self._load_file() self._load_file()
def _load_file(self): def _load_file(self):
""" """
load dataset filename load dataset filename
""" """
self.data_dict = commonlib.load_dict(self.data_list) self.data_dict = commonlib.load_dict(self.data_list)
self.file_dict = {} self.file_dict = {}
child_dir = commonlib.get_dir_list(self.src_dir) child_dir = commonlib.get_dir_list(self.src_dir)
for chd in child_dir: for chd in child_dir:
file_list, file_path = commonlib.get_file_list(chd) file_list, file_path = commonlib.get_file_list(chd)
for i in range(len(file_list)): for i in range(len(file_list)):
name = file_list[i] name = file_list[i]
keyword = "sw%s" % name.split('.')[0].split('_')[-1] keyword = "sw%s" % name.split('.')[0].split('_')[-1]
self.file_dict[keyword] = file_path[i] self.file_dict[keyword] = file_path[i]
def _parser_dataset(self, data_type): def _parser_dataset(self, data_type):
""" """
parser train dev test dataset parser train dev test dataset
""" """
out_filename = "%s/%s.txt" % (self.out_dir, data_type) out_filename = "%s/%s.txt" % (self.out_dir, data_type)
fw = io.open(out_filename, 'w', encoding='utf8') fw = io.open(out_filename, 'w', encoding='utf8')
for name in self.data_dict[data_type]: for name in self.data_dict[data_type]:
file_path = self.file_dict[name] file_path = self.file_dict[name]
fr = io.open(file_path, 'r', encoding="utf8") fr = io.open(file_path, 'r', encoding="utf8")
idx = 0 idx = 0
row = csv.reader(fr, delimiter = ',') row = csv.reader(fr, delimiter=',')
for r in row: for r in row:
if idx == 0: if idx == 0:
idx += 1 idx += 1
continue continue
out = self._parser_utterence(r) out = self._parser_utterence(r)
fw.write(u"%s\n" % out) fw.write(u"%s\n" % out)
def _clean_text(self, text): def _clean_text(self, text):
""" """
text cleaning for dialogue act dataset text cleaning for dialogue act dataset
""" """
if text.startswith('<') and text.endswith('>.'): if text.startswith('<') and text.endswith('>.'):
return text return text
if "[" in text or "]" in text: if "[" in text or "]" in text:
stat = True stat = True
else: else:
stat = False stat = False
group = re.findall("\[.*?\+.*?\]", text) group = re.findall("\[.*?\+.*?\]", text)
while group and stat: while group and stat:
for elem in group: for elem in group:
elem_src = elem elem_src = elem
elem = re.sub('\+', '', elem.lstrip('[').rstrip(']')) elem = re.sub('\+', '', elem.lstrip('[').rstrip(']'))
text = text.replace(elem_src, elem) text = text.replace(elem_src, elem)
if "[" in text or "]" in text: if "[" in text or "]" in text:
stat = True stat = True
else: else:
stat = False stat = False
group = re.findall("\[.*?\+.*?\]", text) group = re.findall("\[.*?\+.*?\]", text)
if "{" in text or "}" in text: if "{" in text or "}" in text:
stat = True stat = True
else: else:
stat = False stat = False
group = re.findall("{[A-Z].*?}", text) group = re.findall("{[A-Z].*?}", text)
while group and stat: while group and stat:
child_group = re.findall("{[A-Z]*(.*?)}", text) child_group = re.findall("{[A-Z]*(.*?)}", text)
for i in range(len(group)): for i in range(len(group)):
text = text.replace(group[i], child_group[i]) text = text.replace(group[i], child_group[i])
if "{" in text or "}" in text: if "{" in text or "}" in text:
stat = True stat = True
else: else:
stat = False stat = False
group = re.findall("{[A-Z].*?}", text) group = re.findall("{[A-Z].*?}", text)
if "(" in text or ")" in text: if "(" in text or ")" in text:
stat = True stat = True
else: else:
stat = False stat = False
group = re.findall("\(\(.*?\)\)", text) group = re.findall("\(\(.*?\)\)", text)
while group and stat: while group and stat:
for elem in group: for elem in group:
if elem: if elem:
elem_clean = re.sub("\(|\)", "", elem) elem_clean = re.sub("\(|\)", "", elem)
text = text.replace(elem, elem_clean) text = text.replace(elem, elem_clean)
else: else:
text = text.replace(elem, "mumblex") text = text.replace(elem, "mumblex")
if "(" in text or ")" in text: if "(" in text or ")" in text:
stat = True stat = True
else: else:
stat = False stat = False
group = re.findall("\(\((.*?)\)\)", text) group = re.findall("\(\((.*?)\)\)", text)
group = re.findall("\<.*?\>", text) group = re.findall("\<.*?\>", text)
if group: if group:
for elem in group: for elem in group:
text = text.replace(elem, "") text = text.replace(elem, "")
text = re.sub(r" \'s", "\'s", text) text = re.sub(r" \'s", "\'s", text)
...@@ -137,24 +138,24 @@ class SWDA(object): ...@@ -137,24 +138,24 @@ class SWDA(object):
text = re.sub("\[|\]|\+|\>|\<|\{|\}", "", text) text = re.sub("\[|\]|\+|\>|\<|\{|\}", "", text)
return text.strip().lower() return text.strip().lower()
def _map_tag(self, da_tag): def _map_tag(self, da_tag):
""" """
map tag to 42 classes map tag to 42 classes
""" """
curr_da_tags = [] curr_da_tags = []
curr_das = re.split(r"\s*[,;]\s*", da_tag) curr_das = re.split(r"\s*[,;]\s*", da_tag)
for curr_da in curr_das: for curr_da in curr_das:
if curr_da == "qy_d" or curr_da == "qw^d" or curr_da == "b^m": if curr_da == "qy_d" or curr_da == "qw^d" or curr_da == "b^m":
pass pass
elif curr_da == "nn^e": elif curr_da == "nn^e":
curr_da = "ng" curr_da = "ng"
elif curr_da == "ny^e": elif curr_da == "ny^e":
curr_da = "na" curr_da = "na"
else: else:
curr_da = re.sub(r'(.)\^.*', r'\1', curr_da) curr_da = re.sub(r'(.)\^.*', r'\1', curr_da)
curr_da = re.sub(r'[\(\)@*]', '', curr_da) curr_da = re.sub(r'[\(\)@*]', '', curr_da)
tag = curr_da tag = curr_da
if tag in ('qr', 'qy'): if tag in ('qr', 'qy'):
tag = 'qy' tag = 'qy'
elif tag in ('fe', 'ba'): elif tag in ('fe', 'ba'):
tag = 'ba' tag = 'ba'
...@@ -170,12 +171,12 @@ class SWDA(object): ...@@ -170,12 +171,12 @@ class SWDA(object):
tag = 'fo_o_fw_"_by_bc' tag = 'fo_o_fw_"_by_bc'
curr_da = tag curr_da = tag
curr_da_tags.append(curr_da) curr_da_tags.append(curr_da)
if curr_da_tags[0] not in self.map_tag_dict: if curr_da_tags[0] not in self.map_tag_dict:
self.map_tag_dict[curr_da_tags[0]] = self.tag_id self.map_tag_dict[curr_da_tags[0]] = self.tag_id
self.tag_id += 1 self.tag_id += 1
return self.map_tag_dict[curr_da_tags[0]] return self.map_tag_dict[curr_da_tags[0]]
def _parser_utterence(self, line): def _parser_utterence(self, line):
""" """
parser one turn dialogue parser one turn dialogue
""" """
...@@ -188,34 +189,34 @@ class SWDA(object): ...@@ -188,34 +189,34 @@ class SWDA(object):
out = "%s\t%s\t%s\t%s" % (conversation_no, act_tag, caller, text) out = "%s\t%s\t%s\t%s" % (conversation_no, act_tag, caller, text)
return out return out
def get_train_dataset(self): def get_train_dataset(self):
""" """
parser train dataset and print train.txt parser train dataset and print train.txt
""" """
self._parser_dataset("train") self._parser_dataset("train")
def get_dev_dataset(self): def get_dev_dataset(self):
""" """
parser dev dataset and print dev.txt parser dev dataset and print dev.txt
""" """
self._parser_dataset("dev") self._parser_dataset("dev")
def get_test_dataset(self): def get_test_dataset(self):
""" """
parser test dataset and print test.txt parser test dataset and print test.txt
""" """
self._parser_dataset("test") self._parser_dataset("test")
def get_labels(self): def get_labels(self):
""" """
get tag and map ids file get tag and map ids file
""" """
fw = io.open(self.map_tag, 'w', encoding='utf8') fw = io.open(self.map_tag, 'w', encoding='utf8')
for elem in self.map_tag_dict: for elem in self.map_tag_dict:
fw.write(u"%s\t%s\n" % (elem, self.map_tag_dict[elem])) fw.write(u"%s\t%s\n" % (elem, self.map_tag_dict[elem]))
def main(self): def main(self):
""" """
run data process run data process
""" """
...@@ -224,10 +225,7 @@ class SWDA(object): ...@@ -224,10 +225,7 @@ class SWDA(object):
self.get_test_dataset() self.get_test_dataset()
self.get_labels() self.get_labels()
if __name__ == "__main__":
if __name__ == "__main__":
swda_inst = SWDA() swda_inst = SWDA()
swda_inst.main() swda_inst.main()
...@@ -25,52 +25,49 @@ def get_file_list(dir_name): ...@@ -25,52 +25,49 @@ def get_file_list(dir_name):
file_list = list() file_list = list()
file_path = list() file_path = list()
for root, dirs, files in os.walk(dir_name): for root, dirs, files in os.walk(dir_name):
for file in files: for file in files:
file_list.append(file) file_list.append(file)
file_path.append(os.path.join(root, file)) file_path.append(os.path.join(root, file))
return file_list, file_path return file_list, file_path
def get_dir_list(dir_name): def get_dir_list(dir_name):
""" """
get directory names get directory names
""" """
child_dir = [] child_dir = []
dir_list = os.listdir(dir_name) dir_list = os.listdir(dir_name)
for cur_file in dir_list: for cur_file in dir_list:
path = os.path.join(dir_name, cur_file) path = os.path.join(dir_name, cur_file)
if not os.path.isdir(path): if not os.path.isdir(path):
continue continue
child_dir.append(path) child_dir.append(path)
return child_dir return child_dir
def load_dict(conf): def load_dict(conf):
""" """
load swda dataset config load swda dataset config
""" """
conf_dict = dict() conf_dict = dict()
fr = io.open(conf, 'r', encoding="utf8") fr = io.open(conf, 'r', encoding="utf8")
for line in fr: for line in fr:
line = line.strip() line = line.strip()
elems = line.split('\t') elems = line.split('\t')
if elems[0] not in conf_dict: if elems[0] not in conf_dict:
conf_dict[elems[0]] = [] conf_dict[elems[0]] = []
conf_dict[elems[0]].append(elems[1]) conf_dict[elems[0]].append(elems[1])
return conf_dict return conf_dict
def load_voc(conf): def load_voc(conf):
""" """
load map dict load map dict
""" """
map_dict = {} map_dict = {}
fr = io.open(conf, 'r', encoding="utf8") fr = io.open(conf, 'r', encoding="utf8")
for line in fr: for line in fr:
line = line.strip() line = line.strip()
elems = line.split('\t') elems = line.split('\t')
map_dict[elems[0]] = elems[1] map_dict[elems[0]] = elems[1]
return map_dict return map_dict
...@@ -20,29 +20,29 @@ from build_dstc2_dataset import DSTC2 ...@@ -20,29 +20,29 @@ from build_dstc2_dataset import DSTC2
from build_mrda_dataset import MRDA from build_mrda_dataset import MRDA
from build_swda_dataset import SWDA from build_swda_dataset import SWDA
if __name__ == "__main__":
if __name__ == "__main__":
task_name = sys.argv[1] task_name = sys.argv[1]
task_name = task_name.lower() task_name = task_name.lower()
if task_name not in ['swda', 'mrda', 'atis', 'dstc2', 'udc']: if task_name not in ['swda', 'mrda', 'atis', 'dstc2', 'udc']:
print("task name error: we support [swda|mrda|atis|dstc2|udc]") print("task name error: we support [swda|mrda|atis|dstc2|udc]")
exit(1) exit(1)
if task_name == 'swda': if task_name == 'swda':
swda_inst = SWDA() swda_inst = SWDA()
swda_inst.main() swda_inst.main()
elif task_name == 'mrda': elif task_name == 'mrda':
mrda_inst = MRDA() mrda_inst = MRDA()
mrda_inst.main() mrda_inst.main()
elif task_name == 'atis': elif task_name == 'atis':
atis_inst = ATIS() atis_inst = ATIS()
atis_inst.main() atis_inst.main()
shutil.copyfile("../../data/input/data/atis/atis_slot/test.txt", "../../data/input/data/atis/atis_slot/dev.txt") shutil.copyfile("../../data/input/data/atis/atis_slot/test.txt",
shutil.copyfile("../../data/input/data/atis/atis_intent/test.txt", "../../data/input/data/atis/atis_intent/dev.txt") "../../data/input/data/atis/atis_slot/dev.txt")
elif task_name == 'dstc2': shutil.copyfile("../../data/input/data/atis/atis_intent/test.txt",
"../../data/input/data/atis/atis_intent/dev.txt")
elif task_name == 'dstc2':
dstc_inst = DSTC2() dstc_inst = DSTC2()
dstc_inst.main() dstc_inst.main()
else: else:
exit(0) exit(0)
...@@ -12,7 +12,6 @@ ...@@ -12,7 +12,6 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and # See the License for the specific language governing permissions and
# limitations under the License. # limitations under the License.
"""Tokenization classes.""" """Tokenization classes."""
from __future__ import absolute_import from __future__ import absolute_import
......
...@@ -113,7 +113,7 @@ def multi_head_attention(queries, ...@@ -113,7 +113,7 @@ def multi_head_attention(queries,
""" """
Scaled Dot-Product Attention Scaled Dot-Product Attention
""" """
scaled_q = layers.scale(x=q, scale=d_key ** -0.5) scaled_q = layers.scale(x=q, scale=d_key**-0.5)
product = layers.matmul(x=scaled_q, y=k, transpose_y=True) product = layers.matmul(x=scaled_q, y=k, transpose_y=True)
if attn_bias: if attn_bias:
product += attn_bias product += attn_bias
......
...@@ -25,8 +25,8 @@ import numpy as np ...@@ -25,8 +25,8 @@ import numpy as np
import paddle.fluid as fluid import paddle.fluid as fluid
class InputField(object): class InputField(object):
def __init__(self, input_field): def __init__(self, input_field):
"""init inpit field""" """init inpit field"""
self.src_ids = input_field[0] self.src_ids = input_field[0]
self.pos_ids = input_field[1] self.pos_ids = input_field[1]
......
...@@ -30,7 +30,7 @@ def check_cuda(use_cuda, err = \ ...@@ -30,7 +30,7 @@ def check_cuda(use_cuda, err = \
if __name__ == "__main__": if __name__ == "__main__":
check_cuda(True) check_cuda(True)
check_cuda(False) check_cuda(False)
......
...@@ -69,8 +69,8 @@ def init_from_checkpoint(args, exe, program): ...@@ -69,8 +69,8 @@ def init_from_checkpoint(args, exe, program):
def init_from_params(args, exe, program): def init_from_params(args, exe, program):
assert isinstance(args.init_from_params, str) assert isinstance(args.init_from_params, str)
if not os.path.exists(args.init_from_params): if not os.path.exists(args.init_from_params):
raise Warning("the params path does not exist.") raise Warning("the params path does not exist.")
return False return False
...@@ -113,7 +113,7 @@ def save_param(args, exe, program, dirname): ...@@ -113,7 +113,7 @@ def save_param(args, exe, program, dirname):
if not os.path.exists(param_dir): if not os.path.exists(param_dir):
os.makedirs(param_dir) os.makedirs(param_dir)
fluid.io.save_params( fluid.io.save_params(
exe, exe,
os.path.join(param_dir, dirname), os.path.join(param_dir, dirname),
...@@ -122,5 +122,3 @@ def save_param(args, exe, program, dirname): ...@@ -122,5 +122,3 @@ def save_param(args, exe, program, dirname):
print("save parameters at %s" % (os.path.join(param_dir, dirname))) print("save parameters at %s" % (os.path.join(param_dir, dirname)))
return True return True
...@@ -23,14 +23,9 @@ from dgu.bert import BertModel ...@@ -23,14 +23,9 @@ from dgu.bert import BertModel
from dgu.utils.configure import JsonConfig from dgu.utils.configure import JsonConfig
def create_net( def create_net(is_training, model_input, num_labels, paradigm_inst, args):
is_training,
model_input,
num_labels,
paradigm_inst,
args):
"""create dialogue task model""" """create dialogue task model"""
src_ids = model_input.src_ids src_ids = model_input.src_ids
pos_ids = model_input.pos_ids pos_ids = model_input.pos_ids
sent_ids = model_input.sent_ids sent_ids = model_input.sent_ids
...@@ -48,14 +43,15 @@ def create_net( ...@@ -48,14 +43,15 @@ def create_net(
config=bert_conf, config=bert_conf,
use_fp16=False) use_fp16=False)
params = {'num_labels': num_labels, params = {
'src_ids': src_ids, 'num_labels': num_labels,
'pos_ids': pos_ids, 'src_ids': src_ids,
'sent_ids': sent_ids, 'pos_ids': pos_ids,
'input_mask': input_mask, 'sent_ids': sent_ids,
'labels': labels, 'input_mask': input_mask,
'is_training': is_training} 'labels': labels,
'is_training': is_training
}
results = paradigm_inst.paradigm(bert, params) results = paradigm_inst.paradigm(bert, params)
return results return results
...@@ -20,17 +20,17 @@ from dgu.evaluation import evaluate ...@@ -20,17 +20,17 @@ from dgu.evaluation import evaluate
from dgu.utils.configure import PDConfig from dgu.utils.configure import PDConfig
def do_eval(args): def do_eval(args):
task_name = args.task_name.lower() task_name = args.task_name.lower()
reference = args.evaluation_file reference = args.evaluation_file
predicitions = args.output_prediction_file predicitions = args.output_prediction_file
evaluate(task_name, predicitions, reference) evaluate(task_name, predicitions, reference)
if __name__ == "__main__": if __name__ == "__main__":
args = PDConfig(yaml_file="./data/config/dgu.yaml") args = PDConfig(yaml_file="./data/config/dgu.yaml")
args.build() args.build()
......
...@@ -29,10 +29,10 @@ import dgu.utils.save_load_io as save_load_io ...@@ -29,10 +29,10 @@ import dgu.utils.save_load_io as save_load_io
import dgu.reader as reader import dgu.reader as reader
from dgu_net import create_net from dgu_net import create_net
import dgu.define_paradigm as define_paradigm import dgu.define_paradigm as define_paradigm
def do_save_inference_model(args): def do_save_inference_model(args):
"""save inference model function""" """save inference model function"""
task_name = args.task_name.lower() task_name = args.task_name.lower()
...@@ -57,35 +57,36 @@ def do_save_inference_model(args): ...@@ -57,35 +57,36 @@ def do_save_inference_model(args):
with fluid.unique_name.guard(): with fluid.unique_name.guard():
# define inputs of the network # define inputs of the network
num_labels = len(processors[task_name].get_labels()) num_labels = len(processors[task_name].get_labels())
src_ids = fluid.data( src_ids = fluid.data(
name='src_ids', shape=[-1, args.max_seq_len], dtype='int64') name='src_ids', shape=[-1, args.max_seq_len], dtype='int64')
pos_ids = fluid.data( pos_ids = fluid.data(
name='pos_ids', shape=[-1, args.max_seq_len], dtype='int64') name='pos_ids', shape=[-1, args.max_seq_len], dtype='int64')
sent_ids = fluid.data( sent_ids = fluid.data(
name='sent_ids', shape=[-1, args.max_seq_len], dtype='int64') name='sent_ids', shape=[-1, args.max_seq_len], dtype='int64')
input_mask = fluid.data( input_mask = fluid.data(
name='input_mask', shape=[-1, args.max_seq_len], dtype='float32') name='input_mask',
if args.task_name == 'atis_slot': shape=[-1, args.max_seq_len],
dtype='float32')
if args.task_name == 'atis_slot':
labels = fluid.data( labels = fluid.data(
name='labels', shape=[-1, args.max_seq_len], dtype='int64') name='labels', shape=[-1, args.max_seq_len], dtype='int64')
elif args.task_name in ['dstc2', 'dstc2_asr', 'multi-woz']: elif args.task_name in ['dstc2', 'dstc2_asr', 'multi-woz']:
labels = fluid.data( labels = fluid.data(
name='labels', shape=[-1, num_labels], dtype='int64') name='labels', shape=[-1, num_labels], dtype='int64')
else: else:
labels = fluid.data( labels = fluid.data(name='labels', shape=[-1, 1], dtype='int64')
name='labels', shape=[-1, 1], dtype='int64')
input_inst = [src_ids, pos_ids, sent_ids, input_mask, labels] input_inst = [src_ids, pos_ids, sent_ids, input_mask, labels]
input_field = InputField(input_inst) input_field = InputField(input_inst)
results = create_net( results = create_net(
is_training=False, is_training=False,
model_input=input_field, model_input=input_field,
num_labels=num_labels, num_labels=num_labels,
paradigm_inst=paradigm_inst, paradigm_inst=paradigm_inst,
args=args) args=args)
probs = results.get("probs", None) probs = results.get("probs", None)
if args.use_cuda: if args.use_cuda:
...@@ -97,7 +98,7 @@ def do_save_inference_model(args): ...@@ -97,7 +98,7 @@ def do_save_inference_model(args):
exe.run(startup_prog) exe.run(startup_prog)
assert (args.init_from_params) or (args.init_from_pretrain_model) assert (args.init_from_params) or (args.init_from_pretrain_model)
if args.init_from_params: if args.init_from_params:
save_load_io.init_from_params(args, exe, test_prog) save_load_io.init_from_params(args, exe, test_prog)
elif args.init_from_pretrain_model: elif args.init_from_pretrain_model:
...@@ -105,20 +106,16 @@ def do_save_inference_model(args): ...@@ -105,20 +106,16 @@ def do_save_inference_model(args):
# saving inference model # saving inference model
fluid.io.save_inference_model( fluid.io.save_inference_model(
args.inference_model_dir, args.inference_model_dir,
feeded_var_names=[ feeded_var_names=[
input_field.src_ids.name, input_field.src_ids.name, input_field.pos_ids.name,
input_field.pos_ids.name, input_field.sent_ids.name, input_field.input_mask.name
input_field.sent_ids.name, ],
input_field.input_mask.name target_vars=[probs],
], executor=exe,
target_vars=[ main_program=test_prog,
probs model_filename="model.pdmodel",
], params_filename="params.pdparams")
executor=exe,
main_program=test_prog,
model_filename="model.pdmodel",
params_filename="params.pdparams")
print("save inference model at %s" % (args.inference_model_dir)) print("save inference model at %s" % (args.inference_model_dir))
......
...@@ -26,7 +26,6 @@ from inference_model import do_save_inference_model ...@@ -26,7 +26,6 @@ from inference_model import do_save_inference_model
from dgu.utils.configure import PDConfig from dgu.utils.configure import PDConfig
if __name__ == "__main__": if __name__ == "__main__":
args = PDConfig(yaml_file="./data/config/dgu.yaml") args = PDConfig(yaml_file="./data/config/dgu.yaml")
......
...@@ -28,7 +28,7 @@ import paddle.fluid as fluid ...@@ -28,7 +28,7 @@ import paddle.fluid as fluid
from dgu_net import create_net from dgu_net import create_net
import dgu.reader as reader import dgu.reader as reader
from dgu.optimization import optimization from dgu.optimization import optimization
import dgu.define_paradigm as define_paradigm import dgu.define_paradigm as define_paradigm
from dgu.utils.configure import PDConfig from dgu.utils.configure import PDConfig
from dgu.utils.input_field import InputField from dgu.utils.input_field import InputField
from dgu.utils.model_check import check_cuda from dgu.utils.model_check import check_cuda
...@@ -37,7 +37,7 @@ import dgu.utils.save_load_io as save_load_io ...@@ -37,7 +37,7 @@ import dgu.utils.save_load_io as save_load_io
def do_train(args): def do_train(args):
"""train function""" """train function"""
task_name = args.task_name.lower() task_name = args.task_name.lower()
paradigm_inst = define_paradigm.Paradigm(task_name) paradigm_inst = define_paradigm.Paradigm(task_name)
...@@ -53,34 +53,35 @@ def do_train(args): ...@@ -53,34 +53,35 @@ def do_train(args):
train_prog = fluid.default_main_program() train_prog = fluid.default_main_program()
startup_prog = fluid.default_startup_program() startup_prog = fluid.default_startup_program()
with fluid.program_guard(train_prog, startup_prog): with fluid.program_guard(train_prog, startup_prog):
train_prog.random_seed = args.random_seed train_prog.random_seed = args.random_seed
startup_prog.random_seed = args.random_seed startup_prog.random_seed = args.random_seed
with fluid.unique_name.guard(): with fluid.unique_name.guard():
num_labels = len(processors[task_name].get_labels()) num_labels = len(processors[task_name].get_labels())
src_ids = fluid.data( src_ids = fluid.data(
name='src_ids', shape=[-1, args.max_seq_len], dtype='int64') name='src_ids', shape=[-1, args.max_seq_len], dtype='int64')
pos_ids = fluid.data( pos_ids = fluid.data(
name='pos_ids', shape=[-1, args.max_seq_len], dtype='int64') name='pos_ids', shape=[-1, args.max_seq_len], dtype='int64')
sent_ids = fluid.data( sent_ids = fluid.data(
name='sent_ids', shape=[-1, args.max_seq_len], dtype='int64') name='sent_ids', shape=[-1, args.max_seq_len], dtype='int64')
input_mask = fluid.data( input_mask = fluid.data(
name='input_mask', shape=[-1, args.max_seq_len], dtype='float32') name='input_mask',
if args.task_name == 'atis_slot': shape=[-1, args.max_seq_len],
dtype='float32')
if args.task_name == 'atis_slot':
labels = fluid.data( labels = fluid.data(
name='labels', shape=[-1, args.max_seq_len], dtype='int64') name='labels', shape=[-1, args.max_seq_len], dtype='int64')
elif args.task_name in ['dstc2']: elif args.task_name in ['dstc2']:
labels = fluid.data( labels = fluid.data(
name='labels', shape=[-1, num_labels], dtype='int64') name='labels', shape=[-1, num_labels], dtype='int64')
else: else:
labels = fluid.data( labels = fluid.data(name='labels', shape=[-1, 1], dtype='int64')
name='labels', shape=[-1, 1], dtype='int64')
input_inst = [src_ids, pos_ids, sent_ids, input_mask, labels] input_inst = [src_ids, pos_ids, sent_ids, input_mask, labels]
input_field = InputField(input_inst) input_field = InputField(input_inst)
data_reader = fluid.io.PyReader(feed_list=input_inst, data_reader = fluid.io.PyReader(
capacity=4, iterable=False) feed_list=input_inst, capacity=4, iterable=False)
processor = processors[task_name](data_dir=args.data_dir, processor = processors[task_name](data_dir=args.data_dir,
vocab_path=args.vocab_path, vocab_path=args.vocab_path,
max_seq_len=args.max_seq_len, max_seq_len=args.max_seq_len,
...@@ -90,12 +91,12 @@ def do_train(args): ...@@ -90,12 +91,12 @@ def do_train(args):
random_seed=args.random_seed) random_seed=args.random_seed)
results = create_net( results = create_net(
is_training=True, is_training=True,
model_input=input_field, model_input=input_field,
num_labels=num_labels, num_labels=num_labels,
paradigm_inst=paradigm_inst, paradigm_inst=paradigm_inst,
args=args) args=args)
loss = results.get("loss", None) loss = results.get("loss", None)
probs = results.get("probs", None) probs = results.get("probs", None)
accuracy = results.get("accuracy", None) accuracy = results.get("accuracy", None)
...@@ -103,21 +104,19 @@ def do_train(args): ...@@ -103,21 +104,19 @@ def do_train(args):
loss.persistable = True loss.persistable = True
probs.persistable = True probs.persistable = True
if accuracy: if accuracy:
accuracy.persistable = True accuracy.persistable = True
num_seqs.persistable = True num_seqs.persistable = True
if args.use_cuda: if args.use_cuda:
dev_count = fluid.core.get_cuda_device_count() dev_count = fluid.core.get_cuda_device_count()
else: else:
dev_count = int(os.environ.get('CPU_NUM', 1)) dev_count = int(os.environ.get('CPU_NUM', 1))
batch_generator = processor.data_generator( batch_generator = processor.data_generator(
batch_size=args.batch_size, batch_size=args.batch_size, phase='train', shuffle=True)
phase='train',
shuffle=True)
num_train_examples = processor.get_num_examples(phase='train') num_train_examples = processor.get_num_examples(phase='train')
if args.in_tokens: if args.in_tokens:
max_train_steps = args.epoch * num_train_examples // ( max_train_steps = args.epoch * num_train_examples // (
args.batch_size // args.max_seq_len) // dev_count args.batch_size // args.max_seq_len) // dev_count
...@@ -147,32 +146,32 @@ def do_train(args): ...@@ -147,32 +146,32 @@ def do_train(args):
place = fluid.CUDAPlace(int(os.getenv('FLAGS_selected_gpus', '0'))) place = fluid.CUDAPlace(int(os.getenv('FLAGS_selected_gpus', '0')))
else: else:
place = fluid.CPUPlace() place = fluid.CPUPlace()
exe = fluid.Executor(place) exe = fluid.Executor(place)
exe.run(startup_prog) exe.run(startup_prog)
assert (args.init_from_checkpoint == "") or ( assert (args.init_from_checkpoint == "") or (
args.init_from_pretrain_model == "") args.init_from_pretrain_model == "")
# init from some checkpoint, to resume the previous training # init from some checkpoint, to resume the previous training
if args.init_from_checkpoint: if args.init_from_checkpoint:
save_load_io.init_from_checkpoint(args, exe, train_prog) save_load_io.init_from_checkpoint(args, exe, train_prog)
# init from some pretrain models, to better solve the current task # init from some pretrain models, to better solve the current task
if args.init_from_pretrain_model: if args.init_from_pretrain_model:
save_load_io.init_from_pretrain_model(args, exe, train_prog) save_load_io.init_from_pretrain_model(args, exe, train_prog)
build_strategy = fluid.compiler.BuildStrategy() build_strategy = fluid.compiler.BuildStrategy()
build_strategy.enable_inplace = True build_strategy.enable_inplace = True
compiled_train_prog = fluid.CompiledProgram(train_prog).with_data_parallel( compiled_train_prog = fluid.CompiledProgram(train_prog).with_data_parallel(
loss_name=loss.name, build_strategy=build_strategy) loss_name=loss.name, build_strategy=build_strategy)
# start training # start training
steps = 0 steps = 0
time_begin = time.time() time_begin = time.time()
ce_info = [] ce_info = []
for epoch_step in range(args.epoch): for epoch_step in range(args.epoch):
data_reader.start() data_reader.start()
while True: while True:
try: try:
...@@ -216,43 +215,38 @@ def do_train(args): ...@@ -216,43 +215,38 @@ def do_train(args):
used_time = time_end - time_begin used_time = time_end - time_begin
current_time = time.strftime('%Y-%m-%d %H:%M:%S', current_time = time.strftime('%Y-%m-%d %H:%M:%S',
time.localtime(time.time())) time.localtime(time.time()))
if accuracy is not None: if accuracy is not None:
print( print("%s epoch: %d, step: %d, ave loss: %f, "
"%s epoch: %d, step: %d, ave loss: %f, " "ave acc: %f, speed: %f steps/s" %
"ave acc: %f, speed: %f steps/s" % (current_time, epoch_step, steps,
(current_time, epoch_step, steps, np.mean(np_loss), np.mean(np_acc),
np.mean(np_loss), args.print_steps / used_time))
np.mean(np_acc),
args.print_steps / used_time))
ce_info.append([ ce_info.append([
np.mean(np_loss), np.mean(np_loss), np.mean(np_acc),
np.mean(np_acc),
args.print_steps / used_time args.print_steps / used_time
]) ])
else: else:
print( print("%s epoch: %d, step: %d, ave loss: %f, "
"%s epoch: %d, step: %d, ave loss: %f, " "speed: %f steps/s" %
"speed: %f steps/s" % (current_time, epoch_step, steps,
(current_time, epoch_step, steps, np.mean(np_loss), args.print_steps / used_time))
np.mean(np_loss), ce_info.append(
args.print_steps / used_time)) [np.mean(np_loss), args.print_steps / used_time])
ce_info.append([
np.mean(np_loss),
args.print_steps / used_time
])
time_begin = time.time() time_begin = time.time()
if steps % args.save_steps == 0: if steps % args.save_steps == 0:
save_path = "step_" + str(steps) save_path = "step_" + str(steps)
if args.save_checkpoint: if args.save_checkpoint:
save_load_io.save_checkpoint(args, exe, train_prog, save_path) save_load_io.save_checkpoint(args, exe, train_prog,
save_path)
if args.save_param: if args.save_param:
save_load_io.save_param(args, exe, train_prog, save_path) save_load_io.save_param(args, exe, train_prog,
save_path)
except fluid.core.EOFException:
except fluid.core.EOFException:
data_reader.reset() data_reader.reset()
break break
if args.save_checkpoint: if args.save_checkpoint:
save_load_io.save_checkpoint(args, exe, train_prog, "step_final") save_load_io.save_checkpoint(args, exe, train_prog, "step_final")
if args.save_param: if args.save_param:
save_load_io.save_param(args, exe, train_prog, "step_final") save_load_io.save_param(args, exe, train_prog, "step_final")
...@@ -264,7 +258,7 @@ def do_train(args): ...@@ -264,7 +258,7 @@ def do_train(args):
if cards != '': if cards != '':
num = len(cards.split(",")) num = len(cards.split(","))
return num return num
if args.enable_ce: if args.enable_ce:
card_num = get_cards() card_num = get_cards()
print("test_card_num", card_num) print("test_card_num", card_num)
...@@ -283,8 +277,8 @@ def do_train(args): ...@@ -283,8 +277,8 @@ def do_train(args):
print("kpis\ttrain_acc_%s_card%s\t%f" % (task_name, card_num, ce_acc)) print("kpis\ttrain_acc_%s_card%s\t%f" % (task_name, card_num, ce_acc))
if __name__ == '__main__': if __name__ == '__main__':
args = PDConfig(yaml_file="./data/config/dgu.yaml") args = PDConfig(yaml_file="./data/config/dgu.yaml")
args.build() args.build()
args.Print() args.Print()
......
...@@ -19,8 +19,7 @@ from __future__ import print_function ...@@ -19,8 +19,7 @@ from __future__ import print_function
import os import os
import sys import sys
sys.path.append("../") sys.path.append("../shared_modules/")
import paddle import paddle
import paddle.fluid as fluid import paddle.fluid as fluid
import numpy as np import numpy as np
......
...@@ -23,7 +23,7 @@ import os ...@@ -23,7 +23,7 @@ import os
import time import time
import multiprocessing import multiprocessing
import sys import sys
sys.path.append("../") sys.path.append("../shared_modules/")
import paddle import paddle
import paddle.fluid as fluid import paddle.fluid as fluid
......
...@@ -24,7 +24,7 @@ import time ...@@ -24,7 +24,7 @@ import time
import argparse import argparse
import multiprocessing import multiprocessing
import sys import sys
sys.path.append("../") sys.path.append("../shared_modules/")
import paddle import paddle
import paddle.fluid as fluid import paddle.fluid as fluid
......
...@@ -36,7 +36,7 @@ import sys ...@@ -36,7 +36,7 @@ import sys
if sys.version[0] == '2': if sys.version[0] == '2':
reload(sys) reload(sys)
sys.setdefaultencoding("utf-8") sys.setdefaultencoding("utf-8")
sys.path.append('../') sys.path.append('../shared_modules/')
import os import os
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "3" os.environ["TF_CPP_MIN_LOG_LEVEL"] = "3"
......
...@@ -26,7 +26,7 @@ from paddle.fluid.initializer import NormalInitializer ...@@ -26,7 +26,7 @@ from paddle.fluid.initializer import NormalInitializer
from reader import Dataset from reader import Dataset
from ernie_reader import SequenceLabelReader from ernie_reader import SequenceLabelReader
sys.path.append("..") sys.path.append("../shared_modules/")
from models.sequence_labeling import nets from models.sequence_labeling import nets
from models.representation.ernie import ernie_encoder, ernie_pyreader from models.representation.ernie import ernie_encoder, ernie_pyreader
...@@ -35,7 +35,8 @@ def create_model(args, vocab_size, num_labels, mode='train'): ...@@ -35,7 +35,8 @@ def create_model(args, vocab_size, num_labels, mode='train'):
"""create lac model""" """create lac model"""
# model's input data # model's input data
words = fluid.data(name='words', shape=[None, 1], dtype='int64', lod_level=1) words = fluid.data(
name='words', shape=[None, 1], dtype='int64', lod_level=1)
targets = fluid.data( targets = fluid.data(
name='targets', shape=[None, 1], dtype='int64', lod_level=1) name='targets', shape=[None, 1], dtype='int64', lod_level=1)
...@@ -88,7 +89,8 @@ def create_pyreader(args, ...@@ -88,7 +89,8 @@ def create_pyreader(args,
return_reader=False, return_reader=False,
mode='train'): mode='train'):
# init reader # init reader
device_count = len(fluid.cuda_places()) if args.use_cuda else len(fluid.cpu_places()) device_count = len(fluid.cuda_places()) if args.use_cuda else len(
fluid.cpu_places())
if model == 'lac': if model == 'lac':
pyreader = fluid.io.DataLoader.from_generator( pyreader = fluid.io.DataLoader.from_generator(
...@@ -107,14 +109,14 @@ def create_pyreader(args, ...@@ -107,14 +109,14 @@ def create_pyreader(args,
fluid.io.shuffle( fluid.io.shuffle(
reader.file_reader(file_name), reader.file_reader(file_name),
buf_size=args.traindata_shuffle_buffer), buf_size=args.traindata_shuffle_buffer),
batch_size=args.batch_size/device_count), batch_size=args.batch_size / device_count),
places=place) places=place)
else: else:
pyreader.set_sample_list_generator( pyreader.set_sample_list_generator(
fluid.io.batch( fluid.io.batch(
reader.file_reader( reader.file_reader(
file_name, mode=mode), file_name, mode=mode),
batch_size=args.batch_size/device_count), batch_size=args.batch_size / device_count),
places=place) places=place)
elif model == 'ernie': elif model == 'ernie':
......
...@@ -20,7 +20,7 @@ import sys ...@@ -20,7 +20,7 @@ import sys
from collections import namedtuple from collections import namedtuple
import numpy as np import numpy as np
sys.path.append("..") sys.path.append("../shared_modules/")
from preprocess.ernie.task_reader import BaseReader, tokenization from preprocess.ernie.task_reader import BaseReader, tokenization
......
...@@ -24,7 +24,7 @@ import paddle ...@@ -24,7 +24,7 @@ import paddle
import utils import utils
import reader import reader
import creator import creator
sys.path.append('../models/') sys.path.append('../shared_modules/models/')
from model_check import check_cuda from model_check import check_cuda
from model_check import check_version from model_check import check_version
......
...@@ -10,7 +10,7 @@ import paddle.fluid as fluid ...@@ -10,7 +10,7 @@ import paddle.fluid as fluid
import creator import creator
import reader import reader
import utils import utils
sys.path.append('../models/') sys.path.append('../shared_modules/models/')
from model_check import check_cuda from model_check import check_cuda
from model_check import check_version from model_check import check_version
......
...@@ -24,7 +24,7 @@ import paddle ...@@ -24,7 +24,7 @@ import paddle
import utils import utils
import reader import reader
import creator import creator
sys.path.append('../models/') sys.path.append('../shared_modules/models/')
from model_check import check_cuda from model_check import check_cuda
from model_check import check_version from model_check import check_version
......
...@@ -34,7 +34,7 @@ import paddle.fluid as fluid ...@@ -34,7 +34,7 @@ import paddle.fluid as fluid
import creator import creator
import utils import utils
sys.path.append("..") sys.path.append("../shared_modules/")
from models.representation.ernie import ErnieConfig from models.representation.ernie import ErnieConfig
from models.model_check import check_cuda from models.model_check import check_cuda
from models.model_check import check_version from models.model_check import check_version
...@@ -187,8 +187,8 @@ def do_train(args): ...@@ -187,8 +187,8 @@ def do_train(args):
end_time - start_time, train_pyreader.queue.size())) end_time - start_time, train_pyreader.queue.size()))
if steps % args.save_steps == 0: if steps % args.save_steps == 0:
save_path = os.path.join(args.model_save_dir, "step_" + str(steps), save_path = os.path.join(args.model_save_dir,
"checkpoint") "step_" + str(steps), "checkpoint")
print("\tsaving model as %s" % (save_path)) print("\tsaving model as %s" % (save_path))
fluid.save(train_program, save_path) fluid.save(train_program, save_path)
...@@ -196,9 +196,10 @@ def do_train(args): ...@@ -196,9 +196,10 @@ def do_train(args):
evaluate(exe, test_program, test_pyreader, train_ret) evaluate(exe, test_program, test_pyreader, train_ret)
save_path = os.path.join(args.model_save_dir, "step_" + str(steps), save_path = os.path.join(args.model_save_dir, "step_" + str(steps),
"checkpoint") "checkpoint")
fluid.save(train_program, save_path) fluid.save(train_program, save_path)
def do_eval(args): def do_eval(args):
# init executor # init executor
if args.use_cuda: if args.use_cuda:
......
...@@ -29,7 +29,7 @@ import reader ...@@ -29,7 +29,7 @@ import reader
import utils import utils
import creator import creator
from eval import test_process from eval import test_process
sys.path.append('../models/') sys.path.append('../shared_modules/models/')
from model_check import check_cuda from model_check import check_cuda
from model_check import check_version from model_check import check_version
...@@ -151,8 +151,7 @@ def do_train(args): ...@@ -151,8 +151,7 @@ def do_train(args):
# save checkpoints # save checkpoints
if step % args.save_steps == 0 and step != 0: if step % args.save_steps == 0 and step != 0:
save_path = os.path.join(args.model_save_dir, save_path = os.path.join(args.model_save_dir,
"step_" + str(step), "step_" + str(step), "checkpoint")
"checkpoint")
fluid.save(train_program, save_path) fluid.save(train_program, save_path)
step += 1 step += 1
......
...@@ -14,12 +14,12 @@ DuReader是一个大规模、面向真实应用、由人类生成的中文阅读 ...@@ -14,12 +14,12 @@ DuReader是一个大规模、面向真实应用、由人类生成的中文阅读
- 答案由人类生成 - 答案由人类生成
- 面向真实应用场景 - 面向真实应用场景
- 标注更加丰富细致 - 标注更加丰富细致
更多关于DuReader数据集的详细信息可在[DuReader官网](https://ai.baidu.com//broad/subordinate?dataset=dureader)找到。 更多关于DuReader数据集的详细信息可在[DuReader官网](https://ai.baidu.com//broad/subordinate?dataset=dureader)找到。
### DuReader基线系统 ### DuReader基线系统
DuReader基线系统利用[PaddlePaddle](http://paddlepaddle.org)深度学习框架,针对**DuReader阅读理解数据集**实现并升级了一个经典的阅读理解模型 —— BiDAF. DuReader基线系统利用[PaddlePaddle](http://paddlepaddle.org)深度学习框架,针对**DuReader阅读理解数据集**实现并升级了一个经典的阅读理解模型 —— BiDAF.
## [KT-Net](https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/Research/ACL2019-KTNET) ## [KT-Net](https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/Research/ACL2019-KTNET)
...@@ -30,7 +30,7 @@ KT-NET是百度NLP提出的具有开创性意义的语言表示与知识表示 ...@@ -30,7 +30,7 @@ KT-NET是百度NLP提出的具有开创性意义的语言表示与知识表示
- 被ACL 2019录用为长文 ([文章链接](https://www.aclweb.org/anthology/P19-1226/)) - 被ACL 2019录用为长文 ([文章链接](https://www.aclweb.org/anthology/P19-1226/))
此外,KT-NET具备很强的通用性,不仅适用于机器阅读理解任务,对其他形式的语言理解任务,如自然语言推断、复述识别、语义相似度判断等均有帮助。 此外,KT-NET具备很强的通用性,不仅适用于机器阅读理解任务,对其他形式的语言理解任务,如自然语言推断、复述识别、语义相似度判断等均有帮助。
## [D-NET](https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/Research/MRQA2019-D-NET) ## [D-NET](https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/Research/MRQA2019-D-NET)
D-NET是一个以提升**阅读理解模型泛化能力**为目标的“预训练-微调”框架。D-NET的特点包括: D-NET是一个以提升**阅读理解模型泛化能力**为目标的“预训练-微调”框架。D-NET的特点包括:
...@@ -39,4 +39,3 @@ D-NET是一个以提升**阅读理解模型泛化能力**为目标的“预训 ...@@ -39,4 +39,3 @@ D-NET是一个以提升**阅读理解模型泛化能力**为目标的“预训
- 在微调阶段引入多任务、多领域的学习策略 (基于[PALM](https://github.com/PaddlePaddle/PALM)多任务学习框架),有效的提升了模型在不同领域的泛化能力 - 在微调阶段引入多任务、多领域的学习策略 (基于[PALM](https://github.com/PaddlePaddle/PALM)多任务学习框架),有效的提升了模型在不同领域的泛化能力
百度利用D-NET框架在EMNLP 2019 [MRQA](https://mrqa.github.io/shared)国际阅读理解评测中以超过第二名近两个百分点的成绩夺得冠军,同时,在全部12个测试数据集中的10个排名第一。 百度利用D-NET框架在EMNLP 2019 [MRQA](https://mrqa.github.io/shared)国际阅读理解评测中以超过第二名近两个百分点的成绩夺得冠军,同时,在全部12个测试数据集中的10个排名第一。
...@@ -12,6 +12,7 @@ ...@@ -12,6 +12,7 @@
# See the License for the specific language governing permissions and # See the License for the specific language governing permissions and
# limitations under the License. # limitations under the License.
def get_input_descs(args): def get_input_descs(args):
""" """
Generate a dict mapping data fields to the corresponding data shapes and Generate a dict mapping data fields to the corresponding data shapes and
...@@ -42,11 +43,12 @@ def get_input_descs(args): ...@@ -42,11 +43,12 @@ def get_input_descs(args):
# encoder. # encoder.
# The actual data shape of src_slf_attn_bias is: # The actual data shape of src_slf_attn_bias is:
# [batch_size, n_head, max_src_len_in_batch, max_src_len_in_batch] # [batch_size, n_head, max_src_len_in_batch, max_src_len_in_batch]
"src_slf_attn_bias": [(batch_size, n_head, seq_len, seq_len), "float32"], "src_slf_attn_bias":
[(batch_size, n_head, seq_len, seq_len), "float32"],
# The actual data shape of trg_word is: # The actual data shape of trg_word is:
# [batch_size, max_trg_len_in_batch, 1] # [batch_size, max_trg_len_in_batch, 1]
"trg_word": [(batch_size, seq_len), "int64", "trg_word": [(batch_size, seq_len), "int64",
2], # lod_level is only used in fast decoder. 2], # lod_level is only used in fast decoder.
# The actual data shape of trg_pos is: # The actual data shape of trg_pos is:
# [batch_size, max_trg_len_in_batch, 1] # [batch_size, max_trg_len_in_batch, 1]
"trg_pos": [(batch_size, seq_len), "int64"], "trg_pos": [(batch_size, seq_len), "int64"],
...@@ -54,12 +56,14 @@ def get_input_descs(args): ...@@ -54,12 +56,14 @@ def get_input_descs(args):
# subsequent words in the decoder. # subsequent words in the decoder.
# The actual data shape of trg_slf_attn_bias is: # The actual data shape of trg_slf_attn_bias is:
# [batch_size, n_head, max_trg_len_in_batch, max_trg_len_in_batch] # [batch_size, n_head, max_trg_len_in_batch, max_trg_len_in_batch]
"trg_slf_attn_bias": [(batch_size, n_head, seq_len, seq_len), "float32"], "trg_slf_attn_bias":
[(batch_size, n_head, seq_len, seq_len), "float32"],
# This input is used to remove attention weights on paddings of the source # This input is used to remove attention weights on paddings of the source
# input in the encoder-decoder attention. # input in the encoder-decoder attention.
# The actual data shape of trg_src_attn_bias is: # The actual data shape of trg_src_attn_bias is:
# [batch_size, n_head, max_trg_len_in_batch, max_src_len_in_batch] # [batch_size, n_head, max_trg_len_in_batch, max_src_len_in_batch]
"trg_src_attn_bias": [(batch_size, n_head, seq_len, seq_len), "float32"], "trg_src_attn_bias":
[(batch_size, n_head, seq_len, seq_len), "float32"],
# This input is used in independent decoder program for inference. # This input is used in independent decoder program for inference.
# The actual data shape of enc_output is: # The actual data shape of enc_output is:
# [batch_size, max_src_len_in_batch, d_model] # [batch_size, max_src_len_in_batch, d_model]
...@@ -80,6 +84,7 @@ def get_input_descs(args): ...@@ -80,6 +84,7 @@ def get_input_descs(args):
return input_descs return input_descs
# Names of word embedding table which might be reused for weight sharing. # Names of word embedding table which might be reused for weight sharing.
word_emb_param_names = ( word_emb_param_names = (
"src_word_emb_table", "src_word_emb_table",
......
...@@ -87,13 +87,14 @@ def do_save_inference_model(args): ...@@ -87,13 +87,14 @@ def do_save_inference_model(args):
# saving inference model # saving inference model
fluid.io.save_inference_model(args.inference_model_dir, fluid.io.save_inference_model(
feeded_var_names=list(input_field_names), args.inference_model_dir,
target_vars=[out_ids, out_scores], feeded_var_names=list(input_field_names),
executor=exe, target_vars=[out_ids, out_scores],
main_program=test_prog, executor=exe,
model_filename="model.pdmodel", main_program=test_prog,
params_filename="params.pdparams") model_filename="model.pdmodel",
params_filename="params.pdparams")
print("save inference model at %s" % (args.inference_model_dir)) print("save inference model at %s" % (args.inference_model_dir))
......
...@@ -25,7 +25,6 @@ from train import do_train ...@@ -25,7 +25,6 @@ from train import do_train
from predict import do_predict from predict import do_predict
from inference_model import do_save_inference_model from inference_model import do_save_inference_model
if __name__ == "__main__": if __name__ == "__main__":
LOG_FORMAT = "[%(asctime)s %(levelname)s %(filename)s:%(lineno)d] %(message)s" LOG_FORMAT = "[%(asctime)s %(levelname)s %(filename)s:%(lineno)d] %(message)s"
logging.basicConfig( logging.basicConfig(
...@@ -43,4 +42,4 @@ if __name__ == "__main__": ...@@ -43,4 +42,4 @@ if __name__ == "__main__":
do_predict(args) do_predict(args)
if args.do_save_inference_model: if args.do_save_inference_model:
do_save_inference_model(args) do_save_inference_model(args)
\ No newline at end of file
此差异已折叠。
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册