diff --git a/PaddleNLP/dialogue_domain_classification/README.MD b/PaddleNLP/dialogue_domain_classification/README.MD
new file mode 100755
index 0000000000000000000000000000000000000000..664f858e1057dd6d64417a24e815d2fea8579959
--- /dev/null
+++ b/PaddleNLP/dialogue_domain_classification/README.MD
@@ -0,0 +1,223 @@
+# Paddle NLP(对话领域分类器)
+
+
+
+## 模型简介
+
+ 在对话业务场景中,完整的对话能力往往由多个领域的语义解析bot组成并提供,对话领域分类器能够根据业务场景需求,将流量分发到对应领域的语义解析bot。对话领域分类器不但能够节省机器资源,流量只分发到所属领域的bot,避免了无效流量调用bot; 同时,对话领域分类器的精准分发,过滤了无效的解析结果,也使得最终的解析结果更加精确。
+
+
+
+
+## 快速开始
+**目前模型要求使用PaddlePaddle 1.6及以上版本或适当的develop版本运行。**
+
+### 1. Paddle版本安装
+
+本项目训练模块兼容Python2.7.x以及Python3.7.x, 依赖PaddlePaddle 1.6版本以及CentOS系统环境, 安装请参考官网 [快速安装](https://www.paddlepaddle.org.cn/documentation/docs/zh/beginners_guide/install/index_cn.html)。
+
+注意:该模型同时支持cpu和gpu训练和预测,用户可以根据自身需求,选择安装对应的paddlepaddle-gpu或paddlepaddle版本。
+
+> Warning: GPU 和 CPU 版本的 PaddlePaddle 分别是 paddlepaddle-gpu 和 paddlepaddle,请安装时注意区别。
+
+
+### 2. 代码安装
+
+克隆工具集代码库到本地
+
+```shell
+git clone https://github.com/PaddlePaddle/models.git
+cd models/PaddleNLP/dialogue_domain_classification
+```
+
+
+
+### 3. 数据准备
+
+本项目提供了部分涉及的数据集,通过运行以下指令可以快速下载。运行指令后会生成`data/input`目录,`data/input`目录下有训练集数据(train.txt)、开发集数据(eval.txt)、测试集数据(test.txt),对应词典(char.dict),领域词表(domain.dict) 以及模型配置文件(model.conf)
+
+```shell
+mkdir -p data/input
+wget --no-check-certificate https://baidu-nlp.bj.bcebos.com/dialogue_domain_classification-dataset-1.0.0.tar.gz
+tar -zxvf dialogue_domain_classification-dataset-1.0.0.tar.gz -C ./data/input
+```
+
+**数据格式说明**
+
+
+1. 数据格式
+
+输入和输出的数据格式相同。
+
+数据格式为: query \t domain_1 \002 domain_2 (多个标签, 使用\002分隔开)
+
+指定输入数据的文件夹: 参数`data_dir`
+
+训练文件: train.txt
+验证集: eval.txt
+测试集: test.txt
+
+指定输出结果的文件夹: 参数`save_dir`
+
+测试集预测结果为: test.rst
+
+2. 模型配置
+
+参数`config_path` 指定模型配置文件地址, 格式如下:
+```shell
+[model]
+emb_dim = 128
+win_sizes = [5, 5, 5]
+hid_dim = 160
+hid_dim2 = 160
+```
+
+
+
+### 4. 模型下载
+
+针对于"打电话, 天气, 火车票预订, 机票预订, 音乐"这5个领域数据,我们开源了一个使用CharCNN训练好的对话领域分类模型,使用以下指令可以对模型进行下载。
+
+```model
+mkdir -p model
+wget --no-check-certificate https://baidu-nlp.bj.bcebos.com/dialogue_domain_classification-model-1.0.0.tar.gz
+tar -zxvf dialogue_domain_classification-model-1.0.0.tar.gz -C ./model
+```
+
+### 5. 脚本参数说明
+
+通过执行如下指令,可以查看入口脚本文件所需要的参数以及说明,指令如下:
+ `export PATH="/path/to/your/python:$PATH"; python run_classifier.py --help `
+
+```shell
+1. 模型参数
+--init_checkpoint # 指定热启动加载的checkpoint模型, Default: None.
+--checkpoints # 指定保存checkpoints的地址,Default: ./checkpoints.
+--config_path # 指定模型配置文件,Default: ./data/input/model.conf.
+--build_dict # 是否根据训练数据建立char字典和domain字典,Default: False
+
+2. 训练参数
+--epoch # 训练的轮次,Default: 100.
+--learning_rate # 学习率, Default: 0.1.
+--save_steps # 保存模型的频率,每x个steps保存一次模型,Default: 1000.
+--validation_steps # 模型评估的频率,每x个steps在验证集上验证模型的效果,Default: 100.
+--random_seed # 随机数种子,Default: 7
+--threshold # 领域置信度阈值,当置信度超过阈值,预测结果出对应的领域标签。 Default: 0.1.
+--cpu_num # 当使用cpu训练时的线程数(当use_cuda=False才起作用)。 Default: 3.
+
+3. logging
+--skip_steps # 训练时打印loss的频率,每x个steps打印一次loss,Default: 10.
+
+4. 数据
+--data_dir # 数据集的目录,其中train.txt为训练集,eval.txt为验证集,test.txt为测试集。Default: ./data/input/
+--save_dir # 模型产出的目录, Default: ./data/output/
+--max_seq_len # 最大句子长度,超过会进行截断,Default: 50.
+--batch_size # 批大小, Default: 64.
+
+5. 脚本运行配置
+--use_cuda # 是否使用GPU,Default: False
+--do_train # 是否进行训练,Default: True
+--do_eval # 是否进行验证,Default: True
+--do_test # 是否进行测试,Default: True
+```
+
+
+
+### 6. 模型训练
+
+用户可以基于示例数据构建训练集和开发集,可以运行下面的命令,进行模型训练和开发集验证。
+
+```
+sh run.sh train
+```
+
+> Warning1: 可以参考`run.sh`脚本以及第5节的**脚本参数说明**, 对默认参数进行修改。
+
+> Warning2: CPU多线程以及GPU多卡训练时,每个step训练分别给每一个CPU核或者GPU卡提供一个batch数据,实际上的batch_size为单核的线程数倍或者单卡的多卡数倍。
+
+
+### 7. 模型评估
+
+基于已有的预训练模型和数据,可以运行下面的命令进行测试,查看训练的模型在验证集(test.tsv)上的评测结果
+
+```
+sh run.sh eval
+```
+
+> Warning: 可以参考`run.sh`脚本以及第5节的**脚本参数说明**, 对默认参数进行修改。
+
+### 8. 模型推断
+
+```
+sh run.sh test
+```
+> Warning: 可以参考`run.sh`脚本以及第5节的**脚本参数说明**, 对默认参数进行修改。
+
+
+
+## 进阶使用
+
+
+
+### 1. 任务定义与建模
+
+在真实复杂业务场景中,语义解析服务往往由多个不同领域的语义解析bot组成,从而同时满足多个场景下的语义解析需求。例如:同时能查天气、播放音乐、查询股票等多种功能的对话bot。
+
+与此同时用户输入的query句子形式各样,而且存在很多歧义。比如用户输入的query为`“下雨了”`, 这条query的语义解析既属于`天气`领域, 又属于`音乐`领域(薛之谦的歌曲)。针对这种多歧义的情况,业务上常见的方法是将query进行"广播",即同时请求每一个语义解析bot,再对返回的解析结果进行粗排,得到最终的语义解析结果。
+
+对话领域分类器能够处理同一query同时命中多个领域的情况,根据对话领域分类器的解析结果,可以对query进行有效的分发到各个领域的bot。对话领域分类器对query进行有效的分发,可以避免"广播"式调用带来的资源浪费,大量的节省了机器资源;同时也提高了最终粗排后的语义解析结果的准确率。
+
+
+对话领域分类模型解决了一个多标签分类(Multilabel Classification)的问题, 将用户输入的文本作为模型的输入,分类器会预测出输入文本对应的每一个标签的置信度,从而得到多标签结果,并依次对query分发。
+
+
+
+### 2. 模型原理介绍
+
+对话领域分类器的大体结构如下图所示,用户输入通过`输入层`进行向量化后,作为`分类器模型`的输入,`分类器`最终的输出是一个多标签结果为`[label_1, label_2, ..., label_n]`,它的维度为`n`.(训练数据定义的训练领域总共有`n-1`个,每一个领域对应一个标签,还有额外一个标签表示背景,即不属于任何一个训练领域)
+
+其中每个`label_i`的概率为0到1之间,且所有label的概率之和不恒为1,它表示当前输入属于第`i`个领域的概率。最后可以人为对每一个label的概率设置阈值,从而可以得到多标签分类的结果。
+
+![net](./imgs/nets.png)
+
+**评估指标说明**
+
+传统的二分类任务中,通常使用准确率、召回率和F1值对模型效果进行评估。
+
+
+![fuction](./imgs/function.png)
+
+
+
+**该项目中对于正负样本的定义**
+
+在多标签分类任务中,我们将样本分为正样本(Pos)与负样本(Neg)两种。如果样本包含了领域标签,表示需要分发到至少1个bot进行解析,则为正样本;反之,样本不包含任何领域标签流量,表示不需要分发,则为负样本。
+
+我们的对话领域分类器在保证了原有解析效果的基础之上,有效的降低机器资源的消耗。即在保证正样本召回率的情况下,尽可能提高准确率。
+
+
+**该项目中样本预测正确的定义**
+1. 如果`正确结果`不包含领域标签, 则`预测结果`也不包含领域标签时,预测正确。
+2. 如果`正确结果`包含领域标签, 则`预测结果`包含`正确结果`的所有领域标签时(即`预测结果`的标签是`正确结果`的超集,预测正确。
+
+
+
+
+### 3. 代码结构说明
+
+```
+├── run_classifier.py:该项目的主函数,封装包括训练、预测、评估的部分
+├── nets.py : 定义了模型所使用的网络结构
+├── utils.py:定义了其他常用的功能函数
+├── run.sh: 启动主函数的demo脚本
+```
+
+
+### 4. 如何组建自己的模型
+可以根据自己的需求,组建自定义的模型,具体方法如下所示:
+
+1. 定义自己的对话领域模型,可以在 ../models/classification/nets.py 中添加自己的网络结构。
+
+2. 定义自己的领域对话数据,可以参考**第3节数据准备**的数据格式,准备自己的训练数据。
+
+3. 模型训练、评估、预测的逻辑,需要在[run.sh](./run.sh)中修改对应的模型路径、数据路径和词典路径等参数,具体说明请参照**第5节的脚本参数说明**.
diff --git a/PaddleNLP/dialogue_domain_classification/imgs/function.png b/PaddleNLP/dialogue_domain_classification/imgs/function.png
new file mode 100755
index 0000000000000000000000000000000000000000..40a236d2dc681ec79ecad68303fbc4ff081db56b
Binary files /dev/null and b/PaddleNLP/dialogue_domain_classification/imgs/function.png differ
diff --git a/PaddleNLP/dialogue_domain_classification/imgs/nets.png b/PaddleNLP/dialogue_domain_classification/imgs/nets.png
new file mode 100755
index 0000000000000000000000000000000000000000..812003b1fc4b181b33395f18902742929e6f522d
Binary files /dev/null and b/PaddleNLP/dialogue_domain_classification/imgs/nets.png differ
diff --git a/PaddleNLP/dialogue_domain_classification/nets.py b/PaddleNLP/dialogue_domain_classification/nets.py
new file mode 100755
index 0000000000000000000000000000000000000000..77912b3b0cda2fcda2f6e264478b909e3472ff77
--- /dev/null
+++ b/PaddleNLP/dialogue_domain_classification/nets.py
@@ -0,0 +1,96 @@
+"""
+# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+
+import paddle.fluid as fluid
+import paddle
+
+def textcnn_net_multi_label(data,
+ label,
+ dict_dim,
+ emb_dim=128,
+ hid_dim=128,
+ hid_dim2=96,
+ class_dim=2,
+ win_sizes=None,
+ is_infer=False,
+ threshold=0.5,
+ max_seq_len=100):
+ """
+ multi labels Textcnn_net
+ """
+ init_bound = 0.1
+ initializer = fluid.initializer.Uniform(low=-init_bound, high=init_bound)
+ #gradient_clip = fluid.clip.GradientClipByNorm(10.0)
+ gradient_clip = None
+ regularizer = fluid.regularizer.L2DecayRegularizer(
+ regularization_coeff=1e-4)
+ seg_param_attrs = fluid.ParamAttr(name="seg_weight",
+ learning_rate=640.0,
+ initializer=initializer,
+ gradient_clip=gradient_clip,
+ trainable=True)
+ fc_param_attrs_1 = fluid.ParamAttr(name="fc_weight_1",
+ learning_rate=1.0,
+ regularizer=regularizer,
+ initializer=initializer,
+ gradient_clip=gradient_clip,
+ trainable=True)
+ fc_param_attrs_2 = fluid.ParamAttr(name="fc_weight_2",
+ learning_rate=1.0,
+ regularizer=regularizer,
+ initializer=initializer,
+ gradient_clip=gradient_clip,
+ trainable=True)
+
+ if win_sizes is None:
+ win_sizes = [1, 2, 3]
+
+ # embedding layer
+
+ emb = fluid.embedding(input=data, size=[dict_dim, emb_dim], param_attr=seg_param_attrs)
+
+ # convolution layer
+ convs = []
+ for cnt, win_size in enumerate(win_sizes):
+ emb = fluid.layers.reshape(x=emb, shape=[-1, 1, max_seq_len, emb_dim], inplace=True)
+ filter_size = (win_size, emb_dim)
+ cnn_param_attrs = fluid.ParamAttr(name="cnn_weight" + str(cnt),
+ learning_rate=1.0,
+ regularizer=regularizer,
+ initializer=initializer,
+ trainable=True)
+ conv_out = fluid.layers.conv2d(input=emb, num_filters=hid_dim, filter_size=filter_size, act="relu", \
+ param_attr=cnn_param_attrs)
+ pool_out = fluid.layers.pool2d(
+ input=conv_out,
+ pool_type='max',
+ pool_stride=1,
+ global_pooling=True)
+ convs.append(pool_out)
+ convs_out = fluid.layers.concat(input=convs, axis=1)
+
+ # full connect layer
+ fc_1 = fluid.layers.fc(input=[pool_out], size=hid_dim2, act=None, param_attr=fc_param_attrs_1)
+ # sigmoid layer
+ fc_2 = fluid.layers.fc(input=[fc_1], size=class_dim, act=None, param_attr=fc_param_attrs_2)
+ prediction = fluid.layers.sigmoid(fc_2)
+ if is_infer:
+ return prediction
+
+ cost = fluid.layers.sigmoid_cross_entropy_with_logits(x=fc_2, label=label)
+ avg_cost = fluid.layers.mean(x=cost)
+ pred_label = fluid.layers.ceil(fluid.layers.thresholded_relu(prediction, threshold))
+ return [avg_cost, prediction, pred_label, label]
diff --git a/PaddleNLP/dialogue_domain_classification/run.sh b/PaddleNLP/dialogue_domain_classification/run.sh
new file mode 100755
index 0000000000000000000000000000000000000000..81efc2b64fb417435ff1bee10d52ba10f1138e5e
--- /dev/null
+++ b/PaddleNLP/dialogue_domain_classification/run.sh
@@ -0,0 +1,118 @@
+export PATH="/home/guohongjie/tmp/paddle/paddle_release_home/python/bin/:$PATH"
+
+
+
+
+
+# CPU setting
+:< 0:
+ pred_pos_num += 1
+ if len(actual_labels) > 0:
+ pos_num += 1
+ if set(actual_labels).issubset(set(pred_labels)):
+ tp += 1
+ true_cnt += 1
+ elif len(pred_labels) == 0 and len(actual_labels) == 0:
+ true_cnt += 1
+ try:
+ precision = tp * 1.0 / pred_pos_num
+ recall = tp * 1.0 / pos_num
+ f1 = 2 * precision * recall / (recall + precision)
+ except Exception as e:
+ precision = 0
+ recall = 0
+ f1 = 0
+ acc = true_cnt * 1.0 / total
+ logger.info("tp, pred_pos_num, pos_num, total")
+ logger.info("%d, %d, %d, %d" % (tp, pred_pos_num, pos_num, total))
+ logger.info("%s result is : precision is %f, recall is %f, f1_score is %f, acc is %f" % (eval_phase, precision, \
+ recall, f1, acc))
+
+
+def train(args, train_exe, compiled_prog, build_res, place):
+ """[train the net]
+
+ Arguments:
+ args {[type]} -- [description]
+ train_exe {[type]} -- [description]
+ compiled_prog{[type]} -- [description]
+ build_res {[type]} -- [description]
+ place {[type]} -- [description]
+ """
+ global DEV_COUNT
+ cost = build_res["cost"]
+ prediction = build_res["prediction"]
+ pred_label = build_res["pred_label"]
+ label = build_res["label"]
+ fetch_list = [cost.name, prediction.name, pred_label.name, label.name]
+ train_pyreader = build_res["train_pyreader"]
+ train_prog = build_res["train_prog"]
+ steps = 0
+ time_begin = time.time()
+ test_exe = train_exe
+ logger.info("Begin training")
+ feed_data = []
+ for i in range(args.epoch):
+ try:
+ for data in train_pyreader():
+ feed_data.extend(data)
+ if len(feed_data) == DEV_COUNT:
+ avg_cost_np, avg_pred_np, pred_label, label = train_exe.run(feed=feed_data, program=compiled_prog, \
+ fetch_list=fetch_list)
+ feed_data = []
+ steps += 1
+ if steps % int(args.skip_steps) == 0:
+ time_end = time.time()
+ used_time = time_end - time_begin
+ get_score(pred_label, label, eval_phase = "Train")
+ logger.info('loss is {}'.format(avg_cost_np))
+ logger.info("epoch: %d, step: %d, speed: %f steps/s" % (i, steps, args.skip_steps / used_time))
+ time_begin = time.time()
+ if steps % args.save_steps == 0:
+ save_path = os.path.join(args.checkpoints,
+ "step_" + str(steps))
+ fluid.io.save_persistables(train_exe, save_path, train_prog)
+ logger.info("[save]step %d : save at %s" % (steps, save_path))
+ if steps % args.validation_steps == 0:
+ if args.do_eval:
+ evaluate(args, test_exe, build_res["eval_prog"], build_res, place, "eval")
+ if args.do_test:
+ evaluate(args, test_exe, build_res["test_prog"], build_res, place, "test")
+ except Exception as e:
+ logger.exception(str(e))
+ logger.error("Train error : %s" % str(e))
+ exit(1)
+ save_path = os.path.join(args.checkpoints, "step_" + str(steps))
+ fluid.io.save_persistables(train_exe, save_path, train_prog)
+ logger.info("[save]step %d : save at %s" % (steps, save_path))
+
+
+def evaluate(args, test_exe, test_prog, build_res, place, eval_phase, save_result=False, id2intent=None):
+ """[evaluate on dev/test dataset]
+
+ Arguments:
+ args {[type]} -- [description]
+ test_exe {[type]} -- [description]
+ test_prog {[type]} -- [description]
+ build_res {[type]} -- [description]
+ place {[type]} -- [description]
+ eval_phase {[type]} -- [description]
+
+ Keyword Arguments:
+ threshold {float} -- [description] (default: {0.5})
+ save_result {bool} -- [description] (default: {False})
+ id2intent {[type]} -- [description] (default: {None})
+ """
+ threshold = args.threshold
+ cost = build_res["cost"]
+ prediction = build_res["prediction"]
+ pred_label = build_res["pred_label"]
+ label = build_res["label"]
+ fetch_list = [cost.name, prediction.name, pred_label.name, label.name]
+ total_cost, total_acc, pred_prob_list, pred_label_list, label_list = [], [], [], [], []
+ if eval_phase == "eval":
+ test_pyreader = build_res["eval_pyreader"]
+ elif eval_phase == "test":
+ test_pyreader = build_res["test_pyreader"]
+ else:
+ exit(1)
+ logger.info("-----------------------------------------------------------")
+ for data in test_pyreader():
+ avg_cost_np, avg_pred_np, pred_label, label= test_exe.run(program=test_prog, fetch_list=fetch_list, feed=data, \
+ return_numpy=True)
+ total_cost.append(avg_cost_np)
+ pred_prob_list.extend(avg_pred_np)
+ pred_label_list.extend(pred_label)
+ label_list.extend(label)
+
+ if save_result:
+ logger.info("save result at : %s" % args.save_dir + "/" + eval_phase + ".rst")
+ save_dir = args.save_dir
+ if not os.path.exists(save_dir):
+ logger.warning("save dir not exists, and create it")
+ os.makedirs(save_dir)
+ fin = codecs.open(os.path.join(args.data_dir, eval_phase + ".txt"), "r", encoding="utf8")
+ fout = codecs.open(args.save_dir + "/" + eval_phase + ".rst", "w", encoding="utf8")
+ for line in pred_prob_list:
+ query = fin.readline().rsplit("\t", 1)[0]
+ res = []
+ for i in range(1, len(line)):
+ if line[i] > threshold:
+ #res.append(id2intent[i]+":"+str(line[i]))
+ res.append(id2intent[i])
+ if len(res) == 0:
+ res.append(id2intent[0])
+ fout.write("%s\t%s\n" % (query, "\2".join(sorted(res))))
+ fout.close()
+ fin.close()
+
+ logger.info("[%s] result: " % eval_phase)
+ get_score(pred_label_list, label_list, eval_phase)
+ logger.info('loss is {}'.format(sum(total_cost) * 1.0 / len(total_cost)))
+ logger.info("-----------------------------------------------------------")
+
+
+
+def create_net(args, flow_data, class_dim, dict_dim, place, model_name="textcnn_net", is_infer=False):
+ """[create network and pyreader]
+
+ Arguments:
+ flow_data {[type]} -- [description]
+ class_dim {[type]} -- [description]
+ dict_dim {[type]} -- [description]
+ place {[type]} -- [description]
+
+ Keyword Arguments:
+ model_name {str} -- [description] (default: {"textcnn_net"})
+ is_infer {bool} -- [description] (default: {False})
+
+ Returns:
+ [type] -- [description]
+ """
+ if model_name == "textcnn_net":
+ model = textcnn_net_multi_label
+ else:
+ return
+ char_list = fluid.data(name="char", shape=[None, args.max_seq_len, 1], dtype="int64", lod_level=0)
+ label = fluid.data(name="label", shape=[None, class_dim], dtype="float32", lod_level=0) # label data
+ reader = fluid.io.PyReader(feed_list=[char_list, label], capacity=args.batch_size * 10, iterable=True, \
+ return_list=False)
+ output = model(char_list, label, dict_dim,
+ emb_dim=flow_data["model"]["emb_dim"],
+ hid_dim=flow_data["model"]["hid_dim"],
+ hid_dim2=flow_data["model"]["hid_dim2"],
+ class_dim=class_dim,
+ win_sizes=flow_data["model"]["win_sizes"],
+ is_infer=is_infer,
+ threshold=args.threshold,
+ max_seq_len=args.max_seq_len)
+ if is_infer:
+ prediction = output
+ return [reader, prediction]
+ else:
+ avg_cost, prediction, pred_label, label = output[0], output[1], output[2], output[3]
+ return [reader, avg_cost, prediction, pred_label, label]
+
+
+def build_data_reader(args, char_dict, intent_dict):
+ """[decorate samples for pyreader]
+
+ Arguments:
+ args {[type]} -- [description]
+ char_dict {[type]} -- [description]
+ intent_dict {[type]} -- [description]
+
+ Returns:
+ [type] -- [description]
+ """
+ reader_res = {}
+ if args.do_train:
+ train_processor = DataReader(char_dict, intent_dict, args.max_seq_len)
+ train_data_generator = train_processor.prepare_data(
+ data_path=args.data_dir + "train.txt",
+ batch_size=args.batch_size,
+ mode='train')
+ reader_res["train_data_generator"] = train_data_generator
+ num_train_examples = train_processor._get_num_examples()
+ logger.info("Num train examples: %d" % num_train_examples)
+ logger.info("Num train steps: %d" % (math.ceil(num_train_examples * 1.0 / args.batch_size) * \
+ args.epoch // DEV_COUNT))
+ if math.ceil(num_train_examples * 1.0 / args.batch_size) // DEV_COUNT <= 0:
+ logger.error("Num of train steps is less than 0 or equals to 0, exit")
+ exit(1)
+ if args.do_eval:
+ eval_processor = DataReader(char_dict, intent_dict, args.max_seq_len)
+ eval_data_generator = eval_processor.prepare_data(
+ data_path=args.data_dir + "eval.txt",
+ batch_size=args.batch_size,
+ mode='eval')
+ reader_res["eval_data_generator"] = eval_data_generator
+ num_eval_examples = eval_processor._get_num_examples()
+ logger.info("Num eval examples: %d" % num_eval_examples)
+ if args.do_test:
+ test_processor = DataReader(char_dict, intent_dict, args.max_seq_len)
+ test_data_generator = test_processor.prepare_data(
+ data_path=args.data_dir + "test.txt",
+ batch_size=args.batch_size,
+ mode='test')
+ reader_res["test_data_generator"] = test_data_generator
+ return reader_res
+
+
+def build_graph(args, model_config, num_labels, dict_dim, place, reader_res):
+ """[build paddle graph]
+
+ Arguments:
+ args {[type]} -- [description]
+ model_config {[type]} -- [description]
+ num_labels {[type]} -- [description]
+ dict_dim {[type]} -- [description]
+ place {[type]} -- [description]
+ reader_res {[type]} -- [description]
+
+ Returns:
+ [type] -- [description]
+ """
+ res = {}
+ cost, prediction, pred_label, label = None, None, None, None
+ train_prog = fluid.default_main_program()
+
+ startup_prog = fluid.default_startup_program()
+ eval_prog = train_prog.clone(for_test=True)
+ test_prog = train_prog.clone(for_test=True)
+ train_prog.random_seed = args.random_seed
+ startup_prog.random_seed = args.random_seed
+ if args.do_train:
+ with fluid.program_guard(train_prog, startup_prog):
+ with fluid.unique_name.guard():
+ train_pyreader, cost, prediction, pred_label, label = create_net(args, model_config, num_labels, \
+ dict_dim, place, model_name="textcnn_net")
+ train_pyreader.decorate_sample_list_generator(reader_res['train_data_generator'], places=place)
+ res["train_pyreader"] = train_pyreader
+ sgd_optimizer = fluid.optimizer.SGD(learning_rate=fluid.layers.exponential_decay(
+ learning_rate=args.learning_rate, decay_steps=1000, decay_rate=0.5, staircase=True))
+ sgd_optimizer.minimize(cost)
+ if args.do_eval:
+ with fluid.program_guard(eval_prog, startup_prog):
+ with fluid.unique_name.guard():
+ eval_pyreader, cost, prediction, pred_label, label = create_net(args, model_config, num_labels, \
+ dict_dim, place, model_name="textcnn_net")
+ eval_pyreader.decorate_sample_list_generator(reader_res['eval_data_generator'], places=place)
+ res["eval_pyreader"] = eval_pyreader
+ if args.do_test:
+ with fluid.program_guard(test_prog, startup_prog):
+ with fluid.unique_name.guard():
+ test_pyreader, cost, prediction, pred_label, label = create_net(args, model_config, num_labels, \
+ dict_dim, place, model_name="textcnn_net")
+ test_pyreader.decorate_sample_list_generator(reader_res['test_data_generator'], places=place)
+ res["test_pyreader"] = test_pyreader
+ res["cost"] = cost
+ res["prediction"] = prediction
+ res["label"] = label
+ res["pred_label"] = pred_label
+ res["train_prog"] =train_prog
+ res["eval_prog"] = eval_prog
+ res["test_prog"] = test_prog
+
+
+ return res
+
+
+def main(args):
+ """
+ Main Function
+ """
+ global DEV_COUNT
+ startup_prog = fluid.default_startup_program()
+ random.seed(args.random_seed)
+ model_config = ConfigReader.read_conf(args.config_path)
+ if args.use_cuda:
+ place = fluid.CUDAPlace(int(os.getenv('FLAGS_selected_gpus', '0')))
+ DEV_COUNT = fluid.core.get_cuda_device_count()
+ else:
+ place = fluid.CPUPlace()
+ os.environ['CPU_NUM'] = str(args.cpu_num)
+ DEV_COUNT = args.cpu_num
+ logger.info("Dev Num is %s" % str(DEV_COUNT))
+ exe = fluid.Executor(place)
+ if args.do_train and args.build_dict:
+ DataProcesser.build_dict(args.data_dir + "train.txt", args.data_dir)
+ # read dict
+ char_dict = DataProcesser.read_dict(args.data_dir + "char.dict")
+ dict_dim = len(char_dict)
+ intent_dict = DataProcesser.read_dict(args.data_dir + "domain.dict")
+ id2intent = {}
+ for key, value in intent_dict.items():
+ id2intent[int(value)] = key
+ num_labels = len(intent_dict)
+ # build model
+ reader_res = build_data_reader(args, char_dict, intent_dict)
+ build_res = build_graph(args, model_config, num_labels, dict_dim, place, reader_res)
+ if not (args.do_train or args.do_eval or args.do_test):
+ raise ValueError("For args `do_train`, `do_eval` and `do_test`, at "
+ "least one of them must be True.")
+
+ exe.run(startup_prog)
+ if args.init_checkpoint and args.init_checkpoint != "None":
+ try:
+ init_checkpoint(exe, args.init_checkpoint, main_program=startup_prog)
+ logger.info("Load model from %s" % args.init_checkpoint)
+ except Exception as e:
+ logger.exception(str(e))
+ logger.error("Faild load model from %s [%s]" % (args.init_checkpoint, str(e)))
+
+ if args.do_train:
+ build_strategy = fluid.compiler.BuildStrategy()
+ compiled_prog = fluid.compiler.CompiledProgram(build_res["train_prog"]).with_data_parallel( \
+ loss_name=build_res["cost"].name, build_strategy=build_strategy)
+ build_res["compiled_prog"] = compiled_prog
+ train(args, exe, compiled_prog, build_res, place)
+ if args.do_eval:
+
+ evaluate(args, exe, build_res["eval_prog"], build_res, place, "eval", \
+ save_result=True, id2intent=id2intent)
+ if args.do_test:
+
+ evaluate(args, exe, build_res["test_prog"], build_res, place, "test",\
+ save_result=True, id2intent=id2intent)
+
+
+
+
+if __name__ == "__main__":
+ logger.info("the paddle version is %s" % paddle.__version__)
+ check_version('1.6.0')
+ print_arguments(args)
+ main(args)
diff --git a/PaddleNLP/dialogue_domain_classification/utils.py b/PaddleNLP/dialogue_domain_classification/utils.py
new file mode 100755
index 0000000000000000000000000000000000000000..2c839a2ccc605fae3c602f241586fda2838fea15
--- /dev/null
+++ b/PaddleNLP/dialogue_domain_classification/utils.py
@@ -0,0 +1,354 @@
+#!/usr/bin/env python
+# -*- coding: utf-8 -*-
+"""
+# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+
+from __future__ import unicode_literals
+import sys
+import os
+import random
+import paddle
+import logging
+import paddle.fluid as fluid
+import numpy as np
+import collections
+import six
+import codecs
+try:
+ import configparser as cp
+except ImportError:
+ import ConfigParser as cp
+
+
+random_seed = 7
+logger = logging.getLogger()
+format = "%(asctime)s - %(name)s - %(levelname)s -%(filename)s-%(lineno)4d -%(message)s"
+# format = "%(levelname)8s: %(asctime)s: %(filename)s:%(lineno)4d %(message)s"
+logging.basicConfig(format=format)
+logger.setLevel(logging.INFO)
+logger = logging.getLogger('Paddle-DDC')
+
+
+def str2bool(v):
+ """[ because argparse does not support to parse "true, False" as python
+ boolean directly]
+ Arguments:
+ v {[type]} -- [description]
+ Returns:
+ [type] -- [description]
+ """
+ return v.lower() in ("true", "t", "1")
+
+
+def to_lodtensor(data, place):
+ """
+ convert ot LODtensor
+ """
+ seq_lens = [len(seq) for seq in data]
+ cur_len = 0
+ lod = [cur_len]
+ for l in seq_lens:
+ cur_len += l
+ lod.append(cur_len)
+ flattened_data = np.concatenate(data, axis=0).astype("int64")
+ flattened_data = flattened_data.reshape([len(flattened_data), 1])
+ res = fluid.LoDTensor()
+ res.set(flattened_data, place)
+ res.set_lod([lod])
+ return res
+
+
+class ArgumentGroup(object):
+ """[ArgumentGroup]
+
+ Arguments:
+ object {[type]} -- [description]
+ """
+ def __init__(self, parser, title, des):
+ self._group = parser.add_argument_group(title=title, description=des)
+
+ def add_arg(self, name, type, default, help, **kwargs):
+ """[add_arg]
+
+ Arguments:
+ name {[type]} -- [description]
+ type {[type]} -- [description]
+ default {[type]} -- [description]
+ help {[type]} -- [description]
+ """
+ type = str2bool if type == bool else type
+ self._group.add_argument(
+ "--" + name,
+ default=default,
+ type=type,
+ help=help + ' Default: %(default)s.',
+ **kwargs)
+
+
+class DataReader(object):
+ """[get data generator for dataset]
+
+ Arguments:
+ object {[type]} -- [description]
+
+ Returns:
+ [type] -- [description]
+ """
+ def __init__(self, char_vocab, intent_dict, max_len):
+ self._char_vocab = char_vocab
+ self._intent_dict = intent_dict
+ self._oov_id = 0
+ self.intent_size = len(intent_dict)
+ self.all_data = []
+ self.max_len = max_len
+ self.padding_id = 0
+
+ def _get_num_examples(self):
+ return len(self.all_data)
+
+ def prepare_data(self, data_path, batch_size, mode):
+ """
+ prepare data
+ """
+ # print word_dict_path
+ # assert os.path.exists(
+ # word_dict_path), "The given word dictionary dose not exist."
+ assert os.path.exists(data_path), "The given data file does not exist."
+ if mode == "train":
+ train_reader = fluid.io.batch(paddle.reader.shuffle(self.data_reader(data_path, self.max_len, shuffle=True),
+ buf_size=batch_size * 100), batch_size)
+ # train_reader = fluid.io.batch(self.data_reader(data_path), batch_size)
+ return train_reader
+ else:
+ test_reader = fluid.io.batch(self.data_reader(data_path, self.max_len), batch_size)
+ return test_reader
+
+ def data_reader(self, file_path, max_len, shuffle=False):
+ """
+ Convert query into id list
+ use fixed voc
+ """
+
+ for line in codecs.open(file_path, "r", encoding="utf8"):
+ line = line.strip()
+ if isinstance(line, six.binary_type):
+ line = line.decode("utf8", errors="ignore")
+ query, intent = line.split("\t")
+ char_id_list = list(map(lambda x: 0 if x not in self._char_vocab else int(self._char_vocab[x]), \
+ list(query)))
+ if len(char_id_list) < max_len:
+ char_id_list.extend([self.padding_id] * (max_len - len(char_id_list)))
+ char_id_list = char_id_list[:max_len]
+ intent_id_list = [self.padding_id] * self.intent_size
+ for item in intent.split('\2'):
+ intent_id_list[int(self._intent_dict[item])] = 1
+ self.all_data.append([char_id_list, intent_id_list])
+ if shuffle:
+ random.seed(random_seed)
+ random.shuffle(self.all_data)
+ def reader():
+ """
+ reader
+ """
+ for char_id_list, intent_id_list in self.all_data:
+ # print char_id_list, intent_id
+ yield char_id_list, intent_id_list
+ return reader
+
+
+class DataProcesser(object):
+ """[file process methods]
+
+ Arguments:
+ object {[type]} -- [description]
+
+ Returns:
+ [type] -- [description]
+ """
+ @staticmethod
+ def read_dict(filename):
+ """
+ read_dict: key\2value
+ """
+ res_dict = {}
+ for line in codecs.open(filename, encoding="utf8"):
+ try:
+ if isinstance(line, six.binary_type):
+ line = line.strip().decode("utf8")
+ line = line.strip()
+ key, value = line.strip().split("\2")
+ res_dict[key] = value
+ except Exception as err:
+ logger.error(str(err))
+ logger.error("read dict[%s] failed" % filename)
+ return res_dict
+
+ @staticmethod
+ def build_dict(filename, save_dir, min_num_char=2, min_num_intent=2):
+ """[build_dict from file]
+
+ Arguments:
+ filename {[type]} -- [description]
+ save_dir {[type]} -- [description]
+
+ Keyword Arguments:
+ min_num_char {int} -- [description] (default: {2})
+ min_num_intent {int} -- [description] (default: {2})
+ """
+ char_dict = {}
+ intent_dict = {}
+ # readfile
+ for line in codecs.open(filename):
+ line = line.strip()
+ if isinstance(line, six.binary_type):
+ line = line.strip().decode("utf8", errors="ignore")
+ query, intents = line.split("\t")
+ # read query
+ for char_item in list(query):
+ if char_item not in char_dict:
+ char_dict[char_item] = 0
+ char_dict[char_item] += 1
+ # read intents
+ for intent in intents.split('\002'):
+ if intent not in intent_dict:
+ intent_dict[intent] = 0
+ intent_dict[intent] += 1
+ # save char dict
+ with codecs.open("%s/char.dict" % save_dir, "w", encoding="utf8") as f_out:
+ f_out.write("PAD\0020\n")
+ f_out.write("OOV\0021\n")
+ char_id = 2
+ for key, value in char_dict.items():
+ if value >= min_num_char:
+ if isinstance(key, six.binary_type):
+ key = key.encode("utf8")
+ f_out.write("%s\002%d\n" % (key, char_id))
+ char_id += 1
+ # save intent dict
+ with codecs.open("%s/domain.dict" % save_dir, "w", encoding="utf8") as f_out:
+ f_out.write("SYS_OTHER\0020\n")
+ intent_id = 1
+ for key, value in intent_dict.items():
+ if value >= min_num_intent and key != u'SYS_OTHER':
+ if isinstance(key, six.binary_type):
+ key = key.encode("utf8")
+ f_out.write("%s\002%d\n" % (key, intent_id))
+ intent_id += 1
+
+
+
+class ConfigReader(object):
+ """[read model config file]
+
+ Arguments:
+ object {[type]} -- [description]
+
+ Returns:
+ [type] -- [description]
+ """
+
+ @staticmethod
+ def read_conf(conf_file):
+ """[read_conf]
+
+ Arguments:
+ conf_file {[type]} -- [description]
+
+ Returns:
+ [type] -- [description]
+ """
+ flow_data = collections.defaultdict(lambda: {})
+ class2key = set(["model"])
+ param_conf = cp.ConfigParser()
+ param_conf.read(conf_file)
+ for section in param_conf.sections():
+ if section not in class2key:
+ continue
+ for option in param_conf.items(section):
+ flow_data[section][option[0]] = eval(option[1])
+ return flow_data
+
+
+def init_pretraining_params(exe,
+ pretraining_params_path,
+ main_program,
+ use_fp16=False):
+ """load params of pretrained model, NOT including moment, learning_rate"""
+ assert os.path.exists(pretraining_params_path
+ ), "[%s] cann't be found." % pretraining_params_path
+
+ def _existed_params(var):
+ if not isinstance(var, fluid.framework.Parameter):
+ return False
+ return os.path.exists(os.path.join(pretraining_params_path, var.name))
+
+ fluid.io.load_vars(
+ exe,
+ pretraining_params_path,
+ main_program=main_program,
+ predicate=_existed_params)
+ print("Load pretraining parameters from {}.".format(
+ pretraining_params_path))
+
+
+def init_checkpoint(exe, init_checkpoint_path, main_program):
+ """
+ Init CheckPoint
+ """
+ assert os.path.exists(
+ init_checkpoint_path), "[%s] cann't be found." % init_checkpoint_path
+
+ def existed_persitables(var):
+ """
+ If existed presitabels
+ """
+ if not fluid.io.is_persistable(var):
+ return False
+ return os.path.exists(os.path.join(init_checkpoint_path, var.name))
+
+ fluid.io.load_vars(
+ exe,
+ init_checkpoint_path,
+ main_program=main_program,
+ predicate=existed_persitables)
+ print ("Load model from {}".format(init_checkpoint_path))
+
+def print_arguments(args):
+ """
+ Print Arguments
+ """
+ print('----------- Configuration Arguments -----------')
+ for arg, value in sorted(six.iteritems(vars(args))):
+ print('%s: %s' % (arg, value))
+ print('------------------------------------------------')
+
+
+def check_version(version='1.6.0'):
+ """
+ Log error and exit when the installed version of paddlepaddle is
+ not satisfied.
+ """
+ err = "PaddlePaddle version 1.6 or higher is required, " \
+ "or a suitable develop version is satisfied as well. \n" \
+ "Please make sure the version is good with your code." \
+
+ try:
+ fluid.require_version(version)
+ except Exception as e:
+ logger.error(err)
+ sys.exit(1)
+
+