diff --git a/PaddleNLP/dialogue_domain_classification/README.MD b/PaddleNLP/dialogue_domain_classification/README.MD new file mode 100755 index 0000000000000000000000000000000000000000..664f858e1057dd6d64417a24e815d2fea8579959 --- /dev/null +++ b/PaddleNLP/dialogue_domain_classification/README.MD @@ -0,0 +1,223 @@ +# Paddle NLP(对话领域分类器) + + + +## 模型简介 + +​ 在对话业务场景中,完整的对话能力往往由多个领域的语义解析bot组成并提供,对话领域分类器能够根据业务场景需求,将流量分发到对应领域的语义解析bot。对话领域分类器不但能够节省机器资源,流量只分发到所属领域的bot,避免了无效流量调用bot; 同时,对话领域分类器的精准分发,过滤了无效的解析结果,也使得最终的解析结果更加精确。 + + + + +## 快速开始 +**目前模型要求使用PaddlePaddle 1.6及以上版本或适当的develop版本运行。** + +### 1. Paddle版本安装 + +本项目训练模块兼容Python2.7.x以及Python3.7.x, 依赖PaddlePaddle 1.6版本以及CentOS系统环境, 安装请参考官网 [快速安装](https://www.paddlepaddle.org.cn/documentation/docs/zh/beginners_guide/install/index_cn.html)。 + +注意:该模型同时支持cpu和gpu训练和预测,用户可以根据自身需求,选择安装对应的paddlepaddle-gpu或paddlepaddle版本。 + +> Warning: GPU 和 CPU 版本的 PaddlePaddle 分别是 paddlepaddle-gpu 和 paddlepaddle,请安装时注意区别。 + + +### 2. 代码安装 + +克隆工具集代码库到本地 + +```shell +git clone https://github.com/PaddlePaddle/models.git +cd models/PaddleNLP/dialogue_domain_classification +``` + + + +### 3. 数据准备 + +本项目提供了部分涉及的数据集,通过运行以下指令可以快速下载。运行指令后会生成`data/input`目录,`data/input`目录下有训练集数据(train.txt)、开发集数据(eval.txt)、测试集数据(test.txt),对应词典(char.dict),领域词表(domain.dict) 以及模型配置文件(model.conf) + +```shell +mkdir -p data/input +wget --no-check-certificate https://baidu-nlp.bj.bcebos.com/dialogue_domain_classification-dataset-1.0.0.tar.gz +tar -zxvf dialogue_domain_classification-dataset-1.0.0.tar.gz -C ./data/input +``` + +**数据格式说明** + + +1. 数据格式 + +输入和输出的数据格式相同。 + +数据格式为: query \t domain_1 \002 domain_2 (多个标签, 使用\002分隔开) + +指定输入数据的文件夹: 参数`data_dir` + +训练文件: train.txt +验证集: eval.txt +测试集: test.txt + +指定输出结果的文件夹: 参数`save_dir` + +测试集预测结果为: test.rst + +2. 模型配置 + +参数`config_path` 指定模型配置文件地址, 格式如下: +```shell +[model] +emb_dim = 128 +win_sizes = [5, 5, 5] +hid_dim = 160 +hid_dim2 = 160 +``` + + + +### 4. 模型下载 + +针对于"打电话, 天气, 火车票预订, 机票预订, 音乐"这5个领域数据,我们开源了一个使用CharCNN训练好的对话领域分类模型,使用以下指令可以对模型进行下载。 + +```model +mkdir -p model +wget --no-check-certificate https://baidu-nlp.bj.bcebos.com/dialogue_domain_classification-model-1.0.0.tar.gz +tar -zxvf dialogue_domain_classification-model-1.0.0.tar.gz -C ./model +``` + +### 5. 脚本参数说明 + +通过执行如下指令,可以查看入口脚本文件所需要的参数以及说明,指令如下: + `export PATH="/path/to/your/python:$PATH"; python run_classifier.py --help ` + +```shell +1. 模型参数 +--init_checkpoint # 指定热启动加载的checkpoint模型, Default: None. +--checkpoints # 指定保存checkpoints的地址,Default: ./checkpoints. +--config_path # 指定模型配置文件,Default: ./data/input/model.conf. +--build_dict # 是否根据训练数据建立char字典和domain字典,Default: False + +2. 训练参数 +--epoch # 训练的轮次,Default: 100. +--learning_rate # 学习率, Default: 0.1. +--save_steps # 保存模型的频率,每x个steps保存一次模型,Default: 1000. +--validation_steps # 模型评估的频率,每x个steps在验证集上验证模型的效果,Default: 100. +--random_seed # 随机数种子,Default: 7 +--threshold # 领域置信度阈值,当置信度超过阈值,预测结果出对应的领域标签。 Default: 0.1. +--cpu_num # 当使用cpu训练时的线程数(当use_cuda=False才起作用)。 Default: 3. + +3. logging +--skip_steps # 训练时打印loss的频率,每x个steps打印一次loss,Default: 10. + +4. 数据 +--data_dir # 数据集的目录,其中train.txt为训练集,eval.txt为验证集,test.txt为测试集。Default: ./data/input/ +--save_dir # 模型产出的目录, Default: ./data/output/ +--max_seq_len # 最大句子长度,超过会进行截断,Default: 50. +--batch_size # 批大小, Default: 64. + +5. 脚本运行配置 +--use_cuda # 是否使用GPU,Default: False +--do_train # 是否进行训练,Default: True +--do_eval # 是否进行验证,Default: True +--do_test # 是否进行测试,Default: True +``` + + + +### 6. 模型训练 + +用户可以基于示例数据构建训练集和开发集,可以运行下面的命令,进行模型训练和开发集验证。 + +``` +sh run.sh train +``` + +> Warning1: 可以参考`run.sh`脚本以及第5节的**脚本参数说明**, 对默认参数进行修改。 + +> Warning2: CPU多线程以及GPU多卡训练时,每个step训练分别给每一个CPU核或者GPU卡提供一个batch数据,实际上的batch_size为单核的线程数倍或者单卡的多卡数倍。 + + +### 7. 模型评估 + +基于已有的预训练模型和数据,可以运行下面的命令进行测试,查看训练的模型在验证集(test.tsv)上的评测结果 + +``` +sh run.sh eval +``` + +> Warning: 可以参考`run.sh`脚本以及第5节的**脚本参数说明**, 对默认参数进行修改。 + +### 8. 模型推断 + +``` +sh run.sh test +``` +> Warning: 可以参考`run.sh`脚本以及第5节的**脚本参数说明**, 对默认参数进行修改。 + + + +## 进阶使用 + + + +### 1. 任务定义与建模 + +在真实复杂业务场景中,语义解析服务往往由多个不同领域的语义解析bot组成,从而同时满足多个场景下的语义解析需求。例如:同时能查天气、播放音乐、查询股票等多种功能的对话bot。 + +与此同时用户输入的query句子形式各样,而且存在很多歧义。比如用户输入的query为`“下雨了”`, 这条query的语义解析既属于`天气`领域, 又属于`音乐`领域(薛之谦的歌曲)。针对这种多歧义的情况,业务上常见的方法是将query进行"广播",即同时请求每一个语义解析bot,再对返回的解析结果进行粗排,得到最终的语义解析结果。 + +对话领域分类器能够处理同一query同时命中多个领域的情况,根据对话领域分类器的解析结果,可以对query进行有效的分发到各个领域的bot。对话领域分类器对query进行有效的分发,可以避免"广播"式调用带来的资源浪费,大量的节省了机器资源;同时也提高了最终粗排后的语义解析结果的准确率。 + + +对话领域分类模型解决了一个多标签分类(Multilabel Classification)的问题, 将用户输入的文本作为模型的输入,分类器会预测出输入文本对应的每一个标签的置信度,从而得到多标签结果,并依次对query分发。 + + + +### 2. 模型原理介绍 + +对话领域分类器的大体结构如下图所示,用户输入通过`输入层`进行向量化后,作为`分类器模型`的输入,`分类器`最终的输出是一个多标签结果为`[label_1, label_2, ..., label_n]`,它的维度为`n`.(训练数据定义的训练领域总共有`n-1`个,每一个领域对应一个标签,还有额外一个标签表示背景,即不属于任何一个训练领域) + +其中每个`label_i`的概率为0到1之间,且所有label的概率之和不恒为1,它表示当前输入属于第`i`个领域的概率。最后可以人为对每一个label的概率设置阈值,从而可以得到多标签分类的结果。 + +![net](./imgs/nets.png) + +**评估指标说明** + +传统的二分类任务中,通常使用准确率、召回率和F1值对模型效果进行评估。 + +

+![fuction](./imgs/function.png) +

+ + +**该项目中对于正负样本的定义** + +在多标签分类任务中,我们将样本分为正样本(Pos)与负样本(Neg)两种。如果样本包含了领域标签,表示需要分发到至少1个bot进行解析,则为正样本;反之,样本不包含任何领域标签流量,表示不需要分发,则为负样本。 + +我们的对话领域分类器在保证了原有解析效果的基础之上,有效的降低机器资源的消耗。即在保证正样本召回率的情况下,尽可能提高准确率。 + + +**该项目中样本预测正确的定义** +1. 如果`正确结果`不包含领域标签, 则`预测结果`也不包含领域标签时,预测正确。 +2. 如果`正确结果`包含领域标签, 则`预测结果`包含`正确结果`的所有领域标签时(即`预测结果`的标签是`正确结果`的超集,预测正确。 + + + + +### 3. 代码结构说明 + +``` +├── run_classifier.py:该项目的主函数,封装包括训练、预测、评估的部分 +├── nets.py : 定义了模型所使用的网络结构 +├── utils.py:定义了其他常用的功能函数 +├── run.sh: 启动主函数的demo脚本 +``` + + +### 4. 如何组建自己的模型 +可以根据自己的需求,组建自定义的模型,具体方法如下所示: + +1. 定义自己的对话领域模型,可以在 ../models/classification/nets.py 中添加自己的网络结构。 + +2. 定义自己的领域对话数据,可以参考**第3节数据准备**的数据格式,准备自己的训练数据。 + +3. 模型训练、评估、预测的逻辑,需要在[run.sh](./run.sh)中修改对应的模型路径、数据路径和词典路径等参数,具体说明请参照**第5节的脚本参数说明**. diff --git a/PaddleNLP/dialogue_domain_classification/imgs/function.png b/PaddleNLP/dialogue_domain_classification/imgs/function.png new file mode 100755 index 0000000000000000000000000000000000000000..40a236d2dc681ec79ecad68303fbc4ff081db56b Binary files /dev/null and b/PaddleNLP/dialogue_domain_classification/imgs/function.png differ diff --git a/PaddleNLP/dialogue_domain_classification/imgs/nets.png b/PaddleNLP/dialogue_domain_classification/imgs/nets.png new file mode 100755 index 0000000000000000000000000000000000000000..812003b1fc4b181b33395f18902742929e6f522d Binary files /dev/null and b/PaddleNLP/dialogue_domain_classification/imgs/nets.png differ diff --git a/PaddleNLP/dialogue_domain_classification/nets.py b/PaddleNLP/dialogue_domain_classification/nets.py new file mode 100755 index 0000000000000000000000000000000000000000..77912b3b0cda2fcda2f6e264478b909e3472ff77 --- /dev/null +++ b/PaddleNLP/dialogue_domain_classification/nets.py @@ -0,0 +1,96 @@ +""" +# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" + +import paddle.fluid as fluid +import paddle + +def textcnn_net_multi_label(data, + label, + dict_dim, + emb_dim=128, + hid_dim=128, + hid_dim2=96, + class_dim=2, + win_sizes=None, + is_infer=False, + threshold=0.5, + max_seq_len=100): + """ + multi labels Textcnn_net + """ + init_bound = 0.1 + initializer = fluid.initializer.Uniform(low=-init_bound, high=init_bound) + #gradient_clip = fluid.clip.GradientClipByNorm(10.0) + gradient_clip = None + regularizer = fluid.regularizer.L2DecayRegularizer( + regularization_coeff=1e-4) + seg_param_attrs = fluid.ParamAttr(name="seg_weight", + learning_rate=640.0, + initializer=initializer, + gradient_clip=gradient_clip, + trainable=True) + fc_param_attrs_1 = fluid.ParamAttr(name="fc_weight_1", + learning_rate=1.0, + regularizer=regularizer, + initializer=initializer, + gradient_clip=gradient_clip, + trainable=True) + fc_param_attrs_2 = fluid.ParamAttr(name="fc_weight_2", + learning_rate=1.0, + regularizer=regularizer, + initializer=initializer, + gradient_clip=gradient_clip, + trainable=True) + + if win_sizes is None: + win_sizes = [1, 2, 3] + + # embedding layer + + emb = fluid.embedding(input=data, size=[dict_dim, emb_dim], param_attr=seg_param_attrs) + + # convolution layer + convs = [] + for cnt, win_size in enumerate(win_sizes): + emb = fluid.layers.reshape(x=emb, shape=[-1, 1, max_seq_len, emb_dim], inplace=True) + filter_size = (win_size, emb_dim) + cnn_param_attrs = fluid.ParamAttr(name="cnn_weight" + str(cnt), + learning_rate=1.0, + regularizer=regularizer, + initializer=initializer, + trainable=True) + conv_out = fluid.layers.conv2d(input=emb, num_filters=hid_dim, filter_size=filter_size, act="relu", \ + param_attr=cnn_param_attrs) + pool_out = fluid.layers.pool2d( + input=conv_out, + pool_type='max', + pool_stride=1, + global_pooling=True) + convs.append(pool_out) + convs_out = fluid.layers.concat(input=convs, axis=1) + + # full connect layer + fc_1 = fluid.layers.fc(input=[pool_out], size=hid_dim2, act=None, param_attr=fc_param_attrs_1) + # sigmoid layer + fc_2 = fluid.layers.fc(input=[fc_1], size=class_dim, act=None, param_attr=fc_param_attrs_2) + prediction = fluid.layers.sigmoid(fc_2) + if is_infer: + return prediction + + cost = fluid.layers.sigmoid_cross_entropy_with_logits(x=fc_2, label=label) + avg_cost = fluid.layers.mean(x=cost) + pred_label = fluid.layers.ceil(fluid.layers.thresholded_relu(prediction, threshold)) + return [avg_cost, prediction, pred_label, label] diff --git a/PaddleNLP/dialogue_domain_classification/run.sh b/PaddleNLP/dialogue_domain_classification/run.sh new file mode 100755 index 0000000000000000000000000000000000000000..81efc2b64fb417435ff1bee10d52ba10f1138e5e --- /dev/null +++ b/PaddleNLP/dialogue_domain_classification/run.sh @@ -0,0 +1,118 @@ +export PATH="/home/guohongjie/tmp/paddle/paddle_release_home/python/bin/:$PATH" + + + + + +# CPU setting +:< 0: + pred_pos_num += 1 + if len(actual_labels) > 0: + pos_num += 1 + if set(actual_labels).issubset(set(pred_labels)): + tp += 1 + true_cnt += 1 + elif len(pred_labels) == 0 and len(actual_labels) == 0: + true_cnt += 1 + try: + precision = tp * 1.0 / pred_pos_num + recall = tp * 1.0 / pos_num + f1 = 2 * precision * recall / (recall + precision) + except Exception as e: + precision = 0 + recall = 0 + f1 = 0 + acc = true_cnt * 1.0 / total + logger.info("tp, pred_pos_num, pos_num, total") + logger.info("%d, %d, %d, %d" % (tp, pred_pos_num, pos_num, total)) + logger.info("%s result is : precision is %f, recall is %f, f1_score is %f, acc is %f" % (eval_phase, precision, \ + recall, f1, acc)) + + +def train(args, train_exe, compiled_prog, build_res, place): + """[train the net] + + Arguments: + args {[type]} -- [description] + train_exe {[type]} -- [description] + compiled_prog{[type]} -- [description] + build_res {[type]} -- [description] + place {[type]} -- [description] + """ + global DEV_COUNT + cost = build_res["cost"] + prediction = build_res["prediction"] + pred_label = build_res["pred_label"] + label = build_res["label"] + fetch_list = [cost.name, prediction.name, pred_label.name, label.name] + train_pyreader = build_res["train_pyreader"] + train_prog = build_res["train_prog"] + steps = 0 + time_begin = time.time() + test_exe = train_exe + logger.info("Begin training") + feed_data = [] + for i in range(args.epoch): + try: + for data in train_pyreader(): + feed_data.extend(data) + if len(feed_data) == DEV_COUNT: + avg_cost_np, avg_pred_np, pred_label, label = train_exe.run(feed=feed_data, program=compiled_prog, \ + fetch_list=fetch_list) + feed_data = [] + steps += 1 + if steps % int(args.skip_steps) == 0: + time_end = time.time() + used_time = time_end - time_begin + get_score(pred_label, label, eval_phase = "Train") + logger.info('loss is {}'.format(avg_cost_np)) + logger.info("epoch: %d, step: %d, speed: %f steps/s" % (i, steps, args.skip_steps / used_time)) + time_begin = time.time() + if steps % args.save_steps == 0: + save_path = os.path.join(args.checkpoints, + "step_" + str(steps)) + fluid.io.save_persistables(train_exe, save_path, train_prog) + logger.info("[save]step %d : save at %s" % (steps, save_path)) + if steps % args.validation_steps == 0: + if args.do_eval: + evaluate(args, test_exe, build_res["eval_prog"], build_res, place, "eval") + if args.do_test: + evaluate(args, test_exe, build_res["test_prog"], build_res, place, "test") + except Exception as e: + logger.exception(str(e)) + logger.error("Train error : %s" % str(e)) + exit(1) + save_path = os.path.join(args.checkpoints, "step_" + str(steps)) + fluid.io.save_persistables(train_exe, save_path, train_prog) + logger.info("[save]step %d : save at %s" % (steps, save_path)) + + +def evaluate(args, test_exe, test_prog, build_res, place, eval_phase, save_result=False, id2intent=None): + """[evaluate on dev/test dataset] + + Arguments: + args {[type]} -- [description] + test_exe {[type]} -- [description] + test_prog {[type]} -- [description] + build_res {[type]} -- [description] + place {[type]} -- [description] + eval_phase {[type]} -- [description] + + Keyword Arguments: + threshold {float} -- [description] (default: {0.5}) + save_result {bool} -- [description] (default: {False}) + id2intent {[type]} -- [description] (default: {None}) + """ + threshold = args.threshold + cost = build_res["cost"] + prediction = build_res["prediction"] + pred_label = build_res["pred_label"] + label = build_res["label"] + fetch_list = [cost.name, prediction.name, pred_label.name, label.name] + total_cost, total_acc, pred_prob_list, pred_label_list, label_list = [], [], [], [], [] + if eval_phase == "eval": + test_pyreader = build_res["eval_pyreader"] + elif eval_phase == "test": + test_pyreader = build_res["test_pyreader"] + else: + exit(1) + logger.info("-----------------------------------------------------------") + for data in test_pyreader(): + avg_cost_np, avg_pred_np, pred_label, label= test_exe.run(program=test_prog, fetch_list=fetch_list, feed=data, \ + return_numpy=True) + total_cost.append(avg_cost_np) + pred_prob_list.extend(avg_pred_np) + pred_label_list.extend(pred_label) + label_list.extend(label) + + if save_result: + logger.info("save result at : %s" % args.save_dir + "/" + eval_phase + ".rst") + save_dir = args.save_dir + if not os.path.exists(save_dir): + logger.warning("save dir not exists, and create it") + os.makedirs(save_dir) + fin = codecs.open(os.path.join(args.data_dir, eval_phase + ".txt"), "r", encoding="utf8") + fout = codecs.open(args.save_dir + "/" + eval_phase + ".rst", "w", encoding="utf8") + for line in pred_prob_list: + query = fin.readline().rsplit("\t", 1)[0] + res = [] + for i in range(1, len(line)): + if line[i] > threshold: + #res.append(id2intent[i]+":"+str(line[i])) + res.append(id2intent[i]) + if len(res) == 0: + res.append(id2intent[0]) + fout.write("%s\t%s\n" % (query, "\2".join(sorted(res)))) + fout.close() + fin.close() + + logger.info("[%s] result: " % eval_phase) + get_score(pred_label_list, label_list, eval_phase) + logger.info('loss is {}'.format(sum(total_cost) * 1.0 / len(total_cost))) + logger.info("-----------------------------------------------------------") + + + +def create_net(args, flow_data, class_dim, dict_dim, place, model_name="textcnn_net", is_infer=False): + """[create network and pyreader] + + Arguments: + flow_data {[type]} -- [description] + class_dim {[type]} -- [description] + dict_dim {[type]} -- [description] + place {[type]} -- [description] + + Keyword Arguments: + model_name {str} -- [description] (default: {"textcnn_net"}) + is_infer {bool} -- [description] (default: {False}) + + Returns: + [type] -- [description] + """ + if model_name == "textcnn_net": + model = textcnn_net_multi_label + else: + return + char_list = fluid.data(name="char", shape=[None, args.max_seq_len, 1], dtype="int64", lod_level=0) + label = fluid.data(name="label", shape=[None, class_dim], dtype="float32", lod_level=0) # label data + reader = fluid.io.PyReader(feed_list=[char_list, label], capacity=args.batch_size * 10, iterable=True, \ + return_list=False) + output = model(char_list, label, dict_dim, + emb_dim=flow_data["model"]["emb_dim"], + hid_dim=flow_data["model"]["hid_dim"], + hid_dim2=flow_data["model"]["hid_dim2"], + class_dim=class_dim, + win_sizes=flow_data["model"]["win_sizes"], + is_infer=is_infer, + threshold=args.threshold, + max_seq_len=args.max_seq_len) + if is_infer: + prediction = output + return [reader, prediction] + else: + avg_cost, prediction, pred_label, label = output[0], output[1], output[2], output[3] + return [reader, avg_cost, prediction, pred_label, label] + + +def build_data_reader(args, char_dict, intent_dict): + """[decorate samples for pyreader] + + Arguments: + args {[type]} -- [description] + char_dict {[type]} -- [description] + intent_dict {[type]} -- [description] + + Returns: + [type] -- [description] + """ + reader_res = {} + if args.do_train: + train_processor = DataReader(char_dict, intent_dict, args.max_seq_len) + train_data_generator = train_processor.prepare_data( + data_path=args.data_dir + "train.txt", + batch_size=args.batch_size, + mode='train') + reader_res["train_data_generator"] = train_data_generator + num_train_examples = train_processor._get_num_examples() + logger.info("Num train examples: %d" % num_train_examples) + logger.info("Num train steps: %d" % (math.ceil(num_train_examples * 1.0 / args.batch_size) * \ + args.epoch // DEV_COUNT)) + if math.ceil(num_train_examples * 1.0 / args.batch_size) // DEV_COUNT <= 0: + logger.error("Num of train steps is less than 0 or equals to 0, exit") + exit(1) + if args.do_eval: + eval_processor = DataReader(char_dict, intent_dict, args.max_seq_len) + eval_data_generator = eval_processor.prepare_data( + data_path=args.data_dir + "eval.txt", + batch_size=args.batch_size, + mode='eval') + reader_res["eval_data_generator"] = eval_data_generator + num_eval_examples = eval_processor._get_num_examples() + logger.info("Num eval examples: %d" % num_eval_examples) + if args.do_test: + test_processor = DataReader(char_dict, intent_dict, args.max_seq_len) + test_data_generator = test_processor.prepare_data( + data_path=args.data_dir + "test.txt", + batch_size=args.batch_size, + mode='test') + reader_res["test_data_generator"] = test_data_generator + return reader_res + + +def build_graph(args, model_config, num_labels, dict_dim, place, reader_res): + """[build paddle graph] + + Arguments: + args {[type]} -- [description] + model_config {[type]} -- [description] + num_labels {[type]} -- [description] + dict_dim {[type]} -- [description] + place {[type]} -- [description] + reader_res {[type]} -- [description] + + Returns: + [type] -- [description] + """ + res = {} + cost, prediction, pred_label, label = None, None, None, None + train_prog = fluid.default_main_program() + + startup_prog = fluid.default_startup_program() + eval_prog = train_prog.clone(for_test=True) + test_prog = train_prog.clone(for_test=True) + train_prog.random_seed = args.random_seed + startup_prog.random_seed = args.random_seed + if args.do_train: + with fluid.program_guard(train_prog, startup_prog): + with fluid.unique_name.guard(): + train_pyreader, cost, prediction, pred_label, label = create_net(args, model_config, num_labels, \ + dict_dim, place, model_name="textcnn_net") + train_pyreader.decorate_sample_list_generator(reader_res['train_data_generator'], places=place) + res["train_pyreader"] = train_pyreader + sgd_optimizer = fluid.optimizer.SGD(learning_rate=fluid.layers.exponential_decay( + learning_rate=args.learning_rate, decay_steps=1000, decay_rate=0.5, staircase=True)) + sgd_optimizer.minimize(cost) + if args.do_eval: + with fluid.program_guard(eval_prog, startup_prog): + with fluid.unique_name.guard(): + eval_pyreader, cost, prediction, pred_label, label = create_net(args, model_config, num_labels, \ + dict_dim, place, model_name="textcnn_net") + eval_pyreader.decorate_sample_list_generator(reader_res['eval_data_generator'], places=place) + res["eval_pyreader"] = eval_pyreader + if args.do_test: + with fluid.program_guard(test_prog, startup_prog): + with fluid.unique_name.guard(): + test_pyreader, cost, prediction, pred_label, label = create_net(args, model_config, num_labels, \ + dict_dim, place, model_name="textcnn_net") + test_pyreader.decorate_sample_list_generator(reader_res['test_data_generator'], places=place) + res["test_pyreader"] = test_pyreader + res["cost"] = cost + res["prediction"] = prediction + res["label"] = label + res["pred_label"] = pred_label + res["train_prog"] =train_prog + res["eval_prog"] = eval_prog + res["test_prog"] = test_prog + + + return res + + +def main(args): + """ + Main Function + """ + global DEV_COUNT + startup_prog = fluid.default_startup_program() + random.seed(args.random_seed) + model_config = ConfigReader.read_conf(args.config_path) + if args.use_cuda: + place = fluid.CUDAPlace(int(os.getenv('FLAGS_selected_gpus', '0'))) + DEV_COUNT = fluid.core.get_cuda_device_count() + else: + place = fluid.CPUPlace() + os.environ['CPU_NUM'] = str(args.cpu_num) + DEV_COUNT = args.cpu_num + logger.info("Dev Num is %s" % str(DEV_COUNT)) + exe = fluid.Executor(place) + if args.do_train and args.build_dict: + DataProcesser.build_dict(args.data_dir + "train.txt", args.data_dir) + # read dict + char_dict = DataProcesser.read_dict(args.data_dir + "char.dict") + dict_dim = len(char_dict) + intent_dict = DataProcesser.read_dict(args.data_dir + "domain.dict") + id2intent = {} + for key, value in intent_dict.items(): + id2intent[int(value)] = key + num_labels = len(intent_dict) + # build model + reader_res = build_data_reader(args, char_dict, intent_dict) + build_res = build_graph(args, model_config, num_labels, dict_dim, place, reader_res) + if not (args.do_train or args.do_eval or args.do_test): + raise ValueError("For args `do_train`, `do_eval` and `do_test`, at " + "least one of them must be True.") + + exe.run(startup_prog) + if args.init_checkpoint and args.init_checkpoint != "None": + try: + init_checkpoint(exe, args.init_checkpoint, main_program=startup_prog) + logger.info("Load model from %s" % args.init_checkpoint) + except Exception as e: + logger.exception(str(e)) + logger.error("Faild load model from %s [%s]" % (args.init_checkpoint, str(e))) + + if args.do_train: + build_strategy = fluid.compiler.BuildStrategy() + compiled_prog = fluid.compiler.CompiledProgram(build_res["train_prog"]).with_data_parallel( \ + loss_name=build_res["cost"].name, build_strategy=build_strategy) + build_res["compiled_prog"] = compiled_prog + train(args, exe, compiled_prog, build_res, place) + if args.do_eval: + + evaluate(args, exe, build_res["eval_prog"], build_res, place, "eval", \ + save_result=True, id2intent=id2intent) + if args.do_test: + + evaluate(args, exe, build_res["test_prog"], build_res, place, "test",\ + save_result=True, id2intent=id2intent) + + + + +if __name__ == "__main__": + logger.info("the paddle version is %s" % paddle.__version__) + check_version('1.6.0') + print_arguments(args) + main(args) diff --git a/PaddleNLP/dialogue_domain_classification/utils.py b/PaddleNLP/dialogue_domain_classification/utils.py new file mode 100755 index 0000000000000000000000000000000000000000..2c839a2ccc605fae3c602f241586fda2838fea15 --- /dev/null +++ b/PaddleNLP/dialogue_domain_classification/utils.py @@ -0,0 +1,354 @@ +#!/usr/bin/env python +# -*- coding: utf-8 -*- +""" +# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" + +from __future__ import unicode_literals +import sys +import os +import random +import paddle +import logging +import paddle.fluid as fluid +import numpy as np +import collections +import six +import codecs +try: + import configparser as cp +except ImportError: + import ConfigParser as cp + + +random_seed = 7 +logger = logging.getLogger() +format = "%(asctime)s - %(name)s - %(levelname)s -%(filename)s-%(lineno)4d -%(message)s" +# format = "%(levelname)8s: %(asctime)s: %(filename)s:%(lineno)4d %(message)s" +logging.basicConfig(format=format) +logger.setLevel(logging.INFO) +logger = logging.getLogger('Paddle-DDC') + + +def str2bool(v): + """[ because argparse does not support to parse "true, False" as python + boolean directly] + Arguments: + v {[type]} -- [description] + Returns: + [type] -- [description] + """ + return v.lower() in ("true", "t", "1") + + +def to_lodtensor(data, place): + """ + convert ot LODtensor + """ + seq_lens = [len(seq) for seq in data] + cur_len = 0 + lod = [cur_len] + for l in seq_lens: + cur_len += l + lod.append(cur_len) + flattened_data = np.concatenate(data, axis=0).astype("int64") + flattened_data = flattened_data.reshape([len(flattened_data), 1]) + res = fluid.LoDTensor() + res.set(flattened_data, place) + res.set_lod([lod]) + return res + + +class ArgumentGroup(object): + """[ArgumentGroup] + + Arguments: + object {[type]} -- [description] + """ + def __init__(self, parser, title, des): + self._group = parser.add_argument_group(title=title, description=des) + + def add_arg(self, name, type, default, help, **kwargs): + """[add_arg] + + Arguments: + name {[type]} -- [description] + type {[type]} -- [description] + default {[type]} -- [description] + help {[type]} -- [description] + """ + type = str2bool if type == bool else type + self._group.add_argument( + "--" + name, + default=default, + type=type, + help=help + ' Default: %(default)s.', + **kwargs) + + +class DataReader(object): + """[get data generator for dataset] + + Arguments: + object {[type]} -- [description] + + Returns: + [type] -- [description] + """ + def __init__(self, char_vocab, intent_dict, max_len): + self._char_vocab = char_vocab + self._intent_dict = intent_dict + self._oov_id = 0 + self.intent_size = len(intent_dict) + self.all_data = [] + self.max_len = max_len + self.padding_id = 0 + + def _get_num_examples(self): + return len(self.all_data) + + def prepare_data(self, data_path, batch_size, mode): + """ + prepare data + """ + # print word_dict_path + # assert os.path.exists( + # word_dict_path), "The given word dictionary dose not exist." + assert os.path.exists(data_path), "The given data file does not exist." + if mode == "train": + train_reader = fluid.io.batch(paddle.reader.shuffle(self.data_reader(data_path, self.max_len, shuffle=True), + buf_size=batch_size * 100), batch_size) + # train_reader = fluid.io.batch(self.data_reader(data_path), batch_size) + return train_reader + else: + test_reader = fluid.io.batch(self.data_reader(data_path, self.max_len), batch_size) + return test_reader + + def data_reader(self, file_path, max_len, shuffle=False): + """ + Convert query into id list + use fixed voc + """ + + for line in codecs.open(file_path, "r", encoding="utf8"): + line = line.strip() + if isinstance(line, six.binary_type): + line = line.decode("utf8", errors="ignore") + query, intent = line.split("\t") + char_id_list = list(map(lambda x: 0 if x not in self._char_vocab else int(self._char_vocab[x]), \ + list(query))) + if len(char_id_list) < max_len: + char_id_list.extend([self.padding_id] * (max_len - len(char_id_list))) + char_id_list = char_id_list[:max_len] + intent_id_list = [self.padding_id] * self.intent_size + for item in intent.split('\2'): + intent_id_list[int(self._intent_dict[item])] = 1 + self.all_data.append([char_id_list, intent_id_list]) + if shuffle: + random.seed(random_seed) + random.shuffle(self.all_data) + def reader(): + """ + reader + """ + for char_id_list, intent_id_list in self.all_data: + # print char_id_list, intent_id + yield char_id_list, intent_id_list + return reader + + +class DataProcesser(object): + """[file process methods] + + Arguments: + object {[type]} -- [description] + + Returns: + [type] -- [description] + """ + @staticmethod + def read_dict(filename): + """ + read_dict: key\2value + """ + res_dict = {} + for line in codecs.open(filename, encoding="utf8"): + try: + if isinstance(line, six.binary_type): + line = line.strip().decode("utf8") + line = line.strip() + key, value = line.strip().split("\2") + res_dict[key] = value + except Exception as err: + logger.error(str(err)) + logger.error("read dict[%s] failed" % filename) + return res_dict + + @staticmethod + def build_dict(filename, save_dir, min_num_char=2, min_num_intent=2): + """[build_dict from file] + + Arguments: + filename {[type]} -- [description] + save_dir {[type]} -- [description] + + Keyword Arguments: + min_num_char {int} -- [description] (default: {2}) + min_num_intent {int} -- [description] (default: {2}) + """ + char_dict = {} + intent_dict = {} + # readfile + for line in codecs.open(filename): + line = line.strip() + if isinstance(line, six.binary_type): + line = line.strip().decode("utf8", errors="ignore") + query, intents = line.split("\t") + # read query + for char_item in list(query): + if char_item not in char_dict: + char_dict[char_item] = 0 + char_dict[char_item] += 1 + # read intents + for intent in intents.split('\002'): + if intent not in intent_dict: + intent_dict[intent] = 0 + intent_dict[intent] += 1 + # save char dict + with codecs.open("%s/char.dict" % save_dir, "w", encoding="utf8") as f_out: + f_out.write("PAD\0020\n") + f_out.write("OOV\0021\n") + char_id = 2 + for key, value in char_dict.items(): + if value >= min_num_char: + if isinstance(key, six.binary_type): + key = key.encode("utf8") + f_out.write("%s\002%d\n" % (key, char_id)) + char_id += 1 + # save intent dict + with codecs.open("%s/domain.dict" % save_dir, "w", encoding="utf8") as f_out: + f_out.write("SYS_OTHER\0020\n") + intent_id = 1 + for key, value in intent_dict.items(): + if value >= min_num_intent and key != u'SYS_OTHER': + if isinstance(key, six.binary_type): + key = key.encode("utf8") + f_out.write("%s\002%d\n" % (key, intent_id)) + intent_id += 1 + + + +class ConfigReader(object): + """[read model config file] + + Arguments: + object {[type]} -- [description] + + Returns: + [type] -- [description] + """ + + @staticmethod + def read_conf(conf_file): + """[read_conf] + + Arguments: + conf_file {[type]} -- [description] + + Returns: + [type] -- [description] + """ + flow_data = collections.defaultdict(lambda: {}) + class2key = set(["model"]) + param_conf = cp.ConfigParser() + param_conf.read(conf_file) + for section in param_conf.sections(): + if section not in class2key: + continue + for option in param_conf.items(section): + flow_data[section][option[0]] = eval(option[1]) + return flow_data + + +def init_pretraining_params(exe, + pretraining_params_path, + main_program, + use_fp16=False): + """load params of pretrained model, NOT including moment, learning_rate""" + assert os.path.exists(pretraining_params_path + ), "[%s] cann't be found." % pretraining_params_path + + def _existed_params(var): + if not isinstance(var, fluid.framework.Parameter): + return False + return os.path.exists(os.path.join(pretraining_params_path, var.name)) + + fluid.io.load_vars( + exe, + pretraining_params_path, + main_program=main_program, + predicate=_existed_params) + print("Load pretraining parameters from {}.".format( + pretraining_params_path)) + + +def init_checkpoint(exe, init_checkpoint_path, main_program): + """ + Init CheckPoint + """ + assert os.path.exists( + init_checkpoint_path), "[%s] cann't be found." % init_checkpoint_path + + def existed_persitables(var): + """ + If existed presitabels + """ + if not fluid.io.is_persistable(var): + return False + return os.path.exists(os.path.join(init_checkpoint_path, var.name)) + + fluid.io.load_vars( + exe, + init_checkpoint_path, + main_program=main_program, + predicate=existed_persitables) + print ("Load model from {}".format(init_checkpoint_path)) + +def print_arguments(args): + """ + Print Arguments + """ + print('----------- Configuration Arguments -----------') + for arg, value in sorted(six.iteritems(vars(args))): + print('%s: %s' % (arg, value)) + print('------------------------------------------------') + + +def check_version(version='1.6.0'): + """ + Log error and exit when the installed version of paddlepaddle is + not satisfied. + """ + err = "PaddlePaddle version 1.6 or higher is required, " \ + "or a suitable develop version is satisfied as well. \n" \ + "Please make sure the version is good with your code." \ + + try: + fluid.require_version(version) + except Exception as e: + logger.error(err) + sys.exit(1) + +