diff --git a/dssm/README.cn.md b/dssm/README.cn.md index 17f9a923c0418c72194f4f54316f04b70ba443dc..e1bd3cab89a2f76752c946b5a54cc250def9441b 100644 --- a/dssm/README.cn.md +++ b/dssm/README.cn.md @@ -1,19 +1,14 @@ # 深度结构化语义模型 (Deep Structured Semantic Models, DSSM) -DSSM使用DNN模型在一个连续的语义空间中学习文本低纬的表示向量,并且建模两个句子间的语义相似度。 -本例演示如何使用 PaddlePaddle实现一个通用的DSSM 模型,用于建模两个字符串间的语义相似度, -模型实现支持通用的数据格式,用户替换数据便可以在真实场景中使用该模型。 +DSSM使用DNN模型在一个连续的语义空间中学习文本低纬的表示向量,并且建模两个句子间的语义相似度。本例演示如何使用PaddlePaddle实现一个通用的DSSM 模型,用于建模两个字符串间的语义相似度,模型实现支持通用的数据格式,用户替换数据便可以在真实场景中使用该模型。 ## 背景介绍 -DSSM \[[1](##参考文献)\]是微软研究院13年提出来的经典的语义模型,用于学习两个文本之间的语义距离, -广义上模型也可以推广和适用如下场景: +DSSM \[[1](##参考文献)\]是微软研究院13年提出来的经典的语义模型,用于学习两个文本之间的语义距离,广义上模型也可以推广和适用如下场景: 1. CTR预估模型,衡量用户搜索词(Query)与候选网页集合(Documents)之间的相关联程度。 2. 文本相关性,衡量两个字符串间的语义相关程度。 3. 自动推荐,衡量User与被推荐的Item之间的关联程度。 -DSSM 已经发展成了一个框架,可以很自然地建模两个记录之间的距离关系, -例如对于文本相关性问题,可以用余弦相似度 (cosin similarity) 来刻画语义距离; -而对于搜索引擎的结果排序,可以在DSSM上接上Rank损失训练出一个排序模型。 +DSSM 已经发展成了一个框架,可以很自然地建模两个记录之间的距离关系,例如对于文本相关性问题,可以用余弦相似度 (cosin similarity) 来刻画语义距离;而对于搜索引擎的结果排序,可以在DSSM上接上Rank损失训练出一个排序模型。 ## 模型简介 在原论文\[[1](#参考文献)\]中,DSSM模型用来衡量用户搜索词 Query 和文档集合 Documents 之间隐含的语义关系,模型结构如下 @@ -23,12 +18,9 @@ DSSM 已经发展成了一个框架,可以很自然地建模两个记录之间 图 1. DSSM 原始结构

-其贯彻的思想是, **用DNN将高维特征向量转化为低纬空间的连续向量(图中红色框部分)** , -**在上层用cosine similarity来衡量用户搜索词与候选文档间的语义相关性** 。 +其贯彻的思想是, **用DNN将高维特征向量转化为低纬空间的连续向量(图中红色框部分)** ,**在上层使用cosine similarity来衡量用户搜索词与候选文档间的语义相关性** 。 -在最顶层损失函数的设计上,原始模型使用类似Word2Vec中负例采样的方法, -一个Query会抽取正例 $D+$ 和4个负例 $D-$ 整体上算条件概率用对数似然函数作为损失, -这也就是图 1中类似 $P(D_1|Q)$ 的结构,具体细节请参考原论文。 +在最顶层损失函数的设计上,原始模型使用类似Word2Vec中负例采样的方法,一个Query会抽取正例 $D+$ 和4个负例 $D-$ 整体上算条件概率用对数似然函数作为损失,这也就是图 1中类似 $P(D_1|Q)$ 的结构,具体细节请参考原论文。 随着后续优化DSSM模型的结构得以简化\[[3](#参考文献)\],演变为: @@ -37,37 +29,30 @@ DSSM 已经发展成了一个框架,可以很自然地建模两个记录之间 图 2. DSSM通用结构

-图中的空白方框可以用任何模型替代,比如全连接FC,卷积CNN,RNN等都可以, -该模型结构专门用于衡量两个元素(比如字符串)间的语义距离。 - -在现实使用中,DSSM模型会作为基础的积木,搭配上不同的损失函数来实现具体的功能,比如 +图中的空白方框可以用任何模型替代,例如:全连接FC,卷积CNN,RNN等。该模型结构专门用于衡量两个元素(比如字符串)间的语义距离。在实际任务中,DSSM模型会作为基础的积木,搭配上不同的损失函数来实现具体的功能,比如: - 在排序学习中,将 图 2 中结构添加 pairwise rank损失,变成一个排序模型 - 在CTR预估中,对点击与否做0,1二元分类,添加交叉熵损失变成一个分类模型 - 在需要对一个子串打分时,可以使用余弦相似度来计算相似度,变成一个回归模型 -本例将尝试面向应用提供一个比较通用的解决方案,在模型任务类型上支持 +本例提供一个比较通用的解决方案,在模型任务类型上支持: - 分类 - [-1, 1] 值域内的回归 - Pairwise-Rank -在生成低纬语义向量的模型结构上,本模型支持以下三种: +在生成低纬语义向量的模型结构上,支持以下三种: - FC, 多层全连接层 - CNN,卷积神经网络 - RNN,递归神经网络 ## 模型实现 -DSSM模型可以拆成三小块实现,分别是左边和右边的DNN,以及顶层的损失函数。 -在复杂任务中,左右两边DNN的结构可以是不同的,比如在原始论文中左右分别学习Query和Document的semantic vector, -两者数据的数据不同,建议对应定制DNN的结构。 +DSSM模型可以拆成三部分:分别是左边和右边的DNN,以及顶层的损失函数。在复杂任务中,左右两边DNN的结构可以不同。在原始论文中左右网络分别学习Query和Document的语义向量,两者数据的数据不同,建议对应定制DNN的结构。 -本例中为了简便和通用,将左右两个DNN的结构都设为相同的,因此只有三个选项FC,CNN,RNN等。 +**本例中为了简便和通用,将左右两个DNN的结构设为相同,因此只提供三个选项FC、CNN、RNN**。 -在损失函数的设计方面,也支持三种,分类, 回归, 排序; -其中,在回归和排序两种损失中,左右两边的匹配程度通过余弦相似度(cossim)来计算; -在分类任务中,类别预测的分布通过softmax计算。 +损失函数的设计也支持三种类型:分类, 回归, 排序;其中,在回归和排序两种损失中,左右两边的匹配程度通过余弦相似度(cosine similairty)来计算;在分类任务中,类别预测的分布通过softmax计算。 在其它教程中,对上述很多内容都有过详细的介绍,例如: @@ -77,19 +62,17 @@ DSSM模型可以拆成三小块实现,分别是左边和右边的DNN,以及 相关原理在此不再赘述,本文接下来的篇幅主要集中介绍使用PaddlePaddle实现这些结构上。 -如图3,回归和分类模型的结构很相似 +如图3,回归和分类模型的结构相似:



图 3. DSSM for REGRESSION or CLASSIFICATION

-最重要的组成部分包括词向量,图中`(1)`,`(2)`两个低纬向量的学习器(可以用RNN/CNN/FC中的任意一种实现), -最上层对应的损失函数。 - -而Pairwise Rank的结构会复杂一些,类似两个 图 4. 中的结构,增加了对应的损失函数: +最重要的组成部分包括词向量,图中`(1)`,`(2)`两个低纬向量的学习器(可以用RNN/CNN/FC中的任意一种实现),最上层对应的损失函数。 -- 模型总体思想是,用同一个source(源)为左右两个target(目标)分别打分——`(a),(b)`,学习目标是(a),(b)间的大小关系 +Pairwise Rank的结构会复杂一些,图 4. 中的结构会出现两次,增加了对应的损失函数,模型总体思想是: +- 给定同一个source(源)为左右两个target(目标)分别打分——`(a),(b)`,学习目标是(a),(b)之间的大小关系 - `(a)`和`(b)`类似图3中结构,用于给source和target的pair打分 - `(1)`和`(2)`的结构其实是共用的,都表示同一个source,图中为了表达效果展开成两个 @@ -98,17 +81,18 @@ DSSM模型可以拆成三小块实现,分别是左边和右边的DNN,以及 图 4. DSSM for Pairwise Rank

-下面是各个部分具体的实现方法,所有的代码均包含在 `./network_conf.py` 中。 +下面是各个部分的具体实现,相关代码均包含在 `./network_conf.py` 中。 ### 创建文本的词向量表 ```python def create_embedding(self, input, prefix=''): - ''' - Create an embedding table whose name has a `prefix`. - ''' - logger.info("create embedding table [%s] which dimention is %d" % + """ + Create word embedding. The `prefix` is added in front of the name of + embedding"s learnable parameter. + """ + logger.info("Create embedding table [%s] whose dimention is %d" % (prefix, self.dnn_dims[0])) emb = paddle.layer.embedding( input=input, @@ -123,14 +107,15 @@ def create_embedding(self, input, prefix=''): ```python def create_cnn(self, emb, prefix=''): - ''' + + """ A multi-layer CNN. + :param emb: The word embedding. + :type emb: paddle.layer + :param prefix: The prefix will be added to of layers' names. + :type prefix: str + """ - @emb: paddle.layer - output of the embedding layer - @prefix: str - prefix of layers' names, used to share parameters between more than one `cnn` parts. - ''' def create_conv(context_len, hidden_size, prefix): key = "%s_%d_%d" % (prefix, context_len, hidden_size) conv = paddle.networks.sequence_conv_pool( @@ -138,21 +123,18 @@ def create_cnn(self, emb, prefix=''): context_len=context_len, hidden_size=hidden_size, # set parameter attr for parameter sharing - context_proj_param_attr=ParamAttr(name=key + 'contex_proj.w'), - fc_param_attr=ParamAttr(name=key + '_fc.w'), - fc_bias_attr=ParamAttr(name=key + '_fc.b'), - pool_bias_attr=ParamAttr(name=key + '_pool.b')) + context_proj_param_attr=ParamAttr(name=key + "contex_proj.w"), + fc_param_attr=ParamAttr(name=key + "_fc.w"), + fc_bias_attr=ParamAttr(name=key + "_fc.b"), + pool_bias_attr=ParamAttr(name=key + "_pool.b")) return conv - logger.info('create a sequence_conv_pool which context width is 3') conv_3 = create_conv(3, self.dnn_dims[1], "cnn") - logger.info('create a sequence_conv_pool which context width is 4') conv_4 = create_conv(4, self.dnn_dims[1], "cnn") return conv_3, conv_4 ``` -CNN 接受 embedding table输出的词向量序列,通过卷积和池化操作捕捉到原始句子的关键信息, -最终输出一个语义向量(可以认为是句子向量)。 +CNN 接受词向量序列,通过卷积和池化操作捕捉到原始句子的关键信息,最终输出一个语义向量(可以认为是句子向量)。 本例的实现中,分别使用了窗口长度为3和4的CNN学到的句子向量按元素求和得到最终的句子向量。 @@ -162,9 +144,9 @@ RNN很适合学习变长序列的信息,使用RNN来学习句子的信息几 ```python def create_rnn(self, emb, prefix=''): - ''' + """ A GRU sentence vector learner. - ''' + """ gru = paddle.networks.simple_gru( input=emb, size=self.dnn_dims[1], @@ -176,18 +158,19 @@ def create_rnn(self, emb, prefix=''): return sent_vec ``` -### FC 结构实现 +### 多层全连接网络FC ```python def create_fc(self, emb, prefix=''): - ''' + + """ A multi-layer fully connected neural networks. + :param emb: The output of the embedding layer + :type emb: paddle.layer + :param prefix: A prefix will be added to the layers' names. + :type prefix: str + """ - @emb: paddle.layer - output of the embedding layer - @prefix: str - prefix of layers' names, used to share parameters between more than one `fc` parts. - ''' _input_layer = paddle.layer.pooling( input=emb, pooling_type=paddle.pooling.Max()) fc = paddle.layer.fc( @@ -198,21 +181,17 @@ def create_fc(self, emb, prefix=''): return fc ``` -在构建FC时需要首先使用`paddle.layer.pooling` 对词向量序列进行最大池化操作,将边长序列转化为一个固定维度向量, -作为整个句子的语义表达,使用最大池化能够降低句子长度对句向量表达的影响。 +在构建全连接网络时首先使用`paddle.layer.pooling` 对词向量序列进行最大池化操作,将边长序列转化为一个固定维度向量,作为整个句子的语义表达,使用最大池化能够降低句子长度对句向量表达的影响。 -### 多层DNN实现 +### 多层DNN 在 CNN/DNN/FC提取出 semantic vector后,在上层可继续接多层FC来实现深层DNN结构。 ```python def create_dnn(self, sent_vec, prefix): - # if more than three layers exists, a fc layer will be added. if len(self.dnn_dims) > 1: _input_layer = sent_vec for id, dim in enumerate(self.dnn_dims[1:]): name = "%s_fc_%d_%d" % (prefix, id, dim) - logger.info("create fc layer [%s] which dimention is %d" % - (name, dim)) fc = paddle.layer.fc( input=_input_layer, size=dim, @@ -224,119 +203,13 @@ def create_dnn(self, sent_vec, prefix): return _input_layer ``` -### 分类或回归实现 -分类和回归的结构比较相似,因此可以用一个函数创建出来 +### 分类及回归 +分类和回归的结构比较相似,具体实现请参考[network_conf.py]( https://github.com/PaddlePaddle/models/blob/develop/dssm/network_conf.py)中的 +`_build_classification_or_regression_model` 函数。 -```python -def _build_classification_or_regression_model(self, is_classification): - ''' - Build a classification/regression model, and the cost is returned. - - A Classification has 3 inputs: - - source sentence - - target sentence - - classification label - - ''' - # prepare inputs. - assert self.class_num - - source = paddle.layer.data( - name='source_input', - type=paddle.data_type.integer_value_sequence(self.vocab_sizes[0])) - target = paddle.layer.data( - name='target_input', - type=paddle.data_type.integer_value_sequence(self.vocab_sizes[1])) - label = paddle.layer.data( - name='label_input', - type=paddle.data_type.integer_value(self.class_num) - if is_classification else paddle.data_type.dense_input) - - prefixs = '_ _'.split( - ) if self.share_semantic_generator else 'source target'.split() - embed_prefixs = '_ _'.split( - ) if self.share_embed else 'source target'.split() - - word_vecs = [] - for id, input in enumerate([source, target]): - x = self.create_embedding(input, prefix=embed_prefixs[id]) - word_vecs.append(x) - - semantics = [] - for id, input in enumerate(word_vecs): - x = self.model_arch_creater(input, prefix=prefixs[id]) - semantics.append(x) - - if is_classification: - concated_vector = paddle.layer.concat(semantics) - prediction = paddle.layer.fc( - input=concated_vector, - size=self.class_num, - act=paddle.activation.Softmax()) - cost = paddle.layer.classification_cost( - input=prediction, label=label) - else: - prediction = paddle.layer.cos_sim(*semantics) - cost = paddle.layer.square_error_cost(prediction, label) - - if not self.is_infer: - return cost, prediction, label - return prediction -``` -### Pairwise Rank实现 -Pairwise Rank复用上面的DNN结构,同一个source对两个target求相似度打分, -如果左边的target打分高,预测为1,否则预测为 0。 +### Pairwise Rank +Pairwise Rank复用上面的DNN结构,同一个source对两个target求相似度打分,如果左边的target打分高,预测为1,否则预测为 0。实现请参考 [network_conf.py]( https://github.com/PaddlePaddle/models/blob/develop/dssm/network_conf.py) 中的`_build_rank_model` 函数。 -```python -def _build_rank_model(self): - ''' - Build a pairwise rank model, and the cost is returned. - - A pairwise rank model has 3 inputs: - - source sentence - - left_target sentence - - right_target sentence - - label, 1 if left_target should be sorted in front of right_target, otherwise 0. - ''' - source = paddle.layer.data( - name='source_input', - type=paddle.data_type.integer_value_sequence(self.vocab_sizes[0])) - left_target = paddle.layer.data( - name='left_target_input', - type=paddle.data_type.integer_value_sequence(self.vocab_sizes[1])) - right_target = paddle.layer.data( - name='right_target_input', - type=paddle.data_type.integer_value_sequence(self.vocab_sizes[1])) - label = paddle.layer.data( - name='label_input', type=paddle.data_type.integer_value(1)) - - prefixs = '_ _ _'.split( - ) if self.share_semantic_generator else 'source target target'.split() - embed_prefixs = '_ _'.split( - ) if self.share_embed else 'source target target'.split() - - word_vecs = [] - for id, input in enumerate([source, left_target, right_target]): - x = self.create_embedding(input, prefix=embed_prefixs[id]) - word_vecs.append(x) - - semantics = [] - for id, input in enumerate(word_vecs): - x = self.model_arch_creater(input, prefix=prefixs[id]) - semantics.append(x) - - # cossim score of source and left_target - left_score = paddle.layer.cos_sim(semantics[0], semantics[1]) - # cossim score of source and right target - right_score = paddle.layer.cos_sim(semantics[0], semantics[2]) - - # rank cost - cost = paddle.layer.rank_cost(left_score, right_score, label=label) - # prediction = left_score - right_score - # but this operator is not supported currently. - # so AUC will not used. - return cost, None, None -``` ## 数据格式 在 `./data` 中有简单的示例数据 @@ -371,7 +244,6 @@ def _build_rank_model(self): 6 10 \t 8 3 1 \t 1 ``` - ### 排序的数据格式 ``` # 4 fields each line: @@ -391,68 +263,11 @@ def _build_rank_model(self): ## 执行训练 -可以直接执行 `python train.py -y 0 --model_arch 0` 使用 `./data/classification` 目录里简单的数据来训练一个分类的FC模型。 - -其他模型结构也可以通过命令行实现定制,详细命令行参数如下 +可以直接执行 `python train.py -y 0 --model_arch 0` 使用 `./data/classification` 目录里的实例数据来测试能否直接运行训练分类FC模型。 -``` -usage: train.py [-h] [-i TRAIN_DATA_PATH] [-t TEST_DATA_PATH] - [-s SOURCE_DIC_PATH] [--target_dic_path TARGET_DIC_PATH] - [-b BATCH_SIZE] [-p NUM_PASSES] -y MODEL_TYPE -a MODEL_ARCH - [--share_network_between_source_target SHARE_NETWORK_BETWEEN_SOURCE_TARGET] - [--share_embed SHARE_EMBED] [--dnn_dims DNN_DIMS] - [--num_workers NUM_WORKERS] [--use_gpu USE_GPU] [-c CLASS_NUM] - [--model_output_prefix MODEL_OUTPUT_PREFIX] - [-g NUM_BATCHES_TO_LOG] [-e NUM_BATCHES_TO_TEST] - [-z NUM_BATCHES_TO_SAVE_MODEL] - -PaddlePaddle DSSM example - -optional arguments: - -h, --help show this help message and exit - -i TRAIN_DATA_PATH, --train_data_path TRAIN_DATA_PATH - path of training dataset - -t TEST_DATA_PATH, --test_data_path TEST_DATA_PATH - path of testing dataset - -s SOURCE_DIC_PATH, --source_dic_path SOURCE_DIC_PATH - path of the source's word dic - --target_dic_path TARGET_DIC_PATH - path of the target's word dic, if not set, the - `source_dic_path` will be used - -b BATCH_SIZE, --batch_size BATCH_SIZE - size of mini-batch (default:32) - -p NUM_PASSES, --num_passes NUM_PASSES - number of passes to run(default:10) - -y MODEL_TYPE, --model_type MODEL_TYPE - model type, 0 for classification, 1 for pairwise rank, - 2 for regression (default: classification) - -a MODEL_ARCH, --model_arch MODEL_ARCH - model architecture, 1 for CNN, 0 for FC, 2 for RNN - --share_network_between_source_target SHARE_NETWORK_BETWEEN_SOURCE_TARGET - whether to share network parameters between source and - target - --share_embed SHARE_EMBED - whether to share word embedding between source and - target - --dnn_dims DNN_DIMS dimentions of dnn layers, default is '256,128,64,32', - which means create a 4-layer dnn, demention of each - layer is 256, 128, 64 and 32 - --num_workers NUM_WORKERS - num worker threads, default 1 - --use_gpu USE_GPU whether to use GPU devices (default: False) - -c CLASS_NUM, --class_num CLASS_NUM - number of categories for classification task. - --model_output_prefix MODEL_OUTPUT_PREFIX - prefix of the path for model to store, (default: ./) - -g NUM_BATCHES_TO_LOG, --num_batches_to_log NUM_BATCHES_TO_LOG - number of batches to output train log, (default: 100) - -e NUM_BATCHES_TO_TEST, --num_batches_to_test NUM_BATCHES_TO_TEST - number of batches to test, (default: 200) - -z NUM_BATCHES_TO_SAVE_MODEL, --num_batches_to_save_model NUM_BATCHES_TO_SAVE_MODEL - number of batches to output model, (default: 400) -``` +其他模型结构也可以通过命令行实现定制,详细命令行参数请执行 `python train.py --help`进行查阅。 -重要的参数描述如下 +这里介绍最重要的几个参数: - `train_data_path` 训练数据路径 - `test_data_path` 测试数据路局,可以不设置 @@ -462,49 +277,8 @@ optional arguments: - `model_arch` 模型结构,FC 0, CNN 1, RNN 2 - `dnn_dims` 模型各层的维度设置,默认为 `256,128,64,32`,即模型有4层,各层维度如上设置 -## 用训练好的模型预测 -``` -usage: infer.py [-h] --model_path MODEL_PATH -i DATA_PATH -o - PREDICTION_OUTPUT_PATH -y MODEL_TYPE [-s SOURCE_DIC_PATH] - [--target_dic_path TARGET_DIC_PATH] -a MODEL_ARCH - [--share_network_between_source_target SHARE_NETWORK_BETWEEN_SOURCE_TARGET] - [--share_embed SHARE_EMBED] [--dnn_dims DNN_DIMS] - [-c CLASS_NUM] - -PaddlePaddle DSSM infer - -optional arguments: - -h, --help show this help message and exit - --model_path MODEL_PATH - path of model parameters file - -i DATA_PATH, --data_path DATA_PATH - path of the dataset to infer - -o PREDICTION_OUTPUT_PATH, --prediction_output_path PREDICTION_OUTPUT_PATH - path to output the prediction - -y MODEL_TYPE, --model_type MODEL_TYPE - model type, 0 for classification, 1 for pairwise rank, - 2 for regression (default: classification) - -s SOURCE_DIC_PATH, --source_dic_path SOURCE_DIC_PATH - path of the source's word dic - --target_dic_path TARGET_DIC_PATH - path of the target's word dic, if not set, the - `source_dic_path` will be used - -a MODEL_ARCH, --model_arch MODEL_ARCH - model architecture, 1 for CNN, 0 for FC, 2 for RNN - --share_network_between_source_target SHARE_NETWORK_BETWEEN_SOURCE_TARGET - whether to share network parameters between source and - target - --share_embed SHARE_EMBED - whether to share word embedding between source and - target - --dnn_dims DNN_DIMS dimentions of dnn layers, default is '256,128,64,32', - which means create a 4-layer dnn, demention of each - layer is 256, 128, 64 and 32 - -c CLASS_NUM, --class_num CLASS_NUM - number of categories for classification task. -``` - -部分参数可以参考 `train.py`,重要参数解释如下 +## 使用训练好的模型预测 +详细命令行参数请执行 `python train.py --help`进行查阅。重要参数解释如下: - `data_path` 需要预测的数据路径 - `prediction_output_path` 预测的输出路径 diff --git a/dssm/README.md b/dssm/README.md index 2d5e0effadabe356cf91e625bb6891fd5152140a..8148ea6557183df1446b98ed6d3a4da1f92c6438 100644 --- a/dssm/README.md +++ b/dssm/README.md @@ -65,10 +65,11 @@ In below, we describe how to train DSSM model in PaddlePaddle. All the codes are ### Create a word vector table for the text ```python def create_embedding(self, input, prefix=''): - ''' - Create an embedding table whose name has a `prefix`. - ''' - logger.info("create embedding table [%s] which dimention is %d" % + """ + Create word embedding. The `prefix` is added in front of the name of + embedding"s learnable parameter. + """ + logger.info("Create embedding table [%s] whose dimention is %d" % (prefix, self.dnn_dims[0])) emb = paddle.layer.embedding( input=input, @@ -82,14 +83,15 @@ Since the input (embedding table) is a list of the IDs of the words correspondin ### CNN implementation ```python def create_cnn(self, emb, prefix=''): - ''' + + """ A multi-layer CNN. + :param emb: The word embedding. + :type emb: paddle.layer + :param prefix: The prefix will be added to of layers' names. + :type prefix: str + """ - @emb: paddle.layer - output of the embedding layer - @prefix: str - prefix of layers' names, used to share parameters between more than one `cnn` parts. - ''' def create_conv(context_len, hidden_size, prefix): key = "%s_%d_%d" % (prefix, context_len, hidden_size) conv = paddle.networks.sequence_conv_pool( @@ -97,15 +99,13 @@ def create_cnn(self, emb, prefix=''): context_len=context_len, hidden_size=hidden_size, # set parameter attr for parameter sharing - context_proj_param_attr=ParamAttr(name=key + 'contex_proj.w'), - fc_param_attr=ParamAttr(name=key + '_fc.w'), - fc_bias_attr=ParamAttr(name=key + '_fc.b'), - pool_bias_attr=ParamAttr(name=key + '_pool.b')) + context_proj_param_attr=ParamAttr(name=key + "contex_proj.w"), + fc_param_attr=ParamAttr(name=key + "_fc.w"), + fc_bias_attr=ParamAttr(name=key + "_fc.b"), + pool_bias_attr=ParamAttr(name=key + "_pool.b")) return conv - logger.info('create a sequence_conv_pool which context width is 3') conv_3 = create_conv(3, self.dnn_dims[1], "cnn") - logger.info('create a sequence_conv_pool which context width is 4') conv_4 = create_conv(4, self.dnn_dims[1], "cnn") return conv_3, conv_4 ``` @@ -118,9 +118,9 @@ RNN is suitable for learning variable length of the information ```python def create_rnn(self, emb, prefix=''): - ''' + """ A GRU sentence vector learner. - ''' + """ gru = paddle.networks.simple_gru( input=emb, size=self.dnn_dims[1], @@ -136,14 +136,15 @@ def create_rnn(self, emb, prefix=''): ```python def create_fc(self, emb, prefix=''): - ''' + + """ A multi-layer fully connected neural networks. + :param emb: The output of the embedding layer + :type emb: paddle.layer + :param prefix: A prefix will be added to the layers' names. + :type prefix: str + """ - @emb: paddle.layer - output of the embedding layer - @prefix: str - prefix of layers' names, used to share parameters between more than one `fc` parts. - ''' _input_layer = paddle.layer.pooling( input=emb, pooling_type=paddle.pooling.Max()) fc = paddle.layer.fc( @@ -160,13 +161,10 @@ In the construction of FC, we use `paddle.layer.pooling` for the maximum pooling ```python def create_dnn(self, sent_vec, prefix): - # if more than three layers exists, a fc layer will be added. if len(self.dnn_dims) > 1: _input_layer = sent_vec for id, dim in enumerate(self.dnn_dims[1:]): name = "%s_fc_%d_%d" % (prefix, id, dim) - logger.info("create fc layer [%s] which dimention is %d" % - (name, dim)) fc = paddle.layer.fc( input=_input_layer, size=dim, @@ -180,117 +178,12 @@ def create_dnn(self, sent_vec, prefix): ### Classification / Regression The structure of classification and regression is similar. Below function can be used for both tasks. - -```python -def _build_classification_or_regression_model(self, is_classification): - ''' - Build a classification/regression model, and the cost is returned. - - A Classification has 3 inputs: - - source sentence - - target sentence - - classification label - - ''' - # prepare inputs. - assert self.class_num - - source = paddle.layer.data( - name='source_input', - type=paddle.data_type.integer_value_sequence(self.vocab_sizes[0])) - target = paddle.layer.data( - name='target_input', - type=paddle.data_type.integer_value_sequence(self.vocab_sizes[1])) - label = paddle.layer.data( - name='label_input', - type=paddle.data_type.integer_value(self.class_num) - if is_classification else paddle.data_type.dense_input) - - prefixs = '_ _'.split( - ) if self.share_semantic_generator else 'source target'.split() - embed_prefixs = '_ _'.split( - ) if self.share_embed else 'source target'.split() - - word_vecs = [] - for id, input in enumerate([source, target]): - x = self.create_embedding(input, prefix=embed_prefixs[id]) - word_vecs.append(x) - - semantics = [] - for id, input in enumerate(word_vecs): - x = self.model_arch_creater(input, prefix=prefixs[id]) - semantics.append(x) - - if is_classification: - concated_vector = paddle.layer.concat(semantics) - prediction = paddle.layer.fc( - input=concated_vector, - size=self.class_num, - act=paddle.activation.Softmax()) - cost = paddle.layer.classification_cost( - input=prediction, label=label) - else: - prediction = paddle.layer.cos_sim(*semantics) - cost = paddle.layer.square_error_cost(prediction, label) - - if not self.is_infer: - return cost, prediction, label - return prediction -``` +Please check the function `_build_classification_or_regression_model` in [network_conf.py]( https://github.com/PaddlePaddle/models/blob/develop/dssm/network_conf.py) for detail implementation. ### Pairwise Rank +Please check the function `_build_rank_model` in [network_conf.py]( https://github.com/PaddlePaddle/models/blob/develop/dssm/network_conf.py) for implementation. -```python -def _build_rank_model(self): - ''' - Build a pairwise rank model, and the cost is returned. - - A pairwise rank model has 3 inputs: - - source sentence - - left_target sentence - - right_target sentence - - label, 1 if left_target should be sorted in front of right_target, otherwise 0. - ''' - source = paddle.layer.data( - name='source_input', - type=paddle.data_type.integer_value_sequence(self.vocab_sizes[0])) - left_target = paddle.layer.data( - name='left_target_input', - type=paddle.data_type.integer_value_sequence(self.vocab_sizes[1])) - right_target = paddle.layer.data( - name='right_target_input', - type=paddle.data_type.integer_value_sequence(self.vocab_sizes[1])) - label = paddle.layer.data( - name='label_input', type=paddle.data_type.integer_value(1)) - - prefixs = '_ _ _'.split( - ) if self.share_semantic_generator else 'source target target'.split() - embed_prefixs = '_ _'.split( - ) if self.share_embed else 'source target target'.split() - - word_vecs = [] - for id, input in enumerate([source, left_target, right_target]): - x = self.create_embedding(input, prefix=embed_prefixs[id]) - word_vecs.append(x) - - semantics = [] - for id, input in enumerate(word_vecs): - x = self.model_arch_creater(input, prefix=prefixs[id]) - semantics.append(x) - - # cossim score of source and left_target - left_score = paddle.layer.cos_sim(semantics[0], semantics[1]) - # cossim score of source and right target - right_score = paddle.layer.cos_sim(semantics[0], semantics[2]) - - # rank cost - cost = paddle.layer.rank_cost(left_score, right_score, label=label) - # prediction = left_score - right_score - # but this operator is not supported currently. - # so AUC will not used. - return cost, None, None -``` ## Data Format Below is a simple example for the data in `./data` @@ -347,67 +240,7 @@ The example of this format is as follows. ## Training -We use `python train.py -y 0 --model_arch 0` with the data in `./data/classification` to train a DSSM model for classification. - - -``` -usage: train.py [-h] [-i TRAIN_DATA_PATH] [-t TEST_DATA_PATH] - [-s SOURCE_DIC_PATH] [--target_dic_path TARGET_DIC_PATH] - [-b BATCH_SIZE] [-p NUM_PASSES] -y MODEL_TYPE -a MODEL_ARCH - [--share_network_between_source_target SHARE_NETWORK_BETWEEN_SOURCE_TARGET] - [--share_embed SHARE_EMBED] [--dnn_dims DNN_DIMS] - [--num_workers NUM_WORKERS] [--use_gpu USE_GPU] [-c CLASS_NUM] - [--model_output_prefix MODEL_OUTPUT_PREFIX] - [-g NUM_BATCHES_TO_LOG] [-e NUM_BATCHES_TO_TEST] - [-z NUM_BATCHES_TO_SAVE_MODEL] - -PaddlePaddle DSSM example - -optional arguments: - -h, --help show this help message and exit - -i TRAIN_DATA_PATH, --train_data_path TRAIN_DATA_PATH - path of training dataset - -t TEST_DATA_PATH, --test_data_path TEST_DATA_PATH - path of testing dataset - -s SOURCE_DIC_PATH, --source_dic_path SOURCE_DIC_PATH - path of the source's word dic - --target_dic_path TARGET_DIC_PATH - path of the target's word dic, if not set, the - `source_dic_path` will be used - -b BATCH_SIZE, --batch_size BATCH_SIZE - size of mini-batch (default:32) - -p NUM_PASSES, --num_passes NUM_PASSES - number of passes to run(default:10) - -y MODEL_TYPE, --model_type MODEL_TYPE - model type, 0 for classification, 1 for pairwise rank, - 2 for regression (default: classification) - -a MODEL_ARCH, --model_arch MODEL_ARCH - model architecture, 1 for CNN, 0 for FC, 2 for RNN - --share_network_between_source_target SHARE_NETWORK_BETWEEN_SOURCE_TARGET - whether to share network parameters between source and - target - --share_embed SHARE_EMBED - whether to share word embedding between source and - target - --dnn_dims DNN_DIMS dimentions of dnn layers, default is '256,128,64,32', - which means create a 4-layer dnn, demention of each - layer is 256, 128, 64 and 32 - --num_workers NUM_WORKERS - num worker threads, default 1 - --use_gpu USE_GPU whether to use GPU devices (default: False) - -c CLASS_NUM, --class_num CLASS_NUM - number of categories for classification task. - --model_output_prefix MODEL_OUTPUT_PREFIX - prefix of the path for model to store, (default: ./) - -g NUM_BATCHES_TO_LOG, --num_batches_to_log NUM_BATCHES_TO_LOG - number of batches to output train log, (default: 100) - -e NUM_BATCHES_TO_TEST, --num_batches_to_test NUM_BATCHES_TO_TEST - number of batches to test, (default: 200) - -z NUM_BATCHES_TO_SAVE_MODEL, --num_batches_to_save_model NUM_BATCHES_TO_SAVE_MODEL - number of batches to output model, (default: 400) -``` - -Parameter description: +We use `python train.py -y 0 --model_arch 0` with the data in `./data/classification` to train a DSSM model for classification. The paremeters to execute the script `train.py` can be found by execution `python infer.py --help`. Some important parameters are: - `train_data_path` Training data path - `test_data_path` Test data path, optional @@ -418,48 +251,8 @@ Parameter description: - `dnn_dims` The dimension of each layer of the model is set, the default is `256,128,64,32`,with 4 layers. ## To predict using the trained model -``` -usage: infer.py [-h] --model_path MODEL_PATH -i DATA_PATH -o - PREDICTION_OUTPUT_PATH -y MODEL_TYPE [-s SOURCE_DIC_PATH] - [--target_dic_path TARGET_DIC_PATH] -a MODEL_ARCH - [--share_network_between_source_target SHARE_NETWORK_BETWEEN_SOURCE_TARGET] - [--share_embed SHARE_EMBED] [--dnn_dims DNN_DIMS] - [-c CLASS_NUM] - -PaddlePaddle DSSM infer - -optional arguments: - -h, --help show this help message and exit - --model_path MODEL_PATH - path of model parameters file - -i DATA_PATH, --data_path DATA_PATH - path of the dataset to infer - -o PREDICTION_OUTPUT_PATH, --prediction_output_path PREDICTION_OUTPUT_PATH - path to output the prediction - -y MODEL_TYPE, --model_type MODEL_TYPE - model type, 0 for classification, 1 for pairwise rank, - 2 for regression (default: classification) - -s SOURCE_DIC_PATH, --source_dic_path SOURCE_DIC_PATH - path of the source's word dic - --target_dic_path TARGET_DIC_PATH - path of the target's word dic, if not set, the - `source_dic_path` will be used - -a MODEL_ARCH, --model_arch MODEL_ARCH - model architecture, 1 for CNN, 0 for FC, 2 for RNN - --share_network_between_source_target SHARE_NETWORK_BETWEEN_SOURCE_TARGET - whether to share network parameters between source and - target - --share_embed SHARE_EMBED - whether to share word embedding between source and - target - --dnn_dims DNN_DIMS dimentions of dnn layers, default is '256,128,64,32', - which means create a 4-layer dnn, demention of each - layer is 256, 128, 64 and 32 - -c CLASS_NUM, --class_num CLASS_NUM - number of categories for classification task. -``` -Important parameters are +The paremeters to execute the script `infer.py` can be found by execution `python infer.py --help`. Some important parameters are: - `data_path` Path for the data to predict - `prediction_output_path` Prediction output path diff --git a/dssm/index.html b/dssm/index.html index b4777a28960f5c5ac83e3e0598ad31794b73a8dd..5c4a1a9d316821f25bbf204c3ba7698573722b94 100644 --- a/dssm/index.html +++ b/dssm/index.html @@ -107,10 +107,11 @@ In below, we describe how to train DSSM model in PaddlePaddle. All the codes are ### Create a word vector table for the text ```python def create_embedding(self, input, prefix=''): - ''' - Create an embedding table whose name has a `prefix`. - ''' - logger.info("create embedding table [%s] which dimention is %d" % + """ + Create word embedding. The `prefix` is added in front of the name of + embedding"s learnable parameter. + """ + logger.info("Create embedding table [%s] whose dimention is %d" % (prefix, self.dnn_dims[0])) emb = paddle.layer.embedding( input=input, @@ -124,14 +125,15 @@ Since the input (embedding table) is a list of the IDs of the words correspondin ### CNN implementation ```python def create_cnn(self, emb, prefix=''): - ''' + + """ A multi-layer CNN. + :param emb: The word embedding. + :type emb: paddle.layer + :param prefix: The prefix will be added to of layers' names. + :type prefix: str + """ - @emb: paddle.layer - output of the embedding layer - @prefix: str - prefix of layers' names, used to share parameters between more than one `cnn` parts. - ''' def create_conv(context_len, hidden_size, prefix): key = "%s_%d_%d" % (prefix, context_len, hidden_size) conv = paddle.networks.sequence_conv_pool( @@ -139,15 +141,13 @@ def create_cnn(self, emb, prefix=''): context_len=context_len, hidden_size=hidden_size, # set parameter attr for parameter sharing - context_proj_param_attr=ParamAttr(name=key + 'contex_proj.w'), - fc_param_attr=ParamAttr(name=key + '_fc.w'), - fc_bias_attr=ParamAttr(name=key + '_fc.b'), - pool_bias_attr=ParamAttr(name=key + '_pool.b')) + context_proj_param_attr=ParamAttr(name=key + "contex_proj.w"), + fc_param_attr=ParamAttr(name=key + "_fc.w"), + fc_bias_attr=ParamAttr(name=key + "_fc.b"), + pool_bias_attr=ParamAttr(name=key + "_pool.b")) return conv - logger.info('create a sequence_conv_pool which context width is 3') conv_3 = create_conv(3, self.dnn_dims[1], "cnn") - logger.info('create a sequence_conv_pool which context width is 4') conv_4 = create_conv(4, self.dnn_dims[1], "cnn") return conv_3, conv_4 ``` @@ -160,9 +160,9 @@ RNN is suitable for learning variable length of the information ```python def create_rnn(self, emb, prefix=''): - ''' + """ A GRU sentence vector learner. - ''' + """ gru = paddle.networks.simple_gru( input=emb, size=self.dnn_dims[1], @@ -178,14 +178,15 @@ def create_rnn(self, emb, prefix=''): ```python def create_fc(self, emb, prefix=''): - ''' + + """ A multi-layer fully connected neural networks. + :param emb: The output of the embedding layer + :type emb: paddle.layer + :param prefix: A prefix will be added to the layers' names. + :type prefix: str + """ - @emb: paddle.layer - output of the embedding layer - @prefix: str - prefix of layers' names, used to share parameters between more than one `fc` parts. - ''' _input_layer = paddle.layer.pooling( input=emb, pooling_type=paddle.pooling.Max()) fc = paddle.layer.fc( @@ -202,13 +203,10 @@ In the construction of FC, we use `paddle.layer.pooling` for the maximum pooling ```python def create_dnn(self, sent_vec, prefix): - # if more than three layers exists, a fc layer will be added. if len(self.dnn_dims) > 1: _input_layer = sent_vec for id, dim in enumerate(self.dnn_dims[1:]): name = "%s_fc_%d_%d" % (prefix, id, dim) - logger.info("create fc layer [%s] which dimention is %d" % - (name, dim)) fc = paddle.layer.fc( input=_input_layer, size=dim, @@ -222,117 +220,12 @@ def create_dnn(self, sent_vec, prefix): ### Classification / Regression The structure of classification and regression is similar. Below function can be used for both tasks. - -```python -def _build_classification_or_regression_model(self, is_classification): - ''' - Build a classification/regression model, and the cost is returned. - - A Classification has 3 inputs: - - source sentence - - target sentence - - classification label - - ''' - # prepare inputs. - assert self.class_num - - source = paddle.layer.data( - name='source_input', - type=paddle.data_type.integer_value_sequence(self.vocab_sizes[0])) - target = paddle.layer.data( - name='target_input', - type=paddle.data_type.integer_value_sequence(self.vocab_sizes[1])) - label = paddle.layer.data( - name='label_input', - type=paddle.data_type.integer_value(self.class_num) - if is_classification else paddle.data_type.dense_input) - - prefixs = '_ _'.split( - ) if self.share_semantic_generator else 'source target'.split() - embed_prefixs = '_ _'.split( - ) if self.share_embed else 'source target'.split() - - word_vecs = [] - for id, input in enumerate([source, target]): - x = self.create_embedding(input, prefix=embed_prefixs[id]) - word_vecs.append(x) - - semantics = [] - for id, input in enumerate(word_vecs): - x = self.model_arch_creater(input, prefix=prefixs[id]) - semantics.append(x) - - if is_classification: - concated_vector = paddle.layer.concat(semantics) - prediction = paddle.layer.fc( - input=concated_vector, - size=self.class_num, - act=paddle.activation.Softmax()) - cost = paddle.layer.classification_cost( - input=prediction, label=label) - else: - prediction = paddle.layer.cos_sim(*semantics) - cost = paddle.layer.square_error_cost(prediction, label) - - if not self.is_infer: - return cost, prediction, label - return prediction -``` +Please check the function `_build_classification_or_regression_model` in [network_conf.py]( https://github.com/PaddlePaddle/models/blob/develop/dssm/network_conf.py) for detail implementation. ### Pairwise Rank +Please check the function `_build_rank_model` in [network_conf.py]( https://github.com/PaddlePaddle/models/blob/develop/dssm/network_conf.py) for implementation. -```python -def _build_rank_model(self): - ''' - Build a pairwise rank model, and the cost is returned. - - A pairwise rank model has 3 inputs: - - source sentence - - left_target sentence - - right_target sentence - - label, 1 if left_target should be sorted in front of right_target, otherwise 0. - ''' - source = paddle.layer.data( - name='source_input', - type=paddle.data_type.integer_value_sequence(self.vocab_sizes[0])) - left_target = paddle.layer.data( - name='left_target_input', - type=paddle.data_type.integer_value_sequence(self.vocab_sizes[1])) - right_target = paddle.layer.data( - name='right_target_input', - type=paddle.data_type.integer_value_sequence(self.vocab_sizes[1])) - label = paddle.layer.data( - name='label_input', type=paddle.data_type.integer_value(1)) - - prefixs = '_ _ _'.split( - ) if self.share_semantic_generator else 'source target target'.split() - embed_prefixs = '_ _'.split( - ) if self.share_embed else 'source target target'.split() - - word_vecs = [] - for id, input in enumerate([source, left_target, right_target]): - x = self.create_embedding(input, prefix=embed_prefixs[id]) - word_vecs.append(x) - - semantics = [] - for id, input in enumerate(word_vecs): - x = self.model_arch_creater(input, prefix=prefixs[id]) - semantics.append(x) - - # cossim score of source and left_target - left_score = paddle.layer.cos_sim(semantics[0], semantics[1]) - # cossim score of source and right target - right_score = paddle.layer.cos_sim(semantics[0], semantics[2]) - - # rank cost - cost = paddle.layer.rank_cost(left_score, right_score, label=label) - # prediction = left_score - right_score - # but this operator is not supported currently. - # so AUC will not used. - return cost, None, None -``` ## Data Format Below is a simple example for the data in `./data` @@ -389,67 +282,7 @@ The example of this format is as follows. ## Training -We use `python train.py -y 0 --model_arch 0` with the data in `./data/classification` to train a DSSM model for classification. - - -``` -usage: train.py [-h] [-i TRAIN_DATA_PATH] [-t TEST_DATA_PATH] - [-s SOURCE_DIC_PATH] [--target_dic_path TARGET_DIC_PATH] - [-b BATCH_SIZE] [-p NUM_PASSES] -y MODEL_TYPE -a MODEL_ARCH - [--share_network_between_source_target SHARE_NETWORK_BETWEEN_SOURCE_TARGET] - [--share_embed SHARE_EMBED] [--dnn_dims DNN_DIMS] - [--num_workers NUM_WORKERS] [--use_gpu USE_GPU] [-c CLASS_NUM] - [--model_output_prefix MODEL_OUTPUT_PREFIX] - [-g NUM_BATCHES_TO_LOG] [-e NUM_BATCHES_TO_TEST] - [-z NUM_BATCHES_TO_SAVE_MODEL] - -PaddlePaddle DSSM example - -optional arguments: - -h, --help show this help message and exit - -i TRAIN_DATA_PATH, --train_data_path TRAIN_DATA_PATH - path of training dataset - -t TEST_DATA_PATH, --test_data_path TEST_DATA_PATH - path of testing dataset - -s SOURCE_DIC_PATH, --source_dic_path SOURCE_DIC_PATH - path of the source's word dic - --target_dic_path TARGET_DIC_PATH - path of the target's word dic, if not set, the - `source_dic_path` will be used - -b BATCH_SIZE, --batch_size BATCH_SIZE - size of mini-batch (default:32) - -p NUM_PASSES, --num_passes NUM_PASSES - number of passes to run(default:10) - -y MODEL_TYPE, --model_type MODEL_TYPE - model type, 0 for classification, 1 for pairwise rank, - 2 for regression (default: classification) - -a MODEL_ARCH, --model_arch MODEL_ARCH - model architecture, 1 for CNN, 0 for FC, 2 for RNN - --share_network_between_source_target SHARE_NETWORK_BETWEEN_SOURCE_TARGET - whether to share network parameters between source and - target - --share_embed SHARE_EMBED - whether to share word embedding between source and - target - --dnn_dims DNN_DIMS dimentions of dnn layers, default is '256,128,64,32', - which means create a 4-layer dnn, demention of each - layer is 256, 128, 64 and 32 - --num_workers NUM_WORKERS - num worker threads, default 1 - --use_gpu USE_GPU whether to use GPU devices (default: False) - -c CLASS_NUM, --class_num CLASS_NUM - number of categories for classification task. - --model_output_prefix MODEL_OUTPUT_PREFIX - prefix of the path for model to store, (default: ./) - -g NUM_BATCHES_TO_LOG, --num_batches_to_log NUM_BATCHES_TO_LOG - number of batches to output train log, (default: 100) - -e NUM_BATCHES_TO_TEST, --num_batches_to_test NUM_BATCHES_TO_TEST - number of batches to test, (default: 200) - -z NUM_BATCHES_TO_SAVE_MODEL, --num_batches_to_save_model NUM_BATCHES_TO_SAVE_MODEL - number of batches to output model, (default: 400) -``` - -Parameter description: +We use `python train.py -y 0 --model_arch 0` with the data in `./data/classification` to train a DSSM model for classification. The paremeters to execute the script `train.py` can be found by execution `python infer.py --help`. Some important parameters are: - `train_data_path` Training data path - `test_data_path` Test data path, optional @@ -460,48 +293,8 @@ Parameter description: - `dnn_dims` The dimension of each layer of the model is set, the default is `256,128,64,32`,with 4 layers. ## To predict using the trained model -``` -usage: infer.py [-h] --model_path MODEL_PATH -i DATA_PATH -o - PREDICTION_OUTPUT_PATH -y MODEL_TYPE [-s SOURCE_DIC_PATH] - [--target_dic_path TARGET_DIC_PATH] -a MODEL_ARCH - [--share_network_between_source_target SHARE_NETWORK_BETWEEN_SOURCE_TARGET] - [--share_embed SHARE_EMBED] [--dnn_dims DNN_DIMS] - [-c CLASS_NUM] - -PaddlePaddle DSSM infer - -optional arguments: - -h, --help show this help message and exit - --model_path MODEL_PATH - path of model parameters file - -i DATA_PATH, --data_path DATA_PATH - path of the dataset to infer - -o PREDICTION_OUTPUT_PATH, --prediction_output_path PREDICTION_OUTPUT_PATH - path to output the prediction - -y MODEL_TYPE, --model_type MODEL_TYPE - model type, 0 for classification, 1 for pairwise rank, - 2 for regression (default: classification) - -s SOURCE_DIC_PATH, --source_dic_path SOURCE_DIC_PATH - path of the source's word dic - --target_dic_path TARGET_DIC_PATH - path of the target's word dic, if not set, the - `source_dic_path` will be used - -a MODEL_ARCH, --model_arch MODEL_ARCH - model architecture, 1 for CNN, 0 for FC, 2 for RNN - --share_network_between_source_target SHARE_NETWORK_BETWEEN_SOURCE_TARGET - whether to share network parameters between source and - target - --share_embed SHARE_EMBED - whether to share word embedding between source and - target - --dnn_dims DNN_DIMS dimentions of dnn layers, default is '256,128,64,32', - which means create a 4-layer dnn, demention of each - layer is 256, 128, 64 and 32 - -c CLASS_NUM, --class_num CLASS_NUM - number of categories for classification task. -``` -Important parameters are +The paremeters to execute the script `infer.py` can be found by execution `python infer.py --help`. Some important parameters are: - `data_path` Path for the data to predict - `prediction_output_path` Prediction output path diff --git a/dssm/infer.py b/dssm/infer.py index f0c65e44a8c5f9249172f0c1912dc9c195ce69c2..63a9657341d7d220b72696fd215d1850b1718f32 100644 --- a/dssm/infer.py +++ b/dssm/infer.py @@ -9,83 +9,81 @@ from utils import logger, ModelType, ModelArch, load_dic parser = argparse.ArgumentParser(description="PaddlePaddle DSSM infer") parser.add_argument( - '--model_path', - type=str, - required=True, - help="path of model parameters file") + "--model_path", type=str, required=True, help="The path of trained model.") parser.add_argument( - '-i', - '--data_path', + "-i", + "--data_path", type=str, required=True, - help="path of the dataset to infer") + help="The path of the data for inferring.") parser.add_argument( - '-o', - '--prediction_output_path', + "-o", + "--prediction_output_path", type=str, required=True, - help="path to output the prediction") + help="The path to save the predictions.") parser.add_argument( - '-y', - '--model_type', + "-y", + "--model_type", type=int, required=True, default=ModelType.CLASSIFICATION_MODE, - help=("model type, %d for classification, %d for pairwise rank, " - "%d for regression (default: classification)") % + help=("The model type: %d for classification, %d for pairwise rank, " + "%d for regression (default: classification).") % (ModelType.CLASSIFICATION_MODE, ModelType.RANK_MODE, ModelType.REGRESSION_MODE)) parser.add_argument( - '-s', - '--source_dic_path', + "-s", + "--source_dic_path", type=str, required=False, - help="path of the source's word dic") + help="The path of the source's word dictionary.") parser.add_argument( - '--target_dic_path', + "--target_dic_path", type=str, required=False, - help=("path of the target's word dictionary, " - "if not set, the `source_dic_path` will be used")) + help=("The path of the target's word dictionary, " + "if this parameter is not set, the `source_dic_path` will be used.")) parser.add_argument( - '-a', - '--model_arch', + "-a", + "--model_arch", type=int, required=True, default=ModelArch.CNN_MODE, help="model architecture, %d for CNN, %d for FC, %d for RNN" % (ModelArch.CNN_MODE, ModelArch.FC_MODE, ModelArch.RNN_MODE)) parser.add_argument( - '--share_network_between_source_target', + "--share_network_between_source_target", type=distutils.util.strtobool, default=False, help="whether to share network parameters between source and target") parser.add_argument( - '--share_embed', + "--share_embed", type=distutils.util.strtobool, default=False, help="whether to share word embedding between source and target") parser.add_argument( - '--dnn_dims', + "--dnn_dims", type=str, - default='256,128,64,32', - help=("dimentions of dnn layers, default is '256,128,64,32', " - "which means create a 4-layer dnn, " - "demention of each layer is 256, 128, 64 and 32")) + default="256,128,64,32", + help=("The dimentions of dnn layers, default is `256,128,64,32`, " + "which means a dnn with 4 layers with " + "dmentions 256, 128, 64 and 32 will be created.")) parser.add_argument( - '-c', - '--class_num', + "-c", + "--class_num", type=int, default=0, - help="number of categories for classification task.") + help="The number of categories for classification task.") args = parser.parse_args() args.model_type = ModelType(args.model_type) args.model_arch = ModelArch(args.model_arch) if args.model_type.is_classification(): - assert args.class_num > 1, "--class_num should be set in classification task." + assert args.class_num > 1, ("The parameter class_num should be set " + "in classification task.") -layer_dims = map(int, args.dnn_dims.split(',')) +layer_dims = map(int, args.dnn_dims.split(",")) args.target_dic_path = args.source_dic_path if not args.target_dic_path \ else args.target_dic_path @@ -94,8 +92,6 @@ paddle.init(use_gpu=False, trainer_count=1) class Inferer(object): def __init__(self, param_path): - logger.info("create DSSM model") - prediction = DSSM( dnn_dims=layer_dims, vocab_sizes=[ @@ -110,14 +106,13 @@ class Inferer(object): is_infer=True)() # load parameter - logger.info("load model parameters from %s" % param_path) + logger.info("Load the trained model from %s." % param_path) self.parameters = paddle.parameters.Parameters.from_tar( - open(param_path, 'r')) + open(param_path, "r")) self.inferer = paddle.inference.Inference( output_layer=prediction, parameters=self.parameters) def infer(self, data_path): - logger.info("infer data...") dataset = reader.Dataset( train_path=data_path, test_path=None, @@ -125,19 +120,20 @@ class Inferer(object): target_dic_path=args.target_dic_path, model_type=args.model_type, ) infer_reader = paddle.batch(dataset.infer, batch_size=1000) - logger.warning('write predictions to %s' % args.prediction_output_path) + logger.warning("Write predictions to %s." % args.prediction_output_path) - output_f = open(args.prediction_output_path, 'w') + output_f = open(args.prediction_output_path, "w") for id, batch in enumerate(infer_reader()): res = self.inferer.infer(input=batch) - predictions = [' '.join(map(str, x)) for x in res] + predictions = [" ".join(map(str, x)) for x in res] assert len(batch) == len(predictions), ( - "predict error, %d inputs, " - "but %d predictions") % (len(batch), len(predictions)) - output_f.write('\n'.join(map(str, predictions)) + '\n') + "Error! %d inputs are given, " + "but only %d predictions are returned.") % (len(batch), + len(predictions)) + output_f.write("\n".join(map(str, predictions)) + "\n") -if __name__ == '__main__': +if __name__ == "__main__": inferer = Inferer(args.model_path) inferer.infer(args.data_path) diff --git a/dssm/network_conf.py b/dssm/network_conf.py index 6888ca0ef44fe9711ba63fcb77fbcea088ce0171..135a00bf6f64da611e6df54c08837fcee9b39041 100644 --- a/dssm/network_conf.py +++ b/dssm/network_conf.py @@ -13,26 +13,33 @@ class DSSM(object): class_num=None, share_embed=False, is_infer=False): - ''' - @dnn_dims: list of int - dimentions of each layer in semantic vector generator. - @vocab_sizes: 2-d tuple - size of both left and right items. - @model_type: int - type of task, should be 'rank: 0', 'regression: 1' or 'classification: 2' - @model_arch: int - model architecture - @share_semantic_generator: bool - whether to share the semantic vector generator for both left and right. - @share_embed: bool - whether to share the embeddings between left and right. - @class_num: int - number of categories. - ''' + """ + :param dnn_dims: The dimention of each layer in the semantic vector + generator. + :type dnn_dims: list of int + :param vocab_sizes: The size of left and right items. + :type vocab_sizes: A list having 2 elements. + :param model_type: The type of task to train the DSSM model. The value + should be "rank: 0", "regression: 1" or + "classification: 2". + :type model_type: int + :param model_arch: A value indicating the model architecture to use. + :type model_arch: int + :param share_semantic_generator: A flag indicating whether to share the + semantic vector between the left and + the right item. + :type share_semantic_generator: bool + :param share_embed: A floag indicating whether to share the embeddings + between the left and the right item. + :type share_embed: bool + :param class_num: The number of categories. + :type class_num: int + """ assert len(vocab_sizes) == 2, ( - "vocab_sizes specify the sizes left and right inputs, " - "and dim should be 2.") - assert len(dnn_dims) > 1, "more than two layers is needed." + "The vocab_sizes specifying the sizes left and right inputs. " + "Its dimension should be 2.") + assert len(dnn_dims) > 1, ("In the DNN model, more than two layers " + "are needed.") self.dnn_dims = dnn_dims self.vocab_sizes = vocab_sizes @@ -42,91 +49,89 @@ class DSSM(object): self.model_arch = ModelArch(model_arch) self.class_num = class_num self.is_infer = is_infer - logger.warning("build DSSM model with config of %s, %s" % + logger.warning("Build DSSM model with config of %s, %s" % (self.model_type, self.model_arch)) - logger.info("vocabulary sizes: %s" % str(self.vocab_sizes)) + logger.info("The vocabulary size is : %s" % str(self.vocab_sizes)) # bind model architecture _model_arch = { - 'cnn': self.create_cnn, - 'fc': self.create_fc, - 'rnn': self.create_rnn, + "cnn": self.create_cnn, + "fc": self.create_fc, + "rnn": self.create_rnn, } - def _model_arch_creater(emb, prefix=''): + def _model_arch_creater(emb, prefix=""): sent_vec = _model_arch.get(str(model_arch))(emb, prefix) dnn = self.create_dnn(sent_vec, prefix) return dnn self.model_arch_creater = _model_arch_creater - # build model type _model_type = { - 'classification': self._build_classification_model, - 'rank': self._build_rank_model, - 'regression': self._build_regression_model, + "classification": self._build_classification_model, + "rank": self._build_rank_model, + "regression": self._build_regression_model, } - print 'model type: ', str(self.model_type) + print("model type: ", str(self.model_type)) self.model_type_creater = _model_type[str(self.model_type)] def __call__(self): return self.model_type_creater() - def create_embedding(self, input, prefix=''): - ''' - Create an embedding table whose name has a `prefix`. - ''' - logger.info("create embedding table [%s] which dimention is %d" % + def create_embedding(self, input, prefix=""): + """ + Create word embedding. The `prefix` is added in front of the name of + embedding"s learnable parameter. + """ + logger.info("Create embedding table [%s] whose dimention is %d. " % (prefix, self.dnn_dims[0])) emb = paddle.layer.embedding( input=input, size=self.dnn_dims[0], - param_attr=ParamAttr(name='%s_emb.w' % prefix)) + param_attr=ParamAttr(name="%s_emb.w" % prefix)) return emb - def create_fc(self, emb, prefix=''): - ''' + def create_fc(self, emb, prefix=""): + """ A multi-layer fully connected neural networks. - @emb: paddle.layer - output of the embedding layer - @prefix: str - prefix of layers' names, used to share parameters between - more than one `fc` parts. - ''' + :param emb: The output of the embedding layer + :type emb: paddle.layer + :param prefix: A prefix will be added to the layers' names. + :type prefix: str + """ _input_layer = paddle.layer.pooling( input=emb, pooling_type=paddle.pooling.Max()) fc = paddle.layer.fc( input=_input_layer, size=self.dnn_dims[1], - param_attr=ParamAttr(name='%s_fc.w' % prefix), + param_attr=ParamAttr(name="%s_fc.w" % prefix), bias_attr=ParamAttr(name="%s_fc.b" % prefix, initial_std=0.)) return fc - def create_rnn(self, emb, prefix=''): - ''' + def create_rnn(self, emb, prefix=""): + """ A GRU sentence vector learner. - ''' + """ gru = paddle.networks.simple_gru( input=emb, size=self.dnn_dims[1], - mixed_param_attr=ParamAttr(name='%s_gru_mixed.w' % prefix), + mixed_param_attr=ParamAttr(name="%s_gru_mixed.w" % prefix), mixed_bias_param_attr=ParamAttr(name="%s_gru_mixed.b" % prefix), - gru_param_attr=ParamAttr(name='%s_gru.w' % prefix), + gru_param_attr=ParamAttr(name="%s_gru.w" % prefix), gru_bias_attr=ParamAttr(name="%s_gru.b" % prefix)) sent_vec = paddle.layer.last_seq(gru) return sent_vec - def create_cnn(self, emb, prefix=''): - ''' + def create_cnn(self, emb, prefix=""): + """ A multi-layer CNN. - @emb: paddle.layer - output of the embedding layer - @prefix: str - prefix of layers' names, used to share parameters between - more than one `cnn` parts. - ''' + :param emb: The word embedding. + :type emb: paddle.layer + :param prefix: The prefix will be added to of layers' names. + :type prefix: str + """ def create_conv(context_len, hidden_size, prefix): key = "%s_%d_%d" % (prefix, context_len, hidden_size) @@ -135,15 +140,15 @@ class DSSM(object): context_len=context_len, hidden_size=hidden_size, # set parameter attr for parameter sharing - context_proj_param_attr=ParamAttr(name=key + 'contex_proj.w'), - fc_param_attr=ParamAttr(name=key + '_fc.w'), - fc_bias_attr=ParamAttr(name=key + '_fc.b'), - pool_bias_attr=ParamAttr(name=key + '_pool.b')) + context_proj_param_attr=ParamAttr(name=key + "contex_proj.w"), + fc_param_attr=ParamAttr(name=key + "_fc.w"), + fc_bias_attr=ParamAttr(name=key + "_fc.b"), + pool_bias_attr=ParamAttr(name=key + "_pool.b")) return conv - logger.info('create a sequence_conv_pool which context width is 3') + logger.info("create a sequence_conv_pool which context width is 3") conv_3 = create_conv(3, self.dnn_dims[1], "cnn") - logger.info('create a sequence_conv_pool which context width is 4') + logger.info("create a sequence_conv_pool which context width is 4") conv_4 = create_conv(4, self.dnn_dims[1], "cnn") return conv_3, conv_4 @@ -160,8 +165,8 @@ class DSSM(object): input=_input_layer, size=dim, act=paddle.activation.Tanh(), - param_attr=ParamAttr(name='%s.w' % name), - bias_attr=ParamAttr(name='%s.b' % name, initial_std=0.)) + param_attr=ParamAttr(name="%s.w" % name), + bias_attr=ParamAttr(name="%s.b" % name, initial_std=0.)) _input_layer = fc return _input_layer @@ -178,7 +183,7 @@ class DSSM(object): is_classification=False) def _build_rank_model(self): - ''' + """ Build a pairwise rank model, and the cost is returned. A pairwise rank model has 3 inputs: @@ -187,26 +192,26 @@ class DSSM(object): - right_target sentence - label, 1 if left_target should be sorted in front of right_target, otherwise 0. - ''' + """ logger.info("build rank model") assert self.model_type.is_rank() source = paddle.layer.data( - name='source_input', + name="source_input", type=paddle.data_type.integer_value_sequence(self.vocab_sizes[0])) left_target = paddle.layer.data( - name='left_target_input', + name="left_target_input", type=paddle.data_type.integer_value_sequence(self.vocab_sizes[1])) right_target = paddle.layer.data( - name='right_target_input', + name="right_target_input", type=paddle.data_type.integer_value_sequence(self.vocab_sizes[1])) if not self.is_infer: label = paddle.layer.data( - name='label_input', type=paddle.data_type.integer_value(1)) + name="label_input", type=paddle.data_type.integer_value(1)) - prefixs = '_ _ _'.split( - ) if self.share_semantic_generator else 'source target target'.split() - embed_prefixs = '_ _'.split( - ) if self.share_embed else 'source target target'.split() + prefixs = "_ _ _".split( + ) if self.share_semantic_generator else "source target target".split() + embed_prefixs = "_ _".split( + ) if self.share_embed else "source target target".split() word_vecs = [] for id, input in enumerate([source, left_target, right_target]): @@ -218,9 +223,9 @@ class DSSM(object): x = self.model_arch_creater(input, prefix=prefixs[id]) semantics.append(x) - # cossim score of source and left_target + # The cosine similarity score of source and left_target. left_score = paddle.layer.cos_sim(semantics[0], semantics[1]) - # cossim score of source and right target + # The cosine similarity score of source and right target. right_score = paddle.layer.cos_sim(semantics[0], semantics[2]) if not self.is_infer: @@ -233,34 +238,33 @@ class DSSM(object): return right_score def _build_classification_or_regression_model(self, is_classification): - ''' + """ Build a classification/regression model, and the cost is returned. - A Classification has 3 inputs: + The classification/regression task expects 3 inputs: - source sentence - target sentence - classification label - ''' + """ if is_classification: - # prepare inputs. assert self.class_num source = paddle.layer.data( - name='source_input', + name="source_input", type=paddle.data_type.integer_value_sequence(self.vocab_sizes[0])) target = paddle.layer.data( - name='target_input', + name="target_input", type=paddle.data_type.integer_value_sequence(self.vocab_sizes[1])) label = paddle.layer.data( - name='label_input', + name="label_input", type=paddle.data_type.integer_value(self.class_num) if is_classification else paddle.data_type.dense_vector(1)) - prefixs = '_ _'.split( - ) if self.share_semantic_generator else 'source target'.split() - embed_prefixs = '_ _'.split( - ) if self.share_embed else 'source target'.split() + prefixs = "_ _".split( + ) if self.share_semantic_generator else "source target".split() + embed_prefixs = "_ _".split( + ) if self.share_embed else "source target".split() word_vecs = [] for id, input in enumerate([source, target]): diff --git a/dssm/train.py b/dssm/train.py index eb563d1d1a3c8f8f09543a887335ead824c1ee2a..9d5b5782ff0d372e61aae31be6cf8101904215e0 100644 --- a/dssm/train.py +++ b/dssm/train.py @@ -9,120 +9,129 @@ from utils import TaskType, load_dic, logger, ModelType, ModelArch, display_args parser = argparse.ArgumentParser(description="PaddlePaddle DSSM example") parser.add_argument( - '-i', - '--train_data_path', + "-i", + "--train_data_path", type=str, required=False, - help="path of training dataset") + help="The path of training data.") parser.add_argument( - '-t', - '--test_data_path', + "-t", + "--test_data_path", type=str, required=False, - help="path of testing dataset") + help="The path of testing data.") parser.add_argument( - '-s', - '--source_dic_path', + "-s", + "--source_dic_path", type=str, required=False, - help="path of the source's word dic") + help="The path of the source's word dictionary.") parser.add_argument( - '--target_dic_path', + "--target_dic_path", type=str, required=False, - help=("path of the target's word dictionary, " - "if not set, the `source_dic_path` will be used")) + help=("The path of the target's word dictionary, " + "if this parameter is not set, the `source_dic_path` will be used")) parser.add_argument( - '-b', - '--batch_size', + "-b", + "--batch_size", type=int, default=32, - help="size of mini-batch (default:32)") + help="The size of mini-batch (default:32).") parser.add_argument( - '-p', - '--num_passes', + "-p", + "--num_passes", type=int, default=10, - help="number of passes to run(default:10)") + help="The number of passes to run(default:10).") parser.add_argument( - '-y', - '--model_type', + "-y", + "--model_type", type=int, required=True, default=ModelType.CLASSIFICATION_MODE, - help="model type, %d for classification, %d for pairwise rank, %d for regression (default: classification)" - % (ModelType.CLASSIFICATION_MODE, ModelType.RANK_MODE, - ModelType.REGRESSION_MODE)) + help=("model type, %d for classification, %d for pairwise rank, " + "%d for regression (default: classification).") % + (ModelType.CLASSIFICATION_MODE, ModelType.RANK_MODE, + ModelType.REGRESSION_MODE)) parser.add_argument( - '-a', - '--model_arch', + "-a", + "--model_arch", type=int, required=True, default=ModelArch.CNN_MODE, - help="model architecture, %d for CNN, %d for FC, %d for RNN" % + help="The model architecture, %d for CNN, %d for FC, %d for RNN." % (ModelArch.CNN_MODE, ModelArch.FC_MODE, ModelArch.RNN_MODE)) parser.add_argument( - '--share_network_between_source_target', + "--share_network_between_source_target", type=distutils.util.strtobool, default=False, - help="whether to share network parameters between source and target") + help="Whether to share network parameters between source and target.") parser.add_argument( - '--share_embed', + "--share_embed", type=distutils.util.strtobool, default=False, - help="whether to share word embedding between source and target") + help="Whether to share word embedding between source and target.") parser.add_argument( - '--dnn_dims', + "--dnn_dims", type=str, - default='256,128,64,32', - help="dimentions of dnn layers, default is '256,128,64,32', which means create a 4-layer dnn, demention of each layer is 256, 128, 64 and 32" -) + default="256,128,64,32", + help=("The dimentions of dnn layers, default is '256,128,64,32', " + "which means create a 4-layer dnn. The dimention of each layer is " + "'256, 128, 64 and 32'.")) parser.add_argument( - '--num_workers', type=int, default=1, help="num worker threads, default 1") + "--num_workers", + type=int, + default=1, + help="The number of worker threads, default 1.") parser.add_argument( - '--use_gpu', + "--use_gpu", type=distutils.util.strtobool, default=False, - help="whether to use GPU devices (default: False)") + help="Whether to use GPU devices (default: False)") parser.add_argument( - '-c', - '--class_num', + "-c", + "--class_num", type=int, default=0, - help="number of categories for classification task.") + help="The number of categories for classification task.") parser.add_argument( - '--model_output_prefix', + "--model_output_prefix", type=str, default="./", - help="prefix of the path for model to store, (default: ./)") + help="The prefix of the path to store the trained models (default: ./).") parser.add_argument( - '-g', - '--num_batches_to_log', + "-g", + "--num_batches_to_log", type=int, default=100, - help="number of batches to output train log, (default: 100)") + help=("The log period. Every num_batches_to_test batches, " + "a training log will be printed. (default: 100)")) parser.add_argument( - '-e', - '--num_batches_to_test', + "-e", + "--num_batches_to_test", type=int, default=200, - help="number of batches to test, (default: 200)") + help=("The test period. Every num_batches_to_save_model batches, " + "the specified test sample will be test (default: 200).")) parser.add_argument( - '-z', - '--num_batches_to_save_model', + "-z", + "--num_batches_to_save_model", type=int, default=400, - help="number of batches to output model, (default: 400)") + help=("Every num_batches_to_save_model batches, " + "a trained model will be saved (default: 400).")) -# arguments check. args = parser.parse_args() args.model_type = ModelType(args.model_type) args.model_arch = ModelArch(args.model_arch) if args.model_type.is_classification(): - assert args.class_num > 1, "--class_num should be set in classification task." + assert args.class_num > 1, ("The parameter class_num should be set in " + "classification task.") -layer_dims = [int(i) for i in args.dnn_dims.split(',')] -args.target_dic_path = args.source_dic_path if not args.target_dic_path else args.target_dic_path +layer_dims = [int(i) for i in args.dnn_dims.split(",")] +args.target_dic_path = args.source_dic_path if not \ + args.target_dic_path else args.target_dic_path def train(train_data_path=None, @@ -138,15 +147,15 @@ def train(train_data_path=None, class_num=None, num_workers=1, use_gpu=False): - ''' + """ Train the DSSM. - ''' - default_train_path = './data/rank/train.txt' - default_test_path = './data/rank/test.txt' - default_dic_path = './data/vocab.txt' + """ + default_train_path = "./data/rank/train.txt" + default_test_path = "./data/rank/test.txt" + default_dic_path = "./data/vocab.txt" if not model_type.is_rank(): - default_train_path = './data/classification/train.txt' - default_test_path = './data/classification/test.txt' + default_train_path = "./data/classification/train.txt" + default_test_path = "./data/classification/test.txt" use_default_data = not train_data_path @@ -200,19 +209,19 @@ def train(train_data_path=None, feeding = {} if model_type.is_classification() or model_type.is_regression(): - feeding = {'source_input': 0, 'target_input': 1, 'label_input': 2} + feeding = {"source_input": 0, "target_input": 1, "label_input": 2} else: feeding = { - 'source_input': 0, - 'left_target_input': 1, - 'right_target_input': 2, - 'label_input': 3 + "source_input": 0, + "left_target_input": 1, + "right_target_input": 2, + "label_input": 3 } def _event_handler(event): - ''' + """ Define batch handler - ''' + """ if isinstance(event, paddle.event.EndIteration): # output train log if event.batch_id % args.num_batches_to_log == 0: @@ -249,7 +258,7 @@ def train(train_data_path=None, logger.info("Training has finished.") -if __name__ == '__main__': +if __name__ == "__main__": display_args(args) train( train_data_path=args.train_data_path, diff --git a/dssm/utils.py b/dssm/utils.py index 7bcbec6ebfb2e41d94e49b4e39eaf106e414c103..97296fd5dcc2dc664c97dd83d658c8805221fc57 100644 --- a/dssm/utils.py +++ b/dssm/utils.py @@ -8,7 +8,7 @@ logger.setLevel(logging.INFO) def mode_attr_name(mode): - return mode.upper() + '_MODE' + return mode.upper() + "_MODE" def create_attrs(cls): @@ -17,9 +17,9 @@ def create_attrs(cls): def make_check_method(cls): - ''' + """ create methods for classes. - ''' + """ def method(mode): def _method(self): @@ -28,7 +28,7 @@ def make_check_method(cls): return _method for id, mode in enumerate(cls.modes): - setattr(cls, 'is_' + mode, method(mode)) + setattr(cls, "is_" + mode, method(mode)) def make_create_method(cls): @@ -41,10 +41,10 @@ def make_create_method(cls): return _method for id, mode in enumerate(cls.modes): - setattr(cls, 'create_' + mode, method(mode)) + setattr(cls, "create_" + mode, method(mode)) -def make_str_method(cls, type_name='unk'): +def make_str_method(cls, type_name="unk"): def _str_(self): for mode in cls.modes: if self.mode == getattr(cls, mode_attr_name(mode)): @@ -53,9 +53,9 @@ def make_str_method(cls, type_name='unk'): def _hash_(self): return self.mode - setattr(cls, '__str__', _str_) - setattr(cls, '__repr__', _str_) - setattr(cls, '__hash__', _hash_) + setattr(cls, "__str__", _str_) + setattr(cls, "__repr__", _str_) + setattr(cls, "__hash__", _hash_) cls.__name__ = type_name @@ -65,7 +65,7 @@ def _init_(self, mode, cls): elif isinstance(mode, cls): self.mode = mode.mode else: - raise Exception("wrong mode type, get type: %s, value: %s" % + raise Exception("A wrong mode type, get type: %s, value: %s." % (type(mode), mode)) @@ -77,21 +77,21 @@ def build_mode_class(cls): class TaskType(object): - modes = 'train test infer'.split() + modes = "train test infer".split() def __init__(self, mode): _init_(self, mode, TaskType) class ModelType: - modes = 'classification rank regression'.split() + modes = "classification rank regression".split() def __init__(self, mode): _init_(self, mode, ModelType) class ModelArch: - modes = 'fc cnn rnn'.split() + modes = "fc cnn rnn".split() def __init__(self, mode): _init_(self, mode, ModelArch) @@ -103,22 +103,16 @@ build_mode_class(ModelArch) def sent2ids(sent, vocab): - ''' + """ transform a sentence to a list of ids. - - @sent: str - a sentence. - @vocab: dict - a word dic - ''' + """ return [vocab.get(w, UNK) for w in sent.split()] def load_dic(path): - ''' - word dic format: - each line is a word - ''' + """ + The format of word dictionary : each line is a word. + """ dic = {} with open(path) as f: for id, line in enumerate(f): @@ -128,13 +122,6 @@ def load_dic(path): def display_args(args): - logger.info("arguments passed by command line:") + logger.info("The arguments passed by command line is :") for k, v in sorted(v for v in vars(args).items()): logger.info("{}:\t{}".format(k, v)) - - -if __name__ == '__main__': - t = TaskType(1) - t = TaskType.create_train() - print t - print 'is', t.is_train()