diff --git a/dssm/README.cn.md b/dssm/README.cn.md
index 17f9a923c0418c72194f4f54316f04b70ba443dc..e1bd3cab89a2f76752c946b5a54cc250def9441b 100644
--- a/dssm/README.cn.md
+++ b/dssm/README.cn.md
@@ -1,19 +1,14 @@
# 深度结构化语义模型 (Deep Structured Semantic Models, DSSM)
-DSSM使用DNN模型在一个连续的语义空间中学习文本低纬的表示向量,并且建模两个句子间的语义相似度。
-本例演示如何使用 PaddlePaddle实现一个通用的DSSM 模型,用于建模两个字符串间的语义相似度,
-模型实现支持通用的数据格式,用户替换数据便可以在真实场景中使用该模型。
+DSSM使用DNN模型在一个连续的语义空间中学习文本低纬的表示向量,并且建模两个句子间的语义相似度。本例演示如何使用PaddlePaddle实现一个通用的DSSM 模型,用于建模两个字符串间的语义相似度,模型实现支持通用的数据格式,用户替换数据便可以在真实场景中使用该模型。
## 背景介绍
-DSSM \[[1](##参考文献)\]是微软研究院13年提出来的经典的语义模型,用于学习两个文本之间的语义距离,
-广义上模型也可以推广和适用如下场景:
+DSSM \[[1](##参考文献)\]是微软研究院13年提出来的经典的语义模型,用于学习两个文本之间的语义距离,广义上模型也可以推广和适用如下场景:
1. CTR预估模型,衡量用户搜索词(Query)与候选网页集合(Documents)之间的相关联程度。
2. 文本相关性,衡量两个字符串间的语义相关程度。
3. 自动推荐,衡量User与被推荐的Item之间的关联程度。
-DSSM 已经发展成了一个框架,可以很自然地建模两个记录之间的距离关系,
-例如对于文本相关性问题,可以用余弦相似度 (cosin similarity) 来刻画语义距离;
-而对于搜索引擎的结果排序,可以在DSSM上接上Rank损失训练出一个排序模型。
+DSSM 已经发展成了一个框架,可以很自然地建模两个记录之间的距离关系,例如对于文本相关性问题,可以用余弦相似度 (cosin similarity) 来刻画语义距离;而对于搜索引擎的结果排序,可以在DSSM上接上Rank损失训练出一个排序模型。
## 模型简介
在原论文\[[1](#参考文献)\]中,DSSM模型用来衡量用户搜索词 Query 和文档集合 Documents 之间隐含的语义关系,模型结构如下
@@ -23,12 +18,9 @@ DSSM 已经发展成了一个框架,可以很自然地建模两个记录之间
图 1. DSSM 原始结构
-其贯彻的思想是, **用DNN将高维特征向量转化为低纬空间的连续向量(图中红色框部分)** ,
-**在上层用cosine similarity来衡量用户搜索词与候选文档间的语义相关性** 。
+其贯彻的思想是, **用DNN将高维特征向量转化为低纬空间的连续向量(图中红色框部分)** ,**在上层使用cosine similarity来衡量用户搜索词与候选文档间的语义相关性** 。
-在最顶层损失函数的设计上,原始模型使用类似Word2Vec中负例采样的方法,
-一个Query会抽取正例 $D+$ 和4个负例 $D-$ 整体上算条件概率用对数似然函数作为损失,
-这也就是图 1中类似 $P(D_1|Q)$ 的结构,具体细节请参考原论文。
+在最顶层损失函数的设计上,原始模型使用类似Word2Vec中负例采样的方法,一个Query会抽取正例 $D+$ 和4个负例 $D-$ 整体上算条件概率用对数似然函数作为损失,这也就是图 1中类似 $P(D_1|Q)$ 的结构,具体细节请参考原论文。
随着后续优化DSSM模型的结构得以简化\[[3](#参考文献)\],演变为:
@@ -37,37 +29,30 @@ DSSM 已经发展成了一个框架,可以很自然地建模两个记录之间
图 2. DSSM通用结构
-图中的空白方框可以用任何模型替代,比如全连接FC,卷积CNN,RNN等都可以,
-该模型结构专门用于衡量两个元素(比如字符串)间的语义距离。
-
-在现实使用中,DSSM模型会作为基础的积木,搭配上不同的损失函数来实现具体的功能,比如
+图中的空白方框可以用任何模型替代,例如:全连接FC,卷积CNN,RNN等。该模型结构专门用于衡量两个元素(比如字符串)间的语义距离。在实际任务中,DSSM模型会作为基础的积木,搭配上不同的损失函数来实现具体的功能,比如:
- 在排序学习中,将 图 2 中结构添加 pairwise rank损失,变成一个排序模型
- 在CTR预估中,对点击与否做0,1二元分类,添加交叉熵损失变成一个分类模型
- 在需要对一个子串打分时,可以使用余弦相似度来计算相似度,变成一个回归模型
-本例将尝试面向应用提供一个比较通用的解决方案,在模型任务类型上支持
+本例提供一个比较通用的解决方案,在模型任务类型上支持:
- 分类
- [-1, 1] 值域内的回归
- Pairwise-Rank
-在生成低纬语义向量的模型结构上,本模型支持以下三种:
+在生成低纬语义向量的模型结构上,支持以下三种:
- FC, 多层全连接层
- CNN,卷积神经网络
- RNN,递归神经网络
## 模型实现
-DSSM模型可以拆成三小块实现,分别是左边和右边的DNN,以及顶层的损失函数。
-在复杂任务中,左右两边DNN的结构可以是不同的,比如在原始论文中左右分别学习Query和Document的semantic vector,
-两者数据的数据不同,建议对应定制DNN的结构。
+DSSM模型可以拆成三部分:分别是左边和右边的DNN,以及顶层的损失函数。在复杂任务中,左右两边DNN的结构可以不同。在原始论文中左右网络分别学习Query和Document的语义向量,两者数据的数据不同,建议对应定制DNN的结构。
-本例中为了简便和通用,将左右两个DNN的结构都设为相同的,因此只有三个选项FC,CNN,RNN等。
+**本例中为了简便和通用,将左右两个DNN的结构设为相同,因此只提供三个选项FC、CNN、RNN**。
-在损失函数的设计方面,也支持三种,分类, 回归, 排序;
-其中,在回归和排序两种损失中,左右两边的匹配程度通过余弦相似度(cossim)来计算;
-在分类任务中,类别预测的分布通过softmax计算。
+损失函数的设计也支持三种类型:分类, 回归, 排序;其中,在回归和排序两种损失中,左右两边的匹配程度通过余弦相似度(cosine similairty)来计算;在分类任务中,类别预测的分布通过softmax计算。
在其它教程中,对上述很多内容都有过详细的介绍,例如:
@@ -77,19 +62,17 @@ DSSM模型可以拆成三小块实现,分别是左边和右边的DNN,以及
相关原理在此不再赘述,本文接下来的篇幅主要集中介绍使用PaddlePaddle实现这些结构上。
-如图3,回归和分类模型的结构很相似
+如图3,回归和分类模型的结构相似:

图 3. DSSM for REGRESSION or CLASSIFICATION
-最重要的组成部分包括词向量,图中`(1)`,`(2)`两个低纬向量的学习器(可以用RNN/CNN/FC中的任意一种实现),
-最上层对应的损失函数。
-
-而Pairwise Rank的结构会复杂一些,类似两个 图 4. 中的结构,增加了对应的损失函数:
+最重要的组成部分包括词向量,图中`(1)`,`(2)`两个低纬向量的学习器(可以用RNN/CNN/FC中的任意一种实现),最上层对应的损失函数。
-- 模型总体思想是,用同一个source(源)为左右两个target(目标)分别打分——`(a),(b)`,学习目标是(a),(b)间的大小关系
+Pairwise Rank的结构会复杂一些,图 4. 中的结构会出现两次,增加了对应的损失函数,模型总体思想是:
+- 给定同一个source(源)为左右两个target(目标)分别打分——`(a),(b)`,学习目标是(a),(b)之间的大小关系
- `(a)`和`(b)`类似图3中结构,用于给source和target的pair打分
- `(1)`和`(2)`的结构其实是共用的,都表示同一个source,图中为了表达效果展开成两个
@@ -98,17 +81,18 @@ DSSM模型可以拆成三小块实现,分别是左边和右边的DNN,以及
图 4. DSSM for Pairwise Rank
-下面是各个部分具体的实现方法,所有的代码均包含在 `./network_conf.py` 中。
+下面是各个部分的具体实现,相关代码均包含在 `./network_conf.py` 中。
### 创建文本的词向量表
```python
def create_embedding(self, input, prefix=''):
- '''
- Create an embedding table whose name has a `prefix`.
- '''
- logger.info("create embedding table [%s] which dimention is %d" %
+ """
+ Create word embedding. The `prefix` is added in front of the name of
+ embedding"s learnable parameter.
+ """
+ logger.info("Create embedding table [%s] whose dimention is %d" %
(prefix, self.dnn_dims[0]))
emb = paddle.layer.embedding(
input=input,
@@ -123,14 +107,15 @@ def create_embedding(self, input, prefix=''):
```python
def create_cnn(self, emb, prefix=''):
- '''
+
+ """
A multi-layer CNN.
+ :param emb: The word embedding.
+ :type emb: paddle.layer
+ :param prefix: The prefix will be added to of layers' names.
+ :type prefix: str
+ """
- @emb: paddle.layer
- output of the embedding layer
- @prefix: str
- prefix of layers' names, used to share parameters between more than one `cnn` parts.
- '''
def create_conv(context_len, hidden_size, prefix):
key = "%s_%d_%d" % (prefix, context_len, hidden_size)
conv = paddle.networks.sequence_conv_pool(
@@ -138,21 +123,18 @@ def create_cnn(self, emb, prefix=''):
context_len=context_len,
hidden_size=hidden_size,
# set parameter attr for parameter sharing
- context_proj_param_attr=ParamAttr(name=key + 'contex_proj.w'),
- fc_param_attr=ParamAttr(name=key + '_fc.w'),
- fc_bias_attr=ParamAttr(name=key + '_fc.b'),
- pool_bias_attr=ParamAttr(name=key + '_pool.b'))
+ context_proj_param_attr=ParamAttr(name=key + "contex_proj.w"),
+ fc_param_attr=ParamAttr(name=key + "_fc.w"),
+ fc_bias_attr=ParamAttr(name=key + "_fc.b"),
+ pool_bias_attr=ParamAttr(name=key + "_pool.b"))
return conv
- logger.info('create a sequence_conv_pool which context width is 3')
conv_3 = create_conv(3, self.dnn_dims[1], "cnn")
- logger.info('create a sequence_conv_pool which context width is 4')
conv_4 = create_conv(4, self.dnn_dims[1], "cnn")
return conv_3, conv_4
```
-CNN 接受 embedding table输出的词向量序列,通过卷积和池化操作捕捉到原始句子的关键信息,
-最终输出一个语义向量(可以认为是句子向量)。
+CNN 接受词向量序列,通过卷积和池化操作捕捉到原始句子的关键信息,最终输出一个语义向量(可以认为是句子向量)。
本例的实现中,分别使用了窗口长度为3和4的CNN学到的句子向量按元素求和得到最终的句子向量。
@@ -162,9 +144,9 @@ RNN很适合学习变长序列的信息,使用RNN来学习句子的信息几
```python
def create_rnn(self, emb, prefix=''):
- '''
+ """
A GRU sentence vector learner.
- '''
+ """
gru = paddle.networks.simple_gru(
input=emb,
size=self.dnn_dims[1],
@@ -176,18 +158,19 @@ def create_rnn(self, emb, prefix=''):
return sent_vec
```
-### FC 结构实现
+### 多层全连接网络FC
```python
def create_fc(self, emb, prefix=''):
- '''
+
+ """
A multi-layer fully connected neural networks.
+ :param emb: The output of the embedding layer
+ :type emb: paddle.layer
+ :param prefix: A prefix will be added to the layers' names.
+ :type prefix: str
+ """
- @emb: paddle.layer
- output of the embedding layer
- @prefix: str
- prefix of layers' names, used to share parameters between more than one `fc` parts.
- '''
_input_layer = paddle.layer.pooling(
input=emb, pooling_type=paddle.pooling.Max())
fc = paddle.layer.fc(
@@ -198,21 +181,17 @@ def create_fc(self, emb, prefix=''):
return fc
```
-在构建FC时需要首先使用`paddle.layer.pooling` 对词向量序列进行最大池化操作,将边长序列转化为一个固定维度向量,
-作为整个句子的语义表达,使用最大池化能够降低句子长度对句向量表达的影响。
+在构建全连接网络时首先使用`paddle.layer.pooling` 对词向量序列进行最大池化操作,将边长序列转化为一个固定维度向量,作为整个句子的语义表达,使用最大池化能够降低句子长度对句向量表达的影响。
-### 多层DNN实现
+### 多层DNN
在 CNN/DNN/FC提取出 semantic vector后,在上层可继续接多层FC来实现深层DNN结构。
```python
def create_dnn(self, sent_vec, prefix):
- # if more than three layers exists, a fc layer will be added.
if len(self.dnn_dims) > 1:
_input_layer = sent_vec
for id, dim in enumerate(self.dnn_dims[1:]):
name = "%s_fc_%d_%d" % (prefix, id, dim)
- logger.info("create fc layer [%s] which dimention is %d" %
- (name, dim))
fc = paddle.layer.fc(
input=_input_layer,
size=dim,
@@ -224,119 +203,13 @@ def create_dnn(self, sent_vec, prefix):
return _input_layer
```
-### 分类或回归实现
-分类和回归的结构比较相似,因此可以用一个函数创建出来
+### 分类及回归
+分类和回归的结构比较相似,具体实现请参考[network_conf.py]( https://github.com/PaddlePaddle/models/blob/develop/dssm/network_conf.py)中的
+`_build_classification_or_regression_model` 函数。
-```python
-def _build_classification_or_regression_model(self, is_classification):
- '''
- Build a classification/regression model, and the cost is returned.
-
- A Classification has 3 inputs:
- - source sentence
- - target sentence
- - classification label
-
- '''
- # prepare inputs.
- assert self.class_num
-
- source = paddle.layer.data(
- name='source_input',
- type=paddle.data_type.integer_value_sequence(self.vocab_sizes[0]))
- target = paddle.layer.data(
- name='target_input',
- type=paddle.data_type.integer_value_sequence(self.vocab_sizes[1]))
- label = paddle.layer.data(
- name='label_input',
- type=paddle.data_type.integer_value(self.class_num)
- if is_classification else paddle.data_type.dense_input)
-
- prefixs = '_ _'.split(
- ) if self.share_semantic_generator else 'source target'.split()
- embed_prefixs = '_ _'.split(
- ) if self.share_embed else 'source target'.split()
-
- word_vecs = []
- for id, input in enumerate([source, target]):
- x = self.create_embedding(input, prefix=embed_prefixs[id])
- word_vecs.append(x)
-
- semantics = []
- for id, input in enumerate(word_vecs):
- x = self.model_arch_creater(input, prefix=prefixs[id])
- semantics.append(x)
-
- if is_classification:
- concated_vector = paddle.layer.concat(semantics)
- prediction = paddle.layer.fc(
- input=concated_vector,
- size=self.class_num,
- act=paddle.activation.Softmax())
- cost = paddle.layer.classification_cost(
- input=prediction, label=label)
- else:
- prediction = paddle.layer.cos_sim(*semantics)
- cost = paddle.layer.square_error_cost(prediction, label)
-
- if not self.is_infer:
- return cost, prediction, label
- return prediction
-```
-### Pairwise Rank实现
-Pairwise Rank复用上面的DNN结构,同一个source对两个target求相似度打分,
-如果左边的target打分高,预测为1,否则预测为 0。
+### Pairwise Rank
+Pairwise Rank复用上面的DNN结构,同一个source对两个target求相似度打分,如果左边的target打分高,预测为1,否则预测为 0。实现请参考 [network_conf.py]( https://github.com/PaddlePaddle/models/blob/develop/dssm/network_conf.py) 中的`_build_rank_model` 函数。
-```python
-def _build_rank_model(self):
- '''
- Build a pairwise rank model, and the cost is returned.
-
- A pairwise rank model has 3 inputs:
- - source sentence
- - left_target sentence
- - right_target sentence
- - label, 1 if left_target should be sorted in front of right_target, otherwise 0.
- '''
- source = paddle.layer.data(
- name='source_input',
- type=paddle.data_type.integer_value_sequence(self.vocab_sizes[0]))
- left_target = paddle.layer.data(
- name='left_target_input',
- type=paddle.data_type.integer_value_sequence(self.vocab_sizes[1]))
- right_target = paddle.layer.data(
- name='right_target_input',
- type=paddle.data_type.integer_value_sequence(self.vocab_sizes[1]))
- label = paddle.layer.data(
- name='label_input', type=paddle.data_type.integer_value(1))
-
- prefixs = '_ _ _'.split(
- ) if self.share_semantic_generator else 'source target target'.split()
- embed_prefixs = '_ _'.split(
- ) if self.share_embed else 'source target target'.split()
-
- word_vecs = []
- for id, input in enumerate([source, left_target, right_target]):
- x = self.create_embedding(input, prefix=embed_prefixs[id])
- word_vecs.append(x)
-
- semantics = []
- for id, input in enumerate(word_vecs):
- x = self.model_arch_creater(input, prefix=prefixs[id])
- semantics.append(x)
-
- # cossim score of source and left_target
- left_score = paddle.layer.cos_sim(semantics[0], semantics[1])
- # cossim score of source and right target
- right_score = paddle.layer.cos_sim(semantics[0], semantics[2])
-
- # rank cost
- cost = paddle.layer.rank_cost(left_score, right_score, label=label)
- # prediction = left_score - right_score
- # but this operator is not supported currently.
- # so AUC will not used.
- return cost, None, None
-```
## 数据格式
在 `./data` 中有简单的示例数据
@@ -371,7 +244,6 @@ def _build_rank_model(self):
6 10 \t 8 3 1 \t 1
```
-
### 排序的数据格式
```
# 4 fields each line:
@@ -391,68 +263,11 @@ def _build_rank_model(self):
## 执行训练
-可以直接执行 `python train.py -y 0 --model_arch 0` 使用 `./data/classification` 目录里简单的数据来训练一个分类的FC模型。
-
-其他模型结构也可以通过命令行实现定制,详细命令行参数如下
+可以直接执行 `python train.py -y 0 --model_arch 0` 使用 `./data/classification` 目录里的实例数据来测试能否直接运行训练分类FC模型。
-```
-usage: train.py [-h] [-i TRAIN_DATA_PATH] [-t TEST_DATA_PATH]
- [-s SOURCE_DIC_PATH] [--target_dic_path TARGET_DIC_PATH]
- [-b BATCH_SIZE] [-p NUM_PASSES] -y MODEL_TYPE -a MODEL_ARCH
- [--share_network_between_source_target SHARE_NETWORK_BETWEEN_SOURCE_TARGET]
- [--share_embed SHARE_EMBED] [--dnn_dims DNN_DIMS]
- [--num_workers NUM_WORKERS] [--use_gpu USE_GPU] [-c CLASS_NUM]
- [--model_output_prefix MODEL_OUTPUT_PREFIX]
- [-g NUM_BATCHES_TO_LOG] [-e NUM_BATCHES_TO_TEST]
- [-z NUM_BATCHES_TO_SAVE_MODEL]
-
-PaddlePaddle DSSM example
-
-optional arguments:
- -h, --help show this help message and exit
- -i TRAIN_DATA_PATH, --train_data_path TRAIN_DATA_PATH
- path of training dataset
- -t TEST_DATA_PATH, --test_data_path TEST_DATA_PATH
- path of testing dataset
- -s SOURCE_DIC_PATH, --source_dic_path SOURCE_DIC_PATH
- path of the source's word dic
- --target_dic_path TARGET_DIC_PATH
- path of the target's word dic, if not set, the
- `source_dic_path` will be used
- -b BATCH_SIZE, --batch_size BATCH_SIZE
- size of mini-batch (default:32)
- -p NUM_PASSES, --num_passes NUM_PASSES
- number of passes to run(default:10)
- -y MODEL_TYPE, --model_type MODEL_TYPE
- model type, 0 for classification, 1 for pairwise rank,
- 2 for regression (default: classification)
- -a MODEL_ARCH, --model_arch MODEL_ARCH
- model architecture, 1 for CNN, 0 for FC, 2 for RNN
- --share_network_between_source_target SHARE_NETWORK_BETWEEN_SOURCE_TARGET
- whether to share network parameters between source and
- target
- --share_embed SHARE_EMBED
- whether to share word embedding between source and
- target
- --dnn_dims DNN_DIMS dimentions of dnn layers, default is '256,128,64,32',
- which means create a 4-layer dnn, demention of each
- layer is 256, 128, 64 and 32
- --num_workers NUM_WORKERS
- num worker threads, default 1
- --use_gpu USE_GPU whether to use GPU devices (default: False)
- -c CLASS_NUM, --class_num CLASS_NUM
- number of categories for classification task.
- --model_output_prefix MODEL_OUTPUT_PREFIX
- prefix of the path for model to store, (default: ./)
- -g NUM_BATCHES_TO_LOG, --num_batches_to_log NUM_BATCHES_TO_LOG
- number of batches to output train log, (default: 100)
- -e NUM_BATCHES_TO_TEST, --num_batches_to_test NUM_BATCHES_TO_TEST
- number of batches to test, (default: 200)
- -z NUM_BATCHES_TO_SAVE_MODEL, --num_batches_to_save_model NUM_BATCHES_TO_SAVE_MODEL
- number of batches to output model, (default: 400)
-```
+其他模型结构也可以通过命令行实现定制,详细命令行参数请执行 `python train.py --help`进行查阅。
-重要的参数描述如下
+这里介绍最重要的几个参数:
- `train_data_path` 训练数据路径
- `test_data_path` 测试数据路局,可以不设置
@@ -462,49 +277,8 @@ optional arguments:
- `model_arch` 模型结构,FC 0, CNN 1, RNN 2
- `dnn_dims` 模型各层的维度设置,默认为 `256,128,64,32`,即模型有4层,各层维度如上设置
-## 用训练好的模型预测
-```
-usage: infer.py [-h] --model_path MODEL_PATH -i DATA_PATH -o
- PREDICTION_OUTPUT_PATH -y MODEL_TYPE [-s SOURCE_DIC_PATH]
- [--target_dic_path TARGET_DIC_PATH] -a MODEL_ARCH
- [--share_network_between_source_target SHARE_NETWORK_BETWEEN_SOURCE_TARGET]
- [--share_embed SHARE_EMBED] [--dnn_dims DNN_DIMS]
- [-c CLASS_NUM]
-
-PaddlePaddle DSSM infer
-
-optional arguments:
- -h, --help show this help message and exit
- --model_path MODEL_PATH
- path of model parameters file
- -i DATA_PATH, --data_path DATA_PATH
- path of the dataset to infer
- -o PREDICTION_OUTPUT_PATH, --prediction_output_path PREDICTION_OUTPUT_PATH
- path to output the prediction
- -y MODEL_TYPE, --model_type MODEL_TYPE
- model type, 0 for classification, 1 for pairwise rank,
- 2 for regression (default: classification)
- -s SOURCE_DIC_PATH, --source_dic_path SOURCE_DIC_PATH
- path of the source's word dic
- --target_dic_path TARGET_DIC_PATH
- path of the target's word dic, if not set, the
- `source_dic_path` will be used
- -a MODEL_ARCH, --model_arch MODEL_ARCH
- model architecture, 1 for CNN, 0 for FC, 2 for RNN
- --share_network_between_source_target SHARE_NETWORK_BETWEEN_SOURCE_TARGET
- whether to share network parameters between source and
- target
- --share_embed SHARE_EMBED
- whether to share word embedding between source and
- target
- --dnn_dims DNN_DIMS dimentions of dnn layers, default is '256,128,64,32',
- which means create a 4-layer dnn, demention of each
- layer is 256, 128, 64 and 32
- -c CLASS_NUM, --class_num CLASS_NUM
- number of categories for classification task.
-```
-
-部分参数可以参考 `train.py`,重要参数解释如下
+## 使用训练好的模型预测
+详细命令行参数请执行 `python train.py --help`进行查阅。重要参数解释如下:
- `data_path` 需要预测的数据路径
- `prediction_output_path` 预测的输出路径
diff --git a/dssm/README.md b/dssm/README.md
index 2d5e0effadabe356cf91e625bb6891fd5152140a..8148ea6557183df1446b98ed6d3a4da1f92c6438 100644
--- a/dssm/README.md
+++ b/dssm/README.md
@@ -65,10 +65,11 @@ In below, we describe how to train DSSM model in PaddlePaddle. All the codes are
### Create a word vector table for the text
```python
def create_embedding(self, input, prefix=''):
- '''
- Create an embedding table whose name has a `prefix`.
- '''
- logger.info("create embedding table [%s] which dimention is %d" %
+ """
+ Create word embedding. The `prefix` is added in front of the name of
+ embedding"s learnable parameter.
+ """
+ logger.info("Create embedding table [%s] whose dimention is %d" %
(prefix, self.dnn_dims[0]))
emb = paddle.layer.embedding(
input=input,
@@ -82,14 +83,15 @@ Since the input (embedding table) is a list of the IDs of the words correspondin
### CNN implementation
```python
def create_cnn(self, emb, prefix=''):
- '''
+
+ """
A multi-layer CNN.
+ :param emb: The word embedding.
+ :type emb: paddle.layer
+ :param prefix: The prefix will be added to of layers' names.
+ :type prefix: str
+ """
- @emb: paddle.layer
- output of the embedding layer
- @prefix: str
- prefix of layers' names, used to share parameters between more than one `cnn` parts.
- '''
def create_conv(context_len, hidden_size, prefix):
key = "%s_%d_%d" % (prefix, context_len, hidden_size)
conv = paddle.networks.sequence_conv_pool(
@@ -97,15 +99,13 @@ def create_cnn(self, emb, prefix=''):
context_len=context_len,
hidden_size=hidden_size,
# set parameter attr for parameter sharing
- context_proj_param_attr=ParamAttr(name=key + 'contex_proj.w'),
- fc_param_attr=ParamAttr(name=key + '_fc.w'),
- fc_bias_attr=ParamAttr(name=key + '_fc.b'),
- pool_bias_attr=ParamAttr(name=key + '_pool.b'))
+ context_proj_param_attr=ParamAttr(name=key + "contex_proj.w"),
+ fc_param_attr=ParamAttr(name=key + "_fc.w"),
+ fc_bias_attr=ParamAttr(name=key + "_fc.b"),
+ pool_bias_attr=ParamAttr(name=key + "_pool.b"))
return conv
- logger.info('create a sequence_conv_pool which context width is 3')
conv_3 = create_conv(3, self.dnn_dims[1], "cnn")
- logger.info('create a sequence_conv_pool which context width is 4')
conv_4 = create_conv(4, self.dnn_dims[1], "cnn")
return conv_3, conv_4
```
@@ -118,9 +118,9 @@ RNN is suitable for learning variable length of the information
```python
def create_rnn(self, emb, prefix=''):
- '''
+ """
A GRU sentence vector learner.
- '''
+ """
gru = paddle.networks.simple_gru(
input=emb,
size=self.dnn_dims[1],
@@ -136,14 +136,15 @@ def create_rnn(self, emb, prefix=''):
```python
def create_fc(self, emb, prefix=''):
- '''
+
+ """
A multi-layer fully connected neural networks.
+ :param emb: The output of the embedding layer
+ :type emb: paddle.layer
+ :param prefix: A prefix will be added to the layers' names.
+ :type prefix: str
+ """
- @emb: paddle.layer
- output of the embedding layer
- @prefix: str
- prefix of layers' names, used to share parameters between more than one `fc` parts.
- '''
_input_layer = paddle.layer.pooling(
input=emb, pooling_type=paddle.pooling.Max())
fc = paddle.layer.fc(
@@ -160,13 +161,10 @@ In the construction of FC, we use `paddle.layer.pooling` for the maximum pooling
```python
def create_dnn(self, sent_vec, prefix):
- # if more than three layers exists, a fc layer will be added.
if len(self.dnn_dims) > 1:
_input_layer = sent_vec
for id, dim in enumerate(self.dnn_dims[1:]):
name = "%s_fc_%d_%d" % (prefix, id, dim)
- logger.info("create fc layer [%s] which dimention is %d" %
- (name, dim))
fc = paddle.layer.fc(
input=_input_layer,
size=dim,
@@ -180,117 +178,12 @@ def create_dnn(self, sent_vec, prefix):
### Classification / Regression
The structure of classification and regression is similar. Below function can be used for both tasks.
-
-```python
-def _build_classification_or_regression_model(self, is_classification):
- '''
- Build a classification/regression model, and the cost is returned.
-
- A Classification has 3 inputs:
- - source sentence
- - target sentence
- - classification label
-
- '''
- # prepare inputs.
- assert self.class_num
-
- source = paddle.layer.data(
- name='source_input',
- type=paddle.data_type.integer_value_sequence(self.vocab_sizes[0]))
- target = paddle.layer.data(
- name='target_input',
- type=paddle.data_type.integer_value_sequence(self.vocab_sizes[1]))
- label = paddle.layer.data(
- name='label_input',
- type=paddle.data_type.integer_value(self.class_num)
- if is_classification else paddle.data_type.dense_input)
-
- prefixs = '_ _'.split(
- ) if self.share_semantic_generator else 'source target'.split()
- embed_prefixs = '_ _'.split(
- ) if self.share_embed else 'source target'.split()
-
- word_vecs = []
- for id, input in enumerate([source, target]):
- x = self.create_embedding(input, prefix=embed_prefixs[id])
- word_vecs.append(x)
-
- semantics = []
- for id, input in enumerate(word_vecs):
- x = self.model_arch_creater(input, prefix=prefixs[id])
- semantics.append(x)
-
- if is_classification:
- concated_vector = paddle.layer.concat(semantics)
- prediction = paddle.layer.fc(
- input=concated_vector,
- size=self.class_num,
- act=paddle.activation.Softmax())
- cost = paddle.layer.classification_cost(
- input=prediction, label=label)
- else:
- prediction = paddle.layer.cos_sim(*semantics)
- cost = paddle.layer.square_error_cost(prediction, label)
-
- if not self.is_infer:
- return cost, prediction, label
- return prediction
-```
+Please check the function `_build_classification_or_regression_model` in [network_conf.py]( https://github.com/PaddlePaddle/models/blob/develop/dssm/network_conf.py) for detail implementation.
### Pairwise Rank
+Please check the function `_build_rank_model` in [network_conf.py]( https://github.com/PaddlePaddle/models/blob/develop/dssm/network_conf.py) for implementation.
-```python
-def _build_rank_model(self):
- '''
- Build a pairwise rank model, and the cost is returned.
-
- A pairwise rank model has 3 inputs:
- - source sentence
- - left_target sentence
- - right_target sentence
- - label, 1 if left_target should be sorted in front of right_target, otherwise 0.
- '''
- source = paddle.layer.data(
- name='source_input',
- type=paddle.data_type.integer_value_sequence(self.vocab_sizes[0]))
- left_target = paddle.layer.data(
- name='left_target_input',
- type=paddle.data_type.integer_value_sequence(self.vocab_sizes[1]))
- right_target = paddle.layer.data(
- name='right_target_input',
- type=paddle.data_type.integer_value_sequence(self.vocab_sizes[1]))
- label = paddle.layer.data(
- name='label_input', type=paddle.data_type.integer_value(1))
-
- prefixs = '_ _ _'.split(
- ) if self.share_semantic_generator else 'source target target'.split()
- embed_prefixs = '_ _'.split(
- ) if self.share_embed else 'source target target'.split()
-
- word_vecs = []
- for id, input in enumerate([source, left_target, right_target]):
- x = self.create_embedding(input, prefix=embed_prefixs[id])
- word_vecs.append(x)
-
- semantics = []
- for id, input in enumerate(word_vecs):
- x = self.model_arch_creater(input, prefix=prefixs[id])
- semantics.append(x)
-
- # cossim score of source and left_target
- left_score = paddle.layer.cos_sim(semantics[0], semantics[1])
- # cossim score of source and right target
- right_score = paddle.layer.cos_sim(semantics[0], semantics[2])
-
- # rank cost
- cost = paddle.layer.rank_cost(left_score, right_score, label=label)
- # prediction = left_score - right_score
- # but this operator is not supported currently.
- # so AUC will not used.
- return cost, None, None
-```
## Data Format
Below is a simple example for the data in `./data`
@@ -347,67 +240,7 @@ The example of this format is as follows.
## Training
-We use `python train.py -y 0 --model_arch 0` with the data in `./data/classification` to train a DSSM model for classification.
-
-
-```
-usage: train.py [-h] [-i TRAIN_DATA_PATH] [-t TEST_DATA_PATH]
- [-s SOURCE_DIC_PATH] [--target_dic_path TARGET_DIC_PATH]
- [-b BATCH_SIZE] [-p NUM_PASSES] -y MODEL_TYPE -a MODEL_ARCH
- [--share_network_between_source_target SHARE_NETWORK_BETWEEN_SOURCE_TARGET]
- [--share_embed SHARE_EMBED] [--dnn_dims DNN_DIMS]
- [--num_workers NUM_WORKERS] [--use_gpu USE_GPU] [-c CLASS_NUM]
- [--model_output_prefix MODEL_OUTPUT_PREFIX]
- [-g NUM_BATCHES_TO_LOG] [-e NUM_BATCHES_TO_TEST]
- [-z NUM_BATCHES_TO_SAVE_MODEL]
-
-PaddlePaddle DSSM example
-
-optional arguments:
- -h, --help show this help message and exit
- -i TRAIN_DATA_PATH, --train_data_path TRAIN_DATA_PATH
- path of training dataset
- -t TEST_DATA_PATH, --test_data_path TEST_DATA_PATH
- path of testing dataset
- -s SOURCE_DIC_PATH, --source_dic_path SOURCE_DIC_PATH
- path of the source's word dic
- --target_dic_path TARGET_DIC_PATH
- path of the target's word dic, if not set, the
- `source_dic_path` will be used
- -b BATCH_SIZE, --batch_size BATCH_SIZE
- size of mini-batch (default:32)
- -p NUM_PASSES, --num_passes NUM_PASSES
- number of passes to run(default:10)
- -y MODEL_TYPE, --model_type MODEL_TYPE
- model type, 0 for classification, 1 for pairwise rank,
- 2 for regression (default: classification)
- -a MODEL_ARCH, --model_arch MODEL_ARCH
- model architecture, 1 for CNN, 0 for FC, 2 for RNN
- --share_network_between_source_target SHARE_NETWORK_BETWEEN_SOURCE_TARGET
- whether to share network parameters between source and
- target
- --share_embed SHARE_EMBED
- whether to share word embedding between source and
- target
- --dnn_dims DNN_DIMS dimentions of dnn layers, default is '256,128,64,32',
- which means create a 4-layer dnn, demention of each
- layer is 256, 128, 64 and 32
- --num_workers NUM_WORKERS
- num worker threads, default 1
- --use_gpu USE_GPU whether to use GPU devices (default: False)
- -c CLASS_NUM, --class_num CLASS_NUM
- number of categories for classification task.
- --model_output_prefix MODEL_OUTPUT_PREFIX
- prefix of the path for model to store, (default: ./)
- -g NUM_BATCHES_TO_LOG, --num_batches_to_log NUM_BATCHES_TO_LOG
- number of batches to output train log, (default: 100)
- -e NUM_BATCHES_TO_TEST, --num_batches_to_test NUM_BATCHES_TO_TEST
- number of batches to test, (default: 200)
- -z NUM_BATCHES_TO_SAVE_MODEL, --num_batches_to_save_model NUM_BATCHES_TO_SAVE_MODEL
- number of batches to output model, (default: 400)
-```
-
-Parameter description:
+We use `python train.py -y 0 --model_arch 0` with the data in `./data/classification` to train a DSSM model for classification. The paremeters to execute the script `train.py` can be found by execution `python infer.py --help`. Some important parameters are:
- `train_data_path` Training data path
- `test_data_path` Test data path, optional
@@ -418,48 +251,8 @@ Parameter description:
- `dnn_dims` The dimension of each layer of the model is set, the default is `256,128,64,32`,with 4 layers.
## To predict using the trained model
-```
-usage: infer.py [-h] --model_path MODEL_PATH -i DATA_PATH -o
- PREDICTION_OUTPUT_PATH -y MODEL_TYPE [-s SOURCE_DIC_PATH]
- [--target_dic_path TARGET_DIC_PATH] -a MODEL_ARCH
- [--share_network_between_source_target SHARE_NETWORK_BETWEEN_SOURCE_TARGET]
- [--share_embed SHARE_EMBED] [--dnn_dims DNN_DIMS]
- [-c CLASS_NUM]
-
-PaddlePaddle DSSM infer
-
-optional arguments:
- -h, --help show this help message and exit
- --model_path MODEL_PATH
- path of model parameters file
- -i DATA_PATH, --data_path DATA_PATH
- path of the dataset to infer
- -o PREDICTION_OUTPUT_PATH, --prediction_output_path PREDICTION_OUTPUT_PATH
- path to output the prediction
- -y MODEL_TYPE, --model_type MODEL_TYPE
- model type, 0 for classification, 1 for pairwise rank,
- 2 for regression (default: classification)
- -s SOURCE_DIC_PATH, --source_dic_path SOURCE_DIC_PATH
- path of the source's word dic
- --target_dic_path TARGET_DIC_PATH
- path of the target's word dic, if not set, the
- `source_dic_path` will be used
- -a MODEL_ARCH, --model_arch MODEL_ARCH
- model architecture, 1 for CNN, 0 for FC, 2 for RNN
- --share_network_between_source_target SHARE_NETWORK_BETWEEN_SOURCE_TARGET
- whether to share network parameters between source and
- target
- --share_embed SHARE_EMBED
- whether to share word embedding between source and
- target
- --dnn_dims DNN_DIMS dimentions of dnn layers, default is '256,128,64,32',
- which means create a 4-layer dnn, demention of each
- layer is 256, 128, 64 and 32
- -c CLASS_NUM, --class_num CLASS_NUM
- number of categories for classification task.
-```
-Important parameters are
+The paremeters to execute the script `infer.py` can be found by execution `python infer.py --help`. Some important parameters are:
- `data_path` Path for the data to predict
- `prediction_output_path` Prediction output path
diff --git a/dssm/index.html b/dssm/index.html
index b4777a28960f5c5ac83e3e0598ad31794b73a8dd..5c4a1a9d316821f25bbf204c3ba7698573722b94 100644
--- a/dssm/index.html
+++ b/dssm/index.html
@@ -107,10 +107,11 @@ In below, we describe how to train DSSM model in PaddlePaddle. All the codes are
### Create a word vector table for the text
```python
def create_embedding(self, input, prefix=''):
- '''
- Create an embedding table whose name has a `prefix`.
- '''
- logger.info("create embedding table [%s] which dimention is %d" %
+ """
+ Create word embedding. The `prefix` is added in front of the name of
+ embedding"s learnable parameter.
+ """
+ logger.info("Create embedding table [%s] whose dimention is %d" %
(prefix, self.dnn_dims[0]))
emb = paddle.layer.embedding(
input=input,
@@ -124,14 +125,15 @@ Since the input (embedding table) is a list of the IDs of the words correspondin
### CNN implementation
```python
def create_cnn(self, emb, prefix=''):
- '''
+
+ """
A multi-layer CNN.
+ :param emb: The word embedding.
+ :type emb: paddle.layer
+ :param prefix: The prefix will be added to of layers' names.
+ :type prefix: str
+ """
- @emb: paddle.layer
- output of the embedding layer
- @prefix: str
- prefix of layers' names, used to share parameters between more than one `cnn` parts.
- '''
def create_conv(context_len, hidden_size, prefix):
key = "%s_%d_%d" % (prefix, context_len, hidden_size)
conv = paddle.networks.sequence_conv_pool(
@@ -139,15 +141,13 @@ def create_cnn(self, emb, prefix=''):
context_len=context_len,
hidden_size=hidden_size,
# set parameter attr for parameter sharing
- context_proj_param_attr=ParamAttr(name=key + 'contex_proj.w'),
- fc_param_attr=ParamAttr(name=key + '_fc.w'),
- fc_bias_attr=ParamAttr(name=key + '_fc.b'),
- pool_bias_attr=ParamAttr(name=key + '_pool.b'))
+ context_proj_param_attr=ParamAttr(name=key + "contex_proj.w"),
+ fc_param_attr=ParamAttr(name=key + "_fc.w"),
+ fc_bias_attr=ParamAttr(name=key + "_fc.b"),
+ pool_bias_attr=ParamAttr(name=key + "_pool.b"))
return conv
- logger.info('create a sequence_conv_pool which context width is 3')
conv_3 = create_conv(3, self.dnn_dims[1], "cnn")
- logger.info('create a sequence_conv_pool which context width is 4')
conv_4 = create_conv(4, self.dnn_dims[1], "cnn")
return conv_3, conv_4
```
@@ -160,9 +160,9 @@ RNN is suitable for learning variable length of the information
```python
def create_rnn(self, emb, prefix=''):
- '''
+ """
A GRU sentence vector learner.
- '''
+ """
gru = paddle.networks.simple_gru(
input=emb,
size=self.dnn_dims[1],
@@ -178,14 +178,15 @@ def create_rnn(self, emb, prefix=''):
```python
def create_fc(self, emb, prefix=''):
- '''
+
+ """
A multi-layer fully connected neural networks.
+ :param emb: The output of the embedding layer
+ :type emb: paddle.layer
+ :param prefix: A prefix will be added to the layers' names.
+ :type prefix: str
+ """
- @emb: paddle.layer
- output of the embedding layer
- @prefix: str
- prefix of layers' names, used to share parameters between more than one `fc` parts.
- '''
_input_layer = paddle.layer.pooling(
input=emb, pooling_type=paddle.pooling.Max())
fc = paddle.layer.fc(
@@ -202,13 +203,10 @@ In the construction of FC, we use `paddle.layer.pooling` for the maximum pooling
```python
def create_dnn(self, sent_vec, prefix):
- # if more than three layers exists, a fc layer will be added.
if len(self.dnn_dims) > 1:
_input_layer = sent_vec
for id, dim in enumerate(self.dnn_dims[1:]):
name = "%s_fc_%d_%d" % (prefix, id, dim)
- logger.info("create fc layer [%s] which dimention is %d" %
- (name, dim))
fc = paddle.layer.fc(
input=_input_layer,
size=dim,
@@ -222,117 +220,12 @@ def create_dnn(self, sent_vec, prefix):
### Classification / Regression
The structure of classification and regression is similar. Below function can be used for both tasks.
-
-```python
-def _build_classification_or_regression_model(self, is_classification):
- '''
- Build a classification/regression model, and the cost is returned.
-
- A Classification has 3 inputs:
- - source sentence
- - target sentence
- - classification label
-
- '''
- # prepare inputs.
- assert self.class_num
-
- source = paddle.layer.data(
- name='source_input',
- type=paddle.data_type.integer_value_sequence(self.vocab_sizes[0]))
- target = paddle.layer.data(
- name='target_input',
- type=paddle.data_type.integer_value_sequence(self.vocab_sizes[1]))
- label = paddle.layer.data(
- name='label_input',
- type=paddle.data_type.integer_value(self.class_num)
- if is_classification else paddle.data_type.dense_input)
-
- prefixs = '_ _'.split(
- ) if self.share_semantic_generator else 'source target'.split()
- embed_prefixs = '_ _'.split(
- ) if self.share_embed else 'source target'.split()
-
- word_vecs = []
- for id, input in enumerate([source, target]):
- x = self.create_embedding(input, prefix=embed_prefixs[id])
- word_vecs.append(x)
-
- semantics = []
- for id, input in enumerate(word_vecs):
- x = self.model_arch_creater(input, prefix=prefixs[id])
- semantics.append(x)
-
- if is_classification:
- concated_vector = paddle.layer.concat(semantics)
- prediction = paddle.layer.fc(
- input=concated_vector,
- size=self.class_num,
- act=paddle.activation.Softmax())
- cost = paddle.layer.classification_cost(
- input=prediction, label=label)
- else:
- prediction = paddle.layer.cos_sim(*semantics)
- cost = paddle.layer.square_error_cost(prediction, label)
-
- if not self.is_infer:
- return cost, prediction, label
- return prediction
-```
+Please check the function `_build_classification_or_regression_model` in [network_conf.py]( https://github.com/PaddlePaddle/models/blob/develop/dssm/network_conf.py) for detail implementation.
### Pairwise Rank
+Please check the function `_build_rank_model` in [network_conf.py]( https://github.com/PaddlePaddle/models/blob/develop/dssm/network_conf.py) for implementation.
-```python
-def _build_rank_model(self):
- '''
- Build a pairwise rank model, and the cost is returned.
-
- A pairwise rank model has 3 inputs:
- - source sentence
- - left_target sentence
- - right_target sentence
- - label, 1 if left_target should be sorted in front of right_target, otherwise 0.
- '''
- source = paddle.layer.data(
- name='source_input',
- type=paddle.data_type.integer_value_sequence(self.vocab_sizes[0]))
- left_target = paddle.layer.data(
- name='left_target_input',
- type=paddle.data_type.integer_value_sequence(self.vocab_sizes[1]))
- right_target = paddle.layer.data(
- name='right_target_input',
- type=paddle.data_type.integer_value_sequence(self.vocab_sizes[1]))
- label = paddle.layer.data(
- name='label_input', type=paddle.data_type.integer_value(1))
-
- prefixs = '_ _ _'.split(
- ) if self.share_semantic_generator else 'source target target'.split()
- embed_prefixs = '_ _'.split(
- ) if self.share_embed else 'source target target'.split()
-
- word_vecs = []
- for id, input in enumerate([source, left_target, right_target]):
- x = self.create_embedding(input, prefix=embed_prefixs[id])
- word_vecs.append(x)
-
- semantics = []
- for id, input in enumerate(word_vecs):
- x = self.model_arch_creater(input, prefix=prefixs[id])
- semantics.append(x)
-
- # cossim score of source and left_target
- left_score = paddle.layer.cos_sim(semantics[0], semantics[1])
- # cossim score of source and right target
- right_score = paddle.layer.cos_sim(semantics[0], semantics[2])
-
- # rank cost
- cost = paddle.layer.rank_cost(left_score, right_score, label=label)
- # prediction = left_score - right_score
- # but this operator is not supported currently.
- # so AUC will not used.
- return cost, None, None
-```
## Data Format
Below is a simple example for the data in `./data`
@@ -389,67 +282,7 @@ The example of this format is as follows.
## Training
-We use `python train.py -y 0 --model_arch 0` with the data in `./data/classification` to train a DSSM model for classification.
-
-
-```
-usage: train.py [-h] [-i TRAIN_DATA_PATH] [-t TEST_DATA_PATH]
- [-s SOURCE_DIC_PATH] [--target_dic_path TARGET_DIC_PATH]
- [-b BATCH_SIZE] [-p NUM_PASSES] -y MODEL_TYPE -a MODEL_ARCH
- [--share_network_between_source_target SHARE_NETWORK_BETWEEN_SOURCE_TARGET]
- [--share_embed SHARE_EMBED] [--dnn_dims DNN_DIMS]
- [--num_workers NUM_WORKERS] [--use_gpu USE_GPU] [-c CLASS_NUM]
- [--model_output_prefix MODEL_OUTPUT_PREFIX]
- [-g NUM_BATCHES_TO_LOG] [-e NUM_BATCHES_TO_TEST]
- [-z NUM_BATCHES_TO_SAVE_MODEL]
-
-PaddlePaddle DSSM example
-
-optional arguments:
- -h, --help show this help message and exit
- -i TRAIN_DATA_PATH, --train_data_path TRAIN_DATA_PATH
- path of training dataset
- -t TEST_DATA_PATH, --test_data_path TEST_DATA_PATH
- path of testing dataset
- -s SOURCE_DIC_PATH, --source_dic_path SOURCE_DIC_PATH
- path of the source's word dic
- --target_dic_path TARGET_DIC_PATH
- path of the target's word dic, if not set, the
- `source_dic_path` will be used
- -b BATCH_SIZE, --batch_size BATCH_SIZE
- size of mini-batch (default:32)
- -p NUM_PASSES, --num_passes NUM_PASSES
- number of passes to run(default:10)
- -y MODEL_TYPE, --model_type MODEL_TYPE
- model type, 0 for classification, 1 for pairwise rank,
- 2 for regression (default: classification)
- -a MODEL_ARCH, --model_arch MODEL_ARCH
- model architecture, 1 for CNN, 0 for FC, 2 for RNN
- --share_network_between_source_target SHARE_NETWORK_BETWEEN_SOURCE_TARGET
- whether to share network parameters between source and
- target
- --share_embed SHARE_EMBED
- whether to share word embedding between source and
- target
- --dnn_dims DNN_DIMS dimentions of dnn layers, default is '256,128,64,32',
- which means create a 4-layer dnn, demention of each
- layer is 256, 128, 64 and 32
- --num_workers NUM_WORKERS
- num worker threads, default 1
- --use_gpu USE_GPU whether to use GPU devices (default: False)
- -c CLASS_NUM, --class_num CLASS_NUM
- number of categories for classification task.
- --model_output_prefix MODEL_OUTPUT_PREFIX
- prefix of the path for model to store, (default: ./)
- -g NUM_BATCHES_TO_LOG, --num_batches_to_log NUM_BATCHES_TO_LOG
- number of batches to output train log, (default: 100)
- -e NUM_BATCHES_TO_TEST, --num_batches_to_test NUM_BATCHES_TO_TEST
- number of batches to test, (default: 200)
- -z NUM_BATCHES_TO_SAVE_MODEL, --num_batches_to_save_model NUM_BATCHES_TO_SAVE_MODEL
- number of batches to output model, (default: 400)
-```
-
-Parameter description:
+We use `python train.py -y 0 --model_arch 0` with the data in `./data/classification` to train a DSSM model for classification. The paremeters to execute the script `train.py` can be found by execution `python infer.py --help`. Some important parameters are:
- `train_data_path` Training data path
- `test_data_path` Test data path, optional
@@ -460,48 +293,8 @@ Parameter description:
- `dnn_dims` The dimension of each layer of the model is set, the default is `256,128,64,32`,with 4 layers.
## To predict using the trained model
-```
-usage: infer.py [-h] --model_path MODEL_PATH -i DATA_PATH -o
- PREDICTION_OUTPUT_PATH -y MODEL_TYPE [-s SOURCE_DIC_PATH]
- [--target_dic_path TARGET_DIC_PATH] -a MODEL_ARCH
- [--share_network_between_source_target SHARE_NETWORK_BETWEEN_SOURCE_TARGET]
- [--share_embed SHARE_EMBED] [--dnn_dims DNN_DIMS]
- [-c CLASS_NUM]
-
-PaddlePaddle DSSM infer
-
-optional arguments:
- -h, --help show this help message and exit
- --model_path MODEL_PATH
- path of model parameters file
- -i DATA_PATH, --data_path DATA_PATH
- path of the dataset to infer
- -o PREDICTION_OUTPUT_PATH, --prediction_output_path PREDICTION_OUTPUT_PATH
- path to output the prediction
- -y MODEL_TYPE, --model_type MODEL_TYPE
- model type, 0 for classification, 1 for pairwise rank,
- 2 for regression (default: classification)
- -s SOURCE_DIC_PATH, --source_dic_path SOURCE_DIC_PATH
- path of the source's word dic
- --target_dic_path TARGET_DIC_PATH
- path of the target's word dic, if not set, the
- `source_dic_path` will be used
- -a MODEL_ARCH, --model_arch MODEL_ARCH
- model architecture, 1 for CNN, 0 for FC, 2 for RNN
- --share_network_between_source_target SHARE_NETWORK_BETWEEN_SOURCE_TARGET
- whether to share network parameters between source and
- target
- --share_embed SHARE_EMBED
- whether to share word embedding between source and
- target
- --dnn_dims DNN_DIMS dimentions of dnn layers, default is '256,128,64,32',
- which means create a 4-layer dnn, demention of each
- layer is 256, 128, 64 and 32
- -c CLASS_NUM, --class_num CLASS_NUM
- number of categories for classification task.
-```
-Important parameters are
+The paremeters to execute the script `infer.py` can be found by execution `python infer.py --help`. Some important parameters are:
- `data_path` Path for the data to predict
- `prediction_output_path` Prediction output path
diff --git a/dssm/infer.py b/dssm/infer.py
index f0c65e44a8c5f9249172f0c1912dc9c195ce69c2..63a9657341d7d220b72696fd215d1850b1718f32 100644
--- a/dssm/infer.py
+++ b/dssm/infer.py
@@ -9,83 +9,81 @@ from utils import logger, ModelType, ModelArch, load_dic
parser = argparse.ArgumentParser(description="PaddlePaddle DSSM infer")
parser.add_argument(
- '--model_path',
- type=str,
- required=True,
- help="path of model parameters file")
+ "--model_path", type=str, required=True, help="The path of trained model.")
parser.add_argument(
- '-i',
- '--data_path',
+ "-i",
+ "--data_path",
type=str,
required=True,
- help="path of the dataset to infer")
+ help="The path of the data for inferring.")
parser.add_argument(
- '-o',
- '--prediction_output_path',
+ "-o",
+ "--prediction_output_path",
type=str,
required=True,
- help="path to output the prediction")
+ help="The path to save the predictions.")
parser.add_argument(
- '-y',
- '--model_type',
+ "-y",
+ "--model_type",
type=int,
required=True,
default=ModelType.CLASSIFICATION_MODE,
- help=("model type, %d for classification, %d for pairwise rank, "
- "%d for regression (default: classification)") %
+ help=("The model type: %d for classification, %d for pairwise rank, "
+ "%d for regression (default: classification).") %
(ModelType.CLASSIFICATION_MODE, ModelType.RANK_MODE,
ModelType.REGRESSION_MODE))
parser.add_argument(
- '-s',
- '--source_dic_path',
+ "-s",
+ "--source_dic_path",
type=str,
required=False,
- help="path of the source's word dic")
+ help="The path of the source's word dictionary.")
parser.add_argument(
- '--target_dic_path',
+ "--target_dic_path",
type=str,
required=False,
- help=("path of the target's word dictionary, "
- "if not set, the `source_dic_path` will be used"))
+ help=("The path of the target's word dictionary, "
+ "if this parameter is not set, the `source_dic_path` will be used."))
parser.add_argument(
- '-a',
- '--model_arch',
+ "-a",
+ "--model_arch",
type=int,
required=True,
default=ModelArch.CNN_MODE,
help="model architecture, %d for CNN, %d for FC, %d for RNN" %
(ModelArch.CNN_MODE, ModelArch.FC_MODE, ModelArch.RNN_MODE))
parser.add_argument(
- '--share_network_between_source_target',
+ "--share_network_between_source_target",
type=distutils.util.strtobool,
default=False,
help="whether to share network parameters between source and target")
parser.add_argument(
- '--share_embed',
+ "--share_embed",
type=distutils.util.strtobool,
default=False,
help="whether to share word embedding between source and target")
parser.add_argument(
- '--dnn_dims',
+ "--dnn_dims",
type=str,
- default='256,128,64,32',
- help=("dimentions of dnn layers, default is '256,128,64,32', "
- "which means create a 4-layer dnn, "
- "demention of each layer is 256, 128, 64 and 32"))
+ default="256,128,64,32",
+ help=("The dimentions of dnn layers, default is `256,128,64,32`, "
+ "which means a dnn with 4 layers with "
+ "dmentions 256, 128, 64 and 32 will be created."))
parser.add_argument(
- '-c',
- '--class_num',
+ "-c",
+ "--class_num",
type=int,
default=0,
- help="number of categories for classification task.")
+ help="The number of categories for classification task.")
args = parser.parse_args()
args.model_type = ModelType(args.model_type)
args.model_arch = ModelArch(args.model_arch)
if args.model_type.is_classification():
- assert args.class_num > 1, "--class_num should be set in classification task."
+ assert args.class_num > 1, ("The parameter class_num should be set "
+ "in classification task.")
-layer_dims = map(int, args.dnn_dims.split(','))
+layer_dims = map(int, args.dnn_dims.split(","))
args.target_dic_path = args.source_dic_path if not args.target_dic_path \
else args.target_dic_path
@@ -94,8 +92,6 @@ paddle.init(use_gpu=False, trainer_count=1)
class Inferer(object):
def __init__(self, param_path):
- logger.info("create DSSM model")
-
prediction = DSSM(
dnn_dims=layer_dims,
vocab_sizes=[
@@ -110,14 +106,13 @@ class Inferer(object):
is_infer=True)()
# load parameter
- logger.info("load model parameters from %s" % param_path)
+ logger.info("Load the trained model from %s." % param_path)
self.parameters = paddle.parameters.Parameters.from_tar(
- open(param_path, 'r'))
+ open(param_path, "r"))
self.inferer = paddle.inference.Inference(
output_layer=prediction, parameters=self.parameters)
def infer(self, data_path):
- logger.info("infer data...")
dataset = reader.Dataset(
train_path=data_path,
test_path=None,
@@ -125,19 +120,20 @@ class Inferer(object):
target_dic_path=args.target_dic_path,
model_type=args.model_type, )
infer_reader = paddle.batch(dataset.infer, batch_size=1000)
- logger.warning('write predictions to %s' % args.prediction_output_path)
+ logger.warning("Write predictions to %s." % args.prediction_output_path)
- output_f = open(args.prediction_output_path, 'w')
+ output_f = open(args.prediction_output_path, "w")
for id, batch in enumerate(infer_reader()):
res = self.inferer.infer(input=batch)
- predictions = [' '.join(map(str, x)) for x in res]
+ predictions = [" ".join(map(str, x)) for x in res]
assert len(batch) == len(predictions), (
- "predict error, %d inputs, "
- "but %d predictions") % (len(batch), len(predictions))
- output_f.write('\n'.join(map(str, predictions)) + '\n')
+ "Error! %d inputs are given, "
+ "but only %d predictions are returned.") % (len(batch),
+ len(predictions))
+ output_f.write("\n".join(map(str, predictions)) + "\n")
-if __name__ == '__main__':
+if __name__ == "__main__":
inferer = Inferer(args.model_path)
inferer.infer(args.data_path)
diff --git a/dssm/network_conf.py b/dssm/network_conf.py
index 6888ca0ef44fe9711ba63fcb77fbcea088ce0171..135a00bf6f64da611e6df54c08837fcee9b39041 100644
--- a/dssm/network_conf.py
+++ b/dssm/network_conf.py
@@ -13,26 +13,33 @@ class DSSM(object):
class_num=None,
share_embed=False,
is_infer=False):
- '''
- @dnn_dims: list of int
- dimentions of each layer in semantic vector generator.
- @vocab_sizes: 2-d tuple
- size of both left and right items.
- @model_type: int
- type of task, should be 'rank: 0', 'regression: 1' or 'classification: 2'
- @model_arch: int
- model architecture
- @share_semantic_generator: bool
- whether to share the semantic vector generator for both left and right.
- @share_embed: bool
- whether to share the embeddings between left and right.
- @class_num: int
- number of categories.
- '''
+ """
+ :param dnn_dims: The dimention of each layer in the semantic vector
+ generator.
+ :type dnn_dims: list of int
+ :param vocab_sizes: The size of left and right items.
+ :type vocab_sizes: A list having 2 elements.
+ :param model_type: The type of task to train the DSSM model. The value
+ should be "rank: 0", "regression: 1" or
+ "classification: 2".
+ :type model_type: int
+ :param model_arch: A value indicating the model architecture to use.
+ :type model_arch: int
+ :param share_semantic_generator: A flag indicating whether to share the
+ semantic vector between the left and
+ the right item.
+ :type share_semantic_generator: bool
+ :param share_embed: A floag indicating whether to share the embeddings
+ between the left and the right item.
+ :type share_embed: bool
+ :param class_num: The number of categories.
+ :type class_num: int
+ """
assert len(vocab_sizes) == 2, (
- "vocab_sizes specify the sizes left and right inputs, "
- "and dim should be 2.")
- assert len(dnn_dims) > 1, "more than two layers is needed."
+ "The vocab_sizes specifying the sizes left and right inputs. "
+ "Its dimension should be 2.")
+ assert len(dnn_dims) > 1, ("In the DNN model, more than two layers "
+ "are needed.")
self.dnn_dims = dnn_dims
self.vocab_sizes = vocab_sizes
@@ -42,91 +49,89 @@ class DSSM(object):
self.model_arch = ModelArch(model_arch)
self.class_num = class_num
self.is_infer = is_infer
- logger.warning("build DSSM model with config of %s, %s" %
+ logger.warning("Build DSSM model with config of %s, %s" %
(self.model_type, self.model_arch))
- logger.info("vocabulary sizes: %s" % str(self.vocab_sizes))
+ logger.info("The vocabulary size is : %s" % str(self.vocab_sizes))
# bind model architecture
_model_arch = {
- 'cnn': self.create_cnn,
- 'fc': self.create_fc,
- 'rnn': self.create_rnn,
+ "cnn": self.create_cnn,
+ "fc": self.create_fc,
+ "rnn": self.create_rnn,
}
- def _model_arch_creater(emb, prefix=''):
+ def _model_arch_creater(emb, prefix=""):
sent_vec = _model_arch.get(str(model_arch))(emb, prefix)
dnn = self.create_dnn(sent_vec, prefix)
return dnn
self.model_arch_creater = _model_arch_creater
- # build model type
_model_type = {
- 'classification': self._build_classification_model,
- 'rank': self._build_rank_model,
- 'regression': self._build_regression_model,
+ "classification": self._build_classification_model,
+ "rank": self._build_rank_model,
+ "regression": self._build_regression_model,
}
- print 'model type: ', str(self.model_type)
+ print("model type: ", str(self.model_type))
self.model_type_creater = _model_type[str(self.model_type)]
def __call__(self):
return self.model_type_creater()
- def create_embedding(self, input, prefix=''):
- '''
- Create an embedding table whose name has a `prefix`.
- '''
- logger.info("create embedding table [%s] which dimention is %d" %
+ def create_embedding(self, input, prefix=""):
+ """
+ Create word embedding. The `prefix` is added in front of the name of
+ embedding"s learnable parameter.
+ """
+ logger.info("Create embedding table [%s] whose dimention is %d. " %
(prefix, self.dnn_dims[0]))
emb = paddle.layer.embedding(
input=input,
size=self.dnn_dims[0],
- param_attr=ParamAttr(name='%s_emb.w' % prefix))
+ param_attr=ParamAttr(name="%s_emb.w" % prefix))
return emb
- def create_fc(self, emb, prefix=''):
- '''
+ def create_fc(self, emb, prefix=""):
+ """
A multi-layer fully connected neural networks.
- @emb: paddle.layer
- output of the embedding layer
- @prefix: str
- prefix of layers' names, used to share parameters between
- more than one `fc` parts.
- '''
+ :param emb: The output of the embedding layer
+ :type emb: paddle.layer
+ :param prefix: A prefix will be added to the layers' names.
+ :type prefix: str
+ """
_input_layer = paddle.layer.pooling(
input=emb, pooling_type=paddle.pooling.Max())
fc = paddle.layer.fc(
input=_input_layer,
size=self.dnn_dims[1],
- param_attr=ParamAttr(name='%s_fc.w' % prefix),
+ param_attr=ParamAttr(name="%s_fc.w" % prefix),
bias_attr=ParamAttr(name="%s_fc.b" % prefix, initial_std=0.))
return fc
- def create_rnn(self, emb, prefix=''):
- '''
+ def create_rnn(self, emb, prefix=""):
+ """
A GRU sentence vector learner.
- '''
+ """
gru = paddle.networks.simple_gru(
input=emb,
size=self.dnn_dims[1],
- mixed_param_attr=ParamAttr(name='%s_gru_mixed.w' % prefix),
+ mixed_param_attr=ParamAttr(name="%s_gru_mixed.w" % prefix),
mixed_bias_param_attr=ParamAttr(name="%s_gru_mixed.b" % prefix),
- gru_param_attr=ParamAttr(name='%s_gru.w' % prefix),
+ gru_param_attr=ParamAttr(name="%s_gru.w" % prefix),
gru_bias_attr=ParamAttr(name="%s_gru.b" % prefix))
sent_vec = paddle.layer.last_seq(gru)
return sent_vec
- def create_cnn(self, emb, prefix=''):
- '''
+ def create_cnn(self, emb, prefix=""):
+ """
A multi-layer CNN.
- @emb: paddle.layer
- output of the embedding layer
- @prefix: str
- prefix of layers' names, used to share parameters between
- more than one `cnn` parts.
- '''
+ :param emb: The word embedding.
+ :type emb: paddle.layer
+ :param prefix: The prefix will be added to of layers' names.
+ :type prefix: str
+ """
def create_conv(context_len, hidden_size, prefix):
key = "%s_%d_%d" % (prefix, context_len, hidden_size)
@@ -135,15 +140,15 @@ class DSSM(object):
context_len=context_len,
hidden_size=hidden_size,
# set parameter attr for parameter sharing
- context_proj_param_attr=ParamAttr(name=key + 'contex_proj.w'),
- fc_param_attr=ParamAttr(name=key + '_fc.w'),
- fc_bias_attr=ParamAttr(name=key + '_fc.b'),
- pool_bias_attr=ParamAttr(name=key + '_pool.b'))
+ context_proj_param_attr=ParamAttr(name=key + "contex_proj.w"),
+ fc_param_attr=ParamAttr(name=key + "_fc.w"),
+ fc_bias_attr=ParamAttr(name=key + "_fc.b"),
+ pool_bias_attr=ParamAttr(name=key + "_pool.b"))
return conv
- logger.info('create a sequence_conv_pool which context width is 3')
+ logger.info("create a sequence_conv_pool which context width is 3")
conv_3 = create_conv(3, self.dnn_dims[1], "cnn")
- logger.info('create a sequence_conv_pool which context width is 4')
+ logger.info("create a sequence_conv_pool which context width is 4")
conv_4 = create_conv(4, self.dnn_dims[1], "cnn")
return conv_3, conv_4
@@ -160,8 +165,8 @@ class DSSM(object):
input=_input_layer,
size=dim,
act=paddle.activation.Tanh(),
- param_attr=ParamAttr(name='%s.w' % name),
- bias_attr=ParamAttr(name='%s.b' % name, initial_std=0.))
+ param_attr=ParamAttr(name="%s.w" % name),
+ bias_attr=ParamAttr(name="%s.b" % name, initial_std=0.))
_input_layer = fc
return _input_layer
@@ -178,7 +183,7 @@ class DSSM(object):
is_classification=False)
def _build_rank_model(self):
- '''
+ """
Build a pairwise rank model, and the cost is returned.
A pairwise rank model has 3 inputs:
@@ -187,26 +192,26 @@ class DSSM(object):
- right_target sentence
- label, 1 if left_target should be sorted in front of
right_target, otherwise 0.
- '''
+ """
logger.info("build rank model")
assert self.model_type.is_rank()
source = paddle.layer.data(
- name='source_input',
+ name="source_input",
type=paddle.data_type.integer_value_sequence(self.vocab_sizes[0]))
left_target = paddle.layer.data(
- name='left_target_input',
+ name="left_target_input",
type=paddle.data_type.integer_value_sequence(self.vocab_sizes[1]))
right_target = paddle.layer.data(
- name='right_target_input',
+ name="right_target_input",
type=paddle.data_type.integer_value_sequence(self.vocab_sizes[1]))
if not self.is_infer:
label = paddle.layer.data(
- name='label_input', type=paddle.data_type.integer_value(1))
+ name="label_input", type=paddle.data_type.integer_value(1))
- prefixs = '_ _ _'.split(
- ) if self.share_semantic_generator else 'source target target'.split()
- embed_prefixs = '_ _'.split(
- ) if self.share_embed else 'source target target'.split()
+ prefixs = "_ _ _".split(
+ ) if self.share_semantic_generator else "source target target".split()
+ embed_prefixs = "_ _".split(
+ ) if self.share_embed else "source target target".split()
word_vecs = []
for id, input in enumerate([source, left_target, right_target]):
@@ -218,9 +223,9 @@ class DSSM(object):
x = self.model_arch_creater(input, prefix=prefixs[id])
semantics.append(x)
- # cossim score of source and left_target
+ # The cosine similarity score of source and left_target.
left_score = paddle.layer.cos_sim(semantics[0], semantics[1])
- # cossim score of source and right target
+ # The cosine similarity score of source and right target.
right_score = paddle.layer.cos_sim(semantics[0], semantics[2])
if not self.is_infer:
@@ -233,34 +238,33 @@ class DSSM(object):
return right_score
def _build_classification_or_regression_model(self, is_classification):
- '''
+ """
Build a classification/regression model, and the cost is returned.
- A Classification has 3 inputs:
+ The classification/regression task expects 3 inputs:
- source sentence
- target sentence
- classification label
- '''
+ """
if is_classification:
- # prepare inputs.
assert self.class_num
source = paddle.layer.data(
- name='source_input',
+ name="source_input",
type=paddle.data_type.integer_value_sequence(self.vocab_sizes[0]))
target = paddle.layer.data(
- name='target_input',
+ name="target_input",
type=paddle.data_type.integer_value_sequence(self.vocab_sizes[1]))
label = paddle.layer.data(
- name='label_input',
+ name="label_input",
type=paddle.data_type.integer_value(self.class_num)
if is_classification else paddle.data_type.dense_vector(1))
- prefixs = '_ _'.split(
- ) if self.share_semantic_generator else 'source target'.split()
- embed_prefixs = '_ _'.split(
- ) if self.share_embed else 'source target'.split()
+ prefixs = "_ _".split(
+ ) if self.share_semantic_generator else "source target".split()
+ embed_prefixs = "_ _".split(
+ ) if self.share_embed else "source target".split()
word_vecs = []
for id, input in enumerate([source, target]):
diff --git a/dssm/train.py b/dssm/train.py
index eb563d1d1a3c8f8f09543a887335ead824c1ee2a..9d5b5782ff0d372e61aae31be6cf8101904215e0 100644
--- a/dssm/train.py
+++ b/dssm/train.py
@@ -9,120 +9,129 @@ from utils import TaskType, load_dic, logger, ModelType, ModelArch, display_args
parser = argparse.ArgumentParser(description="PaddlePaddle DSSM example")
parser.add_argument(
- '-i',
- '--train_data_path',
+ "-i",
+ "--train_data_path",
type=str,
required=False,
- help="path of training dataset")
+ help="The path of training data.")
parser.add_argument(
- '-t',
- '--test_data_path',
+ "-t",
+ "--test_data_path",
type=str,
required=False,
- help="path of testing dataset")
+ help="The path of testing data.")
parser.add_argument(
- '-s',
- '--source_dic_path',
+ "-s",
+ "--source_dic_path",
type=str,
required=False,
- help="path of the source's word dic")
+ help="The path of the source's word dictionary.")
parser.add_argument(
- '--target_dic_path',
+ "--target_dic_path",
type=str,
required=False,
- help=("path of the target's word dictionary, "
- "if not set, the `source_dic_path` will be used"))
+ help=("The path of the target's word dictionary, "
+ "if this parameter is not set, the `source_dic_path` will be used"))
parser.add_argument(
- '-b',
- '--batch_size',
+ "-b",
+ "--batch_size",
type=int,
default=32,
- help="size of mini-batch (default:32)")
+ help="The size of mini-batch (default:32).")
parser.add_argument(
- '-p',
- '--num_passes',
+ "-p",
+ "--num_passes",
type=int,
default=10,
- help="number of passes to run(default:10)")
+ help="The number of passes to run(default:10).")
parser.add_argument(
- '-y',
- '--model_type',
+ "-y",
+ "--model_type",
type=int,
required=True,
default=ModelType.CLASSIFICATION_MODE,
- help="model type, %d for classification, %d for pairwise rank, %d for regression (default: classification)"
- % (ModelType.CLASSIFICATION_MODE, ModelType.RANK_MODE,
- ModelType.REGRESSION_MODE))
+ help=("model type, %d for classification, %d for pairwise rank, "
+ "%d for regression (default: classification).") %
+ (ModelType.CLASSIFICATION_MODE, ModelType.RANK_MODE,
+ ModelType.REGRESSION_MODE))
parser.add_argument(
- '-a',
- '--model_arch',
+ "-a",
+ "--model_arch",
type=int,
required=True,
default=ModelArch.CNN_MODE,
- help="model architecture, %d for CNN, %d for FC, %d for RNN" %
+ help="The model architecture, %d for CNN, %d for FC, %d for RNN." %
(ModelArch.CNN_MODE, ModelArch.FC_MODE, ModelArch.RNN_MODE))
parser.add_argument(
- '--share_network_between_source_target',
+ "--share_network_between_source_target",
type=distutils.util.strtobool,
default=False,
- help="whether to share network parameters between source and target")
+ help="Whether to share network parameters between source and target.")
parser.add_argument(
- '--share_embed',
+ "--share_embed",
type=distutils.util.strtobool,
default=False,
- help="whether to share word embedding between source and target")
+ help="Whether to share word embedding between source and target.")
parser.add_argument(
- '--dnn_dims',
+ "--dnn_dims",
type=str,
- default='256,128,64,32',
- help="dimentions of dnn layers, default is '256,128,64,32', which means create a 4-layer dnn, demention of each layer is 256, 128, 64 and 32"
-)
+ default="256,128,64,32",
+ help=("The dimentions of dnn layers, default is '256,128,64,32', "
+ "which means create a 4-layer dnn. The dimention of each layer is "
+ "'256, 128, 64 and 32'."))
parser.add_argument(
- '--num_workers', type=int, default=1, help="num worker threads, default 1")
+ "--num_workers",
+ type=int,
+ default=1,
+ help="The number of worker threads, default 1.")
parser.add_argument(
- '--use_gpu',
+ "--use_gpu",
type=distutils.util.strtobool,
default=False,
- help="whether to use GPU devices (default: False)")
+ help="Whether to use GPU devices (default: False)")
parser.add_argument(
- '-c',
- '--class_num',
+ "-c",
+ "--class_num",
type=int,
default=0,
- help="number of categories for classification task.")
+ help="The number of categories for classification task.")
parser.add_argument(
- '--model_output_prefix',
+ "--model_output_prefix",
type=str,
default="./",
- help="prefix of the path for model to store, (default: ./)")
+ help="The prefix of the path to store the trained models (default: ./).")
parser.add_argument(
- '-g',
- '--num_batches_to_log',
+ "-g",
+ "--num_batches_to_log",
type=int,
default=100,
- help="number of batches to output train log, (default: 100)")
+ help=("The log period. Every num_batches_to_test batches, "
+ "a training log will be printed. (default: 100)"))
parser.add_argument(
- '-e',
- '--num_batches_to_test',
+ "-e",
+ "--num_batches_to_test",
type=int,
default=200,
- help="number of batches to test, (default: 200)")
+ help=("The test period. Every num_batches_to_save_model batches, "
+ "the specified test sample will be test (default: 200)."))
parser.add_argument(
- '-z',
- '--num_batches_to_save_model',
+ "-z",
+ "--num_batches_to_save_model",
type=int,
default=400,
- help="number of batches to output model, (default: 400)")
+ help=("Every num_batches_to_save_model batches, "
+ "a trained model will be saved (default: 400)."))
-# arguments check.
args = parser.parse_args()
args.model_type = ModelType(args.model_type)
args.model_arch = ModelArch(args.model_arch)
if args.model_type.is_classification():
- assert args.class_num > 1, "--class_num should be set in classification task."
+ assert args.class_num > 1, ("The parameter class_num should be set in "
+ "classification task.")
-layer_dims = [int(i) for i in args.dnn_dims.split(',')]
-args.target_dic_path = args.source_dic_path if not args.target_dic_path else args.target_dic_path
+layer_dims = [int(i) for i in args.dnn_dims.split(",")]
+args.target_dic_path = args.source_dic_path if not \
+ args.target_dic_path else args.target_dic_path
def train(train_data_path=None,
@@ -138,15 +147,15 @@ def train(train_data_path=None,
class_num=None,
num_workers=1,
use_gpu=False):
- '''
+ """
Train the DSSM.
- '''
- default_train_path = './data/rank/train.txt'
- default_test_path = './data/rank/test.txt'
- default_dic_path = './data/vocab.txt'
+ """
+ default_train_path = "./data/rank/train.txt"
+ default_test_path = "./data/rank/test.txt"
+ default_dic_path = "./data/vocab.txt"
if not model_type.is_rank():
- default_train_path = './data/classification/train.txt'
- default_test_path = './data/classification/test.txt'
+ default_train_path = "./data/classification/train.txt"
+ default_test_path = "./data/classification/test.txt"
use_default_data = not train_data_path
@@ -200,19 +209,19 @@ def train(train_data_path=None,
feeding = {}
if model_type.is_classification() or model_type.is_regression():
- feeding = {'source_input': 0, 'target_input': 1, 'label_input': 2}
+ feeding = {"source_input": 0, "target_input": 1, "label_input": 2}
else:
feeding = {
- 'source_input': 0,
- 'left_target_input': 1,
- 'right_target_input': 2,
- 'label_input': 3
+ "source_input": 0,
+ "left_target_input": 1,
+ "right_target_input": 2,
+ "label_input": 3
}
def _event_handler(event):
- '''
+ """
Define batch handler
- '''
+ """
if isinstance(event, paddle.event.EndIteration):
# output train log
if event.batch_id % args.num_batches_to_log == 0:
@@ -249,7 +258,7 @@ def train(train_data_path=None,
logger.info("Training has finished.")
-if __name__ == '__main__':
+if __name__ == "__main__":
display_args(args)
train(
train_data_path=args.train_data_path,
diff --git a/dssm/utils.py b/dssm/utils.py
index 7bcbec6ebfb2e41d94e49b4e39eaf106e414c103..97296fd5dcc2dc664c97dd83d658c8805221fc57 100644
--- a/dssm/utils.py
+++ b/dssm/utils.py
@@ -8,7 +8,7 @@ logger.setLevel(logging.INFO)
def mode_attr_name(mode):
- return mode.upper() + '_MODE'
+ return mode.upper() + "_MODE"
def create_attrs(cls):
@@ -17,9 +17,9 @@ def create_attrs(cls):
def make_check_method(cls):
- '''
+ """
create methods for classes.
- '''
+ """
def method(mode):
def _method(self):
@@ -28,7 +28,7 @@ def make_check_method(cls):
return _method
for id, mode in enumerate(cls.modes):
- setattr(cls, 'is_' + mode, method(mode))
+ setattr(cls, "is_" + mode, method(mode))
def make_create_method(cls):
@@ -41,10 +41,10 @@ def make_create_method(cls):
return _method
for id, mode in enumerate(cls.modes):
- setattr(cls, 'create_' + mode, method(mode))
+ setattr(cls, "create_" + mode, method(mode))
-def make_str_method(cls, type_name='unk'):
+def make_str_method(cls, type_name="unk"):
def _str_(self):
for mode in cls.modes:
if self.mode == getattr(cls, mode_attr_name(mode)):
@@ -53,9 +53,9 @@ def make_str_method(cls, type_name='unk'):
def _hash_(self):
return self.mode
- setattr(cls, '__str__', _str_)
- setattr(cls, '__repr__', _str_)
- setattr(cls, '__hash__', _hash_)
+ setattr(cls, "__str__", _str_)
+ setattr(cls, "__repr__", _str_)
+ setattr(cls, "__hash__", _hash_)
cls.__name__ = type_name
@@ -65,7 +65,7 @@ def _init_(self, mode, cls):
elif isinstance(mode, cls):
self.mode = mode.mode
else:
- raise Exception("wrong mode type, get type: %s, value: %s" %
+ raise Exception("A wrong mode type, get type: %s, value: %s." %
(type(mode), mode))
@@ -77,21 +77,21 @@ def build_mode_class(cls):
class TaskType(object):
- modes = 'train test infer'.split()
+ modes = "train test infer".split()
def __init__(self, mode):
_init_(self, mode, TaskType)
class ModelType:
- modes = 'classification rank regression'.split()
+ modes = "classification rank regression".split()
def __init__(self, mode):
_init_(self, mode, ModelType)
class ModelArch:
- modes = 'fc cnn rnn'.split()
+ modes = "fc cnn rnn".split()
def __init__(self, mode):
_init_(self, mode, ModelArch)
@@ -103,22 +103,16 @@ build_mode_class(ModelArch)
def sent2ids(sent, vocab):
- '''
+ """
transform a sentence to a list of ids.
-
- @sent: str
- a sentence.
- @vocab: dict
- a word dic
- '''
+ """
return [vocab.get(w, UNK) for w in sent.split()]
def load_dic(path):
- '''
- word dic format:
- each line is a word
- '''
+ """
+ The format of word dictionary : each line is a word.
+ """
dic = {}
with open(path) as f:
for id, line in enumerate(f):
@@ -128,13 +122,6 @@ def load_dic(path):
def display_args(args):
- logger.info("arguments passed by command line:")
+ logger.info("The arguments passed by command line is :")
for k, v in sorted(v for v in vars(args).items()):
logger.info("{}:\t{}".format(k, v))
-
-
-if __name__ == '__main__':
- t = TaskType(1)
- t = TaskType.create_train()
- print t
- print 'is', t.is_train()