+# Deep Structured Semantic Models (DSSM)
+Deep Structured Semantic Models (DSSM) is simple but powerful DNN based model for matching web search queries and url based documents. This example demonstrates how to use PaddlePaddle to implement a generic DSSM model for modeling the semantic similarity between two strings.
+
+## Background Introduction
+DSSM \[[1](#References\] is a classic semantic model proposed by the Institute of Physics. It is used to study the semantic distance between two texts. The general implementation of DSSM is as follows.
+
+1. The CTR predictor \measures the degree of association between a user search query and a candidate web page.
+2. Text relevance, which measures the degree of semantic correlation between two strings.
+3. Automatically recommend, measure the degree of association between User and the recommended Item.
+
+
+## Model Architecture
+In the original paper\[[1](#References\] the DSSM model uses the implicit semantic relation between the user search query and the document as metric. The model structure is as follows
+
+
+
+Figure 1. DSSM In the original paper
+
+
+
+With the subsequent optimization of the DSSM model to simplify the structure \[[3](#References\],the model becomes:
+
+
+
+Figure 2. DSSM generic structure
+
+
+The blank box in the figure can be replaced by any model, such as fully connected FC, convoluted CNN, RNN, etc. The structure is designed to measure the semantic distance between two elements (such as strings).
+
+In practice,DSSM model serves as a basic building block, with different loss functions to achieve specific functions, such as
+
+- In ranking system, add the pairwise rank loss to the structure in Figure 2
+- In the CTR estimate, instead of the binary classification on the click, use cross entropy loss for a classification model
+- In regression model, the cosine similarity is used to calculate the similarity
+
+## Model Implementation
+At a high level, DSSM model is composed of three components: the left and right DNN, and loss function on top of them. In complex tasks, the structure of the left DNN and the light DNN can be different. In this example, we keep these two DNN structure the same. And we choose any of FC, CNN, and RNN for the DNN architecture.
+
+In PaddlePaddle, the loss functions are supported for any of classification, regression, and ranking. Among them, the distance between the left and right DNN is calculated by the cosine similarity (cossim). In the classification task, the predicted distribution is calculated by softmax.
+
+Here we demonstrate:
+
+- How CNN, FC do text information extraction can refer to[text classification](https://github.com/PaddlePaddle/models/blob/develop/text_classification/README.md#模型详解)
+- The contents of the RNN / GRU can be found in [Machine Translation](https://github.com/PaddlePaddle/book/blob/develop/08.machine_translation/README.md#gated-recurrent-unit-gru)
+- For Pairwise Rank learning, please refer to [learn to rank](https://github.com/PaddlePaddle/models/blob/develop/ltr/README.md)
+
+Figure 3 shows the general architecture for both regression and classification models.
+
+
+
+Figure 3. DSSM for REGRESSION or CLASSIFICATION
+
+
+The structure of the Pairwise Rank is more complex, as shown in Figure 4.
+
+
+
+图 4. DSSM for Pairwise Rank
+
+
+In below, we describe how to train DSSM model in PaddlePaddle. All the codes are included in `./network_conf.py`.
+
+
+### Create a word vector table for the text
+```python
+def create_embedding(self, input, prefix=''):
+ '''
+ Create an embedding table whose name has a `prefix`.
+ '''
+ logger.info("create embedding table [%s] which dimention is %d" %
+ (prefix, self.dnn_dims[0]))
+ emb = paddle.layer.embedding(
+ input=input,
+ size=self.dnn_dims[0],
+ param_attr=ParamAttr(name='%s_emb.w' % prefix))
+ return emb
+```
+
+Since the input (embedding table) is a list of the IDs of the words corresponding to a sentence, the word vector table outputs the sequence of word vectors.
+
+### CNN implementation
+```python
+def create_cnn(self, emb, prefix=''):
+ '''
+ A multi-layer CNN.
+
+ @emb: paddle.layer
+ output of the embedding layer
+ @prefix: str
+ prefix of layers' names, used to share parameters between more than one `cnn` parts.
+ '''
+ def create_conv(context_len, hidden_size, prefix):
+ key = "%s_%d_%d" % (prefix, context_len, hidden_size)
+ conv = paddle.networks.sequence_conv_pool(
+ input=emb,
+ context_len=context_len,
+ hidden_size=hidden_size,
+ # set parameter attr for parameter sharing
+ context_proj_param_attr=ParamAttr(name=key + 'contex_proj.w'),
+ fc_param_attr=ParamAttr(name=key + '_fc.w'),
+ fc_bias_attr=ParamAttr(name=key + '_fc.b'),
+ pool_bias_attr=ParamAttr(name=key + '_pool.b'))
+ return conv
+
+ logger.info('create a sequence_conv_pool which context width is 3')
+ conv_3 = create_conv(3, self.dnn_dims[1], "cnn")
+ logger.info('create a sequence_conv_pool which context width is 4')
+ conv_4 = create_conv(4, self.dnn_dims[1], "cnn")
+ return conv_3, conv_4
+```
+
+CNN accepts the word sequence of the embedding table, then process the data by convolution and pooling, and finally outputs a semantic vector.
+
+### RNN implementation
+
+RNN is suitable for learning variable length of the information
+
+```python
+def create_rnn(self, emb, prefix=''):
+ '''
+ A GRU sentence vector learner.
+ '''
+ gru = paddle.layer.gru_memory(input=emb,)
+ sent_vec = paddle.layer.last_seq(gru)
+ return sent_vec
+```
+
+### FC implementation
+
+```python
+def create_fc(self, emb, prefix=''):
+ '''
+ A multi-layer fully connected neural networks.
+
+ @emb: paddle.layer
+ output of the embedding layer
+ @prefix: str
+ prefix of layers' names, used to share parameters between more than one `fc` parts.
+ '''
+ _input_layer = paddle.layer.pooling(
+ input=emb, pooling_type=paddle.pooling.Max())
+ fc = paddle.layer.fc(input=_input_layer, size=self.dnn_dims[1])
+ return fc
+```
+
+In the construction of FC, we use `paddle.layer.pooling` for the maximum pooling operation on the word vector sequence. Then we transform the sequence into a fixed dimensional vector.
+
+### Multi-layer DNN implementation
+
+```python
+def create_dnn(self, sent_vec, prefix):
+ # if more than three layers exists, a fc layer will be added.
+ if len(self.dnn_dims) > 1:
+ _input_layer = sent_vec
+ for id, dim in enumerate(self.dnn_dims[1:]):
+ name = "%s_fc_%d_%d" % (prefix, id, dim)
+ logger.info("create fc layer [%s] which dimention is %d" %
+ (name, dim))
+ fc = paddle.layer.fc(
+ input=_input_layer,
+ size=dim,
+ name=name,
+ act=paddle.activation.Tanh(),
+ param_attr=ParamAttr(name='%s.w' % name),
+ bias_attr=ParamAttr(name='%s.b' % name),
+ )
+ _input_layer = fc
+ return _input_layer
+```
+
+### Classification / Regression
+The structure of classification and regression is similar. Below function can be used for both tasks.
+
+```python
+def _build_classification_or_regression_model(self, is_classification):
+ '''
+ Build a classification/regression model, and the cost is returned.
+
+ A Classification has 3 inputs:
+ - source sentence
+ - target sentence
+ - classification label
+
+ '''
+ # prepare inputs.
+ assert self.class_num
+
+ source = paddle.layer.data(
+ name='source_input',
+ type=paddle.data_type.integer_value_sequence(self.vocab_sizes[0]))
+ target = paddle.layer.data(
+ name='target_input',
+ type=paddle.data_type.integer_value_sequence(self.vocab_sizes[1]))
+ label = paddle.layer.data(
+ name='label_input',
+ type=paddle.data_type.integer_value(self.class_num)
+ if is_classification else paddle.data_type.dense_input)
+
+ prefixs = '_ _'.split(
+ ) if self.share_semantic_generator else 'left right'.split()
+ embed_prefixs = '_ _'.split(
+ ) if self.share_embed else 'left right'.split()
+
+ word_vecs = []
+ for id, input in enumerate([source, target]):
+ x = self.create_embedding(input, prefix=embed_prefixs[id])
+ word_vecs.append(x)
+
+ semantics = []
+ for id, input in enumerate(word_vecs):
+ x = self.model_arch_creater(input, prefix=prefixs[id])
+ semantics.append(x)
+
+ concated_vector = paddle.layer.concat(semantics)
+ prediction = paddle.layer.fc(
+ input=concated_vector,
+ size=self.class_num,
+ act=paddle.activation.Softmax())
+ cost = paddle.layer.classification_cost(
+ input=prediction,
+ label=label) if is_classification else paddle.layer.mse_cost(
+ prediction, label)
+ return cost, prediction, label
+```
+
+### Pairwise Rank
+
+
+```python
+def _build_rank_model(self):
+ '''
+ Build a pairwise rank model, and the cost is returned.
+
+ A pairwise rank model has 3 inputs:
+ - source sentence
+ - left_target sentence
+ - right_target sentence
+ - label, 1 if left_target should be sorted in front of right_target, otherwise 0.
+ '''
+ source = paddle.layer.data(
+ name='source_input',
+ type=paddle.data_type.integer_value_sequence(self.vocab_sizes[0]))
+ left_target = paddle.layer.data(
+ name='left_target_input',
+ type=paddle.data_type.integer_value_sequence(self.vocab_sizes[1]))
+ right_target = paddle.layer.data(
+ name='right_target_input',
+ type=paddle.data_type.integer_value_sequence(self.vocab_sizes[1]))
+ label = paddle.layer.data(
+ name='label_input', type=paddle.data_type.integer_value(1))
+
+ prefixs = '_ _ _'.split(
+ ) if self.share_semantic_generator else 'source left right'.split()
+ embed_prefixs = '_ _'.split(
+ ) if self.share_embed else 'source target target'.split()
+
+ word_vecs = []
+ for id, input in enumerate([source, left_target, right_target]):
+ x = self.create_embedding(input, prefix=embed_prefixs[id])
+ word_vecs.append(x)
+
+ semantics = []
+ for id, input in enumerate(word_vecs):
+ x = self.model_arch_creater(input, prefix=prefixs[id])
+ semantics.append(x)
+
+ # cossim score of source and left_target
+ left_score = paddle.layer.cos_sim(semantics[0], semantics[1])
+ # cossim score of source and right target
+ right_score = paddle.layer.cos_sim(semantics[0], semantics[2])
+
+ # rank cost
+ cost = paddle.layer.rank_cost(left_score, right_score, label=label)
+ # prediction = left_score - right_score
+ # but this operator is not supported currently.
+ # so AUC will not used.
+ return cost, None, None
+```
+## Data Format
+Below is a simple example for the data in `./data`
+
+### Regression data format
+```
+# 3 fields each line:
+# - source's word ids
+# - target's word ids
+# - target
+
\t \t
+```
+
+The example of this format is as follows.
+
+```
+3 6 10 \t 6 8 33 \t 0.7
+6 0 \t 6 9 330 \t 0.03
+```
+
+### Classification data format
+```
+# 3 fields each line:
+# - source's word ids
+# - target's word ids
+# - target
+ \t \t
+
+
+
+