README.md 10.6 KB
Newer Older
1 2 3 4
The minimum PaddlePaddle version needed for the code sample in this directory is v0.10.0. If you are on a version of PaddlePaddle earlier than v0.10.0, [please update your installation](http://www.paddlepaddle.org/docs/develop/documentation/en/build_and_install/pip_install_en.html).

---

J
julie 已提交
5
# Deep Structured Semantic Models (DSSM)
J
julie 已提交
6
Deep Structured Semantic Models (DSSM) is simple but powerful DNN based model for matching web search queries and the URL based documents. This example demonstrates how to use PaddlePaddle to implement a generic DSSM model for modeling the semantic similarity between two strings.
S
Superjom 已提交
7

J
julie 已提交
8
## Background Introduction
J
julie 已提交
9
DSSM \[[1](##References)]is a classic semantic model proposed by the Institute of Physics. It is used to study the semantic distance between two texts. The general implementation of DSSM is as follows.
S
Superjom 已提交
10

J
julie 已提交
11
1. The CTR predictor measures the degree of association between a user search query and a candidate web page.
J
julie 已提交
12 13
2. Text relevance, which measures the degree of semantic correlation between two strings.
3. Automatically recommend, measure the degree of association between User and the recommended Item.
S
Superjom 已提交
14 15


J
julie 已提交
16
## Model Architecture
J
julie 已提交
17

J
julie 已提交
18
In the original paper \[[1](#References)] the DSSM model uses the implicit semantic relation between the user search query and the document as metric. The model structure is as follows
S
Superjom 已提交
19 20 21

<p align="center">
<img src="./images/dssm.png"/><br/><br/>
J
julie 已提交
22
Figure 1. DSSM In the original paper
S
Superjom 已提交
23
</p>
S
Superjom 已提交
24 25


J
julie 已提交
26
With the subsequent optimization of the DSSM model to simplify the structure \[[3](#References)],the model becomes:
S
Superjom 已提交
27 28 29

<p align="center">
<img src="./images/dssm2.png" width="600"/><br/><br/>
J
julie 已提交
30
Figure 2. DSSM generic structure
S
Superjom 已提交
31
</p>
S
Superjom 已提交
32

J
julie 已提交
33
The blank box in the figure can be replaced by any model, such as fully connected FC, convoluted CNN, RNN, etc. The structure is designed to measure the semantic distance between two elements (such as strings).
S
Superjom 已提交
34

J
julie 已提交
35
In practice,DSSM model serves as a basic building block, with different loss functions to achieve specific functions, such as
S
Superjom 已提交
36

J
julie 已提交
37 38
- In ranking system, the pairwise rank loss function.
- In the CTR estimate, instead of the binary classification on the click, use cross-entropy loss for a classification model
J
julie 已提交
39
- In regression model,  the cosine similarity is used to calculate the similarity
S
Superjom 已提交
40

J
julie 已提交
41
## Model Implementation
J
julie 已提交
42
At a high level, DSSM model is composed of three components: the left and right DNN, and loss function on top of them. In complex tasks, the structure of the left DNN and the light DNN can be different. In this example, we keep these two DNN structures the same. And we choose any of FC, CNN, and RNN for the DNN architecture.
S
Superjom 已提交
43

J
julie 已提交
44
In PaddlePaddle, the loss functions are supported for any of classification, regression, and ranking. Among them, the distance between the left and right DNN is calculated by the cosine similarity. In the classification task, the predicted distribution is calculated by softmax.
S
Superjom 已提交
45

J
julie 已提交
46
Here we demonstrate:
S
Superjom 已提交
47

J
julie 已提交
48
- How CNN, FC do text information extraction can refer to [text classification](https://github.com/PaddlePaddle/models/blob/develop/text_classification/README.md#模型详解)
J
julie 已提交
49 50
- The contents of the RNN / GRU can be found in  [Machine Translation](https://github.com/PaddlePaddle/book/blob/develop/08.machine_translation/README.md#gated-recurrent-unit-gru)
- For Pairwise Rank learning, please refer to [learn to rank](https://github.com/PaddlePaddle/models/blob/develop/ltr/README.md)
S
Superjom 已提交
51

J
julie 已提交
52
Figure 3 shows the general architecture for both regression and classification models.
S
Superjom 已提交
53 54

<p align="center">
S
Superjom 已提交
55
<img src="./images/dssm3.jpg"/><br/><br/>
J
julie 已提交
56
Figure 3. DSSM for REGRESSION or CLASSIFICATION
S
Superjom 已提交
57 58
</p>

J
julie 已提交
59
The structure of the Pairwise Rank is more complex, as shown in Figure 4.
S
Superjom 已提交
60 61 62

<p align="center">
<img src="./images/dssm2.jpg"/><br/><br/>
S
fix  
Superjom 已提交
63
图 4. DSSM for Pairwise Rank
S
Superjom 已提交
64 65
</p>

J
julie 已提交
66
In below, we describe how to train DSSM model in PaddlePaddle. All the codes are included in  `./network_conf.py`.
S
Superjom 已提交
67 68


J
julie 已提交
69
### Create a word vector table for the text
S
Superjom 已提交
70 71
```python
def create_embedding(self, input, prefix=''):
C
caoying03 已提交
72 73 74 75 76
    """
    Create word embedding. The `prefix` is added in front of the name of
    embedding"s learnable parameter.
    """
    logger.info("Create embedding table [%s] whose dimention is %d" %
S
Superjom 已提交
77 78 79 80 81 82 83 84
                (prefix, self.dnn_dims[0]))
    emb = paddle.layer.embedding(
        input=input,
        size=self.dnn_dims[0],
        param_attr=ParamAttr(name='%s_emb.w' % prefix))
    return emb
```

J
julie 已提交
85
Since the input (embedding table) is a list of the IDs of the words corresponding to a sentence, the word vector table outputs the sequence of word vectors.
S
Superjom 已提交
86

J
julie 已提交
87
### CNN implementation
S
Superjom 已提交
88 89
```python
def create_cnn(self, emb, prefix=''):
C
caoying03 已提交
90 91

    """
S
Superjom 已提交
92
    A multi-layer CNN.
C
caoying03 已提交
93 94 95 96 97
    :param emb: The word embedding.
    :type emb: paddle.layer
    :param prefix: The prefix will be added to of layers' names.
    :type prefix: str
    """
S
Superjom 已提交
98 99 100 101 102 103 104 105

    def create_conv(context_len, hidden_size, prefix):
        key = "%s_%d_%d" % (prefix, context_len, hidden_size)
        conv = paddle.networks.sequence_conv_pool(
            input=emb,
            context_len=context_len,
            hidden_size=hidden_size,
            # set parameter attr for parameter sharing
C
caoying03 已提交
106 107 108 109
            context_proj_param_attr=ParamAttr(name=key + "contex_proj.w"),
            fc_param_attr=ParamAttr(name=key + "_fc.w"),
            fc_bias_attr=ParamAttr(name=key + "_fc.b"),
            pool_bias_attr=ParamAttr(name=key + "_pool.b"))
S
Superjom 已提交
110 111 112 113
        return conv

    conv_3 = create_conv(3, self.dnn_dims[1], "cnn")
    conv_4 = create_conv(4, self.dnn_dims[1], "cnn")
P
peterzhang2029 已提交
114
    return paddle.layer.concat(input=[conv_3, conv_4])
S
Superjom 已提交
115 116
```

J
julie 已提交
117
CNN accepts the word sequence of the embedding table, then process the data by convolution and pooling, and finally outputs a semantic vector.
S
Superjom 已提交
118

J
julie 已提交
119
### RNN implementation
S
Superjom 已提交
120

J
julie 已提交
121
RNN is suitable for learning variable length of the information
S
Superjom 已提交
122 123 124

```python
def create_rnn(self, emb, prefix=''):
C
caoying03 已提交
125
    """
S
Superjom 已提交
126
    A GRU sentence vector learner.
C
caoying03 已提交
127
    """
R
ranqiu 已提交
128 129 130 131 132 133 134
    gru = paddle.networks.simple_gru(
        input=emb,
        size=self.dnn_dims[1],
        mixed_param_attr=ParamAttr(name='%s_gru_mixed.w' % prefix),
        mixed_bias_param_attr=ParamAttr(name="%s_gru_mixed.b" % prefix),
        gru_param_attr=ParamAttr(name='%s_gru.w' % prefix),
        gru_bias_attr=ParamAttr(name="%s_gru.b" % prefix))
S
Superjom 已提交
135 136 137
    sent_vec = paddle.layer.last_seq(gru)
    return sent_vec
```
S
Superjom 已提交
138

J
julie 已提交
139
### FC implementation
S
Superjom 已提交
140

S
Superjom 已提交
141 142
```python
def create_fc(self, emb, prefix=''):
C
caoying03 已提交
143 144

    """
S
Superjom 已提交
145
    A multi-layer fully connected neural networks.
C
caoying03 已提交
146 147 148 149 150
    :param emb: The output of the embedding layer
    :type emb: paddle.layer
    :param prefix: A prefix will be added to the layers' names.
    :type prefix: str
    """
S
Superjom 已提交
151 152 153

    _input_layer = paddle.layer.pooling(
        input=emb, pooling_type=paddle.pooling.Max())
R
ranqiu 已提交
154 155 156 157 158
    fc = paddle.layer.fc(
        input=_input_layer,
        size=self.dnn_dims[1],
        param_attr=ParamAttr(name='%s_fc.w' % prefix),
        bias_attr=ParamAttr(name="%s_fc.b" % prefix))
S
Superjom 已提交
159 160
    return fc
```
S
Superjom 已提交
161

J
julie 已提交
162
In the construction of FC, we use `paddle.layer.pooling` for the maximum pooling operation on the word vector sequence. Then we transform the sequence into a fixed dimensional vector.
S
Superjom 已提交
163

J
julie 已提交
164
### Multi-layer DNN implementation
S
Superjom 已提交
165 166 167 168 169 170 171 172 173 174 175 176

```python
def create_dnn(self, sent_vec, prefix):
    if len(self.dnn_dims) > 1:
        _input_layer = sent_vec
        for id, dim in enumerate(self.dnn_dims[1:]):
            name = "%s_fc_%d_%d" % (prefix, id, dim)
            fc = paddle.layer.fc(
                input=_input_layer,
                size=dim,
                act=paddle.activation.Tanh(),
                param_attr=ParamAttr(name='%s.w' % name),
S
Superjom 已提交
177 178
                bias_attr=ParamAttr(name='%s.b' % name),
                )
S
Superjom 已提交
179 180 181 182
            _input_layer = fc
    return _input_layer
```

J
julie 已提交
183 184
### Classification / Regression
The structure of classification and regression is similar. Below function can be used for both tasks.
C
caoying03 已提交
185
Please check the function `_build_classification_or_regression_model` in [network_conf.py]( https://github.com/PaddlePaddle/models/blob/develop/dssm/network_conf.py) for detail implementation.
J
julie 已提交
186 187 188

### Pairwise Rank

C
caoying03 已提交
189
Please check the function `_build_rank_model` in [network_conf.py]( https://github.com/PaddlePaddle/models/blob/develop/dssm/network_conf.py) for implementation.
S
Superjom 已提交
190

J
julie 已提交
191 192
## Data Format
Below is a simple example for the data in `./data`
S
Superjom 已提交
193

J
julie 已提交
194
### Regression data format
S
Superjom 已提交
195 196
```
# 3 fields each line:
P
peterzhang2029 已提交
197 198
#   - source word list
#   - target word list
S
Superjom 已提交
199
#   - target
P
peterzhang2029 已提交
200
<word list> \t <word list> \t <float>
S
Superjom 已提交
201
```
S
Superjom 已提交
202

J
julie 已提交
203
The example of this format is as follows.
S
Superjom 已提交
204 205

```
P
peterzhang2029 已提交
206 207
Six bags of apples    Apple 6s    0.1
The new driver    The driving school    0.9
S
Superjom 已提交
208
```
J
julie 已提交
209 210

### Classification data format
S
Superjom 已提交
211 212
```
# 3 fields each line:
P
peterzhang2029 已提交
213 214
#   - source word list
#   - target word list
S
Superjom 已提交
215
#   - target
P
peterzhang2029 已提交
216
<word list> \t <word list> \t <label>
S
Superjom 已提交
217
```
S
Superjom 已提交
218

J
julie 已提交
219 220
The example of this format is as follows.

S
Superjom 已提交
221 222

```
P
peterzhang2029 已提交
223 224
Six bags of apples    Apple 6s    0
The new driver    The driving school    1
S
Superjom 已提交
225 226 227
```


J
julie 已提交
228
### Ranking data format
S
Superjom 已提交
229 230
```
# 4 fields each line:
P
peterzhang2029 已提交
231 232 233
#   - source word list
#   - target1 word list
#   - target2 word list
S
Superjom 已提交
234
#   - label
P
peterzhang2029 已提交
235
<word list> \t <word list> \t <word list> \t <label>
S
Superjom 已提交
236 237
```

J
julie 已提交
238
The example of this format is as follows.
S
Superjom 已提交
239 240

```
P
peterzhang2029 已提交
241 242
Six bags of apples    Apple 6s    The new driver    1
The new driver    The driving school    Apple 6s    1
S
Superjom 已提交
243 244
```

J
julie 已提交
245
## Training
S
Superjom 已提交
246

P
peterzhang2029 已提交
247
We use `python train.py -y 0 --model_arch 0 --class_num 2` with the data in  `./data/classification` to train a DSSM model for classification. The paremeters to execute the script `train.py` can be found by execution `python infer.py --help`. Some important parameters are:
S
Superjom 已提交
248

J
julie 已提交
249 250 251
- `train_data_path` Training data path
- `test_data_path`  Test data path, optional
- `source_dic_path`  Source dictionary path
P
peterzhang2029 已提交
252
- `target_dic_path` Target dictionary path
J
julie 已提交
253
- `model_type`  The type of loss function of the model: classification 0, sort 1, regression 2
J
julie 已提交
254
- `model_arch` Model structure: FC 0,CNN 1, RNN 2
J
julie 已提交
255
- `dnn_dims` The dimension of each layer of the model is set, the default is `256,128,64,32`,with 4 layers.
S
Superjom 已提交
256

J
julie 已提交
257
## To predict using the trained model
S
Superjom 已提交
258

C
caoying03 已提交
259
The paremeters to execute the script `infer.py` can be found by execution `python infer.py --help`. Some important parameters are:
S
Superjom 已提交
260

J
julie 已提交
261 262
- `data_path` Path for the data to predict
- `prediction_output_path` Prediction output path
S
Superjom 已提交
263

J
julie 已提交
264
## References
S
Superjom 已提交
265 266 267 268

1. Huang P S, He X, Gao J, et al. Learning deep structured semantic models for web search using clickthrough data[C]//Proceedings of the 22nd ACM international conference on Conference on information & knowledge management. ACM, 2013: 2333-2338.
2. [Microsoft Learning to Rank Datasets](https://www.microsoft.com/en-us/research/project/mslr/)
3. [Gao J, He X, Deng L. Deep Learning for Web Search and Natural Language Processing[J]. Microsoft Research Technical Report, 2015.](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/wsdm2015.v3.pdf)