README.md 10.3 KB
Newer Older
J
julie 已提交
1
# Deep Structured Semantic Models (DSSM)
J
julie 已提交
2
Deep Structured Semantic Models (DSSM) is simple but powerful DNN based model for matching web search queries and the URL based documents. This example demonstrates how to use PaddlePaddle to implement a generic DSSM model for modeling the semantic similarity between two strings.
S
Superjom 已提交
3

J
julie 已提交
4
## Background Introduction
J
julie 已提交
5
DSSM \[[1](##References)]is a classic semantic model proposed by the Institute of Physics. It is used to study the semantic distance between two texts. The general implementation of DSSM is as follows.
S
Superjom 已提交
6

J
julie 已提交
7
1. The CTR predictor measures the degree of association between a user search query and a candidate web page.
J
julie 已提交
8 9
2. Text relevance, which measures the degree of semantic correlation between two strings.
3. Automatically recommend, measure the degree of association between User and the recommended Item.
S
Superjom 已提交
10 11


J
julie 已提交
12
## Model Architecture
J
julie 已提交
13

J
julie 已提交
14
In the original paper \[[1](#References)] the DSSM model uses the implicit semantic relation between the user search query and the document as metric. The model structure is as follows
S
Superjom 已提交
15 16 17

<p align="center">
<img src="./images/dssm.png"/><br/><br/>
J
julie 已提交
18
Figure 1. DSSM In the original paper
S
Superjom 已提交
19
</p>
S
Superjom 已提交
20 21


J
julie 已提交
22
With the subsequent optimization of the DSSM model to simplify the structure \[[3](#References)],the model becomes:
S
Superjom 已提交
23 24 25

<p align="center">
<img src="./images/dssm2.png" width="600"/><br/><br/>
J
julie 已提交
26
Figure 2. DSSM generic structure
S
Superjom 已提交
27
</p>
S
Superjom 已提交
28

J
julie 已提交
29
The blank box in the figure can be replaced by any model, such as fully connected FC, convoluted CNN, RNN, etc. The structure is designed to measure the semantic distance between two elements (such as strings).
S
Superjom 已提交
30

J
julie 已提交
31
In practice,DSSM model serves as a basic building block, with different loss functions to achieve specific functions, such as
S
Superjom 已提交
32

J
julie 已提交
33 34
- In ranking system, the pairwise rank loss function.
- In the CTR estimate, instead of the binary classification on the click, use cross-entropy loss for a classification model
J
julie 已提交
35
- In regression model,  the cosine similarity is used to calculate the similarity
S
Superjom 已提交
36

J
julie 已提交
37
## Model Implementation
J
julie 已提交
38
At a high level, DSSM model is composed of three components: the left and right DNN, and loss function on top of them. In complex tasks, the structure of the left DNN and the light DNN can be different. In this example, we keep these two DNN structures the same. And we choose any of FC, CNN, and RNN for the DNN architecture.
S
Superjom 已提交
39

J
julie 已提交
40
In PaddlePaddle, the loss functions are supported for any of classification, regression, and ranking. Among them, the distance between the left and right DNN is calculated by the cosine similarity. In the classification task, the predicted distribution is calculated by softmax.
S
Superjom 已提交
41

J
julie 已提交
42
Here we demonstrate:
S
Superjom 已提交
43

J
julie 已提交
44
- How CNN, FC do text information extraction can refer to [text classification](https://github.com/PaddlePaddle/models/blob/develop/text_classification/README.md#模型详解)
J
julie 已提交
45 46
- The contents of the RNN / GRU can be found in  [Machine Translation](https://github.com/PaddlePaddle/book/blob/develop/08.machine_translation/README.md#gated-recurrent-unit-gru)
- For Pairwise Rank learning, please refer to [learn to rank](https://github.com/PaddlePaddle/models/blob/develop/ltr/README.md)
S
Superjom 已提交
47

J
julie 已提交
48
Figure 3 shows the general architecture for both regression and classification models.
S
Superjom 已提交
49 50

<p align="center">
S
Superjom 已提交
51
<img src="./images/dssm3.jpg"/><br/><br/>
J
julie 已提交
52
Figure 3. DSSM for REGRESSION or CLASSIFICATION
S
Superjom 已提交
53 54
</p>

J
julie 已提交
55
The structure of the Pairwise Rank is more complex, as shown in Figure 4.
S
Superjom 已提交
56 57 58

<p align="center">
<img src="./images/dssm2.jpg"/><br/><br/>
S
fix  
Superjom 已提交
59
图 4. DSSM for Pairwise Rank
S
Superjom 已提交
60 61
</p>

J
julie 已提交
62
In below, we describe how to train DSSM model in PaddlePaddle. All the codes are included in  `./network_conf.py`.
S
Superjom 已提交
63 64


J
julie 已提交
65
### Create a word vector table for the text
S
Superjom 已提交
66 67
```python
def create_embedding(self, input, prefix=''):
C
caoying03 已提交
68 69 70 71 72
    """
    Create word embedding. The `prefix` is added in front of the name of
    embedding"s learnable parameter.
    """
    logger.info("Create embedding table [%s] whose dimention is %d" %
S
Superjom 已提交
73 74 75 76 77 78 79 80
                (prefix, self.dnn_dims[0]))
    emb = paddle.layer.embedding(
        input=input,
        size=self.dnn_dims[0],
        param_attr=ParamAttr(name='%s_emb.w' % prefix))
    return emb
```

J
julie 已提交
81
Since the input (embedding table) is a list of the IDs of the words corresponding to a sentence, the word vector table outputs the sequence of word vectors.
S
Superjom 已提交
82

J
julie 已提交
83
### CNN implementation
S
Superjom 已提交
84 85
```python
def create_cnn(self, emb, prefix=''):
C
caoying03 已提交
86 87

    """
S
Superjom 已提交
88
    A multi-layer CNN.
C
caoying03 已提交
89 90 91 92 93
    :param emb: The word embedding.
    :type emb: paddle.layer
    :param prefix: The prefix will be added to of layers' names.
    :type prefix: str
    """
S
Superjom 已提交
94 95 96 97 98 99 100 101

    def create_conv(context_len, hidden_size, prefix):
        key = "%s_%d_%d" % (prefix, context_len, hidden_size)
        conv = paddle.networks.sequence_conv_pool(
            input=emb,
            context_len=context_len,
            hidden_size=hidden_size,
            # set parameter attr for parameter sharing
C
caoying03 已提交
102 103 104 105
            context_proj_param_attr=ParamAttr(name=key + "contex_proj.w"),
            fc_param_attr=ParamAttr(name=key + "_fc.w"),
            fc_bias_attr=ParamAttr(name=key + "_fc.b"),
            pool_bias_attr=ParamAttr(name=key + "_pool.b"))
S
Superjom 已提交
106 107 108 109
        return conv

    conv_3 = create_conv(3, self.dnn_dims[1], "cnn")
    conv_4 = create_conv(4, self.dnn_dims[1], "cnn")
P
peterzhang2029 已提交
110
    return paddle.layer.concat(input=[conv_3, conv_4])
S
Superjom 已提交
111 112
```

J
julie 已提交
113
CNN accepts the word sequence of the embedding table, then process the data by convolution and pooling, and finally outputs a semantic vector.
S
Superjom 已提交
114

J
julie 已提交
115
### RNN implementation
S
Superjom 已提交
116

J
julie 已提交
117
RNN is suitable for learning variable length of the information
S
Superjom 已提交
118 119 120

```python
def create_rnn(self, emb, prefix=''):
C
caoying03 已提交
121
    """
S
Superjom 已提交
122
    A GRU sentence vector learner.
C
caoying03 已提交
123
    """
R
ranqiu 已提交
124 125 126 127 128 129 130
    gru = paddle.networks.simple_gru(
        input=emb,
        size=self.dnn_dims[1],
        mixed_param_attr=ParamAttr(name='%s_gru_mixed.w' % prefix),
        mixed_bias_param_attr=ParamAttr(name="%s_gru_mixed.b" % prefix),
        gru_param_attr=ParamAttr(name='%s_gru.w' % prefix),
        gru_bias_attr=ParamAttr(name="%s_gru.b" % prefix))
S
Superjom 已提交
131 132 133
    sent_vec = paddle.layer.last_seq(gru)
    return sent_vec
```
S
Superjom 已提交
134

J
julie 已提交
135
### FC implementation
S
Superjom 已提交
136

S
Superjom 已提交
137 138
```python
def create_fc(self, emb, prefix=''):
C
caoying03 已提交
139 140

    """
S
Superjom 已提交
141
    A multi-layer fully connected neural networks.
C
caoying03 已提交
142 143 144 145 146
    :param emb: The output of the embedding layer
    :type emb: paddle.layer
    :param prefix: A prefix will be added to the layers' names.
    :type prefix: str
    """
S
Superjom 已提交
147 148 149

    _input_layer = paddle.layer.pooling(
        input=emb, pooling_type=paddle.pooling.Max())
R
ranqiu 已提交
150 151 152 153 154
    fc = paddle.layer.fc(
        input=_input_layer,
        size=self.dnn_dims[1],
        param_attr=ParamAttr(name='%s_fc.w' % prefix),
        bias_attr=ParamAttr(name="%s_fc.b" % prefix))
S
Superjom 已提交
155 156
    return fc
```
S
Superjom 已提交
157

J
julie 已提交
158
In the construction of FC, we use `paddle.layer.pooling` for the maximum pooling operation on the word vector sequence. Then we transform the sequence into a fixed dimensional vector.
S
Superjom 已提交
159

J
julie 已提交
160
### Multi-layer DNN implementation
S
Superjom 已提交
161 162 163 164 165 166 167 168 169 170 171 172

```python
def create_dnn(self, sent_vec, prefix):
    if len(self.dnn_dims) > 1:
        _input_layer = sent_vec
        for id, dim in enumerate(self.dnn_dims[1:]):
            name = "%s_fc_%d_%d" % (prefix, id, dim)
            fc = paddle.layer.fc(
                input=_input_layer,
                size=dim,
                act=paddle.activation.Tanh(),
                param_attr=ParamAttr(name='%s.w' % name),
S
Superjom 已提交
173 174
                bias_attr=ParamAttr(name='%s.b' % name),
                )
S
Superjom 已提交
175 176 177 178
            _input_layer = fc
    return _input_layer
```

J
julie 已提交
179 180
### Classification / Regression
The structure of classification and regression is similar. Below function can be used for both tasks.
C
caoying03 已提交
181
Please check the function `_build_classification_or_regression_model` in [network_conf.py]( https://github.com/PaddlePaddle/models/blob/develop/dssm/network_conf.py) for detail implementation.
J
julie 已提交
182 183 184

### Pairwise Rank

C
caoying03 已提交
185
Please check the function `_build_rank_model` in [network_conf.py]( https://github.com/PaddlePaddle/models/blob/develop/dssm/network_conf.py) for implementation.
S
Superjom 已提交
186

J
julie 已提交
187 188
## Data Format
Below is a simple example for the data in `./data`
S
Superjom 已提交
189

J
julie 已提交
190
### Regression data format
S
Superjom 已提交
191 192
```
# 3 fields each line:
P
peterzhang2029 已提交
193 194
#   - source word list
#   - target word list
S
Superjom 已提交
195
#   - target
P
peterzhang2029 已提交
196
<word list> \t <word list> \t <float>
S
Superjom 已提交
197
```
S
Superjom 已提交
198

J
julie 已提交
199
The example of this format is as follows.
S
Superjom 已提交
200 201

```
P
peterzhang2029 已提交
202 203
Six bags of apples    Apple 6s    0.1
The new driver    The driving school    0.9
S
Superjom 已提交
204
```
J
julie 已提交
205 206

### Classification data format
S
Superjom 已提交
207 208
```
# 3 fields each line:
P
peterzhang2029 已提交
209 210
#   - source word list
#   - target word list
S
Superjom 已提交
211
#   - target
P
peterzhang2029 已提交
212
<word list> \t <word list> \t <label>
S
Superjom 已提交
213
```
S
Superjom 已提交
214

J
julie 已提交
215 216
The example of this format is as follows.

S
Superjom 已提交
217 218

```
P
peterzhang2029 已提交
219 220
Six bags of apples    Apple 6s    0
The new driver    The driving school    1
S
Superjom 已提交
221 222 223
```


J
julie 已提交
224
### Ranking data format
S
Superjom 已提交
225 226
```
# 4 fields each line:
P
peterzhang2029 已提交
227 228 229
#   - source word list
#   - target1 word list
#   - target2 word list
S
Superjom 已提交
230
#   - label
P
peterzhang2029 已提交
231
<word list> \t <word list> \t <word list> \t <label>
S
Superjom 已提交
232 233
```

J
julie 已提交
234
The example of this format is as follows.
S
Superjom 已提交
235 236

```
P
peterzhang2029 已提交
237 238
Six bags of apples    Apple 6s    The new driver    1
The new driver    The driving school    Apple 6s    1
S
Superjom 已提交
239 240
```

J
julie 已提交
241
## Training
S
Superjom 已提交
242

P
peterzhang2029 已提交
243
We use `python train.py -y 0 --model_arch 0 --class_num 2` with the data in  `./data/classification` to train a DSSM model for classification. The paremeters to execute the script `train.py` can be found by execution `python infer.py --help`. Some important parameters are:
S
Superjom 已提交
244

J
julie 已提交
245 246 247
- `train_data_path` Training data path
- `test_data_path`  Test data path, optional
- `source_dic_path`  Source dictionary path
P
peterzhang2029 已提交
248
- `target_dic_path` Target dictionary path
J
julie 已提交
249
- `model_type`  The type of loss function of the model: classification 0, sort 1, regression 2
J
julie 已提交
250
- `model_arch` Model structure: FC 0,CNN 1, RNN 2
J
julie 已提交
251
- `dnn_dims` The dimension of each layer of the model is set, the default is `256,128,64,32`,with 4 layers.
S
Superjom 已提交
252

J
julie 已提交
253
## To predict using the trained model
S
Superjom 已提交
254

C
caoying03 已提交
255
The paremeters to execute the script `infer.py` can be found by execution `python infer.py --help`. Some important parameters are:
S
Superjom 已提交
256

J
julie 已提交
257 258
- `data_path` Path for the data to predict
- `prediction_output_path` Prediction output path
S
Superjom 已提交
259

J
julie 已提交
260
## References
S
Superjom 已提交
261 262 263 264

1. Huang P S, He X, Gao J, et al. Learning deep structured semantic models for web search using clickthrough data[C]//Proceedings of the 22nd ACM international conference on Conference on information & knowledge management. ACM, 2013: 2333-2338.
2. [Microsoft Learning to Rank Datasets](https://www.microsoft.com/en-us/research/project/mslr/)
3. [Gao J, He X, Deng L. Deep Learning for Web Search and Natural Language Processing[J]. Microsoft Research Technical Report, 2015.](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/wsdm2015.v3.pdf)