README.md 7.9 KB
Newer Older
1
# Text matching on Quora qestion-answer pair dataset
2

M
mapingshuo 已提交
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
## contents

* [Introduction](#introduction)
  * [a brief review of the Quora Question Pair (QQP) Task](#a-brief-review-of-the-quora-question-pair-qqp-task)
  * [Our Work](#our-work)
* [Environment Preparation](#environment-preparation)
  * [Install Fluid release 1.0](#install-fluid-release-10)
    * [cpu version](#cpu-version)
    * [gpu version](#gpu-version)
    * [Have I installed Fluid successfully?](#have-i-installed-fluid-successfully)
* [Prepare Data](#prepare-data)
* [Train and evaluate](#train-and-evaluate)
* [Models](#models)
* [Results](#results)


M
mapingshuo 已提交
19
## Introduction
20

M
mapingshuo 已提交
21
### a brief review of the Quora Question Pair (QQP) Task
22

M
mapingshuo 已提交
23
The [Quora Question Pair](https://data.quora.com/First-Quora-Dataset-Release-Question-Pairs) dataset contains 400,000 question pairs from [Quora](https://www.quora.com/), where people ask and answer questions related to specific areas. Each sample in the dataset consists of two questions (both English) and a label that represents whether the questions are duplicate. The dataset is well annotated by human. 
24

M
mapingshuo 已提交
25
Below are two samples from the dataset. The last column indicates whether the two questions are duplicate (1) or not (0).
26

M
mapingshuo 已提交
27 28 29 30 31
|id | qid1 | qid2| question1| question2| is_duplicate
|:---:|:---:|:---:|:---:|:---:|:---:|
|0 |1 |2 |What is the step by step guide to invest in share market in india? |What is the step by step guide to invest in share market? |0|
|1 |3 |4 |What is the story of Kohinoor (Koh-i-Noor) Diamond? | What would happen if the Indian government stole the Kohinoor (Koh-i-Noor) diamond back? |0|

M
mapingshuo 已提交
32
 A [kaggle competition](https://www.kaggle.com/c/quora-question-pairs#description) was held based on this dataset in 2017. The kagglers were given a training dataset (with labels), and requested to make predictions on a test dataset (without labels). The predictions were evaluated by the log-likelihood loss on the test data.
M
mapingshuo 已提交
33

M
mapingshuo 已提交
34
The kaggle competition has inspired much effective work. However, most of these models are rule-based and difficult to be transferred to new tasks. Researchers are seeking for more general models that work well on this task and other natual language processing (NLP) tasks.
M
mapingshuo 已提交
35

M
mapingshuo 已提交
36
[Wang _et al._](https://arxiv.org/abs/1702.03814) proposed a bilateral multi-perspective matching (BIMPM) model based on the Quora Question Pair dataset. They splitted the original dataset to [3 parts](https://drive.google.com/file/d/0B0PlTAo--BnaQWlsZl9FZ3l1c28/view?usp=sharing): _train.tsv_ (384,348 samples), _dev.tsv_ (10,000 samples) and _test.tsv_ (10,000 samples). The class distribution of _train.tsv_ is unbalanced (37% positive and 63% negative), while those of _dev.tsv_ and _test.tsv_ are balanced(50% positive and 50% negetive). We used the same splitting method in our experiments. 
M
mapingshuo 已提交
37 38 39

### Our Work

M
mapingshuo 已提交
40
Based on the Quora Question Pair Dataset, we implemented some classic models in the area of neural language understanding (NLU). The accuracy of prediction results are evaluated on the _test.tsv_ from [Wang _et al._](https://arxiv.org/abs/1702.03814).
M
mapingshuo 已提交
41

M
mapingshuo 已提交
42 43
## Environment Preparation

M
mapingshuo 已提交
44
### Install Fluid release 1.0
M
mapingshuo 已提交
45

M
mapingshuo 已提交
46
Please follow the [official document](http://www.paddlepaddle.org/documentation/docs/en/1.0/build_and_install/pip_install_en.html) install the Fluid deep learning framework. 
M
mapingshuo 已提交
47

M
mapingshuo 已提交
48
#### cpu version
M
mapingshuo 已提交
49 50 51 52 53

```
pip install paddlepaddle==1.0.1
```

M
mapingshuo 已提交
54
#### gpu version
M
mapingshuo 已提交
55 56 57 58 59 60 61 62 63 64 65

Assume you have downloaded cuda(cuda9.0) and cudnn(cudnn7) lib, here is an expample:

```shell

pip install paddlepaddle-gpu==1.0.1.post97

```

### Have I installed Fluid successfully?

M
mapingshuo 已提交
66
Run the following script from your command line:
M
mapingshuo 已提交
67 68 69 70 71

```shell
python -c "import paddle"
```

M
mapingshuo 已提交
72
If Fluid is installed successfully you should see no error message. Feel free to open issues under the [PaddlePaddle repository](https://github.com/PaddlePaddle/Paddle/issues) for support.
73 74 75

## Prepare Data

M
mapingshuo 已提交
76
Please download the Quora dataset from [Google drive](https://drive.google.com/file/d/0B0PlTAo--BnaQWlsZl9FZ3l1c28/view?usp=sharing) and unzip to $HOME/.cache/paddle/dataset.
77

M
mapingshuo 已提交
78
Then run _data/prepare_quora_data.sh_ to download the pre-trained _word2vec_ embedding file -- _glove.840B.300d.zip_:
79 80

```shell
M
mapingshuo 已提交
81
sh data/prepare_quora_data.sh   
82 83
```

M
mapingshuo 已提交
84
At this point the dataset directory ($HOME/.cache/paddle/dataset) structure should be:
85 86 87 88 89 90 91 92 93 94 95 96 97

```shell

$HOME/.cache/paddle/dataset
    |- Quora_question_pair_partition
        |- train.tsv
        |- test.tsv
        |- dev.tsv
        |- readme.txt
        |- wordvec.txt
    |- glove.840B.300d.txt
```

M
mapingshuo 已提交
98
## Train and evaluate
99

M
mapingshuo 已提交
100
We provide multiple models and configurations. Details are shown in `models` and `configs` directories. For a quick start, please run the _cdssmNet_ model with the corresponding configuration:
101 102

```shell
M
mapingshuo 已提交
103
python train_and_evaluate.py  \
104 105 106 107
    --model_name=cdssmNet  \
    --config=cdssm_base
```

M
mapingshuo 已提交
108
Logs will be output to the console. If everything works well, the logging information will have the same formats as the content in _cdssm_base.log_.
109

M
mapingshuo 已提交
110
All configurations used in our experiments are as follows:
M
mapingshuo 已提交
111 112 113 114 115 116 117 118 119

|Model|Config|command
|:----:|:----:|:----:|
|cdssmNet|cdssm_base|python train_and_evaluate.py  --model_name=cdssmNet  --config=cdssm_base
|DecAttNet|decatt_glove|python train_and_evaluate.py --model_name=DecAttNet  --config=decatt_glove
|InferSentNet|infer_sent_v1|python train_and_evaluate.py --model_name=InferSentNet --config=infer_sent_v1
|InferSentNet|infer_sent_v2|python train_and_evaluate.py --model_name=InferSentNet --config=infer_sent_v1
|SSENet|sse_base|python train_and_evaluate.py  --model_name=SSENet  --config=sse_base

M
mapingshuo 已提交
120
## Models
M
mapingshuo 已提交
121

M
mapingshuo 已提交
122
We implemeted 4 models for now: the convolutional deep-structured semantic model (CDSSM, CNN-based), the ___Infer Sent Model___ (RNN-based), the shortcut-stacked encoder (SSE, RNN-based), and the decomposed attention model (DecAtt, attention-based).
M
mapingshuo 已提交
123

M
mapingshuo 已提交
124 125 126 127 128 129 130
|Model|features|Context Encoder|Match Layer|Classification Layer
|:----:|:----:|:----:|:----:|:----:|
|CDSSM|word|1 layer conv1d|concatenation|MLP
|DecAtt|word|Attention|concatenation|MLP
|InferSent|word|1 layer Bi-LSTM|concatenation/element-wise product/<br>absolute element-wise difference|MLP
|SSE|word|3 layer Bi-LSTM|concatenation/element-wise product/<br>absolute element-wise difference|MLP

M
mapingshuo 已提交
131
### CDSSM
M
mapingshuo 已提交
132

M
mapingshuo 已提交
133 134 135 136 137 138 139 140 141 142 143
```
@inproceedings{shen2014learning,
  title={Learning semantic representations using convolutional neural networks for web search},
  author={Shen, Yelong and He, Xiaodong and Gao, Jianfeng and Deng, Li and Mesnil, Gr{\'e}goire},
  booktitle={Proceedings of the 23rd International Conference on World Wide Web},
  pages={373--374},
  year={2014},
  organization={ACM}
}
```

M
mapingshuo 已提交
144
### InferSent
M
mapingshuo 已提交
145

M
mapingshuo 已提交
146 147 148 149 150 151 152 153 154
```
@article{conneau2017supervised,
  title={Supervised learning of universal sentence representations from natural language inference data},
  author={Conneau, Alexis and Kiela, Douwe and Schwenk, Holger and Barrault, Loic and Bordes, Antoine},
  journal={arXiv preprint arXiv:1705.02364},
  year={2017}
}
```

M
mapingshuo 已提交
155
### SSE
M
mapingshuo 已提交
156

M
mapingshuo 已提交
157 158 159 160 161 162 163 164 165
```
@article{nie2017shortcut,
  title={Shortcut-stacked sentence encoders for multi-domain inference},
  author={Nie, Yixin and Bansal, Mohit},
  journal={arXiv preprint arXiv:1708.02312},
  year={2017}
}
```

M
mapingshuo 已提交
166
### DecAtt
M
mapingshuo 已提交
167

M
mapingshuo 已提交
168 169 170 171 172 173 174 175 176
```
@article{tomar2017neural,
  title={Neural paraphrase identification of questions with noisy pretraining},
  author={Tomar, Gaurav Singh and Duque, Thyago and T{\"a}ckstr{\"o}m, Oscar and Uszkoreit, Jakob and Das, Dipanjan},
  journal={arXiv preprint arXiv:1704.04565},
  year={2017}
}
```

M
mapingshuo 已提交
177 178
## Results

M
mapingshuo 已提交
179
In our experiment, we found that LSTM-based models outperformed convolution-based models. The DecAtt model has fewer parameters than LSTM-based models, but is sensitive to hyper-parameters.
M
mapingshuo 已提交
180

M
mapingshuo 已提交
181 182 183 184
|Model|Config|dev accuracy| test accuracy
|:----:|:----:|:----:|:----:|
|cdssmNet|cdssm_base|83.56%|82.83%|
|DecAttNet|decatt_glove|86.31%|86.22%|
M
mapingshuo 已提交
185
|InferSentNet|infer_sent_v1|87.15%|86.62%|
M
mapingshuo 已提交
186
|InferSentNet|infer_sent_v2|88.55%|88.43%|
M
mapingshuo 已提交
187 188 189 190 191 192 193 194
|SSENet|sse_base|88.35%|88.25%|
 
 
<p align="center"> 

 <img src="imgs/models_test_acc.png" width = "500" alt="test_acc"/> 

</p>