README.md 7.8 KB
Newer Older
1
# Text matching on Quora qestion-answer pair dataset
2

M
mapingshuo 已提交
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
## contents

* [Introduction](#introduction)
  * [a brief review of the Quora Question Pair (QQP) Task](#a-brief-review-of-the-quora-question-pair-qqp-task)
  * [Our Work](#our-work)
* [Environment Preparation](#environment-preparation)
  * [Install Fluid release 1.0](#install-fluid-release-10)
    * [cpu version](#cpu-version)
    * [gpu version](#gpu-version)
    * [Have I installed Fluid successfully?](#have-i-installed-fluid-successfully)
* [Prepare Data](#prepare-data)
* [Train and evaluate](#train-and-evaluate)
* [Models](#models)
* [Results](#results)


M
mapingshuo 已提交
19
## Introduction
20

M
mapingshuo 已提交
21
### a brief review of the Quora Question Pair (QQP) Task
22

M
mapingshuo 已提交
23
The [Quora Question Pair](https://data.quora.com/First-Quora-Dataset-Release-Question-Pairs) dataset contains 400,000 question pairs from [Quora](https://www.quora.com/), where people ask and answer questions related to specific areas. Each sample in the dataset consists of two questions (both English) and a label that represents whether the questions are duplicate. The dataset is well annotated by human. 
24

M
mapingshuo 已提交
25
Below are two samples from the dataset. The last column indicates whether the two questions are duplicate (1) or not (0).
26

M
mapingshuo 已提交
27 28 29 30 31
|id | qid1 | qid2| question1| question2| is_duplicate
|:---:|:---:|:---:|:---:|:---:|:---:|
|0 |1 |2 |What is the step by step guide to invest in share market in india? |What is the step by step guide to invest in share market? |0|
|1 |3 |4 |What is the story of Kohinoor (Koh-i-Noor) Diamond? | What would happen if the Indian government stole the Kohinoor (Koh-i-Noor) diamond back? |0|

M
mapingshuo 已提交
32
 A [kaggle competition](https://www.kaggle.com/c/quora-question-pairs#description) was held based on this dataset in 2017. The kagglers were given a training dataset (with labels), and requested to make predictions on a test dataset (without labels). The predictions were evaluated by the log-likelihood loss on the test data.
M
mapingshuo 已提交
33

M
mapingshuo 已提交
34
The kaggle competition has inspired much effective work. However, most of these models are rule-based and difficult to be transferred to new tasks. Researchers are seeking for more general models that work well on this task and other natual language processing (NLP) tasks.
M
mapingshuo 已提交
35

M
mapingshuo 已提交
36
[Wang _et al._](https://arxiv.org/abs/1702.03814) proposed a bilateral multi-perspective matching (BIMPM) model based on the Quora Question Pair dataset. They splitted the original dataset to [3 parts](https://drive.google.com/file/d/0B0PlTAo--BnaQWlsZl9FZ3l1c28/view?usp=sharing): _train.tsv_ (384,348 samples), _dev.tsv_ (10,000 samples) and _test.tsv_ (10,000 samples). The class distribution of _train.tsv_ is unbalanced (37% positive and 63% negative), while those of _dev.tsv_ and _test.tsv_ are balanced(50% positive and 50% negetive). We used the same splitting method in our experiments. 
M
mapingshuo 已提交
37 38 39

### Our Work

M
mapingshuo 已提交
40
Based on the Quora Question Pair Dataset, we implemented some classic models in the area of neural language understanding (NLU). The accuracy of prediction results are evaluated on the _test.tsv_ from [Wang _et al._](https://arxiv.org/abs/1702.03814).
M
mapingshuo 已提交
41

M
mapingshuo 已提交
42 43
## Environment Preparation

M
mapingshuo 已提交
44
### Install Fluid release 1.0
M
mapingshuo 已提交
45

M
mapingshuo 已提交
46
Please follow the [official document in English](http://www.paddlepaddle.org/documentation/docs/en/1.0/build_and_install/pip_install_en.html) or [official document in Chinese](http://www.paddlepaddle.org/documentation/docs/zh/1.0/beginners_guide/install/Start.html) to install the Fluid deep learning framework. 
M
mapingshuo 已提交
47

M
mapingshuo 已提交
48
#### Have I installed Fluid successfully?
M
mapingshuo 已提交
49

M
mapingshuo 已提交
50
Run the following script from your command line:
M
mapingshuo 已提交
51 52 53 54 55

```shell
python -c "import paddle"
```

M
mapingshuo 已提交
56
If Fluid is installed successfully you should see no error message. Feel free to open issues under the [PaddlePaddle repository](https://github.com/PaddlePaddle/Paddle/issues) for support.
57 58 59

## Prepare Data

M
mapingshuo 已提交
60
Please download the Quora dataset from [Google drive](https://drive.google.com/file/d/0B0PlTAo--BnaQWlsZl9FZ3l1c28/view?usp=sharing) and unzip to $HOME/.cache/paddle/dataset.
61

M
mapingshuo 已提交
62
Then run _data/prepare_quora_data.sh_ to download the pre-trained _word2vec_ embedding file -- _glove.840B.300d.zip_:
63 64

```shell
M
mapingshuo 已提交
65
sh data/prepare_quora_data.sh   
66 67
```

M
mapingshuo 已提交
68
At this point the dataset directory ($HOME/.cache/paddle/dataset) structure should be:
69 70 71 72 73 74 75 76 77 78 79 80 81

```shell

$HOME/.cache/paddle/dataset
    |- Quora_question_pair_partition
        |- train.tsv
        |- test.tsv
        |- dev.tsv
        |- readme.txt
        |- wordvec.txt
    |- glove.840B.300d.txt
```

M
mapingshuo 已提交
82
## Train and evaluate
83

M
mapingshuo 已提交
84
We provide multiple models and configurations. Details are shown in `models` and `configs` directories. For a quick start, please run the _cdssmNet_ model with the corresponding configuration:
85 86

```shell
M
mapingshuo 已提交
87
python train_and_evaluate.py  \
88 89 90 91
    --model_name=cdssmNet  \
    --config=cdssm_base
```

M
mapingshuo 已提交
92
Logs will be output to the console. If everything works well, the logging information will have the same formats as the content in _cdssm_base.log_.
93

M
mapingshuo 已提交
94
All configurations used in our experiments are as follows:
M
mapingshuo 已提交
95 96 97 98 99 100

|Model|Config|command
|:----:|:----:|:----:|
|cdssmNet|cdssm_base|python train_and_evaluate.py  --model_name=cdssmNet  --config=cdssm_base
|DecAttNet|decatt_glove|python train_and_evaluate.py --model_name=DecAttNet  --config=decatt_glove
|InferSentNet|infer_sent_v1|python train_and_evaluate.py --model_name=InferSentNet --config=infer_sent_v1
M
mapingshuo 已提交
101
|InferSentNet|infer_sent_v2|python train_and_evaluate.py --model_name=InferSentNet --config=infer_sent_v2
M
mapingshuo 已提交
102 103
|SSENet|sse_base|python train_and_evaluate.py  --model_name=SSENet  --config=sse_base

M
mapingshuo 已提交
104
## Models
M
mapingshuo 已提交
105

M
mapingshuo 已提交
106
We implemeted 4 models for now: the convolutional deep-structured semantic model (CDSSM, CNN-based), the InferSent model (RNN-based), the shortcut-stacked encoder (SSE, RNN-based), and the decomposed attention model (DecAtt, attention-based).
M
mapingshuo 已提交
107

M
mapingshuo 已提交
108 109 110 111 112 113 114
|Model|features|Context Encoder|Match Layer|Classification Layer
|:----:|:----:|:----:|:----:|:----:|
|CDSSM|word|1 layer conv1d|concatenation|MLP
|DecAtt|word|Attention|concatenation|MLP
|InferSent|word|1 layer Bi-LSTM|concatenation/element-wise product/<br>absolute element-wise difference|MLP
|SSE|word|3 layer Bi-LSTM|concatenation/element-wise product/<br>absolute element-wise difference|MLP

M
mapingshuo 已提交
115
### CDSSM
M
mapingshuo 已提交
116

M
mapingshuo 已提交
117 118 119 120 121 122 123 124 125 126 127
```
@inproceedings{shen2014learning,
  title={Learning semantic representations using convolutional neural networks for web search},
  author={Shen, Yelong and He, Xiaodong and Gao, Jianfeng and Deng, Li and Mesnil, Gr{\'e}goire},
  booktitle={Proceedings of the 23rd International Conference on World Wide Web},
  pages={373--374},
  year={2014},
  organization={ACM}
}
```

M
mapingshuo 已提交
128
### InferSent
M
mapingshuo 已提交
129

M
mapingshuo 已提交
130 131 132 133 134 135 136 137 138
```
@article{conneau2017supervised,
  title={Supervised learning of universal sentence representations from natural language inference data},
  author={Conneau, Alexis and Kiela, Douwe and Schwenk, Holger and Barrault, Loic and Bordes, Antoine},
  journal={arXiv preprint arXiv:1705.02364},
  year={2017}
}
```

M
mapingshuo 已提交
139
### SSE
M
mapingshuo 已提交
140

M
mapingshuo 已提交
141 142 143 144 145 146 147 148 149
```
@article{nie2017shortcut,
  title={Shortcut-stacked sentence encoders for multi-domain inference},
  author={Nie, Yixin and Bansal, Mohit},
  journal={arXiv preprint arXiv:1708.02312},
  year={2017}
}
```

M
mapingshuo 已提交
150
### DecAtt
M
mapingshuo 已提交
151

M
mapingshuo 已提交
152 153 154 155 156 157 158 159 160
```
@article{tomar2017neural,
  title={Neural paraphrase identification of questions with noisy pretraining},
  author={Tomar, Gaurav Singh and Duque, Thyago and T{\"a}ckstr{\"o}m, Oscar and Uszkoreit, Jakob and Das, Dipanjan},
  journal={arXiv preprint arXiv:1704.04565},
  year={2017}
}
```

M
mapingshuo 已提交
161 162
## Results

M
mapingshuo 已提交
163 164 165 166
|Model|Config|dev accuracy| test accuracy
|:----:|:----:|:----:|:----:|
|cdssmNet|cdssm_base|83.56%|82.83%|
|DecAttNet|decatt_glove|86.31%|86.22%|
M
mapingshuo 已提交
167
|InferSentNet|infer_sent_v1|87.15%|86.62%|
M
mapingshuo 已提交
168
|InferSentNet|infer_sent_v2|88.55%|88.43%|
M
mapingshuo 已提交
169
|SSENet|sse_base|88.35%|88.25%|
M
mapingshuo 已提交
170 171 172

In our experiment, we found that LSTM-based models outperformed convolution-based models. The DecAtt model has fewer parameters than LSTM-based models, but is sensitive to hyper-parameters.

M
mapingshuo 已提交
173 174 175 176 177
<p align="center"> 

 <img src="imgs/models_test_acc.png" width = "500" alt="test_acc"/> 

</p>