README.md 4.4 KB
Newer Older
1
# Text matching on Quora qestion-answer pair dataset
2

M
mapingshuo 已提交
3
## Introduction
4

M
mapingshuo 已提交
5
### a brief review of the Quora Question Pair (QQP) Task
6

M
mapingshuo 已提交
7
[Quora Pair Dataset](https://data.quora.com/First-Quora-Dataset-Release-Question-Pairs) is a dataset of 400,000 question pairs from the [Quora forum](https://www.quora.com/), where people raise questions for the others to answer. Each sample in the dataset consists of two English questions and a label represent whether the two questions are duplicate. The dataset is well annotated by human. 
8

M
mapingshuo 已提交
9
Below are two samples of the dataset. The last clolmn indicates whether the two questions are duplicate (1) or not(0).
10

M
mapingshuo 已提交
11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56
|id | qid1 | qid2| question1| question2| is_duplicate
|:---:|:---:|:---:|:---:|:---:|:---:|
|0 |1 |2 |What is the step by step guide to invest in share market in india? |What is the step by step guide to invest in share market? |0|
|1 |3 |4 |What is the story of Kohinoor (Koh-i-Noor) Diamond? | What would happen if the Indian government stole the Kohinoor (Koh-i-Noor) diamond back? |0|

 A [kaggle competition](https://www.kaggle.com/c/quora-question-pairs#description) is held base on this dataset in 2017. The kaggler is able to reach the train_data(with label) and the test_data(without label), and is requested to make predictions on the test_data. The predictions are evaluated by the log-likelihood loss on the test_data.

The kaggle competition has inspired lots of effective work. However, most of the models are rule-based, thus are hard to transfer to new tasks. Researchers keep seeking for more general models that works well on this task and the other NLP(Natual Language Processing) tasks.

[Wang et al.](https://arxiv.org/abs/1702.03814) proposed the BIMPM(Bilateral Multi-Perspective Matching) model based on the Quora Question Pair dataset. They splited the original dataset to [3 part](https://drive.google.com/file/d/0B0PlTAo--BnaQWlsZl9FZ3l1c28/view?usp=sharing): train.tsv(384,348 samples), dev.tsv(10,000 samples) and test.tsv(10,000 samples). The class distribution in train.tsv is unbalanced(37% positive, 63% negative). But the class distribution in dev.tsv and test.tsv is balanced(50% positive and 50% negetive). We follow this split in our experiments. 

### Our Work

Based on the Quora Question Pair Dataset, we will implement some classic models in the area of the NLU(Neraul Lanuage Understanding). The prediction results will be evaluated by accuracy on the test.tsv, like [Wang et al.](https://arxiv.org/abs/1702.03814).

## Environment Preparation: Install fluid release 1.0

You can follow the fluid's [official document](http://www.paddlepaddle.org/documentation/docs/en/1.0/build_and_install/pip_install_en.html) to install the fluid. 

[Attention] You are supposed to install python and pip before installing fluid

### cpu version

```
pip install paddlepaddle==1.0.1
```

### gpu version

Assume you have downloaded cuda(cuda9.0) and cudnn(cudnn7) lib, here is an expample:

```shell

pip install paddlepaddle-gpu==1.0.1.post97

```

### Have I installed Fluid successfully?

You can run the following script in your command line:

```shell
python -c "import paddle"
```

Fluid is installed successfully if no error message is prompted. If you get any error, feel free to open issues under the [PaddlePaddle repository](https://github.com/PaddlePaddle/Paddle/issues). 
57 58 59

## Prepare Data

M
mapingshuo 已提交
60 61
Please download the Quora dataset firstly from [google drive](https://drive.google.com/file/d/0B0PlTAo--BnaQWlsZl9FZ3l1c28/view?usp=sharing)
 to $HOME/.cache/paddle/dataset and unzip it.
62

M
mapingshuo 已提交
63
Then run the data/prepare_quora_data.sh to download the pretrained word2vec embedding file: glove.840B.300d.zip:
64 65

```shell
M
mapingshuo 已提交
66
sh data/prepare_quora_data.sh   
67 68
```

M
mapingshuo 已提交
69
Finally, The dataset dir($HOME/.cache/paddle/dataset) should be like
70 71 72 73 74 75 76 77 78 79 80 81 82

```shell

$HOME/.cache/paddle/dataset
    |- Quora_question_pair_partition
        |- train.tsv
        |- test.tsv
        |- dev.tsv
        |- readme.txt
        |- wordvec.txt
    |- glove.840B.300d.txt
```

M
mapingshuo 已提交
83
## Train and evaluate
84

M
mapingshuo 已提交
85
We provide multiple models and configs, details are shown in models and configs directory. For quick start, you can run the cdssmNet with cdssm_base config:
86 87 88 89 90 91 92

```shell
fluid train_and_evaluate.py  \
    --model_name=cdssmNet  \
    --config=cdssm_base
```

M
mapingshuo 已提交
93
You are supposed to get log like cdssm_base.log
94

M
mapingshuo 已提交
95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114
## Results

### Models

#### CDSSM

#### InferSent

#### SSE

#### DecAtt

### Test Accuracy

|Model|dev accuracy| test accuracy
|:----:|:----:|:----:|
|CDSSM|||
|InferSent|||
|SSE|||
|DecAtt|||