未验证 提交 96166665 编写于 作者: G guru4elephant 提交者: GitHub

Merge pull request #1367 from mapingshuo/world_conference

Add Text Matching models on Quora
# Text matching on Quora qestion-answer pair dataset
## contents
* [Introduction](#introduction)
* [a brief review of the Quora Question Pair (QQP) Task](#a-brief-review-of-the-quora-question-pair-qqp-task)
* [Our Work](#our-work)
* [Environment Preparation](#environment-preparation)
* [Install Fluid release 1.0](#install-fluid-release-10)
* [cpu version](#cpu-version)
* [gpu version](#gpu-version)
* [Have I installed Fluid successfully?](#have-i-installed-fluid-successfully)
* [Prepare Data](#prepare-data)
* [Train and evaluate](#train-and-evaluate)
* [Models](#models)
* [Results](#results)
## Introduction
### a brief review of the Quora Question Pair (QQP) Task
The [Quora Question Pair](https://data.quora.com/First-Quora-Dataset-Release-Question-Pairs) dataset contains 400,000 question pairs from [Quora](https://www.quora.com/), where people ask and answer questions related to specific areas. Each sample in the dataset consists of two questions (both English) and a label that represents whether the questions are duplicate. The dataset is well annotated by human.
Below are two samples from the dataset. The last column indicates whether the two questions are duplicate (1) or not (0).
|id | qid1 | qid2| question1| question2| is_duplicate
|:---:|:---:|:---:|:---:|:---:|:---:|
|0 |1 |2 |What is the step by step guide to invest in share market in india? |What is the step by step guide to invest in share market? |0|
|1 |3 |4 |What is the story of Kohinoor (Koh-i-Noor) Diamond? | What would happen if the Indian government stole the Kohinoor (Koh-i-Noor) diamond back? |0|
A [kaggle competition](https://www.kaggle.com/c/quora-question-pairs#description) was held based on this dataset in 2017. The kagglers were given a training dataset (with labels), and requested to make predictions on a test dataset (without labels). The predictions were evaluated by the log-likelihood loss on the test data.
The kaggle competition has inspired much effective work. However, most of these models are rule-based and difficult to be transferred to new tasks. Researchers are seeking for more general models that work well on this task and other natual language processing (NLP) tasks.
[Wang _et al._](https://arxiv.org/abs/1702.03814) proposed a bilateral multi-perspective matching (BIMPM) model based on the Quora Question Pair dataset. They splitted the original dataset to [3 parts](https://drive.google.com/file/d/0B0PlTAo--BnaQWlsZl9FZ3l1c28/view?usp=sharing): _train.tsv_ (384,348 samples), _dev.tsv_ (10,000 samples) and _test.tsv_ (10,000 samples). The class distribution of _train.tsv_ is unbalanced (37% positive and 63% negative), while those of _dev.tsv_ and _test.tsv_ are balanced(50% positive and 50% negetive). We used the same splitting method in our experiments.
### Our Work
Based on the Quora Question Pair Dataset, we implemented some classic models in the area of neural language understanding (NLU). The accuracy of prediction results are evaluated on the _test.tsv_ from [Wang _et al._](https://arxiv.org/abs/1702.03814).
## Environment Preparation
### Install Fluid release 1.0
Please follow the [official document in English](http://www.paddlepaddle.org/documentation/docs/en/1.0/build_and_install/pip_install_en.html) or [official document in Chinese](http://www.paddlepaddle.org/documentation/docs/zh/1.0/beginners_guide/install/Start.html) to install the Fluid deep learning framework.
#### Have I installed Fluid successfully?
Run the following script from your command line:
```shell
python -c "import paddle"
```
If Fluid is installed successfully you should see no error message. Feel free to open issues under the [PaddlePaddle repository](https://github.com/PaddlePaddle/Paddle/issues) for support.
## Prepare Data
Please download the Quora dataset from [Google drive](https://drive.google.com/file/d/0B0PlTAo--BnaQWlsZl9FZ3l1c28/view?usp=sharing) and unzip to $HOME/.cache/paddle/dataset.
Then run _data/prepare_quora_data.sh_ to download the pre-trained _word2vec_ embedding file -- _glove.840B.300d.zip_:
```shell
sh data/prepare_quora_data.sh
```
At this point the dataset directory ($HOME/.cache/paddle/dataset) structure should be:
```shell
$HOME/.cache/paddle/dataset
|- Quora_question_pair_partition
|- train.tsv
|- test.tsv
|- dev.tsv
|- readme.txt
|- wordvec.txt
|- glove.840B.300d.txt
```
## Train and evaluate
We provide multiple models and configurations. Details are shown in `models` and `configs` directories. For a quick start, please run the _cdssmNet_ model with the corresponding configuration:
```shell
python train_and_evaluate.py \
--model_name=cdssmNet \
--config=cdssm_base
```
Logs will be output to the console. If everything works well, the logging information will have the same formats as the content in _cdssm_base.log_.
All configurations used in our experiments are as follows:
|Model|Config|command
|:----:|:----:|:----:|
|cdssmNet|cdssm_base|python train_and_evaluate.py --model_name=cdssmNet --config=cdssm_base
|DecAttNet|decatt_glove|python train_and_evaluate.py --model_name=DecAttNet --config=decatt_glove
|InferSentNet|infer_sent_v1|python train_and_evaluate.py --model_name=InferSentNet --config=infer_sent_v1
|InferSentNet|infer_sent_v2|python train_and_evaluate.py --model_name=InferSentNet --config=infer_sent_v2
|SSENet|sse_base|python train_and_evaluate.py --model_name=SSENet --config=sse_base
## Models
We implemeted 4 models for now: the convolutional deep-structured semantic model (CDSSM, CNN-based), the InferSent model (RNN-based), the shortcut-stacked encoder (SSE, RNN-based), and the decomposed attention model (DecAtt, attention-based).
|Model|features|Context Encoder|Match Layer|Classification Layer
|:----:|:----:|:----:|:----:|:----:|
|CDSSM|word|1 layer conv1d|concatenation|MLP
|DecAtt|word|Attention|concatenation|MLP
|InferSent|word|1 layer Bi-LSTM|concatenation/element-wise product/<br>absolute element-wise difference|MLP
|SSE|word|3 layer Bi-LSTM|concatenation/element-wise product/<br>absolute element-wise difference|MLP
### CDSSM
```
@inproceedings{shen2014learning,
title={Learning semantic representations using convolutional neural networks for web search},
author={Shen, Yelong and He, Xiaodong and Gao, Jianfeng and Deng, Li and Mesnil, Gr{\'e}goire},
booktitle={Proceedings of the 23rd International Conference on World Wide Web},
pages={373--374},
year={2014},
organization={ACM}
}
```
### InferSent
```
@article{conneau2017supervised,
title={Supervised learning of universal sentence representations from natural language inference data},
author={Conneau, Alexis and Kiela, Douwe and Schwenk, Holger and Barrault, Loic and Bordes, Antoine},
journal={arXiv preprint arXiv:1705.02364},
year={2017}
}
```
### SSE
```
@article{nie2017shortcut,
title={Shortcut-stacked sentence encoders for multi-domain inference},
author={Nie, Yixin and Bansal, Mohit},
journal={arXiv preprint arXiv:1708.02312},
year={2017}
}
```
### DecAtt
```
@article{tomar2017neural,
title={Neural paraphrase identification of questions with noisy pretraining},
author={Tomar, Gaurav Singh and Duque, Thyago and T{\"a}ckstr{\"o}m, Oscar and Uszkoreit, Jakob and Das, Dipanjan},
journal={arXiv preprint arXiv:1704.04565},
year={2017}
}
```
## Results
|Model|Config|dev accuracy| test accuracy
|:----:|:----:|:----:|:----:|
|cdssmNet|cdssm_base|83.56%|82.83%|
|DecAttNet|decatt_glove|86.31%|86.22%|
|InferSentNet|infer_sent_v1|87.15%|86.62%|
|InferSentNet|infer_sent_v2|88.55%|88.43%|
|SSENet|sse_base|88.35%|88.25%|
In our experiment, we found that LSTM-based models outperformed convolution-based models. The DecAtt model has fewer parameters than LSTM-based models, but is sensitive to hyper-parameters.
<p align="center">
<img src="imgs/models_test_acc.png" width = "500" alt="test_acc"/>
</p>
此差异已折叠。
# Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from .cdssm import cdssm_base
from .dec_att import decatt_glove
from .sse import sse_base
from .infer_sent import infer_sent_v1
from .infer_sent import infer_sent_v2
from __future__ import print_function
class config(object):
def __init__(self):
self.batch_size = 128
self.epoch_num = 50
self.optimizer_type = 'adam' # sgd, adagrad
# pretrained word embedding
self.use_pretrained_word_embedding = True
# when employing pretrained word embedding,
# out of vocabulary words' embedding is initialized with uniform or normal numbers
self.OOV_fill = 'uniform'
self.embedding_norm = False
# or else, use padding and masks for sequence data
self.use_lod_tensor = True
# lr = lr * lr_decay after each epoch
self.lr_decay = 1
self.learning_rate = 0.001
self.save_dirname = 'model_dir'
self.train_samples_num = 384348
self.duplicate_data = False
self.metric_type = ['accuracy']
def list_config(self):
print("config", self.__dict__)
def has_member(self, var_name):
return var_name in self.__dict__
if __name__ == "__main__":
basic = config()
basic.list_config()
basic.ahh = 2
basic.list_config()
# Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from . import basic_config
def cdssm_base():
"""
set configs
"""
config = basic_config.config()
config.learning_rate = 0.001
config.save_dirname = "model_dir"
config.use_pretrained_word_embedding = True
config.dict_dim = 40000 # approx_vocab_size
# net config
config.emb_dim = 300
config.kernel_size = 5
config.kernel_count = 300
config.fc_dim = 128
config.mlp_hid_dim = [128, 128]
config.droprate_conv = 0.1
config.droprate_fc = 0.1
config.class_dim = 2
return config
# Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from . import basic_config
def decatt_glove():
"""
use config 'decAtt_glove' in the paper 'Neural Paraphrase Identification of Questions with Noisy Pretraining'
"""
config = basic_config.config()
config.learning_rate = 0.05
config.save_dirname = "model_dir"
config.use_pretrained_word_embedding = True
config.dict_dim = 40000 # approx_vocab_size
config.metric_type = ['accuracy', 'accuracy_with_threshold']
config.optimizer_type = 'sgd'
config.lr_decay = 1
config.use_lod_tensor = False
config.embedding_norm = False
config.OOV_fill = 'uniform'
config.duplicate_data = False
# net config
config.emb_dim = 300
config.proj_emb_dim = 200 #TODO: has project?
config.num_units = [400, 200]
config.word_embedding_trainable = True
config.droprate = 0.1
config.share_wight_btw_seq = True
config.class_dim = 2
return config
def decatt_word():
"""
use config 'decAtt_glove' in the paper 'Neural Paraphrase Identification of Questions with Noisy Pretraining'
"""
config = basic_config.config()
config.learning_rate = 0.05
config.save_dirname = "model_dir"
config.use_pretrained_word_embedding = False
config.dict_dim = 40000 # approx_vocab_size
config.metric_type = ['accuracy', 'accuracy_with_threshold']
config.optimizer_type = 'sgd'
config.lr_decay = 1
config.use_lod_tensor = False
config.embedding_norm = False
config.OOV_fill = 'uniform'
config.duplicate_data = False
# net config
config.emb_dim = 300
config.proj_emb_dim = 200 #TODO: has project?
config.num_units = [400, 200]
config.word_embedding_trainable = True
config.droprate = 0.1
config.share_wight_btw_seq = True
config.class_dim = 2
return config
# Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from . import basic_config
def infer_sent_v1():
"""
set configs
"""
config = basic_config.config()
config.learning_rate = 0.1
config.lr_decay = 0.99
config.optimizer_type = 'sgd'
config.save_dirname = "model_dir"
config.use_pretrained_word_embedding = True
config.dict_dim = 40000 # approx_vocab_size
config.class_dim = 2
# net config
config.emb_dim = 300
config.droprate_lstm = 0.0
config.droprate_fc = 0.0
config.word_embedding_trainable = False
config.rnn_hid_dim = 2048
config.mlp_non_linear = False
return config
def infer_sent_v2():
"""
use our own config
"""
config = basic_config.config()
config.learning_rate = 0.0002
config.lr_decay = 0.99
config.optimizer_type = 'adam'
config.save_dirname = "model_dir"
config.use_pretrained_word_embedding = True
config.dict_dim = 40000 # approx_vocab_size
config.class_dim = 2
# net config
config.emb_dim = 300
config.droprate_lstm = 0.0
config.droprate_fc = 0.2
config.word_embedding_trainable = False
config.rnn_hid_dim = 2048
config.mlp_non_linear = True
return config
# Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from . import basic_config
def sse_base():
"""
use config in the paper 'Shortcut-Stacked Sentence Encoders for Multi-Domain Inference'
"""
config = basic_config.config()
config.learning_rate = 0.0002
config.lr_decay = 0.7
config.save_dirname = "model_dir"
config.use_pretrained_word_embedding = True
config.dict_dim = 40000 # approx_vocab_size
config.metric_type = ['accuracy']
config.optimizer_type = 'adam'
config.use_lod_tensor = True
config.embedding_norm = False
config.OOV_fill = 'uniform'
config.duplicate_data = False
# net config
config.emb_dim = 300
config.rnn_hid_dim = [512, 1024, 2048]
config.fc_dim = [1600, 1600]
config.droprate_lstm = 0.0
config.droprate_fc = 0.1
config.class_dim = 2
return config
# Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# Please download the Quora dataset firstly from https://drive.google.com/file/d/0B0PlTAo--BnaQWlsZl9FZ3l1c28/view?usp=sharing
# to the ROOT_DIR: $HOME/.cache/paddle/dataset
DATA_DIR=$HOME/.cache/paddle/dataset
wget --directory-prefix=$DATA_DIR http://nlp.stanford.edu/data/glove.840B.300d.zip
unzip $DATA_DIR/glove.840B.300d.zip
# The finally dataset dir should be like
# $HOME/.cache/paddle/dataset
# |- Quora_question_pair_partition
# |- train.tsv
# |- test.tsv
# |- dev.tsv
# |- readme.txt
# |- wordvec.txt
# |- glove.840B.300d.txt
Image files for this model: text_matching_on_quora
# Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import numpy as np
"""
This Module defines evaluate metrics for classification tasks
"""
def accuracy(y_pred, label):
"""
define correct: the top 1 class in y_pred is the same as y_true
"""
y_pred = np.squeeze(y_pred)
y_pred_idx = np.argmax(y_pred, axis=1)
return 1.0 * np.sum(y_pred_idx == label) / label.shape[0]
def accuracy_with_threshold(y_pred, label, threshold=0.5):
"""
define correct: the y_true class's prob in y_pred is bigger than threshold
when threshold is 0.5, This fuction is equal to accuracy
"""
y_pred = np.squeeze(y_pred)
y_pred_idx = (y_pred[:, 1] > threshold).astype(int)
return 1.0 * np.sum(y_pred_idx == label) / label.shape[0]
# Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from .cdssm import cdssmNet
from .dec_att import DecAttNet
from .sse import SSENet
from .infer_sent import InferSentNet
# Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import paddle.fluid as fluid
class cdssmNet():
"""cdssm net"""
def __init__(self, config):
self._config = config
def __call__(self, seq1, seq2, label):
return self.body(seq1, seq2, label, self._config)
def body(self, seq1, seq2, label, config):
"""Body function"""
def conv_model(seq):
embed = fluid.layers.embedding(input=seq, size=[config.dict_dim, config.emb_dim], param_attr='emb.w')
conv = fluid.layers.sequence_conv(embed,
num_filters=config.kernel_count,
filter_size=config.kernel_size,
filter_stride=1,
padding=True, # TODO: what is padding
bias_attr=False,
param_attr='conv1d.w',
act='relu')
#print paddle.parameters.get('conv1d.w').shape
conv = fluid.layers.dropout(conv, dropout_prob = config.droprate_conv)
pool = fluid.layers.sequence_pool(conv, pool_type="max")
fc = fluid.layers.fc(pool,
size=config.fc_dim,
param_attr='fc1.w',
bias_attr='fc1.b',
act='relu')
return fc
def MLP(vec):
for dim in config.mlp_hid_dim:
vec = fluid.layers.fc(vec, size=dim, act='relu')
vec = fluid.layers.dropout(vec, dropout_prob=config.droprate_fc)
return vec
seq1_fc = conv_model(seq1)
seq2_fc = conv_model(seq2)
concated_seq = fluid.layers.concat(input=[seq1_fc, seq2_fc], axis=1)
mlp_res = MLP(concated_seq)
prediction = fluid.layers.fc(mlp_res, size=config.class_dim, act='softmax')
loss = fluid.layers.cross_entropy(input=prediction, label=label)
avg_cost = fluid.layers.mean(x=loss)
acc = fluid.layers.accuracy(input=prediction, label=label)
return avg_cost, acc, prediction
# Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import paddle.fluid as fluid
class DecAttNet():
"""decompose attention net"""
def __init__(self, config):
self._config = config
self.initializer = fluid.initializer.Xavier(uniform=False)
def __call__(self, seq1, seq2, mask1, mask2, label):
return self.body(seq1, seq2, mask1, mask2, label)
def body(self, seq1, seq2, mask1, mask2, label):
"""Body function"""
transformed_q1 = self.transformation(seq1)
transformed_q2 = self.transformation(seq2)
masked_q1 = self.apply_mask(transformed_q1, mask1)
masked_q2 = self.apply_mask(transformed_q2, mask2)
alpha, beta = self.attend(masked_q1, masked_q2)
if self._config.share_wight_btw_seq:
seq1_compare = self.compare(masked_q1, beta, param_prefix='compare')
seq2_compare = self.compare(masked_q2, alpha, param_prefix='compare')
else:
seq1_compare = self.compare(masked_q1, beta, param_prefix='compare_1')
seq2_compare = self.compare(masked_q2, alpha, param_prefix='compare_2')
aggregate_res = self.aggregate(seq1_compare, seq2_compare)
prediction = fluid.layers.fc(aggregate_res, size=self._config.class_dim, act='softmax')
loss = fluid.layers.cross_entropy(input=prediction, label=label)
avg_cost = fluid.layers.mean(x=loss)
acc = fluid.layers.accuracy(input=prediction, label=label)
return avg_cost, acc, prediction
def apply_mask(self, seq, mask):
"""
apply mask on seq
Input: seq in shape [batch_size, seq_len, embedding_size]
Input: mask in shape [batch_size, seq_len]
Output: masked seq in shape [batch_size, seq_len, embedding_size]
"""
return fluid.layers.elementwise_mul(x=seq, y=mask, axis=0)
def feed_forward_2d(self, vec, param_prefix):
"""
Input: vec in shape [batch_size, seq_len, vec_dim]
Output: fc2 in shape [batch_size, seq_len, num_units[1]]
"""
fc1 = fluid.layers.fc(vec, size=self._config.num_units[0], num_flatten_dims=2,
param_attr=fluid.ParamAttr(name=param_prefix+'_fc1.w',
initializer=self.initializer),
bias_attr=param_prefix + '_fc1.b', act='relu')
fc1 = fluid.layers.dropout(fc1, dropout_prob = self._config.droprate)
fc2 = fluid.layers.fc(fc1, size=self._config.num_units[1], num_flatten_dims=2,
param_attr=fluid.ParamAttr(name=param_prefix+'_fc2.w',
initializer=self.initializer),
bias_attr=param_prefix + '_fc2.b', act='relu')
fc2 = fluid.layers.dropout(fc2, dropout_prob = self._config.droprate)
return fc2
def feed_forward(self, vec, param_prefix):
"""
Input: vec in shape [batch_size, vec_dim]
Output: fc2 in shape [batch_size, num_units[1]]
"""
fc1 = fluid.layers.fc(vec, size=self._config.num_units[0], num_flatten_dims=1,
param_attr=fluid.ParamAttr(name=param_prefix+'_fc1.w',
initializer=self.initializer),
bias_attr=param_prefix + '_fc1.b', act='relu')
fc1 = fluid.layers.dropout(fc1, dropout_prob = self._config.droprate)
fc2 = fluid.layers.fc(fc1, size=self._config.num_units[1], num_flatten_dims=1,
param_attr=fluid.ParamAttr(name=param_prefix+'_fc2.w',
initializer=self.initializer),
bias_attr=param_prefix + '_fc2.b', act='relu')
fc2 = fluid.layers.dropout(fc2, dropout_prob = self._config.droprate)
return fc2
def transformation(self, seq):
embed = fluid.layers.embedding(input=seq, size=[self._config.dict_dim, self._config.emb_dim],
param_attr=fluid.ParamAttr(name='emb.w', trainable=self._config.word_embedding_trainable))
if self._config.proj_emb_dim is not None:
return fluid.layers.fc(embed, size=self._config.proj_emb_dim, num_flatten_dims=2,
param_attr=fluid.ParamAttr(name='project' + '_fc1.w',
initializer=self.initializer),
bias_attr=False,
act=None)
return embed
def attend(self, seq1, seq2):
"""
Input: seq1, shape [batch_size, seq_len1, embed_size]
Input: seq2, shape [batch_size, seq_len2, embed_size]
Output: alpha, shape [batch_size, seq_len1, embed_size]
Output: beta, shape [batch_size, seq_len2, embed_size]
"""
if self._config.share_wight_btw_seq:
seq1 = self.feed_forward_2d(seq1, param_prefix="attend")
seq2 = self.feed_forward_2d(seq2, param_prefix="attend")
else:
seq1 = self.feed_forward_2d(seq1, param_prefix="attend_1")
seq2 = self.feed_forward_2d(seq2, param_prefix="attend_2")
attention_weight = fluid.layers.matmul(seq1, seq2, transpose_y=True)
normalized_attention_weight = fluid.layers.softmax(attention_weight)
beta = fluid.layers.matmul(normalized_attention_weight, seq2)
attention_weight_t = fluid.layers.transpose(attention_weight, perm=[0, 2, 1])
normalized_attention_weight_t = fluid.layers.softmax(attention_weight_t)
alpha = fluid.layers.matmul(normalized_attention_weight_t, seq1)
return alpha, beta
def compare(self, seq, soft_alignment, param_prefix):
concat_seq = fluid.layers.concat(input=[seq, soft_alignment], axis=2)
return self.feed_forward_2d(concat_seq, param_prefix="compare")
def aggregate(self, vec1, vec2):
vec1 = fluid.layers.reduce_sum(vec1, dim=1)
vec2 = fluid.layers.reduce_sum(vec2, dim=1)
concat_vec = fluid.layers.concat(input=[vec1, vec2], axis=1)
return self.feed_forward(concat_vec, param_prefix='aggregate')
# Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import paddle.fluid as fluid
from .my_layers import bi_lstm_layer
from .match_layers import ElementwiseMatching
class InferSentNet():
"""
Base on the paper: Supervised Learning of Universal Sentence Representations from Natural Language Inference Data:
https://arxiv.org/abs/1705.02364
"""
def __init__(self, config):
self._config = config
def __call__(self, seq1, seq2, label):
return self.body(seq1, seq2, label, self._config)
def body(self, seq1, seq2, label, config):
"""Body function"""
seq1_rnn = self.encoder(seq1)
seq2_rnn = self.encoder(seq2)
seq_match = ElementwiseMatching(seq1_rnn, seq2_rnn)
mlp_res = self.MLP(seq_match)
prediction = fluid.layers.fc(mlp_res, size=self._config.class_dim, act='softmax')
loss = fluid.layers.cross_entropy(input=prediction, label=label)
avg_cost = fluid.layers.mean(x=loss)
acc = fluid.layers.accuracy(input=prediction, label=label)
return avg_cost, acc, prediction
def encoder(self, seq):
"""encoder"""
embed = fluid.layers.embedding(
input=seq,
size=[self._config.dict_dim, self._config.emb_dim],
param_attr=fluid.ParamAttr(name='emb.w', trainable=self._config.word_embedding_trainable))
bi_lstm_h = bi_lstm_layer(
embed,
rnn_hid_dim = self._config.rnn_hid_dim,
name='encoder')
bi_lstm_h = fluid.layers.dropout(bi_lstm_h, dropout_prob=self._config.droprate_lstm)
pool = fluid.layers.sequence_pool(input=bi_lstm_h, pool_type='max')
return pool
def MLP(self, vec):
if self._config.mlp_non_linear:
drop1 = fluid.layers.dropout(vec, dropout_prob=self._config.droprate_fc)
fc1 = fluid.layers.fc(drop1, size=512, act='tanh')
drop2 = fluid.layers.dropout(fc1, dropout_prob=self._config.droprate_fc)
fc2 = fluid.layers.fc(drop2, size=512, act='tanh')
res = fluid.layers.dropout(fc2, dropout_prob=self._config.droprate_fc)
else:
fc1 = fluid.layers.fc(vec, size=512, act=None)
res = fluid.layers.fc(fc1, size=512, act=None)
return res
# Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
This Module provide different kinds of Match layers
"""
import paddle.fluid as fluid
def MultiPerspectiveMatching(vec1, vec2, perspective_num):
"""
MultiPerspectiveMatching
"""
sim_res = None
for i in range(perspective_num):
vec1_res = fluid.layers.elementwise_add_with_weight(
vec1,
param_attr="elementwise_add_with_weight." + str(i))
vec2_res = fluid.layers.elementwise_add_with_weight(
vec2,
param_attr="elementwise_add_with_weight." + str(i))
m = fluid.layers.cos_sim(vec1_res, vec2_res)
if sim_res is None:
sim_res = m
else:
sim_res = fluid.layers.concat(input=[sim_res, m], axis=1)
return sim_res
def ConcateMatching(vec1, vec2):
"""
ConcateMatching
"""
#TODO: assert shape
return fluid.layers.concat(input=[vec1, vec2], axis=1)
def ElementwiseMatching(vec1, vec2):
"""
reference: [Supervised Learning of Universal Sentence Representations from Natural Language Inference Data](https://arxiv.org/abs/1705.02364)
"""
elementwise_mul = fluid.layers.elementwise_mul(x=vec1, y=vec2)
elementwise_sub = fluid.layers.elementwise_sub(x=vec1, y=vec2)
elementwise_abs_sub = fluid.layers.abs(elementwise_sub)
return fluid.layers.concat(input=[vec1, vec2, elementwise_mul, elementwise_abs_sub], axis=1)
# Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
This module defines some Frequently-used DNN layers
"""
import paddle.fluid as fluid
def bi_lstm_layer(input, rnn_hid_dim, name):
"""
This is a Bi-directional LSTM(long short term memory) Module
"""
fc0 = fluid.layers.fc(input=input, # fc for lstm
size=rnn_hid_dim * 4,
param_attr=name + '.fc0.w',
bias_attr=False,
act=None)
lstm_h, c = fluid.layers.dynamic_lstm(
input=fc0,
size=rnn_hid_dim * 4,
is_reverse=False,
param_attr=name + '.lstm_w',
bias_attr=name + '.lstm_b')
reversed_lstm_h, reversed_c = fluid.layers.dynamic_lstm(
input=fc0,
size=rnn_hid_dim * 4,
is_reverse=True,
param_attr=name + '.reversed_lstm_w',
bias_attr=name + '.reversed_lstm_b')
return fluid.layers.concat(input=[lstm_h, reversed_lstm_h], axis=1)
# Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import paddle.fluid as fluid
from .my_layers import bi_lstm_layer
from .match_layers import ElementwiseMatching
class SSENet():
"""
SSE net: Shortcut-Stacked Sentence Encoders for Multi-Domain Inference
https://arxiv.org/abs/1708.02312
"""
def __init__(self, config):
self._config = config
def __call__(self, seq1, seq2, label):
return self.body(seq1, seq2, label, self._config)
def body(self, seq1, seq2, label, config):
"""Body function"""
def stacked_bi_rnn_model(seq):
embed = fluid.layers.embedding(input=seq, size=[self._config.dict_dim, self._config.emb_dim], param_attr='emb.w')
stacked_lstm_out = [embed]
for i in range(len(self._config.rnn_hid_dim)):
if i == 0:
feature = embed
else:
feature = fluid.layers.concat(input = stacked_lstm_out, axis=1)
bi_lstm_h = bi_lstm_layer(feature,
rnn_hid_dim=self._config.rnn_hid_dim[i],
name="lstm_" + str(i))
# add dropout except for the last stacked lstm layer
if i != len(self._config.rnn_hid_dim) - 1:
bi_lstm_h = fluid.layers.dropout(bi_lstm_h, dropout_prob=self._config.droprate_lstm)
stacked_lstm_out.append(bi_lstm_h)
pool = fluid.layers.sequence_pool(input=bi_lstm_h, pool_type='max')
return pool
def MLP(vec):
for i in range(len(self._config.fc_dim)):
vec = fluid.layers.fc(vec, size=self._config.fc_dim[i], act='relu')
# add dropout after every layer of MLP
vec = fluid.layers.dropout(vec, dropout_prob=self._config.droprate_fc)
return vec
seq1_rnn = stacked_bi_rnn_model(seq1)
seq2_rnn = stacked_bi_rnn_model(seq2)
seq_match = ElementwiseMatching(seq1_rnn, seq2_rnn)
mlp_res = MLP(seq_match)
prediction = fluid.layers.fc(mlp_res, size=self._config.class_dim, act='softmax')
loss = fluid.layers.cross_entropy(input=prediction, label=label)
avg_cost = fluid.layers.mean(x=loss)
acc = fluid.layers.accuracy(input=prediction, label=label)
return avg_cost, acc, prediction
# Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
This Module provide pretrained word-embeddings
"""
from __future__ import print_function, unicode_literals
import numpy as np
import time, datetime
import os, sys
def Glove840B_300D(filepath, keys=None):
"""
input: the "glove.840B.300d.txt" file path
return: a dict, key: word (unicode), value: a numpy array with shape [300]
"""
if keys is not None:
assert(isinstance(keys, set))
print("loading word2vec from ", filepath)
print("please wait for a minute.")
start = time.time()
word2vec = {}
with open(filepath, "r") as f:
for line in f:
if sys.version_info <= (3, 0): # for python2
line = line.decode('utf-8')
info = line.strip("\n").split(" ")
word = info[0]
if (keys is not None) and (word not in keys):
continue
vector = info[1:]
assert(len(vector) == 300)
word2vec[word] = np.asarray(vector, dtype='float32')
end = time.time()
print("Spent ", str(datetime.timedelta(seconds=end-start)), " on loading word2vec.")
return word2vec
if __name__ == '__main__':
from os.path import expanduser
home = expanduser("~")
embed_dict = Glove840B_300D(os.path.join(home, "./.cache/paddle/dataset/glove.840B.300d.txt"))
exit(0)
# Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
"""
import paddle.dataset.common
import collections
import tarfile
import re
import string
import random
import os, sys
import nltk
from os.path import expanduser
__all__ = ['word_dict', 'train', 'dev', 'test']
URL = "https://drive.google.com/file/d/0B0PlTAo--BnaQWlsZl9FZ3l1c28/view"
DATA_HOME = os.path.expanduser('~/.cache/paddle/dataset')
DATA_DIR = "Quora_question_pair_partition"
QUORA_TRAIN_FILE_NAME = os.path.join(DATA_HOME, DATA_DIR, 'train.tsv')
QUORA_DEV_FILE_NAME = os.path.join(DATA_HOME, DATA_DIR, 'dev.tsv')
QUORA_TEST_FILE_NAME = os.path.join(DATA_HOME, DATA_DIR, 'test.tsv')
# punctuation or nltk or space
TOKENIZE_METHOD='space'
COLUMN_COUNT = 4
def tokenize(s):
if sys.version_info <= (3, 0): # for python2
s = s.decode('utf-8')
if TOKENIZE_METHOD == "nltk":
return nltk.tokenize.word_tokenize(s)
elif TOKENIZE_METHOD == "punctuation":
return s.translate({ord(char): None for char in string.punctuation}).lower().split()
elif TOKENIZE_METHOD == "space":
return s.split()
else:
raise RuntimeError("Invalid tokenize method")
def maybe_open(file_name):
if not os.path.isfile(file_name):
msg = "file not exist: %s\nPlease download the dataset firstly from: %s\n\n" % (file_name, URL) + \
("# The finally dataset dir should be like\n\n"
"$HOME/.cache/paddle/dataset\n"
" |- Quora_question_pair_partition\n"
" |- train.tsv\n"
" |- test.tsv\n"
" |- dev.tsv\n"
" |- readme.txt\n"
" |- wordvec.txt\n")
raise RuntimeError(msg)
return open(file_name, 'r')
def tokenized_question_pairs(file_name):
"""
"""
with maybe_open(file_name) as f:
questions = {}
lines = f.readlines()
for line in lines:
info = line.strip().split('\t')
if len(info) != COLUMN_COUNT:
# formatting error
continue
(label, question1, question2, id) = info
question1 = tokenize(question1)
question2 = tokenize(question2)
yield question1, question2, int(label)
def tokenized_questions(file_name):
"""
"""
with maybe_open(file_name) as f:
lines = f.readlines()
for line in lines:
info = line.strip().split('\t')
if len(info) != COLUMN_COUNT:
# formatting error
continue
(label, question1, question2, id) = info
yield tokenize(question1)
yield tokenize(question2)
def build_dict(file_name, cutoff):
"""
Build a word dictionary from the corpus. Keys of the dictionary are words,
and values are zero-based IDs of these words.
"""
word_freq = collections.defaultdict(int)
for doc in tokenized_questions(file_name):
for word in doc:
word_freq[word] += 1
word_freq = filter(lambda x: x[1] > cutoff, word_freq.items())
dictionary = sorted(word_freq, key=lambda x: (-x[1], x[0]))
words, _ = list(zip(*dictionary))
word_idx = dict(zip(words, range(len(words))))
word_idx['<unk>'] = len(words)
word_idx['<pad>'] = len(words) + 1
return word_idx
def reader_creator(file_name, word_idx):
UNK_ID = word_idx['<unk>']
def reader():
for (q1, q2, label) in tokenized_question_pairs(file_name):
q1_ids = [word_idx.get(w, UNK_ID) for w in q1]
q2_ids = [word_idx.get(w, UNK_ID) for w in q2]
if q1_ids != [] and q2_ids != []: # [] is not allowed in fluid
assert(label in [0, 1])
yield q1_ids, q2_ids, label
return reader
def train(word_idx):
"""
Quora training set creator.
It returns a reader creator, each sample in the reader is two zero-based ID
list and label in [0, 1].
:param word_idx: word dictionary
:type word_idx: dict
:return: Training reader creator
:rtype: callable
"""
return reader_creator(QUORA_TRAIN_FILE_NAME, word_idx)
def dev(word_idx):
"""
Quora develop set creator.
It returns a reader creator, each sample in the reader is two zero-based ID
list and label in [0, 1].
:param word_idx: word dictionary
:type word_idx: dict
:return: develop reader creator
:rtype: callable
"""
return reader_creator(QUORA_DEV_FILE_NAME, word_idx)
def test(word_idx):
"""
Quora test set creator.
It returns a reader creator, each sample in the reader is two zero-based ID
list and label in [0, 1].
:param word_idx: word dictionary
:type word_idx: dict
:return: Test reader creator
:rtype: callable
"""
return reader_creator(QUORA_TEST_FILE_NAME, word_idx)
def word_dict():
"""
Build a word dictionary from the corpus.
:return: Word dictionary
:rtype: dict
"""
return build_dict(file_name=QUORA_TRAIN_FILE_NAME, cutoff=4)
#Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from __future__ import print_function
import os
import sys
import time
import argparse
import unittest
import contextlib
import numpy as np
import paddle.fluid as fluid
import utils, metric, configs
import models
from pretrained_word2vec import Glove840B_300D
parser = argparse.ArgumentParser(description=__doc__)
parser.add_argument('--model_name', type=str, default='cdssm', help="Which model to train")
parser.add_argument('--config', type=str, default='cdssm.cdssm_base', help="The global config setting")
DATA_DIR = os.path.join(os.path.expanduser('~'), '.cache/paddle/dataset')
def evaluate(epoch_id, exe, inference_program, dev_reader, test_reader, fetch_list, feeder, metric_type):
"""
evaluate on test/dev dataset
"""
def infer(test_reader):
"""
do inference function
"""
total_cost = 0.0
total_count = 0
preds, labels = [], []
for data in test_reader():
avg_cost, avg_acc, batch_prediction = exe.run(inference_program,
feed=feeder.feed(data),
fetch_list=fetch_list,
return_numpy=True)
total_cost += avg_cost * len(data)
total_count += len(data)
preds.append(batch_prediction)
labels.append(np.asarray([x[-1] for x in data], dtype=np.int64))
y_pred = np.concatenate(preds)
y_label = np.concatenate(labels)
metric_res = []
for metric_name in metric_type:
if metric_name == 'accuracy_with_threshold':
metric_res.append((metric_name, metric.accuracy_with_threshold(y_pred, y_label, threshold=0.3)))
elif metric_name == 'accuracy':
metric_res.append((metric_name, metric.accuracy(y_pred, y_label)))
else:
print("Unknown metric type: ", metric_name)
exit()
return total_cost / (total_count * 1.0), metric_res
dev_cost, dev_metric_res = infer(dev_reader)
print("[%s] epoch_id: %d, dev_cost: %f, " % (
time.asctime( time.localtime(time.time()) ),
epoch_id,
dev_cost)
+ ', '.join([str(x[0]) + ": " + str(x[1]) for x in dev_metric_res]))
test_cost, test_metric_res = infer(test_reader)
print("[%s] epoch_id: %d, test_cost: %f, " % (
time.asctime( time.localtime(time.time()) ),
epoch_id,
test_cost)
+ ', '.join([str(x[0]) + ": " + str(x[1]) for x in test_metric_res]))
print("")
def train_and_evaluate(train_reader,
test_reader,
dev_reader,
network,
optimizer,
global_config,
pretrained_word_embedding,
use_cuda,
parallel):
"""
train network
"""
# define the net
if global_config.use_lod_tensor:
# automatic add batch dim
q1 = fluid.layers.data(
name="question1", shape=[1], dtype="int64", lod_level=1)
q2 = fluid.layers.data(
name="question2", shape=[1], dtype="int64", lod_level=1)
label = fluid.layers.data(name="label", shape=[1], dtype="int64")
cost, acc, prediction = network(q1, q2, label)
else:
# shape: [batch_size, max_seq_len_in_batch, 1]
q1 = fluid.layers.data(
name="question1", shape=[-1, -1, 1], dtype="int64")
q2 = fluid.layers.data(
name="question2", shape=[-1, -1, 1], dtype="int64")
# shape: [batch_size, max_seq_len_in_batch]
mask1 = fluid.layers.data(name="mask1", shape=[-1, -1], dtype="float32")
mask2 = fluid.layers.data(name="mask2", shape=[-1, -1], dtype="float32")
label = fluid.layers.data(name="label", shape=[1], dtype="int64")
cost, acc, prediction = network(q1, q2, mask1, mask2, label)
if parallel:
# TODO: Paarallel Training
print("Parallel Training is not supported for now.")
sys.exit(1)
#optimizer.minimize(cost)
if use_cuda:
print("Using GPU")
place = fluid.CUDAPlace(0)
else:
print("Using CPU")
place = fluid.CPUPlace()
exe = fluid.Executor(place)
if global_config.use_lod_tensor:
feeder = fluid.DataFeeder(feed_list=[q1, q2, label], place=place)
else:
feeder = fluid.DataFeeder(feed_list=[q1, q2, mask1, mask2, label], place=place)
# logging param info
for param in fluid.default_main_program().global_block().all_parameters():
print("param name: %s; param shape: %s" % (param.name, param.shape))
# define inference_program
inference_program = fluid.default_main_program().clone(for_test=True)
optimizer.minimize(cost)
exe.run(fluid.default_startup_program())
# load emb from a numpy erray
if pretrained_word_embedding is not None:
print("loading pretrained word embedding to param")
embedding_name = "emb.w"
embedding_param = fluid.global_scope().find_var(embedding_name).get_tensor()
embedding_param.set(pretrained_word_embedding, place)
evaluate(-1,
exe,
inference_program,
dev_reader,
test_reader,
fetch_list=[cost, acc, prediction],
feeder=feeder,
metric_type=global_config.metric_type)
# start training
print("[%s] Start Training" % time.asctime(time.localtime(time.time())))
for epoch_id in range(global_config.epoch_num):
data_size, data_count, total_acc, total_cost = 0, 0, 0.0, 0.0
batch_id = 0
for data in train_reader():
avg_cost_np, avg_acc_np = exe.run(fluid.default_main_program(),
feed=feeder.feed(data),
fetch_list=[cost, acc])
data_size = len(data)
total_acc += data_size * avg_acc_np
total_cost += data_size * avg_cost_np
data_count += data_size
if batch_id % 100 == 0:
print("[%s] epoch_id: %d, batch_id: %d, cost: %f, acc: %f" % (
time.asctime(time.localtime(time.time())),
epoch_id,
batch_id,
avg_cost_np,
avg_acc_np))
batch_id += 1
avg_cost = total_cost / data_count
avg_acc = total_acc / data_count
print("")
print("[%s] epoch_id: %d, train_avg_cost: %f, train_avg_acc: %f" % (
time.asctime( time.localtime(time.time()) ), epoch_id, avg_cost, avg_acc))
epoch_model = global_config.save_dirname + "/" + "epoch" + str(epoch_id)
fluid.io.save_inference_model(epoch_model, ["question1", "question2", "label"], acc, exe)
evaluate(epoch_id,
exe,
inference_program,
dev_reader,
test_reader,
fetch_list=[cost, acc, prediction],
feeder=feeder,
metric_type=global_config.metric_type)
def main():
"""
This function will parse argments, prepare data and prepare pretrained embedding
"""
args = parser.parse_args()
global_config = configs.__dict__[args.config]()
print("net_name: ", args.model_name)
net = models.__dict__[args.model_name](global_config)
# get word_dict
word_dict = utils.getDict(data_type="quora_question_pairs")
# get reader
train_reader, dev_reader, test_reader = utils.prepare_data(
"quora_question_pairs",
word_dict=word_dict,
batch_size = global_config.batch_size,
buf_size=800000,
duplicate_data=global_config.duplicate_data,
use_pad=(not global_config.use_lod_tensor))
# load pretrained_word_embedding
if global_config.use_pretrained_word_embedding:
word2vec = Glove840B_300D(filepath=os.path.join(DATA_DIR, "glove.840B.300d.txt"),
keys=set(word_dict.keys()))
pretrained_word_embedding = utils.get_pretrained_word_embedding(
word2vec=word2vec,
word2id=word_dict,
config=global_config)
print("pretrained_word_embedding to be load:", pretrained_word_embedding)
else:
pretrained_word_embedding = None
# define optimizer
optimizer = utils.getOptimizer(global_config)
# use cuda or not
if not global_config.has_member('use_cuda'):
global_config.use_cuda = 'CUDA_VISIBLE_DEVICES' in os.environ
global_config.list_config()
train_and_evaluate(
train_reader,
dev_reader,
test_reader,
net,
optimizer,
global_config,
pretrained_word_embedding,
use_cuda=global_config.use_cuda,
parallel=False)
if __name__ == "__main__":
main()
source ~/mapingshuo/.bash_mapingshuo_fluid
export CUDA_VISIBLE_DEVICES=1
fluid train_and_evaluate.py \
--model_name=cdssmNet \
--config=cdssm_base
#fluid train_and_evaluate.py \
# --model_name=DecAttNet \
# --config=decatt_glove
#fluid train_and_evaluate.py \
# --model_name=DecAttNet \
# --config=decatt_word
#fluid train_and_evaluate.py \
# --model_name=ESIMNet \
# --config=esim_seq
# Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
This module provides utilities for data generator and optimizer definition
"""
import sys
import time
import numpy as np
import paddle.fluid as fluid
import paddle
import quora_question_pairs
def to_lodtensor(data, place):
"""
convert to LODtensor
"""
seq_lens = [len(seq) for seq in data]
cur_len = 0
lod = [cur_len]
for l in seq_lens:
cur_len += l
lod.append(cur_len)
flattened_data = np.concatenate(data, axis=0).astype("int64")
flattened_data = flattened_data.reshape([len(flattened_data), 1])
res = fluid.LoDTensor()
res.set(flattened_data, place)
res.set_lod([lod])
return res
def getOptimizer(global_config):
"""
get Optimizer by config
"""
if global_config.optimizer_type == "adam":
optimizer = fluid.optimizer.Adam(learning_rate=fluid.layers.exponential_decay(
learning_rate=global_config.learning_rate,
decay_steps=global_config.train_samples_num // global_config.batch_size,
decay_rate=global_config.lr_decay))
elif global_config.optimizer_type == "sgd":
optimizer = fluid.optimizer.SGD(learning_rate=fluid.layers.exponential_decay(
learning_rate=global_config.learning_rate,
decay_steps=global_config.train_samples_num // global_config.batch_size,
decay_rate=global_config.lr_decay))
elif global_config.optimizer_type == "adagrad":
optimizer = fluid.optimizer.Adagrad(learning_rate=fluid.layers.exponential_decay(
learning_rate=global_config.learning_rate,
decay_steps=global_config.train_samples_num // global_config.batch_size,
decay_rate=global_config.lr_decay))
return optimizer
def get_pretrained_word_embedding(word2vec, word2id, config):
"""get pretrained embedding in shape [config.dict_dim, config.emb_dim]"""
print("preparing pretrained word embedding ...")
assert(config.dict_dim >= len(word2id))
word2id = sorted(word2id.items(), key = lambda x : x[1])
words = [x[0] for x in word2id]
words = words + ['<not-a-real-words>'] * (config.dict_dim - len(words))
pretrained_emb = []
for _, word in enumerate(words):
if word in word2vec:
assert(len(word2vec[word] == config.emb_dim))
if config.embedding_norm:
pretrained_emb.append(word2vec[word] / np.linalg.norm(word2vec[word]))
else:
pretrained_emb.append(word2vec[word])
elif config.OOV_fill == 'uniform':
pretrained_emb.append(np.random.uniform(-0.05, 0.05, size=[config.emb_dim]).astype(np.float32))
elif config.OOV_fill == 'normal':
pretrained_emb.append(np.random.normal(loc=0.0, scale=0.1, size=[config.emb_dim]).astype(np.float32))
else:
print("Unkown OOV fill method: ", OOV_fill)
exit()
word_embedding = np.stack(pretrained_emb)
return word_embedding
def getDict(data_type="quora_question_pairs"):
"""
get word2id dict from quora dataset
"""
print("Generating word dict...")
if data_type == "quora_question_pairs":
word_dict = quora_question_pairs.word_dict()
else:
raise RuntimeError("No such dataset")
print("Vocab size: ", len(word_dict))
return word_dict
def duplicate(reader):
"""
duplicate the quora qestion pairs since there are 2 questions in a sample
Input: reader, which yield (question1, question2, label)
Output: reader, which yield (question1, question2, label) and yield (question2, question1, label)
"""
def duplicated_reader():
for data in reader():
(q1, q2, label) = data
yield (q1, q2, label)
yield (q2, q1, label)
return duplicated_reader
def pad(reader, PAD_ID):
"""
Input: reader, yield batches of [(question1, question2, label), ... ]
Output: padded_reader, yield batches of [(padded_question1, padded_question2, mask1, mask2, label), ... ]
"""
assert(isinstance(PAD_ID, int))
def padded_reader():
for batch in reader():
max_len1 = max([len(data[0]) for data in batch])
max_len2 = max([len(data[1]) for data in batch])
padded_batch = []
for data in batch:
question1, question2, label = data
seq_len1 = len(question1)
seq_len2 = len(question2)
mask1 = [1] * seq_len1 + [0] * (max_len1 - seq_len1)
mask2 = [1] * seq_len2 + [0] * (max_len2 - seq_len2)
padded_question1 = question1 + [PAD_ID] * (max_len1 - seq_len1)
padded_question2 = question2 + [PAD_ID] * (max_len2 - seq_len2)
padded_question1 = [[x] for x in padded_question1] # last dim of questions must be 1, according to fluid's request
padded_question2 = [[x] for x in padded_question2]
assert(len(mask1) == max_len1)
assert(len(mask2) == max_len2)
assert(len(padded_question1) == max_len1)
assert(len(padded_question2) == max_len2)
padded_batch.append((padded_question1, padded_question2, mask1, mask2, label))
yield padded_batch
return padded_reader
def prepare_data(data_type,
word_dict,
batch_size,
buf_size=50000,
duplicate_data=False,
use_pad=False):
"""
prepare data
"""
PAD_ID=word_dict['<pad>']
if data_type == "quora_question_pairs":
# train/dev/test reader are batched iters which yield a batch of (question1, question2, label) each time
# qestion1 and question2 are lists of word ID
# label is 0 or 1
# for example: ([1, 3, 2], [7, 5, 4, 99], 1)
def prepare_reader(reader):
if duplicate_data:
reader = duplicate(reader)
reader = paddle.batch(
paddle.reader.shuffle(reader, buf_size=buf_size),
batch_size=batch_size,
drop_last=False)
if use_pad:
reader = pad(reader, PAD_ID=PAD_ID)
return reader
train_reader = prepare_reader(quora_question_pairs.train(word_dict))
dev_reader = prepare_reader(quora_question_pairs.dev(word_dict))
test_reader = prepare_reader(quora_question_pairs.test(word_dict))
else:
raise RuntimeError("no such dataset")
return train_reader, dev_reader, test_reader
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册