未验证 提交 bed0a70a 编写于 作者: W wuzhihua 提交者: GitHub

Merge branch 'master' into multiview-simnet

...@@ -22,12 +22,12 @@ ...@@ -22,12 +22,12 @@
| 模型 | 简介 | 论文 | | 模型 | 简介 | 论文 |
| :------------------: | :--------------------: | :---------: | | :------------------: | :--------------------: | :---------: |
| TagSpace | 标签推荐 | [EMNLP 2014][TagSpace: Semantic Embeddings from Hashtags](https://research.fb.com/publications/tagspace-semantic-embeddings-from-hashtags/) | | TagSpace | 标签推荐 | [EMNLP 2014][TagSpace: Semantic Embeddings from Hashtags](https://www.aclweb.org/anthology/D14-1194.pdf) |
| Classification | 文本分类 | [EMNLP 2014][Convolutional neural networks for sentence classication](https://www.aclweb.org/anthology/D14-1181.pdf) | | Classification | 文本分类 | [EMNLP 2014][Convolutional neural networks for sentence classication](https://www.aclweb.org/anthology/D14-1181.pdf) |
下面是每个模型的简介(注:图片引用自链接中的论文) 下面是每个模型的简介(注:图片引用自链接中的论文)
[TagSpace模型](https://research.fb.com/publications/tagspace-semantic-embeddings-from-hashtags) [TagSpace模型](https://www.aclweb.org/anthology/D14-1194.pdf)
<p align="center"> <p align="center">
<img align="center" src="../../doc/imgs/tagspace.png"> <img align="center" src="../../doc/imgs/tagspace.png">
<p> <p>
...@@ -37,25 +37,24 @@ ...@@ -37,25 +37,24 @@
<img align="center" src="../../doc/imgs/cnn-ckim2014.png"> <img align="center" src="../../doc/imgs/cnn-ckim2014.png">
<p> <p>
##使用教程(快速开始) ## 使用教程(快速开始)
``` ```
git clone https://github.com/PaddlePaddle/PaddleRec.git paddle-rec git clone https://github.com/PaddlePaddle/PaddleRec.git paddle-rec
cd paddle-rec cd PaddleRec
python -m paddlerec.run -m models/contentunderstanding/tagspace/config.yaml python -m paddlerec.run -m models/contentunderstanding/tagspace/config.yaml
python -m paddlerec.run -m models/contentunderstanding/classification/config.yaml python -m paddlerec.run -m models/contentunderstanding/classification/config.yaml
``` ```
## 使用教程(复现论文) ## 使用教程(复现论文)
###注意 ### 注意
为了方便使用者能够快速的跑通每一个模型,我们在每个模型下都提供了样例数据。如果需要复现readme中的效果请使用以下提供的脚本下载对应数据集以及数据预处理。 为了方便使用者能够快速的跑通每一个模型,我们在每个模型下都提供了样例数据。如果需要复现readme中的效果请使用以下提供的脚本下载对应数据集以及数据预处理。
### 数据处理
**(1)TagSpace** **(1)TagSpace**
### 数据处理
[数据地址](https://github.com/mhjabreel/CharCNN/tree/master/data/) , [备份数据地址](https://paddle-tagspace.bj.bcebos.com/data.tar) [数据地址](https://github.com/mhjabreel/CharCNN/tree/master/data/) , [备份数据地址](https://paddle-tagspace.bj.bcebos.com/data.tar)
数据格式如下 数据格式如下
...@@ -63,63 +62,148 @@ python -m paddlerec.run -m models/contentunderstanding/classification/config.yam ...@@ -63,63 +62,148 @@ python -m paddlerec.run -m models/contentunderstanding/classification/config.yam
"3","Wall St. Bears Claw Back Into the Black (Reuters)","Reuters - Short-sellers, Wall Street's dwindling\band of ultra-cynics, are seeing green again." "3","Wall St. Bears Claw Back Into the Black (Reuters)","Reuters - Short-sellers, Wall Street's dwindling\band of ultra-cynics, are seeing green again."
``` ```
数据解压后,将文本数据转为paddle数据,先将数据放到训练数据目录和测试数据目录 本文提供了快速将数据集中的汉字数据处理为可训练格式数据的脚本,您在解压数据集后,将原始数据存放在raw_big_train_data和raw_big_test_data两个目录下,并在python3环境下运行我们提供的text2paddle.py文件。即可生成可以直接用于训练的数据目录test_big_data和train_big_data。命令如下:
``` ```
mkdir raw_big_train_data mkdir raw_big_train_data
mkdir raw_big_test_data mkdir raw_big_test_data
mv train.csv raw_big_train_data mv train.csv raw_big_train_data
mv test.csv raw_big_test_data mv test.csv raw_big_test_data
python3 text2paddle.py raw_big_train_data/ raw_big_test_data/ train_big_data test_big_data big_vocab_text.txt big_vocab_tag.txt
``` ```
运行脚本text2paddle.py 生成paddle输入格式 运行后的data目录:
``` ```
python text2paddle.py raw_big_train_data/ raw_big_test_data/ train_big_data test_big_data big_vocab_text.txt big_vocab_tag.txt big_vocab_tag.txt #标签词汇数
big_vocab_text.txt #文本词汇数
data.tar #数据集
raw_big_train_data #数据集中原始的训练集
raw_big_test_data #数据集中原始的测试集
train_data #样例训练集
test_data #样例测试集
train_big_data #数据集经处理后的训练集
test_big_data #数据集经处理后的测试集
text2paddle.py #预处理文件
``` ```
处理完成的数据格式如下:
```
2,27 7062 8390 456 407 8 11589 3166 4 7278 31046 33 3898 2897 426 1
2,27 9493 836 355 20871 300 81 19 3 4125 9 449 462 13832 6 16570 1380 2874 5 0 797 236 19 3688 2106 14 8615 7 209 304 4 0 123 1
2,27 12754 637 106 3839 1532 66 0 379 6 0 1246 9 307 33 161 2 8100 36 0 350 123 101 74 181 0 6657 4 0 1222 17195 1
```
### 训练 ### 训练
退回tagspace目录中,打开文件config.yaml,更改其中的参数
将workspace改为您当前的绝对路径。(可用pwd命令获取绝对路径)
将dataset下sample_1的batch_size值从10改为128
将dataset下sample_1的data_path改为:{workspace}/data/train_big_data
将dataset下inferdata的batch_size值从10改为500
将dataset下inferdata的data_path改为:{workspace}/data/test_big_data
执行命令,开始训练:
``` ```
cd modles/contentunderstanding/tagspace python -m paddlerec.run -m ./config.yaml
python -m paddlerec.run -m ./config.yaml # 自定义修改超参后,指定配置文件,使用自定义配置
``` ```
### 预测 ### 预测
在跑完训练后,模型会开始在验证集上预测。
运行结果:
```
PaddleRec: Runner infer_runner Begin
Executor Mode: infer
processor_register begin
Running SingleInstance.
Running SingleNetwork.
Running SingleInferStartup.
Running SingleInferRunner.
load persistables from increment/9
batch: 1, acc: [0.91], loss: [0.02495437]
batch: 2, acc: [0.936], loss: [0.01941476]
batch: 3, acc: [0.918], loss: [0.02116447]
batch: 4, acc: [0.916], loss: [0.0219945]
batch: 5, acc: [0.902], loss: [0.02242816]
batch: 6, acc: [0.9], loss: [0.02421589]
batch: 7, acc: [0.9], loss: [0.026441]
batch: 8, acc: [0.934], loss: [0.01797657]
batch: 9, acc: [0.932], loss: [0.01687362]
batch: 10, acc: [0.926], loss: [0.02047823]
batch: 11, acc: [0.918], loss: [0.01998716]
batch: 12, acc: [0.898], loss: [0.0229556]
batch: 13, acc: [0.928], loss: [0.01736144]
batch: 14, acc: [0.93], loss: [0.01911209]
``` ```
# 修改对应模型的config.yaml, workspace配置为当前目录的绝对路径
# 修改对应模型的config.yaml,mode配置infer_runner
# 示例: mode: train_runner -> mode: infer_runner
# infer_runner中 class配置为 class: infer
# 修改phase阶段为infer的配置,参照config注释
# 修改完config.yaml后 执行: **(2)Classification**
python -m paddlerec.run -m ./config.yaml
### 数据处理
情感倾向分析(Sentiment Classification,简称Senta)针对带有主观描述的中文文本,可自动判断该文本的情感极性类别并给出相应的置信度。情感类型分为积极、消极。情感倾向分析能够帮助企业理解用户消费习惯、分析热点话题和危机舆情监控,为企业提供有利的决策支持。
情感是人类的一种高级智能行为,为了识别文本的情感倾向,需要深入的语义建模。另外,不同领域(如餐饮、体育)在情感的表达各不相同,因而需要有大规模覆盖各个领域的数据进行模型训练。为此,我们通过基于深度学习的语义模型和大规模数据挖掘解决上述两个问题。效果上,我们基于开源情感倾向分类数据集ChnSentiCorp进行评测。
您可以直接执行以下命令下载我们分词完毕后的数据集,文件解压之后,senta_data目录下会存在训练数据(train.tsv)、开发集数据(dev.tsv)、测试集数据(test.tsv)以及对应的词典(word_dict.txt):
```
wget https://baidu-nlp.bj.bcebos.com/sentiment_classification-dataset-1.0.0.tar.gz
tar -zxvf sentiment_classification-dataset-1.0.0.tar.gz
``` ```
**(2)Classification** 数据格式为一句中文的评价语句,和一个代表情感信息的标签。两者之间用/t分隔,中文的评价语句已经分词,词之间用空格分隔。
### 训练
``` ```
cd modles/contentunderstanding/classification 15.4寸 笔记本 的 键盘 确实 爽 , 基本 跟 台式机 差不多 了 , 蛮 喜欢 数字 小 键盘 , 输 数字 特 方便 , 样子 也 很 美观 , 做工 也 相当 不错 1
python -m paddlerec.run -m ./config.yaml # 自定义修改超参后,指定配置文件,使用自定义配置 跟 心灵 鸡汤 没 什么 本质 区别 嘛 , 至少 我 不 喜欢 这样 读 经典 , 把 经典 都 解读 成 这样 有点 去 中国 化 的 味道 了 0
```
本文提供了快速将数据集中的汉字数据处理为可训练格式数据的脚本,您在解压数据集后,将preprocess.py复制到senta_data文件中并执行,即可将数据集中提供的dev.tsv,test.tsv,train.tsv转化为可直接训练的dev.txt,test.txt,train.txt.
```
cp ./data/preprocess.py ./senta_data/
cd senta_data/
python preprocess.py
``` ```
### 预测 ### 训练
创建存放训练集和测试集的目录,将数据放入目录中。
``` ```
# 修改对应模型的config.yaml, workspace配置为当前目录的绝对路径 mkdir train
# 修改对应模型的config.yaml,mode配置infer_runner mv train.txt train
# 示例: mode: train_runner -> mode: infer_runner mkdir test
# infer_runner中 class配置为 class: infer mv dev.txt test
# 修改phase阶段为infer的配置,参照config注释 cd ..
```
打开文件config.yaml,更改其中的参数
将workspace改为您当前的绝对路径。(可用pwd命令获取绝对路径)
将data1下的batch_size值从10改为128
将data1下的data_path改为:{workspace}/senta_data/train
将dataset_infer下的batch_size值从2改为256
将dataset_infer下的data_path改为:{workspace}/senta_data/test
# 修改完config.yaml后 执行: 执行命令,开始训练:
```
python -m paddlerec.run -m ./config.yaml python -m paddlerec.run -m ./config.yaml
``` ```
### 预测
在跑完训练后,模型会开始在验证集上预测。
运行结果:
```
PaddleRec: Runner infer_runner Begin
Executor Mode: infer
processor_register begin
Running SingleInstance.
Running SingleNetwork.
Running SingleInferStartup.
Running SingleInferRunner.
load persistables from increment/14
batch: 1, acc: [0.91796875], loss: [0.2287855]
batch: 2, acc: [0.91796875], loss: [0.22827303]
batch: 3, acc: [0.90234375], loss: [0.27907994]
```
## 效果对比 ## 效果对比
### 模型效果 (测试) ### 模型效果 (测试)
| 数据集 | 模型 | loss | auc | acc | mae | | 数据集 | 模型 | loss | acc |
| :------------------: | :--------------------: | :---------: |:---------: | :---------: |:---------: | | :------------------: | :--------------------: | :---------: |:---------: |
| ag news dataset | TagSpace | -- | -- | -- | -- | | ag news dataset | TagSpace | 0.2282 | 0.9179 |
| -- | Classification | -- | -- | -- | -- | | ChnSentiCorp | Classification | 0.9177 | 0.0199 |
...@@ -16,16 +16,21 @@ workspace: "models/contentunderstanding/tagspace" ...@@ -16,16 +16,21 @@ workspace: "models/contentunderstanding/tagspace"
dataset: dataset:
- name: sample_1 - name: sample_1
type: QueueDataset type: DataLoader
batch_size: 5 batch_size: 10
data_path: "{workspace}/data/train_data" data_path: "{workspace}/data/train_data"
data_converter: "{workspace}/reader.py" data_converter: "{workspace}/reader.py"
- name: inferdata
type: DataLoader
batch_size: 10
data_path: "{workspace}/data/test_data"
data_converter: "{workspace}/reader.py"
hyper_parameters: hyper_parameters:
optimizer: optimizer:
class: Adagrad class: Adagrad
learning_rate: 0.001 learning_rate: 0.001
vocab_text_size: 11447 vocab_text_size: 75378
vocab_tag_size: 4 vocab_tag_size: 4
emb_dim: 10 emb_dim: 10
hid_dim: 1000 hid_dim: 1000
...@@ -34,22 +39,34 @@ hyper_parameters: ...@@ -34,22 +39,34 @@ hyper_parameters:
neg_size: 3 neg_size: 3
num_devices: 1 num_devices: 1
mode: runner1 mode: [runner1,infer_runner]
runner: runner:
- name: runner1 - name: runner1
class: train class: train
epochs: 10 epochs: 10
device: cpu device: cpu
save_checkpoint_interval: 2 save_checkpoint_interval: 1
save_inference_interval: 4 save_inference_interval: 1
save_checkpoint_path: "increment" save_checkpoint_path: "increment"
save_inference_path: "inference" save_inference_path: "inference"
save_inference_feed_varnames: [] save_inference_feed_varnames: []
save_inference_fetch_varnames: [] save_inference_fetch_varnames: []
phases: phase1
- name: infer_runner
class: infer
# device to run training or infer
device: cpu
print_interval: 1
init_model_path: "increment/9" # load model path
phases: phase_infer
phase: phase:
- name: phase1 - name: phase1
model: "{workspace}/model.py" model: "{workspace}/model.py"
dataset_name: sample_1 dataset_name: sample_1
thread_num: 1 thread_num: 1
- name: phase_infer
model: "{workspace}/model.py"
dataset_name: inferdata
thread_num: 1
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import sys
import six
import collections
import os
import csv
import re
import sys
if six.PY2:
reload(sys)
sys.setdefaultencoding('utf-8')
def word_count(column_num, input_file, word_freq=None):
"""
compute word count from corpus
"""
if word_freq is None:
word_freq = collections.defaultdict(int)
data_file = csv.reader(input_file)
for row in data_file:
for w in re.split(r'\W+', row[column_num].strip()):
word_freq[w] += 1
return word_freq
def build_dict(column_num=2, min_word_freq=0, train_dir="", test_dir=""):
"""
Build a word dictionary from the corpus, Keys of the dictionary are words,
and values are zero-based IDs of these words.
"""
word_freq = collections.defaultdict(int)
files = os.listdir(train_dir)
for fi in files:
with open(os.path.join(train_dir, fi), "r", encoding='utf-8') as f:
word_freq = word_count(column_num, f, word_freq)
files = os.listdir(test_dir)
for fi in files:
with open(os.path.join(test_dir, fi), "r", encoding='utf-8') as f:
word_freq = word_count(column_num, f, word_freq)
word_freq = [x for x in six.iteritems(word_freq) if x[1] > min_word_freq]
word_freq_sorted = sorted(word_freq, key=lambda x: (-x[1], x[0]))
words, _ = list(zip(*word_freq_sorted))
word_idx = dict(list(zip(words, six.moves.range(len(words)))))
return word_idx
def write_paddle(text_idx, tag_idx, train_dir, test_dir, output_train_dir,
output_test_dir):
files = os.listdir(train_dir)
if not os.path.exists(output_train_dir):
os.mkdir(output_train_dir)
for fi in files:
with open(os.path.join(train_dir, fi), "r", encoding='utf-8') as f:
with open(
os.path.join(output_train_dir, fi), "w",
encoding='utf-8') as wf:
data_file = csv.reader(f)
for row in data_file:
tag_raw = re.split(r'\W+', row[0].strip())
pos_index = tag_idx.get(tag_raw[0])
wf.write(str(pos_index) + ",")
text_raw = re.split(r'\W+', row[2].strip())
l = [text_idx.get(w) for w in text_raw]
for w in l:
wf.write(str(w) + " ")
wf.write("\n")
files = os.listdir(test_dir)
if not os.path.exists(output_test_dir):
os.mkdir(output_test_dir)
for fi in files:
with open(os.path.join(test_dir, fi), "r", encoding='utf-8') as f:
with open(
os.path.join(output_test_dir, fi), "w",
encoding='utf-8') as wf:
data_file = csv.reader(f)
for row in data_file:
tag_raw = re.split(r'\W+', row[0].strip())
pos_index = tag_idx.get(tag_raw[0])
wf.write(str(pos_index) + ",")
text_raw = re.split(r'\W+', row[2].strip())
l = [text_idx.get(w) for w in text_raw]
for w in l:
wf.write(str(w) + " ")
wf.write("\n")
def text2paddle(train_dir, test_dir, output_train_dir, output_test_dir,
output_vocab_text, output_vocab_tag):
print("start constuct word dict")
vocab_text = build_dict(2, 0, train_dir, test_dir)
with open(output_vocab_text, "w", encoding='utf-8') as wf:
wf.write(str(len(vocab_text)) + "\n")
vocab_tag = build_dict(0, 0, train_dir, test_dir)
with open(output_vocab_tag, "w", encoding='utf-8') as wf:
wf.write(str(len(vocab_tag)) + "\n")
print("construct word dict done\n")
write_paddle(vocab_text, vocab_tag, train_dir, test_dir, output_train_dir,
output_test_dir)
train_dir = sys.argv[1]
test_dir = sys.argv[2]
output_train_dir = sys.argv[3]
output_test_dir = sys.argv[4]
output_vocab_text = sys.argv[5]
output_vocab_tag = sys.argv[6]
text2paddle(train_dir, test_dir, output_train_dir, output_test_dir,
output_vocab_text, output_vocab_tag)
...@@ -16,7 +16,6 @@ import paddle.fluid as fluid ...@@ -16,7 +16,6 @@ import paddle.fluid as fluid
import paddle.fluid.layers.nn as nn import paddle.fluid.layers.nn as nn
import paddle.fluid.layers.tensor as tensor import paddle.fluid.layers.tensor as tensor
import paddle.fluid.layers.control_flow as cf import paddle.fluid.layers.control_flow as cf
from paddlerec.core.model import ModelBase from paddlerec.core.model import ModelBase
from paddlerec.core.utils import envs from paddlerec.core.utils import envs
...@@ -98,14 +97,19 @@ class Model(ModelBase): ...@@ -98,14 +97,19 @@ class Model(ModelBase):
tensor.fill_constant_batch_size_like( tensor.fill_constant_batch_size_like(
input=loss_part2, shape=[-1, 1], value=0.0, dtype='float32'), input=loss_part2, shape=[-1, 1], value=0.0, dtype='float32'),
loss_part2) loss_part2)
avg_cost = nn.mean(loss_part3) avg_cost = fluid.layers.mean(loss_part3)
less = tensor.cast(cf.less_than(cos_neg, cos_pos), dtype='float32') less = tensor.cast(cf.less_than(cos_neg, cos_pos), dtype='float32')
label_ones = fluid.layers.fill_constant_batch_size_like(
input=cos_neg, dtype='float32', shape=[-1, 1], value=1.0)
correct = nn.reduce_sum(less) correct = nn.reduce_sum(less)
total = fluid.layers.reduce_sum(label_ones)
acc = fluid.layers.elementwise_div(correct, total)
self._cost = avg_cost self._cost = avg_cost
if is_infer: if is_infer:
self._infer_results["correct"] = correct self._infer_results["acc"] = acc
self._infer_results["cos_pos"] = cos_pos self._infer_results["loss"] = self._cost
else: else:
self._metrics["correct"] = correct self._metrics["acc"] = acc
self._metrics["cos_pos"] = cos_pos self._metrics["loss"] = self._cost
# tagspace文本分类模型
以下是本例的简要目录结构及说明:
```
├── data #样例数据
├── train_data
├── small_train.csv #训练数据样例
├── test_data
├── small_test.csv #测试数据样例
├── text2paddle.py #数据处理程序
├── __init__.py
├── README.md #文档
├── model.py #模型文件
├── config.yaml #配置文件
├── reader.py #读取程序
```
注:在阅读该示例前,建议您先了解以下内容:
[paddlerec入门教程](https://github.com/PaddlePaddle/PaddleRec/blob/master/README.md)
## 内容
- [模型简介](#模型简介)
- [数据准备](#数据准备)
- [运行环境](#运行环境)
- [快速开始](#快速开始)
- [效果复现](#效果复现)
- [进阶使用](#进阶使用)
- [FAQ](#FAQ)
## 模型简介
tagspace模型是一种对文本打标签的方法,它主要学习从短文到相关主题标签的映射。论文中主要利用CNN做doc向量, 然后优化 f(w,t+),f(w,t-)的距离作为目标函数,得到了 t(标签)和doc在一个特征空间的向量表达,这样就可以找 doc的hashtags了。
论文[TAGSPACE: Semantic Embeddings from Hashtags](https://www.aclweb.org/anthology/D14-1194.pdf)中的网络结构如图所示,一层输入层,一个卷积层,一个pooling层以及最后一个全连接层进行降维。
<p align="center">
<img align="center" src="../../../doc/imgs/tagspace.png">
<p>
## 数据准备
[数据地址](https://github.com/mhjabreel/CharCNN/tree/master/data/) , [备份数据地址](https://paddle-tagspace.bj.bcebos.com/data.tar)
数据格式如下:
```
"3","Wall St. Bears Claw Back Into the Black (Reuters)","Reuters - Short-sellers, Wall Street's dwindling\band of ultra-cynics, are seeing green again."
```
## 运行环境
PaddlePaddle>=1.7.2
python 2.7/3.5/3.6/3.7
PaddleRec >=0.1
os : windows/linux/macos
## 快速开始
本文提供了样例数据可以供您快速体验,在paddlerec目录下直接执行下面的命令即可启动训练:
```
python -m paddlerec.run -m models/contentunderstanding/tagspace/config.yaml
```
## 效果复现
为了方便使用者能够快速的跑通每一个模型,我们在每个模型下都提供了样例数据。如果需要复现readme中的效果,请按如下步骤依次操作即可。
1. 确认您当前所在目录为PaddleRec/models/contentunderstanding/tagspace
2. 在data目录下载并解压数据集,命令如下:
```
cd data
wget https://paddle-tagspace.bj.bcebos.com/data.tar
tar -xvf data.tar
```
3. 本文提供了快速将数据集中的汉字数据处理为可训练格式数据的脚本,您在解压数据集后,将原始数据存放在raw_big_train_data和raw_big_test_data两个目录下,并在python3环境下运行我们提供的text2paddle.py文件。即可生成可以直接用于训练的数据目录test_big_data和train_big_data。命令如下:
```
mkdir raw_big_train_data
mkdir raw_big_test_data
mv train.csv raw_big_train_data
mv test.csv raw_big_test_data
python3 text2paddle.py raw_big_train_data/ raw_big_test_data/ train_big_data test_big_data big_vocab_text.txt big_vocab_tag.txt
```
运行后的data目录:
```
big_vocab_tag.txt #标签词汇数
big_vocab_text.txt #文本词汇数
data.tar #数据集
raw_big_train_data #数据集中原始的训练集
raw_big_test_data #数据集中原始的测试集
train_data #样例训练集
test_data #样例测试集
train_big_data #数据集经处理后的训练集
test_big_data #数据集经处理后的测试集
text2paddle.py #预处理文件
```
处理完成的数据格式如下:
```
2,27 7062 8390 456 407 8 11589 3166 4 7278 31046 33 3898 2897 426 1
2,27 9493 836 355 20871 300 81 19 3 4125 9 449 462 13832 6 16570 1380 2874 5 0 797 236 19 3688 2106 14 8615 7 209 304 4 0 123 1
2,27 12754 637 106 3839 1532 66 0 379 6 0 1246 9 307 33 161 2 8100 36 0 350 123 101 74 181 0 6657 4 0 1222 17195 1
```
4. 退回tagspace目录中,打开文件config.yaml,更改其中的参数
将workspace改为您当前的绝对路径。(可用pwd命令获取绝对路径)
将dataset下sample_1的batch_size值从10改为128
将dataset下sample_1的data_path改为:{workspace}/data/train_big_data
将dataset下inferdata的batch_size值从10改为500
将dataset下inferdata的data_path改为:{workspace}/data/test_big_data
5. 执行命令,开始训练:
```
python -m paddlerec.run -m ./config.yaml
```
6. 运行结果:
```
PaddleRec: Runner infer_runner Begin
Executor Mode: infer
processor_register begin
Running SingleInstance.
Running SingleNetwork.
Running SingleInferStartup.
Running SingleInferRunner.
load persistables from increment/9
batch: 1, acc: [0.91], loss: [0.02495437]
batch: 2, acc: [0.936], loss: [0.01941476]
batch: 3, acc: [0.918], loss: [0.02116447]
batch: 4, acc: [0.916], loss: [0.0219945]
batch: 5, acc: [0.902], loss: [0.02242816]
batch: 6, acc: [0.9], loss: [0.02421589]
batch: 7, acc: [0.9], loss: [0.026441]
batch: 8, acc: [0.934], loss: [0.01797657]
batch: 9, acc: [0.932], loss: [0.01687362]
batch: 10, acc: [0.926], loss: [0.02047823]
batch: 11, acc: [0.918], loss: [0.01998716]
batch: 12, acc: [0.898], loss: [0.0229556]
batch: 13, acc: [0.928], loss: [0.01736144]
batch: 14, acc: [0.93], loss: [0.01911209]
```
## 进阶使用
## FAQ
...@@ -17,50 +17,52 @@ workspace: "models/match/dssm" ...@@ -17,50 +17,52 @@ workspace: "models/match/dssm"
dataset: dataset:
- name: dataset_train - name: dataset_train
batch_size: 4 batch_size: 8
type: QueueDataset type: DataLoader # or QueueDataset
data_path: "{workspace}/data/train" data_path: "{workspace}/data/train"
data_converter: "{workspace}/synthetic_reader.py" data_converter: "{workspace}/synthetic_reader.py"
- name: dataset_infer - name: dataset_infer
batch_size: 1 batch_size: 1
type: QueueDataset type: DataLoader # or QueueDataset
data_path: "{workspace}/data/train" data_path: "{workspace}/data/test"
data_converter: "{workspace}/synthetic_evaluate_reader.py" data_converter: "{workspace}/synthetic_evaluate_reader.py"
hyper_parameters: hyper_parameters:
optimizer: optimizer:
class: sgd class: sgd
learning_rate: 0.01 learning_rate: 0.001
strategy: async strategy: async
trigram_d: 1000 trigram_d: 1439
neg_num: 4 neg_num: 1
fc_sizes: [300, 300, 128] fc_sizes: [300, 300, 128]
fc_acts: ['tanh', 'tanh', 'tanh'] fc_acts: ['tanh', 'tanh', 'tanh']
mode: train_runner mode: [train_runner,infer_runner]
# config of each runner. # config of each runner.
# runner is a kind of paddle training class, which wraps the train/infer process. # runner is a kind of paddle training class, which wraps the train/infer process.
runner: runner:
- name: train_runner - name: train_runner
class: train class: train
# num of epochs # num of epochs
epochs: 4 epochs: 3
# device to run training or infer # device to run training or infer
device: cpu device: cpu
save_checkpoint_interval: 2 # save model interval of epochs save_checkpoint_interval: 1 # save model interval of epochs
save_inference_interval: 4 # save inference save_inference_interval: 1 # save inference
save_checkpoint_path: "increment" # save checkpoint path save_checkpoint_path: "increment" # save checkpoint path
save_inference_path: "inference" # save inference path save_inference_path: "inference" # save inference path
save_inference_feed_varnames: ["query", "doc_pos"] # feed vars of save inference save_inference_feed_varnames: ["query", "doc_pos"] # feed vars of save inference
save_inference_fetch_varnames: ["cos_sim_0.tmp_0"] # fetch vars of save inference save_inference_fetch_varnames: ["cos_sim_0.tmp_0"] # fetch vars of save inference
init_model_path: "" # load model path init_model_path: "" # load model path
print_interval: 2 print_interval: 2
phases: phase1
- name: infer_runner - name: infer_runner
class: infer class: infer
# device to run training or infer # device to run training or infer
device: cpu device: cpu
print_interval: 1 print_interval: 1
init_model_path: "increment/2" # load model path init_model_path: "increment/2" # load model path
phases: phase2
# runner will run all the phase in each epoch # runner will run all the phase in each epoch
phase: phase:
...@@ -68,7 +70,7 @@ phase: ...@@ -68,7 +70,7 @@ phase:
model: "{workspace}/model.py" # user-defined model model: "{workspace}/model.py" # user-defined model
dataset_name: dataset_train # select dataset by name dataset_name: dataset_train # select dataset by name
thread_num: 1 thread_num: 1
#- name: phase2 - name: phase2
# model: "{workspace}/model.py" # user-defined model model: "{workspace}/model.py" # user-defined model
# dataset_name: dataset_infer # select dataset by name dataset_name: dataset_infer # select dataset by name
# thread_num: 1 thread_num: 1
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#encoding=utf-8
import os
import sys
import numpy as np
import random
f = open("./zhidao", "r")
lines = f.readlines()
f.close()
#建立字典
word_dict = {}
for line in lines:
line = line.strip().split("\t")
text = line[0].split(" ") + line[1].split(" ")
for word in text:
if word in word_dict:
continue
else:
word_dict[word] = len(word_dict) + 1
f = open("./zhidao", "r")
lines = f.readlines()
f.close()
lines = [line.strip().split("\t") for line in lines]
#建立以query为key,以负例为value的字典
neg_dict = {}
for line in lines:
if line[2] == "0":
if line[0] in neg_dict:
neg_dict[line[0]].append(line[1])
else:
neg_dict[line[0]] = [line[1]]
#建立以query为key,以正例为value的字典
pos_dict = {}
for line in lines:
if line[2] == "1":
if line[0] in pos_dict:
pos_dict[line[0]].append(line[1])
else:
pos_dict[line[0]] = [line[1]]
#划分训练集和测试集
query_list = list(pos_dict.keys())
#print(len(query))
random.shuffle(query_list)
train_query = query_list[:90]
test_query = query_list[90:]
#获得训练集
train_set = []
for query in train_query:
for pos in pos_dict[query]:
if query not in neg_dict:
continue
for neg in neg_dict[query]:
train_set.append([query, pos, neg])
random.shuffle(train_set)
#获得测试集
test_set = []
for query in test_query:
for pos in pos_dict[query]:
test_set.append([query, pos, 1])
if query not in neg_dict:
continue
for neg in neg_dict[query]:
test_set.append([query, neg, 0])
random.shuffle(test_set)
#训练集中的query,pos,neg转化为词袋
f = open("train.txt", "w")
for line in train_set:
query = line[0].strip().split(" ")
pos = line[1].strip().split(" ")
neg = line[2].strip().split(" ")
query_token = [0] * (len(word_dict) + 1)
for word in query:
query_token[word_dict[word]] = 1
pos_token = [0] * (len(word_dict) + 1)
for word in pos:
pos_token[word_dict[word]] = 1
neg_token = [0] * (len(word_dict) + 1)
for word in neg:
neg_token[word_dict[word]] = 1
f.write(','.join([str(x) for x in query_token]) + "\t" + ','.join([
str(x) for x in pos_token
]) + "\t" + ','.join([str(x) for x in neg_token]) + "\n")
f.close()
#测试集中的query和pos转化为词袋
f = open("test.txt", "w")
fa = open("label.txt", "w")
for line in test_set:
query = line[0].strip().split(" ")
pos = line[1].strip().split(" ")
label = line[2]
query_token = [0] * (len(word_dict) + 1)
for word in query:
query_token[word_dict[word]] = 1
pos_token = [0] * (len(word_dict) + 1)
for word in pos:
pos_token[word_dict[word]] = 1
f.write(','.join([str(x) for x in query_token]) + "\t" + ','.join(
[str(x) for x in pos_token]) + "\n")
fa.write(str(label) + "\n")
f.close()
fa.close()
因为 它太大了无法显示 source diff 。你可以改为 查看blob
因为 它太大了无法显示 source diff 。你可以改为 查看blob
因为 它太大了无法显示 source diff 。你可以改为 查看blob
...@@ -73,6 +73,7 @@ class Model(ModelBase): ...@@ -73,6 +73,7 @@ class Model(ModelBase):
query_fc = fc(inputs[0], self.hidden_layers, self.hidden_acts, query_fc = fc(inputs[0], self.hidden_layers, self.hidden_acts,
['query_l1', 'query_l2', 'query_l3']) ['query_l1', 'query_l2', 'query_l3'])
doc_pos_fc = fc(inputs[1], self.hidden_layers, self.hidden_acts, doc_pos_fc = fc(inputs[1], self.hidden_layers, self.hidden_acts,
['doc_pos_l1', 'doc_pos_l2', 'doc_pos_l3']) ['doc_pos_l1', 'doc_pos_l2', 'doc_pos_l3'])
R_Q_D_p = fluid.layers.cos_sim(query_fc, doc_pos_fc) R_Q_D_p = fluid.layers.cos_sim(query_fc, doc_pos_fc)
...@@ -93,7 +94,7 @@ class Model(ModelBase): ...@@ -93,7 +94,7 @@ class Model(ModelBase):
prob = fluid.layers.softmax(concat_Rs, axis=1) prob = fluid.layers.softmax(concat_Rs, axis=1)
hit_prob = fluid.layers.slice( hit_prob = fluid.layers.slice(
prob, axes=[0, 1], starts=[0, 0], ends=[4, 1]) prob, axes=[0, 1], starts=[0, 0], ends=[8, 1])
loss = -fluid.layers.reduce_sum(fluid.layers.log(hit_prob)) loss = -fluid.layers.reduce_sum(fluid.layers.log(hit_prob))
avg_cost = fluid.layers.mean(x=loss) avg_cost = fluid.layers.mean(x=loss)
self._cost = avg_cost self._cost = avg_cost
......
# DSSM文本匹配模型
以下是本例的简要目录结构及说明:
```
├── data #样例数据
├── train
├── train.txt #训练数据样例
├── test
├── test.txt #测试数据样例
├── preprocess.py #数据处理程序
├── __init__.py
├── README.md #文档
├── model.py #模型文件
├── config.yaml #配置文件
├── synthetic_reader.py #读取训练集的程序
├── synthetic_evaluate_reader.py #读取测试集的程序
├── transform.py #将数据整理成合适的格式方便计算指标
├── run.sh #全量数据集中的训练脚本,从训练到预测并计算指标
```
注:在阅读该示例前,建议您先了解以下内容:
[paddlerec入门教程](https://github.com/PaddlePaddle/PaddleRec/blob/master/README.md)
## 内容
- [模型简介](#模型简介)
- [数据准备](#数据准备)
- [运行环境](#运行环境)
- [快速开始](#快速开始)
- [效果复现](#效果复现)
- [进阶使用](#进阶使用)
- [FAQ](#FAQ)
## 模型简介
DSSM是Deep Structured Semantic Model的缩写,即我们通常说的基于深度网络的语义模型,其核心思想是将query和doc映射到到共同维度的语义空间中,通过最大化query和doc语义向量之间的余弦相似度,从而训练得到隐含语义模型,达到检索的目的。DSSM有很广泛的应用,比如:搜索引擎检索,广告相关性,问答系统,机器翻译等。
DSSM 的输入采用 BOW(Bag of words)的方式,相当于把字向量的位置信息抛弃了,整个句子里的词都放在一个袋子里了。将一个句子用这种方式转化为一个向量输入DNN中。
Query 和 Doc 的语义相似性可以用这两个向量的 cosine 距离表示,然后通过softmax 函数选出与Query语义最相似的样本 Doc 。
模型的具体细节可以阅读论文[DSSM](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/cikm2013_DSSM_fullversion.pdf):
<p align="center">
<img align="center" src="../../../doc/imgs/dssm.png">
<p>
## 数据准备
我们公开了自建的测试集,包括百度知道、ECOM、QQSIM、UNICOM 四个数据集。这里我们选取百度知道数据集来进行训练。执行以下命令可以获取上述数据集。
```
wget --no-check-certificate https://baidu-nlp.bj.bcebos.com/simnet_dataset-1.0.0.tar.gz
tar xzf simnet_dataset-1.0.0.tar.gz
rm simnet_dataset-1.0.0.tar.gz
```
## 运行环境
PaddlePaddle>=1.7.2
python 2.7/3.5/3.6/3.7
PaddleRec >=0.1
os : windows/linux/macos
## 快速开始
本文提供了样例数据可以供您快速体验,在paddlerec目录下执行下面的命令即可快速启动训练:
```
python -m paddlerec.run -m models/match/dssm/config.yaml
```
输出结果示例:
```
PaddleRec: Runner train_runner Begin
Executor Mode: train
processor_register begin
Running SingleInstance.
Running SingleNetwork.
file_list : ['models/match/dssm/data/train/train.txt']
Running SingleStartup.
Running SingleRunner.
!!! The CPU_NUM is not specified, you should set CPU_NUM in the environment variable list.
CPU_NUM indicates that how many CPUPlace are used in the current task.
And if this parameter are set as N (equal to the number of physical CPU core) the program may be faster.
export CPU_NUM=32 # for example, set CPU_NUM as number of physical CPU core which is 32.
!!! The default number of CPU_NUM=1.
I0821 06:56:26.224299 31061 parallel_executor.cc:440] The Program will be executed on CPU using ParallelExecutor, 1 cards are used, so 1 programs are executed in parallel.
I0821 06:56:26.231163 31061 build_strategy.cc:365] SeqOnlyAllReduceOps:0, num_trainers:1
I0821 06:56:26.237023 31061 parallel_executor.cc:307] Inplace strategy is enabled, when build_strategy.enable_inplace = True
I0821 06:56:26.240788 31061 parallel_executor.cc:375] Garbage collection strategy is enabled, when FLAGS_eager_delete_tensor_gb = 0
batch: 2, LOSS: [4.538238]
batch: 4, LOSS: [4.16424]
batch: 6, LOSS: [3.8121371]
batch: 8, LOSS: [3.4250507]
batch: 10, LOSS: [3.2285979]
batch: 12, LOSS: [3.2116117]
batch: 14, LOSS: [3.1406002]
epoch 0 done, use time: 0.357971906662, global metrics: LOSS=[3.0968776]
batch: 2, LOSS: [2.6843479]
batch: 4, LOSS: [2.546976]
batch: 6, LOSS: [2.4103594]
batch: 8, LOSS: [2.301374]
batch: 10, LOSS: [2.264183]
batch: 12, LOSS: [2.315862]
batch: 14, LOSS: [2.3409634]
epoch 1 done, use time: 0.22123003006, global metrics: LOSS=[2.344321]
batch: 2, LOSS: [2.0882485]
batch: 4, LOSS: [2.006743]
batch: 6, LOSS: [1.9231766]
batch: 8, LOSS: [1.8850241]
batch: 10, LOSS: [1.8829436]
batch: 12, LOSS: [1.9336565]
batch: 14, LOSS: [1.9784685]
epoch 2 done, use time: 0.212922096252, global metrics: LOSS=[1.9934461]
PaddleRec Finish
```
## 效果复现
为了方便使用者能够快速的跑通每一个模型,我们在每个模型下都提供了样例数据。如果需要复现readme中的效果,请按如下步骤依次操作即可。
1. 确认您当前所在目录为PaddleRec/models/match/dssm
2. 在data目录下载并解压数据集,命令如下:
```
cd data
wget --no-check-certificate https://baidu-nlp.bj.bcebos.com/simnet_dataset-1.0.0.tar.gz
tar xzf simnet_dataset-1.0.0.tar.gz
rm simnet_dataset-1.0.0.tar.gz
```
3. 本文提供了快速将数据集中的汉字数据处理为可训练格式数据的脚本,您在解压数据集后,可以看见目录中存在一个名为zhidao的文件。然后能可以在python3环境下运行我们提供的preprocess.py文件。即可生成可以直接用于训练的数据目录test.txt,train.txt和label.txt。将其放入train和test目录下以备训练时调用。命令如下:
```
mv data/zhidao ./
rm -rf data
python3 preprocess.py
rm -f ./train/train.txt
mv train.txt ./train
rm -f ./test/test.txt
mv test.txt test
cd ..
```
经过预处理的格式:
训练集为三个稀疏的BOW方式的向量:query,pos,neg
测试集为两个稀疏的BOW方式的向量:query,pos
label.txt中对应的测试集中的标签
4. 退回dssm目录中,打开文件config.yaml,更改其中的参数
将workspace改为您当前的绝对路径。(可用pwd命令获取绝对路径)
将dataset_train中的batch_size从8改为128
将文件model.py中的 hit_prob = fluid.layers.slice(prob, axes=[0, 1], starts=[0, 0], ends=[8, 1])
改为hit_prob = fluid.layers.slice(prob, axes=[0, 1], starts=[0, 0], ends=[128, 1]).当您需要改变batchsize的时候,end中第一个参数也需要随之变化
5. 执行脚本,开始训练.脚本会运行python -m paddlerec.run -m ./config.yaml启动训练,并将结果输出到result文件中。然后启动transform.py整合数据,最后计算出正逆序指标:
```
sh run.sh
```
输出结果示例:
```
................run.................
!!! The CPU_NUM is not specified, you should set CPU_NUM in the environment variable list.
CPU_NUM indicates that how many CPUPlace are used in the current task.
And if this parameter are set as N (equal to the number of physical CPU core) the program may be faster.
export CPU_NUM=32 # for example, set CPU_NUM as number of physical CPU core which is 32.
!!! The default number of CPU_NUM=1.
I0821 07:16:04.512531 32200 parallel_executor.cc:440] The Program will be executed on CPU using ParallelExecutor, 1 cards are used, so 1 programs are executed in parallel.
I0821 07:16:04.515708 32200 build_strategy.cc:365] SeqOnlyAllReduceOps:0, num_trainers:1
I0821 07:16:04.518872 32200 parallel_executor.cc:307] Inplace strategy is enabled, when build_strategy.enable_inplace = True
I0821 07:16:04.520995 32200 parallel_executor.cc:375] Garbage collection strategy is enabled, when FLAGS_eager_delete_tensor_gb = 0
75
pnr: 2.25581395349
query_num: 11
pair_num: 184 184
equal_num: 44
正序率: 0.692857142857
97 43
```
6. 提醒:因为采取较小的数据集进行训练和测试,得到指标的浮动程度会比较大。如果得到的指标不合预期,可以多次执行步骤5,即可获得合理的指标。
## 进阶使用
## FAQ
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#!/bin/bash
echo "................run................."
python -m paddlerec.run -m ./config.yaml >result1.txt
grep -i "query_doc_sim" ./result1.txt >./result2.txt
sed '$d' result2.txt >result.txt
rm -f result1.txt
rm -f result2.txt
python transform.py
sort -t $'\t' -k1,1 -k 2nr,2 pair.txt >result.txt
rm -f pair.txt
python ../../../tools/cal_pos_neg.py result.txt
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import random
import numpy as np
import sklearn.metrics
label = []
filename = './data/label.txt'
f = open(filename, "r")
f.readline()
num = 0
for line in f.readlines():
num = num + 1
line = line.strip()
label.append(line)
f.close()
print(num)
filename = './result.txt'
sim = []
for line in open(filename):
line = line.strip().split(",")
line[1] = line[1].split(":")
line = line[1][1].strip(" ")
line = line.strip("[")
line = line.strip("]")
sim.append(float(line))
filename = './data/test/test.txt'
f = open(filename, "r")
f.readline()
query = []
for line in f.readlines():
line = line.strip().split("\t")
query.append(line[0])
f.close()
filename = 'pair.txt'
f = open(filename, "w")
for i in range(len(sim)):
f.write(str(query[i]) + "\t" + str(sim[i]) + "\t" + str(label[i]) + "\n")
f.close()
# match-pyramid文本匹配模型 # match-pyramid文本匹配模型
## 介绍 以下是本例的简要目录结构及说明:
```
├── data #样例数据
├── process.py #数据处理脚本
├── relation.test.fold1.txt #评估计算指标时用到的关系文件
├── train
├── train.txt #训练数据样例
├── test
├── test.txt #测试数据样例
├── __init__.py
├── README.md #文档
├── model.py #模型文件
├── config.yaml #配置文件
├── data_process.sh #数据下载和处理脚本
├── eval.py #计算指标的评估程序
├── run.sh #一键运行程序
├── test_reader.py #测试集读取程序
├── train_reader.py #训练集读取程序
```
注:在阅读该示例前,建议您先了解以下内容:
[paddlerec入门教程](https://github.com/PaddlePaddle/PaddleRec/blob/master/README.md)
## 内容
- [模型简介](#模型简介)
- [数据准备](#数据准备)
- [运行环境](#运行环境)
- [快速开始](#快速开始)
- [论文复现](#论文复现)
- [进阶使用](#进阶使用)
- [FAQ](#FAQ)
## 模型简介
在许多自然语言处理任务中,匹配两个文本是一个基本问题。一种有效的方法是从单词,短语和句子中提取有意义的匹配模式以产生匹配分数。受卷积神经网络在图像识别中的成功启发,神经元可以根据提取的基本视觉模式(例如定向的边角和边角)捕获许多复杂的模式,所以我们尝试将文本匹配建模为图像识别问题。本模型对齐原作者庞亮开源的tensorflow代码:https://github.com/pl8787/MatchPyramid-TensorFlow/blob/master/model/model_mp.py, 实现了下述论文中提出的Match-Pyramid模型: 在许多自然语言处理任务中,匹配两个文本是一个基本问题。一种有效的方法是从单词,短语和句子中提取有意义的匹配模式以产生匹配分数。受卷积神经网络在图像识别中的成功启发,神经元可以根据提取的基本视觉模式(例如定向的边角和边角)捕获许多复杂的模式,所以我们尝试将文本匹配建模为图像识别问题。本模型对齐原作者庞亮开源的tensorflow代码:https://github.com/pl8787/MatchPyramid-TensorFlow/blob/master/model/model_mp.py, 实现了下述论文中提出的Match-Pyramid模型:
```text ```text
...@@ -19,8 +55,23 @@ ...@@ -19,8 +55,23 @@
3.关系文件:关系文件被用来存储两个句子之间的关系,如query 和document之间的关系。例如:relation.train.fold1.txt, relation.test.fold1.txt 3.关系文件:关系文件被用来存储两个句子之间的关系,如query 和document之间的关系。例如:relation.train.fold1.txt, relation.test.fold1.txt
4.嵌入层文件:我们将预训练的词向量存储在嵌入文件中。例如:embed_wiki-pdc_d50_norm 4.嵌入层文件:我们将预训练的词向量存储在嵌入文件中。例如:embed_wiki-pdc_d50_norm
## 数据下载和预处理 ## 运行环境
本文提供了数据集的下载以及一键生成训练和测试数据的预处理脚本,您可以直接一键运行:bash data_process.sh PaddlePaddle>=1.7.2
python 2.7/3.5/3.6/3.7
PaddleRec >=0.1
os : windows/linux/macos
## 快速开始
本文提供了样例数据可以供您快速体验,在paddlerec目录下直接执行下面的命令即可启动训练:
```
python -m paddlerec.run -m models/match/match-pyramid/config.yaml
```
## 论文复现
1. 确认您当前所在目录为PaddleRec/models/match/match-pyramid
2. 本文提供了原数据集的下载以及一键生成训练和测试数据的预处理脚本,您可以直接一键运行:bash data_process.sh
执行该脚本,会从国内源的服务器上下载Letor07数据集,删除掉data文件夹中原有的relation.test.fold1.txt和relation.train.fold1.txt,并将完整的数据集解压到data文件夹。随后运行 process.py 将全量训练数据放置于`./data/train`,全量测试数据放置于`./data/test`。并生成用于初始化embedding层的embedding.npy文件 执行该脚本,会从国内源的服务器上下载Letor07数据集,删除掉data文件夹中原有的relation.test.fold1.txt和relation.train.fold1.txt,并将完整的数据集解压到data文件夹。随后运行 process.py 将全量训练数据放置于`./data/train`,全量测试数据放置于`./data/test`。并生成用于初始化embedding层的embedding.npy文件
执行该脚本的理想输出为: 执行该脚本的理想输出为:
``` ```
...@@ -69,9 +120,11 @@ data/embed_wiki-pdc_d50_norm ...@@ -69,9 +120,11 @@ data/embed_wiki-pdc_d50_norm
[./data/relation.test.fold1.txt] [./data/relation.test.fold1.txt]
Instance size: 13652 Instance size: 13652
``` ```
3. 打开文件config.yaml,更改其中的参数
将workspace改为您当前的绝对路径。(可用pwd命令获取绝对路径)
## 一键训练并测试评估 4. 随后,您直接一键运行:bash run.sh 即可得到复现的论文效果
本文提供了一键执行训练,测试和评估的脚本,您可以直接一键运行:bash run.sh
执行该脚本后,会执行python -m paddlerec.run -m ./config.yaml 命令开始训练并测试模型,将测试的结果保存到result.txt文件,最后通过执行eval.py进行评估得到数据的map指标 执行该脚本后,会执行python -m paddlerec.run -m ./config.yaml 命令开始训练并测试模型,将测试的结果保存到result.txt文件,最后通过执行eval.py进行评估得到数据的map指标
执行该脚本的理想输出为: 执行该脚本的理想输出为:
``` ```
...@@ -80,15 +133,6 @@ data/embed_wiki-pdc_d50_norm ...@@ -80,15 +133,6 @@ data/embed_wiki-pdc_d50_norm
336 336
('map=', 0.420878322843591) ('map=', 0.420878322843591)
``` ```
## 进阶使用
## 每个文件的作用 ## FAQ
paddlerec可以:
通过config.yaml规定模型的参数
通过model.py规定模型的组网
使用train_reader.py读取训练集中的数据
使用test_reader.py读取测试集中的数据。
本文额外提供:
data_process.sh用来一键处理数据
run.sh用来一键启动训练,直接得出测试结果
eval.py通过保存的测试结果,计算map指标
如需详细了解paddlerec的使用方法请参考https://github.com/PaddlePaddle/PaddleRec/blob/master/README_CN.md 页面下方的教程。
#!/bin/bash #!/bin/bash
echo "................run................." echo "................run................."
python -m paddlerec.run -m ./config.yaml >result1.txt python -m paddlerec.run -m ./config.yaml >result1.txt
grep -A1 "prediction" ./result1.txt >./result.txt grep -i "prediction" ./result1.txt >./result.txt
rm -f result1.txt rm -f result1.txt
python eval.py python eval.py
# 匹配模型库 # 匹配模型库
## 简介 ## 简介
我们提供了常见的匹配任务中使用的模型算法的PaddleRec实现, 单机训练&预测效果指标以及分布式训练&预测性能指标等。实现的模型包括 [DSSM](http://gitlab.baidu.com/tangwei12/paddlerec/tree/develop/models/match/dssm)[MultiView-Simnet](http://gitlab.baidu.com/tangwei12/paddlerec/tree/develop/models/match/multiview-simnet) 我们提供了常见的匹配任务中使用的模型算法的PaddleRec实现, 单机训练&预测效果指标以及分布式训练&预测性能指标等。实现的模型包括 [DSSM](http://gitlab.baidu.com/tangwei12/paddlerec/tree/develop/models/match/dssm)[MultiView-Simnet](http://gitlab.baidu.com/tangwei12/paddlerec/tree/develop/models/match/multiview-simnet)[match-pyramid](https://github.com/PaddlePaddle/PaddleRec/tree/master/models/match/match-pyramid)
模型算法库在持续添加中,欢迎关注。 模型算法库在持续添加中,欢迎关注。
...@@ -18,6 +18,8 @@ ...@@ -18,6 +18,8 @@
| :------------------: | :--------------------: | :---------: | | :------------------: | :--------------------: | :---------: |
| DSSM | Deep Structured Semantic Models | [CIKM 2013][Learning Deep Structured Semantic Models for Web Search using Clickthrough Data](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/cikm2013_DSSM_fullversion.pdf) | | DSSM | Deep Structured Semantic Models | [CIKM 2013][Learning Deep Structured Semantic Models for Web Search using Clickthrough Data](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/cikm2013_DSSM_fullversion.pdf) |
| MultiView-Simnet | Multi-view Simnet for Personalized recommendation | [WWW 2015][A Multi-View Deep Learning Approach for Cross Domain User Modeling in Recommendation Systems](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/frp1159-songA.pdf) | | MultiView-Simnet | Multi-view Simnet for Personalized recommendation | [WWW 2015][A Multi-View Deep Learning Approach for Cross Domain User Modeling in Recommendation Systems](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/frp1159-songA.pdf) |
| match-pyramid | Text Matching as Image Recognition | [arXiv W2016][Text Matching as Image Recognition](https://arxiv.org/pdf/1602.06359.pdf) |
下面是每个模型的简介(注:图片引用自链接中的论文) 下面是每个模型的简介(注:图片引用自链接中的论文)
...@@ -31,24 +33,26 @@ ...@@ -31,24 +33,26 @@
<img align="center" src="../../doc/imgs/multiview-simnet.png"> <img align="center" src="../../doc/imgs/multiview-simnet.png">
<p> <p>
## 使用教程(快速开始) [match-pyramid](https://arxiv.org/pdf/1602.06359.pdf):
### 训练 <p align="center">
```shell <img align="center" src="../../doc/imgs/match-pyramid.png">
git clone https://github.com/PaddlePaddle/PaddleRec.git paddle-rec <p>
cd paddle-rec
## 使用教程(快速开始)
### 训练&预测
每个模型都提供了样例数据可以供您快速体验,在paddlerec目录下直接执行下面的命令即可启动训练:
```
python -m paddlerec.run -m models/match/dssm/config.yaml # dssm python -m paddlerec.run -m models/match/dssm/config.yaml # dssm
python -m paddlerec.run -m models/match/multiview-simnet/config.yaml # multiview-simnet python -m paddlerec.run -m models/match/multiview-simnet/config.yaml # multiview-simnet
python -m paddlerec.run -m models/contentunderstanding/match-pyramid/config.yaml #match-pyramid
``` ```
### 效果复现
每个模型下的readme中都有详细的效果复现的教程,您可以进入模型的目录中详细查看
### 预测 ### 模型效果 (测试)
```shell
# 修改对应模型的config.yaml, workspace配置为当前目录的绝对路径
# 修改对应模型的config.yaml,mode配置infer_runner
# 示例: mode: train_runner -> mode: infer_runner
# infer_runner中 class配置为 class: infer
# 修改phase阶段为infer的配置,参照config注释
# 修改完config.yaml后 执行: | 数据集 | 模型 | 正逆序比 | map |
python -m paddlerec.run -m ./config.yaml # 以dssm为例 | :------------------: | :--------------------: | :---------: |:---------: |
``` | zhidao | DSSM | 2.25 | -- |
| Letor07 | match-pyramid | -- | 0.42 |
| zhidao | multiview-simnet | 1.72 | -- |
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册