未验证 提交 260d18aa 编写于 作者: Y yinhaofeng 提交者: GitHub

add tagspace model and readme (#183)

* add tagspace model and readme

* change
Co-authored-by: Ntangwei12 <tangwei12@baidu.com>
上级 14a31629
......@@ -22,12 +22,12 @@
| 模型 | 简介 | 论文 |
| :------------------: | :--------------------: | :---------: |
| TagSpace | 标签推荐 | [EMNLP 2014][TagSpace: Semantic Embeddings from Hashtags](https://research.fb.com/publications/tagspace-semantic-embeddings-from-hashtags/) |
| TagSpace | 标签推荐 | [EMNLP 2014][TagSpace: Semantic Embeddings from Hashtags](https://www.aclweb.org/anthology/D14-1194.pdf) |
| Classification | 文本分类 | [EMNLP 2014][Convolutional neural networks for sentence classication](https://www.aclweb.org/anthology/D14-1181.pdf) |
下面是每个模型的简介(注:图片引用自链接中的论文)
[TagSpace模型](https://research.fb.com/publications/tagspace-semantic-embeddings-from-hashtags)
[TagSpace模型](https://www.aclweb.org/anthology/D14-1194.pdf)
<p align="center">
<img align="center" src="../../doc/imgs/tagspace.png">
<p>
......@@ -37,89 +37,173 @@
<img align="center" src="../../doc/imgs/cnn-ckim2014.png">
<p>
##使用教程(快速开始)
## 使用教程(快速开始)
```
git clone https://github.com/PaddlePaddle/PaddleRec.git paddle-rec
cd paddle-rec
cd PaddleRec
python -m paddlerec.run -m models/contentunderstanding/tagspace/config.yaml
python -m paddlerec.run -m models/contentunderstanding/classification/config.yaml
```
## 使用教程(复现论文)
###注意
### 注意
为了方便使用者能够快速的跑通每一个模型,我们在每个模型下都提供了样例数据。如果需要复现readme中的效果请使用以下提供的脚本下载对应数据集以及数据预处理。
### 数据处理
**(1)TagSpace**
### 数据处理
[数据地址](https://github.com/mhjabreel/CharCNN/tree/master/data/) , [备份数据地址](https://paddle-tagspace.bj.bcebos.com/data.tar)
数据格式如下
```
"3","Wall St. Bears Claw Back Into the Black (Reuters)","Reuters - Short-sellers, Wall Street's dwindling\band of ultra-cynics, are seeing green again."
```
数据解压后,将文本数据转为paddle数据,先将数据放到训练数据目录和测试数据目录
本文提供了快速将数据集中的汉字数据处理为可训练格式数据的脚本,您在解压数据集后,将原始数据存放在raw_big_train_data和raw_big_test_data两个目录下,并在python3环境下运行我们提供的text2paddle.py文件。即可生成可以直接用于训练的数据目录test_big_data和train_big_data。命令如下:
```
mkdir raw_big_train_data
mkdir raw_big_test_data
mv train.csv raw_big_train_data
mv test.csv raw_big_test_data
python3 text2paddle.py raw_big_train_data/ raw_big_test_data/ train_big_data test_big_data big_vocab_text.txt big_vocab_tag.txt
```
运行脚本text2paddle.py 生成paddle输入格式
运行后的data目录:
```
big_vocab_tag.txt #标签词汇数
big_vocab_text.txt #文本词汇数
data.tar #数据集
raw_big_train_data #数据集中原始的训练集
raw_big_test_data #数据集中原始的测试集
train_data #样例训练集
test_data #样例测试集
train_big_data #数据集经处理后的训练集
test_big_data #数据集经处理后的测试集
text2paddle.py #预处理文件
```
处理完成的数据格式如下:
```
python text2paddle.py raw_big_train_data/ raw_big_test_data/ train_big_data test_big_data big_vocab_text.txt big_vocab_tag.txt
2,27 7062 8390 456 407 8 11589 3166 4 7278 31046 33 3898 2897 426 1
2,27 9493 836 355 20871 300 81 19 3 4125 9 449 462 13832 6 16570 1380 2874 5 0 797 236 19 3688 2106 14 8615 7 209 304 4 0 123 1
2,27 12754 637 106 3839 1532 66 0 379 6 0 1246 9 307 33 161 2 8100 36 0 350 123 101 74 181 0 6657 4 0 1222 17195 1
```
### 训练
退回tagspace目录中,打开文件config.yaml,更改其中的参数
将workspace改为您当前的绝对路径。(可用pwd命令获取绝对路径)
将dataset下sample_1的batch_size值从10改为128
将dataset下sample_1的data_path改为:{workspace}/data/train_big_data
将dataset下inferdata的batch_size值从10改为500
将dataset下inferdata的data_path改为:{workspace}/data/test_big_data
执行命令,开始训练:
```
cd modles/contentunderstanding/tagspace
python -m paddlerec.run -m ./config.yaml # 自定义修改超参后,指定配置文件,使用自定义配置
python -m paddlerec.run -m ./config.yaml
```
### 预测
在跑完训练后,模型会开始在验证集上预测。
运行结果:
```
PaddleRec: Runner infer_runner Begin
Executor Mode: infer
processor_register begin
Running SingleInstance.
Running SingleNetwork.
Running SingleInferStartup.
Running SingleInferRunner.
load persistables from increment/9
batch: 1, acc: [0.91], loss: [0.02495437]
batch: 2, acc: [0.936], loss: [0.01941476]
batch: 3, acc: [0.918], loss: [0.02116447]
batch: 4, acc: [0.916], loss: [0.0219945]
batch: 5, acc: [0.902], loss: [0.02242816]
batch: 6, acc: [0.9], loss: [0.02421589]
batch: 7, acc: [0.9], loss: [0.026441]
batch: 8, acc: [0.934], loss: [0.01797657]
batch: 9, acc: [0.932], loss: [0.01687362]
batch: 10, acc: [0.926], loss: [0.02047823]
batch: 11, acc: [0.918], loss: [0.01998716]
batch: 12, acc: [0.898], loss: [0.0229556]
batch: 13, acc: [0.928], loss: [0.01736144]
batch: 14, acc: [0.93], loss: [0.01911209]
```
# 修改对应模型的config.yaml, workspace配置为当前目录的绝对路径
# 修改对应模型的config.yaml,mode配置infer_runner
# 示例: mode: train_runner -> mode: infer_runner
# infer_runner中 class配置为 class: infer
# 修改phase阶段为infer的配置,参照config注释
# 修改完config.yaml后 执行:
python -m paddlerec.run -m ./config.yaml
**(2)Classification**
### 数据处理
情感倾向分析(Sentiment Classification,简称Senta)针对带有主观描述的中文文本,可自动判断该文本的情感极性类别并给出相应的置信度。情感类型分为积极、消极。情感倾向分析能够帮助企业理解用户消费习惯、分析热点话题和危机舆情监控,为企业提供有利的决策支持。
情感是人类的一种高级智能行为,为了识别文本的情感倾向,需要深入的语义建模。另外,不同领域(如餐饮、体育)在情感的表达各不相同,因而需要有大规模覆盖各个领域的数据进行模型训练。为此,我们通过基于深度学习的语义模型和大规模数据挖掘解决上述两个问题。效果上,我们基于开源情感倾向分类数据集ChnSentiCorp进行评测。
您可以直接执行以下命令下载我们分词完毕后的数据集,文件解压之后,senta_data目录下会存在训练数据(train.tsv)、开发集数据(dev.tsv)、测试集数据(test.tsv)以及对应的词典(word_dict.txt):
```
wget https://baidu-nlp.bj.bcebos.com/sentiment_classification-dataset-1.0.0.tar.gz
tar -zxvf sentiment_classification-dataset-1.0.0.tar.gz
```
**(2)Classification**
数据格式为一句中文的评价语句,和一个代表情感信息的标签。两者之间用/t分隔,中文的评价语句已经分词,词之间用空格分隔。
### 训练
```
cd modles/contentunderstanding/classification
python -m paddlerec.run -m ./config.yaml # 自定义修改超参后,指定配置文件,使用自定义配置
15.4寸 笔记本 的 键盘 确实 爽 , 基本 跟 台式机 差不多 了 , 蛮 喜欢 数字 小 键盘 , 输 数字 特 方便 , 样子 也 很 美观 , 做工 也 相当 不错 1
跟 心灵 鸡汤 没 什么 本质 区别 嘛 , 至少 我 不 喜欢 这样 读 经典 , 把 经典 都 解读 成 这样 有点 去 中国 化 的 味道 了 0
```
本文提供了快速将数据集中的汉字数据处理为可训练格式数据的脚本,您在解压数据集后,将preprocess.py复制到senta_data文件中并执行,即可将数据集中提供的dev.tsv,test.tsv,train.tsv转化为可直接训练的dev.txt,test.txt,train.txt.
```
cp ./data/preprocess.py ./senta_data/
cd senta_data/
python preprocess.py
```
### 预测
### 训练
创建存放训练集和测试集的目录,将数据放入目录中。
```
# 修改对应模型的config.yaml, workspace配置为当前目录的绝对路径
# 修改对应模型的config.yaml,mode配置infer_runner
# 示例: mode: train_runner -> mode: infer_runner
# infer_runner中 class配置为 class: infer
# 修改phase阶段为infer的配置,参照config注释
mkdir train
mv train.txt train
mkdir test
mv dev.txt test
cd ..
```
打开文件config.yaml,更改其中的参数
将workspace改为您当前的绝对路径。(可用pwd命令获取绝对路径)
将data1下的batch_size值从10改为128
将data1下的data_path改为:{workspace}/senta_data/train
将dataset_infer下的batch_size值从2改为256
将dataset_infer下的data_path改为:{workspace}/senta_data/test
# 修改完config.yaml后 执行:
执行命令,开始训练:
```
python -m paddlerec.run -m ./config.yaml
```
### 预测
在跑完训练后,模型会开始在验证集上预测。
运行结果:
```
PaddleRec: Runner infer_runner Begin
Executor Mode: infer
processor_register begin
Running SingleInstance.
Running SingleNetwork.
Running SingleInferStartup.
Running SingleInferRunner.
load persistables from increment/14
batch: 1, acc: [0.91796875], loss: [0.2287855]
batch: 2, acc: [0.91796875], loss: [0.22827303]
batch: 3, acc: [0.90234375], loss: [0.27907994]
```
## 效果对比
### 模型效果 (测试)
| 数据集 | 模型 | loss | auc | acc | mae |
| :------------------: | :--------------------: | :---------: |:---------: | :---------: |:---------: |
| ag news dataset | TagSpace | -- | -- | -- | -- |
| -- | Classification | -- | -- | -- | -- |
| 数据集 | 模型 | loss | acc |
| :------------------: | :--------------------: | :---------: |:---------: |
| ag news dataset | TagSpace | 0.2282 | 0.9179 |
| ChnSentiCorp | Classification | 0.9177 | 0.0199 |
......@@ -16,16 +16,21 @@ workspace: "models/contentunderstanding/tagspace"
dataset:
- name: sample_1
type: QueueDataset
batch_size: 5
type: DataLoader
batch_size: 10
data_path: "{workspace}/data/train_data"
data_converter: "{workspace}/reader.py"
- name: inferdata
type: DataLoader
batch_size: 10
data_path: "{workspace}/data/test_data"
data_converter: "{workspace}/reader.py"
hyper_parameters:
optimizer:
class: Adagrad
learning_rate: 0.001
vocab_text_size: 11447
vocab_text_size: 75378
vocab_tag_size: 4
emb_dim: 10
hid_dim: 1000
......@@ -34,22 +39,34 @@ hyper_parameters:
neg_size: 3
num_devices: 1
mode: runner1
mode: [runner1,infer_runner]
runner:
- name: runner1
class: train
epochs: 10
device: cpu
save_checkpoint_interval: 2
save_inference_interval: 4
save_checkpoint_interval: 1
save_inference_interval: 1
save_checkpoint_path: "increment"
save_inference_path: "inference"
save_inference_feed_varnames: []
save_inference_fetch_varnames: []
phases: phase1
- name: infer_runner
class: infer
# device to run training or infer
device: cpu
print_interval: 1
init_model_path: "increment/9" # load model path
phases: phase_infer
phase:
- name: phase1
model: "{workspace}/model.py"
dataset_name: sample_1
thread_num: 1
- name: phase_infer
model: "{workspace}/model.py"
dataset_name: inferdata
thread_num: 1
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import sys
import six
import collections
import os
import csv
import re
import sys
if six.PY2:
reload(sys)
sys.setdefaultencoding('utf-8')
def word_count(column_num, input_file, word_freq=None):
"""
compute word count from corpus
"""
if word_freq is None:
word_freq = collections.defaultdict(int)
data_file = csv.reader(input_file)
for row in data_file:
for w in re.split(r'\W+', row[column_num].strip()):
word_freq[w] += 1
return word_freq
def build_dict(column_num=2, min_word_freq=0, train_dir="", test_dir=""):
"""
Build a word dictionary from the corpus, Keys of the dictionary are words,
and values are zero-based IDs of these words.
"""
word_freq = collections.defaultdict(int)
files = os.listdir(train_dir)
for fi in files:
with open(os.path.join(train_dir, fi), "r", encoding='utf-8') as f:
word_freq = word_count(column_num, f, word_freq)
files = os.listdir(test_dir)
for fi in files:
with open(os.path.join(test_dir, fi), "r", encoding='utf-8') as f:
word_freq = word_count(column_num, f, word_freq)
word_freq = [x for x in six.iteritems(word_freq) if x[1] > min_word_freq]
word_freq_sorted = sorted(word_freq, key=lambda x: (-x[1], x[0]))
words, _ = list(zip(*word_freq_sorted))
word_idx = dict(list(zip(words, six.moves.range(len(words)))))
return word_idx
def write_paddle(text_idx, tag_idx, train_dir, test_dir, output_train_dir,
output_test_dir):
files = os.listdir(train_dir)
if not os.path.exists(output_train_dir):
os.mkdir(output_train_dir)
for fi in files:
with open(os.path.join(train_dir, fi), "r", encoding='utf-8') as f:
with open(
os.path.join(output_train_dir, fi), "w",
encoding='utf-8') as wf:
data_file = csv.reader(f)
for row in data_file:
tag_raw = re.split(r'\W+', row[0].strip())
pos_index = tag_idx.get(tag_raw[0])
wf.write(str(pos_index) + ",")
text_raw = re.split(r'\W+', row[2].strip())
l = [text_idx.get(w) for w in text_raw]
for w in l:
wf.write(str(w) + " ")
wf.write("\n")
files = os.listdir(test_dir)
if not os.path.exists(output_test_dir):
os.mkdir(output_test_dir)
for fi in files:
with open(os.path.join(test_dir, fi), "r", encoding='utf-8') as f:
with open(
os.path.join(output_test_dir, fi), "w",
encoding='utf-8') as wf:
data_file = csv.reader(f)
for row in data_file:
tag_raw = re.split(r'\W+', row[0].strip())
pos_index = tag_idx.get(tag_raw[0])
wf.write(str(pos_index) + ",")
text_raw = re.split(r'\W+', row[2].strip())
l = [text_idx.get(w) for w in text_raw]
for w in l:
wf.write(str(w) + " ")
wf.write("\n")
def text2paddle(train_dir, test_dir, output_train_dir, output_test_dir,
output_vocab_text, output_vocab_tag):
print("start constuct word dict")
vocab_text = build_dict(2, 0, train_dir, test_dir)
with open(output_vocab_text, "w", encoding='utf-8') as wf:
wf.write(str(len(vocab_text)) + "\n")
vocab_tag = build_dict(0, 0, train_dir, test_dir)
with open(output_vocab_tag, "w", encoding='utf-8') as wf:
wf.write(str(len(vocab_tag)) + "\n")
print("construct word dict done\n")
write_paddle(vocab_text, vocab_tag, train_dir, test_dir, output_train_dir,
output_test_dir)
train_dir = sys.argv[1]
test_dir = sys.argv[2]
output_train_dir = sys.argv[3]
output_test_dir = sys.argv[4]
output_vocab_text = sys.argv[5]
output_vocab_tag = sys.argv[6]
text2paddle(train_dir, test_dir, output_train_dir, output_test_dir,
output_vocab_text, output_vocab_tag)
......@@ -16,7 +16,6 @@ import paddle.fluid as fluid
import paddle.fluid.layers.nn as nn
import paddle.fluid.layers.tensor as tensor
import paddle.fluid.layers.control_flow as cf
from paddlerec.core.model import ModelBase
from paddlerec.core.utils import envs
......@@ -98,14 +97,19 @@ class Model(ModelBase):
tensor.fill_constant_batch_size_like(
input=loss_part2, shape=[-1, 1], value=0.0, dtype='float32'),
loss_part2)
avg_cost = nn.mean(loss_part3)
avg_cost = fluid.layers.mean(loss_part3)
less = tensor.cast(cf.less_than(cos_neg, cos_pos), dtype='float32')
label_ones = fluid.layers.fill_constant_batch_size_like(
input=cos_neg, dtype='float32', shape=[-1, 1], value=1.0)
correct = nn.reduce_sum(less)
total = fluid.layers.reduce_sum(label_ones)
acc = fluid.layers.elementwise_div(correct, total)
self._cost = avg_cost
if is_infer:
self._infer_results["correct"] = correct
self._infer_results["cos_pos"] = cos_pos
self._infer_results["acc"] = acc
self._infer_results["loss"] = self._cost
else:
self._metrics["correct"] = correct
self._metrics["cos_pos"] = cos_pos
self._metrics["acc"] = acc
self._metrics["loss"] = self._cost
# tagspace文本分类模型
以下是本例的简要目录结构及说明:
```
├── data #样例数据
├── train_data
├── small_train.csv #训练数据样例
├── test_data
├── small_test.csv #测试数据样例
├── text2paddle.py #数据处理程序
├── __init__.py
├── README.md #文档
├── model.py #模型文件
├── config.yaml #配置文件
├── reader.py #读取程序
```
注:在阅读该示例前,建议您先了解以下内容:
[paddlerec入门教程](https://github.com/PaddlePaddle/PaddleRec/blob/master/README.md)
## 内容
- [模型简介](#模型简介)
- [数据准备](#数据准备)
- [运行环境](#运行环境)
- [快速开始](#快速开始)
- [效果复现](#效果复现)
- [进阶使用](#进阶使用)
- [FAQ](#FAQ)
## 模型简介
tagspace模型是一种对文本打标签的方法,它主要学习从短文到相关主题标签的映射。论文中主要利用CNN做doc向量, 然后优化 f(w,t+),f(w,t-)的距离作为目标函数,得到了 t(标签)和doc在一个特征空间的向量表达,这样就可以找 doc的hashtags了。
论文[TAGSPACE: Semantic Embeddings from Hashtags](https://www.aclweb.org/anthology/D14-1194.pdf)中的网络结构如图所示,一层输入层,一个卷积层,一个pooling层以及最后一个全连接层进行降维。
<p align="center">
<img align="center" src="../../../doc/imgs/tagspace.png">
<p>
## 数据准备
[数据地址](https://github.com/mhjabreel/CharCNN/tree/master/data/) , [备份数据地址](https://paddle-tagspace.bj.bcebos.com/data.tar)
数据格式如下:
```
"3","Wall St. Bears Claw Back Into the Black (Reuters)","Reuters - Short-sellers, Wall Street's dwindling\band of ultra-cynics, are seeing green again."
```
## 运行环境
PaddlePaddle>=1.7.2
python 2.7/3.5/3.6/3.7
PaddleRec >=0.1
os : windows/linux/macos
## 快速开始
本文提供了样例数据可以供您快速体验,在paddlerec目录下直接执行下面的命令即可启动训练:
```
python -m paddlerec.run -m models/contentunderstanding/tagspace/config.yaml
```
## 效果复现
为了方便使用者能够快速的跑通每一个模型,我们在每个模型下都提供了样例数据。如果需要复现readme中的效果,请按如下步骤依次操作即可。
1. 确认您当前所在目录为PaddleRec/models/contentunderstanding/tagspace
2. 在data目录下载并解压数据集,命令如下:
```
cd data
wget https://paddle-tagspace.bj.bcebos.com/data.tar
tar -xvf data.tar
```
3. 本文提供了快速将数据集中的汉字数据处理为可训练格式数据的脚本,您在解压数据集后,将原始数据存放在raw_big_train_data和raw_big_test_data两个目录下,并在python3环境下运行我们提供的text2paddle.py文件。即可生成可以直接用于训练的数据目录test_big_data和train_big_data。命令如下:
```
mkdir raw_big_train_data
mkdir raw_big_test_data
mv train.csv raw_big_train_data
mv test.csv raw_big_test_data
python3 text2paddle.py raw_big_train_data/ raw_big_test_data/ train_big_data test_big_data big_vocab_text.txt big_vocab_tag.txt
```
运行后的data目录:
```
big_vocab_tag.txt #标签词汇数
big_vocab_text.txt #文本词汇数
data.tar #数据集
raw_big_train_data #数据集中原始的训练集
raw_big_test_data #数据集中原始的测试集
train_data #样例训练集
test_data #样例测试集
train_big_data #数据集经处理后的训练集
test_big_data #数据集经处理后的测试集
text2paddle.py #预处理文件
```
处理完成的数据格式如下:
```
2,27 7062 8390 456 407 8 11589 3166 4 7278 31046 33 3898 2897 426 1
2,27 9493 836 355 20871 300 81 19 3 4125 9 449 462 13832 6 16570 1380 2874 5 0 797 236 19 3688 2106 14 8615 7 209 304 4 0 123 1
2,27 12754 637 106 3839 1532 66 0 379 6 0 1246 9 307 33 161 2 8100 36 0 350 123 101 74 181 0 6657 4 0 1222 17195 1
```
4. 退回tagspace目录中,打开文件config.yaml,更改其中的参数
将workspace改为您当前的绝对路径。(可用pwd命令获取绝对路径)
将dataset下sample_1的batch_size值从10改为128
将dataset下sample_1的data_path改为:{workspace}/data/train_big_data
将dataset下inferdata的batch_size值从10改为500
将dataset下inferdata的data_path改为:{workspace}/data/test_big_data
5. 执行命令,开始训练:
```
python -m paddlerec.run -m ./config.yaml
```
6. 运行结果:
```
PaddleRec: Runner infer_runner Begin
Executor Mode: infer
processor_register begin
Running SingleInstance.
Running SingleNetwork.
Running SingleInferStartup.
Running SingleInferRunner.
load persistables from increment/9
batch: 1, acc: [0.91], loss: [0.02495437]
batch: 2, acc: [0.936], loss: [0.01941476]
batch: 3, acc: [0.918], loss: [0.02116447]
batch: 4, acc: [0.916], loss: [0.0219945]
batch: 5, acc: [0.902], loss: [0.02242816]
batch: 6, acc: [0.9], loss: [0.02421589]
batch: 7, acc: [0.9], loss: [0.026441]
batch: 8, acc: [0.934], loss: [0.01797657]
batch: 9, acc: [0.932], loss: [0.01687362]
batch: 10, acc: [0.926], loss: [0.02047823]
batch: 11, acc: [0.918], loss: [0.01998716]
batch: 12, acc: [0.898], loss: [0.0229556]
batch: 13, acc: [0.928], loss: [0.01736144]
batch: 14, acc: [0.93], loss: [0.01911209]
```
## 进阶使用
## FAQ
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册