未验证 提交 3329d488 编写于 作者: P pkpk 提交者: GitHub

Merge pull request #57 from wangxiao1021/master

add sentiment classification
## 简介
情感是人类的一种高级智能行为,为了识别文本的情感倾向,需要深入的语义建模。另外,不同领域(如餐饮、体育)在情感的表达各不相同,因而需要有大规模覆盖各个领域的数据进行模型训练。为此,我们通过基于深度学习的语义模型和大规模数据挖掘解决上述两个问题。效果上,我们基于开源情感倾向分类数据集ChnSentiCorp进行评测。具体数据如下所示:
| 模型 | dev | test |
| :------| :------ | :------ |
| CNN | 90.6% | 89.7% |
| BOW | 90.1% | 90.3% |
| GRU | 90.0% | 91.1% |
| BIGRU | 89.7% | 89.6% |
动态图文档请见[Dygraph](https://www.paddlepaddle.org.cn/documentation/docs/zh/develop/user_guides/howto/dygraph/DyGraph.html)
## 快速开始
本项目依赖于 Paddlepaddle 1.7.0 及以上版本,请参考 [安装指南](http://www.paddlepaddle.org/#quick-start) 进行安装。
python版本依赖python 2.7或python 3.5及以上版本。
#### 代码下载及环境变量设置
克隆代码库到本地,并设置`PYTHONPATH`环境变量
```shell
git clone https://github.com/PaddlePaddle/hapi
cd hapi
export PYTHONPATH=$PYTHONPATH:`pwd`
cd examples/sentiment_classification
```
#### 数据准备
下载经过预处理的数据,文件解压之后,senta_data目录下会存在训练数据(train.tsv)、开发集数据(dev.tsv)、测试集数据(test.tsv)以及对应的词典(word_dict.txt)
```shell
wget https://baidu-nlp.bj.bcebos.com/sentiment_classification-dataset-1.0.0.tar.gz
tar -zxvf sentiment_classification-dataset-1.0.0.tar.gz
```
#### 模型训练
基于示例的数据集,可以运行下面的命令,在训练集(train.tsv)上进行模型训练,并在开发集(dev.tsv)验证。训练阶段需手动创建模型需要保存的文件夹,并且通过checkpoints设置保存文件路径。
model_type从bow_net,cnn_net,gru_net,bigru_net中选择。
模型相关参数均在`senta.yaml`中设置,模型训练需确保`senta.yaml``do_train`属性置为`True`
```shell
python sentiment_classifier.py
```
#### 模型预测
利用已有模型,可以运行下面命令,对未知label的数据(test.tsv)进行预测。
模型预测需确保`senta.yaml``do_infer`属性置为`True`
```shell
python sentiment_classifier.py
```
#### 模型参数
模型参数配置文件:`senta.yaml`
1. batch_size, 根据模型情况和GPU占用率选择batch_size, 建议cnn/bow选择较大batch_size, gru/bigru选择较小batch_size。
2. padding_size默认为150。
3. epoch, training时默认设置为5,infer默认为1。
4. learning_rate默认为0.002。
## 进阶使用
#### 任务定义
传统的情感分类主要基于词典或者特征工程的方式进行分类,这种方法需要繁琐的人工特征设计和先验知识,理解停留于浅层并且扩展泛化能力差。为了避免传统方法的局限,我们采用近年来飞速发展的深度学习技术。基于深度学习的情感分类不依赖于人工特征,它能够端到端的对输入文本进行语义理解,并基于语义表示进行情感倾向的判断。
#### 模型原理介绍
本项目针对情感倾向性分类问题,:
+ CNN(Convolutional Neural Networks),是一个基础的序列模型,能处理变长序列输入,提取局部区域之内的特征;
+ BOW(Bag Of Words)模型,是一个非序列模型,使用基本的全连接结构;
+ GRU(Gated Recurrent Unit),序列模型,能够较好地解决序列文本中长距离依赖的问题;
+ BI-GRU(Bidirectional Gated Recurrent Unit),序列模型,采用双向双层GRU结构,更好地捕获句子中的语义特征;
#### 数据格式说明
训练、预测、评估使用的数据可以由用户根据实际的应用场景,自己组织数据。数据由两列组成,以制表符分隔,第一列是以空格分词的中文文本(分词预处理方法将在下文具体说明),文件为utf8编码;第二列是情感倾向分类的类别(0表示消极;1表示积极),注意数据文件第一行固定表示为"text_a\tlabel"
```text
特 喜欢 这种 好看的 狗狗 1
这 真是 惊艳 世界 的 中国 黑科技 1
环境 特别 差 ,脏兮兮 的,再也 不去 了 0
```
#### 代码结构说明
```text
.
├── sentiment_classifier.py # 该项目的主函数,封装包括训练、预测、评估的部分
├── models.py # 网络结构
```
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import paddle.fluid as fluid
from paddle.fluid.dygraph.nn import Linear, Embedding
from paddle.fluid.dygraph.base import to_variable
import numpy as np
from hapi.model import Model
from hapi.text.text import GRUEncoderLayer as BiGRUEncoder
from hapi.text.test import BOWEncoder, CNNEncoder, GRUEncoder
class CNN(Model):
def __init__(self, dict_dim, batch_size, seq_len):
super(CNN, self).__init__()
self.dict_dim = dict_dim
self.emb_dim = 128
self.hid_dim = 128
self.fc_hid_dim = 96
self.class_dim = 2
self.channels = 1
self.win_size = [3, self.hid_dim]
self.batch_size = batch_size
self.seq_len = seq_len
self._encoder = CNNEncoder(
dict_size=self.dict_dim + 1,
emb_dim=self.emb_dim,
seq_len=self.seq_len,
filter_size= self.win_size,
num_filters= self.hid_dim,
hidden_dim= self.hid_dim,
padding_idx=None,
act='tanh')
self._fc1 = Linear(input_dim = self.hid_dim*self.seq_len, output_dim=self.fc_hid_dim, act="softmax")
self._fc_prediction = Linear(input_dim = self.fc_hid_dim,
output_dim = self.class_dim,
act="softmax")
def forward(self, inputs):
conv_3 = self._encoder(inputs)
fc_1 = self._fc1(conv_3)
prediction = self._fc_prediction(fc_1)
return prediction
class BOW(Model):
def __init__(self, dict_dim, batch_size, seq_len):
super(BOW, self).__init__()
self.dict_dim = dict_dim
self.emb_dim = 128
self.hid_dim = 128
self.fc_hid_dim = 96
self.class_dim = 2
self.batch_size = batch_size
self.seq_len = seq_len
self._encoder = BOWEncoder(
dict_size=self.dict_dim + 1,
emb_dim=self.emb_dim,
padding_idx=None,
bow_dim=self.hid_dim,
seq_len=self.seq_len)
self._fc1 = Linear(input_dim = self.hid_dim, output_dim=self.hid_dim, act="tanh")
self._fc2 = Linear(input_dim = self.hid_dim, output_dim=self.fc_hid_dim, act="tanh")
self._fc_prediction = Linear(input_dim = self.fc_hid_dim,
output_dim = self.class_dim,
act="softmax")
def forward(self, inputs):
bow_1 = self._encoder(inputs)
bow_1 = fluid.layers.tanh(bow_1)
fc_1 = self._fc1(bow_1)
fc_2 = self._fc2(fc_1)
prediction = self._fc_prediction(fc_2)
return prediction
class GRU(Model):
def __init__(self, dict_dim, batch_size, seq_len):
super(GRU, self).__init__()
self.dict_dim = dict_dim
self.emb_dim = 128
self.hid_dim = 128
self.fc_hid_dim = 96
self.class_dim = 2
self.batch_size = batch_size
self.seq_len = seq_len
self._fc1 = Linear(input_dim=self.hid_dim, output_dim=self.fc_hid_dim, act="tanh")
self._fc_prediction = Linear(input_dim=self.fc_hid_dim,
output_dim=self.class_dim,
act="softmax")
self._encoder = GRUEncoder(
dict_size=self.dict_dim + 1,
emb_dim=self.emb_dim,
gru_dim=self.hid_dim,
hidden_dim=self.hid_dim,
padding_idx=None,
seq_len=self.seq_len)
def forward(self, inputs):
emb = self._encoder(inputs)
fc_1 = self._fc1(emb)
prediction = self._fc_prediction(fc_1)
return prediction
class BiGRU(Model):
def __init__(self, dict_dim, batch_size, seq_len):
super(BiGRU, self).__init__()
self.dict_dim = dict_dim
self.emb_dim = 128
self.hid_dim = 128
self.fc_hid_dim = 96
self.class_dim = 2
self.batch_size = batch_size
self.seq_len = seq_len
self.embedding = Embedding(
size=[self.dict_dim + 1, self.emb_dim],
dtype='float32',
param_attr=fluid.ParamAttr(learning_rate=30),
is_sparse=False)
h_0 = np.zeros((self.batch_size, self.hid_dim), dtype="float32")
h_0 = to_variable(h_0)
self._fc1 = Linear(input_dim = self.hid_dim, output_dim=self.hid_dim*3)
self._fc2 = Linear(input_dim = self.hid_dim*2, output_dim=self.fc_hid_dim, act="tanh")
self._fc_prediction = Linear(input_dim=self.fc_hid_dim,
output_dim=self.class_dim,
act="softmax")
self._encoder = BiGRUEncoder(
grnn_hidden_dim=self.hid_dim,
input_dim=self.hid_dim * 3,
h_0=h_0,
init_bound=0.1,
is_bidirection=True)
def forward(self, inputs):
emb = self.embedding(inputs)
emb = fluid.layers.reshape(emb, shape=[self.batch_size, -1, self.hid_dim])
fc_1 = self._fc1(emb)
encoded_vector = self._encoder(fc_1)
encoded_vector = fluid.layers.tanh(encoded_vector)
encoded_vector = fluid.layers.reduce_max(encoded_vector, dim=1)
fc_2 = self._fc2(encoded_vector)
prediction = self._fc_prediction(fc_2)
return prediction
checkpoints: "./checkpoints"
epoch: 5
save_freq: 1
eval_freq: 1
lr: 0.002
padding_size: 150
skip_steps: 10
verbose: False
data_dir: "./senta_data/"
vocab_path: "./senta_data/word_dict.txt"
vocab_size: 33256
batch_size: 20
random_seed: 0
use_cuda: True
do_train: True
do_infer: False
model_type: "bow_net"
output_dir: "./output"
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Sentiment Classification in Paddle Dygraph Mode. """
from __future__ import print_function
import numpy as np
import paddle.fluid as fluid
from hapi.model import set_device, Model, CrossEntropy, Input
from hapi.configure import Config
from hapi.text.senta import SentaProcessor
from hapi.metrics import Accuracy
from models import CNN, BOW, GRU, BiGRU
import json
import os
args = Config(yaml_file='./senta.yaml')
args.build()
args.Print()
device = set_device("gpu" if args.use_cuda else "cpu")
dev_count = fluid.core.get_cuda_device_count() if args.use_cuda else 1
def main():
if args.do_train:
train()
elif args.do_infer:
infer()
def train():
fluid.enable_dygraph(device)
processor = SentaProcessor(
data_dir=args.data_dir,
vocab_path=args.vocab_path,
random_seed=args.random_seed)
num_labels = len(processor.get_labels())
num_train_examples = processor.get_num_examples(phase="train")
max_train_steps = args.epoch * num_train_examples // args.batch_size // dev_count
train_data_generator = processor.data_generator(
batch_size=args.batch_size,
padding_size=args.padding_size,
places=device,
phase='train',
epoch=args.epoch,
shuffle=False)
eval_data_generator = processor.data_generator(
batch_size=args.batch_size,
padding_size=args.padding_size,
places=device,
phase='dev',
epoch=args.epoch,
shuffle=False)
if args.model_type == 'cnn_net':
model = CNN( args.vocab_size, args.batch_size,
args.padding_size)
elif args.model_type == 'bow_net':
model = BOW( args.vocab_size, args.batch_size,
args.padding_size)
elif args.model_type == 'gru_net':
model = GRU( args.vocab_size, args.batch_size,
args.padding_size)
elif args.model_type == 'bigru_net':
model = BiGRU( args.vocab_size, args.batch_size,
args.padding_size)
optimizer = fluid.optimizer.Adagrad(learning_rate=args.lr, parameter_list=model.parameters())
inputs = [Input([None, None], 'int64', name='doc')]
labels = [Input([None, 1], 'int64', name='label')]
model.prepare(
optimizer,
CrossEntropy(),
Accuracy(topk=(1,)),
inputs,
labels,
device=device)
model.fit(train_data=train_data_generator,
eval_data=eval_data_generator,
batch_size=args.batch_size,
epochs=args.epoch,
save_dir=args.checkpoints,
eval_freq=args.eval_freq,
save_freq=args.save_freq)
def infer():
fluid.enable_dygraph(device)
processor = SentaProcessor(
data_dir=args.data_dir,
vocab_path=args.vocab_path,
random_seed=args.random_seed)
infer_data_generator = processor.data_generator(
batch_size=args.batch_size,
padding_size=args.padding_size,
places=device,
phase='infer',
epoch=1,
shuffle=False)
if args.model_type == 'cnn_net':
model_infer = CNN( args.vocab_size, args.batch_size,
args.padding_size)
elif args.model_type == 'bow_net':
model_infer = BOW( args.vocab_size, args.batch_size,
args.padding_size)
elif args.model_type == 'gru_net':
model_infer = GRU( args.vocab_size, args.batch_size,
args.padding_size)
elif args.model_type == 'bigru_net':
model_infer = BiGRU( args.vocab_size, args.batch_size,
args.padding_size)
print('Do inferring ...... ')
inputs = [Input([None, None], 'int64', name='doc')]
model_infer.prepare(
None,
CrossEntropy(),
Accuracy(topk=(1,)),
inputs,
device=device)
model_infer.load(args.checkpoints, reset_optimizer=True)
preds = model_infer.predict(test_data=infer_data_generator)
preds = np.array(preds[0]).reshape((-1, 2))
if args.output_dir:
with open(os.path.join(args.output_dir, 'predictions.json'), 'w') as w:
for p in range(len(preds)):
label = np.argmax(preds[p])
result = json.dumps({'index': p, 'label': label, 'probs': preds[p].tolist()})
w.write(result+'\n')
print('Predictions saved at '+os.path.join(args.output_dir, 'predictions.json'))
if __name__ == '__main__':
main()
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from hapi.text.senta.data_processer import SentaProcessor
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import numpy as np
from hapi.text.senta.data_reader import load_vocab
from hapi.text.senta.data_reader import data_reader
from paddle.io import DataLoader
class SentaProcessor(object):
def __init__(self, data_dir, vocab_path, random_seed=None):
self.data_dir = data_dir
self.vocab = load_vocab(vocab_path)
self.num_examples = {"train": -1, "dev": -1, "infer": -1}
np.random.seed(random_seed)
def get_train_examples(self, data_dir, epoch, shuffle, batch_size, places, padding_size):
train_reader = data_reader((self.data_dir + "/train.tsv"), self.vocab,
self.num_examples, "train", epoch, padding_size, shuffle)
loader = DataLoader.from_generator(capacity=50, return_list=True)
loader.set_sample_generator(train_reader, batch_size=batch_size, drop_last=False, places=places)
return loader
def get_dev_examples(self, data_dir, epoch, shuffle, batch_size, places, padding_size):
dev_reader = data_reader((self.data_dir + "/dev.tsv"), self.vocab,
self.num_examples, "dev", epoch, padding_size, shuffle)
loader = DataLoader.from_generator(capacity=50, return_list=True)
loader.set_sample_generator(dev_reader, batch_size=batch_size, drop_last=False, places=places)
return loader
def get_test_examples(self, data_dir, epoch, batch_size, places, padding_size):
test_reader = data_reader((self.data_dir + "/test.tsv"), self.vocab,
self.num_examples, "infer", epoch, padding_size)
loader = DataLoader.from_generator(capacity=50, return_list=True)
loader.set_sample_generator(test_reader, batch_size=batch_size, drop_last=False, places=places)
return loader
def get_labels(self):
return ["0", "1"]
def get_num_examples(self, phase):
if phase not in ['train', 'dev', 'infer']:
raise ValueError(
"Unknown phase, which should be in ['train', 'dev', 'infer'].")
return self.num_examples[phase]
def get_train_progress(self):
return self.current_train_example, self.current_train_epoch
def data_generator(self, padding_size, batch_size, places, phase='train', epoch=1, shuffle=True):
if phase == "train":
return self.get_train_examples(self.data_dir, epoch, shuffle, batch_size, places, padding_size)
elif phase == "dev":
return self.get_dev_examples(self.data_dir, epoch, shuffle, batch_size, places, padding_size)
elif phase == "infer":
return self.get_test_examples(self.data_dir, epoch, batch_size, places, padding_size)
else:
raise ValueError(
"Unknown phase, which should be in ['train', 'dev', 'infer'].")
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import io
import sys
import random
def str2bool(v):
return v.lower() in ("true", "t", "1")
def data_reader(file_path, word_dict, num_examples, phrase, epoch, padding_size, shuffle=False):
unk_id = len(word_dict)
all_data = []
with io.open(file_path, "r", encoding='utf8') as fin:
for line in fin:
if line.startswith('text_a'):
continue
cols = line.strip().split("\t")
if len(cols) != 2:
sys.stderr.write("[NOTICE] Error Format Line!")
continue
label = [int(cols[1])]
wids = [
word_dict[x] if x in word_dict else unk_id
for x in cols[0].split(" ")
]
wids = wids[:padding_size]
while len(wids) < padding_size:
wids.append(unk_id)
all_data.append((wids, label))
if shuffle:
if phrase == "train":
random.shuffle(all_data)
num_examples[phrase] = len(all_data)
def reader():
for epoch_index in range(epoch):
for doc, label in all_data:
yield doc, label
return reader
def load_vocab(file_path):
vocab = {}
with io.open(file_path, 'r', encoding='utf8') as f:
wid = 0
for line in f:
if line.strip() not in vocab:
vocab[line.strip()] = wid
wid += 1
vocab["<unk>"] = len(vocab)
return vocab
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册