提交 48e5d584 编写于 作者: P peterzhang2029

add nest text classification

上级 5e74b46e
# 双层序列文本分类
## 简介
序列数据是自然语言处理任务面对的一种主要输入数据类型:一句话是由词语构成的序列,多句话进一步构成了段落。因此,段落可以看作是一个嵌套的双层的序列,这个序列的每个元素又是一个序列。
双层序列是`PaddlePaddle`支持的一种非常灵活的数据组织方式,帮助我们更好地描述段落、多轮对话等更为复杂的语言数据。基于双层序列输入,我们可以设计一个层次化的网络,分别从词语和句子级别编码输入数据,更好地完成一些复杂的语言理解任务。
本示例将演示如何使用`PaddlePaddle`来组织双层序列文本数据,完成文本分类任务。
## 模型介绍
对于文本分类,我们将一段文本看成句子的数组,每个句子又是单词的数组,这便是一种双层序列的输入数据。而将这个段落的每一句话用卷积神经网络编码为一个向量,再将每句话的表示向量经过池化层编码成一个段落的向量, 即可得到段落的表示向量。对于分类任务,将段落表示向量作为分类器的输入可以得到分类结果。
**模型结构如下图所示**
<p align="center">
<img src="images/model.jpg" width = "60%" align="center"/><br/>
图1. 本例中的文本分类模型
</p>
PaddlePaddle 实现该网络结构的代码见 `network_conf.py`
对于双层序列的处理,需要先将双层时间序列数据先变换成单层时间序列数据,再对每一个单层时间序列进行处理。 PaddlePaddle提供了 `recurrent_group` 接口进行转换,在本例中,我们将文本数据的每一段,通过 recurrent_group 进行拆解,拆解成的每一句话再通过一个 CNN网络学习对应的向量表示。
``` python
nest_group = paddle.layer.recurrent_group(input=[paddle.layer.SubsequenceInput(emb),
hidden_size],
step=cnn_cov_group)
```
使用`recurrent_group`接口进行变换时,需要将输入序列传入 `input` 属性。 由于本例要实现的变换是`双层时间序列 => 单层时间序列`,所以我们需要将输入数据标记成 `SubsequenceInput`
拆解后的单层序列数据经过一个CNN网络学习对应的向量表示,CNN的网络结构包含以下部分:
- **卷积层**: 文本分类中的卷积在时间序列上进行,卷积核的宽度和词向量层产出的矩阵一致,卷积后得到的结果为“特征图”, 使用多个不同高度的卷积核,可以得到多个特征图。本例代码默认使用了大小为 3(图1红色框)和 4(图1蓝色框)的卷积核。
- **最大池化层**: 对卷积得到的各个特征图分别进行最大池化操作。由于特征图本身已经是向量,因此最大池化实际上就是选出各个向量中的最大元素。将所有最大元素又被拼接在一起,组成新的向量。
- **线性投影层**: 将不同卷积得到的结果经过最大池化层之后拼接为一个长向量, 然后经过一个线性投影得到对应单层序列的表示向量。
CNN网络具体代码实现如下:
```python
def cnn_cov_group(group_input, hidden_size):
conv3 = paddle.networks.sequence_conv_pool(
input=group_input, context_len=3, hidden_size=hidden_size)
conv4 = paddle.networks.sequence_conv_pool(
input=group_input, context_len=4, hidden_size=hidden_size)
output_group = paddle.layer.fc(input=[conv3, conv4],
size=hidden_size,
param_attr=paddle.attr.ParamAttr(name='_cov_value_weight'),
bias_attr=paddle.attr.ParamAttr(name='_cov_value_bias'),
act=paddle.activation.Linear())
return output_group
```
PaddlePaddle 中已经封装好的带有池化的文本序列卷积模块:`paddle.networks.sequence_conv_pool`,可直接调用。
在得到每个句子的表示向量之后, 将所有句子表示向量经过一个平均池化层, 得到一个样本的向量表示, 向量经过一个全连接层输出最终的预测结果。 代码如下:
```python
avg_pool = paddle.layer.pooling(input=nest_group, pooling_type=paddle.pooling.Avg(),
agg_level=paddle.layer.AggregateLevel.TO_NO_SEQUENCE)
prob = paddle.layer.mixed(size=class_num,
input=[paddle.layer.full_matrix_projection(input=avg_pool)],
act=paddle.activation.Softmax())
```
## 使用 PaddlePaddle 内置数据运行
### 训练
在终端执行:
```bash
python train.py
```
将以 PaddlePaddle 内置的情感分类数据集: `imdb` 运行本例。
### 预测
训练结束后模型将存储在指定目录当中(默认models目录),在终端执行:
```bash
python infer.py
```
默认情况下,预测脚本将加载训练一个pass的模型对 `imdb的测试集` 进行测试。
## 使用自定义数据训练和预测
### 训练
1.数据组织
假设有如下格式的训练数据:每一行为一条样本,以 `\t` 分隔,第一列是类别标签,第二列是输入文本的内容。以下是两条示例数据:
```
1 This movie is very good. The actor is so handsome.
0 What a terrible movie. I waste so much time.
```
2.编写数据读取接口
自定义数据读取接口只需编写一个 Python 生成器实现**从原始输入文本中解析一条训练样本**的逻辑。以下代码片段实现了读取原始数据返回类型为: `paddle.data_type.integer_value_sub_sequence``paddle.data_type.integer_value`
```python
def train_reader(data_dir, word_dict):
"""
Reader interface for training data
:param data_dir: data directory
:type data_dir: str
:param word_dict: path of word dictionary,
the dictionary must has a "UNK" in it.
:type word_dict: Python dict
"""
def reader():
UNK_ID = word_dict['<unk>']
word_col = 1
lbl_col = 0
for file_name in os.listdir(data_dir):
file_path = os.path.join(data_dir, file_name)
if not os.path.isfile(file_path):
continue
with open(file_path, "r") as f:
for line in f:
line_split = line.strip().split("\t")
doc = line_split[word_col]
doc_ids = []
for sent in doc.strip().split("."):
sent_ids = [
word_dict.get(w, UNK_ID)
for w in sent.split()]
if sent_ids:
doc_ids.append(sent_ids)
yield doc_ids, int(line_split[lbl_col])
return reader
```
需要注意的是, 本例中以英文句号`'.'`作为分隔符, 将一段文本分隔为一定数量的句子, 且每个句子表示为对应词表的索引数组(`sent_ids`)。 由于当前样本的表示(`doc_ids`)中包含了该段文本的所有句子, 因此,它的类型为:`paddle.data_type.integer_value_sub_sequence`
3.指定命令行参数进行训练
`train.py`训练脚本中包含以下参数:
```
--train_data_dir TRAIN_DATA_DIR
path of training dataset (default: None). if this
parameter is not set, imdb dataset will be used.
--test_data_dir TEST_DATA_DIR
path of testing dataset (default: None). if this
parameter is not set, imdb dataset will be used.
--word_dict WORD_DICT
path of word dictionary (default: None).if this
parameter is not set, imdb dataset will be used.if
this parameter is set, but the file does not exist,
word dictionay will be built from the training data
automatically.
--class_num CLASS_NUM
class number.
--batch_size BATCH_SIZE
the number of training examples in one
forward/backward pass
--num_passes NUM_PASSES
number of passes to train
--model_save_dir MODEL_SAVE_DIR
path to save the trained models.
```
修改`train.py`脚本中的启动参数,可以直接运行本例。 以`data`目录下的示例数据为例,在终端执行:
```bash
python train.py --train_data_dir 'data/train_data' --test_data_dir 'data/test_data' --word_dict 'dict.txt'
```
即可对样例数据进行训练。
### 预测
1.修改 `infer.py` 中以下变量,指定使用的模型、指定测试数据。
```python
model_path = "models/params_pass_00000.tar.gz" # 指定模型所在的路径
assert os.path.exists(model_path), "the trained model does not exist."
infer_path = 'data/infer.txt' # 指定测试文件所在的目录
word_dict = 'dict.txt' # 指定字典所在的路径
```
2.在终端中执行 `python infer.py`
At this point it seems almost unnecessary to state that Jon Bon Jovi delivers a firm, strong, seamless performance as Derek Bliss. His capability as an actor has been previously established by his critical acclaim garnered in other films (The Leading Man, No Looking Back). But, in case anyone is still wondering, yes, Jon Bon Jovi can act. He can act well and that's come to be expected of him. It's easy to separate Derek from the guy who belts out hits on VH-1.<br /><br />I generally would not watch a horror movie. I've come to expect them to focus on sensationalistic gore rather than dialogue and plot. What pleased me most about this film was that there really was a viable plot being moved along. The gore is not so much as to become the focus of the film and does not have a disturbingly realistic quality of films with higher technical effects budgets. So, gore fans might be disappointed, but story fans will not.<br /><br />Unlike an action film like U-571 where the dialogue takes a back seat to the bombast, we get a chance to know "the good guys" and actually care what happens to them. A few scenes are left unexplained (like Derek's hallucinations) but you get the feeling certain aspects were as they were to lay the foundation for a sequel. Unfortunately, with the lack of interest shown by Hollywood in this film, that sequel will never happen. These few instances are forgiveable knowing that Vampires could have been a continuing series.<br /><br />Is this the best film I've ever seen in my life? No. Is it a good way to spend about two hours being entertained? Yes. It won't leave the person who fears horror movies with insomnia and it won't leave the horror movie lover completely disappointed either. If you're somewhere in between the horror genre loather and the horror genre lover, this film is for you. It reaches a happy medium with the effects and story balancing each other.<br /><br />
The original Vampires (1998) is one of my favorites. I was curious to see how a sequel would work considering they used none of the original characters. I was quite surprised at how this played out. As a rule, sequels are never as good as the original, with a few exceptions. Though this one was not a great movie, the writer did well in keeping the main themes & vampire lore from the first one in tact. Jon Bon Jovi was a drawback initially, but he proved to be a half-way decent Slayer. I doubt anyone could top James Wood's performance in the first one, though. unless you bring in Buffy!<br /><br />All in all, this was a decent watch & I would watch it again.<br /><br />I was left with two questions, though... what happened to Jack Crow & how did Derek Bliss come to be a slayer? Guess we'll just have to leave that to imagination.
The movie opens with a flashback to Doddsville County High School on April Fool's Day. A group of students play a prank on class nerd Marty. When they are punished for playing said prank, they follow up with a bigger prank which (par for the course in slasher films involving pranks on class nerds) goes ridiculously awry leaving Marty simultaneously burned by fire and disfigured by acid for the sake of being thorough. Fast forward five years, where we find members of the student body gathering at the now abandoned high school for their five year class reunion. We find out that it is no coincidence that everyone at the reunion belonged to the clique of pranksters from the flashback scene, as all of the attendees are being stalked and killed by a mysterious, jester mask-clad murderer in increasingly complicated and mind-numbingly ludicrous fashions. It doesn't take Sherlock Holmes to solve the mystery of the killer's identity, as it is revealed to be none other than a scarred Marty who has seemingly been using his nerd rage and high intellect to bend the laws of physics and engineering in order to rig the school for his revenge scenario. The film takes a turn for the bizarre as Marty finishes exacting his revenge on his former tormentors, only to be haunted by their ghosts. Marty is finally pushed fully over the edge and takes his own life. Finally, the film explodes in a crescendo of disjointed weirdness as the whole revenge scenario is revealed to be a dream in the first place as Marty wakes up in a hospital bed, breaks free of his restraints, stabs a nurse, and finally disfigures his own face.<br /><br />The script is tired and suffers from a terminal case of horror movie logic. The only originality comes from the mind-numbingly convoluted ways that the victims are dispatched. The absurd it-was-all-a-dream ending feels tacked on. It's almost as if someone pointed out the disjointed nature of the film and the writer decided then and there that it was a dream.<br /><br />Technically speaking, the film is atrocious. Some scenes were filmed so dark that I had to pause the film and play with the color on my television. The acting is sub-par, even for slasher films. I can't help but think that casting was a part of the problem as all of the actors look at least five years older than the characters they portray, which makes the flashback scene even more unintentionally laughable. Their lack of commitment to the movie is made obvious as half of them can't bother to keep their accents straight through the movie.<br /><br />All of this being said, if you like bad horror movies, you might like this one, too. It isn't the worst film of the genre, but it's far from the best.
Robert Taylor definitely showed himself to be a fine dramatic actor in his role as a gun-slinging buffalo hunter in this 1956 western. It was one of the few times that Taylor would play a heavy in a film. Nonetheless, this picture was far from great as shortly after this, Taylor fled to television with the successful series The Detectives.<br /><br />Stuart Granger hid his British accent and turned in a formidable performance as Taylor's partner. <br /><br />Taylor is a bigot here and his hatred for the Indians really shows.<br /><br />Another very good performance here was by veteran actor Lloyd Nolan as an aged, drinking old-timer who joined in the hunt for buffalo as well. In his early scenes, Nolan was really doing an excellent take-off of Walter Huston in his Oscar-winning role in The Treasure of the Sierre Madre in 1948. Note the appearance of Russ Tamblyn in the film. The following year Tamblyn and Nolan would join in the phenomenal Peyton Place.<br /><br />The writing in the film is stiff at best. By the film's end, it's the elements of nature that did Taylor in. How about the elements of the writing here?
\ No newline at end of file
1 I liked the film. Some of the action scenes were very interesting, tense and well done. I especially liked the opening scene which had a semi truck in it. A very tense action scene that seemed well done.<br /><br />Some of the transitional scenes were filmed in interesting ways such as time lapse photography, unusual colors, or interesting angles. Also the film is funny is several parts. I also liked how the evil guy was portrayed too. I'd give the film an 8 out of 10.
0 The plot for Descent, if it actually can be called a plot, has two noteworthy events. One near the beginning - one at the end. Together these events make up maybe 5% of the total movie time. Everything (and I mean _everything_) in between is basically the director's desperate effort to fill in the minutes. I like disturbing movies, I like dark movies and I don't get troubled by gritty scenes - but if you expect me to sit through 60 minutes of hazy/dark (literally) scenes with NO storyline you have another thing coming. Rosario Dawson, one of my favorite actresses is completely wasted here. And no, she doesn't get naked, not even in the NC-17 version, which I saw.<br /><br />If you have a couple of hours to throw away and want to watch "Descent", take a nap instead - you'll probably have more interesting dreams.
0 This film lacked something I couldn't put my finger on at first: charisma on the part of the leading actress. This inevitably translated to lack of chemistry when she shared the screen with her leading man. Even the romantic scenes came across as being merely the actors at play. It could very well have been the director who miscalculated what he needed from the actors. I just don't know.<br /><br />But could it have been the screenplay? Just exactly who was the chef in love with? He seemed more enamored of his culinary skills and restaurant, and ultimately of himself and his youthful exploits, than of anybody or anything else. He never convinced me he was in love with the princess.<br /><br />I was disappointed in this movie. But, don't forget it was nominated for an Oscar, so judge for yourself.
0 I read the book a long time back and don't specifically remember the plot but do remember that I enjoyed it. Since I'm home sick on the couch it seemed like a good idea and Hey !! It is a Lifetime movie.<br /><br />The movie is populated with grade B actors and actresses.<br /><br />The female cast is right out of Desperate Housewives. I've never seen the show but there are lots of commercials for the show and I get the gist. Is there nothing original anymore? Sure, but not on Lifetime.<br /><br />The male cast are all fairly effeminate looking and acting but the girls need to have husbands I suppose.<br /><br />In one scene a female is struggling with a male, for her life, and what does she do??? Kicks him in the testicles. What else? Women love that but let me tell you girls something. It's not as easy as it's always made to look.<br /><br />It wasn't all bad. I did get the chills a time or two so I have to credit someone with that.
0 I admit that I am a vampire addict: I have seen so many vampire movies I have lost count and this one is definitely in the top ten. I was very impressed by the original John Carpenter's Vampires and when I descovered there was a sequel I went straight out and bought it. This movie does not obey quite the same rules as the first, and it is not quite so dark, but it is close enough and I felt that it built nicely on the original.<br /><br />Jon Bon Jovi was very good as Derek Bliss: his performance was likeable and yet hard enough for the viewer to believe that he might actually be able to survive in the world in which he lives. One of my favourite parts was just after he meets Zoey and wanders into the bathroom of the diner to check to see if she is more than she seems. His comments are beautifully irreverant and yet emminently practical which contrast well with the rest of the scene as it unfolds.<br /><br />The other cast members were also well chosen and they knitted nicely to produce an entertaining and original film. It is not simply a rehash of the first movie and it has grown in a similar way to the way Fright Night II grew out of Fright Night. There are different elements which make it a fresh movie with a similar theme.<br /><br />If you like vampire movies I would recommend this one. If you prefer your films less bloody then choose something else.
0 Almost too well done... "John Carpenter's Vampires" was entertaining, a solid piece of popcorn-entertainment with a budget small enough not to be overrun by special effects. And obviously aiming on the "From Dusk Till Dawn"-audience. "Vampires: Los Muertos" tries the same starting with a rock-star Jon Bon Jovi playing one of the main characters, but does that almost too well...: I haven't seen Jon Bon Jovi in any other movie, so I am not able to compare his acting in "Vampires: Los Muertos" to his other roles, but I was really suprised of his good performance. After the movie started he convinced me not expecting him to grab any guitar and playing "It' my life" or something, but kill vampires, showing no mercy and doing a job which has to be done. This means a lot, because a part of the audience (also me) was probably thinking: "...just because he's a rockstar...". Of course Bon Jovi is not James Woods but to be honest: It could have been much worse, and in my opinion Bon Jovi did a very good performance. The vampiress played by Arly Jover is not the leather dressed killer-machine of a vampire-leader we met in Part 1 (or in similar way in "Ghosts of Mars"). Jover plays the vampire very seductive and very sexy, moving as lithe as a cat, attacking as fast as a snake and dressed in thin, light almost transparent very erotic cloth. And even the optical effects supporting her kind of movement are very well made. It really takes some beating. But the director is in some parts of the film only just avoiding turning the movie from an action-horrorfilm into a sensitive horrormovie like Murnau's "Nosferatu". You can almost see the director's temptation to create a movie with a VERY personal note and different to the original. This is the real strength of the movie and at the same time its weakest point: The audience celebrating the fun-bloodbath of the first movie is probably expecting a pure fun-bloodbath for the second time and might be a little disappointed. Make no mistake: "Vampires:Los Muertos" IS a fun-bloodbath but it's just not ALL THE TIME this kind of movie. Just think of the massacre in the bar compared to the scene in which the vampiress tries to seduce Zoey in the ruins: the bar-massacre is what you expect from american popcorn-entertainment, the seducing-Zoey-in-the-ruins-scene is ALMOST european-like cinema (the movie is eager to tell us more about the relationship between Zoey and the vampiress, but refuses answers at the same time. Because it would had slow down the action? Showed the audience a vampiress with a human past, a now suffering creature and not only a beast which is just slaughtering anybody). And that's the point to me which decides whether the movie is accepted by the audience of the original movie or not. And also: Is the "From Dusk Till Dawn"-audience really going to like this? I'm not sure about that. Nevertheless Tommy Lee Wallace did really a great job, "Vampires:Los Muertos" is surprisingly good. But I also think to direct a sequel of a popcorn movie Wallace is sometimes almost too creative, too expressive. Like he's keeping himself from developing his talent in order to satisfy the expectations of audience. In my opinion, Wallace' talent fills the movie with life and is maybe sometimes sucking it out at the same time. "Vampires: Los Muertos" is almost too well done. (I give it 7 of 10)
1 We all know that countless duds have graced the 80s slasher genre and often deserve nothing but our deepest disgust. Maybe that's a bit hastey but damn if "Slaughter High" wasn't terribly unoriginal, even for a slasher flick. Pretty much, the plot involves a kid who experienced a Carrie-like shower humiliation in high school and returns to the dilapidated building to seek out revenge on a group of former-bullies who all show up to reminisce. As you'd expect, they are killed off steadily by a masked madman on April 1st by means of electrocution, burning, hanging, and chemically altered beer. I've got a number of problems with the plot details and settings of this movie, but considering the ending, I feel the need to discard my complaints and just say that this is a complete waste of time. Ignore any thought of viewing this movie.
1 What a terrible movie. The acting was bad, the pacing was bad, the cinematography was bad, the directing was bad, the "special" effects were bad. You expect a certain degree of badness in a slasher, but even the killings were bad.<br /><br />First of all, the past event that set up the motive for the slaughter went on for 15 or 20 minutes. I thought it would never end. They could have removed 80% of it and explained what happened well enough.<br /><br />Then, the victims were invited to the "reunion" in an abandoned school which still had all the utilities turned on. One of the victims thought this was a little odd, but they dismissed it and decided to break in anyway.<br /><br />Finally, the killings were so fake as to be virtually unwatchable.<br /><br />There is no reason to watch this movie, unless you want to see some breasts, and not very good breasts at that. This movie makes Showgirls virtually indistinguishable from Citizen Kane.
<html>
<head>
<script type="text/x-mathjax-config">
MathJax.Hub.Config({
extensions: ["tex2jax.js", "TeX/AMSsymbols.js", "TeX/AMSmath.js"],
jax: ["input/TeX", "output/HTML-CSS"],
tex2jax: {
inlineMath: [ ['$','$'] ],
displayMath: [ ['$$','$$'] ],
processEscapes: true
},
"HTML-CSS": { availableFonts: ["TeX"] }
});
</script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.0/MathJax.js" async></script>
<script type="text/javascript" src="../.tools/theme/marked.js">
</script>
<link href="http://cdn.bootcss.com/highlight.js/9.9.0/styles/darcula.min.css" rel="stylesheet">
<script src="http://cdn.bootcss.com/highlight.js/9.9.0/highlight.min.js"></script>
<link href="http://cdn.bootcss.com/bootstrap/4.0.0-alpha.6/css/bootstrap.min.css" rel="stylesheet">
<link href="https://cdn.jsdelivr.net/perfect-scrollbar/0.6.14/css/perfect-scrollbar.min.css" rel="stylesheet">
<link href="../.tools/theme/github-markdown.css" rel='stylesheet'>
</head>
<style type="text/css" >
.markdown-body {
box-sizing: border-box;
min-width: 200px;
max-width: 980px;
margin: 0 auto;
padding: 45px;
}
</style>
<body>
<div id="context" class="container-fluid markdown-body">
</div>
<!-- This block will be replaced by each markdown file content. Please do not change lines below.-->
<div id="markdown" style='display:none'>
# 双层序列文本分类
## 简介
序列数据是自然语言处理任务面对的一种主要输入数据类型:一句话是由词语构成的序列,多句话进一步构成了段落。因此,段落可以看作是一个嵌套的双层的序列,这个序列的每个元素又是一个序列。
双层序列是`PaddlePaddle`支持的一种非常灵活的数据组织方式,帮助我们更好地描述段落、多轮对话等更为复杂的语言数据。基于双层序列输入,我们可以设计一个层次化的网络,分别从词语和句子级别编码输入数据,更好地完成一些复杂的语言理解任务。
本示例将演示如何使用`PaddlePaddle`来组织双层序列文本数据,完成文本分类任务。
## 模型介绍
对于文本分类,我们将一段文本看成句子的数组,每个句子又是单词的数组,这便是一种双层序列的输入数据。而将这个段落的每一句话用卷积神经网络编码为一个向量,再将每句话的表示向量经过池化层编码成一个段落的向量, 即可得到段落的表示向量。对于分类任务,将段落表示向量作为分类器的输入可以得到分类结果。
**模型结构如下图所示**
<p align="center">
<img src="images/model.jpg" width = "60%" align="center"/><br/>
图1. 本例中的文本分类模型
</p>
PaddlePaddle 实现该网络结构的代码见 `network_conf.py`。
对于双层序列的处理,需要先将双层时间序列数据先变换成单层时间序列数据,再对每一个单层时间序列进行处理。 PaddlePaddle提供了 `recurrent_group` 接口进行转换,在本例中,我们将文本数据的每一段,通过 recurrent_group 进行拆解,拆解成的每一句话再通过一个 CNN网络学习对应的向量表示。
``` python
nest_group = paddle.layer.recurrent_group(input=[paddle.layer.SubsequenceInput(emb),
hidden_size],
step=cnn_cov_group)
```
使用`recurrent_group`接口进行变换时,需要将输入序列传入 `input` 属性。 由于本例要实现的变换是`双层时间序列 => 单层时间序列`,所以我们需要将输入数据标记成 `SubsequenceInput`。
拆解后的单层序列数据经过一个CNN网络学习对应的向量表示,CNN的网络结构包含以下部分:
- **卷积层**: 文本分类中的卷积在时间序列上进行,卷积核的宽度和词向量层产出的矩阵一致,卷积后得到的结果为“特征图”, 使用多个不同高度的卷积核,可以得到多个特征图。本例代码默认使用了大小为 3(图1红色框)和 4(图1蓝色框)的卷积核。
- **最大池化层**: 对卷积得到的各个特征图分别进行最大池化操作。由于特征图本身已经是向量,因此最大池化实际上就是选出各个向量中的最大元素。将所有最大元素又被拼接在一起,组成新的向量。
- **线性投影层**: 将不同卷积得到的结果经过最大池化层之后拼接为一个长向量, 然后经过一个线性投影得到对应单层序列的表示向量。
CNN网络具体代码实现如下:
```python
def cnn_cov_group(group_input, hidden_size):
conv3 = paddle.networks.sequence_conv_pool(
input=group_input, context_len=3, hidden_size=hidden_size)
conv4 = paddle.networks.sequence_conv_pool(
input=group_input, context_len=4, hidden_size=hidden_size)
output_group = paddle.layer.fc(input=[conv3, conv4],
size=hidden_size,
param_attr=paddle.attr.ParamAttr(name='_cov_value_weight'),
bias_attr=paddle.attr.ParamAttr(name='_cov_value_bias'),
act=paddle.activation.Linear())
return output_group
```
PaddlePaddle 中已经封装好的带有池化的文本序列卷积模块:`paddle.networks.sequence_conv_pool`,可直接调用。
在得到每个句子的表示向量之后, 将所有句子表示向量经过一个平均池化层, 得到一个样本的向量表示, 向量经过一个全连接层输出最终的预测结果。 代码如下:
```python
avg_pool = paddle.layer.pooling(input=nest_group, pooling_type=paddle.pooling.Avg(),
agg_level=paddle.layer.AggregateLevel.TO_NO_SEQUENCE)
prob = paddle.layer.mixed(size=class_num,
input=[paddle.layer.full_matrix_projection(input=avg_pool)],
act=paddle.activation.Softmax())
```
## 使用 PaddlePaddle 内置数据运行
### 训练
在终端执行:
```bash
python train.py
```
将以 PaddlePaddle 内置的情感分类数据集: `imdb` 运行本例。
### 预测
训练结束后模型将存储在指定目录当中(默认models目录),在终端执行:
```bash
python infer.py
```
默认情况下,预测脚本将加载训练一个pass的模型对 `imdb的测试集` 进行测试。
## 使用自定义数据训练和预测
### 训练
1.数据组织
假设有如下格式的训练数据:每一行为一条样本,以 `\t` 分隔,第一列是类别标签,第二列是输入文本的内容。以下是两条示例数据:
```
1 This movie is very good. The actor is so handsome.
0 What a terrible movie. I waste so much time.
```
2.编写数据读取接口
自定义数据读取接口只需编写一个 Python 生成器实现**从原始输入文本中解析一条训练样本**的逻辑。以下代码片段实现了读取原始数据返回类型为: `paddle.data_type.integer_value_sub_sequence` 和 `paddle.data_type.integer_value`
```python
def train_reader(data_dir, word_dict):
"""
Reader interface for training data
:param data_dir: data directory
:type data_dir: str
:param word_dict: path of word dictionary,
the dictionary must has a "UNK" in it.
:type word_dict: Python dict
"""
def reader():
UNK_ID = word_dict['<unk>']
word_col = 1
lbl_col = 0
for file_name in os.listdir(data_dir):
file_path = os.path.join(data_dir, file_name)
if not os.path.isfile(file_path):
continue
with open(file_path, "r") as f:
for line in f:
line_split = line.strip().split("\t")
doc = line_split[word_col]
doc_ids = []
for sent in doc.strip().split("."):
sent_ids = [
word_dict.get(w, UNK_ID)
for w in sent.split()]
if sent_ids:
doc_ids.append(sent_ids)
yield doc_ids, int(line_split[lbl_col])
return reader
```
需要注意的是, 本例中以英文句号`'.'`作为分隔符, 将一段文本分隔为一定数量的句子, 且每个句子表示为对应词表的索引数组(`sent_ids`)。 由于当前样本的表示(`doc_ids`)中包含了该段文本的所有句子, 因此,它的类型为:`paddle.data_type.integer_value_sub_sequence`。
3.指定命令行参数进行训练
`train.py`训练脚本中包含以下参数:
```
--train_data_dir TRAIN_DATA_DIR
path of training dataset (default: None). if this
parameter is not set, imdb dataset will be used.
--test_data_dir TEST_DATA_DIR
path of testing dataset (default: None). if this
parameter is not set, imdb dataset will be used.
--word_dict WORD_DICT
path of word dictionary (default: None).if this
parameter is not set, imdb dataset will be used.if
this parameter is set, but the file does not exist,
word dictionay will be built from the training data
automatically.
--class_num CLASS_NUM
class number.
--batch_size BATCH_SIZE
the number of training examples in one
forward/backward pass
--num_passes NUM_PASSES
number of passes to train
--model_save_dir MODEL_SAVE_DIR
path to save the trained models.
```
修改`train.py`脚本中的启动参数,可以直接运行本例。 以`data`目录下的示例数据为例,在终端执行:
```bash
python train.py --train_data_dir 'data/train_data' --test_data_dir 'data/test_data' --word_dict 'dict.txt'
```
即可对样例数据进行训练。
### 预测
1.修改 `infer.py` 中以下变量,指定使用的模型、指定测试数据。
```python
model_path = "models/params_pass_00000.tar.gz" # 指定模型所在的路径
assert os.path.exists(model_path), "the trained model does not exist."
infer_path = 'data/infer.txt' # 指定测试文件所在的目录
word_dict = 'dict.txt' # 指定字典所在的路径
```
2.在终端中执行 `python infer.py`。
</div>
<!-- You can change the lines below now. -->
<script type="text/javascript">
marked.setOptions({
renderer: new marked.Renderer(),
gfm: true,
breaks: false,
smartypants: true,
highlight: function(code, lang) {
code = code.replace(/&amp;/g, "&")
code = code.replace(/&gt;/g, ">")
code = code.replace(/&lt;/g, "<")
code = code.replace(/&nbsp;/g, " ")
return hljs.highlightAuto(code, [lang]).value;
}
});
document.getElementById("context").innerHTML = marked(
document.getElementById("markdown").innerHTML)
</script>
</body>
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import sys
import os
import gzip
import paddle.v2 as paddle
import reader
from network_conf import nest_net
from utils import logger
def infer(data_path, model_path, word_dict_path, batch_size, class_num):
def _infer_a_batch(inferer, test_batch, ids_2_word):
probs = inferer.infer(input=test_batch, field=["value"])
assert len(probs) == len(test_batch)
for word_ids, prob in zip(test_batch, probs):
sent_ids = []
for sent in word_ids[0]:
sent_ids.extend(sent)
word_text = " ".join([ids_2_word[id] for id in sent_ids])
print("%s\t%s\t%s" % (prob.argmax(),
" ".join(["{:0.4f}".format(p)
for p in prob]), word_text))
logger.info("begin to predict...")
use_default_data = (data_path is None)
if use_default_data:
word_dict = reader.imdb_word_dict()
word_reverse_dict = dict((value, key)
for key, value in word_dict.iteritems())
test_reader = reader.imdb_test(word_dict)
class_num = 2
else:
assert os.path.exists(
word_dict_path), "the word dictionary file does not exist"
word_dict = reader.load_dict(word_dict_path)
word_reverse_dict = dict((value, key)
for key, value in word_dict.iteritems())
test_reader = reader.infer_reader(data_path, word_dict)()
dict_dim = len(word_dict)
prob_layer = nest_net(dict_dim, class_num=class_num, is_infer=True)
# initialize PaddlePaddle
paddle.init(use_gpu=True, trainer_count=4)
# load the trained models
parameters = paddle.parameters.Parameters.from_tar(
gzip.open(model_path, "r"))
inferer = paddle.inference.Inference(
output_layer=prob_layer, parameters=parameters)
test_batch = []
for idx, item in enumerate(test_reader):
test_batch.append([item[0]])
if len(test_batch) == batch_size:
_infer_a_batch(inferer, test_batch, word_reverse_dict)
test_batch = []
if len(test_batch):
_infer_a_batch(inferer, test_batch, word_reverse_dict)
test_batch = []
if __name__ == "__main__":
model_path = "models/params_pass_00000.tar.gz"
assert os.path.exists(model_path), "the trained model does not exist."
infer_path = None
word_dict = None
infer(
data_path=infer_path,
word_dict_path=word_dict,
model_path=model_path,
batch_size=10,
class_num=2)
import paddle.v2 as paddle
def cnn_cov_group(group_input, hidden_size):
conv3 = paddle.networks.sequence_conv_pool(
input=group_input, context_len=3, hidden_size=hidden_size)
conv4 = paddle.networks.sequence_conv_pool(
input=group_input, context_len=4, hidden_size=hidden_size)
output_group = paddle.layer.fc(
input=[conv3, conv4],
size=hidden_size,
param_attr=paddle.attr.ParamAttr(name='_cov_value_weight'),
bias_attr=paddle.attr.ParamAttr(name='_cov_value_bias'),
act=paddle.activation.Linear())
return output_group
def nest_net(dict_dim,
emb_size=28,
hidden_size=128,
class_num=2,
is_infer=False):
data = paddle.layer.data(
"word", paddle.data_type.integer_value_sub_sequence(dict_dim))
emb = paddle.layer.embedding(input=data, size=emb_size)
nest_group = paddle.layer.recurrent_group(
input=[paddle.layer.SubsequenceInput(emb), hidden_size],
step=cnn_cov_group)
avg_pool = paddle.layer.pooling(
input=nest_group,
pooling_type=paddle.pooling.Avg(),
agg_level=paddle.layer.AggregateLevel.TO_NO_SEQUENCE)
prob = paddle.layer.mixed(
size=class_num,
input=[paddle.layer.full_matrix_projection(input=avg_pool)],
act=paddle.activation.Softmax())
if is_infer == False:
label = paddle.layer.data("label",
paddle.data_type.integer_value(class_num))
cost = paddle.layer.classification_cost(input=prob, label=label)
return cost, prob, label
return prob
#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""
IMDB dataset.
This module downloads IMDB dataset from
http://ai.stanford.edu/%7Eamaas/data/sentiment/. This dataset contains a set
of 25,000 highly polar movie reviews for training, and 25,000 for testing.
Besides, this module also provides API for building dictionary.
"""
import collections
import tarfile
import Queue
import re
import string
import threading
import os
import paddle.v2.dataset.common
URL = 'http://ai.stanford.edu/%7Eamaas/data/sentiment/aclImdb_v1.tar.gz'
MD5 = '7c2ac02c03563afcf9b574c7e56c153a'
def tokenize(pattern):
"""
Read files that match the given pattern. Tokenize and yield each file.
"""
with tarfile.open(
paddle.v2.dataset.common.download(URL, 'imdb', MD5)) as tarf:
tf = tarf.next()
while tf != None:
if bool(pattern.match(tf.name)):
# newline and punctuations removal and ad-hoc tokenization.
docs = tarf.extractfile(tf).read().rstrip("\n\r").lower().split(
'.')
doc_list = []
for doc in docs:
doc = doc.strip()
if doc:
doc_without_punc = doc.translate(
None, string.punctuation).strip()
if doc_without_punc:
doc_list.append(
[word for word in doc_without_punc.split()])
yield doc_list
tf = tarf.next()
def imdb_build_dict(pattern, cutoff):
"""
Build a word dictionary from the corpus. Keys of the dictionary are words,
and values are zero-based IDs of these words.
"""
word_freq = collections.defaultdict(int)
for doc_list in tokenize(pattern):
for doc in doc_list:
for word in doc:
word_freq[word] += 1
word_freq['<unk>'] = cutoff + 1
word_freq = filter(lambda x: x[1] > cutoff, word_freq.items())
dictionary = sorted(word_freq, key=lambda x: (-x[1], x[0]))
words, _ = list(zip(*dictionary))
word_idx = dict(zip(words, xrange(len(words))))
return word_idx
def reader_creator(pos_pattern, neg_pattern, word_idx, buffer_size):
UNK = word_idx['<unk>']
qs = [Queue.Queue(maxsize=buffer_size), Queue.Queue(maxsize=buffer_size)]
def load(pattern, queue):
for doc_list in tokenize(pattern):
queue.put(doc_list)
queue.put(None)
def reader():
# Creates two threads that loads positive and negative samples
# into qs.
t0 = threading.Thread(target=load, args=(pos_pattern, qs[0], ))
t0.daemon = True
t0.start()
t1 = threading.Thread(target=load, args=(neg_pattern, qs[1], ))
t1.daemon = True
t1.start()
# Read alternatively from qs[0] and qs[1].
i = 0
doc_list = qs[i].get()
while doc_list != None:
ids_list = []
for doc in doc_list:
ids_list.append([word_idx.get(w, UNK) for w in doc])
yield ids_list, i % 2
i += 1
doc_list = qs[i % 2].get()
# If any queue is empty, reads from the other queue.
i += 1
doc_list = qs[i % 2].get()
while doc_list != None:
ids_list = []
for doc in doc_list:
ids_list.append([word_idx.get(w, UNK) for w in doc])
yield ids_list, i % 2
doc_list = qs[i % 2].get()
return reader()
def imdb_train(word_idx):
"""
IMDB training set creator.
It returns a reader creator, each sample in the reader is an zero-based ID
subsequence and label in [0, 1].
:param word_idx: word dictionary
:type word_idx: dict
:return: Training reader creator
:rtype: callable
"""
return reader_creator(
re.compile("aclImdb/train/pos/.*\.txt$"),
re.compile("aclImdb/train/neg/.*\.txt$"), word_idx, 1000)
def imdb_test(word_idx):
"""
IMDB test set creator.
It returns a reader creator, each sample in the reader is an zero-based ID
subsequence and label in [0, 1].
:param word_idx: word dictionary
:type word_idx: dict
:return: Test reader creator
:rtype: callable
"""
return reader_creator(
re.compile("aclImdb/test/pos/.*\.txt$"),
re.compile("aclImdb/test/neg/.*\.txt$"), word_idx, 1000)
def imdb_word_dict():
"""
Build a word dictionary from the corpus.
:return: Word dictionary
:rtype: dict
"""
return imdb_build_dict(
re.compile("aclImdb/((train)|(test))/((pos)|(neg))/.*\.txt$"), 150)
def build_dict(data_dir, save_path, use_col=1, cutoff_fre=1):
values = collections.defaultdict(int)
for file_name in os.listdir(data_dir):
file_path = os.path.join(data_dir, file_name)
if not os.path.isfile(file_path):
continue
with open(file_path, "r") as fdata:
for line in fdata:
line_splits = line.strip().split("\t")
if len(line_splits) < use_col:
continue
doc = line_splits[use_col]
for sent in doc.strip().split("."):
for w in sent.split():
values[w] += 1
values['<unk>'] = cutoff_fre
with open(save_path, "w") as f:
for v, count in sorted(
values.iteritems(), key=lambda x: x[1], reverse=True):
if count < cutoff_fre:
break
f.write("%s\t%d\n" % (v, count))
def load_dict(dict_path):
return dict((line.strip().split("\t")[0], idx)
for idx, line in enumerate(open(dict_path, "r").readlines()))
def train_reader(data_dir, word_dict):
"""
Reader interface for training data
:param data_dir: data directory
:type data_dir: str
:param word_dict: path of word dictionary,
the dictionary must has a "UNK" in it.
:type word_dict: Python dict
"""
def reader():
UNK_ID = word_dict['<unk>']
word_col = 1
lbl_col = 0
for file_name in os.listdir(data_dir):
file_path = os.path.join(data_dir, file_name)
if not os.path.isfile(file_path):
continue
with open(file_path, "r") as f:
for line in f:
line_split = line.strip().split("\t")
doc = line_split[word_col]
doc_ids = []
for sent in doc.strip().split("."):
sent_ids = [
word_dict.get(w, UNK_ID) for w in sent.split()
]
if sent_ids:
doc_ids.append(sent_ids)
yield doc_ids, int(line_split[lbl_col])
return reader
def infer_reader(file_path, word_dict):
"""
Reader interface for prediction
:param data_dir: data directory
:type data_dir: str
:param word_dict: path of word dictionary,
the dictionary must has a "UNK" in it.
:type word_dict: Python dict
"""
def reader():
UNK_ID = word_dict['<unk>']
with open(file_path, "r") as f:
for doc in f:
doc_ids = []
for sent in doc.strip().split("."):
sent_ids = [word_dict.get(w, UNK_ID) for w in sent.split()]
if sent_ids:
doc_ids.append(sent_ids)
yield doc_ids, doc
return reader
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import os
import sys
import gzip
import paddle.v2 as paddle
import reader
from network_conf import nest_net
from utils import logger, parse_train_cmd
def train(train_data_dir=None,
test_data_dir=None,
word_dict_path=None,
model_save_dir="models",
batch_size=32,
num_passes=10):
"""
:params train_data_path: path of training data, if this parameter
is not specified, imdb dataset will be used to run this example
:type train_data_path: str
:params test_data_path: path of testing data, if this parameter
is not specified, imdb dataset will be used to run this example
:type test_data_path: str
:params word_dict_path: path of training data, if this parameter
is not specified, imdb dataset will be used to run this example
:type word_dict_path: str
:params model_save_dir: dir where models saved
:type num_pass: str
:params batch_size: train batch size
:type num_pass: int
:params num_pass: train pass number
:type num_pass: int
"""
if not os.path.exists(model_save_dir):
os.mkdir(model_save_dir)
use_default_data = (train_data_dir is None)
if use_default_data:
logger.info(("No training data are porivided, "
"use imdb to train the model."))
logger.info("please wait to build the word dictionary ...")
word_dict = reader.imdb_word_dict()
train_reader = paddle.batch(
paddle.reader.shuffle(
lambda: reader.imdb_train(word_dict), buf_size=1000),
batch_size=100)
test_reader = paddle.batch(
lambda: reader.imdb_test(word_dict), batch_size=100)
class_num = 2
else:
if word_dict_path is None or not os.path.exists(word_dict_path):
logger.info(("word dictionary is not given, the dictionary "
"is automatically built from the training data."))
# build the word dictionary to map the original string-typed
# words into integer-typed index
reader.build_dict(
data_dir=train_data_dir,
save_path=word_dict_path,
use_col=1,
cutoff_fre=0)
word_dict = reader.load_dict(word_dict_path)
class_num = args.class_num
logger.info("class number is : %d." % class_num)
train_reader = paddle.batch(
paddle.reader.shuffle(
reader.train_reader(train_data_dir, word_dict), buf_size=1000),
batch_size=batch_size)
if test_data_dir is not None:
# here, because training and testing data share a same format,
# we still use the reader.train_reader to read the testing data.
test_reader = paddle.batch(
paddle.reader.shuffle(
reader.train_reader(test_data_dir, word_dict),
buf_size=1000),
batch_size=batch_size)
else:
test_reader = None
dict_dim = len(word_dict)
emb_size = 28
hidden_size = 128
logger.info("length of word dictionary is : %d." % (dict_dim))
paddle.init(use_gpu=True, trainer_count=4)
# network config
cost, prob, label = nest_net(
dict_dim, emb_size, hidden_size, class_num, is_infer=False)
# create parameters
parameters = paddle.parameters.create(cost)
# create optimizer
adam_optimizer = paddle.optimizer.Adam(
learning_rate=1e-3,
regularization=paddle.optimizer.L2Regularization(rate=1e-3),
model_average=paddle.optimizer.ModelAverage(average_window=0.5))
# create trainer
trainer = paddle.trainer.SGD(
cost=cost,
extra_layers=paddle.evaluator.auc(input=prob, label=label),
parameters=parameters,
update_equation=adam_optimizer)
# begin training network
feeding = {"word": 0, "label": 1}
def _event_handler(event):
"""
Define end batch and end pass event handler
"""
if isinstance(event, paddle.event.EndIteration):
if event.batch_id % 100 == 0:
logger.info("Pass %d, Batch %d, Cost %f, %s\n" % (
event.pass_id, event.batch_id, event.cost, event.metrics))
if isinstance(event, paddle.event.EndPass):
if test_reader is not None:
result = trainer.test(reader=test_reader, feeding=feeding)
logger.info("Test at Pass %d, %s \n" % (event.pass_id,
result.metrics))
with gzip.open(
os.path.join(model_save_dir, "params_pass_%05d.tar.gz" %
event.pass_id), "w") as f:
parameters.to_tar(f)
trainer.train(
reader=train_reader,
event_handler=_event_handler,
feeding=feeding,
num_passes=num_passes)
logger.info("Training has finished.")
def main(args):
train(
train_data_dir=args.train_data_dir,
test_data_dir=args.test_data_dir,
word_dict_path=args.word_dict,
batch_size=args.batch_size,
num_passes=args.num_passes,
model_save_dir=args.model_save_dir)
if __name__ == "__main__":
args = parse_train_cmd()
if args.train_data_dir is not None:
assert args.word_dict, ("the parameter train_data_dir, word_dict_path "
"should be set at the same time.")
main(args)
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import logging
import os
import argparse
from collections import defaultdict
logger = logging.getLogger("paddle")
logger.setLevel(logging.INFO)
def parse_train_cmd():
parser = argparse.ArgumentParser(
description="PaddlePaddle text classification demo")
parser.add_argument(
"--train_data_dir",
type=str,
required=False,
help=("path of training dataset (default: None). "
"if this parameter is not set, "
"imdb dataset will be used."),
default=None)
parser.add_argument(
"--test_data_dir",
type=str,
required=False,
help=("path of testing dataset (default: None). "
"if this parameter is not set, "
"imdb dataset will be used."),
default=None)
parser.add_argument(
"--word_dict",
type=str,
required=False,
help=("path of word dictionary (default: None)."
"if this parameter is not set, imdb dataset will be used."
"if this parameter is set, but the file does not exist, "
"word dictionay will be built from "
"the training data automatically."),
default=None)
parser.add_argument(
"--class_num",
type=int,
required=False,
help=("class number."),
default=2)
parser.add_argument(
"--batch_size",
type=int,
default=32,
help="the number of training examples in one forward/backward pass")
parser.add_argument(
"--num_passes", type=int, default=10, help="number of passes to train")
parser.add_argument(
"--model_save_dir",
type=str,
required=False,
help=("path to save the trained models."),
default="models")
return parser.parse_args()
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册