diff --git a/understand_sentiment/.gitignore b/understand_sentiment/.gitignore new file mode 100644 index 0000000000000000000000000000000000000000..40d91f5b09769c44a4b31d4c2454b3a8117295ae --- /dev/null +++ b/understand_sentiment/.gitignore @@ -0,0 +1,12 @@ +data/aclImdb +data/imdb +data/pre-imdb +data/mosesdecoder-master +logs/ +model_output +dataprovider_copy_1.py +model.list +test.log +train.log +*.pyc +.DS_Store diff --git a/understand_sentiment/README.md b/understand_sentiment/README.md index b15e24afacd4fcf6b19fdbb9017d45813242194c..43eba6d68fe9bc904435b3a488184264b7cff7d2 100644 --- a/understand_sentiment/README.md +++ b/understand_sentiment/README.md @@ -11,14 +11,16 @@ | 为了讽刺官场刻意丑化农村人的傻片子,圆方镜头全程炫技,色调背景美则美矣,但剧情拖沓,口音不伦不类,一直努力却始终无法入戏。不建议进电影院观看,不然睡着了躺都没地方躺。| 负面| |剧情四星。但是圆镜视角加上婺源的风景整个非常有中国写意山水画的感觉,看得实在太舒服了。。难怪作为今年TIFF special presentation的开幕电影。范爷美爆,再往上加一星。|正面| -
表格 1 电影评论情感分析
+

表格 1 电影评论情感分析


  实际上,在自然语言处理中,情感分析属于典型的**文本分类**问题,即,把需要进行情感分析的文本划分为其所属类别。文本分类问题可以分解为两个子问题:文本表示和分类。在深度学习的方法出现之前,主流的文本表示方法为BOW(bag of words),分类方法有SVM,LR,Boosting等等。BOW忽略了词的顺序信息,而且是高维度的稀疏向量表示,这种表示浮于表面,并未充分表示文本的语义信息。例如,句子`这部电影糟糕透了`和`一个乏味,空洞,没有内涵的作品`在情感分析中具有很高的语义相似度,但是它们的BOW表示的相似度为0。又如,句子`小明很喜欢小芳,但是小芳不喜欢小明`和`小芳很喜欢小明,但是小明不喜欢小芳`的BOW相似度为1,但实际上它们的意思很不一样。本章我们所要介绍的深度学习模型克服了BOW表示的上述缺陷,它在考虑词的顺序的基础上把文本映射到低维度的语义空间,并且以端对端(end to end)的方式进行文本表示及分类,其性能相对于传统方法有显著的提升。 ## 模型概览
  本章所使用的文本表示模型为卷积神经网络(Convolutional Neural Networks)和循环神经网络(Recurrent Neural Networks)及其扩展。我们首先介绍处理文本的卷积神经网络。 ### 文本卷积神经网络
  卷积神经网络经常用来处理具有类似网格拓扑结构(grid-like topology)的数据。例如,图像可以视为2D网格的像素点,自然语言可以视为1D的词序列。卷积神经网络可以提取多种局部特征,并对其进行组合抽象得到更高级的特征表示,且其对于数据的某些变化具有不变性。大量实验表明,卷积神经网络能高效的对图像及文本问题进行建模处理。本小结我们讲解如何使用卷积神经网络处理文本(以句子为例)。 -
![rnn](image/text_cnn.png)
-
图 1 卷积神经网络文本分类模型
+

+
+图 1 卷积神经网络文本分类模型 +


  假设一个句子的长度为$n$,其中第$i$个词的word embedding为$x_i\in\mathbb{R}^k$,其维度大小为$k$,我们可以将整个句子表示为$x_{1:n}=x_1\oplus x_2\oplus \ldots \oplus x_n$,其中,$\oplus$表示拼接(concatenation)操作。一般地,我们用$x_{i:i+j}$表示词序列$x_{i},x_{i+1},\ldots,x_{i+j}$的拼接。卷积操作把filter(也称为kernel)$w\in\mathbb{R}^{hk}$应用于包含$h$个词的窗口$x_{i:i+h-1}$,得到特征$c_i$: $$c_i=f(w\cdot x_{i:i+h-1}+b)$$
  其中$b\in\mathbb{R}$为偏置项(bias),$f$为非线性激活函数,如sigmoid。将filter应用于句子中所有的词窗口${x_{1:h},x_{2:h+1},\ldots,x_{n-h+1:n}}$序列,产生一个feature map: @@ -31,9 +33,11 @@ $$\hat c=max(c)$$ #### 简单的循环神经网络
  循环神经网络是一种能对序列数据进行精确建模的有力工具。实际上,循环神经网络的理论计算能力是图灵完备的(Siegelmann, H. T. and Sontag, E. D., 1995)。
  自然语言是一种典型的序列数据(词序列),近年来,循环神经网络及其变体(如lstm等)在自然语言处理的多个领域取得了丰硕的成果,如在语言模型,句法解析,语义角色标注(或一般的序列标注),语义表示,图文生成,对话,机器翻译等任务上均表现优异甚至成为目前效果最好的方法。 -
![rnn](image/rnn.png)
-
图 1 循环神经网络按时间展开的示意图
-
  循环神经网络按时间展开后如图1所示:在第$t$时刻,网络读入第$t$个输入$x_t$(向量表示)及前一时刻隐藏层的输出$h_{t-1}$(向量表示,$h_0$一般初始化为$0$向量),计算得出本时刻隐藏层的值$h_t$,重复这一步骤直至读完所有输入。如果将循环神经网络所表示的函数记为$f$,则其公式可表示为: +

+
+图 2 循环神经网络按时间展开的示意图 +

+
  循环神经网络按时间展开后如图2所示:在第$t$时刻,网络读入第$t$个输入$x_t$(向量表示)及前一时刻隐藏层的输出$h_{t-1}$(向量表示,$h_0$一般初始化为$0$向量),计算得出本时刻隐藏层的值$h_t$,重复这一步骤直至读完所有输入。如果将循环神经网络所表示的函数记为$f$,则其公式可表示为: $$h_t=f(x_t,h_{t-1})=\sigma(W_{xh}x_t+W_{hh}h_{h-1}+b_h)$$
  其中$W_{xh}$是输入到隐层的矩阵参数,$W_{hh}$是隐层到隐层的矩阵参数,$b_h$为隐层的偏置向量(bias)参数,$\sigma$为elementwise的sigmoid函数。在处理自然语言时,一般会先将词(one-hot表示)映射为其embedding表示,然后再作为循环神经网络每一时刻的输入$x_t$。可以根据实际需要的不同在循环神经网络的隐层上连接其它层。如,可以把一个循环神经网络的隐层输出连接至下一个循环神经网络的输入构建深层(deep or stacked)循环神经网络,或者提取最后一个时刻的隐层状态作为句子表示进而使用分类模型等等。
  可以看出,隐状态的输入来源于当前输入和前一时刻隐状态的值,这会导致很久以前的输入容易被覆盖掉。实际上,人们发现当序列很长时,循环神经网络就会表现很差(远距离依赖问题),训练过程中会出现梯度消失或爆炸现象(Bengio Y, Simard P, Frasconi P., 1994)。为了解决这一问题,Hochreiter S, Schmidhuber J. (1997)提出了lstm模型。 @@ -52,15 +56,430 @@ h_t & = o_t\odot tanh(c_t)\\\\ $$ h_t=Recrurent(x_t,h_{t-1})$$
  对于正常顺序的循环神经网络而言,$h_t$包含了$t$时刻之前的输入信息,也就是上文信息。同样,为了得到下文信息,我们可以使用反方向(将输入逆序处理)的循环神经网络。结合构建深层循环神经网络的方法,我们可以构建更加强有力的深层双向循环神经网络(deep bi-directional recurrent neural networks)对时序数据进行建模。 #### 使用循环神经网络的组合进行文本分类 -
  一个简单的做法是分别使用正向lstm-rnn和反向lstm-rnn处理文本,取最后一个时刻的隐层值拼接起来做为文本的定长向量表示,将其连接至softmax得到文本分类模型。但是这样的文本分类模型是一个浅层模型。考虑到深层神经网络往往能得到更抽象和高级的特征表示,我们构建stacked lstm-rnn。如图2所示(以三层为例),奇数层lstm正向,偶数层lstm反向,高一层的lstm使用低一层lstm及之前所有层的信息作为输入,对最高层lstm序列使用max pooling over time得到文本定长向量表示。**这一表示充分融合了文本的上下文信息,并且对文本进行了深层次抽象。**最后我们将文本表示连接至softmax构建分类模型。 -
![rnn](image/stacked_lstm.jpg)
-
图 2 stacked lstm-rnn for text classification
+
  一个简单的做法是分别使用正向lstm-rnn和反向lstm-rnn处理文本,取最后一个时刻的隐层值拼接起来做为文本的定长向量表示,将其连接至softmax得到文本分类模型。但是这样的文本分类模型是一个浅层模型。考虑到深层神经网络往往能得到更抽象和高级的特征表示,我们构建stacked lstm-rnn。如图3所示(以三层为例),奇数层lstm正向,偶数层lstm反向,高一层的lstm使用低一层lstm及之前所有层的信息作为输入,对最高层lstm序列使用max pooling over time得到文本定长向量表示。**这一表示充分融合了文本的上下文信息,并且对文本进行了深层次抽象。**最后我们将文本表示连接至softmax构建分类模型。 +

+
+图 3 stacked lstm-rnn for text classification +

+## 数据准备 +### 数据介绍与下载 +我们以IMDB情感分析数据集为例进行介绍。训练模型之前, 我们需要预处理数椐并构建一个字典。 首先, 你可以使用下面的脚本下载 IMDB 数椐集和[Moses](http://www.statmt.org/moses/)工具, 我们提供了一个数据预处理脚本,它不仅能够处理IMDB数据,还能处理其他用户自定义的数据。 为了使用提前编写的脚本,需要将标记的训练和测试样本移动到另一个路径,这已经在`get_imdb.sh`中完成。 +``` +./get_imdb.sh +``` +如果数椐获取成功,你将在目录```data```中看到下面的文件: +``` +aclImdb get_imdb.sh imdb mosesdecoder-master +``` +* aclImdb: 从外部网站上下载的原始数椐集。 +* imdb: 仅包含训练和测试数椐集。 +* mosesdecoder-master: Moses 工具。 +IMDB数据集包含25,000个已标注过的电影评论用于训练,25,000个用于测试。负面的评论的得分小于等于4,正面的评论的得分大于等于7,满分10分。 +### 数据预处理 +在这个例子中,我们只使用已经标注过的训练集和测试集,且默认在训练集上构建字典。训练集已经做了随机打乱排序。 Moses 工具中的脚本`tokenizer.perl` 用于切分单单词和标点符号。执行下面的命令就可以预处理数椐。 +``` +./preprocess.sh +``` +preprocess.sh: +``` +data_dir="./data/imdb" +python preprocess.py -i data_dir +``` +* data_dir: 输入数椐所在目录。 +* preprocess.py: 预处理脚本。 + +运行成功后目录`data/pre-imdb` 结构如下: + +``` +dict.txt labels.list test.list test_part_000 train.list train_part_000 +``` + +* test\_part\_000 and train\_part\_000: 所有标记的测试集和训练集, 训练集已经随机打乱。 +* train.list and test.list: 训练集和测试集文件列表。 +* dict.txt: 利用训练集生成的字典。 +* labels.txt: neg 0, pos 1, 含义:标签0表示负面的评论,标签1表示正面的评论。 + +### 提供数据给PaddlePaddle +PaddlePaddle可以读取Python写的传输数据脚本,下面dataprovider.py文件给出了完整例子,主要包括两部分: + +* hook: 定义文本信息、类别Id的数据类型。文本被定义为整数序列`integer_value_sequence`,类别被定义为整数`integer_value` +* process: yield文本信息和类别Id,和hook里定义顺序一致。process读取的文件的行为类别和评论文本,以`'\t\t'`分隔。 + +```python +from paddle.trainer.PyDataProvider2 import * + +def hook(settings, dictionary, **kwargs): + settings.word_dict = dictionary + settings.input_types = [ + integer_value_sequence(len(settings.word_dict)), integer_value(2) + ] + settings.logger.info('dict len : %d' % (len(settings.word_dict))) + + +@provider(init_hook=hook) +def process(settings, file_name): + with open(file_name, 'r') as fdata: + for line_count, line in enumerate(fdata): + label, comment = line.strip().split('\t\t') + label = int(label) + words = comment.split() + word_slot = [ + settings.word_dict[w] for w in words if w in settings.word_dict + ] + yield word_slot, label +``` + +## 模型配置说明 +`trainer_config.py` 是一个配置文件的例子。第一行从`sentiment_net.py`中导出预定义的网络。 + +trainer_config.py: + +```python +from sentiment_net import * + +data_dir = "./data/pre-imdb" +# whether this config is used for test +is_test = get_config_arg('is_test', bool, False) +# whether this config is used for prediction +is_predict = get_config_arg('is_predict', bool, False) +dict_dim, class_dim = sentiment_data(data_dir, is_test, is_predict) + +################## Algorithm Config ##################### + +settings( + batch_size=128, + learning_rate=2e-3, + learning_method=AdamOptimizer(), + regularization=L2Regularization(8e-4), + gradient_clipping_threshold=25 +) + +#################### Network Config ###################### +stacked_lstm_net(dict_dim, class_dim=class_dim, + stacked_num=3, is_predict=is_predict) +# bidirectional_lstm_net(dict_dim, class_dim=class_dim, is_predict=is_predict) +# convolution_net(dict_dim, class_dim=class_dim, is_predict=is_predict) +``` + +get\_config\_arg(): 获取通过 `--config_args=xx` 设置的命令行参数。 +### 优化算法配置 + + * 使用随机梯度下降(sgd)算法。 + * 使用 adam 优化。 + * 设置batch size大小为128。 + * 设置全局学习率。 + * 设置L2正则。 + * 设置梯度裁剪(clipping)阈值。 + +### 数据定义 +数据定义在方法sentiment_data之中,其实现在文件`sentiment_net.py`中: +```python +def sentiment_data(data_dir=None, + is_test=False, + is_predict=False, + train_list="train.list", + test_list="test.list", + dict_file="dict.txt"): + """ + Predefined data provider for sentiment analysis. + is_test: whether this config is used for test. + is_predict: whether this config is used for prediction. + train_list: text file name, containing a list of training set. + test_list: text file name, containing a list of testing set. + dict_file: text file name, containing dictionary. + """ + dict_dim = len(open(join_path(data_dir, "dict.txt")).readlines()) + class_dim = len(open(join_path(data_dir, 'labels.list')).readlines()) + if is_predict: + return dict_dim, class_dim + + if data_dir is not None: + train_list = join_path(data_dir, train_list) + test_list = join_path(data_dir, test_list) + dict_file = join_path(data_dir, dict_file) + + train_list = train_list if not is_test else None + word_dict = dict() + with open(dict_file, 'r') as f: + for i, line in enumerate(open(dict_file, 'r')): + word_dict[line.split('\t')[0]] = i + + define_py_data_sources2( + train_list, + test_list, + module="dataprovider", + obj="process", + args={'dictionary': word_dict}) + + return dict_dim, class_dim +``` + +在模型配置中利用define_py_data_sources2加载数据: + +* train.list,test.list: 指定训练、测试数据 +* module="dataprovider": 数据处理Python文件名 +* obj="process": 指定生成数据的函数 +* args={"dictionary": word_dict}: 额外的参数,这里指定词典 + +### 模型结构 + + * `convolution_net`: 在`sentiment_net.py`中定义。其论述详见`文本卷积神经网络`小结。 + ```python + def convolution_net(input_dim, + class_dim=2, + emb_dim=128, + hid_dim=128, + is_predict=False): + data = data_layer("word", input_dim) # one-hot表示的词序列 + emb = embedding_layer(input=data, size=emb_dim) # 将one-hot表示的词序列映射为embedding序列 + conv_3 = sequence_conv_pool(input=emb, context_len=3, hidden_size=hid_dim) #窗口大小为3的convolution及max pooling操作 + conv_4 = sequence_conv_pool(input=emb, context_len=4, hidden_size=hid_dim) #窗口大小为4的convolution及max pooling操作 + output = fc_layer(input=[conv_3,conv_4], size=class_dim, act=SoftmaxActivation()) #将conv_3和conv_4拼接起来输入给softmax分类 + + if not is_predict: + lbl = data_layer("label", 1) #类别标签 + outputs(classification_cost(input=output, label=lbl)) + else: + outputs(output) + ``` + + 其中,我们仅用一个`sequence_conv_pool`方法就实现了convolution和pooling操作,filter的数量为hidden_size参数。`sequence_conv_pool`的实现详见`Paddle/python/paddle/trainer_config_helpers/networks.py`。 + + * `bidirectional_lstm_net`: 在`sentiment_net.py`中定义。其论述详见`使用循环神经网络的组合进行文本分类`小结。 + ```python + def bidirectional_lstm_net(input_dim, + class_dim=2, + emb_dim=128, + lstm_dim=128, + is_predict=False): + data = data_layer("word", input_dim) + emb = embedding_layer(input=data, size=emb_dim) + bi_lstm = bidirectional_lstm(input=emb, size=lstm_dim) # 双向lstm,其默认返回值为正向lstm-rnn和反向lstm-rnn最后一个时刻的隐层值的拼接。其实现详见`Paddle/python/paddle/trainer_config_helpers/networks.py` + dropout = dropout_layer(input=bi_lstm, dropout_rate=0.5) + output = fc_layer(input=dropout, size=class_dim, act=SoftmaxActivation()) + + if not is_predict: + lbl = data_layer("label", 1) + outputs(classification_cost(input=output, label=lbl)) + else: + outputs(output) + ``` + + * `stacked_lstm_net`: 在`sentiment_net.py`中定义,默认情况下使用此网络。其论述详见`使用循环神经网络的组合进行文本分类`小结。 + ```python + def stacked_lstm_net(input_dim, + class_dim=2, + emb_dim=128, + hid_dim=512, + stacked_num=3, + is_predict=False): + """ + A Wrapper for sentiment classification task. + This network uses bi-directional recurrent network, + consisting three LSTM layers. This configure is referred to + the paper as following url, but use fewer layrs. + http://www.aclweb.org/anthology/P15-1109 + + input_dim: here is word dictionary dimension. + class_dim: number of categories. + emb_dim: dimension of word embedding. + hid_dim: dimension of hidden layer. + stacked_num: number of stacked lstm-hidden layer. + is_predict: is predicting or not. + Some layers is not needed in network when predicting. + """ + hid_lr = 1e-3 + assert stacked_num % 2 == 1 + + layer_attr = ExtraLayerAttribute(drop_rate=0.5) + fc_para_attr = ParameterAttribute(learning_rate=hid_lr) + lstm_para_attr = ParameterAttribute(initial_std=0., learning_rate=1.) + para_attr = [fc_para_attr, lstm_para_attr] + bias_attr = ParameterAttribute(initial_std=0., l2_rate=0.) + relu = ReluActivation() + linear = LinearActivation() + + data = data_layer("word", input_dim) + emb = embedding_layer(input=data, size=emb_dim) + + fc1 = fc_layer(input=emb, size=hid_dim, act=linear, bias_attr=bias_attr) + lstm1 = lstmemory( + input=fc1, act=relu, bias_attr=bias_attr, layer_attr=layer_attr) #基于lstm的循环神经网络 + + inputs = [fc1, lstm1] + for i in range(2, stacked_num + 1): #由fc_layer和lstmemory构建双向stacked_lstm_net + fc = fc_layer( + input=inputs, + size=hid_dim, + act=linear, + param_attr=para_attr, + bias_attr=bias_attr) + lstm = lstmemory( + input=fc, + reverse=(i % 2) == 0, + act=relu, + bias_attr=bias_attr, + layer_attr=layer_attr) + inputs = [fc, lstm] + + fc_last = pooling_layer(input=inputs[0], pooling_type=MaxPooling()) #对最后一层fc_layer使用max pooling over time得到定长向量 + lstm_last = pooling_layer(input=inputs[1], pooling_type=MaxPooling()) #对最后一层lstmemory使用max pooling over time得到定长向量 + output = fc_layer( + input=[fc_last, lstm_last], + size=class_dim, + act=SoftmaxActivation(), + bias_attr=bias_attr, + param_attr=para_attr) + + if is_predict: + outputs(output) + else: + outputs(classification_cost(input=output, label=data_layer('label', 1))) + ``` + + +## 训练模型 +首先安装PaddlePaddle。 然后使用下面的脚本 `train.sh` 来开启本地的训练。 + +``` +./train.sh +``` + +train.sh: + +``` +config=trainer_config.py +output=./model_output +paddle train --config=$config \ + --save_dir=$output \ + --job=train \ + --use_gpu=false \ + --trainer_count=4 \ + --num_passes=10 \ + --log_period=20 \ + --dot_period=20 \ + --show_parameter_stats_period=100 \ + --test_all_data_in_one_period=1 \ + 2>&1 | tee 'train.log' +``` + +* \--config=$config: 设置网络配置。 +* \--save\_dir=$output: 设置输出路径以保存训练完成的模型。 +* \--job=train: 设置工作模式为训练。 +* \--use\_gpu=false: 使用CPU训练,如果你安装GPU版本的PaddlePaddle,并想使用GPU来训练可将此设置为true。 +* \--trainer\_count=4:设置线程数(或GPU个数)。 +* \--num\_passes=15: 设置pass,PaddlePaddle中的一个pass意味着对数据集中的所有样本进行一次训练。 +* \--log\_period=20: 每20个batch打印一次日志。 +* \--show\_parameter\_stats\_period=100: 每100个batch打印一次统计信息。 +* \--test\_all_data\_in\_one\_period=1: 每次测试都测试所有数据。 + +如果运行成功,输出日志保存在 `train.log`中,模型保存在目录`model_output/`中。 输出日志说明如下: + +``` +Batch=20 samples=2560 AvgCost=0.681644 CurrentCost=0.681644 Eval: classification_error_evaluator=0.36875 CurrentEval: classification_error_evaluator=0.36875 +... +Pass=0 Batch=196 samples=25000 AvgCost=0.418964 Eval: classification_error_evaluator=0.1922 +Test samples=24999 cost=0.39297 Eval: classification_error_evaluator=0.149406 +``` + +* Batch=xx: 表示训练了xx个Batch。 +* samples=xx: 表示训练了xx个样本。 +* AvgCost=xx: 从第0个batch到当前batch的平均损失。 +* CurrentCost=xx: 最新log_period个batch的损失。 +* Eval: classification\_error\_evaluator=xx: 表示第0个batch到当前batch的分类错误。 +* CurrentEval: classification\_error\_evaluator: 最新log_period个batch的分类错误。 +* Pass=0: 通过所有训练集一次称为一个Pass。 0表示第一次经过训练集。 + +默认情况下,我们使用`stacked_lstm_net`网络,如果要使用双向LSTM或卷积网络,注释相应的行即可。 +## 应用模型 +### 测试模型 + +测试模型是指使用训练出的模型评估已标记的数据集。 + +``` +./test.sh +``` + +test.sh: + +```bash +function get_best_pass() { + cat $1 | grep -Pzo 'Test .*\n.*pass-.*' | \ + sed -r 'N;s/Test.* error=([0-9]+\.[0-9]+).*\n.*pass-([0-9]+)/\1 \2/g' | \ + sort | head -n 1 +} + +log=train.log +LOG=`get_best_pass $log` +LOG=(${LOG}) +evaluate_pass="model_output/pass-${LOG[1]}" + +echo 'evaluating from pass '$evaluate_pass + +model_list=./model.list +touch $model_list | echo $evaluate_pass > $model_list +net_conf=trainer_config.py +paddle train --config=$net_conf \ + --model_list=$model_list \ + --job=test \ + --use_gpu=false \ + --trainer_count=4 \ + --config_args=is_test=1 \ + 2>&1 | tee 'test.log' +``` + +函数`get_best_pass`依据分类错误率获取最佳模型。 与训练不同,测试时需要指定`--job = test`和模型路径,即`--model_list = $model_list`。如果运行成功,日志将保存在“test.log”中。例如,在我们的测试中,最好的模型是`model_output / pass-00002`,分类误差是0.115645,如下: + +``` +Pass=0 samples=24999 AvgCost=0.280471 Eval: classification_error_evaluator=0.115645 +``` + +### 预测 +`predict.py`脚本提供了一个预测接口。在使用它之前请安装PaddlePaddle的python api。 预测IMDB的未标记评论的一个实例如下: + +``` +./predict.sh +``` +predict.sh: + +```bash +#Note the default model is pass-00002, you shold make sure the model path +#exists or change the mode path. +model=model_output/pass-00002/ +config=trainer_config.py +label=data/pre-imdb/labels.list +cat ./data/aclImdb/test/pos/10007_10.txt | python predict.py \ + --tconf=$config\ + --model=$model \ + --label=$label \ + --dict=./data/pre-imdb/dict.txt \ + --batch_size=1 +``` + +* `cat ./data/aclImdb/test/pos/10007_10.txt` : 输入预测样本。 +* `predict.py` : 预测接口脚本。 +* `--tconf=$config` : 设置网络配置。 +* `--model=$model` : 设置模型路径。 +* `--label=$label` : 设置标签类别字典,这个字典是整数标签和字符串标签的一个对应。 +* `--dict=data/pre-imdb/dict.txt` : 设置文本数据字典文件。 +* `--batch_size=1` : 预测时将batch size设置为1。 + +注意应该确保默认模型路径`model_output / pass-00002`存在或更改为其它模型路径。 + +本示例的预测结果: + +``` +Loading parameters from model_output/pass-00002/ +./data/aclImdb/test/pos/10014_7.txt: predicting label is pos +``` + +## 总结 ## 参考文献 \ No newline at end of file diff --git a/understand_sentiment/data/get_imdb.sh b/understand_sentiment/data/get_imdb.sh new file mode 100755 index 0000000000000000000000000000000000000000..7600af6fbb900ee845702f1297779c1f0ed9bf84 --- /dev/null +++ b/understand_sentiment/data/get_imdb.sh @@ -0,0 +1,51 @@ +#!/bin/bash +# Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +set -e +set -x + +DIR="$( cd "$(dirname "$0")" ; pwd -P )" +cd $DIR + +#download the dataset +echo "Downloading aclImdb..." +#http://ai.stanford.edu/%7Eamaas/data/sentiment/ +wget http://ai.stanford.edu/%7Eamaas/data/sentiment/aclImdb_v1.tar.gz + +echo "Downloading mosesdecoder..." +#https://github.com/moses-smt/mosesdecoder +wget https://github.com/moses-smt/mosesdecoder/archive/master.zip + +#extract package +echo "Unzipping..." +tar -zxvf aclImdb_v1.tar.gz +unzip master.zip + +#move train and test set to imdb_data directory +#in order to process when traing +mkdir -p imdb/train +mkdir -p imdb/test + +cp -r aclImdb/train/pos/ imdb/train/pos +cp -r aclImdb/train/neg/ imdb/train/neg + +cp -r aclImdb/test/pos/ imdb/test/pos +cp -r aclImdb/test/neg/ imdb/test/neg + +#remove compressed package +rm aclImdb_v1.tar.gz +rm master.zip + +echo "Done." diff --git a/understand_sentiment/dataprovider.py b/understand_sentiment/dataprovider.py new file mode 100755 index 0000000000000000000000000000000000000000..00f72cecacb454a0dd1184fa2098be4543007de7 --- /dev/null +++ b/understand_sentiment/dataprovider.py @@ -0,0 +1,35 @@ +# Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +from paddle.trainer.PyDataProvider2 import * + + +def hook(settings, dictionary, **kwargs): + settings.word_dict = dictionary + settings.input_types = [ + integer_value_sequence(len(settings.word_dict)), integer_value(2) + ] + settings.logger.info('dict len : %d' % (len(settings.word_dict))) + + +@provider(init_hook=hook) +def process(settings, file_name): + with open(file_name, 'r') as fdata: + for line_count, line in enumerate(fdata): + label, comment = line.strip().split('\t\t') + label = int(label) + words = comment.split() + word_slot = [ + settings.word_dict[w] for w in words if w in settings.word_dict + ] + yield word_slot, label diff --git a/understand_sentiment/predict.py b/understand_sentiment/predict.py new file mode 100755 index 0000000000000000000000000000000000000000..8ec490f64691924013200a3d0038d39aa834b038 --- /dev/null +++ b/understand_sentiment/predict.py @@ -0,0 +1,150 @@ +# Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import os, sys +import numpy as np +from optparse import OptionParser +from py_paddle import swig_paddle, DataProviderConverter +from paddle.trainer.PyDataProvider2 import integer_value_sequence +from paddle.trainer.config_parser import parse_config +""" +Usage: run following command to show help message. + python predict.py -h +""" + + +class SentimentPrediction(): + def __init__(self, train_conf, dict_file, model_dir=None, label_file=None): + """ + train_conf: trainer configure. + dict_file: word dictionary file name. + model_dir: directory of model. + """ + self.train_conf = train_conf + self.dict_file = dict_file + self.word_dict = {} + self.dict_dim = self.load_dict() + self.model_dir = model_dir + if model_dir is None: + self.model_dir = os.path.dirname(train_conf) + + self.label = None + if label_file is not None: + self.load_label(label_file) + + conf = parse_config(train_conf, "is_predict=1") + self.network = swig_paddle.GradientMachine.createFromConfigProto( + conf.model_config) + self.network.loadParameters(self.model_dir) + input_types = [integer_value_sequence(self.dict_dim)] + self.converter = DataProviderConverter(input_types) + + def load_dict(self): + """ + Load dictionary from self.dict_file. + """ + for line_count, line in enumerate(open(self.dict_file, 'r')): + self.word_dict[line.strip().split('\t')[0]] = line_count + return len(self.word_dict) + + def load_label(self, label_file): + """ + Load label. + """ + self.label = {} + for v in open(label_file, 'r'): + self.label[int(v.split('\t')[1])] = v.split('\t')[0] + + def get_index(self, data): + """ + transform word into integer index according to the dictionary. + """ + words = data.strip().split() + word_slot = [self.word_dict[w] for w in words if w in self.word_dict] + return word_slot + + def batch_predict(self, data_batch): + input = self.converter(data_batch) + output = self.network.forwardTest(input) + prob = output[0]["value"] + labs = np.argsort(-prob) + for idx, lab in enumerate(labs): + if self.label is None: + print("predicting label is %d" % (lab[0])) + else: + print("predicting label is %s" % (self.label[lab[0]])) + + +def option_parser(): + usage = "python predict.py -n config -w model_dir -d dictionary -i input_file " + parser = OptionParser(usage="usage: %s [options]" % usage) + parser.add_option( + "-n", + "--tconf", + action="store", + dest="train_conf", + help="network config") + parser.add_option( + "-d", + "--dict", + action="store", + dest="dict_file", + help="dictionary file") + parser.add_option( + "-b", + "--label", + action="store", + dest="label", + default=None, + help="dictionary file") + parser.add_option( + "-c", + "--batch_size", + type="int", + action="store", + dest="batch_size", + default=1, + help="the batch size for prediction") + parser.add_option( + "-w", + "--model", + action="store", + dest="model_path", + default=None, + help="model path") + return parser.parse_args() + + +def main(): + options, args = option_parser() + train_conf = options.train_conf + batch_size = options.batch_size + dict_file = options.dict_file + model_path = options.model_path + label = options.label + swig_paddle.initPaddle("--use_gpu=0") + predict = SentimentPrediction(train_conf, dict_file, model_path, label) + + batch = [] + for line in sys.stdin: + batch.append([predict.get_index(line)]) + if len(batch) == batch_size: + predict.batch_predict(batch) + batch = [] + if len(batch) > 0: + predict.batch_predict(batch) + + +if __name__ == '__main__': + main() diff --git a/understand_sentiment/predict.sh b/understand_sentiment/predict.sh new file mode 100755 index 0000000000000000000000000000000000000000..c72a8e8641516543ef267fcb4b448630246d1e8d --- /dev/null +++ b/understand_sentiment/predict.sh @@ -0,0 +1,27 @@ +#!/bin/bash +# Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +set -e + +#Note the default model is pass-00002, you shold make sure the model path +#exists or change the mode path. +model=model_output/pass-00002/ +config=trainer_config.py +label=data/pre-imdb/labels.list +cat ./data/aclImdb/test/pos/10007_10.txt | python predict.py \ + --tconf=$config\ + --model=$model \ + --label=$label \ + --dict=./data/pre-imdb/dict.txt \ + --batch_size=1 diff --git a/understand_sentiment/preprocess.py b/understand_sentiment/preprocess.py new file mode 100755 index 0000000000000000000000000000000000000000..29b3682b747c66574590de5ea70574981cc536bb --- /dev/null +++ b/understand_sentiment/preprocess.py @@ -0,0 +1,359 @@ +# Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import os +import sys +import random +import operator +import numpy as np +from subprocess import Popen, PIPE +from os.path import join as join_path +from optparse import OptionParser + +from paddle.utils.preprocess_util import * +""" +Usage: run following command to show help message. + python preprocess.py -h +""" + + +def save_dict(dict, filename, is_reverse=True): + """ + Save dictionary into file. + dict: input dictionary. + filename: output file name, string. + is_reverse: True, descending order by value. + False, ascending order by value. + """ + f = open(filename, 'w') + for k, v in sorted(dict.items(), key=operator.itemgetter(1),\ + reverse=is_reverse): + f.write('%s\t%s\n' % (k, v)) + f.close() + + +def tokenize(sentences): + """ + Use tokenizer.perl to tokenize input sentences. + tokenizer.perl is tool of Moses. + sentences : a list of input sentences. + return: a list of processed text. + """ + dir = './data/mosesdecoder-master/scripts/tokenizer/tokenizer.perl' + tokenizer_cmd = [dir, '-l', 'en', '-q', '-'] + assert isinstance(sentences, list) + text = "\n".join(sentences) + tokenizer = Popen(tokenizer_cmd, stdin=PIPE, stdout=PIPE) + tok_text, _ = tokenizer.communicate(text) + toks = tok_text.split('\n')[:-1] + return toks + + +def read_lines(path): + """ + path: String, file path. + return a list of sequence. + """ + seqs = [] + with open(path, 'r') as f: + for line in f.readlines(): + line = line.strip() + if len(line): + seqs.append(line) + return seqs + + +class SentimentDataSetCreate(): + """ + A class to process data for sentiment analysis task. + """ + + def __init__(self, + data_path, + output_path, + use_okenizer=True, + multi_lines=False): + """ + data_path: string, traing and testing dataset path + output_path: string, output path, store processed dataset + multi_lines: whether a file has multi lines. + In order to shuffle fully, it needs to read all files into + memory, then shuffle them if one file has multi lines. + """ + self.output_path = output_path + self.data_path = data_path + + self.train_dir = 'train' + self.test_dir = 'test' + + self.train_list = "train.list" + self.test_list = "test.list" + + self.label_list = "labels.list" + self.classes_num = 0 + + self.batch_size = 50000 + self.batch_dir = 'batches' + + self.dict_file = "dict.txt" + self.dict_with_test = False + self.dict_size = 0 + self.word_count = {} + + self.tokenizer = use_okenizer + self.overwrite = False + + self.multi_lines = multi_lines + + self.train_dir = join_path(data_path, self.train_dir) + self.test_dir = join_path(data_path, self.test_dir) + self.train_list = join_path(output_path, self.train_list) + self.test_list = join_path(output_path, self.test_list) + self.label_list = join_path(output_path, self.label_list) + self.dict_file = join_path(output_path, self.dict_file) + + def data_list(self, path): + """ + create dataset from path + path: data path + return: data list + """ + label_set = get_label_set_from_dir(path) + data = [] + for lab_name in label_set.keys(): + file_paths = list_files(join_path(path, lab_name)) + for p in file_paths: + data.append({"label" : label_set[lab_name],\ + "seq_path": p}) + return data, label_set + + def create_dict(self, data): + """ + create dict for input data. + data: list, [sequence, sequnce, ...] + """ + for seq in data: + for w in seq.strip().lower().split(): + if w not in self.word_count: + self.word_count[w] = 1 + else: + self.word_count[w] += 1 + + def create_dataset(self): + """ + create file batches and dictionary of train data set. + If the self.overwrite is false and train.list already exists in + self.output_path, this function will not create and save file + batches from the data set path. + return: dictionary size, class number. + """ + out_path = self.output_path + if out_path and not os.path.exists(out_path): + os.makedirs(out_path) + + # If self.overwrite is false or self.train_list has existed, + # it will not process dataset. + if not (self.overwrite or not os.path.exists(self.train_list)): + print "%s already exists." % self.train_list + return + + # Preprocess train data. + train_data, train_lab_set = self.data_list(self.train_dir) + print "processing train set..." + file_lists = self.save_data(train_data, "train", self.batch_size, True, + True) + save_list(file_lists, self.train_list) + + # If have test data path, preprocess test data. + if os.path.exists(self.test_dir): + test_data, test_lab_set = self.data_list(self.test_dir) + assert (train_lab_set == test_lab_set) + print "processing test set..." + file_lists = self.save_data(test_data, "test", self.batch_size, + False, self.dict_with_test) + save_list(file_lists, self.test_list) + + # save labels set. + save_dict(train_lab_set, self.label_list, False) + self.classes_num = len(train_lab_set.keys()) + + # save dictionary. + save_dict(self.word_count, self.dict_file, True) + self.dict_size = len(self.word_count) + + def save_data(self, + data, + prefix="", + batch_size=50000, + is_shuffle=False, + build_dict=False): + """ + Create batches for a Dataset object. + data: the Dataset object to process. + prefix: the prefix of each batch. + batch_size: number of data in each batch. + build_dict: whether to build dictionary for data + + return: list of batch names + """ + if is_shuffle and self.multi_lines: + return self.save_data_multi_lines(data, prefix, batch_size, + build_dict) + + if is_shuffle: + random.shuffle(data) + num_batches = int(math.ceil(len(data) / float(batch_size))) + batch_names = [] + for i in range(num_batches): + batch_name = join_path(self.output_path, + "%s_part_%03d" % (prefix, i)) + begin = i * batch_size + end = min((i + 1) * batch_size, len(data)) + # read a batch of data + label_list, data_list = self.get_data_list(begin, end, data) + if build_dict: + self.create_dict(data_list) + self.save_file(label_list, data_list, batch_name) + batch_names.append(batch_name) + + return batch_names + + def get_data_list(self, begin, end, data): + """ + begin: int, begining index of data. + end: int, ending index of data. + data: a list of {"seq_path": seqquence path, "label": label index} + + return a list of label and a list of sequence. + """ + label_list = [] + data_list = [] + for j in range(begin, end): + seqs = read_lines(data[j]["seq_path"]) + lab = int(data[j]["label"]) + #File may have multiple lines. + for seq in seqs: + data_list.append(seq) + label_list.append(lab) + if self.tokenizer: + data_list = tokenize(data_list) + return label_list, data_list + + def save_data_multi_lines(self, + data, + prefix="", + batch_size=50000, + build_dict=False): + """ + In order to shuffle fully, there is no need to load all data if + each file only contains one sample, it only needs to shuffle list + of file name. But one file contains multi lines, each line is one + sample. It needs to read all data into memory to shuffle fully. + This interface is mainly for data containning multi lines in each + file, which consumes more memory if there is a great mount of data. + + data: the Dataset object to process. + prefix: the prefix of each batch. + batch_size: number of data in each batch. + build_dict: whether to build dictionary for data + + return: list of batch names + """ + assert self.multi_lines + label_list = [] + data_list = [] + + # read all data + label_list, data_list = self.get_data_list(0, len(data), data) + if build_dict: + self.create_dict(data_list) + + length = len(label_list) + perm_list = np.array([i for i in xrange(length)]) + random.shuffle(perm_list) + + num_batches = int(math.ceil(length / float(batch_size))) + batch_names = [] + for i in range(num_batches): + batch_name = join_path(self.output_path, + "%s_part_%03d" % (prefix, i)) + begin = i * batch_size + end = min((i + 1) * batch_size, length) + sub_label = [label_list[perm_list[i]] for i in range(begin, end)] + sub_data = [data_list[perm_list[i]] for i in range(begin, end)] + self.save_file(sub_label, sub_data, batch_name) + batch_names.append(batch_name) + + return batch_names + + def save_file(self, label_list, data_list, filename): + """ + Save data into file. + label_list: a list of int value. + data_list: a list of sequnece. + filename: output file name. + """ + f = open(filename, 'w') + print "saving file: %s" % filename + for lab, seq in zip(label_list, data_list): + f.write('%s\t\t%s\n' % (lab, seq)) + f.close() + + +def option_parser(): + parser = OptionParser(usage="usage: python preprcoess.py "\ + "-i data_dir [options]") + parser.add_option( + "-i", + "--data", + action="store", + dest="input", + help="Input data directory.") + parser.add_option( + "-o", + "--output", + action="store", + dest="output", + default=None, + help="Output directory.") + parser.add_option( + "-t", + "--tokenizer", + action="store", + dest="use_tokenizer", + default=True, + help="Whether to use tokenizer.") + parser.add_option("-m", "--multi_lines", action="store", + dest="multi_lines", default=False, + help="If input text files have multi lines and they "\ + "need to be shuffled, you should set -m True,") + return parser.parse_args() + + +def main(): + options, args = option_parser() + data_dir = options.input + output_dir = options.output + use_tokenizer = options.use_tokenizer + multi_lines = options.multi_lines + if output_dir is None: + outname = os.path.basename(options.input) + output_dir = join_path(os.path.dirname(data_dir), 'pre-' + outname) + data_creator = SentimentDataSetCreate(data_dir, output_dir, use_tokenizer, + multi_lines) + data_creator.create_dataset() + + +if __name__ == '__main__': + main() diff --git a/understand_sentiment/preprocess.sh b/understand_sentiment/preprocess.sh new file mode 100755 index 0000000000000000000000000000000000000000..19ec34d4f016365d18db01ddec559d26202b19c6 --- /dev/null +++ b/understand_sentiment/preprocess.sh @@ -0,0 +1,22 @@ +#!/bin/bash +# Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +set -e + +echo "Start to preprcess..." + +data_dir="./data/imdb" +python preprocess.py -i $data_dir + +echo "Done." diff --git a/understand_sentiment/sentiment_net.py b/understand_sentiment/sentiment_net.py new file mode 100644 index 0000000000000000000000000000000000000000..1a92d655148f7ed7b5085a7237e795bc05f8e7fe --- /dev/null +++ b/understand_sentiment/sentiment_net.py @@ -0,0 +1,162 @@ +# Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from os.path import join as join_path + +from paddle.trainer_config_helpers import * + + +def sentiment_data(data_dir=None, + is_test=False, + is_predict=False, + train_list="train.list", + test_list="test.list", + dict_file="dict.txt"): + """ + Predefined data provider for sentiment analysis. + is_test: whether this config is used for test. + is_predict: whether this config is used for prediction. + train_list: text file name, containing a list of training set. + test_list: text file name, containing a list of testing set. + dict_file: text file name, containing dictionary. + """ + dict_dim = len(open(join_path(data_dir, "dict.txt")).readlines()) + class_dim = len(open(join_path(data_dir, 'labels.list')).readlines()) + if is_predict: + return dict_dim, class_dim + + if data_dir is not None: + train_list = join_path(data_dir, train_list) + test_list = join_path(data_dir, test_list) + dict_file = join_path(data_dir, dict_file) + + train_list = train_list if not is_test else None + word_dict = dict() + with open(dict_file, 'r') as f: + for i, line in enumerate(open(dict_file, 'r')): + word_dict[line.split('\t')[0]] = i + + define_py_data_sources2( + train_list, + test_list, + module="dataprovider", + obj="process", + args={'dictionary': word_dict}) + + return dict_dim, class_dim + +def convolution_net(input_dim, + class_dim=2, + emb_dim=128, + hid_dim=128, + is_predict=False): + data = data_layer("word", input_dim) + emb = embedding_layer(input=data, size=emb_dim) + conv_3 = sequence_conv_pool(input=emb, context_len=3, hidden_size=hid_dim) + conv_4 = sequence_conv_pool(input=emb, context_len=4, hidden_size=hid_dim) + output = fc_layer(input=[conv_3,conv_4], size=class_dim, act=SoftmaxActivation()) + + if not is_predict: + lbl = data_layer("label", 1) + outputs(classification_cost(input=output, label=lbl)) + else: + outputs(output) + + +def bidirectional_lstm_net(input_dim, + class_dim=2, + emb_dim=128, + lstm_dim=128, + is_predict=False): + data = data_layer("word", input_dim) + emb = embedding_layer(input=data, size=emb_dim) + bi_lstm = bidirectional_lstm(input=emb, size=lstm_dim) + dropout = dropout_layer(input=bi_lstm, dropout_rate=0.5) + output = fc_layer(input=dropout, size=class_dim, act=SoftmaxActivation()) + + if not is_predict: + lbl = data_layer("label", 1) + outputs(classification_cost(input=output, label=lbl)) + else: + outputs(output) + + +def stacked_lstm_net(input_dim, + class_dim=2, + emb_dim=128, + hid_dim=512, + stacked_num=3, + is_predict=False): + """ + A Wrapper for sentiment classification task. + This network uses bi-directional recurrent network, + consisting three LSTM layers. This configure is referred to + the paper as following url, but use fewer layrs. + http://www.aclweb.org/anthology/P15-1109 + + input_dim: here is word dictionary dimension. + class_dim: number of categories. + emb_dim: dimension of word embedding. + hid_dim: dimension of hidden layer. + stacked_num: number of stacked lstm-hidden layer. + is_predict: is predicting or not. + Some layers is not needed in network when predicting. + """ + hid_lr = 1e-3 + assert stacked_num % 2 == 1 + + layer_attr = ExtraLayerAttribute(drop_rate=0.5) + fc_para_attr = ParameterAttribute(learning_rate=hid_lr) + lstm_para_attr = ParameterAttribute(initial_std=0., learning_rate=1.) + para_attr = [fc_para_attr, lstm_para_attr] + bias_attr = ParameterAttribute(initial_std=0., l2_rate=0.) + relu = ReluActivation() + linear = LinearActivation() + + data = data_layer("word", input_dim) + emb = embedding_layer(input=data, size=emb_dim) + + fc1 = fc_layer(input=emb, size=hid_dim, act=linear, bias_attr=bias_attr) + lstm1 = lstmemory( + input=fc1, act=relu, bias_attr=bias_attr, layer_attr=layer_attr) + + inputs = [fc1, lstm1] + for i in range(2, stacked_num + 1): + fc = fc_layer( + input=inputs, + size=hid_dim, + act=linear, + param_attr=para_attr, + bias_attr=bias_attr) + lstm = lstmemory( + input=fc, + reverse=(i % 2) == 0, + act=relu, + bias_attr=bias_attr, + layer_attr=layer_attr) + inputs = [fc, lstm] + + fc_last = pooling_layer(input=inputs[0], pooling_type=MaxPooling()) + lstm_last = pooling_layer(input=inputs[1], pooling_type=MaxPooling()) + output = fc_layer( + input=[fc_last, lstm_last], + size=class_dim, + act=SoftmaxActivation(), + bias_attr=bias_attr, + param_attr=para_attr) + + if is_predict: + outputs(output) + else: + outputs(classification_cost(input=output, label=data_layer('label', 1))) diff --git a/understand_sentiment/test.sh b/understand_sentiment/test.sh new file mode 100755 index 0000000000000000000000000000000000000000..8af827c3388c8df88a872bd87d121a4f9631c3ff --- /dev/null +++ b/understand_sentiment/test.sh @@ -0,0 +1,39 @@ +#!/bin/bash +# Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +set -e + +function get_best_pass() { + cat $1 | grep -Pzo 'Test .*\n.*pass-.*' | \ + sed -r 'N;s/Test.* classification_error_evaluator=([0-9]+\.[0-9]+).*\n.*pass-([0-9]+)/\1 \2/g' |\ + sort -n | head -n 1 +} + +log=train.log +LOG=`get_best_pass $log` +LOG=(${LOG}) +evaluate_pass="model_output/pass-${LOG[1]}" + +echo 'evaluating from pass '$evaluate_pass + +model_list=./model.list +touch $model_list | echo $evaluate_pass > $model_list +net_conf=trainer_config.py +paddle train --config=$net_conf \ + --model_list=$model_list \ + --job=test \ + --use_gpu=false \ + --trainer_count=4 \ + --config_args=is_test=1 \ + 2>&1 | tee 'test.log' diff --git a/understand_sentiment/train.sh b/understand_sentiment/train.sh new file mode 100755 index 0000000000000000000000000000000000000000..5ce8bf4b997d962b9b61593cec0954d76c4874bc --- /dev/null +++ b/understand_sentiment/train.sh @@ -0,0 +1,29 @@ +#!/bin/bash +# Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +set -e + +config=trainer_config.py +output=./model_output +paddle train --config=$config \ + --save_dir=$output \ + --job=train \ + --use_gpu=false \ + --trainer_count=4 \ + --num_passes=10 \ + --log_period=10 \ + --dot_period=20 \ + --show_parameter_stats_period=100 \ + --test_all_data_in_one_period=1 \ + 2>&1 | tee 'train.log' diff --git a/understand_sentiment/trainer_config.py b/understand_sentiment/trainer_config.py new file mode 100644 index 0000000000000000000000000000000000000000..42deac405921fe229550d51e9b83300fbab55f1a --- /dev/null +++ b/understand_sentiment/trainer_config.py @@ -0,0 +1,40 @@ +# Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from sentiment_net import * +from paddle.trainer_config_helpers import * + +# whether this config is used for test +is_test = get_config_arg('is_test', bool, False) +# whether this config is used for prediction +is_predict = get_config_arg('is_predict', bool, False) + +data_dir = "./data/pre-imdb" +dict_dim, class_dim = sentiment_data(data_dir, is_test, is_predict) + +################## Algorithm Config ##################### + +settings( + batch_size=128, + learning_rate=2e-3, + learning_method=AdamOptimizer(), + average_window=0.5, + regularization=L2Regularization(8e-4), + gradient_clipping_threshold=25) + +#################### Network Config ###################### +stacked_lstm_net( + dict_dim, class_dim=class_dim, stacked_num=3, is_predict=is_predict) +# bidirectional_lstm_net(dict_dim, class_dim=class_dim, is_predict=is_predict) +# convolution_net(dict_dim, class_dim=class_dim, is_predict=is_predict) \ No newline at end of file