[DON'T MERGE] verify test for understand_sentiment chapter (!587) · 合并请求 · PaddlePaddle / book

[DON'T MERGE] verify test for understand_sentiment chapter !587

Created by: chenwhql

选取imdb测试test目录下pos和neg样本各5个，用训练好的模型进行预测，结果如下：

infer_conv.py: 9个正确，1个错误

10897_10.txt positive:  0.83449405 negative:  0.16550599
11837_9.txt positive:  0.39713657 negative:  0.60286343 //Wrong
153_8.txt positive:  0.98495966 negative:  0.01504028
1547_9.txt positive:  0.9090364 negative:  0.09096363
5305_7.txt positive:  0.62061876 negative:  0.37938124
5674_1.txt positive:  0.12702845 negative:  0.87297153
6530_4.txt positive:  0.17245957 negative:  0.82754034
6539_2.txt positive:  0.04734974 negative:  0.9526503
7393_3.txt positive:  0.40734676 negative:  0.59265333
7403_2.txt positive:  0.24649067 negative:  0.7535093

infer_stacked_lstm.py: 9个正确，1个错误

10897_10.txt positive:  0.85451925 negative:  0.14548075
11837_9.txt positive:  0.32060853 negative:  0.67939144 //Wrong
153_8.txt positive:  0.82614267 negative:  0.17385729
1547_9.txt positive:  0.9435877 negative:  0.056412343
5305_7.txt positive:  0.9677125 negative:  0.032287553
5674_1.txt positive:  0.101652615 negative:  0.8983474
6530_4.txt positive:  0.13586038 negative:  0.8641397
6539_2.txt positive:  0.10351077 negative:  0.8964892
7393_3.txt positive:  0.17400761 negative:  0.8259924
7403_2.txt positive:  0.23748593 negative:  0.76251405

测试结果说明该模型是预测结果是合理的。

对于之前实验的解释：

对于10个短句的review_strs的测试结果：

CNN：测试准确率0.88，正确预测7个，错误3个 LSTM：测试准确率0.86，正确预测5个，错误5个

这10个测试用例描述很短（10个单词左右），与imdb训练与测试用例（100个单词以上）在特征丰富性上差别较大，预测能达到这个水平，我个人理解属于正常结果。

对于大明老师实验的解释

大明老师实验时，只是从10个测试文本中，截取了第一句话，这个处理可能略有不妥。因为测试文本中的全部描述构成一个完整的观点或者说评价，而仅看第一句话，可能未必能反映出评论者的观点。我理解这是导致预测结果不准确的主要原因。建议能否用完整的描述重新实验下。