diff --git a/image_classification/README.md b/image_classification/README.md index 4fcce75135631977e6e1bbdcd2b816c5571c3322..e62dd19bb434a55a9168cadcd0d77dce80ce31e7 100644 --- a/image_classification/README.md +++ b/image_classification/README.md @@ -3,26 +3,24 @@ ## 背景介绍 -图像相比文字能够提供更加生动、容易理解、有趣及更具艺术感的信息,是人们转递与交换信息的总要来源。在本教程中,我们专注于图像识别领域的重要问题之一 - 图像分类。 +图像相比文字能够提供更加生动、容易理解及更具艺术感的信息,是人们转递与交换信息的重要来源。在本教程中,我们专注于图像识别领域的一个重要问题,即图像分类。 -图像分类是根据图像的语义信息将不同类别图像区分开来,是计算机视觉中重要的基本问题,也是图像检测、图像分割、物体跟踪、行为分析等其他高层视觉任务的基础。 +图像分类是根据图像的语义信息将不同类别图像区分开来,是计算机视觉中重要的基本问题,也是图像检测、图像分割、物体跟踪、行为分析等其他高层视觉任务的基础。图像分类在很多领域有广泛应用,包括安防领域的人脸识别和智能视频分析等,交通领域的交通场景识别,互联网领域基于内容的图像检索和相册自动归类,医学领域的图像识别等。 -图像分类在很多领域有广泛应用,包括安防领域的人脸识别、智能视频分析等,交通领域的交通场景识别,互联网领域基于内容的图像检索、相册自动归类,医学领域图像识别等。 +一般来说,图像分类通过手工特征或特征学习方法对整个图像进行全部描述,然后使用分类器判别物体类别,因此如何提取图像的特征至关重要。在深度学习算法之前使用较多的是基于词包模型的物体分类方法,词包模型的基本框架包括底层特征学习、特征编码、空间约束、分类器设计、模型融合等几个阶段。 -一般来说,图像分类通过手工特征或特征学习方法对整个图像进行全部描述,然后使用分类器判别物体类别,因此如何提取图像的特征至关重要。在深度学习算法之前使用较多的是基于词包模型的物体分类方法,词包模型的基本框架包括底层特征学习、特征编码、空间约束、分类器设计、模型融合等多个阶段。 +而基于深度学习的图像分类方法,其基本思想是通过有监督或无监督的方式学习层次化的特征描述,来对物体进行从底层到高层的描述。深度学习模型中的卷积神经网络(Convolution Neural Network, CNN)近年来在图像领域取得了惊人的成绩,CNN直接利用图像像素信息作为输入,最大程度上保留了输入图像的所有信息,通过卷积操作进行特征的提取和高层抽象,模型输出直接是图像识别的结果。这种基于"输入-输出"直接端到端学习方法取了非常好的效果,得到了广泛的应用。 -而基于深度学习的图像分类方法,其基本思想是通过有监督或无监督的方式学习层次化的特征描述,来对物体进行从底层到高层的描述。深度学习模型之一 — 卷积神经网络(CNN)近年来在图像领域取得了惊人的成绩,CNN直接利用图像像素信息作为输入,最大程度上保留了输入图像的所有信息,通过卷积操作进行特征的提取和高层抽象,模型输出直接是图像识别的结果。这种基于"输入-输出"直接端到端学习方法取了非常好的效果,得到了广泛的应用。 - -本教程主要介绍图像分类模型,以及如何使用PaddlePaddle训练CNN模型。 +本教程主要介绍图像分类的深度学习模型,以及如何使用PaddlePaddle训练CNN模型。 ## 效果展示 图像分类包括通用图像分类、细粒度图像分类等。下图展示了通用图像分类效果,即模型可以正确识别图像上的主要物体。

-
-图1. 通用物体分类展示 +
+图1. 通用图像分类展示

@@ -30,7 +28,7 @@

-
+
图2. 细粒度图像分类展示

@@ -44,57 +42,72 @@ ## 模型概览 -Alex Krizhevsky在2012年大规模视觉识别竞赛(ILSVRC 2012)的数据集(ImageNet)中提出的CNN模型 [1] 取得了历史性的突破,效果大幅度超越传统方法,获得了ILSVRC 2012冠军,该模型被称作AlexNet,这也是首次将深度学习用于大规模图像分类中,并使用GPU加速模型训练。从此,涌现了一系列CNN模型,不断的在ImageNet上刷新成绩。随着模型变得越来越深,Top-5的错误率也越来越低,目前降到了3.5%附近,而在同样的ImageNet数据集合上,人眼的辨识错误率大概在5.1%,也就是目前的深度学习模型的识别能力已经超过了人眼。 +图像识别领域大量的研究成果都是建立在[PASCAL VOC](http://host.robots.ox.ac.uk/pascal/VOC/)、[ImageNet](http://image-net.org/)等公开的数据集上,这些数据集通常作为图像识别的基准测试。PASCAL VOC是2005建立的一个视觉挑战赛,ImageNet是2010年建立的大规模视觉识别竞赛(ILSVRC)的数据集,我们围绕这两个数据集一些论文介绍图像分类的模型。 + +在2012年之前的传统图像分类模型中,底层特征提取是第一步,通常从图像中按照固定步长、尺度提取大量局部特征描述,通常采用的局部特征包括SIFT(Scale-Invariant Feature Transform, 尺度不变特征转换) \[[1](#参考文献)\]、HOG(Histogram of Oriented Gradient, 方向梯度直方图) \[[2](#参考文献)\]、LBP(Local Bianray Pattern, 局部二值模式) \[[3](#参考文献)\] 等,传统物体分类算法一般都采用多种特征描述子,防止丢失过多的有用信息。底层特征中包含了大量冗余与噪声,为了提高特征表达的鲁棒性,需要使用一种特征变换算法对底层特征进行编码,称作特征编码。特征编码算法包括向量量化编码 \[[4](#参考文献)\]、稀疏编码 \[[5](#参考文献)\]、局部线性约束编码 \[[6](#参考文献)\]、Fisher向量编码 \[[7](#参考文献)\] 等。特征编码之后一般会经过空间特征约束,也称作特征汇聚。特征汇聚是对每一维特征取最大值或者平均值,可以获得一定特征不变形的特征表达。一种常用的特征聚会方法即金字塔特征匹配提出将图像均匀分块,再分块内做特征汇聚。当这些步骤完整之后,一张图像可以用一个固定维度的向量进行描述,接下来就是经过分类器对图像进行分类,通常使用的分类器包括SVM(Support Vector Machine, 支持向量机)、随机森林等,而使用核方法的SVM是最为广泛的分类器,在传统图像分类任务上性能很好。例如ILSVRC 2010年挑战者采用SIFT和LBP特征,两个非线性编码器以及SVM分类器获得图像分类的冠军 \[[8](#参考文献)\]。 +Alex Krizhevsky在2012年ILSVRC提出的CNN模型 \[[9](#参考文献)\] 取得了历史性的突破,效果大幅度超越传统方法,获得了ILSVRC 2012冠军,该模型被称作AlexNet,这也是首次将深度学习用于大规模图像分类中,并使用GPU加速模型训练。从此,涌现了一系列CNN模型,不断的在ImageNet上刷新成绩,如图4展示。随着模型变得越来越深以及精妙的结构设计,Top-5的错误率也越来越低,目前降到了3.5%附近,而在同样的ImageNet数据集上,人眼的辨识错误率大概在5.1%,也就是目前的深度学习模型的识别能力已经超过了人眼。


图4. ILSVRC图像分类Top-5错误率

- -在本教程中我们主要采用VGG和ResNet网络结构,在介绍这两个模型之前,我们首先简单介绍CNN网络结构。 - -### CNN - -卷积神经网络是一种使用卷积层的前向神经网络,很适合构建用于理解图片内容的模型。一个典型的神经网络如下图所示: +一个典型的神经网络如下图所示,我们首先了解CNN网络结构的一些基本组件。AlexNet包含了这些基本的组件,也为后来的网络奠定了基础。


图5. CNN网络示例 -

+

-一个卷积神经网络包含如下层: +- 卷积层: 通过卷积操作提取底层到高层的特征,发掘出了图片局部关联性质和空间不变性质。 +- 池化层: 是一种降采样操作,通过取卷积得到特征图中局部区块的最大值(Max-Pooling)或平均值(Avg-Pooling)来达到降采样的目的,并在做这个过程中获得一定的不变性。 +- 全连接层: 输入层到隐藏层的神经元是全部连接的。 +- 非线性变化: 卷积层、全连层之后一般都会接非线性变化层,例如Sigmoid、Tanh、ReLu等来增强网络的表达能力,在CNN里最常用的为ReLu激活函数。 +- Droupout \[[10](#参考文献)\] : 在模型训练阶段随机让一些隐层节点权重不工作,提高网络的泛化能力 +,一定程度上防止过拟合。 -- 卷积层:通过卷积操作提取底层到高层的特征,并且保持了图像的空间信息。 -- 池化层:是一种降采样操作,通过取卷积得到特征图中局部区块的最大值或平均值来达到降采样的目的,并在做这个过程中获得一定的不变性。 -- 全连接层:使输入层到隐藏层的神经元是全部连接的。 -- 非线性变化: sigmoid、tanh、relu等非线性变化增强网络表达能力。 - -卷积神经网络在图片分类上有着惊人的性能,这是因为它发掘出了图片的两类重要信息:局部关联性质和空间不变性质。通过交替使用卷积和池化处理, 卷积神经网络能够很好的表示这两类信息。 +除过上面这些基本的组件外,另一个非常值得一提是2015年提出的Batch Normalization(BN)算法 \[[14](#参考文献)\],作者指出在训练过程中由于每层参数不断更新,会导致下一次输入分布发生变化,这样导致训练过程需要精心设计超参数。而BN算法是每个batch对网络中的每一层的特征做归一化,使得每层分布相对稳定,起到一定的正则作用,同时也弱化了超参数的设计,经过实验证明,BN加速了训练过程。BN在后来较深的模型中被广泛使用。 +接下来我们主要介绍VGG,GooleNet和ResNet网络结构。 ### VGG -[VGG](https://arxiv.org/abs/1405.3531) 模型的核心是五组卷积操作,每两组之间做max-pooling空间降维。同一组内采用多次连续的3X3卷积,卷积核的数目由较浅组的64增多到最深组的512,同一组内的卷积核数目是一样的。卷积之后接两层全连接层,之后是分类层。VGG模型的计算量较大,收敛较慢。 +牛津大学VGG(Visual Geometry Group)组在2014年ILSVRC提出的模型被称作VGG模型 \[[11](#参考文献)\] 。该模型相比以往模型进一步加宽和加深了网络结构,它的核心是五组卷积操作,每两组之间做Max-Pooling空间降维。同一组内采用多次连续的3X3卷积,卷积核的数目由较浅组的64增多到最深组的512,同一组内的卷积核数目是一样的。卷积之后接两层全连接层,之后是分类层。由于每组内卷积层的不同,有11、13、16、19层这几种模型,下图展示一个16层的网络结构。VGG模型的计算量较大,参数较多。


-图6. 基于ImageNet的VGG16网络示例 +图6. 基于ImageNet的VGG16模型

+### GoogleNet + +GoogleNet \[[12](#参考文献)\] 在2014年ILSVRC的获得了冠军,在介绍该模型之前我们先来了解NIN(Network in Network)模型 \[[13](#参考文献)\],因为GoogleNet模型借鉴了NIN的一些思想。 + +NIN模型中引入了多层感知卷积网络(Multi-Layer Perceptron Convolution, MLPconv)代替一层线性卷积网络。所谓的MLPconv是一个微小的多层卷积网络,即在线性卷积后面增加1x1的卷积,1x1的卷积等价于全连接层,这样利用MLPconv提取高度非线性特征。另外,传统的CNN最后几层一般都是全连接层,参数较多,NIN中采用全局均值池化(Avg-Pooling)替代全连接层,为了使得最后一个卷积层获得的每一个特征图能够对应于一个输出类别。 + +GoogleNet模型是由多组Inception模块堆积组成,Inception模块如下图8所示,高层输出是4组不同卷积核大小的特征连接起来,3个黄色1x1卷积是降维的作用,所谓的降维是减少了通道数。引入1x1卷积的原因是如果去掉这几层,会发现Max-Pooling不会改变特征通道数的大小,这样3个蓝色卷积和Max-Pooling特征连接后会导致特征的通道数较大,经过几层这样的模块堆积会导致通常数越来越大,参数和计算量都会增大。因此引入1x1卷积进行降维,同时在NIN模型中提到1x1卷积也可以修正线性特征。另外,GoogleNet在网络最后也没有采用传统的多层全连接层,而是像NIN网络一样采用了Avg-Pooling,与NIN不同的是,Avg-Pooling后面接了一层到类别数映射的全连接层。除了Inception模块和Avg-Pooling外,由于网络中间层特征也很有判别性,所以在中间层添加了两个辅助分类器,后向传播中增强梯度并且增强正则化。 GoogleNet总体结构是开始由3层普通的卷积组成;接下来由三组子网络组成,第一组子网络包含2个Inception模块,第二组包含5个Inception模块,第三组包含2个Inception模块;然后接Avg-Pooling、全连接层,总共22层网络。 + +

+
+图8. 基于ImageNet的VGG16模型 +

+ +上面介绍的是GoogleNet第一版模型(称作GoogleNet-v1),GoogleNet-v2 \[[14](#参考文献)\] 引入BN层。GoogleNet-v3 \[[16](#参考文献)\] 对一些卷积层做了分解,进一步加深网络非线性能力和加深网络,GoogleNet-v4 \[[17](#参考文献)\] 引入下面要讲的ResNet设计思路,每一版的改进准确度都有所提升,这里不再详细介绍v2-v4结构。 + + ### ResNet -[ResNet](https://arxiv.org/abs/1512.03385) 是2015年ImageNet分类定位、检测比赛的冠军。针对训练卷积神经网络时加深网络导致准确度下降的问题,提出了采用残差学习。在已有设计思路(Batch Norm, 小卷积核,全卷积网络)的基础上,引入了残差模块。每个残差模块包含两条路径,其中一条路径是输入特征的直连通路,另一条路径对该特征做两到三次卷积操作得到该特征的残差,最后再将两条路径上的特征相加。残差模块如图7所示,左边是基本模块连接方式,右边是瓶颈模块连接方式。图8展示了论文[2]中50-152层网络连接示意图。ResNet成功的训练了上百乃至近千层的卷积神经网络,训练时收敛快,速度也较VGG有所提升。 +ResNet \[[15](#参考文献)\] 是2015年ImageNet分类定位、检测比赛的冠军。针对训练卷积神经网络时加深网络导致准确度下降的问题,提出了采用残差学习。在已有设计思路(BN, 小卷积核,全卷积网络)的基础上,引入了残差模块。每个残差模块包含两条路径,其中一条路径是输入特征的直连通路,另一条路径对该特征做两到三次卷积操作得到该特征的残差,最后再将两条路径上的特征相加。残差模块如图 9 所示,左边是基本模块连接方式,右边是瓶颈模块连接方式。图 10 展示了50-152层网络连接示意图。ResNet成功的训练了上百乃至近千层的卷积神经网络,训练时收敛快,速度也较VGG有所提升。


-图7. 残差模块 +图9. 残差模块


-图8. 基于ImageNet的ResNet模型 +图10. 基于ImageNet的ResNet模型

@@ -106,7 +119,7 @@ Alex Krizhevsky在2012年大规模视觉识别竞赛(ILSVRC 2012)的数据集(Im


-图3. CIFAR10数据集 +图11. CIFAR10数据集

@@ -116,7 +129,7 @@ Alex Krizhevsky在2012年大规模视觉识别竞赛(ILSVRC 2012)的数据集(Im ./data/get_data.sh ``` -### 数据提供器 +### 数据提供给PaddlePaddle 我们使用Python接口传递数据给系统,下面 `dataprovider.py` 针对CIFAR10数据给出了完整示例。 @@ -126,6 +139,10 @@ Alex Krizhevsky在2012年大规模视觉识别竞赛(ILSVRC 2012)的数据集(Im ```python +import numpy as np +import cPickle +from paddle.trainer.PyDataProvider2 import * + def initializer(settings, mean_path, is_train, **kwargs): settings.is_train = is_train settings.input_size = 3 * 32 * 32 @@ -157,15 +174,19 @@ def process(settings, file_list): ### 数据定义 -在模型配置中,定义通过 `define_py_data_sources2` 从 dataprovider 中读入数据, 其中 args 指定均值文件的路径。 +在模型配置中,定义通过 `define_py_data_sources2` 从 dataprovider 中读入数据, 其中 args 指定均值文件的路径。如果改配置文件用于预测,则不需要数据定义部分。 ```python -define_py_data_sources2( - train_list='data/train.list', - test_list='data/test.list', - module='dataprovider', - obj='process', - args={'mean_path': 'data/mean.meta'}) +from paddle.trainer_config_helpers import * + +is_predict = get_config_arg("is_predict", bool, False) +if not is_predict: + define_py_data_sources2( + train_list='data/train.list', + test_list='data/test.list', + module='dataprovider', + obj='process', + args={'mean_path': 'data/mean.meta'}) ``` ### 算法配置 @@ -176,20 +197,37 @@ define_py_data_sources2( settings( batch_size=128, learning_rate=0.1 / 128.0, + learning_rate_decay_a=0.1, + learning_rate_decay_b=50000 * 100, + learning_rate_schedule='discexp', learning_method=MomentumOptimizer(0.9), - regularization=L2Regularization(0.0005 * 128)) + regularization=L2Regularization(0.0005 * 128),) ``` ### 模型结构 -在模型概览部分已经介绍了VGG和ResNet模型,本教程中我们提供了这两个模型的网络配置。因为CIFAR10图片大小和数量相比ImageNet数据小很多,因此这里的模型针对CIFAR10数据做了一定的适配。首先介绍VGG模型结构,在CIFAR10数据集上,卷积部分引入了Batch Norm和Dropout操作。 +在模型概览部分已经介绍了VGG和ResNet模型,本教程中我们提供了这两个模型的网络配置。因为CIFAR10图片大小和数量相比ImageNet数据小很多,因此这里的模型针对CIFAR10数据做了一定的适配。 -1. 首先预定义了一组卷积网络,即conv_block, 所使用的 `img_conv_group` 是我们预定义的一个模块,由若干组 `Conv->BatchNorm->Relu->Dropout` 和 一组 `Pooling` 组成,其中卷积操作采用3x3的卷积核。下面定义中根据 groups 决定是几次连续的卷积操作。 +#### VGG +首先介绍VGG模型结构,在CIFAR10数据集上,卷积部分引入了BN和Dropout操作。 -2. 五组卷积操作,即 5个conv_block。 第一、二组采用两次连续的卷积操作。第三、四、五组采用三次连续的卷积操作。 +1. 定义数据输入及其维度 -3. 由两层512维的全连接网络和一个分类层组成。 +网络输入定义 `data_layer` (数据层),在图像分类中即为图像像素信息,CIFRAR10是RGB 3通道32x32大小的彩色图,因此输入数据大小为3072(3x32x32),类别大小为10。 +```python + +datadim = 3 * 32 * 32 +classdim = 10 +data = data_layer(name='image', size=datadim) +``` + +2. 定义VGG网络核心模块 + +```python +net = vgg_bn_drop(data) +``` +VGG核心模块的输入是数据层,`vgg_bn_drop` 定义了16层VGG结构,每层卷积后面引入BN层和Droupout层,详细的定义如下: ```python def vgg_bn_drop(input, num_channels): @@ -224,11 +262,115 @@ def vgg_bn_drop(input, num_channels): input=tmp, size=512, act=LinearActivation()) - tmp = fc_layer(input=tmp, size=10, act=SoftmaxActivation()) return tmp ``` +2.1. 首先定义了一组卷积网络,即conv_block。卷积核为3x3,Pooling窗口为2x2,窗口滑动大小为2,groups 决定每组VGG模块是几次连续的卷积操作,dropouts 指定Droupout操作的概率。 所使用的`img_conv_group`是在模块`paddle.trainer_config_helpers`中预定义的模块,由若干组 `Conv->BN->Relu->Dropout` 和 一组 `Pooling` 组成, + +2.2. 五组卷积操作,即 5个conv_block。 第一、二组采用两次连续的卷积操作。第三、四、五组采用三次连续的卷积操作。每组最后一个卷积后面Dropout概率为0,即不使用Dropout操作。 + +2.3. 最后接两层512维的全连接。 + +3. 定义分类器 + +通过上面VGG网络提取高层特征,然后经过全连接层映射到类别维度大小的向量,再通过Softmax归一化得到每个类别的概率,也可称作分类器。 + +```python +out = fc_layer(input=net, size=class_num, act=SoftmaxActivation()) +``` + +4. 定义损失函数和网络输出 + +在有监督训练中需要输入图像对应的类别信息,同样,通过`data_layer`来定义。训练中采用多类交叉熵作为损失函数并作为网络的输出,预测阶段定义网络的输出为分类器得到的概率信息。 + +```python +if not is_predict: + lbl = data_layer(name="label", size=class_num) + cost = classification_cost(input=out, label=lbl) + outputs(cost) +else: + outputs(out) +``` + +### ResNet + +ResNet模型的第1、3、4步和VGG模型相同,这里不在介绍。主要介绍第2步即CIFAR10数据集上ResNet核心模块。 + +```python +net = resnet_cifar10(data, depth=56) +``` + +resnet_cifar10的底层输入连接一层 `conv_bn_layer`,即卷积层、BN层、ReLu激活函数。 然后连接3组参数模块即下面配置3组 `layer_warp` ,每组采用图 9 左边残差模块组成,即由2层3x3卷积,各自连接BN层,然后再接ReLu激活函数,每组由若干残差模块堆积而成。最后对网络做Avg-Pooling并返回该层。 返回的网络在第3步接一层全连接层,因此,除过第一层卷积层和最后一层全连接层这2层外,要求3组 `layer_warp` 总的含参层数能够6 整除,即 `resnet_cifar10` 的 depth 要满足 $(depth - 2)% 6 == 0$ 。 + +```python +def conv_bn_layer(input, + ch_out, + filter_size, + stride, + padding, + active_type=ReluActivation(), + ch_in=None): + tmp = img_conv_layer( + input=input, + filter_size=filter_size, + num_channels=ch_in, + num_filters=ch_out, + stride=stride, + padding=padding, + act=LinearActivation(), + bias_attr=False) + return batch_norm_layer(input=tmp, act=active_type) + + +def shortcut(ipt, n_in, n_out, stride): + if n_in != n_out: + return conv_bn_layer(ipt, n_out, 1, stride, 0, LinearActivation()) + else: + return ipt + +def basicblock(ipt, ch_out, stride): + ch_in = ipt.num_filters + tmp = conv_bn_layer(ipt, ch_out, 3, stride, 1) + tmp = conv_bn_layer(tmp, ch_out, 3, 1, 1, LinearActivation()) + short = shortcut(ipt, ch_in, ch_out, stride) + return addto_layer(input=[ipt, short], act=ReluActivation()) + +def bottleneck(ipt, ch_out, stride): + ch_in = ipt.num_filter + tmp = conv_bn_layer(ipt, ch_out, 1, stride, 0) + tmp = conv_bn_layer(tmp, ch_out, 3, 1, 1) + tmp = conv_bn_layer(tmp, ch_out * 4, 1, 1, 0, LinearActivation()) + short = shortcut(ipt, ch_in, ch_out, stride) + return addto_layer(input=[ipt, short], act=ReluActivation()) + +def layer_warp(block_func, ipt, features, count, stride): + tmp = block_func(ipt, features, stride) + for i in range(1, count): + tmp = block_func(tmp, features, 1) + return tmp + +def resnet_cifar10(ipt, depth=56): + assert((depth - 2) % 6 == 0, 'depth should be one of 20, 32, 44, 56, 110, 1202') + n = (depth - 2) / 6 + nStages = {16, 64, 128} + tmp = conv_bn_layer(ipt, + ch_in=3, + ch_out=16, + filter_size=3, + stride=1, + padding=1) + tmp = layer_warp(basicblock, tmp, 16, n, 1) + tmp = layer_warp(basicblock, tmp, 32, n, 2) + tmp = layer_warp(basicblock, tmp, 64, n, 2) + tmp = img_pool_layer(input=tmp, + pool_size=8, + stride=1, + pool_type=AvgPooling()) + return tmp +``` + + ## 模型训练 ``` bash @@ -240,7 +382,7 @@ sh train.sh ```bash #cfg=models/resnet.py cfg=models/vgg.py -output=./output +output=output log=train.log paddle train \ @@ -262,38 +404,94 @@ paddle train \ 一轮训练log示例如下所示,经过1个pass, 训练集上平均error为0.79958 ,测试集上平均error为0.7858 。 ```text -I1226 12:33:20.257822 25576 TrainerInternal.cpp:165] Batch=300 samples=38400 AvgCost=2.07708 CurrentCost=1.96158 Eval: classification_error_evaluator=0.81151 CurrentEval: classification_error_evaluator=0.789297 -.........I1226 12:33:37.720484 25576 TrainerInternal.cpp:181] Pass=0 Batch=391 samples=50000 AvgCost=2.03348 Eval: classification_error_evaluator=0.79958 -I1226 12:33:42.413450 25576 Tester.cpp:115] Test samples=10000 cost=1.99246 Eval: classification_error_evaluator=0.7858 +TrainerInternal.cpp:165] Batch=300 samples=38400 AvgCost=2.07708 CurrentCost=1.96158 Eval: classification_error_evaluator=0.81151 CurrentEval: classification_error_evaluator=0.789297 +TrainerInternal.cpp:181] Pass=0 Batch=391 samples=50000 AvgCost=2.03348 Eval: classification_error_evaluator=0.79958 +Tester.cpp:115] Test samples=10000 cost=1.99246 Eval: classification_error_evaluator=0.7858 ``` - - 下图是训练的分类错误率曲线图: -
![Training and testing curves.](image/plot.png)
+

+
+图12. CIFAR10数据集 +

## 模型应用 -在训练完成后,模型及参数会被保存在路径`./output/pass-%05d`下。例如第300个pass的模型会被保存在`./output/pass-00299`。 +在训练完成后,模型会保存在路径 `output/pass-%05d` 下,例如第300个pass的模型会保存在路径 `output/pass-00299`。 脚本 `classify.py` 可以使用模型对图片进行预测或提取特征,注意该脚本默认使用模型配置为 `models/vgg.py`, -要对一个图片的进行分类预测,我们可以使用`predict.sh`,该脚本将输出预测分类的标签: -``` -sh predict.sh +### 预测 + +可以按照下面方式预测图片的类别,默认使用GPU预测,如果使用CPU预测,在后面加参数 `-c`即可。 + +```python +python classify.py --job=predict --model=output/pass-00299 --data=image/dog.png # -c ``` -predict.sh: +预测结果为: + +```text +Label of image/dog.png is: 5 ``` -model=output/pass-00299/ -image=data/cifar-out/test/airplane/seaplane_s_000978.png -use_gpu=1 -python prediction.py $model $image $use_gpu + +### 特征提取 + +可以按照下面方式对图片提取特征,和预测使用方式不同的是指定job类型为extract,并需要指定提取的层。`classify.py` 默认已第一层卷积特征为例提取特征,并给出了可视化图,如图10所示,VGG模型的第一层卷积有64个通道,图 13 展示的为每个通道的灰度图。 + +```python +python classify.py --job=extract --model=output/pass-00299 --data=image/dog.png # -c ``` +

+
+图13. CIFAR10数据集 +

+ +## 总结 + +传统图像分类方法由多个阶段构成,框架较为复杂,而端到端的CNN模型结构可一步到位,而且大幅度提升了分类准确率。本文我们介绍VGG、GoogleNet、ResNet三个经典的模型,基于CIFAR10数据集,介绍了如何使用PaddlePaddle配置和训练CNN模型,尤其是VGG和ResNet模型。最后介绍如何使用PaddlePaddle的API接口对图片进行预测和特征提取。而对于其他数据集比如ImageNet,配置和训练流程是同样的。 + ## 参考文献 -[1]. K. Chatfield, K. Simonyan, A. Vedaldi, A. Zisserman. Return of the Devil in the Details: Delving Deep into Convolutional Nets. BMVC, 2014。 +[1] D. G. Lowe, Distinctive image features from scale-invariant keypoints. IJCV, 60(2):91-110, 2004. + +[2] N. Dalal, B. Triggs, Histograms of Oriented Gradients for Human Detection, Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2005. + +[3] Ahonen, T., Hadid, A., and Pietikinen, M. (2006). Face description with local binary patterns: Application to face recognition. PAMI, 28. + +[4] J. Sivic, A. Zisserman, "Video Google: A Text Retrieval Approach to Object Matching in Videos", Proc. Ninth Int'l Conf. Computer Vision, pp. 1470-1478, 2003. + +[5] B. Olshausen, D. Field, "Sparse Coding with an Overcomplete Basis Set: A Strategy Employed by V1?", Vision Research, vol. 37, pp. 3311-3325, 1997. + +[6] Wang, J., Yang, J., Yu, K., Lv, F., Huang, T., and Gong, Y. (2010). Locality-constrained Linear Coding for image classification. In CVPR. + +[7] Perronnin, F., Sánchez, J., & Mensink, T. (2010). Improving the fisher kernel for large-scale image classification. In ECCV (4). + +[8] Lin, Y., Lv, F., Cao, L., Zhu, S., Yang, M., Cour, T., Yu, K., and Huang, T. (2011). Large-scale image clas- sification: Fast feature extraction and SVM training. In CVPR. + +[9] Krizhevsky, A., Sutskever, I., and Hinton, G. (2012). ImageNet classification with deep convolutional neu- ral networks. In NIPS. + +[10] G.E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R.R. Salakhutdinov. Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580, 2012. + +[11] K. Chatfield, K. Simonyan, A. Vedaldi, A. Zisserman. Return of the Devil in the Details: Delving Deep into Convolutional Nets. BMVC, 2014。 + +[12] Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: CVPR. (2015) + +[13] Lin, M., Chen, Q., and Yan, S. Network in network. In Proc. ICLR, 2014. + +[14] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep +network training by reducing internal covariate shift. In ICML, 2015. + +[15] K. He, X. Zhang, S. Ren, J. Sun. Deep Residual Learning for Image Recognition. CVPR 2016. + +[16] Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the incep-tion architecture for computer vision. In: CVPR. (2016). + +[17] Szegedy, C., Ioffe, S., Vanhoucke, V.: Inception-v4, inception-resnet and the impact of residual connections on learning. arXiv:1602.07261 (2016). + + + + -[2]. K. He, X. Zhang, S. Ren, J. Sun. Deep Residual Learning for Image Recognition. CVPR 2016. + \ No newline at end of file diff --git a/image_classification/classify.py b/image_classification/classify.py new file mode 100644 index 0000000000000000000000000000000000000000..b853991b238868efad0dfa5cb780660ace894f67 --- /dev/null +++ b/image_classification/classify.py @@ -0,0 +1,223 @@ +# Copyright (c) 2016 Baidu, Inc. All Rights Reserved +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import os,sys +import cPickle +import numpy as np +from PIL import Image +from optparse import OptionParser + +import paddle.utils.image_util as image_util +from py_paddle import swig_paddle, DataProviderConverter +from paddle.trainer.PyDataProvider2 import dense_vector +from paddle.trainer.config_parser import parse_config + +import logging +logging.basicConfig(format='[%(levelname)s %(asctime)s %(filename)s:%(lineno)s] %(message)s') +logging.getLogger().setLevel(logging.INFO) + + +def vis_square(data, fname): + import matplotlib + matplotlib.use('Agg') + import matplotlib.pyplot as plt + + """Take an array of shape (n, height, width) or (n, height, width, 3) + and visualize each (height, width) thing in a grid of size approx. sqrt(n) by sqrt(n)""" + # normalize data for display + data = (data - data.min()) / (data.max() - data.min()) + # force the number of filters to be square + n = int(np.ceil(np.sqrt(data.shape[0]))) + padding = (((0, n ** 2 - data.shape[0]), + (0, 1), (0, 1)) # add some space between filters + + ((0, 0),) * (data.ndim - 3)) # don't pad the last dimension (if there is one) + data = np.pad(data, padding, mode='constant', constant_values=1) # pad with ones (white) + # tile the filters into an image + data = data.reshape((n, n) + data.shape[1:]).transpose((0, 2, 1, 3) + tuple(range(4, data.ndim + 1))) + data = data.reshape((n * data.shape[1], n * data.shape[3]) + data.shape[4:]) + plt.imshow(data, cmap='gray') + plt.savefig(fname) + plt.axis('off') + +class ImageClassifier(): + def __init__(self, + train_conf, + resize_dim, + crop_dim, + model_dir=None, + use_gpu=True, + mean_file=None, + oversample=False, + is_color=True): + self.train_conf = train_conf + self.model_dir = model_dir + if model_dir is None: + self.model_dir = os.path.dirname(train_conf) + + self.resize_dim = resize_dim + self.crop_dims = [crop_dim, crop_dim] + self.oversample = oversample + self.is_color = is_color + + self.transformer = image_util.ImageTransformer(is_color = is_color) + self.transformer.set_transpose((2,0,1)) + self.transformer.set_channel_swap((2,1,0)) + + self.mean_file = mean_file + if self.mean_file is not None: + mean = np.load(self.mean_file)['mean'] + mean = mean.reshape(3, self.crop_dims[0], self.crop_dims[1]) + self.transformer.set_mean(mean) # mean pixel + else: + # if you use three mean value, set like: + # this three mean value is calculated from ImageNet. + self.transformer.set_mean(np.array([103.939,116.779,123.68])) + + conf_args = "use_gpu=%d,is_predict=1" % (int(use_gpu)) + conf = parse_config(train_conf, conf_args) + swig_paddle.initPaddle("--use_gpu=%d" % (int(use_gpu))) + self.network = swig_paddle.GradientMachine.createFromConfigProto(conf.model_config) + assert isinstance(self.network, swig_paddle.GradientMachine) + self.network.loadParameters(self.model_dir) + + dim = 3 * self.crop_dims[0] * self.crop_dims[1] + slots = [dense_vector(dim)] + self.converter = DataProviderConverter(slots) + + def get_data(self, img_path): + """ + 1. load image from img_path. + 2. resize or oversampling. + 3. transformer data: transpose, channel swap, sub mean. + return K x H x W ndarray. + + img_path: image path. + """ + image = image_util.load_image(img_path, self.is_color) + # Another way to extract oversampled features is that + # cropping and averaging from large feature map which is + # calculated by large size of image. + # This way reduces the computation. + if self.oversample: + image = image_util.resize_image(image, self.resize_dim) + image = np.array(image) + input = np.zeros((1, image.shape[0], image.shape[1], 3), + dtype=np.float32) + input[0] = image.astype(np.float32) + input = image_util.oversample(input, self.crop_dims) + else: + image = image.resize(self.crop_dims, Image.ANTIALIAS) + input = np.zeros((1, self.crop_dims[0], self.crop_dims[1], 3), + dtype=np.float32) + input[0] = np.array(image).astype(np.float32) + + data_in = [] + for img in input: + img = self.transformer.transformer(img).flatten() + data_in.append([img.tolist()]) + return data_in + + def forward(self, input_data): + in_arg = self.converter(input_data) + return self.network.forwardTest(in_arg) + + def forward(self, data, output_layer): + input = self.converter(data) + self.network.forwardTest(input) + output = self.network.getLayerOutputs(output_layer) + res = {} + if isinstance(output_layer, basestring): + output_layer = [output_layer] + for name in output_layer: + # For oversampling, average predictions across crops. + # If not, the shape of output[name]: (1, class_number), + # the mean is also applicable. + res[name] = output[name].mean(0) + return res + +def option_parser(): + usage = "%prog -c config -i data_list -w model_dir [options]" + parser = OptionParser(usage="usage: %s" % usage) + parser.add_option("--job", + action="store", + dest="job_type", + choices=['predict', 'extract',], + default='predict', + help="The job type. \ + predict: predicting,\ + extract: extract features") + parser.add_option("--conf", + action="store", + dest="train_conf", + default='models/vgg.py', + help="network config") + parser.add_option("--data", + action="store", + dest="data_file", + default='image/dog.png', + help="image list") + parser.add_option("--model", + action="store", + dest="model_path", + default=None, + help="model path") + parser.add_option("-c", + dest="cpu_gpu", + action="store_false", + help="Use cpu mode.") + parser.add_option("-g", + dest="cpu_gpu", + default=True, + action="store_true", + help="Use gpu mode.") + parser.add_option("--mean", + action="store", + dest="mean", + default='data/mean.meta', + help="The mean file.") + parser.add_option("--multi_crop", + action="store_true", + dest="multi_crop", + default=False, + help="Wether to use multiple crops on image.") + return parser.parse_args() + +def main(): + options, args = option_parser() + mean = 'data/mean.meta' if not options.mean else options.mean + conf = 'models/vgg.py' if not options.train_conf else options.train_conf + obj = ImageClassifier(conf, + 32,32, + options.model_path, + use_gpu=options.cpu_gpu, + mean_file=mean, + oversample=options.multi_crop) + image_path = options.data_file + if options.job_type == 'predict': + output_layer = '__fc_layer_2__' + data = obj.get_data(image_path) + prob = obj.forward(data, output_layer) + lab = np.argsort(-prob[output_layer]) + logging.info("Label of %s is: %d", image_path, lab[0]) + + elif options.job_type == "extract": + output_layer = '__conv_0__' + data = obj.get_data(options.data_file) + features = obj.forward(data, output_layer) + dshape = (64, 32, 32) + fea = features[output_layer].reshape(dshape) + vis_square(fea, 'fea_conv0.png') + +if __name__ == '__main__': + main() diff --git a/image_classification/dataprovider.py b/image_classification/dataprovider.py index b4ba430c63539e17214bea5102bb91b90b226519..f9921bd025d19d8102b320b8075e1643ed4e18c2 100644 --- a/image_classification/dataprovider.py +++ b/image_classification/dataprovider.py @@ -14,7 +14,6 @@ import numpy as np import cPickle - from paddle.trainer.PyDataProvider2 import * def initializer(settings, mean_path, is_train, **kwargs): diff --git a/image_classification/extract.sh b/image_classification/extract.sh new file mode 100755 index 0000000000000000000000000000000000000000..93004788c57922033f58818031b63caf42e32765 --- /dev/null +++ b/image_classification/extract.sh @@ -0,0 +1,17 @@ +#!/bin/bash +# Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +set -e + +python classify.py --job=extract --model=output/pass-00299 --data=image/dog.png # -c diff --git a/image_classification/image/dog.png b/image_classification/image/dog.png new file mode 100644 index 0000000000000000000000000000000000000000..ca8f858a902ea723d886d2b88c2c0a1005301c50 Binary files /dev/null and b/image_classification/image/dog.png differ diff --git a/image_classification/image/image_classification.png b/image_classification/image/dog_cat.png similarity index 100% rename from image_classification/image/image_classification.png rename to image_classification/image/dog_cat.png diff --git a/image_classification/image/fea_conv0.png b/image_classification/image/fea_conv0.png new file mode 100644 index 0000000000000000000000000000000000000000..2341003802d3c6ace0b35172e06bce22fea33710 Binary files /dev/null and b/image_classification/image/fea_conv0.png differ diff --git a/image_classification/image/inception.png b/image_classification/image/inception.png new file mode 100644 index 0000000000000000000000000000000000000000..2b7449fd999ef5ca34c55fb22c81c11e0748bbf1 Binary files /dev/null and b/image_classification/image/inception.png differ diff --git a/image_classification/image/resnet.png b/image_classification/image/resnet.png index 3714fe23bb550c00700b6803c98716ea8595026b..bce224b7e66ac071111867f2a6e653fe93b46412 100644 Binary files a/image_classification/image/resnet.png and b/image_classification/image/resnet.png differ diff --git a/image_classification/models/resnet.py b/image_classification/models/resnet.py index 97ab084440f2aab7239f2e11a4091fa3268edc60..a621e2ad25598914bf65dc9b89ec1a6581d3bc73 100644 --- a/image_classification/models/resnet.py +++ b/image_classification/models/resnet.py @@ -14,13 +14,33 @@ from paddle.trainer_config_helpers import * +is_predict = get_config_arg("is_predict", bool, False) +if not is_predict: + args = {'meta': 'data/mean.meta'} + define_py_data_sources2( + train_list='data/train.list', + test_list='data/test.list', + module='dataprovider', + obj='process', + args=args) + +settings( + batch_size=128, + learning_rate=0.1 / 128.0, + learning_rate_decay_a=0.1, + learning_rate_decay_b=50000 * 100, + learning_rate_schedule='discexp', + learning_method=MomentumOptimizer(0.9), + regularization=L2Regularization(0.0001 * 128)) + + def conv_bn_layer(input, ch_out, filter_size, stride, padding, - ch_in=None, - active_type=ReluActivation()): + active_type=ReluActivation(), + ch_in=None): tmp = img_conv_layer( input=input, filter_size=filter_size, @@ -35,16 +55,16 @@ def conv_bn_layer(input, def shortcut(ipt, n_in, n_out, stride): if n_in != n_out: - return conv_bn_layer(ipt, n_out, 1, stride=stride, LinearActivation()) + return conv_bn_layer(ipt, n_out, 1, stride, 0, LinearActivation()) else: return ipt def basicblock(ipt, ch_out, stride): - ch_in = ipt.num_filter + ch_in = ipt.num_filters tmp = conv_bn_layer(ipt, ch_out, 3, stride, 1) tmp = conv_bn_layer(tmp, ch_out, 3, 1, 1, LinearActivation()) short = shortcut(ipt, ch_in, ch_out, stride) - return addto_layer(input=[input, short], act=ReluActivation()) + return addto_layer(input=[ipt, short], act=ReluActivation()) def bottleneck(ipt, ch_out, stride): ch_in = ipt.num_filter @@ -52,13 +72,13 @@ def bottleneck(ipt, ch_out, stride): tmp = conv_bn_layer(tmp, ch_out, 3, 1, 1) tmp = conv_bn_layer(tmp, ch_out * 4, 1, 1, 0, LinearActivation()) short = shortcut(ipt, ch_in, ch_out, stride) - return addto_layer(input=[input, short], act=ReluActivation()) + return addto_layer(input=[ipt, short], act=ReluActivation()) def layer_warp(block_func, ipt, features, count, stride): - tmp = block_func(tmp, features, stride) + tmp = block_func(ipt, features, stride) for i in range(1, count): tmp = block_func(tmp, features, 1) - return tmp + return tmp def resnet_imagenet(ipt, depth=50): cfg = {18 : ([2,2,2,1], basicblock), @@ -96,42 +116,23 @@ def resnet_cifar10(ipt, depth=56): filter_size=3, stride=1, padding=1) - tmp = layer_warp(basicblock, tmp, 16, n) + tmp = layer_warp(basicblock, tmp, 16, n, 1) tmp = layer_warp(basicblock, tmp, 32, n, 2) tmp = layer_warp(basicblock, tmp, 64, n, 2) tmp = img_pool_layer(input=tmp, pool_size=8, stride=1, pool_type=AvgPooling()) - tmp = fc_layer(input=tmp, size=10, act=SoftmaxActivation()) return tmp -is_predict = get_config_arg("is_predict", bool, False) -if not is_predict: - args = {'meta': 'data/mean.meta'} - define_py_data_sources2( - train_list='data/train.list', - test_list='data/test.list', - module='dataprovider', - obj='process', - args=args) - -settings( - batch_size=128, - learning_rate=0.1 / 128.0, - learning_rate_decay_a=0.1, - learning_rate_decay_b=50000 * 100, - learning_rate_schedule='discexp', - learning_method=MomentumOptimizer(0.9), - regularization=L2Regularization(0.0005 * 128)) - -data_size = 3 * 32 * 32 -class_num = 10 -data = data_layer(name='image', size=data_size) -out = resnet_cifar10(data, depth=50) +datadim = 3 * 32 * 32 +classdim = 10 +data = data_layer(name='image', size=datadim) +net = resnet_cifar10(data, depth=56) +out = fc_layer(input=net, size=10, act=SoftmaxActivation()) if not is_predict: - lbl = data_layer(name="label", size=class_num) + lbl = data_layer(name="label", size=classdim) outputs(classification_cost(input=out, label=lbl)) else: outputs(out) diff --git a/image_classification/models/vgg.py b/image_classification/models/vgg.py index 104be53556aae807566292e293993c736aa9fdb6..cc05b676608411a4927d9eef782dd34a35ec9a7a 100644 --- a/image_classification/models/vgg.py +++ b/image_classification/models/vgg.py @@ -14,11 +14,30 @@ from paddle.trainer_config_helpers import * -def vgg_bn_drop(input, num_channels): - def conv_block(ipt, num_filter, groups, dropouts, num_channels_=None): +is_predict = get_config_arg("is_predict", bool, False) +if not is_predict: + define_py_data_sources2( + train_list='data/train.list', + test_list='data/test.list', + module='dataprovider', + obj='process', + args={'mean_path': 'data/mean.meta'}) + +settings( + batch_size=128, + learning_rate=0.1 / 128.0, + learning_rate_decay_a=0.1, + learning_rate_decay_b=50000 * 100, + learning_rate_schedule='discexp', + learning_method=MomentumOptimizer(0.9), + regularization=L2Regularization(0.0005 * 128),) + + +def vgg_bn_drop(input): + def conv_block(ipt, num_filter, groups, dropouts, num_channels=None): return img_conv_group( input=ipt, - num_channels=num_channels_, + num_channels=num_channels, pool_size=2, pool_stride=2, conv_num_filter=[num_filter] * groups, @@ -28,7 +47,7 @@ def vgg_bn_drop(input, num_channels): conv_batchnorm_drop_rate=dropouts, pool_type=MaxPooling()) - tmp = conv_block(input, 64, 2, [0.3, 0], num_channels) + tmp = conv_block(input, 64, 2, [0.3, 0], 3) tmp = conv_block(tmp, 128, 2, [0.4, 0]) tmp = conv_block(tmp, 256, 3, [0.4, 0.4, 0]) tmp = conv_block(tmp, 512, 3, [0.4, 0.4, 0]) @@ -46,33 +65,16 @@ def vgg_bn_drop(input, num_channels): input=tmp, size=512, act=LinearActivation()) - tmp = fc_layer(input=tmp, size=10, act=SoftmaxActivation()) return tmp -is_predict = get_config_arg("is_predict", bool, False) -if not is_predict: - define_py_data_sources2( - train_list='data/train.list', - test_list='data/test.list', - module='dataprovider', - obj='process', - args={'mean_path': 'data/mean.meta'}) - -settings( - batch_size=128, - learning_rate=0.1 / 128.0, - learning_rate_decay_a=0.1, - learning_rate_decay_b=50000 * 100, - learning_rate_schedule='discexp', - learning_method=MomentumOptimizer(0.9), - regularization=L2Regularization(0.0005 * 128),) - -data_size = 3 * 32 * 32 -class_num = 10 -data = data_layer(name='image', size=data_size) -out = vgg_bn_drop(data, 3) +datadim = 3 * 32 * 32 +classdim = 10 +data = data_layer(name='image', size=datadim) +net = vgg_bn_drop(data) +out = fc_layer(input=net, size=classdim, act=SoftmaxActivation()) if not is_predict: - lbl = data_layer(name="label", size=class_num) - outputs(classification_cost(input=out, label=lbl)) + lbl = data_layer(name="label", size=classdim) + cost = classification_cost(input=out, label=lbl) + outputs(cost) else: outputs(out) diff --git a/image_classification/predict.sh b/image_classification/predict.sh index 3bba2a94f70d3a687d8526bf10ba0554f9a1bd38..2e76191bea57c373f6e8f98c5e66b02e6eccf6b1 100755 --- a/image_classification/predict.sh +++ b/image_classification/predict.sh @@ -14,7 +14,4 @@ # limitations under the License. set -e -model=output/pass-00299/ -image=data/cifar-out/test/airplane/seaplane_s_000978.png -use_gpu=1 -python prediction.py $model $image $use_gpu +python classify.py --job=predict --model=output/pass-00299 --data=image/dog.png # -c diff --git a/image_classification/prediction.py b/image_classification/prediction.py deleted file mode 100755 index 519b864221c490de71fbeeebcc607935540db208..0000000000000000000000000000000000000000 --- a/image_classification/prediction.py +++ /dev/null @@ -1,159 +0,0 @@ -# Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import os, sys -import numpy as np -import logging -from PIL import Image -from optparse import OptionParser - -import paddle.utils.image_util as image_util - -from py_paddle import swig_paddle, DataProviderConverter -from paddle.trainer.PyDataProvider2 import dense_vector -from paddle.trainer.config_parser import parse_config - -logging.basicConfig( - format='[%(levelname)s %(asctime)s %(filename)s:%(lineno)s] %(message)s') -logging.getLogger().setLevel(logging.INFO) - - -class ImageClassifier(): - def __init__(self, - train_conf, - use_gpu=True, - model_dir=None, - resize_dim=None, - crop_dim=None, - mean_file=None, - oversample=False, - is_color=True): - """ - train_conf: network configure. - model_dir: string, directory of model. - resize_dim: int, resized image size. - crop_dim: int, crop size. - mean_file: string, image mean file. - oversample: bool, oversample means multiple crops, namely five - patches (the four corner patches and the center - patch) as well as their horizontal reflections, - ten crops in all. - """ - self.train_conf = train_conf - self.model_dir = model_dir - if model_dir is None: - self.model_dir = os.path.dirname(train_conf) - - self.resize_dim = resize_dim - self.crop_dims = [crop_dim, crop_dim] - self.oversample = oversample - self.is_color = is_color - - self.transformer = image_util.ImageTransformer(is_color=is_color) - self.transformer.set_transpose((2, 0, 1)) - - self.mean_file = mean_file - mean = np.load(self.mean_file)['data_mean'] - mean = mean.reshape(3, self.crop_dims[0], self.crop_dims[1]) - self.transformer.set_mean(mean) # mean pixel - gpu = 1 if use_gpu else 0 - conf_args = "is_test=1,use_gpu=%d,is_predict=1" % (gpu) - conf = parse_config(train_conf, conf_args) - swig_paddle.initPaddle("--use_gpu=%d" % (gpu)) - self.network = swig_paddle.GradientMachine.createFromConfigProto( - conf.model_config) - assert isinstance(self.network, swig_paddle.GradientMachine) - self.network.loadParameters(self.model_dir) - - data_size = 3 * self.crop_dims[0] * self.crop_dims[1] - slots = [dense_vector(data_size)] - self.converter = DataProviderConverter(slots) - - def get_data(self, img_path): - """ - 1. load image from img_path. - 2. resize or oversampling. - 3. transformer data: transpose, sub mean. - return K x H x W ndarray. - img_path: image path. - """ - image = image_util.load_image(img_path, self.is_color) - if self.oversample: - # image_util.resize_image: short side is self.resize_dim - image = image_util.resize_image(image, self.resize_dim) - image = np.array(image) - input = np.zeros( - (1, image.shape[0], image.shape[1], 3), dtype=np.float32) - input[0] = image.astype(np.float32) - input = image_util.oversample(input, self.crop_dims) - else: - image = image.resize(self.crop_dims, Image.ANTIALIAS) - input = np.zeros( - (1, self.crop_dims[0], self.crop_dims[1], 3), dtype=np.float32) - input[0] = np.array(image).astype(np.float32) - - data_in = [] - for img in input: - img = self.transformer.transformer(img).flatten() - data_in.append([img.tolist()]) - return data_in - - def forward(self, input_data): - in_arg = self.converter(input_data) - return self.network.forwardTest(in_arg) - - def forward(self, data, output_layer): - """ - input_data: py_paddle input data. - output_layer: specify the name of probability, namely the layer with - softmax activation. - return: the predicting probability of each label. - """ - input = self.converter(data) - self.network.forwardTest(input) - output = self.network.getLayerOutputs(output_layer) - # For oversampling, average predictions across crops. - # If not, the shape of output[name]: (1, class_number), - # the mean is also applicable. - return output[output_layer].mean(0) - - def predict(self, image=None, output_layer=None): - assert isinstance(image, basestring) - assert isinstance(output_layer, basestring) - data = self.get_data(image) - prob = self.forward(data, output_layer) - lab = np.argsort(-prob) - logging.info("Label of %s is: %d", image, lab[0]) - - -if __name__ == '__main__': - image_size = 32 - crop_size = 32 - multi_crop = True - config = "vgg_16_cifar.py" - output_layer = "__fc_layer_1__" - mean_path = "data/batches.meta" - model_path = sys.argv[1] - image = sys.argv[2] - use_gpu = bool(int(sys.argv[3])) - - obj = ImageClassifier( - train_conf=config, - model_dir=model_path, - resize_dim=image_size, - crop_dim=crop_size, - mean_file=mean_path, - use_gpu=use_gpu, - oversample=multi_crop) - obj.predict(image, output_layer) diff --git a/image_classification/train.sh b/image_classification/train.sh index 3fac171eafa3e5d9f4e440dca488b8de43219dc5..c8e82a4eb3f4c24e7a2f2cb42afd20bd0585c776 100755 --- a/image_classification/train.sh +++ b/image_classification/train.sh @@ -14,9 +14,9 @@ # limitations under the License. set -e -#config=models/resnet.py -config=models/vgg.py -output=./output +config=models/resnet.py +#config=models/vgg.py +output=output log=train.log paddle train \ @@ -26,4 +26,4 @@ paddle train \ --log_period=100 \ --num_passes=300 \ --save_dir=$output - 2>&1 | tee $log + #2>&1 | tee $log