From 2c60213ebf5a9f10cc4ebac712675df5ec9ca644 Mon Sep 17 00:00:00 2001 From: julie Date: Sat, 16 Sep 2017 14:49:19 -0700 Subject: [PATCH] Single Shot MultiBox Detector English README --- ssd/README.md | 117 +++++++++++++++++++++++++------------------------ ssd/index.html | 117 +++++++++++++++++++++++++------------------------ 2 files changed, 120 insertions(+), 114 deletions(-) diff --git a/ssd/README.md b/ssd/README.md index 9467c438..666f5572 100644 --- a/ssd/README.md +++ b/ssd/README.md @@ -1,46 +1,47 @@ -# SSD目标检测 -## 概述 -SSD全称为Single Shot MultiBox Detector,是目标检测领域较新且效果较好的检测算法之一,具体参见论文\[[1](#引用)\]。SSD算法主要特点是检测速度快且检测精度高。PaddlePaddle已集成SSD算法,本示例旨在介绍如何使用PaddlePaddle中的SSD模型进行目标检测。下文展开顺序为:首先简要介绍SSD原理,然后介绍示例包含文件及作用,接着介绍如何在PASCAL VOC数据集上训练、评估及检测,最后简要介绍如何在自有数据集上使用SSD。 -## SSD原理 -SSD使用一个卷积神经网络实现“端到端”的检测,所谓“端到端”指输入为原始图像,输出为检测结果,无需借助外部工具或流程进行特征提取、候选框生成等。论文中SSD的基础模型为VGG16\[[2](#引用)\],不同于原始VGG16网络模型,SSD做了一些改变: +# Single Shot MultiBox Detector (SSD) Object Detection -1. 将最后的fc6、fc7全连接层变为卷积层,卷积层参数通过对原始fc6、fc7参数采样得到。 -2. 将pool5层的参数由2x2-s2(kernel大小为2x2,stride size为2)更改为3x3-s1-p1(kernel大小为3x3,stride size为1,padding size为1)。 -3. 在conv4\_3、conv7、conv8\_2、conv9\_2、conv10\_2及pool11层后面接了priorbox层,priorbox层的主要目的是根据输入的特征图(feature map)生成一系列的矩形候选框。关于SSD的更详细的介绍可以参考论文\[[1](#引用)\]。 +## Introduction +Single Shot MultiBox Detector (SSD) is one of the new and enhanced detection algorithms detecting objects in images [ 1 ]. SSD algorithm is characterized by rapid detection and high detection accuracy. PaddlePaddle has an integrated SSD algorithm! This example demonstrates how to use the SSD model in PaddlePaddle for object detection. We first provide a brief introduction to the SSD principle. Then we describe how to train, evaluate and test on the PASCAL VOC data set, and finally on how to use SSD on custom data set. -下图为模型(300x300)的总体结构: +## SSD Architecture +SSD uses a convolutional neural network to achieve end-to-end detection. The term "End-to-end" is used because it uses the input as the original image and the output for the test results, without the use of external tools or processes for feature extraction. One popular model of SSD is VGG16 [ 2 ]. SSD differs from VGG16 network model as in following. + +1. The final fc6, fc7 full connection layer into a convolution layer, convolution layer parameters through the original fc6, fc7 parameters obtained. +2. Change the parameters of the pool5 layer from 2x2-s2 (kernel size 2x2, stride size to 2) to 3x3-s1-p1 (kernel size is 3x3, stride size is 1, padding size is 1). +3. The initial layers are composed of conv4\_3、conv7、conv8\_2、conv9\_2、conv10\_2, and pool11 layers. The main purpose of the priorbox layer is to generate a series of rectangular candidates based on the input feature map. A more detailed introduction to SSD can be found in the paper\[[1](#References)\]。 + +Below is the overall structure of the model (300x300)


图1. SSD网络结构

-图中每个矩形盒子代表一个卷积层,最后的两个矩形框分别表示汇总各卷积层输出结果和后处理阶段。具体地,在预测阶段网络会输出一组候选矩形框,每个矩形包含两类信息:位置和类别得分,图中倒数第二个矩形框即表示网络的检测结果的汇总处理,由于候选矩形框数量较多且很多矩形框重叠严重,这时需要经过后处理来筛选出质量较高的少数矩形框,这里的后处理主要指非极大值抑制(Non-maximum Suppression)。 - -从SSD的网络结构可以看出,候选矩形框在多个特征图(feature map上)生成,不同的feature map具有的感受野不同,这样可以在不同尺度扫描图像,相对于其他检测方法可以生成更丰富的候选框,从而提高检测精度;另一方面SSD对VGG16的扩展部分以较小的代价实现对候选框的位置和类别得分的计算,整个过程只需要一个卷积神经网络完成,所以速度较快。 +Each box in the figure represents a convolution layer, and the last two rectangles represent the summary of each convolution layer output and the post-processing phase. Specifically, the network will output a set of candidate rectangles in the prediction phase. Each rectangle contains two types of information: the position and the category score. The network produces thousands of predictions at various scales and aspect ratios before performing non-maximum suppression, resulting in a handful of final tags. -## 示例总览 -本示例共包含如下文件: +## Example Overview +This example contains the following files: - - - - - - - - - - - + + + + + + + + + + +
表1. 示例文件
文件用途
train.py训练脚本
eval.py评估脚本,用于评估训好模型
infer.py检测脚本,给定图片及模型,实施检测
visual.py检测结果可视化
image_util.py图像预处理所需公共函数
data_provider.py数据处理脚本,生成训练、评估或检测所需数据
config/pascal_voc_conf.py神经网络超参数配置文件
data/label_list类别列表
data/prepare_voc_data.py准备训练PASCAL VOC数据列表
Table 1. Directory structure
FileDescription
train.pyTraining script
eval.pyEvaluation
infer.pyPrediction using the trained model
visual.pyVisualization of the test results
image_util.pyImage preprocessing required common function
data_provider.pyData processing scripts, generate training, evaluate or detect the required data
config/pascal_voc_conf.py Neural network hyperparameter configuration file
data/label_listLabel list
data/prepare_voc_data.pyPrepare training PASCAL VOC data list
-训练阶段需要对数据做预处理,包括裁剪、采样等,这部分操作在```image_util.py```和```data_provider.py```中完成。值得注意的是,```config/vgg_config.py```为参数配置文件,包括训练参数、神经网络参数等,本配置文件包含参数是针对PASCAL VOC数据配置的,当训练自有数据时,需要仿照该文件配置新的参数。```data/prepare_voc_data.py```脚本用来生成文件列表,包括切分训练集和测试集,使用时需要用户事先下载并解压数据,默认采用VOC2007和VOC2012。 +The training phase requires pre-processing of the data, including clipping, sampling, etc. This is done in ```image_util.py``` and ```data_provider.py```.```config/vgg_config.py```. ```data/prepare_voc_data.py``` is used to generate a list of files, including the training set and test set, the need to use the user to download and extract data, the default use of VOC2007 and VOC2012. -## PASCAL VOC数据集 -### 数据准备 -首先需要下载数据集,VOC2007\[[3](#引用)\]和VOC2012\[[4](#引用)\],VOC2007包含训练集和测试集,VOC2012只包含训练集,将下载好的数据解压,目录结构为```data/VOCdevkit/VOC2007```和```data/VOCdevkit/VOC2012```。进入```data```目录,运行```python prepare_voc_data.py```即可生成```trainval.txt```和```test.txt```。核心函数为: +## PASCAL VOC Data set + +### Data Preparation +First download the data set. VOC2007\[[3](#References)\] contains both training and test data set, and VOC2012\[[4](#References)\] contains only training set. Downloaded data are stored in ```data/VOCdevkit/VOC2007``` and ```data/VOCdevkit/VOC2012```. Next, run ```data/prepare_voc_data.py``` to generate ```trainval.txt``` and ```test.txt```. The relevant function is as following: ```python def prepare_filelist(devkit_dir, years, output_dir): @@ -60,7 +61,7 @@ def prepare_filelist(devkit_dir, years, output_dir): ftest.write(item[0] + ' ' + item[1] + '\n') ``` -该函数首先对每一年(year)的数据进行处理,然后将训练图像的文件路径列表进行随机打乱,最后保存训练文件列表和测试文件列表。默认```prepare_voc_data.py```和```VOCdevkit```在相同目录下,且生成的文件列表也在该目录。需注意```trainval.txt```既包含VOC2007的训练数据,也包含VOC2012的训练数据,```test.txt```只包含VOC2007的测试数据。我们这里提供```trainval.txt```前几行输入作为样例: +The data in ```trainval.txt``` will look like: ``` VOCdevkit/VOC2007/JPEGImages/000005.jpg VOCdevkit/VOC2007/Annotations/000005.xml @@ -68,12 +69,14 @@ VOCdevkit/VOC2007/JPEGImages/000007.jpg VOCdevkit/VOC2007/Annotations/000007.xml VOCdevkit/VOC2007/JPEGImages/000009.jpg VOCdevkit/VOC2007/Annotations/000009.xml ``` -文件共两个字段,第一个字段为图像文件的相对路径,第二个字段为对应标注文件的相对路径。 +The first field is the relative path of the image file, and the second field is the relative path of the corresponding label file. + -### 预训练模型准备 -下载预训练的VGG-16模型,我们提供了一个转换好的模型,具体下载地址为:http://paddlepaddle.bj.bcebos.com/model_zoo/detection/ssd_model/vgg_model.tar.gz ,下载好模型后,放置路径为```vgg/vgg_model.tar.gz```。 -### 模型训练 -直接执行```python train.py```即可进行训练。需要注意本示例仅支持CUDA GPU环境,无法在CPU上训练,主要因为使用CPU训练速度很慢,实践中一般使用GPU来处理图像任务,这里实现采用硬编码方式使用cuDNN,不提供CPU版本。```train.py```的一些关键执行逻辑: +### To Use Pre-trained Model +We also provide a pre-trained model using VGG-16 with good performance. To use the model, download the file http://paddlepaddle.bj.bcebos.com/model_zoo/detection/ssd_model/vgg_model.tar.gz, and place it as ```vgg/vgg_model.tar.gz```。 + +### Training +Next, run ```python train.py``` to train the model. Note that this example only supports the CUDA GPU environment, and can not be trained on only CPU. This is mainly because the training is very slow using CPU only. ```python paddle.init(use_gpu=True, trainer_count=4) @@ -89,14 +92,14 @@ train(train_file_list='./data/trainval.txt', init_model_path='./vgg/vgg_model.tar.gz') ``` -主要包括: +Below is a description about this script: -1. 调用```paddle.init```指定使用4卡GPU训练。 -2. 调用```data_provider.Settings```配置数据预处理所需参数,其中```cfg.IMG_HEIGHT```和```cfg.IMG_WIDTH```在配置文件```config/vgg_config.py```中设置,这里均为300,300x300是一个典型配置,兼顾效率和检测精度,也可以通过修改配置文件扩展到512x512。 -3. 调用```train```执行训练,其中```train_file_list```指定训练数据列表,```dev_file_list```指定评估数据列表,```init_model_path```指定预训练模型位置。 -4. 训练过程中会打印一些日志信息,每训练1个batch会输出当前的轮数、当前batch的cost及mAP(mean Average Precision,平均精度均值),每训练一个pass,会保存一次模型,默认保存在```checkpoints```目录下(注:需事先创建)。 +1. Call ```paddle.init``` with 4 GPUs. +2. ```data_provider.Settings()``` is to pass configuration parameters. For ```config/vgg_config.py``` setting,300x300 is a typical configuration for both the accuracy and efficiency. It can be extended to 512x512 by modifying the configuration file. +3. In ```train()```执 function, ```train_file_list``` specifies the training data list, and ```dev_file_list``` specifies the evaluation data list, and ```init_model_path``` specifies the pre-training model location. +4. During the training process will print some log information, each training a batch will output the current number of rounds, the current batch cost and mAP (mean Average Precision. Each training pass will be saved a model to the default saved directory ```heckpoints``` (Need to be created in advance). -下面给出SDD300x300在VOC数据集(train包括07+12,test为07)上的mAP曲线,迭代140轮mAP可达到71.52%。 +The following shows the SDD300x300 in the VOC data set.


@@ -104,8 +107,8 @@ train(train_file_list='./data/trainval.txt',

-### 模型评估 -执行```python eval.py```即可对模型进行评估,```eval.py```的关键执行逻辑如下: +### Model Assessment +Next, run ```python eval.py``` to evaluate the model. ```python paddle.init(use_gpu=True, trainer_count=4) # use 4 gpus @@ -124,10 +127,8 @@ eval( model_path='models/pass-00000.tar.gz') ``` -调用```paddle.init```指定使用4卡GPU评估;```data_provider.Settings```参见训练阶段的配置;调用```eval```执行评估,其中```eval_file_list```指定评估数据列表,```batch_size```指定评估时batch size的大小,```model_path ```指定模型位置。评估结束会输出```loss```信息和```mAP```信息。 - -### 图像检测 -执行```python infer.py```即可使用训练好的模型对图片实施检测,```infer.py```关键逻辑如下: +### Obejct Detection +Run ```python infer.py``` to perform the object detection using the trained model. ```python infer( @@ -139,7 +140,9 @@ infer( threshold=0.3) ``` -其中```eval_file_list```指定图像路径列表;```save_path```指定预测结果保存路径;```data_args```如上;```batch_size```为每多少样本预测一次;```model_path```指模型的位置;```threshold```为置信度阈值,只有得分大于或等于该值的才会输出。下面给出```infer.res```的一些输出样例: + +Here ```eval_file_list``` specified image path list, ```save_path``` specifies directory to save the prediction result. + ``` VOCdevkit/VOC2007/JPEGImages/006936.jpg 12 0.997844 131.255611777 162.271582842 396.475315094 334.0 @@ -150,19 +153,19 @@ VOCdevkit/VOC2007/JPEGImages/006936.jpg 14 0.372522 187.543615699 133.727034628 一共包含4个字段,以tab分割,第一个字段是检测图像路径,第二字段为检测矩形框内类别,第三个字段是置信度,第四个字段是4个坐标值(以空格分割)。 -示例还提供了一个可视化脚本,直接运行```python visual.py```即可,须指定输出检测结果路径及输出目录,默认可视化后图像保存在```./visual_res```,下面是用训练好的模型infer部分图像并可视化的效果: +Below is the example after running ```python visual.py``` to visualize the model result. The default visualization of the image saved in the ```./visual_res```.


-图3. SSD300x300 检测可视化示例 +Figure 3. SSD300x300 Visualization Example

-## 自有数据集 -在自有数据上训练PaddlePaddle SSD需要完成两个关键准备,首先需要适配网络可以接受的输入格式,这里提供一个推荐的结构,以```train.txt```为例 +## To Use Custo Data set +In PaddlePaddle, using the custom data set to train SSD model is also easy! Just input the format that ```train.txt``` can understand. Below is a recommended structure to input for ```train.txt```. ``` image00000_file_path image00000_annotation_file_path @@ -171,7 +174,7 @@ image00002_file_path image00002_annotation_file_path ... ``` -文件共两列,以空白符分割,第一列为图像文件的路径,第二列为对应标注数据的文件路径。对图像文件的读取比较直接,略微复杂的是对标注数据的解析,本示例中标注数据使用xml文件存储,所以需要在```data_provider.py```中对xml解析,核心逻辑如下: +The first column is for the image file path, and the second column for the corresponding marked data file path. In the case of using xml file format, ```data_provider.py``` can be used to process the data as follows. ```python bbox_labels = [] @@ -191,7 +194,7 @@ for object in root.findall('object'): bbox_labels.append(bbox_sample) ``` -这里一条标注数据包括:label、xmin、ymin、xmax、ymax和is\_difficult,is\_difficult表示该object是否为难例,实际中如果不需要,只需把该字段置零即可。自有数据也需要提供对应的解析逻辑,假设标注数据(比如image00000\_annotation\_file\_path)存储格式如下: +Now the marked data(e.g. image00000\_annotation\_file\_path)is as follows: ``` label1 xmin1 ymin1 xmax1 ymax1 @@ -199,7 +202,7 @@ label2 xmin2 ymin2 xmax2 ymax2 ... ``` -每行对应一个物体,共5个字段,第一个为label(注背景为0,需从1编号),剩余4个为坐标,对应的解析逻辑可更改为如下: +Here each row corresponds to an object for 5 fields. The first is for the label (note the background 0, need to be numbered from 1), and the remaining four are for the coordinates. ``` bbox_labels = [] @@ -217,9 +220,9 @@ with open(label_path) as flabel: bbox_labels.append(bbox_sample) ``` -另一个重要的事情就是根据图像大小及检测物体的大小等更改网络结构的配置,主要是仿照```config/vgg_config.py```创建自己的配置文件,参数设置经验请参照论文\[[1](#引用)\]。 +Another important thing is to change the size of the image and the size of the object to change the configuration of the network structure. Use ```config/vgg_config.py``` to create the custom configuration file. For more details, please refer to \[[1](#References)\]。 -## 引用 +## References 1. Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, Alexander C. Berg. [SSD: Single shot multibox detector](https://arxiv.org/abs/1512.02325). European conference on computer vision. Springer, Cham, 2016. 2. Simonyan, Karen, and Andrew Zisserman. [Very deep convolutional networks for large-scale image recognition](https://arxiv.org/abs/1409.1556). arXiv preprint arXiv:1409.1556 (2014). 3. [The PASCAL Visual Object Classes Challenge 2007](http://host.robots.ox.ac.uk/pascal/VOC/voc2007/index.html) diff --git a/ssd/index.html b/ssd/index.html index c31c2188..88a0216c 100644 --- a/ssd/index.html +++ b/ssd/index.html @@ -40,49 +40,50 @@