提交 9f29754f 编写于 作者: doordiey's avatar doordiey

part2 of the twenty issue.

上级 944b84c9
# **Improving Real-Time Object Detection with YOLO**
原文链接:[Improving Real-Time Object Detection with YOLO](https://blog.statsbot.co/real-time-object-detection-yolo-cd348527b9b7?from=hackcv&hmsr=hackcv.com&utm_medium=hackcv.com&utm_source=hackcv.com)
## A new perspective for real-time object detection
In recent years, the field of object detection has seen tremendous progress, aided by the advent of deep learning. Object detection is the task of identifying objects in an image and drawing bounding boxes around them, i.e. localizing them. It’s a very important problem in computer vision due its numerous applications from self-driving cars to security and tracking.
Prior approaches of object detection have generally proposed pipelines that are separate stages in a sequence. This causes a disconnect between what each stage accomplishes and the final objective, which is drawing a tight bounding box around the objects in an image. An end-to-end framework that optimizes the detection error in a joint fashion would be a better solution, not just to train the model for better accuracy but to also improve detection speed.
This is where the You Only Look Once (or YOLO) approach comes into play. Varun Agrawal told the [Statsbot](https://statsbot.co/?utm_source=blog&utm_medium=article&utm_campaign=yolo) team why YOLO is the better option compared to other approaches in object detection.
![img](https://cdn-images-1.medium.com/max/2000/1*PSFl5og1c9HIKXlMIJV8-Q.png)
[Illustration source](https://arxiv.org/abs/1506.02640)
Deep learning has proven to be a powerful tool for image classification, achieving human level capability on this task. Earlier detection approaches leveraged this power to transform the problem of object detection to one of classification, which is recognizing what category of objects the image belonged to.
The way this was done was via a 2-stage process:
1. The first stage involved generating tens of thousands of proposals. They are nothing but specific rectangular areas on the image also known as bounding boxes, of what the system believed to be object-like things in the image. The bounding box proposal could either be around an actual object in an image or not, and filtering this out was the objective of the second stage.
2. In the second stage, an image classifier would classify the sub-image inside the bounding box proposal, and the classifier would say if it was of a particular object type or simply a non-object or background.
While immensely accurate, this 2-step process suffered from certain flaws such as efficiency, due to the immense number of proposals being generated, and a lack of joint optimization over both proposal generation and classification. This leads to each stage not truly understanding the bigger picture, instead being siloed to their own mini-problem and thus limiting their performance.
### **What YOLO is all about**
This is where YOLO comes in. YOLO, which stands for You Only Look Once, is a deep learning based object detection algorithm [developed by Joseph Redmon and Ali Farhadi](https://arxiv.org/abs/1506.02640) at the University of Washington in 2016.
The rationale behind calling the system YOLO is that rather than pass in multiple subimages of potential objects, you only passed in the whole image to the deep learning system once. Then, you would get all the bounding boxes as well as the object category classifications in one go. This is the fundamental design decision of YOLO and is what makes it a refreshing new perspective on the task of object detection.
The way YOLO works is that it subdivides the image into an NxN grid, or more specifically in the original paper a 7x7 grid. Each grid cell, also known as an anchor, represents a classifier which is responsible for generating K bounding boxes around potential objects whose ground truth center falls within that grid cell (K is 2 in the paper) and classifying it as the correct object.
> *Note that the bounding box is not restricted to be within the grid cell, it can expand within the boundaries of the image to accommodate the object it believes it is responsible to detect. This means that in the current version of YOLO, the system generates 98 bounding boxes of varying sizes to accommodate the various objects in the scene.*
### **Performance and Results**
For more dense object detection, a user could set K or N to a higher number based on their needs. However, with the current configuration, we have a system that is able to output a large number of bounding boxes around objects as well as classify them into one of various object categories, based on the spatial layout of the image.
This is done in a single pass through the image at inference time. Thus, the joint detection and classification leads to better optimization of the learning objective (the loss function) as well as real-time performance.
Indeed, the results of YOLO are very promising. On the challenging [Pascal VOC detection challenge dataset](http://host.robots.ox.ac.uk/pascal/VOC/pubs/everingham15.pdf), YOLO manages to achieve a mean average precision, or mAP, of 63.4 (out of 100) while running at 45 frames per second. In comparison, the state of the art model, Faster R-CNN VGG 16 achieves an mAP of 73.2, but only runs at a maximum 7 frames per second, a 6x decrease in efficiency.
You can see comparisons of YOLO to other detection frameworks in the table below.
![img](https://cdn-images-1.medium.com/max/1600/1*rZR8fU2sIz2DSIJqkBb4iA.png)
> *If one lets YOLO sacrifice some more accuracy, it can run at 155 frames per second, though only at an mAP of 52.7.*
Thus, the main selling point for YOLO is its promise of good performance in object detection at real-time speeds. That allows its use in systems such as robots, self-driving cars, and drones, where being time critical is of the utmost importance.
### **YOLOv2 framework**
Recently, the same group of researchers have released the new YOLOv2 framework, which leverages recent results in a deep learning network design to build a more efficient network, as well as use the anchor boxes idea from Faster-RCNN to ease the learning problem for the network.
![img](https://cdn-images-1.medium.com/max/1200/0*X3S2jCdO6bcgCdyc.)
[Illustration source](http://www.pjreddie.com/)
The result is a detection system which is even better, achieving state-of-the-art performance at 78.6 mAP on the Pascal VOC detection dataset, while other systems, such as the improved version of Faster-RCNN (Faster-RCNN ResNet) and [SSD500](https://arxiv.org/pdf/1512.02325.pdf), only achieve 76.4 mAP and 76.8 mAP on the same test dataset.
> *The key differentiator though is the performance speed. The best performing YOLOv2 model runs at 40 FPS compared to 5 FPS for Faster-RCNN ResNet.*
Although SSD500 runs at 45 FPS, a lower resolution version of YOLOv2 with mAP 76.8 (the same as SSD500) runs at 67 FPS, thus showing us the high performance capabilities of YOLOv2 as a result of its design choices.
### Final thoughts
In conclusion, YOLO has demonstrated significant performance gains while running at real-time performance, an important middle ground in the era of resource hungry deep learning algorithms. As we march on towards a more automation ready future, systems like YOLO and SSD500 are poised to usher in large strides of progress and enable the big AI dream.
### **Important reading through the article**
- [You Only Look Once: Unified, Real-Time Object Detection](https://arxiv.org/abs/1506.02640)
- [The PASCAL Visual Objects Challenge: A Retrospective](http://host.robots.ox.ac.uk/pascal/VOC/pubs/everingham15.pdf)
- [SSD: Single Shot Multibox Detector](https://arxiv.org/pdf/1512.02325.pdf)
\ No newline at end of file
| 标题 | 简介 |
| ------------------------------------------------------------ | ---- |
| [A Neural Network in 11 lines of Python (Part 1)](http://iamtrask.github.io/2015/07/12/basic-python-network/?from=hackcv&hmsr=hackcv.com&utm_medium=hackcv.com&utm_source=hackcv.com) | |
| [AlphaGo Zero and capability amplification](https://ai-alignment.com/alphago-zero-and-capability-amplification-ede767bb8446?from=hackcv&hmsr=hackcv.com&utm_medium=hackcv.com&utm_source=hackcv.com) | |
| [Improving Real-Time Object Detection with YOLO](https://blog.statsbot.co/real-time-object-detection-yolo-cd348527b9b7?from=hackcv&hmsr=hackcv.com&utm_medium=hackcv.com&utm_source=hackcv.com) | |
| 标题 | 简介 |
| ------------------------------------------------------------ | ------------------------------ |
| [A Neural Network in 11 lines of Python (Part 1)](http://iamtrask.github.io/2015/07/12/basic-python-network/?from=hackcv&hmsr=hackcv.com&utm_medium=hackcv.com&utm_source=hackcv.com) | |
| [AlphaGo Zero and capability amplification](https://ai-alignment.com/alphago-zero-and-capability-amplification-ede767bb8446?from=hackcv&hmsr=hackcv.com&utm_medium=hackcv.com&utm_source=hackcv.com) | 关于Alphago_zero的一些分析。 |
| [Improving Real-Time Object Detection with YOLO](https://blog.statsbot.co/real-time-object-detection-yolo-cd348527b9b7?from=hackcv&hmsr=hackcv.com&utm_medium=hackcv.com&utm_source=hackcv.com) | 目标检测问题的YOLO的相关介绍。 |
# 使用YOLO提高实时目标检测能力
原文链接:[Improving Real-Time Object Detection with YOLO](https://blog.statsbot.co/real-time-object-detection-yolo-cd348527b9b7?from=hackcv&hmsr=hackcv.com&utm_medium=hackcv.com&utm_source=hackcv.com)
## 一个关于实时目标检测的新观点
近年来,在深度学习的帮助下,目标检测领域取得了巨大的进展。目标检测是在一个图片中识别物体并在物体周围画一个框标出来,即确定它们的位置。这在计算机视觉邻域是一个十分重要的问题,因为它经常应用于自动驾驶汽车的安全和追踪的应用上。
早先的目标检测方法通常是将图片按一定顺序分离成不同的部分,这就导致完成的每一部分和最后的结果没有连接,就是将图像中的对象画一个框包围。一个端对端框架在降低一个联合的图中的识别错误方面是一个很好的加爵办法,它不仅是训练模型使其有更高的准确率,还可以提高它的识别速度。
这就是You Only Look Once(YOLO)算法可以应用的场景。Varun Agrawa 告诉[Statsbot](https://statsbot.co/?utm_source=blog&utm_medium=article&utm_campaign=yolo)的团队为什么YOLO比别的算法更适合在目标检测问题上应用。
![img](https://cdn-images-1.medium.com/max/2000/1*PSFl5og1c9HIKXlMIJV8-Q.png)
[Illustration source](https://arxiv.org/abs/1506.02640)
深度学习已经被证明了使图像分类问题最有效的工具,并已经在这个问题上具备了人工级别的能力。早期的检测方法利用这种能力将目标检测转换为分类问题之一,即识别图像目标属于哪种类型的目标。
这样的方式是通过2个步骤完成的:
1.第一步包括产生数以万计的想法,他们都只是图像上特定的矩形区域,也被称为边界框。边界框可以围绕图像中的对象进行,也可以不。而对这些想法进行过滤是第二步的目标。
2.第二步是一个图像分类器会对边界框内包括的子图像进行分类,判断它是否是一个特别的目标类型的一部分又或者只是简单的无目标的或者只是背景。
这两步过程虽然非常精确,但是仍然存在一些缺陷,例如效率问题,这是由于生成了大量的想法,并且缺乏对想法生成和分类的联合优化。这导致每个阶段都不能真正理解全局,而只是被困在自己的小问题中,从而限制了他们整体的性能。
### **YOLO到底是什么**
这就是YOLO引入的地方。YOLO代表着你只看一次,是一种基于目标识别算法的深度学习。在2016年由华盛顿大学的 [Joseph Redmon and Ali Farhadi](https://arxiv.org/abs/1506.02640) 开发出来的。
YOLO系统背后的理论基础不再是过去的那种传递多个可能的子图片,而是仅向深度学习系统传递一次完整的图片。接着,你将会得到所有的边界框以及目标类别分类。这是YOLO的基本设计策略,这是在目标识别领域的一个崭新的视角。
YOLO工作的方式是将图像细分成NXN网格,或者更具体地说,在原始文件7x7网格中的每个网格单元,也称为锚,表示一个分类器,它负责在潜在对象的周围生成K个边界框,潜在对象的基本真值中心落在该网格单元内(本文中K是2),并将其分类为正确的对象。
> *注意,边界框不限于网格单元内,它可以在图像的边界内扩展,以容纳它认为负责检测的对象。这意味着在当前版本的YOLO中,系统生成98个不同大小的边界框,以适应场景中的各种对象。*
### 性能和结果
对于更密集的对象检测,用户可以根据他们的需要将K或N设 置为更高的数字。然而,在当前的配置中,我们有一个系统,该系统能够输出大量围绕对象的边界框,并且基于图像的空间布局将它们分类到各种类别。
在运行时间里只有一次通过图像。因此,联合检测和分类可以更好地优化学习目标(损失函数)和实时性能。
事实上,YOLO的结果是非常理想的。在具有挑战性的[Pascal VOC检测挑战数据集](http://host.robots.ox.ac.uk/pascal/VOC/pubs/everingham15.pdf)上,YOLO以每秒45帧的速度运行时,实现了63.4(每100帧中)的平均精度或mAP。相比之下,根据现有模型,Faster-RCNN VGG 16实现了73.2的mAP ,但是仅以最大每秒7帧的速度运行,效率降低了6倍。
你可以在下面的图表中看到YOLO与其他模型的不同。
![img](https://cdn-images-1.medium.com/max/1600/1*rZR8fU2sIz2DSIJqkBb4iA.png)
> *YOLO在牺牲一些精度的条件下,它可以以每秒155帧运行,虽然只有在52.7的mAP。*
因此,YOLO的主要优势是其在实时速度下的目标检测中稳定的良好性能。这使它能在机器人、自动驾驶汽车和无人机等系统中使用,在这些系统中,时间至关重要。
### YOLOv2 的框架
最近,同样的研究人员发布了新的YOLOv2框架,该框架利用最近在深层学习网络设计中得到的结果来构建更有效的网络,并且使用Faster-RCNN的锚箱思想来减轻网络的学习问题。
![img](https://cdn-images-1.medium.com/max/1200/0*X3S2jCdO6bcgCdyc.)
[插图来源](http://www.pjreddie.com/)
YOLOv2有着更好的检测结果,在Pascal VOC检测数据集上以78.6mAP实现最先进的性能,而其他系统,例如改进的Faster-RCNN(Faster-RCNN ResNet)和[SSD500](https://arxiv.org/pdf/1512.02325.pdf),在相同的测试数据上仅实现76.4mAP和76.8mAP。
> *关键的区别是性能速度。性能最好的YOLVO2模型运行在40 FPS相比5 FPS的 FAST-RCNN RESNET*
虽然 SSD500以45FPS运行,但是具有mAP 76.8(与SSD500相同)的低分辨率版本的YOLOv2在67FPS下运行,由于YOLOv2的设计选择,这向我们展示了YOLOv2的高性能能力。
### 最后的一些感想
综上所述,YOLO在实时性能上运行时表现出了显著的性能增益,这是在资源匮乏的深度学习算法时代一个重要的中间环节。随着我们向着更加自动化的未来迈进,像YOLO和SSD500这样的系统已经准备好迎来大步的进步,逐步实现伟大的AI梦想。
### 一些重要的阅读文章
- [You Only Look Once: Unified, Real-Time Object Detection](https://arxiv.org/abs/1506.02640)
- [The PASCAL Visual Objects Challenge: A Retrospective](http://host.robots.ox.ac.uk/pascal/VOC/pubs/everingham15.pdf)
- [SSD: Single Shot Multibox Detector](https://arxiv.org/pdf/1512.02325.pdf)
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册