未验证 提交 5d8e2b19 编写于 作者: L littletomatodonkey 提交者: GitHub

Merge pull request #158 from cuicheng01/master

add english doc of modes
# DPN and DenseNet series
## Overview
DenseNet is a new network structure proposed in 2017 and was the best paper of CVPR. The network has designed a new cross-layer connected block called dense-block. Compared to the bottleneck in ResNet, dense-block has designed a more aggressive dense connection module, that is, connecting all the layers to each other, and each layer will accept all the layers in front of it as its additional input. DenseNet stacks all dense-blocks into a densely connected network. The dense connection makes DenseNet easier to backpropagate, making the network easier to train and converge. The full name of DPN is Dual Path Networks, which is a network composed of DenseNet and ResNeXt, which proves that DenseNet can extract new features from the previous level, and ResNeXt essentially reuses the extracted features . The author further analyzes and finds that ResNeXt has high reuse rate for features, but low redundancy, while DenseNet can create new features, but with high redundancy. Combining the advantages of the two structures, the author designed the DPN network. In the end, the DPN network achieved better results than ResNeXt and DenseNet under the same FLOPS and parameters.
The FLOPS, parameters, and inference time on the T4 GPU of this series of models are shown in the figure below.
![](../../images/models/T4_benchmark/t4.fp32.bs4.DPN.flops.png)
![](../../images/models/T4_benchmark/t4.fp32.bs4.DPN.params.png)
![](../../images/models/T4_benchmark/t4.fp32.bs4.DPN.png)
![](../../images/models/T4_benchmark/t4.fp16.bs4.DPN.png)
The pretrained models of these two types of models (a total of 10) are open sourced in PaddleClas at present. The indicators are shown in the figure above. It is easy to observe that under the same FLOPS and parameters, DPN has higher accuracy than DenseNet. However,because DPN has more branches, its inference speed is slower than DenseNet. Since DenseNet264 has the deepest layers in all DenseNet networks, it has the largest parameters,DenseNet161 has the largest width, resulting the largest FLOPs and the highest accuracy in this series. From the perspective of inference speed, DenseNet161, which has a large FLOPs and high accuracy, has a faster speed than DenseNet264, so it has a greater advantage than DenseNet264.
For DPN series networks, the larger the model's FLOPs and parameters, the higher the model's accuracy. Among them, since the width of DPN107 is the largest, it has the largest number of parameters and FLOPs in this series of networks.
## Accuracy, FLOPS and Parameters
| Models | Top1 | Top5 | Reference<br>top1 | Reference<br>top5 | FLOPS<br>(G) | Parameters<br>(M) |
|:--:|:--:|:--:|:--:|:--:|:--:|:--:|
| DenseNet121 | 0.757 | 0.926 | 0.750 | | 5.690 | 7.980 |
| DenseNet161 | 0.786 | 0.941 | 0.778 | | 15.490 | 28.680 |
| DenseNet169 | 0.768 | 0.933 | 0.764 | | 6.740 | 14.150 |
| DenseNet201 | 0.776 | 0.937 | 0.775 | | 8.610 | 20.010 |
| DenseNet264 | 0.780 | 0.939 | 0.779 | | 11.540 | 33.370 |
| DPN68 | 0.768 | 0.934 | 0.764 | 0.931 | 4.030 | 10.780 |
| DPN92 | 0.799 | 0.948 | 0.793 | 0.946 | 12.540 | 36.290 |
| DPN98 | 0.806 | 0.951 | 0.799 | 0.949 | 22.220 | 58.460 |
| DPN107 | 0.809 | 0.953 | 0.802 | 0.951 | 35.060 | 82.970 |
| DPN131 | 0.807 | 0.951 | 0.801 | 0.949 | 30.510 | 75.360 |
## Inference speed based on V100 GPU
| Models | Crop Size | Resize Short Size | FP32<br>Batch Size=1<br>(ms) |
|-------------|-----------|-------------------|--------------------------|
| DenseNet121 | 224 | 256 | 4.371 |
| DenseNet161 | 224 | 256 | 8.863 |
| DenseNet169 | 224 | 256 | 6.391 |
| DenseNet201 | 224 | 256 | 8.173 |
| DenseNet264 | 224 | 256 | 11.942 |
| DPN68 | 224 | 256 | 11.805 |
| DPN92 | 224 | 256 | 17.840 |
| DPN98 | 224 | 256 | 21.057 |
| DPN107 | 224 | 256 | 28.685 |
| DPN131 | 224 | 256 | 28.083 |
## Inference speed based on T4 GPU
| Models | Crop Size | Resize Short Size | FP16<br>Batch Size=1<br>(ms) | FP16<br>Batch Size=4<br>(ms) | FP16<br>Batch Size=8<br>(ms) | FP32<br>Batch Size=1<br>(ms) | FP32<br>Batch Size=4<br>(ms) | FP32<br>Batch Size=8<br>(ms) |
|-------------|-----------|-------------------|------------------------------|------------------------------|------------------------------|------------------------------|------------------------------|------------------------------|
| DenseNet121 | 224 | 256 | 4.16436 | 7.2126 | 10.50221 | 4.40447 | 9.32623 | 15.25175 |
| DenseNet161 | 224 | 256 | 9.27249 | 14.25326 | 20.19849 | 10.39152 | 22.15555 | 35.78443 |
| DenseNet169 | 224 | 256 | 6.11395 | 10.28747 | 13.68717 | 6.43598 | 12.98832 | 20.41964 |
| DenseNet201 | 224 | 256 | 7.9617 | 13.4171 | 17.41949 | 8.20652 | 17.45838 | 27.06309 |
| DenseNet264 | 224 | 256 | 11.70074 | 19.69375 | 24.79545 | 12.14722 | 26.27707 | 40.01905 |
| DPN68 | 224 | 256 | 11.7827 | 13.12652 | 16.19213 | 11.64915 | 12.82807 | 18.57113 |
| DPN92 | 224 | 256 | 18.56026 | 20.35983 | 29.89544 | 18.15746 | 23.87545 | 38.68821 |
| DPN98 | 224 | 256 | 21.70508 | 24.7755 | 40.93595 | 21.18196 | 33.23925 | 62.77751 |
| DPN107 | 224 | 256 | 27.84462 | 34.83217 | 60.67903 | 27.62046 | 52.65353 | 100.11721 |
| DPN131 | 224 | 256 | 28.58941 | 33.01078 | 55.65146 | 28.33119 | 46.19439 | 89.24904 |
# EfficientNet and ResNeXt101_wsl series
## Overview
EfficientNet is a lightweight NAS-based network released by Google in 2019. EfficientNetB7 refreshed the classification accuracy of ImageNet-1k at that time. In this paper, the author points out that the traditional methods to improve the performance of neural networks mainly start with the width of the network, the depth of the network, and the resolution of the input picture.
However, the author found that balancing these three dimensions is essential for improving accuracy and efficiency through experiments.
Therefore, the author summarized how to balance the three dimensions at the same time through a series of experiments.
At the same time, based on this scaling method, the author built a total of 7 networks B1-B7 in the EfficientNet series on the basis of EfficientNetB0, and with the same FLOPS and parameters, the accuracy reached state-of-the-art effect.
ResNeXt is an improved version of ResNet that proposed by Facebook in 2016. In 2019, Facebook researchers studied the accuracy limit of the series network on ImageNet through weakly-supervised-learning. In order to distinguish the previous ResNeXt network, the suffix of this series network is WSL, where WSL is the abbreviation of weakly-supervised-learning. In order to have stronger feature extraction capability, the researchers further enlarged the network width, among which the largest ResNeXt101_32x48d_wsl has 800 million parameters. It was trained under 940 million weak-labeled images, and the results were finetune trained on imagenet-1k. Finally, the acc-1 of imagenet-1k reaches 85.4%, which is also the network with the highest precision under the resolution of 224x224 on imagenet-1k so far. In Fix-ResNeXt, the author used a larger image resolution, made a special Fix strategy for the inconsistency of image data preprocessing in training and testing, and made ResNeXt101_32x48d_wsl have a higher accuracy. Since it used the Fix strategy, it was named Fix-ResNeXt101_32x48d_wsl.
The FLOPS, parameters, and inference time on the T4 GPU of this series of models are shown in the figure below.
![](../../images/models/T4_benchmark/t4.fp32.bs4.EfficientNet.flops.png)
![](../../images/models/T4_benchmark/t4.fp32.bs4.EfficientNet.params.png)
![](../../images/models/T4_benchmark/t4.fp32.bs1.EfficientNet.png)
![](../../images/models/T4_benchmark/t4.fp16.bs1.EfficientNet.png)
At present, there are a total of 14 pretrained models of the two types of models that PaddleClas open source. It can be seen from the above figure that the advantages of the EfficientNet series network are very obvious. The ResNeXt101_wsl series model uses more data, and the final accuracy is also higher. EfficientNet_B0_small removes SE_block based on EfficientNet_B0, which has faster inference speed.
## Accuracy, FLOPS and Parameters
| Models | Top1 | Top5 | Reference<br>top1 | Reference<br>top5 | FLOPS<br>(G) | Parameters<br>(M) |
|:--:|:--:|:--:|:--:|:--:|:--:|:--:|
| ResNeXt101_<br>32x8d_wsl | 0.826 | 0.967 | 0.822 | 0.964 | 29.140 | 78.440 |
| ResNeXt101_<br>32x16d_wsl | 0.842 | 0.973 | 0.842 | 0.972 | 57.550 | 152.660 |
| ResNeXt101_<br>32x32d_wsl | 0.850 | 0.976 | 0.851 | 0.975 | 115.170 | 303.110 |
| ResNeXt101_<br>32x48d_wsl | 0.854 | 0.977 | 0.854 | 0.976 | 173.580 | 456.200 |
| Fix_ResNeXt101_<br>32x48d_wsl | 0.863 | 0.980 | 0.864 | 0.980 | 354.230 | 456.200 |
| EfficientNetB0 | 0.774 | 0.933 | 0.773 | 0.935 | 0.720 | 5.100 |
| EfficientNetB1 | 0.792 | 0.944 | 0.792 | 0.945 | 1.270 | 7.520 |
| EfficientNetB2 | 0.799 | 0.947 | 0.803 | 0.950 | 1.850 | 8.810 |
| EfficientNetB3 | 0.812 | 0.954 | 0.817 | 0.956 | 3.430 | 11.840 |
| EfficientNetB4 | 0.829 | 0.962 | 0.830 | 0.963 | 8.290 | 18.760 |
| EfficientNetB5 | 0.836 | 0.967 | 0.837 | 0.967 | 19.510 | 29.610 |
| EfficientNetB6 | 0.840 | 0.969 | 0.842 | 0.968 | 36.270 | 42.000 |
| EfficientNetB7 | 0.843 | 0.969 | 0.844 | 0.971 | 72.350 | 64.920 |
| EfficientNetB0_<br>small | 0.758 | 0.926 | | | 0.720 | 4.650 |
## Inference speed based on V100 GPU
| Models | Crop Size | Resize Short Size | FP32<br>Batch Size=1<br>(ms) |
|-------------------------------|-----------|-------------------|--------------------------|
| ResNeXt101_<br>32x8d_wsl | 224 | 256 | 19.127 |
| ResNeXt101_<br>32x16d_wsl | 224 | 256 | 23.629 |
| ResNeXt101_<br>32x32d_wsl | 224 | 256 | 40.214 |
| ResNeXt101_<br>32x48d_wsl | 224 | 256 | 59.714 |
| Fix_ResNeXt101_<br>32x48d_wsl | 320 | 320 | 82.431 |
| EfficientNetB0 | 224 | 256 | 2.449 |
| EfficientNetB1 | 240 | 272 | 3.547 |
| EfficientNetB2 | 260 | 292 | 3.908 |
| EfficientNetB3 | 300 | 332 | 5.145 |
| EfficientNetB4 | 380 | 412 | 7.609 |
| EfficientNetB5 | 456 | 488 | 12.078 |
| EfficientNetB6 | 528 | 560 | 18.381 |
| EfficientNetB7 | 600 | 632 | 27.817 |
| EfficientNetB0_<br>small | 224 | 256 | 1.692 |
## Inference speed based on T4 GPU
| Models | Crop Size | Resize Short Size | FP16<br>Batch Size=1<br>(ms) | FP16<br>Batch Size=4<br>(ms) | FP16<br>Batch Size=8<br>(ms) | FP32<br>Batch Size=1<br>(ms) | FP32<br>Batch Size=4<br>(ms) | FP32<br>Batch Size=8<br>(ms) |
|---------------------------|-----------|-------------------|------------------------------|------------------------------|------------------------------|------------------------------|------------------------------|------------------------------|
| ResNeXt101_<br>32x8d_wsl | 224 | 256 | 18.19374 | 21.93529 | 34.67802 | 18.52528 | 34.25319 | 67.2283 |
| ResNeXt101_<br>32x16d_wsl | 224 | 256 | 18.52609 | 36.8288 | 62.79947 | 25.60395 | 71.88384 | 137.62327 |
| ResNeXt101_<br>32x32d_wsl | 224 | 256 | 33.51391 | 70.09682 | 125.81884 | 54.87396 | 160.04337 | 316.17718 |
| ResNeXt101_<br>32x48d_wsl | 224 | 256 | 50.97681 | 137.60926 | 190.82628 | 99.01698256 | 315.91261 | 551.83695 |
| Fix_ResNeXt101_<br>32x48d_wsl | 320 | 320 | 78.62869 | 191.76039 | 317.15436 | 160.0838242 | 595.99296 | 1151.47384 |
| EfficientNetB0 | 224 | 256 | 3.40122 | 5.95851 | 9.10801 | 3.442 | 6.11476 | 9.3304 |
| EfficientNetB1 | 240 | 272 | 5.25172 | 9.10233 | 14.11319 | 5.3322 | 9.41795 | 14.60388 |
| EfficientNetB2 | 260 | 292 | 5.91052 | 10.5898 | 17.38106 | 6.29351 | 10.95702 | 17.75308 |
| EfficientNetB3 | 300 | 332 | 7.69582 | 16.02548 | 27.4447 | 7.67749 | 16.53288 | 28.5939 |
| EfficientNetB4 | 380 | 412 | 11.55585 | 29.44261 | 53.97363 | 12.15894 | 30.94567 | 57.38511 |
| EfficientNetB5 | 456 | 488 | 19.63083 | 56.52299 | - | 20.48571 | 61.60252 | - |
| EfficientNetB6 | 528 | 560 | 30.05911 | - | - | 32.62402 | - | - |
| EfficientNetB7 | 600 | 632 | 47.86087 | - | - | 53.93823 | - | - |
| EfficientNetB0_small | 224 | 256 | 2.39166 | 4.36748 | 6.96002 | 2.3076 | 4.71886 | 7.21888 |
# HRNet series
## Overview
HRNet is a brand new neural network proposed by Microsoft research Asia in 2019. Different from the previous convolutional neural network, this network can still maintain high resolution in the deep layer of the network, so the heat map of the key points predicted is more accurate, and it is also more accurate in space. In addition, the network performs particularly well in other visual tasks sensitive to resolution, such as detection and segmentation.
The FLOPS, parameters, and inference time on the T4 GPU of this series of models are shown in the figure below.
![](../../images/models/T4_benchmark/t4.fp32.bs4.HRNet.flops.png)
![](../../images/models/T4_benchmark/t4.fp32.bs4.HRNet.params.png)
![](../../images/models/T4_benchmark/t4.fp32.bs4.HRNet.png)
![](../../images/models/T4_benchmark/t4.fp16.bs4.HRNet.png)
At present, there are 7 pretrained models of such models open-sourced by PaddleClas, and their indicators are shown in the figure. Among them, the reason why the accuracy of the HRNet_W48_C indicator is abnormal may be due to fluctuations in training.
## Accuracy, FLOPS and Parameters
| Models | Top1 | Top5 | Reference<br>top1 | Reference<br>top5 | FLOPS<br>(G) | Parameters<br>(M) |
|:--:|:--:|:--:|:--:|:--:|:--:|:--:|
| HRNet_W18_C | 0.769 | 0.934 | 0.768 | 0.934 | 4.140 | 21.290 |
| HRNet_W30_C | 0.780 | 0.940 | 0.782 | 0.942 | 16.230 | 37.710 |
| HRNet_W32_C | 0.783 | 0.942 | 0.785 | 0.942 | 17.860 | 41.230 |
| HRNet_W40_C | 0.788 | 0.945 | 0.789 | 0.945 | 25.410 | 57.550 |
| HRNet_W44_C | 0.790 | 0.945 | 0.789 | 0.944 | 29.790 | 67.060 |
| HRNet_W48_C | 0.790 | 0.944 | 0.793 | 0.945 | 34.580 | 77.470 |
| HRNet_W64_C | 0.793 | 0.946 | 0.795 | 0.946 | 57.830 | 128.060 |
## Inference speed based on V100 GPU
| Models | Crop Size | Resize Short Size | FP32<br>Batch Size=1<br>(ms) |
|-------------|-----------|-------------------|--------------------------|
| HRNet_W18_C | 224 | 256 | 7.368 |
| HRNet_W30_C | 224 | 256 | 9.402 |
| HRNet_W32_C | 224 | 256 | 9.467 |
| HRNet_W40_C | 224 | 256 | 10.739 |
| HRNet_W44_C | 224 | 256 | 11.497 |
| HRNet_W48_C | 224 | 256 | 12.165 |
| HRNet_W64_C | 224 | 256 | 15.003 |
## Inference speed based on T4 GPU
| Models | Crop Size | Resize Short Size | FP16<br>Batch Size=1<br>(ms) | FP16<br>Batch Size=4<br>(ms) | FP16<br>Batch Size=8<br>(ms) | FP32<br>Batch Size=1<br>(ms) | FP32<br>Batch Size=4<br>(ms) | FP32<br>Batch Size=8<br>(ms) |
|-------------|-----------|-------------------|------------------------------|------------------------------|------------------------------|------------------------------|------------------------------|------------------------------|
| HRNet_W18_C | 224 | 256 | 6.79093 | 11.50986 | 17.67244 | 7.40636 | 13.29752 | 23.33445 |
| HRNet_W30_C | 224 | 256 | 8.98077 | 14.08082 | 21.23527 | 9.57594 | 17.35485 | 32.6933 |
| HRNet_W32_C | 224 | 256 | 8.82415 | 14.21462 | 21.19804 | 9.49807 | 17.72921 | 32.96305 |
| HRNet_W40_C | 224 | 256 | 11.4229 | 19.1595 | 30.47984 | 12.12202 | 25.68184 | 48.90623 |
| HRNet_W44_C | 224 | 256 | 12.25778 | 22.75456 | 32.61275 | 13.19858 | 32.25202 | 59.09871 |
| HRNet_W48_C | 224 | 256 | 12.65015 | 23.12886 | 33.37859 | 13.70761 | 34.43572 | 63.01219 |
| HRNet_W64_C | 224 | 256 | 15.10428 | 27.68901 | 40.4198 | 17.57527 | 47.9533 | 97.11228 |
# Inception series
## Overview
GoogLeNet is a new neural network structure designed by Google in 2014, which, together with VGG network, became the twin champions of the ImageNet challenge that year. GoogLeNet introduces the Inception structure for the first time, and stacks the Inception structure in the network so that the number of network layers reaches 22, which is also the mark of the convolutional network exceeding 20 layers for the first time. Since 1x1 convolution is used in the Inception structure to reduce the dimension of channel number, and Global pooling is used to replace the traditional method of processing features in multiple fc layers, the final GoogLeNet network has much less FLOPS and parameters than VGG network, which has become a beautiful scenery of neural network design at that time.
Xception is another improvement to InceptionV3 that Google proposed after Inception. In Xception, the author used the depthwise separable convolution to replace the traditional convolution operation, which greatly saved the network FLOPS and the number of parameters, but improved the accuracy. In DeeplabV3+, the author further improved the Xception and increased the number of Xception layers, and designed the network of Xception65 and Xception71.
InceptionV4 is a new neural network designed by Google in 2016, when residual structure were all the rage, but the authors believe that high performance can be achieved using only Inception structure. InceptionV4 uses more Inception structure to achieve even greater precision on Imagenet-1k.
The FLOPS, parameters, and inference time on the T4 GPU of this series of models are shown in the figure below.
![](../../images/models/T4_benchmark/t4.fp32.bs4.Inception.flops.png)
![](../../images/models/T4_benchmark/t4.fp32.bs4.Inception.params.png)
![](../../images/models/T4_benchmark/t4.fp32.bs4.Inception.png)
![](../../images/models/T4_benchmark/t4.fp16.bs4.Inception.png)
The figure above reflects the relationship between the accuracy of Xception series and InceptionV4 and other indicators. Among them, Xception_deeplab is consistent with the structure of the paper, and Xception is an improved model developed by PaddleClas, which improves the accuracy by about 0.6% when the inference speed is basically unchanged. Details of the improved model are being updated, so stay tuned.
## Accuracy, FLOPS and Parameters
| Models | Top1 | Top5 | Reference<br>top1 | Reference<br>top5 | FLOPS<br>(G) | Parameters<br>(M) |
|:--:|:--:|:--:|:--:|:--:|:--:|:--:|
| GoogLeNet | 0.707 | 0.897 | 0.698 | | 2.880 | 8.460 |
| Xception41 | 0.793 | 0.945 | 0.790 | 0.945 | 16.740 | 22.690 |
| Xception41<br>_deeplab | 0.796 | 0.944 | | | 18.160 | 26.730 |
| Xception65 | 0.810 | 0.955 | | | 25.950 | 35.480 |
| Xception65<br>_deeplab | 0.803 | 0.945 | | | 27.370 | 39.520 |
| Xception71 | 0.811 | 0.955 | | | 31.770 | 37.280 |
| InceptionV4 | 0.808 | 0.953 | 0.800 | 0.950 | 24.570 | 42.680 |
## Inference speed based on V100 GPU
| Models | Crop Size | Resize Short Size | FP32<br>Batch Size=1<br>(ms) |
|------------------------|-----------|-------------------|--------------------------|
| GoogLeNet | 224 | 256 | 1.807 |
| Xception41 | 299 | 320 | 3.972 |
| Xception41_<br>deeplab | 299 | 320 | 4.408 |
| Xception65 | 299 | 320 | 6.174 |
| Xception65_<br>deeplab | 299 | 320 | 6.464 |
| Xception71 | 299 | 320 | 6.782 |
| InceptionV4 | 299 | 320 | 11.141 |
## Inference speed based on T4 GPU
| Models | Crop Size | Resize Short Size | FP16<br>Batch Size=1<br>(ms) | FP16<br>Batch Size=4<br>(ms) | FP16<br>Batch Size=8<br>(ms) | FP32<br>Batch Size=1<br>(ms) | FP32<br>Batch Size=4<br>(ms) | FP32<br>Batch Size=8<br>(ms) |
|--------------------|-----------|-------------------|------------------------------|------------------------------|------------------------------|------------------------------|------------------------------|------------------------------|
| GoogLeNet | 299 | 320 | 1.75451 | 3.39931 | 4.71909 | 1.88038 | 4.48882 | 6.94035 |
| Xception41 | 299 | 320 | 2.91192 | 7.86878 | 15.53685 | 4.96939 | 17.01361 | 32.67831 |
| Xception41_<br>deeplab | 299 | 320 | 2.85934 | 7.2075 | 14.01406 | 5.33541 | 17.55938 | 33.76232 |
| Xception65 | 299 | 320 | 4.30126 | 11.58371 | 23.22213 | 7.26158 | 25.88778 | 53.45426 |
| Xception65_<br>deeplab | 299 | 320 | 4.06803 | 9.72694 | 19.477 | 7.60208 | 26.03699 | 54.74724 |
| Xception71 | 299 | 320 | 4.80889 | 13.5624 | 27.18822 | 8.72457 | 31.55549 | 69.31018 |
| InceptionV4 | 299 | 320 | 9.50821 | 13.72104 | 20.27447 | 12.99342 | 25.23416 | 43.56121 |
# Mobile and Embedded Vision Applications Network series
## Overview
MobileNetV1 is a network launched by Google in 2017 for use on mobile devices or embedded devices. The network replaces the depthwise separable convolution with the traditional convolution operation, that is, the combination of depthwise convolution and pointwise convolution. Compared with the traditional convolution operation, this combination can greatly save the number of parameters and computation. At the same time, MobileNetV1 can also be used for object detection, image segmentation and other visual tasks.
MobileNetV2 is a lightweight network proposed by Google following MobileNetV1. Compared with MobileNetV1, MobileNetV2 proposed Linear bottlenecks and Inverted residual block as a basic network structures, to constitute MobileNetV2 network architecture through stacking these basic module a lot. In the end, higher classification accuracy was achieved when FLOPS was only half of MobileNetV1.
The ShuffleNet series network is the lightweight network structure proposed by MEGVII. So far, there are two typical structures in this series network, namely, ShuffleNetV1 and ShuffleNetV2. A Channel Shuffle operation in ShuffleNet can exchange information between groups and perform end-to-end training. In the paper of ShuffleNetV2, the author proposes four criteria for designing lightweight networks, and designs the ShuffleNetV2 network according to the four criteria and the shortcomings of ShuffleNetV1.
MobileNetV3 is a new and lightweight network based on NAS proposed by Google in 2019. In order to further improve the effect, the activation functions of relu and sigmoid were replaced with hard_swish and hard_sigmoid activation functions, and some improved strategies were introduced to reduce the amount of network computing.
![](../../images/models/mobile_arm_top1.png)
![](../../images/models/mobile_arm_storage.png)
![](../../images/models/T4_benchmark/t4.fp32.bs4.mobile_trt.flops.png)
![](../../images/models/T4_benchmark/t4.fp32.bs4.mobile_trt.params.png)
Currently there are 32 pretrained models of the mobile series open source by PaddleClas, and their indicators are shown in the figure below. As you can see from the picture, newer lightweight models tend to perform better, and MobileNetV3 represents the latest lightweight neural network architecture. In MobileNetV3, the author used 1x1 convolution after global-avg-pooling in order to obtain higher accuracy,this operation significantly increases the number of parameters but has little impact on the amount of computation, so if the model is evaluated from a storage perspective of excellence, MobileNetV3 does not have much advantage, but because of its smaller computation, it has a faster inference speed. In addition, the SSLD distillation model in our model library performs excellently, refreshing the accuracy of the current lightweight model from various perspectives. Due to the complex structure and many branches of the MobileNetV3 model, which is not GPU friendly, the GPU inference speed is not as good as that of MobileNetV1.
## Accuracy, FLOPS and Parameters
| Models | Top1 | Top5 | Reference<br>top1 | Reference<br>top5 | FLOPS<br>(G) | Parameters<br>(M) |
|:--:|:--:|:--:|:--:|:--:|:--:|:--:|
| MobileNetV1_x0_25 | 0.514 | 0.755 | 0.506 | | 0.070 | 0.460 |
| MobileNetV1_x0_5 | 0.635 | 0.847 | 0.637 | | 0.280 | 1.310 |
| MobileNetV1_x0_75 | 0.688 | 0.882 | 0.684 | | 0.630 | 2.550 |
| MobileNetV1 | 0.710 | 0.897 | 0.706 | | 1.110 | 4.190 |
| MobileNetV1_ssld | 0.779 | 0.939 | | | 1.110 | 4.190 |
| MobileNetV2_x0_25 | 0.532 | 0.765 | | | 0.050 | 1.500 |
| MobileNetV2_x0_5 | 0.650 | 0.857 | 0.654 | 0.864 | 0.170 | 1.930 |
| MobileNetV2_x0_75 | 0.698 | 0.890 | 0.698 | 0.896 | 0.350 | 2.580 |
| MobileNetV2 | 0.722 | 0.907 | 0.718 | 0.910 | 0.600 | 3.440 |
| MobileNetV2_x1_5 | 0.741 | 0.917 | | | 1.320 | 6.760 |
| MobileNetV2_x2_0 | 0.752 | 0.926 | | | 2.320 | 11.130 |
| MobileNetV2_ssld | 0.7674 | 0.9339 | | | 0.600 | 3.440 |
| MobileNetV3_large_<br>x1_25 | 0.764 | 0.930 | 0.766 | | 0.714 | 7.440 |
| MobileNetV3_large_<br>x1_0 | 0.753 | 0.923 | 0.752 | | 0.450 | 5.470 |
| MobileNetV3_large_<br>x0_75 | 0.731 | 0.911 | 0.733 | | 0.296 | 3.910 |
| MobileNetV3_large_<br>x0_5 | 0.692 | 0.885 | 0.688 | | 0.138 | 2.670 |
| MobileNetV3_large_<br>x0_35 | 0.643 | 0.855 | 0.642 | | 0.077 | 2.100 |
| MobileNetV3_small_<br>x1_25 | 0.707 | 0.895 | 0.704 | | 0.195 | 3.620 |
| MobileNetV3_small_<br>x1_0 | 0.682 | 0.881 | 0.675 | | 0.123 | 2.940 |
| MobileNetV3_small_<br>x0_75 | 0.660 | 0.863 | 0.654 | | 0.088 | 2.370 |
| MobileNetV3_small_<br>x0_5 | 0.592 | 0.815 | 0.580 | | 0.043 | 1.900 |
| MobileNetV3_small_<br>x0_35 | 0.530 | 0.764 | 0.498 | | 0.026 | 1.660 |
| MobileNetV3_large_<br>x1_0_ssld | 0.790 | 0.945 | | | 0.450 | 5.470 |
| MobileNetV3_large_<br>x1_0_ssld_int8 | 0.761 | | | | | |
| MobileNetV3_small_<br>x1_0_ssld | 0.713 | 0.901 | | | 0.123 | 2.940 |
| ShuffleNetV2 | 0.688 | 0.885 | 0.694 | | 0.280 | 2.260 |
| ShuffleNetV2_x0_25 | 0.499 | 0.738 | | | 0.030 | 0.600 |
| ShuffleNetV2_x0_33 | 0.537 | 0.771 | | | 0.040 | 0.640 |
| ShuffleNetV2_x0_5 | 0.603 | 0.823 | 0.603 | | 0.080 | 1.360 |
| ShuffleNetV2_x1_5 | 0.716 | 0.902 | 0.726 | | 0.580 | 3.470 |
| ShuffleNetV2_x2_0 | 0.732 | 0.912 | 0.749 | | 1.120 | 7.320 |
| ShuffleNetV2_swish | 0.700 | 0.892 | | | 0.290 | 2.260 |
## Inference speed and storage size based on SD855
| Models | Batch Size=1(ms) | Storage Size(M) |
|:--:|:--:|:--:|
| MobileNetV1_x0_25 | 3.220 | 1.900 |
| MobileNetV1_x0_5 | 9.580 | 5.200 |
| MobileNetV1_x0_75 | 19.436 | 10.000 |
| MobileNetV1 | 32.523 | 16.000 |
| MobileNetV1_ssld | 32.523 | 16.000 |
| MobileNetV2_x0_25 | 3.799 | 6.100 |
| MobileNetV2_x0_5 | 8.702 | 7.800 |
| MobileNetV2_x0_75 | 15.531 | 10.000 |
| MobileNetV2 | 23.318 | 14.000 |
| MobileNetV2_x1_5 | 45.624 | 26.000 |
| MobileNetV2_x2_0 | 74.292 | 43.000 |
| MobileNetV2_ssld | 23.318 | 14.000 |
| MobileNetV3_large_x1_25 | 28.218 | 29.000 |
| MobileNetV3_large_x1_0 | 19.308 | 21.000 |
| MobileNetV3_large_x0_75 | 13.565 | 16.000 |
| MobileNetV3_large_x0_5 | 7.493 | 11.000 |
| MobileNetV3_large_x0_35 | 5.137 | 8.600 |
| MobileNetV3_small_x1_25 | 9.275 | 14.000 |
| MobileNetV3_small_x1_0 | 6.546 | 12.000 |
| MobileNetV3_small_x0_75 | 5.284 | 9.600 |
| MobileNetV3_small_x0_5 | 3.352 | 7.800 |
| MobileNetV3_small_x0_35 | 2.635 | 6.900 |
| MobileNetV3_large_x1_0_ssld | 19.308 | 21.000 |
| MobileNetV3_large_x1_0_ssld_int8 | 14.395 | 10.000 |
| MobileNetV3_small_x1_0_ssld | 6.546 | 12.000 |
| ShuffleNetV2 | 10.941 | 9.000 |
| ShuffleNetV2_x0_25 | 2.329 | 2.700 |
| ShuffleNetV2_x0_33 | 2.643 | 2.800 |
| ShuffleNetV2_x0_5 | 4.261 | 5.600 |
| ShuffleNetV2_x1_5 | 19.352 | 14.000 |
| ShuffleNetV2_x2_0 | 34.770 | 28.000 |
| ShuffleNetV2_swish | 16.023 | 9.100 |
## Inference speed based on T4 GPU
| Models | FP16<br>Batch Size=1<br>(ms) | FP16<br>Batch Size=4<br>(ms) | FP16<br>Batch Size=8<br>(ms) | FP32<br>Batch Size=1<br>(ms) | FP32<br>Batch Size=4<br>(ms) | FP32<br>Batch Size=8<br>(ms) |
|-----------------------------|-----------------------|-----------------------|-----------------------|-----------------------|-----------------------|-----------------------|
| MobileNetV1_x0_25 | 0.68422 | 1.13021 | 1.72095 | 0.67274 | 1.226 | 1.84096 |
| MobileNetV1_x0_5 | 0.69326 | 1.09027 | 1.84746 | 0.69947 | 1.43045 | 2.39353 |
| MobileNetV1_x0_75 | 0.6793 | 1.29524 | 2.15495 | 0.79844 | 1.86205 | 3.064 |
| MobileNetV1 | 0.71942 | 1.45018 | 2.47953 | 0.91164 | 2.26871 | 3.90797 |
| MobileNetV1_ssld | 0.71942 | 1.45018 | 2.47953 | 0.91164 | 2.26871 | 3.90797 |
| MobileNetV2_x0_25 | 2.85399 | 3.62405 | 4.29952 | 2.81989 | 3.52695 | 4.2432 |
| MobileNetV2_x0_5 | 2.84258 | 3.1511 | 4.10267 | 2.80264 | 3.65284 | 4.31737 |
| MobileNetV2_x0_75 | 2.82183 | 3.27622 | 4.98161 | 2.86538 | 3.55198 | 5.10678 |
| MobileNetV2 | 2.78603 | 3.71982 | 6.27879 | 2.62398 | 3.54429 | 6.41178 |
| MobileNetV2_x1_5 | 2.81852 | 4.87434 | 8.97934 | 2.79398 | 5.30149 | 9.30899 |
| MobileNetV2_x2_0 | 3.65197 | 6.32329 | 11.644 | 3.29788 | 7.08644 | 12.45375 |
| MobileNetV2_ssld | 2.78603 | 3.71982 | 6.27879 | 2.62398 | 3.54429 | 6.41178 |
| MobileNetV3_large_x1_25 | 2.34387 | 3.16103 | 4.79742 | 2.35117 | 3.44903 | 5.45658 |
| MobileNetV3_large_x1_0 | 2.20149 | 3.08423 | 4.07779 | 2.04296 | 2.9322 | 4.53184 |
| MobileNetV3_large_x0_75 | 2.1058 | 2.61426 | 3.61021 | 2.0006 | 2.56987 | 3.78005 |
| MobileNetV3_large_x0_5 | 2.06934 | 2.77341 | 3.35313 | 2.11199 | 2.88172 | 3.19029 |
| MobileNetV3_large_x0_35 | 2.14965 | 2.7868 | 3.36145 | 1.9041 | 2.62951 | 3.26036 |
| MobileNetV3_small_x1_25 | 2.06817 | 2.90193 | 3.5245 | 2.02916 | 2.91866 | 3.34528 |
| MobileNetV3_small_x1_0 | 1.73933 | 2.59478 | 3.40276 | 1.74527 | 2.63565 | 3.28124 |
| MobileNetV3_small_x0_75 | 1.80617 | 2.64646 | 3.24513 | 1.93697 | 2.64285 | 3.32797 |
| MobileNetV3_small_x0_5 | 1.95001 | 2.74014 | 3.39485 | 1.88406 | 2.99601 | 3.3908 |
| MobileNetV3_small_x0_35 | 2.10683 | 2.94267 | 3.44254 | 1.94427 | 2.94116 | 3.41082 |
| MobileNetV3_large_x1_0_ssld | 2.20149 | 3.08423 | 4.07779 | 2.04296 | 2.9322 | 4.53184 |
| MobileNetV3_small_x1_0_ssld | 1.73933 | 2.59478 | 3.40276 | 1.74527 | 2.63565 | 3.28124 |
| ShuffleNetV2 | 1.95064 | 2.15928 | 2.97169 | 1.89436 | 2.26339 | 3.17615 |
| ShuffleNetV2_x0_25 | 1.43242 | 2.38172 | 2.96768 | 1.48698 | 2.29085 | 2.90284 |
| ShuffleNetV2_x0_33 | 1.69008 | 2.65706 | 2.97373 | 1.75526 | 2.85557 | 3.09688 |
| ShuffleNetV2_x0_5 | 1.48073 | 2.28174 | 2.85436 | 1.59055 | 2.18708 | 3.09141 |
| ShuffleNetV2_x1_5 | 1.51054 | 2.4565 | 3.41738 | 1.45389 | 2.5203 | 3.99872 |
| ShuffleNetV2_x2_0 | 1.95616 | 2.44751 | 4.19173 | 2.15654 | 3.18247 | 5.46893 |
| ShuffleNetV2_swish | 2.50213 | 2.92881 | 3.474 | 2.5129 | 2.97422 | 3.69357 |
# Other networks
## Overview
In 2012, AlexNet network proposed by Alex et al. won the ImageNet competition by far surpassing the second place, and the convolutional neural network and even deep learning attracted wide attention. AlexNet used relu as the activation function of CNN to solve the gradient dispersion problem of sigmoid when the network is deep. During the training, Dropout was used to randomly lose a part of the neurons, avoiding the overfitting of the model. In the network, overlapping maximum pooling is used to replace the average pooling commonly used in CNN, which avoids the fuzzy effect of average pooling and improves the feature richness. In a sense, AlexNet has exploded the research and application of neural networks.
SqueezeNet achieved the same precision as AlexNet on Imagenet-1k, but only with 1/50 parameters. The core of the network is the Fire module, which used the convolution of 1x1 to achieve channel dimensionality reduction, thus greatly saving the number of parameters. The author created SqueezeNet by stacking a large number of Fire modules.
VGG is a convolutional neural network developed by researchers at Oxford University's Visual Geometry Group and DeepMind. The network explores the relationship between the depth of the convolutional neural network and its performance. By repeatedly stacking the small convolutional kernel of 3x3 and the maximum pooling layer of 2x2, the multi-layer convolutional neural network is successfully constructed and has achieved good convergence accuracy. In the end, VGG won the runner-up of ILSVRC 2014 classification and the champion of positioning.
DarkNet53 is designed for object detection by YOLO author in the paper. The network is basically composed of 1x1 and 3x3 kernel, with a total of 53 layers, named DarkNet53.
## Accuracy, FLOPS and Parameters
| Models | Top1 | Top5 | Reference<br>top1 | Reference<br>top5 | FLOPS<br>(G) | Parameters<br>(M) |
|:--:|:--:|:--:|:--:|:--:|:--:|:--:|
| AlexNet | 0.567 | 0.792 | 0.5720 | | 1.370 | 61.090 |
| SqueezeNet1_0 | 0.596 | 0.817 | 0.575 | | 1.550 | 1.240 |
| SqueezeNet1_1 | 0.601 | 0.819 | | | 0.690 | 1.230 |
| VGG11 | 0.693 | 0.891 | | | 15.090 | 132.850 |
| VGG13 | 0.700 | 0.894 | | | 22.480 | 133.030 |
| VGG16 | 0.720 | 0.907 | 0.715 | 0.901 | 30.810 | 138.340 |
| VGG19 | 0.726 | 0.909 | | | 39.130 | 143.650 |
| DarkNet53 | 0.780 | 0.941 | 0.772 | 0.938 | 18.580 | 41.600 |
| ResNet50_ACNet | 0.767 | 0.932 | | | 10.730 | 33.110 |
| ResNet50_ACNet<br>_deploy | 0.767 | 0.932 | | | 8.190 | 25.550 |
## Inference speed based on V100 GPU
| Models | Crop Size | Resize Short Size | FP32<br>Batch Size=1<br>(ms) |
|---------------------------|-----------|-------------------|----------------------|
| AlexNet | 224 | 256 | 1.176 |
| SqueezeNet1_0 | 224 | 256 | 0.860 |
| SqueezeNet1_1 | 224 | 256 | 0.763 |
| VGG11 | 224 | 256 | 1.867 |
| VGG13 | 224 | 256 | 2.148 |
| VGG16 | 224 | 256 | 2.616 |
| VGG19 | 224 | 256 | 3.076 |
| DarkNet53 | 256 | 256 | 3.139 |
| ResNet50_ACNet<br>_deploy | 224 | 256 | 5.626 |
## Inference speed based on T4 GPU
| Models | Crop Size | Resize Short Size | FP16<br>Batch Size=1<br>(ms) | FP16<br>Batch Size=4<br>(ms) | FP16<br>Batch Size=8<br>(ms) | FP32<br>Batch Size=1<br>(ms) | FP32<br>Batch Size=4<br>(ms) | FP32<br>Batch Size=8<br>(ms) |
|-----------------------|-----------|-------------------|------------------------------|------------------------------|------------------------------|------------------------------|------------------------------|------------------------------|
| AlexNet | 224 | 256 | 1.06447 | 1.70435 | 2.38402 | 1.44993 | 2.46696 | 3.72085 |
| SqueezeNet1_0 | 224 | 256 | 0.97162 | 2.06719 | 3.67499 | 0.96736 | 2.53221 | 4.54047 |
| SqueezeNet1_1 | 224 | 256 | 0.81378 | 1.62919 | 2.68044 | 0.76032 | 1.877 | 3.15298 |
| VGG11 | 224 | 256 | 2.24408 | 4.67794 | 7.6568 | 3.90412 | 9.51147 | 17.14168 |
| VGG13 | 224 | 256 | 2.58589 | 5.82708 | 10.03591 | 4.64684 | 12.61558 | 23.70015 |
| VGG16 | 224 | 256 | 3.13237 | 7.19257 | 12.50913 | 5.61769 | 16.40064 | 32.03939 |
| VGG19 | 224 | 256 | 3.69987 | 8.59168 | 15.07866 | 6.65221 | 20.4334 | 41.55902 |
| DarkNet53 | 256 | 256 | 3.18101 | 5.88419 | 10.14964 | 4.10829 | 12.1714 | 22.15266 |
| ResNet50_ACNet | 256 | 256 | 3.89002 | 4.58195 | 9.01095 | 5.33395 | 10.96843 | 18.70368 |
| ResNet50_ACNet_deploy | 224 | 256 | 2.6823 | 5.944 | 7.16655 | 3.49161 | 7.78374 | 13.94361 |
# ResNet and ResNet_vd series
## Overview
The ResNet series model was proposed in 2015 and won the championship in the ILSVRC2015 competition with a top5 error rate of 3.57%. The network innovatively proposed the residual structure, and built the ResNet network by stacking multiple residual structures. Experiments show that using residual blocks can improve the convergence speed and accuracy effectively.
Joyce Xu of Stanford university calls ResNet one of three architectures that "really redefine the way we think about neural networks." Due to the outstanding performance of ResNet, more and more scholars and engineers from academia and industry have improved its structure. The well-known ones include wide-resnet, resnet-vc, resnet-vd, Res2Net, etc. The number of parameters and FLOPs of resnet-vc and resnet-vd are almost the same as those of ResNet, so we hereby unified them into the ResNet series.
The models of the ResNet series released this time include 14 pre-trained models including ResNet50, ResNet50_vd, ResNet50_vd_ssld, and ResNet200_vd. At the training level, ResNet adopted the standard training process for training ImageNet, while the rest of the improved model adopted more training strategies, such as cosine decay for the decline of learning rate and the regular label smoothing method,mixup was added to the data preprocessing, and the total number of iterations increased from 120 epoches to 200 epoches.
Among them, ResNet50_vd_v2 and ResNet50_vd_ssld adopted knowledge distillation, which further improved the accuracy of the model while keeping the structure unchanged. Specifically, the teacher model of ResNet50_vd_v2 is ResNet152_vd (top1 accuracy 80.59%), the training set is imagenet-1k, the teacher model of ResNet50_vd_ssld is ResNeXt101_32x16d_wsl (top1 accuracy 84.2%), and the training set is the combination of 4 million data mined by imagenet-22k and ImageNet-1k . The specific methods of knowledge distillation are being continuously updated.
The FLOPS, parameters, and inference time on the T4 GPU of this series of models are shown in the figure below.
![](../../images/models/T4_benchmark/t4.fp32.bs4.ResNet.flops.png)
![](../../images/models/T4_benchmark/t4.fp32.bs4.ResNet.params.png)
![](../../images/models/T4_benchmark/t4.fp32.bs4.ResNet.png)
![](../../images/models/T4_benchmark/t4.fp16.bs4.ResNet.png)
As can be seen from the above curves, the higher the number of layers, the higher the accuracy, but the corresponding number of parameters, calculation and latency will increase. ResNet50_vd_ssld further improves the accuracy of top-1 of the ImageNet-1k validation set by using stronger teachers and more data, reaching 82.39%, refreshing the accuracy of ResNet50 series models.
## Accuracy, FLOPS and Parameters
| Models | Top1 | Top5 | Reference<br>top1 | Reference<br>top5 | FLOPS<br>(G) | Parameters<br>(M) |
|:--:|:--:|:--:|:--:|:--:|:--:|:--:|
| ResNet18 | 0.710 | 0.899 | 0.696 | 0.891 | 3.660 | 11.690 |
| ResNet18_vd | 0.723 | 0.908 | | | 4.140 | 11.710 |
| ResNet34 | 0.746 | 0.921 | 0.732 | 0.913 | 7.360 | 21.800 |
| ResNet34_vd | 0.760 | 0.930 | | | 7.390 | 21.820 |
| ResNet50 | 0.765 | 0.930 | 0.760 | 0.930 | 8.190 | 25.560 |
| ResNet50_vc | 0.784 | 0.940 | | | 8.670 | 25.580 |
| ResNet50_vd | 0.791 | 0.944 | 0.792 | 0.946 | 8.670 | 25.580 |
| ResNet50_vd_v2 | 0.798 | 0.949 | | | 8.670 | 25.580 |
| ResNet101 | 0.776 | 0.936 | 0.776 | 0.938 | 15.520 | 44.550 |
| ResNet101_vd | 0.802 | 0.950 | | | 16.100 | 44.570 |
| ResNet152 | 0.783 | 0.940 | 0.778 | 0.938 | 23.050 | 60.190 |
| ResNet152_vd | 0.806 | 0.953 | | | 23.530 | 60.210 |
| ResNet200_vd | 0.809 | 0.953 | | | 30.530 | 74.740 |
| ResNet50_vd_ssld | 0.824 | 0.961 | | | 8.670 | 25.580 |
| ResNet50_vd_ssld_v2 | 0.830 | 0.964 | | | 8.670 | 25.580 |
| Fix_ResNet50_vd_ssld_v2 | 0.840 | 0.970 | | | 17.696 | 25.580 |
| ResNet101_vd_ssld | 0.837 | 0.967 | | | 16.100 | 44.570 |
* Note: `ResNet50_vd_ssld_v2` is obtained by adding AutoAugment in training process on the basis of `ResNet50_vd_ssld` training strategy.`Fix_ResNet50_vd_ssld_v2` stopped all parameter updates of `ResNet50_vd_ssld_v2` except the FC layer,and fine-tuned on ImageNet1k dataset, the resolution is 320x320.
## Inference speed based on V100 GPU
| Models | Crop Size | Resize Short Size | FP32<br>Batch Size=1<br>(ms) |
|------------------|-----------|-------------------|--------------------------|
| ResNet18 | 224 | 256 | 1.499 |
| ResNet18_vd | 224 | 256 | 1.603 |
| ResNet34 | 224 | 256 | 2.272 |
| ResNet34_vd | 224 | 256 | 2.343 |
| ResNet50 | 224 | 256 | 2.939 |
| ResNet50_vc | 224 | 256 | 3.041 |
| ResNet50_vd | 224 | 256 | 3.165 |
| ResNet50_vd_v2 | 224 | 256 | 3.165 |
| ResNet101 | 224 | 256 | 5.314 |
| ResNet101_vd | 224 | 256 | 5.252 |
| ResNet152 | 224 | 256 | 7.205 |
| ResNet152_vd | 224 | 256 | 7.200 |
| ResNet200_vd | 224 | 256 | 8.885 |
| ResNet50_vd_ssld | 224 | 256 | 3.165 |
| ResNet101_vd_ssld | 224 | 256 | 5.252 |
## Inference speed based on T4 GPU
| Models | Crop Size | Resize Short Size | FP16<br>Batch Size=1<br>(ms) | FP16<br>Batch Size=4<br>(ms) | FP16<br>Batch Size=8<br>(ms) | FP32<br>Batch Size=1<br>(ms) | FP32<br>Batch Size=4<br>(ms) | FP32<br>Batch Size=8<br>(ms) |
|-------------------|-----------|-------------------|------------------------------|------------------------------|------------------------------|------------------------------|------------------------------|------------------------------|
| ResNet18 | 224 | 256 | 1.3568 | 2.5225 | 3.61904 | 1.45606 | 3.56305 | 6.28798 |
| ResNet18_vd | 224 | 256 | 1.39593 | 2.69063 | 3.88267 | 1.54557 | 3.85363 | 6.88121 |
| ResNet34 | 224 | 256 | 2.23092 | 4.10205 | 5.54904 | 2.34957 | 5.89821 | 10.73451 |
| ResNet34_vd | 224 | 256 | 2.23992 | 4.22246 | 5.79534 | 2.43427 | 6.22257 | 11.44906 |
| ResNet50 | 224 | 256 | 2.63824 | 4.63802 | 7.02444 | 3.47712 | 7.84421 | 13.90633 |
| ResNet50_vc | 224 | 256 | 2.67064 | 4.72372 | 7.17204 | 3.52346 | 8.10725 | 14.45577 |
| ResNet50_vd | 224 | 256 | 2.65164 | 4.84109 | 7.46225 | 3.53131 | 8.09057 | 14.45965 |
| ResNet50_vd_v2 | 224 | 256 | 2.65164 | 4.84109 | 7.46225 | 3.53131 | 8.09057 | 14.45965 |
| ResNet101 | 224 | 256 | 5.04037 | 7.73673 | 10.8936 | 6.07125 | 13.40573 | 24.3597 |
| ResNet101_vd | 224 | 256 | 5.05972 | 7.83685 | 11.34235 | 6.11704 | 13.76222 | 25.11071 |
| ResNet152 | 224 | 256 | 7.28665 | 10.62001 | 14.90317 | 8.50198 | 19.17073 | 35.78384 |
| ResNet152_vd | 224 | 256 | 7.29127 | 10.86137 | 15.32444 | 8.54376 | 19.52157 | 36.64445 |
| ResNet200_vd | 224 | 256 | 9.36026 | 13.5474 | 19.0725 | 10.80619 | 25.01731 | 48.81399 |
| ResNet50_vd_ssld | 224 | 256 | 2.65164 | 4.84109 | 7.46225 | 3.53131 | 8.09057 | 14.45965 |
| ResNet50_vd_ssld_v2 | 224 | 256 | 2.65164 | 4.84109 | 7.46225 | 3.53131 | 8.09057 | 14.45965 |
| Fix_ResNet50_vd_ssld_v2 | 320 | 320 | 3.42818 | 7.51534 | 13.19370 | 5.07696 | 14.64218 | 27.01453 |
| ResNet101_vd_ssld | 224 | 256 | 5.05972 | 7.83685 | 11.34235 | 6.11704 | 13.76222 | 25.11071 |
# SEResNeXt and Res2Net series
## Overview
ResNeXt, one of the typical variants of ResNet, was presented at the CVPR conference in 2017. Prior to this, the methods to improve the model accuracy mainly focused on deepening or widening the network, which increased the number of parameters and calculation, and slowed down the inference speed accordingly. The concept of cardinality was proposed in ResNeXt structure. The author found that increasing the number of channel groups was more effective than increasing the depth and width through experiments. It can improve the accuracy without increasing the parameter complexity and reduce the number of parameters at the same time, so it is a more successful variant of ResNet.
SENet is the winner of the 2017 ImageNet classification competition. It proposes a new SE structure that can be migrated to any other network. It controls the scale to enhance the important features between each channel, and weaken the unimportant features. So that the extracted features are more directional.
Res2Net is a brand-new improvement of ResNet proposed in 2019. The solution can be easily integrated with other excellent modules. Without increasing the amount of calculation, the performance on ImageNet, CIFAR-100 and other data sets exceeds ResNet. Res2Net, with its simple structure and superior performance, further explores the multi-scale representation capability of CNN at a more fine-grained level. Res2Net reveals a new dimension to improve model accuracy, called scale, which is an essential and more effective factor in addition to the existing dimensions of depth, width, and cardinality. The network also performs well in other visual tasks such as object detection and image segmentation.
The FLOPS, parameters, and inference time on the T4 GPU of this series of models are shown in the figure below.
![](../../images/models/T4_benchmark/t4.fp32.bs4.SeResNeXt.flops.png)
![](../../images/models/T4_benchmark/t4.fp32.bs4.SeResNeXt.params.png)
![](../../images/models/T4_benchmark/t4.fp32.bs4.SeResNeXt.png)
![](../../images/models/T4_benchmark/t4.fp16.bs4.SeResNeXt.png)
At present, there are a total of 24 pretrained models of the three categories open sourced by PaddleClas, and the indicators are shown in the figure. It can be seen from the diagram that under the same Flops and Params, the improved model tends to have higher accuracy, but the inference speed is often inferior to the ResNet series. On the other hand, Res2Net performed better. Compared with group operation in ResNeXt and SE structure operation in SEResNet, Res2Net tended to have better accuracy in the same Flops, Params and inference speed.
## Accuracy, FLOPS and Parameters
| Models | Top1 | Top5 | Reference<br>top1 | Reference<br>top5 | FLOPS<br>(G) | Parameters<br>(M) |
|:--:|:--:|:--:|:--:|:--:|:--:|:--:|
| Res2Net50_26w_4s | 0.793 | 0.946 | 0.780 | 0.936 | 8.520 | 25.700 |
| Res2Net50_vd_26w_4s | 0.798 | 0.949 | | | 8.370 | 25.060 |
| Res2Net50_14w_8s | 0.795 | 0.947 | 0.781 | 0.939 | 9.010 | 25.720 |
| Res2Net101_vd_26w_4s | 0.806 | 0.952 | | | 16.670 | 45.220 |
| Res2Net200_vd_26w_4s | 0.812 | 0.957 | | | 31.490 | 76.210 |
| ResNeXt50_32x4d | 0.778 | 0.938 | 0.778 | | 8.020 | 23.640 |
| ResNeXt50_vd_32x4d | 0.796 | 0.946 | | | 8.500 | 23.660 |
| ResNeXt50_64x4d | 0.784 | 0.941 | | | 15.060 | 42.360 |
| ResNeXt50_vd_64x4d | 0.801 | 0.949 | | | 15.540 | 42.380 |
| ResNeXt101_32x4d | 0.787 | 0.942 | 0.788 | | 15.010 | 41.540 |
| ResNeXt101_vd_32x4d | 0.803 | 0.951 | | | 15.490 | 41.560 |
| ResNeXt101_64x4d | 0.784 | 0.945 | 0.796 | | 29.050 | 78.120 |
| ResNeXt101_vd_64x4d | 0.808 | 0.952 | | | 29.530 | 78.140 |
| ResNeXt152_32x4d | 0.790 | 0.943 | | | 22.010 | 56.280 |
| ResNeXt152_vd_32x4d | 0.807 | 0.952 | | | 22.490 | 56.300 |
| ResNeXt152_64x4d | 0.795 | 0.947 | | | 43.030 | 107.570 |
| ResNeXt152_vd_64x4d | 0.811 | 0.953 | | | 43.520 | 107.590 |
| SE_ResNet18_vd | 0.733 | 0.914 | | | 4.140 | 11.800 |
| SE_ResNet34_vd | 0.765 | 0.932 | | | 7.840 | 21.980 |
| SE_ResNet50_vd | 0.795 | 0.948 | | | 8.670 | 28.090 |
| SE_ResNeXt50_32x4d | 0.784 | 0.940 | 0.789 | 0.945 | 8.020 | 26.160 |
| SE_ResNeXt50_vd_32x4d | 0.802 | 0.949 | | | 10.760 | 26.280 |
| SE_ResNeXt101_32x4d | 0.791 | 0.942 | 0.793 | 0.950 | 15.020 | 46.280 |
| SENet154_vd | 0.814 | 0.955 | | | 45.830 | 114.290 |
## Inference speed based on V100 GPU
| Models | Crop Size | Resize Short Size | FP32<br>Batch Size=1<br>(ms) |
|-----------------------|-----------|-------------------|--------------------------|
| Res2Net50_26w_4s | 224 | 256 | 4.148 |
| Res2Net50_vd_26w_4s | 224 | 256 | 4.172 |
| Res2Net50_14w_8s | 224 | 256 | 5.113 |
| Res2Net101_vd_26w_4s | 224 | 256 | 7.327 |
| Res2Net200_vd_26w_4s | 224 | 256 | 12.806 |
| ResNeXt50_32x4d | 224 | 256 | 10.964 |
| ResNeXt50_vd_32x4d | 224 | 256 | 7.566 |
| ResNeXt50_64x4d | 224 | 256 | 13.905 |
| ResNeXt50_vd_64x4d | 224 | 256 | 14.321 |
| ResNeXt101_32x4d | 224 | 256 | 14.915 |
| ResNeXt101_vd_32x4d | 224 | 256 | 14.885 |
| ResNeXt101_64x4d | 224 | 256 | 28.716 |
| ResNeXt101_vd_64x4d | 224 | 256 | 28.398 |
| ResNeXt152_32x4d | 224 | 256 | 22.996 |
| ResNeXt152_vd_32x4d | 224 | 256 | 22.729 |
| ResNeXt152_64x4d | 224 | 256 | 46.705 |
| ResNeXt152_vd_64x4d | 224 | 256 | 46.395 |
| SE_ResNet18_vd | 224 | 256 | 1.694 |
| SE_ResNet34_vd | 224 | 256 | 2.786 |
| SE_ResNet50_vd | 224 | 256 | 3.749 |
| SE_ResNeXt50_32x4d | 224 | 256 | 8.924 |
| SE_ResNeXt50_vd_32x4d | 224 | 256 | 9.011 |
| SE_ResNeXt101_32x4d | 224 | 256 | 19.204 |
| SENet154_vd | 224 | 256 | 50.406 |
## Inference speed based on T4 GPU
| Models | Crop Size | Resize Short Size | FP16<br>Batch Size=1<br>(ms) | FP16<br>Batch Size=4<br>(ms) | FP16<br>Batch Size=8<br>(ms) | FP32<br>Batch Size=1<br>(ms) | FP32<br>Batch Size=4<br>(ms) | FP32<br>Batch Size=8<br>(ms) |
|-----------------------|-----------|-------------------|------------------------------|------------------------------|------------------------------|------------------------------|------------------------------|------------------------------|
| Res2Net50_26w_4s | 224 | 256 | 3.56067 | 6.61827 | 11.41566 | 4.47188 | 9.65722 | 17.54535 |
| Res2Net50_vd_26w_4s | 224 | 256 | 3.69221 | 6.94419 | 11.92441 | 4.52712 | 9.93247 | 18.16928 |
| Res2Net50_14w_8s | 224 | 256 | 4.45745 | 7.69847 | 12.30935 | 5.4026 | 10.60273 | 18.01234 |
| Res2Net101_vd_26w_4s | 224 | 256 | 6.53122 | 10.81895 | 18.94395 | 8.08729 | 17.31208 | 31.95762 |
| Res2Net200_vd_26w_4s | 224 | 256 | 11.66671 | 18.93953 | 33.19188 | 14.67806 | 32.35032 | 63.65899 |
| ResNeXt50_32x4d | 224 | 256 | 7.61087 | 8.88918 | 12.99674 | 7.56327 | 10.6134 | 18.46915 |
| ResNeXt50_vd_32x4d | 224 | 256 | 7.69065 | 8.94014 | 13.4088 | 7.62044 | 11.03385 | 19.15339 |
| ResNeXt50_64x4d | 224 | 256 | 13.78688 | 15.84655 | 21.79537 | 13.80962 | 18.4712 | 33.49843 |
| ResNeXt50_vd_64x4d | 224 | 256 | 13.79538 | 15.22201 | 22.27045 | 13.94449 | 18.88759 | 34.28889 |
| ResNeXt101_32x4d | 224 | 256 | 16.59777 | 17.93153 | 21.36541 | 16.21503 | 19.96568 | 33.76831 |
| ResNeXt101_vd_32x4d | 224 | 256 | 16.36909 | 17.45681 | 22.10216 | 16.28103 | 20.25611 | 34.37152 |
| ResNeXt101_64x4d | 224 | 256 | 30.12355 | 32.46823 | 38.41901 | 30.4788 | 36.29801 | 68.85559 |
| ResNeXt101_vd_64x4d | 224 | 256 | 30.34022 | 32.27869 | 38.72523 | 30.40456 | 36.77324 | 69.66021 |
| ResNeXt152_32x4d | 224 | 256 | 25.26417 | 26.57001 | 30.67834 | 24.86299 | 29.36764 | 52.09426 |
| ResNeXt152_vd_32x4d | 224 | 256 | 25.11196 | 26.70515 | 31.72636 | 25.03258 | 30.08987 | 52.64429 |
| ResNeXt152_64x4d | 224 | 256 | 46.58293 | 48.34563 | 56.97961 | 46.7564 | 56.34108 | 106.11736 |
| ResNeXt152_vd_64x4d | 224 | 256 | 47.68447 | 48.91406 | 57.29329 | 47.18638 | 57.16257 | 107.26288 |
| SE_ResNet18_vd | 224 | 256 | 1.61823 | 3.1391 | 4.60282 | 1.7691 | 4.19877 | 7.5331 |
| SE_ResNet34_vd | 224 | 256 | 2.67518 | 5.04694 | 7.18946 | 2.88559 | 7.03291 | 12.73502 |
| SE_ResNet50_vd | 224 | 256 | 3.65394 | 7.568 | 12.52793 | 4.28393 | 10.38846 | 18.33154 |
| SE_ResNeXt50_32x4d | 224 | 256 | 9.06957 | 11.37898 | 18.86282 | 8.74121 | 13.563 | 23.01954 |
| SE_ResNeXt50_vd_32x4d | 224 | 256 | 9.25016 | 11.85045 | 25.57004 | 9.17134 | 14.76192 | 19.914 |
| SE_ResNeXt101_32x4d | 224 | 256 | 19.34455 | 20.6104 | 32.20432 | 18.82604 | 25.31814 | 41.97758 |
| SENet154_vd | 224 | 256 | 49.85733 | 54.37267 | 74.70447 | 53.79794 | 66.31684 | 121.59885 |
此差异已折叠。
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册