diff --git a/docs/en/ImageNet_models_en.md b/docs/en/algorithm_introduction/ImageNet_models_en.md
similarity index 100%
rename from docs/en/ImageNet_models_en.md
rename to docs/en/algorithm_introduction/ImageNet_models_en.md
diff --git a/docs/en/models/DLA.md b/docs/en/models/DLA_en.md
similarity index 92%
rename from docs/en/models/DLA.md
rename to docs/en/models/DLA_en.md
index 176d6d1af77631ffb455ab0ad8bd3d4fbe47555c..fc5d75c0a50aa5b843c490e6768d404622305421 100644
--- a/docs/en/models/DLA.md
+++ b/docs/en/models/DLA_en.md
@@ -1,11 +1,17 @@
# DLA series
+---
+## Catalogue
+* [1. Overview](#1)
+* [2. Accuracy, FLOPS and Parameters](#2)
+
+
## Overview
DLA (Deep Layer Aggregation). Visual recognition requires rich representations that span levels from low to high, scales from small to large, and resolutions from fine to coarse. Even with the depth of features in a convolutional network, a layer in isolation is not enough: compounding and aggregating these representations improves inference of what and where. Although skip connections have been incorporated to combine layers, these connections have been "shallow" themselves, and only fuse by simple, one-step operations. The authors augment standard architectures with deeper aggregation to better fuse information across layers. Deep layer aggregation structures iteratively and hierarchically merge the feature hierarchy to make networks with better accuracy and fewer parameters. Experiments across architectures and tasks show that deep layer aggregation improves recognition and resolution compared to existing branching and merging schemes. [paper](https://arxiv.org/abs/1707.06484)
-
-## Accuracy, FLOPS and Parameters
+
+## 2. Accuracy, FLOPS and Parameters
| Model | Params (M) | FLOPs (G) | Top-1 (%) | Top-5 (%) |
|:-----------------:|:----------:|:---------:|:---------:|:---------:|
diff --git a/docs/en/models/DPN_DenseNet_en.md b/docs/en/models/DPN_DenseNet_en.md
index 3e6aac761af870cf3ba89c46616642a7037c3bdb..7447d7a1e185f853a41f9d7487d78cd907843a6b 100644
--- a/docs/en/models/DPN_DenseNet_en.md
+++ b/docs/en/models/DPN_DenseNet_en.md
@@ -1,6 +1,14 @@
# DPN and DenseNet series
+---
+## Catalogue
-## Overview
+* [1. Overview](#1)
+* [2. Accuracy, FLOPs and Parameters](#2)
+* [3. Inference speed based on V100 GPU](#3)
+* [4. Inference speed based on T4 GPU](#4)
+
+
+## 1. Overview
DenseNet is a new network structure proposed in 2017 and was the best paper of CVPR. The network has designed a new cross-layer connected block called dense-block. Compared to the bottleneck in ResNet, dense-block has designed a more aggressive dense connection module, that is, connecting all the layers to each other, and each layer will accept all the layers in front of it as its additional input. DenseNet stacks all dense-blocks into a densely connected network. The dense connection makes DenseNet easier to backpropagate, making the network easier to train and converge. The full name of DPN is Dual Path Networks, which is a network composed of DenseNet and ResNeXt, which proves that DenseNet can extract new features from the previous level, and ResNeXt essentially reuses the extracted features . The author further analyzes and finds that ResNeXt has high reuse rate for features, but low redundancy, while DenseNet can create new features, but with high redundancy. Combining the advantages of the two structures, the author designed the DPN network. In the end, the DPN network achieved better results than ResNeXt and DenseNet under the same FLOPS and parameters.
@@ -18,10 +26,10 @@ The pretrained models of these two types of models (a total of 10) are open sour
For DPN series networks, the larger the model's FLOPs and parameters, the higher the model's accuracy. Among them, since the width of DPN107 is the largest, it has the largest number of parameters and FLOPs in this series of networks.
+
+## 2. Accuracy, FLOPs and Parameters
-## Accuracy, FLOPS and Parameters
-
-| Models | Top1 | Top5 | Reference
top1 | Reference
top5 | FLOPS
(G) | Parameters
(M) |
+| Models | Top1 | Top5 | Reference
top1 | Reference
top5 | FLOPs
(G) | Parameters
(M) |
|:--:|:--:|:--:|:--:|:--:|:--:|:--:|
| DenseNet121 | 0.757 | 0.926 | 0.750 | | 5.690 | 7.980 |
| DenseNet161 | 0.786 | 0.941 | 0.778 | | 15.490 | 28.680 |
@@ -36,8 +44,8 @@ For DPN series networks, the larger the model's FLOPs and parameters, the higher
-
-## Inference speed based on V100 GPU
+
+## 3. Inference speed based on V100 GPU
| Models | Crop Size | Resize Short Size | FP32
Batch Size=1
(ms) |
|-------------|-----------|-------------------|--------------------------|
@@ -53,8 +61,8 @@ For DPN series networks, the larger the model's FLOPs and parameters, the higher
| DPN131 | 224 | 256 | 28.083 |
-
-## Inference speed based on T4 GPU
+
+## 4. Inference speed based on T4 GPU
| Models | Crop Size | Resize Short Size | FP16
Batch Size=1
(ms) | FP16
Batch Size=4
(ms) | FP16
Batch Size=8
(ms) | FP32
Batch Size=1
(ms) | FP32
Batch Size=4
(ms) | FP32
Batch Size=8
(ms) |
|-------------|-----------|-------------------|------------------------------|------------------------------|------------------------------|------------------------------|------------------------------|------------------------------|
diff --git a/docs/en/models/ESNet_en.md b/docs/en/models/ESNet_en.md
new file mode 100644
index 0000000000000000000000000000000000000000..77219229115efa3f4e7fee456cfe4b6b56d1a5f3
--- /dev/null
+++ b/docs/en/models/ESNet_en.md
@@ -0,0 +1,23 @@
+# ESNet Series
+---
+## Catalogue
+
+* [1. Overview](#1)
+* [2. Accuracy, FLOPS and Parameters](#2)
+
+
+## 1. Overview
+
+ESNet (Enhanced ShuffleNet) is a lightweight network developed by Baidu. This network combines the advantages of MobileNetV3, GhostNet, and PPLCNet on the basis of ShuffleNetV2 to form a faster and more accurate network on ARM devices, Because of its excellent performance, [PP-PicoDet](https://github.com/PaddlePaddle/PaddleDetection/tree/release/2.3/configs/picodet) launched in PaddleDetection uses this model as a backbone, with stronger object detection algorithm, the final mAP index refreshed the SOTA index of the object detection model on the ARM device in one fell swoop.
+
+
+## 2. Accuracy, FLOPS and Parameters
+
+| Models | Top1 | Top5 | FLOPs
(M) | Params
(M) |
+|:--:|:--:|:--:|:--:|:--:|
+| ESNet_x0_25 | 62.48 | 83.46 | 30.9 | 2.83 |
+| ESNet_x0_5 | 68.82 | 88.04 | 67.3 | 3.25 |
+| ESNet_x0_75 | 72.24 | 90.45 | 123.7 | 3.87 |
+| ESNet_x1_0 | 73.92 | 91.40 | 197.3 | 4.64 |
+
+Please stay tuned for information such as Inference speed.
diff --git a/docs/en/models/EfficientNet_and_ResNeXt101_wsl_en.md b/docs/en/models/EfficientNet_and_ResNeXt101_wsl_en.md
index 07dff3dac27bd227dcaa992eed2be1443bb9b256..6b25f69df1417699665abc8588555e7edd769970 100644
--- a/docs/en/models/EfficientNet_and_ResNeXt101_wsl_en.md
+++ b/docs/en/models/EfficientNet_and_ResNeXt101_wsl_en.md
@@ -1,6 +1,14 @@
# EfficientNet and ResNeXt101_wsl series
+---
+## Catalogue
-## Overview
+* [1. Overview](#1)
+* [2. Accuracy, FLOPS and Parameters](#2)
+* [3. Inference speed based on V100 GPU](#3)
+* [4. Inference speed based on T4 GPU](#4)
+
+
+## 1. Overview
EfficientNet is a lightweight NAS-based network released by Google in 2019. EfficientNetB7 refreshed the classification accuracy of ImageNet-1k at that time. In this paper, the author points out that the traditional methods to improve the performance of neural networks mainly start with the width of the network, the depth of the network, and the resolution of the input picture.
However, the author found that balancing these three dimensions is essential for improving accuracy and efficiency through experiments.
@@ -21,7 +29,8 @@ The FLOPS, parameters, and inference time on the T4 GPU of this series of models
At present, there are a total of 14 pretrained models of the two types of models that PaddleClas open source. It can be seen from the above figure that the advantages of the EfficientNet series network are very obvious. The ResNeXt101_wsl series model uses more data, and the final accuracy is also higher. EfficientNet_B0_small removes SE_block based on EfficientNet_B0, which has faster inference speed.
-## Accuracy, FLOPS and Parameters
+
+## 2. Accuracy, FLOPS and Parameters
| Models | Top1 | Top5 | Reference
top1 | Reference
top5 | FLOPS
(G) | Parameters
(M) |
|:--:|:--:|:--:|:--:|:--:|:--:|:--:|
@@ -40,8 +49,8 @@ At present, there are a total of 14 pretrained models of the two types of models
| EfficientNetB7 | 0.843 | 0.969 | 0.844 | 0.971 | 72.350 | 64.920 |
| EfficientNetB0_
small | 0.758 | 0.926 | | | 0.720 | 4.650 |
-
-## Inference speed based on V100 GPU
+
+## 3. Inference speed based on V100 GPU
| Models | Crop Size | Resize Short Size | FP32
Batch Size=1
(ms) |
|-------------------------------|-----------|-------------------|--------------------------|
@@ -61,8 +70,8 @@ At present, there are a total of 14 pretrained models of the two types of models
| EfficientNetB0_
small | 224 | 256 | 1.692 |
-
-## Inference speed based on T4 GPU
+
+## 4. Inference speed based on T4 GPU
| Models | Crop Size | Resize Short Size | FP16
Batch Size=1
(ms) | FP16
Batch Size=4
(ms) | FP16
Batch Size=8
(ms) | FP32
Batch Size=1
(ms) | FP32
Batch Size=4
(ms) | FP32
Batch Size=8
(ms) |
|---------------------------|-----------|-------------------|------------------------------|------------------------------|------------------------------|------------------------------|------------------------------|------------------------------|
diff --git a/docs/en/models/HRNet_en.md b/docs/en/models/HRNet_en.md
index 971aa6778e9cf47cbc95cf5008359c130e0cff04..847f849a85f245f01d230b8bb6c718ea2efd882a 100644
--- a/docs/en/models/HRNet_en.md
+++ b/docs/en/models/HRNet_en.md
@@ -1,6 +1,14 @@
# HRNet series
+---
+## Catalogue
-## Overview
+* [1. Overview](#1)
+* [2. Accuracy, FLOPS and Parameters](#2)
+* [3. Inference speed based on V100 GPU](#3)
+* [4. Inference speed based on T4 GPU](#4)
+
+
+## 1. Overview
HRNet is a brand new neural network proposed by Microsoft research Asia in 2019. Different from the previous convolutional neural network, this network can still maintain high resolution in the deep layer of the network, so the heat map of the key points predicted is more accurate, and it is also more accurate in space. In addition, the network performs particularly well in other visual tasks sensitive to resolution, such as detection and segmentation.
@@ -16,8 +24,8 @@ The FLOPS, parameters, and inference time on the T4 GPU of this series of models
At present, there are 7 pretrained models of such models open-sourced by PaddleClas, and their indicators are shown in the figure. Among them, the reason why the accuracy of the HRNet_W48_C indicator is abnormal may be due to fluctuations in training.
-
-## Accuracy, FLOPS and Parameters
+
+## 2. Accuracy, FLOPS and Parameters
| Models | Top1 | Top5 | Reference
top1 | Reference
top5 | FLOPS
(G) | Parameters
(M) |
|:--:|:--:|:--:|:--:|:--:|:--:|:--:|
@@ -32,8 +40,8 @@ At present, there are 7 pretrained models of such models open-sourced by PaddleC
| HRNet_W64_C | 0.793 | 0.946 | 0.795 | 0.946 | 57.830 | 128.060 |
| SE_HRNet_W64_C_ssld | 0.847 | 0.973 | | | 57.830 | 128.970 |
-
-## Inference speed based on V100 GPU
+
+## 3. Inference speed based on V100 GPU
| Models | Crop Size | Resize Short Size | FP32
Batch Size=1
(ms) |
|-------------|-----------|-------------------|--------------------------|
@@ -49,8 +57,8 @@ At present, there are 7 pretrained models of such models open-sourced by PaddleC
-
-## Inference speed based on T4 GPU
+
+## 4. Inference speed based on T4 GPU
| Models | Crop Size | Resize Short Size | FP16
Batch Size=1
(ms) | FP16
Batch Size=4
(ms) | FP16
Batch Size=8
(ms) | FP32
Batch Size=1
(ms) | FP32
Batch Size=4
(ms) | FP32
Batch Size=8
(ms) |
|-------------|-----------|-------------------|------------------------------|------------------------------|------------------------------|------------------------------|------------------------------|------------------------------|
diff --git a/docs/en/models/HarDNet.md b/docs/en/models/HarDNet_en.md
similarity index 86%
rename from docs/en/models/HarDNet.md
rename to docs/en/models/HarDNet_en.md
index 4201cdba289bc992053061c12395a1223fb21090..ba1c2e5e014b31331cbbf3db30f8d53c36701d1f 100644
--- a/docs/en/models/HarDNet.md
+++ b/docs/en/models/HarDNet_en.md
@@ -1,10 +1,17 @@
# HarDNet series
+---
+## Catalogue
-## Overview
+* [1. Overview](#1)
+* [2. Accuracy, FLOPS and Parameters](#2)
+
+
+## 1. Overview
HarDNet(Harmonic DenseNet)is a brand new neural network proposed by National Tsing Hua University in 2019, which to achieve high efficiency in terms of both low MACs and memory traffic. The new network achieves 35%, 36%, 30%, 32%, and 45% inference time reduction compared with FC-DenseNet-103, DenseNet-264, ResNet-50, ResNet-152, and SSD-VGG, respectively. We use tools including Nvidia profiler and ARM Scale-Sim to measure the memory traffic and verify that the inference latency is indeed proportional to the memory traffic consumption and the proposed network consumes low memory traffic. [Paper](https://arxiv.org/abs/1909.00948).
-## Accuracy, FLOPS and Parameters
+
+## 2. Accuracy, FLOPS and Parameters
| Model | Params (M) | FLOPs (G) | Top-1 (%) | Top-5 (%) |
|:---------------------:|:----------:|:---------:|:---------:|:---------:|
diff --git a/docs/en/models/Inception_en.md b/docs/en/models/Inception_en.md
index 1291f992fa6b1615daad018857733f031f048a16..9b3123923645dc24ba36860aed4465bb88bfe273 100644
--- a/docs/en/models/Inception_en.md
+++ b/docs/en/models/Inception_en.md
@@ -1,6 +1,14 @@
# Inception series
+---
+## Catalogue
-## Overview
+* [1. Overview](#1)
+* [2. Accuracy, FLOPS and Parameters](#2)
+* [3. Inference speed based on V100 GPU](#3)
+* [4. Inference speed based on T4 GPU](#4)
+
+
+## 1. Overview
GoogLeNet is a new neural network structure designed by Google in 2014, which, together with VGG network, became the twin champions of the ImageNet challenge that year. GoogLeNet introduces the Inception structure for the first time, and stacks the Inception structure in the network so that the number of network layers reaches 22, which is also the mark of the convolutional network exceeding 20 layers for the first time. Since 1x1 convolution is used in the Inception structure to reduce the dimension of channel number, and Global pooling is used to replace the traditional method of processing features in multiple fc layers, the final GoogLeNet network has much less FLOPS and parameters than VGG network, which has become a beautiful scenery of neural network design at that time.
@@ -22,8 +30,8 @@ The FLOPS, parameters, and inference time on the T4 GPU of this series of models
The figure above reflects the relationship between the accuracy of Xception series and InceptionV4 and other indicators. Among them, Xception_deeplab is consistent with the structure of the paper, and Xception is an improved model developed by PaddleClas, which improves the accuracy by about 0.6% when the inference speed is basically unchanged. Details of the improved model are being updated, so stay tuned.
-
-## Accuracy, FLOPS and Parameters
+
+## 2. Accuracy, FLOPS and Parameters
| Models | Top1 | Top5 | Reference
top1 | Reference
top5 | FLOPS
(G) | Parameters
(M) |
|:--:|:--:|:--:|:--:|:--:|:--:|:--:|
@@ -37,8 +45,8 @@ The figure above reflects the relationship between the accuracy of Xception seri
| InceptionV4 | 0.808 | 0.953 | 0.800 | 0.950 | 24.570 | 42.680 |
-
-## Inference speed based on V100 GPU
+
+## 3. Inference speed based on V100 GPU
| Models | Crop Size | Resize Short Size | FP32
Batch Size=1
(ms) |
|------------------------|-----------|-------------------|--------------------------|
@@ -51,8 +59,8 @@ The figure above reflects the relationship between the accuracy of Xception seri
| InceptionV4 | 299 | 320 | 11.141 |
-
-## Inference speed based on T4 GPU
+
+## 4. Inference speed based on T4 GPU
| Models | Crop Size | Resize Short Size | FP16
Batch Size=1
(ms) | FP16
Batch Size=4
(ms) | FP16
Batch Size=8
(ms) | FP32
Batch Size=1
(ms) | FP32
Batch Size=4
(ms) | FP32
Batch Size=8
(ms) |
|--------------------|-----------|-------------------|------------------------------|------------------------------|------------------------------|------------------------------|------------------------------|------------------------------|
diff --git a/docs/en/models/LeViT_en.md b/docs/en/models/LeViT_en.md
index 7fd953aca91947cb3acd134c3119dcd0fbf5d2df..4d7e5dbbba8c5a274acca67e80bab66479cdc034 100644
--- a/docs/en/models/LeViT_en.md
+++ b/docs/en/models/LeViT_en.md
@@ -1,9 +1,16 @@
# LeViT series
+---
+## Catalogue
-## Overview
+* [1. Overview](#1)
+* [2. Accuracy, FLOPS and Parameters](#2)
+
+
+## 1. Overview
LeViT is a fast inference hybrid neural network for image classification tasks. Its design considers the performance of the network model on different hardware platforms, so it can better reflect the real scenarios of common applications. Through a large number of experiments, the author found a better way to combine the convolutional neural network and the Transformer system, and proposed an attention-based method to integrate the position information encoding in the Transformer. [Paper](https://arxiv.org/abs/2104.01136)。
-## Accuracy, FLOPS and Parameters
+
+## 2. Accuracy, FLOPS and Parameters
| Models | Top1 | Top5 | Reference
top1 | Reference
top5 | FLOPS
(M) | Params
(M) |
|:--:|:--:|:--:|:--:|:--:|:--:|:--:|
diff --git a/docs/en/models/MixNet_en.md b/docs/en/models/MixNet_en.md
index 0734e843a4a7e374230cc70c8110c2ccf14636d5..a0faa3dcfa71103855009f85b0e126aded604d1c 100644
--- a/docs/en/models/MixNet_en.md
+++ b/docs/en/models/MixNet_en.md
@@ -1,6 +1,12 @@
# MixNet series
+---
+## Catalogue
-## Overview
+* [1. Overview](#1)
+* [2. Accuracy, FLOPS and Parameters](#2)
+
+
+## 1. Overview
MixNet is a lightweight network proposed by Google. The main idea of MixNet is to explore the combination of different size of kernels. The author found that the current network has the following two problems:
@@ -9,7 +15,8 @@ MixNet is a lightweight network proposed by Google. The main idea of MixNet is t
In order to solve the above two problems, MDConv(mixed depthwise convolution) is proposed. In this method, different size of kernels are mixed in a convolution operation block. And based on AutoML, a series of networks called MixNets are proposed, which have achieved good results on Imagenet. [paper](https://arxiv.org/pdf/1907.09595.pdf)
-## Accuracy, FLOPS and Parameters
+
+## 2. Accuracy, FLOPS and Parameters
| Models | Top1 | Top5 | Reference
top1 | FLOPS
(M) | Params
(G |
| :------: | :---: | :---: | :---------------: | :----------: | ------------- |
diff --git a/docs/en/models/Mobile_en.md b/docs/en/models/Mobile_en.md
index 6bd7c94ced2f9ea6c9d94825b1cf23e11e41da93..543bfb9e84c77f528899c89ea1be21d70c9038c5 100644
--- a/docs/en/models/Mobile_en.md
+++ b/docs/en/models/Mobile_en.md
@@ -1,6 +1,14 @@
# Mobile and Embedded Vision Applications Network series
+---
+## Catalogue
-## Overview
+* [1. Overview](#1)
+* [2. Accuracy, FLOPS and Parameters](#2)
+* [3. Inference speed and storage size based on SD855](#3)
+* [4. Inference speed based on T4 GPU](#4)
+
+
+## 1. Overview
MobileNetV1 is a network launched by Google in 2017 for use on mobile devices or embedded devices. The network replaces the depthwise separable convolution with the traditional convolution operation, that is, the combination of depthwise convolution and pointwise convolution. Compared with the traditional convolution operation, this combination can greatly save the number of parameters and computation. At the same time, MobileNetV1 can also be used for object detection, image segmentation and other visual tasks.
@@ -22,7 +30,8 @@ GhosttNet is a brand-new lightweight network structure proposed by Huawei in 202
Currently there are 32 pretrained models of the mobile series open source by PaddleClas, and their indicators are shown in the figure below. As you can see from the picture, newer lightweight models tend to perform better, and MobileNetV3 represents the latest lightweight neural network architecture. In MobileNetV3, the author used 1x1 convolution after global-avg-pooling in order to obtain higher accuracy,this operation significantly increases the number of parameters but has little impact on the amount of computation, so if the model is evaluated from a storage perspective of excellence, MobileNetV3 does not have much advantage, but because of its smaller computation, it has a faster inference speed. In addition, the SSLD distillation model in our model library performs excellently, refreshing the accuracy of the current lightweight model from various perspectives. Due to the complex structure and many branches of the MobileNetV3 model, which is not GPU friendly, the GPU inference speed is not as good as that of MobileNetV1.
-## Accuracy, FLOPS and Parameters
+
+## 2. Accuracy, FLOPS and Parameters
| Models | Top1 | Top5 | Reference
top1 | Reference
top5 | FLOPS
(G) | Parameters
(M) |
|:--:|:--:|:--:|:--:|:--:|:--:|:--:|
@@ -64,8 +73,8 @@ Currently there are 32 pretrained models of the mobile series open source by Pad
| GhostNet_x1_3 | 0.757 | 0.925 | 0.757 | 0.927 | 0.440 | 7.300 |
| GhostNet_x1_3_ssld | 0.794 | 0.945 | 0.757 | 0.927 | 0.440 | 7.300 |
-
-## Inference speed and storage size based on SD855
+
+## 3. Inference speed and storage size based on SD855
| Models | Batch Size=1(ms) | Storage Size(M) |
|:--:|:--:|:--:|
@@ -107,8 +116,8 @@ Currently there are 32 pretrained models of the mobile series open source by Pad
| GhostNet_x1_3 | 19.982 | 29.000 |
| GhostNet_x1_3_ssld | 19.982 | 29.000 |
-
-## Inference speed based on T4 GPU
+
+## 4. Inference speed based on T4 GPU
| Models | FP16
Batch Size=1
(ms) | FP16
Batch Size=4
(ms) | FP16
Batch Size=8
(ms) | FP32
Batch Size=1
(ms) | FP32
Batch Size=4
(ms) | FP32
Batch Size=8
(ms) |
|-----------------------------|-----------------------|-----------------------|-----------------------|-----------------------|-----------------------|-----------------------|
diff --git a/docs/en/models/Others_en.md b/docs/en/models/Others_en.md
index 4511ddb47578a5bef913701d2c45f827e2b6b4c8..e8101b5ed32408d81815a7953f750b919c343128 100644
--- a/docs/en/models/Others_en.md
+++ b/docs/en/models/Others_en.md
@@ -1,6 +1,14 @@
# Other networks
+---
+## Catalogue
-## Overview
+* [1. Overview](#1)
+* [2. Accuracy, FLOPS and Parameters](#2)
+* [3. Inference speed and storage size based on SD855](#3)
+* [4. Inference speed based on T4 GPU](#4)
+
+
+## 1. Overview
In 2012, AlexNet network proposed by Alex et al. won the ImageNet competition by far surpassing the second place, and the convolutional neural network and even deep learning attracted wide attention. AlexNet used relu as the activation function of CNN to solve the gradient dispersion problem of sigmoid when the network is deep. During the training, Dropout was used to randomly lose a part of the neurons, avoiding the overfitting of the model. In the network, overlapping maximum pooling is used to replace the average pooling commonly used in CNN, which avoids the fuzzy effect of average pooling and improves the feature richness. In a sense, AlexNet has exploded the research and application of neural networks.
@@ -11,8 +19,8 @@ VGG is a convolutional neural network developed by researchers at Oxford Univers
DarkNet53 is designed for object detection by YOLO author in the paper. The network is basically composed of 1x1 and 3x3 kernel, with a total of 53 layers, named DarkNet53.
-
-## Accuracy, FLOPS and Parameters
+
+## 2. Accuracy, FLOPS and Parameters
| Models | Top1 | Top5 | Reference
top1 | Reference
top5 | FLOPS
(G) | Parameters
(M) |
|:--:|:--:|:--:|:--:|:--:|:--:|:--:|
@@ -26,8 +34,8 @@ DarkNet53 is designed for object detection by YOLO author in the paper. The netw
| DarkNet53 | 0.780 | 0.941 | 0.772 | 0.938 | 18.580 | 41.600 |
-
-## Inference speed based on V100 GPU
+
+## 3. Inference speed based on V100 GPU
| Models | Crop Size | Resize Short Size | FP32
Batch Size=1
(ms) |
@@ -41,7 +49,8 @@ DarkNet53 is designed for object detection by YOLO author in the paper. The netw
| VGG19 | 224 | 256 | 3.076 |
| DarkNet53 | 256 | 256 | 3.139 |
-## Inference speed based on T4 GPU
+
+## 4. Inference speed based on T4 GPU
| Models | Crop Size | Resize Short Size | FP16
Batch Size=1
(ms) | FP16
Batch Size=4
(ms) | FP16
Batch Size=8
(ms) | FP32
Batch Size=1
(ms) | FP32
Batch Size=4
(ms) | FP32
Batch Size=8
(ms) |
|-----------------------|-----------|-------------------|------------------------------|------------------------------|------------------------------|------------------------------|------------------------------|------------------------------|
diff --git a/docs/en/models/PP-LCNet_en.md b/docs/en/models/PP-LCNet_en.md
index 57e34151d0e2306bbbdf9f756fffc81889e4a947..bc1e075963d5e3afb0375e2a810e2e61b8369164 100644
--- a/docs/en/models/PP-LCNet_en.md
+++ b/docs/en/models/PP-LCNet_en.md
@@ -1,26 +1,49 @@
# PP-LCNet Series
+---
-## Abstract
+
+## Catalogue
+
+- [1. Abstract](#1)
+- [2. Introduction](#2)
+- [3. Method](#3)
+ - [3.1 Better Activation Function](#3.1)
+ - [3.2 SE Modules at Appropriate Positions](#3.2)
+ - [3.3 Larger Convolution Kernels](#3.3)
+ - [3.4 Larger Dimensional 1 × 1 Conv Layer after GAP](#3.4)
+- [4. Experiments](#4)
+ - [4.1 Image Classification](#4.1)
+ - [4.2 Object Detection](#4.2)
+ - [4.3 Semantic Segmentation](#4.3)
+- [5. Conclusion](#5)
+- [6. Reference](#6)
+
+
+## 1. Abstract
In the field of computer vision, the quality of backbone network determines the outcome of the whole vision task. In previous studies, researchers generally focus on the optimization of FLOPs or Params, but inference speed actually serves as an importance indicator of model quality in real-world scenarios. Nevertheless, it is difficult to balance inference speed and accuracy. In view of various CPU-based applications in industry, we are now working to raise the adaptability of the backbone network to Intel CPU, so as to obtain a faster and more accurate lightweight backbone network. At the same time, the performance of downstream vision tasks such as object detection and semantic segmentation are also improved.
-## Introduction
+
+## 2. Introduction
Recent years witnessed the emergence of many lightweight backbone networks. In past two years, in particular, there were abundant networks searched by NAS that either enjoy advantages on FLOPs or Params, or have an edge in terms of inference speed on ARM devices. However, few of them dedicated to specified optimization of Intel CPU, resulting their imperfect inference speed on the intel CPU side. Based on this, we specially design the backbone network PP-LCNet for Intel CPU devices with its acceleration library MKLDNN. Compared with other lightweight SOTA models, this backbone network can further improve the performance of the model without increasing the inference time, significantly outperforming the existing SOTA models. A comparison chart with other models is shown below.
-## Method
+
+## 3. Method
The overall structure of the network is shown in the figure below.
Build on extensive experiments, we found that many seemingly less time-consuming operations will increase the latency on Intel CPU-based devices, especially when the MKLDNN acceleration library is enabled. Therefore, we finally chose a block with the leanest possible structure and the fastest possible speed to form our BaseNet (similar to MobileNetV1). Based on BaseNet, we summarized four strategies that can improve the accuracy of the model without increasing the latency, and we combined these four strategies to form PP-LCNet. Each of these four strategies is introduced as below:
-### Better Activation Function
+
+### 3.1 Better Activation Function
Since the adoption of ReLU activation function by convolutional neural network, the network performance has been improved substantially, and variants of the ReLU activation function have appeared in recent years, such as Leaky-ReLU, P-ReLU, ELU, etc. In 2017, Google Brain searched to obtain the swish activation function, which performs well on lightweight networks. In 2019, the authors of MobileNetV3 further optimized this activation function to H-Swish, which removes the exponential operation, leading to faster speed and an almost unaffected network accuracy. After many experiments, we also recognized its excellent performance on lightweight networks. Therefore, this activation function is adopted in PP-LCNet.
-### SE Modules at Appropriate Positions
+
+### 3.2 SE Modules at Appropriate Positions
The SE module is a channel attention mechanism proposed by SENet, which can effectively improve the accuracy of the model. However, on the Intel CPU side, the module also presents a large latency, leaving us the task of balancing accuracy and speed. The search of the location of the SE module in NAS search-based networks such as MobileNetV3 brings no general conclusions, but we found through our experiments that the closer the SE module is to the tail of the network the greater the improvement in model accuracy. The following table also shows some of our experimental results:
@@ -33,12 +56,13 @@ The SE module is a channel attention mechanism proposed by SENet, which can effe
The option in the third row of the table was chosen for the location of the SE module in PP-LCNet.
-### Larger Convolution Kernels
+
+### 3.3 Larger Convolution Kernels
In the paper of MixNet, the author analyzes the effect of convolutional kernel size on model performance and concludes that larger convolutional kernels within a certain range can improve the performance of the model, but beyond this range will be detrimental to the model’s performance. So the author forms MixConv with split-concat paradigm combined, which can improve the performance of the model but is not conducive to inference. We experimentally summarize the role of some larger convolutional kernels at different positions that are similar to those of the SE module, and find that larger convolutional kernels display more prominent roles in the middle and tail of the network. The following table shows the effect of the position of the 5x5 convolutional kernels on the accuracy:
-| SE Location | Top-1 Acc(\%) | Latency(ms) |
-|-------------------|---------------|-------------|
+| Larger Convolution Location | Top-1 Acc(\%) | Latency(ms) |
+|----------------------------|---------------|-------------|
| 1111111111111 | 63.22 | 2.08 |
| 1111111000000 | 62.70 | 2.07 |
| 0000001111111 | 63.14 | 2.05 |
@@ -46,7 +70,8 @@ In the paper of MixNet, the author analyzes the effect of convolutional kernel s
Experiments show that a larger convolutional kernel placed at the middle and tail of the network can achieve the same accuracy as placed at all positions, coupled with faster inference. The option in the third row of the table was the final choice of PP-LCNet.
-### Larger Dimensional 1 × 1 Conv Layer after GAP
+
+### 3.4 Larger Dimensional 1 × 1 Conv Layer after GAP
Since the introduction of GoogLeNet, GAP (Global-Average-Pooling) is often directly followed by a classification layer, which fails to result in further integration and processing of features extracted after GAP in the lightweight network. If a larger 1x1 convolutional layer (equivalent to the FC layer) is used after GAP, the extracted features, instead of directly passing through the classification layer, will first be integrated, and then classified. This can greatly improve the accuracy rate without affecting the inference speed of the model. The above four improvements were made to BaseNet to obtain PP-LCNet. The following table further illustrates the impact of each scheme on the results:
@@ -58,10 +83,11 @@ Since the introduction of GoogLeNet, GAP (Global-Average-Pooling) is often direc
| 1 | 1 | 1 | 0 | 59.91 | 1.85 |
| 1 | 1 | 1 | 1 | 63.14 | 2.05 |
+
+## 4. Experiments
-## Experiments
-
-### Image Classification
+
+### 4.1 Image Classification
For image classification, ImageNet dataset is adopted. Compared with the current mainstream lightweight network, PP-LCNet can obtain faster inference speed with the same accuracy. When using Baidu’s self-developed SSLD distillation strategy, the accuracy is further improved, with the Top-1 Acc of ImageNet exceeding 80% at an inference speed of about 5ms on the Intel CPU side.
@@ -75,9 +101,9 @@ For image classification, ImageNet dataset is adopted. Compared with the current
| PP-LCNet-1.5x | 4.5 | 342 | 73.71 | 91.53 | 3.19 |
| PP-LCNet-2x | 6.5 | 590 | 75.18 | 92.27 | 4.27 |
| PP-LCNet-2.5x | 9.0 | 906 | 76.60 | 93.00 | 5.39 |
-| PP-LCNet-0.25x\* | 1.9 | 47 | 66.10 | 86.46 | 2.05 |
-| PP-LCNet-0.25x\* | 3.0 | 161 | 74.39 | 92.09 | 2.46 |
-| PP-LCNet-0.25x\* | 9.0 | 906 | 80.82 | 95.33 | 5.39 |
+| PP-LCNet-0.5x\* | 1.9 | 47 | 66.10 | 86.46 | 2.05 |
+| PP-LCNet-1.0x\* | 3.0 | 161 | 74.39 | 92.09 | 2.46 |
+| PP-LCNet-2.5x\* | 9.0 | 906 | 80.82 | 95.33 | 5.39 |
\* denotes the model after using SSLD distillation.
@@ -98,8 +124,8 @@ Performance comparison with other lightweight networks:
| MobileNetV3-small-1.25x | 3.6 | 100 | 70.67 | 89.51 | 3.95 |
| PP-LCNet-1x | 3.0 | 161 | 71.32 | 90.03 | 2.46 |
-
-### Object Detection
+
+### 4.2 Object Detection
For object detection, we adopt Baidu’s self-developed PicoDet, which focuses on lightweight object detection scenarios. The following table shows the comparison between the results of PP-LCNet and MobileNetV3 on the COCO dataset. PP-LCNet has an obvious advantage in both accuracy and speed.
@@ -110,8 +136,8 @@ MobileNetV3-large-0.35x | 19.2 | 8.1 |
MobileNetV3-large-0.75x | 25.8 | 11.1 |
PP-LCNet-1x | 26.9 | 7.9 |
-
-### Semantic Segmentation
+
+### 4.3 Semantic Segmentation
For semantic segmentation, DeeplabV3+ is adopted. The following table presents the comparison between PP-LCNet and MobileNetV3 on the Cityscapes dataset, and PP-LCNet also stands out in terms of accuracy and speed.
@@ -122,11 +148,13 @@ MobileNetV3-large-0.5x | 55.42 | 135 |
MobileNetV3-large-0.75x | 64.53 | 151 |
PP-LCNet-1x | 66.03 | 96 |
-## Conclusion
+
+## 5. Conclusion
Rather than holding on to perfect FLOPs and Params as academics do, PP-LCNet focuses on analyzing how to add Intel CPU-friendly modules to improve the performance of the model, which can better balance accuracy and inference time. The experimental conclusions therein are available to other researchers in network structure design, while providing NAS search researchers with a smaller search space and general conclusions. The finished PP-LCNet can also be better accepted and applied in industry.
-## Reference
+
+## 6. Reference
Reference to cite when you use PP-LCNet in a paper:
```
diff --git a/docs/en/models/ReXNet_en.md b/docs/en/models/ReXNet_en.md
index df9f2ed4c8b3ded56a01b5354e9255163dec0984..ab9cd01de2f9883584121493d3b9c01c8738e831 100644
--- a/docs/en/models/ReXNet_en.md
+++ b/docs/en/models/ReXNet_en.md
@@ -1,9 +1,16 @@
# ReXNet series
+---
+## Catalogue
+* [1. Overview](#1)
+* [2. Accuracy, FLOPS and Parameters](#2)
+
+
## Overview
ReXNet is proposed by NAVER AI Lab, which is based on new network design principles. Aiming at the problem of representative bottleneck in the existing network, a set of design principles are proposed. The author believes that the conventional design produce representational bottlenecks, which would affect model performance. To investigate the representational bottleneck, the author study the matrix rank of the features generated by ten thousand random networks. Besides, entire layer’s channel configuration is also studied to design more accurate network architectures. In the end, the author proposes a set of simple and effective design principles to mitigate the representational bottleneck. [paper](https://arxiv.org/pdf/2007.00992.pdf)
+
## Accuracy, FLOPS and Parameters
| Models | Top1 | Top5 | Reference
top1 | FLOPS
(G) | Params
(M) |
diff --git a/docs/en/models/RedNet.md b/docs/en/models/RedNet_en.md
similarity index 83%
rename from docs/en/models/RedNet.md
rename to docs/en/models/RedNet_en.md
index b93607f2724c375674b8ffa7d579cce6a4dc11a4..b9b5c8edc61592aeef48696e07fccaf2e0480382 100644
--- a/docs/en/models/RedNet.md
+++ b/docs/en/models/RedNet_en.md
@@ -1,11 +1,17 @@
# RedNet series
+---
+## Catalogue
-## Overview
+* [1. Overview](#1)
+* [2. Accuracy, FLOPS and Parameters](#2)
-In the backbone of ResNet and in all bottleneck positions of backbone, the convolution is replaced by Involution, but all convolutions are reserved for channel mapping and fusion. These carefully redesigned entities combine to form a new efficient backbone network, called Rednet. [paper](https://arxiv.org/abs/2103.06255).
+
+## 1. Overview
+In the backbone of ResNet and in all bottleneck positions of backbone, the convolution is replaced by Involution, but all convolutions are reserved for channel mapping and fusion. These carefully redesigned entities combine to form a new efficient backbone network, called Rednet. [paper](https://arxiv.org/abs/2103.06255).
-## Accuracy, FLOPS and Parameters
+
+## 2. Accuracy, FLOPS and Parameters
| Model | Params (M) | FLOPs (G) | Top-1 (%) | Top-5 (%) |
|:---------------------:|:----------:|:---------:|:---------:|:---------:|
diff --git a/docs/en/models/RepVGG_en.md b/docs/en/models/RepVGG_en.md
index f2171a8fcdd8abb4f36f54566017b7c9ee9206fe..0a028706ebacb927a59453ca39723f8baac61efc 100644
--- a/docs/en/models/RepVGG_en.md
+++ b/docs/en/models/RepVGG_en.md
@@ -1,10 +1,17 @@
# RepVGG series
+---
+## Catalogue
-## Overview
+* [1. Overview](#1)
+* [2. Accuracy, FLOPS and Parameters](#2)
+
+
+## 1. Overview
RepVGG (Making VGG-style ConvNets Great Again) series model is a simple but powerful convolutional neural network architecture proposed by Tsinghua University (Guiguang Ding's team), MEGVII Technology (Jian Sun et al.), HKUST and Aberystwyth University in 2021. The architecture has an inference time agent similar to VGG. The main body is composed of 3x3 convolution and relu stack, while the training time model has multi branch topology. The decoupling of training time and inference time is realized by re-parameterization technology, so the model is called repvgg. [paper](https://arxiv.org/abs/2101.03697).
-## Accuracy, FLOPS and Parameters
+
+## 2. Accuracy, FLOPS and Parameters
| Models | Top1 | Top5 | Reference
top1| FLOPS
(G) |
|:--:|:--:|:--:|:--:|:--:|
diff --git a/docs/en/models/ResNeSt_RegNet_en.md b/docs/en/models/ResNeSt_RegNet_en.md
index a2203ad9881dd89da1ee4bf9a5f9fb1a1b1be2a2..748d075b90e4d0d4a52b0b3f41c4c1128ff66539 100644
--- a/docs/en/models/ResNeSt_RegNet_en.md
+++ b/docs/en/models/ResNeSt_RegNet_en.md
@@ -1,10 +1,20 @@
-## Overview
+# ResNeSt and RegNet series
+---
+## Catalogue
+
+* [1. Overview](#1)
+* [2. Accuracy, FLOPS and Parameters](#2)
+* [3. Inference speed based on T4 GPU](#3)
+
+
+## 1. Overview
The ResNeSt series was proposed in 2020. The original resnet network structure has been improved by introducing K groups and adding an attention module similar to SEBlock in different groups, the accuracy is greater than that of the basic model ResNet, but the parameter amount and flops are almost the same as the basic ResNet.
RegNet was proposed in 2020 by Facebook to deepen the concept of design space. Based on AnyNetX, the model performance is gradually improved by shared bottleneck ratio, shared group width, adjusting network depth or width and other strategies. What's more, the design space structure is simplified, whose interpretability is also be improved. The quality of design space is improved while its diversity is maintained. Under similar conditions, the performance of the designed RegNet model performs better than EfficientNet and 5 times faster than EfficientNet.
-## Accuracy, FLOPs and Parameters
+
+## 2. Accuracy, FLOPs and Parameters
| Models | Top1 | Top5 | Reference
top1 | Reference
top5 | FLOPS
(G) | Parameters
(M) |
|:--:|:--:|:--:|:--:|:--:|:--:|:--:|
@@ -12,8 +22,8 @@ RegNet was proposed in 2020 by Facebook to deepen the concept of design space. B
| ResNeSt50 | 0.8083 | 0.9542| 0.8113 | -| 10.78 | 27.5 |
| RegNetX_4GF | 0.7850 | 0.9416| 0.7860 | -| 8.0 | 22.1 |
-
-## Inference speed based on T4 GPU
+
+## 3. Inference speed based on T4 GPU
| Models | Crop Size | Resize Short Size | FP16
Batch Size=1
(ms) | FP16
Batch Size=4
(ms) | FP16
Batch Size=8
(ms) | FP32
Batch Size=1
(ms) | FP32
Batch Size=4
(ms) | FP32
Batch Size=8
(ms) |
|--------------------|-----------|-------------------|------------------------------|------------------------------|------------------------------|------------------------------|------------------------------|------------------------------|
diff --git a/docs/en/models/ResNet_and_vd_en.md b/docs/en/models/ResNet_and_vd_en.md
index 5a947081561f43d82a3322256a9fda22b68fe380..3ffeb292d64c62e9031d006c3747023ccfe931f7 100644
--- a/docs/en/models/ResNet_and_vd_en.md
+++ b/docs/en/models/ResNet_and_vd_en.md
@@ -1,6 +1,14 @@
# ResNet and ResNet_vd series
+---
+## Catalogue
-## Overview
+* [1. Overview](#1)
+* [2. Accuracy, FLOPS and Parameters](#2)
+* [3. Inference speed based on V100 GPU](#3)
+* [4. Inference speed based on T4 GPU](#4)
+
+
+## 1. Overview
The ResNet series model was proposed in 2015 and won the championship in the ILSVRC2015 competition with a top5 error rate of 3.57%. The network innovatively proposed the residual structure, and built the ResNet network by stacking multiple residual structures. Experiments show that using residual blocks can improve the convergence speed and accuracy effectively.
@@ -23,8 +31,8 @@ The FLOPS, parameters, and inference time on the T4 GPU of this series of models
As can be seen from the above curves, the higher the number of layers, the higher the accuracy, but the corresponding number of parameters, calculation and latency will increase. ResNet50_vd_ssld further improves the accuracy of top-1 of the ImageNet-1k validation set by using stronger teachers and more data, reaching 82.39%, refreshing the accuracy of ResNet50 series models.
-
-## Accuracy, FLOPS and Parameters
+
+## 2. Accuracy, FLOPS and Parameters
| Models | Top1 | Top5 | Reference
top1 | Reference
top5 | FLOPS
(G) | Parameters
(M) |
|:--:|:--:|:--:|:--:|:--:|:--:|:--:|
@@ -49,8 +57,8 @@ As can be seen from the above curves, the higher the number of layers, the highe
* Note: `ResNet50_vd_ssld_v2` is obtained by adding AutoAugment in training process on the basis of `ResNet50_vd_ssld` training strategy.`Fix_ResNet50_vd_ssld_v2` stopped all parameter updates of `ResNet50_vd_ssld_v2` except the FC layer,and fine-tuned on ImageNet1k dataset, the resolution is 320x320.
-
-## Inference speed based on V100 GPU
+
+## 3. Inference speed based on V100 GPU
| Models | Crop Size | Resize Short Size | FP32
Batch Size=1
(ms) |
|------------------|-----------|-------------------|--------------------------|
@@ -71,8 +79,8 @@ As can be seen from the above curves, the higher the number of layers, the highe
| ResNet50_vd_ssld | 224 | 256 | 3.165 |
| ResNet101_vd_ssld | 224 | 256 | 5.252 |
-
-## Inference speed based on T4 GPU
+
+## 4. Inference speed based on T4 GPU
| Models | Crop Size | Resize Short Size | FP16
Batch Size=1
(ms) | FP16
Batch Size=4
(ms) | FP16
Batch Size=8
(ms) | FP32
Batch Size=1
(ms) | FP32
Batch Size=4
(ms) | FP32
Batch Size=8
(ms) |
|-------------------|-----------|-------------------|------------------------------|------------------------------|------------------------------|------------------------------|------------------------------|------------------------------|
diff --git a/docs/en/models/SEResNext_and_Res2Net_en.md b/docs/en/models/SEResNext_and_Res2Net_en.md
index 4ccbce59121f5790757e665a56f4924a2fbbbba8..e574fd973d9202f0b4f0706b6ece6932917cbbee 100644
--- a/docs/en/models/SEResNext_and_Res2Net_en.md
+++ b/docs/en/models/SEResNext_and_Res2Net_en.md
@@ -1,6 +1,14 @@
# SEResNeXt and Res2Net series
+---
+## Catalogue
-## Overview
+* [1. Overview](#1)
+* [2. Accuracy, FLOPS and Parameters](#2)
+* [3. Inference speed based on V100 GPU](#3)
+* [4. Inference speed based on T4 GPU](#4)
+
+
+## 1. Overview
ResNeXt, one of the typical variants of ResNet, was presented at the CVPR conference in 2017. Prior to this, the methods to improve the model accuracy mainly focused on deepening or widening the network, which increased the number of parameters and calculation, and slowed down the inference speed accordingly. The concept of cardinality was proposed in ResNeXt structure. The author found that increasing the number of channel groups was more effective than increasing the depth and width through experiments. It can improve the accuracy without increasing the parameter complexity and reduce the number of parameters at the same time, so it is a more successful variant of ResNet.
@@ -23,8 +31,8 @@ The FLOPS, parameters, and inference time on the T4 GPU of this series of models
At present, there are a total of 24 pretrained models of the three categories open sourced by PaddleClas, and the indicators are shown in the figure. It can be seen from the diagram that under the same Flops and Params, the improved model tends to have higher accuracy, but the inference speed is often inferior to the ResNet series. On the other hand, Res2Net performed better. Compared with group operation in ResNeXt and SE structure operation in SEResNet, Res2Net tended to have better accuracy in the same Flops, Params and inference speed.
-
-## Accuracy, FLOPS and Parameters
+
+## 2. Accuracy, FLOPS and Parameters
| Models | Top1 | Top5 | Reference
top1 | Reference
top5 | FLOPS
(G) | Parameters
(M) |
|:--:|:--:|:--:|:--:|:--:|:--:|:--:|
@@ -57,8 +65,8 @@ At present, there are a total of 24 pretrained models of the three categories op
| SENet154_vd | 0.814 | 0.955 | | | 45.830 | 114.290 |
-
-## Inference speed based on V100 GPU
+
+## 3. Inference speed based on V100 GPU
| Models | Crop Size | Resize Short Size | FP32
Batch Size=1
(ms) |
|-----------------------|-----------|-------------------|--------------------------|
@@ -87,8 +95,8 @@ At present, there are a total of 24 pretrained models of the three categories op
| SE_ResNeXt101_32x4d | 224 | 256 | 19.204 |
| SENet154_vd | 224 | 256 | 50.406 |
-
-## Inference speed based on T4 GPU
+
+## 4. Inference speed based on T4 GPU
| Models | Crop Size | Resize Short Size | FP16
Batch Size=1
(ms) | FP16
Batch Size=4
(ms) | FP16
Batch Size=8
(ms) | FP32
Batch Size=1
(ms) | FP32
Batch Size=4
(ms) | FP32
Batch Size=8
(ms) |
|-----------------------|-----------|-------------------|------------------------------|------------------------------|------------------------------|------------------------------|------------------------------|------------------------------|
diff --git a/docs/en/models/SwinTransformer_en.md b/docs/en/models/SwinTransformer_en.md
index 11d45d6c401c57f31d586b6c740d968b304c3574..95afaaf65ccab342b86a8508739b7e8505b02cee 100644
--- a/docs/en/models/SwinTransformer_en.md
+++ b/docs/en/models/SwinTransformer_en.md
@@ -1,10 +1,16 @@
# SwinTransformer
+---
+## Catalogue
-## Overview
-Swin Transformer a new vision Transformer, that capably serves as a general-purpose backbone for computer vision. It is a hierarchical Transformer whose representation is computed with shifted windows. The shifted windowing scheme brings greater efficiency by limiting self-attention computation to non-overlapping local windows while also allowing for cross-window connection. [Paper](https://arxiv.org/abs/2103.14030)。
+* [1. Overview](#1)
+* [2. Accuracy, FLOPS and Parameters](#2)
+
+## 1. Overview
+Swin Transformer a new vision Transformer, that capably serves as a general-purpose backbone for computer vision. It is a hierarchical Transformer whose representation is computed with shifted windows. The shifted windowing scheme brings greater efficiency by limiting self-attention computation to non-overlapping local windows while also allowing for cross-window connection. [Paper](https://arxiv.org/abs/2103.14030)。
-## Accuracy, FLOPS and Parameters
+
+## 2. Accuracy, FLOPS and Parameters
| Models | Top1 | Top5 | Reference
top1 | Reference
top5 | FLOPS
(G) | Params
(M) |
|:--:|:--:|:--:|:--:|:--:|:--:|:--:|
diff --git a/docs/en/models/TNT.md b/docs/en/models/TNT_en.md
similarity index 85%
rename from docs/en/models/TNT.md
rename to docs/en/models/TNT_en.md
index 7e20edab4d5309653e15c7fbd84004e49bb83d81..abdcfbaa0531911ecb404a04d52a8d0dbd972344 100644
--- a/docs/en/models/TNT.md
+++ b/docs/en/models/TNT_en.md
@@ -1,12 +1,18 @@
# TNT series
+---
+## Catalogue
-## Overview
+* [1. Overview](#1)
+* [2. Accuracy, FLOPS and Parameters](#2)
-TNT(Transformer-iN-Transformer) series models were proposed by Huawei-Noah in 2021 for modeling both patch-level and pixel-level representation. In each TNT block, an outer transformer block is utilized to process patch embeddings, and an inner transformer block extracts local features from pixel embeddings. The pixel-level feature is projected to the space of patch embedding by a linear transformation layer and then added into the patch. By stacking the TNT blocks, we build the TNT model for image recognition. Experiments on ImageNet benchmark and downstream tasks demonstrate the superiority and efficiency of the proposed TNT architecture. For example, our TNT achieves 81.3% top-1 accuracy on ImageNet which is 1.5% higher than that of DeiT with similar computational cost. [Paper](https://arxiv.org/abs/2103.00112).
+
+## 1. Overview
+TNT(Transformer-iN-Transformer) series models were proposed by Huawei-Noah in 2021 for modeling both patch-level and pixel-level representation. In each TNT block, an outer transformer block is utilized to process patch embeddings, and an inner transformer block extracts local features from pixel embeddings. The pixel-level feature is projected to the space of patch embedding by a linear transformation layer and then added into the patch. By stacking the TNT blocks, we build the TNT model for image recognition. Experiments on ImageNet benchmark and downstream tasks demonstrate the superiority and efficiency of the proposed TNT architecture. For example, our TNT achieves 81.3% top-1 accuracy on ImageNet which is 1.5% higher than that of DeiT with similar computational cost. [Paper](https://arxiv.org/abs/2103.00112).
-## Accuracy, FLOPS and Parameters
+
+## 2. Accuracy, FLOPS and Parameters
| Model | Params (M) | FLOPs (G) | Top-1 (%) | Top-5 (%) |
|:---------------------:|:----------:|:---------:|:---------:|:---------:|
diff --git a/docs/en/models/Twins.md b/docs/en/models/Twins_en.md
similarity index 88%
rename from docs/en/models/Twins.md
rename to docs/en/models/Twins_en.md
index ccd83e44a47c99ed3c95481c30a682068cb17ff6..f86f537c7a91abdbc79968f6934fd979c90565f2 100644
--- a/docs/en/models/Twins.md
+++ b/docs/en/models/Twins_en.md
@@ -1,9 +1,16 @@
# Twins
+---
+## Catalogue
-## Overview
+* [1. Overview](#1)
+* [2. Accuracy, FLOPS and Parameters](#2)
+
+
+## 1. Overview
The Twins network includes Twins-PCPVT and Twins-SVT, which focuses on the meticulous design of the spatial attention mechanism, resulting in a simple but more effective solution. Since the architecture only involves matrix multiplication, and the current deep learning framework has a high degree of optimization for matrix multiplication, the architecture is very efficient and easy to implement. Moreover, this architecture can achieve excellent performance in a variety of downstream vision tasks such as image classification, target detection, and semantic segmentation. [Paper](https://arxiv.org/abs/2104.13840).
-## Accuracy, FLOPs and Parameters
+
+## 2. Accuracy, FLOPS and Parameters
| Models | Top1 | Top5 | Reference
top1 | Reference
top5 | FLOPs
(G) | Params
(M) |
|:--:|:--:|:--:|:--:|:--:|:--:|:--:|
diff --git a/docs/en/models/ViT_and_DeiT_en.md b/docs/en/models/ViT_and_DeiT_en.md
index ac275d9b4a5e9c653d0bd30c1a322505440f441c..789ad86c0ccf91213aa0bc4d2a225c0dfad43a06 100644
--- a/docs/en/models/ViT_and_DeiT_en.md
+++ b/docs/en/models/ViT_and_DeiT_en.md
@@ -1,13 +1,19 @@
# ViT and DeiT series
+---
+## Catalogue
-## Overview
+* [1. Overview](#1)
+* [2. Accuracy, FLOPS and Parameters](#2)
+
+
+## 1. Overview
ViT(Vision Transformer) series models were proposed by Google in 2020. These models only use the standard transformer structure, completely abandon the convolution structure, splits the image into multiple patches and then inputs them into the transformer, showing the potential of transformer in the CV field.。[Paper](https://arxiv.org/abs/2010.11929)。
DeiT(Data-efficient Image Transformers) series models were proposed by Facebook at the end of 2020. Aiming at the problem that the ViT models need large-scale dataset training, the DeiT improved them, and finally achieved 83.1% Top1 accuracy on ImageNet. More importantly, using convolution model as teacher model, and performing knowledge distillation on these models, the Top1 accuracy of 85.2% can be achieved on the ImageNet dataset.
-
-## Accuracy, FLOPS and Parameters
+
+## 2. Accuracy, FLOPS and Parameters
| Models | Top1 | Top5 | Reference
top1 | Reference
top5 | FLOPS
(G) | Params
(M) |
|:--:|:--:|:--:|:--:|:--:|:--:|:--:|
diff --git a/docs/en/tutorials/config_description_en.md b/docs/en/models_training/config_description_en.md
similarity index 94%
rename from docs/en/tutorials/config_description_en.md
rename to docs/en/models_training/config_description_en.md
index d510df753038f8cc2e766ee1f7a12c05ba5ce677..d0025476e8d2a04bb900e7e4e1556d3168f73539 100644
--- a/docs/en/tutorials/config_description_en.md
+++ b/docs/en/models_training/config_description_en.md
@@ -8,11 +8,35 @@ The parameters in the PaddleClas configuration file(`ppcls/configs/*.yaml`)are d
## Details
+### Catalogue
+
+- [1. Classification model](#1)
+ - [1.1 Global Configuration](#1.1)
+ - [1.2 Architecture](#1.2)
+ - [1.3 Loss function](#1.3)
+ - [1.4 Optimizer](#1.4)
+ - [1.5 Data reading module(DataLoader)](#1.5)
+ - [1.5.1 dataset](#1.5.1)
+ - [1.5.2 sampler](#1.5.2)
+ - [1.5.3 loader](#1.5.3)
+ - [1.6 Evaluation metric](#1.6)
+ - [1.7 Inference](#1.7)
+- [2. Distillation model](#2)
+ - [2.1 Architecture](#2.1)
+ - [2.2 Loss function](#2.2)
+ - [2.3 Evaluation metric](#2.3)
+- [3. Recognition model](#3)
+ - [3.1 Architechture](#3.1)
+ - [3.2 Evaluation metric](#3.2)
+
+
+
### 1. Classification model
Here the configuration of `ResNet50_vd` on`ImageNet-1k`is used as an example to explain the each parameter in detail. [Configure Path](https://github.com/PaddlePaddle/PaddleClas/blob/develop/ppcls/configs/ImageNet/ResNet/ResNet50_vd.yaml).
-#### 1.1Global Configuration
+
+#### 1.1 Global Configuration
| Parameter name | Specific meaning | Defult value | Optional value |
| ------------------ | ------------------------------------------------------- | ---------------- | ----------------- |
@@ -31,6 +55,7 @@ Here the configuration of `ResNet50_vd` on`ImageNet-1k`is used as an example to
**Note**:The http address of pre-trained model can be filled in the `pretrained_model`
+
#### 1.2 Architecture
| Parameter name | Specific meaning | Defult value | Optional value |
@@ -41,6 +66,7 @@ Here the configuration of `ResNet50_vd` on`ImageNet-1k`is used as an example to
**Note**: Here pretrained can be set to True or False, so does the path of the weights. In addition, the pretrained is disabled when Global.pretrained_model is also set to the corresponding path.
+
#### 1.3 Loss function
| Parameter name | Specific meaning | Defult value | Optional value |
@@ -49,6 +75,7 @@ Here the configuration of `ResNet50_vd` on`ImageNet-1k`is used as an example to
| CELoss.weight | The weight of CELoss in the whole Loss | 1.0 | float |
| CELoss.epsilon | The epsilon value of label_smooth in CELoss | 0.1 | float,between 0 and 1 |
+
#### 1.4 Optimizer
| Parameter name | Specific meaning | Defult value | Optional value |
@@ -73,8 +100,10 @@ Here the configuration of `ResNet50_vd` on`ImageNet-1k`is used as an example to
Referring to [learning_rate.py](https://github.com/PaddlePaddle/PaddleClas/blob/develop/ppcls/optimizer/learning_rate.py) for adding method and parameters.
-#### 1.5 Data reading module(DataLoader)
+
+#### 1.5 Data reading module(DataLoader)
+
##### 1.5.1 dataset
| Parameter name | Specific meaning | Defult value | Optional value |
@@ -106,6 +135,7 @@ The parameter meaning of batch_transform_ops:
| ------------- | -------------- | --------------------------------------- |
| MixupOperator | alpha | Mixup parameter value,the larger the value, the stronger the augment |
+
##### 1.5.2 sampler
| Parameter name | Specific meaning | Default value | Optional value |
@@ -114,7 +144,7 @@ The parameter meaning of batch_transform_ops:
| batch_size | batch size | 64 | int |
| drop_last | Whether to drop the last data that does reach the batch-size | False | bool |
| shuffle | whether to shuffle the data | True | bool |
-
+
##### 1.5.3 loader
| Parameter name | Specific meaning | Default meaning | Optional meaning |
@@ -122,12 +152,14 @@ The parameter meaning of batch_transform_ops:
| num_workers | Number of data read threads | 4 | int |
| use_shared_memory | Whether to use shared memory | True | bool |
+
#### 1.6 Evaluation metric
| Parameter name | Specific meaning | Default meaning | Optional meaning |
| -------------- | ---------------- | --------------- | ---------------- |
| TopkAcc | TopkAcc | [1, 5] | list, int |
+
#### 1.7 Inference
| Parameter name | Specific meaning | Default meaning | Optional meaning |
@@ -140,10 +172,12 @@ The parameter meaning of batch_transform_ops:
**Note**:The interpretation of `transforms` in the Infer module refers to the interpretation of`transform_ops`in the dataset in the data reading module.
-### 2.Distillation model
+
+### 2. Distillation model
**Note**:Here the training configuration of `MobileNetV3_large_x1_0` on `ImageNet-1k` distilled MobileNetV3_small_x1_0 is used as an example to explain the meaning of each parameter in detail. [Configure path](https://github.com/PaddlePaddle/PaddleClas/blob/develop/ppcls/configs/ImageNet/Distillation/mv3_large_x1_0_distill_mv3_small_x1_0.yaml). Only parameters that are distinct from the classification model are introduced here.
+
#### 2.1 Architecture
| Parameter name | Specific meaning | Default meaning | Optional meaning |
@@ -169,6 +203,7 @@ The parameter meaning of batch_transform_ops:
2.Student's parameters are similar and will not be repeated.
+
#### 2.2 Loss function
| Parameter name | Specific meaning | Default meaning | Optional meaning |
@@ -180,6 +215,7 @@ The parameter meaning of batch_transform_ops:
| DistillationGTCELos.weight | Loss weight | 1.0 | float |
| DistillationCELoss.model_names | Model names with real label for cross-entropy | ["Student"] | —— |
+
#### 2.3 Evaluation metric
| Parameter name | Specific meaning | Default meaning | Optional meaning |
@@ -190,10 +226,12 @@ The parameter meaning of batch_transform_ops:
**Note**: `DistillationTopkAcc` has the same meaning as `TopkAcc`, except that it is only used in distillation tasks.
+
### 3. Recognition model
**Note**:The training configuration of`ResNet50` on`LogoDet-3k` is used here as an example to explain the meaning of each parameter in detail. [configure path](https://github.com/PaddlePaddle/PaddleClas/blob/develop/ppcls/configs/Logo/ResNet50_ReID.yaml). Only parameters that are distinct from the classification model are presented here.
+
#### 3.1 Architechture
| Parameter name | Specific meaning | Default meaning | Optional meaning |
@@ -223,7 +261,7 @@ The parameter meaning of batch_transform_ops:
-
+
#### 3.2 Evaluation metric
| Parameter name | Specific meaning | Default meaning | Optional meaning |
diff --git a/docs/en/competition_support_en.md b/docs/en/others/competition_support_en.md
similarity index 100%
rename from docs/en/competition_support_en.md
rename to docs/en/others/competition_support_en.md
diff --git a/docs/en/update_history_en.md b/docs/en/others/update_history_en.md
similarity index 100%
rename from docs/en/update_history_en.md
rename to docs/en/others/update_history_en.md
diff --git a/docs/zh_CN/models_training/config_description.md b/docs/zh_CN/models_training/config_description.md
index 6c73b838aedde6976ab892e2115709389d2bd0bb..8c51d7abcfe5baeb19cad1043f73df10832dcada 100644
--- a/docs/zh_CN/models_training/config_description.md
+++ b/docs/zh_CN/models_training/config_description.md
@@ -22,6 +22,7 @@
- [1.5.2 sampler](#1.5.2)
- [1.5.3 loader](#1.5.3)
- [1.6 评估指标(Metric)](#1.6)
+ - [1.7 预测](#1.7)
- [2. 蒸馏模型](#2)
- [2.1 结构(Arch)](#2.1)
- [2.2 损失函数(Loss)](#2.2)