diff --git a/docs/en/models/LeViT_en.md b/docs/en/models/LeViT_en.md new file mode 100644 index 0000000000000000000000000000000000000000..7fd953aca91947cb3acd134c3119dcd0fbf5d2df --- /dev/null +++ b/docs/en/models/LeViT_en.md @@ -0,0 +1,17 @@ +# LeViT series + +## Overview +LeViT is a fast inference hybrid neural network for image classification tasks. Its design considers the performance of the network model on different hardware platforms, so it can better reflect the real scenarios of common applications. Through a large number of experiments, the author found a better way to combine the convolutional neural network and the Transformer system, and proposed an attention-based method to integrate the position information encoding in the Transformer. [Paper](https://arxiv.org/abs/2104.01136)。 + +## Accuracy, FLOPS and Parameters + +| Models | Top1 | Top5 | Reference
top1 | Reference
top5 | FLOPS
(M) | Params
(M) | +|:--:|:--:|:--:|:--:|:--:|:--:|:--:| +| LeViT-128S | 0.7598 | 0.9269 | 0.766 | 0.929 | 305 | 7.8 | +| LeViT-128 | 0.7810 | 0.9371 | 0.786 | 0.940 | 406 | 9.2 | +| LeViT-192 | 0.7934 | 0.9446 | 0.800 | 0.947 | 658 | 11 | +| LeViT-256 | 0.8085 | 0.9497 | 0.816 | 0.954 | 1120 | 19 | +| LeViT-384 | 0.8191 | 0.9551 | 0.826 | 0.960 | 2353 | 39 | + + +**Note**:The difference in accuracy from Reference is due to the difference in data preprocessing and the absence of distilled head as output. diff --git a/docs/en/models/Twins.md b/docs/en/models/Twins.md new file mode 100644 index 0000000000000000000000000000000000000000..69e7054486cfc9fb22415c4438a9c02e9eae3a4a --- /dev/null +++ b/docs/en/models/Twins.md @@ -0,0 +1,17 @@ +# Twins + +## Overview +The Twins network includes Twins-PCPVT and Twins-SVT, which focuses on the meticulous design of the spatial attention mechanism, resulting in a simple but more effective solution. Since the architecture only involves matrix multiplication, and the current deep learning framework has a high degree of optimization for matrix multiplication, the architecture is very efficient and easy to implement. Moreover, this architecture can achieve excellent performance in a variety of downstream vision tasks such as image classification, target detection, and semantic segmentation. [Paper](https://arxiv.org/abs/2104.13840). + +## Accuracy, FLOPS and Parameters + +| Models | Top1 | Top5 | Reference
top1 | Reference
top5 | FLOPS
(G) | Params
(M) | +|:--:|:--:|:--:|:--:|:--:|:--:|:--:| +| pcpvt_small | 0.8082 | 0.9552 | 0.812 | - | 3.7 | 24.1 | +| pcpvt_base | 0.8242 | 0.9619 | 0.827 | - | 6.4 | 43.8 | +| pcpvt_large | 0.8273 | 0.9650 | 0.831 | - | 9.5 | 60.9 | +| alt_gvt_small | 0.8140 | 0.9546 | 0.817 | - | 2.8 | 24 | +| alt_gvt_base | 0.8294 | 0.9621 | 0.832 | - | 8.3 | 56 | +| alt_gvt_large | 0.8331 | 0.9642 | 0.837 | - | 14.8 | 99.2 | + +**Note**:The difference in accuracy from Reference is due to the difference in data preprocessing.