Add ViT and DeiT (#579)

* Update the discription about ViT and DeiT * Fix the error file * Unified data format

Add ViT and DeiT (#579)
* Update the discription about ViT and DeiT * Fix the error file * Unified data format
3a572f95 · Tingquan Gao · GitHub · 1c81cc7d · 3a572f95 · 3a572f95
5 changed file
--- a/README.md
+++ b/README.md
@@ -7,6 +7,7 @@
 PaddleClas is a toolset for image classification tasks prepared for the industry and academia. It helps users train better computer vision models and apply them in real scenarios.

 **Recent update**
+- 2021.01.27 Add `ViT` and `DeiT` pretrained model, `ViT`'s Top-1 Acc on ImageNet-1k dataset reaches 81.05%, and `DeiT` reaches 85.5%.
 - 2021.01.08 Add support for whl package and its usage, Model inference can be done by simply install paddleclas using pip.
 - 2020.12.16 Add support for TensorRT when using cpp inference to obain more obvious acceleration.
 - 2020.12.06 Add `SE_HRNet_W64_C_ssld` pretrained model, whose Top-1 Acc on ImageNet-1k dataset reaches 84.75%.
@@ -66,6 +67,7 @@ PaddleClas is a toolset for image classification tasks prepared for the industry
    - [Inception series](#Inception_series)
    - [EfficientNet and ResNeXt101_wsl series](#EfficientNet_and_ResNeXt101_wsl_series)
    - [ResNeSt and RegNet series](#ResNeSt_and_RegNet_series)
+    - [Transformer series](#Transformer)
    - [Others](#Others)
    - HS-ResNet: arxiv link: [https://arxiv.org/pdf/2010.07621.pdf](https://arxiv.org/pdf/2010.07621.pdf). Code and models are coming soon!
 - Model training/evaluation
@@ -351,6 +353,37 @@ Accuracy and inference time metrics of ResNeSt and RegNet series models are show
 | RegNetX_4GF            | 0.785     | 0.9416    |    6.46478              |      11.19862           | 8        | 22.1      | [Download link](https://paddle-imagenet-models-name.bj.bcebos.com/dygraph/RegNetX_4GF_pretrained.pdparams)            |


+<a name="Transformer"></a>
+### Transformer series
+
+Accuracy and inference time metrics of ViT and DeiT series models are shown as follows. More detailed information can be refered to [Transformer series tutorial](./docs/en/models/Transformer.md).
+
+
+| Model                    | Top-1 Acc | Top-5 Acc | time(ms)<br>bs=1 | time(ms)<br>bs=4 | Flops(G) | Params(M) | Download Address |
+|------------------------|-----------|-----------|------------------|------------------|----------|------------------------|------------------------|
+| ViT_small_<br/>patch16_224 | 0.7727  | 0.9319   | -                | -                |      |  | [Download link](https://paddle-imagenet-models-name.bj.bcebos.com/dygraph/ViT_small_patch16_224_pretrained.pdparams) |
+| ViT_base_<br/>patch16_224 | 0.8176   | 0.9613   | -    | -                |     | 86 | [Download link](https://paddle-imagenet-models-name.bj.bcebos.com/dygraph/ViT_base_patch16_224_pretrained.pdparams) |
+| ViT_base_<br/>patch16_384 | 0.8393  | 0.9710   |    -              |      -           |         |  | [Download link](https://paddle-imagenet-models-name.bj.bcebos.com/dygraph/ViT_base_patch16_384_pretrained.pdparams) |
+| ViT_base_<br/>patch32_384 | 0.8124   | 0.9598   | - | - |  |  | [Download link](https://paddle-imagenet-models-name.bj.bcebos.com/dygraph/ViT_base_patch32_384_pretrained.pdparams) |
+| ViT_large_<br/>patch16_224 | 0.8325  | 0.9658   | - | - |  | 307 | [Download link](https://paddle-imagenet-models-name.bj.bcebos.com/dygraph/ViT_large_patch16_224_pretrained.pdparams) |
+| ViT_large_<br/>patch16_384 | 0.8507  | 0.9741  | - | - |  |  | [Download link](https://paddle-imagenet-models-name.bj.bcebos.com/dygraph/ViT_large_patch16_384_pretrained.pdparams) |
+| ViT_large_<br/>patch32_384 | 0.8105   | 0.9596  | - | - |  |  | [Download link](https://paddle-imagenet-models-name.bj.bcebos.com/dygraph/ViT_large_patch32_384_pretrained.pdparams) |
+|                            |           |           |                  |                  |          |           |                                                              |
+
+
+| Model                    | Top-1 Acc | Top-5 Acc | time(ms)<br>bs=1 | time(ms)<br>bs=4 | Flops(G) | Params(M) | Download Address |
+|------------------------|-----------|-----------|------------------|------------------|----------|------------------------|------------------------|
+| DeiT_tiny_<br>patch16_224 | 0.709 | 0.906 | -                | -                |      | 5 | [Download link](https://paddle-imagenet-models-name.bj.bcebos.com/dygraph/DeiT_tiny_patch16_224_pretrained.pdparams) |
+| DeiT_small_<br>patch16_224 | 0.794 | 0.948 | -    | -                |     | 22 | [Download link](https://paddle-imagenet-models-name.bj.bcebos.com/dygraph/DeiT_small_patch16_224_pretrained.pdparams) |
+| DeiT_base_<br>patch16_224 | 0.816 | 0.955 |    -              |      -           |         | 86 | [Download link](https://paddle-imagenet-models-name.bj.bcebos.com/dygraph/DeiT_base_patch16_224_pretrained.pdparams) |
+| DeiT_base_<br>patch16_384 | 0.831 | 0.962 | - | - |  | 87 | [Download link](https://paddle-imagenet-models-name.bj.bcebos.com/dygraph/DeiT_base_patch16_384_pretrained.pdparams) |
+| DeiT_tiny_<br>distilled_patch16_224 | 0.736 | 0.915 | - | - |  | 6 | [Download link](https://paddle-imagenet-models-name.bj.bcebos.com/dygraph/DeiT_tiny_distilled_patch16_224_pretrained.pdparams) |
+| DeiT_small_<br>distilled_patch16_224 | 0.810 | 0.953 | - | - |  | 22 | [Download link](https://paddle-imagenet-models-name.bj.bcebos.com/dygraph/DeiT_small_distilled_patch16_224_pretrained.pdparams) |
+| DeiT_base_<br>distilled_patch16_224 | 0.830 | 0.963 | - | - |  | 87 | [Download link](https://paddle-imagenet-models-name.bj.bcebos.com/dygraph/DeiT_base_distilled_patch16_224_pretrained.pdparams) |
+| DeiT_base_<br>distilled_patch16_384 | 0.855 | 0.974 | - | - |  | 88 | [Download link](https://paddle-imagenet-models-name.bj.bcebos.com/dygraph/DeiT_base_distilled_patch16_384_pretrained.pdparams) |
+|  |  |  |  |  |  |  |  |
+
+
 <a name="Others"></a>

 ### Others

--- a/README_cn.md
+++ b/README_cn.md
@@ -8,6 +8,7 @@


 **近期更新**
+- 2021.01.27 添加`ViT`与`DeiT`模型，在ImageNet-1k上，`ViT`模型Top-1 Acc可达81.05%，`DeiT`模型可达85.5%。
 - 2021.01.08 添加whl包及其使用说明，直接安装paddleclas whl包，即可快速完成模型预测。
 - 2020.12.16 添加对cpp预测的tensorRT支持，预测加速更明显。
 - 2020.12.06 添加`SE_HRNet_W64_C_ssld`模型，在ImageNet-1k上Top-1 Acc可达84.75%。
@@ -66,6 +67,7 @@
    - [Inception系列](#Inception系列)
    - [EfficientNet与ResNeXt101_wsl系列](#EfficientNet与ResNeXt101_wsl系列)
    - [ResNeSt与RegNet系列](#ResNeSt与RegNet系列)
+    - [Transformer系列](#Transformer系列)
    - [其他模型](#其他模型)
    - HS-ResNet: arxiv文章链接: [https://arxiv.org/pdf/2010.07621.pdf](https://arxiv.org/pdf/2010.07621.pdf)。 代码和预训练模型即将开源，敬请期待。
 - 模型训练/评估
@@ -296,7 +298,7 @@ HRNet系列模型的精度、速度指标如下表所示，更多关于该系列
 | HRNet_W48_C | 0.7895    | 0.9442    | 13.70761         | 34.43572         | 34.58    | 77.47     | [下载链接](https://paddle-imagenet-models-name.bj.bcebos.com/dygraph/HRNet_W48_C_pretrained.pdparams) |
 | HRNet_W48_C_ssld | 0.8363    | 0.9682    | 13.70761         | 34.43572         | 34.58    | 77.47     | [下载链接](https://paddle-imagenet-models-name.bj.bcebos.com/dygraph/HRNet_W48_C_ssld_pretrained.pdparams) |
 | HRNet_W64_C | 0.7930    | 0.9461    | 17.57527         | 47.9533          | 57.83    | 128.06    | [下载链接](https://paddle-imagenet-models-name.bj.bcebos.com/dygraph/HRNet_W64_C_pretrained.pdparams) |
-| SE_HRNet_W64_C_ssld | 0.8475    |  0.9726    |    31.69770      |     94.99546      | 57.83    | 128.97    | [Download link](https://paddle-imagenet-models-name.bj.bcebos.com/dygraph/SE_HRNet_W64_C_ssld_pretrained.pdparams) |
+| SE_HRNet_W64_C_ssld | 0.8475    |  0.9726    |    31.69770      |     94.99546      | 57.83    | 128.97    | [下载链接](https://paddle-imagenet-models-name.bj.bcebos.com/dygraph/SE_HRNet_W64_C_ssld_pretrained.pdparams) |


 <a name="Inception系列"></a>
@@ -352,6 +354,38 @@ ResNeSt与RegNet系列模型的精度、速度指标如下表所示，更多关
 | ResNeSt50              | 0.8083    | 0.9542    | 6.69042    | 8.01664                | 10.78    | 27.5      | [下载链接](https://paddle-imagenet-models-name.bj.bcebos.com/dygraph/ResNeSt50_pretrained.pdparams)              |
 | RegNetX_4GF            | 0.785     | 0.9416    |    6.46478              |      11.19862           | 8        | 22.1      | [下载链接](https://paddle-imagenet-models-name.bj.bcebos.com/dygraph/RegNetX_4GF_pretrained.pdparams)            |

+
+<a name="Transformer系列"></a>
+### Transformer系列
+
+ViT（Vision Transformer）与DeiT（Data-efficient Image Transformers）系列模型的精度、速度指标如下表所示. 更多关于该系列模型的介绍可以参考： [Transformer系列模型文档](./docs/zh_CN/models/Transformer.md)。
+
+
+| 模型                  | Top-1 Acc | Top-5 Acc | time(ms)<br>bs=1 | time(ms)<br>bs=4 | Flops(G) | Params(M) | 下载地址 |
+|------------------------|-----------|-----------|------------------|------------------|----------|------------------------|------------------------|
+| ViT_small_<br/>patch16_224 | 0.7727  | 0.9319   | -                | -                |      |  | [下载链接](https://paddle-imagenet-models-name.bj.bcebos.com/dygraph/ViT_small_patch16_224_pretrained.pdparams) |
+| ViT_base_<br/>patch16_224 | 0.8176   | 0.9613   | -    | -                |     | 86 | [下载链接](https://paddle-imagenet-models-name.bj.bcebos.com/dygraph/ViT_base_patch16_224_pretrained.pdparams) |
+| ViT_base_<br/>patch16_384 | 0.8393  | 0.9710   |    -              |      -           |         |  | [下载链接](https://paddle-imagenet-models-name.bj.bcebos.com/dygraph/ViT_base_patch16_384_pretrained.pdparams) |
+| ViT_base_<br/>patch32_384 | 0.8124   | 0.9598   | - | - |  |  | [下载链接](https://paddle-imagenet-models-name.bj.bcebos.com/dygraph/ViT_base_patch32_384_pretrained.pdparams) |
+| ViT_large_<br/>patch16_224 | 0.8325  | 0.9658   | - | - |  | 307 | [下载链接](https://paddle-imagenet-models-name.bj.bcebos.com/dygraph/ViT_large_patch16_224_pretrained.pdparams) |
+| ViT_large_<br/>patch16_384 | 0.8507  | 0.9741  | - | - |  |  | [下载链接](https://paddle-imagenet-models-name.bj.bcebos.com/dygraph/ViT_large_patch16_384_pretrained.pdparams) |
+| ViT_large_<br/>patch32_384 | 0.8105   | 0.9596  | - | - |  |  | [下载链接](https://paddle-imagenet-models-name.bj.bcebos.com/dygraph/ViT_large_patch32_384_pretrained.pdparams) |
+|                            |           |           |                  |                  |          |           |                                                              |
+
+
+| 模型                  | Top-1 Acc | Top-5 Acc | time(ms)<br>bs=1 | time(ms)<br>bs=4 | Flops(G) | Params(M) | 下载地址 |
+|------------------------|-----------|-----------|------------------|------------------|----------|------------------------|------------------------|
+| DeiT_tiny_<br>patch16_224 | 0.709 | 0.906 | -                | -                |      | 5 | [下载链接](https://paddle-imagenet-models-name.bj.bcebos.com/dygraph/DeiT_tiny_patch16_224_pretrained.pdparams) |
+| DeiT_small_<br>patch16_224 | 0.794 | 0.948 | -    | -                |     | 22 | [下载链接](https://paddle-imagenet-models-name.bj.bcebos.com/dygraph/DeiT_small_patch16_224_pretrained.pdparams) |
+| DeiT_base_<br>patch16_224 | 0.816 | 0.955 |    -              |      -           |         | 86 | [下载链接](https://paddle-imagenet-models-name.bj.bcebos.com/dygraph/DeiT_base_patch16_224_pretrained.pdparams) |
+| DeiT_base_<br>patch16_384 | 0.831 | 0.962 | - | - |  | 87 | [下载链接](https://paddle-imagenet-models-name.bj.bcebos.com/dygraph/DeiT_base_patch16_384_pretrained.pdparams) |
+| DeiT_tiny_<br>distilled_patch16_224 | 0.736 | 0.915 | - | - |  | 6 | [下载链接](https://paddle-imagenet-models-name.bj.bcebos.com/dygraph/DeiT_tiny_distilled_patch16_224_pretrained.pdparams) |
+| DeiT_small_<br>distilled_patch16_224 | 0.810 | 0.953 | - | - |  | 22 | [下载链接](https://paddle-imagenet-models-name.bj.bcebos.com/dygraph/DeiT_small_distilled_patch16_224_pretrained.pdparams) |
+| DeiT_base_<br>distilled_patch16_224 | 0.830 | 0.963 | - | - |  | 87 | [下载链接](https://paddle-imagenet-models-name.bj.bcebos.com/dygraph/DeiT_base_distilled_patch16_224_pretrained.pdparams) |
+| DeiT_base_<br>distilled_patch16_384 | 0.855 | 0.974 | - | - |  | 88 | [下载链接](https://paddle-imagenet-models-name.bj.bcebos.com/dygraph/DeiT_base_distilled_patch16_384_pretrained.pdparams) |
+|  |  |  |  |  |  |  |  |
+
+
 <a name="其他模型"></a>

 ### 其他模型

--- a/configs/DeiT/ViT_small_patch16_224.yaml
+++ b/configs/DeiT/ViT_small_patch16_224.yaml
-mode: 'train'
-ARCHITECTURE:
-    name: 'ViT_small_patch16_224'
-
-pretrained_model: ""
-model_save_dir: "./output/"
-classes_num: 1000
-total_images: 1281167
-save_interval: 1
-validate: True
-valid_interval: 1
-epochs: 120
-topk: 5
-image_shape: [3, 224, 224]
-
-use_mix: False
-ls_epsilon: -1
-
-LEARNING_RATE:
-    function: 'Cosine'          
-    params:                   
-        lr: 0.01               
-
-OPTIMIZER:
-    function: 'Momentum'
-    params:
-        momentum: 0.9
-    regularizer:
-        function: 'L2'
-        factor: 0.000100
-
-TRAIN:
-    batch_size: 64
-    num_workers: 4
-    file_list: "./dataset/ILSVRC2012/train_list.txt"
-    data_dir: "./dataset/ILSVRC2012/"
-    shuffle_seed: 0
-    transforms:
-        - DecodeImage:
-            to_rgb: True
-            to_np: False
-            channel_first: False
-        - RandCropImage:
-            size: 224
-        - RandFlipImage:
-            flip_code: 1
-        - NormalizeImage:
-            scale: 1./255.
-            mean: [0.485, 0.456, 0.406]
-            std: [0.229, 0.224, 0.225]
-            order: ''
-        - ToCHWImage:
-
-VALID:
-    batch_size: 64
-    num_workers: 4
-    file_list: "./dataset/ILSVRC2012/val_list.txt"
-    data_dir: "./dataset/ILSVRC2012/"
-    shuffle_seed: 0
-    transforms:
-        - DecodeImage:
-            to_rgb: True
-            to_np: False
-            channel_first: False
-        - ResizeImage:
-            size: 248
-        - CropImage:
-            size: 224
-        - NormalizeImage:
-            scale: 1.0/255.0
-            mean: [0.485, 0.456, 0.406]
-            std: [0.229, 0.224, 0.225]
-            order: ''
-        - ToCHWImage:
--- a/docs/en/models/Transformer.md
+++ b/docs/en/models/Transformer.md
+# ViT and DeiT series
+
+## Overview
+
+ViT(Vision Transformer) series models were proposed by Google in 2020. These models only use the standard transformer structure, completely abandon the convolution structure, splits the image into multiple patches and then inputs them into the transformer, showing the potential of transformer in the CV field.。[Paper](https://arxiv.org/abs/2010.11929)。
+
+DeiT(Data-efficient Image Transformers) series models were proposed by Facebook at the end of 2020. Aiming at the problem that the ViT models need large-scale dataset training, the DeiT improved them, and finally achieved 83.1% Top1 accuracy on ImageNet. More importantly, using convolution model as teacher model, and performing knowledge distillation on these models, the Top1 accuracy of 85.2% can be achieved on the ImageNet dataset.
+
+
+## Accuracy, FLOPS and Parameters
+
+| Models           | Top1 | Top5 | Reference<br>top1 | Reference<br>top5 | FLOPS<br>(G) |
+|:--:|:--:|:--:|:--:|:--:|:--:|
+| ViT_small_patch16_224 | 0.7727 | 0.9319 | 0.7785 | 0.9342 |      |
+| ViT_base_patch16_224 | 0.8176 | 0.9613 | 0.8178 | 0.9613 |      |
+| ViT_base_patch16_384 | 0.8393 | 0.9710 | 0.8420 | 0.9722 |      |
+| ViT_base_patch32_384 | 0.8124 | 0.9598 | 0.8166 | 0.9613 |  |
+| ViT_large_patch16_224 | 0.8325 | 0.9658 | 0.8306 | 0.9644 |  |
+| ViT_large_patch16_384 | 0.8507 | 0.9741 | 0.8517 | 0.9736 |  |
+| ViT_large_patch32_384 | 0.8105 | 0.9596 | 0.815 | - |  |
+|                       |        |        |                   |                   |              |
+
+
+| Models           | Top1 | Top5 | Reference<br>top1 | Reference<br>top5 | FLOPS<br>(G) |
+|:--:|:--:|:--:|:--:|:--:|:--:|
+| DeiT_tiny_patch16_224        | 0.709 | 0.906 | 0.722 | 0.911 |      |
+| DeiT_small_patch16_224        | 0.794 | 0.948 | 0.799 | 0.950 |      |
+| DeiT_base_patch16_224        | 0.816 | 0.955 | 0.818 | 0.956 |      |
+| DeiT_base_patch16_384 | 0.831 | 0.962 | 0.829 | 0.972 |  |
+| DeiT_tiny_distilled_patch16_224 | 0.736 | 0.915 | 0.745 | 0.919 |  |
+| DeiT_small_distilled_patch16_224 | 0.810 | 0.953 | 0.812 | 0.954 |  |
+| DeiT_base_distilled_patch16_224 | 0.830 | 0.963 | 0.834 | 0.965 |  |
+| DeiT_base_distilled_patch16_384 | 0.855 | 0.974 | 0.852 | 0.972 |  |
+|  |  | |  | |  |
+
+
+Params, FLOPs, Inference speed and other information are coming soon.
--- a/docs/zh_CN/models/Transformer.md
+++ b/docs/zh_CN/models/Transformer.md
+# ViT与DeiT系列
+
+## 概述
+
+ViT（Vision Transformer）系列模型是Google在2020年提出的，该模型仅使用标准的Transformer结构，完全抛弃了卷积结构，将图像拆分为多个patch后再输入到Transformer中，展示了Transformer在CV领域的潜力。[论文地址](https://arxiv.org/abs/2010.11929)。
+
+DeiT（Data-efficient Image Transformers）系列模型是由FaceBook在2020年底提出的，针对ViT模型需要大规模数据集训练的问题进行了改进，最终在ImageNet上取得了83.1%的Top1精度。并且使用卷积模型作为教师模型，针对该模型进行知识蒸馏，在ImageNet数据集上可以达到85.2%的Top1精度。[论文地址](https://arxiv.org/abs/2012.12877)。
+
+
+
+
+## 精度、FLOPS和参数量
+
+| Models           | Top1 | Top5 | Reference<br>top1 | Reference<br>top5 | FLOPS<br>(G) |
+|:--:|:--:|:--:|:--:|:--:|:--:|
+| ViT_small_patch16_224 | 0.7727 | 0.9319 | 0.7785 | 0.9342 |      |
+| ViT_base_patch16_224 | 0.8176 | 0.9613 | 0.8178 | 0.9613 |      |
+| ViT_base_patch16_384 | 0.8393 | 0.9710 | 0.8420 | 0.9722 |      |
+| ViT_base_patch32_384 | 0.8124 | 0.9598 | 0.8166 | 0.9613 |  |
+| ViT_large_patch16_224 | 0.8325 | 0.9658 | 0.8306 | 0.9644 |  |
+| ViT_large_patch16_384 | 0.8507 | 0.9741 | 0.8517 | 0.9736 |  |
+| ViT_large_patch32_384 | 0.8105 | 0.9596 | 0.815 | - |  |
+|                       |        |        |                   |                   |              |
+
+
+| Models           | Top1 | Top5 | Reference<br>top1 | Reference<br>top5 | FLOPS<br>(G) |
+|:--:|:--:|:--:|:--:|:--:|:--:|
+| DeiT_tiny_patch16_224        | 0.709 | 0.906 | 0.722 | 0.911 |      |
+| DeiT_small_patch16_224        | 0.794 | 0.948 | 0.799 | 0.950 |      |
+| DeiT_base_patch16_224        | 0.816 | 0.955 | 0.818 | 0.956 |      |
+| DeiT_base_patch16_384 | 0.831 | 0.962 | 0.829 | 0.972 |  |
+| DeiT_tiny_distilled_patch16_224 | 0.736 | 0.915 | 0.745 | 0.919 |  |
+| DeiT_small_distilled_patch16_224 | 0.810 | 0.953 | 0.812 | 0.954 |  |
+| DeiT_base_distilled_patch16_224 | 0.830 | 0.963 | 0.834 | 0.965 |  |
+| DeiT_base_distilled_patch16_384 | 0.855 | 0.974 | 0.852 | 0.972 |  |
+|  |  | |  | |  |
+
+关于Params、FLOPs、Inference speed等信息，敬请期待。