## 1. PLSC-ViT Introduction


PLSC-ViT reimplemented Google's repository for the ViT model. The overview of the model is as follows. The input image is splited into fixed-size patches, then linear projection and position embeddings are applied. The resulting sequence are feed into a standard Transformer encoder. In order to perform classification, the standard approach of adding an extra learnable "classification token" is utilized to the sequence. 

![Figure 1 from paper](https://github.com/google-research/vision_transformer/raw/main/vit_figure.png)


## 2. Model Effects and Application Scenarios

| Model | Phase | Dataset | gpu | img/sec | Top1 Acc | Official |
| --- | --- | --- | --- | --- | --- | --- |
| ViT-B_16_224 |pretrain |ImageNet2012 |A100*N1C8 | 3583| 0.75196 | 0.7479 |
| ViT-B_16_384 |finetune | ImageNet2012 | A100*N1C8 | 719 | 0.77972 | 0.7791 |
| ViT-L_16_224 | pretrain | ImageNet21K | A100*N4C32 | 5256 | - | - | |
|ViT-L_16_384 |finetune | ImageNet2012 | A100*N4C32 | 934 | 0.85030 | 0.8505 |

## 3. How to use the Model

### 3.1 Install PLSC

```shell
git clone https://github.com/PaddlePaddle/PLSC.git
cd /path/to/PLSC/
# [optional] pip install -r requirements.txt
python setup.py develop
```

### 3.2 Model Training

1. Enter into the task directory

```shell
cd task/classification/vit
```

2. Prepare the data

Organize the data into the following format:

```text
dataset/
└── ILSVRC2012
 ├── train
 ├── val
 ├── train_list.txt
 └── val_list.txt
```

3. Run the command

```shell
export PADDLE_NNODES=1
export PADDLE_MASTER="127.0.0.1:12538"
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7

python -m paddle.distributed.launch \
 --nnodes=$PADDLE_NNODES \
 --master=$PADDLE_MASTER \
 --devices=$CUDA_VISIBLE_DEVICES \
 plsc-train \
 -c ./configs/ViT_base_patch16_224_in1k_1n8c_dp_fp16o2.yaml
```

More courses about model training can be learned here [ViT readme](https://github.com/PaddlePaddle/PLSC/blob/master/task/classification/vit/README.md)

### 3.3 Model Inference

1. Download pretrained model

```shell
mkdir -p pretrained/vit/ViT_base_patch16_224/
wget -O ./pretrained/vit/ViT_base_patch16_224/imagenet2012-ViT-B_16-224.pdparams https://plsc.bj.bcebos.com/models/vit/v2.4/imagenet2012-ViT-B_16-224.pdparams
```

2. Export model for inference

```shell
plsc-export -c ./configs/ViT_base_patch16_224_in1k_1n8c_dp_fp16o2.yaml -o Global.pretrained_model=./pretrained/vit/ViT_base_patch16_224/imagenet2012-ViT-B_16-224 -o Model.data_format=NCHW -o FP16.level=O0
```

## 4. Related papers and citations

```text
@article{dosovitskiy2020,
 title={An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale},
 author={Dosovitskiy, Alexey and Beyer, Lucas and Kolesnikov, Alexander and Weissenborn, Dirk and Zhai, Xiaohua and Unterthiner, Thomas and Dehghani, Mostafa and Minderer, Matthias and Heigold, Georg and Gelly, Sylvain and Uszkoreit, Jakob and Houlsby, Neil},
 journal={arXiv preprint arXiv:2010.11929},
 year={2020}
}
```