{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "ae69ce68",
   "metadata": {},
   "source": [
    "## 1. PLSC-ViT Introduction\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "35485bc6",
   "metadata": {},
   "source": [
    "PLSC-ViT reimplemented Google's repository for the ViT model. The overview of the model is as follows. The input image is splited into fixed-size patches, then linear projection and position embeddings are applied. The resulting sequence are feed into a standard Transformer encoder. In order to perform classification, the standard approach of adding an extra learnable \"classification token\" is utilized to the sequence. \n",
    "\n",
    "![Figure 1 from paper](https://github.com/google-research/vision_transformer/raw/main/vit_figure.png)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "97e174e6",
   "metadata": {},
   "source": [
    "## 2. Model Effects and Application Scenarios"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "67ae978f",
   "metadata": {},
   "source": [
    "| Model | Phase | Dataset | gpu | img/sec | Top1 Acc | Official |\n",
    "| --- | --- | --- | --- | --- | --- | --- |\n",
    "| ViT-B_16_224 |pretrain  |ImageNet2012  |A100*N1C8  |  3583| 0.75196 | 0.7479 |\n",
    "| ViT-B_16_384 |finetune  | ImageNet2012 | A100*N1C8 | 719 | 0.77972 | 0.7791 |\n",
    "| ViT-L_16_224 | pretrain | ImageNet21K | A100*N4C32 | 5256 | - | - |  |\n",
    "|ViT-L_16_384  |finetune  | ImageNet2012 | A100*N4C32 | 934 | 0.85030 | 0.8505 |"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ace3c48d",
   "metadata": {},
   "source": [
    "## 3. How to use the Model"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "186a0c17",
   "metadata": {},
   "source": [
    "### 3.1 Install PLSC\n",
    "\n",
    "```shell\n",
    "git clone https://github.com/PaddlePaddle/PLSC.git\n",
    "cd /path/to/PLSC/\n",
    "# [optional] pip install -r requirements.txt\n",
    "python setup.py develop\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6b22824d",
   "metadata": {},
   "source": [
    "### 3.2 Model Training"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a562bf23",
   "metadata": {},
   "source": [
    "1. Enter into the task directory\n",
    "\n",
    "```shell\n",
    "cd task/classification/vit\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "de109245",
   "metadata": {},
   "source": [
    "2. Prepare the data\n",
    "\n",
    "Organize the data into the following format:\n",
    "\n",
    "```text\n",
    "dataset/\n",
    "└── ILSVRC2012\n",
    "    ├── train\n",
    "    ├── val\n",
    "    ├── train_list.txt\n",
    "    └── val_list.txt\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ec78efdf",
   "metadata": {},
   "source": [
    "3. Run the command\n",
    "\n",
    "```shell\n",
    "export PADDLE_NNODES=1\n",
    "export PADDLE_MASTER=\"127.0.0.1:12538\"\n",
    "export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7\n",
    "\n",
    "python -m paddle.distributed.launch \\\n",
    "    --nnodes=$PADDLE_NNODES \\\n",
    "    --master=$PADDLE_MASTER \\\n",
    "    --devices=$CUDA_VISIBLE_DEVICES \\\n",
    "    plsc-train \\\n",
    "    -c ./configs/ViT_base_patch16_224_in1k_1n8c_dp_fp16o2.yaml\n",
    "```\n",
    "\n",
    "More courses about model training can be learned here [ViT readme](https://github.com/PaddlePaddle/PLSC/blob/master/task/classification/vit/README.md)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "05ba38c3",
   "metadata": {},
   "source": [
    "### 3.3 Model Inference"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7a3ce1ab",
   "metadata": {},
   "source": [
    "1. Download pretrained model\n",
    "\n",
    "```shell\n",
    "mkdir -p pretrained/vit/ViT_base_patch16_224/\n",
    "wget -O ./pretrained/vit/ViT_base_patch16_224/imagenet2012-ViT-B_16-224.pdparams https://plsc.bj.bcebos.com/models/vit/v2.4/imagenet2012-ViT-B_16-224.pdparams\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cff5ac83",
   "metadata": {},
   "source": [
    "2. Export model for inference\n",
    "\n",
    "```shell\n",
    "plsc-export -c ./configs/ViT_base_patch16_224_in1k_1n8c_dp_fp16o2.yaml -o Global.pretrained_model=./pretrained/vit/ViT_base_patch16_224/imagenet2012-ViT-B_16-224 -o Model.data_format=NCHW -o FP16.level=O0\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d375934d",
   "metadata": {},
   "source": [
    "## 4. Related papers and citations\n",
    "\n",
    "```text\n",
    "@article{dosovitskiy2020,\n",
    "  title={An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale},\n",
    "  author={Dosovitskiy, Alexey and Beyer, Lucas and Kolesnikov, Alexander and Weissenborn, Dirk and Zhai, Xiaohua and Unterthiner, Thomas and  Dehghani, Mostafa and Minderer, Matthias and Heigold, Georg and Gelly, Sylvain and Uszkoreit, Jakob and Houlsby, Neil},\n",
    "  journal={arXiv preprint arXiv:2010.11929},\n",
    "  year={2020}\n",
    "}\n",
    "```"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}