introduction_cn.ipynb 5.4 KB
Notebook
Newer Older
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210
{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "ae69ce68",
   "metadata": {},
   "source": [
    "## 1. PLSC-ViT模型简介\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "35485bc6",
   "metadata": {},
   "source": [
    "PLSC-ViT实现了基于Transformer的视觉分类模型。ViT对图像进行切分成patch,之后基于patch拉平的sequence进行线性embedding,并且添加了position embeddings和classfication token,然后将patch序列输入到标准的transformer编码器,最终经过一个MLP进行分类。模型结构如下,\n",
    "\n",
    "![Figure 1 from paper](https://github.com/google-research/vision_transformer/raw/main/vit_figure.png)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "97e174e6",
   "metadata": {},
   "source": [
    "## 2. 模型效果 "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "78137a72",
   "metadata": {},
   "source": [
    "| Model | Phase | Dataset | gpu | img/sec | Top1 Acc | Official |\n",
    "| --- | --- | --- | --- | --- | --- | --- |\n",
    "| ViT-B_16_224 |pretrain  |ImageNet2012  |A100*N1C8  |  3583| 0.75196 | 0.7479 |\n",
    "| ViT-B_16_384 |finetune  | ImageNet2012 | A100*N1C8 | 719 | 0.77972 | 0.7791 |\n",
    "| ViT-L_16_224 | pretrain | ImageNet21K | A100*N4C32 | 5256 | - | - |  |\n",
    "|ViT-L_16_384  |finetune  | ImageNet2012 | A100*N4C32 | 934 | 0.85030 | 0.8505 |"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ace3c48d",
   "metadata": {},
   "source": [
    "## 3. 模型如何使用"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a97a5f56",
   "metadata": {},
   "source": [
    "### 3.1 安装PLSC"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "492fa769-2fe0-4220-b6d9-bbc32f8cca10",
   "metadata": {},
   "source": [
    "```\n",
    "git clone https://github.com/PaddlePaddle/PLSC.git\n",
    "cd /path/to/PLSC/\n",
    "# [optional] pip install -r requirements.txt\n",
    "python setup.py develop\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6b22824d",
   "metadata": {},
   "source": [
    "### 3.2 模型训练"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d68ca5fb",
   "metadata": {},
   "source": [
    "1. 进入任务目录\n",
    "\n",
    "```\n",
    "cd task/classification/vit\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9048df01",
   "metadata": {},
   "source": [
    "2. 准备数据\n",
    "\n",
    "将数据整理成以下格式:\n",
    "```text\n",
    "dataset/\n",
    "└── ILSVRC2012\n",
    "    ├── train\n",
    "    ├── val\n",
    "    ├── train_list.txt\n",
    "    └── val_list.txt\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bea743ea",
   "metadata": {},
   "source": [
    "3. 执行训练命令\n",
    "\n",
    "```shell\n",
    "export PADDLE_NNODES=1\n",
    "export PADDLE_MASTER=\"127.0.0.1:12538\"\n",
    "export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7\n",
    "\n",
    "python -m paddle.distributed.launch \\\n",
    "    --nnodes=$PADDLE_NNODES \\\n",
    "    --master=$PADDLE_MASTER \\\n",
    "    --devices=$CUDA_VISIBLE_DEVICES \\\n",
    "    plsc-train \\\n",
    "    -c ./configs/ViT_base_patch16_224_in1k_1n8c_dp_fp16o2.yaml\n",
    "```\n",
    "\n",
    "更多模型的训练教程可参考文档:[ViT训练文档](https://github.com/PaddlePaddle/PLSC/blob/master/task/classification/vit/README.md)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "186a0c17",
   "metadata": {},
   "source": [
    "### 3.3 模型推理"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e97c527c",
   "metadata": {},
   "source": [
    "1. 下载预训练模型\n",
    "\n",
    "```shell\n",
    "mkdir -p pretrained/vit/ViT_base_patch16_224/\n",
    "wget -O ./pretrained/vit/ViT_base_patch16_224/imagenet2012-ViT-B_16-224.pdparams https://plsc.bj.bcebos.com/models/vit/v2.4/imagenet2012-ViT-B_16-224.pdparams\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a07c6549",
   "metadata": {},
   "source": [
    "2. 导出推理模型\n",
    "\n",
    "```shell\n",
    "plsc-export -c ./configs/ViT_base_patch16_224_in1k_1n8c_dp_fp16o2.yaml -o Global.pretrained_model=./pretrained/vit/ViT_base_patch16_224/imagenet2012-ViT-B_16-224 -o Model.data_format=NCHW -o FP16.level=O0\n",
    "```\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d375934d",
   "metadata": {},
   "source": [
    "## 4. 相关论文及引用信息\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "29f05b07-d323-45e4-b00d-0728eafb5af7",
   "metadata": {},
   "source": [
    "```text\n",
    "@article{dosovitskiy2020,\n",
    "  title={An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale},\n",
    "  author={Dosovitskiy, Alexey and Beyer, Lucas and Kolesnikov, Alexander and Weissenborn, Dirk and Zhai, Xiaohua and Unterthiner, Thomas and  Dehghani, Mostafa and Minderer, Matthias and Heigold, Georg and Gelly, Sylvain and Uszkoreit, Jakob and Houlsby, Neil},\n",
    "  journal={arXiv preprint arXiv:2010.11929},\n",
    "  year={2020}\n",
    "}\n",
    "```"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}