提交 be6a22be 编写于 作者: Y Yang Nie 提交者: Tingquan Gao

add MobileViTv2

上级 9f621279
# MobileViTv2
-----
## 目录
- [1. 模型介绍](#1)
- [1.1 模型简介](#1.1)
- [1.2 模型指标](#1.2)
- [2. 模型快速体验](#2)
- [3. 模型训练、评估和预测](#3)
- [4. 模型推理部署](#4)
- [4.1 推理模型准备](#4.1)
- [4.2 基于 Python 预测引擎推理](#4.2)
- [4.3 基于 C++ 预测引擎推理](#4.3)
- [4.4 服务化部署](#4.4)
- [4.5 端侧部署](#4.5)
- [4.6 Paddle2ONNX 模型转换与预测](#4.6)
<a name='1'></a>
## 1. 模型介绍
<a name='1.1'></a>
### 1.1 模型简介
MobileViTv2 是一个结合 CNN 和 ViT 的轻量级模型,用于移动视觉任务。通过 MobileViTv2-block 解决了 MobileViTv1 的扩展问题并简化了学习任务,从而得倒了 MobileViTv2-XXS、XS 和 S 模型,在 ImageNet-1k、ADE20K、COCO 和 PascalVOC2012 数据集上表现优于 MobileViTv1。
通过将提出的融合块添加到 MobileViTv2 中,创建 MobileViTv2-0.5、0.75 和 1.0 模型,在ImageNet-1k、ADE20K、COCO和PascalVOC2012数据集上给出了比 MobileViTv2 更好的准确性数据。[论文地址](https://arxiv.org/abs/2209.15159)
<a name='1.2'></a>
### 1.2 模型指标
| Models | Top1 | Top5 | Reference<br>top1 | Reference<br>top5 | FLOPs<br>(G) | Params<br>(M) |
|:--:|:--:|:--:|:--:|:--:|:--:|:--:|
| MobileViTv2_x0_5 | 0.7017 | 0.89884 | 0.7018 | - | 480.46 | 1.37 |
| MobileViTv2_x1_0 | 0.7813 | 0.94172 | 0.7809 | - | 1843.81 | 4.90 |
| MobileViTv2_x1_5 | 0.8034 | 0.95094 | 0.8038 | - | 4090.07 | 10.60 |
| MobileViTv2_x2_0 | 0.8116 | 0.95370 | 0.8117 | - | 7219.23 | 18.45 |
**备注:** PaddleClas 所提供的该系列模型的预训练模型权重,均是基于其官方提供的权重转得。
<a name="2"></a>
## 2. 模型快速体验
安装 paddlepaddle 和 paddleclas 即可快速对图片进行预测,体验方法可以参考[ResNet50 模型快速体验](./ResNet.md#2)
<a name="3"></a>
## 3. 模型训练、评估和预测
此部分内容包括训练环境配置、ImageNet数据的准备、该模型在 ImageNet 上的训练、评估、预测等内容。在 `ppcls/configs/ImageNet/MobileViTv2/` 中提供了该模型的训练配置,启动训练方法可以参考:[ResNet50 模型训练、评估和预测](./ResNet.md#3-模型训练评估和预测)
**备注:** 由于 MobileViT 系列模型默认使用的 GPU 数量为 8 个,所以在训练时,需要指定8个GPU,如`python3 -m paddle.distributed.launch --gpus="0,1,2,3,4,5,6,7" tools/train.py -c xxx.yaml`, 如果使用 4 个 GPU 训练,默认学习率需要减小一半,精度可能有损。
<a name="4"></a>
## 4. 模型推理部署
<a name="4.1"></a>
### 4.1 推理模型准备
Paddle Inference 是飞桨的原生推理库, 作用于服务器端和云端,提供高性能的推理能力。相比于直接基于预训练模型进行预测,Paddle Inference可使用 MKLDNN、CUDNN、TensorRT 进行预测加速,从而实现更优的推理性能。更多关于Paddle Inference推理引擎的介绍,可以参考[Paddle Inference官网教程](https://www.paddlepaddle.org.cn/documentation/docs/zh/guides/infer/inference/inference_cn.html)
Inference 的获取可以参考 [ResNet50 推理模型准备](./ResNet.md#4.1)
<a name="4.2"></a>
### 4.2 基于 Python 预测引擎推理
PaddleClas 提供了基于 python 预测引擎推理的示例。您可以参考[ResNet50 基于 Python 预测引擎推理](./ResNet.md#4.2) 完成模型的推理预测。
<a name="4.3"></a>
### 4.3 基于 C++ 预测引擎推理
PaddleClas 提供了基于 C++ 预测引擎推理的示例,您可以参考[服务器端 C++ 预测](../../deployment/image_classification/cpp/linux.md)来完成相应的推理部署。如果您使用的是 Windows 平台,可以参考[基于 Visual Studio 2019 Community CMake 编译指南](../../deployment/image_classification/cpp/windows.md)完成相应的预测库编译和模型预测工作。
<a name="4.4"></a>
### 4.4 服务化部署
Paddle Serving 提供高性能、灵活易用的工业级在线推理服务。Paddle Serving 支持 RESTful、gRPC、bRPC 等多种协议,提供多种异构硬件和多种操作系统环境下推理解决方案。更多关于Paddle Serving 的介绍,可以参考[Paddle Serving 代码仓库](https://github.com/PaddlePaddle/Serving)
PaddleClas 提供了基于 Paddle Serving 来完成模型服务化部署的示例,您可以参考[模型服务化部署](../../deployment/image_classification/paddle_serving.md)来完成相应的部署工作。
<a name="4.5"></a>
### 4.5 端侧部署
Paddle Lite 是一个高性能、轻量级、灵活性强且易于扩展的深度学习推理框架,定位于支持包括移动端、嵌入式以及服务器端在内的多硬件平台。更多关于 Paddle Lite 的介绍,可以参考[Paddle Lite 代码仓库](https://github.com/PaddlePaddle/Paddle-Lite)
PaddleClas 提供了基于 Paddle Lite 来完成模型端侧部署的示例,您可以参考[端侧部署](../../deployment/image_classification/paddle_lite.md)来完成相应的部署工作。
<a name="4.6"></a>
### 4.6 Paddle2ONNX 模型转换与预测
Paddle2ONNX 支持将 PaddlePaddle 模型格式转化到 ONNX 模型格式。通过 ONNX 可以完成将 Paddle 模型到多种推理引擎的部署,包括TensorRT/OpenVINO/MNN/TNN/NCNN,以及其它对 ONNX 开源格式进行支持的推理引擎或硬件。更多关于 Paddle2ONNX 的介绍,可以参考[Paddle2ONNX 代码仓库](https://github.com/PaddlePaddle/Paddle2ONNX)
PaddleClas 提供了基于 Paddle2ONNX 来完成 inference 模型转换 ONNX 模型并作推理预测的示例,您可以参考[Paddle2ONNX 模型转换与预测](../../deployment/image_classification/paddle2onnx.md)来完成相应的部署工作。
...@@ -798,13 +798,17 @@ DeiT(Data-efficient Image Transformers)系列模型的精度、速度指标 ...@@ -798,13 +798,17 @@ DeiT(Data-efficient Image Transformers)系列模型的精度、速度指标
## MobileViT 系列 <sup>[[42](#ref42)][[51](#ref51)]</sup> ## MobileViT 系列 <sup>[[42](#ref42)][[51](#ref51)]</sup>
关于 MobileViT 系列模型的精度、速度指标如下表所示,更多介绍可以参考:[MobileViT 系列模型文档](MobileViT.md), [MobileViTv3 系列模型文档](MobileViTv3.md) 关于 MobileViT 系列模型的精度、速度指标如下表所示,更多介绍可以参考:[MobileViT 系列模型文档](MobileViT.md)[MobileViTv2 系列模型文档](MobileViTv2.md)[MobileViTv3 系列模型文档](MobileViTv3.md)
| 模型 | Top-1 Acc | Top-5 Acc | time(ms)<br>bs=1 | time(ms)<br>bs=4 | time(ms)<br/>bs=8 | FLOPs(M) | Params(M) | 预训练模型下载地址 | inference模型下载地址 | | 模型 | Top-1 Acc | Top-5 Acc | time(ms)<br>bs=1 | time(ms)<br>bs=4 | time(ms)<br/>bs=8 | FLOPs(M) | Params(M) | 预训练模型下载地址 | inference模型下载地址 |
| ---------- | --------- | --------- | ---------------- | ---------------- | -------- | --------- | ------------------------------------------------------------ | ------------------------------------------------------------ | ------------------------------------------------------------ | | ---------- | --------- | --------- | ---------------- | ---------------- | -------- | --------- | ------------------------------------------------------------ | ------------------------------------------------------------ | ------------------------------------------------------------ |
| MobileViT_XXS | 0.6867 | 0.8878 | - | - | - | 337.24 | 1.28 | [下载链接](https://paddle-imagenet-models-name.bj.bcebos.com/dygraph/MobileViT_XXS_pretrained.pdparams) | [下载链接](https://paddle-imagenet-models-name.bj.bcebos.com/dygraph/inference/MobileViT_XXS_infer.tar) | | MobileViT_XXS | 0.6867 | 0.8878 | - | - | - | 337.24 | 1.28 | [下载链接](https://paddle-imagenet-models-name.bj.bcebos.com/dygraph/MobileViT_XXS_pretrained.pdparams) | [下载链接](https://paddle-imagenet-models-name.bj.bcebos.com/dygraph/inference/MobileViT_XXS_infer.tar) |
| MobileViT_XS | 0.7454 | 0.9227 | - | - | - | 930.75 | 2.33 | [下载链接](https://paddle-imagenet-models-name.bj.bcebos.com/dygraph/MobileViT_XS_pretrained.pdparams) | [下载链接](https://paddle-imagenet-models-name.bj.bcebos.com/dygraph/inference/MobileViT_XS_infer.tar) | | MobileViT_XS | 0.7454 | 0.9227 | - | - | - | 930.75 | 2.33 | [下载链接](https://paddle-imagenet-models-name.bj.bcebos.com/dygraph/MobileViT_XS_pretrained.pdparams) | [下载链接](https://paddle-imagenet-models-name.bj.bcebos.com/dygraph/inference/MobileViT_XS_infer.tar) |
| MobileViT_S | 0.7814 | 0.9413 | - | - | - | 1849.35 | 5.59 | [下载链接](https://paddle-imagenet-models-name.bj.bcebos.com/dygraph/MobileViT_S_pretrained.pdparams) | [下载链接](https://paddle-imagenet-models-name.bj.bcebos.com/dygraph/inference/MobileViT_S_infer.tar) | | MobileViT_S | 0.7814 | 0.9413 | - | - | - | 1849.35 | 5.59 | [下载链接](https://paddle-imagenet-models-name.bj.bcebos.com/dygraph/MobileViT_S_pretrained.pdparams) | [下载链接](https://paddle-imagenet-models-name.bj.bcebos.com/dygraph/inference/MobileViT_S_infer.tar) |
| MobileViTv2_x0_5 | 0.7017 | 0.8988 | - | - | - | 480.46 | 1.37 | [下载链接]() | [下载链接]() |
| MobileViTv2_x1_0 | 0.7813 | 0.9417 | - | - | - | 1843.81 | 4.90 | [下载链接]() | [下载链接]() |
| MobileViTv2_x1_5 | 0.8034 | 0.9509 | - | - | - | 4090.07 | 10.60 | [下载链接]() | [下载链接]() |
| MobileViTv2_x2_0 | 0.8116 | 0.9537 | - | - | - | 7219.23 | 18.45 | [下载链接]() | [下载链接]() |
| MobileViTv3_XXS | 0.7087 | 0.8976 | - | - | - | 289.02 | 1.25 | [下载链接]() | [下载链接]() | | MobileViTv3_XXS | 0.7087 | 0.8976 | - | - | - | 289.02 | 1.25 | [下载链接]() | [下载链接]() |
| MobileViTv3_XS | 0.7663 | 0.9332 | - | - | - | 926.98 | 2.49 | [下载链接]() | [下载链接]() | | MobileViTv3_XS | 0.7663 | 0.9332 | - | - | - | 926.98 | 2.49 | [下载链接]() | [下载链接]() |
| MobileViTv3_S | 0.7928 | 0.9454 | - | - | - | 1841.39 | 5.76 | [下载链接]() | [下载链接]() | | MobileViTv3_S | 0.7928 | 0.9454 | - | - | - | 1841.39 | 5.76 | [下载链接]() | [下载链接]() |
...@@ -920,4 +924,8 @@ TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE. ...@@ -920,4 +924,8 @@ TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE.
<a name="ref50">[50]</a>Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie, Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong, Furu Wei, Baining Guo. Swin Transformer V2: Scaling Up Capacity and Resolution <a name="ref50">[50]</a>Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie, Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong, Furu Wei, Baining Guo. Swin Transformer V2: Scaling Up Capacity and Resolution
<<<<<<< HEAD
<a name="ref50">[51]</a>Wadekar, Shakti N. and Chaurasia, Abhishek. MobileViTv3: Mobile-Friendly Vision Transformer with Simple and Effective Fusion of Local, Global and Input Features <a name="ref50">[51]</a>Wadekar, Shakti N. and Chaurasia, Abhishek. MobileViTv3: Mobile-Friendly Vision Transformer with Simple and Effective Fusion of Local, Global and Input Features
=======
<a name="ref51">[51]</a>Sachin Mehta and Mohammad Rastegari. Separable Self-attention for Mobile Vision Transformers
>>>>>>> add MobileViTv2
...@@ -78,6 +78,7 @@ from .model_zoo.cae import cae_base_patch16_224, cae_large_patch16_224 ...@@ -78,6 +78,7 @@ from .model_zoo.cae import cae_base_patch16_224, cae_large_patch16_224
from .model_zoo.cvt import CvT_13_224, CvT_13_384, CvT_21_224, CvT_21_384, CvT_W24_384 from .model_zoo.cvt import CvT_13_224, CvT_13_384, CvT_21_224, CvT_21_384, CvT_W24_384
from .model_zoo.micronet import MicroNet_M0, MicroNet_M1, MicroNet_M2, MicroNet_M3 from .model_zoo.micronet import MicroNet_M0, MicroNet_M1, MicroNet_M2, MicroNet_M3
from .model_zoo.mobilenext import MobileNeXt_x0_35, MobileNeXt_x0_5, MobileNeXt_x0_75, MobileNeXt_x1_0, MobileNeXt_x1_4 from .model_zoo.mobilenext import MobileNeXt_x0_35, MobileNeXt_x0_5, MobileNeXt_x0_75, MobileNeXt_x1_0, MobileNeXt_x1_4
from .model_zoo.mobilevit_v2 import MobileViTv2_x0_5, MobileViTv2_x0_75, MobileViTv2_x1_0, MobileViTv2_x1_25, MobileViTv2_x1_5, MobileViTv2_x1_75, MobileViTv2_x2_0
from .model_zoo.mobilevit_v3 import MobileViTv3_XXS, MobileViTv3_XS, MobileViTv3_S, MobileViTv3_XXS_L2, MobileViTv3_XS_L2, MobileViTv3_S_L2, MobileViTv3_x0_5, MobileViTv3_x0_75, MobileViTv3_x1_0 from .model_zoo.mobilevit_v3 import MobileViTv3_XXS, MobileViTv3_XS, MobileViTv3_S, MobileViTv3_XXS_L2, MobileViTv3_XS_L2, MobileViTv3_S_L2, MobileViTv3_x0_5, MobileViTv3_x0_75, MobileViTv3_x1_0
from .variant_models.resnet_variant import ResNet50_last_stage_stride1 from .variant_models.resnet_variant import ResNet50_last_stage_stride1
......
# copyright (c) 2023 PaddlePaddle Authors. All Rights Reserve.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# Code was based on https://github.com/apple/ml-cvnets/blob/7be93d3debd45c240a058e3f34a9e88d33c07a7d/cvnets/models/classification/mobilevit_v2.py
# reference: https://arxiv.org/abs/2206.02680
from functools import partial
from typing import Dict, Optional, Tuple, Union
import paddle
import paddle.nn as nn
import paddle.nn.functional as F
from ....utils.save_load import load_dygraph_pretrain, load_dygraph_pretrain_from_url
MODEL_URLS = {
"MobileViTv2_x0_5": "",
"MobileViTv2_x0_75": "",
"MobileViTv2_x1_0": "",
"MobileViTv2_x1_25": "",
"MobileViTv2_x1_5": "",
"MobileViTv2_x1_75": "",
"MobileViTv2_x2_0": "",
}
layer_norm_2d = partial(nn.GroupNorm, num_groups=1)
def make_divisible(v, divisor=8, min_value=None):
if min_value is None:
min_value = divisor
new_v = max(min_value, int(v + divisor / 2) // divisor * divisor)
if new_v < 0.9 * v:
new_v += divisor
return new_v
class InvertedResidual(nn.Layer):
"""
Inverted residual block (MobileNetv2): https://arxiv.org/abs/1801.04381
"""
def __init__(self,
in_channels: int,
out_channels: int,
stride: int,
expand_ratio: Union[int, float],
dilation: int=1,
skip_connection: Optional[bool]=True) -> None:
assert stride in [1, 2]
super(InvertedResidual, self).__init__()
self.stride = stride
hidden_dim = make_divisible(int(round(in_channels * expand_ratio)), 8)
self.use_res_connect = self.stride == 1 and in_channels == out_channels and skip_connection
block = nn.Sequential()
if expand_ratio != 1:
block.add_sublayer(
name="exp_1x1",
sublayer=nn.Sequential(
('conv', nn.Conv2D(
in_channels, hidden_dim, 1, bias_attr=False)),
('norm', nn.BatchNorm2D(hidden_dim)), ('act', nn.Silu())))
block.add_sublayer(
name="conv_3x3",
sublayer=nn.Sequential(
('conv', nn.Conv2D(
hidden_dim,
hidden_dim,
3,
bias_attr=False,
stride=stride,
padding=dilation,
dilation=dilation,
groups=hidden_dim)), ('norm', nn.BatchNorm2D(hidden_dim)),
('act', nn.Silu())))
block.add_sublayer(
name="red_1x1",
sublayer=nn.Sequential(
('conv', nn.Conv2D(
hidden_dim, out_channels, 1, bias_attr=False)),
('norm', nn.BatchNorm2D(out_channels))))
self.block = block
self.in_channels = in_channels
self.out_channels = out_channels
self.exp = expand_ratio
self.dilation = dilation
def forward(self, x, *args, **kwargs):
if self.use_res_connect:
return x + self.block(x)
else:
return self.block(x)
class LinearSelfAttention(nn.Layer):
def __init__(self, embed_dim, attn_dropout=0.0, bias=True):
super().__init__()
self.embed_dim = embed_dim
self.qkv_proj = nn.Conv2D(
embed_dim, 1 + (2 * embed_dim), 1, bias_attr=bias)
self.attn_dropout = nn.Dropout(p=attn_dropout)
self.out_proj = nn.Conv2D(embed_dim, embed_dim, 1, bias_attr=bias)
def forward(self, x):
# [B, C, P, N] --> [B, h + 2d, P, N]
qkv = self.qkv_proj(x)
# Project x into query, key and value
# Query --> [B, 1, P, N]
# value, key --> [B, d, P, N]
query, key, value = paddle.split(
qkv, [1, self.embed_dim, self.embed_dim], axis=1)
# apply softmax along N dimension
context_scores = F.softmax(query, axis=-1)
# Uncomment below line to visualize context scores
# self.visualize_context_scores(context_scores=context_scores)
context_scores = self.attn_dropout(context_scores)
# Compute context vector
# [B, d, P, N] x [B, 1, P, N] -> [B, d, P, N]
context_vector = key * context_scores
# [B, d, P, N] --> [B, d, P, 1]
context_vector = paddle.sum(context_vector, axis=-1, keepdim=True)
# combine context vector with values
# [B, d, P, N] * [B, d, P, 1] --> [B, d, P, N]
out = F.relu(value) * context_vector
out = self.out_proj(out)
return out
class LinearAttnFFN(nn.Layer):
def __init__(self,
embed_dim: int,
ffn_latent_dim: int,
attn_dropout: Optional[float]=0.0,
dropout: Optional[float]=0.1,
ffn_dropout: Optional[float]=0.0,
norm_layer: Optional[str]=layer_norm_2d) -> None:
super().__init__()
attn_unit = LinearSelfAttention(
embed_dim=embed_dim, attn_dropout=attn_dropout, bias=True)
self.pre_norm_attn = nn.Sequential(
norm_layer(num_channels=embed_dim),
attn_unit,
nn.Dropout(p=dropout))
self.pre_norm_ffn = nn.Sequential(
norm_layer(num_channels=embed_dim),
nn.Conv2D(embed_dim, ffn_latent_dim, 1),
nn.Silu(),
nn.Dropout(p=ffn_dropout),
nn.Conv2D(ffn_latent_dim, embed_dim, 1),
nn.Dropout(p=dropout))
def forward(self, x):
# self-attention
x = x + self.pre_norm_attn(x)
# Feed forward network
x = x + self.pre_norm_ffn(x)
return x
class MobileViTv2Block(nn.Layer):
"""
This class defines the `MobileViTv2 block`
"""
def __init__(self,
in_channels: int,
attn_unit_dim: int,
ffn_multiplier: float=2.0,
n_attn_blocks: Optional[int]=2,
attn_dropout: Optional[float]=0.0,
dropout: Optional[float]=0.0,
ffn_dropout: Optional[float]=0.0,
patch_h: Optional[int]=8,
patch_w: Optional[int]=8,
conv_ksize: Optional[int]=3,
dilation: Optional[int]=1,
attn_norm_layer: Optional[str]=layer_norm_2d):
cnn_out_dim = attn_unit_dim
padding = (conv_ksize - 1) // 2 * dilation
conv_3x3_in = nn.Sequential(
('conv', nn.Conv2D(
in_channels,
in_channels,
conv_ksize,
bias_attr=False,
padding=padding,
dilation=dilation,
groups=in_channels)), ('norm', nn.BatchNorm2D(in_channels)),
('act', nn.Silu()))
conv_1x1_in = nn.Sequential(('conv', nn.Conv2D(
in_channels, cnn_out_dim, 1, bias_attr=False)))
super().__init__()
self.local_rep = nn.Sequential(conv_3x3_in, conv_1x1_in)
self.global_rep, attn_unit_dim = self._build_attn_layer(
d_model=attn_unit_dim,
ffn_mult=ffn_multiplier,
n_layers=n_attn_blocks,
attn_dropout=attn_dropout,
dropout=dropout,
ffn_dropout=ffn_dropout,
attn_norm_layer=attn_norm_layer)
self.conv_proj = nn.Sequential(
('conv', nn.Conv2D(
cnn_out_dim, in_channels, 1, bias_attr=False)),
('norm', nn.BatchNorm2D(in_channels)))
self.patch_h = patch_h
self.patch_w = patch_w
def _build_attn_layer(self,
d_model: int,
ffn_mult: float,
n_layers: int,
attn_dropout: float,
dropout: float,
ffn_dropout: float,
attn_norm_layer: nn.Layer):
# ensure that dims are multiple of 16
ffn_dims = [ffn_mult * d_model // 16 * 16] * n_layers
global_rep = [
LinearAttnFFN(
embed_dim=d_model,
ffn_latent_dim=ffn_dims[block_idx],
attn_dropout=attn_dropout,
dropout=dropout,
ffn_dropout=ffn_dropout,
norm_layer=attn_norm_layer) for block_idx in range(n_layers)
]
global_rep.append(attn_norm_layer(num_channels=d_model))
return nn.Sequential(*global_rep), d_model
def unfolding(self, feature_map):
batch_size, in_channels, img_h, img_w = feature_map.shape
# [B, C, H, W] --> [B, C, P, N]
patches = F.unfold(
feature_map,
kernel_sizes=[self.patch_h, self.patch_w],
strides=[self.patch_h, self.patch_w])
n_patches = img_h * img_w // (self.patch_h * self.patch_w)
patches = patches.reshape(
[batch_size, in_channels, self.patch_h * self.patch_w, n_patches])
return patches, (img_h, img_w)
def folding(self, patches, output_size: Tuple[int, int]):
batch_size, in_dim, patch_size, n_patches = patches.shape
# [B, C, P, N]
patches = patches.reshape([batch_size, in_dim * patch_size, n_patches])
feature_map = F.fold(
patches,
output_size,
kernel_sizes=[self.patch_h, self.patch_w],
strides=[self.patch_h, self.patch_w])
return feature_map
def forward(self, x):
fm = self.local_rep(x)
# convert feature map to patches
patches, output_size = self.unfolding(fm)
# learn global representations on all patches
patches = self.global_rep(patches)
# [B x Patch x Patches x C] --> [B x C x Patches x Patch]
fm = self.folding(patches=patches, output_size=output_size)
fm = self.conv_proj(fm)
return fm
class MobileViTv2(nn.Layer):
"""
MobileViTv2
"""
def __init__(self,
mobilevit_config: Dict,
class_num=1000,
output_stride=None):
super().__init__()
self.round_nearest = 8
self.dilation = 1
dilate_l4 = dilate_l5 = False
if output_stride == 8:
dilate_l4 = True
dilate_l5 = True
elif output_stride == 16:
dilate_l5 = True
# store model configuration in a dictionary
in_channels = mobilevit_config["layer0"]["img_channels"]
out_channels = mobilevit_config["layer0"]["out_channels"]
self.conv_1 = nn.Sequential(
('conv', nn.Conv2D(
in_channels,
out_channels,
3,
bias_attr=False,
stride=2,
padding=1)), ('norm', nn.BatchNorm2D(out_channels)),
('act', nn.Silu()))
in_channels = out_channels
self.layer_1, out_channels = self._make_layer(
input_channel=in_channels, cfg=mobilevit_config["layer1"])
in_channels = out_channels
self.layer_2, out_channels = self._make_layer(
input_channel=in_channels, cfg=mobilevit_config["layer2"])
in_channels = out_channels
self.layer_3, out_channels = self._make_layer(
input_channel=in_channels, cfg=mobilevit_config["layer3"])
in_channels = out_channels
self.layer_4, out_channels = self._make_layer(
input_channel=in_channels,
cfg=mobilevit_config["layer4"],
dilate=dilate_l4)
in_channels = out_channels
self.layer_5, out_channels = self._make_layer(
input_channel=in_channels,
cfg=mobilevit_config["layer5"],
dilate=dilate_l5)
self.conv_1x1_exp = nn.Identity()
self.classifier = nn.Sequential()
self.classifier.add_sublayer(
name="global_pool",
sublayer=nn.Sequential(nn.AdaptiveAvgPool2D(1), nn.Flatten()))
self.classifier.add_sublayer(
name="fc", sublayer=nn.Linear(out_channels, class_num))
# weight initialization
self.apply(self._init_weights)
def _init_weights(self, m):
if isinstance(m, nn.Conv2D):
fan_in = m.weight.shape[1] * m.weight.shape[2] * m.weight.shape[3]
bound = 1.0 / fan_in**0.5
nn.initializer.Uniform(-bound, bound)(m.weight)
if m.bias is not None:
nn.initializer.Uniform(-bound, bound)(m.bias)
elif isinstance(m, (nn.BatchNorm2D, nn.GroupNorm)):
nn.initializer.Constant(1)(m.weight)
nn.initializer.Constant(0)(m.bias)
elif isinstance(m, nn.Linear):
nn.initializer.XavierUniform()(m.weight)
if m.bias is not None:
nn.initializer.Constant(0)(m.bias)
def _make_layer(self, input_channel, cfg, dilate=False):
block_type = cfg.get("block_type", "mobilevit")
if block_type.lower() == "mobilevit":
return self._make_mit_layer(
input_channel=input_channel, cfg=cfg, dilate=dilate)
else:
return self._make_mobilenet_layer(
input_channel=input_channel, cfg=cfg)
def _make_mit_layer(self, input_channel, cfg, dilate=False):
prev_dilation = self.dilation
block = []
stride = cfg.get("stride", 1)
if stride == 2:
if dilate:
self.dilation *= 2
stride = 1
layer = InvertedResidual(
in_channels=input_channel,
out_channels=cfg.get("out_channels"),
stride=stride,
expand_ratio=cfg.get("mv_expand_ratio", 4),
dilation=prev_dilation)
block.append(layer)
input_channel = cfg.get("out_channels")
block.append(
MobileViTv2Block(
in_channels=input_channel,
attn_unit_dim=cfg["attn_unit_dim"],
ffn_multiplier=cfg.get("ffn_multiplier"),
n_attn_blocks=cfg.get("attn_blocks", 1),
ffn_dropout=0.,
attn_dropout=0.,
dilation=self.dilation,
patch_h=cfg.get("patch_h", 2),
patch_w=cfg.get("patch_w", 2)))
return nn.Sequential(*block), input_channel
def _make_mobilenet_layer(self, input_channel, cfg):
output_channels = cfg.get("out_channels")
num_blocks = cfg.get("num_blocks", 2)
expand_ratio = cfg.get("expand_ratio", 4)
block = []
for i in range(num_blocks):
stride = cfg.get("stride", 1) if i == 0 else 1
layer = InvertedResidual(
in_channels=input_channel,
out_channels=output_channels,
stride=stride,
expand_ratio=expand_ratio)
block.append(layer)
input_channel = output_channels
return nn.Sequential(*block), input_channel
def extract_features(self, x):
x = self.conv_1(x)
x = self.layer_1(x)
x = self.layer_2(x)
x = self.layer_3(x)
x = self.layer_4(x)
x = self.layer_5(x)
x = self.conv_1x1_exp(x)
return x
def forward(self, x):
x = self.extract_features(x)
x = self.classifier(x)
return x
def _load_pretrained(pretrained, model, model_url, use_ssld=False):
if pretrained is False:
pass
elif pretrained is True:
load_dygraph_pretrain_from_url(model, model_url, use_ssld=use_ssld)
elif isinstance(pretrained, str):
load_dygraph_pretrain(model, pretrained)
else:
raise RuntimeError(
"pretrained type is not available. Please use `string` or `boolean` type."
)
def get_configuration(width_multiplier) -> Dict:
ffn_multiplier = 2
mv2_exp_mult = 2 # max(1.0, min(2.0, 2.0 * width_multiplier))
layer_0_dim = max(16, min(64, 32 * width_multiplier))
layer_0_dim = int(make_divisible(layer_0_dim, divisor=8, min_value=16))
config = {
"layer0": {
"img_channels": 3,
"out_channels": layer_0_dim,
},
"layer1": {
"out_channels": int(make_divisible(64 * width_multiplier, divisor=16)),
"expand_ratio": mv2_exp_mult,
"num_blocks": 1,
"stride": 1,
"block_type": "mv2",
},
"layer2": {
"out_channels": int(make_divisible(128 * width_multiplier, divisor=8)),
"expand_ratio": mv2_exp_mult,
"num_blocks": 2,
"stride": 2,
"block_type": "mv2",
},
"layer3": { # 28x28
"out_channels": int(make_divisible(256 * width_multiplier, divisor=8)),
"attn_unit_dim": int(make_divisible(128 * width_multiplier, divisor=8)),
"ffn_multiplier": ffn_multiplier,
"attn_blocks": 2,
"patch_h": 2,
"patch_w": 2,
"stride": 2,
"mv_expand_ratio": mv2_exp_mult,
"block_type": "mobilevit",
},
"layer4": { # 14x14
"out_channels": int(make_divisible(384 * width_multiplier, divisor=8)),
"attn_unit_dim": int(make_divisible(192 * width_multiplier, divisor=8)),
"ffn_multiplier": ffn_multiplier,
"attn_blocks": 4,
"patch_h": 2,
"patch_w": 2,
"stride": 2,
"mv_expand_ratio": mv2_exp_mult,
"block_type": "mobilevit",
},
"layer5": { # 7x7
"out_channels": int(make_divisible(512 * width_multiplier, divisor=8)),
"attn_unit_dim": int(make_divisible(256 * width_multiplier, divisor=8)),
"ffn_multiplier": ffn_multiplier,
"attn_blocks": 3,
"patch_h": 2,
"patch_w": 2,
"stride": 2,
"mv_expand_ratio": mv2_exp_mult,
"block_type": "mobilevit",
},
"last_layer_exp_factor": 4,
}
return config
def MobileViTv2_x2_0(pretrained=False, use_ssld=False, **kwargs):
width_multiplier = 2.0
model = MobileViTv2(get_configuration(width_multiplier), **kwargs)
_load_pretrained(
pretrained, model, MODEL_URLS["MobileViTv2_x2_0"], use_ssld=use_ssld)
return model
def MobileViTv2_x1_75(pretrained=False, use_ssld=False, **kwargs):
width_multiplier = 1.75
model = MobileViTv2(get_configuration(width_multiplier), **kwargs)
_load_pretrained(
pretrained, model, MODEL_URLS["MobileViTv2_x1_75"], use_ssld=use_ssld)
return model
def MobileViTv2_x1_5(pretrained=False, use_ssld=False, **kwargs):
width_multiplier = 1.5
model = MobileViTv2(get_configuration(width_multiplier), **kwargs)
_load_pretrained(
pretrained, model, MODEL_URLS["MobileViTv2_x1_5"], use_ssld=use_ssld)
return model
def MobileViTv2_x1_25(pretrained=False, use_ssld=False, **kwargs):
width_multiplier = 1.25
model = MobileViTv2(get_configuration(width_multiplier), **kwargs)
_load_pretrained(
pretrained, model, MODEL_URLS["MobileViTv2_x1_25"], use_ssld=use_ssld)
return model
def MobileViTv2_x1_0(pretrained=False, use_ssld=False, **kwargs):
width_multiplier = 1.0
model = MobileViTv2(get_configuration(width_multiplier), **kwargs)
_load_pretrained(
pretrained, model, MODEL_URLS["MobileViTv2_x1_0"], use_ssld=use_ssld)
return model
def MobileViTv2_x0_75(pretrained=False, use_ssld=False, **kwargs):
width_multiplier = 0.75
model = MobileViTv2(get_configuration(width_multiplier), **kwargs)
_load_pretrained(
pretrained, model, MODEL_URLS["MobileViTv2_x0_75"], use_ssld=use_ssld)
return model
def MobileViTv2_x0_5(pretrained=False, use_ssld=False, **kwargs):
width_multiplier = 0.5
model = MobileViTv2(get_configuration(width_multiplier), **kwargs)
_load_pretrained(
pretrained, model, MODEL_URLS["MobileViTv2_x0_5"], use_ssld=use_ssld)
return model
# global configs
Global:
checkpoints: null
pretrained_model: null
output_dir: ./output/
device: gpu
save_interval: 1
eval_during_train: True
eval_interval: 1
epochs: 300
print_batch_step: 10
use_visualdl: False
# used for static mode and model export
image_shape: [3, 256, 256]
save_inference_dir: ./inference
use_dali: False
# mixed precision training
AMP:
scale_loss: 65536
use_dynamic_loss_scaling: True
# O1: mixed fp16
level: O1
# model ema
EMA:
decay: 0.9995
# model architecture
Arch:
name: MobileViTv2_x0_5
class_num: 1000
# loss function config for traing/eval process
Loss:
Train:
- CELoss:
weight: 1.0
epsilon: 0.1
Eval:
- CELoss:
weight: 1.0
Optimizer:
name: AdamW
beta1: 0.9
beta2: 0.999
epsilon: 1e-8
weight_decay: 0.004
one_dim_param_no_weight_decay: True
lr:
# for 8 cards
name: Cosine
learning_rate: 0.009
eta_min: 0.0009
warmup_epoch: 16 # 20000 iterations
warmup_start_lr: 1e-6
# by_epoch: True
clip_norm: 10
# data loader for train and eval
DataLoader:
Train:
dataset:
name: ImageNetDataset
image_root: ./dataset/ILSVRC2012/
cls_label_path: ./dataset/ILSVRC2012/train_list.txt
transform_ops:
- DecodeImage:
to_rgb: True
channel_first: False
backend: pil
- RandCropImage:
size: 256
interpolation: bicubic
backend: pil
use_log_aspect: True
- RandFlipImage:
flip_code: 1
- RandAugmentV3:
num_layers: 2
interpolation: bicubic
- NormalizeImage:
scale: 1.0/255.0
mean: [0.0, 0.0, 0.0]
std: [1.0, 1.0, 1.0]
order: ''
- RandomErasing:
EPSILON: 0.25
sl: 0.02
sh: 1.0/3.0
r1: 0.3
attempt: 10
use_log_aspect: True
mode: const
batch_transform_ops:
- OpSampler:
MixupOperator:
alpha: 0.2
prob: 0.25
CutmixOperator:
alpha: 1.0
prob: 0.25
sampler:
name: DistributedBatchSampler
batch_size: 128
drop_last: False
shuffle: True
loader:
num_workers: 4
use_shared_memory: True
Eval:
dataset:
name: ImageNetDataset
image_root: ./dataset/ILSVRC2012/
cls_label_path: ./dataset/ILSVRC2012/val_list.txt
transform_ops:
- DecodeImage:
to_np: False
channel_first: False
backend: pil
- ResizeImage:
resize_short: 288
interpolation: bicubic
backend: pil
- CropImage:
size: 256
- NormalizeImage:
scale: 1.0/255.0
mean: [0.0, 0.0, 0.0]
std: [1.0, 1.0, 1.0]
order: ''
sampler:
name: DistributedBatchSampler
batch_size: 128
drop_last: False
shuffle: False
loader:
num_workers: 4
use_shared_memory: True
Infer:
infer_imgs: docs/images/inference_deployment/whl_demo.jpg
batch_size: 10
transforms:
- DecodeImage:
to_np: False
channel_first: False
backend: pil
- ResizeImage:
resize_short: 288
interpolation: bicubic
backend: pil
- CropImage:
size: 256
- NormalizeImage:
scale: 1.0/255.0
mean: [0.0, 0.0, 0.0]
std: [1.0, 1.0, 1.0]
order: ''
- ToCHWImage:
PostProcess:
name: Topk
topk: 5
class_id_map_file: ppcls/utils/imagenet1k_label_list.txt
Metric:
Train:
- TopkAcc:
topk: [1, 5]
Eval:
- TopkAcc:
topk: [1, 5]
# global configs
Global:
checkpoints: null
pretrained_model: null
output_dir: ./output/
device: gpu
save_interval: 1
eval_during_train: True
eval_interval: 1
epochs: 300
print_batch_step: 10
use_visualdl: False
# used for static mode and model export
image_shape: [3, 256, 256]
save_inference_dir: ./inference
use_dali: False
# mixed precision training
AMP:
scale_loss: 65536
use_dynamic_loss_scaling: True
# O1: mixed fp16
level: O1
# model ema
EMA:
decay: 0.9995
# model architecture
Arch:
name: MobileViTv2_x1_0
class_num: 1000
# loss function config for traing/eval process
Loss:
Train:
- CELoss:
weight: 1.0
epsilon: 0.1
Eval:
- CELoss:
weight: 1.0
Optimizer:
name: AdamW
beta1: 0.9
beta2: 0.999
epsilon: 1e-8
weight_decay: 0.013
one_dim_param_no_weight_decay: True
lr:
# for 8 cards
name: Cosine
learning_rate: 0.0075
eta_min: 0.00075
warmup_epoch: 16 # 20000 iterations
warmup_start_lr: 1e-6
# by_epoch: True
clip_norm: 10
# data loader for train and eval
DataLoader:
Train:
dataset:
name: ImageNetDataset
image_root: ./dataset/ILSVRC2012/
cls_label_path: ./dataset/ILSVRC2012/train_list.txt
transform_ops:
- DecodeImage:
to_rgb: True
channel_first: False
backend: pil
- RandCropImage:
size: 256
interpolation: bicubic
backend: pil
use_log_aspect: True
- RandFlipImage:
flip_code: 1
- RandAugmentV3:
num_layers: 2
interpolation: bicubic
- NormalizeImage:
scale: 1.0/255.0
mean: [0.0, 0.0, 0.0]
std: [1.0, 1.0, 1.0]
order: ''
- RandomErasing:
EPSILON: 0.25
sl: 0.02
sh: 1.0/3.0
r1: 0.3
attempt: 10
use_log_aspect: True
mode: const
batch_transform_ops:
- OpSampler:
MixupOperator:
alpha: 0.2
prob: 0.25
CutmixOperator:
alpha: 1.0
prob: 0.25
sampler:
name: DistributedBatchSampler
batch_size: 128
drop_last: False
shuffle: True
loader:
num_workers: 4
use_shared_memory: True
Eval:
dataset:
name: ImageNetDataset
image_root: ./dataset/ILSVRC2012/
cls_label_path: ./dataset/ILSVRC2012/val_list.txt
transform_ops:
- DecodeImage:
to_np: False
channel_first: False
backend: pil
- ResizeImage:
resize_short: 288
interpolation: bicubic
backend: pil
- CropImage:
size: 256
- NormalizeImage:
scale: 1.0/255.0
mean: [0.0, 0.0, 0.0]
std: [1.0, 1.0, 1.0]
order: ''
sampler:
name: DistributedBatchSampler
batch_size: 128
drop_last: False
shuffle: False
loader:
num_workers: 4
use_shared_memory: True
Infer:
infer_imgs: docs/images/inference_deployment/whl_demo.jpg
batch_size: 10
transforms:
- DecodeImage:
to_np: False
channel_first: False
backend: pil
- ResizeImage:
resize_short: 288
interpolation: bicubic
backend: pil
- CropImage:
size: 256
- NormalizeImage:
scale: 1.0/255.0
mean: [0.0, 0.0, 0.0]
std: [1.0, 1.0, 1.0]
order: ''
- ToCHWImage:
PostProcess:
name: Topk
topk: 5
class_id_map_file: ppcls/utils/imagenet1k_label_list.txt
Metric:
Train:
- TopkAcc:
topk: [1, 5]
Eval:
- TopkAcc:
topk: [1, 5]
# global configs
Global:
checkpoints: null
pretrained_model: null
output_dir: ./output/
device: gpu
save_interval: 1
eval_during_train: True
eval_interval: 1
epochs: 300
print_batch_step: 10
use_visualdl: False
# used for static mode and model export
image_shape: [3, 256, 256]
save_inference_dir: ./inference
use_dali: False
update_freq: 2 # for 4 gpus
# mixed precision training
AMP:
scale_loss: 65536
use_dynamic_loss_scaling: True
# O1: mixed fp16
level: O1
# model ema
EMA:
decay: 0.9995
# model architecture
Arch:
name: MobileViTv2_x1_5
class_num: 1000
# loss function config for traing/eval process
Loss:
Train:
- CELoss:
weight: 1.0
epsilon: 0.1
Eval:
- CELoss:
weight: 1.0
Optimizer:
name: AdamW
beta1: 0.9
beta2: 0.999
epsilon: 1e-8
weight_decay: 0.029
one_dim_param_no_weight_decay: True
lr:
# for 8 cards
name: Cosine
learning_rate: 0.0035 # for total batch size 1024
eta_min: 0.00035
warmup_epoch: 16 # 20000 iterations
warmup_start_lr: 1e-6
# by_epoch: True
clip_norm: 10
# data loader for train and eval
DataLoader:
Train:
dataset:
name: ImageNetDataset
image_root: ./dataset/ILSVRC2012/
cls_label_path: ./dataset/ILSVRC2012/train_list.txt
transform_ops:
- DecodeImage:
to_rgb: True
channel_first: False
backend: pil
- RandCropImage:
size: 256
interpolation: bicubic
backend: pil
use_log_aspect: True
- RandFlipImage:
flip_code: 1
- RandAugmentV3:
num_layers: 2
interpolation: bicubic
- NormalizeImage:
scale: 1.0/255.0
mean: [0.0, 0.0, 0.0]
std: [1.0, 1.0, 1.0]
order: ''
- RandomErasing:
EPSILON: 0.25
sl: 0.02
sh: 1.0/3.0
r1: 0.3
attempt: 10
use_log_aspect: True
mode: const
batch_transform_ops:
- OpSampler:
MixupOperator:
alpha: 0.2
prob: 0.25
CutmixOperator:
alpha: 1.0
prob: 0.25
sampler:
name: DistributedBatchSampler
batch_size: 128
drop_last: False
shuffle: True
loader:
num_workers: 4
use_shared_memory: True
Eval:
dataset:
name: ImageNetDataset
image_root: ./dataset/ILSVRC2012/
cls_label_path: ./dataset/ILSVRC2012/val_list.txt
transform_ops:
- DecodeImage:
to_np: False
channel_first: False
backend: pil
- ResizeImage:
resize_short: 288
interpolation: bicubic
backend: pil
- CropImage:
size: 256
- NormalizeImage:
scale: 1.0/255.0
mean: [0.0, 0.0, 0.0]
std: [1.0, 1.0, 1.0]
order: ''
sampler:
name: DistributedBatchSampler
batch_size: 128
drop_last: False
shuffle: False
loader:
num_workers: 4
use_shared_memory: True
Infer:
infer_imgs: docs/images/inference_deployment/whl_demo.jpg
batch_size: 10
transforms:
- DecodeImage:
to_np: False
channel_first: False
backend: pil
- ResizeImage:
resize_short: 288
interpolation: bicubic
backend: pil
- CropImage:
size: 256
- NormalizeImage:
scale: 1.0/255.0
mean: [0.0, 0.0, 0.0]
std: [1.0, 1.0, 1.0]
order: ''
- ToCHWImage:
PostProcess:
name: Topk
topk: 5
class_id_map_file: ppcls/utils/imagenet1k_label_list.txt
Metric:
Train:
- TopkAcc:
topk: [1, 5]
Eval:
- TopkAcc:
topk: [1, 5]
# global configs
Global:
checkpoints: null
pretrained_model: null
output_dir: ./output/
device: gpu
save_interval: 1
eval_during_train: True
eval_interval: 1
epochs: 300
print_batch_step: 10
use_visualdl: False
# used for static mode and model export
image_shape: [3, 256, 256]
save_inference_dir: ./inference
use_dali: False
# mixed precision training
AMP:
scale_loss: 65536
use_dynamic_loss_scaling: True
# O1: mixed fp16
level: O1
# model ema
EMA:
decay: 0.9995
# model architecture
Arch:
name: MobileViTv2_x2_0
class_num: 1000
# loss function config for traing/eval process
Loss:
Train:
- CELoss:
weight: 1.0
epsilon: 0.1
Eval:
- CELoss:
weight: 1.0
Optimizer:
name: AdamW
beta1: 0.9
beta2: 0.999
epsilon: 1e-8
weight_decay: 0.05
one_dim_param_no_weight_decay: True
lr:
# for 8 cards
name: Cosine
learning_rate: 0.002
eta_min: 0.0002
warmup_epoch: 16 # 20000 iterations
warmup_start_lr: 1e-6
# by_epoch: True
clip_norm: 10
# data loader for train and eval
DataLoader:
Train:
dataset:
name: ImageNetDataset
image_root: ./dataset/ILSVRC2012/
cls_label_path: ./dataset/ILSVRC2012/train_list.txt
transform_ops:
- DecodeImage:
to_rgb: True
channel_first: False
backend: pil
- RandCropImage:
size: 256
interpolation: bicubic
backend: pil
use_log_aspect: True
- RandFlipImage:
flip_code: 1
- RandAugmentV3:
num_layers: 2
interpolation: bicubic
- NormalizeImage:
scale: 1.0/255.0
mean: [0.0, 0.0, 0.0]
std: [1.0, 1.0, 1.0]
order: ''
- RandomErasing:
EPSILON: 0.25
sl: 0.02
sh: 1.0/3.0
r1: 0.3
attempt: 10
use_log_aspect: True
mode: const
batch_transform_ops:
- OpSampler:
MixupOperator:
alpha: 0.2
prob: 0.25
CutmixOperator:
alpha: 1.0
prob: 0.25
sampler:
name: DistributedBatchSampler
batch_size: 128
drop_last: False
shuffle: True
loader:
num_workers: 4
use_shared_memory: True
Eval:
dataset:
name: ImageNetDataset
image_root: ./dataset/ILSVRC2012/
cls_label_path: ./dataset/ILSVRC2012/val_list.txt
transform_ops:
- DecodeImage:
to_np: False
channel_first: False
backend: pil
- ResizeImage:
resize_short: 288
interpolation: bicubic
backend: pil
- CropImage:
size: 256
- NormalizeImage:
scale: 1.0/255.0
mean: [0.0, 0.0, 0.0]
std: [1.0, 1.0, 1.0]
order: ''
sampler:
name: DistributedBatchSampler
batch_size: 128
drop_last: False
shuffle: False
loader:
num_workers: 4
use_shared_memory: True
Infer:
infer_imgs: docs/images/inference_deployment/whl_demo.jpg
batch_size: 10
transforms:
- DecodeImage:
to_np: False
channel_first: False
backend: pil
- ResizeImage:
resize_short: 288
interpolation: bicubic
backend: pil
- CropImage:
size: 256
- NormalizeImage:
scale: 1.0/255.0
mean: [0.0, 0.0, 0.0]
std: [1.0, 1.0, 1.0]
order: ''
- ToCHWImage:
PostProcess:
name: Topk
topk: 5
class_id_map_file: ppcls/utils/imagenet1k_label_list.txt
Metric:
Train:
- TopkAcc:
topk: [1, 5]
Eval:
- TopkAcc:
topk: [1, 5]
...@@ -16,6 +16,7 @@ from ppcls.data.preprocess.ops.autoaugment import ImageNetPolicy as RawImageNetP ...@@ -16,6 +16,7 @@ from ppcls.data.preprocess.ops.autoaugment import ImageNetPolicy as RawImageNetP
from ppcls.data.preprocess.ops.randaugment import RandAugment as RawRandAugment from ppcls.data.preprocess.ops.randaugment import RandAugment as RawRandAugment
from ppcls.data.preprocess.ops.randaugment import RandomApply from ppcls.data.preprocess.ops.randaugment import RandomApply
from ppcls.data.preprocess.ops.randaugment import RandAugmentV2 as RawRandAugmentV2 from ppcls.data.preprocess.ops.randaugment import RandAugmentV2 as RawRandAugmentV2
from ppcls.data.preprocess.ops.randaugment import RandAugmentV3 as RawRandAugmentV3
from ppcls.data.preprocess.ops.timm_autoaugment import RawTimmAutoAugment from ppcls.data.preprocess.ops.timm_autoaugment import RawTimmAutoAugment
from ppcls.data.preprocess.ops.cutout import Cutout from ppcls.data.preprocess.ops.cutout import Cutout
...@@ -58,6 +59,7 @@ import numpy as np ...@@ -58,6 +59,7 @@ import numpy as np
from PIL import Image from PIL import Image
import random import random
def transform(data, ops=[]): def transform(data, ops=[]):
""" transform """ """ transform """
for op in ops: for op in ops:
...@@ -122,6 +124,25 @@ class RandAugmentV2(RawRandAugmentV2): ...@@ -122,6 +124,25 @@ class RandAugmentV2(RawRandAugmentV2):
return img return img
class RandAugmentV3(RawRandAugmentV3):
""" RandAugmentV3 wrapper to auto fit different img types """
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
def __call__(self, img):
if not isinstance(img, Image.Image):
img = np.ascontiguousarray(img)
img = Image.fromarray(img)
img = super().__call__(img)
if isinstance(img, Image.Image):
img = np.asarray(img)
return img
class TimmAutoAugment(RawTimmAutoAugment): class TimmAutoAugment(RawTimmAutoAugment):
""" TimmAutoAugment wrapper to auto fit different img tyeps. """ """ TimmAutoAugment wrapper to auto fit different img tyeps. """
......
===========================train_params===========================
model_name:MobileViTv2_x0_5
python:python3.7
gpu_list:0|0,1
-o Global.device:gpu
-o Global.auto_cast:null
-o Global.epochs:lite_train_lite_infer=2|whole_train_whole_infer=120
-o Global.output_dir:./output/
-o DataLoader.Train.sampler.batch_size:8
-o Global.pretrained_model:null
train_model_name:latest
train_infer_img_dir:./dataset/ILSVRC2012/val
null:null
##
trainer:norm_train
norm_train:tools/train.py -c ppcls/configs/ImageNet/MobileViTv2/MobileViTv2_x0_5.yaml -o Global.seed=1234 -o DataLoader.Train.sampler.shuffle=False -o DataLoader.Train.loader.num_workers=0 -o DataLoader.Train.loader.use_shared_memory=False -o Global.print_batch_step=1 -o Global.eval_during_train=False -o Global.save_interval=2
pact_train:null
fpgm_train:null
distill_train:null
null:null
null:null
##
===========================eval_params===========================
eval:tools/eval.py -c ppcls/configs/ImageNet/MobileViTv2/MobileViTv2_x0_5.yaml
null:null
##
===========================infer_params==========================
-o Global.save_inference_dir:./inference
-o Global.pretrained_model:
norm_export:tools/export_model.py -c ppcls/configs/ImageNet/MobileViTv2/MobileViTv2_x0_5.yaml
quant_export:null
fpgm_export:null
distill_export:null
kl_quant:null
export2:null
inference_dir:null
infer_model:../inference/
infer_export:True
infer_quant:Fasle
inference:python/predict_cls.py -c configs/inference_cls.yaml -o PreProcess.transform_ops.0.ResizeImage.resize_short=288 -o PreProcess.transform_ops.1.CropImage.size=256 -o PreProcess.transform_ops.2.NormalizeImage.mean=[0.,0.,0.] -o PreProcess.transform_ops.2.NormalizeImage.std=[1.,1.,1.]
-o Global.use_gpu:True|False
-o Global.enable_mkldnn:False
-o Global.cpu_num_threads:1
-o Global.batch_size:1
-o Global.use_tensorrt:False
-o Global.use_fp16:False
-o Global.inference_model_dir:../inference
-o Global.infer_imgs:../dataset/ILSVRC2012/val/ILSVRC2012_val_00000001.JPEG
-o Global.save_log_path:null
-o Global.benchmark:False
null:null
null:null
===========================disable_train_benchmark==========================
batch_size:128
fp_items:fp32
epoch:1
model_type:norm_train
--profiler_options:batch_range=[10,20];state=GPU;tracer_option=Default;profile_path=model.profile
flags:FLAGS_eager_delete_tensor_gb=0.0;FLAGS_fraction_of_gpu_memory_to_use=0.98;FLAGS_conv_workspace_size_limit=4096
===========================infer_benchmark_params==========================
random_infer_input:[{float32,[3,256,256]}]
===========================train_params===========================
model_name:MobileViTv2_x1_0
python:python3.7
gpu_list:0|0,1
-o Global.device:gpu
-o Global.auto_cast:null
-o Global.epochs:lite_train_lite_infer=2|whole_train_whole_infer=120
-o Global.output_dir:./output/
-o DataLoader.Train.sampler.batch_size:8
-o Global.pretrained_model:null
train_model_name:latest
train_infer_img_dir:./dataset/ILSVRC2012/val
null:null
##
trainer:norm_train
norm_train:tools/train.py -c ppcls/configs/ImageNet/MobileViTv2/MobileViTv2_x1_0.yaml -o Global.seed=1234 -o DataLoader.Train.sampler.shuffle=False -o DataLoader.Train.loader.num_workers=0 -o DataLoader.Train.loader.use_shared_memory=False -o Global.print_batch_step=1 -o Global.eval_during_train=False -o Global.save_interval=2
pact_train:null
fpgm_train:null
distill_train:null
null:null
null:null
##
===========================eval_params===========================
eval:tools/eval.py -c ppcls/configs/ImageNet/MobileViTv2/MobileViTv2_x1_0.yaml
null:null
##
===========================infer_params==========================
-o Global.save_inference_dir:./inference
-o Global.pretrained_model:
norm_export:tools/export_model.py -c ppcls/configs/ImageNet/MobileViTv2/MobileViTv2_x1_0.yaml
quant_export:null
fpgm_export:null
distill_export:null
kl_quant:null
export2:null
inference_dir:null
infer_model:../inference/
infer_export:True
infer_quant:Fasle
inference:python/predict_cls.py -c configs/inference_cls.yaml -o PreProcess.transform_ops.0.ResizeImage.resize_short=288 -o PreProcess.transform_ops.1.CropImage.size=256 -o PreProcess.transform_ops.2.NormalizeImage.mean=[0.,0.,0.] -o PreProcess.transform_ops.2.NormalizeImage.std=[1.,1.,1.]
-o Global.use_gpu:True|False
-o Global.enable_mkldnn:False
-o Global.cpu_num_threads:1
-o Global.batch_size:1
-o Global.use_tensorrt:False
-o Global.use_fp16:False
-o Global.inference_model_dir:../inference
-o Global.infer_imgs:../dataset/ILSVRC2012/val/ILSVRC2012_val_00000001.JPEG
-o Global.save_log_path:null
-o Global.benchmark:False
null:null
null:null
===========================disable_train_benchmark==========================
batch_size:128
fp_items:fp32
epoch:1
model_type:norm_train
--profiler_options:batch_range=[10,20];state=GPU;tracer_option=Default;profile_path=model.profile
flags:FLAGS_eager_delete_tensor_gb=0.0;FLAGS_fraction_of_gpu_memory_to_use=0.98;FLAGS_conv_workspace_size_limit=4096
===========================infer_benchmark_params==========================
random_infer_input:[{float32,[3,256,256]}]
===========================train_params===========================
model_name:MobileViTv2_x1_5
python:python3.7
gpu_list:0|0,1
-o Global.device:gpu
-o Global.auto_cast:null
-o Global.epochs:lite_train_lite_infer=2|whole_train_whole_infer=120
-o Global.output_dir:./output/
-o DataLoader.Train.sampler.batch_size:8
-o Global.pretrained_model:null
train_model_name:latest
train_infer_img_dir:./dataset/ILSVRC2012/val
null:null
##
trainer:norm_train
norm_train:tools/train.py -c ppcls/configs/ImageNet/MobileViTv2/MobileViTv2_x1_5.yaml -o Global.seed=1234 -o DataLoader.Train.sampler.shuffle=False -o DataLoader.Train.loader.num_workers=0 -o DataLoader.Train.loader.use_shared_memory=False -o Global.print_batch_step=1 -o Global.eval_during_train=False -o Global.save_interval=2
pact_train:null
fpgm_train:null
distill_train:null
null:null
null:null
##
===========================eval_params===========================
eval:tools/eval.py -c ppcls/configs/ImageNet/MobileViTv2/MobileViTv2_x1_5.yaml
null:null
##
===========================infer_params==========================
-o Global.save_inference_dir:./inference
-o Global.pretrained_model:
norm_export:tools/export_model.py -c ppcls/configs/ImageNet/MobileViTv2/MobileViTv2_x1_5.yaml
quant_export:null
fpgm_export:null
distill_export:null
kl_quant:null
export2:null
inference_dir:null
infer_model:../inference/
infer_export:True
infer_quant:Fasle
inference:python/predict_cls.py -c configs/inference_cls.yaml -o PreProcess.transform_ops.0.ResizeImage.resize_short=288 -o PreProcess.transform_ops.1.CropImage.size=256 -o PreProcess.transform_ops.2.NormalizeImage.mean=[0.,0.,0.] -o PreProcess.transform_ops.2.NormalizeImage.std=[1.,1.,1.]
-o Global.use_gpu:True|False
-o Global.enable_mkldnn:False
-o Global.cpu_num_threads:1
-o Global.batch_size:1
-o Global.use_tensorrt:False
-o Global.use_fp16:False
-o Global.inference_model_dir:../inference
-o Global.infer_imgs:../dataset/ILSVRC2012/val/ILSVRC2012_val_00000001.JPEG
-o Global.save_log_path:null
-o Global.benchmark:False
null:null
null:null
===========================disable_train_benchmark==========================
batch_size:128
fp_items:fp32
epoch:1
model_type:norm_train
--profiler_options:batch_range=[10,20];state=GPU;tracer_option=Default;profile_path=model.profile
flags:FLAGS_eager_delete_tensor_gb=0.0;FLAGS_fraction_of_gpu_memory_to_use=0.98;FLAGS_conv_workspace_size_limit=4096
===========================infer_benchmark_params==========================
random_infer_input:[{float32,[3,256,256]}]
===========================train_params===========================
model_name:MobileViTv2_x2_0
python:python3.7
gpu_list:0|0,1
-o Global.device:gpu
-o Global.auto_cast:null
-o Global.epochs:lite_train_lite_infer=2|whole_train_whole_infer=120
-o Global.output_dir:./output/
-o DataLoader.Train.sampler.batch_size:8
-o Global.pretrained_model:null
train_model_name:latest
train_infer_img_dir:./dataset/ILSVRC2012/val
null:null
##
trainer:norm_train
norm_train:tools/train.py -c ppcls/configs/ImageNet/MobileViTv2/MobileViTv2_x2_0.yaml -o Global.seed=1234 -o DataLoader.Train.sampler.shuffle=False -o DataLoader.Train.loader.num_workers=0 -o DataLoader.Train.loader.use_shared_memory=False -o Global.print_batch_step=1 -o Global.eval_during_train=False -o Global.save_interval=2
pact_train:null
fpgm_train:null
distill_train:null
null:null
null:null
##
===========================eval_params===========================
eval:tools/eval.py -c ppcls/configs/ImageNet/MobileViTv2/MobileViTv2_x2_0.yaml
null:null
##
===========================infer_params==========================
-o Global.save_inference_dir:./inference
-o Global.pretrained_model:
norm_export:tools/export_model.py -c ppcls/configs/ImageNet/MobileViTv2/MobileViTv2_x2_0.yaml
quant_export:null
fpgm_export:null
distill_export:null
kl_quant:null
export2:null
inference_dir:null
infer_model:../inference/
infer_export:True
infer_quant:Fasle
inference:python/predict_cls.py -c configs/inference_cls.yaml -o PreProcess.transform_ops.0.ResizeImage.resize_short=288 -o PreProcess.transform_ops.1.CropImage.size=256 -o PreProcess.transform_ops.2.NormalizeImage.mean=[0.,0.,0.] -o PreProcess.transform_ops.2.NormalizeImage.std=[1.,1.,1.]
-o Global.use_gpu:True|False
-o Global.enable_mkldnn:False
-o Global.cpu_num_threads:1
-o Global.batch_size:1
-o Global.use_tensorrt:False
-o Global.use_fp16:False
-o Global.inference_model_dir:../inference
-o Global.infer_imgs:../dataset/ILSVRC2012/val/ILSVRC2012_val_00000001.JPEG
-o Global.save_log_path:null
-o Global.benchmark:False
null:null
null:null
===========================disable_train_benchmark==========================
batch_size:128
fp_items:fp32
epoch:1
model_type:norm_train
--profiler_options:batch_range=[10,20];state=GPU;tracer_option=Default;profile_path=model.profile
flags:FLAGS_eager_delete_tensor_gb=0.0;FLAGS_fraction_of_gpu_memory_to_use=0.98;FLAGS_conv_workspace_size_limit=4096
===========================infer_benchmark_params==========================
random_infer_input:[{float32,[3,256,256]}]
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册