- We collect some frequently asked questions in issues and user groups since PaddleClas is open-sourced and provide brief answers, aiming to give some reference for the majority to save you from twists and turns.
- There are many talents in the field of image classification, recognition and retrieval with quickly updated models and papers, and the answers here mainly rely on our limited project practice, so it is not possible to cover all facets. We sincerely hope that the man of insight will help to supplement and correct the content, thanks a lot.
## PaddleClas FAQ Summary
-[1. 30 Questions About Image Classification](https://github.com/PaddlePaddle/PaddleClas/blob/release/2.3/docs/zh_CN/faq_series/faq_selected_30.md#1)
-[1.4 Model Inference and Prediction](https://github.com/PaddlePaddle/PaddleClas/blob/release/2.3/docs/zh_CN/faq_series/faq_selected_30.md#1.4)
-[2. Application of PaddleClas](https://github.com/PaddlePaddle/PaddleClas/blob/release/2.3/docs/zh_CN/faq_series/faq_selected_30.md#2)
## 1. 30 Questions About Image Classification
### 1.1 Basic Knowledge
- Q: How many classification metrics are commonly used in the field of image classification?
- A:
- For a single-label image classification (containing only 1 category and background), the evaluation metrics are Accuracy, Precision, Recall, F-score, etc. If TP(True Positive) means predicting positive class as positive, FP(False Positive) means predicting negative class as positive, TN( True Negative) means the negative class is predicted to be negative, and FN(False Negative) means the positive class is predicted to be negative. Then Accuracy=(TP + TN) / NUM, Precision=TP /(TP + FP), Recall=TP /(TP + FN).
- For the image classification problem with the number of classes greater than 1, the evaluation metrics are Accuary and Class-wise Accuracy. Accuary indicates the percentage of the number of images correctly predicted by all classes to the total number of images; Class-wise Accuracy is obtained by calculating the Accuracy for each class of images and then averaging the Accuracy of all classes.
> >
- Q: 怎样根据自己的任务选择合适的模型进行训练?How to choose the right training model?
- A: If you want to deploy on the server with a high requirement for accuracy but not model storage size or prediction speed, then it is recommended to use ResNet_vd, Res2Net_vd, DenseNet, Xception, etc., which are suitable for server-side models. If you want to deploy on the mobile side, then it is recommended to use MobileNetV3 and GhostNet. Meanwhile, we suggest you refer to the speed-accuracy metrics chart in [Model Library](https://github.com/PaddlePaddle/PaddleClas/blob/release/2.3/docs/zh_CN/models/models_intro.md) when choosing models.
> >
- Q: How to initialize the parameters and what kind of initialization can speed up the convergence of the model?
- A: It is well known that the initialization of parameters can affect the final performance of the model. In general, if the target dataset is not very large, it is recommended to use the pre-trained model obtained by training ImageNet-1k for initialization. If the network is designed manually or there are no pre-trained weights based on ImageNet-1k training, you can use Xavier initialization or MSRA initialization, where the former is proposed for Sigmoid function, which is less friendly to RELU function. The deeper the network is, the smaller the variance of each layer input, the harder the network is to train. So when more RELU activation functions are used in the neural network, MSRA initialization is a better choice.
> >
- Q: What are the better solutions to the problem of parameter redundancy in deep neural networks?
- A: There are several major approaches to compressing models and reducing the model parameter redundancy, such as pruning, quantization, and knowledge distillation. Model pruning refers to removing relatively unimportant weights from the weight matrix and then fine-tuning the network again. Model quantization refers to a technique that converts floating-point computation into low-ratio specific-point computation, such as 8-bit, 4-bit, etc., which can effectively reduce the computational intensity, parameter size, and memory consumption of the model. Knowledge distillation refers to the use of a teacher model to guide a student model to learn a specific task, ensuring that the small model has a great performance improvement or even obtains similar accuracy metrics as the large model with the same number of parameters.
> >
- Q: How to choose the right classification model as a backbone network in other tasks, such as target detection, image segmentation, key point detection, etc.?
- A:
Without considering the speed, it is most recommended to use pre-training models and backbone networks with higher accuracy. A series of SSLD knowledge distillation pre-training models are open-sourced in PaddleClas, such as ResNet50_vd_ssld, Res2Net200_vd_26w_4s_ssld, etc., which excel in both model accuracy and speed. For specific tasks, such as image segmentation or key point detection, which require higher image resolution, it is recommended to use neural network models such as HRNet that can take into account both network depth and resolution. And PaddleClas also provides HRNet SSLD distillation series pre-training models including HRNet_W18_C_ssld, HRNet_W48_C_ssld, etc., which have very high accuracy. You can use these models and the backbone network to improve your own model accuracy on other tasks.
> >
- Q: What is the attention mechanism? What are the common methods of it?
- A: The Attention Mechanism (AM) originated from the study of human vision. Using the mechanism on computer vision tasks can effectively capture the useful regions in the images and thus improve the overall network performance. Currently, the most commonly used ones are [SE block](https://arxiv.org/abs/1709.01507), [SK-block](https://arxiv.org/abs/1903.06586), [Non-local block](https://arxiv. org/abs/1711.07971), [GC block](https://arxiv.org/abs/1904.11492), [CBAM](https://arxiv.org/abs/1807.06521), etc. The core idea is to learn the importance of feature maps in different regions or different channels, so that the network can pay more attention to the regions of salience.
### 1.2 Model Training
> >
- Q: What will happen if a model with 10 million classes is trained during the image classification with deep convolutional networks?
- A: Because of the large number of parameters in the FC layer, the memory/video memory/model storage usage will increase significantly; the model convergence speed will also be slower. In this case, it is recommended to add a layer of FC with a smaller dimension before the last FC layer, which can drastically reduce the storage size of the model.
> >
- Q: What are the possible reasons if the model converges poorly during the training process?
- A: There are several points that can be investigated: (1) The data annotation should be checked to ensure that there are no problems with the labeling of the training and validation sets. (2) Try to adjust the learning rate (initially by a factor of 10). A learning rate that is too large (training oscillation) or too small (slow convergence) may lead to poor convergence. (3) Huge amount of data and an overly small model may prevent it from learning all the features of the data. (4) See if normalization is used in the data preprocessing process. It may be slower without normalization operation. (5) If the amount of data is relatively small, you can try to load the pre-trained model based on ImageNet-1k dataset provided in PaddleClas, which can greatly improve the training convergence speed. (6) There is a long tail problem in the dataset, you can refer to the [solution to the long tail problem of data](https://github.com/PaddlePaddle/PaddleClas/blob/release/2.3/docs/zh_CN/faq_series/faq_selected_30.md #long_tail).
> >
- Q: How to choose the right optimizer when training image classification tasks?
- A: Since the emergence of deep learning, there has been a lot of research on optimizers, which aim to minimize the loss function to find the right weights for a given task. Currently, the main optimizers used in the industry are SGD, RMSProp, Adam, AdaDelt, etc. Among them, since the SGD optimizer with momentum is widely used in academia and industry (only for classification tasks), most of the models we published also adopt this optimizer to achieve gradient descent of the loss function. It has two disadvantages, one is the slow convergence speed, and the other is the reliance on experiences of the initial learning rate setting. However, if the initial learning rate is set properly with a sufficient number of iterations, the optimizer will also stand out among many other optimizers, obtaining higher accuracy on the validation set. Some optimizers with adaptive learning rates, such as Adam and RMSProp, tend to converge fast, but the final convergence accuracy will be slightly worse. If you pursue faster convergence speed, we recommend using these adaptive learning rate optimizers, and SGD optimizers with momentum for higher convergence accuracy.
- Q: What are the current mainstream learning rate decay strategies? How to choose?
- A: The learning rate is the speed at which the hyperparameters of the network weights are adjusted by the gradient of the loss function. The lower the learning rate, the slower the loss function will change. While using a low learning rate ensures that no local minimal values are missed, it also means that it takes longer to converge, especially if trapped in a plateau region. Throughout the whole training process, we cannot adopt the same learning rate to update the weights, otherwise, the optimal point cannot be reached, So we need to adjust the learning rate during the training. In the initial stage of training, since the weights are in a random initialization state and the loss function decreases fast, a larger learning rate can be set. And in the later stage of training, since the weights are close to the optimal value, a larger learning rate cannot further find the optimal value, so a smaller learning rate needs is a better choice. As for the learning rate decay strategy, many researchers or practitioners use piecewise_decay (step_decay), which is a stepwise decay learning rate. In addition, there are also other methods proposed by researchers, such as polynomial_decay, exponential_ decay, cosine_decay, etc. Among them, cosine_decay requires no adjustment of hyperparameters and has higher robustness, thus emerging as the preferred learning rate decay method to improve model accuracy. The learning rates of cosine_decay and piecewise_decay are shown in the following figure. It is easy to observe that cosine_decay keeps a large learning rate throughout the training, so it is slow in convergence, but its final effect is better than peicewise_decay.[![img](https://github.com/PaddlePaddle/PaddleClas/raw/release/2.3/docs/images/models/lr_decay.jpeg)](https://github.com/PaddlePaddle/PaddleClas/blob/release/2.3/docs/images/models/lr_decay.jpeg)
> >
- Q: What is the Warmup strategy? Where is it applied?
- A: The warmup strategy, which, as the name implies, is a warm-up for the learning rate with no direct adoption of maximum learning rate at the beginning of training, but to train the network with a gradually increasing rate, and then decay the learning rate when it peaks. When training a neural network with a large batch_size, it is recommended to use the warmup strategy. Experiments show that warmup can steadily improve the accuracy of the model when the batch_size is large. For example, when training MobileNetV3, we set the epoch in warmup to 5 by default, i.e., first increase the learning rate from 0 to the maximum value with 5 epochs, and then conduct the corresponding decay of the learning rate.
> >
- Q: What is `batch size`?How to choose the appropriate `batch size` during training?
- A: `batch size` is an important hyperparameter in neural networks training, whose value determines how much data is fed into the neural network for training at a time. According to the paper [Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour](https://arxiv.org/abs/1706.02677), when the value of `batch size` is linearly related to the value of learning rate, the convergence accuracy is almost unaffected. When training ImageNet-1k data, most of the neural networks choose an initial learning rate of 0.1 and a `batch size` of 256. Therefore, depending on the actual model size and video memory, the learning rate can be set to 0.1*k and the batch_size to 256*k. This setting can also be used as the initial parameter to further adjust the learning rate parameter and obtain better performance in real tasks.
> >
- Q: What is weight_decay?How to choose?
- A: Overfitting is a common term in machine learning, which is simply understood as a model that performs well on training data but less satisfactory on test data. In image classification, there is also the problem of overfitting, and many regularization methods are proposed to avoid it, among which weight_decay is one of the widely used ways. When using SGD optimizer, weight_decay is equivalent to adding L2 regularization after the final loss function, which makes the weights of the network tend to choose smaller values, so eventually, the parameter values in the whole network tend to be more towards 0, and the generalization performance of the model is improved accordingly. In the implementation of major deep learning frameworks, this value means the coefficient before the L2 regularization, which is called L2Decay in the PaddlePaddle framework. The larger the coefficient is, the stronger the added regularization is, and the more the model tends to be underfitted. When training ImageNet, most networks set the value of this parameter to 1e-4, and in some smaller networks such as the MobileNet series network, the value is set between 1e-5 and 4e-5 to avoid the underfitting. Of course, the setting of this value is also related to specific datasets. When the dataset of the task is large, the network itself tends to be under-fitted and the value should be reduced appropriately, and when it is small, the network itself tends to be over-fitted and the value should be increased. The following table shows the accuracy of MobileNetV1_x0_25 on ImageNet-1k using different l2_decay. Since MobileNetV1_x0_25 is a relatively small network, too large a l2_decay will tend to underfit the network, so 3e-5 is a better choice in this network compared to 1e-4.
| Model | L2_decay | Train acc1/acc5 | Test acc1/acc5 |
- Q: What does label smoothing (label_smoothing) refer to? What is the effect? What kind of scenarios does it usually apply to?
- A: Label_smoothing is a regularization method in deep learning, whose full name is Label Smoothing Regularization (LSR). In the traditional classification task, the loss function is calculated by the cross-entropy of the real one hot label and the output of the neural network, while label_smoothing is a label smoothing of the real one hot label, so that the label learned by the network is no longer a hard label, but a soft label with a probability value, where the probability at the position corresponding to the category is the largest and others small. See the paper[2] for detailed calculation methods. In label_smoothing, the epsilon parameter describes the degree of label softening, the larger the value, the smaller the label probability value of the label vector after label smoothing, the smoother the label, and vice versa. The value is usually set to 0.1 in experiments training ImageNet-1k, and there is a steady increase in accuracy for models of the ResNet50 size and above after using label_smooting. The following table shows the accuracy metrics of ResNet50_vd before and after using label_smoothing. At the same time, since label_smoohing can be regarded as a regularization method, the accuracy improvement is not obvious or even decreases on a relatively small model. The following table shows the accuracy metrics of ResNet18 before and after using label_smoothing on ImageNet-1k. It is clear that the accuracy drops after using label_smoothing.
| Model | Use_label_smoothing | Test acc1 |
| ----------- | ------------------- | --------- |
| ResNet50_vd | 0 | 77.9% |
| ResNet50_vd | 1 | 78.4% |
| ResNet18 | 0 | 71.0% |
| ResNet18 | 1 | 70.8% |
> >
- Q: How to determine the tuning strategy by the accuracy or loss of the training and validation sets during training?
- A: In the process of training a network, the accuracy of the training set and validation set are usually printed for each epoch, which portrays the performance of the model on both datasets. Generally speaking, it is good to have a comparable accuracy or a slightly higher accuracy in the training set than in the validation set. If we find that the accuracy of the training set is much higher than the validation set, it means that the training set is overfitted and we need to add more regularity, such as increasing the value of L2Decay, adding more data augmentation strategies, introducing label_smoothing strategies, etc. If we find that the accuracy of the training set is lower than the validation set, it means that the training set is probably underfitted, and the regularization effect should be weakened during the training, such as reducing the value of L2Decay, decreasing the data augmentation methods, increasing the area of the crop area, weakening the image stretching, removing label_smoothing, etc.
> >
- Q: How to improve the accuracy of my own dataset by pre-training the model?
- A: At this stage, it has become a common practice in the image recognition field to load pre-trained models to train their own tasks, which can often improve the accuracy of a particular task compared to training from random initialization. In general, the pre-training model widely used in the industry is obtained by training the ImageNet-1k dataset of 1.28 million images of 1000 classes. The fc layer weights of this pre-training model are a matrix of k*1000, where k is the number of neurons before the fc layer, and it is not necessary to load the fc layer weights when loading the pre-training weights. In terms of the learning rate, if your dataset is particularly small (e.g., less than 1,000), we recommend you to adopt a small initial learning rate, e.g., 0.001 (batch_size:256, the same below), so as not to corrupt the pre-training weights with a larger learning rate. If your training dataset is relatively large (>100,000), we suggest you try a larger initial learning rate, such as 0.01 or above.
### 1.3 Data
> >
- Q: What are the general steps involved in the data pre-processing for image classification?
- A: When training ResNet50 on ImageNet-1k dataset, an image is fed into the network, and there are several steps: image decoding, random cropping, random horizontal flipping, normalization, data rearrangement, group batching and feeding into the network. Image decoding refers to reading the image file into memory; random cropping refers to randomly stretching and cropping the read image to an image with the length and width of 224 ; random horizontal flipping refers to flipping the cropped image horizontally with a probability of 0.5; normalization refers to centering the data of each channel of the image by de-meaning, so that the data conforms to the `N(0,1)` normal distribution as much as possible; data rearrangement refers to changing the data from `[224,224,3]` format to `[3,224,224]`; and group batching refers to forming a batch of multiple images and feeding them into the network for training.
> >
- Q: How does random-crop affect the performance of small model training?
- A: In the standard preprocessing of ImageNet-1k data, the random_crop function defines two values, scale and ratio, which respectively determine the size of the image crop and the degree of image stretching, where the default value of the former is 0.08-1 (lower_scale-upper_scale), and the latter is 3/4-4/3 (lower_ratio-upper_ratio). In very small networks, this kind of data augmentation can lead to network underfitting and decreased accuracy. To the end, the data augmentation can be made weaker by increasing the crop area of the image or decreasing the stretching of the image. Weaker image transformation can be achieved by increasing the value of lower_scale or reducing the difference between lower_ratio and upper_scale, respectively. The following table shows the accuracy of training MobileNetV2_x0_25 with different lower_scale, and we can see that the training accuracy and verification accuracy are improved by increasing the crop area of the images.
| Model | Range of Scale | Train_acc1/acc5 | Test_acc1/acc5 |
- Q: What are the common data augmentation methods currently available to increase the richness of training samples when the amount of data is insufficient?
- A: PaddleClas classifies data augmentation methods into three categories, which are image transformation, image cropping and image aliasing. Image transformation mainly includes AutoAugment and RandAugment, image cropping contains CutOut, RandErasing, HideAndSeek and GridMask, and image aliasing comprises Mixup and Cutmix. More detailed introduction to data augmentation can be found in the chapter of [Data Augmentation ](https://github.com/PaddlePaddle/PaddleClas/blob/release/2.3/docs/zh_CN/ algorithm_introduction/DataAugmentation.md).
> >
- Q: For image classification scenarios where occlusion is common, what data augmentation methods should be used to improve the accuracy of the model?
- A: During the training, you can try to adopt cropping data augmentations including CutOut, RandErasing, HideAndSeek and GridMask on the training set, so that the model can learn not only the significant regions but also the non-significant regions, thus better performing the recognition task.
> >
- Q: What data augmentation methods should be used to improve model accuracy in the case of complex color transformations?
- A: Consider using the data augmentation strategies of AutoAugment or RandAugment, both of which include rich color transformations such as sharpening and histogram equalization, allowing the model to be more robust to these transformations during the training process.
> >
- Q: How do Mixup and Cutmix work? Why are they effective methods of data augmentation?
- A: Mixup generates a new image by linearly overlaying two images, and the corresponding labels also undertake the same process for training, while Cutmix crops a random region of interest (ROI) from an image and overlays the corresponding region in the current image, and the labels are linearly overlaid in proportion to the image area. They actually generate different samples and labels from the training set and for the learning of the network, thus enriching the samples.
> >
- Q: What is the size of the training dataset for an image classification task that does not require high accuracy?
- A: The amount of the training data is related to the complexity of the problem to be solved. The greater the difficulty and the higher the accuracy requirement, the larger the dataset needs to be. And it is a universal rule that the more training data the better the result in practice. Of course, in general, 10-20 images per category with pre-trained models can guarantee the basic classification effect; or at least 100-200 images per category without pre-training models.
> >
- Q: What are the common methods currently used for datasets with long-tailed distributions?
- A:(1) the categories with fewer data can be resampled to increase the probability of their occurrence; (2) the loss can be modified to increase the loss weight of images in categories corresponding to fewer images; (3) the method of transfer learning can be borrowed to learn generic knowledge from common categories and then migrate to the categories with fewer samples.
### 1.4 Model Inference and Prediction
> >
- Q: How to deal with the poor recognition performance when the original image is taken for classification with only a small part of the image being the foreground object of interest?
- A: A mainbody detection model can be added before classification to detect the foreground objects, which can greatly improve the final recognition results. If time cost is not a concern, multi-crop can also be used to fuse all the predictions to determine the final category.
> >
- Q: What are the currently recommended model inference methods?
- A: After the model is trained, it is recommended to use the exported inference model to make inferences based on the Paddle inference engine, which currently supports python inference and cpp inference. If you want to deploy the inference model based on service, it is recommended to use the PaddleServing.
> >
- Q: What are the appropriate inference methods to further improve the model accuracy after training?
- A:(1) A larger inference scale can be used, e.g., 224 for training, then 288 or 320 for inference, which will directly bring about a 0.5% improvement in accuracy. (2) Test Time Augmentation (TTA) can be used to create multiple copies of the test set by rotating, flipping, color transforming, and so on, and then fuse all the prediction results, which can greatly improve its accuracy and robustness. (3) Of course, a multi-model fusion strategy can also be adopted to fuse the prediction results of multiple models for the same images.
> >
- Q: How to choose the best model for the fusion of multiple models?
- A: Without considering the inference speed, models with the highest possible are recommended; it is also suggested to choose models with different structures or series for fusion. For example, the model fusion results of ResNet50_vd and Xception65 tend to be better than those of ResNet50_vd and ResNet101_vd with similar accuracy.
> >
- Q: What are the common acceleration methods when using a fixed model for inference?
- A: (1) Using a GPU with better performance; (2) increasing the batch size; (3) using TenorRT and FP16 half-precision floating-point methods.
## 2. Application of PaddleClas
> >
- Q: Why can't I import parameters even though I have specified the address of the folder where the pre-trained model is located during evaluation and inference?
- A: When loading a pretrained model, you need to specify the prefix of it. For example, if the folder where the pretrained model parameters are located is `output/ResNet50_vd/19` and the name of the pretrained model parameters is `output/ResNet50_vd/19/ppcls.pdparams`, then the `pretrained_model ` parameter needs to be specified as `output/ResNet50_vd/19/ppcls`, and PaddleClas will automatically complete the `.pdparams` suffix.
> >
- Q: Why the final accuracy is always about 0.3% lower than the official one when evaluating the `EfficientNetB0_small` model?
- A: The `EfficientNet` series network uses `cubic interpolation` for resize (the interpolation value of the resize parameter is set to 2), while other models are None by default, so the interpolation value of resize needs to be explicitly specified during training and evaluation. Specifically, you can refer to the ResizeImage parameter in the preprocessing process in the following configuration.
```
Eval:
dataset:
name: ImageNetDataset
image_root: ./dataset/ILSVRC2012/
cls_label_path: ./dataset/ILSVRC2012/val_list.txt
transform_ops:
- DecodeImage:
to_rgb: True
channel_first: False
- ResizeImage:
resize_short: 256
interpolation: 2
- CropImage:
size: 224
- NormalizeImage:
scale: 1.0/255.0
mean: [0.485, 0.456, 0.406]
std: [0.229, 0.224, 0.225]
order: ''
```
> >
- Q: Why `TypeError: __init__() missing 1 required positional argument: 'sync_cycle'` is reported when using visualdl under python2?
- A: Currently visualdl only supports running under python3 with a required version of 2.0 or higher. If visualdl is not the right version, you can install it as follows: `pip3 install visualdl -i https://mirror.baidu.com/pypi/ simple`
> >
- Q: Why is it that the inference speed of a single image by ResNet50_vd is much lower than the benchmark provided by the official website while the CPU is much faster than GPU?
- A: The model inference needs to be initialized, and it is time-consuming. Therefore, when counting the inference speed, we need to run a batch of images, remove the inference time of the first few images, and then count the average time.GPU is slower than CPU to test a single image because the initialization of GPU is much slower than CPU.
> >
- Q: Can grayscale maps be used for model training?
- A: The grayscale image can also be used for model training, but the input shape of the model needs to be modified to `[1, 224, 224]`, and the data augmentation also needs to be adapted. However, for better use of the PaddleClas code, it is recommended to adapt the grayscale image to a 3-channel image for training (RGB channels have equal pixel values).
> >
- Q: How to train the model on windows or cpu?
- A: You can refer to [Getting Started Tutorial](https://github.com/PaddlePaddle/PaddleClas/blob/release/2.3/docs/zh_CN/models_training/classification.md) for detailed tutorials on model training, evaluation and inference in Linux , Windows, CPU, and other environments.
> >
- Q: How to use label smoothing in model training?
- A: This can be set in the `Loss` field in the configuration file as follows. `epsilon=0.1` means set the value to 0.1, if the `epsilon` field is not set, then `label smoothing` will not be used.
```
Loss:
Train:
- CELoss:
weight: 1.0
epsilon: 0.1
```
> >
- Q: Is the 10W class image classification pre-training model provided by PaddleClas available for model inference?
- A: This 10W class image classification pre-training model does not provide parameters for the fc fully connected layer, which cannot be used for model inference but is available for model fine-tuning at present.
> >
- Q: Why is `Error: Pass tensorrt_subgraph_pass has not been registered` reported When using `deploy/python/predict_cls.py` for model prediction?
- A: If you want to use TensorRT for model prediction and inference, you need to install or compile PaddlePaddle with TensorRT by yourself. For Linux, Windows, macOS users, you can refer to [download inference library](https://paddleinference. paddlepaddle.org.cn/user_guides/download_lib.html). If there is no required version, you need to compile and install it locally, which is detailed in [source code compilation](https://paddleinference.paddlepaddle.org .cn/user_guides/source_compile.html).
> >
- Q: How to train with Automatic Mixed Precision (AMP) during training?
- A: You can refer to [ResNet50_fp16.yaml](https://github.com/PaddlePaddle/PaddleClas/blob/release/2.3/ppcls/configs/ImageNet/ResNet/ResNet50_fp16. yaml). Specifically, if you want your configuration file to support automatic mixed precision during model training, you can add the following information to the file.
PaddleClas is an image recognition toolset for industry and academia, helping users train better computer vision models and apply them in real scenarios. Specifically, it contains the following core features.
- Practical image recognition system: Integrate detection, feature learning, and retrieval modules to be applicable to all types of image recognition tasks. Four sample solutions are provided, including product recognition, vehicle recognition, logo recognition, and animation character recognition.
- Rich library of pre-trained models: Provide a total of 175 ImageNet pre-trained models of 36 series, among which 7 selected series of models support fast structural modification.
- Comprehensive and easy-to-use feature learning components: 12 metric learning methods are integrated and can be combined and switched at will through configuration files.
- SSLD knowledge distillation: The 14 classification pre-training models generally improved their accuracy by more than 3%; among them, the ResNet50_vd model achieved a Top-1 accuracy of 84.0% on the Image-Net-1k dataset and the Res2Net200_vd pre-training model achieved a Top-1 accuracy of 85.1%.
- Data augmentation: Provide 8 data augmentation algorithms such as AutoAugment, Cutout, Cutmix, etc. with the detailed introduction, code replication, and evaluation of effectiveness in a unified experimental environment.
For more information about the quick start of image recognition, algorithm details, model training and evaluation, and prediction and deployment methods, please refer to the [README Tutorial](https://github.com/PaddlePaddle/PaddleClas/blob/release/2.3/README_ch.md) on home page.
The feature graph is the feature representation of the input image in the convolutional network, and the study of which can be beneficial to our understanding and design of the model. Therefore, we employ this tool to visualize the feature graph based on the dynamic graph.
## 2. Prepare Work
The first step is to select the model to be studied, here we choose ResNet50. Copy the model networking code [resnet.py](https://github.com/PaddlePaddle/PaddleClas/blob/release/2.3/ppcls/arch/backbone/) to [directory](https://github.com/PaddlePaddle/PaddleClas/blob/release/2.3/ppcls/utils/feature_maps_ visualization) and download the [ResNet50 pre-training model](https://paddle-imagenet-models-name.bj.bcebos.com/dygraph/ResNet50_pretrained.pdparams) or follow the command below.
For other pre-training models and codes of network structure, please download [model library](https://github.com/PaddlePaddle/PaddleClas/blob/release/2.3/ppcls/arch/backbone) and [pre-training models](https:// github.com/PaddlePaddle/PaddleClas/blob/release/2.3/docs/zh_CN/models/models_intro.md).
## 3. Model Modification
Having found the location of the needed feature graph, set self.fm to fetch it out. Here we adopt the feature graph after the stem layer in resnet50 as an example.
Specify the feature graph to be visualized in the forward function of ResNet50
```
def forward(self, x):
with paddle.static.amp.fp16_guard():
if self.data_format == "NHWC":
x = paddle.transpose(x, [0, 2, 3, 1])
x.stop_gradient = True
x = self.stem(x)
fm = x
x = self.max_pool(x)
x = self.blocks(x)
x = self.avg_pool(x)
x = self.flatten(x)
x = self.fc(x)
return x, fm
```
Then modify the code [fm_vis.py](https://github.com/PaddlePaddle/PaddleClas/blob/release/2.3/ppcls/utils/feature_maps_visualization/fm_vis.py) to import `ResNet50`,instantiating the `net` object:
-[Common Datasets for Image Classification](https://github.com/paddlepaddle/paddleclas/blob/release%2F2.3/docs/zh_CN/data_preparation/classification_dataset.md#图像分类任务常见数据集介绍)
PaddleClas adopts `txt` files to assign the training and test sets. Taking the `ImageNet1k` dataset as an example, where `train_list.txt` and `val_list.txt` have the following formats:
```
# Separate the image path and annotation with "space" for each line
# train_list.txt has the following format
train/n01440764/n01440764_10026.JPEG 0
...
# val_list.txt has the following format
val/ILSVRC2012_val_00000001.JPEG 65
...
```
## 2 Common Datasets for Image Classification
Here we present a compilation of commonly used image classification datasets, which is continuously updated and expects your supplement.
### 2.1 ImageNet1k
[ImageNet](https://image-net.org/) is a large visual database for visual target recognition research with over 14 million manually labeled images. ImageNet-1k is a subset of the ImageNet dataset, which contains 1000 categories with 1281167 images for the training set and 50000 for the validation set. Since 2010, ImageNet began to hold an annual image classification competition, namely, the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) with ImageNet-1k as its specified dataset. To date, ImageNet-1k has become one of the most significant contributors to the development of computer vision, based on which numerous initial models of downstream computer vision tasks are trained.
| Dataset | Size of Training Set | Size of Test Set | Number of Category | Note |
The CIFAR-10 dataset comprises 60,000 color images of 10 classes with 32x32 image resolution, each with 6,000 images including 5,000 images in the training set and 1,000 images in the validation set. The 10 different classes represent airplanes, cars, birds, cats, deer, dogs, frogs, horses, ships, and trucks. The CIFAR-100 dataset is an extension of CIFAR-10 and consists of 60,000 color images of 100 classes with 32x32 image resolution, each with 600 images including 500 images in the training set and 100 images in the validation set.
MMNIST is a renowned dataset for handwritten digit recognition and is used as an introductory sample for deep learning in many sources. It contains 60,000 images, 50,000 for the training set and 10,000 for the validation set, with a size of 28 * 28.
Website:http://yann.lecun.com/exdb/mnist/
### 2.5 NUS-WIDE
NUS-WIDE is a multi-category dataset. It contains 269,648 images and 81 categories with each image being labeled as one or more of the 81 categories.
**A**:PaddleClas is an image recognition toolset for industry and academia, helping users train better computer vision models and apply them in real scenarios.
It provides the whole process of model training, evaluation, inference, and deployment based on image classification to facilitate users' efficient learning. Specifically, PaddleClas contains the following features.
- PaddleClas provides 36 families of classification network structures (ResNet, ResNet_vd, MobileNetV3, Res2Net, HRNet, etc.) and training configurations, 175 pre-trained models, and performance evaluation and inference for free choice and application.
- PaddleClas provides a variety of inference deployment solutions such as TensorRT inference, python inference, c++ inference, Paddle-Lite inference deployment, PaddleServing, PaddleHub, etc., to facilitate inference deployment in multiple environments.
- PaddleClas provides a simple SSLD knowledge distillation scheme, based on which the recognition accuracy of distillation models register a general improvement of more than 3%.
- PaddleClas provides 8 data augmentation algorithms such as AutoAugment, Cutout, Cutmix, etc. with a detailed introduction, code replication, and evaluation of effectiveness in a unified experimental environment.
- PaddleClas supports CPU/GPU-based adoption in Windows/Linux/MacOS environments.
### Q1.2: What is the ResNet series model? What are they? Why are they so popular on the server side?
**A**: ResNet takes the lead to introduce the residual structure, and construct the ResNet network by stacking multiple residual structures. Experiments show that the use of residual blocks can effectively improve convergence speed and accuracy. In PaddleClas, ResNet has such structures containing 18, 34, 50, 101, 152, and 200 layers in order. These models, proposed in 2015, has been validated in different application scenarios, such as classification, detection, segmentation, etc., and has long been optimized by the industry and acquired obvious advantages in terms of speed and accuracy, let alone its well support for the inference of TensorRT and FP16. Therefore, it is recommended to adopt the ResNet series model. Considering their large storage footprint, they are often used on the server side. For more information about ResNet models, please refer to the paper [Deep Residual Learning for Image Recognition](https://arxiv.org/abs/1512.03385).
### Q1.3: What's the difference between the structure of ResNet_vd, ResNet and ResNet_vc?
**A**: The structure of ResNet_va to vd is shown in the figure below. ResNet was first proposed as va structure, in the left feature transformation path (Path A) of the downsampling residual module, the first 1x1 convolution is downsampled, which leads to information loss (the kernel size of the convolution is 1, stride is 2, some features in the input feature graph are not involved in the calculation of convolution). In the vb structure, the downsampling step is adjusted from the first 1x1 convolution at the beginning to the 3x3 convolution in the middle, thus avoiding the loss of information, and the default ResNet model in PaddleClas is ResNet_vb. The vc structure turns the initial 7x7 convolution into 3 3x3 convolutions with almost the same computation and storage size and improved accuracy when the perceptual field remains unchanged. The vd structure is a modification of the feature path (Path B) on the right side of the downsampling residual module, replacing the downsampling with average pooling. This collection of improvements (va->vd), with little extra inference time, and combined with appropriate training strategies, such as label smoothing and mixup data augmentation, can improve the accuracy by up to 2.7%.
### Q1.4 How to choose appropriate ResNet models for the actual scenario?
**A**:
Among the ResNet series model, the ResNet_vd model is recommended for it has a significant improvement in accuracy with almost constant inference speed compared to other models. When the batch size=4, the variation of inference time, FLOPs, Params and accuracy for different models on T4 GPU are demonstrated in the [ResNet and its vd series models](https://github.com/PaddlePaddle/PaddleClas/blob/release/2.3/docs/zh_CN/models/ResNet_and_vd.md). If you want the smallest possible model storage or the fastest inference speed, please use ResNet18_vd model, and if you want to get the highest possible accuracy, we recommend the ResNet152_vd or ResNet200_vd models. For more information about the ResNet series model, please refer to [ResNet and its vd series models](https://github.com/PaddlePaddle/PaddleClas/blob/release/2.3/docs/zh_CN/models/ResNet_ and_vd.md)
### Q1.5 Is conv-bn-relu a fixed form in a block of the network?
**A**:
Before the advent of batch-norm, the mainstream convolutional neural networks were fixed in the form of conv-relu. At the moment, conv-bn-relu is the fixed form of blocks in most of the convolutional networks, which is a relatively robust design. Besides, the block in DenseNet is adopted in the form of bn-relu-conv, which is the same combination used in ResNet-V2. In MobileNetV2, the middle layer of some blocks adopts conv-bn instead of the relu activation function to avoid information loss.
### Q1.6 What's the difference between ResNet34 and ResNet50?
**A**:
There are two different kinds of blocks in the ResNet series, basic-block and bottleneck-block, and the ResNet network is constructed by the stacking of such blocks. basic-block is a stack of two 3x3 convolutional kernels with shortcut, while bottleneck-block is a stack of a 1x1 convolutional kernel, 3x3 convolutional kernel, and 1x1 convolutional kernel with shortcut, so there are two layers in the former one and three in the latter. The number of blocks stacked in ResNet34 and ResNet50 is the same, but the types of stacking are basic-block and bottleneck-block, respectively.
### Q1.7 Do large convolution kernels necessarily lead to positive returns?
**A**:
Not really, increasing all the convolutional kernels in the network may not lead to performance improvement or even the opposite. In the paper [MixConv: Mixed Depthwise Convolutional Kernels](https://arxiv.org/abs/1907.09595), it is pointed out that increasing the size of the convolutional kernels within a certain range plays a positive role in the accuracy improvement, but a size beyond may lead to accuracy loss. Therefore, considering the size of the model and the computation, large convolutional kernels are generally abandoned to design the network. Also, there are experiments on large convolution kernels in the article [PP-LCNet](https://github.com/PaddlePaddle/PaddleClas/blob/release/2.3/docs/zh_CN/models/PP-LCNet.md).
## Issue 2
### Q2.1: How does PaddleClas train its backbone?
**A**:The process is as follows:
- First, create a new model structure file under the folder ppcls/arch/backbone/model_zoo/, i.e. your own backbone. You can refer to resnet.py for model construction;
- Then add your own backbone class in ppcls/arch/backbone/__init__.py;
- Next, configure the yaml file for training, here you can refer to ppcls/configs/ImageNet/ResNet/ResNet50.yaml;
- Now you can start the training.
### Q2.2: How to transfer the existing models and weights to your own classification tasks?
**A**: The process is as follows:
- First, a good pre-training model tends to be better transferred, so it is recommended to adopt a pre-training model with higher accuracy, for instance, series of industry-leading pre-training models provided by PaddleClas;
- Second, determine and train hyperparameters based on the size of the dataset to be transferred, which need to be debugged to find a local optimal value. If you have no relevant experience, it is recommended to start with the learning rate, which generally has a smaller dataset adopting a small learning rate, such as 0.001. In addition, the warmup strategy is suggested for the learning rate to avoid the weight damage of the pre-training model resulting from a large learning rate. During the transfer, the learning rate of different layers in the backbone can also be set, and it is often better to gradually reduce the learning rate from the head to the tail of the network. Data augmentation strategies can also be useful for small datasets, and PaddleClas offers 8 powerful data augmentation strategies for higher accuracy.
- After training, the above process can be iterated repeatedly until a local optimal value is found.
### Q2.3: Is the default parameter under configs in PaddleClas available for all datasets?
**A**:
The default parameter of the configuration file under ppcls/configs/ImageNet/ in PaddleClas is the training parameter of ImageNet-1k, which is not suitable for all datasets, and the specific datasets need to be further debugged on this basis.
### Q2.4 The resolution varies for different models in PaddleClas, so what is the standard?
**A**:
PaddleClas strictly follows the resolution used by the authors of the paper. Since AlexNet in 2012, most of the convolutional neural networks trained on ImageNet have a resolution of 224x224, and Google adjusted the resolution to 299x299 when designing InceptionV3 to fit the network structure, which was the same one for later Xception and InceptionV4. In addition, in EfficeintNet, the authors analyze that different resolutions should be used for networks of different sizes, so does it in this series. In practical scenarios, it is recommended to adopt the default resolution, but networks with deeper layers or larger widths can also try larger ones.
### Q2.5 There are many ssld models available in PaddleClas, what is the value of their application?
**A**:
There are many ssld pre-training models available in PaddleClas, which obtain better pre-training weights by semi-supervised knowledge distillation, so that the accuracy can be improved by replacing the ssld pre-training models with higher accuracy in transfer tasks or downstream vision tasks without replacing the structure files. For example, in PaddleSeg, [HRNet](https://github.com/PaddlePaddle/PaddleSeg/blob/release/v0.7.0/docs/model_zoo.md), with the weight of the ssld pre-training model, achieves much better accuracy than other same models in the industry; In PaddleDetection, [PP- YOLO](https://github.com/PaddlePaddle/PaddleDetection/blob/release/0.4/configs/ppyolo/README_cn.md) with ssld pre-training weights has further improvement in the already high baseline. The transfer of classification with ssld pre-training weights also yields impressive results, and the benefits of knowledge distillation for the transfer of classification task is detailed in [SSLD Distillation Strategy](https://github.com/PaddlePaddle/PaddleClas/blob/release/2.3/docs/zh_CN/advanced_tutorials/knowledge_distillation.md)
## Issue 3
### Q3.1: What is the improvement of DenseNet model over ResNet? What are the features or application scenarios?
**A**:
DenseNet is designed with a more aggressive dense connectivity mechanism compared to ResNet, which further reduces the number of parameters by considering feature reuse and bypass settings, and mitigates the gradient dispersion problem to some extent. What's more, the model is easier to train and equipped with some regularization effect due to the introduction of dense connectivity. DenseNet is a good choice in image classification scenarios where the amount of data is limited. More information about DenseNet and this series can be found in [DenseNet Models](https://github.com/PaddlePaddle/PaddleClas/blob/release/2.3/docs/zh_CN/models/DPN_DenseNet. md).
### Q3.2: What are the improvements of the DPN network over DenseNet?
**A**:
The full name of DPN is Dual Path Networks, or Dual Channel Networks. It is a combination of DenseNet and ResNeXt, which demonstrates that DenseNet can extract new features from the previous layers, while ResNeXt is essentially reuse of features already extracted from the previous layers. The authors further analyze and find that ResNeXt has a high reuse rate for features but low redundancy, while DenseNet can create new features but has high redundancy. Combining the advantages of both structures, the DPN network is designed. Finally, the DPN network achieves better results than ResNeXt and DenseNet with the same FLOPS and number of parameters. More introduction and series models of DPN can be found in [DPN Models](https://github.com/PaddlePaddle/PaddleClas/blob/release/2.3/docs/zh_CN/models/DPN_DenseNet.md).
### Q3.3: How to use multiple models for inference fusion?
**A**:
When adopting multiple models for inference, it is recommended to first export the pre-training model as an inference model to get rid of the dependence on the network structure definition, you can refer to [model export script](https://github.com/PaddlePaddle/PaddleClas/blob/release/2.3/ tools/export_model.py) for model exporting, and then see [inference script for the inference model](https://github.com/PaddlePaddle/PaddleClas/blob/release/2.3/deploy/python/), where you need to create multiple predictors according to the number of employed models.
### Q3.4: How to add your own data augmentation methods in PaddleClas?
**A**:
- For single-image augmentation, you can refer to [Single-image based data augmentation script](https://github.com/PaddlePaddle/PaddleClas/blob/release/2.3/ppcls/data/preprocess/ops). Learning from the data operator ` ResizeImage ` or `CropImage` to create a new class, and then implement the corresponding augmentation method in `__call__`.
- For a batch image, you can refer to the [batch data-based data augmentation script](https://github.com/PaddlePaddle/PaddleClas/blob/release/2.3/ppcls/data/preprocess/batch_ops). Learning from the data operator `MixupOperator` or `CutmixOperator` to create a new class, and then implement the corresponding augmentation method in `__call__`.
## Q3.5: How to further accelerate the model training?
**A**:
- You can adopt auto-mixed precision training, which can gain a significantly faster speed with almost zero precision loss. Take ResNet50 as an example, the configuration file of auto-mixed precision training in PaddleClas can be found at: [ResNet50_fp16.yml](https://github.com/ PaddlePaddle/PaddleClas/blob/release/2.3/ppcls/configs/ImageNet/ResNet/ResNet50_fp16.yaml). The main step is to add the following lines to the standard configuration file.
```
# mixed precision training
AMP:
scale_loss: 128.0
use_dynamic_loss_scaling: True
use_pure_fp16: &use_pure_fp16 True
```
- You can turn on dali to run the data preprocessing method on GPU. When the model is relatively small (reader accounts for a higher percentage of time consumption), an obviously faster speed can be obtained with dali on, which could be employed by adding `-o Global.use_dali=True` during training. You can refer to [dali installation tutorial](https://docs.nvidia.com/deeplearning/dali/user-guide/docs/installation.html#nightly-builds) for more details.
## Issue 4
### Q4.1: How many types of model files are there in PaddlePaddle?
**A**:
- There are two types of model-related files saved in PaddlePaddle.
- One is the files used for *inference deployment*, including files with the suffixes ``pdiparams``, ``model``, where the ``pdiparams` " file stores model parameter information, and "`model`" file stores model network structure information. For inference deployment files, the `paddle.jit.save` and `paddle.jit.load` interfaces are used for saving and loading.
- Another one is used in the *training tuning* process, including files with the suffixes "`pdparams`" and "`pdopt`", where the "`pdparams` " file stores the model parameters information during training, and "`pdopt`" file stores the optimizer information during model training. For training tuning files, `paddle.save` and `paddle.load` interfaces are used for saving and loading.
- The inference deployment file enables you to build the model network structure and load the model parameters for inference, while the training tuning file allows you to load the model parameters and optimizer information for resuming the training process.
### Q4.2: What are the innovations of HRNet?
**A**:
- In the field of image classification, most neural networks are designed with the idea of extracting high-dimensional features of images. Specifically, the input image usually has a high spatial resolution, and through multi-layer convolution and pooling, a feature graph with lower spatial resolution but higher dimension can be gradually obtained and applied in scenarios such as classification.
- However, the authors of *HRNet* believe that the idea of gradually decreasing spatial resolution is not suitable for scenarios such as target detection (classification task of image region-level) and semantic segmentation (classification task of image pixel-level), because a lot of information is lost in this process and the final learned features can hardly represent the information of the original image at the high spatial resolution, while both the task of region-level and pixel-level classification are very sensitive to spatial accuracy.
- Therefore, the authors of *HRNet* propose the idea of paralleling feature graphs with different spatial resolutions, while in contrast, neural networks such as *VGG* cascade feature graphs with different spatial resolutions by different convolutional pooling layers. Moreover, *HRNet* connects feature graphs of equal depth and disparate spatial resolutions, so that the information can be fully exchanged. The specific network structure is shown in the figure below.
### Q4.3: In HRNet, how are connections made between feature graphs with different spatial resolutions?
**A**:
- First, in *HRNet*, the *3 × 3* convolution with *stride* of *2* can be used to obtain a feature graph with low spatial resolution but higher dimension; and for those with low spatial resolution, the *1 × 1* convolution is first used to match the number of channels, and then the nearest neighbor interpolation is used for upsampling to obtain a feature graph with the same spatial resolution and number of channels as the high spatial resolution graph. And for the feature map with the same spatial resolution, the constant mapping can be performed directly. The details are shown in the following figure.
- SE indicates that the model uses an SE structure, which is derived from the winning solution of the 2017 ImageNet classification competition *Squeeze-and-Excitation Networks (SENet)*, and can be migrated to any other networks. The *scale* vector dimension is the same as the number of channels in the feature map, and the value in each dimension of the learned *scale* vector indicates the enhancement or weakening of the feature channels in that dimension, so as to enhance the important feature channel and weaken the unimportant ones, thus making the extracted features more directional.
- The *SE* structure is shown in the figure above. First, *Ftr* represents the regular convolution operation, and *X* and *U* are the input and output feature maps of *Ftr*. After obtaining the feature map *U*, operate the *Fsq* and *Fex* to obtain the *scale* vector, which has a dimension of *C*, the same as the number of *U* channels, so it can be applied to *U* by multiplying, and then obtain *X~*.
- Specifically, *Fsq* is the *Global Average Pooling* operation, which is called *Squeeze* by *SENet* authors because it compresses *U* from *C × H × W* to *C × 1 × 1*, and then conduct the *Fex* operation on the output of *Fsq*.
- The *Fex* operation represents two full joins, which is referred to as *Excitation* by the authors. The first full join compresses the dimension of the vector from *1 × 1 × C* to *1 × 1 × C/r*, followed by the use of *RELU*, and then restores the dimension of the vector to *C* by the second full join. The purpose of this operation is to reduce the computation. The *SENet* authors conclude experimentally that a balance between gain and computational effort could be obtained when *r=16*.
- For *Fsq*, the key is to obtain the vector of *C* dimension, so it is not limited to the *Global Average Pooling*. The *SENet* authors believe that the final *scale* is applied to *U* separately by channel, so it is necessary to calculate the corresponding *scale* based on the information of the corresponding channel, so the simplest *Global Average Pooling* is adopted, and the final *scale* vector represents the distribution between different channels, ignoring the situation in the same channel.
- For *Fex*, its role is to find the distribution based on all the training data through training on each *mini batch*. Since our training is performed on *mini batches* and the *scale* based on all training data is the best, we can adopt the *Fex* to train on each *mini batch* to obtain a more reliable *scale*.
## Issue 5
### Q5.1 How to choose an optimizer?
**A**:
Since the emergence of deep learning, there has been a lot of research on optimizers, which aim to minimize the loss function to find the right weights for a given task. Currently, the main optimizers used in the industry are SGD, RMSProp, Adam, AdaDelt, etc. Among them, since the SGD optimizer with momentum is widely used in academia and industry (only for classification tasks), most of the models we published also adopt this optimizer to achieve gradient descent of the loss function. It has two disadvantages, one is the slow convergence speed, and the other is the reliance on experiences of the initial learning rate setting. However, if the initial learning rate is set properly with the sufficient number of iterations, the optimizer will also stand out among many other optimizers, obtaining higher accuracy on the validation set. Some optimizers with adaptive learning rates, such as Adam and RMSProp, tend to converge fast, but the final convergence accuracy will be slightly worse. If you pursue faster convergence speed, we recommend using these adaptive learning rate optimizers, and SGD optimizers with momentum for higher convergence accuracy. The specific information of the dataset is as follows:
- ImageNet-1k: It is recommended to use the SGD optimizer with momentum only.
- Other datasets (ImageNet-1k pre-training by default): When loading the pre-training model, you can consider an optimizer such as Adam (which may work better), but the SGD optimizer with momentum is definitely a good solution.
In addition, to further speed up the training, Lookahead optimizer is also a good choice. On ImageNet-1k, it can guarantee the same convergence accuracy at a faster rate, but the performance is less stable on some datasets and requires further tuning.
### Q5.2 How to set the initial learning rate and the learning rate decay strategy?
**A**: The choice of learning rate is often related to the optimizer as well as the data and the task. The learning rate determines how quickly the network weights are updated. The lower the learning rate, the slower the loss function will change. While using a low learning rate ensures that no local minimal values are missed, it also means that it takes longer to converge, especially if trapped in a plateau region.
Throughout the whole training process, we cannot adopt the same learning rate to update the weights, otherwise, the optimal point cannot be reached, So we need to adjust the learning rate during the training. In the initial stage of training, since the weights are in a random initialization state and the loss function decreases fast, a larger learning rate can be set. And in the later stage of training, since the weights are close to the optimal value, a larger learning rate cannot further find the optimal value, so a smaller learning rate needs is a better choice. As for the learning rate decay strategy, many researchers or practitioners use piecewise_decay (step_decay), which is a stepwise decay learning rate. In addition, there are also other methods proposed by researchers, such as polynomial_decay, exponential_ decay, cosine_decay, etc. Among them, cosine_decay requires no adjustment of hyperparameters and has higher robustness, thus emerging as the preferred learning rate decay method to improve model accuracy.
The learning rates of cosine_decay and piecewise_decay are shown in the following figure. It is easy to observe that cosine_decay keeps a large learning rate throughout the training, so it is slow in convergence, but its final effect is better than peicewise_decay.
In addition, it is also observed that only a few rounds in cosine_decay use a small learning rate, which affects the final accuracy. So it is recommended to iterate more rounds for better results.
Finally, when training a neural network with a large batch_size, it is recommended to use the warmup strategy, which, as the name implies, is a warm-up for the learning rate with no direct adoption of maximum learning rate at the beginning of training, but to train the network with a gradually increasing rate, and then decay the learning rate when it peaks. Experiments show that warmup can steadily improve the accuracy of the model when the batch_size is large. Specifically for the dataset. The specific information of the dataset is as follows:
- ImageNet-1k: The recommended batch-size is 256, the initial learning rate is 0.1, and the cosine-decay to decrease the learning rate.
- Other datasets (ImageNet-1k pre-training by default): the larger the dataset, the larger the initial learning rate, not exceeding 0.1 for the best (when the batch-size is 256); the smaller the dataset, the smaller the initial learning rate, when the dataset is small, the use of warmup will also bring some accuracy improvement, and cosine-decay is still recommended as the learning rate decay strategy.
### Q5.3 How to set the batch-size?
**A**:
Batch_size is an important hyperparameter in neural networks training, whose value determines how much data is fed into the neural network for training at a time. Previous researchers have experimentally found that when the value of batch_size is linearly related to the value of learning rate, the convergence accuracy is almost unaffected. When training ImageNet-1k data, most of the neural networks choose an initial learning rate of 0.1 and a batch_size of 256. The specific information of the dataset is as follows:
- ImageNet-1k: learning rate is set to 0.1*k, batch_size is set to 256*k.
- Other datasets (ImageNet-1k pre-training by default): can be set according to the actual situation (e.g. smaller learning rate), but when adjusting the learning rate or batch size, another value should be adjusted at the same time.
Overfitting is a common term in machine learning, which is simply understood as a model that performs well on training data but less satisfactory on test data. In image classification, there is also the problem of overfitting, and many regularization methods are proposed to avoid it, among which weight_decay is one of the widely used ways. When using SGD optimizer, weight_decay is equivalent to adding L2 regularization after the final loss function, which makes the weights of the network tend to choose smaller values, so eventually, the parameter values in the whole network tend to be more towards 0, and the generalization performance of the model is improved accordingly. In the implementation of major deep learning frameworks, this value means the coefficient before the L2 regularization, which is called L2Decay in the PaddlePaddle framework. The larger the coefficient is, the stronger the added regularization is, and the more the model tends to be underfitted. The specific information of the dataset is as follows:
- ImageNet-1k: Most networks set the value of this parameter to 1e-4, and in some smaller networks such as the MobileNet series network, the value is set between 1e-5 and 4e-5 to avoid the underfitting. The following table shows the accuracy of MobileNetV1_x0_25 on ImageNet-1k using different L2Decay. Since MobileNetV1_x0_25 is a relatively small network, an overly large L2Decay will tend to underfit the network, so 3e-5 is a better choice in this network compared to 1e-4.
| Model | L2Decay | Train acc1/acc5 | Test acc1/acc5 |
In addition, the setting of this value is also related to whether other regularizations are used during training. If the data preprocessing is complicated, which equates with a harder training task, the value can be reduced appropriately. The following table shows the accuracy of ResNet50 on ImageNet-1k using different L2Decay after the RandAugment preprocessing. It is easy to observe that a smaller l2_decay helps to improve the model accuracy for a harder task.
| Model | L2Decay | Train acc1/acc5 | Test acc1/acc5 |
- Other datasets (ImageNet-1k pre-training loaded by default): When transferring, it is better not to change the value of L2Decay when training ImageNet-1k (i.e. training to get the L2Decay value of pre-training, the L2Decay value of each backbone is in the corresponding yaml file), and the change of learning rate is enough for general datasets.
### Q5.5 Should I use label_smoothing and how to set the value of the parameter?
**A**:
Label_smoothing is a regularization method in deep learning, whose full name is Label Smoothing Regularization (LSR). In the traditional classification task, the loss function is calculated by the cross-entropy of the real one hot label and the output of the neural network, while label_smoothing is a label smoothing of the real one hot label, so that the label learned by the network is no longer a hard label, but a soft label with a probability value, where the probability at the position corresponding to the category is the largest and others small. In label_smoothing, the epsilon parameter describes the degree of label softening, the larger the value, the smaller the label probability value of the label vector after label smoothing, the smoother the label, and vice versa. The specific information of the dataset is as follows:
- ImageNet-1k: This value is usually set to 0.1 in experiments training ImageNet-1k, and there is a steady increase in accuracy for models of the ResNet50 size and above after using label_smooting. The following table shows the accuracy metrics of ResNet50_vd before and after using label_smoothing.
At the same time, since label_smoohing can be regarded as a regularization method, the accuracy improvement is not obvious or even decreases on a relatively small model. The following table shows the accuracy metrics of ResNet18 before and after using label_smoothing on ImageNet-1k. It is clear that the accuracy drops after using label_smoothing.
| Model | Use_label_smoohing(0.1) | Train acc1/acc5 | Test acc1/acc5 |
Here is a trick to make label-smoothing effective even in smaller models, i.e., a fully connected layer of size 1000-2000 after the Global-Average-Pool, which works better with label-smoothing.
- Other datasets (loaded with ImageNet-1k pre-training by default): the use of label-smooth tends to improve the accuracy, the smaller the dataset the larger the epsilon value can be. And in some smaller fine-grained images, the best model is usually obtained with the value set to 0.4-0.5.
### Q5.6 Is random-crop still adjustable in the default image preprocessing? How ?
**A**:
In the standard preprocessing of ImageNet-1k data, the random_crop function defines two values, scale and ratio, which respectively determine the size of the image crop and the degree of image stretching, where the default value of the former is 0.08-1 (lower_scale-upper_scale), and the latter is 3/4-4/3 (lower_ratio-upper_ratio). In very small networks, this kind of data augmentation can lead to network underfitting and decreased accuracy. To the end, the data augmentation can be made weaker by increasing the crop area of the image or decreasing the stretching of the image. Weaker image transformation can be achieved by increasing the value of lower_scale or reducing the difference between lower_ratio and upper_scale, respectively. The specific information of the dataset is as follows:
- ImageNet-1k: It is recommended to use only the default value for networks that are not particularly small, and to increase the value of lower_scale (to increase the crop area) or decrease the range of ratio values (to weaken the image stretching) for networks that are particularly small, and conduct the opposite for networks that are particularly large. The following table shows the accuracy of training MobileNetV2_x0_25 with different lower_scale, and we can see that the training accuracy and verification accuracy are improved by increasing the crop area of the images.
| Model | Range of Scale | Train_acc1/acc5 | Test_acc1/acc5 |
- Other datasets (ImageNet-1k pre-training is loaded by default): it is recommended to use the default value, if the overfitting is too over, please consider adjusting the value of lower_scale (to reduce the crop area) or increasing the range of ratio values (to enhance the image stretching).
### Q5.7 What are the common data augmentation? How to choose?
**A**:
In general, the size of the dataset is crucial to the performance, but the annotation of images is often expensive, hence there are rare annotated images, which highlight the importance of data augmentation. In the standard data augmentation for training ImageNet-1k, two methods, Random_Crop and Random_Flip, are mainly adopted. However, in recent years, an increasing number of data augmentation methods have been proposed, such as cutout, mixup, cutmix, AutoAugment, etc. Experiments show that these methods can effectively improve the accuracy of the model. The specific information of the dataset is as follows:
- ImageNet-1k: The following table lists the performance of ResNet50 adopting 8 different data augmentation methods. It can be seen that all of them are beneficial compared to baseline, with cutmix being the most effective data augmentation so far. For more information about data augmentation, please refer to the chapter of [**Data Augmentation**](https://github.com/PaddlePaddle/PaddleClas/blob/release/2.3/docs/zh_CN/advanced_tutorials/DataAugmentation.md).
- Other datasets (ImageNet-1k pre-training is loaded by default): In other datasets except those using Auto-Augment, there is generally an accuracy gain. Auto-Augment will search for each dataset with an independent hyper-parameter which determines how the data is processed, so the default ImageNet-1k hyper-parameter is not suitable for all datasets, but you can use Random-Augmentation instead of Auto-Augmentation. Other strategies can be used normally, but for harder tasks or smaller networks, it is not recommended to adopt strong data augmentation.
In addition, multiple data augmentations can be overlaid to further improve accuracy when the data set is simple or the data size is small.
### Q5.8 How to determine the tuning strategy by train_acc and test_acc?
**A**:
In the process of training a network, the accuracy of the training set and validation set are usually printed for each epoch, which portrays the performance of the model on both datasets. Generally speaking, the training set accuracy reflects the data accuracy after Random-Crop, and since the data is often more complex after Random-Crop, the training set accuracy and the validation set accuracy are often not the same concepts.
- ImageNet-1k: Generally speaking, it is good to have a comparable accuracy or a slightly higher accuracy in the training set than in the validation set. If we find that the accuracy of the training set is much higher than the validation set, it means that the training set is overfitted and we need to add more regularity, such as increasing the value of L2Decay, adding more data augmentation strategies, introducing label_smoothing strategies, etc. If we find that the accuracy of the training set is lower than the validation set, it means that the training set is probably underfitted, and the regularization effect should be weakened during the training, such as reducing the value of L2Decay, decreasing the data augmentation methods, increasing the area of the crop area, weakening the image stretching, removing label_smoothing, etc.
- Other datasets (ImageNet-1k pretraining loaded by default): basically the same as ImageNet-1k training, in addition, if the model on other datasets tends to overfit (train acc much larger than test acc), you can also use better pretraining weights. PaddleClas provides distillation pretraining weights of SSLD for common networks, which are better than those of ImageNet-1k, expecting your preference.
-**[Note]** It is not recommended to readjust the training strategy according to the loss. After using different data augmentation, the size of the train loss varies greatly. For example, after using Cutmix or RandAugmentation, the train loss will exceed the test loss, and when the data augmentation strategy is weakened, the train loss will be smaller than the test loss, making it more difficult to adjust.
### Q5.9 How to improve the accuracy of your own dataset by pre-training the model?
**A**:
At this stage, it has become a common practice in the image recognition field to load pre-trained models to train their own tasks, which can often improve the accuracy of a particular task compared to training from random initialization. In general, the pre-training model widely used in the industry is obtained by training the ImageNet-1k dataset of 1.28 million images of 1000 classes. The fc layer weights of this pre-training model are a matrix of k*1000, where k is the number of neurons before the fc layer, and it is not necessary to load the fc layer weights when loading the pre-training weights. In terms of the learning rate, if your dataset is particularly small (e.g., less than 1,000), we recommend you to adopt a small initial learning rate, e.g., 0.001 (batch_size:256, the same below), so as not to corrupt the pre-training weights with a larger learning rate. If your training dataset is relatively large (>100,000), we suggest you try a larger initial learning rate, such as 0.01 or above. If the target dataset is small, you can also freeze some shallow weights. Also, if you want to train a small dataset for a specific vertical class, you can train a pre-training weight on a related large dataset first, and then fine-tune the model with a smaller learning rate on that weight.
### Q5.10 Existing strategies have saturated the accuracy of the model, how can the accuracy of a particular model be further improved?
**A**: If the existing strategy cannot further improve the accuracy of the model, it means that the model has almost reached saturation with the existing dataset and strategy, and two methods are provided here.
- Mining relevant data: Use the model trained on the existing dataset to make predictions on the relevant data, label the data with higher confidence and add it to the training set for further training. Repeat the steps above to further improve the accuracy of the model.
- Knowledge distillation: You can use a larger model to train a teacher model with higher accuracy on the dataset, and then adopt the teacher model to teach a Student model, where the Student model is the target model. PaddleClas provides Baidu's own SSLD knowledge distillation scheme, which can steadily improve by more than 3% even on such a challenging classification task as ImageNet-1k. For the chapter on SSLD knowledge distillation, please refer to [**SSLD Knowledge Distillation**](https://github.com/PaddlePaddle/PaddleClas/blob/release/2.3 /docs/zh_CN/advanced_tutorials/knowledge_distillation.md).
## Issue 6
### Q6.1: What are the differences between the several branches of PaddleClas? How should I choose?
**A**: PaddleClas currently has 3 branches:
- Develop: develop branch is the development branch of PaddleClas as well as the most updated branch. All new features and changes will proceed on this branch first. If you want to keep track of the latest progress of PaddleClas, you can follow this branch. This branch mainly supports dynamic graphs and will be updated along with the version of paddlepaddle.
- Stable release (e.g. release/2.1.3): Fast updates keep followers informed of the latest progress, but they can also bring instability. Therefore, at critical points, we pull branches from the develop branch to provide stable releases, and the latest stable branch is also the default branch. Note that we only maintain the latest stable release, and generally fix bugs only without updating new features and models unless there are special circumstances.
- Static branch is a branch mainly for old users that adopts the static version of the graph, receiving simple maintenance only with no new features and models updated. It is not recommended for new users, and those who still employ it is recommended to turn to the dynamic graph branch or the stable release branch if the condition permits.
In general, it is recommended to choose the develop branch if you want to keep up with the latest developments of PaddleClas, and the latest stable release branch if you need a stable version.
### Q6.2: What is the static graph mode?
**A**:
The static graph mode is declarative programming, which is initially adopted by many deep learning frameworks such as tensorflow, mxnet, etc. In the static graph mode, you need to define the model structure first, and then the framework will compile and optimize the model structure to build the "computational graph". It can be simply understood that the mode is a static and unchanging pattern of the "computational graph". The advantage of it is that the compiler generally builds the graph once, which is relatively efficient, but the disadvantage lies in that it is not flexible enough and troublesome to debug. For example, if you run a static graph model in paddle, you need to complete all the operations, then extract the output according to a specific key, which means you cannot get the result in real-time.
### Q6.3: What is the dynamic graph mode?
**A**:
Dynamic diagram mode is imperative programming, where users do not need to pre-define the network structure, and each line of code can be run directly to get the result. Compared with static graph mode, this one is more user-friendly and easier to debug. In addition, the structure design of dynamic graph mode is more flexible, and the structure can be adjusted at any time during the operation.
PaddleClas currently uses a dynamic graph model for its continuously updated develop branch and the stable release branch. If you are new, it is recommended to use dynamic graph mode for development and training. If there is a performance requirement for inference and prediction, you can convert the dynamic graph model to a static one to improve efficiency after the training.
### Q6.5: When building a classification dataset, how to build the data of the "background" category?
**A**:
In practice, it is often necessary to construct your own classification dataset for training purposes. In addition to the required category data, an additional category is needed, i.e., a "background" category. For example, if we create a cat and dog classification, with cats as one category and dogs as another, then an input image of a rabbit will be forced into one of these two categories. Therefore, during training, we should add some data from non-target categories as the "background" category data.
When constructing the data for the "background" category, the first step is to consider the actual requirements. For example, if the actual test data are all animals, the "background" category data should include some animals other than dogs and cats. If the test data contains more categories, such as a tree, then the "background" category should be enriched. To put it simply, the data in the "background" category should be collected according to the situations that may occur in the actual scenario. The more situations included, the more types of data need to be covered, and the more difficult the task will be. Therefore, in practice, it is better to limit the problem to avoid the waste of resources and computing power.
### Q1.1: Why is the prediction accuracy of the exported inference model very low ?
**A**:You can check the following aspects:
- Check whether the path of the pre-training model is correct or not.
- The default class number is 1000 when exporting the model. If the pre-training model has a custom class number, you need to specify the parameter `--class_num=k` when exporting, k is the custom class number.
- Compare the output class id and score of `tools/infer/infer.py` and `tools/infer/predict.py` for the same input. If they are exactly the same, the pre-trained model may have poor accuracy itself.
### Q1.2: How to deal with the unbalanced categories of training samples?
**A**:There are several commonly used methods.
- From the perspective of sampling
- The samples can be sampled dynamically according to the categories, with different sampling probabilities for each category and ensure that the number of training samples in different categories is basically the same or in the desired proportion in the same minibatch or epoch.
- You can use the oversampling method to oversample the categories with a small number of images.
- From the perspective of loss function
- The OHEM (online hard example miniing) method can be used to filter the hard example based on the loss of the samples for gradient backpropagation and parameter update of the model.
- The Focal loss method can be used to assign a smaller weight to the loss of some easy samples and a larger weight to the loss of hard samples, so that the loss of easy samples contributes to the overall loss of the network without dominating the loss.
### Q1.3 When training in docker, the data path and configuration are fine, but it keeps reporting `SystemError: (Fatal) Blocking queue is killed because the data reader raises an exception`, why is this?
**A**:
This may be caused by the small shared memory in docker. When creating docker, the default size of `/dev/shm` is 64M, if you use multiple processes to read data, the shared memory may fall short, so you need to allocate more space to `/dev/shm`. When creating a docker, input `--shm-size=8g` to allocate 8G to `/dev/shm`, which is usually enough.
### Q1.4 Where can I download the 10W class image classification pre-training model provided by PaddleClas and how to use it?
**A**:
Based on ResNet50_vd, Baidu open-sourced its own large-scale classification pre-training model with 100,000 categories and 43 million images. The former is available for download at [download address](https://paddle-imagenet-models-name.bj.bcebos.com/ ResNet50_vd_10w_pretrained.tar), where it should be noted that the pre-training model does not provide the final FC layer parameters and thus cannot be used directly for inference; however, it can be used as a pre-training model to fine-tune it on your own dataset. It is verified that this pre-training model has a more significant accuracy gain of up to 30% on different datasets than the ResNet50_vd pre-training model based on the ImageNet1k dataset.
### Q1.5 How to accelerate when using C++ for inference deployment?
**A**:You can speed up the inference in the following ways.
1. For CPU inference, you can turn on mkldnn and increase the number of threads (cpu_math_library_num_threads, in `tools/config.txt`), which is usually 6~10.
2. For GPU inference, you can enable TensorRT inference and FP16 inference if the hardware condition allows, which can further speed up the speed.
3. If the memory or video memory is sufficient, the batch size of inference can be increased.
4. The image preprocessing logic (mainly designed for resize, crop, normalize, etc.) can be run on the GPU, which can further speed up the process.
You are welcomed to add more tips on inference deployment acceleration.
## Issue 2
### Q2.1: Does PaddleClas have to start from 0 when setting labels, and does class_num have to equal the number of classes in the dataset?
**A**:
In PaddleClas, the label start from 0 by default, so try to set the label from 0. Of course, it is possible to set them from other values, which will result in a larger class_num and will in turn lead to a larger number of FC layer parameters for the classification, so the weight file will take up more storage space. In the case of a continuous set of classes, set class_num equal to the number of classes in the dataset (of course, it is acceptable to set it greater than the number of classes in the dataset, and even higher accuracy can be obtained in many datasets, but it will also result in a larger number of FC layers), and in the case of a discontinuous set of classes, the class_num should be equal to the largest class_id+1 in the dataset.
### Q2.2: How to address the the problem of large space occupation of weight file resulted from great FC due to numerous class number?
**A**:
The final FC weight is a large matrix of size C*class_num, where C is the number of neural units in the previous layer of FC, e.g., C is 2048 in ResNet50, and the size of FC weight can be further reduced by decreasing the value of C. For example, a layer of FC with smaller dimension can be added after GAP, which can greatly reduce the weight size of the final classification layer.
### Q2.3: Why did the training of ssld distillation on a custom dataset using PaddleClas fail to meet the expectation?
First, it is necessary to ensure that the accuracy of the Teacher model. Second, it is necessary to ensure that the Student model is successfully loaded with the pre-training weights of ImageNet-1k and the Teacher model is successfully loaded with the weights of the training custom dataset. Finally, it is necessary to ensure that the initial learning rate is not too large, or at least smaller than the value of the ImageNet-1k training.
### Q2.4: Which networks have advantages on mobile or embedded side?
It is recommended to use the Mobile Series network, and the details can be found in [Introduction to Mobile Series Network Structure](https://github.com/PaddlePaddle/PaddleClas/blob/release/2.3/docs/zh_CN/models/Mobile.md). If the speed of the task is the priority, MobileNetV3 series can be considered, and if the model size is more important, the specific structure can be determined based on the StorageSize - Accuracy in the Introduction to the Mobile Series Network Structure.
### Q2.5: Why use a network with large number of parameters and computation such as ResNet when the mobile network is so fast?
Different network structures have various speed advantages running on disparate devices. On the mobile side, mobile series networks run faster than server-side networks, but on the server side, networks with specific optimizations such as ResNet have greater advantages for the same accuracy. So the specific network structure needs to be decided on a case-by-case basis.
## Issue 3
### Q3.1: What are the characteristics of a double (multi)-branch structure and a Plain structure, respectively?
**A**:
Plain networks, represented by VGG, have evolved into multi-branch network structures, represented by the ResNet series (with residual modules) and the Inception series (with multiple convolutional kernels in parallel). It is found that the multi-branch structure is more friendly in the model training, and the larger network width can bring stronger feature fitting ability, while the residual structure can avoid the problem of disappearing gradient of the deep network. However, in the inference phase, the model with a multi-branch structure has no speed advantage. Even though the FLOPs of the model are lower, the computational density of the multi-branch structure model is also dissatisfactory. For example, the FLOPs of VGG16 model are much larger than those of EfficientNetB3, but the inference speed of the latter is significantly faster than the former. Therefore, the multi-branch structure is more friendly in the model training, while the Plain structure model is more suitable for the inference. Starting from this, we can use a multi-branch network structure in the training phase to exchange a larger training time cost for a model with better feature fitting ability, and convert the multi-branch structure to Plain structure in the inference phase to exchange a shorter inference time. The conversion from multi-branch to a Plain structure can be achieved by the structural re-parameterization technique.
In addition, the Plain structure is more friendly for pruning operations.
Note: The term "Plain structure" and "structural re-parameterization" are from the paper "RepVGG: Making VGG-style ConvNets Great Again". Plain structure network model means that there is no branching structure in the whole network, i.e., the input of layer `i` is the output of layer `i-1` and the output of layer `i` is the input of layer `i+1`.
### Q3.2: What are the main innovations of ACNet?
**A**: ACNet means "Asymmetric Convolution Block", and the idea is from the paper "ACNet: Strengthening the Kernel Skeletons for Powerful CNN via Asymmetric Convolution Blocks", which proposes a set of three CNN convolutional kernels with "ACB" structure to replace the traditional square convolutional kernels in existing convolutional neural networks in the training phase.
The size of the square convolution kernel is assumed to be `d*d`, i.e., the width and height are all `d`. The ACB structure used to replace this convolution kernel is three convolution kernels of size `d*d`, `1*d`, and `d*1`, and then the outputs of the three convolution kernels are added directly to obtain the same size as the original square convolution kernel. After the training, the ACB structure is replaced by the original square convolution kernel, and the parameters of which are directly added to the parameters of the three convolution kernels of the ACB structure (see `Q3.4`), so the same model structure as before is used for inference, and the ACB structure is only used in the training phase.
During the training, the network width of the model is improved by the ACB structure, and more features are extracted using the two asymmetric convolution kernels of `1*d` and `d*1` to enrich the information of the feature maps extracted by the `d*d` convolution kernels. In the inference stage, this design idea does not bring additional parameters and computational overhead. The following figure shows the form of convolutional kernels for the training phase and the inference deployment phase, respectively.
Experiments by the authors of the article show that the model capability can be significantly improved by using ACNet structures in the training of the original network model, as explained by the original authors as follows.
1. Experiments show that for a `d*d` convolution kernel, the parameters of the skeleton position (e.g., the `skeleton` position of the convolution kernel in the above figure) have a greater impact on the model accuracy than the parameters of the corner position (e.g., the `corners` position of the convolution kernel in the above figure) of the eliminated convolution kernel, so the parameters of the skeleton position of the convolution kernel are essential. And the two asymmetric convolution kernels in the ACB structure enhance the weight of the square convolution kernel skeleton position parameter, making it more significant. About whether this summation will weaken the role of the skeleton position parameter due to the offsetting effect of positive and negative numbers, the authors found through experiments that the training of the network always goes in the direction of increasing the role of the skeleton position parameter, and there is no weakening due to the offsetting effect.
2. The asymmetric convolution kernel is more robust for flipped images, as shown in the following figure, the horizontal asymmetric convolution kernel is more robust for flipped images up and down. The feature maps extracted by the asymmetric convolution kernel are the same for semantically the same position in the image before and after the flip, which is better than the square convolution kernel.
### Q3.3: What are the main innovations of RepVGG?
**A**:
Through Q3.1 and Q3.2, it may occur to us to decouple the training phase and inference phase by ACNet, and use multi-branch structure in the training phase and Plain structure in inference phase, which is the innovation of RepVGG. The following figure shows the comparison of the network structures of ResNet and RepVGG in the training and inference phases.
First, the RepVGG in the training phase adopts a multi-branch structure, which can be regarded as a residual structure with `1*1` convolution and constant mapping on top of the traditional VGG network, while the RepVGG in the inference phase degenerates to a VGG structure. The transformation of the network structure from RepVGG in the training phase to RepVGG in the inference phase is implemented using the "structural reparameterization" technique.
The constant mapping can be regarded as the output of the `1*1` convolution kernel with `1` parameters acting on the input feature map, so the convolution module of RepVGG in the training phase can be regarded as two `1*1` convolutions and one `3*3` convolution, and the parameters of the `1*1` convolution can be directly added to the parameters at the center of the `3*3` convolution kernel (this operation is similar to the operation of adding the parameters of the asymmetric convolution kernel to the parameters of the skeleton position of the square convolution kernel in ACNet). By doing this, the constant mapping, `1*1` convolution, and `3*3` convolution branches of the network structure can be combined into one `3*3` convolution in the inference stage, as described in `Q3.4`.
### Q3.4: What are the similarities and differences between the struct re-parameters in ACNet and RepVGG?
**A**:
From the above, it can be simply understood that RepVGG is the extreme version of ACNet. Re-parameters operation in ACNet is shown in the following figure.
Take `conv2` as an example, the asymmetric convolution can be regarded as a square convolution kernel of `3*3`, except that the upper and lower six parameters of the square convolution kernel are `0`, and it is the same for `conv3`. On top of that, the sum of the results of `conv1`, `conv2`, and `conv3` is equivalent to the sum of three convolution kernels followed by convolution. With `Conv` denoting the convolution operation and `+` denoting the addition operation of the matrix, then: `Conv1(A)+Conv2(A)+Conv3(A) == Convk(A)`, where `Conv1`, ` Conv2`, `Conv3` have convolution kernels `Kernel1`, `kernel2`, `kernel3`, and `Convk` has convolution kernels `Kernel1 + kernel2 + kernel3`, respectively.
The RepVGG network is the same as ACNet, except that the `1*d` asymmetric convolution of ACNet becomes the `1*1` convolution, and the position where the `1*1` convolutions are summed becomes the center of the `3*3` convolution.
### Q3.5: What are the factors that affect the computation speed of a model? Does a model with a larger number of parameters necessarily have a slower computation speed?
**A**:
There are many factors that affect the computation speed of the model, and the number of parameters is only one of them. Specifically, without considering the hardware differences, the computation speed of the model can be referred to the following aspects.
1. Number of parameters: the number of parameters used to measure the model, the larger the number of parameters of the model, the higher the memory (video memory) requirements of the model during computation. However, the size of the memory (video memory) footprint does not depend entirely on the number of parameters. In the figure below, assuming that the input feature map memory footprint size is `1` unit, for the residual structure on the left, the peak memory footprint during computation is twice as large as that of the Plain structure on the right, because the results of the two branches need to be recorded and then added together.
2. FLOPs: Note that FLOPs are distinguished from floating point operations per second (FLOPS), which can be simply understood as the amount of computation and is usually adopted to measure the computational complexity of a model. Taking the common convolution operation as an example, considering no batch size, activation function, stride operation, and bias, assuming that the input future map size is `Min*Min` and the number of channels is `Cin`, the output future map size is `Mout*Mout` and the number of channels is `Cout`, and the conv kernel size is `K*K`, the FLOPs for a single convolution can be calculated as follows.
1. The number of feature points contained in the output feature map is: `Cout * Mout * Mout`.
2. For the convolution operation for each feature point in the output feature map: the number of multiplication calculations is: `Cin * K * K`; the number of addition calculations is: `Cin * K * K - 1`.
3. So the total number of computations is: `Cout * Mout * Mout * (Cin * K * K + Cin * K * K - 1)`, i.e. `Cout * Mout * Mout * (2Cin * K * K - 1)`.
3. Memory Access Cost (MAC): The computer needs to read the data from memory (general memory, including video memory) to the operator's Cache before performing operations on data (such as multiplication and addition), and the memory access is very time-consuming. Take grouped convolution as an example, suppose it is divided into `g` groups, although the number of parameters and FLOPs of the model remain unchanged after grouping, the number of memory accesses for grouped convolution becomes `g` times of the previous one (this is a simple calculation without considering multi-level Cache), so the MAC increases significantly and the computation speed of the model slows down accordingly.
4. Parallelism: The term parallelism often includes data parallelism and model parallelism, in this case, model parallelism. Take convolutional operation as an example, the number of parameters in a convolutional layer is usually very large, so if the matrix in the convolutional layer is chunked and then handed over to multiple GPUs separately, the purpose of acceleration can be achieved. Even some network layers with too many parameters for a single GPU memory may be divided into multiple GPUs, but whether they can be divided into multiple GPUs in parallel depends not only on hardware conditions, but also on the specific form of operation. Of course, the higher the degree of parallelism, the faster the model can run.
## Issue 4
### Q4.1: There is certain synthetic data in image classification task, is it necessary to use sample equalization?
**A**:
1. If the number of samples of different categories varies greatly, and the samples of one category are expanded to more than several times of other categories due to the synthetic data set, it is necessary to reduce the weights of that category appropriately.
2. If some categories are synthetic and some are semi-synthetic and semi-real, the equalization is not needed as long as the number is in an order of magnitude. Or you can try to train it, testing whether the synthetic category samples can be accurately identified.
3. If the performance of the category of different sources of data is degraded due to the increase of synthetic data, it is necessary to consider whether the synthetic dataset has noise or difficult samples. You can also properly increase the weight of the category to obtain better recognition performance of the category.
### Q4.2: What new opportunities and challenges will be brought by the introduction of Vision Transformer (ViT) into the field of image classification by academia? What are the advantages over CNN?
Paper address: [AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE](https://openreview.net/pdf?id=YicbFdNTTy)
**A**:
1. The dependence of images on CNNs is unnecessary, and the computational efficiency and scalability of the Transformer allow for training very large models without saturation as the model and dataset grow. Inspired by the Transformer for NLP, when being used in image classification tasks, images are divided into sequential patches that are fed into a linear unit embedding as input to the transformer.
2. In medium-sized datasets such as ImageNet1k, ImageNet21k, the visual Transformer model is several percentage points lower than ResNet of the same size. It is speculated that this is because the transformer lacks the Locality and Spatial Invariance that CNNs have, and it is difficult to outperform convolutional networks when the amount of data is not large enough. But for this problem, the data augmentation adopted by [DeiT](https://arxiv.org/abs/ 2012.12877) to some extent addresses the reliance of Vision Transformer on very large datasets for training.
3. This approach can go beyond local information and model more long-range dependencies when training on super large-scale datasets 14M-300M, while CNN can better focus on local information but is weak in capturing global information.
4. Transformer once reigned in the field of NLP, but was also questioned as not applicable to the CV field. The current several pieces of visual field articles also deliver competitive performance as the SOTA of CNN. We believe that a joint Vision-Language or multimodal model will be proposed that can solve both visual and linguistic problems.
### Q4.3: For the Vision Transformer model, how is the image converted into sequence information for the Encoder?
**A**:
1. Using the Transformer model, mainly the attention approach. We want to conceive a scenario where semantic embedding information is applicable. But image classification is not very relevant to the semantic information of sequences, so Vision Transformer has its own unique design, and it is precisely the goal of ViT to use attention mechanisms instead of CNNs.
2. Consider the input form of the Encoder in Transformer, as shown below:
- (1) variable-length sequential input, because it is RNN structure with various amounts of words in one sentence. If it is an NLP scene, the change of word order affects little of the semantics, but the position of the image means a lot since great misunderstanding can be caused when different regions are connected in a different order.
- (2) Single patch position is transformed into a vector with fixed dimension. Encoder input is patch pixel information embedding, combined with some fixed position vector concate to synthesize a vector with fixed dimension and position information in it.
1. Consider the following question: How to pass an image to an encoder?
- As the following figure shows. Suppose the input image is [224,224,3], which is cut into many patches in order from left to right and top to bottom, and the patch size can be [p,p,3] (p can be 16, 32). Convert it into a feature vector using the Linear Projection of Flattened Patches module and concat a position vector into the Encoder.
1. As shown above, given an image of `H×W×C` and a block size P, the image can be divided into `N` blocks of `P×P×C`, `N=H×W/(P×P)`. After getting the blocks, we have to use linear transformation to convert them into D-dimensional feature vectors, and then add the position encoding vectors. Similar to BERT, ViT also adds a classification flag bit before the sequence, denoted as `[CLS]`. The ViT input sequence `z` is shown in the following equation, where `x` represents an image block.
1. ViT model is basically the same as Transformer, where the input sequence is passed into ViT and then the final output features are classified using the `[CLS]` flags. viT consists mainly of MSA (multiheaded self-attentive) and MLP (two-layer fully connected network using GELU activation function), with LayerNorm and residual connections before MSA and MLP
### Q4.4: How to understand Inductive Bias?
**A**:
1. In machine learning, some assumptions are made about the problem to be applied, and this assumption is called inductive preference. Certain a priori rules are inducted from the phenomena observed in real life, and then certain constraints are made on the model, thus playing the role of model selection. In CNN, it is assumed that the features have the characteristics of Locality and Spatial Invariance, that is, the adjacent features are connected but not those that are far away, and the adjacent features are fused together to produce "solutions" more easily; there is also the attention mechanism, which is the rule inducted from human intuition and life experience.
2. Vision Transformer utilizes an inductive bias that is linked to Sequentiality and Time Invariance, i.e., the time interval in sequence order, and therefore also yields better performance than CNN-like models on larger datasets. In the Conclusion of the article, "Unlike prior works using self-attention in computer vision, we do not introduce any image-specific inductive biases into the architecture" and "We find that large scale training trumps inductive bias" in the Introduction, we can conclude that intuitively the generation of inductive bias in the case of large amounts of data is performance-degrading and should be discarded whenever possible.
### Q4.5: Why does ViT add a [CLS] flag bit? Why is the vector corresponding to the [CLS] used as the semantic representation of the whole sequence?
**A**:
1. Similar to BERT, ViT adds a `[CLS]` flag bit before the first patch, and the vector corresponding to the last end flag bit can be used as a semantic representation of the whole image, and thus for downstream classification tasks, etc. Therefore, the whole embedding group can characterize the features of the image at different locations.
2. The vector corresponding to the `[CLS]` flag bit is used as the semantic representation of the whole image because this symbol with no obvious semantic information will "fairly" integrate the semantic information of each patch in the image compared with other patches, and thus better represent the semantic of the whole image.
## Issue 5
### Q5.1: What is included in the PaddleClas training profile? How to modify during the model training?
**A**: PaddleClas configures 6 modules, namely: global configuration, network architecture, learning rate, optimizer, training and validation.
The global configuration contains information about the configuration of the task, such as the number of categories, the amount of data in the training set, the number of epochs to train, the size of the network input, etc. If you want to train a custom task or use your own training set, please pay attention to this section.
The configuration of the network structure defines the network to be used. In practice, the first step is to select the appropriate configuration file, so this part of the configuration is usually not modified. Modifications are only made when customizing the network structure, or when there are special requirements for the task.
It is recommended to use the default configuration for the learning rate and optimizer. These parameters are the ones that are already tuned. Fine-tuning can also be done if the changes to the task are significant.
The training and inference configurations include batch_size, dataset, data transforms (transforms), number of workers (num_workers) and other important configurations, which should be modified according to the actual environment. Note that the batch_size in paddleClas is a single card configuration, if it is a multi-card training, the total batch_size is a multiple of the one set in the configuration file. For example, if the configuration file has the batch_size of 64 and 4 cards trained, the total batch_size is 4*64=256. num_workers defines the number of processes on a single card, i.e., if num_workers is 8 with 4 cards for training, there are actually 32 workers.
### Q5.2: How to quickly modify the configuration in the command line?
**A**:
During training, we often need to constantly fine-tune individual configurations without frequent modification of the configuration file. This can be adjusted using -o. The modification is done by first writing the name of the configuration to be changed by level, splitting the levels with dots, and then writing the value to be modified. For example, if we want to modify batch_size, we can add -o DataLoader.TRAIN.sampler.batch_size=512 after the training command.
### Q5.3: How to choose the right model according to the accuracy curve of PaddleClas?
**A**:
PaddleClas provides benchmarks for several models and plots performance curves. There are mainly three kinds: accuracy-inference time curve, accuracy-parameter count curve and accuracy-FLOPS curve, with accuracy on the vertical axis and the other three on the horizontal axis. in general, different models perform consistently on the three plots. Models of the same series are represented on the plots using the same symbols and connected by curves.
Taking the accuracy-inference time curve as an example, a higher point indicates higher accuracy, and a left one indicates faster speed. For example, the model in the upper-left region is a fast and accurate model, while the leftmost point close to the vertical axis is a lightweight model. When using this, you can choose the right model by considering both accuracy and time. As an example, if we want the most accurate model that runs under 10ms, first draw a vertical line 10ms out from the horizontal axis, and then find the highest point on the left side, which is the model that meets the requirements.
In practice, the number of parameters and FLOPS of the model are constant, while the operation time varies under different hardware and software conditions. If you want to choose the model more accurately, then you can run the test in your own environment and get the corresponding performance graph.
### Q5.4: If I want to add two classes in imagenet, can I fix the parameters of the existing fully connected layer and only train the new two?
**A**:
This idea works in theory, but I am afraid it will not work too well. If only the fully-connected layers are fixed and the parameters of the preceding convolutional layers are changed, there is no guarantee that these fully-connected layers will work the same as they did at the beginning. If the parameters of the entire network are kept constant and only the two new categories of fully connected layers are trained, it is also difficult to train the desired results.
If you really need the original 1000 categories to be accurate, you can add the data of the new categories to the original training set, and then finetune with the pre-training model. if you only need a few of the 1000 categories, you can pick out this part of the data, and finetune it after mixing it with the new data.
### Q5.5: When using classification models as pre-training models for other tasks, which layers should be selected as features?
**A**:
There are many strategies to use classification models as the backbone for other tasks, and a more basic approach is presented here. First, remove the final fully-connected layer, which mainly contains the classification information of the original task. If the task is relatively simple, just use the output of the previous layer as featuremap and add the structure corresponding to the task on top of it. If the task involves multiple scales, and the anchor of different scales needs to be selected, such as some detection models, then the output of the layer before each downsampling can be selected as the featuremap.
-[Common Datasets for Image Recognition](https://github.com/paddlepaddle/paddleclas/blob/release%2F2.3/docs/zh_CN/data_preparation/recognition_dataset.md#图像识别任务常见数据集介绍)
-[2.1 General Datasets](https://github.com/paddlepaddle/paddleclas/blob/release%2F2.3/docs/zh_CN/data_preparation/recognition_dataset.md#通用图像识别数据集)
-[2.2.1 Animation Character Recognition](https://github.com/paddlepaddle/paddleclas/blob/release%2F2.3/docs/zh_CN/data_preparation/recognition_dataset.md#动漫人物识别)
The dataset for the vector search, unlike those for classification tasks, is divided into the following three parts:
- Train dataset: Used to train the model to learn the image features involved.
- Gallery dataset: Used to provide the gallery data in the vector search task. It can either be the same as the train or query datasets or different, and when it is the same as the train dataset, the category system of the query dataset and train dataset should be the same.
- Query dataset: Used to test the performance of the model. It usually extracts features from each query image of the dataset, followed by distance matching with those in the gallery dataset to get the recognition results, based on which the metrics of the whole query dataset are calculated.
The above three datasets all adopt `txt` files for assignment. Taking the `CUB_200_2011` dataset as an example, the `train_list.txt` of the train dataset has the following format:
```
# Use "space" as the separator
...
train/99/Ovenbird_0136_92859.jpg 99 2
...
train/99/Ovenbird_0128_93366.jpg 99 6
...
```
The `test_list.txt` of the query dataset (both gallery dataset and query dataset in`CUB_200_2011`) has the following format:
Each row of data is separated by "space", and the three columns of data stand for the path, label information, and unique id of training data.
**Note**:
1. When the gallery dataset and query dataset are the same, to remove the first retrieved data (the images themselves require no evaluation), each data should have its unique id (ensuring that each image has a different id, which can be represented by the row number) for subsequent evaluation of mAP, recall@1, and other metrics. The dataset of yaml configuration file is `VeriWild`.
2. When the gallery dataset and query dataset are different, there is no need to add a unique id. Both `query_list.txt` and `gallery_list.txt` contain two columns, which are the path and label information of the training data. The dataset of yaml configuration file is ` ImageNetDataset`.
## 2. Common Datasets for Image Recognition
Here we present a compilation of commonly used image recognition datasets, which is continuously updated and expects your supplement.
### 2.1 General Datasets
- SOP: The SOP dataset is a common product dataset in general recognition research and MetricLearning technology research, which contains 120,053 images of 22,634 products downloaded from eBay.com. There are 59,551 images of 11,318 in the training set and 60,502 images of 11,316 categories in the validation set.
- Cars196: The Cars dataset contains 16,185 images of 196 categories of cars. The data is classified into 8144 training images and 8041 query images, with each category split roughly in a 50-50 ratio. The classification is normally based on the manufacturing, model and year of the car, e.g. 2012 Tesla Model S or 2012 BMW M3 coupe.
- CUB_200_2011: The CUB_200_2011 dataset is a fine-grained dataset proposed by the California Institute of Technology (Caltech) in 2010 and is currently the benchmark image dataset for fine-grained classification recognition research. There are 11788 bird images in this dataset with 200 subclasses, including 5994 images in the train dataset and 5794 images in the query dataset. Each image provides label information, the bounding box of the bird, the key part information of the bird, and the attribute of the bird. The dataset is shown in the figure below.
- In-shop Clothes: In-shop Clothes is one of the 4 subsets of the DeepFashion dataset. It is a seller show image dataset with multi-angle images of each product id being collected in the same folder. The dataset contains 7982 items with 52712 images, each with 463 attributes, Bbox, landmarks, and store descriptions.
- iCartoonFace: iCartoonFace, developed by iQiyi (an online video platform), is the world's largest manual labeled detection and recognition dataset for cartoon characters, which contains more than 5013 cartoon characters and 389,678 high-quality live images. Compared with other datasets, it boasts features of large scale, high quality, rich diversity, and challenging difficulty, making it one of the most commonly used datasets to study cartoon character recognition.
- Manga109: Manga109 is a dataset released in May 2020 for the study of cartoon character detection and recognition, which contains 21142 images and is officially banned from commercial use. Manga109-s, a subset of this dataset, is available for industrial use, mainly for tasks such as text detection, sketch line-based search, and character image generation.
Website:http://www.manga109.org/en/
- IIT-CFW: The IIF-CFW dataset contains a total of 8928 labeled cartoon portraits of celebrity characters, covering 100 characters with varying numbers of portraits for each. In addition, it also provides 1000 real face photos (10 real portraits for 100 public figures). This dataset can be employed to study both animation character recognition and cross-modal search tasks.
- AliProduct: The AliProduct dataset is the largest open source product dataset. As an SKU-level image classification dataset, it contains 50,000 categories and 3 million images, ranking the first in both aspects in the industry. This dataset covers a large number of household goods, food, etc. Due to its lack of manual annotation, the data is messy and unevenly distributed with many similar product images.
- Product-10k: Products-10k dataset has all its images from Jingdong Mall, covering 10,000 frequently purchased SKUs that are organized into a hierarchy. In total, there are nearly 190,000 images. In the real application scenario, the distribution of image volume is uneven. All images are manually checked/labeled by a team of production experts.
- DeepFashion-Inshop: The same as the common datasets In-shop Clothes.
### 2.2.3 Logo Recognition
- Logo-2K+: Logo-2K+ is a dataset exclusively for logo image recognition, which contains 10 major categories, 2341 minor categories, and 167,140 images.
- Tsinghua-Tencent 100K: This dataset is a large traffic sign benchmark dataset based on 100,000 Tencent Street View panoramas. 30,000 traffic sign instances included, it provides 100,000 images covering a wide range of illumination, and weather conditions. Each traffic sign in the benchmark test is labeled with the category, bounding box and pixel mask. A total of 222 categories (0 background + 221 traffic signs) are incorporated.
- CompCars: The images, 136,726 images of the whole car and 27,618 partial ones, are mainly from network and surveillance data. The network data contains 163 vehicle manufacturers and 1,716 vehicle models and includes the bounding box, viewing angle, and 5 attributes (maximum speed, displacement, number of doors, number of seats, and vehicle type). And the surveillance data comprises 50,000 front view images.
- BoxCars: The dataset contains a total of 21,250 vehicles, 63,750 images, 27 vehicle manufacturers, and 148 subcategories. All of them are derived from surveillance data.
Website: https://github.com/JakubSochor/BoxCars
- PKU-VD Dataset: The dataset contains two large vehicle datasets (VD1 and VD2) that capture images from real-world unrestricted scenes in two cities. VD1 is obtained from high-resolution traffic cameras, while images in VD2 are acquired from surveillance videos. The authors have performed vehicle detection on the raw data to ensure that each image contains only one vehicle. Due to privacy constraints, all the license numbers have been obscured with black overlays. All images are captured from the front view, and diverse attribute annotations are provided for each image in the dataset, including identification numbers, accurate vehicle models, and colors. VD1 originally contained 1097649 images, 1232 vehicle models, and 11 vehicle colors, and remains 846358 images and 141756 vehicles after removing images with multiple vehicles inside and those taken from the rear of the vehicle. VD2 contains 807260 images, 79763 vehicles, 1112 vehicle models, and 11 vehicle colors.
- This document describes the models currently supported by Kunlun and how to train these models on Kunlun devices. To install PaddlePaddle that supports Kunlun, please refer to install_kunlun(https://github.com/PaddlePaddle/FluidDoc/blob/develop/doc/paddle/install/install_Kunlun_zh.md)
## 2. Training of Kunlun
- See [quick_start](https://github.com/PaddlePaddle/PaddleClas/blob/release/2.3/docs/zh_CN/quick_start/quick_start_ classification_new_user.md) for data sources and pre-trained models. The training effect of Kunlun is aligned with CPU/GPU.
- We collect some frequently asked questions in issues and user groups since PaddleClas is open-sourced and provide brief answers, aiming to give some reference for the majority to save you from twists and turns.
- There are many talents in the field of image classification, recognition and retrieval with quickly updated models and papers, and the answers here mainly rely on our limited project practice, so it is not possible to cover all facets. We sincerely hope that the man of insight will help to supplement and correct the content, thanks a lot.
-[1.1 Basic Knowledge of PaddleClas](https://github.com/PaddlePaddle/PaddleClas/blob/release/2.3/docs/zh_CN/faq_series/faq_2021_s2.md#1.1)
-[1.2 Backbone Network and Pre-trained Model Library](https://github.com/PaddlePaddle/PaddleClas/blob/release/2.3/docs/zh_CN/faq_series/faq_2021_s2.md#1.2)
-[2.1 Common Problems in Training and Evaluation](https://github.com/PaddlePaddle/PaddleClas/blob/release/2.3/docs/zh_CN/faq_series/faq_2021_s2.md#2.1)
-[2.6 Model Inference Deployment](https://github.com/PaddlePaddle/PaddleClas/blob/release/2.3/docs/zh_CN/faq_series/faq_2021_s2.md#2.6)
## Recent Updates
#### Q2.1.7: How to tackle the reported error `ERROR: Unexpected segmentation fault encountered in DataLoader workers.` during training?
**A**:
Try setting the field `num_workers` in the training configuration file to `0`; try making the field `batch_size` in the file smaller; ensure that the dataset format and the dataset path in the profile are correct.
#### Q2.1.8: How to use `Mixup` and `Cutmix` during training?
**A**:
- For `Mixup`, please refer to [Mixup](https://github.com/PaddlePaddle/PaddleClas/blob/release/2.3/ppcls/configs/ImageNet/DataAugment/ResNet50_ Mixup.yaml#L63-L65); and `Cuxmix`, please refer to [Cuxmix](https://github.com/PaddlePaddle/PaddleClas/blob/release/2.3/ppcls/configs/ImageNet/ DataAugment/ResNet50_Cutmix.yaml#L63-L65).
- The training accuracy (Acc) metric cannot be calculated when using `Mixup` or `Cutmix` for training, so you need to remove the field `Metric.Train.TopkAcc` in the configuration file, please refer to [Metric.Train.TopkAcc](https://github.com/PaddlePaddle/PaddleClas/blob/release/2.3/ppcls/configs/ImageNet/DataAugment/ResNet50_Cutmix.yaml#L125-L128) for more details.
#### Q2.1.9: What are the fields `Global.pretrain_model` and `Global.checkpoints` used for in the training configuration file yaml?
**A**:
- When `fine-tune` is required, the path of the file of pre-training model weights can be configured via the field `Global.pretrain_model`, which usually has the suffix `.pdparams`.
- During training, the training program automatically saves the breakpoint information at the end of each epoch, including the optimizer information `.pdopt` and model weights information `.pdparams`. In the event that the training process is unexpectedly interrupted and needs to be resumed, the breakpoint information file saved during training can be configured via the field `Global.checkpoints`, for example by configuring `checkpoints: . /output/ResNet18/epoch_18` to restore the breakpoint information at the end of 18 epoch training. PaddleClas will automatically load `epoch_18.pdopt` and `epoch_18.pdparams` to continue training from 19 epoch.
#### Q2.6.3: How to convert the model to `ONNX` format?
**A**:Paddle supports two ways and relies on the `paddle2onnx` tool, which first requires the installation of `paddle2onnx`.
```
pip install paddle2onnx
```
- From inference model to ONNX format model.
Take the `combined` format inference model (containing `.pdmodel` and `.pdiparams` files) exported from the dynamic graph as an example, run the following command to convert the model format:
-`model_dir`: this parameter needs to contain `.pdmodel` and `.pdiparams` files.
-`model_filename`: this parameter is used to specify the path of the `.pdmodel` file under the parameter `model_dir`.
-`params_filename`: this parameter is used to specify the path of the `.pdiparams` file under the parameter `model_dir`.
-`save_file`: this parameter is used to specify the path to the directory where the converted model is saved.
For the conversion of a non-`combined` format inference model exported from a static diagram (usually containing the file `__model__` and multiple parameter files), and more parameter descriptions, please refer to the official documentation of [paddle2onnx](https://github.com/ PaddlePaddle/Paddle2ONNX/blob/develop/README_zh.md#Parameter options).
- Exporting ONNX format models directly from the model networking code.
Take the model networking code of dynamic graphs as an example, the model class is a subclass that inherits from `paddle.nn.Layer` and the code is shown below.
-`InputSpec()` function is used to describe the signature information of the model input, including the `shape`, `type` and `name` of the input data (can be omitted).
- The `paddle.onnx.export()` function needs to specify the model grouping object `net`, the save path of the exported model `save_path`, and the description of the model's input data `input_spec`.
Note that the `paddlepaddle``2.0.0` or above should be adopted.See [paddle.onnx.export](https://www.paddlepaddle.org.cn/documentation/docs/zh/api/paddle/onnx/) for more details on the parameters of the `paddle.onnx.export()` function.
#### Q2.5.4: How to set the parameter `pq_size` when build searches the base library?
**A**:
`pq_size` is a parameter of the PQ search algorithm, which can be simply understood as a "tiered" search algorithm. And `pq_size` is the "capacity" of each tier, so the setting of this parameter will affect the performance. However, in the case that the total data volume of the base library is not too large (less than 10,000), this parameter has little impact on the performance. So for most application scenarios, there is no need to modify this parameter when building the base library. For more details on the PQ search algorithm, see the related [paper](https://lear.inrialpes.fr/pubs/2011/JDS11/jegou_searching_with_quantization.pdf).
## Selection
## 1. Theory
### 1.1 Basic Knowledge of PaddleClas
#### Q1.1.1 Differences between PaddleClas and PaddleDetection
**A**:PaddleClas is an image recognition repo that integrates mainbody detection, image classification, and image retrieval to solve most image recognition problems. It can be easily adopted by users to solve small sample and multi-category issues in the field. PaddleDetection provides the ability of target detection, keypoint detection, multi-target tracking, etc., which is convenient for users to locate the points and regions of interest in images, and is widely used in industrial quality inspection, remote sensing image detection, unmanned inspection and other projects.
#### Q1.1.3: What does the parameter momentum mean in the Momentum optimizer?
**A**:
Momentum optimizer is based on SGD optimizer and introduces the concept of "momentum". In the SGD optimizer, the update of the parameter `w` at the time `t+1` can be expressed as
```
w_t+1 = w_t - lr * grad
```
`lr` is the learning rate and `grad` is the gradient of the parameter `w` at this point. With the introduction of momentum, the update of the parameter `w` can be expressed as
```
v_t+1 = m * v_t + lr * grad
w_t+1 = w_t - v_t+1
```
Here `m` is the `momentum`, which is the weighted value of the cumulative momentum, generally taken as `0.9`. And when the value is less than `1`, the earlier the gradient is, the smaller the impact on the current. For example, when the momentum parameter `m` takes `0.9`, the weighted value of the gradient of `t-5` is `0.9 ^ 5 = 0.59049` at time `t`, while the value at time `t-2` is `0.9 ^ 2 = 0.81`. Therefore, it is intuitive that gradient information that is too "far away" is of little significance for the current reference, while "recent" historical gradient information matters more.
By introducing the concept of momentum, the effect of historical updates is taken into account in parameter updates, thus speeding up the convergence and improving the loss (cost, loss) oscillation caused by the `SGD` optimizer.
#### Q1.1.4: Does PaddleClas have an implementation of the paper `Fixing the train-test resolution discrepancy`?
**A**: Currently, it is not implemented. If needed, you can try to modify the code yourself. In brief, the idea proposed in this paper is to fine-tune the final FC layer of the trained model using a larger resolution as input. Specifically, train the model network on a lower resolution dataset first, then set the parameter `stop_gradient=True ` for the weights of all layers of the network except the final FC layer, and at last fine-tune the network with a larger resolution input.
### 1.2 Backbone Network and Pre-trained Model Library
### 1.3 Image Classification
#### Q1.3.1: Does PaddleClas provide data enhancement for adjusting image brightness, contrast, saturation, hue, etc.?
**A**:
PaddleClas provides a variety of data augmentation methods, which can be divided into 3 categories.
Among them, RandAngment provides a variety of random combinations of data augmentation methods, which can meet the needs of brightness, contrast, saturation, hue and other aspects.
### 1.4 General Detection
#### Q1.4.1 Does the mainbody detection only export one subject detection box at a time?
**A**:The number of outputs for the main body detection is configurable through the configuration file. In the configuration file, Global.threshold controls the threshold value for detection, so boxes smaller than this threshold are discarded; and Global.max_det_results controls the maximum number of results returned. The two together determine the number of output detection boxes.
#### Q1.4.2 How is the data selected for training the mainbody detection model? Will it harm the accuracy to switch to a smaller model?
**A**:
The training data is a randomly selected subset of publicly available datasets such as COCO, Object365, RPC, and LogoDet. We are currently introducing an ultra-lightweight mainbody detection model in version 2.3, which can be found in [Mainbody Detection](https://github.com/PaddlePaddle/PaddleClas/blob/release/2.3/docs/zh_CN/image_recognition_ pipeline/mainbody_detection.md#2-Model Selection).
#### Q1.4.3: Is there any false detections in some scenarios with the current mainbody detection model?
**A**:The current mainbody detection model is trained using publicly available datasets such as COCO, Object365, RPC, LogoDet, etc. If the data to be detected is similar to industrial quality inspection and other data with large differences from common categories, it is necessary to fine-tune the training based on the current detection model again.
### 1.5 Image Recognition
#### Q1.5.1 Is `triplet loss` needed for `circle loss` ?
**A**:
`circle loss` is a unified form of sample pair learning and classification learning, and `triplet loss` can be added if it is a classification learning.
#### Q1.5.2 如果不是识别开源的四个方向的图片,该使用哪个识别模型?Which recognition model is better if not to recognize open source images in all four directions?
**A**:
The product recognition model is recommended. For one, the range of products is wider and the probability that the recognized image is a product is higher. For two, the training data of the product recognition model uses 50,000 categories of data, which has better generalization ability and more robust features.
#### Q1.5.3 Why is 512-dimensional vector is finally adopted instead of 1024 or others?
**A**:
Vectors with small dimensions should be adopted. 128 or even smaller are practically used to speed up the computation. In general, a dimension of 512 is large enough to adequately represent the features.
### 1.6 Vector Search
#### Q1.6.1 Does the Möbius vector search algorithm currently used by PaddleClas support index.add() similar to the one used by faiss? Also, do I have to train every time I build a new graph? Is the train here to speed up the search or to build similar graphs?
**A**:The faiss retrieval module is now supported in the release/2.3 branch and is no longer supported by Möbius, which provides a graph-based algorithm that is similar to the nearest neighbor search and currently supports two types of distance calculation: inner product and L2 distance. However, Möbius does not support the index.add function provided in faiss. So if you need to add content to the search library, you need to rebuild a new index from scratch. The search algorithm internally performs a train-like process each time the index is built, which is different from the train interface provided by faiss. Therefore, if you need the faiss module, you can use the release/2.3 branch, and if you need Möbius, you need to fall back to the release/2.2 branch for now.
#### Q1.6.2: What exactly are the `Query` and `Gallery` configurations used for in the PaddleClas image recognition for Eval configuration file?
**A**:
Both `Query` and `Gallery` are data set configurations, where `Gallery` is used to configure the base library data and `Query` is used to configure the validation set. When performing Eval, the model is first used to forward compute feature vectors on the `Gallery` base library data, which are used to construct the base library, and then the model forward computes feature vectors on the data in the `Query` validation set, and then computes metrics such as recall rate in the base library.
## 2. Practice
### 2.1 Common Problems in Training and Evaluation
#### Q2.1.1 Where is the `train_log` file in PaddleClas?
**A**:`train.log` is stored in the path where the weights are stored.
#### Q2.1.2 Why nan is the output of the model training?
**A**: 1. Make sure the pre-trained model is loaded correctly, the easiest way is to add the parameter `-o Arch.pretrained=True`; 2. When fine-tuning the model, the learning rate should not be too large, e.g. 0.001.
#### Q2.1.3 Is it possible to perform frame-by-frame prediction in a video?
**A**:Yes, but currently PaddleClas does not support video input. You can try to modify the code of PaddleClas or store the video frame by frame before using PaddleClas.
#### Q2.1.4: In data preprocessing, what setting can be adopted without cropping the input data? Or how to set the size of the crop?
**A**: The data preprocessing operators supported by PaddleClas can be viewed in `ppcls/data/preprocess/__init__.py`, and all supported operators can be configured in the configuration file. The name of the operator needs to be the same as the operator class name, and the parameters need to be the same as the constructor parameters of the corresponding operator class. If you do not need to crop the image, you can remove `CropImage` and `RandCropImage` and replace them with `ResizeImage`, and you can set different resize methods with its parameters by using the `size` parameter to directly scale the image to a fixed size, and using the `resize_short` parameter to maintain the image aspect ratio for scaling. To set the crop size, use the `size` parameter of the `CropImage` operator, or the `size` parameter of the `RandCropImage` operator.
#### Q2.1.5: Why do I get a usage error after PaddlePaddle installation and cannot import any modules under paddle (import paddle.xxx)?
**A**:
You can first test if Paddle is installed correctly by using the following code.
```
import paddle
paddle.utils.install_check.run_check()
```
When installed correctly, the following prompts will be displayed.
```
PaddlePaddle is installed successfully! Let's start deep learning with PaddlePaddle now.
```
Otherwise, the relevant question will prompt out. Also, after installing both the CPU and the GPU version of Paddle, you will need to uninstall both versions and reinstall the required version due to conflicts between the two versions.
#### Q2.1.6: How to save the optimal model during training?
**A**:
PaddleClas saves/updates the following three types of models during training.
1. the latest model (`latest.pdopt`, `latest.pdparams`, `latest.pdstates`), which can be used to resume training when it is unexpectedly interrupted.
2. the best model (`best_model.pdopt`, `best_model.pdparams`, `best_model.pdstates`).
3. breakpoints at the end of an epoch during training (`epoch_xxx.pdopt`, `epoch_xxx.pdparams`, `epoch_xxx.pdstates`). The `Global.save_interval` field in the training profile indicates the save interval for this model. If you make it larger than the total number of epochs, intermediate breakpoint models will no longer be saved.
#### Q2.1.7: How to address the `ERROR: Unexpected segmentation fault encountered in DataLoader workers.` during training?
**A**:Try setting the field `num_workers` in the training profile to `0`; try making the field `batch_size` in the training profile smaller; ensure that the dataset format and the dataset path in the profile are correct.
#### Q2.1.8: How to use `Mixup` and `Cutmix` during training?
**A**:
- For `Mixup`, please refer to [Mixup](https://github.com/PaddlePaddle/PaddleClas/blob/release/2.3/ppcls/configs/ImageNet/DataAugment/ResNet50_ Mixup.yaml#L63-L65); and`Cuxmix`, please refer to [Cuxmix](https://github.com/PaddlePaddle/PaddleClas/blob/release/2.3/ppcls/configs/ImageNet/ DataAugment/ResNet50_Cutmix.yaml#L63-L65).
- The training accuracy (Acc) metric cannot be calculated when using `Mixup` or `Cutmix` for training, so you need to remove the `Metric.Train.TopkAcc` field in the configuration file, please refer to [Metric.Train.TopkAcc](https://github.com/ PaddlePaddle/PaddleClas/blob/release/2.3/ppcls/configs/ImageNet/DataAugment/ResNet50_Cutmix.yaml#L125-L128).
#### Q2.1.9: What are the fields `Global.pretrain_model` and `Global.checkpoints` used for in the training configuration file yaml?
**A**:
- When `fine-tune` is required, the path of the file of pre-training model weights can be configured via the field `Global.pretrain_model`, which usually has the suffix `.pdparams`.
- During training, the training program automatically saves the breakpoint information at the end of each epoch, including the optimizer information `.pdopt` and model weights information `.pdparams`. In the event that the training process is unexpectedly interrupted and needs to be resumed, the breakpoint information file saved during training can be configured via the field `Global.checkpoints`, for example by configuring `checkpoints: . /output/ResNet18/epoch_18` to restore the breakpoint information at the end of 18 epoch training. PaddleClas will automatically load `epoch_18.pdopt` and `epoch_18.pdparams` to continue training from 19 epoch.
### 2.2 Image Classification
#### Q2.2.1 How to distill the small model after pre-training the large model on 500M data and then distill the fintune model on 1M data in SSLD?
**A**:The steps are as follows:
1. Obtain the `ResNet50-vd` model based on the distillation of the open-source `ResNeXt101-32x16d-wsl` model from Facebook.
2. Use this `ResNet50-vd` to distill `MobilNetV3` on a 500W dataset.
3. Considering that the distribution of the 500W dataset is not exactly the same as that of the 100W data, this piece of data is finetuned on the 100W data to slightly improve the accuracy.
#### Q2.2.2 nan appears in loss when training SwinTransformer
**A**:When training SwinTransformer, please use `Paddle``2.1.1` or above, and load the pre-trained model we provide. Also, the learning rate should be kept at an appropriate level.
### 2.3 General Detection
#### Q2.3.1 Why are there some images that are detected as the original image?
**A**:The mainbody detection model returns the detection frame, but in fact, in order to make the subsequent recognition model more accurate, the original image is also returned along with the detection frame. Subsequently, the original image or the detection frame will be sorted according to its similarity with the images in the library, and the label of the image in the library with the highest similarity will be the label of the recognized image.
#### Q2.3.2:在直播场景中,需要提供一个直播即时识别画面,能够在延迟几秒内找到特征目标物并用框圈起,这个可以实现吗?In a live broadcast scenario, is it possible to provide a live instant recognition screen that can find the target object of the feature and circle it with a delay of a few seconds?
**A**:A real-time detection presents high requirements for the detection speed; PP-YOLO is a lightweight target detection model provided by Paddle team, which strikes a good balance of detection speed and accuracy, you can try PP-YOLO for detection. For the use of PP-YOLO, you can refer to [PaddleDetection](https://github.com/PaddlePaddle/PaddleDetection/blob/release/2.1/configs/ppyolo/README_cn.md).
#### Q2.3.3: For unknown labels, adding gallery dataset can be used for subsequent classification recognition (without training), but if the previous detection model cannot locate and detect the unknown labels, is it still necessary to train the previous detection model?
**A**:If the detection model does not perform well on your own dataset, you need to finetune it again on your own detection dataset.
### 2.4 Image Recognition
#### Q2.4.1: Why is `Illegal instruction` reported during the recognition inference?
**A**:If you are using the release/2.2 branch, it is recommended to update it to the release/2.3 branch, where we replaced the Möbius search model with the faiss search module, as described in [Vector Search Tutorial](https://github.com/PaddlePaddle/ PaddleClas/blob/release/2.3/deploy/vector_search/README.md). If you still have problems, you can contact us in the WeChat group or raise an issue on GitHub.
#### Q2.4.2: How can recognition models be fine-tuned to train on the basis of pre-trained models?
**A**:The fine-tuning training of the recognition model is similar to that of the classification model. The recognition model can be loaded with a pre-trained model of the product, and the training process can be found in [recognition model training](https://github.com/PaddlePaddle/PaddleClas/blob/release/2.3/docs/zh_CN/ models_training/recognition.md), and we will continue to refine the documentation.
#### Q2.4.3: Why does it fail to run all mini-batches in each epoch when training metric learning?
**A**:When training metric learning, the Sampler used is DistributedRandomIdentitySampler, which does not sample all the images, resulting in each epoch sampling only part of the data, so it is normal that the mini-batch cannot run through the display. This issue has been optimized in the release/2.3 branch, please update to release/2.3 to use it.
#### Q2.4.4: Why do some images have no recognition results?
**A**:In the configuration file (e.g. inference_product.yaml), `IndexProcess.score_thres` controls the minimum value of cosine similarity of the recognized image to the image in the library. When the cosine similarity is less than this value, the result will not be printed. You can adjust this value according to your actual data.
### 2.5 Vector Search
#### Q2.5.1: Why is the error `assert text_num >= 2` reported after adding an image to the index?
**A**:Make sure that the image path and the image name in data_file.txt is separated by a single table instead of a space.
#### Q2.5.2: Do I need to rebuild the index to add new base data?
**A**:Starting from release/2.3 branch, we have replaced the Möbius search model with the faiss search module, which already supports the addition of base data without building the base library, as described in [Vector Search Tutorial](https://github.com/PaddlePaddle/PaddleClas/blob/ release/2.3/deploy/vector_search/README.md).
#### Q2.5.3: How to deal with the reported error clang: error: unsupported option '-fopenmp' when recompiling index.so in Mac?
**A**:
If you are using the release/2.2 branch, it is recommended to update it to the release/2.3 branch, where we replaced the Möbius search model with the faiss search module, as described in [Vector Search Tutorial](https://github.com/PaddlePaddle/ PaddleClas/blob/release/2.3/deploy/vector_search/README.md). If you still have problems, you can contact us in the user WeChat group or raise an issue on GitHub.
#### Q2.5.4: How to set the parameter `pq_size` when build searches the base library?
**A**:
`pq_size` is a parameter of the PQ search algorithm, which can be simply understood as a "tiered" search algorithm. And `pq_size` is the "capacity" of each tier, so the setting of this parameter will affect the performance. However, in the case that the total data volume of the base library is not too large (less than 10,000), this parameter has little impact on the performance. So for most application scenarios, there is no need to modify this parameter when building the base library. For more details on the PQ search algorithm, see the related [paper](https://lear.inrialpes.fr/pubs/2011/JDS11/jegou_searching_with_quantization.pdf).
### 2.6 Model Inference Deployment
#### Q2.6.1: How to add the parameter of a module that is enabled by hub serving?
**A**:See [hub serving parameters](https://github.com/PaddlePaddle/PaddleClas/blob/release/2.3/deploy/hubserving/clas/params.py) for more details.
#### Q2.6.2: Why is the result not accurate enough when exporting the inference model for inference deployment?
**A**:
This problem is usually caused by the incorrect loading of the model parameters when exporting. First check the export log for something like the following.
```
UserWarning: Skip loading for ***. *** is not found in the provided dict.
```
If it exists, the model weights were not loaded successfully. Please further check the `Global.pretrained_model` field in the configuration file to see if the path of the model weights file is correctly configured. The suffix of the model weights file is usually `pdparams`, note that the file suffix is not required when configuring this path.
#### Q2.6.3: How to convert the model to `ONNX` format?
**A**:
Paddle supports two ways and relies on the `paddle2onnx` tool, which first requires the installation of `paddle2onnx`.
```
pip install paddle2onnx
```
- From inference model to ONNX format model:
Take the `combined` format inference model (containing `.pdmodel` and `.pdiparams` files) exported from the dynamic graph as an example, run the following command to convert the model format:
-`model_dir`: this parameter needs to contain `.pdmodel` and `.pdiparams` files.
-`model_filename`: this parameter is used to specify the path of the `.pdmodel` file under the parameter `model_dir`.
-`params_filename`: this parameter is used to specify the path of the `.pdiparams` file under the parameter `model_dir`.
-`save_file`: this parameter is used to specify the path to the directory where the converted model is saved.
For the conversion of a non-`combined` format inference model exported from a static diagram (usually containing the file `__model__` and multiple parameter files), and more parameter descriptions, please refer to the official documentation of [paddle2onnx](https://github.com/ PaddlePaddle/Paddle2ONNX/blob/develop/README_zh.md#Parameter options).
- Exporting ONNX format models directly from the model networking code.
Take the model networking code of dynamic graphs as an example, the model class is a subclass that inherits from `paddle.nn.Layer` and the code is shown below:
-`InputSpec()` function is used to describe the signature information of the model input, including the `shape`, `type` and `name` of the input data (can be omitted).
- The `paddle.onnx.export()` function needs to specify the model grouping object `net`, the save path of the exported model `save_path`, and the description of the model's input data `input_spec`.
Note that the `paddlepaddle``2.0.0` or above should be adopted.See [paddle.onnx.export](https://www.paddlepaddle.org.cn/documentation/docs/zh/api/paddle/onnx/) for more details on the parameters of the `paddle.onnx.export()` function.
- Add pre-training weights for lightweight models, including detection models and feature models
- Release PP-LCNet series of models, which are self-developed ones designed to run on CPU
- Enable SwinTransformer, Twins, and Deit to support direct training from scrach to achieve thesis accuracy.
- Basic framework capabilities
- Add DeepHash module, which supports feature model to directly export binary features
- Add PKSampler, which tackles the problem that feature models cannot be trained by multiple machines and cards
- Support PaddleSlim: support quantization, pruning training, and offline quantization of classification models and feature models
- Enable legendary models to support intermediate model output
- Support multi-label classification training
- Inference Deployment
- Replace the original feature retrieval library with Faiss to improve platform adaptability
- Support PaddleServing: support the deployment of classification models and image recognition process
- Versions of the Recommendation Library
- python: 3.7
- PaddlePaddle: 2.1.3
- PaddleSlim: 2.2.0
- PaddleServing: 0.6.1
## 2. v2.2
- Model Updates
- Add models including LeViT, Twins, TNT, DLA, HardNet, RedNet, and SwinTransfomer
- Basic framework capabilities
- Divide the classification models into two categories
- legendary models: introduce TheseusLayer base class, add the interface to modify the network function, and support the networking data truncation and output
- model zoo: other common classification models
- Add the support of Metric Learning algorithm
- Add a variety of related loss algorithms, and the basic network module gears (allow the combination with backbone and loss) for convenient use
- Support both the general classification and metric learning-related training
- Support static graph training
- Classification training with dali acceleration supported
- Support fp16 training
- Application Updates
- Add specific application cases and related models of product recognition, vehicle recognition (vehicle fine-grained classification, vehicle ReID), logo recognition, animation character recognition
- Add a complete pipeline for image recognition, including detection module, feature extraction module, and vector search module
- Inference Deployment
- Add Mobius, Baidu's self-developed vector search module, to support the inference deployment of the image recognition system
- Image recognition, build feature library that allows batch_size>1