From 307424b6f997beb2dc7f4951851896f14520831e Mon Sep 17 00:00:00 2001
From: gaotingquan <gaotingquan@baidu.com>
Date: Tue, 21 Dec 2021 06:29:48 +0000
Subject: [PATCH] docs: fix link

---
 .../classification_dataset_en.md              | 34 +++++-----
 .../recognition_dataset_en.md                 | 38 ++++++-----
 docs/en/faq_series/faq_2020_s1_en.md          | 64 +++++++++----------
 docs/en/faq_series/faq_2021_s1_en.md          | 60 ++++++++---------
 docs/en/faq_series/faq_selected_30_en.md      | 39 +++++------
 docs/en/introduction/function_intro_en.md     |  4 +-
 docs/en/others/feature_visiualization_en.md   | 30 +++++----
 docs/en/others/train_on_xpu_en.md             | 28 ++++----
 docs/en/others/versions_en.md                 |  8 +--
 9 files changed, 160 insertions(+), 145 deletions(-)
diff --git a/docs/en/data_preparation/classification_dataset_en.md b/docs/en/data_preparation/classification_dataset_en.md
index c1f84dd7..b2394cc3 100644
--- a/docs/en/data_preparation/classification_dataset_en.md
+++ b/docs/en/data_preparation/classification_dataset_en.md
@@ -6,21 +6,21 @@ This document elaborates on the dataset format adopted by PaddleClas for image c
 
 ## Contents
 
-- [Dataset Format](https://github.com/paddlepaddle/paddleclas/blob/release%2F2.3/docs/zh_CN/data_preparation/classification_dataset.md#数据集格式说明)
-- [Common Datasets for Image Classification](https://github.com/paddlepaddle/paddleclas/blob/release%2F2.3/docs/zh_CN/data_preparation/classification_dataset.md#图像分类任务常见数据集介绍)
-  - [2.1 ImageNet1k](https://github.com/paddlepaddle/paddleclas/blob/release%2F2.3/docs/zh_CN/data_preparation/classification_dataset.md#ImageNet1k)
-  - [2.2 Flowers102](https://github.com/paddlepaddle/paddleclas/blob/release%2F2.3/docs/zh_CN/data_preparation/classification_dataset.md#Flowers102)
-  - [2.3 CIFAR10 / CIFAR100](https://github.com/paddlepaddle/paddleclas/blob/release%2F2.3/docs/zh_CN/data_preparation/classification_dataset.md#CIFAR10/CIFAR100)
-  - [2.4 MNIST](https://github.com/paddlepaddle/paddleclas/blob/release%2F2.3/docs/zh_CN/data_preparation/classification_dataset.md#MNIST)
-  - [2.5 NUS-WIDE](https://github.com/paddlepaddle/paddleclas/blob/release%2F2.3/docs/zh_CN/data_preparation/classification_dataset.md#NUS-WIDE)
+- [1. Dataset Format](#1)
+- [2. Common Datasets for Image Classification](#2)
+  - [2.1 ImageNet1k](#2.1)
+  - [2.2 Flowers102](#2.2)
+  - [2.3 CIFAR10 / CIFAR100](#2.3)
+  - [2.4 MNIST](#2.4)
+  - [2.5 NUS-WIDE](#2.5)
 
+<a name="1"></a>
 
-
-## 1 Dataset Format
+## 1. Dataset Format
 
 PaddleClas adopts `txt` files to assign the training and test sets. Taking the `ImageNet1k` dataset as an example, where `train_list.txt` and `val_list.txt` have the following formats:
 
-```
+```shell
 # Separate the image path and annotation with "space" for each line
 
 # train_list.txt has the following format
@@ -32,12 +32,14 @@ val/ILSVRC2012_val_00000001.JPEG 65
 ...
 ```
 
+<a name="2"></a>
 
-
-## 2 Common Datasets for Image Classification
+## 2. Common Datasets for Image Classification
 
 Here we present a compilation of commonly used image classification datasets, which is continuously updated and expects your supplement.
 
+<a name="2.1"></a>
+
 ### 2.1 ImageNet1k
 
 [ImageNet](https://image-net.org/) is a large visual database for visual target recognition research with over 14 million manually labeled images. ImageNet-1k is a subset of the ImageNet dataset, which contains 1000 categories with 1281167 images for the training set and 50000 for the validation set. Since 2010, ImageNet began to hold an annual image classification competition, namely, the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) with ImageNet-1k as its specified dataset. To date, ImageNet-1k has become one of the most significant contributors to the development of computer vision, based on which numerous initial models of downstream computer vision tasks are trained.
@@ -67,7 +69,7 @@ PaddleClas/dataset/ILSVRC2012/
 |_ val_list.txt
 ```
 
-
+<a name="2.2"></a>
 
 ### 2.2 Flowers102
 
@@ -104,7 +106,7 @@ PaddleClas/dataset/flowers102/
 |_ val_list.txt
 ```
 
-
+<a name="2.3"></a>
 
 ### 2.3 CIFAR10 / CIFAR100
 
@@ -112,7 +114,7 @@ The CIFAR-10 dataset comprises 60,000 color images of 10 classes with 32x32 imag
 
 Website：http://www.cs.toronto.edu/~kriz/cifar.html
 
-
+<a name="2.4"></a>
 
 ### 2.4 MNIST
 
@@ -120,7 +122,7 @@ MMNIST is a renowned dataset for handwritten digit recognition and is used as an
 
 Website：http://yann.lecun.com/exdb/mnist/
 
-
+<a name="2.5"></a>
 
 ### 2.5 NUS-WIDE
 
diff --git a/docs/en/data_preparation/recognition_dataset_en.md b/docs/en/data_preparation/recognition_dataset_en.md
index 5ed88d04..a1df5190 100644
--- a/docs/en/data_preparation/recognition_dataset_en.md
+++ b/docs/en/data_preparation/recognition_dataset_en.md
@@ -6,18 +6,18 @@ This document elaborates on the dataset format adopted by PaddleClas for image r
 
 ## Contents
 
-- [Dataset Format](https://github.com/paddlepaddle/paddleclas/blob/release%2F2.3/docs/zh_CN/data_preparation/recognition_dataset.md#数据集格式说明)
-- [Common Datasets for Image Recognition](https://github.com/paddlepaddle/paddleclas/blob/release%2F2.3/docs/zh_CN/data_preparation/recognition_dataset.md#图像识别任务常见数据集介绍)
-  - [2.1 General Datasets](https://github.com/paddlepaddle/paddleclas/blob/release%2F2.3/docs/zh_CN/data_preparation/recognition_dataset.md#通用图像识别数据集)
-  - [2.2 Vertical Datasets](https://github.com/paddlepaddle/paddleclas/blob/release%2F2.3/docs/zh_CN/data_preparation/recognition_dataset.md#垂类图像识别数据集)
-    - [2.2.1 Animation Character Recognition](https://github.com/paddlepaddle/paddleclas/blob/release%2F2.3/docs/zh_CN/data_preparation/recognition_dataset.md#动漫人物识别)
-    - [2.2.2 Product Recognition](https://github.com/paddlepaddle/paddleclas/blob/release%2F2.3/docs/zh_CN/data_preparation/recognition_dataset.md#商品识别)
-    - [2.2.3 Logo Recognition](https://github.com/paddlepaddle/paddleclas/blob/release%2F2.3/docs/zh_CN/data_preparation/recognition_dataset.md#Logo识别)
-    - [2.2.4 Vehicle Recognition](https://github.com/paddlepaddle/paddleclas/blob/release%2F2.3/docs/zh_CN/data_preparation/recognition_dataset.md#车辆识别)
+- [1. Dataset Format](#1)
+- [2. Common Datasets for Image Recognition](#2)
+  - [2.1 General Datasets](#2.1)
+  - [2.2 Vertical Datasets](#2.2)
+    - [2.2.1 Animation Character Recognition](#2.2.1)
+    - [2.2.2 Product Recognition](#2.2.2)
+    - [2.2.3 Logo Recognition](#2.2.3)
+    - [2.2.4 Vehicle Recognition](#2.2.4)
 
+<a name="1"></a>
 
-
-## 1 Dataset Format
+## 1. Dataset Format
 
 The dataset for the vector search, unlike those for classification tasks, is divided into the following three parts:
 
@@ -27,7 +27,7 @@ The dataset for the vector search, unlike those for classification tasks, is div
 
 The above three datasets all adopt  `txt` files for assignment. Taking the `CUB_200_2011` dataset as an example, the `train_list.txt` of the train dataset has the following format：
 
-```
+```shell
 # Use "space" as the separator
 ...
 train/99/Ovenbird_0136_92859.jpg 99 2
@@ -38,7 +38,7 @@ train/99/Ovenbird_0128_93366.jpg 99 6
 
 The `test_list.txt` of the query dataset (both gallery dataset and query dataset in`CUB_200_2011`) has the following format：
 
-```
+```shell
 # Use "space" as the separator
 ...
 test/200/Common_Yellowthroat_0126_190407.jpg 200 1
@@ -55,12 +55,14 @@ Each row of data is separated by "space", and the three columns of data stand fo
 
 2. When the gallery dataset and query dataset are different, there is no need to add a unique id. Both `query_list.txt` and `gallery_list.txt` contain two columns, which are the path and label information of the training data. The dataset of yaml configuration file is ` ImageNetDataset`.
 
-
+<a name="2"></a>
 
 ## 2. Common Datasets for Image Recognition
 
 Here we present a compilation of commonly used image recognition datasets, which is continuously updated and expects your supplement.
 
+<a name="2.1"></a>
+
 ### 2.1 General Datasets
 
 - SOP: The SOP dataset is a common product dataset in general recognition research and MetricLearning technology research, which contains 120,053 images of 22,634 products downloaded from eBay.com. There are 59,551 images of 11,318 in the training set and 60,502 images of 11,316 categories in the validation set.
@@ -77,11 +79,11 @@ Here we present a compilation of commonly used image recognition datasets, which
 
   Website： http://mmlab.ie.cuhk.edu.hk/projects/DeepFashion.html
 
-
+<a name="2.2"></a>
 
 ### 2.2 Vertical Datasets
 
-
+<a name="2.2.1"></a>
 
 #### 2.2.1 Animation Character Recognition
 
@@ -97,7 +99,7 @@ Here we present a compilation of commonly used image recognition datasets, which
 
   Website： http://cvit.iiit.ac.in/research/projects/cvit-projects/cartoonfaces
 
-
+<a name="2.2.2"></a>
 
 #### 2.2.2 Product Recognition
 
@@ -111,7 +113,7 @@ Here we present a compilation of commonly used image recognition datasets, which
 
 - DeepFashion-Inshop: The same as the common datasets In-shop Clothes.
 
-
+<a name="2.2.3"></a>
 
 ### 2.2.3 Logo Recognition
 
@@ -123,6 +125,8 @@ Here we present a compilation of commonly used image recognition datasets, which
 
   Website： https://cg.cs.tsinghua.edu.cn/traffic-sign/
 
+<a name="2.2.4"></a>
+
 ### 2.2.4 Vehicle Recognition
 
 - CompCars: The images, 136,726 images of the whole car and 27,618 partial ones, are mainly from network and surveillance data. The network data contains 163 vehicle manufacturers and 1,716 vehicle models and includes the bounding box, viewing angle, and 5 attributes (maximum speed, displacement, number of doors, number of seats, and vehicle type). And the surveillance data comprises 50,000 front view images.
diff --git a/docs/en/faq_series/faq_2020_s1_en.md b/docs/en/faq_series/faq_2020_s1_en.md
index 9a49b77b..0c81cda8 100644
--- a/docs/en/faq_series/faq_2020_s1_en.md
+++ b/docs/en/faq_series/faq_2020_s1_en.md
@@ -2,14 +2,14 @@
 
 ## Contents
 
-- [1. Issue 1](https://github.com/PaddlePaddle/PaddleClas/blob/release/2.3/docs/zh_CN/faq_series/faq_2020_s1.md#1)(2020.11.03)
-- [2. Issue 2](https://github.com/PaddlePaddle/PaddleClas/blob/release/2.3/docs/zh_CN/faq_series/faq_2020_s1.md#2)(2020.11.11)
-- [3. Issue 3](https://github.com/PaddlePaddle/PaddleClas/blob/release/2.3/docs/zh_CN/faq_series/faq_2020_s1.md#3)(2020.11.18)
-- [4. Issue 4](https://github.com/PaddlePaddle/PaddleClas/blob/release/2.3/docs/zh_CN/faq_series/faq_2020_s1.md#4)(2020.12.07)
-- [5. Issue 5](https://github.com/PaddlePaddle/PaddleClas/blob/release/2.3/docs/zh_CN/faq_series/faq_2020_s1.md#5)(2020.12.17)
-- [6. Issue 6](https://github.com/PaddlePaddle/PaddleClas/blob/release/2.3/docs/zh_CN/faq_series/faq_2020_s1.md#6)(2020.12.30)
-
+- [1. Issue 1](#1)(2020.11.03)
+- [2. Issue 2](#2)(2020.11.11)
+- [3. Issue 3](#3)(2020.11.18)
+- [4. Issue 4](#4)(2020.12.07)
+- [5. Issue 5](#5)(2020.12.17)
+- [6. Issue 6](#6)(2020.12.30)
 
+<a name="1"></a>
 
 ## Issue 1
 
@@ -33,25 +33,25 @@ It provides the whole process of model training, evaluation, inference, and depl
 
 **A**: The structure of ResNet_va to vd is shown in the figure below. ResNet was first proposed as va structure, in the left feature transformation path (Path A) of the downsampling residual module, the first 1x1 convolution is downsampled, which leads to information loss (the kernel size of the convolution is 1, stride is 2, some features in the input feature graph are not involved in the calculation of convolution). In the vb structure, the downsampling step is adjusted from the first 1x1 convolution at the beginning to the 3x3 convolution in the middle, thus avoiding the loss of information, and the default ResNet model in PaddleClas is ResNet_vb. The vc structure turns the initial 7x7 convolution into 3 3x3 convolutions with almost the same computation and storage size and improved accuracy when the perceptual field remains unchanged. The vd structure is a modification of the feature path (Path B) on the right side of the downsampling residual module, replacing the downsampling with average pooling. This collection of improvements (va->vd), with little extra inference time, and combined with appropriate training strategies, such as label smoothing and mixup data augmentation, can improve the accuracy by up to 2.7%.
 
-[![img](https://github.com/PaddlePaddle/PaddleClas/raw/release/2.3/docs/images/faq/ResNet_vabcd_structure.png)](https://github.com/PaddlePaddle/PaddleClas/blob/release/2.3/docs/images/faq/ResNet_vabcd_structure.png)
+![](../../images/faq/ResNet_vabcd_structure.png)
 
 ### Q1.4 How to choose appropriate ResNet models for the actual scenario?
 
 **A**:
 
-Among the ResNet series model, the ResNet_vd model is recommended for it has a significant improvement in accuracy with almost constant inference speed compared to other models. When the batch size=4, the variation of inference time, FLOPs, Params and accuracy for different models on T4 GPU are demonstrated in the [ResNet and its vd series models](https://github.com/PaddlePaddle/PaddleClas/blob/release/2.3/docs/zh_CN/models/ResNet_and_vd.md). If you want the smallest possible model storage or the fastest inference speed, please use ResNet18_vd model, and if you want to get the highest possible accuracy, we recommend the ResNet152_vd or ResNet200_vd models. For more information about the ResNet series model, please refer to [ResNet and its vd series models](https://github.com/PaddlePaddle/PaddleClas/blob/release/2.3/docs/zh_CN/models/ResNet_ and_vd.md)
+Among the ResNet series model, the ResNet_vd model is recommended for it has a significant improvement in accuracy with almost constant inference speed compared to other models. When the batch size=4, the variation of inference time, FLOPs, Params and accuracy for different models on T4 GPU are demonstrated in the [ResNet and its vd series models](../models/ResNet_and_vd_en.md). If you want the smallest possible model storage or the fastest inference speed, please use ResNet18_vd model, and if you want to get the highest possible accuracy, we recommend the ResNet152_vd or ResNet200_vd models. For more information about the ResNet series model, please refer to [ResNet and its vd series models](../models/ResNet_and_vd_en.md)
 
 - Variation of precision-inference speed
 
-[![img](https://github.com/PaddlePaddle/PaddleClas/raw/release/2.3/docs/images/models/T4_benchmark/t4.fp32.bs4.ResNet.png)](https://github.com/PaddlePaddle/PaddleClas/blob/release/2.3/docs/images/models/T4_benchmark/t4.fp32.bs4.ResNet.png)
+![](../../images/models/T4_benchmark/t4.fp32.bs4.ResNet.png)
 
 - Variation of precision-params
 
-[![img](https://github.com/PaddlePaddle/PaddleClas/raw/release/2.3/docs/images/models/T4_benchmark/t4.fp32.bs4.ResNet.params.png)](https://github.com/PaddlePaddle/PaddleClas/blob/release/2.3/docs/images/models/T4_benchmark/t4.fp32.bs4.ResNet.params.png)
+![](../../images/models/T4_benchmark/t4.fp32.bs4.ResNet.params.png)
 
 - Variation of precision-flops
 
-[![img](https://github.com/PaddlePaddle/PaddleClas/raw/release/2.3/docs/images/models/T4_benchmark/t4.fp32.bs4.ResNet.flops.png)](https://github.com/PaddlePaddle/PaddleClas/blob/release/2.3/docs/images/models/T4_benchmark/t4.fp32.bs4.ResNet.flops.png)
+![](../../images/models/T4_benchmark/t4.fp32.bs4.ResNet.flops.png)
 
 ### Q1.5 Is conv-bn-relu a fixed form in a block of the network?
 
@@ -69,9 +69,9 @@ There are two different kinds of blocks in the ResNet series, basic-block and bo
 
 **A**:
 
-Not really, increasing all the convolutional kernels in the network may not lead to performance improvement or even the opposite. In the paper [MixConv: Mixed Depthwise Convolutional Kernels](https://arxiv.org/abs/1907.09595), it is pointed out that increasing the size of the convolutional kernels within a certain range plays a positive role in the accuracy improvement, but a size beyond may lead to accuracy loss. Therefore, considering the size of the model and the computation, large convolutional kernels are generally abandoned to design the network. Also, there are experiments on large convolution kernels in the article [PP-LCNet](https://github.com/PaddlePaddle/PaddleClas/blob/release/2.3/docs/zh_CN/models/PP-LCNet.md).
-
+Not really, increasing all the convolutional kernels in the network may not lead to performance improvement or even the opposite. In the paper [MixConv: Mixed Depthwise Convolutional Kernels](https://arxiv.org/abs/1907.09595), it is pointed out that increasing the size of the convolutional kernels within a certain range plays a positive role in the accuracy improvement, but a size beyond may lead to accuracy loss. Therefore, considering the size of the model and the computation, large convolutional kernels are generally abandoned to design the network. Also, there are experiments on large convolution kernels in the article [PP-LCNet](../models/PP-LCNet_en.md).
 
+<a name="2"></a>
 
 ## Issue 2
 
@@ -80,7 +80,7 @@ Not really, increasing all the convolutional kernels in the network may not lead
 **A**：The process is as follows:
 
 - First, create a new model structure file under the folder ppcls/arch/backbone/model_zoo/, i.e. your own backbone. You can refer to resnet.py for model construction;
-- Then add your own backbone class in ppcls/arch/backbone/__init__.py;
+- Then add your own backbone class in ppcls/arch/backbone/\__init__.py;
 - Next, configure the yaml file for training, here you can refer to ppcls/configs/ImageNet/ResNet/ResNet50.yaml;
 - Now you can start the training.
 
@@ -108,9 +108,9 @@ PaddleClas strictly follows the resolution used by the authors of the paper. Sin
 
 **A**:
 
-There are many ssld pre-training models available in PaddleClas, which obtain better pre-training weights by semi-supervised knowledge distillation, so that the accuracy can be improved by replacing the ssld pre-training models with higher accuracy in transfer tasks or downstream vision tasks without replacing the structure files. For example, in PaddleSeg, [HRNet](https://github.com/PaddlePaddle/PaddleSeg/blob/release/v0.7.0/docs/model_zoo.md), with the weight of the ssld pre-training model, achieves much better accuracy than other same models in the industry; In PaddleDetection, [PP- YOLO](https://github.com/PaddlePaddle/PaddleDetection/blob/release/0.4/configs/ppyolo/README_cn.md) with ssld pre-training weights has further improvement in the already high baseline. The transfer of classification with ssld pre-training weights also yields impressive results, and the benefits of knowledge distillation for the transfer of classification task is detailed in  [SSLD Distillation Strategy](https://github.com/PaddlePaddle/PaddleClas/blob/release/2.3/docs/zh_CN/advanced_tutorials/knowledge_distillation.md)  
-
+There are many ssld pre-training models available in PaddleClas, which obtain better pre-training weights by semi-supervised knowledge distillation, so that the accuracy can be improved by replacing the ssld pre-training models with higher accuracy in transfer tasks or downstream vision tasks without replacing the structure files. For example, in PaddleSeg, [HRNet](../models/HRNet_en.md) , with the weight of the ssld pre-training model, achieves much better accuracy than other same models in the industry; In PaddleDetection, [PP- YOLO](https://github.com/PaddlePaddle/PaddleDetection/blob/release/0.4/configs/ppyolo/README_cn.md)with ssld pre-training weights has further improvement in the already high baseline. The transfer of classification with ssld pre-training weights also yields impressive results, and the benefits of knowledge distillation for the transfer of classification task is detailed in  [SSLD Distillation Strategy](../advanced_tutorials/knowledge_distillation_en.md)  
 
+<a name="3"></a>
 
 ## Issue 3
 
@@ -118,32 +118,32 @@ There are many ssld pre-training models available in PaddleClas, which obtain be
 
 **A**:
 
-DenseNet is designed with a more aggressive dense connectivity mechanism compared to ResNet, which further reduces the number of parameters by considering feature reuse and bypass settings, and mitigates the gradient dispersion problem to some extent. What's more, the model is easier to train and equipped with some regularization effect due to the introduction of dense connectivity. DenseNet is a good choice in image classification scenarios where the amount of data is limited. More information about DenseNet and this series can be found in [DenseNet Models](https://github.com/PaddlePaddle/PaddleClas/blob/release/2.3/docs/zh_CN/models/DPN_DenseNet. md).
+DenseNet is designed with a more aggressive dense connectivity mechanism compared to ResNet, which further reduces the number of parameters by considering feature reuse and bypass settings, and mitigates the gradient dispersion problem to some extent. What's more, the model is easier to train and equipped with some regularization effect due to the introduction of dense connectivity. DenseNet is a good choice in image classification scenarios where the amount of data is limited. More information about DenseNet and this series can be found in [DenseNet Models](../models/DPN_DenseNet_en.md).
 
 ### Q3.2: What are the improvements of the DPN network over DenseNet?
 
 **A**：
 
-The full name of DPN is Dual Path Networks, or Dual Channel Networks. It is a combination of DenseNet and ResNeXt, which demonstrates that DenseNet can extract new features from the previous layers, while ResNeXt is essentially reuse of features already extracted from the previous layers. The authors further analyze and find that ResNeXt has a high reuse rate for features but low redundancy, while DenseNet can create new features but has high redundancy. Combining the advantages of both structures, the DPN network is designed. Finally, the DPN network achieves better results than ResNeXt and DenseNet with the same FLOPS and number of parameters. More introduction and series models of DPN can be found in [DPN Models](https://github.com/PaddlePaddle/PaddleClas/blob/release/2.3/docs/zh_CN/models/DPN_DenseNet.md).
+The full name of DPN is Dual Path Networks, or Dual Channel Networks. It is a combination of DenseNet and ResNeXt, which demonstrates that DenseNet can extract new features from the previous layers, while ResNeXt is essentially reuse of features already extracted from the previous layers. The authors further analyze and find that ResNeXt has a high reuse rate for features but low redundancy, while DenseNet can create new features but has high redundancy. Combining the advantages of both structures, the DPN network is designed. Finally, the DPN network achieves better results than ResNeXt and DenseNet with the same FLOPS and number of parameters. More introduction and series models of DPN can be found in [DPN Models](../models/DPN_DenseNet_en.md).
 
 ### Q3.3: How to use multiple models for inference fusion?
 
 **A**:
 
-When adopting multiple models for inference, it is recommended to first export the pre-training model as an inference model to get rid of the dependence on the network structure definition, you can refer to [model export script](https://github.com/PaddlePaddle/PaddleClas/blob/release/2.3/ tools/export_model.py) for model exporting, and then see [inference script for the inference model](https://github.com/PaddlePaddle/PaddleClas/blob/release/2.3/deploy/python/), where you need to create multiple predictors according to the number of employed models.
+When adopting multiple models for inference, it is recommended to first export the pre-training model as an inference model to get rid of the dependence on the network structure definition, you can refer to [model export script](../../../tools/export_model.py) for model exporting, and then see [inference script for the inference model](../../../deploy/python/predict_cls.py), where you need to create multiple predictors according to the number of employed models.
 
 ### Q3.4: How to add your own data augmentation methods in PaddleClas?
 
 **A**：
 
-- For single-image augmentation, you can refer to [Single-image based data augmentation script](https://github.com/PaddlePaddle/PaddleClas/blob/release/2.3/ppcls/data/preprocess/ops). Learning from the data operator  ` ResizeImage ` or `CropImage` to create a new class, and then implement the corresponding augmentation method in `__call__`.
-- For a batch image, you can refer to the [batch data-based data augmentation script](https://github.com/PaddlePaddle/PaddleClas/blob/release/2.3/ppcls/data/preprocess/batch_ops ). Learning from the data operator `MixupOperator` or `CutmixOperator` to create a new class, and then implement the corresponding augmentation method in `__call__`.
+- For single-image augmentation, you can refer to [Single-image based data augmentation script](../../../ppcls/data/preprocess/ops). Learning from the data operator  ` ResizeImage ` or `CropImage` to create a new class, and then implement the corresponding augmentation method in `__call__`.
+- For a batch image, you can refer to the [batch data-based data augmentation script](../../../ppcls/data/preprocess/batch_ops). Learning from the data operator `MixupOperator` or `CutmixOperator` to create a new class, and then implement the corresponding augmentation method in `__call__`.
 
 ## Q3.5: How to further accelerate the model training?
 
 **A**：
 
-- You can adopt auto-mixed precision training, which can gain a significantly faster speed with almost zero precision loss. Take ResNet50 as an example, the configuration file of auto-mixed precision training in PaddleClas can be found at: [ResNet50_fp16.yml](https://github.com/ PaddlePaddle/PaddleClas/blob/release/2.3/ppcls/configs/ImageNet/ResNet/ResNet50_fp16.yaml). The main step is to add the following lines to the standard configuration file.
+- You can adopt auto-mixed precision training, which can gain a significantly faster speed with almost zero precision loss. Take ResNet50 as an example, the configuration file of auto-mixed precision training in PaddleClas can be found at: [ResNet50_fp16.yml](../../../ppcls/configs/ImageNet/ResNet/ResNet50_fp16.yaml). The main step is to add the following lines to the standard configuration file.
 
 ```
 # mixed precision training
@@ -155,7 +155,7 @@ AMP:
 
 - You can turn on dali to run the data preprocessing method on GPU. When the model is relatively small (reader accounts for a higher percentage of time consumption), an obviously faster speed can be obtained with dali on, which could be employed by adding  `-o Global.use_dali=True` during training. You can refer to [dali installation tutorial](https://docs.nvidia.com/deeplearning/dali/user-guide/docs/installation.html#nightly-builds) for more details.
 
-
+<a name="4"></a>
 
 ## Issue 4
 
@@ -176,7 +176,7 @@ AMP:
 - However, the authors of *HRNet* believe that the idea of gradually decreasing spatial resolution is not suitable for scenarios such as target detection (classification task of image region-level) and semantic segmentation (classification task of image pixel-level), because a lot of information is lost in this process and the final learned features can hardly represent the information of the original image at the high spatial resolution, while both the task of region-level and pixel-level classification are very sensitive to spatial accuracy.
 - Therefore, the authors of *HRNet* propose the idea of paralleling feature graphs with different spatial resolutions, while in contrast, neural networks such as *VGG* cascade feature graphs with different spatial resolutions by different convolutional pooling layers. Moreover, *HRNet* connects feature graphs of equal depth and disparate spatial resolutions, so that the information can be fully exchanged. The specific network structure is shown in the figure below.
 
-[![img](https://github.com/PaddlePaddle/PaddleClas/raw/release/2.3/docs/images/faq/HRNet.png)](https://github.com/PaddlePaddle/PaddleClas/blob/release/2.3/docs/images/faq/HRNet.png)
+![](../../images/faq/HRNet.png)
 
 ### Q4.3: In HRNet, how are connections made between feature graphs with different spatial resolutions?
 
@@ -184,7 +184,7 @@ AMP:
 
 - First, in *HRNet*, the *3 × 3* convolution with *stride* of *2* can be used to obtain a feature graph with low spatial resolution but higher dimension; and for those with low spatial resolution, the *1 × 1* convolution is first used to match the number of channels, and then the nearest neighbor interpolation is used for upsampling to obtain a feature graph with the same spatial resolution and number of channels as the high spatial resolution graph. And for the feature map with the same spatial resolution, the constant mapping can be performed directly. The details are shown in the following figure.
 
-[![img](https://github.com/PaddlePaddle/PaddleClas/raw/release/2.3/docs/images/faq/HRNet_block.png)](https://github.com/PaddlePaddle/PaddleClas/blob/release/2.3/docs/images/faq/HRNet_block.png)
+![](../../images/faq/HRNet_block.png)
 
 ### Q4.4: What does "SE" in the model stand for?
 
@@ -194,7 +194,7 @@ AMP:
 
 ### Q4.5: How does the SE structure implemented?
 
-[![img](https://github.com/PaddlePaddle/PaddleClas/raw/release/2.3/docs/images/faq/SE_structure.png)](https://github.com/PaddlePaddle/PaddleClas/blob/release/2.3/docs/images/faq/SE_structure.png)
+![](../../images/faq/SE_structure.png)
 
 **A**:
 
@@ -204,7 +204,7 @@ AMP:
 - For *Fsq*, the key is to obtain the vector of *C* dimension, so it is not limited to the *Global Average Pooling*. The *SENet* authors believe that the final *scale* is applied to *U* separately by channel, so it is necessary to calculate the corresponding *scale* based on the information of the corresponding channel, so the simplest *Global Average Pooling* is adopted, and the final *scale* vector represents the distribution between different channels, ignoring the situation in the same channel.
 - For *Fex*, its role is to find the distribution based on all the training data through training on each *mini batch*. Since our training is performed on *mini batches* and the *scale* based on all training data is the best, we can adopt the *Fex* to train on each *mini batch* to obtain a more reliable *scale*.
 
-
+<a name="5"></a>
 
 ## Issue 5
 
@@ -227,7 +227,7 @@ Throughout the whole training process, we cannot adopt the same learning rate to
 
 The learning rates of cosine_decay and piecewise_decay are shown in the following figure. It is easy to observe that cosine_decay keeps a large learning rate throughout the training, so it is slow in convergence, but its final effect is better than peicewise_decay.
 
-[![img](https://github.com/PaddlePaddle/PaddleClas/raw/release/2.3/docs/images/models/lr_decay.jpeg)](https://github.com/PaddlePaddle/PaddleClas/blob/release/2.3/docs/images/models/lr_decay.jpeg)
+![](../../images/models/lr_decay.jpeg)
 
 In addition, it is also observed that only a few rounds in cosine_decay use a small learning rate, which affects the final accuracy. So it is recommended to iterate more rounds for better results.
 
@@ -312,7 +312,7 @@ In the standard preprocessing of ImageNet-1k data, the random_crop function defi
 
 In general, the size of the dataset is crucial to the performance, but the annotation of images is often expensive, hence there are rare annotated images, which highlight the importance of data augmentation. In the standard data augmentation for training ImageNet-1k, two methods, Random_Crop and Random_Flip, are mainly adopted. However, in recent years, an increasing number of data augmentation methods have been proposed, such as cutout, mixup, cutmix, AutoAugment, etc. Experiments show that these methods can effectively improve the accuracy of the model. The specific information of the dataset is as follows:
 
-- ImageNet-1k: The following table lists the performance of ResNet50 adopting 8 different data augmentation methods. It can be seen that all of them are beneficial compared to baseline, with cutmix being the most effective data augmentation so far. For more information about data augmentation, please refer to the chapter of [**Data Augmentation**](https://github.com/PaddlePaddle/PaddleClas/blob/release/2.3/docs/zh_CN/advanced_tutorials/DataAugmentation.md).
+- ImageNet-1k: The following table lists the performance of ResNet50 adopting 8 different data augmentation methods. It can be seen that all of them are beneficial compared to baseline, with cutmix being the most effective data augmentation so far. For more information about data augmentation, please refer to the chapter of [**Data Augmentation**](../advanced_tutorials/DataAugmentation_en.md).
 
 | Model    | Date Augmentation Method | Test top-1 |
 | -------- | ------------------------ | ---------- |
@@ -351,9 +351,9 @@ At this stage, it has become a common practice in the image recognition field to
 **A**: If the existing strategy cannot further improve the accuracy of the model, it means that the model has almost reached saturation with the existing dataset and strategy, and two methods are provided here.
 
 - Mining relevant data: Use the model trained on the existing dataset to make predictions on the relevant data, label the data with higher confidence and add it to the training set for further training. Repeat the steps above to further improve the accuracy of the model.
-- Knowledge distillation: You can use a larger model to train a teacher model with higher accuracy on the dataset, and then adopt the teacher model to teach a Student model, where the Student model is the target model. PaddleClas provides Baidu's own SSLD knowledge distillation scheme, which can steadily improve by more than 3% even on such a challenging classification task as ImageNet-1k. For the chapter on SSLD knowledge distillation, please refer to [**SSLD Knowledge Distillation**](https://github.com/PaddlePaddle/PaddleClas/blob/release/2.3 /docs/zh_CN/advanced_tutorials/knowledge_distillation.md).
-
+- Knowledge distillation: You can use a larger model to train a teacher model with higher accuracy on the dataset, and then adopt the teacher model to teach a Student model, where the Student model is the target model. PaddleClas provides Baidu's own SSLD knowledge distillation scheme, which can steadily improve by more than 3% even on such a challenging classification task as ImageNet-1k. For the chapter on SSLD knowledge distillation, please refer to [**SSLD Knowledge Distillation**](../advanced_tutorials/knowledge_distillation_en.md).
 
+<a name="6"></a>
 
 ## Issue 6
 
diff --git a/docs/en/faq_series/faq_2021_s1_en.md b/docs/en/faq_series/faq_2021_s1_en.md
index 6d2e6f85..0a7dddd3 100644
--- a/docs/en/faq_series/faq_2021_s1_en.md
+++ b/docs/en/faq_series/faq_2021_s1_en.md
@@ -2,13 +2,13 @@
 
 ## Contents
 
-- [1. Issue 1](https://github.com/PaddlePaddle/PaddleClas/blob/release/2.3/docs/zh_CN/faq_series/faq_2021_s1.md#1)(2021.01.05)
-- [2. Issue 2](https://github.com/PaddlePaddle/PaddleClas/blob/release/2.3/docs/zh_CN/faq_series/faq_2021_s1.md#2)(2021.01.14)
-- [3. Issue 3](https://github.com/PaddlePaddle/PaddleClas/blob/release/2.3/docs/zh_CN/faq_series/faq_2021_s1.md#3)(2020.01.21)
-- [4. Issue 4](https://github.com/PaddlePaddle/PaddleClas/blob/release/2.3/docs/zh_CN/faq_series/faq_2021_s1.md#4)(2021.01.28)
-- [5. Issue 5](https://github.com/PaddlePaddle/PaddleClas/blob/release/2.3/docs/zh_CN/faq_series/faq_2021_s1.md#5)(2021.02.03)
-
+- [1. Issue 1](#1)(2021.01.05)
+- [2. Issue 2](#2)(2021.01.14)
+- [3. Issue 3](#3)(2020.01.21)
+- [4. Issue 4](#4)(2021.01.28)
+- [5. Issue 5](#5)(2021.02.03)
 
+<a name="1"></a>
 
 ## Issue 1
 
@@ -26,10 +26,10 @@
 
 - From the perspective of sampling
   - The samples can be sampled dynamically according to the categories, with different sampling probabilities for each category and ensure that the number of training samples in different categories is basically the same or in the desired proportion in the same minibatch or epoch.
-    - You can use the oversampling method to oversample the categories with a small number of images.
+  - You can use the oversampling method to oversample the categories with a small number of images.
 - From the perspective of loss function
   - The OHEM (online hard example miniing) method can be used to filter the hard example based on the loss of the samples for gradient backpropagation and parameter update of the model.
-    - The Focal loss method can be used to assign a smaller weight to the loss of some easy samples and a larger weight to the loss of hard samples, so that the loss of easy samples contributes to the overall loss of the network without dominating the loss.
+  - The Focal loss method can be used to assign a smaller weight to the loss of some easy samples and a larger weight to the loss of hard samples, so that the loss of easy samples contributes to the overall loss of the network without dominating the loss.
 
 ### Q1.3 When training in docker, the data path and configuration are fine, but it keeps reporting `SystemError: (Fatal) Blocking queue is killed because the data reader raises an exception`, why is this?
 
@@ -54,7 +54,7 @@ Based on ResNet50_vd, Baidu open-sourced its own large-scale classification pre-
 
 You are welcomed to add more tips on inference deployment acceleration.
 
-
+<a name="2"></a>
 
 ## Issue 2
 
@@ -76,13 +76,13 @@ First, it is necessary to ensure that the accuracy of the Teacher model. Second,
 
 ### Q2.4: Which networks have advantages on mobile or embedded side?
 
-It is recommended to use the Mobile Series network, and the details can be found in [Introduction to Mobile Series Network Structure](https://github.com/PaddlePaddle/PaddleClas/blob/release/2.3/docs/zh_CN/models/Mobile.md). If the speed of the task is the priority, MobileNetV3 series can be considered, and if the model size is more important, the specific structure can be determined based on the StorageSize - Accuracy in the Introduction to the Mobile Series Network Structure.
+It is recommended to use the Mobile Series network, and the details can be found in [Introduction to Mobile Series Network Structure](../models/Mobile_en.md). If the speed of the task is the priority, MobileNetV3 series can be considered, and if the model size is more important, the specific structure can be determined based on the StorageSize - Accuracy in the Introduction to the Mobile Series Network Structure.
 
 ### Q2.5: Why use a network with large number of parameters and computation such as ResNet when the mobile network is so fast?
 
 Different network structures have various speed advantages running on disparate devices. On the mobile side, mobile series networks run faster than server-side networks, but on the server side, networks with specific optimizations such as ResNet have greater advantages for the same accuracy. So the specific network structure needs to be decided on a case-by-case basis.
 
-
+<a name="3"></a>
 
 ## Issue 3
 
@@ -104,16 +104,16 @@ The size of the square convolution kernel is assumed to be `d*d`, i.e., the widt
 
 During the training, the network width of the model is improved by the ACB structure, and more features are extracted using the two asymmetric convolution kernels of `1*d` and `d*1` to enrich the information of the feature maps extracted by the `d*d` convolution kernels. In the inference stage, this design idea does not bring additional parameters and computational overhead. The following figure shows the form of convolutional kernels for the training phase and the inference deployment phase, respectively.
 
-[![img](https://github.com/PaddlePaddle/PaddleClas/raw/release/2.3/docs/images/faq/TrainingtimeACNet.png)](https://github.com/PaddlePaddle/PaddleClas/blob/release/2.3/docs/images/faq/TrainingtimeACNet.png)
+![](../../images/faq/TrainingtimeACNet.png)
 
-[![img](https://github.com/PaddlePaddle/PaddleClas/raw/release/2.3/docs/images/faq/DeployedACNet.png)](https://github.com/PaddlePaddle/PaddleClas/blob/release/2.3/docs/images/faq/DeployedACNet.png)
+![](../../images/faq/DeployedACNet.png)
 
 Experiments by the authors of the article show that the model capability can be significantly improved by using ACNet structures in the training of the original network model, as explained by the original authors as follows.
 
 1. Experiments show that for a `d*d` convolution kernel, the parameters of the skeleton position (e.g., the `skeleton` position of the convolution kernel in the above figure) have a greater impact on the model accuracy than the parameters of the corner position (e.g., the `corners` position of the convolution kernel in the above figure) of the eliminated convolution kernel, so the parameters of the skeleton position of the convolution kernel are essential. And the two asymmetric convolution kernels in the ACB structure enhance the weight of the square convolution kernel skeleton position parameter, making it more significant. About whether this summation will weaken the role of the skeleton position parameter due to the offsetting effect of positive and negative numbers, the authors found through experiments that the training of the network always goes in the direction of increasing the role of the skeleton position parameter, and there is no weakening due to the offsetting effect.
 2. The asymmetric convolution kernel is more robust for flipped images, as shown in the following figure, the horizontal asymmetric convolution kernel is more robust for flipped images up and down. The feature maps extracted by the asymmetric convolution kernel are the same for semantically the same position in the image before and after the flip, which is better than the square convolution kernel.
 
-[![img](https://github.com/PaddlePaddle/PaddleClas/raw/release/2.3/docs/images/faq/HorizontalKernel.png)](https://github.com/PaddlePaddle/PaddleClas/blob/release/2.3/docs/images/faq/HorizontalKernel.png)
+![](../../images/faq/HorizontalKernel.png)
 
 ### Q3.3: What are the main innovations of RepVGG?
 
@@ -121,7 +121,7 @@ Experiments by the authors of the article show that the model capability can be
 
 Through Q3.1 and Q3.2, it may occur to us to decouple the training phase and inference phase by ACNet, and use multi-branch structure in the training phase and Plain structure in inference phase, which is the innovation of RepVGG. The following figure shows the comparison of the network structures of ResNet and RepVGG in the training and inference phases.
 
-[![img](https://github.com/PaddlePaddle/PaddleClas/raw/release/2.3/docs/images/faq/RepVGG.png)](https://github.com/PaddlePaddle/PaddleClas/blob/release/2.3/docs/images/faq/RepVGG.png)
+![](../../images/faq/RepVGG.png)
 
 First, the RepVGG in the training phase adopts a multi-branch structure, which can be regarded as a residual structure with `1*1` convolution and constant mapping on top of the traditional VGG network, while the RepVGG in the inference phase degenerates to a VGG structure. The transformation of the network structure from RepVGG in the training phase to RepVGG in the inference phase is implemented using the "structural reparameterization" technique.
 
@@ -133,7 +133,7 @@ The constant mapping can be regarded as the output of the `1*1` convolution kern
 
 From the above, it can be simply understood that RepVGG is the extreme version of ACNet. Re-parameters operation in ACNet is shown in the following figure.
 
-[![img](https://github.com/PaddlePaddle/PaddleClas/raw/release/2.3/docs/images/faq/ACNetReParams.png)](https://github.com/PaddlePaddle/PaddleClas/blob/release/2.3/docs/images/faq/ACNetReParams.png)
+![](../../images/faq/ACNetReParams.png)
 
 Take `conv2` as an example, the asymmetric convolution can be regarded as a square convolution kernel of `3*3`, except that the upper and lower six parameters of the square convolution kernel are `0`, and it is the same for `conv3`. On top of that, the sum of the results of `conv1`, `conv2`, and `conv3` is equivalent to the sum of three convolution kernels followed by convolution. With `Conv` denoting the convolution operation and `+` denoting the addition operation of the matrix, then: `Conv1(A)+Conv2(A)+Conv3(A) == Convk(A)`, where `Conv1`, ` Conv2`, `Conv3` have convolution kernels `Kernel1`, `kernel2`, `kernel3`, and `Convk` has convolution kernels `Kernel1 + kernel2 + kernel3`, respectively.
 
@@ -147,20 +147,20 @@ There are many factors that affect the computation speed of the model, and the n
 
 1. Number of parameters: the number of parameters used to measure the model, the larger the number of parameters of the model, the higher the memory (video memory) requirements of the model during computation. However, the size of the memory (video memory) footprint does not depend entirely on the number of parameters. In the figure below, assuming that the input feature map memory footprint size is `1` unit, for the residual structure on the left, the peak memory footprint during computation is twice as large as that of the Plain structure on the right, because the results of the two branches need to be recorded and then added together.
 
-[![img](https://github.com/PaddlePaddle/PaddleClas/raw/release/2.3/docs/images/faq/MemoryOccupation.png)](https://github.com/PaddlePaddle/PaddleClas/blob/release/2.3/docs/images/faq/MemoryOccupation.png)
+![](../../images/faq/MemoryOccupation.png)
 
-1.
 2. FLOPs: Note that FLOPs are distinguished from floating point operations per second (FLOPS), which can be simply understood as the amount of computation and is usually adopted to measure the computational complexity of a model. Taking the common convolution operation as an example, considering no batch size, activation function, stride operation, and bias, assuming that the input future map size is `Min*Min` and the number of channels is `Cin`, the output future map size is `Mout*Mout` and the number of channels is `Cout`, and the conv kernel size is `K*K`, the FLOPs for a single convolution can be calculated as follows.
-
    1. The number of feature points contained in the output feature map is: `Cout * Mout * Mout`.
-   2. For the convolution operation for each feature point in the output feature map: the number of multiplication calculations is: `Cin * K * K`; the number of addition calculations is: `Cin * K * K - 1`.
-   3. So the total number of computations is: `Cout * Mout * Mout * (Cin * K * K + Cin * K * K - 1)`, i.e. `Cout * Mout * Mout * (2Cin * K * K - 1)`.
+   1. For the convolution operation for each feature point in the output feature map: the number of multiplication calculations is: `Cin * K * K`; the number of addition calculations is: `Cin * K * K - 1`.
+   1. So the total number of computations is: `Cout * Mout * Mout * (Cin * K * K + Cin * K * K - 1)`, i.e. `Cout * Mout * Mout * (2Cin * K * K - 1)`.
+
 3. Memory Access Cost (MAC): The computer needs to read the data from memory (general memory, including video memory) to the operator's Cache before performing operations on data (such as multiplication and addition), and the memory access is very time-consuming. Take grouped convolution as an example, suppose it is divided into `g` groups, although the number of parameters and FLOPs of the model remain unchanged after grouping, the number of memory accesses for grouped convolution becomes `g` times of the previous one (this is a simple calculation without considering multi-level Cache), so the MAC increases significantly and the computation speed of the model slows down accordingly.
-4. Parallelism: The term parallelism often includes data parallelism and model parallelism, in this case, model parallelism. Take convolutional operation as an example, the number of parameters in a convolutional layer is usually very large, so if the matrix in the convolutional layer is chunked and then handed over to multiple GPUs separately, the purpose of acceleration can be achieved. Even some network layers with too many parameters for a single GPU memory may be divided into multiple GPUs, but whether they can be divided into multiple GPUs in parallel depends not only on hardware conditions, but also on the specific form of operation. Of course, the higher the degree of parallelism, the faster the model can run.
 
+4. Parallelism: The term parallelism often includes data parallelism and model parallelism, in this case, model parallelism. Take convolutional operation as an example, the number of parameters in a convolutional layer is usually very large, so if the matrix in the convolutional layer is chunked and then handed over to multiple GPUs separately, the purpose of acceleration can be achieved. Even some network layers with too many parameters for a single GPU memory may be divided into multiple GPUs, but whether they can be divided into multiple GPUs in parallel depends not only on hardware conditions, but also on the specific form of operation. Of course, the higher the degree of parallelism, the faster the model can run.
 
 
 
+<a name="4"></a>
 
 ## Issue 4
 
@@ -179,7 +179,7 @@ Paper address: [AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITIO
 **A**:
 
 1. The dependence of images on CNNs is unnecessary, and the computational efficiency and scalability of the Transformer allow for training very large models without saturation as the model and dataset grow. Inspired by the Transformer for NLP, when being used in image classification tasks, images are divided into sequential patches that are fed into a linear unit embedding as input to the transformer.
-2. In medium-sized datasets such as ImageNet1k, ImageNet21k, the visual Transformer model is several percentage points lower than ResNet of the same size. It is speculated that this is because the transformer lacks the Locality and Spatial Invariance that CNNs have, and it is difficult to outperform convolutional networks when the amount of data is not large enough. But for this problem, the data augmentation adopted by [DeiT](https://arxiv.org/abs/ 2012.12877) to some extent addresses the reliance of Vision Transformer on very large datasets for training.
+2. In medium-sized datasets such as ImageNet1k, ImageNet21k, the visual Transformer model is several percentage points lower than ResNet of the same size. It is speculated that this is because the transformer lacks the Locality and Spatial Invariance that CNNs have, and it is difficult to outperform convolutional networks when the amount of data is not large enough. But for this problem, the data augmentation adopted by [DeiT](https://arxiv.org/abs/2012.12877) to some extent addresses the reliance of Vision Transformer on very large datasets for training.
 3. This approach can go beyond local information and model more long-range dependencies when training on super large-scale datasets 14M-300M, while CNN can better focus on local information but is weak in capturing global information.
 4. Transformer once reigned in the field of NLP, but was also questioned as not applicable to the CV field. The current several pieces of visual field articles also deliver competitive performance as the SOTA of CNN. We believe that a joint Vision-Language or multimodal model will be proposed that can solve both visual and linguistic problems.
 
@@ -192,19 +192,19 @@ Paper address: [AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITIO
    - (1) variable-length sequential input, because it is RNN structure with various amounts of words in one sentence. If it is an NLP scene, the change of word order affects little of the semantics, but the position of the image means a lot since great misunderstanding can be caused when different regions are connected in a different order.
    - (2) Single patch position is transformed into a vector with fixed dimension. Encoder input is patch pixel information embedding, combined with some fixed position vector concate to synthesize a vector with fixed dimension and position information in it.
 
-[![img](https://github.com/PaddlePaddle/PaddleClas/raw/release/2.3/docs/images/faq/Transformer_input.png)](https://github.com/PaddlePaddle/PaddleClas/blob/release/2.3/docs/images/faq/Transformer_input.png)
+![](../../images/faq/Transformer_input.png)
 
-1. Consider the following question: How to pass an image to an encoder?
+3. Consider the following question: How to pass an image to an encoder?
 
 - As the following figure shows. Suppose the input image is [224,224,3], which is cut into many patches in order from left to right and top to bottom, and the patch size can be [p,p,3] (p can be 16, 32). Convert it into a feature vector using the Linear Projection of Flattened Patches module and concat a position vector into the Encoder.
 
-[![img](https://github.com/PaddlePaddle/PaddleClas/raw/release/2.3/docs/images/faq/ViT_structure.png)](https://github.com/PaddlePaddle/PaddleClas/blob/release/2.3/docs/images/faq/ViT_structure.png)
+![](../../images/faq/ViT_structure.png)
 
-1. As shown above, given an image of `H×W×C` and a block size P, the image can be divided into `N` blocks of `P×P×C`, `N=H×W/(P×P)`. After getting the blocks, we have to use linear transformation to convert them into D-dimensional feature vectors, and then add the position encoding vectors. Similar to BERT, ViT also adds a classification flag bit before the sequence, denoted as `[CLS]`. The ViT input sequence `z` is shown in the following equation, where `x` represents an image block.
+4. As shown above, given an image of `H×W×C` and a block size P, the image can be divided into `N` blocks of `P×P×C`, `N=H×W/(P×P)`. After getting the blocks, we have to use linear transformation to convert them into D-dimensional feature vectors, and then add the position encoding vectors. Similar to BERT, ViT also adds a classification flag bit before the sequence, denoted as `[CLS]`. The ViT input sequence `z` is shown in the following equation, where `x` represents an image block.
 
-[![img](https://github.com/PaddlePaddle/PaddleClas/raw/release/2.3/docs/images/faq/ViT.png)](https://github.com/PaddlePaddle/PaddleClas/blob/release/2.3/docs/images/faq/ViT.png)
+![](../../images/faq/ViT.png)
 
-1. ViT model is basically the same as Transformer, where the input sequence is passed into ViT and then the final output features are classified using the `[CLS]` flags. viT consists mainly of MSA (multiheaded self-attentive) and MLP (two-layer fully connected network using GELU activation function), with LayerNorm and residual connections before MSA and MLP
+5. ViT model is basically the same as Transformer, where the input sequence is passed into ViT and then the final output features are classified using the `[CLS]` flags. viT consists mainly of MSA (multiheaded self-attentive) and MLP (two-layer fully connected network using GELU activation function), with LayerNorm and residual connections before MSA and MLP
 
 ### Q4.4: How to understand Inductive Bias?
 
@@ -220,7 +220,7 @@ Paper address: [AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITIO
 1. Similar to BERT, ViT adds a `[CLS]` flag bit before the first patch, and the vector corresponding to the last end flag bit can be used as a semantic representation of the whole image, and thus for downstream classification tasks, etc. Therefore, the whole embedding group can characterize the features of the image at different locations.
 2. The vector corresponding to the `[CLS]` flag bit is used as the semantic representation of the whole image because this symbol with no obvious semantic information will "fairly" integrate the semantic information of each patch in the image compared with other patches, and thus better represent the semantic of the whole image.
 
-
+<a name="5"></a>
 
 ## Issue 5
 
diff --git a/docs/en/faq_series/faq_selected_30_en.md b/docs/en/faq_series/faq_selected_30_en.md
index e7f7768b..b442c4b9 100644
--- a/docs/en/faq_series/faq_selected_30_en.md
+++ b/docs/en/faq_series/faq_selected_30_en.md
@@ -5,21 +5,19 @@
 - We collect some frequently asked questions in issues and user groups since PaddleClas is open-sourced and provide brief answers, aiming to give some reference for the majority to save you from twists and turns.
 - There are many talents in the field of image classification, recognition and retrieval with quickly updated models and papers, and the answers here mainly rely on our limited project practice, so it is not possible to cover all facets. We sincerely hope that the man of insight will help to supplement and correct the content, thanks a lot.
 
-## PaddleClas FAQ Summary
-
-- [1. 30 Questions About Image Classification](https://github.com/PaddlePaddle/PaddleClas/blob/release/2.3/docs/zh_CN/faq_series/faq_selected_30.md#1)
-  - [1.1 Basic Knowledge](https://github.com/PaddlePaddle/PaddleClas/blob/release/2.3/docs/zh_CN/faq_series/faq_selected_30.md#1.1)
-  - [1.2 Model Training](https://github.com/PaddlePaddle/PaddleClas/blob/release/2.3/docs/zh_CN/faq_series/faq_selected_30.md#1.2)
-  - [1.3 Data](https://github.com/PaddlePaddle/PaddleClas/blob/release/2.3/docs/zh_CN/faq_series/faq_selected_30.md#1.3)
-  - [1.4 Model Inference and Prediction](https://github.com/PaddlePaddle/PaddleClas/blob/release/2.3/docs/zh_CN/faq_series/faq_selected_30.md#1.4)
-- [2. Application of PaddleClas](https://github.com/PaddlePaddle/PaddleClas/blob/release/2.3/docs/zh_CN/faq_series/faq_selected_30.md#2)
-
+## Contents
 
+- [1. 30 Questions About Image Classification](#1)
+  - [1.1 Basic Knowledge](#1.1)
+  - [1.2 Model Training](#1.2)
+  - [1.3 Data](#1.3)
+  - [1.4 Model Inference and Prediction](#1.4)
+- [2. Application of PaddleClas](#2)
 
+<a name="1"></a>
 ## 1. 30 Questions About Image Classification
 
-
-
+<a name="1.1"></a>
 ### 1.1 Basic Knowledge
 
 - Q: How many classification metrics are commonly used in the field of image classification?
@@ -30,7 +28,7 @@
 > >
 
 - Q: 怎样根据自己的任务选择合适的模型进行训练？How to choose the right training model?
-- A: If you want to deploy on the server with a high requirement for accuracy but not model storage size or prediction speed, then it is recommended to use ResNet_vd, Res2Net_vd, DenseNet, Xception, etc., which are suitable for server-side models. If you want to deploy on the mobile side, then it is recommended to use MobileNetV3 and GhostNet. Meanwhile, we suggest you refer to the speed-accuracy metrics chart in [Model Library](https://github.com/PaddlePaddle/PaddleClas/blob/release/2.3/docs/zh_CN/models/models_intro.md) when choosing models.
+- A: If you want to deploy on the server with a high requirement for accuracy but not model storage size or prediction speed, then it is recommended to use ResNet_vd, Res2Net_vd, DenseNet, Xception, etc., which are suitable for server-side models. If you want to deploy on the mobile side, then it is recommended to use MobileNetV3 and GhostNet. Meanwhile, we suggest you refer to the speed-accuracy metrics chart in [Model Library](../models/models_intro_en.md) when choosing models.
 
 > >
 
@@ -56,7 +54,7 @@
 - A: The Attention Mechanism (AM) originated from the study of human vision. Using the mechanism on computer vision tasks can effectively capture the useful regions in the images and thus improve the overall network performance. Currently, the most commonly used ones are [SE block](https://arxiv.org/abs/1709.01507), [SK-block](https://arxiv.org/abs/1903.06586), [Non-local block](https://arxiv. org/abs/1711.07971), [GC block](https://arxiv.org/abs/1904.11492), [CBAM](https://arxiv.org/abs/1807.06521), etc. The core idea is to learn the importance of feature maps in different regions or different channels, so that the network can pay more attention to the regions of salience.
 
 
-
+<a name="1.2"></a>
 ### 1.2 Model Training
 
 > >
@@ -67,7 +65,7 @@
 > >
 
 - Q: What are the possible reasons if the model converges poorly during the training process?
-- A: There are several points that can be investigated: (1) The data annotation should be checked to ensure that there are no problems with the labeling of the training and validation sets. (2) Try to adjust the learning rate (initially by a factor of 10). A learning rate that is too large (training oscillation) or too small (slow convergence) may lead to poor convergence. (3) Huge amount of data and an overly small model may prevent it from learning all the features of the data. (4) See if normalization is used in the data preprocessing process. It may be slower without normalization operation. (5) If the amount of data is relatively small, you can try to load the pre-trained model based on ImageNet-1k dataset provided in PaddleClas, which can greatly improve the training convergence speed. (6) There is a long tail problem in the dataset, you can refer to the [solution to the long tail problem of data](https://github.com/PaddlePaddle/PaddleClas/blob/release/2.3/docs/zh_CN/faq_series/faq_selected_30.md #long_tail).
+- A: There are several points that can be investigated: (1) The data annotation should be checked to ensure that there are no problems with the labeling of the training and validation sets. (2) Try to adjust the learning rate (initially by a factor of 10). A learning rate that is too large (training oscillation) or too small (slow convergence) may lead to poor convergence. (3) Huge amount of data and an overly small model may prevent it from learning all the features of the data. (4) See if normalization is used in the data preprocessing process. It may be slower without normalization operation. (5) If the amount of data is relatively small, you can try to load the pre-trained model based on ImageNet-1k dataset provided in PaddleClas, which can greatly improve the training convergence speed. (6) There is a long tail problem in the dataset, you can refer to the [solution to the long tail problem of data](#long_tail).
 
 > >
 
@@ -75,7 +73,9 @@
 - A: Since the emergence of deep learning, there has been a lot of research on optimizers, which aim to minimize the loss function to find the right weights for a given task. Currently, the main optimizers used in the industry are SGD, RMSProp, Adam, AdaDelt, etc. Among them, since the SGD optimizer with momentum is widely used in academia and industry (only for classification tasks), most of the models we published also adopt this optimizer to achieve gradient descent of the loss function. It has two disadvantages, one is the slow convergence speed, and the other is the reliance on experiences of the initial learning rate setting. However, if the initial learning rate is set properly with a sufficient number of iterations, the optimizer will also stand out among many other optimizers, obtaining higher accuracy on the validation set. Some optimizers with adaptive learning rates, such as Adam and RMSProp, tend to converge fast, but the final convergence accuracy will be slightly worse. If you pursue faster convergence speed, we recommend using these adaptive learning rate optimizers, and SGD optimizers with momentum for higher convergence accuracy.
 
 - Q: What are the current mainstream learning rate decay strategies? How to choose?
-- A: The learning rate is the speed at which the hyperparameters of the network weights are adjusted by the gradient of the loss function. The lower the learning rate, the slower the loss function will change. While using a low learning rate ensures that no local minimal values are missed, it also means that it takes longer to converge, especially if trapped in a plateau region. Throughout the whole training process, we cannot adopt the same learning rate to update the weights, otherwise, the optimal point cannot be reached, So we need to adjust the learning rate during the training. In the initial stage of training, since the weights are in a random initialization state and the loss function decreases fast, a larger learning rate can be set. And in the later stage of training, since the weights are close to the optimal value, a larger learning rate cannot further find the optimal value, so a smaller learning rate needs is a better choice. As for the learning rate decay strategy, many researchers or practitioners use piecewise_decay (step_decay), which is a stepwise decay learning rate. In addition, there are also other methods proposed by researchers, such as polynomial_decay, exponential_ decay, cosine_decay, etc. Among them, cosine_decay requires no adjustment of hyperparameters and has higher robustness, thus emerging as the preferred learning rate decay method to improve model accuracy. The learning rates of cosine_decay and piecewise_decay are shown in the following figure. It is easy to observe that cosine_decay keeps a large learning rate throughout the training, so it is slow in convergence, but its final effect is better than peicewise_decay.[![img](https://github.com/PaddlePaddle/PaddleClas/raw/release/2.3/docs/images/models/lr_decay.jpeg)](https://github.com/PaddlePaddle/PaddleClas/blob/release/2.3/docs/images/models/lr_decay.jpeg)
+- A: The learning rate is the speed at which the hyperparameters of the network weights are adjusted by the gradient of the loss function. The lower the learning rate, the slower the loss function will change. While using a low learning rate ensures that no local minimal values are missed, it also means that it takes longer to converge, especially if trapped in a plateau region. Throughout the whole training process, we cannot adopt the same learning rate to update the weights, otherwise, the optimal point cannot be reached, So we need to adjust the learning rate during the training. In the initial stage of training, since the weights are in a random initialization state and the loss function decreases fast, a larger learning rate can be set. And in the later stage of training, since the weights are close to the optimal value, a larger learning rate cannot further find the optimal value, so a smaller learning rate needs is a better choice. As for the learning rate decay strategy, many researchers or practitioners use piecewise_decay (step_decay), which is a stepwise decay learning rate. In addition, there are also other methods proposed by researchers, such as polynomial_decay, exponential_ decay, cosine_decay, etc. Among them, cosine_decay requires no adjustment of hyperparameters and has higher robustness, thus emerging as the preferred learning rate decay method to improve model accuracy. The learning rates of cosine_decay and piecewise_decay are shown in the following figure. It is easy to observe that cosine_decay keeps a large learning rate throughout the training, so it is slow in convergence, but its final effect is better than peicewise_decay.
+
+![](../../images/models/lr_decay.jpeg)
 
 > >
 
@@ -119,6 +119,7 @@
 - Q: How to improve the accuracy of my own dataset by pre-training the model?
 - A: At this stage, it has become a common practice in the image recognition field to load pre-trained models to train their own tasks, which can often improve the accuracy of a particular task compared to training from random initialization. In general, the pre-training model widely used in the industry is obtained by training the ImageNet-1k dataset of 1.28 million images of 1000 classes. The fc layer weights of this pre-training model are a matrix of k*1000, where k is the number of neurons before the fc layer, and it is not necessary to load the fc layer weights when loading the pre-training weights. In terms of the learning rate, if your dataset is particularly small (e.g., less than 1,000), we recommend you to adopt a small initial learning rate, e.g., 0.001 (batch_size:256, the same below), so as not to corrupt the pre-training weights with a larger learning rate. If your training dataset is relatively large (>100,000), we suggest you try a larger initial learning rate, such as 0.01 or above.
 
+<a name="1.3"></a>
 ### 1.3 Data
 
 > >
@@ -139,7 +140,7 @@
 > >
 
 - Q: What are the common data augmentation methods currently available to increase the richness of training samples when the amount of data is insufficient?
-- A: PaddleClas classifies data augmentation methods into three categories, which are image transformation, image cropping and image aliasing. Image transformation mainly includes AutoAugment and RandAugment, image cropping contains CutOut, RandErasing, HideAndSeek and GridMask, and image aliasing comprises Mixup and Cutmix. More detailed introduction to data augmentation can be found in the chapter of [Data Augmentation ](https://github.com/PaddlePaddle/PaddleClas/blob/release/2.3/docs/zh_CN/ algorithm_introduction/DataAugmentation.md).
+- A: PaddleClas classifies data augmentation methods into three categories, which are image transformation, image cropping and image aliasing. Image transformation mainly includes AutoAugment and RandAugment, image cropping contains CutOut, RandErasing, HideAndSeek and GridMask, and image aliasing comprises Mixup and Cutmix. More detailed introduction to data augmentation can be found in the chapter of [Data Augmentation ](../algorithm_introduction/DataAugmentation_en.md).
 
 > >
 
@@ -164,12 +165,12 @@
 
 
 > >
-
+<a name="long_tail"></a>
 - Q: What are the common methods currently used for datasets with long-tailed distributions?
 - A:(1) the categories with fewer data can be resampled to increase the probability of their occurrence; (2) the loss can be modified to increase the loss weight of images in categories corresponding to fewer images; (3) the method of transfer learning can be borrowed to learn generic knowledge from common categories and then migrate to the categories with fewer samples.
 
 
-
+<a name="1.4"></a>
 ### 1.4 Model Inference and Prediction
 
 > >
@@ -198,7 +199,7 @@
 - A: (1) Using a GPU with better performance; (2) increasing the batch size; (3) using TenorRT and FP16 half-precision floating-point methods.
 
 
-
+<a name="2"></a>
 ## 2. Application of PaddleClas
 
 > >
diff --git a/docs/en/introduction/function_intro_en.md b/docs/en/introduction/function_intro_en.md
index 4561a4dd..1a1a54b1 100644
--- a/docs/en/introduction/function_intro_en.md
+++ b/docs/en/introduction/function_intro_en.md
@@ -8,6 +8,6 @@ PaddleClas is an image recognition toolset for industry and academia, helping us
 - SSLD knowledge distillation: The 14 classification pre-training models generally improved their accuracy by more than 3%; among them, the ResNet50_vd model achieved a Top-1 accuracy of 84.0% on the Image-Net-1k dataset and the Res2Net200_vd pre-training model achieved a Top-1 accuracy of 85.1%.
 - Data augmentation: Provide 8 data augmentation algorithms such as AutoAugment, Cutout, Cutmix, etc. with the detailed introduction, code replication, and evaluation of effectiveness in a unified experimental environment.
 
-[![img](https://github.com/PaddlePaddle/PaddleClas/raw/release/2.3/docs/images/recognition.gif)](https://github.com/PaddlePaddle/PaddleClas/blob/release/2.3/docs/images/recognition.gif)
+![](../../images/recognition.gif)
 
-For more information about the quick start of image recognition, algorithm details, model training and evaluation, and prediction and deployment methods, please refer to the [README Tutorial](https://github.com/PaddlePaddle/PaddleClas/blob/release/2.3/README_ch.md) on home page.
+For more information about the quick start of image recognition, algorithm details, model training and evaluation, and prediction and deployment methods, please refer to the [README Tutorial](../../../README_ch.md) on home page.
diff --git a/docs/en/others/feature_visiualization_en.md b/docs/en/others/feature_visiualization_en.md
index 0056666b..7c87791b 100644
--- a/docs/en/others/feature_visiualization_en.md
+++ b/docs/en/others/feature_visiualization_en.md
@@ -4,26 +4,32 @@
 
 ## Contents
 
-- [1. Overview](https://github.com/PaddlePaddle/PaddleClas/blob/release%2F2.3/docs/zh_CN/others/feature_visiualization.md#1)
-- [2. Prepare Work](https://github.com/PaddlePaddle/PaddleClas/blob/release%2F2.3/docs/zh_CN/others/feature_visiualization.md#2)
-- [3. Model Modification](https://github.com/PaddlePaddle/PaddleClas/blob/release%2F2.3/docs/zh_CN/others/feature_visiualization.md#3)
-- [4. Results](https://github.com/PaddlePaddle/PaddleClas/blob/release%2F2.3/docs/zh_CN/others/feature_visiualization.md#4)
+- [1. Overview](#1)
+- [2. Prepare Work](#2)
+- [3. Model Modification](#3)
+- [4. Results](#4)
 
 
 
+<a name='1'></a>
+
 ## 1. Overview
 
 The feature graph is the feature representation of the input image in the convolutional network, and the study of which can be beneficial to our understanding and design of the model. Therefore, we employ this tool to visualize the feature graph based on the dynamic graph.
 
+<a name='2'></a>
+
 ## 2. Prepare Work
 
-The first step is to select the model to be studied, here we choose ResNet50. Copy the model networking code [resnet.py](https://github.com/PaddlePaddle/PaddleClas/blob/release/2.3/ppcls/arch/backbone/) to [directory](https://github.com/PaddlePaddle/PaddleClas/blob/release/2.3/ppcls/utils/feature_maps_ visualization) and download the [ResNet50 pre-training model](https://paddle-imagenet-models-name.bj.bcebos.com/dygraph/ResNet50_pretrained.pdparams) or follow the command below.
+The first step is to select the model to be studied, here we choose ResNet50. Copy the model networking code [resnet.py](../../../ppcls/arch/backbone/legendary_models/resnet.py) to [directory](../../../ppcls/utils/feature_maps_visualization/) and download the [ResNet50 pre-training model](https://paddle-imagenet-models-name.bj.bcebos.com/dygraph/ResNet50_pretrained.pdparams) or follow the command below.
 
-```
+```bash
 wget https://paddle-imagenet-models-name.bj.bcebos.com/dygraph/ResNet50_pretrained.pdparams
 ```
 
-For other pre-training models and codes of network structure, please download [model library](https://github.com/PaddlePaddle/PaddleClas/blob/release/2.3/ppcls/arch/backbone) and [pre-training models](https:// github.com/PaddlePaddle/PaddleClas/blob/release/2.3/docs/zh_CN/models/models_intro.md).
+For other pre-training models and codes of network structure, please download [model library](../../../ppcls/arch/backbone/) and [pre-training models](../models/models_intro_en.md).
+
+<a name='3'></a>
 
 ## 3. Model Modification
 
@@ -31,7 +37,7 @@ Having found the location of the needed feature graph, set self.fm to fetch it o
 
 Specify the feature graph to be visualized in the forward function of ResNet50
 
-```
+```python
     def forward(self, x):
         with paddle.static.amp.fp16_guard():
             if self.data_format == "NHWC":
@@ -47,7 +53,7 @@ Specify the feature graph to be visualized in the forward function of ResNet50
         return x, fm
 ```
 
-Then modify the code [fm_vis.py](https://github.com/PaddlePaddle/PaddleClas/blob/release/2.3/ppcls/utils/feature_maps_visualization/fm_vis.py) to import `ResNet50`，instantiating the  `net` object:
+Then modify the code [fm_vis.py](../../../ppcls/utils/feature_maps_visualization/fm_vis.py) to import `ResNet50`，instantiating the  `net` object:
 
 ```
 from resnet import ResNet50
@@ -75,13 +81,13 @@ Parameters：
 - `--save_path`: save path, such as `./tools/`
 - `--use_gpu`: whether to enable GPU inference, default value: True
 
-
+<a name='4'></a>
 
 ## 4. Results
 
 - Import the Image：
 
-[![img](https://github.com/PaddlePaddle/PaddleClas/raw/release/2.3/docs/images/feature_maps/feature_visualization_input.jpg)](https://github.com/PaddlePaddle/PaddleClas/blob/release/2.3/docs/images/feature_maps/feature_visualization_input.jpg)
+![](../../images/feature_maps/feature_visualization_input.jpg)
 
 - Run the following script of feature graph visualization
 
@@ -97,3 +103,5 @@ python tools/feature_maps_visualization/fm_vis.py \
 ```
 
 - Save the output feature graph as `output.png`, as shown below.
+
+![](../../images/feature_maps/feature_visualization_output.jpg)
diff --git a/docs/en/others/train_on_xpu_en.md b/docs/en/others/train_on_xpu_en.md
index 2f33ac64..cc3bcd0e 100644
--- a/docs/en/others/train_on_xpu_en.md
+++ b/docs/en/others/train_on_xpu_en.md
@@ -4,26 +4,26 @@
 
 ## Contents
 
-- [1. Foreword](https://github.com/PaddlePaddle/PaddleClas/blob/release%2F2.3/docs/zh_CN/others/train_on_xpu.md#1)
-- [2. Training of Kunlun](https://github.com/PaddlePaddle/PaddleClas/blob/release%2F2.3/docs/zh_CN/others/train_on_xpu.md#2)
-  - [2.1 ResNet50](https://github.com/PaddlePaddle/PaddleClas/blob/release%2F2.3/docs/zh_CN/others/train_on_xpu.md#2.1)
-  - [2.2 MobileNetV3](https://github.com/PaddlePaddle/PaddleClas/blob/release%2F2.3/docs/zh_CN/others/train_on_xpu.md#2.2)
-  - [2.3 HRNet](https://github.com/PaddlePaddle/PaddleClas/blob/release%2F2.3/docs/zh_CN/others/train_on_xpu.md#2.3)
-  - [2.4 VGG16/19](https://github.com/PaddlePaddle/PaddleClas/blob/release%2F2.3/docs/zh_CN/others/train_on_xpu.md#2.4)
-
+- [1. Foreword](#1)
+- [2. Training of Kunlun](#2)
+  - [2.1 ResNet50](#2.1)
+  - [2.2 MobileNetV3](#2.2)
+  - [2.3 HRNet](#2.3)
+  - [2.4 VGG16/19](#2.4)
 
+<a name='1'></a>
 
 ## 1. Forword
 
-- This document describes the models currently supported by Kunlun and how to train these models on Kunlun devices. To install PaddlePaddle that supports Kunlun, please refer to install_kunlun(https://github.com/PaddlePaddle/FluidDoc/blob/develop/doc/paddle/install/install_Kunlun_zh.md)
-
+- This document describes the models currently supported by Kunlun and how to train these models on Kunlun devices. To install PaddlePaddle that supports Kunlun, please refer to [install_kunlun](https://www.paddlepaddle.org.cn/documentation/docs/zh/develop/guides/09_hardware_support/xpu_docs/paddle_install_cn.html)
 
+<a name='2'></a>
 
 ## 2. Training of Kunlun
 
-- See [quick_start](https://github.com/PaddlePaddle/PaddleClas/blob/release/2.3/docs/zh_CN/quick_start/quick_start_ classification_new_user.md) for data sources and pre-trained models. The training effect of Kunlun is aligned with CPU/GPU.
-
+- See [quick_start](../quick_start/quick_start_classification_new_user_en.md)for data sources and pre-trained models. The training effect of Kunlun is aligned with CPU/GPU.
 
+<a name='2.1'></a>
 
 ### 2.1 ResNet50
 
@@ -39,7 +39,7 @@ python3.7 ppcls/static/train.py \
 
 The difference with cpu/gpu training lies in the addition of -o use_xpu=True, indicating that the execution is on a Kunlun device.
 
-
+<a name='2.2'></a>
 
 ### 2.2 MobileNetV3
 
@@ -53,7 +53,7 @@ python3.7 ppcls/static/train.py \
     -o is_distributed=False
 ```
 
-
+<a name='2.3'></a>
 
 ### 2.3 HRNet
 
@@ -67,7 +67,7 @@ python3.7 ppcls/static/train.py \
     -o use_gpu=False
 ```
 
-
+<a name='2.4'></a>
 
 ### 2.4 VGG16/19
 
diff --git a/docs/en/others/versions_en.md b/docs/en/others/versions_en.md
index 73c3cdcc..aeb2f814 100644
--- a/docs/en/others/versions_en.md
+++ b/docs/en/others/versions_en.md
@@ -4,10 +4,10 @@
 
 ## Contents
 
-- [1. v2.3](https://github.com/PaddlePaddle/PaddleClas/blob/release%2F2.3/docs/zh_CN/others/versions.md#1)
-- [2. v2.2](https://github.com/PaddlePaddle/PaddleClas/blob/release%2F2.3/docs/zh_CN/others/versions.md#2)
-
+- [1. v2.3](#1)
+- [2. v2.2](#2)
 
+<a name='1'></a>
 
 ## 1. v2.3
 
@@ -30,7 +30,7 @@
   - PaddleSlim: 2.2.0
   - PaddleServing: 0.6.1
 
-
+<a name='2'></a>
 
 ## 2. v2.2
 
-- 
GitLab