Update the instructions

099b3356 · gaotingquan · Tingquan Gao · 07a1478c · 099b3356 · 099b3356
4 changed file
--- a/docs/en/extension/train_with_DALI_en.md
+++ b/docs/en/extension/train_with_DALI_en.md
@@ -6,7 +6,7 @@
 Since  the Deep learning relies on a large amount of data in the training stage, these data need to be loaded and preprocessed. These operations are usually executed on the CPU, which limits the further improvement of the training speed, especially when the batch_size is large, which become the bottleneck of speed. DALI can use GPU to accelerate these operations, thereby further improve the training speed.

 ## Installing DALI
-DALI only support Linux x64 and version of CUDA is 10.0 or later.
+DALI only support Linux x64 and version of CUDA is 10.2 or later.

 * For CUDA 10:

@@ -25,7 +25,7 @@ Paddleclas supports training with DALI in static graph. Since DALI only supports
 # set the GPUs that can be seen
 export CUDA_VISIBLE_DEVICES="0"

-# set the GPU memory used for neural network training, generally 0.8 or 0.7
+# set the GPU memory used for neural network training, generally 0.8 or 0.7, and the remaining GPU memory is reserved for DALI
 export FLAGS_fraction_of_gpu_memory_to_use=0.80

 python tools/static/train.py -c configs/ResNet/ResNet50.yaml -o use_dali=True
@@ -37,7 +37,7 @@ And you can train with muti-GPUs:
 # set the GPUs that can be seen
 export CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7"

-# set the GPU memory used for neural network training, generally 0.8 or 0.7
+# set the GPU memory used for neural network training, generally 0.8 or 0.7, and the remaining GPU memory is reserved for DALI
 export FLAGS_fraction_of_gpu_memory_to_use=0.80

 python -m paddle.distributed.launch \
@@ -54,5 +54,3 @@ On the basis of the above, using FP16 half-precision can further improve the tra
 ```shell
 python tools/static/train.py -c configs/ResNet/ResNet50.yaml -o use_dali=True -o AMP.use_pure_fp16=True
 ```
-
-Using fp16 half-precision will lead to the problem of training precision decline or convergence slow.
--- a/docs/en/tutorials/getting_started_en.md
+++ b/docs/en/tutorials/getting_started_en.md
@@ -26,16 +26,24 @@ Among them, `-c` is used to specify the path of the configuration file, `-o` is
 Of course, you can also directly modify the configuration file to update the configuration. For specific configuration parameters, please refer to [Configuration Document](config_en.md).

 * The output log examples are as follows:
-    * If mixup or cutmix is used in training, only loss, lr (learning rate) and training time of the minibatch will be printed in the log.
+    * If mixup or cutmix is used in training, top-1 and top-k (default by 5) will not be printed in the log:

    ```
-    train step:890  loss:  6.8473 lr: 0.100000 elapse: 0.157s
+    ...
+    epoch:0  , train step:20   , loss: 4.53660, lr: 0.003750, batch_cost: 1.23101 s, reader_cost: 0.74311 s, ips: 25.99489 images/sec, eta: 0:12:43
+    ...
+    END epoch:1   valid top1: 0.01569, top5: 0.06863, loss: 4.61747,  batch_cost: 0.26155 s, reader_cost: 0.16952 s, batch_cost_sum: 10.72348 s, ips: 76.46772 images/sec.
+    ...
    ```

-    * If mixup or cutmix is not used during training, in addition to loss, lr (learning rate) and the training time of the minibatch, top-1 and top-k( The default is 5) will also be printed in the log.
+    * If mixup or cutmix is not used during training, in addition to the above information, top-1 and top-k (The default is 5) will also be printed in the log:

    ```
-    epoch:0    train    step:13    loss:7.9561    top1:0.0156    top5:0.1094    lr:0.100000    elapse:0.193s
+    ...
+    epoch:0  , train step:30  , top1: 0.06250, top5: 0.09375, loss: 4.62766, lr: 0.003728, batch_cost: 0.64089 s, reader_cost: 0.18857 s, ips: 49.93080 images/sec, eta: 0:06:18
+    ...
+    END epoch:0   train top1: 0.01310, top5: 0.04738, loss: 4.65124,  batch_cost: 0.64089 s, reader_cost: 0.18857 s, batch_cost_sum: 13.45863 s, ips: 49.93080 images/sec.
+    ...
    ```

 During training, you can view loss changes in real time through `VisualDL`,  see [VisualDL](../extension/VisualDL_en.md) for details.

--- a/docs/zh_CN/extension/train_with_DALI.md
+++ b/docs/zh_CN/extension/train_with_DALI.md
@@ -6,7 +6,7 @@
 由于深度学习程序在训练阶段依赖大量数据，这些数据需要经过加载、预处理等操作后，才能送入训练程序，而这些操作通常在CPU完成，因此限制了训练速度进一步提高，特别是在batch_size较大时，数据读取可能成为训练速度的瓶颈。DALI可以基于GPU的高并行特性实现数据加载及预处理操作，可以进一步提高训练速度。

 ## 安装DALI
-目前DALI仅支持Linux x64平台，且CUDA版本大于等于10.0。
+目前DALI仅支持Linux x64平台，且CUDA版本大于等于10.2。

 * 对于CUDA 10:

@@ -25,7 +25,7 @@ PaddleClas支持在静态图训练方式中使用DALI加速，由于DALI仅支
 # 设置用于训练的GPU卡号
 export CUDA_VISIBLE_DEVICES="0"

-# 设置用于神经网络训练的显存大小，可根据具体情况设置，一般可设置为0.8或0.7
+# 设置用于神经网络训练的显存大小，可根据具体情况设置，一般可设置为0.8或0.7，剩余显存则预留DALI使用
 export FLAGS_fraction_of_gpu_memory_to_use=0.80

 python tools/static/train.py -c configs/ResNet/ResNet50.yaml -o use_dali=True
@@ -37,7 +37,7 @@ python tools/static/train.py -c configs/ResNet/ResNet50.yaml -o use_dali=True
 # 设置用于训练的GPU卡号
 export CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7"

-# 设置用于神经网络训练的显存，可根据具体情况设置，一般可设置为0.8或0.7
+# 设置用于神经网络训练的显存大小，可根据具体情况设置，一般可设置为0.8或0.7，剩余显存则预留DALI使用
 export FLAGS_fraction_of_gpu_memory_to_use=0.80

 python -m paddle.distributed.launch \
@@ -54,5 +54,3 @@ python -m paddle.distributed.launch \
 ```shell
 python tools/static/train.py -c configs/ResNet/ResNet50.yaml -o use_dali=True -o AMP.use_pure_fp16=True
 ```
-
-使用FP16半精度训练将导致训练精度下降或收敛变慢的问题。
--- a/docs/zh_CN/tutorials/getting_started.md
+++ b/docs/zh_CN/tutorials/getting_started.md
@@ -34,16 +34,23 @@ python tools/train.py \

 运行上述命令，可以看到输出日志，示例如下：

-* 如果在训练中使用了mixup或者cutmix的数据增广方式，那么日志中只会打印出loss(损失)、lr(学习率)以及该minibatch的训练时间。
-
+* 如果在训练中使用了mixup或者cutmix的数据增广方式，那么日志中将不会打印top-1与top-k（默认为5）信息：
    ```
-    train step:890  loss:  6.8473 lr: 0.100000 elapse: 0.157s
+    ...
+    epoch:0  , train step:20   , loss: 4.53660, lr: 0.003750, batch_cost: 1.23101 s, reader_cost: 0.74311 s, ips: 25.99489 images/sec, eta: 0:12:43
+    ...
+    END epoch:1   valid top1: 0.01569, top5: 0.06863, loss: 4.61747,  batch_cost: 0.26155 s, reader_cost: 0.16952 s, batch_cost_sum: 10.72348 s, ips: 76.46772 images/sec.
+    ...
    ```

-* 如果训练过程中没有使用mixup或者cutmix的数据增广，那么除了loss(损失)、lr(学习率)以及该minibatch的训练时间之外，日志中也会打印出top-1与top-k(默认为5)的信息。
+* 如果训练过程中没有使用mixup或者cutmix的数据增广，那么除了上述信息外，日志中也会打印出top-1与top-k(默认为5)的信息：

    ```
-    epoch:0    train    step:13    loss:7.9561    top1:0.0156    top5:0.1094    lr:0.100000    elapse:0.193s
+    ...
+    epoch:0  , train step:30  , top1: 0.06250, top5: 0.09375, loss: 4.62766, lr: 0.003728, batch_cost: 0.64089 s, reader_cost: 0.18857 s, ips: 49.93080 images/sec, eta: 0:06:18
+    ...
+    END epoch:0   train top1: 0.01310, top5: 0.04738, loss: 4.65124,  batch_cost: 0.64089 s, reader_cost: 0.18857 s, batch_cost_sum: 13.45863 s, ips: 49.93080 images/sec.
+    ...
    ```

 训练期间也可以通过VisualDL实时观察loss变化，详见[VisualDL](../extension/VisualDL.md)。