Add check nan inf tools doc (#1658)

42108fb0 · WangXi · GitHub · d742ff07 · 42108fb0 · 42108fb0
7 changed file
--- a/doc/fluid/advanced_guide/flags/check_nan_inf_cn.md
+++ b/doc/fluid/advanced_guide/flags/check_nan_inf_cn.md
+# check nan inf工具
+
+check nan inf工具用于检查Operator的结果是否含有nan(not a number，非有效数)或inf(infinite，无穷大数)。支持float32、double、float16三类浮点型，整型由于不存在nan、inf不作检查。
+
+## <span id="use">使用</span>
+
+#### 1. 使用方法
+设置环境变量为FLAGS_check_nan_inf为True或者1即可。
+```
+export FLAGS_check_nan_inf=1   # 或者=True
+```
+
+#### 2. 进阶使用
+添加上述环境变量后，可以通过设置环境变量跳过op、op类型及op变量的检查。设置的格式如下：
+```
+PADDLE_INF_NAN_SKIP_OP="op0,op1,op2"
+PADDLE_INF_NAN_SKIP_ROLE="role1,role2,role3"
+PADDLE_INF_NAN_SKIP_VAR="op0:var0,op0:var1,op1:var0"
+```
+其中上面三个环境变量分别表示跳过op、op类型和op里变量的检查。
+##### 2.1 跳过op检查
+如下设置中前一个只跳过mul op的nan inf检查，后一个设置则跳过mul、softmax_with_cross_entropy这两个op的检查。
+`注意`：op跳过只接受精准匹配，要跳过softmax_with_cross_entropy的检查，不能设置环境变量为softmax_with或者with_cross进行模糊匹配，必须设置softmax_with_cross_entropy全名。
+```
+export PADDLE_INF_NAN_SKIP_OP="mul"
+export PADDLE_INF_NAN_SKIP_OP="mul,softmax_with_cross_entropy"
+```
+##### 2.2 跳过op类型检查
+目前接受的类型有: forward、backward、optimize、rpc、dist、lrsched、loss、default。正常fp32训练中，不需要跳过op类型进行nan inf检查。但在`fp16`中，在反向过程出现inf会对其进行修正，所以一般需要跳过backward的检查，这也是添加该功能的缘由。
+如下设置中前一个只跳过backward的检查，后一个设置跳过backward、optimize两种类型的检查。同上，op类型跳过也只支持精准匹配。
+```
+export PADDLE_INF_NAN_SKIP_ROLE="backward"
+export PADDLE_INF_NAN_SKIP_ROLE="backward,optimize"
+```
+##### 2.3 跳过指定op中变量的检查
+如下设置中前一个跳过mul op中fc_0.tmp_0变量，后一个设置则跳过mul op中fc_0.tmp_0和fc_0.tmp_1变量及 dropout op的new_relative变量。
+```
+export PADDLE_INF_NAN_SKIP_VAR="mul:fc_0.tmp_0"
+export PADDLE_INF_NAN_SKIP_VAR="mul:fc_0.tmp_0,mul:fc_0.tmp_1,dropout:new_relative"
+```
+`注意`：指定op变量检查中，对于op只接受精准匹配，对于变量则为模糊匹配，如上述的mlu op中的fc_0.tmp_0和fc_0.tmp_1变量可用c_0.tmp进行匹配。
+
+## <span id="test">试用</span>
+可以使用单测中的[check_nan_inf_base.py](https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/fluid/tests/unittests/check_nan_inf_base.py)文件进行试用。该脚本已设置FLAGS_check_nan_inf=1打开check nan inf功能。直接python check_nan_inf_base.py执行即可。
+#### 1. GPU日志信息
+其中GPU的check nan信息由于在GPU中打印，所以nan inf信息会出现在出错信息栈前面。工具中会打印出现inf、nan的op及tensor名称，每个block会打印nan、inf、num中的3个值，并打印各自block中nan、inf、num的数量。
+![gpu_nan_inf.png](check_nan_inf_files/gpu_nan_inf.png)
+#### 2. CPU日志信息
+CPU中打印的nan、inf、num会在出错信息栈前面显示，同样打印了nan、inf、num中的三个值，并打印nan、inf、num的数量。check nan信息中op及tensor的名称会在最后显示。
+
+![cpu_nan_inf.png](check_nan_inf_files/cpu_nan_inf.png)
+
+![cpu_nan_inf_op_var.png](check_nan_inf_files/cpu_nan_inf_op_var.png)
+
+## <span id="speed">速度</span>
+测试环境：v100 32G单卡测试，Resnet50模型，imagenet数据集。`不同环境模型数据集下速度可能不同，以下速度仅供参考`
+>不检查nan inf速度，每张卡307.7 images/s。
+检查nan inf速度，每张卡250.2 images/s。
+
+## <span id="principle">原理</span>
+#### 1. 工具原理
+对于浮点类型操作，正常数值num，无穷大inf，非数值nan有如下运行关系。更详细可查看[INF, NAN, and NULL](https://wiki.analytica.com/index.php?title=INF,_NAN,_and_NULL_-_Exception_values&title=INF,_NAN,_and_NULL_-_Exception_values)
+```
+nan - nan = nan, inf - inf = nan, num - num = 0,
+nan + nan = nan, inf + inf = inf, nan + 0 = nan,
+inf + 0 = inf, nan + inf = nan, 0 + 0 = 0
+```
+基于此使用如下操作仅需最后检查sum是否为nan或者inf就行了。
+```
+for(value:values): sum += (value-value)
+```
+
+***`注意`：本文档的进阶使用、速度、原理目前仅在develop版本的paddle生效，并将随1.7版本的paddle发布。
+此前版本的check nan inf工具在GPU上不推荐使用，旧工具速度为0.25 images/s，测试会拖慢1000多倍的训练速度。***
\ No newline at end of file
--- a/doc/fluid/advanced_guide/flags/check_nan_inf_en.md
+++ b/doc/fluid/advanced_guide/flags/check_nan_inf_en.md
+# check nan inf tool
+
+The check nan inf tool is used to check whether the result of the Operator contains nan(not a number) or inf(infinite number).
+Float32, double, and float16 are supported. Integers are not checked because there is no nan or inf.
+
+## <span id="use">Use</span>
+#### 1. Method of use
+Set the environment variable FLAGS_check_nan_inf to True or 1.
+```
+export FLAGS_check_nan_inf=1  # or set =True
+```
+
+#### 2. Advanced use
+After adding the above environment variables, you can skip the check of op, op role, and variables by setting environment variables.
+The format of the setting is as follows:
+```
+PADDLE_INF_NAN_SKIP_OP="op0,op1,op2"
+PADDLE_INF_NAN_SKIP_ROLE="role1,role2,role3"
+PADDLE_INF_NAN_SKIP_VAR="op0:var0,op0:var1,op1:var0"
+```
+The three above environment variables respectively indicate skipping the checks of op, op role, and variables in op.
+##### 2.1 Skip op check
+In the following settings, the previous one only skips the nan inf check of the mul op, and the latter setting skips the check of mul and softmax_with_cross_entropy op.
+`Note`: Op skip only accepts exact matches. To skip the softmax_with_cross_entropy check, you cannot set the environment variable to softmax_with or with_cross for fuzzy matching.
+You must set the full softmax_with_cross_entropy name.
+```
+export PADDLE_INF_NAN_SKIP_OP="mul"
+export PADDLE_INF_NAN_SKIP_OP="mul,softmax_with_cross_entropy"
+```
+##### 2.2 Skip op role check
+The currently accepted types are: forward, backward, optimize, rpc, dist, lrsched, loss, default.
+In fp32 training, it is not necessary to skip the nan inf check of the op role.
+However in `fp16` training, inf will be corrected in the backpropagation, so it is generally necessary to skip the backward check, which is why this feature is added.
+In the following setting, the previous setting only skips the backward check, and the latter setting skips both the backward and optimize checks.
+Same as above, the op role skipping only supports exact matching.
+```
+export PADDLE_INF_NAN_SKIP_ROLE="backward"
+export PADDLE_INF_NAN_SKIP_ROLE="backward,optimize"
+```
+##### 2.3 Skip the checking of variables in the specified op
+In the following setting, the former skip the fc_0.tmp_0 variable in mul op, and the latter setting skips the fc_0.tmp_0 and fc_0.tmp_1 variables in mul op and the new_relative variable in dropout op.
+```
+export PADDLE_INF_NAN_SKIP_VAR="mul:fc_0.tmp_0"
+export PADDLE_INF_NAN_SKIP_VAR="mul:fc_0.tmp_0,mul:fc_0.tmp_1,dropout:new_relative"
+```
+`Note`: In the specified op variable check, only exact matching is accepted for op, and fuzzy matching is used for variables.
+For example, the fc_0.tmp_0 and fc_0.tmp_1 variables in mul op mentioned above can be matched by c_0.tmp
+
+## <span id="test">Test</span>
+You can use the [check_nan_inf_base.py](https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/fluid/tests/unittests/check_nan_inf_base.py) file for test.
+The script has set FLAGS_check_nan_inf=1 to enable the nan inf check. Just execute `python check_nan_inf_base.py` to test.
+
+#### 1. GPU log information
+The check information of the GPU is printed in the GPU, so the nan inf information appears in front of the error information stack.
+The tool will print the name of the op and tensor which find inf or nan. Each block will print the three values of nan, inf, and num.
+And will print the number of nan, inf, and num in the respective block.
+![gpu_nan_inf.png](check_nan_inf_files/gpu_nan_inf.png)
+
+#### 2. CPU log information
+The nan, inf, and num printed in the CPU are displayed in front of the error message stack.
+The three values of nan, inf, and num are also printed, and the number of nan, inf, and num is printed.
+The name of the op and tensor which has nan or inf will be displayed in the end.
+![cpu_nan_inf.png](check_nan_inf_files/cpu_nan_inf.png)
+![cpu_nan_inf_op_var.png](check_nan_inf_files/cpu_nan_inf_op_var.png)
+
+## <span id="speed">Speed</span>
+Test environment: v100 32G single card, Resnet50 model, Imagenet dataset.
+`The speed may be different under different environments and different model datasets. The following speeds are only for reference`
+> Without check nan inf speed, 307.7 images/s per card. 
+Check nan inf speed, 250.2 images/s per card.
+
+## <span id="printciple">Principle</span>
+#### 1. Tool principle
+For floating-point operations, num(normal numeric), inf(infinite), and nan(not a number) have the following relations.
+More details can be found in [INF, NAN, and NULL](https://wiki.analytica.com/index.php?title=INF,_NAN,_and_NULL_-_Exception_values&title=INF,_NAN,_and_NULL_-_Exception_values)
+```
+nan - nan = nan, inf - inf = nan, num - num = 0,
+nan + nan = nan, inf + inf = inf, nan + 0 = nan,
+inf + 0 = inf, nan + inf = nan, 0 + 0 = 0
+```
+Based on this, using the following operation and only check the sum is nan or inf is enough.
+```
+for(value:values): sum += (value-value)
+```
+
+***`Note`: The Advanced use, Speed, and Principles of this document are currently only effective in the develop version of the Paddle, and will be released with the 1.7 version of the Paddle.
+It is not recommended to use the previous version of the check nan inf tool on the GPU, the speed of old tools is 0.25 images/s，will slow down the training speed by a thousand times.***
\ No newline at end of file
--- a/doc/fluid/advanced_guide/flags/check_nan_inf_files/cpu_nan_inf.png
+++ b/doc/fluid/advanced_guide/flags/check_nan_inf_files/cpu_nan_inf.png
--- a/doc/fluid/advanced_guide/flags/check_nan_inf_files/cpu_nan_inf_op_var.png
+++ b/doc/fluid/advanced_guide/flags/check_nan_inf_files/cpu_nan_inf_op_var.png
--- a/doc/fluid/advanced_guide/flags/check_nan_inf_files/gpu_nan_inf.png
+++ b/doc/fluid/advanced_guide/flags/check_nan_inf_files/gpu_nan_inf.png
--- a/doc/fluid/advanced_guide/flags/debug_cn.rst
+++ b/doc/fluid/advanced_guide/flags/debug_cn.rst
@@ -79,4 +79,9 @@ FLAGS_reader_queue_speed_test_mode=True - 启用pyreader测试模式。

 注意
 -------
-仅当使用py_reader时该flag才有效。
\ No newline at end of file
+仅当使用py_reader时该flag才有效。
+
+..	toctree::
+	:hidden:
+
+	check_nan_inf_cn.md
--- a/doc/fluid/advanced_guide/flags/debug_en.rst
+++ b/doc/fluid/advanced_guide/flags/debug_en.rst
@@ -79,3 +79,8 @@ FLAGS_reader_queue_speed_test_mode=True will enable the pyreader test mode.
 Note
 -------
 This flag will work only when you are using py_reader.
+
+..	toctree::
+	:hidden:
+
+	check_nan_inf_en.md
\ No newline at end of file