Refine memory-related flags docs (#1167)

* refine flags doc,test=develop * follow comments, test=develop

Refine memory-related flags docs (#1167)
* refine flags doc,test=develop * follow comments, test=develop
9ea76ed4 · Zeng Jinle · GitHub · f0cd2958 · 9ea76ed4 · 9ea76ed4
4 changed file
--- a/doc/fluid/flags/cudnn_cn.rst
+++ b/doc/fluid/flags/cudnn_cn.rst
@@ -11,7 +11,7 @@ FLAGS_conv_workspace_size_limit

 取值范围
 ---------------
-Uint64型，缺省值为4096。即4G内存工作区。
+Uint64型，缺省值为512。即512MB显存工作区。

 示例
 -------

--- a/doc/fluid/flags/cudnn_en.rst
+++ b/doc/fluid/flags/cudnn_en.rst
@@ -11,7 +11,7 @@ The workspace limit size in MB unit for choosing cuDNN convolution algorithms. T

 Values accepted
 ---------------
-Uint64. The default value is 4096. That is to say, 4G memory workspace.
+Uint64. The default value is 512. That is to say, 512MB memory workspace.

 Example
 -------

--- a/doc/fluid/flags/memory_cn.rst
+++ b/doc/fluid/flags/memory_cn.rst
@@ -7,17 +7,17 @@ FLAGS_allocator_strategy
 ********************
 (始于1.2)

-用于选择PaddlePaddle的分配器策略。 分配器策略正在开发中，且非legacy分配器尚未稳定。
+用于选择PaddlePaddle的分配器策略。其中auto_growth策略尚未稳定。

 取值范围
 ---------------
-String型，['legacy', 'naive_best_fit']中的一个。缺省值为'legacy'。
+String型，['naive_best_fit', 'auto_growth']中的一个。缺省值为'naive_best_fit'。

 示例
 --------
-FLAGS_allocator_strategy=legacy - 使用legacy分配器。
+FLAGS_allocator_strategy=naive_best_fit - 使用预分配best fit分配器。

-FLAGS_allocator_strategy=naive_best_fit - 使用新设计的分配器。
+FLAGS_allocator_strategy=auto_growth - 使用auto growth分配器。


 FLAGS_eager_delete_scope
@@ -39,15 +39,15 @@ FLAGS_eager_delete_tensor_gb
 *******************************************
 (始于1.0.0)

-表示是否使用垃圾回收策略来优化网络的内存使用。如果FLAGS_eager_delete_tensor_gb >= 0，则启用垃圾回收策略，并在运行网络时回收内存垃圾，这有利于节省内存使用量。它仅在您使用Executor运行程序、编译程序或使用并行数据编译程序时才有用。如果FLAGS_eager_delete_tensor_gb < 0，则禁用垃圾回收策略。垃圾回收器直到垃圾的内存大小达到FLAGS_eager_delete_tensor_gb GB时才会释放内存垃圾。
+表示是否使用垃圾回收策略来优化网络的内存使用。如果FLAGS_eager_delete_tensor_gb < 0，则禁用垃圾回收策略。如果FLAGS_eager_delete_tensor_gb >= 0，则启用垃圾回收策略，并在运行网络时回收内存垃圾，这有利于节省内存使用量。它仅在您使用Executor运行程序、编译程序或使用并行数据编译程序时才有用。垃圾回收器直到垃圾的内存大小达到FLAGS_eager_delete_tensor_gb GB时才会释放内存垃圾。

 取值范围
 ---------------
-Double型，单位为GB，缺省值为-1.0。
+Double型，单位为GB，缺省值为0.0。

 示例
 -------
-FLAGS_eager_delete_tensor_gb=0.0 - 一旦不再使用即释放内存垃圾。
+FLAGS_eager_delete_tensor_gb=0.0 - 垃圾占用大小达到0.0GB时释放内存垃圾，即一旦出现垃圾则马上释放。

 FLAGS_eager_delete_tensor_gb=1.0 - 垃圾占用内存大小达到1.0GB时释放内存垃圾。

@@ -58,72 +58,70 @@ FLAGS_eager_delete_tensor_gb=-1.0 - 禁用垃圾回收策略。
 建议用户在训练大型网络时设置FLAGS_eager_delete_tensor_gb=0.0以启用垃圾回收策略。


-FLAGS_enable_inplace_whitelist
+FLAGS_fast_eager_deletion_mode
 *******************************************
-(始于1.4)
+(始于1.3)

-该flag用于调试，在某些ops中禁止内存原位复用。设置后，一些ops不会执行原位复用优化以节省内存。这些Ops包括：sigmoid, exp, relu, tanh, sqrt, ceil, floor, reciprocal, relu6, soft_relu, hard_sigmoid, batch_norm, batch_norm_grad, sum, sum_grad, scale, reshape, elementwise_add, and elementwise_add_grad。
+是否使用快速垃圾回收策略。如果未设置，则在CUDA内核结束时释放gpu内存。否则gpu内存将在CUDA内核尚未结束的情况下被释放，从而使垃圾回收策略更快。仅在启用垃圾回收策略时有效。

 取值范围
 ---------------
-Bool型，缺省值为False。
+Bool型，缺省值为True。

 示例
 -------
-FLAGS_enable_inplace_whitelist=True - 在特定op上禁止内存原位复用优化。
+FLAGS_fast_eager_deletion_mode=True - 启用快速垃圾回收策略。

+FLAGS_fast_eager_deletion_mode=False - 禁用快速垃圾回收策略。

-FLAGS_fast_eager_deletion_mode
+
+FLAGS_fraction_of_cpu_memory_to_use
 *******************************************
-(始于1.3)
+(始于1.2.0)

-是否使用快速垃圾回收策略。如果未设置，则在CUDA内核结束时释放gpu内存。否则gpu内存将在CUDA内核尚未结束的情况下被释放，从而使垃圾回收策略更快。仅在启用垃圾回收策略时有效。
+表示分配的内存块占CPU总内存大小的比例。将来的内存使用将从该内存块分配。 如果内存块没有足够的cpu内存，将从cpu请求分配与内存块相同大小的新的内存块，直到cpu没有足够的内存为止。

 取值范围
 ---------------
-Bool型，缺省值为True。
+Double型，范围[0, 1]，表示初始分配的内存块占CPU内存的比例。缺省值为1.0。

 示例
 -------
-FLAGS_fast_eager_deletion_mode=True - 启用快速垃圾回收策略。
-
-FLAGS_fast_eager_deletion_mode=False - 禁用快速垃圾回收策略。
+FLAGS_fraction_of_cpu_memory_to_use=0.1 - 分配总CPU内存大小的10%作为初始CPU内存块。


-FLAGS_fraction_of_gpu_memory_to_use
+FLAGS_fraction_of_cuda_pinned_memory_to_use
 *******************************************
 (始于1.2.0)

-表示分配的内存块占GPU总内存大小的比例。将来的内存使用将从该内存块分配。 如果内存块没有足够的gpu内存，将从gpu请求分配与内存块同样大小的新的内存块，直到gpu没有足够的内存为止。
+表示分配的CUDA Pinned内存块占CPU总内存大小的比例。将来的CUDA Pinned内存使用将从该内存块分配。 如果内存块没有足够的cpu内存，将从cpu请求分配与内存块相同大小的新的内存块，直到cpu没有足够的内存为止。

 取值范围
 ---------------
-Uint64型，大于0，表示初始分配的内存块占GPU内存的比例。
+Double型，范围[0, 1]，表示初始分配的内存块占CPU内存的比例。缺省值为0.5。

 示例
 -------
-FLAGS_fraction_of_gpu_memory_to_use=0.1 - 分配总GPU内存大小的10%作为初始GPU 内存块。
-
-注意
-------
-Windows系列平台会将FLAGS_fraction_of_gpu_memory_to_use默认设为0.5，Linux则会默认设为0.92。
+FLAGS_fraction_of_cuda_pinned_memory_to_use=0.1 - 分配总CPU内存大小的10%作为初始CUDA Pinned内存块。


-FLAGS_free_idle_memory
+FLAGS_fraction_of_gpu_memory_to_use
 *******************************************
-(始于0.15.0)
+(始于1.2.0)

-是否在运行时释放从系统预分配的空闲内存。设置后，如果预分配的分配器中有太多空闲内存，则释放空闲内存。
+表示分配的显存块占GPU总可用显存大小的比例。将来的显存使用将从该显存块分配。 如果显存块没有足够的gpu显存，将从gpu请求分配与显存块同样大小的新的显存块，直到gpu没有足够的显存为止。

 取值范围
 ---------------
-Bool型，缺省值为False。
+Double型，范围[0, 1]，表示初始分配的显存块占GPU可用显存的比例。

 示例
 -------
-FLAGS_free_idle_memory=True - 空闲内存太多时释放。
+FLAGS_fraction_of_gpu_memory_to_use=0.1 - 分配GPU总可用显存大小的10%作为初始GPU显存块。

-FLAGS_free_idle_memory=False - 不释放空闲内存。
+注意
+-------
+Windows系列平台会将FLAGS_fraction_of_gpu_memory_to_use默认设为0.5，Linux则会默认设为0.92。


 FLAGS_fuse_parameter_groups_size
@@ -207,21 +205,6 @@ FLAGS_initial_gpu_memory_in_mb=4096 - 分配4GB作为初始GPU内存块大小。
 如果设置该flag，则FLAGS_fraction_of_gpu_memory_to_use设置的内存大小将被该flag覆盖。如果未设置该flag，PaddlePaddle将使用FLAGS_fraction_of_gpu_memory_to_use分配GPU内存。


-FLAGS_limit_of_tmp_allocation
-*******************************************
-(始于1.3)
-
-FLAGS_limit_of_tmp_allocation表示temporary_allocation大小的上限，单位为字节。如果FLAGS_limit_of_tmp_allocation为-1，temporary_allocation的大小将没有限制。
-
-取值范围
---------------
-Int64型，缺省值为-1。
-
-示例
-------
-FLAGS_limit_of_tmp_allocation=1024 - 将temporary_allocation大小的上限设为1024字节。
-
-
 FLAGS_memory_fraction_of_eager_deletion
 *******************************************
 (始于1.4)
@@ -261,21 +244,6 @@ FLAGS_reallocate_gpu_memory_in_mb=1024 - 如果耗尽了分配的GPU内存块，
 如果设置了该flag，PaddlePaddle将重新分配该flag指定大小的gpu内存。否则分配FLAGS_fraction_of_gpu_memory_to_use指定比例的gpu内存。


-FLAGS_times_excess_than_required_tmp_allocation
-*******************************************
-(始于1.3)
-
-FLAGS_times_excess_than_required_tmp_allocation表示TemporaryAllocator可以返回的最大大小。例如，如果所需的内存大小为N，且times_excess_than_required_tmp_allocation为2.0，则TemporaryAllocator将返回大小范围为N~2*N的可用分配。
-
-取值范围
---------------
-Int64型，缺省值为2。
-
-示例
-------
-FLAGS_times_excess_than_required_tmp_allocation=1024 - 设置TemporaryAllocator可以返回的最大大小为1024*N。
-
-
 FLAGS_use_pinned_memory
 *******************************************
 (始于0.12.0)

--- a/doc/fluid/flags/memory_en.rst
+++ b/doc/fluid/flags/memory_en.rst
@@ -7,17 +7,17 @@ FLAGS_allocator_strategy
 **************************************
 (since 1.2)

-Use to choose allocator strategy of PaddlePaddle. The allocator strategy is under development, and the non-legacy allocator is not stable yet.
+Use to choose allocator strategy of PaddlePaddle. Auto growth allocator is not stable yet.

 Values accepted
 ---------------
-String, enum in ['legacy', 'naive_best_fit']. The default value is 'legacy'.
+String, enum in ['naive_best_fit', 'auto_growth']. The default value is 'naive_best_fit'.

 Example
 --------
-FLAGS_allocator_strategy=legacy would use the legacy allocator.
+FLAGS_allocator_strategy=naive_best_fit would use the pre-allocated best fit allocator.

-FLAGS_allocator_strategy=naive_best_fit would use the new-designed allocator.
+FLAGS_allocator_strategy=auto_growth would use the auto growth allocator.



@@ -40,15 +40,15 @@ FLAGS_eager_delete_tensor_gb
 *******************************************
 (since 1.0.0)

-Whether to use garbage collection strategy to optimize the memory usage of network. If FLAGS_eager_delete_tensor_gb >= 0, garbage collection strategy would be enabled, and collect memory garbages when running network, which is beneficial to saving memory usage. It is only useful when you use Executor to run program, or compile program, or compile program with data parallel. If FLAGS_eager_delete_tensor_gb < 0, garbage collection strategy is disabled. Garbage collector would not release memory garbages until the memory size of garbages reaches FLAGS_eager_delete_tensor_gb GB.
+Whether to use garbage collection strategy to optimize the memory usage of network. If FLAGS_eager_delete_tensor_gb < 0, garbage collection strategy is disabled. If FLAGS_eager_delete_tensor_gb >= 0, garbage collection strategy would be enabled, and collect memory garbages when running network, which is beneficial to saving memory usage. It is only useful when you use Executor to run program, or compile program, or compile program with data parallel. Garbage collector would not release memory garbages until the memory size of garbages reaches FLAGS_eager_delete_tensor_gb GB.

 Values accepted
 ---------------
-Double, in GB unit. The default value is -1.0.
+Double, in GB unit. The default value is 0.0.

 Example
 -------
-FLAGS_eager_delete_tensor_gb=0.0 would make memory garbage release immediately once it is not used. 
+FLAGS_eager_delete_tensor_gb=0.0 would make memory garbage release till the memory size of garbages reaches 0.0GB, i.e., release immediately once there is any garbage.

 FLAGS_eager_delete_tensor_gb=1.0 would make memory garbage release till the memory size of garbages reaches 1.0GB. 

@@ -59,75 +59,70 @@ Note
 It is recommended that users enable garbage collection strategy by setting FLAGS_eager_delete_tensor_gb=0.0 when training large network.


-
-FLAGS_enable_inplace_whitelist
+FLAGS_fast_eager_deletion_mode
 *******************************************
-(since 1.4)
+(since 1.3)

-Debug use to disable memory in-place in some ops. If set, some ops would not perform in-place optimization to save memory. These ops include: sigmoid, exp, relu, tanh, sqrt, ceil, floor, reciprocal, relu6, soft_relu, hard_sigmoid, batch_norm, batch_norm_grad, sum, sum_grad, scale, reshape, elementwise_add, and elementwise_add_grad.
+Whether to use fast garbage collection strategy. If not set, gpu memory would be released when CUDA kernel ends. Otherwise, gpu memory would be released without waiting CUDA kernel ends, making garbage collection strategy faster. Only valid when garbage collection strategy is enabled.

 Values accepted
 ---------------
-Bool. The default value is False.
+Bool. The default value is True.

 Example
 -------
-FLAGS_enable_inplace_whitelist=True would disable memory in-place optimization on certain ops.
-
+FLAGS_fast_eager_deletion_mode=True would turn on fast garbage collection strategy. 

+FLAGS_fast_eager_deletion_mode=False would turn off fast garbage collection strategy.

-FLAGS_fast_eager_deletion_mode
+FLAGS_fraction_of_cpu_memory_to_use
 *******************************************
-(since 1.3)
+(since 1.2.0)

-Whether to use fast garbage collection strategy. If not set, gpu memory would be released when CUDA kernel ends. Otherwise, gpu memory would be released without waiting CUDA kernel ends, making garbage collection strategy faster. Only valid when garbage collection strategy is enabled.
+Allocate a chunk of cpu memory that is this fraction of the total cpu memory size. Future memory usage will be allocated from the chunk. If the chunk doesn't have enough cpu memory, additional chunks of the same size will be requested from cpu until the cpu has no memory left for another chunk.

 Values accepted
 ---------------
-Bool. The default value is True.
+Double value in range [0, 1] which is the initial CPU memory percentage. The default value is 1.0.

 Example
 -------
-FLAGS_fast_eager_deletion_mode=True would turn on fast garbage collection strategy. 
+FLAGS_fraction_of_cpu_memory_to_use=0.1 will allocate 10% total cpu memory size as initial CPU chunk.

-FLAGS_fast_eager_deletion_mode=False would turn off fast garbage collection strategy.

-
-FLAGS_fraction_of_gpu_memory_to_use
+FLAGS_fraction_of_cuda_pinned_memory_to_use
 *******************************************
 (since 1.2.0)

-Allocate a chunk of gpu memory that is this fraction of the total gpu memory size. Future memory usage will be allocated from the chunk. If the chunk doesn't have enough gpu memory, additional chunks of the same size will be requested from gpu until the gpu has no memory left for another chunk.
+Allocate a chunk of CUDA pinned memory that is this fraction of the total cpu memory size. Future memory usage will be allocated from the chunk. If the chunk doesn't have enough cpu memory, additional chunks of the same size will be requested from cpu until the cpu has no memory left for another chunk.

 Values accepted
 ---------------
-Uint64 value greater than 0 which is the initial GPU memory percentage.
+Double value in range [0, 1] which is the initial CUDA pinned memory percentage. The default value is 0.5.

 Example
 -------
-FLAGS_fraction_of_gpu_memory_to_use=0.1 will allocate 10% total gpu memory size as initial GPU chunk.
+FLAGS_fraction_of_cuda_pinned_memory_to_use=0.1 will allocate 10% total cpu memory size as initial CUDA Pinned chunk.

-Note
-------
-Windows series platform will set FLAGS_fraction_of_gpu_memory_to_use to 0.5 by default.
-Linux will set FLAGS_fraction_of_gpu_memory_to_use to 0.92 by default.

-
-FLAGS_free_idle_memory
+FLAGS_fraction_of_gpu_memory_to_use
 *******************************************
-(since 0.15.0)
+(since 1.2.0)

-Whether to free idle memory pre-allocated from system during runtime. If set, free idle memory would be released if there is too much free idle memory in the pre-allocated allocator.
+Allocate a chunk of gpu memory that is this fraction of the available gpu memory size. Future memory usage will be allocated from the chunk. If the chunk doesn't have enough gpu memory, additional chunks of the same size will be requested from gpu until the gpu has no memory left for another chunk.

 Values accepted
 ---------------
-Bool. The default value is False.
+Double value in range [0, 1] which is the initial GPU memory percentage.

 Example
 -------
-FLAGS_free_idle_memory=True will free idle memory when there is too much of it. 
+FLAGS_fraction_of_gpu_memory_to_use=0.1 will allocate 10% available gpu memory size as initial GPU chunk.

-FLAGS_free_idle_memory=False will not free idle memory.
+Note
+-------
+Windows series platform will set FLAGS_fraction_of_gpu_memory_to_use to 0.5 by default.
+Linux will set FLAGS_fraction_of_gpu_memory_to_use to 0.92 by default.


 FLAGS_fuse_parameter_groups_size
@@ -213,20 +208,6 @@ If you set this flag, the memory size set by FLAGS_fraction_of_gpu_memory_to_use
 If you don't set this flag, PaddlePaddle will use FLAGS_fraction_of_gpu_memory_to_use to allocate gpu memory.


-FLAGS_limit_of_tmp_allocation
-*******************************************
-(since 1.3)
-
-The FLAGS_limit_of_tmp_allocation indicates the up limit of temporary_allocation size, the unit is byte. If the FLAGS_limit_of_tmp_allocation is -1, the size of temporary_allocation will not be limited.
-
-Values accepted
---------------
-Int64. The default value is -1.
-
-Example
-------
-FLAGS_limit_of_tmp_allocation=1024 will set the up limit of temporary_allocation size to 1024 bytes.
-

 FLAGS_memory_fraction_of_eager_deletion
 *******************************************
@@ -268,21 +249,6 @@ If this flag is set, PaddlePaddle will reallocate the gpu memory with size speci
 Else PaddlePaddle will reallocate with size set by FLAGS_fraction_of_gpu_memory_to_use.


-FLAGS_times_excess_than_required_tmp_allocation
-*******************************************
-(since 1.3)
-
-The FLAGS_times_excess_than_required_tmp_allocation indicates the max size the TemporaryAllocator can return. For Example
-, if the required memory size is N, and FLAGS_times_excess_than_required_tmp_allocation is 2.0, the TemporaryAllocator will return the available allocation that the range of size is N ~ 2*N.
-
-Values accepted
---------------
-Int64. The default value is 2.
-
-Example
-------
-FLAGS_times_excess_than_required_tmp_allocation=1024 will set the max size of the TemporaryAllocator can return to 1024*N.
-

 FLAGS_use_pinned_memory
 *******************************************