add flags file (#898)

* add flag files * add flags file * delete space

add flags file (#898)
* add flag files * add flags file * delete space
15957eec · xsrobin · GitHub · 85a250c9 · 15957eec · 15957eec
4 changed file
--- a/doc/fluid/api/index_en.rst
+++ b/doc/fluid/api/index_en.rst
@@ -5,6 +5,7 @@ API Reference
 ..  toctree::
    :maxdepth: 1
+    ../flags_en.rst
    ../api_guides/index_en.rst
    fluid.rst
    average.rst

--- a/doc/fluid/api_cn/index_cn.rst
+++ b/doc/fluid/api_cn/index_cn.rst
@@ -5,6 +5,7 @@ API
 ..  toctree::
    :maxdepth: 1
+    ../flags_cn.rst
    ../api_guides/index_cn.rst
    fluid_cn.rst
    average_cn.rst

--- a/doc/fluid/flags_cn.rst
+++ b/doc/fluid/flags_cn.rst
+环境变量FLAGS
+==================
+allocator_strategy
+********************
+(始于1.2)
+用于选择PaddlePaddle的分配器策略。 分配器策略正在开发中，且非legacy分配器尚未稳定。
+取值范围
+---------------
+String型，['legacy', 'naive_best_fit']中的一个。缺省值为'legacy'。
+示例
+--------
+FLAGS_allocator_strategy=legacy - 使用legacy分配器。
+FLAGS_allocator_strategy=naive_best_fit - 使用新设计的分配器。
+benchmark
+********************
+(始于0.12.0)
+用于基准测试。设置后，它将使局域删除同步，添加一些内存使用日志，并在内核启动后同步所有cuda内核。
+取值范围
+---------------
+Bool型，缺省值为False。
+示例
+-------
+FLAGS_benchmark=True -  同步以测试基准。
+check_nan_inf
+********************
+(始于0.13.0)
+用于调试。它用于检查Operator的结果是否含有Nan或Inf。
+取值范围
+---------------
+Bool型，缺省值为False。
+示例
+-------
+FLAGS_check_nan_inf=True - 检查Operator的结果是否含有Nan或Inf。
+communicator_fake_rpc
+**********************
+(始于1.5.0)
+当设为True时，通信器不会实际进行rpc调用，因此速度不会受到网络通信的影响。该flag用于调试。
+取值范围
+---------------
+Bool型，缺省值为False。
+示例
+-------
+FLAGS_communicator_fake_rpc=True - 启用通信器fake模式。
+注释
+-------
+该flag仅用于paddlepaddle的开发者，普通用户不应对其设置。
+communicator_independent_recv_thread
+**************************************
+(始于1.5.0)
+使用独立线程以从参数服务器接收参数。
+取值范围
+---------------
+Bool型，缺省值为True。
+示例
+-------
+FLAGS_communicator_independent_recv_thread=True - 使用独立线程以从参数服务器接收参数。
+注释
+-------
+开发者使用该flag进行框架的调试与优化，普通用户不应对其设置。
+communicator_max_merge_var_num
+**************************************
+(始于1.5.0)
+要通过通信器合并为一个梯度并发送的最大梯度数。训练器将所有梯度放入队列，然后通信器将从队列中取出梯度并在合并后发送。
+取值范围
+---------------
+Int32型，缺省值为20。
+示例
+-------
+FLAGS_communicator_max_merge_var_num=16 - 将要通过通信器合并为一个梯度并发送的最大梯度数设为16。
+注释
+-------
+该flag和训练器线程数有着密切关联，缺省值应和线程数一致。
+communicator_min_send_grad_num_before_recv
+*******************************************
+(始于1.5.0)
+在通信器中，有一个发送线程向参数服务器发送梯度，一个接收线程从参数服务器接收参数，且它们之间彼此独立。该flag用于控制接收线程的频率。 仅当发送线程至少发送communicator_min_send_grad_num_before_recv数量的梯度时，接收线程才会从参数服务器接收参数。
+取值范围
+---------------
+Int32型，缺省值为20。
+示例
+-------
+FLAGS_communicator_min_send_grad_num_before_recv=10 - 在接收线程从参数服务器接收参数之前，发送线程发送的梯度数为10。
+注释
+-------
+由于该flag和训练器的训练线程数强相关，而每个训练线程都会发送其梯度，所以缺省值应和线程数一致。
+communicator_send_queue_size
+*******************************************
+(始于1.5.0)
+每个梯度的队列大小。训练器将梯度放入队列，然后通信器将其从队列中取出并发送出去。 当通信器很慢时，队列可能会满，训练器在队列有空间之前被持续阻塞。它用于避免训练比通信快得多，以致太多的梯度没有及时发出的情况。
+取值范围
+---------------
+Int32型，缺省值为20。
+示例
+-------
+FLAGS_communicator_send_queue_size=10 - 设置每个梯度的队列大小为10。
+注释
+-------
+该flag会影响训练速度，若队列大小过大，速度会变快但结果可能会变差。
+communicator_send_wait_times
+*******************************************
+(始于1.5.0)
+合并数没有达到max_merge_var_num的情况下发送线程等待的次数。
+取值范围
+---------------
+Int32型，缺省值为5。
+示例
+-------
+FLAGS_communicator_send_wait_times=5 - 将合并数没有达到max_merge_var_num的情况下发送线程等待的次数设为5。
+communicator_thread_pool_size
+*******************************************
+(始于1.5.0)
+设置用于发送梯度和接收参数的线程池大小。
+取值范围
+---------------
+Int32型，缺省值为5。
+示例
+-------
+FLAGS_communicator_thread_pool_size=10 - 设置线程池大小为10。
+注释
+-------
+大部分情况下，用户不需要设置该flag。
+conv_workspace_size_limit
+*******************************************
+(始于0.13.0)
+用于选择cuDNN卷积算法的工作区限制大小（单位为MB）。cuDNN的内部函数在这个内存限制范围内获得速度最快的匹配算法。通常，在较大的工作区内可以选择更快的算法，但同时也会显著增加内存空间。用户需要在内存和速度之间进行权衡。
+取值范围
+---------------
+Uint64型，缺省值为4096。即4G内存工作区。
+示例
+-------
+FLAGS_conv_workspace_size_limit=1024 - 将用于选择cuDNN卷积算法的工作区限制大小设置为1024MB。
+cpu_deterministic
+*******************************************
+(始于0.15.0)
+该flag用于调试。它表示是否在CPU侧确定计算结果。 在某些情况下，不同求和次序的结果可能不同，例如，`a+b+c+d` 的结果可能与 `c+a+b+d` 的结果不同。
+取值范围
+---------------
+Bool型，缺省值为False。
+示例
+-------
+FLAGS_cpu_deterministic=True - 在CPU侧确定计算结果。
+cudnn_batchnorm_spatial_persistent
+*******************************************
+(始于1.4.0)
+表示是否在batchnorm中使用新的批量标准化模式CUDNN_BATCHNORM_SPATIAL_PERSISTENT函数。
+取值范围
+---------------
+Bool型，缺省值为False。
+示例
+-------
+FLAGS_cudnn_batchnorm_spatial_persistent=True - 开启CUDNN_BATCHNORM_SPATIAL_PERSISTENT模式。
+注释
+-------
+此模式在某些任务中可以更快，因为将为CUDNN_DATA_FLOAT和CUDNN_DATA_HALF数据类型选择优化路径。我们默认将其设置为False的原因是此模式可能使用原子整数缩减(scaled atomic integer reduction)而导致某些输入数据范围的数字溢出。
+cudnn_deterministic
+*******************************************
+(始于0.13.0)
+cuDNN对于同一操作有几种算法，一些算法结果是非确定性的，如卷积算法。该flag用于调试。它表示是否选择cuDNN中的确定性函数。 
+取值范围
+---------------
+Bool型，缺省值为False。
+示例
+-------
+FLAGS_cudnn_deterministic=True - 选择cuDNN中的确定性函数。
+注释
+-------
+现在，在cuDNN卷积和池化Operator中启用此flag。确定性算法速度可能较慢，因此该flag通常用于调试。
+cudnn_exhaustive_search
+*******************************************
+(始于1.2.0)
+表示是否使用穷举搜索方法来选择卷积算法。在cuDNN中有两种搜索方法，启发式搜索和穷举搜索。穷举搜索尝试所有cuDNN算法以选择其中最快的算法。此方法非常耗时，所选择的算法将针对给定的层规格进行缓存。 一旦更改了图层规格（如batch大小，feature map大小），它将再次搜索。
+取值范围
+---------------
+Bool型，缺省值为False。
+示例
+-------
+FLAGS_cudnn_exhaustive_search=True - 使用穷举搜索方法来选择卷积算法。
+dist_threadpool_size
+*******************************************
+(始于1.0.0)
+控制用于分布式模块的线程数。如果未设置，则将其设置为硬线程。
+取值范围
+---------------
+Int32型，缺省值为0。
+示例
+-------
+FLAGS_dist_threadpool_size=10 - 将用于分布式模块的最大线程数设为10。
+eager_delete_scope
+*******************************************
+(始于0.12.0)
+同步局域删除。设置后，它将降低GPU内存使用量，但同时也会减慢销毁变量的速度（性能损害约1％）。
+取值范围
+---------------
+Bool型，缺省值为True。
+示例
+-------
+FLAGS_eager_delete_scope=True - 同步局域删除。
+eager_delete_tensor_gb
+*******************************************
+(始于1.0.0)
+表示是否使用垃圾回收策略来优化网络的内存使用。如果FLAGS_eager_delete_tensor_gb >= 0，则启用垃圾回收策略，并在运行网络时回收内存垃圾，这有利于节省内存使用量。它仅在您使用Executor运行程序、编译程序或使用并行数据编译程序时才有用。如果FLAGS_eager_delete_tensor_gb < 0，则禁用垃圾回收策略。垃圾回收器直到垃圾的内存大小达到FLAGS_eager_delete_tensor_gb GB时才会释放内存垃圾。
+取值范围
+---------------
+Double型，单位为GB，缺省值为-1.0。
+示例
+-------
+FLAGS_eager_delete_tensor_gb=0.0 - 一旦不再使用即释放内存垃圾。
+FLAGS_eager_delete_tensor_gb=1.0 - 垃圾占用内存大小达到1.0GB时释放内存垃圾。
+FLAGS_eager_delete_tensor_gb=-1.0 - 禁用垃圾回收策略。    
+注释
+-------
+建议用户在训练大型网络时设置FLAGS_eager_delete_tensor_gb=0.0以启用垃圾回收策略。
+enable_cublas_tensor_op_math
+*******************************************
+(始于1.2.0)
+该flag表示是否使用Tensor Core，但可能会因此降低部分精确度。
+取值范围
+---------------
+Bool型，缺省值为False。
+示例
+-------
+enable_cublas_tensor_op_math=True - 使用Tensor Core。
+enable_inplace_whitelist
+*******************************************
+(始于1.4)
+该flag用于调试，在某些ops中禁止内存原位复用。设置后，一些ops不会执行原位复用优化以节省内存。这些Ops包括：sigmoid, exp, relu, tanh, sqrt, ceil, floor, reciprocal, relu6, soft_relu, hard_sigmoid, batch_norm, batch_norm_grad, sum, sum_grad, scale, reshape, elementwise_add, and elementwise_add_grad。
+取值范围
+---------------
+Bool型，缺省值为False。
+示例
+-------
+FLAGS_enable_inplace_whitelist=True - 在特定op上禁止内存原位复用优化。
+enable_parallel_graph
+*******************************************
+(始于1.2.0)
+该flag用于ParallelExecutor以禁用并行图执行模式。
+取值范围
+---------------
+Bool型，缺省值为False。
+示例
+-------
+FLAGS_enable_parallel_graph=False - 通过ParallelExecutor强制禁用并行图执行模式。
+enable_rpc_profiler
+*******************************************
+(始于1.0.0)
+是否启用RPC分析器。
+取值范围
+----------------
+Bool型，缺省值为False。
+示例
+-------
+FLAGS_enable_rpc_profiler=True - 启用RPC分析器并在分析器文件中记录时间线。
+fast_eager_deletion_mode
+*******************************************
+(始于1.3)
+是否使用快速垃圾回收策略。如果未设置，则在CUDA内核结束时释放gpu内存。否则gpu内存将在CUDA内核尚未结束的情况下被释放，从而使垃圾回收策略更快。仅在启用垃圾回收策略时有效。
+取值范围
+---------------
+Bool型，缺省值为True。
+示例
+-------
+FLAGS_fast_eager_deletion_mode=True - 启用快速垃圾回收策略。
+FLAGS_fast_eager_deletion_mode=False - 禁用快速垃圾回收策略。
+fraction_of_gpu_memory_to_use
+*******************************************
+(始于1.2.0)
+表示分配的内存块占GPU总内存大小的比例。将来的内存使用将从该内存块分配。 如果内存块没有足够的gpu内存，将从gpu请求分配与内存块同样大小的新的内存块，直到gpu没有足够的内存为止。
+取值范围
+---------------
+Uint64型，大于0，表示初始分配的内存块占GPU内存的比例。
+示例
+-------
+FLAGS_fraction_of_gpu_memory_to_use=0.1 - 分配总GPU内存大小的10%作为初始GPU 内存块。
+注释
+-------
+Windows系列平台会将FLAGS_fraction_of_gpu_memory_to_use默认设为0.5，Linux则会默认设为0.92。
+free_idle_memory
+*******************************************
+(始于0.15.0)
+是否在运行时释放从系统预分配的空闲内存。设置后，如果预分配的分配器中有太多空闲内存，则释放空闲内存。
+取值范围
+---------------
+Bool型，缺省值为False。
+示例
+-------
+FLAGS_free_idle_memory=True - 空闲内存太多时释放。
+FLAGS_free_idle_memory=False - 不释放空闲内存。
+fuse_parameter_groups_size
+*******************************************
+(始于1.4.0)
+FLAGS_fuse_parameter_groups_size表示每一组中参数的个数。缺省值是一个经验性的结果。如果fuse_parameter_groups_size为1，则表示组的大小和参数梯度的数目一致。 如果fuse_parameter_groups_size为-1，则表示只有一个组。缺省值为3，这只是一个经验值。
+取值范围
+---------------
+Int32型，缺省值为3。
+示例
+-------
+FLAGS_fuse_parameter_groups_size=3 - 将单组参数的梯度大小设为3。
+fuse_parameter_memory_size
+*******************************************
+(始于1.5.0)
+FLAGS_fuse_parameter_memory_size表示作为通信调用输入（例如NCCLAllReduce）的单组参数梯度的上限内存大小。默认值为-1.0，表示不根据memory_size设置组。单位是MB。
+取值范围
+---------------
+Double型，缺省值为-1.0。
+示例
+-------
+FLAGS_fuse_parameter_memory_size=16 - 将单组参数梯度的上限大小设为16MB。
+init_allocated_mem
+*******************************************
+(始于0.15.0)
+是否对分配的内存进行非零值初始化。该flag用于调试，以防止某些Ops假定已分配的内存都是初始化为零的。
+取值范围
+---------------
+Bool型，缺省值为False。
+示例
+-------
+FLAGS_init_allocated_mem=True - 对分配的内存进行非零初始化。
+FLAGS_init_allocated_mem=False - 不会对分配的内存进行非零初始化。
+initial_cpu_memory_in_mb
+*******************************************
+(始于0.14.0)
+初始PaddlePaddle分配器的CPU内存块大小，单位为MB。分配器将FLAGS_initial_cpu_memory_in_mb和FLAGS_fraction_of_cpu_memory_to_use*（总物理内存）的最小值作为内存块大小。
+取值范围
+---------------
+Uint64型，缺省值为500，单位为MB。
+示例
+-------
+FLAGS_initial_cpu_memory_in_mb=100 - 在FLAGS_fraction_of_cpu_memory_to_use*（总物理内存）大于100MB的情况下，首次提出分配请求时，分配器预先分配100MB内存，并在预分配的内存耗尽时再次分配100MB。
+initial_gpu_memory_in_mb
+*******************************************
+(始于1.4.0)
+分配一块指定大小的GPU内存块。之后的内存使用将从该内存块分配。如果内存块没有足够的gpu内存，将从gpu请求大小为FLAGS_reallocate_gpu_memory_in_mb的内存块，直到gpu没有剩余内存为止。
+取值范围
+---------------
+Uint64型，大于0，为初始GPU内存大小，单位为MB。
+示例
+-------
+FLAGS_initial_gpu_memory_in_mb=4096 - 分配4GB作为初始GPU内存块大小。
+注释
+-------
+如果设置该flag，则FLAGS_fraction_of_gpu_memory_to_use设置的内存大小将被该flag覆盖。如果未设置该flag，PaddlePaddle将使用FLAGS_fraction_of_gpu_memory_to_use分配GPU内存。
+inner_op_parallelism
+*******************************************
+(始于1.3.0)
+大多数Operators都在单线程模式下工作，但对于某些Operators，使用多线程更合适。 例如，优化稀疏梯度的优化Op使用多线程工作会更快。该flag用于设置Op内的线程数。
+取值范围
+---------------
+Int32型，缺省值为0，这意味着operator将不会在多线程模式下运行。
+示例
+-------
+FLAGS_inner_op_parallelism=5 - 将operator内的线程数设为5。
+注释
+-------
+目前只有稀疏的adam op支持inner_op_parallelism。
+limit_of_tmp_allocation
+*******************************************
+(始于1.3)
+FLAGS_limit_of_tmp_allocation表示temporary_allocation大小的上限，单位为字节。如果FLAGS_limit_of_tmp_allocation为-1，temporary_allocation的大小将没有限制。
+取值范围
+---------------
+Int64型，缺省值为-1。
+示例
+-------
+FLAGS_limit_of_tmp_allocation=1024 - 将temporary_allocation大小的上限设为1024字节。
+max_body_size
+*******************************************
+(始于1.0.0)
+控制BRPC中的最大消息大小。
+取值范围
+---------------
+Int32型，缺省值为2147483647。
+示例
+-------
+FLAGS_max_body_size=2147483647 - 将BRPC消息大小设为2147483647。
+memory_fraction_of_eager_deletion
+*******************************************
+(始于1.4)
+垃圾回收策略释放变量的内存大小百分比。如果FLAGS_memory_fraction_of_eager_deletion = 1.0，则将释放网络中的所有临时变量。如果FLAGS_memory_fraction_of_eager_deletion = 0.0，则不会释放网络中的任何临时变量。如果0.0<FLAGS_memory_fraction_of_eager_deletion<1.0，则所有临时变量将根据其内存大小降序排序，并且仅
+释放具有最大内存大小的FLAGS_memory_fraction_of_eager_deletion比例的变量。该flag仅在运行并行数据编译程序时有效。
+取值范围
+---------------
+Double型，范围为[0.0, 1.0]，缺省值为1.0。
+示例
+-------
+FLAGS_memory_fraction_of_eager_deletion=0 - 保留所有临时变量，也就是禁用垃圾回收策略。
+FLAGS_memory_fraction_of_eager_deletion=1 - 释放所有临时变量。   
+FLAGS_memory_fraction_of_eager_deletion=0.5 - 仅释放50%比例的占用内存最多的变量。
+multiple_of_cupti_buffer_size
+*******************************************
+(始于1.4.0)
+该flag用于分析。它表示CUPTI设备缓冲区大小的倍数。如果在profiler过程中程序挂掉或者在chrome://tracing中加载timeline文件时出现异常，请尝试增大此值。
+取值范围
+---------------
+Int32型，缺省值为1。
+示例
+-------
+FLAGS_multiple_of_cupti_buffer_size=1 - 将CUPTI设备缓冲区大小的倍数设为1。
+paddle_num_threads
+*******************************************
+(始于0.15.0)
+控制每个paddle实例的线程数。
+取值范围
+---------------
+Int32型，缺省值为1。
+示例
+-------
+FLAGS_paddle_num_threads=2 - 将每个实例的最大线程数设为2。
+pe_profile_fname
+*******************************************
+(始于1.3.0)
+该flag用于ParallelExecutor的调试。ParallelExecutor会通过gpertools生成配置文件结果，并将结果存储在FLAGS_pe_profile_fname指定的文件中。仅在编译选项选择 `WITH_PRIFILER=ON` 时有效。如果禁用则设为empty。
+取值范围
+---------------
+String型，缺省值为empty ("")。
+示例
+-------
+FLAGS_pe_profile_fname="./parallel_executor.perf" - 将配置文件结果存储在parallel_executor.perf中。
+print_sub_graph_dir
+*******************************************
+(始于1.2.0)
+该flag用于调试。如果程序中转换图的某些子图失去连接，则结果可能会出错。我们可以将这些断开连接的子图打印到该flag指定的文件中。如果禁用则设为empty。
+取值范围
+---------------
+String型，缺省值为empty ("")。
+示例
+-------
+FLAGS_print_sub_graph_dir="./sub_graphs.txt" - 将断开连接的子图打印到"./sub_graphs.txt"。
+reader_queue_speed_test_mode
+*******************************************
+(始于1.1.0)
+将pyreader数据队列设置为测试模式。在测试模式下，pyreader将缓存一些数据，然后执行器将读取缓存的数据，因此阅读器不会成为瓶颈。
+取值范围
+---------------
+Bool型，缺省值为False。
+示例
+-------
+FLAGS_reader_queue_speed_test_mode=True - 启用pyreader测试模式。
+注释
+-------
+仅当使用py_reader时该flag才有效。
+reallocate_gpu_memory_in_mb
+*******************************************
+(始于1.4.0)
+如果耗尽了分配的GPU内存块，则重新分配额外的GPU内存块。
+取值范围
+---------------
+Int64型，大于0，单位为MB。
+示例
+-------
+FLAGS_reallocate_gpu_memory_in_mb=1024 - 如果耗尽了分配的GPU内存块，重新分配1GB。
+注释
+-------
+如果设置了该flag，PaddlePaddle将重新分配该flag指定大小的gpu内存。否则分配FLAGS_fraction_of_gpu_memory_to_use指定比例的gpu内存。
+rpc_deadline
+*******************************************
+(始于1.0.0)
+它控制rpc通信的deadline超时。
+取值范围
+---------------
+Int32型，缺省值为180000，单位为ms。
+示例
+-------
+FLAGS_rpc_deadline=180000 - 将deadline超时设为3分钟。
+rpc_disable_reuse_port
+*******************************************
+(始于1.2.0)
+rpc_disable_reuse_port为True时，grpc的 GRPC_ARG_ALLOW_REUSEPORT会被设置为False以禁用SO_REUSEPORT。
+取值范围
+---------------
+Bool型，缺省值为False。
+示例
+-------
+FLAGS_rpc_disable_reuse_port=True - 禁用SO_REUSEPORT。
+rpc_get_thread_num
+*******************************************
+(始于1.0.0)
+它控制用于从参数服务器获取参数的线程数。
+取值范围
+---------------
+Int32型，缺省值为12。
+示例
+-------
+FLAGS_rpc_get_thread_num=6 - 将从参数服务器获取参数的线程数设为6。
+rpc_send_thread_num
+*******************************************
+(始于1.0.0)
+它控制用于发送rpc的线程数。
+取值范围
+---------------
+Int32型，缺省值为12。
+示例
+-------
+FLAGS_rpc_send_thread_num=6 - 将用于发送的线程数设为6。
+rpc_server_profile_path
+*******************************************
+since(v0.15.0)
+设置分析器输出日志文件路径前缀。完整路径为rpc_server_profile_path_listener_id，其中listener_id为随机数。 
+取值范围
+---------------
+String型，缺省值为"./profile_ps"。
+示例
+-------
+FLAGS_rpc_server_profile_path="/tmp/pserver_profile_log" - 在"/tmp/pserver_profile_log_listener_id"中生成配置日志文件。
+selected_gpus
+*******************************************
+(始于1.3)
+设置用于训练或预测的GPU设备。
+取值范围
+---------------
+以逗号分隔的设备ID列表，其中每个设备ID是一个非负整数，且应小于您的机器拥有的GPU设备总数。
+示例
+-------
+FLAGS_selected_gpus=0,1,2,3,4,5,6,7 - 令0-7号GPU设备用于训练和预测。
+注释
+-------
+使用该flag的原因是我们希望在GPU设备之间使用聚合通信，但通过CUDA_VISIBLE_DEVICES只能使用共享内存。
+sync_nccl_allreduce
+*******************************************
+(始于1.3)
+如果FLAGS_sync_nccl_allreduce为True，则会在allreduce_op_handle中调用 `cudaStreamSynchronize（nccl_stream）` ，这种模式在某些情况下可以获得更好的性能。
+取值范围
+---------------
+Bool型，缺省值为True。
+示例
+-------
+FLAGS_sync_nccl_allreduce=True - 在allreduce_op_handle中调用 `cudaStreamSynchronize(nccl_stream)` 。
+times_excess_than_required_tmp_allocation
+*******************************************
+(始于1.3)
+FLAGS_times_excess_than_required_tmp_allocation表示TemporaryAllocator可以返回的最大大小。例如，如果所需的内存大小为N，且times_excess_than_required_tmp_allocation为2.0，则TemporaryAllocator将返回大小范围为N~2*N的可用分配。
+取值范围
+---------------
+Int64型，缺省值为2。
+示例
+-------
+FLAGS_times_excess_than_required_tmp_allocation=1024 - 设置TemporaryAllocator可以返回的最大大小为1024*N。
+tracer_profile_fname
+*******************************************
+(始于1.4.0)
+FLAGS_tracer_profile_fname表示由gperftools生成的命令式跟踪器的分析器文件名。仅在编译选项选择`WITH_PROFILER = ON`时有效。如果禁用则设为empty。
+取值范围
+---------------
+String型，缺省值为("gperf")。
+示例
+-------
+FLAGS_tracer_profile_fname="gperf_profile_file" - 将命令式跟踪器的分析器文件名设为"gperf_profile_file"。
+use_mkldnn
+*******************************************
+(始于0.13.0)
+在预测或训练过程中，可以通过该选项选择使用Intel MKL-DNN（https://github.com/intel/mkl-dnn）库运行。
+“用于深度神经网络的英特尔（R）数学核心库（Intel(R) MKL-DNN）”是一个用于深度学习应用程序的开源性能库。该库加速了英特尔（R）架构上的深度学习应用程序和框架。Intel MKL-DNN包含矢量化和线程化构建建块，您可以使用它们来实现具有C和C ++接口的深度神经网络（DNN）。
+取值范围
+---------------
+Bool型，缺省值为False。
+示例
+-------
+FLAGS_use_mkldnn=True - 开启使用MKL-DNN运行。
+注释
+-------
+FLAGS_use_mkldnn仅用于python训练和预测脚本。要在CAPI中启用MKL-DNN，请设置选项 -DWITH_MKLDNN=ON。
+英特尔MKL-DNN支持英特尔64架构和兼容架构。
+该库对基于以下设备的系统进行了优化：
+英特尔SSE4.1支持的英特尔凌动（R）处理器；
+第4代，第5代，第6代，第7代和第8代英特尔（R）Core（TM）处理器；
+英特尔（R）Xeon（R）处理器E3，E5和E7系列（原Sandy Bridge，Ivy Bridge，Haswell和Broadwell）；
+英特尔（R）Xeon（R）可扩展处理器（原Skylake和Cascade Lake）；
+英特尔（R）Xeon Phi（TM）处理器（原Knights Landing and Knights Mill）；
+兼容处理器。
+use_ngraph
+*******************************************
+(始于1.4.0)
+在预测或训练过程中，可以通过该选项选择使用英特尔nGraph（https://github.com/NervanaSystems/ngraph）引擎。它将在英特尔Xeon CPU上获得很大的性能提升。
+取值范围
+---------------
+Bool型，缺省值为False。
+示例
+-------
+FLAGS_use_ngraph=True - 开启使用nGraph运行。
+注释
+-------
+英特尔nGraph目前仅在少数模型中支持。我们只验证了[ResNet-50]（https://github.com/PaddlePaddle/models/blob/develop/PaddleCV/image_classification/README_ngraph.md）的训练和预测。
+use_pinned_memory
+*******************************************
+(始于0.12.0)
+是否使用pinned memory。设为True后，CPU分配器将调用mlock来锁定内存页。
+取值范围
+---------------
+Bool型，缺省值为True。
+示例
+-------
+FLAGS_use_pinned_memory=True - 锁定分配的CPU内存页面。
--- a/doc/fluid/flags_en.rst
+++ b/doc/fluid/flags_en.rst
+==================
+FLAGS
+==================
+allocator_strategy
+**************************************
+(since 1.2)
+Use to choose allocator strategy of PaddlePaddle. The allocator strategy is under development, and the non-legacy allocator is not stable yet.
+Values accepted
+---------------
+String, enum in ['legacy', 'naive_best_fit']. The default value is 'legacy'.
+Example
+--------
+FLAGS_allocator_strategy=legacy would use the legacy allocator.
+FLAGS_allocator_strategy=naive_best_fit would use the new-designed allocator.
+benchmark
+**************************************
+(since 0.12.0)
+Used to do benchmark. If set, it will make scope delete synchronized, add some memory usage log, and synchronize all cuda kernel after kernel launches.
+Values accepted
+---------------
+Bool. The default value is False.
+Example
+-------
+FLAGS_benchmark=True will do some synchronizations to test benchmark.
+check_nan_inf
+**************************************
+(since 0.13.0)
+This Flag is used for debugging. It is used to check whether the result of the Operator has Nan or Inf.
+Values accepted
+---------------
+Bool. The default value is False.
+Example
+-------
+FLAGS_check_nan_inf=True will check the result of Operator whether the result has Nan or Inf.
+communicator_fake_rpc
+**************************************
+(since 1.5.0)
+When set true, communicator will not really do rpc call, so the speed will not be affected by network communication. This flag is used for debugging purpose.
+Values accepted
+---------------
+Bool. The default value is false.
+Example
+-------
+FLAGS_communicator_fake_rpc=True will enable communicator fake mode.
+Note
+-------
+This flag is only for developer of paddlepaddle, user should not set it.
+communicator_independent_recv_thread
+**************************************
+(since 1.5.0)
+use an independent thread to receive parameter from parameter server
+Values accepted
+---------------
+Bool. The default value is True.
+Example
+-------
+FLAGS_communicator_independent_recv_thread=True will use an independent thread to receive parameter from parameter server.
+Note
+-------
+This flag is for developer to debug and optimize the framework. User should not set it.
+communicator_max_merge_var_num
+**************************************
+(since 1.5.0)
+max gradient number to merge and send as one gradient by communicator. Trainer will put all gradients into a queue, then communicator will take the gradients out from the queue and merge them before send.
+Values accepted
+---------------
+Int32. The default value is 20.
+Example
+-------
+FLAGS_communicator_max_merge_var_num=16 will set the max gradient number to merge and send as one gradient to 16.
+Note
+-------
+This flag has strong relationship with trainer thread num. The default value should be the same with thread num.
+communicator_min_send_grad_num_before_recv
+*******************************************
+(since 1.5.0)
+In communicator, there is one send thread that send gradient to parameter server and one receive thread that receive parameter from parameter server. They work independently. This flag is used to control the frequency of receive thread. Only when the send thread send at least communicator_min_send_grad_num_before_recv gradients will the receive thread receive parameter from parameter server.
+Values accepted
+---------------
+Int32. The default value is 20.
+Example
+-------
+FLAGS_communicator_min_send_grad_num_before_recv=10 will set the number of gradients sent by the send thread to 10 before the receive thread receive parameter from parameter server.
+Note
+-------
+This flag has strong relation with the training threads of trainer. because each training thread will send it's grad. So the default value should be training thread num.
+communicator_send_queue_size
+*******************************************
+(since 1.5.0)
+The queue size for each gradient. Trainer will put gradient into a queue, and communicator will take gradient out from the queue and then send them out. When communicator is slow, the queue may be full and then the trainer will be blocked until the queue has space. It's used to avoid the situation that training is much more faster than communication. There will be too much gradients that is not sent out in time.
+Values accepted
+---------------
+Int32. The default value is 20.
+Example
+-------
+FLAGS_communicator_send_queue_size=10 will set the queue size for each gradient to 10.
+Note
+-------
+This flag will affect the training speed, if the queue size is larger, the speed may be faster, but may make the result worse.
+communicator_send_wait_times
+*******************************************
+(since 1.5.0)
+times that send thread will wait if merge number does not reach max_merge_var_num.
+Values accepted
+---------------
+Int32. The default value is 5.
+Example
+-------
+FLAGS_communicator_send_wait_times=5 set the times that send thread will wait if merge number does not reach max_merge_var_num to 5.
+communicator_thread_pool_size
+*******************************************
+(since 1.5.0)
+Set the thread pool size that used to do gradient send and parameter receive.
+Values accepted
+---------------
+Int32. The default value is 5.
+Example
+-------
+FLAGS_communicator_thread_pool_size=10 set the thread pool size to 10.
+Note
+-------
+Most of time user does not need to set this flag.
+conv_workspace_size_limit
+*******************************************
+(since 0.13.0)
+The workspace limit size in MB unit for choosing cuDNN convolution algorithms. The inner funciton of cuDNN obtain the fastest suited algorithm that fits within this memory limit. Usually, large workspace size may lead to choose faster algorithms, but significant increasing memory workspace. Users need to trade-off between memory and speed.
+Values accepted
+---------------
+Uint64. The default value is 4096. That is to say, 4G memory workspace.
+Example
+-------
+FLAGS_conv_workspace_size_limit=1024 set the workspace limit size for choosing cuDNN convolution algorithms to 1024MB.
+cpu_deterministic
+*******************************************
+(since 0.15.0)
+This Flag is used for debugging. It indicates whether to make the result of computation deterministic in CPU side. In some case, the result of the different order of summing maybe different，for example, the result of `a+b+c+d` may be different with the result of `c+a+b+d`.
+Values accepted
+---------------
+Bool. The default value is False.
+Example
+-------
+FLAGS_cpu_deterministic=True will make the result of computation deterministic in CPU side.
+cudnn_batchnorm_spatial_persistent
+*******************************************
+(since 1.4.0)
+Indicates whether to use the new batch normalization mode CUDNN_BATCHNORM_SPATIAL_PERSISTENT function in batchnorm.
+Values accepted
+---------------
+Bool. The default value is False.
+Example
+-------
+FLAGS_cudnn_batchnorm_spatial_persistent=True will enable the CUDNN_BATCHNORM_SPATIAL_PERSISTENT mode.
+Note
+-------
+This mode can be faster in some tasks because an optimized path will be selected for CUDNN_DATA_FLOAT and CUDNN_DATA_HALF data types. The reason we set it to False by default is that this mode may use scaled atomic integer reduction which may cause a numerical overflow for some input data range.
+cudnn_deterministic
+*******************************************
+(since 0.13.0)
+For one operation, cuDNN has several algorithms, some algorithm results are non-deterministic, like convolution algorithms. This flag is used for debugging. It indicates whether to choose the deterministic in cuDNN. 
+Values accepted
+---------------
+Bool. The default value is False.
+Example
+-------
+FLAGS_cudnn_deterministic=True will choose the deterministic in cuDNN.
+Note
+-------
+Now this flag is enabled in cuDNN convolution and pooling operator. The deterministic algorithms may slower, so this flag is generally used for debugging.
+cudnn_exhaustive_search
+*******************************************
+(since 1.2.0)
+Whether to use exhaustive search method to choose convolution algorithms. There are two search methods, heuristic search and exhaustive search in cuDNN. The exhaustive search attempts all cuDNN algorithms to choose the fastest algorithm. This method is time-consuming, the choosed algorithm will be cached for the given layer specifications. Once the layer specifications (like batch size, feature map size) are changed, it will search again.
+Values accepted
+---------------
+Bool. The default value is False. 
+Example
+-------
+FLAGS_cudnn_exhaustive_search=True will use exhaustive search method to choose convolution algorithms.
+dist_threadpool_size
+*******************************************
+(Since 1.0.0)
+Control the number of thread used for distributed module. If it's not set, it will be set to hardware threads.
+Values accepted
+---------------
+Int32. The default value is 0.
+Example
+-------
+FLAGS_dist_threadpool_size=10 will enable 10 threads as max number of thread used for distributed module.
+eager_delete_scope
+*******************************************
+(since 0.12.0)
+Make scope delete synchronously. If set, it will reduce GPU memory usage but slow down the destruction of variables (around 1% performance harm).
+Values accepted
+---------------
+Bool. The default value is True.
+Example
+-------
+FLAGS_eager_delete_scope=True will make scope delete synchronously.
+eager_delete_tensor_gb
+*******************************************
+(since 1.0.0)
+Whether to use garbage collection strategy to optimize the memory usage of network. If FLAGS_eager_delete_tensor_gb >= 0, garbage collection strategy would be enabled, and collect memory garbages when running network, which is beneficial to saving memory usage. It is only useful when you use Executor to run program, or compile program, or compile program with data parallel. If FLAGS_eager_delete_tensor_gb < 0, garbage collection strategy is disabled. Garbage collector would not release memory garbages until the memory size of garbages reaches FLAGS_eager_delete_tensor_gb GB.
+Values accepted
+---------------
+Double, in GB unit. The default value is -1.0.
+Example
+-------
+FLAGS_eager_delete_tensor_gb=0.0 would make memory garbage release immediately once it is not used. 
+FLAGS_eager_delete_tensor_gb=1.0 would make memory garbage release till the memory size of garbages reaches 1.0GB. 
+FLAGS_eager_delete_tensor_gb=-1.0 would disable garbage collection strategy.
+Note
+-------
+It is recommended that users enable garbage collection strategy by setting FLAGS_eager_delete_tensor_gb=0.0 when training large network.
+enable_cublas_tensor_op_math
+*******************************************
+(since 1.2.0)
+This Flag indicates whether to use Tensor Core, but it may lose some precision. 
+Values accepted
+---------------
+Bool. The default value is False.
+Example
+-------
+enable_cublas_tensor_op_math=True will use Tensor Core.
+enable_inplace_whitelist
+*******************************************
+(since 1.4)
+Debug use to disable memory in-place in some ops. If set, some ops would not perform in-place optimization to save memory. These ops include: sigmoid, exp, relu, tanh, sqrt, ceil, floor, reciprocal, relu6, soft_relu, hard_sigmoid, batch_norm, batch_norm_grad, sum, sum_grad, scale, reshape, elementwise_add, and elementwise_add_grad.
+Values accepted
+---------------
+Bool. The default value is False.
+Example
+-------
+FLAGS_enable_inplace_whitelist=True would disable memory in-place optimization on certain ops.
+enable_parallel_graph
+*******************************************
+(since 1.2.0)
+This Flag is used for ParallelExecutor to disable parallel graph execution mode.
+Values accepted
+---------------
+Bool. The default value is False.
+Example
+-------
+FLAGS_enable_parallel_graph=False will force disable parallel graph execution mode by ParallelExecutor.
+enable_rpc_profiler
+*******************************************
+(Since 1.0.0)
+Enable RPC profiler or not.
+Values accepted
+----------------
+Bool. The default value is False.
+Example
+-------
+FLAGS_enable_rpc_profiler=True will enable rpc profiler and record the timeline to profiler file.
+fast_eager_deletion_mode
+*******************************************
+(since 1.3)
+Whether to use fast garbage collection strategy. If not set, gpu memory would be released when CUDA kernel ends. Otherwise, gpu memory would be released without waiting CUDA kernel ends, making garbage collection strategy faster. Only valid when garbage collection strategy is enabled.
+Values accepted
+---------------
+Bool. The default value is True.
+Example
+-------
+FLAGS_fast_eager_deletion_mode=True would turn on fast garbage collection strategy. 
+FLAGS_fast_eager_deletion_mode=False would turn off fast garbage collection strategy.
+fraction_of_gpu_memory_to_use
+*******************************************
+(since 1.2.0)
+Allocate a chunk of gpu memory that is this fraction of the total gpu memory size. Future memory usage will be allocated from the chunk. If the chunk doesn't have enough gpu memory, additional chunks of the same size will be requested from gpu until the gpu has no memory left for another chunk.
+Values accepted
+---------------
+Uint64 value greater than 0 which is the initial GPU memory percentage.
+Example
+-------
+FLAGS_fraction_of_gpu_memory_to_use=0.1 will allocate 10% total gpu memory size as initial GPU chunk.
+Note
+-------
+Windows series platform will set FLAGS_fraction_of_gpu_memory_to_use to 0.5 by default.
+Linux will set FLAGS_fraction_of_gpu_memory_to_use to 0.92 by default.
+free_idle_memory
+*******************************************
+(since 0.15.0)
+Whether to free idle memory pre-allocated from system during runtime. If set, free idle memory would be released if there is too much free idle memory in the pre-allocated allocator.
+Values accepted
+---------------
+Bool. The default value is False.
+Example
+-------
+FLAGS_free_idle_memory=True will free idle memory when there is too much of it. 
+FLAGS_free_idle_memory=False will not free idle memory.
+fuse_parameter_groups_size
+*******************************************
+(since 1.4.0)
+FLAGS_fuse_parameter_groups_size is the size of one group parameters' gradient. The default value is an empirical result. If the fuse_parameter_groups_size is 1, it means that the groups' size is the number of parameters' gradient. If the fuse_parameter_groups_size is -1, it means that there is only one group. The default value is 3, it is an empirical value.
+Values accepted
+---------------
+Int32. The default value is 3.
+Example
+-------
+FLAGS_fuse_parameter_groups_size=3 will set the size of one group parameters' gradient to 3.
+fuse_parameter_memory_size
+*******************************************
+(since 1.5.0)
+FLAGS_fuse_parameter_memory_size indicates the up limited memory size of one group parameters' gradient which is the input of communication calling ( e.g NCCLAllReduce). The default value is -1.0, it means that not set group according to memory_size. The unit is Megabyte.
+Values accepted
+---------------
+Double. The default value is -1.0.
+Example
+-------
+FLAGS_fuse_parameter_memory_size=16 set the up limited memory size of one group parameters' gradient to 16 Megabytes.
+init_allocated_mem
+*******************************************
+(since 0.15.0)
+Whether to initialize the allocated memory by some non-zero values. This flag is for debug use to prevent that some ops assumes that the memory allocated is initialized to be zero.
+Values accepted
+---------------
+Bool. The default value is False.
+Example
+-------
+FLAGS_init_allocated_mem=True will make the allocated memory initialize as a non-zero value. 
+FLAGS_init_allocated_mem=False will not initialize the allocated memory.
+initial_cpu_memory_in_mb
+*******************************************
+(since 0.14.0)
+Initial CPU memory chunk size in MB of PaddlePaddle allocator. Allocator would take the minimal value of FLAGS_initial_cpu_memory_in_mb and FLAGS_fraction_of_cpu_memory_to_use*(total physical memory) as the memory chunk size.
+Values accepted
+---------------
+Uint64. The default value is 500 with unit MB.
+Example
+-------
+FLAGS_initial_cpu_memory_in_mb=100, if FLAGS_fraction_of_cpu_memory_to_use*(total physical memory) > 100MB, then allocator will pre-allocate 100MB when first allocation request raises, and re-allocate 100MB again when the pre-allocated memory is exhaustive.
+initial_gpu_memory_in_mb
+*******************************************
+(since 1.4.0)
+Allocate a chunk of GPU memory whose byte size is specified by the flag. Future memory usage will be allocated from the chunk. If the chunk doesn't have enough gpu memory, additional chunks of the gpu memory will be requested from gpu with size specified by FLAGS_reallocate_gpu_memory_in_mb until the gpu has no memory left for the additional chunk.
+Values accepted
+---------------
+Uint64 value greater than 0 which is the initial GPU memory size in MB.
+Example
+-------
+FLAGS_initial_gpu_memory_in_mb=4096 will allocate 4 GB as initial GPU chunk.
+Note
+-------
+If you set this flag, the memory size set by FLAGS_fraction_of_gpu_memory_to_use will be overrided by this flag.
+If you don't set this flag, PaddlePaddle will use FLAGS_fraction_of_gpu_memory_to_use to allocate gpu memory.
+inner_op_parallelism
+*******************************************
+(since 1.3.0)
+Most operators are working in single thread mode, but for some operator, use multi thread is more suitable. For Example, optimization op that optimize sparse gradient will be much faster to use multi thread. This flag is used to set the thread number inside an operator.
+Values accepted
+---------------
+Int32. The default value is 0 which means that operator will not run in multi thread mode.
+Example
+-------
+FLAGS_inner_op_parallelism=5 will set the thread number inside an operator to 5.
+Note
+-------
+currently only sparse adam op supports inner_op_parallelism.
+limit_of_tmp_allocation
+*******************************************
+(since 1.3)
+The FLAGS_limit_of_tmp_allocation indicates the up limit of temporary_allocation size, the unit is byte. If the FLAGS_limit_of_tmp_allocation is -1, the size of temporary_allocation will not be limited.
+Values accepted
+---------------
+Int64. The default value is -1.
+Example
+-------
+FLAGS_limit_of_tmp_allocation=1024 will set the up limit of temporary_allocation size to 1024 bytes.
+max_body_size
+*******************************************
+(Since 1.0.0)
+It controls the max message size in BRPC.
+Values accepted
+---------------
+Int32. The default value is 2147483647.
+Example
+-------
+FLAGS_max_body_size=2147483647 will set the BRPC message size to 2147483647.
+memory_fraction_of_eager_deletion
+*******************************************
+(since 1.4)
+A memory size percentage when garbage collection strategy decides which variables should be released. If FLAGS_memory_fraction_of_eager_deletion=1.0, all temporary variables in the network would be released. If FLAGS_memory_fraction_of_eager_deletion=0.0, all temporary variables in the network would not be released. If 0.0<FLAGS_memory_fraction_of_eager_deletion<1.0, all temporary variables would be sorted descendingly according to their memory size, and only 
+FLAGS_memory_fraction_of_eager_deletion of variables with largest memory size would be released. This flag is only valid when running compiled program with data parallel.
+Values accepted
+---------------
+Double, inside [0.0, 1.0]. The default value is 1.0.
+Example
+-------
+FLAGS_memory_fraction_of_eager_deletion=0 would keep all temporary variables, that is to say, disabling garbage collection strategy.
+FLAGS_memory_fraction_of_eager_deletion=1 would release all temporary variables.  
+FLAGS_memory_fraction_of_eager_deletion=0.5 would only release 50% of variables with largest memory size.
+multiple_of_cupti_buffer_size
+*******************************************
+(since 1.4.0)
+This Flag is used for profiling. It indicates the multiple of the CUPTI device buffer size. When you are profiling, if the program breaks down or bugs rise when loading timeline file in chrome://traxing, try increasing this value.
+Values accepted
+---------------
+Int32. The default value is 1.
+Example
+-------
+FLAGS_multiple_of_cupti_buffer_size=1 set the multiple of the CUPTI device buffer size to 1.
+paddle_num_threads
+*******************************************
+(since 0.15.0)
+Control the number of threads of each paddle instance.
+Values accepted
+---------------
+Int32. The default value is 1.
+Example
+-------
+FLAGS_paddle_num_threads=2 will enable 2 threads as max number of threads for each instance.
+pe_profile_fname
+*******************************************
+(since 1.3.0)
+This Flag is used for debugging for ParallelExecutor. The ParallelExecutor will generate the profile result by gperftools, and the profile result will be stored in the file which is specified by FLAGS_pe_profile_fname. Only valid when compiled `WITH_PRIFILER=ON`. Empty if disable.
+Values accepted
+---------------
+String. The default value is empty ("").
+Example
+-------
+FLAGS_pe_profile_fname="./parallel_executor.perf" will store the profile result to parallel_executor.perf.
+print_sub_graph_dir
+*******************************************
+(since 1.2.0)
+This Flag is used for debugging. If some subgraphs of the transformed graph from the program are disconnected, the result may be problematic. We can print these disconnected subgraphs to a file specified by the flag. Empty if disable.
+Values accepted
+---------------
+String. The default value is empty ("").
+Example
+-------
+FLAGS_print_sub_graph_dir="./sub_graphs.txt" will print the disconnected subgraphs to "./sub_graphs.txt".
+reader_queue_speed_test_mode
+*******************************************
+(since 1.1.0)
+Set the pyreader data queue to test mode. In test mode, pyreader will cache some data, executor will then read the cached data, so reader will not be the bottleneck.
+Values accepted
+---------------
+Bool. The default value is False.
+Example
+-------
+FLAGS_reader_queue_speed_test_mode=True will enable the pyreader test mode.
+Note
+-------
+This flag will work only when you are using py_reader.
+reallocate_gpu_memory_in_mb
+*******************************************
+(since 1.4.0)
+Re-allocate additional GPU chunk if run out of allocated GPU memory chunk.
+Values accepted
+---------------
+Int64 value greater than 0 in MB
+Example
+-------
+FLAGS_reallocate_gpu_memory_in_mb=1024 will re-allocate 1 GB if run out of GPU memory chunk.
+Note
+-------
+If this flag is set, PaddlePaddle will reallocate the gpu memory with size specified by this flag.
+Else PaddlePaddle will reallocate with size set by FLAGS_fraction_of_gpu_memory_to_use.
+rpc_deadline
+*******************************************
+(Since 1.0.0)
+It controls the deadline timeout of the rpc communication.
+Values accepted
+---------------
+Int32. The default value is 180000 in ms.
+Example
+-------
+FLAGS_rpc_deadline=180000 will set deadline timeout to 3 minute.
+rpc_disable_reuse_port
+*******************************************
+(since 1.2.0)
+When rpc_disable_reuse_port is true, the flag of grpc GRPC_ARG_ALLOW_REUSEPORT will be set to false to
+disable the use of SO_REUSEPORT if it's available.
+Values accepted
+---------------
+Bool. The default value is False.
+Example
+-------
+FLAGS_rpc_disable_reuse_port=True will disable the use of SO_REUSEPORT.
+rpc_get_thread_num
+*******************************************
+(Since 1.0.0)
+It controls the number of threads used to get parameter from parameter server.
+Values accepted
+---------------
+Int32. The default value is 12.
+Example
+-------
+FLAGS_rpc_get_thread_num=6 will use 6 threads to get parameter from parameter server.
+rpc_send_thread_num
+*******************************************
+(Since 1.0.0)
+It controls the number of threads used for send rpc.
+Values accepted
+---------------
+Int32. The default value is 12.
+Example
+-------
+FLAGS_rpc_send_thread_num=6 will set number thread used for send to 6.
+rpc_server_profile_path
+*******************************************
+since(v0.15.0)
+Set the profiler output log file path prefix. The complete path will be rpc_server_profile_path_listener_id, listener_id is a random number.
+Values accepted
+---------------
+String. The default value is "./profile_ps".
+Example
+-------
+FLAGS_rpc_server_profile_path="/tmp/pserver_profile_log" generate profile log file at "/tmp/pserver_profile_log_listener_id".
+selected_gpus
+*******************************************
+(since 1.3)
+Set the GPU devices used for training or inference.
+Values accepted
+---------------
+A comma-separated list of device IDs, where each device ID is a nonnegative integer less than the number of GPU devices your machine have.
+Example
+-------
+FLAGS_selected_gpus=0,1,2,3,4,5,6,7 makes GPU devices 0-7 to be used for training or inference.
+Note
+-------
+The reason for using this flag is that we want to use collective communication between GPU devices, but with CUDA_VISIBLE_DEVICES can only use share-memory.
+sync_nccl_allreduce
+*******************************************
+(since 1.3)
+If the FLAGS_sync_nccl_allreduce is true, there will call `cudaStreamSynchronize(nccl_stream)` in allreduce_op_handle, this mode can get better performance in some scenarios.
+Values accepted
+---------------
+Bool. The default value is True.
+Example
+-------
+FLAGS_sync_nccl_allreduce=True will call `cudaStreamSynchronize(nccl_stream)` in allreduce_op_handle.
+times_excess_than_required_tmp_allocation
+*******************************************
+(since 1.3)
+The FLAGS_times_excess_than_required_tmp_allocation indicates the max size the TemporaryAllocator can return. For Example
+, if the required memory size is N, and times_excess_than_required_tmp_allocation is 2.0, the TemporaryAllocator will return the available allocation that the range of size is N ~ 2*N.
+Values accepted
+---------------
+Int64. The default value is 2.
+Example
+-------
+FLAGS_times_excess_than_required_tmp_allocation=1024 will set the max size of the TemporaryAllocator can return to 1024*N.
+tracer_profile_fname
+*******************************************
+(since 1.4.0)
+FLAGS_tracer_profile_fname indicates the profiler filename for imperative tracer, which generated by gperftools. Only valid when compiled `WITH_PROFILER=ON`. Empty if disabled.
+Values accepted
+---------------
+String. The default value is ("gperf").
+Example
+-------
+FLAGS_tracer_profile_fname="gperf_profile_file" will set the profiler filename for imperative tracer to "gperf_profile_file".
+use_mkldnn
+*******************************************
+(since 0.13.0)
+Give a choice to run with Intel MKL-DNN (https://github.com/intel/mkl-dnn) library on inference or training.
+Intel(R) Math Kernel Library for Deep Neural Networks (Intel(R) MKL-DNN) is an open-source performance library for deep-learning applications. The library accelerates deep-learning applications and frameworks on Intel(R) architecture. Intel MKL-DNN contains vectorized and threaded building blocks that you can use to implement deep neural networks (DNN) with C and C++ interfaces.
+Values accepted
+---------------
+Bool. The default value is False.
+Example
+-------
+FLAGS_use_mkldnn=True will enable running with MKL-DNN support.
+Note
+-------
+FLAGS_use_mkldnn is only used for python training and inference scripts. To enable MKL-DNN in CAPI, set build option -DWITH_MKLDNN=ON
+Intel MKL-DNN supports Intel 64 architecture and compatible architectures. The library is optimized for the systems based on:
+Intel Atom(R) processor with Intel SSE4.1 support
+4th, 5th, 6th, 7th, and 8th generation Intel(R) Core(TM) processor
+Intel(R) Xeon(R) processor E3, E5, and E7 family (formerly Sandy Bridge, Ivy Bridge, Haswell, and Broadwell)
+Intel(R) Xeon(R) Scalable processors (formerly Skylake and Cascade Lake)
+Intel(R) Xeon Phi(TM) processors (formerly Knights Landing and Knights Mill)
+and compatible processors.
+use_ngraph
+*******************************************
+(since 1.4.0)
+Give a choice to run with Intel nGraph(https://github.com/NervanaSystems/ngraph) engine on inference or training. This will obtain much performance boost on Intel Xeon CPU.
+Values accepted
+---------------
+Bool. The default value is False.
+Example
+-------
+FLAGS_use_ngraph=True will enable running with nGraph support.
+Note
+-------
+Intel nGraph is only supported in few models yet. We have only verified [ResNet-50](https://github.com/PaddlePaddle/models/blob/develop/PaddleCV/image_classification/README_ngraph.md) training and inference.
+use_pinned_memory
+*******************************************
+(since 0.12.0)
+Whether to use cpu pinned memory. If set, CPU allocator calls mlock to lock pages.
+Values accepted
+---------------
+Bool. The default value is True.
+Example
+-------
+FLAGS_use_pinned_memory=True would make the pages of allocated cpu memory lock.