提交 17c643e7 编写于 作者: Z zhiqiu

add en doc of memory optimization, test=develop

上级 07ecde52
......@@ -5,7 +5,7 @@ Practice Improving
.. toctree::
:maxdepth: 1
singlenode_training_improving/memory_optimize_en.rst
multinode_training_improving/cpu_train_best_practice_en.rst
multinode_training_improving/gpu_training_with_recompute_en.rst
inference_improving/paddle_tensorrt_infer_en.md
......
.. _api_guide_memory_optimize_en:
###########
Memory Allocation and Optimization
###########
1. Memory Allocation Strategy
===========================
1.1. AutoGrowth Strategy
--------------------------
Since version 1.6+, PaddlePaddle supports the AutoGrowth strategy, which allocates memory on demand.
AutoGrowth strategy has been enabled by default in version 1.7+, making it convenient for users to
run multiple tasks on the same GPU card at the same time.
Because the native CUDA system calls: :code:`cudaMalloc` and :code:`cudaFree` are synchronous operations,
which are very time-consuming, the AutoGrowth strategy will cache the allocated memory for subsequent allocation.
The specific methods are as follows:
- In the first few memory allocations, PaddlePaddle framework will call :code:`cudaMalloc` and allocate memory on demand.
When releasing the allocated memory, it will not call :code:`cudaFree` to return the memory to GPU or memory,
but cache the memory inside the framework.
- In the subsequent allocations, PaddlePaddle framework will first check if there is a fit block
(block size larger than the required memory size) in the cached memory.
If there is, it will split the required memory from the fit block and return. Otherwise, it will call :code:`cudaMalloc` to
allocate memory from GPU. The allocated memory are also cached when being released for subsequent allocation.
Therefore, the AutoGrowth strategy may slow the speed in the first few batches of model training,
but will not affect the speed in the subsequent training process.
1.2. Pre-Allocation Strategy
----------------
In addition to the AutoGrowth strategy, paddlepaddle also provides a Pre-Allocation strategy,
which is the default memory allocation strategy before paddlepaddle 1.7.
The Pre-Allocation strategy allocates a large size chunk at the first allocation, and the subsequent memory allocation is mostly obtained from the pre allocated memory chunk.
Among them, the chunk size is determined by the environment variable :code:`FLAGS_fraction_of_gpu_memory_to_use`, and the calculation formula of chunk size is:
.. code-block:: python
chunk_size = FLAGS_fraction_of_gpu_memory_to_use * number of current available memory of a single GPU card
The default value of :code:`FLAGS_fraction_of_gpu_memory_to_use` is 0.92,that is, the framework will pre allocates
92% of the currently available memory of the GPU card。
The specific way of Pre-Allocation strategy to allocate GPU memory is:
- When allocating memory of requested_size,
- If requested_size <= chunk_size,the framework will first allocate a memory chunk of chunk_size,
then split a block of requested_size and return the block。Every subsequent memory allocation will be performed on the chunk.
- If requested_size > chunk_size,the framework will call :code:`cudaMalloc` to allocate memory block of requested_size and return.
- When freeing memory of requested_size,
- If free_size <= chunk_size,the framework will put the memory block back into the pre-allocated chunk,instead of returning back to GPU。
- If free_size > chunk_size,the framework will call :code:`cudaFree` and return the memory back to GPU.
若你的GPU卡上有其他任务占用显存,你可以适当将 :code:`FLAGS_fraction_of_gpu_memory_to_use` 减少,保证框架能预分配到合适的显存块,例如:
If there are other tasks on your GPU card that occupy the memory, you can appropriately decrease :code:`FLAGS_fraction_of_gpu_memory_to_use`
to ensure that the framework can pre-allocate the memory block of appropriate size, for example
.. code-block:: shell
export FLAGS_fraction_of_gpu_memory_to_use=0.4 # Pre-allocate 40% memory of a single GPU card
If :code:`FLAGS_fraction_of_gpu_memory_to_use` is set to 0,the framework will call
:code:`cudaMalloc` and :code:`cudaFree` every time the memory is allocated and released,which will seriously affect the performance and is not recommended.
Only when you want to measure the actual memory usage of the network, you could set :code:`FLAGS_fraction_of_gpu_memory_to_use` to 0, and observe the memory
usage of command nvidia-smi display.
1.3. Configuration of memory allocation strategy
-----------------------
Since version 1.6+, PaddlePaddle supports both the AutoGrowth strategy and the Pre-Allocation Strategy, and control the strategy used in framework by
the environment variable :code:`FLAGS_allocator_strategy` .
Use AutoGrowth strategy:
.. code-block:: shell
export FLAGS_allocator_strategy=auto_growth # Use AutoGrowth strategy
Use Pre-Allocation strategy:
.. code-block:: shell
export FLAGS_allocator_strategy=naive_best_fit # Use Pre-Allocation strategy
2. Memory Optimization Strategy
===========================
Paddlepaddle provides several general memory optimization methods to optimize the memory usage of your network (including general memory and GPU memory).
2.1. GC Strategy: memory garbage eager collection
-------------------------
The principle of GC(Garbage Collection)is to release the memory space of useless variables eagerly during network running,
in order to save memory space. GC is suitable training and inderence using Executor or ParallelExecutor, but it is not suitable for C++ inference library.
**Since version 1.6+, GC Strategy is enabled by default. **
GC Strategy is controled by 3 environment variable:
- :code:`FLAGS_eager_delete_tensor_gb`
GC enable variable,its data type is double. The default value is -1 in PaddlePaddle with version < 1.6,
and is 0 in PaddlePaddle with version >= 1.6. GC Strategy will cache a certain amount of memory garbage and release it uniformly.
:code:`FLAGS_eager_delete_tensor_gb` means the threshold of cached memory garbage, the unit of which is GB。**It is recommended to set** :code:`FLAGS_eager_delete_tensor_gb=0` .
If :code:`FLAGS_eager_delete_tensor_gb=0` , once there is memory garbage, it will be collected immediately to save memory.
If :code:`FLAGS_eager_delete_tensor_gb=1` ,the memory garbage is collected when the cached amount of garbage reaches 1GB.
If :code:`FLAGS_eager_delete_tensor_gb<0` ,GC Strategy is disabled.
- :code:`FLAGS_memory_fraction_of_eager_deletion`
GC Strategy control flag,its data type is double. The default value is 1,range [0,1]. It is only suitable for ParallelExecutor or CompiledProgram+with_data_parallel。
GC will sort the variables in descending order according to the memory space occupied by the variables,
and only collect the memory space of top :code:`FLAGS_memory_fraction_of_eager_deletion` variables.
**It is recommended to remain default value**, that is :code:`FLAGS_memory_fraction_of_eager_deletion=1` .
If :code:`FLAGS_memory_fraction_of_eager_deletion=0.6`, top 60% variables will be collected.
若 :code:`FLAGS_memory_fraction_of_eager_deletion=0`, no variable will be collected, GC Strategy is disabled.
若 :code:`FLAGS_memory_fraction_of_eager_deletion=1` ,all variables will be collected.
- :code:`FLAGS_fast_eager_deletion_mode`
Fast GC Strategy enable variable,its type is bool. The default value is True, which means use fast GC Strategy.
Fast GC Strategy will collect the memory garbage immediately instead of waiting for CUDA Kernel finish. **It is recommended to remain default value**,that is :code:`FLAGS_fast_eager_deletion_mode=True` .
2.2. Inplace Strategy: output reuses input inside operator
----------------------------------
The principle of Inplace strategy is that the output of some operators can reuses the meory space of input.
For example, the output and input of operator :core:`reshape` can reuse the same memory space.
Inplace Strategy is suitable for ParallelExecutor or CompiledProgram+with_data_parallel, which can be set through :code:`BuildStrategy`.
The Strategy is not suitable for Executor+Program or C++ inference library.
**Since version 1.6+, Inplace Strategy is enabled by default. **
The specific way of Inplace strategy is::
.. code-block:: python
build_strategy = fluid.BuildStrategy()
build_strategy.enable_inplace = True # Enable Inplace Strategy
compiled_program = fluid.CompiledProgram(train_program)
.with_data_parallel(loss_name=loss.name, build_strategy=build_strategy)
In PaddlePaddle with version < 1.6, due to of some design problems, when the Inplace Strategy is enabled,
the variable in fetch_list in the subsequent :code:`exe.run` must be persistent.
That is, if you the variables you want to fetch are loss and acc, you must set:
.. code-block:: python
loss.persistable = True
acc.persistable = True
**Since version 1.6+, setting variables in fetch_list to persistable is not needed. **
3. Memory Optimization Best Practice
=======================
We recommend the best memory optimization strategy as:
- Enable GC strategy:set :code:`FLAGS_eager_delete_tensor_gb=0` .
- Enable Inplace strategy:set :code:`build_strategy.enable_inplace = True` ,and set variables in fetch_list to persistable
using :code:`var.persistable = True` when the version of PaddlePaddle < 1.6 .
**Since version 1.6+, the above optimal strategy have been enabled by default and setting variables in fetch_list to persistable is not needed. **
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册