[New Feature] 清除paddle控制设备占用存储
Created by: yiakwy
问题背景
在类似AIStudio环境中,包括AIStudio,在启动Paddle程序,是对进程生命周期分配存储,导致第二次运行一个 jupyter code bock,内存未被释放。在某些情况下,我们只能重启Kernel来释放内存。
复现场景
nvidia-smi
nvidia-smi --gpu-reset -i 0 # https://devtalk.nvidia.com/default/topic/958159/cuda-programming-and-performance/11-gb-of-gpu-ram-used-and-no-process-listed-by-nvidia-smi/
Fri Jun 7 22:03:38 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 396.37 Driver Version: 396.37 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... Off | 00000000:00:07.0 Off | 0 |
| N/A 33C P0 54W / 300W | 14912MiB / 16160MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
PCI device 00000000:00:08.0 must be reset with GPU 0 (00000000:00:07.0).
One or more incomplete sets of NVLink GPUs were specified.
GPU Reset couldn't run because the specified GPUs could not be validated for NVLink reset.
我们发现 14/16 的内存被使用,但是找不到process id, 也不能在paddle中清除显存,只能手动重启Jupyter Kernel.
问题对比
- 旧版本显存是针对进程生命周期分配的,在新版的Tensorflow 已经可以通过Gpu分配策略Tensorflow api释放gpu
- Colab从过往使用经验看,似乎没有类似问题
可能解决方案和相关问题
- 清除显存:https://stackoverflow.com/questions/39758094/clearing-tensorflow-gpu-memory-after-model-execution
- 和NVIDIA驱动版本有关:https://github.com/tensorflow/tensorflow/issues/1727
- 使用多进程来解决: https://github.com/tensorflow/tensorflow/issues/17048
- 使用numba.cuda: https://github.com/tensorflow/tensorflow/issues/17048
- 使用Tensorflow gpu memory growth config: https://github.com/keras-team/keras/issues/12625