GPU使用率低,训练速度慢
Created by: yazone
-
标题:GPU使用率低,CPU使用率高,训练速度慢
-
版本、环境信息: 1)PaddlePaddle版本:1.7.1 2)CPU:Intel(R) Xeon(R) Gold 6126 CPU @ 2.60GHz 3)GPU:2080TI 4)系统环境:ubunutu 16.04,Python3.7.6,CUDA 10.2,cuDNN 7.6,conda安装方式
-
训练信息 1)单机,多卡 2)显存信息 3)Operator信息
-
复现信息:使用4卡训练PaddleDetection中人脸检测(fruit检测也一样): CUDA_VISIBLE_DEVICES=1,2,3,4 python -u tools/train.py -c configs/face_detection/blazeface.yml
-
问题描述: 训练慢,GPU使用率一直是个位数(单卡跑也是上不去) W0414 23:31:04.069887 36272 device_context.cc:237] Please NOTE: device: 0, CUDA Capability: 75, Driver API Version: 10.2, Runtime API Version: 10.0 W0414 23:31:04.073050 36272 device_context.cc:245] device: 0, cuDNN Version: 7.6. 2020-04-14 23:31:05,911-INFO: 12880 samples in file dataset/wider_face/wider_face_split/wider_face_train_bbx_gt.txt 2020-04-14 23:31:09,320-INFO: places would be ommited when DataLoader is not iterable I0414 23:31:09.339357 36272 parallel_executor.cc:440] The Program will be executed on CUDA using ParallelExecutor, 4 cards are used, so 4 programs are executed in parallel. W0414 23:31:16.488627 36272 fuse_all_reduce_op_pass.cc:74] Find all_reduce operators: 122. To make the speed faster, some all_reduce ops are fused during training, after fusion, the number of all_reduce ops is 80. I0414 23:31:16.497570 36272 build_strategy.cc:365] SeqOnlyAllReduceOps:0, num_trainers:1 I0414 23:31:16.756584 36272 parallel_executor.cc:307] Inplace strategy is enabled, when build_strategy.enable_inplace = True I0414 23:31:16.820021 36272 parallel_executor.cc:375] Garbage collection strategy is enabled, when FLAGS_eager_delete_tensor_gb = 0 2020-04-14 23:31:23,780-INFO: iter: 0, lr: 0.001000, 'loss': '52.732460', time: 0.000, eta: 0:00:10 2020-04-14 23:32:18,247-INFO: iter: 20, lr: 0.001000, 'loss': '16.188927', time: 3.244, eta: 12 days, 0:18:25 2020-04-14 23:33:09,343-INFO: iter: 40, lr: 0.001000, 'loss': '13.720903', time: 2.631, eta: 9 days, 17:51:18 2020-04-14 23:34:03,918-INFO: iter: 60, lr: 0.001000, 'loss': '13.864499', time: 2.686, eta: 9 days, 22:40:33
显卡信息: +-----------------------------------------------------------------------------+ | NVIDIA-SMI 440.64.00 Driver Version: 440.64.00 CUDA Version: 10.2 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 GeForce RTX 208... Off | 00000000:1A:00.0 Off | N/A | | 27% 21C P8 16W / 250W | 1328MiB / 11019MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 1 GeForce RTX 208... Off | 00000000:1B:00.0 Off | N/A | | 27% 29C P2 61W / 250W | 4287MiB / 11019MiB | 4% Default | +-------------------------------+----------------------+----------------------+ | 2 GeForce RTX 208... Off | 00000000:3D:00.0 Off | N/A | | 27% 29C P2 44W / 250W | 3588MiB / 11019MiB | 4% Default | +-------------------------------+----------------------+----------------------+ | 3 GeForce RTX 208... Off | 00000000:3E:00.0 Off | N/A | | 27% 28C P2 57W / 250W | 3930MiB / 11019MiB | 9% Default | +-------------------------------+----------------------+----------------------+ | 4 GeForce RTX 208... Off | 00000000:88:00.0 Off | N/A | | 27% 28C P2 58W / 250W | 3650MiB / 11019MiB | 4% Default | +-------------------------------+----------------------+----------------------+ | 5 GeForce RTX 208... Off | 00000000:89:00.0 Off | N/A | | 27% 23C P8 19W / 250W | 11MiB / 11019MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 6 GeForce RTX 208... Off | 00000000:B1:00.0 Off | N/A | | 27% 23C P8 9W / 250W | 11MiB / 11019MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 7 GeForce RTX 208... Off | 00000000:B2:00.0 Off | N/A | | 46% 54C P2 233W / 250W | 10536MiB / 11019MiB | 79% Default | +-------------------------------+----------------------+----------------------+
CPU使用率:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
36499 trainer 20 0 34.181g 6.040g 3.554g R 948.8 2.4 2144:57 python
36506 trainer 20 0 34.173g 6.040g 3.553g R 775.2 2.4 2114:05 python
36500 trainer 20 0 34.329g 6.120g 3.556g R 689.4 2.4 2147:13 python
36495 trainer 20 0 34.340g 6.154g 3.550g R 675.6 2.4 2105:47 python
36496 trainer 20 0 34.162g 6.058g 3.583g R 517.2 2.4 2117:44 python
36497 trainer 20 0 34.180g 6.042g 3.556g R 393.7 2.4 2137:54 python
36498 trainer 20 0 34.302g 6.082g 3.550g R 368.3 2.4 2072:33 python
36507 trainer 20 0 34.317g 6.101g 3.549g R 358.7 2.4 2073:01 python
36272 trainer 20 0 45.777g 0.011t 7.767g S 25.4 4.6 125:23.97 python
- 另外: 1080TI 单机四卡 PaddlePaddle 1.7.1 CUDA 9.0 cuDNN 7.5 python 3.6.8 conda安装,单卡训练PaddleSeg套件中的Unet也是相同情况(其它模型没跑)