diff --git a/doc/fluid/advanced_guide/performance_improving/device_switching/device_switching.md b/doc/fluid/advanced_guide/performance_improving/device_switching/device_switching.md new file mode 100644 index 0000000000000000000000000000000000000000..5cc262b40c62ee703e19b51744829243e265a51a --- /dev/null +++ b/doc/fluid/advanced_guide/performance_improving/device_switching/device_switching.md @@ -0,0 +1,197 @@ +# 运行时设备切换 + +Paddle提供了[fluid.CUDAPlace](https://www.paddlepaddle.org.cn/documentation/docs/zh/api_cn/fluid_cn/CUDAPlace_cn.html)以及[fluid.CPUPlace](https://www.paddlepaddle.org.cn/documentation/docs/zh/api_cn/fluid_cn/CPUPlace_cn.html)用于指定运行时的设备。这两个接口用于指定全局的设备,从2.0版本开始,Paddle提供了[device_guard](https://www.paddlepaddle.org.cn/documentation/docs/zh/develop/api_cn/fluid_cn/device_guard_cn.html)接口,用于指定部分OP的运行设备,此教程会介绍device_guard的使用场景,以及如何使用该接口对模型进行优化。 + +如果使用了`fluid.CUDAPlace`设置了全局的执行设备,框架将尽可能地将OP设置在GPU上执行,因此有可能会遇到显存不够的情况。`device_guard`可以用于设置OP的执行设备,如果将部分层设置在CPU上运行,就能够充分利用CPU大内存的优势,避免显存超出。 + +有时尽管指定了全局的执行设备为GPU,但框架在自动分配OP执行设备时,可能会将部分OP设置在CPU上执行。另外,个别OP会将输出存储在CPU上。在以上的场景中,常常会发生不同设备间的数据传输,可能会影响模型的性能。使用`device_guard`可以避免模型运行中不必要的数据传输。在下面的内容中,将会详细介绍如何通过[profile](https://www.paddlepaddle.org.cn/documentation/docs/zh/develop/api_cn/profiler_cn.html)工具分析数据传输开销,以及如何使用`device_guard`避免不必要的数据传输,从而提升模型性能。 + +## 如何避免显存超出 + +下面示例代码中的`embedding`层,其参数`size`包含两个元素,第一个元素为`vocab_size` (词表大小), 第二个为`emb_size`(`embedding`层维度)。实际场景中,词表可能会非常大。示例代码中,词表大小被设置为10000000。如果在GPU模式下运行,该层创建的权重矩阵的大小为(10000000, 150),仅这一层就需要5.59G的显存,如果词表大小继续增加,极有可能会导致显存超出。 + +```python +import paddle.fluid as fluid + +data = fluid.layers.fill_constant(shape=[1], value=128, dtype='int64') +label = fluid.layers.fill_constant(shape=[1, 150], value=0.5, dtype='float32') +emb = fluid.embedding(input=data, size=(10000000, 150), dtype='float32') +out = fluid.layers.l2_normalize(x=emb, axis=-1) + +cost = fluid.layers.square_error_cost(input=out, label=label) +avg_cost = fluid.layers.mean(cost) +sgd_optimizer = fluid.optimizer.SGD(learning_rate=0.001) +sgd_optimizer.minimize(avg_cost) + +place = fluid.CUDAPlace(0) +exe = fluid.Executor(place) +exe.run(fluid.default_startup_program()) +result = exe.run(fluid.default_main_program(), fetch_list=[avg_cost]) +``` + +`embedding`是根据`input`中的`id`信息从`embedding`矩阵中查询对应`embedding`信息,在CPU上进行计算,其速度也是可接受的。因此,可以参考如下代码,使用`device_guard`将`embedding`层设置在CPU上,以利用CPU内存资源。那么,除了`embedding`层,其他各层都会在GPU上运行。 + +```python +import paddle.fluid as fluid + +data = fluid.layers.fill_constant(shape=[1], value=128, dtype='int64') +label = fluid.layers.fill_constant(shape=[1, 150], value=0.5, dtype='float32') +with fluid.device_guard("cpu"): + emb = fluid.embedding(input=data, size=(10000000, 150), dtype='float32') +out = fluid.layers.l2_normalize(x=emb, axis=-1) + +cost = fluid.layers.square_error_cost(input=out, label=label) +avg_cost = fluid.layers.mean(cost) +sgd_optimizer = fluid.optimizer.SGD(learning_rate=0.001) +sgd_optimizer.minimize(avg_cost) + +place = fluid.CUDAPlace(0) +exe = fluid.Executor(place) +exe.run(fluid.default_startup_program()) +result = exe.run(fluid.default_main_program(), fetch_list=[avg_cost]) +``` + +在显存足够的情况下,可不必进行这样的设置。 + +## 如何减少数据传输 +### 使用profile工具确认是否发生了数据传输 +首先对模型的性能数据进行分析,找到发生数据传输的原因。如下列代码所示,可以利用[profile](https://www.paddlepaddle.org.cn/documentation/docs/zh/api_cn/profiler_cn.html)工具进行分析。 + +```python +import paddle.fluid as fluid +import paddle.fluid.profiler as profiler + +data1 = fluid.layers.fill_constant(shape=[1, 3, 8, 8], value=0.5, dtype='float32') +data2 = fluid.layers.fill_constant(shape=[1, 3, 5, 5], value=0.5, dtype='float32') +shape = fluid.layers.shape(data2) +shape = fluid.layers.slice(shape, axes=[0], starts=[0], ends=[4]) +out = fluid.layers.crop_tensor(data1, shape=shape) +place = fluid.CUDAPlace(0) +exe = fluid.Executor(place) +exe.run(fluid.default_startup_program()) +with profiler.profiler('All', 'total') as prof: + for i in range(10): + result = exe.run(fetch_list=[out]) +``` + +在程序运行结束后,将会自动地打印出profile report。在下面的profile report中,可以看到 `GpuMemCpy Summary`中给出了2项数据传输的调用耗时。在OP执行过程中,如果输入Tensor所在的设备与OP执行的设备不同,就会发生`GpuMemcpySync`,通常我们可以直接优化的就是这一项。进一步分析,可以看到`slice`和`crop_tensor`执行中都发生了`GpuMemcpySync`。尽管我们在程序中设置了GPU模式运行,但是框架中有些OP,例如shape,会将输出结果放在CPU上。 + +```text +-------------------------> Profiling Report <------------------------- + +Note! This Report merge all thread info into one. +Place: All +Time unit: ms +Sorted by event first end time in descending order in the same thread + +Total time: 24.0922 + Computation time Total: 3.60143 Ratio: 14.9485% + Framework overhead Total: 20.4908 Ratio: 85.0515% + +------------------------- GpuMemCpy Summary ------------------------- + +GpuMemcpy Calls: 30 Total: 1.44377 Ratio: 5.99267% + GpuMemcpyAsync Calls: 10 Total: 0.459803 Ratio: 1.90851% + GpuMemcpySync Calls: 20 Total: 0.983967 Ratio: 4.08416% + +------------------------- Event Summary ------------------------- + +Event Calls Total CPU Time (Ratio) GPU Time (Ratio) Min. Max. Ave. Ratio. +fill_constant 20 2.03147 1.995597 (0.982342) 0.035872 (0.017658) 0.064199 0.379822 0.101573 0.0843204 +shape 10 0.466503 0.466503 (1.000000) 0.000000 (0.000000) 0.021165 0.207393 0.0466503 0.0193632 +eager_deletion 30 0.28398 0.283980 (1.000000) 0.000000 (0.000000) 0.004668 0.028065 0.009466 0.0117872 +slice 10 1.53533 1.505664 (0.980679) 0.029664 (0.019321) 0.1312 0.259446 0.153533 0.0637271 + GpuMemcpySync:CPU->GPU 10 0.41714 0.408532 (0.979364) 0.008608 (0.020636) 0.038545 0.054022 0.041714 0.0173143 +crop_tensor 10 1.49584 1.438558 (0.961707) 0.057280 (0.038293) 0.129106 0.246395 0.149584 0.0620879 + GpuMemcpySync:GPU->CPU 10 0.566827 0.543787 (0.959353) 0.023040 (0.040647) 0.047598 0.097705 0.0566827 0.0235274 +Fetch 10 0.921333 0.897141 (0.973742) 0.024192 (0.026258) 0.077059 0.177223 0.0921333 0.0382419 + GpuMemcpyAsync:GPU->CPU 10 0.459803 0.435611 (0.947386) 0.024192 (0.052614) 0.039321 0.073849 0.0459803 0.0190851 +ParallelExecutor::Run 10 17.3578 17.345797 (0.999309) 0.012000 (0.000691) 0.705361 10.3389 1.73578 0.720472 + InitLocalVars 1 0.084954 0.084954 (1.000000) 0.000000 (0.000000) 0.084954 0.084954 0.084954 0.0035262 + ScopeBufferedMonitor::pre_local_exec_scopes_process 10 0.040771 0.040771 (1.000000) 0.000000 (0.000000) 0.003653 0.00543 0.0040771 0.00169229 + FastThreadedSSAGraphExecutorPrepare 10 8.64291 8.630914 (0.998612) 0.012000 (0.001388) 0.033383 8.29818 0.864291 0.358743 + ScopeBufferedMonitor::post_local_exec_scopes_process 10 0.252618 0.252618 (1.000000) 0.000000 (0.000000) 0.022696 0.041439 0.0252618 0.0104854 +``` +### 通过log查看发生数据传输的具体位置 + +以上的示例程序比较简单,我们只用看profile report就能知道具体是哪些算子发生了数据传输。但是当模型比较复杂时,可能需要去查看更加详细的调试信息,可以打印出运行时的log去确定发生数据传输的具体位置。依然以上述程序为例,执行`GLOG_vmodule=operator=3 python test_case.py`,会得到如下log信息,会发现发生了2次数据传输: + +- `shape`输出的结果在CPU上,在`slice`运行时,`shape`的输出被拷贝到GPU上 +- `slice`执行完的结果在GPU上,当`crop_tensor`执行时,它会被拷贝到CPU上。 + +```text +I0406 14:56:23.286592 17516 operator.cc:180] CUDAPlace(0) Op(shape), inputs:{Input[fill_constant_1.tmp_0:float[1, 3, 5, 5]({})]}, outputs:{Out[shape_0.tmp_0:int[4]({})]}. +I0406 14:56:23.286628 17516 eager_deletion_op_handle.cc:107] Erase variable fill_constant_1.tmp_0 on CUDAPlace(0) +I0406 14:56:23.286725 17516 operator.cc:1210] Transform Variable shape_0.tmp_0 from data_type[int]:data_layout[NCHW]:place[CPUPlace]:library_type[PLAIN] to data_type[int]:data_layout[ANY_LAYOUT]:place[CUDAPlace(0)]:library_type[PLAIN] +I0406 14:56:23.286763 17516 scope.cc:169] Create variable shape_0.tmp_0 +I0406 14:56:23.286784 17516 data_device_transform.cc:21] DeviceTransform in, src_place CPUPlace dst_place: CUDAPlace(0) +I0406 14:56:23.286867 17516 tensor_util.cu:129] TensorCopySync 4 from CPUPlace to CUDAPlace(0) +I0406 14:56:23.287099 17516 operator.cc:180] CUDAPlace(0) Op(slice), inputs:{EndsTensor[], EndsTensorList[], Input[shape_0.tmp_0:int[4]({})], StartsTensor[], StartsTensorList[]}, outputs:{Out[slice_0.tmp_0:int[4]({})]}. +I0406 14:56:23.287140 17516 eager_deletion_op_handle.cc:107] Erase variable shape_0.tmp_0 on CUDAPlace(0) +I0406 14:56:23.287220 17516 tensor_util.cu:129] TensorCopySync 4 from CUDAPlace(0) to CPUPlace +I0406 14:56:23.287473 17516 operator.cc:180] CUDAPlace(0) Op(crop_tensor), inputs:{Offsets[], OffsetsTensor[], Shape[slice_0.tmp_0:int[4]({})], ShapeTensor[], X[fill_constant_0.tmp_0:float[1, 3, 8, 8]({})]}, outputs:{Out[crop_tensor_0.tmp_0:float[1, 3, 5, 5]({})]}. +``` + +### 使用device_guard避免不必要的数据传输 + +在上面的例子中,`shape`输出的是一个1-D的Tensor,因此对于`slice`而言计算量很小。这种情况下如果将`slice`设置在CPU上运行,就可以避免2次数据传输。修改后的程序如下: + +```python +import paddle.fluid as fluid +import paddle.fluid.profiler as profiler + +data1 = fluid.layers.fill_constant(shape=[1, 3, 8, 8], value=0.5, dtype='float32') +data2 = fluid.layers.fill_constant(shape=[1, 3, 5, 5], value=0.5, dtype='float32') +shape = fluid.layers.shape(data2) +with fluid.device_guard("cpu"): + shape = fluid.layers.slice(shape, axes=[0], starts=[0], ends=[4]) +out = fluid.layers.crop_tensor(data1, shape=shape) + +place = fluid.CUDAPlace(0) +exe = fluid.Executor(place) +exe.run(fluid.default_startup_program()) +with profiler.profiler('All', 'total') as prof: + for i in range(10): + result = exe.run(fetch_list=[out]) +``` +再次观察profile report中`GpuMemCpy Summary`的内容,可以看到`GpuMemCpySync`已经被消除。在实际的模型中,若`GpuMemCpySync` 调用耗时占比较大,并且可以通过设置`device_guard`避免,那么就能够带来一定的性能提升。 + +```text +-------------------------> Profiling Report <------------------------- + +Note! This Report merge all thread info into one. +Place: All +Time unit: ms +Sorted by total time in descending order in the same thread + +Total time: 14.5345 + Computation time Total: 4.47587 Ratio: 30.7948% + Framework overhead Total: 10.0586 Ratio: 69.2052% + +------------------------- GpuMemCpy Summary ------------------------- + +GpuMemcpy Calls: 10 Total: 0.457033 Ratio: 3.14447% + GpuMemcpyAsync Calls: 10 Total: 0.457033 Ratio: 3.14447% + +------------------------- Event Summary ------------------------- + +Event Calls Total CPU Time (Ratio) GPU Time (Ratio) Min. Max. Ave. Ratio. +FastThreadedSSAGraphExecutorPrepare 10 7.70113 7.689066 (0.998433) 0.012064 (0.001567) 0.032657 7.39363 0.770113 0.529852 +fill_constant 20 2.62299 2.587022 (0.986287) 0.035968 (0.013713) 0.071097 0.342082 0.13115 0.180466 +shape 10 1.93504 1.935040 (1.000000) 0.000000 (0.000000) 0.026774 1.6016 0.193504 0.133134 +Fetch 10 0.880496 0.858512 (0.975032) 0.021984 (0.024968) 0.07392 0.140896 0.0880496 0.0605797 + GpuMemcpyAsync:GPU->CPU 10 0.457033 0.435049 (0.951898) 0.021984 (0.048102) 0.037836 0.071424 0.0457033 0.0314447 +crop_tensor 10 0.705426 0.671506 (0.951916) 0.033920 (0.048084) 0.05841 0.123901 0.0705426 0.0485346 +slice 10 0.324241 0.324241 (1.000000) 0.000000 (0.000000) 0.024299 0.07213 0.0324241 0.0223084 +eager_deletion 30 0.250524 0.250524 (1.000000) 0.000000 (0.000000) 0.004171 0.016235 0.0083508 0.0172365 +ScopeBufferedMonitor::post_local_exec_scopes_process 10 0.047794 0.047794 (1.000000) 0.000000 (0.000000) 0.003344 0.014131 0.0047794 0.00328831 +InitLocalVars 1 0.034629 0.034629 (1.000000) 0.000000 (0.000000) 0.034629 0.034629 0.034629 0.00238254 +ScopeBufferedMonitor::pre_local_exec_scopes_process 10 0.032231 0.032231 (1.000000) 0.000000 (0.000000) 0.002952 0.004076 0.0032231 0.00221755 +``` + +### 总结 + +- 使用profile工具对模型进行分析,看是否存在GpuMemcpySync的调用耗时。若存在,则进一步分析发生数据传输的原因。 +- 可以通过profile report找到发生GpuMemcpySync的OP。如果需要,可以通过打印log,找到GpuMemcpySync发生的具体位置。 +- 尝试使用`device_guard`设置部分OP的运行设备,来减少GpuMemcpySync的调用。 +- 最后可以通过比较修改前后模型的profile report,或者其他用来衡量性能的指标,确认修改后是否带来了性能提升。 diff --git a/doc/fluid/advanced_guide/performance_improving/index_cn.rst b/doc/fluid/advanced_guide/performance_improving/index_cn.rst index 9103496255d6637161065237ac53a856f033a835..a40594d13bf74398518f23ad923900ad1bed81d8 100644 --- a/doc/fluid/advanced_guide/performance_improving/index_cn.rst +++ b/doc/fluid/advanced_guide/performance_improving/index_cn.rst @@ -7,6 +7,7 @@ singlenode_training_improving/training_best_practice.rst singlenode_training_improving/memory_optimize.rst + device_switching/device_switching.md multinode_training_improving/cpu_train_best_practice.rst multinode_training_improving/dist_training_gpu.rst multinode_training_improving/gpu_training_with_recompute.rst