optimize assign op to avoid copy data from GPU to GPU (!21181) · 合并请求 · PaddlePaddle / Paddle

optimize assign op to avoid copy data from GPU to GPU !21181

Created by: zhangting2020

问题来源：https://github.com/PaddlePaddle/benchmark/issues/205#issuecomment-542488429 关于assign的描述： GPU训练时，当输入在CPU上，会先将输入x通过data transform传输到GPU的tmp_x，然后通过一次GPU -> GPU的传输将输入数据tmp_x拷贝到输出out上。 GPU -> GPU的拷贝操作是多余的 问题分析：assign op将输入Tensor或numpy数组拷贝至输出。在op run过程中，会进入到PrepareData中，判断kernel_type_for_var和expected_kernel_key的place是否一致，如果不一致就需要transform。 https://github.com/PaddlePaddle/Paddle/blob/8da0cd537ae5f6fa60ac3ecd1d42aed7c730f423/paddle/fluid/framework/operator.cc#L1068-L1073 原始的assign op使用了默认的GetKernelTypeForVar函数，它返回的OpKernelType的place是输入Tensor的place。如果输入Tensor是在CPU上，而要执行的Kernel是Assign的CUDAKernel，则会发生CPU->GPU的数据拷贝。 https://github.com/PaddlePaddle/Paddle/blob/8da0cd537ae5f6fa60ac3ecd1d42aed7c730f423/paddle/fluid/framework/operator.cc#L1127-L1129 在assign op的kernel中，会再发生一次拷贝，即GPU->GPU的数据拷贝 https://github.com/PaddlePaddle/Paddle/blob/cfdd1fc2cd3673a6e0de53e257056ad21ba6c75a/paddle/fluid/operators/assign_op.h#L59-L65

优化方案：考虑通过覆盖GetKernelTypeForVar，返回和expected_kernel_key的place一致的OpKernel，避免在data transform过程中进行CPU->GPU的数据拷贝。assign op中通过TensorCopy将CPU上的源数据拷贝到GPU上的目的地址。

通过下面的代码对执行流程进行验证：

import paddle.fluid as fluid

data = fluid.layers.fill_constant(shape=[3, 2], value=2.5, dtype='float64', force_cpu=True) 
result = fluid.layers.create_tensor(name='result', dtype='float64')
fluid.layers.assign(data, result) 

exe = fluid.Executor(fluid.core.CUDAPlace(0))
exe.run(fluid.default_startup_program())
out = exe.run(fluid.default_main_program(), fetch_list=[result])

修改前：

I1114 04:08:43.568965 24523 operator.cc:172] CUDAPlace(0) Op(fill_constant), inputs:{ShapeTensor[], ShapeTensorList[]}, outputs:{Out[fill_constant_0.tmp_0:double[3, 2]({})]}.
I1114 04:08:43.569031 24523 operator.cc:982] expected_kernel_key:data_type[double]:data_layout[ANY_LAYOUT]:place[CUDAPlace(0)]:library_type[PLAIN]
I1114 04:08:43.569059 24523 operator.cc:1077] Transform Variable fill_constant_0.tmp_0 from data_type[double]:data_layout[NCHW]:place[CPUPlace]:library_type[PLAIN] to data_type[double]:data_layout[ANY_LAYOUT]:place[CUDAPlace(0)]:library_type[PLAIN]
I1114 04:08:43.569080 24523 scope.cc:164] Create variable fill_constant_0.tmp_0
I1114 04:08:43.569105 24523 data_device_transform.cc:21] DeviceTransform in, src_place CPUPlace dst_place: CUDAPlace(0)
I1114 04:08:43.569167 24523 tensor_util.cu:122] TensorCopySync 3, 2 from CPUPlace to CUDAPlace(0)
I1114 04:08:43.571846 24523 tensor_util.cu:28] TensorCopy 3, 2 from CUDAPlace(0) to CUDAPlace(0)
I1114 04:08:43.572003 24523 operator.cc:172] CUDAPlace(0) Op(assign), inputs:{X[fill_constant_0.tmp_0:double[3, 2]({})]}, outputs:{Out[result:double[3, 2]({})]}.

修改后

I1114 03:59:26.250339 10889 operator.cc:172] CUDAPlace(0) Op(fill_constant), inputs:{ShapeTensor[], ShapeTensorList[]}, outputs:{Out[fill_constant_0.tmp_0:double[3, 2]({})]}.
I1114 03:59:26.250404 10889 operator.cc:982] expected_kernel_key:data_type[double]:data_layout[ANY_LAYOUT]:place[CUDAPlace(0)]:library_type[PLAIN]
I1114 03:59:26.250453 10889 tensor_util.cu:28] TensorCopy 3, 2 from CPUPlace to CUDAPlace(0)
I1114 03:59:26.253456 10889 operator.cc:172] CUDAPlace(0) Op(assign), inputs:{X[fill_constant_0.tmp_0:double[3, 2]({})]}, outputs:{Out[result:double[3, 2]({})]}.

PaddlePaddle / Paddle 大约 2 年 前同步成功

optimize assign op to avoid copy data from GPU to GPU !21181

PaddlePaddle / Paddle
大约 2 年前同步成功