Created by: zhangting2020
问题来源:https://github.com/PaddlePaddle/benchmark/issues/205#issuecomment-542488429 关于assign的描述: GPU训练时,当输入在CPU上,会先将输入x通过data transform传输到GPU的tmp_x,然后通过一次GPU -> GPU的传输将输入数据tmp_x拷贝到输出out上。 GPU -> GPU的拷贝操作是多余的 问题分析:assign op将输入Tensor或numpy数组拷贝至输出。在op run过程中,会进入到PrepareData中,判断kernel_type_for_var和expected_kernel_key的place是否一致,如果不一致就需要transform。 https://github.com/PaddlePaddle/Paddle/blob/8da0cd537ae5f6fa60ac3ecd1d42aed7c730f423/paddle/fluid/framework/operator.cc#L1068-L1073 原始的assign op使用了默认的GetKernelTypeForVar函数,它返回的OpKernelType的place是输入Tensor的place。如果输入Tensor是在CPU上,而要执行的Kernel是Assign的CUDAKernel,则会发生CPU->GPU的数据拷贝。 https://github.com/PaddlePaddle/Paddle/blob/8da0cd537ae5f6fa60ac3ecd1d42aed7c730f423/paddle/fluid/framework/operator.cc#L1127-L1129 在assign op的kernel中,会再发生一次拷贝,即GPU->GPU的数据拷贝 https://github.com/PaddlePaddle/Paddle/blob/cfdd1fc2cd3673a6e0de53e257056ad21ba6c75a/paddle/fluid/operators/assign_op.h#L59-L65
优化方案:考虑通过覆盖GetKernelTypeForVar,返回和expected_kernel_key的place一致的OpKernel,避免在data transform过程中进行CPU->GPU的数据拷贝。assign op中通过TensorCopy将CPU上的源数据拷贝到GPU上的目的地址。
通过下面的代码对执行流程进行验证:
import paddle.fluid as fluid
data = fluid.layers.fill_constant(shape=[3, 2], value=2.5, dtype='float64', force_cpu=True)
result = fluid.layers.create_tensor(name='result', dtype='float64')
fluid.layers.assign(data, result)
exe = fluid.Executor(fluid.core.CUDAPlace(0))
exe.run(fluid.default_startup_program())
out = exe.run(fluid.default_main_program(), fetch_list=[result])
修改前:
I1114 04:08:43.568965 24523 operator.cc:172] CUDAPlace(0) Op(fill_constant), inputs:{ShapeTensor[], ShapeTensorList[]}, outputs:{Out[fill_constant_0.tmp_0:double[3, 2]({})]}.
I1114 04:08:43.569031 24523 operator.cc:982] expected_kernel_key:data_type[double]:data_layout[ANY_LAYOUT]:place[CUDAPlace(0)]:library_type[PLAIN]
I1114 04:08:43.569059 24523 operator.cc:1077] Transform Variable fill_constant_0.tmp_0 from data_type[double]:data_layout[NCHW]:place[CPUPlace]:library_type[PLAIN] to data_type[double]:data_layout[ANY_LAYOUT]:place[CUDAPlace(0)]:library_type[PLAIN]
I1114 04:08:43.569080 24523 scope.cc:164] Create variable fill_constant_0.tmp_0
I1114 04:08:43.569105 24523 data_device_transform.cc:21] DeviceTransform in, src_place CPUPlace dst_place: CUDAPlace(0)
I1114 04:08:43.569167 24523 tensor_util.cu:122] TensorCopySync 3, 2 from CPUPlace to CUDAPlace(0)
I1114 04:08:43.571846 24523 tensor_util.cu:28] TensorCopy 3, 2 from CUDAPlace(0) to CUDAPlace(0)
I1114 04:08:43.572003 24523 operator.cc:172] CUDAPlace(0) Op(assign), inputs:{X[fill_constant_0.tmp_0:double[3, 2]({})]}, outputs:{Out[result:double[3, 2]({})]}.
修改后
I1114 03:59:26.250339 10889 operator.cc:172] CUDAPlace(0) Op(fill_constant), inputs:{ShapeTensor[], ShapeTensorList[]}, outputs:{Out[fill_constant_0.tmp_0:double[3, 2]({})]}.
I1114 03:59:26.250404 10889 operator.cc:982] expected_kernel_key:data_type[double]:data_layout[ANY_LAYOUT]:place[CUDAPlace(0)]:library_type[PLAIN]
I1114 03:59:26.250453 10889 tensor_util.cu:28] TensorCopy 3, 2 from CPUPlace to CUDAPlace(0)
I1114 03:59:26.253456 10889 operator.cc:172] CUDAPlace(0) Op(assign), inputs:{X[fill_constant_0.tmp_0:double[3, 2]({})]}, outputs:{Out[result:double[3, 2]({})]}.