Created by: qili93
PR types
Performance optimization
PR changes
OPs
Describe
Address review comments in RP https://github.com/PaddlePaddle/Paddle/pull/24562
paddle/fluid/operators/histogram_op.cu 1.1 变量名修改,例如bVal => b_val 1.2 kernel使用shared memory优化CudaAtomicAdd
python/paddle/tensor/linalg.py 修改: 2.1 code-block下面要有一个空行 2.2 确认英文文档可以正常预览
python/paddle/fluid/tests/unittests/test_histogram_op.py 3.1 添加报错相关的单测 3.2 单测中,输入数据类型要求是fp64 3.3 增加浮点数的例子,说明浮点数计算规则
其中cuda代码修改之后,cuda kernel的运行时间比较如下
原有的非shared memory方式,input date shape = shape x shape shape = 512 time = 0.23 shape = 1024 time = 0.86 shape = 4096 time = 3.37 shape = 8192 time = 13.43
当前shared memory方式,input date shape = shape x shape shape = 512 time = 0.03 shape = 1024 time = 0.16 shape = 4096 time = 0.58 shape = 8192 time = 4.27