Created by: zhangting2020
Performance optimization PR types
Others PR changes
Update this library to a recent version 1.9.8 Describe
更新cub库到近期的版本:
- 如果更新到1.9.10,cuda9环境下会有以下编译错误。但是cuda10无问题
util_allocator.cuh(409): error: argument of type "__nv_bool" is incompatible with parameter of type "cudaError_t"
- 因此本PR,更新cub到1.9.8
- 在v100,cuda10环境下,测试调用cub的OP升级前后的性能:重复10000次,取compute的平均时间,mean op的时间不太稳定,设置10000、100000次,都不稳定。
OP | 输入shape | 升级前 | 升级后 | 性能变化 |
---|---|---|---|---|
mean | [1, 3, 128, 128] | 0.0472014 ms | 0.0483602 ms | -2.5% |
mean_grad | [1, 3, 128, 128] | 0.0171568 ms | 0.0158518 ms | +7.6% |
argsort | [1700971, 1] | 24.590 ms | 23.4357 ms | +4.7% |
topk | [16, 1000] | 0.172232 ms | 0.170051 ms | +1.3% |
topk | [16, 3] | 0.164381 ms | 0.163783 ms | +0.4% |
instance_norm | [16, 64, 128, 128] | 0.619216 ms | 0.616683 | +0.4% |
instance_norm_grad | [16, 64, 128, 128] | 0.651986 ms | 0.651012 ms | +0.1% |
instance_norm | [16, 128, 64, 64] | 0.177824 ms | 0.179421 ms | -0.9% |
instance_norm_grad | [16, 128, 64, 64] | 0.312913 ms | 0.312026 ms | +0.3% |
instance_norm | [16, 256, 32, 32] | 0.149429 ms | 0.150154 ms | -0.5% |
instance_norm_grad | [16, 256, 32, 32] | 0.359864 ms | 0.362011 ms | -0.6% |
batch_norm | [16, 2048] | 0.0803907 ms | 0.0805798 ms | 0% |
batch_norm_grad | [16, 2048] | 0.0500745 ms | 0.0503249 ms | 0% |
batch_norm | [16, 1536, 33, 33] | 0.560359 ms | 0.559413 ms | +0.2% |
batch_norm_grad | [16, 1536, 33, 33] | 0.755874 | 0.759712 ms | -0.5% |
batch_norm | [16, 32, 256, 256] | 0.769818 ms | 0.767086 ms | +0.4% |
batch_norm_grad | [16, 32, 256, 256] | 1.58815 ms | 1.58004 | +0.5% |
layer_norm | [16, 16, 1024] | 0.0289321 ms | 0.0282497 ms | +2.4% |
layer_norm_grad | [16, 16, 1024] | 0.140934 ms | 0.139902 | +0.7% |
softmax_with_cross_entropy | [128, 100] | 0.0458516 ms | 0.0469107 ms | -2.3% |
softmax_with_cross_entropy_grad | [128, 100] | 0.0502409 ms | 0.0507099 | 0% |
sigmoid_cross_entropy_with_logits | [16, 3862] | 0.0215579 ms | 0.0206079 ms | +2.8% |
sigmoid_cross_entropy_with_logits_grad | [16, 3862] | 0.019415 ms | 0.0187059 | +3.7% |
sigmoid_cross_entropy_with_logits | [16, 63504] | 0.0428302 ms | 0.0426433 ms | +0.4% |
sigmoid_cross_entropy_with_logits_grad | [16, 63504] | 0.0439403 ms | 0.0440053 | -0.1% |