Created by: NHZlX
PR types
Performance optimization
PR changes
OPs
Describe
The current arg max and arg min impl base on eigen, in addition, many templates are used at the same time. This leads to the increase of the size of the inference lib(60M for each). So, we use the cub to impl this.
Here are some data to show the current performance results and lib size compared with eigen.
eigen | Cuda cub | |
---|---|---|
(1000, 10) axis = -1 | 0.068563ms | 0.044079ms |
(1000, 100) axis = -1 | 0.101589ms | 0.042587ms |
(1000, 1000) axis = -1 | 0.947479ms | 0.170187ms |
(1000, 10000) axis = -1 | 9.41424ms | 0.336944ms |
eigen | Cuda cub | |
---|---|---|
(1000, 10, 10) axis = 1 | 0.12406ms | 0.13495ms |
(1000, 100, 10) axis = 1 | 0.121825ms | 0.185797ms |
(1000, 1000, 10) axis = 1 | 0.406775ms | 1.35506ms |
(1000, 10000, 10) axis = 1 | 3.57772ms | 3.65801ms |
Eigen | Cuda cub | |
---|---|---|
ArgMin | 60M | 1.3M |
ArgMax | 60M | 1.3M |