Created by: Xreki
stack_op's function is the same as concat, except the output's dims. We can call concat functor to optimize it. The current code is implemented by thrust which may generate the calling of cudaMalloc
, cudaFree
and Wait
. See the timelien of ernie (https://github.com/PaddlePaddle/benchmark/issues/165#issuecomment-521134541):