Created by: reyoung
concat_op invoke cudaMemcpyAsync many times. Even though cudaMemcpyAsync is async, the kernel launch time is huge.
concat_op
cudaMemcpyAsync