Fork自 PaddlePaddle / Paddle
* remove cudaDeviceContext * remove more template * fix rocm compile
* add fused_seqpool_cvm op;test=develop