Fork自 PaddlePaddle / Paddle
* add cuda implement of cast kernel * remove bfloat16 when defined paddle_with_hip