Fork自 PaddlePaddle / Paddle
* fix fleet for multi-stream * fix memcpy for ncclid * use sync to solve move operation
* add tensor_indices in AssignGroupBySize * add rebuild group in reducer