[performance]Schedule send/recv op from gpu0 to all devices to overlap memcpy
Created by: Yancey1989
For current distributed training implement, all send/recv opswas scheduled on GPU0, I have done some experiments about schedule the send/recv ops on different devices to overlap memcpy.
This feature would improve about 9% performance.
2* trainers + 4*pservers, 8 GPUs per trainer, parallelexecutor(num_threads=1)
1. develop branch
Pass = 0, Elapsed = 160, Training performance = 38.328424 imgs/s, Train accuracy = 0.011654, Test accuracy = 0.011765
Pass = 1, Elapsed = 156, Training performance = 39.400293 imgs/s, Train accuracy = 0.009553, Test accuracy = 0.009804
2. overlap memcpy branch
Pass = 0, Elapsed = 146, Training performance = 41.994200 imgs/s, Train accuracy = 0.010613, Test accuracy = 0.009649
Pass = 1, Elapsed = 144, Training performance = 42.671882 imgs/s, Train accuracy = 0.008629, Test accuracy = 0.010768
Please see the latest experiment details in the PR comment #11221