[performance]Schedule send/recv op from gpu0 to all devices to overlap memcpy (#11143) · Issue · PaddlePaddle / Paddle

[performance]Schedule send/recv op from gpu0 to all devices to overlap memcpy

Created by: Yancey1989

For current distributed training implement, all send/recv opswas scheduled on GPU0, I have done some experiments about schedule the send/recv ops on different devices to overlap memcpy.

~~This feature would improve about 9% performance.~~

~~2* trainers + 4*pservers, 8 GPUs per trainer, parallelexecutor(num_threads=1)~~

~~1. develop branch~~

Pass = 0, Elapsed = 160, Training performance = 38.328424 imgs/s, Train accuracy = 0.011654, Test accuracy = 0.011765
Pass = 1, Elapsed = 156, Training performance = 39.400293 imgs/s, Train accuracy = 0.009553, Test accuracy = 0.009804

~~2. overlap memcpy branch~~

Pass = 0, Elapsed = 146, Training performance = 41.994200 imgs/s, Train accuracy = 0.010613, Test accuracy = 0.009649
Pass = 1, Elapsed = 144, Training performance = 42.671882 imgs/s, Train accuracy = 0.008629, Test accuracy = 0.010768

Please see the latest experiment details in the PR comment #11221

PaddlePaddle / Paddle 大约 1 年 前同步成功

[performance]Schedule send/recv op from gpu0 to all devices to overlap memcpy

PaddlePaddle / Paddle
大约 1 年前同步成功