ParallelDo performance on VGG (#8719) · Issue · PaddlePaddle / Paddle

ParallelDo performance on VGG

Created by: tonyyang-svail

Major take aways:

Parameter copy is still a big bottleneck (large net large VGG16, Memcpy takes up to 80%)
We do need multiple streams (AllReduce Kernel takes up 70% of the total kernel time)
NCCLInit should not be called at every iteration. It takes about 70ms for one GPU and 90ms for four GPUs.

Background

Net: VGG16 Model Size: 409M (The original definition of VGG16 net is incorrect https://github.com/PaddlePaddle/Paddle/issues/8718) Batchsize: 16 for each GPU BatchNorm: OFF. Since parallel_do doesn't support this.

Inputs are randomly generated on each GPU. So no overhead on copying training data to different devices

Result

Time unit: milliseconds.

# GPUs	copy weights (ms)	forward and backward (ms)	merge gradient (ms)	apply gradient (ms)
1	N/A	130	N/A	5
1 NCCL in backward	N/A	220	N/A	5
4	350	130	350	5
4 NCCL in backward	350	650(AllReduce takes about 70%)	N/A	5

PaddlePaddle / Paddle 大约 1 年 前同步成功

ParallelDo performance on VGG

Background

Result

PaddlePaddle / Paddle
大约 1 年前同步成功