ParallelDo performance on VGG
Created by: tonyyang-svail
Major take aways:
- Parameter copy is still a big bottleneck (large net large VGG16, Memcpy takes up to 80%)
- We do need multiple streams (AllReduce Kernel takes up 70% of the total kernel time)
- NCCLInit should not be called at every iteration. It takes about 70ms for one GPU and 90ms for four GPUs.
Background
Net: VGG16 Model Size: 409M (The original definition of VGG16 net is incorrect https://github.com/PaddlePaddle/Paddle/issues/8718) Batchsize: 16 for each GPU BatchNorm: OFF. Since parallel_do doesn't support this.
Inputs are randomly generated on each GPU. So no overhead on copying training data to different devices
Result
Time unit: milliseconds.
# GPUs | copy weights (ms) | forward and backward (ms) | merge gradient (ms) | apply gradient (ms) | total |
---|---|---|---|---|---|
1 | N/A | 130 | N/A | 5 | |
1 NCCL in backward | N/A | 220 | N/A | 5 | |
4 | 350 | 130 | 350 | 5 | |
4 NCCL in backward | 350 | 650(AllReduce takes about 70%) | N/A | 5 |