Benchmark on convnets with multiple graphics card
Created by: tonyyang-svail
Background
- Machine: TitanX x 8
- benchmark scripts
- command
python benchmark.py --use_data_parallel --batch_size=1024 --label_size=100 --iterations=5
Model overview:
- Input: 1024 x 1 x 100 x 100: bs x ch x h x w
- 5 layers of
conv2d(stride=2, num_filters=50, filter_size=5)
- 2 layers of
fc(size=1000, act="softmax")
Result in miliseconds
numbers of gpu | CPU to GPU | split in input and copy weights | forward | backward | merge gradient | apply gradient | total |
---|---|---|---|---|---|---|---|
1 | 34 | / | 46 | 1209 | / | 5 | 1350 |
2 | 35 | 65 | 21 | 593 | 1 | 5 | 750 |
4 | 32 | 35 | 10 | 231 | 3 | 2 | 330 |
8 | 32 | 32 | 7 | 160 | 5 | 2 | 250 |
Takeaway
- Two bottlenecks:
- MemcpyH2D (CPU to GPU can be hidden through data feeder)
- backward (mainly elementwise_add_grad as described at #7902 (closed) and #7862 (closed))
- If the model is computation intensive, such as lots of convolutions and large batch size, parallel_do doing everything sequentially looks fine.