PaddlePaddle / Paddle
大约 1 年前同步成功

代码
- 文件
- 提交
- 分支
- Tags
- 贡献者
- 分支图
- Diff
Issue 1423
- 列表
- 看板
- 标记
- 里程碑
合并请求 543
Wiki 0
- Wiki
分析
- 仓库
- DevOps
项目成员
Pages

Benchmark on convnets with multiple graphics card

Created by: tonyyang-svail

Background

Machine: TitanX x 8
benchmark scripts
command python benchmark.py --use_data_parallel --batch_size=1024 --label_size=100 --iterations=5

Model overview:

Input: 1024 x 1 x 100 x 100: bs x ch x h x w
5 layers of conv2d(stride=2, num_filters=50, filter_size=5)
2 layers of fc(size=1000, act="softmax")

Result in miliseconds

numbers of gpu	CPU to GPU	split in input and copy weights	forward	backward	merge gradient	apply gradient	total
1	34	/	46	1209	/	5	1350
2	35	65	21	593	1	5	750
4	32	35	10	231	3	2	330
8	32	32	7	160	5	2	250

Takeaway

Two bottlenecks:
- MemcpyH2D (CPU to GPU can be hidden through data feeder)
- backward (mainly elementwise_add_grad as described at #7902 (closed) and #7862 (closed))
If the model is computation intensive, such as lots of convolutions and large batch size, parallel_do doing everything sequentially looks fine.