[WIP] C++ implementation of parallel executor (!9035) · 合并请求 · PaddlePaddle / Paddle

[WIP] C++ implementation of parallel executor !9035

Created by: tonyyang-svail

DO NOT MERGE THIS PR!

This PR will serve a baseline to https://github.com/PaddlePaddle/Paddle/pull/9080. The main difference is that:

In this PR, one thread is bound to one GPU. Each thread launches all Ops sequentially in the computation stream and launches AllReduce in the io stream. CudaEvent is used for coordination between streams.

In https://github.com/PaddlePaddle/Paddle/pull/9080, a dependency parsing is used for scheduling the ready Ops to a thread pool.

machine: 250 test_script: test_parallel_executor.py in this PR test_command:

CUDA_VISIBLE_DEVICES=3 python -m unittest test_parallel_executor.TestResnet
CUDA_VISIBLE_DEVICES=3,4,5,6 python -m unittest test_parallel_executor.TestResnet

model: SE_ResNeXt152 batch_size: 16 per GPU model size: 1382651904 peak memory: 7351879168

20.7775 Instance per second 60.8804 Instance per second

PaddlePaddle / Paddle 大约 2 年 前同步成功

[WIP] C++ implementation of parallel executor !9035

PaddlePaddle / Paddle
大约 2 年前同步成功