Created by: tonyyang-svail
DO NOT MERGE THIS PR!
This PR will serve a baseline to https://github.com/PaddlePaddle/Paddle/pull/9080. The main difference is that:
In this PR, one thread is bound to one GPU. Each thread launches all Ops sequentially in the computation stream and launches AllReduce in the io stream. CudaEvent is used for coordination between streams.
In https://github.com/PaddlePaddle/Paddle/pull/9080, a dependency parsing is used for scheduling the ready Ops to a thread pool.
machine: 250 test_script: test_parallel_executor.py in this PR test_command:
CUDA_VISIBLE_DEVICES=3 python -m unittest test_parallel_executor.TestResnet
CUDA_VISIBLE_DEVICES=3,4,5,6 python -m unittest test_parallel_executor.TestResnet
model: SE_ResNeXt152 batch_size: 16 per GPU model size: 1382651904 peak memory: 7351879168
20.7775 Instance per second 60.8804 Instance per second