GPU under-utilized when waiting for CPU to launch kernel
Created by: panyx0718
Currently we run Op one-by-one synchronously. For Ops that can be quickly finished by GPU, the CPU is too slow to launch GPU kernels. Hence, in many cases, the GPU is under-utilized.
To mitigate the situation, we need to schedule Ops in parallel (based on dependency information). So, we can better utilize both cpus and gpus.