Created by: panyx0718
Currently, I find ParallelExecutor could hang on for some models. When use a threadpool for each device, and set num_threads=1, problem is solved. (However, we should still fix it)