Design Doc: Execute the Program with Multi CPU

Abstract

This Design Doc propose an approach to make the user-defined Op graph running with multi-CPU, we will use an auto transpiler to convert the user-defined Op graph to a multi-CPU Op graph, and run ParallelDo Op to run the graph.

Transpiler

After converted:

Implement

  • Multi-CPU Transpiler will convert the graph to a multi-CPU graph which would be executed with multi-threads.

  • BlockingCounter will Init/Decrement an atomic counter, and Blocking Wait for the atomic counter become 0:

    BlockingCounter bc(thread_count);
    for (int i = 0; i < thread_count; ++i) {
      thread_pool->Start([&bc] {bc.DecrementCount(); })
    }
    bc.Wait();
    
  • ParallelDo Operator

    • Initialize a thread pool which is a Singleton.
    • Use a block id as the input, and create run the specify Block on independent scope with multi-threads.
    • Initialize a BlockingCounter instance and wait until all threads are done.
  • Split Operator will split the Input Tensor into a TensorArray.

  • Merge merge all the gradients which calculated in different threads with mean/sum/max/min... method, and then run the Optimizer Op to optimize W.

TODO

  • Improve the optimizer stage with multi-threads, since we could assign the parameters to the different threads and execute optimizer with multi-threads.