multi_cpu.md 1.6 KB
Newer Older
1 2 3 4 5 6 7 8 9 10
# Design Doc: Execute the Program with Multi CPU

## Abstract

This Design Doc propose an approach to make the user-defined Op graph
running with multi-CPU, we will use an auto transpiler to convert the user-defined
Op graph to a multi-CPU Op graph, and run `ParallelDo` Op to run the graph.

## Transpiler

_青葱's avatar
_青葱 已提交
11
<img src="https://github.com/PaddlePaddle/Paddle/tree/develop/doc/fluid/images/single-thread@3x.png" width="300">
12 13 14

After converted:

_青葱's avatar
_青葱 已提交
15
<img src="https://github.com/PaddlePaddle/Paddle/tree/develop/doc/fluid/images/multi-threads@3x.png" width="1000">
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43

## Implement

- `Multi-CPU Transpiler` will convert the graph to a multi-CPU graph
  which would be executed with multi-threads.
- `BlockingCounter` will `Init/Decrement` an atomic counter, and Blocking `Wait`
  for the atomic counter become `0`:
  ```cpp
  BlockingCounter bc(thread_count);
  for (int i = 0; i < thread_count; ++i) {
    thread_pool->Start([&bc] {bc.DecrementCount(); })
  }
  bc.Wait();
  ```
- `ParallelDo` Operator
  - Initialize a thread pool which is a Singleton.
  - Use a block id as the input, and create run the specify Block on independent scope
    with multi-threads.
  - Initialize a `BlockingCounter` instance and wait until all threads are done.
- `Split` Operator will split the Input Tensor into a TensorArray.
- `Merge` merge all the gradients which calculated in different threads
  with `mean/sum/max/min...` method, and then run the Optimizer Op to optimize `W`.

## TODO

- Improve the optimizer stage with multi-threads, since we could
  assign the parameters to the different threads and execute
  optimizer with multi-threads.