1. Make it easy for external contributors to write new elementory computaiton operations.
1. Make the codebase clean and readable.
1. Introduce a new design of computation representation -- a computation graph of operators and variables.
1. The graph representation helps implementing auto-scalable and auto fault recoverable distributed computing.
## Computation Graphs
1. PaddlePaddle represent the computation, training and inference of DL models, by [computation graphs](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/design/graph.md).
1. Users write Python programs to describe the graphs and run it (locally or remotely).
1. A graph is composed of *variabels* and *operators*.
1. The description of graphs must be able to be serialized/deserialized, so it
1. could to be sent to the cloud for distributed execution, and
1. be sent to clients for mobile or enterprise deployment.
1. The Python program do
1.*compilation*: runs a Python program to generate a protobuf message representation of the graph and send it to
1. the C++ library `libpaddle.so` for local execution,
1. the master process of a distributed training job for training, or
1. the server process of a Kubernetes serving job for distributed serving.
1.*execution*: according to the protobuf message, constructs instances of class `Variable` and `OperatorBase`, and run them.
## Description and Realization
At compile time, the Python program generates protobuf message representation of the graph, or the description of the graph.
At runtime, the C++ program realizes the graph and run it.
*`OpWithKernel::Run` get device's kernel, and invoke `OpKernel::Compute`.
*`OpKernelKey` is the map key. Only device place now, but may be data type later.
---
# Why separate Kernel and Operator
* Separate GPU and CPU code.
* Make Paddle can run without GPU.
* Make one operator (which is user interface) can contain many implementations.
* Same mul op, different FP16, FP32 Kernel. different MKL, eigen kernel.
---
# Libraries for Kernel development
*`Eigen::Tensor` contains basic math and element-wise functions.
* Note that `Eigen::Tensor` has broadcast implementation.
* Limit number of `tensor.device(dev) = ` in your code.
*`thrust::tranform` and `std::transform`.
*`thrust` has the same API as C++ standard library. Using `transform` can quickly implement a customized elementwise kernel.
*`thrust` has more complex API, like `scan`, `reduce`, `reduce_by_key`.
* Hand-writing `GPUKernel` and `CPU` code
* Do not write `.h`. CPU Kernel should be in `.cc`. CPU kernel should be in `.cu`. (`GCC` cannot compile GPU code.)
---
# Operator Register
## Why register is necessary?
We need a method to build mappings between Op type names and Op classes.
## How to do the register?
Maintain a map, whose key is the type name and value is corresponding Op constructor.
---
# The Registry Map
### `OpInfoMap`
`op_type(string)` -> `OpInfo`
`OpInfo`:
-**`creator`**: The Op constructor.
-**`grad_op_type`**: The type of the gradient Op.
-**`proto`**: The Op's Protobuf, including inputs, outputs and required attributes.
-**`checker`**: Used to check attributes.
---
# Related Concepts
### Op_Maker
It's constructor takes `proto` and `checker`. They are compeleted during Op_Maker's construction. ([ScaleOpMaker](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/operators/scale_op.cc#L37))