The word *graph* is exchangable with *block* in this document. A graph represent computation steps and local variables as a C++/Java program block, or a pair of { and }.
## Compilation and Execution
1. Run an applicaton Python program to describe the graph. In particular,
1. create VarDesc to represent local/intermediate variables,
1. create operators and set attributes,
1. validate attribute values,
1. inference the type and the shape of variables,
1. plan for memory-reuse for variables,
1. generate backward and optimization part of the Graph.
1. possiblly split the graph for distributed training.
1. The invocation of `train` or `infer` in the application Python program:
1. create a new Scope instance in the [scope hierarchy]( for each run of a block,
1. realize local variables defined in the BlockDesc message in the new scope,
1. a scope is similar to the stack frame in programming languages,
1. create an instance of class `Block`, in which,
1. realize operators in the BlockDesc message,
1. run the Block by calling
1.`Block::Eval(vector<Variable>* targets)` for forward and backward computations, or
1.`Block::Eval(vector<Operator>* targets)` for optimization.
## Intermediate Representation (IR)
Compile Time -> IR -> Runtime
### Benefit
- Optimization
Compile Time -> IR -> Optimized IR -> Runtime
- Send automatically partitioned IR to different nodes.
*`OpWithKernel::Run` get device's kernel, and invoke `OpKernel::Compute`.
*`OpKernelKey` is the map key. Only device place now, but may be data type later.
# Why separate Kernel and Operator
* Separate GPU and CPU code.
* Make Paddle can run without GPU.
* Make one operator (which is user interface) can contain many implementations.
* Same mul op, different FP16, FP32 Kernel. different MKL, eigen kernel.
# Libraries for Kernel development
*`Eigen::Tensor` contains basic math and element-wise functions.
* Note that `Eigen::Tensor` has broadcast implementation.
* Limit number of `tensor.device(dev) = ` in your code.
*`thrust::tranform` and `std::transform`.
*`thrust` has the same API as C++ standard library. Using `transform` can quickly implement a customized elementwise kernel.
*`thrust` has more complex API, like `scan`, `reduce`, `reduce_by_key`.
* Hand-writing `GPUKernel` and `CPU` code
* Do not write `.h`. CPU Kernel should be in `.cc`. CPU kernel should be in `.cu`. (`GCC` cannot compile GPU code.)
# Operator Register
## Why register is necessary?
We need a method to build mappings between Op type names and Op classes.
## How to do the register?
Maintain a map, whose key is the type name and value is corresponding Op constructor.
# The Registry Map
### `OpInfoMap`
`op_type(string)` -> `OpInfo`
-**`creator`**: The Op constructor.
-**`grad_op_type`**: The type of the gradient Op.
-**`proto`**: The Op's Protobuf, including inputs, outputs and required attributes.
-**`checker`**: Used to check attributes.
# Related Concepts
### Op_Maker
It's constructor takes `proto` and `checker`. They are compeleted during Op_Maker's construction.