Design Doc: Refactorization Overview¶
The goals of refactoring include:
- Making it easy for external contributors to write new elementary computation operations.
 - Making the codebase clean and readable.
 - Designing a new computation representation – a computation graph of operators and variables.
 - Implementing auto-scalability and auto fault recoverable distributed computing with the help of computation graphs.
 
Computation Graphs¶
- PaddlePaddle represents the computation, training and inference of Deep Learning models, by computation graphs.
 - Please refer to computation graphs for a concrete example.
 - Users write Python programs to describe the graphs and run them (locally or remotely).
 - A graph is composed of variables and operators.
 - The description of graphs must be capable of being serialized/deserialized, so that
- It can to be sent to the cloud for distributed execution, and
 - It can be sent to clients for mobile or enterprise deployment.
 
 - The Python program does the following steps
- compilation: run a Python program to generate a protobuf message representation of the graph and send it to
- the C++ library 
libpaddle.sofor local execution, - the master process of a distributed training job for training, or
 - the server process of a Kubernetes serving job for distributed serving.
 
 - the C++ library 
 - execution: execute the graph by constructing instances of class 
VariableandOperatorBase, according to the protobuf message. 
 - compilation: run a Python program to generate a protobuf message representation of the graph and send it to
 
Description and Realization of Computation Graph¶
At compile time, the Python program generates a protobuf message representation of the graph, or the description of the graph.
At runtime, the C++ program realizes the graph and runs it.
| | Representation (protobuf messages) | Realization (C++ class objects) | |—|—|—| |Data|VarDesc|Variable| |Operation|OpDesc|Operator| |Block|BlockDesc|Block|
The word graph is interchangeable with block in this document.  A graph represents computation steps and local variables similar to a C++/Java program block, or a pair of parentheses({ and }).
Compilation and Execution¶
- Run an application Python program to describe the graph.  In particular, the Python application program does the following:
- Create 
VarDescto represent local/intermediate variables, - Create operators and set attributes,
 - Validate attribute values,
 - Infer the type and the shape of variables,
 - Plan memory-reuse for variables,
 - Generate the backward graph
 - Optimize the computation graph.
 - Potentially, split the graph for distributed training.
 
 - Create 
 - The invocation of 
trainorinfermethods in the application Python program does the following:- Create a new Scope instance in the scope hierarchy for each run of a block,
- realize local variables defined in the BlockDesc message in the new scope,
 - a scope is similar to the stack frame in programming languages,
 
 - Create an instance of class 
Block, in which,- realize operators in the BlockDesc message,
 
 - Run the Block by calling
Block::Eval(vector<Variable>* targets)for forward and backward computations, orBlock::Eval(vector<Operator>* targets)for optimization.
 
 - Create a new Scope instance in the scope hierarchy for each run of a block,
 
Intermediate Representation (IR)¶
Compile Time -> IR -> Runtime
Benefits of IR¶
Optimization
Compile Time -> IR -> Optimized IR -> Runtime
Automatically send partitioned IR to different nodes.
Automatic Data Parallelism
Compile Time |-> Single GPU IR |-> [trainer-IR-0, trainer-IR-1, pserver-IR] |-> Node-0 (runs trainer-IR-0) |-> Node-1 (runs trainer-IR-1) |-> Node-2 (runs pserver-IR)Automatic Model Parallelism (planned for future)
Operator/OpWithKernel/OpKernel¶
Operator¶
Operatoris the fundamental building block of the user interface.- Operator stores input/output variable names, and attributes.
 - The 
InferShapeinterface is used to infer the shape of the output variable shapes based on the shapes of the input variables. - Use 
Runto compute theoutputvariables from theinputvariables. 
OpWithKernel/Kernel¶
OpWithKernelinheritsOperator.OpWithKernelcontains a Kernel map.OpWithKernel::Runget device’s kernel, and invokeOpKernel::Compute.OpKernelKeyis the map key. Only device place now, but may be data type later.
Why separate Kernel and Operator¶
- Separate GPU and CPU code.
- Make Paddle capable of running without GPU.
 
 - Make one operator (which is a user interface) and create many implementations.
- For example, same multiplication op can have different implementations kernels such as FP16 kernel, FP32 kernel, MKL, eigen kernel.
 
 
Libraries for Kernel development¶
Eigen::Tensorcontains basic math and element-wise functions.- Note that 
Eigen::Tensorhas broadcast implementation. - Limit the number of 
tensor.device(dev) =in your code. 
- Note that 
 thrust::tranformandstd::transform.thrusthas the same API as C++ standard library. Usingtransform, one can quickly implement customized elementwise kernels.thrustalso has more complex APIs, likescan,reduce,reduce_by_key.
- Hand-writing 
GPUKernelandCPUcode- Do not write in header (
.h) files. CPU Kernel should be in cpp source (.cc) and GPU kernels should be in cuda (.cu) files. (GCC cannot compile GPU code.) 
 - Do not write in header (
 
Operator Registration¶
Why registration is necessary?¶
We need a method to build mappings between Op type names and Op classes.
How is registration implemented?¶
Maintaining a map, whose key is the type name and the value is the corresponding Op constructor.
The Registry Map¶
OpInfoMap¶
op_type(string) -> OpInfo
OpInfo:
creator: The Op constructor.grad_op_type: The type of the gradient Op.proto: The Op’s Protobuf, including inputs, outputs and required attributes.checker: Used to check attributes.
Registration Process¶
- Write an Op class and its gradient Op class, if required.
 - Write an Op maker class. In the constructor of this class, describe the inputs, outputs and attributes of the operator.
 - Invoke the macro 
REGISTER_OP. This macro will- Call maker class to complete the 
protoand thechecker - Using the completed 
protoandchecker, it will add a new key-value pair to theOpInfoMap 
 - Call maker class to complete the 
 - Invoke the 
USEmacro in which the Op is used, to make sure that it is linked. 
Backward Module (2/2)¶
Build Backward Network¶
- Input: graph of forwarding operators
 - Output: graph of backward operators
 - Corner cases in construction
- Shared Variables => insert an 
Addoperator to combine gradients - No Gradient => insert a 
fill_zero_gradoperator - Recursive NetOp => call 
Backwardrecursively - RNN Op => recursively call 
Backwardon stepnet 
 - Shared Variables => insert an 
 
Scope, Variable, Tensor¶
Tensoris an n-dimension array with type.- Only dims and data pointers are stored in 
Tensor. - All operations on 
Tensorare written inOperatoror global functions. - Variable length Tensor design LoDTensor
 
- Only dims and data pointers are stored in 
 Variableinstances are the inputs and the outputs of an operator. Not justTensor.step_scopesin RNN is a variable and not a tensor.
Scopeis where variables are stores.- map<string 
variable_name, Variable> Scopehas a hierarchical structure. The local scope can get variables from its parent scope.
- map<string 
 
Block (in design)¶
the difference with original RNNOp¶
- As an operator is more intuitive than 
RNNOp, - Offers a new interface 
Eval(targets)to deduce the minimal block toRun, - Fits the compile-time/ runtime separation design paradigm.
- During the compilation, 
SymbolTablestoresVarDescs andOpDescs and serialize to aBlockDesc - When graph executes, a Block with 
BlockDescis passed. It then createsOpandVarinstances and then invokesRun. 
 - During the compilation, 
 
Milestone¶
- Take Paddle/books as the main line, the requirement of the models motivates framework refactoring,
 - Model migration
- Framework development gives priority support to model migration, for example,
- the MNIST demo needs a Python interface,
 - the RNN models require the framework to support 
LoDTensor. 
 - Determine some timelines,
 - Frequently used Ops need to be migrated first,
 - Different models can be migrated in parallel.
 
 - Framework development gives priority support to model migration, for example,
 - Improve the framework at the same time
 - Accept imperfection, concentrate on solving the specific problem at the right price.
 
Control the migration quality¶
- Compare the performance of migrated models with old ones.
 - Follow the google C++ style
 - Build the automatic workflow of generating Python/C++ documentations.
- The documentation of layers and ops should be written inside the code.
 - Take the documentation quality into account when submitting pull requests.
 - Preview the documentations, read and improve them from a user’s perspective.
 
 

