Design Doc: Refactorization Overview¶
The goals of refactoring include:
- Making it easy for external contributors to write new elementary computation operations.
- Making the codebase clean and readable.
- Designing a new computation representation – a computation graph of operators and variables.
- Implementing auto-scalability and auto fault recoverable distributed computing with the help of computation graphs.
Computation Graphs¶
- PaddlePaddle represents the computation, training and inference of Deep Learning models, by computation graphs.
- Please refer to computation graphs for a concrete example.
- Users write Python programs to describe the graphs and run them (locally or remotely).
- A graph is composed of variables and operators.
- The description of graphs must be serializable/deserializable, so that:- It can be sent to the cloud for distributed execution, and
- It can be sent to clients for mobile or enterprise deployment.
 
- The Python program does two things- Compilation runs a Python program to generate a protobuf message representation of the graph and send it to- the C++ library libpaddle.sofor local execution,
- the master process of a distributed training job for training, or
- the server process of a Kubernetes serving job for distributed serving.
 
- the C++ library 
- Execution executes the graph by constructing instances of class VariableandOperatorBase, according to the protobuf message.
 
- Compilation runs a Python program to generate a protobuf message representation of the graph and send it to
Description and Realization of Computation Graph¶
At compile time, the Python program generates a protobuf message representation of the graph, or a description of the graph.
At runtime, the C++ program realizes the graph and runs it.
| | Representation (protobuf messages) | Realization (C++ class objects) | |—|—|—| |Data|VarDesc|Variable| |Operation|OpDesc|Operator| |Block|BlockDesc|Block|
The word graph is interchangeable with block in this document.  A graph consists of computation steps and local variables similar to a C++/Java program block, or a pair of parentheses({ and }).
Compilation and Execution¶
- Run a Python program to describe the graph.  In particular, the Python application program does the following:- Create VarDescto represent local/intermediate variables,
- Create operators and set attributes,
- Validate attribute values,
- Infer the type and the shape of variables,
- Plan memory-reuse for variables,
- Generate the backward graph
- Add optimization operators to the computation graph.
- Optionally, split the graph for distributed training.
 
- Create 
- The invocation of trainorinfermethods in the Python program does the following:- Create a new Scope instance in the scope hierarchy for each run of a block,- realize local variables defined in the BlockDesc message in the new scope,
- a scope is similar to the stack frame in programming languages,
 
- Create an instance of class Block, in which,- realize operators in the BlockDesc message,
 
- Run the Block by calling- Block::Eval(vector<Variable>* targets)for forward and backward computations, or
- Block::Eval(vector<Operator>* targets)for optimization.
 
 
- Create a new Scope instance in the scope hierarchy for each run of a block,
Intermediate Representation (IR)¶
Compile Time -> IR -> Runtime
Benefits of IR¶
- Optimization - Compile Time -> IR -> Optimized IR -> Runtime 
- Automatically send partitioned IR to different nodes. - Automatic Data Parallelism - Compile Time |-> Single GPU IR |-> [trainer-IR-0, trainer-IR-1, pserver-IR] |-> Node-0 (runs trainer-IR-0) |-> Node-1 (runs trainer-IR-1) |-> Node-2 (runs pserver-IR)
- Automatic Model Parallelism (planned for future) 
 
Operator/OpWithKernel/OpKernel¶
Operator¶
- Operatoris the fundamental building block of the user interface.- Operator stores input/output variable names and attributes.
- The InferShapeinterface is used to infer the shape of the output variables based on the shapes of the input variables.
- Use Runto compute theoutputvariables from theinputvariables.
 
OpWithKernel/Kernel¶
- OpWithKernelinherits- Operator.
- OpWithKernelcontains a Kernel map.- OpWithKernel::Runget device’s kernel, and invoke- OpKernel::Compute.
- OpKernelKeyis the map key. Only device place now, but may be data type later.
 
Why separate Kernel and Operator¶
- Separate GPU and CPU code.- Make Paddle capable of running without GPU.
 
- Make one operator (which is a user interface) and create many implementations.- For example, same multiplication op can have different implementations kernels such as FP16 kernel, FP32 kernel, MKL, eigen kernel.
 
Libraries for Kernel development¶
- Eigen::Tensorcontains basic math and element-wise functions.- Note that Eigen::Tensorhas broadcast implementation.
- Limit the number of tensor.device(dev) =in your code.
 
- Note that 
- thrust::transformand- std::transform.- thrusthas the same API as C++ standard library. Using- transform, one can quickly implement customized element-wise kernels.
- thrust, in addition, supports more complex APIs, like- scan,- reduce,- reduce_by_key.
 
- Hand-writing GPUKernelandCPUcode- Do not write in header (.h) files. CPU Kernel should be in cpp source (.cc) and GPU kernels should be in cuda (.cu) files. (GCC cannot compile GPU code.)
 
- Do not write in header (
Operator Registration¶
Why is registration necessary?¶
We need a method to build mappings between Op type names and Op classes.
How is registration implemented?¶
Maintaining a map, whose key is the type name and the value is the corresponding Op constructor.
The Registry Map¶
OpInfoMap¶
op_type(string) -> OpInfo
OpInfo:
- creator: The Op constructor.
- grad_op_type: The type of the gradient Op.
- proto: The Op’s Protobuf, including inputs, outputs and required attributes.
- checker: Used to check attributes.
Registration Process¶
- Write an Op class and its gradient Op class, if required.
- Write an Op maker class. In the constructor of this class, describe the inputs, outputs and attributes of the operator.
- Invoke the macro REGISTER_OP. This macro will- Call maker class to complete protoandchecker
- Using the completed protoandchecker, it will add a new key-value pair to theOpInfoMap
 
- Call maker class to complete 
- Invoke the USEmacro in which the Op is used to make sure that it is linked.
Backward Module (2/2)¶
Build Backward Network¶
- Input: a graph of forward operators
- Output: a graph of backward operators
- Corner cases in construction- Shared Variables => insert an Addoperator to combine gradients
- No Gradient => insert a fill_zero_gradoperator
- Recursive NetOp => call Backwardrecursively
- RNN Op => recursively call Backwardon stepnet
- RNN Op => recursively call Backwardon stepnet
 
- Shared Variables => insert an 
Scope, Variable, Tensor¶
- Tensoris an n-dimension array with type.- Only dims and data pointers are stored in Tensor.
- All operations on Tensorare written inOperatoror global functions.
- Variable length Tensor design LoDTensor
 
- Only dims and data pointers are stored in 
- Variableinstances are the inputs and the outputs of an operator, not just- Tensor.- step_scopesin RNN is a variable and not a tensor.
 
- Scopeis where variables are stored.- map<string var name, Variable>
- Scopehas a hierarchical structure. The local scope can get variables from its parent scope.
 
- map<string 
Block (in design)¶
the difference between original RNNOp and Block¶
- As an operator is more intuitive than RNNOp,
- Offers a new interface Eval(targets)to deduce the minimal block toRun,
- Fits the compile-time/ runtime separation design paradigm.- During the compilation, SymbolTablestoresVarDescs andOpDescs and serialize to aBlockDesc
- When graph executes, a Block with BlockDescis passed. It then createsOpandVarinstances and then invokesRun.
 
- During the compilation, 
Milestone¶
- Take Paddle/books as the main line, the requirement of the models motivates framework refactoring,
- Model migration- Framework development gives priority support to model migration, for example,- the MNIST demo needs a Python interface,
- the RNN models require the framework to support LoDTensor.
 
- Determine some timelines,
- Frequently used Ops need to be migrated first,
- Different models can be migrated in parallel.
 
- Framework development gives priority support to model migration, for example,
- Improve the framework at the same time
- Accept imperfection, concentrate on solving the specific problem at the right price.
Control the migration quality¶
- Compare the performance of migrated models with old ones.
- Follow the google C++ style guide.
- Build the automatic workflow of generating Python/C++ documentations.- The documentation of layers and ops should be written inside the code.
- Take the documentation quality into account when submitting pull requests.
- Preview the documentations, read and improve them from a user’s perspective.
 

