Design Doc: Refactorization Overview¶
The goal of refactorizaiton include:
- Make it easy for external contributors to write new elementory computaiton operations.
- Make the codebase clean and readable.
- Introduce a new design of computation representation – a computation graph of operators and variables.
- The graph representation helps implementing auto-scalable and auto fault recoverable distributed computing.
Computation Graphs¶
- PaddlePaddle represent the computation, training and inference of DL models, by computation graphs.
- Please dig into computation graphs for a solid example.
- Users write Python programs to describe the graphs and run it (locally or remotely).
- A graph is composed of variabels and operators.
- The description of graphs must be able to be serialized/deserialized, so it
- could to be sent to the cloud for distributed execution, and
- be sent to clients for mobile or enterprise deployment.
- The Python program do
- compilation: runs a Python program to generate a protobuf message representation of the graph and send it to
- the C++ library
libpaddle.so
for local execution, - the master process of a distributed training job for training, or
- the server process of a Kubernetes serving job for distributed serving.
- the C++ library
- execution: according to the protobuf message, constructs instances of class
Variable
andOperatorBase
, and run them.
- compilation: runs a Python program to generate a protobuf message representation of the graph and send it to
Description and Realization¶
At compile time, the Python program generates protobuf message representation of the graph, or the description of the graph.
At runtime, the C++ program realizes the graph and run it.
| | Representation (protobuf messages) | Realization (C++ class objects) | |—|—|—| |Data|VarDesc|Variable| |Operation|OpDesc|Operator| |Block|BlockDesc|Block|
The word graph is exchangable with block in this document. A graph represent computation steps and local variables as a C++/Java program block, or a pair of { and }.
Compilation and Execution¶
- Run an applicaton Python program to describe the graph. In particular,
- create VarDesc to represent local/intermediate variables,
- create operators and set attributes,
- validate attribute values,
- inference the type and the shape of variables,
- plan for memory-reuse for variables,
- generate backward and optimization part of the Graph.
- possiblly split the graph for distributed training.
- The invocation of
train
orinfer
in the application Python program:- create a new Scope instance in the scope hierarchy for each run of a block,
- realize local variables defined in the BlockDesc message in the new scope,
- a scope is similar to the stack frame in programming languages,
- create an instance of class
Block
, in which,- realize operators in the BlockDesc message,
- run the Block by calling
Block::Eval(vector<Variable>* targets)
for forward and backward computations, orBlock::Eval(vector<Operator>* targets)
for optimization.
- create a new Scope instance in the scope hierarchy for each run of a block,
Intermediate Representation (IR)¶
Compile Time -> IR -> Runtime
Benefit¶
Optimization
Compile Time -> IR -> Optimized IR -> Runtime
Send automatically partitioned IR to different nodes.
Automatic data parallel
Compile Time |-> Single GPU IR |-> [trainer-IR-0, trainer-IR-1, pserver-IR] |-> Node-0 (runs trainer-IR-0) |-> Node-1 (runs trainer-IR-1) |-> Node-2 (runs pserver-IR)
Automatic model parallel (planned for future)
Operator/OpWithKernel/OpKernel¶
Operator¶
Operator
is the fundamental building block as the user interface.- Operator stores input/output variable name, and attributes.
- The
InferShape
interface is used to infer output variable shapes by its input shapes. - Use
Run
to computeinput variables
tooutput variables
.
OpWithKernel/Kernel¶
OpWithKernel
inheritsOperator
.OpWithKernel
contains a Kernel map.OpWithKernel::Run
get device’s kernel, and invokeOpKernel::Compute
.OpKernelKey
is the map key. Only device place now, but may be data type later.
Why separate Kernel and Operator¶
- Separate GPU and CPU code.
- Make Paddle can run without GPU.
- Make one operator (which is user interface) can contain many implementations.
- Same mul op, different FP16, FP32 Kernel. different MKL, eigen kernel.
Libraries for Kernel development¶
Eigen::Tensor
contains basic math and element-wise functions.- Note that
Eigen::Tensor
has broadcast implementation. - Limit number of
tensor.device(dev) =
in your code.
- Note that
thrust::tranform
andstd::transform
.thrust
has the same API as C++ standard library. Usingtransform
can quickly implement a customized elementwise kernel.thrust
has more complex API, likescan
,reduce
,reduce_by_key
.
- Hand-writing
GPUKernel
andCPU
code- Do not write
.h
. CPU Kernel should be in.cc
. CPU kernel should be in.cu
. (GCC
cannot compile GPU code.)
- Do not write
Operator Register¶
Why register is necessary?¶
We need a method to build mappings between Op type names and Op classes.
How to do the register?¶
Maintain a map, whose key is the type name and value is corresponding Op constructor.
The Registry Map¶
OpInfoMap
¶
op_type(string)
-> OpInfo
OpInfo
:
creator
: The Op constructor.grad_op_type
: The type of the gradient Op.proto
: The Op’s Protobuf, including inputs, outputs and required attributes.checker
: Used to check attributes.
Register Process¶
- Write Op class, as well as its gradient Op class if there is.
- Write Op maker class. In the constructor, describe its inputs, outputs, and attributes.
- Invoke macro
REGISTER_OP
. The macro will- call maker class to complete
proto
andchecker
- with the completed
proto
andchecker
, build a new key-value pair in theOpInfoMap
- call maker class to complete
- Invoke
USE
macro in where the Op is used to make sure it is linked.
Backward Module (2/2)¶
Build Backward Network¶
- Input graph of forwarding operators
- Output graph of backward operators
- corner case in construction
- shared variable => insert
Add
operator - no gradient => insert
fill_zero_grad
operator - recursive netOp => call
Backward
recursively - RNN Op => recursively call
Backward
on stepnet
- shared variable => insert
Scope, Variable, Tensor¶
Tensor
is an n-dimension array with type.- Only dims and data pointers are stored in
Tensor
. - All operators on
Tensor
is written inOperator
or global functions. - variable length Tensor design LoDTensor
- Only dims and data pointers are stored in
Variable
is the inputs and outputs of an operator. Not justTensor
.- step_scopes in RNN is a variable and not a tensor.
Scope
is where variables store at.- map<string/*var name */, Variable>
Scope
has a hierarchical structure. The local scope can get variable from its parent scope.
Block (in design)¶
the difference with original RNNOp¶
- as an operator is more intuitive than
RNNOp
, - offers new interface
Eval(targets)
to deduce the minimal block toRun
, - fits the compile-time/ runtime separation design.
- during the compilation,
SymbolTable
storesVarDesc
s andOpDesc
s and serialize to aBlockDesc
- when graph executes, a Block with
BlockDesc
passed in createsOp
andVar
thenRun
- during the compilation,
Milestone¶
- take Paddle/books as the main line, the requirement of the models motivates framework refactoring,
- model migration
- framework development gives priority support to model migration, for example,
- the MNIST demo needs a Python interface,
- the RNN models require the framework to support
LoDTensor
.
- determine some timelines,
- heavily-relied Ops need to be migrated first,
- different models can be migrated parallelly.
- framework development gives priority support to model migration, for example,
- improve the framework at the same time
- accept imperfection, concentrated on solving the specific problem at the right price.
Control the migration quality¶
- compare the performance of migrated models with old ones.
- follow google C style
- build the automatic workflow of generating Python/C++ documentations
- the documentation of layers and ops should be written inside the code
- take the documentation quality into account when doing PR
- preview the documentations, read and improve them from users’ perspective