Design Doc: Refactorization Overview¶
The goals of refactoring include:
- Making it easy for external contributors to write new elementary computation operations.
- Making the codebase clean and readable.
- Designing a new computation representation – a computation graph of operators and variables.
- Implementing auto-scalability and auto fault recoverable distributed computing with the help of computation graphs.
Computation Graphs¶
- PaddlePaddle represents the computation, training and inference of Deep Learning models, by computation graphs.
- Please refer to computation graphs for a concrete example.
- Users write Python programs to describe the graphs and run them (locally or remotely).
- A graph is composed of variables and operators.
- The description of graphs must be capable of being serialized/deserialized, so that
- It can to be sent to the cloud for distributed execution, and
- It can be sent to clients for mobile or enterprise deployment.
- The Python program does the following steps
- compilation: run a Python program to generate a protobuf message representation of the graph and send it to
- the C++ library
libpaddle.so
for local execution, - the master process of a distributed training job for training, or
- the server process of a Kubernetes serving job for distributed serving.
- the C++ library
- execution: execute the graph by constructing instances of class
Variable
andOperatorBase
, according to the protobuf message.
- compilation: run a Python program to generate a protobuf message representation of the graph and send it to
Description and Realization of Computation Graph¶
At compile time, the Python program generates a protobuf message representation of the graph, or the description of the graph.
At runtime, the C++ program realizes the graph and runs it.
| | Representation (protobuf messages) | Realization (C++ class objects) | |—|—|—| |Data|VarDesc|Variable| |Operation|OpDesc|Operator| |Block|BlockDesc|Block|
The word graph is interchangeable with block in this document. A graph represents computation steps and local variables similar to a C++/Java program block, or a pair of parentheses({
and }
).
Compilation and Execution¶
- Run an application Python program to describe the graph. In particular, the Python application program does the following:
- Create
VarDesc
to represent local/intermediate variables, - Create operators and set attributes,
- Validate attribute values,
- Infer the type and the shape of variables,
- Plan memory-reuse for variables,
- Generate the backward graph
- Optimize the computation graph.
- Potentially, split the graph for distributed training.
- Create
- The invocation of
train
orinfer
methods in the application Python program does the following:- Create a new Scope instance in the scope hierarchy for each run of a block,
- realize local variables defined in the BlockDesc message in the new scope,
- a scope is similar to the stack frame in programming languages,
- Create an instance of class
Block
, in which,- realize operators in the BlockDesc message,
- Run the Block by calling
Block::Eval(vector<Variable>* targets)
for forward and backward computations, orBlock::Eval(vector<Operator>* targets)
for optimization.
- Create a new Scope instance in the scope hierarchy for each run of a block,
Intermediate Representation (IR)¶
Compile Time -> IR -> Runtime
Benefits of IR¶
Optimization
Compile Time -> IR -> Optimized IR -> Runtime
Automatically send partitioned IR to different nodes.
Automatic Data Parallelism
Compile Time |-> Single GPU IR |-> [trainer-IR-0, trainer-IR-1, pserver-IR] |-> Node-0 (runs trainer-IR-0) |-> Node-1 (runs trainer-IR-1) |-> Node-2 (runs pserver-IR)
Automatic Model Parallelism (planned for future)
Operator/OpWithKernel/OpKernel¶
Operator¶
Operator
is the fundamental building block of the user interface.- Operator stores input/output variable names, and attributes.
- The
InferShape
interface is used to infer the shape of the output variable shapes based on the shapes of the input variables. - Use
Run
to compute theoutput
variables from theinput
variables.
OpWithKernel/Kernel¶
OpWithKernel
inheritsOperator
.OpWithKernel
contains a Kernel map.OpWithKernel::Run
get device’s kernel, and invokeOpKernel::Compute
.OpKernelKey
is the map key. Only device place now, but may be data type later.
Why separate Kernel and Operator¶
- Separate GPU and CPU code.
- Make Paddle capable of running without GPU.
- Make one operator (which is a user interface) and create many implementations.
- For example, same multiplication op can have different implementations kernels such as FP16 kernel, FP32 kernel, MKL, eigen kernel.
Libraries for Kernel development¶
Eigen::Tensor
contains basic math and element-wise functions.- Note that
Eigen::Tensor
has broadcast implementation. - Limit the number of
tensor.device(dev) =
in your code.
- Note that
thrust::tranform
andstd::transform
.thrust
has the same API as C++ standard library. Usingtransform
, one can quickly implement customized elementwise kernels.thrust
also has more complex APIs, likescan
,reduce
,reduce_by_key
.
- Hand-writing
GPUKernel
andCPU
code- Do not write in header (
.h
) files. CPU Kernel should be in cpp source (.cc
) and GPU kernels should be in cuda (.cu
) files. (GCC cannot compile GPU code.)
- Do not write in header (
Operator Registration¶
Why registration is necessary?¶
We need a method to build mappings between Op type names and Op classes.
How is registration implemented?¶
Maintaining a map, whose key is the type name and the value is the corresponding Op constructor.
The Registry Map¶
OpInfoMap
¶
op_type(string)
-> OpInfo
OpInfo
:
creator
: The Op constructor.grad_op_type
: The type of the gradient Op.proto
: The Op’s Protobuf, including inputs, outputs and required attributes.checker
: Used to check attributes.
Registration Process¶
- Write an Op class and its gradient Op class, if required.
- Write an Op maker class. In the constructor of this class, describe the inputs, outputs and attributes of the operator.
- Invoke the macro
REGISTER_OP
. This macro will- Call maker class to complete the
proto
and thechecker
- Using the completed
proto
andchecker
, it will add a new key-value pair to theOpInfoMap
- Call maker class to complete the
- Invoke the
USE
macro in which the Op is used, to make sure that it is linked.
Backward Module (2/2)¶
Build Backward Network¶
- Input: graph of forwarding operators
- Output: graph of backward operators
- Corner cases in construction
- Shared Variables => insert an
Add
operator to combine gradients - No Gradient => insert a
fill_zero_grad
operator - Recursive NetOp => call
Backward
recursively - RNN Op => recursively call
Backward
on stepnet
- Shared Variables => insert an
Scope, Variable, Tensor¶
Tensor
is an n-dimension array with type.- Only dims and data pointers are stored in
Tensor
. - All operations on
Tensor
are written inOperator
or global functions. - Variable length Tensor design LoDTensor
- Only dims and data pointers are stored in
Variable
instances are the inputs and the outputs of an operator. Not justTensor
.step_scopes
in RNN is a variable and not a tensor.
Scope
is where variables are stores.- map<string
variable_name
, Variable> Scope
has a hierarchical structure. The local scope can get variables from its parent scope.
- map<string
Block (in design)¶
the difference with original RNNOp¶
- As an operator is more intuitive than
RNNOp
, - Offers a new interface
Eval(targets)
to deduce the minimal block toRun
, - Fits the compile-time/ runtime separation design paradigm.
- During the compilation,
SymbolTable
storesVarDesc
s andOpDesc
s and serialize to aBlockDesc
- When graph executes, a Block with
BlockDesc
is passed. It then createsOp
andVar
instances and then invokesRun
.
- During the compilation,
Milestone¶
- Take Paddle/books as the main line, the requirement of the models motivates framework refactoring,
- Model migration
- Framework development gives priority support to model migration, for example,
- the MNIST demo needs a Python interface,
- the RNN models require the framework to support
LoDTensor
.
- Determine some timelines,
- Frequently used Ops need to be migrated first,
- Different models can be migrated in parallel.
- Framework development gives priority support to model migration, for example,
- Improve the framework at the same time
- Accept imperfection, concentrate on solving the specific problem at the right price.
Control the migration quality¶
- Compare the performance of migrated models with old ones.
- Follow the google C++ style
- Build the automatic workflow of generating Python/C++ documentations.
- The documentation of layers and ops should be written inside the code.
- Take the documentation quality into account when submitting pull requests.
- Preview the documentations, read and improve them from a user’s perspective.