提交 04a02b1e 编写于 作者: T Travis CI

Deploy to GitHub Pages: 7508d52b

上级 5394b772
# Memory Optimization
## Problem
In a lecture from Andrew Ng, he attributes the recent sucess of AI due to a combination of these:
- availability of Big Data
- supercomputing power to process this Big Data over very large neural networks
- modern algorithms
Following graph shows the details:
![](images/deep_learning.png)
Larger model usually brings better performance. However, GPU memory is certain limited. For example, the memory size of a GTX TITAN X is only 12GB. To train complex and large model, we have to take care of memory using. Besides, memory optimization is also necessary in both online/mobile inference.
## Solution
### Basic Strategy
There are some basic strategies to make memory optimization, including in-place operation and memory sharing.
#### In-place Operation
In a relu activation operator:
$y = \max(x, 0)$
If the variable x is not used in any other operator, we can make an in-place operation. In other words, the memory block of variable y and variable x are the same. In-place operation will save 50% memory occupancy immediately.
#### Memory Sharing
Not all operators support in-place operations. Memory sharing is a more general strategy.
Following is an example:
```
a = op1(b, c);
d = op2(a)
e = op3(d, f)
```
In this case, variable a is no longer used, and op2 does not support in-place operation. After op2 finished, we can put the memory of variable a to a memory pool. Then, variable e can share the memory of variable a from the pool.
### Live Variable Analysis
It's not enough to only have some basic strategies. The prerequisite of memory optimization is to know if a variable is still "live" after an operation.
In our design, the neural network topology is defined as a program. Luckily, [live variable analysis](https://en.wikipedia.org/wiki/Live_variable_analysis) is a classic problem in compilers which can be used in many stages, such as register allocation.
In compilers, the front end of the compilers translates programs into an intermediate language with an unbounded number of temporaries. This program must run on a machine with a bounded number of registers. Two temporaries a and b can fit into the same register, if a and b are never "in use" at the same time. Thus, many temporaries can fit in few registers; if they don't all fit, the excess temporaries can be kept in memory.
Therefore, the compiler needs to analyze the intermediate-representation program to determine which temporaries are in use at the same time. We say a variable is "live" if it holds a value that may be needed in the future, so this analysis is called liveness analysis.
We can leran these techniques from compilers. There are mainly two stages to make live variable analysis:
- construct a control flow graph
- solve the dataflow equations
#### Control Flow Graph
To preform analyses on a program, it is often useful to make a control flow graph. A [control flow graph](https://en.wikipedia.org/wiki/Control_flow_graph) (CFG) in computer science is a representation, using graph notation, of all paths that might be traversed through a program during its execution. Each statement in the program is a node in the flow graph; if statemment x can be followed by statement y, there is an egde from x to y.
Following is the flow graph for a simple loop.
![](images/control_flow_graph.png)
#### Dataflow Analysis
liveness of variable "flows" around the edges of the control flow graph; determining the live range of each variable is an example of a dataflow problem. [Dataflow analysis](https://en.wikipedia.org/wiki/Data-flow_analysis) is a technique for gathering information about the possible set of values calculated at various points in a computer program.
A simple way to perform data-flow analysis of programs is to set up dataflow equations for each node of the control flow graph and solve them by repeatedly calculating the output from the input locally at each node until the whole system stabilizes.
- Flow Graph Terminology
A flow graph node has out-edges that lead to sucessor nodes, and in-edges that come from presucessor nodes. The set *pred[n]* is all the predecessors of node n, and *succ[n]* is the set of sucessors.
In former control flow graph, the out-edges of node 5 are 5 --> 6 and 5 --> 2, and *succ[5]* = {2, 6}. The in-edges of 2 are 5 --> 2 and 1 --> 2, and *pred[2]* = {1, 5}.
- Uses and Defs
An assignmemt to a variable or temporary defines that variable. An occurence of a variable on the right-hand side of an assginment(or in other expressions) uses the variable. We can speak the *def* of a variable as the set of graph nodes that define it; or the *def* of a graph node as the set of variables that it defines; and the similarly for the *use* of a variable or graph node. In former control flow graph, *def(3)* = {c}, *use(3)* = {b, c}.
- Liveness
A variable is *live* on an edge if there is a directed path from that edge to a *use* of the variable that does not go through any *def*. A variable is *live-in* at a node if it is live on any of the in-edges of that node; it is *live-out* at a node if it is live on any of the out-edges of the node.
The calcution of liveness can be solved by iteration until a fixed pointer is reached. Following is the recursive formula:
![](images/dataflow_equations.png)
### Memory optimization transpiler
At last, we take basic strategy and liveness analysis techniques learning from compilers to implement our memory optimization transpiler.
#### add in-place attribute
In-place is a built-in attribute of an operator. Since we treat in-place and other operators differently, we have to add an in-place attribute for every operator.
#### contruct control flow graph
Following is the ProgramDesc protobuf of [machine translation](https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/v2/fluid/tests/book/test_machine_translation.py) example.
- Block0:
```
lookup_table
mul
...
while(sub-block idx 1)
...
array_to_lod_tensor
cross_entropy
...
while_grad(sub-block idx 2)
read_from_array
array_to_lod_tensor
...
```
- Block1
```
read_from_array
read_from_array
...
write_to_array
increment
write_to_array
less_than
```
- Block2
```
read_from_array
increment
...
write_to_array
write_to_array
```
We can transfer all the operators and variables in ProgramDesc to build a control flow graph.
```python
class ControlFlowGraph(object):
def __init__(self, Program):
self._sucessors = defaultdict(set)
self._presucessors = defaultdict(set)
self._uses = defaultdict(set)
self._defs = defaultdict(set)
self._live_in = defaultdict(set)
self._live_out = defaultdict(set)
self._program = Program
def build(self):
pass
def dataflow_analysis(self):
pass
def memory_optimization(self):
pass
def get_program(self):
return self._program
```
#### make dataflow analysis
We follow guide from compilers and try to solve the dataflow equation to get liveness of every variable. If the live-in of an operator node is different from the live-out, then we can make memory sharing.
For example:
```
a = op1(b, c);
d = op2(a)
e = op3(d, f)
```
The dataflow analysis result is:
```
live_in(op1) = {b, c, f}
live_out(op1) = {a, f}
live_in(op2) = {a, f}
live_out(op2) = {d, f}
live_in(op3) = {d, f}
live_out(op3) = {}
```
After op1, we can process variable b and variable c; After op2, we can process variable a. After op3, we can process variable d and variable f.
#### memory sharing policy
A memory pool will be mantained in the stage of memory optimization. Each operator node will be scanned to determine memory optimization is done or not. If an operator satifies the requirement, following policy will be taken to handle input/output variables.
```
if op.support_inplace():
i --> pool
pool --> o
else:
pool --> o
i --> pool
```
## Reference
- [Lecture Notes From Artificial Intelligence Is The New Electricity By Andrew Ng](https://manavsehgal.com/lecture-notes-from-artificial-intelligence-is-the-new-electricity-by-andrew-ng-4712dcbf26e5)
- Modern compiler implementation in ML, by Andrew W. Appel
- [Optimizing Memory Consumption in Deep learning](https://mxnet.incubator.apache.org/architecture/note_memory.html)
<!DOCTYPE html>
<!--[if IE 8]><html class="no-js lt-ie9" lang="en" > <![endif]-->
<!--[if gt IE 8]><!--> <html class="no-js" lang="en" > <!--<![endif]-->
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Memory Optimization &mdash; PaddlePaddle documentation</title>
<link rel="stylesheet" href="../_static/css/theme.css" type="text/css" />
<link rel="index" title="Index"
href="../genindex.html"/>
<link rel="search" title="Search" href="../search.html"/>
<link rel="top" title="PaddlePaddle documentation" href="../index.html"/>
<link rel="stylesheet" href="https://cdn.jsdelivr.net/perfect-scrollbar/0.6.14/css/perfect-scrollbar.min.css" type="text/css" />
<link rel="stylesheet" href="../_static/css/override.css" type="text/css" />
<script>
var _hmt = _hmt || [];
(function() {
var hm = document.createElement("script");
hm.src = "//hm.baidu.com/hm.js?b9a314ab40d04d805655aab1deee08ba";
var s = document.getElementsByTagName("script")[0];
s.parentNode.insertBefore(hm, s);
})();
</script>
<script src="../_static/js/modernizr.min.js"></script>
</head>
<body class="wy-body-for-nav" role="document">
<header class="site-header">
<div class="site-logo">
<a href="/"><img src="../_static/images/PP_w.png"></a>
</div>
<div class="site-nav-links">
<div class="site-menu">
<a class="fork-on-github" href="https://github.com/PaddlePaddle/Paddle" target="_blank"><i class="fa fa-github"></i>Fork me on Github</a>
<div class="language-switcher dropdown">
<a type="button" data-toggle="dropdown">
<span>English</span>
<i class="fa fa-angle-up"></i>
<i class="fa fa-angle-down"></i>
</a>
<ul class="dropdown-menu">
<li><a href="/doc_cn">中文</a></li>
<li><a href="/doc">English</a></li>
</ul>
</div>
<ul class="site-page-links">
<li><a href="/">Home</a></li>
</ul>
</div>
<div class="doc-module">
<ul>
<li class="toctree-l1"><a class="reference internal" href="../getstarted/index_en.html">GET STARTED</a></li>
<li class="toctree-l1"><a class="reference internal" href="../howto/index_en.html">HOW TO</a></li>
<li class="toctree-l1"><a class="reference internal" href="../api/index_en.html">API</a></li>
<li class="toctree-l1"><a class="reference internal" href="../mobile/index_en.html">MOBILE</a></li>
</ul>
<div role="search">
<form id="rtd-search-form" class="wy-form" action="../search.html" method="get">
<input type="text" name="q" placeholder="Search docs" />
<input type="hidden" name="check_keywords" value="yes" />
<input type="hidden" name="area" value="default" />
</form>
</div>
</div>
</div>
</header>
<div class="main-content-wrap">
<nav class="doc-menu-vertical" role="navigation">
<ul>
<li class="toctree-l1"><a class="reference internal" href="../getstarted/index_en.html">GET STARTED</a><ul>
<li class="toctree-l2"><a class="reference internal" href="../getstarted/build_and_install/index_en.html">Install and Build</a><ul>
<li class="toctree-l3"><a class="reference internal" href="../getstarted/build_and_install/pip_install_en.html">Install Using pip</a></li>
<li class="toctree-l3"><a class="reference internal" href="../getstarted/build_and_install/docker_install_en.html">Run in Docker Containers</a></li>
<li class="toctree-l3"><a class="reference internal" href="../howto/dev/build_en.html">Build using Docker</a></li>
<li class="toctree-l3"><a class="reference internal" href="../getstarted/build_and_install/build_from_source_en.html">Build from Sources</a></li>
</ul>
</li>
</ul>
</li>
<li class="toctree-l1"><a class="reference internal" href="../howto/index_en.html">HOW TO</a><ul>
<li class="toctree-l2"><a class="reference internal" href="../howto/usage/cmd_parameter/index_en.html">Set Command-line Parameters</a><ul>
<li class="toctree-l3"><a class="reference internal" href="../howto/usage/cmd_parameter/use_case_en.html">Use Case</a></li>
<li class="toctree-l3"><a class="reference internal" href="../howto/usage/cmd_parameter/arguments_en.html">Argument Outline</a></li>
<li class="toctree-l3"><a class="reference internal" href="../howto/usage/cmd_parameter/detail_introduction_en.html">Detail Description</a></li>
</ul>
</li>
<li class="toctree-l2"><a class="reference internal" href="../howto/usage/cluster/cluster_train_en.html">Distributed Training</a><ul>
<li class="toctree-l3"><a class="reference internal" href="../howto/usage/cluster/fabric_en.html">fabric</a></li>
<li class="toctree-l3"><a class="reference internal" href="../howto/usage/cluster/openmpi_en.html">openmpi</a></li>
<li class="toctree-l3"><a class="reference internal" href="../howto/usage/cluster/k8s_en.html">kubernetes</a></li>
<li class="toctree-l3"><a class="reference internal" href="../howto/usage/cluster/k8s_aws_en.html">kubernetes on AWS</a></li>
</ul>
</li>
<li class="toctree-l2"><a class="reference internal" href="../howto/dev/new_layer_en.html">Write New Layers</a></li>
<li class="toctree-l2"><a class="reference internal" href="../howto/dev/contribute_to_paddle_en.html">Contribute Code</a></li>
<li class="toctree-l2"><a class="reference internal" href="../howto/dev/write_docs_en.html">Contribute Documentation</a></li>
<li class="toctree-l2"><a class="reference internal" href="../howto/deep_model/rnn/index_en.html">RNN Models</a><ul>
<li class="toctree-l3"><a class="reference internal" href="../howto/deep_model/rnn/rnn_config_en.html">RNN Configuration</a></li>
</ul>
</li>
<li class="toctree-l2"><a class="reference internal" href="../howto/optimization/gpu_profiling_en.html">Tune GPU Performance</a></li>
</ul>
</li>
<li class="toctree-l1"><a class="reference internal" href="../api/index_en.html">API</a><ul>
<li class="toctree-l2"><a class="reference internal" href="../api/v2/model_configs.html">Model Configuration</a><ul>
<li class="toctree-l3"><a class="reference internal" href="../api/v2/config/activation.html">Activation</a></li>
<li class="toctree-l3"><a class="reference internal" href="../api/v2/config/layer.html">Layers</a></li>
<li class="toctree-l3"><a class="reference internal" href="../api/v2/config/evaluators.html">Evaluators</a></li>
<li class="toctree-l3"><a class="reference internal" href="../api/v2/config/optimizer.html">Optimizer</a></li>
<li class="toctree-l3"><a class="reference internal" href="../api/v2/config/pooling.html">Pooling</a></li>
<li class="toctree-l3"><a class="reference internal" href="../api/v2/config/networks.html">Networks</a></li>
<li class="toctree-l3"><a class="reference internal" href="../api/v2/config/attr.html">Parameter Attribute</a></li>
</ul>
</li>
<li class="toctree-l2"><a class="reference internal" href="../api/v2/data.html">Data Reader Interface and DataSets</a><ul>
<li class="toctree-l3"><a class="reference internal" href="../api/v2/data/data_reader.html">Data Reader Interface</a></li>
<li class="toctree-l3"><a class="reference internal" href="../api/v2/data/image.html">Image Interface</a></li>
<li class="toctree-l3"><a class="reference internal" href="../api/v2/data/dataset.html">Dataset</a></li>
</ul>
</li>
<li class="toctree-l2"><a class="reference internal" href="../api/v2/run_logic.html">Training and Inference</a></li>
<li class="toctree-l2"><a class="reference internal" href="../api/v2/fluid.html">Fluid</a><ul>
<li class="toctree-l3"><a class="reference internal" href="../api/v2/fluid/layers.html">Layers</a></li>
<li class="toctree-l3"><a class="reference internal" href="../api/v2/fluid/data_feeder.html">DataFeeder</a></li>
<li class="toctree-l3"><a class="reference internal" href="../api/v2/fluid/executor.html">Executor</a></li>
<li class="toctree-l3"><a class="reference internal" href="../api/v2/fluid/initializer.html">Initializer</a></li>
<li class="toctree-l3"><a class="reference internal" href="../api/v2/fluid/evaluator.html">Evaluator</a></li>
<li class="toctree-l3"><a class="reference internal" href="../api/v2/fluid/nets.html">Nets</a></li>
<li class="toctree-l3"><a class="reference internal" href="../api/v2/fluid/optimizer.html">Optimizer</a></li>
<li class="toctree-l3"><a class="reference internal" href="../api/v2/fluid/param_attr.html">ParamAttr</a></li>
<li class="toctree-l3"><a class="reference internal" href="../api/v2/fluid/profiler.html">Profiler</a></li>
<li class="toctree-l3"><a class="reference internal" href="../api/v2/fluid/regularizer.html">Regularizer</a></li>
</ul>
</li>
</ul>
</li>
<li class="toctree-l1"><a class="reference internal" href="../mobile/index_en.html">MOBILE</a><ul>
<li class="toctree-l2"><a class="reference internal" href="../mobile/cross_compiling_for_android_en.html">Build PaddlePaddle for Android</a></li>
<li class="toctree-l2"><a class="reference internal" href="../mobile/cross_compiling_for_ios_en.html">Build PaddlePaddle for iOS</a></li>
<li class="toctree-l2"><a class="reference internal" href="../mobile/cross_compiling_for_raspberry_en.html">Build PaddlePaddle for Raspberry Pi</a></li>
</ul>
</li>
</ul>
</nav>
<section class="doc-content-wrap">
<div role="navigation" aria-label="breadcrumbs navigation">
<ul class="wy-breadcrumbs">
<li>Memory Optimization</li>
</ul>
</div>
<div class="wy-nav-content" id="doc-content">
<div class="rst-content">
<div role="main" class="document" itemscope="itemscope" itemtype="http://schema.org/Article">
<div itemprop="articleBody">
<div class="section" id="memory-optimization">
<span id="memory-optimization"></span><h1>Memory Optimization<a class="headerlink" href="#memory-optimization" title="Permalink to this headline"></a></h1>
<div class="section" id="problem">
<span id="problem"></span><h2>Problem<a class="headerlink" href="#problem" title="Permalink to this headline"></a></h2>
<p>In a lecture from Andrew Ng, he attributes the recent sucess of AI due to a combination of these:</p>
<ul class="simple">
<li>availability of Big Data</li>
<li>supercomputing power to process this Big Data over very large neural networks</li>
<li>modern algorithms</li>
</ul>
<p>Following graph shows the details:</p>
<p><img alt="" src="../_images/deep_learning.png" /></p>
<p>Larger model usually brings better performance. However, GPU memory is certain limited. For example, the memory size of a GTX TITAN X is only 12GB. To train complex and large model, we have to take care of memory using. Besides, memory optimization is also necessary in both online/mobile inference.</p>
</div>
<div class="section" id="solution">
<span id="solution"></span><h2>Solution<a class="headerlink" href="#solution" title="Permalink to this headline"></a></h2>
<div class="section" id="basic-strategy">
<span id="basic-strategy"></span><h3>Basic Strategy<a class="headerlink" href="#basic-strategy" title="Permalink to this headline"></a></h3>
<p>There are some basic strategies to make memory optimization, including in-place operation and memory sharing.</p>
<div class="section" id="in-place-operation">
<span id="in-place-operation"></span><h4>In-place Operation<a class="headerlink" href="#in-place-operation" title="Permalink to this headline"></a></h4>
<p>In a relu activation operator:</p>
<p>$y = \max(x, 0)$</p>
<p>If the variable x is not used in any other operator, we can make an in-place operation. In other words, the memory block of variable y and variable x are the same. In-place operation will save 50% memory occupancy immediately.</p>
</div>
<div class="section" id="memory-sharing">
<span id="memory-sharing"></span><h4>Memory Sharing<a class="headerlink" href="#memory-sharing" title="Permalink to this headline"></a></h4>
<p>Not all operators support in-place operations. Memory sharing is a more general strategy.</p>
<p>Following is an example:</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="n">a</span> <span class="o">=</span> <span class="n">op1</span><span class="p">(</span><span class="n">b</span><span class="p">,</span> <span class="n">c</span><span class="p">);</span>
<span class="n">d</span> <span class="o">=</span> <span class="n">op2</span><span class="p">(</span><span class="n">a</span><span class="p">)</span>
<span class="n">e</span> <span class="o">=</span> <span class="n">op3</span><span class="p">(</span><span class="n">d</span><span class="p">,</span> <span class="n">f</span><span class="p">)</span>
</pre></div>
</div>
<p>In this case, variable a is no longer used, and op2 does not support in-place operation. After op2 finished, we can put the memory of variable a to a memory pool. Then, variable e can share the memory of variable a from the pool.</p>
</div>
</div>
<div class="section" id="live-variable-analysis">
<span id="live-variable-analysis"></span><h3>Live Variable Analysis<a class="headerlink" href="#live-variable-analysis" title="Permalink to this headline"></a></h3>
<p>It&#8217;s not enough to only have some basic strategies. The prerequisite of memory optimization is to know if a variable is still &#8220;live&#8221; after an operation.</p>
<p>In our design, the neural network topology is defined as a program. Luckily, <a class="reference external" href="https://en.wikipedia.org/wiki/Live_variable_analysis">live variable analysis</a> is a classic problem in compilers which can be used in many stages, such as register allocation.</p>
<p>In compilers, the front end of the compilers translates programs into an intermediate language with an unbounded number of temporaries. This program must run on a machine with a bounded number of registers. Two temporaries a and b can fit into the same register, if a and b are never &#8220;in use&#8221; at the same time. Thus, many temporaries can fit in few registers; if they don&#8217;t all fit, the excess temporaries can be kept in memory.</p>
<p>Therefore, the compiler needs to analyze the intermediate-representation program to determine which temporaries are in use at the same time. We say a variable is &#8220;live&#8221; if it holds a value that may be needed in the future, so this analysis is called liveness analysis.</p>
<p>We can leran these techniques from compilers. There are mainly two stages to make live variable analysis:</p>
<ul class="simple">
<li>construct a control flow graph</li>
<li>solve the dataflow equations</li>
</ul>
<div class="section" id="control-flow-graph">
<span id="control-flow-graph"></span><h4>Control Flow Graph<a class="headerlink" href="#control-flow-graph" title="Permalink to this headline"></a></h4>
<p>To preform analyses on a program, it is often useful to make a control flow graph. A <a class="reference external" href="https://en.wikipedia.org/wiki/Control_flow_graph">control flow graph</a> (CFG) in computer science is a representation, using graph notation, of all paths that might be traversed through a program during its execution. Each statement in the program is a node in the flow graph; if statemment x can be followed by statement y, there is an egde from x to y.</p>
<p>Following is the flow graph for a simple loop.</p>
<p><img alt="" src="../_images/control_flow_graph.png" /></p>
</div>
<div class="section" id="dataflow-analysis">
<span id="dataflow-analysis"></span><h4>Dataflow Analysis<a class="headerlink" href="#dataflow-analysis" title="Permalink to this headline"></a></h4>
<p>liveness of variable &#8220;flows&#8221; around the edges of the control flow graph; determining the live range of each variable is an example of a dataflow problem. <a class="reference external" href="https://en.wikipedia.org/wiki/Data-flow_analysis">Dataflow analysis</a> is a technique for gathering information about the possible set of values calculated at various points in a computer program.</p>
<p>A simple way to perform data-flow analysis of programs is to set up dataflow equations for each node of the control flow graph and solve them by repeatedly calculating the output from the input locally at each node until the whole system stabilizes.</p>
<ul class="simple">
<li>Flow Graph Terminology</li>
</ul>
<p>A flow graph node has out-edges that lead to sucessor nodes, and in-edges that come from presucessor nodes. The set <em>pred[n]</em> is all the predecessors of node n, and <em>succ[n]</em> is the set of sucessors.
In former control flow graph, the out-edges of node 5 are 5 &#8211;&gt; 6 and 5 &#8211;&gt; 2, and <em>succ[5]</em> = {2, 6}. The in-edges of 2 are 5 &#8211;&gt; 2 and 1 &#8211;&gt; 2, and <em>pred[2]</em> = {1, 5}.</p>
<ul class="simple">
<li>Uses and Defs</li>
</ul>
<p>An assignmemt to a variable or temporary defines that variable. An occurence of a variable on the right-hand side of an assginment(or in other expressions) uses the variable. We can speak the <em>def</em> of a variable as the set of graph nodes that define it; or the <em>def</em> of a graph node as the set of variables that it defines; and the similarly for the <em>use</em> of a variable or graph node. In former control flow graph, <em>def(3)</em> = {c}, <em>use(3)</em> = {b, c}.</p>
<ul class="simple">
<li>Liveness</li>
</ul>
<p>A variable is <em>live</em> on an edge if there is a directed path from that edge to a <em>use</em> of the variable that does not go through any <em>def</em>. A variable is <em>live-in</em> at a node if it is live on any of the in-edges of that node; it is <em>live-out</em> at a node if it is live on any of the out-edges of the node.</p>
<p>The calcution of liveness can be solved by iteration until a fixed pointer is reached. Following is the recursive formula:</p>
<p><img alt="" src="../_images/dataflow_equations.png" /></p>
</div>
</div>
<div class="section" id="memory-optimization-transpiler">
<span id="memory-optimization-transpiler"></span><h3>Memory optimization transpiler<a class="headerlink" href="#memory-optimization-transpiler" title="Permalink to this headline"></a></h3>
<p>At last, we take basic strategy and liveness analysis techniques learning from compilers to implement our memory optimization transpiler.</p>
<div class="section" id="add-in-place-attribute">
<span id="add-in-place-attribute"></span><h4>add in-place attribute<a class="headerlink" href="#add-in-place-attribute" title="Permalink to this headline"></a></h4>
<p>In-place is a built-in attribute of an operator. Since we treat in-place and other operators differently, we have to add an in-place attribute for every operator.</p>
</div>
<div class="section" id="contruct-control-flow-graph">
<span id="contruct-control-flow-graph"></span><h4>contruct control flow graph<a class="headerlink" href="#contruct-control-flow-graph" title="Permalink to this headline"></a></h4>
<p>Following is the ProgramDesc protobuf of <a class="reference external" href="https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/v2/fluid/tests/book/test_machine_translation.py">machine translation</a> example.</p>
<ul class="simple">
<li>Block0:</li>
</ul>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="n">lookup_table</span>
<span class="n">mul</span>
<span class="o">...</span>
<span class="k">while</span><span class="p">(</span><span class="n">sub</span><span class="o">-</span><span class="n">block</span> <span class="n">idx</span> <span class="mi">1</span><span class="p">)</span>
<span class="o">...</span>
<span class="n">array_to_lod_tensor</span>
<span class="n">cross_entropy</span>
<span class="o">...</span>
<span class="n">while_grad</span><span class="p">(</span><span class="n">sub</span><span class="o">-</span><span class="n">block</span> <span class="n">idx</span> <span class="mi">2</span><span class="p">)</span>
<span class="n">read_from_array</span>
<span class="n">array_to_lod_tensor</span>
<span class="o">...</span>
</pre></div>
</div>
<ul class="simple">
<li>Block1</li>
</ul>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="n">read_from_array</span>
<span class="n">read_from_array</span>
<span class="o">...</span>
<span class="n">write_to_array</span>
<span class="n">increment</span>
<span class="n">write_to_array</span>
<span class="n">less_than</span>
</pre></div>
</div>
<ul class="simple">
<li>Block2</li>
</ul>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="n">read_from_array</span>
<span class="n">increment</span>
<span class="o">...</span>
<span class="n">write_to_array</span>
<span class="n">write_to_array</span>
</pre></div>
</div>
<p>We can transfer all the operators and variables in ProgramDesc to build a control flow graph.</p>
<div class="highlight-python"><div class="highlight"><pre><span></span><span class="k">class</span> <span class="nc">ControlFlowGraph</span><span class="p">(</span><span class="nb">object</span><span class="p">):</span>
<span class="k">def</span> <span class="fm">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">Program</span><span class="p">):</span>
<span class="bp">self</span><span class="o">.</span><span class="n">_sucessors</span> <span class="o">=</span> <span class="n">defaultdict</span><span class="p">(</span><span class="nb">set</span><span class="p">)</span>
<span class="bp">self</span><span class="o">.</span><span class="n">_presucessors</span> <span class="o">=</span> <span class="n">defaultdict</span><span class="p">(</span><span class="nb">set</span><span class="p">)</span>
<span class="bp">self</span><span class="o">.</span><span class="n">_uses</span> <span class="o">=</span> <span class="n">defaultdict</span><span class="p">(</span><span class="nb">set</span><span class="p">)</span>
<span class="bp">self</span><span class="o">.</span><span class="n">_defs</span> <span class="o">=</span> <span class="n">defaultdict</span><span class="p">(</span><span class="nb">set</span><span class="p">)</span>
<span class="bp">self</span><span class="o">.</span><span class="n">_live_in</span> <span class="o">=</span> <span class="n">defaultdict</span><span class="p">(</span><span class="nb">set</span><span class="p">)</span>
<span class="bp">self</span><span class="o">.</span><span class="n">_live_out</span> <span class="o">=</span> <span class="n">defaultdict</span><span class="p">(</span><span class="nb">set</span><span class="p">)</span>
<span class="bp">self</span><span class="o">.</span><span class="n">_program</span> <span class="o">=</span> <span class="n">Program</span>
<span class="k">def</span> <span class="nf">build</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="k">pass</span>
<span class="k">def</span> <span class="nf">dataflow_analysis</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="k">pass</span>
<span class="k">def</span> <span class="nf">memory_optimization</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="k">pass</span>
<span class="k">def</span> <span class="nf">get_program</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="k">return</span> <span class="bp">self</span><span class="o">.</span><span class="n">_program</span>
</pre></div>
</div>
</div>
<div class="section" id="make-dataflow-analysis">
<span id="make-dataflow-analysis"></span><h4>make dataflow analysis<a class="headerlink" href="#make-dataflow-analysis" title="Permalink to this headline"></a></h4>
<p>We follow guide from compilers and try to solve the dataflow equation to get liveness of every variable. If the live-in of an operator node is different from the live-out, then we can make memory sharing.</p>
<p>For example:</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="n">a</span> <span class="o">=</span> <span class="n">op1</span><span class="p">(</span><span class="n">b</span><span class="p">,</span> <span class="n">c</span><span class="p">);</span>
<span class="n">d</span> <span class="o">=</span> <span class="n">op2</span><span class="p">(</span><span class="n">a</span><span class="p">)</span>
<span class="n">e</span> <span class="o">=</span> <span class="n">op3</span><span class="p">(</span><span class="n">d</span><span class="p">,</span> <span class="n">f</span><span class="p">)</span>
</pre></div>
</div>
<p>The dataflow analysis result is:</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="n">live_in</span><span class="p">(</span><span class="n">op1</span><span class="p">)</span> <span class="o">=</span> <span class="p">{</span><span class="n">b</span><span class="p">,</span> <span class="n">c</span><span class="p">,</span> <span class="n">f</span><span class="p">}</span>
<span class="n">live_out</span><span class="p">(</span><span class="n">op1</span><span class="p">)</span> <span class="o">=</span> <span class="p">{</span><span class="n">a</span><span class="p">,</span> <span class="n">f</span><span class="p">}</span>
<span class="n">live_in</span><span class="p">(</span><span class="n">op2</span><span class="p">)</span> <span class="o">=</span> <span class="p">{</span><span class="n">a</span><span class="p">,</span> <span class="n">f</span><span class="p">}</span>
<span class="n">live_out</span><span class="p">(</span><span class="n">op2</span><span class="p">)</span> <span class="o">=</span> <span class="p">{</span><span class="n">d</span><span class="p">,</span> <span class="n">f</span><span class="p">}</span>
<span class="n">live_in</span><span class="p">(</span><span class="n">op3</span><span class="p">)</span> <span class="o">=</span> <span class="p">{</span><span class="n">d</span><span class="p">,</span> <span class="n">f</span><span class="p">}</span>
<span class="n">live_out</span><span class="p">(</span><span class="n">op3</span><span class="p">)</span> <span class="o">=</span> <span class="p">{}</span>
</pre></div>
</div>
<p>After op1, we can process variable b and variable c; After op2, we can process variable a. After op3, we can process variable d and variable f.</p>
</div>
<div class="section" id="memory-sharing-policy">
<span id="memory-sharing-policy"></span><h4>memory sharing policy<a class="headerlink" href="#memory-sharing-policy" title="Permalink to this headline"></a></h4>
<p>A memory pool will be mantained in the stage of memory optimization. Each operator node will be scanned to determine memory optimization is done or not. If an operator satifies the requirement, following policy will be taken to handle input/output variables.</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="k">if</span> <span class="n">op</span><span class="o">.</span><span class="n">support_inplace</span><span class="p">():</span>
<span class="n">i</span> <span class="o">--&gt;</span> <span class="n">pool</span>
<span class="n">pool</span> <span class="o">--&gt;</span> <span class="n">o</span>
<span class="k">else</span><span class="p">:</span>
<span class="n">pool</span> <span class="o">--&gt;</span> <span class="n">o</span>
<span class="n">i</span> <span class="o">--&gt;</span> <span class="n">pool</span>
</pre></div>
</div>
</div>
</div>
</div>
<div class="section" id="reference">
<span id="reference"></span><h2>Reference<a class="headerlink" href="#reference" title="Permalink to this headline"></a></h2>
<ul class="simple">
<li><a class="reference external" href="https://manavsehgal.com/lecture-notes-from-artificial-intelligence-is-the-new-electricity-by-andrew-ng-4712dcbf26e5">Lecture Notes From Artificial Intelligence Is The New Electricity By Andrew Ng</a></li>
<li>Modern compiler implementation in ML, by Andrew W. Appel</li>
<li><a class="reference external" href="https://mxnet.incubator.apache.org/architecture/note_memory.html">Optimizing Memory Consumption in Deep learning</a></li>
</ul>
</div>
</div>
</div>
</div>
<footer>
<hr/>
<div role="contentinfo">
<p>
&copy; Copyright 2016, PaddlePaddle developers.
</p>
</div>
Built with <a href="http://sphinx-doc.org/">Sphinx</a> using a <a href="https://github.com/snide/sphinx_rtd_theme">theme</a> provided by <a href="https://readthedocs.org">Read the Docs</a>.
</footer>
</div>
</div>
</section>
</div>
<script type="text/javascript">
var DOCUMENTATION_OPTIONS = {
URL_ROOT:'../',
VERSION:'',
COLLAPSE_INDEX:false,
FILE_SUFFIX:'.html',
HAS_SOURCE: true,
SOURCELINK_SUFFIX: ".txt",
};
</script>
<script type="text/javascript" src="../_static/jquery.js"></script>
<script type="text/javascript" src="../_static/underscore.js"></script>
<script type="text/javascript" src="../_static/doctools.js"></script>
<script type="text/javascript" src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.0/MathJax.js?config=TeX-AMS-MML_HTMLorMML"></script>
<script type="text/javascript" src="../_static/js/theme.js"></script>
<script src="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/js/bootstrap.min.js" integrity="sha384-Tc5IQib027qvyjSMfHjOMaLkfuWVxZxUPnCJA7l2mCWNIpG9mGCD8wGNIcPD7Txa" crossorigin="anonymous"></script>
<script src="https://cdn.jsdelivr.net/perfect-scrollbar/0.6.14/js/perfect-scrollbar.jquery.min.js"></script>
<script src="../_static/js/paddle_doc_init.js"></script>
</body>
</html>
\ No newline at end of file
因为 它太大了无法显示 source diff 。你可以改为 查看blob
# Memory Optimization
## Problem
In a lecture from Andrew Ng, he attributes the recent sucess of AI due to a combination of these:
- availability of Big Data
- supercomputing power to process this Big Data over very large neural networks
- modern algorithms
Following graph shows the details:
![](images/deep_learning.png)
Larger model usually brings better performance. However, GPU memory is certain limited. For example, the memory size of a GTX TITAN X is only 12GB. To train complex and large model, we have to take care of memory using. Besides, memory optimization is also necessary in both online/mobile inference.
## Solution
### Basic Strategy
There are some basic strategies to make memory optimization, including in-place operation and memory sharing.
#### In-place Operation
In a relu activation operator:
$y = \max(x, 0)$
If the variable x is not used in any other operator, we can make an in-place operation. In other words, the memory block of variable y and variable x are the same. In-place operation will save 50% memory occupancy immediately.
#### Memory Sharing
Not all operators support in-place operations. Memory sharing is a more general strategy.
Following is an example:
```
a = op1(b, c);
d = op2(a)
e = op3(d, f)
```
In this case, variable a is no longer used, and op2 does not support in-place operation. After op2 finished, we can put the memory of variable a to a memory pool. Then, variable e can share the memory of variable a from the pool.
### Live Variable Analysis
It's not enough to only have some basic strategies. The prerequisite of memory optimization is to know if a variable is still "live" after an operation.
In our design, the neural network topology is defined as a program. Luckily, [live variable analysis](https://en.wikipedia.org/wiki/Live_variable_analysis) is a classic problem in compilers which can be used in many stages, such as register allocation.
In compilers, the front end of the compilers translates programs into an intermediate language with an unbounded number of temporaries. This program must run on a machine with a bounded number of registers. Two temporaries a and b can fit into the same register, if a and b are never "in use" at the same time. Thus, many temporaries can fit in few registers; if they don't all fit, the excess temporaries can be kept in memory.
Therefore, the compiler needs to analyze the intermediate-representation program to determine which temporaries are in use at the same time. We say a variable is "live" if it holds a value that may be needed in the future, so this analysis is called liveness analysis.
We can leran these techniques from compilers. There are mainly two stages to make live variable analysis:
- construct a control flow graph
- solve the dataflow equations
#### Control Flow Graph
To preform analyses on a program, it is often useful to make a control flow graph. A [control flow graph](https://en.wikipedia.org/wiki/Control_flow_graph) (CFG) in computer science is a representation, using graph notation, of all paths that might be traversed through a program during its execution. Each statement in the program is a node in the flow graph; if statemment x can be followed by statement y, there is an egde from x to y.
Following is the flow graph for a simple loop.
![](images/control_flow_graph.png)
#### Dataflow Analysis
liveness of variable "flows" around the edges of the control flow graph; determining the live range of each variable is an example of a dataflow problem. [Dataflow analysis](https://en.wikipedia.org/wiki/Data-flow_analysis) is a technique for gathering information about the possible set of values calculated at various points in a computer program.
A simple way to perform data-flow analysis of programs is to set up dataflow equations for each node of the control flow graph and solve them by repeatedly calculating the output from the input locally at each node until the whole system stabilizes.
- Flow Graph Terminology
A flow graph node has out-edges that lead to sucessor nodes, and in-edges that come from presucessor nodes. The set *pred[n]* is all the predecessors of node n, and *succ[n]* is the set of sucessors.
In former control flow graph, the out-edges of node 5 are 5 --> 6 and 5 --> 2, and *succ[5]* = {2, 6}. The in-edges of 2 are 5 --> 2 and 1 --> 2, and *pred[2]* = {1, 5}.
- Uses and Defs
An assignmemt to a variable or temporary defines that variable. An occurence of a variable on the right-hand side of an assginment(or in other expressions) uses the variable. We can speak the *def* of a variable as the set of graph nodes that define it; or the *def* of a graph node as the set of variables that it defines; and the similarly for the *use* of a variable or graph node. In former control flow graph, *def(3)* = {c}, *use(3)* = {b, c}.
- Liveness
A variable is *live* on an edge if there is a directed path from that edge to a *use* of the variable that does not go through any *def*. A variable is *live-in* at a node if it is live on any of the in-edges of that node; it is *live-out* at a node if it is live on any of the out-edges of the node.
The calcution of liveness can be solved by iteration until a fixed pointer is reached. Following is the recursive formula:
![](images/dataflow_equations.png)
### Memory optimization transpiler
At last, we take basic strategy and liveness analysis techniques learning from compilers to implement our memory optimization transpiler.
#### add in-place attribute
In-place is a built-in attribute of an operator. Since we treat in-place and other operators differently, we have to add an in-place attribute for every operator.
#### contruct control flow graph
Following is the ProgramDesc protobuf of [machine translation](https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/v2/fluid/tests/book/test_machine_translation.py) example.
- Block0:
```
lookup_table
mul
...
while(sub-block idx 1)
...
array_to_lod_tensor
cross_entropy
...
while_grad(sub-block idx 2)
read_from_array
array_to_lod_tensor
...
```
- Block1
```
read_from_array
read_from_array
...
write_to_array
increment
write_to_array
less_than
```
- Block2
```
read_from_array
increment
...
write_to_array
write_to_array
```
We can transfer all the operators and variables in ProgramDesc to build a control flow graph.
```python
class ControlFlowGraph(object):
def __init__(self, Program):
self._sucessors = defaultdict(set)
self._presucessors = defaultdict(set)
self._uses = defaultdict(set)
self._defs = defaultdict(set)
self._live_in = defaultdict(set)
self._live_out = defaultdict(set)
self._program = Program
def build(self):
pass
def dataflow_analysis(self):
pass
def memory_optimization(self):
pass
def get_program(self):
return self._program
```
#### make dataflow analysis
We follow guide from compilers and try to solve the dataflow equation to get liveness of every variable. If the live-in of an operator node is different from the live-out, then we can make memory sharing.
For example:
```
a = op1(b, c);
d = op2(a)
e = op3(d, f)
```
The dataflow analysis result is:
```
live_in(op1) = {b, c, f}
live_out(op1) = {a, f}
live_in(op2) = {a, f}
live_out(op2) = {d, f}
live_in(op3) = {d, f}
live_out(op3) = {}
```
After op1, we can process variable b and variable c; After op2, we can process variable a. After op3, we can process variable d and variable f.
#### memory sharing policy
A memory pool will be mantained in the stage of memory optimization. Each operator node will be scanned to determine memory optimization is done or not. If an operator satifies the requirement, following policy will be taken to handle input/output variables.
```
if op.support_inplace():
i --> pool
pool --> o
else:
pool --> o
i --> pool
```
## Reference
- [Lecture Notes From Artificial Intelligence Is The New Electricity By Andrew Ng](https://manavsehgal.com/lecture-notes-from-artificial-intelligence-is-the-new-electricity-by-andrew-ng-4712dcbf26e5)
- Modern compiler implementation in ML, by Andrew W. Appel
- [Optimizing Memory Consumption in Deep learning](https://mxnet.incubator.apache.org/architecture/note_memory.html)
<!DOCTYPE html>
<!--[if IE 8]><html class="no-js lt-ie9" lang="en" > <![endif]-->
<!--[if gt IE 8]><!--> <html class="no-js" lang="en" > <!--<![endif]-->
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Memory Optimization &mdash; PaddlePaddle 文档</title>
<link rel="stylesheet" href="../_static/css/theme.css" type="text/css" />
<link rel="index" title="索引"
href="../genindex.html"/>
<link rel="search" title="搜索" href="../search.html"/>
<link rel="top" title="PaddlePaddle 文档" href="../index.html"/>
<link rel="stylesheet" href="https://cdn.jsdelivr.net/perfect-scrollbar/0.6.14/css/perfect-scrollbar.min.css" type="text/css" />
<link rel="stylesheet" href="../_static/css/override.css" type="text/css" />
<script>
var _hmt = _hmt || [];
(function() {
var hm = document.createElement("script");
hm.src = "//hm.baidu.com/hm.js?b9a314ab40d04d805655aab1deee08ba";
var s = document.getElementsByTagName("script")[0];
s.parentNode.insertBefore(hm, s);
})();
</script>
<script src="../_static/js/modernizr.min.js"></script>
</head>
<body class="wy-body-for-nav" role="document">
<header class="site-header">
<div class="site-logo">
<a href="/"><img src="../_static/images/PP_w.png"></a>
</div>
<div class="site-nav-links">
<div class="site-menu">
<a class="fork-on-github" href="https://github.com/PaddlePaddle/Paddle" target="_blank"><i class="fa fa-github"></i>Fork me on Github</a>
<div class="language-switcher dropdown">
<a type="button" data-toggle="dropdown">
<span>English</span>
<i class="fa fa-angle-up"></i>
<i class="fa fa-angle-down"></i>
</a>
<ul class="dropdown-menu">
<li><a href="/doc_cn">中文</a></li>
<li><a href="/doc">English</a></li>
</ul>
</div>
<ul class="site-page-links">
<li><a href="/">Home</a></li>
</ul>
</div>
<div class="doc-module">
<ul>
<li class="toctree-l1"><a class="reference internal" href="../getstarted/index_cn.html">新手入门</a></li>
<li class="toctree-l1"><a class="reference internal" href="../howto/index_cn.html">进阶指南</a></li>
<li class="toctree-l1"><a class="reference internal" href="../api/index_cn.html">API</a></li>
<li class="toctree-l1"><a class="reference internal" href="../faq/index_cn.html">FAQ</a></li>
<li class="toctree-l1"><a class="reference internal" href="../mobile/index_cn.html">MOBILE</a></li>
</ul>
<div role="search">
<form id="rtd-search-form" class="wy-form" action="../search.html" method="get">
<input type="text" name="q" placeholder="Search docs" />
<input type="hidden" name="check_keywords" value="yes" />
<input type="hidden" name="area" value="default" />
</form>
</div>
</div>
</div>
</header>
<div class="main-content-wrap">
<nav class="doc-menu-vertical" role="navigation">
<ul>
<li class="toctree-l1"><a class="reference internal" href="../getstarted/index_cn.html">新手入门</a><ul>
<li class="toctree-l2"><a class="reference internal" href="../getstarted/build_and_install/index_cn.html">安装与编译</a><ul>
<li class="toctree-l3"><a class="reference internal" href="../getstarted/build_and_install/pip_install_cn.html">使用pip安装</a></li>
<li class="toctree-l3"><a class="reference internal" href="../getstarted/build_and_install/docker_install_cn.html">使用Docker安装运行</a></li>
<li class="toctree-l3"><a class="reference internal" href="../howto/dev/build_cn.html">用Docker编译和测试PaddlePaddle</a></li>
<li class="toctree-l3"><a class="reference internal" href="../getstarted/build_and_install/build_from_source_cn.html">从源码编译</a></li>
</ul>
</li>
<li class="toctree-l2"><a class="reference internal" href="../getstarted/concepts/use_concepts_cn.html">基本使用概念</a></li>
</ul>
</li>
<li class="toctree-l1"><a class="reference internal" href="../howto/index_cn.html">进阶指南</a><ul>
<li class="toctree-l2"><a class="reference internal" href="../howto/usage/cmd_parameter/index_cn.html">设置命令行参数</a><ul>
<li class="toctree-l3"><a class="reference internal" href="../howto/usage/cmd_parameter/use_case_cn.html">使用案例</a></li>
<li class="toctree-l3"><a class="reference internal" href="../howto/usage/cmd_parameter/arguments_cn.html">参数概述</a></li>
<li class="toctree-l3"><a class="reference internal" href="../howto/usage/cmd_parameter/detail_introduction_cn.html">细节描述</a></li>
</ul>
</li>
<li class="toctree-l2"><a class="reference internal" href="../howto/usage/cluster/cluster_train_cn.html">分布式训练</a><ul>
<li class="toctree-l3"><a class="reference internal" href="../howto/usage/cluster/fabric_cn.html">fabric集群</a></li>
<li class="toctree-l3"><a class="reference internal" href="../howto/usage/cluster/openmpi_cn.html">openmpi集群</a></li>
<li class="toctree-l3"><a class="reference internal" href="../howto/usage/cluster/k8s_cn.html">kubernetes单机</a></li>
<li class="toctree-l3"><a class="reference internal" href="../howto/usage/cluster/k8s_distributed_cn.html">kubernetes distributed分布式</a></li>
<li class="toctree-l3"><a class="reference internal" href="../howto/usage/cluster/k8s_aws_cn.html">AWS上运行kubernetes集群训练</a></li>
</ul>
</li>
<li class="toctree-l2"><a class="reference internal" href="../howto/dev/contribute_to_paddle_cn.html">如何贡献代码</a></li>
<li class="toctree-l2"><a class="reference internal" href="../howto/dev/write_docs_cn.html">如何贡献/修改文档</a></li>
<li class="toctree-l2"><a class="reference internal" href="../howto/deep_model/rnn/index_cn.html">RNN相关模型</a><ul>
<li class="toctree-l3"><a class="reference internal" href="../howto/deep_model/rnn/rnn_config_cn.html">RNN配置</a></li>
<li class="toctree-l3"><a class="reference internal" href="../howto/deep_model/rnn/recurrent_group_cn.html">Recurrent Group教程</a></li>
<li class="toctree-l3"><a class="reference internal" href="../howto/deep_model/rnn/hierarchical_layer_cn.html">支持双层序列作为输入的Layer</a></li>
<li class="toctree-l3"><a class="reference internal" href="../howto/deep_model/rnn/hrnn_rnn_api_compare_cn.html">单双层RNN API对比介绍</a></li>
</ul>
</li>
<li class="toctree-l2"><a class="reference internal" href="../howto/optimization/gpu_profiling_cn.html">GPU性能分析与调优</a></li>
</ul>
</li>
<li class="toctree-l1"><a class="reference internal" href="../api/index_cn.html">API</a><ul>
<li class="toctree-l2"><a class="reference internal" href="../api/v2/model_configs.html">模型配置</a><ul>
<li class="toctree-l3"><a class="reference internal" href="../api/v2/config/activation.html">Activation</a></li>
<li class="toctree-l3"><a class="reference internal" href="../api/v2/config/layer.html">Layers</a></li>
<li class="toctree-l3"><a class="reference internal" href="../api/v2/config/evaluators.html">Evaluators</a></li>
<li class="toctree-l3"><a class="reference internal" href="../api/v2/config/optimizer.html">Optimizer</a></li>
<li class="toctree-l3"><a class="reference internal" href="../api/v2/config/pooling.html">Pooling</a></li>
<li class="toctree-l3"><a class="reference internal" href="../api/v2/config/networks.html">Networks</a></li>
<li class="toctree-l3"><a class="reference internal" href="../api/v2/config/attr.html">Parameter Attribute</a></li>
</ul>
</li>
<li class="toctree-l2"><a class="reference internal" href="../api/v2/data.html">数据访问</a><ul>
<li class="toctree-l3"><a class="reference internal" href="../api/v2/data/data_reader.html">Data Reader Interface</a></li>
<li class="toctree-l3"><a class="reference internal" href="../api/v2/data/image.html">Image Interface</a></li>
<li class="toctree-l3"><a class="reference internal" href="../api/v2/data/dataset.html">Dataset</a></li>
</ul>
</li>
<li class="toctree-l2"><a class="reference internal" href="../api/v2/run_logic.html">训练与应用</a></li>
<li class="toctree-l2"><a class="reference internal" href="../api/v2/fluid.html">Fluid</a><ul>
<li class="toctree-l3"><a class="reference internal" href="../api/v2/fluid/layers.html">Layers</a></li>
<li class="toctree-l3"><a class="reference internal" href="../api/v2/fluid/data_feeder.html">DataFeeder</a></li>
<li class="toctree-l3"><a class="reference internal" href="../api/v2/fluid/executor.html">Executor</a></li>
<li class="toctree-l3"><a class="reference internal" href="../api/v2/fluid/initializer.html">Initializer</a></li>
<li class="toctree-l3"><a class="reference internal" href="../api/v2/fluid/evaluator.html">Evaluator</a></li>
<li class="toctree-l3"><a class="reference internal" href="../api/v2/fluid/nets.html">Nets</a></li>
<li class="toctree-l3"><a class="reference internal" href="../api/v2/fluid/optimizer.html">Optimizer</a></li>
<li class="toctree-l3"><a class="reference internal" href="../api/v2/fluid/param_attr.html">ParamAttr</a></li>
<li class="toctree-l3"><a class="reference internal" href="../api/v2/fluid/profiler.html">Profiler</a></li>
<li class="toctree-l3"><a class="reference internal" href="../api/v2/fluid/regularizer.html">Regularizer</a></li>
</ul>
</li>
</ul>
</li>
<li class="toctree-l1"><a class="reference internal" href="../faq/index_cn.html">FAQ</a><ul>
<li class="toctree-l2"><a class="reference internal" href="../faq/build_and_install/index_cn.html">编译安装与单元测试</a></li>
<li class="toctree-l2"><a class="reference internal" href="../faq/model/index_cn.html">模型配置</a></li>
<li class="toctree-l2"><a class="reference internal" href="../faq/parameter/index_cn.html">参数设置</a></li>
<li class="toctree-l2"><a class="reference internal" href="../faq/local/index_cn.html">本地训练与预测</a></li>
<li class="toctree-l2"><a class="reference internal" href="../faq/cluster/index_cn.html">集群训练与预测</a></li>
</ul>
</li>
<li class="toctree-l1"><a class="reference internal" href="../mobile/index_cn.html">MOBILE</a><ul>
<li class="toctree-l2"><a class="reference internal" href="../mobile/cross_compiling_for_android_cn.html">Android平台编译指南</a></li>
<li class="toctree-l2"><a class="reference internal" href="../mobile/cross_compiling_for_ios_cn.html">iOS平台编译指南</a></li>
<li class="toctree-l2"><a class="reference internal" href="../mobile/cross_compiling_for_raspberry_cn.html">Raspberry Pi平台编译指南</a></li>
</ul>
</li>
</ul>
</nav>
<section class="doc-content-wrap">
<div role="navigation" aria-label="breadcrumbs navigation">
<ul class="wy-breadcrumbs">
<li>Memory Optimization</li>
</ul>
</div>
<div class="wy-nav-content" id="doc-content">
<div class="rst-content">
<div role="main" class="document" itemscope="itemscope" itemtype="http://schema.org/Article">
<div itemprop="articleBody">
<div class="section" id="memory-optimization">
<span id="memory-optimization"></span><h1>Memory Optimization<a class="headerlink" href="#memory-optimization" title="永久链接至标题"></a></h1>
<div class="section" id="problem">
<span id="problem"></span><h2>Problem<a class="headerlink" href="#problem" title="永久链接至标题"></a></h2>
<p>In a lecture from Andrew Ng, he attributes the recent sucess of AI due to a combination of these:</p>
<ul class="simple">
<li>availability of Big Data</li>
<li>supercomputing power to process this Big Data over very large neural networks</li>
<li>modern algorithms</li>
</ul>
<p>Following graph shows the details:</p>
<p><img alt="" src="../_images/deep_learning.png" /></p>
<p>Larger model usually brings better performance. However, GPU memory is certain limited. For example, the memory size of a GTX TITAN X is only 12GB. To train complex and large model, we have to take care of memory using. Besides, memory optimization is also necessary in both online/mobile inference.</p>
</div>
<div class="section" id="solution">
<span id="solution"></span><h2>Solution<a class="headerlink" href="#solution" title="永久链接至标题"></a></h2>
<div class="section" id="basic-strategy">
<span id="basic-strategy"></span><h3>Basic Strategy<a class="headerlink" href="#basic-strategy" title="永久链接至标题"></a></h3>
<p>There are some basic strategies to make memory optimization, including in-place operation and memory sharing.</p>
<div class="section" id="in-place-operation">
<span id="in-place-operation"></span><h4>In-place Operation<a class="headerlink" href="#in-place-operation" title="永久链接至标题"></a></h4>
<p>In a relu activation operator:</p>
<p>$y = \max(x, 0)$</p>
<p>If the variable x is not used in any other operator, we can make an in-place operation. In other words, the memory block of variable y and variable x are the same. In-place operation will save 50% memory occupancy immediately.</p>
</div>
<div class="section" id="memory-sharing">
<span id="memory-sharing"></span><h4>Memory Sharing<a class="headerlink" href="#memory-sharing" title="永久链接至标题"></a></h4>
<p>Not all operators support in-place operations. Memory sharing is a more general strategy.</p>
<p>Following is an example:</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="n">a</span> <span class="o">=</span> <span class="n">op1</span><span class="p">(</span><span class="n">b</span><span class="p">,</span> <span class="n">c</span><span class="p">);</span>
<span class="n">d</span> <span class="o">=</span> <span class="n">op2</span><span class="p">(</span><span class="n">a</span><span class="p">)</span>
<span class="n">e</span> <span class="o">=</span> <span class="n">op3</span><span class="p">(</span><span class="n">d</span><span class="p">,</span> <span class="n">f</span><span class="p">)</span>
</pre></div>
</div>
<p>In this case, variable a is no longer used, and op2 does not support in-place operation. After op2 finished, we can put the memory of variable a to a memory pool. Then, variable e can share the memory of variable a from the pool.</p>
</div>
</div>
<div class="section" id="live-variable-analysis">
<span id="live-variable-analysis"></span><h3>Live Variable Analysis<a class="headerlink" href="#live-variable-analysis" title="永久链接至标题"></a></h3>
<p>It&#8217;s not enough to only have some basic strategies. The prerequisite of memory optimization is to know if a variable is still &#8220;live&#8221; after an operation.</p>
<p>In our design, the neural network topology is defined as a program. Luckily, <a class="reference external" href="https://en.wikipedia.org/wiki/Live_variable_analysis">live variable analysis</a> is a classic problem in compilers which can be used in many stages, such as register allocation.</p>
<p>In compilers, the front end of the compilers translates programs into an intermediate language with an unbounded number of temporaries. This program must run on a machine with a bounded number of registers. Two temporaries a and b can fit into the same register, if a and b are never &#8220;in use&#8221; at the same time. Thus, many temporaries can fit in few registers; if they don&#8217;t all fit, the excess temporaries can be kept in memory.</p>
<p>Therefore, the compiler needs to analyze the intermediate-representation program to determine which temporaries are in use at the same time. We say a variable is &#8220;live&#8221; if it holds a value that may be needed in the future, so this analysis is called liveness analysis.</p>
<p>We can leran these techniques from compilers. There are mainly two stages to make live variable analysis:</p>
<ul class="simple">
<li>construct a control flow graph</li>
<li>solve the dataflow equations</li>
</ul>
<div class="section" id="control-flow-graph">
<span id="control-flow-graph"></span><h4>Control Flow Graph<a class="headerlink" href="#control-flow-graph" title="永久链接至标题"></a></h4>
<p>To preform analyses on a program, it is often useful to make a control flow graph. A <a class="reference external" href="https://en.wikipedia.org/wiki/Control_flow_graph">control flow graph</a> (CFG) in computer science is a representation, using graph notation, of all paths that might be traversed through a program during its execution. Each statement in the program is a node in the flow graph; if statemment x can be followed by statement y, there is an egde from x to y.</p>
<p>Following is the flow graph for a simple loop.</p>
<p><img alt="" src="../_images/control_flow_graph.png" /></p>
</div>
<div class="section" id="dataflow-analysis">
<span id="dataflow-analysis"></span><h4>Dataflow Analysis<a class="headerlink" href="#dataflow-analysis" title="永久链接至标题"></a></h4>
<p>liveness of variable &#8220;flows&#8221; around the edges of the control flow graph; determining the live range of each variable is an example of a dataflow problem. <a class="reference external" href="https://en.wikipedia.org/wiki/Data-flow_analysis">Dataflow analysis</a> is a technique for gathering information about the possible set of values calculated at various points in a computer program.</p>
<p>A simple way to perform data-flow analysis of programs is to set up dataflow equations for each node of the control flow graph and solve them by repeatedly calculating the output from the input locally at each node until the whole system stabilizes.</p>
<ul class="simple">
<li>Flow Graph Terminology</li>
</ul>
<p>A flow graph node has out-edges that lead to sucessor nodes, and in-edges that come from presucessor nodes. The set <em>pred[n]</em> is all the predecessors of node n, and <em>succ[n]</em> is the set of sucessors.
In former control flow graph, the out-edges of node 5 are 5 &#8211;&gt; 6 and 5 &#8211;&gt; 2, and <em>succ[5]</em> = {2, 6}. The in-edges of 2 are 5 &#8211;&gt; 2 and 1 &#8211;&gt; 2, and <em>pred[2]</em> = {1, 5}.</p>
<ul class="simple">
<li>Uses and Defs</li>
</ul>
<p>An assignmemt to a variable or temporary defines that variable. An occurence of a variable on the right-hand side of an assginment(or in other expressions) uses the variable. We can speak the <em>def</em> of a variable as the set of graph nodes that define it; or the <em>def</em> of a graph node as the set of variables that it defines; and the similarly for the <em>use</em> of a variable or graph node. In former control flow graph, <em>def(3)</em> = {c}, <em>use(3)</em> = {b, c}.</p>
<ul class="simple">
<li>Liveness</li>
</ul>
<p>A variable is <em>live</em> on an edge if there is a directed path from that edge to a <em>use</em> of the variable that does not go through any <em>def</em>. A variable is <em>live-in</em> at a node if it is live on any of the in-edges of that node; it is <em>live-out</em> at a node if it is live on any of the out-edges of the node.</p>
<p>The calcution of liveness can be solved by iteration until a fixed pointer is reached. Following is the recursive formula:</p>
<p><img alt="" src="../_images/dataflow_equations.png" /></p>
</div>
</div>
<div class="section" id="memory-optimization-transpiler">
<span id="memory-optimization-transpiler"></span><h3>Memory optimization transpiler<a class="headerlink" href="#memory-optimization-transpiler" title="永久链接至标题"></a></h3>
<p>At last, we take basic strategy and liveness analysis techniques learning from compilers to implement our memory optimization transpiler.</p>
<div class="section" id="add-in-place-attribute">
<span id="add-in-place-attribute"></span><h4>add in-place attribute<a class="headerlink" href="#add-in-place-attribute" title="永久链接至标题"></a></h4>
<p>In-place is a built-in attribute of an operator. Since we treat in-place and other operators differently, we have to add an in-place attribute for every operator.</p>
</div>
<div class="section" id="contruct-control-flow-graph">
<span id="contruct-control-flow-graph"></span><h4>contruct control flow graph<a class="headerlink" href="#contruct-control-flow-graph" title="永久链接至标题"></a></h4>
<p>Following is the ProgramDesc protobuf of <a class="reference external" href="https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/v2/fluid/tests/book/test_machine_translation.py">machine translation</a> example.</p>
<ul class="simple">
<li>Block0:</li>
</ul>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="n">lookup_table</span>
<span class="n">mul</span>
<span class="o">...</span>
<span class="k">while</span><span class="p">(</span><span class="n">sub</span><span class="o">-</span><span class="n">block</span> <span class="n">idx</span> <span class="mi">1</span><span class="p">)</span>
<span class="o">...</span>
<span class="n">array_to_lod_tensor</span>
<span class="n">cross_entropy</span>
<span class="o">...</span>
<span class="n">while_grad</span><span class="p">(</span><span class="n">sub</span><span class="o">-</span><span class="n">block</span> <span class="n">idx</span> <span class="mi">2</span><span class="p">)</span>
<span class="n">read_from_array</span>
<span class="n">array_to_lod_tensor</span>
<span class="o">...</span>
</pre></div>
</div>
<ul class="simple">
<li>Block1</li>
</ul>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="n">read_from_array</span>
<span class="n">read_from_array</span>
<span class="o">...</span>
<span class="n">write_to_array</span>
<span class="n">increment</span>
<span class="n">write_to_array</span>
<span class="n">less_than</span>
</pre></div>
</div>
<ul class="simple">
<li>Block2</li>
</ul>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="n">read_from_array</span>
<span class="n">increment</span>
<span class="o">...</span>
<span class="n">write_to_array</span>
<span class="n">write_to_array</span>
</pre></div>
</div>
<p>We can transfer all the operators and variables in ProgramDesc to build a control flow graph.</p>
<div class="highlight-python"><div class="highlight"><pre><span></span><span class="k">class</span> <span class="nc">ControlFlowGraph</span><span class="p">(</span><span class="nb">object</span><span class="p">):</span>
<span class="k">def</span> <span class="fm">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">Program</span><span class="p">):</span>
<span class="bp">self</span><span class="o">.</span><span class="n">_sucessors</span> <span class="o">=</span> <span class="n">defaultdict</span><span class="p">(</span><span class="nb">set</span><span class="p">)</span>
<span class="bp">self</span><span class="o">.</span><span class="n">_presucessors</span> <span class="o">=</span> <span class="n">defaultdict</span><span class="p">(</span><span class="nb">set</span><span class="p">)</span>
<span class="bp">self</span><span class="o">.</span><span class="n">_uses</span> <span class="o">=</span> <span class="n">defaultdict</span><span class="p">(</span><span class="nb">set</span><span class="p">)</span>
<span class="bp">self</span><span class="o">.</span><span class="n">_defs</span> <span class="o">=</span> <span class="n">defaultdict</span><span class="p">(</span><span class="nb">set</span><span class="p">)</span>
<span class="bp">self</span><span class="o">.</span><span class="n">_live_in</span> <span class="o">=</span> <span class="n">defaultdict</span><span class="p">(</span><span class="nb">set</span><span class="p">)</span>
<span class="bp">self</span><span class="o">.</span><span class="n">_live_out</span> <span class="o">=</span> <span class="n">defaultdict</span><span class="p">(</span><span class="nb">set</span><span class="p">)</span>
<span class="bp">self</span><span class="o">.</span><span class="n">_program</span> <span class="o">=</span> <span class="n">Program</span>
<span class="k">def</span> <span class="nf">build</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="k">pass</span>
<span class="k">def</span> <span class="nf">dataflow_analysis</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="k">pass</span>
<span class="k">def</span> <span class="nf">memory_optimization</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="k">pass</span>
<span class="k">def</span> <span class="nf">get_program</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="k">return</span> <span class="bp">self</span><span class="o">.</span><span class="n">_program</span>
</pre></div>
</div>
</div>
<div class="section" id="make-dataflow-analysis">
<span id="make-dataflow-analysis"></span><h4>make dataflow analysis<a class="headerlink" href="#make-dataflow-analysis" title="永久链接至标题"></a></h4>
<p>We follow guide from compilers and try to solve the dataflow equation to get liveness of every variable. If the live-in of an operator node is different from the live-out, then we can make memory sharing.</p>
<p>For example:</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="n">a</span> <span class="o">=</span> <span class="n">op1</span><span class="p">(</span><span class="n">b</span><span class="p">,</span> <span class="n">c</span><span class="p">);</span>
<span class="n">d</span> <span class="o">=</span> <span class="n">op2</span><span class="p">(</span><span class="n">a</span><span class="p">)</span>
<span class="n">e</span> <span class="o">=</span> <span class="n">op3</span><span class="p">(</span><span class="n">d</span><span class="p">,</span> <span class="n">f</span><span class="p">)</span>
</pre></div>
</div>
<p>The dataflow analysis result is:</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="n">live_in</span><span class="p">(</span><span class="n">op1</span><span class="p">)</span> <span class="o">=</span> <span class="p">{</span><span class="n">b</span><span class="p">,</span> <span class="n">c</span><span class="p">,</span> <span class="n">f</span><span class="p">}</span>
<span class="n">live_out</span><span class="p">(</span><span class="n">op1</span><span class="p">)</span> <span class="o">=</span> <span class="p">{</span><span class="n">a</span><span class="p">,</span> <span class="n">f</span><span class="p">}</span>
<span class="n">live_in</span><span class="p">(</span><span class="n">op2</span><span class="p">)</span> <span class="o">=</span> <span class="p">{</span><span class="n">a</span><span class="p">,</span> <span class="n">f</span><span class="p">}</span>
<span class="n">live_out</span><span class="p">(</span><span class="n">op2</span><span class="p">)</span> <span class="o">=</span> <span class="p">{</span><span class="n">d</span><span class="p">,</span> <span class="n">f</span><span class="p">}</span>
<span class="n">live_in</span><span class="p">(</span><span class="n">op3</span><span class="p">)</span> <span class="o">=</span> <span class="p">{</span><span class="n">d</span><span class="p">,</span> <span class="n">f</span><span class="p">}</span>
<span class="n">live_out</span><span class="p">(</span><span class="n">op3</span><span class="p">)</span> <span class="o">=</span> <span class="p">{}</span>
</pre></div>
</div>
<p>After op1, we can process variable b and variable c; After op2, we can process variable a. After op3, we can process variable d and variable f.</p>
</div>
<div class="section" id="memory-sharing-policy">
<span id="memory-sharing-policy"></span><h4>memory sharing policy<a class="headerlink" href="#memory-sharing-policy" title="永久链接至标题"></a></h4>
<p>A memory pool will be mantained in the stage of memory optimization. Each operator node will be scanned to determine memory optimization is done or not. If an operator satifies the requirement, following policy will be taken to handle input/output variables.</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="k">if</span> <span class="n">op</span><span class="o">.</span><span class="n">support_inplace</span><span class="p">():</span>
<span class="n">i</span> <span class="o">--&gt;</span> <span class="n">pool</span>
<span class="n">pool</span> <span class="o">--&gt;</span> <span class="n">o</span>
<span class="k">else</span><span class="p">:</span>
<span class="n">pool</span> <span class="o">--&gt;</span> <span class="n">o</span>
<span class="n">i</span> <span class="o">--&gt;</span> <span class="n">pool</span>
</pre></div>
</div>
</div>
</div>
</div>
<div class="section" id="reference">
<span id="reference"></span><h2>Reference<a class="headerlink" href="#reference" title="永久链接至标题"></a></h2>
<ul class="simple">
<li><a class="reference external" href="https://manavsehgal.com/lecture-notes-from-artificial-intelligence-is-the-new-electricity-by-andrew-ng-4712dcbf26e5">Lecture Notes From Artificial Intelligence Is The New Electricity By Andrew Ng</a></li>
<li>Modern compiler implementation in ML, by Andrew W. Appel</li>
<li><a class="reference external" href="https://mxnet.incubator.apache.org/architecture/note_memory.html">Optimizing Memory Consumption in Deep learning</a></li>
</ul>
</div>
</div>
</div>
</div>
<footer>
<hr/>
<div role="contentinfo">
<p>
&copy; Copyright 2016, PaddlePaddle developers.
</p>
</div>
Built with <a href="http://sphinx-doc.org/">Sphinx</a> using a <a href="https://github.com/snide/sphinx_rtd_theme">theme</a> provided by <a href="https://readthedocs.org">Read the Docs</a>.
</footer>
</div>
</div>
</section>
</div>
<script type="text/javascript">
var DOCUMENTATION_OPTIONS = {
URL_ROOT:'../',
VERSION:'',
COLLAPSE_INDEX:false,
FILE_SUFFIX:'.html',
HAS_SOURCE: true,
SOURCELINK_SUFFIX: ".txt",
};
</script>
<script type="text/javascript" src="../_static/jquery.js"></script>
<script type="text/javascript" src="../_static/underscore.js"></script>
<script type="text/javascript" src="../_static/doctools.js"></script>
<script type="text/javascript" src="../_static/translations.js"></script>
<script type="text/javascript" src="https://cdn.bootcss.com/mathjax/2.7.0/MathJax.js"></script>
<script type="text/javascript" src="../_static/js/theme.js"></script>
<script src="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/js/bootstrap.min.js" integrity="sha384-Tc5IQib027qvyjSMfHjOMaLkfuWVxZxUPnCJA7l2mCWNIpG9mGCD8wGNIcPD7Txa" crossorigin="anonymous"></script>
<script src="https://cdn.jsdelivr.net/perfect-scrollbar/0.6.14/js/perfect-scrollbar.jquery.min.js"></script>
<script src="../_static/js/paddle_doc_init.js"></script>
</body>
</html>
\ No newline at end of file
因为 它太大了无法显示 source diff 。你可以改为 查看blob
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册