Deploy to GitHub Pages: 605b3e44

dac32632 · Travis CI · 7045fad1 · dac32632 · dac32632 · dac32632
10 changed file
--- a/develop/doc/_sources/howto/optimization/cpu_profiling.md.txt
+++ b/develop/doc/_sources/howto/optimization/cpu_profiling.md.txt
-此教程会介绍如何使用Python的cProfile包，与Python库yep，google perftools来运行性能分析(Profiling)与调优。
+This tutorial introduces techniques we used to profile and tune the
+CPU performance of PaddlePaddle.  We will use Python packages
+`cProfile` and `yep`, and Google `perftools`.
-运行性能分析可以让开发人员科学的，有条不紊的对程序进行性能优化。性能分析是性能调优的基础。因为在程序实际运行中，真正的瓶颈可能和程序员开发过程中想象的瓶颈相去甚远。
+Profiling is the process that reveals the performance bottlenecks,
+which could be very different from what's in the developers' mind.
+Performance tuning is to fix the bottlenecks. Performance optimization
+repeats the steps of profiling and tuning alternatively.
-性能优化的步骤，通常是循环重复若干次『性能分析 --> 寻找瓶颈 ---> 调优瓶颈 --> 性能分析确认调优效果』。其中性能分析是性能调优的至关重要的量化指标。
+PaddlePaddle users program AI by calling the Python API, which calls
+into `libpaddle.so.` written in C++.  In this tutorial, we focus on
+the profiling and tuning of
-Paddle提供了Python语言绑定。用户使用Python进行神经网络编程，训练，测试。Python解释器通过`pybind`和`swig`调用Paddle的动态链接库，进而调用Paddle C++部分的代码。所以Paddle的性能分析与调优分为两个部分:
+1. the Python code and
+1. the mixture of Python and C++ code.
-* Python代码的性能分析
+## Profiling the Python Code
-* Python与C++混合代码的性能分析
+### Generate the Performance Profiling File
-## Python代码的性能分析
+We can use Python standard
+package, [`cProfile`](https://docs.python.org/2/library/profile.html),
-### 生成性能分析文件
+to generate Python profiling file.  For example:
-Python标准库中提供了性能分析的工具包，[cProfile](https://docs.python.org/2/library/profile.html)。生成Python性能分析的命令如下:
 ```bash
 python -m cProfile -o profile.out main.py
 ```
-其中`-o`标识了一个输出的文件名，用来存储本次性能分析的结果。如果不指定这个文件，`cProfile`会打印一些统计信息到`stdout`。这不方便我们进行后期处理(进行`sort`, `split`, `cut`等等)。
+where `main.py` is the program we are going to profile, `-o` specifies
+the output file.  Without `-o`, `cProfile` would outputs to standard
-### 查看性能分析文件
+output.
-当main.py运行完毕后，性能分析结果文件`profile.out`就生成出来了。我们可以使用[cprofilev](https://github.com/ymichael/cprofilev)来查看性能分析结果。`cprofilev`是一个Python的第三方库。使用它会开启一个HTTP服务，将性能分析结果以网页的形式展示出来。
+### Look into the Profiling File
-使用`pip install cprofilev`安装`cprofilev`工具。安装完成后，使用如下命令开启HTTP服务
+`cProfile` generates `profile.out` after `main.py` completes. We can
+use [`cprofilev`](https://github.com/ymichael/cprofilev) to look into
+the details:
 ```bash
 cprofilev -a 0.0.0.0 -p 3214 -f profile.out main.py
 ```
-其中`-a`标识HTTP服务绑定的IP。使用`0.0.0.0`允许外网访问这个HTTP服务。`-p`标识HTTP服务的端口。`-f`标识性能分析的结果文件。`main.py`标识被性能分析的源文件。
+where `-a` specifies the HTTP IP, `-p` specifies the port, `-f`
+specifies the profiling file, and `main.py` is the source file.
-访问对应网址，即可显示性能分析的结果。性能分析结果格式如下:
+Open the Web browser and points to the local IP and the specifies
+port, we will see the output like the following:
-```text
+```
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.284    0.284   29.514   29.514 main.py:1(<module>)
     4696    0.128    0.000   15.748    0.003 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/fluid/executor.py:20(run)
@@ -44,23 +54,23 @@ cprofilev -a 0.0.0.0 -p 3214 -f profile.out main.py
        1    0.144    0.144    6.534    6.534 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/__init__.py:14(<module>)
 ```
-每一列的含义是:
+where each line corresponds to Python function, and the meaning of
+each column is as follows:
-| 列名 | 含义 |
+| column | meaning |
 | --- | --- |
-| ncalls | 函数的调用次数 |
+| ncalls | the number of calls into a function |
-| tottime | 函数实际使用的总时间。该时间去除掉本函数调用其他函数的时间 |
+| tottime | the total execution time of the function, not including the
-| percall | tottime的每次调用平均时间 |
+ execution time of other functions called by the function |
-| cumtime | 函数总时间。包含这个函数调用其他函数的时间 |
+| percall | tottime divided by ncalls |
-| percall | cumtime的每次调用平均时间 |
+| cumtime | the total execution time of the function, including the execution time of other functions being called |
-| filename:lineno(function) | 文件名, 行号，函数名 |
+| percall | cumtime divided by ncalls |
+| filename:lineno(function) | where the function is defined |
+### Identify Performance Bottlenecks
-### 寻找性能瓶颈
+Usually, `tottime` and the related `percall` time is what we want to
+focus on. We can sort above profiling file by tottime:
-通常`tottime`和`cumtime`是寻找瓶颈的关键指标。这两个指标代表了某一个函数真实的运行时间。
-将性能分析结果按照tottime排序，效果如下:
 ```text
     4696   12.040    0.003   12.040    0.003 {built-in method run}
@@ -68,12 +78,15 @@ cprofilev -a 0.0.0.0 -p 3214 -f profile.out main.py
   107991    0.676    0.000    1.519    0.000 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/fluid/framework.py:219(__init__)
     4697    0.626    0.000    2.291    0.000 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/fluid/framework.py:428(sync_with_cpp)
        1    0.618    0.618    0.618    0.618 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/fluid/__init__.py:1(<module>)
 ```
-可以看到最耗时的函数是C++端的`run`函数。这需要联合我们第二节`Python`与`C++`混合代码的性能分析来进行调优。而`sync_with_cpp`函数的总共耗时很长，每次调用的耗时也很长。于是我们可以点击`sync_with_cpp`的详细信息，了解其调用关系。
+We can see that the most time-consuming function is the `built-in
+method run`, which is a C++ function in `libpaddle.so`.  We will
+explain how to profile C++ code in the next section.  At the right
+moment, let's look into the third function `sync_with_cpp`, which is a
+Python function.  We can click it to understand more about it:
-```text
+```
 Called By:
   Ordered by: internal time
@@ -92,72 +105,93 @@ Called:
   List reduced from 4497 to 2 due to restriction <'sync_with_cpp'>
 ```
-通常观察热点函数间的调用关系，和对应行的代码，就可以了解到问题代码在哪里。当我们做出性能修正后，再次进行性能分析(profiling)即可检查我们调优后的修正是否能够改善程序的性能。
+The lists of the callers of `sync_with_cpp` might help us understand
+how to improve the function definition.
+## Profiling Python and C++ Code
+### Generate the Profiling File
-## Python与C++混合代码的性能分析
+To profile a mixture of Python and C++ code, we can use a Python
+package, `yep`, that can work with Google's `perftools`, which is a
+commonly-used profiler for C/C++ code.
-### 生成性能分析文件
+In Ubuntu systems, we can install `yep` and `perftools` by running the
+following commands:
-C++的性能分析工具非常多。常见的包括`gprof`, `valgrind`, `google-perftools`。但是调试Python中使用的动态链接库与直接调试原始二进制相比增加了很多复杂度。幸而Python的一个第三方库`yep`提供了方便的和`google-perftools`交互的方法。于是这里使用`yep`进行Python与C++混合代码的性能分析
-使用`yep`前需要安装`google-perftools`与`yep`包。ubuntu下安装命令为
 ```bash
+apt update
 apt install libgoogle-perftools-dev
 pip install yep
 ```
-安装完毕后，我们可以通过
+Then we can run the following command
 ```bash
 python -m yep -v main.py
 ```
-生成性能分析文件。生成的性能分析文件为`main.py.prof`。
+to generate the profiling file.  The default filename is
+`main.py.prof`.
+Please be aware of the `-v` command line option, which prints the
+analysis results after generating the profiling file.  By taking a
+glance at the print result, we'd know that if we stripped debug
+information from `libpaddle.so` at build time.  The following hints
+help make sure that the analysis results are readable:
-命令行中的`-v`指定在生成性能分析文件之后，在命令行显示分析结果。我们可以在命令行中简单的看一下生成效果。因为C++与Python不同，编译时可能会去掉调试信息，运行时也可能因为多线程产生混乱不可读的性能分析结果。为了生成更可读的性能分析结果，可以采取下面几点措施:
+1. Use GCC command line option `-g` when building `libpaddle.so` so to
+   include the debug information.  The standard building system of
+   PaddlePaddle is CMake, so you might want to set
+   `CMAKE_BUILD_TYPE=RelWithDebInfo`.
-1. 编译时指定`-g`生成调试信息。使用cmake的话，可以将CMAKE_BUILD_TYPE指定为`RelWithDebInfo`。
+1. Use GCC command line option `-O2` or `-O3` to generate optimized
-2. 编译时一定要开启优化。单纯的`Debug`编译性能会和`-O2`或者`-O3`有非常大的差别。`Debug`模式下的性能测试是没有意义的。
+   binary code. It doesn't make sense to profile `libpaddle.so`
-3. 运行性能分析的时候，先从单线程开始，再开启多线程，进而多机。毕竟单线程调试更容易。可以设置`OMP_NUM_THREADS=1`这个环境变量关闭openmp优化。
+   without optimization, because it would anyway run slowly.
-### 查看性能分析文件
+1. Profiling the single-threaded binary file before the
+   multi-threading version, because the latter often generates tangled
+   profiling analysis result.  You might want to set environment
+   variable `OMP_NUM_THREADS=1` to prevents OpenMP from automatically
+   starting multiple threads.
-在运行完性能分析后，会生成性能分析结果文件。我们可以使用[pprof](https://github.com/google/pprof)来显示性能分析结果。注意，这里使用了用`Go`语言重构后的`pprof`，因为这个工具具有web服务界面，且展示效果更好。
+### Look into the Profiling File
-安装`pprof`的命令和一般的`Go`程序是一样的，其命令如下:
+The tool we used to look into the profiling file generated by
+`perftools` is [`pprof`](https://github.com/google/pprof), which
+provides a Web-based GUI like `cprofilev`.
+We can rely on the standard Go toolchain to retrieve the source code
+of `pprof` and build it:
 ```bash
 go get github.com/google/pprof
 ```
-进而我们可以使用如下命令开启一个HTTP服务:
+Then we can use it to profile `main.py.prof` generated in the previous
+section:
 ```bash
 pprof -http=0.0.0.0:3213 `which python`  ./main.py.prof
 ```
-这行命令中，`-http`指开启HTTP服务。`which python`会产生当前Python二进制的完整路径，进而指定了Python可执行文件的路径。`./main.py.prof`输入了性能分析结果。
+Where `-http` specifies the IP and port of the HTTP service.
+Directing our Web browser to the service, we would see something like
-访问对应的网址，我们可以查看性能分析的结果。结果如下图所示:
+the following:
 ![result](./pprof_1.png)
+### Identifying the Performance Bottlenecks
-### 寻找性能瓶颈
+Similar to how we work with `cprofilev`, we'd focus on `tottime` and
+`cumtime`.
-与寻找Python代码的性能瓶颈类似，寻找Python与C++混合代码的性能瓶颈也是要看`tottime`和`cumtime`。而`pprof`展示的调用图也可以帮助我们发现性能中的问题。
-例如下图中，
 ![kernel_perf](./pprof_2.png)
-在一次训练中，乘法和乘法梯度的计算占用2%-4%左右的计算时间。而`MomentumOp`占用了17%左右的计算时间。显然，`MomentumOp`的性能有问题。
+We can see that the execution time of multiplication and the computing
+of the gradient of multiplication takes 2% to 4% of the total running
-在`pprof`中，对于性能的关键路径都做出了红色标记。先检查关键路径的性能问题，再检查其他部分的性能问题，可以更有次序的完成性能的优化。
+time, and `MomentumOp` takes about 17%. Obviously, we'd want to
+optimize `MomentumOp`.
-## 总结
-至此，两种性能分析的方式都介绍完毕了。希望通过这两种性能分析的方式，Paddle的开发人员和使用人员可以有次序的，科学的发现和解决性能问题。
+`pprof` would mark performance critical parts of the program in
+red. It's a good idea to follow the hint.
--- a/develop/doc/howto/optimization/cpu_profiling.html
+++ b/develop/doc/howto/optimization/cpu_profiling.html
@@ -8,7 +8,7 @@
  <meta name="viewport" content="width=device-width, initial-scale=1.0">
-  <title>Python代码的性能分析 &mdash; PaddlePaddle  documentation</title>
+  <title>Profiling the Python Code &mdash; PaddlePaddle  documentation</title>
@@ -179,7 +179,7 @@
 <div role="navigation" aria-label="breadcrumbs navigation">
  <ul class="wy-breadcrumbs">
-    <li>Python代码的性能分析</li>
+    <li>Profiling the Python Code</li>
  </ul>
 </div>
@@ -188,54 +188,69 @@
          <div role="main" class="document" itemscope="itemscope" itemtype="http://schema.org/Article">
           <div itemprop="articleBody">
-  <p>此教程会介绍如何使用Python的cProfile包，与Python库yep，google perftools来运行性能分析(Profiling)与调优。</p>
+  <p>This tutorial introduces techniques we used to profile and tune the
-<p>运行性能分析可以让开发人员科学的，有条不紊的对程序进行性能优化。性能分析是性能调优的基础。因为在程序实际运行中，真正的瓶颈可能和程序员开发过程中想象的瓶颈相去甚远。</p>
+CPU performance of PaddlePaddle.  We will use Python packages
-<p>性能优化的步骤，通常是循环重复若干次『性能分析 &#8211;&gt; 寻找瓶颈 &#8212;&gt; 调优瓶颈 &#8211;&gt; 性能分析确认调优效果』。其中性能分析是性能调优的至关重要的量化指标。</p>
+<code class="docutils literal"><span class="pre">cProfile</span></code> and <code class="docutils literal"><span class="pre">yep</span></code>, and Google <code class="docutils literal"><span class="pre">perftools</span></code>.</p>
-<p>Paddle提供了Python语言绑定。用户使用Python进行神经网络编程，训练，测试。Python解释器通过<code class="docutils literal"><span class="pre">pybind</span></code>和<code class="docutils literal"><span class="pre">swig</span></code>调用Paddle的动态链接库，进而调用Paddle C++部分的代码。所以Paddle的性能分析与调优分为两个部分:</p>
+<p>Profiling is the process that reveals the performance bottlenecks,
-<ul class="simple">
+which could be very different from what&#8217;s in the developers&#8217; mind.
-<li>Python代码的性能分析</li>
+Performance tuning is to fix the bottlenecks. Performance optimization
-<li>Python与C++混合代码的性能分析</li>
+repeats the steps of profiling and tuning alternatively.</p>
-</ul>
+<p>PaddlePaddle users program AI by calling the Python API, which calls
-<div class="section" id="python">
+into <code class="docutils literal"><span class="pre">libpaddle.so.</span></code> written in C++.  In this tutorial, we focus on
-<span id="python"></span><h1>Python代码的性能分析<a class="headerlink" href="#python" title="Permalink to this headline">¶</a></h1>
+the profiling and tuning of</p>
-<div class="section" id="">
+<ol class="simple">
-<span id="id1"></span><h2>生成性能分析文件<a class="headerlink" href="#" title="Permalink to this headline">¶</a></h2>
+<li>the Python code and</li>
-<p>Python标准库中提供了性能分析的工具包，<a class="reference external" href="https://docs.python.org/2/library/profile.html">cProfile</a>。生成Python性能分析的命令如下:</p>
+<li>the mixture of Python and C++ code.</li>
+</ol>
+<div class="section" id="profiling-the-python-code">
+<span id="profiling-the-python-code"></span><h1>Profiling the Python Code<a class="headerlink" href="#profiling-the-python-code" title="Permalink to this headline">¶</a></h1>
+<div class="section" id="generate-the-performance-profiling-file">
+<span id="generate-the-performance-profiling-file"></span><h2>Generate the Performance Profiling File<a class="headerlink" href="#generate-the-performance-profiling-file" title="Permalink to this headline">¶</a></h2>
+<p>We can use Python standard
+package, <a class="reference external" href="https://docs.python.org/2/library/profile.html"><code class="docutils literal"><span class="pre">cProfile</span></code></a>,
+to generate Python profiling file.  For example:</p>
 <div class="highlight-bash"><div class="highlight"><pre><span></span>python -m cProfile -o profile.out main.py
 </pre></div>
 </div>
-<p>其中<code class="docutils literal"><span class="pre">-o</span></code>标识了一个输出的文件名，用来存储本次性能分析的结果。如果不指定这个文件，<code class="docutils literal"><span class="pre">cProfile</span></code>会打印一些统计信息到<code class="docutils literal"><span class="pre">stdout</span></code>。这不方便我们进行后期处理(进行<code class="docutils literal"><span class="pre">sort</span></code>, <code class="docutils literal"><span class="pre">split</span></code>, <code class="docutils literal"><span class="pre">cut</span></code>等等)。</p>
+<p>where <code class="docutils literal"><span class="pre">main.py</span></code> is the program we are going to profile, <code class="docutils literal"><span class="pre">-o</span></code> specifies
+the output file.  Without <code class="docutils literal"><span class="pre">-o</span></code>, <code class="docutils literal"><span class="pre">cProfile</span></code> would outputs to standard
+output.</p>
 </div>
-<div class="section" id="">
+<div class="section" id="look-into-the-profiling-file">
-<span id="id2"></span><h2>查看性能分析文件<a class="headerlink" href="#" title="Permalink to this headline">¶</a></h2>
+<span id="look-into-the-profiling-file"></span><h2>Look into the Profiling File<a class="headerlink" href="#look-into-the-profiling-file" title="Permalink to this headline">¶</a></h2>
-<p>当main.py运行完毕后，性能分析结果文件<code class="docutils literal"><span class="pre">profile.out</span></code>就生成出来了。我们可以使用<a class="reference external" href="https://github.com/ymichael/cprofilev">cprofilev</a>来查看性能分析结果。<code class="docutils literal"><span class="pre">cprofilev</span></code>是一个Python的第三方库。使用它会开启一个HTTP服务，将性能分析结果以网页的形式展示出来。</p>
+<p><code class="docutils literal"><span class="pre">cProfile</span></code> generates <code class="docutils literal"><span class="pre">profile.out</span></code> after <code class="docutils literal"><span class="pre">main.py</span></code> completes. We can
-<p>使用<code class="docutils literal"><span class="pre">pip</span> <span class="pre">install</span> <span class="pre">cprofilev</span></code>安装<code class="docutils literal"><span class="pre">cprofilev</span></code>工具。安装完成后，使用如下命令开启HTTP服务</p>
+use <a class="reference external" href="https://github.com/ymichael/cprofilev"><code class="docutils literal"><span class="pre">cprofilev</span></code></a> to look into
+the details:</p>
 <div class="highlight-bash"><div class="highlight"><pre><span></span>cprofilev -a <span class="m">0</span>.0.0.0 -p <span class="m">3214</span> -f profile.out main.py
 </pre></div>
 </div>
-<p>其中<code class="docutils literal"><span class="pre">-a</span></code>标识HTTP服务绑定的IP。使用<code class="docutils literal"><span class="pre">0.0.0.0</span></code>允许外网访问这个HTTP服务。<code class="docutils literal"><span class="pre">-p</span></code>标识HTTP服务的端口。<code class="docutils literal"><span class="pre">-f</span></code>标识性能分析的结果文件。<code class="docutils literal"><span class="pre">main.py</span></code>标识被性能分析的源文件。</p>
+<p>where <code class="docutils literal"><span class="pre">-a</span></code> specifies the HTTP IP, <code class="docutils literal"><span class="pre">-p</span></code> specifies the port, <code class="docutils literal"><span class="pre">-f</span></code>
-<p>访问对应网址，即可显示性能分析的结果。性能分析结果格式如下:</p>
+specifies the profiling file, and <code class="docutils literal"><span class="pre">main.py</span></code> is the source file.</p>
-<div class="highlight-text"><div class="highlight"><pre><span></span>   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
+<p>Open the Web browser and points to the local IP and the specifies
-        1    0.284    0.284   29.514   29.514 main.py:1(&lt;module&gt;)
+port, we will see the output like the following:</p>
-     4696    0.128    0.000   15.748    0.003 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/fluid/executor.py:20(run)
+<div class="highlight-default"><div class="highlight"><pre><span></span>   <span class="n">ncalls</span>  <span class="n">tottime</span>  <span class="n">percall</span>  <span class="n">cumtime</span>  <span class="n">percall</span> <span class="n">filename</span><span class="p">:</span><span class="n">lineno</span><span class="p">(</span><span class="n">function</span><span class="p">)</span>
-     4696   12.040    0.003   12.040    0.003 {built-in method run}
+        <span class="mi">1</span>    <span class="mf">0.284</span>    <span class="mf">0.284</span>   <span class="mf">29.514</span>   <span class="mf">29.514</span> <span class="n">main</span><span class="o">.</span><span class="n">py</span><span class="p">:</span><span class="mi">1</span><span class="p">(</span><span class="o">&lt;</span><span class="n">module</span><span class="o">&gt;</span><span class="p">)</span>
-        1    0.144    0.144    6.534    6.534 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/__init__.py:14(&lt;module&gt;)
+     <span class="mi">4696</span>    <span class="mf">0.128</span>    <span class="mf">0.000</span>   <span class="mf">15.748</span>    <span class="mf">0.003</span> <span class="o">/</span><span class="n">home</span><span class="o">/</span><span class="n">yuyang</span><span class="o">/</span><span class="n">perf_test</span><span class="o">/.</span><span class="n">env</span><span class="o">/</span><span class="n">lib</span><span class="o">/</span><span class="n">python2</span><span class="o">.</span><span class="mi">7</span><span class="o">/</span><span class="n">site</span><span class="o">-</span><span class="n">packages</span><span class="o">/</span><span class="n">paddle</span><span class="o">/</span><span class="n">v2</span><span class="o">/</span><span class="n">fluid</span><span class="o">/</span><span class="n">executor</span><span class="o">.</span><span class="n">py</span><span class="p">:</span><span class="mi">20</span><span class="p">(</span><span class="n">run</span><span class="p">)</span>
+     <span class="mi">4696</span>   <span class="mf">12.040</span>    <span class="mf">0.003</span>   <span class="mf">12.040</span>    <span class="mf">0.003</span> <span class="p">{</span><span class="n">built</span><span class="o">-</span><span class="ow">in</span> <span class="n">method</span> <span class="n">run</span><span class="p">}</span>
+        <span class="mi">1</span>    <span class="mf">0.144</span>    <span class="mf">0.144</span>    <span class="mf">6.534</span>    <span class="mf">6.534</span> <span class="o">/</span><span class="n">home</span><span class="o">/</span><span class="n">yuyang</span><span class="o">/</span><span class="n">perf_test</span><span class="o">/.</span><span class="n">env</span><span class="o">/</span><span class="n">lib</span><span class="o">/</span><span class="n">python2</span><span class="o">.</span><span class="mi">7</span><span class="o">/</span><span class="n">site</span><span class="o">-</span><span class="n">packages</span><span class="o">/</span><span class="n">paddle</span><span class="o">/</span><span class="n">v2</span><span class="o">/</span><span class="fm">__init__</span><span class="o">.</span><span class="n">py</span><span class="p">:</span><span class="mi">14</span><span class="p">(</span><span class="o">&lt;</span><span class="n">module</span><span class="o">&gt;</span><span class="p">)</span>
 </pre></div>
 </div>
-<p>每一列的含义是:</p>
+<p>where each line corresponds to Python function, and the meaning of
-<p>| 列名 | 含义 |
+each column is as follows:</p>
+<p>| column | meaning |
 | &#8212; | &#8212; |
-| ncalls | 函数的调用次数 |
+| ncalls | the number of calls into a function |
-| tottime | 函数实际使用的总时间。该时间去除掉本函数调用其他函数的时间 |
+| tottime | the total execution time of the function, not including the
-| percall | tottime的每次调用平均时间 |
+execution time of other functions called by the function |
-| cumtime | 函数总时间。包含这个函数调用其他函数的时间 |
+| percall | tottime divided by ncalls |
-| percall | cumtime的每次调用平均时间 |
+| cumtime | the total execution time of the function, including the execution time of other functions being called |
-| filename:lineno(function) | 文件名, 行号，函数名 |</p>
+| percall | cumtime divided by ncalls |
+| filename:lineno(function) | where the function is defined |</p>
 </div>
-<div class="section" id="">
+<div class="section" id="identify-performance-bottlenecks">
-<span id="id3"></span><h2>寻找性能瓶颈<a class="headerlink" href="#" title="Permalink to this headline">¶</a></h2>
+<span id="identify-performance-bottlenecks"></span><h2>Identify Performance Bottlenecks<a class="headerlink" href="#identify-performance-bottlenecks" title="Permalink to this headline">¶</a></h2>
-<p>通常<code class="docutils literal"><span class="pre">tottime</span></code>和<code class="docutils literal"><span class="pre">cumtime</span></code>是寻找瓶颈的关键指标。这两个指标代表了某一个函数真实的运行时间。</p>
+<p>Usually, <code class="docutils literal"><span class="pre">tottime</span></code> and the related <code class="docutils literal"><span class="pre">percall</span></code> time is what we want to
-<p>将性能分析结果按照tottime排序，效果如下:</p>
+focus on. We can sort above profiling file by tottime:</p>
 <div class="highlight-text"><div class="highlight"><pre><span></span>     4696   12.040    0.003   12.040    0.003 {built-in method run}
   300005    0.874    0.000    1.681    0.000 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/dataset/mnist.py:38(reader)
   107991    0.676    0.000    1.519    0.000 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/fluid/framework.py:219(__init__)
@@ -243,77 +258,104 @@
        1    0.618    0.618    0.618    0.618 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/fluid/__init__.py:1(&lt;module&gt;)
 </pre></div>
 </div>
-<p>可以看到最耗时的函数是C++端的<code class="docutils literal"><span class="pre">run</span></code>函数。这需要联合我们第二节<code class="docutils literal"><span class="pre">Python</span></code>与<code class="docutils literal"><span class="pre">C++</span></code>混合代码的性能分析来进行调优。而<code class="docutils literal"><span class="pre">sync_with_cpp</span></code>函数的总共耗时很长，每次调用的耗时也很长。于是我们可以点击<code class="docutils literal"><span class="pre">sync_with_cpp</span></code>的详细信息，了解其调用关系。</p>
+<p>We can see that the most time-consuming function is the <code class="docutils literal"><span class="pre">built-in</span> <span class="pre">method</span> <span class="pre">run</span></code>, which is a C++ function in <code class="docutils literal"><span class="pre">libpaddle.so</span></code>.  We will
-<div class="highlight-text"><div class="highlight"><pre><span></span>Called By:
+explain how to profile C++ code in the next section.  At the right
+moment, let&#8217;s look into the third function <code class="docutils literal"><span class="pre">sync_with_cpp</span></code>, which is a
+Python function.  We can click it to understand more about it:</p>
+<div class="highlight-default"><div class="highlight"><pre><span></span><span class="n">Called</span> <span class="n">By</span><span class="p">:</span>
-   Ordered by: internal time
+   <span class="n">Ordered</span> <span class="n">by</span><span class="p">:</span> <span class="n">internal</span> <span class="n">time</span>
-   List reduced from 4497 to 2 due to restriction &lt;&#39;sync_with_cpp&#39;&gt;
+   <span class="n">List</span> <span class="n">reduced</span> <span class="kn">from</span> <span class="mi">4497</span> <span class="n">to</span> <span class="mi">2</span> <span class="n">due</span> <span class="n">to</span> <span class="n">restriction</span> <span class="o">&lt;</span><span class="s1">&#39;sync_with_cpp&#39;</span><span class="o">&gt;</span>
-Function                                                                                                 was called by...
+<span class="n">Function</span>                                                                                                 <span class="n">was</span> <span class="n">called</span> <span class="n">by</span><span class="o">...</span>
-                                                                                                             ncalls  tottime  cumtime
+                                                                                                             <span class="n">ncalls</span>  <span class="n">tottime</span>  <span class="n">cumtime</span>
-/home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/fluid/framework.py:428(sync_with_cpp)  &lt;-    4697    0.626    2.291  /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/fluid/framework.py:562(sync_with_cpp)
+<span class="o">/</span><span class="n">home</span><span class="o">/</span><span class="n">yuyang</span><span class="o">/</span><span class="n">perf_test</span><span class="o">/.</span><span class="n">env</span><span class="o">/</span><span class="n">lib</span><span class="o">/</span><span class="n">python2</span><span class="o">.</span><span class="mi">7</span><span class="o">/</span><span class="n">site</span><span class="o">-</span><span class="n">packages</span><span class="o">/</span><span class="n">paddle</span><span class="o">/</span><span class="n">v2</span><span class="o">/</span><span class="n">fluid</span><span class="o">/</span><span class="n">framework</span><span class="o">.</span><span class="n">py</span><span class="p">:</span><span class="mi">428</span><span class="p">(</span><span class="n">sync_with_cpp</span><span class="p">)</span>  <span class="o">&lt;-</span>    <span class="mi">4697</span>    <span class="mf">0.626</span>    <span class="mf">2.291</span>  <span class="o">/</span><span class="n">home</span><span class="o">/</span><span class="n">yuyang</span><span class="o">/</span><span class="n">perf_test</span><span class="o">/.</span><span class="n">env</span><span class="o">/</span><span class="n">lib</span><span class="o">/</span><span class="n">python2</span><span class="o">.</span><span class="mi">7</span><span class="o">/</span><span class="n">site</span><span class="o">-</span><span class="n">packages</span><span class="o">/</span><span class="n">paddle</span><span class="o">/</span><span class="n">v2</span><span class="o">/</span><span class="n">fluid</span><span class="o">/</span><span class="n">framework</span><span class="o">.</span><span class="n">py</span><span class="p">:</span><span class="mi">562</span><span class="p">(</span><span class="n">sync_with_cpp</span><span class="p">)</span>
-/home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/fluid/framework.py:562(sync_with_cpp)  &lt;-    4696    0.019    2.316  /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/fluid/framework.py:487(clone)
+<span class="o">/</span><span class="n">home</span><span class="o">/</span><span class="n">yuyang</span><span class="o">/</span><span class="n">perf_test</span><span class="o">/.</span><span class="n">env</span><span class="o">/</span><span class="n">lib</span><span class="o">/</span><span class="n">python2</span><span class="o">.</span><span class="mi">7</span><span class="o">/</span><span class="n">site</span><span class="o">-</span><span class="n">packages</span><span class="o">/</span><span class="n">paddle</span><span class="o">/</span><span class="n">v2</span><span class="o">/</span><span class="n">fluid</span><span class="o">/</span><span class="n">framework</span><span class="o">.</span><span class="n">py</span><span class="p">:</span><span class="mi">562</span><span class="p">(</span><span class="n">sync_with_cpp</span><span class="p">)</span>  <span class="o">&lt;-</span>    <span class="mi">4696</span>    <span class="mf">0.019</span>    <span class="mf">2.316</span>  <span class="o">/</span><span class="n">home</span><span class="o">/</span><span class="n">yuyang</span><span class="o">/</span><span class="n">perf_test</span><span class="o">/.</span><span class="n">env</span><span class="o">/</span><span class="n">lib</span><span class="o">/</span><span class="n">python2</span><span class="o">.</span><span class="mi">7</span><span class="o">/</span><span class="n">site</span><span class="o">-</span><span class="n">packages</span><span class="o">/</span><span class="n">paddle</span><span class="o">/</span><span class="n">v2</span><span class="o">/</span><span class="n">fluid</span><span class="o">/</span><span class="n">framework</span><span class="o">.</span><span class="n">py</span><span class="p">:</span><span class="mi">487</span><span class="p">(</span><span class="n">clone</span><span class="p">)</span>
-                                                                                                                  1    0.000    0.001  /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/fluid/framework.py:534(append_backward)
+                                                                                                                  <span class="mi">1</span>    <span class="mf">0.000</span>    <span class="mf">0.001</span>  <span class="o">/</span><span class="n">home</span><span class="o">/</span><span class="n">yuyang</span><span class="o">/</span><span class="n">perf_test</span><span class="o">/.</span><span class="n">env</span><span class="o">/</span><span class="n">lib</span><span class="o">/</span><span class="n">python2</span><span class="o">.</span><span class="mi">7</span><span class="o">/</span><span class="n">site</span><span class="o">-</span><span class="n">packages</span><span class="o">/</span><span class="n">paddle</span><span class="o">/</span><span class="n">v2</span><span class="o">/</span><span class="n">fluid</span><span class="o">/</span><span class="n">framework</span><span class="o">.</span><span class="n">py</span><span class="p">:</span><span class="mi">534</span><span class="p">(</span><span class="n">append_backward</span><span class="p">)</span>
-Called:
+<span class="n">Called</span><span class="p">:</span>
-   Ordered by: internal time
+   <span class="n">Ordered</span> <span class="n">by</span><span class="p">:</span> <span class="n">internal</span> <span class="n">time</span>
-   List reduced from 4497 to 2 due to restriction &lt;&#39;sync_with_cpp&#39;&gt;
+   <span class="n">List</span> <span class="n">reduced</span> <span class="kn">from</span> <span class="mi">4497</span> <span class="n">to</span> <span class="mi">2</span> <span class="n">due</span> <span class="n">to</span> <span class="n">restriction</span> <span class="o">&lt;</span><span class="s1">&#39;sync_with_cpp&#39;</span><span class="o">&gt;</span>
 </pre></div>
 </div>
-<p>通常观察热点函数间的调用关系，和对应行的代码，就可以了解到问题代码在哪里。当我们做出性能修正后，再次进行性能分析(profiling)即可检查我们调优后的修正是否能够改善程序的性能。</p>
+<p>The lists of the callers of <code class="docutils literal"><span class="pre">sync_with_cpp</span></code> might help us understand
+how to improve the function definition.</p>
 </div>
 </div>
-<div class="section" id="pythonc">
+<div class="section" id="profiling-python-and-c-code">
-<span id="pythonc"></span><h1>Python与C++混合代码的性能分析<a class="headerlink" href="#pythonc" title="Permalink to this headline">¶</a></h1>
+<span id="profiling-python-and-c-code"></span><h1>Profiling Python and C++ Code<a class="headerlink" href="#profiling-python-and-c-code" title="Permalink to this headline">¶</a></h1>
-<div class="section" id="">
+<div class="section" id="generate-the-profiling-file">
-<span id="id4"></span><h2>生成性能分析文件<a class="headerlink" href="#" title="Permalink to this headline">¶</a></h2>
+<span id="generate-the-profiling-file"></span><h2>Generate the Profiling File<a class="headerlink" href="#generate-the-profiling-file" title="Permalink to this headline">¶</a></h2>
-<p>C++的性能分析工具非常多。常见的包括<code class="docutils literal"><span class="pre">gprof</span></code>, <code class="docutils literal"><span class="pre">valgrind</span></code>, <code class="docutils literal"><span class="pre">google-perftools</span></code>。但是调试Python中使用的动态链接库与直接调试原始二进制相比增加了很多复杂度。幸而Python的一个第三方库<code class="docutils literal"><span class="pre">yep</span></code>提供了方便的和<code class="docutils literal"><span class="pre">google-perftools</span></code>交互的方法。于是这里使用<code class="docutils literal"><span class="pre">yep</span></code>进行Python与C++混合代码的性能分析</p>
+<p>To profile a mixture of Python and C++ code, we can use a Python
-<p>使用<code class="docutils literal"><span class="pre">yep</span></code>前需要安装<code class="docutils literal"><span class="pre">google-perftools</span></code>与<code class="docutils literal"><span class="pre">yep</span></code>包。ubuntu下安装命令为</p>
+package, <code class="docutils literal"><span class="pre">yep</span></code>, that can work with Google&#8217;s <code class="docutils literal"><span class="pre">perftools</span></code>, which is a
-<div class="highlight-bash"><div class="highlight"><pre><span></span>apt install libgoogle-perftools-dev
+commonly-used profiler for C/C++ code.</p>
+<p>In Ubuntu systems, we can install <code class="docutils literal"><span class="pre">yep</span></code> and <code class="docutils literal"><span class="pre">perftools</span></code> by running the
+following commands:</p>
+<div class="highlight-bash"><div class="highlight"><pre><span></span>apt update
+apt install libgoogle-perftools-dev
 pip install yep
 </pre></div>
 </div>
-<p>安装完毕后，我们可以通过</p>
+<p>Then we can run the following command</p>
 <div class="highlight-bash"><div class="highlight"><pre><span></span>python -m yep -v main.py
 </pre></div>
 </div>
-<p>生成性能分析文件。生成的性能分析文件为<code class="docutils literal"><span class="pre">main.py.prof</span></code>。</p>
+<p>to generate the profiling file.  The default filename is
-<p>命令行中的<code class="docutils literal"><span class="pre">-v</span></code>指定在生成性能分析文件之后，在命令行显示分析结果。我们可以在命令行中简单的看一下生成效果。因为C++与Python不同，编译时可能会去掉调试信息，运行时也可能因为多线程产生混乱不可读的性能分析结果。为了生成更可读的性能分析结果，可以采取下面几点措施:</p>
+<code class="docutils literal"><span class="pre">main.py.prof</span></code>.</p>
+<p>Please be aware of the <code class="docutils literal"><span class="pre">-v</span></code> command line option, which prints the
+analysis results after generating the profiling file.  By taking a
+glance at the print result, we&#8217;d know that if we stripped debug
+information from <code class="docutils literal"><span class="pre">libpaddle.so</span></code> at build time.  The following hints
+help make sure that the analysis results are readable:</p>
 <ol class="simple">
-<li>编译时指定<code class="docutils literal"><span class="pre">-g</span></code>生成调试信息。使用cmake的话，可以将CMAKE_BUILD_TYPE指定为<code class="docutils literal"><span class="pre">RelWithDebInfo</span></code>。</li>
+<li>Use GCC command line option <code class="docutils literal"><span class="pre">-g</span></code> when building <code class="docutils literal"><span class="pre">libpaddle.so</span></code> so to
-<li>编译时一定要开启优化。单纯的<code class="docutils literal"><span class="pre">Debug</span></code>编译性能会和<code class="docutils literal"><span class="pre">-O2</span></code>或者<code class="docutils literal"><span class="pre">-O3</span></code>有非常大的差别。<code class="docutils literal"><span class="pre">Debug</span></code>模式下的性能测试是没有意义的。</li>
+include the debug information.  The standard building system of
-<li>运行性能分析的时候，先从单线程开始，再开启多线程，进而多机。毕竟单线程调试更容易。可以设置<code class="docutils literal"><span class="pre">OMP_NUM_THREADS=1</span></code>这个环境变量关闭openmp优化。</li>
+PaddlePaddle is CMake, so you might want to set
+<code class="docutils literal"><span class="pre">CMAKE_BUILD_TYPE=RelWithDebInfo</span></code>.</li>
+<li>Use GCC command line option <code class="docutils literal"><span class="pre">-O2</span></code> or <code class="docutils literal"><span class="pre">-O3</span></code> to generate optimized
+binary code. It doesn&#8217;t make sense to profile <code class="docutils literal"><span class="pre">libpaddle.so</span></code>
+without optimization, because it would anyway run slowly.</li>
+<li>Profiling the single-threaded binary file before the
+multi-threading version, because the latter often generates tangled
+profiling analysis result.  You might want to set environment
+variable <code class="docutils literal"><span class="pre">OMP_NUM_THREADS=1</span></code> to prevents OpenMP from automatically
+starting multiple threads.</li>
 </ol>
 </div>
-<div class="section" id="">
+<div class="section" id="look-into-the-profiling-file">
-<span id="id5"></span><h2>查看性能分析文件<a class="headerlink" href="#" title="Permalink to this headline">¶</a></h2>
+<span id="id1"></span><h2>Look into the Profiling File<a class="headerlink" href="#look-into-the-profiling-file" title="Permalink to this headline">¶</a></h2>
-<p>在运行完性能分析后，会生成性能分析结果文件。我们可以使用<a class="reference external" href="https://github.com/google/pprof">pprof</a>来显示性能分析结果。注意，这里使用了用<code class="docutils literal"><span class="pre">Go</span></code>语言重构后的<code class="docutils literal"><span class="pre">pprof</span></code>，因为这个工具具有web服务界面，且展示效果更好。</p>
+<p>The tool we used to look into the profiling file generated by
-<p>安装<code class="docutils literal"><span class="pre">pprof</span></code>的命令和一般的<code class="docutils literal"><span class="pre">Go</span></code>程序是一样的，其命令如下:</p>
+<code class="docutils literal"><span class="pre">perftools</span></code> is <a class="reference external" href="https://github.com/google/pprof"><code class="docutils literal"><span class="pre">pprof</span></code></a>, which
+provides a Web-based GUI like <code class="docutils literal"><span class="pre">cprofilev</span></code>.</p>
+<p>We can rely on the standard Go toolchain to retrieve the source code
+of <code class="docutils literal"><span class="pre">pprof</span></code> and build it:</p>
 <div class="highlight-bash"><div class="highlight"><pre><span></span>go get github.com/google/pprof
 </pre></div>
 </div>
-<p>进而我们可以使用如下命令开启一个HTTP服务:</p>
+<p>Then we can use it to profile <code class="docutils literal"><span class="pre">main.py.prof</span></code> generated in the previous
+section:</p>
 <div class="highlight-bash"><div class="highlight"><pre><span></span>pprof -http<span class="o">=</span><span class="m">0</span>.0.0.0:3213 <span class="sb">`</span>which python<span class="sb">`</span>  ./main.py.prof
 </pre></div>
 </div>
-<p>这行命令中，<code class="docutils literal"><span class="pre">-http</span></code>指开启HTTP服务。<code class="docutils literal"><span class="pre">which</span> <span class="pre">python</span></code>会产生当前Python二进制的完整路径，进而指定了Python可执行文件的路径。<code class="docutils literal"><span class="pre">./main.py.prof</span></code>输入了性能分析结果。</p>
+<p>Where <code class="docutils literal"><span class="pre">-http</span></code> specifies the IP and port of the HTTP service.
-<p>访问对应的网址，我们可以查看性能分析的结果。结果如下图所示:</p>
+Directing our Web browser to the service, we would see something like
+the following:</p>
 <p><img alt="result" src="../../_images/pprof_1.png" /></p>
 </div>
-<div class="section" id="">
+<div class="section" id="identifying-the-performance-bottlenecks">
-<span id="id6"></span><h2>寻找性能瓶颈<a class="headerlink" href="#" title="Permalink to this headline">¶</a></h2>
+<span id="identifying-the-performance-bottlenecks"></span><h2>Identifying the Performance Bottlenecks<a class="headerlink" href="#identifying-the-performance-bottlenecks" title="Permalink to this headline">¶</a></h2>
-<p>与寻找Python代码的性能瓶颈类似，寻找Python与C++混合代码的性能瓶颈也是要看<code class="docutils literal"><span class="pre">tottime</span></code>和<code class="docutils literal"><span class="pre">cumtime</span></code>。而<code class="docutils literal"><span class="pre">pprof</span></code>展示的调用图也可以帮助我们发现性能中的问题。</p>
+<p>Similar to how we work with <code class="docutils literal"><span class="pre">cprofilev</span></code>, we&#8217;d focus on <code class="docutils literal"><span class="pre">tottime</span></code> and
-<p>例如下图中，</p>
+<code class="docutils literal"><span class="pre">cumtime</span></code>.</p>
 <p><img alt="kernel_perf" src="../../_images/pprof_2.png" /></p>
-<p>在一次训练中，乘法和乘法梯度的计算占用2%-4%左右的计算时间。而<code class="docutils literal"><span class="pre">MomentumOp</span></code>占用了17%左右的计算时间。显然，<code class="docutils literal"><span class="pre">MomentumOp</span></code>的性能有问题。</p>
+<p>We can see that the execution time of multiplication and the computing
-<p>在<code class="docutils literal"><span class="pre">pprof</span></code>中，对于性能的关键路径都做出了红色标记。先检查关键路径的性能问题，再检查其他部分的性能问题，可以更有次序的完成性能的优化。</p>
+of the gradient of multiplication takes 2% to 4% of the total running
-</div>
+time, and <code class="docutils literal"><span class="pre">MomentumOp</span></code> takes about 17%. Obviously, we&#8217;d want to
+optimize <code class="docutils literal"><span class="pre">MomentumOp</span></code>.</p>
+<p><code class="docutils literal"><span class="pre">pprof</span></code> would mark performance critical parts of the program in
+red. It&#8217;s a good idea to follow the hint.</p>
 </div>
-<div class="section" id="">
-<span id="id7"></span><h1>总结<a class="headerlink" href="#" title="Permalink to this headline">¶</a></h1>
-<p>至此，两种性能分析的方式都介绍完毕了。希望通过这两种性能分析的方式，Paddle的开发人员和使用人员可以有次序的，科学的发现和解决性能问题。</p>
 </div>

--- a/develop/doc/objects.inv
+++ b/develop/doc/objects.inv
--- a/develop/doc/searchindex.js
+++ b/develop/doc/searchindex.js
--- a/develop/doc_cn/_sources/howto/optimization/cpu_profiling.md.txt
+++ b/develop/doc_cn/_sources/howto/optimization/cpu_profiling.md.txt
-此教程会介绍如何使用Python的cProfile包，与Python库yep，google perftools来运行性能分析(Profiling)与调优。
+This tutorial introduces techniques we used to profile and tune the
+CPU performance of PaddlePaddle.  We will use Python packages
+`cProfile` and `yep`, and Google `perftools`.
-运行性能分析可以让开发人员科学的，有条不紊的对程序进行性能优化。性能分析是性能调优的基础。因为在程序实际运行中，真正的瓶颈可能和程序员开发过程中想象的瓶颈相去甚远。
+Profiling is the process that reveals the performance bottlenecks,
+which could be very different from what's in the developers' mind.
+Performance tuning is to fix the bottlenecks. Performance optimization
+repeats the steps of profiling and tuning alternatively.
-性能优化的步骤，通常是循环重复若干次『性能分析 --> 寻找瓶颈 ---> 调优瓶颈 --> 性能分析确认调优效果』。其中性能分析是性能调优的至关重要的量化指标。
+PaddlePaddle users program AI by calling the Python API, which calls
+into `libpaddle.so.` written in C++.  In this tutorial, we focus on
+the profiling and tuning of
-Paddle提供了Python语言绑定。用户使用Python进行神经网络编程，训练，测试。Python解释器通过`pybind`和`swig`调用Paddle的动态链接库，进而调用Paddle C++部分的代码。所以Paddle的性能分析与调优分为两个部分:
+1. the Python code and
+1. the mixture of Python and C++ code.
-* Python代码的性能分析
+## Profiling the Python Code
-* Python与C++混合代码的性能分析
+### Generate the Performance Profiling File
-## Python代码的性能分析
+We can use Python standard
+package, [`cProfile`](https://docs.python.org/2/library/profile.html),
-### 生成性能分析文件
+to generate Python profiling file.  For example:
-Python标准库中提供了性能分析的工具包，[cProfile](https://docs.python.org/2/library/profile.html)。生成Python性能分析的命令如下:
 ```bash
 python -m cProfile -o profile.out main.py
 ```
-其中`-o`标识了一个输出的文件名，用来存储本次性能分析的结果。如果不指定这个文件，`cProfile`会打印一些统计信息到`stdout`。这不方便我们进行后期处理(进行`sort`, `split`, `cut`等等)。
+where `main.py` is the program we are going to profile, `-o` specifies
+the output file.  Without `-o`, `cProfile` would outputs to standard
-### 查看性能分析文件
+output.
-当main.py运行完毕后，性能分析结果文件`profile.out`就生成出来了。我们可以使用[cprofilev](https://github.com/ymichael/cprofilev)来查看性能分析结果。`cprofilev`是一个Python的第三方库。使用它会开启一个HTTP服务，将性能分析结果以网页的形式展示出来。
+### Look into the Profiling File
-使用`pip install cprofilev`安装`cprofilev`工具。安装完成后，使用如下命令开启HTTP服务
+`cProfile` generates `profile.out` after `main.py` completes. We can
+use [`cprofilev`](https://github.com/ymichael/cprofilev) to look into
+the details:
 ```bash
 cprofilev -a 0.0.0.0 -p 3214 -f profile.out main.py
 ```
-其中`-a`标识HTTP服务绑定的IP。使用`0.0.0.0`允许外网访问这个HTTP服务。`-p`标识HTTP服务的端口。`-f`标识性能分析的结果文件。`main.py`标识被性能分析的源文件。
+where `-a` specifies the HTTP IP, `-p` specifies the port, `-f`
+specifies the profiling file, and `main.py` is the source file.
-访问对应网址，即可显示性能分析的结果。性能分析结果格式如下:
+Open the Web browser and points to the local IP and the specifies
+port, we will see the output like the following:
-```text
+```
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.284    0.284   29.514   29.514 main.py:1(<module>)
     4696    0.128    0.000   15.748    0.003 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/fluid/executor.py:20(run)
@@ -44,23 +54,23 @@ cprofilev -a 0.0.0.0 -p 3214 -f profile.out main.py
        1    0.144    0.144    6.534    6.534 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/__init__.py:14(<module>)
 ```
-每一列的含义是:
+where each line corresponds to Python function, and the meaning of
+each column is as follows:
-| 列名 | 含义 |
+| column | meaning |
 | --- | --- |
-| ncalls | 函数的调用次数 |
+| ncalls | the number of calls into a function |
-| tottime | 函数实际使用的总时间。该时间去除掉本函数调用其他函数的时间 |
+| tottime | the total execution time of the function, not including the
-| percall | tottime的每次调用平均时间 |
+ execution time of other functions called by the function |
-| cumtime | 函数总时间。包含这个函数调用其他函数的时间 |
+| percall | tottime divided by ncalls |
-| percall | cumtime的每次调用平均时间 |
+| cumtime | the total execution time of the function, including the execution time of other functions being called |
-| filename:lineno(function) | 文件名, 行号，函数名 |
+| percall | cumtime divided by ncalls |
+| filename:lineno(function) | where the function is defined |
+### Identify Performance Bottlenecks
-### 寻找性能瓶颈
+Usually, `tottime` and the related `percall` time is what we want to
+focus on. We can sort above profiling file by tottime:
-通常`tottime`和`cumtime`是寻找瓶颈的关键指标。这两个指标代表了某一个函数真实的运行时间。
-将性能分析结果按照tottime排序，效果如下:
 ```text
     4696   12.040    0.003   12.040    0.003 {built-in method run}
@@ -68,12 +78,15 @@ cprofilev -a 0.0.0.0 -p 3214 -f profile.out main.py
   107991    0.676    0.000    1.519    0.000 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/fluid/framework.py:219(__init__)
     4697    0.626    0.000    2.291    0.000 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/fluid/framework.py:428(sync_with_cpp)
        1    0.618    0.618    0.618    0.618 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/fluid/__init__.py:1(<module>)
 ```
-可以看到最耗时的函数是C++端的`run`函数。这需要联合我们第二节`Python`与`C++`混合代码的性能分析来进行调优。而`sync_with_cpp`函数的总共耗时很长，每次调用的耗时也很长。于是我们可以点击`sync_with_cpp`的详细信息，了解其调用关系。
+We can see that the most time-consuming function is the `built-in
+method run`, which is a C++ function in `libpaddle.so`.  We will
+explain how to profile C++ code in the next section.  At the right
+moment, let's look into the third function `sync_with_cpp`, which is a
+Python function.  We can click it to understand more about it:
-```text
+```
 Called By:
   Ordered by: internal time
@@ -92,72 +105,93 @@ Called:
   List reduced from 4497 to 2 due to restriction <'sync_with_cpp'>
 ```
-通常观察热点函数间的调用关系，和对应行的代码，就可以了解到问题代码在哪里。当我们做出性能修正后，再次进行性能分析(profiling)即可检查我们调优后的修正是否能够改善程序的性能。
+The lists of the callers of `sync_with_cpp` might help us understand
+how to improve the function definition.
+## Profiling Python and C++ Code
+### Generate the Profiling File
-## Python与C++混合代码的性能分析
+To profile a mixture of Python and C++ code, we can use a Python
+package, `yep`, that can work with Google's `perftools`, which is a
+commonly-used profiler for C/C++ code.
-### 生成性能分析文件
+In Ubuntu systems, we can install `yep` and `perftools` by running the
+following commands:
-C++的性能分析工具非常多。常见的包括`gprof`, `valgrind`, `google-perftools`。但是调试Python中使用的动态链接库与直接调试原始二进制相比增加了很多复杂度。幸而Python的一个第三方库`yep`提供了方便的和`google-perftools`交互的方法。于是这里使用`yep`进行Python与C++混合代码的性能分析
-使用`yep`前需要安装`google-perftools`与`yep`包。ubuntu下安装命令为
 ```bash
+apt update
 apt install libgoogle-perftools-dev
 pip install yep
 ```
-安装完毕后，我们可以通过
+Then we can run the following command
 ```bash
 python -m yep -v main.py
 ```
-生成性能分析文件。生成的性能分析文件为`main.py.prof`。
+to generate the profiling file.  The default filename is
+`main.py.prof`.
+Please be aware of the `-v` command line option, which prints the
+analysis results after generating the profiling file.  By taking a
+glance at the print result, we'd know that if we stripped debug
+information from `libpaddle.so` at build time.  The following hints
+help make sure that the analysis results are readable:
-命令行中的`-v`指定在生成性能分析文件之后，在命令行显示分析结果。我们可以在命令行中简单的看一下生成效果。因为C++与Python不同，编译时可能会去掉调试信息，运行时也可能因为多线程产生混乱不可读的性能分析结果。为了生成更可读的性能分析结果，可以采取下面几点措施:
+1. Use GCC command line option `-g` when building `libpaddle.so` so to
+   include the debug information.  The standard building system of
+   PaddlePaddle is CMake, so you might want to set
+   `CMAKE_BUILD_TYPE=RelWithDebInfo`.
-1. 编译时指定`-g`生成调试信息。使用cmake的话，可以将CMAKE_BUILD_TYPE指定为`RelWithDebInfo`。
+1. Use GCC command line option `-O2` or `-O3` to generate optimized
-2. 编译时一定要开启优化。单纯的`Debug`编译性能会和`-O2`或者`-O3`有非常大的差别。`Debug`模式下的性能测试是没有意义的。
+   binary code. It doesn't make sense to profile `libpaddle.so`
-3. 运行性能分析的时候，先从单线程开始，再开启多线程，进而多机。毕竟单线程调试更容易。可以设置`OMP_NUM_THREADS=1`这个环境变量关闭openmp优化。
+   without optimization, because it would anyway run slowly.
-### 查看性能分析文件
+1. Profiling the single-threaded binary file before the
+   multi-threading version, because the latter often generates tangled
+   profiling analysis result.  You might want to set environment
+   variable `OMP_NUM_THREADS=1` to prevents OpenMP from automatically
+   starting multiple threads.
-在运行完性能分析后，会生成性能分析结果文件。我们可以使用[pprof](https://github.com/google/pprof)来显示性能分析结果。注意，这里使用了用`Go`语言重构后的`pprof`，因为这个工具具有web服务界面，且展示效果更好。
+### Look into the Profiling File
-安装`pprof`的命令和一般的`Go`程序是一样的，其命令如下:
+The tool we used to look into the profiling file generated by
+`perftools` is [`pprof`](https://github.com/google/pprof), which
+provides a Web-based GUI like `cprofilev`.
+We can rely on the standard Go toolchain to retrieve the source code
+of `pprof` and build it:
 ```bash
 go get github.com/google/pprof
 ```
-进而我们可以使用如下命令开启一个HTTP服务:
+Then we can use it to profile `main.py.prof` generated in the previous
+section:
 ```bash
 pprof -http=0.0.0.0:3213 `which python`  ./main.py.prof
 ```
-这行命令中，`-http`指开启HTTP服务。`which python`会产生当前Python二进制的完整路径，进而指定了Python可执行文件的路径。`./main.py.prof`输入了性能分析结果。
+Where `-http` specifies the IP and port of the HTTP service.
+Directing our Web browser to the service, we would see something like
-访问对应的网址，我们可以查看性能分析的结果。结果如下图所示:
+the following:
 ![result](./pprof_1.png)
+### Identifying the Performance Bottlenecks
-### 寻找性能瓶颈
+Similar to how we work with `cprofilev`, we'd focus on `tottime` and
+`cumtime`.
-与寻找Python代码的性能瓶颈类似，寻找Python与C++混合代码的性能瓶颈也是要看`tottime`和`cumtime`。而`pprof`展示的调用图也可以帮助我们发现性能中的问题。
-例如下图中，
 ![kernel_perf](./pprof_2.png)
-在一次训练中，乘法和乘法梯度的计算占用2%-4%左右的计算时间。而`MomentumOp`占用了17%左右的计算时间。显然，`MomentumOp`的性能有问题。
+We can see that the execution time of multiplication and the computing
+of the gradient of multiplication takes 2% to 4% of the total running
-在`pprof`中，对于性能的关键路径都做出了红色标记。先检查关键路径的性能问题，再检查其他部分的性能问题，可以更有次序的完成性能的优化。
+time, and `MomentumOp` takes about 17%. Obviously, we'd want to
+optimize `MomentumOp`.
-## 总结
-至此，两种性能分析的方式都介绍完毕了。希望通过这两种性能分析的方式，Paddle的开发人员和使用人员可以有次序的，科学的发现和解决性能问题。
+`pprof` would mark performance critical parts of the program in
+red. It's a good idea to follow the hint.
--- a/develop/doc_cn/_sources/howto/optimization/cpu_profiling_cn.md.txt
+++ b/develop/doc_cn/_sources/howto/optimization/cpu_profiling_cn.md.txt
+此教程会介绍如何使用Python的cProfile包、Python库yep、Google perftools来进行性能分析 (profiling) 与调优（performance tuning）。
+Profling 指发现性能瓶颈。系统中的瓶颈可能和程序员开发过程中想象的瓶颈相去甚远。Tuning 指消除瓶颈。性能优化的过程通常是不断重复地 profiling 和 tuning。
+PaddlePaddle 用户一般通过调用 Python API 编写深度学习程序。大部分 Python API 调用用 C++ 写的 libpaddle.so。所以 PaddlePaddle 的性能分析与调优分为两个部分:
+* Python 代码的性能分析
+* Python 与 C++ 混合代码的性能分析
+## Python代码的性能分析
+### 生成性能分析文件
+Python标准库中提供了性能分析的工具包，[cProfile](https://docs.python.org/2/library/profile.html)。生成Python性能分析的命令如下:
+```bash
+python -m cProfile -o profile.out main.py
+```
+其中 `main.py` 是我们要分析的程序，`-o`标识了一个输出的文件名，用来存储本次性能分析的结果。如果不指定这个文件，`cProfile`会打印到标准输出。
+### 查看性能分析文件
+`cProfile` 在main.py 运行完毕后输出`profile.out`。我们可以使用[`cprofilev`](https://github.com/ymichael/cprofilev)来查看性能分析结果。`cprofilev`是一个Python的第三方库。使用它会开启一个HTTP服务，将性能分析结果以网页的形式展示出来：
+```bash
+cprofilev -a 0.0.0.0 -p 3214 -f profile.out main.py
+```
+其中`-a`标识HTTP服务绑定的IP。使用`0.0.0.0`允许外网访问这个HTTP服务。`-p`标识HTTP服务的端口。`-f`标识性能分析的结果文件。`main.py`标识被性能分析的源文件。
+用Web浏览器访问对应网址，即可显示性能分析的结果：
+```
+   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
+        1    0.284    0.284   29.514   29.514 main.py:1(<module>)
+     4696    0.128    0.000   15.748    0.003 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/fluid/executor.py:20(run)
+     4696   12.040    0.003   12.040    0.003 {built-in method run}
+        1    0.144    0.144    6.534    6.534 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/__init__.py:14(<module>)
+```
+每一列的含义是:
+| 列名 | 含义 |
+| --- | --- |
+| ncalls | 函数的调用次数 |
+| tottime | 函数实际使用的总时间。该时间去除掉本函数调用其他函数的时间 |
+| percall | tottime的每次调用平均时间 |
+| cumtime | 函数总时间。包含这个函数调用其他函数的时间 |
+| percall | cumtime的每次调用平均时间 |
+| filename:lineno(function) | 文件名, 行号，函数名 |
+### 寻找性能瓶颈
+通常`tottime`和`cumtime`是寻找瓶颈的关键指标。这两个指标代表了某一个函数真实的运行时间。
+将性能分析结果按照tottime排序，效果如下:
+```text
+     4696   12.040    0.003   12.040    0.003 {built-in method run}
+   300005    0.874    0.000    1.681    0.000 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/dataset/mnist.py:38(reader)
+   107991    0.676    0.000    1.519    0.000 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/fluid/framework.py:219(__init__)
+     4697    0.626    0.000    2.291    0.000 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/fluid/framework.py:428(sync_with_cpp)
+        1    0.618    0.618    0.618    0.618 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/fluid/__init__.py:1(<module>)
+```
+可以看到最耗时的函数是C++端的`run`函数。这需要联合我们第二节`Python`与`C++`混合代码的性能分析来进行调优。而`sync_with_cpp`函数的总共耗时很长，每次调用的耗时也很长。于是我们可以点击`sync_with_cpp`的详细信息，了解其调用关系。
+```text
+Called By:
+   Ordered by: internal time
+   List reduced from 4497 to 2 due to restriction <'sync_with_cpp'>
+Function                                                                                                 was called by...
+                                                                                                             ncalls  tottime  cumtime
+/home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/fluid/framework.py:428(sync_with_cpp)  <-    4697    0.626    2.291  /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/fluid/framework.py:562(sync_with_cpp)
+/home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/fluid/framework.py:562(sync_with_cpp)  <-    4696    0.019    2.316  /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/fluid/framework.py:487(clone)
+                                                                                                                  1    0.000    0.001  /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/fluid/framework.py:534(append_backward)
+Called:
+   Ordered by: internal time
+   List reduced from 4497 to 2 due to restriction <'sync_with_cpp'>
+```
+通常观察热点函数间的调用关系，和对应行的代码，就可以了解到问题代码在哪里。当我们做出性能修正后，再次进行性能分析(profiling)即可检查我们调优后的修正是否能够改善程序的性能。
+## Python与C++混合代码的性能分析
+### 生成性能分析文件
+C++的性能分析工具非常多。常见的包括`gprof`, `valgrind`, `google-perftools`。但是调试Python中使用的动态链接库与直接调试原始二进制相比增加了很多复杂度。幸而Python的一个第三方库`yep`提供了方便的和`google-perftools`交互的方法。于是这里使用`yep`进行Python与C++混合代码的性能分析
+使用`yep`前需要安装`google-perftools`与`yep`包。ubuntu下安装命令为
+```bash
+apt update
+apt install libgoogle-perftools-dev
+pip install yep
+```
+安装完毕后，我们可以通过
+```bash
+python -m yep -v main.py
+```
+生成性能分析文件。生成的性能分析文件为`main.py.prof`。
+命令行中的`-v`指定在生成性能分析文件之后，在命令行显示分析结果。我们可以在命令行中简单的看一下生成效果。因为C++与Python不同，编译时可能会去掉调试信息，运行时也可能因为多线程产生混乱不可读的性能分析结果。为了生成更可读的性能分析结果，可以采取下面几点措施:
+1. 编译时指定`-g`生成调试信息。使用cmake的话，可以将CMAKE_BUILD_TYPE指定为`RelWithDebInfo`。
+2. 编译时一定要开启优化。单纯的`Debug`编译性能会和`-O2`或者`-O3`有非常大的差别。`Debug`模式下的性能测试是没有意义的。
+3. 运行性能分析的时候，先从单线程开始，再开启多线程，进而多机。毕竟单线程调试更容易。可以设置`OMP_NUM_THREADS=1`这个环境变量关闭openmp优化。
+### 查看性能分析文件
+在运行完性能分析后，会生成性能分析结果文件。我们可以使用[`pprof`](https://github.com/google/pprof)来显示性能分析结果。注意，这里使用了用`Go`语言重构后的`pprof`，因为这个工具具有web服务界面，且展示效果更好。
+安装`pprof`的命令和一般的`Go`程序是一样的，其命令如下:
+```bash
+go get github.com/google/pprof
+```
+进而我们可以使用如下命令开启一个HTTP服务:
+```bash
+pprof -http=0.0.0.0:3213 `which python`  ./main.py.prof
+```
+这行命令中，`-http`指开启HTTP服务。`which python`会产生当前Python二进制的完整路径，进而指定了Python可执行文件的路径。`./main.py.prof`输入了性能分析结果。
+访问对应的网址，我们可以查看性能分析的结果。结果如下图所示:
+![result](./pprof_1.png)
+### 寻找性能瓶颈
+与寻找Python代码的性能瓶颈类似，寻找Python与C++混合代码的性能瓶颈也是要看`tottime`和`cumtime`。而`pprof`展示的调用图也可以帮助我们发现性能中的问题。
+例如下图中，
+![kernel_perf](./pprof_2.png)
+在一次训练中，乘法和乘法梯度的计算占用2%-4%左右的计算时间。而`MomentumOp`占用了17%左右的计算时间。显然，`MomentumOp`的性能有问题。
+在`pprof`中，对于性能的关键路径都做出了红色标记。先检查关键路径的性能问题，再检查其他部分的性能问题，可以更有次序的完成性能的优化。
--- a/develop/doc_cn/howto/optimization/cpu_profiling.html
+++ b/develop/doc_cn/howto/optimization/cpu_profiling.html
@@ -8,7 +8,7 @@
  <meta name="viewport" content="width=device-width, initial-scale=1.0">
-  <title>Python代码的性能分析 &mdash; PaddlePaddle  文档</title>
+  <title>Profiling the Python Code &mdash; PaddlePaddle  文档</title>
@@ -193,7 +193,7 @@
 <div role="navigation" aria-label="breadcrumbs navigation">
  <ul class="wy-breadcrumbs">
-    <li>Python代码的性能分析</li>
+    <li>Profiling the Python Code</li>
  </ul>
 </div>
@@ -202,54 +202,69 @@
          <div role="main" class="document" itemscope="itemscope" itemtype="http://schema.org/Article">
           <div itemprop="articleBody">
-  <p>此教程会介绍如何使用Python的cProfile包，与Python库yep，google perftools来运行性能分析(Profiling)与调优。</p>
+  <p>This tutorial introduces techniques we used to profile and tune the
-<p>运行性能分析可以让开发人员科学的，有条不紊的对程序进行性能优化。性能分析是性能调优的基础。因为在程序实际运行中，真正的瓶颈可能和程序员开发过程中想象的瓶颈相去甚远。</p>
+CPU performance of PaddlePaddle.  We will use Python packages
-<p>性能优化的步骤，通常是循环重复若干次『性能分析 &#8211;&gt; 寻找瓶颈 &#8212;&gt; 调优瓶颈 &#8211;&gt; 性能分析确认调优效果』。其中性能分析是性能调优的至关重要的量化指标。</p>
+<code class="docutils literal"><span class="pre">cProfile</span></code> and <code class="docutils literal"><span class="pre">yep</span></code>, and Google <code class="docutils literal"><span class="pre">perftools</span></code>.</p>
-<p>Paddle提供了Python语言绑定。用户使用Python进行神经网络编程，训练，测试。Python解释器通过<code class="docutils literal"><span class="pre">pybind</span></code>和<code class="docutils literal"><span class="pre">swig</span></code>调用Paddle的动态链接库，进而调用Paddle C++部分的代码。所以Paddle的性能分析与调优分为两个部分:</p>
+<p>Profiling is the process that reveals the performance bottlenecks,
-<ul class="simple">
+which could be very different from what&#8217;s in the developers&#8217; mind.
-<li>Python代码的性能分析</li>
+Performance tuning is to fix the bottlenecks. Performance optimization
-<li>Python与C++混合代码的性能分析</li>
+repeats the steps of profiling and tuning alternatively.</p>
-</ul>
+<p>PaddlePaddle users program AI by calling the Python API, which calls
-<div class="section" id="python">
+into <code class="docutils literal"><span class="pre">libpaddle.so.</span></code> written in C++.  In this tutorial, we focus on
-<span id="python"></span><h1>Python代码的性能分析<a class="headerlink" href="#python" title="永久链接至标题">¶</a></h1>
+the profiling and tuning of</p>
-<div class="section" id="">
+<ol class="simple">
-<span id="id1"></span><h2>生成性能分析文件<a class="headerlink" href="#" title="永久链接至标题">¶</a></h2>
+<li>the Python code and</li>
-<p>Python标准库中提供了性能分析的工具包，<a class="reference external" href="https://docs.python.org/2/library/profile.html">cProfile</a>。生成Python性能分析的命令如下:</p>
+<li>the mixture of Python and C++ code.</li>
+</ol>
+<div class="section" id="profiling-the-python-code">
+<span id="profiling-the-python-code"></span><h1>Profiling the Python Code<a class="headerlink" href="#profiling-the-python-code" title="永久链接至标题">¶</a></h1>
+<div class="section" id="generate-the-performance-profiling-file">
+<span id="generate-the-performance-profiling-file"></span><h2>Generate the Performance Profiling File<a class="headerlink" href="#generate-the-performance-profiling-file" title="永久链接至标题">¶</a></h2>
+<p>We can use Python standard
+package, <a class="reference external" href="https://docs.python.org/2/library/profile.html"><code class="docutils literal"><span class="pre">cProfile</span></code></a>,
+to generate Python profiling file.  For example:</p>
 <div class="highlight-bash"><div class="highlight"><pre><span></span>python -m cProfile -o profile.out main.py
 </pre></div>
 </div>
-<p>其中<code class="docutils literal"><span class="pre">-o</span></code>标识了一个输出的文件名，用来存储本次性能分析的结果。如果不指定这个文件，<code class="docutils literal"><span class="pre">cProfile</span></code>会打印一些统计信息到<code class="docutils literal"><span class="pre">stdout</span></code>。这不方便我们进行后期处理(进行<code class="docutils literal"><span class="pre">sort</span></code>, <code class="docutils literal"><span class="pre">split</span></code>, <code class="docutils literal"><span class="pre">cut</span></code>等等)。</p>
+<p>where <code class="docutils literal"><span class="pre">main.py</span></code> is the program we are going to profile, <code class="docutils literal"><span class="pre">-o</span></code> specifies
+the output file.  Without <code class="docutils literal"><span class="pre">-o</span></code>, <code class="docutils literal"><span class="pre">cProfile</span></code> would outputs to standard
+output.</p>
 </div>
-<div class="section" id="">
+<div class="section" id="look-into-the-profiling-file">
-<span id="id2"></span><h2>查看性能分析文件<a class="headerlink" href="#" title="永久链接至标题">¶</a></h2>
+<span id="look-into-the-profiling-file"></span><h2>Look into the Profiling File<a class="headerlink" href="#look-into-the-profiling-file" title="永久链接至标题">¶</a></h2>
-<p>当main.py运行完毕后，性能分析结果文件<code class="docutils literal"><span class="pre">profile.out</span></code>就生成出来了。我们可以使用<a class="reference external" href="https://github.com/ymichael/cprofilev">cprofilev</a>来查看性能分析结果。<code class="docutils literal"><span class="pre">cprofilev</span></code>是一个Python的第三方库。使用它会开启一个HTTP服务，将性能分析结果以网页的形式展示出来。</p>
+<p><code class="docutils literal"><span class="pre">cProfile</span></code> generates <code class="docutils literal"><span class="pre">profile.out</span></code> after <code class="docutils literal"><span class="pre">main.py</span></code> completes. We can
-<p>使用<code class="docutils literal"><span class="pre">pip</span> <span class="pre">install</span> <span class="pre">cprofilev</span></code>安装<code class="docutils literal"><span class="pre">cprofilev</span></code>工具。安装完成后，使用如下命令开启HTTP服务</p>
+use <a class="reference external" href="https://github.com/ymichael/cprofilev"><code class="docutils literal"><span class="pre">cprofilev</span></code></a> to look into
+the details:</p>
 <div class="highlight-bash"><div class="highlight"><pre><span></span>cprofilev -a <span class="m">0</span>.0.0.0 -p <span class="m">3214</span> -f profile.out main.py
 </pre></div>
 </div>
-<p>其中<code class="docutils literal"><span class="pre">-a</span></code>标识HTTP服务绑定的IP。使用<code class="docutils literal"><span class="pre">0.0.0.0</span></code>允许外网访问这个HTTP服务。<code class="docutils literal"><span class="pre">-p</span></code>标识HTTP服务的端口。<code class="docutils literal"><span class="pre">-f</span></code>标识性能分析的结果文件。<code class="docutils literal"><span class="pre">main.py</span></code>标识被性能分析的源文件。</p>
+<p>where <code class="docutils literal"><span class="pre">-a</span></code> specifies the HTTP IP, <code class="docutils literal"><span class="pre">-p</span></code> specifies the port, <code class="docutils literal"><span class="pre">-f</span></code>
-<p>访问对应网址，即可显示性能分析的结果。性能分析结果格式如下:</p>
+specifies the profiling file, and <code class="docutils literal"><span class="pre">main.py</span></code> is the source file.</p>
-<div class="highlight-text"><div class="highlight"><pre><span></span>   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
+<p>Open the Web browser and points to the local IP and the specifies
-        1    0.284    0.284   29.514   29.514 main.py:1(&lt;module&gt;)
+port, we will see the output like the following:</p>
-     4696    0.128    0.000   15.748    0.003 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/fluid/executor.py:20(run)
+<div class="highlight-default"><div class="highlight"><pre><span></span>   <span class="n">ncalls</span>  <span class="n">tottime</span>  <span class="n">percall</span>  <span class="n">cumtime</span>  <span class="n">percall</span> <span class="n">filename</span><span class="p">:</span><span class="n">lineno</span><span class="p">(</span><span class="n">function</span><span class="p">)</span>
-     4696   12.040    0.003   12.040    0.003 {built-in method run}
+        <span class="mi">1</span>    <span class="mf">0.284</span>    <span class="mf">0.284</span>   <span class="mf">29.514</span>   <span class="mf">29.514</span> <span class="n">main</span><span class="o">.</span><span class="n">py</span><span class="p">:</span><span class="mi">1</span><span class="p">(</span><span class="o">&lt;</span><span class="n">module</span><span class="o">&gt;</span><span class="p">)</span>
-        1    0.144    0.144    6.534    6.534 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/__init__.py:14(&lt;module&gt;)
+     <span class="mi">4696</span>    <span class="mf">0.128</span>    <span class="mf">0.000</span>   <span class="mf">15.748</span>    <span class="mf">0.003</span> <span class="o">/</span><span class="n">home</span><span class="o">/</span><span class="n">yuyang</span><span class="o">/</span><span class="n">perf_test</span><span class="o">/.</span><span class="n">env</span><span class="o">/</span><span class="n">lib</span><span class="o">/</span><span class="n">python2</span><span class="o">.</span><span class="mi">7</span><span class="o">/</span><span class="n">site</span><span class="o">-</span><span class="n">packages</span><span class="o">/</span><span class="n">paddle</span><span class="o">/</span><span class="n">v2</span><span class="o">/</span><span class="n">fluid</span><span class="o">/</span><span class="n">executor</span><span class="o">.</span><span class="n">py</span><span class="p">:</span><span class="mi">20</span><span class="p">(</span><span class="n">run</span><span class="p">)</span>
+     <span class="mi">4696</span>   <span class="mf">12.040</span>    <span class="mf">0.003</span>   <span class="mf">12.040</span>    <span class="mf">0.003</span> <span class="p">{</span><span class="n">built</span><span class="o">-</span><span class="ow">in</span> <span class="n">method</span> <span class="n">run</span><span class="p">}</span>
+        <span class="mi">1</span>    <span class="mf">0.144</span>    <span class="mf">0.144</span>    <span class="mf">6.534</span>    <span class="mf">6.534</span> <span class="o">/</span><span class="n">home</span><span class="o">/</span><span class="n">yuyang</span><span class="o">/</span><span class="n">perf_test</span><span class="o">/.</span><span class="n">env</span><span class="o">/</span><span class="n">lib</span><span class="o">/</span><span class="n">python2</span><span class="o">.</span><span class="mi">7</span><span class="o">/</span><span class="n">site</span><span class="o">-</span><span class="n">packages</span><span class="o">/</span><span class="n">paddle</span><span class="o">/</span><span class="n">v2</span><span class="o">/</span><span class="fm">__init__</span><span class="o">.</span><span class="n">py</span><span class="p">:</span><span class="mi">14</span><span class="p">(</span><span class="o">&lt;</span><span class="n">module</span><span class="o">&gt;</span><span class="p">)</span>
 </pre></div>
 </div>
-<p>每一列的含义是:</p>
+<p>where each line corresponds to Python function, and the meaning of
-<p>| 列名 | 含义 |
+each column is as follows:</p>
+<p>| column | meaning |
 | &#8212; | &#8212; |
-| ncalls | 函数的调用次数 |
+| ncalls | the number of calls into a function |
-| tottime | 函数实际使用的总时间。该时间去除掉本函数调用其他函数的时间 |
+| tottime | the total execution time of the function, not including the
-| percall | tottime的每次调用平均时间 |
+execution time of other functions called by the function |
-| cumtime | 函数总时间。包含这个函数调用其他函数的时间 |
+| percall | tottime divided by ncalls |
-| percall | cumtime的每次调用平均时间 |
+| cumtime | the total execution time of the function, including the execution time of other functions being called |
-| filename:lineno(function) | 文件名, 行号，函数名 |</p>
+| percall | cumtime divided by ncalls |
+| filename:lineno(function) | where the function is defined |</p>
 </div>
-<div class="section" id="">
+<div class="section" id="identify-performance-bottlenecks">
-<span id="id3"></span><h2>寻找性能瓶颈<a class="headerlink" href="#" title="永久链接至标题">¶</a></h2>
+<span id="identify-performance-bottlenecks"></span><h2>Identify Performance Bottlenecks<a class="headerlink" href="#identify-performance-bottlenecks" title="永久链接至标题">¶</a></h2>
-<p>通常<code class="docutils literal"><span class="pre">tottime</span></code>和<code class="docutils literal"><span class="pre">cumtime</span></code>是寻找瓶颈的关键指标。这两个指标代表了某一个函数真实的运行时间。</p>
+<p>Usually, <code class="docutils literal"><span class="pre">tottime</span></code> and the related <code class="docutils literal"><span class="pre">percall</span></code> time is what we want to
-<p>将性能分析结果按照tottime排序，效果如下:</p>
+focus on. We can sort above profiling file by tottime:</p>
 <div class="highlight-text"><div class="highlight"><pre><span></span>     4696   12.040    0.003   12.040    0.003 {built-in method run}
   300005    0.874    0.000    1.681    0.000 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/dataset/mnist.py:38(reader)
   107991    0.676    0.000    1.519    0.000 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/fluid/framework.py:219(__init__)
@@ -257,77 +272,104 @@
        1    0.618    0.618    0.618    0.618 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/fluid/__init__.py:1(&lt;module&gt;)
 </pre></div>
 </div>
-<p>可以看到最耗时的函数是C++端的<code class="docutils literal"><span class="pre">run</span></code>函数。这需要联合我们第二节<code class="docutils literal"><span class="pre">Python</span></code>与<code class="docutils literal"><span class="pre">C++</span></code>混合代码的性能分析来进行调优。而<code class="docutils literal"><span class="pre">sync_with_cpp</span></code>函数的总共耗时很长，每次调用的耗时也很长。于是我们可以点击<code class="docutils literal"><span class="pre">sync_with_cpp</span></code>的详细信息，了解其调用关系。</p>
+<p>We can see that the most time-consuming function is the <code class="docutils literal"><span class="pre">built-in</span> <span class="pre">method</span> <span class="pre">run</span></code>, which is a C++ function in <code class="docutils literal"><span class="pre">libpaddle.so</span></code>.  We will
-<div class="highlight-text"><div class="highlight"><pre><span></span>Called By:
+explain how to profile C++ code in the next section.  At the right
+moment, let&#8217;s look into the third function <code class="docutils literal"><span class="pre">sync_with_cpp</span></code>, which is a
+Python function.  We can click it to understand more about it:</p>
+<div class="highlight-default"><div class="highlight"><pre><span></span><span class="n">Called</span> <span class="n">By</span><span class="p">:</span>
-   Ordered by: internal time
+   <span class="n">Ordered</span> <span class="n">by</span><span class="p">:</span> <span class="n">internal</span> <span class="n">time</span>
-   List reduced from 4497 to 2 due to restriction &lt;&#39;sync_with_cpp&#39;&gt;
+   <span class="n">List</span> <span class="n">reduced</span> <span class="kn">from</span> <span class="mi">4497</span> <span class="n">to</span> <span class="mi">2</span> <span class="n">due</span> <span class="n">to</span> <span class="n">restriction</span> <span class="o">&lt;</span><span class="s1">&#39;sync_with_cpp&#39;</span><span class="o">&gt;</span>
-Function                                                                                                 was called by...
+<span class="n">Function</span>                                                                                                 <span class="n">was</span> <span class="n">called</span> <span class="n">by</span><span class="o">...</span>
-                                                                                                             ncalls  tottime  cumtime
+                                                                                                             <span class="n">ncalls</span>  <span class="n">tottime</span>  <span class="n">cumtime</span>
-/home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/fluid/framework.py:428(sync_with_cpp)  &lt;-    4697    0.626    2.291  /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/fluid/framework.py:562(sync_with_cpp)
+<span class="o">/</span><span class="n">home</span><span class="o">/</span><span class="n">yuyang</span><span class="o">/</span><span class="n">perf_test</span><span class="o">/.</span><span class="n">env</span><span class="o">/</span><span class="n">lib</span><span class="o">/</span><span class="n">python2</span><span class="o">.</span><span class="mi">7</span><span class="o">/</span><span class="n">site</span><span class="o">-</span><span class="n">packages</span><span class="o">/</span><span class="n">paddle</span><span class="o">/</span><span class="n">v2</span><span class="o">/</span><span class="n">fluid</span><span class="o">/</span><span class="n">framework</span><span class="o">.</span><span class="n">py</span><span class="p">:</span><span class="mi">428</span><span class="p">(</span><span class="n">sync_with_cpp</span><span class="p">)</span>  <span class="o">&lt;-</span>    <span class="mi">4697</span>    <span class="mf">0.626</span>    <span class="mf">2.291</span>  <span class="o">/</span><span class="n">home</span><span class="o">/</span><span class="n">yuyang</span><span class="o">/</span><span class="n">perf_test</span><span class="o">/.</span><span class="n">env</span><span class="o">/</span><span class="n">lib</span><span class="o">/</span><span class="n">python2</span><span class="o">.</span><span class="mi">7</span><span class="o">/</span><span class="n">site</span><span class="o">-</span><span class="n">packages</span><span class="o">/</span><span class="n">paddle</span><span class="o">/</span><span class="n">v2</span><span class="o">/</span><span class="n">fluid</span><span class="o">/</span><span class="n">framework</span><span class="o">.</span><span class="n">py</span><span class="p">:</span><span class="mi">562</span><span class="p">(</span><span class="n">sync_with_cpp</span><span class="p">)</span>
-/home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/fluid/framework.py:562(sync_with_cpp)  &lt;-    4696    0.019    2.316  /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/fluid/framework.py:487(clone)
+<span class="o">/</span><span class="n">home</span><span class="o">/</span><span class="n">yuyang</span><span class="o">/</span><span class="n">perf_test</span><span class="o">/.</span><span class="n">env</span><span class="o">/</span><span class="n">lib</span><span class="o">/</span><span class="n">python2</span><span class="o">.</span><span class="mi">7</span><span class="o">/</span><span class="n">site</span><span class="o">-</span><span class="n">packages</span><span class="o">/</span><span class="n">paddle</span><span class="o">/</span><span class="n">v2</span><span class="o">/</span><span class="n">fluid</span><span class="o">/</span><span class="n">framework</span><span class="o">.</span><span class="n">py</span><span class="p">:</span><span class="mi">562</span><span class="p">(</span><span class="n">sync_with_cpp</span><span class="p">)</span>  <span class="o">&lt;-</span>    <span class="mi">4696</span>    <span class="mf">0.019</span>    <span class="mf">2.316</span>  <span class="o">/</span><span class="n">home</span><span class="o">/</span><span class="n">yuyang</span><span class="o">/</span><span class="n">perf_test</span><span class="o">/.</span><span class="n">env</span><span class="o">/</span><span class="n">lib</span><span class="o">/</span><span class="n">python2</span><span class="o">.</span><span class="mi">7</span><span class="o">/</span><span class="n">site</span><span class="o">-</span><span class="n">packages</span><span class="o">/</span><span class="n">paddle</span><span class="o">/</span><span class="n">v2</span><span class="o">/</span><span class="n">fluid</span><span class="o">/</span><span class="n">framework</span><span class="o">.</span><span class="n">py</span><span class="p">:</span><span class="mi">487</span><span class="p">(</span><span class="n">clone</span><span class="p">)</span>
-                                                                                                                  1    0.000    0.001  /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/fluid/framework.py:534(append_backward)
+                                                                                                                  <span class="mi">1</span>    <span class="mf">0.000</span>    <span class="mf">0.001</span>  <span class="o">/</span><span class="n">home</span><span class="o">/</span><span class="n">yuyang</span><span class="o">/</span><span class="n">perf_test</span><span class="o">/.</span><span class="n">env</span><span class="o">/</span><span class="n">lib</span><span class="o">/</span><span class="n">python2</span><span class="o">.</span><span class="mi">7</span><span class="o">/</span><span class="n">site</span><span class="o">-</span><span class="n">packages</span><span class="o">/</span><span class="n">paddle</span><span class="o">/</span><span class="n">v2</span><span class="o">/</span><span class="n">fluid</span><span class="o">/</span><span class="n">framework</span><span class="o">.</span><span class="n">py</span><span class="p">:</span><span class="mi">534</span><span class="p">(</span><span class="n">append_backward</span><span class="p">)</span>
-Called:
+<span class="n">Called</span><span class="p">:</span>
-   Ordered by: internal time
+   <span class="n">Ordered</span> <span class="n">by</span><span class="p">:</span> <span class="n">internal</span> <span class="n">time</span>
-   List reduced from 4497 to 2 due to restriction &lt;&#39;sync_with_cpp&#39;&gt;
+   <span class="n">List</span> <span class="n">reduced</span> <span class="kn">from</span> <span class="mi">4497</span> <span class="n">to</span> <span class="mi">2</span> <span class="n">due</span> <span class="n">to</span> <span class="n">restriction</span> <span class="o">&lt;</span><span class="s1">&#39;sync_with_cpp&#39;</span><span class="o">&gt;</span>
 </pre></div>
 </div>
-<p>通常观察热点函数间的调用关系，和对应行的代码，就可以了解到问题代码在哪里。当我们做出性能修正后，再次进行性能分析(profiling)即可检查我们调优后的修正是否能够改善程序的性能。</p>
+<p>The lists of the callers of <code class="docutils literal"><span class="pre">sync_with_cpp</span></code> might help us understand
+how to improve the function definition.</p>
 </div>
 </div>
-<div class="section" id="pythonc">
+<div class="section" id="profiling-python-and-c-code">
-<span id="pythonc"></span><h1>Python与C++混合代码的性能分析<a class="headerlink" href="#pythonc" title="永久链接至标题">¶</a></h1>
+<span id="profiling-python-and-c-code"></span><h1>Profiling Python and C++ Code<a class="headerlink" href="#profiling-python-and-c-code" title="永久链接至标题">¶</a></h1>
-<div class="section" id="">
+<div class="section" id="generate-the-profiling-file">
-<span id="id4"></span><h2>生成性能分析文件<a class="headerlink" href="#" title="永久链接至标题">¶</a></h2>
+<span id="generate-the-profiling-file"></span><h2>Generate the Profiling File<a class="headerlink" href="#generate-the-profiling-file" title="永久链接至标题">¶</a></h2>
-<p>C++的性能分析工具非常多。常见的包括<code class="docutils literal"><span class="pre">gprof</span></code>, <code class="docutils literal"><span class="pre">valgrind</span></code>, <code class="docutils literal"><span class="pre">google-perftools</span></code>。但是调试Python中使用的动态链接库与直接调试原始二进制相比增加了很多复杂度。幸而Python的一个第三方库<code class="docutils literal"><span class="pre">yep</span></code>提供了方便的和<code class="docutils literal"><span class="pre">google-perftools</span></code>交互的方法。于是这里使用<code class="docutils literal"><span class="pre">yep</span></code>进行Python与C++混合代码的性能分析</p>
+<p>To profile a mixture of Python and C++ code, we can use a Python
-<p>使用<code class="docutils literal"><span class="pre">yep</span></code>前需要安装<code class="docutils literal"><span class="pre">google-perftools</span></code>与<code class="docutils literal"><span class="pre">yep</span></code>包。ubuntu下安装命令为</p>
+package, <code class="docutils literal"><span class="pre">yep</span></code>, that can work with Google&#8217;s <code class="docutils literal"><span class="pre">perftools</span></code>, which is a
-<div class="highlight-bash"><div class="highlight"><pre><span></span>apt install libgoogle-perftools-dev
+commonly-used profiler for C/C++ code.</p>
+<p>In Ubuntu systems, we can install <code class="docutils literal"><span class="pre">yep</span></code> and <code class="docutils literal"><span class="pre">perftools</span></code> by running the
+following commands:</p>
+<div class="highlight-bash"><div class="highlight"><pre><span></span>apt update
+apt install libgoogle-perftools-dev
 pip install yep
 </pre></div>
 </div>
-<p>安装完毕后，我们可以通过</p>
+<p>Then we can run the following command</p>
 <div class="highlight-bash"><div class="highlight"><pre><span></span>python -m yep -v main.py
 </pre></div>
 </div>
-<p>生成性能分析文件。生成的性能分析文件为<code class="docutils literal"><span class="pre">main.py.prof</span></code>。</p>
+<p>to generate the profiling file.  The default filename is
-<p>命令行中的<code class="docutils literal"><span class="pre">-v</span></code>指定在生成性能分析文件之后，在命令行显示分析结果。我们可以在命令行中简单的看一下生成效果。因为C++与Python不同，编译时可能会去掉调试信息，运行时也可能因为多线程产生混乱不可读的性能分析结果。为了生成更可读的性能分析结果，可以采取下面几点措施:</p>
+<code class="docutils literal"><span class="pre">main.py.prof</span></code>.</p>
+<p>Please be aware of the <code class="docutils literal"><span class="pre">-v</span></code> command line option, which prints the
+analysis results after generating the profiling file.  By taking a
+glance at the print result, we&#8217;d know that if we stripped debug
+information from <code class="docutils literal"><span class="pre">libpaddle.so</span></code> at build time.  The following hints
+help make sure that the analysis results are readable:</p>
 <ol class="simple">
-<li>编译时指定<code class="docutils literal"><span class="pre">-g</span></code>生成调试信息。使用cmake的话，可以将CMAKE_BUILD_TYPE指定为<code class="docutils literal"><span class="pre">RelWithDebInfo</span></code>。</li>
+<li>Use GCC command line option <code class="docutils literal"><span class="pre">-g</span></code> when building <code class="docutils literal"><span class="pre">libpaddle.so</span></code> so to
-<li>编译时一定要开启优化。单纯的<code class="docutils literal"><span class="pre">Debug</span></code>编译性能会和<code class="docutils literal"><span class="pre">-O2</span></code>或者<code class="docutils literal"><span class="pre">-O3</span></code>有非常大的差别。<code class="docutils literal"><span class="pre">Debug</span></code>模式下的性能测试是没有意义的。</li>
+include the debug information.  The standard building system of
-<li>运行性能分析的时候，先从单线程开始，再开启多线程，进而多机。毕竟单线程调试更容易。可以设置<code class="docutils literal"><span class="pre">OMP_NUM_THREADS=1</span></code>这个环境变量关闭openmp优化。</li>
+PaddlePaddle is CMake, so you might want to set
+<code class="docutils literal"><span class="pre">CMAKE_BUILD_TYPE=RelWithDebInfo</span></code>.</li>
+<li>Use GCC command line option <code class="docutils literal"><span class="pre">-O2</span></code> or <code class="docutils literal"><span class="pre">-O3</span></code> to generate optimized
+binary code. It doesn&#8217;t make sense to profile <code class="docutils literal"><span class="pre">libpaddle.so</span></code>
+without optimization, because it would anyway run slowly.</li>
+<li>Profiling the single-threaded binary file before the
+multi-threading version, because the latter often generates tangled
+profiling analysis result.  You might want to set environment
+variable <code class="docutils literal"><span class="pre">OMP_NUM_THREADS=1</span></code> to prevents OpenMP from automatically
+starting multiple threads.</li>
 </ol>
 </div>
-<div class="section" id="">
+<div class="section" id="look-into-the-profiling-file">
-<span id="id5"></span><h2>查看性能分析文件<a class="headerlink" href="#" title="永久链接至标题">¶</a></h2>
+<span id="id1"></span><h2>Look into the Profiling File<a class="headerlink" href="#look-into-the-profiling-file" title="永久链接至标题">¶</a></h2>
-<p>在运行完性能分析后，会生成性能分析结果文件。我们可以使用<a class="reference external" href="https://github.com/google/pprof">pprof</a>来显示性能分析结果。注意，这里使用了用<code class="docutils literal"><span class="pre">Go</span></code>语言重构后的<code class="docutils literal"><span class="pre">pprof</span></code>，因为这个工具具有web服务界面，且展示效果更好。</p>
+<p>The tool we used to look into the profiling file generated by
-<p>安装<code class="docutils literal"><span class="pre">pprof</span></code>的命令和一般的<code class="docutils literal"><span class="pre">Go</span></code>程序是一样的，其命令如下:</p>
+<code class="docutils literal"><span class="pre">perftools</span></code> is <a class="reference external" href="https://github.com/google/pprof"><code class="docutils literal"><span class="pre">pprof</span></code></a>, which
+provides a Web-based GUI like <code class="docutils literal"><span class="pre">cprofilev</span></code>.</p>
+<p>We can rely on the standard Go toolchain to retrieve the source code
+of <code class="docutils literal"><span class="pre">pprof</span></code> and build it:</p>
 <div class="highlight-bash"><div class="highlight"><pre><span></span>go get github.com/google/pprof
 </pre></div>
 </div>
-<p>进而我们可以使用如下命令开启一个HTTP服务:</p>
+<p>Then we can use it to profile <code class="docutils literal"><span class="pre">main.py.prof</span></code> generated in the previous
+section:</p>
 <div class="highlight-bash"><div class="highlight"><pre><span></span>pprof -http<span class="o">=</span><span class="m">0</span>.0.0.0:3213 <span class="sb">`</span>which python<span class="sb">`</span>  ./main.py.prof
 </pre></div>
 </div>
-<p>这行命令中，<code class="docutils literal"><span class="pre">-http</span></code>指开启HTTP服务。<code class="docutils literal"><span class="pre">which</span> <span class="pre">python</span></code>会产生当前Python二进制的完整路径，进而指定了Python可执行文件的路径。<code class="docutils literal"><span class="pre">./main.py.prof</span></code>输入了性能分析结果。</p>
+<p>Where <code class="docutils literal"><span class="pre">-http</span></code> specifies the IP and port of the HTTP service.
-<p>访问对应的网址，我们可以查看性能分析的结果。结果如下图所示:</p>
+Directing our Web browser to the service, we would see something like
+the following:</p>
 <p><img alt="result" src="../../_images/pprof_1.png" /></p>
 </div>
-<div class="section" id="">
+<div class="section" id="identifying-the-performance-bottlenecks">
-<span id="id6"></span><h2>寻找性能瓶颈<a class="headerlink" href="#" title="永久链接至标题">¶</a></h2>
+<span id="identifying-the-performance-bottlenecks"></span><h2>Identifying the Performance Bottlenecks<a class="headerlink" href="#identifying-the-performance-bottlenecks" title="永久链接至标题">¶</a></h2>
-<p>与寻找Python代码的性能瓶颈类似，寻找Python与C++混合代码的性能瓶颈也是要看<code class="docutils literal"><span class="pre">tottime</span></code>和<code class="docutils literal"><span class="pre">cumtime</span></code>。而<code class="docutils literal"><span class="pre">pprof</span></code>展示的调用图也可以帮助我们发现性能中的问题。</p>
+<p>Similar to how we work with <code class="docutils literal"><span class="pre">cprofilev</span></code>, we&#8217;d focus on <code class="docutils literal"><span class="pre">tottime</span></code> and
-<p>例如下图中，</p>
+<code class="docutils literal"><span class="pre">cumtime</span></code>.</p>
 <p><img alt="kernel_perf" src="../../_images/pprof_2.png" /></p>
-<p>在一次训练中，乘法和乘法梯度的计算占用2%-4%左右的计算时间。而<code class="docutils literal"><span class="pre">MomentumOp</span></code>占用了17%左右的计算时间。显然，<code class="docutils literal"><span class="pre">MomentumOp</span></code>的性能有问题。</p>
+<p>We can see that the execution time of multiplication and the computing
-<p>在<code class="docutils literal"><span class="pre">pprof</span></code>中，对于性能的关键路径都做出了红色标记。先检查关键路径的性能问题，再检查其他部分的性能问题，可以更有次序的完成性能的优化。</p>
+of the gradient of multiplication takes 2% to 4% of the total running
-</div>
+time, and <code class="docutils literal"><span class="pre">MomentumOp</span></code> takes about 17%. Obviously, we&#8217;d want to
+optimize <code class="docutils literal"><span class="pre">MomentumOp</span></code>.</p>
+<p><code class="docutils literal"><span class="pre">pprof</span></code> would mark performance critical parts of the program in
+red. It&#8217;s a good idea to follow the hint.</p>
 </div>
-<div class="section" id="">
-<span id="id7"></span><h1>总结<a class="headerlink" href="#" title="永久链接至标题">¶</a></h1>
-<p>至此，两种性能分析的方式都介绍完毕了。希望通过这两种性能分析的方式，Paddle的开发人员和使用人员可以有次序的，科学的发现和解决性能问题。</p>
 </div>

--- a/develop/doc_cn/howto/optimization/cpu_profiling_cn.html
+++ b/develop/doc_cn/howto/optimization/cpu_profiling_cn.html
+<!DOCTYPE html>
+<!--[if IE 8]><html class="no-js lt-ie9" lang="en" > <![endif]-->
+<!--[if gt IE 8]><!--> <html class="no-js" lang="en" > <!--<![endif]-->
+<head>
+  <meta charset="utf-8">
+  <meta name="viewport" content="width=device-width, initial-scale=1.0">
+  <title>Python代码的性能分析 &mdash; PaddlePaddle  文档</title>
+    <link rel="stylesheet" href="../../_static/css/theme.css" type="text/css" />
+        <link rel="index" title="索引"
+              href="../../genindex.html"/>
+        <link rel="search" title="搜索" href="../../search.html"/>
+    <link rel="top" title="PaddlePaddle  文档" href="../../index.html"/> 
+  <link rel="stylesheet" href="https://cdn.jsdelivr.net/perfect-scrollbar/0.6.14/css/perfect-scrollbar.min.css" type="text/css" />
+  <link rel="stylesheet" href="../../_static/css/override.css" type="text/css" />
+  <script>
+  var _hmt = _hmt || [];
+  (function() {
+    var hm = document.createElement("script");
+    hm.src = "//hm.baidu.com/hm.js?b9a314ab40d04d805655aab1deee08ba";
+    var s = document.getElementsByTagName("script")[0]; 
+    s.parentNode.insertBefore(hm, s);
+  })();
+  </script>
+  <script src="../../_static/js/modernizr.min.js"></script>
+</head>
+<body class="wy-body-for-nav" role="document">
+  <header class="site-header">
+    <div class="site-logo">
+      <a href="/"><img src="../../_static/images/PP_w.png"></a>
+    </div>
+    <div class="site-nav-links">
+      <div class="site-menu">
+        <a class="fork-on-github" href="https://github.com/PaddlePaddle/Paddle" target="_blank"><i class="fa fa-github"></i>Fork me on Github</a>
+        <div class="language-switcher dropdown">
+          <a type="button" data-toggle="dropdown">
+            <span>English</span>
+            <i class="fa fa-angle-up"></i>
+            <i class="fa fa-angle-down"></i>
+          </a>
+          <ul class="dropdown-menu">
+            <li><a href="/doc_cn">中文</a></li>
+            <li><a href="/doc">English</a></li>
+          </ul>
+        </div>
+        <ul class="site-page-links">
+          <li><a href="/">Home</a></li>
+        </ul>
+      </div>
+      <div class="doc-module">
+        <ul>
+<li class="toctree-l1"><a class="reference internal" href="../../getstarted/index_cn.html">新手入门</a></li>
+<li class="toctree-l1"><a class="reference internal" href="../index_cn.html">进阶指南</a></li>
+<li class="toctree-l1"><a class="reference internal" href="../../api/index_cn.html">API</a></li>
+<li class="toctree-l1"><a class="reference internal" href="../../faq/index_cn.html">FAQ</a></li>
+<li class="toctree-l1"><a class="reference internal" href="../../mobile/index_cn.html">MOBILE</a></li>
+</ul>
+<div role="search">
+  <form id="rtd-search-form" class="wy-form" action="../../search.html" method="get">
+    <input type="text" name="q" placeholder="Search docs" />
+    <input type="hidden" name="check_keywords" value="yes" />
+    <input type="hidden" name="area" value="default" />
+  </form>
+</div>        
+      </div>
+    </div>
+  </header>
+  <div class="main-content-wrap">
+    <nav class="doc-menu-vertical" role="navigation">
+          <ul>
+<li class="toctree-l1"><a class="reference internal" href="../../getstarted/index_cn.html">新手入门</a><ul>
+<li class="toctree-l2"><a class="reference internal" href="../../getstarted/build_and_install/index_cn.html">安装与编译</a><ul>
+<li class="toctree-l3"><a class="reference internal" href="../../getstarted/build_and_install/pip_install_cn.html">使用pip安装</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../../getstarted/build_and_install/docker_install_cn.html">使用Docker安装运行</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../../getstarted/build_and_install/build_from_source_cn.html">从源码编译</a></li>
+</ul>
+</li>
+<li class="toctree-l2"><a class="reference internal" href="../../getstarted/concepts/use_concepts_cn.html">基本使用概念</a></li>
+</ul>
+</li>
+<li class="toctree-l1"><a class="reference internal" href="../index_cn.html">进阶指南</a><ul>
+<li class="toctree-l2"><a class="reference internal" href="../usage/cmd_parameter/index_cn.html">设置命令行参数</a><ul>
+<li class="toctree-l3"><a class="reference internal" href="../usage/cmd_parameter/use_case_cn.html">使用案例</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../usage/cmd_parameter/arguments_cn.html">参数概述</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../usage/cmd_parameter/detail_introduction_cn.html">细节描述</a></li>
+</ul>
+</li>
+<li class="toctree-l2"><a class="reference internal" href="../usage/cluster/cluster_train_cn.html">PaddlePaddle分布式训练</a></li>
+<li class="toctree-l2"><a class="reference internal" href="../usage/k8s/k8s_basis_cn.html">Kubernetes 简介</a></li>
+<li class="toctree-l2"><a class="reference internal" href="../usage/k8s/k8s_cn.html">Kubernetes单机训练</a></li>
+<li class="toctree-l2"><a class="reference internal" href="../usage/k8s/k8s_distributed_cn.html">Kubernetes分布式训练</a></li>
+<li class="toctree-l2"><a class="reference internal" href="../dev/write_docs_cn.html">如何贡献/修改文档</a></li>
+<li class="toctree-l2"><a class="reference internal" href="../deep_model/rnn/index_cn.html">RNN相关模型</a><ul>
+<li class="toctree-l3"><a class="reference internal" href="../deep_model/rnn/rnn_config_cn.html">RNN配置</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../deep_model/rnn/recurrent_group_cn.html">Recurrent Group教程</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../deep_model/rnn/hierarchical_layer_cn.html">支持双层序列作为输入的Layer</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../deep_model/rnn/hrnn_rnn_api_compare_cn.html">单双层RNN API对比介绍</a></li>
+</ul>
+</li>
+<li class="toctree-l2"><a class="reference internal" href="gpu_profiling_cn.html">GPU性能分析与调优</a></li>
+</ul>
+</li>
+<li class="toctree-l1"><a class="reference internal" href="../../api/index_cn.html">API</a><ul>
+<li class="toctree-l2"><a class="reference internal" href="../../api/v2/model_configs.html">模型配置</a><ul>
+<li class="toctree-l3"><a class="reference internal" href="../../api/v2/config/activation.html">Activation</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../../api/v2/config/layer.html">Layers</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../../api/v2/config/evaluators.html">Evaluators</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../../api/v2/config/optimizer.html">Optimizer</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../../api/v2/config/pooling.html">Pooling</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../../api/v2/config/networks.html">Networks</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../../api/v2/config/attr.html">Parameter Attribute</a></li>
+</ul>
+</li>
+<li class="toctree-l2"><a class="reference internal" href="../../api/v2/data.html">数据访问</a><ul>
+<li class="toctree-l3"><a class="reference internal" href="../../api/v2/data/data_reader.html">Data Reader Interface</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../../api/v2/data/image.html">Image Interface</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../../api/v2/data/dataset.html">Dataset</a></li>
+</ul>
+</li>
+<li class="toctree-l2"><a class="reference internal" href="../../api/v2/run_logic.html">训练与应用</a></li>
+</ul>
+</li>
+<li class="toctree-l1"><a class="reference internal" href="../../faq/index_cn.html">FAQ</a><ul>
+<li class="toctree-l2"><a class="reference internal" href="../../faq/build_and_install/index_cn.html">编译安装与单元测试</a></li>
+<li class="toctree-l2"><a class="reference internal" href="../../faq/model/index_cn.html">模型配置</a></li>
+<li class="toctree-l2"><a class="reference internal" href="../../faq/parameter/index_cn.html">参数设置</a></li>
+<li class="toctree-l2"><a class="reference internal" href="../../faq/local/index_cn.html">本地训练与预测</a></li>
+<li class="toctree-l2"><a class="reference internal" href="../../faq/cluster/index_cn.html">集群训练与预测</a></li>
+</ul>
+</li>
+<li class="toctree-l1"><a class="reference internal" href="../../mobile/index_cn.html">MOBILE</a><ul>
+<li class="toctree-l2"><a class="reference internal" href="../../mobile/cross_compiling_for_android_cn.html">Android平台编译指南</a></li>
+<li class="toctree-l2"><a class="reference internal" href="../../mobile/cross_compiling_for_ios_cn.html">iOS平台编译指南</a></li>
+<li class="toctree-l2"><a class="reference internal" href="../../mobile/cross_compiling_for_raspberry_cn.html">Raspberry Pi平台编译指南</a></li>
+</ul>
+</li>
+</ul>
+    </nav>
+    <section class="doc-content-wrap">
+<div role="navigation" aria-label="breadcrumbs navigation">
+  <ul class="wy-breadcrumbs">
+    <li>Python代码的性能分析</li>
+  </ul>
+</div>
+      <div class="wy-nav-content" id="doc-content">
+        <div class="rst-content">
+          <div role="main" class="document" itemscope="itemscope" itemtype="http://schema.org/Article">
+           <div itemprop="articleBody">
+  <p>此教程会介绍如何使用Python的cProfile包、Python库yep、Google perftools来进行性能分析 (profiling) 与调优（performance tuning）。</p>
+<p>Profling 指发现性能瓶颈。系统中的瓶颈可能和程序员开发过程中想象的瓶颈相去甚远。Tuning 指消除瓶颈。性能优化的过程通常是不断重复地 profiling 和 tuning。</p>
+<p>PaddlePaddle 用户一般通过调用 Python API 编写深度学习程序。大部分 Python API 调用用 C++ 写的 libpaddle.so。所以 PaddlePaddle 的性能分析与调优分为两个部分:</p>
+<ul class="simple">
+<li>Python 代码的性能分析</li>
+<li>Python 与 C++ 混合代码的性能分析</li>
+</ul>
+<div class="section" id="python">
+<span id="python"></span><h1>Python代码的性能分析<a class="headerlink" href="#python" title="永久链接至标题">¶</a></h1>
+<div class="section" id="">
+<span id="id1"></span><h2>生成性能分析文件<a class="headerlink" href="#" title="永久链接至标题">¶</a></h2>
+<p>Python标准库中提供了性能分析的工具包，<a class="reference external" href="https://docs.python.org/2/library/profile.html">cProfile</a>。生成Python性能分析的命令如下:</p>
+<div class="highlight-bash"><div class="highlight"><pre><span></span>python -m cProfile -o profile.out main.py
+</pre></div>
+</div>
+<p>其中 <code class="docutils literal"><span class="pre">main.py</span></code> 是我们要分析的程序，<code class="docutils literal"><span class="pre">-o</span></code>标识了一个输出的文件名，用来存储本次性能分析的结果。如果不指定这个文件，<code class="docutils literal"><span class="pre">cProfile</span></code>会打印到标准输出。</p>
+</div>
+<div class="section" id="">
+<span id="id2"></span><h2>查看性能分析文件<a class="headerlink" href="#" title="永久链接至标题">¶</a></h2>
+<p><code class="docutils literal"><span class="pre">cProfile</span></code> 在main.py 运行完毕后输出<code class="docutils literal"><span class="pre">profile.out</span></code>。我们可以使用<a class="reference external" href="https://github.com/ymichael/cprofilev"><code class="docutils literal"><span class="pre">cprofilev</span></code></a>来查看性能分析结果。<code class="docutils literal"><span class="pre">cprofilev</span></code>是一个Python的第三方库。使用它会开启一个HTTP服务，将性能分析结果以网页的形式展示出来：</p>
+<div class="highlight-bash"><div class="highlight"><pre><span></span>cprofilev -a <span class="m">0</span>.0.0.0 -p <span class="m">3214</span> -f profile.out main.py
+</pre></div>
+</div>
+<p>其中<code class="docutils literal"><span class="pre">-a</span></code>标识HTTP服务绑定的IP。使用<code class="docutils literal"><span class="pre">0.0.0.0</span></code>允许外网访问这个HTTP服务。<code class="docutils literal"><span class="pre">-p</span></code>标识HTTP服务的端口。<code class="docutils literal"><span class="pre">-f</span></code>标识性能分析的结果文件。<code class="docutils literal"><span class="pre">main.py</span></code>标识被性能分析的源文件。</p>
+<p>用Web浏览器访问对应网址，即可显示性能分析的结果：</p>
+<div class="highlight-default"><div class="highlight"><pre><span></span>   <span class="n">ncalls</span>  <span class="n">tottime</span>  <span class="n">percall</span>  <span class="n">cumtime</span>  <span class="n">percall</span> <span class="n">filename</span><span class="p">:</span><span class="n">lineno</span><span class="p">(</span><span class="n">function</span><span class="p">)</span>
+        <span class="mi">1</span>    <span class="mf">0.284</span>    <span class="mf">0.284</span>   <span class="mf">29.514</span>   <span class="mf">29.514</span> <span class="n">main</span><span class="o">.</span><span class="n">py</span><span class="p">:</span><span class="mi">1</span><span class="p">(</span><span class="o">&lt;</span><span class="n">module</span><span class="o">&gt;</span><span class="p">)</span>
+     <span class="mi">4696</span>    <span class="mf">0.128</span>    <span class="mf">0.000</span>   <span class="mf">15.748</span>    <span class="mf">0.003</span> <span class="o">/</span><span class="n">home</span><span class="o">/</span><span class="n">yuyang</span><span class="o">/</span><span class="n">perf_test</span><span class="o">/.</span><span class="n">env</span><span class="o">/</span><span class="n">lib</span><span class="o">/</span><span class="n">python2</span><span class="o">.</span><span class="mi">7</span><span class="o">/</span><span class="n">site</span><span class="o">-</span><span class="n">packages</span><span class="o">/</span><span class="n">paddle</span><span class="o">/</span><span class="n">v2</span><span class="o">/</span><span class="n">fluid</span><span class="o">/</span><span class="n">executor</span><span class="o">.</span><span class="n">py</span><span class="p">:</span><span class="mi">20</span><span class="p">(</span><span class="n">run</span><span class="p">)</span>
+     <span class="mi">4696</span>   <span class="mf">12.040</span>    <span class="mf">0.003</span>   <span class="mf">12.040</span>    <span class="mf">0.003</span> <span class="p">{</span><span class="n">built</span><span class="o">-</span><span class="ow">in</span> <span class="n">method</span> <span class="n">run</span><span class="p">}</span>
+        <span class="mi">1</span>    <span class="mf">0.144</span>    <span class="mf">0.144</span>    <span class="mf">6.534</span>    <span class="mf">6.534</span> <span class="o">/</span><span class="n">home</span><span class="o">/</span><span class="n">yuyang</span><span class="o">/</span><span class="n">perf_test</span><span class="o">/.</span><span class="n">env</span><span class="o">/</span><span class="n">lib</span><span class="o">/</span><span class="n">python2</span><span class="o">.</span><span class="mi">7</span><span class="o">/</span><span class="n">site</span><span class="o">-</span><span class="n">packages</span><span class="o">/</span><span class="n">paddle</span><span class="o">/</span><span class="n">v2</span><span class="o">/</span><span class="fm">__init__</span><span class="o">.</span><span class="n">py</span><span class="p">:</span><span class="mi">14</span><span class="p">(</span><span class="o">&lt;</span><span class="n">module</span><span class="o">&gt;</span><span class="p">)</span>
+</pre></div>
+</div>
+<p>每一列的含义是:</p>
+<p>| 列名 | 含义 |
+| &#8212; | &#8212; |
+| ncalls | 函数的调用次数 |
+| tottime | 函数实际使用的总时间。该时间去除掉本函数调用其他函数的时间 |
+| percall | tottime的每次调用平均时间 |
+| cumtime | 函数总时间。包含这个函数调用其他函数的时间 |
+| percall | cumtime的每次调用平均时间 |
+| filename:lineno(function) | 文件名, 行号，函数名 |</p>
+</div>
+<div class="section" id="">
+<span id="id3"></span><h2>寻找性能瓶颈<a class="headerlink" href="#" title="永久链接至标题">¶</a></h2>
+<p>通常<code class="docutils literal"><span class="pre">tottime</span></code>和<code class="docutils literal"><span class="pre">cumtime</span></code>是寻找瓶颈的关键指标。这两个指标代表了某一个函数真实的运行时间。</p>
+<p>将性能分析结果按照tottime排序，效果如下:</p>
+<div class="highlight-text"><div class="highlight"><pre><span></span>     4696   12.040    0.003   12.040    0.003 {built-in method run}
+   300005    0.874    0.000    1.681    0.000 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/dataset/mnist.py:38(reader)
+   107991    0.676    0.000    1.519    0.000 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/fluid/framework.py:219(__init__)
+     4697    0.626    0.000    2.291    0.000 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/fluid/framework.py:428(sync_with_cpp)
+        1    0.618    0.618    0.618    0.618 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/fluid/__init__.py:1(&lt;module&gt;)
+</pre></div>
+</div>
+<p>可以看到最耗时的函数是C++端的<code class="docutils literal"><span class="pre">run</span></code>函数。这需要联合我们第二节<code class="docutils literal"><span class="pre">Python</span></code>与<code class="docutils literal"><span class="pre">C++</span></code>混合代码的性能分析来进行调优。而<code class="docutils literal"><span class="pre">sync_with_cpp</span></code>函数的总共耗时很长，每次调用的耗时也很长。于是我们可以点击<code class="docutils literal"><span class="pre">sync_with_cpp</span></code>的详细信息，了解其调用关系。</p>
+<div class="highlight-text"><div class="highlight"><pre><span></span>Called By:
+   Ordered by: internal time
+   List reduced from 4497 to 2 due to restriction &lt;&#39;sync_with_cpp&#39;&gt;
+Function                                                                                                 was called by...
+                                                                                                             ncalls  tottime  cumtime
+/home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/fluid/framework.py:428(sync_with_cpp)  &lt;-    4697    0.626    2.291  /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/fluid/framework.py:562(sync_with_cpp)
+/home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/fluid/framework.py:562(sync_with_cpp)  &lt;-    4696    0.019    2.316  /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/fluid/framework.py:487(clone)
+                                                                                                                  1    0.000    0.001  /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/fluid/framework.py:534(append_backward)
+Called:
+   Ordered by: internal time
+   List reduced from 4497 to 2 due to restriction &lt;&#39;sync_with_cpp&#39;&gt;
+</pre></div>
+</div>
+<p>通常观察热点函数间的调用关系，和对应行的代码，就可以了解到问题代码在哪里。当我们做出性能修正后，再次进行性能分析(profiling)即可检查我们调优后的修正是否能够改善程序的性能。</p>
+</div>
+</div>
+<div class="section" id="pythonc">
+<span id="pythonc"></span><h1>Python与C++混合代码的性能分析<a class="headerlink" href="#pythonc" title="永久链接至标题">¶</a></h1>
+<div class="section" id="">
+<span id="id4"></span><h2>生成性能分析文件<a class="headerlink" href="#" title="永久链接至标题">¶</a></h2>
+<p>C++的性能分析工具非常多。常见的包括<code class="docutils literal"><span class="pre">gprof</span></code>, <code class="docutils literal"><span class="pre">valgrind</span></code>, <code class="docutils literal"><span class="pre">google-perftools</span></code>。但是调试Python中使用的动态链接库与直接调试原始二进制相比增加了很多复杂度。幸而Python的一个第三方库<code class="docutils literal"><span class="pre">yep</span></code>提供了方便的和<code class="docutils literal"><span class="pre">google-perftools</span></code>交互的方法。于是这里使用<code class="docutils literal"><span class="pre">yep</span></code>进行Python与C++混合代码的性能分析</p>
+<p>使用<code class="docutils literal"><span class="pre">yep</span></code>前需要安装<code class="docutils literal"><span class="pre">google-perftools</span></code>与<code class="docutils literal"><span class="pre">yep</span></code>包。ubuntu下安装命令为</p>
+<div class="highlight-bash"><div class="highlight"><pre><span></span>apt update
+apt install libgoogle-perftools-dev
+pip install yep
+</pre></div>
+</div>
+<p>安装完毕后，我们可以通过</p>
+<div class="highlight-bash"><div class="highlight"><pre><span></span>python -m yep -v main.py
+</pre></div>
+</div>
+<p>生成性能分析文件。生成的性能分析文件为<code class="docutils literal"><span class="pre">main.py.prof</span></code>。</p>
+<p>命令行中的<code class="docutils literal"><span class="pre">-v</span></code>指定在生成性能分析文件之后，在命令行显示分析结果。我们可以在命令行中简单的看一下生成效果。因为C++与Python不同，编译时可能会去掉调试信息，运行时也可能因为多线程产生混乱不可读的性能分析结果。为了生成更可读的性能分析结果，可以采取下面几点措施:</p>
+<ol class="simple">
+<li>编译时指定<code class="docutils literal"><span class="pre">-g</span></code>生成调试信息。使用cmake的话，可以将CMAKE_BUILD_TYPE指定为<code class="docutils literal"><span class="pre">RelWithDebInfo</span></code>。</li>
+<li>编译时一定要开启优化。单纯的<code class="docutils literal"><span class="pre">Debug</span></code>编译性能会和<code class="docutils literal"><span class="pre">-O2</span></code>或者<code class="docutils literal"><span class="pre">-O3</span></code>有非常大的差别。<code class="docutils literal"><span class="pre">Debug</span></code>模式下的性能测试是没有意义的。</li>
+<li>运行性能分析的时候，先从单线程开始，再开启多线程，进而多机。毕竟单线程调试更容易。可以设置<code class="docutils literal"><span class="pre">OMP_NUM_THREADS=1</span></code>这个环境变量关闭openmp优化。</li>
+</ol>
+</div>
+<div class="section" id="">
+<span id="id5"></span><h2>查看性能分析文件<a class="headerlink" href="#" title="永久链接至标题">¶</a></h2>
+<p>在运行完性能分析后，会生成性能分析结果文件。我们可以使用<a class="reference external" href="https://github.com/google/pprof"><code class="docutils literal"><span class="pre">pprof</span></code></a>来显示性能分析结果。注意，这里使用了用<code class="docutils literal"><span class="pre">Go</span></code>语言重构后的<code class="docutils literal"><span class="pre">pprof</span></code>，因为这个工具具有web服务界面，且展示效果更好。</p>
+<p>安装<code class="docutils literal"><span class="pre">pprof</span></code>的命令和一般的<code class="docutils literal"><span class="pre">Go</span></code>程序是一样的，其命令如下:</p>
+<div class="highlight-bash"><div class="highlight"><pre><span></span>go get github.com/google/pprof
+</pre></div>
+</div>
+<p>进而我们可以使用如下命令开启一个HTTP服务:</p>
+<div class="highlight-bash"><div class="highlight"><pre><span></span>pprof -http<span class="o">=</span><span class="m">0</span>.0.0.0:3213 <span class="sb">`</span>which python<span class="sb">`</span>  ./main.py.prof
+</pre></div>
+</div>
+<p>这行命令中，<code class="docutils literal"><span class="pre">-http</span></code>指开启HTTP服务。<code class="docutils literal"><span class="pre">which</span> <span class="pre">python</span></code>会产生当前Python二进制的完整路径，进而指定了Python可执行文件的路径。<code class="docutils literal"><span class="pre">./main.py.prof</span></code>输入了性能分析结果。</p>
+<p>访问对应的网址，我们可以查看性能分析的结果。结果如下图所示:</p>
+<p><img alt="result" src="../../_images/pprof_1.png" /></p>
+</div>
+<div class="section" id="">
+<span id="id6"></span><h2>寻找性能瓶颈<a class="headerlink" href="#" title="永久链接至标题">¶</a></h2>
+<p>与寻找Python代码的性能瓶颈类似，寻找Python与C++混合代码的性能瓶颈也是要看<code class="docutils literal"><span class="pre">tottime</span></code>和<code class="docutils literal"><span class="pre">cumtime</span></code>。而<code class="docutils literal"><span class="pre">pprof</span></code>展示的调用图也可以帮助我们发现性能中的问题。</p>
+<p>例如下图中，</p>
+<p><img alt="kernel_perf" src="../../_images/pprof_2.png" /></p>
+<p>在一次训练中，乘法和乘法梯度的计算占用2%-4%左右的计算时间。而<code class="docutils literal"><span class="pre">MomentumOp</span></code>占用了17%左右的计算时间。显然，<code class="docutils literal"><span class="pre">MomentumOp</span></code>的性能有问题。</p>
+<p>在<code class="docutils literal"><span class="pre">pprof</span></code>中，对于性能的关键路径都做出了红色标记。先检查关键路径的性能问题，再检查其他部分的性能问题，可以更有次序的完成性能的优化。</p>
+</div>
+</div>
+           </div>
+          </div>
+          <footer>
+  <hr/>
+  <div role="contentinfo">
+    <p>
+        &copy; Copyright 2016, PaddlePaddle developers.
+    </p>
+  </div>
+  Built with <a href="http://sphinx-doc.org/">Sphinx</a> using a <a href="https://github.com/snide/sphinx_rtd_theme">theme</a> provided by <a href="https://readthedocs.org">Read the Docs</a>. 
+</footer>
+        </div>
+      </div>
+    </section>
+  </div>
+    <script type="text/javascript">
+        var DOCUMENTATION_OPTIONS = {
+            URL_ROOT:'../../',
+            VERSION:'',
+            COLLAPSE_INDEX:false,
+            FILE_SUFFIX:'.html',
+            HAS_SOURCE:  true,
+            SOURCELINK_SUFFIX: ".txt",
+        };
+    </script>
+      <script type="text/javascript" src="../../_static/jquery.js"></script>
+      <script type="text/javascript" src="../../_static/underscore.js"></script>
+      <script type="text/javascript" src="../../_static/doctools.js"></script>
+      <script type="text/javascript" src="../../_static/translations.js"></script>
+      <script type="text/javascript" src="https://cdn.bootcss.com/mathjax/2.7.0/MathJax.js"></script>
+    <script type="text/javascript" src="../../_static/js/theme.js"></script>
+  <script src="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/js/bootstrap.min.js" integrity="sha384-Tc5IQib027qvyjSMfHjOMaLkfuWVxZxUPnCJA7l2mCWNIpG9mGCD8wGNIcPD7Txa" crossorigin="anonymous"></script>
+  <script src="https://cdn.jsdelivr.net/perfect-scrollbar/0.6.14/js/perfect-scrollbar.jquery.min.js"></script>
+  <script src="../../_static/js/paddle_doc_init.js"></script> 
+</body>
+</html>
\ No newline at end of file
--- a/develop/doc_cn/objects.inv
+++ b/develop/doc_cn/objects.inv
--- a/develop/doc_cn/searchindex.js
+++ b/develop/doc_cn/searchindex.js