提交 dac32632 编写于 作者: T Travis CI

Deploy to GitHub Pages: 605b3e44

上级 7045fad1
此教程会介绍如何使用Python的cProfile包,与Python库yep,google perftools来运行性能分析(Profiling)与调优。 This tutorial introduces techniques we used to profile and tune the
CPU performance of PaddlePaddle. We will use Python packages
`cProfile` and `yep`, and Google `perftools`.
运行性能分析可以让开发人员科学的,有条不紊的对程序进行性能优化。性能分析是性能调优的基础。因为在程序实际运行中,真正的瓶颈可能和程序员开发过程中想象的瓶颈相去甚远。 Profiling is the process that reveals the performance bottlenecks,
which could be very different from what's in the developers' mind.
Performance tuning is to fix the bottlenecks. Performance optimization
repeats the steps of profiling and tuning alternatively.
性能优化的步骤,通常是循环重复若干次『性能分析 --> 寻找瓶颈 ---> 调优瓶颈 --> 性能分析确认调优效果』。其中性能分析是性能调优的至关重要的量化指标。 PaddlePaddle users program AI by calling the Python API, which calls
into `libpaddle.so.` written in C++. In this tutorial, we focus on
the profiling and tuning of
Paddle提供了Python语言绑定。用户使用Python进行神经网络编程,训练,测试。Python解释器通过`pybind`和`swig`调用Paddle的动态链接库,进而调用Paddle C++部分的代码。所以Paddle的性能分析与调优分为两个部分: 1. the Python code and
1. the mixture of Python and C++ code.
* Python代码的性能分析 ## Profiling the Python Code
* Python与C++混合代码的性能分析
### Generate the Performance Profiling File
## Python代码的性能分析 We can use Python standard
package, [`cProfile`](https://docs.python.org/2/library/profile.html),
### 生成性能分析文件 to generate Python profiling file. For example:
Python标准库中提供了性能分析的工具包,[cProfile](https://docs.python.org/2/library/profile.html)。生成Python性能分析的命令如下:
```bash ```bash
python -m cProfile -o profile.out main.py python -m cProfile -o profile.out main.py
``` ```
其中`-o`标识了一个输出的文件名,用来存储本次性能分析的结果。如果不指定这个文件,`cProfile`会打印一些统计信息到`stdout`。这不方便我们进行后期处理(进行`sort`, `split`, `cut`等等)。 where `main.py` is the program we are going to profile, `-o` specifies
the output file. Without `-o`, `cProfile` would outputs to standard
### 查看性能分析文件 output.
当main.py运行完毕后,性能分析结果文件`profile.out`就生成出来了。我们可以使用[cprofilev](https://github.com/ymichael/cprofilev)来查看性能分析结果。`cprofilev`是一个Python的第三方库。使用它会开启一个HTTP服务,将性能分析结果以网页的形式展示出来。 ### Look into the Profiling File
使用`pip install cprofilev`安装`cprofilev`工具。安装完成后,使用如下命令开启HTTP服务 `cProfile` generates `profile.out` after `main.py` completes. We can
use [`cprofilev`](https://github.com/ymichael/cprofilev) to look into
the details:
```bash ```bash
cprofilev -a 0.0.0.0 -p 3214 -f profile.out main.py cprofilev -a 0.0.0.0 -p 3214 -f profile.out main.py
``` ```
其中`-a`标识HTTP服务绑定的IP。使用`0.0.0.0`允许外网访问这个HTTP服务。`-p`标识HTTP服务的端口。`-f`标识性能分析的结果文件。`main.py`标识被性能分析的源文件。 where `-a` specifies the HTTP IP, `-p` specifies the port, `-f`
specifies the profiling file, and `main.py` is the source file.
访问对应网址,即可显示性能分析的结果。性能分析结果格式如下: Open the Web browser and points to the local IP and the specifies
port, we will see the output like the following:
```text ```
ncalls tottime percall cumtime percall filename:lineno(function) ncalls tottime percall cumtime percall filename:lineno(function)
1 0.284 0.284 29.514 29.514 main.py:1(<module>) 1 0.284 0.284 29.514 29.514 main.py:1(<module>)
4696 0.128 0.000 15.748 0.003 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/fluid/executor.py:20(run) 4696 0.128 0.000 15.748 0.003 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/fluid/executor.py:20(run)
...@@ -44,23 +54,23 @@ cprofilev -a 0.0.0.0 -p 3214 -f profile.out main.py ...@@ -44,23 +54,23 @@ cprofilev -a 0.0.0.0 -p 3214 -f profile.out main.py
1 0.144 0.144 6.534 6.534 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/__init__.py:14(<module>) 1 0.144 0.144 6.534 6.534 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/__init__.py:14(<module>)
``` ```
每一列的含义是: where each line corresponds to Python function, and the meaning of
each column is as follows:
| 列名 | 含义 | | column | meaning |
| --- | --- | | --- | --- |
| ncalls | 函数的调用次数 | | ncalls | the number of calls into a function |
| tottime | 函数实际使用的总时间。该时间去除掉本函数调用其他函数的时间 | | tottime | the total execution time of the function, not including the
| percall | tottime的每次调用平均时间 | execution time of other functions called by the function |
| cumtime | 函数总时间。包含这个函数调用其他函数的时间 | | percall | tottime divided by ncalls |
| percall | cumtime的每次调用平均时间 | | cumtime | the total execution time of the function, including the execution time of other functions being called |
| filename:lineno(function) | 文件名, 行号,函数名 | | percall | cumtime divided by ncalls |
| filename:lineno(function) | where the function is defined |
### Identify Performance Bottlenecks
### 寻找性能瓶颈 Usually, `tottime` and the related `percall` time is what we want to
focus on. We can sort above profiling file by tottime:
通常`tottime`和`cumtime`是寻找瓶颈的关键指标。这两个指标代表了某一个函数真实的运行时间。
将性能分析结果按照tottime排序,效果如下:
```text ```text
4696 12.040 0.003 12.040 0.003 {built-in method run} 4696 12.040 0.003 12.040 0.003 {built-in method run}
...@@ -68,12 +78,15 @@ cprofilev -a 0.0.0.0 -p 3214 -f profile.out main.py ...@@ -68,12 +78,15 @@ cprofilev -a 0.0.0.0 -p 3214 -f profile.out main.py
107991 0.676 0.000 1.519 0.000 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/fluid/framework.py:219(__init__) 107991 0.676 0.000 1.519 0.000 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/fluid/framework.py:219(__init__)
4697 0.626 0.000 2.291 0.000 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/fluid/framework.py:428(sync_with_cpp) 4697 0.626 0.000 2.291 0.000 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/fluid/framework.py:428(sync_with_cpp)
1 0.618 0.618 0.618 0.618 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/fluid/__init__.py:1(<module>) 1 0.618 0.618 0.618 0.618 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/fluid/__init__.py:1(<module>)
``` ```
可以看到最耗时的函数是C++端的`run`函数。这需要联合我们第二节`Python`与`C++`混合代码的性能分析来进行调优。而`sync_with_cpp`函数的总共耗时很长,每次调用的耗时也很长。于是我们可以点击`sync_with_cpp`的详细信息,了解其调用关系。 We can see that the most time-consuming function is the `built-in
method run`, which is a C++ function in `libpaddle.so`. We will
explain how to profile C++ code in the next section. At the right
moment, let's look into the third function `sync_with_cpp`, which is a
Python function. We can click it to understand more about it:
```text ```
Called By: Called By:
Ordered by: internal time Ordered by: internal time
...@@ -92,72 +105,93 @@ Called: ...@@ -92,72 +105,93 @@ Called:
List reduced from 4497 to 2 due to restriction <'sync_with_cpp'> List reduced from 4497 to 2 due to restriction <'sync_with_cpp'>
``` ```
通常观察热点函数间的调用关系,和对应行的代码,就可以了解到问题代码在哪里。当我们做出性能修正后,再次进行性能分析(profiling)即可检查我们调优后的修正是否能够改善程序的性能。 The lists of the callers of `sync_with_cpp` might help us understand
how to improve the function definition.
## Profiling Python and C++ Code
### Generate the Profiling File
## Python与C++混合代码的性能分析 To profile a mixture of Python and C++ code, we can use a Python
package, `yep`, that can work with Google's `perftools`, which is a
commonly-used profiler for C/C++ code.
### 生成性能分析文件 In Ubuntu systems, we can install `yep` and `perftools` by running the
following commands:
C++的性能分析工具非常多。常见的包括`gprof`, `valgrind`, `google-perftools`。但是调试Python中使用的动态链接库与直接调试原始二进制相比增加了很多复杂度。幸而Python的一个第三方库`yep`提供了方便的和`google-perftools`交互的方法。于是这里使用`yep`进行Python与C++混合代码的性能分析
使用`yep`前需要安装`google-perftools`与`yep`包。ubuntu下安装命令为
```bash ```bash
apt update
apt install libgoogle-perftools-dev apt install libgoogle-perftools-dev
pip install yep pip install yep
``` ```
安装完毕后,我们可以通过 Then we can run the following command
```bash ```bash
python -m yep -v main.py python -m yep -v main.py
``` ```
生成性能分析文件。生成的性能分析文件为`main.py.prof`。 to generate the profiling file. The default filename is
`main.py.prof`.
Please be aware of the `-v` command line option, which prints the
analysis results after generating the profiling file. By taking a
glance at the print result, we'd know that if we stripped debug
information from `libpaddle.so` at build time. The following hints
help make sure that the analysis results are readable:
命令行中的`-v`指定在生成性能分析文件之后,在命令行显示分析结果。我们可以在命令行中简单的看一下生成效果。因为C++与Python不同,编译时可能会去掉调试信息,运行时也可能因为多线程产生混乱不可读的性能分析结果。为了生成更可读的性能分析结果,可以采取下面几点措施: 1. Use GCC command line option `-g` when building `libpaddle.so` so to
include the debug information. The standard building system of
PaddlePaddle is CMake, so you might want to set
`CMAKE_BUILD_TYPE=RelWithDebInfo`.
1. 编译时指定`-g`生成调试信息。使用cmake的话,可以将CMAKE_BUILD_TYPE指定为`RelWithDebInfo`。 1. Use GCC command line option `-O2` or `-O3` to generate optimized
2. 编译时一定要开启优化。单纯的`Debug`编译性能会和`-O2`或者`-O3`有非常大的差别。`Debug`模式下的性能测试是没有意义的。 binary code. It doesn't make sense to profile `libpaddle.so`
3. 运行性能分析的时候,先从单线程开始,再开启多线程,进而多机。毕竟单线程调试更容易。可以设置`OMP_NUM_THREADS=1`这个环境变量关闭openmp优化。 without optimization, because it would anyway run slowly.
### 查看性能分析文件 1. Profiling the single-threaded binary file before the
multi-threading version, because the latter often generates tangled
profiling analysis result. You might want to set environment
variable `OMP_NUM_THREADS=1` to prevents OpenMP from automatically
starting multiple threads.
在运行完性能分析后,会生成性能分析结果文件。我们可以使用[pprof](https://github.com/google/pprof)来显示性能分析结果。注意,这里使用了用`Go`语言重构后的`pprof`,因为这个工具具有web服务界面,且展示效果更好。 ### Look into the Profiling File
安装`pprof`的命令和一般的`Go`程序是一样的,其命令如下: The tool we used to look into the profiling file generated by
`perftools` is [`pprof`](https://github.com/google/pprof), which
provides a Web-based GUI like `cprofilev`.
We can rely on the standard Go toolchain to retrieve the source code
of `pprof` and build it:
```bash ```bash
go get github.com/google/pprof go get github.com/google/pprof
``` ```
进而我们可以使用如下命令开启一个HTTP服务: Then we can use it to profile `main.py.prof` generated in the previous
section:
```bash ```bash
pprof -http=0.0.0.0:3213 `which python` ./main.py.prof pprof -http=0.0.0.0:3213 `which python` ./main.py.prof
``` ```
这行命令中,`-http`指开启HTTP服务。`which python`会产生当前Python二进制的完整路径,进而指定了Python可执行文件的路径。`./main.py.prof`输入了性能分析结果。 Where `-http` specifies the IP and port of the HTTP service.
Directing our Web browser to the service, we would see something like
访问对应的网址,我们可以查看性能分析的结果。结果如下图所示: the following:
![result](./pprof_1.png) ![result](./pprof_1.png)
### Identifying the Performance Bottlenecks
### 寻找性能瓶颈 Similar to how we work with `cprofilev`, we'd focus on `tottime` and
`cumtime`.
与寻找Python代码的性能瓶颈类似,寻找Python与C++混合代码的性能瓶颈也是要看`tottime`和`cumtime`。而`pprof`展示的调用图也可以帮助我们发现性能中的问题。
例如下图中,
![kernel_perf](./pprof_2.png) ![kernel_perf](./pprof_2.png)
在一次训练中,乘法和乘法梯度的计算占用2%-4%左右的计算时间。而`MomentumOp`占用了17%左右的计算时间。显然,`MomentumOp`的性能有问题。 We can see that the execution time of multiplication and the computing
of the gradient of multiplication takes 2% to 4% of the total running
在`pprof`中,对于性能的关键路径都做出了红色标记。先检查关键路径的性能问题,再检查其他部分的性能问题,可以更有次序的完成性能的优化。 time, and `MomentumOp` takes about 17%. Obviously, we'd want to
optimize `MomentumOp`.
## 总结
至此,两种性能分析的方式都介绍完毕了。希望通过这两种性能分析的方式,Paddle的开发人员和使用人员可以有次序的,科学的发现和解决性能问题。 `pprof` would mark performance critical parts of the program in
red. It's a good idea to follow the hint.
...@@ -8,7 +8,7 @@ ...@@ -8,7 +8,7 @@
<meta name="viewport" content="width=device-width, initial-scale=1.0"> <meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Python代码的性能分析 &mdash; PaddlePaddle documentation</title> <title>Profiling the Python Code &mdash; PaddlePaddle documentation</title>
...@@ -179,7 +179,7 @@ ...@@ -179,7 +179,7 @@
<div role="navigation" aria-label="breadcrumbs navigation"> <div role="navigation" aria-label="breadcrumbs navigation">
<ul class="wy-breadcrumbs"> <ul class="wy-breadcrumbs">
<li>Python代码的性能分析</li> <li>Profiling the Python Code</li>
</ul> </ul>
</div> </div>
...@@ -188,54 +188,69 @@ ...@@ -188,54 +188,69 @@
<div role="main" class="document" itemscope="itemscope" itemtype="http://schema.org/Article"> <div role="main" class="document" itemscope="itemscope" itemtype="http://schema.org/Article">
<div itemprop="articleBody"> <div itemprop="articleBody">
<p>此教程会介绍如何使用Python的cProfile包,与Python库yep,google perftools来运行性能分析(Profiling)与调优。</p> <p>This tutorial introduces techniques we used to profile and tune the
<p>运行性能分析可以让开发人员科学的,有条不紊的对程序进行性能优化。性能分析是性能调优的基础。因为在程序实际运行中,真正的瓶颈可能和程序员开发过程中想象的瓶颈相去甚远。</p> CPU performance of PaddlePaddle. We will use Python packages
<p>性能优化的步骤,通常是循环重复若干次『性能分析 &#8211;&gt; 寻找瓶颈 &#8212;&gt; 调优瓶颈 &#8211;&gt; 性能分析确认调优效果』。其中性能分析是性能调优的至关重要的量化指标。</p> <code class="docutils literal"><span class="pre">cProfile</span></code> and <code class="docutils literal"><span class="pre">yep</span></code>, and Google <code class="docutils literal"><span class="pre">perftools</span></code>.</p>
<p>Paddle提供了Python语言绑定。用户使用Python进行神经网络编程,训练,测试。Python解释器通过<code class="docutils literal"><span class="pre">pybind</span></code><code class="docutils literal"><span class="pre">swig</span></code>调用Paddle的动态链接库,进而调用Paddle C++部分的代码。所以Paddle的性能分析与调优分为两个部分:</p> <p>Profiling is the process that reveals the performance bottlenecks,
<ul class="simple"> which could be very different from what&#8217;s in the developers&#8217; mind.
<li>Python代码的性能分析</li> Performance tuning is to fix the bottlenecks. Performance optimization
<li>Python与C++混合代码的性能分析</li> repeats the steps of profiling and tuning alternatively.</p>
</ul> <p>PaddlePaddle users program AI by calling the Python API, which calls
<div class="section" id="python"> into <code class="docutils literal"><span class="pre">libpaddle.so.</span></code> written in C++. In this tutorial, we focus on
<span id="python"></span><h1>Python代码的性能分析<a class="headerlink" href="#python" title="Permalink to this headline"></a></h1> the profiling and tuning of</p>
<div class="section" id=""> <ol class="simple">
<span id="id1"></span><h2>生成性能分析文件<a class="headerlink" href="#" title="Permalink to this headline"></a></h2> <li>the Python code and</li>
<p>Python标准库中提供了性能分析的工具包,<a class="reference external" href="https://docs.python.org/2/library/profile.html">cProfile</a>。生成Python性能分析的命令如下:</p> <li>the mixture of Python and C++ code.</li>
</ol>
<div class="section" id="profiling-the-python-code">
<span id="profiling-the-python-code"></span><h1>Profiling the Python Code<a class="headerlink" href="#profiling-the-python-code" title="Permalink to this headline"></a></h1>
<div class="section" id="generate-the-performance-profiling-file">
<span id="generate-the-performance-profiling-file"></span><h2>Generate the Performance Profiling File<a class="headerlink" href="#generate-the-performance-profiling-file" title="Permalink to this headline"></a></h2>
<p>We can use Python standard
package, <a class="reference external" href="https://docs.python.org/2/library/profile.html"><code class="docutils literal"><span class="pre">cProfile</span></code></a>,
to generate Python profiling file. For example:</p>
<div class="highlight-bash"><div class="highlight"><pre><span></span>python -m cProfile -o profile.out main.py <div class="highlight-bash"><div class="highlight"><pre><span></span>python -m cProfile -o profile.out main.py
</pre></div> </pre></div>
</div> </div>
<p>其中<code class="docutils literal"><span class="pre">-o</span></code>标识了一个输出的文件名,用来存储本次性能分析的结果。如果不指定这个文件,<code class="docutils literal"><span class="pre">cProfile</span></code>会打印一些统计信息到<code class="docutils literal"><span class="pre">stdout</span></code>。这不方便我们进行后期处理(进行<code class="docutils literal"><span class="pre">sort</span></code>, <code class="docutils literal"><span class="pre">split</span></code>, <code class="docutils literal"><span class="pre">cut</span></code>等等)。</p> <p>where <code class="docutils literal"><span class="pre">main.py</span></code> is the program we are going to profile, <code class="docutils literal"><span class="pre">-o</span></code> specifies
the output file. Without <code class="docutils literal"><span class="pre">-o</span></code>, <code class="docutils literal"><span class="pre">cProfile</span></code> would outputs to standard
output.</p>
</div> </div>
<div class="section" id=""> <div class="section" id="look-into-the-profiling-file">
<span id="id2"></span><h2>查看性能分析文件<a class="headerlink" href="#" title="Permalink to this headline"></a></h2> <span id="look-into-the-profiling-file"></span><h2>Look into the Profiling File<a class="headerlink" href="#look-into-the-profiling-file" title="Permalink to this headline"></a></h2>
<p>当main.py运行完毕后,性能分析结果文件<code class="docutils literal"><span class="pre">profile.out</span></code>就生成出来了。我们可以使用<a class="reference external" href="https://github.com/ymichael/cprofilev">cprofilev</a>来查看性能分析结果。<code class="docutils literal"><span class="pre">cprofilev</span></code>是一个Python的第三方库。使用它会开启一个HTTP服务,将性能分析结果以网页的形式展示出来。</p> <p><code class="docutils literal"><span class="pre">cProfile</span></code> generates <code class="docutils literal"><span class="pre">profile.out</span></code> after <code class="docutils literal"><span class="pre">main.py</span></code> completes. We can
<p>使用<code class="docutils literal"><span class="pre">pip</span> <span class="pre">install</span> <span class="pre">cprofilev</span></code>安装<code class="docutils literal"><span class="pre">cprofilev</span></code>工具。安装完成后,使用如下命令开启HTTP服务</p> use <a class="reference external" href="https://github.com/ymichael/cprofilev"><code class="docutils literal"><span class="pre">cprofilev</span></code></a> to look into
the details:</p>
<div class="highlight-bash"><div class="highlight"><pre><span></span>cprofilev -a <span class="m">0</span>.0.0.0 -p <span class="m">3214</span> -f profile.out main.py <div class="highlight-bash"><div class="highlight"><pre><span></span>cprofilev -a <span class="m">0</span>.0.0.0 -p <span class="m">3214</span> -f profile.out main.py
</pre></div> </pre></div>
</div> </div>
<p>其中<code class="docutils literal"><span class="pre">-a</span></code>标识HTTP服务绑定的IP。使用<code class="docutils literal"><span class="pre">0.0.0.0</span></code>允许外网访问这个HTTP服务。<code class="docutils literal"><span class="pre">-p</span></code>标识HTTP服务的端口。<code class="docutils literal"><span class="pre">-f</span></code>标识性能分析的结果文件。<code class="docutils literal"><span class="pre">main.py</span></code>标识被性能分析的源文件。</p> <p>where <code class="docutils literal"><span class="pre">-a</span></code> specifies the HTTP IP, <code class="docutils literal"><span class="pre">-p</span></code> specifies the port, <code class="docutils literal"><span class="pre">-f</span></code>
<p>访问对应网址,即可显示性能分析的结果。性能分析结果格式如下:</p> specifies the profiling file, and <code class="docutils literal"><span class="pre">main.py</span></code> is the source file.</p>
<div class="highlight-text"><div class="highlight"><pre><span></span> ncalls tottime percall cumtime percall filename:lineno(function) <p>Open the Web browser and points to the local IP and the specifies
1 0.284 0.284 29.514 29.514 main.py:1(&lt;module&gt;) port, we will see the output like the following:</p>
4696 0.128 0.000 15.748 0.003 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/fluid/executor.py:20(run) <div class="highlight-default"><div class="highlight"><pre><span></span> <span class="n">ncalls</span> <span class="n">tottime</span> <span class="n">percall</span> <span class="n">cumtime</span> <span class="n">percall</span> <span class="n">filename</span><span class="p">:</span><span class="n">lineno</span><span class="p">(</span><span class="n">function</span><span class="p">)</span>
4696 12.040 0.003 12.040 0.003 {built-in method run} <span class="mi">1</span> <span class="mf">0.284</span> <span class="mf">0.284</span> <span class="mf">29.514</span> <span class="mf">29.514</span> <span class="n">main</span><span class="o">.</span><span class="n">py</span><span class="p">:</span><span class="mi">1</span><span class="p">(</span><span class="o">&lt;</span><span class="n">module</span><span class="o">&gt;</span><span class="p">)</span>
1 0.144 0.144 6.534 6.534 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/__init__.py:14(&lt;module&gt;) <span class="mi">4696</span> <span class="mf">0.128</span> <span class="mf">0.000</span> <span class="mf">15.748</span> <span class="mf">0.003</span> <span class="o">/</span><span class="n">home</span><span class="o">/</span><span class="n">yuyang</span><span class="o">/</span><span class="n">perf_test</span><span class="o">/.</span><span class="n">env</span><span class="o">/</span><span class="n">lib</span><span class="o">/</span><span class="n">python2</span><span class="o">.</span><span class="mi">7</span><span class="o">/</span><span class="n">site</span><span class="o">-</span><span class="n">packages</span><span class="o">/</span><span class="n">paddle</span><span class="o">/</span><span class="n">v2</span><span class="o">/</span><span class="n">fluid</span><span class="o">/</span><span class="n">executor</span><span class="o">.</span><span class="n">py</span><span class="p">:</span><span class="mi">20</span><span class="p">(</span><span class="n">run</span><span class="p">)</span>
<span class="mi">4696</span> <span class="mf">12.040</span> <span class="mf">0.003</span> <span class="mf">12.040</span> <span class="mf">0.003</span> <span class="p">{</span><span class="n">built</span><span class="o">-</span><span class="ow">in</span> <span class="n">method</span> <span class="n">run</span><span class="p">}</span>
<span class="mi">1</span> <span class="mf">0.144</span> <span class="mf">0.144</span> <span class="mf">6.534</span> <span class="mf">6.534</span> <span class="o">/</span><span class="n">home</span><span class="o">/</span><span class="n">yuyang</span><span class="o">/</span><span class="n">perf_test</span><span class="o">/.</span><span class="n">env</span><span class="o">/</span><span class="n">lib</span><span class="o">/</span><span class="n">python2</span><span class="o">.</span><span class="mi">7</span><span class="o">/</span><span class="n">site</span><span class="o">-</span><span class="n">packages</span><span class="o">/</span><span class="n">paddle</span><span class="o">/</span><span class="n">v2</span><span class="o">/</span><span class="fm">__init__</span><span class="o">.</span><span class="n">py</span><span class="p">:</span><span class="mi">14</span><span class="p">(</span><span class="o">&lt;</span><span class="n">module</span><span class="o">&gt;</span><span class="p">)</span>
</pre></div> </pre></div>
</div> </div>
<p>每一列的含义是:</p> <p>where each line corresponds to Python function, and the meaning of
<p>| 列名 | 含义 | each column is as follows:</p>
<p>| column | meaning |
| &#8212; | &#8212; | | &#8212; | &#8212; |
| ncalls | 函数的调用次数 | | ncalls | the number of calls into a function |
| tottime | 函数实际使用的总时间。该时间去除掉本函数调用其他函数的时间 | | tottime | the total execution time of the function, not including the
| percall | tottime的每次调用平均时间 | execution time of other functions called by the function |
| cumtime | 函数总时间。包含这个函数调用其他函数的时间 | | percall | tottime divided by ncalls |
| percall | cumtime的每次调用平均时间 | | cumtime | the total execution time of the function, including the execution time of other functions being called |
| filename:lineno(function) | 文件名, 行号,函数名 |</p> | percall | cumtime divided by ncalls |
| filename:lineno(function) | where the function is defined |</p>
</div> </div>
<div class="section" id=""> <div class="section" id="identify-performance-bottlenecks">
<span id="id3"></span><h2>寻找性能瓶颈<a class="headerlink" href="#" title="Permalink to this headline"></a></h2> <span id="identify-performance-bottlenecks"></span><h2>Identify Performance Bottlenecks<a class="headerlink" href="#identify-performance-bottlenecks" title="Permalink to this headline"></a></h2>
<p>通常<code class="docutils literal"><span class="pre">tottime</span></code><code class="docutils literal"><span class="pre">cumtime</span></code>是寻找瓶颈的关键指标。这两个指标代表了某一个函数真实的运行时间。</p> <p>Usually, <code class="docutils literal"><span class="pre">tottime</span></code> and the related <code class="docutils literal"><span class="pre">percall</span></code> time is what we want to
<p>将性能分析结果按照tottime排序,效果如下:</p> focus on. We can sort above profiling file by tottime:</p>
<div class="highlight-text"><div class="highlight"><pre><span></span> 4696 12.040 0.003 12.040 0.003 {built-in method run} <div class="highlight-text"><div class="highlight"><pre><span></span> 4696 12.040 0.003 12.040 0.003 {built-in method run}
300005 0.874 0.000 1.681 0.000 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/dataset/mnist.py:38(reader) 300005 0.874 0.000 1.681 0.000 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/dataset/mnist.py:38(reader)
107991 0.676 0.000 1.519 0.000 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/fluid/framework.py:219(__init__) 107991 0.676 0.000 1.519 0.000 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/fluid/framework.py:219(__init__)
...@@ -243,77 +258,104 @@ ...@@ -243,77 +258,104 @@
1 0.618 0.618 0.618 0.618 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/fluid/__init__.py:1(&lt;module&gt;) 1 0.618 0.618 0.618 0.618 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/fluid/__init__.py:1(&lt;module&gt;)
</pre></div> </pre></div>
</div> </div>
<p>可以看到最耗时的函数是C++端的<code class="docutils literal"><span class="pre">run</span></code>函数。这需要联合我们第二节<code class="docutils literal"><span class="pre">Python</span></code><code class="docutils literal"><span class="pre">C++</span></code>混合代码的性能分析来进行调优。而<code class="docutils literal"><span class="pre">sync_with_cpp</span></code>函数的总共耗时很长,每次调用的耗时也很长。于是我们可以点击<code class="docutils literal"><span class="pre">sync_with_cpp</span></code>的详细信息,了解其调用关系。</p> <p>We can see that the most time-consuming function is the <code class="docutils literal"><span class="pre">built-in</span> <span class="pre">method</span> <span class="pre">run</span></code>, which is a C++ function in <code class="docutils literal"><span class="pre">libpaddle.so</span></code>. We will
<div class="highlight-text"><div class="highlight"><pre><span></span>Called By: explain how to profile C++ code in the next section. At the right
moment, let&#8217;s look into the third function <code class="docutils literal"><span class="pre">sync_with_cpp</span></code>, which is a
Python function. We can click it to understand more about it:</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="n">Called</span> <span class="n">By</span><span class="p">:</span>
Ordered by: internal time <span class="n">Ordered</span> <span class="n">by</span><span class="p">:</span> <span class="n">internal</span> <span class="n">time</span>
List reduced from 4497 to 2 due to restriction &lt;&#39;sync_with_cpp&#39;&gt; <span class="n">List</span> <span class="n">reduced</span> <span class="kn">from</span> <span class="mi">4497</span> <span class="n">to</span> <span class="mi">2</span> <span class="n">due</span> <span class="n">to</span> <span class="n">restriction</span> <span class="o">&lt;</span><span class="s1">&#39;sync_with_cpp&#39;</span><span class="o">&gt;</span>
Function was called by... <span class="n">Function</span> <span class="n">was</span> <span class="n">called</span> <span class="n">by</span><span class="o">...</span>
ncalls tottime cumtime <span class="n">ncalls</span> <span class="n">tottime</span> <span class="n">cumtime</span>
/home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/fluid/framework.py:428(sync_with_cpp) &lt;- 4697 0.626 2.291 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/fluid/framework.py:562(sync_with_cpp) <span class="o">/</span><span class="n">home</span><span class="o">/</span><span class="n">yuyang</span><span class="o">/</span><span class="n">perf_test</span><span class="o">/.</span><span class="n">env</span><span class="o">/</span><span class="n">lib</span><span class="o">/</span><span class="n">python2</span><span class="o">.</span><span class="mi">7</span><span class="o">/</span><span class="n">site</span><span class="o">-</span><span class="n">packages</span><span class="o">/</span><span class="n">paddle</span><span class="o">/</span><span class="n">v2</span><span class="o">/</span><span class="n">fluid</span><span class="o">/</span><span class="n">framework</span><span class="o">.</span><span class="n">py</span><span class="p">:</span><span class="mi">428</span><span class="p">(</span><span class="n">sync_with_cpp</span><span class="p">)</span> <span class="o">&lt;-</span> <span class="mi">4697</span> <span class="mf">0.626</span> <span class="mf">2.291</span> <span class="o">/</span><span class="n">home</span><span class="o">/</span><span class="n">yuyang</span><span class="o">/</span><span class="n">perf_test</span><span class="o">/.</span><span class="n">env</span><span class="o">/</span><span class="n">lib</span><span class="o">/</span><span class="n">python2</span><span class="o">.</span><span class="mi">7</span><span class="o">/</span><span class="n">site</span><span class="o">-</span><span class="n">packages</span><span class="o">/</span><span class="n">paddle</span><span class="o">/</span><span class="n">v2</span><span class="o">/</span><span class="n">fluid</span><span class="o">/</span><span class="n">framework</span><span class="o">.</span><span class="n">py</span><span class="p">:</span><span class="mi">562</span><span class="p">(</span><span class="n">sync_with_cpp</span><span class="p">)</span>
/home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/fluid/framework.py:562(sync_with_cpp) &lt;- 4696 0.019 2.316 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/fluid/framework.py:487(clone) <span class="o">/</span><span class="n">home</span><span class="o">/</span><span class="n">yuyang</span><span class="o">/</span><span class="n">perf_test</span><span class="o">/.</span><span class="n">env</span><span class="o">/</span><span class="n">lib</span><span class="o">/</span><span class="n">python2</span><span class="o">.</span><span class="mi">7</span><span class="o">/</span><span class="n">site</span><span class="o">-</span><span class="n">packages</span><span class="o">/</span><span class="n">paddle</span><span class="o">/</span><span class="n">v2</span><span class="o">/</span><span class="n">fluid</span><span class="o">/</span><span class="n">framework</span><span class="o">.</span><span class="n">py</span><span class="p">:</span><span class="mi">562</span><span class="p">(</span><span class="n">sync_with_cpp</span><span class="p">)</span> <span class="o">&lt;-</span> <span class="mi">4696</span> <span class="mf">0.019</span> <span class="mf">2.316</span> <span class="o">/</span><span class="n">home</span><span class="o">/</span><span class="n">yuyang</span><span class="o">/</span><span class="n">perf_test</span><span class="o">/.</span><span class="n">env</span><span class="o">/</span><span class="n">lib</span><span class="o">/</span><span class="n">python2</span><span class="o">.</span><span class="mi">7</span><span class="o">/</span><span class="n">site</span><span class="o">-</span><span class="n">packages</span><span class="o">/</span><span class="n">paddle</span><span class="o">/</span><span class="n">v2</span><span class="o">/</span><span class="n">fluid</span><span class="o">/</span><span class="n">framework</span><span class="o">.</span><span class="n">py</span><span class="p">:</span><span class="mi">487</span><span class="p">(</span><span class="n">clone</span><span class="p">)</span>
1 0.000 0.001 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/fluid/framework.py:534(append_backward) <span class="mi">1</span> <span class="mf">0.000</span> <span class="mf">0.001</span> <span class="o">/</span><span class="n">home</span><span class="o">/</span><span class="n">yuyang</span><span class="o">/</span><span class="n">perf_test</span><span class="o">/.</span><span class="n">env</span><span class="o">/</span><span class="n">lib</span><span class="o">/</span><span class="n">python2</span><span class="o">.</span><span class="mi">7</span><span class="o">/</span><span class="n">site</span><span class="o">-</span><span class="n">packages</span><span class="o">/</span><span class="n">paddle</span><span class="o">/</span><span class="n">v2</span><span class="o">/</span><span class="n">fluid</span><span class="o">/</span><span class="n">framework</span><span class="o">.</span><span class="n">py</span><span class="p">:</span><span class="mi">534</span><span class="p">(</span><span class="n">append_backward</span><span class="p">)</span>
Called: <span class="n">Called</span><span class="p">:</span>
Ordered by: internal time <span class="n">Ordered</span> <span class="n">by</span><span class="p">:</span> <span class="n">internal</span> <span class="n">time</span>
List reduced from 4497 to 2 due to restriction &lt;&#39;sync_with_cpp&#39;&gt; <span class="n">List</span> <span class="n">reduced</span> <span class="kn">from</span> <span class="mi">4497</span> <span class="n">to</span> <span class="mi">2</span> <span class="n">due</span> <span class="n">to</span> <span class="n">restriction</span> <span class="o">&lt;</span><span class="s1">&#39;sync_with_cpp&#39;</span><span class="o">&gt;</span>
</pre></div> </pre></div>
</div> </div>
<p>通常观察热点函数间的调用关系,和对应行的代码,就可以了解到问题代码在哪里。当我们做出性能修正后,再次进行性能分析(profiling)即可检查我们调优后的修正是否能够改善程序的性能。</p> <p>The lists of the callers of <code class="docutils literal"><span class="pre">sync_with_cpp</span></code> might help us understand
how to improve the function definition.</p>
</div> </div>
</div> </div>
<div class="section" id="pythonc"> <div class="section" id="profiling-python-and-c-code">
<span id="pythonc"></span><h1>Python与C++混合代码的性能分析<a class="headerlink" href="#pythonc" title="Permalink to this headline"></a></h1> <span id="profiling-python-and-c-code"></span><h1>Profiling Python and C++ Code<a class="headerlink" href="#profiling-python-and-c-code" title="Permalink to this headline"></a></h1>
<div class="section" id=""> <div class="section" id="generate-the-profiling-file">
<span id="id4"></span><h2>生成性能分析文件<a class="headerlink" href="#" title="Permalink to this headline"></a></h2> <span id="generate-the-profiling-file"></span><h2>Generate the Profiling File<a class="headerlink" href="#generate-the-profiling-file" title="Permalink to this headline"></a></h2>
<p>C++的性能分析工具非常多。常见的包括<code class="docutils literal"><span class="pre">gprof</span></code>, <code class="docutils literal"><span class="pre">valgrind</span></code>, <code class="docutils literal"><span class="pre">google-perftools</span></code>。但是调试Python中使用的动态链接库与直接调试原始二进制相比增加了很多复杂度。幸而Python的一个第三方库<code class="docutils literal"><span class="pre">yep</span></code>提供了方便的和<code class="docutils literal"><span class="pre">google-perftools</span></code>交互的方法。于是这里使用<code class="docutils literal"><span class="pre">yep</span></code>进行Python与C++混合代码的性能分析</p> <p>To profile a mixture of Python and C++ code, we can use a Python
<p>使用<code class="docutils literal"><span class="pre">yep</span></code>前需要安装<code class="docutils literal"><span class="pre">google-perftools</span></code><code class="docutils literal"><span class="pre">yep</span></code>包。ubuntu下安装命令为</p> package, <code class="docutils literal"><span class="pre">yep</span></code>, that can work with Google&#8217;s <code class="docutils literal"><span class="pre">perftools</span></code>, which is a
<div class="highlight-bash"><div class="highlight"><pre><span></span>apt install libgoogle-perftools-dev commonly-used profiler for C/C++ code.</p>
<p>In Ubuntu systems, we can install <code class="docutils literal"><span class="pre">yep</span></code> and <code class="docutils literal"><span class="pre">perftools</span></code> by running the
following commands:</p>
<div class="highlight-bash"><div class="highlight"><pre><span></span>apt update
apt install libgoogle-perftools-dev
pip install yep pip install yep
</pre></div> </pre></div>
</div> </div>
<p>安装完毕后,我们可以通过</p> <p>Then we can run the following command</p>
<div class="highlight-bash"><div class="highlight"><pre><span></span>python -m yep -v main.py <div class="highlight-bash"><div class="highlight"><pre><span></span>python -m yep -v main.py
</pre></div> </pre></div>
</div> </div>
<p>生成性能分析文件。生成的性能分析文件为<code class="docutils literal"><span class="pre">main.py.prof</span></code></p> <p>to generate the profiling file. The default filename is
<p>命令行中的<code class="docutils literal"><span class="pre">-v</span></code>指定在生成性能分析文件之后,在命令行显示分析结果。我们可以在命令行中简单的看一下生成效果。因为C++与Python不同,编译时可能会去掉调试信息,运行时也可能因为多线程产生混乱不可读的性能分析结果。为了生成更可读的性能分析结果,可以采取下面几点措施:</p> <code class="docutils literal"><span class="pre">main.py.prof</span></code>.</p>
<p>Please be aware of the <code class="docutils literal"><span class="pre">-v</span></code> command line option, which prints the
analysis results after generating the profiling file. By taking a
glance at the print result, we&#8217;d know that if we stripped debug
information from <code class="docutils literal"><span class="pre">libpaddle.so</span></code> at build time. The following hints
help make sure that the analysis results are readable:</p>
<ol class="simple"> <ol class="simple">
<li>编译时指定<code class="docutils literal"><span class="pre">-g</span></code>生成调试信息。使用cmake的话,可以将CMAKE_BUILD_TYPE指定为<code class="docutils literal"><span class="pre">RelWithDebInfo</span></code></li> <li>Use GCC command line option <code class="docutils literal"><span class="pre">-g</span></code> when building <code class="docutils literal"><span class="pre">libpaddle.so</span></code> so to
<li>编译时一定要开启优化。单纯的<code class="docutils literal"><span class="pre">Debug</span></code>编译性能会和<code class="docutils literal"><span class="pre">-O2</span></code>或者<code class="docutils literal"><span class="pre">-O3</span></code>有非常大的差别。<code class="docutils literal"><span class="pre">Debug</span></code>模式下的性能测试是没有意义的。</li> include the debug information. The standard building system of
<li>运行性能分析的时候,先从单线程开始,再开启多线程,进而多机。毕竟单线程调试更容易。可以设置<code class="docutils literal"><span class="pre">OMP_NUM_THREADS=1</span></code>这个环境变量关闭openmp优化。</li> PaddlePaddle is CMake, so you might want to set
<code class="docutils literal"><span class="pre">CMAKE_BUILD_TYPE=RelWithDebInfo</span></code>.</li>
<li>Use GCC command line option <code class="docutils literal"><span class="pre">-O2</span></code> or <code class="docutils literal"><span class="pre">-O3</span></code> to generate optimized
binary code. It doesn&#8217;t make sense to profile <code class="docutils literal"><span class="pre">libpaddle.so</span></code>
without optimization, because it would anyway run slowly.</li>
<li>Profiling the single-threaded binary file before the
multi-threading version, because the latter often generates tangled
profiling analysis result. You might want to set environment
variable <code class="docutils literal"><span class="pre">OMP_NUM_THREADS=1</span></code> to prevents OpenMP from automatically
starting multiple threads.</li>
</ol> </ol>
</div> </div>
<div class="section" id=""> <div class="section" id="look-into-the-profiling-file">
<span id="id5"></span><h2>查看性能分析文件<a class="headerlink" href="#" title="Permalink to this headline"></a></h2> <span id="id1"></span><h2>Look into the Profiling File<a class="headerlink" href="#look-into-the-profiling-file" title="Permalink to this headline"></a></h2>
<p>在运行完性能分析后,会生成性能分析结果文件。我们可以使用<a class="reference external" href="https://github.com/google/pprof">pprof</a>来显示性能分析结果。注意,这里使用了用<code class="docutils literal"><span class="pre">Go</span></code>语言重构后的<code class="docutils literal"><span class="pre">pprof</span></code>,因为这个工具具有web服务界面,且展示效果更好。</p> <p>The tool we used to look into the profiling file generated by
<p>安装<code class="docutils literal"><span class="pre">pprof</span></code>的命令和一般的<code class="docutils literal"><span class="pre">Go</span></code>程序是一样的,其命令如下:</p> <code class="docutils literal"><span class="pre">perftools</span></code> is <a class="reference external" href="https://github.com/google/pprof"><code class="docutils literal"><span class="pre">pprof</span></code></a>, which
provides a Web-based GUI like <code class="docutils literal"><span class="pre">cprofilev</span></code>.</p>
<p>We can rely on the standard Go toolchain to retrieve the source code
of <code class="docutils literal"><span class="pre">pprof</span></code> and build it:</p>
<div class="highlight-bash"><div class="highlight"><pre><span></span>go get github.com/google/pprof <div class="highlight-bash"><div class="highlight"><pre><span></span>go get github.com/google/pprof
</pre></div> </pre></div>
</div> </div>
<p>进而我们可以使用如下命令开启一个HTTP服务:</p> <p>Then we can use it to profile <code class="docutils literal"><span class="pre">main.py.prof</span></code> generated in the previous
section:</p>
<div class="highlight-bash"><div class="highlight"><pre><span></span>pprof -http<span class="o">=</span><span class="m">0</span>.0.0.0:3213 <span class="sb">`</span>which python<span class="sb">`</span> ./main.py.prof <div class="highlight-bash"><div class="highlight"><pre><span></span>pprof -http<span class="o">=</span><span class="m">0</span>.0.0.0:3213 <span class="sb">`</span>which python<span class="sb">`</span> ./main.py.prof
</pre></div> </pre></div>
</div> </div>
<p>这行命令中,<code class="docutils literal"><span class="pre">-http</span></code>指开启HTTP服务。<code class="docutils literal"><span class="pre">which</span> <span class="pre">python</span></code>会产生当前Python二进制的完整路径,进而指定了Python可执行文件的路径。<code class="docutils literal"><span class="pre">./main.py.prof</span></code>输入了性能分析结果。</p> <p>Where <code class="docutils literal"><span class="pre">-http</span></code> specifies the IP and port of the HTTP service.
<p>访问对应的网址,我们可以查看性能分析的结果。结果如下图所示:</p> Directing our Web browser to the service, we would see something like
the following:</p>
<p><img alt="result" src="../../_images/pprof_1.png" /></p> <p><img alt="result" src="../../_images/pprof_1.png" /></p>
</div> </div>
<div class="section" id=""> <div class="section" id="identifying-the-performance-bottlenecks">
<span id="id6"></span><h2>寻找性能瓶颈<a class="headerlink" href="#" title="Permalink to this headline"></a></h2> <span id="identifying-the-performance-bottlenecks"></span><h2>Identifying the Performance Bottlenecks<a class="headerlink" href="#identifying-the-performance-bottlenecks" title="Permalink to this headline"></a></h2>
<p>与寻找Python代码的性能瓶颈类似,寻找Python与C++混合代码的性能瓶颈也是要看<code class="docutils literal"><span class="pre">tottime</span></code><code class="docutils literal"><span class="pre">cumtime</span></code>。而<code class="docutils literal"><span class="pre">pprof</span></code>展示的调用图也可以帮助我们发现性能中的问题。</p> <p>Similar to how we work with <code class="docutils literal"><span class="pre">cprofilev</span></code>, we&#8217;d focus on <code class="docutils literal"><span class="pre">tottime</span></code> and
<p>例如下图中,</p> <code class="docutils literal"><span class="pre">cumtime</span></code>.</p>
<p><img alt="kernel_perf" src="../../_images/pprof_2.png" /></p> <p><img alt="kernel_perf" src="../../_images/pprof_2.png" /></p>
<p>在一次训练中,乘法和乘法梯度的计算占用2%-4%左右的计算时间。而<code class="docutils literal"><span class="pre">MomentumOp</span></code>占用了17%左右的计算时间。显然,<code class="docutils literal"><span class="pre">MomentumOp</span></code>的性能有问题。</p> <p>We can see that the execution time of multiplication and the computing
<p><code class="docutils literal"><span class="pre">pprof</span></code>中,对于性能的关键路径都做出了红色标记。先检查关键路径的性能问题,再检查其他部分的性能问题,可以更有次序的完成性能的优化。</p> of the gradient of multiplication takes 2% to 4% of the total running
</div> time, and <code class="docutils literal"><span class="pre">MomentumOp</span></code> takes about 17%. Obviously, we&#8217;d want to
optimize <code class="docutils literal"><span class="pre">MomentumOp</span></code>.</p>
<p><code class="docutils literal"><span class="pre">pprof</span></code> would mark performance critical parts of the program in
red. It&#8217;s a good idea to follow the hint.</p>
</div> </div>
<div class="section" id="">
<span id="id7"></span><h1>总结<a class="headerlink" href="#" title="Permalink to this headline"></a></h1>
<p>至此,两种性能分析的方式都介绍完毕了。希望通过这两种性能分析的方式,Paddle的开发人员和使用人员可以有次序的,科学的发现和解决性能问题。</p>
</div> </div>
......
因为 它太大了无法显示 source diff 。你可以改为 查看blob
此教程会介绍如何使用Python的cProfile包,与Python库yep,google perftools来运行性能分析(Profiling)与调优。 This tutorial introduces techniques we used to profile and tune the
CPU performance of PaddlePaddle. We will use Python packages
`cProfile` and `yep`, and Google `perftools`.
运行性能分析可以让开发人员科学的,有条不紊的对程序进行性能优化。性能分析是性能调优的基础。因为在程序实际运行中,真正的瓶颈可能和程序员开发过程中想象的瓶颈相去甚远。 Profiling is the process that reveals the performance bottlenecks,
which could be very different from what's in the developers' mind.
Performance tuning is to fix the bottlenecks. Performance optimization
repeats the steps of profiling and tuning alternatively.
性能优化的步骤,通常是循环重复若干次『性能分析 --> 寻找瓶颈 ---> 调优瓶颈 --> 性能分析确认调优效果』。其中性能分析是性能调优的至关重要的量化指标。 PaddlePaddle users program AI by calling the Python API, which calls
into `libpaddle.so.` written in C++. In this tutorial, we focus on
the profiling and tuning of
Paddle提供了Python语言绑定。用户使用Python进行神经网络编程,训练,测试。Python解释器通过`pybind`和`swig`调用Paddle的动态链接库,进而调用Paddle C++部分的代码。所以Paddle的性能分析与调优分为两个部分: 1. the Python code and
1. the mixture of Python and C++ code.
* Python代码的性能分析 ## Profiling the Python Code
* Python与C++混合代码的性能分析
### Generate the Performance Profiling File
## Python代码的性能分析 We can use Python standard
package, [`cProfile`](https://docs.python.org/2/library/profile.html),
### 生成性能分析文件 to generate Python profiling file. For example:
Python标准库中提供了性能分析的工具包,[cProfile](https://docs.python.org/2/library/profile.html)。生成Python性能分析的命令如下:
```bash ```bash
python -m cProfile -o profile.out main.py python -m cProfile -o profile.out main.py
``` ```
其中`-o`标识了一个输出的文件名,用来存储本次性能分析的结果。如果不指定这个文件,`cProfile`会打印一些统计信息到`stdout`。这不方便我们进行后期处理(进行`sort`, `split`, `cut`等等)。 where `main.py` is the program we are going to profile, `-o` specifies
the output file. Without `-o`, `cProfile` would outputs to standard
### 查看性能分析文件 output.
当main.py运行完毕后,性能分析结果文件`profile.out`就生成出来了。我们可以使用[cprofilev](https://github.com/ymichael/cprofilev)来查看性能分析结果。`cprofilev`是一个Python的第三方库。使用它会开启一个HTTP服务,将性能分析结果以网页的形式展示出来。 ### Look into the Profiling File
使用`pip install cprofilev`安装`cprofilev`工具。安装完成后,使用如下命令开启HTTP服务 `cProfile` generates `profile.out` after `main.py` completes. We can
use [`cprofilev`](https://github.com/ymichael/cprofilev) to look into
the details:
```bash ```bash
cprofilev -a 0.0.0.0 -p 3214 -f profile.out main.py cprofilev -a 0.0.0.0 -p 3214 -f profile.out main.py
``` ```
其中`-a`标识HTTP服务绑定的IP。使用`0.0.0.0`允许外网访问这个HTTP服务。`-p`标识HTTP服务的端口。`-f`标识性能分析的结果文件。`main.py`标识被性能分析的源文件。 where `-a` specifies the HTTP IP, `-p` specifies the port, `-f`
specifies the profiling file, and `main.py` is the source file.
访问对应网址,即可显示性能分析的结果。性能分析结果格式如下: Open the Web browser and points to the local IP and the specifies
port, we will see the output like the following:
```text ```
ncalls tottime percall cumtime percall filename:lineno(function) ncalls tottime percall cumtime percall filename:lineno(function)
1 0.284 0.284 29.514 29.514 main.py:1(<module>) 1 0.284 0.284 29.514 29.514 main.py:1(<module>)
4696 0.128 0.000 15.748 0.003 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/fluid/executor.py:20(run) 4696 0.128 0.000 15.748 0.003 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/fluid/executor.py:20(run)
...@@ -44,23 +54,23 @@ cprofilev -a 0.0.0.0 -p 3214 -f profile.out main.py ...@@ -44,23 +54,23 @@ cprofilev -a 0.0.0.0 -p 3214 -f profile.out main.py
1 0.144 0.144 6.534 6.534 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/__init__.py:14(<module>) 1 0.144 0.144 6.534 6.534 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/__init__.py:14(<module>)
``` ```
每一列的含义是: where each line corresponds to Python function, and the meaning of
each column is as follows:
| 列名 | 含义 | | column | meaning |
| --- | --- | | --- | --- |
| ncalls | 函数的调用次数 | | ncalls | the number of calls into a function |
| tottime | 函数实际使用的总时间。该时间去除掉本函数调用其他函数的时间 | | tottime | the total execution time of the function, not including the
| percall | tottime的每次调用平均时间 | execution time of other functions called by the function |
| cumtime | 函数总时间。包含这个函数调用其他函数的时间 | | percall | tottime divided by ncalls |
| percall | cumtime的每次调用平均时间 | | cumtime | the total execution time of the function, including the execution time of other functions being called |
| filename:lineno(function) | 文件名, 行号,函数名 | | percall | cumtime divided by ncalls |
| filename:lineno(function) | where the function is defined |
### Identify Performance Bottlenecks
### 寻找性能瓶颈 Usually, `tottime` and the related `percall` time is what we want to
focus on. We can sort above profiling file by tottime:
通常`tottime`和`cumtime`是寻找瓶颈的关键指标。这两个指标代表了某一个函数真实的运行时间。
将性能分析结果按照tottime排序,效果如下:
```text ```text
4696 12.040 0.003 12.040 0.003 {built-in method run} 4696 12.040 0.003 12.040 0.003 {built-in method run}
...@@ -68,12 +78,15 @@ cprofilev -a 0.0.0.0 -p 3214 -f profile.out main.py ...@@ -68,12 +78,15 @@ cprofilev -a 0.0.0.0 -p 3214 -f profile.out main.py
107991 0.676 0.000 1.519 0.000 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/fluid/framework.py:219(__init__) 107991 0.676 0.000 1.519 0.000 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/fluid/framework.py:219(__init__)
4697 0.626 0.000 2.291 0.000 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/fluid/framework.py:428(sync_with_cpp) 4697 0.626 0.000 2.291 0.000 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/fluid/framework.py:428(sync_with_cpp)
1 0.618 0.618 0.618 0.618 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/fluid/__init__.py:1(<module>) 1 0.618 0.618 0.618 0.618 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/fluid/__init__.py:1(<module>)
``` ```
可以看到最耗时的函数是C++端的`run`函数。这需要联合我们第二节`Python`与`C++`混合代码的性能分析来进行调优。而`sync_with_cpp`函数的总共耗时很长,每次调用的耗时也很长。于是我们可以点击`sync_with_cpp`的详细信息,了解其调用关系。 We can see that the most time-consuming function is the `built-in
method run`, which is a C++ function in `libpaddle.so`. We will
explain how to profile C++ code in the next section. At the right
moment, let's look into the third function `sync_with_cpp`, which is a
Python function. We can click it to understand more about it:
```text ```
Called By: Called By:
Ordered by: internal time Ordered by: internal time
...@@ -92,72 +105,93 @@ Called: ...@@ -92,72 +105,93 @@ Called:
List reduced from 4497 to 2 due to restriction <'sync_with_cpp'> List reduced from 4497 to 2 due to restriction <'sync_with_cpp'>
``` ```
通常观察热点函数间的调用关系,和对应行的代码,就可以了解到问题代码在哪里。当我们做出性能修正后,再次进行性能分析(profiling)即可检查我们调优后的修正是否能够改善程序的性能。 The lists of the callers of `sync_with_cpp` might help us understand
how to improve the function definition.
## Profiling Python and C++ Code
### Generate the Profiling File
## Python与C++混合代码的性能分析 To profile a mixture of Python and C++ code, we can use a Python
package, `yep`, that can work with Google's `perftools`, which is a
commonly-used profiler for C/C++ code.
### 生成性能分析文件 In Ubuntu systems, we can install `yep` and `perftools` by running the
following commands:
C++的性能分析工具非常多。常见的包括`gprof`, `valgrind`, `google-perftools`。但是调试Python中使用的动态链接库与直接调试原始二进制相比增加了很多复杂度。幸而Python的一个第三方库`yep`提供了方便的和`google-perftools`交互的方法。于是这里使用`yep`进行Python与C++混合代码的性能分析
使用`yep`前需要安装`google-perftools`与`yep`包。ubuntu下安装命令为
```bash ```bash
apt update
apt install libgoogle-perftools-dev apt install libgoogle-perftools-dev
pip install yep pip install yep
``` ```
安装完毕后,我们可以通过 Then we can run the following command
```bash ```bash
python -m yep -v main.py python -m yep -v main.py
``` ```
生成性能分析文件。生成的性能分析文件为`main.py.prof`。 to generate the profiling file. The default filename is
`main.py.prof`.
Please be aware of the `-v` command line option, which prints the
analysis results after generating the profiling file. By taking a
glance at the print result, we'd know that if we stripped debug
information from `libpaddle.so` at build time. The following hints
help make sure that the analysis results are readable:
命令行中的`-v`指定在生成性能分析文件之后,在命令行显示分析结果。我们可以在命令行中简单的看一下生成效果。因为C++与Python不同,编译时可能会去掉调试信息,运行时也可能因为多线程产生混乱不可读的性能分析结果。为了生成更可读的性能分析结果,可以采取下面几点措施: 1. Use GCC command line option `-g` when building `libpaddle.so` so to
include the debug information. The standard building system of
PaddlePaddle is CMake, so you might want to set
`CMAKE_BUILD_TYPE=RelWithDebInfo`.
1. 编译时指定`-g`生成调试信息。使用cmake的话,可以将CMAKE_BUILD_TYPE指定为`RelWithDebInfo`。 1. Use GCC command line option `-O2` or `-O3` to generate optimized
2. 编译时一定要开启优化。单纯的`Debug`编译性能会和`-O2`或者`-O3`有非常大的差别。`Debug`模式下的性能测试是没有意义的。 binary code. It doesn't make sense to profile `libpaddle.so`
3. 运行性能分析的时候,先从单线程开始,再开启多线程,进而多机。毕竟单线程调试更容易。可以设置`OMP_NUM_THREADS=1`这个环境变量关闭openmp优化。 without optimization, because it would anyway run slowly.
### 查看性能分析文件 1. Profiling the single-threaded binary file before the
multi-threading version, because the latter often generates tangled
profiling analysis result. You might want to set environment
variable `OMP_NUM_THREADS=1` to prevents OpenMP from automatically
starting multiple threads.
在运行完性能分析后,会生成性能分析结果文件。我们可以使用[pprof](https://github.com/google/pprof)来显示性能分析结果。注意,这里使用了用`Go`语言重构后的`pprof`,因为这个工具具有web服务界面,且展示效果更好。 ### Look into the Profiling File
安装`pprof`的命令和一般的`Go`程序是一样的,其命令如下: The tool we used to look into the profiling file generated by
`perftools` is [`pprof`](https://github.com/google/pprof), which
provides a Web-based GUI like `cprofilev`.
We can rely on the standard Go toolchain to retrieve the source code
of `pprof` and build it:
```bash ```bash
go get github.com/google/pprof go get github.com/google/pprof
``` ```
进而我们可以使用如下命令开启一个HTTP服务: Then we can use it to profile `main.py.prof` generated in the previous
section:
```bash ```bash
pprof -http=0.0.0.0:3213 `which python` ./main.py.prof pprof -http=0.0.0.0:3213 `which python` ./main.py.prof
``` ```
这行命令中,`-http`指开启HTTP服务。`which python`会产生当前Python二进制的完整路径,进而指定了Python可执行文件的路径。`./main.py.prof`输入了性能分析结果。 Where `-http` specifies the IP and port of the HTTP service.
Directing our Web browser to the service, we would see something like
访问对应的网址,我们可以查看性能分析的结果。结果如下图所示: the following:
![result](./pprof_1.png) ![result](./pprof_1.png)
### Identifying the Performance Bottlenecks
### 寻找性能瓶颈 Similar to how we work with `cprofilev`, we'd focus on `tottime` and
`cumtime`.
与寻找Python代码的性能瓶颈类似,寻找Python与C++混合代码的性能瓶颈也是要看`tottime`和`cumtime`。而`pprof`展示的调用图也可以帮助我们发现性能中的问题。
例如下图中,
![kernel_perf](./pprof_2.png) ![kernel_perf](./pprof_2.png)
在一次训练中,乘法和乘法梯度的计算占用2%-4%左右的计算时间。而`MomentumOp`占用了17%左右的计算时间。显然,`MomentumOp`的性能有问题。 We can see that the execution time of multiplication and the computing
of the gradient of multiplication takes 2% to 4% of the total running
在`pprof`中,对于性能的关键路径都做出了红色标记。先检查关键路径的性能问题,再检查其他部分的性能问题,可以更有次序的完成性能的优化。 time, and `MomentumOp` takes about 17%. Obviously, we'd want to
optimize `MomentumOp`.
## 总结
至此,两种性能分析的方式都介绍完毕了。希望通过这两种性能分析的方式,Paddle的开发人员和使用人员可以有次序的,科学的发现和解决性能问题。 `pprof` would mark performance critical parts of the program in
red. It's a good idea to follow the hint.
此教程会介绍如何使用Python的cProfile包、Python库yep、Google perftools来进行性能分析 (profiling) 与调优(performance tuning)。
Profling 指发现性能瓶颈。系统中的瓶颈可能和程序员开发过程中想象的瓶颈相去甚远。Tuning 指消除瓶颈。性能优化的过程通常是不断重复地 profiling 和 tuning。
PaddlePaddle 用户一般通过调用 Python API 编写深度学习程序。大部分 Python API 调用用 C++ 写的 libpaddle.so。所以 PaddlePaddle 的性能分析与调优分为两个部分:
* Python 代码的性能分析
* Python 与 C++ 混合代码的性能分析
## Python代码的性能分析
### 生成性能分析文件
Python标准库中提供了性能分析的工具包,[cProfile](https://docs.python.org/2/library/profile.html)。生成Python性能分析的命令如下:
```bash
python -m cProfile -o profile.out main.py
```
其中 `main.py` 是我们要分析的程序,`-o`标识了一个输出的文件名,用来存储本次性能分析的结果。如果不指定这个文件,`cProfile`会打印到标准输出。
### 查看性能分析文件
`cProfile` 在main.py 运行完毕后输出`profile.out`。我们可以使用[`cprofilev`](https://github.com/ymichael/cprofilev)来查看性能分析结果。`cprofilev`是一个Python的第三方库。使用它会开启一个HTTP服务,将性能分析结果以网页的形式展示出来:
```bash
cprofilev -a 0.0.0.0 -p 3214 -f profile.out main.py
```
其中`-a`标识HTTP服务绑定的IP。使用`0.0.0.0`允许外网访问这个HTTP服务。`-p`标识HTTP服务的端口。`-f`标识性能分析的结果文件。`main.py`标识被性能分析的源文件。
用Web浏览器访问对应网址,即可显示性能分析的结果:
```
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.284 0.284 29.514 29.514 main.py:1(<module>)
4696 0.128 0.000 15.748 0.003 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/fluid/executor.py:20(run)
4696 12.040 0.003 12.040 0.003 {built-in method run}
1 0.144 0.144 6.534 6.534 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/__init__.py:14(<module>)
```
每一列的含义是:
| 列名 | 含义 |
| --- | --- |
| ncalls | 函数的调用次数 |
| tottime | 函数实际使用的总时间。该时间去除掉本函数调用其他函数的时间 |
| percall | tottime的每次调用平均时间 |
| cumtime | 函数总时间。包含这个函数调用其他函数的时间 |
| percall | cumtime的每次调用平均时间 |
| filename:lineno(function) | 文件名, 行号,函数名 |
### 寻找性能瓶颈
通常`tottime`和`cumtime`是寻找瓶颈的关键指标。这两个指标代表了某一个函数真实的运行时间。
将性能分析结果按照tottime排序,效果如下:
```text
4696 12.040 0.003 12.040 0.003 {built-in method run}
300005 0.874 0.000 1.681 0.000 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/dataset/mnist.py:38(reader)
107991 0.676 0.000 1.519 0.000 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/fluid/framework.py:219(__init__)
4697 0.626 0.000 2.291 0.000 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/fluid/framework.py:428(sync_with_cpp)
1 0.618 0.618 0.618 0.618 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/fluid/__init__.py:1(<module>)
```
可以看到最耗时的函数是C++端的`run`函数。这需要联合我们第二节`Python`与`C++`混合代码的性能分析来进行调优。而`sync_with_cpp`函数的总共耗时很长,每次调用的耗时也很长。于是我们可以点击`sync_with_cpp`的详细信息,了解其调用关系。
```text
Called By:
Ordered by: internal time
List reduced from 4497 to 2 due to restriction <'sync_with_cpp'>
Function was called by...
ncalls tottime cumtime
/home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/fluid/framework.py:428(sync_with_cpp) <- 4697 0.626 2.291 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/fluid/framework.py:562(sync_with_cpp)
/home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/fluid/framework.py:562(sync_with_cpp) <- 4696 0.019 2.316 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/fluid/framework.py:487(clone)
1 0.000 0.001 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/fluid/framework.py:534(append_backward)
Called:
Ordered by: internal time
List reduced from 4497 to 2 due to restriction <'sync_with_cpp'>
```
通常观察热点函数间的调用关系,和对应行的代码,就可以了解到问题代码在哪里。当我们做出性能修正后,再次进行性能分析(profiling)即可检查我们调优后的修正是否能够改善程序的性能。
## Python与C++混合代码的性能分析
### 生成性能分析文件
C++的性能分析工具非常多。常见的包括`gprof`, `valgrind`, `google-perftools`。但是调试Python中使用的动态链接库与直接调试原始二进制相比增加了很多复杂度。幸而Python的一个第三方库`yep`提供了方便的和`google-perftools`交互的方法。于是这里使用`yep`进行Python与C++混合代码的性能分析
使用`yep`前需要安装`google-perftools`与`yep`包。ubuntu下安装命令为
```bash
apt update
apt install libgoogle-perftools-dev
pip install yep
```
安装完毕后,我们可以通过
```bash
python -m yep -v main.py
```
生成性能分析文件。生成的性能分析文件为`main.py.prof`。
命令行中的`-v`指定在生成性能分析文件之后,在命令行显示分析结果。我们可以在命令行中简单的看一下生成效果。因为C++与Python不同,编译时可能会去掉调试信息,运行时也可能因为多线程产生混乱不可读的性能分析结果。为了生成更可读的性能分析结果,可以采取下面几点措施:
1. 编译时指定`-g`生成调试信息。使用cmake的话,可以将CMAKE_BUILD_TYPE指定为`RelWithDebInfo`。
2. 编译时一定要开启优化。单纯的`Debug`编译性能会和`-O2`或者`-O3`有非常大的差别。`Debug`模式下的性能测试是没有意义的。
3. 运行性能分析的时候,先从单线程开始,再开启多线程,进而多机。毕竟单线程调试更容易。可以设置`OMP_NUM_THREADS=1`这个环境变量关闭openmp优化。
### 查看性能分析文件
在运行完性能分析后,会生成性能分析结果文件。我们可以使用[`pprof`](https://github.com/google/pprof)来显示性能分析结果。注意,这里使用了用`Go`语言重构后的`pprof`,因为这个工具具有web服务界面,且展示效果更好。
安装`pprof`的命令和一般的`Go`程序是一样的,其命令如下:
```bash
go get github.com/google/pprof
```
进而我们可以使用如下命令开启一个HTTP服务:
```bash
pprof -http=0.0.0.0:3213 `which python` ./main.py.prof
```
这行命令中,`-http`指开启HTTP服务。`which python`会产生当前Python二进制的完整路径,进而指定了Python可执行文件的路径。`./main.py.prof`输入了性能分析结果。
访问对应的网址,我们可以查看性能分析的结果。结果如下图所示:
![result](./pprof_1.png)
### 寻找性能瓶颈
与寻找Python代码的性能瓶颈类似,寻找Python与C++混合代码的性能瓶颈也是要看`tottime`和`cumtime`。而`pprof`展示的调用图也可以帮助我们发现性能中的问题。
例如下图中,
![kernel_perf](./pprof_2.png)
在一次训练中,乘法和乘法梯度的计算占用2%-4%左右的计算时间。而`MomentumOp`占用了17%左右的计算时间。显然,`MomentumOp`的性能有问题。
在`pprof`中,对于性能的关键路径都做出了红色标记。先检查关键路径的性能问题,再检查其他部分的性能问题,可以更有次序的完成性能的优化。
...@@ -8,7 +8,7 @@ ...@@ -8,7 +8,7 @@
<meta name="viewport" content="width=device-width, initial-scale=1.0"> <meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Python代码的性能分析 &mdash; PaddlePaddle 文档</title> <title>Profiling the Python Code &mdash; PaddlePaddle 文档</title>
...@@ -193,7 +193,7 @@ ...@@ -193,7 +193,7 @@
<div role="navigation" aria-label="breadcrumbs navigation"> <div role="navigation" aria-label="breadcrumbs navigation">
<ul class="wy-breadcrumbs"> <ul class="wy-breadcrumbs">
<li>Python代码的性能分析</li> <li>Profiling the Python Code</li>
</ul> </ul>
</div> </div>
...@@ -202,54 +202,69 @@ ...@@ -202,54 +202,69 @@
<div role="main" class="document" itemscope="itemscope" itemtype="http://schema.org/Article"> <div role="main" class="document" itemscope="itemscope" itemtype="http://schema.org/Article">
<div itemprop="articleBody"> <div itemprop="articleBody">
<p>此教程会介绍如何使用Python的cProfile包,与Python库yep,google perftools来运行性能分析(Profiling)与调优。</p> <p>This tutorial introduces techniques we used to profile and tune the
<p>运行性能分析可以让开发人员科学的,有条不紊的对程序进行性能优化。性能分析是性能调优的基础。因为在程序实际运行中,真正的瓶颈可能和程序员开发过程中想象的瓶颈相去甚远。</p> CPU performance of PaddlePaddle. We will use Python packages
<p>性能优化的步骤,通常是循环重复若干次『性能分析 &#8211;&gt; 寻找瓶颈 &#8212;&gt; 调优瓶颈 &#8211;&gt; 性能分析确认调优效果』。其中性能分析是性能调优的至关重要的量化指标。</p> <code class="docutils literal"><span class="pre">cProfile</span></code> and <code class="docutils literal"><span class="pre">yep</span></code>, and Google <code class="docutils literal"><span class="pre">perftools</span></code>.</p>
<p>Paddle提供了Python语言绑定。用户使用Python进行神经网络编程,训练,测试。Python解释器通过<code class="docutils literal"><span class="pre">pybind</span></code><code class="docutils literal"><span class="pre">swig</span></code>调用Paddle的动态链接库,进而调用Paddle C++部分的代码。所以Paddle的性能分析与调优分为两个部分:</p> <p>Profiling is the process that reveals the performance bottlenecks,
<ul class="simple"> which could be very different from what&#8217;s in the developers&#8217; mind.
<li>Python代码的性能分析</li> Performance tuning is to fix the bottlenecks. Performance optimization
<li>Python与C++混合代码的性能分析</li> repeats the steps of profiling and tuning alternatively.</p>
</ul> <p>PaddlePaddle users program AI by calling the Python API, which calls
<div class="section" id="python"> into <code class="docutils literal"><span class="pre">libpaddle.so.</span></code> written in C++. In this tutorial, we focus on
<span id="python"></span><h1>Python代码的性能分析<a class="headerlink" href="#python" title="永久链接至标题"></a></h1> the profiling and tuning of</p>
<div class="section" id=""> <ol class="simple">
<span id="id1"></span><h2>生成性能分析文件<a class="headerlink" href="#" title="永久链接至标题"></a></h2> <li>the Python code and</li>
<p>Python标准库中提供了性能分析的工具包,<a class="reference external" href="https://docs.python.org/2/library/profile.html">cProfile</a>。生成Python性能分析的命令如下:</p> <li>the mixture of Python and C++ code.</li>
</ol>
<div class="section" id="profiling-the-python-code">
<span id="profiling-the-python-code"></span><h1>Profiling the Python Code<a class="headerlink" href="#profiling-the-python-code" title="永久链接至标题"></a></h1>
<div class="section" id="generate-the-performance-profiling-file">
<span id="generate-the-performance-profiling-file"></span><h2>Generate the Performance Profiling File<a class="headerlink" href="#generate-the-performance-profiling-file" title="永久链接至标题"></a></h2>
<p>We can use Python standard
package, <a class="reference external" href="https://docs.python.org/2/library/profile.html"><code class="docutils literal"><span class="pre">cProfile</span></code></a>,
to generate Python profiling file. For example:</p>
<div class="highlight-bash"><div class="highlight"><pre><span></span>python -m cProfile -o profile.out main.py <div class="highlight-bash"><div class="highlight"><pre><span></span>python -m cProfile -o profile.out main.py
</pre></div> </pre></div>
</div> </div>
<p>其中<code class="docutils literal"><span class="pre">-o</span></code>标识了一个输出的文件名,用来存储本次性能分析的结果。如果不指定这个文件,<code class="docutils literal"><span class="pre">cProfile</span></code>会打印一些统计信息到<code class="docutils literal"><span class="pre">stdout</span></code>。这不方便我们进行后期处理(进行<code class="docutils literal"><span class="pre">sort</span></code>, <code class="docutils literal"><span class="pre">split</span></code>, <code class="docutils literal"><span class="pre">cut</span></code>等等)。</p> <p>where <code class="docutils literal"><span class="pre">main.py</span></code> is the program we are going to profile, <code class="docutils literal"><span class="pre">-o</span></code> specifies
the output file. Without <code class="docutils literal"><span class="pre">-o</span></code>, <code class="docutils literal"><span class="pre">cProfile</span></code> would outputs to standard
output.</p>
</div> </div>
<div class="section" id=""> <div class="section" id="look-into-the-profiling-file">
<span id="id2"></span><h2>查看性能分析文件<a class="headerlink" href="#" title="永久链接至标题"></a></h2> <span id="look-into-the-profiling-file"></span><h2>Look into the Profiling File<a class="headerlink" href="#look-into-the-profiling-file" title="永久链接至标题"></a></h2>
<p>当main.py运行完毕后,性能分析结果文件<code class="docutils literal"><span class="pre">profile.out</span></code>就生成出来了。我们可以使用<a class="reference external" href="https://github.com/ymichael/cprofilev">cprofilev</a>来查看性能分析结果。<code class="docutils literal"><span class="pre">cprofilev</span></code>是一个Python的第三方库。使用它会开启一个HTTP服务,将性能分析结果以网页的形式展示出来。</p> <p><code class="docutils literal"><span class="pre">cProfile</span></code> generates <code class="docutils literal"><span class="pre">profile.out</span></code> after <code class="docutils literal"><span class="pre">main.py</span></code> completes. We can
<p>使用<code class="docutils literal"><span class="pre">pip</span> <span class="pre">install</span> <span class="pre">cprofilev</span></code>安装<code class="docutils literal"><span class="pre">cprofilev</span></code>工具。安装完成后,使用如下命令开启HTTP服务</p> use <a class="reference external" href="https://github.com/ymichael/cprofilev"><code class="docutils literal"><span class="pre">cprofilev</span></code></a> to look into
the details:</p>
<div class="highlight-bash"><div class="highlight"><pre><span></span>cprofilev -a <span class="m">0</span>.0.0.0 -p <span class="m">3214</span> -f profile.out main.py <div class="highlight-bash"><div class="highlight"><pre><span></span>cprofilev -a <span class="m">0</span>.0.0.0 -p <span class="m">3214</span> -f profile.out main.py
</pre></div> </pre></div>
</div> </div>
<p>其中<code class="docutils literal"><span class="pre">-a</span></code>标识HTTP服务绑定的IP。使用<code class="docutils literal"><span class="pre">0.0.0.0</span></code>允许外网访问这个HTTP服务。<code class="docutils literal"><span class="pre">-p</span></code>标识HTTP服务的端口。<code class="docutils literal"><span class="pre">-f</span></code>标识性能分析的结果文件。<code class="docutils literal"><span class="pre">main.py</span></code>标识被性能分析的源文件。</p> <p>where <code class="docutils literal"><span class="pre">-a</span></code> specifies the HTTP IP, <code class="docutils literal"><span class="pre">-p</span></code> specifies the port, <code class="docutils literal"><span class="pre">-f</span></code>
<p>访问对应网址,即可显示性能分析的结果。性能分析结果格式如下:</p> specifies the profiling file, and <code class="docutils literal"><span class="pre">main.py</span></code> is the source file.</p>
<div class="highlight-text"><div class="highlight"><pre><span></span> ncalls tottime percall cumtime percall filename:lineno(function) <p>Open the Web browser and points to the local IP and the specifies
1 0.284 0.284 29.514 29.514 main.py:1(&lt;module&gt;) port, we will see the output like the following:</p>
4696 0.128 0.000 15.748 0.003 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/fluid/executor.py:20(run) <div class="highlight-default"><div class="highlight"><pre><span></span> <span class="n">ncalls</span> <span class="n">tottime</span> <span class="n">percall</span> <span class="n">cumtime</span> <span class="n">percall</span> <span class="n">filename</span><span class="p">:</span><span class="n">lineno</span><span class="p">(</span><span class="n">function</span><span class="p">)</span>
4696 12.040 0.003 12.040 0.003 {built-in method run} <span class="mi">1</span> <span class="mf">0.284</span> <span class="mf">0.284</span> <span class="mf">29.514</span> <span class="mf">29.514</span> <span class="n">main</span><span class="o">.</span><span class="n">py</span><span class="p">:</span><span class="mi">1</span><span class="p">(</span><span class="o">&lt;</span><span class="n">module</span><span class="o">&gt;</span><span class="p">)</span>
1 0.144 0.144 6.534 6.534 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/__init__.py:14(&lt;module&gt;) <span class="mi">4696</span> <span class="mf">0.128</span> <span class="mf">0.000</span> <span class="mf">15.748</span> <span class="mf">0.003</span> <span class="o">/</span><span class="n">home</span><span class="o">/</span><span class="n">yuyang</span><span class="o">/</span><span class="n">perf_test</span><span class="o">/.</span><span class="n">env</span><span class="o">/</span><span class="n">lib</span><span class="o">/</span><span class="n">python2</span><span class="o">.</span><span class="mi">7</span><span class="o">/</span><span class="n">site</span><span class="o">-</span><span class="n">packages</span><span class="o">/</span><span class="n">paddle</span><span class="o">/</span><span class="n">v2</span><span class="o">/</span><span class="n">fluid</span><span class="o">/</span><span class="n">executor</span><span class="o">.</span><span class="n">py</span><span class="p">:</span><span class="mi">20</span><span class="p">(</span><span class="n">run</span><span class="p">)</span>
<span class="mi">4696</span> <span class="mf">12.040</span> <span class="mf">0.003</span> <span class="mf">12.040</span> <span class="mf">0.003</span> <span class="p">{</span><span class="n">built</span><span class="o">-</span><span class="ow">in</span> <span class="n">method</span> <span class="n">run</span><span class="p">}</span>
<span class="mi">1</span> <span class="mf">0.144</span> <span class="mf">0.144</span> <span class="mf">6.534</span> <span class="mf">6.534</span> <span class="o">/</span><span class="n">home</span><span class="o">/</span><span class="n">yuyang</span><span class="o">/</span><span class="n">perf_test</span><span class="o">/.</span><span class="n">env</span><span class="o">/</span><span class="n">lib</span><span class="o">/</span><span class="n">python2</span><span class="o">.</span><span class="mi">7</span><span class="o">/</span><span class="n">site</span><span class="o">-</span><span class="n">packages</span><span class="o">/</span><span class="n">paddle</span><span class="o">/</span><span class="n">v2</span><span class="o">/</span><span class="fm">__init__</span><span class="o">.</span><span class="n">py</span><span class="p">:</span><span class="mi">14</span><span class="p">(</span><span class="o">&lt;</span><span class="n">module</span><span class="o">&gt;</span><span class="p">)</span>
</pre></div> </pre></div>
</div> </div>
<p>每一列的含义是:</p> <p>where each line corresponds to Python function, and the meaning of
<p>| 列名 | 含义 | each column is as follows:</p>
<p>| column | meaning |
| &#8212; | &#8212; | | &#8212; | &#8212; |
| ncalls | 函数的调用次数 | | ncalls | the number of calls into a function |
| tottime | 函数实际使用的总时间。该时间去除掉本函数调用其他函数的时间 | | tottime | the total execution time of the function, not including the
| percall | tottime的每次调用平均时间 | execution time of other functions called by the function |
| cumtime | 函数总时间。包含这个函数调用其他函数的时间 | | percall | tottime divided by ncalls |
| percall | cumtime的每次调用平均时间 | | cumtime | the total execution time of the function, including the execution time of other functions being called |
| filename:lineno(function) | 文件名, 行号,函数名 |</p> | percall | cumtime divided by ncalls |
| filename:lineno(function) | where the function is defined |</p>
</div> </div>
<div class="section" id=""> <div class="section" id="identify-performance-bottlenecks">
<span id="id3"></span><h2>寻找性能瓶颈<a class="headerlink" href="#" title="永久链接至标题"></a></h2> <span id="identify-performance-bottlenecks"></span><h2>Identify Performance Bottlenecks<a class="headerlink" href="#identify-performance-bottlenecks" title="永久链接至标题"></a></h2>
<p>通常<code class="docutils literal"><span class="pre">tottime</span></code><code class="docutils literal"><span class="pre">cumtime</span></code>是寻找瓶颈的关键指标。这两个指标代表了某一个函数真实的运行时间。</p> <p>Usually, <code class="docutils literal"><span class="pre">tottime</span></code> and the related <code class="docutils literal"><span class="pre">percall</span></code> time is what we want to
<p>将性能分析结果按照tottime排序,效果如下:</p> focus on. We can sort above profiling file by tottime:</p>
<div class="highlight-text"><div class="highlight"><pre><span></span> 4696 12.040 0.003 12.040 0.003 {built-in method run} <div class="highlight-text"><div class="highlight"><pre><span></span> 4696 12.040 0.003 12.040 0.003 {built-in method run}
300005 0.874 0.000 1.681 0.000 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/dataset/mnist.py:38(reader) 300005 0.874 0.000 1.681 0.000 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/dataset/mnist.py:38(reader)
107991 0.676 0.000 1.519 0.000 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/fluid/framework.py:219(__init__) 107991 0.676 0.000 1.519 0.000 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/fluid/framework.py:219(__init__)
...@@ -257,77 +272,104 @@ ...@@ -257,77 +272,104 @@
1 0.618 0.618 0.618 0.618 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/fluid/__init__.py:1(&lt;module&gt;) 1 0.618 0.618 0.618 0.618 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/fluid/__init__.py:1(&lt;module&gt;)
</pre></div> </pre></div>
</div> </div>
<p>可以看到最耗时的函数是C++端的<code class="docutils literal"><span class="pre">run</span></code>函数。这需要联合我们第二节<code class="docutils literal"><span class="pre">Python</span></code><code class="docutils literal"><span class="pre">C++</span></code>混合代码的性能分析来进行调优。而<code class="docutils literal"><span class="pre">sync_with_cpp</span></code>函数的总共耗时很长,每次调用的耗时也很长。于是我们可以点击<code class="docutils literal"><span class="pre">sync_with_cpp</span></code>的详细信息,了解其调用关系。</p> <p>We can see that the most time-consuming function is the <code class="docutils literal"><span class="pre">built-in</span> <span class="pre">method</span> <span class="pre">run</span></code>, which is a C++ function in <code class="docutils literal"><span class="pre">libpaddle.so</span></code>. We will
<div class="highlight-text"><div class="highlight"><pre><span></span>Called By: explain how to profile C++ code in the next section. At the right
moment, let&#8217;s look into the third function <code class="docutils literal"><span class="pre">sync_with_cpp</span></code>, which is a
Python function. We can click it to understand more about it:</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="n">Called</span> <span class="n">By</span><span class="p">:</span>
Ordered by: internal time <span class="n">Ordered</span> <span class="n">by</span><span class="p">:</span> <span class="n">internal</span> <span class="n">time</span>
List reduced from 4497 to 2 due to restriction &lt;&#39;sync_with_cpp&#39;&gt; <span class="n">List</span> <span class="n">reduced</span> <span class="kn">from</span> <span class="mi">4497</span> <span class="n">to</span> <span class="mi">2</span> <span class="n">due</span> <span class="n">to</span> <span class="n">restriction</span> <span class="o">&lt;</span><span class="s1">&#39;sync_with_cpp&#39;</span><span class="o">&gt;</span>
Function was called by... <span class="n">Function</span> <span class="n">was</span> <span class="n">called</span> <span class="n">by</span><span class="o">...</span>
ncalls tottime cumtime <span class="n">ncalls</span> <span class="n">tottime</span> <span class="n">cumtime</span>
/home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/fluid/framework.py:428(sync_with_cpp) &lt;- 4697 0.626 2.291 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/fluid/framework.py:562(sync_with_cpp) <span class="o">/</span><span class="n">home</span><span class="o">/</span><span class="n">yuyang</span><span class="o">/</span><span class="n">perf_test</span><span class="o">/.</span><span class="n">env</span><span class="o">/</span><span class="n">lib</span><span class="o">/</span><span class="n">python2</span><span class="o">.</span><span class="mi">7</span><span class="o">/</span><span class="n">site</span><span class="o">-</span><span class="n">packages</span><span class="o">/</span><span class="n">paddle</span><span class="o">/</span><span class="n">v2</span><span class="o">/</span><span class="n">fluid</span><span class="o">/</span><span class="n">framework</span><span class="o">.</span><span class="n">py</span><span class="p">:</span><span class="mi">428</span><span class="p">(</span><span class="n">sync_with_cpp</span><span class="p">)</span> <span class="o">&lt;-</span> <span class="mi">4697</span> <span class="mf">0.626</span> <span class="mf">2.291</span> <span class="o">/</span><span class="n">home</span><span class="o">/</span><span class="n">yuyang</span><span class="o">/</span><span class="n">perf_test</span><span class="o">/.</span><span class="n">env</span><span class="o">/</span><span class="n">lib</span><span class="o">/</span><span class="n">python2</span><span class="o">.</span><span class="mi">7</span><span class="o">/</span><span class="n">site</span><span class="o">-</span><span class="n">packages</span><span class="o">/</span><span class="n">paddle</span><span class="o">/</span><span class="n">v2</span><span class="o">/</span><span class="n">fluid</span><span class="o">/</span><span class="n">framework</span><span class="o">.</span><span class="n">py</span><span class="p">:</span><span class="mi">562</span><span class="p">(</span><span class="n">sync_with_cpp</span><span class="p">)</span>
/home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/fluid/framework.py:562(sync_with_cpp) &lt;- 4696 0.019 2.316 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/fluid/framework.py:487(clone) <span class="o">/</span><span class="n">home</span><span class="o">/</span><span class="n">yuyang</span><span class="o">/</span><span class="n">perf_test</span><span class="o">/.</span><span class="n">env</span><span class="o">/</span><span class="n">lib</span><span class="o">/</span><span class="n">python2</span><span class="o">.</span><span class="mi">7</span><span class="o">/</span><span class="n">site</span><span class="o">-</span><span class="n">packages</span><span class="o">/</span><span class="n">paddle</span><span class="o">/</span><span class="n">v2</span><span class="o">/</span><span class="n">fluid</span><span class="o">/</span><span class="n">framework</span><span class="o">.</span><span class="n">py</span><span class="p">:</span><span class="mi">562</span><span class="p">(</span><span class="n">sync_with_cpp</span><span class="p">)</span> <span class="o">&lt;-</span> <span class="mi">4696</span> <span class="mf">0.019</span> <span class="mf">2.316</span> <span class="o">/</span><span class="n">home</span><span class="o">/</span><span class="n">yuyang</span><span class="o">/</span><span class="n">perf_test</span><span class="o">/.</span><span class="n">env</span><span class="o">/</span><span class="n">lib</span><span class="o">/</span><span class="n">python2</span><span class="o">.</span><span class="mi">7</span><span class="o">/</span><span class="n">site</span><span class="o">-</span><span class="n">packages</span><span class="o">/</span><span class="n">paddle</span><span class="o">/</span><span class="n">v2</span><span class="o">/</span><span class="n">fluid</span><span class="o">/</span><span class="n">framework</span><span class="o">.</span><span class="n">py</span><span class="p">:</span><span class="mi">487</span><span class="p">(</span><span class="n">clone</span><span class="p">)</span>
1 0.000 0.001 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/fluid/framework.py:534(append_backward) <span class="mi">1</span> <span class="mf">0.000</span> <span class="mf">0.001</span> <span class="o">/</span><span class="n">home</span><span class="o">/</span><span class="n">yuyang</span><span class="o">/</span><span class="n">perf_test</span><span class="o">/.</span><span class="n">env</span><span class="o">/</span><span class="n">lib</span><span class="o">/</span><span class="n">python2</span><span class="o">.</span><span class="mi">7</span><span class="o">/</span><span class="n">site</span><span class="o">-</span><span class="n">packages</span><span class="o">/</span><span class="n">paddle</span><span class="o">/</span><span class="n">v2</span><span class="o">/</span><span class="n">fluid</span><span class="o">/</span><span class="n">framework</span><span class="o">.</span><span class="n">py</span><span class="p">:</span><span class="mi">534</span><span class="p">(</span><span class="n">append_backward</span><span class="p">)</span>
Called: <span class="n">Called</span><span class="p">:</span>
Ordered by: internal time <span class="n">Ordered</span> <span class="n">by</span><span class="p">:</span> <span class="n">internal</span> <span class="n">time</span>
List reduced from 4497 to 2 due to restriction &lt;&#39;sync_with_cpp&#39;&gt; <span class="n">List</span> <span class="n">reduced</span> <span class="kn">from</span> <span class="mi">4497</span> <span class="n">to</span> <span class="mi">2</span> <span class="n">due</span> <span class="n">to</span> <span class="n">restriction</span> <span class="o">&lt;</span><span class="s1">&#39;sync_with_cpp&#39;</span><span class="o">&gt;</span>
</pre></div> </pre></div>
</div> </div>
<p>通常观察热点函数间的调用关系,和对应行的代码,就可以了解到问题代码在哪里。当我们做出性能修正后,再次进行性能分析(profiling)即可检查我们调优后的修正是否能够改善程序的性能。</p> <p>The lists of the callers of <code class="docutils literal"><span class="pre">sync_with_cpp</span></code> might help us understand
how to improve the function definition.</p>
</div> </div>
</div> </div>
<div class="section" id="pythonc"> <div class="section" id="profiling-python-and-c-code">
<span id="pythonc"></span><h1>Python与C++混合代码的性能分析<a class="headerlink" href="#pythonc" title="永久链接至标题"></a></h1> <span id="profiling-python-and-c-code"></span><h1>Profiling Python and C++ Code<a class="headerlink" href="#profiling-python-and-c-code" title="永久链接至标题"></a></h1>
<div class="section" id=""> <div class="section" id="generate-the-profiling-file">
<span id="id4"></span><h2>生成性能分析文件<a class="headerlink" href="#" title="永久链接至标题"></a></h2> <span id="generate-the-profiling-file"></span><h2>Generate the Profiling File<a class="headerlink" href="#generate-the-profiling-file" title="永久链接至标题"></a></h2>
<p>C++的性能分析工具非常多。常见的包括<code class="docutils literal"><span class="pre">gprof</span></code>, <code class="docutils literal"><span class="pre">valgrind</span></code>, <code class="docutils literal"><span class="pre">google-perftools</span></code>。但是调试Python中使用的动态链接库与直接调试原始二进制相比增加了很多复杂度。幸而Python的一个第三方库<code class="docutils literal"><span class="pre">yep</span></code>提供了方便的和<code class="docutils literal"><span class="pre">google-perftools</span></code>交互的方法。于是这里使用<code class="docutils literal"><span class="pre">yep</span></code>进行Python与C++混合代码的性能分析</p> <p>To profile a mixture of Python and C++ code, we can use a Python
<p>使用<code class="docutils literal"><span class="pre">yep</span></code>前需要安装<code class="docutils literal"><span class="pre">google-perftools</span></code><code class="docutils literal"><span class="pre">yep</span></code>包。ubuntu下安装命令为</p> package, <code class="docutils literal"><span class="pre">yep</span></code>, that can work with Google&#8217;s <code class="docutils literal"><span class="pre">perftools</span></code>, which is a
<div class="highlight-bash"><div class="highlight"><pre><span></span>apt install libgoogle-perftools-dev commonly-used profiler for C/C++ code.</p>
<p>In Ubuntu systems, we can install <code class="docutils literal"><span class="pre">yep</span></code> and <code class="docutils literal"><span class="pre">perftools</span></code> by running the
following commands:</p>
<div class="highlight-bash"><div class="highlight"><pre><span></span>apt update
apt install libgoogle-perftools-dev
pip install yep pip install yep
</pre></div> </pre></div>
</div> </div>
<p>安装完毕后,我们可以通过</p> <p>Then we can run the following command</p>
<div class="highlight-bash"><div class="highlight"><pre><span></span>python -m yep -v main.py <div class="highlight-bash"><div class="highlight"><pre><span></span>python -m yep -v main.py
</pre></div> </pre></div>
</div> </div>
<p>生成性能分析文件。生成的性能分析文件为<code class="docutils literal"><span class="pre">main.py.prof</span></code></p> <p>to generate the profiling file. The default filename is
<p>命令行中的<code class="docutils literal"><span class="pre">-v</span></code>指定在生成性能分析文件之后,在命令行显示分析结果。我们可以在命令行中简单的看一下生成效果。因为C++与Python不同,编译时可能会去掉调试信息,运行时也可能因为多线程产生混乱不可读的性能分析结果。为了生成更可读的性能分析结果,可以采取下面几点措施:</p> <code class="docutils literal"><span class="pre">main.py.prof</span></code>.</p>
<p>Please be aware of the <code class="docutils literal"><span class="pre">-v</span></code> command line option, which prints the
analysis results after generating the profiling file. By taking a
glance at the print result, we&#8217;d know that if we stripped debug
information from <code class="docutils literal"><span class="pre">libpaddle.so</span></code> at build time. The following hints
help make sure that the analysis results are readable:</p>
<ol class="simple"> <ol class="simple">
<li>编译时指定<code class="docutils literal"><span class="pre">-g</span></code>生成调试信息。使用cmake的话,可以将CMAKE_BUILD_TYPE指定为<code class="docutils literal"><span class="pre">RelWithDebInfo</span></code></li> <li>Use GCC command line option <code class="docutils literal"><span class="pre">-g</span></code> when building <code class="docutils literal"><span class="pre">libpaddle.so</span></code> so to
<li>编译时一定要开启优化。单纯的<code class="docutils literal"><span class="pre">Debug</span></code>编译性能会和<code class="docutils literal"><span class="pre">-O2</span></code>或者<code class="docutils literal"><span class="pre">-O3</span></code>有非常大的差别。<code class="docutils literal"><span class="pre">Debug</span></code>模式下的性能测试是没有意义的。</li> include the debug information. The standard building system of
<li>运行性能分析的时候,先从单线程开始,再开启多线程,进而多机。毕竟单线程调试更容易。可以设置<code class="docutils literal"><span class="pre">OMP_NUM_THREADS=1</span></code>这个环境变量关闭openmp优化。</li> PaddlePaddle is CMake, so you might want to set
<code class="docutils literal"><span class="pre">CMAKE_BUILD_TYPE=RelWithDebInfo</span></code>.</li>
<li>Use GCC command line option <code class="docutils literal"><span class="pre">-O2</span></code> or <code class="docutils literal"><span class="pre">-O3</span></code> to generate optimized
binary code. It doesn&#8217;t make sense to profile <code class="docutils literal"><span class="pre">libpaddle.so</span></code>
without optimization, because it would anyway run slowly.</li>
<li>Profiling the single-threaded binary file before the
multi-threading version, because the latter often generates tangled
profiling analysis result. You might want to set environment
variable <code class="docutils literal"><span class="pre">OMP_NUM_THREADS=1</span></code> to prevents OpenMP from automatically
starting multiple threads.</li>
</ol> </ol>
</div> </div>
<div class="section" id=""> <div class="section" id="look-into-the-profiling-file">
<span id="id5"></span><h2>查看性能分析文件<a class="headerlink" href="#" title="永久链接至标题"></a></h2> <span id="id1"></span><h2>Look into the Profiling File<a class="headerlink" href="#look-into-the-profiling-file" title="永久链接至标题"></a></h2>
<p>在运行完性能分析后,会生成性能分析结果文件。我们可以使用<a class="reference external" href="https://github.com/google/pprof">pprof</a>来显示性能分析结果。注意,这里使用了用<code class="docutils literal"><span class="pre">Go</span></code>语言重构后的<code class="docutils literal"><span class="pre">pprof</span></code>,因为这个工具具有web服务界面,且展示效果更好。</p> <p>The tool we used to look into the profiling file generated by
<p>安装<code class="docutils literal"><span class="pre">pprof</span></code>的命令和一般的<code class="docutils literal"><span class="pre">Go</span></code>程序是一样的,其命令如下:</p> <code class="docutils literal"><span class="pre">perftools</span></code> is <a class="reference external" href="https://github.com/google/pprof"><code class="docutils literal"><span class="pre">pprof</span></code></a>, which
provides a Web-based GUI like <code class="docutils literal"><span class="pre">cprofilev</span></code>.</p>
<p>We can rely on the standard Go toolchain to retrieve the source code
of <code class="docutils literal"><span class="pre">pprof</span></code> and build it:</p>
<div class="highlight-bash"><div class="highlight"><pre><span></span>go get github.com/google/pprof <div class="highlight-bash"><div class="highlight"><pre><span></span>go get github.com/google/pprof
</pre></div> </pre></div>
</div> </div>
<p>进而我们可以使用如下命令开启一个HTTP服务:</p> <p>Then we can use it to profile <code class="docutils literal"><span class="pre">main.py.prof</span></code> generated in the previous
section:</p>
<div class="highlight-bash"><div class="highlight"><pre><span></span>pprof -http<span class="o">=</span><span class="m">0</span>.0.0.0:3213 <span class="sb">`</span>which python<span class="sb">`</span> ./main.py.prof <div class="highlight-bash"><div class="highlight"><pre><span></span>pprof -http<span class="o">=</span><span class="m">0</span>.0.0.0:3213 <span class="sb">`</span>which python<span class="sb">`</span> ./main.py.prof
</pre></div> </pre></div>
</div> </div>
<p>这行命令中,<code class="docutils literal"><span class="pre">-http</span></code>指开启HTTP服务。<code class="docutils literal"><span class="pre">which</span> <span class="pre">python</span></code>会产生当前Python二进制的完整路径,进而指定了Python可执行文件的路径。<code class="docutils literal"><span class="pre">./main.py.prof</span></code>输入了性能分析结果。</p> <p>Where <code class="docutils literal"><span class="pre">-http</span></code> specifies the IP and port of the HTTP service.
<p>访问对应的网址,我们可以查看性能分析的结果。结果如下图所示:</p> Directing our Web browser to the service, we would see something like
the following:</p>
<p><img alt="result" src="../../_images/pprof_1.png" /></p> <p><img alt="result" src="../../_images/pprof_1.png" /></p>
</div> </div>
<div class="section" id=""> <div class="section" id="identifying-the-performance-bottlenecks">
<span id="id6"></span><h2>寻找性能瓶颈<a class="headerlink" href="#" title="永久链接至标题"></a></h2> <span id="identifying-the-performance-bottlenecks"></span><h2>Identifying the Performance Bottlenecks<a class="headerlink" href="#identifying-the-performance-bottlenecks" title="永久链接至标题"></a></h2>
<p>与寻找Python代码的性能瓶颈类似,寻找Python与C++混合代码的性能瓶颈也是要看<code class="docutils literal"><span class="pre">tottime</span></code><code class="docutils literal"><span class="pre">cumtime</span></code>。而<code class="docutils literal"><span class="pre">pprof</span></code>展示的调用图也可以帮助我们发现性能中的问题。</p> <p>Similar to how we work with <code class="docutils literal"><span class="pre">cprofilev</span></code>, we&#8217;d focus on <code class="docutils literal"><span class="pre">tottime</span></code> and
<p>例如下图中,</p> <code class="docutils literal"><span class="pre">cumtime</span></code>.</p>
<p><img alt="kernel_perf" src="../../_images/pprof_2.png" /></p> <p><img alt="kernel_perf" src="../../_images/pprof_2.png" /></p>
<p>在一次训练中,乘法和乘法梯度的计算占用2%-4%左右的计算时间。而<code class="docutils literal"><span class="pre">MomentumOp</span></code>占用了17%左右的计算时间。显然,<code class="docutils literal"><span class="pre">MomentumOp</span></code>的性能有问题。</p> <p>We can see that the execution time of multiplication and the computing
<p><code class="docutils literal"><span class="pre">pprof</span></code>中,对于性能的关键路径都做出了红色标记。先检查关键路径的性能问题,再检查其他部分的性能问题,可以更有次序的完成性能的优化。</p> of the gradient of multiplication takes 2% to 4% of the total running
</div> time, and <code class="docutils literal"><span class="pre">MomentumOp</span></code> takes about 17%. Obviously, we&#8217;d want to
optimize <code class="docutils literal"><span class="pre">MomentumOp</span></code>.</p>
<p><code class="docutils literal"><span class="pre">pprof</span></code> would mark performance critical parts of the program in
red. It&#8217;s a good idea to follow the hint.</p>
</div> </div>
<div class="section" id="">
<span id="id7"></span><h1>总结<a class="headerlink" href="#" title="永久链接至标题"></a></h1>
<p>至此,两种性能分析的方式都介绍完毕了。希望通过这两种性能分析的方式,Paddle的开发人员和使用人员可以有次序的,科学的发现和解决性能问题。</p>
</div> </div>
......
<!DOCTYPE html>
<!--[if IE 8]><html class="no-js lt-ie9" lang="en" > <![endif]-->
<!--[if gt IE 8]><!--> <html class="no-js" lang="en" > <!--<![endif]-->
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Python代码的性能分析 &mdash; PaddlePaddle 文档</title>
<link rel="stylesheet" href="../../_static/css/theme.css" type="text/css" />
<link rel="index" title="索引"
href="../../genindex.html"/>
<link rel="search" title="搜索" href="../../search.html"/>
<link rel="top" title="PaddlePaddle 文档" href="../../index.html"/>
<link rel="stylesheet" href="https://cdn.jsdelivr.net/perfect-scrollbar/0.6.14/css/perfect-scrollbar.min.css" type="text/css" />
<link rel="stylesheet" href="../../_static/css/override.css" type="text/css" />
<script>
var _hmt = _hmt || [];
(function() {
var hm = document.createElement("script");
hm.src = "//hm.baidu.com/hm.js?b9a314ab40d04d805655aab1deee08ba";
var s = document.getElementsByTagName("script")[0];
s.parentNode.insertBefore(hm, s);
})();
</script>
<script src="../../_static/js/modernizr.min.js"></script>
</head>
<body class="wy-body-for-nav" role="document">
<header class="site-header">
<div class="site-logo">
<a href="/"><img src="../../_static/images/PP_w.png"></a>
</div>
<div class="site-nav-links">
<div class="site-menu">
<a class="fork-on-github" href="https://github.com/PaddlePaddle/Paddle" target="_blank"><i class="fa fa-github"></i>Fork me on Github</a>
<div class="language-switcher dropdown">
<a type="button" data-toggle="dropdown">
<span>English</span>
<i class="fa fa-angle-up"></i>
<i class="fa fa-angle-down"></i>
</a>
<ul class="dropdown-menu">
<li><a href="/doc_cn">中文</a></li>
<li><a href="/doc">English</a></li>
</ul>
</div>
<ul class="site-page-links">
<li><a href="/">Home</a></li>
</ul>
</div>
<div class="doc-module">
<ul>
<li class="toctree-l1"><a class="reference internal" href="../../getstarted/index_cn.html">新手入门</a></li>
<li class="toctree-l1"><a class="reference internal" href="../index_cn.html">进阶指南</a></li>
<li class="toctree-l1"><a class="reference internal" href="../../api/index_cn.html">API</a></li>
<li class="toctree-l1"><a class="reference internal" href="../../faq/index_cn.html">FAQ</a></li>
<li class="toctree-l1"><a class="reference internal" href="../../mobile/index_cn.html">MOBILE</a></li>
</ul>
<div role="search">
<form id="rtd-search-form" class="wy-form" action="../../search.html" method="get">
<input type="text" name="q" placeholder="Search docs" />
<input type="hidden" name="check_keywords" value="yes" />
<input type="hidden" name="area" value="default" />
</form>
</div>
</div>
</div>
</header>
<div class="main-content-wrap">
<nav class="doc-menu-vertical" role="navigation">
<ul>
<li class="toctree-l1"><a class="reference internal" href="../../getstarted/index_cn.html">新手入门</a><ul>
<li class="toctree-l2"><a class="reference internal" href="../../getstarted/build_and_install/index_cn.html">安装与编译</a><ul>
<li class="toctree-l3"><a class="reference internal" href="../../getstarted/build_and_install/pip_install_cn.html">使用pip安装</a></li>
<li class="toctree-l3"><a class="reference internal" href="../../getstarted/build_and_install/docker_install_cn.html">使用Docker安装运行</a></li>
<li class="toctree-l3"><a class="reference internal" href="../../getstarted/build_and_install/build_from_source_cn.html">从源码编译</a></li>
</ul>
</li>
<li class="toctree-l2"><a class="reference internal" href="../../getstarted/concepts/use_concepts_cn.html">基本使用概念</a></li>
</ul>
</li>
<li class="toctree-l1"><a class="reference internal" href="../index_cn.html">进阶指南</a><ul>
<li class="toctree-l2"><a class="reference internal" href="../usage/cmd_parameter/index_cn.html">设置命令行参数</a><ul>
<li class="toctree-l3"><a class="reference internal" href="../usage/cmd_parameter/use_case_cn.html">使用案例</a></li>
<li class="toctree-l3"><a class="reference internal" href="../usage/cmd_parameter/arguments_cn.html">参数概述</a></li>
<li class="toctree-l3"><a class="reference internal" href="../usage/cmd_parameter/detail_introduction_cn.html">细节描述</a></li>
</ul>
</li>
<li class="toctree-l2"><a class="reference internal" href="../usage/cluster/cluster_train_cn.html">PaddlePaddle分布式训练</a></li>
<li class="toctree-l2"><a class="reference internal" href="../usage/k8s/k8s_basis_cn.html">Kubernetes 简介</a></li>
<li class="toctree-l2"><a class="reference internal" href="../usage/k8s/k8s_cn.html">Kubernetes单机训练</a></li>
<li class="toctree-l2"><a class="reference internal" href="../usage/k8s/k8s_distributed_cn.html">Kubernetes分布式训练</a></li>
<li class="toctree-l2"><a class="reference internal" href="../dev/write_docs_cn.html">如何贡献/修改文档</a></li>
<li class="toctree-l2"><a class="reference internal" href="../deep_model/rnn/index_cn.html">RNN相关模型</a><ul>
<li class="toctree-l3"><a class="reference internal" href="../deep_model/rnn/rnn_config_cn.html">RNN配置</a></li>
<li class="toctree-l3"><a class="reference internal" href="../deep_model/rnn/recurrent_group_cn.html">Recurrent Group教程</a></li>
<li class="toctree-l3"><a class="reference internal" href="../deep_model/rnn/hierarchical_layer_cn.html">支持双层序列作为输入的Layer</a></li>
<li class="toctree-l3"><a class="reference internal" href="../deep_model/rnn/hrnn_rnn_api_compare_cn.html">单双层RNN API对比介绍</a></li>
</ul>
</li>
<li class="toctree-l2"><a class="reference internal" href="gpu_profiling_cn.html">GPU性能分析与调优</a></li>
</ul>
</li>
<li class="toctree-l1"><a class="reference internal" href="../../api/index_cn.html">API</a><ul>
<li class="toctree-l2"><a class="reference internal" href="../../api/v2/model_configs.html">模型配置</a><ul>
<li class="toctree-l3"><a class="reference internal" href="../../api/v2/config/activation.html">Activation</a></li>
<li class="toctree-l3"><a class="reference internal" href="../../api/v2/config/layer.html">Layers</a></li>
<li class="toctree-l3"><a class="reference internal" href="../../api/v2/config/evaluators.html">Evaluators</a></li>
<li class="toctree-l3"><a class="reference internal" href="../../api/v2/config/optimizer.html">Optimizer</a></li>
<li class="toctree-l3"><a class="reference internal" href="../../api/v2/config/pooling.html">Pooling</a></li>
<li class="toctree-l3"><a class="reference internal" href="../../api/v2/config/networks.html">Networks</a></li>
<li class="toctree-l3"><a class="reference internal" href="../../api/v2/config/attr.html">Parameter Attribute</a></li>
</ul>
</li>
<li class="toctree-l2"><a class="reference internal" href="../../api/v2/data.html">数据访问</a><ul>
<li class="toctree-l3"><a class="reference internal" href="../../api/v2/data/data_reader.html">Data Reader Interface</a></li>
<li class="toctree-l3"><a class="reference internal" href="../../api/v2/data/image.html">Image Interface</a></li>
<li class="toctree-l3"><a class="reference internal" href="../../api/v2/data/dataset.html">Dataset</a></li>
</ul>
</li>
<li class="toctree-l2"><a class="reference internal" href="../../api/v2/run_logic.html">训练与应用</a></li>
</ul>
</li>
<li class="toctree-l1"><a class="reference internal" href="../../faq/index_cn.html">FAQ</a><ul>
<li class="toctree-l2"><a class="reference internal" href="../../faq/build_and_install/index_cn.html">编译安装与单元测试</a></li>
<li class="toctree-l2"><a class="reference internal" href="../../faq/model/index_cn.html">模型配置</a></li>
<li class="toctree-l2"><a class="reference internal" href="../../faq/parameter/index_cn.html">参数设置</a></li>
<li class="toctree-l2"><a class="reference internal" href="../../faq/local/index_cn.html">本地训练与预测</a></li>
<li class="toctree-l2"><a class="reference internal" href="../../faq/cluster/index_cn.html">集群训练与预测</a></li>
</ul>
</li>
<li class="toctree-l1"><a class="reference internal" href="../../mobile/index_cn.html">MOBILE</a><ul>
<li class="toctree-l2"><a class="reference internal" href="../../mobile/cross_compiling_for_android_cn.html">Android平台编译指南</a></li>
<li class="toctree-l2"><a class="reference internal" href="../../mobile/cross_compiling_for_ios_cn.html">iOS平台编译指南</a></li>
<li class="toctree-l2"><a class="reference internal" href="../../mobile/cross_compiling_for_raspberry_cn.html">Raspberry Pi平台编译指南</a></li>
</ul>
</li>
</ul>
</nav>
<section class="doc-content-wrap">
<div role="navigation" aria-label="breadcrumbs navigation">
<ul class="wy-breadcrumbs">
<li>Python代码的性能分析</li>
</ul>
</div>
<div class="wy-nav-content" id="doc-content">
<div class="rst-content">
<div role="main" class="document" itemscope="itemscope" itemtype="http://schema.org/Article">
<div itemprop="articleBody">
<p>此教程会介绍如何使用Python的cProfile包、Python库yep、Google perftools来进行性能分析 (profiling) 与调优(performance tuning)。</p>
<p>Profling 指发现性能瓶颈。系统中的瓶颈可能和程序员开发过程中想象的瓶颈相去甚远。Tuning 指消除瓶颈。性能优化的过程通常是不断重复地 profiling 和 tuning。</p>
<p>PaddlePaddle 用户一般通过调用 Python API 编写深度学习程序。大部分 Python API 调用用 C++ 写的 libpaddle.so。所以 PaddlePaddle 的性能分析与调优分为两个部分:</p>
<ul class="simple">
<li>Python 代码的性能分析</li>
<li>Python 与 C++ 混合代码的性能分析</li>
</ul>
<div class="section" id="python">
<span id="python"></span><h1>Python代码的性能分析<a class="headerlink" href="#python" title="永久链接至标题"></a></h1>
<div class="section" id="">
<span id="id1"></span><h2>生成性能分析文件<a class="headerlink" href="#" title="永久链接至标题"></a></h2>
<p>Python标准库中提供了性能分析的工具包,<a class="reference external" href="https://docs.python.org/2/library/profile.html">cProfile</a>。生成Python性能分析的命令如下:</p>
<div class="highlight-bash"><div class="highlight"><pre><span></span>python -m cProfile -o profile.out main.py
</pre></div>
</div>
<p>其中 <code class="docutils literal"><span class="pre">main.py</span></code> 是我们要分析的程序,<code class="docutils literal"><span class="pre">-o</span></code>标识了一个输出的文件名,用来存储本次性能分析的结果。如果不指定这个文件,<code class="docutils literal"><span class="pre">cProfile</span></code>会打印到标准输出。</p>
</div>
<div class="section" id="">
<span id="id2"></span><h2>查看性能分析文件<a class="headerlink" href="#" title="永久链接至标题"></a></h2>
<p><code class="docutils literal"><span class="pre">cProfile</span></code> 在main.py 运行完毕后输出<code class="docutils literal"><span class="pre">profile.out</span></code>。我们可以使用<a class="reference external" href="https://github.com/ymichael/cprofilev"><code class="docutils literal"><span class="pre">cprofilev</span></code></a>来查看性能分析结果。<code class="docutils literal"><span class="pre">cprofilev</span></code>是一个Python的第三方库。使用它会开启一个HTTP服务,将性能分析结果以网页的形式展示出来:</p>
<div class="highlight-bash"><div class="highlight"><pre><span></span>cprofilev -a <span class="m">0</span>.0.0.0 -p <span class="m">3214</span> -f profile.out main.py
</pre></div>
</div>
<p>其中<code class="docutils literal"><span class="pre">-a</span></code>标识HTTP服务绑定的IP。使用<code class="docutils literal"><span class="pre">0.0.0.0</span></code>允许外网访问这个HTTP服务。<code class="docutils literal"><span class="pre">-p</span></code>标识HTTP服务的端口。<code class="docutils literal"><span class="pre">-f</span></code>标识性能分析的结果文件。<code class="docutils literal"><span class="pre">main.py</span></code>标识被性能分析的源文件。</p>
<p>用Web浏览器访问对应网址,即可显示性能分析的结果:</p>
<div class="highlight-default"><div class="highlight"><pre><span></span> <span class="n">ncalls</span> <span class="n">tottime</span> <span class="n">percall</span> <span class="n">cumtime</span> <span class="n">percall</span> <span class="n">filename</span><span class="p">:</span><span class="n">lineno</span><span class="p">(</span><span class="n">function</span><span class="p">)</span>
<span class="mi">1</span> <span class="mf">0.284</span> <span class="mf">0.284</span> <span class="mf">29.514</span> <span class="mf">29.514</span> <span class="n">main</span><span class="o">.</span><span class="n">py</span><span class="p">:</span><span class="mi">1</span><span class="p">(</span><span class="o">&lt;</span><span class="n">module</span><span class="o">&gt;</span><span class="p">)</span>
<span class="mi">4696</span> <span class="mf">0.128</span> <span class="mf">0.000</span> <span class="mf">15.748</span> <span class="mf">0.003</span> <span class="o">/</span><span class="n">home</span><span class="o">/</span><span class="n">yuyang</span><span class="o">/</span><span class="n">perf_test</span><span class="o">/.</span><span class="n">env</span><span class="o">/</span><span class="n">lib</span><span class="o">/</span><span class="n">python2</span><span class="o">.</span><span class="mi">7</span><span class="o">/</span><span class="n">site</span><span class="o">-</span><span class="n">packages</span><span class="o">/</span><span class="n">paddle</span><span class="o">/</span><span class="n">v2</span><span class="o">/</span><span class="n">fluid</span><span class="o">/</span><span class="n">executor</span><span class="o">.</span><span class="n">py</span><span class="p">:</span><span class="mi">20</span><span class="p">(</span><span class="n">run</span><span class="p">)</span>
<span class="mi">4696</span> <span class="mf">12.040</span> <span class="mf">0.003</span> <span class="mf">12.040</span> <span class="mf">0.003</span> <span class="p">{</span><span class="n">built</span><span class="o">-</span><span class="ow">in</span> <span class="n">method</span> <span class="n">run</span><span class="p">}</span>
<span class="mi">1</span> <span class="mf">0.144</span> <span class="mf">0.144</span> <span class="mf">6.534</span> <span class="mf">6.534</span> <span class="o">/</span><span class="n">home</span><span class="o">/</span><span class="n">yuyang</span><span class="o">/</span><span class="n">perf_test</span><span class="o">/.</span><span class="n">env</span><span class="o">/</span><span class="n">lib</span><span class="o">/</span><span class="n">python2</span><span class="o">.</span><span class="mi">7</span><span class="o">/</span><span class="n">site</span><span class="o">-</span><span class="n">packages</span><span class="o">/</span><span class="n">paddle</span><span class="o">/</span><span class="n">v2</span><span class="o">/</span><span class="fm">__init__</span><span class="o">.</span><span class="n">py</span><span class="p">:</span><span class="mi">14</span><span class="p">(</span><span class="o">&lt;</span><span class="n">module</span><span class="o">&gt;</span><span class="p">)</span>
</pre></div>
</div>
<p>每一列的含义是:</p>
<p>| 列名 | 含义 |
| &#8212; | &#8212; |
| ncalls | 函数的调用次数 |
| tottime | 函数实际使用的总时间。该时间去除掉本函数调用其他函数的时间 |
| percall | tottime的每次调用平均时间 |
| cumtime | 函数总时间。包含这个函数调用其他函数的时间 |
| percall | cumtime的每次调用平均时间 |
| filename:lineno(function) | 文件名, 行号,函数名 |</p>
</div>
<div class="section" id="">
<span id="id3"></span><h2>寻找性能瓶颈<a class="headerlink" href="#" title="永久链接至标题"></a></h2>
<p>通常<code class="docutils literal"><span class="pre">tottime</span></code><code class="docutils literal"><span class="pre">cumtime</span></code>是寻找瓶颈的关键指标。这两个指标代表了某一个函数真实的运行时间。</p>
<p>将性能分析结果按照tottime排序,效果如下:</p>
<div class="highlight-text"><div class="highlight"><pre><span></span> 4696 12.040 0.003 12.040 0.003 {built-in method run}
300005 0.874 0.000 1.681 0.000 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/dataset/mnist.py:38(reader)
107991 0.676 0.000 1.519 0.000 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/fluid/framework.py:219(__init__)
4697 0.626 0.000 2.291 0.000 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/fluid/framework.py:428(sync_with_cpp)
1 0.618 0.618 0.618 0.618 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/fluid/__init__.py:1(&lt;module&gt;)
</pre></div>
</div>
<p>可以看到最耗时的函数是C++端的<code class="docutils literal"><span class="pre">run</span></code>函数。这需要联合我们第二节<code class="docutils literal"><span class="pre">Python</span></code><code class="docutils literal"><span class="pre">C++</span></code>混合代码的性能分析来进行调优。而<code class="docutils literal"><span class="pre">sync_with_cpp</span></code>函数的总共耗时很长,每次调用的耗时也很长。于是我们可以点击<code class="docutils literal"><span class="pre">sync_with_cpp</span></code>的详细信息,了解其调用关系。</p>
<div class="highlight-text"><div class="highlight"><pre><span></span>Called By:
Ordered by: internal time
List reduced from 4497 to 2 due to restriction &lt;&#39;sync_with_cpp&#39;&gt;
Function was called by...
ncalls tottime cumtime
/home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/fluid/framework.py:428(sync_with_cpp) &lt;- 4697 0.626 2.291 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/fluid/framework.py:562(sync_with_cpp)
/home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/fluid/framework.py:562(sync_with_cpp) &lt;- 4696 0.019 2.316 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/fluid/framework.py:487(clone)
1 0.000 0.001 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/fluid/framework.py:534(append_backward)
Called:
Ordered by: internal time
List reduced from 4497 to 2 due to restriction &lt;&#39;sync_with_cpp&#39;&gt;
</pre></div>
</div>
<p>通常观察热点函数间的调用关系,和对应行的代码,就可以了解到问题代码在哪里。当我们做出性能修正后,再次进行性能分析(profiling)即可检查我们调优后的修正是否能够改善程序的性能。</p>
</div>
</div>
<div class="section" id="pythonc">
<span id="pythonc"></span><h1>Python与C++混合代码的性能分析<a class="headerlink" href="#pythonc" title="永久链接至标题"></a></h1>
<div class="section" id="">
<span id="id4"></span><h2>生成性能分析文件<a class="headerlink" href="#" title="永久链接至标题"></a></h2>
<p>C++的性能分析工具非常多。常见的包括<code class="docutils literal"><span class="pre">gprof</span></code>, <code class="docutils literal"><span class="pre">valgrind</span></code>, <code class="docutils literal"><span class="pre">google-perftools</span></code>。但是调试Python中使用的动态链接库与直接调试原始二进制相比增加了很多复杂度。幸而Python的一个第三方库<code class="docutils literal"><span class="pre">yep</span></code>提供了方便的和<code class="docutils literal"><span class="pre">google-perftools</span></code>交互的方法。于是这里使用<code class="docutils literal"><span class="pre">yep</span></code>进行Python与C++混合代码的性能分析</p>
<p>使用<code class="docutils literal"><span class="pre">yep</span></code>前需要安装<code class="docutils literal"><span class="pre">google-perftools</span></code><code class="docutils literal"><span class="pre">yep</span></code>包。ubuntu下安装命令为</p>
<div class="highlight-bash"><div class="highlight"><pre><span></span>apt update
apt install libgoogle-perftools-dev
pip install yep
</pre></div>
</div>
<p>安装完毕后,我们可以通过</p>
<div class="highlight-bash"><div class="highlight"><pre><span></span>python -m yep -v main.py
</pre></div>
</div>
<p>生成性能分析文件。生成的性能分析文件为<code class="docutils literal"><span class="pre">main.py.prof</span></code></p>
<p>命令行中的<code class="docutils literal"><span class="pre">-v</span></code>指定在生成性能分析文件之后,在命令行显示分析结果。我们可以在命令行中简单的看一下生成效果。因为C++与Python不同,编译时可能会去掉调试信息,运行时也可能因为多线程产生混乱不可读的性能分析结果。为了生成更可读的性能分析结果,可以采取下面几点措施:</p>
<ol class="simple">
<li>编译时指定<code class="docutils literal"><span class="pre">-g</span></code>生成调试信息。使用cmake的话,可以将CMAKE_BUILD_TYPE指定为<code class="docutils literal"><span class="pre">RelWithDebInfo</span></code></li>
<li>编译时一定要开启优化。单纯的<code class="docutils literal"><span class="pre">Debug</span></code>编译性能会和<code class="docutils literal"><span class="pre">-O2</span></code>或者<code class="docutils literal"><span class="pre">-O3</span></code>有非常大的差别。<code class="docutils literal"><span class="pre">Debug</span></code>模式下的性能测试是没有意义的。</li>
<li>运行性能分析的时候,先从单线程开始,再开启多线程,进而多机。毕竟单线程调试更容易。可以设置<code class="docutils literal"><span class="pre">OMP_NUM_THREADS=1</span></code>这个环境变量关闭openmp优化。</li>
</ol>
</div>
<div class="section" id="">
<span id="id5"></span><h2>查看性能分析文件<a class="headerlink" href="#" title="永久链接至标题"></a></h2>
<p>在运行完性能分析后,会生成性能分析结果文件。我们可以使用<a class="reference external" href="https://github.com/google/pprof"><code class="docutils literal"><span class="pre">pprof</span></code></a>来显示性能分析结果。注意,这里使用了用<code class="docutils literal"><span class="pre">Go</span></code>语言重构后的<code class="docutils literal"><span class="pre">pprof</span></code>,因为这个工具具有web服务界面,且展示效果更好。</p>
<p>安装<code class="docutils literal"><span class="pre">pprof</span></code>的命令和一般的<code class="docutils literal"><span class="pre">Go</span></code>程序是一样的,其命令如下:</p>
<div class="highlight-bash"><div class="highlight"><pre><span></span>go get github.com/google/pprof
</pre></div>
</div>
<p>进而我们可以使用如下命令开启一个HTTP服务:</p>
<div class="highlight-bash"><div class="highlight"><pre><span></span>pprof -http<span class="o">=</span><span class="m">0</span>.0.0.0:3213 <span class="sb">`</span>which python<span class="sb">`</span> ./main.py.prof
</pre></div>
</div>
<p>这行命令中,<code class="docutils literal"><span class="pre">-http</span></code>指开启HTTP服务。<code class="docutils literal"><span class="pre">which</span> <span class="pre">python</span></code>会产生当前Python二进制的完整路径,进而指定了Python可执行文件的路径。<code class="docutils literal"><span class="pre">./main.py.prof</span></code>输入了性能分析结果。</p>
<p>访问对应的网址,我们可以查看性能分析的结果。结果如下图所示:</p>
<p><img alt="result" src="../../_images/pprof_1.png" /></p>
</div>
<div class="section" id="">
<span id="id6"></span><h2>寻找性能瓶颈<a class="headerlink" href="#" title="永久链接至标题"></a></h2>
<p>与寻找Python代码的性能瓶颈类似,寻找Python与C++混合代码的性能瓶颈也是要看<code class="docutils literal"><span class="pre">tottime</span></code><code class="docutils literal"><span class="pre">cumtime</span></code>。而<code class="docutils literal"><span class="pre">pprof</span></code>展示的调用图也可以帮助我们发现性能中的问题。</p>
<p>例如下图中,</p>
<p><img alt="kernel_perf" src="../../_images/pprof_2.png" /></p>
<p>在一次训练中,乘法和乘法梯度的计算占用2%-4%左右的计算时间。而<code class="docutils literal"><span class="pre">MomentumOp</span></code>占用了17%左右的计算时间。显然,<code class="docutils literal"><span class="pre">MomentumOp</span></code>的性能有问题。</p>
<p><code class="docutils literal"><span class="pre">pprof</span></code>中,对于性能的关键路径都做出了红色标记。先检查关键路径的性能问题,再检查其他部分的性能问题,可以更有次序的完成性能的优化。</p>
</div>
</div>
</div>
</div>
<footer>
<hr/>
<div role="contentinfo">
<p>
&copy; Copyright 2016, PaddlePaddle developers.
</p>
</div>
Built with <a href="http://sphinx-doc.org/">Sphinx</a> using a <a href="https://github.com/snide/sphinx_rtd_theme">theme</a> provided by <a href="https://readthedocs.org">Read the Docs</a>.
</footer>
</div>
</div>
</section>
</div>
<script type="text/javascript">
var DOCUMENTATION_OPTIONS = {
URL_ROOT:'../../',
VERSION:'',
COLLAPSE_INDEX:false,
FILE_SUFFIX:'.html',
HAS_SOURCE: true,
SOURCELINK_SUFFIX: ".txt",
};
</script>
<script type="text/javascript" src="../../_static/jquery.js"></script>
<script type="text/javascript" src="../../_static/underscore.js"></script>
<script type="text/javascript" src="../../_static/doctools.js"></script>
<script type="text/javascript" src="../../_static/translations.js"></script>
<script type="text/javascript" src="https://cdn.bootcss.com/mathjax/2.7.0/MathJax.js"></script>
<script type="text/javascript" src="../../_static/js/theme.js"></script>
<script src="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/js/bootstrap.min.js" integrity="sha384-Tc5IQib027qvyjSMfHjOMaLkfuWVxZxUPnCJA7l2mCWNIpG9mGCD8wGNIcPD7Txa" crossorigin="anonymous"></script>
<script src="https://cdn.jsdelivr.net/perfect-scrollbar/0.6.14/js/perfect-scrollbar.jquery.min.js"></script>
<script src="../../_static/js/paddle_doc_init.js"></script>
</body>
</html>
\ No newline at end of file
因为 它太大了无法显示 source diff 。你可以改为 查看blob
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册