<p>This tutorial introduces techniques we used to profile and tune the
CPU performance of PaddlePaddle. We will use Python packages
<codeclass="docutils literal"><spanclass="pre">cProfile</span></code> and <codeclass="docutils literal"><spanclass="pre">yep</span></code>, and Google <codeclass="docutils literal"><spanclass="pre">perftools</span></code>.</p>
<p>Profiling is the process that reveals the performance bottlenecks,
which could be very different from what’s in the developers’ mind.
Performance tuning is to fix the bottlenecks. Performance optimization
repeats the steps of profiling and tuning alternatively.</p>
<p>PaddlePaddle users program AI by calling the Python API, which calls
into <codeclass="docutils literal"><spanclass="pre">libpaddle.so.</span></code> written in C++. In this tutorial, we focus on
<spanid="profiling-the-python-code"></span><h1>Profiling the Python Code<aclass="headerlink"href="#profiling-the-python-code"title="Permalink to this headline">¶</a></h1>
<spanid="generate-the-performance-profiling-file"></span><h2>Generate the Performance Profiling File<aclass="headerlink"href="#generate-the-performance-profiling-file"title="Permalink to this headline">¶</a></h2>
<p>where <codeclass="docutils literal"><spanclass="pre">main.py</span></code> is the program we are going to profile, <codeclass="docutils literal"><spanclass="pre">-o</span></code> specifies
the output file. Without <codeclass="docutils literal"><spanclass="pre">-o</span></code>, <codeclass="docutils literal"><spanclass="pre">cProfile</span></code> would outputs to standard
output.</p>
</div>
<divclass="section"id="">
<spanid="id2"></span><h2>查看性能分析文件<aclass="headerlink"href="#"title="Permalink to this headline">¶</a></h2>
<spanid="look-into-the-profiling-file"></span><h2>Look into the Profiling File<aclass="headerlink"href="#look-into-the-profiling-file"title="Permalink to this headline">¶</a></h2>
<p><codeclass="docutils literal"><spanclass="pre">cProfile</span></code> generates <codeclass="docutils literal"><spanclass="pre">profile.out</span></code> after <codeclass="docutils literal"><spanclass="pre">main.py</span></code> completes. We can
use <aclass="reference external"href="https://github.com/ymichael/cprofilev"><codeclass="docutils literal"><spanclass="pre">cprofilev</span></code></a> to look into
the details:</p>
<divclass="highlight-bash"><divclass="highlight"><pre><span></span>cprofilev -a <spanclass="m">0</span>.0.0.0 -p <spanclass="m">3214</span> -f profile.out main.py
<spanid="identify-performance-bottlenecks"></span><h2>Identify Performance Bottlenecks<aclass="headerlink"href="#identify-performance-bottlenecks"title="Permalink to this headline">¶</a></h2>
<p>Usually, <codeclass="docutils literal"><spanclass="pre">tottime</span></code> and the related <codeclass="docutils literal"><spanclass="pre">percall</span></code> time is what we want to
focus on. We can sort above profiling file by tottime:</p>
<p>We can see that the most time-consuming function is the <codeclass="docutils literal"><spanclass="pre">built-in</span><spanclass="pre">method</span><spanclass="pre">run</span></code>, which is a C++ function in <codeclass="docutils literal"><spanclass="pre">libpaddle.so</span></code>. We will
explain how to profile C++ code in the next section. At the right
moment, let’s look into the third function <codeclass="docutils literal"><spanclass="pre">sync_with_cpp</span></code>, which is a
Python function. We can click it to understand more about it:</p>
<spanid="profiling-python-and-c-code"></span><h1>Profiling Python and C++ Code<aclass="headerlink"href="#profiling-python-and-c-code"title="Permalink to this headline">¶</a></h1>
<spanid="generate-the-profiling-file"></span><h2>Generate the Profiling File<aclass="headerlink"href="#generate-the-profiling-file"title="Permalink to this headline">¶</a></h2>
<p>To profile a mixture of Python and C++ code, we can use a Python
package, <codeclass="docutils literal"><spanclass="pre">yep</span></code>, that can work with Google’s <codeclass="docutils literal"><spanclass="pre">perftools</span></code>, which is a
commonly-used profiler for C/C++ code.</p>
<p>In Ubuntu systems, we can install <codeclass="docutils literal"><spanclass="pre">yep</span></code> and <codeclass="docutils literal"><spanclass="pre">perftools</span></code> by running the
<li>Use GCC command line option <codeclass="docutils literal"><spanclass="pre">-g</span></code> when building <codeclass="docutils literal"><spanclass="pre">libpaddle.so</span></code> so to
include the debug information. The standard building system of
<li>Use GCC command line option <codeclass="docutils literal"><spanclass="pre">-O2</span></code> or <codeclass="docutils literal"><spanclass="pre">-O3</span></code> to generate optimized
binary code. It doesn’t make sense to profile <codeclass="docutils literal"><spanclass="pre">libpaddle.so</span></code>
without optimization, because it would anyway run slowly.</li>
<li>Profiling the single-threaded binary file before the
multi-threading version, because the latter often generates tangled
profiling analysis result. You might want to set environment
variable <codeclass="docutils literal"><spanclass="pre">OMP_NUM_THREADS=1</span></code> to prevents OpenMP from automatically
starting multiple threads.</li>
</ol>
</div>
<divclass="section"id="">
<spanid="id5"></span><h2>查看性能分析文件<aclass="headerlink"href="#"title="Permalink to this headline">¶</a></h2>
<spanid="id1"></span><h2>Look into the Profiling File<aclass="headerlink"href="#look-into-the-profiling-file"title="Permalink to this headline">¶</a></h2>
<p>The tool we used to look into the profiling file generated by
<codeclass="docutils literal"><spanclass="pre">perftools</span></code> is <aclass="reference external"href="https://github.com/google/pprof"><codeclass="docutils literal"><spanclass="pre">pprof</span></code></a>, which
provides a Web-based GUI like <codeclass="docutils literal"><spanclass="pre">cprofilev</span></code>.</p>
<p>We can rely on the standard Go toolchain to retrieve the source code
of <codeclass="docutils literal"><spanclass="pre">pprof</span></code> and build it:</p>
<divclass="highlight-bash"><divclass="highlight"><pre><span></span>go get github.com/google/pprof
</pre></div>
</div>
<p>进而我们可以使用如下命令开启一个HTTP服务:</p>
<p>Then we can use it to profile <codeclass="docutils literal"><spanclass="pre">main.py.prof</span></code> generated in the previous
<spanid="identifying-the-performance-bottlenecks"></span><h2>Identifying the Performance Bottlenecks<aclass="headerlink"href="#identifying-the-performance-bottlenecks"title="Permalink to this headline">¶</a></h2>
<p>Similar to how we work with <codeclass="docutils literal"><spanclass="pre">cprofilev</span></code>, we’d focus on <codeclass="docutils literal"><spanclass="pre">tottime</span></code> and
<p>This tutorial introduces techniques we used to profile and tune the
CPU performance of PaddlePaddle. We will use Python packages
<codeclass="docutils literal"><spanclass="pre">cProfile</span></code> and <codeclass="docutils literal"><spanclass="pre">yep</span></code>, and Google <codeclass="docutils literal"><spanclass="pre">perftools</span></code>.</p>
<p>Profiling is the process that reveals the performance bottlenecks,
which could be very different from what’s in the developers’ mind.
Performance tuning is to fix the bottlenecks. Performance optimization
repeats the steps of profiling and tuning alternatively.</p>
<p>PaddlePaddle users program AI by calling the Python API, which calls
into <codeclass="docutils literal"><spanclass="pre">libpaddle.so.</span></code> written in C++. In this tutorial, we focus on
<spanid="profiling-the-python-code"></span><h1>Profiling the Python Code<aclass="headerlink"href="#profiling-the-python-code"title="永久链接至标题">¶</a></h1>
<spanid="generate-the-performance-profiling-file"></span><h2>Generate the Performance Profiling File<aclass="headerlink"href="#generate-the-performance-profiling-file"title="永久链接至标题">¶</a></h2>
<p>where <codeclass="docutils literal"><spanclass="pre">main.py</span></code> is the program we are going to profile, <codeclass="docutils literal"><spanclass="pre">-o</span></code> specifies
the output file. Without <codeclass="docutils literal"><spanclass="pre">-o</span></code>, <codeclass="docutils literal"><spanclass="pre">cProfile</span></code> would outputs to standard
<spanid="look-into-the-profiling-file"></span><h2>Look into the Profiling File<aclass="headerlink"href="#look-into-the-profiling-file"title="永久链接至标题">¶</a></h2>
<p><codeclass="docutils literal"><spanclass="pre">cProfile</span></code> generates <codeclass="docutils literal"><spanclass="pre">profile.out</span></code> after <codeclass="docutils literal"><spanclass="pre">main.py</span></code> completes. We can
use <aclass="reference external"href="https://github.com/ymichael/cprofilev"><codeclass="docutils literal"><spanclass="pre">cprofilev</span></code></a> to look into
the details:</p>
<divclass="highlight-bash"><divclass="highlight"><pre><span></span>cprofilev -a <spanclass="m">0</span>.0.0.0 -p <spanclass="m">3214</span> -f profile.out main.py
<p>Usually, <codeclass="docutils literal"><spanclass="pre">tottime</span></code> and the related <codeclass="docutils literal"><spanclass="pre">percall</span></code> time is what we want to
focus on. We can sort above profiling file by tottime:</p>
<p>We can see that the most time-consuming function is the <codeclass="docutils literal"><spanclass="pre">built-in</span><spanclass="pre">method</span><spanclass="pre">run</span></code>, which is a C++ function in <codeclass="docutils literal"><spanclass="pre">libpaddle.so</span></code>. We will
explain how to profile C++ code in the next section. At the right
moment, let’s look into the third function <codeclass="docutils literal"><spanclass="pre">sync_with_cpp</span></code>, which is a
Python function. We can click it to understand more about it:</p>
<spanid="profiling-python-and-c-code"></span><h1>Profiling Python and C++ Code<aclass="headerlink"href="#profiling-python-and-c-code"title="永久链接至标题">¶</a></h1>
<spanid="generate-the-profiling-file"></span><h2>Generate the Profiling File<aclass="headerlink"href="#generate-the-profiling-file"title="永久链接至标题">¶</a></h2>
<p>To profile a mixture of Python and C++ code, we can use a Python
package, <codeclass="docutils literal"><spanclass="pre">yep</span></code>, that can work with Google’s <codeclass="docutils literal"><spanclass="pre">perftools</span></code>, which is a
commonly-used profiler for C/C++ code.</p>
<p>In Ubuntu systems, we can install <codeclass="docutils literal"><spanclass="pre">yep</span></code> and <codeclass="docutils literal"><spanclass="pre">perftools</span></code> by running the
<li>Use GCC command line option <codeclass="docutils literal"><spanclass="pre">-g</span></code> when building <codeclass="docutils literal"><spanclass="pre">libpaddle.so</span></code> so to
include the debug information. The standard building system of
<li>Use GCC command line option <codeclass="docutils literal"><spanclass="pre">-O2</span></code> or <codeclass="docutils literal"><spanclass="pre">-O3</span></code> to generate optimized
binary code. It doesn’t make sense to profile <codeclass="docutils literal"><spanclass="pre">libpaddle.so</span></code>
without optimization, because it would anyway run slowly.</li>
<li>Profiling the single-threaded binary file before the
multi-threading version, because the latter often generates tangled
profiling analysis result. You might want to set environment
variable <codeclass="docutils literal"><spanclass="pre">OMP_NUM_THREADS=1</span></code> to prevents OpenMP from automatically
<spanid="id1"></span><h2>Look into the Profiling File<aclass="headerlink"href="#look-into-the-profiling-file"title="永久链接至标题">¶</a></h2>
<p>The tool we used to look into the profiling file generated by
<codeclass="docutils literal"><spanclass="pre">perftools</span></code> is <aclass="reference external"href="https://github.com/google/pprof"><codeclass="docutils literal"><spanclass="pre">pprof</span></code></a>, which
provides a Web-based GUI like <codeclass="docutils literal"><spanclass="pre">cprofilev</span></code>.</p>
<p>We can rely on the standard Go toolchain to retrieve the source code
of <codeclass="docutils literal"><spanclass="pre">pprof</span></code> and build it:</p>
<divclass="highlight-bash"><divclass="highlight"><pre><span></span>go get github.com/google/pprof
</pre></div>
</div>
<p>进而我们可以使用如下命令开启一个HTTP服务:</p>
<p>Then we can use it to profile <codeclass="docutils literal"><spanclass="pre">main.py.prof</span></code> generated in the previous
<spanid="identifying-the-performance-bottlenecks"></span><h2>Identifying the Performance Bottlenecks<aclass="headerlink"href="#identifying-the-performance-bottlenecks"title="永久链接至标题">¶</a></h2>
<p>Similar to how we work with <codeclass="docutils literal"><spanclass="pre">cprofilev</span></code>, we’d focus on <codeclass="docutils literal"><spanclass="pre">tottime</span></code> and
Built with <ahref="http://sphinx-doc.org/">Sphinx</a> using a <ahref="https://github.com/snide/sphinx_rtd_theme">theme</a> provided by <ahref="https://readthedocs.org">Read the Docs</a>.