cpu_profiling_en.md 8.3 KB
Newer Older
1
This tutorial introduces techniques we use to profile and tune the
2
CPU performance of PaddlePaddle.  We will use Python packages
3
`cProfile` and `yep`, and Google's `perftools`.
Y
Yu Yang 已提交
4

5
Profiling is the process that reveals performance bottlenecks,
6
which could be very different from what's in the developers' mind.
7
Performance tuning is done to fix these bottlenecks. Performance optimization
8
repeats the steps of profiling and tuning alternatively.
Y
Yu Yang 已提交
9

10
PaddlePaddle users program AI applications by calling the Python API, which calls
11 12
into `libpaddle.so.` written in C++.  In this tutorial, we focus on
the profiling and tuning of
Y
Yu Yang 已提交
13

14 15
1. the Python code and
1. the mixture of Python and C++ code.
Y
Yu Yang 已提交
16

W
weixing 已提交
17
# Profiling the Python Code
Y
Yu Yang 已提交
18

19
### Generate the Performance Profiling File
Y
Yu Yang 已提交
20

21 22 23
We can use Python standard
package, [`cProfile`](https://docs.python.org/2/library/profile.html),
to generate Python profiling file.  For example:
Y
Yu Yang 已提交
24 25 26 27 28

```bash
python -m cProfile -o profile.out main.py
```

29 30 31
where `main.py` is the program we are going to profile, `-o` specifies
the output file.  Without `-o`, `cProfile` would outputs to standard
output.
Y
Yu Yang 已提交
32

33
### Look into the Profiling File
Y
Yu Yang 已提交
34

35 36 37
`cProfile` generates `profile.out` after `main.py` completes. We can
use [`cprofilev`](https://github.com/ymichael/cprofilev) to look into
the details:
Y
Yu Yang 已提交
38 39 40 41 42

```bash
cprofilev -a 0.0.0.0 -p 3214 -f profile.out main.py
```

43 44
where `-a` specifies the HTTP IP, `-p` specifies the port, `-f`
specifies the profiling file, and `main.py` is the source file.
Y
Yu Yang 已提交
45

46 47
Open the Web browser and points to the local IP and the specifies
port, we will see the output like the following:
Y
Yu Yang 已提交
48

49
```
Y
Yu Yang 已提交
50 51
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.284    0.284   29.514   29.514 main.py:1(<module>)
52
     4696    0.128    0.000   15.748    0.003 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/fluid/executor.py:20(run)
Y
Yu Yang 已提交
53 54 55 56
     4696   12.040    0.003   12.040    0.003 {built-in method run}
        1    0.144    0.144    6.534    6.534 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/__init__.py:14(<module>)
```

57 58
where each line corresponds to Python function, and the meaning of
each column is as follows:
Y
Yu Yang 已提交
59

60
| column | meaning |
Y
Yu Yang 已提交
61
| --- | --- |
62
| ncalls | the number of calls into a function |
63
| tottime | the total execution time of the function, not including the execution time of other functions called by the function |
64 65 66 67
| percall | tottime divided by ncalls |
| cumtime | the total execution time of the function, including the execution time of other functions being called |
| percall | cumtime divided by ncalls |
| filename:lineno(function) | where the function is defined |
Y
Yu Yang 已提交
68

_青葱's avatar
_青葱 已提交
69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103
<table>
<thead>
<tr>
<th>column</th>
<th>meaning </th>
</tr>
</thead>
<tbody>
<tr>
<td> ncalls</td>
<td> the number of calls into a function</td>
</tr>
<tr>
<td>tottime</td>
<td> the total execution time of the function, not including the execution time of other functions called by the function</td>
</tr>
<tr>
<td> percall </td>
<td> tottime divided by ncalls</td>
</tr>
<tr>
<td> cumtime</td>
<td> the total execution time of the function, including the execution time of other functions being called</td>
</tr>
<tr>
<td> percall</td>
<td> cumtime divided by ncalls</td>
</tr>
<tr>
<td> filename:lineno(function) </td>
<td> where the function is define </td>
</tr>
</tbody>
</table>

104
### Identify Performance Bottlenecks
Y
Yu Yang 已提交
105

106 107
Usually, `tottime` and the related `percall` time is what we want to
focus on. We can sort above profiling file by tottime:
Y
Yu Yang 已提交
108 109 110 111

```text
     4696   12.040    0.003   12.040    0.003 {built-in method run}
   300005    0.874    0.000    1.681    0.000 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/dataset/mnist.py:38(reader)
112 113 114
   107991    0.676    0.000    1.519    0.000 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/fluid/framework.py:219(__init__)
     4697    0.626    0.000    2.291    0.000 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/fluid/framework.py:428(sync_with_cpp)
        1    0.618    0.618    0.618    0.618 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/fluid/__init__.py:1(<module>)
Y
Yu Yang 已提交
115 116
```

117 118
We can see that the most time-consuming function is the `built-in
method run`, which is a C++ function in `libpaddle.so`.  We will
W
weixing 已提交
119
explain how to profile C++ code in the next section.  At this
120 121
moment, let's look into the third function `sync_with_cpp`, which is a
Python function.  We can click it to understand more about it:
Y
Yu Yang 已提交
122

123
```
Y
Yu Yang 已提交
124 125 126 127 128 129 130
Called By:

   Ordered by: internal time
   List reduced from 4497 to 2 due to restriction <'sync_with_cpp'>

Function                                                                                                 was called by...
                                                                                                             ncalls  tottime  cumtime
131 132 133
/home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/fluid/framework.py:428(sync_with_cpp)  <-    4697    0.626    2.291  /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/fluid/framework.py:562(sync_with_cpp)
/home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/fluid/framework.py:562(sync_with_cpp)  <-    4696    0.019    2.316  /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/fluid/framework.py:487(clone)
                                                                                                                  1    0.000    0.001  /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/fluid/framework.py:534(append_backward)
Y
Yu Yang 已提交
134 135 136 137 138 139 140 141


Called:

   Ordered by: internal time
   List reduced from 4497 to 2 due to restriction <'sync_with_cpp'>
```

142 143
The lists of the callers of `sync_with_cpp` might help us understand
how to improve the function definition.
Y
Yu Yang 已提交
144

145
## Profiling Python and C++ Code
Y
Yu Yang 已提交
146

147
### Generate the Profiling File
Y
Yu Yang 已提交
148

149 150 151
To profile a mixture of Python and C++ code, we can use a Python
package, `yep`, that can work with Google's `perftools`, which is a
commonly-used profiler for C/C++ code.
Y
Yu Yang 已提交
152

153 154
In Ubuntu systems, we can install `yep` and `perftools` by running the
following commands:
Y
Yu Yang 已提交
155 156

```bash
157
apt update
Y
Yu Yang 已提交
158 159 160 161
apt install libgoogle-perftools-dev
pip install yep
```

162
Then we can run the following command
Y
Yu Yang 已提交
163 164 165 166 167

```bash
python -m yep -v main.py
```

168 169 170 171
to generate the profiling file.  The default filename is
`main.py.prof`.

Please be aware of the `-v` command line option, which prints the
172 173
analysis results after generating the profiling file.  By examining the
 the print result, we'd know that if we stripped debug
174 175
information from `libpaddle.so` at build time.  The following hints
help make sure that the analysis results are readable:
Y
Yu Yang 已提交
176

177 178 179 180
1. Use GCC command line option `-g` when building `libpaddle.so` so to
   include the debug information.  The standard building system of
   PaddlePaddle is CMake, so you might want to set
   `CMAKE_BUILD_TYPE=RelWithDebInfo`.
Y
Yu Yang 已提交
181

182 183 184
1. Use GCC command line option `-O2` or `-O3` to generate optimized
   binary code. It doesn't make sense to profile `libpaddle.so`
   without optimization, because it would anyway run slowly.
Y
Yu Yang 已提交
185

186 187 188 189 190
1. Profiling the single-threaded binary file before the
   multi-threading version, because the latter often generates tangled
   profiling analysis result.  You might want to set environment
   variable `OMP_NUM_THREADS=1` to prevents OpenMP from automatically
   starting multiple threads.
Y
Yu Yang 已提交
191

192
### Examining the Profiling File
Y
Yu Yang 已提交
193

194
The tool we used to examine the profiling file generated by
195 196 197 198 199
`perftools` is [`pprof`](https://github.com/google/pprof), which
provides a Web-based GUI like `cprofilev`.

We can rely on the standard Go toolchain to retrieve the source code
of `pprof` and build it:
Y
Yu Yang 已提交
200 201 202 203 204

```bash
go get github.com/google/pprof
```

205 206
Then we can use it to profile `main.py.prof` generated in the previous
section:
Y
Yu Yang 已提交
207 208 209 210 211

```bash
pprof -http=0.0.0.0:3213 `which python`  ./main.py.prof
```

212 213 214
Where `-http` specifies the IP and port of the HTTP service.
Directing our Web browser to the service, we would see something like
the following:
Y
Yu Yang 已提交
215 216 217

![result](./pprof_1.png)

218
### Identifying the Performance Bottlenecks
Y
Yu Yang 已提交
219

220 221
Similar to how we work with `cprofilev`, we'd focus on `tottime` and
`cumtime`.
Y
Yu Yang 已提交
222 223 224

![kernel_perf](./pprof_2.png)

225 226 227 228
We can see that the execution time of multiplication and the computing
of the gradient of multiplication takes 2% to 4% of the total running
time, and `MomentumOp` takes about 17%. Obviously, we'd want to
optimize `MomentumOp`.
Y
Yu Yang 已提交
229

230
`pprof` would mark performance critical parts of the program in
231
red. It's a good idea to follow the hints.