cpu_profiling_en.md.txt 7.7 KB
Newer Older
1
This tutorial introduces techniques we use to profile and tune the
2
CPU performance of PaddlePaddle.  We will use Python packages
3
`cProfile` and `yep`, and Google's `perftools`.
4

5
Profiling is the process that reveals performance bottlenecks,
6
which could be very different from what's in the developers' mind.
7
Performance tuning is done to fix these bottlenecks. Performance optimization
8
repeats the steps of profiling and tuning alternatively.
9

10
PaddlePaddle users program AI applications by calling the Python API, which calls
11 12
into `libpaddle.so.` written in C++.  In this tutorial, we focus on
the profiling and tuning of
13

14 15
1. the Python code and
1. the mixture of Python and C++ code.
16

17
## Profiling the Python Code
18

19
### Generate the Performance Profiling File
20

21 22 23
We can use Python standard
package, [`cProfile`](https://docs.python.org/2/library/profile.html),
to generate Python profiling file.  For example:
24 25 26 27 28

```bash
python -m cProfile -o profile.out main.py
```

29 30 31
where `main.py` is the program we are going to profile, `-o` specifies
the output file.  Without `-o`, `cProfile` would outputs to standard
output.
32

33
### Look into the Profiling File
34

35 36 37
`cProfile` generates `profile.out` after `main.py` completes. We can
use [`cprofilev`](https://github.com/ymichael/cprofilev) to look into
the details:
38 39 40 41 42

```bash
cprofilev -a 0.0.0.0 -p 3214 -f profile.out main.py
```

43 44
where `-a` specifies the HTTP IP, `-p` specifies the port, `-f`
specifies the profiling file, and `main.py` is the source file.
45

46 47
Open the Web browser and points to the local IP and the specifies
port, we will see the output like the following:
48

49
```
50 51
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.284    0.284   29.514   29.514 main.py:1(<module>)
52
     4696    0.128    0.000   15.748    0.003 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/fluid/executor.py:20(run)
53 54 55 56
     4696   12.040    0.003   12.040    0.003 {built-in method run}
        1    0.144    0.144    6.534    6.534 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/__init__.py:14(<module>)
```

57 58
where each line corresponds to Python function, and the meaning of
each column is as follows:
59

60
| column | meaning |
61
| --- | --- |
62
| ncalls | the number of calls into a function |
63
| tottime | the total execution time of the function, not including the execution time of other functions called by the function |
64 65 66 67
| percall | tottime divided by ncalls |
| cumtime | the total execution time of the function, including the execution time of other functions being called |
| percall | cumtime divided by ncalls |
| filename:lineno(function) | where the function is defined |
68

69
### Identify Performance Bottlenecks
70

71 72
Usually, `tottime` and the related `percall` time is what we want to
focus on. We can sort above profiling file by tottime:
73 74 75 76

```text
     4696   12.040    0.003   12.040    0.003 {built-in method run}
   300005    0.874    0.000    1.681    0.000 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/dataset/mnist.py:38(reader)
77 78 79
   107991    0.676    0.000    1.519    0.000 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/fluid/framework.py:219(__init__)
     4697    0.626    0.000    2.291    0.000 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/fluid/framework.py:428(sync_with_cpp)
        1    0.618    0.618    0.618    0.618 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/fluid/__init__.py:1(<module>)
80 81
```

82 83
We can see that the most time-consuming function is the `built-in
method run`, which is a C++ function in `libpaddle.so`.  We will
84
explain how to profile C++ code in the next section.  At this 
85 86
moment, let's look into the third function `sync_with_cpp`, which is a
Python function.  We can click it to understand more about it:
87

88
```
89 90 91 92 93 94 95
Called By:

   Ordered by: internal time
   List reduced from 4497 to 2 due to restriction <'sync_with_cpp'>

Function                                                                                                 was called by...
                                                                                                             ncalls  tottime  cumtime
96 97 98
/home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/fluid/framework.py:428(sync_with_cpp)  <-    4697    0.626    2.291  /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/fluid/framework.py:562(sync_with_cpp)
/home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/fluid/framework.py:562(sync_with_cpp)  <-    4696    0.019    2.316  /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/fluid/framework.py:487(clone)
                                                                                                                  1    0.000    0.001  /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/fluid/framework.py:534(append_backward)
99 100 101 102 103 104 105 106


Called:

   Ordered by: internal time
   List reduced from 4497 to 2 due to restriction <'sync_with_cpp'>
```

107 108
The lists of the callers of `sync_with_cpp` might help us understand
how to improve the function definition.
109

110
## Profiling Python and C++ Code
111

112
### Generate the Profiling File
113

114 115 116
To profile a mixture of Python and C++ code, we can use a Python
package, `yep`, that can work with Google's `perftools`, which is a
commonly-used profiler for C/C++ code.
117

118 119
In Ubuntu systems, we can install `yep` and `perftools` by running the
following commands:
120 121

```bash
122
apt update
123 124 125 126
apt install libgoogle-perftools-dev
pip install yep
```

127
Then we can run the following command
128 129 130 131 132

```bash
python -m yep -v main.py
```

133 134 135 136
to generate the profiling file.  The default filename is
`main.py.prof`.

Please be aware of the `-v` command line option, which prints the
137 138
analysis results after generating the profiling file.  By examining the
 the print result, we'd know that if we stripped debug
139 140
information from `libpaddle.so` at build time.  The following hints
help make sure that the analysis results are readable:
141

142 143 144 145
1. Use GCC command line option `-g` when building `libpaddle.so` so to
   include the debug information.  The standard building system of
   PaddlePaddle is CMake, so you might want to set
   `CMAKE_BUILD_TYPE=RelWithDebInfo`.
146

147 148 149
1. Use GCC command line option `-O2` or `-O3` to generate optimized
   binary code. It doesn't make sense to profile `libpaddle.so`
   without optimization, because it would anyway run slowly.
150

151 152 153 154 155
1. Profiling the single-threaded binary file before the
   multi-threading version, because the latter often generates tangled
   profiling analysis result.  You might want to set environment
   variable `OMP_NUM_THREADS=1` to prevents OpenMP from automatically
   starting multiple threads.
156

157
### Examining the Profiling File
158

159
The tool we used to examine the profiling file generated by
160 161 162 163 164
`perftools` is [`pprof`](https://github.com/google/pprof), which
provides a Web-based GUI like `cprofilev`.

We can rely on the standard Go toolchain to retrieve the source code
of `pprof` and build it:
165 166 167 168 169

```bash
go get github.com/google/pprof
```

170 171
Then we can use it to profile `main.py.prof` generated in the previous
section:
172 173 174 175 176

```bash
pprof -http=0.0.0.0:3213 `which python`  ./main.py.prof
```

177 178 179
Where `-http` specifies the IP and port of the HTTP service.
Directing our Web browser to the service, we would see something like
the following:
180 181 182

![result](./pprof_1.png)

183
### Identifying the Performance Bottlenecks
184

185 186
Similar to how we work with `cprofilev`, we'd focus on `tottime` and
`cumtime`.
187 188 189

![kernel_perf](./pprof_2.png)

190 191 192 193
We can see that the execution time of multiplication and the computing
of the gradient of multiplication takes 2% to 4% of the total running
time, and `MomentumOp` takes about 17%. Obviously, we'd want to
optimize `MomentumOp`.
194

195
`pprof` would mark performance critical parts of the program in
196
red. It's a good idea to follow the hints.