diff --git a/doc/optimization/gpu_profiling.rst b/doc/optimization/gpu_profiling.rst index 583c2d6caee460331aea366d4d4f65be81e553b0..efdf5552c351fac15cd1f257e664dfdec0252e33 100644 --- a/doc/optimization/gpu_profiling.rst +++ b/doc/optimization/gpu_profiling.rst @@ -8,6 +8,7 @@ This tutorial will guide you step-by-step through how to conduct profiling and p - How to do profiling? - Profile tools - Hands-on Tutorial +- Profiling tips What's profiling? ================= @@ -68,10 +69,59 @@ respectively to avoid program crashes when CPU version of PaddlePaddle invokes t Hands-on Approach ================= +To use this command line profiler :code:`nvprof`, you can simply issue the command: + +.. code-block:: bash + + nvprof ./paddle/math/tests/test_GpuProfiler + +Then, you can get the following profiling result: + .. image:: nvprof.png :align: center :scale: 30% +For visual profiler :code:`nvvp`, you can either import the output of :code:`nvprof –o ...` or +run application through GUI. + .. image:: nvvp1.png :align: center - :scale: 30% \ No newline at end of file + :scale: 30% + +From the perspective of kernel functions, :code:`nvvp` can even illustrate why does an operation take a long time? +As shown in the following figure, kernel's block usage, register usage and shared memory usage from :code:`nvvp` +allow us to fully utilize all warps on the GPU. + +.. image:: nvvp2.png + :align: center + :scale: 30% + +From the perspective of application, :code:`nvvp` can give you some suggestions to address performance bottleneck. +For instance, some advice in data movement and compute utilization from the below figure can guide you to tune performance. + +.. image:: nvvp3.png + :align: center + :scale: 30% + +.. image:: nvvp4.png + :align: center + :scale: 30% + +Profiling tips +============== + +- The :code:`nvprof` and :code:`nvvp` output is a very good place to start +- The timeline is a good place to go next +- Only dig deep into a kernel if it’s taking a significant amount of your time. +- Where possible, try to match profiler output with theory. + 1) For example, if I know I’m moving 1GB, and my kernel takes 10ms, I expect the profiler to report 100GB/s. + 2) Discrepancies are likely to mean your application isn’t doing what you thought it was. +- Know your hardware: If your GPU can do 6 TFLOPs, and you’re already doing 5.5 TFLOPs, you won’t go much faster! + + +Profiling is a key step in optimisation. Sometimes quite simple changes can lead to big improvements in performance. +Your mileage may vary! + +Reference +========= +Jeremy Appleyard, `GPU Profiling for Deep Learning `_, 2015 diff --git a/doc/optimization/nvvp2.png b/doc/optimization/nvvp2.png new file mode 100644 index 0000000000000000000000000000000000000000..177c9db708da6863d1075f3e615f5962dbe18b29 Binary files /dev/null and b/doc/optimization/nvvp2.png differ diff --git a/doc/optimization/nvvp3.png b/doc/optimization/nvvp3.png new file mode 100644 index 0000000000000000000000000000000000000000..d8f393667d6569b6f1e61ffccac43fae5888b6db Binary files /dev/null and b/doc/optimization/nvvp3.png differ diff --git a/doc/optimization/nvvp4.png b/doc/optimization/nvvp4.png new file mode 100644 index 0000000000000000000000000000000000000000..51f2f3e183295de6cf8ddaf2b3b8a0862aa35f01 Binary files /dev/null and b/doc/optimization/nvvp4.png differ