diff --git a/doc/optimization/gpu_profiling.rst b/doc/optimization/gpu_profiling.rst
index 583c2d6caee460331aea366d4d4f65be81e553b0..efdf5552c351fac15cd1f257e664dfdec0252e33 100644
--- a/doc/optimization/gpu_profiling.rst
+++ b/doc/optimization/gpu_profiling.rst
@@ -8,6 +8,7 @@ This tutorial will guide you step-by-step through how to conduct profiling and p
 - How to do profiling?
 - Profile tools
 - Hands-on Tutorial
+- Profiling tips
 
 What's profiling?
 =================
@@ -68,10 +69,59 @@ respectively to avoid program crashes when CPU version of PaddlePaddle invokes t
 Hands-on Approach
 =================
 
+To use this command line profiler :code:`nvprof`, you can simply issue the command:
+
+.. code-block:: bash
+
+    nvprof  ./paddle/math/tests/test_GpuProfiler
+
+Then, you can get the following profiling result:
+
 ..  image:: nvprof.png
     :align: center
     :scale: 30%
 
+For visual profiler :code:`nvvp`, you can either import the output of :code:`nvprof –o ...` or
+run application through GUI.
+
 ..  image:: nvvp1.png
     :align: center
-    :scale: 30%
\ No newline at end of file
+    :scale: 30%
+
+From the perspective of kernel functions, :code:`nvvp` can even illustrate why does an operation take a long time?
+As shown in the following figure, kernel's block usage, register usage and shared memory usage from :code:`nvvp`
+allow us to fully utilize all warps on the GPU. 
+
+..  image:: nvvp2.png
+    :align: center
+    :scale: 30%
+
+From the perspective of application, :code:`nvvp` can give you some suggestions to address performance bottleneck.
+For instance, some advice in data movement and compute utilization from the below figure can guide you to tune performance.
+
+..  image:: nvvp3.png
+    :align: center
+    :scale: 30%
+
+..  image:: nvvp4.png
+    :align: center
+    :scale: 30%
+
+Profiling tips
+==============
+
+- The :code:`nvprof` and :code:`nvvp` output is a very good place to start
+- The timeline is a good place to go next
+- Only dig deep into a kernel if it’s taking a significant amount of your time.
+- Where possible, try to match profiler output with theory.
+    1) For example, if I know I’m moving 1GB, and my kernel takes 10ms, I expect the profiler to report 100GB/s.
+    2) Discrepancies are likely to mean your application isn’t doing what you thought it was.
+- Know your hardware: If your GPU can do 6 TFLOPs, and you’re already doing 5.5 TFLOPs, you won’t go much faster!
+
+
+Profiling is a key step in optimisation. Sometimes quite simple changes can lead to big improvements in performance.
+Your mileage may vary!
+
+Reference
+=========
+Jeremy Appleyard, `GPU Profiling for Deep Learning <http://www.robots.ox.ac.uk/~seminars/seminars/Extra/2015_10_08_JeremyAppleyard.pdf>`_, 2015
diff --git a/doc/optimization/nvvp2.png b/doc/optimization/nvvp2.png
new file mode 100644
index 0000000000000000000000000000000000000000..177c9db708da6863d1075f3e615f5962dbe18b29
Binary files /dev/null and b/doc/optimization/nvvp2.png differ
diff --git a/doc/optimization/nvvp3.png b/doc/optimization/nvvp3.png
new file mode 100644
index 0000000000000000000000000000000000000000..d8f393667d6569b6f1e61ffccac43fae5888b6db
Binary files /dev/null and b/doc/optimization/nvvp3.png differ
diff --git a/doc/optimization/nvvp4.png b/doc/optimization/nvvp4.png
new file mode 100644
index 0000000000000000000000000000000000000000..51f2f3e183295de6cf8ddaf2b3b8a0862aa35f01
Binary files /dev/null and b/doc/optimization/nvvp4.png differ