Skip to content
体验新版
项目
组织
正在加载...
登录
切换导航
打开侧边栏
Crayon鑫
Paddle
提交
2c84c1ec
P
Paddle
项目概览
Crayon鑫
/
Paddle
与 Fork 源项目一致
Fork自
PaddlePaddle / Paddle
通知
1
Star
1
Fork
0
代码
文件
提交
分支
Tags
贡献者
分支图
Diff
Issue
1
列表
看板
标记
里程碑
合并请求
0
Wiki
0
Wiki
分析
仓库
DevOps
项目成员
Pages
P
Paddle
项目概览
项目概览
详情
发布
仓库
仓库
文件
提交
分支
标签
贡献者
分支图
比较
Issue
1
Issue
1
列表
看板
标记
里程碑
合并请求
0
合并请求
0
Pages
分析
分析
仓库分析
DevOps
Wiki
0
Wiki
成员
成员
收起侧边栏
关闭侧边栏
动态
分支图
创建新Issue
提交
Issue看板
提交
2c84c1ec
编写于
11月 17, 2016
作者:
L
liaogang
浏览文件
操作
浏览文件
下载
电子邮件补丁
差异文件
Add profiler object and update docs
上级
ff6205dc
变更
6
隐藏空白更改
内联
并排
Showing
6 changed file
with
180 addition
and
37 deletion
+180
-37
CMakeLists.txt
CMakeLists.txt
+5
-0
doc/optimization/gpu_profiling.rst
doc/optimization/gpu_profiling.rst
+142
-33
paddle/cuda/src/hl_cuda_device.cc
paddle/cuda/src/hl_cuda_device.cc
+2
-0
paddle/math/tests/test_GpuProfiler.cpp
paddle/math/tests/test_GpuProfiler.cpp
+10
-4
paddle/utils/Stat.cpp
paddle/utils/Stat.cpp
+1
-0
paddle/utils/Stat.h
paddle/utils/Stat.h
+20
-0
未找到文件。
CMakeLists.txt
浏览文件 @
2c84c1ec
...
...
@@ -36,6 +36,7 @@ option(WITH_RDMA "Compile PaddlePaddle with rdma support" OFF)
option
(
WITH_GLOG
"Compile PaddlePaddle use glog, otherwise use a log implement internally"
${
LIBGLOG_FOUND
}
)
option
(
WITH_GFLAGS
"Compile PaddlePaddle use gflags, otherwise use a flag implement internally"
${
GFLAGS_FOUND
}
)
option
(
WITH_TIMER
"Compile PaddlePaddle use timer"
OFF
)
option
(
WITH_PROFILER
"Compile PaddlePaddle use gpu profiler"
OFF
)
option
(
WITH_TESTING
"Compile and run unittest for PaddlePaddle"
${
GTEST_FOUND
}
)
option
(
WITH_DOC
"Compile PaddlePaddle with documentation"
OFF
)
option
(
WITH_SWIG_PY
"Compile PaddlePaddle with py PaddlePaddle prediction api"
${
SWIG_FOUND
}
)
...
...
@@ -134,6 +135,10 @@ if(NOT WITH_TIMER)
add_definitions
(
-DPADDLE_DISABLE_TIMER
)
endif
(
NOT WITH_TIMER
)
if
(
NOT WITH_PROFILER
)
add_definitions
(
-DPADDLE_DISABLE_PROFILER
)
endif
(
NOT WITH_PROFILER
)
if
(
WITH_AVX
)
set
(
CMAKE_C_FLAGS
"
${
CMAKE_C_FLAGS
}
${
AVX_FLAG
}
"
)
set
(
CMAKE_CXX_FLAGS
"
${
CMAKE_CXX_FLAGS
}
${
AVX_FLAG
}
"
)
...
...
doc/optimization/gpu_profiling.rst
浏览文件 @
2c84c1ec
GPU Profiling
=============
Profiling on PaddlePaddle
=============
============
This tutorial will guide you step-by-step through how to conduct profiling and performance tuning using
:code:`nvprof` and :code:`nvvp`
.
This tutorial will guide you step-by-step through how to conduct profiling and performance tuning using
built-in timer, **nvprof** and **nvvp**
.
- What is profiling?
- Why we need profiling?
...
...
@@ -45,73 +45,182 @@ Profiler Tools
==============
For general GPU profiling, a bunch of tools are provided from both NVIDIA and third party.
:code:`nvprof` is Nvidia profiler and :code:`nvvp`
is (GUI based) Nvidia visual profiler.
**nvprof** is Nvidia profiler and **nvvp**
is (GUI based) Nvidia visual profiler.
In this tutorial, we will focus on nvprof and nvvp.
:code:`test_GpuProfiler` from :code:`paddle/math/tests` directory will be used to evaluate
above profilers.
.. code-block:: c++
.. literalinclude:: ../../paddle/math/tests/test_GpuProfiler.cpp
:language: c++
:lines: 107-121
:linenos:
TEST(Profiler, BilinearFwdBwd) {
hl_profiler_start();
auto numSamples = 10;
auto channels = 16;
auto imgSize = 64;
testBilinearFwdBwd(numSamples, imgSize, imgSize, channels);
hl_profiler_end();
}
The above code snippet includes two methods, you can use any of them to profile the regions of interest.
:code:`hl_profiler_start` and :code:`hl_profiler_end` can be used to profile only regions of interest
in PaddlePaddle. They are wrapper functions of :code:`cudaProfilerStart` and :code:`cudaProfilerStop`
respectively to avoid program crashes when CPU version of PaddlePaddle invokes them.
1. :code:`REGISTER_TIMER_INFO` is a built-in timer wrapper which can calculate the time overhead of both cpu functions and cuda kernels.
2. :code:`REGISTER_GPU_PROFILER` is a general purpose wrapper object of :code:`cudaProfilerStart` and :code:`cudaProfilerStop` to avoid
program crashes when CPU version of PaddlePaddle invokes them.
You can find all the gory details about how to use both of them in the next session.
Hands-on Approach
=================
To use this command line profiler :code:`nvprof`, you can simply issue the command:
Built-in Timer
--------------
.. code-block:: bash
To enable built-in timer in PaddlePaddle, first you have to add :code:`REGISTER_TIMER_INFO` into the regions of you interest.
Then, all information could be stamped in the console via :code:`printStatus` or :code:`printAllStatus` function.
As a simple example, consider the following:
1. Add :code:`REGISTER_TIMER_INFO` and :code:`printStatus` functions (see the emphasize-lines).
.. literalinclude:: ../../paddle/math/tests/test_GpuProfiler.cpp
:language: c++
:lines: 107-121
:emphasize-lines: 10-11,14
:linenos:
2. Configure cmake with **WITH_TIMER** and recompile PaddlePaddle.
.. code-block:: bash
cmake .. -DWITH_TIMER=ON
make
3. Execute your code and observe the results (see the emphasize-lines).
.. code-block:: bash
:emphasize-lines: 1,12-15
> ./paddle/math/tests/test_GpuProfiler
I1117 11:13:42.313065 2522362816 Util.cpp:155] commandline: ./paddle/math/tests/test_GpuProfiler
I1117 11:13:42.845065 2522362816 Util.cpp:130] Calling runInitFunctions
I1117 11:13:42.845208 2522362816 Util.cpp:143] Call runInitFunctions done.
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from Profiler
[ RUN ] Profiler.BilinearFwdBwd
I1117 11:13:42.845310 2522362816 test_GpuProfiler.cpp:114] Enable GPU Profiler Stat: [testBilinearFwdBwd] "numSamples = 10, channels = 16, im
gSizeX = 64, imgSizeY = 64"
I1117 11:13:42.850154 2522362816 ThreadLocal.cpp:37] thread use undeterministic rand seed:20659751
I1117 11:13:42.981501 2522362816 Stat.cpp:130] ======= StatSet: [GlobalStatInfo] status ======
I1117 11:13:42.981539 2522362816 Stat.cpp:133] Stat=testBilinearFwdBwd total=136.141 avg=136.141 max=136.141 min=136.141 count=1
I1117 11:13:42.981572 2522362816 Stat.cpp:141] ======= BarrierStatSet status ======
I1117 11:13:42.981575 2522362816 Stat.cpp:154] --------------------------------------------------
[ OK ] Profiler.BilinearFwdBwd (136 ms)
[----------] 1 test from Profiler (136 ms total)
[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (136 ms total)
[ PASSED ] 1 test.
nvprof profiler
---------------
nvprof ./paddle/math/tests/test_GpuProfiler
To use this command line profiler **nvprof**, you can simply issue the following command:
1. Add :code:`REGISTER_GPU_PROFILER` function (see the emphasize-lines).
.. literalinclude:: ../../paddle/math/tests/test_GpuProfiler.cpp
:language: c++
:lines: 107-121
:emphasize-lines: 7-8
:linenos:
2. Configure cmake with **WITH_PROFILER** and recompile PaddlePaddle.
.. code-block:: bash
cmake .. -DWITH_PROFILER=ON
make
3. Use Nvidia profiler **nvprof** to profile the binary.
.. code-block:: bash
nvprof ./paddle/math/tests/test_GpuProfiler
Then, you can get the following profiling result:
.. image:: nvprof.png
:align: center
:scale: 30%
.. code-block:: bash
For visual profiler :code:`nvvp`, you can either import the output of :code:`nvprof –o ...` or
==78544== Profiling application: ./paddle/math/tests/test_GpuProfiler
==78544== Profiling result:
Time(%) Time Calls Avg Min Max Name
27.60% 9.6305ms 5 1.9261ms 3.4560us 6.4035ms [CUDA memcpy HtoD]
26.07% 9.0957ms 1 9.0957ms 9.0957ms 9.0957ms KeBilinearInterpBw
23.78% 8.2977ms 1 8.2977ms 8.2977ms 8.2977ms KeBilinearInterpFw
22.55% 7.8661ms 2 3.9330ms 1.5798ms 6.2863ms [CUDA memcpy DtoH]
==78544== API calls:
Time(%) Time Calls Avg Min Max Name
46.85% 682.28ms 8 85.285ms 12.639us 682.03ms cudaStreamCreateWithFlags
39.83% 580.00ms 4 145.00ms 302ns 550.27ms cudaFree
9.82% 143.03ms 9 15.892ms 8.7090us 142.78ms cudaStreamCreate
1.23% 17.983ms 7 2.5690ms 23.210us 6.4563ms cudaMemcpy
1.23% 17.849ms 2 8.9247ms 8.4726ms 9.3768ms cudaStreamSynchronize
0.66% 9.5969ms 7 1.3710ms 288.43us 2.4279ms cudaHostAlloc
0.13% 1.9530ms 11 177.54us 7.6810us 591.06us cudaMalloc
0.07% 1.0424ms 8 130.30us 1.6970us 453.72us cudaGetDevice
0.04% 527.90us 40 13.197us 525ns 253.99us cudaEventCreateWithFlags
0.03% 435.73us 348 1.2520us 124ns 42.704us cuDeviceGetAttribute
0.03% 419.36us 1 419.36us 419.36us 419.36us cudaGetDeviceCount
0.02% 260.75us 2 130.38us 129.32us 131.43us cudaGetDeviceProperties
0.02% 222.32us 2 111.16us 106.94us 115.39us cudaLaunch
0.01% 214.06us 4 53.514us 28.586us 77.655us cuDeviceGetName
0.01% 115.45us 4 28.861us 9.8250us 44.526us cuDeviceTotalMem
0.01% 83.988us 4 20.997us 578ns 77.760us cudaSetDevice
0.00% 38.918us 1 38.918us 38.918us 38.918us cudaEventCreate
0.00% 34.573us 31 1.1150us 279ns 12.784us cudaDeviceGetAttribute
0.00% 17.767us 1 17.767us 17.767us 17.767us cudaProfilerStart
0.00% 15.228us 2 7.6140us 3.5460us 11.682us cudaConfigureCall
0.00% 14.536us 2 7.2680us 1.1490us 13.387us cudaGetLastError
0.00% 8.6080us 26 331ns 173ns 783ns cudaSetupArgument
0.00% 5.5470us 6 924ns 215ns 2.6780us cuDeviceGet
0.00% 5.4090us 6 901ns 328ns 3.3320us cuDeviceGetCount
0.00% 4.1770us 3 1.3920us 1.0630us 1.8300us cuDriverGetVersion
0.00% 3.4650us 3 1.1550us 1.0810us 1.2680us cuInit
0.00% 830ns 1 830ns 830ns 830ns cudaRuntimeGetVersion
nvvp profiler
-------------
For visual profiler **nvvp**, you can either import the output of :code:`nvprof –o ...` or
run application through GUI.
**Note: nvvp also support CPU profiling** (Click the box in nvvp to enable profile execution on CPU).
.. image:: nvvp1.png
:align: center
:scale: 3
0
%
:scale: 3
3
%
From the perspective of kernel functions,
:code:`nvvp`
can even illustrate why does an operation take a long time?
From the perspective of kernel functions,
**nvvp**
can even illustrate why does an operation take a long time?
As shown in the following figure, kernel's block usage, register usage and shared memory usage from :code:`nvvp`
allow us to fully utilize all warps on the GPU.
allow us to fully utilize all warps on the GPU.
.. image:: nvvp2.png
:align: center
:scale: 3
0
%
:scale: 3
3
%
From the perspective of application,
:code:`nvvp`
can give you some suggestions to address performance bottleneck.
From the perspective of application,
**nvvp**
can give you some suggestions to address performance bottleneck.
For instance, some advice in data movement and compute utilization from the below figure can guide you to tune performance.
.. image:: nvvp3.png
:align: center
:scale: 3
0
%
:scale: 3
3
%
.. image:: nvvp4.png
:align: center
:scale: 3
0
%
:scale: 3
3
%
Profiling tips
==============
- The
:code:`nvprof` and :code:`nvvp` output is a very good place to start
- The timeline is a good place to go next
- The
**nvprof** and **nvvp** output is a very good place to start.
- The timeline is a good place to go next
.
- Only dig deep into a kernel if it’s taking a significant amount of your time.
- Where possible, try to match profiler output with theory.
1) For example, if I know I’m moving 1GB, and my kernel takes 10ms, I expect the profiler to report 100GB/s.
...
...
@@ -119,7 +228,7 @@ Profiling tips
- Know your hardware: If your GPU can do 6 TFLOPs, and you’re already doing 5.5 TFLOPs, you won’t go much faster!
Profiling is a key step in optimi
s
ation. Sometimes quite simple changes can lead to big improvements in performance.
Profiling is a key step in optimi
z
ation. Sometimes quite simple changes can lead to big improvements in performance.
Your mileage may vary!
Reference
...
...
paddle/cuda/src/hl_cuda_device.cc
浏览文件 @
2c84c1ec
...
...
@@ -762,6 +762,8 @@ bool hl_cuda_event_is_ready(hl_event_t event) {
void
hl_profiler_start
()
{
CHECK_CUDA
(
dynload
::
cudaProfilerStart
());
}
void
hl_profiler_end
()
{
CHECK_CUDA
(
dynload
::
cudaProfilerStop
());
}
paddle/math/tests/test_GpuProfiler.cpp
浏览文件 @
2c84c1ec
...
...
@@ -20,7 +20,6 @@ limitations under the License. */
#include <gtest/gtest.h>
#include "paddle/gserver/tests/TestUtil.h"
#include "paddle/utils/Stat.h"
#include "hl_cuda.h"
using
namespace
paddle
;
// NOLINT
using
namespace
std
;
// NOLINT
...
...
@@ -106,12 +105,19 @@ void testBilinearFwdBwd(int numSamples, int imgSizeH, int imgSizeW,
}
TEST
(
Profiler
,
BilinearFwdBwd
)
{
hl_profiler_start
();
auto
numSamples
=
10
;
auto
channels
=
16
;
auto
imgSize
=
64
;
testBilinearFwdBwd
(
numSamples
,
imgSize
,
imgSize
,
channels
);
hl_profiler_end
();
{
// nvprof: GPU Proflier
REGISTER_GPU_PROFILER
(
"testBilinearFwdBwd"
,
"numSamples = 10, channels = 16, imgSizeX = 64, imgSizeY = 64"
);
// Paddle built-in timer
REGISTER_TIMER_INFO
(
"testBilinearFwdBwd"
,
"numSamples = 10, channels = 16, imgSizeX = 64, imgSizeY = 64"
);
testBilinearFwdBwd
(
numSamples
,
imgSize
,
imgSize
,
channels
);
}
globalStat
.
printStatus
(
"testBilinearFwdBwd"
);
}
int
main
(
int
argc
,
char
**
argv
)
{
...
...
paddle/utils/Stat.cpp
浏览文件 @
2c84c1ec
...
...
@@ -65,6 +65,7 @@ std::ostream& operator<<(std::ostream& outPut, const Stat& stat) {
auto
showStat
=
[
&
](
const
StatInfo
*
info
,
pid_t
tid
,
bool
isFirst
=
true
)
{
uint64_t
average
=
0
;
if
(
info
->
count_
>
0
)
{
outPut
<<
std
::
setfill
(
' '
)
<<
std
::
left
;
if
(
!
isFirst
)
{
outPut
<<
std
::
setw
(
42
)
<<
" "
;
}
...
...
paddle/utils/Stat.h
浏览文件 @
2c84c1ec
...
...
@@ -28,6 +28,7 @@ limitations under the License. */
#include "Locks.h"
#include "ThreadLocal.h"
#include "BarrierStat.h"
#include "hl_gpu.h"
namespace
paddle
{
...
...
@@ -280,4 +281,23 @@ inline StatSet& registerTimerArg2(uint64_t threshold = -1,
#endif // DISABLE_TIMER
class
GpuProfiler
final
{
public:
GpuProfiler
()
{
hl_profiler_start
();
}
~
GpuProfiler
()
{
hl_profiler_end
();
}
};
#ifdef PADDLE_DISABLE_PROFILER
#define REGISTER_GPU_PROFILER(statName, ...)
#else
#define REGISTER_GPU_PROFILER(statName, ...) \
LOG(INFO) << "Enable GPU Profiler Stat: [" \
<< statName << "] " << #__VA_ARGS__; \
GpuProfiler __gpuProfiler;
#endif // DISABLE_PROFILER
}
// namespace paddle
编辑
预览
Markdown
is supported
0%
请重试
或
添加新附件
.
添加附件
取消
You are about to add
0
people
to the discussion. Proceed with caution.
先完成此消息的编辑!
取消
想要评论请
注册
或
登录