Skip to content
体验新版
项目
组织
正在加载...
登录
切换导航
打开侧边栏
PaddlePaddle
FluidDoc
提交
377cc37f
F
FluidDoc
项目概览
PaddlePaddle
/
FluidDoc
通知
10
Star
2
Fork
0
代码
文件
提交
分支
Tags
贡献者
分支图
Diff
Issue
23
列表
看板
标记
里程碑
合并请求
111
Wiki
0
Wiki
分析
仓库
DevOps
项目成员
Pages
F
FluidDoc
项目概览
项目概览
详情
发布
仓库
仓库
文件
提交
分支
标签
贡献者
分支图
比较
Issue
23
Issue
23
列表
看板
标记
里程碑
合并请求
111
合并请求
111
Pages
分析
分析
仓库分析
DevOps
Wiki
0
Wiki
成员
成员
收起侧边栏
关闭侧边栏
动态
分支图
创建新Issue
提交
Issue看板
未验证
提交
377cc37f
编写于
4月 23, 2020
作者:
Z
Zhang Ting
提交者:
GitHub
4月 23, 2020
浏览文件
操作
浏览文件
下载
电子邮件补丁
差异文件
add device_switching doc (#1995)
* add device_switching doc, test=develop
上级
af6da442
变更
2
显示空白变更内容
内联
并排
Showing
2 changed file
with
198 addition
and
0 deletion
+198
-0
doc/fluid/advanced_guide/performance_improving/device_switching/device_switching.md
...erformance_improving/device_switching/device_switching.md
+197
-0
doc/fluid/advanced_guide/performance_improving/index_cn.rst
doc/fluid/advanced_guide/performance_improving/index_cn.rst
+1
-0
未找到文件。
doc/fluid/advanced_guide/performance_improving/device_switching/device_switching.md
0 → 100644
浏览文件 @
377cc37f
# 运行时设备切换
Paddle提供了
[
fluid.CUDAPlace
](
https://www.paddlepaddle.org.cn/documentation/docs/zh/api_cn/fluid_cn/CUDAPlace_cn.html
)
以及
[
fluid.CPUPlace
](
https://www.paddlepaddle.org.cn/documentation/docs/zh/api_cn/fluid_cn/CPUPlace_cn.html
)
用于指定运行时的设备。这两个接口用于指定全局的设备,从2.0版本开始,Paddle提供了
[
device_guard
](
https://www.paddlepaddle.org.cn/documentation/docs/zh/develop/api_cn/fluid_cn/device_guard_cn.html
)
接口,用于指定部分OP的运行设备,此教程会介绍device_guard的使用场景,以及如何使用该接口对模型进行优化。
如果使用了
`fluid.CUDAPlace`
设置了全局的执行设备,框架将尽可能地将OP设置在GPU上执行,因此有可能会遇到显存不够的情况。
`device_guard`
可以用于设置OP的执行设备,如果将部分层设置在CPU上运行,就能够充分利用CPU大内存的优势,避免显存超出。
有时尽管指定了全局的执行设备为GPU,但框架在自动分配OP执行设备时,可能会将部分OP设置在CPU上执行。另外,个别OP会将输出存储在CPU上。在以上的场景中,常常会发生不同设备间的数据传输,可能会影响模型的性能。使用
`device_guard`
可以避免模型运行中不必要的数据传输。在下面的内容中,将会详细介绍如何通过
[
profile
](
https://www.paddlepaddle.org.cn/documentation/docs/zh/develop/api_cn/profiler_cn.html
)
工具分析数据传输开销,以及如何使用
`device_guard`
避免不必要的数据传输,从而提升模型性能。
## 如何避免显存超出
下面示例代码中的
`embedding`
层,其参数
`size`
包含两个元素,第一个元素为
`vocab_size`
(词表大小), 第二个为
`emb_size`
(
`embedding`
层维度)。实际场景中,词表可能会非常大。示例代码中,词表大小被设置为10000000。如果在GPU模式下运行,该层创建的权重矩阵的大小为(10000000, 150),仅这一层就需要5.59G的显存,如果词表大小继续增加,极有可能会导致显存超出。
```
python
import
paddle.fluid
as
fluid
data
=
fluid
.
layers
.
fill_constant
(
shape
=
[
1
],
value
=
128
,
dtype
=
'int64'
)
label
=
fluid
.
layers
.
fill_constant
(
shape
=
[
1
,
150
],
value
=
0.5
,
dtype
=
'float32'
)
emb
=
fluid
.
embedding
(
input
=
data
,
size
=
(
10000000
,
150
),
dtype
=
'float32'
)
out
=
fluid
.
layers
.
l2_normalize
(
x
=
emb
,
axis
=-
1
)
cost
=
fluid
.
layers
.
square_error_cost
(
input
=
out
,
label
=
label
)
avg_cost
=
fluid
.
layers
.
mean
(
cost
)
sgd_optimizer
=
fluid
.
optimizer
.
SGD
(
learning_rate
=
0.001
)
sgd_optimizer
.
minimize
(
avg_cost
)
place
=
fluid
.
CUDAPlace
(
0
)
exe
=
fluid
.
Executor
(
place
)
exe
.
run
(
fluid
.
default_startup_program
())
result
=
exe
.
run
(
fluid
.
default_main_program
(),
fetch_list
=
[
avg_cost
])
```
`embedding`
是根据
`input`
中的
`id`
信息从
`embedding`
矩阵中查询对应
`embedding`
信息,在CPU上进行计算,其速度也是可接受的。因此,可以参考如下代码,使用
`device_guard`
将
`embedding`
层设置在CPU上,以利用CPU内存资源。那么,除了
`embedding`
层,其他各层都会在GPU上运行。
```
python
import
paddle.fluid
as
fluid
data
=
fluid
.
layers
.
fill_constant
(
shape
=
[
1
],
value
=
128
,
dtype
=
'int64'
)
label
=
fluid
.
layers
.
fill_constant
(
shape
=
[
1
,
150
],
value
=
0.5
,
dtype
=
'float32'
)
with
fluid
.
device_guard
(
"cpu"
):
emb
=
fluid
.
embedding
(
input
=
data
,
size
=
(
10000000
,
150
),
dtype
=
'float32'
)
out
=
fluid
.
layers
.
l2_normalize
(
x
=
emb
,
axis
=-
1
)
cost
=
fluid
.
layers
.
square_error_cost
(
input
=
out
,
label
=
label
)
avg_cost
=
fluid
.
layers
.
mean
(
cost
)
sgd_optimizer
=
fluid
.
optimizer
.
SGD
(
learning_rate
=
0.001
)
sgd_optimizer
.
minimize
(
avg_cost
)
place
=
fluid
.
CUDAPlace
(
0
)
exe
=
fluid
.
Executor
(
place
)
exe
.
run
(
fluid
.
default_startup_program
())
result
=
exe
.
run
(
fluid
.
default_main_program
(),
fetch_list
=
[
avg_cost
])
```
在显存足够的情况下,可不必进行这样的设置。
## 如何减少数据传输
### 使用profile工具确认是否发生了数据传输
首先对模型的性能数据进行分析,找到发生数据传输的原因。如下列代码所示,可以利用
[
profile
](
https://www.paddlepaddle.org.cn/documentation/docs/zh/api_cn/profiler_cn.html
)
工具进行分析。
```
python
import
paddle.fluid
as
fluid
import
paddle.fluid.profiler
as
profiler
data1
=
fluid
.
layers
.
fill_constant
(
shape
=
[
1
,
3
,
8
,
8
],
value
=
0.5
,
dtype
=
'float32'
)
data2
=
fluid
.
layers
.
fill_constant
(
shape
=
[
1
,
3
,
5
,
5
],
value
=
0.5
,
dtype
=
'float32'
)
shape
=
fluid
.
layers
.
shape
(
data2
)
shape
=
fluid
.
layers
.
slice
(
shape
,
axes
=
[
0
],
starts
=
[
0
],
ends
=
[
4
])
out
=
fluid
.
layers
.
crop_tensor
(
data1
,
shape
=
shape
)
place
=
fluid
.
CUDAPlace
(
0
)
exe
=
fluid
.
Executor
(
place
)
exe
.
run
(
fluid
.
default_startup_program
())
with
profiler
.
profiler
(
'All'
,
'total'
)
as
prof
:
for
i
in
range
(
10
):
result
=
exe
.
run
(
fetch_list
=
[
out
])
```
在程序运行结束后,将会自动地打印出profile report。在下面的profile report中,可以看到
`GpuMemCpy Summary`
中给出了2项数据传输的调用耗时。在OP执行过程中,如果输入Tensor所在的设备与OP执行的设备不同,就会发生
`GpuMemcpySync`
,通常我们可以直接优化的就是这一项。进一步分析,可以看到
`slice`
和
`crop_tensor`
执行中都发生了
`GpuMemcpySync`
。尽管我们在程序中设置了GPU模式运行,但是框架中有些OP,例如shape,会将输出结果放在CPU上。
```
text
-------------------------> Profiling Report <-------------------------
Note! This Report merge all thread info into one.
Place: All
Time unit: ms
Sorted by event first end time in descending order in the same thread
Total time: 24.0922
Computation time Total: 3.60143 Ratio: 14.9485%
Framework overhead Total: 20.4908 Ratio: 85.0515%
------------------------- GpuMemCpy Summary -------------------------
GpuMemcpy Calls: 30 Total: 1.44377 Ratio: 5.99267%
GpuMemcpyAsync Calls: 10 Total: 0.459803 Ratio: 1.90851%
GpuMemcpySync Calls: 20 Total: 0.983967 Ratio: 4.08416%
------------------------- Event Summary -------------------------
Event Calls Total CPU Time (Ratio) GPU Time (Ratio) Min. Max. Ave. Ratio.
fill_constant 20 2.03147 1.995597 (0.982342) 0.035872 (0.017658) 0.064199 0.379822 0.101573 0.0843204
shape 10 0.466503 0.466503 (1.000000) 0.000000 (0.000000) 0.021165 0.207393 0.0466503 0.0193632
eager_deletion 30 0.28398 0.283980 (1.000000) 0.000000 (0.000000) 0.004668 0.028065 0.009466 0.0117872
slice 10 1.53533 1.505664 (0.980679) 0.029664 (0.019321) 0.1312 0.259446 0.153533 0.0637271
GpuMemcpySync:CPU->GPU 10 0.41714 0.408532 (0.979364) 0.008608 (0.020636) 0.038545 0.054022 0.041714 0.0173143
crop_tensor 10 1.49584 1.438558 (0.961707) 0.057280 (0.038293) 0.129106 0.246395 0.149584 0.0620879
GpuMemcpySync:GPU->CPU 10 0.566827 0.543787 (0.959353) 0.023040 (0.040647) 0.047598 0.097705 0.0566827 0.0235274
Fetch 10 0.921333 0.897141 (0.973742) 0.024192 (0.026258) 0.077059 0.177223 0.0921333 0.0382419
GpuMemcpyAsync:GPU->CPU 10 0.459803 0.435611 (0.947386) 0.024192 (0.052614) 0.039321 0.073849 0.0459803 0.0190851
ParallelExecutor::Run 10 17.3578 17.345797 (0.999309) 0.012000 (0.000691) 0.705361 10.3389 1.73578 0.720472
InitLocalVars 1 0.084954 0.084954 (1.000000) 0.000000 (0.000000) 0.084954 0.084954 0.084954 0.0035262
ScopeBufferedMonitor::pre_local_exec_scopes_process 10 0.040771 0.040771 (1.000000) 0.000000 (0.000000) 0.003653 0.00543 0.0040771 0.00169229
FastThreadedSSAGraphExecutorPrepare 10 8.64291 8.630914 (0.998612) 0.012000 (0.001388) 0.033383 8.29818 0.864291 0.358743
ScopeBufferedMonitor::post_local_exec_scopes_process 10 0.252618 0.252618 (1.000000) 0.000000 (0.000000) 0.022696 0.041439 0.0252618 0.0104854
```
### 通过log查看发生数据传输的具体位置
以上的示例程序比较简单,我们只用看profile report就能知道具体是哪些算子发生了数据传输。但是当模型比较复杂时,可能需要去查看更加详细的调试信息,可以打印出运行时的log去确定发生数据传输的具体位置。依然以上述程序为例,执行
`GLOG_vmodule=operator=3 python test_case.py`
,会得到如下log信息,会发现发生了2次数据传输:
-
`shape`
输出的结果在CPU上,在
`slice`
运行时,
`shape`
的输出被拷贝到GPU上
-
`slice`
执行完的结果在GPU上,当
`crop_tensor`
执行时,它会被拷贝到CPU上。
```
text
I0406 14:56:23.286592 17516 operator.cc:180] CUDAPlace(0) Op(shape), inputs:{Input[fill_constant_1.tmp_0:float[1, 3, 5, 5]({})]}, outputs:{Out[shape_0.tmp_0:int[4]({})]}.
I0406 14:56:23.286628 17516 eager_deletion_op_handle.cc:107] Erase variable fill_constant_1.tmp_0 on CUDAPlace(0)
I0406 14:56:23.286725 17516 operator.cc:1210] Transform Variable shape_0.tmp_0 from data_type[int]:data_layout[NCHW]:place[CPUPlace]:library_type[PLAIN] to data_type[int]:data_layout[ANY_LAYOUT]:place[CUDAPlace(0)]:library_type[PLAIN]
I0406 14:56:23.286763 17516 scope.cc:169] Create variable shape_0.tmp_0
I0406 14:56:23.286784 17516 data_device_transform.cc:21] DeviceTransform in, src_place CPUPlace dst_place: CUDAPlace(0)
I0406 14:56:23.286867 17516 tensor_util.cu:129] TensorCopySync 4 from CPUPlace to CUDAPlace(0)
I0406 14:56:23.287099 17516 operator.cc:180] CUDAPlace(0) Op(slice), inputs:{EndsTensor[], EndsTensorList[], Input[shape_0.tmp_0:int[4]({})], StartsTensor[], StartsTensorList[]}, outputs:{Out[slice_0.tmp_0:int[4]({})]}.
I0406 14:56:23.287140 17516 eager_deletion_op_handle.cc:107] Erase variable shape_0.tmp_0 on CUDAPlace(0)
I0406 14:56:23.287220 17516 tensor_util.cu:129] TensorCopySync 4 from CUDAPlace(0) to CPUPlace
I0406 14:56:23.287473 17516 operator.cc:180] CUDAPlace(0) Op(crop_tensor), inputs:{Offsets[], OffsetsTensor[], Shape[slice_0.tmp_0:int[4]({})], ShapeTensor[], X[fill_constant_0.tmp_0:float[1, 3, 8, 8]({})]}, outputs:{Out[crop_tensor_0.tmp_0:float[1, 3, 5, 5]({})]}.
```
### 使用device_guard避免不必要的数据传输
在上面的例子中,
`shape`
输出的是一个1-D的Tensor,因此对于
`slice`
而言计算量很小。这种情况下如果将
`slice`
设置在CPU上运行,就可以避免2次数据传输。修改后的程序如下:
```
python
import
paddle.fluid
as
fluid
import
paddle.fluid.profiler
as
profiler
data1
=
fluid
.
layers
.
fill_constant
(
shape
=
[
1
,
3
,
8
,
8
],
value
=
0.5
,
dtype
=
'float32'
)
data2
=
fluid
.
layers
.
fill_constant
(
shape
=
[
1
,
3
,
5
,
5
],
value
=
0.5
,
dtype
=
'float32'
)
shape
=
fluid
.
layers
.
shape
(
data2
)
with
fluid
.
device_guard
(
"cpu"
):
shape
=
fluid
.
layers
.
slice
(
shape
,
axes
=
[
0
],
starts
=
[
0
],
ends
=
[
4
])
out
=
fluid
.
layers
.
crop_tensor
(
data1
,
shape
=
shape
)
place
=
fluid
.
CUDAPlace
(
0
)
exe
=
fluid
.
Executor
(
place
)
exe
.
run
(
fluid
.
default_startup_program
())
with
profiler
.
profiler
(
'All'
,
'total'
)
as
prof
:
for
i
in
range
(
10
):
result
=
exe
.
run
(
fetch_list
=
[
out
])
```
再次观察profile report中
`GpuMemCpy Summary`
的内容,可以看到
`GpuMemCpySync`
已经被消除。在实际的模型中,若
`GpuMemCpySync`
调用耗时占比较大,并且可以通过设置
`device_guard`
避免,那么就能够带来一定的性能提升。
```
text
-------------------------> Profiling Report <-------------------------
Note! This Report merge all thread info into one.
Place: All
Time unit: ms
Sorted by total time in descending order in the same thread
Total time: 14.5345
Computation time Total: 4.47587 Ratio: 30.7948%
Framework overhead Total: 10.0586 Ratio: 69.2052%
------------------------- GpuMemCpy Summary -------------------------
GpuMemcpy Calls: 10 Total: 0.457033 Ratio: 3.14447%
GpuMemcpyAsync Calls: 10 Total: 0.457033 Ratio: 3.14447%
------------------------- Event Summary -------------------------
Event Calls Total CPU Time (Ratio) GPU Time (Ratio) Min. Max. Ave. Ratio.
FastThreadedSSAGraphExecutorPrepare 10 7.70113 7.689066 (0.998433) 0.012064 (0.001567) 0.032657 7.39363 0.770113 0.529852
fill_constant 20 2.62299 2.587022 (0.986287) 0.035968 (0.013713) 0.071097 0.342082 0.13115 0.180466
shape 10 1.93504 1.935040 (1.000000) 0.000000 (0.000000) 0.026774 1.6016 0.193504 0.133134
Fetch 10 0.880496 0.858512 (0.975032) 0.021984 (0.024968) 0.07392 0.140896 0.0880496 0.0605797
GpuMemcpyAsync:GPU->CPU 10 0.457033 0.435049 (0.951898) 0.021984 (0.048102) 0.037836 0.071424 0.0457033 0.0314447
crop_tensor 10 0.705426 0.671506 (0.951916) 0.033920 (0.048084) 0.05841 0.123901 0.0705426 0.0485346
slice 10 0.324241 0.324241 (1.000000) 0.000000 (0.000000) 0.024299 0.07213 0.0324241 0.0223084
eager_deletion 30 0.250524 0.250524 (1.000000) 0.000000 (0.000000) 0.004171 0.016235 0.0083508 0.0172365
ScopeBufferedMonitor::post_local_exec_scopes_process 10 0.047794 0.047794 (1.000000) 0.000000 (0.000000) 0.003344 0.014131 0.0047794 0.00328831
InitLocalVars 1 0.034629 0.034629 (1.000000) 0.000000 (0.000000) 0.034629 0.034629 0.034629 0.00238254
ScopeBufferedMonitor::pre_local_exec_scopes_process 10 0.032231 0.032231 (1.000000) 0.000000 (0.000000) 0.002952 0.004076 0.0032231 0.00221755
```
### 总结
-
使用profile工具对模型进行分析,看是否存在GpuMemcpySync的调用耗时。若存在,则进一步分析发生数据传输的原因。
-
可以通过profile report找到发生GpuMemcpySync的OP。如果需要,可以通过打印log,找到GpuMemcpySync发生的具体位置。
-
尝试使用
`device_guard`
设置部分OP的运行设备,来减少GpuMemcpySync的调用。
-
最后可以通过比较修改前后模型的profile report,或者其他用来衡量性能的指标,确认修改后是否带来了性能提升。
doc/fluid/advanced_guide/performance_improving/index_cn.rst
浏览文件 @
377cc37f
...
...
@@ -7,6 +7,7 @@
singlenode_training_improving/training_best_practice.rst
singlenode_training_improving/memory_optimize.rst
device_switching/device_switching.md
multinode_training_improving/cpu_train_best_practice.rst
multinode_training_improving/dist_training_gpu.rst
multinode_training_improving/gpu_training_with_recompute.rst
...
...
编辑
预览
Markdown
is supported
0%
请重试
或
添加新附件
.
添加附件
取消
You are about to add
0
people
to the discussion. Proceed with caution.
先完成此消息的编辑!
取消
想要评论请
注册
或
登录