提交 bf1378f3 编写于 作者: G ghostplant 提交者: François Chollet

In-place split to avoid inter-device duplication (#10230)

New Benchmark by in-place split:

>> keras.application.Resnet50 224x224x3 (NCWH; NVidia Tesla P100 x 4)
 input_shape = 3x224x224, batch_size =  96 x 4: 392(images/sec) => 417(images/sec)
 input_shape = 3x299x299, batch_size =  64 x 4: 229(images/sec) => 244(images/sec)
 input_shape = 3x224x224, batch_size =   8 x 4: 148(images/sec) => 163(images/sec)

>> keras.application.InceptionV3 (NCWH; NVidia Tesla P100 x 4)
 input_shape = 3x224x224, batch_size = 128 x 4: 488(images/sec) => 526(images/sec)
 input_shape = 3x299x299, batch_size =  96 x 4: 270(images/sec) => 294(images/sec)
 input_shape = 3x224x224, batch_size =   8 x 4: 146(images/sec) => 158(images/sec)
Signed-off-by: NCUI Wei <ghostplant@qq.com>
上级 14ff5175
......@@ -210,12 +210,16 @@ def multi_gpu_model(model, gpus=None, cpu_merge=True, cpu_relocation=False):
inputs = []
# Retrieve a slice of the input.
for x in model.inputs:
input_shape = K.int_shape(x)[1:]
slice_i = Lambda(get_slice,
output_shape=input_shape,
arguments={'i': i,
'parts': num_gpus})(x)
inputs.append(slice_i)
# In-place input splitting which is not only
# 5% ~ 12% faster but also less GPU memory
# duplication.
with tf.device(x.device):
input_shape = K.int_shape(x)[1:]
slice_i = Lambda(get_slice,
output_shape=input_shape,
arguments={'i': i,
'parts': num_gpus})(x)
inputs.append(slice_i)
# Apply model on slice
# (creating a model replica on the target device).
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册