In-place split to avoid inter-device duplication (#10230)

New Benchmark by in-place split: >> keras.application.Resnet50 224x224x3 (NCWH; NVidia Tesla P100 x 4) input_shape = 3x224x224, batch_size = 96 x 4: 392(images/sec) => 417(images/sec) input_shape = 3x299x299, batch_size = 64 x 4: 229(images/sec) => 244(images/sec) input_shape = 3x224x224, batch_size = 8 x 4: 148(images/sec) => 163(images/sec) >> keras.application.InceptionV3 (NCWH; NVidia Tesla P100 x 4) input_shape = 3x224x224, batch_size = 128 x 4: 488(images/sec) => 526(images/sec) input_shape = 3x299x299, batch_size = 96 x 4: 270(images/sec) => 294(images/sec) input_shape = 3x224x224, batch_size = 8 x 4: 146(images/sec) => 158(images/sec) Signed-off-by: N CUI Wei <ghostplant@qq.com>

In-place split to avoid inter-device duplication (#10230)
New Benchmark by in-place split: >> keras.application.Resnet50 224x224x3 (NCWH; NVidia Tesla P100 x 4) input_shape = 3x224x224, batch_size = 96 x 4: 392(images/sec) => 417(images/sec) input_shape = 3x299x299, batch_size = 64 x 4: 229(images/sec) => 244(images/sec) input_shape = 3x224x224, batch_size = 8 x 4: 148(images/sec) => 163(images/sec) >> keras.application.InceptionV3 (NCWH; NVidia Tesla P100 x 4) input_shape = 3x224x224, batch_size = 128 x 4: 488(images/sec) => 526(images/sec) input_shape = 3x299x299, batch_size = 96 x 4: 270(images/sec) => 294(images/sec) input_shape = 3x224x224, batch_size = 8 x 4: 146(images/sec) => 158(images/sec) Signed-off-by: N CUI Wei <ghostplant@qq.com>
bf1378f3 · ghostplant · François Chollet · 14ff5175 · bf1378f3
隐藏空白更改
内联并排

Showing with 10 addition and 6 deletion

keras/utils/multi_gpu_utils.py keras/utils/multi_gpu_utils.py +10 -6

未找到文件。
--- a/keras/utils/multi_gpu_utils.py
+++ b/keras/utils/multi_gpu_utils.py
@@ -210,12 +210,16 @@ def multi_gpu_model(model, gpus=None, cpu_merge=True, cpu_relocation=False):
                inputs = []
                # Retrieve a slice of the input.
                for x in model.inputs:
-                    input_shape = K.int_shape(x)[1:]
-                    slice_i = Lambda(get_slice,
-                                     output_shape=input_shape,
-                                     arguments={'i': i,
-                                                'parts': num_gpus})(x)
-                    inputs.append(slice_i)
+                    # In-place input splitting which is not only
+                    # 5% ~ 12% faster but also less GPU memory
+                    # duplication.
+                    with tf.device(x.device):
+                        input_shape = K.int_shape(x)[1:]
+                        slice_i = Lambda(get_slice,
+                                         output_shape=input_shape,
+                                         arguments={'i': i,
+                                                    'parts': num_gpus})(x)
+                        inputs.append(slice_i)

                # Apply model on slice
                # (creating a model replica on the target device).