From 5dbd537048f153863d1cac0fbb433bb62fe561b4 Mon Sep 17 00:00:00 2001 From: Yancey Date: Fri, 12 Jan 2018 11:25:53 +0800 Subject: [PATCH] Fluid distributed training benchmark (#7410) * add cluster training bencharmk design * update by comment * update by comment --- benchmark/cluster/README.md | 78 +++++++++++++++++++++++++++++++++++++ 1 file changed, 78 insertions(+) create mode 100644 benchmark/cluster/README.md diff --git a/benchmark/cluster/README.md b/benchmark/cluster/README.md new file mode 100644 index 0000000000..b619613ea7 --- /dev/null +++ b/benchmark/cluster/README.md @@ -0,0 +1,78 @@ +# Cluster Training Benchmark + +## Setup + +- Platform + - Kubernetes: v1.6.2 + - Linux Kernel: v3.10.0 + +- Resource + - CPU: 10 Cores per Pod + - Memory: 5GB per Pod + +- Docker Image + + We use different base Docker Image to run the benchmark on Kubernetes: + - PaddlePaddle v2: paddlepaddle/paddle:0.11.0 + - PaddlePaddle Fluid: paddlepaddle/paddle:[commit-id] + - TensorFlow: tensorflow/tensorflow:1.5.0-rc0 + +- Model + vgg16 is used in this benchmark. + +## Cases + +- Variable + - Batch Size of training data. + - PServer count of the training job. + - The number of trainers. + +- Invariant + - The resource of trainer/pserver Pod. + +### Measure the Performance for Different Batch Size + +- PServer Count: 40 +- Trainer Count: 100 +- Metrics: mini-batch / sec + +| Batch Size | 32 | 64 | 128 | 256 | +| -- | -- | -- | -- | -- | +| PaddlePaddle Fluid | - | - | - | - | +| PaddlePaddle v2 | - | - | - | - | +| TensorFlow | - | - | - | - | + +### Measure the Performance for Different PServer Count + +- Trainer Count: 100 +- Batch Size: 64 +- Metrics: mini-batch / sec + +| PServer Count | 10 | 20 | 40 | 60 | +| -- | -- | -- | -- | -- | +| PaddlePaddle Fluid | - | - | - | - | +| PaddlePaddle v2 | - | - | - | - | +| TensorFlow | - | - | - | - | + +### Measure Parallel Efficiency By Increasing Trainer Count + +- PServer Count: 20 +- Batch Size: 64 +- Metrics: + +$S = \div(T1, TN)$ + +which S is the ratio of T1 over TN, training time of 1 and N trainers. +The parallel efficiency is: + +$E = \div(S, N)$ + +| Trainer Counter | 1 | 10 | 20 | 30 | 40 | 50 | 60 | 70 | 80 | 90 | 100 | +| -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | +| PaddlePaddle Fluid | - | - | - | - | - | - | - | - | - | - | - | +| PaddlePaddle v2 | - | - | - | - | - | - | - | - | - | - | - | - | +| TensorFlow | - | - | - | - | - | - | - | - | - | - | - | - | - | + +## Reproduce the benchmark + +TODO -- GitLab