diff --git a/benchmark/cluster/README.md b/benchmark/cluster/README.md new file mode 100644 index 0000000000000000000000000000000000000000..b619613ea7a5b6e940ec735314e8e47338b2c600 --- /dev/null +++ b/benchmark/cluster/README.md @@ -0,0 +1,78 @@ +# Cluster Training Benchmark + +## Setup + +- Platform + - Kubernetes: v1.6.2 + - Linux Kernel: v3.10.0 + +- Resource + - CPU: 10 Cores per Pod + - Memory: 5GB per Pod + +- Docker Image + + We use different base Docker Image to run the benchmark on Kubernetes: + - PaddlePaddle v2: paddlepaddle/paddle:0.11.0 + - PaddlePaddle Fluid: paddlepaddle/paddle:[commit-id] + - TensorFlow: tensorflow/tensorflow:1.5.0-rc0 + +- Model + vgg16 is used in this benchmark. + +## Cases + +- Variable + - Batch Size of training data. + - PServer count of the training job. + - The number of trainers. + +- Invariant + - The resource of trainer/pserver Pod. + +### Measure the Performance for Different Batch Size + +- PServer Count: 40 +- Trainer Count: 100 +- Metrics: mini-batch / sec + +| Batch Size | 32 | 64 | 128 | 256 | +| -- | -- | -- | -- | -- | +| PaddlePaddle Fluid | - | - | - | - | +| PaddlePaddle v2 | - | - | - | - | +| TensorFlow | - | - | - | - | + +### Measure the Performance for Different PServer Count + +- Trainer Count: 100 +- Batch Size: 64 +- Metrics: mini-batch / sec + +| PServer Count | 10 | 20 | 40 | 60 | +| -- | -- | -- | -- | -- | +| PaddlePaddle Fluid | - | - | - | - | +| PaddlePaddle v2 | - | - | - | - | +| TensorFlow | - | - | - | - | + +### Measure Parallel Efficiency By Increasing Trainer Count + +- PServer Count: 20 +- Batch Size: 64 +- Metrics: + +$S = \div(T1, TN)$ + +which S is the ratio of T1 over TN, training time of 1 and N trainers. +The parallel efficiency is: + +$E = \div(S, N)$ + +| Trainer Counter | 1 | 10 | 20 | 30 | 40 | 50 | 60 | 70 | 80 | 90 | 100 | +| -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | +| PaddlePaddle Fluid | - | - | - | - | - | - | - | - | - | - | - | +| PaddlePaddle v2 | - | - | - | - | - | - | - | - | - | - | - | - | +| TensorFlow | - | - | - | - | - | - | - | - | - | - | - | - | - | + +## Reproduce the benchmark + +TODO