# Cluster Training Benchmark
## Setup
- Platform
- Kubernetes: v1.6.2
- Linux Kernel: v3.10.0
- Resource
- CPU: 10 Cores per Pod
- Memory: 5GB per Pod
- Docker Image
We use different base Docker Image to run the benchmark on Kubernetes:
- PaddlePaddle v2: paddlepaddle/paddle:0.11.0
- PaddlePaddle Fluid: paddlepaddle/paddle:[commit-id]
- TensorFlow: tensorflow/tensorflow:1.5.0-rc0
- Model
vgg16 is used in this benchmark.
## Cases
- Variable
- Batch Size of training data.
- PServer count of the training job.
- The number of trainers.
- Invariant
- The resource of trainer/pserver Pod.
### Measure the Performance for Different Batch Size
- PServer Count: 40
- Trainer Count: 100
- Metrics: mini-batch / sec
Batch Size |
32 |
64 |
128 |
256 |
PaddlePaddle Fluid |
- |
- |
- |
- |
PaddlePaddle v2 |
- |
- |
- |
- |
TensorFlow |
- |
- |
- |
- |
### Measure the Performance for Different PServer Count
- Trainer Count: 100
- Batch Size: 64
- Metrics: mini-batch / sec
PServer Count |
10 |
20 |
40 |
60 |
PaddlePaddle Fluid |
- |
- |
- |
- |
PaddlePaddle v2 |
- |
- |
- |
- |
TensorFlow |
- |
- |
- |
- |
### Measure Parallel Efficiency By Increasing Trainer Count
- PServer Count: 20
- Batch Size: 64
- Metrics:
$S = \div(T1, TN)$
which S is the ratio of T1 over TN, training time of 1 and N trainers.
The parallel efficiency is:
$E = \div(S, N)$
Trainer Counter |
1 |
10 |
20 |
30 |
40 |
50 |
60 |
70 |
80 |
90 |
100 |
PaddlePaddle Fluid |
- |
- |
- |
- |
- |
- |
- |
- |
- |
- |
- |
PaddlePaddle v2 |
- |
- |
- |
- |
- |
- |
- |
- |
- |
- |
- |
TensorFlow |
- |
- |
- |
- |
- |
- |
- |
- |
- |
- |
- |
## Reproduce the benchmark
TODO