Benchmark of Fluid CPU multi-thread
Created by: luotao1
Environment
- Machine: Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz, 2 socket, total 40 cores
- Model: ResNet50
- Input: 3 * 224 * 224
- docker images:
- paddlepaddle/paddle:latest (for MKLML)
- paddlepaddle/paddle:latest-openblas (for OpenBlas)
- v2 results: https://github.com/PaddlePaddle/Paddle/blob/develop/benchmark/IntelOptimizedPaddle.md
- scripts: https://github.com/chengduoZH/benchmark/tree/add_resnet_50_v2/fluid/ResNet_50
Training
ensure that CPU_NUM * batch_size_per_trainer = batch_size
, for example, batchsize=64, threads=8
export CPU_NUM=8
python train_resnet.py --display_step=1 --warmup=0 --use_gpu=false --number_iteration=20 --skip_first_steps=5 --batch_size_per_trainer=8
The Result (images/second) is as follows:
core | 1 | 8 | 16 | 32 | 40 |
---|---|---|---|---|---|
OpenBlas (Fluid) | 1.6342 | 10.7300 | 18.9921 | 30.3432 | --- |
MKLML (Fluid) | 2.6408 | 15.874 | 28.2036 | 33.9912 | --- |
OpenBlas (V2-0.11.0) | --- | --- | --- | --- | 25.22 |
MKLML (V2-0.11.0) | --- | --- | --- | --- | 32.52 |
Inference
ensure that --batch_size_per_trainer=1
, only change the CPU_NUM
, for example, batchsize=8, threads=8
export CPU_NUM=8
python train_resnet.py --display_step=1 --warmup=0 --use_gpu=false --number_iteration=20 --skip_first_steps=5 --batch_size_per_trainer=1 --with_test=True
The Result (images/second) is as follows:
BatchSize | 1 | 2 | 4 | 8 | 16 |
---|---|---|---|---|---|
OpenBLAS (Fluid) | 4.7254 | 8.6016 | 16.7441 | 33.9707 | 59.2398 |
MKLML (Fluid) | 8.4839 | 13.60 | 24.8657 | 48.3597 | 74.5651 |
OpenBLAS (V2-0.11.0) | 3.31 | 6.72 | 11.59 | 13.17 | 9.27 |
MKLML (V2-0.11.0) | 6.33 | 12.02 | 22.88 | 40.53 | 63.09 |