IntelOptimizedPaddle.md 3.2 KB
Newer Older
1 2 3 4
# Benchmark

Machine:

L
Luo Tao 已提交
5 6
- Server: Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz, 2 Sockets, 20 Cores per socket
- Laptop: TBD
7 8 9

System: CentOS release 6.3 (Final), Docker 1.12.1.

L
Luo Tao 已提交
10 11 12 13 14 15
PaddlePaddle: (TODO: will rerun after 0.11.0)
- paddlepaddle/paddle:latest (for MKLML and MKL-DNN)
  - MKL-DNN tag v0.11
  - MKLML 2018.0.1.20171007
- paddlepaddle/paddle:latest-openblas (for OpenBLAS)
  - OpenBLAS v0.2.20
16 17 18 19 20 21
	 
On each machine, we will test and compare the performance of training on single node using MKL-DNN / MKLML / OpenBLAS respectively.

## Benchmark Model

### Server
T
tensor-tang 已提交
22 23

#### Training
T
tensor-tang 已提交
24
Test on batch size 64, 128, 256 on Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz
T
tensor-tang 已提交
25
Pay attetion that the speed below includes forward, backward and parameter update time. So we can not directly compare the data with the benchmark of caffe `time` [command](https://github.com/PaddlePaddle/Paddle/blob/develop/benchmark/caffe/image/run.sh#L9), which only contain forward and backward. The updating time of parameter would become very heavy when the weight size are large, especially on alexnet.
26 27 28 29 30 31 32

Input image size - 3 * 224 * 224, Time: images/second

- VGG-19

| BatchSize    | 64    | 128  | 256     |
|--------------|-------| -----| --------|
33 34 35
| OpenBLAS     | 7.80  | 9.00  | 10.80  | 
| MKLML        | 12.12 | 13.70 | 16.18  |
| MKL-DNN      | 28.46 | 29.83 | 30.44  |
36

L
Luo Tao 已提交
37
<img src="figs/vgg-cpu-train.png" width="500">
38

T
tensor-tang 已提交
39 40 41 42
 - ResNet-50

| BatchSize    | 64    | 128   | 256    |
|--------------|-------| ------| -------|
43 44 45
| OpenBLAS     | 25.22 | 25.68 | 27.12  | 
| MKLML        | 32.52 | 31.89 | 33.12  |
| MKL-DNN      | 81.69 | 82.35 | 84.08  |
T
tensor-tang 已提交
46

L
Luo Tao 已提交
47
<img src="figs/resnet-cpu-train.png" width="500">
T
tensor-tang 已提交
48

49 50
 - GoogLeNet

T
tensor-tang 已提交
51 52
| BatchSize    | 64    | 128   | 256    |
|--------------|-------| ------| -------|
T
Tao Luo 已提交
53 54 55
| OpenBLAS     | 89.52 | 96.97 | 108.25 | 
| MKLML        | 128.46| 137.89| 158.63 |
| MKL-DNN      | 250.46| 264.83| 269.50 |
T
tensor-tang 已提交
56

L
Luo Tao 已提交
57
<img src="figs/googlenet-cpu-train.png" width="500">
T
tensor-tang 已提交
58

T
tensor-tang 已提交
59 60 61 62
- Alexnet

| BatchSize    | 64     | 128    | 256    |
|--------------|--------| ------ | -------|
T
Tao Luo 已提交
63 64 65
| OpenBLAS     | 2.13   | 2.45   | 2.68   | 
| MKLML        | 66.37  | 105.60 | 144.04 |
| MKL-DNN      | 399.00 | 498.94 | 626.53 | 
T
tensor-tang 已提交
66 67 68

chart TBD

T
tensor-tang 已提交
69 70 71 72 73 74
#### Inference
Test on batch size 1, 2, 4, 8, 16 on Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz
- VGG-19

| BatchSize | 1     | 2     | 4     | 8     | 16    |
|-----------|-------|-------|-------|-------|-------|
T
Tao Luo 已提交
75 76 77
| OpenBLAS  | 1.07  | 1.08  | 1.06  | 0.88  | 0.65  |
| MKLML     | 5.58  | 9.80  | 15.15 | 21.21 | 28.67 |
| MKL-DNN   | 75.07 | 88.64 | 82.58 | 92.29 | 96.75 |
T
tensor-tang 已提交
78 79 80 81 82

- ResNet-50

| BatchSize | 1     | 2      | 4      | 8      | 16     |
|-----------|-------|--------|--------|--------|--------|
T
Tao Luo 已提交
83 84 85
| OpenBLAS  | 3.35  | 3.19   | 3.09   | 2.55   | 1.96   |
| MKLML     | 6.33  | 12.02  | 22.88  | 40.53  | 63.09  |
| MKL-DNN   | 107.83| 148.84 | 177.78 | 189.35 | 217.69 |
T
tensor-tang 已提交
86 87


T
Tao Luo 已提交
88
- GoogLeNet
T
tensor-tang 已提交
89 90 91

| BatchSize | 1      | 2      | 4      | 8      | 16     |
|-----------|--------|--------|--------|--------|--------|
T
Tao Luo 已提交
92 93 94
| OpenBLAS  | 12.04  | 11.31  | 10.00  | 9.07   | 4.34   |
| MKLML     | 22.74  | 41.56  | 81.22  | 133.47 | 210.53 |
| MKL-DNN   | 175.10 | 272.92 | 450.70 | 512.00 | 600.94 |
T
tensor-tang 已提交
95 96


97 98
### Laptop
TBD