README.md 5.1 KB
Newer Older
D
dangqingqing 已提交
1 2 3 4 5 6 7
# Benchmark

Machine: 

- CPU: 12-core Intel(R) Xeon(R) CPU E5-2620 v2 @2.10GHz
- GPU: Tesla K40m
- cuDNN: v5.1
8
- system: Docker 1.12.1, all platforms are tested in docker environment.
D
dangqingqing 已提交
9

10
Platforms: 
D
dangqingqing 已提交
11

12
- PaddlePaddle: paddledev/paddle:gpu-devel-v0.9.0a0 
D
dangqingqing 已提交
13
- Tensorflow: gcr.io/tensorflow/tensorflow:0.11.0rc0-gpu 
14
- Caffe: kaixhin/cuda-caffe
D
dangqingqing 已提交
15

16
Several convolutional neural networks and recurrent neural networks are used to test.
D
dangqingqing 已提交
17 18 19 20 21

## Image

### Benchmark Model

22
AlexNet, GoogleNet and a small network used in Caffe.
D
dangqingqing 已提交
23 24 25 26 27 28 29 30

- [AlexNet](https://github.com/BVLC/caffe/tree/master/models/bvlc_alexnet): but the group size is one.

- [GoogleNet](https://github.com/BVLC/caffe/tree/master/models/bvlc_googlenet): but remove loss1 and loss2 when testing benchmark.

- [SmallNet](https://github.com/BVLC/caffe/blob/master/examples/cifar10/cifar10\_quick\_train\_test.prototxt)


31
### Single-GPU
D
dangqingqing 已提交
32 33 34 35 36 37 38 39 40

- AlexNet:  input - 3 * 227 * 227,  Time: ms/batch

| BatchSize    | 64  | 128  | 256   | 512  |
|--------------|-----| -----| ------| -----|
| PaddlePaddle | 195 | 334  | 602   | 1629 |
| TensorFlow   | 223 | 364  | 645   | 1235 |
| Caffe        | 324 | 627  | 1232  | 2513 |
 
41
**Notation**
D
dangqingqing 已提交
42

43
All platforms use cuDNN-v5.1. We see that caffe is slower in this experiment, because its workspace limit size of cuDNN-conv interface is 8 * 1024 * 1024, which is smaller in PaddlePaddle and TensorFlow. Note that Caffe will be faster if increasing the workspace limit size.
D
dangqingqing 已提交
44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61
 
- GoogletNet:  input - 3 * 224 * 224, Time: ms/batch


| BatchSize    | 64    |   128  | 256     |
|--------------|-------| -------| --------|
| PaddlePaddle | 613   | 1149   | 2348    |
| TensorFlow   | 644   | 1176   | 2219    |
| Caffe        | 694   | 1364   | out of memory   |

- SmallNet: input - 3 * 32 * 32, Time ms/batch

| BatchSize    | 64     |   128    | 256     | 512     |
|--------------|--------| -------- | --------|---------|
| PaddlePaddle | 10.463 | 18.184   | 33.113  |  63.039 |
| TensorFlow   | 9     | 15       | 28      | 59       |
| Caffe        | 9.373  | 16.6606  | 31.4797 | 59.719  |

62
**Notation**
D
dangqingqing 已提交
63

64
All the experiments in caffe use `caffe time` to execute, which does not include the time of parameter updating. While PaddlePaddle and TensorFlow contains this time. But, compared with the total time, the time of parameter updating is relatively little on single machine.
D
dangqingqing 已提交
65 66 67 68 69 70 71

In Tensorflow, they implement algorithm searching method instead of using the algorithm searching interface in cuDNN.

### Multi-GPU: 4 GPUs

- AlexNet,  ms / batch

72
| total-BatchSize | 128 * 4  | 256 * 4    |
D
dangqingqing 已提交
73 74 75 76 77
|------------------|----------| -----------|
| PaddlePaddle     | 347      | 622        |
| TensorFlow       | 377      | 675        |
| Caffe            | 1229     | 2435       |

78
For example, if `total-BatchSize = 128 * 4`, the speedup ratio is calculated by 
D
dangqingqing 已提交
79 80 81 82 83 84 85 86 87 88

```
  time_at_1gpu_batch_128 * 4 / time_at_4gpu_total_batch_512 
= (334 * 4)/347 
= 3.85
``` 

<img src="figs/alexnet-4gpu.png" width="420">


89
- GoogleNet, ms / batch
D
dangqingqing 已提交
90

91
| total-BatchSize  | 128 * 4      |  256 * 4    |
D
dangqingqing 已提交
92 93 94 95 96 97 98 99 100 101 102 103 104
|-------------------|--------------| ----------- |
| PaddlePaddle      | 1178         | 2367        |
| TensorFlow        | 1210         | 2292        |
| Caffe             | 2007         | out of memory  |

<img src="figs/googlenet-4gpu.png" width="420">


## RNN
We use lstm network for text classfication to test benchmark.

### Dataset
-  [IMDB](http://www.iro.umontreal.ca/~lisa/deep/data/imdb.pkl)
105
- Sequence legth is 100. In fact, PaddlePaddle supports training with variable-length sequence, but TensorFlow needs to pad, we also pad sequence length to 100 in PaddlePaddle in order to compare.
D
dangqingqing 已提交
106 107 108
- Dictionary size=30000 
- Peephole connection is used in `lstmemory` by default in PaddlePaddle. It is also configured in TensorFlow.

109
### Single-GPU
D
dangqingqing 已提交
110 111 112

#### LSTM in Text Classification

113
Testing `2 lstm layer + fc` network with different hidden size and batch size.
D
dangqingqing 已提交
114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140
  
- Batch size = 64, ms / batch
 
| hidden_size  | 256   | 512    |  1280   |
|--------------|-------| -------| --------|
| PaddlePaddle | 83    | 184    | 641     |
| TensorFlow   | 175   | 280    | 818     |

- Batch size = 128, ms / batch
 
| hidden_size  | 256    | 512    |  1280   |
|--------------|------- | -------| --------|
| PaddlePaddle | 110    | 261    | 1007    |
| TensorFlow   | 181    | 361    | 1237    |


- Batch size = 256, ms / batch
 
| hidden_size  | 256   | 512    |  1280   |
|--------------|-------| -------| --------|
| PaddlePaddle | 170   | 414    | 1655    |
| TensorFlow   | 238   | 536    | 1905    |

<img src="figs/rnn_lstm_cls.png" width="600">

#### Seq2Seq

141
The benchmark of sequence-to-sequence network will be added later.
D
dangqingqing 已提交
142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167
 

### Multi GPU: 4 GPUs

#### LSTM in Text Classification

- hidden_size = 256, ms / batch
 
| batch_size   | 256    |  512    |
|--------------| -------| --------|
| PaddlePaddle | 90     | 118     |
| TensorFlow   | 226    | 118     |


- hidden_size = 512, ms / batch
 
| batch_size   | 256    |  512    |
|--------------| -------| --------|
| PaddlePaddle | 189    | 268     |
| TensorFlow   | 297    | 383     |


<img src="figs/rnn_lstm_4gpus.png" width="420">

#### Seq2Seq

168
The benchmark of sequence-to-sequence network will be added later.