This repository provides OneFlow deep learning benchmark examples for CV, CTR and NLP, and more models are on the way and will be provided here when ready.
## [Convolutional Networks](./Classification/cnns) for Computer Vision Classification
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
```
## Test Descriptions
4 groups of tests were performed with different batch size per device: 32, 64 and 96 for BERT base, 4 for BERT large.
Each group includes 6 tests with different number of devices: 1, 2, 4, 8, 16, 32.
`Throughput` of images/sec and `GPU Memory Usage` were logged and recorded.
Data type of all tests is `Float32`, XLA is not applied.
## Test Scripts
Please clone or download `BERT` folder from [OneFlow-Benchmark repository](https://github.com/Oneflow-Inc/OneFlow-Benchmark/tree/master/LanguageModeling/BERT).
We create two bash scripts alone side with `BERT` folder for this test:
1.`local_run.sh` - launch a local oneflow with specific number of nodes and gpu number per node
Normally, the first `throughput` value e.g. `52.257` is discarded because the start time of first batch is not correct. we average the other `throughput` as the throughput of this test.
## BERT base Pretrain Test Results
All test logs can be found [here](https://oneflow-public.oss-cn-beijing.aliyuncs.com/OF_benchmark_logs/oneflow_bert_benchmark_logs.tgz)
### Group: batch size per device = 32
BERT Base Pretrain, batch size per device=32, dtype=float32, without XLA
| node num | gpu num/node | gpu num | bsz/gpu | GPU Memory Usage | Throughput | Speedup |
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
```
## Test Descriptions
Two groups of tests were performed with different batch size per device: 128 and 160.
Each group includes 6 tests with different number of devices: 1, 2, 4, 8, 16, 32.
`Throughput` of images/sec and `GPU Memory Usage` were logged and recorded.
Data type of all tests is `Float32`, XLA is not applied.
## Test Scripts
Please clone or download `cnns` folder from [OneFlow-Benchmark repository](https://github.com/Oneflow-Inc/OneFlow-Benchmark/tree/master/Classification/cnns).
We create two bash scripts alone side with `cnns` folder for this test:
1.`local_run.sh` - launch a local oneflow with specific number of nodes and gpu number per node
Note: Please to make sure all servers can login each other automaticly with ssh-key.
### Test Command Example
```
# test on 1 node with 4 gpus
./launch_all.sh 1 4
# test on 4 nodes with 8 gpus per node
./launch_all.sh 4 8
```
### Calculate `Throughput` from Test Results
`Throughput(samples/s)` information as well as `loss` and `top-k` can be found in `oneflow_temp` folder in the first node's home directory, there are two files:
1.`oneflow.log` - redirected stdout
2.`log/summary.csv` - same information in csv format
We use `oneflow.log` for instance, here is an example:
Normally, the first `samples/s` value e.g. `288.088` is discarded because the start time of first batch is not correct. we average the other `samples/s` as the throughput of this test.
## Test Results
All test logs can be found [here](https://oneflow-public.oss-cn-beijing.aliyuncs.com/OF_benchmark_logs/oneflow_resnet50_logs.tgz)
### Group: batch size per device = 128
ResNet50 V1.5, batch size per device=128, dtype=float32, without XLA
| node num | gpus/nodes | gpu num | bsz/gpu | GPU Memory Usage | Throughput | Speedup |