未验证 提交 2cf4d973 编写于 作者: L Lyon 提交者: GitHub

Merge pull request #104 from Oneflow-Inc/cnn_bert_report

Cnn bert report
......@@ -3,19 +3,24 @@
This repository provides OneFlow deep learning benchmark examples for CV, CTR and NLP, and more models are on the way and will be provided here when ready.
## [Convolutional Networks](./Classification/cnns) for Computer Vision Classification
- [ResNet-50](./Classification/cnns/resnet_model.py)
- [ResNeXt-50-32*4d](./Classification/cnns/resnext_model.py)
- [VGG-16](./Classification/cnns/vgg_model.py)
- [Inception-V3](./Classification/cnns/inception_model.py)
- [AlexNet](./Classification/cnns/alexnet_model.py)
- [MobileNet-V2](./Classification/cnns/mobilenet_v2_model.py)
- [ResNet-50](./Classification/cnns)
- [ResNeXt-50-32*4d](./Classification/cnns)
- [VGG-16](./Classification/cnns)
- [Inception-V3](./Classification/cnns)
- [AlexNet](./Classification/cnns)
- [MobileNet-V2](./Classification/cnns)
## [Wide Deep Learning](./ClickThroughRate/WideDeepLearning) for Click-Through-Rate (CTR) Recommender Systems
- [OneFlow-WDL](./ClickThroughRate/WideDeepLearning)
## [BERT](./LanguageModeling/BERT) for Nature Language Process
- [BERT Pretrain for Language Modeling](./LanguageModeling/BERT/run_pretraining.py)
- [SQuAD for Question Answering](./LanguageModeling/BERT/run_squad.py)
- [CoLA and MRPC of GLUE](./LanguageModeling/BERT/run_classifier.py)
- [BERT Pretrain for Language Modeling](./LanguageModeling/BERT)
- [SQuAD for Question Answering](./LanguageModeling/BERT)
- [CoLA and MRPC of GLUE](./LanguageModeling/BERT)
## OneFlow Benchmark Test Reports
| Model | DType | XLA | Throughput | Speedup on 32 devices |
| ----- | ----- | --- | ---------- | ------- |
| [ResNet50-V1.5](./reports/resnet50_v15_fp32_report.md) | Float32 | No | 11.6k imges/sec | 30.4 |
| [BERT base Pretrain](./reports/bert_fp32_report.md) | Float32 | No | 530k tokens/sec | 28.54 |
# OneFlow BERT Pretrain Benchmark Test Report
This document reports OneFlow BERT Pretrain benchmark test results on Aug 9 2020.
## Test Environment
All tests were performed on 4 GPU Servers with 8x Tesla V100-SXM2-16GB and following is the main hardware and software configurations for each:
- Tesla V100-SXM2-16GB x 8
- InfiniBand 100 Gb/sec (4X EDR), Mellanox Technologies MT27700 Family
- 48 CPU(s), Intel(R) Xeon(R) Gold 5118 CPU @ 2.30GHz
- Memory 384G
- Ubuntu 16.04.4 LTS (GNU/Linux 4.4.0-116-generic x86_64)
- CUDA Version: 10.2, Driver Version: 440.33.01
- OneFlow: v0.1.8, fix_infer_out_logical_blob_desc@17a2bdc9b
- OneFlow-Benchmark: master@892f87e6
- `nvidia-smi topo -m`
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 mlx5_0 CPU Affinity
GPU0 X NV1 NV1 NV2 NV2 SYS SYS SYS NODE 0-11,24-35
GPU1 NV1 X NV2 NV1 SYS NV2 SYS SYS NODE 0-11,24-35
GPU2 NV1 NV2 X NV2 SYS SYS NV1 SYS PIX 0-11,24-35
GPU3 NV2 NV1 NV2 X SYS SYS SYS NV1 PIX 0-11,24-35
GPU4 NV2 SYS SYS SYS X NV1 NV1 NV2 SYS 12-23,36-47
GPU5 SYS NV2 SYS SYS NV1 X NV2 NV1 SYS 12-23,36-47
GPU6 SYS SYS NV1 SYS NV1 NV2 X NV2 SYS 12-23,36-47
GPU7 SYS SYS SYS NV1 NV2 NV1 NV2 X SYS 12-23,36-47
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
## Test Descriptions
4 groups of tests were performed with different batch size per device: 32, 64 and 96 for BERT base, 4 for BERT large.
Each group includes 6 tests with different number of devices: 1, 2, 4, 8, 16, 32.
`Throughput` of images/sec and `GPU Memory Usage` were logged and recorded.
Data type of all tests is `Float32`, XLA is not applied.
## Test Scripts
Please clone or download `BERT` folder from [OneFlow-Benchmark repository](https://github.com/Oneflow-Inc/OneFlow-Benchmark/tree/master/LanguageModeling/BERT).
We create two bash scripts alone side with `BERT` folder for this test:
1. `local_run.sh` - launch a local oneflow with specific number of nodes and gpu number per node
# local_run.sh
rm -rf ./log
mkdir ./log
python3 ./$BENCH_ROOT_DIR/run_pretraining.py \
--gpu_num_per_node=$GPU_NUM_PER_NODE \
--num_nodes=$NUM_NODES \
--node_ips='','','','' \
--learning_rate=1e-4 \
--batch_size_per_device=$BSZ_PER_DEVICE \
--iter_num=200 \
--loss_print_every_n_iter=20 \
--seq_length=128 \
--max_predictions_per_seq=20 \
--num_hidden_layers=12 \
--num_attention_heads=12 \
--max_position_embeddings=512 \
--type_vocab_size=2 \
--vocab_size=30522 \
--attention_probs_dropout_prob=0.1 \
--hidden_dropout_prob=0.1 \
--hidden_size_per_head=64 \
--data_dir=$DATA_ROOT \
--data_part_num=32 \
--log_dir=./log \
--model_save_every_n_iter=10000 \
--save_last_snapshot=False \
2. `launch_all.sh` - launch oneflow on all remote nodes with specific number of nodes and gpu number per node.
# launch_all.sh
#0 prepare the host list for training
#comment unused hosts with `#`
#or use first arg to limit the hosts number
declare -a host_list=("" "" "" "")
if [ -n "$1" ]
if [ ${host_num} -gt ${#host_list[@]} ]
echo "Working on hosts:${hosts[@]}"
#1 prepare oneflow_temp folder on each host
for host in "${hosts[@]}"
ssh $USER@$host "mkdir -p ~/oneflow_temp"
#2 copy files to each host and start work
for host in "${hosts[@]}"
echo "start training on ${host}"
ssh $USER@$host 'rm -rf ~/oneflow_temp/*'
scp -r ./$BENCH_ROOT_DIR ./$LOCAL_RUN $USER@$host:~/oneflow_temp
ssh $USER@$host "cd ~/oneflow_temp; nohup ./$LOCAL_RUN $NUM_NODES $GPU_NUM_PER_NODE 1>oneflow.log 2>&1 </dev/null &"
Note: Please to make sure all servers can login each other automaticly with ssh-key.
### Test Command Example
# test on 1 node with 4 gpus
./launch_all.sh 1 4
# test on 4 nodes with 8 gpus per node
./launch_all.sh 4 8
### Calculate `Throughput` from Test Results
`Throughput(samples/s)` information as well as `loss` can be found in `oneflow_temp` folder in the first node's home directory, there are two files:
1. `oneflow.log` - redirected stdout
2. `log/summary.csv` - same information in csv format
We use `oneflow.log` for instance, here is an example:
step: 19, total_loss: 11.078, mlm_loss: 10.407, nsp_loss: 0.671, throughput: 52.257
step: 39, total_loss: 10.884, mlm_loss: 10.190, nsp_loss: 0.694, throughput: 142.735
step: 59, total_loss: 10.592, mlm_loss: 9.915, nsp_loss: 0.677, throughput: 142.636
step: 79, total_loss: 10.335, mlm_loss: 9.659, nsp_loss: 0.676, throughput: 142.391
step: 99, total_loss: 10.157, mlm_loss: 9.479, nsp_loss: 0.678, throughput: 142.565
step: 119, total_loss: 10.046, mlm_loss: 9.361, nsp_loss: 0.686, throughput: 142.397
step: 139, total_loss: 9.915, mlm_loss: 9.237, nsp_loss: 0.678, throughput: 142.298
step: 159, total_loss: 9.851, mlm_loss: 9.168, nsp_loss: 0.683, throughput: 142.383
step: 179, total_loss: 9.784, mlm_loss: 9.104, nsp_loss: 0.680, throughput: 142.270
step: 199, total_loss: 9.640, mlm_loss: 8.960, nsp_loss: 0.680, throughput: 142.579
Normally, the first `throughput` value e.g. `52.257` is discarded because the start time of first batch is not correct. we average the other `throughput` as the throughput of this test.
## BERT base Pretrain Test Results
All test logs can be found [here](https://oneflow-public.oss-cn-beijing.aliyuncs.com/OF_benchmark_logs/oneflow_bert_benchmark_logs.tgz)
### Group: batch size per device = 32
BERT Base Pretrain, batch size per device=32, dtype=float32, without XLA
| node num | gpu num/node | gpu num | bsz/gpu | GPU Memory Usage | Throughput | Speedup |
| -------- | -------- | -------- | -------- | -------- | -------- | -------- |
| 1 | 1 | 1 | 32 | 6207 | 140.034 | 1 |
| 1 | 2 | 2 | 32 | 7081 | 254.304 | 1.82 |
| 1 | 4 | 4 | 32 | 7255 | 506.989 | 3.62 |
| 1 | 8 | 8 | 32 | 7323 | 1010.446 | 7.22 |
| 2 | 8 | 16 | 32 | 7145 | 1571.088 | 11.22 |
| 4 | 8 | 32 | 32 | 7185 | 3136.797 | 22.40 |
### Group: batch size per device = 64
BERT Base Pretrain, batch size per device=64, dtype=float32, without XLA
| node num | gpu num/node | gpu num | bsz/gpu | GPU Memory Usage | Throughput | Speedup |
| -------- | -------- | -------- | -------- | -------- | -------- | -------- |
| 1 | 1 | 1 | 64 | 9989 | 145.148 | 1 |
| 1 | 2 | 2 | 64 | 10947 | 277.880 | 1.91 |
| 1 | 4 | 4 | 64 | 10955 | 552.843 | 3.81 |
| 1 | 8 | 8 | 64 | 11029 | 1103.102 | 7.60 |
| 2 | 8 | 16 | 64 | 10957 | 2023.743 | 13.94 |
| 4 | 8 | 32 | 64 | 10981 | 3947.739 | 27.20 |
### Group: batch size per device = 96
BERT Base Pretrain, batch size per device=96, dtype=float32, without XLA
| node num | gpu num/node | gpu num | bsz/gpu | GPU Memory Usage | Throughput | Speedup |
| -------- | -------- | -------- | -------- | -------- | -------- | -------- |
| 1 | 1 | 1 | 96 | 13771 | 145.095 | 1 |
| 1 | 2 | 2 | 96 | 14757 | 282.984 | 1.95 |
| 1 | 4 | 4 | 96 | 14851 | 559.011 | 3.85 |
| 1 | 8 | 8 | 96 | 14815 | 1121.632 | 7.73 |
| 2 | 8 | 16 | 96 | 14815 | 2132.490 | 14.70 |
| 4 | 8 | 32 | 96 | 14687 | 4140.439 | 28.54 |
## BERT Large Pretrain Test Results
BERT large was tested on the same situtation. Some arguments in `local_run.sh` need to be modified to meet to BERT large pretrain configuration.
# local_run.sh for bert large
rm -rf ./log
mkdir ./log
python3 ./$BENCH_ROOT_DIR/run_pretraining.py \
--gpu_num_per_node=$GPU_NUM_PER_NODE \
--num_nodes=$NUM_NODES \
--node_ips='','','','' \
--learning_rate=1e-4 \
--batch_size_per_device=$BSZ_PER_DEVICE \
--iter_num=200 \
--loss_print_every_n_iter=20 \
--seq_length=512 \
--max_predictions_per_seq=80 \
--num_hidden_layers=24 \
--num_attention_heads=16 \
--max_position_embeddings=512 \
--type_vocab_size=2 \
--vocab_size=30522 \
--attention_probs_dropout_prob=0.1 \
--hidden_dropout_prob=0.1 \
--hidden_size_per_head=64 \
--data_dir=$DATA_ROOT \
--data_part_num=32 \
--log_dir=./log \
--model_save_every_n_iter=10000 \
--save_last_snapshot=False \
Here is the result:
BERT Large Pretrain, batch size per device=4, dtype=float32, without XLA
| node num | gpu num/node | gpu num | bsz/gpu | GPU Memory Usage | Throughput | Speedup |
| -------- | -------- | -------- | -------- | -------- | -------- | -------- |
| 1 | 1 | 1 | 4 | 12087 | 8.839 | 1 |
| 1 | 2 | 2 | 4 | 14593 | 16.405 | 1.86 |
| 1 | 4 | 4 | 4 | 14713 | 33.158 | 3.75 |
| 1 | 8 | 8 | 4 | 14765 | 64.519 | 7.30 |
| 2 | 8 | 16 | 4 | 14661 | 74.224 | 8.40 |
| 4 | 8 | 32 | 4 | 14673 | 143.232 | 16.21 |
| 1 | 1 | 1 | 6 | 15779 | 9.180 | 1.04 |
# OneFlow ResNet50-V1.5 Benchmark Test Report
This document reports OneFlow ResNet50-V1.5 benchmark test results on Aug 8 2020.
## Test Environment
All tests were performed on 4 GPU Servers with 8x Tesla V100-SXM2-16GB and following is the main hardware and software configurations for each:
- Tesla V100-SXM2-16GB x 8
- InfiniBand 100 Gb/sec (4X EDR), Mellanox Technologies MT27700 Family
- 48 CPU(s), Intel(R) Xeon(R) Gold 5118 CPU @ 2.30GHz
- Memory 384G
- Ubuntu 16.04.4 LTS (GNU/Linux 4.4.0-116-generic x86_64)
- CUDA Version: 10.2, Driver Version: 440.33.01
- OneFlow: v0.1.8, fix_infer_out_logical_blob_desc@17a2bdc9b
- OneFlow-Benchmark: master@892f87e6
- `nvidia-smi topo -m`
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 mlx5_0 CPU Affinity
GPU0 X NV1 NV1 NV2 NV2 SYS SYS SYS NODE 0-11,24-35
GPU1 NV1 X NV2 NV1 SYS NV2 SYS SYS NODE 0-11,24-35
GPU2 NV1 NV2 X NV2 SYS SYS NV1 SYS PIX 0-11,24-35
GPU3 NV2 NV1 NV2 X SYS SYS SYS NV1 PIX 0-11,24-35
GPU4 NV2 SYS SYS SYS X NV1 NV1 NV2 SYS 12-23,36-47
GPU5 SYS NV2 SYS SYS NV1 X NV2 NV1 SYS 12-23,36-47
GPU6 SYS SYS NV1 SYS NV1 NV2 X NV2 SYS 12-23,36-47
GPU7 SYS SYS SYS NV1 NV2 NV1 NV2 X SYS 12-23,36-47
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
## Test Descriptions
Two groups of tests were performed with different batch size per device: 128 and 160.
Each group includes 6 tests with different number of devices: 1, 2, 4, 8, 16, 32.
`Throughput` of images/sec and `GPU Memory Usage` were logged and recorded.
Data type of all tests is `Float32`, XLA is not applied.
## Test Scripts
Please clone or download `cnns` folder from [OneFlow-Benchmark repository](https://github.com/Oneflow-Inc/OneFlow-Benchmark/tree/master/Classification/cnns).
We create two bash scripts alone side with `cnns` folder for this test:
1. `local_run.sh` - launch a local oneflow with specific number of nodes and gpu number per node
# local_run.sh
rm -rf ./log
mkdir ./log
python3 ./$BENCH_ROOT_DIR/of_cnn_train_val.py \
--num_examples=$NUM_EXAMPLES \
--train_data_dir=$DATA_ROOT/train \
--train_data_part_num=44 \
--num_nodes=$NUM_NODES \
--gpu_num_per_node=$GPU_NUM_PER_NODE \
--model_update="momentum" \
--learning_rate=0.001 \
--loss_print_every_n_iter=20 \
--batch_size_per_device=$BSZ_PER_DEVICE \
--val_batch_size_per_device=125 \
--num_epoch=1 \
--log_dir=./log \
--node_ips='','','','' \
2. `launch_all.sh` - launch oneflow on all remote nodes with specific number of nodes and gpu number per node.
# launch_all.sh
#0 prepare the host list for training
#comment unused hosts with `#`
#or use first arg to limit the hosts number
declare -a host_list=("" "" "" "")
if [ -n "$1" ]
if [ ${host_num} -gt ${#host_list[@]} ]
echo "Working on hosts:${hosts[@]}"
#1 prepare oneflow_temp folder on each host
for host in "${hosts[@]}"
ssh $USER@$host "mkdir -p ~/oneflow_temp"
#2 copy files to each host and start work
for host in "${hosts[@]}"
echo "start training on ${host}"
ssh $USER@$host 'rm -rf ~/oneflow_temp/*'
scp -r ./$BENCH_ROOT_DIR ./$LOCAL_RUN $USER@$host:~/oneflow_temp
ssh $USER@$host "cd ~/oneflow_temp; nohup ./$LOCAL_RUN $NUM_NODES $GPU_NUM_PER_NODE 1>oneflow.log 2>&1 </dev/null &"
Note: Please to make sure all servers can login each other automaticly with ssh-key.
### Test Command Example
# test on 1 node with 4 gpus
./launch_all.sh 1 4
# test on 4 nodes with 8 gpus per node
./launch_all.sh 4 8
### Calculate `Throughput` from Test Results
`Throughput(samples/s)` information as well as `loss` and `top-k` can be found in `oneflow_temp` folder in the first node's home directory, there are two files:
1. `oneflow.log` - redirected stdout
2. `log/summary.csv` - same information in csv format
We use `oneflow.log` for instance, here is an example:
train: epoch 0, iter 20, loss: 6.505637, top_1: 0.000000, top_k: 0.000000, samples/s: 288.088
train: epoch 0, iter 40, loss: 5.736447, top_1: 0.020313, top_k: 0.117578, samples/s: 385.628
train: epoch 0, iter 60, loss: 4.274485, top_1: 0.817969, top_k: 0.991797, samples/s: 386.264
train: epoch 0, iter 80, loss: 2.331075, top_1: 1.000000, top_k: 1.000000, samples/s: 385.723
train: epoch 0, iter 100, loss: 1.236110, top_1: 1.000000, top_k: 1.000000, samples/s: 384.622
train: epoch 0, iter 120, loss: 1.078446, top_1: 1.000000, top_k: 1.000000, samples/s: 385.367
train: epoch 0, iter 140, loss: 1.054016, top_1: 1.000000, top_k: 1.000000, samples/s: 384.704
train: epoch 0, iter 160, loss: 1.048110, top_1: 1.000000, top_k: 1.000000, samples/s: 384.927
train: epoch 0, iter 180, loss: 1.050786, top_1: 1.000000, top_k: 1.000000, samples/s: 384.109
train: epoch 0, iter 200, loss: 1.047857, top_1: 1.000000, top_k: 1.000000, samples/s: 384.517
Normally, the first `samples/s` value e.g. `288.088` is discarded because the start time of first batch is not correct. we average the other `samples/s` as the throughput of this test.
## Test Results
All test logs can be found [here](https://oneflow-public.oss-cn-beijing.aliyuncs.com/OF_benchmark_logs/oneflow_resnet50_logs.tgz)
### Group: batch size per device = 128
ResNet50 V1.5, batch size per device=128, dtype=float32, without XLA
| node num | gpus/nodes | gpu num | bsz/gpu | GPU Memory Usage | Throughput | Speedup |
| -------- | -------- | -------- | -------- | -------- | -------- | -------- |
| 1 | 1 | 1 | 128 | 12565 | 383.760 | 1 |
| 1 | 2 | 2 | 128 | 12839 | 747.295 | 1.95 |
| 1 | 4 | 4 | 128 | 12987 | 1497.618 | 3.90 |
| 1 | 8 | 8 | 128 | 13051 | 2942.321 | 7.67 |
| 2 | 8 | 16 | 128 | 12871 | 5839.054 | 15.22 |
| 4 | 8 | 32 | 128 | 12871 | 11548.451 | 30.09 |
### Group: batch size per device = 160
ResNet50 V1.5, batch size per device=160, dtype=float32, without XLA
| node num | gpus/nodes | gpu num | bsz/gpu | GPU Memory Usage | Throughput | Speedup |
| -------- | -------- | -------- | -------- | -------- | -------- | -------- |
| 1 | 1 | 1 | 160 | 15509 | 382.324 | 1 |
| 1 | 2 | 2 | 160 | 15785 | 755.956 | 1.98 |
| 1 | 4 | 4 | 160 | 15881 | 1494.733 | 3.91 |
| 1 | 8 | 8 | 160 | 15701 | 3016.431 | 7.89 |
| 2 | 8 | 16 | 160 | 15817 | 5877.289 | 15.37 |
| 4 | 8 | 32 | 160 | 15879 | 11623.889 | 30.40 |
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
想要评论请 注册