未验证 提交 bfdbae75 编写于 作者: L Lyon 提交者: GitHub

Merge pull request #108 from Oneflow-Inc/dev_modify_cnn_training_param

Dev modify cnn training param
......@@ -30,6 +30,7 @@ def _default_config(args):
def get_train_config(args):
train_config = _default_config(args)
train_config.train.primary_lr(args.learning_rate)
train_config.cudnn_conv_heuristic_search_algo(False)
train_config.prune_parallel_cast_ops(True)
......
......@@ -80,6 +80,7 @@ def TrainNet():
else:
loss = flow.nn.sparse_softmax_cross_entropy_with_logits(labels, logits, name="softmax_loss")
loss = flow.math.reduce_mean(loss)
flow.losses.add_loss(loss)
predictions = flow.nn.softmax(logits)
outputs = {"loss": loss, "predictions": predictions, "labels": labels}
......@@ -124,7 +125,7 @@ def main():
save_summary_steps=num_val_steps, batch_size=val_batch_size)
for i in range(num_val_steps):
InferenceNet().async_get(metric.metric_cb(epoch, i))
snapshot.save('epoch_{}'.format(epoch))
# snapshot.save('epoch_{}'.format(epoch))
if __name__ == "__main__":
......
#!/bin/sh
rm -rf core.*
rm -rf ./output/snapshots/*
# training with synthetic data
python3 of_cnn_train_val.py \
--num_examples=50 \
--num_val_examples=50 \
......@@ -13,4 +14,29 @@ python3 of_cnn_train_val.py \
--batch_size_per_device=16 \
--val_batch_size_per_device=16 \
--num_epoch=10 \
--model="resnet50"
\ No newline at end of file
--model="resnet50"
# # training with imagenet
# DATA_ROOT=/datasets/ImageNet/ofrecord
# LOG_FOLDER=../logs
# mkdir -p $LOG_FOLDER
# LOGFILE=$LOG_FOLDER/resnet_training.log
# python3 of_cnn_train_val.py \
# --train_data_dir=$DATA_ROOT/train \
# --train_data_part_num=256 \
# --val_data_dir=$DATA_ROOT/validation \
# --val_data_part_num=256 \
# --num_nodes=1 \
# --gpu_num_per_node=4 \
# --model_update="momentum" \
# --learning_rate=0.256 \
# --loss_print_every_n_iter=100 \
# --batch_size_per_device=64 \
# --val_batch_size_per_device=50 \
# --num_epoch=90 \
# --model="resnet50" 2>&1 | tee ${LOGFILE}
# echo "Writting log to ${LOGFILE}"
# OneFlow BERT Pretrain Benchmark Test Report
This document reports OneFlow BERT Pretrain benchmark test results on Aug 13 2020.
This document reports OneFlow BERT Pretrain benchmark test results on Aug 9 2020.
## Test Environment
All tests were performed on 4 GPU Servers with 8x Tesla V100-SXM2-16GB and following is the main hardware and software configurations for each:
......@@ -9,7 +9,7 @@ All tests were performed on 4 GPU Servers with 8x Tesla V100-SXM2-16GB and follo
- Memory 384G
- Ubuntu 16.04.4 LTS (GNU/Linux 4.4.0-116-generic x86_64)
- CUDA Version: 10.2, Driver Version: 440.33.01
- OneFlow: v0.1.8, master@4d44113e2 with NCCL 2.4.8
- OneFlow: v0.1.8, fix_infer_out_logical_blob_desc@17a2bdc9b
- OneFlow-Benchmark: master@892f87e6
- `nvidia-smi topo -m`
```
......@@ -171,40 +171,39 @@ step: 199, total_loss: 9.640, mlm_loss: 8.960, nsp_loss: 0.680, throughput: 142.
```
Normally, the first `throughput` value e.g. `52.257` is discarded because the start time of first batch is not correct. we average the other `throughput` as the throughput of this test.
## BERT base Pretrain Test Results
All test logs can be found [here](https://oneflow-public.oss-cn-beijing.aliyuncs.com/OF_benchmark_logs/of_leinao_benchmark_log_0813.tar.gz)
All test logs can be found [here](https://oneflow-public.oss-cn-beijing.aliyuncs.com/OF_benchmark_logs/oneflow_bert_benchmark_logs.tgz)
### Group: batch size per device = 32
BERT Base Pretrain, batch size per device=32, dtype=float32, without XLA
| node num | device num | bsz per device | throughput | speedup | memory(MiB) |
| -------- | -------- | -------- | -------- | -------- | -------- |
| 1 | 1 | 32 | 137.17 | 1.00 | 6205 |
| 1 | 2 | 32 | 250.41 | 1.83 | 7071 |
| 1 | 4 | 32 | 502.70 | 3.66 | 7139 |
| 1 | 8 | 32 | 990.87 | 7.22 | 7215 |
| 2 | 16 | 32 | 1573.31 | 11.47 | 7135 |
| 4 | 32 | 32 | 3081.96 | 22.47 | 7149 |
| node num | gpu num/node | gpu num | bsz/gpu | GPU Memory Usage | Throughput | Speedup |
| -------- | -------- | -------- | -------- | -------- | -------- | -------- |
| 1 | 1 | 1 | 32 | 6207 | 140.034 | 1 |
| 1 | 2 | 2 | 32 | 7081 | 254.304 | 1.82 |
| 1 | 4 | 4 | 32 | 7255 | 506.989 | 3.62 |
| 1 | 8 | 8 | 32 | 7323 | 1010.446 | 7.22 |
| 2 | 8 | 16 | 32 | 7145 | 1571.088 | 11.22 |
| 4 | 8 | 32 | 32 | 7185 | 3136.797 | 22.40 |
### Group: batch size per device = 64
BERT Base Pretrain, batch size per device=64, dtype=float32, without XLA
| node num | device num | bsz per device | throughput | speedup | memory(MiB) |
| -------- | -------- | -------- | -------- | -------- | -------- |
| 1 | 1 | 64 | 145.55 | 1.00 | 9987 |
| 1 | 2 | 64 | 277.03 | 1.90 | 10847 |
| 1 | 4 | 64 | 551.78 | 3.79 | 10923 |
| 1 | 8 | 64 | 1105.13 | 7.59 | 11057 |
| 2 | 16 | 64 | 2016.09 | 13.85 | 10937 |
| 4 | 32 | 64 | 3911.90 | 26.88 | 10963 |
| node num | gpu num/node | gpu num | bsz/gpu | GPU Memory Usage | Throughput | Speedup |
| -------- | -------- | -------- | -------- | -------- | -------- | -------- |
| 1 | 1 | 1 | 64 | 9989 | 145.148 | 1 |
| 1 | 2 | 2 | 64 | 10947 | 277.880 | 1.91 |
| 1 | 4 | 4 | 64 | 10955 | 552.843 | 3.81 |
| 1 | 8 | 8 | 64 | 11029 | 1103.102 | 7.60 |
| 2 | 8 | 16 | 64 | 10957 | 2023.743 | 13.94 |
| 4 | 8 | 32 | 64 | 10981 | 3947.739 | 27.20 |
### Group: batch size per device = 96
BERT Base Pretrain, batch size per device=96, dtype=float32, without XLA
| node num | device num | bsz per device | throughput | speedup | memory(MiB) |
| -------- | -------- | -------- | -------- | -------- | -------- |
| 1 | 1 | 96 | 148.34 | 1.00 | 13769 |
| 1 | 2 | 96 | 286.24 | 1.93 | 14735 |
| 1 | 4 | 96 | 573.85 | 3.87 | 14809 |
| 1 | 8 | 96 | 1147.47 | 7.74 | 14893 |
| 2 | 16 | 96 | 2169.65 | 14.63 | 14763 |
| 4 | 32 | 96 | 4238.85 | 28.58 | 14795 |
| node num | gpu num/node | gpu num | bsz/gpu | GPU Memory Usage | Throughput | Speedup |
| -------- | -------- | -------- | -------- | -------- | -------- | -------- |
| 1 | 1 | 1 | 96 | 13771 | 145.095 | 1 |
| 1 | 2 | 2 | 96 | 14757 | 282.984 | 1.95 |
| 1 | 4 | 4 | 96 | 14851 | 559.011 | 3.85 |
| 1 | 8 | 8 | 96 | 14815 | 1121.632 | 7.73 |
| 2 | 8 | 16 | 96 | 14815 | 2132.490 | 14.70 |
| 4 | 8 | 32 | 96 | 14687 | 4140.439 | 28.54 |
## BERT Large Pretrain Test Results
BERT large was tested on the same situtation. Some arguments in `local_run.sh` need to be modified to meet to BERT large pretrain configuration.
......
# OneFlow ResNet50-V1.5 Benchmark Test Report
This document reports OneFlow ResNet50-V1.5 benchmark test results on Aug 13 2020.
This document reports OneFlow ResNet50-V1.5 benchmark test results on Aug 8 2020.
## Test Environment
All tests were performed on 4 GPU Servers with 8x Tesla V100-SXM2-16GB and following is the main hardware and software configurations for each:
......@@ -9,7 +9,7 @@ All tests were performed on 4 GPU Servers with 8x Tesla V100-SXM2-16GB and follo
- Memory 384G
- Ubuntu 16.04.4 LTS (GNU/Linux 4.4.0-116-generic x86_64)
- CUDA Version: 10.2, Driver Version: 440.33.01
- OneFlow: v0.1.8, master@4d44113e2 with NCCL 2.4.8
- OneFlow: v0.1.8, fix_infer_out_logical_blob_desc@17a2bdc9b
- OneFlow-Benchmark: master@892f87e6
- `nvidia-smi topo -m`
```
......@@ -163,26 +163,26 @@ train: epoch 0, iter 200, loss: 1.047857, top_1: 1.000000, top_k: 1.000000, samp
```
Normally, the first `samples/s` value e.g. `288.088` is discarded because the start time of first batch is not correct. we average the other `samples/s` as the throughput of this test.
## Test Results
All test logs can be found [here](https://oneflow-public.oss-cn-beijing.aliyuncs.com/OF_benchmark_logs/of_leinao_benchmark_log_0813.tar.gz)
All test logs can be found [here](https://oneflow-public.oss-cn-beijing.aliyuncs.com/OF_benchmark_logs/oneflow_resnet50_logs.tgz)
### Group: batch size per device = 128
ResNet50 V1.5, batch size per device=128, dtype=float32, without XLA
| node num | device num | bsz per device | throughput | speedup | memory(MiB) |
| -------- | -------- | -------- | -------- | -------- | -------- |
| 1 | 1 | 128 | 405.53 | 1.00 | 12553 |
| 1 | 2 | 128 | 795.19 | 1.96 | 12993 |
| 1 | 4 | 128 | 1589.44 | 3.92 | 12941 |
| 1 | 8 | 128 | 3160.44 | 7.79 | 12943 |
| 2 | 16 | 128 | 6273.50 | 15.47 | 13617 |
| 4 | 32 | 128 | 12230.47 | 30.16 | 13643 |
| node num | gpus/nodes | gpu num | bsz/gpu | GPU Memory Usage | Throughput | Speedup |
| -------- | -------- | -------- | -------- | -------- | -------- | -------- |
| 1 | 1 | 1 | 128 | 12565 | 383.760 | 1 |
| 1 | 2 | 2 | 128 | 12839 | 747.295 | 1.95 |
| 1 | 4 | 4 | 128 | 12987 | 1497.618 | 3.90 |
| 1 | 8 | 8 | 128 | 13051 | 2942.321 | 7.67 |
| 2 | 8 | 16 | 128 | 12871 | 5839.054 | 15.22 |
| 4 | 8 | 32 | 128 | 12871 | 11548.451 | 30.09 |
### Group: batch size per device = 160
ResNet50 V1.5, batch size per device=160, dtype=float32, without XLA
| node num | device num | bsz per device | throughput | speedup | memory(MiB) |
| -------- | -------- | -------- | -------- | -------- | -------- |
| 1 | 1 | 160 | 405.46 | 1.00 | 15583 |
| 1 | 2 | 160 | 797.04 | 1.97 | 15765 |
| 1 | 4 | 160 | 1593.81 | 3.93 | 15837 |
| 1 | 8 | 160 | 3177.84 | 7.84 | 15889 |
| 2 | 16 | 160 | 6282.58 | 15.49 | 15963 |
| 4 | 32 | 160 | 12370.62 | 30.51 | 15965 |
| node num | gpus/nodes | gpu num | bsz/gpu | GPU Memory Usage | Throughput | Speedup |
| -------- | -------- | -------- | -------- | -------- | -------- | -------- |
| 1 | 1 | 1 | 160 | 15509 | 382.324 | 1 |
| 1 | 2 | 2 | 160 | 15785 | 755.956 | 1.98 |
| 1 | 4 | 4 | 160 | 15881 | 1494.733 | 3.91 |
| 1 | 8 | 8 | 160 | 15701 | 3016.431 | 7.89 |
| 2 | 8 | 16 | 160 | 15817 | 5877.289 | 15.37 |
| 4 | 8 | 32 | 160 | 15879 | 11623.889 | 30.40 |
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册