Merge pull request #108 from Oneflow-Inc/dev_modify_cnn_training_param

Dev modify cnn training param

Merge pull request #108 from Oneflow-Inc/dev_modify_cnn_training_param
Dev modify cnn training param
bfdbae75 · Lyon · GitHub · 1227b23f · 47e9a849 · bfdbae75
5 changed file
--- a/Classification/cnns/job_function_util.py
+++ b/Classification/cnns/job_function_util.py
@@ -30,6 +30,7 @@ def _default_config(args):
 def get_train_config(args):
    train_config = _default_config(args)
    train_config.train.primary_lr(args.learning_rate)
+    train_config.cudnn_conv_heuristic_search_algo(False)


    train_config.prune_parallel_cast_ops(True)

--- a/Classification/cnns/of_cnn_train_val.py
+++ b/Classification/cnns/of_cnn_train_val.py
@@ -80,6 +80,7 @@ def TrainNet():
    else:
        loss = flow.nn.sparse_softmax_cross_entropy_with_logits(labels, logits, name="softmax_loss")

+    loss = flow.math.reduce_mean(loss)
    flow.losses.add_loss(loss)
    predictions = flow.nn.softmax(logits)
    outputs = {"loss": loss, "predictions": predictions, "labels": labels}
@@ -124,7 +125,7 @@ def main():
                            save_summary_steps=num_val_steps, batch_size=val_batch_size)
            for i in range(num_val_steps):
                InferenceNet().async_get(metric.metric_cb(epoch, i))
-        snapshot.save('epoch_{}'.format(epoch))
+        # snapshot.save('epoch_{}'.format(epoch))


 if __name__ == "__main__":

--- a/Classification/cnns/train.sh
+++ b/Classification/cnns/train.sh
+#!/bin/sh
 rm -rf core.* 
 rm -rf ./output/snapshots/*

-
+# training with synthetic data
 python3 of_cnn_train_val.py \
    --num_examples=50 \
    --num_val_examples=50 \
@@ -13,4 +14,29 @@ python3 of_cnn_train_val.py \
    --batch_size_per_device=16 \
    --val_batch_size_per_device=16 \
    --num_epoch=10 \
-    --model="resnet50"
\ No newline at end of file
+    --model="resnet50"
+
+
+# # training with imagenet
+# DATA_ROOT=/datasets/ImageNet/ofrecord
+# LOG_FOLDER=../logs
+# mkdir -p $LOG_FOLDER
+# LOGFILE=$LOG_FOLDER/resnet_training.log
+
+# python3 of_cnn_train_val.py \
+#     --train_data_dir=$DATA_ROOT/train \
+#     --train_data_part_num=256 \
+#     --val_data_dir=$DATA_ROOT/validation \
+#     --val_data_part_num=256 \
+#     --num_nodes=1 \
+#     --gpu_num_per_node=4 \
+#     --model_update="momentum" \
+#     --learning_rate=0.256 \
+#     --loss_print_every_n_iter=100 \
+#     --batch_size_per_device=64 \
+#     --val_batch_size_per_device=50 \
+#     --num_epoch=90 \
+#     --model="resnet50" 2>&1 | tee ${LOGFILE}
+
+# echo "Writting log to ${LOGFILE}" 
+
--- a/reports/bert_fp32_report.md
+++ b/reports/bert_fp32_report.md
 # OneFlow BERT Pretrain Benchmark Test Report
-This document reports OneFlow BERT Pretrain benchmark test results on Aug 13 2020. 
+This document reports OneFlow BERT Pretrain benchmark test results on Aug 9 2020. 

 ## Test Environment
 All tests were performed on 4 GPU Servers with 8x Tesla V100-SXM2-16GB and following is the main hardware and software configurations for each:  
@@ -9,7 +9,7 @@ All tests were performed on 4 GPU Servers with 8x Tesla V100-SXM2-16GB and follo
 - Memory 384G
 - Ubuntu 16.04.4 LTS (GNU/Linux 4.4.0-116-generic x86_64)
 - CUDA Version: 10.2, Driver Version: 440.33.01
- OneFlow: v0.1.8, master@4d44113e2 with NCCL 2.4.8
+- OneFlow: v0.1.8, fix_infer_out_logical_blob_desc@17a2bdc9b
 - OneFlow-Benchmark: master@892f87e6
 - `nvidia-smi topo -m`
 ```
@@ -171,40 +171,39 @@ step: 199, total_loss: 9.640, mlm_loss: 8.960, nsp_loss: 0.680, throughput: 142.
 ```
 Normally, the first `throughput` value e.g. `52.257` is discarded because the start time of first batch is not correct. we average the other `throughput` as the throughput of this test.
 ## BERT base Pretrain Test Results
-All test logs can be found [here](https://oneflow-public.oss-cn-beijing.aliyuncs.com/OF_benchmark_logs/of_leinao_benchmark_log_0813.tar.gz)
+All test logs can be found [here](https://oneflow-public.oss-cn-beijing.aliyuncs.com/OF_benchmark_logs/oneflow_bert_benchmark_logs.tgz)
 ### Group: batch size per device = 32
 BERT Base Pretrain, batch size per device=32, dtype=float32, without XLA						
-| node num | device num | bsz per device | throughput | speedup | memory(MiB) | 
-| -------- | -------- | -------- | -------- | -------- | -------- | 
-| 1 | 1 | 32 | 137.17  | 1.00  | 6205 | 
-| 1 | 2 | 32 | 250.41  | 1.83  | 7071 | 
-| 1 | 4 | 32 | 502.70  | 3.66  | 7139 | 
-| 1 | 8 | 32 | 990.87  | 7.22  | 7215 | 
-| 2 | 16 | 32 | 1573.31  | 11.47  | 7135 | 
-| 4 | 32 | 32 | 3081.96  | 22.47  | 7149 | 
+| node num | gpu num/node | gpu num | bsz/gpu | GPU Memory Usage | Throughput | Speedup | 
+| -------- | -------- | -------- | -------- | -------- | -------- | -------- | 
+| 1 | 1 | 1 | 32 | 6207 | 140.034  | 1 | 
+| 1 | 2 | 2 | 32 | 7081 | 254.304  | 1.82  | 
+| 1 | 4 | 4 | 32 | 7255 | 506.989  | 3.62  | 
+| 1 | 8 | 8 | 32 | 7323 | 1010.446  | 7.22  | 
+| 2 | 8 | 16 | 32 | 7145 | 1571.088  | 11.22  | 
+| 4 | 8 | 32 | 32 | 7185 | 3136.797  | 22.40  | 

 ### Group: batch size per device = 64 
 BERT Base Pretrain, batch size per device=64, dtype=float32, without XLA						
-| node num | device num | bsz per device | throughput | speedup | memory(MiB) | 
-| -------- | -------- | -------- | -------- | -------- | -------- | 
-| 1 | 1 | 64 | 145.55  | 1.00  | 9987 | 
-| 1 | 2 | 64 | 277.03  | 1.90  | 10847 | 
-| 1 | 4 | 64 | 551.78  | 3.79  | 10923 | 
-| 1 | 8 | 64 | 1105.13  | 7.59  | 11057 | 
-| 2 | 16 | 64 | 2016.09  | 13.85  | 10937 | 
-| 4 | 32 | 64 | 3911.90  | 26.88  | 10963 | 
-
+| node num | gpu num/node | gpu num | bsz/gpu | GPU Memory Usage | Throughput | Speedup | 
+| -------- | -------- | -------- | -------- | -------- | -------- | -------- | 
+| 1 | 1 | 1 | 64 | 9989 | 145.148  | 1 | 
+| 1 | 2 | 2 | 64 | 10947 | 277.880  | 1.91  | 
+| 1 | 4 | 4 | 64 | 10955 | 552.843  | 3.81  | 
+| 1 | 8 | 8 | 64 | 11029 | 1103.102  | 7.60  | 
+| 2 | 8 | 16 | 64 | 10957 | 2023.743  | 13.94  | 
+| 4 | 8 | 32 | 64 | 10981 | 3947.739  | 27.20  | 

 ### Group: batch size per device = 96 
 BERT Base Pretrain, batch size per device=96, dtype=float32, without XLA						
-| node num | device num | bsz per device | throughput | speedup | memory(MiB) | 
-| -------- | -------- | -------- | -------- | -------- | -------- | 
-| 1 | 1 | 96 | 148.34  | 1.00  | 13769 | 
-| 1 | 2 | 96 | 286.24  | 1.93  | 14735 | 
-| 1 | 4 | 96 | 573.85  | 3.87  | 14809 | 
-| 1 | 8 | 96 | 1147.47  | 7.74  | 14893 | 
-| 2 | 16 | 96 | 2169.65  | 14.63  | 14763 | 
-| 4 | 32 | 96 | 4238.85  | 28.58  | 14795 | 
+| node num | gpu num/node | gpu num | bsz/gpu | GPU Memory Usage | Throughput | Speedup | 
+| -------- | -------- | -------- | -------- | -------- | -------- | -------- | 
+| 1 | 1 | 1 | 96 | 13771 | 145.095  | 1 | 
+| 1 | 2 | 2 | 96 | 14757 | 282.984  | 1.95  | 
+| 1 | 4 | 4 | 96 | 14851 | 559.011  | 3.85  | 
+| 1 | 8 | 8 | 96 | 14815 | 1121.632  | 7.73  | 
+| 2 | 8 | 16 | 96 | 14815 | 2132.490  | 14.70  | 
+| 4 | 8 | 32 | 96 | 14687 | 4140.439  | 28.54  | 

 ## BERT Large Pretrain Test Results
 BERT large was tested on the same situtation. Some arguments in `local_run.sh` need to be modified to meet to BERT large pretrain configuration. 

--- a/reports/resnet50_v15_fp32_report.md
+++ b/reports/resnet50_v15_fp32_report.md
 # OneFlow ResNet50-V1.5 Benchmark Test Report
-This document reports OneFlow ResNet50-V1.5 benchmark test results on Aug 13 2020. 
+This document reports OneFlow ResNet50-V1.5 benchmark test results on Aug 8 2020. 

 ## Test Environment
 All tests were performed on 4 GPU Servers with 8x Tesla V100-SXM2-16GB and following is the main hardware and software configurations for each:  
@@ -9,7 +9,7 @@ All tests were performed on 4 GPU Servers with 8x Tesla V100-SXM2-16GB and follo
 - Memory 384G
 - Ubuntu 16.04.4 LTS (GNU/Linux 4.4.0-116-generic x86_64)
 - CUDA Version: 10.2, Driver Version: 440.33.01
- OneFlow: v0.1.8, master@4d44113e2 with NCCL 2.4.8
+- OneFlow: v0.1.8, fix_infer_out_logical_blob_desc@17a2bdc9b
 - OneFlow-Benchmark: master@892f87e6
 - `nvidia-smi topo -m`
 ```
@@ -163,26 +163,26 @@ train: epoch 0, iter 200, loss: 1.047857, top_1: 1.000000, top_k: 1.000000, samp
 ```
 Normally, the first `samples/s` value e.g. `288.088` is discarded because the start time of first batch is not correct. we average the other `samples/s` as the throughput of this test.
 ## Test Results
-All test logs can be found [here](https://oneflow-public.oss-cn-beijing.aliyuncs.com/OF_benchmark_logs/of_leinao_benchmark_log_0813.tar.gz)
+All test logs can be found [here](https://oneflow-public.oss-cn-beijing.aliyuncs.com/OF_benchmark_logs/oneflow_resnet50_logs.tgz)
 ### Group: batch size per device = 128
 ResNet50 V1.5, batch size per device=128, dtype=float32, without XLA						
-| node num | device num | bsz per device | throughput | speedup | memory(MiB) | 
-| -------- | -------- | -------- | -------- | -------- | -------- | 
-| 1 | 1 | 128 | 405.53  | 1.00  | 12553 | 
-| 1 | 2 | 128 | 795.19  | 1.96  | 12993 | 
-| 1 | 4 | 128 | 1589.44  | 3.92  | 12941 | 
-| 1 | 8 | 128 | 3160.44  | 7.79  | 12943 | 
-| 2 | 16 | 128 | 6273.50  | 15.47  | 13617 | 
-| 4 | 32 | 128 | 12230.47  | 30.16  | 13643 | 
+| node num | gpus/nodes | gpu num | bsz/gpu | GPU Memory Usage | Throughput | Speedup | 
+| -------- | -------- | -------- | -------- | -------- | -------- | -------- | 
+| 1 | 1 | 1 | 128 | 12565 | 383.760  | 1 | 
+| 1 | 2 | 2 | 128 | 12839 | 747.295  | 1.95  | 
+| 1 | 4 | 4 | 128 | 12987 | 1497.618  | 3.90  | 
+| 1 | 8 | 8 | 128 | 13051 | 2942.321  | 7.67  | 
+| 2 | 8 | 16 | 128 | 12871 | 5839.054  | 15.22  | 
+| 4 | 8 | 32 | 128 | 12871 | 11548.451  | 30.09  | 

 ### Group: batch size per device = 160
 ResNet50 V1.5, batch size per device=160, dtype=float32, without XLA						
-| node num | device num | bsz per device | throughput | speedup | memory(MiB) | 
-| -------- | -------- | -------- | -------- | -------- | -------- | 
-| 1 | 1 | 160 | 405.46  | 1.00  | 15583 | 
-| 1 | 2 | 160 | 797.04  | 1.97  | 15765 | 
-| 1 | 4 | 160 | 1593.81  | 3.93  | 15837 | 
-| 1 | 8 | 160 | 3177.84  | 7.84  | 15889 | 
-| 2 | 16 | 160 | 6282.58  | 15.49  | 15963 | 
-| 4 | 32 | 160 | 12370.62  | 30.51  | 15965 | 
+| node num | gpus/nodes | gpu num | bsz/gpu | GPU Memory Usage | Throughput | Speedup | 
+| -------- | -------- | -------- | -------- | -------- | -------- | -------- | 
+| 1 | 1 | 1 | 160 | 15509 | 382.324  | 1 | 
+| 1 | 2 | 2 | 160 | 15785 | 755.956  | 1.98  | 
+| 1 | 4 | 4 | 160 | 15881 | 1494.733  | 3.91  | 
+| 1 | 8 | 8 | 160 | 15701 | 3016.431  | 7.89  | 
+| 2 | 8 | 16 | 160 | 15817 | 5877.289  | 15.37  | 
+| 4 | 8 | 32 | 160 | 15879 | 11623.889  | 30.40  |