Dev mindspore bert FP16 with dynamic loss scale (#122)

* update README * update README of resnet50v1.5 * update scripts * update README * update README

Dev mindspore bert FP16 with dynamic loss scale (#122)
* update README * update README of resnet50v1.5 * update scripts * update README * update README
0e26548d · YongtaoShi · GitHub · 87124338 · 0e26548d · 0e26548d
6 changed file
--- a/MindSpore/bert/README.md
+++ b/MindSpore/bert/README.md
--- a/MindSpore/bert/run_pretrain.py
+++ b/MindSpore/bert/run_pretrain.py
+# Copyright 2020 Huawei Technologies Co., Ltd
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ============================================================================
+"""
+#################pre_train bert example on zh-wiki########################
+python run_pretrain.py
+"""
+
+import os
+import argparse
+import mindspore.communication.management as D
+from mindspore.communication.management import get_rank
+import mindspore.common.dtype as mstype
+from mindspore import context
+from mindspore.train.model import Model
+from mindspore.context import ParallelMode
+from mindspore.nn.wrap.loss_scale import DynamicLossScaleUpdateCell
+from mindspore.train.callback import ModelCheckpoint, CheckpointConfig, TimeMonitor
+from mindspore.train.serialization import load_checkpoint, load_param_into_net
+from mindspore.nn.optim import Lamb, Momentum, AdamWeightDecay
+from mindspore import log as logger
+from mindspore.common import set_seed
+from src import BertNetworkWithLoss, BertTrainOneStepCell, BertTrainOneStepWithLossScaleCell, \
+                BertTrainAccumulateStepsWithLossScaleCell, BertTrainOneStepWithLossScaleCellForAdam, \
+                AdamWeightDecayForBert
+from src.dataset import create_bert_dataset
+from src.config import cfg, bert_net_cfg
+from src.utils import LossCallBack, BertLearningRate
+_current_dir = os.path.dirname(os.path.realpath(__file__))
+
+
+def _set_bert_all_reduce_split(device_target='Ascend', enable_graph_kernel=False):
+    """set bert all_reduce fusion split, support num_hidden_layers is 12 and 24."""
+    if bert_net_cfg.num_hidden_layers == 12:
+        if bert_net_cfg.use_relative_positions:
+            context.set_auto_parallel_context(all_reduce_fusion_config=[29, 58, 87, 116, 145, 174, 203, 217])
+        else:
+            context.set_auto_parallel_context(all_reduce_fusion_config=[28, 55, 82, 109, 136, 163, 190, 205])
+            if device_target == 'GPU' and enable_graph_kernel:
+                context.set_auto_parallel_context(all_reduce_fusion_config=[180, 205])
+    elif bert_net_cfg.num_hidden_layers == 24:
+        if bert_net_cfg.use_relative_positions:
+            context.set_auto_parallel_context(all_reduce_fusion_config=[30, 90, 150, 210, 270, 330, 390, 421])
+        else:
+            context.set_auto_parallel_context(all_reduce_fusion_config=[38, 93, 148, 203, 258, 313, 368, 397])
+
+
+def _get_optimizer(args_opt, network):
+    """get bert optimizer, support Lamb, Momentum, AdamWeightDecay."""
+    if cfg.optimizer == 'Lamb':
+        lr_schedule = BertLearningRate(learning_rate=cfg.Lamb.learning_rate,
+                                       end_learning_rate=cfg.Lamb.end_learning_rate,
+                                       warmup_steps=cfg.Lamb.warmup_steps,
+                                       decay_steps=args_opt.train_steps,
+                                       power=cfg.Lamb.power)
+        params = network.trainable_params()
+        decay_params = list(filter(cfg.Lamb.decay_filter, params))
+        other_params = list(filter(lambda x: not cfg.Lamb.decay_filter(x), params))
+        group_params = [{'params': decay_params, 'weight_decay': cfg.Lamb.weight_decay},
+                        {'params': other_params},
+                        {'order_params': params}]
+        optimizer = Lamb(group_params, learning_rate=lr_schedule, eps=cfg.Lamb.eps)
+    elif cfg.optimizer == 'Momentum':
+        optimizer = Momentum(network.trainable_params(), learning_rate=cfg.Momentum.learning_rate,
+                             momentum=cfg.Momentum.momentum)
+    elif cfg.optimizer == 'AdamWeightDecay':
+        lr_schedule = BertLearningRate(learning_rate=cfg.AdamWeightDecay.learning_rate,
+                                       end_learning_rate=cfg.AdamWeightDecay.end_learning_rate,
+                                       warmup_steps=cfg.AdamWeightDecay.warmup_steps,
+                                       decay_steps=args_opt.train_steps,
+                                       power=cfg.AdamWeightDecay.power)
+        params = network.trainable_params()
+        decay_params = list(filter(cfg.AdamWeightDecay.decay_filter, params))
+        other_params = list(filter(lambda x: not cfg.AdamWeightDecay.decay_filter(x), params))
+        group_params = [{'params': decay_params, 'weight_decay': cfg.AdamWeightDecay.weight_decay},
+                        {'params': other_params, 'weight_decay': 0.0},
+                        {'order_params': params}]
+        if args_opt.enable_lossscale == "true" and args_opt.device_target == 'GPU':
+            optimizer = AdamWeightDecayForBert(group_params, learning_rate=lr_schedule, eps=cfg.AdamWeightDecay.eps)
+        else:
+            optimizer = AdamWeightDecay(group_params, learning_rate=lr_schedule, eps=cfg.AdamWeightDecay.eps)
+    else:
+        raise ValueError("Don't support optimizer {}, only support [Lamb, Momentum, AdamWeightDecay]".
+                         format(cfg.optimizer))
+    return optimizer
+
+
+def _auto_enable_graph_kernel(device_target, graph_kernel_mode):
+    """Judge whether is suitable to enable graph kernel."""
+    return graph_kernel_mode in ("auto", "true") and device_target == 'GPU' and \
+        cfg.bert_network == 'base' and cfg.optimizer == 'AdamWeightDecay'
+
+
+def run_pretrain():
+    """pre-train bert_clue"""
+    parser = argparse.ArgumentParser(description='bert pre_training')
+    parser.add_argument('--device_target', type=str, default='Ascend', choices=['Ascend', 'GPU'],
+                        help='device where the code will be implemented. (Default: Ascend)')
+    parser.add_argument("--distribute", type=str, default="false", choices=["true", "false"],
+                        help="Run distribute, default is false.")
+    parser.add_argument("--epoch_size", type=int, default="1", help="Epoch size, default is 1.")
+    parser.add_argument("--device_id", type=int, default=0, help="Device id, default is 0.")
+    parser.add_argument("--device_num", type=int, default=1, help="Use device nums, default is 1.")
+    parser.add_argument("--enable_save_ckpt", type=str, default="true", choices=["true", "false"],
+                        help="Enable save checkpoint, default is true.")
+    parser.add_argument("--enable_lossscale", type=str, default="true", choices=["true", "false"],
+                        help="Use lossscale or not, default is not.")
+    parser.add_argument("--do_shuffle", type=str, default="true", choices=["true", "false"],
+                        help="Enable shuffle for dataset, default is true.")
+    parser.add_argument("--enable_data_sink", type=str, default="true", choices=["true", "false"],
+                        help="Enable data sink, default is true.")
+    parser.add_argument("--data_sink_steps", type=int, default="1", help="Sink steps for each epoch, default is 1.")
+    parser.add_argument("--accumulation_steps", type=int, default="1",
+                        help="Accumulating gradients N times before weight update, default is 1.")
+    parser.add_argument("--save_checkpoint_path", type=str, default="", help="Save checkpoint path")
+    parser.add_argument("--load_checkpoint_path", type=str, default="", help="Load checkpoint file path")
+    parser.add_argument("--save_checkpoint_steps", type=int, default=1000, help="Save checkpoint steps, "
+                                                                                "default is 1000.")
+    parser.add_argument("--train_steps", type=int, default=-1, help="Training Steps, default is -1, "
+                                                                    "meaning run all steps according to epoch number.")
+    parser.add_argument("--save_checkpoint_num", type=int, default=1, help="Save checkpoint numbers, default is 1.")
+    parser.add_argument("--data_dir", type=str, default="", help="Data path, it is better to use absolute path")
+    parser.add_argument("--schema_dir", type=str, default="", help="Schema path, it is better to use absolute path")
+    parser.add_argument("--enable_graph_kernel", type=str, default="auto", choices=["auto", "true", "false"],
+                        help="Accelerate by graph kernel, default is auto.")
+
+    parser.add_argument("--optimizer", type=str, default="AdamWeightDecay", choices=["AdamWeightDecay", "Lamb", "Momentum"],
+                        help="Optimizer, default is AdamWeightDecay.")
+    parser.add_argument("--enable_global_norm", type=str, default="true", choices=["true", "false"],
+                        help="Enable gloabl norm for grad clip, default is true.")
+    parser.add_argument("--batch_size", type=int, default=32, help="Batch size, default is 32.")
+    parser.add_argument("--dtype", type=str, default="fp32", choices=["fp32", "fp16"],
+                        help="dtype, default is fp32.")
+
+    args_opt = parser.parse_args()
+    cfg.optimizer = args_opt.optimizer
+    cfg.batch_size = args_opt.batch_size
+    cfg.enable_global_norm = True if args_opt.enable_global_norm == "true" else False
+    bert_net_cfg.compute_type = mstype.float32 if args_opt.dtype== "fp32" else mstype.float16
+    logger.warning("\nargs_opt: {}".format(args_opt))
+    logger.warning("\ncfg: {}".format(cfg))
+
+    context.set_context(mode=context.GRAPH_MODE, device_target=args_opt.device_target, device_id=args_opt.device_id)
+    context.set_context(reserve_class_name_in_scope=False)
+    is_auto_enable_graph_kernel = _auto_enable_graph_kernel(args_opt.device_target, args_opt.enable_graph_kernel)
+    if args_opt.enable_graph_kernel == "true" or is_auto_enable_graph_kernel:
+        context.set_context(enable_graph_kernel=True)
+    ckpt_save_dir = args_opt.save_checkpoint_path
+    if args_opt.distribute == "true":
+        if args_opt.device_target == 'Ascend':
+            D.init()
+            device_num = args_opt.device_num
+            rank = args_opt.device_id % device_num
+        else:
+            D.init()
+            device_num = D.get_group_size()
+            rank = D.get_rank()
+        ckpt_save_dir = args_opt.save_checkpoint_path + 'ckpt_' + str(get_rank()) + '/'
+
+        context.reset_auto_parallel_context()
+        context.set_auto_parallel_context(parallel_mode=ParallelMode.DATA_PARALLEL, gradients_mean=True,
+                                          device_num=device_num)
+        _set_bert_all_reduce_split(args_opt.device_target, context.get_context('enable_graph_kernel'))
+    else:
+        rank = 0
+        device_num = 1
+
+    if args_opt.accumulation_steps > 1:
+        logger.info("accumulation steps: {}".format(args_opt.accumulation_steps))
+        logger.info("global batch size: {}".format(cfg.batch_size * args_opt.accumulation_steps))
+        if args_opt.enable_data_sink == "true":
+            args_opt.data_sink_steps *= args_opt.accumulation_steps
+            logger.info("data sink steps: {}".format(args_opt.data_sink_steps))
+        if args_opt.enable_save_ckpt == "true":
+            args_opt.save_checkpoint_steps *= args_opt.accumulation_steps
+            logger.info("save checkpoint steps: {}".format(args_opt.save_checkpoint_steps))
+
+    ds = create_bert_dataset(device_num, rank, args_opt.do_shuffle, args_opt.data_dir, args_opt.schema_dir)
+    net_with_loss = BertNetworkWithLoss(bert_net_cfg, True)
+
+    new_repeat_count = args_opt.epoch_size * ds.get_dataset_size() // args_opt.data_sink_steps
+    if args_opt.train_steps > 0:
+        train_steps = args_opt.train_steps * args_opt.accumulation_steps
+        new_repeat_count = min(new_repeat_count, train_steps // args_opt.data_sink_steps)
+    else:
+        args_opt.train_steps = args_opt.epoch_size * ds.get_dataset_size() // args_opt.accumulation_steps
+        logger.info("train steps: {}".format(args_opt.train_steps))
+
+    optimizer = _get_optimizer(args_opt, net_with_loss)
+    callback = [TimeMonitor(args_opt.data_sink_steps), LossCallBack(ds.get_dataset_size())]
+    if args_opt.enable_save_ckpt == "true" and args_opt.device_id % min(8, device_num) == 0:
+        config_ck = CheckpointConfig(save_checkpoint_steps=args_opt.save_checkpoint_steps,
+                                     keep_checkpoint_max=args_opt.save_checkpoint_num)
+        ckpoint_cb = ModelCheckpoint(prefix='checkpoint_bert',
+                                     directory=None if ckpt_save_dir == "" else ckpt_save_dir, config=config_ck)
+        callback.append(ckpoint_cb)
+
+    if args_opt.load_checkpoint_path:
+        param_dict = load_checkpoint(args_opt.load_checkpoint_path)
+        load_param_into_net(net_with_loss, param_dict)
+
+    if args_opt.enable_lossscale == "true":
+        update_cell = DynamicLossScaleUpdateCell(loss_scale_value=cfg.loss_scale_value,
+                                                 scale_factor=cfg.scale_factor,
+                                                 scale_window=cfg.scale_window)
+
+        if args_opt.accumulation_steps <= 1:
+            if cfg.optimizer == 'AdamWeightDecay' and args_opt.device_target == 'GPU':
+                net_with_grads = BertTrainOneStepWithLossScaleCellForAdam(net_with_loss, optimizer=optimizer,
+                                                                          scale_update_cell=update_cell)
+            else:
+                net_with_grads = BertTrainOneStepWithLossScaleCell(net_with_loss, optimizer=optimizer,
+                                                                   scale_update_cell=update_cell)
+        else:
+            accumulation_steps = args_opt.accumulation_steps
+            net_with_grads = BertTrainAccumulateStepsWithLossScaleCell(net_with_loss, optimizer=optimizer,
+                                                                       scale_update_cell=update_cell,
+                                                                       accumulation_steps=accumulation_steps,
+                                                                       enable_global_norm=cfg.enable_global_norm)
+    else:
+        net_with_grads = BertTrainOneStepCell(net_with_loss, optimizer=optimizer)
+
+    model = Model(net_with_grads)
+    model.train(new_repeat_count, ds, callbacks=callback,
+                dataset_sink_mode=(args_opt.enable_data_sink == "true"), sink_size=args_opt.data_sink_steps)
+
+
+if __name__ == '__main__':
+    set_seed(0)
+    run_pretrain()
--- a/MindSpore/bert/scripts/run_distributed_pretrain_for_gpu.sh
+++ b/MindSpore/bert/scripts/run_distributed_pretrain_for_gpu.sh
@@ -27,6 +27,11 @@ else
    echo "Invalid node num."
 fi

+ENABLE_LOSSSCALE="false"
+if [ ${DTYPE} == "fp16" ] ; then
+  ENABLE_LOSSSCALE="true"
+fi
+
 export CUDA_VISIBLE_DEVICES=$DEVICE_ID
 export GLOG_logtostderr=1
 export GLOG_v=2
@@ -46,7 +51,7 @@ mpirun --allow-run-as-root \
    --distribute="true"        \
    --epoch_size=1    \
    --enable_save_ckpt="false"    \
-    --enable_lossscale="false"    \
+    --enable_lossscale=$ENABLE_LOSSSCALE \
    --enable_data_sink="true"    \
    --data_sink_steps=10        \
    --train_steps=$NUM_STEP \

--- a/MindSpore/bert/scripts/run_standalone_pretrain_for_gpu.sh
+++ b/MindSpore/bert/scripts/run_standalone_pretrain_for_gpu.sh
@@ -7,8 +7,12 @@ NUM_STEP=${4:-120}
 ENABLE_GRAPH_KERNEL=${5:-'false'}
 TEST_NUM=${6:-1}

-export CUDA_VISIBLE_DEVICES=$DEVICE_ID
+ENABLE_LOSSSCALE="false"
+if [ ${DTYPE} == "fp16" ] ; then
+  ENABLE_LOSSSCALE="true"
+fi

+export CUDA_VISIBLE_DEVICES=$DEVICE_ID
 export GLOG_logtostderr=1
 export GLOG_v=2
 LOG_FOLDER=./logs/mindspore/bert/bz${BATCH_SIZE}/1n1g
@@ -20,7 +24,7 @@ python run_pretrain.py  \
    --distribute="false" \
    --epoch_size=1 \
    --enable_save_ckpt="false" \
-    --enable_lossscale="false" \
+    --enable_lossscale=$ENABLE_LOSSSCALE \
    --enable_data_sink="true" \
    --data_sink_steps=10 \
    --train_steps=$NUM_STEP \

--- a/MindSpore/resnet50v1.5/README.md
+++ b/MindSpore/resnet50v1.5/README.md
@@ -2,7 +2,7 @@

 # Overview

-本次复现采用了[MindSpore官方仓库](https://gitee.com/mindspore/mindspore/tree/r1.1)中的[ResNet](https://gitee.com/mindspore/mindspore/tree/r1.1/model_zoo/official/cv/resnet)，目的在于速度测评，同时根据测速结果给出1机、2机、4机情况下的加速比，评判框架在分布式多机训练情况下的横向拓展能力。
+本次复现采用了[MindSpore官方仓库](https://gitee.com/mindspore/mindspore/tree/e13c045ced043de5998f5f77acc0ebe7da4eed5c)中的[ResNet](https://gitee.com/mindspore/mindspore/tree/e13c045ced043de5998f5f77acc0ebe7da4eed5c/model_zoo/official/cv/resnet)，目的在于速度测评，同时根据测速结果给出1机、2机、4机情况下的加速比，评判框架在分布式多机训练情况下的横向拓展能力。

 目前，该测试已覆盖 FP32、FP16混合精度，后续将持续维护，增加更多方式的测评。

@@ -24,6 +24,10 @@
 - CUDA 10.1.243
 - OpenMPI 4.0.3

+## 框架
+
+- **MindSpore 1.1.0**
+
 ## Feature support matrix

 | Feature            | ResNet50v1.5 MindSpore    |
@@ -36,8 +40,8 @@

 ## 项目代码

- [MindSpore官方仓库](https://gitee.com/mindspore/mindspore/tree/r1.1)
-  - [ResNet项目主页](https://gitee.com/mindspore/mindspore/tree/r1.1/model_zoo/official/cv/resnet)
+- [MindSpore官方仓库](https://gitee.com/mindspore/mindspore/tree/e13c045ced043de5998f5f77acc0ebe7da4eed5c)
+  - [ResNet项目主页](https://gitee.com/mindspore/mindspore/tree/e13c045ced043de5998f5f77acc0ebe7da4eed5c/model_zoo/official/cv/resnet)

 下载官方源码：

@@ -45,6 +49,7 @@
 git clone https://gitee.com/mindspore/mindspore.git
 cd mindspore/
 git checkout r1.1
+git reset e13c045ced043de5998f5f77acc0ebe7da4eed5c --hard
 cd model_zoo/official/cv/resnet/
 ```

@@ -142,7 +147,7 @@ cd model_zoo/official/cv/resnet/
 ## 容器

 本次测评采用的是MindSpore官方提供的Docker镜像，您可以
-参考[MindSpore官方文档](https://gitee.com/mindspore/mindspore/tree/r1.1/#docker%E9%95%9C%E5%83%8F)GPU部分
+参考[MindSpore官方文档](https://gitee.com/mindspore/mindspore/tree/e13c045ced043de5998f5f77acc0ebe7da4eed5c/#docker%E9%95%9C%E5%83%8F)GPU部分
 **获取项目镜像**

 对于`GPU`后端，请确保`nvidia-container-toolkit`已经提前安装，以下是`Ubuntu`用户安装指南：
@@ -201,7 +206,7 @@ docker run -it \

 ## 数据集

-数据集直接采用 JPEG 图像，请参考：[MindSpore官方仓库说明](https://gitee.com/mindspore/mindspore/tree/r1.1/model_zoo/official/cv/resnet#%E6%95%B0%E6%8D%AE%E9%9B%86) ImageNet2012 部分；
+数据集直接采用 JPEG 图像，请参考：[MindSpore官方仓库说明](https://gitee.com/mindspore/mindspore/tree/e13c045ced043de5998f5f77acc0ebe7da4eed5c/model_zoo/official/cv/resnet#%E6%95%B0%E6%8D%AE%E9%9B%86) ImageNet2012 部分；

 ## SSH配置(可选)

@@ -376,6 +381,8 @@ extract_mindspore_logs_time.py根据log中打印出的耗时，排除前100个it
 | 2        | 16      | 5483.22   | 14.83   |
 | 4        | 32      | 10731.78  | 29.02   |

+注：以32为最小单位，最大batch size为128，否则会OOM(out of memory)。
+
 ### ResNet50v1.5  FP16

 #### batch size=256
@@ -389,6 +396,8 @@ extract_mindspore_logs_time.py根据log中打印出的耗时，排除前100个it
 | 2        | 16      | 12057.38  | 10.80   |
 | 4        | 32      | 24183.95  | 21.67   |

+注：以32为最小单位，最大batch size为256，否则会OOM(out of memory)。
+
 ### 完整日志

 - [resnet50_fp32.zip](https://oneflow-public.oss-cn-beijing.aliyuncs.com/DLPerf/logs/MindSpore/resnet50/resnet50_fp32.zip) 

--- a/README.md
+++ b/README.md
@@ -15,8 +15,9 @@ Multiple deep learning frameworks are evaluated in this repository, they are:
 3. PyTorch
 4. MXNet
 5. PaddlePaddle
+6. MindSpore

-More frameworks will be included in the future, such as MindSpore, MegEngine, etc.
+More frameworks will be included in the future, such as MegEngine, etc.

 ### Evaluated Deep Neural Network models

@@ -32,7 +33,7 @@ The first type is classical deep neural network models that used to evaluate the
 1. **ResNet-50 v1.5**
 2. **BERT-Base**

-The secode type is that some models use special techniques or frameworks with unique implementations,such as implementation of [Megatron-LM](https://github.com/microsoft/DeepSpeedExamples/tree/a79272cc8b8f0c5b66c803e581a1355341eacb77/Megatron-LM) based on Microsoft's framwork deepspeed, [HugeCTR](https://github.com/NVIDIA/HugeCTR)(Designed for CTR estimation training and implemented by NVIDIA).
+The second type is that some models use special techniques or frameworks with unique implementations,such as implementation of [Megatron-LM](https://github.com/microsoft/DeepSpeedExamples/tree/a79272cc8b8f0c5b66c803e581a1355341eacb77/Megatron-LM) based on Microsoft's framwork deepspeed, [HugeCTR](https://github.com/NVIDIA/HugeCTR)(Designed for CTR estimation training and implemented by NVIDIA).

 In general, there are a lot of different implementations of these DNN models, we choose official benchmark source as well as [NVIDIA-DeepLearningExamples](https://github.com/NVIDIA/DeepLearningExamples). In most cases, we avoid changing any scripts and codes from origin. If we have to, changes were mentioned in the documents.

@@ -92,6 +93,7 @@ To get a continuous and stable output, first several training steps are ignored.
 - `TensorFlow/`: holds the reproducible scripts and test reports for DNN models from [TensorFlow 2.x official benchmark](https://github.com/tensorflow/models/tree/r2.3.0);
 - `PyTorch/`: holds the reproducible scripts and test reports for DNN models from [PyTorch official benchmark](https://github.com/PyTorch/examples/tree/49ec0bd72b85be55579ae8ceb278c66145f593e1);
 - `MxNet/`: holds the reproducible scripts and test reports for DNN models from [gluon-nlp](https://github.com/dmlc/gluon-nlp)  and [gluon-cv](https://github.com/dmlc/gluon-cv);
+- `MindSpore/`: holds the reproducible scripts and test reports for DNN models from [MindSpore official benchmark](https://gitee.com/mindspore/mindspore/tree/r1.1/model_zoo);
 - `reports`: holds rounds of DNN's benchmark test reports.

 ## Summary of Latest Test Results(common cases)
@@ -122,6 +124,7 @@ This difference makes ResNet50 v1.5 slightly more accurate (~0.5% top1) than v1,
 | [TensorFlow 2.x](https://github.com/tensorflow/tensorflow/tree/v2.3.0) | [TensorFlow-models](https://github.com/tensorflow/models/tree/r2.3.0/official/vision/image_classification) | [9418.44](./TensorFlow/resnet50v1.5)                         | 29.27                   | 19314.31                                            | 17.96                          |
 | [PyTorch](https://github.com/pytorch/pytorch/tree/v1.6.0)    | [PyTorch-examples](https://github.com/PyTorch/examples/tree/49ec0bd72b85be55579ae8ceb278c66145f593e1/imagenet) | [10021.29](./PyTorch/resnet50v1.5)                           | 28.75                   | <sup>[2]</sup> -                                    | -                              |
 | [PaddlePaddle](https://github.com/PaddlePaddle/Paddle/tree/v1.8.3) | [PaddleCV](https://github.com/PaddlePaddle/models/tree/release/1.8/PaddleCV/image_classification) | [9348.17](./PaddlePaddle/resnet50v1.5)                       | 26.50                   | <sup>[3]</sup>10633.22<br>11617.57<sup>w/DALI</sup> | 10.2<br>13.1<sup>w/DALI</sup>  |
+| [MindSpore](https://gitee.com/mindspore/mindspore/tree/e13c045ced043de5998f5f77acc0ebe7da4eed5c) | [MindSpore-model_zoo](https://gitee.com/mindspore/mindspore/tree/e13c045ced043de5998f5f77acc0ebe7da4eed5c/model_zoo/official/cv/resnet) | [10731.78](./MindSpore/resnet50v1.5)                       | 29.02                   | 24183.95              | 21.67              |

 [1]:  AMP throughput of TensorFlow 1.x is obtained **with** or **without** XLA and using **bsz = 224**, because when bsz = 256 OOM (out of memory) will be encountered.

@@ -141,6 +144,7 @@ Our results were obtained by running the applicable training scripts on 4 nodes
 | [PaddlePaddle](https://github.com/PaddlePaddle/Paddle/tree/v1.8.3) | [PaddleNLP](https://github.com/PaddlePaddle/models/tree/release/1.8/PaddleNLP/pretrain_language_models/BERT) | [3167.68<br>bsz=96](./PaddlePaddle/bert)                     | 2073.60                   | 5452.35<br>bsz=160                | 3406.36                                                      |
 | [OneFlow](https://github.com/Oneflow-Inc/oneflow/tree/v0.2.0)<sup>W/O clip</sup> | [OneFlow-Benchmark](https://github.com/Oneflow-Inc/OneFlow-Benchmark/tree/v0.2.0/LanguageModeling/BERT) | [4799.64<br/>bsz=96](./OneFlow)                              | 4019.45                   | 17210.63<br>bsz=160               | 11195.72                                                     |
 | <sup>[5]</sup>[MXNet](https://github.com/apache/incubator-mxnet/tree/1.6.0)<sup>W/O clip</sup> | [gluon-nlp](https://github.com/dmlc/gluon-nlp/tree/7b7bf60259e28b3bf1f4d70569a7e5c18e2f4b3e/scripts/bert) | [4340.89<br>bsz=64](./MxNet/BERT)                            | 3671.45                   | 14822.31<br>bsz=128               | 11269.14                                                     |
+| [MindSpore](https://gitee.com/mindspore/mindspore/tree/d9db5bf730ee7aa252eb7df41ffad09501acbe44) | [MindSpore-model_zoo](https://gitee.com/mindspore/mindspore/tree/d9db5bf730ee7aa252eb7df41ffad09501acbe44/model_zoo/official/nlp/bert) | [3051.3<br/>bsz=64](./MindSpore/bert)                     | 2457.8                   | 6068.55<br>bsz=128                | 4659.76                                                      |

 [4]: AMP throughput of TensorFlow 1.x is obtained **with** or **without** XLA.