add deepspeed-bert readme

7cf3c693 · Flowingsun007 · b1f1b503 · 7cf3c693 · 7cf3c693 · 7cf3c693
9 changed file
--- a/DeepSpeed/README.md
+++ b/DeepSpeed/README.md
+# 【DLPerf】DeepSpeed - BERT测评
+
+# Overview
+
+本次复现采用了微软[DeepSpeed官方仓库](https://github.com/microsoft/DeepSpeed/tree/1afca8f722fbcedf4b0ec0bf9e165d60564a7bba)中的[BERT-base](https://github.com/microsoft/DeepSpeedExamples/tree/ba63ad0fa861d28b3b33bc2c20f702647403e258/bing_bert)，目的在于速度测评，同时根据测速结果给出1机、2机器、4机情况下的加速比，评判框架在分布式多机训练情况下的横向拓展能力。
+
+目前，该测试覆盖了FP32、FP16混合精度，后续将持续维护，增加更多方式的测评。
+
+
+
+# Environment
+
+## 系统
+
+- 系统：Ubuntu 16.04.4 LTS (GNU/Linux 4.4.0-116-generic x86_64)
+- 显卡：Tesla V100-SXM2-16GB x 8
+- 驱动：NVIDIA 440.33.01
+- CUDA：10.2
+- cuDNN：7.6.5
+- NCCL：2.7.3
+
+## 框架
+
+- **torch 1.6.0 **
+
+## Feature support matrix
+
+| Feature                       | ResNet-50 v1.5 Paddle |
+| ----------------------------- | --------------------- |
+| Multi-node,multi-gpu training | Yes                   |
+| NVIDIA NCCL                   | Yes                   |
+| Mixed precision               | Yes                   |
+
+# Quick Start
+
+## 项目代码
+
+- 微软[DeepSpeed官方仓库](https://github.com/microsoft/DeepSpeed/tree/1afca8f722fbcedf4b0ec0bf9e165d60564a7bba)
+  - [BERT](https://github.com/microsoft/DeepSpeedExamples/tree/ba63ad0fa861d28b3b33bc2c20f702647403e258/bing_bert)
+
+下载官方源码：
+
+```shell
+git clone https://github.com/microsoft/DeepSpeed.git
+cd DeepSpeed && git checkout 1afca8f722fbcedf4b0ec0bf9e165d60564a7bba
+cd DeepSpeedExamples/bing_bert
+```
+
+
+将本页面中所有.json配置文件、scripts文件夹中的.sh脚本文件全部放入：`bing_bert/`路径下。
+
+
+
+## 依赖安装
+
+如果您是通过`docker pull deepspeed/deepspeed:latest`拉取了DeepSpeed官方镜像，则可通过下面命令启动容器：
+
+```
+docker pull deepspeed/deepspeed:latest
+docker run -it -d  -p 12345:22 --shm-size=16g --ulimit memlock=-1 --privileged  \
+--name deepspeed --cap-add=IPC_LOCK  \
+-v /datasets/bert/deepspeed/data/test:/datasets  \
+deepspeed/deepspeed:latest
+```
+
+并且容器内已安装好各种依赖，无需额外进行下面的安装。如果您是在物理机上，则需要进行下面的步骤来安装环境。
+
+### 创建环境
+
+```shell
+# 创建conda环境
+conda create -n deepspeed python=3.7.9
+conda activate deepspeed
+# 安装pytorch
+python3 -m pip install torch==1.6.0 -i https://mirror.baidu.com/pypi/simple
+python3 -m pip install torchvision==0.7.0 matplotlib==3.3.2
+sudo apt install pdsh
+```
+
+修改`DeepSpeed/requirements/requirements.txt`，将前两行注释掉：
+
+```
+# torch>=1.2
+# torchvision>=0.4.0
+```
+
+安装BERT训练相关依赖
+
+```
+# DeepSpeed主目录下执行
+python3 -m pip install -r requirements/requirements.txt
+python3 -m pip install -r requirements/requirements-dev.txt
+```
+
+### 安装deepspeed
+
+conda deepspeed环境下执行：`bash install.sh` ，命令执行后会编译并安装一系列whl包如：apex,deepspeed等，过程中可能会报错：
+
+![error.png](https://cdn.nlark.com/yuque/0/2020/png/216914/1602677523508-d71b79b8-625c-453a-9d64-c75d84afba79.png)
+
+报错原因在于，脚本中执行的pip使用了本地.local中的pip，而其版本和conda环境deepspeed下的python3.7不同，导致报错：`ERROR: apex-0.1-cp37-cp37m-linux_x86_64.whl is not a supported wheel on this platform`
+
+如报错，可修改脚本install.sh[第153行](https://github.com/microsoft/DeepSpeed/blob/1afca8f722fbcedf4b0ec0bf9e165d60564a7bba/install.sh#L153)，改为：
+
+```
+PIP_SUDO="python3  -m "
+```
+
+后重新执行`bash install.sh`
+
+编译耗时大约几分钟，成功后：
+
+![success.png](https://cdn.nlark.com/yuque/0/2020/png/216914/1602678788832-9819f36a-0b68-4a4c-a9dd-3b6b5a24fc03.png)
+
+
+
+## NCCL
+
+pytorch的分布式训练底层依赖NCCL库，需要从[NVIDIA-NCCL官网下载](https://developer.nvidia.com/nccl/nccl-download)并安装和操作系统、CUDA版本适配的NCCL。本次测试中安装2.7.3版本的NCCL：
+
+```shell
+sudo dpkg -i nccl-repo-ubuntu1604-2.7.3-ga-cuda10.2_1-1_amd64.deb
+sudo apt update
+sudo apt install libnccl2=2.7.3-1+cuda10.2 libnccl-dev=2.7.3-1+cuda10.2
+```
+
+## 数据集
+
+### 训练集
+
+本次训练使用Wikipedia数据集，并根据NVIDIA官方提供的脚本制作转换为.hdf5格式，详见：[NVIDIA-quick-start-guide](https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/LanguageModeling/BERT#quick-start-guide)。
+### 词表文件
+
+由于直接运行训练，程序会自动从s3.amazonaws.com下载词表文件(vocab.txt)，但速度很慢，故我们可以手动下载词表文件并放入新建文件夹`bing_bert/data`下，（直接运行训练，程序会自动从亚马逊amazonaws自动所有文件，但速度很慢）词表文件下载链接见：[tokenization.py](https://github.com/microsoft/DeepSpeedExamples/blob/ba63ad0fa861d28b3b33bc2c20f702647403e258/bing_bert/pytorch_pretrained_bert/tokenization.py)。下载完成并将词表文件存入`bing_bert/data`后，注释掉[tokenization.py Line:30]([tokenization.py](https://github.com/microsoft/DeepSpeedExamples/blob/ba63ad0fa861d28b3b33bc2c20f702647403e258/bing_bert/pytorch_pretrained_bert/tokenization.py#L30)) 的`PRETRAINED_VOCAB_ARCHIVE_MAP{}`，修改如下：
+
+```python3
+CACHE_DIR = "/your/path/to/DeepSpeed/DeepSpeedExamples/bing_bert/data/"
+PRETRAINED_VOCAB_ARCHIVE_MAP = {
+    'bert-base-uncased':
+    CACHE_DIR+"bert-base-uncased-vocab.txt",
+    'bert-large-uncased':
+    CACHE_DIR+"bert-large-uncased-vocab.txt",
+    'bert-base-cased':
+    CACHE_DIR+"bert-base-cased-vocab.txt",
+    'bert-large-cased':
+    CACHE_DIR+"bert-large-cased-vocab.txt",
+    'bert-base-multilingual-uncased':
+    CACHE_DIR+"bert-base-multilingual-uncased-vocab.txt",
+    'bert-base-multilingual-cased':
+    CACHE_DIR+"bert-base-multilingual-cased-vocab.txt",
+    'bert-base-chinese':
+    CACHE_DIR+"bert-base-chinese-vocab.txt",
+}
+```
+
+修改完成后，程序将从本地加载所需词表文件。
+
+
+# Training
+
+集群中有4台节点：
+
+
+- NODE1=10.11.0.2
+- NODE2=10.11.0.3
+- NODE3=10.11.0.4
+- NODE4=10.11.0.5
+
+每个节点有8张显卡，这里默认设置batch size为32，分别在1机1卡～4机32卡的情况下进行了多组训练。
+
+训练前，我们还需要做一些准备工作：
+
+1.安装依赖：`python3 -m pip install boto3 h5py`（不安装直接跑bert pretraining会报错)
+
+2.修改deepdpeed_train.py[Line 197](https://github.com/microsoft/DeepSpeedExamples/blob/ba63ad0fa861d28b3b33bc2c20f702647403e258/bing_bert/deepspeed_train.py#L197)如下：
+
+```
+else:
+     epoch_step += 1
+     # Call DeepSpeed engine step on micro steps
+     model.network.step()
+```
+
+以使训练120iter后，程序会自动退出。
+
+## 单机
+
+`DeepSpeed/DeepSpeedExamples/bing_bert/`目录下，执行脚本：
+
+```shell
+bash run_single_node.sh
+```
+
+对单机1卡、4卡、8卡分别做5组测试，默认测试fp32精度，batch_size=32。
+
+### 混合精度
+
+可以通过参数指定fp16及batch_size：
+
+```shell
+bash run_single_node.sh 32 fp16
+```
+
+也可以自行指定精度以及batch_size：`bash run_single_node.sh 128 fp16`,`bash run_single_node.sh 64 fp32`。
+
+
+
+## 2机16卡
+
+2机、4机等多机情况下，需要在所有机器节点上相同路径准备同样的数据集、以完成分布式训练；此外，还需配置各个机器节点直接的ssh免密登录，配置完成后，我们在`deepSpeed/DeepSpeedExamples/bing_bert/`目录下新建hosts文件，例如deepspeed_hosts：
+
+```
+NODE1 slots=8
+NODE1 slots=8
+```
+
+此文件，指定了2个节点NODE1和NODE2，每个节点使用8块GPU进行训练。
+
+在主节点(NODE1节点)`deepSpeed/DeepSpeedExamples/bing_bert/`目录下执行脚本:
+
+```shell
+bash run_two_node.sh
+```
+
+即可运行2机16卡的训练，同样默认测试5组（fp32精度，batch_size=32）
+
+### 混合精度
+
+可以通过参数指定fp16及batch_size：
+
+```shell
+bash run_two_node.sh 128  fp16
+```
+
+## 4机32卡
+
+流程同上，在主节点上执行：
+
+```shell
+bash run_multi_node.sh
+```
+
+以运行4机32卡的训练，默认测试5组（fp32精度，batch_size=32）。
+
+### 混合精度
+
+可以通过参数指定fp16及batch_size：
+
+```shell
+bash run_multi_node.sh 128 fp16
+```
+
+#### 注：FP32精度下的多机训练过程中会报错：
+
+```shell
+vs003: Traceback (most recent call last):
+vs003:   File "deepspeed_train.py", line 540, in <module>
+vs003:     main()
+vs003:   File "deepspeed_train.py", line 533, in main
+vs003:     run(args, model, optimizer, start_epoch)
+vs003:   File "deepspeed_train.py", line 499, in run
+vs003:     train(args, index, model, optimizer, pretrain_dataset_provider)
+vs003:   File "deepspeed_train.py", line 187, in train
+vs003:     report_step_metrics(args, lr_this_step, unscaled_loss,
+vs003: UnboundLocalError: local variable 'lr_this_step' referenced before assignment
+```
+
+此报错为官方代码中尚未修复的bug，详见：
+
+https://github.com/microsoft/DeepSpeedExamples/issues/53
+
+https://github.com/microsoft/DeepSpeed/issues/426
+
+
+
+# Result
+
+## 吞吐率及加速比
+
+执行以下命令，即可计算各种测试配置下的吞吐率及加速比：
+
+```shell
+python extract_deepspeed_logs_time.py  --log_dir=logs/deepspeed/bert/bz32 --batch_size_per_device=32
+```
+
+输出：
+
+```shell
+python extract_deepspeed_logs.py  --log_dir=logs/deepspeed/bert/bz32 --batch_size_per_device=32
+logs/deepspeed/bert/bz32/4n8g/bert_b32_fp32_4.log {4: 4891.77}
+logs/deepspeed/bert/bz32/4n8g/bert_b32_fp32_3.log {4: 4891.77, 3: 4903.74}
+logs/deepspeed/bert/bz32/4n8g/bert_b32_fp32_5.log {4: 4891.77, 3: 4903.74, 5: 4900.76}
+logs/deepspeed/bert/bz32/4n8g/bert_b32_fp32_2.log {4: 4891.77, 3: 4903.74, 5: 4900.76, 2: 4899.37}
+logs/deepspeed/bert/bz32/4n8g/bert_b32_fp32_1.log {4: 4891.77, 3: 4903.74, 5: 4900.76, 2: 4899.37, 1: 4873.92}
+logs/deepspeed/bert/bz32/1n8g/bert_b32_fp32_4.log {4: 1150.49}
+logs/deepspeed/bert/bz32/1n8g/bert_b32_fp32_3.log {4: 1150.49, 3: 1186.46}
+logs/deepspeed/bert/bz32/1n8g/bert_b32_fp32_5.log {4: 1150.49, 3: 1186.46, 5: 1146.9}
+logs/deepspeed/bert/bz32/1n8g/bert_b32_fp32_2.log {4: 1150.49, 3: 1186.46, 5: 1146.9, 2: 1145.22}
+logs/deepspeed/bert/bz32/1n8g/bert_b32_fp32_1.log {4: 1150.49, 3: 1186.46, 5: 1146.9, 2: 1145.22, 1: 1147.9}
+logs/deepspeed/bert/bz32/1n4g/bert_b32_fp32_4.log {4: 587.55}
+logs/deepspeed/bert/bz32/1n4g/bert_b32_fp32_3.log {4: 587.55, 3: 575.07}
+logs/deepspeed/bert/bz32/1n4g/bert_b32_fp32_5.log {4: 587.55, 3: 575.07, 5: 572.11}
+logs/deepspeed/bert/bz32/1n4g/bert_b32_fp32_2.log {4: 587.55, 3: 575.07, 5: 572.11, 2: 573.84}
+logs/deepspeed/bert/bz32/1n4g/bert_b32_fp32_1.log {4: 587.55, 3: 575.07, 5: 572.11, 2: 573.84, 1: 577.14}
+logs/deepspeed/bert/bz32/1n1g/bert_b32_fp32_4.log {4: 147.81}
+logs/deepspeed/bert/bz32/1n1g/bert_b32_fp32_3.log {4: 147.81, 3: 147.93}
+logs/deepspeed/bert/bz32/1n1g/bert_b32_fp32_5.log {4: 147.81, 3: 147.93, 5: 143.8}
+logs/deepspeed/bert/bz32/1n1g/bert_b32_fp32_2.log {4: 147.81, 3: 147.93, 5: 143.8, 2: 143.61}
+logs/deepspeed/bert/bz32/1n1g/bert_b32_fp32_1.log {4: 147.81, 3: 147.93, 5: 143.8, 2: 143.61, 1: 148.42}
+logs/deepspeed/bert/bz32/2n8g/bert_b32_fp32_4.log {4: 2273.68}
+logs/deepspeed/bert/bz32/2n8g/bert_b32_fp32_3.log {4: 2273.68, 3: 2267.55}
+logs/deepspeed/bert/bz32/2n8g/bert_b32_fp32_5.log {4: 2273.68, 3: 2267.55, 5: 2368.65}
+logs/deepspeed/bert/bz32/2n8g/bert_b32_fp32_2.log {4: 2273.68, 3: 2267.55, 5: 2368.65, 2: 2264.69}
+logs/deepspeed/bert/bz32/2n8g/bert_b32_fp32_1.log {4: 2273.68, 3: 2267.55, 5: 2368.65, 2: 2264.69, 1: 2266.27}
+{'bert': {'1n1g': {'average_speed': 146.31,
+                   'batch_size_per_device': 32,
+                   'median_speed': 147.81,
+                   'speedup': 1.0},
+          '1n4g': {'average_speed': 577.14,
+                   'batch_size_per_device': 32,
+                   'median_speed': 575.07,
+                   'speedup': 3.89},
+          '1n8g': {'average_speed': 1155.39,
+                   'batch_size_per_device': 32,
+                   'median_speed': 1147.9,
+                   'speedup': 7.77},
+          '2n8g': {'average_speed': 2288.17,
+                   'batch_size_per_device': 32,
+                   'median_speed': 2267.55,
+                   'speedup': 15.34},
+          '4n8g': {'average_speed': 4893.91,
+                   'batch_size_per_device': 32,
+                   'median_speed': 4899.37,
+                   'speedup': 33.15}}}
+Saving result to ./result/bz32_result.json
+```
+
+## 计算规则
+
+### 1.测速脚本
+
+- extract_paddle_logs.py
+- extract_paddle_logs_time.py
+
+两个脚本略有不同，得到的结果稍有误差：
+
+extract_paddle_logs.py根据官方在log中打印的速度，在120个iter中，排除前20iter，取后100个iter的速度做平均；
+
+extract_paddle_logs_time.py则根据log中打印出的时间，排除前20iter取后100个iter的实际运行时间计算速度。
+
+README展示的是extract_paddle_logs.py的计算结果。
+
+### 2.均值速度和中值速度
+
+- average_speed均值速度
+
+- median_speed中值速度
+
+  每个batch size进行5次训练测试，记为一组，每一组取average_speed为均值速度，median_speed为中值速度
+
+### 3.加速比以中值速度计算
+
+脚本和表格中的 **加速比** 是以单机单卡下的中值速度为基准进行计算的。例如:
+
+单机单卡情况下速度为200(samples/s)，单机2卡速度为400，单机4卡速度为700，则加速比分别为：1.0、2.0、3.5
+
+## BERT-Base  FP32
+
+### batch size=32 & without xla
+
+| node_num | gpu_num | samples/s | speedup |
+| -------- | ------- | --------- | ------- |
+| 1        | 1       | 147.81    | 1       |
+| 1        | 4       | 575.07    | 3.89    |
+| 1        | 8       | 1147.9    | 7.77    |
+| 2        | 16      | 2267.55   | 15.34   |
+| 4        | 32      | 4899.37   | 33.15   |
+
+### batch size=64 & without xla
+
+| node_num | gpu_num | samples/s | speedup |
+| -------- | ------- | --------- | ------- |
+| 1        | 1       | 152.32    | 1       |
+| 1        | 4       | 601.64    | 3.95    |
+| 1        | 8       | 1197.91   | 7.86    |
+| 2        | 16      | 2318.82   | 15.22   |
+| 4        | 32      | 4510.15   | 29.61   |
+
+
+
+## BERT-Base  FP16
+
+### batch size=64 & without xla
+
+| node_num | gpu_num | samples/s | speedup |
+| -------- | ------- | --------- | ------- |
+| 1        | 1       | 565.3     | 1       |
+| 1        | 4       | 2271.61   | 4.02    |
+| 1        | 8       | 4512.68   | 7.98    |
+| 2        | 16      | 8944.0    | 15.82   |
+| 4        | 32      | 16401.67  | 29.01   |
+
+
+### batch size=128 & without xla
+
+| node_num | gpu_num | samples/s | speedup |
+| -------- | ------- | --------- | ------- |
+| 1        | 1       | 607.12    | 1       |
+| 1        | 4       | 2412.1    | 3.97    |
+| 1        | 8       | 4863.79   | 8.01    |
+| 2        | 16      | 9892.88   | 16.29   |
+| 4        | 32      | 16809.43  | 27.69   |
+
+### batch size=160 & without xla
+
+| node_num | gpu_num | samples/s | speedup |
+| -------- | ------- | --------- | ------- |
+| 1        | 1       | 619.73    | 1       |
+| 1        | 4       | 2528.53   | 4.08    |
+| 1        | 8       | 4953.73   | 7.99    |
+| 2        | 16      | 10122.54  | 16.33   |
+| 4        | 32      | 17751.63  | 28.64   |
+
+## 完整日志
+
+- [bert_fp32.zip](https://oneflow-public.oss-cn-beijing.aliyuncs.com/DLPerf/logs/DeepSpeed/bert/bert_fp32.zip)
+- [bert_fp16.zip](https://oneflow-public.oss-cn-beijing.aliyuncs.com/DLPerf/logs/DeepSpeed/bert/bert_fp16.zip)
--- a/DeepSpeed/bert_base.json
+++ b/DeepSpeed/bert_base.json
+{
+    "name": "bing_bert_base",
+    "bert_token_file": "bert-base-uncased",
+    "bert_model_file": "bert-base-uncased",
+    "bert_model_config": {
+        "vocab_size_or_config_json_file": 30522,
+        "hidden_size": 768,
+        "num_hidden_layers": 12,
+        "num_attention_heads": 12,
+        "intermediate_size": 3072,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "attention_probs_dropout_prob": 0.1,
+        "max_position_embeddings": 512,
+        "type_vocab_size": 2,
+        "initializer_range": 0.02
+    },
+    "data": {
+        "flags": {
+            "pretrain_dataset": true,
+            "pretrain_type": "wiki"
+        },
+        "mixed_seq_datasets": {
+            "128": {
+                "pretrain_dataset": "hdf5/wikicorpus_en/128"
+            },
+            "512": {
+                "pretrain_dataset": "hdf5/wikicorpus_en/512"
+            }
+        }
+    },
+    "mixed_seq_training": {
+        "128": {
+            "num_epochs": 1,
+            "warmup_proportion": 1.0,
+            "learning_rate": 4e-4,
+            "num_workers": 4,
+            "async_worker": true,
+            "decay_rate": 0.99,
+            "decay_step": 520,
+            "total_training_steps": 125000
+        },
+        "512": {
+            "num_epochs": 160,
+            "warmup_proportion": 0.02,
+            "learning_rate": 1e-5,
+            "num_workers": 0,
+            "async_worker": true,
+            "decay_rate": 0.90,
+            "decay_step": 150,
+            "total_training_steps": 7500
+        }
+    },
+    "validation": {
+        "path": "validation_set/"
+    }
+}
\ No newline at end of file
--- a/DeepSpeed/deepspeed_bsz64k_adam_config_seq128.json
+++ b/DeepSpeed/deepspeed_bsz64k_adam_config_seq128.json
+{
+  "train_batch_size": 32768,
+  "train_micro_batch_size_per_gpu": 32,
+  "steps_per_print": 1,
+  "prescale_gradients": false,
+  "optimizer": {
+    "type": "ADAM",
+    "params": {
+      "lr": 1e-4,
+      "weight_decay": 0.01,
+      "bias_correction": false
+    }
+  },
+  "gradient_clipping": 1.0,
+
+  "wall_clock_breakdown": false,
+
+  "fp16": {
+    "enabled":false,
+    "loss_scale": 0
+  },
+  "sparse_attention": {
+    "mode": "fixed",
+    "block": 16,
+    "different_layout_per_head": true,
+    "num_local_blocks": 4,
+    "num_global_blocks": 1,
+    "attention": "bidirectional",
+    "horizontal_global_attention": false,
+    "num_different_global_patterns": 4
+  }
+}
--- a/DeepSpeed/extract_deepspeed_logs.py
+++ b/DeepSpeed/extract_deepspeed_logs.py
+import os
+import re
+import sys
+import glob
+import json
+import argparse
+import pprint
+
+import numpy as np
+
+pp = pprint.PrettyPrinter(indent=1)
+os.chdir(sys.path[0])
+
+parser = argparse.ArgumentParser(description="flags for benchmark")
+parser.add_argument("--log_dir", type=str, default="./logs/deepspeed/bert/bz32", required=True)
+parser.add_argument("--output_dir", type=str, default="./result", required=False)
+parser.add_argument('--warmup_batches', type=int, default=20)
+parser.add_argument('--train_batches', type=int, default=120)
+parser.add_argument('--batch_size_per_device', type=int, default=32)
+
+args = parser.parse_args()
+
+
+class AutoVivification(dict):
+    """Implementation of perl's autovivification feature."""
+
+    def __getitem__(self, item):
+        try:
+            return dict.__getitem__(self, item)
+        except KeyError:
+            value = self[item] = type(self)()
+        return value
+
+
+def extract_info_from_file(log_file, result_dict, speed_dict):
+    # extract info from file name
+    fname = os.path.basename(log_file)
+    run_case = log_file.split("/")[-2]  # eg: 1n1g
+    model = fname.split("_")[0]
+    batch_size = int(fname.split("_")[1].strip("b"))
+    pricition = fname.split("_")[2]
+    test_iter = int(fname.split("_")[3].strip(".log"))
+    node_num = int(run_case[0])
+    if len(run_case) == 4:
+        card_num = int(run_case[-2])
+    elif len(run_case) == 5:
+        card_num = int(run_case[-3:-1])
+
+    total_batch_size = node_num * card_num * batch_size
+    tmp_dict = {
+        'average_speed': 0,
+        'batch_size_per_device': batch_size,
+    }
+
+    avg_speed_list = []
+    # extract info from file content
+    with open(log_file) as f:
+        lines = f.readlines()
+        for line in lines:
+            if "SamplesPerSec" in line:
+                p1 = re.compile(r"SamplesPerSec=(.*\.?.*)\n", re.S)
+                item = re.findall(p1, line)
+                a = float(item[0].strip())
+                avg_speed_list.append(round(a, 4))
+
+    # compute avg throughoutput
+    begin_index=args.warmup_batches-2
+    avg_speed = round(np.mean(avg_speed_list[begin_index:args.train_batches]), 2)
+    tmp_dict['average_speed'] = avg_speed
+
+    result_dict[model][run_case]['average_speed'] = tmp_dict['average_speed']
+    result_dict[model][run_case]['batch_size_per_device'] = tmp_dict['batch_size_per_device']
+
+    speed_dict[model][run_case][test_iter] = avg_speed
+
+    print(log_file, speed_dict[model][run_case])
+
+
+def compute_median(iter_dict):
+    speed_list = [i for i in iter_dict.values()]
+    return round(np.median(speed_list), 2)
+    
+
+def compute_speedup(result_dict, speed_dict):
+    model_list = [key for key in result_dict]  # eg.['vgg16', 'rn50']
+    for m in model_list:
+        run_case = [key for key in result_dict[m]]  # eg.['4n8g', '2n8g', '1n8g', '1n4g', '1n1g']
+        for d in run_case:
+            speed_up = 1.0
+            if result_dict[m]['1n1g']['average_speed']:
+                result_dict[m][d]['average_speed'] = compute_average(speed_dict[m][d])
+                result_dict[m][d]['median_speed'] = compute_median(speed_dict[m][d])
+                speed_up = result_dict[m][d]['median_speed'] / compute_median(speed_dict[m]['1n1g'])
+            result_dict[m][d]['speedup'] = round(speed_up, 2)
+
+
+def compute_average(iter_dict):
+    i = 0
+    total_speed = 0
+    for iter in iter_dict:
+        i += 1
+        total_speed += iter_dict[iter]
+    return round(total_speed / i, 2)
+
+
+def extract_result():
+    result_dict = AutoVivification()
+    speed_dict = AutoVivification()
+    logs_list = glob.glob(os.path.join(args.log_dir, "*/*.log"))
+    for l in logs_list:
+        extract_info_from_file(l, result_dict, speed_dict)
+
+    # compute speedup
+    compute_speedup(result_dict, speed_dict)
+
+    # print result
+    pp.pprint(result_dict)
+
+    # write to file as JSON format
+    os.makedirs(args.output_dir, exist_ok=True)
+    framwork = args.log_dir.split('/')[-1]
+    result_file_name = os.path.join(args.output_dir, framwork + "_result.json")
+    print("Saving result to {}".format(result_file_name))
+    with open(result_file_name, 'w') as f:
+        json.dump(result_dict, f)
+
+
+if __name__ == "__main__":
+    extract_result()
+
--- a/DeepSpeed/extract_deepspeed_logs_time.py
+++ b/DeepSpeed/extract_deepspeed_logs_time.py
+import os
+import re
+import sys
+import glob
+import json
+import argparse
+import pprint
+import time
+import datetime
+import numpy as np
+
+pp = pprint.PrettyPrinter(indent=1)
+os.chdir(sys.path[0])
+
+parser = argparse.ArgumentParser(description="flags for benchmark")
+parser.add_argument("--log_dir", type=str, default="./logs/deepspeed/bert/bz32", required=True)
+parser.add_argument("--output_dir", type=str, default="./result", required=False)
+parser.add_argument('--warmup_batches', type=int, default=20)
+parser.add_argument('--train_batches', type=int, default=120)
+parser.add_argument('--batch_size_per_device', type=int, default=32)
+
+args = parser.parse_args()
+
+
+class AutoVivification(dict):
+    """Implementation of perl's autovivification feature."""
+
+    def __getitem__(self, item):
+        try:
+            return dict.__getitem__(self, item)
+        except KeyError:
+            value = self[item] = type(self)()
+        return value
+
+
+def extract_info_from_file(log_file, result_dict, speed_dict):
+    # extract info from file name
+    fname = os.path.basename(log_file)
+    run_case = log_file.split("/")[-2]  # eg: 1n1g
+    model = fname.split("_")[0]
+    batch_size = int(fname.split("_")[1].strip("b"))
+    pricition = fname.split("_")[2]
+    test_iter = int(fname.split("_")[3].strip(".log"))
+
+    node_num = int(run_case[0])
+    if len(run_case) == 4:
+        card_num = int(run_case[-2])
+    elif len(run_case) == 5:
+        card_num = int(run_case[-3:-1])
+
+    total_batch_size = node_num * card_num * batch_size
+
+    tmp_dict = {
+        'average_speed': 0,
+        'batch_size_per_device': batch_size,
+    }
+
+    avg_speed = 0
+    # extract info from file content, e.g. 2020-10-27 11:28:12,892
+    pt = re.compile(r"(\d{4}-\d{1,2}-\d{1,2} \d{1,2}:\d{1,2}:\d{1,2},\d{1,3})", re.S)
+
+
+    s1 = "[timer.py:157:stop] 0/" + str(args.warmup_batches)
+    s2 = "[timer.py:157:stop] 0/" + str(args.train_batches)
+    start_time = ''
+    end_time = ''
+    with open(log_file) as f:
+        lines = f.readlines()
+        for line in lines:
+            if "SamplesPerSec" in line:
+                if s1 in line:
+                    start_time = re.findall(pt, line)[0]
+                    continue
+
+                if s2 in line:
+                    end_time = re.findall(pt, line)[0]
+                    t1 = datetime.datetime.strptime(start_time, "%Y-%m-%d %H:%M:%S,%f")
+                    t2 = datetime.datetime.strptime(end_time, "%Y-%m-%d %H:%M:%S,%f")
+                    cost_time = (t2 - t1).total_seconds()
+                    iter_num = args.train_batches - args.warmup_batches
+                    avg_speed = round(float(total_batch_size) / (cost_time / iter_num), 2)
+                    break
+
+    # compute avg throughoutput
+    tmp_dict['average_speed'] = avg_speed
+    result_dict[model][run_case]['average_speed'] = avg_speed
+    result_dict[model][run_case]['batch_size_per_device'] = tmp_dict['batch_size_per_device']
+
+    speed_dict[model][run_case][test_iter] = avg_speed
+
+    print(log_file, speed_dict[model][run_case])
+
+
+def compute_speedup(result_dict, speed_dict):
+    model_list = [key for key in result_dict]  # eg.['vgg16', 'rn50']
+    for m in model_list:
+        run_case = [key for key in result_dict[m]]  # eg.['4n8g', '2n8g', '1n8g', '1n4g', '1n1g']
+        for d in run_case:
+            speed_up = 1.0
+            if result_dict[m]['1n1g']['average_speed']:
+                result_dict[m][d]['average_speed'] = compute_average(speed_dict[m][d])
+                result_dict[m][d]['median_speed'] = compute_median(speed_dict[m][d])
+                speed_up = result_dict[m][d]['median_speed'] / compute_median(speed_dict[m]['1n1g'])
+            result_dict[m][d]['speedup'] = round(speed_up, 2)
+
+
+def compute_median(iter_dict):
+    speed_list = [i for i in iter_dict.values()]
+    return round(np.median(speed_list), 2)
+
+
+def compute_average(iter_dict):
+    i = 0
+    total_speed = 0
+    for iter in iter_dict:
+        i += 1
+        total_speed += iter_dict[iter]
+    return round(total_speed / i, 4)
+
+
+def extract_result():
+    result_dict = AutoVivification()
+    speed_dict = AutoVivification()
+    logs_list = glob.glob(os.path.join(args.log_dir, "*/*.log"))
+    for l in logs_list:
+        extract_info_from_file(l, result_dict, speed_dict)
+
+    # compute speedup
+    compute_speedup(result_dict, speed_dict)
+
+    # print result
+    pp.pprint(result_dict)
+
+    # write to file as JSON format
+    os.makedirs(args.output_dir, exist_ok=True)
+    framwork = args.log_dir.split('/')[-1]
+    result_file_name = os.path.join(args.output_dir, framwork + "_result.json")
+    print("Saving result to {}".format(result_file_name))
+    with open(result_file_name, 'w') as f:
+        json.dump(result_dict, f)
+
+
+if __name__ == "__main__":
+    # The iteration output in tensorflow log files is an integer multiple of 10
+    assert args.warmup_batches % 10 ==0 and args.train_batches % 10 ==0
+    extract_result()
--- a/DeepSpeed/scripts/multi_node_train.sh
+++ b/DeepSpeed/scripts/multi_node_train.sh
+#!/bin/bash
+OUTPUT_DIR=../output
+# Where should we save checkpoints and tensorboard events?
+rm -rf $OUTPUT_DIR
+mkdir -p $OUTPUT_DIR
+MODEL=${1:-"bert_base"}
+BATCH_SIZE=${2:-32}
+gpus=${3:-"0"}
+nodes=${4:-$NODE1}
+TEST_NUM=${5:-1}
+DTYPE=${6:-"fp32"}
+
+a=`expr ${#gpus} + 1`
+num_gpus=`expr ${a} / 2`
+num_nodes=$(echo $nodes | tr ',' '\n' | wc -l)
+train_batch_size=`expr ${BATCH_SIZE} \* 1024`
+
+LOG_FOLDER=../logs-${DTYPE}/deepspeed/bert/bz${BATCH_SIZE}/${num_nodes}n${num_gpus}g
+mkdir -p $LOG_FOLDER
+LOGFILE=${LOG_FOLDER}/bert_b${BATCH_SIZE}_${DTYPE}_$TEST_NUM.log
+
+job_name=adam_nvidia_data_${MODEL}
+config=${MODEL}.json
+deepspeed_config=deepspeed_bsz64k_adam_config_seq128.json
+# deepspeed_config=deepspeed_bsz4k_onebit_config_seq128.json
+
+if  [ ${DTYPE} == "fp16" ];then
+    enabled=true
+else
+    enabled=false
+fi
+
+sed -i "s/\"train_batch_size\":.*$/\"train_batch_size\": $train_batch_size,/" $deepspeed_config
+sed -i "s/\"train_micro_batch_size_per_gpu\":.*$/\"train_micro_batch_size_per_gpu\": $BATCH_SIZE,/" $deepspeed_config
+sed -i "s/\"enabled\":.*$/\"enabled\":$enabled,/" $deepspeed_config
+
+
+DATA_PATH_PREFIX=/datasets/bert/deepspeed/data/test
+if  [ $num_nodes -ge 2 ];then
+    NCCL_TREE_THRESHOLD=0 deepspeed   --hostfile=deepspeed_hosts \
+        --num_nodes=$num_nodes \
+        --num_gpus=$num_gpus  deepspeed_train.py \
+        --cf  ${config}  \
+        --max_seq_length 128 \
+        --output_dir $OUTPUT_DIR \
+        --deepspeed \
+        --print_steps 1 \
+        --lr_schedule "EP" \
+        --max_steps_per_epoch 120 \
+        --lr_offset 10e-4 \
+        --job_name ${job_name} \
+        --deepspeed_config  $deepspeed_config \
+        --data_path_prefix  ${DATA_PATH_PREFIX}   \
+        --use_nvidia_dataset  2>&1 | tee $LOGFILE
+else
+    NCCL_TREE_THRESHOLD=0 deepspeed  \
+        --num_nodes=$num_nodes \
+        --num_gpus=$num_gpus  deepspeed_train.py \
+        --cf  ${config}  \
+        --max_seq_length 128 \
+        --output_dir $OUTPUT_DIR \
+        --deepspeed \
+        --print_steps 1 \
+        --lr_schedule "EP" \
+        --max_steps_per_epoch 120 \
+        --lr_offset 10e-4 \
+        --job_name ${job_name} \
+        --deepspeed_config  $deepspeed_config  \
+        --data_path_prefix  ${DATA_PATH_PREFIX}   \
+        --use_nvidia_dataset  2>&1 | tee $LOGFILE
+fi
+
--- a/DeepSpeed/scripts/run_multi_node.sh
+++ b/DeepSpeed/scripts/run_multi_node.sh
+SHELL_FOLDER=$(dirname $(readlink -f "$0"))
+BATCH_SIZE=${1:-32}
+DTYPE=${2:-"fp32"}
+NODE1='10.11.0.2'     
+NODE2='10.11.0.3'
+NODE3='10.11.0.4'
+NODE4='10.11.0.5'    
+nodes=$NODE1,$NODE2,$NODE3,$NODE4
+
+
+
+nodes=$NODE1,$NODE2,$NODE3,$NODE4
+i=1
+while [ $i -le 5 ]
+do
+  bash $SHELL_FOLDER/multi_node_train.sh     "bert_base"    $BATCH_SIZE   0,1,2,3,4,5,6,7  $nodes    $i   $DTYPE
+  echo ">>>>>>>>>>>>>>>>>>>>>>>>>>>>>>Finished Test Case ${i}!<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<"
+  let i++
+  sleep 20
+done
\ No newline at end of file
--- a/DeepSpeed/scripts/run_single_node.sh
+++ b/DeepSpeed/scripts/run_single_node.sh
+SHELL_FOLDER=$(dirname $(readlink -f "$0"))
+BATCH_SIZE=${1:-32}
+DTYPE=${2:-"fp32"}
+NODE1='10.11.0.2'     
+# NODE2='10.11.0.3'
+# NODE3='10.11.0.4'
+# NODE4='10.11.0.5'    
+nodes=$NODE1
+
+i=1
+while [ $i -le 5 ]
+do
+  bash $SHELL_FOLDER/multi_node_train.sh     "bert_base"    $BATCH_SIZE   0  $nodes    $i   $DTYPE
+  echo ">>>>>>>>>>>>>>>>>>>>>>>>>>>>>>Finished Test Case ${i}!<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<"
+  let i++
+  sleep 20
+done
+
+
+i=1
+while [ $i -le 5 ]
+do
+  bash $SHELL_FOLDER/multi_node_train.sh     "bert_base"    $BATCH_SIZE   0,1,2,3  $nodes    $i   $DTYPE
+  echo ">>>>>>>>>>>>>>>>>>>>>>>>>>>>>>Finished Test Case ${i}!<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<"
+  let i++
+  sleep 20
+done
+
+
+i=1
+while [ $i -le 5 ]
+do
+  bash $SHELL_FOLDER/multi_node_train.sh     "bert_base"    $BATCH_SIZE   0,1,2,3,4,5,6,7  $nodes    $i   $DTYPE
+  echo ">>>>>>>>>>>>>>>>>>>>>>>>>>>>>>Finished Test Case ${i}!<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<"
+  let i++
+  sleep 20
+done
\ No newline at end of file
--- a/DeepSpeed/scripts/run_two_node.sh
+++ b/DeepSpeed/scripts/run_two_node.sh
+SHELL_FOLDER=$(dirname $(readlink -f "$0"))
+BATCH_SIZE=${1:-32}
+DTYPE=${2:-"fp32"}
+NODE1='10.11.0.2'     
+NODE2='10.11.0.3'  
+nodes=$NODE1,$NODE2
+
+
+i=1
+while [ $i -le 5 ]
+do
+  bash $SHELL_FOLDER/multi_node_train.sh     "bert_base"    $BATCH_SIZE   0,1,2,3,4,5,6,7  $nodes    $i   $DTYPE
+  echo ">>>>>>>>>>>>>>>>>>>>>>>>>>>>>>Finished Test Case ${i}!<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<"
+  let i++
+  sleep 20
+done
\ No newline at end of file