add scripts and testing data of tf2.x-bert (#53)

* add scripts and testing data of tf2.x-bert * fix readme * scripts and testing data of tf2.3-bert-fp16

add scripts and testing data of tf2.x-bert (#53)
* add scripts and testing data of tf2.x-bert * fix readme * scripts and testing data of tf2.3-bert-fp16
e12d67cc · Lyon · GitHub · b49ef9a7 · e12d67cc · e12d67cc
4 changed file
--- a/TensorFlow/bert/README.md
+++ b/TensorFlow/bert/README.md
+# 【DLPerf】TensorFlow 2.x-BERT测评
+
+# Overview
+本次复现采用了[Tensorflow官方仓库](https://github.com/tensorflow/models/tree/r2.3.0)中的tf2.x版[BERT](https://github.com/tensorflow/models/tree/r2.3.0/official/nlp/bert)的实现，复现的目的在于速度测评，同时根据测速结果给出1机、2机器、4机情况下的加速比，评判框架在分布式多机训练情况下的横向拓展能力。
+
+目前，该测试仅覆盖单机情况下的FP32 精度，后续将持续维护，增加混合精度训练，XLA 等多种方式的测评。
+
+
+
+# Environment
+## 系统
+
+- 系统：Ubuntu 16.04.4 LTS (GNU/Linux 4.4.0-116-generic x86_64)
+- 显卡：Tesla V100-SXM2-16GB x 8
+- 驱动：NVIDIA 440.33.01
+- CUDA：10.2
+- cuDNN：7.6.5
+- NCCL：2.7.3
+## 框架
+
+- **TensorFlow 2.3.0** 
+
+
+
+# Quick Start
+## 项目代码
+
+- [TensorFlow官方仓库](https://github.com/tensorflow/models/tree/r2.3.0)
+   - [BERT项目主页](https://github.com/tensorflow/models/tree/r2.3.0/official/nlp/bert)
+
+下载官方源码：
+```shell
+git clone https://github.com/tensorflow/models.git && checkout r2.3.0
+cd models/official/nlp/bert
+```
+
+将本页面scripts文件夹中的脚本放入/models/official/nlp/bert目录下。
+
+
+
+## 框架安装
+```shell
+python -m pip install tensorflow==2.3.0 -i https://mirror.baidu.com/pypi/simple
+```
+## NCCL
+TensorFlow的分布式训练底层依赖NCCL库，需要从[NVIDIA-NCCL官网下载](https://developer.nvidia.com/nccl/nccl-download)并安装和操作系统、CUDA版本适配的NCCL。
+
+本次测试中安装2.7.3版本的NCCL：
+
+```shell
+sudo dpkg -i nccl-repo-ubuntu1604-2.7.3-ga-cuda10.2_1-1_amd64.deb
+sudo apt update
+sudo apt install libnccl2=2.7.3-1+cuda10.2 libnccl-dev=2.7.3-1+cuda10.2
+```
+## 数据集
+BERT-Pretraining数据集的制作过程参考tensorflow官方readme文档：[pre-training](https://github.com/tensorflow/models/tree/r2.3.0/official/nlp/bert#pre-training ) ，制作时使用了官方提供的代码：[create_pretraining_data.py](https://github.com/tensorflow/models/blob/r2.3.0/official/nlp/data/create_pretraining_data.py)
+
+设置create_pretraining_data.py中的相应参数，其中LINE 34、LINE 38、LINE 41为必填参数：
+
+```shell
+# LINE 34
+flags.DEFINE_string("input_file", None,
+                    "Input raw text file (or comma-separated list of files).")
+# LINE 38
+flags.DEFINE_string(
+    "output_file", None,
+    "Output TF example file (or comma-separated list of files).")
+# LINE 41
+flags.DEFINE_string("vocab_file", None,
+                    "The vocabulary file that the BERT model was trained on.")
+```
+
+### 参数说明
+
+- input_file 是原始txt文件，可以是wiki或者其他数据集的txt，可以是单个文件，可以是多个文件。
+
+示例：
+
+```shell
+ '/datasets/bert/AA/wiki_00'
+ '/datasets/bert/AA/wiki_00,/datasets/bert/AA/wiki_01'
+'/datasets/bert/AA/wiki_00,/datasets/bert/AA/wiki_*'
+```
+
+- output_file是制作完成的tfrecord文件，同样可以是单个/多个，如：wiki_AA.tfrecord
+
+- vocab_file是词表文件，如：uncased_L-12_H-768_A-12/vocab.txt
+
+### 制作tfrecord数据集
+
+准备好原始txt文件并设置相应参数后，执行以下脚本即可生成tfrecord数据集：
+
+```shell
+export PYTHONPATH=$PYTHONPATH:/your/path/to/models
+cd /your/path/to/models/official/nlp/bert
+python3  ../data/create_pretraining_data.py
+```
+
+
+
+# Training
+
+集群中有4台节点：
+
+
+- NODE1=10.11.0.2
+- NODE2=10.11.0.3
+- NODE3=10.11.0.4
+- NODE4=10.11.0.5
+
+每个节点有8张显卡，这里设置batch size为32、48和64，在1机1卡～1机8卡的情况下进行了多组训练。
+
+
+
+## 单机
+
+修改环境变量PYTHONPATH:`export PYTHONPATH=$PYTHONPATH:/your/path/to/models`
+
+`models/official/nlp/bert`目录下，设置`single_node_train.sh`脚本中的训练/配置参数，然后执行脚本:
+
+```shell
+bash run_single_node.sh
+```
+对单机1卡、2卡、4卡、8卡分别做5组测试。单机脚本默认的batch size为32，可以通过参数指定，如指定batch size为48或64：`bash run_single_node.sh 48`，`bash run_single_node.sh 64`
+
+### 混合精度
+
+可以通过修改脚本`run_single_node.sh`中的变量，也可直接通过参数指定以开启混合精度，如：
+
+```shell
+bash run_single_node.sh 64 5 'fp16'
+```
+
+表示开启fp16混合精度，batch size=64，每组测试5次。
+
+
+
+## 多机
+
+测试过程中我们发现，官方提供的python脚本运行多机时会报错，即使在修改代码后也只能支持
+
+`--all_reduce_alg='ring'`模式的多机训练(cpu多机)，而不能支持'nccl'模式的多gpu训练，故多机的测试暂不开展。
+
+
+# Result
+## 完整日志
+- [bert_fp32.zip](https://oneflow-public.oss-cn-beijing.aliyuncs.com/DLPerf/logs/Tensorflow/bert/bert_fp32.zip)
+
+- [bert_fp16.zip](https://oneflow-public.oss-cn-beijing.aliyuncs.com/DLPerf/logs/Tensorflow/bert/bert_fp16.zip)
+
+## 加速比
+
+执行以下脚本计算各个情况下的加速比：
+```shell
+python extract_tensorflow_logs_time.py --log_dir=logs/tensorflow/bert/bz64 --batch_size_per_device=64
+```
+输出：
+```shell
+logs/tensorflow/bert/bz64/1n8g/bert_b64_fp32_4.log {4: 805.41}
+logs/tensorflow/bert/bz64/1n8g/bert_b64_fp32_1.log {4: 805.41, 1: 806.74}
+logs/tensorflow/bert/bz64/1n8g/bert_b64_fp32_2.log {4: 805.41, 1: 806.74, 2: 805.43}
+logs/tensorflow/bert/bz64/1n8g/bert_b64_fp32_3.log {4: 805.41, 1: 806.74, 2: 805.43, 3: 806.01}
+logs/tensorflow/bert/bz64/1n8g/bert_b64_fp32_5.log {4: 805.41, 1: 806.74, 2: 805.43, 3: 806.01, 5: 803.36}
+logs/tensorflow/bert/bz64/1n4g/bert_b64_fp32_4.log {4: 402.34}
+logs/tensorflow/bert/bz64/1n4g/bert_b64_fp32_1.log {4: 402.34, 1: 399.56}
+logs/tensorflow/bert/bz64/1n4g/bert_b64_fp32_2.log {4: 402.34, 1: 399.56, 2: 402.02}
+logs/tensorflow/bert/bz64/1n4g/bert_b64_fp32_3.log {4: 402.34, 1: 399.56, 2: 402.02, 3: 404.06}
+logs/tensorflow/bert/bz64/1n4g/bert_b64_fp32_5.log {4: 402.34, 1: 399.56, 2: 402.02, 3: 404.06, 5: 400.27}
+logs/tensorflow/bert/bz64/1n1g/bert_b64_fp32_4.log {4: 112.71}
+logs/tensorflow/bert/bz64/1n1g/bert_b64_fp32_1.log {4: 112.71, 1: 113.55}
+logs/tensorflow/bert/bz64/1n1g/bert_b64_fp32_2.log {4: 112.71, 1: 113.55, 2: 114.95}
+logs/tensorflow/bert/bz64/1n1g/bert_b64_fp32_3.log {4: 112.71, 1: 113.55, 2: 114.95, 3: 112.99}
+logs/tensorflow/bert/bz64/1n1g/bert_b64_fp32_5.log {4: 112.71, 1: 113.55, 2: 114.95, 3: 112.99, 5: 111.67}
+logs/tensorflow/bert/bz64/1n2g/bert_b64_fp32_4.log {4: 204.96}
+logs/tensorflow/bert/bz64/1n2g/bert_b64_fp32_1.log {4: 204.96, 1: 204.3}
+logs/tensorflow/bert/bz64/1n2g/bert_b64_fp32_2.log {4: 204.96, 1: 204.3, 2: 202.48}
+logs/tensorflow/bert/bz64/1n2g/bert_b64_fp32_3.log {4: 204.96, 1: 204.3, 2: 202.48, 3: 204.16}
+logs/tensorflow/bert/bz64/1n2g/bert_b64_fp32_5.log {4: 204.96, 1: 204.3, 2: 202.48, 3: 204.16, 5: 203.15}
+{'bert': {'1n1g': {'average_speed': 113.17,
+                   'batch_size_per_device': 64,
+                   'median_speed': 112.99,
+                   'speedup': 1.0},
+          '1n2g': {'average_speed': 203.81,
+                   'batch_size_per_device': 64,
+                   'median_speed': 204.16,
+                   'speedup': 1.81},
+          '1n4g': {'average_speed': 401.65,
+                   'batch_size_per_device': 64,
+                   'median_speed': 402.02,
+                   'speedup': 3.56},
+          '1n8g': {'average_speed': 805.39,
+                   'batch_size_per_device': 64,
+                   'median_speed': 805.43,
+                   'speedup': 7.13}}}
+Saving result to ./result/bz64_result.json
+```
+## 计算规则
+
+### 1.测速脚本
+
+- extract_tensorflow_logs_time.py
+
+extract_tensorflow_logs_time.py根据log中打印出的时间，排除前20iter取后100个iter的实际运行时间计算速度。
+
+### 2.均值速度和中值速度
+
+- average_speed均值速度
+
+- median_speed中值速度
+
+  每个batch size进行5次训练测试，记为一组，每一组取average_speed为均值速度，median_speed为中值速度
+
+### 3.加速比以中值速度计算
+
+脚本和表格中的 **加速比** 是以单机单卡下的中值速度为基准进行计算的。例如:
+
+单机单卡情况下速度为200(samples/s)，单机2卡速度为400，单机4卡速度为700，则加速比分别为：1.0、2.0、3.5
+
+## BERT-Base batch szie = 64
+
+### FP32 & Without XLA
+
+| node_num | gpu_num | samples/s | speedup |
+| -------- | ------- | --------- | ------- |
+| 1        | 1       | 112.99    | 1.00    |
+| 1        | 2       | 204.16    | 1.81    |
+| 1        | 4       | 402.02    | 3.56    |
+| 1        | 8       | 805.43    | 7.13    |
+
+## BERT-Base  batch size=48
+
+### FP32 & Without XLA
+
+| node_num | gpu_num | samples/s | speedup |
+| -------- | ------- | --------- | ------- |
+| 1        | 1       | 108.94    | 1.00    |
+| 1        | 2       | 194.29    | 1.78    |
+| 1        | 4       | 384.59    | 3.53    |
+| 1        | 8       | 752.21    | 6.9     |
+
+## BERT-Base  batch size=32
+### FP32 & Without XLA
+
+| node_num | gpu_num | samples/s | speedup |
+| -------- | ------- | --------- | ------- |
+| 1        | 1       | 103.58    | 1.00    |
+| 1        | 2       | 177.18    | 1.71    |
+| 1        | 4       | 347.83    | 3.36    |
+| 1        | 8       | 675.82    | 6.52    |
+
+
+
+## BERT-Base  batch size=64
+
+### FP16 & Without XLA
+
+| node_num | gpu_num | samples/s | speedup |
+| -------- | ------- | --------- | ------- |
+| 1        | 1       | 228.66    | 1.00    |
+| 1        | 2       | 385.19    | 1.68    |
+| 1        | 4       | 746.9     | 3.27    |
+| 1        | 8       | 1402.41   | 6.13    |
\ No newline at end of file
--- a/TensorFlow/bert/extract_tensorflow_logs_time.py
+++ b/TensorFlow/bert/extract_tensorflow_logs_time.py
+import os
+import re
+import sys
+import glob
+import json
+import argparse
+import pprint
+import time
+import datetime
+import numpy as np
+
+pp = pprint.PrettyPrinter(indent=1)
+os.chdir(sys.path[0])
+
+parser = argparse.ArgumentParser(description="flags for benchmark")
+parser.add_argument("--log_dir", type=str, default="./logs/tensorflow/bert", required=True)
+parser.add_argument("--output_dir", type=str, default="./result", required=False)
+parser.add_argument('--warmup_batches', type=int, default=20)
+parser.add_argument('--train_batches', type=int, default=120)
+parser.add_argument('--batch_size_per_device', type=int, default=32)
+
+args = parser.parse_args()
+
+
+class AutoVivification(dict):
+    """Implementation of perl's autovivification feature."""
+
+    def __getitem__(self, item):
+        try:
+            return dict.__getitem__(self, item)
+        except KeyError:
+            value = self[item] = type(self)()
+        return value
+
+
+def extract_info_from_file(log_file, result_dict, speed_dict):
+    # extract info from file name
+    fname = os.path.basename(log_file)
+    run_case = log_file.split("/")[-2]  # eg: 1n1g
+    model = fname.split("_")[0]
+    batch_size = int(fname.split("_")[1].strip("b"))
+    pricition = fname.split("_")[2]
+    test_iter = int(fname.split("_")[3].strip(".log"))
+
+    node_num = int(run_case[0])
+    if len(run_case) == 4:
+        card_num = int(run_case[-2])
+    elif len(run_case) == 5:
+        card_num = int(run_case[-3:-1])
+
+    total_batch_size = node_num * card_num * batch_size
+
+    tmp_dict = {
+        'average_speed': 0,
+        'batch_size_per_device': batch_size,
+    }
+
+    avg_speed = 0
+    # extract info from file content
+    pt = re.compile(r"(\d{1,2}:\d{1,2}:\d{1,2}.\d{1,6})", re.S)
+
+    from_index = 20 if args.warmup_batches <= 20 else args.warmup_batches
+    to_index = args.train_batches
+    start_time = ''
+    end_time = ''
+    line_num = 0
+    with open(log_file) as f:
+        lines = f.readlines()
+        for line in lines:
+            if "Train Step:" in line: 
+                line_num+=1
+
+                if line_num == from_index:
+                    start_time = re.findall(pt, line)[0]
+                    continue
+
+                if line_num == to_index:
+                    end_time = re.findall(pt, line)[0]
+                    t1 = datetime.datetime.strptime(start_time, "%H:%M:%S.%f")
+                    t2 = datetime.datetime.strptime(end_time, "%H:%M:%S.%f")
+                    cost_time = (t2 - t1).total_seconds()
+                    iter_num = args.train_batches-args.warmup_batches
+                    avg_speed = round(float(total_batch_size) / (cost_time / iter_num), 2)
+                    break
+
+
+    # compute avg throughoutput
+    tmp_dict['average_speed'] = avg_speed
+    result_dict[model][run_case]['average_speed'] = avg_speed
+    result_dict[model][run_case]['batch_size_per_device'] = tmp_dict['batch_size_per_device']
+
+    speed_dict[model][run_case][test_iter] = avg_speed
+
+    print(log_file, speed_dict[model][run_case])
+
+
+def compute_speedup(result_dict, speed_dict):
+    model_list = [key for key in result_dict]  # eg.['vgg16', 'rn50']
+    for m in model_list:
+        run_case = [key for key in result_dict[m]]  # eg.['4n8g', '2n8g', '1n8g', '1n4g', '1n1g']
+        for d in run_case:
+            speed_up = 1.0
+            if result_dict[m]['1n1g']['average_speed']:
+                result_dict[m][d]['average_speed'] = compute_average(speed_dict[m][d])
+                result_dict[m][d]['median_speed'] = compute_median(speed_dict[m][d])
+                speed_up = result_dict[m][d]['median_speed'] / compute_median(speed_dict[m]['1n1g'])
+            result_dict[m][d]['speedup'] = round(speed_up, 2)
+
+
+def compute_median(iter_dict):
+    speed_list = [i for i in iter_dict.values()]
+    return round(np.median(speed_list), 2)
+
+
+def compute_average(iter_dict):
+    i = 0
+    total_speed = 0
+    for iter in iter_dict:
+        i += 1
+        total_speed += iter_dict[iter]
+    return round(total_speed / i, 2)
+
+
+def extract_result():
+    result_dict = AutoVivification()
+    speed_dict = AutoVivification()
+    logs_list = glob.glob(os.path.join(args.log_dir, "*/*.log"))
+    for l in logs_list:
+        extract_info_from_file(l, result_dict, speed_dict)
+
+    # compute speedup
+    compute_speedup(result_dict, speed_dict)
+
+    # print result
+    pp.pprint(result_dict)
+
+    # write to file as JSON format
+    os.makedirs(args.output_dir, exist_ok=True)
+    framwork = args.log_dir.split('/')[-1]
+    result_file_name = os.path.join(args.output_dir, framwork + "_result.json")
+    print("Saving result to {}".format(result_file_name))
+    with open(result_file_name, 'w') as f:
+        json.dump(result_dict, f)
+
+
+if __name__ == "__main__":
+    assert args.train_batches > args.warmup_batches
+    extract_result()
--- a/TensorFlow/bert/scripts/run_single_node.sh
+++ b/TensorFlow/bert/scripts/run_single_node.sh
+BATCH_SIZE=${1:-32}
+NUM_TESTING=${2:-5}
+DTYPE=${3:-'fp32'}
+USE_XLA=${4:-'false'}
+SHELL_FOLDER=$(dirname $(readlink -f "$0"))
+export PYTHONPATH=$PYTHONPATH:/home/leinao/tensorflow/models-2.3.0
+
+i=1
+while [ $i -le $NUM_TESTING ]
+do
+  bash $SHELL_FOLDER/single_node_train.sh  0 ${BATCH_SIZE}     $DTYPE   120    $USE_XLA    $i
+  echo ">>>>>>>>>>>>>>>>>>>>>>>>>>>>>>Finished Test Case ${i}!<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<"
+  let i++
+  sleep 20
+done
+
+
+# i=1
+# while [ $i -le $NUM_TESTING ]
+# do
+#   bash $SHELL_FOLDER/single_node_train.sh  0,1 ${BATCH_SIZE}   $DTYPE   120     $USE_XLA    $i
+#   echo ">>>>>>>>>>>>>>>>>>>>>>>>>>>>>>Finished Test Case ${i}!<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<"
+#   let i++
+#   sleep 20
+# done
+
+
+i=1
+while [ $i -le  $NUM_TESTING ]
+do
+  bash $SHELL_FOLDER/single_node_train.sh  0,1,2,3  ${BATCH_SIZE} $DTYPE   120    $USE_XLA   $i
+  echo ">>>>>>>>>>>>>>>>>>>>>>>>>>>>>>Finished Test Case ${i}!<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<"
+  let i++
+  sleep 20
+done
+
+
+i=1
+while [ $i -le $NUM_TESTING ]
+do
+  bash $SHELL_FOLDER/single_node_train.sh   0,1,2,3,4,5,6,7 ${BATCH_SIZE}  $DTYPE  120    $USE_XLA   $i
+  echo ">>>>>>>>>>>>>>>>>>>>>>>>>>>>>>Finished Test Case ${i}!<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<"
+  let i++
+  sleep 20
+done
\ No newline at end of file
--- a/TensorFlow/bert/scripts/single_node_train.sh
+++ b/TensorFlow/bert/scripts/single_node_train.sh
+#!/bin/bash
+MODEL_DIR=../output
+rm -rf $MODEL_DIR
+gpus=${1:-"0"}
+BATCH_SIZE=${2:-32}
+DTYPE=${3:-'fp32'}
+NUM_STEP=${4:-120}
+USE_XLA=${5:-'false'}
+TEST_NUM=${6:-1}
+
+a=`expr ${#gpus} + 1`
+NUM_GPU=`expr ${a} / 2`
+total_batch_size=`expr ${BATCH_SIZE} \* $NUM_GPU`
+echo "Use gpus: $gpus"
+echo "Total batch size : $total_batch_size"
+
+
+if  [ "$USE_XLA" == "true" ] ; then
+  enable_xla='true'
+else
+  enable_xla='false'
+fi
+
+BERT_BASE_CONFIG_FILE='/datasets/bert/uncased_L-12_H-768_A-12/bert_config.json' 
+LOG_FOLDER=./logs/tensorflow/bert/bz${BATCH_SIZE}/1n${NUM_GPU}g
+mkdir -p $LOG_FOLDER
+LOGFILE=${LOG_FOLDER}/bert_b${BATCH_SIZE}_${DTYPE}_${TEST_NUM}.log
+
+
+export CUDA_VISIBLE_DEVICES=$gpus
+python run_pretraining.py  \
+--input_files='/datasets/bert/wiki/*.tfrecord'   \
+--max_seq_length=128  \
+--max_predictions_per_seq=20  \
+--train_batch_size=$total_batch_size  \
+--num_steps_per_epoch=$NUM_STEP  \
+--num_train_epochs=1 \
+--warmup_steps=10000  \
+--use_next_sentence_label=True  \
+--train_summary_interval=0 \
+--optimizer_type='adamw' \
+--num_gpus=$NUM_GPU  \
+--datasets_num_private_threads=8 \
+--dtype=$DTYPE   \
+--enable_xla=$enable_xla \
+--model_dir=$MODEL_DIR   \
+--bert_config_file=${BERT_BASE_CONFIG_FILE}   2>&1 | tee $LOGFILE
+
+echo "Writting log to $LOGFILE"
\ No newline at end of file