!5487 add schema file for BERT and TinyBERT

Merge pull request !5487 from wanghua/r0.7

!5487 add schema file for BERT and TinyBERT
Merge pull request !5487 from wanghua/r0.7
03ff5f33 · mindspore-ci-bot · Gitee · 82dba19a · 6674a88d · 03ff5f33
4 changed file
--- a/model_zoo/official/nlp/bert/README.md
+++ b/model_zoo/official/nlp/bert/README.md
@@ -73,6 +73,60 @@ For distributed training, a hccl configuration file with JSON format needs to be
 Please follow the instructions in the link below:
 https:gitee.com/mindspore/mindspore/tree/master/model_zoo/utils/hccl_tools.
+For dataset, if you want to set the format and parameters, a schema configuration file with JSON format needs to be created, please refer to [tfrecord](https://www.mindspore.cn/tutorial/zh-CN/master/use/data_preparation/loading_the_datasets.html#tfrecord) format.
+```
+For pretraining, schema file contains ["input_ids", "input_mask", "segment_ids", "next_sentence_labels", "masked_lm_positions", "masked_lm_ids", "masked_lm_weights"]. 
+For ner or classification task, schema file contains ["input_ids", "input_mask", "segment_ids", "label_ids"].
+For squad task, training: schema file contains ["start_positions", "end_positions", "input_ids", "input_mask", "segment_ids"], evaluation: schema file contains ["input_ids", "input_mask", "segment_ids"].
+`numRows` is the only option which could be set by user, the others value must be set according to the dataset.
+For example, the dataset is cn-wiki-128, the schema file for pretraining as following:
+{
+    "datasetType": "TF",
+    "numRows": 7680,
+    "columns": {
+        "input_ids": {
+            "type": "int64",
+            "rank": 1,
+            "shape": [256]
+        },
+        "input_mask": {
+            "type": "int64",
+            "rank": 1,
+            "shape": [256]
+        },
+        "segment_ids": {
+            "type": "int64",
+            "rank": 1,
+            "shape": [256]
+        },
+        "next_sentence_labels": {
+            "type": "int64",
+            "rank": 1,
+            "shape": [1]
+        },
+        "masked_lm_positions": {
+            "type": "int64",
+            "rank": 1,
+            "shape": [32]
+        },
+        "masked_lm_ids": {
+            "type": "int64",
+            "rank": 1,
+            "shape": [32]
+        },
+        "masked_lm_weights": {
+            "type": "float32",
+            "rank": 1,
+            "shape": [32]
+        }
+    }
+}
+``` 
 # [Script Description](#contents)
 ## [Script and Sample Code](#contents)
@@ -87,11 +141,12 @@ https:gitee.com/mindspore/mindspore/tree/master/model_zoo/utils/hccl_tools.
        ├─hyper_parameter_config.ini          # hyper paramter for distributed pretraining 
        ├─run_distribute_pretrain.py          # script for distributed pretraining
        ├─README.md    
-    ├─run_classifier.sh                       # shell script for standalone classifier task
+    ├─run_classifier.sh                       # shell script for standalone classifier task on ascend or gpu
-    ├─run_ner.sh                              # shell script for standalone NER task
+    ├─run_ner.sh                              # shell script for standalone NER task on ascend or gpu
-    ├─run_squad.sh                            # shell script for standalone SQUAD task  
+    ├─run_squad.sh                            # shell script for standalone SQUAD task on ascend or gpu
    ├─run_standalone_pretrain_ascend.sh       # shell script for standalone pretrain on ascend
    ├─run_distributed_pretrain_ascend.sh      # shell script for distributed pretrain on ascend
+    ├─run_distributed_pretrain_gpu.sh         # shell script for distributed pretrain on gpu
    └─run_standaloned_pretrain_gpu.sh         # shell script for distributed pretrain on gpu
  ├─src
    ├─__init__.py
@@ -122,7 +177,7 @@ https:gitee.com/mindspore/mindspore/tree/master/model_zoo/utils/hccl_tools.
 usage: run_pretrain.py  [--distribute DISTRIBUTE] [--epoch_size N] [----device_num N] [--device_id N] 
                        [--enable_save_ckpt ENABLE_SAVE_CKPT] [--device_target DEVICE_TARGET]
                        [--enable_lossscale ENABLE_LOSSSCALE] [--do_shuffle DO_SHUFFLE]
-                        [--enable_data_sink ENABLE_DATA_SINK] [--data_sink_steps N] 
+                        [--enable_data_sink ENABLE_DATA_SINK] [--data_sink_steps N]
                        [--save_checkpoint_path SAVE_CHECKPOINT_PATH]
                        [--load_checkpoint_path LOAD_CHECKPOINT_PATH]
                        [--save_checkpoint_steps N] [--save_checkpoint_num N] 
@@ -361,55 +416,59 @@ The result will be as follows:
 ## [Model Description](#contents)
 ## [Performance](#contents)
 ### Pretraining Performance
-| Parameters                 | BERT                                                       | BERT                      |
+| Parameters                 | Ascend                                                     | GPU                       |
 | -------------------------- | ---------------------------------------------------------- | ------------------------- |
-| Model Version              | base                                                       | base                      |
+| Model Version              | BERT_base                                                  | BERT_base                 |
 | Resource                   | Ascend 910, cpu:2.60GHz 56cores, memory:314G               | NV SMX2 V100-32G          |
 | uploaded Date              | 08/22/2020                                                 | 05/06/2020                |
 | MindSpore Version          | 0.6.0                                                      | 0.3.0                     |
-| Dataset                    | cn-wiki-128                                                | ImageNet                  |
+| Dataset                    | cn-wiki-128(4000w)                                         | ImageNet                  |
 | Training Parameters        | src/config.py                                              | src/config.py             |
 | Optimizer                  | Lamb                                                       | Momentum                  |
 | Loss Function              | SoftmaxCrossEntropy                                        | SoftmaxCrossEntropy       |
 | outputs                    | probability                                                |                           |
-| Loss                       |                                                            | 1.913                     |
+| Epoch                      | 40                                                         |                           |                      |
-| Speed                      | 116.5 ms/step                                              | 1.913                     |
+| Batch_size                 | 256*8                                                      | 130(8P)                   |                      |
-| Total time                 |                                                            |                           |
+| Loss                       | 1.7                                                        | 1.913                     |
+| Speed                      | 340ms/step                                                 | 1.913                     |
+| Total time                 | 73h                                                        |                           |
 | Params (M)                 | 110M                                                       |                           |
 | Checkpoint for Fine tuning | 1.2G(.ckpt file)                                           |                           |
-| Parameters                 | BERT                                                       | BERT                      |
+| Parameters                 | Ascend                                                     | GPU                       |
 | -------------------------- | ---------------------------------------------------------- | ------------------------- |
-| Model Version              | NEZHA                                                      | NEZHA                     |
+| Model Version              | BERT_NEZHA                                                 | BERT_NEZHA                |
 | Resource                   | Ascend 910, cpu:2.60GHz 56cores, memory:314G               | NV SMX2 V100-32G          |
 | uploaded Date              | 08/20/2020                                                 | 05/06/2020                |
 | MindSpore Version          | 0.6.0                                                      | 0.3.0                     |
-| Dataset                    | cn-wiki-128                                                | ImageNet                  |
+| Dataset                    | cn-wiki-128(4000w)                                         | ImageNet                  |
 | Training Parameters        | src/config.py                                              | src/config.py             |
 | Optimizer                  | Lamb                                                       | Momentum                  |
 | Loss Function              | SoftmaxCrossEntropy                                        | SoftmaxCrossEntropy       |
 | outputs                    | probability                                                |                           |
-| Loss                       |                                                            | 1.913                     |
+| Epoch                      | 40                                                         |                           |                      |
-| Speed                      |                                                            | 1.913                     |
+| Batch_size                 | 96*8                                                       | 130(8P)                   |
-| Total time                 |                                                            |                           |
+| Loss                       | 1.7                                                        | 1.913                     |
+| Speed                      | 360ms/step                                                 | 1.913                     |
+| Total time                 | 200h                                                       |                           |
 | Params (M)                 | 340M                                                       |                           |
 | Checkpoint for Fine tuning | 3.2G(.ckpt file)                                           |                           |             
 #### Inference Performance
-| Parameters                 |                               |                           |                      |
+| Parameters                 | Ascend                        | GPU                       |
-| -------------------------- | ----------------------------- | ------------------------- | -------------------- |
+| -------------------------- | ----------------------------- | ------------------------- | 
-| Model Version              | V1                            |                           |                      |
+| Model Version              |                               |                           |                      
-| Resource                   | Ascend 910                    | NV SMX2 V100-32G          | Ascend 310           |
+| Resource                   | Ascend 910                    | NV SMX2 V100-32G          | 
-| uploaded Date              | 08/22/2020                    | 05/22/2020                |                      |
+| uploaded Date              | 08/22/2020                    | 05/22/2020                |                      
-| MindSpore Version          | 0.6.0                         | 0.2.0                     | 0.2.0                | 
+| MindSpore Version          | 0.6.0                         | 0.2.0                     | 
-| Dataset                    | cola, 1.2W                    | ImageNet, 1.2W            | ImageNet, 1.2W       |
+| Dataset                    | cola, 1.2W                    | ImageNet, 1.2W            |
-| batch_size                 | 32(1P)                        | 130(8P)                   |                      |
+| batch_size                 | 32(1P)                        | 130(8P)                   |                      
-| Accuracy                   | 0.588986                      | ACC1[72.07%] ACC5[90.90%] |                      |
+| Accuracy                   | 0.588986                      | ACC1[72.07%] ACC5[90.90%] |                     
-| Speed                      | 59.25ms/step                  |                           |                      |
+| Speed                      | 59.25ms/step                  |                           |                     
-| Total time                 |                               |                           |                      |
+| Total time                 | 15min                         |                           |                     
-| Model for inference        | 1.2G(.ckpt file)              |                           |                      |
+| Model for inference        | 1.2G(.ckpt file)              |                           |                     
 # [Description of Random Situation](#contents)

--- a/model_zoo/official/nlp/bert/scripts/ascend_distributed_launcher/run_distribute_pretrain.py
+++ b/model_zoo/official/nlp/bert/scripts/ascend_distributed_launcher/run_distribute_pretrain.py
@@ -122,7 +122,7 @@ def distribute_pretrain():
        print("core_nums:", cmdopt)
        print("epoch_size:", str(cfg['epoch_size']))
        print("data_dir:", data_dir)
-        print("log_file_dir: " + cur_dir + "/LOG" + str(device_id) + "/log.txt")
+        print("log_file_dir: " + cur_dir + "/LOG" + str(device_id) + "/pretraining_log.txt")
        os.chdir(cur_dir + "/LOG" + str(device_id))
        cmd = 'taskset -c ' + cmdopt + ' nohup python ' + run_script + " "

--- a/model_zoo/official/nlp/bert/src/dataset.py
+++ b/model_zoo/official/nlp/bert/src/dataset.py
@@ -112,9 +112,6 @@ def create_squad_dataset(batch_size=1, repeat_count=1, data_file_path=None, sche
    else:
        ds = de.TFRecordDataset([data_file_path], schema_file_path if schema_file_path != "" else None,
                                columns_list=["input_ids", "input_mask", "segment_ids", "unique_ids"])
-        ds = ds.map(input_columns="input_ids", operations=type_cast_op)
-        ds = ds.map(input_columns="input_mask", operations=type_cast_op)
-        ds = ds.map(input_columns="segment_ids", operations=type_cast_op)
    ds = ds.map(input_columns="segment_ids", operations=type_cast_op)
    ds = ds.map(input_columns="input_mask", operations=type_cast_op)
    ds = ds.map(input_columns="input_ids", operations=type_cast_op)

--- a/model_zoo/official/nlp/tinybert/README.md
+++ b/model_zoo/official/nlp/tinybert/README.md
@@ -65,6 +65,38 @@ For distributed training on Ascend, a hccl configuration file with JSON format n
 Please follow the instructions in the link below:
 https:gitee.com/mindspore/mindspore/tree/master/model_zoo/utils/hccl_tools.
+For dataset, if you want to set the format and parameters, a schema configuration file with JSON format needs to be created, please refer to [tfrecord](https://www.mindspore.cn/tutorial/zh-CN/master/use/data_preparation/loading_the_datasets.html#tfrecord) format. 
+```
+For general task, schema file contains ["input_ids", "input_mask", "segment_ids"].
+For task distill and eval phase, schema file contains ["input_ids", "input_mask", "segment_ids", "label_ids"]. 
+`numRows` is the only option which could be set by user, the others value must be set according to the dataset.
+For example, the dataset is cn-wiki-128, the schema file for general distill phase as following:
+{
+	"datasetType": "TF",
+	"numRows": 7680,
+	"columns": {
+		"input_ids": {
+			"type": "int64",
+			"rank": 1,
+			"shape": [256]
+		},
+		"input_mask": {
+			"type": "int64",
+			"rank": 1,
+			"shape": [256]
+		},
+		"segment_ids": {
+			"type": "int64",
+			"rank": 1,
+			"shape": [256]
+		}
+	}
+}
+```
 # [Script Description](#contents)
 ## [Script and Sample Code](#contents)
@@ -117,7 +149,7 @@ options:
    --save_checkpoint_step     steps for saving checkpoint files: N, default is 1000
    --load_teacher_ckpt_path   path to load teacher checkpoint files: PATH, default is ""
    --data_dir                 path to dataset directory: PATH, default is ""
-    --schema_dir               path to schema.json file, PATH, default is ""
+    --schema_dir               path to schema.json file, PATH, default is ""  
 ```
 ### Task Distill
@@ -132,7 +164,7 @@ usage: run_general_task.py  [--device_target DEVICE_TARGET] [--do_train DO_TRAIN
                            [--load_td1_ckpt_path LOAD_TD1_CKPT_PATH]
                            [--train_data_dir TRAIN_DATA_DIR]
                            [--eval_data_dir EVAL_DATA_DIR]
-                            [--task_name TASK_NAME] [--schema_dir SCHEMA_DIR]
+                            [--task_name TASK_NAME] [--schema_dir SCHEMA_DIR] 
 options:
    --device_target            device where the code will be implemented: "Ascend" | "GPU", default is "Ascend"
@@ -302,9 +334,9 @@ The best acc is 0.891176
 ## [Model Description](#contents)
 ## [Performance](#contents)
 ### training Performance
-| Parameters                 | TinyBERT                                                   | TinyBERT                  |
+| Parameters                 | Ascend                                                     | GPU                       |
 | -------------------------- | ---------------------------------------------------------- | ------------------------- |
-| Model Version              |                                                            |                           |
+| Model Version              | TinyBERT                                                   | TinyBERT                           |
 | Resource                   | Ascend 910, cpu:2.60GHz 56cores, memory:314G               | NV SMX2 V100-32G, cpu:2.10GHz 64cores,  memory:251G         |
 | uploaded Date              | 08/20/2020                                                 | 08/24/2020                |
 | MindSpore Version          | 0.6.0                                                      | 0.7.0                     |
@@ -321,7 +353,7 @@ The best acc is 0.891176
 #### Inference Performance
-| Parameters                 |                               |                           |
+| Parameters                 | Ascend                        | GPU                       |
 | -------------------------- | ----------------------------- | ------------------------- | 
 | Model Version              |                               |                           |
 | Resource                   | Ascend 910                    | NV SMX2 V100-32G          |
@@ -344,4 +376,4 @@ In run_general_distill.py, we set the random seed to make sure distribute traini
 # [ModelZoo Homepage](#contents)
 Please check the official [homepage](https://gitee.com/mindspore/mindspore/tree/master/model_zoo). 
\ No newline at end of file