!5325 add schema file introduce for BERT and TinyBERT

Merge pull request !5325 from wanghua/master

!5325 add schema file introduce for BERT and TinyBERT
Merge pull request !5325 from wanghua/master
c247f5cd · mindspore-ci-bot · Gitee · 8cea8816 · f347d1c9 · c247f5cd
4 changed file
--- a/model_zoo/official/nlp/bert/README.md
+++ b/model_zoo/official/nlp/bert/README.md
@@ -73,6 +73,60 @@ For distributed training, a hccl configuration file with JSON format needs to be
 Please follow the instructions in the link below:
 https:gitee.com/mindspore/mindspore/tree/master/model_zoo/utils/hccl_tools.

+For dataset, if you want to set the format and parameters, a schema configuration file with JSON format needs to be created, please refer to [tfrecord](https://www.mindspore.cn/tutorial/zh-CN/master/use/data_preparation/loading_the_datasets.html#tfrecord) format.
+```
+For pretraining, schema file contains ["input_ids", "input_mask", "segment_ids", "next_sentence_labels", "masked_lm_positions", "masked_lm_ids", "masked_lm_weights"]. 
+
+For ner or classification task, schema file contains ["input_ids", "input_mask", "segment_ids", "label_ids"].
+
+For squad task, training: schema file contains ["start_positions", "end_positions", "input_ids", "input_mask", "segment_ids"], evaluation: schema file contains ["input_ids", "input_mask", "segment_ids"].
+
+`numRows` is the only option which could be set by user, the others value must be set according to the dataset.
+
+For example, the dataset is cn-wiki-128, the schema file for pretraining as following:
+{
+    "datasetType": "TF",
+    "numRows": 7680,
+    "columns": {
+        "input_ids": {
+            "type": "int64",
+            "rank": 1,
+            "shape": [256]
+        },
+        "input_mask": {
+            "type": "int64",
+            "rank": 1,
+            "shape": [256]
+        },
+        "segment_ids": {
+            "type": "int64",
+            "rank": 1,
+            "shape": [256]
+        },
+        "next_sentence_labels": {
+            "type": "int64",
+            "rank": 1,
+            "shape": [1]
+        },
+        "masked_lm_positions": {
+            "type": "int64",
+            "rank": 1,
+            "shape": [32]
+        },
+        "masked_lm_ids": {
+            "type": "int64",
+            "rank": 1,
+            "shape": [32]
+        },
+        "masked_lm_weights": {
+            "type": "float32",
+            "rank": 1,
+            "shape": [32]
+        }
+    }
+}
+``` 
+
 # [Script Description](#contents)

 ## [Script and Sample Code](#contents)
@@ -87,11 +141,12 @@ https:gitee.com/mindspore/mindspore/tree/master/model_zoo/utils/hccl_tools.
        ├─hyper_parameter_config.ini          # hyper paramter for distributed pretraining 
        ├─run_distribute_pretrain.py          # script for distributed pretraining
        ├─README.md    
-    ├─run_classifier.sh                       # shell script for standalone classifier task
-    ├─run_ner.sh                              # shell script for standalone NER task
-    ├─run_squad.sh                            # shell script for standalone SQUAD task  
+    ├─run_classifier.sh                       # shell script for standalone classifier task on ascend or gpu
+    ├─run_ner.sh                              # shell script for standalone NER task on ascend or gpu
+    ├─run_squad.sh                            # shell script for standalone SQUAD task on ascend or gpu
    ├─run_standalone_pretrain_ascend.sh       # shell script for standalone pretrain on ascend
    ├─run_distributed_pretrain_ascend.sh      # shell script for distributed pretrain on ascend
+    ├─run_distributed_pretrain_gpu.sh         # shell script for distributed pretrain on gpu
    └─run_standaloned_pretrain_gpu.sh         # shell script for distributed pretrain on gpu
  ├─src
    ├─__init__.py
@@ -363,55 +418,59 @@ The result will be as follows:
 ## [Model Description](#contents)
 ## [Performance](#contents)
 ### Pretraining Performance
-| Parameters                 | BERT                                                       | BERT                      |
+| Parameters                 | Ascend                                                     | GPU                       |
 | -------------------------- | ---------------------------------------------------------- | ------------------------- |
-| Model Version              | base                                                       | base                      |
+| Model Version              | BERT_base                                                  | BERT_base                 |
 | Resource                   | Ascend 910, cpu:2.60GHz 56cores, memory:314G               | NV SMX2 V100-32G          |
 | uploaded Date              | 08/22/2020                                                 | 05/06/2020                |
 | MindSpore Version          | 0.6.0                                                      | 0.3.0                     |
-| Dataset                    | cn-wiki-128                                                | ImageNet                  |
+| Dataset                    | cn-wiki-128(4000w)                                         | ImageNet                  |
 | Training Parameters        | src/config.py                                              | src/config.py             |
 | Optimizer                  | Lamb                                                       | Momentum                  |
 | Loss Function              | SoftmaxCrossEntropy                                        | SoftmaxCrossEntropy       |
 | outputs                    | probability                                                |                           |
-| Loss                       |                                                            | 1.913                     |
-| Speed                      | 116.5 ms/step                                              | 1.913                     |
-| Total time                 |                                                            |                           |
+| Epoch                      | 40                                                         |                           |                      |
+| Batch_size                 | 256*8                                                      | 130(8P)                   |                      |
+| Loss                       | 1.7                                                        | 1.913                     |
+| Speed                      | 340ms/step                                                 | 1.913                     |
+| Total time                 | 73h                                                        |                           |
 | Params (M)                 | 110M                                                       |                           |
 | Checkpoint for Fine tuning | 1.2G(.ckpt file)                                           |                           |


-| Parameters                 | BERT                                                       | BERT                      |
+| Parameters                 | Ascend                                                     | GPU                       |
 | -------------------------- | ---------------------------------------------------------- | ------------------------- |
-| Model Version              | NEZHA                                                      | NEZHA                     |
+| Model Version              | BERT_NEZHA                                                 | BERT_NEZHA                |
 | Resource                   | Ascend 910, cpu:2.60GHz 56cores, memory:314G               | NV SMX2 V100-32G          |
 | uploaded Date              | 08/20/2020                                                 | 05/06/2020                |
 | MindSpore Version          | 0.6.0                                                      | 0.3.0                     |
-| Dataset                    | cn-wiki-128                                                | ImageNet                  |
+| Dataset                    | cn-wiki-128(4000w)                                         | ImageNet                  |
 | Training Parameters        | src/config.py                                              | src/config.py             |
 | Optimizer                  | Lamb                                                       | Momentum                  |
 | Loss Function              | SoftmaxCrossEntropy                                        | SoftmaxCrossEntropy       |
 | outputs                    | probability                                                |                           |
-| Loss                       |                                                            | 1.913                     |
-| Speed                      |                                                            | 1.913                     |
-| Total time                 |                                                            |                           |
+| Epoch                      | 40                                                         |                           |                      |
+| Batch_size                 | 96*8                                                       | 130(8P)                   |
+| Loss                       | 1.7                                                        | 1.913                     |
+| Speed                      | 360ms/step                                                 | 1.913                     |
+| Total time                 | 200h                                                       |                           |
 | Params (M)                 | 340M                                                       |                           |
 | Checkpoint for Fine tuning | 3.2G(.ckpt file)                                           |                           |             

 #### Inference Performance

-| Parameters                 |                               |                           |                      |
-| -------------------------- | ----------------------------- | ------------------------- | -------------------- |
-| Model Version              | V1                            |                           |                      |
-| Resource                   | Huawei 910                    | NV SMX2 V100-32G          | Huawei 310           |
-| uploaded Date              | 08/22/2020                    | 05/22/2020                |                      |
-| MindSpore Version          | 0.6.0                         | 0.2.0                     | 0.2.0                | 
-| Dataset                    | cola, 1.2W                    | ImageNet, 1.2W            | ImageNet, 1.2W       |
-| batch_size                 | 32(1P)                        | 130(8P)                   |                      |
-| Accuracy                   | 0.588986                      | ACC1[72.07%] ACC5[90.90%] |                      |
-| Speed                      | 59.25ms/step                  |                           |                      |
-| Total time                 |                               |                           |                      |
-| Model for inference        | 1.2G(.ckpt file)              |                           |                      |
+| Parameters                 | Ascend                        | GPU                       |
+| -------------------------- | ----------------------------- | ------------------------- | 
+| Model Version              |                               |                           |                      
+| Resource                   | Ascend 910                    | NV SMX2 V100-32G          | 
+| uploaded Date              | 08/22/2020                    | 05/22/2020                |                      
+| MindSpore Version          | 0.6.0                         | 0.2.0                     | 
+| Dataset                    | cola, 1.2W                    | ImageNet, 1.2W            |
+| batch_size                 | 32(1P)                        | 130(8P)                   |                      
+| Accuracy                   | 0.588986                      | ACC1[72.07%] ACC5[90.90%] |                     
+| Speed                      | 59.25ms/step                  |                           |                     
+| Total time                 | 15min                         |                           |                     
+| Model for inference        | 1.2G(.ckpt file)              |                           |                     

 # [Description of Random Situation](#contents)


--- a/model_zoo/official/nlp/bert/scripts/ascend_distributed_launcher/run_distribute_pretrain.py
+++ b/model_zoo/official/nlp/bert/scripts/ascend_distributed_launcher/run_distribute_pretrain.py
@@ -122,7 +122,7 @@ def distribute_pretrain():
        print("core_nums:", cmdopt)
        print("epoch_size:", str(cfg['epoch_size']))
        print("data_dir:", data_dir)
-        print("log_file_dir: " + cur_dir + "/LOG" + str(device_id) + "/log.txt")
+        print("log_file_dir: " + cur_dir + "/LOG" + str(device_id) + "/pretraining_log.txt")

        os.chdir(cur_dir + "/LOG" + str(device_id))
        cmd = 'taskset -c ' + cmdopt + ' nohup python ' + run_script + " "

--- a/model_zoo/official/nlp/bert/src/dataset.py
+++ b/model_zoo/official/nlp/bert/src/dataset.py
@@ -112,9 +112,6 @@ def create_squad_dataset(batch_size=1, repeat_count=1, data_file_path=None, sche
    else:
        ds = de.TFRecordDataset([data_file_path], schema_file_path if schema_file_path != "" else None,
                                columns_list=["input_ids", "input_mask", "segment_ids", "unique_ids"])
-        ds = ds.map(input_columns="input_ids", operations=type_cast_op)
-        ds = ds.map(input_columns="input_mask", operations=type_cast_op)
-        ds = ds.map(input_columns="segment_ids", operations=type_cast_op)
    ds = ds.map(input_columns="segment_ids", operations=type_cast_op)
    ds = ds.map(input_columns="input_mask", operations=type_cast_op)
    ds = ds.map(input_columns="input_ids", operations=type_cast_op)

--- a/model_zoo/official/nlp/tinybert/README.md
+++ b/model_zoo/official/nlp/tinybert/README.md
@@ -65,6 +65,38 @@ For distributed training on Ascend, a hccl configuration file with JSON format n
 Please follow the instructions in the link below:
 https:gitee.com/mindspore/mindspore/tree/master/model_zoo/utils/hccl_tools.

+For dataset, if you want to set the format and parameters, a schema configuration file with JSON format needs to be created, please refer to [tfrecord](https://www.mindspore.cn/tutorial/zh-CN/master/use/data_preparation/loading_the_datasets.html#tfrecord) format. 
+```
+For general task, schema file contains ["input_ids", "input_mask", "segment_ids"].
+
+For task distill and eval phase, schema file contains ["input_ids", "input_mask", "segment_ids", "label_ids"]. 
+
+`numRows` is the only option which could be set by user, the others value must be set according to the dataset.
+
+For example, the dataset is cn-wiki-128, the schema file for general distill phase as following:
+{
+	"datasetType": "TF",
+	"numRows": 7680,
+	"columns": {
+		"input_ids": {
+			"type": "int64",
+			"rank": 1,
+			"shape": [256]
+		},
+		"input_mask": {
+			"type": "int64",
+			"rank": 1,
+			"shape": [256]
+		},
+		"segment_ids": {
+			"type": "int64",
+			"rank": 1,
+			"shape": [256]
+		}
+	}
+}
+```
+
 # [Script Description](#contents)
 ## [Script and Sample Code](#contents)

@@ -304,9 +336,9 @@ The best acc is 0.891176
 ## [Model Description](#contents)
 ## [Performance](#contents)
 ### training Performance
-| Parameters                 | TinyBERT                                                   | TinyBERT                  |
+| Parameters                 | Ascend                                                     | GPU                       |
 | -------------------------- | ---------------------------------------------------------- | ------------------------- |
-| Model Version              |                                                            |                           |
+| Model Version              | TinyBERT                                                   | TinyBERT                           |
 | Resource                   | Ascend 910, cpu:2.60GHz 56cores, memory:314G               | NV SMX2 V100-32G, cpu:2.10GHz 64cores,  memory:251G         |
 | uploaded Date              | 08/20/2020                                                 | 08/24/2020                |
 | MindSpore Version          | 0.6.0                                                      | 0.7.0                     |
@@ -323,10 +355,10 @@ The best acc is 0.891176

 #### Inference Performance

-| Parameters                 |                               |                           |
+| Parameters                 | Ascend                        | GPU                       |
 | -------------------------- | ----------------------------- | ------------------------- | 
 | Model Version              |                               |                           |
-| Resource                   | Huawei 910                    | NV SMX2 V100-32G          |
+| Resource                   | Ascend 910                    | NV SMX2 V100-32G          |
 | uploaded Date              | 08/20/2020                    | 08/24/2020                |
 | MindSpore Version          | 0.6.0                         | 0.7.0                     |
 | Dataset                    | SST-2,                        | SST-2                     |