提交 03ff5f33 编写于 作者: M mindspore-ci-bot 提交者: Gitee

!5487 add schema file for BERT and TinyBERT

Merge pull request !5487 from wanghua/r0.7
...@@ -73,6 +73,60 @@ For distributed training, a hccl configuration file with JSON format needs to be ...@@ -73,6 +73,60 @@ For distributed training, a hccl configuration file with JSON format needs to be
Please follow the instructions in the link below: Please follow the instructions in the link below:
https:gitee.com/mindspore/mindspore/tree/master/model_zoo/utils/hccl_tools. https:gitee.com/mindspore/mindspore/tree/master/model_zoo/utils/hccl_tools.
For dataset, if you want to set the format and parameters, a schema configuration file with JSON format needs to be created, please refer to [tfrecord](https://www.mindspore.cn/tutorial/zh-CN/master/use/data_preparation/loading_the_datasets.html#tfrecord) format.
```
For pretraining, schema file contains ["input_ids", "input_mask", "segment_ids", "next_sentence_labels", "masked_lm_positions", "masked_lm_ids", "masked_lm_weights"].
For ner or classification task, schema file contains ["input_ids", "input_mask", "segment_ids", "label_ids"].
For squad task, training: schema file contains ["start_positions", "end_positions", "input_ids", "input_mask", "segment_ids"], evaluation: schema file contains ["input_ids", "input_mask", "segment_ids"].
`numRows` is the only option which could be set by user, the others value must be set according to the dataset.
For example, the dataset is cn-wiki-128, the schema file for pretraining as following:
{
"datasetType": "TF",
"numRows": 7680,
"columns": {
"input_ids": {
"type": "int64",
"rank": 1,
"shape": [256]
},
"input_mask": {
"type": "int64",
"rank": 1,
"shape": [256]
},
"segment_ids": {
"type": "int64",
"rank": 1,
"shape": [256]
},
"next_sentence_labels": {
"type": "int64",
"rank": 1,
"shape": [1]
},
"masked_lm_positions": {
"type": "int64",
"rank": 1,
"shape": [32]
},
"masked_lm_ids": {
"type": "int64",
"rank": 1,
"shape": [32]
},
"masked_lm_weights": {
"type": "float32",
"rank": 1,
"shape": [32]
}
}
}
```
# [Script Description](#contents) # [Script Description](#contents)
## [Script and Sample Code](#contents) ## [Script and Sample Code](#contents)
...@@ -87,11 +141,12 @@ https:gitee.com/mindspore/mindspore/tree/master/model_zoo/utils/hccl_tools. ...@@ -87,11 +141,12 @@ https:gitee.com/mindspore/mindspore/tree/master/model_zoo/utils/hccl_tools.
├─hyper_parameter_config.ini # hyper paramter for distributed pretraining ├─hyper_parameter_config.ini # hyper paramter for distributed pretraining
├─run_distribute_pretrain.py # script for distributed pretraining ├─run_distribute_pretrain.py # script for distributed pretraining
├─README.md ├─README.md
├─run_classifier.sh # shell script for standalone classifier task ├─run_classifier.sh # shell script for standalone classifier task on ascend or gpu
├─run_ner.sh # shell script for standalone NER task ├─run_ner.sh # shell script for standalone NER task on ascend or gpu
├─run_squad.sh # shell script for standalone SQUAD task ├─run_squad.sh # shell script for standalone SQUAD task on ascend or gpu
├─run_standalone_pretrain_ascend.sh # shell script for standalone pretrain on ascend ├─run_standalone_pretrain_ascend.sh # shell script for standalone pretrain on ascend
├─run_distributed_pretrain_ascend.sh # shell script for distributed pretrain on ascend ├─run_distributed_pretrain_ascend.sh # shell script for distributed pretrain on ascend
├─run_distributed_pretrain_gpu.sh # shell script for distributed pretrain on gpu
└─run_standaloned_pretrain_gpu.sh # shell script for distributed pretrain on gpu └─run_standaloned_pretrain_gpu.sh # shell script for distributed pretrain on gpu
├─src ├─src
├─__init__.py ├─__init__.py
...@@ -122,7 +177,7 @@ https:gitee.com/mindspore/mindspore/tree/master/model_zoo/utils/hccl_tools. ...@@ -122,7 +177,7 @@ https:gitee.com/mindspore/mindspore/tree/master/model_zoo/utils/hccl_tools.
usage: run_pretrain.py [--distribute DISTRIBUTE] [--epoch_size N] [----device_num N] [--device_id N] usage: run_pretrain.py [--distribute DISTRIBUTE] [--epoch_size N] [----device_num N] [--device_id N]
[--enable_save_ckpt ENABLE_SAVE_CKPT] [--device_target DEVICE_TARGET] [--enable_save_ckpt ENABLE_SAVE_CKPT] [--device_target DEVICE_TARGET]
[--enable_lossscale ENABLE_LOSSSCALE] [--do_shuffle DO_SHUFFLE] [--enable_lossscale ENABLE_LOSSSCALE] [--do_shuffle DO_SHUFFLE]
[--enable_data_sink ENABLE_DATA_SINK] [--data_sink_steps N] [--enable_data_sink ENABLE_DATA_SINK] [--data_sink_steps N]
[--save_checkpoint_path SAVE_CHECKPOINT_PATH] [--save_checkpoint_path SAVE_CHECKPOINT_PATH]
[--load_checkpoint_path LOAD_CHECKPOINT_PATH] [--load_checkpoint_path LOAD_CHECKPOINT_PATH]
[--save_checkpoint_steps N] [--save_checkpoint_num N] [--save_checkpoint_steps N] [--save_checkpoint_num N]
...@@ -361,55 +416,59 @@ The result will be as follows: ...@@ -361,55 +416,59 @@ The result will be as follows:
## [Model Description](#contents) ## [Model Description](#contents)
## [Performance](#contents) ## [Performance](#contents)
### Pretraining Performance ### Pretraining Performance
| Parameters | BERT | BERT | | Parameters | Ascend | GPU |
| -------------------------- | ---------------------------------------------------------- | ------------------------- | | -------------------------- | ---------------------------------------------------------- | ------------------------- |
| Model Version | base | base | | Model Version | BERT_base | BERT_base |
| Resource | Ascend 910, cpu:2.60GHz 56cores, memory:314G | NV SMX2 V100-32G | | Resource | Ascend 910, cpu:2.60GHz 56cores, memory:314G | NV SMX2 V100-32G |
| uploaded Date | 08/22/2020 | 05/06/2020 | | uploaded Date | 08/22/2020 | 05/06/2020 |
| MindSpore Version | 0.6.0 | 0.3.0 | | MindSpore Version | 0.6.0 | 0.3.0 |
| Dataset | cn-wiki-128 | ImageNet | | Dataset | cn-wiki-128(4000w) | ImageNet |
| Training Parameters | src/config.py | src/config.py | | Training Parameters | src/config.py | src/config.py |
| Optimizer | Lamb | Momentum | | Optimizer | Lamb | Momentum |
| Loss Function | SoftmaxCrossEntropy | SoftmaxCrossEntropy | | Loss Function | SoftmaxCrossEntropy | SoftmaxCrossEntropy |
| outputs | probability | | | outputs | probability | |
| Loss | | 1.913 | | Epoch | 40 | | |
| Speed | 116.5 ms/step | 1.913 | | Batch_size | 256*8 | 130(8P) | |
| Total time | | | | Loss | 1.7 | 1.913 |
| Speed | 340ms/step | 1.913 |
| Total time | 73h | |
| Params (M) | 110M | | | Params (M) | 110M | |
| Checkpoint for Fine tuning | 1.2G(.ckpt file) | | | Checkpoint for Fine tuning | 1.2G(.ckpt file) | |
| Parameters | BERT | BERT | | Parameters | Ascend | GPU |
| -------------------------- | ---------------------------------------------------------- | ------------------------- | | -------------------------- | ---------------------------------------------------------- | ------------------------- |
| Model Version | NEZHA | NEZHA | | Model Version | BERT_NEZHA | BERT_NEZHA |
| Resource | Ascend 910, cpu:2.60GHz 56cores, memory:314G | NV SMX2 V100-32G | | Resource | Ascend 910, cpu:2.60GHz 56cores, memory:314G | NV SMX2 V100-32G |
| uploaded Date | 08/20/2020 | 05/06/2020 | | uploaded Date | 08/20/2020 | 05/06/2020 |
| MindSpore Version | 0.6.0 | 0.3.0 | | MindSpore Version | 0.6.0 | 0.3.0 |
| Dataset | cn-wiki-128 | ImageNet | | Dataset | cn-wiki-128(4000w) | ImageNet |
| Training Parameters | src/config.py | src/config.py | | Training Parameters | src/config.py | src/config.py |
| Optimizer | Lamb | Momentum | | Optimizer | Lamb | Momentum |
| Loss Function | SoftmaxCrossEntropy | SoftmaxCrossEntropy | | Loss Function | SoftmaxCrossEntropy | SoftmaxCrossEntropy |
| outputs | probability | | | outputs | probability | |
| Loss | | 1.913 | | Epoch | 40 | | |
| Speed | | 1.913 | | Batch_size | 96*8 | 130(8P) |
| Total time | | | | Loss | 1.7 | 1.913 |
| Speed | 360ms/step | 1.913 |
| Total time | 200h | |
| Params (M) | 340M | | | Params (M) | 340M | |
| Checkpoint for Fine tuning | 3.2G(.ckpt file) | | | Checkpoint for Fine tuning | 3.2G(.ckpt file) | |
#### Inference Performance #### Inference Performance
| Parameters | | | | | Parameters | Ascend | GPU |
| -------------------------- | ----------------------------- | ------------------------- | -------------------- | | -------------------------- | ----------------------------- | ------------------------- |
| Model Version | V1 | | | | Model Version | | |
| Resource | Ascend 910 | NV SMX2 V100-32G | Ascend 310 | | Resource | Ascend 910 | NV SMX2 V100-32G |
| uploaded Date | 08/22/2020 | 05/22/2020 | | | uploaded Date | 08/22/2020 | 05/22/2020 |
| MindSpore Version | 0.6.0 | 0.2.0 | 0.2.0 | | MindSpore Version | 0.6.0 | 0.2.0 |
| Dataset | cola, 1.2W | ImageNet, 1.2W | ImageNet, 1.2W | | Dataset | cola, 1.2W | ImageNet, 1.2W |
| batch_size | 32(1P) | 130(8P) | | | batch_size | 32(1P) | 130(8P) |
| Accuracy | 0.588986 | ACC1[72.07%] ACC5[90.90%] | | | Accuracy | 0.588986 | ACC1[72.07%] ACC5[90.90%] |
| Speed | 59.25ms/step | | | | Speed | 59.25ms/step | |
| Total time | | | | | Total time | 15min | |
| Model for inference | 1.2G(.ckpt file) | | | | Model for inference | 1.2G(.ckpt file) | |
# [Description of Random Situation](#contents) # [Description of Random Situation](#contents)
......
...@@ -122,7 +122,7 @@ def distribute_pretrain(): ...@@ -122,7 +122,7 @@ def distribute_pretrain():
print("core_nums:", cmdopt) print("core_nums:", cmdopt)
print("epoch_size:", str(cfg['epoch_size'])) print("epoch_size:", str(cfg['epoch_size']))
print("data_dir:", data_dir) print("data_dir:", data_dir)
print("log_file_dir: " + cur_dir + "/LOG" + str(device_id) + "/log.txt") print("log_file_dir: " + cur_dir + "/LOG" + str(device_id) + "/pretraining_log.txt")
os.chdir(cur_dir + "/LOG" + str(device_id)) os.chdir(cur_dir + "/LOG" + str(device_id))
cmd = 'taskset -c ' + cmdopt + ' nohup python ' + run_script + " " cmd = 'taskset -c ' + cmdopt + ' nohup python ' + run_script + " "
......
...@@ -112,9 +112,6 @@ def create_squad_dataset(batch_size=1, repeat_count=1, data_file_path=None, sche ...@@ -112,9 +112,6 @@ def create_squad_dataset(batch_size=1, repeat_count=1, data_file_path=None, sche
else: else:
ds = de.TFRecordDataset([data_file_path], schema_file_path if schema_file_path != "" else None, ds = de.TFRecordDataset([data_file_path], schema_file_path if schema_file_path != "" else None,
columns_list=["input_ids", "input_mask", "segment_ids", "unique_ids"]) columns_list=["input_ids", "input_mask", "segment_ids", "unique_ids"])
ds = ds.map(input_columns="input_ids", operations=type_cast_op)
ds = ds.map(input_columns="input_mask", operations=type_cast_op)
ds = ds.map(input_columns="segment_ids", operations=type_cast_op)
ds = ds.map(input_columns="segment_ids", operations=type_cast_op) ds = ds.map(input_columns="segment_ids", operations=type_cast_op)
ds = ds.map(input_columns="input_mask", operations=type_cast_op) ds = ds.map(input_columns="input_mask", operations=type_cast_op)
ds = ds.map(input_columns="input_ids", operations=type_cast_op) ds = ds.map(input_columns="input_ids", operations=type_cast_op)
......
...@@ -65,6 +65,38 @@ For distributed training on Ascend, a hccl configuration file with JSON format n ...@@ -65,6 +65,38 @@ For distributed training on Ascend, a hccl configuration file with JSON format n
Please follow the instructions in the link below: Please follow the instructions in the link below:
https:gitee.com/mindspore/mindspore/tree/master/model_zoo/utils/hccl_tools. https:gitee.com/mindspore/mindspore/tree/master/model_zoo/utils/hccl_tools.
For dataset, if you want to set the format and parameters, a schema configuration file with JSON format needs to be created, please refer to [tfrecord](https://www.mindspore.cn/tutorial/zh-CN/master/use/data_preparation/loading_the_datasets.html#tfrecord) format.
```
For general task, schema file contains ["input_ids", "input_mask", "segment_ids"].
For task distill and eval phase, schema file contains ["input_ids", "input_mask", "segment_ids", "label_ids"].
`numRows` is the only option which could be set by user, the others value must be set according to the dataset.
For example, the dataset is cn-wiki-128, the schema file for general distill phase as following:
{
"datasetType": "TF",
"numRows": 7680,
"columns": {
"input_ids": {
"type": "int64",
"rank": 1,
"shape": [256]
},
"input_mask": {
"type": "int64",
"rank": 1,
"shape": [256]
},
"segment_ids": {
"type": "int64",
"rank": 1,
"shape": [256]
}
}
}
```
# [Script Description](#contents) # [Script Description](#contents)
## [Script and Sample Code](#contents) ## [Script and Sample Code](#contents)
...@@ -117,7 +149,7 @@ options: ...@@ -117,7 +149,7 @@ options:
--save_checkpoint_step steps for saving checkpoint files: N, default is 1000 --save_checkpoint_step steps for saving checkpoint files: N, default is 1000
--load_teacher_ckpt_path path to load teacher checkpoint files: PATH, default is "" --load_teacher_ckpt_path path to load teacher checkpoint files: PATH, default is ""
--data_dir path to dataset directory: PATH, default is "" --data_dir path to dataset directory: PATH, default is ""
--schema_dir path to schema.json file, PATH, default is "" --schema_dir path to schema.json file, PATH, default is ""
``` ```
### Task Distill ### Task Distill
...@@ -132,7 +164,7 @@ usage: run_general_task.py [--device_target DEVICE_TARGET] [--do_train DO_TRAIN ...@@ -132,7 +164,7 @@ usage: run_general_task.py [--device_target DEVICE_TARGET] [--do_train DO_TRAIN
[--load_td1_ckpt_path LOAD_TD1_CKPT_PATH] [--load_td1_ckpt_path LOAD_TD1_CKPT_PATH]
[--train_data_dir TRAIN_DATA_DIR] [--train_data_dir TRAIN_DATA_DIR]
[--eval_data_dir EVAL_DATA_DIR] [--eval_data_dir EVAL_DATA_DIR]
[--task_name TASK_NAME] [--schema_dir SCHEMA_DIR] [--task_name TASK_NAME] [--schema_dir SCHEMA_DIR]
options: options:
--device_target device where the code will be implemented: "Ascend" | "GPU", default is "Ascend" --device_target device where the code will be implemented: "Ascend" | "GPU", default is "Ascend"
...@@ -302,9 +334,9 @@ The best acc is 0.891176 ...@@ -302,9 +334,9 @@ The best acc is 0.891176
## [Model Description](#contents) ## [Model Description](#contents)
## [Performance](#contents) ## [Performance](#contents)
### training Performance ### training Performance
| Parameters | TinyBERT | TinyBERT | | Parameters | Ascend | GPU |
| -------------------------- | ---------------------------------------------------------- | ------------------------- | | -------------------------- | ---------------------------------------------------------- | ------------------------- |
| Model Version | | | | Model Version | TinyBERT | TinyBERT |
| Resource | Ascend 910, cpu:2.60GHz 56cores, memory:314G | NV SMX2 V100-32G, cpu:2.10GHz 64cores, memory:251G | | Resource | Ascend 910, cpu:2.60GHz 56cores, memory:314G | NV SMX2 V100-32G, cpu:2.10GHz 64cores, memory:251G |
| uploaded Date | 08/20/2020 | 08/24/2020 | | uploaded Date | 08/20/2020 | 08/24/2020 |
| MindSpore Version | 0.6.0 | 0.7.0 | | MindSpore Version | 0.6.0 | 0.7.0 |
...@@ -321,7 +353,7 @@ The best acc is 0.891176 ...@@ -321,7 +353,7 @@ The best acc is 0.891176
#### Inference Performance #### Inference Performance
| Parameters | | | | Parameters | Ascend | GPU |
| -------------------------- | ----------------------------- | ------------------------- | | -------------------------- | ----------------------------- | ------------------------- |
| Model Version | | | | Model Version | | |
| Resource | Ascend 910 | NV SMX2 V100-32G | | Resource | Ascend 910 | NV SMX2 V100-32G |
...@@ -344,4 +376,4 @@ In run_general_distill.py, we set the random seed to make sure distribute traini ...@@ -344,4 +376,4 @@ In run_general_distill.py, we set the random seed to make sure distribute traini
# [ModelZoo Homepage](#contents) # [ModelZoo Homepage](#contents)
Please check the official [homepage](https://gitee.com/mindspore/mindspore/tree/master/model_zoo). Please check the official [homepage](https://gitee.com/mindspore/mindspore/tree/master/model_zoo).
\ No newline at end of file
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册