提交 6674a88d 编写于 作者: W wanghua

add schema file for BERT and TinyBERT

上级 7d38a1fb
......@@ -73,6 +73,60 @@ For distributed training, a hccl configuration file with JSON format needs to be
Please follow the instructions in the link below:
https:gitee.com/mindspore/mindspore/tree/master/model_zoo/utils/hccl_tools.
For dataset, if you want to set the format and parameters, a schema configuration file with JSON format needs to be created, please refer to [tfrecord](https://www.mindspore.cn/tutorial/zh-CN/master/use/data_preparation/loading_the_datasets.html#tfrecord) format.
```
For pretraining, schema file contains ["input_ids", "input_mask", "segment_ids", "next_sentence_labels", "masked_lm_positions", "masked_lm_ids", "masked_lm_weights"].
For ner or classification task, schema file contains ["input_ids", "input_mask", "segment_ids", "label_ids"].
For squad task, training: schema file contains ["start_positions", "end_positions", "input_ids", "input_mask", "segment_ids"], evaluation: schema file contains ["input_ids", "input_mask", "segment_ids"].
`numRows` is the only option which could be set by user, the others value must be set according to the dataset.
For example, the dataset is cn-wiki-128, the schema file for pretraining as following:
{
"datasetType": "TF",
"numRows": 7680,
"columns": {
"input_ids": {
"type": "int64",
"rank": 1,
"shape": [256]
},
"input_mask": {
"type": "int64",
"rank": 1,
"shape": [256]
},
"segment_ids": {
"type": "int64",
"rank": 1,
"shape": [256]
},
"next_sentence_labels": {
"type": "int64",
"rank": 1,
"shape": [1]
},
"masked_lm_positions": {
"type": "int64",
"rank": 1,
"shape": [32]
},
"masked_lm_ids": {
"type": "int64",
"rank": 1,
"shape": [32]
},
"masked_lm_weights": {
"type": "float32",
"rank": 1,
"shape": [32]
}
}
}
```
# [Script Description](#contents)
## [Script and Sample Code](#contents)
......@@ -87,11 +141,12 @@ https:gitee.com/mindspore/mindspore/tree/master/model_zoo/utils/hccl_tools.
├─hyper_parameter_config.ini # hyper paramter for distributed pretraining
├─run_distribute_pretrain.py # script for distributed pretraining
├─README.md
├─run_classifier.sh # shell script for standalone classifier task
├─run_ner.sh # shell script for standalone NER task
├─run_squad.sh # shell script for standalone SQUAD task
├─run_classifier.sh # shell script for standalone classifier task on ascend or gpu
├─run_ner.sh # shell script for standalone NER task on ascend or gpu
├─run_squad.sh # shell script for standalone SQUAD task on ascend or gpu
├─run_standalone_pretrain_ascend.sh # shell script for standalone pretrain on ascend
├─run_distributed_pretrain_ascend.sh # shell script for distributed pretrain on ascend
├─run_distributed_pretrain_gpu.sh # shell script for distributed pretrain on gpu
└─run_standaloned_pretrain_gpu.sh # shell script for distributed pretrain on gpu
├─src
├─__init__.py
......@@ -122,7 +177,7 @@ https:gitee.com/mindspore/mindspore/tree/master/model_zoo/utils/hccl_tools.
usage: run_pretrain.py [--distribute DISTRIBUTE] [--epoch_size N] [----device_num N] [--device_id N]
[--enable_save_ckpt ENABLE_SAVE_CKPT] [--device_target DEVICE_TARGET]
[--enable_lossscale ENABLE_LOSSSCALE] [--do_shuffle DO_SHUFFLE]
[--enable_data_sink ENABLE_DATA_SINK] [--data_sink_steps N]
[--enable_data_sink ENABLE_DATA_SINK] [--data_sink_steps N]
[--save_checkpoint_path SAVE_CHECKPOINT_PATH]
[--load_checkpoint_path LOAD_CHECKPOINT_PATH]
[--save_checkpoint_steps N] [--save_checkpoint_num N]
......@@ -361,55 +416,59 @@ The result will be as follows:
## [Model Description](#contents)
## [Performance](#contents)
### Pretraining Performance
| Parameters | BERT | BERT |
| Parameters | Ascend | GPU |
| -------------------------- | ---------------------------------------------------------- | ------------------------- |
| Model Version | base | base |
| Model Version | BERT_base | BERT_base |
| Resource | Ascend 910, cpu:2.60GHz 56cores, memory:314G | NV SMX2 V100-32G |
| uploaded Date | 08/22/2020 | 05/06/2020 |
| MindSpore Version | 0.6.0 | 0.3.0 |
| Dataset | cn-wiki-128 | ImageNet |
| Dataset | cn-wiki-128(4000w) | ImageNet |
| Training Parameters | src/config.py | src/config.py |
| Optimizer | Lamb | Momentum |
| Loss Function | SoftmaxCrossEntropy | SoftmaxCrossEntropy |
| outputs | probability | |
| Loss | | 1.913 |
| Speed | 116.5 ms/step | 1.913 |
| Total time | | |
| Epoch | 40 | | |
| Batch_size | 256*8 | 130(8P) | |
| Loss | 1.7 | 1.913 |
| Speed | 340ms/step | 1.913 |
| Total time | 73h | |
| Params (M) | 110M | |
| Checkpoint for Fine tuning | 1.2G(.ckpt file) | |
| Parameters | BERT | BERT |
| Parameters | Ascend | GPU |
| -------------------------- | ---------------------------------------------------------- | ------------------------- |
| Model Version | NEZHA | NEZHA |
| Model Version | BERT_NEZHA | BERT_NEZHA |
| Resource | Ascend 910, cpu:2.60GHz 56cores, memory:314G | NV SMX2 V100-32G |
| uploaded Date | 08/20/2020 | 05/06/2020 |
| MindSpore Version | 0.6.0 | 0.3.0 |
| Dataset | cn-wiki-128 | ImageNet |
| Dataset | cn-wiki-128(4000w) | ImageNet |
| Training Parameters | src/config.py | src/config.py |
| Optimizer | Lamb | Momentum |
| Loss Function | SoftmaxCrossEntropy | SoftmaxCrossEntropy |
| outputs | probability | |
| Loss | | 1.913 |
| Speed | | 1.913 |
| Total time | | |
| Epoch | 40 | | |
| Batch_size | 96*8 | 130(8P) |
| Loss | 1.7 | 1.913 |
| Speed | 360ms/step | 1.913 |
| Total time | 200h | |
| Params (M) | 340M | |
| Checkpoint for Fine tuning | 3.2G(.ckpt file) | |
#### Inference Performance
| Parameters | | | |
| -------------------------- | ----------------------------- | ------------------------- | -------------------- |
| Model Version | V1 | | |
| Resource | Ascend 910 | NV SMX2 V100-32G | Ascend 310 |
| uploaded Date | 08/22/2020 | 05/22/2020 | |
| MindSpore Version | 0.6.0 | 0.2.0 | 0.2.0 |
| Dataset | cola, 1.2W | ImageNet, 1.2W | ImageNet, 1.2W |
| batch_size | 32(1P) | 130(8P) | |
| Accuracy | 0.588986 | ACC1[72.07%] ACC5[90.90%] | |
| Speed | 59.25ms/step | | |
| Total time | | | |
| Model for inference | 1.2G(.ckpt file) | | |
| Parameters | Ascend | GPU |
| -------------------------- | ----------------------------- | ------------------------- |
| Model Version | | |
| Resource | Ascend 910 | NV SMX2 V100-32G |
| uploaded Date | 08/22/2020 | 05/22/2020 |
| MindSpore Version | 0.6.0 | 0.2.0 |
| Dataset | cola, 1.2W | ImageNet, 1.2W |
| batch_size | 32(1P) | 130(8P) |
| Accuracy | 0.588986 | ACC1[72.07%] ACC5[90.90%] |
| Speed | 59.25ms/step | |
| Total time | 15min | |
| Model for inference | 1.2G(.ckpt file) | |
# [Description of Random Situation](#contents)
......
......@@ -122,7 +122,7 @@ def distribute_pretrain():
print("core_nums:", cmdopt)
print("epoch_size:", str(cfg['epoch_size']))
print("data_dir:", data_dir)
print("log_file_dir: " + cur_dir + "/LOG" + str(device_id) + "/log.txt")
print("log_file_dir: " + cur_dir + "/LOG" + str(device_id) + "/pretraining_log.txt")
os.chdir(cur_dir + "/LOG" + str(device_id))
cmd = 'taskset -c ' + cmdopt + ' nohup python ' + run_script + " "
......
......@@ -112,9 +112,6 @@ def create_squad_dataset(batch_size=1, repeat_count=1, data_file_path=None, sche
else:
ds = de.TFRecordDataset([data_file_path], schema_file_path if schema_file_path != "" else None,
columns_list=["input_ids", "input_mask", "segment_ids", "unique_ids"])
ds = ds.map(input_columns="input_ids", operations=type_cast_op)
ds = ds.map(input_columns="input_mask", operations=type_cast_op)
ds = ds.map(input_columns="segment_ids", operations=type_cast_op)
ds = ds.map(input_columns="segment_ids", operations=type_cast_op)
ds = ds.map(input_columns="input_mask", operations=type_cast_op)
ds = ds.map(input_columns="input_ids", operations=type_cast_op)
......
......@@ -65,6 +65,38 @@ For distributed training on Ascend, a hccl configuration file with JSON format n
Please follow the instructions in the link below:
https:gitee.com/mindspore/mindspore/tree/master/model_zoo/utils/hccl_tools.
For dataset, if you want to set the format and parameters, a schema configuration file with JSON format needs to be created, please refer to [tfrecord](https://www.mindspore.cn/tutorial/zh-CN/master/use/data_preparation/loading_the_datasets.html#tfrecord) format.
```
For general task, schema file contains ["input_ids", "input_mask", "segment_ids"].
For task distill and eval phase, schema file contains ["input_ids", "input_mask", "segment_ids", "label_ids"].
`numRows` is the only option which could be set by user, the others value must be set according to the dataset.
For example, the dataset is cn-wiki-128, the schema file for general distill phase as following:
{
"datasetType": "TF",
"numRows": 7680,
"columns": {
"input_ids": {
"type": "int64",
"rank": 1,
"shape": [256]
},
"input_mask": {
"type": "int64",
"rank": 1,
"shape": [256]
},
"segment_ids": {
"type": "int64",
"rank": 1,
"shape": [256]
}
}
}
```
# [Script Description](#contents)
## [Script and Sample Code](#contents)
......@@ -117,7 +149,7 @@ options:
--save_checkpoint_step steps for saving checkpoint files: N, default is 1000
--load_teacher_ckpt_path path to load teacher checkpoint files: PATH, default is ""
--data_dir path to dataset directory: PATH, default is ""
--schema_dir path to schema.json file, PATH, default is ""
--schema_dir path to schema.json file, PATH, default is ""
```
### Task Distill
......@@ -132,7 +164,7 @@ usage: run_general_task.py [--device_target DEVICE_TARGET] [--do_train DO_TRAIN
[--load_td1_ckpt_path LOAD_TD1_CKPT_PATH]
[--train_data_dir TRAIN_DATA_DIR]
[--eval_data_dir EVAL_DATA_DIR]
[--task_name TASK_NAME] [--schema_dir SCHEMA_DIR]
[--task_name TASK_NAME] [--schema_dir SCHEMA_DIR]
options:
--device_target device where the code will be implemented: "Ascend" | "GPU", default is "Ascend"
......@@ -302,9 +334,9 @@ The best acc is 0.891176
## [Model Description](#contents)
## [Performance](#contents)
### training Performance
| Parameters | TinyBERT | TinyBERT |
| Parameters | Ascend | GPU |
| -------------------------- | ---------------------------------------------------------- | ------------------------- |
| Model Version | | |
| Model Version | TinyBERT | TinyBERT |
| Resource | Ascend 910, cpu:2.60GHz 56cores, memory:314G | NV SMX2 V100-32G, cpu:2.10GHz 64cores, memory:251G |
| uploaded Date | 08/20/2020 | 08/24/2020 |
| MindSpore Version | 0.6.0 | 0.7.0 |
......@@ -321,7 +353,7 @@ The best acc is 0.891176
#### Inference Performance
| Parameters | | |
| Parameters | Ascend | GPU |
| -------------------------- | ----------------------------- | ------------------------- |
| Model Version | | |
| Resource | Ascend 910 | NV SMX2 V100-32G |
......@@ -344,4 +376,4 @@ In run_general_distill.py, we set the random seed to make sure distribute traini
# [ModelZoo Homepage](#contents)
Please check the official [homepage](https://gitee.com/mindspore/mindspore/tree/master/model_zoo).
\ No newline at end of file
Please check the official [homepage](https://gitee.com/mindspore/mindspore/tree/master/model_zoo).
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册