README.md 22.0 KB
Newer Older
W
wanghua 已提交
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
# Contents
- [TinyBERT Description](#tinybert-description)
- [Model Architecture](#model-architecture)
- [Dataset](#dataset)
- [Environment Requirements](#environment-requirements)
- [Quick Start](#quick-start)
- [Script Description](#script-description)
    - [Script and Sample Code](#script-and-sample-code)
    - [Script Parameters](#script-parameters)
    - [Dataset Preparation](#dataset-preparation)
    - [Training Process](#training-process)
    - [Evaluation Process](#evaluation-process)
- [Model Description](#model-description)
    - [Performance](#performance)
        - [Training Performance](#training-performance)
        - [Evaluation Performance](#evaluation-performance)
- [Description of Random Situation](#description-of-random-situation)
- [ModelZoo Homepage](#modelzoo-homepage)
W
wanghua 已提交
19

W
wanghua 已提交
20 21
# [TinyBERT Description](#contents)
[TinyBERT](https://github.com/huawei-noah/Pretrained-Language-Model/tree/master/TinyBERT) is 7.5x smalller and 9.4x faster on inference than [BERT-base](https://github.com/google-research/bert) (the base version of BERT model) and achieves competitive performances in the tasks of natural language understanding. It performs a novel transformer distillation at both the pre-training and task-specific learning stages.
W
wanghua 已提交
22

W
wanghua 已提交
23
[Paper](https://arxiv.org/abs/1909.10351):  Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, Qun Liu. [TinyBERT: Distilling BERT for Natural Language Understanding](https://arxiv.org/abs/1909.10351). arXiv preprint arXiv:1909.10351. 
W
wanghua 已提交
24

W
wanghua 已提交
25 26
# [Model Architecture](#contents)
The backbone structure of TinyBERT is transformer, the transformer contains four encoder modules, one encoder contains one selfattention module and one selfattention module contains one attention module.  
W
wanghua 已提交
27

W
wanghua 已提交
28 29 30
# [Dataset](#contents)
- Download the zhwiki or enwiki dataset for general distillation. Extract and clean text in the dataset with [WikiExtractor](https://github.com/attardi/wikiextractor). Convert the dataset to TFRecord format, please refer to create_pretraining_data.py which in [BERT](https://github.com/google-research/bert) repository.
- Download glue dataset for task distillation. Convert dataset files from json format to tfrecord format, please refer to run_classifier.py which in [BERT](https://github.com/google-research/bert) repository.
W
wanghua 已提交
31

W
wanghua 已提交
32
# [Environment Requirements](#contents)
H
hanhuifeng2020 已提交
33 34
- Hardware(Ascend/GPU)
  - Prepare hardware environment with Ascend or GPU processor. If you want to try Ascend, please send the [application form](https://obs-9be7.obs.cn-east-2.myhuaweicloud.com/file/other/Ascend%20Model%20Zoo%E4%BD%93%E9%AA%8C%E8%B5%84%E6%BA%90%E7%94%B3%E8%AF%B7%E8%A1%A8.docx) to ascend@huawei.com. Once approved, you can get the resources. 
W
wanghua 已提交
35 36 37 38 39
- Framework
  - [MindSpore](https://gitee.com/mindspore/mindspore)
- For more information, please check the resources below:
  - [MindSpore tutorials](https://www.mindspore.cn/tutorial/en/master/index.html) 
  - [MindSpore API](https://www.mindspore.cn/api/en/master/index.html)
W
wanghua 已提交
40

W
wanghua 已提交
41 42 43 44
# [Quick Start](#contents)
After installing MindSpore via the official website, you can start general distill, task distill and evaluation as follows:
```bash
# run standalone general distill example
H
hanhuifeng2020 已提交
45
bash scripts/run_standalone_gd.sh 
W
wanghua 已提交
46

J
jonyguo 已提交
47
Before running the shell script, please set the `load_teacher_ckpt_path`, `data_dir`, `schema_dir` and `dataset_type` in the run_standalone_gd.sh file first. If running on GPU, please set the `device_target=GPU`.
W
wanghua 已提交
48

H
hanhuifeng2020 已提交
49
# For Ascend device, run distributed general distill example
W
wanghua 已提交
50 51
bash scripts/run_distributed_gd_ascend.sh 8 1 /path/hccl.json

J
jonyguo 已提交
52
Before running the shell script, please set the `load_teacher_ckpt_path`, `data_dir`, `schema_dir` and `dataset_type` in the run_distributed_gd_ascend.sh file first.
W
wanghua 已提交
53

H
hanhuifeng2020 已提交
54 55 56
# For GPU device, run distributed general distill example
bash scripts/run_distributed_gd_gpu.sh 8 1 /path/data/ /path/schema.json /path/teacher.ckpt

W
wanghua 已提交
57
# run task distill and evaluation example
H
hanhuifeng2020 已提交
58
bash scripts/run_standalone_td.sh 
W
wanghua 已提交
59

J
jonyguo 已提交
60
Before running the shell script, please set the `task_name`, `load_teacher_ckpt_path`, `load_gd_ckpt_path`, `train_data_dir`, `eval_data_dir`, `schema_dir` and `dataset_type` in the run_standalone_td.sh file first.
H
hanhuifeng2020 已提交
61
If running on GPU, please set the `device_target=GPU`.
W
wanghua 已提交
62
```
W
wanghua 已提交
63

H
hanhuifeng2020 已提交
64
For distributed training on Ascend, a hccl configuration file with JSON format needs to be created in advance.
W
wanghua 已提交
65 66
Please follow the instructions in the link below:
https:gitee.com/mindspore/mindspore/tree/master/model_zoo/utils/hccl_tools.
W
wanghua 已提交
67

W
wanghua 已提交
68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99
For dataset, if you want to set the format and parameters, a schema configuration file with JSON format needs to be created, please refer to [tfrecord](https://www.mindspore.cn/tutorial/zh-CN/master/use/data_preparation/loading_the_datasets.html#tfrecord) format. 
```
For general task, schema file contains ["input_ids", "input_mask", "segment_ids"].

For task distill and eval phase, schema file contains ["input_ids", "input_mask", "segment_ids", "label_ids"]. 

`numRows` is the only option which could be set by user, the others value must be set according to the dataset.

For example, the dataset is cn-wiki-128, the schema file for general distill phase as following:
{
	"datasetType": "TF",
	"numRows": 7680,
	"columns": {
		"input_ids": {
			"type": "int64",
			"rank": 1,
			"shape": [256]
		},
		"input_mask": {
			"type": "int64",
			"rank": 1,
			"shape": [256]
		},
		"segment_ids": {
			"type": "int64",
			"rank": 1,
			"shape": [256]
		}
	}
}
```

W
wanghua 已提交
100 101
# [Script Description](#contents)
## [Script and Sample Code](#contents)
W
wanghua 已提交
102

W
wanghua 已提交
103 104 105 106 107
```shell
.
└─bert
  ├─README.md
  ├─scripts
H
hanhuifeng2020 已提交
108 109 110 111
    ├─run_distributed_gd_ascend.sh       # shell script for distributed general distill phase on Ascend
    ├─run_distributed_gd_gpu.sh          # shell script for distributed general distill phase on GPU
    ├─run_standalone_gd.sh               # shell script for standalone general distill phase
    ├─run_standalone_td.sh               # shell script for standalone task distill phase
W
wanghua 已提交
112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127
  ├─src
    ├─__init__.py
    ├─assessment_method.py               # assessment method for evaluation
    ├─dataset.py                         # data processing
    ├─fused_layer_norm.py                # Layernormal is optimized for Ascend
    ├─gd_config.py                       # parameter configuration for general distill phase
    ├─td_config.py                       # parameter configuration for task distill phase
    ├─tinybert_for_gd_td.py              # backbone code of network
    ├─tinybert_model.py                  # backbone code of network
    ├─utils.py                           # util function
  ├─__init__.py
  ├─run_general_distill.py               # train net for general distillation 
  ├─run_task_distill.py                  # train and eval net for task distillation 
```

## [Script Parameters](#contents)
W
wanghua 已提交
128 129
### General Distill
``` 
H
hanhuifeng2020 已提交
130 131 132 133 134 135
usage: run_general_distill.py   [--distribute DISTRIBUTE] [--epoch_size N] [----device_num N] [--device_id N] 
                                [--device_target DEVICE_TARGET] [--do_shuffle DO_SHUFFLE]
                                [--enable_data_sink ENABLE_DATA_SINK] [--data_sink_steps N] 
                                [--save_ckpt_path SAVE_CKPT_PATH]
                                [--load_teacher_ckpt_path LOAD_TEACHER_CKPT_PATH]
                                [--save_checkpoint_step N] [--max_ckpt_num N] 
J
jonyguo 已提交
136
                                [--data_dir DATA_DIR] [--schema_dir SCHEMA_DIR] [--dataset_type DATASET_TYPE] [train_steps N]
W
wanghua 已提交
137 138

options:
W
wanghua 已提交
139 140
    --device_target            device where the code will be implemented: "Ascend" | "GPU", default is "Ascend"
    --distribute               pre_training by serveral devices: "true"(training by more than 1 device) | "false", default is "false"
W
wanghua 已提交
141 142
    --epoch_size               epoch size: N, default is 1
    --device_id                device id: N, default is 0
W
wanghua 已提交
143 144 145 146
    --device_num               number of used devices: N, default is 1
    --save_ckpt_path           path to save checkpoint files: PATH, default is ""    
    --max_ckpt_num             max number for saving checkpoint files: N, default is 1
    --do_shuffle               enable shuffle: "true" | "false", default is "true"
W
wanghua 已提交
147 148
    --enable_data_sink         enable data sink: "true" | "false", default is "true"
    --data_sink_steps          set data sink steps: N, default is 1
W
wanghua 已提交
149 150
    --save_checkpoint_step     steps for saving checkpoint files: N, default is 1000
    --load_teacher_ckpt_path   path to load teacher checkpoint files: PATH, default is ""
W
wanghua 已提交
151 152
    --data_dir                 path to dataset directory: PATH, default is ""
    --schema_dir               path to schema.json file, PATH, default is ""
J
jonyguo 已提交
153
    --dataset_type             the dataset type which can be tfrecord/mindrecord, default is tfrecord
W
wanghua 已提交
154 155 156 157
```
  
### Task Distill
``` 
H
hanhuifeng2020 已提交
158 159 160 161 162 163 164 165 166 167
usage: run_general_task.py  [--device_target DEVICE_TARGET] [--do_train DO_TRAIN] [--do_eval DO_EVAL] 
                            [--td_phase1_epoch_size N] [--td_phase2_epoch_size N] 
                            [--device_id N] [--do_shuffle DO_SHUFFLE]
                            [--enable_data_sink ENABLE_DATA_SINK] [--save_ckpt_step N] 
                            [--max_ckpt_num N] [--data_sink_steps N] 
                            [--load_teacher_ckpt_path LOAD_TEACHER_CKPT_PATH]
                            [--load_gd_ckpt_path LOAD_GD_CKPT_PATH]
                            [--load_td1_ckpt_path LOAD_TD1_CKPT_PATH]
                            [--train_data_dir TRAIN_DATA_DIR]
                            [--eval_data_dir EVAL_DATA_DIR]
J
jonyguo 已提交
168
                            [--task_name TASK_NAME] [--schema_dir SCHEMA_DIR] [--dataset_type DATASET_TYPE]
W
wanghua 已提交
169 170

options:
W
wanghua 已提交
171 172 173 174 175
    --device_target            device where the code will be implemented: "Ascend" | "GPU", default is "Ascend"
    --do_train                 enable train task: "true" | "false", default is "true"
    --do_eval                  enable eval task: "true" | "false", default is "true"
    --td_phase1_epoch_size     epoch size for td phase1: N, default is 10
    --td_phase2_epoch_size     epoch size for td phase2: N, default is 3
W
wanghua 已提交
176
    --device_id                device id: N, default is 0
W
wanghua 已提交
177 178 179 180
    --do_shuffle               enable shuffle: "true" | "false", default is "true"    
    --enable_data_sink         enable data sink: "true" | "false", default is "true"    
    --save_ckpt_step           steps for saving checkpoint files: N, default is 1000
    --max_ckpt_num             max number for saving checkpoint files: N, default is 1
W
wanghua 已提交
181
    --data_sink_steps          set data sink steps: N, default is 1
W
wanghua 已提交
182 183 184 185 186 187
    --load_teacher_ckpt_path   path to load teacher checkpoint files: PATH, default is ""
    --load_gd_ckpt_path        path to load checkpoint files which produced by general distill: PATH, default is ""
    --load_td1_ckpt_path       path to load checkpoint files which produced by task distill phase 1: PATH, default is ""
    --train_data_dir           path to train dataset directory: PATH, default is ""
    --eval_data_dir            path to eval dataset directory: PATH, default is ""
    --task_name                classification task: "SST-2" | "QNLI" | "MNLI", default is ""
W
wanghua 已提交
188
    --schema_dir               path to schema.json file, PATH, default is ""
J
jonyguo 已提交
189
    --dataset_type             the dataset type which can be tfrecord/mindrecord, default is tfrecord
W
wanghua 已提交
190 191 192
```

## Options and Parameters
W
wanghua 已提交
193
`gd_config.py` and `td_config.py` contain parameters of BERT model and options for optimizer and lossscale.
W
wanghua 已提交
194 195 196 197 198 199 200
### Options:
```
Parameters for lossscale:
    loss_scale_value                initial value of loss scale: N, default is 2^8
    scale_factor                    factor used to update loss scale: N, default is 2
    scale_window                    steps for once updatation of loss scale: N, default is 50 

W
wanghua 已提交
201 202 203 204 205 206
Parameters for optimizer:
    learning_rate                   value of learning rate: Q
    end_learning_rate               value of end learning rate: Q, must be positive
    power                           power: Q
    weight_decay                    weight decay: Q
    eps                             term added to the denominator to improve numerical stability: Q
W
wanghua 已提交
207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232
```

### Parameters:
```
Parameters for bert network:
    batch_size                      batch size of input dataset: N, default is 16
    seq_length                      length of input sequence: N, default is 128
    vocab_size                      size of each embedding vector: N, must be consistant with the dataset you use. Default is 30522
    hidden_size                     size of bert encoder layers: N
    num_hidden_layers               number of hidden layers: N
    num_attention_heads             number of attention heads: N, default is 12
    intermediate_size               size of intermediate layer: N
    hidden_act                      activation function used: ACTIVATION, default is "gelu"
    hidden_dropout_prob             dropout probability for BertOutput: Q
    attention_probs_dropout_prob    dropout probability for BertAttention: Q
    max_position_embeddings         maximum length of sequences: N, default is 512
    save_ckpt_step                  number for saving checkponit: N, default is 100
    max_ckpt_num                    maximum number for saving checkpoint: N, default is 1
    type_vocab_size                 size of token type vocab: N, default is 2
    initializer_range               initialization value of TruncatedNormal: Q, default is 0.02
    use_relative_positions          use relative positions or not: True | False, default is False
    input_mask_from_dataset         use the input mask loaded form dataset or not: True | False, default is True
    token_type_ids_from_dataset     use the token type ids loaded from dataset or not: True | False, default is True
    dtype                           data type of input: mstype.float16 | mstype.float32, default is mstype.float32
    compute_type                    compute type in BertTransformer: mstype.float16 | mstype.float32, default is mstype.float16
    enable_fused_layernorm          use batchnorm instead of layernorm to improve performance, default is False
W
wanghua 已提交
233 234 235 236 237 238
```
## [Training Process](#contents)
### Training
#### running on Ascend
Before running the command below, please check `load_teacher_ckpt_path`, `data_dir` and `schma_dir` has been set. Please set the path to be the absolute full path, e.g:"/username/checkpoint_100_300.ckpt".
```
H
hanhuifeng2020 已提交
239
bash scripts/run_standalone_gd.sh
W
wanghua 已提交
240 241 242 243 244 245 246 247
```
The command above will run in the background, you can view the results the file log.txt. After training, you will get some checkpoint files under the script folder by default. The loss value will be achieved as follows:
```
# grep "epoch" log.txt
epoch: 1, step: 100, outpus are (Tensor(shape=[1], dtype=Float32, 28.2093), Tensor(shape=[], dtype=Bool, False), Tensor(shape=[], dtype=Float32, 65536))
epoch: 2, step: 200, outpus are (Tensor(shape=[1], dtype=Float32, 30.1724), Tensor(shape=[], dtype=Bool, False), Tensor(shape=[], dtype=Float32, 65536))
...
```
W
wanghua 已提交
248

H
hanhuifeng2020 已提交
249 250 251 252 253 254 255 256 257 258 259 260
#### running on GPU
Before running the command below, please check `load_teacher_ckpt_path`, `data_dir` `schma_dir` and `device_target=GPU` has been set. Please set the path to be the absolute full path, e.g:"/username/checkpoint_100_300.ckpt".
```
bash scripts/run_standalone_gd.sh
```
The command above will run in the background, you can view the results the file log.txt. After training, you will get some checkpoint files under the script folder by default. The loss value will be achieved as follows:
```
# grep "epoch" log.txt
epoch: 1, step: 100, outpus are 28.2093
...
```

W
wanghua 已提交
261 262 263 264 265 266 267 268 269 270 271 272 273 274 275
### Distributed Training
#### running on Ascend
Before running the command below, please check `load_teacher_ckpt_path`, `data_dir` and `schma_dir` has been set. Please set the path to be the absolute full path, e.g:"/username/checkpoint_100_300.ckpt".
```
bash scripts/run_distributed_gd_ascend.sh 8 1 /path/hccl.json
```
The command above will run in the background, you can view the results the file log.txt. After training, you will get some checkpoint files under the LOG* folder by default. The loss value will be achieved as follows:
```
# grep "epoch" LOG*/log.txt
epoch: 1, step: 100, outpus are (Tensor(shape=[1], dtype=Float32, 28.1478), Tensor(shape=[], dtype=Bool, False), Tensor(shape=[], dtype=Float32, 65536))
...
epoch: 1, step: 100, outpus are (Tensor(shape=[1], dtype=Float32, 30.5901), Tensor(shape=[], dtype=Bool, False), Tensor(shape=[], dtype=Float32, 65536))
...
```

H
hanhuifeng2020 已提交
276 277 278 279 280 281 282 283 284 285 286 287
#### running on GPU
Please input the path to be the absolute full path, e.g:"/username/checkpoint_100_300.ckpt".
```
bash scripts/run_distributed_gd_gpu.sh 8 1 /path/data/ /path/schema.json /path/teacher.ckpt
```
The command above will run in the background, you can view the results the file log.txt. After training, you will get some checkpoint files under the LOG* folder by default. The loss value will be achieved as follows:
```
# grep "epoch" LOG*/log.txt
epoch: 1, step: 1, outpus are 63.4098
...
```

W
wanghua 已提交
288 289
## [Evaluation Process](#contents)
### Evaluation
H
hanhuifeng2020 已提交
290 291
If you want to after running and continue to eval, please set `do_train=true` and `do_eval=true`, If you want to run eval alone, please set `do_train=false` and `do_eval=true`. If running on GPU, please set `device_target=GPU`.
#### evaluation on SST-2 dataset  
W
wanghua 已提交
292
```
H
hanhuifeng2020 已提交
293
bash scripts/run_standalone_td.sh
W
wanghua 已提交
294 295 296 297 298 299 300 301 302 303 304
```
The command above will run in the background, you can view the results the file log.txt. The accuracy of the test dataset will be as follows:   
```bash
# grep "The best acc" log.txt
The best acc is 0.872685
The best acc is 0.893515
The best acc is 0.899305
...
The best acc is 0.902777
...
```
H
hanhuifeng2020 已提交
305
#### evaluation on MNLI dataset
W
wanghua 已提交
306
Before running the command below, please check the load pretrain checkpoint path has been set. Please set the checkpoint path to be the absolute full path, e.g:"/username/pretrain/checkpoint_100_300.ckpt".
W
wanghua 已提交
307
```
H
hanhuifeng2020 已提交
308
bash scripts/run_standalone_td.sh
W
wanghua 已提交
309 310 311 312 313 314 315 316 317 318 319
```
The command above will run in the background, you can view the results the file log.txt. The accuracy of the test dataset will be as follows:   
```
# grep "The best acc" log.txt
The best acc is 0.803206
The best acc is 0.803308
The best acc is 0.810355
...
The best acc is 0.813929
...
```
H
hanhuifeng2020 已提交
320
#### evaluation on QNLI dataset
W
wanghua 已提交
321 322
Before running the command below, please check the load pretrain checkpoint path has been set. Please set the checkpoint path to be the absolute full path, e.g:"/username/pretrain/checkpoint_100_300.ckpt".
```
H
hanhuifeng2020 已提交
323
bash scripts/run_standalone_td.sh
W
wanghua 已提交
324 325 326 327 328 329 330 331 332 333 334 335 336 337 338
```
The command above will run in the background, you can view the results the file log.txt. The accuracy of the test dataset will be as follows:   
```
# grep "The best acc" log.txt
The best acc is 0.870772
The best acc is 0.871691
The best acc is 0.875183
...
The best acc is 0.891176
...
```
    
## [Model Description](#contents)
## [Performance](#contents)
### training Performance
W
wanghua 已提交
339
| Parameters                 | Ascend                                                     | GPU                       |
W
wanghua 已提交
340
| -------------------------- | ---------------------------------------------------------- | ------------------------- |
W
wanghua 已提交
341
| Model Version              | TinyBERT                                                   | TinyBERT                           |
H
hanhuifeng2020 已提交
342 343 344 345 346
| Resource                   | Ascend 910, cpu:2.60GHz 56cores, memory:314G               | NV SMX2 V100-32G, cpu:2.10GHz 64cores,  memory:251G         |
| uploaded Date              | 08/20/2020                                                 | 08/24/2020                |
| MindSpore Version          | 0.6.0                                                      | 0.7.0                     |
| Dataset                    | cn-wiki-128                                                | cn-wiki-128               |
| Training Parameters        | src/gd_config.py                                           | src/gd_config.py          |
W
wanghua 已提交
347 348
| Optimizer                  | AdamWeightDecay                                            | AdamWeightDecay           |
| Loss Function              | SoftmaxCrossEntropy                                        | SoftmaxCrossEntropy       |
H
hanhuifeng2020 已提交
349 350 351 352 353 354
| outputs                    | probability                                                | probability               |
| Loss                       | 6.541583                                                   | 6.6915                    |
| Speed                      | 35.4ms/step                                                | 98.654ms/step             |
| Total time                 | 17.3h(3poch, 8p)                                           | 48h(3poch, 8p)            |
| Params (M)                 | 15M                                                        | 15M                       |
| Checkpoint for task distill| 74M(.ckpt file)                                            | 74M(.ckpt file)           |    
W
wanghua 已提交
355 356 357

#### Inference Performance

W
wanghua 已提交
358
| Parameters                 | Ascend                        | GPU                       |
H
hanhuifeng2020 已提交
359 360
| -------------------------- | ----------------------------- | ------------------------- | 
| Model Version              |                               |                           |
W
wanghua 已提交
361
| Resource                   | Ascend 910                    | NV SMX2 V100-32G          |
H
hanhuifeng2020 已提交
362 363 364 365 366 367 368 369
| uploaded Date              | 08/20/2020                    | 08/24/2020                |
| MindSpore Version          | 0.6.0                         | 0.7.0                     |
| Dataset                    | SST-2,                        | SST-2                     |
| batch_size                 | 32                            | 32                        |
| Accuracy                   | 0.902777                      | 0.9086                    |
| Speed                      |                               |                           |
| Total time                 |                               |                           |
| Model for inference        | 74M(.ckpt file)               | 74M(.ckpt file)           |
W
wanghua 已提交
370 371 372 373 374 375 376 377

# [Description of Random Situation](#contents)

In run_standaloned_td.sh, we set do_shuffle to shuffle the dataset. 

In gd_config.py and td_config.py, we set the hidden_dropout_prob and attention_pros_dropout_prob to dropout some network node.

In run_general_distill.py, we set the random seed to make sure distribute training has the same init weight.
W
wanghua 已提交
378

W
wanghua 已提交
379 380
# [ModelZoo Homepage](#contents)
 
J
jonyguo 已提交
381
Please check the official [homepage](https://gitee.com/mindspore/mindspore/tree/master/model_zoo).