提交 5d9f5829 编写于 作者: M mindspore-ci-bot 提交者: Gitee

!955 update second order optimizer for resnet50 application

Merge pull request !955 from wangmin0104/master
...@@ -114,7 +114,7 @@ def create_dataset(dataset_path, do_train, repeat_num=1, batch_size=32, target=" ...@@ -114,7 +114,7 @@ def create_dataset(dataset_path, do_train, repeat_num=1, batch_size=32, target="
if target == "Ascend": if target == "Ascend":
device_num, rank_id = _get_rank_info() device_num, rank_id = _get_rank_info()
else: else:
init("nccl") init()
rank_id = get_rank() rank_id = get_rank()
device_num = get_group_size() device_num = get_group_size()
if device_num == 1: if device_num == 1:
...@@ -249,7 +249,7 @@ if __name__ == "__main__": ...@@ -249,7 +249,7 @@ if __name__ == "__main__":
# learning rate setting # learning rate setting
lr = get_model_lr(0, config.lr_init, config.lr_decay, config.lr_end_epoch, step_size, decay_epochs=39) lr = get_model_lr(0, config.lr_init, config.lr_decay, config.lr_end_epoch, step_size, decay_epochs=39)
# define the optimizer # define the optimizer
net_opt = opt = THOR(filter(lambda x: x.requires_grad, net.get_parameters()), Tensor(lr), config.momentum, opt = THOR(filter(lambda x: x.requires_grad, net.get_parameters()), Tensor(lr), config.momentum,
filter(lambda x: 'matrix_A' in x.name, net.get_parameters()), filter(lambda x: 'matrix_A' in x.name, net.get_parameters()),
filter(lambda x: 'matrix_G' in x.name, net.get_parameters()), filter(lambda x: 'matrix_G' in x.name, net.get_parameters()),
filter(lambda x: 'A_inv_max' in x.name, net.get_parameters()), filter(lambda x: 'A_inv_max' in x.name, net.get_parameters()),
...@@ -360,20 +360,18 @@ sh run_distribute_train_gpu.sh [DATASET_PATH] [DEVICE_NUM] ...@@ -360,20 +360,18 @@ sh run_distribute_train_gpu.sh [DATASET_PATH] [DEVICE_NUM]
- `DATASET_PATH`:训练数据集路径。 - `DATASET_PATH`:训练数据集路径。
- `DEVICE_NUM`: 实际的运行卡数。 - `DEVICE_NUM`: 实际的运行卡数。
在GPU训练时,无需设置`DEVICE_ID`环境变量,因此在主训练脚本中不需要调用`int(os.getenv('DEVICE_ID'))`来获取卡的物理序号,同时`context`中也无需传入`device_id`。我们需要将device_target设置为GPU,并需要调用`init("nccl")`来使能NCCL。 在GPU训练时,无需设置`DEVICE_ID`环境变量,因此在主训练脚本中不需要调用`int(os.getenv('DEVICE_ID'))`来获取卡的物理序号,同时`context`中也无需传入`device_id`。我们需要将device_target设置为GPU,并需要调用`init()`来使能NCCL。
训练过程中loss打印示例如下: 训练过程中loss打印示例如下:
```bash ```bash
... ...
epoch: 1 step: 5004, loss is 4.3069 epoch: 1 step: 5004, loss is 4.2546034
epoch: 2 step: 5004, loss is 3.5695 epoch: 2 step: 5004, loss is 4.0819564
epoch: 3 step: 5004, loss is 3.5893 epoch: 3 step: 5004, loss is 3.7005644
epoch: 4 step: 5004, loss is 3.1987 epoch: 4 step: 5004, loss is 3.2668946
epoch: 5 step: 5004, loss is 3.3526 epoch: 5 step: 5004, loss is 3.023509
...... ......
epoch: 40 step: 5004, loss is 1.9482 epoch: 36 step: 5004, loss is 1.645802
epoch: 41 step: 5004, loss is 1.8950
epoch: 42 step: 5004, loss is 1.9023
... ...
``` ```
...@@ -385,13 +383,13 @@ epoch: 42 step: 5004, loss is 1.9023 ...@@ -385,13 +383,13 @@ epoch: 42 step: 5004, loss is 1.9023
├─resnet-1_5004.ckpt ├─resnet-1_5004.ckpt
├─resnet-2_5004.ckpt ├─resnet-2_5004.ckpt
│ ...... │ ......
├─resnet-42_5004.ckpt ├─resnet-36_5004.ckpt
...... ......
├─ckpt_7 ├─ckpt_7
├─resnet-1_5004.ckpt ├─resnet-1_5004.ckpt
├─resnet-2_5004.ckpt ├─resnet-2_5004.ckpt
│ ...... │ ......
├─resnet-42_5004.ckpt ├─resnet-36_5004.ckpt
``` ```
...@@ -434,7 +432,7 @@ if __name__ == "__main__": ...@@ -434,7 +432,7 @@ if __name__ == "__main__":
``` ```
### 执行推理 ### 执行推理
推理网络定义完成之后,调用/scripts目录下的shell脚本,进行推理。 推理网络定义完成之后,调用`scripts`目录下的shell脚本,进行推理。
#### Ascend 910 #### Ascend 910
在Ascend 910硬件平台上,推理的执行命令如下: 在Ascend 910硬件平台上,推理的执行命令如下:
``` ```
...@@ -462,5 +460,5 @@ sh run_eval_gpu.sh [DATASET_PATH] [CHECKPOINT_PATH] ...@@ -462,5 +460,5 @@ sh run_eval_gpu.sh [DATASET_PATH] [CHECKPOINT_PATH]
推理的结果如下: 推理的结果如下:
``` ```
result: {'top_5_accuracy': 0.9281169974391805, 'top_1_accuracy': 0.7593830025608195} ckpt=train_parallel0/resnet-42_5004.ckpt result: {'top_5_accuracy': 0.9287972151088348, 'top_1_accuracy': 0.7597031049935979} ckpt=train_parallel/resnet-36_5004.ckpt
``` ```
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册