提交 6a5e496f 编写于 作者: M mindspore-ci-bot 提交者: Gitee

!554 Fix Bugs about loss funcion name

Merge pull request !554 from moran/master
...@@ -65,32 +65,34 @@ Parameters for both training and evaluation can be set in src/config.py. ...@@ -65,32 +65,34 @@ Parameters for both training and evaluation can be set in src/config.py.
## Running the example ## Running the example
### Train ### Running on Ascend
#### Usage #### Train
##### Usage
``` ```
# distributed training # distributed training
Usage: ./run_distribute_train.sh [RANK_TABLE_FILE] [DATASET_PATH] [PRETRAINED_CKPT_PATH](optional) Usage: bash run_distribute_train.sh [RANK_TABLE_FILE] [DATASET_PATH] [PRETRAINED_CKPT_PATH](optional)
# standalone training # standalone training
Usage: ./run_standalone_train.sh [DATASET_PATH] [PRETRAINED_CKPT_PATH](optional) Usage: bash run_standalone_train.sh [DATASET_PATH] [PRETRAINED_CKPT_PATH](optional)
``` ```
#### Launch ##### Launch
``` ```
# distribute training example # distribute training example
./run_distribute_train.sh rank_table.json ~/dataset_path bash run_distribute_train.sh rank_table.json ~/cifar-10-batches-bin
# standalone training example # standalone training example
./run_standalone_train.sh ~/dataset_path bash run_standalone_train.sh ~/cifar-10-batches-bin
``` ```
> About rank_table.json, you can refer to the [distributed training tutorial](https://www.mindspore.cn/tutorial/en/master/advanced_use/distributed_training.html). > About rank_table.json, you can refer to the [distributed training tutorial](https://www.mindspore.cn/tutorial/zh-CN/master/advanced_use/distributed_training_ascend.html).
#### Result ##### Result
Training result will be stored in the example path, whose folder name begins with "train" or "train_parallel". Under this, you can find checkpoint file together with result like the followings in log. Training result will be stored in the example path, whose folder name begins with "train" or "train_parallel". Under this, you can find checkpoint file together with result like the followings in log.
...@@ -103,20 +105,20 @@ epoch: 1 step: 603, loss is 2.305666 ...@@ -103,20 +105,20 @@ epoch: 1 step: 603, loss is 2.305666
... ...
``` ```
### Evaluation #### Evaluation
#### Usage ##### Usage
``` ```
# evaluation # evaluation
Usage: ./run_eval.sh [DATASET_PATH] [CHECKPOINT_PATH] Usage: bash run_eval.sh [DATASET_PATH] [CHECKPOINT_PATH]
``` ```
#### Launch ##### Launch
``` ```
# evaluation example # evaluation example
./run_eval.sh ~/cifar-10-batches-bin ~/alexnet/train/alexnet-1.591.ckpt bash run_eval.sh ~/cifar-10-verify-bin ~/resnet50/train/alexnet-1.591.ckpt
``` ```
> checkpoint can be produced in training process. > checkpoint can be produced in training process.
...@@ -125,11 +127,11 @@ Usage: ./run_eval.sh [DATASET_PATH] [CHECKPOINT_PATH] ...@@ -125,11 +127,11 @@ Usage: ./run_eval.sh [DATASET_PATH] [CHECKPOINT_PATH]
### Running on GPU ### Running on GPU
``` ```
# distributed training example # distributed training example
./run_distribute_train_gpu.sh [DATASET_PATH] [PRETRAINED_CKPT_PATH](optional) bash run_distribute_train_gpu.sh [DATASET_PATH] [PRETRAINED_CKPT_PATH](optional)
# standalone training example # standalone training example
./run_standalone_train_gpu.sh [DATASET_PATH] [PRETRAINED_CKPT_PATH](optional) bash run_standalone_train_gpu.sh [DATASET_PATH] [PRETRAINED_CKPT_PATH](optional)
# infer example # infer example
./run_eval_gpu.sh [DATASET_PATH] [CHECKPOINT_PATH] bash run_eval_gpu.sh [DATASET_PATH] [CHECKPOINT_PATH]
``` ```
...@@ -57,14 +57,13 @@ fi ...@@ -57,14 +57,13 @@ fi
ulimit -u unlimited ulimit -u unlimited
export DEVICE_NUM=8 export DEVICE_NUM=8
export RANK_SIZE=$DEVICE_NUM export RANK_SIZE=$DEVICE_NUM
export MINDSPORE_HCCL_CONFIG_PATH=$PATH1
export RANK_TABLE_FILE=$PATH1 export RANK_TABLE_FILE=$PATH1
start_id=0 start_id=0
for((i=start_id; i<DEVICE_NUM + start_id; i++)) for((i=0; i<${DEVICE_NUM}; i++))
do do
export DEVICE_ID=$i export DEVICE_ID=$((i + start_id))
export RANK_ID=$((i - start_id)) export RANK_ID=$i
rm -rf ./train_parallel$i rm -rf ./train_parallel$i
mkdir ./train_parallel$i mkdir ./train_parallel$i
cp ../*.py ./train_parallel$i cp ../*.py ./train_parallel$i
......
...@@ -19,9 +19,7 @@ network config setting, will be used in train.py ...@@ -19,9 +19,7 @@ network config setting, will be used in train.py
from easydict import EasyDict as edict from easydict import EasyDict as edict
cfg = edict({ cfg = edict({
{% if dataset=='MNIST' %} {% if dataset=='Cifar10' %}
'num_classes': 10,
{% elif dataset=='Cifar10' %}
'num_classes': 10, 'num_classes': 10,
{% elif dataset=='ImageNet' %} {% elif dataset=='ImageNet' %}
'num_classes': 1001, 'num_classes': 1001,
......
...@@ -50,32 +50,34 @@ Parameters for both training and evaluation can be set in src/config.py. ...@@ -50,32 +50,34 @@ Parameters for both training and evaluation can be set in src/config.py.
## Running the example ## Running the example
### Train ### Running on Ascend
#### Usage #### Train
##### Usage
``` ```
# distributed training # distributed training
Usage: ./run_distribute_train.sh [RANK_TABLE_FILE] [DATASET_PATH] [PRETRAINED_CKPT_PATH](optional) Usage: bash run_distribute_train.sh [RANK_TABLE_FILE] [DATASET_PATH] [PRETRAINED_CKPT_PATH](optional)
# standalone training # standalone training
Usage: ./run_standalone_train.sh [DATASET_PATH] [PRETRAINED_CKPT_PATH](optional) Usage: bash run_standalone_train.sh [DATASET_PATH] [PRETRAINED_CKPT_PATH](optional)
``` ```
#### Launch ##### Launch
``` ```
# distribute training example # distribute training example
./run_distribute_train.sh rank_table.json ~/MNIST_data bash run_distribute_train.sh rank_table.json ~/MNIST_data
# standalone training example # standalone training example
./run_standalone_train.sh ~/MNIST_data bash run_standalone_train.sh ~/MNIST_data
``` ```
> About rank_table.json, you can refer to the [distributed training tutorial](https://www.mindspore.cn/tutorial/en/master/advanced_use/distributed_training.html). > About rank_table.json, you can refer to the [distributed training tutorial](https://www.mindspore.cn/tutorial/zh-CN/master/advanced_use/distributed_training_ascend.html).
#### Result ##### Result
Training result will be stored in the example path, whose folder name begins with "train" or "train_parallel". Under this, you can find checkpoint file together with result like the followings in log. Training result will be stored in the example path, whose folder name begins with "train" or "train_parallel". Under this, you can find checkpoint file together with result like the followings in log.
...@@ -88,20 +90,20 @@ epoch: 1 step: 603, loss is 2.305666 ...@@ -88,20 +90,20 @@ epoch: 1 step: 603, loss is 2.305666
... ...
``` ```
### Evaluation #### Evaluation
#### Usage ##### Usage
``` ```
# evaluation # evaluation
Usage: ./run_eval.sh [DATASET_PATH] [CHECKPOINT_PATH] Usage: bash run_eval.sh [DATASET_PATH] [CHECKPOINT_PATH]
``` ```
#### Launch ##### Launch
``` ```
# evaluation example # evaluation example
./run_eval.sh ~/MNIST_data ~/lenet/train_parallel0/ckpt_0/checkpoint_lenet-2_937.ckpt bash run_eval.sh ~/MNIST_data ~/lenet/train_parallel0/ckpt_0/checkpoint_lenet-2_937.ckpt
``` ```
> checkpoint can be produced in training process. > checkpoint can be produced in training process.
...@@ -110,11 +112,11 @@ Usage: ./run_eval.sh [DATASET_PATH] [CHECKPOINT_PATH] ...@@ -110,11 +112,11 @@ Usage: ./run_eval.sh [DATASET_PATH] [CHECKPOINT_PATH]
### Running on GPU ### Running on GPU
``` ```
# distributed training example # distributed training example
./run_distribute_train_gpu.sh [DATASET_PATH] [PRETRAINED_CKPT_PATH](optional) bash run_distribute_train_gpu.sh [DATASET_PATH] [PRETRAINED_CKPT_PATH](optional)
# standalone training example # standalone training example
./run_standalone_train_gpu.sh [DATASET_PATH] [PRETRAINED_CKPT_PATH](optional) bash run_standalone_train_gpu.sh [DATASET_PATH] [PRETRAINED_CKPT_PATH](optional)
# infer example # infer example
./run_eval_gpu.sh [DATASET_PATH] [CHECKPOINT_PATH] bash run_eval_gpu.sh [DATASET_PATH] [CHECKPOINT_PATH]
``` ```
...@@ -60,13 +60,11 @@ export DEVICE_NUM=8 ...@@ -60,13 +60,11 @@ export DEVICE_NUM=8
export RANK_SIZE=$DEVICE_NUM export RANK_SIZE=$DEVICE_NUM
export RANK_TABLE_FILE=$PATH1 export RANK_TABLE_FILE=$PATH1
export SERVER_ID=0 start_id=0
rank_start=$((DEVICE_NUM * SERVER_ID)) for((i=0; i<${DEVICE_NUM}; i++))
for((i=0; i<DEVICE_NUM; i++))
do do
export DEVICE_ID=$i export DEVICE_ID=$((i + start_id))
export RANK_ID=$((rank_start + i)) export RANK_ID=$i
rm -rf ./train_parallel$i rm -rf ./train_parallel$i
mkdir ./train_parallel$i mkdir ./train_parallel$i
cp ../*.py ./train_parallel$i cp ../*.py ./train_parallel$i
......
...@@ -66,32 +66,34 @@ Parameters for both training and evaluation can be set in src/config.py. ...@@ -66,32 +66,34 @@ Parameters for both training and evaluation can be set in src/config.py.
## Running the example ## Running the example
### Train ### Running on Ascend
#### Usage #### Train
##### Usage
``` ```
# distributed training # distributed training
Usage: ./run_distribute_train.sh [RANK_TABLE_FILE] [DATASET_PATH] [PRETRAINED_CKPT_PATH](optional) Usage: bash run_distribute_train.sh [RANK_TABLE_FILE] [DATASET_PATH] [PRETRAINED_CKPT_PATH](optional)
# standalone training # standalone training
Usage: ./run_standalone_train.sh [DATASET_PATH] [PRETRAINED_CKPT_PATH](optional) Usage: bash run_standalone_train.sh [DATASET_PATH] [PRETRAINED_CKPT_PATH](optional)
``` ```
#### Launch ##### Launch
``` ```
# distribute training example # distribute training example
./run_distribute_train.sh rank_table.json ~/dataset_path bash run_distribute_train.sh rank_table.json ~/cifar-10-batches-bin
# standalone training example # standalone training example
./run_standalone_train.sh ~/dataset_path bash run_standalone_train.sh ~/cifar-10-batches-bin
``` ```
> About rank_table.json, you can refer to the [distributed training tutorial](https://www.mindspore.cn/tutorial/en/master/advanced_use/distributed_training.html). > About rank_table.json, you can refer to the [distributed training tutorial](https://www.mindspore.cn/tutorial/zh-CN/master/advanced_use/distributed_training_ascend.html).
#### Result ##### Result
Training result will be stored in the example path, whose folder name begins with "train" or "train_parallel". Under this, you can find checkpoint file together with result like the followings in log. Training result will be stored in the example path, whose folder name begins with "train" or "train_parallel". Under this, you can find checkpoint file together with result like the followings in log.
...@@ -104,20 +106,20 @@ epoch: 1 step: 603, loss is 2.305666 ...@@ -104,20 +106,20 @@ epoch: 1 step: 603, loss is 2.305666
... ...
``` ```
### Evaluation #### Evaluation
#### Usage ##### Usage
``` ```
# evaluation # evaluation
Usage: ./run_eval.sh [DATASET_PATH] [CHECKPOINT_PATH] Usage: bash run_eval.sh [DATASET_PATH] [CHECKPOINT_PATH]
``` ```
#### Launch ##### Launch
``` ```
# evaluation example # evaluation example
./run_eval.sh ~/cifar-10-batches-bin ~/resnet50/train/alexnet-1.591.ckpt bash run_eval.sh ~/cifar-10-verify-bin ~/resnet50/train/alexnet-1.591.ckpt
``` ```
> checkpoint can be produced in training process. > checkpoint can be produced in training process.
...@@ -126,11 +128,11 @@ Usage: ./run_eval.sh [DATASET_PATH] [CHECKPOINT_PATH] ...@@ -126,11 +128,11 @@ Usage: ./run_eval.sh [DATASET_PATH] [CHECKPOINT_PATH]
### Running on GPU ### Running on GPU
``` ```
# distributed training example # distributed training example
./run_distribute_train_gpu.sh [DATASET_PATH] [PRETRAINED_CKPT_PATH](optional) bash run_distribute_train_gpu.sh [DATASET_PATH] [PRETRAINED_CKPT_PATH](optional)
# standalone training example # standalone training example
./run_standalone_train_gpu.sh [DATASET_PATH] [PRETRAINED_CKPT_PATH](optional) bash run_standalone_train_gpu.sh [DATASET_PATH] [PRETRAINED_CKPT_PATH](optional)
# infer example # infer example
./run_eval_gpu.sh [DATASET_PATH] [CHECKPOINT_PATH] bash run_eval_gpu.sh [DATASET_PATH] [CHECKPOINT_PATH]
``` ```
...@@ -22,7 +22,6 @@ from mindspore import context ...@@ -22,7 +22,6 @@ from mindspore import context
from mindspore import dataset as de from mindspore import dataset as de
from mindspore.train.model import Model from mindspore.train.model import Model
from mindspore.train.serialization import load_checkpoint, load_param_into_net from mindspore.train.serialization import load_checkpoint, load_param_into_net
from src.crossentropy import CrossEntropy
parser = argparse.ArgumentParser(description='Image classification') parser = argparse.ArgumentParser(description='Image classification')
...@@ -67,9 +66,14 @@ if __name__ == '__main__': ...@@ -67,9 +66,14 @@ if __name__ == '__main__':
# define loss, model # define loss, model
{% if dataset=='ImageNet' %} {% if dataset=='ImageNet' %}
{% if loss=='SoftmaxCrossEntropyWithLogits' %}
if not cfg.use_label_smooth: if not cfg.use_label_smooth:
cfg.label_smooth_factor = 0.0 cfg.label_smooth_factor = 0.0
loss = CrossEntropy(smooth_factor=cfg.label_smooth_factor, num_classes=cfg.num_classes) loss = nn.SoftmaxCrossEntropyWithLogits(sparse=True, reduction='mean',
smooth_factor=cfg.label_smooth_factor, num_classes=cfg.num_classes)
{% elif loss=='SoftmaxCrossEntropyExpand' %}
loss = nn.SoftmaxCrossEntropyExpand(sparse=True)
{% endif %}
{% else %} {% else %}
{% if loss=='SoftmaxCrossEntropyWithLogits' %} {% if loss=='SoftmaxCrossEntropyWithLogits' %}
loss = nn.SoftmaxCrossEntropyWithLogits(sparse=True, reduction='mean') loss = nn.SoftmaxCrossEntropyWithLogits(sparse=True, reduction='mean')
......
...@@ -57,14 +57,13 @@ fi ...@@ -57,14 +57,13 @@ fi
ulimit -u unlimited ulimit -u unlimited
export DEVICE_NUM=8 export DEVICE_NUM=8
export RANK_SIZE=$DEVICE_NUM export RANK_SIZE=$DEVICE_NUM
export MINDSPORE_HCCL_CONFIG_PATH=$PATH1
export RANK_TABLE_FILE=$PATH1 export RANK_TABLE_FILE=$PATH1
start_id=0 start_id=0
for((i=start_id; i<DEVICE_NUM + start_id; i++)) for((i=0; i<${DEVICE_NUM}; i++))
do do
export DEVICE_ID=$i export DEVICE_ID=$((i + start_id))
export RANK_ID=$((i - start_id)) export RANK_ID=$i
rm -rf ./train_parallel$i rm -rf ./train_parallel$i
mkdir ./train_parallel$i mkdir ./train_parallel$i
cp ../*.py ./train_parallel$i cp ../*.py ./train_parallel$i
......
...@@ -72,5 +72,5 @@ if [ $# == 2 ] ...@@ -72,5 +72,5 @@ if [ $# == 2 ]
then then
mpirun --allow-run-as-root -n $RANK_SIZE \ mpirun --allow-run-as-root -n $RANK_SIZE \
python train.py --run_distribute=True \ python train.py --run_distribute=True \
--device_num=$DEVICE_NUM --device_target="GPU" --dataset_path=$PATH1 --pre_trained=$PATH3 &> log & --device_num=$DEVICE_NUM --device_target="GPU" --dataset_path=$PATH1 --pre_trained=$PATH2 &> log &
fi fi
...@@ -18,9 +18,7 @@ network config setting, will be used in train.py and eval.py ...@@ -18,9 +18,7 @@ network config setting, will be used in train.py and eval.py
from easydict import EasyDict as ed from easydict import EasyDict as ed
cfg = ed({ cfg = ed({
{% if dataset=='MNIST' %} {% if dataset=='Cifar10' %}
'num_classes': 10,
{% elif dataset=='Cifar10' %}
'num_classes': 10, 'num_classes': 10,
{% elif dataset=='ImageNet' %} {% elif dataset=='ImageNet' %}
'num_classes': 1001, 'num_classes': 1001,
...@@ -48,7 +46,7 @@ cfg = ed({ ...@@ -48,7 +46,7 @@ cfg = ed({
{% if dataset=='ImageNet' %} {% if dataset=='ImageNet' %}
"warmup_epochs": 0, "warmup_epochs": 0,
"lr_decay_mode": "cosine", "lr_decay_mode": "cosine",
{% else %} {% elif dataset=='Cifar10' %}
"warmup_epochs": 5, "warmup_epochs": 5,
"lr_decay_mode": "poly", "lr_decay_mode": "poly",
{% endif %} {% endif %}
......
...@@ -29,7 +29,6 @@ from mindspore.communication.management import init, get_rank, get_group_size ...@@ -29,7 +29,6 @@ from mindspore.communication.management import init, get_rank, get_group_size
import mindspore.nn as nn import mindspore.nn as nn
import mindspore.common.initializer as weight_init import mindspore.common.initializer as weight_init
from src.lr_generator import get_lr, warmup_cosine_annealing_lr from src.lr_generator import get_lr, warmup_cosine_annealing_lr
from src.crossentropy import CrossEntropy
parser = argparse.ArgumentParser(description='Image classification') parser = argparse.ArgumentParser(description='Image classification')
parser.add_argument('--run_distribute', type=bool, default=False, help='Run distribute') parser.add_argument('--run_distribute', type=bool, default=False, help='Run distribute')
...@@ -122,9 +121,14 @@ if __name__ == '__main__': ...@@ -122,9 +121,14 @@ if __name__ == '__main__':
# define loss, model # define loss, model
if target == "Ascend": if target == "Ascend":
{% if dataset=='ImageNet' %} {% if dataset=='ImageNet' %}
{% if loss=='SoftmaxCrossEntropyWithLogits' %}
if not cfg.use_label_smooth: if not cfg.use_label_smooth:
cfg.label_smooth_factor = 0.0 cfg.label_smooth_factor = 0.0
loss = CrossEntropy(smooth_factor=cfg.label_smooth_factor, num_classes=cfg.num_classes) loss = nn.SoftmaxCrossEntropyWithLogits(sparse=True, reduction='mean',
smooth_factor=cfg.label_smooth_factor, num_classes=cfg.num_classes)
{% elif loss=='SoftmaxCrossEntropyExpand' %}
loss = nn.SoftmaxCrossEntropyExpand(sparse=True)
{% endif %}
{% else %} {% else %}
{% if loss=='SoftmaxCrossEntropyWithLogits' %} {% if loss=='SoftmaxCrossEntropyWithLogits' %}
loss = nn.SoftmaxCrossEntropyWithLogits(sparse=True, reduction='mean') loss = nn.SoftmaxCrossEntropyWithLogits(sparse=True, reduction='mean')
...@@ -137,6 +141,16 @@ if __name__ == '__main__': ...@@ -137,6 +141,16 @@ if __name__ == '__main__':
amp_level="O2", keep_batchnorm_fp32=False) amp_level="O2", keep_batchnorm_fp32=False)
else: else:
# GPU target # GPU target
{% if dataset=='ImageNet' %}
{% if loss=='SoftmaxCrossEntropyWithLogits' %}
if not cfg.use_label_smooth:
cfg.label_smooth_factor = 0.0
loss = nn.SoftmaxCrossEntropyWithLogits(sparse=True, is_grad=False, reduction='mean',
smooth_factor=cfg.label_smooth_factor, num_classes=cfg.num_classes)
{% elif loss=='SoftmaxCrossEntropyExpand' %}
loss = nn.SoftmaxCrossEntropyExpand(sparse=True)
{% endif %}
{% else %}
{% if loss=='SoftmaxCrossEntropyWithLogits' %} {% if loss=='SoftmaxCrossEntropyWithLogits' %}
loss = nn.SoftmaxCrossEntropyWithLogits(sparse=True, is_grad=False, reduction='mean') loss = nn.SoftmaxCrossEntropyWithLogits(sparse=True, is_grad=False, reduction='mean')
{% elif loss=='SoftmaxCrossEntropyExpand' %} {% elif loss=='SoftmaxCrossEntropyExpand' %}
...@@ -147,6 +161,7 @@ if __name__ == '__main__': ...@@ -147,6 +161,7 @@ if __name__ == '__main__':
{% else %} {% else %}
opt = nn.{{optimizer}}(filter(lambda x: x.requires_grad, net.get_parameters()), learning_rate=lr) opt = nn.{{optimizer}}(filter(lambda x: x.requires_grad, net.get_parameters()), learning_rate=lr)
{% endif %} {% endif %}
{% endif %}
model = Model(net, loss_fn=loss, optimizer=opt, metrics={'acc'}) model = Model(net, loss_fn=loss, optimizer=opt, metrics={'acc'})
# define callbacks # define callbacks
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册