In deep learning, the increasing number of datasets and parameters prolongs the training time and requires more hardware resources, becoming a training bottleneck. Parallel distributed training is an important optimization method for training, which can reduce requirements on hardware, such as memory and computing performance. Based on different parallel principles and modes, parallelism is generally classified into the following types:
- Data parallelism: splits data into many batches and then allocates the batches to each worker for model computation.
This tutorial describes how to train the ResNet-50 network in data parallel and automatic parallel modes on MindSpore based on the Ascend 910 AI processor.
- Model parallelism: splits a model. MindSpore supports the intra-layer model parallelism. Parameters are split and then allocated to each worker for training.
- Hybrid parallelism: contains data parallelism and model parallelism.
MindSpore also provides the parallel distributed training function. It supports the following modes:
-`DATA_PARALLEL`: data parallelism.
-`AUTO_PARALLEL`: automatic parallelism, which integrates data parallelism, model parallelism, and hybrid parallelism. A cost model can be automatically created to select one parallel mode for users. Creating a cost model refers to modeling the training time based on the memory-based computation and communication overheads of the Ascend 910 chip, and designing efficient algorithms to develop a parallel strategy with a relatively short training time.
-`HYBRID_PARALLEL`: On MindSpore, users manually split parameters to implement intra-layer model parallelism.
This tutorial describes how to train the ResNet-50 network in data parallel and automatic parallel modes on MindSpore.
> The example in this tutorial applies to hardware platforms based on the Ascend 910 AI processor, whereas does not support CPU and GPU scenarios.
> Download address of the complete sample code: <https://gitee.com/mindspore/docs/blob/master/tutorials/tutorial_code/distributed_training/resnet50_distributed_training.py>
> Download address of the complete sample code: <https://gitee.com/mindspore/docs/blob/master/tutorials/tutorial_code/distributed_training/resnet50_distributed_training.py>
In deep learning, the increasing number of datasets and parameters prolongs the training time and requires more hardware resources, becoming a training bottleneck. Parallel distributed training is an important optimization method for training, which can reduce requirements on hardware, such as memory and computing performance. Based on different parallel principles and modes, parallelism is generally classified into the following types:
- Data parallelism: splits data into many batches and then allocates the batches to each worker for model computation.
- Model parallelism: splits a model. MindSpore supports the intra-layer model parallelism. Parameters are split and then allocated to each worker for training.
- Hybrid parallelism: contains data parallelism and model parallelism.
MindSpore also provides the parallel distributed training function. It supports the following modes:
- `DATA_PARALLEL`: data parallelism.
- `AUTO_PARALLEL`: automatic parallelism, which integrates data parallelism, model parallelism, and hybrid parallelism. A cost model can be automatically created to select one parallel mode for users. Creating a cost model refers to modeling the training time based on the memory-based computation and communication overheads of the Ascend 910 chip, and designing efficient algorithms to develop a parallel strategy with a relatively short training time.
- `HYBRID_PARALLEL`: On MindSpore, users manually split parameters to implement intra-layer model parallelism.