!455 add GPU distributed training doc

Merge pull request !455 from lichen/master

!455 add GPU distributed training doc
Merge pull request !455 from lichen/master
2d152c49 · mindspore-ci-bot · Gitee · fc831c41 · 9dda3fab · 2d152c49
6 changed file
--- a/tutorials/source_en/advanced_use/distributed_training.md
+++ b/tutorials/source_en/advanced_use/distributed_training.md
-# Getting Started with Parallel Distributed Training
+# Parallel Distributed Training (Ascend)
 <!-- TOC -->
- [Getting Started with Parallel Distributed Training](#getting-started-with-parallel-distributed-training)
+- [Parallel Distributed Training (Ascend)](#parallel-distributed-training-ascend)
    - [Overview](#overview)
    - [Preparations](#preparations)
        - [Downloading the Dataset](#downloading-the-dataset)
@@ -18,22 +18,11 @@
 <!-- /TOC -->
-<a href="https://gitee.com/mindspore/docs/blob/master/tutorials/source_en/advanced_use/distributed_training.md" target="_blank"><img src="../_static/logo_source.png"></a>
+<a href="https://gitee.com/mindspore/docs/blob/master/tutorials/source_en/advanced_use/distributed_training_ascend.md" target="_blank"><img src="../_static/logo_source.png"></a>
 ## Overview
-In deep learning, the increasing number of datasets and parameters prolongs the training time and requires more hardware resources, becoming a training bottleneck. Parallel distributed training is an important optimization method for training, which can reduce requirements on hardware, such as memory and computing performance. Based on different parallel principles and modes, parallelism is generally classified into the following types:
- Data parallelism: splits data into many batches and then allocates the batches to each worker for model computation.
+This tutorial describes how to train the ResNet-50 network in data parallel and automatic parallel modes on MindSpore based on the Ascend 910 AI processor.
- Model parallelism: splits a model. MindSpore supports the intra-layer model parallelism. Parameters are split and then allocated to each worker for training.
- Hybrid parallelism: contains data parallelism and model parallelism.
-MindSpore also provides the parallel distributed training function. It supports the following modes:
- `DATA_PARALLEL`: data parallelism.
- `AUTO_PARALLEL`: automatic parallelism, which integrates data parallelism, model parallelism, and hybrid parallelism. A cost model can be automatically created to select one parallel mode for users. Creating a cost model refers to modeling the training time based on the memory-based computation and communication overheads of the Ascend 910 chip, and designing efficient algorithms to develop a parallel strategy with a relatively short training time.
- `HYBRID_PARALLEL`: On MindSpore, users manually split parameters to implement intra-layer model parallelism.
-This tutorial describes how to train the ResNet-50 network in data parallel and automatic parallel modes on MindSpore.
-> The example in this tutorial applies to hardware platforms based on the Ascend 910 AI processor, whereas does not support CPU and GPU scenarios.
 > Download address of the complete sample code: <https://gitee.com/mindspore/docs/blob/master/tutorials/tutorial_code/distributed_training/resnet50_distributed_training.py>
 ## Preparations

--- a/tutorials/source_en/advanced_use/distributed_training_tutorials.rst
+++ b/tutorials/source_en/advanced_use/distributed_training_tutorials.rst
 Distributed training
 ====================
+  In deep learning, the increasing number of datasets and parameters prolongs the training time and requires more hardware resources, becoming a training bottleneck. Parallel distributed training is an important optimization method for training, which can reduce requirements on hardware, such as memory and computing performance. Based on different parallel principles and modes, parallelism is generally classified into the following types:
+- Data parallelism: splits data into many batches and then allocates the batches to each worker for model computation.
+- Model parallelism: splits a model. MindSpore supports the intra-layer model parallelism. Parameters are split and then allocated to each worker for training.
+- Hybrid parallelism: contains data parallelism and model parallelism.
+MindSpore also provides the parallel distributed training function. It supports the following modes:
+- `DATA_PARALLEL`: data parallelism.
+- `AUTO_PARALLEL`: automatic parallelism, which integrates data parallelism, model parallelism, and hybrid parallelism. A cost model can be automatically created to select one parallel mode for users. Creating a cost model refers to modeling the training time based on the memory-based computation and communication overheads of the Ascend 910 chip, and designing efficient algorithms to develop a parallel strategy with a relatively short training time.
+- `HYBRID_PARALLEL`: On MindSpore, users manually split parameters to implement intra-layer model parallelism.
 .. toctree::
  :maxdepth: 1
-  distributed_training
+  distributed_training_ascend
  host_device_training
  checkpoint_for_hybrid_parallel
--- a/tutorials/source_zh_cn/advanced_use/distributed_training.md
+++ b/tutorials/source_zh_cn/advanced_use/distributed_training.md
-# 分布式并行训练入门
+# 分布式并行训练 (Ascend)
 <!-- TOC -->
- [分布式并行训练入门](#分布式并行训练入门)
+- [分布式并行训练 (Ascend)](#分布式并行训练-ascend)
    - [概述](#概述)
    - [准备环节](#准备环节)
        - [下载数据集](#下载数据集)
@@ -18,23 +18,14 @@
 <!-- /TOC -->
-<a href="https://gitee.com/mindspore/docs/blob/master/tutorials/source_zh_cn/advanced_use/distributed_training.md" target="_blank"><img src="../_static/logo_source.png"></a>
+<a href="https://gitee.com/mindspore/docs/blob/master/tutorials/source_zh_cn/advanced_use/distributed_training_ascend.md" target="_blank"><img src="../_static/logo_source.png"></a>
 ## 概述
-在深度学习中，当数据集和参数量的规模越来越大，训练所需的时间和硬件资源会随之增加，最后会变成制约训练的瓶颈。分布式并行训练，可以降低对内存、计算性能等硬件的需求，是进行训练的重要优化手段。根据并行的原理及模式不同，业界主流的并行类型有以下几种：
- 数据并行（Data Parallel）：对数据进行切分的并行模式，一般按照batch维度切分，将数据分配到各个计算单元（worker）中，进行模型计算。
+本篇教程我们主要讲解，如何在Ascend 910 AI处理器硬件平台上，利用MindSpore通过数据并行及自动并行模式训练ResNet-50网络。
- 模型并行（Model Parallel）：对模型进行切分的并行模式。MindSpore中支持层内模型并行模式，对参数切分后分配到各个计算单元中进行训练。
+> 你可以在这里下载完整的样例代码：
- 混合并行（Hybrid Parallel）：涵盖数据并行和模型并行的并行模式。
+>
+> <https://gitee.com/mindspore/docs/blob/master/tutorials/tutorial_code/distributed_training/resnet50_distributed_training.py>
-当前MindSpore也提供分布式并行训练的功能。它支持了多种模式包括：
- `DATA_PARALLEL`：数据并行模式。
- `AUTO_PARALLEL`：自动并行模式，融合了数据并行、模型并行及混合并行的1种分布式并行模式，可以自动建立代价模型，为用户选择1种并行模式。其中，代价模型指围绕Ascend 910芯片基于内存的计算开销和通信开销对训练时间建模，并设计高效的算法找到训练时间较短的并行策略。
- `HYBRID_PARALLEL`: 在MindSpore中指用户手动切分参数实现层内模型并行的场景。
-本篇教程我们主要讲解如何在MindSpore上通过数据并行及自动并行模式训练ResNet-50网络。
-> 本例面向Ascend 910 AI处理器硬件平台，暂不支持CPU和GPU场景。
-> 你可以在这里下载完整的样例代码：<https://gitee.com/mindspore/docs/blob/master/tutorials/tutorial_code/distributed_training/resnet50_distributed_training.py>
 ## 准备环节
@@ -262,6 +253,7 @@ def test_train_cifar(epoch_size=10):
 - `LossMonitor`：能够通过回调函数返回Loss值，用于监控损失函数。
 ## 运行脚本
 上述已将训练所需的脚本编辑好了，接下来通过命令调用对应的脚本。
 目前MindSpore分布式执行采用单卡单进程运行方式，即每张卡上运行1个进程，进程数量与使用的卡的数量一致。其中，0卡在前台执行，其他卡放在后台执行。每个进程创建1个目录，用来保存日志信息以及算子编译信息。下面以使用8张卡的分布式训练脚本为例，演示如何运行脚本：

--- a/tutorials/source_zh_cn/advanced_use/distributed_training_gpu.md
+++ b/tutorials/source_zh_cn/advanced_use/distributed_training_gpu.md
+# 分布式并行训练 (GPU)
+<!-- TOC -->
+- [分布式并行训练 (GPU)](#分布式并行训练-gpu)
+    - [概述](#概述)
+    - [准备环节](#准备环节)
+        - [下载数据集](#下载数据集)
+        - [配置分布式环境](#配置分布式环境)
+        - [调用集合通信库](#调用集合通信库)
+    - [数据并行模式加载数据集](#数据并行模式加载数据集)
+    - [定义网络](#定义网络)
+    - [运行脚本](#运行脚本)
+<!-- /TOC -->
+<a href="https://gitee.com/mindspore/docs/blob/master/tutorials/source_zh_cn/advanced_use/distributed_training_gpu.md" target="_blank"><img src="../_static/logo_source.png"></a>
+## 概述
+本篇教程我们主要讲解，如何在GPU硬件平台上，利用MindSpore的数据并行及自动并行模式训练ResNet-50网络。
+## 准备环节
+### 下载数据集
+本样例采用`CIFAR-10`数据集，数据集的下载以及加载方式和Ascend 910 AI处理器一致。
+> 数据集的下载和加载方式参考：
+>
+> <https://www.mindspore.cn/tutorial/zh-CN/master/advanced_use/distributed_training_ascend.html>。
+### 配置分布式环境
+- `OpenMPI-3.1.5`：MindSpore采用的多进程通信库。
+  > OpenMPI-3.1.5源码下载地址：<https://www.open-mpi.org/software/ompi/v3.1/>，选择`openmpi-3.1.5.tar.gz`下载。
+  >
+  > 参考OpenMPI官网教程安装：<https://www.open-mpi.org/faq/?category=building#easy-build>。
+- `NCCL-2.4.8`：Nvidia集合通信库。
+  > NCCL-2.4.8下载地址：<https://developer.nvidia.com/nccl/nccl-legacy-downloads>。
+  >
+  > 参考NCCL官网教程安装：<https://docs.nvidia.com/deeplearning/nccl/install-guide/index.html#debian>。
+### 调用集合通信库
+在GPU硬件平台上，MindSpore分布式并行训练的通信使用的是NCCL。
+> GPU平台上，MindSpore暂不支持用户进行：
+>
+> `get_local_rank`、`get_local_size`、`get_world_rank_from_group_rank`、`get_group_rank_from_world_rank`、`create_group`操作。
+下面是调用集合通信库的代码样例：
+```python
+from mindspore import context
+from mindspore.communication.management import init
+if __name__ == "__main__":
+    context.set_context(mode=context.GRAPH_MODE, device_target="GPU")
+    init("nccl")
+    ...   
+```
+其中，
+- `mode=context.GRAPH_MODE`：使用分布式训练需要指定运行模式为图模式（PyNative模式不支持并行）。
+- `init("nccl")`：使能NCCL通信，并完成分布式训练初始化操作。
+## 定义网络
+在GPU硬件平台上，网络的定义和Ascend 910 AI处理器一致。
+> 网络、优化器、损失函数的定义参考：<https://www.mindspore.cn/tutorial/zh-CN/master/advanced_use/distributed_training_ascend.html>。
+## 运行脚本
+在GPU硬件平台上，MindSpore采用OpenMPI的`mpirun`进行分布式训练。下面以使用8张卡的分布式训练脚本为例，演示如何运行脚本：
+```bash
+#!/bin/bash
+DATA_PATH=$1
+export DATA_PATH=${DATA_PATH}
+rm -rf device
+mkdir device
+cp ./resnet50_distributed_training.py ./resnet.py ./device
+cd ./device
+echo "start training"
+mpirun -n 8 pytest -s -v ./resnet50_distributed_training.py > train.log 2>&1 &
+```
+脚本需要传入变量`DATA_PATH`，表示数据集的路径，resnet50_distributed_training.py是适配GPU后的Python文件。日志文件保存`device`目录下，关于Loss部分结果保存在`train.log`中。将loss值 `grep`出来后，示例如下：
+```
+epoch: 1 step: 1, loss is 2.3025854
+epoch: 1 step: 1, loss is 2.3025854
+epoch: 1 step: 1, loss is 2.3025854
+epoch: 1 step: 1, loss is 2.3025854
+epoch: 1 step: 1, loss is 2.3025854
+epoch: 1 step: 1, loss is 2.3025854
+epoch: 1 step: 1, loss is 2.3025854
+epoch: 1 step: 1, loss is 2.3025854
+```
--- a/tutorials/source_zh_cn/advanced_use/distributed_training_tutorials.rst
+++ b/tutorials/source_zh_cn/advanced_use/distributed_training_tutorials.rst
 分布式并行训练
 ===============
+  在深度学习中，当数据集和参数量的规模越来越大，训练所需的时间和硬件资源会随之增加，最后会变成制约训练的瓶颈。分布式并行训练，可以降低对内存、计算性能等硬件的需求，是进行训练的重要优化手段。根据并行的原理及模式不同，业界主流的并行类型有以下几种：
+- 数据并行（Data Parallel）：对数据进行切分的并行模式，一般按照batch维度切分，将数据分配到各个计算单元（worker）中，进行模型计算。
+- 模型并行（Model Parallel）：对模型进行切分的并行模式。MindSpore中支持层内模型并行模式，对参数切分后分配到各个计算单元中进行训练。
+- 混合并行（Hybrid Parallel）：涵盖数据并行和模型并行的并行模式。
+当前MindSpore也提供分布式并行训练的功能。它支持了多种模式包括：
+- `DATA_PARALLEL`：数据并行模式。
+- `AUTO_PARALLEL`：自动并行模式，融合了数据并行、模型并行及混合并行的1种分布式并行模式，可以自动建立代价模型，为用户选择1种并行模式。其中，代价模型指围绕Ascend 910芯片基于内存的计算开销和通信开销对训练时间建模，并设计高效的算法找到训练时间较短的并行策略。
+- `HYBRID_PARALLEL`: 在MindSpore中指用户手动切分参数实现层内模型并行的场景。
 .. toctree::
  :maxdepth: 1
-  distributed_training
+  distributed_training_ascend
+  distributed_training_gpu
  host_device_training
  checkpoint_for_hybrid_parallel
  parameter_server_training
\ No newline at end of file
--- a/tutorials/tutorial_code/distributed_training/run_gpu.sh
+++ b/tutorials/tutorial_code/distributed_training/run_gpu.sh
+#!/bin/bash
+DATA_PATH=$1
+export DATA_PATH=${DATA_PATH}
+rm -rf device
+mkdir device
+cp ./resnet50_distributed_training.py ./resnet.py ./device
+cd ./device
+echo "start training"
+pytest -s -v ./resnet50_distributed_training.py > train.log 2>&1 &