提交 6058db43 编写于 作者: M mindspore-ci-bot 提交者: Gitee

!25 update distributed_training doc

Merge pull request !25 from lichen/master
...@@ -37,7 +37,7 @@ In this tutorial, we will learn how to train the ResNet-50 network in `DATA_PARA ...@@ -37,7 +37,7 @@ In this tutorial, we will learn how to train the ResNet-50 network in `DATA_PARA
When distributed training is performed in the lab environment, you need to configure the networking information file for the current multi-card environment. If HUAWEI CLOUD is used, skip this section. When distributed training is performed in the lab environment, you need to configure the networking information file for the current multi-card environment. If HUAWEI CLOUD is used, skip this section.
The Ascend 910 AI processor and 1980 AIServer are used as an example. The JSON configuration file of a two-card environment is as follows. In this example, the configuration file is named rank_table.json. The Ascend 910 AI processor and AIServer are used as an example. The JSON configuration file of a two-card environment is as follows. In this example, the configuration file is named rank_table.json.
```json ```json
{ {
...@@ -67,11 +67,12 @@ The Ascend 910 AI processor and 1980 AIServer are used as an example. The JSON c ...@@ -67,11 +67,12 @@ The Ascend 910 AI processor and 1980 AIServer are used as an example. The JSON c
``` ```
The following parameters need to be modified based on the actual training environment: The following parameters need to be modified based on the actual training environment:
1. `server_num` indicates the number of hosts, and `server_id` indicates the IP address of the local host. 1. `board_id` indicates the environment in which the program runs.
2. `device_num`, `para_plane_nic_num`, and `instance_count` indicate the number of cards. 2. `server_num` indicates the number of hosts, and `server_id` indicates the IP address of the local host.
3. `rank_id` indicates the logical sequence number of a card, which starts from 0 fixedly. `device_id` indicates the physical sequence number of a card, that is, the actual sequence number of the host where the card is located. 3. `device_num`, `para_plane_nic_num`, and `instance_count` indicate the number of cards.
4. `device_ip` indicates the IP address of the NIC. You can run the `cat /etc/hccn.conf` command on the current host to obtain the IP address of the NIC. 4. `rank_id` indicates the logical sequence number of a card, which starts from 0 fixedly. `device_id` indicates the physical sequence number of a card, that is, the actual sequence number of the host where the card is located.
5. `para_plane_nic_name` indicates the name of the corresponding NIC. 5. `device_ip` indicates the IP address of the NIC. You can run the `cat /etc/hccn.conf` command on the current host to obtain the IP address of the NIC.
6. `para_plane_nic_name` indicates the name of the corresponding NIC.
After the networking information file is ready, add the file path to the environment variable `MINDSPORE_HCCL_CONFIG_PATH`. In addition, the `device_id` information needs to be transferred to the script. In this example, the information is transferred by configuring the environment variable DEVICE_ID. After the networking information file is ready, add the file path to the environment variable `MINDSPORE_HCCL_CONFIG_PATH`. In addition, the `device_id` information needs to be transferred to the script. In this example, the information is transferred by configuring the environment variable DEVICE_ID.
......
...@@ -36,7 +36,7 @@ MindSpore支持数据并行及自动并行。自动并行是MindSpore融合了 ...@@ -36,7 +36,7 @@ MindSpore支持数据并行及自动并行。自动并行是MindSpore融合了
在实验室环境进行分布式训练时,需要配置当前多卡环境的组网信息文件。如果使用华为云环境,可以跳过本小节。 在实验室环境进行分布式训练时,需要配置当前多卡环境的组网信息文件。如果使用华为云环境,可以跳过本小节。
以Ascend 910 AI处理器、1980 AIServer为例,一个两卡环境的json配置文件示例如下,本样例将该配置文件命名为rank_table.json。 以Ascend 910 AI处理器、AIServer为例,一个两卡环境的json配置文件示例如下,本样例将该配置文件命名为rank_table.json。
```json ```json
{ {
...@@ -66,11 +66,13 @@ MindSpore支持数据并行及自动并行。自动并行是MindSpore融合了 ...@@ -66,11 +66,13 @@ MindSpore支持数据并行及自动并行。自动并行是MindSpore融合了
``` ```
其中需要根据实际训练环境修改的参数项有: 其中需要根据实际训练环境修改的参数项有:
1. `server_num`表示机器数量, `server_id`表示本机IP地址。
2. `device_num``para_plane_nic_num``instance_count`表示卡的数量。 1. `board_id`表示当前运行的环境。
3. `rank_id`表示卡逻辑序号,固定从0开始编号,`device_id`表示卡物理序号,即卡所在机器中的实际序号。 2. `server_num`表示机器数量, `server_id`表示本机IP地址。
4. `device_ip`表示网卡IP地址,可以在当前机器执行指令`cat /etc/hccn.conf`获取网卡IP地址。 3. `device_num``para_plane_nic_num``instance_count`表示卡的数量。
5. `para_plane_nic_name`对应网卡名称。 4. `rank_id`表示卡逻辑序号,固定从0开始编号,`device_id`表示卡物理序号,即卡所在机器中的实际序号。
5. `device_ip`表示网卡IP地址,可以在当前机器执行指令`cat /etc/hccn.conf`获取网卡IP地址。
6. `para_plane_nic_name`对应网卡名称。
组网信息文件准备好后,将文件路径加入环境变量`MINDSPORE_HCCL_CONFIG_PATH`中。此外需要将`device_id`信息传入脚本中,本样例通过配置环境变量DEVICE_ID的方式传入。 组网信息文件准备好后,将文件路径加入环境变量`MINDSPORE_HCCL_CONFIG_PATH`中。此外需要将`device_id`信息传入脚本中,本样例通过配置环境变量DEVICE_ID的方式传入。
...@@ -221,7 +223,7 @@ opt = Momentum(filter(lambda x: x.requires_grad, net.get_parameters()), lr, mome ...@@ -221,7 +223,7 @@ opt = Momentum(filter(lambda x: x.requires_grad, net.get_parameters()), lr, mome
`context.set_auto_parallel_context()`是提供给用户设置并行参数的接口。主要参数包括: `context.set_auto_parallel_context()`是提供给用户设置并行参数的接口。主要参数包括:
- `parallel_mode`:分布式并行模式。可选数据并行`ParallelMode.DATA_PARALLEL`及自动并行`ParallelMode.AUTO_PARALLEL` - `parallel_mode`:分布式并行模式。可选数据并行`ParallelMode.DATA_PARALLEL`及自动并行`ParallelMode.AUTO_PARALLEL`
- `mirror_mean`: 反向计算时,框架内部会将数据并行参数分散在多台机器的梯度进行收集,得到全局梯度值后再传入优化器中更新。 - `mirror_mean`: 反向计算时,框架内部会将数据并行参数分散在多台机器的梯度进行收集,得到全局梯度值后再传入优化器中更新。
设置为True对应`allreduce_mean`操作,False对应`allreduce_sum`操作。 设置为True对应`allreduce_mean`操作,False对应`allreduce_sum`操作。
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册