提交 6058db43 编写于 作者: M mindspore-ci-bot 提交者: Gitee

!25 update distributed_training doc

Merge pull request !25 from lichen/master
......@@ -37,7 +37,7 @@ In this tutorial, we will learn how to train the ResNet-50 network in `DATA_PARA
When distributed training is performed in the lab environment, you need to configure the networking information file for the current multi-card environment. If HUAWEI CLOUD is used, skip this section.
The Ascend 910 AI processor and 1980 AIServer are used as an example. The JSON configuration file of a two-card environment is as follows. In this example, the configuration file is named rank_table.json.
The Ascend 910 AI processor and AIServer are used as an example. The JSON configuration file of a two-card environment is as follows. In this example, the configuration file is named rank_table.json.
```json
{
......@@ -67,11 +67,12 @@ The Ascend 910 AI processor and 1980 AIServer are used as an example. The JSON c
```
The following parameters need to be modified based on the actual training environment:
1. `server_num` indicates the number of hosts, and `server_id` indicates the IP address of the local host.
2. `device_num`, `para_plane_nic_num`, and `instance_count` indicate the number of cards.
3. `rank_id` indicates the logical sequence number of a card, which starts from 0 fixedly. `device_id` indicates the physical sequence number of a card, that is, the actual sequence number of the host where the card is located.
4. `device_ip` indicates the IP address of the NIC. You can run the `cat /etc/hccn.conf` command on the current host to obtain the IP address of the NIC.
5. `para_plane_nic_name` indicates the name of the corresponding NIC.
1. `board_id` indicates the environment in which the program runs.
2. `server_num` indicates the number of hosts, and `server_id` indicates the IP address of the local host.
3. `device_num`, `para_plane_nic_num`, and `instance_count` indicate the number of cards.
4. `rank_id` indicates the logical sequence number of a card, which starts from 0 fixedly. `device_id` indicates the physical sequence number of a card, that is, the actual sequence number of the host where the card is located.
5. `device_ip` indicates the IP address of the NIC. You can run the `cat /etc/hccn.conf` command on the current host to obtain the IP address of the NIC.
6. `para_plane_nic_name` indicates the name of the corresponding NIC.
After the networking information file is ready, add the file path to the environment variable `MINDSPORE_HCCL_CONFIG_PATH`. In addition, the `device_id` information needs to be transferred to the script. In this example, the information is transferred by configuring the environment variable DEVICE_ID.
......
......@@ -36,7 +36,7 @@ MindSpore支持数据并行及自动并行。自动并行是MindSpore融合了
在实验室环境进行分布式训练时,需要配置当前多卡环境的组网信息文件。如果使用华为云环境,可以跳过本小节。
以Ascend 910 AI处理器、1980 AIServer为例,一个两卡环境的json配置文件示例如下,本样例将该配置文件命名为rank_table.json。
以Ascend 910 AI处理器、AIServer为例,一个两卡环境的json配置文件示例如下,本样例将该配置文件命名为rank_table.json。
```json
{
......@@ -66,11 +66,13 @@ MindSpore支持数据并行及自动并行。自动并行是MindSpore融合了
```
其中需要根据实际训练环境修改的参数项有:
1. `server_num`表示机器数量, `server_id`表示本机IP地址。
2. `device_num``para_plane_nic_num``instance_count`表示卡的数量。
3. `rank_id`表示卡逻辑序号,固定从0开始编号,`device_id`表示卡物理序号,即卡所在机器中的实际序号。
4. `device_ip`表示网卡IP地址,可以在当前机器执行指令`cat /etc/hccn.conf`获取网卡IP地址。
5. `para_plane_nic_name`对应网卡名称。
1. `board_id`表示当前运行的环境。
2. `server_num`表示机器数量, `server_id`表示本机IP地址。
3. `device_num``para_plane_nic_num``instance_count`表示卡的数量。
4. `rank_id`表示卡逻辑序号,固定从0开始编号,`device_id`表示卡物理序号,即卡所在机器中的实际序号。
5. `device_ip`表示网卡IP地址,可以在当前机器执行指令`cat /etc/hccn.conf`获取网卡IP地址。
6. `para_plane_nic_name`对应网卡名称。
组网信息文件准备好后,将文件路径加入环境变量`MINDSPORE_HCCL_CONFIG_PATH`中。此外需要将`device_id`信息传入脚本中,本样例通过配置环境变量DEVICE_ID的方式传入。
......@@ -221,7 +223,7 @@ opt = Momentum(filter(lambda x: x.requires_grad, net.get_parameters()), lr, mome
`context.set_auto_parallel_context()`是提供给用户设置并行参数的接口。主要参数包括:
- `parallel_mode`:分布式并行模式。可选数据并行`ParallelMode.DATA_PARALLEL`及自动并行`ParallelMode.AUTO_PARALLEL`
- `mirror_mean`: 反向计算时,框架内部会将数据并行参数分散在多台机器的梯度进行收集,得到全局梯度值后再传入优化器中更新。
- `mirror_mean`: 反向计算时,框架内部会将数据并行参数分散在多台机器的梯度进行收集,得到全局梯度值后再传入优化器中更新。
设置为True对应`allreduce_mean`操作,False对应`allreduce_sum`操作。
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册