@@ -37,7 +37,7 @@ In this tutorial, we will learn how to train the ResNet-50 network in `DATA_PARA
...
@@ -37,7 +37,7 @@ In this tutorial, we will learn how to train the ResNet-50 network in `DATA_PARA
When distributed training is performed in the lab environment, you need to configure the networking information file for the current multi-card environment. If HUAWEI CLOUD is used, skip this section.
When distributed training is performed in the lab environment, you need to configure the networking information file for the current multi-card environment. If HUAWEI CLOUD is used, skip this section.
The Ascend 910 AI processor and 1980 AIServer are used as an example. The JSON configuration file of a two-card environment is as follows. In this example, the configuration file is named rank_table.json.
The Ascend 910 AI processor and AIServer are used as an example. The JSON configuration file of a two-card environment is as follows. In this example, the configuration file is named rank_table.json.
```json
```json
{
{
...
@@ -67,11 +67,12 @@ The Ascend 910 AI processor and 1980 AIServer are used as an example. The JSON c
...
@@ -67,11 +67,12 @@ The Ascend 910 AI processor and 1980 AIServer are used as an example. The JSON c
```
```
The following parameters need to be modified based on the actual training environment:
The following parameters need to be modified based on the actual training environment:
1.`server_num` indicates the number of hosts, and `server_id` indicates the IP address of the local host.
1.`board_id` indicates the environment in which the program runs.
2.`device_num`, `para_plane_nic_num`, and `instance_count` indicate the number of cards.
2.`server_num` indicates the number of hosts, and `server_id` indicates the IP address of the local host.
3.`rank_id` indicates the logical sequence number of a card, which starts from 0 fixedly. `device_id` indicates the physical sequence number of a card, that is, the actual sequence number of the host where the card is located.
3.`device_num`, `para_plane_nic_num`, and `instance_count` indicate the number of cards.
4.`device_ip` indicates the IP address of the NIC. You can run the `cat /etc/hccn.conf` command on the current host to obtain the IP address of the NIC.
4.`rank_id` indicates the logical sequence number of a card, which starts from 0 fixedly. `device_id` indicates the physical sequence number of a card, that is, the actual sequence number of the host where the card is located.
5.`para_plane_nic_name` indicates the name of the corresponding NIC.
5.`device_ip` indicates the IP address of the NIC. You can run the `cat /etc/hccn.conf` command on the current host to obtain the IP address of the NIC.
6.`para_plane_nic_name` indicates the name of the corresponding NIC.
After the networking information file is ready, add the file path to the environment variable `MINDSPORE_HCCL_CONFIG_PATH`. In addition, the `device_id` information needs to be transferred to the script. In this example, the information is transferred by configuring the environment variable DEVICE_ID.
After the networking information file is ready, add the file path to the environment variable `MINDSPORE_HCCL_CONFIG_PATH`. In addition, the `device_id` information needs to be transferred to the script. In this example, the information is transferred by configuring the environment variable DEVICE_ID.