export MS_SCHED_HOST=XXX.XXX.XXX.XXX # Scheduler IP address
export MS_SCHED_POST=XXXX # Scheduler port
export MS_ROLE=MS_SCHED # The role of this process: MS_SCHED represents the scheduler, MS_WORKER represents the worker, MS_PSERVER represents the Server
```
## 执行训练
1. shell脚本
提供Worker,Server和Scheduler三个角色对应的shell脚本,以启动训练:
`Scheduler.sh`:
```bash
#!/bin/bash
export MS_SERVER_NUM=1
export MS_WORKER_NUM=1
export MS_SCHED_HOST=XXX.XXX.XXX.XXX
export MS_SCHED_POST=XXXX
export MS_ROLE=MS_SCHED
python train.py
```
`Server.sh`:
```bash
#!/bin/bash
export MS_SERVER_NUM=1
export MS_WORKER_NUM=1
export MS_SCHED_HOST=XXX.XXX.XXX.XXX
export MS_SCHED_POST=XXXX
export MS_ROLE=MS_PSERVER
python train.py
```
`Worker.sh`:
```bash
#!/bin/bash
export MS_SERVER_NUM=1
export MS_WORKER_NUM=1
export MS_SCHED_HOST=XXX.XXX.XXX.XXX
export MS_SCHED_POST=XXXX
export MS_ROLE=MS_WORKER
python train.py
```
最后分别执行:
```bash
sh Scheduler.sh > scheduler.log 2>&1 &
sh Server.sh > server.log 2>&1 &
sh Worker.sh > worker.log 2>&1 &
```
启动训练
2. 查看结果
查看`scheduler.log`中和Server与Worker通信日志:
```
Bind to role=scheduler, id=1, ip=XXX.XXX.XXX.XXX, port=XXXX
Assign rank=8 to node role=server, ip=XXX.XXX.XXX.XXX, port=XXXX
Assign rank=9 to node role=worker, ip=XXX.XXX.XXX.XXX, port=XXXX
the scheduler is connected to 1 workers and 1 servers