cluster_train_en.md 7.0 KB
Newer Older
L
Luo Tao 已提交
1
# Run Distributed Training
Z
zhangjinchao01 已提交
2

3
In this article, we explain how to run distributed Paddle training jobs on clusters.  We will create the distributed version of the single-process training example, [recommendation](https://github.com/baidu/Paddle/tree/develop/demo/recommendation).
Z
zhangjinchao01 已提交
4

L
Luo Tao 已提交
5
[Scripts](https://github.com/baidu/Paddle/tree/develop/paddle/scripts/cluster_train) used in this article launch distributed jobs via SSH.  They also work as a reference for users running more sophisticated cluster management systems like MPI and [Kubernetes](https://github.com/PaddlePaddle/Paddle/tree/develop/doc/howto/usage/k8s).
Z
zhangjinchao01 已提交
6

7
## Prerequisite
Z
zhangjinchao01 已提交
8

9
1. Aforementioned scripts use a Python library [fabric](http://www.fabfile.org/) to run SSH commands.  We can use `pip` to install fabric:
Z
zhangjinchao01 已提交
10

11
   ```bash
L
liaogang 已提交
12
   pip install fabric
13
   ```
Z
zhangjinchao01 已提交
14

15
1. We need to install PaddlePaddle on all nodes in the cluster.  To enable GPUs, we need to install CUDA in `/usr/local/cuda`; otherwise Paddle would report errors at runtime.
Z
zhangjinchao01 已提交
16

17
1. Set the `ROOT_DIR` variable in [`cluster_train/conf.py`] on all nodes.  For convenience, we often create a Unix user `paddle` on all nodes and set `ROOT_DIR=/home/paddle`.  In this way, we can write public SSH keys into `/home/paddle/.ssh/authorized_keys` so that user `paddle` can SSH to all nodes without password.
Z
zhangjinchao01 已提交
18 19 20

## Prepare Job Workspace

21
We refer to the directory where we put dependent libraries, config files, etc., as *workspace*.
Z
zhangjinchao01 已提交
22

23
These `train/test` data should be prepared before launching cluster job. To  satisfy the requirement that train/test data are placed in different directory from workspace, PADDLE refers train/test data according to index file named as `train.list/test.list` which are used in model config file. So the train/test data also contains train.list/test.list two list file. All local training demo already provides scripts to help you create these two files,  and all nodes in cluster job will handle files with same logical code in normal condition.
Z
zhangjinchao01 已提交
24

25
Generally, you can use same model file from local training for cluster training. What you should have in mind that, the `batch_size` set in `setting` function in model file means batch size in `each` node of cluster job instead of total batch size if synchronization SGD was used.
Z
zhangjinchao01 已提交
26

27
Following steps are based on [demo/recommendation](https://github.com/PaddlePaddle/Paddle/tree/develop/demo/recommendation) demo in demo directory.
Z
zhangjinchao01 已提交
28

29
You just go through demo/recommendation tutorial doc until `Train` section, and at last you will get train/test data and model configuration file. Finaly, just use demo/recommendation as workspace for cluster training.
Z
zhangjinchao01 已提交
30 31 32 33

At last your workspace should look like as follow:
```
.
34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54
|-- common_utils.py
|-- data
|   |-- config.json
|   |-- config_generator.py
|   |-- meta.bin
|   |-- meta_config.json
|   |-- meta_generator.py
|   |-- ml-1m
|   |-- ml_data.sh
|   |-- ratings.dat.test
|   |-- ratings.dat.train
|   |-- split.py
|   |-- test.list
|   `-- train.list
|-- dataprovider.py
|-- evaluate.sh
|-- prediction.py
|-- preprocess.sh
|-- requirements.txt
|-- run.sh
`-- trainer_config.py
Z
zhangjinchao01 已提交
55
```
56
Not all of these files are needed for cluster training, but it's not necessary to remove useless files.
Z
zhangjinchao01 已提交
57

L
livc 已提交
58
`trainer_config.py`
59
Indicates the model config file.
Z
zhangjinchao01 已提交
60

L
livc 已提交
61
`train.list` and `test.list`
Z
zhangjinchao01 已提交
62 63
File index. It stores all relative or absolute file paths of all train/test data at current node.

L
livc 已提交
64
`dataprovider.py`
65 66
used to read train/test samples. It's same as local training.

L
livc 已提交
67
`data`
68
all files in data directory are refered by train.list/test.list which are refered by data provider.
Z
zhangjinchao01 已提交
69 70 71 72


## Prepare Cluster Job Configuration

73
The options below must be carefully set in cluster_train/conf.py
Z
zhangjinchao01 已提交
74

75
`HOSTS`  all nodes hostname or ip that will run cluster job. You can also append user and ssh port with hostname, such as root@192.168.100.17:9090.
Z
zhangjinchao01 已提交
76

77
`ROOT_DIR` workspace ROOT directory for placing JOB workspace directory
Z
zhangjinchao01 已提交
78

79
`PADDLE_NIC` the NIC(Network Interface Card) interface name for cluster communication channel, such as eth0 for ethternet, ib0 for infiniband.
Z
zhangjinchao01 已提交
80

81
`PADDLE_PORT` port number for cluster commnunication channel
Z
zhangjinchao01 已提交
82

83
`PADDLE_PORTS_NUM` the number of port used for cluster communication channle. if the number of cluster nodes is small(less than 5~6nodes), recommend you set it to larger, such as 2 ~ 8, for better network performance.
Z
zhangjinchao01 已提交
84

85
`PADDLE_PORTS_NUM_FOR_SPARSE` the number of port used for sparse updater cluster commnunication channel. if sparse remote update is used, set it like `PADDLE_PORTS_NUM`
Z
zhangjinchao01 已提交
86

87
`LD_LIBRARY_PATH` set addtional LD_LIBRARY_PATH for cluster job. You can use it to set CUDA libraries path.
88

Z
zhangjinchao01 已提交
89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114
Default Configuration as follow:

```python
HOSTS = [
        "root@192.168.100.17",
        "root@192.168.100.18",
        ]

'''
workspace configuration
'''

#root dir for workspace
ROOT_DIR = "/home/paddle"

'''
network configuration
'''
#pserver nics
PADDLE_NIC = "eth0"
#pserver port
PADDLE_PORT = 7164
#pserver ports num
PADDLE_PORTS_NUM = 2
#pserver sparse ports num
PADDLE_PORTS_NUM_FOR_SPARSE = 2
115 116 117

#environments setting for all processes in cluster job
LD_LIBRARY_PATH="/usr/local/cuda/lib64:/usr/lib64"
Z
zhangjinchao01 已提交
118 119 120
```

### Launching Cluster Job
121
`paddle.py` provides automatical scripts to start all PaddlePaddle cluster processes in different nodes. By default, all command line options can set as `paddle.py` command options and `paddle.py` will transparently and automatically set these options to PaddlePaddle lower level processes.
Z
zhangjinchao01 已提交
122

123
`paddle.py`provides two distinguished command option for easy job launching.
Z
zhangjinchao01 已提交
124

125 126
`job_dispatch_package`  set it with local `workspace`directory, it will be dispatched to all nodes set in conf.py. It could be helpful for frequent hacking workspace files, otherwise frequent mulit-nodes workspace deployment could make your crazy.
`job_workspace`  set it with already deployed workspace directory, `paddle.py` will skip dispatch stage to directly launch cluster job with all nodes. It could help to reduce heavy
Z
zhangjinchao01 已提交
127 128
dispatch latency.

129
`cluster_train/run.sh` provides command line sample to run `demo/recommendation` cluster job, just modify `job_dispatch_package` and `job_workspace` with your defined directory, then:
Z
zhangjinchao01 已提交
130 131 132 133 134 135 136
```
sh run.sh
```

The cluster Job will start in several seconds.

### Kill Cluster Job
137
`paddle.py` can capture `Ctrl + C` SIGINT signal to automatically kill all processes launched by it. So just stop `paddle.py` to kill cluster job. You should mannally kill job if program crashed.
Z
zhangjinchao01 已提交
138 139 140 141

### Check Cluster Training Result
Check log in $workspace/log for details, each node owns same log structure.

L
livc 已提交
142
`paddle_trainer.INFO`
Z
zhangjinchao01 已提交
143 144
It provides almost all interal output log for training,  same as local training. Check runtime model convergence here.

L
livc 已提交
145
`paddle_pserver2.INFO`
Z
zhangjinchao01 已提交
146 147
It provides pserver running log, which could help to diagnose distributed error.

L
livc 已提交
148
`server.log`
Z
zhangjinchao01 已提交
149 150
It provides stderr and stdout of pserver process. Check error log if training crashs.

L
livc 已提交
151
`train.log`
Z
zhangjinchao01 已提交
152 153 154
It provides stderr and stdout of trainer process. Check error log if training crashs.

### Check Model Output
155 156
After one pass finished, model files will be writed in `output` directory in node 0.
`nodefile` in workspace indicates the node id of current cluster job.