cluster_train.md 7.3 KB
Newer Older
Z
zhangjinchao01 已提交
1 2
# Cluster Training

3
We provide some simple scripts ```paddle/scripts/cluster_train``` to help you to launch cluster training Job to harness PaddlePaddle's distributed trainning. For MPI and other cluster scheduler refer this naive script to implement more robust cluster training platform by yourself.
Z
zhangjinchao01 已提交
4

5
The following cluster demo is based on RECOMMENDATION local training demo in PaddlePaddle ```demo/recommendation``` directory.  Assuming you enter the ```paddle/scripts/cluster_train/``` directory.
Z
zhangjinchao01 已提交
6 7 8 9 10 11 12 13 14

## Pre-requirements

Firstly,

```bash
pip install fabric
```

15
Secondly, go through installing scripts to install PaddlePaddle at all nodes to make sure demo can run as local mode. For CUDA enabled training, we assume that CUDA is installed in ```/usr/local/cuda```, otherwise missed cuda runtime libraries error could be reported at cluster runtime. In one word, the local training environment should be well prepared for the simple scripts.
Z
zhangjinchao01 已提交
16

17
Then you should prepare same ROOT_DIR directory in all nodes. ROOT_DIR is from in cluster_train/conf.py. Assuming that the ROOT_DIR = /home/paddle, you can create ```paddle``` user account as well, at last ```paddle.py``` can ssh connections to all nodes with ```paddle``` user automatically.
Z
zhangjinchao01 已提交
18 19 20 21 22 23 24 25 26 27 28 29 30

At last you can create ssh mutual trust relationship between all nodes for easy ssh login, otherwise ```password``` should be provided at runtime from ```paddle.py```.

## Prepare Job Workspace

```Job workspace``` is defined as one package directory which contains dependency libraries, train data, test data, model config file and all other related file dependencies.

These ```train/test``` data should be prepared before launching cluster job. To  satisfy the requirement that train/test data are placed in different directory from workspace, PADDLE refers train/test data according to index file named as ```train.list/test.list``` which are used in model config file. So the train/test data also contains train.list/test.list two list file. All local training demo already provides scripts to help you create these two files,  and all nodes in cluster job will handle files with same logical code in normal condition.

Generally, you can use same model file from local training for cluster training. What you should have in mind that, the ```batch_size``` set in ```setting``` function in model file means batch size in ```each``` node of cluster job instead of total batch size if synchronization SGD was used.

Following steps are based on demo/recommendation demo in demo directory.

31
You just go through demo/recommendation tutorial doc until ```Train``` section, and at last you will get train/test data and model configuration file. Finaly, just use demo/recommendation as workspace for cluster training.
Z
zhangjinchao01 已提交
32 33 34 35

At last your workspace should look like as follow:
```
.
36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56
|-- common_utils.py
|-- data
|   |-- config.json
|   |-- config_generator.py
|   |-- meta.bin
|   |-- meta_config.json
|   |-- meta_generator.py
|   |-- ml-1m
|   |-- ml_data.sh
|   |-- ratings.dat.test
|   |-- ratings.dat.train
|   |-- split.py
|   |-- test.list
|   `-- train.list
|-- dataprovider.py
|-- evaluate.sh
|-- prediction.py
|-- preprocess.sh
|-- requirements.txt
|-- run.sh
`-- trainer_config.py
Z
zhangjinchao01 已提交
57
```
58
Not all of these files are needed for cluster training, but it's not necessary to remove useless files.
Z
zhangjinchao01 已提交
59

60 61
```trainer_config.py```
Indicates the model config file.
Z
zhangjinchao01 已提交
62 63 64 65

```train.list``` and ```test.list```
File index. It stores all relative or absolute file paths of all train/test data at current node.

66 67 68 69 70
```dataprovider.py```
used to read train/test samples. It's same as local training.

```data```
all files in data directory are refered by train.list/test.list which are refered by data provider.
Z
zhangjinchao01 已提交
71 72 73 74


## Prepare Cluster Job Configuration

75
The options below must be carefully set in cluster_train/conf.py
Z
zhangjinchao01 已提交
76 77 78 79 80 81 82 83 84 85 86 87 88

```HOSTS```  all nodes hostname or ip that will run cluster job. You can also append user and ssh port with hostname, such as root@192.168.100.17:9090.

```ROOT_DIR``` workspace ROOT directory for placing JOB workspace directory

```PADDLE_NIC``` the NIC(Network Interface Card) interface name for cluster communication channel, such as eth0 for ethternet, ib0 for infiniband.

```PADDLE_PORT``` port number for cluster commnunication channel

```PADDLE_PORTS_NUM``` the number of port used for cluster communication channle. if the number of cluster nodes is small(less than 5~6nodes), recommend you set it to larger, such as 2 ~ 8, for better network performance.

```PADDLE_PORTS_NUM_FOR_SPARSE``` the number of port used for sparse updater cluster commnunication channel. if sparse remote update is used, set it like ```PADDLE_PORTS_NUM```

89 90
```LD_LIBRARY_PATH``` set addtional LD_LIBRARY_PATH for cluster job. You can use it to set CUDA libraries path.

Z
zhangjinchao01 已提交
91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116
Default Configuration as follow:

```python
HOSTS = [
        "root@192.168.100.17",
        "root@192.168.100.18",
        ]

'''
workspace configuration
'''

#root dir for workspace
ROOT_DIR = "/home/paddle"

'''
network configuration
'''
#pserver nics
PADDLE_NIC = "eth0"
#pserver port
PADDLE_PORT = 7164
#pserver ports num
PADDLE_PORTS_NUM = 2
#pserver sparse ports num
PADDLE_PORTS_NUM_FOR_SPARSE = 2
117 118 119

#environments setting for all processes in cluster job
LD_LIBRARY_PATH="/usr/local/cuda/lib64:/usr/lib64"
Z
zhangjinchao01 已提交
120 121 122 123 124 125 126 127 128 129 130
```

### Launching Cluster Job
```paddle.py``` provides automatical scripts to start all PaddlePaddle cluster processes in different nodes. By default, all command line options can set as ```paddle.py``` command options and ```paddle.py``` will transparently and automatically set these options to PaddlePaddle lower level processes.

```paddle.py```provides two distinguished command option for easy job launching.

```job_dispatch_package```  set it with local ```workspace```directory, it will be dispatched to all nodes set in conf.py. It could be helpful for frequent hacking workspace files, otherwise frequent mulit-nodes workspace deployment could make your crazy.
```job_workspace```  set it with already deployed workspace directory, ```paddle.py``` will skip dispatch stage to directly launch cluster job with all nodes. It could help to reduce heavy
dispatch latency.

131
```cluster_train/run.sh``` provides command line sample to run ```demo/recommendation``` cluster job, just modify ```job_dispatch_package``` and ```job_workspace``` with your defined directory, then:
Z
zhangjinchao01 已提交
132 133 134 135 136 137 138
```
sh run.sh
```

The cluster Job will start in several seconds.

### Kill Cluster Job
139
```paddle.py``` can capture ```Ctrl + C``` SIGINT signal to automatically kill all processes launched by it. So just stop ```paddle.py``` to kill cluster job. You should mannally kill job if program crashed.
Z
zhangjinchao01 已提交
140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158

### Check Cluster Training Result
Check log in $workspace/log for details, each node owns same log structure.

```paddle_trainer.INFO```
It provides almost all interal output log for training,  same as local training. Check runtime model convergence here.

```paddle_pserver2.INFO```
It provides pserver running log, which could help to diagnose distributed error.

```server.log```
It provides stderr and stdout of pserver process. Check error log if training crashs.

```train.log```
It provides stderr and stdout of trainer process. Check error log if training crashs.

### Check Model Output
After one pass finished, model files will be writed in ```output``` directory in node 0.
```nodefile``` in workspace indicates the node id of current cluster job.