Cluster Training Using Fabric

Prepare a Linux cluster

Run kubectl -f ssh_servers.yaml under the directory: paddle/scripts/cluster_train_v2/fabric/docker_cluster will launch a demo cluster. Run kubectl get po -o wide to get IP addresses of these nodes.

Launching Cluster Job

paddle.py provides automatical scripts to start all PaddlePaddle cluster processes in different nodes. By default, all command line options can be set as paddle.py command options and paddle.py will transparently and automatically set these options to PaddlePaddle lower level processes.

paddle.pyprovides two distinguished command option for easy job launching.

  • job_dispatch_package set it with local workspace directory, it will be dispatched to all nodes which is set in conf.py. It could be helpful for frequently manipulating workspace files. otherwise, frequent multi-nodes workspace deployment is very annoying.
  • job_workspace set it with already deployed workspace directory, paddle.py will skip dispatch stage to directly launch cluster job with all nodes. It could help to reduce heavy dispatch latency.

cluster_train/run.sh provides command line sample to run demo/recommendation cluster job, just modify job_dispatch_package and job_workspace with your defined directory, then:

sh run.sh

The cluster Job will start in several seconds.

Kill Cluster Job

paddle.py can capture Ctrl + C SIGINT signal to automatically kill all processes launched by it. So just stop paddle.py to kill cluster job. You should manually kill the job if the program crashed.

Check Cluster Training Result

Check log in $workspace/log for details, each node owns same log structure.

paddle_trainer.INFO It provides almost all internal output log for training, same as local training. Check runtime model convergence here.

paddle_pserver2.INFO It provides parameter server running log, which could help to diagnose distributed error.

server.log It provides stderr and stdout of parameter server process. Check error log if training crashes.

train.log It provides stderr and stdout of trainer process. Check error log if training crashes.

Check Model Output

After one pass finished, model files will be written in output directory in node 0. nodefile in workspace indicates the node id of current cluster job.