Cluster Training Using Fabric¶
Prepare a Linux cluster¶
Run kubectl -f ssh_servers.yaml under the directory: paddle/scripts/cluster_train_v2/fabric/docker_cluster will launch a demo cluster. Run kubectl get po -o wide to get IP addresses of these nodes.
Launching Cluster Job¶
paddle.py provides automatical scripts to start all PaddlePaddle cluster processes in different nodes. By default, all command line options can be set as paddle.py command options and paddle.py will transparently and automatically set these options to PaddlePaddle lower level processes.
paddle.pyprovides two distinguished command option for easy job launching.
job_dispatch_packageset it with localworkspacedirectory, it will be dispatched to all nodes which is set inconf.py. It could be helpful for frequently manipulating workspace files. otherwise, frequent multi-nodes workspace deployment is very annoying.job_workspaceset it with already deployed workspace directory,paddle.pywill skip dispatch stage to directly launch cluster job with all nodes. It could help to reduce heavy dispatch latency.
cluster_train/run.sh provides command line sample to run demo/recommendation cluster job, just modify job_dispatch_package and job_workspace with your defined directory, then:
sh run.sh
The cluster Job will start in several seconds.
Kill Cluster Job¶
paddle.py can capture Ctrl + C SIGINT signal to automatically kill all processes launched by it. So just stop paddle.py to kill cluster job. You should manually kill the job if the program crashed.
Check Cluster Training Result¶
Check log in $workspace/log for details, each node owns same log structure.
paddle_trainer.INFO
It provides almost all internal output log for training, same as local training. Check runtime model convergence here.
paddle_pserver2.INFO
It provides parameter server running log, which could help to diagnose distributed error.
server.log
It provides stderr and stdout of parameter server process. Check error log if training crashes.
train.log
It provides stderr and stdout of trainer process. Check error log if training crashes.
Check Model Output¶
After one pass finished, model files will be written in output directory in node 0.
nodefile in workspace indicates the node id of current cluster job.
