Distributed training: I am unable to execute the demo example for distributed training. I am following the procedure given in the Cluster training exercise on the website step by step. While I am able to launch the cluster but I am unable to execute t...

Created by: bestfitline

Error:

[XYZ@172.25.1.210] run: mkdir -p home/XYZ/paddle/JOB20161013183936/log [XYZ@172.25.1.210] run: rm -fr home/XYZ/paddle/JOB20161013183936/log/* [XYZ@172.25.1.109] Executing task 'set_nodefile' [XYZ@172.25.1.109] run: echo 0 > home/XYZ/paddle/JOB20161013183936/nodefile [XYZ@172.25.1.210] Executing task 'set_nodefile' [XYZ@172.25.1.210] run: echo 1 > home/XYZ/paddle/JOB20161013183936/nodefile [XYZ@172.25.1.109] Executing task 'kill_process' [XYZ@172.25.1.109] run: ps aux | grep paddle_process_by_paddle | grep -v grep | awk '{print $2}' | xargs kill > /dev/null 2>&1

Warning: run() received nonzero return code 123 while executing 'ps aux | grep paddle_process_by_paddle | grep -v grep | awk '{print $2}' | xargs kill > /dev/null 2>&1'!

[XYZ@172.25.1.210] Executing task 'kill_process' [XYZ@172.25.1.210] run: ps aux | grep paddle_process_by_paddle | grep -v grep | awk '{print $2}' | xargs kill > /dev/null 2>&1

Warning: run() received nonzero return code 123 while executing 'ps aux | grep paddle_process_by_paddle | grep -v grep | awk '{print $2}' | xargs kill > /dev/null 2>&1'!

[XYZ@172.25.1.109] Executing task 'start_pserver' [XYZ@172.25.1.109] run: cd home/XYZ/paddle/JOB20161013183936; GLOG_logtostderr=0 GLOG_log_dir="./log" nohup paddle pserver --num_gradient_servers=2 --nics=eth0 --port=7164 --ports_num=2 --ports_num_for_sparse=2 --comment=paddle_process_by_paddle > ./log/server.log 2>&1 < /dev/null & [XYZ@172.25.1.210] Executing task 'start_pserver' [XYZ@172.25.1.210] run: cd home/XYZ/paddle/JOB20161013183936; GLOG_logtostderr=0 GLOG_log_dir="./log" nohup paddle pserver --num_gradient_servers=2 --nics=eth0 --port=7164 --ports_num=2 --ports_num_for_sparse=2 --comment=paddle_process_by_paddle > ./log/server.log 2>&1 < /dev/null & [XYZ@172.25.1.109] Executing task 'start_trainer' [XYZ@172.25.1.109] run: cd home/XYZ/paddle/JOB20161013183936; GLOG_logtostderr=0 GLOG_log_dir="./log" nohup paddle train --num_gradient_servers=2 --nics=eth0 --port=7164 --ports_num=2 --comment=paddle_process_by_paddle --pservers=172.25.1.109,172.25.1.210 --ports_num_for_sparse=2 --config=./trainer_config.py --trainer_count=1 --use_gpu=true --num_passes=10 --save_dir=./output --log_period=50 --dot_period=1 --saving_period=1 --local=0 --trainer_id=0 > ./log/train.log 2>&1 < /dev/null & [XYZ@172.25.1.210] Executing task 'start_trainer' [XYZ@172.25.1.210] run: cd home/XYZ/paddle/JOB20161013183936; GLOG_logtostderr=0 GLOG_log_dir="./log" nohup paddle train --num_gradient_servers=2 --nics=eth0 --port=7164 --ports_num=2 --comment=paddle_process_by_paddle --pservers=172.25.1.109,172.25.1.210 --ports_num_for_sparse=2 --config=./trainer_config.py --trainer_count=1 --use_gpu=true --num_passes=10 --save_dir=./output --log_period=50 --dot_period=1 --saving_period=1 --local=0 --trainer_id=1 > ./log/train.log 2>&1 < /dev/null &

Conf.py settings:

HOSTS = [ "XYZ@172.25.1.109", "XYZ@172.25.1.210", ]

''' workspace configuration '''

root dir for workspace, can be set as any director with real user account

ROOT_DIR = "home/XYZ/paddle"

''' network configuration '''

pserver nics

PADDLE_NIC = "eth0"

pserver port

PADDLE_PORT = 7164

pserver ports num

PADDLE_PORTS_NUM = 2

pserver sparse ports num

PADDLE_PORTS_NUM_FOR_SPARSE = 2

environments setting for all processes in cluster job

LD_LIBRARY_PATH="/usr/local/cuda/lib64:/usr/lib64"

run.sh settings:

python paddle.py
--job_dispatch_package="/home/XYZ/paddle/paddle/demo/recommendation"
--dot_period=1
--ports_num_for_sparse=2
--log_period=50
--num_passes=10
--trainer_count=1
--saving_period=1
--local=0
--config=./trainer_config.py
--save_dir=./output
--use_gpu=true

I am able to execute the model on an individual machine though without any issue and the log is being recorded in the output folder as mentioned.

PaddlePaddle / Paddle 大约 2 年 前同步成功