train.log报connection错误
Created by: sarawon
跑的是cluster模式,执行start trainer的task时候就卡住了: [root@192.168.30.131:8023] Executing task 'start_trainer' [root@192.168.30.131:8023] run: cd /root/paddle/demo/recommendation; GLOG_logtostderr=0 GLOG_log_dir="./log" nohup paddle train --num_gradient_servers=2 --nics=eth0 --port=7164 --ports_num=2 --comment=paddle_process_by_paddle --pservers=192.168.30.131,192.168.30.179 --ports_num_for_sparse=2 --config=./trainer_config.py --trainer_count=4 --use_gpu=0 --num_passes=10 --save_dir=./output --log_period=50 --dot_period=10 --saving_period=1 --local=0 --trainer_id=0 > ./log/train.log 2>&1 < /dev/null & [root@192.168.30.131:8023] out: stdin: is not a tty [root@192.168.30.131:8023] out:
[root@192.168.30.179:8023] Executing task 'start_trainer' [root@192.168.30.179:8023] run: cd /root/paddle/demo/recommendation; GLOG_logtostderr=0 GLOG_log_dir="./log" nohup paddle train --num_gradient_servers=2 --nics=eth0 --port=7164 --ports_num=2 --comment=paddle_process_by_paddle --pservers=192.168.30.131,192.168.30.179 --ports_num_for_sparse=2 --config=./trainer_config.py --trainer_count=4 --use_gpu=0 --num_passes=10 --save_dir=./output --log_period=50 --dot_period=10 --saving_period=1 --local=0 --trainer_id=1 > ./log/train.log 2>&1 < /dev/null & [root@192.168.30.179:8023] out: stdin: is not a tty [root@192.168.30.179:8023] out:
train.log的内容: [INFO 2016-11-24 07:17:26,152 networks.py:1466] The input order is [movie_id, title, genres, user_id, gender, age, occupation, rating] [INFO 2016-11-24 07:17:26,152 networks.py:1472] The output order is [regression_cost_0] F1124 07:17:26.942348 352 LightNetwork.cpp:379] Check failed: connect(sockfd, (sockaddr *)&serv_addr, sizeof(serv_addr)) >= 0 ERROR connecting to 192.168.30.131: Connection refused [111]
* Check failure stack trace: *
@ 0x7f1604a93daa (unknown) @ 0x7f1604a93ce4 (unknown) @ 0x7f1604a936e6 (unknown) @ 0x7f1604a934fb (unknown) @ 0x7f1604a94477 (unknown) @ 0x69552e paddle::SocketClient::TcpClient() @ 0x696051 paddle::SocketClient::SocketClient() @ 0x7eaa76 std::vector<>::emplace_back<>() @ 0x7e1be3 paddle::ParameterClient2::init() @ 0x68e2dd paddle::RemoteParameterUpdater::init() @ 0x678de2 paddle::Trainer::init() @ 0x5132a9 main @ 0x7f1603c9ff45 (unknown) @ 0x51f2a5 (unknown) @ (nil) (unknown) /usr/local/bin/paddle: line 109: 352 Aborted (core dumped) ${DEBUGGER} $MYDIR/../opt/paddle/bin/paddle_trainer ${@:2}server.log的内容: F1124 07:19:03.638399 418 SocketChannel.cpp:180] Check failed: len == sizeof(header) : Success [0]