分布式训练失败 SocketChannel.cpp:101] Check failed: len > 0......
Created by: Sampson1107
老师, 你好! 我在按照http://doc.paddlepaddle.org/doc_cn/howto/usage/cluster/cluster_train_cn.html 说明做分布式训练的时候总是遇到如下错误。我使用的脚本是: paddle/benchmark/paddle/image/alexnet.py作为config模型文件 conf.py内容如下: HOSTS = [ "root@192.168.100.67", "root@192.168.100.68", ] ''' workspace configuration ''' #root dir for workspace, can be set as any director with real user account ROOT_DIR = "/home/users/AI/paddle/benchmark/paddle/image" ''' network configuration ''' #pserver nics PADDLE_NIC = "eno1" #pserver port PADDLE_PORT = 12345 #pserver ports num PADDLE_PORTS_NUM = 8 #pserver sparse ports num PADDLE_PORTS_NUM_FOR_SPARSE = 2
#environments setting for all processes in cluster job LD_LIBRARY_PATH = "/usr/local/cuda/lib64:/usr/lib64:/home/users/cudnn/cudnn6.0/lib64:/home/users/openmpi/lib"
run.sh如下:
python paddle2.py
--job_dispatch_package="/home/users/AI/paddle/benchmark/paddle/image"
--dot_period=10
--ports_num_for_sparse=2
--log_period=50
--num_passes=10
--trainer_count=4
--saving_period=1
--local=0
--config=/home/users/AI/paddle/benchmark/paddle/image/alexnet.py
--save_dir=./output
--use_gpu=1
报错信息如下: I0927 10:21:46.176991 43449 ParameterClient2.cpp:379] send thread 10 started I0927 10:21:46.177026 43450 ParameterClient2.cpp:409] recv thread 10 started I0927 10:21:46.177057 43451 ParameterClient2.cpp:379] send thread 11 started I0927 10:21:46.177083 43452 ParameterClient2.cpp:409] recv thread 11 started I0927 10:21:46.177114 43453 ParameterClient2.cpp:379] send thread 12 started I0927 10:21:46.177139 43454 ParameterClient2.cpp:409] recv thread 12 started I0927 10:21:46.177166 43455 ParameterClient2.cpp:379] send thread 13 started I0927 10:21:46.177192 43456 ParameterClient2.cpp:409] recv thread 13 started I0927 10:21:46.177227 43457 ParameterClient2.cpp:379] send thread 14 started I0927 10:21:46.177253 43458 ParameterClient2.cpp:409] recv thread 14 started I0927 10:21:46.177284 43459 ParameterClient2.cpp:379] send thread 15 started I0927 10:21:46.177309 43460 ParameterClient2.cpp:409] recv thread 15 started F0927 10:21:46.177381 43306 SocketChannel.cpp:101] Check failed: len > 0 peer=192.168.100.67 curIov=0 iovCnt=4 iovs[curIov].base=0x7fff55ff7d40 iovs[curIov].iov_len=16