Cluster training DeepFM always stuck at the beginning
Created by: typhoonzero
I started a 10 node job with 10 pservers and 10 trainers. Using DeepFM model (converted to paddle v1 code), the job always stuck before the trainer can run batches.
pserver log:
commandline: ./paddle_pserver2 --num_gradient_servers=10 --nics=xgbe0 --port=7164 --ports_num=1 --ports_num_for_sparse=1 --rdma_tcp=tcp --comment=paddle_cluster_job
W1206 12:25:42.604581 29267 CpuId.h:112] PaddlePaddle wasn't compiled to use avx instructions, but these are available on your machine and could speed up CPU computations via CMAKE .. -DWITH_AVX=ON
I1206 12:25:42.605259 29267 ParameterServerController.cpp:83] number of parameterServer instances: 2
I1206 12:25:42.605267 29267 ParameterServerController.cpp:87] Starting parameterServer[0]
I1206 12:25:42.605304 29267 ParameterServerController.cpp:87] Starting parameterServer[1]
I1206 12:25:42.605325 29267 ParameterServerController.cpp:96] Waiting parameterServer[0]
I1206 12:25:42.605643 29288 LightNetwork.cpp:273] tcp server start
I1206 12:25:42.605721 29287 LightNetwork.cpp:273] tcp server start
I1206 12:36:39.935209 23676 LightNetwork.cpp:326] worker started, peer = 10.104.100.14
I1206 12:36:39.960119 23677 LightNetwork.cpp:326] worker started, peer = 10.104.100.14
I1206 12:40:23.655452 4484 LightNetwork.cpp:326] worker started, peer = 10.102.196.43
I1206 12:40:23.683399 4485 LightNetwork.cpp:326] worker started, peer = 10.102.196.43
I1206 12:40:25.714017 4485 ParameterServer2.cpp:256] pserver: setParameter
I1206 12:40:25.714077 4485 ParameterServer2.cpp:302] pserver: new cpuvector: size=1196032
I1206 12:40:25.737254 4484 ParameterServer2.cpp:256] pserver: setParameter
I1206 12:40:25.743959 4484 ParameterServer2.cpp:302] pserver: new cpuvector: size=30000
I1206 12:40:25.766773 4686 LightNetwork.cpp:326] worker started, peer = 10.102.196.43
I1206 12:40:25.810925 4688 LightNetwork.cpp:326] worker started, peer = 10.102.196.43
I1206 12:40:26.050133 23677 ParameterServer2.cpp:564] pserver: getParameter
I1206 12:48:30.094723 8188 LightNetwork.cpp:326] worker started, peer = 10.102.196.37
I1206 12:48:30.119384 8189 LightNetwork.cpp:326] worker started, peer = 10.102.196.37
I1206 12:48:32.120713 8189 ParameterServer2.cpp:564] pserver: getParameter
I1206 12:50:31.704797 18232 LightNetwork.cpp:326] worker started, peer = 10.102.196.39
I1206 12:50:31.731204 18233 LightNetwork.cpp:326] worker started, peer = 10.102.196.39
I1206 12:50:33.527307 18356 LightNetwork.cpp:326] worker started, peer = 10.102.196.41
I1206 12:50:33.564779 18363 LightNetwork.cpp:326] worker started, peer = 10.102.196.41
I1206 12:50:33.732800 18233 ParameterServer2.cpp:564] pserver: getParameter
I1206 12:50:35.555511 18363 ParameterServer2.cpp:564] pserver: getParameter
I1206 12:51:13.538336 21095 LightNetwork.cpp:326] worker started, peer = 10.104.100.12
I1206 12:51:13.562628 21097 LightNetwork.cpp:326] worker started, peer = 10.104.100.12
I1206 12:51:15.564131 21097 ParameterServer2.cpp:564] pserver: getParameter
I1206 12:51:33.749547 22697 LightNetwork.cpp:326] worker started, peer = 10.102.196.44
I1206 12:51:33.776844 22701 LightNetwork.cpp:326] worker started, peer = 10.102.196.44
I1206 12:51:35.778558 22701 ParameterServer2.cpp:564] pserver: getParameter
I1206 12:55:58.762526 4722 LightNetwork.cpp:326] worker started, peer = 10.102.196.38
I1206 12:55:58.790271 4723 LightNetwork.cpp:326] worker started, peer = 10.102.196.38
I1206 12:56:00.792819 4723 ParameterServer2.cpp:564] pserver: getParameter
I1206 12:56:01.233184 4860 LightNetwork.cpp:326] worker started, peer = 10.102.196.42
I1206 12:56:01.260675 4861 LightNetwork.cpp:326] worker started, peer = 10.102.196.42
I1206 12:56:03.409593 4861 ParameterServer2.cpp:564] pserver: getParameter
I1206 12:57:06.424779 9664 LightNetwork.cpp:326] worker started, peer = 10.104.100.11
trainer log:
./paddle_trainer --num_gradient_servers=10 --trainer_id=5 --pservers=10.102.196.43,10.102.196.42,10.102.196.41,10.104.100.12,10.102.196.37,10.104.100.11,10.102.196.44,10.102.196.38,10.104.100.14,10.102.196.39 --rdma_tcp=tcp --nics=xgbe0 --port=7164 --ports_num=1 --local=0 --job=train --dot_period=10 --saving_period=1 --log_period=20 --trainer_count=1 --num_passes=1 --ports_num_for_sparse=1 --config=conf/trainer_config.conf --save_dir=./output --use_gpu=0
W1206 12:57:04.976786 9582 CpuId.h:112] PaddlePaddle wasn't compiled to use avx instructions, but these are available on your machine and could speed up CPU computations via CMAKE .. -DWITH_AVX=ON
I1206 12:57:05.292944 9582 Trainer.cpp:166] trainer mode: SgdSparseCpuTraining
I1206 12:57:05.292971 9582 TrainerInternal.cpp:239] Sgd sparse training can not work with ConcurrentRemoteParameterUpdater, automatically reset --use_old_updater=true
I1206 12:57:05.498909 9582 PyDataProvider2.cpp:243] loading dataprovider dataprovider::process_deep
I1206 12:57:05.553319 9582 GradientMachine.cpp:94] Initing parameters..
I1206 12:57:06.414937 9582 GradientMachine.cpp:101] Init parameters done.
I1206 12:57:06.415112 9621 ParameterClient2.cpp:113] pserver 0 10.102.196.43:7165
I1206 12:57:06.415400 9621 ParameterClient2.cpp:113] pserver 1 10.102.196.42:7165
I1206 12:57:06.415513 9621 ParameterClient2.cpp:113] pserver 2 10.102.196.41:7165
I1206 12:57:06.415627 9621 ParameterClient2.cpp:113] pserver 3 10.104.100.12:7165
I1206 12:57:06.415748 9621 ParameterClient2.cpp:113] pserver 4 10.102.196.37:7165
I1206 12:57:06.415866 9621 ParameterClient2.cpp:113] pserver 5 10.104.100.11:7165
I1206 12:57:06.415930 9621 ParameterClient2.cpp:113] pserver 6 10.102.196.44:7165
I1206 12:57:06.416048 9621 ParameterClient2.cpp:113] pserver 7 10.102.196.38:7165
I1206 12:57:06.416160 9621 ParameterClient2.cpp:113] pserver 8 10.104.100.14:7165
I1206 12:57:06.416251 9621 ParameterClient2.cpp:113] pserver 9 10.102.196.39:7165
I1206 12:57:06.439986 9582 ParameterClient2.cpp:113] pserver 0 10.102.196.43:7164
I1206 12:57:06.440148 9582 ParameterClient2.cpp:113] pserver 1 10.102.196.42:7164
I1206 12:57:06.440269 9582 ParameterClient2.cpp:113] pserver 2 10.102.196.41:7164
I1206 12:57:06.440351 9582 ParameterClient2.cpp:113] pserver 3 10.104.100.12:7164
I1206 12:57:06.440423 9582 ParameterClient2.cpp:113] pserver 4 10.102.196.37:7164
I1206 12:57:06.440512 9582 ParameterClient2.cpp:113] pserver 5 10.104.100.11:7164
I1206 12:57:06.440629 9582 ParameterClient2.cpp:113] pserver 6 10.102.196.44:7164
I1206 12:57:06.440716 9582 ParameterClient2.cpp:113] pserver 7 10.102.196.38:7164
I1206 12:57:06.440809 9582 ParameterClient2.cpp:113] pserver 8 10.104.100.14:7164
I1206 12:57:06.440871 9582 ParameterClient2.cpp:113] pserver 9 10.102.196.39:7164