MPI训练:has no optimize block
Created by: ninnkotora
-
版本、环境信息: 1)PaddlePaddle版本:1.3 2)CPU 3)Python2.7
-
问题描述:请详细描述您的问题,同步贴出报错信息、日志、可复现的代码片段 app/pserver.log中一直有下面警告:pserver [xxxxx:yyyy] has no optimize block!! 现象是程序执行不动了,像是哪里卡住了, 看日志,程序执行到红线部分就不动了
Sun May 5 18:54:25 2019[1,29]:2019-05-05 18:54:25,830 - WARNING - pserver [10.182.14.28:62005] has no optimize block!! Sun May 5 18:54:25 2019[1,29]:I0505 18:54:25.845669 138469 grpc_server.cc:430] Server listening on 10.182.14.28:62005 selected port: 62005 Sun May 5 18:54:25 2019[1,45]:2019-05-05 18:54:25 dist train Sun May 5 18:54:25 2019[1,45]:2019-05-05 18:54:25 run pserver Sun May 5 18:54:25 2019[1,45]:get_pserver_program() is deprecated, call get_pserver_programs() to get pserver main and startup in a single call. Sun May 5 18:54:25 2019[1,45]:2019-05-05 18:54:25,872 - WARNING - pserver [10.182.109.38:62005] has no optimize block!! Sun May 5 18:54:25 2019[1,45]:I0505 18:54:25.883363 41628 grpc_server.cc:430] Server listening on 10.182.109.38:62005 selected port: 62005 Sun May 5 18:54:26 2019[1,3]:2019-05-05 18:54:26 dist train Sun May 5 18:54:26 2019[1,3]:2019-05-05 18:54:26 run pserver Sun May 5 18:54:26 2019[1,3]:get_pserver_program() is deprecated, call get_pserver_programs() to get pserver main and startup in a single call. Sun May 5 18:54:26 2019[1,3]:2019-05-05 18:54:26,054 - WARNING - pserver [10.182.10.152:62005] has no optimize block!! Sun May 5 18:54:26 2019[1,3]:I0505 18:54:26.068373 1449 grpc_server.cc:430] Server listening on 10.182.10.152:62005 selected port: 62005 Sun May 5 18:54:27 2019[1,36]:2019-05-05 18:54:27 dist train Sun May 5 18:54:27 2019[1,36]:2019-05-05 18:54:27 run pserver Sun May 5 18:54:27 2019[1,36]:get_pserver_program() is deprecated, call get_pserver_programs() to get pserver main and startup in a single call. Sun May 5 18:54:27 2019[1,36]:2019-05-05 18:54:27,057 - WARNING - pserver [10.182.13.18:62005] has no optimize block!! Sun May 5 18:54:27 2019[1,36]:I0505 18:54:27.068897 50413 grpc_server.cc:430] Server listening on 10.182.13.18:62005 selected port: 62005 Sun May 5 18:54:27 2019[1,19]:2019-05-05 18:54:27 dist train Sun May 5 18:54:27 2019[1,19]:2019-05-05 18:54:27 run pserver Sun May 5 18:54:27 2019[1,19]:get_pserver_program() is deprecated, call get_pserver_programs() to get pserver main and startup in a single call. Sun May 5 18:54:27 2019[1,19]:2019-05-05 18:54:27,295 - WARNING - pserver [10.182.15.142:62005] has no optimize block!!
最终在train.log中报错: Sun May 5 01:18:43 2019[1,12]:F0505 01:18:43.651557 33534 grpc_client.cc:418] BatchBarrierRPC name:[BATCH_BARRIER@RECV], ep:[10.182.65.144:62004], status:[-1] meets grpc error, error_code:14 error_message:Socket closed error_details: Sun May 5 01:18:43 2019[1,15]:F0505 01:18:43.666119 45154 grpc_client.cc:418] BatchBarrierRPC name:[BATCH_BARRIER@RECV], ep:[10.182.65.144:62004], status:[-1] meets grpc error, error_code:14 error_message:Socket closed error_details: Sun May 5 01:18:43 2019[1,13]:F0505 01:18:43.658960 30178 grpc_client.cc:418] BatchBarrierRPC name:[BATCH_BARRIER@RECV], ep:[10.182.65.144:62004], status:[-1] meets grpc error, error_code:14 error_message:Socket closed error_details: Sun May 5 01:18:43 2019[1,18]:F0505 01:18:43.684461 47413 grpc_client.cc:418] BatchBarrierRPC name:[BATCH_BARRIER@RECV], ep:[10.182.65.144:62004], status:[-1] meets grpc error, error_code:14 error_message:Socket closed error_details: Sun May 5 01:18:43 2019[1,37]:F0505 01:18:43.684893 21399 grpc_client.cc:418] BatchBarrierRPC name:[BATCH_BARRIER@RECV], ep:[10.182.65.144:62004], status:[-1] meets grpc error, error_code:14 error_message:Socket closed error_details: Sun May 5 01:18:43 2019[1,35]:F0505 01:18:43.672283 42122 grpc_client.cc:418] BatchBarrierRPC name:[BATCH_BARRIER@RECV], ep:[10.182.65.144:62004], status:[-1] meets grpc error, error_code:14 error_message:Socket closed error_details: