MPI上分布式运行paddle官方的word2vec,is_sparse设置false时正常运行,is_spare设置true时不能训练
Created by: ustcxiexk
如题,利用paddle提供的word2vec模型:https://github.com/PaddlePaddle/models/tree/develop/PaddleRec/word2vec,在MPI集群上分布式运行时,is_sparse(skip-gram构造网络参数)默认false时,可以正常运行,传入参数 is_sparse(=True)时不能训练,最后报rpc的错误: (已设置FLAGS_rpc_deadline=5000000)
W0918 13:25:01.122030 4096 parallel_executor.cc:333] The number of graph should be only one, but the current graph has 2 sub_graphs. If you want to see the nodes of the sub_graphs, you should use 'FLAGS_print_sub_graph_dir' to specify the output dir. NOTES: if you not do training, please don't pass loss_var_name.
2019-09-18 13:25:01,169-INFO: running data in ./train_data/xab
F0918 13:27:53.938683 8432 grpc_client.cc:418] FetchBarrierRPC name:[FETCH_BARRIER@RECV], ep:[10.76.57.40:62001], status:[-1] meets grpc error, error_code:14 error_message:Socket closed error_details:
*** Check failure stack trace: ***
@ 0x7f811a7a0c0d google::LogMessage::Fail()
@ 0x7f811a7a46bc google::LogMessage::SendToLog()
@ 0x7f811a7a0733 google::LogMessage::Flush()
@ 0x7f811a7a5bce google::LogMessageFatal::~LogMessageFatal()
@ 0x7f811b3a5e0e paddle::operators::distributed::GRPCClient::Proceed()
@ 0x7f81288e68a0 execute_native_thread_routine
@ 0x7f8132cfc1c3 start_thread
@ 0x7f813232412d __clone
@ (nil) (unknown)
('corpus_size:', 269909100)
dict_size = 2699092 word_all_count = 269909100
CPU_NUM:5