ctr模型使用is_sparse=True抛异常
Created by: codescv
在deepctr例子里,只要把network_conf.py
这里改成:
def embedding_layer(input):
return fluid.layers.embedding(
input=input,
+ is_distributed=True,
+ is_sparse=True,
size=[sparse_feature_dim, embedding_size],
param_attr=fluid.ParamAttr(name="SparseFeatFactors", initializer=fluid.initializer.Normal(scale=1/math.sqrt(sparse_feature_dim))))
使用1worker 1ps训练, 就会报如下错误: WORKER
2018-11-06 16:29:39,757 - INFO - run dist training
2018-11-06 16:29:39,789 - INFO - run trainer
pserver not ready, wait 3 sec to retry...
pserver not ready, wait 3 sec to retry...
F1106 16:29:51.974265 25611 grpc_client.cc:348] PrefetchRPC name:[prefetch_compress_out_tmp_0], ep:[127.0.0.1:6174], status:[-1] meets grpc error, error_code:14 error_message:Socket closed error_details:
*** Check failure stack trace: ***
@ 0x7f650b3c254d google::LogMessage::Fail()
@ 0x7f650b3c5ffc google::LogMessage::SendToLog()
@ 0x7f650b3c2073 google::LogMessage::Flush()
@ 0x7f650b3c750e google::LogMessageFatal::~LogMessageFatal()
@ 0x7f650bef5ac1 paddle::operators::distributed::GRPCClient::Proceed()
@ 0x7f6519d58678 execute_native_thread_routine_compat
@ 0x7f65b4c33e25 start_thread
@ 0x7f65b425834d __clone
@ (nil) (unknown)
./start_trainer.sh: line 1: 25247 Aborted python train.py --train_data_path data/raw/train.txt --is_local 0 --role trainer --endpoints 127.0.0.1:6174 --trainers 1 --trainer_id=0 --sparse_feature_dim 1000001
PS
2018-11-06 16:29:43,886 - INFO - run dist training
2018-11-06 16:29:43,917 - INFO - run pserver
get_pserver_program() is deprecated, call get_pserver_programs() to get pserver main and startup in a single call.2018-11-06 16:29:43,921 - WARNING - distribute lookup table only support sgd optimizer, change it's optimizer to sgd instead of sgd
*** Aborted at 1541492991 (unix time) try "date -d @1541492991" if you are using GNU date ***
PC: @ 0x0 (unknown)
*** SIGFPE (@0x7f8a5564a644) received by PID 25435 (TID 0x7f8ac9577700) from PID 1432659524; stack trace: ***
@ 0x7f8afe2945e0 (unknown)
@ 0x7f8a5564a644 paddle::framework::SelectedRows::Get()
@ 0x7f8a5519238c paddle::operators::LookupSparseTableOp::RunImpl()
@ 0x7f8a55611f48 paddle::framework::OperatorBase::Run()
@ 0x7f8a54a604eb paddle::framework::Executor::RunPreparedContext()
@ 0x7f8a5555708e paddle::operators::distributed::RequestPrefetchHandler::Handle()
@ 0x7f8a5555e54e paddle::operators::distributed::RequestPrefetch::Process()
@ 0x7f8a5555a0ba paddle::operators::distributed::AsyncGRPCServer::HandleRequest()
@ 0x7f8a5555daff std::thread::_Impl<>::_M_run()
@ 0x7f8a633b1678 execute_native_thread_routine_compat
@ 0x7f8afe28ce25 start_thread
@ 0x7f8afd8b134d __clone
@ 0x0 (unknown)
./start_ps.sh: line 1: 25435 Floating point exceptionpython train.py --is_local 0 --role pserver --endpoints 127.0.0.1:6174 --trainers 1 --sparse_feature_dim 1000001 --current_endpoint 127.0.0.1:6174