多机异步运行过程中出core
Created by: ccmeteorljh
paddle version 0.14 模型:se-resnext 作业运行配置: 2个pserver,4个trainer,8卡 paddlecloud job地址: http://paddlecloud.baidu-int.com:8088/paddle/jobRunInfo?jobId=job-e6c5b70eef8af715&flag=jobs&groupName=k8s_gpu_demo&groupId=c0a1f165-6279-5320-b9e7-e0218c7a87f5¤tPage=1¤tKey=1
运行至第8个pass报错如下:
*** Aborted at 1534174828 (unix time) try "date -d @1534174828" if you are using GNU date ***
PC: @ 0x0 (unknown)
*** SIGSEGV (@0x0) received by PID 3760 (TID 0x7f64f4dfa700) from PID 0; stack trace: ***
@ 0x7f69fdbf6500 (unknown)
@ 0x7f699667abfb paddle::operators::distributed::SerializeToByteBuffer()
@ 0x7f6996666e81 _ZNSt17_Function_handlerIFSt10unique_ptrIN6paddle8platform13EnforceNotMetESt14default_deleteIS3_EEvESt17reference_wrapperISt12_Bind_simpleIFS8_IZNS1_9framework10ThreadPool18RunAndGetExceptionIZNS1_9operators11distributed10GRPCClient12AsyncSendVarERKSsRKNS2_13DeviceContextERKNSA_5ScopeESH_lEUlvE_EESt6futureIS6_ET_EUlvE_EvEEEE9_M_invokeERKSt9_Any_data
@ 0x7f69963dcf9a std::_Function_handler<>::_M_invoke()
@ 0x7f6995939137 std::__future_base::_State_base::_M_do_set()
@ 0x7f69fdbf3b23 __pthread_once_internal
@ 0x7f6996669b64 _ZNSt13__future_base11_Task_stateIZN6paddle9framework10ThreadPool18RunAndGetExceptionIZNS1_9operators11distributed10GRPCClient12AsyncSendVarERKSsRKNS1_8platform13DeviceContextERKNS2_5ScopeES9_lEUlvE_EESt6futureISt10unique_ptrINSA_13EnforceNotMetESt14default_deleteISK_EEET_EUlvE_SaIiEFSN_vEE6_M_runEv
@ 0x7f69967511f8 paddle::framework::ThreadPool::TaskLoop()
@ 0x7f69ab6b1470 (unknown)
@ 0x7f69fdbee851 start_thread
@ 0x7f69fd2b190d clone
@ 0x0 (unknown)