paddle集群训练报错求助
Created by: HugoLian
我用的是paddle-cloud的v2 0.10版本 集群训练随机报错,不稳定复现, pserver的err信息是:
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
*** Aborted at 1534275366 (unix time) try "date -d @1534275366" if you are using GNU date ***
PC: @ 0x0 (unknown)
*** SIGABRT (@0x1f500000221) received by PID 545 (TID 0x7f1b9d195700) from PID 545; stack trace: ***
@ 0x7f1bddbc6390 (unknown)
@ 0x7f1bdcd7b428 gsignal
@ 0x7f1bdcd7d02a abort
@ 0x7f1bdd6be84d __gnu_cxx::__verbose_terminate_handler()
@ 0x7f1bdd6bc6b6 (unknown)
@ 0x7f1bdd6bc701 std::terminate()
@ 0x7f1bdd6e7d38 (unknown)
@ 0x7f1bddbbc6ba start_thread
@ 0x7f1bdce4d3dd clone
@ 0x0 (unknown)
trainer的错误信息是:
*** Aborted at 1534275366 (unix time) try "date -d @1534275366" if you are using GNU date ***
PC: @ 0x0 (unknown)
*** SIGSEGV (@0x8) received by PID 12308 (TID 0x7f15eb58e700) from PID 8; stack trace: ***
@ 0x7f174d584160 (unknown)
@ 0x7f16b38e70d2 paddle::ProtoClient::recv()
@ 0x7f16b56e448d paddle::ParameterClient2::waitPassStart()
@ 0x7f16b523d798 paddle::RemoteParameterUpdater::controller()
@ 0x7f173aba68a0 execute_native_thread_routine
@ 0x7f174d57c1c3 start_thread
@ 0x7f174cba412d __clone
@ 0x0 (unknown)
这个任务的日志在下列url的 workspace/evn_run/output/log 目录结构里 http://10.104.92.25:8900/fileview.html?path=/home/disk1/normandy/maybach/app-user-20180814102806-581/
请问这是为什么呢?