【1.5】fleet 在pslib模式下,运行ctr模型,大batchsize,调用fleet.stopworker()挂掉
Created by: ccmeteorljh
paddleversion : 1.5
batchsize=32会挂,batchsize=2则没问题;
- 报错情况:
run default_startup_program
I0612 15:58:40.458407 201989 src/brpc/server.cpp:975] Server[paddle::ps::DownpourPsService] is serving on port=8000.
I0612 15:58:40.458457 201989 src/brpc/server.cpp:978] Check out http://yq01-jpaas-ai00-let0023.yq01.baidu.com:8000 in web browser.
I0612 15:58:40.658174 201990 baidu/paddlepaddle/pslib/src/communicate/ps_client.cc:82] Create PSClient[DownpourBrpcPsClient] success
I0612 15:58:40.663541 201990 src/brpc/server.cpp:975] Server[paddle::ps::DownpourPsClientService] is serving on port=8501.
I0612 15:58:40.663560 201990 src/brpc/server.cpp:978] Check out http://yq01-jpaas-ai00-let0023.yq01.baidu.com:8501 in web browser.
I0612 15:58:40.664471 201990 baidu/paddlepaddle/pslib/src/communicate/downpour_ps_client.cc:78] Client connect success:10.255.120.16:8501,
start load_into_memory
[yq01-jpaas-ai00-let0023.yq01.baidu.com:201984] 1 more process has sent help message help-mpi-btl-openib.txt / no device params found
[yq01-jpaas-ai00-let0023.yq01.baidu.com:201984] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
load_into_memory done
global_shuffle
global_shuffle done
run default_main_program
I0612 15:58:49.454741 201990 baidu/paddlepaddle/pslib/src/communicate/downpour_ps_client.cc:324] wait _async_call_num:0
finished
*** Aborted at 1560326329 (unix time) try "date -d @1560326329" if you are using GNU date ***
PC: @ 0x0 (unknown)
*** SIGSEGV (@0x7fd6a97faec8) received by PID 201990 (TID 0x7fd69bfff700) from PID 18446744072258301640; stack trace: ***
@ 0x7fd85f733160 (unknown)
@ 0x7fd85f72fbfa __pthread_cond_signal
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 201990 on node yq01-jpaas-ai00-let0023.yq01.baidu.com exited on signal 11 (Segmentation fault).