多机异步下有时会出core
Created by: ccmeteorljh
paddle version 0.14 模型:machine_translation 代码库地址:https://github.com/xuezhong/transformer-nist 作业运行配置: 2个pserver,4个trainer,8卡 启动脚本:
LAGS_rpc_deadline=3000000 python -u thirdparty/model/transformer_cloud/train.py --src_vocab_fpath ./thirdparty/nist06n/cn_30001.dict --trg_vocab_fpath ./thirdparty/nist06n/en_30001.dict --train_file_pattern './train_data/part-*' --val_file_pattern './test_data/part-*' --batch_size 1024 --use_token_batch True --special_token '_GO' '_EOS' '_UNK' --pass_num=100 --iterations=1000 --local False --sync False
有一个trainer挂掉,日志如下:
init fluid.framework.default_startup_program
*** Aborted at 1533109074 (unix time) try "date -d @1533109074" if you are using GNU date ***
PC: @ 0x0 (unknown)
*** SIGSEGV (@0x10) received by PID 3524 (TID 0x7f6758b81700) from PID 16; stack trace: ***
@ 0x7f6803be1500 (unknown)
@ 0x7f67f8e14f34 std::_Rb_tree_rotate_left()
@ 0x7f67f8e150d4 std::_Rb_tree_insert_and_rebalance()
@ 0x7f67a6a7e708 _ZNSt8_Rb_treeIN5boost7variantIN6paddle8platform9CUDAPlaceENS3_8CPUPlaceENS3_15CUDAPinnedPlaceENS0_6detail7variant5void_ES9_S9_S9_S9_S9_S9_S9_S9_S9_S9_S9_S9_S9_S9_S9_S9_EESt4pairIKSA_PNS3_13DeviceContextEESt10_Select1stISF_ESt4lessISA_ESaISF_EE22_M_emplace_hint_uniqueIJRKSt21piecewise_construct_tSt5tupleIJRSC_EESQ_IJEEEEESt17_Rb_tree_iteratorISF_ESt23_Rb_tree_const_iteratorISF_EDpOT_
@ 0x7f67a6a7e2d7 paddle::framework::details::ComputationOpHandle::NeedWait()
@ 0x7f67a6aab52d paddle::framework::details::OpHandleBase::WaitInputVarGenerated()
@ 0x7f67a6a7d90f paddle::framework::details::ComputationOpHandle::RunImpl()
@ 0x7f67a6aaadc5 paddle::framework::details::OpHandleBase::Run()
@ 0x7f67a6aa208a _ZZN6paddle9framework7details24ThreadedSSAGraphExecutor5RunOpEPNS0_13BlockingQueueIPNS1_13VarHandleBaseEEEPNS1_12OpHandleBaseEENKUlvE_clEv
@ 0x7f67a68f8233 std::_Function_handler<>::_M_invoke()
@ 0x7f67a5e5bba7 std::__future_base::_State_base::_M_do_set()
@ 0x7f6803bdeb23 __pthread_once_internal
@ 0x7f67a6aa0e62 _ZNSt13__future_base11_Task_stateISt5_BindIFZN6paddle9framework7details24ThreadedSSAGraphExecutor5RunOpEPNS3_13BlockingQueueIPNS4_13VarHandleBaseEEEPNS4_12OpHandleBaseEEUlvE_vEESaIiEFvvEE6_M_runEv
@ 0x7f67a5e5d6e4 _ZZN10ThreadPoolC1EmENKUlvE_clEv
@ 0x7f67f8e61470 (unknown)
@ 0x7f6803bd9851 start_thread
@ 0x7f680329c90d clone
@ 0x0 (unknown)
*********************error messages********************