集群train的过程卡死
Created by: Damon-wyg
在集群上跑模型时,任务挂了,单机跑测试时是可以正常运行的。paddle版本是v1.相关报错信息如下: train.log里
Mon Jul 24 18:41:05 2017[1,3]<stderr>:*********************Shell Script Stack Trace********************
Mon Jul 24 18:41:05 2017[1,3]<stderr>: @: [./log.sh: 41] log_fatal
Mon Jul 24 18:41:05 2017[1,3]<stderr>: @: [./common.sh: 399] kill_pserver2_exit
Mon Jul 24 18:41:05 2017[1,3]<stderr>: @: [./train.sh: 210] main
Mon Jul 24 18:41:05 2017[1,3]<stderr>:
Mon Jul 24 18:41:05 2017[1,3]<stderr>:+ exit 1
Mon Jul 24 18:41:05 2017[1,25]<stderr>:+ log_fatal 'paddle_trainer failed kill paddle_pserver2 and exit'
Mon Jul 24 18:41:05 2017[1,25]<stderr>:+ echo '[./common.sh : 399] [kill_pserver2_exit]'
Mon Jul 24 18:41:05 2017[1,25]<stderr>:[./common.sh : 399] [kill_pserver2_exit]
Mon Jul 24 18:41:05 2017[1,25]<stderr>:+ echo '[FATAL]: paddle_trainer failed kill paddle_pserver2 and exit'
Mon Jul 24 18:41:05 2017[1,25]<stderr>:[FATAL]: paddle_trainer failed kill paddle_pserver2 and exit
Mon Jul 24 18:41:05 2017[1,25]<stderr>:+ get_stack
Mon Jul 24 18:41:05 2017[1,25]<stderr>:+ set +x
Mon Jul 24 18:41:05 2017[1,6]<stderr>:+ log_fatal 'paddle_trainer failed kill paddle_pserver2 and exit'
Mon Jul 24 18:41:05 2017[1,6]<stderr>:+ echo '[./common.sh : 399] [kill_pserver2_exit]'
Mon Jul 24 18:41:05 2017[1,6]<stderr>:[./common.sh : 399] [kill_pserver2_exit]
Mon Jul 24 18:41:05 2017[1,6]<stderr>:+ echo '[FATAL]: paddle_trainer failed kill paddle_pserver2 and exit'
Mon Jul 24 18:41:05 2017[1,6]<stderr>:[FATAL]: paddle_trainer failed kill paddle_pserver2 and exit
Mon Jul 24 18:41:05 2017[1,6]<stderr>:+ get_stack
堆栈信息:
Mon Jul 24 18:41:04 2017[1,34]<stderr>:.*** Error in `./paddle_trainer': free(): invalid pointer: 0x00000000083da0c0 ***
Mon Jul 24 18:41:04 2017[1,34]<stderr>:======= Backtrace: =========
Mon Jul 24 18:41:04 2017[1,34]<stderr>:/opt/compiler/gcc-4.8.2/lib/libc.so.6(+0x7354f)[0x7fa0eb39f54f]
Mon Jul 24 18:41:04 2017[1,34]<stderr>:/opt/compiler/gcc-4.8.2/lib/libc.so.6(+0x78dbe)[0x7fa0eb3a4dbe]
Mon Jul 24 18:41:04 2017[1,34]<stderr>:/opt/compiler/gcc-4.8.2/lib/libc.so.6(+0x79a97)[0x7fa0eb3a5a97]
Mon Jul 24 18:41:04 2017[1,34]<stderr>:./paddle_trainer[0x74a983]
Mon Jul 24 18:41:04 2017[1,34]<stderr>:./paddle_trainer(_ZN6paddle15TrainerInternal13trainOneBatchElRKNS_9DataBatchEPSt6vectorINS_8ArgumentESaIS5_EE+0xb33)[0x74bf23]
Mon Jul 24 18:41:04 2017[1,34]<stderr>:./paddle_trainer(_ZN6paddle7Trainer17trainOneDataBatchERNS_9DataBatchE+0x166)[0x749616]
Mon Jul 24 18:41:04 2017[1,34]<stderr>:./paddle_trainer(_ZN6paddle7Trainer12trainOnePassEv+0x13d)[0x749abd]
Mon Jul 24 18:41:04 2017[1,34]<stderr>:./paddle_trainer(_ZN6paddle7Trainer5trainEm+0x95)[0x74a375]
Mon Jul 24 18:41:04 2017[1,34]<stderr>:./paddle_trainer(main+0x350)[0x5a3d70]
Mon Jul 24 18:41:04 2017[1,34]<stderr>:/opt/compiler/gcc-4.8.2/lib/libc.so.6(__libc_start_main+0xf5)[0x7fa0eb34dbd5]
Mon Jul 24 18:41:04 2017[1,34]<stderr>:./paddle_trainer[0x5b2169]
Mon Jul 24 18:41:04 2017[1,34]<stderr>:======= Memory map: ========