任务因超时失败,指定 init_model_path= 重新运行时(v1版本)出错
Created by: lyp2github
集群运行因为超时失败了,指定 init_model_path= 重新运行时(v1版本)出错 :[INFO 2017-07-21 14:47:58,207 networks.py:1482] The input order is [img_feature, pos_title, neg_title, label] Fri Jul 21 14:47:58 2017[1,2]:[INFO 2017-07-21 14:47:58,208 networks.py:1488] The output order is [rank_cost_0] Fri Jul 21 14:47:58 2017[1,6]:Download File: /app/ecom/fcr-opt/liuyaping/paddle/paddle_image_bidword_sim/test/part-00000_00278 Fri Jul 21 14:47:58 2017[1,0]:Download File: /app/ecom/fcr-opt/liuyaping/paddle/paddle_image_bidword_sim/test/tusou-part-00006 Fri Jul 21 14:47:59 2017[1,3]:Download File: /app/ecom/fcr-opt/liuyaping/paddle/paddle_image_bidword_sim/test/part-00000_00278 Fri Jul 21 14:47:59 2017[1,2]:F0721 14:47:59.842875 25453 Parameter.cpp:379] _fc_layer_0.w0 missing, not allowed. Fri Jul 21 14:47:59 2017[1,2]:*** Check failure stack trace: *** Fri Jul 21 14:47:59 2017[1,2]: @ 0x91316d google::LogMessage::Fail() Fri Jul 21 14:47:59 2017[1,2]: @ 0x916c1c google::LogMessage::SendToLog() Fri Jul 21 14:47:59 2017[1,2]: @ 0x912c93 google::LogMessage::Flush() Fri Jul 21 14:47:59 2017[1,2]: @ 0x91812e google::LogMessageFatal::~LogMessageFatal() Fri Jul 21 14:47:59 2017[1,2]: @ 0x877dac paddle::Parameter::load() Fri Jul 21 14:47:59 2017[1,2]: @ 0x6b42ca paddle::GradientMachine::loadParameters() Fri Jul 21 14:47:59 2017[1,2]: @ 0x75a68c paddle::ParameterUtil::loadParametersWithPath() Fri Jul 21 14:47:59 2017[1,2]: @ 0x7490b7 paddle::Trainer::init() Fri Jul 21 14:47:59 2017[1,2]: @ 0x5a3cbc main Fri Jul 21 14:47:59 2017[1,2]: @ 0x7f90cb33bbd5 __libc_start_main Fri Jul 21 14:47:59 2017[1,2]: @ 0x5b2169 (unknown) Fri Jul 21 14:47:59 2017[1,2]: @ (nil) (unknown) Fri Jul 21 14:48:00 2017[1,2]:./train.sh: line 207: 25453 Aborted PYTHONPATH=./paddle:$PYTHONPATH GLOG_logtostderr=0 GLOG_log_dir="./log" ./paddle_trainer --num_gradient_servers=${OMPI_COMM_WORLD_SIZE} --trainer_id=${OMPI_COMM_WORLD_RANK} --pservers=$ipstring --rdma_tcp=${rdma_tcp} --nics=${nics} ${train_arg} --config=conf/trainer_config.conf --save_dir=./${save_dir} ${extern_arg} Fri Jul 21 14:48:00 2017[1,2]:+ '[' 134 -ne 0 ']' 刘亚萍 任务配置并无修改,仅仅因超时提前结束,因此增加init_model_path希望继续训练,按说不应该还出现 配置参数的错误... http://nmg01-hpc-controller.nmg01.baidu.com:8090/job/i-278622/