跑mpi任务出现'Params over flow error'
Created by: Bella-Zhao
跑Mpi任务时,训练完第一个Pass后出现以下error:
Sat Nov 11 00:15:39 2017[1,12]:*Shell Script Stack Trace Sat Nov 11 00:15:39 2017[1,12]: @: [./log.sh: 41] log_fatal Sat Nov 11 00:15:39 2017[1,12]: @: [./common.sh: 399] kill_pserver2_exit Sat Nov 11 00:15:39 2017[1,12]: @: [./train.sh: 242] main Sat Nov 11 00:15:39 2017[1,12]: Sat Nov 11 00:15:39 2017[1,12]:+ exit 1 Sat Nov 11 00:26:13 2017[1,13]:[INFO 2017-11-11 00:26:13,260 trainer_config.conf:126] Test at Pass 0, {'auc_evaluator_0': 0.7254259586334229, 'classification_error_evaluator': 0.04435134679079056} Sat Nov 11 00:26:41 2017[1,13]:Traceback (most recent call last): Sat Nov 11 00:26:41 2017[1,13]: File "conf/trainer_config.conf", line 182, in Sat Nov 11 00:26:41 2017[1,13]: use_gpu=False) Sat Nov 11 00:26:41 2017[1,13]: File "conf/trainer_config.conf", line 139, in train Sat Nov 11 00:26:41 2017[1,13]: num_passes=num_passes) Sat Nov 11 00:26:41 2017[1,13]: File "/home/disk1/normandy/maybach/333156/workspace/python27-gcc482/lib/python2.7/site-packages/paddle/v2/trainer.py", line 175, in train Sat Nov 11 00:26:41 2017[1,13]: event_handler(v2_event.EndPass(pass_id, evaluator=pass_evaluator)) Sat Nov 11 00:26:41 2017[1,13]: File "conf/trainer_config.conf", line 133, in _event_handler Sat Nov 11 00:26:41 2017[1,13]: trainer.save_parameter_to_tar(f) Sat Nov 11 00:26:41 2017[1,13]: File "/home/disk1/normandy/maybach/333156/workspace/python27-gcc482/lib/python2.7/site-packages/paddle/v2/trainer.py", line 107, in save_parameter_to_tar Sat Nov 11 00:26:41 2017[1,13]: self.parameters.to_tar(f) Sat Nov 11 00:26:41 2017[1,13]: File "/home/disk1/normandy/maybach/333156/workspace/python27-gcc482/lib/python2.7/site-packages/paddle/v2/parameters.py", line 271, in to_tar Sat Nov 11 00:26:41 2017[1,13]: self.serialize(nm, buf) Sat Nov 11 00:26:41 2017[1,13]: File "/home/disk1/normandy/maybach/333156/workspace/python27-gcc482/lib/python2.7/site-packages/paddle/v2/parameters.py", line 253, in serialize Sat Nov 11 00:26:41 2017[1,13]: f.write(param.tostring()) Sat Nov 11 00:26:41 2017[1,13]:OverflowError: length too large Sat Nov 11 00:26:54 2017[1,13]:+ '[' 1 -ne 0 ']' Sat Nov 11 00:26:54 2017[1,13]:+ kill_pserver2_exit Sat Nov 11 00:26:54 2017[1,13]:+ ps aux Sat Nov 11 00:26:54 2017[1,13]:+ grep paddle_pserver2 Sat Nov 11 00:26:54 2017[1,13]:+ grep paddle_cluster_job Sat Nov 11 00:26:54 2017[1,13]:+ grep -v grep Sat Nov 11 00:26:54 2017[1,13]:+ cut -c10-14 Sat Nov 11 00:26:54 2017[1,13]:+ xargs kill -9 Sat Nov 11 00:26:54 2017[1,13]:+ log_fatal 'paddle_trainer failed kill paddle_pserver2 and exit' Sat Nov 11 00:26:54 2017[1,13]:+ echo '[./common.sh : 399] [kill_pserver2_exit]' Sat Nov 11 00:26:54 2017[1,13]:[./common.sh : 399] [kill_pserver2_exit] Sat Nov 11 00:26:54 2017[1,13]:+ echo '[FATAL]: paddle_trainer failed kill paddle_pserver2 and exit' Sat Nov 11 00:26:54 2017[1,13]:[FATAL]: paddle_trainer failed kill paddle_pserver2 and exit Sat Nov 11 00:26:54 2017[1,13]:+ get_stack Sat Nov 11 00:26:54 2017[1,13]:+ set +x Sat Nov 11 00:26:54 2017[1,13]:
我的mpi版本是参考http://wiki.baidu.com/pages/viewpage.action?pageId=327596461 安装的 看到类似issue,https://github.com/PaddlePaddle/Paddle/issues/2895 ,里面据说已经fixed,目前更新到mpi的版本里了吗?如果没有,请问我应该如何修改代码,以保证在mpi上不出错呢?