paddle v2在P40 gpu上跑报错
Created by: Bella-Zhao
错误信息:
+ python27-gcc482/bin/python conf/trainer_config.conf
F1105 16:35:31.386123 3970 RemoteParameterUpdater.cpp:829] Check failed: !parametersArray[UPDATER_SPARSE_REMOTE].empty()
*** Check failure stack trace: ***
@ 0x7f5e5a66e9cd google::LogMessage::Fail()
@ 0x7f5e5a67247c google::LogMessage::SendToLog()
@ 0x7f5e5a66e4c3 google::LogMessage::Flush()
@ 0x7f5e5a67398e google::LogMessageFatal::~LogMessageFatal()
@ 0x7f5e5add5969 paddle::SparseRemoteParameterUpdaterComposite::init()
@ 0x7f5e5a64e80a ParameterUpdater::init()
@ 0x7f5e5a25883b _wrap_ParameterUpdater_init
@ 0x4b4cb9 PyEval_EvalFrameEx
@ 0x4b6b28 PyEval_EvalCodeEx
@ 0x4b5d10 PyEval_EvalFrameEx
@ 0x4b6b28 PyEval_EvalCodeEx
@ 0x4b5d10 PyEval_EvalFrameEx
@ 0x4b6b28 PyEval_EvalCodeEx
@ 0x4b5d10 PyEval_EvalFrameEx
@ 0x4b6b28 PyEval_EvalCodeEx
@ 0x4b6c52 PyEval_EvalCode
@ 0x4e1c7d PyRun_FileExFlags
@ 0x4e3501 PyRun_SimpleFileExFlags
@ 0x4159dd Py_Main
@ 0x7f5ea5defbd5 __libc_start_main
@ 0x414b71 (unknown)
@ (nil) (unknown)
日志地址链接http://yq01-sys-hic-p40-0124.yq01.baidu.com:8880/output/list/59364 ,上述信息在log/train.log中
提交命令:
paddle cluster_train \
--config=train_distri.py \
--time_limit=60:00:00 \
--submitter=qianjianping \
--num_nodes=1 \
--job_priority=very_high \
--fs_name=hdfs://yq01-build-hdfs.dmop.baidu.com:54310 \
--fs_ugi=colombo,colombo@build \
--num_passes=1 \
--train_data_path=${input_path} \
--init_model_path=\"\" \
--test_data_path=/user/colombo/user/zhaoyijin/haokan_feed_log/output/2018022413/test \
--output_path=${output_path} \
--thirdparty=./thirdparty \
--where=yq01-hic-p40_3_8_slurm_cluster \
--job_name=feed_production_day_paddle_mcf \
--ports_num_for_sparse=1 \
--port=5353 \
--ports_num=1 \
--use_remote_sparse=1
相关代码片段:
paddle.init(use_gpu=True,
trainer_count=int(1),
port=int(os.getenv("PADDLE_PORT", "7164")),
ports_num=int(os.getenv("PADDLE_PORTS_NUM", "1")),
num_gradient_servers=int(os.getenv("PADDLE_NUM_GRADIENT_SERVERS", "1")),
trainer_id=int(os.getenv("PADDLE_TRAINER_ID", "0")),
pservers=os.getenv("PADDLE_PSERVERS", "127.0.0.1"),
ports_num_for_sparse=int(os.getenv('PADDLE_PORTS_NUM_FOR_SPARSE', "1")))
在cpu集群上正常运行,模型参数的大小大概1G左右