mpirun download_init_model.sh failed
Created by: Bella-Zhao
提交集群预测任务失败,logs/job.err.log有下列错误:
[12-04 10:43:40] [0] Mon Dec 4 10:43:40 2017[1,17]<stderr>:+ echo '[./hadoop_functions.sh : 129] [hadoop_get_file]'
[12-04 10:43:40] [0] Mon Dec 4 10:43:40 2017[1,17]<stderr>:[./hadoop_functions.sh : 129] [hadoop_get_file]
[12-04 10:43:40] [0] Mon Dec 4 10:43:40 2017[1,17]<stderr>:+ echo '[INFO]: download from [/app/....../video-youtube-model/model/pass-00000] to init_model_path/pass-00000 success'
[12-04 10:43:40] [0] Mon Dec 4 10:43:40 2017[1,17]<stderr>:[INFO]: download from [/app/....../video-youtube-model/model/pass-00000] to init_model_path/pass-00000 success
[12-04 10:43:40] [0] Mon Dec 4 10:43:40 2017[1,17]<stderr>:+ break
[12-04 10:43:40] [0] Mon Dec 4 10:43:40 2017[1,17]<stderr>:+ return 0
[12-04 10:43:40] [0] Mon Dec 4 10:43:40 2017[1,17]<stderr>:+ check_return 'download init_model_path failed'
[12-04 10:43:40] [0] Mon Dec 4 10:43:40 2017[1,17]<stderr>:+ '[' 0 -ne 0 ']'
[12-04 10:43:40] [0] Mon Dec 4 10:43:40 2017[1,17]<stderr>:+ mv ./init_model_path/pass-00000/model_pass_00002.tar.gz ./init_model_path/
[12-04 10:43:40] [0] Mon Dec 4 10:43:40 2017[1,15]<stderr>:+ ret=0
[12-04 10:43:40] [0] Mon Dec 4 10:43:40 2017[1,15]<stderr>:+ '[' 0 -ne 0 ']'
[12-04 10:43:40] [0] Mon Dec 4 10:43:40 2017[1,15]<stderr>:+ log_info 'download from [/app/....../video-youtube-model/model/pass-00000] to init_model_path/pass-00000 success'
[12-04 10:43:40] [0] Mon Dec 4 10:43:40 2017[1,15]<stderr>:+ echo '[./hadoop_functions.sh : 129] [hadoop_get_file]'
[12-04 10:43:40] [0] Mon Dec 4 10:43:40 2017[1,15]<stderr>:[./hadoop_functions.sh : 129] [hadoop_get_file]
[12-04 10:43:40] [0] Mon Dec 4 10:43:40 2017[1,15]<stderr>:+ echo '[INFO]: download from [/app/....../video-youtube-model/model/pass-00000] to init_model_path/pass-00000 success'
[12-04 10:43:40] [0] Mon Dec 4 10:43:40 2017[1,15]<stderr>:[INFO]: download from [/app/....../video-youtube-model/model/pass-00000] to init_model_path/pass-00000 success
[12-04 10:43:40] [0] Mon Dec 4 10:43:40 2017[1,15]<stderr>:+ break
[12-04 10:43:40] [0] Mon Dec 4 10:43:40 2017[1,15]<stderr>:+ return 0
[12-04 10:43:40] [0] Mon Dec 4 10:43:40 2017[1,15]<stderr>:+ check_return 'download init_model_path failed'
[12-04 10:43:40] [0] Mon Dec 4 10:43:40 2017[1,15]<stderr>:+ '[' 0 -ne 0 ']'
[12-04 10:43:40] [0] Mon Dec 4 10:43:40 2017[1,15]<stderr>:+ mv ./init_model_path/pass-00000/model_pass_00002.tar.gz ./init_model_path/
[12-04 10:43:40] [0] + check_return 'mpirun download_init_model.sh failed'
是否是我的init_model_path设置的有问题,以下是提交任务的脚本
paddle cluster_train \
--config=get_user_vector.py \
--time_limit=72:00:00 \
--submitter=zhaoyijin \
--num_nodes=20 \
--job_priority=normal \
--fs_name=hdfs://...... \
--fs_ugi=weigou-ecst,123abc \
--num_passes=10 \
--init_model_path=/app/....../video-youtube-model/model/pass-00000 \
--train_data_path=/app/....../video-youtube-model/gen-sample/sample_20171129_split \
--output_path=/app/....../video-youtube-model/gen-vector \
--thirdparty=./my_thirdparty \
--where=...... \
--job_name=paddle_dssm_zhaoyijin \
--ports_num_for_sparse=1 \
--use_remote_sparse=1
--init_model_path=/app/....../cpu/video-youtube-model/model/pass-00000
pass-00000是一个文件夹,下面有一个model_pass_00002.tar.gz是我的模型参数包。