Kubernetes分布式训练,无法正常训练
Created by: Daemon007
参考https://github.com/PaddlePaddle/Paddle/blob/develop/doc/howto/usage/k8s/k8s_distributed_cn.md在kubernetes上进行分布式训练。
1. 环境
Kubernetes版本:v1.5.2
镜像版本:运行的集群上不支持avx,因此镜像直接从paddlepaddle/paddle:0.10.0rc3-noavx
基础上build。
FROM paddlepaddle/paddle:0.10.0rc3-noavx
COPY start.sh /root/
COPY start_paddle.py /root/
RUN chmod +x /root/start.sh
CMD ["bash"," -c","/root/start.sh"]
2. 脚本
Job脚本 job.yaml:
apiVersion: batch/v1
kind: Job
metadata:
name: paddle-cluster-job
spec:
parallelism: 3
completions: 3
template:
metadata:
name: paddle-cluster-job
spec:
volumes:
- name: jobpath
hostPath:
path: /home/work/mfs
containers:
- name: trainer
image: registry.vm-1:5000/test:test
imagePullPolicy: Always
command: ["bin/bash", "-c", "/root/start.sh"]
env:
- name: JOB_NAME
value: paddle-cluster-job
- name: JOB_PATH
value: /home/jobpath
- name: JOB_NAMESPACE
value: default
- name: TRAIN_CONFIG_DIR
value: recommendation
- name: CONF_PADDLE_NIC
value: eth0
- name: CONF_PADDLE_PORT
value: "7164"
- name: CONF_PADDLE_PORTS_NUM
value: "2"
- name: CONF_PADDLE_PORTS_NUM_SPARSE
value: "2"
- name: CONF_PADDLE_GRADIENT_NUM
value: "3"
name: TRAINER_COUNT
value: "9"
volumeMounts:
- name: jobpath
mountPath: /home/jobpath
restartPolicy: Never
imagePullSecrets:
- name: myregistrykey
3. 结果
在集群上训练,pod的日志如下:
node_0: server.log
ERROR: illegal value 'None' specified for int32 flag 'num_gradient_servers'
node_0: train.log
I0717 07:27:11.042790 53 Util.cpp:166] commandline:
/usr/bin/../opt/paddle/bin/paddle_trainer
--nics=eth0 --port=7164
--ports_num=2 --comment=paddle_process_by_paddle
--pservers=10.0.57.2,10.0.40.3,10.0.51.2
--ports_num_for_sparse=2
--config=trainer_config.lr.py
--log_period=50 --trainer_count=9
--num_passes=10 --use_gpu=0
--ports_num=2 --dot_period=10
--saving_period=1 --local=0 --trainer_id=0
--save_dir=/home/jobpath/paddle-cluster-job/output
Traceback (most recent call last):
File "/usr/local/lib/python2.7/site-packages/paddle/trainer/config_parser.py", line 3607, in parse_config_and_serialize
config = parse_config(trainer_config, config_arg_str)
File "/usr/local/lib/python2.7/site-packages/paddle/trainer/config_parser.py", line 3600, in parse_config
make_config_environment(trainer_config, config_args))
IOError: [Errno 2] No such file or directory: 'trainer_config.lr.py'
F0717 07:27:11.444602 53 PythonUtil.cpp:131] Check failed: (ret) != nullptr Current PYTHONPATH: ['/usr/opt/paddle/bin', '/root', '/usr/local/lib/python27.zip', '/usr/local/lib/python2.7', '/usr/local/lib/python2.7/plat-linux2', '/usr/local/lib/python2.7/lib-tk', '/usr/local/lib/python2.7/lib-old', '/usr/local/lib/python2.7/lib-dynload', '/usr/local/lib/python2.7/site-packages']
Python Error: <type 'exceptions.IOError'> : [Errno 2] No such file or directory: 'trainer_config.lr.py'
Python Callstack:
/usr/local/lib/python2.7/site-packages/paddle/trainer/config_parser.py : 3607
/usr/local/lib/python2.7/site-packages/paddle/trainer/config_parser.py : 3600
Call Object failed.
*** Check failure stack trace: ***
@ 0x92841d google::LogMessage::Fail()
@ 0x92bf65 google::LogMessage::SendToLog()
@ 0x927f43 google::LogMessage::Flush()
@ 0x92d47e google::LogMessageFatal::~LogMessageFatal()
@ 0x88c3fa paddle::callPythonFuncRetPyObj()
@ 0x88c5bc paddle::callPythonFunc()
@ 0x795c9b paddle::TrainerConfigHelper::TrainerConfigHelper()
@ 0x7962d4 paddle::TrainerConfigHelper::createFromFlags()
@ 0x5fa4c2 main
@ 0x7f1be31a5b45 __libc_start_main
@ 0x608449 (unknown)
@ (nil) (unknown)
/usr/bin/paddle: line 113: 53 Aborted (core dumped) ${DEBUGGER} $MYDIR/../opt/paddle/bin/paddle_trainer ${@:2}
请问是镜像版本不对还是什么原因造成的。该如何调整?