图像分类的例子如何在多机多卡上训练?
Created by: Sampson1107
老师, 你好! 我之前在https://github.com/PaddlePaddle/models/tree/develop/image_classification这个模型下实现了单机多卡的训练,但是在多机多卡上训练总是遇到一些问题。 我按照分布式的相关操作设置了conf.py如下 HOSTS = [ "root@192.168.100.67", "root@192.168.100.68", ] ''' workspace configuration ''' #root dir for workspace, can be set as any director with real user account ROOT_DIR = "/home/users/AI/paddle/paddle/scripts/cluster_train" ''' network configuration ''' #pserver nics PADDLE_NIC = "eno1" #pserver port PADDLE_PORT = 22222 #pserver ports num PADDLE_PORTS_NUM = 2 #pserver sparse ports num PADDLE_PORTS_NUM_FOR_SPARSE = 2
#environments setting for all processes in cluster job
LD_LIBRARY_PATH = "/usr/local/cuda/lib64:/usr/lib64:/home/users/cudnn/cudnn6.0/lib64:/home/users/openmpi/lib"
使用例子中的:paddle/paddle/scripts/cluster_train 下的脚本
run.sh如下:
python paddle2.py
--job_dispatch_package="/home/users/AI/paddle/paddle/scripts/cluster_train"
--dot_period=10
--ports_num_for_sparse=2
--log_period=50
--num_passes=10
--trainer_count=4
--saving_period=1
--local=0
--config=/home/users/AI/paddle/paddle/scripts/cluster_train/train.py
--save_dir=./output
--use_gpu=1
运行的时候总是出现如下错误:
Python Error: <class 'google.protobuf.message.EncodeError'> : Message paddle.TrainerConfig is missing required fields: opt_config.batch_size
是否是这个例子不能作为分布式的脚本来用呢?
全部错误提示如下: [INFO 2017-09-27 10:42:59,648 (unknown file):0] model_config { type: "nn" sub_models { name: "root" is_recurrent_layer_group: false } } opt_config { algorithm: "async_sgd" learning_rate: 1.0 learning_rate_decay_a: 0.0 learning_rate_decay_b: 0.0 l1weight: 0.1 l2weight: 0.0 c1: 0.0001 backoff: 0.5 owlqn_steps: 10 max_backoff: 5 l2weight_zero_iter: 0 average_window: 0 learning_method: "momentum" ada_epsilon: 1e-06 do_average_in_cpu: false ada_rou: 0.95 learning_rate_schedule: "poly" delta_add_rate: 1.0 shrink_parameter_value: 0 adam_beta1: 0.9 adam_beta2: 0.999 adam_epsilon: 1e-08 learning_rate_args: "" async_lagged_grad_discard_ratio: 1.5 } save_dir: "./output/model" start_pass: 0
Traceback (most recent call last):
File "/usr/lib/python2.7/site-packages/paddle/trainer/config_parser.py", line 4230, in parse_config_and_serialize
return config.SerializeToString()
File "/usr/lib/python2.7/site-packages/google/protobuf/internal/python_message.py", line 1059, in SerializeToString
self.DESCRIPTOR.full_name, ','.join(self.FindInitializationErrors())))
EncodeError: Message paddle.TrainerConfig is missing required fields: opt_config.batch_size
F0927 10:42:59.650485 40923 PythonUtil.cpp:131] Check failed: (ret) != nullptr Current PYTHONPATH: ['/usr/local/bin', '/usr/lib/python2.7/site-packages/pip-9.0.1-py2.7.egg', '/usr/lib/python2.7/site-packages/pybind11-1.9.dev0-py2.7.egg', '/home/users/AI/paddle/paddle/scripts/cluster_train/JOB20170927104210', '/usr/lib64/python27.zip', '/usr/lib64/python2.7', '/usr/lib64/python2.7/plat-linux2', '/usr/lib64/python2.7/lib-tk', '/usr/lib64/python2.7/lib-old', '/usr/lib64/python2.7/lib-dynload', '/usr/lib64/python2.7/site-packages', '/usr/lib64/python2.7/site-packages/gtk-2.0', '/usr/lib/python2.7/site-packages']
Python Error: * Check failure stack trace: *
@ 0x61061d google::LogMessage::Fail()
@ 0x61249f google::LogMessage::SendToLog()
@ 0x610193 google::LogMessage::Flush()
@ 0x612dbe google::LogMessageFatal::~LogMessageFatal()
@ 0x97d6aa paddle::callPythonFuncRetPyObj()
@ 0x97d88c paddle::callPythonFunc()