本地训练没问题,集群跑的时候报错
Created by: weiyuze
本地训练没问题,集群跑的时候报如下的错误:
Log file created at: 2017/06/29 13:52:44
Running on machine: nmg01-hpc-w0112.nmg01.baidu.com
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
F0629 13:52:44.978458 23900 TrainerConfigHelper.cpp:57] Check failed: m->conf.ParseFromString(configProtoStr)
配置如下:
lr = 0.1
lm ="adam"
bs = 1110
l1s = 64
l1at="tanh"
################################### DATA Configuration #############################################
TrainData(PyData(files="train.list",
load_data_module="Py_traf_mult_task",
load_data_object="processData"))
TestData(PyData(files="test.list",
load_data_module="Py_traf_mult_task",
load_data_object="processData"))
################################### Algorithm Configuration ########################################
Settings(
learning_rate_decay_a = 0.0,
learning_rate_decay_b = 0.0,
learning_rate = lr,
batch_size = bs,
algorithm = 'sgd',
learning_method = lm,
ada_epsilon = 1e-6,
ada_rou = 0.95,
num_batches_per_send_parameter = 1,
num_batches_per_get_parameter = 1,
)