MPI任务,单节点work, 多节点下Evaluator出错
Created by: CDDB
背景, remote-sparse-updater开启, 用momentum。 集群任务,node=1的时候可以正常跑,比如这个: http://yq01-hpc-wutai01-mon.dmop.baidu.com:8090/job/i-40410/
设置node=10的时候,就会两种错误: 1 只启用了AUCEvaluator,但是cost和值打印出来都是0. 地址:http://yq01-hpc-wutai01-mon.dmop.baidu.com:8090/job/i-40431/ conf:
input_0 = data_layer(name='input_fea_0', size=input_dim_0)
input_1 = data_layer(name='input_fea_1', size=input_dim_1)
input_2 = data_layer(name='input_fea_2', size=input_dim_2)
input_3 = data_layer(name='input_fea_3', size=input_dim_3)
input_4 = data_layer(name='input_fea_4', size=input_dim_4)
input_5 = data_layer(name='input_fea_5', size=input_dim_5)
input_6 = data_layer(name='input_fea_6', size=input_dim_6)
input_7 = data_layer(name='input_fea_7', size=input_dim_7)
input_8 = data_layer(name='input_fea_8', size=input_dim_8)
input_9 = data_layer(name='input_fea_9', size=input_dim_9)
input_10 = data_layer(name='input_fea_10', size=input_dim_10)
input_11 = data_layer(name='input_fea_11', size=input_dim_11)
input_12 = data_layer(name='input_fea_12', size=input_dim_12)
input_13 = data_layer(name='input_fea_13', size=input_dim_13)
mask_input_0 = data_layer(name='mask_layer_0', size=mask_dim)
mask_input_2 = data_layer(name='mask_layer_2', size=mask_dim)
mask_input_3 = data_layer(name='mask_layer_3', size=mask_dim)
label = data_layer(name="label", size=num_classes)
hidden_0 = fc_layer(input=input_0, size=8, act=ReluActivation(), param_attr=ParamAttr(sparse_update=True), layer_attr=ExtraLayerAttribute(drop_rate=0.3))
hidden_1 = fc_layer(input=input_1, size=8, act=ReluActivation(), param_attr=ParamAttr(sparse_update=True), layer_attr=ExtraLayerAttribute(drop_rate=0.3))
hidden_2 = fc_layer(input=input_2, size=8, act=ReluActivation(), param_attr=ParamAttr(sparse_update=True), layer_attr=ExtraLayerAttribute(drop_rate=0.3))
hidden_3 = fc_layer(input=input_3, size=8, act=ReluActivation(), param_attr=ParamAttr(sparse_update=True), layer_attr=ExtraLayerAttribute(drop_rate=0.3))
hidden_4 = fc_layer(input=input_4, size=8, act=ReluActivation(), param_attr=ParamAttr(sparse_update=True), layer_attr=ExtraLayerAttribute(drop_rate=0.3))
hidden_5 = fc_layer(input=input_5, size=8, act=ReluActivation(), param_attr=ParamAttr(sparse_update=True), layer_attr=ExtraLayerAttribute(drop_rate=0.3))
hidden_6 = fc_layer(input=input_6, size=8, act=ReluActivation(), param_attr=ParamAttr(sparse_update=True), layer_attr=ExtraLayerAttribute(drop_rate=0.3))
hidden_7 = fc_layer(input=input_7, size=8, act=ReluActivation(), param_attr=ParamAttr(sparse_update=True), layer_attr=ExtraLayerAttribute(drop_rate=0.3))
hidden_8 = fc_layer(input=input_8, size=8, act=ReluActivation(), param_attr=ParamAttr(sparse_update=True), layer_attr=ExtraLayerAttribute(drop_rate=0.3))
hidden_9 = fc_layer(input=input_9, size=8, act=ReluActivation(), param_attr=ParamAttr(sparse_update=True), layer_attr=ExtraLayerAttribute(drop_rate=0.3))
hidden_10 = fc_layer(input=input_10, size=8, act=ReluActivation(), param_attr=ParamAttr(sparse_update=True), layer_attr=ExtraLayerAttribute(drop_rate=0.3))
hidden_11 = fc_layer(input=input_11, size=8, act=ReluActivation(), param_attr=ParamAttr(sparse_update=True), layer_attr=ExtraLayerAttribute(drop_rate=0.3))
hidden_12 = fc_layer(input=input_12, size=8, act=ReluActivation(), param_attr=ParamAttr(sparse_update=True), layer_attr=ExtraLayerAttribute(drop_rate=0.3))
hidden_13 = fc_layer(input=input_13, size=8, act=ReluActivation(), param_attr=ParamAttr(sparse_update=True), layer_attr=ExtraLayerAttribute(drop_rate=0.3))
hidden = fc_layer(input=[hidden_0, hidden_1, hidden_2, hidden_3, hidden_4, hidden_5,hidden_6,hidden_7,hidden_8,hidden_9,hidden_10,hidden_11,hidden_12,hidden_13], size=256, act=ReluActivation())
hidden = fc_layer(input=hidden, size=128, act=ReluActivation())
hidden = fc_layer(input=hidden, size=64, act=ReluActivation())
hidden = fc_layer(input=hidden, size=32, act=ReluActivation())
user_hidden = fc_layer(input=hidden, size=16, act=ReluActivation())
customer_hidden_0 = fc_layer(input=user_hidden, size=16, act=ReluActivation())
customer_hidden_2 = fc_layer(input=user_hidden, size=16, act=ReluActivation())
customer_hidden_3 = fc_layer(input=user_hidden, size=16, act=ReluActivation())
customer_hidden_0 = fc_layer(input=customer_hidden_0, size=16, act=ReluActivation())
customer_hidden_2 = fc_layer(input=customer_hidden_2, size=16, act=ReluActivation())
customer_hidden_3 = fc_layer(input=customer_hidden_3, size=16, act=ReluActivation())
with mixed_layer() as mix0:
mix0 += dotmul_operator(a=customer_hidden_0, b=mask_input_0)
with mixed_layer() as mix2:
mix2 += dotmul_operator(a=customer_hidden_2, b=mask_input_2)
with mixed_layer() as mix3:
mix3 += dotmul_operator(a=customer_hidden_3, b=mask_input_3)
if not is_predict:
prediction = fc_layer(input=[mix0, mix2, mix3], size=num_classes, act=SoftmaxActivation())
cost= cross_entropy(prediction, label)
eval= auc_evaluator(prediction, label)
outputs(cost)
日志:
I0719 19:10:00.669837 15532 TrainerInternal.cpp:165] Batch=800 samples=800000 AvgCost=0 CurrentCost=0 Eval: __auc_evaluator_0__=0 CurrentEval: __auc_evaluator_0__=0
I0719 19:10:44.259759 15532 TrainerInternal.cpp:181] Pass=4 Batch=961 samples=960087 AvgCost=0 Eval: __auc_evaluator_0__=0
1 启用PrecisionRecallEvaluator任务会挂。去掉PrecisionRecallEvaluator就可以正常跑; 扔个地址 http://yq01-hpc-wutai01-mon.dmop.baidu.com:8090/job/i-40417/ 相比上面的conf,增加了:
for i in range(num_classes):
precision_recall_evaluator(name="PreRec of label [{0}]".format(i), input=prediction, label=label, positive_label=i)
Wed Jul 19 14:24:21 2017[1,1]<stderr>:F0719 14:24:21.605953 25960 Evaluator.cpp:825] Check failed: label >= 0 && label < (int)statsInfo_.size() positive_label [0] should be in range [0, 0)
Wed Jul 19 14:24:21 2017[1,2]<stderr>:F0719 14:24:21.606295 1447 Evaluator.cpp:825] Check failed: label >= 0 && label < (int)statsInfo_.size() positive_label [0] should be in range [0, 0)
Wed Jul 19 14:24:21 2017[1,2]<stderr>:*** Check failure stack trace: ***
Wed Jul 19 14:24:21 2017[1,2]<stderr>: @ 0x91316d google::LogMessage::Fail()
Wed Jul 19 14:24:21 2017[1,3]<stderr>:F0719 14:24:21.611981 24429 Evaluator.cpp:825] Check failed: label >= 0 && label < (int)statsInfo_.size() positive_label [0] should be in range [0, 0)
Wed Jul 19 14:24:21 2017[1,7]<stderr>:F0719 14:24:21.605876 29111 Evaluator.cpp:825] Check failed: label >= 0 && label < (int)statsInfo_.size() positive_label [0] should be in range [0, 0)
Wed Jul 19 14:24:21 2017[1,9]<stderr>:F0719 14:24:21.612742 4733 Evaluator.cpp:825] Check failed: label >= 0 && label < (int)statsInfo_.size() positive_label [0] should be in range [0, 0)
Wed Jul 19 14:24:21 2017[1,9]<stderr>:*** Check failure stack trace: ***
Wed Jul 19 14:24:21 2017[1,2]<stderr>: @ 0x916c1c google::LogMessage::SendToLog()
Wed Jul 19 14:24:21 2017[1,4]<stderr>:F0719 14:24:21.607393 24523 Evaluator.cpp:825] Check failed: label >= 0 && label < (int)statsInfo_.size() positive_label [0] should be in range [0, 0)
Wed Jul 19 14:24:21 2017[1,4]<stderr>:*** Check failure stack trace: ***
Wed Jul 19 14:24:21 2017[1,5]<stderr>:F0719 14:24:21.607707 3091 Evaluator.cpp:825] Check failed: label >= 0 && label < (int)statsInfo_.size() positive_label [0] should be in range [0, 0)
Wed Jul 19 14:24:21 2017[1,9]<stderr>: @ 0x91316d google::LogMessage::Fail()
Wed Jul 19 14:24:21 2017[1,8]<stderr>:F0719 14:24:21.607941 11448 Evaluator.cpp:825] Check failed: label >= 0 && label < (int)statsInfo_.size() positive_label [0] should be in range [0, 0)
Wed Jul 19 14:24:21 2017[1,2]<stderr>: @ 0x912c93 google::LogMessage::Flush()
Wed Jul 19 14:24:21 2017[1,8]<stderr>:*** Check failure stack trace: ***
Wed Jul 19 14:24:21 2017[1,4]<stderr>: @ 0x91316d google::LogMessage::Fail()
Wed Jul 19 14:24:21 2017[1,9]<stderr>: @ 0x916c1c google::LogMessage::SendToLog()
Wed Jul 19 14:24:21 2017[1,0]<stderr>:F0719 14:24:21.607136 1839 Evaluator.cpp:825] Check failed: label >= 0 && label < (int)statsInfo_.size() positive_label [0] should be in range [0, 0)
Wed Jul 19 14:24:21 2017[1,4]<stderr>: @ 0x916c1c google::LogMessage::SendToLog()
Wed Jul 19 14:24:21 2017[1,2]<stderr>: @ 0x91812e google::LogMessageFatal::~LogMessageFatal()
Wed Jul 19 14:24:21 2017[1,8]<stderr>: @ 0x91316d google::LogMessage::Fail()
Wed Jul 19 14:24:21 2017[1,9]<stderr>: @ 0x912c93 google::LogMessage::Flush()
Wed Jul 19 14:24:21 2017[1,4]<stderr>: @ 0x912c93 google::LogMessage::Flush()
Wed Jul 19 14:24:21 2017[1,8]<stderr>: @ 0x916c1c google::LogMessage::SendToLog()
Wed Jul 19 14:24:21 2017[1,9]<stderr>: @ 0x91812e google::LogMessageFatal::~LogMessageFatal()
Wed Jul 19 14:24:21 2017[1,2]<stderr>: @ 0x6b6bda paddle::PrecisionRecallEvaluator::getStatsInfo()
Wed Jul 19 14:24:21 2017[1,4]<stderr>: @ 0x91812e google::LogMessageFatal::~LogMessageFatal()
Wed Jul 19 14:24:21 2017[1,8]<stderr>: @ 0x912c93 google::LogMessage::Flush()
Wed Jul 19 14:24:21 2017[1,9]<stderr>: @ 0x6b6bda paddle::PrecisionRecallEvaluator::getStatsInfo()
Wed Jul 19 14:24:21 2017[1,8]<stderr>: @ 0x91812e google::LogMessageFatal::~LogMessageFatal()
Wed Jul 19 14:24:21 2017[1,2]<stderr>: @ 0x6b6ef3 paddle::PrecisionRecallEvaluator::printStats()
Wed Jul 19 14:24:21 2017[1,4]<stderr>: @ 0x6b6bda paddle::PrecisionRecallEvaluator::getStatsInfo()
Wed Jul 19 14:24:21 2017[1,9]<stderr>: @ 0x6b6ef3 paddle::PrecisionRecallEvaluator::printStats()
Wed Jul 19 14:24:21 2017[1,2]<stderr>: @ 0x6a28e0 paddle::CombinedEvaluator::printStats()
Wed Jul 19 14:24:21 2017[1,8]<stderr>: @ 0x6b6bda paddle::PrecisionRecallEvaluator::getStatsInfo()
Wed Jul 19 14:24:21 2017[1,4]<stderr>: @ 0x6b6ef3 paddle::PrecisionRecallEvaluator::printStats()
Wed Jul 19 14:24:21 2017[1,9]<stderr>: @ 0x6a28e0 paddle::CombinedEvaluator::printStats()
Wed Jul 19 14:24:21 2017[1,2]<stderr>: @ 0x74c444 paddle::TrainerInternal::trainOneBatch()
Wed Jul 19 14:24:21 2017[1,4]<stderr>: @ 0x6a28e0 paddle::CombinedEvaluator::printStats()
Wed Jul 19 14:24:21 2017[1,8]<stderr>: @ 0x6b6ef3 paddle::PrecisionRecallEvaluator::printStats()
Wed Jul 19 14:24:21 2017[1,9]<stderr>: @ 0x74c444 paddle::TrainerInternal::trainOneBatch()
Wed Jul 19 14:24:21 2017[1,2]<stderr>: @ 0x749616 paddle::Trainer::trainOneDataBatch()
Wed Jul 19 14:24:21 2017[1,8]<stderr>: @ 0x6a28e0 paddle::CombinedEvaluator::printStats()
Wed Jul 19 14:24:21 2017[1,9]<stderr>: @ 0x749616 paddle::Trainer::trainOneDataBatch()
Wed Jul 19 14:24:21 2017[1,8]<stderr>: @ 0x74c444 paddle::TrainerInternal::trainOneBatch()
Wed Jul 19 14:24:21 2017[1,2]<stderr>: @ 0x749abd paddle::Trainer::trainOnePass()
Wed Jul 19 14:24:21 2017[1,9]<stderr>: @ 0x749abd paddle::Trainer::trainOnePass()