单机训练auc正常,集群训练auc全部都是0是为什么呢?
Created by: HugoLian
程序单机训练的auc是对的,集群训练上无论训练集(event.metrics)还是测试集(results.metrics)打印出来都是0是为什么呢?训练数据和测试数据肯定是正常的,已经检验过了,不存在连续1或者0分布。
集群训练程序打印auc的方式如下:
def event_handler(event):
if isinstance(event, paddle.event.EndIteration):
if (event.batch_id + 1) % 400 == 0:
print "\nPass %d Batch %d Cost %.4f %s" % (
event.pass_id, event.batch_id, event.cost, event.metrics)
result = trainer.test(
reader=paddle.batch(cluster_data_reader(cluster_test_dir, node_id), batch_size = 1024),
feeding=feeding)
print "\nTest %d, Cost %6f, %s" % (event.pass_id, result.cost, result.metrics)
trainer.train(
reader=paddle.batch(
paddle.reader.shuffle(cluster_data_reader(cluster_train_dir, node_id), buf_size=102400),
batch_size=1024),
event_handler=event_handler,
feeding=feeding,
num_passes=20)
cluster_data_reader()是读取文件yield数据的方法,没有问题。trainer的写法是:
auc_layer = paddle.layer.slope_intercept(input=inference, name='auc_layer', slope=0.5, intercept=0.5)
cost = paddle.layer.regression_cost(input=auc_layer, label=label)
parameters = paddle.parameters.create(cost)
trainer = paddle.trainer.SGD(
cost = cost,
extra_layers = paddle.evaluator.auc(input=auc_layer, label=label),
parameters = parameters,
update_equation = paddle.optimizer.Adam(learning_rate=2e-4),
is_local = False)
这样每轮递归结束后,训练集(Pass 0)和测试集(Test 0)的结果都是0,是什么地方出了问题呢?
Thu Aug 9 15:04:48 2018[1,0]<stdout>:Pass 0 Batch 399 Cost 0.1073 {'__auc_evaluator_0__': 0.0}
Thu Aug 9 15:04:48 2018[1,3]<stdout>:
Thu Aug 9 15:04:48 2018[1,3]<stdout>:Pass 0 Batch 399 Cost 0.1220 {'__auc_evaluator_0__': 0.0}
Thu Aug 9 15:04:48 2018[1,2]<stdout>:
Thu Aug 9 15:04:48 2018[1,2]<stdout>:Pass 0 Batch 399 Cost 0.1143 {'__auc_evaluator_0__': 0.0}
Thu Aug 9 15:04:48 2018[1,1]<stdout>:
Thu Aug 9 15:04:48 2018[1,1]<stdout>:Pass 0 Batch 399 Cost 0.1155 {'__auc_evaluator_0__': 0.0}
Thu Aug 9 15:04:48 2018[1,4]<stdout>:
Thu Aug 9 15:04:48 2018[1,4]<stdout>:Pass 0 Batch 399 Cost 0.1208 {'__auc_evaluator_0__': 0.0}
Thu Aug 9 15:17:22 2018[1,2]<stdout>:
Thu Aug 9 15:17:22 2018[1,2]<stdout>:Test 0, Cost 0.114977, {'__auc_evaluator_0__': 0.0}
Thu Aug 9 15:17:26 2018[1,0]<stdout>:
Thu Aug 9 15:17:26 2018[1,0]<stdout>:Test 0, Cost 0.114977, {'__auc_evaluator_0__': 0.0}
Thu Aug 9 15:17:28 2018[1,4]<stdout>:
Thu Aug 9 15:17:28 2018[1,4]<stdout>:Test 0, Cost 0.114977, {'__auc_evaluator_0__': 0.0}
Thu Aug 9 15:17:30 2018[1,3]<stdout>:
Thu Aug 9 15:17:30 2018[1,3]<stdout>:Test 0, Cost 0.114977, {'__auc_evaluator_0__': 0.0}
Thu Aug 9 15:17:40 2018[1,1]<stdout>:
Thu Aug 9 15:17:40 2018[1,1]<stdout>:Test 0, Cost 0.114977, {'__auc_evaluator_0__': 0.0}