Skip to content

  • 体验新版
    • 正在加载...
  • 登录
  • PaddlePaddle
  • Paddle
  • Issue
  • #2987

P
Paddle
  • 项目概览

PaddlePaddle / Paddle
大约 2 年 前同步成功

通知 2325
Star 20933
Fork 5424
  • 代码
    • 文件
    • 提交
    • 分支
    • Tags
    • 贡献者
    • 分支图
    • Diff
  • Issue 1423
    • 列表
    • 看板
    • 标记
    • 里程碑
  • 合并请求 543
  • Wiki 0
    • Wiki
  • 分析
    • 仓库
    • DevOps
  • 项目成员
  • Pages
P
Paddle
  • 项目概览
    • 项目概览
    • 详情
    • 发布
  • 仓库
    • 仓库
    • 文件
    • 提交
    • 分支
    • 标签
    • 贡献者
    • 分支图
    • 比较
  • Issue 1,423
    • Issue 1,423
    • 列表
    • 看板
    • 标记
    • 里程碑
  • 合并请求 543
    • 合并请求 543
  • Pages
  • 分析
    • 分析
    • 仓库分析
    • DevOps
  • Wiki 0
    • Wiki
  • 成员
    • 成员
  • 收起侧边栏
  • 动态
  • 分支图
  • 创建新Issue
  • 提交
  • Issue看板
已关闭
开放中
Opened 7月 20, 2017 by saxon_zh@saxon_zhGuest

MPI任务,单节点work, 多节点下Evaluator出错

Created by: CDDB

背景, remote-sparse-updater开启, 用momentum。 集群任务,node=1的时候可以正常跑,比如这个: http://yq01-hpc-wutai01-mon.dmop.baidu.com:8090/job/i-40410/

设置node=10的时候,就会两种错误: 1 只启用了AUCEvaluator,但是cost和值打印出来都是0. 地址:http://yq01-hpc-wutai01-mon.dmop.baidu.com:8090/job/i-40431/ conf:

input_0 = data_layer(name='input_fea_0', size=input_dim_0)
input_1 = data_layer(name='input_fea_1', size=input_dim_1)
input_2 = data_layer(name='input_fea_2', size=input_dim_2)
input_3 = data_layer(name='input_fea_3', size=input_dim_3)
input_4 = data_layer(name='input_fea_4', size=input_dim_4)
input_5 = data_layer(name='input_fea_5', size=input_dim_5)
input_6 = data_layer(name='input_fea_6', size=input_dim_6)
input_7 = data_layer(name='input_fea_7', size=input_dim_7)
input_8 = data_layer(name='input_fea_8', size=input_dim_8)
input_9 = data_layer(name='input_fea_9', size=input_dim_9)
input_10 = data_layer(name='input_fea_10', size=input_dim_10)
input_11 = data_layer(name='input_fea_11', size=input_dim_11)
input_12 = data_layer(name='input_fea_12', size=input_dim_12)
input_13 = data_layer(name='input_fea_13', size=input_dim_13)

mask_input_0 = data_layer(name='mask_layer_0', size=mask_dim)
mask_input_2 = data_layer(name='mask_layer_2', size=mask_dim)
mask_input_3 = data_layer(name='mask_layer_3', size=mask_dim)

label = data_layer(name="label", size=num_classes)

hidden_0 = fc_layer(input=input_0, size=8, act=ReluActivation(), param_attr=ParamAttr(sparse_update=True), layer_attr=ExtraLayerAttribute(drop_rate=0.3))
hidden_1 = fc_layer(input=input_1, size=8, act=ReluActivation(), param_attr=ParamAttr(sparse_update=True), layer_attr=ExtraLayerAttribute(drop_rate=0.3))
hidden_2 = fc_layer(input=input_2, size=8, act=ReluActivation(), param_attr=ParamAttr(sparse_update=True), layer_attr=ExtraLayerAttribute(drop_rate=0.3))
hidden_3 = fc_layer(input=input_3, size=8, act=ReluActivation(), param_attr=ParamAttr(sparse_update=True), layer_attr=ExtraLayerAttribute(drop_rate=0.3))
hidden_4 = fc_layer(input=input_4, size=8, act=ReluActivation(), param_attr=ParamAttr(sparse_update=True), layer_attr=ExtraLayerAttribute(drop_rate=0.3))
hidden_5 = fc_layer(input=input_5, size=8, act=ReluActivation(), param_attr=ParamAttr(sparse_update=True), layer_attr=ExtraLayerAttribute(drop_rate=0.3))
hidden_6 = fc_layer(input=input_6, size=8, act=ReluActivation(), param_attr=ParamAttr(sparse_update=True), layer_attr=ExtraLayerAttribute(drop_rate=0.3))
hidden_7 = fc_layer(input=input_7, size=8, act=ReluActivation(), param_attr=ParamAttr(sparse_update=True), layer_attr=ExtraLayerAttribute(drop_rate=0.3))
hidden_8 = fc_layer(input=input_8, size=8, act=ReluActivation(), param_attr=ParamAttr(sparse_update=True), layer_attr=ExtraLayerAttribute(drop_rate=0.3))
hidden_9 = fc_layer(input=input_9, size=8, act=ReluActivation(), param_attr=ParamAttr(sparse_update=True), layer_attr=ExtraLayerAttribute(drop_rate=0.3))
hidden_10 = fc_layer(input=input_10, size=8, act=ReluActivation(), param_attr=ParamAttr(sparse_update=True), layer_attr=ExtraLayerAttribute(drop_rate=0.3))
hidden_11 = fc_layer(input=input_11, size=8, act=ReluActivation(), param_attr=ParamAttr(sparse_update=True), layer_attr=ExtraLayerAttribute(drop_rate=0.3))
hidden_12 = fc_layer(input=input_12, size=8, act=ReluActivation(), param_attr=ParamAttr(sparse_update=True), layer_attr=ExtraLayerAttribute(drop_rate=0.3))
hidden_13 = fc_layer(input=input_13, size=8, act=ReluActivation(), param_attr=ParamAttr(sparse_update=True), layer_attr=ExtraLayerAttribute(drop_rate=0.3))

hidden = fc_layer(input=[hidden_0, hidden_1, hidden_2, hidden_3, hidden_4, hidden_5,hidden_6,hidden_7,hidden_8,hidden_9,hidden_10,hidden_11,hidden_12,hidden_13], size=256, act=ReluActivation())
hidden = fc_layer(input=hidden, size=128, act=ReluActivation())
hidden = fc_layer(input=hidden, size=64, act=ReluActivation())
hidden = fc_layer(input=hidden, size=32, act=ReluActivation())
user_hidden = fc_layer(input=hidden, size=16, act=ReluActivation())

customer_hidden_0 = fc_layer(input=user_hidden, size=16, act=ReluActivation())
customer_hidden_2 = fc_layer(input=user_hidden, size=16, act=ReluActivation())
customer_hidden_3 = fc_layer(input=user_hidden, size=16, act=ReluActivation())

customer_hidden_0 = fc_layer(input=customer_hidden_0, size=16, act=ReluActivation())
customer_hidden_2 = fc_layer(input=customer_hidden_2, size=16, act=ReluActivation())
customer_hidden_3 = fc_layer(input=customer_hidden_3, size=16, act=ReluActivation())

with mixed_layer() as mix0:
    mix0 += dotmul_operator(a=customer_hidden_0, b=mask_input_0)
with mixed_layer() as mix2:
    mix2 += dotmul_operator(a=customer_hidden_2, b=mask_input_2)
with mixed_layer() as mix3:
    mix3 += dotmul_operator(a=customer_hidden_3, b=mask_input_3)

if not is_predict: 
    prediction = fc_layer(input=[mix0, mix2, mix3], size=num_classes, act=SoftmaxActivation())
    cost= cross_entropy(prediction, label)
    eval= auc_evaluator(prediction, label)
    outputs(cost)

日志:

I0719 19:10:00.669837 15532 TrainerInternal.cpp:165]  Batch=800 samples=800000 AvgCost=0 CurrentCost=0 Eval: __auc_evaluator_0__=0  CurrentEval: __auc_evaluator_0__=0 
I0719 19:10:44.259759 15532 TrainerInternal.cpp:181]  Pass=4 Batch=961 samples=960087 AvgCost=0 Eval: __auc_evaluator_0__=0 

1 启用PrecisionRecallEvaluator任务会挂。去掉PrecisionRecallEvaluator就可以正常跑; 扔个地址 http://yq01-hpc-wutai01-mon.dmop.baidu.com:8090/job/i-40417/ 相比上面的conf,增加了:

    for i in range(num_classes):
        precision_recall_evaluator(name="PreRec of label [{0}]".format(i), input=prediction, label=label, positive_label=i)
Wed Jul 19 14:24:21 2017[1,1]<stderr>:F0719 14:24:21.605953 25960 Evaluator.cpp:825] Check failed: label >= 0 && label < (int)statsInfo_.size() positive_label [0] should be in range [0, 0)
Wed Jul 19 14:24:21 2017[1,2]<stderr>:F0719 14:24:21.606295  1447 Evaluator.cpp:825] Check failed: label >= 0 && label < (int)statsInfo_.size() positive_label [0] should be in range [0, 0)
Wed Jul 19 14:24:21 2017[1,2]<stderr>:*** Check failure stack trace: ***
Wed Jul 19 14:24:21 2017[1,2]<stderr>:    @           0x91316d  google::LogMessage::Fail()
Wed Jul 19 14:24:21 2017[1,3]<stderr>:F0719 14:24:21.611981 24429 Evaluator.cpp:825] Check failed: label >= 0 && label < (int)statsInfo_.size() positive_label [0] should be in range [0, 0)
Wed Jul 19 14:24:21 2017[1,7]<stderr>:F0719 14:24:21.605876 29111 Evaluator.cpp:825] Check failed: label >= 0 && label < (int)statsInfo_.size() positive_label [0] should be in range [0, 0)
Wed Jul 19 14:24:21 2017[1,9]<stderr>:F0719 14:24:21.612742  4733 Evaluator.cpp:825] Check failed: label >= 0 && label < (int)statsInfo_.size() positive_label [0] should be in range [0, 0)
Wed Jul 19 14:24:21 2017[1,9]<stderr>:*** Check failure stack trace: ***
Wed Jul 19 14:24:21 2017[1,2]<stderr>:    @           0x916c1c  google::LogMessage::SendToLog()
Wed Jul 19 14:24:21 2017[1,4]<stderr>:F0719 14:24:21.607393 24523 Evaluator.cpp:825] Check failed: label >= 0 && label < (int)statsInfo_.size() positive_label [0] should be in range [0, 0)
Wed Jul 19 14:24:21 2017[1,4]<stderr>:*** Check failure stack trace: ***
Wed Jul 19 14:24:21 2017[1,5]<stderr>:F0719 14:24:21.607707  3091 Evaluator.cpp:825] Check failed: label >= 0 && label < (int)statsInfo_.size() positive_label [0] should be in range [0, 0)
Wed Jul 19 14:24:21 2017[1,9]<stderr>:    @           0x91316d  google::LogMessage::Fail()
Wed Jul 19 14:24:21 2017[1,8]<stderr>:F0719 14:24:21.607941 11448 Evaluator.cpp:825] Check failed: label >= 0 && label < (int)statsInfo_.size() positive_label [0] should be in range [0, 0)
Wed Jul 19 14:24:21 2017[1,2]<stderr>:    @           0x912c93  google::LogMessage::Flush()
Wed Jul 19 14:24:21 2017[1,8]<stderr>:*** Check failure stack trace: ***
Wed Jul 19 14:24:21 2017[1,4]<stderr>:    @           0x91316d  google::LogMessage::Fail()
Wed Jul 19 14:24:21 2017[1,9]<stderr>:    @           0x916c1c  google::LogMessage::SendToLog()
Wed Jul 19 14:24:21 2017[1,0]<stderr>:F0719 14:24:21.607136  1839 Evaluator.cpp:825] Check failed: label >= 0 && label < (int)statsInfo_.size() positive_label [0] should be in range [0, 0)
Wed Jul 19 14:24:21 2017[1,4]<stderr>:    @           0x916c1c  google::LogMessage::SendToLog()
Wed Jul 19 14:24:21 2017[1,2]<stderr>:    @           0x91812e  google::LogMessageFatal::~LogMessageFatal()
Wed Jul 19 14:24:21 2017[1,8]<stderr>:    @           0x91316d  google::LogMessage::Fail()
Wed Jul 19 14:24:21 2017[1,9]<stderr>:    @           0x912c93  google::LogMessage::Flush()
Wed Jul 19 14:24:21 2017[1,4]<stderr>:    @           0x912c93  google::LogMessage::Flush()
Wed Jul 19 14:24:21 2017[1,8]<stderr>:    @           0x916c1c  google::LogMessage::SendToLog()
Wed Jul 19 14:24:21 2017[1,9]<stderr>:    @           0x91812e  google::LogMessageFatal::~LogMessageFatal()
Wed Jul 19 14:24:21 2017[1,2]<stderr>:    @           0x6b6bda  paddle::PrecisionRecallEvaluator::getStatsInfo()
Wed Jul 19 14:24:21 2017[1,4]<stderr>:    @           0x91812e  google::LogMessageFatal::~LogMessageFatal()
Wed Jul 19 14:24:21 2017[1,8]<stderr>:    @           0x912c93  google::LogMessage::Flush()
Wed Jul 19 14:24:21 2017[1,9]<stderr>:    @           0x6b6bda  paddle::PrecisionRecallEvaluator::getStatsInfo()
Wed Jul 19 14:24:21 2017[1,8]<stderr>:    @           0x91812e  google::LogMessageFatal::~LogMessageFatal()
Wed Jul 19 14:24:21 2017[1,2]<stderr>:    @           0x6b6ef3  paddle::PrecisionRecallEvaluator::printStats()
Wed Jul 19 14:24:21 2017[1,4]<stderr>:    @           0x6b6bda  paddle::PrecisionRecallEvaluator::getStatsInfo()
Wed Jul 19 14:24:21 2017[1,9]<stderr>:    @           0x6b6ef3  paddle::PrecisionRecallEvaluator::printStats()
Wed Jul 19 14:24:21 2017[1,2]<stderr>:    @           0x6a28e0  paddle::CombinedEvaluator::printStats()
Wed Jul 19 14:24:21 2017[1,8]<stderr>:    @           0x6b6bda  paddle::PrecisionRecallEvaluator::getStatsInfo()
Wed Jul 19 14:24:21 2017[1,4]<stderr>:    @           0x6b6ef3  paddle::PrecisionRecallEvaluator::printStats()
Wed Jul 19 14:24:21 2017[1,9]<stderr>:    @           0x6a28e0  paddle::CombinedEvaluator::printStats()
Wed Jul 19 14:24:21 2017[1,2]<stderr>:    @           0x74c444  paddle::TrainerInternal::trainOneBatch()
Wed Jul 19 14:24:21 2017[1,4]<stderr>:    @           0x6a28e0  paddle::CombinedEvaluator::printStats()
Wed Jul 19 14:24:21 2017[1,8]<stderr>:    @           0x6b6ef3  paddle::PrecisionRecallEvaluator::printStats()
Wed Jul 19 14:24:21 2017[1,9]<stderr>:    @           0x74c444  paddle::TrainerInternal::trainOneBatch()
Wed Jul 19 14:24:21 2017[1,2]<stderr>:    @           0x749616  paddle::Trainer::trainOneDataBatch()
Wed Jul 19 14:24:21 2017[1,8]<stderr>:    @           0x6a28e0  paddle::CombinedEvaluator::printStats()
Wed Jul 19 14:24:21 2017[1,9]<stderr>:    @           0x749616  paddle::Trainer::trainOneDataBatch()
Wed Jul 19 14:24:21 2017[1,8]<stderr>:    @           0x74c444  paddle::TrainerInternal::trainOneBatch()
Wed Jul 19 14:24:21 2017[1,2]<stderr>:    @           0x749abd  paddle::Trainer::trainOnePass()
Wed Jul 19 14:24:21 2017[1,9]<stderr>:    @           0x749abd  paddle::Trainer::trainOnePass()
指派人
分配到
无
里程碑
无
分配里程碑
工时统计
无
截止日期
无
标识: paddlepaddle/Paddle#2987
渝ICP备2023009037号

京公网安备11010502055752号

网络110报警服务 Powered by GitLab CE v13.7
开源知识
Git 入门 Pro Git 电子书 在线学 Git
Markdown 基础入门 IT 技术知识开源图谱
帮助
使用手册 反馈建议 博客
《GitCode 隐私声明》 《GitCode 服务条款》 关于GitCode
Powered by GitLab CE v13.7