Skip to content

  • 体验新版
    • 正在加载...
  • 登录
  • PaddlePaddle
  • Paddle
  • Issue
  • #5857

P
Paddle
  • 项目概览

PaddlePaddle / Paddle
大约 2 年 前同步成功

通知 2325
Star 20933
Fork 5424
  • 代码
    • 文件
    • 提交
    • 分支
    • Tags
    • 贡献者
    • 分支图
    • Diff
  • Issue 1423
    • 列表
    • 看板
    • 标记
    • 里程碑
  • 合并请求 543
  • Wiki 0
    • Wiki
  • 分析
    • 仓库
    • DevOps
  • 项目成员
  • Pages
P
Paddle
  • 项目概览
    • 项目概览
    • 详情
    • 发布
  • 仓库
    • 仓库
    • 文件
    • 提交
    • 分支
    • 标签
    • 贡献者
    • 分支图
    • 比较
  • Issue 1,423
    • Issue 1,423
    • 列表
    • 看板
    • 标记
    • 里程碑
  • 合并请求 543
    • 合并请求 543
  • Pages
  • 分析
    • 分析
    • 仓库分析
    • DevOps
  • Wiki 0
    • Wiki
  • 成员
    • 成员
  • 收起侧边栏
  • 动态
  • 分支图
  • 创建新Issue
  • 提交
  • Issue看板
已关闭
开放中
Opened 11月 23, 2017 by saxon_zh@saxon_zhGuest

Check failed: len >= 0

Created by: Bella-Zhao

Mpi 训练报错:

Thu Nov 23 09:57:03 2017[1,8]<stderr>:*** Aborted at 1511402223 (unix time) try "date -d @1511402223" if you are using GNU date ***
Thu Nov 23 09:57:03 2017[1,1]<stderr>:*** Check failure stack trace: ***
Thu Nov 23 09:57:03 2017[1,1]<stderr>:PC: @                0x0 (unknown)
Thu Nov 23 09:57:03 2017[1,1]<stderr>:*** SIGSEGV (@0x8) received by PID 18318 (TID 0x7f42197fb700) from PID 8; stack trace: ***
Thu Nov 23 09:57:03 2017[1,8]<stderr>:F1123 09:57:03.537968 11345 SocketChannel.cpp:54] Check failed: len >= 0  peer=10.86.20.35: Connection reset by peer [104]
Thu Nov 23 09:57:03 2017[1,8]<stderr>:*** Check failure stack trace: ***
Thu Nov 23 09:57:03 2017[1,1]<stderr>:    @     0x7f4e43f50160 (unknown)
Thu Nov 23 09:57:03 2017[1,8]<stderr>:PC: @                0x0 (unknown)
Thu Nov 23 09:57:03 2017[1,8]<stderr>:*** SIGSEGV (@0x8) received by PID 5775 (TID 0x7f49fd782700) from PID 8; stack trace: ***
Thu Nov 23 09:57:03 2017[1,8]<stderr>:    @     0x7f55e10ce160 (unknown)
Thu Nov 23 09:57:03 2017[1,8]<stderr>:    @     0x7f55db2d627d  google::LogMessage::Fail()
Thu Nov 23 09:57:03 2017[1,8]<stderr>:    @     0x7f55db12f012 paddle::ProtoClient::recv()
Thu Nov 23 09:57:03 2017[1,8]<stderr>:    @     0x7f55db2d9d2c  google::LogMessage::SendToLog()
Thu Nov 23 09:57:03 2017[1,8]<stderr>:    @     0x7f55dbdba17a paddle::ParameterClient2::sendParallel()
Thu Nov 23 09:57:03 2017[1,8]<stderr>:    @     0x7f55db2d5da3  google::LogMessage::Flush()
Thu Nov 23 09:57:03 2017[1,8]<stderr>:    @     0x7f55db2d5fa9  google::LogMessage::~LogMessage()
Thu Nov 23 09:57:03 2017[1,8]<stderr>:    @     0x7f55db2360ec _ZNSt6thread5_ImplISt12_Bind_simpleIFZN6paddle14SyncThreadPool5startEvEUliE_mEEE6_M_runEv
Thu Nov 23 09:57:03 2017[1,8]<stderr>:    @     0x7f55db2d9257  google::ErrnoLogMessage::~ErrnoLogMessage()
Thu Nov 23 09:57:03 2017[1,8]<stderr>:    @     0x7f55da28c8a0 execute_native_thread_routine
Thu Nov 23 09:57:03 2017[1,1]<stderr>:    @     0x7f4e3e15827d  google::LogMessage::Fail()
Thu Nov 23 09:57:03 2017[1,1]<stderr>:    @     0x7f4e3dfb1012 paddle::ProtoClient::recv()
Thu Nov 23 09:57:03 2017[1,8]<stderr>:    @     0x7f55db12dbd1  paddle::SocketChannel::read()
Thu Nov 23 09:57:03 2017[1,8]<stderr>:    @     0x7f55e10c61c3 start_thread
Thu Nov 23 09:57:03 2017[1,1]<stderr>:    @     0x7f4e3ec3c17a paddle::ParameterClient2::sendParallel()
Thu Nov 23 09:57:03 2017[1,1]<stderr>:    @     0x7f4e3e15bd2c  google::LogMessage::SendToLog()
Thu Nov 23 09:57:03 2017[1,8]<stderr>:    @     0x7f55db12e170  paddle::SocketChannel::readMessage()
Thu Nov 23 09:57:03 2017[1,8]<stderr>:    @     0x7f55e06ee12d __clone
Thu Nov 23 09:57:03 2017[1,1]<stderr>:    @     0x7f4e3e157da3  google::LogMessage::Flush()
Thu Nov 23 09:57:03 2017[1,8]<stderr>:    @     0x7f55db12f006  paddle::ProtoClient::recv()
Thu Nov 23 09:57:03 2017[1,1]<stderr>:    @     0x7f4e3e0b80ec _ZNSt6thread5_ImplISt12_Bind_simpleIFZN6paddle14SyncThreadPool5startEvEUliE_mEEE6_M_runEv
Thu Nov 23 09:57:03 2017[1,8]<stderr>:    @                0x0 (unknown)
Thu Nov 23 09:57:03 2017[1,1]<stderr>:    @     0x7f4e3d10e8a0 execute_native_thread_routine
Thu Nov 23 09:57:03 2017[1,1]<stderr>:    @     0x7f4e43f481c3 start_thread
Thu Nov 23 09:57:03 2017[1,1]<stderr>:    @     0x7f4e3e157fa9  google::LogMessage::~LogMessage()
Thu Nov 23 09:57:03 2017[1,1]<stderr>:    @     0x7f4e4357012d __clone
Thu Nov 23 09:57:03 2017[1,1]<stderr>:    @     0x7f4e3e15b257  google::ErrnoLogMessage::~ErrnoLogMessage()
Thu Nov 23 09:57:03 2017[1,1]<stderr>:    @                0x0 (unknown)
Thu Nov 23 09:57:09 2017[1,12]<stderr>:F1123 09:57:09.341764 27629 SocketChannel.cpp:101] Check failed: len > 0  peer=10.86.20.34 curIov=1024 iovCnt=12730 iovs[curIov].base=0x60223800 iovs[curIov].iov_len=256: Broken pipe [32]
Thu Nov 23 09:57:09 2017[1,12]<stderr>:*** Check failure stack trace: ***
Thu Nov 23 09:57:09 2017[1,12]<stderr>:F1123 09:57:09.337268 27602 SocketChannel.cpp:101] Check failed: len > 0  peer=10.86.20.35 curIov=1024 iovCnt=12723 iovs[curIov].base=0x6020d100 iovs[curIov].iov_len=256: Broken pipe [32]
Thu Nov 23 09:57:09 2017[1,12]<stderr>:*** Check failure stack trace: ***
Thu Nov 23 09:57:09 2017[1,12]<stderr>:F1123 09:57:09.341764 27629 SocketChannel.cpp:101] Check failed: len > 0  peer=10.86.20.34 curIov=1024 iovCnt=12730 iovs[curIov].base=0x60223800 iovs[curIov].iov_len=256: Broken pipe [32]F1123 09:57:09.343297 27605 SocketChannel.cpp:101] Check failed: len > 0  peer=10.86.20.38 curIov=1024 iovCnt=12752 iovs[curIov].base=0x60212800 iovs[curIov].iov_len=256: Broken pipe [32]
Thu Nov 23 09:57:09 2017[1,12]<stderr>:*** Check failure stack trace: ***
Thu Nov 23 09:57:09 2017[1,12]<stderr>:    @     0x7f63c862c27d  google::LogMessage::Fail()
Thu Nov 23 09:57:09 2017[1,12]<stderr>:    @     0x7f63c862c27d  google::LogMessage::Fail()
Thu Nov 23 09:57:09 2017[1,12]<stderr>:    @     0x7f63c862c27d  google::LogMessage::Fail()
Thu Nov 23 09:57:09 2017[1,12]<stderr>:    @     0x7f63c862fd2c  google::LogMessage::SendToLog()
Thu Nov 23 09:57:09 2017[1,12]<stderr>:    @     0x7f63c862fd2c  google::LogMessage::SendToLog()
Thu Nov 23 09:57:09 2017[1,12]<stderr>:    @     0x7f63c862fd2c  google::LogMessage::SendToLog()
Thu Nov 23 09:57:09 2017[1,12]<stderr>:    @     0x7f63c862bda3  google::LogMessage::Flush()
Thu Nov 23 09:57:09 2017[1,12]<stderr>:    @     0x7f63c862bda3  google::LogMessage::Flush()
Thu Nov 23 09:57:09 2017[1,12]<stderr>:    @     0x7f63c862bda3  google::LogMessage::Flush()
Thu Nov 23 09:57:09 2017[1,12]<stderr>:    @     0x7f63c862bfa9  google::LogMessage::~LogMessage()
Thu Nov 23 09:57:09 2017[1,12]<stderr>:    @     0x7f63c862f257  google::ErrnoLogMessage::~ErrnoLogMessage()
Thu Nov 23 09:57:09 2017[1,12]<stderr>:    @     0x7f63c862bfa9  google::LogMessage::~LogMessage()
Thu Nov 23 09:57:09 2017[1,12]<stderr>:    @     0x7f63c84839bc  paddle::readwritev<>()
Thu Nov 23 09:57:09 2017[1,12]<stderr>:    @     0x7f63c862bfa9  google::LogMessage::~LogMessage()
Thu Nov 23 09:57:09 2017[1,12]<stderr>:    @     0x7f63c862f257  google::ErrnoLogMessage::~ErrnoLogMessage()
Thu Nov 23 09:57:09 2017[1,12]<stderr>:    @     0x7f63c862f257  google::ErrnoLogMessage::~ErrnoLogMessage()
Thu Nov 23 09:57:09 2017[1,12]<stderr>:    @     0x7f63c84839bc  paddle::readwritev<>()
Thu Nov 23 09:57:09 2017[1,12]<stderr>:    @     0x7f63c84839bc  paddle::readwritev<>()
Thu Nov 23 09:57:09 2017[1,12]<stderr>:    @     0x7f63c84849cd  paddle::SocketChannel::writeMessage()
Thu Nov 23 09:57:09 2017[1,12]<stderr>:    @     0x7f63c848581c  paddle::ProtoClient::send()
Thu Nov 23 09:57:09 2017[1,12]<stderr>:    @     0x7f63c84849cd  paddle::SocketChannel::writeMessage()
Thu Nov 23 09:57:09 2017[1,12]<stderr>:    @     0x7f63c84849cd  paddle::SocketChannel::writeMessage()
Thu Nov 23 09:57:09 2017[1,12]<stderr>:    @     0x7f63c848581c  paddle::ProtoClient::send()
Thu Nov 23 09:57:09 2017[1,12]<stderr>:    @     0x7f63c911007a  paddle::ParameterClient2::sendParallel()
Thu Nov 23 09:57:09 2017[1,12]<stderr>:    @     0x7f63c848581c  paddle::ProtoClient::send()

任务相关配置如下:

paddle.init(use_gpu=False,
            trainer_count=int(os.getenv("PADDLE_TRAINER_COUNT", "1")),
            port=int(os.getenv("PADDLE_PORT", "7164")),
            ports_num=int(os.getenv("PADDLE_PORTS_NUM", "1")),
            num_gradient_servers=int(os.getenv("PADDLE_NUM_GRADIENT_SERVERS", "1")),
            trainer_id=int(os.getenv("PADDLE_TRAINER_ID", "0")),
            pservers=os.getenv("PADDLE_PSERVERS", "127.0.0.1"),
            ports_num_for_sparse=int(os.getenv('PADDLE_PORTS_NUM_FOR_SPARSE', "1")))
paddle cluster_train \
  --config=train_mpi.py \
  --time_limit=50:00:00 \
  --submitter=zhaoyijin \
  --num_nodes=15 \
  --job_priority=normal \
  --fs_name=hdfs://...省略 \
  --fs_ugi=weigou,123abc \
  --num_passes=14 \
  --train_data_path=/....省略/train \
  --test_data_path=/....省略/test \
  --output_path=/....省略 \
  --thirdparty=./my_thirdparty \
  --where=...省略 \
  --job_name=paddle_dssm_zhaoyijin \
  --ports_num_for_sparse=1 \
  --use_remote_sparse=1
指派人
分配到
无
里程碑
无
分配里程碑
工时统计
无
截止日期
无
标识: paddlepaddle/Paddle#5857
渝ICP备2023009037号

京公网安备11010502055752号

网络110报警服务 Powered by GitLab CE v13.7
开源知识
Git 入门 Pro Git 电子书 在线学 Git
Markdown 基础入门 IT 技术知识开源图谱
帮助
使用手册 反馈建议 博客
《GitCode 隐私声明》 《GitCode 服务条款》 关于GitCode
Powered by GitLab CE v13.7