Skip to content

  • 体验新版
    • 正在加载...
  • 登录
  • PaddlePaddle
  • models
  • Issue
  • #1896

M
models
  • 项目概览

PaddlePaddle / models
大约 2 年 前同步成功

通知 232
Star 6828
Fork 2962
  • 代码
    • 文件
    • 提交
    • 分支
    • Tags
    • 贡献者
    • 分支图
    • Diff
  • Issue 602
    • 列表
    • 看板
    • 标记
    • 里程碑
  • 合并请求 255
  • Wiki 0
    • Wiki
  • 分析
    • 仓库
    • DevOps
  • 项目成员
  • Pages
M
models
  • 项目概览
    • 项目概览
    • 详情
    • 发布
  • 仓库
    • 仓库
    • 文件
    • 提交
    • 分支
    • 标签
    • 贡献者
    • 分支图
    • 比较
  • Issue 602
    • Issue 602
    • 列表
    • 看板
    • 标记
    • 里程碑
  • 合并请求 255
    • 合并请求 255
  • Pages
  • 分析
    • 分析
    • 仓库分析
    • DevOps
  • Wiki 0
    • Wiki
  • 成员
    • 成员
  • 收起侧边栏
  • 动态
  • 分支图
  • 创建新Issue
  • 提交
  • Issue看板
已关闭
开放中
Opened 3月 14, 2019 by saxon_zh@saxon_zhGuest

运行官网例子单机分布训练ps模式出错

Created by: qianledan

我运行官网github中分布训练的例子run_ps_mode.sh出现:(官网例子是否需要修改?)

qld@ifan-W580-G20:/data/paddle/image_classification/ImageNet_train_v1.3/dist_train$ sh run_ps_mode.sh
qld@ifan-W580-G20:/data/paddle/image_classification/ImageNet_train_v1.3/dist_train$ -----------  Configuration Arguments -----------
async_mode: False
batch_size: 32
checkpoint: None
class_dim: 1000
data_dir: ../data/ILSVRC2012
enable_ce: False
fp16: False
image_shape: 3,224,224
lr: 0.1
lr_strategy: piecewise_decay
model: DistResNet
model_category: models
model_save_dir: output
multi_batch_repeat: 1
num_epochs: 120
num_threads: 8
pretrained_model: None
reduce_strategy: allreduce
scale_loss: 1.0
skip_unbalanced_data: False
split_var: True
start_test_pass: 0
total_images: 1281167
update_method: pserver
use_gpu: True
with_mem_opt: False
------------------------------------------------
----------- Configuration envs -----------
ENV PADDLE_TRAINERS_NUM:2
ENV PADDLE_TRAINER_ID:0
ENV PADDLE_CURRENT_ENDPOINT:127.0.0.1:7160
ENV PADDLE_PSERVER_ENDPOINTS:127.0.0.1:7160,127.0.0.1:7161
ENV PADDLE_TRAINING_ROLE:TRAINER
------------------------------------------------
-----------  Configuration Arguments -----------
async_mode: False
batch_size: 32
checkpoint: None
class_dim: 1000
data_dir: ../data/ILSVRC2012
enable_ce: False
fp16: False
image_shape: 3,224,224
lr: 0.1
lr_strategy: piecewise_decay
model: DistResNet
model_category: models
model_save_dir: output
multi_batch_repeat: 1
num_epochs: 120
num_threads: 8
pretrained_model: None
reduce_strategy: allreduce
scale_loss: 1.0
skip_unbalanced_data: False
split_var: True
start_test_pass: 0
total_images: 1281167
update_method: pserver
use_gpu: True
with_mem_opt: False
------------------------------------------------
----------- Configuration envs -----------
ENV PADDLE_TRAINERS_NUM:2
ENV PADDLE_CURRENT_ENDPOINT:127.0.0.1:7160
ENV PADDLE_PSERVER_ENDPOINTS:127.0.0.1:7160,127.0.0.1:7161
ENV PADDLE_TRAINING_ROLE:PSERVER
------------------------------------------------
-----------  Configuration Arguments -----------
async_mode: False
batch_size: 32
checkpoint: None
class_dim: 1000
data_dir: ../data/ILSVRC2012
enable_ce: False
fp16: False
image_shape: 3,224,224
lr: 0.1
lr_strategy: piecewise_decay
model: DistResNet
model_category: models
model_save_dir: output
multi_batch_repeat: 1
num_epochs: 120
num_threads: 8
pretrained_model: None
reduce_strategy: allreduce
scale_loss: 1.0
skip_unbalanced_data: False
split_var: True
start_test_pass: 0
total_images: 1281167
update_method: pserver
use_gpu: True
with_mem_opt: False
------------------------------------------------
----------- Configuration envs -----------
ENV PADDLE_TRAINERS_NUM:2
ENV PADDLE_TRAINER_ID:1
ENV PADDLE_CURRENT_ENDPOINT:127.0.0.1:7161
ENV PADDLE_PSERVER_ENDPOINTS:127.0.0.1:7160,127.0.0.1:7161
ENV PADDLE_TRAINING_ROLE:TRAINER
------------------------------------------------
-----------  Configuration Arguments -----------
async_mode: False
batch_size: 32
checkpoint: None
class_dim: 1000
data_dir: ../data/ILSVRC2012
enable_ce: False
fp16: False
image_shape: 3,224,224
lr: 0.1
lr_strategy: piecewise_decay
model: DistResNet
model_category: models
model_save_dir: output
multi_batch_repeat: 1
num_epochs: 120
num_threads: 8
pretrained_model: None
reduce_strategy: allreduce
scale_loss: 1.0
skip_unbalanced_data: False
split_var: True
start_test_pass: 0
total_images: 1281167
update_method: pserver
use_gpu: True
with_mem_opt: False
------------------------------------------------
----------- Configuration envs -----------
ENV PADDLE_TRAINERS_NUM:2
ENV PADDLE_CURRENT_ENDPOINT:127.0.0.1:7161
ENV PADDLE_PSERVER_ENDPOINTS:127.0.0.1:7160,127.0.0.1:7161
ENV PADDLE_TRAINING_ROLE:PSERVER
------------------------------------------------
start lr: 0.1, end lr: 0.2, decay boundaries: [600570, 1201140, 1601520]
start lr: 0.1, end lr: 0.2, decay boundaries: [600570, 1201140, 1601520]
start lr: 0.1, end lr: 0.2, decay boundaries: [600570, 1201140, 1601520]
start lr: 0.1, end lr: 0.2, decay boundaries: [600570, 1201140, 1601520]
get_pserver_program() is deprecated, call get_pserver_programs() to get pserver main and startup in a single call.
get_pserver_program() is deprecated, call get_pserver_programs() to get pserver main and startup in a single call.
I0314 20:58:22.185073 36189 grpc_server.cc:430] Server listening on 127.0.0.1:7161 selected port: 7161
I0314 20:58:23.730594 36240 grpc_server.cc:430] Server listening on 127.0.0.1:7160 selected port: 7160
W0314 20:58:24.545244 35502 device_context.cc:263] Please NOTE: device: 0, CUDA Capability: 37, Driver API Version: 9.0, Runtime APIVersion: 9.0
W0314 20:58:24.545289 35502 device_context.cc:271] device: 0, cuDNN Version: 7.0.
W0314 20:58:24.618002 35505 device_context.cc:263] Please NOTE: device: 0, CUDA Capability: 37, Driver API Version: 9.0, Runtime APIVersion: 9.0
W0314 20:58:24.618052 35505 device_context.cc:271] device: 0, cuDNN Version: 7.0.
read images from 0, length: 640583, lines length: 640583, total: 1281167
F0314 21:01:24.640844 36401 grpc_client.cc:408] GetRPC name:[conv2d_10.w_0.block1], ep:[127.0.0.1:7161], status:[-1] meets grpc error, error_code:4 error_message:Deadline Exceeded error_details:
*** Check failure stack trace: ***
    @     0x7f55771013fd  google::LogMessage::Fail()
    @     0x7f5577104eac  google::LogMessage::SendToLog()
    @     0x7f5577100f23  google::LogMessage::Flush()
    @     0x7f55771063be  google::LogMessageFatal::~LogMessageFatal()
    @     0x7f55785c82da  paddle::operators::distributed::GRPCClient::Proceed()
    @     0x7f5619507c80  (unknown)
    @     0x7f561bce86ba  start_thread
    @     0x7f561b30e41d  clone
    @              (nil)  (unknown)
F0314 21:01:26.682222 36295 grpc_client.cc:408] SendRPC name:[fc_0.b_0@GRAD.trainer_0], ep:[127.0.0.1:7160], status:[-1] meets grpc error, error_code:4 error_message:Deadline Exceeded error_details:
*** Check failure stack trace: ***
    @     0x7fc9368dd3fd  google::LogMessage::Fail()
    @     0x7fc9368e0eac  google::LogMessage::SendToLog()
    @     0x7fc9368dcf23  google::LogMessage::Flush()
    @     0x7fc9368e23be  google::LogMessageFatal::~LogMessageFatal()
    @     0x7fc937da42da  paddle::operators::distributed::GRPCClient::Proceed()
    @     0x7fc9d8ce3c80  (unknown)
    @     0x7fc9db4c46ba  start_thread
    @     0x7fc9daaea41d  clone
    @              (nil)  (unknown)
指派人
分配到
无
里程碑
无
分配里程碑
工时统计
无
截止日期
无
标识: paddlepaddle/models#1896
渝ICP备2023009037号

京公网安备11010502055752号

网络110报警服务 Powered by GitLab CE v13.7
开源知识
Git 入门 Pro Git 电子书 在线学 Git
Markdown 基础入门 IT 技术知识开源图谱
帮助
使用手册 反馈建议 博客
《GitCode 隐私声明》 《GitCode 服务条款》 关于GitCode
Powered by GitLab CE v13.7