Skip to content

  • 体验新版
    • 正在加载...
  • 登录
  • PaddlePaddle
  • Paddle
  • Issue
  • #16216

P
Paddle
  • 项目概览

PaddlePaddle / Paddle
大约 2 年 前同步成功

通知 2325
Star 20933
Fork 5424
  • 代码
    • 文件
    • 提交
    • 分支
    • Tags
    • 贡献者
    • 分支图
    • Diff
  • Issue 1423
    • 列表
    • 看板
    • 标记
    • 里程碑
  • 合并请求 543
  • Wiki 0
    • Wiki
  • 分析
    • 仓库
    • DevOps
  • 项目成员
  • Pages
P
Paddle
  • 项目概览
    • 项目概览
    • 详情
    • 发布
  • 仓库
    • 仓库
    • 文件
    • 提交
    • 分支
    • 标签
    • 贡献者
    • 分支图
    • 比较
  • Issue 1,423
    • Issue 1,423
    • 列表
    • 看板
    • 标记
    • 里程碑
  • 合并请求 543
    • 合并请求 543
  • Pages
  • 分析
    • 分析
    • 仓库分析
    • DevOps
  • Wiki 0
    • Wiki
  • 成员
    • 成员
  • 收起侧边栏
  • 动态
  • 分支图
  • 创建新Issue
  • 提交
  • Issue看板
已关闭
开放中
Opened 3月 15, 2019 by saxon_zh@saxon_zhGuest

在k8s集群中通过service暴露端口,创建分布式任务失败

Created by: liuliu4348

  • 版本、环境信息:    1)PaddlePaddle版本:0.14    2)CPU:Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz    4)系统环境:k8s(v1.10.7),镜像信息(hub.baidubce.com/paddlepaddle/paddle: 0.14.0)
  • 训练信息    1)多记多卡    2)Operator信息
  • 复现信息:在k8s中先启动4个service,暴露端口6174,然后再启动与之对应的4个pod,其中两个为PSERVER,两个为TRAINER,通过service的clusterIP和端口可以访问相应的pod提供的服务。 在pod中写入相应的环境变量,以第一个PSERVER的pod为例: PADDLE_TRAINING_ROLE=PSERVER PADDLE_PSERVER_IPS=10.104.213.246,10.97.10.70 PADDLE_PSERVER_PORT=6174 PADDLE_TRAINERS=2 PADDLE_CURRENT_IP=10.104.213.246

代码为官网demo示例: import os import sys, getopt import paddle import paddle.fluid as fluid

from visualdl import LogWriter

EPOCH_NUM = 30 BATCH_SIZE = 8

train_reader = paddle.batch( paddle.reader.shuffle( paddle.dataset.uci_housing.train(), buf_size=500), batch_size=BATCH_SIZE)

def train(save_dir, log_dir): y = fluid.layers.data(name='y', shape=[1], dtype='float32') x = fluid.layers.data(name='x', shape=[13], dtype='float32') y_predict = fluid.layers.fc(input=x, size=1, act=None)

logwriter = LogWriter(log_dir, sync_cycle=10)
with logwriter.mode("train") as writer:
    loss_scalar = writer.scalar("loss")

loss = fluid.layers.square_error_cost(input=y_predict, label=y)
avg_loss = fluid.layers.mean(loss)
opt = fluid.optimizer.SGD(learning_rate=0.001)
opt.minimize(avg_loss)

# use cpu
place = fluid.CPUPlace()
#use gpu
#place = fluid.CUDAPlace(0)
feeder = fluid.DataFeeder(place=place, feed_list=[x, y])
exe = fluid.Executor(place)

# fetch distributed training environment setting
training_role = os.getenv("PADDLE_TRAINING_ROLE", None)
port = os.getenv("PADDLE_PSERVER_PORT", "6174")
pserver_ips = os.getenv("PADDLE_PSERVER_IPS", "")
trainer_id = int(os.getenv("PADDLE_TRAINER_ID", "0"))
eplist = []
for ip in pserver_ips.split(","):
    eplist.append(':'.join([ip, port]))
pserver_endpoints = ",".join(eplist)
trainers = int(os.getenv("PADDLE_TRAINERS"))
current_endpoint = os.getenv("PADDLE_CURRENT_IP", "") + ":" + port

t = fluid.DistributeTranspiler()
t.transpile(
    trainer_id = trainer_id,
    pservers = pserver_endpoints,
    trainers = trainers)

if training_role == "PSERVER":
    pserver_prog = t.get_pserver_program(current_endpoint)
    startup_prog = t.get_startup_program(current_endpoint, pserver_prog)
    exe.run(startup_prog)
    exe.run(pserver_prog)
elif training_role == "TRAINER":
    trainer_prog = t.get_trainer_program()
    exe.run(fluid.default_startup_program())

    for epoch in range(EPOCH_NUM):
        for batch_id, batch_data in enumerate(train_reader()):
            avg_loss_value, = exe.run(trainer_prog,
                                  feed=feeder.feed(batch_data),
                                  fetch_list=[avg_loss])
            loss_scalar.add_record(batch_id + epoch * 50, avg_loss_value[0])
            if (batch_id + 1) % 10 == 0:
                print("Epoch: {0}, Batch: {1}, loss: {2}".format(
                    epoch, batch_id, avg_loss_value[0]))
    # destory the resource of current trainer node in pserver server node
    if trainer_id == 0:
        fluid.io.save_params(executor=exe, dirname=save_dir,
             main_program=trainer_prog)
    exe.close()
else:
    raise AssertionError("PADDLE_TRAINING_ROLE should be one of [TRAINER, PSERVER]")

opts, args = getopt.getopt(sys.argv[1:], "hs:l:", ["save_dir=", "log_dir="]) save_dir="./output" log_dir="./log" for op, value in opts: if op == "-h": print 'test.py -s <save_dir>' sys.exit() elif op in ("-s", "--save_dir"): save_dir=value elif op in ("-l", "--log_dir"): log_dir=value
train(save_dir, log_dir)

  • 问题描述:运行第一个PSERVER时报错: block map: {'fc_0.b_0': [(0L, 1L)], 'fc_0.w_0': [(0L, 13L)]} block map: {'fc_0.w_0@GRAD': [(0L, 13L)], 'fc_0.b_0@GRAD': [(0L, 1L)]} E0315 02:21:11.156191433 77 server_chttp2.cc:38] {"created":"@1552616471.156142167","description":"No address added out of total 1 resolved","file":"src/core/ext/transport/chttp2/server/chttp2_server.cc","file_line":305,"referenced_errors":[{"created":"@1552616471.156129936","description":"Unable to configure socket","fd":10,"file":"src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":202,"referenced_errors":[{"created":"@1552616471.156123980","description":"OS Error","errno":99,"file":"src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":175,"os_error":"Cannot assign requested address","syscall":"bind"}]}]} *** Aborted at 1552616471 (unix time) try "date -d @1552616471" if you are using GNU date *** PC: @ 0x0 (unknown) *** SIGSEGV (@0x50) received by PID 25 (TID 0x7fb41cb84700) from PID 80; stack trace: *** @ 0x7fb5816ca390 (unknown) @ 0x7fb53259bf2e grpc::ServerInterface::RegisteredAsyncRequest::IssueRequest() @ 0x7fb53255060f paddle::operators::distributed::AsyncGRPCServer::TryToRegisterNewOne() @ 0x7fb532550f4e paddle::operators::distributed::AsyncGRPCServer::StartServer() @ 0x7fb53247c364 paddle::operators::RunServer() @ 0x7fb532481587 std::thread::_Impl<>::_M_run() @ 0x7fb54053fc80 (unknown) @ 0x7fb5816c06ba start_thread @ 0x7fb5813f641d clone @ 0x0 (unknown) Segmentation fault (core dumped) 另外:通过paddle_trainer 的方式在k8s上用此方式启动分布式任务是可以成功的。
指派人
分配到
无
里程碑
无
分配里程碑
工时统计
无
截止日期
无
标识: paddlepaddle/Paddle#16216
渝ICP备2023009037号

京公网安备11010502055752号

网络110报警服务 Powered by GitLab CE v13.7
开源知识
Git 入门 Pro Git 电子书 在线学 Git
Markdown 基础入门 IT 技术知识开源图谱
帮助
使用手册 反馈建议 博客
《GitCode 隐私声明》 《GitCode 服务条款》 关于GitCode
Powered by GitLab CE v13.7