Skip to content

  • 体验新版
    • 正在加载...
  • 登录
  • PaddlePaddle
  • Paddle
  • Issue
  • #6334

P
Paddle
  • 项目概览

PaddlePaddle / Paddle
大约 2 年 前同步成功

通知 2325
Star 20933
Fork 5424
  • 代码
    • 文件
    • 提交
    • 分支
    • Tags
    • 贡献者
    • 分支图
    • Diff
  • Issue 1423
    • 列表
    • 看板
    • 标记
    • 里程碑
  • 合并请求 543
  • Wiki 0
    • Wiki
  • 分析
    • 仓库
    • DevOps
  • 项目成员
  • Pages
P
Paddle
  • 项目概览
    • 项目概览
    • 详情
    • 发布
  • 仓库
    • 仓库
    • 文件
    • 提交
    • 分支
    • 标签
    • 贡献者
    • 分支图
    • 比较
  • Issue 1,423
    • Issue 1,423
    • 列表
    • 看板
    • 标记
    • 里程碑
  • 合并请求 543
    • 合并请求 543
  • Pages
  • 分析
    • 分析
    • 仓库分析
    • DevOps
  • Wiki 0
    • Wiki
  • 成员
    • 成员
  • 收起侧边栏
  • 动态
  • 分支图
  • 创建新Issue
  • 提交
  • Issue看板
已关闭
开放中
Opened 12月 06, 2017 by saxon_zh@saxon_zhGuest

Cluster training DeepFM always stuck at the beginning

Created by: typhoonzero

I started a 10 node job with 10 pservers and 10 trainers. Using DeepFM model (converted to paddle v1 code), the job always stuck before the trainer can run batches.

pserver log:

commandline: ./paddle_pserver2 --num_gradient_servers=10 --nics=xgbe0 --port=7164 --ports_num=1 --ports_num_for_sparse=1 --rdma_tcp=tcp --comment=paddle_cluster_job 
W1206 12:25:42.604581 29267 CpuId.h:112] PaddlePaddle wasn't compiled to use avx instructions, but these are available on your machine and could speed up CPU computations via CMAKE .. -DWITH_AVX=ON
I1206 12:25:42.605259 29267 ParameterServerController.cpp:83] number of parameterServer instances: 2
I1206 12:25:42.605267 29267 ParameterServerController.cpp:87] Starting parameterServer[0]
I1206 12:25:42.605304 29267 ParameterServerController.cpp:87] Starting parameterServer[1]
I1206 12:25:42.605325 29267 ParameterServerController.cpp:96] Waiting parameterServer[0]
I1206 12:25:42.605643 29288 LightNetwork.cpp:273] tcp server start 
I1206 12:25:42.605721 29287 LightNetwork.cpp:273] tcp server start 
I1206 12:36:39.935209 23676 LightNetwork.cpp:326] worker started, peer = 10.104.100.14
I1206 12:36:39.960119 23677 LightNetwork.cpp:326] worker started, peer = 10.104.100.14
I1206 12:40:23.655452  4484 LightNetwork.cpp:326] worker started, peer = 10.102.196.43
I1206 12:40:23.683399  4485 LightNetwork.cpp:326] worker started, peer = 10.102.196.43
I1206 12:40:25.714017  4485 ParameterServer2.cpp:256] pserver: setParameter
I1206 12:40:25.714077  4485 ParameterServer2.cpp:302] pserver: new cpuvector: size=1196032
I1206 12:40:25.737254  4484 ParameterServer2.cpp:256] pserver: setParameter
I1206 12:40:25.743959  4484 ParameterServer2.cpp:302] pserver: new cpuvector: size=30000
I1206 12:40:25.766773  4686 LightNetwork.cpp:326] worker started, peer = 10.102.196.43
I1206 12:40:25.810925  4688 LightNetwork.cpp:326] worker started, peer = 10.102.196.43
I1206 12:40:26.050133 23677 ParameterServer2.cpp:564] pserver: getParameter
I1206 12:48:30.094723  8188 LightNetwork.cpp:326] worker started, peer = 10.102.196.37
I1206 12:48:30.119384  8189 LightNetwork.cpp:326] worker started, peer = 10.102.196.37
I1206 12:48:32.120713  8189 ParameterServer2.cpp:564] pserver: getParameter
I1206 12:50:31.704797 18232 LightNetwork.cpp:326] worker started, peer = 10.102.196.39
I1206 12:50:31.731204 18233 LightNetwork.cpp:326] worker started, peer = 10.102.196.39
I1206 12:50:33.527307 18356 LightNetwork.cpp:326] worker started, peer = 10.102.196.41
I1206 12:50:33.564779 18363 LightNetwork.cpp:326] worker started, peer = 10.102.196.41
I1206 12:50:33.732800 18233 ParameterServer2.cpp:564] pserver: getParameter
I1206 12:50:35.555511 18363 ParameterServer2.cpp:564] pserver: getParameter
I1206 12:51:13.538336 21095 LightNetwork.cpp:326] worker started, peer = 10.104.100.12
I1206 12:51:13.562628 21097 LightNetwork.cpp:326] worker started, peer = 10.104.100.12
I1206 12:51:15.564131 21097 ParameterServer2.cpp:564] pserver: getParameter
I1206 12:51:33.749547 22697 LightNetwork.cpp:326] worker started, peer = 10.102.196.44
I1206 12:51:33.776844 22701 LightNetwork.cpp:326] worker started, peer = 10.102.196.44
I1206 12:51:35.778558 22701 ParameterServer2.cpp:564] pserver: getParameter
I1206 12:55:58.762526  4722 LightNetwork.cpp:326] worker started, peer = 10.102.196.38
I1206 12:55:58.790271  4723 LightNetwork.cpp:326] worker started, peer = 10.102.196.38
I1206 12:56:00.792819  4723 ParameterServer2.cpp:564] pserver: getParameter
I1206 12:56:01.233184  4860 LightNetwork.cpp:326] worker started, peer = 10.102.196.42
I1206 12:56:01.260675  4861 LightNetwork.cpp:326] worker started, peer = 10.102.196.42
I1206 12:56:03.409593  4861 ParameterServer2.cpp:564] pserver: getParameter
I1206 12:57:06.424779  9664 LightNetwork.cpp:326] worker started, peer = 10.104.100.11

trainer log:

./paddle_trainer --num_gradient_servers=10 --trainer_id=5 --pservers=10.102.196.43,10.102.196.42,10.102.196.41,10.104.100.12,10.102.196.37,10.104.100.11,10.102.196.44,10.102.196.38,10.104.100.14,10.102.196.39 --rdma_tcp=tcp --nics=xgbe0 --port=7164 --ports_num=1 --local=0 --job=train --dot_period=10 --saving_period=1 --log_period=20 --trainer_count=1 --num_passes=1 --ports_num_for_sparse=1 --config=conf/trainer_config.conf --save_dir=./output --use_gpu=0 
W1206 12:57:04.976786  9582 CpuId.h:112] PaddlePaddle wasn't compiled to use avx instructions, but these are available on your machine and could speed up CPU computations via CMAKE .. -DWITH_AVX=ON
I1206 12:57:05.292944  9582 Trainer.cpp:166] trainer mode: SgdSparseCpuTraining
I1206 12:57:05.292971  9582 TrainerInternal.cpp:239] Sgd sparse training can not work with ConcurrentRemoteParameterUpdater, automatically reset --use_old_updater=true
I1206 12:57:05.498909  9582 PyDataProvider2.cpp:243] loading dataprovider dataprovider::process_deep
I1206 12:57:05.553319  9582 GradientMachine.cpp:94] Initing parameters..
I1206 12:57:06.414937  9582 GradientMachine.cpp:101] Init parameters done.
I1206 12:57:06.415112  9621 ParameterClient2.cpp:113] pserver 0 10.102.196.43:7165
I1206 12:57:06.415400  9621 ParameterClient2.cpp:113] pserver 1 10.102.196.42:7165
I1206 12:57:06.415513  9621 ParameterClient2.cpp:113] pserver 2 10.102.196.41:7165
I1206 12:57:06.415627  9621 ParameterClient2.cpp:113] pserver 3 10.104.100.12:7165
I1206 12:57:06.415748  9621 ParameterClient2.cpp:113] pserver 4 10.102.196.37:7165
I1206 12:57:06.415866  9621 ParameterClient2.cpp:113] pserver 5 10.104.100.11:7165
I1206 12:57:06.415930  9621 ParameterClient2.cpp:113] pserver 6 10.102.196.44:7165
I1206 12:57:06.416048  9621 ParameterClient2.cpp:113] pserver 7 10.102.196.38:7165
I1206 12:57:06.416160  9621 ParameterClient2.cpp:113] pserver 8 10.104.100.14:7165
I1206 12:57:06.416251  9621 ParameterClient2.cpp:113] pserver 9 10.102.196.39:7165
I1206 12:57:06.439986  9582 ParameterClient2.cpp:113] pserver 0 10.102.196.43:7164
I1206 12:57:06.440148  9582 ParameterClient2.cpp:113] pserver 1 10.102.196.42:7164
I1206 12:57:06.440269  9582 ParameterClient2.cpp:113] pserver 2 10.102.196.41:7164
I1206 12:57:06.440351  9582 ParameterClient2.cpp:113] pserver 3 10.104.100.12:7164
I1206 12:57:06.440423  9582 ParameterClient2.cpp:113] pserver 4 10.102.196.37:7164
I1206 12:57:06.440512  9582 ParameterClient2.cpp:113] pserver 5 10.104.100.11:7164
I1206 12:57:06.440629  9582 ParameterClient2.cpp:113] pserver 6 10.102.196.44:7164
I1206 12:57:06.440716  9582 ParameterClient2.cpp:113] pserver 7 10.102.196.38:7164
I1206 12:57:06.440809  9582 ParameterClient2.cpp:113] pserver 8 10.104.100.14:7164
I1206 12:57:06.440871  9582 ParameterClient2.cpp:113] pserver 9 10.102.196.39:7164
指派人
分配到
无
里程碑
无
分配里程碑
工时统计
无
截止日期
无
标识: paddlepaddle/Paddle#6334
渝ICP备2023009037号

京公网安备11010502055752号

网络110报警服务 Powered by GitLab CE v13.7
开源知识
Git 入门 Pro Git 电子书 在线学 Git
Markdown 基础入门 IT 技术知识开源图谱
帮助
使用手册 反馈建议 博客
《GitCode 隐私声明》 《GitCode 服务条款》 关于GitCode
Powered by GitLab CE v13.7