Skip to content

  • 体验新版
    • 正在加载...
  • 登录
  • PaddlePaddle
  • Paddle
  • Issue
  • #26852

P
Paddle
  • 项目概览

PaddlePaddle / Paddle
大约 2 年 前同步成功

通知 2325
Star 20933
Fork 5424
  • 代码
    • 文件
    • 提交
    • 分支
    • Tags
    • 贡献者
    • 分支图
    • Diff
  • Issue 1423
    • 列表
    • 看板
    • 标记
    • 里程碑
  • 合并请求 543
  • Wiki 0
    • Wiki
  • 分析
    • 仓库
    • DevOps
  • 项目成员
  • Pages
P
Paddle
  • 项目概览
    • 项目概览
    • 详情
    • 发布
  • 仓库
    • 仓库
    • 文件
    • 提交
    • 分支
    • 标签
    • 贡献者
    • 分支图
    • 比较
  • Issue 1,423
    • Issue 1,423
    • 列表
    • 看板
    • 标记
    • 里程碑
  • 合并请求 543
    • 合并请求 543
  • Pages
  • 分析
    • 分析
    • 仓库分析
    • DevOps
  • Wiki 0
    • Wiki
  • 成员
    • 成员
  • 收起侧边栏
  • 动态
  • 分支图
  • 创建新Issue
  • 提交
  • Issue看板
已关闭
开放中
Opened 9月 01, 2020 by saxon_zh@saxon_zhGuest

【论文复现】多卡训练时运行时间变长的问题

Created by: thrkingd

  • 版本、环境信息:    1)PaddlePaddle版本:请提供您的PaddlePaddle版本号,1.8.2    2)GPU:aistudio平台的脚步任务环境(选择的4卡环境进行运行测试) 项目连接:https://aistudio.baidu.com/aistudio/clusterprojectdetail/773316(已公开) 现象:每次运行到第二个epoch时发现每次循环的运行时间都变长了一倍,运行时间包括reader读数据的时间,进一步测试发现gpu的运行时间没有变化,主要是reader读数据的时间变长了一倍,不知道是什么原因。具体现象可以看输出的日志文件: Loss at epoch 0 step 5: [4.3775268], acc: [0.125],period: 16.51509690284729 Loss at epoch 0 step 10: [3.8490183], acc: [0.40625],period: 16.0886127948761 Loss at epoch 0 step 15: [2.9390435], acc: [0.5625],period: 16.38063144683838 Loss at epoch 0 step 20: [2.3567476], acc: [0.65625],period: 16.37914490699768 Loss at epoch 0 step 25: [1.5900159], acc: [0.734375],period: 16.665042877197266 Loss at epoch 0 step 30: [1.2181798], acc: [0.734375],period: 15.94478726387024 Loss at epoch 0 step 35: [0.6449917], acc: [0.90625],period: 16.192134141921997 valid Loss at step 0: [0.25852293], acc: [1.] valid Loss at step 20: [0.46329147], acc: [0.875] valid Loss at step 40: [1.2772802], acc: [0.71875] valid Loss at step 60: [1.3179752], acc: [0.6875] valid Loss at step 80: [1.1119821], acc: [0.75] start data reader (trainers_num: 4, trainer_id: 1) valid Loss at step 100: [2.2030392], acc: [0.1875] start data reader (trainers_num: 4, trainer_id: 3) start data reader (trainers_num: 4, trainer_id: 2) 验证集准确率为:0.8328919410705566 验证集loss为:0.7212506532669067 Epoch1 lr=0.009999999776482582 start data reader (trainers_num: 4, trainer_id: 0) Loss at epoch 1 step 5: [0.55662304], acc: [0.890625],period: 36.16345572471619 Loss at epoch 1 step 10: [0.42250103], acc: [0.875],period: 37.54211711883545 Loss at epoch 1 step 15: [0.27849782], acc: [0.96875],period: 37.35439491271973 Loss at epoch 1 step 20: [0.33116293], acc: [0.90625],period: 35.67357039451599 Loss at epoch 1 step 25: [0.37910157], acc: [0.890625],period: 37.441500663757324 Loss at epoch 1 step 30: [0.36237404], acc: [0.953125],period: 37.42715764045715 Loss at epoch 1 step 35: [0.36434418], acc: [0.921875],period: 35.077528953552246 valid Loss at step 0: [0.08099228], acc: [1.] valid Loss at step 20: [0.43469694], acc: [0.84375] valid Loss at step 40: [1.1833289], acc: [0.71875] valid Loss at step 60: [0.94777864], acc: [0.71875] valid Loss at step 80: [0.42069796], acc: [0.9375] valid Loss at step 100: [1.0385896], acc: [0.71875] start data reader (trainers_num: 4, trainer_id: 2) start data reader (trainers_num: 4, trainer_id: 3) start data reader (trainers_num: 4, trainer_id: 1) 验证集准确率为:0.8781779408454895 验证集loss为:0.44851189851760864 Epoch2 lr=0.0010000000474974513 start data reader (trainers_num: 4, trainer_id: 0) Loss at epoch 2 step 5: [0.38328272], acc: [0.921875],period: 37.521843671798706 Loss at epoch 2 step 10: [0.3836357], acc: [0.90625],period: 36.96732306480408 Loss at epoch 2 step 15: [0.2954501], acc: [0.9375],period: 37.78758931159973 Loss at epoch 2 step 20: [0.28290433], acc: [0.9375],period: 36.79577970504761 Loss at epoch 2 step 25: [0.23077166], acc: [0.984375],period: 35.68168759346008 Loss at epoch 2 step 30: [0.303536], acc: [0.9375],period: 37.21159076690674 Loss at epoch 2 step 35: [0.25395435], acc: [0.9375],period: 36.25998377799988 valid Loss at step 0: [0.07491596], acc: [1.] valid Loss at step 20: [0.39159745], acc: [0.84375] valid Loss at step 40: [1.0963948], acc: [0.71875] valid Loss at step 60: [0.9259555], acc: [0.71875] valid Loss at step 80: [0.5368143], acc: [0.90625] valid Loss at step 100: [0.93665516], acc: [0.78125] 验证集准确率为:0.8916842937469482 验证集loss为:0.41894418001174927 INFO 2020-09-01 00:58:53,138 launch.py:223] Local procs complete, POD info:rank:0 id:None addr:127.0.0.1 port:None visible_gpu:[] trainers:["gpu:['0'] endpoint:127.0.0.1:52136 rank:0", "gpu:['1'] endpoint:127.0.0.1:35241 rank:1", "gpu:['2'] endpoint:127.0.0.1:47208 rank:2", "gpu:['3'] endpoint:127.0.0.1:49438 rank:3"]

period就是统计的每个iter的运行时间(含读数时间),测试代码如下: cur_time = 0 for batch_id, data in enumerate(train_reader()):

            dy_x_data = np.array([x[0] for x in data]).astype('float32')
            y_data = np.array([[x[1]] for x in data]).astype('int64')
        
            period = time.time()-cur_time

            cur_time = time.time()
            img = fluid.dygraph.to_variable(dy_x_data)
            label = fluid.dygraph.to_variable(y_data)
            label.stop_gradient = True
            

            out = train_model(img)
            acc = fluid.layers.accuracy(out,label)
            loss = fluid.layers.cross_entropy(out, label)
            avg_loss = fluid.layers.mean(loss)

          
            avg_loss = train_model.scale_loss(avg_loss)
            avg_loss.backward()
            train_model.apply_collective_grads()

            opt.minimize(avg_loss)
            train_model.clear_gradients()
            
            if(dev_id==0):
                if batch_id % 5 == 0 and batch_id>0:
                    print("Loss at epoch {} step {}: {}, acc: {},period: {}".format(i, batch_id, avg_loss.numpy()*4, acc.numpy(),period))
指派人
分配到
无
里程碑
无
分配里程碑
工时统计
无
截止日期
无
标识: paddlepaddle/Paddle#26852
渝ICP备2023009037号

京公网安备11010502055752号

网络110报警服务 Powered by GitLab CE v13.7
开源知识
Git 入门 Pro Git 电子书 在线学 Git
Markdown 基础入门 IT 技术知识开源图谱
帮助
使用手册 反馈建议 博客
《GitCode 隐私声明》 《GitCode 服务条款》 关于GitCode
Powered by GitLab CE v13.7