Skip to content

  • 体验新版
    • 正在加载...
  • 登录
  • PaddlePaddle
  • models
  • Issue
  • #1764

M
models
  • 项目概览

PaddlePaddle / models
大约 2 年 前同步成功

通知 232
Star 6828
Fork 2962
  • 代码
    • 文件
    • 提交
    • 分支
    • Tags
    • 贡献者
    • 分支图
    • Diff
  • Issue 602
    • 列表
    • 看板
    • 标记
    • 里程碑
  • 合并请求 255
  • Wiki 0
    • Wiki
  • 分析
    • 仓库
    • DevOps
  • 项目成员
  • Pages
M
models
  • 项目概览
    • 项目概览
    • 详情
    • 发布
  • 仓库
    • 仓库
    • 文件
    • 提交
    • 分支
    • 标签
    • 贡献者
    • 分支图
    • 比较
  • Issue 602
    • Issue 602
    • 列表
    • 看板
    • 标记
    • 里程碑
  • 合并请求 255
    • 合并请求 255
  • Pages
  • 分析
    • 分析
    • 仓库分析
    • DevOps
  • Wiki 0
    • Wiki
  • 成员
    • 成员
  • 收起侧边栏
  • 动态
  • 分支图
  • 创建新Issue
  • 提交
  • Issue看板
已关闭
开放中
Opened 2月 18, 2019 by saxon_zh@saxon_zhGuest

ImageNet 高性能加速模型问题汇总

Created by: ccmeteorljh

以下均在P40集群测试

1、PG模型下出现 double free or corruption错误; 模型配置如下:

python train.py --update_method=nccl2 --num_threads 1 --fp16 True --scale_loss 8.0
-----------  Configuration Arguments -----------
async_mode: False
batch_size: 256
checkpoint: None
class_dim: 1000
data_dir: ./data/ILSVRC2012
enable_ce: False
fp16: 1
image_shape: 3,224,224
lr: 0.1
lr_strategy: piecewise_decay
model: DistResNet
model_category: models
model_save_dir: output
multi_batch_repeat: 1
num_epochs: 120
num_threads: 1
pretrained_model: None
reduce_strategy: allreduce
scale_loss: 8.0
skip_unbalanced_data: False
split_var: True
start_test_pass: 0
total_images: 1281167
update_method: nccl2
use_gpu: True
use_visiontool: True
visiontool_workers: 16
with_mem_opt: False
------------------------------------------------
Total examples: 320512, total time: 737.57827, 434.54642 examples/sed

Pass: 0, Test Loss 41.2, test acc1: 0.08636667, test acc5: 0.21744126

报错如下:

*** Error in `python': double free or corruption (fasttop): 0x00007f2694000a30 ***
======= Backtrace: =========
/opt/compiler/gcc-4.8.2/lib/libc.so.6(+0x7354f)[0x7f27b601454f]
/opt/compiler/gcc-4.8.2/lib/libc.so.6(+0x78dbe)[0x7f27b6019dbe]
/opt/compiler/gcc-4.8.2/lib/libc.so.6(+0x79a97)[0x7f27b601aa97]
/usr/local/lib/python2.7/site-packages/paddle/fluid/core.so(_ZN6paddle9framework8Variable15PlaceholderImplINS_9operators15AlgorithmsCacheI31cudnnConvolutionBwdFilterAlgo_tEEED0Ev+0x6d)[0x7f275168e12d]
/usr/local/lib/python2.7/site-packages/paddle/fluid/core.so(_ZN6paddle9framework8Variable10GetMutableINS_9operators15AlgorithmsCacheI31cudnnConvolutionBwdFilterAlgo_tEEEEPT_v+0x164)[0x7f2751690eb4]
/usr/local/lib/python2.7/site-packages/paddle/fluid/core.so(_ZNK6paddle9operators21CUDNNConvGradOpKernelINS_8platform7float16EE7ComputeERKNS_9framework16ExecutionContextE+0x1304)[0x7f275169d9e4]
/usr/local/lib/python2.7/site-packages/paddle/fluid/core.so(_ZNSt17_Function_handlerIFvRKN6paddle9framework16ExecutionContextEEZNKS1_24OpKernelRegistrarFunctorINS0_8platform9CUDAPlaceELb0ELm2EINS0_9operators21CUDNNConvGradOpKernelIfEENSA_IdEENSA_INS7_7float16EEEEEclEPKcSH_iEUlS4_E_E9_M_invokeERKSt9_Any_dataS4_+0x23)[0x7f275169e973]
/usr/local/lib/python2.7/site-packages/paddle/fluid/core.so(_ZNK6paddle9framework18OperatorWithKernel7RunImplERKNS0_5ScopeERKN5boost7variantINS_8platform9CUDAPlaceENS7_8CPUPlaceENS7_15CUDAPinnedPlaceENS5_6detail7variant5void_ESD_SD_SD_SD_SD_SD_SD_SD_SD_SD_SD_SD_SD_SD_SD_SD_EE+0x293)[0x7f27529a10f3]
/usr/local/lib/python2.7/site-packages/paddle/fluid/core.so(_ZN6paddle9framework12OperatorBase3RunERKNS0_5ScopeERKN5boost7variantINS_8platform9CUDAPlaceENS7_8CPUPlaceENS7_15CUDAPinnedPlaceENS5_6detail7variant5void_ESD_SD_SD_SD_SD_SD_SD_SD_SD_SD_SD_SD_SD_SD_SD_SD_EE+0x155)[0x7f275299e975]
/usr/local/lib/python2.7/site-packages/paddle/fluid/core.so(+0x2717016)[0x7f2752823016]
/usr/local/lib/python2.7/site-packages/paddle/fluid/core.so(+0x27305fd)[0x7f275283c5fd]
/usr/local/lib/python2.7/site-packages/paddle/fluid/core.so(_ZN6paddle9framework7details12OpHandleBase17RunAndRecordEventERKSt8functionIFvvEE+0x325)[0x7f275283bf55]
/usr/local/lib/python2.7/site-packages/paddle/fluid/core.so(_ZN6paddle9framework7details19ComputationOpHandle7RunImplEv+0x70)[0x7f2752822ce0]
/usr/local/lib/python2.7/site-packages/paddle/fluid/core.so(_ZN6paddle9framework7details12OpHandleBase3RunEb+0x76)[0x7f275283cd86]
/usr/local/lib/python2.7/site-packages/paddle/fluid/core.so(+0x26b451d)[0x7f27527c051d]
/usr/local/lib/python2.7/site-packages/paddle/fluid/core.so(_ZN6paddle9framework7details24ThreadedSSAGraphExecutor5RunOpERKSt10shared_ptrINS0_13BlockingQueueIPNS1_13VarHandleBaseEEEEPNS1_12OpHandleBaseE+0x51d)[0x7f27527c0f1d]
*** Aborted at 1550253252 (unix time) try "date -d @1550253252" if you are using GNU date ***
/usr/local/lib/python2.7/site-packages/paddle/fluid/core.so(_ZN6paddle9framework7details24ThreadedSSAGraphExecutor3RunERKSt6vectorISsSaISsEE+0x914)[0x7f27527c2b94]
/usr/local/lib/python2.7/site-packages/paddle/fluid/core.so(+0x26af7b2)[0x7f27527bb7b2]
PC: @                0x0 (unknown)
*** SIGSEGV (@0x0) received by PID 33450 (TID 0x7f26a8dfa700) from PID 0; stack trace: ***
/usr/local/lib/python2.7/site-packages/paddle/fluid/core.so(_ZNSt17_Function_handlerIFSt10unique_ptrINSt13__future_base12_Result_baseENS2_8_DeleterEEvENS1_12_Task_setterIS0_INS1_7_ResultISt6vectorIN6paddle9framework9LoDTensorESaISB_EEEES3_ESD_EEE9_M_invokeERKSt9_Any_data+0x2a)[0x7f27527be2aa]
/usr/local/lib/python2.7/site-packages/paddle/fluid/core.so(_ZNSt13__future_base11_State_base9_M_do_setERSt8functionIFSt10unique_ptrINS_12_Result_baseENS3_8_DeleterEEvEERb+0x27)[0x7f2751b91547]
/opt/compiler/gcc-4.8.2/lib/libpthread.so.0(pthread_once+0x53)[0x7f27b6a65973]
    @     0x7f27b6a68160 (unknown)
/usr/local/lib/python2.7/site-packages/paddle/fluid/core.so(+0x26af252)[0x7f27527bb252]
/usr/local/lib/python2.7/site-packages/paddle/fluid/core.so(_ZZN10ThreadPoolC1EmENKUlvE_clEv+0x194)[0x7f2751b927d4]
/opt/compiler/gcc-4.8.2/lib/libstdc++.so.6(+0xb08a0)[0x7f27ab0de8a0]
/opt/compiler/gcc-4.8.2/lib/libpthread.so.0(+0x81c3)[0x7f27b6a601c3]
/opt/compiler/gcc-4.8.2/lib/libc.so.6(clone+0x6d)[0x7f27b608812d]

2、PG模式下出现 `python': invalid fastbin entry 问题; 单机和多机都会出现,但时间较久,有时70+pass,有时出现在110+pass 模型配置如下:

start cmd is: python train.py --update_method=local --num_threads 1 --fp16 True --scale_loss 8.0
-----------  Configuration Arguments -----------
async_mode: False
batch_size: 256
checkpoint: None
class_dim: 1000
data_dir: ./data/ILSVRC2012
enable_ce: False
fp16: 1
image_shape: 3,224,224
lr: 0.1
lr_strategy: piecewise_decay
model: DistResNet
model_category: models
model_save_dir: output
multi_batch_repeat: 1
num_epochs: 120
num_threads: 1
pretrained_model: None
reduce_strategy: allreduce
scale_loss: 8.0
skip_unbalanced_data: False
split_var: True
start_test_pass: 0
total_images: 1281167
update_method: local
use_gpu: True
use_visiontool: True
visiontool_workers: 16
with_mem_opt: False
------------------------------------------------
Pass 74, batch [30/5004], loss 6.953, acc1: 0.78125, acc5: 0.90625, avg batch time 0.2261
Pass 74, batch [60/5004], loss 6.69, acc1: 0.78125, acc5: 0.93359375, avg batch time 0.2239
Pass 74, batch [90/5004], loss 8.305, acc1: 0.75, acc5: 0.90625, avg batch time 0.2247
Pass 74, batch [120/5004], loss 5.77, acc1: 0.8125, acc5: 0.93359375, avg batch time 0.2250
Pass 74, batch [150/5004], loss 8.36, acc1: 0.7734375, acc5: 0.9140625, avg batch time 0.2244
*** Error in `python': invalid fastbin entry (free): 0x00007f99808d9a60 ***
======= Backtrace: =========
/opt/compiler/gcc-4.8.2/lib/libc.so.6(+0x7354f)[0x7f9b3fc6f54f]
/opt/compiler/gcc-4.8.2/lib/libc.so.6(+0x78dbe)[0x7f9b3fc74dbe]
/opt/compiler/gcc-4.8.2/lib/libc.so.6(+0x79a97)[0x7f9b3fc75a97]
/usr/local/lib/python2.7/site-packages/visreader/transformer/libpytransform.so(_ZN7vistool16ImageTransformer3getEPNS_25transformer_output_data_tE+0x227)[0x7f9ac7d11fe7]
/usr/local/lib/python2.7/site-packages/visreader/transformer/libpytransform.so(+0x168be42)[0x7f9ac7ceee42]
/usr/local/bin/../lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x46be)[0x7f9b409d880e]
/usr/local/bin/../lib/libpython2.7.so.1.0(PyEval_EvalCodeEx+0x80d)[0x7f9b409da21d]
/usr/local/bin/../lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x43a1)[0x7f9b409d84f1]
/usr/local/bin/../lib/libpython2.7.so.1.0(PyEval_EvalCodeEx+0x80d)[0x7f9b409da21d]
/usr/local/bin/../lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x43a1)[0x7f9b409d84f1]
/usr/local/bin/../lib/libpython2.7.so.1.0(+0x78613)[0x7f9b40949613]
/usr/local/bin/../lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x11b9)[0x7f9b409d5309]
/usr/local/bin/../lib/libpython2.7.so.1.0(+0x78613)[0x7f9b40949613]
/usr/local/bin/../lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x11b9)[0x7f9b409d5309]
/usr/local/bin/../lib/libpython2.7.so.1.0(+0x78613)[0x7f9b40949613]
/usr/local/bin/../lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x11b9)[0x7f9b409d5309]
/usr/local/bin/../lib/libpython2.7.so.1.0(+0x78613)[0x7f9b40949613]
/usr/local/bin/../lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x11b9)[0x7f9b409d5309]
/usr/local/bin/../lib/libpython2.7.so.1.0(+0x78613)[0x7f9b40949613]
/usr/local/bin/../lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x11b9)[0x7f9b409d5309]
/usr/local/bin/../lib/libpython2.7.so.1.0(PyEval_EvalCodeEx+0x80d)[0x7f9b409da21d]
/usr/local/bin/../lib/libpython2.7.so.1.0(+0x83db5)[0x7f9b40954db5]
/usr/local/bin/../lib/libpython2.7.so.1.0(PyObject_Call+0x43)[0x7f9b40923273]
/usr/local/bin/../lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x1426)[0x7f9b409d5576]
/usr/local/bin/../lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x482e)[0x7f9b409d897e]
/usr/local/bin/../lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x482e)[0x7f9b409d897e]
/usr/local/bin/../lib/libpython2.7.so.1.0(PyEval_EvalCodeEx+0x80d)[0x7f9b409da21d]
/usr/local/bin/../lib/libpython2.7.so.1.0(+0x83ce0)[0x7f9b40954ce0]
/usr/local/bin/../lib/libpython2.7.so.1.0(PyObject_Call+0x43)[0x7f9b40923273]
/usr/local/bin/../lib/libpython2.7.so.1.0(+0x60c3d)[0x7f9b40931c3d]
/usr/local/bin/../lib/libpython2.7.so.1.0(PyObject_Call+0x43)[0x7f9b40923273]
/usr/local/bin/../lib/libpython2.7.so.1.0(PyEval_CallObjectWithKeywords+0x47)[0x7f9b409d3b67]
/usr/local/bin/../lib/libpython2.7.so.1.0(+0x14cbf2)[0x7f9b40a1dbf2]
/opt/compiler/gcc-4.8.2/lib/libpthread.so.0(+0x81c3)[0x7f9b406bb1c3]
/opt/compiler/gcc-4.8.2/lib/libc.so.6(clone+0x6d)[0x7f9b3fce312d]

3、运行结束后程序退出异常;

Pass: 119, Test Loss 0.995599, test acc1: 0.75795186, test acc5: 0.92903614

F0215 15:37:24.769343 38469 grpc_client.cc:418] SendCompleteRPC name:[COMPLETE@RECV], ep:[10.255.118.18:30026], status:[-1] meets grpc error, error_code:14 error_message:OS Error error_details:
*** Check failure stack trace: ***
    @     0x7f317d9167ad  google::LogMessage::Fail()
    @     0x7f317d91a25c  google::LogMessage::SendToLog()
    @     0x7f317d9162d3  google::LogMessage::Flush()
    @     0x7f317d91b76e  google::LogMessageFatal::~LogMessageFatal()
    @     0x7f317ed7199a  paddle::operators::distributed::GRPCClient::Proceed()
    @     0x7f31d58f28a0  execute_native_thread_routine
    @     0x7f31e32741c3  start_thread
    @     0x7f31e289c12d  __clone
    @              (nil)  (unknown)
/root/paddlejob/run.sh: line 313: 38102 Aborted                 (core dumped) python train.py --update_method=nccl2 --num_threads 1
*********************error messages********************

4、MP模式下无报错信息结束;

换集群后不在出现该问题

5、P40下单机PG和MP均有加速效果,多机效果差;

GPUs Base Fluid PG FP32 Fluid MP FP32 Fliud PG FP16 Fluid MP FP16
8 553.31055 test acc1: 0.75941 test acc5: 0.93016 855 test acc1: 0.756278 test acc5: 0.928083 417*8 test acc1: 0.64468, test acc5: 0.85098 1002 423*8 test acc1: 0.64744, test acc5: 0.86174
16 464.45 *2 test acc1: 0.76156, test acc5: 0.93018 426*2 11382   14582
32 477*4 test acc1: 0.76148, test acc5: 0.92962 334 *4 test acc1: 0.75795, test acc5: 0.929036 9084 test acc1: 0.63722, test acc5: 0.85278 434*4  
指派人
分配到
无
里程碑
无
分配里程碑
工时统计
无
截止日期
无
标识: paddlepaddle/models#1764
渝ICP备2023009037号

京公网安备11010502055752号

网络110报警服务 Powered by GitLab CE v13.7
开源知识
Git 入门 Pro Git 电子书 在线学 Git
Markdown 基础入门 IT 技术知识开源图谱
帮助
使用手册 反馈建议 博客
《GitCode 隐私声明》 《GitCode 服务条款》 关于GitCode
Powered by GitLab CE v13.7