ImageNet 高性能加速模型问题汇总
Created by: ccmeteorljh
以下均在P40集群测试
1、PG模型下出现 double free or corruption错误; 模型配置如下:
python train.py --update_method=nccl2 --num_threads 1 --fp16 True --scale_loss 8.0
----------- Configuration Arguments -----------
async_mode: False
batch_size: 256
checkpoint: None
class_dim: 1000
data_dir: ./data/ILSVRC2012
enable_ce: False
fp16: 1
image_shape: 3,224,224
lr: 0.1
lr_strategy: piecewise_decay
model: DistResNet
model_category: models
model_save_dir: output
multi_batch_repeat: 1
num_epochs: 120
num_threads: 1
pretrained_model: None
reduce_strategy: allreduce
scale_loss: 8.0
skip_unbalanced_data: False
split_var: True
start_test_pass: 0
total_images: 1281167
update_method: nccl2
use_gpu: True
use_visiontool: True
visiontool_workers: 16
with_mem_opt: False
------------------------------------------------
Total examples: 320512, total time: 737.57827, 434.54642 examples/sed
Pass: 0, Test Loss 41.2, test acc1: 0.08636667, test acc5: 0.21744126
报错如下:
*** Error in `python': double free or corruption (fasttop): 0x00007f2694000a30 ***
======= Backtrace: =========
/opt/compiler/gcc-4.8.2/lib/libc.so.6(+0x7354f)[0x7f27b601454f]
/opt/compiler/gcc-4.8.2/lib/libc.so.6(+0x78dbe)[0x7f27b6019dbe]
/opt/compiler/gcc-4.8.2/lib/libc.so.6(+0x79a97)[0x7f27b601aa97]
/usr/local/lib/python2.7/site-packages/paddle/fluid/core.so(_ZN6paddle9framework8Variable15PlaceholderImplINS_9operators15AlgorithmsCacheI31cudnnConvolutionBwdFilterAlgo_tEEED0Ev+0x6d)[0x7f275168e12d]
/usr/local/lib/python2.7/site-packages/paddle/fluid/core.so(_ZN6paddle9framework8Variable10GetMutableINS_9operators15AlgorithmsCacheI31cudnnConvolutionBwdFilterAlgo_tEEEEPT_v+0x164)[0x7f2751690eb4]
/usr/local/lib/python2.7/site-packages/paddle/fluid/core.so(_ZNK6paddle9operators21CUDNNConvGradOpKernelINS_8platform7float16EE7ComputeERKNS_9framework16ExecutionContextE+0x1304)[0x7f275169d9e4]
/usr/local/lib/python2.7/site-packages/paddle/fluid/core.so(_ZNSt17_Function_handlerIFvRKN6paddle9framework16ExecutionContextEEZNKS1_24OpKernelRegistrarFunctorINS0_8platform9CUDAPlaceELb0ELm2EINS0_9operators21CUDNNConvGradOpKernelIfEENSA_IdEENSA_INS7_7float16EEEEEclEPKcSH_iEUlS4_E_E9_M_invokeERKSt9_Any_dataS4_+0x23)[0x7f275169e973]
/usr/local/lib/python2.7/site-packages/paddle/fluid/core.so(_ZNK6paddle9framework18OperatorWithKernel7RunImplERKNS0_5ScopeERKN5boost7variantINS_8platform9CUDAPlaceENS7_8CPUPlaceENS7_15CUDAPinnedPlaceENS5_6detail7variant5void_ESD_SD_SD_SD_SD_SD_SD_SD_SD_SD_SD_SD_SD_SD_SD_SD_EE+0x293)[0x7f27529a10f3]
/usr/local/lib/python2.7/site-packages/paddle/fluid/core.so(_ZN6paddle9framework12OperatorBase3RunERKNS0_5ScopeERKN5boost7variantINS_8platform9CUDAPlaceENS7_8CPUPlaceENS7_15CUDAPinnedPlaceENS5_6detail7variant5void_ESD_SD_SD_SD_SD_SD_SD_SD_SD_SD_SD_SD_SD_SD_SD_SD_EE+0x155)[0x7f275299e975]
/usr/local/lib/python2.7/site-packages/paddle/fluid/core.so(+0x2717016)[0x7f2752823016]
/usr/local/lib/python2.7/site-packages/paddle/fluid/core.so(+0x27305fd)[0x7f275283c5fd]
/usr/local/lib/python2.7/site-packages/paddle/fluid/core.so(_ZN6paddle9framework7details12OpHandleBase17RunAndRecordEventERKSt8functionIFvvEE+0x325)[0x7f275283bf55]
/usr/local/lib/python2.7/site-packages/paddle/fluid/core.so(_ZN6paddle9framework7details19ComputationOpHandle7RunImplEv+0x70)[0x7f2752822ce0]
/usr/local/lib/python2.7/site-packages/paddle/fluid/core.so(_ZN6paddle9framework7details12OpHandleBase3RunEb+0x76)[0x7f275283cd86]
/usr/local/lib/python2.7/site-packages/paddle/fluid/core.so(+0x26b451d)[0x7f27527c051d]
/usr/local/lib/python2.7/site-packages/paddle/fluid/core.so(_ZN6paddle9framework7details24ThreadedSSAGraphExecutor5RunOpERKSt10shared_ptrINS0_13BlockingQueueIPNS1_13VarHandleBaseEEEEPNS1_12OpHandleBaseE+0x51d)[0x7f27527c0f1d]
*** Aborted at 1550253252 (unix time) try "date -d @1550253252" if you are using GNU date ***
/usr/local/lib/python2.7/site-packages/paddle/fluid/core.so(_ZN6paddle9framework7details24ThreadedSSAGraphExecutor3RunERKSt6vectorISsSaISsEE+0x914)[0x7f27527c2b94]
/usr/local/lib/python2.7/site-packages/paddle/fluid/core.so(+0x26af7b2)[0x7f27527bb7b2]
PC: @ 0x0 (unknown)
*** SIGSEGV (@0x0) received by PID 33450 (TID 0x7f26a8dfa700) from PID 0; stack trace: ***
/usr/local/lib/python2.7/site-packages/paddle/fluid/core.so(_ZNSt17_Function_handlerIFSt10unique_ptrINSt13__future_base12_Result_baseENS2_8_DeleterEEvENS1_12_Task_setterIS0_INS1_7_ResultISt6vectorIN6paddle9framework9LoDTensorESaISB_EEEES3_ESD_EEE9_M_invokeERKSt9_Any_data+0x2a)[0x7f27527be2aa]
/usr/local/lib/python2.7/site-packages/paddle/fluid/core.so(_ZNSt13__future_base11_State_base9_M_do_setERSt8functionIFSt10unique_ptrINS_12_Result_baseENS3_8_DeleterEEvEERb+0x27)[0x7f2751b91547]
/opt/compiler/gcc-4.8.2/lib/libpthread.so.0(pthread_once+0x53)[0x7f27b6a65973]
@ 0x7f27b6a68160 (unknown)
/usr/local/lib/python2.7/site-packages/paddle/fluid/core.so(+0x26af252)[0x7f27527bb252]
/usr/local/lib/python2.7/site-packages/paddle/fluid/core.so(_ZZN10ThreadPoolC1EmENKUlvE_clEv+0x194)[0x7f2751b927d4]
/opt/compiler/gcc-4.8.2/lib/libstdc++.so.6(+0xb08a0)[0x7f27ab0de8a0]
/opt/compiler/gcc-4.8.2/lib/libpthread.so.0(+0x81c3)[0x7f27b6a601c3]
/opt/compiler/gcc-4.8.2/lib/libc.so.6(clone+0x6d)[0x7f27b608812d]
2、PG模式下出现 `python': invalid fastbin entry 问题; 单机和多机都会出现,但时间较久,有时70+pass,有时出现在110+pass 模型配置如下:
start cmd is: python train.py --update_method=local --num_threads 1 --fp16 True --scale_loss 8.0
----------- Configuration Arguments -----------
async_mode: False
batch_size: 256
checkpoint: None
class_dim: 1000
data_dir: ./data/ILSVRC2012
enable_ce: False
fp16: 1
image_shape: 3,224,224
lr: 0.1
lr_strategy: piecewise_decay
model: DistResNet
model_category: models
model_save_dir: output
multi_batch_repeat: 1
num_epochs: 120
num_threads: 1
pretrained_model: None
reduce_strategy: allreduce
scale_loss: 8.0
skip_unbalanced_data: False
split_var: True
start_test_pass: 0
total_images: 1281167
update_method: local
use_gpu: True
use_visiontool: True
visiontool_workers: 16
with_mem_opt: False
------------------------------------------------
Pass 74, batch [30/5004], loss 6.953, acc1: 0.78125, acc5: 0.90625, avg batch time 0.2261
Pass 74, batch [60/5004], loss 6.69, acc1: 0.78125, acc5: 0.93359375, avg batch time 0.2239
Pass 74, batch [90/5004], loss 8.305, acc1: 0.75, acc5: 0.90625, avg batch time 0.2247
Pass 74, batch [120/5004], loss 5.77, acc1: 0.8125, acc5: 0.93359375, avg batch time 0.2250
Pass 74, batch [150/5004], loss 8.36, acc1: 0.7734375, acc5: 0.9140625, avg batch time 0.2244
*** Error in `python': invalid fastbin entry (free): 0x00007f99808d9a60 ***
======= Backtrace: =========
/opt/compiler/gcc-4.8.2/lib/libc.so.6(+0x7354f)[0x7f9b3fc6f54f]
/opt/compiler/gcc-4.8.2/lib/libc.so.6(+0x78dbe)[0x7f9b3fc74dbe]
/opt/compiler/gcc-4.8.2/lib/libc.so.6(+0x79a97)[0x7f9b3fc75a97]
/usr/local/lib/python2.7/site-packages/visreader/transformer/libpytransform.so(_ZN7vistool16ImageTransformer3getEPNS_25transformer_output_data_tE+0x227)[0x7f9ac7d11fe7]
/usr/local/lib/python2.7/site-packages/visreader/transformer/libpytransform.so(+0x168be42)[0x7f9ac7ceee42]
/usr/local/bin/../lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x46be)[0x7f9b409d880e]
/usr/local/bin/../lib/libpython2.7.so.1.0(PyEval_EvalCodeEx+0x80d)[0x7f9b409da21d]
/usr/local/bin/../lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x43a1)[0x7f9b409d84f1]
/usr/local/bin/../lib/libpython2.7.so.1.0(PyEval_EvalCodeEx+0x80d)[0x7f9b409da21d]
/usr/local/bin/../lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x43a1)[0x7f9b409d84f1]
/usr/local/bin/../lib/libpython2.7.so.1.0(+0x78613)[0x7f9b40949613]
/usr/local/bin/../lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x11b9)[0x7f9b409d5309]
/usr/local/bin/../lib/libpython2.7.so.1.0(+0x78613)[0x7f9b40949613]
/usr/local/bin/../lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x11b9)[0x7f9b409d5309]
/usr/local/bin/../lib/libpython2.7.so.1.0(+0x78613)[0x7f9b40949613]
/usr/local/bin/../lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x11b9)[0x7f9b409d5309]
/usr/local/bin/../lib/libpython2.7.so.1.0(+0x78613)[0x7f9b40949613]
/usr/local/bin/../lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x11b9)[0x7f9b409d5309]
/usr/local/bin/../lib/libpython2.7.so.1.0(+0x78613)[0x7f9b40949613]
/usr/local/bin/../lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x11b9)[0x7f9b409d5309]
/usr/local/bin/../lib/libpython2.7.so.1.0(PyEval_EvalCodeEx+0x80d)[0x7f9b409da21d]
/usr/local/bin/../lib/libpython2.7.so.1.0(+0x83db5)[0x7f9b40954db5]
/usr/local/bin/../lib/libpython2.7.so.1.0(PyObject_Call+0x43)[0x7f9b40923273]
/usr/local/bin/../lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x1426)[0x7f9b409d5576]
/usr/local/bin/../lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x482e)[0x7f9b409d897e]
/usr/local/bin/../lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x482e)[0x7f9b409d897e]
/usr/local/bin/../lib/libpython2.7.so.1.0(PyEval_EvalCodeEx+0x80d)[0x7f9b409da21d]
/usr/local/bin/../lib/libpython2.7.so.1.0(+0x83ce0)[0x7f9b40954ce0]
/usr/local/bin/../lib/libpython2.7.so.1.0(PyObject_Call+0x43)[0x7f9b40923273]
/usr/local/bin/../lib/libpython2.7.so.1.0(+0x60c3d)[0x7f9b40931c3d]
/usr/local/bin/../lib/libpython2.7.so.1.0(PyObject_Call+0x43)[0x7f9b40923273]
/usr/local/bin/../lib/libpython2.7.so.1.0(PyEval_CallObjectWithKeywords+0x47)[0x7f9b409d3b67]
/usr/local/bin/../lib/libpython2.7.so.1.0(+0x14cbf2)[0x7f9b40a1dbf2]
/opt/compiler/gcc-4.8.2/lib/libpthread.so.0(+0x81c3)[0x7f9b406bb1c3]
/opt/compiler/gcc-4.8.2/lib/libc.so.6(clone+0x6d)[0x7f9b3fce312d]
3、运行结束后程序退出异常;
Pass: 119, Test Loss 0.995599, test acc1: 0.75795186, test acc5: 0.92903614
F0215 15:37:24.769343 38469 grpc_client.cc:418] SendCompleteRPC name:[COMPLETE@RECV], ep:[10.255.118.18:30026], status:[-1] meets grpc error, error_code:14 error_message:OS Error error_details:
*** Check failure stack trace: ***
@ 0x7f317d9167ad google::LogMessage::Fail()
@ 0x7f317d91a25c google::LogMessage::SendToLog()
@ 0x7f317d9162d3 google::LogMessage::Flush()
@ 0x7f317d91b76e google::LogMessageFatal::~LogMessageFatal()
@ 0x7f317ed7199a paddle::operators::distributed::GRPCClient::Proceed()
@ 0x7f31d58f28a0 execute_native_thread_routine
@ 0x7f31e32741c3 start_thread
@ 0x7f31e289c12d __clone
@ (nil) (unknown)
/root/paddlejob/run.sh: line 313: 38102 Aborted (core dumped) python train.py --update_method=nccl2 --num_threads 1
*********************error messages********************
4、MP模式下无报错信息结束;
换集群后不在出现该问题
5、P40下单机PG和MP均有加速效果,多机效果差;
GPUs | Base | Fluid PG FP32 | Fluid MP FP32 | Fliud PG FP16 | Fluid MP FP16 |
---|---|---|---|---|---|
8 | 553.31055 test acc1: 0.75941 test acc5: 0.93016 | 855 test acc1: 0.756278 test acc5: 0.928083 | 417*8 test acc1: 0.64468, test acc5: 0.85098 | 1002 | 423*8 test acc1: 0.64744, test acc5: 0.86174 |
16 | 464.45 *2 test acc1: 0.76156, test acc5: 0.93018 | 426*2 | 11382 | 14582 | |
32 | 477*4 test acc1: 0.76148, test acc5: 0.92962 | 334 *4 test acc1: 0.75795, test acc5: 0.929036 | 9084 test acc1: 0.63722, test acc5: 0.85278 | 434*4 |