resnet 单卡下没问题,八卡报错
Created by: ccmeteorljh
paddle_version :1.5.0 运行命令:
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python -u train.py --model=ResNet101 --batch_size=256 --total_images=1281167 --class_dim=1000 --image_shape=3,224,224 --model_save_dir=output/ --pretrained_model=SE_ResNext50_32x4d_pretrained/ --data_dir=data/ILSVRC2012 --with_mem_opt=False --with_inplace=True --lr_strategy=cosine_decay --lr=0.1 --l2_decay=1.2e-4 --num_epochs=2
报错如下:
I0620 05:24:42.159507 21796 build_strategy.cc:285] SeqOnlyAllReduceOps:0, num_trainers:1
Pass 0, trainbatch 0, loss 7.05474, acc1 0.00000, acc5 0.00391, lr 0.10000, time 12.15 sec
Traceback (most recent call last):
File "train.py", line 498, in <module>
main()
File "train.py", line 494, in main
train(args)
File "train.py", line 393, in train
fetch_list=train_fetch_list)
File "/usr/local/lib/python2.7/dist-packages/paddle/fluid/parallel_executor.py", line 205, in run
return_numpy=return_numpy)
File "/usr/local/lib/python2.7/dist-packages/paddle/fluid/executor.py", line 580, in run
return_numpy=return_numpy)
File "/usr/local/lib/python2.7/dist-packages/paddle/fluid/executor.py", line 446, in _run_parallel
exe.run(fetch_var_names, fetch_var_name)
paddle.fluid.core.EnforceNotMet: Enforce failed. Expected infer_next_address == next_address, but received infer_next_address:0x1011e520240 != next_address:0x1011e520a40.
The address is not consistent. at [/paddle/paddle/fluid/framework/details/fused_all_reduce_op_handle.cc:142]
PaddlePaddle Call Stacks:
0 0x7fd2cc104b68p void paddle: :platform::EnforceNotMet::Init<std: :string>(std: :string, char const*, int) + 360
1 0x7fd2cc104eb7p paddle: :platform::EnforceNotMet::EnforceNotMet(std: :string const&, char const*, int) + 87
2 0x7fd2cdb63327p paddle::framework::details::FusedAllReduceOpHandle::RunImpl() + 4567
3 0x7fd2cdb960e0p paddle::framework::details::OpHandleBase::Run(bool) + 160
4 0x7fd2cdafd7adp
5 0x7fd2cce70df3p std::_Function_handler<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> (), std::__future_base::_Task_setter<std::unique_ptr<std::__future_base::_Result<void>, std::__future_base::_Result_base::_Deleter>, void> >::_M_invoke(std::_Any_data const&) + 35