Load model bug in ParallelExecutor.
Created by: qingqing01
When loading the saved model for ParallelExecutor training. There is error:
File "train.py", line 229, in train_parallel_exe
train_exe = fluid.ParallelExecutor(use_cuda=True, loss_name=avg_cost.name, num_threads=cards_num)
File "/home/users/dangqingqing/.jumbo/lib/python2.7/site-packages/paddle/fluid/parallel_executor.py", line 120, in __init__
allow_op_delay)
paddle.fluid.core.EnforceNotMet: Not supported at [/home/users/dangqingqing/Paddle/paddle/fluid/platform/nccl_helper.h:36]
PaddlePaddle Call Stacks:
0 0x7f98138e3e5ep paddle::platform::EnforceNotMet::EnforceNotMet(std::__exception_ptr::exception_ptr, char const*, int) + 558
1 0x7f98139b253dp paddle::platform::ToNCCLDataType(std::type_index) + 253
2 0x7f98139af29dp paddle::framework::ParallelExecutor::BCastParamsToGPUs(std::unordered_set<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&) const + 1677
Now the NCCL in Fluid not support int64
data type. But we have persistable variable with int64
, like:
vars {
name: "@LR_DECAY_COUNTER@"
type {
type: LOD_TENSOR
lod_tensor {
tensor {
data_type: INT64
dims: 1
}
lod_level: 0
}
}
persistable: true
}