fluid.cuda_places() GPU单卡运行不稳定报错
Created by: kgresearch
问题:在P40 阡陌集群 单卡执行时,fluid.cuda_places() 这一句不稳定会报错(完全相同的代码、数据,约1/10任务中会遇到该问题然后任务失败)
环境: python: 3.76 paddle: 1.63 nccl2.3.7_cuda9.0
代码片段为:
def main(args): ernie_config = ErnieConfig(args.ernie_config_path) ernie_config.print_config()
if args.use_cuda:
dev_list = fluid.cuda_places()
place = dev_list[0]
dev_count = len(dev_list)
...
报错信息如下: File "/home/slurm/job/tmp/job-134491/env/py376paddle163/lib/python3.7/site-packages/paddle/fluid/framework.py", line 312, in cuda_places device_ids = _cuda_ids() File "/home/slurm/job/tmp/job-134491/env/py376paddle163/lib/python3.7/site-packages/paddle/fluid/framework.py", line 244, in _cuda_ids device_ids = six.moves.range(core.get_cuda_device_count()) paddle.fluid.core_avx.EnforceNotMet:
C++ Call Stacks (More useful to developers):
0 std::string paddle::platform::GetTraceBackString<char const*>(char const*&&, char const*, int) 1 paddle::platform::EnforceNotMet::EnforceNotMet(std::__exception_ptr::exception_ptr, char const*, int) 2 paddle::platform::GetCUDADeviceCount()
Error Message Summary:
Error: cudaGetDeviceCount failed in paddle::platform::GetCUDADeviceCountImpl, error code : 3, Please see detail in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html#group__CUDART__TYPES_1g3f51e3575c2178246db0a94a430e0038 at (/paddle/paddle/fluid/platform/gpu_info.cc:67)