fluid.cuda_places() GPU单卡运行不稳定报错 (#23082) · Issue · PaddlePaddle / Paddle

fluid.cuda_places() GPU单卡运行不稳定报错

Created by: kgresearch

问题：在P40 阡陌集群单卡执行时，fluid.cuda_places() 这一句不稳定会报错(完全相同的代码、数据，约1/10任务中会遇到该问题然后任务失败)

环境： python: 3.76 paddle: 1.63 nccl2.3.7_cuda9.0

代码片段为：

def main(args): ernie_config = ErnieConfig(args.ernie_config_path) ernie_config.print_config()

if args.use_cuda:
    dev_list = fluid.cuda_places()
    place = dev_list[0]
    dev_count = len(dev_list)

...

报错信息如下： File "/home/slurm/job/tmp/job-134491/env/py376paddle163/lib/python3.7/site-packages/paddle/fluid/framework.py", line 312, in cuda_places device_ids = _cuda_ids() File "/home/slurm/job/tmp/job-134491/env/py376paddle163/lib/python3.7/site-packages/paddle/fluid/framework.py", line 244, in _cuda_ids device_ids = six.moves.range(core.get_cuda_device_count()) paddle.fluid.core_avx.EnforceNotMet:

C++ Call Stacks (More useful to developers):

0 std::string paddle::platform::GetTraceBackString<char const*>(char const*&&, char const*, int) 1 paddle::platform::EnforceNotMet::EnforceNotMet(std::__exception_ptr::exception_ptr, char const*, int) 2 paddle::platform::GetCUDADeviceCount()

Error Message Summary:

Error: cudaGetDeviceCount failed in paddle::platform::GetCUDADeviceCountImpl, error code : 3, Please see detail in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html#group__CUDART__TYPES_1g3f51e3575c2178246db0a94a430e0038 at (/paddle/paddle/fluid/platform/gpu_info.cc:67)

PaddlePaddle / Paddle 大约 1 年 前同步成功

fluid.cuda_places() GPU单卡运行不稳定报错

代码片段为：

...

C++ Call Stacks (More useful to developers):

Error Message Summary:

PaddlePaddle / Paddle
大约 1 年前同步成功