Created by: Xreki
这个PR修复了以下问题
一,增强以下场景的检查和报错
使用镜像:
paddlepaddle/paddle latest-gpu-cuda10.0-cudnn7 d9831916b6f5 6 weeks ago 11.7GB
language_model若打开fusion_group,develop代码出现如下错误。只能看出cuModuleLoadData
调用出错,看不出错误详情。
W0221 05:00:37.126574 11815 device_code.cc:218] Call cuModuleLoadData failed: 0
这个PR做了两处修正:
-
报错信息和执行流程优化。增加
IsAvaiable
接口,检查CUDA Driver API和NVRTC是否可用。以下情况会判定为不可用,直接略过fusion_group_pass。- 找不到libcuda.so或Libnvrtc.so。
- 调用
cuDeviceGetCount
函数,根据返回值是不是CUDA_SUCCESS判断CUDA Driver API是否可用。TODO:在fusion_group_pass最初添加改检查。
I0221 12:06:36.538492 12642 parallel_executor.cc:440] The Program will be executed on CUDA using ParallelExecutor, 1 cards are used, so 1 programs are executed in parallel. I0221 12:06:36.541931 12642 device_code.cc:121] CUDA Driver Version: 10.1; NVRTC Version: 10.0 W0221 12:06:36.606364 12642 device_code.cc:93] Call cuModuleLoadData for < > failed: ... (3) W0221 12:06:36.606411 12642 fusion_group_pass.cc:45] Disable fusion_group because CUDA Driver or NVRTC is not avaiable.
-
在所有—— 需进一步验证对其他CUDA功能是否有影响。InitDevices
中调用cuInit
,初始化CUDA Driver API环境。
二,避免因找不到动态库而挂掉,保证可以正常训练
使用镜像:
paddlepaddle/paddle latest-gpu-cuda10.0-cudnn7 88e1673b2cae 8 months ago 3.58GB
language_model若打开fusion_group,develop代码出现如下错误。
----------------------
Error Message Summary:
----------------------
Error: Failed to find dynamic library: libnvrtc.so ( libnvrtc.so: cannot open shared object file: No such file or directory )
Please specify its path correctly using following ways:
Method. set environment variable LD_LIBRARY_PATH on Linux or DYLD_LIBRARY_PATH on Mac OS.
For instance, issue command: export LD_LIBRARY_PATH=...
Note: After Mac OS 10.11, using the DYLD_LIBRARY_PATH is impossible unless System Integrity Protection (SIP) is disabled. at (/paddle/paddle/fluid/platform/dynload/dynamic_loader.cc:177)
修改后,若仍然找不到libnvrtc.so,报错信息如下,且会自动关闭fusion_group功能,能正常训练。
I0224 06:09:39.450793 3712 parallel_executor.cc:440] The Program will be executed on CUDA using ParallelExecutor, 1 cards are used, so 1 programs are executed in parallel.
W0224 06:09:39.451284 3712 dynamic_loader.cc:120] Can not find library: libnvrtc.so. The process maybe hang. Please try to add the lib path to LD_LIBRARY_PATH.
W0224 06:09:39.451333 3712 dynamic_loader.cc:179] Failed to find dynamic library: libnvrtc.so ( libnvrtc.so: cannot open shared object file: No such file or directory )
Please specify its path correctly using following ways:
Method. set environment variable LD_LIBRARY_PATH on Linux or DYLD_LIBRARY_PATH on Mac OS.
For instance, issue command: export LD_LIBRARY_PATH=...
Note: After Mac OS 10.11, using the DYLD_LIBRARY_PATH is impossible unless System Integrity Protection (SIP) is disabled.
W0224 06:09:39.451361 3712 device_code.cc:105] NVRTC and CUDA driver are need for JIT compiling of CUDA code.
W0224 06:09:39.451382 3712 fusion_group_pass.cc:44] Disable fusion_group because CUDA Driver or NVRTC is not avaiable.
W0224 06:09:39.468574 3712 fuse_optimizer_op_pass.cc:191] Find sgd operators : 7, and 7 for dense gradients. To make the speed faster, those optimization are fused during training.
I0224 06:09:39.511509 3712 build_strategy.cc:371] SeqOnlyAllReduceOps:0, num_trainers:1
I0224 06:09:39.725088 3712 parallel_executor.cc:307] Inplace strategy is enabled, when build_strategy.enable_inplace = True
I0224 06:09:39.749193 3712 parallel_executor.cc:375] Garbage collection strategy is enabled, when FLAGS_eager_delete_tensor_gb = 0
-- Epoch:[0]; Batch:[232]; Time: 0.01777 s; ppl: 870.98950, lr: 1.00000
-- Epoch:[0]; Batch:[464]; Time: 0.01841 s; ppl: 638.87372, lr: 1.00000