"Add bfloat16 data type" 引起bert fp16 模型训练挂掉
Created by: hysunflower
this issue is caused by #25402
System information 1)PaddlePaddle version:develop commitID:95e1434b 2)GPU:V100 16G CUDA10.1 CUDNN7.6.5 3)CPU:Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz 4)OS:Ubuntu 16.04.6 5) Python3.7 6) compile docker images: paddlepaddle/paddle_manylinux_devel:cuda10.1_cudnn7 7) runtime docker images: paddlepaddle/paddle:latest-gpu-cuda10.1-cudnn7
build command
- step 1
export CMAKE_BUILD_TYPE=Release
export PYTHON_ABI=cp37-cp37m
export PADDLE_VERSION=0.0.0
export WITH_DOC=OFF
export WITH_AVX=ON
export WITH_GPU=ON
export WITH_TEST=OFF
export RUN_TEST=OFF
export WITH_GOLANG=OFF
export WITH_SWIG_PY=ON
export WITH_PYTHON=ON
export WITH_C_API=OFF
export WITH_STYLE_CHECK=OFF
export WITH_TESTING=OFF
export CMAKE_EXPORT_COMPILE_COMMANDS=ON
export WITH_MKL=ON
export BUILD_TYPE=Release
export WITH_DISTRIBUTE=ON
export WITH_FLUID_ONLY=OFF
export CMAKE_VERBOSE_MAKEFILE=OFF
- step 2
bash paddle/script/paddle_build.sh build
running
- model /models/PaddleNLP/pretrain_language_models/BERT/
- command
CUDA_VISIBLE_DEVICES=1 python -u run_classifier.py
--task_name mnli
--use_cuda true
--do_train true
--do_val False
--do_test False
--batch_size 8
--in_tokens False
--init_pretraining_params /path/to/uncased_L-24_H-1024_A-16/params
--data_dir /path/to/MNLI
--vocab_path /path/to/uncased_L-24_H-1024_A-16/vocab.txt
--checkpoints /path/to/save
--save_steps 1000
--weight_decay 0.01
--warmup_proportion 0.1
--validation_steps 1000
--epoch 2
--is_profiler=0
--max_iter=1500
--max_seq_len 128
--bert_config_path /path/to/uncased_L-24_H-1024_A-16/bert_config.json
--learning_rate 5e-5
--skip_steps 100
--random_seed 1
err message
Cast parameters to float16 data format.
Traceback (most recent call last):
File "run_classifier.py", line 451, in <module>
main(args)
File "run_classifier.py", line 290, in main
use_fp16=args.use_fp16)
File "/models/PaddleNLP/pretrain_language_models/BERT/utils/init.py", line 60, in init_pretraining_params
cast_fp32_to_fp16(exe, main_program)
File "/models/PaddleNLP/pretrain_language_models/BERT/utils/init.py", line 33, in cast_fp32_to_fp16
param_t.set(np.float16(data).view(np.uint16), exe.place)
paddle.fluid.core_avx.EnforceNotMet:
--------------------------------------
C++ Traceback (most recent call last):
--------------------------------------
0 paddle::platform::EnforceNotMet::EnforceNotMet(std::__exception_ptr::exception_ptr, char const*, int)
1 std::string paddle::platform::GetTraceBackString<char const*>(char const*&&, char const*, int)
2 paddle::platform::GetCurrentTraceBackString[abi:cxx11]()
----------------------
Error Message Summary:
----------------------
ExternalError: Cuda error(1), invalid argument.
[Advise: The device function being invoked (usually via cudaLaunchKernel()) was not previously configured via the cudaConfigureCall() function.] (at /paddle/paddle/fluid/platform/gpu_info.cc:291)