单机单卡能跑通, 单机多卡会在fluid.ParallelExecutor处报SegmentationFault (#110) · Issue · PaddlePaddle / ERNIE

单机单卡能跑通, 单机多卡会在fluid.ParallelExecutor处报SegmentationFault

Created by: daizh

[问题描述] 运行script/run_xnli.sh时, 单机单卡能够顺利运行, 但多卡会在fluid.ParallelExecutor处报SegmentationFault

[报错位置] run_classifier.py中

167     if args.do_train:
168         print("do_train:")
169         exec_strategy = fluid.ExecutionStrategy()
170         if args.use_fast_executor:
171             exec_strategy.use_experimental_executor = True
172         exec_strategy.num_threads = dev_count
173         exec_strategy.num_iteration_per_drop_scope = args.num_iteration_per_drop_scope
174
175         print("init fluid.ParallelExecutor")  # 此处之后报错
176         train_exe = fluid.ParallelExecutor(
177             use_cuda=args.use_cuda,
178             loss_name=graph_vars["loss"].name,
179             exec_strategy=exec_strategy,
180             main_program=train_program)
181
182         print(" train_pyreader.decorate_tensor_provider(train_data_generator)")
183         train_pyreader.decorate_tensor_provider(train_data_generator)

[错误信息]

Theoretical memory usage in training: 9076.691 - 9508.915 MB W0424 13:34:08.421407 8512 device_context.cc:263] Please NOTE: device: 0, CUDA Capability: 61, Driver API Version: 9.2, Runtime API Version: 9.0 W0424 13:34:08.421470 8512 device_context.cc:271] device: 0, cuDNN Version: 7.0. W0424 13:34:08.421519 8512 device_context.cc:295] WARNING: device: 0. The installed Paddle is compiled with CUDNN 7.3, but CUDNN version in your machine is 7.0, which may cause serious incompatible bug. Please recompile or reinstall Paddle with compatible CUDNN version. Load pretraining parameters from ./checkpoint/ernie_stable//params. do_train: init fluid.ParallelExecutor

* Aborted at 1556084061 (unix time) try "date -d @1556084061" if you are using GNU date *

PC: @ 0x0 (unknown)

* SIGSEGV (@0x0) received by PID 8512 (TID 0x7f99f42ca700) from PID 0; stack trace: *

@ 0x7f99f3a81160 (unknown) @ 0x0 (unknown) script/run_xnli.sh: line 44: 8512 Segmentation fault ${PYTHON} -u run_classifier.py --use_cuda true --do_train true --do_val true --do_test true --verbose true --batch_size 4192 --in_tokens true --init_pretraining_params ${MODEL_PATH}/params --train_set ${TASK_DATA_PATH}/xnli/train.tsv --dev_set ${TASK_DATA_PATH}/xnli/dev.tsv --test_set ${TASK_DATA_PATH}/xnli/test.tsv --vocab_path config/vocab.txt --label_map ${TASK_DATA_PATH}/xnli/label_map.json --ernie_config_path config/ernie_config.json --checkpoints ./checkpoints --save_steps 1000 --weight_decay 0.01 --warmup_proportion 0.0 --validation_steps 25 --epoch 3 --max_seq_len 512 --learning_rate 1e-4 --skip_steps 10 --num_iteration_per_drop_scope 1 --num_labels 3 --random_seed 1

[运行环境]

paddlepaddle-gpu==1.3.0.post85
cuda-9.0
cudnn_v7

[运行脚本(script/run_xnli.sh)]

export FLAGS_sync_nccl_allreduce=1
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
export LD_LIBRARY_PATH=./env/lib/:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=/home/work/cuda-9.0/lib64/:/home/work/cudnn/cudnn_v7/cuda/lib64:/home/work/cuda-9.0/extras/CUPTI/lib64/:$LD_LIBRARY_PATH

TASK_DATA_PATH=./task_data
MODEL_PATH=./checkpoint/ernie_stable/

PYTHON=./env/python3/bin/python3
${PYTHON} -u run_classifier.py \
                   --use_cuda true \
                   --do_train true \
                   --do_val true \
                   --do_test true \
                   --verbose true \
                   --batch_size 8192 \
                   --in_tokens true \
                   --init_pretraining_params ${MODEL_PATH}/params \
                   --train_set ${TASK_DATA_PATH}/xnli/train.tsv \
                   --dev_set ${TASK_DATA_PATH}/xnli/dev.tsv \
                   --test_set ${TASK_DATA_PATH}/xnli/test.tsv \
                   --vocab_path config/vocab.txt \
                   --label_map ${TASK_DATA_PATH}/xnli/label_map.json \
                   --ernie_config_path config/ernie_config.json \
                   --checkpoints ./checkpoints \
                   --save_steps 1000 \
                   --weight_decay  0.01 \
                   --warmup_proportion 0.0 \
                   --validation_steps 25 \
                   --epoch 3 \
                   --max_seq_len 512 \
                   --learning_rate 1e-4 \
                   --skip_steps 10 \
                   --num_iteration_per_drop_scope 1 \
                   --num_labels 3 \
                   --random_seed 1

PaddlePaddle / ERNIE 1 年多 前同步成功

单机单卡能跑通, 单机多卡会在fluid.ParallelExecutor处报SegmentationFault

* Aborted at 1556084061 (unix time) try "date -d @1556084061" if you are using GNU date *

* SIGSEGV (@0x0) received by PID 8512 (TID 0x7f99f42ca700) from PID 0; stack trace: *

PaddlePaddle / ERNIE
1 年多前同步成功