GPU training process killed (#2843) · Issue · PaddlePaddle / models

GPU training process killed

Created by: arogowie-intel

Hi,

I've been trying to run ResNet50 training on GPU from models\PaddleCV\image_classification\train.py with following command:

cudaid=${object_detection_cudaid:=0}
export CUDA_VISIBLE_DEVICES=$cudaid

python train.py \
       --model=ResNet50 \
       --batch_size=4 \
       --class_dim=1000 \
       --image_shape=3,224,224 \
       --with_mem_opt=False \
       --use_gpu=True \
       --total_images=1281167 \
       --model_save_dir=output/ \
       --lr_strategy=piecewise_decay \
       --num_epochs=1 \
       --lr=0.1 \
       --data_dir=/root/data/ILSVRC2012

I've build PaddlePaddle from sources on develop branch inside Docker container. I used Dockerfile from top level directory in PaddlePaddle repository. It was build with GPU support.

Unfortunately every time I'm trying to run training the process is killed:

Pass 0, trainbatch 2010, loss 7.22243,                             acc1 0.00000, acc5 0.00000, lr 0.10000, time 0.22 sec
Pass 0, trainbatch 2020, loss 6.67906,                             acc1 0.00000, acc5 0.00000, lr 0.10000, time 0.20 sec
Pass 0, trainbatch 2030, loss 7.07748,                             acc1 0.00000, acc5 0.00000, lr 0.10000, time 0.22 sec
Pass 0, trainbatch 2040, loss 6.63177,                             acc1 0.00000, acc5 0.25000, lr 0.10000, time 0.26 sec
Pass 0, trainbatch 2050, loss 6.74798,                             acc1 0.00000, acc5 0.00000, lr 0.10000, time 0.22 sec
./run_train.sh: line 21:  9815 Killed                  python train.py --model=ResNet50 --batch_size=4 --class_dim=1000 --image_shape=3,224,224 --with_mem_opt=False --use_gpu=True --total_images=1281167 --model_save_dir=output/ --lr_strategy=piecewise_decay --num_epochs=1 --lr=0.1 --l2_decay=1e-4 --data_dir=/root/data/ILSVRC2012

I've two 1080Ti with 10GB each. Every time maximum available memory is used.

Can anyone help me why the training process may be stopped?

PaddlePaddle / models 大约 1 年 前同步成功

GPU training process killed

PaddlePaddle / models
大约 1 年前同步成功