GPU training process killed
Created by: arogowie-intel
Hi,
I've been trying to run ResNet50 training on GPU from models\PaddleCV\image_classification\train.py
with following command:
cudaid=${object_detection_cudaid:=0}
export CUDA_VISIBLE_DEVICES=$cudaid
python train.py \
--model=ResNet50 \
--batch_size=4 \
--class_dim=1000 \
--image_shape=3,224,224 \
--with_mem_opt=False \
--use_gpu=True \
--total_images=1281167 \
--model_save_dir=output/ \
--lr_strategy=piecewise_decay \
--num_epochs=1 \
--lr=0.1 \
--data_dir=/root/data/ILSVRC2012
I've build PaddlePaddle from sources on develop
branch inside Docker container. I used Dockerfile from top level directory in PaddlePaddle repository. It was build with GPU support.
Unfortunately every time I'm trying to run training the process is killed:
Pass 0, trainbatch 2010, loss 7.22243, acc1 0.00000, acc5 0.00000, lr 0.10000, time 0.22 sec
Pass 0, trainbatch 2020, loss 6.67906, acc1 0.00000, acc5 0.00000, lr 0.10000, time 0.20 sec
Pass 0, trainbatch 2030, loss 7.07748, acc1 0.00000, acc5 0.00000, lr 0.10000, time 0.22 sec
Pass 0, trainbatch 2040, loss 6.63177, acc1 0.00000, acc5 0.25000, lr 0.10000, time 0.26 sec
Pass 0, trainbatch 2050, loss 6.74798, acc1 0.00000, acc5 0.00000, lr 0.10000, time 0.22 sec
./run_train.sh: line 21: 9815 Killed python train.py --model=ResNet50 --batch_size=4 --class_dim=1000 --image_shape=3,224,224 --with_mem_opt=False --use_gpu=True --total_images=1281167 --model_save_dir=output/ --lr_strategy=piecewise_decay --num_epochs=1 --lr=0.1 --l2_decay=1e-4 --data_dir=/root/data/ILSVRC2012
I've two 1080Ti with 10GB each. Every time maximum available memory is used.
Can anyone help me why the training process may be stopped?