调用模型微调命令训练出错
Created by: vincentpengpeng
模型调用命令,使用百度ResNet50_vd_10w的预训练模型: set CUDA_VISIBLE_DEVICES=0 python -m paddle.distributed.launch --selected_gpus="0" tools/train.py -c ./configs/quick_start/ResNet50_vd_10w_finetune.yaml
报错:
Traceback (most recent call last): File "tools/train.py", line 150, in main(args) File "tools/train.py", line 75, in main config, train_prog, startup_prog, is_train=True) File "F:\pythonproject\PaddleClas\PaddleClas\tools\program.py", line 363, in build optimizer.minimize(fetchs['loss'][0]) File "F:\Anaconda3\lib\site-packages\paddle\fluid\incubate\fleet\collective_init_.py", line 652, in minimize fleet.main_program = self.try_to_compile(startup_program, main_program) File "F:\Anaconda3\lib\site-packages\paddle\fluid\incubate\fleet\collective_init.py", line 562, in _try_to_compile self.transpile(startup_program, main_program) File "F:\Anaconda3\lib\site-packages\paddle\fluid\incubate\fleet\collective_init.py", line 489, in _transpile current_endpoint=current_endpoint) File "F:\Anaconda3\lib\site-packages\paddle\fluid\transpiler\distribute_transpiler.py", line 625, in transpile wait_port=self.config.wait_port) File "F:\Anaconda3\lib\site-packages\paddle\fluid\transpiler\distribute_transpiler.py", line 397, in _transpile_nccl2 self.config.hierarchical_allreduce_inter_nranks File "F:\Anaconda3\lib\site-packages\paddle\fluid\framework.py", line 2610, in append_op attrs=kwargs.get("attrs", None)) File "F:\Anaconda3\lib\site-packages\paddle\fluid\framework.py", line 1870, in init proto = OpProtoHolder.instance().get_op_proto(type) File "F:\Anaconda3\lib\site-packages\paddle\fluid\framework.py", line 1751, in get_op_proto raise ValueError("Operator "%s" has not been registered." % type) ValueError: Operator "gen_nccl_id" has not been registered. INFO 2020-06-22 11:29:30,706 utils.py:272] terminate all the procs ERROR 2020-06-22 11:29:30,706 utils.py:416] ABORT!!! Out of all 1 trainers, the trainer process with rank=[0] was aborted. Please check its log. INFO 2020-06-22 11:29:30,706 utils.py:272] terminate all the procs
ResNet50_vd_10w_finetune.yaml文件配置如下: mode: 'train' ARCHITECTURE: name: 'ResNet50_vd' pretrained_model: "F:/pythonproject/PaddleClas/PaddleClas/ResNet50_vd_10w_pretrained/ResNet50_vd_10w_pretrained" model_save_dir: "./output/" classes_num: 5 total_images: 11745 save_interval: 1 validate: True valid_interval: 1 epochs: 20 topk: 2 image_shape: [3, 224, 224]
LEARNING_RATE:
function: 'Cosine'
params:
lr: 0.00375
OPTIMIZER: function: 'Momentum' params: momentum: 0.9 regularizer: function: 'L2' factor: 0.000001
TRAIN: batch_size: 32 num_workers: 4 file_list: "F:/pythonproject\PaddleClas/PaddleClas/dataset/driver/train_list.txt" data_dir: "F:/pythonproject\PaddleClas/PaddleClas/dataset/driver/" shuffle_seed: 0 transforms: - DecodeImage: to_rgb: True to_np: False channel_first: False - RandCropImage: size: 224 - RandFlipImage: flip_code: 1 - NormalizeImage: scale: 1./255. mean: [0.485, 0.456, 0.406] std: [0.229, 0.224, 0.225] order: '' - ToCHWImage:
VALID: batch_size: 20 num_workers: 4 file_list: "F:/pythonproject\PaddleClas/PaddleClas/dataset/driver/val_list.txt" data_dir: "F:/pythonproject\PaddleClas/PaddleClas/dataset/driver/" shuffle_seed: 0 transforms: - DecodeImage: to_rgb: True to_np: False channel_first: False - ResizeImage: resize_short: 256 - CropImage: size: 224 - NormalizeImage: scale: 1.0/255.0 mean: [0.485, 0.456, 0.406] std: [0.229, 0.224, 0.225] order: '' - ToCHWImage: