GETTING_STARTED.md 10.0 KB
Newer Older
1 2
# Getting Started

K
Kaipeng Deng 已提交
3
For setting up the running environment, please refer to [installation
4 5 6 7 8 9 10 11 12 13
instructions](INSTALL.md).


## Training

#### Single-GPU Training


```bash
export CUDA_VISIBLE_DEVICES=0
14
export PYTHONPATH=$PYTHONPATH:.
15 16 17 18 19 20 21
python tools/train.py -c configs/faster_rcnn_r50_1x.yml
```

#### Multi-GPU Training

```bash
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
22 23 24 25 26 27 28 29 30
export PYTHONPATH=$PYTHONPATH:.
python tools/train.py -c configs/faster_rcnn_r50_1x.yml
```

#### CPU Training

```bash
export CPU_NUM=8
export PYTHONPATH=$PYTHONPATH:.
W
wangguanzhong 已提交
31
python tools/train.py -c configs/faster_rcnn_r50_1x.yml -o use_gpu=false
32 33
```

34 35 36 37
##### Optional arguments

- `-r` or `--resume_checkpoint`: Checkpoint path for resuming training. Such as: `-r output/faster_rcnn_r50_1x/10000`
- `--eval`: Whether to perform evaluation in training, default is `False`
38
- `--output_eval`: If perform evaluation in training, this edits evaluation directory, default is current directory.
39
- `-d` or `--dataset_dir`: Dataset path, same as `dataset_dir` of configs. Such as: `-d dataset/coco`
40 41
- `-c`: Select config file and all files are saved in `configs/`
- `-o`: Set configuration options in config file. Such as: `-o max_iters=180000`. `-o` has higher priority to file configured by `-c`
42 43
- `--use_tb`: Whether to record the data with [tb-paddle](https://github.com/linshuliang/tb-paddle), so as to display in Tensorboard, default is `False`
- `--tb_log_dir`: tb-paddle logging directory for scalar, default is `tb_log_dir/scalar`
44 45
- `--fp16`: Whether to enable mixed precision training (requires GPU), default is `False`
- `--loss_scale`: Loss scaling factor for mixed precision training, default is `8.0`
46 47 48 49 50 51 52 53 54 55 56 57 58


##### Examples

- Perform evaluation in training
```bash
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
export PYTHONPATH=$PYTHONPATH:.
python -u tools/train.py -c configs/faster_rcnn_r50_1x.yml --eval
```

Alternating between training epoch and evaluation run is possible, simply pass
in `--eval` to do so and evaluate at each snapshot_iter. It can be modified at `snapshot_iter` of the configuration file. If evaluation dataset is large and
59
causes time-consuming in training, we suggest decreasing evaluation times or evaluating after training. When perform evaluation in training,
60
the best model with highest MAP is saved at each `snapshot_iter`. `best_model` has the same path as `model_final`.
61 62


63
- Configure dataset path
64 65 66 67 68 69 70
```bash
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
export PYTHONPATH=$PYTHONPATH:.
python -u tools/train.py -c configs/faster_rcnn_r50_1x.yml \
                         -d dataset/coco
```

71 72
- Fine-tune other task

73 74 75 76
When using pre-trained model to fine-tune other task, two methods can be used:

1. The excluded pre-trained parameters can be set by `finetune_exclude_pretrained_params` in YAML config
2. Set -o finetune_exclude_pretrained_params in the arguments.
77 78 79 80 81 82 83 84

```bash
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
export PYTHONPATH=$PYTHONPATH:.
python -u tools/train.py -c configs/faster_rcnn_r50_1x.yml \
                         -o pretrain_weights=output/faster_rcnn_r50_1x/model_final/ \
                            finetune_exclude_pretrained_params = ['cls_score','bbox_pred']
```
85

86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101
- Mixed Precision Training

Mixed precision training can be enabled with `--fp16` flag. Currently Faster-FPN, Mask-FPN and Yolov3 have been verified to be working with little to no loss of precision (less than 0.2 mAP)

To speed up mixed precision training, it is recommended to train in multi-process mode, for example

```bash
export PYTHONPATH=$PYTHONPATH:.
python -m paddle.distributed.launch --selected_gpus 0,1,2,3,4,5,6,7 tools/train.py --fp16 -c configs/faster_rcnn_r50_fpn_1x.yml
```

If loss becomes `NaN` during training, try tweak the `--loss_scale` value. Please refer to the Nvidia [documentation](https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html#mptrain) on mixed precision training for details.

Also, please note mixed precision training currently requires changing `norm_type` from `affine_channel` to `bn`.


102 103 104 105 106
##### NOTES

- `CUDA_VISIBLE_DEVICES` can specify different gpu numbers. Such as: `export CUDA_VISIBLE_DEVICES=0,1,2,3`. GPU calculation rules can refer [FAQ](#faq)
- Dataset is stored in `dataset/coco` by default (configurable).
- Dataset will be downloaded automatically and cached in `~/.cache/paddle/dataset` if not be found locally.
107
- Pretrained model is downloaded automatically and cached in `~/.cache/paddle/weights`.
108
- Model checkpoints are saved in `output` by default (configurable).
109
- When finetuning, users could set `pretrain_weights` to the models published by PaddlePaddle. Parameters matched by fields in finetune_exclude_pretrained_params will be ignored in loading and fields can be wildcard matching. For detailed information, please refer to [Transfer Learning](TRANSFER_LEARNING.md).
W
wangguanzhong 已提交
110
- To check out hyper parameters used, please refer to the [configs](../configs).
111
- RCNN models training on CPU is not supported on PaddlePaddle<=1.5.1 and will be fixed on later version.
112 113 114 115 116 117



## Evaluation

```bash
W
wangguanzhong 已提交
118
# run on GPU with:
119
export PYTHONPATH=$PYTHONPATH:.
W
wangguanzhong 已提交
120
export CUDA_VISIBLE_DEVICES=0
121 122 123
python tools/eval.py -c configs/faster_rcnn_r50_1x.yml
```

124 125 126
#### Optional arguments

- `-d` or `--dataset_dir`: Dataset path, same as dataset_dir of configs. Such as: `-d dataset/coco`
127
- `--output_eval`: Evaluation directory, default is current directory.
128 129 130 131 132
- `-o`: Set configuration options in config file. Such as: `-o weights=output/faster_rcnn_r50_1x/model_final`
- `--json_eval`: Whether to eval with already existed bbox.json or mask.json. Default is `False`. Json file directory is assigned by `-f` argument.

#### Examples

133
- Evaluate by specified weights path and dataset path
134
```bash
W
wangguanzhong 已提交
135
# run on GPU with:
136
export PYTHONPATH=$PYTHONPATH:.
W
wangguanzhong 已提交
137
export CUDA_VISIBLE_DEVICES=0
138 139 140 141 142
python -u tools/eval.py -c configs/faster_rcnn_r50_1x.yml \
                        -o weights=output/faster_rcnn_r50_1x/model_final \
                        -d dataset/coco
```

143
- Evaluate with json
144
```bash
W
wangguanzhong 已提交
145
# run on GPU with:
146
export PYTHONPATH=$PYTHONPATH:.
W
wangguanzhong 已提交
147
export CUDA_VISIBLE_DEVICES=0
148
python tools/eval.py -c configs/faster_rcnn_r50_1x.yml \
W
wangguanzhong 已提交
149 150
             --json_eval \
             -f evaluation/
151 152 153 154 155 156
```

The json file must be named bbox.json or mask.json, placed in the `evaluation/` directory. Or without the `-f` parameter, default is the current directory.

#### NOTES

157 158 159 160 161 162 163 164 165 166 167
- Checkpoint is loaded from `output` by default (configurable)
- Multi-GPU evaluation for R-CNN and SSD models is not supported at the
moment, but it is a planned feature


## Inference


- Run inference on a single image:

```bash
W
wangguanzhong 已提交
168
# run on GPU with:
169
export PYTHONPATH=$PYTHONPATH:.
W
wangguanzhong 已提交
170
export CUDA_VISIBLE_DEVICES=0
171 172 173
python tools/infer.py -c configs/faster_rcnn_r50_1x.yml --infer_img=demo/000000570688.jpg
```

174
- Multi-image inference:
175 176

```bash
W
wangguanzhong 已提交
177
# run on GPU with:
178
export PYTHONPATH=$PYTHONPATH:.
W
wangguanzhong 已提交
179
export CUDA_VISIBLE_DEVICES=0
180 181 182
python tools/infer.py -c configs/faster_rcnn_r50_1x.yml --infer_dir=demo
```

183 184 185 186 187
#### Optional arguments

- `--output_dir`: Directory for storing the output visualization files.
- `--draw_threshold`: Threshold to reserve the result for visualization. Default is 0.5.
- `--save_inference_model`: Save inference model in output_dir if True.
188 189
- `--use_tb`: Whether to record the data with [tb-paddle](https://github.com/linshuliang/tb-paddle), so as to display in Tensorboard, default is `False`
- `--tb_log_dir`: tb-paddle logging directory for image, default is `tb_log_dir/image`
190 191 192 193

#### Examples

- Output specified directory && Set up threshold
194

195
```bash
W
wangguanzhong 已提交
196
# run on GPU with:
197
export PYTHONPATH=$PYTHONPATH:.
W
wangguanzhong 已提交
198
export CUDA_VISIBLE_DEVICES=0
199 200 201
python tools/infer.py -c configs/faster_rcnn_r50_1x.yml \
                      --infer_img=demo/000000570688.jpg \
                      --output_dir=infer_output/ \
202
                      --draw_threshold=0.5 \
203 204
                      -o weights=output/faster_rcnn_r50_1x/model_final \
                      --use_tb=Ture
205
```
206

207
The visualization files are saved in `output` by default, to specify a different path, simply add a `--output_dir=` flag.
208
`--draw_threshold` is an optional argument. Default is 0.5.
209 210
Different thresholds will produce different results depending on the calculation of [NMS](https://ieeexplore.ieee.org/document/1699659).
If users want to infer according to customized model path, `-o weights` can be set for specified path.
211
`--use_tb` is an optional argument, if `--use_tb` is `True`, the tb-paddle will record data in directory,
212
so users can see the results in Tensorboard.
213

214 215 216
- Save inference model

```bash
W
wangguanzhong 已提交
217
# run on GPU with:
218
export CUDA_VISIBLE_DEVICES=0
219 220 221
export PYTHONPATH=$PYTHONPATH:.
python tools/infer.py -c configs/faster_rcnn_r50_1x.yml \
                      --infer_img=demo/000000570688.jpg \
222 223 224
                      --save_inference_model
```

K
Kaipeng Deng 已提交
225
Save inference model by set `--save_inference_model`, which can be loaded by PaddlePaddle predict library.
226

227 228 229

## FAQ

Q
qingqing01 已提交
230 231
**Q:**  Why do I get `NaN` loss values during single GPU training? </br>
**A:**  The default learning rate is tuned to multi-GPU training (8x GPUs), it must
232 233
be adapted for single GPU training accordingly (e.g., divide by 8).
The calculation rules are as follows,they are equivalent: </br>
234

235

236 237
| GPU number  | Learning rate  | Max_iters | Milestones       |
| :---------: | :------------: | :-------: | :--------------: |
238 239 240
| 2           | 0.0025         | 720000    | [480000, 640000] |
| 4           | 0.005          | 360000    | [240000, 320000] |
| 8           | 0.01           | 180000    | [120000, 160000] |
241

242

Q
qingqing01 已提交
243 244 245 246 247
**Q:**  How to reduce GPU memory usage? </br>
**A:**  Setting environment variable FLAGS_conv_workspace_size_limit to a smaller
number can reduce GPU memory footprint without affecting training speed.
Take Mask-RCNN (R50) as example, by setting `export FLAGS_conv_workspace_size_limit=512`,
batch size could reach 4 per GPU (Tesla V100 16GB).
248 249 250 251 252 253


**Q:**  How to change data preprocessing? </br>
**A:**  Set `sample_transform` in configuration. Note that **the whole transforms** need to be added in configuration.
For example, `DecodeImage`, `NormalizeImage` and `Permute` in RCNN models. For detail description, please refer
to [config_example](config_example).