DATA.md 8.9 KB
Newer Older
Q
qingqing01 已提交
1 2
English | [简体中文](DATA_cn.md)

3 4
# Data Pipline

5 6 7 8 9 10 11 12 13 14 15 16
## Introduction

The data pipeline is responsible for loading and converting data. Each
resulting data sample is a tuple of np.ndarrays.
For example, Faster R-CNN training uses samples of this format: `[(im,
im_info, im_id, gt_bbox, gt_class, is_crowd), (...)]`.

### Implementation

The data pipeline consists of four sub-systems: data parsing, image
pre-processing, data conversion and data feeding APIs.

S
SunAhong1993 已提交
17
Data samples are collected to form `data.Dataset`s, usually 3 sets are
18 19
needed for training, validation, and testing respectively.

S
SunAhong1993 已提交
20 21 22
First, `data.source` loads the data files into memory, then
`data.transform` processes them, and lastly, the batched samples
are fetched by `data.Reader`.
23 24 25

Sub-systems details:
1. Data parsing
S
SunAhong1993 已提交
26
Parses various data sources and creates `data.Dataset` instances. Currently,
27 28 29
following data sources are supported:

- COCO data source
K
Kaipeng Deng 已提交
30

31 32 33
Loads `COCO` type datasets with directory structures like this:

  ```
Q
qingqing01 已提交
34
  dataset/coco/
35
  ├── annotations
W
wangguanzhong 已提交
36
  │   ├── instances_train2014.json
37
  │   ├── instances_train2017.json
W
wangguanzhong 已提交
38
  │   ├── instances_val2014.json
39
  │   ├── instances_val2017.json
K
Kaipeng Deng 已提交
40
  │   |   ...
41 42 43
  ├── train2017
  │   ├── 000000000009.jpg
  │   ├── 000000580008.jpg
K
Kaipeng Deng 已提交
44
  │   |   ...
45 46 47
  ├── val2017
  │   ├── 000000000139.jpg
  │   ├── 000000000285.jpg
K
Kaipeng Deng 已提交
48
  │   |   ...
49 50 51 52
  |   ...
  ```

- Pascal VOC data source
K
Kaipeng Deng 已提交
53

54 55 56
Loads `Pascal VOC` like datasets with directory structure like this:

  ```
K
Kaipeng Deng 已提交
57 58 59 60 61 62 63 64 65
  dataset/voc/
  ├── train.txt
  ├── val.txt
  ├── test.txt
  ├── label_list.txt (optional)
  ├── VOCdevkit/VOC2007
  │   ├── Annotations
  │       ├── 001789.xml
  │       |   ...
W
wangguanzhong 已提交
66
  │   ├── JPEGImages
K
Kaipeng Deng 已提交
67 68 69 70 71 72 73 74
  │       ├── 001789.xml
  │       |   ...
  │   ├── ImageSets
  │       |   ...
  ├── VOCdevkit/VOC2012
  │   ├── Annotations
  │       ├── 003876.xml
  │       |   ...
W
wangguanzhong 已提交
75
  │   ├── JPEGImages
K
Kaipeng Deng 已提交
76 77 78 79
  │       ├── 003876.xml
  │       |   ...
  │   ├── ImageSets
  │       |   ...
80 81 82
  |   ...
  ```

K
Kaipeng Deng 已提交
83 84
**NOTE:** If you set `use_default_label=False` in yaml configs, the `label_list.txt`
of Pascal VOC dataset will be read, otherwise, `label_list.txt` is unnecessary and
W
wangguanzhong 已提交
85
the default Pascal VOC label list which defined in
K
Kaipeng Deng 已提交
86 87
[voc\_loader.py](../ppdet/data/source/voc_loader.py) will be used.

88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113
- Roidb data source
A generalized data source serialized as pickle files, which have the following
structure:
```python
(records, cname2id)
# `cname2id` is a `dict` which maps category name to class IDs
# and `records` is a list of dict of this structure:
{
    'im_file': im_fname,    # image file name
    'im_id': im_id,         # image ID
    'h': im_h,              # height of image
    'w': im_w,              # width of image
    'is_crowd': is_crowd,   # crowd marker
    'gt_class': gt_class,   # ground truth class
    'gt_bbox': gt_bbox,     # ground truth bounding box
    'gt_poly': gt_poly,     # ground truth segmentation
}
```

We provide a tool to generate roidb data sources. To convert `COCO` or `VOC`
like dataset, run this command:
```sh
# --type: the type of original data (xml or json)
# --annotation: the path of file, which contains the name of annotation files
# --save-dir: the save path
# --samples: the number of samples (default is -1, which mean all datas in dataset)
W
wangguanzhong 已提交
114
python ./ppdet/data/tools/generate_data_for_training.py
115 116 117 118 119 120 121
            --type=json \
            --annotation=./annotations/instances_val2017.json \
            --save-dir=./roidb \
            --samples=-1
```

 2. Image preprocessing
S
SunAhong1993 已提交
122
the `data.transform.operator` module provides operations such as image
123 124 125 126
decoding, expanding, cropping, etc. Multiple operators are combined to form
larger processing pipelines.

 3. Data transformer
S
SunAhong1993 已提交
127 128
Transform a `data.Dataset` to achieve various desired effects, Notably: the
`data.transform.paralle_map` transformer accelerates image processing with
129
multi-threads or multi-processes. More transformers can be found in
S
SunAhong1993 已提交
130
`data.transform.transformer`.
131 132

 4. Data feeding apis
S
SunAhong1993 已提交
133 134
To facilitate data pipeline building, we combine multiple `data.Dataset` to
form a `data.Reader` which can provide data for training, validation and
135 136 137 138 139
testing respectively. Users can simply call `Reader.[train|eval|infer]` to get
the corresponding data stream. Many aspect of the `Reader`, such as storage
location, preprocessing pipeline, acceleration mode can be configured with yaml
files.

G
Guanghua Yu 已提交
140 141
### APIs

142 143 144 145
The main APIs are as follows:

1. Data parsing

S
SunAhong1993 已提交
146 147
 - `source/coco_loader.py`: COCO dataset parser. [source](../ppdet/data/source/coco_loader.py)
 - `source/voc_loader.py`: Pascal VOC dataset parser. [source](../ppdet/data/source/voc_loader.py)
148 149 150 151
 [Note] To use a non-default label list for VOC datasets, a `label_list.txt`
 file is needed, one can use the provided label list
 (`data/pascalvoc/ImageSets/Main/label_list.txt`) or generate a custom one (with `tools/generate_data_for_training.py`). Also, `use_default_label` option should
 be set to `false` in the configuration file
S
SunAhong1993 已提交
152
 - `source/loader.py`: Roidb dataset parser. [source](../ppdet/data/source/loader.py)
153 154

2. Operator
G
Guanghua Yu 已提交
155
 `transform/operators.py`: Contains a variety of data augmentation methods, including:
156 157 158 159 160 161 162 163 164 165
- `DecodeImage`: Read images in RGB format.
- `RandomFlipImage`: Horizontal flip.
- `RandomDistort`: Distort brightness, contrast, saturation, and hue.
- `ResizeImage`: Resize image with interpolation.
- `RandomInterpImage`: Use a random interpolation method to resize the image.
- `CropImage`: Crop image with respect to different scale, aspect ratio, and overlap.
- `ExpandImage`: Pad image to a larger size, padding filled with mean image value.
- `NormalizeImage`: Normalize image pixel values.
- `NormalizeBox`: Normalize the bounding box.
- `Permute`: Arrange the channels of the image and optionally convert image to BGR format.
G
Guanghua Yu 已提交
166
- `MixupImage`: Mixup two images with given fraction<sup>[1](#mix)</sup>.
167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186

<a name="mix">[1]</a> Please refer to [this paper](https://arxiv.org/pdf/1710.09412.pdf)

`transform/arrange_sample.py`: Assemble the data samples needed by different models.
3. Transformer
`transform/post_map.py`: Transformations that operates on whole batches, mainly for:
- Padding whole batch to given stride values
- Resize images to Multi-scales
- Randomly adjust the image size of the batch data
`transform/transformer.py`: Data filtering batching.
`transform/parallel_map.py`: Accelerate data processing with multi-threads/multi-processes.
4. Reader
`reader.py`: Combine source and transforms, return batch data according to `max_iter`.
`data_feed.py`: Configure default parameters for `reader.py`.


### Usage

#### Canned Datasets

Q
qingqing01 已提交
187
Preset for common datasets, e.g., `COCO` and `Pascal Voc` are included. In
188 189 190 191 192
most cases, user can simply use these canned dataset as is. Moreover, the
whole data pipeline is fully customizable through the yaml configuration files.

#### Custom Datasets

S
SunAhong1993 已提交
193
- Option 1: Convert the dataset to COCO format.
194
```sh
S
SunAhong1993 已提交
195 196 197 198
 # a small utility (`tools/x2coco.py`) is provided to convert
 # Labelme-annotated dataset or cityscape dataset to COCO format.
 python ./ppdet/data/tools/x2coco.py --dataset_type labelme
                                --json_input_dir ./labelme_annos/
199 200 201 202 203
                                --image_input_dir ./labelme_imgs/
                                --output_dir ./cocome/
                                --train_proportion 0.8
                                --val_proportion 0.2
                                --test_proportion 0.0
S
SunAhong1993 已提交
204
 # --dataset_type: The data format which is need to be converted. Currently supported are: 'labelme' and 'cityscape'
205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235
 # --json_input_dir:The path of json files which are annotated by Labelme.
 # --image_input_dir:The path of images.
 # --output_dir:The path of coverted COCO dataset.
 # --train_proportion:The train proportion of annatation data.
 # --val_proportion:The validation proportion of annatation data.
 # --test_proportion: The inference proportion of annatation data.
```

- Option 2:

1. Add `source/XX_loader.py` and implement the `load` function, following the
   example of `source/coco_loader.py` and `source/voc_loader.py`.
2. Modify the `load` function in `source/loader.py` to make use of the newly
   added data loader.
3. Modify `/source/__init__.py` accordingly.
```python
if data_cf['type'] in ['VOCSource', 'COCOSource', 'RoiDbSource']:
    source_type = 'RoiDbSource'
# Replace the above code with the following code:
if data_cf['type'] in ['VOCSource', 'COCOSource', 'RoiDbSource', 'XXSource']:
    source_type = 'RoiDbSource'
```
4. In the configure file, define the `type` of `dataset` as `XXSource`.

#### How to add data pre-processing?

- To add pre-processing operation for a single image, refer to the classes in
  `transform/operators.py`, and implement the desired transformation with a new
  class.
- To add pre-processing for a batch, one needs to modify the `build_post_map`
  function in `transform/post_map.py`.