DATA.md 8.9 KB
Newer Older
Q
qingqing01 已提交
1 2
English | [简体中文](DATA_cn.md)

3 4
# Data Pipline

5
## Introduction
Y
Yang Zhang 已提交
6 7 8 9 10

The data pipeline is responsible for loading and converting data. Each
resulting data sample is a tuple of np.ndarrays.
For example, Faster R-CNN training uses samples of this format: `[(im,
im_info, im_id, gt_bbox, gt_class, is_crowd), (...)]`.
11 12 13

### Implementation

Y
Yang Zhang 已提交
14 15 16
The data pipeline consists of four sub-systems: data parsing, image
pre-processing, data conversion and data feeding APIs.

S
SunAhong1993 已提交
17
Data samples are collected to form `data.Dataset`s, usually 3 sets are
Y
Yang Zhang 已提交
18
needed for training, validation, and testing respectively.
19

S
SunAhong1993 已提交
20 21 22
First, `data.source` loads the data files into memory, then
`data.transform` processes them, and lastly, the batched samples
are fetched by `data.Reader`.
Y
Yang Zhang 已提交
23 24 25

Sub-systems details:
1. Data parsing
S
SunAhong1993 已提交
26
Parses various data sources and creates `data.Dataset` instances. Currently,
Y
Yang Zhang 已提交
27
following data sources are supported:
28

Y
Yang Zhang 已提交
29
- COCO data source
K
Kaipeng Deng 已提交
30

Y
Yang Zhang 已提交
31
Loads `COCO` type datasets with directory structures like this:
32 33

  ```
Q
qingqing01 已提交
34
  dataset/coco/
35
  ├── annotations
W
wangguanzhong 已提交
36
  │   ├── instances_train2014.json
37
  │   ├── instances_train2017.json
W
wangguanzhong 已提交
38
  │   ├── instances_val2014.json
39
  │   ├── instances_val2017.json
K
Kaipeng Deng 已提交
40
  │   |   ...
41 42 43
  ├── train2017
  │   ├── 000000000009.jpg
  │   ├── 000000580008.jpg
K
Kaipeng Deng 已提交
44
  │   |   ...
45 46 47
  ├── val2017
  │   ├── 000000000139.jpg
  │   ├── 000000000285.jpg
K
Kaipeng Deng 已提交
48
  │   |   ...
49 50 51
  |   ...
  ```

Y
Yang Zhang 已提交
52
- Pascal VOC data source
K
Kaipeng Deng 已提交
53

Y
Yang Zhang 已提交
54
Loads `Pascal VOC` like datasets with directory structure like this:
55 56

  ```
K
Kaipeng Deng 已提交
57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79
  dataset/voc/
  ├── train.txt
  ├── val.txt
  ├── test.txt
  ├── label_list.txt (optional)
  ├── VOCdevkit/VOC2007
  │   ├── Annotations
  │       ├── 001789.xml
  │       |   ...
  │   ├── JPEGImages 
  │       ├── 001789.xml
  │       |   ...
  │   ├── ImageSets
  │       |   ...
  ├── VOCdevkit/VOC2012
  │   ├── Annotations
  │       ├── 003876.xml
  │       |   ...
  │   ├── JPEGImages 
  │       ├── 003876.xml
  │       |   ...
  │   ├── ImageSets
  │       |   ...
80 81 82
  |   ...
  ```

K
Kaipeng Deng 已提交
83 84 85 86 87
**NOTE:** If you set `use_default_label=False` in yaml configs, the `label_list.txt`
of Pascal VOC dataset will be read, otherwise, `label_list.txt` is unnecessary and
the default Pascal VOC label list which defined in 
[voc\_loader.py](../ppdet/data/source/voc_loader.py) will be used.

Y
Yang Zhang 已提交
88 89 90
- Roidb data source
A generalized data source serialized as pickle files, which have the following
structure:
91
```python
Y
Yang Zhang 已提交
92 93 94
(records, cname2id)
# `cname2id` is a `dict` which maps category name to class IDs
# and `records` is a list of dict of this structure:
95
{
Y
Yang Zhang 已提交
96 97 98 99 100 101 102 103
    'im_file': im_fname,    # image file name
    'im_id': im_id,         # image ID
    'h': im_h,              # height of image
    'w': im_w,              # width of image
    'is_crowd': is_crowd,   # crowd marker
    'gt_class': gt_class,   # ground truth class
    'gt_bbox': gt_bbox,     # ground truth bounding box
    'gt_poly': gt_poly,     # ground truth segmentation
104 105
}
```
Y
Yang Zhang 已提交
106 107 108 109

We provide a tool to generate roidb data sources. To convert `COCO` or `VOC`
like dataset, run this command:
```sh
110 111 112 113
# --type: the type of original data (xml or json)
# --annotation: the path of file, which contains the name of annotation files
# --save-dir: the save path
# --samples: the number of samples (default is -1, which mean all datas in dataset)
W
wangguanzhong 已提交
114
python ./ppdet/data/tools/generate_data_for_training.py
115 116 117 118 119 120
            --type=json \
            --annotation=./annotations/instances_val2017.json \
            --save-dir=./roidb \
            --samples=-1
```

Y
Yang Zhang 已提交
121
 2. Image preprocessing
S
SunAhong1993 已提交
122
the `data.transform.operator` module provides operations such as image
Y
Yang Zhang 已提交
123 124 125 126
decoding, expanding, cropping, etc. Multiple operators are combined to form
larger processing pipelines.

 3. Data transformer
S
SunAhong1993 已提交
127 128
Transform a `data.Dataset` to achieve various desired effects, Notably: the
`data.transform.paralle_map` transformer accelerates image processing with
Y
Yang Zhang 已提交
129
multi-threads or multi-processes. More transformers can be found in
S
SunAhong1993 已提交
130
`data.transform.transformer`.
Y
Yang Zhang 已提交
131 132

 4. Data feeding apis
S
SunAhong1993 已提交
133 134
To facilitate data pipeline building, we combine multiple `data.Dataset` to
form a `data.Reader` which can provide data for training, validation and
Y
Yang Zhang 已提交
135 136 137 138
testing respectively. Users can simply call `Reader.[train|eval|infer]` to get
the corresponding data stream. Many aspect of the `Reader`, such as storage
location, preprocessing pipeline, acceleration mode can be configured with yaml
files.
139

G
Guanghua Yu 已提交
140 141
### APIs

142 143 144 145
The main APIs are as follows:

1. Data parsing

S
SunAhong1993 已提交
146 147
 - `source/coco_loader.py`: COCO dataset parser. [source](../ppdet/data/source/coco_loader.py)
 - `source/voc_loader.py`: Pascal VOC dataset parser. [source](../ppdet/data/source/voc_loader.py)
Y
Yang Zhang 已提交
148 149 150 151
 [Note] To use a non-default label list for VOC datasets, a `label_list.txt`
 file is needed, one can use the provided label list
 (`data/pascalvoc/ImageSets/Main/label_list.txt`) or generate a custom one (with `tools/generate_data_for_training.py`). Also, `use_default_label` option should
 be set to `false` in the configuration file
S
SunAhong1993 已提交
152
 - `source/loader.py`: Roidb dataset parser. [source](../ppdet/data/source/loader.py)
153 154

2. Operator
G
Guanghua Yu 已提交
155
 `transform/operators.py`: Contains a variety of data augmentation methods, including:
Y
Yang Zhang 已提交
156 157 158 159 160 161 162 163 164 165
- `DecodeImage`: Read images in RGB format.
- `RandomFlipImage`: Horizontal flip.
- `RandomDistort`: Distort brightness, contrast, saturation, and hue.
- `ResizeImage`: Resize image with interpolation.
- `RandomInterpImage`: Use a random interpolation method to resize the image.
- `CropImage`: Crop image with respect to different scale, aspect ratio, and overlap.
- `ExpandImage`: Pad image to a larger size, padding filled with mean image value.
- `NormalizeImage`: Normalize image pixel values.
- `NormalizeBox`: Normalize the bounding box.
- `Permute`: Arrange the channels of the image and optionally convert image to BGR format.
G
Guanghua Yu 已提交
166
- `MixupImage`: Mixup two images with given fraction<sup>[1](#mix)</sup>.
Y
Yang Zhang 已提交
167 168 169 170

<a name="mix">[1]</a> Please refer to [this paper](https://arxiv.org/pdf/1710.09412.pdf)

`transform/arrange_sample.py`: Assemble the data samples needed by different models.
171
3. Transformer
Y
Yang Zhang 已提交
172 173 174 175 176 177
`transform/post_map.py`: Transformations that operates on whole batches, mainly for:
- Padding whole batch to given stride values
- Resize images to Multi-scales
- Randomly adjust the image size of the batch data
`transform/transformer.py`: Data filtering batching.
`transform/parallel_map.py`: Accelerate data processing with multi-threads/multi-processes.
178
4. Reader
Y
Yang Zhang 已提交
179
`reader.py`: Combine source and transforms, return batch data according to `max_iter`.
180 181 182 183 184
`data_feed.py`: Configure default parameters for `reader.py`.


### Usage

Y
Yang Zhang 已提交
185
#### Canned Datasets
186

Q
qingqing01 已提交
187
Preset for common datasets, e.g., `COCO` and `Pascal Voc` are included. In
Y
Yang Zhang 已提交
188 189
most cases, user can simply use these canned dataset as is. Moreover, the
whole data pipeline is fully customizable through the yaml configuration files.
190

Y
Yang Zhang 已提交
191 192
#### Custom Datasets

S
SunAhong1993 已提交
193
- Option 1: Convert the dataset to COCO format.
Y
Yang Zhang 已提交
194
```sh
S
SunAhong1993 已提交
195 196 197 198
 # a small utility (`tools/x2coco.py`) is provided to convert
 # Labelme-annotated dataset or cityscape dataset to COCO format.
 python ./ppdet/data/tools/x2coco.py --dataset_type labelme
                                --json_input_dir ./labelme_annos/
199 200 201 202 203
                                --image_input_dir ./labelme_imgs/
                                --output_dir ./cocome/
                                --train_proportion 0.8
                                --val_proportion 0.2
                                --test_proportion 0.0
S
SunAhong1993 已提交
204
 # --dataset_type: The data format which is need to be converted. Currently supported are: 'labelme' and 'cityscape'
205 206 207 208 209 210 211 212
 # --json_input_dir:The path of json files which are annotated by Labelme.
 # --image_input_dir:The path of images.
 # --output_dir:The path of coverted COCO dataset.
 # --train_proportion:The train proportion of annatation data.
 # --val_proportion:The validation proportion of annatation data.
 # --test_proportion: The inference proportion of annatation data.
```

Y
Yang Zhang 已提交
213
- Option 2:
214

Y
Yang Zhang 已提交
215 216 217 218 219
1. Add `source/XX_loader.py` and implement the `load` function, following the
   example of `source/coco_loader.py` and `source/voc_loader.py`.
2. Modify the `load` function in `source/loader.py` to make use of the newly
   added data loader.
3. Modify `/source/__init__.py` accordingly.
220 221 222 223 224 225 226
```python
if data_cf['type'] in ['VOCSource', 'COCOSource', 'RoiDbSource']:
    source_type = 'RoiDbSource'
# Replace the above code with the following code:
if data_cf['type'] in ['VOCSource', 'COCOSource', 'RoiDbSource', 'XXSource']:
    source_type = 'RoiDbSource'
```
Y
Yang Zhang 已提交
227
4. In the configure file, define the `type` of `dataset` as `XXSource`.
228 229

#### How to add data pre-processing?
Y
Yang Zhang 已提交
230 231 232 233 234 235

- To add pre-processing operation for a single image, refer to the classes in
  `transform/operators.py`, and implement the desired transformation with a new
  class.
- To add pre-processing for a batch, one needs to modify the `build_post_map`
  function in `transform/post_map.py`.