# Data Pipline ## Introduction The data pipeline is responsible for loading and converting data. Each resulting data sample is a tuple of np.ndarrays. For example, Faster R-CNN training uses samples of this format: `[(im, im_info, im_id, gt_bbox, gt_class, is_crowd), (...)]`. ### Implementation The data pipeline consists of four sub-systems: data parsing, image pre-processing, data conversion and data feeding APIs. Data samples are collected to form `data.Dataset`s, usually 3 sets are needed for training, validation, and testing respectively. First, `data.source` loads the data files into memory, then `data.transform` processes them, and lastly, the batched samples are fetched by `data.Reader`. Sub-systems details: 1. Data parsing Parses various data sources and creates `data.Dataset` instances. Currently, following data sources are supported: - COCO data source Loads `COCO` type datasets with directory structures like this: ``` dataset/coco/ ├── annotations │ ├── instances_train2014.json │ ├── instances_train2017.json │ ├── instances_val2014.json │ ├── instances_val2017.json | ... ├── train2017 │ ├── 000000000009.jpg │ ├── 000000580008.jpg | ... ├── val2017 │ ├── 000000000139.jpg │ ├── 000000000285.jpg | ... ``` - Pascal VOC data source Loads `Pascal VOC` like datasets with directory structure like this: ``` data/pascalvoc/ ├──Annotations │ ├── i000050.jpg │ ├── 003876.xml | ... ├── ImageSets │ ├──Main └── train.txt └── val.txt └── test.txt └── dog_train.txt └── dog_trainval.txt └── dog_val.txt └── dog_test.txt └── ... │ ├──Layout └──... │ ├── Segmentation └──... ├── JPEGImages │ ├── 000050.jpg │ ├── 003876.jpg | ... ``` - Roidb data source A generalized data source serialized as pickle files, which have the following structure: ```python (records, cname2id) # `cname2id` is a `dict` which maps category name to class IDs # and `records` is a list of dict of this structure: { 'im_file': im_fname, # image file name 'im_id': im_id, # image ID 'h': im_h, # height of image 'w': im_w, # width of image 'is_crowd': is_crowd, # crowd marker 'gt_class': gt_class, # ground truth class 'gt_bbox': gt_bbox, # ground truth bounding box 'gt_poly': gt_poly, # ground truth segmentation } ``` We provide a tool to generate roidb data sources. To convert `COCO` or `VOC` like dataset, run this command: ```sh # --type: the type of original data (xml or json) # --annotation: the path of file, which contains the name of annotation files # --save-dir: the save path # --samples: the number of samples (default is -1, which mean all datas in dataset) python ./ppdet/data/tools/generate_data_for_training.py --type=json \ --annotation=./annotations/instances_val2017.json \ --save-dir=./roidb \ --samples=-1 ``` 2. Image preprocessing the `data.transform.operator` module provides operations such as image decoding, expanding, cropping, etc. Multiple operators are combined to form larger processing pipelines. 3. Data transformer Transform a `data.Dataset` to achieve various desired effects, Notably: the `data.transform.paralle_map` transformer accelerates image processing with multi-threads or multi-processes. More transformers can be found in `data.transform.transformer`. 4. Data feeding apis To facilitate data pipeline building, we combine multiple `data.Dataset` to form a `data.Reader` which can provide data for training, validation and testing respectively. Users can simply call `Reader.[train|eval|infer]` to get the corresponding data stream. Many aspect of the `Reader`, such as storage location, preprocessing pipeline, acceleration mode can be configured with yaml files. ### APIs The main APIs are as follows: 1. Data parsing - `source/coco_loader.py`: COCO dataset parser. [source](../ppdet/data/source/coco_loader.py) - `source/voc_loader.py`: Pascal VOC dataset parser. [source](../ppdet/data/source/voc_loader.py) [Note] To use a non-default label list for VOC datasets, a `label_list.txt` file is needed, one can use the provided label list (`data/pascalvoc/ImageSets/Main/label_list.txt`) or generate a custom one (with `tools/generate_data_for_training.py`). Also, `use_default_label` option should be set to `false` in the configuration file - `source/loader.py`: Roidb dataset parser. [source](../ppdet/data/source/loader.py) 2. Operator `transform/operators.py`: Contains a variety of data augmentation methods, including: - `DecodeImage`: Read images in RGB format. - `RandomFlipImage`: Horizontal flip. - `RandomDistort`: Distort brightness, contrast, saturation, and hue. - `ResizeImage`: Resize image with interpolation. - `RandomInterpImage`: Use a random interpolation method to resize the image. - `CropImage`: Crop image with respect to different scale, aspect ratio, and overlap. - `ExpandImage`: Pad image to a larger size, padding filled with mean image value. - `NormalizeImage`: Normalize image pixel values. - `NormalizeBox`: Normalize the bounding box. - `Permute`: Arrange the channels of the image and optionally convert image to BGR format. - `MixupImage`: Mixup two images with given fraction[1](#mix). [1] Please refer to [this paper](https://arxiv.org/pdf/1710.09412.pdf)。 `transform/arrange_sample.py`: Assemble the data samples needed by different models. 3. Transformer `transform/post_map.py`: Transformations that operates on whole batches, mainly for: - Padding whole batch to given stride values - Resize images to Multi-scales - Randomly adjust the image size of the batch data `transform/transformer.py`: Data filtering batching. `transform/parallel_map.py`: Accelerate data processing with multi-threads/multi-processes. 4. Reader `reader.py`: Combine source and transforms, return batch data according to `max_iter`. `data_feed.py`: Configure default parameters for `reader.py`. ### Usage #### Canned Datasets Preset for common datasets, e.g., `COCO` and `Pascal Voc` are included. In most cases, user can simply use these canned dataset as is. Moreover, the whole data pipeline is fully customizable through the yaml configuration files. #### Custom Datasets - Option 1: Convert the dataset to COCO or VOC format. ```sh # a small utility (`tools/labelme2coco.py`) is provided to convert # Labelme-annotated dataset to COCO format. python ./ppdet/data/tools/labelme2coco.py --json_input_dir ./labelme_annos/ --image_input_dir ./labelme_imgs/ --output_dir ./cocome/ --train_proportion 0.8 --val_proportion 0.2 --test_proportion 0.0 # --json_input_dir:The path of json files which are annotated by Labelme. # --image_input_dir:The path of images. # --output_dir:The path of coverted COCO dataset. # --train_proportion:The train proportion of annatation data. # --val_proportion:The validation proportion of annatation data. # --test_proportion: The inference proportion of annatation data. ``` - Option 2: 1. Add `source/XX_loader.py` and implement the `load` function, following the example of `source/coco_loader.py` and `source/voc_loader.py`. 2. Modify the `load` function in `source/loader.py` to make use of the newly added data loader. 3. Modify `/source/__init__.py` accordingly. ```python if data_cf['type'] in ['VOCSource', 'COCOSource', 'RoiDbSource']: source_type = 'RoiDbSource' # Replace the above code with the following code: if data_cf['type'] in ['VOCSource', 'COCOSource', 'RoiDbSource', 'XXSource']: source_type = 'RoiDbSource' ``` 4. In the configure file, define the `type` of `dataset` as `XXSource`. #### How to add data pre-processing? - To add pre-processing operation for a single image, refer to the classes in `transform/operators.py`, and implement the desired transformation with a new class. - To add pre-processing for a batch, one needs to modify the `build_post_map` function in `transform/post_map.py`.