README.md 7.6 KB
Newer Older
1
# Python Data Reader Design Doc
2

H
Helin Wang 已提交
3 4 5 6
At training and testing time, PaddlePaddle programs need to read data. To ease the users' work to write data reading code, we define that

- A *reader* is a function that reads data (from file, network, random number generator, etc) and yields data items.
- A *reader creator* is a function that returns a reader function.
7 8
- A *reader decorator* is a function, which accepts one or more readers, and returns a reader.
- A *batch reader* is a function that reads data (from *reader*, file, network, random number generator, etc) and yields a batch of data items.
H
Helin Wang 已提交
9

10
and provide function which converts reader to batch reader, frequently used reader creators and reader decorators.
11

12
## Data Reader Interface
13

H
Helin Wang 已提交
14
Indeed, *data reader* doesn't have to be a function that reads and yields data items. It can be any function with no parameter that creates a iterable (anything can be used in `for x in iterable`):
15 16

```
17
iterable = data_reader()
18 19
```

H
Helin Wang 已提交
20
Element produced from the iterable should be a **single** entry of data, **not** a mini batch. That entry of data could be a single item, or a tuple of items. Item should be of [supported type](http://www.paddlepaddle.org/doc/ui/data_provider/pydataprovider2.html?highlight=dense_vector#input-types) (e.g., numpy 1d array of float32, int, list of int)
21

H
Helin Wang 已提交
22
An example implementation for single item data reader creator:
23

H
Helin Wang 已提交
24
```python
H
Helin Wang 已提交
25
def reader_creator_random_image(width, height):
Y
Yu Yang 已提交
26 27 28 29
    def reader():
        while True:
            yield numpy.random.uniform(-1, 1, size=width*height)
    return reader
H
Helin Wang 已提交
30 31
```

H
Helin Wang 已提交
32
An example implementation for multiple item data reader creator:
H
Helin Wang 已提交
33
```python
Y
Yu Yang 已提交
34 35 36 37 38
def reader_creator_random_image_and_label(width, height, label):
    def reader():
        while True:
            yield numpy.random.uniform(-1, 1, size=width*height), label
    return reader
39
```
H
Helin Wang 已提交
40

41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76
## Batch Reader Interface

*batch reader* can be any function with no parameter that creates a iterable (anything can be used in `for x in iterable`). The output of the iterable should be a batch (list) of data items. Each item inside the list must be a tuple.

Here are valid outputs:
```python
# a mini batch of three data items. Each data item consist three columns of data, each of which is 1.
[(1, 1, 1),
(2, 2, 2),
(3, 3, 3)]

# a mini batch of three data items, each data item is a list (single column).
[([1,1,1],),
([2,2,2],),
([3,3,3],),
```

Please note that each item inside the list must be a tuple, below is an invalid output:
```python
 # wrong, [1,1,1] needs to be inside a tuple: ([1,1,1],).
 # Otherwise it's ambiguous whether [1,1,1] means a single column of data [1, 1, 1],
 # or three column of datas, each of which is 1.
[[1,1,1],
[2,2,2],
[3,3,3]]
```

It's easy to convert from reader to batch reader:
```python
mnist_train = paddle.dataset.mnist.train()
mnist_train_batch_reader = paddle.batch(mnist_train, 128)
```

Also easy to create custom batch reader:
```python
def custom_batch_reader():
Y
Yu Yang 已提交
77 78 79 80 81
    while True:
        batch = []
        for i in xrange(128):
            batch.append((numpy.random.uniform(-1, 1, 28*28),)) # note that it's a tuple being appended.
        yield batch
82 83 84 85

mnist_random_image_batch_reader = custom_batch_reader
```

86 87
## Usage

88
batch reader, mapping from item(s) read to data layer, batch size and number of total pass will be passed into `paddle.train`:
89 90 91 92 93 94 95

```python
# two data layer is created:
image_layer = paddle.layer.data("image", ...)
label_layer = paddle.layer.data("label", ...)

# ...
96 97
batch_reader = paddle.batch(paddle.dataset.mnist.train(), 128)
paddle.train(batch_reader, {"image":0, "label":1}, 128, 10, ...)
98 99
```

100
## Data Reader Decorator
H
Helin Wang 已提交
101

102
*Data reader decorator* takes a single or multiple data reader, returns a new data reader. It is similar to a [python decorator](https://wiki.python.org/moin/PythonDecorators), but it does not use `@` syntax.
H
Helin Wang 已提交
103

104
Since we have a strict interface for data readers (no parameter, return a single data item). Data reader can be used flexiable via data reader decorators. Following are a few examples:
H
Helin Wang 已提交
105 106 107 108 109 110 111 112

### Prefetch Data

Since reading data may take time and training can not proceed without data. It is generally a good idea to prefetch data.

Use `paddle.reader.buffered` to prefetch data:

```python
113
buffered_reader = paddle.reader.buffered(paddle.dataset.mnist.train(), 100)
H
Helin Wang 已提交
114 115
```

116
`buffered_reader` will try to buffer (prefetch) `100` data entries.
H
Helin Wang 已提交
117

118
### Compose Multiple Data Readers
H
Helin Wang 已提交
119

H
Helin Wang 已提交
120
For example, we want to use a source of real images (reusing mnist dataset), and a source of random images as input for [Generative Adversarial Networks](https://arxiv.org/abs/1406.2661).
H
Helin Wang 已提交
121 122 123 124

We can do:

```python
H
Helin Wang 已提交
125
def reader_creator_random_image(width, height):
Y
Yu Yang 已提交
126 127 128 129
    def reader():
        while True:
            yield numpy.random.uniform(-1, 1, size=width*height)
    return reader
H
Helin Wang 已提交
130

H
Helin Wang 已提交
131
def reader_creator_bool(t):
Y
Yu Yang 已提交
132 133 134 135
    def reader:
        while True:
            yield t
    return reader
H
Helin Wang 已提交
136

H
Helin Wang 已提交
137 138
true_reader = reader_creator_bool(True)
false_reader = reader_creator_bool(False)
H
Helin Wang 已提交
139

140 141
reader = paddle.reader.compose(paddle.dataset.mnist.train(), data_reader_creator_random_image(20, 20), true_reader, false_reader)
# Skipped 1 because paddle.dataset.mnist.train() produces two items per data entry.
142
# And we don't care second item at this time.
143
paddle.train(paddle.batch(reader, 128), {"true_image":0, "fake_image": 2, "true_label": 3, "false_label": 4}, ...)
H
Helin Wang 已提交
144 145 146 147
```

### Shuffle

148
Given shuffle buffer size `n`, `paddle.reader.shuffle` will return a data reader that buffers `n` data entries and shuffle them before a data entry is read.
H
Helin Wang 已提交
149 150 151

Example:
```python
152
reader = paddle.reader.shuffle(paddle.dataset.mnist.train(), 512)
153 154 155 156
```

## Q & A

157 158 159 160 161
### Why reader return only a single entry, but not a mini batch?

Always returning a single entry make reusing existing data readers much easier (e.g., if existing reader return not a single entry but 3 entries, training code will be more complex because it need to handle cases like batch size 2).

We provide function `paddle.batch` to turn (single entry) reader into batch reader.
162

163
### Why do we need batch reader, isn't train take reader and batch_size as arguments sufficient?
164

165
In most of the case, train taking reader and batch_size as arguments would be sufficent. However sometimes user want to customize order of data entries inside a mini batch. Or even change batch size dynamically.
166

H
Helin Wang 已提交
167 168 169 170
### Why use a dictionary but not a list to provide mapping?

We decided to use dictionary (`{"image":0, "label":1}`) instead of list (`["image", "label"]`) is because that user can easily resue item (e.g., using `{"image_a":0, "image_b":0, "label":1}`) or skip item (e.g., using `{"image_a":0, "label":2}`).

H
Helin Wang 已提交
171
### How to create custom data reader creator
172 173

```python
174
def image_reader_creator(image_path, label_path, n):
Y
Yu Yang 已提交
175 176 177 178 179 180 181 182 183 184 185 186
    def reader():
        f = open(image_path)
        l = open(label_path)
        images = numpy.fromfile(
            f, 'ubyte', count=n * 28 * 28).reshape((n, 28 * 28)).astype('float32')
        images = images / 255.0 * 2.0 - 1.0
        labels = numpy.fromfile(l, 'ubyte', count=n).astype("int")
        for i in xrange(n):
            yield images[i, :], labels[i] # a single entry of data is created each time
        f.close()
        l.close()
    return reader
187

188 189
# images_reader_creator creates a reader
reader = image_reader_creator("/path/to/image_file", "/path/to/label_file", 1024)
190
paddle.train(paddle.batch(reader, 128), {"image":0, "label":1}, ...)
191 192 193 194 195 196 197
```

### How is `paddle.train` implemented

An example implementation of paddle.train could be:

```python
198
def train(batch_reader, mapping, batch_size, total_pass):
Y
Yu Yang 已提交
199 200 201
    for pass_idx in range(total_pass):
        for mini_batch in batch_reader(): # this loop will never end in online learning.
            do_forward_backward(mini_batch, mapping)
202
```