cpp_data_feeding.md 8.2 KB
Newer Older
F
fengjiayi 已提交
1 2
# C++ Data Feeding

F
merge  
fengjiayi 已提交
3
While using Paddle V2 API for training, data feeding completely depends on the Python code. To get rid of the Python environment and achieve the goal of "wrapping the whole training by a while loop op" in Paddle Fluid, a C++ data feeding mechanism is required.
F
fengjiayi 已提交
4

F
fengjiayi 已提交
5 6 7 8
In this document, we show the fundamental design of a C++ data feeding process, which includes data reading, shuffling and batching.

## Overview

F
fengjiayi 已提交
9
![](images/readers.png)
F
fengjiayi 已提交
10 11 12

## Reader

F
fengjiayi 已提交
13
In order to handle the above-mentioned problem, a new concept called 'Reader' is introduced. `Reader` is a series of inherited classes which can be held by our `Variable` and they are used to read or process file data.
F
fengjiayi 已提交
14 15


F
fengjiayi 已提交
16
### ReaderBase
F
fengjiayi 已提交
17

18
`ReaderBase` is the abstract base class for all readers. It defines the interface for all readers.
F
fengjiayi 已提交
19 20 21 22

```cpp
class ReaderBase {
 public:
F
merge  
fengjiayi 已提交
23 24
  // Reads the next batch of data. (A 'batch' can be only one instance)
  // If the next batch doesn't exist, it throws an exception
F
fengjiayi 已提交
25 26
  virtual void ReadNext(std::vector<LoDTensor>* out) = 0;
  
F
merge  
fengjiayi 已提交
27 28
  // Checks whether the next instance exists.
  virtual bool HasNext() = 0;
F
fengjiayi 已提交
29
  
F
merge  
fengjiayi 已提交
30 31
  // Reinitializes the reader and read the file from the beginning.
  virtual void ReInit() = 0;
F
fengjiayi 已提交
32

F
fengjiayi 已提交
33
  virtual ~ReaderBase();
F
merge  
fengjiayi 已提交
34 35 36 37 38 39 40 41 42 43
};
```

### FileReader

`FileReader` is derived from the `ReaderBase`. It is still an abstract class and will further be derived by Readers of respective specific format.

```cpp
class FileReader : public ReaderBase {
 public:
F
fengjiayi 已提交
44
  explicit FileReader(const std::vector<DDim>& dims);
F
merge  
fengjiayi 已提交
45

F
fengjiayi 已提交
46
  void ReadNext(std::vector<LoDTensor>* out) override;
F
fengjiayi 已提交
47 48

 protected:
F
fengjiayi 已提交
49
  virtual void ReadNextImpl(std::vector<LoDTensor>* out) = 0;
F
merge  
fengjiayi 已提交
50

F
fengjiayi 已提交
51 52
 private:
  std::vector<DDim> dims_;
F
fengjiayi 已提交
53 54 55
};
```

56
A file reader binds with a single file and reads one data instance at a time. Each type of file reader shall implement its own `ReadNextImpl()`, `HasNext()` and `ReInit()`.
F
fengjiayi 已提交
57

58
The `ReadNextImpl()` is invoked by `ReadNext()`. Besides invoking `ReadNextImpl()`, `ReadNext()` is also responsible for checking the output, making sure that each shape of `LoDTensor` in `*out` is consistent with the one in `dims_`.  
F
merge  
fengjiayi 已提交
59 60 61

### DecoratedReader

62
A decorated reader takes another reader(both file reader and decorated reader are OK) as its 'underlying reader'. It gets data from its underlying reader, does some processing on them(shuffling,  batching or something else), then yields processed data. The output data of a decorated reader can be a single instance or a batch. `ShuffleReader` and `BatchReader` are both decorated readers.
F
merge  
fengjiayi 已提交
63 64 65 66

```cpp
class DecoratedReader : public ReaderBase {
 public:
F
fengjiayi 已提交
67
  explicit DecoratedReader(ReaderBase* reader) : ReaderBase(), reader_(reader) {
F
merge  
fengjiayi 已提交
68 69 70 71 72
    PADDLE_ENFORCE_NOT_NULL(reader_);
  }

  void ReInit() override { reader_->ReInit(); }

F
fengjiayi 已提交
73 74
  bool HasNext() const override { return reader_->HasNext(); }

F
merge  
fengjiayi 已提交
75 76 77 78
 protected:
  ReaderBase* reader_;
};
```
F
fengjiayi 已提交
79

80
Both the `FileReader` and `DecoratedReader` share exactly the same interface as defined in `ReaderBase`. So they can be decorated for multiple times: We can **shuffle** a reader's outputs and then **batch** the shuffled outputs. The interface consistency also allows related ops use readers without knowing their underlying type.
F
fengjiayi 已提交
81

F
fengjiayi 已提交
82
### MultipleReader
F
fengjiayi 已提交
83

F
fengjiayi 已提交
84
All `FileReader` binds with a single file and are single-threaded. However, sometimes we need to read data from more than one file. In this case, it's not enough to only have `FileReader` and `DecoratedReader`.
F
fengjiayi 已提交
85

F
fengjiayi 已提交
86 87
So `MultipleReader` is introduced. It is also derived from `ReaderBase`. A `MultipleReader` holds several prefetching `FileReaders` and these readers run concurrently. Another pivotal part of a `MultipleReader` is a buffer channel. The channel collects data yield by all prefetching readers and makes subsequent OPs or decorated readers be able to fetch data without concerning about multiple readers scheduling.

F
fengjiayi 已提交
88
![](images/multiple_reader.png)
F
fengjiayi 已提交
89 90 91 92 93 94

This graph shows how a `MultipleReader` works with three prefetching file readers and two GPUs. There is a queue of files which are going to be read. Each time when a prefetching file reader is free(complete reading from one file), it fetches a new file from the queue. Each prefetching file reader runs in a separated prefetch thread and dumps their outputs to the same channel.

To the subsequent two decorated readers, the `MultipleReader` is **a single reader**. They don't need to concern about how prefetch readers are scheduled. They only need to invoke `MultipleReader::ReadNext()` to get the next data from the buffer channel. 

### ReaderHolder
F
fengjiayi 已提交
95

96
Different readers belong to different class types. This leads to a problem: How can we drop them into `Variable`s and fetch them out by a unified method? For example, if a Variable holds a `BatchReader`, we can not get it by the following code:
F
fengjiayi 已提交
97 98 99 100 101

```cpp
var->Get<ReaderBase>("batch_reader");
```

102
We would have to write:
F
fengjiayi 已提交
103 104 105 106 107

```cpp
var->Get<BatchReader>("batch_reader");
```

108
This requires that in order to get a reader from a variable, every time, we must know the reader's type exactly. This is nearly impossible.
F
fengjiayi 已提交
109

110
To solve this problem, we introduce `ReaderHolder` as a wrapper. It acts as an empty decorator of `ReaderBase`, which hides reader's type. With `ReaderHolder` we are able to fetch all types of readers by `var->Get<ReaderHolder>("...")` and regard the obtained object as a reader.
F
fengjiayi 已提交
111 112 113

## Related Operators

114
To create and invoke readers, some new ops are introduced:
F
fengjiayi 已提交
115

F
fengjiayi 已提交
116
### CreateReaderOp
F
fengjiayi 已提交
117

118
Each reader has its creation op. File readers' creation ops have no input and yield the created file reader as its output. Decorated readers' creation ops take the underlying readers as inputs and then yield new decorated readers.
F
fengjiayi 已提交
119

F
fengjiayi 已提交
120 121 122 123
However, direct usage of file readers' creation ops is not recommended because a file reader can only read one file via a single thread. Using `OpenFilesOp` is a better choice.

### OpenFilesOp

F
fengjiayi 已提交
124 125 126
The `OpenFilesOp` is the creation op of `MultipleReader`. It takes no input but requires a list of file names as one of its attributes. The newly created `MultipleReader` then creates its own prefetching readers according to given file names.

To make sure that created prefetching readers match file formats, we need a name prefix rule to append file format tags to file names, as well as a file reader registry mechanism to map file format tags to their corresponding file readers' constructors.
F
fengjiayi 已提交
127 128 129 130 131 132 133 134 135 136

### HasNextOp

`HasNextOp` is used to check whether the next data batch exists via the reader's `HasNext()` interface.

### ResetOp

`ResetOp` is used to reset a reader via its `ReInit()` interface.

### ReadOp
F
fengjiayi 已提交
137 138

A reader is only a Variable. It cannot trigger the reading process by itself. So we add the `ReadOp` to execute it. A `ReadOp` takes a reader Variable as its input. Each time it runs, it invokes the reader‘s `ReadNext()` function and gets a new batch of data(or only one instance of data, if we use file reader directly). The output data of a reader are in the form of `std::vector<LoDTenosr>`, so the `ReadOp` also needs to split the vector and move LoDTensors to their respective output Variables.
F
fengjiayi 已提交
139 140 141

## Program with Readers

142
A `Program` holds readers as its persistable variables. These variables are created by `CreateReaderOp` or `OpenFilesOp`. These ops shall run only once. So they shall be settled in the `startup_program`. `HasNextOp`, `ResetOp` and `ReadOp` are required by training loop, so they shall be in the `main_program`.
F
fengjiayi 已提交
143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166

The ops of a `startup_program` with readers would be like this:

```
multiple_reader = open_files_op(...)
batch_reader = create_batch_reader_op(multiple_reader)
double_buffer_reader = create_double_buffer_op(batch_reader)
... (other initializers)
```

The forwarding ops of the corresponding `main_program` would be like this:

```
while_op {
    has_next = has_next_op(double_buffer_reader)
    if_else_op(has_next) {
        batch_data = read_op(double_buffer_reader)
        ... (subsequent training ops)
    } else {
        reset_op(double_buffer_reader)
    }
}
```

167
Two important considerations for these programs are as follows:
F
fengjiayi 已提交
168 169 170 171

1. The multiple\_reader is the batch\_reader's underlying reader, and the batch\_reader is the double\_buffer\_reader's underlying reader. `read_op`, `has_next_op` and other reader related ops will only invoke the top-most reader. In this case, it's the double\_buffer\_reader.

2. All readers exist in both `startup_program` and `main_program`. And they are persistable.