In the former implementation of Paddle Fluid, there are two ways to feed data:
- Use `reader_op` in backend C++ side. This method only supports data feeding from recordio files and random data generators, but supports many kinds of `decorated_readers`. For examples, `double_buffer_reader` uses two threads to achieve better performance: one for time-consuming I/O operations, and the other for `Executor::Run()`. See [C++ Data Feeding](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/fluid/design/concepts/cpp_data_feeding.md) for details.
- Feed data directly using `DataFeeder.feed()` in Python codes. It is more flexible than the first way. Many kinds of preprocessing steps can be performed before feeding using Python or any other languages, instead of adding many uncommon `operators` in C++ side. But this method is less efficient: the program cannot read the next mini-batch data before `Executor::Run()` ends. Moreover, `decorated_readers` such as `double_buffer_reader` cannot be used for better performance.
In this document, we design a Python Data Feeding process combining the efficiency of the first way and the flexibility of the second way. A data queue `PyArrayFeedQueue` is designed to be shared by the Python and C++ side, while Python array is pushed into the queue and `reader_op` in C++ side reads out the data from the queue.
## Design of PyArrayFeedQueue
`PyArrayFeedQueue` is a blocking queue with a fixed `capacity` and accepts Python array with shapes indicated by `dims`.
```C++
class PyArrayFeedQueueHolder;
class PyArrayFeedQueue {
friend class PyArrayFeedQueueHolder;
private:
// PyArrayFeedQueue can only be constructed by PyArrayFeedQueueHolder
There are some major things that must be concerned:
-`PyArrayFeedQueueHolder` should be a `Variable` in global scope, so that `reader_op` can find it when reading data. Since `PyArrayFeedQueue` does not have a default constructor, it cannot be constructed by `Scope::Var()::GetMutable<T>()`. To solve this problem, `PyArrayFeedQueueHolder` is designed to defer construction of `PyArrayFeedQueue`.
- A `Variable` of `PyArrayFeedQueueHolder` but not `VarDesc` must be created in Python code before `Executor::Run()` so that `Executor::Run()` can get the feeding data when it is called.
-`Create_reader_op` should accept the name or address of `PyArrayFeedQueueHolder` as an input or attribute.
## Design of PyArrayReader
`PyArrayReader` is a reader which holds a `PyArrayFeedQueue` object. Notice that `ReInit()` function is not supported because the capacity of the `PyArrayFeedQueue` object is limited.
PADDLE_THROW("PyArrayReader does not support ReInit()");
}
private:
std::shared_ptr<PyArrayFeedQueue> queue_;
};
```
## Design of CreatePyArrayReaderOp
`CreatePyArrayReaderOp` is used to create `PyArrayReader` object. It requires an attribute of `feeder_name` which indicates the name of the `PyArrayFeedQueueHolder` variable.
```C++
class CreatePyArrayReaderOp : public framework::OperatorBase {
The design of Python codes are as follows. First, we construct a variable of `PyArrayFeedQueueHolder` and init it with given parameters, returning the `PyArrayFeedQueue` object after initialization. After that, a layer of `CreatePyArrayReaderOp` is constructed and accepts the name of the `PyArrayFeedQueueHolder` variable. The `PyArrayFeedQueue` object and result of the layer are both returned.