python_data_feeding.md 6.5 KB
Newer Older
S
sneaxiy 已提交
1 2 3 4
# Python Data Feeding

In the former implementation of Paddle Fluid, there are two ways to feed data:

S
sneaxiy 已提交
5
- Use `reader_op` in backend C++ side. This method only supports data feeding from recordio files and random data generators, but supports many kinds of `decorated_readers`. For examples, `double_buffer_reader` uses two threads to achieve better performance: one for time-consuming I/O operations, and the other for `Executor::Run()`. See [C++ Data Feeding](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/fluid/design/concepts/cpp_data_feeding.md) for details.
S
sneaxiy 已提交
6

S
sneaxiy 已提交
7
- Feed data directly using `DataFeeder.feed()` in Python codes. It is more flexible than the first way. Many kinds of preprocessing steps can be performed before feeding using Python or any other languages, instead of adding many uncommon `operators` in C++ side. But this method is less efficient: the program cannot read the next mini-batch data before `Executor::Run()` ends. Moreover, `decorated_readers` such as `double_buffer_reader` cannot be used for better performance.
S
sneaxiy 已提交
8 9 10 11 12 13 14 15 16 17 18

In this document, we design a Python Data Feeding process combining the efficiency of the first way and the flexibility of the second way. A data queue `PyArrayFeedQueue` is designed to be shared by the Python and C++ side, while Python array is pushed into the queue and `reader_op` in C++ side reads out the data from the queue.

## Design of PyArrayFeedQueue
`PyArrayFeedQueue` is a blocking queue with a fixed `capacity` and accepts Python array with shapes indicated by `dims`.
```C++
class PyArrayFeedQueueHolder;

class PyArrayFeedQueue {
  friend class PyArrayFeedQueueHolder;
 private:
S
sneaxiy 已提交
19 20 21
  // PyArrayFeedQueue can only be constructed by PyArrayFeedQueueHolder
  PyArrayFeedQueue(size_t capacity, const std::vector<framework::DDim>& dims, const platform::Place& place);
 
S
sneaxiy 已提交
22
 public:
S
sneaxiy 已提交
23 24 25 26 27 28
  // Not copyable and not moveable
  PyArrayFeedQueue(const PyArrayFeedQueue&) = delete;
  PyArrayFeedQueue(PyArrayFeedQueue&&) = delete;
  PyArrayFeedQueue& operator = (const PyArrayFeedQueue&) = delete;
  PyArrayFeedQueue& operator = (PyArrayFeedQueue&&) = delete;

S
sneaxiy 已提交
29 30 31 32 33 34 35 36 37 38 39 40 41 42
  size_t size() const; // Get the current size of the queue
  size_t capacity() const; // Get the capacity of the queue
  bool is_full() const;
  bool is_empty() const;
  
  // Convert Python array tuple to std::vector<framework::LoDTensor> and store it.
  // Block if is_full() == true
  // Use pybind11::gil_scoped_release to release GIL of Python
  void push(const pybind11::tuple& array_tuple);
  
  // Block if is_empty() == true
  // Use pybind11::gil_scoped_release to release GIL of Python
  std::vector<framework::LoDTensor> pop();
 private:
S
sneaxiy 已提交
43 44 45 46 47 48
  // CircularQueue is a class like `boost::circular_buffer`
  framework::CircularQueue<std::vector<framework::LoDTensor>> queue_;
  std::vector<framework::DDim> dims_;
  platform::Place place_;
  mutable std::mutex mutex_;
  mutable std::condition_variable cv_;
S
sneaxiy 已提交
49 50 51 52 53 54 55
};

class PyArrayFeedQueueHolder {
 public:
  PyArrayFeedQueueHolder() {}
  
  // Calls the constructor of PyArrayFeedQueue to create feeder_
S
sneaxiy 已提交
56
  // `init_once` can only called once, otherwise an exception would raise
S
sneaxiy 已提交
57 58
  void init_once(size_t capacity, const std::vector<framework::DDim>& dims, const Place& place);
  
S
sneaxiy 已提交
59 60
  PyArrayFeedQueue* feeder(); // feeder_.get()
  const PyArrayFeederQueue* feeder() const; // feeder_.get()
S
sneaxiy 已提交
61
 private:
S
sneaxiy 已提交
62
  std::shared_ptr<PyArrayFeedQueue> feeder_;
S
sneaxiy 已提交
63 64 65 66 67
};
```

There are some major things that must be concerned:
- `PyArrayFeedQueueHolder` should be a `Variable` in global scope, so that `reader_op` can find it when reading data. Since `PyArrayFeedQueue` does not have a default constructor, it cannot be constructed by `Scope::Var()::GetMutable<T>()`. To solve this problem, `PyArrayFeedQueueHolder` is designed to defer construction of `PyArrayFeedQueue`.
S
sneaxiy 已提交
68
- A `Variable` of `PyArrayFeedQueueHolder` but not `VarDesc` must be created in Python code before `Executor::Run()` so that `Executor::Run()` can get the feeding data when it is called.
S
sneaxiy 已提交
69 70 71 72 73 74 75 76
- `Create_reader_op` should accept the name or address of `PyArrayFeedQueueHolder` as an input or attribute.


## Design of PyArrayReader
`PyArrayReader` is a reader which holds a `PyArrayFeedQueue` object. Notice that `ReInit()` function is not supported because the capacity of the `PyArrayFeedQueue` object is limited.
```C++
class PyArrayReader : public ReaderBase {
 public:
S
sneaxiy 已提交
77
  explicit PyArrayReader(const std::shared_ptr<PyArrayFeedQueue>& queue);
S
sneaxiy 已提交
78 79 80 81 82 83
  
  void ReadNext(std::vector<framework::LoDTensor>* out) override;
  
  void ReInit() override {
    PADDLE_THROW("PyArrayReader does not support ReInit()");
  }
S
sneaxiy 已提交
84 85 86

  PyArrayFeedQueue* feeder();
  const PyArrayFeederQueue* feeder() const;
S
sneaxiy 已提交
87
 private:
S
sneaxiy 已提交
88
  std::shared_ptr<PyArrayFeedQueue> queue_;
S
sneaxiy 已提交
89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126
};
```

## Design of CreatePyArrayReaderOp
`CreatePyArrayReaderOp` is used to create `PyArrayReader` object. It requires an attribute of `feeder_name` which indicates the name of the `PyArrayFeedQueueHolder` variable.
```C++
class CreatePyArrayReaderOp : public framework::OperatorBase {
 public:
  using framework::OperatorBase::OperatorBase;
 private:
  void RunImpl(const framework::Scope& scope,
               const platform::Place& dev_place) const override {
    const std::string& feeder_name = Attr<std::string>("feeder_name");
    auto* feeder_holder_var = scope.FindVar(feeder_name);
    PADDLE_ENFORCE(feed_holder_var != nullptr);
    auto* feeder_holder = feeder_holder_var
                    ->template GetMutable<framework::PyArrayFeedQueueHolder>();
    auto* out = scope.FindVar(Output("Out"))
                    ->template GetMutable<framework::ReaderHolder>();
    out->Reset(new PyArrayReader(feeder_holder->feeder());
  }
};
```

## Design of Python codes
The design of Python codes are as follows. First, we construct a variable of `PyArrayFeedQueueHolder` and init it with given parameters, returning the `PyArrayFeedQueue` object after initialization. After that, a layer of `CreatePyArrayReaderOp` is constructed and accepts the name of the `PyArrayFeedQueueHolder` variable. The `PyArrayFeedQueue` object and result of the layer are both returned.
```Python
def py_array_reader(capacity, shapes, place):
  feeder_name = unique_name.generate("py_array_feed_queue")
  var = global_scope().var(feeder_name) # create PyArrayFeedQueueHolder Variable
  feed_queue = core.init_py_array_feed_queue(var, capacity, shapes, place) # init PyArrayFeedQueue
  out = create_var()
  create_reader_op_with_feeder_name(
      type='create_py_array_reader',
      outputs={'Out':[out]},
      attrs = {'feeder_name': feeder_name})  
  return out, feed_queue
```