In training with Paddle V2 API, data feeding wholly dependents on Python code. To get rid of the Python environment and achieve the goal of "wrapping the whole training by a while loop op" in Paddle Fluid, a C++ data feeding mechanism is required.
In this document we show the fundamental design of C++ data feeding process, which includes the data reading, shuffling and batching.
## Reader
A new concept named 'Reader' is introduced. `Reader` is a series of inherited classes which can be hold by our `Variable` and they are used to read or process file data.
### `ReaderBase`
`ReaderBase` is the abstract base class of all readers. It defines the all readers' interfaces.
These two classes are derived from the `ReaderBase` and will further be derived by respective specific readers. That is to say, in our design, there are two kinds of readers: file readers and decorated readers. A file reader reads from a file of some specific format, and yield only one instance of data at a time. e.g. RecordIO reader, jpg reader, .... A decorated reader takes another reader(both file reader and decorated reader are OK) as its 'underlying reader'. It gets data from its underlying reader, does some process on them(shuffling, or batching), then yields processed data. The output data of a decorated reader can be a single instance or a batch. `ShuffleReader` and `BatchReader` are both decorated readers.
All the readers share exactly the same interfaces defined in `ReaderBase`. So they can be decorated for more than one time: We can **shuffle** a reader's outputs and then **batch** the shuffle outputs. The interface consistency also allows related ops use readers without knowing what they are exactly.
### `ReaderHolder`
Different readers belong to different class types. It leads to a problem: How can we drop them into `Variable`s and fetch them out by a unified method? For example, if a Variable holds a `BatchReader`, we can not get it by the following code:
```cpp
var->Get<ReaderBase>("batch_reader");
```
we have to write:
```cpp
var->Get<BatchReader>("batch_reader");
```
This requires each time getting a reader from a variable we must know the reader's type exactly. It is nearly impossible.
To solve this problem, we introduce `ReaderHolder` as a wrapper. It acts as an empty decorator of `ReaderBase`, which erases reader's type. With `ReaderHolder` we are able to fetch all types of readers by `var->Get<ReaderHolder>("...")` and regard the obtained object as a reader.
## Related Operators
To create and invoke readers, some now ops are introduced:
### `CreateReaderOp`
Each reader has its creating op. File readers' creating ops have no input and yield the created file reader as its output. Decorated readers' creating ops take the underlying readers as inputs and then yield new decorated readers.
### `ReadOp`
A reader is only a Variable. It cannot trigger the reading process by itself. So we add the `ReadOp` to execute it. A `ReadOp` takes a reader Variable as its input. Each time it runs, it invokes the reader‘s `ReadNext()` function and gets a new batch of data(or only one instance of data, if we use file reader directly). The output data of a reader are in the form of `std::vector<LoDTenosr>`, so the `ReadOp` also needs to split the vector and move LoDTensors to their respective output Variables.
<liclass="toctree-l3"><aclass="reference internal"href="../howto/cluster/multi_cluster/index_en.html">Use different clusters</a><ul>
<liclass="toctree-l4"><aclass="reference internal"href="../howto/cluster/multi_cluster/fabric_en.html">Cluster Training Using Fabric</a></li>
<liclass="toctree-l4"><aclass="reference internal"href="../howto/cluster/multi_cluster/openmpi_en.html">Cluster Training Using OpenMPI</a></li>
<liclass="toctree-l4"><aclass="reference internal"href="../howto/cluster/multi_cluster/k8s_en.html">PaddlePaddle On Kubernetes</a></li>
<liclass="toctree-l4"><aclass="reference internal"href="../howto/cluster/multi_cluster/k8s_aws_en.html">Distributed PaddlePaddle Training on AWS with Kubernetes</a></li>
<spanid="c-data-feeding"></span><h1>C++ Data Feeding<aclass="headerlink"href="#c-data-feeding"title="Permalink to this headline">¶</a></h1>
<p>In training with Paddle V2 API, data feeding wholly dependents on Python code. To get rid of the Python environment and achieve the goal of “wrapping the whole training by a while loop op” in Paddle Fluid, a C++ data feeding mechanism is required.</p>
<p>In this document we show the fundamental design of C++ data feeding process, which includes the data reading, shuffling and batching.</p>
<divclass="section"id="reader">
<spanid="reader"></span><h2>Reader<aclass="headerlink"href="#reader"title="Permalink to this headline">¶</a></h2>
<p>A new concept named ‘Reader’ is introduced. <codeclass="docutils literal"><spanclass="pre">Reader</span></code> is a series of inherited classes which can be hold by our <codeclass="docutils literal"><spanclass="pre">Variable</span></code> and they are used to read or process file data.</p>
<divclass="section"id="readerbase">
<spanid="readerbase"></span><h3><codeclass="docutils literal"><spanclass="pre">ReaderBase</span></code><aclass="headerlink"href="#readerbase"title="Permalink to this headline">¶</a></h3>
<p><codeclass="docutils literal"><spanclass="pre">ReaderBase</span></code> is the abstract base class of all readers. It defines the all readers’ interfaces.</p>
<spanid="filereader-and-decoratedreader"></span><h3><codeclass="docutils literal"><spanclass="pre">FileReader</span></code> and <codeclass="docutils literal"><spanclass="pre">DecoratedReader</span></code><aclass="headerlink"href="#filereader-and-decoratedreader"title="Permalink to this headline">¶</a></h3>
<p>These two classes are derived from the <codeclass="docutils literal"><spanclass="pre">ReaderBase</span></code> and will further be derived by respective specific readers. That is to say, in our design, there are two kinds of readers: file readers and decorated readers. A file reader reads from a file of some specific format, and yield only one instance of data at a time. e.g. RecordIO reader, jpg reader, .... A decorated reader takes another reader(both file reader and decorated reader are OK) as its ‘underlying reader’. It gets data from its underlying reader, does some process on them(shuffling, or batching), then yields processed data. The output data of a decorated reader can be a single instance or a batch. <codeclass="docutils literal"><spanclass="pre">ShuffleReader</span></code> and <codeclass="docutils literal"><spanclass="pre">BatchReader</span></code> are both decorated readers.</p>
<p>All the readers share exactly the same interfaces defined in <codeclass="docutils literal"><spanclass="pre">ReaderBase</span></code>. So they can be decorated for more than one time: We can <strong>shuffle</strong> a reader’s outputs and then <strong>batch</strong> the shuffle outputs. The interface consistency also allows related ops use readers without knowing what they are exactly.</p>
</div>
<divclass="section"id="readerholder">
<spanid="readerholder"></span><h3><codeclass="docutils literal"><spanclass="pre">ReaderHolder</span></code><aclass="headerlink"href="#readerholder"title="Permalink to this headline">¶</a></h3>
<p>Different readers belong to different class types. It leads to a problem: How can we drop them into <codeclass="docutils literal"><spanclass="pre">Variable</span></code>s and fetch them out by a unified method? For example, if a Variable holds a <codeclass="docutils literal"><spanclass="pre">BatchReader</span></code>, we can not get it by the following code:</p>
<p>This requires each time getting a reader from a variable we must know the reader’s type exactly. It is nearly impossible.</p>
<p>To solve this problem, we introduce <codeclass="docutils literal"><spanclass="pre">ReaderHolder</span></code> as a wrapper. It acts as an empty decorator of <codeclass="docutils literal"><spanclass="pre">ReaderBase</span></code>, which erases reader’s type. With <codeclass="docutils literal"><spanclass="pre">ReaderHolder</span></code> we are able to fetch all types of readers by <codeclass="docutils literal"><spanclass="pre">var->Get<ReaderHolder>("...")</span></code> and regard the obtained object as a reader.</p>
</div>
</div>
<divclass="section"id="related-operators">
<spanid="related-operators"></span><h2>Related Operators<aclass="headerlink"href="#related-operators"title="Permalink to this headline">¶</a></h2>
<p>To create and invoke readers, some now ops are introduced:</p>
<divclass="section"id="createreaderop">
<spanid="createreaderop"></span><h3><codeclass="docutils literal"><spanclass="pre">CreateReaderOp</span></code><aclass="headerlink"href="#createreaderop"title="Permalink to this headline">¶</a></h3>
<p>Each reader has its creating op. File readers’ creating ops have no input and yield the created file reader as its output. Decorated readers’ creating ops take the underlying readers as inputs and then yield new decorated readers.</p>
</div>
<divclass="section"id="readop">
<spanid="readop"></span><h3><codeclass="docutils literal"><spanclass="pre">ReadOp</span></code><aclass="headerlink"href="#readop"title="Permalink to this headline">¶</a></h3>
<p>A reader is only a Variable. It cannot trigger the reading process by itself. So we add the <codeclass="docutils literal"><spanclass="pre">ReadOp</span></code> to execute it. A <codeclass="docutils literal"><spanclass="pre">ReadOp</span></code> takes a reader Variable as its input. Each time it runs, it invokes the reader‘s <codeclass="docutils literal"><spanclass="pre">ReadNext()</span></code> function and gets a new batch of data(or only one instance of data, if we use file reader directly). The output data of a reader are in the form of <codeclass="docutils literal"><spanclass="pre">std::vector<LoDTenosr></span></code>, so the <codeclass="docutils literal"><spanclass="pre">ReadOp</span></code> also needs to split the vector and move LoDTensors to their respective output Variables.</p>
Built with <ahref="http://sphinx-doc.org/">Sphinx</a> using a <ahref="https://github.com/snide/sphinx_rtd_theme">theme</a> provided by <ahref="https://readthedocs.org">Read the Docs</a>.
In training with Paddle V2 API, data feeding wholly dependents on Python code. To get rid of the Python environment and achieve the goal of "wrapping the whole training by a while loop op" in Paddle Fluid, a C++ data feeding mechanism is required.
In this document we show the fundamental design of C++ data feeding process, which includes the data reading, shuffling and batching.
## Reader
A new concept named 'Reader' is introduced. `Reader` is a series of inherited classes which can be hold by our `Variable` and they are used to read or process file data.
### `ReaderBase`
`ReaderBase` is the abstract base class of all readers. It defines the all readers' interfaces.
These two classes are derived from the `ReaderBase` and will further be derived by respective specific readers. That is to say, in our design, there are two kinds of readers: file readers and decorated readers. A file reader reads from a file of some specific format, and yield only one instance of data at a time. e.g. RecordIO reader, jpg reader, .... A decorated reader takes another reader(both file reader and decorated reader are OK) as its 'underlying reader'. It gets data from its underlying reader, does some process on them(shuffling, or batching), then yields processed data. The output data of a decorated reader can be a single instance or a batch. `ShuffleReader` and `BatchReader` are both decorated readers.
All the readers share exactly the same interfaces defined in `ReaderBase`. So they can be decorated for more than one time: We can **shuffle** a reader's outputs and then **batch** the shuffle outputs. The interface consistency also allows related ops use readers without knowing what they are exactly.
### `ReaderHolder`
Different readers belong to different class types. It leads to a problem: How can we drop them into `Variable`s and fetch them out by a unified method? For example, if a Variable holds a `BatchReader`, we can not get it by the following code:
```cpp
var->Get<ReaderBase>("batch_reader");
```
we have to write:
```cpp
var->Get<BatchReader>("batch_reader");
```
This requires each time getting a reader from a variable we must know the reader's type exactly. It is nearly impossible.
To solve this problem, we introduce `ReaderHolder` as a wrapper. It acts as an empty decorator of `ReaderBase`, which erases reader's type. With `ReaderHolder` we are able to fetch all types of readers by `var->Get<ReaderHolder>("...")` and regard the obtained object as a reader.
## Related Operators
To create and invoke readers, some now ops are introduced:
### `CreateReaderOp`
Each reader has its creating op. File readers' creating ops have no input and yield the created file reader as its output. Decorated readers' creating ops take the underlying readers as inputs and then yield new decorated readers.
### `ReadOp`
A reader is only a Variable. It cannot trigger the reading process by itself. So we add the `ReadOp` to execute it. A `ReadOp` takes a reader Variable as its input. Each time it runs, it invokes the reader‘s `ReadNext()` function and gets a new batch of data(or only one instance of data, if we use file reader directly). The output data of a reader are in the form of `std::vector<LoDTenosr>`, so the `ReadOp` also needs to split the vector and move LoDTensors to their respective output Variables.
<liclass="toctree-l4"><aclass="reference internal"href="../howto/cluster/multi_cluster/k8s_aws_cn.html">Distributed PaddlePaddle Training on AWS with Kubernetes</a></li>
<spanid="c-data-feeding"></span><h1>C++ Data Feeding<aclass="headerlink"href="#c-data-feeding"title="永久链接至标题">¶</a></h1>
<p>In training with Paddle V2 API, data feeding wholly dependents on Python code. To get rid of the Python environment and achieve the goal of “wrapping the whole training by a while loop op” in Paddle Fluid, a C++ data feeding mechanism is required.</p>
<p>In this document we show the fundamental design of C++ data feeding process, which includes the data reading, shuffling and batching.</p>
<p>A new concept named ‘Reader’ is introduced. <codeclass="docutils literal"><spanclass="pre">Reader</span></code> is a series of inherited classes which can be hold by our <codeclass="docutils literal"><spanclass="pre">Variable</span></code> and they are used to read or process file data.</p>
<p><codeclass="docutils literal"><spanclass="pre">ReaderBase</span></code> is the abstract base class of all readers. It defines the all readers’ interfaces.</p>
<spanid="filereader-and-decoratedreader"></span><h3><codeclass="docutils literal"><spanclass="pre">FileReader</span></code> and <codeclass="docutils literal"><spanclass="pre">DecoratedReader</span></code><aclass="headerlink"href="#filereader-and-decoratedreader"title="永久链接至标题">¶</a></h3>
<p>These two classes are derived from the <codeclass="docutils literal"><spanclass="pre">ReaderBase</span></code> and will further be derived by respective specific readers. That is to say, in our design, there are two kinds of readers: file readers and decorated readers. A file reader reads from a file of some specific format, and yield only one instance of data at a time. e.g. RecordIO reader, jpg reader, .... A decorated reader takes another reader(both file reader and decorated reader are OK) as its ‘underlying reader’. It gets data from its underlying reader, does some process on them(shuffling, or batching), then yields processed data. The output data of a decorated reader can be a single instance or a batch. <codeclass="docutils literal"><spanclass="pre">ShuffleReader</span></code> and <codeclass="docutils literal"><spanclass="pre">BatchReader</span></code> are both decorated readers.</p>
<p>All the readers share exactly the same interfaces defined in <codeclass="docutils literal"><spanclass="pre">ReaderBase</span></code>. So they can be decorated for more than one time: We can <strong>shuffle</strong> a reader’s outputs and then <strong>batch</strong> the shuffle outputs. The interface consistency also allows related ops use readers without knowing what they are exactly.</p>
<p>Different readers belong to different class types. It leads to a problem: How can we drop them into <codeclass="docutils literal"><spanclass="pre">Variable</span></code>s and fetch them out by a unified method? For example, if a Variable holds a <codeclass="docutils literal"><spanclass="pre">BatchReader</span></code>, we can not get it by the following code:</p>
<p>This requires each time getting a reader from a variable we must know the reader’s type exactly. It is nearly impossible.</p>
<p>To solve this problem, we introduce <codeclass="docutils literal"><spanclass="pre">ReaderHolder</span></code> as a wrapper. It acts as an empty decorator of <codeclass="docutils literal"><spanclass="pre">ReaderBase</span></code>, which erases reader’s type. With <codeclass="docutils literal"><spanclass="pre">ReaderHolder</span></code> we are able to fetch all types of readers by <codeclass="docutils literal"><spanclass="pre">var->Get<ReaderHolder>("...")</span></code> and regard the obtained object as a reader.</p>
<p>Each reader has its creating op. File readers’ creating ops have no input and yield the created file reader as its output. Decorated readers’ creating ops take the underlying readers as inputs and then yield new decorated readers.</p>
<p>A reader is only a Variable. It cannot trigger the reading process by itself. So we add the <codeclass="docutils literal"><spanclass="pre">ReadOp</span></code> to execute it. A <codeclass="docutils literal"><spanclass="pre">ReadOp</span></code> takes a reader Variable as its input. Each time it runs, it invokes the reader‘s <codeclass="docutils literal"><spanclass="pre">ReadNext()</span></code> function and gets a new batch of data(or only one instance of data, if we use file reader directly). The output data of a reader are in the form of <codeclass="docutils literal"><spanclass="pre">std::vector<LoDTenosr></span></code>, so the <codeclass="docutils literal"><spanclass="pre">ReadOp</span></code> also needs to split the vector and move LoDTensors to their respective output Variables.</p>
Built with <ahref="http://sphinx-doc.org/">Sphinx</a> using a <ahref="https://github.com/snide/sphinx_rtd_theme">theme</a> provided by <ahref="https://readthedocs.org">Read the Docs</a>.