prepare_steps_en.rst

..  _user_guide_prepare_steps_en:

#############
Prepare Steps
#############

Data preparation in PaddlePaddle Fluid can be separated into 3 steps.

Step 1: Define a reader to generate training/testing data
##########################################################

The generated data type can be Numpy Array or LoDTensor. According to the different data formats returned by the reader, it can be divided into Batch Reader and Sample Reader.

The batch reader yields a mini-batch data for each, while the sample reader yields a sample data for each.

If your reader yields a sample data, we provide a data augmentation and batching tool for you: :code:`Python Reader` .

Step 2: Define data layer variables in network
###############################################

Users should use :code:`fluid.data` to define data layer variables. Name, dtype and shape are required when defining. For example,

.. code-block:: python

    import paddle.fluid as fluid

    image = fluid.data(name='image', dtype='float32', shape=[None, 28, 28])
    label = fluid.data(name='label', dtype='int64', shape=[None, 1])

None means that the dimension is uncertain. In this example, None means the batch size.

Step 3: Send the data to network for training/testing
######################################################

PaddlePaddle Fluid provides 2 methods for sending data to the network: Asynchronous DataLoader API, and Synchronous Feed Method.

- Asynchronous DataLoader API

User should use :code:`fluid.io.DataLoader` to define a DataLoader object and use its setter method to set the data source.
When using DataLoader API, the process of data sending works asynchronously with network training/testing.
It is an efficient way for sending data and recommended to use.

- Synchronous Feed Method

User should create the feeding data beforehand and use :code:`executor.run(feed=...)` to send the data to :code:`fluid.Executor` or :code:`fluid.ParallelExecutor` .
Data preparation and network training/testing work synchronously, which is less efficient.

Comparison of these 2 methods are as follows:

==========================  =================================   ======================================
Comparison item                 Synchronous Feed Method              Asynchronous DataLoader API
==========================  =================================   ======================================
API                           :code:`executor.run(feed=...)`          :code:`fluid.io.DataLoader`
Data type                       Numpy Array or LoDTensor                Numpy Array or LoDTensor
Data augmentation            use Python for data augmentation       use Python for data augmentation
Speed                                     slow                                    rapid
Recommended applications            model debugging                        industrial training
==========================  =================================   ======================================

Choose different usages for different data formats
###################################################

According to the different data formats of reader, users should choose different usages for data preparation.

Read data from sample reader
+++++++++++++++++++++++++++++

If user-defined reader is a sample reader, users should use the following steps:

Step 1. Batching
=================

Use the data reader interfaces in PaddlePaddle Fluid for data augmentation and batching. Please refer to `Python Reader <./reader.html>`_ for details.

Step 2. Sending data
=====================

If using Asynchronous DataLoader API, please use :code:`set_sample_generator` or :code:`set_sample_list_generator` to set the data source for DataLoader. Please refer to :ref:`user_guide_use_py_reader_en` for details.

If using Synchronous Feed Method, please use DataFeeder to convert the reader data to LoDTensor before sending to the network. Please refer to :ref:`api_fluid_DataFeeder` for details.

Read data from sample reader
+++++++++++++++++++++++++++++

Step 1. Batching
=================

Since the reader has been a batch reader, this step can be skipped.

Step 2. Sending data
=====================

If using Asynchronous DataLoader API, please use :code:`set_batch_generator` to set the data source for DataLoader. Please refer to :ref:`user_guide_use_py_reader_en` for details.

If using Synchronous Feed Method, please refer to :ref:`user_guide_use_numpy_array_as_train_data_en` for details.