Besides Python Reader, we provide PyReader. The performance of PyReader is better than :ref:`user_guide_use_numpy_array_as_train_data_en` , because the process of loading data is asynchronous with the process of training model when PyReader is in use. And PyReader can coordinate with :code:`double_buffer_reader` to improve the performance of reading data. What's more, :code:`double_buffer_reader` can achieve the transformation from CPU Tensor to GPU Tensor, which improve the efficiency of reading data to some extent.
Besides synchronous data reading, we provide DataLoader. The performance of DataLoader is better than :ref:`user_guide_use_numpy_array_as_train_data_en` , because data reading and model training process is asynchronous
when DataLoader is in use, and it can cooperate with :code:`double_buffer_reader` to improve the performance of reading data. What's more, :code:`double_buffer_reader` can achieve the asynchronous transformation from CPU Tensor to GPU Tensor, which improves the efficiency of reading data to some extent.
In the code, ``capacity`` is buffer size of PyReader;
``shapes`` is the size of parameters in the batch (such as image and label in picture classification task);
``dtypes`` is data type of parameters in the batch;
``name`` is name of PyReader instance;
``use_double_buffer`` is True by default, which means :code:`double_buffer_reader` is used.
ITERABLE = True
Attention: If you want to create multiple PyReader objects(such as two different PyReader in training and inference period respectively), you have to appoint different names for different PyReader objects,since PaddlePaddle uses different names to distinguish different variables, and `Program.clone()` (reference to :ref:`api_fluid_Program_clone` )can't copy PyReader objects.
- ``capacity`` is the buffer size of the DataLoader object in batches;
- ``use_double_buffer`` is True by default, which means ``double_buffer_reader`` is used. It is recommended, because it can improve data reading speed;
- ``iterable`` is True by default, which means the DataLoader object is For-Range iterative. When ``iterable = True`` , DataLoader decouples from the Program, which means defining DataLoader objects does not change Program; when When ``iterable = False`` , DataLoader inserts operators related to data reading into Program.
Attention: ``Program.clone()`` (reference to :ref:`api_fluid_Program` )can't copy DataLoader objects.
If you want to create multiple DataLoader objects(such as two different DataLoaders in training and inference period respectively), you have to define different DataLoader objects.
While using DataLoader, if you need to share the model parameters of training and testing periods, you can use :code:`fluid.unique_name.guard()`.
Notes: Paddle use different names to distinguish different variables, and the names are generated by the counter in :code:`unique_name` module, which rises by one every time a variable name is generated. :code:`fluid.unique_name.guard()` aims to reset the counter in :code:`unique_name` module, in order to ensure that the variable names are the same when calling :code:`fluid.unique_name.guard()` repeatedly, so that parameters can be shared.
While using PyReader, if you need to share the model parameters of training and test periods, you can use :code:`fluid.unique_name.guard()` .
Notes: Paddle use different names to distinguish different variables, and the names are generated by the counter in :code:`unique_name` module. By the way, the counts rise by one every time a variable name is generated. :code:`fluid.unique_name.guard()` aims to reset the counter in :code:`unique_name` module, in order to ensure that the variable names are the same when calling :code:`fluid.unique_name.guard()` repeatedly, so that parameters can be shared.
An example of configuring networks during the training and test periods by PyReader is as follows:
An example of configuring networks during the training and testing periods by DataLoader is as follows:
.. code-block:: python
...
...
@@ -56,230 +47,220 @@ An example of configuring networks during the training and test periods by PyRea
import paddle.fluid as fluid
import paddle.dataset.mnist as mnist
import numpy
def network(is_train):
# Create py_reader object and give different names
# when is_train = True and is_train = False
reader = fluid.layers.py_reader(
capacity=10,
shapes=((-1, 784), (-1, 1)),
dtypes=('float32', 'int64'),
name="train_reader" if is_train else "test_reader",
use_double_buffer=True)
# Use read_file() method to read out the data from py_reader
img, label = fluid.layers.read_file(reader)
...
# Here, we omitted the definition of loss of the model
# Create main program and startup program for training
# Create main program and startup program for training.
train_prog = fluid.Program()
train_startup = fluid.Program()
with fluid.program_guard(train_prog, train_startup):
# Use fluid.unique_name.guard() to share parameters with test network
# Use fluid.unique_name.guard() to share parameters with test network.
with fluid.unique_name.guard():
train_loss, train_reader = network(True)
train_loss, train_loader = network()
adam = fluid.optimizer.Adam(learning_rate=0.01)
adam.minimize(train_loss)
# Create main program and startup program for testing
# Create main program and startup program for testing.
test_prog = fluid.Program()
test_startup = fluid.Program()
with fluid.program_guard(test_prog, test_startup):
# Use fluid.unique_name.guard() to share parameters with train network
with fluid.unique_name.guard():
test_loss, test_reader = network(False)
test_loss, test_loader = network()
Configure data source of PyReader objects
##########################################
PyReader object sets the data source by :code:`decorate_paddle_reader()` or :code:`decorate_tensor_provider()` :code:`decorate_paddle_reader()` and :code:`decorate_tensor_provider()` both receive the Python generator :code:`generator` as parameters. :code:`generator` generates a batch of data every time by yield ways inside.
The differences of :code:`decorate_paddle_reader()` and :code:`decorate_tensor_provider()` ways are:
Configure data source of DataLoader object
##########################################
DataLoader object sets the data source by :code:`set_sample_generator()`, :code:`set_sample_list_generator()` or :code:`set_batch_generator()` . These three methods all receive the Python generator :code:`generator` as parameters. The differences of are:
- :code:`generator` of :code:`decorate_paddle_reader()` should return data of Numpy Array type, but :code:`generator` of :code:`decorate_tensor_provider()` should return LoDTensor type.
- :code:`generator` of :code:`set_sample_generator()` should return data of :code:`[img_1, label_1]` type, in which ``img_1`` and ``label_1`` is one sample's data of Numpy array type.
- :code:`decorate_tensor_provider()` requires that the returned data type and size of LoDTensor of :code:`generator` have to match the appointed dtypes and shapes parameters while configuring py_reader, but :code:`decorate_paddle_reader()` doesn't have the requirements, since the data type and size can transform inside.
- :code:`generator` of :code:`set_sample_list_generator()` should return data of :code:`[(img_1, label_1), (img_2, label_2), ..., (img_n, label_n)]` type, in which ``img_i`` and ``label_i`` is one sample's data of Numpy array type, and ``n`` is batch size.
Specific ways are as follows:
- :code:`generator` of :code:`set_batch_generator()` should return data of :code:`[batched_imgs, batched_labels]` type, in which ``batched_imgs`` and ``batched_labels`` is one batch's data of Numpy array or LoDTensor type.
.. code-block:: python
Please note that, when using DataLoader for multi-GPU card (or multi-CPU core) training, the actual total batch size is the batch size of incoming user generator multiplied by the number of devices.
import paddle.fluid as fluid
import numpy as np
When :code:`iterable = True` (default) of DataLoader, ``places`` parameters must be passed to these three methods, specifying whether to convert data to CPU Tensor or GPU Tensor. When :code:`iterable = False` of DataLoader, there is no need to pass the ``places`` parameter.
BATCH_SIZE = 32
For example, suppose we have two readers, ``fake_sample_reader`` returns one sample's data at a time and ``fake_batch_reader`` returns one batch's data at a time.
# Case 1: Use decorate_paddle_reader() method to set the data source of py_reader
Examples of using DataLoader to train and test models are as follows:
py_reader1 = fluid.layers.py_reader(
capacity=10,
shapes=((-1, 784), (-1, 1)),
dtypes=('float32', 'int64'),
name='py_reader1',
use_double_buffer=True)
- Step 1, we need to set up training network and testing network, define the corresponding DataLoader object, and configure the data source of DataLoader object.
If :code:`iterable = True`, the DataLoader object is a Python generator that can iterate directly using for-range. The results returned by for-range are passed to the executor through the ``feed`` parameter of ``exe.run()``.
If :code:`iterable = False`, call the ``start()`` method to start the DataLoader object before each epoch starts, and call the ``reset()`` method to reset the status of the DataLoader object after catching the exception to start the iteration of next epoch, since ``exe.run()`` throws a ``fluid.core.EOFException`` exception at the end of each epoch. When :code:`iterable = False`, there is no need to pass ``feed`` parameter to ``exe.run()``. The specific ways are as follows:
1. Before the start of every epoch, call :code:`start()` to invoke PyReader;
2. At the end of every epoch, :code:`read_file` throws exception :code:`fluid.core.EOFException` . Call :code:`reset()` after catching up exception to reset the state of PyReader in order to start next epoch.