DataLoader_cn.rst 9.2 KB
Newer Older
Z
Zeng Jinle 已提交
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197
.. _cn_api_fluid_io_DataLoader:

DataLoader
-------------------------------

.. py:class:: paddle.fluid.io.DataLoader


.. py:method:: from_generator(feed_list=None, capacity=None, use_double_buffer=True, iterable=True, return_list=False)

创建一个DataLoader对象用于加载Python生成器产生的数据。数据会由Python线程预先读取,并异步送入一个队列中。

本方法创建的DataLoader对象提供了3个方法设置数据源,分别是 :code:`set_sample_generator` , :code:`set_sample_list_generator` 和
:code:`set_batch_generator` 。请查阅下述示例代码了解它们的使用方法。

如果iterable = True,本方法创建的DataLoader对象时一个Python生成器,可以for-range的方法循环迭代。

如果iterable = False,本方法创建的DataLoader对象提供 :code:`start()` 和 :code:`reset()` 方法控制数据读取过程。此模式用于兼容
``fluid.layers.py_reader`` 的使用方式。用户可使用iterable = False模式,方便地将 ``fluid.layers.py_reader`` 的代码迁移至
``fluid.io.DataLoader`` 。

参数:
    - **feed_list** (list(Variable)|tuple(Variable)) - feed变量列表,由 ``fluid.layers.data()`` 创建。
    - **capacity** (int) - DataLoader对象内部维护队列的容量大小。单位是batch数量。若reader读取速度较快,建议设置较大的capacity值。
    - **use_double_buffer** (bool) - 是否使用 ``double_buffer_reader`` 。若use_double_buffer=True,DataLoader会异步地预读取下一个batch的数据,可加速数据读取过程,但同时会占用少量的CPU/GPU存储,即一个batch输入数据的存储空间。
    - **iterable** (bool) - 所创建的DataLoader对象是否可迭代。
    - **return_list** (bool) - 每个设备上的数据是否以list形式返回。仅在iterable = True模式下有效。若return_list = False,每个设备上的返回数据均是str -> LoDTensor的映射表,其中映射表的key是每个输入变量的名称。若return_list = True,则每个设备上的返回数据均是list(LoDTensor)。推荐在静态图模式下使用return_list = False,在动态图模式下使用return_list = True。

返回: 被创建的DataLoader对象

返回类型: loader (DataLoader)

**代码示例**

.. code-block:: python

            import paddle.fluid as fluid
            import numpy as np

            BATCH_NUM = 10
            BATCH_SIZE = 16
            EPOCH_NUM = 4

            CLASS_NUM = 10

            ITERABLE = True # whether the created DataLoader object is iterable
            USE_GPU = False # whether to use GPU

            DATA_FORMAT = 'batch_generator' # data format of data source user provides

            def simple_net(image, label):
                fc_tmp = fluid.layers.fc(image, size=CLASS_NUM)
                cross_entropy = fluid.layers.softmax_with_cross_entropy(image, label)
                loss = fluid.layers.reduce_mean(cross_entropy)
                sgd = fluid.optimizer.SGD(learning_rate=1e-3)
                sgd.minimize(loss)
                return loss

            def get_random_images_and_labels(image_shape, label_shape):
                image = np.random.random(size=image_shape).astype('float32')
                label = np.random.random(size=label_shape).astype('int64')
                return image, label

            # If the data generator yields one sample each time,
            # use DataLoader.set_sample_generator to set the data source.
            def sample_generator_creator():
                def __reader__():
                    for _ in range(BATCH_NUM * BATCH_SIZE):
                        image, label = get_random_images_and_labels([784], [1])
                        yield image, label

                return __reader__

            # If the data generator yield list of samples each time,
            # use DataLoader.set_sample_list_generator to set the data source.
            def sample_list_generator_creator():
                def __reader__():
                    for _ in range(BATCH_NUM):
                        sample_list = []
                        for _ in range(BATCH_SIZE):
                            image, label = get_random_images_and_labels([784], [1])
                            sample_list.append([image, label])

                        yield sample_list

                return __reader__

            # If the data generator yields a batch each time,
            # use DataLoader.set_batch_generator to set the data source.
            def batch_generator_creator():
                def __reader__():
                    for _ in range(BATCH_NUM):
                        batch_image, batch_label = get_random_images_and_labels([BATCH_SIZE, 784], [BATCH_SIZE, 1])
                        yield batch_image, batch_label

                return __reader__

            # If DataLoader is iterable, use for loop to train the network
            def train_iterable(exe, prog, loss, loader):
                for _ in range(EPOCH_NUM):
                    for data in loader():
                        exe.run(prog, feed=data, fetch_list=[loss])

            # If DataLoader is not iterable, use start() and reset() method to control the process
            def train_non_iterable(exe, prog, loss, loader):
                for _ in range(EPOCH_NUM):
                    loader.start() # call DataLoader.start() before each epoch starts
                    try:
                        while True:
                            exe.run(prog, fetch_list=[loss])
                    except fluid.core.EOFException:
                        loader.reset() # call DataLoader.reset() after catching EOFException

            def set_data_source(loader, places):
                if DATA_FORMAT == 'sample_generator':
                    loader.set_sample_generator(sample_generator_creator(), batch_size=BATCH_SIZE, drop_last=True, places=places)
                elif DATA_FORMAT == 'sample_list_generator':
                    loader.set_sample_list_generator(sample_list_generator_creator(), places=places)
                elif DATA_FORMAT == 'batch_generator':
                    loader.set_batch_generator(batch_generator_creator(), places=places)
                else:
                    raise ValueError('Unsupported data format')

            image = fluid.layers.data(name='image', shape=[784], dtype='float32')
            label = fluid.layers.data(name='label', shape=[1], dtype='int64')

            # Define DataLoader
            loader = fluid.io.DataLoader.from_generator(feed_list=[image, label], capacity=16, iterable=ITERABLE)

            # Define network
            loss = simple_net(image, label)

            # Set data source of DataLoader
            #
            # If DataLoader is iterable, places must be given and the number of places must be the same with device number.
            #  - If you are using GPU, call `fluid.cuda_places()` to get all GPU places.
            #  - If you are using CPU, call `fluid.cpu_places()` to get all CPU places.
            #
            # If DataLoader is not iterable, places can be None.
            places = fluid.cuda_places() if USE_GPU else fluid.cpu_places()
            set_data_source(loader, places)

            exe = fluid.Executor(places[0])
            exe.run(fluid.default_startup_program())

            prog = fluid.CompiledProgram(fluid.default_main_program()).with_data_parallel(loss_name=loss.name)

            if loader.iterable:
                train_iterable(exe, prog, loss, loader)
            else:
                train_non_iterable(exe, prog, loss, loader)


            '''
            Users can use return_list = True in dygraph mode.
            '''
            with fluid.dygraph.guard(places[0]):
                loader = fluid.io.DataLoader.from_generator(capacity=2, return_list=True)
                set_data_source(loader, places[0])
                for image, label in loader():
                    relu = fluid.layers.relu(image)
                    assert image.shape == [BATCH_SIZE, 784]
                    assert label.shape == [BATCH_SIZE, 1]
                    assert relu.shape == [BATCH_SIZE, 784]


.. py:method:: from_dataset(dataset, places, drop_last=True)

创建一个DataLoader对象用于加载Dataset产生的数据。目前,Dataset仅支持Linux系统下使用。

参数:
    - **dataset** (InMemoryDataset|QueueDataset) - Dataset对象。
    - **places** (list(CUDAPlace)|list(CPUPlace)) - DataLoader对象返回数据所在的place。
    - **drop_last** (bool) - 是否丢弃最后样本数量不足batch size的batch。若drop_last = True则丢弃,若drop_last = False则不丢弃。

返回: 被创建的DataLoader对象,可以for-range的方式循环迭代

返回类型: loader (DataLoader)

**代码示例**

.. code-block:: python

            import paddle.fluid as fluid

            image = fluid.layers.data(name='image', shape=[784], dtype='float32')
            label = fluid.layers.data(name='label', shape=[1], dtype='int64')

            dataset = fluid.DatasetFactory().create_dataset("QueueDataset")
            dataset.set_batch_size(32)
            dataset.set_filelist(['a.txt', 'b.txt', 'c.txt'])
            dataset.set_use_var([image, label])
            dataset.set_pipe_command('cat')

            loader = fluid.io.DataLoader.from_dataset(dataset, fluid.cpu_places())