From c3e4fc1b7f0b9a5682dbf74465aac15a2fb529b5 Mon Sep 17 00:00:00 2001 From: Zeng Jinle <32832641+sneaxiy@users.noreply.github.com> Date: Wed, 9 Oct 2019 18:30:17 +0800 Subject: [PATCH] Refine prepare steps doc (#1482) * refine prepare steps doc, test=develop * replace layers.data with fluid.data, test=develop --- .../howto/prepare_data/prepare_steps.rst | 88 +++++---------- .../howto/prepare_data/prepare_steps_en.rst | 100 ++++++++++++------ 2 files changed, 99 insertions(+), 89 deletions(-) diff --git a/doc/fluid/user_guides/howto/prepare_data/prepare_steps.rst b/doc/fluid/user_guides/howto/prepare_data/prepare_steps.rst index 69b39ee8d..3c6242e09 100644 --- a/doc/fluid/user_guides/howto/prepare_data/prepare_steps.rst +++ b/doc/fluid/user_guides/howto/prepare_data/prepare_steps.rst @@ -6,7 +6,7 @@ 使用PaddlePaddle Fluid准备数据分为三个步骤: -Step1: 自定义Reader生成训练/预测数据 +Step 1: 自定义Reader生成训练/预测数据 ################################### 生成的数据类型可以为Numpy Array或LoDTensor。根据Reader返回的数据形式的不同,可分为Batch级的Reader和Sample(样本)级的Reader。 @@ -16,41 +16,28 @@ Batch级的Reader每次返回一个Batch的数据,Sample级的Reader每次返 如果您的数据是Sample级的数据,我们提供了一个可以数据预处理和组建batch的工具::code:`Python Reader` 。 -Step2: 在网络配置中定义数据层变量 +Step 2: 在网络配置中定义数据层变量 ################################### -用户需使用 :code:`fluid.layers.data` 在网络中定义数据层变量。定义数据层变量时需指明数据层的名称name、数据类型dtype和维度shape。例如: +用户需使用 :code:`fluid.data` 在网络中定义数据层变量。定义数据层变量时需指明数据层的名称name、数据类型dtype和维度shape。例如: .. code-block:: python import paddle.fluid as fluid - image = fluid.layers.data(name='image', dtype='float32', shape=[28, 28]) - label = fluid.layers.data(name='label', dtype='int64', shape=[1]) + image = fluid.data(name='image', dtype='float32', shape=[None, 28, 28]) + label = fluid.data(name='label', dtype='int64', shape=[None, 1]) +其中,None表示不确定的维度。此例子中None的含义为batch size。 -需要注意的是,此处的shape是单个样本的维度,PaddlePaddle Fluid会在shape第0维位置添加-1,表示batch_size的维度,即此例中image.shape为[-1, 28, 28], -label.shape为[-1, 1]。 - -若用户不希望框架在第0维位置添加-1,则可通过append_batch_size=False参数控制,即: - -.. code-block:: python - - import paddle.fluid as fluid - - image = fluid.layers.data(name='image', dtype='float32', shape=[28, 28], append_batch_size=False) - label = fluid.layers.data(name='label', dtype='int64', shape=[1], append_batch_size=False) - -此时,image.shape为[28, 28],label.shape为[1]。 - -Step3: 将数据送入网络进行训练/预测 +Step 3: 将数据送入网络进行训练/预测 ################################### -Fluid提供两种方式,分别是异步PyReader接口方式或同步Feed方式,具体介绍如下: +Fluid提供两种方式,分别是异步DataLoader接口方式或同步Feed方式,具体介绍如下: -- 异步PyReader接口方式 +- 异步DataLoader接口方式 -用户需要先使用 :code:`fluid.io.PyReader` 定义PyReader对象,然后通过PyReader对象的decorate方法设置数据源。 -使用PyReader接口时,数据传入与模型训练/预测过程是异步进行的,效率较高,推荐使用。 +用户需要先使用 :code:`fluid.io.DataLoader` 定义DataLoader对象,然后通过DataLoader对象的set方法设置数据源。 +使用DataLoader接口时,数据传入与模型训练/预测过程是异步进行的,效率较高,推荐使用。 - 同步Feed方式 @@ -62,9 +49,9 @@ Fluid提供两种方式,分别是异步PyReader接口方式或同步Feed方式 这两种准备数据方法的比较如下: ======== ================================= ===================================== -对比项 同步Feed方式 异步PyReader接口方式 +对比项 同步Feed方式 异步DataLoader接口方式 ======== ================================= ===================================== -API接口 :code:`executor.run(feed=...)` :code:`fluid.io.PyReader` +API接口 :code:`executor.run(feed=...)` :code:`fluid.io.DataLoader` 数据格式 Numpy Array或LoDTensor Numpy Array或LoDTensor 数据增强 Python端使用其他库完成 Python端使用其他库完成 速度 慢 快 @@ -72,7 +59,7 @@ API接口 :code:`executor.run(feed=...)` :code:`fluid.io.PyReader` ======== ================================= ===================================== Reader数据类型对使用方式的影响 -############################### +########################### 根据Reader数据类型的不同,上述步骤的具体操作将有所不同,具体介绍如下: @@ -81,49 +68,32 @@ Reader数据类型对使用方式的影响 若自定义的Reader每次返回单个样本的数据,用户需通过以下步骤完成数据送入: -Step1. 组建数据 -============================= +Step 1. 组建数据 +================ -调用Fluid提供的Reader相关接口完成组batch和部分的数据预处理功能,具体请参见: +调用Fluid提供的Reader相关接口完成组batch和部分的数据预处理功能,具体请参见: `数据预处理工具 <./reader_cn.html>`_ 。 -.. toctree:: - :maxdepth: 1 +Step 2. 送入数据 +================ - reader_cn.md +若使用异步DataLoader接口方式送入数据,请调用 :code:`set_sample_generator` 或 :code:`set_sample_list_generator` 接口完成,具体请参见: :ref:`user_guides_use_py_reader` 。 -Step2. 送入数据 -================================= - -若使用异步PyReader接口方式送入数据,请调用 :code:`decorate_sample_generator` 或 :code:`decorate_sample_list_generator` 接口完成,具体请参见: - -- :ref:`user_guides_use_py_reader` - -若使用同步Feed方式送入数据,请使用DataFeeder接口将Reader数据转换为LoDTensor格式后送入网络,具体请参见 :ref:`cn_api_fluid_DataFeeder` +若使用同步Feed方式送入数据,请使用DataFeeder接口将Reader数据转换为LoDTensor格式后送入网络,具体请参见 :ref:`cn_api_fluid_DataFeeder` 。 读取Batch级Reader数据 -+++++++++++++++++++++++ - -Step1. 组建数据 -================= - -由于Batch已经组好,已经满足了Step1的条件,可以直接进行Step2 - -Step2. 送入数据 -================================= - -若使用异步PyReader接口方式送入数据,请调用PyReader的 :code:`decorate_batch_generator` 接口完成,具体方式请参见: +++++++++++++++++++++ -.. toctree:: - :maxdepth: 1 +Step 1. 组建数据 +================ - use_py_reader.rst +由于Batch已经组好,已经满足了Step 1的条件,可以直接进行Step 2。 -若使用同步Feed方式送入数据,具体请参见: +Step 2. 送入数据 +================ -.. toctree:: - :maxdepth: 1 +若使用异步DataLoader接口方式送入数据,请调用DataLoader的 :code:`set_batch_generator` 接口完成,具体方式请参见: :ref:`user_guides_use_py_reader` 。 - feeding_data.rst +若使用同步Feed方式送入数据,具体请参见: :ref:`user_guide_use_numpy_array_as_train_data` 。 diff --git a/doc/fluid/user_guides/howto/prepare_data/prepare_steps_en.rst b/doc/fluid/user_guides/howto/prepare_data/prepare_steps_en.rst index b8c0a9948..609a6ac18 100644 --- a/doc/fluid/user_guides/howto/prepare_data/prepare_steps_en.rst +++ b/doc/fluid/user_guides/howto/prepare_data/prepare_steps_en.rst @@ -4,52 +4,92 @@ Prepare Steps ############# -PaddlePaddle Fluid supports two methods to feed data into networks: +Data preparation in PaddlePaddle Fluid can be separated into 3 steps. -1. Synchronous method - Python Reader:Firstly, use :code:`fluid.layers.data` to set up data input layer. Then, feed in the training data through :code:`executor.run(feed=...)` in :code:`fluid.Executor` or :code:`fluid.ParallelExecutor` . +Step 1: Define a reader to generate training/testing data +########################################################## -2. Asynchronous method - py_reader:Firstly, use :code:`fluid.layers.py_reader` to set up data input layer. Then configure the data source with functions :code:`decorate_paddle_reader` or :code:`decorate_tensor_provider` of :code:`py_reader` . After that, call :code:`fluid.layers.read_file` to read data. +The generated data type can be Numpy Array or LoDTensor. According to the different data formats returned by the reader, it can be divided into Batch Reader and Sample Reader. +The batch reader yields a mini-batch data for each, while the sample reader yields a sample data for each. +If your reader yields a sample data, we provide a data augmentation and batching tool for you: :code:`Python Reader` . -Comparisons of the two methods: +Step 2: Define data layer variables in network +############################################### -========================= ==================================================== =============================================== -Aspects Synchronous Python Reader Asynchronous py_reader -========================= ==================================================== =============================================== -API interface :code:`executor.run(feed=...)` :code:`fluid.layers.py_reader` -data type Numpy Array Numpy Array or LoDTensor -data augmentation carried out by other libraries on Python end carried out by other libraries on Python end -velocity slow rapid -recommended applications model debugging industrial training -========================= ==================================================== =============================================== +Users should use :code:`fluid.data` to define data layer variables. Name, dtype and shape are required when defining. For example, -Synchronous Python Reader -########################## +.. code-block:: python -Fluid provides Python Reader to feed in data. + import paddle.fluid as fluid -Python Reader is a pure Python-side interface, and data feeding is synchronized with the model training/prediction process. Users can pass in data through Numpy Array. For specific operations, please refer to: + image = fluid.data(name='image', dtype='float32', shape=[None, 28, 28]) + label = fluid.data(name='label', dtype='int64', shape=[None, 1]) +None means that the dimension is uncertain. In this example, None means the batch size. -.. toctree:: - :maxdepth: 1 +Step 3: Send the data to network for training/testing +###################################################### - feeding_data_en.rst +PaddlePaddle Fluid provides 2 methods for sending data to the network: Asynchronous DataLoader API, and Synchronous Feed Method. -Python Reader supports advanced functions like group batch, shuffle. For specific operations, please refer to: +- Asynchronous DataLoader API -.. toctree:: - :maxdepth: 1 +User should use :code:`fluid.io.DataLoader` to define a DataLoader object and use its setter method to set the data source. +When using DataLoader API, the process of data sending works asynchronously with network training/testing. +It is an efficient way for sending data and recommended to use. - reader.md +- Synchronous Feed Method -Asynchronous py_reader -######################## +User should create the feeding data beforehand and use :code:`executor.run(feed=...)` to send the data to :code:`fluid.Executor` or :code:`fluid.ParallelExecutor` . +Data preparation and network training/testing work synchronously, which is less efficient. -Fluid provides asynchronous data feeding method PyReader. It is more efficient as data feeding is not synchronized with the model training/prediction process. For specific operations, please refer to: +Comparison of these 2 methods are as follows: -.. toctree:: - :maxdepth: 1 +========================== ================================= ====================================== +Comparison item Synchronous Feed Method Asynchronous DataLoader API +========================== ================================= ====================================== +API :code:`executor.run(feed=...)` :code:`fluid.io.DataLoader` +Data type Numpy Array or LoDTensor Numpy Array or LoDTensor +Data augmentation use Python for data augmentation use Python for data augmentation +Speed slow rapid +Recommended applications model debugging industrial training +========================== ================================= ====================================== - use_py_reader_en.rst +Choose different usages for different data formats +################################################### + +According to the different data formats of reader, users should choose different usages for data preparation. + +Read data from sample reader ++++++++++++++++++++++++++++++ + +If user-defined reader is a sample reader, users should use the following steps: + +Step 1. Batching +================= + +Use the data reader interfaces in PaddlePaddle Fluid for data augmentation and batching. Please refer to `Python Reader <./reader.html>`_ for details. + +Step 2. Sending data +===================== + +If using Asynchronous DataLoader API, please use :code:`set_sample_generator` or :code:`set_sample_list_generator` to set the data source for DataLoader. Please refer to :ref:`user_guide_use_py_reader_en` for details. + +If using Synchronous Feed Method, please use DataFeeder to convert the reader data to LoDTensor before sending to the network. Please refer to :ref:`api_fluid_DataFeeder` for details. + +Read data from sample reader ++++++++++++++++++++++++++++++ + +Step 1. Batching +================= + +Since the reader has been a batch reader, this step can be skipped. + +Step 2. Sending data +===================== + +If using Asynchronous DataLoader API, please use :code:`set_batch_generator` to set the data source for DataLoader. Please refer to :ref:`user_guide_use_py_reader_en` for details. + +If using Synchronous Feed Method, please refer to :ref:`user_guide_use_numpy_array_as_train_data_en` for details. \ No newline at end of file -- GitLab