未验证 提交 c3e4fc1b 编写于 作者: Z Zeng Jinle 提交者: GitHub

Refine prepare steps doc (#1482)

* refine prepare steps doc, test=develop

* replace layers.data with fluid.data, test=develop
上级 1d0b8615
...@@ -6,7 +6,7 @@ ...@@ -6,7 +6,7 @@
使用PaddlePaddle Fluid准备数据分为三个步骤: 使用PaddlePaddle Fluid准备数据分为三个步骤:
Step1: 自定义Reader生成训练/预测数据 Step 1: 自定义Reader生成训练/预测数据
################################### ###################################
生成的数据类型可以为Numpy Array或LoDTensor。根据Reader返回的数据形式的不同,可分为Batch级的Reader和Sample(样本)级的Reader。 生成的数据类型可以为Numpy Array或LoDTensor。根据Reader返回的数据形式的不同,可分为Batch级的Reader和Sample(样本)级的Reader。
...@@ -16,41 +16,28 @@ Batch级的Reader每次返回一个Batch的数据,Sample级的Reader每次返 ...@@ -16,41 +16,28 @@ Batch级的Reader每次返回一个Batch的数据,Sample级的Reader每次返
如果您的数据是Sample级的数据,我们提供了一个可以数据预处理和组建batch的工具::code:`Python Reader` 。 如果您的数据是Sample级的数据,我们提供了一个可以数据预处理和组建batch的工具::code:`Python Reader` 。
Step2: 在网络配置中定义数据层变量 Step 2: 在网络配置中定义数据层变量
################################### ###################################
用户需使用 :code:`fluid.layers.data` 在网络中定义数据层变量。定义数据层变量时需指明数据层的名称name、数据类型dtype和维度shape。例如: 用户需使用 :code:`fluid.data` 在网络中定义数据层变量。定义数据层变量时需指明数据层的名称name、数据类型dtype和维度shape。例如:
.. code-block:: python .. code-block:: python
import paddle.fluid as fluid import paddle.fluid as fluid
image = fluid.layers.data(name='image', dtype='float32', shape=[28, 28]) image = fluid.data(name='image', dtype='float32', shape=[None, 28, 28])
label = fluid.layers.data(name='label', dtype='int64', shape=[1]) label = fluid.data(name='label', dtype='int64', shape=[None, 1])
其中,None表示不确定的维度。此例子中None的含义为batch size。
需要注意的是,此处的shape是单个样本的维度,PaddlePaddle Fluid会在shape第0维位置添加-1,表示batch_size的维度,即此例中image.shape为[-1, 28, 28], Step 3: 将数据送入网络进行训练/预测
label.shape为[-1, 1]。
若用户不希望框架在第0维位置添加-1,则可通过append_batch_size=False参数控制,即:
.. code-block:: python
import paddle.fluid as fluid
image = fluid.layers.data(name='image', dtype='float32', shape=[28, 28], append_batch_size=False)
label = fluid.layers.data(name='label', dtype='int64', shape=[1], append_batch_size=False)
此时,image.shape为[28, 28],label.shape为[1]。
Step3: 将数据送入网络进行训练/预测
################################### ###################################
Fluid提供两种方式,分别是异步PyReader接口方式或同步Feed方式,具体介绍如下: Fluid提供两种方式,分别是异步DataLoader接口方式或同步Feed方式,具体介绍如下:
- 异步PyReader接口方式 - 异步DataLoader接口方式
用户需要先使用 :code:`fluid.io.PyReader` 定义PyReader对象,然后通过PyReader对象的decorate方法设置数据源。 用户需要先使用 :code:`fluid.io.DataLoader` 定义DataLoader对象,然后通过DataLoader对象的set方法设置数据源。
使用PyReader接口时,数据传入与模型训练/预测过程是异步进行的,效率较高,推荐使用。 使用DataLoader接口时,数据传入与模型训练/预测过程是异步进行的,效率较高,推荐使用。
- 同步Feed方式 - 同步Feed方式
...@@ -62,9 +49,9 @@ Fluid提供两种方式,分别是异步PyReader接口方式或同步Feed方式 ...@@ -62,9 +49,9 @@ Fluid提供两种方式,分别是异步PyReader接口方式或同步Feed方式
这两种准备数据方法的比较如下: 这两种准备数据方法的比较如下:
======== ================================= ===================================== ======== ================================= =====================================
对比项 同步Feed方式 异步PyReader接口方式 对比项 同步Feed方式 异步DataLoader接口方式
======== ================================= ===================================== ======== ================================= =====================================
API接口 :code:`executor.run(feed=...)` :code:`fluid.io.PyReader` API接口 :code:`executor.run(feed=...)` :code:`fluid.io.DataLoader`
数据格式 Numpy Array或LoDTensor Numpy Array或LoDTensor 数据格式 Numpy Array或LoDTensor Numpy Array或LoDTensor
数据增强 Python端使用其他库完成 Python端使用其他库完成 数据增强 Python端使用其他库完成 Python端使用其他库完成
速度 慢 快 速度 慢 快
...@@ -72,7 +59,7 @@ API接口 :code:`executor.run(feed=...)` :code:`fluid.io.PyReader` ...@@ -72,7 +59,7 @@ API接口 :code:`executor.run(feed=...)` :code:`fluid.io.PyReader`
======== ================================= ===================================== ======== ================================= =====================================
Reader数据类型对使用方式的影响 Reader数据类型对使用方式的影响
############################### ###########################
根据Reader数据类型的不同,上述步骤的具体操作将有所不同,具体介绍如下: 根据Reader数据类型的不同,上述步骤的具体操作将有所不同,具体介绍如下:
...@@ -81,49 +68,32 @@ Reader数据类型对使用方式的影响 ...@@ -81,49 +68,32 @@ Reader数据类型对使用方式的影响
若自定义的Reader每次返回单个样本的数据,用户需通过以下步骤完成数据送入: 若自定义的Reader每次返回单个样本的数据,用户需通过以下步骤完成数据送入:
Step1. 组建数据 Step 1. 组建数据
============================= ================
调用Fluid提供的Reader相关接口完成组batch和部分的数据预处理功能,具体请参见: 调用Fluid提供的Reader相关接口完成组batch和部分的数据预处理功能,具体请参见: `数据预处理工具 <./reader_cn.html>`_ 。
.. toctree:: Step 2. 送入数据
:maxdepth: 1 ================
reader_cn.md 若使用异步DataLoader接口方式送入数据,请调用 :code:`set_sample_generator` 或 :code:`set_sample_list_generator` 接口完成,具体请参见: :ref:`user_guides_use_py_reader` 。
Step2. 送入数据 若使用同步Feed方式送入数据,请使用DataFeeder接口将Reader数据转换为LoDTensor格式后送入网络,具体请参见 :ref:`cn_api_fluid_DataFeeder` 。
=================================
若使用异步PyReader接口方式送入数据,请调用 :code:`decorate_sample_generator` 或 :code:`decorate_sample_list_generator` 接口完成,具体请参见:
- :ref:`user_guides_use_py_reader`
若使用同步Feed方式送入数据,请使用DataFeeder接口将Reader数据转换为LoDTensor格式后送入网络,具体请参见 :ref:`cn_api_fluid_DataFeeder`
读取Batch级Reader数据 读取Batch级Reader数据
+++++++++++++++++++++++ ++++++++++++++++++++
Step1. 组建数据
=================
由于Batch已经组好,已经满足了Step1的条件,可以直接进行Step2
Step2. 送入数据
=================================
若使用异步PyReader接口方式送入数据,请调用PyReader的 :code:`decorate_batch_generator` 接口完成,具体方式请参见:
.. toctree:: Step 1. 组建数据
:maxdepth: 1 ================
use_py_reader.rst 由于Batch已经组好,已经满足了Step 1的条件,可以直接进行Step 2。
若使用同步Feed方式送入数据,具体请参见: Step 2. 送入数据
================
.. toctree:: 若使用异步DataLoader接口方式送入数据,请调用DataLoader的 :code:`set_batch_generator` 接口完成,具体方式请参见: :ref:`user_guides_use_py_reader` 。
:maxdepth: 1
feeding_data.rst 若使用同步Feed方式送入数据,具体请参见: :ref:`user_guide_use_numpy_array_as_train_data` 。
......
...@@ -4,52 +4,92 @@ ...@@ -4,52 +4,92 @@
Prepare Steps Prepare Steps
############# #############
PaddlePaddle Fluid supports two methods to feed data into networks: Data preparation in PaddlePaddle Fluid can be separated into 3 steps.
1. Synchronous method - Python Reader:Firstly, use :code:`fluid.layers.data` to set up data input layer. Then, feed in the training data through :code:`executor.run(feed=...)` in :code:`fluid.Executor` or :code:`fluid.ParallelExecutor` . Step 1: Define a reader to generate training/testing data
##########################################################
2. Asynchronous method - py_reader:Firstly, use :code:`fluid.layers.py_reader` to set up data input layer. Then configure the data source with functions :code:`decorate_paddle_reader` or :code:`decorate_tensor_provider` of :code:`py_reader` . After that, call :code:`fluid.layers.read_file` to read data. The generated data type can be Numpy Array or LoDTensor. According to the different data formats returned by the reader, it can be divided into Batch Reader and Sample Reader.
The batch reader yields a mini-batch data for each, while the sample reader yields a sample data for each.
If your reader yields a sample data, we provide a data augmentation and batching tool for you: :code:`Python Reader` .
Comparisons of the two methods: Step 2: Define data layer variables in network
###############################################
========================= ==================================================== =============================================== Users should use :code:`fluid.data` to define data layer variables. Name, dtype and shape are required when defining. For example,
Aspects Synchronous Python Reader Asynchronous py_reader
========================= ==================================================== ===============================================
API interface :code:`executor.run(feed=...)` :code:`fluid.layers.py_reader`
data type Numpy Array Numpy Array or LoDTensor
data augmentation carried out by other libraries on Python end carried out by other libraries on Python end
velocity slow rapid
recommended applications model debugging industrial training
========================= ==================================================== ===============================================
Synchronous Python Reader .. code-block:: python
##########################
Fluid provides Python Reader to feed in data. import paddle.fluid as fluid
Python Reader is a pure Python-side interface, and data feeding is synchronized with the model training/prediction process. Users can pass in data through Numpy Array. For specific operations, please refer to: image = fluid.data(name='image', dtype='float32', shape=[None, 28, 28])
label = fluid.data(name='label', dtype='int64', shape=[None, 1])
None means that the dimension is uncertain. In this example, None means the batch size.
.. toctree:: Step 3: Send the data to network for training/testing
:maxdepth: 1 ######################################################
feeding_data_en.rst PaddlePaddle Fluid provides 2 methods for sending data to the network: Asynchronous DataLoader API, and Synchronous Feed Method.
Python Reader supports advanced functions like group batch, shuffle. For specific operations, please refer to: - Asynchronous DataLoader API
.. toctree:: User should use :code:`fluid.io.DataLoader` to define a DataLoader object and use its setter method to set the data source.
:maxdepth: 1 When using DataLoader API, the process of data sending works asynchronously with network training/testing.
It is an efficient way for sending data and recommended to use.
reader.md - Synchronous Feed Method
Asynchronous py_reader User should create the feeding data beforehand and use :code:`executor.run(feed=...)` to send the data to :code:`fluid.Executor` or :code:`fluid.ParallelExecutor` .
######################## Data preparation and network training/testing work synchronously, which is less efficient.
Fluid provides asynchronous data feeding method PyReader. It is more efficient as data feeding is not synchronized with the model training/prediction process. For specific operations, please refer to: Comparison of these 2 methods are as follows:
.. toctree:: ========================== ================================= ======================================
:maxdepth: 1 Comparison item Synchronous Feed Method Asynchronous DataLoader API
========================== ================================= ======================================
API :code:`executor.run(feed=...)` :code:`fluid.io.DataLoader`
Data type Numpy Array or LoDTensor Numpy Array or LoDTensor
Data augmentation use Python for data augmentation use Python for data augmentation
Speed slow rapid
Recommended applications model debugging industrial training
========================== ================================= ======================================
use_py_reader_en.rst Choose different usages for different data formats
###################################################
According to the different data formats of reader, users should choose different usages for data preparation.
Read data from sample reader
+++++++++++++++++++++++++++++
If user-defined reader is a sample reader, users should use the following steps:
Step 1. Batching
=================
Use the data reader interfaces in PaddlePaddle Fluid for data augmentation and batching. Please refer to `Python Reader <./reader.html>`_ for details.
Step 2. Sending data
=====================
If using Asynchronous DataLoader API, please use :code:`set_sample_generator` or :code:`set_sample_list_generator` to set the data source for DataLoader. Please refer to :ref:`user_guide_use_py_reader_en` for details.
If using Synchronous Feed Method, please use DataFeeder to convert the reader data to LoDTensor before sending to the network. Please refer to :ref:`api_fluid_DataFeeder` for details.
Read data from sample reader
+++++++++++++++++++++++++++++
Step 1. Batching
=================
Since the reader has been a batch reader, this step can be skipped.
Step 2. Sending data
=====================
If using Asynchronous DataLoader API, please use :code:`set_batch_generator` to set the data source for DataLoader. Please refer to :ref:`user_guide_use_py_reader_en` for details.
If using Synchronous Feed Method, please refer to :ref:`user_guide_use_numpy_array_as_train_data_en` for details.
\ No newline at end of file
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册