From c3e4fc1b7f0b9a5682dbf74465aac15a2fb529b5 Mon Sep 17 00:00:00 2001
From: Zeng Jinle <32832641+sneaxiy@users.noreply.github.com>
Date: Wed, 9 Oct 2019 18:30:17 +0800
Subject: [PATCH] Refine prepare steps doc (#1482)

* refine prepare steps doc, test=develop

* replace layers.data with fluid.data, test=develop
---
 .../howto/prepare_data/prepare_steps.rst      |  88 +++++----------
 .../howto/prepare_data/prepare_steps_en.rst   | 100 ++++++++++++------
 2 files changed, 99 insertions(+), 89 deletions(-)

diff --git a/doc/fluid/user_guides/howto/prepare_data/prepare_steps.rst b/doc/fluid/user_guides/howto/prepare_data/prepare_steps.rst
index 69b39ee8d..3c6242e09 100644
--- a/doc/fluid/user_guides/howto/prepare_data/prepare_steps.rst
+++ b/doc/fluid/user_guides/howto/prepare_data/prepare_steps.rst
@@ -6,7 +6,7 @@
 
 使用PaddlePaddle Fluid准备数据分为三个步骤：
 
-Step1: 自定义Reader生成训练/预测数据
+Step 1: 自定义Reader生成训练/预测数据
 ###################################
 
 生成的数据类型可以为Numpy Array或LoDTensor。根据Reader返回的数据形式的不同，可分为Batch级的Reader和Sample（样本）级的Reader。
@@ -16,41 +16,28 @@ Batch级的Reader每次返回一个Batch的数据，Sample级的Reader每次返
 如果您的数据是Sample级的数据，我们提供了一个可以数据预处理和组建batch的工具：:code:`Python Reader` 。
 
 
-Step2: 在网络配置中定义数据层变量
+Step 2: 在网络配置中定义数据层变量
 ###################################
-用户需使用 :code:`fluid.layers.data` 在网络中定义数据层变量。定义数据层变量时需指明数据层的名称name、数据类型dtype和维度shape。例如：
+用户需使用 :code:`fluid.data` 在网络中定义数据层变量。定义数据层变量时需指明数据层的名称name、数据类型dtype和维度shape。例如：
 
 .. code-block:: python
 
     import paddle.fluid as fluid
 
-    image = fluid.layers.data(name='image', dtype='float32', shape=[28, 28])
-    label = fluid.layers.data(name='label', dtype='int64', shape=[1])
+    image = fluid.data(name='image', dtype='float32', shape=[None, 28, 28])
+    label = fluid.data(name='label', dtype='int64', shape=[None, 1])
 
+其中，None表示不确定的维度。此例子中None的含义为batch size。
 
-需要注意的是，此处的shape是单个样本的维度，PaddlePaddle Fluid会在shape第0维位置添加-1，表示batch_size的维度，即此例中image.shape为[-1, 28, 28]，
-label.shape为[-1, 1]。
-
-若用户不希望框架在第0维位置添加-1，则可通过append_batch_size=False参数控制，即：
-
-.. code-block:: python
-
-   import paddle.fluid as fluid
-
-   image = fluid.layers.data(name='image', dtype='float32', shape=[28, 28], append_batch_size=False)
-   label = fluid.layers.data(name='label', dtype='int64', shape=[1], append_batch_size=False)
-
-此时，image.shape为[28, 28]，label.shape为[1]。
-
-Step3: 将数据送入网络进行训练/预测
+Step 3: 将数据送入网络进行训练/预测
 ###################################
 
-Fluid提供两种方式，分别是异步PyReader接口方式或同步Feed方式，具体介绍如下：
+Fluid提供两种方式，分别是异步DataLoader接口方式或同步Feed方式，具体介绍如下：
 
-- 异步PyReader接口方式
+- 异步DataLoader接口方式
 
-用户需要先使用 :code:`fluid.io.PyReader` 定义PyReader对象，然后通过PyReader对象的decorate方法设置数据源。
-使用PyReader接口时，数据传入与模型训练/预测过程是异步进行的，效率较高，推荐使用。
+用户需要先使用 :code:`fluid.io.DataLoader` 定义DataLoader对象，然后通过DataLoader对象的set方法设置数据源。
+使用DataLoader接口时，数据传入与模型训练/预测过程是异步进行的，效率较高，推荐使用。
 
 - 同步Feed方式
 
@@ -62,9 +49,9 @@ Fluid提供两种方式，分别是异步PyReader接口方式或同步Feed方式
 这两种准备数据方法的比较如下:
 
 ========  =================================   =====================================
-对比项            同步Feed方式                          异步PyReader接口方式
+对比项            同步Feed方式                          异步DataLoader接口方式
 ========  =================================   =====================================
-API接口     :code:`executor.run(feed=...)`          :code:`fluid.io.PyReader`
+API接口     :code:`executor.run(feed=...)`          :code:`fluid.io.DataLoader`
 数据格式         Numpy Array或LoDTensor               Numpy Array或LoDTensor
 数据增强          Python端使用其他库完成                  Python端使用其他库完成
 速度                     慢                                   快
@@ -72,7 +59,7 @@ API接口     :code:`executor.run(feed=...)`          :code:`fluid.io.PyReader`
 ========  =================================   =====================================
 
 Reader数据类型对使用方式的影响
-###############################
+###########################
 
 根据Reader数据类型的不同，上述步骤的具体操作将有所不同，具体介绍如下:
 
@@ -81,49 +68,32 @@ Reader数据类型对使用方式的影响
 
 若自定义的Reader每次返回单个样本的数据，用户需通过以下步骤完成数据送入：
 
-Step1. 组建数据
-=============================
+Step 1. 组建数据
+================
 
-调用Fluid提供的Reader相关接口完成组batch和部分的数据预处理功能，具体请参见：
+调用Fluid提供的Reader相关接口完成组batch和部分的数据预处理功能，具体请参见： `数据预处理工具 <./reader_cn.html>`_ 。
 
-.. toctree::
-   :maxdepth: 1
+Step 2. 送入数据
+================
 
-   reader_cn.md
+若使用异步DataLoader接口方式送入数据，请调用 :code:`set_sample_generator` 或 :code:`set_sample_list_generator` 接口完成，具体请参见： :ref:`user_guides_use_py_reader` 。
 
-Step2. 送入数据
-=================================
-
-若使用异步PyReader接口方式送入数据，请调用 :code:`decorate_sample_generator` 或 :code:`decorate_sample_list_generator` 接口完成，具体请参见：
-
-- :ref:`user_guides_use_py_reader`
-
-若使用同步Feed方式送入数据，请使用DataFeeder接口将Reader数据转换为LoDTensor格式后送入网络，具体请参见 :ref:`cn_api_fluid_DataFeeder`
+若使用同步Feed方式送入数据，请使用DataFeeder接口将Reader数据转换为LoDTensor格式后送入网络，具体请参见 :ref:`cn_api_fluid_DataFeeder` 。
 
 读取Batch级Reader数据
-+++++++++++++++++++++++
-
-Step1. 组建数据
-=================
-
-由于Batch已经组好，已经满足了Step1的条件，可以直接进行Step2
-
-Step2. 送入数据
-=================================
-
-若使用异步PyReader接口方式送入数据，请调用PyReader的 :code:`decorate_batch_generator` 接口完成，具体方式请参见:
+++++++++++++++++++++
 
-.. toctree::
-   :maxdepth: 1
+Step 1. 组建数据
+================
 
-   use_py_reader.rst
+由于Batch已经组好，已经满足了Step 1的条件，可以直接进行Step 2。
 
-若使用同步Feed方式送入数据，具体请参见:
+Step 2. 送入数据
+================
 
-.. toctree::
-   :maxdepth: 1
+若使用异步DataLoader接口方式送入数据，请调用DataLoader的 :code:`set_batch_generator` 接口完成，具体方式请参见: :ref:`user_guides_use_py_reader` 。
 
-   feeding_data.rst
+若使用同步Feed方式送入数据，具体请参见: :ref:`user_guide_use_numpy_array_as_train_data` 。
 
 
 
diff --git a/doc/fluid/user_guides/howto/prepare_data/prepare_steps_en.rst b/doc/fluid/user_guides/howto/prepare_data/prepare_steps_en.rst
index b8c0a9948..609a6ac18 100644
--- a/doc/fluid/user_guides/howto/prepare_data/prepare_steps_en.rst
+++ b/doc/fluid/user_guides/howto/prepare_data/prepare_steps_en.rst
@@ -4,52 +4,92 @@
 Prepare Steps
 #############
 
-PaddlePaddle Fluid supports two methods to feed data into networks:
+Data preparation in PaddlePaddle Fluid can be separated into 3 steps.
 
-1. Synchronous method - Python Reader：Firstly, use :code:`fluid.layers.data` to set up data input layer. Then, feed in the training data through :code:`executor.run(feed=...)` in :code:`fluid.Executor` or :code:`fluid.ParallelExecutor` .
+Step 1: Define a reader to generate training/testing data
+##########################################################
 
-2. Asynchronous method - py_reader：Firstly, use :code:`fluid.layers.py_reader` to set up data input layer. Then configure the data source with functions :code:`decorate_paddle_reader` or :code:`decorate_tensor_provider` of :code:`py_reader` . After that, call :code:`fluid.layers.read_file` to read data.
+The generated data type can be Numpy Array or LoDTensor. According to the different data formats returned by the reader, it can be divided into Batch Reader and Sample Reader.
 
+The batch reader yields a mini-batch data for each, while the sample reader yields a sample data for each.
 
+If your reader yields a sample data, we provide a data augmentation and batching tool for you: :code:`Python Reader` .
 
-Comparisons of the two methods:
+Step 2: Define data layer variables in network
+###############################################
 
-=========================  ====================================================   ===============================================
-Aspects                                   Synchronous Python Reader                       Asynchronous py_reader
-=========================  ====================================================   ===============================================
-API interface                          :code:`executor.run(feed=...)`                 :code:`fluid.layers.py_reader`
-data type                                   Numpy Array                                Numpy Array or LoDTensor
-data augmentation          carried out by other libraries on Python end            carried out by other libraries on Python end 
-velocity                                        slow                                            rapid
-recommended applications                model debugging                                      industrial training
-=========================  ====================================================   ===============================================
+Users should use :code:`fluid.data` to define data layer variables. Name, dtype and shape are required when defining. For example,
 
-Synchronous Python Reader
-##########################
+.. code-block:: python
 
-Fluid provides Python Reader to feed in data.
+    import paddle.fluid as fluid
 
-Python Reader is a pure Python-side interface, and data feeding is synchronized with the model training/prediction process. Users can pass in data through Numpy Array. For specific operations, please refer to:
+    image = fluid.data(name='image', dtype='float32', shape=[None, 28, 28])
+    label = fluid.data(name='label', dtype='int64', shape=[None, 1])
 
+None means that the dimension is uncertain. In this example, None means the batch size.
 
-.. toctree::
-   :maxdepth: 1
+Step 3: Send the data to network for training/testing
+######################################################
 
-   feeding_data_en.rst
+PaddlePaddle Fluid provides 2 methods for sending data to the network: Asynchronous DataLoader API, and Synchronous Feed Method.
 
-Python Reader supports advanced functions like group batch, shuffle. For specific operations, please refer to：
+- Asynchronous DataLoader API
 
-.. toctree::
-   :maxdepth: 1
+User should use :code:`fluid.io.DataLoader` to define a DataLoader object and use its setter method to set the data source.
+When using DataLoader API, the process of data sending works asynchronously with network training/testing.
+It is an efficient way for sending data and recommended to use.
 
-   reader.md
+- Synchronous Feed Method
 
-Asynchronous py_reader
-########################
+User should create the feeding data beforehand and use :code:`executor.run(feed=...)` to send the data to :code:`fluid.Executor` or :code:`fluid.ParallelExecutor` .
+Data preparation and network training/testing work synchronously, which is less efficient.
 
-Fluid provides asynchronous data feeding method PyReader. It is more efficient as data feeding is not synchronized with the model training/prediction process. For specific operations, please refer to：
+Comparison of these 2 methods are as follows:
 
-.. toctree::
-   :maxdepth: 1
+==========================  =================================   ======================================
+Comparison item                 Synchronous Feed Method              Asynchronous DataLoader API
+==========================  =================================   ======================================
+API                           :code:`executor.run(feed=...)`          :code:`fluid.io.DataLoader`
+Data type                       Numpy Array or LoDTensor                Numpy Array or LoDTensor
+Data augmentation            use Python for data augmentation       use Python for data augmentation
+Speed                                     slow                                    rapid
+Recommended applications            model debugging                        industrial training
+==========================  =================================   ======================================
 
-   use_py_reader_en.rst
+Choose different usages for different data formats
+###################################################
+
+According to the different data formats of reader, users should choose different usages for data preparation.
+
+Read data from sample reader
++++++++++++++++++++++++++++++
+
+If user-defined reader is a sample reader, users should use the following steps:
+
+Step 1. Batching
+=================
+
+Use the data reader interfaces in PaddlePaddle Fluid for data augmentation and batching. Please refer to `Python Reader <./reader.html>`_ for details.
+
+Step 2. Sending data
+=====================
+
+If using Asynchronous DataLoader API, please use :code:`set_sample_generator` or :code:`set_sample_list_generator` to set the data source for DataLoader. Please refer to :ref:`user_guide_use_py_reader_en` for details.
+
+If using Synchronous Feed Method, please use DataFeeder to convert the reader data to LoDTensor before sending to the network. Please refer to :ref:`api_fluid_DataFeeder` for details.
+
+Read data from sample reader
++++++++++++++++++++++++++++++
+
+Step 1. Batching
+=================
+
+Since the reader has been a batch reader, this step can be skipped.
+
+Step 2. Sending data
+=====================
+
+If using Asynchronous DataLoader API, please use :code:`set_batch_generator` to set the data source for DataLoader. Please refer to :ref:`user_guide_use_py_reader_en` for details.
+
+If using Synchronous Feed Method, please refer to :ref:`user_guide_use_numpy_array_as_train_data_en` for details.
\ No newline at end of file
-- 
GitLab