diff --git a/doc/fluid/api_guides/low_level/compiled_program_en.rst b/doc/fluid/api_guides/low_level/compiled_program_en.rst
index 8ba3583024c14eec8ca08f2857ca2f449dc4e5ed..77ea883d6cc9a3c9bf74202e64412cbb5a928fdf 100755
--- a/doc/fluid/api_guides/low_level/compiled_program_en.rst
+++ b/doc/fluid/api_guides/low_level/compiled_program_en.rst
@@ -44,6 +44,8 @@ The :ref:`api_fluid_CompiledProgram` is used to transform a program for various
                     feed=feed_dict,
                     fetch_list=[loss.name])
 
+**Note**: :code:`fluid.Porgram` and :code:`compiler.CompiledPorgram` are completely different :code:`Programs`. :code:`fluid.Porgram` is composed of a series of operators. :code:`compiler.CompiledPorgram` compiles the :code:`fluid.Porgram` and converts it into a computational graph. :code:`compiler.CompiledPorgram` cannot be saved at present.
+
 
 - Related API :
- - :ref:`api_fluid_CompiledProgram`
\ No newline at end of file
+ - :ref:`api_fluid_CompiledProgram`
diff --git a/doc/fluid/api_guides/low_level/parallel_executor_en.rst b/doc/fluid/api_guides/low_level/parallel_executor_en.rst
index f8e2ddfe18efcfc5861d707ba132698e18c20d90..faad6688300667c76a90e9dc0f26012eb328dc9d 100644
--- a/doc/fluid/api_guides/low_level/parallel_executor_en.rst
+++ b/doc/fluid/api_guides/low_level/parallel_executor_en.rst
@@ -4,8 +4,7 @@
 Parallel Executor
 ##############################
 
-
-:code:`ParallelExecutor` is an executor that executes :code:`Program` separately on multiple nodes in a data-parallelism manner. Users can use the Python script to run :code:`ParallelExecutor`. The execution process of :code:`ParallelExecutor` is as follows:
+:code:`ParallelExecutor` is an upgraded version of Executor, in addition, it supports model training of :code:`Program` in parallel with data. Users can use the Python script to run :code:`ParallelExecutor`. The execution process of :code:`ParallelExecutor` is as follows:
 
 - First it builds :code:`SSA Graph` and a thread pool based on :code:`Program`, the number of :code:`GPU` cards (or :code:`CPU` cores) and :ref:`api_fluid_BuildStrategy` ;
 - During execution, it executes the Op depending on whether the input of Op is ready, so that multiple Ops that do not depend on each other can be executed in parallel in the thread pool;
@@ -69,6 +68,7 @@ Since the execution speed of the model is related to the model structure and the
     train_loss, = train_exe.run(fetch_list=[loss.name], feed=feed_dict)
     test_loss, = test_exe.run(fetch_list=[loss.name], feed=feed_dict)
 
+**Note**: :code:`fluid.Executor` and :code:`fluid.ParallelExecutor` are two completely different executors. First of all, their execution objects are different. The execution object of :code:`fluid.Executor` is :code:`fluid.Program` and the execution object of :code:`fluid.ParallelExecutor` is Graph. Secondly, their execution schedules are different. :code:`fluid.Executor` runs one by one according to the order of operators in :code:`fluid.Program`. :code:`fluid.ParallelExecutor` executes concurrently according to the dependencies between nodes in Graph.
 
 - Related API :
  - :ref:`api_fluid_ParallelExecutor`
diff --git a/doc/fluid/user_guides/howto/training/single_node.rst b/doc/fluid/user_guides/howto/training/single_node.rst
index 298094f7fc3f99ceff32503732df4ff632da927f..117f6facf524b37d4f20e0f73a4c522b6e201f93 100644
--- a/doc/fluid/user_guides/howto/training/single_node.rst
+++ b/doc/fluid/user_guides/howto/training/single_node.rst
@@ -21,12 +21,8 @@
    label = fluid.layers.data(name="label", shape=[1])
    hidden = fluid.layers.fc(input=image, size=100, act='relu')
    prediction = fluid.layers.fc(input=hidden, size=10, act='softmax')
-   loss = fluid.layers.mean(
-       fluid.layers.cross_entropy(
-           input=prediction,
-           label=label
-       )
-   )
+   loss = fluid.layers.cross_entropy(input=prediction, label=label)
+   loss = fluid.layers.mean(loss)
 
    sgd = fluid.optimizer.SGD(learning_rate=0.001)
    sgd.minimize(loss)
@@ -45,17 +41,13 @@
 
 用户配置完模型后，参数初始化操作会被写入到\
 :code:`fluid.default_startup_program()` 中。使用 :code:`fluid.Executor()` 运行
-这一程序，即可在全局 :code:`fluid.global_scope()` 中随机初始化参数。例如:
+这一程序，初始化之后的参数默认被放在全局scope中，即 :code:`fluid.global_scope()` 。例如:
 
 .. code-block:: python
 
    exe = fluid.Executor(fluid.CUDAPlace(0))
    exe.run(program=fluid.default_startup_program())
 
-值得注意的是: 如果使用多GPU训练，参数需要先在GPU0上初始化，再经由\
-:code:`fluid.ParallelExecutor` 分发到多张显卡上。
-
-
 载入预定义参数
 ==============
 
@@ -72,8 +64,38 @@
 
 .. code-block:: python
 
-   ...
-   loss = fluid.layers.mean(...)
+    import paddle.fluid as fluid
+    import numpy
+
+    train_program = fluid.Program()
+    startup_program = fluid.Program()
+    with fluid.program_guard(train_program, startup_program):
+        data = fluid.layers.data(name='X', shape=[1], dtype='float32')
+        hidden = fluid.layers.fc(input=data, size=10)
+        loss = fluid.layers.mean(hidden)
+        sgd = fluid.optimizer.SGD(learning_rate=0.001)
+        sgd.minimize(loss)
+
+    use_cuda = True
+    place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace()
+    exe = fluid.Executor(place)
+
+    # Run the startup program once and only once.
+    # Not need to optimize/compile the startup program.
+    startup_program.random_seed=1
+    exe.run(startup_program)
+
+    # Run the main program directly without compile.
+    x = numpy.random.random(size=(10, 1)).astype('float32')
+    loss_data, = exe.run(train_program,
+                         feed={"X": x},
+                         fetch_list=[loss.name])
+
+    # Or use CompiledProgram:
+    compiled_prog = compiler.CompiledProgram(train_program)
+    loss_data, = exe.run(compiled_prog,
+                 feed={"X": x},
+                 fetch_list=[loss.name])
 
 多卡训练
 #######################
@@ -81,21 +103,28 @@
 
 .. code-block:: python
 
-    exe = fluid.Executor(...)
-
-    compiled_prog = fluid.compiler.CompiledProgram(
-        fluid.default_main_program()).with_data_parallel(
-            loss_name=loss.name)
-
-    result = exe.run(program=compiled_prog,
-                    fetch_list=[loss.name],
-                    feed={"image": ..., "label": ...})
+    # NOTE: If you use CPU to run the program, you need
+    # to specify the CPU_NUM, otherwise, fluid will use
+    # all the number of the logic cores as the CPU_NUM,
+    # in that case, the batch size of the input should be
+    # greater than CPU_NUM, if not, the process will be
+    # failed by an exception.
+    if not use_cuda:
+        os.environ['CPU_NUM'] = str(2)
+
+    compiled_prog = compiler.CompiledProgram(
+        train_program).with_data_parallel(
+        loss_name=loss.name)
+    loss_data, = exe.run(compiled_prog,
+                         feed={"X": x},
+                         fetch_list=[loss.name])
 
 注释：
 
-1. :ref:`cn_api_fluid_CompiledProgram` 的构造函数需要经过 :code:`fluid.Program` 设置后运行，这在运行时内无法被修改。
-2. 如果 :code:`exe` 是用CUDAPlace来初始化的，模型会在GPU中运行。在显卡训练模式中，所有的显卡都将被占用。用户可以配置 `CUDA_VISIBLE_DEVICES <http://www.acceleware.com/blog/cudavisibledevices-masking-gpus>`_ 以更改被占用的显卡。
-3. 如果 :code:`exe` 是用CPUPlace来初始化的，模型会在CPU中运行。在这种情况下，多线程用于运行模型，同时线程的数目和逻辑核的数目相等。用户可以配置 ``CPU_NUM`` 以更改使用中的线程数目。
+1. :ref:`cn_api_fluid_CompiledProgram` 会将传入的 :code:`fluid.Program` 转为计算图，即Graph，因为 :code:`compiled_prog` 与传入的 :code:`train_program` 是完全不同的对象，目前还不能够对 :code:`compiled_prog` 进行保存。
+2. 多卡训练也可以使用 :ref:`cn_api_fluid_ParallelExecutor` ，但是现在推荐使用 :ref:`cn_api_fluid_CompiledProgram` .
+3. 如果 :code:`exe` 是用CUDAPlace来初始化的，模型会在GPU中运行。在显卡训练模式中，所有的显卡都将被占用。用户可以配置 `CUDA_VISIBLE_DEVICES <http://www.acceleware.com/blog/cudavisibledevices-masking-gpus>`_ 以更改被占用的显卡。
+4. 如果 :code:`exe` 是用CPUPlace来初始化的，模型会在CPU中运行。在这种情况下，多线程用于运行模型，同时线程的数目和逻辑核的数目相等。用户可以配置 ``CPU_NUM`` 以更改使用中的线程数目。
 
 进阶使用
 ###############
diff --git a/doc/fluid/user_guides/howto/training/single_node_en.rst b/doc/fluid/user_guides/howto/training/single_node_en.rst
index c8854cfbc24c1688f717bd74a340e35adb649ef2..04105f37b93c6f8750b0a2edf82d17a70cd52e3f 100644
--- a/doc/fluid/user_guides/howto/training/single_node_en.rst
+++ b/doc/fluid/user_guides/howto/training/single_node_en.rst
@@ -17,12 +17,8 @@ For example:
    label = fluid.layers.data(name="label", shape=[1])
    hidden = fluid.layers.fc(input=image, size=100, act='relu')
    prediction = fluid.layers.fc(input=hidden, size=10, act='softmax')
-   loss = fluid.layers.mean(
-       fluid.layers.cross_entropy(
-           input=prediction,
-           label=label
-       )
-   )
+   loss = fluid.layers.cross_entropy(input=prediction, label=label)
+   loss = fluid.layers.mean(loss)
 
    sgd = fluid.optimizer.SGD(learning_rate=0.001)
    sgd.minimize(loss)
@@ -38,16 +34,13 @@ Initialize Parameters
 Random Initialization of Parameters
 ====================================
 
-After the configuration of model,the initialization of parameters will be written into :code:`fluid.default_startup_program()` . By running this program in :code:`fluid.Executor()` , the random initialization of parameters will be finished in global :code:`fluid.global_scope()` .For example:
+After the configuration of model,the initialization of parameters will be written into :code:`fluid.default_startup_program()` . By running this program in :code:`fluid.Executor()` , the random initialization of parameters will be finished in global scope, i.e. :code:`fluid.global_scope()` .For example:
 
 .. code-block:: python
 
    exe = fluid.Executor(fluid.CUDAPlace(0))
    exe.run(program=fluid.default_startup_program())
 
-Note that in multi-GPU training, the parameters should be initialized on GPU0 and then will be distributed to multiple graphic cards through :code:`fluid.ParallelExecutor` .
-
-
 Load Predefined Parameters
 ===========================
 
@@ -62,12 +55,37 @@ In the runtime, feed data with :code:`run(feed=...)` and get persistable data wi
 
 .. code-block:: python
 
-   ...
-   loss = fluid.layers.mean(...)
-
-   exe = fluid.Executor(...)
-   # the result is an numpy array
-   result = exe.run(feed={"image": ..., "label": ...}, fetch_list=[loss])
+    import paddle.fluid as fluid
+    import numpy
+
+    train_program = fluid.Program()
+    startup_program = fluid.Program()
+    with fluid.program_guard(train_program, startup_program):
+        data = fluid.layers.data(name='X', shape=[1], dtype='float32')
+        hidden = fluid.layers.fc(input=data, size=10)
+        loss = fluid.layers.mean(hidden)
+        sgd = fluid.optimizer.SGD(learning_rate=0.001)
+        sgd.minimize(loss)
+
+    use_cuda = True
+    place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace()
+    exe = fluid.Executor(place)
+
+    # Run the startup program once and only once.
+    # Not need to optimize/compile the startup program.
+    startup_program.random_seed=1
+    exe.run(startup_program)
+
+    # Run the main program directly without compile.
+    x = numpy.random.random(size=(10, 1)).astype('float32')
+    loss_data, = exe.run(train_program,
+                         feed={"X": x},
+                         fetch_list=[loss.name])
+    # Or 
+    # compiled_prog = compiler.CompiledProgram(train_program)
+    # loss_data, = exe.run(compiled_prog,
+    #              feed={"X": x},
+    #              fetch_list=[loss.name])
 
 Notes:
 
@@ -81,21 +99,28 @@ In multi-card training, you can use :code:`fluid.compiler.CompiledProgram` to co
 
 .. code-block:: python
 
-    exe = fluid.Executor(...)
-    
-    compiled_prog = fluid.compiler.CompiledProgram(
-        fluid.default_main_program()).with_data_parallel(
-            loss_name=loss.name)
-           
-    result = exe.run(program=compiled_prog, 
-                    fetch_list=[loss.name], 
-                    feed={"image": ..., "label": ...}) 
+    # NOTE: If you use CPU to run the program, you need
+    # to specify the CPU_NUM, otherwise, fluid will use
+    # all the number of the logic core as the CPU_NUM,
+    # in that case, the batch size of the input should be
+    # greater than CPU_NUM, if not, the process will be
+    # failed by an exception.
+    if not use_cuda:
+        os.environ['CPU_NUM'] = str(2)
+
+    compiled_prog = compiler.CompiledProgram(
+        train_program).with_data_parallel(
+        loss_name=loss.name)
+    loss_data, = exe.run(compiled_prog,
+                         feed={"X": x},
+                         fetch_list=[loss.name])
 
 Notes:
 
-1. The constructor of :ref:`api_fluid_CompiledProgram` needs to be set with :code:`fluid.Program` to be run which can not be modified at runtime. 
-2. If :code:`exe` is initialized with CUDAPlace, the model will be run in GPU. In the mode of graphics card training, all graphics card will be occupied. Users can configure `CUDA_VISIBLE_DEVICES <http://www.acceleware.com/blog/cudavisibledevices-masking-gpus>`_ to change graphics cards that are being used. 
-3. If :code:`exe` is initialized with CPUPlace, the model will be run in CPU. In this situation, the multi-threads are used to run the model, and the number of threads is equal to the number of logic cores. Users can configure `CPU_NUM`  to change the number of threads that are being used. 
+1. :ref:`api_fluid_CompiledProgram` will convert the input Program into a computational graph, and :code:`compiled_prog` is a completely different object from the incoming :code:`train_program`. At present, :code:`compiled_prog` can not be saved.
+2. Multi-card training can also be used: ref:`api_fluid_ParallelExecutor` , but now it is recommended to use: :ref:`api_fluid_CompiledProgram`.
+3. If :code:`exe` is initialized with CUDAPlace, the model will be run in GPU. In the mode of graphics card training, all graphics card will be occupied. Users can configure `CUDA_VISIBLE_DEVICES <http://www.acceleware.com/blog/cudavisibledevices-masking-gpus>`_ to change graphics cards that are being used. 
+4. If :code:`exe` is initialized with CPUPlace, the model will be run in CPU. In this situation, the multi-threads are used to run the model, and the number of threads is equal to the number of logic cores. Users can configure `CPU_NUM`  to change the number of threads that are being used. 
 
 Advanced Usage
 ###############