Merge branch 'develop' into checker

48558218 · Luo Tao · 1a066975 · 80b45ad1 · 48558218 · 48558218
29 changed file
--- a/doc/api/data_provider/dataprovider_cn.rst
+++ b/doc/api/data_provider/dataprovider_cn.rst
+.. _api_dataprovider:
+
 DataProvider的介绍
 ==================

-DataProvider是PaddlePaddle负责提供数据的模块。其作用是将数据传入内存或显存，让神经网络可以进行训练或预测。用户可以通过简单使用Python接口 `PyDataProvider2 <pydataprovider2.html>`_ ，来自定义传数据的过程。如果有更复杂的使用，或者需要更高的效率，用户也可以在C++端自定义一个 ``DataProvider`` 。
+DataProvider是PaddlePaddle负责提供数据的模块。其作用是将数据传入内存或显存，让神经网络可以进行训练或预测。用户可以通过简单使用Python接口 :ref:`api_pydataprovider2` ，来自定义传数据的过程。如果有更复杂的使用，或者需要更高的效率，用户也可以在C++端自定义一个 ``DataProvider`` 。

 PaddlePaddle需要用户在网络配置（trainer_config.py）中定义使用哪种DataProvider，并且在DataProvider中实现如何访问训练文件列表（train.list）或测试文件列表（test.list）。


--- a/doc/api/data_provider/pydataprovider2_cn.rst
+++ b/doc/api/data_provider/pydataprovider2_cn.rst
+..  _api_pydataprovider2:
+
 PyDataProvider2的使用
 =====================


--- a/doc/api/predict/swig_py_paddle_cn.rst
+++ b/doc/api/predict/swig_py_paddle_cn.rst
+.. _api_swig_py_paddle:
+
 基于Python的预测
 ================


--- a/doc/api/trainer_config_helpers/evaluators.rst
+++ b/doc/api/trainer_config_helpers/evaluators.rst
+..  _api_trainer_config_helpers_evaluators:
+
 ==========
 Evaluators
 ==========

--- a/doc/api/trainer_config_helpers/layers.rst
+++ b/doc/api/trainer_config_helpers/layers.rst
@@ -187,6 +187,8 @@ get_output_layer
 Mixed Layer
 ===========

+..  _api_trainer_config_helpers_layers_mixed_layer:
+
 mixed_layer
 -----------
 ..  automodule:: paddle.trainer_config_helpers.layers
@@ -255,12 +257,16 @@ pooling_layer
    :members: pooling_layer
    :noindex:

+..  _api_trainer_config_helpers_layers_last_seq:
+
 last_seq
 --------
 ..  automodule:: paddle.trainer_config_helpers.layers
    :members: last_seq
    :noindex:

+..  _api_trainer_config_helpers_layers_first_seq:
+
 first_seq
 ---------
 ..  automodule:: paddle.trainer_config_helpers.layers
@@ -282,6 +288,8 @@ block_expand_layer
    :members: block_expand_layer
    :noindex:

+..  _api_trainer_config_helpers_layers_expand_layer:
+
 expand_layer
 ------------
 ..  automodule:: paddle.trainer_config_helpers.layers
@@ -374,6 +382,8 @@ sampling_id_layer
    :members: sampling_id_layer
    :noindex:

+..  _api_trainer_config_helpers_layers_cost_layers:
+
 Cost Layers
 ===========


--- a/doc/api/trainer_config_helpers/networks.rst
+++ b/doc/api/trainer_config_helpers/networks.rst
@@ -36,6 +36,8 @@ img_conv_group
    :members: img_conv_group
    :noindex:

+..  _api_trainer_config_helpers_network_simple_img_conv_pool:
+
 simple_img_conv_pool
 --------------------
 ..  automodule:: paddle.trainer_config_helpers.networks

--- a/doc/api/trainer_config_helpers/optimizers.rst
+++ b/doc/api/trainer_config_helpers/optimizers.rst
+..  _api_trainer_config_helpers_optimizers:
+
 ==========
 Optimizers
 ==========
@@ -50,6 +52,8 @@ RMSPropOptimizer
    :members: RMSPropOptimizer
    :noindex:

+..  _api_trainer_config_helpers_optimizers_settings:
+
 settings
 ========
 ..  automodule:: paddle.trainer_config_helpers.optimizers

--- a/doc/faq/index_cn.rst
+++ b/doc/faq/index_cn.rst
@@ -35,7 +35,7 @@ PyDataProvider使用的是异步加载，同时在内存里直接随即选取数

 ..  literalinclude:: src/reduce_min_pool_size.py

-这样做可以极大的减少内存占用，并且可能会加速训练过程，详细文档参考 `这里 <../ui/data_provider/pydataprovider2.html#provider>`_ 。
+这样做可以极大的减少内存占用，并且可能会加速训练过程，详细文档参考 :ref:`api_pydataprovider2` 。

 神经元激活内存
 ++++++++++++++
@@ -95,7 +95,6 @@ PaddlePaddle支持Sparse的训练，sparse训练需要训练特征是 :code:`spa

 ..  literalinclude:: src/word2vec_config.py

-更多关于sparse训练的内容请参考 `sparse训练的文档 <TBD>`_

 利用更多的计算资源
 ++++++++++++++++++
@@ -103,14 +102,17 @@ PaddlePaddle支持Sparse的训练，sparse训练需要训练特征是 :code:`spa
 利用更多的计算资源可以分为一下几个方式来进行\:

 * 单机CPU训练
+
  * 使用多线程训练。设置命令行参数 :code:`trainer_count`。

 * 单机GPU训练
+
  * 使用显卡训练。设置命令行参数 :code:`use_gpu`。
  * 使用多块显卡训练。设置命令行参数 :code:`use_gpu` 和 :code:`trainer_count` 。

 * 多机训练
-  * 具体的多机训练方法参考  `多机训练文档 <../ui/data_provider/pydataprovider2.html#provider>`_ 。
+
+  * 请参考 :ref:`cluster_train` 。


 3. 遇到“非法指令”或者是“illegal instruction”
@@ -302,4 +304,4 @@ PaddlePaddle的参数使用名字 :code:`name` 作为参数的ID，相同名字
    git submodule init
    git submodule update

-来获得所有第三方模块。
\ No newline at end of file
+来获得所有第三方模块。
--- a/doc/getstarted/build_and_install/index_cn.rst
+++ b/doc/getstarted/build_and_install/index_cn.rst
-编译与安装
+安装与编译
 ==========

-安装
-++++
+.. _install_steps:
+
+安装流程
++++++++

 PaddlePaddle提供数个预编译的二进制来进行安装，包括Docker镜像，ubuntu的deb安装包等。我们推荐使用Docker镜像来部署环境，同时欢迎贡献更多的安装包。

@@ -14,12 +16,12 @@ PaddlePaddle提供数个预编译的二进制来进行安装，包括Docker镜



-编译
-++++
+编译流程
++++++++

 ..  warning::

-    编译选项主要推荐高级用户查看，普通用户请走安装流程。
+    编译流程主要推荐高级用户查看，普通用户请走安装流程。

 ..  toctree::
    :maxdepth: 1

--- a/doc/howto/deep_model/rnn/hierarchical_layer_cn.rst
+++ b/doc/howto/deep_model/rnn/hierarchical_layer_cn.rst
@@ -22,7 +22,7 @@
 pooling_layer
 ==============

-pooling_layer 的使用示例如下，详细见 `pooling_layer`_ 配置API。
+pooling_layer 的使用示例如下，详细见 :ref:`api_trainer_config_helpers_layers_pooling_layer` 配置API。

 ..	code-block:: bash

@@ -47,7 +47,7 @@ pooling_layer 的使用示例如下，详细见 `pooling_layer`_ 配置API。
 last_seq 和 first_seq
 =====================

-last_seq 的使用示例如下（ `first_seq`_ 类似），详细见 `last_seq`_ 配置API。
+last_seq 的使用示例如下（ :ref:`api_trainer_config_helpers_layers_first_seq` 类似），详细见 :ref:`api_trainer_config_helpers_layers_last_seq` 配置API。

 ..	code-block:: bash

@@ -68,7 +68,7 @@ last_seq 的使用示例如下（ `first_seq`_ 类似），详细见 `last_seq`_
 expand_layer
 ============

-expand_layer 的使用示例如下，详细见 `expand_layer`_ 配置API。
+expand_layer 的使用示例如下，详细见 :ref:`api_trainer_config_helpers_layers_expand_layer` 配置API。

 ..	code-block:: bash

@@ -87,9 +87,3 @@ expand_layer 的使用示例如下，详细见 `expand_layer`_ 配置API。
  - 作用：一个单层序列经过运算扩展成一个双层序列
  - 输入：layer1必须是一个单层序列，是待扩展的数据；layer2 必须是一个双层序列，提供扩展的长度信息
  - 输出：一个双层序列，序列中含有元素的数目同 layer2 一致。要求单层序列含有元素的数目（0层序列）和双层序列含有subseq 的数目一致。单层序列第i个元素（0层序列），被扩展为一个单层序列，构成了输出双层序列的第i个 subseq 。
-
-
-.. _pooling_layer: ../../../doc/ui/api/trainer_config_helpers/layers.html#pooling-layer
-.. _last_seq: ../../../doc/ui/api/trainer_config_helpers/layers.html#last-seq
-.. _first_seq: ../../../doc/ui/api/trainer_config_helpers/layers.html#first-seq
-.. _expand_layer: ../../../doc/ui/api/trainer_config_helpers/layers.html#expand-layer
--- a/doc/howto/deep_model/rnn/recurrent_group_cn.md
+++ b/doc/howto/deep_model/rnn/recurrent_group_cn.md
@@ -12,7 +12,7 @@

 更进一步，`recurrent_group`同样可以扩展到双层序列的处理上。通过两个嵌套的`recurrent_group`分别定义子句级别和词语级别上需要完成的运算，最终实现一个层次化的复杂RNN。

-目前，在PaddlePaddle中，能够对双向序列进行处理的有`recurrent_group`和部分Layer，具体可参考文档：<a href = "hierarchical-layer.html">支持双层序列作为输入的Layer</a>。
+目前，在PaddlePaddle中，能够对双向序列进行处理的有`recurrent_group`和部分Layer，具体可参考文档：<a href = "hierarchical_layer_cn.html">支持双层序列作为输入的Layer</a>。
 
 ## 相关概念


--- a/doc/howto/index_cn.rst
+++ b/doc/howto/index_cn.rst
@@ -8,6 +8,7 @@
  :maxdepth: 1

  usage/concepts/use_concepts_cn.rst
+  usage/cluster/cluster_train_cn.md
  usage/cluster/k8s/k8s_cn.md
  usage/cluster/k8s/k8s_distributed_cn.md


--- a/doc/howto/usage/cluster/cluster_train_cn.md
+++ b/doc/howto/usage/cluster/cluster_train_cn.md
+```eval_rst
+.. _cluster_train:
+```
+
+# 运行分布式训练
+
+在本文中，我们将阐释如何在集群上运行分布式 Paddle 训练作业。我们将以[推荐系统](https://github.com/baidu/Paddle/tree/develop/demo/recommendation)为例创建分布式的单进程训练。
+
+在本文中使用的[脚本](https://github.com/baidu/Paddle/tree/develop/paddle/scripts/cluster_train)通过 SSH 运行分布式作业。 它们还可以供那些运行更复杂的集群管理系统（如 MPI 和 [Kubernetes](https://github.com/PaddlePaddle/Paddle/tree/develop/doc/howto/usage/cluster/k8s) ）的用户参考。
+
+## 前提条件
+
+1. 上述脚本使用 Python 库 [fabric](http://www.fabfile.org/) 来运行 SSH 命令。 我们使用 `pip` 来安装 fabric:
+
+   ```bash
+   pip install fabric
+   ```
+
+2. 我们需要在集群的所有节点上安装 PaddlePaddle。 如果要启用GPU，需要在 `/usr/local/cuda` 中安装 CUDA; 否则 Paddle 将在运行时报错。
+
+3. 在 [`cluster_train/conf.py`] 中设置 `ROOT_DIR`， 该 ROOT_DIR 要在所有节点上存在。为了方便起见，我们通常在所有节点上创建一个 Unix 用户 `paddle`，并设置 `ROOT_DIR=/home/paddle`。这样，我们可以将 SSH 公钥写入 `/home/paddle/.ssh/authorized_keys`，以便用户 `paddle` 可以 SSH 到所有节点而不用密码。
+
+## 准备工作空间
+
+我们将放置依赖库、配置等文件的目录视为 *工作空间（workspace）*。
+
+这些 `train/test` 数据应该在启动集群作业之前准备好。 为了满足训练/测试数据放置在工作空间中不同目录的要求，PADDLE 根据在模型配置文件中使用的名为 `train.list/test.list` 的索引文件引用训练/测试数据，所以训练/测试数据也包含 train.list/test.list 两个列表文件。所有本地训练 demo 已经提供了脚本来帮助您创建这两个文件，并且集群作业中的所有节点将在正常情况下处理具有相同逻辑代码的文件。
+
+通常，你可以使用本地训练中的相同模型文件进行集群训练。请记住，在模型文件的 `setting`函数中设置的 `batch_size` 表示在集群作业**每个**节点中的 batch 大小，而不是使用同步 SGD 的总 batch 大小。
+
+以下步骤基于 demo 目录中的 [demo/recommendation](https://github.com/PaddlePaddle/Paddle/tree/develop/demo/recommendation)。
+
+你只需完成 demo/recommendation 教程文档到 `Train` 的部分，之后你会得到训练/测试数据和模型配置文件。最后，只需使用 demo/recommendation 作为集群训练的工作空间。
+
+最后，你的工作空间应如下所示：
+```
+.
+|-- common_utils.py
+|-- data
+|   |-- config.json
+|   |-- config_generator.py
+|   |-- meta.bin
+|   |-- meta_config.json
+|   |-- meta_generator.py
+|   |-- ml-1m
+|   |-- ml_data.sh
+|   |-- ratings.dat.test
+|   |-- ratings.dat.train
+|   |-- split.py
+|   |-- test.list
+|   `-- train.list
+|-- dataprovider.py
+|-- evaluate.sh
+|-- prediction.py
+|-- preprocess.sh
+|-- requirements.txt
+|-- run.sh
+`-- trainer_config.py
+```
+虽然这些文件并非都需要集群训练，但是也没有必要删除无用的文件。
+
+`trainer_config.py`
+表示模型配置文件。
+
+`train.list` 和 `test.list`
+文件索引。它存储当前节点所有训练/测试数据的所有相对或绝对文件路径。
+
+`dataprovider.py`
+用于读取训练/测试样本。这与本地训练相同。
+
+`data`
+数据目录中的所有文件被 train.list/test.list 引用。
+
+
+## 准备集群作业配置
+
+以下选项必须在 cluster_train/conf.py 中认真设置
+
+`HOSTS`  所有节点运行集群作业的主机名或 IP 。你还可以将用户和 ssh 端口附加到主机名上，例如 root@192.168.100.17:9090。
+
+`ROOT_DIR` 用于放置 JOB 工作空间目录的工作空间 ROOT 目录
+
+`PADDLE_NIC` 集群通信通道的 NIC(Network Interface Card, 网络接口卡) 接口名称，例如以太网的 eth0，infiniband 的 ib0。
+
+`PADDLE_PORT` 集群通信通道的端口号
+
+`PADDLE_PORTS_NUM` 用于集群通信通道的端口数。 如果集群节点数量少（少于5〜6个节点），建议将其设置为较大，如2〜8，以获得更好的网络性能。
+
+`PADDLE_PORTS_NUM_FOR_SPARSE` 用于 sparse remote updater 集群通信信道的端口数。如果使用 sparse remote update，则可以像 `PADDLE_PORTS_NUM` 一样设置。
+
+`LD_LIBRARY_PATH` 为集群作业设置额外的 LD_LIBRARY_PATH。你可以使用它来设置 CUDA 库的路径。
+
+默认配置如下：
+
+```python
+HOSTS = [
+        "root@192.168.100.17",
+        "root@192.168.100.18",
+        ]
+
+'''
+工作空间配置
+'''
+
+#工作空间根目录
+ROOT_DIR = "/home/paddle"
+
+'''
+网络配置
+'''
+#pserver NIC
+PADDLE_NIC = "eth0"
+#pserver 端口
+PADDLE_PORT = 7164
+#pserver 端口数
+PADDLE_PORTS_NUM = 2
+#pserver sparse ports num
+PADDLE_PORTS_NUM_FOR_SPARSE = 2
+
+#集群作业中所有进程的环境设置
+LD_LIBRARY_PATH="/usr/local/cuda/lib64:/usr/lib64"
+```
+
+### 启动集群作业
+`paddle.py` 提供了自动化脚本来启动不同节点中的所有 PaddlePaddle 集群进程。默认情况下，所有命令行选项可以设置为```paddle.py``` 命令选项并且 `paddle.py` 将透明、自动地将这些选项应用到 PaddlePaddle 底层进程。
+
+`paddle.py` 为方便作业启动提供了两个独特的命令选项。
+
+`job_dispatch_package`  设为本地 `workspace` 目录，它将被分发到 conf.py 中设置的所有节点。  它有助于帮助频繁修改和访问工作区文件的用户减少负担，否则频繁的多节点工作空间部署可能会很麻烦。
+`job_workspace`  设为已部署的工作空间目录，`paddle.py` 将跳过分发阶段直接启动所有节点的集群作业。它可以帮助减少分发延迟。
+
+`cluster_train/run.sh` 提供了命令样例来运行 `demo/recommendation` 集群工作，只需用你定义的目录修改 `job_dispatch_package` 和 `job_workspace`，然后：
+```
+sh run.sh
+```
+
+集群作业将会在几秒后启动。
+
+### 终止集群作业
+`paddle.py`能获取`Ctrl + C` SIGINT 信号来自动终止它启动的所有进程。只需中断 `paddle.py` 任务来终止集群作业。如果程序崩溃你也可以手动终止。
+
+### 检查集群训练结果
+详细信息请检查 $workspace/log 里的日志，每一个节点都有相同的日志结构。
+
+`paddle_trainer.INFO`
+提供几乎所有训练的内部输出日志，与本地训练相同。这里检验运行时间模型的收敛。
+
+`paddle_pserver2.INFO`
+提供 pserver 运行日志，有助于诊断分布式错误。
+
+`server.log`
+提供 pserver 进程的 stderr 和 stdout。训练失败时可以检查错误日志。
+
+`train.log`
+提供训练过程的 stderr 和 stdout。训练失败时可以检查错误日志。
+
+### 检查模型输出
+运行完成后，模型文件将被写入节点 0 的 `output` 目录中。
+工作空间中的 `nodefile` 表示当前集群作业的节点 ID。
--- a/doc/howto/usage/cluster/cluster_train_en.md
+++ b/doc/howto/usage/cluster/cluster_train_en.md
@@ -2,7 +2,7 @@

 In this article, we explain how to run distributed Paddle training jobs on clusters.  We will create the distributed version of the single-process training example, [recommendation](https://github.com/baidu/Paddle/tree/develop/demo/recommendation).

-[Scripts](https://github.com/baidu/Paddle/tree/develop/paddle/scripts/cluster_train) used in this article launch distributed jobs via SSH.  They also work as a reference for users running more sophisticated cluster management systems like MPI and Kubernetes.
+[Scripts](https://github.com/baidu/Paddle/tree/develop/paddle/scripts/cluster_train) used in this article launch distributed jobs via SSH.  They also work as a reference for users running more sophisticated cluster management systems like MPI and [Kubernetes](https://github.com/PaddlePaddle/Paddle/tree/develop/doc/howto/usage/cluster/k8s).

 ## Prerequisite

@@ -20,13 +20,13 @@ In this article, we explain how to run distributed Paddle training jobs on clust

 We refer to the directory where we put dependent libraries, config files, etc., as *workspace*.

-These ```train/test``` data should be prepared before launching cluster job. To  satisfy the requirement that train/test data are placed in different directory from workspace, PADDLE refers train/test data according to index file named as ```train.list/test.list``` which are used in model config file. So the train/test data also contains train.list/test.list two list file. All local training demo already provides scripts to help you create these two files,  and all nodes in cluster job will handle files with same logical code in normal condition.
+These `train/test` data should be prepared before launching cluster job. To  satisfy the requirement that train/test data are placed in different directory from workspace, PADDLE refers train/test data according to index file named as `train.list/test.list` which are used in model config file. So the train/test data also contains train.list/test.list two list file. All local training demo already provides scripts to help you create these two files,  and all nodes in cluster job will handle files with same logical code in normal condition.

-Generally, you can use same model file from local training for cluster training. What you should have in mind that, the ```batch_size``` set in ```setting``` function in model file means batch size in ```each``` node of cluster job instead of total batch size if synchronization SGD was used.
+Generally, you can use same model file from local training for cluster training. What you should have in mind that, the `batch_size` set in `setting` function in model file means batch size in `each` node of cluster job instead of total batch size if synchronization SGD was used.

-Following steps are based on demo/recommendation demo in demo directory.
+Following steps are based on [demo/recommendation](https://github.com/PaddlePaddle/Paddle/tree/develop/demo/recommendation) demo in demo directory.

-You just go through demo/recommendation tutorial doc until ```Train``` section, and at last you will get train/test data and model configuration file. Finaly, just use demo/recommendation as workspace for cluster training.
+You just go through demo/recommendation tutorial doc until `Train` section, and at last you will get train/test data and model configuration file. Finaly, just use demo/recommendation as workspace for cluster training.

 At last your workspace should look like as follow:
 ```
@@ -55,16 +55,16 @@ At last your workspace should look like as follow:
 ```
 Not all of these files are needed for cluster training, but it's not necessary to remove useless files.

-```trainer_config.py```
+`trainer_config.py`
 Indicates the model config file.

-```train.list``` and ```test.list```
+`train.list` and `test.list`
 File index. It stores all relative or absolute file paths of all train/test data at current node.

-```dataprovider.py```
+`dataprovider.py`
 used to read train/test samples. It's same as local training.

-```data```
+`data`
 all files in data directory are refered by train.list/test.list which are refered by data provider.


@@ -72,19 +72,19 @@ all files in data directory are refered by train.list/test.list which are refere

 The options below must be carefully set in cluster_train/conf.py

-```HOSTS```  all nodes hostname or ip that will run cluster job. You can also append user and ssh port with hostname, such as root@192.168.100.17:9090.
+`HOSTS`  all nodes hostname or ip that will run cluster job. You can also append user and ssh port with hostname, such as root@192.168.100.17:9090.

-```ROOT_DIR``` workspace ROOT directory for placing JOB workspace directory
+`ROOT_DIR` workspace ROOT directory for placing JOB workspace directory

-```PADDLE_NIC``` the NIC(Network Interface Card) interface name for cluster communication channel, such as eth0 for ethternet, ib0 for infiniband.
+`PADDLE_NIC` the NIC(Network Interface Card) interface name for cluster communication channel, such as eth0 for ethternet, ib0 for infiniband.

-```PADDLE_PORT``` port number for cluster commnunication channel
+`PADDLE_PORT` port number for cluster commnunication channel

-```PADDLE_PORTS_NUM``` the number of port used for cluster communication channle. if the number of cluster nodes is small(less than 5~6nodes), recommend you set it to larger, such as 2 ~ 8, for better network performance.
+`PADDLE_PORTS_NUM` the number of port used for cluster communication channle. if the number of cluster nodes is small(less than 5~6nodes), recommend you set it to larger, such as 2 ~ 8, for better network performance.

-```PADDLE_PORTS_NUM_FOR_SPARSE``` the number of port used for sparse updater cluster commnunication channel. if sparse remote update is used, set it like ```PADDLE_PORTS_NUM```
+`PADDLE_PORTS_NUM_FOR_SPARSE` the number of port used for sparse updater cluster commnunication channel. if sparse remote update is used, set it like `PADDLE_PORTS_NUM`

-```LD_LIBRARY_PATH``` set addtional LD_LIBRARY_PATH for cluster job. You can use it to set CUDA libraries path.
+`LD_LIBRARY_PATH` set addtional LD_LIBRARY_PATH for cluster job. You can use it to set CUDA libraries path.

 Default Configuration as follow:

@@ -118,15 +118,15 @@ LD_LIBRARY_PATH="/usr/local/cuda/lib64:/usr/lib64"
 ```

 ### Launching Cluster Job
-```paddle.py``` provides automatical scripts to start all PaddlePaddle cluster processes in different nodes. By default, all command line options can set as ```paddle.py``` command options and ```paddle.py``` will transparently and automatically set these options to PaddlePaddle lower level processes.
+`paddle.py` provides automatical scripts to start all PaddlePaddle cluster processes in different nodes. By default, all command line options can set as `paddle.py` command options and `paddle.py` will transparently and automatically set these options to PaddlePaddle lower level processes.

-```paddle.py```provides two distinguished command option for easy job launching.
+`paddle.py`provides two distinguished command option for easy job launching.

-```job_dispatch_package```  set it with local ```workspace```directory, it will be dispatched to all nodes set in conf.py. It could be helpful for frequent hacking workspace files, otherwise frequent mulit-nodes workspace deployment could make your crazy.
-```job_workspace```  set it with already deployed workspace directory, ```paddle.py``` will skip dispatch stage to directly launch cluster job with all nodes. It could help to reduce heavy
+`job_dispatch_package`  set it with local `workspace`directory, it will be dispatched to all nodes set in conf.py. It could be helpful for frequent hacking workspace files, otherwise frequent mulit-nodes workspace deployment could make your crazy.
+`job_workspace`  set it with already deployed workspace directory, `paddle.py` will skip dispatch stage to directly launch cluster job with all nodes. It could help to reduce heavy
 dispatch latency.

-```cluster_train/run.sh``` provides command line sample to run ```demo/recommendation``` cluster job, just modify ```job_dispatch_package``` and ```job_workspace``` with your defined directory, then:
+`cluster_train/run.sh` provides command line sample to run `demo/recommendation` cluster job, just modify `job_dispatch_package` and `job_workspace` with your defined directory, then:
 ```
 sh run.sh
 ```
@@ -134,23 +134,23 @@ sh run.sh
 The cluster Job will start in several seconds.

 ### Kill Cluster Job
-```paddle.py``` can capture ```Ctrl + C``` SIGINT signal to automatically kill all processes launched by it. So just stop ```paddle.py``` to kill cluster job. You should mannally kill job if program crashed.
+`paddle.py` can capture `Ctrl + C` SIGINT signal to automatically kill all processes launched by it. So just stop `paddle.py` to kill cluster job. You should mannally kill job if program crashed.

 ### Check Cluster Training Result
 Check log in $workspace/log for details, each node owns same log structure.

-```paddle_trainer.INFO```
+`paddle_trainer.INFO`
 It provides almost all interal output log for training,  same as local training. Check runtime model convergence here.

-```paddle_pserver2.INFO```
+`paddle_pserver2.INFO`
 It provides pserver running log, which could help to diagnose distributed error.

-```server.log```
+`server.log`
 It provides stderr and stdout of pserver process. Check error log if training crashs.

-```train.log```
+`train.log`
 It provides stderr and stdout of trainer process. Check error log if training crashs.

 ### Check Model Output
-After one pass finished, model files will be writed in ```output``` directory in node 0.
-```nodefile``` in workspace indicates the node id of current cluster job.
+After one pass finished, model files will be writed in `output` directory in node 0.
+`nodefile` in workspace indicates the node id of current cluster job.
--- a/doc/howto/usage/cluster/k8s/k8s_distributed_cn.md
+++ b/doc/howto/usage/cluster/k8s/k8s_distributed_cn.md
@@ -82,7 +82,7 @@ COPY start_paddle.py /root/
 CMD ["bash"," -c","/root/start.sh"]
 ```

-[`start.sh`](start.sh)文件拷贝训练文件到容器内，然后执行[`start_paddle.py`](start_paddle.py)脚本启动训练，前文提到的获取其他节点IP地址，分配`trainer_id`等都在`start_paddle.py`脚本中完成。
+[start.sh](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/howto/usage/cluster/k8s/start.sh)文件拷贝训练文件到容器内，然后执行[start_paddle.py](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/howto/usage/cluster/k8s/start_paddle.py)脚本启动训练，前文提到的获取其他节点IP地址，分配`trainer_id`等都在`start_paddle.py`脚本中完成。

 `start_paddle.py`脚本开始时，会先进行参数的初始化与解析。


--- a/doc/howto/usage/concepts/use_concepts_cn.rst
+++ b/doc/howto/usage/concepts/use_concepts_cn.rst
@@ -37,7 +37,7 @@ PaddlePaddle是一个深度学习框架，支持单机模式和多机模式。

 DataProvider是PaddlePaddle系统的数据提供器，将用户的原始数据转换成系统可以识别的数据类型。每当系统需要新的数据训练时, trainer进程会调用DataProvider函数返回数据。当所有数据读取完一轮后，DataProvider返回空数据，通知系统一轮数据读取结束，并且系统每一轮训练开始时会重置DataProvider。需要注意的是，DataProvider是被系统调用，而不是新数据驱动系统，一些随机化噪声添加都应该在DataProvider中完成。

-在不同的应用里，训练数据的格式往往各不相同。因此，为了用户能够灵活的处理数据，我们提供了Python处理数据的接口，称为 `PyDataProvider`_ 。在 ``PyDataProvider`` 中，系统C++模块接管了shuffle、处理batch、GPU和CPU通信、双缓冲、异步读取等问题，一些情况下(如：``min_pool_size=0``)需要Python接口里处理shuffle，可以参考 `PyDataProvider`_ 的相关文档继续深入了解。
+在不同的应用里，训练数据的格式往往各不相同。因此，为了用户能够灵活的处理数据，我们提供了Python处理数据的接口，称为 ``PyDataProvider`` 。在 ``PyDataProvider`` 中，系统C++模块接管了shuffle、处理batch、GPU和CPU通信、双缓冲、异步读取等问题，一些情况下(如：``min_pool_size=0``)需要Python接口里处理shuffle，可以参考 :ref:`api_pydataprovider2` 继续深入了解。


 训练配置文件
@@ -50,21 +50,21 @@ DataProvider是PaddlePaddle系统的数据提供器，将用户的原始数据
 ..  literalinclude:: src/trainer_config.py
    :linenos:

-文件开头 ``from paddle.trainer_config_helpers import *`` ，是因为PaddlePaddle配置文件与C++模块通信的最基础协议是protobuf，为了避免用户直接写复杂的protobuf string，我们为用户定以Python接口来配置网络，该Python代码可以生成protobuf包，这就是`trainer_config_helpers`_的作用。因此，在文件的开始，需要import这些函数。 这个包里面包含了模型配置需要的各个模块。
+文件开头 ``from paddle.trainer_config_helpers import *`` ，是因为PaddlePaddle配置文件与C++模块通信的最基础协议是protobuf，为了避免用户直接写复杂的protobuf string，我们为用户定以Python接口来配置网络，该Python代码可以生成protobuf包，这就是 :ref:`api_trainer_config` 的作用。因此，在文件的开始，需要import这些函数。 这个包里面包含了模型配置需要的各个模块。

 下面分别介绍数据源配置、优化算法配置、网络结构配置这三部分该概念。

 数据源配置
 ----------

-使用 `PyDataProvider`_ 的函数 ``define_py_data_sources2`` 配置数据源。``define_py_data_sources2`` 里通过train_list和test_list指定是训练文件列表和测试文件列表。 如果传入字符串的话，是指一个数据列表文件。这个数据列表文件中包含的是每一个训练或者测试文件的路径。如果传入一个list的话，则会默认生成一个list文件，再传入给train.list或者test.list。
+使用 ``PyDataProvider2`` 的函数 ``define_py_data_sources2`` 配置数据源。``define_py_data_sources2`` 里通过train_list和test_list指定是训练文件列表和测试文件列表。 如果传入字符串的话，是指一个数据列表文件。这个数据列表文件中包含的是每一个训练或者测试文件的路径。如果传入一个list的话，则会默认生成一个list文件，再传入给train.list或者test.list。

-``module`` 和 ``obj`` 指定了DataProvider的文件名和返回数据的函数名。更详细的使用，请参考 `PyDataProvider`_ 。
+``module`` 和 ``obj`` 指定了DataProvider的文件名和返回数据的函数名。更详细的使用，请参考 :ref:`api_pydataprovider2` 。

 优化算法配置
 ------------

-通过 `settings`_ 接口设置神经网络所使用的训练参数和 `优化算法`_ ，包括学习率、batch_size、优化算法、正则方法等，具体的使用方法请参考 `settings`_ 文档。
+通过 :ref:`api_trainer_config_helpers_optimizers_settings` 接口设置神经网络所使用的训练参数和 :ref:`api_trainer_config_helpers_optimizers` ，包括学习率、batch_size、优化算法、正则方法等，具体的使用方法请参考 :ref:`api_trainer_config_helpers_optimizers_settings` 文档。

 网络结构配置
 ------------
@@ -82,14 +82,13 @@ DataProvider是PaddlePaddle系统的数据提供器，将用户的原始数据
 
  这个配置文件网络由 ``data_layer`` 、 ``simple_img_conv_pool`` 、 ``fc_layer`` 组成。

-  - `data_layer`_  ： 通常每个配置文件都会包括 ``data_layer`` ，定义输入数据大小。
-  - `simple_img_conv_pool`_ ：是一个组合层，包括了图像的卷积 (convolution)和池化(pooling)。
-  - `fc_layer`_ ：全连接层，激活函数为Softmax，这里也可叫分类层。
+  - :ref:`api_trainer_config_helpers_layers_data_layer`  ： 通常每个配置文件都会包括 ``data_layer`` ，定义输入数据大小。
+  - :ref:`api_trainer_config_helpers_network_simple_img_conv_pool` ：是一个组合层，包括了图像的卷积 (convolution)和池化(pooling)。
+  - :ref:`api_trainer_config_helpers_layers_fc_layer` ：全连接层，激活函数为Softmax，这里也可叫分类层。

-  
 - 损失函数和评估器：损失函数即为网络的优化目标，评估器可以评价模型结果。

-  PaddlePaddle包括很多损失函数和评估起，详细可以参考 `损失函数层`_ 和 `评估器`_ 。这里 ``classification_cost`` 默认使用多类交叉熵损失函数和分类错误率统计评估器。
+  PaddlePaddle包括很多损失函数和评估起，详细可以参考 :ref:`api_trainer_config_helpers_layers_cost_layers` 和 :ref:`api_trainer_config_helpers_evaluators` 。这里 ``classification_cost`` 默认使用多类交叉熵损失函数和分类错误率统计评估器。
  
 - ``outputs``: 标记网络输出的函数为 ``outputs`` 。

@@ -106,7 +105,7 @@ DataProvider是PaddlePaddle系统的数据提供器，将用户的原始数据
       with mixed_layer(size=200) as out:
           out += full_matrix_projection(input=data)

-PaddlePaddle 可以使用 ``mixed layer`` 配置出非常复杂的网络，甚至可以直接配置一个完整的LSTM。用户可以参考 `mixed_layer`_ 的相关文档进行配置。
+PaddlePaddle 可以使用 ``mixed layer`` 配置出非常复杂的网络，甚至可以直接配置一个完整的LSTM。用户可以参考 :ref:`api_trainer_config_helpers_layers_mixed_layer` 的相关文档进行配置。


 分布式训练
@@ -138,18 +137,3 @@ PaddlePaddle多机采用经典的 Parameter Server 架构对多个节点的 trai
 * --ports_num_for_sparse\: 一个pserver进程共绑定多少端口用来做稀疏更新，默认是0。

 使用手工指定端口数量，是因为Paddle的网络通信中，使用了 int32 作为消息长度，比较容易在大模型下溢出。所以，在 pserver 进程中可以启动多个子线程去接受 trainer 的数据，这样单个子线程的长度就不会溢出了。但是这个值不可以调的过大，因为增加这个值，对性能尤其是内存占用有一定的开销，另外稀疏更新的端口如果太大的话，很容易导致某一个参数服务器没有分配到任何参数。
-
-详细的说明可以参考，使用 `集群训练Paddle`_ 。
-
-
-..  _PyDataProvider: ../ui/data_provider/pydataprovider2.html
-.. _settings: ../../doc/ui/api/trainer_config_helpers/optimizers.html#settings
-.. _优化算法: ../../doc/ui/api/trainer_config_helpers/optimizers.html#optimizers
-.. _trainer_config_helper: ../../doc/ui/api/trainer_config_helpers/index.html
-.. _data_layer: ../../doc/ui/api/trainer_config_helpers/layers.html#data-layer
-.. _simple_img_conv_pool: ../../doc/ui/api/trainer_config_helpers/networks.html#simple-img-conv-pool
-.. _fc_layer: ../../doc/ui/api/trainer_config_helpers/layers.html#fc-layer
-.. _损失函数层: ../../doc/ui/api/trainer_config_helpers/layers.html#cost-layers
-.. _评估器: ../../doc/ui/api/trainer_config_helpers/evaluators.html
-.. _mixed_layer: ../../doc/ui/api/trainer_config_helpers/layers.html#mixed-layer
-..  _集群训练Paddle: ../cluster/index.html
--- a/doc/tutorials/index_cn.md
+++ b/doc/tutorials/index_cn.md
 # 完整教程

-## 快速入门
-
-使用商品评论分类任务，系统性的介绍如何一步步改进，最终得到产品级的深度模型。
-
-* [阅读教程](quick_start/index_cn.rst)
-
-## 图像
-
-* TBD
-
-## 自然语言处理
-
-* [情感分类](sentiment_analysis/index_cn.md)
+* [快速入门](quick_start/index_cn.rst)
+* [个性化推荐](rec/ml_regression_cn.rst)
+* [情感分析](sentiment_analysis/index_cn.md)
 * [语义角色标注](semantic_role_labeling/index_cn.md)
-
-## 个性化推荐
-
-* TBD
+* [机器翻译](text_generation/index_cn.md)

 ## 常用模型

-* TBD
+* [ResNet模型](imagenet_model/resnet_model_cn.md)
--- a/doc/tutorials/index_en.md
+++ b/doc/tutorials/index_en.md
 # TUTORIALS
 There are several examples and demos here.

-## Quick Start
-
 * [Quick Start](quick_start/index_en.md)
-
-## Image
-
+* [MovieLens Regression](rec/ml_regression_en.rst)
 * [Image Classification](image_classification/index_en.md)
-
-## NLP
-
 * [Sentiment Analysis](sentiment_analysis/index_en.md)
-* [Text Generation](text_generation/index_en.md)
 * [Semantic Role Labeling](semantic_role_labeling/index_en.md)
-
-## Recommendation
-
-* [MovieLens Regression](rec/ml_regression_en.rst)
+* [Text Generation](text_generation/index_en.md)

 ## Model Zoo
 * [ImageNet: ResNet](imagenet_model/resnet_model_en.md)

--- a/doc/tutorials/quick_start/index_cn.rst
+++ b/doc/tutorials/quick_start/index_cn.rst
@@ -8,7 +8,7 @@
 安装
 ====

-请参考 `安装教程 <../../build_and_install/index.html>`_ 安装PaddlePaddle。
+请参考 :ref:`install_steps` 安装PaddlePaddle。

 使用概述
 ========
@@ -60,7 +60,7 @@
 Python脚本读取数据
 ------------------

-`DataProvider <../../ui/data_provider/index.html>`_ 是PaddlePaddle负责提供数据的模块。``DataProvider`` 主要职责在于将训练数据传入内存或者显存，让模型能够得到训练更新，其包括两个函数：
+`DataProvider` 是PaddlePaddle负责提供数据的模块，主要职责在于将训练数据传入内存或者显存，让模型能够得到训练更新，其包括两个函数：

 * initializer：PaddlePaddle会在调用读取数据的Python脚本之前，先调用initializer函数。在下面例子里，我们在initialzier函数里初始化词表，并且在随后的读取数据过程中填充词表。
 * process：PaddlePaddle调用process函数来读取数据。每次读取一条数据后，process函数会用yield语句输出这条数据，从而能够被PaddlePaddle 捕获 (harvest)。
@@ -73,6 +73,7 @@ Python脚本读取数据
     :linenos:
     :emphasize-lines: 8,33

+详细内容请参见 :ref:`api_dataprovider` 。

 配置中的数据加载定义
 --------------------
@@ -93,7 +94,7 @@ Python脚本读取数据
 - obj="process": 指定生成数据的函数
 - args={"dictionary": word_dict}: 额外的参数，这里指定词典

-更详细数据格式和用例请参考 `PyDataProvider2 <../../ui/data_provider/pydataprovider2.html>`_ 。
+更详细数据格式和用例请参考 :ref:`api_pydataprovider2` 。

 模型网络结构
 ============
@@ -105,7 +106,7 @@ Python脚本读取数据
        :scale: 80%


-我们将以最基本的逻辑回归网络作为起点，并逐渐展示更加深入的功能。更详细的网络配置连接请参考 `Layer文档 <../../../doc/layer.html>`_ 。
+我们将以最基本的逻辑回归网络作为起点，并逐渐展示更加深入的功能。更详细的网络配置连接请参考 :ref:`api_trainer_config_helpers_layers` 。
 所有配置都能在 `源代码 <https://github.com/PaddlePaddle/Paddle>`_ 的 ``demo/quick_start`` 目录下找到。

 逻辑回归模型
@@ -306,7 +307,7 @@ Momentum, RMSProp，AdaDelta，AdaGrad，ADAM，Adamax等，这里采用Adam优
        --num_passes=15 \
        --use_gpu=false

-这里只简单介绍了单机训练，如何进行分布式训练，可以参考教程 `分布式训练 <../../cluster/index.html>`_ 。
+这里只简单介绍了单机训练，如何进行分布式训练，请参考 :ref:`cluster_train` 。

 预测
 =====
@@ -318,7 +319,7 @@ Momentum, RMSProp，AdaDelta，AdaGrad，ADAM，Adamax等，这里采用Adam优
    :scale: 80%

 之前配置文件中 ``test.list`` 指定的数据将会被测试，这里直接通过预测脚本 ``predict.sh`` 进行预测,
-更详细的说明，可以参考 `Python API预测 <../../ui/predict/swig_py_paddle.html>`_ 教程。
+更详细的说明，请参考 :ref:`api_swig_py_paddle` 。

    .. code-block:: bash

@@ -373,7 +374,7 @@ Momentum, RMSProp，AdaDelta，AdaGrad，ADAM，Adamax等，这里采用Adam优

 默认一个pass保存一次模型，也可以通过saving_period_by_batches设置每隔多少batch保存一次模型。
 可以通过show_parameter_stats_period设置打印参数信息等。
-其他参数请参考 `命令行参数文档 <../../ui/index.html#command-line-argument>`_ 。
+其他参数请参考 命令行参数文档（链接待补充）。

 输出日志
 ---------

--- a/doc/tutorials/quick_start/index_en.md
+++ b/doc/tutorials/quick_start/index_en.md
@@ -159,7 +159,7 @@ define_py_data_sources2(train_list='data/train.list',
 You can refer to the following link for more detailed examples and data formats: <a href = "../../api/data_provider/pydataprovider2_en.html">PyDataProvider2</a>.

 ## Network Architecture
-You will describe four kinds of network architectures in this section.
+We will describe four kinds of network architectures in this section.
 <center> ![](./src/PipelineNetwork_en.jpg) </center>

 First, you will build a logistic regression model. Later, you will also get chance to build other more powerful network architectures.
@@ -391,7 +391,7 @@ paddle train \
 --use_gpu=false
 ```

-We do not provide examples on how to train on clusters here. If you want to train on clusters, please follow the <a href = "../../howto/cluster/cluster_train_en.html">distributed training</a> documentation or other demos for more details.
+We do not provide examples on how to train on clusters here. If you want to train on clusters, please follow the <a href = "../../howto/usage/cluster/cluster_train_en.html">distributed training</a> documentation or other demos for more details.

 ## Inference
 You can use the trained model to perform prediction on the dataset with no labels. You can also evaluate the model on dataset with labels to obtain its test accuracy.
@@ -509,7 +509,7 @@ The scripts of data downloading, network configurations, and training scrips are
 * \--config_args：Other configuration arguments.
 * \--init_model_path：The path of the initial model parameter.

-By default, the trainer will save model every pass. You can also specify `saving_period_by_batches` to set the frequency of batch saving. You can use `show_parameter_stats_period` to print the statistics of the parameters, which are very useful for tuning parameters. Other command line arguments can be found in <a href = "../../howto/cmd_parameter/index_en.html">command line argument documentation</a>。
+By default, the trainer will save model every pass. You can also specify `saving_period_by_batches` to set the frequency of batch saving. You can use `show_parameter_stats_period` to print the statistics of the parameters, which are very useful for tuning parameters. Other command line arguments can be found in <a href = "../../howto/usage/cmd_parameter/index_en.html">command line argument documentation</a>。

 ### Log


--- a/doc/tutorials/rec/ml_regression_cn.rst
+++ b/doc/tutorials/rec/ml_regression_cn.rst
 MovieLens数据集评分回归模型
-=========================
+===========================

 这里我们在MovieLens数据集描述一种 **余弦相似度回归** 任务。
 该示例将展示paddle如何进行词向量嵌入，处理相似度回归，针对文本
@@ -12,9 +12,9 @@ MovieLens数据集评分回归模型
 让这个示例变得更好，希望能让我们知晓。**

 数据准备
-```````
+`````````
 下载并解压数据集
-''''''''''''''
+'''''''''''''''''
 这里我们使用 :ref:`demo_ml_dataset` 。
 要下载和解压数据集，只需要简单的运行下面的命令即可。

@@ -34,7 +34,7 @@ MovieLens数据集评分回归模型
 		+--- README 		# 数据集描述

 字段配置文件
-''''''''''
+'''''''''''''
 **字段配置文件** 用来具体说明数据集的字段和文件格式，
 例如，说明每个特征文件具体字段是 **什么** 类型。

@@ -50,7 +50,7 @@ ml-1m的字段配置文件在目录 :code:`demo/recommendation/data/config.json`
   :literal:

 准备数据
-```````
+`````````
 你需要安装python的第三方库。
 **强烈推荐使用VIRTUALENV来创造一个干净的python环境。**

@@ -68,14 +68,14 @@ ml-1m的字段配置文件在目录 :code:`demo/recommendation/data/config.json`
 下面介绍预处理过程具体的步骤。

 提取电影或用户的特征并生成python对象
-''''''''''''''''''''''''''''''''
+'''''''''''''''''''''''''''''''''''''

 在movielens 1m数据集中，电影和用户有许多的特征。
 评分文件的每一行仅仅提供电影或用户的编号来代表相应的电影或用户。
 我们首先处理电影或用户的特征文件，然后用pickle命令将特征( **Meta** )对象存储为文件。

 Meta配置文件
-...........
+.............

 **Meta配置文件** 用来具体描述 **如何** 解析数据集中的每一个字段。
 该文件可以从字段配置文件生成，或是手动编辑生成。文件的格式可以
@@ -185,7 +185,7 @@ meta文件 :code:`meta.bin` 的结构如下：


 分割训练/测试文件
-'''''''''''''''
+''''''''''''''''''

 我们将 :code:`ml-1m/ratings.dat` 文件分割为训练和测试文件。分割文件的方法是：对于每位用户，我们将评分分成两部分。
 这样的话每位用户在测试文件中将与训练文件含有同样的信息。
@@ -208,10 +208,10 @@ meta文件 :code:`meta.bin` 的结构如下：


 神经网络结构配置
-``````````````
+`````````````````

 训练器配置文件
-''''''''''''
+'''''''''''''''

 网络结构如下图所示：

@@ -251,7 +251,7 @@ meta文件 :code:`meta.bin` 的结构如下：
 *  声明Python数据源， :ref:`api_trainer_config_helpers_data_sources` 

 数据提供脚本
-'''''''''''
+'''''''''''''

 ..  literalinclude:: ../../../demo/recommendation/dataprovider.py
    :language: python
@@ -264,7 +264,7 @@ meta文件 :code:`meta.bin` 的结构如下：
 * use_seq\: :code:`dataprovider.py` 中的数据是否为序列模式。
 * process\: 返回数据的每一条样本给 :code:`paddle` 。

-数据提供脚本的细节文档可以参考 :ref:`api_pydataprovider` 。
+数据提供脚本的细节文档可以参考 :ref:`api_pydataprovider2` 。

 训练
 ````
@@ -316,7 +316,7 @@ meta文件 :code:`meta.bin` 的结构如下：
 模型被保存在 :code:`output/` 目录中。你可以在任何时候用 :code:`Ctrl-C` 来停止训练。

 模型评估和预测
-````````````
+```````````````

 在训练了几个轮次以后，你可以对模型进行评估，得到最好轮次下的模型。运行下面命令即可：


--- a/paddle/cuda/src/hl_cuda_cudnn.cc
+++ b/paddle/cuda/src/hl_cuda_cudnn.cc
@@ -175,11 +175,15 @@ void hl_cudnn_init(cudnnHandle_t* cudnn_handle, cudaStream_t stream) {
      << "PaddlePaddle Requirement: "
      << "(header v[2-3] with libcudnn v[2-3]) Or "
      << "(header v4 with libcudnn v4) Or "
-      << "(header v5 with libcudnn v5).";
+      << "(header v5 with libcudnn v5) Or"
+      << "(header v6 with libcudnn v6).";

-  CHECK(!(CUDNN_VERSION >= 5000 && CUDA_VERSION < 7050))
+  CHECK(!(CUDNN_VERSION < 6000 && CUDNN_VERSION >= 5000 && CUDA_VERSION < 7050))
      << "cudnn v5 requires cuda version >= 7.5";

+  CHECK(!(CUDNN_VERSION >= 6000 && CUDA_VERSION < 8000))
+      << "cudnn v6 requires cuda version >= 8.0";
+
  CHECK_CUDNN(dynload::cudnnCreate(cudnn_handle));
  CHECK_CUDNN(dynload::cudnnSetStream(*cudnn_handle, stream));

@@ -610,6 +614,23 @@ void hl_create_convolution_descriptor(hl_convolution_descriptor* conv,
  CHECK_CUDNN(dynload::cudnnCreateConvolutionDescriptor(&hl_conv->desc));

  cudnnConvolutionMode_t mode = CUDNN_CROSS_CORRELATION;
+
+#if CUDNN_VERSION >= 6000
+#ifndef PADDLE_TYPE_DOUBLE
+  cudnnDataType_t data_type = CUDNN_DATA_FLOAT;
+#else
+  cudnnDataType_t data_type = CUDNN_DATA_DOUBLE;
+#endif
+  CHECK_CUDNN(dynload::cudnnSetConvolution2dDescriptor(hl_conv->desc,
+                                                       padding_height,
+                                                       padding_width,
+                                                       stride_height,
+                                                       stride_width,
+                                                       1,
+                                                       1,
+                                                       mode,
+                                                       data_type));
+#else
  CHECK_CUDNN(dynload::cudnnSetConvolution2dDescriptor(hl_conv->desc,
                                                       padding_height,
                                                       padding_width,
@@ -618,6 +639,7 @@ void hl_create_convolution_descriptor(hl_convolution_descriptor* conv,
                                                       1,
                                                       1,
                                                       mode));
+#endif

  hl_conv->input_image = image;
  hl_conv->filter = filter;
@@ -645,6 +667,23 @@ void hl_reset_convolution_descriptor(hl_convolution_descriptor conv,

  cudnnConvolutionDescriptor_t conv_desc = GET_CONVOLUTION_DESCRIPTOR(conv);
  cudnnConvolutionMode_t mode = CUDNN_CROSS_CORRELATION;
+
+#if CUDNN_VERSION >= 6000
+#ifndef PADDLE_TYPE_DOUBLE
+  cudnnDataType_t data_type = CUDNN_DATA_FLOAT;
+#else
+  cudnnDataType_t data_type = CUDNN_DATA_DOUBLE;
+#endif
+  CHECK_CUDNN(dynload::cudnnSetConvolution2dDescriptor(conv_desc,
+                                                       padding_height,
+                                                       padding_width,
+                                                       stride_height,
+                                                       stride_width,
+                                                       1,
+                                                       1,
+                                                       mode,
+                                                       data_type));
+#else
  CHECK_CUDNN(dynload::cudnnSetConvolution2dDescriptor(conv_desc,
                                                       padding_height,
                                                       padding_width,
@@ -653,6 +692,7 @@ void hl_reset_convolution_descriptor(hl_convolution_descriptor conv,
                                                       1,
                                                       1,
                                                       mode));
+#endif

  cudnn_convolution_descriptor hl_conv = (cudnn_convolution_descriptor)conv;
  hl_conv->input_image = image;

--- a/paddle/gserver/dataproviders/PyDataProvider2.cpp
+++ b/paddle/gserver/dataproviders/PyDataProvider2.cpp
@@ -252,19 +252,9 @@ private:
    // only for instance will make python reference-count error.
    //
    // So here, we increase reference count manually.
-    if (gModuleClsPtrs_.find((uintptr_t)module.get()) !=
-        gModuleClsPtrs_.end()) {
-      // Multi instance use same module
-      Py_XINCREF(module.get());
-      Py_XINCREF(moduleDict.get());
-    } else {
-      gModuleClsPtrs_.insert((uintptr_t)module.get());
-    }
-    if (gModuleClsPtrs_.find((uintptr_t)cls.get()) != gModuleClsPtrs_.end()) {
-      Py_XINCREF(cls.get());
-    } else {
-      gModuleClsPtrs_.insert((uintptr_t)cls.get());
-    }
+    Py_XINCREF(module.get());
+    Py_XINCREF(moduleDict.get());
+    Py_XINCREF(cls.get());

    PyObjectPtr fileListInPy = loadPyFileLists(fileListName);
    PyDict_SetItemString(kwargs.get(), "file_list", fileListInPy.get());
@@ -471,7 +461,6 @@ private:
  std::vector<std::string> fileLists_;
  std::vector<SlotHeader> headers_;
  static PyObjectPtr zeroTuple_;
-  static std::unordered_set<uintptr_t> gModuleClsPtrs_;

  class PositionRandom {
  public:
@@ -671,7 +660,6 @@ public:
  }
 };

-std::unordered_set<uintptr_t> PyDataProvider2::gModuleClsPtrs_;
 PyObjectPtr PyDataProvider2::zeroTuple_(PyTuple_New(0));

 REGISTER_DATA_PROVIDER_EX(py2, PyDataProvider2);

--- a/paddle/gserver/layers/BatchNormalizationLayer.cpp
+++ b/paddle/gserver/layers/BatchNormalizationLayer.cpp
@@ -59,24 +59,14 @@ void BatchNormalizationLayer::calMeanAndStd(const MatrixPtr& mat) {

 void BatchNormalizationLayer::calMovingMeanAndVar() {
  // calculating and saving moving mean and variance
-  MatrixPtr movingMean = movingMean_->getW();
-  MatrixPtr movingVar = movingVar_->getW();
-
-  if (!useGpu_ && FLAGS_trainer_count > 1) {
-    auto mvMean = std::dynamic_pointer_cast<SharedCpuMatrix>(movingMean);
-    auto mvVar = std::dynamic_pointer_cast<SharedCpuMatrix>(movingVar);
-    CHECK(mvMean && mvVar);
-
-    mvMean->add(*savedMean_, movingAvgFraction_, 1.0 - movingAvgFraction_);
-    mvVar->add(*savedInvVar_, movingAvgFraction_, 1.0 - movingAvgFraction_);
-  } else {
-    // movingMean =  movingMean * movingAvgFraction_
-    //            + savedMean_ * (1 - movingAvgFraction_)
-    movingMean->add(*savedMean_, movingAvgFraction_, 1.0 - movingAvgFraction_);
-    // movingVar =  movingVar * movingAvgFraction_
-    //           + savedInvVar_ * (1 - movingAvgFraction_)
-    movingVar->add(*savedInvVar_, movingAvgFraction_, 1.0 - movingAvgFraction_);
-  }
+  auto& movingMean = movingMean_->getW();
+  auto& movingVar = movingVar_->getW();
+  // movingMean =  movingMean * movingAvgFraction_
+  //            + savedMean_ * (1 - movingAvgFraction_)
+  movingMean->add(*savedMean_, movingAvgFraction_, 1.0 - movingAvgFraction_);
+  // movingVar =  movingVar * movingAvgFraction_
+  //           + savedInvVar_ * (1 - movingAvgFraction_)
+  movingVar->add(*savedInvVar_, movingAvgFraction_, 1.0 - movingAvgFraction_);
 }

 void BatchNormalizationLayer::setMeanAndStd() {

--- a/paddle/gserver/tests/test_RecurrentGradientMachine.cpp
+++ b/paddle/gserver/tests/test_RecurrentGradientMachine.cpp
@@ -127,7 +127,7 @@ TEST(RecurrentGradientMachine, HasSubSequence) {
  }
 }

-TEST(RecurrentGradientMachine, DISABLED_rnn) {
+TEST(RecurrentGradientMachine, rnn) {
  for (bool useGpu : {false, true}) {
    test("gserver/tests/sequence_rnn.conf",
         "gserver/tests/sequence_nest_rnn.conf",
@@ -136,7 +136,7 @@ TEST(RecurrentGradientMachine, DISABLED_rnn) {
  }
 }

-TEST(RecurrentGradientMachine, DISABLED_rnn_multi_input) {
+TEST(RecurrentGradientMachine, rnn_multi_input) {
  for (bool useGpu : {false, true}) {
    test("gserver/tests/sequence_rnn_multi_input.conf",
         "gserver/tests/sequence_nest_rnn_multi_input.conf",
@@ -145,7 +145,7 @@ TEST(RecurrentGradientMachine, DISABLED_rnn_multi_input) {
  }
 }

-TEST(RecurrentGradientMachine, DISABLED_rnn_multi_unequalength_input) {
+TEST(RecurrentGradientMachine, rnn_multi_unequalength_input) {
  for (bool useGpu : {false, true}) {
    test("gserver/tests/sequence_rnn_multi_unequalength_inputs.py",
         "gserver/tests/sequence_nest_rnn_multi_unequalength_inputs.py",

--- a/paddle/math/Matrix.h
+++ b/paddle/math/Matrix.h
@@ -1973,8 +1973,8 @@ public:

 public:
  virtual void mul(CpuSparseMatrix* a, CpuMatrix* b, real scaleAB, real scaleT);
-  void add(Matrix& b, real p1, real p2);
-  void add(real p1, real p2);
+  virtual void add(Matrix& b, real p1, real p2);
+  virtual void add(real p1, real p2);

 private:
  using Matrix::mul;

--- a/python/paddle/trainer/PyDataProvider2.py
+++ b/python/paddle/trainer/PyDataProvider2.py
@@ -107,8 +107,7 @@ def integer_value_sub_sequence(dim):
    return integer_value(dim, seq_type=SequenceType.SUB_SEQUENCE)


-def integer_sequence(dim):
-    return index_slot(dim, seq_type=SequenceType.SEQUENCE)
+integer_sequence = integer_value_sequence


 class SingleSlotWrapper(object):

--- a/python/paddle/trainer_config_helpers/data_sources.py
+++ b/python/paddle/trainer_config_helpers/data_sources.py
@@ -78,21 +78,6 @@ def define_py_data_source(file_list,
    if not isinstance(args, basestring) and args is not None:
        args = pickle.dumps(args, 0)

-    if data_cls is None:
-
-        def py_data2(files, load_data_module, load_data_object, load_data_args,
-                     **kwargs):
-            data = DataBase()
-            data.type = 'py2'
-            data.files = files
-            data.load_data_module = load_data_module
-            data.load_data_object = load_data_object
-            data.load_data_args = load_data_args
-            data.async_load_data = True
-            return data
-
-        data_cls = py_data2
-
    cls(
        data_cls(
            files=file_list,
@@ -207,10 +192,22 @@ def define_py_data_sources2(train_list, test_list, module, obj, args=None):
    :return: None
    :rtype: None
    """
+
+    def py_data2(files, load_data_module, load_data_object, load_data_args,
+                 **kwargs):
+        data = DataBase()
+        data.type = 'py2'
+        data.files = files
+        data.load_data_module = load_data_module
+        data.load_data_object = load_data_object
+        data.load_data_args = load_data_args
+        data.async_load_data = True
+        return data
+
    define_py_data_sources(
        train_list=train_list,
        test_list=test_list,
        module=module,
        obj=obj,
        args=args,
-        data_cls=None)
+        data_cls=py_data2)
--- a/python/paddle/trainer_config_helpers/layers.py
+++ b/python/paddle/trainer_config_helpers/layers.py
@@ -1776,15 +1776,15 @@ def img_conv_layer(input,
                   trans=False,
                   layer_type=None):
    """
-    Convolution layer for image. Paddle only support square input currently and
-    thus input image's width equals height.
+    Convolution layer for image. Paddle can support both square and non-square 
+    input currently.

    The details of convolution layer, please refer UFLDL's `convolution
    <http://ufldl.stanford.edu/tutorial/supervised/
    FeatureExtractionUsingConvolution/>`_ .

-    Convolution Transpose (deconv) layer for image. Paddle only support square
-    input currently and thus input image's width equals height.
+    Convolution Transpose (deconv) layer for image. Paddle can support both square 
+    and non-square input currently.

    The details of convolution transpose layer,
    please refer to the following explanation and references therein