thorough clean for doc (#644) (#645)

* thorough clean * delete_DS_Store

thorough clean for doc (#644) (#645)
* thorough clean * delete_DS_Store
0d020394 · Cheerego · GitHub · ba77c0bb · ba77c0bb · ba77c0bb
91 changed file
--- a/doc/fluid/advanced_usage/deploy/anakin/anakin_arm_benchmark.md
+++ b/doc/fluid/advanced_usage/deploy/anakin/anakin_arm_benchmark.md
-# Anakin ARM 性能测试
-## 测试环境和参数:
-+ 测试模型Mobilenetv1, mobilenetv2, mobilenet-ssd
-+ 采用android ndk交叉编译，gcc 4.9，enable neon， ABI： armveabi-v7a with neon -mfloat-abi=softfp
-+ 测试平台
-   - 荣耀v9(root): 处理器:麒麟960, 4 big cores in 2.36GHz, 4 little cores in 1.8GHz
-   - nubia z17:处理器:高通835, 4 big cores in 2.36GHz, 4 little cores in 1.9GHz
-   - 360 N5:处理器:高通653, 4 big cores in 1.8GHz, 4 little cores in 1.4GHz
-+ 多线程：openmp
-+ 时间：warmup10次，运行10次取均值
-+ ncnn版本：来源于github的master branch中commits ID：307a77f04be29875f40d337cfff6df747df09de6（msg:convert            LogisticRegressionOutput)版本
-+ TFlite版本：来源于github的master branch中commits ID：65c05bc2ac19f51f7027e66350bc71652662125c（msg:Removed unneeded file copy that was causing failure in Pi builds)版本
-在BenchMark中本文将使用**`ncnn`**、**`TFlite`**和**`Anakin`**进行性能对比分析
-## BenchMark model
-> 注意在性能测试之前，请先将测试model通过[External Converter](#10003)转换为Anakin model
-> 对这些model，本文在ARM上进行多线程的单batch size测试。
- [Mobilenet v1](#11)  *caffe model 可以在[这儿](https://github.com/shicai/MobileNet-Caffe)下载*
- [Mobilenet v2](#22)  *caffe model 可以在[这儿](https://github.com/shicai/MobileNet-Caffe)下载*
- [mobilenet-ssd](#33)  *caffe model 可以在[这儿](https://github.com/chuanqi305/MobileNet-SSD)下载*
-### <span id = '11'> mobilenetv1 </span>
-   |platform | Anakin (1) | Anakin (2) | Anakin (4) | ncnn (1) | ncnn (2) | ncnn (4) | TFlite (1) | TFlite (2) | TFlite (4)|
-   |:---: | :---: | :---: | :---:| :---:| :---:| :---:| :---:| :---:| :---:|
-   |麒麟960|107.7ms|61.1ms|38.2ms|152.8ms|85.2ms|51.9ms|152.6ms|nan|nan|
-   |高通835|105.7ms|63.1ms|~~46.8ms~~|152.7ms|87.0ms|~~92.7ms~~|146.9ms|nan|nan|
-   |高通653|120.3ms|64.2ms|46.6ms|202.5ms|117.6ms|84.8ms|158.6ms|nan|nan|
-### <span id = '22'> mobilenetv2 </span>
-   |platform | Anakin (1) | Anakin (2) | Anakin (4) | ncnn (1) | ncnn (2) | ncnn (4) | TFlite (1) | TFlite (2) | TFlite (4)|
-   |:---: | :---: | :---: | :---:| :---:| :---:| :---:| :---:| :---:| :---:|
-   |麒麟960|93.1ms|53.9ms|34.8ms|144.4ms|84.3ms|55.3ms|100.6ms|nan|nan|
-   |高通835|93.0ms|55.6ms|41.1ms|139.1ms|88.4ms|58.1ms|95.2ms|nan|nan|
-   |高通653|106.6ms|64.2ms|48.0ms|199.9ms|125.1ms|98.9ms|108.5ms|nan|nan|
-### <span id = '33'> mobilenet-ssd </span>
-   |platform | Anakin (1) | Anakin (2) | Anakin (4) | ncnn (1) | ncnn (2) | ncnn (4) | TFlite (1) | TFlite (2) | TFlite (4)|
-   |:---: | :---: | :---: | :---:| :---:| :---:| :---:| :---:| :---:| :---:|
-   |麒麟960|213.9ms|120.5ms|74.5ms|307.9ms|166.5ms|104.2ms|nan|nan|nan|
-   |高通835|213.0ms|125.7ms|~~98.4ms~~|292.9ms|177.9ms|~~167.8ms~~|nan|nan|nan|
-   |高通653|236.0ms|129.6ms|96.0ms|377.7ms|228.9ms|165.0ms|nan|nan|nan
-## How to run those Benchmark models?
-   1. 首先, 使用[External Converter](./convert_paddle_to_anakin.html)对caffe model 进行转换
-   2. 然后将转换后的Anakin model和编译好的benchmark_arm 二进制文件通过'adb push'命令上传至测试机
-   3. 接着在测试机含有Anakin model的目录中运行'./benchmark_arm ./ anakin_model.anakin.bin 1 10 10 1' 命令
-   4. 最后，终端显示器上将会打印该模型的运行时间
-   5. 其中运行命令的参数个数和含义可以通过运行'./benchmark_arm'看到
--- a/doc/fluid/advanced_usage/deploy/anakin/anakin_example.md
+++ b/doc/fluid/advanced_usage/deploy/anakin/anakin_example.md
-# Anakin 运行模型示例
-Anakin目前只支持NCHW的格式
-示例文件在test/framework/net下
-## 在NV的GPU上运行CNN模型
-示例文件为打开example_nv_cnn_net.cpp，整体流程如下：
- 将模型的的path设置为anakin模型的路径，初始化NV平台的图对象。 anakin模型可以通过转换器转化caffe或Paddle的模型得到
- 根据模型设置网络图的输入尺寸，进行图优化
- 根据优化后的网络图初始化网络执行器
- 取出网络的输入tensor，将数据拷贝到输入tensor
- 运行推导
- 取出网络的输出tensor
-以NV平台为例演示Anakin框架的使用方法，注意编译时需要打开GPU编译开关
-## 在X86上运行RNN模型
-示例文件为example_x86_rnn_net.cpp
-整体流程与在NV的GPU上运行CNN模型相似，不同之处如下：
- 使用X86标识初始化图对象和网络执行器对象
- rnn模型的输入尺寸是可变的，初始化图时的输入维度是维度的最大值，输入维度N代表总的词的个数。还需要设置输入tensor的seq_offset来标示这些词是如何划分为句子的,如{0,5,12}表示共有12个词，其中第0到第4个词是第一句话，第5到第11个词是第二句话
-以X86平台为例演示Anakin框架的使用方法，注意编译时需要打开X86编译开关
-## 在NV的GPU上使用Anakin的线程池运行CNN模型
-示例文件为example_nv_cnn_net_multi_thread.cpp ，示例使用worker的同步预测接口
-整体流程与在NV的GPU上运行CNN模型相似，不同之处如下：
- 用模型地址和线程池大小初始化worker对象
- 将输入tensor注入任务队列,获得输出tensor
--- a/doc/fluid/advanced_usage/deploy/anakin/anakin_gpu_benchmark.md
+++ b/doc/fluid/advanced_usage/deploy/anakin/anakin_gpu_benchmark.md
-# Anakin GPU 性能测试
-## 环境:
->  CPU: `12-core Intel(R) Xeon(R) CPU E5-2620 v2 @2.10GHz`
->  GPU: `Tesla P4`
->  cuDNN: `v7`
-## anakin 对比对象:
-**`Anakin`** 将与高性能的推理引擎 **`NVIDIA TensorRT 3`** 进行比较
-## Benchmark Model
-> 注意在性能测试之前，请先将测试model通过 `External Converter` 工具转换为Anakin model
-> 对这些model，本文在GPU上进行单线程单GPU卡的性能测试。
- [Vgg16](#1)   *caffe model 可以在[这儿](https://gist.github.com/jimmie33/27c1c0a7736ba66c2395)下载*
- [Yolo](#2)  *caffe model 可以在[这儿](https://github.com/hojel/caffe-yolo-model)下载*
- [Resnet50](#3)  *caffe model 可以在[这儿](https://github.com/KaimingHe/deep-residual-networks#models)下载*
- [Resnet101](#4)  *caffe model 可以在[这儿](https://github.com/KaimingHe/deep-residual-networks#models)下载*
- [Mobilenet v1](#5)  *caffe model 可以在[这儿](https://github.com/shicai/MobileNet-Caffe)下载*
- [Mobilenet v2](#6)  *caffe model 可以在[这儿](https://github.com/shicai/MobileNet-Caffe)下载*
- [RNN](#7)  *暂不支持*
-### <span id = '1'>VGG16 </span>
- Latency (`ms`) of different batch
-| BatchSize | TensorRT | Anakin |
-| --- | --- | --- |
-| 1 | 8.53945 | 8.18737 |
-| 2 | 14.2269 | 13.8976 |
-| 4 | 24.2803 | 21.7976 |
-| 8 | 45.6003 | 40.319 |
- GPU Memory Used (`MB`)
-| BatchSize | TensorRT | Anakin |
-| --- | --- | --- |
-| 1 | 1053.88 | 762.73 |
-| 2 | 1055.71 | 762.41 |
-| 4 | 1003.22 | 832.75 |
-| 8 | 1108.77 | 926.9 |
-### <span id = '2'>Yolo </span>
- Latency (`ms`) of different batch
-| BatchSize | TensorRT | Anakin |
-| --- | --- | --- |
-| 1 | 8.41606| 7.07977 |
-| 2 | 16.6588| 15.2216 |
-| 4 | 31.9955| 30.5102 |
-| 8 | 66.1107 | 64.3658 |
- GPU Memory Used (`MB`)
-| BatchSize | TensorRT | Anakin |
-| --- | --- | --- |
-| 1 | 1054.71  | 299.8 |
-| 2 | 951.51  | 347.47 |
-| 4 | 846.9  | 438.47 |
-| 8 | 1042.31  | 515.15 |
-### <span id = '3'> Resnet50 </span>
- Latency (`ms`) of different batch
-| BatchSize | TensorRT | Anakin |
-| --- | --- | --- |
-| 1 | 4.10063  |  3.33845 |
-| 2 |  6.10941 |  5.54814 |
-| 4 | 9.90233  | 10.2763 |
-| 8 | 17.3287  |   20.0783 |
- GPU Memory Used (`MB`)
-| BatchSize | TensorRT | Anakin |
-| --- | --- | --- |
-| 1 | 1059.15 | 299.86 |
-| 2 | 1077.8  | 340.78 |
-| 4 | 903.04  | 395 |
-| 8 | 832.53  | 508.86 |
-### <span id = '4'> Resnet101 </span>
- Latency (`ms`) of different batch
-| BatchSize | TensorRT | Anakin |
-| --- | --- | --- |
-| 1 | 7.29828 | 5.672 |
-| 2 | 11.2037 | 9.42352 |
-| 4 | 17.9306 | 18.0936 |
-| 8 | 31.4804 | 35.7439 |
- GPU Memory Used (`MB)`
-| BatchSize | TensorRT | Anakin |
-| --- | --- | --- |
-| 1 | 1161.94 | 429.22 |
-| 2 | 1190.92 | 531.92 |
-| 4 | 994.11  | 549.7 |
-| 8 | 945.47  | 653.06 |
-###  <span id = '5'> MobileNet V1 </span>
- Latency (`ms`) of different batch
-| BatchSize | TensorRT | Anakin |
-| --- | --- | --- |
-| 1 | 1.52692  |  1.39282 |
-| 2 |  1.98091  |  2.05788 |
-| 4 | 3.2705  | 4.03476 |
-| 8 |  5.15652 |  7.06651 |
- GPU Memory Used (`MB`)
-| BatchSize | TensorRT | Anakin |
-| --- | --- | --- |
-| 1 | 1144.35   | 99.6 |
-| 2 | 1160.03    | 199.75 |
-| 4 | 1098  | 184.33 |
-| 8 | 990.71  | 232.11 |
-###  <span id = '6'> MobileNet V2</span>
- Latency (`ms`) of different batch
-| BatchSize | TensorRT | Anakin |
-| --- | --- | --- |
-| 1 | 1.95961 | 1.78249 |
-| 2 | 2.8709 | 3.01144 |
-| 4 | 4.46131 | 5.43946 |
-| 8 | 7.161 | 10.2081 |
- GPU Memory Used (`MB`)
-| BatchSize | TensorRT | Anakin |
-| --- | --- | --- |
-| 1 | 1154.69 | 195.25 |
-| 2 | 1187.25 | 227.6 |
-| 4 | 1053 | 241.75 |
-| 8 | 1062.48 | 352.18 |
-## How to run those Benchmark models
-1. 首先, 使用[External Converter](./convert_paddle_to_anakin.html)对caffe model 进行转换
-2. 然后跳转至 *source_root/benchmark/CNN* 目录下，使用 'mkdir ./models'创建存放模型的目录，并将转换好的Anakin模型放在该目录下
-3. 运行脚本 `sh run.sh`，运行结束后，该模型的运行时间将会显示到终端上
-4. 如果你想获取每层OP的运行时间，你只用将 CMakeLists.txt 中的`ENABLE_OP_TIMER` 设置为 `YES` 即可
--- a/doc/fluid/advanced_usage/deploy/anakin/anakin_parser_design.md
+++ b/doc/fluid/advanced_usage/deploy/anakin/anakin_parser_design.md
-# Parser的编写指南
-  Parser是一种网络框架转换工具，将其他框架如Caffe、TensorFlow的网络结构转换为Anakin网络结构图，然后对转换后的Anakin图进行预测处理
-  本文主要介绍Parser功能的框架结构和根据已有的网络框架改写Parser，以解析得到Anakin框架图，进行Anakin预测
-  下文称Anakin为AK，运算操作为OP,本文参考TensorFlow的Parser编写,参考代码目录为tools/external_converter_v2/parser/tensorflow
-## Parser的功能和执行流程
-  Parser功能是将其他深度学习框架(如Caffe，TensorFlow，ONNX)的模型转换为AK的模型
-  对AK的作用是屏蔽不同框架间的差异，这种差异包括模型存储、OP的定义、图差异
-  因此Parser的执行流程是：
-  - 将源框架的模型载入Parser
-  - 将原框架的图解析为AK中的OP节点和OP节点的连接关系
-  - 进行OP定义的转换和图优化
-  - 将符合AK标准的图写入protobuf
-## Parser的目录结构
-  Parser工具在tools/external_converter_v2/parser目录下
-  Parser的目录主要包含3部分:
-  - Parser的运行配置文件包括 config.py, config.yaml, converter.py, 用户只用执行converter.py，Parser就会按照config.yaml中的声明去解析模型
-  - Parser的公共定义，包括operations,pbs,proto三个目录。Parser的公共工具函数 graph*.py logger.py utils.py
-  - 各个框架对应的Parser，其目录的命名方式为框架名,如Caffe, TensorFlow
-## Parser的编写流程
-### 1、声明你的Parser
-  - 在config.yaml中填写你的Parser运行的必要信息，包括ProtoPath和SavePath等。OPTIONS/Framework改为你的Parser的类型，TARGET下填写对应的参数列表
-  - 添加你的Parser目录，如TensorFlow，导出你的Parser符号。注意，Parser的框架默认调用你的Parser类中的__call__方法来执行解析，这个方法需要返回填写完毕的GraphProtoIO对象
-  - 在config.py中Configuration下__init__函数中增加对你的Parser的调用，将yaml中读取的配置信息传给你的Parser，此处调用你的Parser中的__init__方法
-### 2、添加你的Parser主体
-  可以参考parser_tf.py
-  - 你需要在Parser主体构造时获取模型路径，input，ouput名字等解析必须的信息
-  - 在__call__中返回填写好的GraphProtoIO对象，该对象为填写protobuf的辅助工具
-  - 建议Parser的解析过程分成三部分，先将原框架的模型载入并转换为一种便于修改的中间的图形式；对中间图修改使得图满足AK的要求；将满足要求的中间图利用NodeProtoIO和GraphProtoIO这两个辅助类填入protobuf，具体细节可以参考parser_tf
-### 3、读取原始模型，并将模型转换为中间类型
-  可以参考parse_tf_2_med.py
-  - 这一步与原始框架结合紧密，你可能需要import原始框架的工具函数来完成模型的裁剪、固定、加载等操作
-  - 大部分的框架都是使用tensor来连接OP的，但AK中是OP直接相连，这点需要注意
-  - AK的shape默认是4维的，有的参数的shape不足4维，需要Parser补全
-### 4、对中间类型的图进行优化
-  可以参考med_graph.py
-  - 由于AK不支持普通OP多输出的情况，需要在多输出的OP后面补上Splite类型的OP节点
-  - 对于Convlution后接Batchnorm这种可以合并又不会导致OP定义改变的情况，需要Parser在这一步做掉
-  - AK规定所有的输入类型OP的名字必须是input_x这种命名方式，其中x为从0开始的数字
-### 5、将中间类型的图以GraphProtoIO的方式保存
-  可以参考parse_med_2_ak.py 和 parser_tf.py
-  - 你首先需要构造Node节点，Node节点的名字是OP的名字(如conv2d_1_a_0)，Node节点中OP成员变量的名字是Node节点的类型(如Convlution)
-  - Node节点需要按照输入的顺序用Node的add_in方法填写输入Node的名字，add_out方法按顺序填写输出Node的名字
-  - 通过调用GraphProtoIO的add_node方法将构造好的Node的__call__方法的返回值作为参数，将Node节点加入AK的graph中
-  - 调用GraphProtoIO的add_in_edge和add_out_edge完成AK图中OP间关系的构建。如果Node中的in和out填写正确，你也可以通过调用GraphProtoIO的format_edge_from_nodes方法完成这个工作
-  - AK的模型需要Parser给出输出Node的名字，使用GraphProtoIO的add_out方法填写输出Node的名字
-### 6、检查模型解析的正确性
-  - 默认的config.yaml配置会在解析结束后启动一个web服务器展示解析后的AK模型图，你需要对比原框架的模型图进行验证。这里最容易出现的错误是边关系的错误，表现为图非常乱，你需要逐条边地检查错误；第二个容易出错的地方是参数漏填，需要你检查OP中的属性
-  - 将解析后的模型放入AK中执行，使用相同的输入，原框架与AK有相同的输出。若果输出不一致可以开启AK的DEBUG模式，在net.cpp中将没层的输出打印；如果AK在解析阶段陷入死循环，大概率是边的关系出错
-## 如何添加新OP
-  - 需要在AK代码中加入该OP的实现，包括对应设备Saber的OP，Saber单测和Framework中的OP
-  - 根据Framework的OP在ops.py中添加Parser公共的OP定义
-  - 从原框架的模型中解析出该OP的节点，并在AK的graph中填入该OP节点
-## AK模型与其他框架模型的不同之处
-  + AK模型与caffe的模型相似，因此与其他模型有很多不同的地方，需要Parser在解析过程中处理掉
-  + 最大的不同是与PaddlePaddle或TensorFlow的模型中OP粒度很细，而AK的模型中OP的粒度很粗（目的是为了节省访存开销）。这会导致解析这些框架的模型时存在大量的合并操作
-  + 其次是OP的行为不同,如TensorFlow中Pooling默认都是exclusive的，而AK中是inclusive的。TensorFlow的Padding，如果是奇数pad，则在右方和下方多pad，而AK是在左方和上方多Pad
-  + AK默认的布局是NCHW，如果其他框架的OP是其他形式的，需要在Parser中做weights的布局转换，并处理reshape的问题
-  + AK中有的weights是需要预先做布局转换的(如GRU，LSTM)，AK中也支持同一OP的不同算法，如(GRU，Pooling)
--- a/doc/fluid/advanced_usage/deploy/anakin/anakin_run_on_arm.md
+++ b/doc/fluid/advanced_usage/deploy/anakin/anakin_run_on_arm.md
-## ARM 源码编译 Anakin ##
-目前Anakin支持ARM Android平台，采用Android NDK交叉编译工具链，已在mac os和centos上编译和测试通过。
-### 安装概览 ###
-* [系统需求](#0001)
-* [安装第三方依赖](#0002)
-* [Anakin源码编译](#0003)
-* [验证安装](#0004)
-### <span id = '0001'> 1. 系统需求 </span> ###
-*  宿主机: linux, mac
-*  cmake 3.8.2+
-*  Android NDK r14, Linux 版本[从这里下载](https://dl.google.com/android/repository/android-ndk-r14b-linux-x86_64.zip)
-### <span id = '0002'> 2. 安装第三方依赖 </span> ###
- 2.1 protobuf3.4.0
-  源码从这里[下载](https://github.com/google/protobuf/releases/tag/v3.4.0)
-  - 2.1.1 为宿主机编译protobuf
-  ```bash
-    $ tar -xzf protobuf-3.4.0.tar.gz
-    $ cd protobuf-3.4.0
-    $ ./autogen.sh
-    $ ./configure
-    $ make
-    $ make check
-    $ make install
-  ```
-  上述 $make install 执行后，可在 `/usr/local/include/google` 找到 libprotobuf 所需的头文件,将整个google文件夹拷贝至Anakin/third-party/arm-android/protobuf/下, 然后将已经生成文件清除。
-  如有问题，请点[这里](https://github.com/google/protobuf/blob/v3.4.0/src/README.md)。
-  ```bash
-    $ make distclean
-  ```
-  - 2.1.1 交叉编译Android`armeabi-v7a`的protobuf，注意设置ANDROID_NDK的路径，以及ARCH_ABI、HOSTOSN的值
-  ```bash
-    $ export ANDROID_NDK=your_ndk_path
-    $ ARCH_ABI="arm-linux-androideabi-4.9"
-    $ HOSTOSN="darwin-x86_64"
-    $ export SYSROOT=$ANDROID_NDK/platforms/android-9/arch-arm
-    $ export PREBUILT=$ANDROID_NDK/toolchains/$ARCH_ABI
-    $ export LDFLAGS="--sysroot=$SYSROOT"
-    $ export LD="$ANDROID_NDK/toolchains/$ARCH_ABI/prebuilt/$HOSTOSN/arm-linux-androideabi/bin/ld $LDFLAGS"
-    $ export LIBS="-llog $ANDROID_NDK/sources/cxx-stl/gnu-libstdc++/4.9/libs/armeabi-v7a/libgnustl_static.a"
-    $ export CPPFLAGS=""
-    $ export INCLUDES="-I$ANDROID_NDK/sources/cxx-stl/gnu-libstdc++/4.9/include/ -I$ANDROID_NDK/platforms/android-9/arch-arm/usr/include/ -I$ANDROID_NDK/sources/cxx-stl/gnu-libstdc++/4.9/libs/armeabi-v7a/include/"
-    $ export CXXFLAGS="-march=armv7-a -mfloat-abi=softfp -DGOOGLE_PROTOBUF_NO_RTTI --sysroot=$SYSROOT"
-    $ export CCFLAGS="$CXXFLAGS"
-    $ export CXX="$PREBUILT/prebuilt/$HOSTOSN/bin/arm-linux-androideabi-g++ $CXXFLAGS"
-    $ export CC="$CXX"
-    $ export RANLIB="$ANDROID_NDK/toolchains/$ARCH_ABI/prebuilt/$HOSTOSN/bin/arm-linux-androideabi-ranlib"
-    $ ./autogen.sh
-    $ ./configure --host=arm-linux-androideabi --with-sysroot=$SYSROOT --enable-cross-compile --with-protoc=protoc --disable-shared CXX="$CXX" CC="$CC" LD="$LD"
-    $ make
-  ```
-  编译生成 *.a 静态库，若希望编译*.so 动态链接库 ，请在./configure参数中改--disable-shared为--disable-static --enable-shared
-  生成文件在`src/.libs/`下，将生成的文件拷贝至`Anakin/third-party/arm-android/protobuf/lib`下
-  在[cmake](../../cmake/find_modules.cmake)中更新`ARM_RPOTO_ROOT`的路径。
-  ```cmake
-    set(ARM_RPOTO_ROOT "${CMAKE_SOURCE_DIR}/third-party/arm-android/protobuf")
-  ```
- 2.2 opencv 2.4.3+(optional)
-    Anakin只在examples示例中使用opencv
-    Android系统的opencv从[这里下载](https://opencv.org/releases.html)
-    解压后将 `3rdparty/libs/armeabi-v7a`中的库文件拷贝到`libs/armeabi-v7a`
-    在[cmake](../../cmake/find_modules.cmake)中搜索`anakin_find_opencv`
-    并设置 `include_directories` 和 `LINK_DIRECTORIES`为自己安装的库的路径
-    ```cmake
-      include_directories(${CMAKE_SOURCE_DIR}/third-party/arm-android/opencv/sdk/native/jni/include/)
-      LINK_DIRECTORIES(${CMAKE_SOURCE_DIR}/third-party/arm-android/opencv/sdk/native/libs/armeabi-v7a/)
-    ```
-### <span id = '0003'> 3. Anakin源码编译 </span> ###
-#### 编译Android版本
-克隆[源码](https://github.com/PaddlePaddle/Anakin/tree/arm)
-```bash
-  cd your_dir
-  git clone https://github.com/PaddlePaddle/Anakin.git
-  cd Anakin
-  git fetch origin arm
-  git checkout arm
-```
-修改`android_build.sh`
-  - 修改NDK路径
-  ```bash
-    #modify "your_ndk_path" to your NDK path
-    export ANDROID_NDK=your_ndk_path
-  ```
-  - 修改ARM 处理器架构
-    对于32位ARM处理器, 将ANDROID_ABI 设置为 `armeabi-v7a with NEON`
-    对于64位ARM处理器, 可以将ANDROID_ABI 设置为 `armeabi-v7a with NEON`或者`arm64-v8a`
-    目前我们只支持 `armeabi-v7a with NEON`；`arm64-v8a` 还在开发中
-  ```bash
-      -DANDROID_ABI="armeabi-v7a with NEON"
-  ```
- 设置Android API
-  根据Android系统的版本设置API level， 例如API Level 21 -> Android 5.0.1
-  ```bash
-      -DANDROID_NATIVE_API_LEVEL=21
-  ```
- 选择编译静态库或动态库
-  设置`BUILD_SHARED=NO`编译静态库
-  设置`BUILD_SHARED=YES`编译动态库
-  ```bash
-      -DBUILD_SHARED=NO
-  ```
- OpenMP多线程支持
-  设置`USE_OPENMP=YES`开启OpenMP多线程
-  ```bash
-      -DUSE_OPENMP=YES
-  ```
- 编译单测文件
-  设置`BUILD_WITH_UNIT_TEST=YES`将会编译单测文件
-  ```bash
-      -DBUILD_WITH_UNIT_TEST=YES
-  ```
- 编译示例文件
-  设置`BUILD_EXAMPLES=YES`将会编译示例文件
-    ```bash
-        -DBUILD_EXAMPLES=YES
-    ```
- 开启opencv
-  如果使用opencv，设置`USE_OPENCV=YES`
-  ```bash
-    -DUSE_OPENCV=YES
-  ```
- 开始编译
-  运行脚本 `android_build.sh` 将自动编译Anakin
-  ```bash
-      ./android_build.sh
-  ```
-### <span id = '0004'> 4. 验证安装 </span> ###
-编译好的库会放在目录`${Anakin_root}/output`下；
-编译好的单测文件会放在`${Anakin_root}/output/unit_test`目录下；
-编译好的示例文件会放在`${Anakin_root}/output/examples`目录下。
-对于Android系统，打开设备的调试模式，通过ADB可以访问的目录是`data/local/tmp`，通过ADB push将测试文件、模型和数据发送到设备目录， 运行测试文件。
--- a/doc/fluid/advanced_usage/deploy/anakin/anakin_tutorial.md
+++ b/doc/fluid/advanced_usage/deploy/anakin/anakin_tutorial.md
-# Anakin 使用教程 ##
-本教程将会简略的介绍Anakin的工作原理，一些基本的Anakin API，以及如何调用这些API。
-## 内容 ###
- [Anakin的工作原理](#principle)
- [Anakin APIs](#api)
- [示例代码](#example)
-## <span id = 'principle'> Anakin的工作原理</span> ###
-![Anakin_principle](../pics/anakin_fm_ch.png)
-用Anakin来进行前向计算主要分为三个步骤：
-  - 将外部模型通过[Anakin Parser](./convert_paddle_to_anakin.html)解析为Anakin模型
-    在使用Anakin之前，用户必须将所有其他模型转换成Anakin模型，我们提供了转换脚本，用户可通过[Anakin Parser](./convert_paddle_to_anakin.html)进行模型转换。
-  - 生成Anakin计算图
-    加载Anakin模型生成原始计算图，然后需要对原始计算图进行优化。你只需要调用相应的API优化即可。
-  - 执行计算图
-    Anakin会选择不同硬件平台执行计算图。
-## <span id ='api'>Anakin APIs </span> ###
-### Tensor ####
-`Tensor`提供基础的数据操作和管理，为ops提供统一的数据接口。`Tensor`包含以下几个属性：
- Buffer
-  数据存储区
- Shape
-  数据的维度信息
- Event
-  用于异步计算的同步
-`Tensor`类包含三个`Shape`对象， 分别是`_shape`, `_valid_shape`和 `offset`
-  - `_shape`为`tensor`真正空间信息
-  - `_valid_shape`表示当前`tensor`使用的空间信息
-  - `tensor`使用的空间信息
-  - `_offset`表示当前`tensor`数据指针相对于真正数据空间的信息
-`Tensor`不同维度与分别与数学中的向量、矩阵等相对应如下表所示
-Dimentions | Math entity |
-:----: | :----:
-1 | vector
-2 | matrix
-3 | 3-tensor
-n | n-tensor
-#### 声明tensor对象
-`Tensor`接受三个模板参数:
-```c++
- template<typename TargetType, DataType datatype, typename LayOutType = NCHW>
- class Tensor .../* Inherit other class */{
-  //some implements
-  ...
- };
-```
-TargetType是平台类型，如X86，GPU等等，在Anakin内部有相应的标识与之对应；datatype是普通的数据类型，在Anakin内部也有相应的标志与之对应
-[LayOutType](#layout)是数据分布类型，如batch x channel x height x width [NxCxHxW], 在Anakin内部用一个struct来标识
-Anakin中数据类型与基本数据类型的对应如下:
-  1. <span id = 'target'> TargetType </span>
-    Anakin TargetType | platform
-    :----: | :----:
-    NV | NVIDIA GPU
-    ARM | ARM
-    AMD | AMD GPU
-    X86 | X86
-    NVHX86 | NVIDIA GPU with Pinned Memory
-  2. <sapn id='datatype'> DataType </span>
-    Anakin DataType | C++ | Description
-    :---: | :---: | :---:
-    AK_HALF | short | fp16
-    AK_FLOAT | float | fp32
-    AK_DOUBLE | double | fp64
-    AK_INT8 | char | int8
-    AK_INT16 | short | int16
-    AK_INT32 | int | int32
-    AK_INT64 | long | int64
-    AK_UINT8 | unsigned char | uint8
-    AK_UINT16 | unsigned short | uint8
-    AK_UINT32 | unsigned int | uint32
-    AK_STRING | std::string | /
-    AK_BOOL | bool | /
-    AK_SHAPE | / | Anakin Shape
-    AK_TENSOR | / | Anakin Tensor
-  3. <span id = 'layout'> LayOutType </span>
-    Anakin LayOutType ( Tensor LayOut ) | Tensor Dimention | Tensor Support | Op Support
-    :---: | :---: | :---: | :---:
-    W | 1-D | YES | NO
-    HW | 2-D | YES | NO
-    WH | 2-D | YES | NO
-    NW | 2-D | YES | YES
-    NHW | 3-D | YES |YES
-    NCHW ( default ) | 4-D | YES | YES
-    NHWC | 4-D | YES | NO
-    NCHW_C4 | 5-D | YES | YES
-  理论上，Anakin支持申明1维以上的tensor，但是对于Anakin中的Op来说，只支持NW、NHW、NCHW、NCHW_C4这四种LayOut，其中NCHW是默认的LayOuteType，NCHW_C4是专门针对于int8这种数据类型的。
-  例子
-    下面的代码将展示如何使用tensor， 我们建议先看看这些示例。
-    要想获得更多关于tensor的信息， 请参考 *soure_path/core/tensor.h*
-    > 1. 使用shape对象初始化tensor
-    ```c++
-      //create a null tensor. A null tensor holds for nothing.
-      //tensor's buffer  is resident at CPU and its datatype is AK_FLOAT.
-      //tensor's Layout is NCHW(default)
-      Tensor<X86, AK_FLOAT> mytensor;
-      //1. using shape object to create a tensor.
-      Shape shape1(NUM); //1-D shape. NUM is the number of dimention.
-      Tensor<X86, AK_FLOAT, W> mytensor1(shape1); //1-D tensor.
-      // A 4-D shape
-      Shape shape2(N, C, H, W); // batch x channel x height x width
-    ```
-    >`注意：Shape的维度必须和tensor的`[LayoutType](#layout)`相同，比如Shape(N,C,H,W), 那么Tensor的 LayoutType必须是NCHW，否则会出错。如下列代码所示`
-    ```c++
-       // A 4-D tensor.
-       Tensor<X86, AK_FLOAT> mytensor2(shape2);  //right
-       //A 4-D tensor which is resident at GPU and its datatype is AK_INT8
-       Tensor<NV, AK_INT8> mytensor3(shape2);   //right
-       Tensor<X86, AK_FLOAT, NHW> mytensor4(shape2); //wrong!! shape's dimetion must be equal to tensor's Layout.
-       Tensor<NV, AK_FLOAT, NCHW_C4> mytensor5(shape2); //wrong!!!!
-    ```
-    > 2. 使用现有的数据和shape初始化tensor
-    ```c++
-       /**
-       *  A construtor of Tensor.
-       *  data_ptr is a pointer to any data type of data
-       *  TargetType is type of a platform [Anakin TargetType]
-       *  id : device id
-       *  shape: a Anakin shape
-       */
-       Tensor(Dtype* data_ptr, TargetType_t target, int id, Shape shape);
-       //using existing data feed to a tensor
-       Tensor<X86, AK_FLOAT> mytensor(data_ptr, TargetType, device_id, shape); //shape must has dimention (N, C, H, W).
-    ```
-    > 3. 使用tensor初始化tensor
-    ```c++
-       Tensor<NV, AK_FLOAT> tensor(exist_tensor);
-    ```
-    > 提示： 你可以用` typedef Tensor<X86, AK_FLOAT> Tensor4d_X86 `方便定义tensor
-#### 填充tensor数据区
-填充数据区得看你申明tensor的方式， 下面展示了如何填充tensor的数据区。
-首先来看看tensor的四种声明方式：
-```c++
-  1. Tensor<X86, AK_FLOAT> mytensor;
-  2. Tensor<X86, AK_FLOAT, W> mytensor1(shape1);
-  3. Tensor<X86, AK_FLOAT> mytensor(data_ptr, TargetType, device_id, shape);
-  4. Tensor<NV, AK_FLOAT> tensor(exist_tensor);
-```
-相关的声明方式的数据填充方法如下：
- 声明一个空的tensor，此时没有为其分配内存，所以，我们需要手动的为其分配内存。
-```c++
-        //parama shape
-        mytensor.re_alloc(Shape shape);
-        //Get writable pointer to mytensor.
-        //parama index (int): where you start to write.
-        //Dtype is your data type such int, float or double.
-        Dtype *p = mytensor.mutable_data(index/*=0*/);
-        //write data to mytensor
-        for(int i = 0; i < mytensor.size(); i++){
-            p[i] = 1.0f;
-        }
-        //do something ...
-```
- 这种声明方式会自动分配内存
-```c++
-        //Get writable pointer to mytensor.
-        //parama index (int): where you start to write.
-        //Dtype is your data type such int, float or double.
-        Dtype *p = mytensor1.mutable_data(index/*=0*/);
-        //write data to mytensor
-        for(int i = 0; i < mytensor.size(); i++){
-           p[i] = 1.0f;
-        }
-        //do something ...
-```
- 在该种声明方式中，我们仍不需要手动为其分配内存。但在构造函数内部是否为其分配内存，得依情况而定。如果data_ptr和申明的
-  tensor都在都一个目标平台上，那么该tensor就会与data_ptr共享内存空间，相反，如果他们不在同一个平台上（如data_ptr在X86上，而
-  tensor在GPU上），那么此时tensor就会开辟一个新的内存空间，并将data_ptr所指向的数据拷贝到tensor的buffer中。
-```c++
-        //Get writable pointer to mytensor.
-        //parama index (int): where you start to write.
-        //Dtype is your data type such int, float or double.
-        Dtype *p = mytensor.mutable_data(index/*=0*/);
-        //write data to mytensor
-        for(int i = 0; i < mytensor.size(); i++){
-          p[i] = 1.0f;
-        }
-        //do something ...
-```
- 该种方式仍不需要手动分配内存
-```c++
-        //Get writable pointer to mytensor.
-        //parama index (int): where you start to write.
-        //Dtype is your data type such int, float or double.
-        Dtype *p = mytensor.mutable_data(index/*=0*/);
-        //write data to mytensor
-        for(int i = 0; i < mytensor.size(); i++){
-           p[i] = 1.0f;
-        }
-       //do something ...
-```
- 另外，你还可以获取一个tensor的可读指针，示例如下：
-```c++
-        //Get read-only pointer to mytensor.
-        //parama index (int): where you start to read.
-        //Dtype is your data type such int, float or double.
-        Dtype *p = mytensor.data(index/*=0*/);
-        //do something ...
-```
-如果想更详细的了解tensor，请查阅*soure_path/saber/core/tensor.h*
-#### 获取tensor的shape
-```c++
-  //some declarations
-  // ...
-  Shape shape = mytensor.shape();
-  //Get a first dimetion size of tesor, if it has.
-  int d1 = shape[0];
-  //Get a second dimention size of tensor, if it has.
-  int d2 = shape[1];
-  ...
-  //Get a n-th dimention size of tensor, if it has.
-  int dn = shape[n-1];
-  //Get a tensor's dimention
-  int dims = mytensor.dims();
-  //Get the size of tensor.
-  //size = d1 x d2 x ... x dn.
-  int size = mytensor.size();
-  //Get the size of tensor at interval [Di, Dj)
-  // form i-th dimention to j-th dimention, but not including the j-th dimention.
-  // which means di x (di+1) x ... x (dj -1)
-  int size = mytensor.count(start, end);
-```
-#### 设置tensor的shape
-我们可以用tensor的成员函数set_shape来设置tensor的shape。 下面是set_shape的定义
-```c++
-  /**
-   * \brief set a tensor's shape
-   * \param valid_shape [a Shape object]
-   * \param shape [a Shape object]
-   * \param offset [a Shape object]
-   * \return the status of this operation, that means whether it success * or not.
-   */
-  SaberStatus set_shape(Shape valid_shape, Shape shape = Shape::zero(TensorAPI::layout_dims::value), Shape offset = Shape::minusone(TensorAPI::layout_dims::value));
-```
-这个成员函数只设置tensor的shape。这些shape对象(valid_shape, shape, offset)的[LayOutType](#layout)必须和当前的tensor的相应三个shape对象的LayOutType相同，如果不同就会出错，返回SaberInvalidValue。 如果相同，那么将成功设置tensor的shape。
-```c++
-  // some declarations
-  // ...
-  //valid_shape, shape , offset are Shape object;
-  //All these Shape object's LayOutType must be equal to mytensor's.
-  mytensor.set_shape(valid_shape, shape, offset);
-```
-#### 重置 tensor的shape
-```c++
-  //some declarations
-  Shape shape, valid_shape, offset;
-  //do some initializations
-  ...
-  mytensor.reshape(valid_shape, shape, offset);
-```
-注意： Reshape操作仍然需要shape的[LayOutType](#layout) 与tensor的相同
-### Graph ###
-`Graph`类负责加载Anakin模型生成计算图、对图进行优化、存储模型等操作。
-#### 图的声明
-与`Tensor`一样，graph也接受三个模板参数。
-```c++
-  template<typename TargetType, DataType Dtype, Precision Ptype>
-  class Graph ... /* inherit other class*/{
-    //some implements
-    ...
-  };
-```
-前面已经介绍过[TargetType](#target)和[DataType](#datatype)是Anakin内部自定义数据类型。[TargetType](#target)表示平台类型 (如NV、X86), [DataType](#datatype)是Anakin基本数据类型与C++/C中的基本数据类型相对应。 [Precision](#precision)为op所支持的精度类型, 稍后我们在介绍它。
-```c++
-  //Create a empty graph object.
-  Graph graph = Graph<NV, AK_FLOAT, Precision::FP32> tmp();
-  //Create a pointer to a empty graph.
-  Graph *graph = new Graph<NV, AK_FLOAT, Precision::FP32>();
-  //Create a pointer to a empty graph.
-  auto graph = new Graph<NV, AK_FLOAT, Precision::FP32>();
-```
-#### 加载 Anakin 模型
-```c++
-  //some declarations
-  ...
-  auto graph = new Graph<NV, AK_FLOAT, Precision::FP32>();
-  std::string model_path = "the/path/to/where/your/models/are";
-  const char *model_path1 = "the/path/to/where/your/models/are";
-  //Loading Anakin model to generate a compute graph.
-  auto status = graph->load(model_path);
-  //Or this way.
-  auto status = graph->load(model_path1);
-  //Check whether load operation success.
-  if(!status){
-    std::cout << "error" << endl;
-    //do something...
-  }
-```
-#### 优化计算图
-```c++
-  //some declarations
-  ...
-  //Load graph.
-  ...
-  //According to the ops of loaded graph, optimize compute graph.
-  graph->Optimize();
-```
-> 注意： 第一次加载原始图，必须要优化。
-#### 保存模型
-你可以在任何时候保存模型， 特别的， 你可以保存一个优化的模型，这样，下次再加载模型时，就不必进行优化操作。
-```c++
-  //some declarations
-  ...
-  //Load graph.
-  ...
-  // save a model
-  //save_model_path: the path to where your model is.
-  auto status = graph->save(save_model_path);
-  //Checking
-  if(!status){
-    cout << "error" << endl;
-    //do somethin...
-  }
-```
-#### 重新设置计算图里的tensor的shape
-```c++
-  //some declarations
-  ...
-  //Load graph.
-  ...
-  vector<int> shape{10, 256, 256, 10};
-  //input_name : std::string.
-  //Reshape a tensor named input_name.
-  graph->Reshape(input_name, shape);//Note: shape is a vector, not a Shape object.
-```
-#### 设置 batch size
-`Graph` 支持重新设置batch size的大小。
-```c++
-  //some declarations
-  ...
-  //Load graph.
-  ...
-  //input_name : std::string.
-  //Reset a tensor named input_name.
-  int new_batch_size = 4;
-  graph->ResetBatchSize(input_name, new_batch_size);
-```
-###  Net ###
-`Net` 是计算图的执行器。你可以通过Net对象获得输入和输出
-#### Creating a graph executor
-`Net`接受四个模板参数。
-```c++
-  template<typename TargetType, DataType Dtype, Precision PType OpRunType RunType = OpRunType::ASYNC>
-  class Net{
-    //some implements
-    ...
-  };
-```
-由于有些Op可能支持多种精度，我们可以通过Precision来指定。OpRunType表示同步或异步类型，异步是默认类型。OpRunType::SYNC表示同步，在GPU上只有单个流；OpRunType::ASYNC表示异步，在GPU上有多个流并以异步方式执行。实际上，Precision和OpRunType都是enum class, 详细设计请参考*source_root/framework/core/types.h*.
-1. <span id = 'precision'> Precision </span>
-  Precision | Op support
-  :---: | :---:
-  Precision::INT4 | NO
-  Precision::INT8 | NO
-  Precision::FP16 | NO
-  Precision::FP32 | YES
-  Precision::FP64 | NO
-现在Op的精度只支持FP32， 但在将来我们会支持剩下的Precision.
-2.  <span id = '1'> OpRunType </span>
-  OpRunType | Sync/Aync |Description
-  :---: | :---: | :---:
-  OpRunType::SYNC | Synchronization | single-stream on GPU
-  OpRunType::ASYNC | Asynchronization | multi-stream on GPU
-用graph对象创建一个执行器
-```c++
-  //some declarations
-  ...
-  //Create a pointer to a graph.
-  auto graph = new Graph<NV, AK_FLOAT, Precision::FP32>();
-  //do something...
-  ...
-  //create a executor
-  Net<NV, AK_FLOAT, Precision::FP32> executor(*graph);
-```
-#### 获取输入输出tensor
-获取输入输出tensor，并填充输入tensor的buffer。如果想要获取输入和输出tensor，那么必须指定输入的名字，如"input_0", "input_1", "input_2", ..., 必须传入如上字符串才能够获得输入tensor。另外，如果想知道input_i对应哪个输入，你需要去dash board查看，如何使用dash board请看[Anakin Parser](./convert_paddle_to_anakin.html)。请看如下示例代码
-```c++
-  //some declaratinos
-  ...
-  //create a executor
-  //TargetType is NV [NVIDIA GPU]
-  Net<NV, AK_FLOAT, Precision::FP32> executor(*graph);
-  //Get the first input tensor.
-  //The following tensors(tensor_in0, tensor_in2 ...) are resident at GPU.
-  //Note: Member function get_in returns an pointer to tensor.
-  Tensor<NV, AK_FLOAT>* tensor_in0 = executor.get_in("input_0");
-  //If you have multiple input tensors
-  //You just type this code below.
-  Tensor<NV, AK_FLOAT>* tensor_in1 = executor.get_in("input_1");
-  ...
-  auto tensor_inn = executor.get_in("input_n");
-```
-当得到输入tensor之后，就可以填充它的数据区了。
-```c++
-    //This tensor is resident at GPU.
-    auto tensor_d_in = executor.get_in("input_0");
-    //If we want to feed above tensor, we must feed the tensor which is resident at host. And then copy the host tensor to the device's one.
-    //using Tensor4d = Tensor<Ttype, Dtype>;
-    Tensor4d<X86, AK_FLOAT> tensor_h_in; //host tensor;
-    //Tensor<X86, AK_FLOAT> tensor_h_in;
-    //Allocate memory for host tensor.
-    tensor_h_in.re_alloc(tensor_d_in->valid_shape());
-    //Get a writable pointer to tensor.
-    float *h_data = tensor_h_in.mutable_data();
-    //Feed your tensor.
-    /** example
-    for(int i = 0; i < tensor_h_in.size(); i++){
-      h_data[i] = 1.0f;
-    }
-    */
-    //Copy host tensor's data to device tensor.
-    tensor_d_in->copy_from(tensor_h_in);
-    // And then
-```
-类似的，我们可以利用成员函数get_out来获得输出tensor。但与获得输入tensor不同的是， 我们需要指定输入tensor结点的名字，这个可以从dash board中看到，请从[Anakin Parser](./convert_paddle_to_anakin.html)中查看dash board的使用方法。假如有个输出结点叫pred_out, 那么我们可以通过如下代码获得相应的输出tensor：
-```c++
-  //Note: this tensor are resident at GPU.
-  Tensor<NV, AK_FLOAT>* tensor_out_d = executor.get_out("pred_out");
-```
-#### Executing graph
-当一切准备就绪后，我们就可以执行真正的计算了！
-```c++
-  executor.prediction();
-```
-## <span id='example'> 示例代码 </span> ##
-下面的例子展示了如何调用Anakin。
-在这儿之前， 请确保你已经有了Anakin模型。如果还没有，那么请使用[Anakin Parser](./convert_paddle_to_anakin.html)转换你的模型。
-### Single-thread
-单线程例子在 *`source_root/test/framework/net/net_exec_test.cpp`*
-```c++
-  std::string model_path = "your_Anakin_models/xxxxx.anakin.bin";
-  // Create an empty graph object.
-  auto graph = new Graph<NV, AK_FLOAT, Precision::FP32>();
-  // Load Anakin model.
-  auto status = graph->load(model_path);
-  if(!status ) {
-      LOG(FATAL) << " [ERROR] " << status.info();
-  }
-  // Reshape
-  graph->Reshape("input_0", {10, 384, 960, 10});
-  // You must optimize graph for the first time.
-  graph->Optimize();
-  // Create a executer.
-  Net<NV, AK_FLOAT, Precision::FP32> net_executer(*graph);
-  //Get your input tensors through some specific string such as "input_0", "input_1", and
-  //so on.
-  //And then, feed the input tensor.
-  //If you don't know Which input do these specific string ("input_0", "input_1") correspond with, you can launch dash board to find out.
-  auto d_tensor_in_p = net_executer.get_in("input_0");
-  Tensor4d<X86, AK_FLOAT> h_tensor_in;
-  auto valid_shape_in = d_tensor_in_p->valid_shape();
-  for (int i=0; i<valid_shape_in.size(); i++) {
-      LOG(INFO) << "detect input dims[" << i << "]" << valid_shape_in[i]; //see tensor's dimentions
-  }
-  h_tensor_in.re_alloc(valid_shape_in);
-  float* h_data = h_tensor_in.mutable_data();
-  for (int i=0; i<h_tensor_in.size(); i++) {
-      h_data[i] = 1.0f;
-  }
-  d_tensor_in_p->copy_from(h_tensor_in);
-  //Do inference.
-  net_executer.prediction();
-  //Get result tensor through the name of output node.
-  //And also, you need to see the dash board again to find out how many output nodes are and remember their name.
-  //For example, you've got a output node named obj_pre_out
-  //Then, you can get an output tensor.
-  auto d_tensor_out_0_p = net_executer.get_out("obj_pred_out"); //get_out returns a pointer to output tensor.
-  auto d_tensor_out_1_p = net_executer.get_out("lc_pred_out"); //get_out returns a pointer to output tensor.
-  //......
-  // do something else ...
-  //...
-  //save model.
-  //You might not optimize the graph when you load the saved model again.
-  std::string save_model_path = model_path + std::string(".saved");
-  auto status = graph->save(save_model_path);
-  if (!status ) {
-      LOG(FATAL) << " [ERROR] " << status.info();
-  }
-```
--- a/doc/fluid/advanced_usage/deploy/anakin/convert_paddle_to_anakin.md
+++ b/doc/fluid/advanced_usage/deploy/anakin/convert_paddle_to_anakin.md
-# 模型转换指南
-Anakin 支持不同框架的模型预测。但由于格式的差别，Anakin 需要您预先转换模型, 本文档介绍如何转换模型。
-## 简介
-Anakin 模型转换器输入支持 Caffe 和 Paddle 两种格式的预测模型，模型包含网络结构（model 或 prototxt）和权重参数（param 或 caffemodel）。
-模型转换的输出是一个 bin 文件，它作为 Anakin 框架的 graph 参数导入。
-您还可以使用模型转换器的 launch board 功能生成网络结构的 HTML 预览。
-## 系统要求
- python 2.7+
- pyyaml
- flask
- protobuf 3.5+
-## 用法
-### 1、环境
-转换器所需的依赖标注于*系统要求*一节。
-### 2、配置
-您需要对 *config.yaml* 文件进行修改以告知您的需求。工程中给出了 *config.yaml* 示例，下面作进一步说明。
-#### config.yaml
-```bash
-OPTIONS:
-    Framework: CAFFE       # 依框架类型填写 CAFFE 或 Paddle
-    SavePath: ./output     # 转换结束后模型的保存位置
-    ResultName: googlenet  # 输出模型的名字
-    Config:
-        LaunchBoard: ON    # 是否生成网络结构预览页面
-        Server:
-            ip: 0.0.0.0
-            port: 8888     # 从一个可用端口访问预览页面
-        OptimizedGraph:    # 当您使用了 Anakin 框架的 Optimized 功能时，才应该打开此项
-            enable: OFF
-            path: /path/to/anakin_optimized_anakin_model/googlenet.anakin.bin.saved
-    LOGGER:
-        LogToPath: ./log/  # 生成日志的路径
-        WithColor: ON
-TARGET:
-    CAFFE:
-        # 当 Framework 为 CAFFE 时需填写
-        ProtoPaths:
-            - /path/to/caffe/src/caffe/proto/caffe.proto
-        PrototxtPath: /path/to/your/googlenet.prototxt
-        ModelPath: /path/to/your/googlenet.caffemodel
-    Paddle:
-        # 当 Framework 为 Paddle 时需填写
-        Debug: NULL
-        ProtoPaths:
-            - /
-        PrototxtPath: /path/to/paddle/inference_model
-        ModelPath: /path/to/paddle/inference_model
-	# ...
-```
-### 3、转换
-在完成配置文件的修改后，您只需执行 ```python converter.py``` 就可以进行模型转换了。
-### 4、预览
-最后一步，就是在浏览器中查看转换结果！网址是在 *config.yaml* 中配置的，例如 http://0.0.0.0:8888 。
-> 注意：若您使用了默认的 IP 地址 0.0.0.0，请在预览时使用真实的服务器地址 real_ip:port 替代它。
--- a/doc/fluid/advanced_usage/deploy/anakin/how_to_add_anakin_op.md
+++ b/doc/fluid/advanced_usage/deploy/anakin/how_to_add_anakin_op.md
-# 如何增加新的Operator
-## 基本概念
-简单介绍下几个同Operator相关的基本概念，详情请参考设计文档。
-```framework```: 上层的逻辑代码，负责从parser中获取参数及weights，添加op时主要修改framework/operator目录下的内容。
-```saber```: 底层的实现代码，Anakin通过saber封装了不同的backends，不同的实现(impl)分别特化出自己的实现，外层framework通过不同的template进入各自的impl完成调用。各个op的parameter放在saber/saber_funcs_param.h文件中，增加op主要修改saber/funcs下的内容。
-saber的文件结构：
-* saber/funcs下的是各个funcs的外部接口，这一层的op与具体的设备实现无关，只与各op完成的功能有关。由于跟实现(impl)无关，本层文件明均不带impl。
-* saber/funcs/impl下是各个op的impl声明，特定设备需要完成该层声明的特化版本，如saber/funcs/impl/x86实现了上一层impl声明的x86特化版本，saber/funcs/impl/cuda实现了上一层impl声明的NV特化版本。当增加新的backends时需要特化出新的实现。本层代码同实现相关，均带有```impl_```前缀。
-* saber/funcs/impl/cuda/base/cuda_c内有cuda```.cu```扩展名的文件，添加cuda的kernel需要在该文件目录下添加。
-* saber/funcs/impl/cuda/base/sass 内有不同架构的汇编代码编译的静态库。
-### 涉及到的基类及各个类之前的关系
-简单介绍相关的基类
-* ```anakin::Operator```: framework的operator基类，位于framework/core/operator/operator.h
-* ```anakin::saber::BaseFunc```: saber对外的op接口基类，提供统一的对外接口，位于saber/funcs/base.h。BaseFunc的```compute_output_shape```接口只根据input的shape和param的参数计算输出的shape，并通过```tensor```的```set_shape```接口(只设置shape，不分配空间)设置到output中。```operator()```接口为各个op的计算接口。
-* ```ankain::saber::ImplBase```: saber设备实现的op的接口，所有设备相关实现的基类。位于saber/funcs/impl/impl_base.h。实现版本中这里分为两类，一类以```vender_```为前缀，带有```vender_```代码意为使用第三方库来实现该op，如cudnn的conv，或mkl的conv等等，这类op的性能我们难以调优，因此单独列为一类。另一类是带有源码的saber实现，这些实现都带有```saber_```为前缀，此类实现带有源码，能够通过后续优化不断提升性能，实现起名时需要注意这一点。
-## 添加operator
-添加一个新的op需要以下几步：
-1. 添加saber的param
-2. 定义saber的Operator类
-3. 定义新的impl声明
-3. 完成新的impl实现
-4. 增加framework的实现或特化
-接下来就针对这几步，以一个简单例子为例介绍实现。
-例如我们要添加新的Mul op。给出计算公式如下：$$Out = alpha \dot X * Y$$
-### 为operator增加param
-涉及到的文件：```saber/saber_funcs_param.h```。如果之前已经存在需要添加的op的param，这一步可以跳过。
-这里```XXXParam```是一个```struct```。包含一个无参数的构造函数，含参数的构造函数，复制构造函数，```operator=()```及```operator==()```。
-```
-template <typename opTensor> // 能够获得target, datatype, layout
-struct MulParam{
-  MulParam()
-    : alpha(0)
-  {}
-  MulParam(float alpha_in)
-    : alpha(alpha_in)
-  {}
-  MulParam(const MulParam& right)
-    : alpha(right.alpha)
-  {}
-  MulParam &operator=(const MulParam &right) {
-    alpha = right.alpha;
-  }
-  bool operator==(const MulParam &right) {
-    return alpha == right.alpha;
-  }
-  float alpha;
-};
-```
-### 定义Operator类
-涉及到的文件:```saber/funcs/mul.h```。如果之前定义过该op的类，这里需要修改输入的impl定义头文件。
-下面给出一个相对完整的定义结构供参考。
-```
-//不同的设备需要包含对应的operator实现.[详见](#impl)
-#ifdef NVIDIA_GPU
-#include "saber/funcs/impl/cuda/saber_mul.h"
-#include "saber/funcs/impl/cuda/vender_mul.h"
-#endif
-//如果一个设备现在还没有对应的operator实现，需要包含声明。[详见](#declare)
-#ifdef USE_X86_PLACE
-#include "saber/funcs/impl/impl_mul.h"
-#endif
-namespace anakin {
-namespace saber {
-template<typename TargetType,
-        DataType OpDtype,
-        DataType inDtype = AK_FLOAT,
-        DataType outDtype = AK_FLOAT,
-        typename LayOutType_op = NCHW,
-        typename LayOutType_in = NCHW,
-        typename LayOutType_out = NCHW>
-class Mul : public BaseFunc<
-        Tensor<TargetType, inDtype, LayOutType_in>,
-        Tensor<TargetType, outDtype, LayOutType_out>,
-        Tensor<TargetType, OpDtype, LayOutType_op>,
-        ImplBase, MulParam> {
-public:
-    using BaseFunc<
-            Tensor<TargetType, inDtype, LayOutType_in>,
-            Tensor<TargetType, outDtype, LayOutType_out>,
-            Tensor<TargetType, OpDtype, LayOutType_op>,
-            ImplBase, MulParam>::BaseFunc;
-    Mul() = default;
-    typedef Tensor<TargetType, inDtype, LayOutType_in> InDataTensor;
-    typedef Tensor<TargetType, outDtype, LayOutType_out> OutDataTensor;
-    typedef Tensor<TargetType, OpDtype, LayOutType_op> OpTensor;
-    typedef MulParam<OpTensor> Param_t;
-    typedef std::vector<InDataTensor *> Input_v;
-    typedef std::vector<OutDataTensor *> Output_v;
-    typedef std::vector<Shape> Shape_v;
-    virtual SaberStatus compute_output_shape(const Input_v &input,
-                                             Output_v &output, Param_t &param) override {
-        //计算输出的shape，
-        Shape output_shape = (input[0]->valid_shape());
-        /* code */
-        return output[0]->set_shape(output_shape);
-    }
-    virtual SaberStatus init_impl(ImplEnum implenum) override {
-      // 不同设备均使用此init_impl, 此接口创建对应impl的实现。
-      switch (implenum) {
-            case VENDER_IMPL:
-                this->_impl.push_back(new VenderMul <TargetType,
-                OpDtype, inDtype, outDtype,
-                LayOutType_op, LayOutType_in, LayOutType_out>);
-                return SaberSuccess;
-            case SABER_IMPL:
-                this->_impl.push_back(new SaberMul <TargetType,
-                OpDtype, inDtype, outDtype,
-                LayOutType_op, LayOutType_in, LayOutType_out>);
-                return SaberSuccess;
-            default:
-                return SaberUnImplError;
-        }
-    }
-private:
-    virtual void pick_best_static() override {
-        if (true) // some condition?
-            this->_best_impl = this->_impl[0];
-    }
-    virtual void pick_best_specify(ImplEnum implenum) override {
-        this->_best_impl = this->_impl[0];
-    }
-};
-} // namespace saber
-} // namespace anakin
-```
-### 为operator增加新的impl<span id="declare">声明</span>
-涉及的文件:```saber/funcs/impl/impl_mul.h```。不同的设备都特化同一个声明，特化版本放在对应的文件夹下，这里的声明就是给出所有设备的统一声明。下面给出一个参考。
-```
-#include "saber/funcs/impl/impl_macro.h"
-namespace anakin{
-namespace saber{
-DEFINE_OP_CLASS(Mul, MulParam); // 第一个参数是op的名字，第二个是对应param的名字
-}
-}
-```
-### 完成新的operator特定后端<span id="impl">实现</span>
-涉及的文件:```saber/funcs/impl/xxx/vender_mul.h```或```saber/funcs/impl/xxx/saber_mul.h```
-这里```xxx```指代特定的一种设备。```vender```是指的使用第三方库实现的op，```saber```指的源码实现的op。这里以cuda的vender实现为例，简单介绍一下特化出的函数的几个基本接口。
-```
-// include 对应的声明
-#include "saber/funcs/impl/impl_mul.h"
-namespace anakin{
-namespace saber{
-template <DataType OpDtype,
-    DataType inDtype,
-    DataType outDtype,
-    typename LayOutType_op,
-    typename LayOutType_in,
-    typename LayOutType_out>
-class VenderMul<NV, //偏特化出需要的后端。
-    OpDtype, inDtype, outDtype,
-    LayOutType_op, LayOutType_in, LayOutType_out> :
-    public ImplBase<
-        Tensor<NV, inDtype, LayOutType_in>,
-        Tensor<NV, outDtype, LayOutType_out>,
-        Tensor<NV, OpDtype, LayOutType_op>,
-        MulParam<Tensor<NV, OpDtype, LayOutType_op> > >
-{
-public:
-    typedef Tensor<NV, inDtype, LayOutType_in> DataTensor_in;
-    typedef Tensor<NV, outDtype, LayOutType_out> DataTensor_out;
-    typedef Tensor<NV, OpDtype, LayOutType_op> OpTensor;
-    typedef typename DataTensor_in::Dtype InDataType;
-    typedef typename DataTensor_out::Dtype OutDataType;
-    typedef typename OpTensor::Dtype OpDataType;
-    VenderMul(){}
-    ~VenderMul() {}
-    virtual SaberStatus init(const std::vector<DataTensor_in *>& inputs,
-                            std::vector<DataTensor_out *>& outputs,
-                            MulParam<OpTensor>& param, Context<NV>& ctx) {
-        this->_ctx = ctx;
-        create(inputs, outputs, param, ctx);
-    }
-    virtual SaberStatus create(const std::vector<DataTensor_in *>& inputs,
-                            std::vector<DataTensor_out *>& outputs,
-                            MulParam<OpTensor>& param, Context<NV>& ctx) {
-        // set内部参数
-    }
-    virtual SaberStatus dispatch(const std::vector<DataTensor_in*>& inputs,
-                          std::vector<DataTensor_out*>& outputs,
-                        MulParam<OpTensor>& param) {
-        // dispatch kernel.
-    }
-private:
-};
-}
-}
-```
-```init```和```create```的区别：```init```接口是第一次初始化op的时候进入的接口，此函数只在第一次初始化op时调用，这个接口一般放一些只需要执行一次的代码，如malloc或者create之类的函数。```create```函数除了第一次init执行外，在输入发生变化或者param发生变化时会再次触发，create一般放置set函数，设置内部变量，当input发生变化时这里执行一些同input或weights直接相关的代码。但create因为触发位置在网络内，如果```create```函数执行了一些严重耗时的操作，这里会拖慢整个op的执行时间，需要慎重选择操作放置的位置。
-### 添加framework的特化
-涉及的文件:```framework/operators/mul.h```和```framework/operators/mul.cpp```。
-这里简单介绍下如果添加或修改framework内的operator
-```
-#include "framework/core/base.h"
-#include "framework/core/data_types.h"
-#include "framework/core/operator/operator.h"
-#include "utils/logger/logger.h"
-#include "saber/funcs/mul.h" // 需要包对应的saber头文件
-namespace anakin {
-namespace ops {
-template<typename Ttype, DataType Dtype, Precision Ptype>
-class MulHelper;
-template<typename Ttype, DataType Dtype, Precision Ptype>
-class Mul : public Operator<Ttype, Dtype, Ptype> {
-public:
-    Mul() {}
-    /// forward impl
-    virtual void operator() (OpContext<Ttype> &ctx,
-                             const std::vector<Tensor4dPtr<Ttype, Dtype> >& ins,
-                             std::vector<Tensor4dPtr<Ttype, Dtype> >& outs) {
-        LOG(ERROR) << "Not Impl Yet Operator power<TargetType:"<<"unknown"<<","
-                   <<type_id<typename DataTypeWarpper<Dtype>::type>().type_info()<<">";
-    }
-    friend class MulHelper<Ttype, Dtype, Ptype>;
-};
-template<typename Ttype, DataType Dtype, Precision Ptype>
-class MulHelper : public OperatorHelper<Ttype, Dtype, Ptype> {
-public:
-    MulHelper() = default;
-    ~MulHelper();
-    Status InitParam() override;
-    Status Init(OpContext<Ttype> &ctx,
-                const std::vector<Tensor4dPtr<Ttype, Dtype> >& ins,
-                std::vector<Tensor4dPtr<Ttype, Dtype> >& outs) override;
-    Status InferShape(const std::vector<Tensor4dPtr<Ttype, Dtype> >& ins,
-                      std::vector<Tensor4dPtr<Ttype, Dtype> >& outs) override;
-public:
-    saber::MulParam<Tensor4d<Ttype, Dtype>> _param_mul;
-    saber::Mul<Ttype, Dtype> _funcs_mul;
-};
-}
-} /* namespace anakin */
-```
-对应的```.cpp```文件如下：
-```
-#include "framework/operators/mul.h"
-namespace anakin {
-namespace ops {
-#ifdef USE_CUDA
-template<>
-void Mul<NV, AK_FLOAT, Precision::FP32>::operator()(
-    OpContext<NV>& ctx,
-    const std::vector<Tensor4dPtr<NV, AK_FLOAT> >& ins,
-    std::vector<Tensor4dPtr<NV, AK_FLOAT> >& outs) {
-    auto* impl =
-        static_cast<MulHelper<NV, AK_FLOAT, Precision::FP32>*>(this->_helper);
-    auto& param =
-        static_cast<MulHelper<NV, AK_FLOAT, Precision::FP32>*>(this->_helper)->_param_mul;
-    impl->_funcs_mul(ins, outs, param, ctx);
-}
-#endif
-template<typename Ttype, DataType Dtype, Precision Ptype>
-Status MulHelper<Ttype, Dtype, Ptype>::InitParam() {
-    auto alpha = GET_PARAMETER(float, alpha);
-    MulParam<Tensor4d<Ttype, Dtype>> param_mul(alpha);
-    _param_mul = param_mul;
-    return Status::OK();
-}
-template<typename Ttype, DataType Dtype, Precision Ptype>
-Status MulHelper<Ttype, Dtype, Ptype>::Init(OpContext<Ttype>& ctx,
-        const std::vector<Tensor4dPtr<Ttype, Dtype> >& ins,
-        std::vector<Tensor4dPtr<Ttype, Dtype> >& outs) {
-    SABER_CHECK(_funcs_mul.init(ins, outs, _param_mul, SPECIFY, VENDER_IMPL, ctx));
-    return Status::OK();
-}
-template<typename Ttype, DataType Dtype, Precision Ptype>
-Status MulHelper<Ttype, Dtype, Ptype>::InferShape(const
-        std::vector<Tensor4dPtr<Ttype, Dtype> >& ins,
-        std::vector<Tensor4dPtr<Ttype, Dtype> >& outs) {
-    SABER_CHECK(_funcs_mul.compute_output_shape(ins, outs, _param_mul));
-    return Status::OK();
-}
-#ifdef USE_CUDA
-template class MulHelper<NV, AK_FLOAT, Precision::FP32>;
-#endif
-#ifdef USE_ARM_PLACE
-template class MulHelper<ARM, AK_FLOAT, Precision::FP32>;
-#endif
-// register helper
-#ifdef USE_CUDA
-ANAKIN_REGISTER_OP_HELPER(Mul, MulHelper, NV, AK_FLOAT, Precision::FP32);
-#endif
-#ifdef USE_ARM_PLACE
-ANAKIN_REGISTER_OP_HELPER(Mul, MulHelper, ARM, AK_FLOAT, Precision::FP32);
-#endif
-//! register op
-ANAKIN_REGISTER_OP(Mul)
-.Doc("Mul operator")
-#ifdef USE_CUDA
-.__alias__<NV, AK_FLOAT, Precision::FP32>("mul")
-#endif
-#ifdef USE_ARM_PLACE
-.__alias__<ARM, AK_FLOAT, Precision::FP32>("mul")
-#endif
-.num_in(1)
-.num_out(1)
-.Args<float>("alpha", " alpha of Mul "); //注册
-} /* namespace ops */
-} /* namespace anakin */
-```
-## 实现单元测试
-涉及的文件:```test/saber/xxx/test_saber_funcs_mul_xxx.cpp```
-在对应的test下需要添加新的单元测试
-```
-TEST(TestSaberFuncNV, test_depthwise_conv) {
-    // init tensors and some param.
-    // start Reshape & doInfer
-    Context<NV> ctx1(0, 1, 1);
-    // create param
-    MulParam<Tensor<NV, AK_FLOAT, NCHW> > param(alpha);
-    std::vector<Tensor<NV, AK_FLOAT, NCHW>*> input;
-    std::vector<Tensor<NV, AK_FLOAT, NCHW>*> output;
-    // create saber op
-    Mul<NV, AK_FLOAT, AK_FLOAT, AK_FLOAT, NCHW> mul;
-    // compute output shape
-    mul.compute_output_shape(input, output, param);
-    // re_alloc output tensors memory based on output shape
-    output[0]->re_alloc(output[0]->shape());
-    // init saber op(calling init and create)
-    mul.init(input, output, param, SPECIFY, VENDER_IMPL, ctx1);
-    // call operator()
-    mul(input, output, param, ctx1);
-    // cuda specified, record events
-    cudaStream_t cuda_stream = ctx1.get_compute_stream();
-    output[0]->record_event(cuda_stream);
-    output_dev.sync();
-    // param changed 
-    param.alpha = 2.0;
-    // auto calling saber op(create and dispatch)
-    mul(input, output, param, ctx1);
-    cudaDeviceSynchronize();
-    CUDA_CHECK(cudaPeekAtLastError());
-}
-int main(int argc, const char** argv){
-    anakin::saber::Env<NV>::env_init();
-    // initial logger
-    //logger::init(argv[0]);
-    InitTest();
-    RUN_ALL_TESTS(argv[0]);
-    return 0;
-}
-```
-## 调试及注意事项
-一个op需要有对外的op接口和内部实现，由于存在saber/funcs/impl的非特化版本声明，当有op在某种设备下没有对应实现时，也能够编译，但此时是没有任何实现的空实现，
--- a/doc/fluid/advanced_usage/deploy/anakin/how_to_support_new_device_in_anakin.md
+++ b/doc/fluid/advanced_usage/deploy/anakin/how_to_support_new_device_in_anakin.md
-# 如何支持一个新的设备
-## 概览
-添加一个新的设备需要以下3个步骤：
-* [在`CMakeList`中添加设备的支持](#0001)
-* [在`saber`中添加设备的实现](#0002)
-* [在`framework`中添加设备的具体化或实例化](#0003)
-假设新设备的名称为`TNEW`, 以下将以这个设备名称进行演示。
-## <span id = '0001'> 在`CMakeList`中添加设备的支持 </span> ##
-* 修改根目录`CMakeList.txt`
-```cmake
-#select the plantform to build
-anakin_option(USE_GPU_PLACE "Select the build mode for GPU place." NO)
-anakin_option(USE_X86_PLACE "Select the build mode for X86 place." NO)
-anakin_option(USE_ARM_PLACE "Select the build mode for ARM place." NO)
-anakin_option(USE_TNEW_PLACE "Select the build mode for ARM place." YES)
-```
-* 修改`saber/CMakeList.txt`
-根据新增设备的目录完善`saber`目录下的`CMakeList.txt`。
-```cmake
-if(USE_TNEW_PLACE)
-    anakin_fetch_files_with_suffix(${ANAKIN_SABER}/core/impl/tnew "cpp" ANAKIN_SABER_BASE_SRC)
-    anakin_fetch_files_with_suffix(${ANAKIN_SABER}/funcs/impl/tnew "cpp" ANAKIN_SABER_BASE_SRC)
-endif()
-```
-* 修改`test/CMakeList.txt`
-新增设备的单测文件放在`test/saber/tnew`目录下，修改`test`目录下的`CMakeList.txt`。
-```cmake
-if(USE_TNEW_PLACE)
-    anakin_fetch_files_with_suffix(${ANAKIN_UNIT_TEST}/saber/tnew "cpp" ANAKIN_TEST_CASE_SRC)
-endif()
-```
-* 修改`cmake/anakin_config.h.in`
-```c++
-// plantform to use
-#cmakedefine USE_GPU_PLACE
-#cmakedefine USE_X86_PLACE
-#cmakedefine USE_ARM_PLACE
-#cmakedefine USE_TNEW_PLACE
-```
-* 其他依赖和编译选项
-修改`cmake`目录下的`compiler_options.cmake`和`find_modules.cmake`
-## <span id = '0002'> 在`saber`中添加设备的实现 </span> ##
-`saber`是`Anakin`的基础计算库，对外提供设备无关的统一的API，设备相关的实现都会封装到`TargetWrapper`中。
-### 在`saber/saber_types.h`中添加设备
-```c++
-enum TargetTypeEnum {
-    eINVALID = -1,
-    eNV = 1,
-    eAMD = 2,
-    eARM = 3,
-    eX86 = 4,
-    eNVHX86 = 5,
-    eTNEW = 6
-};
-typedef TargetType<eNV> NV;
-typedef TargetType<eARM> ARM;
-typedef TargetType<eAMD> AMD;
-typedef TargetType<eX86> X86;
-typedef TargetType<eTNEW> TNEW;
-```
-### 在`saber/core`中添加设备的实现
-1. 在`target_traits.h`中添加新设备
-* 增加设备类型
-```c++
-struct __cuda_device{};
-struct __arm_device{};
-struct __amd_device{};
-struct __x86_device{};
-struct __tnew_device{};
-```
-* `TargetTypeTraits`模板具体化
-```c++
-template <>
-struct TargetTypeTraits<TNEW> {
-    typedef __xxx_target target_category;//根据实际设备是host端还是device端进行选择
-    typedef __tnew_device target_type;
-};
-```
-2. 在`data_traits.h`中特化`DataTrait`模板类
-如果设备需要特殊的数据类型，则特化出设备的`DataTrait`类的实现，例如opencl数据类型的实现如下：
-```c++
-#ifdef USE_OPENCL
-struct ClMem{
-    ClMem(){
-        dmem = nullptr;
-        offset = 0;
-    }
-    ClMem(cl_mem* mem_in, int offset_in = 0) {
-        dmem = mem_in;
-        offset = offset_in;
-    }
-    ClMem(ClMem& right) {
-        dmem = right.dmem;
-        offset = right.offset;
-    }
-    ClMem& operator=(ClMem& right) {
-        this->dmem = right.dmem;
-        this->offset = right.offset;
-        return *this;
-    }
-    ClMem& operator+(int offset_in) {
-        this->offset += offset_in;
-        return *this;
-    }
-    int offset{0};
-    cl_mem* dmem;
-};
-template <>
-struct DataTrait<AMD, AK_FLOAT> {
-    typedef ClMem Dtype;
-    typedef float dtype;
-};
-template <>
-struct DataTrait<AMD, AK_DOUBLE> {
-    typedef ClMem Dtype;
-    typedef double dtype;
-};
-template <>
-struct DataTrait<AMD, AK_INT8> {
-    typedef ClMem Dtype;
-    typedef char dtype;
-};
-#endif //use_opencl
-```
-3. 在`target_wrapper.h`中特化`TargetWrapper`模板类
-特化`TargetWrapper`模板类，在`target_wrapper.h`中声明函数，具体如下：
-```c++
-template <>
-struct TargetWrapper<TNEW, __xxx_target> { //根据TNEW的具体类型修改__xxx_target，__host_target或者__device_target
-    typedef xxx_event event_t;          //根据设备实现xxx_event
-    typedef xxx_stream stream_t;        //根据设备实现xxx_stream
-    static void get_device_count(int& count);
-    static void set_device(int id);
-    //We should add strategy to avoid malloc directly
-    static void mem_alloc(void** ptr, size_t n);
-    static void mem_free(void* ptr);
-    static void mem_set(void* ptr, int value, size_t n);
-    static void create_event(event_t& event, bool flag = false);
-    static void create_stream(stream_t& stream);
-    static void create_stream_with_flag(stream_t& stream, unsigned int flag);
-    static void create_stream_with_priority(stream_t& stream, unsigned int flag, int priority);
-    static void destroy_stream(stream_t& stream);
-    static void destroy_event(event_t& event);
-    static void record_event(event_t& event, stream_t stream);
-    static void query_event(event_t& event);
-    static void sync_event(event_t& event);
-    static void sync_stream(event_t& event, stream_t& stream);
-    static void sync_memcpy(void* dst, int dst_id, const void* src, int src_id, \
-                            size_t count, __DtoD);
-    static void async_memcpy(void* dst, int dst_id, const void* src, int src_id, \
-                             size_t count, stream_t& stream, __DtoD);
-    static void sync_memcpy(void* dst, int dst_id, const void* src, int src_id, \
-                            size_t count, __HtoD);
-    static void async_memcpy(void* dst, int dst_id, const void* src, int src_id, \
-                             size_t count, stream_t& stream, __HtoD);
-    static void sync_memcpy(void* dst, int dst_id, const void* src, int src_id, \
-                            size_t count, __DtoH);
-    static void async_memcpy(void* dst, int dst_id, const void* src, int src_id, \
-                             size_t count, stream_t& stream, __DtoH);
-    static void sync_memcpy_p2p(void* dst, int dst_dev, const void* src, \
-                                int src_dev, size_t count);
-    static void async_memcpy_p2p(void* dst, int dst_dev, const void* src, \
-                                 int src_dev, size_t count, stream_t& stream);
-    static int get_device_id();
-};
-```
-4. 在`impl/`目录下添加设备目录和实现
-在`saber/core/impl`目录下添加设备目录`tnew`。
-* 实现`TargetWrapper<TNEW, __xxx_target>`结构体中各函数的定义。
-如果`TargetWrapper<TNEW, __xxx_target>`的实现与默认的模板类一致，则不用特化出该类。
-```c++
-typedef TargetWrapper<TNEW, __xxx_target> TNEW_API;
-void TNEW_API::get_device_count(int &count) {
-    // add implementation
-}
-void TNEW_API::set_device(int id){
-    // add implementation
-}
-void TNEW_API::mem_alloc(void** ptr, size_t n){
-    // add implementation
-}
-void TNEW_API::mem_free(void* ptr){
-    if(ptr != nullptr){
-        // add implementation
-    }
-}
-...
-```
-* 特化实现`device.h`中的`Device<TNEW>`
-```c++
-template <>
-void Device<TNEW>::create_stream() {
-    // add implementation
-}
-template <>
-void Device<TNEW>::get_info() {
-    // add implementation
-}
-```
-### 在`saber/funcs`中实现设备相关的op
-参考[如何增加新的Operator](./how_to_add_anakin_op.html)
-## <span id = '0003'> 在`framework`中添加设备的具体化或实例化 </span> ##
-### `framework/core`
-* `net.cpp`中添加实例化
-```c++
-#ifdef USE_TNEW_PLACE
-template class Net<TNEW, AK_FLOAT, Precision::FP32, OpRunType::ASYNC>;
-template class Net<TNEW, AK_FLOAT, Precision::FP32, OpRunType::SYNC>;
-#endif
-```
-* `operator_func.cpp`中添加实例化
-```c++
-#ifdef USE_TNEW_PLACE
-template class OperatorFunc<TNEW, AK_FLOAT, Precision::FP32>;
-#endif
-```
-* `worker.cpp`中添加实例化
-```c++
-#ifdef USE_TNEW_PLACE
-template class Worker<TNEW, AK_FLOAT, Precision::FP32, OpRunType::ASYNC>;
-template class Worker<TNEW, AK_FLOAT, Precision::FP32, OpRunType::SYNC>;
-#endif
-```
-* `operator_attr.cpp`中添加实例化
-```c++
-template
-OpAttrWarpper& OpAttrWarpper::__alias__<TNEW, AK_FLOAT, Precision::FP32>(const std::string& op_name);
-template
-OpAttrWarpper& OpAttrWarpper::__alias__<TNEW, AK_FLOAT, Precision::FP16>(const std::string& op_name);
-template
-OpAttrWarpper& OpAttrWarpper::__alias__<TNEW, AK_FLOAT, Precision::INT8>(const std::string& op_name);
-```
-* `parameter.h`中添加设备的实现
-```c++
-#ifdef USE_TNEW_PLACE
-template<typename Dtype>
-class PBlock<Dtype, TNEW> {
-public:
-	typedef Tensor4d<TNEW, DataTypeRecover<Dtype>::type> type;
-	PBlock() {
-		_inner_tensor = std::make_shared<type>();
-	}
-	...
-}
-#endif //TNEW
-```
-* `type_traits_extend.h`中添加设备的实现
-```c++
-template<>
-struct target_host<saber::TNEW> {
-    typedef saber::X86 type; //根据TNEW选择正确的host type
-};
-```
-### `framework/graph`
-* `graph.cpp`中添加实例化
-```c++
-  #ifdef USE_TNEW_PLACE
-  template class Graph<TNEW, AK_FLOAT, Precision::FP32>;
-  template class Graph<TNEW, AK_FLOAT, Precision::FP16>;
-  template class Graph<TNEW, AK_FLOAT, Precision::INT8>;
-  #endif
-```
-### `framework/model_parser`
-* `parser.cpp`中添加实例化
-```c++
-  #ifdef USE_TNEW_PLACE
-  template
-  Status load<TNEW, AK_FLOAT, Precision::FP32>(graph::Graph<TNEW, AK_FLOAT, Precision::FP32>* graph,
-          const char* model_path);
-  template
-  Status load<TNEW, AK_FLOAT, Precision::FP16>(graph::Graph<TNEW, AK_FLOAT, Precision::FP16>* graph,
-          const char* model_path);
-  template
-  Status load<TNEW, AK_FLOAT, Precision::INT8>(graph::Graph<TNEW, AK_FLOAT, Precision::INT8>* graph,
-          const char* model_path);
-  template
-  Status save<TNEW, AK_FLOAT, Precision::FP32>(graph::Graph<TNEW, AK_FLOAT, Precision::FP32>* graph,
-          std::string& model_path);
-  template
-  Status save<TNEW, AK_FLOAT, Precision::FP16>(graph::Graph<TNEW, AK_FLOAT, Precision::FP16>* graph,
-          std::string& model_path);
-  template
-  Status save<TNEW, AK_FLOAT, Precision::INT8>(graph::Graph<TNEW, AK_FLOAT, Precision::INT8>* graph,
-          std::string& model_path);
-  template
-  Status load<TNEW, AK_FLOAT, Precision::FP32>(graph::Graph<TNEW, AK_FLOAT, Precision::FP32>* graph,
-          std::string& model_path);
-  template
-  Status load<TNEW, AK_FLOAT, Precision::FP16>(graph::Graph<TNEW, AK_FLOAT, Precision::FP16>* graph,
-          std::string& model_path);
-  template
-  Status load<TNEW, AK_FLOAT, Precision::INT8>(graph::Graph<TNEW, AK_FLOAT, Precision::INT8>* graph,
-          std::string& model_path);
-  template
-  Status save<TNEW, AK_FLOAT, Precision::FP32>(graph::Graph<TNEW, AK_FLOAT, Precision::FP32>* graph,
-          const char* model_path);
-  template
-  Status save<TNEW, AK_FLOAT, Precision::FP16>(graph::Graph<TNEW, AK_FLOAT, Precision::FP16>* graph,
-          const char* model_path);
-  template
-  Status save<TNEW, AK_FLOAT, Precision::INT8>(graph::Graph<TNEW, AK_FLOAT, Precision::INT8>* graph,
-          const char* model_path);
-  #endif
-```
-* `model_io.cpp`中添加实例化
-```c++
-#ifdef USE_TNEW_PLACE
-template class NodeIO<TNEW, AK_FLOAT, Precision::FP32>;
-template class NodeIO<TNEW, AK_FLOAT, Precision::FP16>;
-template class NodeIO<TNEW, AK_FLOAT, Precision::INT8>;
-#endif
-```
-### `framework/operators`
-为`framework/operators`目录下所有op添加实例化或具体化
-以`activation.cpp`为例，实例化如下：
-```c++
-#ifdef USE_TNEW_PLACE
-INSTANCE_ACTIVATION(TNEW, AK_FLOAT, Precision::FP32);
-INSTANCE_ACTIVATION(TNEW, AK_FLOAT, Precision::FP16);
-INSTANCE_ACTIVATION(TNEW, AK_FLOAT, Precision::INT8);
-template class ActivationHelper<TNEW, AK_FLOAT, Precision::FP32>;
-ANAKIN_REGISTER_OP_HELPER(Activation, ActivationHelper, TNEW, AK_FLOAT, Precision::FP32);
-#endif
-```
-如果TNEW设备函数的实现与现有模板实现不一致，可以特化实现如下（以init()为例）：
-```c++
-#ifdef USE_TNEW_PLACE
-INSTANCE_ACTIVATION(TNEW, AK_FLOAT, Precision::FP32);
-INSTANCE_ACTIVATION(TNEW, AK_FLOAT, Precision::FP16);
-INSTANCE_ACTIVATION(TNEW, AK_FLOAT, Precision::INT8);
-template <>
-Status ActivationHelper<TNEW, AK_FLOAT, Precision::FP32>::Init(OpContext<TNEW> &ctx,\
-        const std::vector<Tensor4dPtr<TNEW, AK_FLOAT> >& ins, \
-                std::vector<Tensor4dPtr<TNEW, AK_FLOAT> >& outs) {
-    SABER_CHECK(_funcs_activation.init(ins, outs, _param_activation, SPECIFY, SABER_IMPL, ctx)); //在这里选择实现方式
-    return Status::OK();
-}
-ANAKIN_REGISTER_OP_HELPER(Activation, ActivationHelper, TNEW, AK_FLOAT, Precision::FP32);
-#endif
-```
-在`ANAKIN_REGISTER_OP(Activation)`中添加TNEW的注册
-```c++
-#ifdef USE_TNEW_PLACE
-.__alias__<TNEW, AK_FLOAT, Precision::FP32>("activation")
-#endif
-```
-## 注意事项
-不要修改`Tensor`/`Buffer`/`Env`/`Context`这些类函数的接口和实现
--- a/doc/fluid/advanced_usage/deploy/anakin/index_cn.rst
+++ b/doc/fluid/advanced_usage/deploy/anakin/index_cn.rst
-Anakin 预测引擎
-#######################
-使用文档
-~~~~~~~
-.. toctree::
-   :maxdepth: 1
-   install_anakin.md
-   convert_paddle_to_anakin.md
-   anakin_tutorial.md
-   anakin_run_on_arm.md
-   anakin_example.md
-   int8_design_anakin.md
-   anakin_gpu_benchmark.md
-   anakin_arm_benchmark.md
-开发文档
-~~~~~~~
-.. toctree::
-   :maxdepth: 1
-   how_to_add_anakin_op.md
-   how_to_support_new_device_in_anakin.md
-   anakin_parser_design.md
--- a/doc/fluid/advanced_usage/deploy/anakin/install_anakin.md
+++ b/doc/fluid/advanced_usage/deploy/anakin/install_anakin.md
-## 源码编译安装Anakin ##
-我们已经在CentOS 7.3上成功的安装和测试了Anakin，对于其他操作系统，我们将很快支持。
-### 安装概览 ###
-* [在CentOS上安装 Anakin]()
-* [在Ubuntu上安装 Anakin]()
-* [在ARM上安装 Anakin](./anakin_run_on_arm.html)
-* [验证安装]()
-### 在CentOS上安装 Anakin ###
-#### 1. 系统要求 ####
-*  make 3.82+
-*  cmake 2.8.12+
-*  gcc 4.8.2+
-*  g++ 4.8.2+
-#### 2. 编译CPU版Anakin ####
-暂时不支持
-#### 3. 编译支持NVIDIA GPU的Anakin ####
- 3.1. 安装依赖
-  - 3.1.1 protobuf
-  ```
-    > git clone https://github.com/google/protobuf
-    > cd protobuf
-    > git submodule update --init --recursive
-    > ./autogen.sh
-    > ./configure --prefix=/path/to/your/insall_dir
-    > make
-    > make check
-    > make install
-    > sudo ldconfig
-  ```
-  如安装protobuf遇到任何问题，请访问[这里](https://github.com/google/protobuf/blob/master/src/README.md)
- 3.2 CUDA Toolkit
-  - [CUDA 8.0](https://developer.nvidia.com/cuda-zone) or higher, 具体信息参见[NVIDIA's documentation](https://docs.nvidia.com/cuda/cuda-installation-guide-linux/).
-  - [cuDNN v7](https://developer.nvidia.com/cudnn), 具体信息参见[NVIDIA's documentation](https://docs.nvidia.com/cuda/cuda-installation-guide-linux/).
- 3.3  编译Anakin
-  ```
-    > git clone https:/xxxxx
-    > cd anakin
-    > mkdir build
-    > camke ..
-    > make
-  ```
-#### 4. 编译支持AMD GPU的Anakin ####
-暂时还不支持
-### 在Ubuntu上安装 Anakin ###
-暂时还不支持
-### 在ARM上安装 Anakin ###
-请参考[ARM安装文档](./anakin_run_on_arm.html)
-### 验证安装 ###
-安装完成后，如果没有报错信息，你可以通过运行 `output/unit_test`路径下的单测示例验证是否编译成功。
--- a/doc/fluid/advanced_usage/deploy/anakin/run_anakin_on_arm.md
+++ b/doc/fluid/advanced_usage/deploy/anakin/run_anakin_on_arm.md
-## ARM 源码编译 Anakin ##
-目前Anakin支持ARM Android平台，采用Android NDK交叉编译工具链，已在mac os和centos上编译和测试通过。
-### 安装概览 ###
-* [系统需求](#0001)
-* [安装第三方依赖](#0002)
-* [Anakin源码编译](#0003)
-* [验证安装](#0004)
-### <span id = '0001'> 1. 系统需求 </span> ###
-*  宿主机: linux, mac
-*  cmake 3.8.2+
-*  Android NDK r14, Linux 版本[从这里下载](https://dl.google.com/android/repository/android-ndk-r14b-linux-x86_64.zip)
-### <span id = '0002'> 2. 安装第三方依赖 </span> ###
- 2.1 protobuf3.4.0
-   源码从这里[下载](https://github.com/google/protobuf/releases/tag/v3.4.0)
- - 2.1.1 为宿主机编译protobuf
-```bash
-   $ tar -xzf protobuf-3.4.0.tar.gz
-   $ cd protobuf-3.4.0
-   $ ./autogen.sh
-   $ ./configure
-   $ make
-   $ make check
-   $ make install
-```
-上述 $make install 执行后，可在 /usr/local/include/google 找到 libprotobuf 所需的头文件，将整个google文件夹拷贝至Anakin/third-party/arm-android/protobuf/下
-如有问题，请点[这里](https://github.com/google/protobuf/blob/v3.4.0/src/README.md)，然后将已经生成文件清除。
-```bash
-   $ make distclean
-```
- - 2.1.1 交叉编译Android`armeabi-v7a`的protobuf，注意设置ANDROID_NDK的路径，以及ARCH_ABI、HOSTOSN的值
- ```bash
-   $ export ANDROID_NDK=your_ndk_path
-   $ ARCH_ABI="arm-linux-androideabi-4.9"
-   $ HOSTOSN="darwin-x86_64"
-   $ export SYSROOT=$ANDROID_NDK/platforms/android-9/arch-arm
-   $ export PREBUILT=$ANDROID_NDK/toolchains/$ARCH_ABI
-   $ export LDFLAGS="--sysroot=$SYSROOT"
-   $ export LD="$ANDROID_NDK/toolchains/$ARCH_ABI/prebuilt/$HOSTOSN/arm-linux-androideabi/bin/ld $LDFLAGS"
-   $ export LIBS="-llog $ANDROID_NDK/sources/cxx-stl/gnu-libstdc++/4.9/libs/armeabi-v7a/libgnustl_static.a"
-   $ export CPPFLAGS=""
-   $ export INCLUDES="-I$ANDROID_NDK/sources/cxx-stl/gnu-libstdc++/4.9/include/ -I$ANDROID_NDK/platforms/android-9/arch-arm/usr/include/ -I$ANDROID_NDK/sources/cxx-stl/gnu-libstdc++/4.9/libs/armeabi-v7a/include/"
-   $ export CXXFLAGS="-march=armv7-a -mfloat-abi=softfp -DGOOGLE_PROTOBUF_NO_RTTI --sysroot=$SYSROOT"
-   $ export CCFLAGS="$CXXFLAGS"
-   $ export CXX="$PREBUILT/prebuilt/$HOSTOSN/bin/arm-linux-androideabi-g++ $CXXFLAGS"
-   $ export CC="$CXX"
-   $ export RANLIB="$ANDROID_NDK/toolchains/$ARCH_ABI/prebuilt/$HOSTOSN/bin/arm-linux-androideabi-ranlib"
-   $ ./autogen.sh
-   $ ./configure --host=arm-linux-androideabi --with-sysroot=$SYSROOT --enable-cross-compile --with-protoc=protoc --disable-shared CXX="$CXX" CC="$CC" LD="$LD"
-   $ make
-```
-编译生成 *.a 静态库，若希望编译*.so 动态链接库 ，请在./configure参数中改--disable-shared为--disable-static --enable-shared。
-生成文件在src/.libs/下，将生成的文件拷贝至Anakin/third-party/arm-android/protobuf/lib下。
-在[cmake](../../cmake/find_modules.cmake)中更新`ARM_RPOTO_ROOT`的路径。
-```cmake
-  set(ARM_RPOTO_ROOT "${CMAKE_SOURCE_DIR}/third-party/arm-android/protobuf")
-```
- 2.2 opencv 2.4.3+(optional)
-  Anakin只在examples示例中使用opencv
-  Android系统的opencv从[这里下载](https://opencv.org/releases.html)
-  解压后将 `3rdparty/libs/armeabi-v7a`中的库文件拷贝到`libs/armeabi-v7a`
-  在[cmake](../../cmake/find_modules.cmake)中搜索`anakin_find_opencv`,
-  并设置 `include_directories` 和 `LINK_DIRECTORIES`为自己安装的库的路径。
-  ```cmake
-    include_directories(${CMAKE_SOURCE_DIR}/third-party/arm-android/opencv/sdk/native/jni/include/)
-    LINK_DIRECTORIES(${CMAKE_SOURCE_DIR}/third-party/arm-android/opencv/sdk/native/libs/armeabi-v7a/)
-  ```
-### <span id = '0003'> 3. Anakin源码编译 </span> ###
-#### 编译Android版本
-  克隆[源码](https://github.com/PaddlePaddle/Anakin/tree/arm)
-```bash
-    cd your_dir
-    git clone https://github.com/PaddlePaddle/Anakin.git
-    cd Anakin
-    git fetch origin arm
-    git checkout arm
-  ```
-  修改`android_build.sh`
- 修改NDK路径
-  ```bash
-    #modify "your_ndk_path" to your NDK path
-    export ANDROID_NDK=your_ndk_path
-  ```
- 修改ARM 处理器架构
-  对于32位ARM处理器, 将ANDROID_ABI 设置为 `armeabi-v7a with NEON`，
-  对于64位ARM处理器, 可以将ANDROID_ABI 设置为 `armeabi-v7a with NEON`或者`arm64-v8a`。
-  目前我们只支持 `armeabi-v7a with NEON`；`arm64-v8a` 还在开发中。
-  ```bash
-      -DANDROID_ABI="armeabi-v7a with NEON"
-  ```
- 设置Android API
-  根据Android系统的版本设置API level， 例如API Level 21 -> Android 5.0.1
-  ```bash
-      -DANDROID_NATIVE_API_LEVEL=21
-  ```
- 选择编译静态库或动态库
-  设置`BUILD_SHARED=NO`编译静态库
-  设置`BUILD_SHARED=YES`编译动态库
-  ```bash
-      -DBUILD_SHARED=NO
-  ```
- OpenMP多线程支持
-  设置`USE_OPENMP=YES`开启OpenMP多线程
-  ```bash
-      -DUSE_OPENMP=YES
-  ```
- 编译单测文件
-  设置`BUILD_WITH_UNIT_TEST=YES`将会编译单测文件
-  ```bash
-     -DBUILD_WITH_UNIT_TEST=YES
-  ```
- 编译示例文件
-  设置`BUILD_EXAMPLES=YES`将会编译示例文件
-  ```bash
-     -DBUILD_EXAMPLES=YES
-  ```
- 开启opencv
-  如果使用opencv，设置`USE_OPENCV=YES`
-  ```bash
-    -DUSE_OPENCV=YES
-  ```
- 开始编译
-  运行脚本 `android_build.sh` 将自动编译Anakin
-  ```bash
-      ./android_build.sh
-  ```
-### <span id = '0004'> 4. 验证安装 </span> ###
-  编译好的库会放在目录`${Anakin_root}/output`下
-  编译好的单测文件会放在`${Anakin_root}/output/unit_test`目录下
-  编译好的示例文件会放在`${Anakin_root}/output/examples`目录下
-  对于Android系统，打开设备的调试模式，通过ADB可以访问的目录是`data/local/tmp`，通过ADB push将测试文件、模型和数据发送到设备目录，运行测试文件。
--- a/doc/fluid/advanced_usage/development/profiling/benchmark.rst
+++ b/doc/fluid/advanced_usage/development/profiling/benchmark.rst
-#################
-如何进行基准测试
-#################
-本文介绍如何给深度学习框架做基准测试。基准测试主要包含验证模型的精度和性能两方面，下文包含搭建测试环境，选择基准测试模型，验证测试结果等几方面内容。
-验证深度学习框架，可分为训练和测试两个阶段， 验证指标略有不同，本文只介绍训练阶段的指标验证。训练阶段关注的是模型训练集上的精度，训练集是完备的，因此关注大batch\_size下的训练速度,关注吞吐量，例如图像模型常用的batch\_size=128, 多卡情况下会加大；预测阶段关注的是在测试集上的精度，线上服务测试数据不能提前收集，因此关注小batch\_size下的预测速度，关注延迟，例如预测服务常用的batch\_size=1, 4等。
-`Fluid <https://github.com/PaddlePaddle/Paddle>`__ 是PaddlePaddle从0.11.0版本开始引入的设计，本文的基准测试在该版本上完成。
-环境搭建
-""""""""""""
-基准测试中模型精度和硬件、框架无关，由模型结构和数据共同决定；性能方面由测试硬件和框架性能决定。框架基准测试为了对比框架之间的差异，控制硬件环境，系统库等版本一致。下文中的对比实验都在相同的硬件条件和系统环境条件下进行.
-不同架构的GPU卡性能差异巨大，在验证模型在GPU上训练性能时，可使用NVIDIA提供的工具:code `nvidia-smi` 检验当前使用的GPU型号，如果测试多卡训练性能，需确认硬件连接是 `nvlink <https://zh.wikipedia.org/zh/NVLink>`__ 或 `PCIe <https://zh.wikipedia.org/zh-hans/PCI_Express>`__ 。 同样地，CPU型号会极大影响模型在CPU上的训练性能。可读取`/proc/cpuinfo`中的参数，确认当前正在使用的CPU型号。
-下载GPU对应的Cuda Tool Kit和 Cudnn，或者使用NVIDIA官方发布的nvidia-docker镜像 `nvidia-docker <https://github.com/NVIDIA/nvidia-docker>`__, 镜像内包含了Cuda和Cudnn，本文采用这种方式。 Cuda Tool Kit包含了GPU代码使用到的基础库，影响在此基础上编译出的Fluid二进制运行性能。
-准备好Cuda环境后，从github上的下载Paddle并源码编译，会生成对应的最适合当前GPU的sm\_arch二进制\ `sm\_arch <https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html>`__\ 。另外，cudnn对卷积类任务影响巨大，在基准测试中需要小版本一致，例如Cudnn7.0.2与Cudnn7.1.4在Resnet上有5%以上差异。
-选择基准模型
-""""""""""""
-对框架做基准测试，需要覆盖不同训练任务和不同大小的模型，本文中选取了图像和NLP的最为常用的5个模型。
-============  ============  =================  ============
-任务种类        模型名称       网络结构         数据集     
-============  ============  =================  ============
-图像分类      mnist         Lenet              mnist
-图像分类      VGG           VGG-16             Flowers102
-图像分类      Resnet        Resnet-50          Flowers102
-文本分类      Stacked-LSTM  Stacked-LSTM       IMDB 
-机器翻译      seq-seq       Stacked-LSTM       wmt14 
-============  ============  =================  ============
-其中mnist, VGG, Resnet属于CNN模型, stacked-lstm, seq2seq代表RNN模型。
-`benchmark <https://github.com/PaddlePaddle/Paddle/tree/develop/benchmark/fluid>`__
-基准模型测试脚本中，均跳过了前几个batch的训练过程，原因是加载数据和分配显存受系统当前运行情况影响，会导致统计性能不准确。运行完若干个轮次后，统计对应指标。
-基准模型的数据的选择方面，数据量大且验证效果多的公开数据集为首选。图像模型VGG和resnet, 本文选择了 `flowers102 <http://www.robots.ox.ac.uk/~vgg/data/flowers/102/>`__ ，图像大小预处理为和Imagenet相同大小，因此性能可直接对比
-NLP模型的公开且影响力大数据集较少，seq2seq模型选择了wmt14数据，stacked-lstm模型中选择了 `imdb <https://www.imdb.com/interfaces/>`__ 数据。
-注意，图像模型每条样本大小相同，图像经过变换后大小一致，因此经过的计算路径基本相同，计算速度和显存占用波动较小，可以从若干个batch的数据中采样得到当前的训练性能数据。而NLP模型由于样本长度不定，计算路径和显存占用也不相同，因此只能完整运行若干个轮次后，统计速度和显存消耗。
-显存分配是特别耗时的操作，因此Fluid默认会占用所有可用显存空间形成显存池，用以加速计算过程中的显存分配。如果需要统计模型真实显存消耗，可设置环境变量`FLAGS_fraction_of_gpu_memory_to_use=0.0`，观察最大显存开销。
-测试过程
-""""""""""""
-  CPU 单机单线程测试
-测试CPU上单线程的性能，先设置CUDA的环境变量为空，``CUDA_VISIBLE_DEVICES=``，并通过环境变量关闭OpenMP和MKL的多线程 ``OMP_NUM_THREADS=1``， ``MKL_NUM_THREADS=1;``。
-然后代码中设置为使用CPUPlace，如果使用Paddle代码库中的脚本，只需要命令行参数传入 use_gpu=False即可。
-.. code-block:: python
-    >>> import paddle.fluid as fluid
-    >>> place = fluid.CPUPlace() 
-.. code:: bash
-    docker run -it --name CASE_NAME --security-opt seccomp=unconfined -v $PWD/benchmark:/benchmark paddlepaddle/paddle:latest-dev /bin/bash
-  GPU 单机单卡测试
-本教程使用了Cuda8, Cudnn7.0.1。来源为:code `nvidia/cuda:8.0-cudnn7-devel-ubuntu16.04`
-.. code:: bash
-    nvidia-docker run -it --name CASE_NAME --security-opt seccomp=unconfined -v $PWD/benchmark:/benchmark -v /usr/lib/x86_64-linux-gnu:/usr/lib/x86_64-linux-gnu paddlepaddle/paddle:latest-dev /bin/bash
-在单卡上测试，设置CUDA的环境变量使用一块GPU，``CUDA_VISIBLE_DEVICES=0``
-然后代码中设置为使用CUDAPlace，如果使用Paddle代码库中的脚本，只需要命令行参数传入 use_gpu=True即可。
-.. code-block:: python
-    >>> import paddle.fluid as fluid
-    >>> place = fluid.CUDAPlace(0) // 0 指第0块GPU
-测试结果
-""""""""""""
-本教程对比相同环境下的Fluid0.12.0和TensorFlow1.4.0的性能表现。
-硬件环境为 CPU: Intel(R) Xeon(R) CPU E5-2660 v4 @ 2.00GHz, GPU: TITAN X(Pascal) 12G x 1, Nvidia-Driver 384.90。
-系统环境为Ubuntu 16.04.3 LTS, 本文中采用了docker环境，系统版本为nvidia-docker17.05.0-ce。
-测试的Fluid版本为\ `v.0.12.0 <https://github.com/PaddlePaddle/Paddle/releases/tag/v.0.12.0>`__ 。
-TensorFlow版本为\ `v.1.4.0-rc1 <https://github.com/tensorflow/tensorflow/tree/v1.4.0-rc1>`__ 。
-使用的脚本和配置见\ `benchmark <https://github.com/PaddlePaddle/Paddle/tree/develop/benchmark/fluid>`__ 。
-图表中统计单位为samples/秒。
- CPU 单机单线程测试结果
-  ================  ====================  ===================
-   Speed            Fluid CPU              TensorFlow CPU    
-  ================  ====================  ===================
-  mnist             1298.75 samples/s     637.57 samples/s  
-  VGG-16            0.4147 images/s       0.1229 images/s   
-  Resnet-50         1.6935 images/s       0.3657 images/s   
-  Stacked-LSTM      472.3225 words/s      48.2293words/s    
-  Seq2Seq           217.1655 words/s      28.6164 words/s   
-  ================  ====================  ===================
- GPU 单机单卡测试结果
-  =============== =====================  =================
-   Speed           Fluid GPU              TensorFlow GPU      
-  =============== =====================  =================
-   mnist           19710.90 samples/s    15576.3 samples/s        
-   VGG-16          59.83327 images/s     40.9967 images/s    
-   Resnet-50       105.84412             97.8923 images/s    
-   Stacked-LSTM    1319.99315            1608.2526 words/s   
-   Seq2Seq         7147.89081            6845.1161 words/s   
-  =============== =====================  =================
--- a/doc/fluid/advanced_usage/development/profiling/gpu_profiling_cn.rst
+++ b/doc/fluid/advanced_usage/development/profiling/gpu_profiling_cn.rst
-============
-GPU性能调优
-============
-..  contents::
-此教程将向您分步介绍如何使用内置的定时工具、 **nvprof** 或 **nvvp** 来运行性能分析和调优。
- 什么是性能分析？
- 为什么需要性能分析？
- 如何进行性能分析？
- 性能分析工具介绍
- 详细教程
- 性能分析小技巧
-什么是性能分析？
-================
-在软件工程的范畴里，性能分析（Profiling）是一个动态程序分析的术语，它可以指测量一个程序的空间（内存）复杂度或时间复杂度，
-也可以说是某些特定指令的使用情况，或者是函数调用的频率和耗时等。通常情况下，分析得到的信息用于协助进行程序的优化。
-简单来说，性能分析工具是用于给应用程序的性能做定量分析的。如果想很好的理解程序的行为，那程序分析工具是必不可少的利器。简单的性能分析，可以告诉您某个操作到底花了多长时间？而更深入的分析，甚至能解释为什么某个操作花了很长时间？
-为什么需要性能分析？
-============================
-训练好一个深层神经网络通常要耗费非常长的时间，所以性能也就逐步变成了深度学习领域最重要的指标。
-而优化性能的首要任务，是需要了解哪些步骤拖慢了整体。
-如果某一块根本就不怎么耗时，那也就不需要急着优化性能啦！
-如何进行性能分析？
-========================
-为了达到性能最优，您可以采用下面五个步骤：
- 对代码进行性能分析
- 找到运行慢的部分
- 找到运行慢的原因
- 修改成更快的版本
- 再次对代码进行性能分析
-Usually, processor has two key performance limits include float point throughput and
-memory throughput. For GPU,  it also need more parallelism to fulfill its potential.
-This is why they can be so fast.
-通常情况下，处理器有两个关键性能限制：一个是浮点计算量，另一个是内存操作量。
-GPU则还需要高并行性，才能发挥其全部能力。这正是它们速度快的原因。
-性能分析工具介绍
-======================
-就通常的GPU性能分析来说，市面上已经有NVIDIA或第三方提供的众多工具。
-**nvprof** 是Nvidia性能分析工具， **nvvp** 则是带GUI的Nvidia可视化性能分析工具。
-在这个教程中，我们主要会介绍nvprof和nvvp。
-:code:`test_GpuProfiler` from :code:`paddle/legacy/math/tests` directory will be used to evaluate
-above profilers.
-:code:`paddle/legacy/math/test` 目录中的 :code:`test_GpuProfiler` 就是用于展示上述分析工具的用法。
-.. literalinclude:: ../../../../paddle/legacy/math/tests/test_GpuProfiler.cpp
-   :language: c++
-   :lines: 137-151
-   :linenos:
-上述的代码片段包含了两种方法，您可以任意使用一个或两个来对感兴趣的代码段做性能分析。
-1. :code:`REGISTER_TIMER_INFO` 是一个内置的定时器封装，可以用来计算CPU函数或cuda内核的时间消耗。
-2. :code:`REGISTER_GPU_PROFILER` 是 :code:`cudaProfilerStart` 和 :code:`cudaProfilerStop` 的通用包装对象，避免当CPU版本的PaddlePaddle调用它们时程序崩溃。
-3. :code:`REGISTER_GPU_PROFILER` 是一个封装对象，封装了 :code:`cudaProfilerStart` 和 :code:`cudaProfileStop` 两个操作；同时其内部实现可以避免纯CPU版本PaddlePaddle在执行本语句时发生崩溃。
-您会在接下来的部分中获得更多的细节介绍。
-详细教程
-============
-内置定时器
------------
-如果想要启用PaddlePaddle的内置定时器，您首先需要在相关代码段中加入 :code:`REGISTER_TIMER_INFO`。
-接下来就可以使用 :code:`printStatus` 或者 :code:`printAllStatus` 函数来将信息输出到界面中。
-下面举个简单的例子：
-1. 加入 :code:`REGISTER_TIMER_INFO` 和 :code:`printAllStatus` 函数（如高亮部分）。
-    .. literalinclude:: ../../../../paddle/legacy/math/tests/test_GpuProfiler.cpp
-        :language: c++
-        :lines: 137-151
-        :emphasize-lines: 8-12,14
-        :linenos:
-2. cmake配置中将 **WITH_TIMER** 打开，重新编译PaddlePaddle。
-    .. code-block:: bash
-        cmake .. -DWITH_TIMER=ON
-        make
-3. 执行您的代码，并观察结果(如高亮部分）。
-    .. code-block:: bash
-        :emphasize-lines: 1,12-15
-        > ./paddle/legacy/math/tests/test_GpuProfiler
-        I1117 11:13:42.313065 2522362816 Util.cpp:155] commandline: ./paddle/legacy/math/tests/test_GpuProfiler
-        I1117 11:13:42.845065 2522362816 Util.cpp:130] Calling runInitFunctions
-        I1117 11:13:42.845208 2522362816 Util.cpp:143] Call runInitFunctions done.
-        [==========] Running 1 test from 1 test case.
-        [----------] Global test environment set-up.
-        [----------] 1 test from Profiler
-        [ RUN      ] Profiler.BilinearFwdBwd
-        I1117 11:13:42.845310 2522362816 test_GpuProfiler.cpp:114] Enable GPU Profiler Stat: [testBilinearFwdBwd] "numSamples = 10, channels = 16, im
-        gSizeX = 64, imgSizeY = 64"
-        I1117 11:13:42.850154 2522362816 ThreadLocal.cpp:37] thread use undeterministic rand seed:20659751
-        I1117 11:13:42.981501 2522362816 Stat.cpp:130] ======= StatSet: [GlobalStatInfo] status ======
-        I1117 11:13:42.981539 2522362816 Stat.cpp:133] Stat=testBilinearFwdBwd     total=136.141    avg=136.141    max=136.141    min=136.141   count=1
-        I1117 11:13:42.981572 2522362816 Stat.cpp:141] ======= BarrierStatSet status ======
-        I1117 11:13:42.981575 2522362816 Stat.cpp:154] --------------------------------------------------
-        [       OK ] Profiler.BilinearFwdBwd (136 ms)
-        [----------] 1 test from Profiler (136 ms total)
-        [----------] Global test environment tear-down
-        [==========] 1 test from 1 test case ran. (136 ms total)
-        [  PASSED  ] 1 test.
-nvprof 工具
----------------
-要使用命令行分析工具 **nvprof**，您按如下步骤操作即可：
-1. 将 :code:`REGISTER_GPU_PROFILER` 函数加到代码中（参考强调部分）。
-    .. literalinclude:: ../../../../paddle/legacy/math/tests/test_GpuProfiler.cpp
-        :language: c++
-        :lines: 137-151
-        :emphasize-lines: 6-7
-        :linenos:
-2. cmake中将 **WITH_PROFILER** 配置打开，重新编译PaddlePaddle。
-    .. code-block:: bash
-        cmake .. -DWITH_PROFILER=ON
-        make
-3. 使用 **nvprof** 来分析执行文件。
-    .. code-block:: bash
-        nvprof  ./paddle/legacy/math/tests/test_GpuProfiler
-然后，您就能获得如下的分析结果：
-.. code-block:: bash
-    ==78544== Profiling application: ./paddle/legacy/math/tests/test_GpuProfiler
-    ==78544== Profiling result:
-    Time(%)     Time     Calls       Avg       Min       Max  Name
-    27.60%  9.6305ms         5  1.9261ms  3.4560us  6.4035ms  [CUDA memcpy HtoD]
-    26.07%  9.0957ms         1  9.0957ms  9.0957ms  9.0957ms  KeBilinearInterpBw
-    23.78%  8.2977ms         1  8.2977ms  8.2977ms  8.2977ms  KeBilinearInterpFw
-    22.55%  7.8661ms         2  3.9330ms  1.5798ms  6.2863ms  [CUDA memcpy DtoH]
-    ==78544== API calls:
-    Time(%)     Time     Calls       Avg       Min       Max  Name
-    46.85%  682.28ms         8  85.285ms  12.639us  682.03ms  cudaStreamCreateWithFlags
-    39.83%  580.00ms         4  145.00ms     302ns  550.27ms  cudaFree
-    9.82%   143.03ms         9  15.892ms  8.7090us  142.78ms  cudaStreamCreate
-    1.23%   17.983ms         7  2.5690ms  23.210us  6.4563ms  cudaMemcpy
-    1.23%   17.849ms         2  8.9247ms  8.4726ms  9.3768ms  cudaStreamSynchronize
-    0.66%   9.5969ms         7  1.3710ms  288.43us  2.4279ms  cudaHostAlloc
-    0.13%   1.9530ms        11  177.54us  7.6810us  591.06us  cudaMalloc
-    0.07%   1.0424ms         8  130.30us  1.6970us  453.72us  cudaGetDevice
-    0.04%   527.90us        40  13.197us     525ns  253.99us  cudaEventCreateWithFlags
-    0.03%   435.73us       348  1.2520us     124ns  42.704us  cuDeviceGetAttribute
-    0.03%   419.36us         1  419.36us  419.36us  419.36us  cudaGetDeviceCount
-    0.02%   260.75us         2  130.38us  129.32us  131.43us  cudaGetDeviceProperties
-    0.02%   222.32us         2  111.16us  106.94us  115.39us  cudaLaunch
-    0.01%   214.06us         4  53.514us  28.586us  77.655us  cuDeviceGetName
-    0.01%   115.45us         4  28.861us  9.8250us  44.526us  cuDeviceTotalMem
-    0.01%   83.988us         4  20.997us     578ns  77.760us  cudaSetDevice
-    0.00%   38.918us         1  38.918us  38.918us  38.918us  cudaEventCreate
-    0.00%   34.573us        31  1.1150us     279ns  12.784us  cudaDeviceGetAttribute
-    0.00%   17.767us         1  17.767us  17.767us  17.767us  cudaProfilerStart
-    0.00%   15.228us         2  7.6140us  3.5460us  11.682us  cudaConfigureCall
-    0.00%   14.536us         2  7.2680us  1.1490us  13.387us  cudaGetLastError
-    0.00%   8.6080us        26     331ns     173ns     783ns  cudaSetupArgument
-    0.00%   5.5470us         6     924ns     215ns  2.6780us  cuDeviceGet
-    0.00%   5.4090us         6     901ns     328ns  3.3320us  cuDeviceGetCount
-    0.00%   4.1770us         3  1.3920us  1.0630us  1.8300us  cuDriverGetVersion
-    0.00%   3.4650us         3  1.1550us  1.0810us  1.2680us  cuInit
-    0.00%      830ns         1     830ns     830ns     830ns  cudaRuntimeGetVersion
-nvvp 工具
--------------
-如果想使用可视化的分析器 **nvvp**，您可以导入 :code:`nvprof -o ...` 的输出，或者从工具的界面里运行您的应用。
-**备注: nvvp 也支持CPU的性能分析** (需在nvvp界面中选上才能开启）
-..  image:: nvvp1.png
-    :align: center
-    :scale: 33%
-从内核函数的角度， **nvvp** 可以精确说明一个长耗时操作的具体原因。
-同时，如下图所示， **nvvp** 的内核block使用情况、寄存器使用情况和共享内存使用情况能让我们对GPU的整体使用有更好的理解。
-..  image:: nvvp2.png
-    :align: center
-    :scale: 33%
-而从应用的角度， **nvvp** 可以帮您提供一些定位性能瓶颈的建议。
-例如，下图中就展示了一些关于内存数据迁徙和计算资源利用率的建议，为您做性能调优提供了方向。
-..  image:: nvvp3.png
-    :align: center
-    :scale: 33%
-..  image:: nvvp4.png
-    :align: center
-    :scale: 33%
-性能分析小技巧
-==================
- 开始阶段，从 **nvprof** 和 **nvvp** 的输出信息入手是个不错的选择。
- 接下来可以考虑下时间线的分析。
- 如果真想挖掘内核深处的某个秘密，您最好先确认：这一块的耗时比例真的太高，值得深入分析。
- 可能的情况下，试着让输出的分析数据和理论值对应。
-    1) 例如，如果我知道内核花了10ms来移动1GB数据，那我会期望分析工具统计到速度是100GB/s。
-    2) 若有不一致之处，很有可能实际应用就是没有按照您的预期情况运行。
- 了解您的硬件：如果您的GPU理论可以达到6 TFLOPs（6万亿次浮点运算每秒），而当前已经有5.5 TFLOPs了，那估计这里的潜力就没啥好挖的了……
-性能分析是性能优化的关键一步。有的时候简简单单的改变就能在性能上产生明显的优化效果！
-当然，具体情况因人而异。
-参考资料
-===========
-Jeremy Appleyard, `GPU Profiling for Deep Learning <http://www.robots.ox.ac.uk/~seminars/seminars/Extra/2015_10_08_JeremyAppleyard.pdf>`_, 2015
--- a/doc/fluid/advanced_usage/development/write_docs_cn.md
+++ b/doc/fluid/advanced_usage/development/write_docs_cn.md
-../../dev/write_docs_cn.md
\ No newline at end of file
--- a/doc/fluid/advanced_usage/development/write_docs_cn.md
+++ b/doc/fluid/advanced_usage/development/write_docs_cn.md
+# 如何贡献文档
+PaddlePaddle非常欢迎您贡献文档。如果您撰写/翻译的文档满足我们的要求，您的文档将会呈现在paddlapaddle.org网站和Github上供PaddlePaddle的用户阅读。
+Paddle的文档主要分为以下几个模块：
+- 新手入门：包括安装说明、深度学习基础知识、学习资料等，旨在帮助用户快速安装和入门；
+- 使用指南：包括数据准备、网络配置、训练、Debug、预测部署和模型库文档，旨在为用户提供PaddlePaddle基本用法讲解；
+- 进阶使用：包括服务器端和移动端部署、如何贡献代码/文档、如何性能调优等，旨在满足开发者的需求；
+我们的文档支持[reStructured Text](http://www.sphinx-doc.org/en/master/usage/restructuredtext/basics.html)和[Markdown](https://guides.github.com/features/mastering-markdown/) (GitHub风格)格式的内容贡献。
+撰写文档完成后，您可以使用预览工具查看文档在官网显示的效果，以验证您的文档是否能够在官网正确显示。
+## 如何使用预览工具
+如果您正在修改代码文档（即API），并在Docker容器中使用PaddlePaddle，请在您相应的docker容器中执行下列步骤。因为API的文档生成器依赖于PaddlePaddle。
+如果您只改进了文本/媒体内容(不需要安装或构建PaddlePaddle)，或者正在主机上构建PaddlePaddle，请继续在主机上执行下列步骤。
+### 1. Clone你希望更新或测试的相关仓库：
+首先下载完整的文档存储仓库，其中`--recurse-submodules`会同步更新FluidDoc中的submodule（所有的submodule均在`FluidDoc/external`中），以保证所有文档可以正常显示：
+```
+git clone --recurse-submodules https://github.com/PaddlePaddle/FluidDoc
+```
+其他可拉取的存储库有：
+```
+git clone https://github.com/PaddlePaddle/book.git
+git clone https://github.com/PaddlePaddle/models.git
+git clone https://github.com/PaddlePaddle/Mobile.git
+```
+您可以将这些本地副本放在电脑的任意目录下，稍后我们会在启动 PaddlePaddle.org时指定这些仓库的位置。
+### 2. 在新目录下拉取 PaddlePaddle.org 并安装其依赖项
+在此之前，请确认您的操作系统安装了python的依赖项
+以ubuntu系统为例，运行：
+```
+sudo apt-get update && apt-get install -y python-dev build-essential
+```
+然后：
+```
+git clone https://github.com/PaddlePaddle/PaddlePaddle.org.git
+cd PaddlePaddle.org/portal
+# To install in a virtual environment.
+# virtualenv venv; source venv/bin/activate
+pip install -r requirements.txt
+```
+**可选项**：如果你希望实现中英网站转换，以改善PaddlePaddle.org，请安装[GNU gettext](https://www.gnu.org/software/gettext/)
+### 3. 在本地运行 PaddlePaddle.org
+添加您希望加载和构建内容的目录列表(选项包括：--paddle，--book，--models，--mobile)
+运行：
+```
+./runserver --paddle <path_to_FluidDoc_dir>
+```
+**注意：**  `<pathe_to_FluidDoc_dir>`为第一步中paddle副本在您本机的存储地址。
+如果您需要处理依赖于`book`、`models`或`mobile`存储库内容的文档，您可以添加一个或多个可选项：
+```
+./runserver --paddle <path_to_fluiddoc_dir> \
+    --book <path_to_fluiddoc_dir>/external/book \
+    --models <path_to_fluiddoc_dir>/external/models \
+    --mobile <path_to_fluiddoc_dir>/external/mobile
+```
+然后：打开浏览器并导航到http://localhost:8000。
+>*网站可能需要几秒钟才能成功加载，因为构建需要一定的时间*
+>*如果您是在docker环境下运行的这些步骤，请检查ip确保可以将端口8000映射到您的主机*
+## 贡献新文档或更新API
+所有内容都应该以[Markdown](https://guides.github.com/features/mastering-markdown/) (GitHub风格)的形式编写(尽管在文档中有一些使用.rst格式的遗留内容)。
+在完成安装步骤后，您还需要完成下列操作：
+  - 在你开始写作之前，我们建议你回顾一下这些关于贡献内容的指南
+ ---
+  **贡献新文档**
+  - 创建一个新的` .md` 文件或者在您当前操作的仓库中修改已存在的文章
+  - 将新增的文档名，添加到对应的index文件中
+ ---
+  **贡献或修改Python API**
+  在编译代码的docker容器内,或主机的对应位置：
+  - 运行脚本 `paddle/scripts/paddle_build.sh`(在 Paddle repo 下)
+  ```bash
+  # 编译paddle的python库
+  cd Paddle
+  ./paddle/scripts/paddle_docker_build.sh gen_doc_lib full
+  cd ..
+  ```
+  - 运行预览工具
+  ```
+  # 在编译paddle的对应docker镜像中运行预览工具
+  docker run -it -v /Users/xxxx/workspace/paddlepaddle_workplace:/workplace -p 8000:8000 [images_id] /bin/bash
+  ```
+  > 其中`/Users/xxxx/workspace/paddlepaddle_workplace`请替换成您本机的paddle工作环境，`/workplace`请替换成您相应的 docker 下的工作环境，这一映射会保证我们同时完成编译python库、修改FluidDoc和使用预览工具。
+  > [images_id]为docker中您使用的paddlepaddle的镜像id。
+  - 设置环境变量
+  ```
+  # 在docker环境中
+  # 设置环境变量`PYTHONPATH`使预览工具可以找到 paddle 的 python 库
+  export PYTHONPATH=/workplace/Paddle/build/python/
+  ```
+  - 清理旧文件
+  ```
+  # 清除历史生成的文件，如果是第一次使用预览工具可以跳过这一步
+  rm -rf /workplace/FluidDoc/doc/fluid/menu.json /workplace/FluidDoc/doc/fluid/api/menu.json /tmp/docs/ /tmp/api/
+  ```
+  - 启动预览工具
+  ```
+  cd /workplace/PaddlePaddle.org/portal
+  pip install -r requirements.txt
+  ./runserver --paddle /workplace/FluidDoc/
+  ```
+---
+  **预览修改**
+  打开浏览器并导航到http://localhost:8000。
+  在要更新的页面上，单击右上角的Refresh Content
+  进入使用文档单元后，API部分并不包含内容，希望预览API文档需要点击API目录，几分钟后您将看到生成的 API reference。
+## 提交修改
+如果您希望修改代码，请在`Paddle`仓库下参考[如何贡献代码](../development/contribute_to_paddle.html)执行操作。
+如果您仅修改文档：
+  - 修改的内容在`doc`文件夹内，您只需要在`FluidDoc`仓库下提交`PR`
+  - 修改的内容在`external`文件夹内：
+    1.在您修改的仓库下提交PR。这是因为：`FluidDoc`仓库只是一个包装器，将其他仓库的链接（git术语的“submodule”）集合在了一起。
+    2.当您的修改被认可后，更新FluidDoc中对应的`submodule`到源仓库最新的commit-id。
+      > 例如，您更新了book仓库中的develop分支下的文档：
+      > - 进入`FluidDoc/external/book`目录
+      > - 更新 commit-id 到最新的提交：`git pull origin develop`
+      > - 在`FluidDoc`中提交你的修改
+	3.在`FluidDoc`仓库下为您的修改提交PR
+提交修改与PR的步骤可以参考[如何贡献代码](../development/contribute_to_paddle.html)
+## 帮助改进预览工具
+我们非常欢迎您对平台和支持内容的各个方面做出贡献，以便更好地呈现这些内容。您可以Fork或Clone这个存储库，或者提出问题并提供反馈，以及在issues上提交bug信息。详细内容请参考[开发指南](https://github.com/PaddlePaddle/PaddlePaddle.org/blob/develop/DEVELOPING.md)。
+## 版权和许可
+PaddlePaddle.org在Apache-2.0的许可下提供。
--- a/doc/fluid/beginners_guide/basics/learning_materials.md
+++ b/doc/fluid/beginners_guide/basics/learning_materials.md
-# 学习资料
-## 要读的第一本书
-基础理论习得的最直接来源就是书本。按机器学习理论、深度学习理论、编程语言三方面划分，这里推荐如下书籍辅助您。
-### 机器学习理论
-在开启深度学习之前，您需要先行掌握机器学习的理论。深度学习是机器学习中的一个分支，两者内在的理论基础存在强关联。
-机器学习理论的书籍教材比较多，这里推荐一本易懂易学的书籍，可以重点关注神经网络部分。
-书名：《机器学习》（周志华著，清华大学出版社，2016年版）
-### 深度学习理论
-打好机器学习的理论功底后，您可以开始钻研深度学习的理论。通常深度学习理论会给人留下抽象难懂的印象，且和数学结合紧密。
-为了让您能够顺利入门，这里推荐一份易学易用的教材，无论深度学习理论还是数学理论即可一本搞定。
-书名：《Deep Learning（深度学习）》（Goodfellow, Bengio, Courville合著，赵申剑、黎彧君、符天凡和李凯合译，人民邮电出版社，2017年版）
-此书电子版在Github上已经开源，详情可参考此链接 [《深度学习》](https://github.com/exacity/deeplearningbook-chinese)
-### 编程语言
-Python方向：这里推荐您学习Python，一方面各大主流深度学习框架的主力支撑编程语言均为Python；另一方面，对比其他语言，Python较为简单易学。
-Python的教材种类较多，这里推荐一本实操和理论性都兼顾的教材，只要完成书中52个习题，跑代码然后发现问题解决，就能逐步上手。
-书名：《“笨办法”学Python》（Zed Shaw著，王巍巍译，人民邮电出版社，2014年11月版）
-C++方向：C++语言在底层框架中使用较多，您逐步掌握开源框架的基本操作后，在更高阶的框架应用中会用到这个技能点。
-同前面提到的Python一样，学习C++时需要多上手操作。这里推荐迅速上手C++的书籍，不但能够学习功能和结构，还提供了解决方案的示例。
-书名：《Essential C++》【美】李普曼（Lippman,S.B.）著，侯捷译，电子工业出版社2013年8月版
-## 要看的视频公开课
-在学习一门新技术的同时，除了看书，如果有老师面对面教授，可以更快更好的学会知识。相比于线下授课，视频公开课能够在省钱省力的同时，达到易学易掌握的效果。
-目前深度学习的课程多是公开免费的，通过学习您可以更轻松的理解深度学习中的抽象理论，并在实操方面不绕弯路。
-综合课程生动性、可操作性、紧凑性、连续性这些特点，这里推荐如下课程，同步附上网址，便于您查找学习。
-### 理论知识详解视频课
-[机器学习](http://open.163.com/special/opencourse/machinelearning.html) 斯坦福大学教授吴恩达公开课程，包含相关算法的详细讲解。
-[AI技术](https://ai.baidu.com/paddlepaddle/player?id=13) 百度推出的“AI核心技术掌握”课程，每节课在20-30分钟左右，从AI技术到深度学习进行全面细致的解读。
-[深度学习](http://speech.ee.ntu.edu.tw/~tlkagk/courses_ML17_2.html) 台湾李宏毅教授的在线课程，其中是英文课程，会结合国外的科研成果，但也适合新手入门和理解深度学习。
-[编程语言](https://ai.baidu.com/paddlepaddle/openCourses) Python操作课程，从基础到进阶操作都提供详细说明，每节课时长20分钟左右。
-### PaddlePaddle实操视频课
-掌握好理论基础，具备编程能力后，您可以开始使用PaddlePaddle Fluid进行实操，从初阶开始学习，向着中高阶努力。
-目前已有PaddlePaddle官方视频公开课在官网呈现,内含PaddlePaddle实战、PaddlePaddle应用场景和机器学习模型讲解课程，帮助开发者从零开始使用PaddlePaddle，从简单场景逐步过渡到工业级应用。[点击这里](http://ai.baidu.com/paddlepaddle/openCourses)您即可开始视频课的学习之旅。
--- a/doc/fluid/beginners_guide/basics/learning_materials_en.md
+++ b/doc/fluid/beginners_guide/basics/learning_materials_en.md
-# Learning Materials
-## The first book to start your journey
-Books are the most direct resources to pick up the rationale of a subject. We recommend the following books for you which are categorized into machine learning theory, deep learning theory and programming languages.
-### Books for Machine Learning Theory
-Machine learning theory is a prerequisite to deep learning. Deep learning, one of the branches of machine learning, has a theoretical basis strongly relevant to machine learning.
-There have been various textbooks nowadays, from which we select an easier one for you. Please pay more attention to the chapters involving Neural Networks in the textbook.
-book：《Machine Learning》（Zhihua Zhou，Tsinghua University Express, 2016）
-### Books for Deep Learning Theory
-Having consolidated your basis of machine learning, it is time to dive into deep learning. 
-It's commonplace that deep learning theory leaves an obscure and abstract impression on learners, and tightly connects with mathematics.
-To help you smoothly get started with deep learning, we recommend the following easy-to-go textbook, which features a good explanation of both deep learning theory and its related mathematic basis.
-book：《Deep Learning》（Goodfellow, Bengio, Courville)
-### Books for Programming Languages
-Python：
-Python is our recommended programming language. On the one hand, Python is the main supportive language of mainstream deep learning frameworks; On the other hand, Python is easier than other languages for beginners.
-Python textbooks abounds in the market, and what lies here is a textbook that ingeniously balanced theoretical knowledge with practical operations. Through resolving the 52 questions in the book, running your answer code, and addressing the problems occurred in this process, you can gradually get the hang of Python.
-Book：《Learn Python the Hard Way》（Zed Shaw）
-C++： 
-C++ is adopted widely in low level part of frameworks. After you have gradually mastered basic operations of an open-source framework, programming in C++ is an important skill in the more advanced operations of a framework.
-C++ also requires frequent practical exercises like Python mentioned above.
-The book lying here is a quick-to-start textbook with introduction to functions and structures, and examples of resolutions.
-Book：《Essential C++》（Lippman,S.B.）
-## Open Lectures
-Besides textbooks, face-to-face instructions from teachers would contribute a robust and quick boost to your learning of new technology. Compared with on-campus lectures, open video lectures can not only make your learning simpler, but also save your time and energy. 
-Currently, the courses about deep learning are mostly free and public. These courses will facilitate you to comprehend abstract theory embedded in deep learning in a more effortless way, and direct you straightly towards practical applications. Regards to the vitality, operability, continuity, and compactness, we recommend the following courses and their corresponding links are attached afterwards to exempt your time from searching.
-### Lectures Aimed at Theory Analysis 
-[Machine Learning](http://open.163.com/special/opencourse/machinelearning.html) : Delivered by Andrew Ng, Stanford University. This series of lectures encompasses detailed analysis on relevant algorithms.
-[Deep Learning](http://speech.ee.ntu.edu.tw/~tlkagk/courses_ML17_2.html) : An online English course delivered by Prof. Hung-yi Lee. It is combined with the abroad research contributions, and at the same time it is suitable for novices to get started and understand deep learning.
-The following are several lectures delivered in Chinese:
-[AI tech](https://ai.baidu.com/paddlepaddle/player?id=13)  : The course named "Master Core AI Technology" organized by Baidu deciphers AI technology to Deep Learning in a comprehensive and fine-grained way. Each lesson lasts for 20 - 30 minutes. 
-[Programming Languages](https://ai.baidu.com/paddlepaddle/openCourses) Python tutorials，with 20 minutes each lesson, illustrates from the basis to advanced usage. 
-### PaddlePaddle Hands-on Training
-Having equipped with a firm grasp of theory basis and programming ability, you can now commence a practical adventure to PaddlePaddle Fluid, and grow up from a beginner level to a medium or high level.
-Our official open courses are presented on the official site. The courses embrace PaddlePaddle practical operations, scenarios applied with PaddlePaddle, and introduction to PaddlePaddle machine learning models. Developers can take full advantage of our official courses to start PaddlePaddle from scratch and gradually move to industrial application.
-[Click Here](http://ai.baidu.com/paddlepaddle/openCourses) to embark on your sailing in our official deep learning video lectures.
--- a/doc/fluid/beginners_guide/index.rst
+++ b/doc/fluid/beginners_guide/index.rst
@@ -10,10 +10,6 @@ PaddlePaddle (PArallel Distributed Deep LEarning)是一个易用、高效、灵
    - `安装说明 <../beginners_guide/install/index_cn.html>`_：我们支持在Ubuntu/CentOS/Windows/MacOS环境上的安装
-如果您初次接触深度学习，在学习PaddlePaddle之前建议您先阅读以下资料：
-    - `学习资料 <../beginners_guide/basics/learning_materials.html>`_：推荐机器学习、深度学习和编程语言三个方面的书籍与视频公开课
 如果您已经具备一定的深度学习基础，第一次使用PaddlePaddle时，可以跟随下列简单的模型案例供您快速上手：
    - `Fluid编程指南 <../beginners_guide/programming_guide/programming_guide.html>`_：介绍 Fluid 的基本概念和使用方法
@@ -29,5 +25,4 @@ PaddlePaddle (PArallel Distributed Deep LEarning)是一个易用、高效、灵
    install/index_cn.rst
    quick_start/index.rst
    basics/index.rst
-    basics/learning_materials.md
    programming_guide/programming_guide.md
--- a/doc/fluid/beginners_guide/index_en.rst
+++ b/doc/fluid/beginners_guide/index_en.rst
@@ -11,10 +11,6 @@ For beginners of PaddlePaddle, the following documentation will tutor you about
    - `Installation Manuals <../beginners_guide/install/index_en.html>`_ ：Installation on Ubuntu/CentOS/Windows/MacOS is supported.
-The following resources are recommended for novices in deep learning:
-    - `Resources <../beginners_guide/basics/learning_materials_en.html>`_ ：Selected books and lectures about machine learning, deep learning and programming languages.
 If you have been armed with certain level of deep learning knowledge, and it happens to be the first time to try PaddlePaddle, the following cases of model building will expedite your learning process:
    - `Programming with Fluid <../beginners_guide/programming_guide/programming_guide_en.html>`_ ： Core concepts and basic usage of Fluid
@@ -28,5 +24,4 @@ If you have been armed with certain level of deep learning knowledge, and it hap
    :hidden:
    install/index_en.rst
-    basics/learning_materials_en.md
    programming_guide/programming_guide_en.md
--- a/doc/fluid/build_and_install/build_from_source_cn.rst
+++ b/doc/fluid/build_and_install/build_from_source_cn.rst
-从源码编译
-======================
-.. _requirements:
-需要的软硬件
----------------
-为了编译PaddlePaddle，我们需要
-1. 一台电脑，可以装的是 Linux, Windows 或者 MacOS 操作系统
-2. Docker
-不需要依赖其他任何软件了。即便是 Python 和 GCC 都不需要，因为我们会把所有编译工具都安装进一个 Docker 镜像里。
-.. _build_step:
-编译方法
----------------
-PaddlePaddle需要使用Docker环境完成编译，这样可以免去单独安装编译依赖的步骤，可选的不同编译环境Docker镜像
-可以在 `这里 <https://hub.docker.com/r/paddlepaddle/paddle_manylinux_devel/tags/>`__ 找到，您也可以
-在 `这里 <https://github.com/PaddlePaddle/Paddle/tree/develop/tools/manylinux1/>`__ 找到 paddle_manylinux_devel
-镜像的编译以及使用方法。或者参考下述可选步骤，从源码中构建用于编译PaddlePaddle的Docker镜像。
-如果您选择不使用Docker镜像，则需要在本机安装下面章节列出的 :ref:`编译依赖 <_compile_deps>` 之后才能开始编译的步骤。
-编译PaddlePaddle，需要执行：
-.. code-block:: bash
-   # 1. 获取源码
-   git clone https://github.com/PaddlePaddle/Paddle.git
-   cd Paddle
-   # 2. 可选步骤：源码中构建用于编译PaddlePaddle的Docker镜像
-   docker build -t paddle:dev .
-   # 3. 执行下面的命令编译CPU-Only的二进制
-   docker run -it -v $PWD:/paddle -w /paddle -e "PYTHON_ABI=cp27-cp27mu" -e "WITH_GPU=OFF" -e "WITH_TESTING=OFF" paddlepaddle/paddle_manylinux_devel:cuda8.0_cudnn5 ./paddle/scripts/paddle_build.sh build
-   # 4. 或者也可以使用为上述可选步骤构建的镜像（必须先执行第2步）
-   docker run -it -v $PWD:/paddle -w /paddle -e "WITH_GPU=OFF" -e "WITH_TESTING=OFF" paddle:dev ./paddle/scripts/paddle_build.sh build
-注：
- 上述命令把当前目录（源码树根目录）映射为 container 里的 :code:`/paddle` 目录。
- 如果您使用的是 manylinux 的镜像进行编译, 那么您需要通过环境变量 :code:`PYTHON_ABI` 来指定一个 `Python ABI <https://www.python.org/dev/peps/pep-0425/#id8>`__.
-PaddlePaddle目前支持的 Python ABI 有 :code:`cp27-cp27m` 和 :code:`cp27-cp27mu`.
-编译完成后会在build/python/dist目录下生成输出的whl包，可以选在在当前机器安装也可以拷贝到目标机器安装：
-.. code-block:: bash
-   pip install build/python/dist/*.whl
-如果机器中已经安装过PaddlePaddle，有两种方法：
-.. code-block:: bash
-   1. 先卸载之前的版本，再重新安装
-   pip uninstall paddlepaddle
-   pip install build/python/dist/*.whl
-   2. 直接升级到更新的版本
-   pip install build/python/dist/*.whl -U
-.. _run_test:
-执行单元测试
----------------
-如果您期望在编译完成后立即执行所有的单元测试，可以按照下面的方法：
-设置 :code:`RUN_TEST=ON` 和 :code:`WITH_TESTING=ON` 就会在完成编译之后，立即执行单元测试。
-开启 :code:`WITH_GPU=ON` 可以指定同时执行GPU上的单元测试。
-.. code-block:: bash
-   docker run -it -v $PWD:/paddle -w /paddle -e "WITH_GPU=OFF" -e "WITH_TESTING=ON" -e "RUN_TEST=ON" paddlepaddle/paddle_manylinux_devel:cuda8.0_cudnn5 ./paddle/scripts/paddle_build.sh test
-如果期望执行其中一个单元测试，（比如 :code:`test_sum_op` ）：
-.. code-block:: bash
-   docker run -it -v $PWD:/paddle -w /paddle -e "WITH_GPU=OFF" -e "WITH_TESTING=ON" -e "RUN_TEST=OFF" paddlepaddle/paddle_manylinux_devel:cuda8.0_cudnn5 /bin/bash
-   ./paddle/scripts/paddle_build.sh build
-   cd build
-   ctest -R test_sum_op -V
-.. _faq_docker:
-常见问题
----------------
- 什么是 Docker?
-  如果您没有听说 Docker，可以把它想象为一个类似 virtualenv 的系统，但是虚拟的不仅仅是 Python 的运行环境。
- Docker 还是虚拟机？
-  有人用虚拟机来类比 Docker。需要强调的是：Docker 不会虚拟任何硬件，Docker container 里运行的编译工具实际上都是在本机的 CPU 和操作系统上直接运行的，性能和把编译工具安装在本机运行一样。
- 为什么用 Docker?
-  把工具和配置都安装在一个 Docker image 里可以标准化编译环境。这样如果遇到问题，其他人可以复现问题以便帮助。
-  另外，对于习惯使用Windows和MacOS的开发者来说，使用Docker就不用配置交叉编译环境了。
- 我可以选择不用Docker吗？
-  当然可以。大家可以用把开发工具安装进入 Docker image 一样的方式，把这些工具安装到本机。这篇文档介绍基于 Docker 的开发流程，是因为这个流程比其他方法都更简便。
- 学习 Docker 有多难？
-  理解 Docker 并不难，大概花十分钟看一下 `如何使用Docker <https://zhuanlan.zhihu.com/p/19902938>`_ 。这可以帮您省掉花一小时安装和配置各种开发工具，以及切换机器时需要新安装的辛苦。别忘了 PaddlePaddle 更新可能导致需要新的开发工具。更别提简化问题复现带来的好处了。
- 我可以用 IDE 吗？
-  当然可以，因为源码就在本机上。IDE 默认调用 make 之类的程序来编译源码，我们只需要配置 IDE 来调用 Docker 命令编译源码即可。
-  很多 PaddlePaddle 开发者使用 Emacs。他们在自己的 `~/.emacs` 配置文件里加两行
-  .. code-block:: emacs
-    (global-set-key "\C-cc" 'compile)
-    (setq compile-command "docker run --rm -it -v $(git rev-parse --show-toplevel):/paddle paddle:dev")
-  就可以按 `Ctrl-C` 和 `c` 键来启动编译了。
- 可以并行编译吗？
-  是的。我们的 Docker image 运行一个 `Paddle编译Bash脚本 <https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/scripts/docker/build.sh>`_ 。这个脚本调用 `make -j$(nproc)` 来启动和 CPU 核一样多的进程来并行编译。
- Docker 需要 sudo
-  如果用自己的电脑开发，自然也就有管理员权限（sudo）了。如果用公用的电脑开发，需要请管理员安装和配置好 Docker。此外，PaddlePaddle 项目在努力开始支持其他不需要 sudo 的集装箱技术，比如 rkt。
- 在 Windows/MacOS 上编译很慢
-  Docker 在 Windows 和 MacOS 都可以运行。不过实际上是运行在一个 Linux 虚拟机上。可能需要注意给这个虚拟机多分配一些 CPU 和内存，以保证编译高效。具体做法请参考 `如何为Windows/Mac计算机上的Docker增加内存和虚拟机 <https://github.com/PaddlePaddle/Paddle/issues/627>`_ 。
- 磁盘不够
-  本文中的例子里，`docker run` 命令里都用了 `--rm` 参数，这样保证运行结束之后的 containers 不会保留在磁盘上。可以用 `docker ps -a` 命令看到停止后但是没有删除的 containers。`docker build` 命令有时候会产生一些中间结果，是没有名字的 images，也会占用磁盘。可以参考 `如何删除Docker Container <https://zaiste.net/posts/removing_docker_containers/>`_ 来清理这些内容。
-.. _compile_deps:
-附录：编译依赖
----------------
-PaddlePaddle编译需要使用到下面的依赖（包含但不限于），其他的依赖软件，会自动在编译时下载。
-.. csv-table:: PaddlePaddle编译依赖
-   :header: "依赖", "版本", "说明"
-   :widths: 10, 15, 30
-   "CMake", ">=3.2", ""
-   "GCC", "4.8.2", "推荐使用CentOS的devtools2"
-   "Python", "2.7.x", "依赖libpython2.7.so"
-   "pip", ">=9.0", ""
-   "numpy", "", ""
-   "SWIG", ">=2.0", ""
-   "Go", ">=1.8", "可选"
-.. _build_options:
-附录：编译选项
----------------
-PaddlePaddle的编译选项，包括生成CPU/GPU二进制文件、链接何种BLAS库等。
-用户可在调用cmake的时候设置它们，详细的cmake使用方法可以参考
-`官方文档 <https://cmake.org/cmake-tutorial>`_ 。
-在cmake的命令行中，通过使用 ``-D`` 命令设置该类编译选项，例如：
-..  code-block:: bash
-    cmake .. -DWITH_GPU=OFF
-..  csv-table:: 编译选项说明
-    :header: "选项", "说明", "默认值"
-    :widths: 1, 7, 2
-    "WITH_GPU", "是否支持GPU", "ON"
-    "WITH_C_API", "是否仅编译CAPI", "OFF"
-    "WITH_DOUBLE", "是否使用双精度浮点数", "OFF"
-    "WITH_DSO", "是否运行时动态加载CUDA动态库，而非静态加载CUDA动态库。", "ON"
-    "WITH_AVX", "是否编译含有AVX指令集的PaddlePaddle二进制文件", "ON"
-    "WITH_PYTHON", "是否内嵌PYTHON解释器", "ON"
-    "WITH_STYLE_CHECK", "是否编译时进行代码风格检查", "ON"
-    "WITH_TESTING", "是否开启单元测试", "OFF"
-    "WITH_DOC", "是否编译中英文文档", "OFF"
-    "WITH_SWIG_PY", "是否编译PYTHON的SWIG接口，该接口可用于预测和定制化训练", "Auto"
-    "WITH_GOLANG", "是否编译go语言的可容错parameter server", "OFF"
-    "WITH_MKL", "是否使用MKL数学库，如果为否则是用OpenBLAS", "ON"
-BLAS
-+++++
-PaddlePaddle支持 `MKL <https://software.intel.com/en-us/intel-mkl>`_ 和
-`OpenBlAS <http://www.openblas.net/>`_ 两种BLAS库。默认使用MKL。如果使用MKL并且机器含有AVX2指令集，
-还会下载MKL-DNN数学库，详细参考 `mkldnn设计文档 <https://github.com/PaddlePaddle/Paddle/tree/develop/doc/design/mkldnn#cmake>`_ 。
-如果关闭MKL，则会使用OpenBLAS作为BLAS库。
-CUDA/cuDNN
-+++++++++++
-PaddlePaddle在编译时/运行时会自动找到系统中安装的CUDA和cuDNN库进行编译和执行。
-使用参数 :code:`-DCUDA_ARCH_NAME=Auto` 可以指定开启自动检测SM架构，加速编译。
-PaddlePaddle可以使用cuDNN v5.1之后的任何一个版本来编译运行，但尽量请保持编译和运行使用的cuDNN是同一个版本。
-我们推荐使用最新版本的cuDNN。
-编译选项的设置
-++++++++++++++
-PaddePaddle通过编译时指定路径来实现引用各种BLAS/CUDA/cuDNN库。cmake编译时，首先在系统路径（ :code:`/usr/lib:/usr/local/lib` ）中搜索这几个库，同时也会读取相关路径变量来进行搜索。 通过使用 ``-D`` 命令可以设置，例如
-..  code-block:: bash
-    cmake .. -DWITH_GPU=ON -DWITH_TESTING=OFF -DCUDNN_ROOT=/opt/cudnnv5
-**注意：这几个编译选项的设置，只在第一次cmake的时候有效。如果之后想要重新设置，推荐清理整个编译目录（** :code:`rm -rf` ）**后，再指定。**
--- a/doc/fluid/build_and_install/build_from_source_en.rst
+++ b/doc/fluid/build_and_install/build_from_source_en.rst
-Build from Sources
-==========================
-.. _requirements:
-Requirements
----------------
-To build PaddlePaddle, you need
-1. A computer -- Linux, Windows, MacOS.
-2. Docker.
-Nothing else.  Not even Python and GCC, because you can install all build tools into a Docker image.
-We run all the tools by running this image.
-.. _build_step:
-How To Build
----------------
-You need to use Docker to build PaddlePaddle
-to avoid installing dependencies by yourself. We have several pre-built
-Docker images `here <https://hub.docker.com/r/paddlepaddle/paddle_manylinux_devel/tags/>`_ ,
-you can also find how to build and use paddle_manylinux_devel Docker image from
-`here <https://github.com/PaddlePaddle/Paddle/tree/develop/tools/manylinux1/>`__
-Or you can build your own image from source as the optional step below:
-If you don't wish to use docker，you need to install several compile dependencies manually as :ref:`Compile Dependencies <_compile_deps>` shows to start compilation.
-.. code-block:: bash
-   # 1. clone the source code
-   git clone https://github.com/PaddlePaddle/Paddle.git
-   cd Paddle
-   # 2. Optional: build development docker image from source
-   docker build -t paddle:dev .
-   # 3. Run the following command to build a CPU-Only binaries
-   docker run -it -v $PWD:/paddle -w /paddle -e "PYTHON_ABI=cp27-cp27mu" -e "WITH_GPU=OFF" -e "WITH_TESTING=OFF" paddlepaddle/paddle_manylinux_devel:cuda8.0_cudnn5 ./paddle/scripts/paddle_build.sh build
-   # 4. Or, use your built Docker image to build PaddlePaddle (must run step 2)
-   docker run -it -v $PWD:/paddle -w /paddle -e "WITH_GPU=OFF" -e "WITH_TESTING=OFF" paddle:dev ./paddle/scripts/paddle_build.sh build
-NOTE: 
- The above command try to mount the current working directory (root directory of source code)
-into :code:`/paddle` directory inside docker container.
- You need to pass in the required environment variable :code:`PYTHON_ABI` to specify a `Python ABI <https://www.python.org/dev/peps/pep-0425/#id8>`__.
-Currently PaddlePaddle supported Python ABIs include :code:`cp27-cp27m` and :code:`cp27-cp27mu` .
-When the compile finishes, you can get the output whl package under
-build/python/dist, then you can choose to install the whl on local
-machine or copy it to the target machine.
-.. code-block:: bash
-   pip install build/python/dist/*.whl
-If the machine has installed PaddlePaddle before, there are two methods:
-.. code-block:: bash
-   1. uninstall and reinstall
-   pip uninstall paddlepaddle
-   pip install build/python/dist/*.whl
-   2. upgrade directly
-   pip install build/python/dist/*.whl -U
-.. _run_test:
-Run Tests
----------------
-If you wish to run the tests, you may follow the below steps:
-When using Docker, set :code:`RUN_TEST=ON` and :code:`WITH_TESTING=ON` will run test immediately after the build.
-Set :code:`WITH_GPU=ON` Can also run tests on GPU.
-.. code-block:: bash
-   docker run -it -v $PWD:/paddle -w /paddle -e "WITH_GPU=OFF" -e "WITH_TESTING=ON" -e "RUN_TEST=ON" paddlepaddle/paddle_manylinux_devel:cuda8.0_cudnn5 ./paddle/scripts/paddle_build.sh test
-If you wish to run only one unit test, like :code:`test_sum_op`:
-.. code-block:: bash
-   docker run -it -v $PWD:/paddle -w /paddle -e "WITH_GPU=OFF" -e "WITH_TESTING=ON" -e "RUN_TEST=OFF" paddlepaddle/paddle_manylinux_devel:cuda8.0_cudnn5 /bin/bash
-   ./paddle/scripts/paddle_build.sh build
-   cd build
-   ctest -R test_sum_op -V
-.. _faq_docker:
-Frequently Asked Questions
---------------------------
- What is Docker?
-  If you haven't heard of it, consider it something like Python's virtualenv.
- Docker or virtual machine?
-  Some people compare Docker with VMs, but Docker doesn't virtualize any hardware nor running a guest OS, which means there is no compromise on the performance.
- Why Docker?
-  Using a Docker image of build tools standardizes the building environment, which makes it easier for others to reproduce your problems and to help.
-  Also, some build tools don't run on Windows or Mac or BSD, but Docker runs almost everywhere, so developers can use whatever computer they want.
- Can I choose not to use Docker?
-  Sure, you don't have to install build tools into a Docker image; instead, you can install them on your local computer.  This document exists because Docker would make the development way easier.
- How difficult is it to learn Docker?
-    It takes you ten minutes to read `an introductory article <https://docs.docker.com/get-started>`_ and saves you more than one hour to install all required build tools, configure them, especially when new versions of PaddlePaddle require some new tools.  Not even to mention the time saved when other people trying to reproduce the issue you have.
- Can I use my favorite IDE?
-  Yes, of course.  The source code resides on your local computer, and you can edit it using whatever editor you like.
-  Many PaddlePaddle developers are using Emacs.  They add the following few lines into their `~/.emacs` configure file:
-  .. code-block:: emacs
-    (global-set-key "\C-cc" 'compile)
-    (setq compile-command "docker run --rm -it -v $(git rev-parse --show-toplevel):/paddle paddle:dev")
-  so they could type `Ctrl-C` and `c` to build PaddlePaddle from source.
- Does Docker do parallel building?
-  Our building Docker image runs a  `Bash script <https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/scripts/docker/build.sh>`_ , which calls `make -j$(nproc)` to starts as many processes as the number of your CPU cores.
- Docker requires sudo
-  An owner of a computer has the administrative privilege, a.k.a., sudo, and Docker requires this privilege to work properly.  If you use a shared computer for development, please ask the administrator to install and configure Docker.  We will do our best to support rkt, another container technology that doesn't require sudo.
- Docker on Windows/MacOS builds slowly
-  On Windows and MacOS, Docker containers run in a Linux VM.  You might want to give this VM some more memory and CPUs so to make the building efficient.  Please refer to `this issue  <https://github.com/PaddlePaddle/Paddle/issues/627>`_ for details.
- Not enough disk space
-  Examples in this article use option `--rm` with the `docker run` command.  This option ensures that stopped containers do not exist on hard disks.  We can use `docker ps -a` to list all containers, including stopped.  Sometimes `docker build` generates some intermediate dangling images, which also take disk space.  To clean them, please refer to `this article <https://zaiste.net/posts/removing_docker_containers/>`_ .
-.. _compile_deps:
-Appendix: Compile Dependencies
-------------------------------
-PaddlePaddle need the following dependencies when compiling, other dependencies
-will be downloaded automatically.
-.. csv-table:: PaddlePaddle Compile Dependencies
-   :header: "Dependency", "Version", "Description"
-   :widths: 10, 15, 30
-   "CMake", ">=3.2", ""
-   "GCC", "4.8.2", "Recommend devtools2 for CentOS"
-   "Python", "2.7.x", "Need libpython2.7.so"
-   "pip", ">=9.0", ""
-   "numpy", "", ""
-   "SWIG", ">=2.0", ""
-   "Go", ">=1.8", "Optional"
-.. _build_options:
-Appendix: Build Options
-------------------------
-Build options include whether build binaries for CPU or GPU, which BLAS
-library to use etc. You may pass these settings when running cmake.
-For detailed cmake tutorial please refer to `here <https://cmake.org/cmake-tutorial>`__ 。
-You can add :code:`-D` argument to pass such options, like:
-..  code-block:: bash
-    cmake .. -DWITH_GPU=OFF
-..  csv-table:: Bool Type Options
-    :header: "Option", "Description", "Default"
-    :widths: 1, 7, 2
-    "WITH_GPU", "Build with GPU support", "ON"
-    "WITH_C_API", "Build only CAPI", "OFF"
-    "WITH_DOUBLE", "Build with double precision", "OFF"
-    "WITH_DSO", "Dynamically load CUDA libraries", "ON"
-    "WITH_AVX", "Build with AVX support", "ON"
-    "WITH_PYTHON", "Build with integrated Python interpreter", "ON"
-    "WITH_STYLE_CHECK", "Check code style when building", "ON"
-    "WITH_TESTING", "Build unit tests", "OFF"
-    "WITH_DOC", "Build documentations", "OFF"
-    "WITH_SWIG_PY", "Build Python SWIG interface for V2 API", "Auto"
-    "WITH_GOLANG", "Build fault-tolerant parameter server written in go", "OFF"
-    "WITH_MKL", "Use MKL as BLAS library, else use OpenBLAS", "ON"
-BLAS
-+++++
-PaddlePaddle supports `MKL <https://software.intel.com/en-us/intel-mkl>`_ and
-`OpenBlAS <http://www.openblas.net/>`_ as BLAS library。By default it uses MKL.
-If you are using MKL and your machine supports AVX2, MKL-DNN will also be downloaded
-and used, for more `details <https://github.com/PaddlePaddle/Paddle/tree/develop/doc/design/mkldnn#cmake>`_ .
-If you choose not to use MKL, then OpenBlAS will be used.
-CUDA/cuDNN
-+++++++++++
-PaddlePaddle will automatically find CUDA and cuDNN when compiling and running.
-parameter :code:`-DCUDA_ARCH_NAME=Auto` can be used to detect SM architecture
-automatically in order to speed up the build.
-PaddlePaddle can build with any version later than cuDNN v5.1, and we intend to
-keep on with latest cuDNN versions. Be sure to run with the same version of cuDNN
-you built.
-Pass Compile Options
-++++++++++++++++++++++
-You can pass compile options to use intended BLAS/CUDA/Cudnn libraries.
-When running cmake command, it will search system paths like
-:code:`/usr/lib:/usr/local/lib` and then search paths that you
-passed to cmake, i.e.
-..  code-block:: bash
-    cmake .. -DWITH_GPU=ON -DWITH_TESTING=OFF -DCUDNN_ROOT=/opt/cudnnv5
-**NOTE: These options only take effect when running cmake for the first time, you need to clean the cmake cache or clean the build directory (** :code:`rm -rf` **) if you want to change it.**
--- a/doc/fluid/build_and_install/docker_install_cn.rst
+++ b/doc/fluid/build_and_install/docker_install_cn.rst
-使用Docker安装运行
-================================
-使用Docker安装和运行PaddlePaddle可以无需考虑依赖环境即可运行。并且也可以在Windows的docker中运行。
-您可以在 `Docker官网 <https://docs.docker.com/get-started/>`_ 获得基本的Docker安装和使用方法。
-如果您在使用Windows，可以参考
-`这篇 <https://docs.docker.com/toolbox/toolbox_install_windows/>`_
-教程，完成在Windows上安装和使用Docker。
-在了解Docker的基本使用方法之后，即可开始下面的步骤：
-.. _docker_pull:
-获取PaddlePaddle的Docker镜像
------------------------------
-执行下面的命令获取最新的PaddlePaddle Docker镜像，版本为cpu_avx_mkl：
-  .. code-block:: bash
-     docker pull paddlepaddle/paddle
-对于国内用户，我们提供了加速访问的镜像源：
-  .. code-block:: bash
-     docker pull docker.paddlepaddlehub.com/paddle
-下载GPU版本（cuda8.0_cudnn5_avx_mkl）的Docker镜像：
-  .. code-block:: bash
-     docker pull paddlepaddle/paddle:latest-gpu
-     docker pull docker.paddlepaddlehub.com/paddle:latest-gpu
-选择下载使用不同的BLAS库的Docker镜像：
-  .. code-block:: bash
-     # 默认是使用MKL的镜像
-     docker pull paddlepaddle/paddle
-     # 使用OpenBLAS的镜像
-     docker pull paddlepaddle/paddle:latest-openblas
-下载指定版本的Docker镜像，可以从 `DockerHub网站 <https://hub.docker.com/r/paddlepaddle/paddle/tags/>`_ 获取可选的tag，并执行下面的命令：
-  .. code-block:: bash
-     docker pull paddlepaddle/paddle:[tag]
-     # 比如：
-     docker pull docker.paddlepaddlehub.com/paddle:0.11.0-gpu
-.. _docker_run:
-在Docker中执行PaddlePaddle训练程序
----------------------------------
-假设您已经在当前目录（比如在/home/work）编写了一个PaddlePaddle的程序 :code:`train.py` （可以参考
-`PaddlePaddleBook <http://www.paddlepaddle.org/docs/develop/book/01.fit_a_line/index.cn.html>`_ 
-编写），就可以使用下面的命令开始执行训练：
-  .. code-block:: bash
-     cd /home/work
-     docker run -it -v $PWD:/work paddlepaddle/paddle /work/train.py
-上述命令中， :code:`-it` 参数说明容器已交互式运行； :code:`-v $PWD:/work`
-指定将当前路径（Linux中$PWD变量会展开为当前路径的绝对路径）挂载到容器内部的 :code:`/work`
-目录； :code:`paddlepaddle/paddle` 指定需要使用的容器； 最后 :code:`/work/train.py`
-为容器内执行的命令，即运行训练程序。
-当然，您也可以进入到Docker容器中，以交互式的方式执行或调试您的代码：
-  .. code-block:: bash
-     docker run -it -v $PWD:/work paddlepaddle/paddle /bin/bash
-     cd /work
-     python train.py
-**注：PaddlePaddle Docker镜像为了减小体积，默认没有安装vim，您可以在容器中执行** :code:`apt-get install -y vim` **安装后，在容器中编辑代码。**
-.. _docker_run_book:
-使用Docker启动PaddlePaddle Book教程
-----------------------------------
-使用Docker可以快速在本地启动一个包含了PaddlePaddle官方Book教程的Jupyter Notebook，可以通过网页浏览。
-PaddlePaddle Book是为用户和开发者制作的一个交互式的Jupyter Notebook。
-如果您想要更深入了解deep learning，PaddlePaddle Book一定是您最好的选择。
-大家可以通过它阅读教程，或者制作和分享带有代码、公式、图表、文字的交互式文档。
-我们提供可以直接运行PaddlePaddle Book的Docker镜像，直接运行：
-  .. code-block:: bash
-     docker run -p 8888:8888 paddlepaddle/book
-国内用户可以使用下面的镜像源来加速访问：
-  .. code-block:: bash
-    docker run -p 8888:8888 docker.paddlepaddlehub.com/book
-然后在浏览器中输入以下网址：
-  .. code-block:: text
-     http://localhost:8888/
-就这么简单，享受您的旅程！
-.. _docker_run_gpu:
-使用Docker执行GPU训练
------------------------------
-为了保证GPU驱动能够在镜像里面正常运行，我们推荐使用
-`nvidia-docker <https://github.com/NVIDIA/nvidia-docker>`_ 来运行镜像。
-请不要忘记提前在物理机上安装GPU最新驱动。
-  .. code-block:: bash
-     nvidia-docker run -it -v $PWD:/work paddlepaddle/paddle:latest-gpu /bin/bash
-**注: 如果没有安装nvidia-docker，可以尝试以下的方法，将CUDA库和Linux设备挂载到Docker容器内：**
-  .. code-block:: bash
-     export CUDA_SO="$(\ls /usr/lib64/libcuda* | xargs -I{} echo '-v {}:{}') $(\ls /usr/lib64/libnvidia* | xargs -I{} echo '-v {}:{}')"
-     export DEVICES=$(\ls /dev/nvidia* | xargs -I{} echo '--device {}:{}')
-     docker run ${CUDA_SO} ${DEVICES} -it paddlepaddle/paddle:latest-gpu
-**关于AVX：**
-AVX是一种CPU指令集，可以加速PaddlePaddle的计算。最新的PaddlePaddle Docker镜像默认
-是开启AVX编译的，所以，如果您的电脑不支持AVX，需要单独
-`编译 <./build_from_source_cn.html>`_ PaddlePaddle为no-avx版本。
-以下指令能检查Linux电脑是否支持AVX：
-   .. code-block:: bash
-      if cat /proc/cpuinfo | grep -i avx; then echo Yes; else echo No; fi
-如果输出是No，就需要选择使用no-AVX的镜像
--- a/doc/fluid/build_and_install/docker_install_en.rst
+++ b/doc/fluid/build_and_install/docker_install_en.rst
-Run in Docker Containers
-=================================
-Run PaddlePaddle in Docker container so that you don't need to care about
-runtime dependencies, also you can run under Windows system. You can get
-tutorials at `here <https://docs.docker.com/get-started/>`_ .
-If you are using Windows, please refer to
-`this <https://docs.docker.com/toolbox/toolbox_install_windows/>`_
-tutorial to start running docker under windows.
-After you've read above tutorials you may proceed the following steps.
-.. _docker_pull:
-Pull PaddlePaddle Docker Image
------------------------------
-Run the following command to download the latest Docker images, the version is cpu_avx_mkl:
-  .. code-block:: bash
-     docker pull paddlepaddle/paddle
-For users in China, we provide a faster mirror:
-  .. code-block:: bash
-     docker pull docker.paddlepaddlehub.com/paddle
-Download GPU version (cuda8.0_cudnn5_avx_mkl) images:
-  .. code-block:: bash
-     docker pull paddlepaddle/paddle:latest-gpu
-     docker pull docker.paddlepaddlehub.com/paddle:latest-gpu
-Choose between different BLAS version:
-  .. code-block:: bash
-     # image using MKL by default
-     docker pull paddlepaddle/paddle
-     # image using OpenBLAS
-     docker pull paddlepaddle/paddle:latest-openblas
-If you want to use legacy versions, choose a tag from
-`DockerHub <https://hub.docker.com/r/paddlepaddle/paddle/tags/>`_
-and run:
-  .. code-block:: bash
-     docker pull paddlepaddle/paddle:[tag]
-     # i.e.
-     docker pull docker.paddlepaddlehub.com/paddle:0.11.0-gpu
-.. _docker_run:
-Launch your training program in Docker
--------------------------------------
-Assume that you have already written a PaddlePaddle program
-named :code:`train.py` under directory :code:`/home/work` (refer to 
-`PaddlePaddleBook <http://www.paddlepaddle.org/docs/develop/book/01.fit_a_line/index.cn.html>`_ 
-for more samples), then run the following command:
-  .. code-block:: bash
-     cd /home/work
-     docker run -it -v $PWD:/work paddlepaddle/paddle /work/train.py
-In the above command, :code:`-it` means run the container interactively;
-:code:`-v $PWD:/work` means mount the current directory ($PWD will expand
-to current absolute path in Linux) under :code:`/work` in the container.
-:code:`paddlepaddle/paddle` to specify image to use; finnally
-:code:`/work/train.py` is the command to run inside docker.
-Also, you can go into the container shell, run or debug your code
-interactively:
-  .. code-block:: bash
-     docker run -it -v $PWD:/work paddlepaddle/paddle /bin/bash
-     cd /work
-     python train.py
-**NOTE: We did not install vim in the default docker image to reduce the image size, you can run** :code:`apt-get install -y vim` **to install it if you need to edit python files.**
-.. _docker_run_book:
-PaddlePaddle Book
------------------
-You can create a container serving PaddlePaddle Book using Jupyter Notebook in
-one minute using Docker. PaddlePaddle Book is an interactive Jupyter Notebook
-for users and developers.If you want to
-dig deeper into deep learning, PaddlePaddle Book definitely is your best choice.
-We provide a packaged book image, simply issue the command:
-  .. code-block:: bash
-     docker run -p 8888:8888 paddlepaddle/book
-For users in China, we provide a faster mirror:
-  .. code-block:: bash
-    docker run -p 8888:8888 docker.paddlepaddlehub.com/book
-Then, you would back and paste the address into the local browser:
-  .. code-block:: text
-     http://localhost:8888/
-That's all. Enjoy your journey!
-.. _docker_run_gpu:
-Train with Docker with GPU
------------------------------
-We recommend using
-`nvidia-docker <https://github.com/NVIDIA/nvidia-docker>`_
-to run GPU training jobs. Please ensure you have latest
-GPU driver installed before move on.
-  .. code-block:: bash
-     nvidia-docker run -it -v $PWD:/work paddlepaddle/paddle:latest-gpu /bin/bash
-**NOTE: If you don't have nvidia-docker installed, try the following method to mount CUDA libs and devices into the container.**
-  .. code-block:: bash
-     export CUDA_SO="$(\ls /usr/lib64/libcuda* | xargs -I{} echo '-v {}:{}') $(\ls /usr/lib64/libnvidia* | xargs -I{} echo '-v {}:{}')"
-     export DEVICES=$(\ls /dev/nvidia* | xargs -I{} echo '--device {}:{}')
-     docker run ${CUDA_SO} ${DEVICES} -it paddlepaddle/paddle:latest-gpu
-**About AVX:**
-AVX is a kind of CPU instruction can accelerate PaddlePaddle's calculations.
-The latest PaddlePaddle Docker image turns AVX on by default, so, if your
-computer doesn't support AVX, you'll probably need to
-`build <./build_from_source_en.html>`_ with :code:`WITH_AVX=OFF`.
-The following command will tell you whether your computer supports AVX.
-   .. code-block:: bash
-      if cat /proc/cpuinfo | grep -i avx; then echo Yes; else echo No; fi
--- a/doc/fluid/build_and_install/index_cn.rst
+++ b/doc/fluid/build_and_install/index_cn.rst
-安装与编译
-==========
-.. _install_steps:
-PaddlePaddle针对不同的用户群体提供了多种安装方式。
-专注深度学习模型开发
--------------------
-PaddlePaddle提供了多种python wheel包，可通过pip一键安装：
-.. toctree::
-	:maxdepth: 1
-	pip_install_cn.rst
-这是最便捷的安装方式，请根据机器配置和系统选择对应的安装包。
-关注底层框架
-------------
-PaddlePaddle提供了基于Docker的安装方式，请参照以下教程：
-.. toctree::
-	:maxdepth: 1
-	docker_install_cn.rst
-我们推荐在Docker中运行PaddlePaddle，该方式具有以下优势：
- 无需单独安装第三方依赖
- 方便分享运行时环境，易于问题的复现
-对于有定制化二进制文件需求的用户，我们同样提供了从源码编译安装PaddlePaddle的方法：
-.. toctree::
-    :maxdepth: 1
-    build_from_source_cn.rst
-.. warning::
-	需要提醒的是，这种安装方式会涉及到一些第三方库的下载、编译及安装，整个安装过程耗时较长。
-常见问题汇总
--------------
-如果在安装过程中遇到了问题，请先尝试在下面的页面寻找答案：
-:ref:`常见问题解答 <install_faq>`
-如果问题没有得到解决，欢迎向PaddlePaddle社区反馈问题：
-`创建issue <https://github.com/PaddlePaddle/Paddle/issues/new>`_
--- a/doc/fluid/build_and_install/index_en.rst
+++ b/doc/fluid/build_and_install/index_en.rst
-install and Compile
-======================
-.. _install_steps:
-PaddlePaddle provides various methods of installation for many different users
-Focus on Deep Learning Model Development
----------------------------------------
-PaddlePaddle provides lots of packages of python wheel , that pip can install:
-.. toctree::
-	:maxdepth: 1
-	pip_install_en.rst
-This is the most convenient way of installation. Please choose the right installation package with machine configure and system.
-Follow the Bottom Frame
------------------------
-PaddlePaddle also supports installation using Docker. Please refer to the tutorial below:
-.. toctree::
-	:maxdepth: 1
-	docker_install_en.rst
-We recommend running PaddlePaddle in Docker. This method has the following advantages：
- Does not require installation of third-party dependencies. 
- Easy to share runtime environment. 
-Lastly, users can also compile and install PaddlePaddle from source code. The instructions are below:
-.. toctree::
-    :maxdepth: 1
-    build_from_source_en.rst
-.. warning::
-	One caveat with this approach is that developers will have to download, compile and install all third-party dependencies. Thus this process of installation is more time consuming.
-FAQ
-----------
-For any problems during installation, please refer to the page below for answers:
-:ref:`常见问题解答 <install_faq>`
-If the problem still persists, you are welcome to seek assistance from the PaddlePaddle community：
-`创建issue <https://github.com/PaddlePaddle/Paddle/issues/new>`_
--- a/doc/fluid/build_and_install/paddleci.png
+++ b/doc/fluid/build_and_install/paddleci.png
--- a/doc/fluid/build_and_install/pip_install_cn.rst
+++ b/doc/fluid/build_and_install/pip_install_cn.rst
-使用pip安装
-================================
-PaddlePaddle可以使用常用的Python包管理工具
-`pip <https://pip.pypa.io/en/stable/installing/>`_
-完成安装，并可以在大多数主流的Linux操作系统以及MacOS上执行。
-.. _pip_install:
-使用pip安装
------------------------------
-执行下面的命令即可在当前机器上安装PaddlePaddle的运行时环境，并自动下载安装依赖软件。
-  .. code-block:: bash
-     pip install paddlepaddle
-当前的默认版本为0.12.0，cpu_avx_openblas，您可以通过指定版本号来安装其它版本，例如:
-  .. code-block:: bash
-      pip install paddlepaddle==0.11.0
-如果需要安装支持GPU的版本（cuda8.0_cudnn5_avx_openblas），需要执行：
-  .. code-block:: bash
-     pip install paddlepaddle-gpu
-当前的默认版本也是0.12.0，PaddlePaddle针对不同需求提供了更多版本的安装包，部分列表如下：
-=================================   ========================================
-版本号                               版本说明
-=================================   ========================================
-paddlepaddle-gpu==0.12.0            使用CUDA 8.0和cuDNN 5编译的0.12.0版本
-paddlepaddle-gpu==0.11.0.post87     使用CUDA 8.0和cuDNN 7编译的0.11.0版本
-paddlepaddle-gpu==0.11.0.post8      使用CUDA 8.0和cuDNN 5编译的0.11.0版本
-paddlepaddle-gpu==0.11.0            使用CUDA 7.5和cuDNN 5编译的0.11.0版本
-=================================   ========================================
-您可以在 `Release History <https://pypi.org/project/paddlepaddle-gpu/#history>`_ 中找到paddlepaddle-gpu的各个发行版本。
-如果需要获取并安装最新的（开发分支）PaddlePaddle，可以从我们的CI系统中下载最新的whl安装包和c-api开发包并安装，
-您可以从下面的表格中找到需要的版本：
-如果在点击下面链接时出现如下登陆界面，点击“Log in as guest”即可开始下载：
-.. image:: paddleci.png
-   :scale: 50 %
-   :align: center
-..  csv-table:: 各个版本最新的whl包
-    :header: "版本说明", "cp27-cp27mu", "cp27-cp27m"
-    :widths: 1, 3, 3
-    "cpu_avx_mkl", "`paddlepaddle-latest-cp27-cp27mu-linux_x86_64.whl <https://guest:@paddleci.ngrok.io/repository/download/Manylinux1_CpuAvxCp27cp27mu/.lastSuccessful/paddlepaddle-latest-cp27-cp27mu-linux_x86_64.whl>`__", "`paddlepaddle-latest-cp27-cp27m-linux_x86_64.whl <https://guest:@paddleci.ngrok.io/repository/download/Manylinux1_CpuAvxCp27cp27mu/.lastSuccessful/paddlepaddle-latest-cp27-cp27m-linux_x86_64.whl>`__"
-    "cpu_avx_openblas", "`paddlepaddle-latest-cp27-cp27mu-linux_x86_64.whl <https://guest:@paddleci.ngrok.io/repository/download/Manylinux1_CpuAvxOpenblas/.lastSuccessful/paddlepaddle-latest-cp27-cp27mu-linux_x86_64.whl>`__", "`paddlepaddle-latest-cp27-cp27m-linux_x86_64.whl <https://guest:@paddleci.ngrok.io/repository/download/Manylinux1_CpuAvxOpenblas/.lastSuccessful/paddlepaddle-latest-cp27-cp27m-linux_x86_64.whl>`__"
-    "cpu_noavx_openblas", "`paddlepaddle-latest-cp27-cp27mu-linux_x86_64.whl <https://guest:@paddleci.ngrok.io/repository/download/Manylinux1_CpuNoavxOpenblas/.lastSuccessful/paddlepaddle-latest-cp27-cp27mu-linux_x86_64.whl>`__", "`paddlepaddle-latest-cp27-cp27m-linux_x86_64.whl <https://guest:@paddleci.ngrok.io/repository/download/Manylinux1_CpuNoavxOpenblas/.lastSuccessful/paddlepaddle-latest-cp27-cp27m-linux_x86_64.whl>`_"
-    "cuda8.0_cudnn5_avx_mkl", "`paddlepaddle_gpu-latest-cp27-cp27mu-linux_x86_64.whl <https://guest:@paddleci.ngrok.io/repository/download/Manylinux1_Cuda80cudnn5cp27cp27mu/.lastSuccessful/paddlepaddle_gpu-latest-cp27-cp27mu-linux_x86_64.whl>`__", "`paddlepaddle_gpu-latest-cp27-cp27m-linux_x86_64.whl <https://guest:@paddleci.ngrok.io/repository/download/Manylinux1_Cuda80cudnn5cp27cp27mu/.lastSuccessful/paddlepaddle_gpu-latest-cp27-cp27m-linux_x86_64.whl>`__"
-    "cuda8.0_cudnn7_avx_mkl", "`paddlepaddle_gpu-latest-cp27-cp27mu-linux_x86_64.whl <https://guest:@paddleci.ngrok.io/repository/download/Manylinux1_Cuda8cudnn7cp27cp27mu/.lastSuccessful/paddlepaddle_gpu-latest-cp27-cp27mu-linux_x86_64.whl>`__", "`paddlepaddle_gpu-latest-cp27-cp27m-linux_x86_64.whl <https://guest:@paddleci.ngrok.io/repository/download/Manylinux1_Cuda8cudnn7cp27cp27mu/.lastSuccessful/paddlepaddle_gpu-latest-cp27-cp27m-linux_x86_64.whl>`__"
-    "cuda9.0_cudnn7_avx_mkl", "`paddlepaddle_gpu-latest-cp27-cp27mu-linux_x86_64.whl <https://guest:@paddleci.ngrok.io/repository/download/Manylinux1_Cuda90cudnn7avxMkl/.lastSuccessful/paddlepaddle_gpu-latest-cp27-cp27mu-linux_x86_64.whl>`__", "`paddlepaddle_gpu-latest-cp27-cp27m-linux_x86_64.whl <https://guest:@paddleci.ngrok.io/repository/download/Manylinux1_Cuda90cudnn7avxMkl/.lastSuccessful/paddlepaddle_gpu-latest-cp27-cp27m-linux_x86_64.whl>`__"
-.. _pip_dependency:
-运行环境依赖
------------------------------
-PaddlePaddle安装包由于不仅仅包含.py程序，而且包含了C++编写的部分，所以我们确保发布的二进制包可以支持主流的Linux操作系统，比如CentOS 6以上，Ubuntu 14.04以上，MacOS 10.12以上。
-PaddlePaddle发布的安装包会尽量对齐 `manylinux1 <https://www.python.org/dev/peps/pep-0513/#the-manylinux1-policy>`_ 标准，通常使用CentOS 5作为编译环境。但由于CUDA库通常需要CentOS 6以上，而且CentOS 5即将停止维护，所以我们默认使用CentOS 6作为标准编译环境。
-.. csv-table:: PaddlePaddle环境依赖
-   :header: "依赖", "版本", "说明"
-   :widths: 10, 15, 30
-   "操作系统", "Linux, MacOS", "CentOS 6以上，Ubuntu 14.04以上，MacOS 10.12以上"
-   "Python", "2.7.x", "暂时不支持Python3"
-   "libc.so", "GLIBC_2.7", "glibc至少包含GLIBC_2.7以上的符号"
-   "libstdc++.so", "GLIBCXX_3.4.11, CXXABI_1.3.3", "至少包含GLIBCXX_3.4.11, CXXABI_1.3.3以上的符号"
-   "libgcc_s.so", "GCC_3.3", "至少包含GCC_3.3以上的符号"
-.. _pip_faq:
-安装常见问题和解决方法
------------------------------
- paddlepaddle*.whl is not a supported wheel on this platform.
-  出现这个问题的主要原因是，没有找到和当前系统匹配的paddlepaddle安装包。请检查Python版本是否为2.7系列。另外最新的pip官方源中的安装包默认是manylinux1标准，需要使用最新的pip (>9.0.0) 才可以安装。可以使用下面的命令更新您的pip：
-    .. code-block:: bash
-       pip install --upgrade pip
-  如果仍然存在问题，可以执行：
-      .. code-block:: bash
-         python -c "import pip; print(pip.pep425tags.get_supported())"
-  获取当前系统支持的安装包格式，并检查和需安装的包是否匹配。pypi安装包可以在 `这个 <https://pypi.python.org/pypi/paddlepaddle/0.10.5>`_ 链接中找到。
-  如果系统支持的是 linux_x86_64 而安装包是 manylinux1_x86_64 ，需要升级pip版本到最新； 如果系统支持 manylinux1_x86_64 而安装包（本地）是 linux_x86_64 ，可以重命名这个whl包为 manylinux1_x86_64 再安装。
--- a/doc/fluid/build_and_install/pip_install_en.rst
+++ b/doc/fluid/build_and_install/pip_install_en.rst
-Install using pip
-================================
-You can use current widely used Python package management
-tool `pip <https://pip.pypa.io/en/stable/installing/>`_
-to install PaddlePaddle. This method can be used in
-most of current Linux systems or MacOS.
-.. _pip_install:
-Install using pip
------------------------------
-Run the following command to install PaddlePaddle on the current
-machine, it will also download requirements.
-  .. code-block:: bash
-     pip install paddlepaddle
-the default version is 0.12.0, cpu_avx_openblas, you can specify the versions to satisfy your demands, like:
-  .. code-block:: bash
-      pip install paddlepaddle==0.11.0
-If you need to install a GPU-enabled version (cuda8.0_cudnn5_avx_openblas), you need to run:
-  .. code-block:: bash
-     pip install paddlepaddle-gpu
-The default version is also 0.12.0, PaddlePaddle provides several versions of packages for different needs, as shown in the table:
-=================================   ========================================
-版本号                               版本说明
-=================================   ========================================
-paddlepaddle-gpu==0.12.0            0.12.0 built with CUDA 8.0 and cuDNN 5
-paddlepaddle-gpu==0.11.0.post87     0.11.0 built with CUDA 8.0 and cuDNN 7
-paddlepaddle-gpu==0.11.0.post8      0.11.0 built with CUDA 8.0 and cuDNN 5
-paddlepaddle-gpu==0.11.0            0.11.0 built with CUDA 7.5 and cuDNN 5
-=================================   ========================================
-You can find all versions released of paddlepaddle-gpu in `Release History <https://pypi.org/project/paddlepaddle-gpu/#history>`_ .
-If you wish to install the latest develop branch PaddlePaddle,
-you can download the latest whl package from our CI system. Access
-the below links, log in as guest, then click at the "Artifact"
-tab, you'll find the download link of whl packages.
-If the links below shows up the login form, just click "Log in as guest" to start the download:
-.. image:: paddleci.png
-   :scale: 50 %
-   :align: center
-..  csv-table:: whl package of each version
-    :header: "version", "cp27-cp27mu", "cp27-cp27m"
-    :widths: 1, 3, 3
-    "cpu_avx_mkl", "`paddlepaddle-latest-cp27-cp27mu-linux_x86_64.whl <https://guest:@paddleci.ngrok.io/repository/download/Manylinux1_CpuAvxCp27cp27mu/.lastSuccessful/paddlepaddle-latest-cp27-cp27mu-linux_x86_64.whl>`__", "`paddlepaddle-latest-cp27-cp27m-linux_x86_64.whl <https://guest:@paddleci.ngrok.io/repository/download/Manylinux1_CpuAvxCp27cp27mu/.lastSuccessful/paddlepaddle-latest-cp27-cp27m-linux_x86_64.whl>`__"
-    "cpu_avx_openblas", "`paddlepaddle-latest-cp27-cp27mu-linux_x86_64.whl <https://guest:@paddleci.ngrok.io/repository/download/Manylinux1_CpuAvxOpenblas/.lastSuccessful/paddlepaddle-latest-cp27-cp27mu-linux_x86_64.whl>`__", "`paddlepaddle-latest-cp27-cp27m-linux_x86_64.whl <https://guest:@paddleci.ngrok.io/repository/download/Manylinux1_CpuAvxOpenblas/.lastSuccessful/paddlepaddle-latest-cp27-cp27m-linux_x86_64.whl>`__"
-    "cpu_noavx_openblas", "`paddlepaddle-latest-cp27-cp27mu-linux_x86_64.whl <https://guest:@paddleci.ngrok.io/repository/download/Manylinux1_CpuNoavxOpenblas/.lastSuccessful/paddlepaddle-latest-cp27-cp27mu-linux_x86_64.whl>`__", "`paddlepaddle-latest-cp27-cp27m-linux_x86_64.whl <https://guest:@paddleci.ngrok.io/repository/download/Manylinux1_CpuNoavxOpenblas/.lastSuccessful/paddlepaddle-latest-cp27-cp27m-linux_x86_64.whl>`__"
-    "cuda8.0_cudnn5_avx_mkl", "`paddlepaddle_gpu-latest-cp27-cp27mu-linux_x86_64.whl <https://guest:@paddleci.ngrok.io/repository/download/Manylinux1_Cuda80cudnn5cp27cp27mu/.lastSuccessful/paddlepaddle_gpu-latest-cp27-cp27mu-linux_x86_64.whl>`__", "`paddlepaddle_gpu-latest-cp27-cp27m-linux_x86_64.whl <https://guest:@paddleci.ngrok.io/repository/download/Manylinux1_Cuda80cudnn5cp27cp27mu/.lastSuccessful/paddlepaddle_gpu-latest-cp27-cp27m-linux_x86_64.whl>`__"
-    "cuda8.0_cudnn7_avx_mkl", "`paddlepaddle_gpu-latest-cp27-cp27mu-linux_x86_64.whl <https://guest:@paddleci.ngrok.io/repository/download/Manylinux1_Cuda8cudnn7cp27cp27mu/.lastSuccessful/paddlepaddle_gpu-latest-cp27-cp27mu-linux_x86_64.whl>`__", "`paddlepaddle_gpu-latest-cp27-cp27m-linux_x86_64.whl <https://guest:@paddleci.ngrok.io/repository/download/Manylinux1_Cuda8cudnn7cp27cp27mu/.lastSuccessful/paddlepaddle_gpu-latest-cp27-cp27m-linux_x86_64.whl>`__"
-    "cuda9.0_cudnn7_avx_mkl", "`paddlepaddle_gpu-latest-cp27-cp27mu-linux_x86_64.whl <https://guest:@paddleci.ngrok.io/repository/download/Manylinux1_Cuda90cudnn7avxMkl/.lastSuccessful/paddlepaddle_gpu-latest-cp27-cp27mu-linux_x86_64.whl>`__", "`paddlepaddle_gpu-latest-cp27-cp27m-linux_x86_64.whl <https://guest:@paddleci.ngrok.io/repository/download/Manylinux1_Cuda90cudnn7avxMkl/.lastSuccessful/paddlepaddle_gpu-latest-cp27-cp27m-linux_x86_64.whl>`__"
-.. _pip_dependency:
-Runtime Dependency
------------------------------
-PaddlePaddle installation packages (whl) does not only contain .py files,
-but also binaries built from C++ code. We ensure that PaddlePaddle can
-run on current mainline Linux distributions, like CentOS 6, Ubuntu 14.04
-and MacOS 10.12.
-PaddlePaddle whl packages are trying to satisfy
-`manylinux1 <https://www.python.org/dev/peps/pep-0513/#the-manylinux1-policy>`_
-standard, which uses CentOS 5 as default build environment. But CUDA libraries
-seems only run on CentOS 6 at least, also, CentOS 5 is about to end its lifetime,
-so we use CentOS 6 as default build environment.
-.. csv-table:: PaddlePaddle Runtime Deps
-   :header: "Dependency", "version", "description"
-   :widths: 10, 15, 30
-   "OS", "Linux, MacOS", "CentOS 6 or later，Ubuntu 14.04 or later，MacOS 10.12 or later"
-   "Python", "2.7.x", "Currently Python3 is not supported"
-   "libc.so", "GLIBC_2.7", "glibc at least include GLIBC_2.7 symbols"
-   "libstdc++.so", "GLIBCXX_3.4.11, CXXABI_1.3.3", "At least include GLIBCXX_3.4.11, CXXABI_1.3.3 symbols"
-   "libgcc_s.so", "GCC_3.3", "At least include GCC_3.3 symbols"
-.. _pip_faq:
-FAQ
------------------------------
- paddlepaddle*.whl is not a supported wheel on this platform.
-  The main cause of this issue is that your current platform is
-  not supported. Please check that you are using Python 2.7 series.
-  Besides, pypi only supports manylinux1 standard, you'll need to
-  upgrade your pip to >9.0.0. Then run the below command:
-    .. code-block:: bash
-       pip install --upgrade pip
-  If the problem still exists, run the following command:
-      .. code-block:: bash
-         python -c "import pip; print(pip.pep425tags.get_supported())"
-  Then you'll get supported package suffixes, then check if it matches
-  the file name of the whl package. You can find default whl package at
-  `here <https://pypi.python.org/pypi/paddlepaddle/0.10.5>`_
-  If your system supports linux_x86_64 but the whl package is manylinux1_x86_64,
-  you'll need to update pip to the latest version; If your system supports
-  manylinux1_x86_64 but the whl package is linux_x86_64 you can rename the
-  file to manylinux1_x86_64 suffix and then install.
--- a/doc/fluid/dev/api_doc_std_cn.md
+++ b/doc/fluid/dev/api_doc_std_cn.md
-# API注释撰写标准
- [API注释撰写标准](#api)
-    - [API注释模块](#api)
-    - [格式及示例](#)
-    - [完整示例](#)
-## API注释模块
-API文档须包含以下几个模块（排列顺序为文档撰写顺序）：
- Python API Definition
-  API的代码定义。
- Function Description
-  API的功能描述。描述该API的含义、作用或对输入所做的操作，及参考文献和对应链接（如果有），必要时给出公式，并解释公式中关键变量的含义。
- Args Description
-  API参数介绍。按代码定义中的参数顺序逐个介绍，介绍内容包含数据类型、默认值（如果有）、含义等。
- Returns
-  API返回值介绍。介绍返回值含义，必要时给出对应的形状。若返回值为包含多个参数的tuple，则按顺序逐个介绍各参数。
- Raises（如果有）
-  可能抛出的异常或错误及可能的产生原因，当可能抛出多种异常或错误时应分条列出。
- Note（如果有）
-  注意事项。当有多条注意事项时，应分条列出。
- Examples
-  API的使用示例。
-## 格式及示例
-API文档须使用reStructuredText格式撰写，该格式详情请参考[链接](http://sphinx-doc-zh.readthedocs.io/en/latest/rest.html)。API文档各模块的内容格式及示例如下（以下以fc为例进行说明）：
- Python API Definition
-  - 格式：
-      [Python API Definition]
-  - 示例
-      ```
-      fc(input,
-         size,
-         num_flatten_dims=1,
-         param_attr=None,
-         bias_attr=None,
-         act=None,
-         name=None,
-         main_program=None,
-         startup_program=None)
-      ```
- Function Description
-  - 格式
-      本模块应包含以下内容（排列顺序为文档撰写顺序）：
-      [Function Description]
-      [Formula]
-      [Symbols' Descriptions if necessary]
-      [References if necessary]
-  - 示例
-      [Function Description]
-       ```
-       **Fully Connected Layer**
-       The fully connected layer can take multiple tensors as its inputs. It
-       creates a variable called weights for each input tensor, which represents
-       a fully connected weight matrix from each input unit to each output unit.
-       The fully connected layer multiplies each input tensor with its coresponding
-       weight to produce an output Tensor. If multiple input tensors are given,
-       the results of multiple multiplications will be sumed up. If bias_attr is
-       not None, a bias variable will be created and added to the output. Finally,
-       if activation is not None, it will be applied to the output as well.
-       ```
-      [Formula]
-      ```
-      This process can be formulated as follows:
-      .. math::
-           Out = Act({\sum_{i=0}^{N-1}X_iW_i + b})
-      ```
-      [Symbols' Descriptions if necessary]
-      ```
-      In the above equation:
-      * :math:`N`: Number of the input.
-      * :math:`X_i`: The input tensor.
-      * :math:`W`: The weights created by this layer.
-      * :math:`b`: The bias parameter created by this layer (if needed).
-      * :math:`Act`: The activation function.
-      * :math:`Out`: The output tensor.
-      ```
-      [References if necessary]
-      因fc没有必要列出的参考文献，故该内容省略。其他情况下需明确给出对应的参考文献和对应连接，以 layer_norm 为例：
-      ```
-      Refer to `Layer Normalization <https://arxiv.org/pdf/1607.06450v1.pdf>`_ for more details.
-      ```
- Args Description
-  - 格式
-      \[Arg's Name\][(Data Type, Default Value)][Description]
-  - 示例
-      fc的部分参数注释如下：
-      ```
-      Args:
-          input (Variable|list of Variable): The input tensor(s) of this layer, and the dimension of
-              the input tensor(s) is at least 2.
-          param_attr (ParamAttr|list of ParamAttr, default None): The parameter attribute for learnable
-              parameters/weights of this layer.
-          name (str, default None): The name of this layer.
-      ```
- Returns
-  - 格式
-      [Name][Shape]
-  - 示例
-      ```
-      Returns:
-          A tensor variable storing the transformation result.
-      ```
-      当返回值为包含多个参数的tuple时，应按顺序逐个介绍各参数，以dynamic_lstm为例：
-      ```
-      Returns:
-          A tuple containing:
-            The hidden state of LSTM whose shape is (T X D).
-            The cell state of LSTM whose shape is (T X D).
-      ```
- Raises
-  - 格式
-      [Exception Type][Condition]
-  - 示例
-      ```
-      Raises:
-          ValueError: If the rank of the input is less than 2.
-      ```
- Note
-  - 格式
-     [Note]
-  - 示例
-      fc没有注意事项，故该模块省略不写。如有注意事项应明确给出，当有多条注意事项，须分条列出，以scaled\_dot\_product\_attention为例：
-      ```
-      Note:
-          1. When num_heads > 1, three linear projections are learned respectively
-             to map input queries, keys and values into queries', keys' and values'.
-             queries', keys' and values' have the same shapes with queries, keys
-             and values.
-          2. When num_heads == 1, scaled_dot_product_attention has no learnable
-             parameters.
-      ```
- Examples
-  - 格式
-      \[Python Code Snipper]
-  - 示例
-      ```
-      Examples:
-          .. code-block:: python
-            data = fluid.layers.data(name="data", shape=[32, 32], dtype="float32")
-            fc = fluid.layers.fc(input=data, size=1000, act="tanh")
-      ```
-## 完整示例
-fc 的完整注释见[示例](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/fluid/dev/src/fc.py)。
--- a/doc/fluid/dev/api_doc_std_en.md
+++ b/doc/fluid/dev/api_doc_std_en.md
-# API Doc Standard
- [API Doc Standard](#api-doc-standard)
-    - [API Doc Structure](#api-doc-structure)
-    - [Format and Examples](#format-and-examples)
-    - [Complete Example](#complete-example)
-## API Doc Structure
-API Doc should contain the following parts(please write them in order):
- Python API Definition
-  The definition of API
- Function Description
-  Description of API's function. 
-  The description includes: meaning, purpose and operation on input of API, reference and corresponding link(if any), formula(if necessary) and explanations of key variables in the formula.
- Args Description
-  Description of API parameters.
-  Introduce parameters one by one according to the order in API definition.
-  The introduction includes: data type, default value(if any), meaning, etc.
- Returns
-  Introduction of API returned value.
-  Introduce meaning of returned value, provide correspoding format if necessary.
-  If returned value is a tuple containing multiple parameters, then introduce parameters one by one in order.
- Raises（if any）
-   Abnormality, error that may occur, and possible reasons. If there are more than one possible abnormity or error, they should be listed in order. 
- Note（if any）
-  Matters needing attention. If there are more than one matters, they should be listed in order. 
- Examples
-  Examples of how to use API.
-## Format and Examples
-API documentation must obey reStructuredText format, please refer to [here](http://sphinx-doc-zh.readthedocs.io/en/latest/rest.html).
-Format and examples of each part of API documantation are as follows: (take fc for example)
- Python API Definition
-  - Format
-      [Python API Definition]
-  - Example
-      ```
-      fc(input,
-         size,
-         num_flatten_dims=1,
-         param_attr=None,
-         bias_attr=None,
-         act=None,
-         name=None,
-         main_program=None,
-         startup_program=None)
-      ```
- Function Description
-  - Format
-      This part contains (please write them in order):
-      [Function Description]
-      [Formula]
-      [Symbols' Descriptions if necessary]
-      [References if necessary]
-  - Example
-      [Function Description]
-       ```
-       **Fully Connected Layer**
-       The fully connected layer can take multiple tensors as its inputs. It
-       creates a variable called weights for each input tensor, which represents
-       a fully connected weight matrix from each input unit to each output unit.
-       The fully connected layer multiplies each input tensor with its coresponding
-       weight to produce an output Tensor. If multiple input tensors are given,
-       the results of multiple multiplications will be sumed up. If bias_attr is
-       not None, a bias variable will be created and added to the output. Finally,
-       if activation is not None, it will be applied to the output as well.
-       ```
-      [Formula]
-      ```
-      This process can be formulated as follows:
-      .. math::
-           Out = Act({\sum_{i=0}^{N-1}X_iW_i + b})
-      ```
-      [Symbols' Descriptions if necessary]
-      ```
-      In the above equation:
-      * :math:`N`: Number of the input.
-      * :math:`X_i`: The input tensor.
-      * :math:`W`: The weights created by this layer.
-      * :math:`b`: The bias parameter created by this layer (if needed).
-      * :math:`Act`: The activation function.
-      * :math:`Out`: The output tensor.
-      ```
-      [References if necessary]
-      Since there is no need for reference of fc, we omit them here. Under other circumstances, please provide explicit reference and link, take layer_norm for example: 
-      ```
-      Refer to `Layer Normalization <https://arxiv.org/pdf/1607.06450v1.pdf>`_ for more details.
-      ```
- Args Description
-  - Format
-      \[Arg's Name\][(Data Type, Default Value)][Description]
-  - Example
-      part of fc parameters are as follows:
-      ```
-      Args:
-          input (Variable|list of Variable): The input tensor(s) of this layer, and the dimension of
-              the input tensor(s) is at least 2.
-          param_attr (ParamAttr|list of ParamAttr, default None): The parameter attribute for learnable
-              parameters/weights of this layer.
-          name (str, default None): The name of this layer.
-      ```
- Returns
-  - Format
-      [Name][Shape]
-  - Example
-      ```
-      Returns:
-          A tensor variable storing the transformation result.
-      ```
-      when returned value contain more than one tuple, please introduce every parameter in order, take dynamic_lstm for example:
-      ```
-      Returns:
-          A tuple containing:
-            The hidden state of LSTM whose shape is (T X D).
-            The cell state of LSTM whose shape is (T X D).
-      ```
- Raises
-  - Format
-      [Exception Type][Condition]
-  - Example
-      ```
-      Raises:
-          ValueError: If the rank of the input is less than 2.
-      ```
- Note
-  - Format
-     [Note]
-  - Example
-      there is no Note in fc, so we omit this part. If there is any note, please write clearly. If there are more than one notes, please list them in order. Take scaled\_dot\_product\_attention for example:
-      ```
-      Note:
-          1. When num_heads > 1, three linear projections are learned respectively
-             to map input queries, keys and values into queries', keys' and values'.
-             queries', keys' and values' have the same shapes with queries, keys
-             and values.
-          2. When num_heads == 1, scaled_dot_product_attention has no learnable
-             parameters.
-      ```
- Examples
-  - Format
-      \[Python Code Snipper]
-  - Example
-      ```
-      Examples:
-          .. code-block:: python
-            data = fluid.layers.data(name="data", shape=[32, 32], dtype="float32")
-            fc = fluid.layers.fc(input=data, size=1000, act="tanh")
-      ```
-## Complete Example
-Complete Example of fc please see [here](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/fluid/dev/src/fc.py)。
--- a/doc/fluid/dev/ci_build_whl.png
+++ b/doc/fluid/dev/ci_build_whl.png
--- a/doc/fluid/dev/contribute_to_paddle_cn.md
+++ b/doc/fluid/dev/contribute_to_paddle_cn.md
-../../v2/dev/contribute_to_paddle_cn.md
\ No newline at end of file
--- a/doc/fluid/dev/contribute_to_paddle_en.md
+++ b/doc/fluid/dev/contribute_to_paddle_en.md
-../../v2/dev/contribute_to_paddle_en.md
\ No newline at end of file
--- a/doc/fluid/dev/index_cn.rst
+++ b/doc/fluid/dev/index_cn.rst
-开发标准
------------
-.. toctree::
-  :maxdepth: 1
-  contribute_to_paddle_cn.md
-  write_docs_cn.md
-  api_doc_std_cn.md
-  new_op_cn.md
-  op_notes.md
-  new_op_kernel.md
-  use_eigen_cn.md
-  name_convention.md
-  support_new_device.md
-  releasing_process_cn.md
-  op_markdown_format.md
--- a/doc/fluid/dev/index_en.rst
+++ b/doc/fluid/dev/index_en.rst
-Development
------------
-.. toctree::
-  :maxdepth: 1
-  contribute_to_paddle_en.md
-  write_docs_en.md
-  api_doc_std_en.md
-  new_op_kernel.md
-  use_eigen_en.md
-  name_convention.md
-  releasing_process_en.md
-  op_markdown_format.md
--- a/doc/fluid/dev/name_convention.md
+++ b/doc/fluid/dev/name_convention.md
-# Operator's Parameter Name Convention
-To make the operator document itself more clear, we recommend operator names obey the listing conventions.
-## OpProtoMaker names
-When defining an operator in Paddle, a corresponding [OpProtoMaker](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/framework/operator.h#L170) (TODO: OpProtoMaker Doc)need to be defined. All the Input/Output and Attributes will write into the [OpProto](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/framework/framework.proto#L61) , and will be used in client language to create operator.
- Input/Output.
-  - Input/Output names follow the **CamelCase**. e.g. `X`, `Y`, `Matrix`, `LastAxisInMatrix`. Input/Output much more like Variables, we prefer to meaningful English words.
-  - If an operator's Input/Output are tensors in math, not match to any meaningful words, input name should starts from `X`. e.g. `X`, `Y`, and output name should starts from `Out`. e.g. `Out`. This rule intends making operators which have few inputs/outputs unified.
- Attribute.
-  - Attribute name follows the **snake_case**. e.g. `x`, `y`, `axis`, `rowwise_matrix`. Also, attribute name prefers to meaningful English words.
- Comments.
-  - Input/Output/Attr comment follow the format of **(type,default value) usage**, corresponding to which type it can be and how it will be used in the operator. e.g.  Attribute in Accumulator`"gamma" `,`(float, default 1.0) Accumulation multiplier`.
-  - Operator comment format of` R"DOC(your comment here)DOC"`. You should explain the input/output of the operator first. If there is math calculation in this operator, you should write the equation in the comment. e.g. `Out = X + Y`.
- Order.
-  - Follow the order of Input/Output, then Attribute, then Comments. See the example in best practice.
-## Best Practice
-Here we give some examples to show how these rules will be used.
- The operator has one input, one output. e.g.`relu`, inputs: `X`, outputs: `Out`.
- The operator has two input, one output. e.g. `rowwise_add`, inputs : `X`, `Y`, outputs : `Out`.
- The operator contains attribute. e.g. `cosine`, inputs : `X`, `axis`, outputs : `Out`.
-  We give a full example of Accumulator Operator.
-```c++
-class AccumulateOpMaker : public framework::OpProtoAndCheckerMaker {
-public:
-  AccumulateOpMaker(OpProto *proto,
-                    OpAttrChecker *op_checker)
-    : OpProtoAndCheckerMaker(proto, op_checker) {
-    AddInput("X", "(Tensor) The input tensor that has to be accumulated to the output tensor.
-    If the output size is not the same as input size,
-    the output tensor is first reshaped and initialized to zero, and only then, accumulation is done.");
-    AddOutput("Out", "(Tensor) Accumulated output tensor");
-    AddAttr<float>("gamma", "(float, default 1.0) Accumulation multiplier").SetDefault(1.0f);
-    AddComment(R"DOC(
-Accumulate Operator.
-This operator accumulates the input tensor to the output tensor. If the
-output tensor already has the right size, we add to it; otherwise, we first
-initialize the output tensor to all zeros, and then do accumulation. Any
-further calls to the operator, given that no one else fiddles with the output
-in the interim, will do simple accumulations.
-Accumulation is done as follows:
-Out = 1*X + gamma*Out
-where X is the input tensor, Out is the output tensor and gamma is the multiplier
-argument.
-)DOC");
-  }
-};
-```
--- a/doc/fluid/dev/new_op_cn.md
+++ b/doc/fluid/dev/new_op_cn.md
-# 如何写新的operator
- - [概念简介](#概念简介)
- - [实现C++类](#实现c类)
-   - [定义ProtoMaker类](#定义protomaker类)
-   - [定义Operator类](#定义operator类)
-   - [定义OpKernel类](#定义opkernel类)
-   - [注册Operator](#注册operator)
-   - [编译](#编译)
- - [绑定Python](#绑定python)
- - [实现单元测试](#实现单元测试)
-   - [前向Operator单测](#前向operator单测)
-   - [反向Operator单测](#反向operator单测)
-   - [编译和执行](#编译和执行)
- - [注意事项](#注意事项)
-## 概念简介
-简单介绍需要用到基类，详细介绍请参考设计文档。
- `framework::OperatorBase`: Operator(简写，Op)基类。
- `framework::OpKernel`: Op计算函数的基类，称作Kernel。
- `framework::OperatorWithKernel`：继承自OperatorBase，Op有计算函数，称作有Kernel。
- `class OpProtoAndCheckerMaker`：描述该Op的输入、输出、属性、注释,主要用于Python API接口生成
-依据是否包含kernel，可以将Op分为两种：包含Kernel的Op和不包含kernel的Op，前者Op的定义继承自`OperatorWithKernel`，后者继承自`OperatorBase`。本教程主要介绍带Kernel的Op如何写，简单总结Op需要包含的内容如下：
-<table>
-<thead>
-<tr>
-<th>内容</th>
-<th>定义位置</th>
-</tr>
-</thead>
-<tbody>
-<tr>
-<td>OpProtoMake定义 </td>
-<td>.cc 文件，Backward Op不需要定义OpProtoMake </td>
-</tr>
-<tr>
-<td>Op定义 </td>
-<td> .cc 文件</td>
-</tr>
-<tr>
-<td>Kernel实现 </td>
-<td> CPU、CUDA共享Kernel实现在.h 文件中，否则，CPU 实现在.cc 文件中，CUDA 实现在.cu 文件中。</td>
-</tr>
-<tr>
-<td>注册Op </td>
-<td> Op注册实现在.cc 文件；Kernel注册CPU实现在.cc 文件中，CUDA实现在.cu 文件中</td>
-</tr>
-</tbody>
-</table>
-实现新的op都添加至目录[paddle/fluid/operators](https://github.com/PaddlePaddle/Paddle/tree/develop/paddle/fluid/operators)下，文件命名以`*_op.h`（如有） 、 `*_op.cc` 、`*_op.cu`（如有）结尾。**系统会根据文件名自动构建op和其对应的Python扩展。**
-下面以矩阵乘操作，即[MulOp](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/operators/mul_op.cc)为例来介绍如何写带Kernel的Operator。
-## 实现C++类
-### 定义ProtoMaker类
-矩阵乘法的公式：$Out = X * Y$, 可见该计算由两个输入，一个输出组成。
-首先定义`ProtoMaker`来描述该Op的输入、输出，并添加注释：
-```cpp
-class MulOpMaker : public framework::OpProtoAndCheckerMaker {
- public:
-  MulOpMaker(OpProto *proto, OpAttrChecker *op_checker)
-      : OpProtoAndCheckerMaker(proto, op_checker) {
-    AddInput("X", "(Tensor), 2D tensor of size (M x K)");
-    AddInput("Y", "(Tensor), 2D tensor of size (K x N)");
-    AddOutput("Out", "(Tensor), 2D tensor of size (M x N)");
-    AddComment(R"DOC(
-Two Element Mul Operator.
-The equation is: Out = X * Y
-)DOC");
-  }
-};
-```
-[`MulOpMaker`](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/operators/mul_op.cc#L76-L127)继承自`framework::OpProtoAndCheckerMaker`，构造函数含有2个参数：
-   - `framework::OpProto` ： 前者存储Op的输入输出和参数属性，将用于Python API接口的生成。
-   - `framework::OpAttrChecker` ：后者用于检查参数属性的合法性。
-构造函数里通过`AddInput`添加输入参数，通过`AddOutput`添加输出参数，通过`AddComment`添加Op的注释。这些函数会将对应内容添加到`OpProto`中。
-上面的代码在`MulOp`中添加两个输入`X`和`Y`，添加了一个输出`Out`，并解释了各自含义，命名请遵守[命名规范](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/fluid/dev/name_convention.md)。
-再以[`ScaleOp`](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/operators/scale_op.cc#L38-L55)为例：
-```cpp
-template <typename AttrType>
-class ScaleOpMaker : public framework::OpProtoAndCheckerMaker {
- public:
-  ScaleOpMaker(OpProto *proto, OpAttrChecker *op_checker)
-      : OpProtoAndCheckerMaker(proto, op_checker) {
-    AddInput("X", "(Tensor) Input tensor of scale operator.");
-    AddOutput("Out", "(Tensor) Output tensor of scale operator.");
-    AddComment(R"DOC(
-Scale operator
-$$Out = scale*X$$
-)DOC");
-    AddAttr<AttrType>("scale",
-                      "(float, default 1.0)"
-                      "The scaling factor of the scale operator.")
-        .SetDefault(1.0);
-  }
-};
-```
-这个例子有`AddAttr<AttrType>("scale", "...").SetDefault(1.0);` : 增加`scale`系数，作为参数属性，并且设置默认值为1.0。
-### 定义GradProtoMaker类
-每个Op的必须有一个对应的GraProtoMaker，若未定制对应前向Op的GradProtoMaker，fluid提供了DefaultGradProtoMaker，默认注册会使用全部输入输出，包括Input, Output, Output@Grad等，使用不需要的变量的会造成显存浪费。
-下面示例定义了ScaleOp的GradProtoMaker。
-```cpp
-class ScaleGradMaker : public framework::SingleGradOpDescMaker {
- public:
-  using framework::SingleGradOpDescMaker::SingleGradOpDescMaker;
-  std::unique_ptr<framework::OpDesc> Apply() const override {
-    auto *grad_op = new framework::OpDesc();
-    grad_op->SetType("scale");
-    grad_op->SetInput("X", OutputGrad("Out"));
-    grad_op->SetOutput("Out", InputGrad("X"));
-    grad_op->SetAttr("scale", GetAttr("scale"));
-    return std::unique_ptr<framework::OpDesc>(grad_op);
-  }
-};
-```
-### 定义Operator类
-下面实现了MulOp的定义：
-```cpp
-class MulOp : public framework::OperatorWithKernel {
- public:
-  using framework::OperatorWithKernel::OperatorWithKernel;
- protected:
-  void InferShape(const framework::InferShapeContext &ctx) const override {
-    //never use Input<Tensor> or Output<Tensor> if you want a to get a LoDTensor.
-    auto dim0 = ctx.Input<LoDTensor>("X")->dims();
-    auto dim1 = ctx.Input<LoDTensor>("Y")->dims();
-    PADDLE_ENFORCE_EQ(dim0.size(), 2,
-                      "input X(%s) should be a tensor with 2 dims, a matrix",
-                      ctx.op_.Input("X"));
-    PADDLE_ENFORCE_EQ(dim1.size(), 2,
-                      "input Y(%s) should be a tensor with 2 dims, a matrix",
-                      ctx.op_.Input("Y"));
-    PADDLE_ENFORCE_EQ(
-        dim0[1], dim1[0],
-        "First matrix's width must be equal with second matrix's height.");
-    ctx.Output<LoDTensor>("Out")->Resize({dim0[0], dim1[1]});
-  }
-};
-```
-[`MulOp`](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/operators/mul_op.cc#L22)继承自`OperatorWithKernel`。`public`成员：
-```cpp
-using framework::OperatorWithKernel::OperatorWithKernel;
-```
-这句表示使用基类`OperatorWithKernel`的构造函数，也可写成：
-```cpp
-MulOp(const std::string &type, const framework::VariableNameMap &inputs,
-      const framework::VariableNameMap &outputs,
-      const framework::AttributeMap &attrs)
-  : OperatorWithKernel(type, inputs, outputs, attrs) {}
-```
-还需要重写`InferShape`接口。`InferShape`为const函数，不能修改Op的成员变量，参数为`const framework::InferShapeContext &ctx`，通过该参数可获取到输入输出以及属性。它的功能是：
-  - 做检查， 尽早报错：检查输入数据维度、类型等是否合法。
-  - 设置输出Tensor的形状。
-通常`OpProtoMaker`和`Op`类的定义写在`.cc`文件中，和下面将要介绍的注册函数一起放在`.cc`中
-### 定义OpKernel类
-`MulKernel`继承自`framework::OpKernel`，带有下面两个模板参数:
- `typename DeviceContext`: 表示设备类型，不同设备(CPU、CUDA)共享同一个Kernel时，需加该模板参数，不共享则不加，一个不共享的例子是[`OnehotCrossEntropyOpKernel`](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/operators/cross_entropy_op.h#L43)。
- `typename T` : 表示数据类型，如`float`, `double`等。
-需要为`MulKernel`类重写`Compute`接口。
- `Compute`接受一个输入参数：`const framework::ExecutionContext& context`。
- 与`InferShapeContext`相比，`ExecutionContext`增加了设备类型，同样可获取到输入输出和属性参数。
- `Compute`函数里实现`OpKernel`的具体计算逻辑。
-Op的输入和输出可分别通过`ExecutionContext::Input<T>()`和`ExecutionContext::Output<T>()`获得。
-**注意：** 若op的输入/输出的变量类型是`LoDTensor`（fluid默认所有的Tensor默认都是LoDTensor类型），请写成`ExecutionContext::Input<LoDTensor>()`和`ExecutionContext::Output<LoDTensor>()`，不要写`ExecutionContext::Input<Tensor>()`和`ExecutionContext::Output<Tensor>()`。因为若实际的变量类型为`SelectedRows`，`Input<Tensor>()`和`Output<Tensor>()`方法会将`SelectedRows`类型特化为`Tensor`，导致潜在的错误。
-下面是 `MulKernel` `Compute`的实现：
-  ```cpp
-  template <typename DeviceContext, typename T>
-  class MulKernel : public framework::OpKernel {
-  public:
-  void Compute(const framework::ExecutionContext& context) const override {
-    auto* X = context.Input<LoDTensor>("X");
-    auto* Y = context.Input<LoDTensor>("Y");
-    auto* Z = context.Output<LoDTensor>("Out");
-    Z->mutable_data<T>(context.GetPlace());
-    auto& device_context = context.template device_context<DeviceContext>();
-    math::matmul<DeviceContext, T>(*X, false, *Y, false, 1, Z, 0, device_context);
-  }
-  };
-  ```
-需要注意：**不同设备(CPU、CUDA)共享一个Op定义，是否则共享同一个`OpKernel`，取决于`Compute`调用的函数是否支持不同设备。**
-`MulOp`的CPU、CUDA实现共享同一个`Kernel`。`OpKernel`不共享的例子可以参考：[`OnehotCrossEntropyOpKernel`](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/operators/cross_entropy_op.h#L43)。
-为了使`OpKernel`的计算过程书写更加简单，并且CPU、CUDA的代码可以复用，我们通常借助 Eigen unsupported Tensor模块来实现`Compute`接口。关于在PaddlePaddle中如何使用Eigen库，请参考[使用文档](https://github.com/PaddlePaddle/FluidDoc/blob/develop/doc/fluid/dev/use_eigen_cn.md)。
-到此，前向Op实现完成。接下来，需要在`.cc`文件中注册该op和kernel。
-反向Op类的定义，反向OpKernel的定义与前向Op类似，这里不再赘述。**但需注意反向Op没有`ProtoMaker`**。
-### 注册Operator
- 在`.cc`文件中注册前向、反向Op类，注册CPU Kernel。
-    ```cpp
-    namespace ops = paddle::operators;
-    REGISTER_OPERATOR(mul, ops::MulOp, ops::MulOpMaker,
-                  paddle::framework::DefaultGradOpDescMaker<true>)
-    REGISTER_OPERATOR(mul_grad, ops::MulGradOp)
-    REGISTER_OP_CPU_KERNEL(mul, ops::MulKernel<paddle::platform::CPUDeviceContext, float>);
-    REGISTER_OP_CPU_KERNEL(mul_grad,
-                  ops::MulGradKernel<paddle::platform::CPUDeviceContext, float>);
-    ```
-    在上面的代码中：
-	   - `REGISTER_OPERATOR` ： 注册`ops::MulOp`类，类型名为`mul`，该类的`ProtoMaker`为`ops::MulOpMaker`，注册`ops::MulOpGrad`，类型名为`mul_grad`。
-	   - `REGISTER_OP_CPU_KERNEL` ：注册`ops::MulKernel`类，并特化模板参数为`paddle::platform::CPUPlace`和`float`类型，同理，注册`ops::MulGradKernel`类。
- 在 `.cu`文件中注册CUDA Kernel。
-    - 请注意，如果CUDA Kernel的实现基于Eigen unsupported模块，那么在 `.cu`的开始请加上宏定义 `#define EIGEN_USE_GPU`，代码示例如下：
-    ```cpp
-    // if use Eigen unsupported module before include head files
-    #define EIGEN_USE_GPU
-    namespace ops = paddle::operators;
-    REGISTER_OP_CUDA_KERNEL(mul, ops::MulKernel<paddle::platform::CUDADeviceContext, float>);
-    REGISTER_OP_CUDA_KERNEL(mul_grad,
-                           ops::MulGradKernel<paddle::platform::CUDADeviceContext, float>);
-    ```
-### 编译
-运行下面命令可以进行编译：
-```
-make mul_op
-```
-## 绑定Python
-系统会对新增的op自动绑定Python，并链接到生成的lib库中。
-## 实现单元测试
-单测包括对比前向Op不同设备(CPU、CUDA)的实现、对比反向OP不同设备(CPU、CUDA)的实现、反向Op的梯度测试。下面介绍介绍[`MulOp`的单元测试](https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/fluid/tests/unittests/test_mul_op.py)。
-### 前向Operator单测
-Op单元测试继承自`OpTest`。各项更加具体的单元测试在`TestMulOp`里完成。测试Operator，需要：
-1. 在`setUp`函数定义输入、输出，以及相关的属性参数。
-2. 生成随机的输入数据。
-3. 在Python脚本中实现与前向operator相同的计算逻辑，得到输出值，与operator前向计算的输出进行对比。
-4. 反向计算已经自动集成进测试框架，直接调用相应接口即可。
-	  ```python
-	  import unittest
-	  import numpy as np
-	  from op_test import OpTest
-	  class TestMulOp(OpTest):
-	      def setUp(self):
-	          self.op_type = "mul"
-	          self.inputs = {
-	              'X': np.random.random((32, 84)).astype("float32"),
-	              'Y': np.random.random((84, 100)).astype("float32")
-	          }
-	          self.outputs = {'Out': np.dot(self.inputs['X'], self.inputs['Y'])}
-	      def test_check_output(self):
-	          self.check_output()
-	      def test_check_grad_normal(self):
-	          self.check_grad(['X', 'Y'], 'Out', max_relative_error=0.5)
-	      def test_check_grad_ingore_x(self):
-	          self.check_grad(
-	              ['Y'], 'Out', max_relative_error=0.5, no_grad_set=set("X"))
-	      def test_check_grad_ingore_y(self):
-	          self.check_grad(
-	              ['X'], 'Out', max_relative_error=0.5, no_grad_set=set('Y'))
-	  ```
-	上面的代码首先导入依赖的包，下面是对`setUp`函数中操作的重要变量的详细解释：
-	- `self.op_type = "mul" ` : 定义类型，与operator注册时注册的类型一致。
-	- `self.inputs` : 定义输入，类型为`numpy.array`，并初始化。
-	- `self.outputs` : 定义输出，并在Python脚本中完成与operator同样的计算逻辑，返回Python端的计算结果。
-### 反向operator单测
-而反向测试中：
- `test_check_grad_normal`中调用`check_grad`使用数值法检测梯度正确性和稳定性。
-  - 第一个参数`["X", "Y"]` : 指定对输入变量`X`、`Y`做梯度检测。
-  - 第二个参数`"Out"` : 指定前向网络最终的输出目标变量`Out`。
-  - 第三个参数`max_relative_error`：指定检测梯度时能容忍的最大错误值。
- `test_check_grad_ingore_x`和`test_check_grad_ingore_y`分支用来测试只需要计算一个输入梯度的情况。
-### 编译和执行
-`python/paddle/fluid/tests/unittests/` 目录下新增的 `test_*.py` 单元测试会被自动加入工程进行编译。
-请注意，**不同于Op的编译测试，运行单元测试测时需要编译整个工程**，并且编译时需要打开`WITH_TESTING`, 即`cmake paddle_dir -DWITH_TESTING=ON`。编译成功后，执行下面的命令来运行单元测试：
-```bash
-make test ARGS="-R test_mul_op -V"
-```
-或者:
-```bash
-ctest -R test_mul_op
-```
-## 注意事项
- 注册Op时的类型名，需要和该Op的名字一样。即不允许在`A_op.cc`里面，注册`REGISTER_OPERATOR(B, ...)`等，这将会导致单元测试出错。
- 如果Op没有实现CUDA Kernel，请不要创建空的`*_op.cu`，这将会导致单元测试出错。
- 如果多个Op依赖一些共用的函数，可以创建非`*_op.*`格式的文件来存放，如`gather.h`文件。
-### PADDLE_ENFORCE使用注意
-实现Op时检查数据的合法性需要使用PADDLE_ENFORCE以及PADDLE_ENFORCE_EQ等宏定义，基本格式如下：
-```
-PADDLE_ENFORCE(表达式, 错误提示信息)
-PADDLE_ENFORCE_EQ(比较对象A, 比较对象B, 错误提示信息)
-```
-如果表达式为真，或者比较对象A=B，则检查通过，否则会终止程序运行，向用户反馈相应的错误提示信息。
-为了确保提示友好易懂，开发者需要注意其使用方法。
-#### 总体原则
-任何使用了PADDLE_ENFORCE与PADDLE_ENFORCE_**检查的地方，必须有详略得当的备注解释！**错误提示信息**不能为空！
-#### 提示信息书写标准
-1. [required] 哪里错了？为什么错了？
-    - 例如：`ValueError: Mismatched label shape`
-2. [optional] 期望的输入是什么样的？实际的输入是怎样的？
-    - 例如：`Expected labels dimension=1. Received 4.`
-3. [optional] 能否给出修改意见？
-    - 例如：`Suggested Fix:If your classifier expects one-hot encoding label,check your n_classes argument to the estimatorand/or the shape of your label.Otherwise, check the shape of your label.`
-如果并非必要或者简洁的描述即可表达清楚以上要点，根据情况书写亦可。
-#### FAQ 典型问题
-1. 无报错信息或报错信息过于简单，不能给用户提供有效的提示！
-	问题示例1 ：未写提示信息
-	```
-	PADDLE_ENFORCE(ctx->HasInput("X"), "");
-	```
-	问题示例2 ：提示信息过于简单
-	```
-	PADDLE_ENFORCE(i != nullptr, "i must be set"); // i是什么？
-	```
-2. 在报错信息中使用开发人员定义的变量缩写，不易理解！
-	问题示例：
-	```
-	PADDLE_ENFORCE(forward_pd != nullptr,
-	                    "Fail to find eltwise_fwd_pd in device context");  //eltwise_fwd_pd用户可能看不懂
-	```
-3. OP内部调用非法接口：Op内部如果出现Output = ShareDataWith(Input) 
-	问题示例：
-	```cpp
-	auto *out = ctx.Output<framework::LoDTensor>("Out");
-	auto *in = ctx.Input<framework::LoDTensor>("X");
-	out->ShareDataWith(*in);
-	```
-	Op内部如果出现Output = ShareDataWith(Input)，相当于operator图的中有一条隐藏边，连接了Input和Output，这条边无法在图分析中表达，引发基于图优化的错误。
-4. OP实现的性能实践
-	调用了eigen的broadcast, chop等操作，性能会比手写cuda kernel差几倍以上。此时cpu的实现可以复用eigen，gpu实现可以实现cuda kernel.
-#### OP InferShape检查提示信息特别说明
- 检查输入输出变量，请统一遵循以下格式
-`Input(变量名) of OP名 operator should not be null.`  
-	正确示例：
-	```
-	PADDLE_ENFORCE(ctx->HasInput("Input"),
-	                        "Input(Input) of LSTMP operator should not be null.");
-	```
- 反向Op的输入输出检查，要写明反向Op的名字
-	正确示例：
-	```
-	PADDLE_ENFORCE(ctx->HasInput("X"),
-	                        "Input(X) of LoDResetGrad opreator should not be null.");
-	```
--- a/doc/fluid/dev/new_op_kernel.md
+++ b/doc/fluid/dev/new_op_kernel.md
-# Add Kernels for a New Device
-## Background
-PaddlePaddle Fluid have hundreds of operators.  Each operator could have one or more kernels.  A kernel is an implementation of the operator for a certain device, which could be a hardware device, e.g., the CUDA GPU, or a library that utilizes a device, e.g., Intel MKL that makes full use of the Xeon CPU.
-[This document](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/fluid/dev/new_op_en.md) explains how to add an operator, and its kernels.  The kernels of an operator are indexed by a C++ type [`OpKernelType`](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/fluid/design/multi_devices/operator_kernel_type.md).  An operator chooses the right kernel at runtime.  This choosing mechanism is described [here](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/fluid/design/execution/switch.md).
-## Write Kernels for A New Device
-### Add A New Device
-  For some historical reaons, we misuse the word *library* for *device*.  For example, we call the deivce type by *library type*.  An example is the header file [`library_type.h`](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/framework/library_type.h#L24).  We will correct this ASAP.
-To register a new device, we need to add an enum value to `LibraryType`:
-```
-enum class LibraryType {
-  kPlain = 0,
-  kMKLDNN = 1,
-  kCUDNN = 2,
-};
-```
-### Add A New [Place](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/platform/place.h#L53)
-If you have a new kind of Device, firstly you need to add a new kind of [`Place`](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/platform/place.h#L53). For example `CUDAPlace`:
-```cpp
-struct CUDAPlace {
-  CUDAPlace() : CUDAPlace(0) {}
-  explicit CUDAPlace(int d) : device(d) {}
-  inline int GetDeviceId() const { return device; }
-  // needed for variant equality comparison
-  inline bool operator==(const CUDAPlace &o) const {
-    return device == o.device;
-  }
-  inline bool operator!=(const CUDAPlace &o) const { return !(*this == o); }
-  int device;
-};
-typedef boost::variant<CUDAPlace, CPUPlace> Place;
-```
-### Add [device context]((https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/platform/device_context.h#L37))
-After a new kind of Device is added, you should add a corresponding [DeviceContext](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/platform/device_context.h#L37) for it.
-```cpp
-class DeviceContext {
- public:
-  virtual ~DeviceContext() {}
-  virtual Place GetPlace() const = 0;
-  virtual void Wait() const {}
-};
-```
-### Implement new [OpKernel](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/framework/operator.h#L351) for your Device.
-A detailed documentation can be found in [`new_op_and_kernel`](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/fluid/dev/new_op_en.md)
-```cpp
-class OpKernelBase {
- public:
-  /**
-   * ExecutionContext is the only parameter of Kernel Run function.
-   * Run will get input/output variables, state such as momentum and
-   * device resource such as CUDA stream, cublas handle, etc. from
-   * ExecutionContext. User should construct it before run the Operator.
-   */
-  virtual void Compute(const ExecutionContext& context) const = 0;
-  virtual ~OpKernelBase() = default;
-};
-template <typename T>
-class OpKernel : public OpKernelBase {
- public:
-  using ELEMENT_TYPE = T;
-};
-```
-### Register the OpKernel to framework
-After writing the components described above, we should register the kernel to the framework.
-We use `REGISTER_OP_KERNEL` to do the registration.
-```cpp
-REGISTER_OP_KERNEL(
-	op_type,
-	library_type,
-	place_type,
-	kernel0, kernel1, ...)
-```
-kernel0, kernel1 are kernels that have the same `op_type`, `library_type`, `place_type` but different `data_types`.
-take [`conv2d`]((https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/operators/conv_cudnn_op.cu.cc#L318)) as an example:
-	```cpp
-	REGISTER_OP_KERNEL(conv2d, CPU, paddle::platform::CPUPlace,
-    		paddle::operators::GemmConvKernel<paddle::platform::CPUDeviceContext, float>,
-    		paddle::operators::GemmConvKernel<paddle::platform::CPUDeviceContext, double>);
-	REGISTER_OP_KERNEL(conv2d, CUDNN, ::paddle::platform::CUDAPlace,
-	       paddle::operators::CUDNNConvOpKernel<float>,
-	       paddle::operators::CUDNNConvOpKernel<double>);
-	```
-In the code above:
- - `conv2d` is the type/name of the operator
- - `CUDNN/CPU` is `library`
- - `paddle::platform::CUDAPlace/CPUPlace` is `place`
- - template parameter `float/double` on `CUDNNConvOpKernel<T>` is `data_type`.
--- a/doc/fluid/dev/op_markdown_format.md
+++ b/doc/fluid/dev/op_markdown_format.md
-# Standard Markdown Format for Operators
-The following should be the standard format for documentation for all the operators that will get rendered in the `html`:
-```
-Operator Name (In PaddlePaddle)
-Operator Name (Standard)
-Operator description.
-LaTeX equation of how the operator performs an update.
-The signature of the operator.
-```
-Each section mentioned above has been covered in further detail in the rest of the document.
-## PaddlePaddle Operator Name
-This should be in all small letters, in case of multiple words, we separate them with an underscore. For example:
-`array to lod tensor` should be written as `array_to_lod_tensor`.
-This naming convention should be standard across all PaddlePaddle operators.
-## Standard Operator Name
-This is the standard name of the operator as used in the community. The general standard is usually:
- Standard abbreviations like `SGD` are written in all capital letters.
- Operator names that have multiple words inside a single word use `camelCase` (capitalize word boundaries inside of a word).
- Keep numbers inside a word as is, with no boundary delimiters.
- Follow the name of the operator with the keyword: `Activation Operator.`
-## Operator description
-This section should contain the description of what the operator does, including the operation performed, the literature from where it comes and was introduced first, and other important details. The relevant paper/article including the hyperlink should be cited in this section.
-## LaTeX equation
-This section should contain an overall equation of the update or operation that the operator performs. The variables used in the equation should follow the naming convention of operators as described [here](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/operators/name_convention.md). Two words in the same word should be separated by an underscore (`_`).
-## The signature
-This section describes the signature of the operator. A list of Inputs and Outputs, each of which have a small description of what the variable represents and the type of variable. The variable names follow the `CamelCase` naming convention. The proposed format for this is:
-`Section :
-VariableName : (VariableType) VariableDescription
-...
-...
-`
-The following example for an `sgd` operator covers the above mentioned sections as they would ideally look like in the `html`:
-```
-sgd
-SGD operator
-This operator implements one step of the stochastic gradient descent algorithm.
-param_out = param_learning_rate * grad
-Inputs:
-Param : (Tensor) Input parameter
-LearningRate : (Tensor) Learning rate of SGD
-Grad : (Tensor) Input gradient
-Outputs:
-ParamOut : (Tensor) Output parameter
-```
--- a/doc/fluid/dev/releasing_process_cn.md
+++ b/doc/fluid/dev/releasing_process_cn.md
-# PaddlePaddle发行规范
-PaddlePaddle使用Trunk Based Development，使用[Semantic Versioning](http://semver.org/)标准表示PaddlePaddle版本号。
-PaddlePaddle每次发新的版本，遵循以下流程:
-1. 从`develop`分支派生出新的分支，分支名为`release/版本号`。例如，`release/0.10.0`
-2. 将新分支的版本打上tag，tag为`版本号rc-Patch号`。例如，第一个tag为`0.10.0-rc0`。
-3. 新分支一般不接受新的feature和优化。QA在release分支上进行测试。研发基于最新的develop开发。
-4. QA和研发发现的bug，在develop上修复验证后，cherry-pick修复到release分支。直到release分支相对稳定。
-5. 如果有需要，在release分支最新代码上打上新的tag，比如`0.10.0-rc1`，让更多的用户加入测试。重复3-4步。
-6. release分支稳定后，打上正式的release tag，比如`0.10.0`。
-7. 将这个版本的python wheel包发布到pypi。
-8. 更新Docker镜像（参考后面的操作细节）。
-需要注意的是:
-* bug修复需要先在develop上进行，然后进入release分支。而不是直接在release分支上开发。
-* release分支原则上只接受修复类的修改，不接受新feature。
-## 发布wheel包到pypi
-1. 使用[PaddlePaddle CI](https://paddleci.ngrok.io/project.html?projectId=Manylinux1&tab=projectOverview)
-完成自动化二进制编译，参考下图，选择需要发布的版本（通常包含一个CPU版本和一个GPU版本），点击"run"右侧的"..."按钮，可以
-弹出下面的选择框，在第二个tab (Changes)里选择需要发布的分支，这里选择0.11.0，然后点击"Run Build"按钮。
-	<img src="https://raw.githubusercontent.com/PaddlePaddle/Paddle/develop/doc/fluid/images/ci_build_whl.png">
-1. 等待编译完成后可以在此页面的"Artifacts"下拉框中找到生成的3个二进制文件，分别对应CAPI，`cp27m`和`cp27mu`的版本。
-1. 由于pypi.python.org目前遵循[严格的命名规范PEP 513](https://www.python.org/dev/peps/pep-0513)，在使用twine上传之前，需要重命名wheel包中platform相关的后缀，比如将`linux_x86_64`修改成`manylinux1_x86_64`。
-1. 上传：
-```
-cd build/python
-pip install twine
-twine upload dist/[package to upload]
-```
-* 注：CI环境使用 https://github.com/PaddlePaddle/buildtools 这里的DockerImage作为编译环境以支持更多的Linux
-  发型版，如果需要手动编译，也可以使用这些镜像。这些镜像也可以从 https://hub.docker.com/r/paddlepaddle/paddle_manylinux_devel/tags/ 下载得到。
-* pypi不支持覆盖上传，所以一个版本号的wheel包发布之后，不可以更改。下一个wheel包需要更新版本号才可以上传。
-## 发布Docker镜像
-上述PaddlePaddle CI编译wheel完成后会自动将Docker镜像push到DockerHub，所以，发布Docker镜像只需要对自动push的镜像打上
-版本号对应的tag即可：
-```
-docker pull [镜像]:latest
-docker tag [镜像]:latest [镜像]:[version]
-docker push [镜像]:[version]
-```
-需要更新的镜像tag包括：
-* `[version]`: CPU版本
-* `[version]-openblas`: openblas版本
-* `[version]-gpu`: GPU版本（CUDA 8.0 cudnn 5）
-* `[version]-gpu-[cudaver]-[cudnnver]`: 不同cuda, cudnn版本的镜像
-之后可进入 https://hub.docker.com/r/paddlepaddle/paddle/tags/ 查看是否发布成功。
-## PaddlePaddle 分支规范
-PaddlePaddle开发过程使用[Trunk Based Development](https://trunkbaseddevelopment.com/) 开发规范。
-* `develop`分支为开发(develop branch)版本分支。每一个`develop`分支的版本都经过单元测试。并且会经过模型回归测试。
-* `release/版本号`分支为每一次Release时建立的临时分支。release分支主要用于测试，bug修复和最终发版。
-* `master`分支因为历史原因，已经废弃。
-* 其他开发者fork的feature branch。
-	* 建议，开发者的feature branch需要同步主版本库的`develop`分支。
-	* 建议，开发者的feature branch需要基于主版本库中的`develop`分支。
-	* 当feature branch开发完毕后，向PaddlePaddle的主版本库提交`Pull Reuqest`，进而进行代码评审。
-		* 在评审过程中，开发者修改自己的代码，可以继续在自己的feature branch提交代码。
-## PaddlePaddle回归测试列表
-TODO
-### PaddlePaddle Book中所有章节
-PaddlePaddle每次发版本首先要保证PaddlePaddle Book中所有章节功能的正确性。功能的正确性包括验证PaddlePaddle目前的`paddle_trainer`训练和纯使用`Python`训练（V2和Fluid）模型正确性。
-<table>
-<thead>
-<tr>
-<th></th>
-<th>新手入门章节 </th>
-<th> 识别数字</th>
-<th> 图像分类</th>
-<th>词向量</th>
-<th> 情感分析</th>
-<th>语意角色标注</th>
-<th> 机器翻译</th>
-<th>个性化推荐</th>
-</tr>
-</thead>
-<tbody>
-<tr>
-<td>API.V2 + Docker + GPU </td>
-<td>  </td>
-<td> </td>
-<td>  </td>
-<td> </td>
-<td>  </td>
-<td> </td>
-<td>  </td>
-<td> </td>
-</tr>
-<tr>
-<td> API.V2 + Docker + CPU </td>
-<td>  </td>
-<td> </td>
-<td>  </td>
-<td> </td>
-<td>  </td>
-<td> </td>
-<td>  </td>
-<td> </td>
-</tr>
-<tr>
-<td>`paddle_trainer` + Docker + GPU </td>
-<td>  </td>
-<td> </td>
-<td>  </td>
-<td> </td>
-<td>  </td>
-<td> </td>
-<td>  </td>
-<td> </td>
-</tr>
-<tr>
-<td>`paddle_trainer` + Docker + CPU </td>
-<td>  </td>
-<td> </td>
-<td>  </td>
-<td> </td>
-<td>  </td>
-<td> </td>
-<td>  </td>
-<td> </td>
-</tr>
-<tr>
-<td> API.V2 + Ubuntu + GPU</td>
-<td>  </td>
-<td> </td>
-<td>  </td>
-<td> </td>
-<td>  </td>
-<td> </td>
-<td>  </td>
-<td> </td>
-</tr>
-<tr>
-<td>API.V2 + Ubuntu + CPU </td>
-<td>  </td>
-<td> </td>
-<td>  </td>
-<td> </td>
-<td>  </td>
-<td> </td>
-<td>  </td>
-<td> </td>
-</tr>
-<tr>
-<td> `paddle_trainer` + Ubuntu + GPU</td>
-<td>  </td>
-<td> </td>
-<td>  </td>
-<td> </td>
-<td>  </td>
-<td> </td>
-<td>  </td>
-<td> </td>
-</tr>
-<tr>
-<td> `paddle_trainer` + Ubuntu + CPU</td>
-<td>  </td>
-<td> </td>
-<td>  </td>
-<td> </td>
-<td>  </td>
-<td> </td>
-<td>  </td>
-<td> </td>
-</tr>
-</tbody>
-</table>
--- a/doc/fluid/dev/releasing_process_en.md
+++ b/doc/fluid/dev/releasing_process_en.md
-# PaddlePaddle Releasing Process
-PaddlePaddle manages its branches using Trunk Based Development, and [Semantic Versioning](http://semver.org/) as it's version number semantics.
-Each time we release a new PaddlePaddle version, we should follow the below steps:
-1. Create a new release branch from `develop`，named `release/[version]`. E.g.，`release/0.10.0`
-2. Create a new tag for the release branch, tag format: `version-rc.Patch`. E.g. the first tag is `0.10.0-rc0`。
-3. New release branch normally doesn't accept new features or optimizations. QA will test on the release branch. Developer should develop based on `develop` branch.
-4. If QA or Developer find bugs. They should first fix and verify on `develop` branch. Then cherry-pick the fix to the release branch. Wait until the release branch is stable.
-5. If necessary, create a new tag on the relese branch, e.g. `0.10.0-rc1`. Involve more users to try it and repeat step 3-4.
-6. After release branch is stable，Create the official release tag，such as `0.10.0`.
-7. Release the python wheel package to pypi.
-8. Update the docker image (More details below).
-NOTE:
-* bug fix should happen on `develop` branch, then cherry-pick to relese branch. Avoid developing directly on release branch.
-* release normally only accept bug fixes. Don't add new features.
-## Publish Wheel Packages to pypi
-1. Use our [CI tool](https://paddleci.ngrok.io/project.html?projectId=Manylinux1&tab=projectOverview)
-   to build all wheel packages needed to publish. As shown in the following picture, choose a build
-     version, click "..." button on the right side of "Run" button, and switch to the second tab in the
-pop-up box, choose the current release branch and click "Run Build" button. You may repeat this
-     step to start different versions of builds.
-    <img src="https://raw.githubusercontent.com/PaddlePaddle/Paddle/develop/doc/fluid/images/ci_build_whl.png">
-1. After the build succeeds, download the outputs under "Artifacts" including capi, `cp27m` and `cp27mu`.
-1. Since pypi.python.org follows [PEP 513](https://www.python.org/dev/peps/pep-0513), before we
-     upload the package using `twine`, we need to rename the package from `linux_x86_64` to
-     `manylinux1_x86_64`.
-1. Start the upload:
-     ```
-     cd build/python
-     pip install twine
-     twine upload dist/[package to upload]
-     ```
-* NOTE: We use a special Docker image to build our releases to support more Linux distributions, you can
-  download it from https://hub.docker.com/r/paddlepaddle/paddle_manylinux_devel/tags/, or build it using
-    scripts under `tools/manylinux1`.
-* pypi does not allow overwrite the already uploaded version of wheel package, even if you delete the
-  old version. you must change the version number before upload a new one.
-### Publish wheel Packages for MacOS
-You need to build the binary wheel package for MacOS before publishing, to
-make sure that the package can be used by many versions of MacOS
-(10.11, 10.12, 10.13) and different python installs (python.org, homebrew, etc.),
-you must build the package ***exactly*** following below steps:
-Build steps:
-1. install python from python.org downloads, and make sure it's currently in use
-   in your system.
-1. `export MACOSX_DEPLOYMENT_TARGET=10.11`, use `10.11` is enough for recent versions.
-1. `git clone https://github.com/PaddlePaddle/Paddle.git && cd Paddle && mkdir build && cd build`
-1. `cmake -DWITH_GPU=OFF -DWITH_MKL=OFF -DWITH_SYSTEM_BLAS=OFF  ..`, make sure the output of `cmake` command is using the correct python interpreter installed from python.org
-1. `make -j`
-1. `pip install delocate`
-1. `mkdir fixed_wheel && delocate-wheel -w fixed_wheel python/dist/*.whl`
-Then the whl under `fixed_wheel` is ready to upload.
-Install steps:
-1. run `pip install paddlepaddle...whl`
-1. find the `libpython.dylib` that are currently in use:
-    - for python.org package installs, do nothing.
-    - for other python installs, find the path of `libpython*.dylib` and `export LD_LIBRARY_PATH=you path && DYLD_LIBRARY_PATH=your path`
-## Publish Docker Images
-Our CI tool will push latest images to DockerHub, so we only need to push a version tag like:
-```
-docker pull [image]:latest
-docker tag [image]:latest [image]:[version]
-docker push [image]:[version]
-```
-Tags that need to be updated are:
-* `[version]`: CPU only version image
-* `[version]-openblas`: openblas version image
-* `[version]-gpu`: GPU version（using CUDA 8.0 cudnn 5）
-* `[version]-gpu-[cudaver]-[cudnnver]`: tag for different cuda, cudnn versions
-You can then checkout the latest pushed tags at https://hub.docker.com/r/paddlepaddle/paddle/tags/.
-## Branching Model
-PaddlePaddle uses [Trunk Based Development](https://trunkbaseddevelopment.com/) as our branching model.
-* `develop` branch is used for development. Each comment to `develop` branc goes through unit tests and model regression tests.
-* `release/[version]` branch is used for each release. Release branch is used for tests, bug fix and evetual release.
-* `master` branch as been deprecated for historical reasons
-* Developer's feature branch。
-	* Developer's feature branch should sync with upstream `develop` branch.
-	* Developer's feature branch should be forked from upstream `develop` branch.
-	* After feature branch is ready, create a `Pull Request` against the Paddle repo and go through code review.
-	   * In the review process, develop modify codes and push to their own feature branch.
-## PaddlePaddle Regression Test List
-TODO
-### All Chapters of PaddlePaddle Book
-We need to guarantee that all the chapters of PaddlePaddle Book can run correctly. Including
-V1 (`paddle_trainer` training) and V2 training and Fluid training.
-<table>
-<thead>
-<tr>
-<th></th>
-<th>Linear Regression</th>
-<th>Recognize Digits</th>
-<th>Image Classification</th>
-<th>Word2Vec</th>
-<th>Personalized Recommendation</th>
-<th>Sentiment Analysis</th>
-<th>Semantic Role Labeling</th>
-<th>Machine Translation</th>
-</tr>
-</thead>
-<tbody>
-<tr>
-<td>API.V2 + Docker + GPU </td>
-<td>  </td>
-<td> </td>
-<td>  </td>
-<td> </td>
-<td>  </td>
-<td> </td>
-<td>  </td>
-<td> </td>
-</tr>
-<tr>
-<td> API.V2 + Docker + CPU </td>
-<td>  </td>
-<td> </td>
-<td>  </td>
-<td> </td>
-<td>  </td>
-<td> </td>
-<td>  </td>
-<td> </td>
-</tr>
-<tr>
-<td>`paddle_trainer` + Docker + GPU </td>
-<td>  </td>
-<td> </td>
-<td>  </td>
-<td> </td>
-<td>  </td>
-<td> </td>
-<td>  </td>
-<td> </td>
-</tr>
-<tr>
-<td>`paddle_trainer` + Docker + CPU </td>
-<td>  </td>
-<td> </td>
-<td>  </td>
-<td> </td>
-<td>  </td>
-<td> </td>
-<td>  </td>
-<td> </td>
-</tr>
-<tr>
-<td> API.V2 + Ubuntu + GPU</td>
-<td>  </td>
-<td> </td>
-<td>  </td>
-<td> </td>
-<td>  </td>
-<td> </td>
-<td>  </td>
-<td> </td>
-</tr>
-<tr>
-<td>API.V2 + Ubuntu + CPU </td>
-<td>  </td>
-<td> </td>
-<td>  </td>
-<td> </td>
-<td>  </td>
-<td> </td>
-<td>  </td>
-<td> </td>
-</tr>
-<tr>
-<td> `paddle_trainer` + Ubuntu + GPU</td>
-<td>  </td>
-<td> </td>
-<td>  </td>
-<td> </td>
-<td>  </td>
-<td> </td>
-<td>  </td>
-<td> </td>
-</tr>
-<tr>
-<td> `paddle_trainer` + Ubuntu + CPU</td>
-<td>  </td>
-<td> </td>
-<td>  </td>
-<td> </td>
-<td>  </td>
-<td> </td>
-<td>  </td>
-<td> </td>
-</tr>
-</tbody>
-</table>
--- a/doc/fluid/dev/src/fc.py
+++ b/doc/fluid/dev/src/fc.py
-#   Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-def fc(input,
-       size,
-       num_flatten_dims=1,
-       param_attr=None,
-       bias_attr=None,
-       act=None,
-       name=None):
-    """
-    **Fully Connected Layer**
-    The fully connected layer can take multiple tensors as its inputs. It
-    creates a variable called weights for each input tensor, which represents
-    a fully connected weight matrix from each input unit to each output unit.
-    The fully connected layer multiplies each input tensor with its coresponding
-    weight to produce an output Tensor. If multiple input tensors are given,
-    the results of multiple multiplications will be sumed up. If bias_attr is
-    not None, a bias variable will be created and added to the output. Finally,
-    if activation is not None, it will be applied to the output as well.
-    This process can be formulated as follows:
-    .. math::
-        Out = Act({\sum_{i=0}^{N-1}X_iW_i + b})
-    In the above equation:
-    * :math:`N`: Number of the input.
-    * :math:`X_i`: The input tensor.
-    * :math:`W`: The weights created by this layer.
-    * :math:`b`: The bias parameter created by this layer (if needed).
-    * :math:`Act`: The activation function.
-    * :math:`Out`: The output tensor.
-    Args:
-        input (Variable|list of Variable): The input tensor(s) of this layer, and the dimension of
-            the input tensor(s) is at least 2.
-        size(int): The number of output units in this layer.
-        num_flatten_dims (int, default 1): The fc layer can accept an input tensor with more than
-            two dimensions. If this happens, the multidimensional tensor will first be flattened
-            into a 2-dimensional matrix. The parameter `num_flatten_dims` determines how the input
-            tensor is flattened: the first `num_flatten_dims` (inclusive, index starts from 1)
-            dimensions will be flatten to form the first dimension of the final matrix (height of
-            the matrix), and the rest `rank(X) - num_flatten_dims` dimensions are flattened to
-            form the second dimension of the final matrix (width of the matrix). For example, suppose
-            `X` is a 6-dimensional tensor with a shape [2, 3, 4, 5, 6], and `num_flatten_dims` = 3.
-            Then, the flattened matrix will have a shape [2 x 3 x 4, 5 x 6] = [24, 30].
-        param_attr (ParamAttr|list of ParamAttr, default None): The parameter attribute for learnable
-            parameters/weights of this layer.
-        bias_attr (ParamAttr|list of ParamAttr, default None): The parameter attribute for the bias
-            of this layer. If it is set to None, no bias will be added to the output units.
-        act (str, default None): Activation to be applied to the output of this layer.
-        name (str, default None): The name of this layer.
-    Returns:
-        A tensor variable storing the transformation result.
-    Raises:
-        ValueError: If rank of the input tensor is less than 2.
-    Examples:
-        .. code-block:: python
-          data = fluid.layers.data(name="data", shape=[32, 32], dtype="float32")
-          fc = fluid.layers.fc(input=data, size=1000, act="tanh")
-    """
--- a/doc/fluid/dev/support_new_device.md
+++ b/doc/fluid/dev/support_new_device.md
-# Design Doc: Supporting new Device/Library
-## Background
-Deep learning has a high demand for computing resources. New high-performance devices and computing libraries are appearing very frequently. Deep learning frameworks have to integrate these high-performance devices and computing libraries in a flexible and efficient manner.
-On one hand, hardware and computing libraries usually do not have a one-to-one correspondence. For example, Intel CPUs support Eigen and MKL computing libraries while Nvidia GPUs support Eigen and cuDNN computing libraries. We have to implement operator specific kernels for each computing library.
-On the other hand, users usually do not want to care about the low-level hardware and computing libraries when writing a neural network configuration. In Fluid, `Layer` is exposed in `Python`, and `Operator` is exposed in `C++`. Both `Layer` and `Operator` are hardware independent.
-So, how to support a new Device/Library in Fluid becomes a challenge.
-## Basic: Integrate A New Device/Library
-For a general overview of fluid, please refer to the [overview doc](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/fluid/read_source.md).
-There are mainly three parts that we have to consider while integrating a new device/library:
- Place and DeviceContext: indicate the device id and manage hardware resources
- Memory and Tensor: malloc/free data on certain device
- Math Functor and OpKernel: implement computing unit on certain devices/libraries
-### Place and DeviceContext
-Please note that device and computing library are not one-to-one corresponding. A device can have a lot of computing libraries and a computing library can also support several devices.
-#### Place
-Fluid uses class [Place](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/platform/place.h#L55) to represent the device memory where data is located. If we add another device, we have to add the corresponding `DevicePlace`.
-```
-        |   CPUPlace
-Place --|   CUDAPlace
-        |   FPGAPlace
-```
-And `Place` is defined as follows:
-```
-typedef boost::variant<CUDAPlace, CPUPlace, FPGAPlace> Place;
-```
-#### DeviceContext
-Fluid uses class [DeviceContext](https://github.com/PaddlePaddle/Paddle/blob/develop/fluid/paddle/platform/device_context.h#L30) to manage the resources in different libraries, such as CUDA stream in `CDUADeviceContext`. There are also inheritance relationships between different kinds of `DeviceContext`.
-```
-                /->  CPUDeviceContext   
-DeviceContext ---->  CUDADeviceContext  
-                \->  FPGADeviceContext
-```
-An example of Nvidia GPU is as follows:
- DeviceContext
-```
-class DeviceContext {
-  virtual Place GetPlace() const = 0;
-};  
-```
- CUDADeviceContext
-```
-class CUDADeviceContext : public DeviceContext {
-  Place GetPlace() const override { return place_; }
-private:
-  CUDAPlace place_;
-  cudaStream_t stream_;
-  cublasHandle_t cublas_handle_;
-  std::unique_ptr<Eigen::GpuDevice> eigen_device_;  // binds with stream_
-};
-```
-### Memory and Tensor
-#### memory module
-Fluid provides the following [memory interfaces](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/memory/memory.h#L36):
-```
-template <typename Place>
-void* Alloc(Place place, size_t size);
-template <typename Place>
-void Free(Place place, void* ptr);
-template <typename Place>
-size_t Used(Place place);
-```
-To implement these interfaces, we have to implement MemoryAllocator for different Devices.
-#### Tensor
-[Tensor](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/framework/tensor.h#L36) holds data with some shape in a specific Place.
-```cpp
-class Tensor {
- public:
-  /*! Return a pointer to mutable memory block. */
-  template <typename T>
-  inline T* data();
-  /**
-   * @brief   Return a pointer to mutable memory block.
-   * @note    If not exist, then allocation.
-   */
-  template <typename T>
-  inline T* mutable_data(platform::Place place);
-  /**
-   * @brief     Return a pointer to mutable memory block.
-   *
-   * @param[in] dims    The dimensions of the memory block.
-   * @param[in] place   The place of the memory block.
-   *
-   * @note      If not exist, then allocation.
-   */
-  template <typename T>
-  inline T* mutable_data(DDim dims, platform::Place place);
-  /*! Resize the dimensions of the memory block. */
-  inline Tensor& Resize(const DDim& dims);
-  /*! Return the dimensions of the memory block. */
-  inline const DDim& dims() const;
- private:
-  /*! holds the memory block if allocated. */
-  std::shared_ptr<Placeholder> holder_;
-  /*! points to dimensions of memory block. */
-  DDim dim_;
-};
-```
-`Placeholder` is used to delay memory allocation; that is, we can first define a tensor, using `Resize` to configurate its shape, and then call `mutuable_data` to allocate the actual memory.
-```cpp
-paddle::framework::Tensor t;
-paddle::platform::CPUPlace place;
-// set size first
-t.Resize({2, 3});
-// allocate memory on CPU later
-t.mutable_data(place);
-```
-### Math Functor and OpKernel
-Fluid implements computing units based on different DeviceContexts. Some computing units are shared between operators. This common part will be put in operators/math directory as basic Functors.
-Let's take [MaxOutFunctor](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/operators/math/maxouting.h#L27) as an example:
-The interface is defined in the header file.
-```
-template <typename DeviceContext, typename T>
-class MaxOutFunctor {
- public:
-  void operator()(const DeviceContext& context, const framework::Tensor& input,
-                  framework::Tensor* output, int groups);
-};
-```
-CPU implementation is in .cc file
-```
-template <typename T>
-class MaxOutFunctor<platform::CPUDeviceContext, T> {
-  public:
-  void operator()(const platform::CPUDeviceContext& context,
-                  const framework::Tensor& input, framework::Tensor* output,
-                  int groups) {
-                  ...
-                  }
-};
-```
-CUDA implementation is in .cu file
-```
-template <typename T>
-class MaxOutFunctor<platform::CUDADeviceContext, T> {
- public:
-  void operator()(const platform::CUDADeviceContext& context,
-                  const framework::Tensor& input, framework::Tensor* output,
-                  int groups) {
-                  ...
-                  }
-};                  
-```
-We first obtain the computing handle from a concrete DeviceContext and then compute on tensors.
-The implementation of `OpKernel` is similar to math functors, the extra thing we need to do is to register the OpKernel in a global map.
-Fluid provides different register interfaces in op_registry.h
-Let's take [Crop](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/operators/crop_op.cc#L134) operator as an example:
-In .cc file:
-```
-REGISTER_OP_CPU_KERNEL(crop, ops::CropKernel<float>);
-REGISTER_OP_CPU_KERNEL(
-    crop_grad, ops::CropGradKernel<paddle::platform::CPUDeviceContext, float>);
-```
-In .cu file:
-```
-REGISTER_OP_CUDA_KERNEL(crop, ops::CropKernel<float>);
-REGISTER_OP_CUDA_KERNEL(
-    crop_grad, ops::CropGradKernel<paddle::platform::CUDADeviceContext, float>);
-```
-## Advanced topics: How to switch between different Device/Library
-Generally, we will implement OpKernel for all Device/Library of an Operator. We can easily train a Convolutional Neural Network in GPU. However, some OpKernel is not suitable on a specific Device. For example, crf operator can only run on CPU, whereas most other operators can run on GPU. To achieve high performance in such circumstance, we have to switch between different Device/Library.
-For more details, please refer to following docs:
- operator kernel type [doc](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/fluid/design/multi_devices/operator_kernel_type.md)
- switch kernel [doc](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/fluid/design/execution/switch.md)
--- a/doc/fluid/dev/use_eigen_cn.md
+++ b/doc/fluid/dev/use_eigen_cn.md
-# 在Paddle中如何使用Eigen
-神经网络本质上是一个计算图，计算需要的数据存放在`Tensor`中，而计算过程是由`Operartor`来描述的。在执行时，`Operator`调用对应`OpKernel`中的`Compute`接口，实现对`Tensor`的操作。
-## Eigen Tensor模块
-Eigen Tensor模块对element-wise计算提供了强大的支持，并且书写一份代码，可以同时在CPU、GPU执行。但Eigen Tensor是一个正在开发中的模块，因此可能测试不够完备，文档较少。
-关于Eigen Tensor模块的详细介绍请参考[Eigen文档](https://bitbucket.org/eigen/eigen/src/default/unsupported/Eigen/CXX11/src/Tensor/README.md)
-## paddle::framework::Tensor
-Paddle Tensor定义在framework目录下，其主要接口如下：
-```cpp
-class Tensor {
- public:
-  /*! Return a pointer to mutable memory block. */
-  template <typename T>
-  inline T* data();
-  /**
-   * @brief   Return a pointer to mutable memory block.
-   * @note    If not exist, then allocation.
-   */
-  template <typename T>
-  inline T* mutable_data(platform::Place place);
-  /**
-   * @brief     Return a pointer to mutable memory block.
-   *
-   * @param[in] dims    The dimensions of the memory block.
-   * @param[in] place   The place of the memory block.
-   *
-   * @note      If not exist, then allocation.
-   */
-  template <typename T>
-  inline T* mutable_data(DDim dims, platform::Place place);
-  /*! Resize the dimensions of the memory block. */
-  inline Tensor& Resize(const DDim& dims);
-  /*! Return the dimensions of the memory block. */
-  inline const DDim& dims() const;
- private:  
-  /*! holds the memory block if allocated. */
-  std::shared_ptr<Placeholder> holder_;
-  /*! points to dimensions of memory block. */
-  DDim dim_;
-};
-```
-`Placeholder`的作用是延迟分配内存，即我们可以先定义一个Tensor，然后使用Resize接口设置Tensor的大小，最后再调用mutable_data接口分配实际的内存。
-```cpp
-paddle::framework::Tensor t;
-paddle::platform::CPUPlace place;
-// set size first
-t.Resize({2, 3});
-// allocate memory on CPU later
-t.mutable_data(place);
-```
-### paddle::framework::Tensor使用样例
-下面以AddOp为例说明Tensor的使用过程：
- InferShape
-在运行神经网络计算图时，我们先调用每个`Operator`的`InferShape`接口，根据输入Tensor的大小来设置输出Tensor的大小，`Resize`接口会被调用。
-```cpp
-void InferShape(const framework::InferShapeContext &ctx) const override {
-  PADDLE_ENFORCE_EQ(ctx.Input<Tensor>("X")->dims(),
-                    ctx.Input<Tensor>("Y")->dims(),
-                    "Two input of Add Op's dimension must be same.");
-  ctx.Output<Tensor>("Out")->Resize(ctx.Input<Tensor>("X")->dims());
-}
-```
- Run
-`Operator`的`Run`接口最终会调用对应`OpKernel`的`Compute`接口，在这时真正的分配内存，`mutable_data`接口会被调用。
-```cpp
-void Compute(const framework::ExecutionContext& context) const override {
-  auto* input0 = context.Input<Tensor>("X");
-  auto* input1 = context.Input<Tensor>("Y");
-  auto* output = context.Output<Tensor>("Out");
-  output->mutable_data<T>(context.GetPlace());
-  auto x = EigenVector<T>::Flatten(*input0);
-  auto y = EigenVector<T>::Flatten(*input1);
-  auto z = EigenVector<T>::Flatten(*output);
-  auto place = context.GetEigenDevice<Place>();
-  z.device(place) = x + y;
-}
-```
-### paddle::framework::Tensor到EigenTensor的转换
-如上一小节所示，在具体的计算中，我们需要先把输入Tensor和输出Tensor转换为Eigen支持的格式。我们在[eigen.h](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/framework/eigen.h)中提供了一些全局函数用来实现paddle::framework::Tensor到EigenTensor/EigenMatrix/EigenVector/EigenScalar的转换。
-以EigenTensor为例，做一个介绍
-```cpp
-Tensor t;
-float* p = t.mutable_data<float>(make_ddim({1, 2, 3}), platform::CPUPlace());
-for (int i = 0; i < 1 * 2 * 3; i++) {
-  p[i] = static_cast<float>(i);
-}
-EigenTensor<float, 3>::Type et = EigenTensor<float, 3>::From(t);
-```
-From是EigenTensor模板提供的一个接口，可以实现从paddle::framework::Tensor到对EigenTensor的转换。由于Tensor的rank是模板参数，因此在转换时需要显示的指定。
-在Eigen中，不同rank的Tensor是不同类型，Vector是rank为1的Tensor。需要额外注意的是，EigenVector<T>::From方法是把paddle中的一维Tensor转为Eigen的一维Tensor，在这里用EigenVector来表示；而EigenVector<T>::Flatten方法是把paddle中的一个Tensor进行reshape操作，压扁成为Eigen的一维Tensor，类型仍然为EigenVector。
-更多的转换方法请参考eigen_test.cc中的[单元测试](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/framework/eigen_test.cc)。
-## 实现计算
-当需要完成计算时，我们需要等式左边的EigenTensor调用device接口。在这里需要注意的是，这里的EigenTensor之间的运算只是改变了原有Tensor中的数据，而不会改变原有Tensor的shape信息。
-```cpp
-auto x = EigenVector<T>::Flatten(*input0);
-auto y = EigenVector<T>::Flatten(*input1);
-auto z = EigenVector<T>::Flatten(*output);
-auto place = context.GetEigenDevice<Place>();
-z.device(place) = x + y;
-```
-在这段代码中，input0/input1/output可以是任意维度的Tensor。我们调用了EigenVector的Flatten接口，把任意维度的Tensor转为了一维的EigenVector。而在计算结束之后，input0/input1/output的原有shape信息不变。如果想改变原有Tensor的shape信息，可以调用Resize接口进行改变。
-由于Eigen Tensor模块的文档较少，我们可以参考TensorFlow的[kernels](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/core/kernels)模块下的相关`OpKernel`的计算代码。
--- a/doc/fluid/dev/use_eigen_en.md
+++ b/doc/fluid/dev/use_eigen_en.md
-# How to use Eigen in Paddle
-Essentially, a neural network is a compute graph. T data needed for the computation is stored in `Tensor`s and its computation procedure is described by `Operator`s. An `Operator` calls the `Compute` interface in its corresponding `OpKernel` and operates on the `Tensor`.
-## Eigen Tensor Module
-The Eigen Tensor module supports powerful element-wise computation. In addition, a piece of code written using it can be run on both the CPU and the GPU.
-Note that Eigen Tensor is still being actively developed, so its tests are not completely covered and its documentation may be sparse.
-For details on Eigen Tensor module, please see [doc 1](https://github.com/RLovelett/eigen/blob/master/unsupported/Eigen/CXX11/src/Tensor/README.md) and [doc 2](https://bitbucket.org/eigen/eigen/src/default/unsupported/Eigen/CXX11/src/Tensor/README.md).
-## paddle::framework::Tensor
-Paddle Tensor's is defined in the framework directory with the following interface:
-```cpp
-class Tensor {
- public:
-  /*! Return a pointer to mutable memory block. */
-  template <typename T>
-  inline T* data();
-  /**
-   * @brief   Return a pointer to mutable memory block.
-   * @note    If not exist, then allocation.
-   */
-  template <typename T>
-  inline T* mutable_data(platform::Place place);
-  /**
-   * @brief     Return a pointer to mutable memory block.
-   *
-   * @param[in] dims    The dimensions of the memory block.
-   * @param[in] place   The place of the memory block.
-   *
-   * @note      If not exist, then allocation.
-   */
-  template <typename T>
-  inline T* mutable_data(DDim dims, platform::Place place);
-  /*! Resize the dimensions of the memory block. */
-  inline Tensor& Resize(const DDim& dims);
-  /*! Return the dimensions of the memory block. */
-  inline const DDim& dims() const;
- private:
-  /*! holds the memory block if allocated. */
-  std::shared_ptr<Placeholder> holder_;
-  /*! points to dimensions of memory block. */
-  DDim dim_;
-};
-```
-`Placeholder` is used to delay memory allocation; that is, we can first define a tensor, using `Resize` to configure its shape, and then call `mutuable_data` to allocate the actual memory.
-```cpp
-paddle::framework::Tensor t;
-paddle::platform::CPUPlace place;
-// set size first
-t.Resize({2, 3});
-// allocate memory on CPU later
-t.mutable_data(place);
-```
-### paddle::framework::Tensor Usage
-`AddOp` demonstrates Tensor's usage.
- InferShape
-When computing a neural network's compute graph, first call every `Operator`'s `InferShape` method, and use `Resize` to configure the size of the output tensor.
-```cpp
-void InferShape(const framework::InferShapeContext &ctx) const override {
-  PADDLE_ENFORCE_EQ(ctx.Input<Tensor>("X")->dims(),
-                    ctx.Input<Tensor>("Y")->dims(),
-                    "Two input of Add Op's dimension must be same.");
-  ctx.Output<Tensor>("Out")->Resize(ctx.Input<Tensor>("X")->dims());
-}
-```
- Run
-```cpp
-void Compute(const framework::ExecutionContext& context) const override {
-  auto* input0 = context.Input<Tensor>("X");
-  auto* input1 = context.Input<Tensor>("Y");
-  auto* output = context.Output<Tensor>("Out");
-  output->mutable_data<T>(context.GetPlace());
-  auto x = EigenVector<T>::Flatten(*input0);
-  auto y = EigenVector<T>::Flatten(*input1);
-  auto z = EigenVector<T>::Flatten(*output);
-  auto place = context.GetEigenDevice<Place>();
-  z.device(place) = x + y;
-}
-```
-## paddle::framework::Tensor到EigenTensor的转换
-As shown above, in actual computation, we need to transform the input and output `Tensor`s into formats Eigen supports. We show some functions in [eigen.h](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/framework/eigen.h) to implement the transformation from `paddle::framework::Tensor`to `EigenTensor/EigenMatrix/EigenVector/EigenScalar`.
-Using EigenTensor as an example:
-```cpp
-Tensor t;
-float* p = t.mutable_data<float>(make_ddim({1, 2, 3}), platform::CPUPlace());
-for (int i = 0; i < 1 * 2 * 3; i++) {
-  p[i] = static_cast<float>(i);
-}
-EigenTensor<float, 3>::Type et = EigenTensor<float, 3>::From(t);
-```
-`From` is an interfacing method provided by the EigenTensor template, which implements the transformation from a `paddle::framework::Tensor` object to an EigenTensor. Since `rank` is a template parameter, it needs to be explicitly specified at the time of the transformation.
-In Eigen, tensors with different ranks are different types, with `Vector` bring a rank-1 instance. Note that `EigenVector<T>::From` uses a transformation from an 1-dimensional Paddle tensor to a 1-dimensional Eigen tensor while `EigenVector<T>::Flatten` reshapes a paddle tensor and flattens it into a 1-dimensional Eigen tensor. Both resulting tensors are still typed EigenVector.
-For more transformations, see the [unit tests](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/framework/eigen_test.cc) in the `eigen_test.cc` file.
-## Implementing Computation
-While computing, the device interface is needed from the EigenTensors on the left hand side of the assignments. Note that the computation between EigenTensors only changes the data originally inthe Tensor and does not change all the shape information associated with the Tensor.
-```cpp
-auto x = EigenVector<T>::Flatten(*input0);
-auto y = EigenVector<T>::Flatten(*input1);
-auto z = EigenVector<T>::Flatten(*output);
-auto place = context.GetEigenDevice<Place>();
-z.device(place) = x + y;
-```
-In this code segment, input0/input1/output can be Tensors of arbitrary dimension. We are calling Flatten from EigenVector, transforming a tensor of any dimension into a 1-dimensional EigenVector. After completing computation, input0/input1/output will retain the same shape information, and they can be resized using the `Resize` interface.
-Because the Eigen Tensor module is under-documented, please refer to `OpKernel`'s computation code in TensorFlow's [kernel module documentation](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/core/kernels).
--- a/doc/fluid/dev/versioning_en.md
+++ b/doc/fluid/dev/versioning_en.md
-# Versioning (Work In Progress)
-PaddlePaddle framework follows Semantic Versioning 2.0 (semver).
-Each release has version of the following format: MAJOR.MINOR.PATCH
-(e.g. 1.2.0). Some key points:
- * Major version number change can result in backward-incompatible changes. Codes working in old version don’t necessarily work in the new version. In addition, data, such as program model and checkpointed parameters, generated by the previous major version might not work in the new version. Tools will be attempted to be built to help the release migration.
- * Minor version number change always maintain backward compatibility. It normally contains compatible improvements and bug fixes.
- * Patch number change is for bug fixes.
- * Violation of the policy are considered as bugs and should be fixed.
-### What is Covered
-* All public documented Python APIs, excluding those live in the contrib namespace.
-### What is Not Covered
-* If an API’s implementation has bugs, we reserve the rights to fix the bugs and change the behavior.
-* The Python APIs in contrib namespace.
-* The Python function and classes that start with ‘_’.
-* The offline tools.
-* The data generated by the framework, such as serialized Program model file and checkpointed variables, are subject to different versioning scheme described below.
-* C++ Inference APIs. (To be covered)
-## Data
-Data refers to the artifacts generated by the framework. Here, we specifically mean model Program file and the checkpointed variables.
-* Backward Compatibility: User sometimes generates Data at PaddlePaddle version 1.1 and expects it to be consumed by PaddlePaddle version 1.2.
-  This can happen when an new online system wants to serve an old model trained previously.
-* Forward Compatibility: User sometimes generates Data at PaddlePaddle version 1.2 and expects it to be consumed by PaddlePaddle version 1.1.
-  The can happen when an new successful research model want to be served by an old online system that is not frequently upgraded.
-### Versioning
-Data version. Data is assigned an integer version number. Version is increased when incompatible change is introduced.
-PaddlePaddle framework has an interval of Data version that it supports. PadlePaddle framework within the same major version (semver) cannot drop support of lower version of Data. Hence, a minor version change cannot drop support of Data version.
-For example, For PaddlePaddle version 1.1, it supports Program version 3 to 5. Later, Program version is increased from 5 to 6 due to addition of an attribute. As a result PaddlePaddle version 1.1 won’t be able to consume it. PaddlePaddle 1.2 should support Program version 3 to 6. PaddlePaddle can only drop support for Program version 3 until PaddlePaddle version 2.0.
-### Known Issues
-Currently, forward compatibility for new Data version is best-effort.
--- a/doc/fluid/dev/write_docs_cn.md
+++ b/doc/fluid/dev/write_docs_cn.md
-../../v2/dev/write_docs_cn.md
\ No newline at end of file
--- a/doc/fluid/dev/write_docs_cn.rst
+++ b/doc/fluid/dev/write_docs_cn.rst
-../../v2/dev/write_docs_cn.rst
\ No newline at end of file
--- a/doc/fluid/dev/write_docs_en.rst
+++ b/doc/fluid/dev/write_docs_en.rst
-../../v2/dev/write_docs_en.rst
\ No newline at end of file
--- a/doc/fluid/faq/faq.rst
+++ b/doc/fluid/faq/faq.rst
-###################
-编译安装与单元测试
-###################
-1. 通过pip安装的PaddlePaddle在  :code:`import paddle.fluid` 报找不到 :code:`libmkldnn.so` 或 :code:`libmklml_intel.so`
------------------------------------------------------------------------------------------
-出现这种问题的原因是在导入 :code:`paddle.fluid` 时需要加载 :code:`libmkldnn.so` 和 :code:`libmklml_intel.so`，
-但是系统没有找到该文件。一般通过pip安装PaddlePaddle时会将 :code:`libmkldnn.so` 和 :code:`libmklml_intel.so`
-拷贝到 :code:`/usr/local/lib` 路径下，所以解决办法是将该路径加到 :code:`LD_LIBRARY_PATH` 环境变量下，
-即： :code:`export LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH` 。
-**注意**：如果是在虚拟环境中安装PaddlePaddle， :code:`libmkldnn.so` 和 :code:`libmklml_intel.so` 可能不在 :code:`/usr/local/lib` 路径下。
--- a/doc/fluid/faq/index_cn.rst
+++ b/doc/fluid/faq/index_cn.rst
-FAQ
-====
-本文档对关于PaddlePaddle的一些常见问题提供了解答。如果您的问题未在此处，请您到 `PaddlePaddle社区 <https://github.com/PaddlePaddle/Paddle/issues>`_ 查找答案或直接提 `issue <https://github.com/PaddlePaddle/Paddle/issues/new>`_ ，我们会及时进行回复。
-..  toctree::
-  :maxdepth: 1
-  faq.rst
--- a/doc/fluid/faq/index_en.rst
+++ b/doc/fluid/faq/index_en.rst
-FAQ
------------
--- a/doc/fluid/getstarted/Developer's_Guide_to_Paddle_Fluid.md
+++ b/doc/fluid/getstarted/Developer's_Guide_to_Paddle_Fluid.md
-# Paddle Fluid 开发者指南
---
-### ==1==. 为什么需要 PaddlePaddle Fluid？
---
-### 两个基础问题
-<font size=6>
-1. 如何描述机器学习模型和优化过程？
-    - 完备自洽，表达能力足以支持潜在出现的各种计算需求
-1. 如何充分利用资源高效计算？
-    - 支持异步设备、多卡、分布式计算
-    - 降低计算/计算优化的开发成本
-    - ……
-</font>
---
-### 如何描述模型和优化过程？
-<font size=6>
-<table>
-<thead>
-<tr>
-<th> </th>
-<th>一组连续执行的layers</th>
-<th>variable和operator构成的计算图 </th>
-<th>不再有模型的概念 </th>
-</tr>
-</thead>
-<tbody>
-<tr>
-<td> 2013</td>
-<td> Caffe，Theano, Torch, PaddlePaddle </td>
-<td> </td>
-<td> </td>
-</tr>
-<tr>
-<td> 2015 </td>
-<td> </td>
-<td> TensorFlow, MxNet, Caffe2, ONNX, n-graph </td>
-<td> </td>
-</tr>
-<tr>
-<td>2016 </td>
-<td> </td>
-<td> </td>
-<td> PyTorch, TensorFlow Eager Execution, <font color=#483D8B>**==PaddlePaddle Fluid==** </td>
-</tr>
-</tbody>
-</table>
---
-### <p align="center">目标 </p>
-<font size=6>
- 提高对各类机器学习任务的描述能力：能够描述潜在出现的任意机器学习模型。
- 代码结构逻辑清晰，各模块充分解耦：内外部贡献者能够专注于自己所需的功能模块，基于框架进行再次开发。
- 从设计上，留下技术优化的空间和潜力。
- 代码解耦后降低多设备支持、计算优化等的开发成本。
- 在统一的设计理念下，实现自动可伸缩，自动容错的分布式计算。
-</font>
---
-## ==2.== Design Overview
---
-# Fluid: 系统形态
- <span style="background-color:#ACD6FF;">[编译器式的执行流程，区分编译时和运行时](https://github.com/PaddlePaddle/Fluiddoc/blob/develop/doc/fluid/design/motivation/fluid_compiler.md)</span>
-<br>
-<p align="center">
-  <img src="https://raw.githubusercontent.com/PaddlePaddle/Fluiddoc/develop/doc/fluid/images/fluid-compiler.png" width=100%>
-</p>
---
-#### 让我们在Fluid程序实例中，区分编译时和运行时
---
-### Fluid 编译时
-<font size=5>
- ==**定义前向计算**==
-  ```python
-  x = fluid.layers.data(name='x',shape=[13], dtype='float32')
-  y_predict = fluid.layers.fc(input=x, size=1, act=None)
-  y = fluid.layers.data(name='y', shape=[1], dtype='float32')
-  cost = fluid.layers.square_error_cost(input=y_predict, label=y)
-  avg_cost = fluid.layers.mean(x=cost)
-  ```
- ==**添加反向、正则、优化**==
-  ```python
-  learning_rate = 0.01
-  sgd_optimizer = fluid.optimizer.SGD(learning_rate)
-  sgd_optimizer.minimize(avg_cost)
-  ```
-</font>
---
-### `Program` vs. 计算图
-<font size=5>
- 在科学计算领域，计算图是一种描述计算的经典方式。下图展示了从前向计算图（蓝色）开始，通过添加反向（红色）和优化算法相关（绿色）操作，构建出整个计算图的过程：
-
-<p align="center">
-  <img src="https://raw.githubusercontent.com/PaddlePaddle/Fluiddoc/develop/doc/fluid/images/graph_construction_example_all.png" width=60%>
-</p>
- Fluid ==使用`Program`而不是计算图==来描述模型和优化过程。`Program`由`Block`、`Operator`和`Variable`构成，相关概念会在后文详细展开。
- 编译时 Fluid 接受前向计算（这里可以先简单的理解为是一段有序的计算流）`Program`，为这段前向计算按照：前向 -> 反向 -> 梯度 clip -> 正则 -> 优化 的顺序，添加相关 `Operator`和`Variable`到`Program`到完整的计算。
-</font>
---
-### Fluid 运行时
-<font size=5>
- ==**读入数据**==
-  ```python
-  train_reader = paddle.batch(
-      paddle.reader.shuffle(paddle.dataset.uci_housing.train(), buf_size=500),
-      batch_size=20)
-  feeder = fluid.DataFeeder(place=place, feed_list=[x, y])
-  ```
- ==**定义执行程序的设备**==
-  ```python
-  place = fluid.CPUPlace()
-  feeder = fluid.DataFeeder(place=place,feed_list=[x, y])
-  ```
- ==创建执行器（Executor），执行初始化 `Program`和训练`Program`==
-  ```python
-  exe = fluid.Executor(place)
-  exe.run(fluid.default_startup_program())
-  PASS_NUM = 100
-  for pass_id in range(PASS_NUM):
-      for data in train_reader():
-          avg_loss_value, = exe.run(fluid.default_main_program(),
-                                    feed=feeder.feed(data),
-                                    fetch_list=[avg_cost])
-          print(avg_loss_value)
-  ```
-</font>
---
-### 总结：框架做什么？用户做什么？
-<br>
-<font size=5>
-<table>
-<thead>
-<tr>
-<th>构建训练</th>
-<th>执行训练</th>
-</tr>
-</thead>
-<tbody>
-<tr>
-<td>
-<span style="background-color:#B3D9D9">用户</span>：描述前向运算<br><span style="background-color:#DAB1D5;">框架</span>：添加反向运算<br><span style="background-color:#DAB1D5;">框架</span>：添加优化运算<br><span style="background-color:#DAB1D5;">框架</span>：添加内存优化<br><span style="background-color:#DAB1D5;">框架</span>：添加并行/多设备/分布式相关的计算单元
-</td>
-<td>
-<span style="background-color:#DAB1D5;">框架</span>：创建Operator（计算）+ Variable（数据）<br><span style="background-color:#DAB1D5;">框架</span>：创建`Block`<br><span style="background-color:#DAB1D5;">框架</span>：内存管理/设备管理<br><span style="background-color:#DAB1D5;">框架</span>：执行计算
-</td>
-</tr>
-</tbody>
-</table>
-</font>
---
-### <p align="center">总结：编译时</p>
-<font size=5>
-<span style="background-color:#A3D1D1;">**用户编写一段Python程序，描述模型的前向计算**</span>
-1. 创建变量描述 `VarDesc`
-1. 创建operators的描述 `OpDesc`
-1. 创建operators的属性
-1. 推断变量的类型和形状，进行静态检查：`inferShape`
-1. 规划变量的内存复用
-1. 创建反向计算
-1. 添加优化相关的Operators
-1. （可选）添加多卡/多机相关的Operator，生成在多卡/多机上运行的程序
-</font>
---
-### <p align="center">总结：运行时</p>
-<font size=5>
-<span style="background-color:#C7C7E2;">**执行规划好的计算**</span>
-1. 创建`Executor`
-1. 为将要执行的一段计算，在层级式的`Scope`空间中创建`Scope`
-1. 创建`Block`，依次执行`Block`
-<p align="center">
-<img src="https://raw.githubusercontent.com/PaddlePaddle/Fluiddoc/develop/doc/fluid/images/compile_run_time.png" width=50%><br>
-<font size=3> Figure. 编译时运行时概览</font>
-</p>
-</font>
---
-<!-- *template: invert -->
-## ==3==. 用户如何描述计算？
---
-### Fluid：==像写程序一样==定义计算
-<font size=5>
- 顺序执行
-    ```python
-    x = fluid.layers.data(name='x',shape=[13], dtype='float32')
-    y_predict = fluid.layers.fc(input=x, size=1, act=None)
-    y = fluid.layers.data(name='y', shape=[1], dtype='float32')
-    cost = fluid.layers.square_error_cost(input=y_predict, label=y)
-    ```
- 条件分支: [swith](https://github.com/PaddlePaddle/Fluiddoc/blob/develop/doc/fluid/design/execution/switch.md)、[ifelse](https://github.com/PaddlePaddle/Fluiddoc/blob/develop/doc/fluid/design/execution/if_else_op.md)
-   ```python
-   a = fluid.Var(10)
-   b = fluid.Var(0)
-   switch = fluid.switch()
-   with switch.block():
-      with switch.case(fluid.less_equal(a, 10)):
-          fluid.print("Case 1")
-      with switch.case(fluid.larger(a, 0)):
-          fluid.print("Case 2")
-      with switch.default():
-          fluid.print("Case 3")
-   ```
->[A Lisp cond form may be compared to a continued if-then-else as found in many algebraic programming languages](https://www.cs.cmu.edu/Groups/AI/html/cltl/clm/node84.html).
-</font>
---
-### Fluid: ==像写程序一样==定义计算
-<font size=5>
- 循环：[while](https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/fluid/tests/book/test_machine_translation.py#L105)
-  ```python
-  d0 = layers.data("d0", shape=[10], dtype='float32')
-  data_array = layers.array_write(x=d0, i=i)
-  array_len = layers.fill_constant(shape=[1],dtype='int64', value=3)
-  cond = layers.less_than(x=i, y=array_len)
-  while_op = layers.While(cond=cond)
-  with while_op.block():
-      d = layers.array_read(array=data_array, i=i)
-      i = layers.increment(x=i, in_place=True)
-      layers.array_write(result, i=i, array=d)
-      layers.less_than(x=i, y=array_len, cond=cond)
-  ```
- 完整实例请点查看 [->](https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/fluid/tests/unittests/test_while_op.py#L36-L44)
- beam search  [->]( https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/fluid/tests/book/test_machine_translation.py#L105)
-</font>
---
-#### <p align="center">总结</p>
-<font size=5>
-1. 用户层提供的描述语法具有完备性、自洽性，有能力支持对复杂计算过程描述
-1. 使用方式和核心概念可以类比编程语言，认知能够直接迁移
-1. 能够支持：定义问题，逐步求解
-</font>
---
-## ==3.== 核心概念
---
-### 编译时概念 ：==变量和计算的描述==
-<font size=5>
- `VarDesc` + `TensorDesc` + `OpDesc` -> `BlockDesc` -> `ProgramDesc`
-    - https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/framework/framework.proto
- <span style="background-color:#DAB1D5;">什么是 Fluid Program</span>
-  - 在Fluid中，一个神经网络任务（训练/预测）被描述为一段`Program`
-  - `Program`包含对`Variable`（数据）和 `Operator`（对数据的操作）的描述
-  - `Variable` 和 `Operator` 被组织为多个可以嵌套的`Block`，构成一段完整的`Fluid Program`
->编译阶段最终，经过 Transpiler 的执行规划，变换处理，生成使用`protobuf`序列化后的`ProgramDesc`。可以发送给多卡或者网络中的其它计算节点执行
-</font>
---
-### 编译时概念 ：==**[Transpiler](https://github.com/PaddlePaddle/Fluiddoc/blob/develop/doc/fluid/design/motivation/fluid_compiler.md)**==
-<font size=5>
-1. 接受一段`ProgramDesc`作为输入，生成一段新的`ProgramDesc`
-    - *Memory optimization transpiler*：向原始`ProgramDesc` 中插入 `FreeMemoryOps`，在一次迭代优化结束前提前释放内存，使得能够维持较小的 memory footprint
-    - *Distributed training transpiler*：将原始的`ProgramDesc`中转化为对应的分布式版本，生成两段新的`ProgramDesc`:
-        1. trainer进程执行的`ProgramDesc`
-        1. parameter server执行的`ProgramDesc`
-1. ==**WIP**==: 接受一段`ProgramDesc`，生成可直接被`gcc`, `nvcc`, `icc`等编译的代码，编译后得到可执行文件
-</font>
---
-### Transplier
-<p align="center">
-  <img src="https://raw.githubusercontent.com/PaddlePaddle/Fluiddoc/develop/doc/fluid/images/transpiler.png" width=70%>
-</p>
---
-### 打印 `ProgramDesc`
-<p align="center">
-  <img src="https://raw.githubusercontent.com/PaddlePaddle/Fluiddoc/develop/doc/fluid/images/print_fluid_program.png" width=70%>
-</p>
-<font size=5>
- `default_startup_program`：创建可学习参数，对参数进行初始化
- `default_main_program`：由用户定义的模型，包括了前向、反向、优化及所有必要的计算
-</font>
---
-### 输出效果
-<font size=5>
-<table>
-<thead>
-<th>variable in block 0</th>
-<th>variable in block 0</th>
-</thead>
-<tbody>
-<tr>
-<td><img src="https://raw.githubusercontent.com/PaddlePaddle/Fluiddoc/develop/doc/fluid/images/program_desc1.png" width=70%></td>
-<td><img src="https://raw.githubusercontent.com/PaddlePaddle/Fluiddoc/develop/doc/fluid/images/program_desc2.png" width=70%></td>
-</tr>
-</tbody>
-</table>
-</font>
---
-### 运行时概念
-<font size=5>
- 数据相关
-  - `Tensor` / `LoDTensor` / `Variable`
-  - `Scope`
- 计算相关
-  - `Block`
-  - `Kernel`、`OpWithKernel`、`OpWithoutKernel`
-<table>
-<thead>
-<th></th>
-<th>protobuf messages</th>
-<th>C++ class objects</th>
-</thead>
-<tbody>
-<tr>
-<td>Data</td>
-<td>[VarDesc](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/framework/framework.proto#L107)
-</td>
-<td>[Variable](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/framework/variable.h#L24)
-</td>
-</tr>
-<tr>
-<td>Operation</td>
-<td>[OpDesc](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/framework/framework.proto#L35)
-</td>
-<td>[Operator](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/framework/operator.h#L64)
-</td>
-</tr>
-<tr>
-<td>Block</td>
-<td>BlockDesc
-</td>
-<td>Block
-</td>
-</tr>
-</tbody>
-</table>
- 执行相关 ：`Executor`
-</font>
---
-#### Tensor 和 LoD(Level-of-Detail) Tensor
-<font size=5>
- Tensor 是$n$-dimensional arry的推广，LoDTensor是在Tensor基础上附加了序列信息
- Fluid中输入、输出，网络中的可学习参数全部统一使用LoDTensor（n-dimension array）表示
- 一个mini-batch输入数据是一个LoDTensor
-  - 在Fluid中，RNN 处理变长序列无需padding，得益于 `LoDTensor`表示
-  - 可以简单将 LoD 理解为：`std::vector<std::vector<int>>`
-  - 对非序列数据，LoD 信息为空
-<table>
-<thead>
-<th></th>
-<th>TensorFlow</th>
-<th>PaddlePaddle</th>
-</thead>
-<tbody>
-<tr>
-<td>RNN</td>
-<td>Support
-</td>
-<td>Support
-</td>
-</tr>
-<tr>
-<td>recursive RNN</td>
-<td>Support
-</td>
-<td>Support
-</td>
-</tr>
-<tr>
-<td>padding zeros</td>
-<td>Must
-</td>
-<td>No need
-</td>
-<tr>
-<td>blob data type</td>
-<td>Tensor
-</td>
-<td>LODTensor
-</td>
-</tr>
-</tbody>
-</table>
-</font>
---
-#### LoD 信息实例
-<font size=4>
-<p align="center">
-  <img src="https://raw.githubusercontent.com/PaddlePaddle/Fluiddoc/develop/doc/fluid/images/LoDTensor.png" width=43%>
-</p>
- 图(a)的LoD 信息
-  ```cpp
-  [0, 5, 8, 10, 14]
-  ```
- 图(b)的 LoD 信息
-  ```cpp
-  [[0, 5, 8, 10, 14] /*level=1*/, [0, 2, 3, 5, 7, 8, 10, 13, 14] /*level=2*/]
-  ```
-</font>
---
-#### Tensor, Variable, Scope 之间的关系
-<p align="center">
-  <img src="https://raw.githubusercontent.com/PaddlePaddle/Fluiddoc/develop/doc/fluid/images/scope_variable_tensor.png" width=40%>
-</p>
-<font size=5>
-1. `Block` 是一个实现层的概念，不在应用层暴露给用户。目前用户无法自行创建并利用`Block`，用户能够感知的只有`Program`这个概念。
-1. 逻辑上，可以将 `Block` 类比为编程语言中的大括号：定义了一段作用域，其中运行一段代码
-1. `Executor`会为每一个`Block`创建一个`Scope`，`Block`是可嵌套的，因此`Scope`也是可嵌套的
-</font>
---
-### Executor
-<font size=5>
-<table>
-<thead>
-<th>接口</th>
-<th>说明</th>
-</thead>
-<tbody>
-<tr>
-<td><p align="center">
-  <img src="https://raw.githubusercontent.com/PaddlePaddle/Fluiddoc/develop/doc/fluid/images/executor.png" width=60%>
-</p></td>
-<td><span style="background-color:#B3D9D9;">输入</span><br>1. `ProgramDesc`<br>2. `Scope`<br> 3.`block_id`<br><br><span style="background-color:#B3D9D9;">解释执行步骤</span><br>1. 创建所有 Variables<br> 2. 逐一创建 Operator 并运行
-</td>
-</tr>
-</tbody>
-</table>
---
-### Operator/OpWithKernel/Kernel
-<font size=5>
-<p align="center">
-  <img src="https://raw.githubusercontent.com/PaddlePaddle/Fluiddoc/develop/doc/fluid/images/operator1.png" width=50%>
-</p>
- operator 无状态，Operator的核心是==Run==方法
- 一个operator可以注册多个kernel
- operator 可以无 kernel：while_op 、ifelse op
-</font>
---
-#### Fluid Operator vs. PaddlePaddle layers
-<font size=5>
-<table>
-<thead>
-<th>Layer</th>
-<th>Operator</th>
-</thead>
-<tbody>
-<tr>
-<td><p align="center">
-  <img src="https://raw.githubusercontent.com/PaddlePaddle/Fluiddoc/develop/doc/fluid/images/layer.png" width=70%>
-</p></td>
-<td><p align="center">
-  <img src="https://raw.githubusercontent.com/PaddlePaddle/Fluiddoc/develop/doc/fluid/images/operator2.png" width=73%>
-</p></td>
-</tr>
-<tr>
-<td>1. 内部维护状态<br>2. 包含forward和backward方法</td>
-<td>1. 内部无状态<br>2. 只有Run方法</td>
-</tr>
-</tbody>
-</table>
-</font>
---
-### ==4.== 内存管理
---
-### 目标
- 为异构设备提供统一的内存分配、回收接口
- 最小化管理内存所需的时间，最小化管理开销
- 减少内存碎片
- 将内存管理与计算（Operators/Kernels）完全剥离
- 统一内存管理是内存优化的基础
---
-<font size=5>
-### Memory 接口
- 内存管理模块向上层应用逻辑提供三个基础接口：
-  ```cpp
-  template <typename Place>
-  void* Alloc(Place place, size_t size);
-  template <typename Place>
-  void Free(Place place, void* ptr);
-  template <typename Place>
-  size_t Used(Place place);
-  struct Usage : public boost::static_visitor<size_t> {
-    size_t operator()(const platform::CPUPlace& cpu) const;
-    size_t operator()(const platform::CUDAPlace& gpu) const;
-  };
-  ```
- 模板参数 `Place` 指示内存分配发生的设备
- 实现时，需特化支持的 `Place`， 提供以上三个接口的实现
-</font>
---
-### 代码结构
-<font size=5>
-内存管理模块可以理解为由以下两部分构成：
-1. SystemAllocator：实际从物理设备上分配、释放的内存的接口
-1. BuddyAllocator：内存管理算法
-</font>
---
-### System Allocator
-<font size=5>
- SystemAllocator 是实现物理内存分配、回收的基类
-    - 不同设备上的内存分配和回收终将转化为标准接口调用
-    - 为不同设备实现MemoryAllocator，继承自SystemAllocator
-  ```cpp
-  class SystemAllocator {
-   public:
-    virtual ~SystemAllocator() {}
-    virtual void* Alloc(size_t& index, size_t size) = 0;
-    virtual void Free(void* p, size_t size, size_t index) = 0;
-    virtual bool UseGpu() const = 0;
-  };
-  ```
-</font>
---
-### CPU/GPU Allocator
-<font size=5>
-```cpp
-class CPUAllocator : public SystemAllocator {
- public:
-  virtual void* Alloc(size_t& index, size_t size);
-  virtual void Free(void* p, size_t size, size_t index);
-  virtual bool UseGpu() const;
-};
-#ifdef PADDLE_WITH_CUDA
-class GPUAllocator : public SystemAllocator {
- public:
-  virtual void* Alloc(size_t& index, size_t size);
-  virtual void Free(void* p, size_t size, size_t index);
-  virtual bool UseGpu() const;
- private:
-  size_t gpu_alloc_size_ = 0;
-  size_t fallback_alloc_size_ = 0;
-};
-#endif
-```
- CPUAllocator和GPUAllocator分别继承自SystemAllocator，分别调用相应的标准库函数实现物理内存的分配和释放。
- 一旦大块、连续的物理内存分配之后，将通过内存管理算法实现内存的按块分配、回收、重用等。
-</font>
---
-### CPU Allocator
-<font size=5>
- CPU 内存的分配提供两种选项：
-    1. non-pinned memory：可分页内存
-    2. pinned memory：页锁定内存
-        - 分配过大的页锁定内存有可能因为系统可使用的分页内存减少，影响系统性能，默认CPU下分配的是可分页内存
- 通过gflags进行设置一次性分配内存的大小以及是否使用页锁定内存。
-   ```cpp
-   DEFINE_bool(use_pinned_memory, true, "If set, allocate cpu pinned memory.");
-   DEFINE_double(fraction_of_cpu_memory_to_use, 1,
-                 "Default use 100% of CPU memory for PaddlePaddle,"
-                 "reserve the rest for page tables, etc");
-   ```
-</font>
---
-### GPU Allocator
-<font size=5>
- 通过 cudaMalloc 分配GPU显存
- GPUAllocator::Alloc 首先会计算指定GPU device上的可用显存
-    - 如果可用显存小于请求分配大小，调用cudaMalloc进行分配
-    - 如果可用显存不足，目前会报错退出。
- 通过gflags控制GPU下一次性分配显存的大小：
-  ```cpp
-  DEFINE_double(fraction_of_gpu_memory_to_use, 0.92,
-                "Default use 92% of GPU memory for PaddlePaddle,"
-                "reserve the rest for page tables, etc");
-  ```
-</font>
---
-#### 内存管理算法:  [Buddy Memory Allocation](https://en.wikipedia.org/wiki/Buddy_memory_allocation)
-<font size=5>
- Memory Arena：一次性分配大块连续内存，之后会基于这块内存进行内存管理：动态分配、释放、重用内存块。
- 伙伴内存分配：
-    - 将内存划分为 2 的幂次方个分区，使用 best-fit 方法来分配内存请求。
-    - 当释放内存时，检查 buddy 块，查看相邻的内存块是否也已被释放。如果是，将内存块合并，以最小化内存碎片。
-    - 分配的内存在物理内存的自然边界对齐，提高内存访问效率。
-    - 算法的时间效率高，单使用 best-fit 方法的缘故，会产生一定的内存浪费
-</font>
---
-### Buddy Allocator
-<font size=5>
- BuddyAllocator 是一个单例，每个设备（如： GPU/CPU(0)/GPU(1)） 拥有一个BuddyAllocator
- BuddyAllocator 内部拥有一个私有成员变量 SystemAllocator
- 当请求的内存超过BuddyAllocator管理的空余内存时，将会调用SystemAllocator去指定的设备上分配物理内存
-</font>
---
-### 实例：CPU 下内存管理接口的实现
-<font size=5>
- 对上层应用，统一通过BuddyAllocator来实现内存的分配、释放以及用量查询
-    ```cpp
-    template <>
-    void* Alloc<platform::CPUPlace>(platform::CPUPlace place, size_t size) {
-      VLOG(10) << "Allocate " << size << " bytes on " << platform::Place(place);
-      void* p = GetCPUBuddyAllocator()->Alloc(size);
-      VLOG(10) << "  pointer=" << p;
-      return p;
-    }
-    template <>
-    void Free<platform::CPUPlace>(platform::CPUPlace place, void* p) {
-      VLOG(10) << "Free pointer=" << p << " on " << platform::Place(place);
-      GetCPUBuddyAllocator()->Free(p);
-    }
-    template <>
-    size_t Used<platform::CPUPlace>(platform::CPUPlace place) {
-      return GetCPUBuddyAllocator()->Used();
-    }
-    ```
-</font>
---
-### ==5.== 多设备支持
---
-### 多设备支持（一）
-<font size=5>
- step 1：添加Place类型，<span style="background-color:#DAB1D5;">由用户实现添加到框架</span>
-   - 可以将Place类型理解为一个整数加上一个枚举型，包括：设备号 + 设备类型
-    <p align="center">
-    <img src="https://raw.githubusercontent.com/PaddlePaddle/Fluiddoc/develop/doc/fluid/images/place.png" width=40%>
-    </p>
- DeviceContext
-    - 不同的Place会对应一个相应的DeviceContext，用于组织管理与设备相关的信息
-      - 例如，GpuDeviceContext中会管理Cuda stream
-    - 目前实现中一些特殊的库也会对应有自己的DeviceContext：例如：
-      ```cpp
-      class MKLDNNDeviceContext : public CPUDeviceContext {……}
-      ```
-    - 每种设备对应的DeviceContext需要管理的内容不尽相同，视具体需求来实现
-</font>
---
-### 多设备支持（二）
-<font size=5>
- step 2: 增加KernelType，为相应的KernelType注册Kernel对象，<span style="background-color:#DAB1D5;">由用户实现注册给框架</span> 可以按照：
-    1. Place 执行设备
-    1. DataType 执行数据类型 FP32/FP64/INT32/INT64
-    1. Memory layout： 运行时 Tensor 在内存中的排布格式 NCHW、 NHWC
-    1. 使用的库
-    来区分Kernel，为同一个operator注册多个 Kernel。
-    ```cpp
-    struct OpKernelType {
-      proto::DataType data_type_;
-      DataLayout data_layout_;
-      platform::Place place_;
-      LibraryType library_type_;
-    }
-    ```
-</font>
---
-### 多设备支持（三）
-<font size=5>
-step 3: 运行时的 KernelType 推断和Kernel切换，<span style="background-color:#DAB1D5;">按需要修改Kernel推断和Kernel切换规则</span>
- Expected Kernel：期待调用的Kernel：由（1）`Place`和计算精度决定；或（2）用户在配置中显示指定使用的计算库，如`cudnn`、`mkldnn`等。
- Actual Kernel：运行时从`Operator`的输入（`Variable`）可以推断出实际需要的`KernelType`
- 当Expected Kernel和Actual Kernel不一致的时候，框架会插入`data_transformer`或者`data_layerout_transform`等，保证Expected Kernel可以执行，包括：
-   - CPUPlace -> GPUPlace ：跨设备内存复制
-   - NCHW -> nChw8c ：Layout转换
-   - FP32 -> FP16 ：精度转换 _**尚未支持**_
-   - ……
- 以上过程实现在OperatorWithKernel类的Run方法中 [->](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/framework/operator.cc#L497)
-</font>
---
-## ==6.== while_op
---
-### while_op
-<font size=5>
- 循环执行一段`Program`，直到条件operator判断循环条件不满足时终止循环
- while_op 的特殊之处：
-  1. while_op 没有 kernel
-  1. while_op 拥有自己的`Block`，会形成一段嵌套的`Block`
-  1. ==while_op 内部创建了一个 Executor，来循环执行`Block`==
- while_op 输入输出 ： LoDTensorArray
-    ```cpp
-    namespace paddle {
-    namespace framework {
-    using LoDTensorArray = std::vector<LoDTensor>;
-    }
-    }
-    ```
-    - 每一次循环，从原始输入中“切出”一个片段
-    - LoDTensorArray 在Python端暴露，是Fluid支持的基础数据结构之一，用户可以直接创建并使用
-</font>
---
-### while_op [Run](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/operators/while_op.cc#L42) 方法概览
-<font size=5>
-```cpp
-void Run(const framework::Scope &scope,
-         const platform::Place &dev_place) const override {
-  PADDLE_ENFORCE_NOT_NULL(scope.FindVar(Input(kCondition)));
-  auto &cond = scope.FindVar(Input(kCondition))->Get<LoDTensor>();
-  PADDLE_ENFORCE_EQ(cond.dims(), paddle::framework::make_ddim({1}));
-  framework::Executor executor(dev_place);
-  auto *block = Attr<framework::BlockDesc *>(kStepBlock);
-  auto *program = block->Program();
-  auto step_scopes =
-      scope.FindVar(Output(kStepScopes))->GetMutable<StepScopeVar>();
-  while (cond.data<bool>()[0]) {
-    auto &current_scope = scope.NewScope();
-    step_scopes->push_back(&current_scope);
-    executor.Run(*program, &current_scope, block->ID(),
-                   false /*create_local_scope*/);
-  }
-}
-```
-</font>
---
-### while_op 的重要应用：Dynamic RNN
---
-### 什么是 `dynamicRNN` ?
-<font size=5>
-<br>
-1. 用户可以自定义在一个时间步之内的计算, 框架接受序列输入数据，在其上循环调用用户定义的单步计算
-1. 可学习参数在多个时间步之间共享
-1. `dynamicRNN` 由 `while_op` 实现
-1. 如果`dynamicRNN`中定义了`memory`，将会构成一个循环神经网络，否则其行为就等于在输入序列上循环调用预定义的单步计算
-</font>
---
-#### `dynamic RNN` 用户接口
-<font size=5>
-<p align="center">
-<img src="https://raw.githubusercontent.com/PaddlePaddle/Fluiddoc/develop/doc/fluid/images/user_interface.png" width=75%>
-</p>
- `dynamicRNN` 中的重要元素
-  1. **step input**: `dynamicRNN` 每个时间步的输入
-  1. **step function**: 用户定义的单步计算
-  1. **memory**: 用于形成循环连接
-  1. **external/static memory**：单步计算的每一步都可以全部读取到的外部输入
-</font>
---
-#### dynamicRNN 中的 Memory
-<font size=5>
-`dynamicRNN`中`memory`的行为非常类似于 C++ 中的引用变量
-  - `memory` “指向” 一个operator的输出变量，记作： A
-  - `memory` 可以被 LoDTensor 初始化（当LoD信息为空时，为非序列，否则为序列）,默认`memory`被初始化为零
-  - `memory` 在 operator A 前向计算之后，进行前向计算
-  - 当 `memory` 的前向计算会 "指向" A 的输出 LoDTensor
-  - `memory` 的输出可以是另一个 operator 的输入，于是形成了“循环”连接
-</font>
---
-### DynamicRNN 实现细节
-<font size=5>
- `while_op` <span style="background-color:#DAB1D5;">无法独立构成dynamicRNN</span>，必须和一组相关的 operator 及数据结构配合
-    - 依赖的 operators (这里仅列出最重要的，并非全部):
-        - `lod_rank_table` operator
-        - `lod_tensor_to_array` operator
-        - `array_to_lod_tensor` operator
-        - `shrink_memory` operator
-    - 依赖的数据结构
-        - `TensorArray`
-        - `LoDRankTable`
- 在Fluid中，RNN接受变长序列输入，无需填充，以上数据结构和相关的operator配合工作，实现了对变长输入以batch计算
-</font>
---
-### `dynamicRNN` 如何实现 batch 计算 ?
-<font size=5>
- 问题：
-  - RNN 可以看作是一个展开的前向网络，前向网络的深度是最长序列的长度
-  - 如果不对变长序列进行填充，将它们填充到一样长度，每个mini-batch输入将会不等长，每个样本展开长度不一致，导致前向和反向计算实现困难
-</font>
----
-##### 实例 ：RNN encoder-decoder with attention
-<font size=5>
- 以机器翻译的RNN encoder-decoder 模型（涉及了`dynamicRNN`的所有设计要素）为例，下图是 RNN encoder-decoder 的原始输入：
-  <p align="center">
-  <img src="https://raw.githubusercontent.com/PaddlePaddle/Fluiddoc/develop/doc/fluid/images/raw_input.png" width=100%><br><font size=3> Figure. RNN encoder-decoder 原始batch 输入数据</font>
-  </p>
- source word sequences 是encoder RNN的输出，是一个LoDTensor
- target word sequences 是look_uptable的输入，是一个LoDTensor
- 上图中一个矩形方块是CPU/GPU内存中一片连续的内存空间，表示一个dense vector
-</font>
---
-### `dynamicRNN` 如何实现 batch 计算 ?
-<font size=5>
-1. 对一个mini batch中不等长样本进行排序，最长样本变成batch中的第一个，最短样本是batch中最后一个
-      - `LoDTensor` -> `LoDRankTable` :heavy_plus_sign: `lod_rank_table operaator`
-          - 可以将`LoDRankTable`理解为对LoDTensor中的多个序列按照长度排序LoDRankTable 存储了排序之后的index
-2. 构建每个时间步的batch输入：随着时间步增加，每个时间步的batch输入可能会逐渐缩小
-    - `TensorArray` :heavy_plus_sign: `lod_tensor_to_array` -> `LoDTensor` (without LoD)
-3. 每个时间步输出写入一个输出 `LoDTensorArray`
-3. `dynamicRNN`循环结束后, 按照`LoDRankTable`中记录的信息对输出`LoDTensorArray`重排序，还原会原始输入顺序
-    - `TensorArray` :heavy_plus_sign: `array_to_lod_tensor` -> `LoDTensor`
-</font>
---
-### 运行实例
-<p align="center">
-<img src="https://raw.githubusercontent.com/PaddlePaddle/Fluiddoc/develop/doc/fluid/images/sorted_input.png" width=100%>
-</p>
---
-### 运行实例
-<p align="center">
-<img src="https://raw.githubusercontent.com/PaddlePaddle/Fluiddoc/develop/doc/fluid/images/1.png" width=100%>
-</p>
-<font size=5>
- 执行到第5~7个batch时，batch size将会缩小
-</font>
---
-### 运行实例
-<p align="center">
-<img src="https://raw.githubusercontent.com/PaddlePaddle/Fluiddoc/develop/doc/fluid/images/1.png" width=80%>
-</p>
-<font size=5>
- 第5 ~ 7个batch时RNN的`memory`会发生什么？
-    - `memory` 指向某个operator的输出Tensor，在该operator前向计算之后，“取回”其计算结果
-    - 5 ~ 7时，遇到了序列的结束，==下一个时间步计算不再需要在已经结束的序列上展开==
-    - 在`dynamicRNN`中`shrink_memory` operator 用来缩小`memory`的batch输入
-</font>
---
-### 运行实例：batch 1 ~ 2
-<p align="center">
-<img src="https://raw.githubusercontent.com/PaddlePaddle/Fluiddoc/develop/doc/fluid/images/2.png" width=70%><br><font size=4>Figure. 第1、2个batch输入dynamicRNN的batch输入</font>
-</p>
---
-### 运行实例：batch 3 ~ 4
-<p align="center">
-<img src="https://raw.githubusercontent.com/PaddlePaddle/Fluiddoc/develop/doc/fluid/images/3.png" width=70%><br><font size=4>Figure. 第3、4个batch输入dynamicRNN的batch输入</font>
-</p>
---
-### 运行实例：batch 5 ~ 7
-<p align="center">
-<img src="https://raw.githubusercontent.com/PaddlePaddle/Fluiddoc/develop/doc/fluid/images/4.png" width=70%><br><font size=4>Figure. 第5、6、7个batch输入dynamicRNN的batch输入</font>
-</p>
---
-### ==7.== Fluid 代码结构
---
-### Fluid 代码结构
-<table>
-<thead>
-<tr>
-<th>代码结构</th>
-<th>模块结构</th>
-</tr>
-</thead>
-<tbody>
-<tr>
-<td>
-<p align="center">
-<img src="https://raw.githubusercontent.com/PaddlePaddle/Fluiddoc/develop/doc/fluid/images/fluid_module_1.png" width=60%>
-</p>
-</td>
-<td>
-<p align="center">
-<img src="https://raw.githubusercontent.com/PaddlePaddle/Fluiddoc/develop/doc/fluid/images/fluid_module_2.png" width=60%>
-</p>
-</td>
-</tr>
-</tbody>
-</table>
---
-### ==8.== 文档总结
---
-<font size=5>
- 设计概览
-  - 重构概览 [->](https://github.com/PaddlePaddle/Fluiddoc/blob/develop/doc/design/refactorization.md)
-  - fluid [->](https://github.com/PaddlePaddle/Fluiddoc/blob/develop/doc/design/fluid.md)
-  - fluid_compiler [->](https://github.com/PaddlePaddle/Fluiddoc/blob/develop/doc/fluid/design/motivation/fluid_compiler.md)
- 核心概念
-  - variable 描述 [->](https://github.com/PaddlePaddle/Fluiddoc/blob/develop/doc/design/var_desc.md)
-  - Tensor [->](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/tensor.md)
-  - LoDTensor [->](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/lod_tensor.md)
-  - TensorArray [->](https://github.com/PaddlePaddle/Fluiddoc/blob/develop/doc/design/tensor_array.md)
-  - Program [->](https://github.com/PaddlePaddle/Fluiddoc/blob/develop/doc/design/program.md)
-  - Block [->](https://github.com/PaddlePaddle/Fluiddoc/blob/develop/doc/design/block.md)
-  - Scope [->](https://github.com/PaddlePaddle/Fluiddoc/blob/develop/doc/design/scope.md)
---
- 重要功能模块
-  - backward [->](https://github.com/PaddlePaddle/Fluiddoc/blob/develop/doc/design/backward.md)
-  - 内存优化 [->](https://github.com/PaddlePaddle/Fluiddoc/blob/develop/doc/design/memory_optimization.md)
-  - evaluator [->](https://github.com/PaddlePaddle/Fluiddoc/blob/develop/doc/design/executor.md)
-  - python API [->](https://github.com/PaddlePaddle/Fluiddoc/blob/develop/doc/design/python_api.md)
-  - regularization [->](https://github.com/PaddlePaddle/Fluiddoc/blob/develop/doc/design/regularization.md)
- 开发指南
-  - 支持新设硬件设备库 [->](https://github.com/PaddlePaddle/Fluiddoc/blob/develop/doc/design/support_new_device.md)
-  - 添加新的Operator [->](https://github.com/PaddlePaddle/Fluiddoc/blob/develop/doc/howto/dev/new_op_cn.md)
-  - 添加新的Kernel [->](
-https://github.com/PaddlePaddle/Fluiddoc/blob/develop/doc/howto/dev/new_op_kernel_en.md)
-</font>
---
-### ==9.== 开发指南
---
-#### 建议开发环境：使用 Docker 编译和测试
-<font size=5>
-Docker编译PaddlePaddle源码: [->](http://www.paddlepaddle.org/docs/develop/documentation/fluid/zh/build_and_install/docker_install_cn.html)
-PaddlePaddle 在 Dockerhub 地址：[->](
-    https://hub.docker.com/r/paddlepaddle/paddle/tags/)
-1. 获取PaddlePaddle的Docker镜像
-    ```bash
-    docker pull paddlepaddle/paddle:latest-dev
-    ```
-1. 启动 docker container
-    ```bash
-    docker run -it -v $PWD/Paddle:/paddle paddlepaddle/paddle:latest-dev /bin/bash
-    ```
-1. 进入docker container后，从源码编译，请参考文档 [->]( http://www.paddlepaddle.org/docs/develop/documentation/fluid/zh/build_and_install/build_from_source_cn.html)
-</font>
---
-### 一些说明
-<font size=5>
-1. PaddlePaddle的Docker镜像为了减小体积，默认没有安装vim，可以在容器中执行`apt-get install -y vim`来安装vim。
-1. 开发推荐使用tag为`latest-dev`的镜像，其中打包了所有编译依赖。`latest`及`lastest-gpu`是production镜像，主要用于运行PaddlePaddle程序。
-2. 在Docker中运行GPU程序，推荐使用nvidia-docker，[否则需要将CUDA库和设备挂载到Docker容器内](http://www.paddlepaddle.org/docs/develop/documentation/fluid/zh/build_and_install/docker_install_cn.html)。
-   <font size=4>
-   ```bash
-   nvidia-docker run -it -v $PWD/Paddle:/paddle paddlepaddle/paddle:latest-dev /bin/bash
-   ```
-   </font>
-</font>
---
-### [如何贡献](http://www.paddlepaddle.org/docs/develop/documentation/fluid/zh/dev/contribute_to_paddle_cn.html)
-<font size=5>
- ==提交PullRequest前请务必阅读==： [->](http://www.paddlepaddle.org/docs/develop/documentation/fluid/zh/dev/contribute_to_paddle_cn.html)
- 代码要求
-    1. 代码注释遵守 Doxygen 的样式
-    1. 确保编译器选项 WITH_STYLE_CHECK 已打开，并且编译能通过代码样式检查
-    1. 所有代码必须具有单元测试，且能够通过所有单元测试
- 使用 `pre-commit` 钩子提交Pull Request
-    1. 帮助格式化源代码（C++，Python）
-    1. 在提交前自动检查一些基本事宜：如每个文件只有一个 EOL，Git 中不要添加大文件等
-    1. 安装pre-commit，并在PaddlePaddle根目录运行：
-    ```bash
-      ➜  pip install pre-commit
-      ➜  pre-commit install
-    ```
-</font>
---
-### 如何贡献
-<font size=5>
-1. 开始开发之前请先建立issue。
-    - 让其它同学知道某项工作已经有人在进行，以避免多人开发同一功能的情况。
-1. 提交PR必须关联相关的issue。做法请参考：[->](https://help.github.com/articles/closing-issues-using-keywords/)
-    - 目的：为了在提交的版本中留有记录描述这个PR是为了开发什么样的功能，为了解决什么样的问题。
-    - 当PR被merge后，关联的issue会被自动关闭。
-1. PR review 中，reviewer的每条comment都必须回复。
-    - 如修改完可直接回复：Done。
-    - 目的：review comment 中可能会有（1）询问类型的问题；（2）可以在下一个PR修改的问题；（3）comment意见不合理等。需要明确回复，以便reviewer和其他人有历史可查，便于区分是否已经进行修改，或者准备下一个PR修改，或者意见不合理可以不用进行修改。
-</font>
---
-### ==10.== 添加新的 Operator
---
-### 概念简介
-<font size=5>
-添加一个新的operator，会涉及实现以下C++类的派生类：
-1. `framework::OperatorBase`: Operator(简写，Op)基类。
-1. `framework::OpKernel`: Op计算函数的基类，称作Kernel。
-1. `framework::OperatorWithKernel`：继承自OperatorBase，Op有计算函数，称作有Kernel。
-1. `class OpProtoAndCheckerMaker`：描述该Op的输入、输出、属性、注释,主要用于Python API接口生成
-依据是否包含kernel，可以将Op分为两种：
-1. 包含Kernel的Op：继承自OperatorWithKernel，==绝大多数operator都属于这一类==
-1. 不包含kernel的Op，继承自OperatorBase，只有少量Op属于这一类，例如while_op，ifelse_op
-<span style="background-color:#DAB1D5;">这里主要介绍带Kernel的Op如何编写。</span>
-</font>
---
-#### 添加新的Operator需要修改/添加哪些文件？
-<font size=5>
-<table>
-<thead>
-<tr>
-<th>内容</th>
-<th>定义位置</th>
-</tr>
-</thead>
-<tbody>
-<tr>
-<td>
-OpProtoMake定义
-</td>
-<td>
-`.cc`文件，<span style="background-color:#DAB1D5;">Backward Op不需要OpProtoMaker</span>
-</td>
-</tr>
-<tr>
-<td>
-Op定义
-</td>
-<td>
-`.cc`文件
-</td>
-</tr>
-<tr>
-<td>
-Kernel实现
-</td>
-<td>
-<span style="background-color:#DAB1D5;">CPU、CUDA共享Kernel实现在`.h`文件中</span>，否则，CPU 实现在`.cc`文件中，CUDA 实现在`.cu`文件中。
-</td>
-</tr>
-<tr>
-<td>
-注册Op
-</td>
-<td>
-Op注册实现在`.cc`文件；Kernel注册CPU实现在`.cc`文件中，CUDA实现在`.cu`文件中
-</td>
-</tr>
-</tbody>
-</table>
- 添加 Operator 之前请阅读：[Operator 命名规范](https://github.com/PaddlePaddle/Paddle/blob/63cca04cfd488a4dab6d6273fd04a8017ef45932/doc/fluid/dev/name_convention.md)及[Operator Markdown注释规范](https://github.com/PaddlePaddle/Paddle/blob/63cca04cfd488a4dab6d6273fd04a8017ef45932/doc/fluid/dev/op_markdown_format.md)。
- 实现新的op都添加至目录[paddle/operators](https://github.com/PaddlePaddle/Paddle/tree/develop/paddle/fluid/operators)下，文件命名以`*_op.h`（如有） 、 `*_op.cc` 、`*_op.cu`（如有）结尾。
- 根据文件名自动构建op和Python端绑定，<span style="background-color:#DAB1D5;">请务必遵守以上命名，否则需要进一步修改PyBind相关文件及CMakeLists.txt</span>。
-</font>
---
-###### 实现带Kernel的Operator <span style="background-color:#c4e1e1;">step1</span>: 定义ProtoMaker类
-<font size=5>
-下面均以[clip_op](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/operators/clip_op.h)为例进行介绍
- clip_op计算公式：$Out = \min(\max(X, min), max)$
- 首先定义`ProtoMaker`来描述该Op的输入、输出，并添加注释（<font size=4>*下面代码段的中注释进行了简化，实现时需按照规范添加注释*</font>）：
-    ```cpp
-    template <typename AttrType>
-    class ClipOpMaker : public framework::OpProtoAndCheckerMaker {
-     public:
-      ClipOpMaker(OpProto* proto, OpAttrChecker* op_checker)
-          : OpProtoAndCheckerMaker(proto, op_checker) {
-        AddInput("X","(Tensor)The input of clip op.");
-        AddOutput("Out", "(Tensor),The output of clip op.");
-        AddAttr<AttrType>(
-            "min", "(float),Minimum value.");
-        AddAttr<AttrType>(
-            "max", "(float),Maximum value.");
-        AddComment(R"DOC(
-        ……
-    )DOC");
-      }
-    };
-    ```
-</font>
---
-###### 实现带Kernel的Operator <span style="background-color:#c4e1e1;">step2</span>: 定义Operator类
-<font size=5>
-下面的代码段实现了`clip_op`的定义：
-```cpp
-class ClipOp : public framework::OperatorWithKernel {
- public:
-  using framework::OperatorWithKernel::OperatorWithKernel;
-  void InferShape(framework::InferShapeContext* ctx) const override {
-    PADDLE_ENFORCE(ctx->HasInput("X"),
-                   "Input(X) of ClipOp should not be null.");
-    PADDLE_ENFORCE(ctx->HasOutput("Out"),
-                   "Output(Out) of ClipOp should not be null.");
-    auto x_dims = ctx->GetInputDim("X");
-    auto max = ctx->Attrs().Get<float>("max");
-    auto min = ctx->Attrs().Get<float>("min");
-    PADDLE_ENFORCE_LT(min, max, "max should be greater than min.");
-    ctx->SetOutputDim("Out", x_dims);
-    ctx->ShareLoD("X", /*->*/ "Out");
-  }
-};
-```
-</font>
---
-### Operator 类中需要完成的工作
-<font size=5>
-1. clip_op 继承自`OperatorWithKernel`，
-    ```cpp
-    using framework::OperatorWithKernel::OperatorWithKernel;
-    ```
-    表示使用基类`OperatorWithKernel`的构造函数。
-1. 重写`InferShape`接口。
-    - `InferShape` 为const函数，不能修改Op的成员变
-    - `InferShape` 的参数为 `const framework::InferShapeContext &ctx`，从中可获取到输入输出以及属性
-    - `InferShape` 会被调用两次，一次是编译时（创建op），一次是运行时（调用op的`Run`方法时），需要完成以下功能：
-        1. 做检查， 尽早报错：检查输入数据维度、类型等是否合法
-        2. 设置输出Tensor的形状
-<span style="background-color:#DAB1D5;">通常`OpProtoMaker`和`Op`类的定义写在`.cc`文件中。</span>
-</font>
---
-### 补充说明
-<font size=5>
-1. `InferShape`目前支持两种实现方式，<span style="background-color:#DAB1D5;">二者最后都会生成一个functor注册给OpInfo结构体。</span>
-    1. 继承framework::InferShapeBase，实现为一个functor（参考 [mul_op](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/operators/mul_op.cc#L22)）
-    2. override InferShape函数（参考 [clip_op](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/operators/clip_op.cc#L24)）
-1. 什么是`functor` ?
-   - 类或结构体仅重载了`()`，一般是可被多个kernel复用的计算函数。
-        <font size=4>
-        ```cpp
-        template <typename T>
-        class CrossEntropyFunctor<platform::CPUDeviceContext, T> {
-         public:
-          void operator()(const platform::CPUDeviceContext& ctx,
-                          framework::Tensor* out,
-                          const framework::Tensor* prob,
-                          const framework::Tensor* labels, const bool softLabel) {
-               ……
-          }
-        };
-        ```
-        </font>
-    - 在 clip_op 内也会看到将一段计算函数抽象为functor的使用法： [->](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/operators/clip_op.h#L27)。
-</font>
---
-###### 实现带Kernel的Operator <span style="background-color:#c4e1e1;">step3</span>: 定义OpKernel类
-<font size=5>
- `ClipKernel`继承自`framework::OpKernel`，带有下面两个模板参数:
-    1. `typename DeviceContext`: 表示设备类型，不同设备共享同一个Kernel时，需添加该模板参数。不共享时，需要提供针对不同设备的特化实现。
-    1. `typename T` : 表示支持的数据类型，如`float`, `double`等
- 在`ClipKernel`类中重写`Compute`方法
-    1. `Compute`接受输入参数：`const framework::ExecutionContext& context`
-        - `ExecutionContext` 是从 `Scope`中将运行时Op的输入、输出`Variable`组织在一起，使得Op在调用`Compute`方法时，能够简单地通过名字拿到需要的输入输出`Variable`
-        - 与`InferShapeContext`相比，`ExecutionContext` 中增加了设备类型
-    1. 在`Compute`函数里实现`OpKernel`的具体计算逻辑
-</font>
---
-#### ClipKernel 代码概览
-<font size=5>
-```cpp
-template <typename DeviceContext, typename T>
-class ClipKernel : public framework::OpKernel<T> {
- public:
-  void Compute(const framework::ExecutionContext& context) const override {
-    auto max = context.Attr<T>("max");
-    auto min = context.Attr<T>("min");
-    auto* x = context.Input<Tensor>("X");
-    auto* out = context.Output<Tensor>("Out");
-    T* out_data = out->mutable_data<T>(context.GetPlace());
-    const T* x_data = x->data<T>();
-    int64_t numel = x->numel();
-    Transform<DeviceContext> trans;
-    trans(context.template device_context<DeviceContext>(), x_data,
-          x_data + numel, out_data, ClipFunctor<T>(min, max));
-  }
-};
-```
- 为了使`OpKernel`的计算过程书写更加简单，并且CPU、CUDA的代码可以复用， Fluid 使用 Eigen 作为基础的矩阵运算库
- Fluid对Eigen unsupported Tensor提供了一些基本的封装，可以在`Compute`接口中直接调用
-    - 关于在PaddlePaddle中如何使用Eigen库，请参考[使用文档](https://github.com/PaddlePaddle/Fluiddoc/blob/develop/doc/fluid/dev/use_eigen_cn.md)。
-</font>
---
-###### 实现带Kernel的Operator <span style="background-color:#c4e1e1;">step4</span>: 实现反向Op
-<font size=5>
- ==**反向Op没有`ProtoMaker`**==，除此之外定义与实现方式前向Op完全一致，不再赘述
- 这里仅对反向Op的输入输出进行说明：
-    1. 反向Op的输入
-        - 前向Op的输出
-        - 反向传播过程中传递给当前Op的梯度
-            - 需要注意，<span style="background-color:#e1c4c4;">Fluid中，不区分Cost Op和中间层Op，所有Op都必须正确处理接收到的梯度</span>
-    2. 反向Op的输出
-        - 对可学习参数的求导结果
-        - 对所有输入的求导结果
-</font>
---
-###### 实现带Kernel的Operator <span style="background-color:#c4e1e1;">step5</span>: 注册Op及Kernel
-<font size=5>
-至此Op和Op kernel都已经实现完毕，接下来，需要在`.cc`和`cu`文件中注册op和kernel
-1. 在`.cc`文件中注册前向、反向Op类，注册CPU Kernel。
-    <font size=4>
-    ```cpp
-    namespace ops = paddle::operators;
-    REGISTER_OP(clip, ops::ClipOp, ops::ClipOpMaker<float>, clip_grad,
-                ops::ClipOpGrad);
-    REGISTER_OP_CPU_KERNEL(
-        clip, ops::ClipKernel<paddle::platform::CPUDeviceContext, float>);
-    REGISTER_OP_CPU_KERNEL(
-        clip_grad, ops::ClipGradKernel<paddle::platform::CPUDeviceContext, float>);
-    ```
-   - 在上面的代码片段中：
-     1. `REGISTER_OP` ： 注册`ops::ClipOp`类，类型名为`clip`，该类的`ProtoMaker`为`ops::ClipOpMaker`，注册`ops::ClipOpGrad`，类型名为`clip_grad`
-     1. `REGISTER_OP_WITHOUT_GRADIENT` ： 用于注册没有反向的Op，例如：优化算法相关的Op
-     1. `REGISTER_OP_CPU_KERNEL` ：注册`ops::ClipKernel`类，并特化模板参数为`paddle::platform::CPUPlace`和`float`类型，同理，注册`ops::ClipGradKernel`类
-    </font>
-1. 按照同样方法，在`.cu`文件中注册GPU Kernel
-   -  <span style="background-color:#e1c4c4;">如果CUDA Kernel的实现基于Eigen，需在 `.cu`的开始加上宏定义 `#define EIGEN_USE_GPU` </span>
-</font>
---
-##### 编译和Python端绑定
-<font size=5>
- 运行下面命令可以仅编译新添加的Op：
-  ```
-  make mul_op
-  ```
-  - <span style="background-color:#e1c4c4;">需注意，运行单元测试需要编译整个工程</span>
- 如果遵循前文的文件命名规则，构建过程中，会自动为新增的op添加Python端绑定，并链接到生成的lib库中
-</font>
---
-###### 实现带Kernel的Operator <span style="background-color:#c4e1e1;">step6</span>: 添加前向单测及梯度检测
-<font size=5>
- 新增Op的单元测试统一添加至：[python/paddle/v2/fluid/tests/unittests](https://github.com/PaddlePaddle/Paddle/tree/develop/python/paddle/fluid/tests/unittests)目录
- 前向Operator单测
-    1. Op单元测试继承自`OpTest`，各项具体的单元测试在`TestClipOp`里完成，所有单测case都以`TestXX`命名
-    1. 单元测试Operator，需要：
-        1. 在`setUp`函数定义输入、输出，以及相关的属性参数
-        1. 生成随机的输入数据
-        1. 在Python脚本中实现与前向operator相同的计算逻辑，得到输出值，与operator前向计算的输出进行对比
-        1. 反向梯度检测流程测试框架已经实现，直接调用相应接口`check_grad`即可
- `clip_op` 单测代码请参考 [->](https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/fluid/tests/unittests/test_clip_op.py)，这里不再展开
-</font>
---
-#### 编译执行单测
-<font size=5>
- `python/paddle/v2/framework/tests` 目录下新增的 `test_*.py` 单元测试会被自动加入工程进行编译
-    - <span style="background-color:#e1c4c4;">运行单元测试测时需要编译整个工程，并且编译时需要打开`WITH_TESTING`</span>, 即`cmake paddle_dir -DWITH_TESTING=ON`
- 编译成功后，执行下面的命令来运行单元测试：
-  ```bash
-  make test ARGS="-R test_mul_op -V"
-  ```
-  或者:
-  ```
-  ctest -R test_mul_op
-  ```
-</font>
---
-### 添加Op的一些注意事项
-<font size=5>
- 为每个Op创建单独的`*_op.h`（如有）、`*_op.cc`和`*_op.cu`（如有）。<span style="background-color:#e1c4c4;">不允许一个文件中包含多个Op</span>，将会导致编译出错。
- 注册Op时的类型名，需要和该Op的名字一样。<span style="background-color:#e1c4c4;">不允许在`A_op.cc`里面，注册`REGISTER_OP(B, ...)`</span>，会导致单元测试出错。
- 如果Op<span style="background-color:#e1c4c4;">没有实现CUDA Kernel，不要创建空的`*_op.cu`</span>，会导致单元测试出错。
- 如果多个Op依赖一些共用的函数，可以创建非`*_op.*`格式的文件来存放，如`gather.h`文件。
-</font>
---
-### ==10.== 使用相关问题
---
-### 定义前向计算
-<font size=5>
- 当在python端执行时：
-    ```python
-    import paddle.fluid as fluid
-    ```
-    [`framework.py`](https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/fluid/framework.py#L1040)定义了两个全局`Program`:
-    ```python
-    # program is a global instance.
-    _main_program_ = Program()
-    _startup_program_ = Program()
-    ```
- 前向定义的过程就是不断往`mian_program`中添加Op和Variable
- 如果需要执行一个新的`mian_program`时，可以调用调用：
-    ```python
-    def switch_main_program(program):
-        """
-        Switch the main program to a new program.
-        This funtion returns the previous main program.
-        """
-        ……
-    ```
-</font>
---
-### 自定义参数的初始化
-<font size=5>
- 调用`fluid.ParamAttr(……)`接口，自定义参数的初始化
-  ```python
-  w_param_attrs = ParamAttr(name=None,
-      initializer=UniformInitializer(low=-1.0, high=1.0, seed=0),
-      learning_rate=1.0,
-      regularizer=L1Decay(1.0),
-      trainable=True,
-      clip=GradientClipByValue(-1.0, 1.0),
-  )
-  y_predict = fluid.layers.fc(input=x, size=1, param_attr=w_param_attrs)
-  ```
- 补充问题：如何创建 `Variable`
-  ```python
-  cur_program = Program()
-  cur_block = cur_program.current_block()
-  new_var = cur_block.create_var(name="X", shape=[-1, 16, 16], dtype="float32")
-  ```
-</font>
---
-### 添加反向Op
-<font size=5>
- 调用`fluid.backward.append_backward(X)`（`X`是一个Variable），来为一段前向`ProgramDesc`添加反Op
-    ```python
-    data = fluid.layers.data(name="data", shape=(2,3,4))
-    out = fluid.layers.fc(input=data,size=128,act=None)
-    loss = fluid.layers.reduce_sum(out)
-    fluid.backward.append_backward(loss=loss)
-    ```
- 添加优化相关的Op
-    ```python
-    sgd_optimizer = fluid.optimizer.SGD(learning_rate=0.001)
-    sgd_optimizer.minimize(loss)
-    ```
- 可以随时调用`print(fluid.default_main_program())`来输出当前的`main_program`
- 当构建完成整个`Program`后，调用下面的接口执行内存优化：
-  ```python
-  fluid.memory_optimize(fluid.default_main_program())
-  ```
-  - _<span style="background-color:#e1c4c4;">注：内存优化目前仍在持续开发中，有可能不够稳定。</span>_
-</font>
---
-### 总结：编译时执行流程
-<font size=5>
- 用户定义前向计算
- 添加反向Op到`default_main_program`
- 添加 gradient clipping Op 到
- 添加 regularization Op 到`default_main_program`
- 为指定的优化算法，添加相关的状态 variable of optimizer 到`default_startup_program`
-    - 状态相关 variable是指如学习率, 历史 momentum, 二阶momentum等
- 添加初始化 variable 的Op 到 `default_startup_program`
- 为整个网络最后一个op，添加设置其接受到的梯度的Op到`default_main_program`
- 进行内存优化规划
-</font>
---
-### Feed 数据 (一)：通过 feed 字典
-<font size=5>
- 执行executor的run方法时，指定feed字典，feed op 会将指定的数据放到`x`和`y`两个Variable中
-  ```python
-  y_data = np.random.randint(0, 8, [1]).astype("int32")
-  y_tensor = core.Tensor()
-  y_tensor.set(y_data, place)
-  x_data = np.random.uniform(0.1, 1, [11, 8]).astype("float32")
-  x_tensor = core.Tensor()
-  x_tensor.set(x_data, place)
-  ……
-  cost = exe.run(
-      fluid.default_main_program(),
-      feed={'x': x_tensor,
-            'y': y_tensor},
-      fetchlist=[avg_cost])
-  ```
- 这种方法较为底层，一般用于单测中
-</font>
---
-### Feed 数据 (二)：使用 DataFeeder接口
-<font size=5>
- 编写一个data_reader函数，data_reader是一个Python generator
-  ```python
-  def demo_reader():
-      def random_generator():
-          yield np.random.uniform(0.1, 1, [4]), np.random.randint(0, 1, [1])
-      return random_generator
-  ```
- 在训练任务中使用 DataFeeder 接口
-  ```python
-  cost = exe.run(
-      fluid.default_main_program(),
-      feed={'x': x_tensor,
-            'y': y_tensor},
-      fetchlist=[avg_cost])
-  train_reader = paddle.batch(
-      paddle.reader.shuffle(demo_reader(), buf_size=500), batch_size=4)
-  feeder = fluid.DataFeeder(place=place, feed_list=[x, y])
-  for data in train_reader():
-      cost = exe.run(
-          fluid.default_main_program(),
-          feed=feeder.feed(data),
-          fetch_list=[cost])
-  ```
-</font>
---
-### 常见问题
-<font size=5>
- 如何使用 evaluator ? [->](https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/fluid/tests/book/test_label_semantic_roles.py#L168)
-    ```python
-    accuracy = fluid.evaluator.Accuracy(input=predict, label=label)
-    for pass_id in range(PASS_NUM):
-        accuracy.reset()
-        for data in train_reader():
-            loss, acc = exe.run(fluid.default_main_program(),
-                                feed=feeder.feed(data),
-                                fetch_list=[avg_cost] + accuracy.metrics)
-             pass_acc = accuracy.eval(exe)
-             # acc 当前一个batch 的 accuracy
-             # pass_acc 当前batch 的 accuracy
-         pass_total_acc = accuracy.eval(exe)  # 整个pass的accuracy
-    ```
- 如何在训练中测试？[->](https://github.com/dzhwinter/benchmark/blob/master/fluid/vgg16.py#L144)
- 如何保存训练好的模型？[->](https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/fluid/tests/book/test_recognize_digits.py#L143)
- 如何加载训练好的模型进行预测？[->](https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/fluid/tests/book/test_recognize_digits.py#L154)
- 如何在同一个训练任务中定义多个Program，并交替运行？ [->](https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/fluid/tests/demo/fc_gan.py)
- 如何profile？Fluid 实现了profile 工具，可以直接调用。请参考示例 [->](https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/fluid/tests/unittests/test_profiler.py)
-</font>
---
--- a/doc/fluid/getstarted/concepts/index_cn.rst
+++ b/doc/fluid/getstarted/concepts/index_cn.rst
-基本使用概念
-============
-TBD
--- a/doc/fluid/getstarted/concepts/index_en.rst
+++ b/doc/fluid/getstarted/concepts/index_en.rst
-Concepts
-============
-TBD
--- a/doc/fluid/getstarted/concepts/reader/README.md
+++ b/doc/fluid/getstarted/concepts/reader/README.md
-# Python Data Reader Design Doc
-During the training and testing phases, PaddlePaddle programs need to read data. To help the users write code that performs reading input data, we define the following:
- A *reader*: A function that reads data (from file, network, random number generator, etc) and yields the data items.
- A *reader creator*: A function that returns a reader function.
- A *reader decorator*: A function, which takes in one or more readers, and returns a reader.
- A *batch reader*: A function that reads data (from *reader*, file, network, random number generator, etc) and yields a batch of data items.
-and also provide a function which can convert a reader to a batch reader, frequently used reader creators and reader decorators.
-## Data Reader Interface
-*Data reader* doesn't have to be a function that reads and yields data items. It can just be any function without any parameters that creates an iterable (anything can be used in `for x in iterable`) as follows:
-```
-iterable = data_reader()
-```
-The item produced from the iterable should be a **single** entry of data and **not** a mini batch. The entry of data could be a single item or a tuple of items. Item should be of one of the [supported types](http://www.paddlepaddle.org/doc/ui/data_provider/pydataprovider2.html?highlight=dense_vector#input-types) (e.g., numpy 1d array of float32, int, list of int etc.)
-An example implementation for single item data reader creator is as follows:
-```python
-def reader_creator_random_image(width, height):
-    def reader():
-        while True:
-            yield numpy.random.uniform(-1, 1, size=width*height)
-    return reader
-```
-An example implementation for multiple item data reader creator is as follows:
-```python
-def reader_creator_random_image_and_label(width, height, label):
-    def reader():
-        while True:
-            yield numpy.random.uniform(-1, 1, size=width*height), label
-    return reader
-```
-## Batch Reader Interface
-*Batch reader* can be any function without any parameters that creates an iterable (anything can be used in `for x in iterable`). The output of the iterable should be a batch (list) of data items. Each item inside the list should be a tuple.
-Here are some valid outputs:
-```python
-# a mini batch of three data items. Each data item consist three columns of data, each of which is 1.
-[(1, 1, 1),
-(2, 2, 2),
-(3, 3, 3)]
-# a mini batch of three data items, each data item is a list (single column).
-[([1,1,1],),
-([2,2,2],),
-([3,3,3],)]
-```
-Please note that each item inside the list must be a tuple, below is an invalid output:
-```python
- # wrong, [1,1,1] needs to be inside a tuple: ([1,1,1],).
- # Otherwise it is ambiguous whether [1,1,1] means a single column of data [1, 1, 1],
- # or three columns of data, each of which is 1.
-[[1,1,1],
-[2,2,2],
-[3,3,3]]
-```
-It is easy to convert from a reader to a batch reader:
-```python
-mnist_train = paddle.dataset.mnist.train()
-mnist_train_batch_reader = paddle.batch(mnist_train, 128)
-```
-It is also straight forward to create a custom batch reader:
-```python
-def custom_batch_reader():
-    while True:
-        batch = []
-        for i in xrange(128):
-            batch.append((numpy.random.uniform(-1, 1, 28*28),)) # note that it's a tuple being appended.
-        yield batch
-mnist_random_image_batch_reader = custom_batch_reader
-```
-## Usage
-Following is how we can use the reader with PaddlePaddle:
-The batch reader, a mapping from item(s) to data layer, the batch size and the number of total passes will be passed into `paddle.train` as follows:
-```python
-# two data layer is created:
-image_layer = paddle.layer.data("image", ...)
-label_layer = paddle.layer.data("label", ...)
-# ...
-batch_reader = paddle.batch(paddle.dataset.mnist.train(), 128)
-paddle.train(batch_reader, {"image":0, "label":1}, 128, 10, ...)
-```
-## Data Reader Decorator
-The *Data reader decorator* takes in a single reader or multiple data readers and returns a new data reader. It is similar to a [python decorator](https://wiki.python.org/moin/PythonDecorators), but it does not use `@` in the syntax.
-Since we have a strict interface for data readers (no parameters and return a single data item), a data reader can be used in a flexible way using data reader decorators. Following are a few examples:
-### Prefetch Data
-Since reading data may take some time and training can not proceed without data, it is generally a good idea to prefetch the data.
-Use `paddle.reader.buffered` to prefetch data:
-```python
-buffered_reader = paddle.reader.buffered(paddle.dataset.mnist.train(), 100)
-```
-`buffered_reader` will try to buffer (prefetch) `100` data entries.
-### Compose Multiple Data Readers
-For example, if we want to use a source of real images (say reusing mnist dataset), and a source of random images as input for [Generative Adversarial Networks](https://arxiv.org/abs/1406.2661).
-We can do the following :
-```python
-def reader_creator_random_image(width, height):
-    def reader():
-        while True:
-            yield numpy.random.uniform(-1, 1, size=width*height)
-    return reader
-def reader_creator_bool(t):
-    def reader:
-        while True:
-            yield t
-    return reader
-true_reader = reader_creator_bool(True)
-false_reader = reader_creator_bool(False)
-reader = paddle.reader.compose(paddle.dataset.mnist.train(), data_reader_creator_random_image(20, 20), true_reader, false_reader)
-# Skipped 1 because paddle.dataset.mnist.train() produces two items per data entry.
-# And we don't care about the second item at this time.
-paddle.train(paddle.batch(reader, 128), {"true_image":0, "fake_image": 2, "true_label": 3, "false_label": 4}, ...)
-```
-### Shuffle
-Given the shuffle buffer size `n`, `paddle.reader.shuffle` returns a data reader that buffers `n` data entries and shuffles them before a data entry is read.
-Example:
-```python
-reader = paddle.reader.shuffle(paddle.dataset.mnist.train(), 512)
-```
-## Q & A
-### Why does a reader return only a single entry, and not a mini batch?
-Returning a single entry makes reusing existing data readers much easier (for example, if an existing reader returns 3 entries instead if a single entry, the training code will be more complicated because it need to handle cases like a batch size 2).
-We provide a function: `paddle.batch` to turn (a single entry) reader into a batch reader.
-### Why do we need a batch reader, isn't is sufficient to give the reader and batch_size as arguments during training ?
-In most of the cases, it would be sufficient to give the reader and batch_size as arguments to the train method. However sometimes the user wants to customize the order of data entries inside a mini batch, or even change the batch size dynamically. For these cases using a batch reader is very efficient and helpful.
-### Why use a dictionary instead of a list to provide mapping?
-Using a dictionary (`{"image":0, "label":1}`) instead of a list (`["image", "label"]`) gives the advantage that the user can easily reuse the items (e.g., using `{"image_a":0, "image_b":0, "label":1}`) or even skip an item (e.g., using `{"image_a":0, "label":2}`).
-### How to create a custom data reader creator ?
-```python
-def image_reader_creator(image_path, label_path, n):
-    def reader():
-        f = open(image_path)
-        l = open(label_path)
-        images = numpy.fromfile(
-            f, 'ubyte', count=n * 28 * 28).reshape((n, 28 * 28)).astype('float32')
-        images = images / 255.0 * 2.0 - 1.0
-        labels = numpy.fromfile(l, 'ubyte', count=n).astype("int")
-        for i in xrange(n):
-            yield images[i, :], labels[i] # a single entry of data is created each time
-        f.close()
-        l.close()
-    return reader
-# images_reader_creator creates a reader
-reader = image_reader_creator("/path/to/image_file", "/path/to/label_file", 1024)
-paddle.train(paddle.batch(reader, 128), {"image":0, "label":1}, ...)
-```
-### How is `paddle.train` implemented
-An example implementation of paddle.train is:
-```python
-def train(batch_reader, mapping, batch_size, total_pass):
-    for pass_idx in range(total_pass):
-        for mini_batch in batch_reader(): # this loop will never end in online learning.
-            do_forward_backward(mini_batch, mapping)
-```
--- a/doc/fluid/getstarted/concepts/save_model/model_format.md
+++ b/doc/fluid/getstarted/concepts/save_model/model_format.md
-# Design Doc: Model Format
-## Motivation
-A model is an output of the training process. One complete model consists of two parts, the **topology** and the **parameters**. In order to support industrial deployment, the model format must be self-complete and must not expose any training source code.
-As a result, In PaddlePaddle, the **topology** is represented as a  [ProgramDesc](https://github.com/PaddlePaddle/Paddle/blob/1c0a4c901c9fc881d120249c703b15d1c50dae7d/doc/design/program.md), which describes the model structure. The **parameters** contain all the trainable weights in the model. We must support large size parameters and efficient serialization/deserialization of parameters.
-## Implementation
-The topology is saved as a plain text in a detailed self-contain protobuf file.
-The parameters are saved as a binary file. As we all know, the protobuf message has a limit of [64M size](https://developers.google.com/protocol-buffers/docs/reference/cpp/google.protobuf.io.coded_stream#CodedInputStream.SetTotalBytesLimit.details). We have done a [benchmark experiment](https://github.com/PaddlePaddle/Paddle/pull/4610), which shows that protobuf is not fit for the task.
-As a result, we design a particular format for tensor serialization. By default, an arbitrary tensor in Paddle is a [LoDTensor](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/lod_tensor.md), and has a description information proto of [LoDTensorDesc](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/framework.proto#L99). We save the DescProto as the byte string header. It contains all the necessary information, such as the `dims`, and the `LoD` information in [LoDTensor](https://github.com/PaddlePaddle/Paddle/blob/1c0a4c901c9fc881d120249c703b15d1c50dae7d/paddle/framework/lod_tensor.md). A tensor stores values in a continuous memory buffer. For speed we dump the raw memory to disk and save it as the byte string content. So, the binary format of one tensor is,
-The table below shows a tensor's byte view in detail. Note that all the signed values are written in the little-endian format.
-<table>
-<thead>
-<tr>
-<th>field name</th>
-<th>type </th>
-<th>description </th>
-</tr>
-</thead>
-<tbody>
-<tr>
-<td> version</td>
-<td> uint32_t </td>
-<td> Version of saved file. Always 0 now.</td>
-</tr>
-<tr>
-<td> tensor desc length  </td>
-<td> uint32_t </td>
-<td> TensorDesc(Protobuf message) length in bytes. </td>
-</tr>
-<tr>
-<td>tensor desc </td>
-<td> void*</td>
-<td> TensorDesc protobuf binary message </td>
-</tr>
-<tr>
-<td> tensor data </td>
-<td> void* </td>
-<td> Tensor's data in binary format. The length of `tensor_data` is decided by `TensorDesc.dims()` and `TensorDesc.data_type()` </td>
-</tr>
-<tr>
-<td> lod_level</td>
-<td> uint64_t </td>
-<td> Level of LoD </td>
-</tr>
-<tr>
-<td> length of lod[0] </td>
-<td> uint64_t </td>
-<td> [Optional] length of lod[0] in bytes. </td>
-</tr>
-<tr>
-<td> data of lod[0] </td>
-<td> uint64_t*   </td>
-<td> [Optional] lod[0].data() </td>
-</tr>
-<tr>
-<td>... </td>
-<td> ... </td>
-<td> ... </td>
-</tr>
-</tbody>
-</table>
-## Summary
- We introduce a model format.
- The model represented by its forward-pass computation procedure is saved in a **ProgramDesc** protobuf message.
- A bunch of specified format binary tensors describe the **parameters**.
--- a/doc/fluid/getstarted/index_cn.rst
+++ b/doc/fluid/getstarted/index_cn.rst
-新手入门
-============
-如果需要快速了解PaddlePaddle的使用，可以参考以下指南。
-..  toctree::
-  :maxdepth: 1
-  quickstart_cn.rst
-在使用PaddlePaddle构建应用时，需要了解一些基本概念。
-这里以一个线性回归为例子，详细介绍了PaddlePaddle的使用流程，包括数据格式，模型配置与训练等。
-..  toctree::
-  :maxdepth: 1
-  concepts/use_concepts_cn.rst
-  developer's_guide_to_paddle_fluid.md
--- a/doc/fluid/getstarted/index_en.rst
+++ b/doc/fluid/getstarted/index_en.rst
-GET STARTED
-============
-If you want to quickly know how to use PaddlePaddle, please refer to the following guide:
-..  toctree::
-  :maxdepth: 1
-  quickstart_en.rst
-While using PaddlePaddle to build applications, please understand some basic concepts.
-Here is an example of linear regression. It introduces workflow of PaddlePaddle, including data format, model configuration and training, etc.
-..  toctree::
-  :maxdepth: 1
-  concepts/index_en.rst
-  developer's_guide_to_paddle_fluid.md
--- a/doc/fluid/getstarted/quickstart_cn.rst
+++ b/doc/fluid/getstarted/quickstart_cn.rst
-快速开始
-========
-快速安装
--------
-PaddlePaddle支持使用pip快速安装，目前支持CentOS 6以上, Ubuntu 14.04以及MacOS 10.12，并安装有Python2.7。
-执行下面的命令完成快速安装，版本为cpu_avx_openblas：
-  .. code-block:: bash
-     pip install paddlepaddle
-如果需要安装支持GPU的版本（cuda8.0_cudnn5_avx_openblas），需要执行：
-  .. code-block:: bash
-     pip install paddlepaddle-gpu
-更详细的安装和编译方法参考： :ref:`install_steps` 。
-快速使用
--------
-创建一个 housing.py 并粘贴此Python代码：
-  .. code-block:: python
-     import paddle.dataset.uci_housing as uci_housing
-     import paddle.fluid as fluid
-     with fluid.scope_guard(fluid.core.Scope()):
-         # initialize executor with cpu
-         exe = fluid.Executor(place=fluid.CPUPlace())
-         # load inference model
-         [inference_program, feed_target_names,fetch_targets] =  \
-             fluid.io.load_inference_model(uci_housing.fluid_model(), exe)
-         # run inference
-         result = exe.run(inference_program,
-                          feed={feed_target_names[0]: uci_housing.predict_reader()},
-                          fetch_list=fetch_targets)
-         # print predicted price is $12,273.97
-         print 'Predicted price: ${:,.2f}'.format(result[0][0][0] * 1000)
-执行 :code:`python housing.py` 瞧！ 它应该打印出预测住房数据的清单。
--- a/doc/fluid/getstarted/quickstart_en.rst
+++ b/doc/fluid/getstarted/quickstart_en.rst
-Quick Start
-============
-Quick Install
-------------
-You can use pip to install PaddlePaddle with a single command, supports
-CentOS 6 above, Ubuntu 14.04 above or MacOS 10.12, with Python 2.7 installed.
-Simply run the following command to install, the version is cpu_avx_openblas:
-  .. code-block:: bash
-     pip install paddlepaddle
-If you need to install GPU version (cuda8.0_cudnn5_avx_openblas), run:
-  .. code-block:: bash
-     pip install paddlepaddle-gpu
-For more details about installation and build: :ref:`install_steps` .
-Quick Use
---------
-Create a new file called housing.py, and paste this Python
-code:
-  .. code-block:: python
-     import paddle.dataset.uci_housing as uci_housing
-     import paddle.fluid as fluid
-     with fluid.scope_guard(fluid.core.Scope()):
-         # initialize executor with cpu
-         exe = fluid.Executor(place=fluid.CPUPlace())
-         # load inference model
-         [inference_program, feed_target_names,fetch_targets] =  \
-             fluid.io.load_inference_model(uci_housing.fluid_model(), exe)
-         # run inference
-         result = exe.run(inference_program,
-                          feed={feed_target_names[0]: uci_housing.predict_reader()},
-                          fetch_list=fetch_targets)
-         # print predicted price is $12,273.97
-         print 'Predicted price: ${:,.2f}'.format(result[0][0][0] * 1000)
-Run :code:`python housing.py` and voila! It should print out a list of predictions
-for the test housing data.
--- a/doc/fluid/howto/cluster/fluid_cluster_train_cn.md
+++ b/doc/fluid/howto/cluster/fluid_cluster_train_cn.md
-# Fluid 分布式版本使用指南
-本篇文章将说明如何在PaddlePaddle Fluid版本下进行分布式训练的配置和执行，以及将单机训练脚本改造成支持集群训练的版本
-## 准备工作
-* 可用的集群
-    包含一个或多个计算节点的集群，每一个节点都能够执行PaddlePaddle的训练任务且拥有唯一的IP地址，集群内的所有计算节点可以通过网络相互通信。
-* 安装PaddlePaddle Fluid with Distribution版本
-    所有的计算节点上均需要按照分布式版本的PaddlePaddle, 在用于GPU等设备的机器上还需要额外安装好相应的驱动程序和CUDA的库。
-    **注意：**当前对外提供的PaddlePaddle版本并不支持分布式，需要通过源码重新编译。编译和安装方法参见[编译和安装指南](http://www.paddlepaddle.org/docs/develop/documentation/en/getstarted/build_and_install/index_en.html)。
-    cmake编译命令中需要将WITH_DISTRIBUTE设置为ON，下面是一个cmake编译指令示例：
-``` bash
-cmake .. -DWITH_DOC=OFF -DWITH_GPU=OFF -DWITH_DISTRIBUTE=ON -DWITH_SWIG_PY=ON -DWITH_PYTHON=ON
-```
-## 更新训练脚本
-这里，我们以[Deep Learing 101](http://www.paddlepaddle.org/docs/develop/book/01.fit_a_line/index.html)课程中的第一章 fit a line 为例，描述如何将单机训练脚本改造成支持集群训练的版本。
-### 单机训练脚本示例
-```python
-import paddle
-import paddle.fluid as fluid
-x = fluid.layers.data(name='x', shape=[13], dtype='float32')
-y_predict = fluid.layers.fc(input=x, size=1, act=None)
-y = fluid.layers.data(name='y', shape=[1], dtype='float32')
-cost = fluid.layers.square_error_cost(input=y_predict, label=y)
-avg_cost = fluid.layers.mean(x=cost)
-sgd_optimizer = fluid.optimizer.SGD(learning_rate=0.001)
-sgd_optimizer.minimize(avg_cost)
-BATCH_SIZE = 20
-train_reader = paddle.batch(
-    paddle.reader.shuffle(
-        paddle.dataset.uci_housing.train(), buf_size=500),
-    batch_size=BATCH_SIZE)
-place = fluid.CPUPlace()
-feeder = fluid.DataFeeder(place=place, feed_list=[x, y])
-exe = fluid.Executor(place)
-exe.run(fluid.default_startup_program())
-PASS_NUM = 100
-for pass_id in range(PASS_NUM):
-    fluid.io.save_persistables(exe, "./fit_a_line.model/")
-    fluid.io.load_persistables(exe, "./fit_a_line.model/")
-    for data in train_reader():
-        avg_loss_value, = exe.run(fluid.default_main_program(),
-                                  feed=feeder.feed(data),
-                                  fetch_list=[avg_cost])
-        if avg_loss_value[0] < 10.0:
-            exit(0)  # if avg cost less than 10.0, we think our code is good.
-exit(1)
-```
-我们创建了一个简单的全连接神经网络程序，并且通过Fluid的Executor执行了100次迭代,现在我们需要将该单机版本的程序更新为分布式版本的程序。
-### 介绍Parameter Server
-在非分布式版本的训练脚本中，只存在Trainer一种角色，它不仅处理常规的计算任务，也处理参数相关的计算、保存和优化任务。在分布式版本的训练过程中，由于存在多个Trainer节点进行同样的数据计算任务，因此需要有一个中心化的节点来统一处理参数相关的保存和分配。在PaddlePaddle中，我们称这样的节点为[Parameter Server](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/fluid/design/dist_train/parameter_server.md)
-**因此，在分布式的Fluid环境中，我们有两个角色需要创建，分别是Parameter Server和Trainer。**
-### 分布式训练
-Fliud专门提供了工具[Distributed Transpiler](https://github.com/PaddlePaddle/Paddle/blob/ba65d54d9d3b41cd3c5171b00f476d4e60133ddb/doc/fluid/design/dist_train/distributed_architecture.md#distributed-transpiler)用于将单机版的训练程序转换为分布式版本的训练程序。工具背后的理念是找出程序的优化算子和梯度参数，将他们分隔为两部分，通过send/recv 操作算子进行连接,优化算子和梯度参数可以在优化器的minimize函数的返回值中获取到。
-```python
-optimize_ops, params_grads = sgd_optimizer.minimize(avg_cost)
-```
-将Distributed Transpiler、优化算子和梯度函数放在一个代码中如下：
-```python
-... #define the program, cost, and create sgd optimizer
-optimize_ops, params_grads = sgd_optimizer.minimize(avg_cost) #get optimize OPs and gradient parameters
-t = fluid.DistributeTranspiler() # create the transpiler instance
-# slice the program into 2 pieces with optimizer_ops and gradient parameters list, as well as pserver_endpoints, which is a comma separated list of [IP:PORT] and number of trainers
-t.transpile(optimize_ops, params_grads, pservers=pserver_endpoints, trainers=2)
-... #create executor
-# in pserver, run this
-#current_endpoint here means current pserver IP:PORT you wish to run on
-pserver_prog = t.get_pserver_program(current_endpoint)
-pserver_startup = t.get_startup_program(current_endpoint, pserver_prog)
-exe.run(pserver_startup)
-exe.run(pserver_prog)
-# in trainer, run this
-... # define data reader
-exe.run(fluid.default_startup_program())
-for pass_id in range(100):
-    for data in train_reader():
-        exe.run(t.get_trainer_program())
-```
-### 分布式训练脚本运行说明
-分布式任务的运行需要将表格中说明的多个参数进行赋值:
-<table>
-<thead>
-<tr>
-<th>参数名</th>
-<th> 值类型</th>
-<th>说明</th>
-<th> 示例</th>
-</tr>
-</thead>
-<tbody>
-<tr>
-<td>trainer_id </td>
-<td> int</td>
-<td> 当前训练节点的ID，训练节点ID编号为0 - n-1， n为trainers的值 </td>
-<td> 0/1/2/3  </td>
-</tr>
-<tr>
-<td>pservers </td>
-<td> str</td>
-<td> parameter server 列表 </td>
-<td> 127.0.0.1:6710,127.0.0.1:6711 </td>
-</tr>
-<tr>
-<td>trainers </td>
-<td>int </td>
-<td> 训练节点的总个数，>0的数字 </td>
-<td> 4 </td>
-</tr>
-<tr>
-<td> server_endpoint</td>
-<td> str </td>
-<td> 当前所起的服务节点的IP:PORT </td>
-<td> 127.0.0.1:8789 </td>
-</tr>
-<tr>
-<td> training_role</td>
-<td>str </td>
-<td> 节点角色， TRAINER/PSERVER </td>
-<td> PSERVER </td>
-</tr>
-</tbody>
-</table>
-**注意：** ```training_role```是用来区分当前所起服务的角色的，用于训练程序中，用户可根据需要自行定义，其他参数为fluid.DistributeTranspiler的transpile函数所需要，需要在调用函数前进行定义，样例如下：
-```python
-t = fluid.DistributeTranspiler()
-t.transpile(
-    optimize_ops,
-    params_grads,
-    trainer_id,
-    pservers=pserver,
-    trainers=trainers)
-if training_role == "PSERVER":
-    pserver_prog = t.get_pserver_program(server_endpoint)
-    pserver_startup = t.get_startup_program(server_endpoint, pserver_prog)
-```
-### Demo
-完整的demo代码位于Fluid的test目录下的[book](https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/fluid/tests/book/test_fit_a_line.py)中。
-第一步，进入demo代码所在目录：
-```bash
-cd /paddle/python/paddle/fluid/tests/book
-```
-第二步，启动Parameter Server：
-```bash
-PADDLE_PSERVER_PORT=6174 PADDLE_PSERVER_IPS=192.168.1.2 PADDLE_TRAINERS=2 PADDLE_CURRENT_IP=192.168.1.2 PADDLE_TRAINER_ID=1 PADDLE_TRAINING_ROLE=PSERVER python test_fit_a_line.py
-```
-执行命令后请等待出现提示： ```Server listening on 192.168.1.2:6174 ```, 表示Paramter Server已经正常启动。
-第三步，启动Trainer：
-```bash
-PADDLE_PSERVER_PORT=6174 PADDLE_PSERVER_IPS=192.168.1.3 PADDLE_TRAINERS=2 PADDLE_CURRENT_IPP=192.168.1.3 PADDLE_TRAINER_ID=1 PADDLE_TRAINING_ROLE=TRAINER python test_fit_a_line.py
-```
-由于我们定义的Trainer的数量是2个，因此需要在另外一个计算节点上再启动一个Trainer。
-现在我们就启动了一个包含一个Parameter Server和两个Trainer的分布式训练任务。
--- a/doc/fluid/howto/cluster/fluid_cluster_train_en.md
+++ b/doc/fluid/howto/cluster/fluid_cluster_train_en.md
-# Fluid Distributed Training
-## Introduction
-In this article, we'll explain how to configure and run distributed training jobs with PaddlePaddle Fluid in a bare metal cluster.
-## Preparations
-### Getting the cluster ready
-Prepare the compute nodes in the cluster. Nodes in this cluster can be of any specification that runs PaddlePaddle, and with a unique IP address assigned to it. Make sure they can communicate to each other.
-### Have PaddlePaddle installed
-PaddlePaddle must be installed on all nodes. If you have GPU cards on your nodes, be sure to properly install drivers and CUDA libraries.
-PaddlePaddle build and installation guide can be found  [here](http://www.paddlepaddle.org/docs/develop/documentation/en/getstarted/build_and_install/index_en.html).
-In addition to above, the `cmake` command should be run with the option `WITH_DISTRIBUTE` set to on. An example bare minimum `cmake` command would look as follows:
-``` bash
-cmake .. -DWITH_DOC=OFF -DWITH_GPU=OFF -DWITH_DISTRIBUTE=ON -DWITH_SWIG_PY=ON -DWITH_PYTHON=ON
-```
-### Update the training script
-#### Non-cluster training script
-Let's take [Deep Learning 101](http://www.paddlepaddle.org/docs/develop/book/01.fit_a_line/index.html)'s first chapter: "fit a line" as an example.
-The non-cluster version of this demo with fluid API is as follows:
-``` python
-import paddle
-import paddle.fluid as fluid
-x = fluid.layers.data(name='x', shape=[13], dtype='float32')
-y_predict = fluid.layers.fc(input=x, size=1, act=None)
-y = fluid.layers.data(name='y', shape=[1], dtype='float32')
-cost = fluid.layers.square_error_cost(input=y_predict, label=y)
-avg_cost = fluid.layers.mean(x=cost)
-sgd_optimizer = fluid.optimizer.SGD(learning_rate=0.001)
-sgd_optimizer.minimize(avg_cost)
-BATCH_SIZE = 20
-train_reader = paddle.batch(
-    paddle.reader.shuffle(
-        paddle.dataset.uci_housing.train(), buf_size=500),
-    batch_size=BATCH_SIZE)
-place = fluid.CPUPlace()
-feeder = fluid.DataFeeder(place=place, feed_list=[x, y])
-exe = fluid.Executor(place)
-exe.run(fluid.default_startup_program())
-PASS_NUM = 100
-for pass_id in range(PASS_NUM):
-    fluid.io.save_persistables(exe, "./fit_a_line.model/")
-    fluid.io.load_persistables(exe, "./fit_a_line.model/")
-    for data in train_reader():
-        avg_loss_value, = exe.run(fluid.default_main_program(),
-                                  feed=feeder.feed(data),
-                                  fetch_list=[avg_cost])
-        if avg_loss_value[0] < 10.0:
-            exit(0)  # if avg cost less than 10.0, we think our code is good.
-exit(1)
-```
-We created a simple fully-connected neural network training program and handed it to the fluid executor to run for 100 passes.
-Now let's try to convert it to a distributed version to run on a cluster.
-#### Introducing parameter server
-As we can see from the non-cluster version of training script, there is only one role in the script: the trainer, that performs the computing as well as holds the parameters. In cluster training, since multi-trainers are working on the same task, they need one centralized place to hold and distribute parameters. This centralized place is called the Parameter Server in PaddlePaddle.
-![parameter server architecture](src/trainer.png)
-Parameter Server in fluid not only holds the parameters but is also assigned with a part of the program. Trainers communicate with parameter servers via send/receive OPs. For more technical details, please refer to  [this document](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/design/dist_refactor/distributed_architecture.md).
-Now we need to create programs for both: trainers and parameter servers, the question is how?
-#### Slice the program
-Fluid provides a tool called "Distributed Transpiler" that automatically converts the non-cluster program into cluster program.
-The idea behind this tool is to find the optimize OPs and gradient parameters, slice the program into 2 pieces and connect them with send/receive OP.
-Optimize OPs and gradient parameters can be found from the return values of optimizer's minimize function.
-To put them together:
-``` python
-... #define the program, cost, and create sgd optimizer
-optimize_ops, params_grads = sgd_optimizer.minimize(avg_cost) #get optimize OPs and gradient parameters
-t = fluid.DistributeTranspiler() # create the transpiler instance
-# slice the program into 2 pieces with optimizer_ops and gradient parameters list, as well as pserver_endpoints, which is a comma separated list of [IP:PORT] and number of trainers
-t.transpile(optimize_ops, params_grads, pservers=pserver_endpoints, trainers=2)
-... #create executor
-# in pserver, run this
-#current_endpoint here means current pserver IP:PORT you wish to run on
-pserver_prog = t.get_pserver_program(current_endpoint)
-pserver_startup = t.get_startup_program(current_endpoint, pserver_prog)
-exe.run(pserver_startup)
-exe.run(pserver_prog)
-# in trainer, run this
-... # define data reader
-exe.run(fluid.default_startup_program())
-for pass_id in range(100):
-    for data in train_reader():
-        exe.run(t.get_trainer_program())
-```
-### E2E demo
-Please find the complete demo from [here](https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/fluid/tests/book_distribute/notest_dist_fit_a_line.py).
-First `cd` into the folder that contains the `python` files. In this case:
-```bash
-cd /paddle/python/paddle/fluid/tests/book_distribute
-```
-In parameter server node run the following in the command line:
-``` bash
-PSERVERS=192.168.1.2:6174 SERVER_ENDPOINT=192.168.1.2:6174 TRAINING_ROLE=PSERVER python notest_dist_fit_a_line.py
-```
-*please note we assume that your parameter server runs at 192.168.1.2:6174*
-Wait until the prompt `Server listening on 192.168.1.2:6174`
-Then in 2 of your trainer nodes run this:
-``` bash
-PSERVERS=192.168.1.2:6174 SERVER_ENDPOINT=192.168.1.2:6174 TRAINING_ROLE=TRAINER python notest_dist_fit_a_line.py
-```
-*the reason you need to run this command twice in 2 nodes is because: in the script we set the trainer count to be 2. You can change this setting on line 50*
-Now you have 2 trainers and 1 parameter server up and running.
--- a/doc/fluid/howto/cluster/fluid_recordio.md
+++ b/doc/fluid/howto/cluster/fluid_recordio.md
-# How to use RecordIO in Fluid
-If you want to use RecordIO as your training data format, you need to convert to your training data
-to RecordIO files and reading them in the process of training, PaddlePaddle Fluid provides some
-interface to deal with the RecordIO files.
-## Generate RecordIO File
-Before start training with RecordIO files, you need to convert your training data
-to RecordIO format by `fluid.recordio_writer.convert_reader_to_recordio_file`, the sample codes
-as follows:
-```python
-    reader = paddle.batch(mnist.train(), batch_size=1)
-    feeder = fluid.DataFeeder(
-        feed_list=[  # order is image and label
-            fluid.layers.data(
-            name='image', shape=[784]),
-            fluid.layers.data(
-            name='label', shape=[1], dtype='int64'),
-        ],
-        place=fluid.CPUPlace())
-    fluid.recordio_writer.convert_reader_to_recordio_file('./mnist.recordio', reader, feeder)
-```
-The above code snippet would generate a RecordIO `./mnist.recordio` on your host.
-**NOTE**: we recommend users to set `batch_size=1` when generating the recordio files so that users can
-adjust it flexibly while reading it.
-## Use the RecordIO file in a Local Training Job
-PaddlePaddle Fluid provides an interface `fluid.layers.io.open_recordio_file` to load your RecordIO file
-and then you can use them as a Layer in your network configuration, the sample codes as follows:
-```python
-    data_file = fluid.layers.io.open_recordio_file(
-        filename="./mnist.recordio",
-        shapes=[(-1, 784),(-1, 1)],
-        lod_levels=[0, 0],
-        dtypes=["float32", "int32"])
-    data_file = fluid.layers.io.batch(data_file, batch_size=4)
-    img, label = fluid.layers.io.read_file(data_file)
-    hidden = fluid.layers.fc(input=img, size=100, act='tanh')
-    prediction = fluid.layers.fc(input=hidden, size=10, act='softmax')
-    loss = fluid.layers.cross_entropy(input=prediction, label=label)
-    avg_loss = fluid.layers.mean(loss)
-    fluid.optimizer.Adam(learning_rate=1e-3).minimize(avg_loss)
-    place = fluid.CPUPlace()
-    exe = fluid.Executor(place)
-    exe.run(fluid.default_startup_program())
-    avg_loss_np = []
-    # train a pass
-    batch_id = 0
-    while True:
-        tmp, = exe.run(fetch_list=[avg_loss])
-        avg_loss_np.append(tmp)
-        print(batch_id)
-        batch_id += 1
-```
-## Use the RecordIO files in Distributed Training
-1. generate multiple RecordIO files
-For a distributed training job, you may have multiple trainer nodes,
-and one or more RecordIO files for one trainer node, you can use the interface
-`fluid.recordio_writer.convert_reader_to_recordio_files` to convert your training data
-into multiple RecordIO files, the sample codes as follows:
-```python
-    reader = paddle.batch(mnist.train(), batch_size=1)
-    feeder = fluid.DataFeeder(
-        feed_list=[  # order is image and label
-            fluid.layers.data(
-            name='image', shape=[784]),
-            fluid.layers.data(
-            name='label', shape=[1], dtype='int64'),
-        ],
-        place=fluid.CPUPlace())
-    fluid.recordio_writer.convert_reader_to_recordio_files(
-          filename_suffix='./mnist.recordio', batch_per_file=100, reader, feeder)
-```
-The above codes would generate multiple RecordIO files on your host like:
-```bash
-.
- \_mnist-00000.recordio
- |-mnist-00001.recordio
- |-mnist-00002.recordio
- |-mnist-00003.recordio
- |-mnist-00004.recordio
-```
-2. open multiple RecordIO files by `fluid.layers.io.open_files`
-For a distributed training job, the distributed operator system will schedule trainer process on multiple nodes,
-each trainer process reads parts of the whole training data, we usually take the following approach to make the training
-data allocated by each trainer process as uniform as possiable:
-```python
-def gen_train_list(file_pattern, trainers, trainer_id):
-   file_list = glob.glob(file_pattern)
-   ret_list = []
-   for idx, f in enumerate(file_list):
-       if (idx + trainers) % trainers == trainer_id:
-           ret_list.append(f)
-   return ret_list
-trainers = int(os.getenv("PADDLE_TRAINERS"))
-trainer_id = int(os.getenv("PADDLE_TRAINER_ID"))
-data_file = fluid.layers.io.open_files(
-    filenames=gen_train_list("./mnist-[0-9]*.recordio", 2, 0),
-    thread_num=1,
-    shapes=[(-1, 784),(-1, 1)],
-    lod_levels=[0, 0],
-    dtypes=["float32", "int32"])
-img, label = fluid.layers.io.read_file(data_files)
-...
-```
--- a/doc/fluid/howto/cluster/nccl2_rdma_training.md
+++ b/doc/fluid/howto/cluster/nccl2_rdma_training.md
-# Distributed Training with NCCL2 and RDMA
-When doing distributed multi-GPU training, network bandwidth often becomes the
-bottleneck. We introduce a way to use NCCL2 to do such training job to
-achieve best performance.
-## Prepare Hardware with RDMA and Multiple GPUs
-I'm using two Linux servers each of them installed with 8 GPUs and
-one 100Gb RDMA card.
-Base environment is:
-* OS: CentOS 7.4
-* RDMA device: "Mellanox Technologies MT27700 Family [ConnectX-4]"
-* Kernel version: `4.4.88-1.el7.elrepo.x86_64`
-* Docker version: `1.12.6`
-* Docker storage driver: `overlay2`
-* IP addresses: 192.168.16.30,192.168.16.34
-In general, the steps including:
-1. Install GPU drivers
-1. Install RDMA drivers
-1. Install "InfiniBand Support"
-1. Use docker to run tests and make sure GPUs and RDMA can work inside
-   the container.
-I'll omit the section "Install GPU drivers" because we can find it easily
-somewhere else.
-### Install RDMA drivers
-For my case, I've got two machines with device
-"Mellanox Technologies MT27700 Family [ConnectX-4]" installed. The OS was
-"CentOS 7.4" and I updated the kernel to version 4.4 so that docker can
-work with the latest overlay2 filesystem.
-***NOTE: before you start, make sure you have a way to get a console
-of the server other than ssh because we may need to re-configure the
-network device.***
-1. Go to http://www.mellanox.com/page/products_dyn?product_family=26,
-   download `MLNX_OFED` software in the bottom of the page, and upload it
-   onto the server.
-1. Run `./mlnxofedinstall --add-kernel-support` in the software package.
-1. Run `/etc/init.d/openibd restart` to make everything work, note that
-   this operation may cause the network goes down if you are using this
-   RDMA device as default network device and use ssh to log in the server.
-1. Re-configure the network interface, for example:
-   `ifconfig eth2 192.168.16.30/20 up`, then add routes if needed:
-   `ip route add default via 192.168.16.1 dev eth2`.
-1. Do the same thing on the other node.
-1. Use `ping` to test if the two nodes have typical ICMP connection.
-1. Use either `udaddy` or `ib_write_bw` to test the network connection is
-   ready and have the desired bandwidth.
-### Prepare Docker Image to Run RDMA Programs
-1. Build a docker image using cuda base image like: `nvidia/cuda:8.0-cudnn5-devel-ubuntu16.04` and install paddlepaddle whl
-   package in it.
-1. Start a docker container and mount GPU driver libs into it (you can
-   skip this step if you are using nvidia-docker).
-1. Mount RDMA drivers and libs into the docker image (see below section),
-   also `udaddy` and `ib_write_bw` if needed.
-1. Mount GPU devices and RDMA devices into the container using `--device`
-   or just use privileged mode `--privileged`.
-1. Start the container using host network mode: `--net=host`
-### RDMA Library Files Needed
-Usually, `MLNX_OFED` install latest supported libs under
-`/usr/lib64/mlnx_ofed/valgrind`. Other libs also needed to run RDMA programs
-is listed below. These libs must be mounted into the docker container.
-* Libs under `/usr/lib64/mlnx_ofed/valgrind`
-  * libibcm.so
-  * libibverbs.so
-  * libmlx4.so
-  * libmlx5.so
-  * libmlx5-rdmav2.so
-  * librdmacm.so
-* Other libs:
-  * libnl-3.so.200
-  * libnl-route-3.so.200
-  * libnuma.so.1
-## Start to Run the Training Job
-Setting NCCL environment variables to turn NCCL switches on and off:
-| Env Name | Description |
-| --- | --- |
-| NCCL_SOCKET_IFNAME | The RDMA device, e.g. eth2 |
-| NCCL_P2P_DISABLE | Set to 1 to disable P2P transfer between GPUs |
-| NCCL_IB_DISABLE | Set to 1 to disable using RDMA |
-| NCCL_IB_CUDA_SUPPORT | Set to 1 to enable GPU Direct if supported |
-| NCCL_DEBUG | Set debug level: VERSION, WARN, INFO |
-My two servers are: `192.168.16.30,192.168.16.34`, On node 1, Run :
-```bash
-PADDLE_TRAINER_ID=0 PADDLE_PORT=48372 PADDLE_WORKERS=192.168.16.30,192.168.16.34 POD_IP=192.168.16.30 stdbuf -oL python vgg16.py
-```
-On node 2, Run:
-```bash
-PADDLE_TRAINER_ID=1 PADDLE_PORT=48372 PADDLE_WORKERS=192.168.16.30,192.168.16.34 POD_IP=192.168.16.34 stdbuf -oL python vgg16.py
-```
--- a/doc/fluid/howto/index_cn.rst
+++ b/doc/fluid/howto/index_cn.rst
-进阶使用
------------
-.. toctree::
-  :maxdepth: 1
-  inference/index_cn.rst
-  optimization/index_cn.rst
--- a/doc/fluid/howto/index_en.rst
+++ b/doc/fluid/howto/index_en.rst
-HOW TO
------------
-.. toctree::
-  :maxdepth: 1
-  optimization/index_en.rst
--- a/doc/fluid/howto/inference/build_and_install_lib_cn.rst
+++ b/doc/fluid/howto/inference/build_and_install_lib_cn.rst
-安装与编译C++预测库
-===========================
-直接下载安装
-------------
-======================   ========================================
-版本说明                            C++预测库   
-======================   ========================================
-cpu_avx_mkl              `fluid.tgz <https://guest:@paddleci.ngrok.io/repository/download/Manylinux1_CpuAvxCp27cp27mu/.lastSuccessful/fluid.tgz/?branch=0.14.0>`_ 
-cpu_avx_openblas         `fluid.tgz <https://guest:@paddleci.ngrok.io/repository/download/Manylinux1_CpuAvxOpenblas/.lastSuccessful/fluid.tgz/?branch=0.14.0>`_
-cpu_noavx_openblas       `fluid.tgz <https://guest:@paddleci.ngrok.io/repository/download/Manylinux1_CpuNoavxOpenblas/.lastSuccessful/fluid.tgz/?branch=0.14.0>`_
-cuda7.5_cudnn5_avx_mkl   `fluid.tgz <https://guest:@paddleci.ngrok.io/repository/download/Manylinux1_Cuda75cudnn5cp27cp27mu/.lastSuccessful/fluid.tgz/?branch=0.14.0>`_
-cuda8.0_cudnn5_avx_mkl   `fluid.tgz <https://guest:@paddleci.ngrok.io/repository/download/Manylinux1_Cuda80cudnn5cp27cp27mu/.lastSuccessful/fluid.tgz/?branch=0.14.0>`_
-cuda8.0_cudnn7_avx_mkl   `fluid.tgz <https://guest:@paddleci.ngrok.io/repository/download/Manylinux1_Cuda8cudnn7cp27cp27mu/.lastSuccessful/fluid.tgz/?branch=0.14.0>`_
-cuda9.0_cudnn7_avx_mkl   `fluid.tgz <https://guest:@paddleci.ngrok.io/repository/download/Manylinux1_Cuda90cudnn7avxMkl/.lastSuccessful/fluid.tgz/?branch=0.14.0>`_
-======================   ========================================
-从源码编译
----------
-用户也可以从 PaddlePaddle 核心代码编译C++预测库，只需在编译时配制下面这些编译选项：
-=================   =========
-选项                 值   
-=================   =========
-CMAKE_BUILD_TYPE    Release
-FLUID_INSTALL_DIR   安装路径    
-WITH_FLUID_ONLY     ON（推荐）
-WITH_SWIG_PY        OFF（推荐
-WITH_PYTHON         OFF（推荐）
-WITH_GPU            ON/OFF
-WITH_MKL            ON/OFF
-=================   =========
-建议按照推荐值设置，以避免链接不必要的库。其它可选编译选项按需进行设定。
-下面的代码片段从github拉取最新代码，配制编译选项（需要将PADDLE_ROOT替换为PaddlePaddle预测库的安装路径）：
-  .. code-block:: bash
-     pip install paddlepaddle-gpu
-     PADDLE_ROOT=/path/of/capi
-     git clone https://github.com/PaddlePaddle/Paddle.git
-     cd Paddle
-     mkdir build
-     cd build
-     cmake -DFLUID_INSTALL_DIR=$PADDLE_ROOT \
-           -DCMAKE_BUILD_TYPE=Release \
-           -DWITH_FLUID_ONLY=ON \
-           -DWITH_SWIG_PY=OFF \
-           -DWITH_PYTHON=OFF \
-           -DWITH_MKL=OFF \
-           -DWITH_GPU=OFF  \
-           ..
-      make
-      make inference_lib_dist
-成功编译后，使用C++预测库所需的依赖（包括：（1）编译出的PaddlePaddle预测库和头文件；（2）第三方链接库和头文件；（3）版本信息与编译选项信息）
-均会存放于PADDLE_ROOT目录中。目录结构如下：
-  .. code-block:: text
-     PaddleRoot/
-     ├── CMakeCache.txt
-     ├── paddle
-     │   └── fluid
-     │       ├── framework
-     │       ├── inference
-     │       ├── memory
-     │       ├── platform
-     │       ├── pybind
-     │       └── string
-     ├── third_party
-     │   ├── boost
-     │   │   └── boost
-     │   ├── eigen3
-     │   │   ├── Eigen
-     │   │   └── unsupported
-     │   └── install
-     │       ├── gflags
-     │       ├── glog
-     │       ├── mklml
-     │       ├── protobuf
-     │       ├── snappy
-     │       ├── snappystream
-     │       └── zlib
-     └── version.txt
-version.txt 中记录了该预测库的版本信息，包括Git Commit ID、使用OpenBlas或MKL数学库、CUDA/CUDNN版本号，如：
-  .. code-block:: text
-     GIT COMMIT ID: c95cd4742f02bb009e651a00b07b21c979637dc8
-     WITH_MKL: ON
-     WITH_GPU: ON
-     CUDA version: 8.0
-     CUDNN version: v5
--- a/doc/fluid/howto/inference/index_cn.rst
+++ b/doc/fluid/howto/inference/index_cn.rst
-预测库
------------
-.. toctree::
-  :maxdepth: 1
-  build_and_install_lib_cn.rst
-  inference_support_in_fluid_cn.md
--- a/doc/fluid/howto/inference/inference_support_in_fluid_cn.md
+++ b/doc/fluid/howto/inference/inference_support_in_fluid_cn.md
-# 使用指南
-## 目录：
- Python Inference API
- Inference C++ API
- Inference实例
- Inference计算优化
-## Python Inference API **[改进中]**
- 保存Inference模型 ([链接](https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/fluid/io.py#L295))
-  ```python
-  def save_inference_model(dirname,
-                           feeded_var_names,
-                           target_vars,
-                           executor,
-                           main_program=None,
-                           model_filename=None,
-                           params_filename=None):
-  ```
-  Inference模型和参数将会保存到`dirname`目录下：
-  - 序列化的模型
-    - `model_filename`为`None`，保存到`dirname/__model__`
-    - `model_filename`非`None`，保存到`dirname/model_filename`
-  - 参数
-    - `params_filename`为`None`，单独保存到各个独立的文件，各文件以参数变量的名字命名
-    - `params_filename`非`None`，保存到`dirname/params_filename`
- 两种存储格式
-  - 参数保存到各个独立的文件
-    - 如，设置`model_filename`为`None`、`params_filename`为`None`
-    ```bash
-    $ cd recognize_digits_conv.inference.model
-    $ ls
-    $ __model__ batch_norm_1.w_0 batch_norm_1.w_2 conv2d_2.w_0 conv2d_3.w_0 fc_1.w_0 batch_norm_1.b_0 batch_norm_1.w_1 conv2d_2.b_0 conv2d_3.b_0 fc_1.b_0
-    ```
-  - 参数保存到同一个文件
-    - 如，设置`model_filename`为`None`、`params_filename`为`__params__`
-    ```bash
-    $ cd recognize_digits_conv.inference.model
-    $ ls
-    $ __model__ __params__
-    ```
- 加载Inference模型([链接](https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/fluid/io.py#L380))
-  ```python
-  def load_inference_model(dirname,
-                           executor,
-                           model_filename=None,
-                           params_filename=None):
-    ...
-    return [program, feed_target_names, fetch_targets]
-  ```
-## 链接Fluid Inference库
- 示例项目([链接](https://github.com/luotao1/fluid_inference_example.git))
-  - GCC配置
-    ```bash
-    $ g++ -o a.out -std=c++11 main.cc \
-          -I${PADDLE_ROOT}/ \
-          -I${PADDLE_ROOT}/third_party/install/gflags/include \
-          -I${PADDLE_ROOT}/third_party/install/glog/include \
-          -I${PADDLE_ROOT}/third_party/install/protobuf/include \
-          -I${PADDLE_ROOT}/third_party/eigen3 \
-          -L${PADDLE_ROOT}/paddle/fluid/inference -lpaddle_fluid \
-          -lrt -ldl -lpthread
-    ```
-  - CMake配置
-    ```cmake
-    include_directories(${PADDLE_ROOT}/)
-    include_directories(${PADDLE_ROOT}/third_party/install/gflags/include)
-    include_directories(${PADDLE_ROOT}/third_party/install/glog/include)
-    include_directories(${PADDLE_ROOT}/third_party/install/protobuf/include)
-    include_directories(${PADDLE_ROOT}/third_party/eigen3)
-    target_link_libraries(${TARGET_NAME}
-                          ${PADDLE_ROOT}/paddle/fluid/inference/libpaddle_fluid.so
-                          -lrt -ldl -lpthread)
-    ```
-  - 设置环境变量：
-  `export LD_LIBRARY_PATH=${PADDLE_ROOT}/paddle/fluid/inference:$LD_LIBRARY_PATH`
-## C++ Inference API
- 推断流程([链接](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/inference/tests/test_helper.h#L91))
-  - 1、 初始化设备
-    ```cpp
-    #include "paddle/fluid/framework/init.h"
-    paddle::framework::InitDevices(false);
-    ```
-  - 2、 定义place，executor，scope
-    ```cpp
-    auto place = paddle::platform::CPUPlace();
-    auto executor = paddle::framework::Executor(place);
-    auto* scope = new paddle::framework::Scope();
-    ```
-  - 3、 加载模型
-    ```cpp
-    #include "paddle/fluid/inference/io.h"
-    auto inference_program = paddle::inference::Load(executor, *scope, dirname);
-    // or
-    auto inference_program = paddle::inference::Load(executor,
-                                                     *scope,
-                                                     dirname + "/" + model_filename,
-                                                     dirname + "/" + params_filename);
-    ```
-  - 4、 获取`feed_target_names`和`fetch_target_names`
-    ```cpp
-    const std::vector<std::string>& feed_target_names = inference_program->GetFeedTargetNames();
-    const std::vector<std::string>& fetch_target_names = inference_program->GetFetchTargetNames();
-    ```
-  - 5、 准备`feed`数据
-    ```cpp
-    #include "paddle/fluid/framework/lod_tensor.h"
-    std::vector<paddle::framework::LoDTensor*> cpu_feeds;
-    ...
-    std::map<std::string, const paddle::framework::LoDTensor*> feed_targets;
-    for (size_t i = 0; i < feed_target_names.size(); ++i) {
-      // Please make sure that cpu_feeds[i] is right for feed_target_names[i]
-      feed_targets[feed_target_names[i]] = cpu_feeds[i];
-    }
-    ```
-  - 6、 定义`Tensor`来`fetch`结果
-    ```cpp
-    std::vector<paddle::framework::LoDTensor*> cpu_fetchs;
-    std::map<std::string, paddle::framework::LoDTensor*> fetch_targets;
-    for (size_t i = 0; i < fetch_target_names.size(); ++i) {
-      fetch_targets[fetch_target_names[i]] = cpu_fetchs[i];
-    }
-    ```
-  - 7、 执行`inference_program`
-    ```cpp
-    executor.Run(*inference_program, scope, feed_targets, fetch_targets);
-    ```
-  - 8、 使用`fetch`数据
-    ```cpp
-    for (size_t i = 0; i < cpu_fetchs.size(); ++i) {
-      std::cout << "lod_i: " << cpu_fetchs[i]->lod();
-      std::cout << "dims_i: " << cpu_fetchs[i]->dims();
-      std::cout << "result:";
-      float* output_ptr = cpu_fetchs[i]->data<float>();
-      for (int j = 0; j < cpu_fetchs[i]->numel(); ++j) {
-        std::cout << " " << output_ptr[j];
-      }
-      std::cout << std::endl;
-    }
-    ```
-    针对不同的数据，4. - 8.可执行多次。
-  - 9、 释放内存
-    ```cpp
-    delete scope;
-    ```
- 接口说明
-  ```cpp
-  void Run(const ProgramDesc& program, Scope* scope,
-           std::map<std::string, const LoDTensor*>& feed_targets,
-           std::map<std::string, LoDTensor*>& fetch_targets,
-           bool create_vars = true,
-           const std::string& feed_holder_name = "feed",
-           const std::string& fetch_holder_name = "fetch");
-  ```
-  - 使用Python API `save_inference_model`保存的`program`里面包含了`feed_op`和`fetch_op`，用户提供的`feed_targets`、`fetch_targets`必须和`inference_program`中的`feed_op`、`fetch_op`保持一致。
-  - 用户提供的`feed_holder_name`和`fetch_holder_name`也必须和`inference_program`中`feed_op`、`fetch_op`保持一致，可使用`SetFeedHolderName`和`SetFetchHolderName`接口重新设置`inferece_program`
-  - 默认情况下，除了`persistable`属性设置为`True`的`Variable`之外，每次执行`executor.Run`会创建一个局部`Scope`，并且在这个局部`Scope`中创建和销毁所有的`Variable`，以最小化空闲时的内存占用。
-  - `persistable`属性为`True`的`Variable`有：
-    - Operators的参数`w`、`b`等
-    - `feed_op`的输入变量
-    - `fetch_op`的输出变量
- **不在每次执行时创建和销毁变量
- ([PR](https://github.com/PaddlePaddle/Paddle/pull/9301))**
-  - 执行`inference_program`
-    ```cpp
-    // Call once
-    executor.CreateVariables(*inference_program, scope, 0);
-    // Call as many times as you like
-    executor.Run(
-        *inference_program, scope, feed_targets, fetch_targets, false);
-    ```
-  - **优点**
-    - 节省了频繁创建、销毁变量的时间（约占每次`Run`总时间的1% ~ 12%）
-    - 执行结束后可获取所有Operators的计算结果
-  - **缺点**
-    - 空闲时也会占用大量的内存
-    - 在同一个`Scope`中，相同的变量名是公用同一块内存的，容易引起意想不到的错误
- **不在每次执行时创建Op([PR](https://github.com/PaddlePaddle/Paddle/pull/9630))**
-  - 执行`inference_program`
-    ```cpp
-    // Call once
-    auto ctx = executor.Prepare(*inference_program, 0);
-    // Call as many times as you like if you have no need to change the inference_program
-    executor.RunPreparedContext(ctx.get(), scope, feed_targets, fetch_targets);
-    ```
-  - **优点**
-    - 节省了频繁创建、销毁Op的时间
-  - **缺点**
-    - 一旦修改了`inference_program`，则需要重新创建`ctx`
- **多线程共享Parameters([链接](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/inference/tests/test_multi_thread_helper.h))**
-  - 主线程
-    - 1、 初始化设备
-    - 2、 定义`place`，`executor`，`scope`
-    - 3、 加载模型，得到`inference_program`
-  - 从线程
-    - **复制`inference_program`得到`copy_program`，修改`copy_program`的`feed_holder_name`和`fetch_holder_name`**
-      ```cpp
-      auto copy_program = std::unique_ptr<paddle::framework::ProgramDesc>(
-                 new paddle::framework::ProgramDesc(*inference_program));
-      std::string feed_holder_name = "feed_" + paddle::string::to_string(thread_id);
-      std::string fetch_holder_name = "fetch_" + paddle::string::to_string(thread_id);
-      copy_program->SetFeedHolderName(feed_holder_name);
-      copy_program->SetFetchHolderName(fetch_holder_name);
-      ```
-    - 4、 获取`copy_program`的`feed_target_names`和`fetch_target_names`
-    - 5、 准备feed数据，定义Tensor来fetch结果
-    - 6、 执行`copy_program`
-      ```cpp
-      executor->Run(*copy_program, scope, feed_targets, fetch_targets, true, feed_holder_name, fetch_holder_name);
-      ```
-    - 7、 使用fetch数据
-  - 主线程
-    - 8、 释放资源
- 基本概念
-  - 数据相关：
-    - [Tensor](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/fluid/design/concepts/tensor.md)，一个N维数组，数据可以是任意类型（int，float，double等）
-    - [LoDTensor](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/fluid/design/concepts/lod_tensor.md)，带LoD(Level-of-Detail)即序列信息的Tensor
-    - [Scope](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/design/scope.md)，记录了变量Variable
-  - 执行相关：
-    - [Executor](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/fluid/design/concepts/executor.md)，无状态执行器，只跟设备相关
-    - Place
-      - CPUPlace，CPU设备
-      - CUDAPlace，CUDA GPU设备
-  - 神经网络表示：
-    - [Program](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/fluid/design/concepts/program.md).
-    详细介绍请参考[**Paddle Fluid开发者指南**](https://github.com/lcy-seso/learning_notes/blob/master/Fluid/developer's_guid_for_Fluid/Developer's_Guide_to_Paddle_Fluid.md)
-## Inference实例
-  1. fit a line: [Python](https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/fluid/tests/book/test_fit_a_line.py), [C++](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/inference/tests/book/test_inference_fit_a_line.cc)
-  1. image classification: [Python](https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/fluid/tests/book/test_image_classification.py), [C++](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/inference/tests/book/test_inference_image_classification.cc)
-  1. label semantic roles: [Python](https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/fluid/tests/book/test_label_semantic_roles.py), [C++](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/inference/tests/book/test_inference_label_semantic_roles.cc)
-  1. recognize digits: [Python](https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/fluid/tests/book/test_recognize_digits.py), [C++](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/inference/tests/book/test_inference_recognize_digits.cc)
-  1. recommender system: [Python](https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/fluid/tests/book/test_recommender_system.py), [C++](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/inference/tests/book/test_inference_recommender_system.cc)
-  1. understand sentiment: [Python](https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/fluid/tests/book/test_understand_sentiment.py), [C++](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/inference/tests/book/test_inference_understand_sentiment.cc)
-  1. word2vec: [Python](https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/fluid/tests/book/test_word2vec.py), [C++](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/inference/tests/book/test_inference_word2vec.cc)
-## Inference计算优化
- 使用Python推理优化工具([inference_transpiler](https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/fluid/inference_transpiler.py))
-  ```python
-  class InferenceTranspiler:
-    def transpile(self, program, place, scope=None):
-        ...
-        if scope is None:
-            scope = global_scope()
-        ...
-  ```
-  - 使用`InferenceTranspiler`将会直接修改`program`。
-  - 使用`InferenceTranspiler`会修改参数的值，请确保`program`的参数在`scope`内。
- 支持的优化
-  - 融合batch_norm op的计算
- 使用示例([链接](https://github.com/Xreki/Xreki.github.io/blob/master/fluid/inference/inference_transpiler.py))
-  ```python
-  import paddle.fluid as fluid
-  # NOTE: Applying the inference transpiler will change the inference_program.
-  t = fluid.InferenceTranspiler()
-  t.transpile(inference_program, place, inference_scope)
-  ```
-## 内存使用优化
- 使用Python内存优化工具([memory_optimization_transipiler](https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/fluid/memory_optimization_transpiler.py))
-  ```python
-  fluid.memory_optimize(inference_program)
-  ```
--- a/doc/fluid/howto/optimization/benchmark/index_cn.rst
+++ b/doc/fluid/howto/optimization/benchmark/index_cn.rst
-基准
------------
-.. toctree::
-  :maxdepth: 1
-  vgg16/README.md
-  README.md
--- a/doc/fluid/howto/optimization/benchmark/index_en.rst
+++ b/doc/fluid/howto/optimization/benchmark/index_en.rst
-Benchmark
------------
-.. toctree::
-  :maxdepth: 1
-  vgg16/README.md
-  README.md
--- a/doc/fluid/howto/optimization/index_cn.rst
+++ b/doc/fluid/howto/optimization/index_cn.rst
-性能优化
------------
-.. toctree::
-  :maxdepth: 1
-  timeline.md
-  cpu_profiling_cn.md
-  benchmark/index_cn.rst
--- a/doc/fluid/howto/optimization/index_en.rst
+++ b/doc/fluid/howto/optimization/index_en.rst
-Performance Optimization
---------------------------
-.. toctree::
-  :maxdepth: 1
-  timeline.md
-  cpu_profiling_en.md
-  benchmark/index_en.rst
--- a/doc/fluid/howto/optimization/pprof_1.png
+++ b/doc/fluid/howto/optimization/pprof_1.png
--- a/doc/fluid/howto/optimization/pprof_2.png
+++ b/doc/fluid/howto/optimization/pprof_2.png
--- a/doc/fluid/howto/optimization/timeline.jpeg
+++ b/doc/fluid/howto/optimization/timeline.jpeg
--- a/doc/fluid/howto/optimization/tracing.jpeg
+++ b/doc/fluid/howto/optimization/tracing.jpeg
--- a/doc/fluid/howto/performance/error_clip.md
+++ b/doc/fluid/howto/performance/error_clip.md
-# Error Clip
-## Overview
-Error clip is widely used in model training to prevent gradient exploding. It takes some specific rules to adjust variables' gradients and prevent them from being too large. With it, values of a gradient will be checked before they are taken by the next `grad_op` and be shrunk if necessary.
-## Usage
-Users are allowed to assign different error clip methods or attributes to different `Variable`s. Users can specify it as a parameter of `Variable`'s constructor:
-```python
-var = framework.Variable(..., error_clip=myErrorClip, ...)
-```
-The default value of `error_clip` is `None`, which means no error clip is employed. When it's not `None`, it should take an object of `BaseErrorClipAttr`'s derived class. So far, `BaseErrorClipAttr` has only one derived class: `ErrorClipByValue`, whose constructor is:
-```python
-ErrorClipByValue(max, min=None)
-```
-`max` and `min` represent the maximal and minimal clip threshold respectively. In backward pass, all values of `var`'s gradient greater than `max` or less than `min` will be clipped to `max` and `min` respectively. When the `min` is None, the minimal threshold will be assigned with `-max` automatically.
-So we can enable the error clip with threshold `[-5.0, 5.0]` for variable `var` by:
-```python
-var = framework.Variable(..., error_clip=ErrorClipByValue(max=5.0), ...)
-```
-## Implementation
-The `BaseErrorClipAttr` and its derived class `ErrorClipByValue` are defined in *clip.py*.
-```python
-class BaseErrorClipAttr(object):
-    def append_clip_op(self, block, grad_name):
-        raise NotImplementedError()
-class ErrorClipByValue(BaseErrorClipAttr):
-    def __init__(self, max, min=None):
-        max = float(max)
-        if min is None:
-            min = -max
-        else:
-            min = float(min)
-        self.max = max
-        self.min = min
-    def append_clip_op(self, block, grad_name):
-        clip_op_desc = block.desc.append_op()
-        clip_op_desc.set_type("clip")
-        clip_op_desc.set_input("X", [grad_name])
-        clip_op_desc.set_output("Out", [grad_name])
-        clip_op_desc.set_attr("min", self.min)
-        clip_op_desc.set_attr("max", self.max)
-```
-The `BaseErrorClipAttr` have one main member functions: `append_clip_op(self, block, grad_name)`.
-This function is used to create a `clip_op` and append it to the end of given `block`. For different error clip algorithm require different `clip_op`, the function is defined as virtual in the base class. All derived classes must implement their own versions of this function.
-These `clip_op`s should be inserted after `grad_op`s whose output gradients need to be clipped. It is equivalent to appending some `clip_op`s to the end of the target block every time a new `grad_op` is added.
-```python
-for op_desc in grad_op_descs:
-        new_op_desc = target_block.desc.append_op()
-        new_op_desc.copy_from(op_desc)
-        callback(block=target_block, context=grad_to_var)
-```
-Here we employ a callback function to complete this kind of jobs. In `_append_backward_ops_` function, each time after a `grad_op` is added to the `target_block`, a callback function is invoked. The logic of `clip_op` appending can be implemented inside the callback function.
-The callback function for `clip_op` appending is defined in *clip.py*:
-```python
-def error_clip_callback(block, context):
-    # the context is a grad_to_var map
-    grad_to_var = context
-    op_desc = block.desc.op(block.desc.op_size() - 1)
-    for grad_n in filter(lambda n: grad_to_var.has_key(n),
-                         op_desc.output_arg_names()):
-        fwd_var = block.__var_recursive(grad_to_var[grad_n])
-        error_clip = getattr(fwd_var, "error_clip", None)
-        if not (error_clip is None or isinstance(error_clip,
-                                                 BaseErrorClipAttr)):
-            raise TypeError(
-                "Variable's error_clip should be an instance of BaseErrorClipAttr or None."
-            )
-        if error_clip is not None:
-            error_clip.append_clip_op(block, grad_n)
-```
-This function takes a `block` and a `context`(which is actually a grad\_to\_var map) as inputs. It checks each output of the last `OpDesc` in the `block`. Notice that the last `OpDesc` of the `block` must be a `grad_op` and its outputs must be some forward variables' gradients. If an output gradient's corresponding forward variable has an attribute of `error_clip`, `error_clip_callback` will call the `error_clip`'s `append_clip_op` function to append the required `clip_op` into the `block`.
--- a/doc/fluid/howto/performance/images/profiler.png
+++ b/doc/fluid/howto/performance/images/profiler.png
--- a/doc/fluid/howto/performance/profiler.md
+++ b/doc/fluid/howto/performance/profiler.md
-## Introduction
-There are many performance analysis tools for [different programming languages and different software frameworks](https://en.wikipedia.org/wiki/List_of_performance_analysis_tools). For most popular deep learning frameworks, they use several programming languages and adapt to heterogeneous platforms. Similar to most of the deep learning frameworks, PaddlePaddle also uses C++, CUDA and Python as the basic programming languages to adapt to run on CPU and GPU devices.  The [`nvprof` tools](http://docs.nvidia.com/cuda/profiler-users-guide/index.html#nvprof-overview) is usually used to analyse the CUDA program.  We have [a document](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/howto/optimization/cpu_profiling.md) to profile CPU and Python program by [yep](https://pypi.python.org/pypi/yep) and [Google's perftools](https://github.com/google/pprof) to profile only the CPU and Python program. But for [PaddlePaddle fluid](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/design/fluid.md), the operator is the basic computing unit. The developers usually want to collect the time of each operator and locate bottlenecks.  The `nvprof` usually collect the timeline of CUDA-related activities on both CPU and GPU, including kernel execution, memory transfers, memory set and CUDA API calls and events or metrics for CUDA kernels. And the `yep` and `Google's perftools` can't collect the timeline for CUDA program. All these tools can't collect time in the operator level. So we design this profiling tool.
-## Architecture
-The work flow for most task is as follows. Each operator will run many times in the all iterations. So the profiler must collect the total time of each operator during the iteration. For more, sometimes, the developers may want to collect more detailed time span inside the operator or record time span for elsewhere, this requires that the profiler must support to record the nested time span. And in order to speedup training, all the deep learning frameworks support parallel computing, including multiple threads on CPU and multiple GPUs. So the profiler must be able to collect the timeline for each thread. In addition, the profiler also occupies certain resources. It must can be easily to be enabled or disabled by the developers. At last, the profiler should present a human-readable report.  
-```python
-for i in xrange(M):  # M is  the iteration number
-  for op in operator_lists: # The `operator_lists` contains all the operators in the network.
-    op.run();
-```
-In summary, the proflier should have following features:
- records time span in loop.
- supports nested time span.
- supports multiple threads/multiple GPUs.
- supports to be enabled and disabled by users.
-But how to record the time for the mixed C++ and CUDA program?  There many C++ APIs to get the current calendar time in host program. But for GPU, the CUDA kernels may be executed concurrently if they are in different [streams](http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#streams) and the CUDA kernels is asynchronous with the host program if there is no the synchronous aftern the CUDA kernels. CUDA provides [event](http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#events) to monitor the device and perform accurate timing. Inspired by PyTorch and CUDA event, we also design and apply the events to record the timeline. Then summarize and present statistics based on these events.  
-The overall flow is shown as the following figure.
-<img src="https://raw.githubusercontent.com/PaddlePaddle/FluidDoc/develop/doc/fluid/howto/performance/images/profiler.png" align="center"/><br/>
-### Event
-In above work flow, a pair of events are needed before and after the piece of code to collect time. So the event has a flag to mark whether it is a starting event or an ending event. Except this two kinds of event, sometime, a only marker with a text message is needed, for example, a marker to specify the profiling start or end. There are three kinds of event:
-```c++
-enum EventKind {
-  kMark,
-  kPushRange,
-  kPopRange};
-```
- kMark: only a marker without time range.
- kPushRange: mark the starting event for time range.
- kPopRange: mark the ending event for time range.
-For the CPU code, the events only need to record the current time. For the CUDA code, the [event management functions of CUDA](http://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__EVENT.html#group__CUDART__EVENT) are used.  For many pieces of code, an event lists are used to record each piece.
-```c++
-class Event {
- public:
-  // The DeviceContext is used to get current  CUDA stream.
-  Event(EventKind kind, std::string name, uint32_t thread_id,
-        const platform::DeviceContext* dev_ctx = nullptr);
-  double CpuElapsedUs(const Event& e) const;
-  double CudaElapsedUs(const Event& e) const;
- private:
-  EventKind kind_;
-  std::string name_;
-  uint32_t thread_id_;
-  int64_t cpu_ns_;
-#ifdef PADDLE_WITH_CUDA
-  cudaEvent_t event_ = nullptr;
-  int device_ = -1;
-#endif
-};
-struct EventList {
-  std::forward_list<std::vector<Event>> event_blocks;
-};
-```
-As mentioned above, there is no need to record the timeline when disabling the profiler. So there is a global state to enable or disable the profiler.
-```c++
-enum ProfilerState {
-  kDisabled,
-  kCPU,
-  kCUDA
-};
-ProfilerState g_state;
-```
- kDisabled: the disabled state.
- kCPU: CPU profiling state.
- kCUDA: GPU profiling state.
-A pair of starting and ending events are pushed to event lists in constructor and destructor of `RecordEvent`. So the timeline is recorded for the code in the lifecycle of an object of `RecordEvent`.
-```c++
-struct RecordEvent {
-  explicit RecordEvent(const std::string name,
-                       platform::DeviceContext* dev_ctx = nullptr) {
-    if (kState == ProfilerState::kDisabled) return;
-    // push the starting event to the event lists.
-  }
-  ~RecordEvent() {
-    if (kState == ProfilerState::kDisabled) return;
-    // push the ending event to the event lists.
-  }
-};
-```
-### Report sample
-```
-Event                                             Calls       Total       Min.        Max.        Ave.        Ratio.      
-thread101::deserial                               1410        392.302     0.032768    14.1058     0.278228    0.00117247  
-thread100::GetRPC                                 11          2951.13     7.60675     1426.75     268.284     0.00882     
-thread100::serial                                 14          75.3212     0.07584     36.2135     5.38009     0.000225112 
-thread100::SendRPC                                14          13.9494     0.003072    3.97517     0.996389    4.16905e-05 
-thread99::GetRPC                                  15          3012.62     2.79062     1426.61     200.841     0.00900378  
-... 
-thread0::matmul_grad                              1480        3674.28     0.375808    181.608     2.48262     0.0109813   
-thread0::matmul                                   1480        3365.82     0.196608    172.256     2.2742      0.0100594   
-thread0::mul_grad                                 3840        3167.39     0.411648    3.33824     0.82484     0.00946633  
-thread0::fetch_barrier                            5           3082.82     354.385     1617.88     616.564     0.00921359  
-thread0::dropout                                  2480        3014.05     0.201728    6.76454     1.21534     0.00900807  
-```
-Note: profiler can merge the same operator's time which runs multiple times in the same thread.
\ No newline at end of file
--- a/doc/fluid/howto/third_party/images/multigpu_allreduce.graffle
+++ b/doc/fluid/howto/third_party/images/multigpu_allreduce.graffle
--- a/doc/fluid/howto/third_party/images/multigpu_allreduce.png
+++ b/doc/fluid/howto/third_party/images/multigpu_allreduce.png
--- a/doc/fluid/howto/third_party/images/multigpu_before_convert.graffle
+++ b/doc/fluid/howto/third_party/images/multigpu_before_convert.graffle
--- a/doc/fluid/howto/third_party/images/multigpu_before_convert.png
+++ b/doc/fluid/howto/third_party/images/multigpu_before_convert.png
--- a/doc/fluid/howto/third_party/mkldnn_fluid.md
+++ b/doc/fluid/howto/third_party/mkldnn_fluid.md
-# Design Doc: Add MKLDNN Kernel in Fluid Operator
-## Principles
-First of all, we should follow some basical principles like:
-1.  [How to write a new operator](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/howto/dev/new_op_en.md). We are trying to add a new kind of kernel into operators, so basically we should follow this doc.
-2.  [Supporting new Device/Library](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/design/support_new_device.md). Since MKLDNN is a new library to fluid, we should add `MKLDNNDeviceContext` and maybe `mkldnn_helper.h`, just like [cudnn_helper.h](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/platform/cudnn_helper.h).
-3.  [Switch Kernel](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/design/switch_kernel.md). Another important point is that we should ensure the data synchronization between different kernel types, which is this [topic](https://github.com/PaddlePaddle/Paddle/issues/6549). So basically we should override `GetExpectedKernelType` and `trans` functions to support switching kernels.
-4.  [The Keys of Operator Kernel Type](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/design/operator_kernel_type.md). Kernel Type is a pivotal conception which can record the `Place`, `Library`, `DataType` and `Layout`.
-## Sulution
-In general, there are four parts we should follow to run a MKL-DNN primitive.
-  Create a primitive descriptor that describe this operator
-  Create a primitive itself by primitive descriptor and the engine
-  Create all memory buffers that primitive needed
-  Launch a stream to execute the primitive created
-More details can refer to [here](http://01org.github.io/mkl-dnn).
-It's better to avoid reinitialization of primitives and memory handles in the first three stages in every iteration. \
-So we plan to create a map to record all the `primitive` and `memory`, which should not take too much memories as discussed [here](https://github.com/PaddlePaddle/Paddle/issues/6822).
-It's assumed that following three conditions should be satisfied.
-1. there is a unique key for each operator instance. May be the actual name of `Output Tensor`.
-2. the `Input Tensor` inside `Compute` function is the one after converted.
-3. we can get the phase(eg. `is_test`) inside `Compute` function, otherwise we need to expose this attribue to user.
-### Compute
-The algorithm of `Compute` would be described as follow, let's take conv like an example.
-```c++
-  PADDLE_ENFORCE(platform::is_cpu_place(ctx.GetPlace()), "It must use CPUPlace.");
-  PADDLE_ENFORCE(platform::is_mkldnn_library(ctx.GetLibrary()), "It must use MKLDNN Library.");
-  auto& dev_ctx = ctx.template device_context<platform::MKLDNNDeviceContext>();
-  // find primitive by unique key from mkldnn context
-  // the op_key should be a unique name of this op instance
-  auto& p = dev_ctx.findPrimitive(op_key + "_fwd");
-  // assuming the input tensor inside this compute function is the one after converted
-  // this point should be guarantee by another mechanism
-  auto& i = dev_ctx.findMemory(op_key + "_input");
-  if (p == nullptr || i == nullptr || inputSizeChanged(p, i))  {
-    auto fwd_primitive_desc = createPrimitiveDesc(ctx);
-    auto* input = ctx.Input<Tensor>("Input");
-    auto* filter = ctx.Input<Tensor>("Filter");
-    auto* output = ctx.Output<Tensor>("Output");
-    shared_ptr<mkldnn::memory> in(new mkldnn::memory(fwd_primitive_desc->src_primitive_desc(), input->data<T>()));
-    shared_ptr<mkldnn::memory> wgt(new mkldnn::memory(fwd_primitive_desc->weights_primitive_desc(), filter->data<T>()));
-    shared_ptr<mkldnn::memory> out(new mkldnn::memory(fwd_primitive_desc->dst_primitive_desc(), output->mutable_data<T>(ctx.GetPlace())));
-    shared_ptr<mkldnn::conv_fwd> fwd_primitive(new mkldnn::conv_fwd(*fwd_primitive_desc, *in, *wgt, *out));
-    dev_ctx.addMemory(op_key+"_input", in);
-    dev_ctx.addMemory(op_key+"_output", out);
-    dev_ctx.addMemory(op_key+"_filer", wgt);
-    dev_ctx.addPrimitive(op_key+"_fwd", fwd_primitive);
-    dev_ctx.addPrimitiveDesc(op_key+"_fwd_PD", fwd_primitive_desc);
-  }
-  p = dev_ctx.findPrimitive(op_key + "_fwd");
-  PADDLE_ENFORCE(p, "Should have forward Primitive");
-  PADDLE_ENFORCE(dev_ctx.findMemory(op_unique_key+"_input"), "Should have input memory");
-  PADDLE_ENFORCE(dev_ctx.findMemory(op_unique_key+"_output"), "Should have output memory");
-  PADDLE_ENFORCE(dev_ctx.findMemory(op_unique_key+"_filter"), "Should have filter memory");
-  PADDLE_ENFORCE(dev_ctx.findPrimitiveDesc(op_unique_key+"_fwd_PD"), "Should have forward PrimitiveDesc");
-  dev_ctx.submit(p);
-  dev_ctx.execute();  // the convert primitive should have already contained.
-```
-The `createPrimitiveDesc` returns the primitive descripotor of this operator, would be like this:
-```c++
-  auto* input = ctx.Input<Tensor>("Input");
-  auto* filter = ctx.Input<Tensor>("Filter");
-  auto* output = ctx.Output<Tensor>("Output");
-  std::vector<int> strides = ctx.Attr<std::vector<int>>("strides");
-  std::vector<int> paddings = ctx.Attr<std::vector<int>>("paddings");
-  std::vector<int> dilations = ctx.Attr<std::vector<int>>("dilations");
-  int groups = ctx.Attr<int>("groups");
-  algorithm algo = static_cast<algorithm>(ctx.Attr<int>("convolution_algorithm_option"));
-  prop_kind pk = ctx.Attr<bool>("is_test") ? prop_kind::forward_inference : prop_kind::forward_training;
-  auto fwd_desc = mkldnn::conv_fwd::desc(/* all the setting above*/);
-  shared_ptr<mkldnn::conv_fwd::primitive_desc> fwd_primitive_desc(new mkldnn::conv_fwd::primitive_desc(fwd_desc, ctx.getEngine()));
-  return fwd_primitive_desc;
-  }
-```
-### MKLDNNDeviceContext
-`MKLDNNDeviceContext`, which is very straightforward, should contain some base information like: `stream`, `engine` and the map needed.
-### mkldnn_helper
-Some functions would be put in `paddle/platform/mkldnn_helper.h`.
- create MKLDNN memories
- create MKLDNN primitives
- error check function
- etc
-### Kernel Switch
-We should `reorder` the different Layout from other device or to other device. `GetExpectedKernelType` and `trans` functions can help us to implement it.
-`GetExpectedKernelType` should get the context, and this operator can return the best `KernelType`. 
-`trans` would be like this:
-```c++
-void trans(inputs, ctx) override {
-  if (NoNeedTrans()) {
-    return;
-  }
-  // find reorder primitive by op_key from context
-  auto& dev_ctx = ctx.template device_context<platform::MKLDNNDeviceContext>();
-  auto& p = dev_ctx.findPrimitive(op_key + "_reorder_input");
-  auto& i = dev_ctx.findMemory(op_key + "_src_input");
-  if (p == nullptr || i == nullptr || changeSized(i, input)) {
-    auto prim = createPrimitiveDesc(ctx);
-    auto src = createMemory(memoryDesc(input->dims(), actual_layout), input->data);
-    auto newbuffer = paddle::memory::Alloc(ctx.GetPlace(), input->size_in_bytes());
-    auto dst = createMemory(p->expected_desc(), newbuffer->data);
-    auto reorder_primitive(new mkldnn::reorder(src, dst));
-    dev_ctx.addMemory(op_key+"_src_input", src);
-    dev_ctx.addMemory(op_key+"_input", dst);
-    dev_ctx.addPrimitive(op_key+"_reorder_input", reorder_primitive);
-  }
-  p = dev_ctx.findPrimitive(op_key + "_reorder_input");
-  PADDLE_ENFORCE(p, "Should have Reorder Primitive");
-  dev_ctx.submit(p);
-  if (! this->isMKLDNNKernel()) {
-    // execute immediately only if this is not mkldnn kernel function.
-    // otherwise, it can be executed with the operator primitive in Compute
-    dev_ctx.stream();
-  }
-  // after submit, the input tensor in ExecutionContext should be changed as the converted one
-  // there should be another mechanism to ensure this
-}
-```
-### Unit Test
-All the functions should be tested corresponding.
-TBD
--- a/doc/fluid/howto/third_party/paddle_nccl.md
+++ b/doc/fluid/howto/third_party/paddle_nccl.md
-# Design Doc: NCCL support in Paddle Fluid
-## Abstract
-This Design Doc refers to the NCCL feature in  paddle.  We propose an approach to support NCCL library both on a single machine and multiple machines. We wrapper the NCCL primitives `Broadcast`, `Allreduce`, `Reduce` as operators to utilize Multi-GPU powers in one script.
-## Motivation
-[NCCL](https://developer.nvidia.com/nccl) is a NVIDIA library support Multi-GPU communicating and optimized for NVIDIA GPUs, it provides routines such as all-gather, all-reduce, broadcast, reduce, reduce-scatter, that can achieve high bandwidth over PCIe and NVLink high-speed interconnect. With NCCL library, we can easily accelerate the training in parallel. 
- Pros
-1. easily plug-in with [NCCL2](https://developer.nvidia.com/nccl) library.
-1. high performance in NVIDIA GPUs.
-1. MPI like primitives, which have low learning cost for users.
- Cons
-1. Only design for NVIDIA GPUs, not a general multi-device solution.
-1. Although NCCL1 is opensourced under BSD license, but NCCL2 is not opensourced anymore.
-At the beginning of training, the framework needs to distribute the same parameters to every GPU, and merge the gradients at any time user interests.
-As a result, during training, we need the operations of peer to peer copy between different GPUs, aggregating gradients/parameters from GPUs, and broadcasting parameters to GPUs. Every GPU only need to run the operator with correct place information.
-Besides, it needs interfaces to synchronize model update with each different GPU Cards. 
-## Implementation
-As mentioned above, we wrap the NCCL routines as several kinds of operators. Need to note that NCCL need to create Communicator between gpu at the beginning, so there is a NCCLInit operator created.
-### Transpiler
-To be compatible with [parameter server design doc](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/design/ops/dist_train.md), the transpiler compiles the user defined operation graph into sub-graphs to be executed on different devices.
-1. The user-defined model will be a single device program
-2. Broadcast/Reduce operators between GPUs will be inserted into the program, even for the multi-node, may insert the `Send`, `Recv` operator.
-   *Broadcast, AllReduce in a single machine. And Broadcast, AllReduce, [Send, Recv](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/design/ops/dist_train.md#graph-converter) in multiple machines*
-   <img src="images/multigpu_before_convert.png" width="300"/>
-After compiling, the graph as shows
-<img src="images/multigpu_allreduce.png" width="1000"/>
-Operators are added to the sub-graphs. Every GPU assigned a role of `rank0`, `rank1` etc. 
- **Broadcast**. Broadcast operator distribute initialized parameter to all the GPUs from the GPU who owns it. e.g. from`rank0` GPU.
- **AllReduce**. AllReduce operator synchronizes parameters/gradients between GPUs. AllReduce implemented in the Ring-Based  communicating method, avoid of the bottle neck in a single GPU.
-Need to notice that AllReduce operator force GPUs synchronized at that point. The whole training process in asynchronous or synchronous mode depends on the AllReduce point in the graph.
-As it shown in the picture, when each GPU compute the gradient of `W`, followed with a `AllReduce` operator, accumulate the `dW` to full batch of data, then run the optimize process individually and apply the gradient to its `W`.
- **AllReduce**
-  Need to note that our AllReduce operator is a ring-base AllReduce implementation. If we use the NCCL2 AllReduce primitive, every GPU optimized full batch of data, wasted (n-1) GPU compute resources. In addition, NCCL2 built-in AllReduce will only utilize the communicating resource during synchronization, then update the gradient will be a subsequent phase. In fact, we can amortize the update gradient time cost into the communicating phase. The process is
-1. Every parameter has its root card. That card will responsible for aggregating the gradients from GPUs.
-2. The whole model's parameter will be hashed to different root card, ensure the load balance between GPUs.
-3. Logically neighberhood card will start send parameter to the next one. After one round, the parameter main card will aggregate the full gradients.
-4. Then the root card will optimize the parameter.
-5. This parameter card will send its optimized result to its neighberhood, then the neighberhood will send parameter to its next one.
-6. Finish the sychronization round.
-The total time cost will be 2 * (n-1) * per-parameter-send-time, we reach the goal of amortize the upgrade time into communicating phase.
--- a/doc/fluid/index_en.rst
+++ b/doc/fluid/index_en.rst
@@ -7,7 +7,6 @@
  beginners_guide/index_en.rst
  user_guides/index_en.rst
  advanced_usage/index_en.rst
  api/index_en.rst
-  book/index_en.rst
--- a/doc/fluid/overview.md
+++ b/doc/fluid/overview.md
-# 概览
-使用文档部分将帮助您更好的了解和学习 PaddlePaddle，本单元将主要为您展示 **教程** 和 **API** 两个板块。
-## 教程
-如果您想了解深度学习知识与 Fluid 使用方法，可以在教程部分查找相关内容。教程模块主要包含：
- [新手入门](beginners_guide/index.html)：包含安装说明和多个简单的模型案例供您快速上手
- [使用指南](user_guides/index.html)：包含 Fluid 使用说明和已开源的[模型库](user_guidex/models/index.html)助您更好地应用Fluid
- [进阶使用](advanced_usage/index.html)：包含移动端部署、模型调优、书写Operator等高阶使用说明，使 Fluid 更贴合您的需求。
-## API
-如果您是PaddlePaddle的老用户，想查找与您项目相关的API，可以直接阅读：
- [API Guide](api/api_guides/index.html)：介绍 Fluid 主要 API 的功能以及说明文档的接口
- [API](api/index.html)：Fluid 已有 API 的设计思想与使用说明
--- a/doc/fluid/read_source.md
+++ b/doc/fluid/read_source.md
-# PaddlePaddle Fluid Source Code Overview
-Examples: https://github.com/PaddlePaddle/Paddle/tree/develop/python/paddle/fluid/tests/book
-Core: https://github.com/PaddlePaddle/Paddle/tree/develop/paddle/fluid/framework
-Operator: https://github.com/PaddlePaddle/Paddle/tree/develop/paddle/fluid/operators
-Memory: https://github.com/PaddlePaddle/Paddle/tree/develop/paddle/fluid/memory
-Platform: https://github.com/PaddlePaddle/Paddle/tree/develop/paddle/fluid/platform
-# Compile Time
-The following **defines** the NN. The definition goes into this [protocol buffer](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/framework/framework.proto).
-```python
-x = fluid.layers.data(name='x', shape=[13], dtype='float32')
-y = fluid.layers.data(name='y', shape=[1], dtype='float32')
-y_predict = fluid.layers.fc(input=x, size=1, act=None)
-cost = fluid.layers.square_error_cost(input=y_predict, label=y)
-avg_cost = fluid.layers.mean(x=cost)
-sgd_optimizer = fluid.optimizer.SGD(learning_rate=0.001)
-sgd_optimizer.minimize(avg_cost)
-```
- Variables: `x`,  `y`, `y_predict`, `cost` and `avg_cost`. [Python](https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/fluid/framework.py#)
- Layers: `fluid.layers.data`, `fluid.layers.fc` and `fluid.layers.mean` are layers. [Python](https://github.com/PaddlePaddle/Paddle/tree/develop/python/paddle/fluid/layers)
-  - Every Layer has one or more operators and variables/parameters
-    - All the operators are defined at [`paddle/fluid/operators/`](https://github.com/PaddlePaddle/Paddle/tree/develop/paddle/fluid/operators). Other worth-looking files:
-      - Base class: [`paddle/fluid/framework/operator.h`](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/framework/operator.h)
-      - Operator Registration: [`paddle/fluid/framework/op_registry.h`](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/framework/op_registry.h)
-      - Operator Lookup: [`paddle/fluid/framework/op_info.h`](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/framework/op_info.h)
- Optimizer: `fluid.optimizer.SGD`. It does the following
-  - Add backward operators. [[Python](https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/fluid/backward.py)]
-  - Add optimizer operators. [[Python](https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/fluid/optimizer.py)]
-# Run Time
-The following **evaluates** the NN. Instantiates all the variables, operators.
-```python
-place = fluid.CPUPlace()
-feeder = fluid.DataFeeder(place=place, feed_list=[x, y])
-exe = fluid.Executor(place)
-# Allocate memory. Initialize Parameter.
-exe.run(fluid.default_startup_program())
-# Allocate memory. Do computation.
-exe.run(fluid.default_main_program(),
-        feed=feeder.feed(data),
-        fetch_list=[avg_cost])
-```
- Place: `place`. one of CPU, GPU or FPGA. [C++](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/platform/place.h)
-  - The device handle are at [paddle/fluid/platform/device_context.h](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/platform/device_context.h)
- Executor: `fluid.Executor(place)`. [[Python](https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/fluid/executor.py), [C++](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/framework/executor.cc)]
-  - Feeds the data: `feed=feeder.feed(data)`
-  - Evaluates all the operators
-  - Fetches the result: `fetch_list=[avg_cost]`
- Other worth looking files:
-  - Scope: [paddle/fluid/framework/scope.h](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/framework/scope.h). Where all the variables live
-    - Variable: [paddle/fluid/framework/variable.h](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/framework/variable.h). Where all the data (most likely tensors) live
-      - Tensor: [paddle/fluid/framework/tensor.h](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/framework/tensor.h). Where we allocate memory through [`paddle/fluid/memory/`](https://github.com/PaddlePaddle/Paddle/tree/develop/paddle/fluid/memory)