all of advanced user guides (#625)

1. add index for all English version of "进阶使用" 2. replace legacy softlinks to howto/optimization & dev/new_op_en with original file

all of advanced user guides (#625)
1. add index for all English version of "进阶使用" 2. replace legacy softlinks to howto/optimization & dev/new_op_en with original file
a4b46eba · Hao Wang · Cheerego · 36217c58 · a4b46eba · a4b46eba
21 changed file
--- a/doc/fluid/advanced_usage/deploy/index_en.rst
+++ b/doc/fluid/advanced_usage/deploy/index_en.rst
+#######################
+Deploy Inference Model
+#######################
+- `Server side Deployment <inference/index_en.html>`_ : This section illustrates Fluid C++ API to support deployment and release of trained models.
+- `Paddle Mobile <mobile/index_en.html>`_ : Embedded deep learning framework Paddle-Mobile organized by PaddlePaddle.
+..  toctree::
+    :hidden:
+    inference/index_en.rst
+    mobile/index_en.rst
--- a/doc/fluid/advanced_usage/deploy/inference/index_en.rst
+++ b/doc/fluid/advanced_usage/deploy/inference/index_en.rst
+######################
+Server-side Deployment
+######################
+PaddlePaddle Fluid provides C++ API to support deployment and release of trained models.
+.. toctree::
+   :titlesonly:
+   build_and_install_lib_en.rst
+   native_infer_en.md
+   paddle_tensorrt_infer_en.md
+   paddle_gpu_benchmark_en.md
+   windows_cpp_inference_en.md
--- a/doc/fluid/advanced_usage/deploy/mobile/index_en.rst
+++ b/doc/fluid/advanced_usage/deploy/mobile/index_en.rst
+#################
+Mobile Deployment
+#################
+This section is for a deep learning framework in PaddlePaddle organization —— Paddle-Mobile：
+* `Brief Introduction to the Project <mobile_readme_en.html>`_：Brief introduction to effects, features, and user guides of Paddle-Mobile 
+* `Build Environment <mobile_build_en.html>`_：How to build environment for Mobile with Docker or without it.
+.. toctree::
+   :hidden:
+   mobile_readme_en.md
+   mobile_build_en.md
--- a/doc/fluid/advanced_usage/development/contribute_to_paddle/index_en.rst
+++ b/doc/fluid/advanced_usage/development/contribute_to_paddle/index_en.rst
+#################################
+How to contribute codes to Paddle
+#################################
+..  toctree::
+    :maxdepth: 1
+    local_dev_guide_en.md
+    submit_pr_guide_en.md
--- a/doc/fluid/advanced_usage/development/new_op/index_en.rst
+++ b/doc/fluid/advanced_usage/development/new_op/index_en.rst
+###################
+Write New Operators
+###################
+- `How to write new operator <../../../advanced_usage/development/new_op_en.html>`_ ：guides to write new operators
+- `op notes <../../../advanced_usage/development/op_notes_en.html>`_ ：notes on developing new operators
+.. toctree::
+   :hidden:
+   new_op_en.md
+   op_notes_en.md
--- a/doc/fluid/advanced_usage/development/profiling/cpu_profiling_cn.md
+++ b/doc/fluid/advanced_usage/development/profiling/cpu_profiling_cn.md
-../../../howto/optimization/cpu_profiling_cn.md
\ No newline at end of file
--- a/doc/fluid/advanced_usage/development/profiling/cpu_profiling_cn.md
+++ b/doc/fluid/advanced_usage/development/profiling/cpu_profiling_cn.md
+# CPU性能调优
+此教程会介绍如何使用Python的cProfile包、Python库yep、Google perftools来进行性能分析 (profiling) 与调优（performance tuning）。
+Profling 指发现性能瓶颈。系统中的瓶颈可能和程序员开发过程中想象的瓶颈相去甚远。Tuning 指消除瓶颈。性能优化的过程通常是不断重复地 profiling 和 tuning。
+PaddlePaddle 用户一般通过调用 Python API 编写深度学习程序。大部分 Python API 调用用 C++ 写的 libpaddle.so。所以 PaddlePaddle 的性能分析与调优分为两个部分:
+* Python 代码的性能分析
+* Python 与 C++ 混合代码的性能分析
+## Python代码的性能分析
+### 生成性能分析文件
+Python标准库中提供了性能分析的工具包，[cProfile](https://docs.python.org/2/library/profile.html)。生成Python性能分析的命令如下:
+```bash
+python -m cProfile -o profile.out main.py
+```
+其中 `main.py` 是我们要分析的程序，`-o`标识了一个输出的文件名，用来存储本次性能分析的结果。如果不指定这个文件，`cProfile`会打印到标准输出。
+### 查看性能分析文件
+`cProfile` 在main.py 运行完毕后输出`profile.out`。我们可以使用[`cprofilev`](https://github.com/ymichael/cprofilev)来查看性能分析结果。`cprofilev`是一个Python的第三方库。使用它会开启一个HTTP服务，将性能分析结果以网页的形式展示出来：
+```bash
+cprofilev -a 0.0.0.0 -p 3214 -f profile.out main.py
+```
+其中`-a`标识HTTP服务绑定的IP。使用`0.0.0.0`允许外网访问这个HTTP服务。`-p`标识HTTP服务的端口。`-f`标识性能分析的结果文件。`main.py`标识被性能分析的源文件。
+用Web浏览器访问对应网址，即可显示性能分析的结果：
+```
+   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
+        1    0.284    0.284   29.514   29.514 main.py:1(<module>)
+     4696    0.128    0.000   15.748    0.003 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/fluid/executor.py:20(run)
+     4696   12.040    0.003   12.040    0.003 {built-in method run}
+        1    0.144    0.144    6.534    6.534 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/__init__.py:14(<module>)
+```
+每一列的含义是:
+<table>
+<thead>
+<tr>
+<th>列名</th>
+<th>含义 </th>
+</tr>
+</thead>
+<tbody>
+<tr>
+<td> ncalls</td>
+<td> 函数的调用次数</td>
+</tr>
+<tr>
+<td>tottime</td>
+<td> 函数实际使用的总时间。该时间去除掉本函数调用其他函数的时间</td>
+</tr>
+<tr>
+<td> percall </td>
+<td> tottime的每次调用平均时间</td>
+</tr>
+<tr>
+<td> cumtime</td>
+<td> 函数总时间。包含这个函数调用其他函数的时间</td>
+</tr>
+<tr>
+<td> percall</td>
+<td> cumtime的每次调用平均时间</td>
+</tr>
+<tr>
+<td> filename:lineno(function) </td>
+<td> 文件名, 行号，函数名 </td>
+</tr>
+</tbody>
+</table>
+### 寻找性能瓶颈
+通常`tottime`和`cumtime`是寻找瓶颈的关键指标。这两个指标代表了某一个函数真实的运行时间。
+将性能分析结果按照tottime排序，效果如下:
+```text
+     4696   12.040    0.003   12.040    0.003 {built-in method run}
+   300005    0.874    0.000    1.681    0.000 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/dataset/mnist.py:38(reader)
+   107991    0.676    0.000    1.519    0.000 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/fluid/framework.py:219(__init__)
+     4697    0.626    0.000    2.291    0.000 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/fluid/framework.py:428(sync_with_cpp)
+        1    0.618    0.618    0.618    0.618 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/fluid/__init__.py:1(<module>)
+```
+可以看到最耗时的函数是C++端的`run`函数。这需要联合我们第二节`Python`与`C++`混合代码的性能分析来进行调优。而`sync_with_cpp`函数的总共耗时很长，每次调用的耗时也很长。于是我们可以点击`sync_with_cpp`的详细信息，了解其调用关系。
+```text
+Called By:
+   Ordered by: internal time
+   List reduced from 4497 to 2 due to restriction <'sync_with_cpp'>
+Function                                                                                                 was called by...
+                                                                                                             ncalls  tottime  cumtime
+/home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/fluid/framework.py:428(sync_with_cpp)  <-    4697    0.626    2.291  /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/fluid/framework.py:562(sync_with_cpp)
+/home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/fluid/framework.py:562(sync_with_cpp)  <-    4696    0.019    2.316  /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/fluid/framework.py:487(clone)
+                                                                                                                  1    0.000    0.001  /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/fluid/framework.py:534(append_backward)
+Called:
+   Ordered by: internal time
+   List reduced from 4497 to 2 due to restriction <'sync_with_cpp'>
+```
+通常观察热点函数间的调用关系，和对应行的代码，就可以了解到问题代码在哪里。当我们做出性能修正后，再次进行性能分析(profiling)即可检查我们调优后的修正是否能够改善程序的性能。
+## Python与C++混合代码的性能分析
+### 生成性能分析文件
+C++的性能分析工具非常多。常见的包括`gprof`, `valgrind`, `google-perftools`。但是调试Python中使用的动态链接库与直接调试原始二进制相比增加了很多复杂度。幸而Python的一个第三方库`yep`提供了方便的和`google-perftools`交互的方法。于是这里使用`yep`进行Python与C++混合代码的性能分析
+使用`yep`前需要安装`google-perftools`与`yep`包。ubuntu下安装命令为
+```bash
+apt update
+apt install libgoogle-perftools-dev
+pip install yep
+```
+安装完毕后，我们可以通过
+```bash
+python -m yep -v main.py
+```
+生成性能分析文件。生成的性能分析文件为`main.py.prof`。
+命令行中的`-v`指定在生成性能分析文件之后，在命令行显示分析结果。我们可以在命令行中简单的看一下生成效果。因为C++与Python不同，编译时可能会去掉调试信息，运行时也可能因为多线程产生混乱不可读的性能分析结果。为了生成更可读的性能分析结果，可以采取下面几点措施:
+1. 编译时指定`-g`生成调试信息。使用cmake的话，可以将CMAKE_BUILD_TYPE指定为`RelWithDebInfo`。
+2. 编译时一定要开启优化。单纯的`Debug`编译性能会和`-O2`或者`-O3`有非常大的差别。`Debug`模式下的性能测试是没有意义的。
+3. 运行性能分析的时候，先从单线程开始，再开启多线程，进而多机。毕竟单线程调试更容易。可以设置`OMP_NUM_THREADS=1`这个环境变量关闭openmp优化。
+### 查看性能分析文件
+在运行完性能分析后，会生成性能分析结果文件。我们可以使用[`pprof`](https://github.com/google/pprof)来显示性能分析结果。注意，这里使用了用`Go`语言重构后的`pprof`，因为这个工具具有web服务界面，且展示效果更好。
+安装`pprof`的命令和一般的`Go`程序是一样的，其命令如下:
+```bash
+go get github.com/google/pprof
+```
+进而我们可以使用如下命令开启一个HTTP服务:
+```bash
+pprof -http=0.0.0.0:3213 `which python`  ./main.py.prof
+```
+这行命令中，`-http`指开启HTTP服务。`which python`会产生当前Python二进制的完整路径，进而指定了Python可执行文件的路径。`./main.py.prof`输入了性能分析结果。
+访问对应的网址，我们可以查看性能分析的结果。结果如下图所示:
+![result](./pprof_1.png)
+### 寻找性能瓶颈
+与寻找Python代码的性能瓶颈类似，寻找Python与C++混合代码的性能瓶颈也是要看`tottime`和`cumtime`。而`pprof`展示的调用图也可以帮助我们发现性能中的问题。
+例如下图中，
+![kernel_perf](./pprof_2.png)
+在一次训练中，乘法和乘法梯度的计算占用2%-4%左右的计算时间。而`MomentumOp`占用了17%左右的计算时间。显然，`MomentumOp`的性能有问题。
+在`pprof`中，对于性能的关键路径都做出了红色标记。先检查关键路径的性能问题，再检查其他部分的性能问题，可以更有次序的完成性能的优化。
--- a/doc/fluid/howto/optimization/cpu_profiling_en.md
+++ b/doc/fluid/howto/optimization/cpu_profiling_en.md
--- a/doc/fluid/advanced_usage/development/profiling/host_memory_profiling_cn.md
+++ b/doc/fluid/advanced_usage/development/profiling/host_memory_profiling_cn.md
-../../../howto/optimization/host_memory_profiling_cn.md
\ No newline at end of file
--- a/doc/fluid/advanced_usage/development/profiling/host_memory_profiling_cn.md
+++ b/doc/fluid/advanced_usage/development/profiling/host_memory_profiling_cn.md
+# 堆内存分析和优化
+计算机程序都可能有内存泄漏的风险。**内存泄漏**一般是由于程序在堆(heap)上分配了内存而没有释放，随着程序的运行占用的内存越来越大，一方面会影响程序的稳定性，可能让运行速度越来越慢，或者造成oom，甚至会影响运行程序的机器的稳定性，造成宕机。
+目前有很多内存泄漏分析工具，比较经典的有[valgrind](http://valgrind.org/docs/manual/quick-start.html#quick-start.intro), [gperftools](https://gperftools.github.io/gperftools/)。
+因为Fluid是用Python驱动C++ core来运行，valgrind直接分析非常困难，需要自己编译debug版本的、带valgrind支持的专用Python版本，而且输出的信息中大部分是Python自己的符号和调用信息，分析起来很困难，另外使用valgrind会让程序运行速度变得非常慢，所以不建议使用。
+本教程主要介绍[gperftools](https://gperftools.github.io/gperftools/)的使用。
+gperftool主要支持以下四个功能：
+- thread-caching malloc
+- heap-checking using tcmalloc
+- heap-profiling using tcmalloc
+- CPU profiler
+Paddle也提供了基于gperftool的[CPU性能分析教程](https://github.com/PaddlePaddle/FluidDoc/blob/develop/doc/fluid/howto/optimization/cpu_profiling_cn.md)。
+对于堆内存的分析，主要用到thread-caching malloc和heap-profiling using tcmalloc。
+## 环境
+本教程基于paddle提供的Docker开发环境paddlepaddle/paddle:latest-dev，基于Ubuntu 16.04.4 LTS环境。
+## 使用流程
+- 安装google-perftools
+```
+apt-get install libunwind-dev 
+apt-get install google-perftools
+```
+- 安装pprof
+```
+go get -u github.com/google/pprof
+```
+- 设置运行环境
+```
+export PPROF_PATH=/root/gopath/bin/pprof
+export PPROF_BINARY_PATH=/root/gopath/bin/pprof
+export LD_PRELOAD=/usr/lib/libtcmalloc.so.4
+```
+- 使用heap profile来运行python程序。本质上是周期性的对堆的分配情况做一次快照。
+```
+# HEAPPROFILE 设置生成的堆分析文件的目录和文件前缀
+# HEAP_PROFILE_ALLOCATION_INTERVAL 设置每分配多少存储dump一次dump，默认1GB
+env HEAPPROFILE="./perf_log/test.log" HEAP_PROFILE_ALLOCATION_INTERVAL=209715200 python trainer.py
+```
+随着程序的运行，会在perf_log这个文件夹下生成很多文件，如下：
+```
+-rw-r--r-- 1 root root 1.0M Jun  1 15:00 test.log.0001.heap
+-rw-r--r-- 1 root root 1.0M Jun  1 15:00 test.log.0002.heap
+-rw-r--r-- 1 root root 1.0M Jun  1 15:00 test.log.0003.heap
+-rw-r--r-- 1 root root 1.0M Jun  1 15:00 test.log.0004.heap
+-rw-r--r-- 1 root root 1.0M Jun  1 15:00 test.log.0005.heap
+-rw-r--r-- 1 root root 1.0M Jun  1 15:00 test.log.0006.heap
+```
+- 使用pprof对heap文件进行分析。分析有两种模式：
+	- 完整模式。会对当前heap做一个分析，显示目前分配内存一些调用路径。
+	```
+	pprof --pdf python test.log.0012.heap
+	```
+	上述命令会生成一个profile00x.pdf的文件，可以直接打开，例如：[memory_cpu_allocator](https://github.com/jacquesqiao/Paddle/blob/bd2ea0e1f84bb6522a66d44a072598153634cade/doc/fluid/howto/optimization/memory_cpu_allocator.pdf)。从下图可以看出，在CPU版本fluid的运行过程中，分配存储最多的模块式CPUAllocator. 而别的模块相对而言分配内存较少，所以被忽略了，这对于分配内存泄漏是很不方便的，因为泄漏是一个缓慢的过程，在这种图中是无法看到的。
+	![result](https://user-images.githubusercontent.com/3048612/40964027-a54033e4-68dc-11e8-836a-144910c4bb8c.png)
+	- Diff模式。可以对两个时刻的heap做diff，把一些内存分配没有发生变化的模块去掉，而把增量部分显示出来。
+	```
+	pprof --pdf --base test.log.0010.heap python test.log.1045.heap
+	```
+	生成的结果为：[`memory_leak_protobuf`](https://github.com/jacquesqiao/Paddle/blob/bd2ea0e1f84bb6522a66d44a072598153634cade/doc/fluid/howto/optimization/memory_leak_protobuf.pdf)
+	从图中可以看出：ProgramDesc这个结构，在两个版本之间增长了200MB+，所以这里有很大的内存泄漏的可能性，最终结果也确实证明是这里造成了泄漏。
+	![result](https://user-images.githubusercontent.com/3048612/40964057-b434d5e4-68dc-11e8-894b-8ab62bcf26c2.png)
+	![result](https://user-images.githubusercontent.com/3048612/40964063-b7dbee44-68dc-11e8-9719-da279f86477f.png)
--- a/doc/fluid/advanced_usage/development/profiling/index_cn.rst
+++ b/doc/fluid/advanced_usage/development/profiling/index_cn.rst
@@ -5,9 +5,8 @@
 ..  toctree::
 	:hidden:
-	benchmark.rst
 	cpu_profiling_cn.md
-	gpu_profiling_cn.rst
 	host_memory_profiling_cn.md
 	timeline_cn.md

--- a/doc/fluid/advanced_usage/development/profiling/index_en.rst
+++ b/doc/fluid/advanced_usage/development/profiling/index_en.rst
+#######################################
+Performance Profiling and Optimization
+#######################################
+..  toctree::
+	:hidden:
+	cpu_profiling_en.md
+	host_memory_profiling_en.md
+	timeline_en.md
+This section illustrates how to optimize performance of Fluid：
+- `CPU profiling <cpu_profiling_en.html>`_：How to use cProfile, yep, and Google perftools to profile and optimize model performance
+- `Heap Memory Profiling and Optimization <host_memory_profiling_en.html>`_：Use gperftool to perform Heap Memory Profiling and Optimization to solve memory leaks.
+- `How to use timeline tool to do profiling <timeline_en.html>`_ ：How to use timeline tool to do profile and optimization
--- a/doc/fluid/advanced_usage/development/profiling/timeline_cn.md
+++ b/doc/fluid/advanced_usage/development/profiling/timeline_cn.md
-../../../howto/optimization/timeline_cn.md
\ No newline at end of file
--- a/doc/fluid/advanced_usage/development/profiling/timeline_cn.md
+++ b/doc/fluid/advanced_usage/development/profiling/timeline_cn.md
+# timeline工具简介
+## <span id="local">本地使用</span>
+1. 在训练的主循环外加上`profiler.start_profiler(...)`和`profiler.stop_profiler(...)`。运行之后，代码会在`/tmp/profile`目录下生成一个profile的记录文件。
+	**提示：**
+	请不要在timeline记录信息时运行太多次迭代，因为timeline中的记录数量和迭代次数是成正比的。
+	```python
+    for pass_id in range(pass_num):
+        for batch_id, data in enumerate(train_reader()):
+            if pass_id == 0 and batch_id == 5:
+                profiler.start_profiler("All")
+            elif pass_id == 0 and batch_id == 10:
+                profiler.stop_profiler("total", "/tmp/profile")
+            exe.run(fluid.default_main_program(),
+                    feed=feeder.feed(data),
+                    fetch_list=[])
+	            ...
+	```
+1. 运行`python paddle/tools/timeline.py`来处理`/tmp/profile`，这个程序默认会生成一个`/tmp/timeline`文件，你也可以用命令行参数来修改这个路径，请参考[timeline.py](https://github.com/PaddlePaddle/Paddle/blob/develop/tools/timeline.py)。
+```python
+python Paddle/tools/timeline.py --profile_path=/tmp/profile --timeline_path=timeline
+```
+1. 打开chrome浏览器，访问<chrome://tracing/>，用`load`按钮来加载生成的`timeline`文件。
+	![chrome tracing](./tracing.jpeg)
+1. 结果如下图所示，可以放到来查看timetime的细节信息。
+	![chrome timeline](./timeline.jpeg)
+## 分布式使用
+一般来说，分布式的训练程序都会有两种程序：pserver和trainer。我们提供了把pserver和trainer的profile日志用timeline来显示的方式。 
+1. trainer打开方式与[本地使用](#local)部分的第1步相同
+1. pserver可以通过加两个环境变量打开profile，例如：
+```
+FLAGS_rpc_server_profile_period=10 FLAGS_rpc_server_profile_path=./tmp/pserver python train.py
+```
+3. 把pserver和trainer的profile文件生成一个timeline文件，例如：  
+```
+python /paddle/tools/timeline.py
+    --profile_path trainer0=local_profile_10_pass0_0,trainer1=local_profile_10_pass0_1,pserver0=./pserver_0,pserver1=./pserver_1
+    --timeline_path ./dist.timeline
+```
+4. 在chrome中加载dist.timeline文件，方法和[本地使用](#local)第4步相同。
--- a/doc/fluid/howto/optimization/timeline_en.md
+++ b/doc/fluid/howto/optimization/timeline_en.md
@@ -2,7 +2,7 @@
 ## <span id="local">Local</span>
-1. Add `profiler.start_profiler(...)`和`profiler.stop_profiler(...)` to the main training loop. After run, the code will generate a profile record file `/tmp/profile`. **Warning**: Please do not run too many batches when use profiler to record timeline information, for the profile record will grow with the batch number.
+1. Add `profiler.start_profiler(...)` and `profiler.stop_profiler(...)` to the main training loop. After run, the code will generate a profile record file `/tmp/profile`. **Warning**: Please do not run too many batches when use profiler to record timeline information, for the profile record will grow with the batch number.
 	```python
    for pass_id in range(pass_num):
@@ -17,37 +17,38 @@
 	            ...
 	```
-1. Run `python paddle/tools/timeline.py` to process `/tmp/profile`, it will generate another
+2. Run `python paddle/tools/timeline.py` to process `/tmp/profile`, it will generate another
 file `/tmp/timeline` by default. You can change the path by cmd parameter, please take a look at
 [timeline.py](https://github.com/PaddlePaddle/Paddle/blob/develop/tools/timeline.py) for details.
 ```python
 python Paddle/tools/timeline.py --profile_path=/tmp/profile --timeline_path=timeline
 ```
-1. Open chrome and visit <chrome://tracing/>, use `load` button to load the generated `timeline` file.
+3. Open chrome and visit <chrome://tracing/>, use `load` button to load the generated `timeline` file.
 	![chrome tracing](./tracing.jpeg)
-1. The resulting timeline should be like:
-	![chrome timeline](./timeline.jpeg)
+4. The result timeline should be like:<a name="local_step_4"></a>
+    ![chrome timeline](./timeline.jpeg)
 ## Distributed
 This tool can support distributed train programs(pserver and trainer) too.
 1. Open traniner profiler just like how to use in [local](#local).
-1. Open pserver profiler: add some enviroment variables, eg:
+2. Open pserver profiler: add two environment variables, e.g.:
 ```
 FLAGS_rpc_server_profile_period=10 FLAGS_rpc_server_profile_path=./tmp/pserver python train.py
 ```
-1. Merge pservers' and trainers' profiler file, eg:
+3. Merge pservers' and trainers' profiler file, e.g.:
 ```
 python /paddle/tools/timeline.py
    --profile_path trainer0=local_profile_10_pass0_0,trainer1=local_profile_10_pass0_1,pserver0=./pserver_0,pserver1=./pserver_1
    --timeline_path ./dist.timeline
 ```
-1. Load `dist.timeline` in chrome://tracing
+4. Load `dist.timeline` in chrome just like the [fourth step in Local](#local_step_4)
--- a/doc/fluid/advanced_usage/index_en.rst
+++ b/doc/fluid/advanced_usage/index_en.rst
+####################
+Advanced User Guides
+####################
+..  todo::
+So far you have already been familiar with Fluid. And the next expectation should be building a more efficient model or inventing your original Operator. If so, read more on:
+    - `Fluid Design Principles <../advanced_usage/design_idea/fluid_design_idea_en.html>`_ : Design principles underlying Fluid to help you understand how the framework runs.
+	- `Deploy Inference Model <../advanced_usage/deploy/index_en.html>`_ ：How to deploy the trained network to perform practical inference
+	- `Write new operators <../advanced_usage/development/new_op/index_en.html>`_ ：How to write new operators and notes on creating them
+	- `Performance Profiling <../advanced_usage/development/profiling/index_en.html>`_ ：How to do profiling for Fluid programs
+We gladly encourage your contributions of codes and documentation to our communities, read the following articles for how to do it:
+	- `How to contribute codes <../advanced_usage/development/contribute_to_paddle/index_en.html>`_：Tutorials for how to contribute codes to PaddlePaddle open source communities.
+	- `How to contribute documentation <../advanced_usage/development/write_docs_en.html>`_：Tutorials for how to contribute documentation to PaddlePaddle open source communities.
+..  toctree::
+    :hidden:
+    design_idea/fluid_design_idea_en.md
+    deploy/index_en.rst
+    development/new_op/index_en.rst
+    development/profiling/index_en.rst
+    development/contribute_to_paddle/index_en.rst
+    development/write_docs_en.md
--- a/doc/fluid/dev/index_en.rst
+++ b/doc/fluid/dev/index_en.rst
@@ -7,7 +7,6 @@ Development
  contribute_to_paddle_en.md
  write_docs_en.md
  api_doc_std_en.md
-  new_op_en.md
  new_op_kernel.md
  use_eigen_en.md
  name_convention.md

--- a/doc/fluid/howto/optimization/cpu_profiling_cn.md
+++ b/doc/fluid/howto/optimization/cpu_profiling_cn.md
-# CPU性能调优
-此教程会介绍如何使用Python的cProfile包、Python库yep、Google perftools来进行性能分析 (profiling) 与调优（performance tuning）。
-Profling 指发现性能瓶颈。系统中的瓶颈可能和程序员开发过程中想象的瓶颈相去甚远。Tuning 指消除瓶颈。性能优化的过程通常是不断重复地 profiling 和 tuning。
-PaddlePaddle 用户一般通过调用 Python API 编写深度学习程序。大部分 Python API 调用用 C++ 写的 libpaddle.so。所以 PaddlePaddle 的性能分析与调优分为两个部分:
-* Python 代码的性能分析
-* Python 与 C++ 混合代码的性能分析
-## Python代码的性能分析
-### 生成性能分析文件
-Python标准库中提供了性能分析的工具包，[cProfile](https://docs.python.org/2/library/profile.html)。生成Python性能分析的命令如下:
-```bash
-python -m cProfile -o profile.out main.py
-```
-其中 `main.py` 是我们要分析的程序，`-o`标识了一个输出的文件名，用来存储本次性能分析的结果。如果不指定这个文件，`cProfile`会打印到标准输出。
-### 查看性能分析文件
-`cProfile` 在main.py 运行完毕后输出`profile.out`。我们可以使用[`cprofilev`](https://github.com/ymichael/cprofilev)来查看性能分析结果。`cprofilev`是一个Python的第三方库。使用它会开启一个HTTP服务，将性能分析结果以网页的形式展示出来：
-```bash
-cprofilev -a 0.0.0.0 -p 3214 -f profile.out main.py
-```
-其中`-a`标识HTTP服务绑定的IP。使用`0.0.0.0`允许外网访问这个HTTP服务。`-p`标识HTTP服务的端口。`-f`标识性能分析的结果文件。`main.py`标识被性能分析的源文件。
-用Web浏览器访问对应网址，即可显示性能分析的结果：
-```
-   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
-        1    0.284    0.284   29.514   29.514 main.py:1(<module>)
-     4696    0.128    0.000   15.748    0.003 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/fluid/executor.py:20(run)
-     4696   12.040    0.003   12.040    0.003 {built-in method run}
-        1    0.144    0.144    6.534    6.534 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/__init__.py:14(<module>)
-```
-每一列的含义是:
-<table>
-<thead>
-<tr>
-<th>列名</th>
-<th>含义 </th>
-</tr>
-</thead>
-<tbody>
-<tr>
-<td> ncalls</td>
-<td> 函数的调用次数</td>
-</tr>
-<tr>
-<td>tottime</td>
-<td> 函数实际使用的总时间。该时间去除掉本函数调用其他函数的时间</td>
-</tr>
-<tr>
-<td> percall </td>
-<td> tottime的每次调用平均时间</td>
-</tr>
-<tr>
-<td> cumtime</td>
-<td> 函数总时间。包含这个函数调用其他函数的时间</td>
-</tr>
-<tr>
-<td> percall</td>
-<td> cumtime的每次调用平均时间</td>
-</tr>
-<tr>
-<td> filename:lineno(function) </td>
-<td> 文件名, 行号，函数名 </td>
-</tr>
-</tbody>
-</table>
-### 寻找性能瓶颈
-通常`tottime`和`cumtime`是寻找瓶颈的关键指标。这两个指标代表了某一个函数真实的运行时间。
-将性能分析结果按照tottime排序，效果如下:
-```text
-     4696   12.040    0.003   12.040    0.003 {built-in method run}
-   300005    0.874    0.000    1.681    0.000 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/dataset/mnist.py:38(reader)
-   107991    0.676    0.000    1.519    0.000 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/fluid/framework.py:219(__init__)
-     4697    0.626    0.000    2.291    0.000 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/fluid/framework.py:428(sync_with_cpp)
-        1    0.618    0.618    0.618    0.618 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/fluid/__init__.py:1(<module>)
-```
-可以看到最耗时的函数是C++端的`run`函数。这需要联合我们第二节`Python`与`C++`混合代码的性能分析来进行调优。而`sync_with_cpp`函数的总共耗时很长，每次调用的耗时也很长。于是我们可以点击`sync_with_cpp`的详细信息，了解其调用关系。
-```text
-Called By:
-   Ordered by: internal time
-   List reduced from 4497 to 2 due to restriction <'sync_with_cpp'>
-Function                                                                                                 was called by...
-                                                                                                             ncalls  tottime  cumtime
-/home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/fluid/framework.py:428(sync_with_cpp)  <-    4697    0.626    2.291  /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/fluid/framework.py:562(sync_with_cpp)
-/home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/fluid/framework.py:562(sync_with_cpp)  <-    4696    0.019    2.316  /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/fluid/framework.py:487(clone)
-                                                                                                                  1    0.000    0.001  /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/fluid/framework.py:534(append_backward)
-Called:
-   Ordered by: internal time
-   List reduced from 4497 to 2 due to restriction <'sync_with_cpp'>
-```
-通常观察热点函数间的调用关系，和对应行的代码，就可以了解到问题代码在哪里。当我们做出性能修正后，再次进行性能分析(profiling)即可检查我们调优后的修正是否能够改善程序的性能。
-## Python与C++混合代码的性能分析
-### 生成性能分析文件
-C++的性能分析工具非常多。常见的包括`gprof`, `valgrind`, `google-perftools`。但是调试Python中使用的动态链接库与直接调试原始二进制相比增加了很多复杂度。幸而Python的一个第三方库`yep`提供了方便的和`google-perftools`交互的方法。于是这里使用`yep`进行Python与C++混合代码的性能分析
-使用`yep`前需要安装`google-perftools`与`yep`包。ubuntu下安装命令为
-```bash
-apt update
-apt install libgoogle-perftools-dev
-pip install yep
-```
-安装完毕后，我们可以通过
-```bash
-python -m yep -v main.py
-```
-生成性能分析文件。生成的性能分析文件为`main.py.prof`。
-命令行中的`-v`指定在生成性能分析文件之后，在命令行显示分析结果。我们可以在命令行中简单的看一下生成效果。因为C++与Python不同，编译时可能会去掉调试信息，运行时也可能因为多线程产生混乱不可读的性能分析结果。为了生成更可读的性能分析结果，可以采取下面几点措施:
-1. 编译时指定`-g`生成调试信息。使用cmake的话，可以将CMAKE_BUILD_TYPE指定为`RelWithDebInfo`。
-2. 编译时一定要开启优化。单纯的`Debug`编译性能会和`-O2`或者`-O3`有非常大的差别。`Debug`模式下的性能测试是没有意义的。
-3. 运行性能分析的时候，先从单线程开始，再开启多线程，进而多机。毕竟单线程调试更容易。可以设置`OMP_NUM_THREADS=1`这个环境变量关闭openmp优化。
-### 查看性能分析文件
-在运行完性能分析后，会生成性能分析结果文件。我们可以使用[`pprof`](https://github.com/google/pprof)来显示性能分析结果。注意，这里使用了用`Go`语言重构后的`pprof`，因为这个工具具有web服务界面，且展示效果更好。
-安装`pprof`的命令和一般的`Go`程序是一样的，其命令如下:
-```bash
-go get github.com/google/pprof
-```
-进而我们可以使用如下命令开启一个HTTP服务:
-```bash
-pprof -http=0.0.0.0:3213 `which python`  ./main.py.prof
-```
-这行命令中，`-http`指开启HTTP服务。`which python`会产生当前Python二进制的完整路径，进而指定了Python可执行文件的路径。`./main.py.prof`输入了性能分析结果。
-访问对应的网址，我们可以查看性能分析的结果。结果如下图所示:
-![result](./pprof_1.png)
-### 寻找性能瓶颈
-与寻找Python代码的性能瓶颈类似，寻找Python与C++混合代码的性能瓶颈也是要看`tottime`和`cumtime`。而`pprof`展示的调用图也可以帮助我们发现性能中的问题。
-例如下图中，
-![kernel_perf](./pprof_2.png)
-在一次训练中，乘法和乘法梯度的计算占用2%-4%左右的计算时间。而`MomentumOp`占用了17%左右的计算时间。显然，`MomentumOp`的性能有问题。
-在`pprof`中，对于性能的关键路径都做出了红色标记。先检查关键路径的性能问题，再检查其他部分的性能问题，可以更有次序的完成性能的优化。
--- a/doc/fluid/howto/optimization/host_memory_profiling_cn.md
+++ b/doc/fluid/howto/optimization/host_memory_profiling_cn.md
-# 堆内存分析和优化
-计算机程序都可能有内存泄漏的风险。**内存泄漏**一般是由于程序在堆(heap)上分配了内存而没有释放，随着程序的运行占用的内存越来越大，一方面会影响程序的稳定性，可能让运行速度越来越慢，或者造成oom，甚至会影响运行程序的机器的稳定性，造成宕机。
-目前有很多内存泄漏分析工具，比较经典的有[valgrind](http://valgrind.org/docs/manual/quick-start.html#quick-start.intro), [gperftools](https://gperftools.github.io/gperftools/)。
-因为Fluid是用Python驱动C++ core来运行，valgrind直接分析非常困难，需要自己编译debug版本的、带valgrind支持的专用Python版本，而且输出的信息中大部分是Python自己的符号和调用信息，分析起来很困难，另外使用valgrind会让程序运行速度变得非常慢，所以不建议使用。
-本教程主要介绍[gperftools](https://gperftools.github.io/gperftools/)的使用。
-gperftool主要支持以下四个功能：
- thread-caching malloc
- heap-checking using tcmalloc
- heap-profiling using tcmalloc
- CPU profiler
-Paddle也提供了基于gperftool的[CPU性能分析教程](https://github.com/PaddlePaddle/FluidDoc/blob/develop/doc/fluid/howto/optimization/cpu_profiling_cn.md)。
-对于堆内存的分析，主要用到thread-caching malloc和heap-profiling using tcmalloc。
-## 环境
-本教程基于paddle提供的Docker开发环境paddlepaddle/paddle:latest-dev，基于Ubuntu 16.04.4 LTS环境。
-## 使用流程
- 安装google-perftools
-```
-apt-get install libunwind-dev 
-apt-get install google-perftools
-```
- 安装pprof
-```
-go get -u github.com/google/pprof
-```
- 设置运行环境
-```
-export PPROF_PATH=/root/gopath/bin/pprof
-export PPROF_BINARY_PATH=/root/gopath/bin/pprof
-export LD_PRELOAD=/usr/lib/libtcmalloc.so.4
-```
- 使用heap profile来运行python程序。本质上是周期性的对堆的分配情况做一次快照。
-```
-# HEAPPROFILE 设置生成的堆分析文件的目录和文件前缀
-# HEAP_PROFILE_ALLOCATION_INTERVAL 设置每分配多少存储dump一次dump，默认1GB
-env HEAPPROFILE="./perf_log/test.log" HEAP_PROFILE_ALLOCATION_INTERVAL=209715200 python trainer.py
-```
-随着程序的运行，会在perf_log这个文件夹下生成很多文件，如下：
-```
-rw-r--r-- 1 root root 1.0M Jun  1 15:00 test.log.0001.heap
-rw-r--r-- 1 root root 1.0M Jun  1 15:00 test.log.0002.heap
-rw-r--r-- 1 root root 1.0M Jun  1 15:00 test.log.0003.heap
-rw-r--r-- 1 root root 1.0M Jun  1 15:00 test.log.0004.heap
-rw-r--r-- 1 root root 1.0M Jun  1 15:00 test.log.0005.heap
-rw-r--r-- 1 root root 1.0M Jun  1 15:00 test.log.0006.heap
-```
- 使用pprof对heap文件进行分析。分析有两种模式：
-	- 完整模式。会对当前heap做一个分析，显示目前分配内存一些调用路径。
-	```
-	pprof --pdf python test.log.0012.heap
-	```
-	上述命令会生成一个profile00x.pdf的文件，可以直接打开，例如：[memory_cpu_allocator](https://github.com/jacquesqiao/Paddle/blob/bd2ea0e1f84bb6522a66d44a072598153634cade/doc/fluid/howto/optimization/memory_cpu_allocator.pdf)。从下图可以看出，在CPU版本fluid的运行过程中，分配存储最多的模块式CPUAllocator. 而别的模块相对而言分配内存较少，所以被忽略了，这对于分配内存泄漏是很不方便的，因为泄漏是一个缓慢的过程，在这种图中是无法看到的。
-	![result](https://user-images.githubusercontent.com/3048612/40964027-a54033e4-68dc-11e8-836a-144910c4bb8c.png)
-	- Diff模式。可以对两个时刻的heap做diff，把一些内存分配没有发生变化的模块去掉，而把增量部分显示出来。
-	```
-	pprof --pdf --base test.log.0010.heap python test.log.1045.heap
-	```
-	生成的结果为：[`memory_leak_protobuf`](https://github.com/jacquesqiao/Paddle/blob/bd2ea0e1f84bb6522a66d44a072598153634cade/doc/fluid/howto/optimization/memory_leak_protobuf.pdf)
-	从图中可以看出：ProgramDesc这个结构，在两个版本之间增长了200MB+，所以这里有很大的内存泄漏的可能性，最终结果也确实证明是这里造成了泄漏。
-	![result](https://user-images.githubusercontent.com/3048612/40964057-b434d5e4-68dc-11e8-894b-8ab62bcf26c2.png)
-	![result](https://user-images.githubusercontent.com/3048612/40964063-b7dbee44-68dc-11e8-9719-da279f86477f.png)
--- a/doc/fluid/howto/optimization/timeline_cn.md
+++ b/doc/fluid/howto/optimization/timeline_cn.md
-# timeline工具简介
-## <span id="local">本地使用</span>
-1. 在训练的主循环外加上`profiler.start_profiler(...)`和`profiler.stop_profiler(...)`。运行之后，代码会在`/tmp/profile`目录下生成一个profile的记录文件。
-	**提示：**
-	请不要在timeline记录信息时运行太多次迭代，因为timeline中的记录数量和迭代次数是成正比的。
-	```python
-    for pass_id in range(pass_num):
-        for batch_id, data in enumerate(train_reader()):
-            if pass_id == 0 and batch_id == 5:
-                profiler.start_profiler("All")
-            elif pass_id == 0 and batch_id == 10:
-                profiler.stop_profiler("total", "/tmp/profile")
-            exe.run(fluid.default_main_program(),
-                    feed=feeder.feed(data),
-                    fetch_list=[])
-	            ...
-	```
-1. 运行`python paddle/tools/timeline.py`来处理`/tmp/profile`，这个程序默认会生成一个`/tmp/timeline`文件，你也可以用命令行参数来修改这个路径，请参考[timeline.py](https://github.com/PaddlePaddle/Paddle/blob/develop/tools/timeline.py)。
-```python
-python Paddle/tools/timeline.py --profile_path=/tmp/profile --timeline_path=timeline
-```
-1. 打开chrome浏览器，访问<chrome://tracing/>，用`load`按钮来加载生成的`timeline`文件。
-	![chrome tracing](./tracing.jpeg)
-1. 结果如下图所示，可以放到来查看timetime的细节信息。
-	![chrome timeline](./timeline.jpeg)
-## 分布式使用
-一般来说，分布式的训练程序都会有两种程序：pserver和trainer。我们提供了把pserver和trainer的profile日志用timeline来显示的方式。 
-1. trainer打开方式与[本地使用](#local)部分的第1步相同
-1. pserver可以通过加两个环境变量打开profile，例如：
-```
-FLAGS_rpc_server_profile_period=10 FLAGS_rpc_server_profile_path=./tmp/pserver python train.py
-```
-3. 把pserver和trainer的profile文件生成一个timeline文件，例如：  
-```
-python /paddle/tools/timeline.py
-    --profile_path trainer0=local_profile_10_pass0_0,trainer1=local_profile_10_pass0_1,pserver0=./pserver_0,pserver1=./pserver_1
-    --timeline_path ./dist.timeline
-```
-4. 在chrome中加载dist.timeline文件，方法和[本地使用](#local)第4步相同。
--- a/doc/fluid/index_en.rst
+++ b/doc/fluid/index_en.rst
@@ -6,9 +6,8 @@
  beginners_guide/index_en.rst
  user_guides/index_en.rst
-  design/index_en.rst
+  advanced_usage/index_en.rst
-  howto/index_en.rst
-  dev/index_en.rst
  api/index_en.rst
  book/index_en.rst
-  advanced_usage/deploy/index_mobile.rst