modify yxt (#37)

* updata_yxt * update_yxt * add_yxt * update_yxt * update_yxt * update_yxt

modify yxt (#37)
* updata_yxt * update_yxt * add_yxt * update_yxt * update_yxt * update_yxt
ea2ed599 · Tink_Y · Shan Yi · c46c7baa · ea2ed599 · ea2ed599
23 changed file
--- a/source/advanced_usage/deploy/anakin_arm_benchmark.md
+++ b/source/advanced_usage/deploy/anakin_arm_benchmark.md
@@ -54,4 +54,3 @@
 3. 接着在测试机含有Anakin model的目录中运行'./benchmark_arm ./ anakin_model.anakin.bin 1 10 10 1' 命令
 4. 最后，终端显示器上将会打印该模型的运行时间
 5. 其中运行命令的参数个数和含义可以通过运行'./benchmark_arm'看到
--- a/source/advanced_usage/deploy/anakin_example.md
+++ b/source/advanced_usage/deploy/anakin_example.md
-../../../anakin/examples/example_introduction_cn.md
+# Example
\ No newline at end of file
+Anakin目前只支持NCHW的格式
+示例文件在test/framework/net下
+## 在NV的GPU上运行CNN模型
+示例文件为打开example_nv_cnn_net.cpp，整体流程如下：
+- 将模型的的path设置为anakin模型的路径，初始化NV平台的图对象。 anakin模型可以通过转换器转化caffe或fluid的模型得到
+- 根据模型设置网络图的输入尺寸，进行图优化
+- 根据优化后的网络图初始化网络执行器
+- 取出网络的输入tensor，将数据拷贝到输入tensor
+- 运行推导
+- 取出网络的输出tensor
+以NV平台为例演示Anakin框架的使用方法，注意编译时需要打开GPU编译开关
+## 在X86上运行RNN模型
+示例文件为example_x86_rnn_net.cpp
+整体流程与在NV的GPU上运行CNN模型相似，不同之处如下：
+- 使用X86标识初始化图对象和网络执行器对象
+- rnn模型的输入尺寸是可变的，初始化图时的输入维度是维度的最大值，输入维度N代表总的词的个数。还需要设置输入tensor的seq_offset来标示这些词是如何划分为句子的,如{0,5,12}表示共有12个词，其中第0到第4个词是第一句话，第5到第11个词是第二句话
+以X86平台为例演示Anakin框架的使用方法，注意编译时需要打开X86编译开关
+## 在NV的GPU上使用Anakin的线程池运行CNN模型
+示例文件为example_nv_cnn_net_multi_thread.cpp ，示例使用worker的同步预测接口
+整体流程与在NV的GPU上运行CNN模型相似，不同之处如下：
+- 用模型地址和线程池大小初始化worker对象
+- 将输入tensor注入任务队列,获得输出tensor
--- a/source/advanced_usage/deploy/anakin_gpu_benchmark.md
+++ b/source/advanced_usage/deploy/anakin_gpu_benchmark.md
@@ -168,8 +168,3 @@ We tested them on single-GPU with single-thread.
 > 2. Switch to *source_root/benchmark/CNN* directory. Use 'mkdir ./models' to create ./models and put anakin models into this file.
 > 3. Use command 'sh run.sh', we will create files in logs to save model log with different batch size. Finally, model latency summary will be displayed on the screen.
 > 4. If you want to get more detailed information with op time, you can modify CMakeLists.txt with setting `ENABLE_OP_TIMER` to `YES`, then recompile and run. You will find detailed information in  model log file.
--- a/source/advanced_usage/deploy/anakin_tutorial.md
+++ b/source/advanced_usage/deploy/anakin_tutorial.md
-../../../anakin/docs/Manual/Tutorial_ch.md
+# Anakin 使用教程 ##
\ No newline at end of file
+本教程将会简略的介绍Anakin的工作原理，一些基本的Anakin API，以及如何调用这些API。
+## 内容 ###
+- [Anakin的工作原理](#principle)
+- [Anakin APIs](#api)
+- [示例代码](#example)
+## <span id = 'principle'> Anakin的工作原理</span> ###
+![Anakin_principle](../pics/anakin_fm_ch.png)
+用Anakin来进行前向计算主要分为三个步骤：
+- 将外部模型通过[Anakin Parser](Converter_ch.md)解析为Anakin模型  
+  在使用Anakin之前，用户必须将所有其他模型转换成Anakin模型，我们提供了转换脚本，用户可通过[Anakin Parser](Converter_ch.md)进行模型转换。
+- 生成Anakin计算图
+  加载Anakin模型生成原始计算图，然后需要对原始计算图进行优化。你只需要调用相应的API优化即可。
+- 执行计算图  
+  Anakin会选择不同硬件平台执行计算图。
+## <span id ='api'>Anakin APIs </span> ###
+### Tensor ####
+`Tensor`提供基础的数据操作和管理，为ops提供统一的数据接口。`Tensor`包含以下几个属性：   
+- Buffer  
+   数据存储区
+- Shape  
+   数据的维度信息
+- Event  
+   用于异步计算的同步
+ `Tensor` 类包含三个`Shape`对象， 分别是`_shape`, `_valid_shape`和 `offset`。 `_shape`为`tensor`真正空间信息，`_valid_shape`表示当前`tensor`使用的空间信息， `_offset`表示当前`tensor`数据指针相对于真正数据空间的信息。 `Tensor`不同维度与分别与数学中的向量、矩阵等相对应如下表所示。
+Dimentions | Math entity |
+ :----: | :----:
+1 | vector
+2 | matrix
+3 | 3-tensor
+n | n-tensor
+#### 声明tensor对象
+`Tensor`接受三个模板参数:
+```c++
+ template<typename TargetType, DataType datatype, typename LayOutType = NCHW>
+ class Tensor .../* Inherit other class */{
+  //some implements
+  ...
+ };
+```
+TargetType是平台类型，如X86，GPU等等，在Anakin内部有相应的标识与之对应；datatype是普通的数据类型，在Anakin内部也有相应的标志与之对应；[LayOutType](#layout)是数据分布类型，如batch x channel x height x width [NxCxHxW], 在Anakin内部用一个struct来标识。 Anakin中数据类型与基本数据类型的对应如下:
+1. <span id='target'>TargetType</sapn>
+ Anakin TargetType | platform
+  :----: | :----:|
+  NV | NVIDIA GPU
+  ARM | ARM
+  AMD | AMD GPU
+  X86 | X86
+  NVHX86 | NVIDIA GPU with Pinned Memory
+2. <sapn id='datatype'>DataType</span>
+Anakin DataType | C++ | Description 
+:---: | :---: | :---: |
+AK_HALF | short | fp16
+AK_FLOAT | float | fp32
+AK_DOUBLE | double | fp64
+AK_INT8 | char | int8
+AK_INT16 | short | int16
+AK_INT32 | int | int32
+AK_INT64 | long | int64
+AK_UINT8 | unsigned char | uint8
+AK_UINT16 | unsigned short | uint8
+AK_UINT32 | unsigned int | uint32
+AK_STRING | std::string | /
+AK_BOOL | bool | /
+AK_SHAPE | / | Anakin Shape 
+AK_TENSOR | / | Anakin Tensor 
+3. <span id = 'layout'>LayOutType </span>
+Anakin LayOutType ( Tensor LayOut ) | Tensor Dimention | Tensor Support | Op Support
+:---: | :---: | :---: | :---: |
+W | 1-D | YES | NO
+HW | 2-D | YES | NO
+WH | 2-D | YES | NO
+NW | 2-D | YES | YES
+NHW | 3-D | YES |YES
+NCHW ( default ) | 4-D | YES | YES
+NHWC | 4-D | YES | NO
+NCHW_C4 | 5-D | YES | YES
+理论上，Anakin支持申明1维以上的tensor，但是对于Anakin中的Op来说，只支持NW、NHW、NCHW、NCHW_C4这四种LayOut，其中NCHW是默认的LayOutType，NCHW_C4是专门针对于int8这种数据类型的。
+例子
+> 下面的代码将展示如何使用tensor， 我们建议先看看这些示例。
+> 要想获得更多关于tensor的信息， 请参考 *soure_path/core/tensor.h*
+> 1. 使用shape对象初始化tensor
+``` c++  
+  //create a null tensor. A null tensor holds for nothing.
+  //tensor's buffer  is resident at CPU and its datatype is AK_FLOAT.
+  //tensor's Layout is NCHW(default)
+   Tensor<X86, AK_FLOAT> mytensor;
+   //1. using shape object to create a tensor.
+   Shape shape1(NUM); //1-D shape. NUM is the number of dimention.
+   Tensor<X86, AK_FLOAT, W> mytensor1(shape1); //1-D tensor.
+  // A 4-D shape
+   Shape shape2(N, C, H, W); // batch x channel x height x width
+```
+>`注意：Shape的维度必须和tensor的`[LayoutType](#layout)`相同，比如Shape(N,C,H,W), 那么Tensor的 LayoutType必须是NCHW，否则会出错。如下列代码所示`  
+```c++
+   // A 4-D tensor.
+   Tensor<X86, AK_FLOAT> mytensor2(shape2);  //right
+   //A 4-D tensor which is resident at GPU and its datatype is AK_INT8
+   Tensor<NV, AK_INT8> mytensor3(shape2);   //right
+   Tensor<X86, AK_FLOAT, NHW> mytensor4(shape2); //wrong!! shape's dimetion must be equal to tensor's Layout.
+   Tensor<NV, AK_FLOAT, NCHW_C4> mytensor5(shape2); //wrong!!!!
+```
+> 2. 使用现有的数据和shape初始化tensor
+```c++
+   /**
+   *  A construtor of Tensor.
+   *  data_ptr is a pointer to any data type of data
+   *  TargetType is type of a platform [Anakin TargetType]
+   *  id : device id
+   *  shape: a Anakin shape
+   */
+   Tensor(Dtype* data_ptr, TargetType_t target, int id, Shape shape);
+   //using existing data feed to a tensor
+   Tensor<X86, AK_FLOAT> mytensor(data_ptr, TargetType, device_id, shape); //shape must has dimention (N, C, H, W).
+```
+> 3. 使用tensor初始化tensor
+```c++
+   Tensor<NV, AK_FLOAT> tensor(exist_tensor);
+```
+> 提示： 你可以用` typedef Tensor<X86, AK_FLOAT> Tensor4d_X86 `方便定义tensor
+#### 填充tensor数据区
+填充数据区得看你申明tensor的方式， 下面展示了如何填充tensor的数据区。
+```c++
+首先来看看tensor的四种声明方式：
+1. Tensor<X86, AK_FLOAT> mytensor;
+2. Tensor<X86, AK_FLOAT, W> mytensor1(shape1);
+3. Tensor<X86, AK_FLOAT> mytensor(data_ptr, TargetType, device_id, shape);
+4. Tensor<NV, AK_FLOAT> tensor(exist_tensor);
+相关的声明方式的数据填充方法如下：
+1：声明一个空的tensor，此时没有为其分配内存，所以，我们需要手动的为其分配内存。
+            //parama shape
+            mytensor.re_alloc(Shape shape); 
+            //Get writable pointer to mytensor.
+            //parama index (int): where you start to write.
+            //Dtype is your data type such int, float or double.
+            Dtype *p = mytensor.mutable_data(index/*=0*/);
+            //write data to mytensor
+            for(int i = 0; i < mytensor.size(); i++){
+              p[i] = 1.0f;
+            }
+            //do something ...
+2: 这种声明方式会自动分配内存 
+          //Get writable pointer to mytensor.
+          //parama index (int): where you start to write.
+          //Dtype is your data type such int, float or double.
+          Dtype *p = mytensor1.mutable_data(index/*=0*/);
+          //write data to mytensor
+          for(int i = 0; i < mytensor.size(); i++){
+            p[i] = 1.0f;
+          }
+          //do something ...
+3：在该种声明方式中，我们仍不需要手动为其分配内存。但在构造函数内部是否为其分配内存，得依情况而定。如果data_ptr和申明的
+tensor都在都一个目标平台上，那么该tensor就会与data_ptr共享内存空间，相反，如果他们不在同一个平台上（如data_ptr在X86上，而
+tensor在GPU上），那么此时tensor就会开辟一个新的内存空间，并将data_ptr所指向的数据拷贝到tensor的buffer中。
+          //Get writable pointer to mytensor.
+          //parama index (int): where you start to write.
+          //Dtype is your data type such int, float or double.
+          Dtype *p = mytensor.mutable_data(index/*=0*/);
+          //write data to mytensor
+          for(int i = 0; i < mytensor.size(); i++){
+            p[i] = 1.0f;
+          }
+          //do something ...
+4：该种方式仍不需要手动分配内存
+          //Get writable pointer to mytensor.
+          //parama index (int): where you start to write.
+          //Dtype is your data type such int, float or double.
+          Dtype *p = mytensor.mutable_data(index/*=0*/);
+          //write data to mytensor
+          for(int i = 0; i < mytensor.size(); i++){
+            p[i] = 1.0f;
+          }
+          //do something ...
+另外，你还可以获取一个tensor的可读指针，示例如下：
+        //Get read-only pointer to mytensor.
+        //parama index (int): where you start to read.
+        //Dtype is your data type such int, float or double.
+         Dtype *p = mytensor.data(index/*=0*/);
+        //do something ...
+```
+如果想更详细的了解tensor，请查阅*soure_path/saber/core/tensor.h*
+#### 获取tensor的shape
+```c++
+//some declarations
+// ...
+Shape shape = mytensor.shape();
+//Get a first dimetion size of tesor, if it has.
+int d1 = shape[0];
+//Get a second dimention size of tensor, if it has.
+int d2 = shape[1];
+...
+//Get a n-th dimention size of tensor, if it has.
+int dn = shape[n-1];
+//Get a tensor's dimention
+int dims = mytensor.dims();
+//Get the size of tensor.
+//size = d1 x d2 x ... x dn.
+int size = mytensor.size();
+//Get the size of tensor at interval [Di, Dj)
+// form i-th dimention to j-th dimention, but not including the j-th dimention.
+// which means di x (di+1) x ... x (dj -1)
+int size = mytensor.count(start, end);
+```
+#### 设置tensor的shape
+我们可以用tensor的成员函数set_shape来设置tensor的shape。 下面是set_shape的定义
+```c++
+/**
+ * \brief set a tensor's shape
+ * \param valid_shape [a Shape object]
+ * \param shape [a Shape object]
+ * \param offset [a Shape object]
+ * \return the status of this operation, that means whether it success * or not.
+ */
+SaberStatus set_shape(Shape valid_shape, Shape shape = Shape::zero(TensorAPI::layout_dims::value), Shape offset = Shape::minusone(TensorAPI::layout_dims::value)); 
+```
+这个成员函数只设置tensor的shape。这些shape对象(valid_shape, shape, offset)的[LayOutType](#layout)必须和当前的tensor的相应三个shape对象的LayOutType相同，如果不同就会出错，返回SaberInvalidValue。 如果相同，那么将成功设置tensor的shape。
+```c++
+// some declarations
+// ...
+//valid_shape, shape , offset are Shape object;
+//All these Shape object's LayOutType must be equal to mytensor's.
+mytensor.set_shape(valid_shape, shape, offset);
+```
+#### 重置 tensor的shape
+```c++
+//some declarations
+Shape shape, valid_shape, offset;
+//do some initializations
+... 
+mytensor.reshape(valid_shape, shape, offset);
+```
+注意： Reshape操作仍然需要shape的[LayOutType](#layout) 与tensor的相同
+### Graph ###
+`Graph`类负责加载Anakin模型生成计算图、对图进行优化、存储模型等操作。
+#### 图的声明
+与`Tensor`一样，graph也接受三个模板参数。
+```c++
+template<typename TargetType, DataType Dtype, Precision Ptype>
+class Graph ... /* inherit other class*/{
+  //some implements
+  ...
+};
+```
+前面已经介绍过[TargetType](#target)和[DataType](#datatype)是Anakin内部自定义数据类型。[TargetType](#target)表示平台类型 (如NV、X86), [DataType](#datatype)是Anakin基本数据类型与C++/C中的基本数据类型相对应。 [Precision](#precision)为op所支持的精度类型, 稍后我们在介绍它。
+```c++
+//Create a empty graph object.
+Graph graph = Graph<NV, AK_FLOAT, Precision::FP32> tmp();
+//Create a pointer to a empty graph.
+Graph *graph = new Graph<NV, AK_FLOAT, Precision::FP32>();
+//Create a pointer to a empty graph.
+auto graph = new Graph<NV, AK_FLOAT, Precision::FP32>();
+```
+#### 加载 Anakin 模型
+```c++
+//some declarations
+...
+auto graph = new Graph<NV, AK_FLOAT, Precision::FP32>();
+std::string model_path = "the/path/to/where/your/models/are";
+const char *model_path1 = "the/path/to/where/your/models/are";
+//Loading Anakin model to generate a compute graph.
+auto status = graph->load(model_path);
+//Or this way.
+auto status = graph->load(model_path1);
+//Check whether load operation success.
+if(!status){
+  std::cout << "error" << endl;
+  //do something...
+}
+```
+#### 优化计算图
+```c++
+//some declarations
+...
+//Load graph.
+...
+//According to the ops of loaded graph, optimize compute graph.
+graph->Optimize();
+```
+> 注意： 第一次加载原始图，必须要优化。
+#### 保存模型
+你可以在任何时候保存模型， 特别的， 你可以保存一个优化的模型，这样，下次再加载模型时，就不必进行优化操作。
+```c++
+//some declarations
+...
+//Load graph.
+...
+// save a model
+//save_model_path: the path to where your model is.
+auto status = graph->save(save_model_path);
+//Checking
+if(!status){
+  cout << "error" << endl;
+  //do somethin...
+}
+```
+#### 重新设置计算图里的tensor的shape
+```c++
+//some declarations
+...
+//Load graph.
+...
+vector<int> shape{10, 256, 256, 10};
+//input_name : std::string.
+//Reshape a tensor named input_name.
+graph->Reshape(input_name, shape);//Note: shape is a vector, not a Shape object.
+```
+#### 设置 batch size
+`Graph` 支持重新设置batch size的大小。
+```c++
+//some declarations
+...
+//Load graph.
+...
+//input_name : std::string.
+//Reset a tensor named input_name.
+int new_batch_size = 4;
+graph->ResetBatchSize(input_name, new_batch_size);
+```
+###  Net ###
+`Net` 是计算图的执行器。你可以通过Net对象获得输入和输出
+#### Creating a graph executor
+`Net`接受四个模板参数。  
+```c++
+template<typename TargetType, DataType Dtype, Precision PType OpRunType RunType = OpRunType::ASYNC>
+class Net{
+  //some implements
+  ...
+};
+```
+由于有些Op可能支持多种精度，我们可以通过Precision来指定。OpRunType表示同步或异步类型，异步是默认类型。OpRunType::SYNC表示同步，在GPU上只有单个流；OpRunType::ASYNC表示异步，在GPU上有多个流并以异步方式执行。实际上，Precision和OpRunType都是enum class, 详细设计请参考*source_root/framework/core/types.h*.
+1. <span id = 'precision'> Precision </span>
+Precision | Op support
+:---: | :---:
+Precision::INT4 | NO
+Precision::INT8 | NO
+Precision::FP16 | NO
+Precision::FP32 | YES
+Precision::FP64 | NO
+现在Op的精度只支持FP32， 但在将来我们会支持剩下的Precision.
+2. OpRunType
+OpRunType | Sync/Aync |Description
+:---: | :---: | :---:
+OpRunType::SYNC | Synchronization | single-stream on GPU
+OpRunType::ASYNC | Asynchronization | multi-stream on GPU
+用graph对象创建一个执行器。
+```c++
+//some declarations
+...
+//Create a pointer to a graph.
+auto graph = new Graph<NV, AK_FLOAT, Precision::FP32>();
+//do something...
+...
+//create a executor
+Net<NV, AK_FLOAT, Precision::FP32> executor(*graph);
+```
+#### 获取输入输出tensor
+获取输入输出tensor，并填充输入tensor的buffer。如果想要获取输入和输出tensor，那么必须指定输入的名字，如"input_0", "input_1", "input_2", ..., 必须传入如上字符串才能够获得输入tensor。另外，如果想知道input_i对应哪个输入，你需要去dash board查看，如何使用dash board请看[Anakin Parser](Converter_ch.md)。请看如下示例代码
+```c++
+//some declaratinos
+...
+//create a executor
+//TargetType is NV [NVIDIA GPU]
+Net<NV, AK_FLOAT, Precision::FP32> executor(*graph);
+//Get the first input tensor.
+//The following tensors(tensor_in0, tensor_in2 ...) are resident at GPU.
+//Note: Member function get_in returns an pointer to tensor.
+Tensor<NV, AK_FLOAT>* tensor_in0 = executor.get_in("input_0");
+//If you have multiple input tensors
+//You just type this code below.
+Tensor<NV, AK_FLOAT>* tensor_in1 = executor.get_in("input_1");
+...
+auto tensor_inn = executor.get_in("input_n");
+```
+当得到输入tensor之后，就可以填充它的数据区了。
+```c++
+//This tensor is resident at GPU.
+auto tensor_d_in = executor.get_in("input_0");
+//If we want to feed above tensor, we must feed the tensor which is resident at host. And then copy the host tensor to the device's one.
+//using Tensor4d = Tensor<Ttype, Dtype>;
+Tensor4d<X86, AK_FLOAT> tensor_h_in; //host tensor;
+//Tensor<X86, AK_FLOAT> tensor_h_in; 
+//Allocate memory for host tensor.
+tensor_h_in.re_alloc(tensor_d_in->valid_shape());
+//Get a writable pointer to tensor.
+float *h_data = tensor_h_in.mutable_data();
+//Feed your tensor.
+/** example
+for(int i = 0; i < tensor_h_in.size(); i++){
+  h_data[i] = 1.0f;
+}
+*/
+//Copy host tensor's data to device tensor.
+tensor_d_in->copy_from(tensor_h_in);
+// And then
+```
+类似的，我们可以利用成员函数get_out来获得输出tensor。但与获得输入tensor不同的是， 我们需要指定输入tensor结点的名字，这个可以从dash board中看到，请从[Anakin Parser](Converter_ch.md)中查看dash board的使用方法。假如有个输出结点叫pred_out, 那么我们可以通过如下代码获得相应的输出tensor：
+```c++
+//Note: this tensor are resident at GPU.
+Tensor<NV, AK_FLOAT>* tensor_out_d = executor.get_out("pred_out");
+```
+#### Executing graph
+当一切准备就绪后，我们就可以执行真正的计算了！
+```c++
+executor.prediction();
+```
+## <span id='example'> 示例代码 </span> ##
+下面的例子展示了如何调用Anakin。
+在这儿之前， 请确保你已经有了Anakin模型。如果还没有，那么请使用[Anakin Parser](Converter_ch.md)转换你的模型。
+### Single-thread
+单线程例子在 *source_root/test/framework/net/net_exec_test.cpp`*
+```c++
+std::string model_path = "your_Anakin_models/xxxxx.anakin.bin";
+// Create an empty graph object.
+auto graph = new Graph<NV, AK_FLOAT, Precision::FP32>();
+// Load Anakin model.
+auto status = graph->load(model_path);
+if(!status ) {
+    LOG(FATAL) << " [ERROR] " << status.info();
+}
+// Reshape
+graph->Reshape("input_0", {10, 384, 960, 10});
+// You must optimize graph for the first time.
+graph->Optimize();
+// Create a executer.
+Net<NV, AK_FLOAT, Precision::FP32> net_executer(*graph);
+//Get your input tensors through some specific string such as "input_0", "input_1", and 
+//so on. 
+//And then, feed the input tensor.
+//If you don't know Which input do these specific string ("input_0", "input_1") correspond with, you can launch dash board to find out.
+auto d_tensor_in_p = net_executer.get_in("input_0");
+Tensor4d<X86, AK_FLOAT> h_tensor_in;
+auto valid_shape_in = d_tensor_in_p->valid_shape();
+for (int i=0; i<valid_shape_in.size(); i++) {
+    LOG(INFO) << "detect input dims[" << i << "]" << valid_shape_in[i]; //see tensor's dimentions
+}
+h_tensor_in.re_alloc(valid_shape_in);
+float* h_data = h_tensor_in.mutable_data();
+for (int i=0; i<h_tensor_in.size(); i++) {
+    h_data[i] = 1.0f;
+}
+d_tensor_in_p->copy_from(h_tensor_in);
+//Do inference.
+net_executer.prediction();
+//Get result tensor through the name of output node.
+//And also, you need to see the dash board again to find out how many output nodes are and remember their name.
+//For example, you've got a output node named obj_pre_out
+//Then, you can get an output tensor.
+auto d_tensor_out_0_p = net_executer.get_out("obj_pred_out"); //get_out returns a pointer to output tensor.
+auto d_tensor_out_1_p = net_executer.get_out("lc_pred_out"); //get_out returns a pointer to output tensor.
+//......
+// do something else ...
+//...
+//save model.
+//You might not optimize the graph when you load the saved model again.
+std::string save_model_path = model_path + std::string(".saved");
+auto status = graph->save(save_model_path);
+if (!status ) {
+    LOG(FATAL) << " [ERROR] " << status.info();
+}
+```
--- a/source/advanced_usage/deploy/convert_paddle_to_anakin.md
+++ b/source/advanced_usage/deploy/convert_paddle_to_anakin.md
-../../../anakin/docs/Manual/Converter_ch.md
+# 模型转换指南
\ No newline at end of file
+Anakin 支持不同框架的模型预测。但由于格式的差别，Anakin 需要您预先转换模型。本文档介绍如何转换模型。
+## 简介
+Anakin 模型转换器输入支持 Caffe 和 Fluid 两种格式的预测模型，模型包含网络结构（model 或 prototxt）和权重参数（param 或 caffemodel）。   
+模型转换的输出是一个 bin 文件，它作为 Anakin 框架的 graph 参数导入。   
+您还可以使用模型转换器的 launch board 功能生成网络结构的 HTML 预览。   
+## 系统要求
+- python 2.7+
+- pyyaml
+- flask
+- protobuf 3.5+
+## 用法
+### 1、环境
+转换器所需的依赖标注于 *系统要求* 一节。
+### 2、配置
+您需要对 *config.yaml* 文件进行修改以告知您的需求。工程中给出了 *config.yaml* 示例，下面作进一步说明。
+#### config.yaml
+```bash
+OPTIONS:
+    Framework: CAFFE       # 依框架类型填写 CAFFE 或 FLUID
+    SavePath: ./output     # 转换结束后模型的保存位置
+    ResultName: googlenet  # 输出模型的名字
+    Config:
+        LaunchBoard: ON    # 是否生成网络结构预览页面
+        Server:
+            ip: 0.0.0.0
+            port: 8888     # 从一个可用端口访问预览页面
+        OptimizedGraph:    # 当您使用了 Anakin 框架的 Optimized 功能时，才应该打开此项
+            enable: OFF
+            path: /path/to/anakin_optimized_anakin_model/googlenet.anakin.bin.saved
+    LOGGER:
+        LogToPath: ./log/  # 生成日志的路径
+        WithColor: ON
+TARGET:
+    CAFFE:
+        # 当 Framework 为 CAFFE 时需填写
+        ProtoPaths:
+            - /path/to/caffe/src/caffe/proto/caffe.proto
+        PrototxtPath: /path/to/your/googlenet.prototxt
+        ModelPath: /path/to/your/googlenet.caffemodel
+    FLUID:
+        # 当 Framework 为 FLUID 时需填写
+        Debug: NULL
+        ProtoPaths:
+            - /
+        PrototxtPath: /path/to/fluid/inference_model
+        ModelPath: /path/to/fluid/inference_model
+	# ...
+```
+### 3、转换
+在完成配置文件的修改后，您只需执行 ```python converter.py``` 就可以进行模型转换了。
+### 4、预览
+最后一步，就是在浏览器中查看令人振奋的转换结果！网址是在 *config.yaml* 中配置的，例如 http://0.0.0.0:8888 。
+> 注意：若您使用了默认的 IP 地址 0.0.0.0，请在预览时使用真实的服务器地址 real_ip:port 替代它。
--- a/source/advanced_usage/deploy/how_to_add_anakin_op.md
+++ b/source/advanced_usage/deploy/how_to_add_anakin_op.md
-../../../anakin/docs/Manual/addCustomOp.md
+# 如何增加新的Operator
\ No newline at end of file
+## 基本概念
+简单介绍下几个同Operator相关的基本概念，详情请参考设计文档。
+```framework```: 上层的逻辑代码，负责从parser中获取参数及weights，添加op时主要修改framework/operator目录下的内容。
+```saber```: 底层的实现代码，Anakin通过saber封装了不同的backends，不同的实现(impl)分别特化出自己的实现，外层framework通过不同的template进入各自的impl完成调用。各个op的parameter放在saber/saber_funcs_param.h文件中，增加op主要修改saber/funcs下的内容。
+saber的文件结构：
+* saber/funcs下的是各个funcs的外部接口，这一层的op与具体的设备实现无关，只与各op完成的功能有关。由于跟实现(impl)无关，本层文件明均不带impl。
+* saber/funcs/impl下是各个op的impl声明，特定设备需要完成该层声明的特化版本，如saber/funcs/impl/x86实现了上一层impl声明的x86特化版本，saber/funcs/impl/cuda实现了上一层impl声明的NV特化版本。当增加新的backends时需要特化出新的实现。本层代码同实现相关，均带有```impl_```前缀。
+* saber/funcs/impl/cuda/base/cuda_c内有cuda```.cu```扩展名的文件，添加cuda的kernel需要在该文件目录下添加。
+* saber/funcs/impl/cuda/base/sass 内有不同架构的汇编代码编译的静态库。
+### 涉及到的基类及各个类之前的关系
+简单介绍相关的基类
+* ```anakin::Operator```: framework的operator基类，位于framework/core/operator/operator.h
+* ```anakin::saber::BaseFunc```: saber对外的op接口基类，提供统一的对外接口，位于saber/funcs/base.h。BaseFunc的```compute_output_shape```接口只根据input的shape和param的参数计算输出的shape，并通过```tensor```的```set_shape```接口(只设置shape，不分配空间)设置到output中。```operator()```接口为各个op的计算接口。
+* ```ankain::saber::ImplBase```: saber设备实现的op的接口，所有设备相关实现的基类。位于saber/funcs/impl/impl_base.h。实现版本中这里分为两类，一类以```vender_```为前缀，带有```vender_```代码意为使用第三方库来实现该op，如cudnn的conv，或mkl的conv等等，这类op的性能我们难以调优，因此单独列为一类。另一类是带有源码的saber实现，这些实现都带有```saber_```为前缀，此类实现带有源码，能够通过后续优化不断提升性能，实现起名时需要注意这一点。
+## 添加operator
+添加一个新的op需要以下几步：
+1. 添加saber的param
+2. 定义saber的Operator类
+3. 定义新的impl声明
+3. 完成新的impl实现
+4. 增加framework的实现或特化
+接下来就针对这几步，以一个简单例子为例介绍实现。
+例如我们要添加新的Mul op。给出计算公式如下：$$Out = alpha \dot X * Y$$
+### 为operator增加param
+涉及到的文件：```saber/saber_funcs_param.h```。如果之前已经存在需要添加的op的param，这一步可以跳过。
+这里```XXXParam```是一个```struct```。包含一个无参数的构造函数，含参数的构造函数，复制构造函数，```operator=()```及```operator==()```。
+```
+template <typename opTensor> // 能够获得target, datatype, layout
+struct MulParam{
+  MulParam()
+    : alpha(0)
+  {}
+  MulParam(float alpha_in)
+    : alpha(alpha_in)
+  {}
+  MulParam(const MulParam& right)
+    : alpha(right.alpha)
+  {}
+  MulParam &operator=(const MulParam &right) {
+    alpha = right.alpha;
+  }
+  bool operator==(const MulParam &right) {
+    return alpha == right.alpha;
+  }
+  float alpha;
+};
+```
+### 定义Operator类
+涉及到的文件:```saber/funcs/mul.h```。如果之前定义过该op的类，这里需要修改输入的impl定义头文件。
+下面给出一个相对完整的定义结构供参考。
+```
+//不同的设备需要包含对应的operator实现.[详见](#impl)
+#ifdef NVIDIA_GPU
+#include "saber/funcs/impl/cuda/saber_mul.h"
+#include "saber/funcs/impl/cuda/vender_mul.h"
+#endif
+//如果一个设备现在还没有对应的operator实现，需要包含声明。[详见](#declare)
+#ifdef USE_X86_PLACE
+#include "saber/funcs/impl/impl_mul.h"
+#endif
+namespace anakin {
+namespace saber {
+template<typename TargetType,
+        DataType OpDtype,
+        DataType inDtype = AK_FLOAT,
+        DataType outDtype = AK_FLOAT,
+        typename LayOutType_op = NCHW,
+        typename LayOutType_in = NCHW,
+        typename LayOutType_out = NCHW>
+class Mul : public BaseFunc<
+        Tensor<TargetType, inDtype, LayOutType_in>,
+        Tensor<TargetType, outDtype, LayOutType_out>,
+        Tensor<TargetType, OpDtype, LayOutType_op>,
+        ImplBase, MulParam> {
+public:
+    using BaseFunc<
+            Tensor<TargetType, inDtype, LayOutType_in>,
+            Tensor<TargetType, outDtype, LayOutType_out>,
+            Tensor<TargetType, OpDtype, LayOutType_op>,
+            ImplBase, MulParam>::BaseFunc;
+    Mul() = default;
+    typedef Tensor<TargetType, inDtype, LayOutType_in> InDataTensor;
+    typedef Tensor<TargetType, outDtype, LayOutType_out> OutDataTensor;
+    typedef Tensor<TargetType, OpDtype, LayOutType_op> OpTensor;
+    typedef MulParam<OpTensor> Param_t;
+    typedef std::vector<InDataTensor *> Input_v;
+    typedef std::vector<OutDataTensor *> Output_v;
+    typedef std::vector<Shape> Shape_v;
+    virtual SaberStatus compute_output_shape(const Input_v &input,
+                                             Output_v &output, Param_t &param) override {
+        //计算输出的shape，
+        Shape output_shape = (input[0]->valid_shape());
+        /* code */
+        return output[0]->set_shape(output_shape);
+    }
+    virtual SaberStatus init_impl(ImplEnum implenum) override {
+      // 不同设备均使用此init_impl, 此接口创建对应impl的实现。
+      switch (implenum) {
+            case VENDER_IMPL:
+                this->_impl.push_back(new VenderMul <TargetType,
+                OpDtype, inDtype, outDtype,
+                LayOutType_op, LayOutType_in, LayOutType_out>);
+                return SaberSuccess;
+            case SABER_IMPL:
+                this->_impl.push_back(new SaberMul <TargetType,
+                OpDtype, inDtype, outDtype,
+                LayOutType_op, LayOutType_in, LayOutType_out>);
+                return SaberSuccess;
+            default:
+                return SaberUnImplError;
+        }
+    }
+private:
+    virtual void pick_best_static() override {
+        if (true) // some condition?
+            this->_best_impl = this->_impl[0];
+    }
+    virtual void pick_best_specify(ImplEnum implenum) override {
+        this->_best_impl = this->_impl[0];
+    }
+};
+} // namespace saber
+} // namespace anakin
+```
+### 为operator增加新的impl<span id="declare">声明</span>
+涉及的文件:```saber/funcs/impl/impl_mul.h```。不同的设备都特化同一个声明，特化版本放在对应的文件夹下，这里的声明就是给出所有设备的统一声明。下面给出一个参考。
+```
+#include "saber/funcs/impl/impl_macro.h"
+namespace anakin{
+namespace saber{
+DEFINE_OP_CLASS(Mul, MulParam); // 第一个参数是op的名字，第二个是对应param的名字
+}
+}
+```
+### 完成新的operator特定后端<span id="impl">实现</span>
+涉及的文件:```saber/funcs/impl/xxx/vender_mul.h```或```saber/funcs/impl/xxx/saber_mul.h```
+这里```xxx```指代特定的一种设备。```vender```是指的使用第三方库实现的op，```saber```指的源码实现的op。这里以cuda的vender实现为例，简单介绍一下特化出的函数的几个基本接口。
+```
+// include 对应的声明
+#include "saber/funcs/impl/impl_mul.h"
+namespace anakin{
+namespace saber{
+template <DataType OpDtype,
+    DataType inDtype,
+    DataType outDtype,
+    typename LayOutType_op,
+    typename LayOutType_in,
+    typename LayOutType_out>
+class VenderMul<NV, //偏特化出需要的后端。
+    OpDtype, inDtype, outDtype,
+    LayOutType_op, LayOutType_in, LayOutType_out> :
+    public ImplBase<
+        Tensor<NV, inDtype, LayOutType_in>,
+        Tensor<NV, outDtype, LayOutType_out>,
+        Tensor<NV, OpDtype, LayOutType_op>,
+        MulParam<Tensor<NV, OpDtype, LayOutType_op> > >
+{
+public:
+    typedef Tensor<NV, inDtype, LayOutType_in> DataTensor_in;
+    typedef Tensor<NV, outDtype, LayOutType_out> DataTensor_out;
+    typedef Tensor<NV, OpDtype, LayOutType_op> OpTensor;
+    typedef typename DataTensor_in::Dtype InDataType;
+    typedef typename DataTensor_out::Dtype OutDataType;
+    typedef typename OpTensor::Dtype OpDataType;
+    VenderMul(){}
+    ~VenderMul() {}
+    virtual SaberStatus init(const std::vector<DataTensor_in *>& inputs,
+                            std::vector<DataTensor_out *>& outputs,
+                            MulParam<OpTensor>& param, Context<NV>& ctx) {
+        this->_ctx = ctx;
+        create(inputs, outputs, param, ctx);
+    }
+    virtual SaberStatus create(const std::vector<DataTensor_in *>& inputs,
+                            std::vector<DataTensor_out *>& outputs,
+                            MulParam<OpTensor>& param, Context<NV>& ctx) {
+        // set内部参数
+    }
+    virtual SaberStatus dispatch(const std::vector<DataTensor_in*>& inputs,
+                          std::vector<DataTensor_out*>& outputs,
+                        MulParam<OpTensor>& param) {
+        // dispatch kernel.
+    }
+private:
+};
+}
+}
+```
+```init```和```create```的区别：```init```接口是第一次初始化op的时候进入的接口，此函数只在第一次初始化op时调用，这个接口一般放一些只需要执行一次的代码，如malloc或者create之类的函数。```create```函数除了第一次init执行外，在输入发生变化或者param发生变化时会再次触发，create一般放置set函数，设置内部变量，当input发生变化时这里执行一些同input或weights直接相关的代码。但create因为触发位置在网络内，如果```create```函数执行了一些严重耗时的操作，这里会拖慢整个op的执行时间，需要慎重选择操作放置的位置。
+### 添加framework的特化
+涉及的文件:```framework/operators/mul.h```和```framework/operators/mul.cpp```。
+这里简单介绍下如果添加或修改framework内的operator
+```
+#include "framework/core/base.h"
+#include "framework/core/data_types.h"
+#include "framework/core/operator/operator.h"
+#include "utils/logger/logger.h"
+#include "saber/funcs/mul.h" // 需要包对应的saber头文件
+namespace anakin {
+namespace ops {
+template<typename Ttype, DataType Dtype, Precision Ptype>
+class MulHelper;
+template<typename Ttype, DataType Dtype, Precision Ptype>
+class Mul : public Operator<Ttype, Dtype, Ptype> {
+public:
+    Mul() {}
+    /// forward impl
+    virtual void operator() (OpContext<Ttype> &ctx,
+                             const std::vector<Tensor4dPtr<Ttype, Dtype> >& ins,
+                             std::vector<Tensor4dPtr<Ttype, Dtype> >& outs) {
+        LOG(ERROR) << "Not Impl Yet Operator power<TargetType:"<<"unknown"<<","
+                   <<type_id<typename DataTypeWarpper<Dtype>::type>().type_info()<<">";
+    }
+    friend class MulHelper<Ttype, Dtype, Ptype>;
+};
+template<typename Ttype, DataType Dtype, Precision Ptype>
+class MulHelper : public OperatorHelper<Ttype, Dtype, Ptype> {
+public:
+    MulHelper() = default;
+    ~MulHelper();
+    Status InitParam() override;
+    Status Init(OpContext<Ttype> &ctx,
+                const std::vector<Tensor4dPtr<Ttype, Dtype> >& ins,
+                std::vector<Tensor4dPtr<Ttype, Dtype> >& outs) override;
+    Status InferShape(const std::vector<Tensor4dPtr<Ttype, Dtype> >& ins,
+                      std::vector<Tensor4dPtr<Ttype, Dtype> >& outs) override;
+public:
+    saber::MulParam<Tensor4d<Ttype, Dtype>> _param_mul;
+    saber::Mul<Ttype, Dtype> _funcs_mul;
+};
+}
+} /* namespace anakin */
+```
+对应的```.cpp```文件如下：
+```
+#include "framework/operators/mul.h"
+namespace anakin {
+namespace ops {
+#ifdef USE_CUDA
+template<>
+void Mul<NV, AK_FLOAT, Precision::FP32>::operator()(
+    OpContext<NV>& ctx,
+    const std::vector<Tensor4dPtr<NV, AK_FLOAT> >& ins,
+    std::vector<Tensor4dPtr<NV, AK_FLOAT> >& outs) {
+    auto* impl =
+        static_cast<MulHelper<NV, AK_FLOAT, Precision::FP32>*>(this->_helper);
+    auto& param =
+        static_cast<MulHelper<NV, AK_FLOAT, Precision::FP32>*>(this->_helper)->_param_mul;
+    impl->_funcs_mul(ins, outs, param, ctx);
+}
+#endif
+template<typename Ttype, DataType Dtype, Precision Ptype>
+Status MulHelper<Ttype, Dtype, Ptype>::InitParam() {
+    auto alpha = GET_PARAMETER(float, alpha);
+    MulParam<Tensor4d<Ttype, Dtype>> param_mul(alpha);
+    _param_mul = param_mul;
+    return Status::OK();
+}
+template<typename Ttype, DataType Dtype, Precision Ptype>
+Status MulHelper<Ttype, Dtype, Ptype>::Init(OpContext<Ttype>& ctx,
+        const std::vector<Tensor4dPtr<Ttype, Dtype> >& ins,
+        std::vector<Tensor4dPtr<Ttype, Dtype> >& outs) {
+    SABER_CHECK(_funcs_mul.init(ins, outs, _param_mul, SPECIFY, VENDER_IMPL, ctx));
+    return Status::OK();
+}
+template<typename Ttype, DataType Dtype, Precision Ptype>
+Status MulHelper<Ttype, Dtype, Ptype>::InferShape(const
+        std::vector<Tensor4dPtr<Ttype, Dtype> >& ins,
+        std::vector<Tensor4dPtr<Ttype, Dtype> >& outs) {
+    SABER_CHECK(_funcs_mul.compute_output_shape(ins, outs, _param_mul));
+    return Status::OK();
+}
+#ifdef USE_CUDA
+template class MulHelper<NV, AK_FLOAT, Precision::FP32>;
+#endif
+#ifdef USE_ARM_PLACE
+template class MulHelper<ARM, AK_FLOAT, Precision::FP32>;
+#endif
+// register helper
+#ifdef USE_CUDA
+ANAKIN_REGISTER_OP_HELPER(Mul, MulHelper, NV, AK_FLOAT, Precision::FP32);
+#endif
+#ifdef USE_ARM_PLACE
+ANAKIN_REGISTER_OP_HELPER(Mul, MulHelper, ARM, AK_FLOAT, Precision::FP32);
+#endif
+//! register op
+ANAKIN_REGISTER_OP(Mul)
+.Doc("Mul operator")
+#ifdef USE_CUDA
+.__alias__<NV, AK_FLOAT, Precision::FP32>("mul")
+#endif
+#ifdef USE_ARM_PLACE
+.__alias__<ARM, AK_FLOAT, Precision::FP32>("mul")
+#endif
+.num_in(1)
+.num_out(1)
+.Args<float>("alpha", " alpha of Mul "); //注册
+} /* namespace ops */
+} /* namespace anakin */
+```
+## 实现单元测试
+涉及的文件:```test/saber/xxx/test_saber_funcs_mul_xxx.cpp```
+在对应的test下需要添加新的单元测试
+```
+TEST(TestSaberFuncNV, test_depthwise_conv) {
+    // init tensors and some param.
+    // start Reshape & doInfer
+    Context<NV> ctx1(0, 1, 1);
+    // create param
+    MulParam<Tensor<NV, AK_FLOAT, NCHW> > param(alpha);
+    std::vector<Tensor<NV, AK_FLOAT, NCHW>*> input;
+    std::vector<Tensor<NV, AK_FLOAT, NCHW>*> output;
+    // create saber op
+    Mul<NV, AK_FLOAT, AK_FLOAT, AK_FLOAT, NCHW> mul;
+    // compute output shape
+    mul.compute_output_shape(input, output, param);
+    // re_alloc output tensors memory based on output shape
+    output[0]->re_alloc(output[0]->shape());
+    // init saber op(calling init and create)
+    mul.init(input, output, param, SPECIFY, VENDER_IMPL, ctx1);
+    // call operator()
+    mul(input, output, param, ctx1);
+    // cuda specified, record events
+    cudaStream_t cuda_stream = ctx1.get_compute_stream();
+    output[0]->record_event(cuda_stream);
+    output_dev.sync();
+    // param changed 
+    param.alpha = 2.0;
+    // auto calling saber op(create and dispatch)
+    mul(input, output, param, ctx1);
+    cudaDeviceSynchronize();
+    CUDA_CHECK(cudaPeekAtLastError());
+}
+int main(int argc, const char** argv){
+    anakin::saber::Env<NV>::env_init();
+    // initial logger
+    //logger::init(argv[0]);
+    InitTest();
+    RUN_ALL_TESTS(argv[0]);
+    return 0;
+}
+```
+## 调试及注意事项
+一个op需要有对外的op接口和内部实现，由于存在saber/funcs/impl的非特化版本声明，当有op在某种设备下没有对应实现时，也能够编译，但此时是没有任何实现的空实现，
--- a/source/advanced_usage/deploy/how_to_support_new_device_in_anakin.md
+++ b/source/advanced_usage/deploy/how_to_support_new_device_in_anakin.md
-../../../anakin/docs/Manual/addCustomDevice.md
+# 如何支持一个新的设备
\ No newline at end of file
+## 概览
+添加一个新的设备需要以下3个步骤：
+* [在`CMakeList`中添加设备的支持](#0001)
+* [在`saber`中添加设备的实现](#0002)
+* [在`framework`中添加设备的具体化或实例化](#0003)
+假设新设备的名称为`TNEW`, 以下将以这个设备名称进行演示。
+## <span id = '0001'> 在`CMakeList`中添加设备的支持 </span> ##
+* 修改根目录`CMakeList.txt`
+```cmake
+#select the plantform to build
+anakin_option(USE_GPU_PLACE "Select the build mode for GPU place." NO)
+anakin_option(USE_X86_PLACE "Select the build mode for X86 place." NO)
+anakin_option(USE_ARM_PLACE "Select the build mode for ARM place." NO)
+anakin_option(USE_TNEW_PLACE "Select the build mode for ARM place." YES)
+```
+* 修改`saber/CMakeList.txt`
+根据新增设备的目录完善`saber`目录下的`CMakeList.txt`。
+```cmake
+if(USE_TNEW_PLACE)
+    anakin_fetch_files_with_suffix(${ANAKIN_SABER}/core/impl/tnew "cpp" ANAKIN_SABER_BASE_SRC)
+    anakin_fetch_files_with_suffix(${ANAKIN_SABER}/funcs/impl/tnew "cpp" ANAKIN_SABER_BASE_SRC)
+endif()
+```
+* 修改`test/CMakeList.txt`
+新增设备的单测文件放在`test/saber/tnew`目录下，修改`test`目录下的`CMakeList.txt`。
+```cmake
+if(USE_TNEW_PLACE)
+    anakin_fetch_files_with_suffix(${ANAKIN_UNIT_TEST}/saber/tnew "cpp" ANAKIN_TEST_CASE_SRC)
+endif()
+```
+* 修改`cmake/anakin_config.h.in`
+```c++
+// plantform to use
+#cmakedefine USE_GPU_PLACE
+#cmakedefine USE_X86_PLACE
+#cmakedefine USE_ARM_PLACE
+#cmakedefine USE_TNEW_PLACE
+```
+* 其他依赖和编译选项    
+修改`cmake`目录下的`compiler_options.cmake`和`find_modules.cmake`
+## <span id = '0002'> 在`saber`中添加设备的实现 </span> ##
+`saber`是`Anakin`的基础计算库，对外提供设备无关的统一的API，设备相关的实现都会封装到`TargetWrapper`中。
+### 在`saber/saber_types.h`中添加设备
+```c++
+enum TargetTypeEnum {
+    eINVALID = -1,
+    eNV = 1,
+    eAMD = 2,
+    eARM = 3,
+    eX86 = 4,
+    eNVHX86 = 5,
+    eTNEW = 6
+};
+typedef TargetType<eNV> NV;
+typedef TargetType<eARM> ARM;
+typedef TargetType<eAMD> AMD;
+typedef TargetType<eX86> X86;
+typedef TargetType<eTNEW> TNEW;
+```
+### 在`saber/core`中添加设备的实现
+1. 在`target_traits.h`中添加新设备
+* 增加设备类型
+```c++
+struct __cuda_device{};
+struct __arm_device{};
+struct __amd_device{};
+struct __x86_device{};
+struct __tnew_device{};
+```
+* `TargetTypeTraits`模板具体化
+```c++
+template <>
+struct TargetTypeTraits<TNEW> {
+    typedef __xxx_target target_category;//根据实际设备是host端还是device端进行选择
+    typedef __tnew_device target_type;
+};
+```
+2. 在`data_traits.h`中特化`DataTrait`模板类
+如果设备需要特殊的数据类型，则特化出设备的`DataTrait`类的实现，例如opencl数据类型的实现如下：
+```c++
+#ifdef USE_OPENCL
+struct ClMem{
+    ClMem(){
+        dmem = nullptr;
+        offset = 0;
+    }
+    ClMem(cl_mem* mem_in, int offset_in = 0) {
+        dmem = mem_in;
+        offset = offset_in;
+    }
+    ClMem(ClMem& right) {
+        dmem = right.dmem;
+        offset = right.offset;
+    }
+    ClMem& operator=(ClMem& right) {
+        this->dmem = right.dmem;
+        this->offset = right.offset;
+        return *this;
+    }
+    ClMem& operator+(int offset_in) {
+        this->offset += offset_in;
+        return *this;
+    }
+    int offset{0};
+    cl_mem* dmem;
+};
+template <>
+struct DataTrait<AMD, AK_FLOAT> {
+    typedef ClMem Dtype;
+    typedef float dtype;
+};
+template <>
+struct DataTrait<AMD, AK_DOUBLE> {
+    typedef ClMem Dtype;
+    typedef double dtype;
+};
+template <>
+struct DataTrait<AMD, AK_INT8> {
+    typedef ClMem Dtype;
+    typedef char dtype;
+};
+#endif //use_opencl
+```
+3. 在`target_wrapper.h`中特化`TargetWrapper`模板类
+特化`TargetWrapper`模板类，在`target_wrapper.h`中声明函数，具体如下：
+```c++
+template <>
+struct TargetWrapper<TNEW, __xxx_target> { //根据TNEW的具体类型修改__xxx_target，__host_target或者__device_target
+    typedef xxx_event event_t;          //根据设备实现xxx_event
+    typedef xxx_stream stream_t;        //根据设备实现xxx_stream
+    static void get_device_count(int& count);
+    static void set_device(int id);
+    //We should add strategy to avoid malloc directly
+    static void mem_alloc(void** ptr, size_t n);
+    static void mem_free(void* ptr);
+    static void mem_set(void* ptr, int value, size_t n);
+    static void create_event(event_t& event, bool flag = false);
+    static void create_stream(stream_t& stream);
+    static void create_stream_with_flag(stream_t& stream, unsigned int flag);
+    static void create_stream_with_priority(stream_t& stream, unsigned int flag, int priority);
+    static void destroy_stream(stream_t& stream);
+    static void destroy_event(event_t& event);
+    static void record_event(event_t& event, stream_t stream);
+    static void query_event(event_t& event);
+    static void sync_event(event_t& event);
+    static void sync_stream(event_t& event, stream_t& stream);
+    static void sync_memcpy(void* dst, int dst_id, const void* src, int src_id, \
+                            size_t count, __DtoD);
+    static void async_memcpy(void* dst, int dst_id, const void* src, int src_id, \
+                             size_t count, stream_t& stream, __DtoD);
+    static void sync_memcpy(void* dst, int dst_id, const void* src, int src_id, \
+                            size_t count, __HtoD);
+    static void async_memcpy(void* dst, int dst_id, const void* src, int src_id, \
+                             size_t count, stream_t& stream, __HtoD);
+    static void sync_memcpy(void* dst, int dst_id, const void* src, int src_id, \
+                            size_t count, __DtoH);
+    static void async_memcpy(void* dst, int dst_id, const void* src, int src_id, \
+                             size_t count, stream_t& stream, __DtoH);
+    static void sync_memcpy_p2p(void* dst, int dst_dev, const void* src, \
+                                int src_dev, size_t count);
+    static void async_memcpy_p2p(void* dst, int dst_dev, const void* src, \
+                                 int src_dev, size_t count, stream_t& stream);
+    static int get_device_id();
+};
+```
+4. 在`impl/`目录下添加设备目录和实现
+在`saber/core/impl`目录下添加设备目录`tnew`。
+* 实现`TargetWrapper<TNEW, __xxx_target>`结构体中各函数的定义。    
+如果`TargetWrapper<TNEW, __xxx_target>`的实现与默认的模板类一致，则不用特化出该类。
+```c++
+typedef TargetWrapper<TNEW, __xxx_target> TNEW_API;
+void TNEW_API::get_device_count(int &count) {
+    // add implementation
+}
+void TNEW_API::set_device(int id){
+    // add implementation
+}
+void TNEW_API::mem_alloc(void** ptr, size_t n){
+    // add implementation
+}
+void TNEW_API::mem_free(void* ptr){
+    if(ptr != nullptr){
+        // add implementation
+    }
+}
+...
+```
+* 特化实现`device.h`中的`Device<TNEW>`
+```c++
+template <>
+void Device<TNEW>::create_stream() {
+    // add implementation
+}
+template <>
+void Device<TNEW>::get_info() {
+    // add implementation
+}
+```
+### 在`saber/funcs`中实现设备相关的op
+参考[如何增加新的Operator](addCustomOp.md)
+## <span id = '0003'> 在`framework`中添加设备的具体化或实例化 </span> ##
+### `framework/core`
+* `net.cpp`中添加实例化
+```c++
+#ifdef USE_TNEW_PLACE
+template class Net<TNEW, AK_FLOAT, Precision::FP32, OpRunType::ASYNC>;
+template class Net<TNEW, AK_FLOAT, Precision::FP32, OpRunType::SYNC>;
+#endif
+```
+* `operator_func.cpp`中添加实例化
+```c++
+#ifdef USE_TNEW_PLACE
+template class OperatorFunc<TNEW, AK_FLOAT, Precision::FP32>;
+#endif
+```
+* `worker.cpp`中添加实例化
+```c++
+#ifdef USE_TNEW_PLACE
+template class Worker<TNEW, AK_FLOAT, Precision::FP32, OpRunType::ASYNC>;
+template class Worker<TNEW, AK_FLOAT, Precision::FP32, OpRunType::SYNC>;
+#endif
+```
+* `operator_attr.cpp`中添加实例化
+```c++
+template
+OpAttrWarpper& OpAttrWarpper::__alias__<TNEW, AK_FLOAT, Precision::FP32>(const std::string& op_name);
+template
+OpAttrWarpper& OpAttrWarpper::__alias__<TNEW, AK_FLOAT, Precision::FP16>(const std::string& op_name);
+template
+OpAttrWarpper& OpAttrWarpper::__alias__<TNEW, AK_FLOAT, Precision::INT8>(const std::string& op_name);
+```
+* `parameter.h`中添加设备的实现
+```c++
+#ifdef USE_TNEW_PLACE
+template<typename Dtype>
+class PBlock<Dtype, TNEW> {
+public:
+	typedef Tensor4d<TNEW, DataTypeRecover<Dtype>::type> type;
+	PBlock() {
+		_inner_tensor = std::make_shared<type>(); 
+	}
+	...
+}
+#endif //TNEW
+```
+* `type_traits_extend.h`中添加设备的实现
+```c++
+template<>
+struct target_host<saber::TNEW> {
+    typedef saber::X86 type; //根据TNEW选择正确的host type
+};
+```
+### `framework/graph`
+* `graph.cpp`中添加实例化
+```c++
+  #ifdef USE_TNEW_PLACE
+  template class Graph<TNEW, AK_FLOAT, Precision::FP32>;
+  template class Graph<TNEW, AK_FLOAT, Precision::FP16>;
+  template class Graph<TNEW, AK_FLOAT, Precision::INT8>;
+  #endif
+```
+### `framework/model_parser`
+* `parser.cpp`中添加实例化
+```c++
+  #ifdef USE_TNEW_PLACE
+  template
+  Status load<TNEW, AK_FLOAT, Precision::FP32>(graph::Graph<TNEW, AK_FLOAT, Precision::FP32>* graph,
+          const char* model_path);
+  template
+  Status load<TNEW, AK_FLOAT, Precision::FP16>(graph::Graph<TNEW, AK_FLOAT, Precision::FP16>* graph,
+          const char* model_path);
+  template
+  Status load<TNEW, AK_FLOAT, Precision::INT8>(graph::Graph<TNEW, AK_FLOAT, Precision::INT8>* graph,
+          const char* model_path);
+  template
+  Status save<TNEW, AK_FLOAT, Precision::FP32>(graph::Graph<TNEW, AK_FLOAT, Precision::FP32>* graph,
+          std::string& model_path);
+  template
+  Status save<TNEW, AK_FLOAT, Precision::FP16>(graph::Graph<TNEW, AK_FLOAT, Precision::FP16>* graph,
+          std::string& model_path);
+  template
+  Status save<TNEW, AK_FLOAT, Precision::INT8>(graph::Graph<TNEW, AK_FLOAT, Precision::INT8>* graph,
+          std::string& model_path);
+  template
+  Status load<TNEW, AK_FLOAT, Precision::FP32>(graph::Graph<TNEW, AK_FLOAT, Precision::FP32>* graph,
+          std::string& model_path);
+  template
+  Status load<TNEW, AK_FLOAT, Precision::FP16>(graph::Graph<TNEW, AK_FLOAT, Precision::FP16>* graph,
+          std::string& model_path);
+  template
+  Status load<TNEW, AK_FLOAT, Precision::INT8>(graph::Graph<TNEW, AK_FLOAT, Precision::INT8>* graph,
+          std::string& model_path);
+  template
+  Status save<TNEW, AK_FLOAT, Precision::FP32>(graph::Graph<TNEW, AK_FLOAT, Precision::FP32>* graph,
+          const char* model_path);
+  template
+  Status save<TNEW, AK_FLOAT, Precision::FP16>(graph::Graph<TNEW, AK_FLOAT, Precision::FP16>* graph,
+          const char* model_path);
+  template
+  Status save<TNEW, AK_FLOAT, Precision::INT8>(graph::Graph<TNEW, AK_FLOAT, Precision::INT8>* graph,
+          const char* model_path);
+  #endif
+```
+* `model_io.cpp`中添加实例化
+```c++
+#ifdef USE_TNEW_PLACE
+template class NodeIO<TNEW, AK_FLOAT, Precision::FP32>;
+template class NodeIO<TNEW, AK_FLOAT, Precision::FP16>;
+template class NodeIO<TNEW, AK_FLOAT, Precision::INT8>;
+#endif
+```
+### `framework/operators`
+为`framework/operators`目录下所有op添加实例化或具体化
+以`activation.cpp`为例，实例化如下：
+```c++
+#ifdef USE_TNEW_PLACE
+INSTANCE_ACTIVATION(TNEW, AK_FLOAT, Precision::FP32);
+INSTANCE_ACTIVATION(TNEW, AK_FLOAT, Precision::FP16);
+INSTANCE_ACTIVATION(TNEW, AK_FLOAT, Precision::INT8);
+template class ActivationHelper<TNEW, AK_FLOAT, Precision::FP32>;
+ANAKIN_REGISTER_OP_HELPER(Activation, ActivationHelper, TNEW, AK_FLOAT, Precision::FP32);
+#endif
+```
+如果TNEW设备函数的实现与现有模板实现不一致，可以特化实现如下（以init()为例）：
+```c++
+#ifdef USE_TNEW_PLACE
+INSTANCE_ACTIVATION(TNEW, AK_FLOAT, Precision::FP32);
+INSTANCE_ACTIVATION(TNEW, AK_FLOAT, Precision::FP16);
+INSTANCE_ACTIVATION(TNEW, AK_FLOAT, Precision::INT8);
+template <>
+Status ActivationHelper<TNEW, AK_FLOAT, Precision::FP32>::Init(OpContext<TNEW> &ctx,\
+        const std::vector<Tensor4dPtr<TNEW, AK_FLOAT> >& ins, \
+                std::vector<Tensor4dPtr<TNEW, AK_FLOAT> >& outs) {
+    SABER_CHECK(_funcs_activation.init(ins, outs, _param_activation, SPECIFY, SABER_IMPL, ctx)); //在这里选择实现方式
+    return Status::OK();
+}
+ANAKIN_REGISTER_OP_HELPER(Activation, ActivationHelper, TNEW, AK_FLOAT, Precision::FP32);
+#endif
+```
+在`ANAKIN_REGISTER_OP(Activation)`中添加TNEW的注册
+```c++
+#ifdef USE_TNEW_PLACE
+.__alias__<TNEW, AK_FLOAT, Precision::FP32>("activation")
+#endif
+```
+## 注意事项
+不要修改`Tensor`/`Buffer`/`Env`/`Context`这些类函数的接口和实现
--- a/source/advanced_usage/deploy/index_native.rst
+++ b/source/advanced_usage/deploy/index_native.rst
@@ -5,4 +5,4 @@
    :maxdepth: 2
    build_and_install_lib_cn.rst
-    native_inference_engine.rst
+    native_infer.rst
--- a/source/advanced_usage/deploy/install_anakin.md
+++ b/source/advanced_usage/deploy/install_anakin.md
-../../../anakin/docs/Manual/INSTALL_ch.md
+## 从源码编译安装Anakin ##
\ No newline at end of file
+我们已经在CentOS 7.3上成功的安装和测试了Anakin，对于其他操作系统，我们将很快支持。
+### 安装概览 ###
+* [在CentOS上安装 Anakin]()
+* [在Ubuntu上安装 Anakin]()
+* [在ARM上安装 Anakin](run_on_arm_ch.md)
+* [验证安装]()
+### 在CentOS上安装 Anakin ###
+#### 1. 系统要求 ####
+*  make 3.82+
+*  cmake 2.8.12+
+*  gcc 4.8.2+
+*  g++ 4.8.2+
+*  其他需要补充的。。。
+#### 2. 编译CPU版Anakin ####
+暂时不支持
+#### 3. 编译支持NVIDIA GPU的Anakin ####
+- 3.1. 安装依赖
+  - 3.1.1 protobuf  
+    >$ git clone https://github.com/google/protobuf  
+    >$ cd protobuf  
+    >$ git submodule update --init --recursive  
+    >$ ./autogen.sh  
+    >$ ./configure --prefix=/path/to/your/insall_dir  
+    >$ make  
+    >$ make check  
+    >$ make install  
+    >$ sudo ldconfig
+    如安装protobuf遇到任何问题，请访问[这里](https://github.com/google/protobuf/blob/master/src/README.md)
+- 3.2 CUDA Toolkit
+  - [CUDA 8.0](https://developer.nvidia.com/cuda-zone) or higher. 具体信息参见[NVIDIA's documentation](https://docs.nvidia.com/cuda/cuda-installation-guide-linux/).
+  - [cuDNN v7](https://developer.nvidia.com/cudnn). 具体信息参见[NVIDIA's documentation](https://docs.nvidia.com/cuda/cuda-installation-guide-linux/). 
+- 3.3  编译Anakin
+  >$ git clone https:/xxxxx  
+  >$ cd anakin  
+  >$ mkdir build  
+  >$ camke ..  
+  >$ make
+#### 4. 编译支持AMD GPU的Anakin ####
+暂时还不支持
+### 在Ubuntu上安装 Anakin ###
+暂时还不支持
+### 在ARM上安装 Anakin ###
+暂时还不支持
+### 验证安装 ###
+we are coming soon...
--- a/source/advanced_usage/deploy/mobile_build.md
+++ b/source/advanced_usage/deploy/mobile_build.md
-../../../mobile/doc/build.md
+# 环境搭建
\ No newline at end of file
+## 使用 docker
+### 1. 安装 docker
+安装 docker 的方式，参考官方文档 [https://docs.docker.com/install/](https://docs.docker.com/install/)
+### 2. 使用 docker 搭建构建环境
+首先进入 paddle-mobile 的目录下，执行 `docker build`
+以 Linux/Mac 为例 (windows 建议在 'Docker Quickstart Terminal' 中执行)
+```
+$ docker build -t paddle-mobile:dev - < Dockerfile
+```
+使用 `docker images` 可以看到我们新建的 image
+```
+$ docker images
+REPOSITORY      TAG     IMAGE ID       CREATED         SIZE
+paddle-mobile   dev     33b146787711   45 hours ago    372MB
+```
+### 3. 使用 docker 构建
+进入 paddle-mobile 目录，执行 docker run
+```
+$ docker run -it --mount type=bind,source=$PWD,target=/paddle-mobile paddle-mobile:dev
+root@5affd29d4fc5:/ # cd /paddle-mobile
+# 生成构建 android 产出的 Makefile
+root@5affd29d4fc5:/ # rm CMakeCache.txt
+root@5affd29d4fc5:/ # cmake -DCMAKE_TOOLCHAIN_FILE=tools/toolchains/arm-android-neon.cmake
+# 生成构建 linux 产出的 Makefile
+root@5affd29d4fc5:/ # rm CMakeCache.txt
+root@5affd29d4fc5:/ # cmake -DCMAKE_TOOLCHAIN_FILE=tools/toolchains/arm-linux-gnueabi.cmake
+```
+### 4. 设置编译选项
+可以通过 ccmake 设置编译选项
+```
+root@5affd29d4fc5:/ # ccmake .
+                                                     Page 1 of 1
+ CMAKE_ASM_FLAGS
+ CMAKE_ASM_FLAGS_DEBUG
+ CMAKE_ASM_FLAGS_RELEASE
+ CMAKE_BUILD_TYPE
+ CMAKE_INSTALL_PREFIX             /usr/local
+ CMAKE_TOOLCHAIN_FILE             /paddle-mobile/tools/toolchains/arm-android-neon.cmake
+ CPU                              ON
+ DEBUGING                         ON
+ FPGA                             OFF
+ LOG_PROFILE                      ON
+ MALI_GPU                         OFF
+ NET                              googlenet
+ USE_EXCEPTION                    ON
+ USE_OPENMP                       OFF
+```
+修改选项后，按 `c`, `g` 更新 Makefile
+### 5. 构建
+使用 make 命令进行构建
+```
+root@5affd29d4fc5:/ # make
+```
+### 6. 查看构建产出
+构架产出可以在 host 机器上查看，在 paddle-mobile 的目录下，build 以及 test/build 下，可以使用 adb 指令或者 scp 传输到 device 上执行
+## 不使用 docker
+不使用 docker 的方法，可以直接用 cmake 生成 makefile 后构建。使用 ndk 构建 android 应用需要正确设置 NDK_ROOT。构建 linux 应用需要安装 arm-linux-gnueabi-gcc 或者类似的交叉编译工具，可能需要设置 CC，CXX 环境变量，或者在 tools/toolchains/ 中修改 arm-linux-gnueabi.cmake，或者增加自己需要的 toolchain file。
--- a/source/advanced_usage/deploy/mobile_dev.md
+++ b/source/advanced_usage/deploy/mobile_dev.md
-../../../mobile/doc/development_doc.md
+# iOS开发文档
\ No newline at end of file
+## 编译
+### 一. 使用 build.sh 编译
+```sh 
+sh build.sh ios
+# 如果只想编译某个特定模型的 op, 则需执行以下命令
+sh build.sh ios googlenet
+# 在这个文件夹下, 你可以拿到生成的 .a 库
+cd ../build/release/ios/build
+```
+### 二. 使用 xcode 编译
+我们提供了 ios 开发更为熟悉的 xcode 编译环境:
+在 ios/ 目录下打开 PaddleMobile.xcworkspace 即可编译 PaddleMobile 或者 运行 Demo
+### 三. 集成
+#### 如使用 c++ 接口
+将 
+```
+libpaddle-mobile.a 
+io.h  
+program.h 
+types.h 
+lod_tensor.h 
+tensor.h
+```
+拖入工程, io.h 为接口文件, 可在 [github](https://github.com/PaddlePaddle/paddle-mobile/blob/develop/src/io/io.h)上查看接口注释
+#### 如使用 oc 接口
+将在xcode 编译生成的
+```
+libPaddleMobile.a 
+PaddleMobile.h
+```
+拖入工程, 接口如下:
+```
+/*
+	创建单例对象
+*/
+ (instancetype)sharedInstance;
+/*
+	load 模型, 开辟内存
+*/
+- (BOOL)load:(NSString *)modelPath andWeightsPath:(NSString *)weighsPath;
+/*
+	进行预测, means 和 scale 为训练模型时的预处理参数, 如训练时没有做这些预处理则直接使用 predict
+*/
+- (NSArray *)predict:(CGImageRef)image means:(NSArray<NSNumber *> *)means scale:(float)scale;
+/*
+	进行预测
+*/
+- (NSArray *)predict:(CGImageRef)image;
+/*
+	清理内存
+*/
+- (void)clear;
+```
--- a/source/advanced_usage/deploy/native_infer.rst
+++ b/source/advanced_usage/deploy/native_infer.rst
+Paddle 预测 API
+===============
+为了更简单方便的预测部署，Fluid 提供了一套高层 API
+用来隐藏底层不同的优化实现。
+`预测库相关代码 <https://github.com/PaddlePaddle/Paddle/tree/develop/paddle/contrib/inference>`__
+包括
+-  头文件 ``paddle_inference_api.h`` 定义了所有的接口
+-  库文件\ ``libpaddle_fluid.so`` 或 ``libpaddle_fluid.a``
+-  库文件 ``libpaddle_inference_api.so`` 或
+   ``libpaddle_inference_api.a``
+编译和依赖可以参考 :ref:`install_or_build_cpp_inference_lib` 。
+下面是一些 API 概念的介绍
+PaddleTensor
+------------
+PaddleTensor 定义了预测最基本的输入输出的数据格式，其定义是
+.. code:: cpp
+    struct PaddleTensor {
+      std::string name;  // variable name.
+      std::vector<int> shape;
+      PaddleBuf data;  // blob of data.
+      PaddleDType dtype;
+    };
+-  ``name`` 用于指定输入数据对应的 模型中variable 的名字
+   （暂时没有用，但会在后续支持任意 target 时启用）
+-  ``shape`` 表示一个 Tensor 的 shape
+-  ``data`` 数据以连续内存的方式存储在\ ``PaddleBuf``
+   中，\ ``PaddleBuf``
+   可以接收外面的数据或者独立\ ``malloc``\ 内存，详细可以参考头文件中相关定义。
+-  ``dtype`` 表示 Tensor 的数据类型
+engine
+------
+高层 API 底层有多种优化实现，我们称之为 engine，目前有三种 engine
+-  原生 engine，由 paddle 原生的 forward operator
+   组成，可以天然支持所有paddle 训练出的模型，
+-  Anakin engine，封装了
+   `Anakin <https://github.com/PaddlePaddle/Anakin>`__
+   ，在某些模型上性能不错，但只能接受自带模型格式，无法支持所有 paddle
+   模型，
+-  TensorRT mixed engine，用子图的方式支持了
+   `TensorRT <https://developer.nvidia.com/tensorrt>`__ ，支持所有paddle
+   模型，并自动切割部分计算子图到 TensorRT 上加速（WIP）
+其实现为
+.. code:: cpp
+    enum class PaddleEngineKind {
+      kNative = 0,       // Use the native Fluid facility.
+      kAnakin,           // Use Anakin for inference.
+      kAutoMixedTensorRT // Automatically mixing TensorRT with the Fluid ops.
+    };
+预测部署过程
+------------
+总体上分为以下步骤
+1. 用合适的配置创建 ``PaddlePredictor``
+2. 创建输入用的 ``PaddleTensor``\ ，传入到 ``PaddlePredictor`` 中
+3. 获取输出的 ``PaddleTensor`` ，将结果取出
+下面完整演示一个简单的模型，部分细节代码隐去
+.. code:: cpp
+    #include "paddle_inference_api.h"
+    // 创建一个 config，并修改相关设置
+    paddle::NativeConfig config;
+    config.model_dir = "xxx";
+    config.use_gpu = false;
+    // 创建一个原生的 PaddlePredictor
+    auto predictor =
+          paddle::CreatePaddlePredictor<NativeConfig, PaddleEngineKind::kNative>(config);
+    // 创建输入 tensor
+    int64_t data[4] = {1, 2, 3, 4};
+    paddle::PaddleTensor tensor{.name = "",
+                                .shape = std::vector<int>({4, 1}),
+                                .data = PaddleBuf(data, sizeof(data)),
+                                .dtype = PaddleDType::INT64};
+    // 创建输出 tensor，输出 tensor 的内存可以复用
+    std::vector<paddle::PaddleTensor> outputs;
+    // 执行预测
+    CHECK(predictor->Run(slots, &outputs));
+    // 获取 outputs ...
+编译时，联编 ``libpaddle_fluid.a/.so`` 和
+``libpaddle_inference_api.a/.so`` 便可。
+详细代码参考
+------------
+-  `inference
+   demos <https://github.com/PaddlePaddle/Paddle/tree/develop/paddle/contrib/inference/demo>`__
+-  `复杂单线程/多线程例子 <https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/contrib/inference/test_paddle_inference_api_impl.cc>`__
--- a/source/advanced_usage/deploy/run_anakin_on_arm.md
+++ b/source/advanced_usage/deploy/run_anakin_on_arm.md
-../../../anakin/docs/Manual/run_on_arm_ch.md
+## 源码编译 Anakin ##
\ No newline at end of file
+目前Anakin支持ARM Android平台，采用Android NDK交叉编译工具链，已在mac os和centos上编译和测试通过。
+### 安装概览 ###
+* [系统需求](#0001)
+* [安装第三方依赖](#0002)
+* [Anakin源码编译](#0003)
+* [验证安装](#0004)
+### <span id = '0001'> 1. 系统需求 </span> ###
+*  宿主机: linux, mac    
+*  cmake 3.8.2+    
+*  Android NDK r14, Linux 版本[从这里下载](https://dl.google.com/android/repository/android-ndk-r14b-linux-x86_64.zip)
+### <span id = '0002'> 2. 安装第三方依赖 </span> ###
+- 2.1 protobuf3.4.0     
+   源码从这里[下载](https://github.com/google/protobuf/releases/tag/v3.4.0)    
+ - 2.1.1 为宿主机编译protobuf     
+ ```bash
+   $ tar -xzf protobuf-3.4.0.tar.gz  
+   $ cd protobuf-3.4.0   
+   $ ./autogen.sh  
+   $ ./configure    
+   $ make  
+   $ make check   
+   $ make install
+   ```
+   上述 $make install 执行后，可在 /usr/local/include/google 找到 libprotobuf 所需的头文件,将整个google文件夹拷贝至Anakin/third-party/arm-android/protobuf/下，
+   如有问题，请点[这里](https://github.com/google/protobuf/blob/v3.4.0/src/README.md)。
+   然后将已经生成文件清除。
+ ```bash
+   $ make distclean
+   ```
+ - 2.1.1 交叉编译Android`armeabi-v7a`的protobuf，注意设置ANDROID_NDK的路径，以及ARCH_ABI、HOSTOSN的值，   
+ ```bash
+   $ export ANDROID_NDK=your_ndk_path 
+   $ ARCH_ABI="arm-linux-androideabi-4.9"
+   $ HOSTOSN="darwin-x86_64"
+   $ export SYSROOT=$ANDROID_NDK/platforms/android-9/arch-arm  
+   $ export PREBUILT=$ANDROID_NDK/toolchains/$ARCH_ABI
+   $ export LDFLAGS="--sysroot=$SYSROOT"
+   $ export LD="$ANDROID_NDK/toolchains/$ARCH_ABI/prebuilt/$HOSTOSN/arm-linux-androideabi/bin/ld $LDFLAGS"
+   $ export LIBS="-llog $ANDROID_NDK/sources/cxx-stl/gnu-libstdc++/4.9/libs/armeabi-v7a/libgnustl_static.a"
+   $ export CPPFLAGS=""
+   $ export INCLUDES="-I$ANDROID_NDK/sources/cxx-stl/gnu-libstdc++/4.9/include/ -I$ANDROID_NDK/platforms/android-9/arch-arm/usr/include/ -I$ANDROID_NDK/sources/cxx-stl/gnu-libstdc++/4.9/libs/armeabi-v7a/include/"
+   $ export CXXFLAGS="-march=armv7-a -mfloat-abi=softfp -DGOOGLE_PROTOBUF_NO_RTTI --sysroot=$SYSROOT"
+   $ export CCFLAGS="$CXXFLAGS"
+   $ export CXX="$PREBUILT/prebuilt/$HOSTOSN/bin/arm-linux-androideabi-g++ $CXXFLAGS"
+   $ export CC="$CXX"
+   $ export RANLIB="$ANDROID_NDK/toolchains/$ARCH_ABI/prebuilt/$HOSTOSN/bin/arm-linux-androideabi-ranlib"  
+   $ ./autogen.sh  
+   $ ./configure --host=arm-linux-androideabi --with-sysroot=$SYSROOT --enable-cross-compile --with-protoc=protoc --disable-shared CXX="$CXX" CC="$CC" LD="$LD"  
+   $ make
+  ```
+  编译生成 *.a 静态库，若希望编译*.so 动态链接库 ，请在./configure参数中改--disable-shared为--disable-static --enable-shared。  
+  生成文件在src/.libs/下，将生成的文件拷贝至Anakin/third-party/arm-android/protobuf/lib下。  
+  在[cmake](../../cmake/find_modules.cmake)中更新`ARM_RPOTO_ROOT`的路径。        
+  ```cmake
+  set(ARM_RPOTO_ROOT "${CMAKE_SOURCE_DIR}/third-party/arm-android/protobuf")
+  ```
+- 2.2 opencv 2.4.3+(optional)    
+    Anakin只在examples示例中使用opencv   
+    Android系统的opencv从[这里下载](https://opencv.org/releases.html)    
+    解压后将 `3rdparty/libs/armeabi-v7a`中的库文件拷贝到`libs/armeabi-v7a`    
+    在[cmake](../../cmake/find_modules.cmake)中搜索`anakin_find_opencv`, 
+    并设置 `include_directories` 和 `LINK_DIRECTORIES`为自己安装的库的路径。   
+    ```cmake
+    include_directories(${CMAKE_SOURCE_DIR}/third-party/arm-android/opencv/sdk/native/jni/include/)
+    LINK_DIRECTORIES(${CMAKE_SOURCE_DIR}/third-party/arm-android/opencv/sdk/native/libs/armeabi-v7a/)
+    ```
+### <span id = '0003'> 3. Anakin源码编译 </span> ###
+#### 编译Android版本
+   克隆[源码](https://github.com/PaddlePaddle/Anakin/tree/arm)
+```bash
+    cd your_dir
+    git clone https://github.com/PaddlePaddle/Anakin.git
+    cd Anakin
+    git fetch origin arm
+    git checkout arm
+  ```
+  修改`android_build.sh`    
+- 修改NDK路径    
+  ```bash
+    #modify "your_ndk_path" to your NDK path
+    export ANDROID_NDK=your_ndk_path
+  ```
+- 修改ARM 处理器架构     
+  对于32位ARM处理器, 将ANDROID_ABI 设置为 `armeabi-v7a with NEON`， 
+  对于64位ARM处理器, 可以将ANDROID_ABI 设置为 `armeabi-v7a with NEON`或者`arm64-v8a`。        
+  目前我们只支持 `armeabi-v7a with NEON`；`arm64-v8a` 还在开发中。      
+  ```bash
+      -DANDROID_ABI="armeabi-v7a with NEON"
+  ```
+- 设置Android API    
+  根据Android系统的版本设置API level， 例如API Level 21 -> Android 5.0.1    
+  ```bash
+      -DANDROID_NATIVE_API_LEVEL=21
+  ```
+- 选择编译静态库或动态库    
+  设置`BUILD_SHARED=NO`编译静态库    
+  设置`BUILD_SHARED=YES`编译动态库    
+  ```bash
+      -DBUILD_SHARED=NO
+  ```
+- OpenMP多线程支持    
+  设置`USE_OPENMP=YES`开启OpenMP多线程    
+  ```bash
+      -DUSE_OPENMP=YES
+  ```
+- 编译单测文件    
+  设置`BUILD_WITH_UNIT_TEST=YES`将会编译单测文件    
+    ```bash
+        -DBUILD_WITH_UNIT_TEST=YES
+    ```
+- 编译示例文件    
+  设置`BUILD_EXAMPLES=YES`将会编译示例文件    
+    ```bash
+        -DBUILD_EXAMPLES=YES
+    ```
+- 开启opencv    
+  如果使用opencv，设置`USE_OPENCV=YES`    
+    ```bash
+        -DUSE_OPENCV=YES
+    ```
+- 开始编译    
+  运行脚本 `android_build.sh` 将自动编译Anakin     
+  ```bash
+      ./android_build.sh
+  ```
+### <span id = '0004'> 4. 验证安装 </span> ###    
+  编译好的库会放在目录`${Anakin_root}/output`下；    
+  编译好的单测文件会放在`${Anakin_root}/output/unit_test`目录下；    
+  编译好的示例文件会放在`${Anakin_root}/output/examples`目录下。
+  对于Android系统，打开设备的调试模式，通过ADB可以访问的目录是`data/local/tmp`，通过ADB push将测试文件、模型和数据发送到设备目录， 运行测试文件。
--- a/source/advanced_usage/development/gpu_profiling_cn.rst
+++ b/source/advanced_usage/development/gpu_profiling_cn.rst
-../../../paddle/doc/v2/howto/optimization/gpu_profiling_cn.rst
+============
\ No newline at end of file
+GPU性能调优
+============
+..  contents::
+此教程将向您分步介绍如何使用内置的定时工具、 **nvprof** 或 **nvvp** 来运行性能分析和调优。
+- 什么是性能分析？
+- 为什么需要性能分析？
+- 如何进行性能分析？
+- 性能分析工具介绍
+- 详细教程
+- 性能分析小技巧
+什么是性能分析？
+================
+在软件工程的范畴里，性能分析（Profiling）是一个动态程序分析的术语，它可以指测量一个程序的空间（内存）复杂度或时间复杂度，
+也可以说是某些特定指令的使用情况，或者是函数调用的频率和耗时等。通常情况下，分析得到的信息用于协助进行程序的优化。
+简单来说，性能分析工具是用于给应用程序的性能做定量分析的。如果想很好的理解程序的行为，那程序分析工具是必不可少的利器。简单的性能分析，可以告诉您某个操作到底花了多长时间？而更深入的分析，甚至能解释为什么某个操作花了很长时间？
+为什么需要性能分析？
+============================
+训练好一个深层神经网络通常要耗费非常长的时间，所以性能也就逐步变成了深度学习领域最重要的指标。
+而优化性能的首要任务，是需要了解哪些步骤拖慢了整体。
+如果某一块根本就不怎么耗时，那也就不需要急着优化性能啦！
+如何进行性能分析？
+========================
+为了达到性能最优，您可以采用下面五个步骤：
+- 对代码进行性能分析
+- 找到运行慢的部分
+- 找到运行慢的原因
+- 修改成更快的版本
+- 再次对代码进行性能分析
+Usually, processor has two key performance limits include float point throughput and
+memory throughput. For GPU,  it also need more parallelism to fulfill its potential.
+This is why they can be so fast.
+通常情况下，处理器有两个关键性能限制：一个是浮点计算量，另一个是内存操作量。
+GPU则还需要高并行性，才能发挥其全部能力。这正是它们速度快的原因。
+性能分析工具介绍
+======================
+就通常的GPU性能分析来说，市面上已经有NVIDIA或第三方提供的众多工具。
+**nvprof** 是Nvidia性能分析工具， **nvvp** 则是带GUI的Nvidia可视化性能分析工具。
+在这个教程中，我们主要会介绍nvprof和nvvp。
+:code:`test_GpuProfiler` from :code:`paddle/legacy/math/tests` directory will be used to evaluate
+above profilers.
+:code:`paddle/legacy/math/test` 目录中的 :code:`test_GpuProfiler` 就是用于展示上述分析工具的用法。
+.. literalinclude:: ../../../../paddle/legacy/math/tests/test_GpuProfiler.cpp
+   :language: c++
+   :lines: 137-151
+   :linenos:
+上述的代码片段包含了两种方法，您可以任意使用一个或两个来对感兴趣的代码段做性能分析。
+1. :code:`REGISTER_TIMER_INFO` 是一个内置的定时器封装，可以用来计算CPU函数或cuda内核的时间消耗。
+2. :code:`REGISTER_GPU_PROFILER` is a general purpose wrapper object of :code:`cudaProfilerStart` and :code:`cudaProfilerStop` to avoid
+program crashes when CPU version of PaddlePaddle invokes them.
+3. :code:`REGISTER_GPU_PROFILER` 是一个封装对象，封装了 :code:`cudaProfilerStart` 和 :code:`cudaProfileStop` 两个操作；同时其内部实现可以避免纯CPU版本PaddlePaddle在执行本语句时发生崩溃。
+您会在接下来的部分中获得更多的细节介绍。
+详细教程
+============
+内置定时器
+------------
+如果想要启用PaddlePaddle的内置定时器，您首先需要在相关代码段中加入 :code:`REGISTER_TIMER_INFO`。
+接下来就可以使用 :code:`printStatus` 或者 :code:`printAllStatus` 函数来将信息输出到界面中。
+下面举个简单的例子：
+1. 加入 :code:`REGISTER_TIMER_INFO` 和 :code:`printAllStatus` 函数（如高亮部分）。
+    .. literalinclude:: ../../../../paddle/legacy/math/tests/test_GpuProfiler.cpp
+        :language: c++
+        :lines: 137-151
+        :emphasize-lines: 8-12,14
+        :linenos:
+2. cmake配置中将 **WITH_TIMER** 打开，重新编译PaddlePaddle。
+    .. code-block:: bash
+        cmake .. -DWITH_TIMER=ON
+        make
+3. 执行您的代码，并观察结果(如高亮部分）。
+    .. code-block:: bash
+        :emphasize-lines: 1,12-15
+        > ./paddle/legacy/math/tests/test_GpuProfiler
+        I1117 11:13:42.313065 2522362816 Util.cpp:155] commandline: ./paddle/legacy/math/tests/test_GpuProfiler
+        I1117 11:13:42.845065 2522362816 Util.cpp:130] Calling runInitFunctions
+        I1117 11:13:42.845208 2522362816 Util.cpp:143] Call runInitFunctions done.
+        [==========] Running 1 test from 1 test case.
+        [----------] Global test environment set-up.
+        [----------] 1 test from Profiler
+        [ RUN      ] Profiler.BilinearFwdBwd
+        I1117 11:13:42.845310 2522362816 test_GpuProfiler.cpp:114] Enable GPU Profiler Stat: [testBilinearFwdBwd] "numSamples = 10, channels = 16, im
+        gSizeX = 64, imgSizeY = 64"
+        I1117 11:13:42.850154 2522362816 ThreadLocal.cpp:37] thread use undeterministic rand seed:20659751
+        I1117 11:13:42.981501 2522362816 Stat.cpp:130] ======= StatSet: [GlobalStatInfo] status ======
+        I1117 11:13:42.981539 2522362816 Stat.cpp:133] Stat=testBilinearFwdBwd     total=136.141    avg=136.141    max=136.141    min=136.141   count=1
+        I1117 11:13:42.981572 2522362816 Stat.cpp:141] ======= BarrierStatSet status ======
+        I1117 11:13:42.981575 2522362816 Stat.cpp:154] --------------------------------------------------
+        [       OK ] Profiler.BilinearFwdBwd (136 ms)
+        [----------] 1 test from Profiler (136 ms total)
+        [----------] Global test environment tear-down
+        [==========] 1 test from 1 test case ran. (136 ms total)
+        [  PASSED  ] 1 test.
+nvprof 工具
+----------------
+要使用命令行分析工具 **nvprof**，您按如下步骤操作即可：
+1. 将 :code:`REGISTER_GPU_PROFILER` 函数加到代码中（参考强调部分）。
+    .. literalinclude:: ../../../../paddle/legacy/math/tests/test_GpuProfiler.cpp
+        :language: c++
+        :lines: 137-151
+        :emphasize-lines: 6-7
+        :linenos:
+2. cmake中将 **WITH_PROFILER** 配置打开，重新编译PaddlePaddle。
+    .. code-block:: bash
+        cmake .. -DWITH_PROFILER=ON
+        make
+3. 使用 **nvprof** 来分析执行文件。
+    .. code-block:: bash
+        nvprof  ./paddle/legacy/math/tests/test_GpuProfiler
+然后，您就能获得如下的分析结果：
+.. code-block:: bash
+    ==78544== Profiling application: ./paddle/legacy/math/tests/test_GpuProfiler
+    ==78544== Profiling result:
+    Time(%)     Time     Calls       Avg       Min       Max  Name
+    27.60%  9.6305ms         5  1.9261ms  3.4560us  6.4035ms  [CUDA memcpy HtoD]
+    26.07%  9.0957ms         1  9.0957ms  9.0957ms  9.0957ms  KeBilinearInterpBw
+    23.78%  8.2977ms         1  8.2977ms  8.2977ms  8.2977ms  KeBilinearInterpFw
+    22.55%  7.8661ms         2  3.9330ms  1.5798ms  6.2863ms  [CUDA memcpy DtoH]
+    ==78544== API calls:
+    Time(%)     Time     Calls       Avg       Min       Max  Name
+    46.85%  682.28ms         8  85.285ms  12.639us  682.03ms  cudaStreamCreateWithFlags
+    39.83%  580.00ms         4  145.00ms     302ns  550.27ms  cudaFree
+    9.82%   143.03ms         9  15.892ms  8.7090us  142.78ms  cudaStreamCreate
+    1.23%   17.983ms         7  2.5690ms  23.210us  6.4563ms  cudaMemcpy
+    1.23%   17.849ms         2  8.9247ms  8.4726ms  9.3768ms  cudaStreamSynchronize
+    0.66%   9.5969ms         7  1.3710ms  288.43us  2.4279ms  cudaHostAlloc
+    0.13%   1.9530ms        11  177.54us  7.6810us  591.06us  cudaMalloc
+    0.07%   1.0424ms         8  130.30us  1.6970us  453.72us  cudaGetDevice
+    0.04%   527.90us        40  13.197us     525ns  253.99us  cudaEventCreateWithFlags
+    0.03%   435.73us       348  1.2520us     124ns  42.704us  cuDeviceGetAttribute
+    0.03%   419.36us         1  419.36us  419.36us  419.36us  cudaGetDeviceCount
+    0.02%   260.75us         2  130.38us  129.32us  131.43us  cudaGetDeviceProperties
+    0.02%   222.32us         2  111.16us  106.94us  115.39us  cudaLaunch
+    0.01%   214.06us         4  53.514us  28.586us  77.655us  cuDeviceGetName
+    0.01%   115.45us         4  28.861us  9.8250us  44.526us  cuDeviceTotalMem
+    0.01%   83.988us         4  20.997us     578ns  77.760us  cudaSetDevice
+    0.00%   38.918us         1  38.918us  38.918us  38.918us  cudaEventCreate
+    0.00%   34.573us        31  1.1150us     279ns  12.784us  cudaDeviceGetAttribute
+    0.00%   17.767us         1  17.767us  17.767us  17.767us  cudaProfilerStart
+    0.00%   15.228us         2  7.6140us  3.5460us  11.682us  cudaConfigureCall
+    0.00%   14.536us         2  7.2680us  1.1490us  13.387us  cudaGetLastError
+    0.00%   8.6080us        26     331ns     173ns     783ns  cudaSetupArgument
+    0.00%   5.5470us         6     924ns     215ns  2.6780us  cuDeviceGet
+    0.00%   5.4090us         6     901ns     328ns  3.3320us  cuDeviceGetCount
+    0.00%   4.1770us         3  1.3920us  1.0630us  1.8300us  cuDriverGetVersion
+    0.00%   3.4650us         3  1.1550us  1.0810us  1.2680us  cuInit
+    0.00%      830ns         1     830ns     830ns     830ns  cudaRuntimeGetVersion
+nvvp 工具
+--------------
+如果想使用可视化的分析器 **nvvp**，您可以导入 :code:`nvprof -o ...` 的输出，或者从工具的界面里运行您的应用。
+**备注: nvvp 也支持CPU的性能分析** (需在nvvp界面中选上才能开启）
+..  image:: nvvp1.png
+    :align: center
+    :scale: 33%
+从内核函数的角度， **nvvp** 可以精确说明一个长耗时操作的具体原因。
+同时，如下图所示， **nvvp** 的内核block使用情况、寄存器使用情况和共享内存使用情况能让我们对GPU的整体使用有更好的理解。
+..  image:: nvvp2.png
+    :align: center
+    :scale: 33%
+而从应用的角度， **nvvp** 可以帮您提供一些定位性能瓶颈的建议。
+例如，下图中就展示了一些关于内存数据迁徙和计算资源利用率的建议，为您做性能调优提供了方向。
+..  image:: nvvp3.png
+    :align: center
+    :scale: 33%
+..  image:: nvvp4.png
+    :align: center
+    :scale: 33%
+性能分析小技巧
+==================
+- 开始阶段，从 **nvprof** 和 **nvvp** 的输出信息入手是个不错的选择。
+- 接下来可以考虑下时间线的分析。
+- 如果真想挖掘内核深处的某个秘密，您最好先确认：这一块的耗时比例真的太高，值得深入分析。
+- 可能的情况下，试着让输出的分析数据和理论值对应。
+    1) 例如，如果我知道内核花了10ms来移动1GB数据，那我会期望分析工具统计到速度是100GB/s。
+    2) 若有不一致之处，很有可能实际应用就是没有按照您的预期情况运行。
+- 了解您的硬件：如果您的GPU理论可以达到6 TFLOPs（6万亿次浮点运算每秒），而当前已经有5.5 TFLOPs了，那估计这里的潜力就没啥好挖的了……
+性能分析是性能优化的关键一步。有的时候简简单单的改变就能在性能上产生明显的优化效果！
+当然，具体情况因人而异。
+参考资料
+===========
+Jeremy Appleyard, `GPU Profiling for Deep Learning <http://www.robots.ox.ac.uk/~seminars/seminars/Extra/2015_10_08_JeremyAppleyard.pdf>`_, 2015
--- a/source/advanced_usage/pics/anakin_fm_ch.png
+++ b/source/advanced_usage/pics/anakin_fm_ch.png
--- a/source/beginners_guide/quick_start/fit_a_line/README.cn.md
+++ b/source/beginners_guide/quick_start/fit_a_line/README.cn.md
@@ -4,7 +4,7 @@
 # 线性回归
 让我们从经典的线性回归（Linear Regression \[[1](#参考文献)\]）模型开始这份教程。在这一章里，你将使用真实的数据集建立起一个房价预测模型，并且了解到机器学习中的若干重要概念。
-本教程源代码目录在[book/fit_a_line](https://github.com/PaddlePaddle/book/tree/develop/01.fit_a_line)， 初次使用请参考PaddlePaddle[安装教程](https://github.com/PaddlePaddle/book/blob/develop/README.cn.md#运行这本书)，更多内容请参考本教程的[视频课堂](http://bit.baidu.com/course/detail/id/137.html)。
+本教程源代码目录在[book/fit_a_line](https://github.com/PaddlePaddle/book/tree/develop/01.fit_a_line)， 初次使用请参考PaddlePaddle[安装教程](https://github.com/PaddlePaddle/book/blob/develop/README.cn.md#运行这本书)。
 ## 背景介绍
 给定一个大小为`$n$`的数据集  `${\{y_{i}, x_{i1}, ..., x_{id}\}}_{i=1}^{n}$`，其中`$x_{i1}, \ldots, x_{id}$`是第`$i$`个样本`$d$`个属性上的取值，`$y_i$`是该样本待预测的目标。线性回归模型假设目标`$y_i$`可以被属性间的线性组合描述，即
@@ -52,22 +52,89 @@ $$MSE=\frac{1}{n}\sum_{i=1}^{n}{(\hat{Y_i}-Y_i)}^2$$
 ### 数据集介绍
 这份数据集共506行，每行包含了波士顿郊区的一类房屋的相关信息及该类房屋价格的中位数。其各维属性的意义如下：
-| 属性名 | 解释 | 类型 |
+<p align="center">
-| ------| ------ | ------ |
+<table>
-| CRIM | 该镇的人均犯罪率 | 连续值 |
+    <thead>
-| ZN | 占地面积超过25,000平方呎的住宅用地比例 | 连续值 |
+    <tr>
-| INDUS | 非零售商业用地比例 | 连续值 |
+        <th>属性名</th>
-| CHAS | 是否邻近 Charles River  | 离散值，1=邻近；0=不邻近 |
+        <th>解释</th>
-| NOX | 一氧化氮浓度 | 连续值 |
+        <th>类型</th>
-| RM | 每栋房屋的平均客房数 | 连续值 |
+    </tr>
-| AGE | 1940年之前建成的自用单位比例 | 连续值 |
+    </thead>
-| DIS | 到波士顿5个就业中心的加权距离 | 连续值 |
+    <tbody>
-| RAD | 到径向公路的可达性指数 | 连续值 |
+    <tr>
-| TAX | 全值财产税率 | 连续值 |
+        <td>CRIM</td>
-| PTRATIO | 学生与教师的比例 | 连续值 |
+        <td>该镇的人均犯罪率</td>
-| B | 1000(BK - 0.63)^2，其中BK为黑人占比 | 连续值 |
+        <td>连续值</td>
-| LSTAT | 低收入人群占比 | 连续值 |
+    </tr>
-| MEDV | 同类房屋价格的中位数 | 连续值 |
+    <tr>
+        <td>ZN</td>
+        <td>占地面积超过25,000平方呎的住宅用地比例</td>
+        <td>连续值</td>
+    </tr>
+    <tr>
+        <td>INDUS</td>
+        <td>非零售商业用地比例</td>
+        <td>连续值</td>
+    </tr>
+    <tr>
+        <td>CHAS</td>
+        <td>是否邻近 Charles River</td>
+        <td>离散值，1=邻近；0=不邻近</td>
+    </tr>
+    <tr>
+        <td>NOX</td>
+        <td>一氧化氮浓度</td>
+        <td>连续值</td>
+    </tr>
+    <tr>
+        <td>RM</td>
+        <td>每栋房屋的平均客房数</td>
+        <td>连续值</td>
+    </tr>
+    <tr>
+        <td>AGE</td>
+        <td>1940年之前建成的自用单位比例</td>
+        <td>连续值</td>
+    </tr>
+    <tr>
+        <td>DIS</td>
+        <td>到波士顿5个就业中心的加权距离</td>
+        <td>连续值</td>
+    </tr>
+    <tr>
+        <td>RAD</td>
+        <td>到径向公路的可达性指数</td>
+        <td>连续值</td>
+    </tr>
+    <tr>
+        <td>TAX</td>
+        <td>全值财产税率</td>
+        <td>连续值</td>
+    </tr>
+    <tr>
+        <td>PTRATIO</td>
+        <td>学生与教师的比例</td>
+        <td>连续值</td>
+    </tr>
+    <tr>
+        <td>B</td>
+        <td>1000(BK - 0.63)^2，其中BK为黑人占比</td>
+        <td>连续值</td>
+    </tr>
+    <tr>
+        <td>LSTAT</td>
+        <td>低收入人群占比</td>
+        <td>连续值</td>
+    </tr>
+    <tr>
+        <td>MEDV</td>
+        <td>同类房屋价格的中位数</td>
+        <td>连续值</td>
+    </tr>
+    </tbody>
+</table>
+</p>
 ### 数据预处理
 #### 连续值与离散值
@@ -260,4 +327,3 @@ print("infer results: ", results[0])
 <br/>
 <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/"><img alt="知识共享许可协议" style="border-width:0" src="https://i.creativecommons.org/l/by-sa/4.0/88x31.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" href="http://purl.org/dc/dcmitype/Text" property="dct:title" rel="dct:type">本教程</span> 由 <a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a> 创作，采用 <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">知识共享 署名-相同方式共享 4.0 国际 许可协议</a>进行许可。
--- a/source/faq/faq.rst
+++ b/source/faq/faq.rst
-###
+###################
-FAQ
+编译安装与单元测试
-###
+###################
+1. 通过pip安装的PaddlePaddle在  :code:`import paddle.fluid` 报找不到 :code:`libmkldnn.so` 或 :code:`libmklml_intel.so`
+------------------------------------------------------------------------------------------
+出现这种问题的原因是在导入 :code:`paddle.fluid` 时需要加载 :code:`libmkldnn.so` 和 :code:`libmklml_intel.so`，
+但是系统没有找到该文件。一般通过pip安装PaddlePaddle时会将 :code:`libmkldnn.so` 和 :code:`libmklml_intel.so`
+拷贝到 :code:`/usr/local/lib` 路径下，所以解决办法是将该路径加到 :code:`LD_LIBRARY_PATH` 环境变量下，
+即： :code:`export LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH` 。
+**注意**：如果是在虚拟环境中安装PaddlePaddle， :code:`libmkldnn.so` 和 :code:`libmklml_intel.so` 可能不在 :code:`/usr/local/lib` 路径下。
--- a/source/faq/index_cn.rst
+++ b/source/faq/index_cn.rst
+FAQ
+====
+本文档对关于PaddlePaddle的一些常见问题提供了解答。如果您的问题未在此处，请您到 `PaddlePaddle社区 <https://github.com/PaddlePaddle/Paddle/issues>`_ 查找答案或直接提 `issue <https://github.com/PaddlePaddle/Paddle/issues/new>`_ ，我们会及时进行回复。
+..  toctree::
+  :maxdepth: 1
+  faq.rst
--- a/source/user_guides/howto/training/checkpoint_doc_cn.md
+++ b/source/user_guides/howto/training/checkpoint_doc_cn.md
@@ -57,4 +57,4 @@ trainer = Trainer(..., checkpoint_config=config)
 1. 保证每个训练的```checkpoint_dir``` 与其他训练独立。
 2. 最大副本数量```max_num_checkpoints```需要根据磁盘容量以及模型的大小进行调整， 保证磁盘的可用性。
 3. ```epoch_interval```  和 ```step_interval```  不宜过小， 频繁的进行checkpoint会拖慢训练速度。
 4. **分布式训练**的过程中：每个Trainer都会在```checkpoint_dir```目录中保存当前Trainer的参数（只有Trainer 0会保存模型的参数），需要**分布式文件系统(HDFS等)**将同```checkpoint_dir```目录的数据进行合并才能得到完整的数据，恢复训练的时候需要用完整的数据进行恢复。
\ No newline at end of file
--- a/source/user_guides/howto/training/checkpoint_doc_en.md
+++ b/source/user_guides/howto/training/checkpoint_doc_en.md
@@ -59,4 +59,4 @@ After all the things done, the train will save checkpoint at the specified epoch
 1. Make the ```checkpoint_dir``` only be used by one train job.
 2. The number of ```max_num_checkpoints``` need to be adjusted by the disk size and model size.
 3. Too frequently to slow down the train speed, so too ```small epoch_interval``` and ```step_interval``` are not suitable.
 4. **In distributed train**, each Trainer will save arguments in its ```checkpoint_dir``` (Only Trainer 0 will save model variables). We need **distributed file system (HDFS, etc)** to merge all the ```checkpoint_dir``` to get the whole data.
\ No newline at end of file
--- a/source/user_guides/howto/training/save_load_variables.rst
+++ b/source/user_guides/howto/training/save_load_variables.rst
@@ -7,19 +7,19 @@
 模型变量分类
 ############
-在PaddlePaddle Fluid中，所有的模型变量都用 :ref:`api_fluid_Variable` 作为基类进行表示。
+在PaddlePaddle Fluid中，所有的模型变量都用 :code:`fluid.Variable()` 作为基类进行表示。
 在该基类之下，模型变量主要可以分为以下几种类别：
 1. 模型参数
  模型参数是深度学习模型中被训练和学习的变量，在训练过程中，训练框架根据反向传播算法计算出每一个模型参数当前的梯度，
  并用优化器根据梯度对参数进行更新。模型的训练过程本质上可以看做是模型参数不断迭代更新的过程。
  在PaddlePaddle Fluid中，模型参数用 :code:`fluid.framework.Parameter` 来表示，
-  这是一个 :ref:`api_fluid_Variable` 的派生类，除了 :ref:`api_fluid_Variable` 具有的各项性质以外，
+  这是一个 :code:`fluid.Variable()` 的派生类，除了 :code:`fluid.Variable()` 具有的各项性质以外，
  :code:`fluid.framework.Parameter` 还可以配置自身的初始化方法、更新率等属性。
 2. 长期变量
  长期变量指的是在整个训练过程中持续存在、不会因为一个迭代的结束而被销毁的变量，例如动态调节的全局学习率等。
-  在PaddlePaddle Fluid中，长期变量通过将 :ref:`api_fluid_Variable` 的 :code:`persistable`
+  在PaddlePaddle Fluid中，长期变量通过将 :code:`fluid.Variable()` 的 :code:`persistable`
  属性设置为 :code:`True` 来表示。所有的模型参数都是长期变量，但并非所有的长期变量都是模型参数。
 3. 临时变量
@@ -43,7 +43,7 @@
 ==========================
 如果我们保存模型的目的是用于对新样本的预测，那么只保存模型参数就足够了。我们可以使用
-:ref:`api_fluid_io_save_params` 接口来进行模型参数的保存。
+:code:`fluid.io.save_params()` 接口来进行模型参数的保存。
 例如：
@@ -57,7 +57,7 @@
    fluid.io.save_params(executor=exe, dirname=param_path, main_program=None)
 上面的例子中，通过调用 :code:`fluid.io.save_params` 函数，PaddlePaddle Fluid会对默认
-:ref:`api_fluid_Program` 也就是 :code:`prog` 中的所有模型变量进行扫描，
+:code:`fluid.Program` 也就是 :code:`prog` 中的所有模型变量进行扫描，
 筛选出其中所有的模型参数，并将这些模型参数保存到指定的 :code:`param_path` 之中。
@@ -66,7 +66,7 @@
 在训练过程中，我们可能希望在一些节点上将当前的训练状态保存下来，
 以便在将来需要的时候恢复训练环境继续进行训练。这一般被称作“checkpoint”。
-想要保存checkpoint，可以使用 :ref:`api_fluid_io_save_checkpoint` 接口。
+想要保存checkpoint，可以使用 :code:`fluid.io.save_checkpiont()` 接口。
 例如：
@@ -87,7 +87,7 @@
                                max_num_checkpoints=3)
 上面的例子中，通过调用 :code:`fluid.io.save_checkpoint` 函数，PaddlePaddle Fluid会对默认
-:ref:`api_fluid_Program` 也就是 :code:`prog` 中的所有模型变量进行扫描，
+:code:`fluid.Program` 也就是 :code:`prog` 中的所有模型变量进行扫描，
 根据一系列内置的规则自动筛选出其中所有需要保存的变量，并将他们保存到指定的 :code:`path` 目录下。
 :code:`fluid.io.save_checkpoint` 的各个参数中， :code:`trainer_id` 在单机情况下设置为0即可； :code:`trainer_args`
@@ -125,8 +125,8 @@
 需要格外注意的是，这里的 :code:`prog` 必须和调用 :code:`fluid.io.save_params`
 时所用的 :code:`prog` 中的前向部分完全一致，且不能包含任何参数更新的操作。如果两者存在不一致，
 那么可能会导致一些变量未被正确加载；如果错误地包含了参数更新操作，那可能会导致正常预测过程中参数被更改。
-这两个 :ref:`api_fluid_Program` 之间的关系类似于训练 :ref:`api_fluid_Program`
+这两个 :code:`fluid.Program` 之间的关系类似于训练 :code:`fluid.Program`
-和测试 :ref:`api_fluid_Program` 之间的关系，详见： :ref:`user_guide_test_while_training`。
+和测试 :code:`fluid.Program` 之间的关系，详见： :ref:`user_guide_test_while_training`。
 另外，需特别注意运行 :code:`fluid.default_startup_program()` 必须在调用 :code:`fluid.io.load_params`
 之前。如果在之后运行，可能会覆盖已加载的模型参数导致错误。

--- a/source/user_guides/howto/training/single_node.rst
+++ b/source/user_guides/howto/training/single_node.rst
@@ -8,8 +8,8 @@
 要进行PaddlePaddle Fluid单机训练，需要先 :ref:`user_guide_prepare_data` 和
 :ref:`user_guide_configure_simple_model` 。当\
 :ref:`user_guide_configure_simple_model` 完毕后，可以得到两个\
-:ref:`api_fluid_Program`， :code:`startup_program` 和 :code:`main_program`。
+:code:`fluid.Program`， :code:`startup_program` 和 :code:`main_program`。
-默认情况下，可以使用 :ref:`api_fluid_default_startup_program` 与\ :ref:`api_fluid_default_main_program` 获得全局的 :ref:`api_fluid_Program`。
+默认情况下，可以使用 :code:`fluid.default_startup_program()` 与\ :code:`fluid.default_main_program()` 获得全局的 :code:`fluid.Program`。
 例如:
@@ -44,8 +44,8 @@
 ==============
 用户配置完模型后，参数初始化操作会被写入到\
-:code:`fluid.default_startup_program()` 中。使用 :ref:`api_fluid_Executor` 运行
+:code:`fluid.default_startup_program()` 中。使用 :code:`fluid.Executor()` 运行
-这一程序，即可在全局 :ref:`api_fluid_global_scope` 中随机初始化参数。例如:
+这一程序，即可在全局 :code:`fluid.global_scope()` 中随机初始化参数。例如:
 .. code-block:: python
@@ -53,7 +53,7 @@
   exe.run(program=fluid.default_startup_program())
 值得注意的是: 如果使用多GPU训练，参数需要先在GPU0上初始化，再经由\
-:ref:`api_fluid_ParallelExecutor` 分发到多张显卡上。
+:code:`fluid.ParallelExecutor` 分发到多张显卡上。
 载入预定义参数
@@ -66,8 +66,8 @@
 单卡训练
 ########
-执行单卡训练可以使用 :ref:`api_fluid_Executor` 中的 :code:`run()` 方法，运行训练\
+执行单卡训练可以使用 :code:`fluid.Executor()` 中的 :code:`run()` 方法，运行训练\
-:ref:`api_fluid_Program` 即可。在运行的时候，用户可以通过 :code:`run(feed=...)`\
+:code:`fluid.Program` 即可。在运行的时候，用户可以通过 :code:`run(feed=...)`\
 参数传入数据；用户可以通过 :code:`run(fetch=...)` 获取持久的数据。例如:\
 .. code-block:: python
@@ -86,14 +86,14 @@
   的Variable必须是persistable的。 :code:`fetch_list` 可以传入Variable的列表，\
   也可以传入Variable的名字列表。:code:`Executor.run` 返回Fetch结果列表。
 3. 如果需要取回的数据包含序列信息，可以设置
-   :code:`exe.run(return_numpy=False, ...)` 直接返回 :ref:`api_guide_lod_tensor`
+   :code:`exe.run(return_numpy=False, ...)` 直接返回 :code:`fluid.LoDTensor`
-   。用户可以直接访问 :ref:`api_guide_lod_tensor` 中的信息。
+   。用户可以直接访问 :code:`fluid.LoDTensor` 中的信息。
 多卡训练
 ########
-执行多卡训练可以使用 :ref:`api_fluid_ParallelExecutor` 运行训练
+执行多卡训练可以使用 :code:`fluid.ParallelExecutor` 运行训练
-:ref:`api_fluid_Program`。例如:
+:code:`fluid.Program`。例如:
 .. code-block:: python
@@ -103,8 +103,8 @@
 这里有几点注意事项:
-1. :code:`ParallelExecutor` 的构造函数需要指明要执行的 :ref:`api_fluid_Program` ,
+1. :code:`ParallelExecutor` 的构造函数需要指明要执行的 :code:`fluid.Program` ,
-   并在执行过程中不能修改。默认值是 :ref:`api_fluid_default_main_program` 。
+   并在执行过程中不能修改。默认值是 :code:`fluid.default_main_program()` 。
 2. :code:`ParallelExecutor` 需要明确指定是否使用 CUDA 显卡进行训练。在显卡训练\
   模式下会占用全部显卡。用户可以配置 `CUDA_VISIBLE_DEVICES <http://www.acceleware.com/blog/cudavisibledevices-masking-gpus>`_ 来修改占用\
   的显卡。

--- a/source/user_guides/howto/training/test_while_training.rst
+++ b/source/user_guides/howto/training/test_while_training.rst
@@ -4,7 +4,7 @@
 训练过程中评测模型
 ##################
-模型的测试评价与训练的 :ref:`api_fluid_Program` 不同。在测试评价中:
+模型的测试评价与训练的 :code:`fluid.Program` 不同。在测试评价中:
 1. 评价测试不进行反向传播，不优化更新参数。
 2. 评价测试执行的操作可以不同。
@@ -13,13 +13,13 @@
   * 评价模型与训练相比可以是完全不同的模型。
-生成测试 :ref:`api_fluid_Program`
+生成测试 :code:`fluid.Program`
 #################################
-通过克隆训练 :ref:`api_fluid_Program` 生成测试 :ref:`api_fluid_Program`
+通过克隆训练 :code:`fluid.Program` 生成测试 :code:`fluid.Program`
 =======================================================================
-:code:`Program.clone()` 方法可以复制出新的 :ref:`api_fluid_Program` 。 通过设置
+:code:`Program.clone()` 方法可以复制出新的 :code:`fluid.Program` 。 通过设置
 :code:`Program.clone(for_test=True)` 复制含有用于测试的操作Program。简单的使用方法如下:
 .. code-block:: python
@@ -45,11 +45,11 @@
 成一个 :code:`test_program` 。之后使用测试数据运行 :code:`test_program`,\
 就可以做到运行测试程序，而不影响训练结果。
-分别配置训练 :ref:`api_fluid_Program` 和测试 :ref:`api_fluid_Program`
+分别配置训练 :code:`fluid.Program` 和测试 :code:`fluid.Program`
 =====================================================================
 如果训练程序和测试程序相差较大时，用户也可以通过完全定义两个不同的
-:ref:`api_fluid_Program`，分别进行训练和测试。在PaddlePaddle Fluid中，\
+:code:`fluid.Program`，分别进行训练和测试。在PaddlePaddle Fluid中，\
 所有的参数都有名字。如果两个不同的操作，甚至两个不同的网络使用了同样名字的参数，\
 那么他们的值和内存空间都是共享的。
@@ -84,14 +84,14 @@ PaddlePaddle Fluid中使用 :code:`fluid.unique_name` 包来随机初始化用
   # fluid.default_main_program() is the train program
   # fluid.test_program is the test program
-执行测试 :ref:`api_fluid_Program`
+执行测试 :code:`fluid.Program`
 #################################
-使用 :code:`Executor` 执行测试 :ref:`api_fluid_Program`
+使用 :code:`Executor` 执行测试 :code:`fluid.Program`
 =======================================================
 用户可以使用 :code:`Executor.run(program=...)` 来执行测试
-:ref:`api_fluid_Program`。
+:code:`fluid.Program`。
 例如
@@ -101,10 +101,10 @@ PaddlePaddle Fluid中使用 :code:`fluid.unique_name` 包来随机初始化用
   test_acc = exe.run(program=test_program, feed=test_data_batch, fetch_list=[acc])
   print 'Test accuracy is ', test_acc
-使用 :code:`ParallelExecutor` 执行测试 :ref:`api_fluid_Program`
+使用 :code:`ParallelExecutor` 执行测试 :code:`fluid.Program`
 ===============================================================
-用户可以使用训练用的 :code:`ParallelExecutor` 与测试 :ref:`api_fluid_Program`
+用户可以使用训练用的 :code:`ParallelExecutor` 与测试 :code:`fluid.Program`
 一起新建一个测试的 :code:`ParallelExecutor` ；再使用测试
 :code:`ParallelExecutor.run` 来执行测试。