Merge branch 'PaddlePaddle:develop' into develop

c168938e · TeslaZhao · GitHub · 505aef76 · 3a670684 · c168938e
7 changed file
--- a/doc/C++Serving/Introduction_CN.md
+++ b/doc/C++Serving/Introduction_CN.md
@@ -41,7 +41,7 @@ Server端的核心是一个由项目代码编译产生的名称为serving的二
 <img src='../images/syn_mode.png' width = "350" height = "300">
 <p>
 异步模型主要适用于模型支持多batch(最大batch数M可通过配置选项指定),单个Request请求的batch较小(batch << M)，单次预测时间较长的情况。
-异步模型下，Server端N个线程只负责接收Request请求，实际调用预测引擎是在异步框架的线程中，异步框架的线程数可以由配置选项来指定。为了方便理解，我们假设每个Request请求的batch均为1，此时异步框架会尽可能多得从请求池中取n(n≤M)个Request并将其拼装为1个Request(batch=n)，调用1次预测引擎，得到1个Response(batch = n)，再将其对应拆分为n个Response作为返回结果。
+异步模型下，Server端N个线程只负责接收Request请求，实际调用预测引擎是在异步框架的线程池中，异步框架的线程数可以由配置选项来指定。为了方便理解，我们假设每个Request请求的batch均为1，此时异步框架会尽可能多得从请求池中取n(n≤M)个Request并将其拼装为1个Request(batch=n)，调用1次预测引擎，得到1个Response(batch = n)，再将其对应拆分为n个Response作为返回结果。
 <p align="center">
 <img src='../images/asyn_mode.png'">
 <p>

--- a/doc/C++Serving/Performance_Tuning_CN.md
+++ b/doc/C++Serving/Performance_Tuning_CN.md
-待填写！
+# C++ Serving性能分析与优化
+# 1.背景知识介绍
+1) 首先，应确保您知道C++ Serving常用的一些[功能特点](Introduction_CN.md)和[C++ Serving 参数配置和启动的详细说明](../SERVING_CONFIGURE_CN.md。
+2) 关于C++ Serving框架本身的性能分析和介绍，请参考[C++ Serving框架性能测试](Frame_Performance_CN.md)。
+3) 您需要对您使用的模型、机器环境、需要部署上线的业务有一些了解，例如，您使用CPU还是GPU进行预测；是否可以开启TRT进行加速；你的机器CPU是多少core的；您的业务包含几个模型；每个模型的输入和输出需要做些什么处理；您业务的最大线上流量是多少；您的模型支持的最大输入batch是多少等等.
+
+# 2.Server线程数
+
+首先，Server端线程数N并不是越大越好。众所周知，线程的切换涉及到用户空间和内核空间的切换，有一定的开销，当您的core数=1，而线程数为100000时，线程的频繁切换将带来不可忽视的性能开销。
+
+在BRPC框架中，用户态协程worker数M >> 线程数N，用户态协程worker会工作在任意一个线程中，当RPC网络传输IO操作让出CPU资源时，BRPC会进行用户态协程worker的切换从而提高RPC框架的并发性。所以，极端情况下，若您的代码中除RPC通信外，没有阻塞线程的任何IO或网络操作，您的线程数完全可以 == 机器core数量，您不必担心N个线程都在进行RPC网络IO，而导致CPU利用率不高的问题。
+
+Server端<mark>**线程数N**</mark>的设置需要结合三个因素来综合考虑：
+
+## 2.1 最大并发请求量M
+
+根据最大并发请求量来设置Server端线程数N，根据[C++ Serving框架性能测试](Frame_Performance_CN.md)中的数据来看，此时<mark>**线程数N应等于或略小于最大并发请求量M**</mark>，此时平均处理时延最小。
+
+这也很容易理解，举个极端的例子，如果您每次只有1个请求，那此时Server端线程数设置1是最合理的，因为此时没有任何线程切换的开销。如果您设置线程数为任何大于1的数，必然就带来了线程切换的开销。
+
+## 2.2 机器core数量C
+
+根据机器core数量来设置Server端线程数N，众所周知，线程是CPU core调度执行的最小单元，若要在一个进程内充分使用所有的core，<mark>**线程数至少应该>=机器core数量C**</mark>,但具体线程数N/机器core数量C = ?需要您根据您的代码中网络、IO、内存和计算所占用的比例来决定，一般用户可以通过设置不同的线程数来测试CPU占用率来不断调整。
+
+## 2.3 模型预测时间长短T
+
+当您使用CPU进行预测时，预测阶段的计算是使用CPU完成的，此时，请参考前两者来进行设置线程数。
+
+当您使用GPU进行预测时，情况有些不同，此时预测阶段的计算是由GPU完成的，此时CPU资源是空闲的，而预测操作是阻塞该线程的，类似于Sleep操作，此时若您的线程数==机器core数量，将没有其他可切换的线程从而导致必然有部分core是空闲的状态。具体来说，当模型预测时间较短时（<10ms），Server端线程数不宜过多（线程数=1~10倍core数量），否则线程切换带来的开销不可忽视。当模型预测时间较长时，Server端线程数应稍大一些（线程数=4~200倍core数量）。
+
+# 3.异步模式
+当<mark>**大部分用户的Request请求batch数<<模型最大支持的Batch数**</mark>时，采用异步模式的收益是明显的。
+
+异步模型的原理是将模型预测阶段与RPC线程脱离，模型单独开辟一个线程数可指定的线程池，RPC收到Request后将请求数据放入模型的线程池中的Task队列中，线程池中的线程从Task中取出数据合并Batch后进行预测，从而提升QPS，更多详细的介绍见[C++Serving功能简介](Introduction_CN.md)，同步模式与异步模式的数据对比见[C++ Serving vs TensorFlow Serving 性能对比](Benchmark_CN.md)，在上述测试的条件下，异步模型比同步模式快百分50%。
+
+
+异步模式的开启有以下两种方式。
+## 3.1 Python命令辅助启动C++Server
+
+`python3 -m paddle_serving_server.serve`通过添加`--runtime_thread_num 2`指定该模型开启异步模式，其中2表示的是该模型异步线程池中的线程数为2，该数值默认值为0，此时表示不使用异步模式。`--runtime_thread_num`的具体数值设置根据模型、数据和显卡的可用显存来设置。
+
+通过添加`--batch_infer_size 32`来设置模型最大允许Batch == 32 的输入，此参数只有在异步模型开启的状态下，才有效。
+
+## 3.2 命令行+配置文件启动C++Server
+
+此时通过修改`model_toolkit.prototxt`中的`runtime_thread_num`字段和`batch_infer_size`字段同样能达到上述效果。
+
+# 4.多模型组合
+当<mark>**您的业务中需要调用多个模型进行预测**</mark>时，如果您追求极致的性能，您可以考虑使用C++Serving[自定义OP](OP_CN.md)和[自定义DAG图](DAG_CN.md)的方式来实现上述需求。
+
+## 4.1 优点
+由于在一个服务中做模型的组合，节省了网络IO的时间和序列化反序列化的时间，尤其当数据量比较大时，收益十分明显(实测单次传输40MB数据时，RPC耗时为160-170ms)。
+
+## 4.2 缺点
+1) 需要使用C++去自定义OP和自定义DAG图去定义模型之间的组合关系。
+2) 若多个模型之间需要前后处理，您也需要使用C++在OP之间去编写这部分代码。
+3) 需要重新编译Server端代码。
+
+## 4.3 示例
+请参考[examples/C++/PaddleOCR/ocr/README_CN.md](../../examples/C++/PaddleOCR/ocr/README_CN.md)中`C++ OCR Service服务章节`和[Paddle Serving中的集成预测](Model_Ensemble_CN.md)中的例子。
--- a/doc/SERVING_CONFIGURE.md
+++ b/doc/SERVING_CONFIGURE.md
-# Serving Side Configuration
+# Serving Configuration

+([简体中文](SERVING_CONFIGURE_CN.md)|English)

-Paddle Serving配置文件格式采用明文格式的protobuf文件，配置文件的每个字段都需要事先在configure/proto/目录下相关.proto定义中定义好，才能被protobuf读取和解析到。
+## Overview

-Serving端的所有配置均在configure/proto/server_configure.proto文件中。
+This guide focuses on Paddle C++ Serving and Python Pipeline configuration:

-## 1. service.prototxt
-Serving端service 配置的入口是service.prototxt，用于配置Paddle Serving实例挂载的service列表。他的protobuf格式可参考`configure/server_configure.protobuf`的`InferServiceConf`类型。(至于具体的磁盘文件路径可通过--inferservice_path与--inferservice_file 命令行选项修改)，样例如下：
+- [Model Configuration](#model-configuration): Auto generated when converting model. Specify model input/output.
+- [C++ Serving](#c-serving): High-performance scenarios. Specify how to start quickly and start with user-defined configuration.
+- [Python Pipeline](#python-pipeline): Multiple model combined scenarios.

-```JSON
-port: 8010
-services {
-  name: "ImageClassifyService"
-  workflows: "workflow1"
+## Model Configuration
+
+The model configuration is generated by converting PaddleServing model and named  serving_client_conf.prototxt/serving_server_conf.prototxt. It specifies the info of input/output so that users can fill parameters easily. The model configuration file should not be modified. See the [Saving guide](SAVE.md) for model converting. The model configuration file provided must be a `core/configure/proto/general_model_config.proto`.
+
+Example：
+
+```
+feed_var {
+  name: "x"
+  alias_name: "x"
+  is_lod_tensor: false
+  feed_type: 1
+  shape: 13
+}
+fetch_var {
+  name: "concat_1.tmp_0"
+  alias_name: "concat_1.tmp_0"
+  is_lod_tensor: false
+  fetch_type: 1
+  shape: 3
+  shape: 640
+  shape: 640
 }
 ```

-其中
+- feed_var：model input
+- fetch_var：model output
+- name：node name
+- alias_name：alias name
+- is_lod_tensor：lod tensor, ref to [Lod Introduction](LOD.md)
+- feed_type/fetch_type：data type
+
+|feed_type|类型|
+|---------|----|
+|0|INT64|
+|1|FLOAT32|
+|2|INT32|
+|3|FP64|
+|4|INT16|
+|5|FP16|
+|6|BF16|
+|7|UINT8|
+|8|INT8|
+
+- shape：tensor shape
+
+## C++ Serving

- port: 该字段标明本机serving实例启动的监听端口。默认为8010。还可以通过--port=8010命令行参数指定。
- services: 可以配置多个services。Paddle Serving被设计为单个Serving实例可以同时承载多个预测服务，服务间通过service name进行区分。例如以下代码配置2个预测服务：
-```JSON
+### 1. Quick start
+
+The easiest way to start c++ serving is to provide the `--model` and `--port` flags.
+
+Example starting c++ serving:
+```BASH
+python3 -m paddle_serving_server.serve --model serving_model --port 9393
+```
+
+This command will generate the server configuration files as `workdir_9393`:
+
+```
+workdir_9393
+├── general_infer_0
+│   ├── fluid_time_file
+│   ├── general_model.prototxt
+│   └── model_toolkit.prototxt
+├── infer_service.prototxt
+├── resource.prototxt
+└── workflow.prototxt
+```
+
+More flags:
+| Argument                                       | Type | Default | Description                                           |
+| ---------------------------------------------- | ---- | ------- | ----------------------------------------------------- |
+| `thread`                                       | int  | `2`     | Number of brpc service thread                         |
+| `op_num`                                       | int[]| `0`     | Thread Number for each model in asynchronous mode     |
+| `op_max_batch`                                 | int[]| `32`    | Batch Number for each model in asynchronous mode      |
+| `gpu_ids`                                      | str[]| `"-1"`  | Gpu card id for each model                            |
+| `port`                                         | int  | `9292`  | Exposed port of current service to users              |
+| `model`                                        | str[]| `""`    | Path of paddle model directory to be served           |
+| `mem_optim_off`                                | -    | -       | Disable memory / graphic memory optimization          |
+| `ir_optim`                                     | bool | False   | Enable analysis and optimization of calculation graph |
+| `use_mkl` (Only for cpu version)               | -    | -       | Run inference with MKL                                |
+| `use_trt` (Only for trt version)               | -    | -       | Run inference with TensorRT. Need open with ir_optim.                            |
+| `use_lite` (Only for Intel x86 CPU or ARM CPU) | -    | -       | Run PaddleLite inference. Need open with ir_optim.                              |
+| `use_xpu`                                      | -    | -       | Run PaddleLite inference with Baidu Kunlun XPU. Need open with ir_optim.        |
+| `precision`                                    | str  | FP32    | Precision Mode, support FP32, FP16, INT8              |
+| `use_calib`                                    | bool | False   | Use TRT int8 calibration                              |
+| `gpu_multi_stream`                             | bool | False   | EnableGpuMultiStream to get larger QPS                |
+
+#### Serving model with multiple gpus.
+```BASH
+python3 -m paddle_serving_server.serve --model serving_model --thread 10 --port 9292 --gpu_ids 0,1,2
+```
+#### Serving two models.
+```BASH
+python3 -m paddle_serving_server.serve --model serving_model_1 serving_model_2 --thread 10 --port 9292
+```
+
+### 2. Starting with user-defined Configuration
+
+Mostly, the flags can meet the demand. However, the model configuration files can be modified by user that include service.prototxt、workflow.prototxt、resource.prototxt、model_toolkit.prototxt、proj.conf.
+
+Example starting with self-defined config:
+
+```BASH
+/bin/serving --flagfile=proj.conf
+```
+
+#### 2.1 proj.conf
+
+You can provide proj.conf with lots of flags:
+```
+# for paddle inference
+--precision=fp32
+--use_calib=False
+--reload_interval_s=10
+# for brpc
+--max_concurrency=0
+--num_threads=10
+--bthread_concurrency=10
+--max_body_size=536870912
+# default path
+--inferservice_path=conf
+--inferservice_file=infer_service.prototxt
+--resource_path=conf
+--resource_file=resource.prototxt
+--workflow_path=conf
+--workflow_file=workflow.prototxt
+```
+
+The table below sets out the detailed description:
+| name | Default | Description |
+|------|--------|------|
+|precision|"fp32"|Precision Mode, support FP32, FP16, INT8|
+|use_calib|False|Only for deployment with TensorRT|
+|reload_interval_s|10|Reload interval|
+|max_concurrency|0|Limit of request processing in parallel, 0: unlimited|
+|num_threads|10|Number of brpc service thread|
+|bthread_concurrency|10|Number of bthread|
+|max_body_size|536870912|Max size of brpc message|
+|inferservice_path|"conf"|Path of inferservice conf|
+|inferservice_file|"infer_service.prototxt"|Filename of inferservice conf|
+|resource_path|"conf"|Path of resource conf|
+|resource_file|"resource.prototxt"|Filename of resource conf|
+|workflow_path|"conf"|Path of workflow conf|
+|workflow_file|"workflow.prototxt"|Filename of workflow conf|
+
+#### 2.2 service.prototxt
+
+To set listening port, modify service.prototxt. You can set the `--inferservice_path` and `--inferservice_file` to instruct the server to check for service.prototxt. The `service.prototxt` file provided must be a `core/configure/server_configure.protobuf:InferServiceConf`.
+
+```
 port: 8010
 services {
-  name: "ImageClassifyService"
+  name: "GeneralModelService"
  workflows: "workflow1"
 }
-services {
-  name: "BuiltinEchoService"
-  workflows: "workflow2"
-}
 ```

- service.name: 请填写serving/proto/xx.proto文件的service名称，例如，在serving/proto/image_class.proto中，service名称如下声明：
-```JSON
-service ImageClassifyService {
-  rpc inference(Request) returns (Response);
-  rpc debug(Request) returns (Response);
-  option (pds.options).generate_impl = true;
-};
-```
-则service name就是`ImageClassifyService`
+- port: Listening port.
+- services: No need to modify. The workflow1 is defined in workflow.prototxt.

- service.workflows: 用于指定该service下所配的workflow列表。可以配置多个workflow。在本例中，为`ImageClassifyService`配置了一个workflow：`workflow1`。`workflow1`的具体定义在workflow.prototxt
+#### 2.3 workflow.prototxt

-## 2. workflow.prototxt
+To server user-defined OP, modify workflow.prototxt. You can set the `--workflow_path` and `--inferservice_file` to instruct the server to check for workflow.prototxt. The `workflow.prototxt` provided must be a `core/configure/server_configure.protobuf:Workflow`.

-workflow.prototxt用来描述每一个具体的workflow，他的protobuf格式可参考`configure/server_configure.protobuf`的`Workflow`类型。具体的磁盘文件路径可通过`--workflow_path`和`--workflow_file`指定。一个例子如下：
+In the blow example, you are serving model with 3 OPs. The GeneralReaderOp converts the input data to tensor. The GeneralInferOp which depends the output of GeneralReaderOp predicts the tensor. The GeneralResponseOp return the output data.

-```JSON
+```
 workflows {
  name: "workflow1"
  workflow_type: "Sequence"
  nodes {
-    name: "image_reader_op"
-    type: "ReaderOp"
+    name: "general_reader_0"
+    type: "GeneralReaderOp"
  }
  nodes {
-    name: "image_classify_op"
-    type: "ClassifyOp"
+    name: "general_infer_0"
+    type: "GeneralInferOp"
    dependencies {
-      name: "image_reader_op"
+      name: "general_reader_0"
      mode: "RO"
    }
  }
  nodes {
-    name: "write_json_op"
-    type: "WriteJsonOp"
+    name: "general_response_0"
+    type: "GeneralResponseOp"
    dependencies {
-      name: "image_classify_op"
+      name: "general_infer_0"
      mode: "RO"
    }
  }
 }
-
-workflows {
-  name: "workflow2"
-  workflow_type: "Sequence"
-  nodes {
-    name: "echo_op"
-    type: "CommonEchoOp"
-  }
-}
 ```
-以上样例配置了2个workflow：`workflow1`和`workflow2`。以`workflow1`为例：

- name: workflow名称，用于从service.prototxt索引到具体的workflow
- workflow_type: 可选"Sequence", "Parallel"，表示本workflow下节点所代表的OP是否可并行。**当前只支持Sequence类型，如配置了Parallel类型，则该workflow不会被执行**
- nodes: 用于串联成workflow的所有节点，可配置多个nodes。nodes间通过配置dependencies串联起来
- node.name: 随意，建议取一个能代表当前node所执行OP的类
- node.type: 当前node所执行OP的类名称，与serving/op/下每个具体的OP类的名称对应
- node.dependencies: 依赖的上游node列表
- node.dependencies.name: 与workflow内节点的name保持一致
+- name: The name of workflow.
+- workflow_type: "Sequence"
+- nodes: A workflow consists of nodes.
+- node.name: The name of node. Corresponding to node type. Ref to `python/paddle_serving_server/dag.py`
+- node.type: The bound operator. Ref to OPS in `serving/op`.
+- node.dependencies: The list of upstream dependent operators.
+- node.dependencies.name: The name of dependent operators.
 - node.dependencies.mode: RO-Read Only, RW-Read Write

-# 3. resource.prototxt
+#### 2.4 resource.prototxt

-Serving端resource配置的入口是resource.prototxt，用于配置模型信息。它的protobuf格式参考`configure/proto/server_configure.proto`的ResourceConf。具体的磁盘文件路径可用`--resource_path`和`--resource_file`指定。样例如下：
+You may modify resource.prototxt to set the path of model files. You can set the `--resource_path` and `--resource_file` to instruct the server to check for resource.prototxt. The `resource.prototxt` provided must be a `core/configure/server_configure.proto:Workflow`.

-```JSON
-model_toolkit_path: "./conf"
-model_toolkit_file: "model_toolkit.prototxt"
-cube_config_file: "./conf/cube.conf"
-```

-其中：
+```
+model_toolkit_path: "conf"
+model_toolkit_file: "general_infer_0/model_toolkit.prototxt"
+general_model_path: "conf"
+general_model_file: "general_infer_0/general_model.prototxt"
+```

- model_toolkit_path:用来指定model_toolkit.prototxt所在的目录
- model_toolkit_file: 用来指定model_toolkit.prototxt所在的文件名
- cube_config_file: 用来指定cube配置文件所在路径与文件名
+- model_toolkit_path: The diectory path of model_toolkil.prototxt.
+- model_toolkit_file: The file name of model_toolkil.prototxt.
+- general_model_path: The diectory path of general_model.prototxt.
+- general_model_file: The file name of general_model.prototxt.

-Cube是Paddle Serving中用于大规模稀疏参数的组件。
+#### 2.5 model_toolkit.prototxt

-# 4. model_toolkit.prototxt
+The model_toolkit.prototxt specifies the parameters of predictor engines. The `model_toolkit.prototxt` provided must be a `core/configure/server_configure.proto:ModelToolkitConf`.

-用来配置模型信息和所用的预测引擎。它的protobuf格式参考`configure/proto/server_configure.proto`的ModelToolkitConf。model_toolkit.protobuf的磁盘路径不能通过命令行参数覆盖。样例如下：
+Example using cpu engine:

-```JSON
+```
 engines {
-  name: "image_classification_resnet"
-  type: "FLUID_CPU_NATIVE_DIR"
-  reloadable_meta: "./data/model/paddle/fluid_time_file"
+  name: "general_infer_0"
+  type: "PADDLE_INFER"
+  reloadable_meta: "uci_housing_model/fluid_time_file"
  reloadable_type: "timestamp_ne"
-  model_data_path: "./data/model/paddle/fluid/SE_ResNeXt50_32x4d"
-  runtime_thread_num: 0
-  batch_infer_size: 0
-  enable_batch_align: 0
-  sparse_param_service_type: LOCAL
-  sparse_param_service_table_name: "local_kv"
+  model_dir: "uci_housing_model"
+  gpu_ids: -1
  enable_memory_optimization: true
-  static_optimization: false
-  force_update_static_cache: false
+  enable_ir_optimization: false
+  use_trt: false
+  use_lite: false
+  use_xpu: false
+  use_gpu: false
+  combined_model: false
+  gpu_multi_stream: false
+  runtime_thread_num: 0
+  batch_infer_size: 32
+  enable_overrun: false
+  allow_split_request: true
 }
 ```

-其中
+- name: The name of engine corresponding to the node name in workflow.prototxt.
+- type: Only support ”PADDLE_INFER“
+- reloadable_meta: Specify the mark file of reload. 
+- reloadable_type: Support timestamp_ne/timestamp_gt/md5sum/revision/none
+
+|reloadable_type|Description|
+|---------------|----|
+|timestamp_ne|when the mtime of reloadable_meta file changed|
+|timestamp_gt|When the mtime of reloadable_meta file greater than last record|
+|md5sum|No use|
+|revision|No use|

- name: 模型名称。InferManager通过此名称，找到要使用的模型和预测引擎。可参考serving/op/classify_op.h与serving/op/classify_op.cpp的InferManager::instance().infer()方法的参数来了解。
- type: 预测引擎的类型。可在inferencer-fluid-cpu/src/fluid_cpu_engine.cpp找到当前注册的预测引擎列表
+- model_dir: The path of model files.
+- gpu_ids: Specify the gpu ids. Support multiple device ids:
+```
+# GPU0，1，2
+gpu_ids: 0
+gpu_ids: 1
+gpu_ids: 2
+```
+- enable_memory_optimization: Enable memory optimization.
+- enable_ir_optimization: Enable ir optimization.
+- use_trt: Enable Tensor RT. Need use_gpu on.
+- use_lite: Enable PaddleLite.
+- use_xpu: Enable KUNLUN XPU.
+- use_gpu: Enbale GPU.
+- combined_model: Enable combined model.
+- gpu_multi_stream: Enable gpu multiple stream mode.
+- runtime_thread_num: Enable Async mode when num greater than 0 and creating predictors.
+- batch_infer_size: The max batch size of Async mode.
+- enable_overrun: Enable over running of Async mode which means putting the whole task into the task queue.
+- allow_split_request: Allow to split request task in Async mode.
+
+#### 2.6 general_model.prototxt
+
+The content of general_model.prototxt is same as serving_server_conf.prototxt.

-|预测引擎|含义|
-|--------|----|
-|FLUID_CPU_ANALYSIS|使用fluid Analysis API；模型所有参数保存在一个文件|
-|FLUID_CPU_ANALYSIS_DIR|使用fluid Analysis API；模型所有参数分开保存为独立的文件，整个模型放到一个目录中|
-|FLUID_CPU_NATIVE|使用fluid Native API；模型所有参数保存在一个文件|
-|FLUID_CPU_NATIVE_DIR|使用fluid Native API；模型所有参数分开保存为独立的文件，整个模型放到一个目录中|
-|FLUID_GPU_ANALYSIS|GPU预测，使用fluid Analysis API；模型所有参数保存在一个文件|
-|FLUID_GPU_ANALYSIS_DIR|GPU预测，使用fluid Analysis API；模型所有参数分开保存为独立的文件，整个模型放到一个目录中|
-|FLUID_GPU_NATIVE|GPU预测，使用fluid Native API；模型所有参数保存在一个文件|
-|FLUID_GPU_NATIVE_DIR|GPU预测，使用fluid Native API；模型所有参数分开保存为独立的文件，整个模型放到一个目录中|
+```
+feed_var {
+  name: "x"
+  alias_name: "x"
+  is_lod_tensor: false
+  feed_type: 1
+  shape: 13
+}
+fetch_var {
+  name: "fc_0.tmp_1"
+  alias_name: "price"
+  is_lod_tensor: false
+  fetch_type: 1
+  shape: 1
+}
+```

+## Python Pipeline

-**fluid Analysis API和fluid Native API的区别**
+Python Pipeline provides a user-friendly programming framework for multi-model composite services.

-Analysis API在模型加载过程中，会对模型计算逻辑进行多种优化，包括但不限于zero copy tensor，相邻OP的fuse等。**但优化逻辑不是一定对所有模型都有加速作用，有时甚至会有反作用，请以实测结果为准**。
+Example of config.yaml:
+```YAML
+#RPC port. The RPC port and HTTP port cannot be empyt at the same time. If the RPC port is empty and the HTTP port is not empty, the RPC port is automatically set to HTTP port+1.
+rpc_port: 18090

- reloadable_meta: 目前实际内容无意义，用来通过对该文件的mtime判断是否超过reload时间阈值
- reloadable_type: 检查reload条件：timestamp_ne/timestamp_gt/md5sum/revision/none
+#HTTP port. The RPC port and the HTTP port cannot be empty at the same time. If the RPC port is available and the HTTP port is empty, the HTTP port is not automatically generated
+http_port: 9999

-|reloadable_type|含义|
-|---------------|----|
-|timestamp_ne|reloadable_meta所指定文件的mtime时间戳发生变化|
-|timestamp_gt|reloadable_meta所指定文件的mtime时间戳大于等于上次检查时记录的mtime时间戳|
-|md5sum|目前无用，配置后永远不reload|
-|revision|目前无用，配置后用于不reload|
+#worker_num, the maximum concurrency.
+#When build_dag_each_worker=True, server will create processes within GRPC Server ans DAG.
+#When build_dag_each_worker=False, server will set the threadpool of GRPC.
+worker_num: 20

- model_data_path: 模型文件路径
- runtime_thread_num: 若大于0， 则启用bsf多线程调度框架，在每个预测bthread worker内启动多线程预测。要注意的是，当启用worker内多线程预测，workflow中OP需要用Serving框架的BatchTensor类做预测的输入和输出 (predictor/framework/infer_data.h, `class BatchTensor`)。
- batch_infer_size: 启用bsf多线程预测时，每个预测线程的batch size
- enable_batch_align:
- sparse_param_service_type: 枚举类型，可选参数，大规模稀疏参数服务类型
+#build_dag_each_worker, False，create process with DAG；True，create process with multiple independent DAG
+build_dag_each_worker: false

-|sparse_param_service_type|含义|
-|-------------------------|--|
-|NONE|不使用大规模稀疏参数服务|
-|LOCAL|单机本地大规模稀疏参数服务，以rocksdb作为引擎|
-|REMOTE|分布式大规模稀疏参数服务，以Cube作为引擎|
+dag:
+    #True, thread model；False，process model
+    is_thread_op: False

- sparse_param_service_table_name: 可选参数，大规模稀疏参数服务承载本模型所用参数的表名。
- enable_memory_optimization: bool类型，可选参数，是否启用内存优化。只在使用fluid Analysis预测API时有意义。需要说明的是，在GPU预测时，会执行显存优化
- static_optimization: bool类型，是否执行静态优化。只有当启用内存优化时有意义。
- force_update_static_cache: bool类型，是否强制更新静态优化cache。只有当启用内存优化时有意义。
+    #retry times
+    retry: 1

-## 5. 命令行配置参数
+    # True，generate the TimeLine data；False
+    use_profile: false
+    tracer:
+        interval_s: 10

-以下是serving端支持的gflag配置选项列表，并提供了默认值。
+op:
+    det:
+        #concurrency，is_thread_op=True，thread otherwise process
+        concurrency: 6

-| name | 默认值 | 含义 |
-|------|--------|------|
-|workflow_path|./conf|workflow配置目录名|
-|workflow_file|workflow.prototxt|workflow配置文件名|
-|inferservice_path|./conf|service配置目录名|
-|inferservice_file|service.prototxt|service配置文件名|
-|resource_path|./conf|资源管理器目录名|
-|resource_file|resource.prototxt|资源管理器文件名|
-|reload_interval_s|10|重载线程间隔时间(s)|
-|enable_model_toolkit|true|模型管理|
-|enable_protocol_list|baidu_std|brpc 通信协议列表|
-|log_dir|./log|log dir|
-|num_threads||brpc server使用的系统线程数，默认为CPU核数|
-|port|8010|Serving进程接收请求监听端口|
-|gpuid|0|GPU预测时指定Serving进程使用的GPU device id。只允许绑定1张GPU卡|
-|bthread_concurrency|9|BRPC底层bthread的concurrency。在使用GPU预测引擎时，为了限制并发worker数，可使用此参数|
-|bthread_min_concurrency|4|BRPC底层bthread的min concurrency。在使用GPU预测引擎时，为限制并发worker数，可使用此参数。与bthread_concurrency结合使用|
+        #Loading local server configuration without server_endpoints.
+        local_service_conf:
+            #client type，include brpc, grpc and local_predictor.
+            client_type: local_predictor

-可以通过在serving/conf/gflags.conf覆盖默认值，例如
-```
--log_dir=./serving_log/
+            #det model path
+            model_config: ocr_det_model
+
+            #Fetch data list
+            fetch_list: ["concat_1.tmp_0"]
+
+            #Device ID
+            devices: ""
+
+            # device_type, 0=cpu, 1=gpu, 2=tensorRT, 3=arm cpu, 4=kunlun xpu
+            device_type: 0
+
+            #use_mkldnn
+            #use_mkldnn: True
+
+            #ir_optim
+            ir_optim: True
+    rec:
+        #concurrency，is_thread_op=True，thread otherwise process
+        concurrency: 3
+
+        #time out, ms
+        timeout: -1
+
+        #retry times
+        retry: 1
+
+        #Loading local server configuration without server_endpoints.
+        local_service_conf:
+
+            #client type，include brpc, grpc and local_predictor.
+            client_type: local_predictor
+
+            #rec model path
+            model_config: ocr_rec_model
+
+            #Fetch data list
+            fetch_list: ["ctc_greedy_decoder_0.tmp_0", "softmax_0.tmp_0"]
+
+            #Device ID
+            devices: ""
+
+            # device_type, 0=cpu, 1=gpu, 2=tensorRT, 3=arm cpu, 4=kunlun xpu
+            device_type: 0
+
+            #use_mkldnn
+            #use_mkldnn: True
+
+            #ir_optim
+            ir_optim: True
 ```
-将指定日志目录到./serving_log目录下

-### 5.1 gflags.conf
+### Single-machine and multi-card inference

-可以将命令行配置参数写到配置文件中，该文件路径默认为`conf/gflags.conf`。如果`conf/gflags.conf`存在，则serving端会尝试解析其中的gflags命令。例如
-```shell
--enable_model_toolkit
--port=8011
+Single-machine multi-card inference can be abstracted into M OP processes bound to N GPU cards. It is related to the configuration of three parameters in config.yml. First, select the process mode, the number of concurrent processes is the number of processes, and devices is the GPU card ID.The binding method is to traverse the GPU card ID when the process starts, for example, start 7 OP processes, set devices:0,1,2 in config.yml, then the first, fourth, and seventh started processes are bound to the 0 card, and the second , 4 started processes are bound to 1 card, 3 and 6 processes are bound to card 2.
+
+Reference config.yaml:
+```YAML
+#True, thread model；False，process model
+is_thread_op: False
+
+#concurrency，is_thread_op=True，thread otherwise process
+concurrency: 7
+
+devices: "0,1,2"
 ```

-可用以下命令指定另外的命令行参数配置文件
+### Heterogeneous Devices

-```shell
-bin/serving --g=true --flagfile=conf/gflags.conf.new
+In addition to supporting CPU and GPU, Pipeline also supports the deployment of a variety of heterogeneous hardware. It consists of device_type and devices in config.yml. Use device_type to specify the type first, and judge according to devices when it is vacant. The device_type is described as follows:
+- CPU(Intel) : 0
+- GPU : 1
+- TensorRT : 2
+- CPU(Arm) : 3
+- XPU : 4
+
+Reference config.yaml:
+```YAML
+# device_type, 0=cpu, 1=gpu, 2=tensorRT, 3=arm cpu, 4=kunlun xpu
+device_type: 0
+devices: "" # "0,1"
 ```
+
+### Low precision inference
+
+Python Pipeline supports low-precision inference. The precision types supported by CPU, GPU and TensoRT are shown in the figure below:
+- CPU
+  - fp32(default)
+  - fp16
+  - bf16(mkldnn)
+- GPU
+  - fp32(default)
+  - fp16(TRT effects)
+  - int8
+- Tensor RT
+  - fp32(default)
+  - fp16
+  - int8 
+
+```YAML
+#precsion
+#GPU support: "fp32"(default), "fp16(TensorRT)", "int8"；
+#CPU support: "fp32"(default), "fp16", "bf16"(mkldnn); not support: "int8"
+precision: "fp32"
+```
\ No newline at end of file
--- a/doc/SERVING_CONFIGURE_CN.md
+++ b/doc/SERVING_CONFIGURE_CN.md
+# Serving Configuration
+
+(简体中文|[English](SERVING_CONFIGURE.md))
+
+## 简介
+
+本文主要介绍C++ Serving以及Python Pipeline的各项配置:
+
+- [模型配置文件](#模型配置文件): 转换模型时自动生成，描述模型输入输出信息
+- [C++ Serving](#c-serving): 用于高性能场景，介绍了快速启动以及自定义配置方法
+- [Python Pipeline](#python-pipeline): 用于单算子多模型组合场景
+
+## 模型配置文件
+
+在开始介绍Server配置之前，先来介绍一下模型配置文件。我们在将模型转换为PaddleServing模型时，会生成对应的serving_client_conf.prototxt以及serving_server_conf.prototxt，两者内容一致，为模型输入输出的参数信息，方便用户拼装参数。该配置文件用于Server以及Client，并不需要用户自行修改。转换方法参考文档《[怎样保存用于Paddle Serving的模型](SAVE_CN.md)》。protobuf格式可参考`core/configure/proto/general_model_config.proto`。
+样例如下：
+
+```
+feed_var {
+  name: "x"
+  alias_name: "x"
+  is_lod_tensor: false
+  feed_type: 1
+  shape: 13
+}
+fetch_var {
+  name: "concat_1.tmp_0"
+  alias_name: "concat_1.tmp_0"
+  is_lod_tensor: false
+  fetch_type: 1
+  shape: 3
+  shape: 640
+  shape: 640
+}
+```
+
+其中
+- feed_var：模型输入
+- fetch_var：模型输出
+- name：名称
+- alias_name：别名，与名称对应
+- is_lod_tensor：是否为lod，具体可参考《[Lod字段说明](LOD_CN.md)》
+- feed_type：数据类型
+
+|feed_type|类型|
+|---------|----|
+|0|INT64|
+|1|FLOAT32|
+|2|INT32|
+|3|FP64|
+|4|INT16|
+|5|FP16|
+|6|BF16|
+|7|UINT8|
+|8|INT8|
+
+- shape：数据维度
+
+## C++ Serving
+
+### 1.快速启动
+
+可以通过配置模型及端口号快速启动服务，启动命令如下：
+
+```BASH
+python3 -m paddle_serving_server.serve --model serving_model --port 9393
+```
+
+该命令会自动生成配置文件，并使用生成的配置文件启动C++ Serving。例如上述启动命令会自动生成workdir_9393目录，其结构如下
+
+```
+workdir_9393
+├── general_infer_0
+│   ├── fluid_time_file
+│   ├── general_model.prototxt
+│   └── model_toolkit.prototxt
+├── infer_service.prototxt
+├── resource.prototxt
+└── workflow.prototxt
+```
+
+更多启动参数详见下表：
+| Argument                                       | Type | Default | Description                                           |
+| ---------------------------------------------- | ---- | ------- | ----------------------------------------------------- |
+| `thread`                                       | int  | `2`     | Number of brpc service thread                         |
+| `op_num`                                       | int[]| `0`     | Thread Number for each model in asynchronous mode     |
+| `op_max_batch`                                 | int[]| `32`    | Batch Number for each model in asynchronous mode      |
+| `gpu_ids`                                      | str[]| `"-1"`  | Gpu card id for each model                            |
+| `port`                                         | int  | `9292`  | Exposed port of current service to users              |
+| `model`                                        | str[]| `""`    | Path of paddle model directory to be served           |
+| `mem_optim_off`                                | -    | -       | Disable memory / graphic memory optimization          |
+| `ir_optim`                                     | bool | False   | Enable analysis and optimization of calculation graph |
+| `use_mkl` (Only for cpu version)               | -    | -       | Run inference with MKL                                |
+| `use_trt` (Only for trt version)               | -    | -       | Run inference with TensorRT. Need open with ir_optim.                           |
+| `use_lite` (Only for Intel x86 CPU or ARM CPU) | -    | -       | Run PaddleLite inference. Need open with ir_optim.                              |
+| `use_xpu`                                      | -    | -       | Run PaddleLite inference with Baidu Kunlun XPU. Need open with ir_optim.        |
+| `precision`                                    | str  | FP32    | Precision Mode, support FP32, FP16, INT8              |
+| `use_calib`                                    | bool | False   | Use TRT int8 calibration                              |
+| `gpu_multi_stream`                             | bool | False   | EnableGpuMultiStream to get larger QPS                |
+
+#### 当您的某个模型想使用多张GPU卡部署时.
+```BASH
+python3 -m paddle_serving_server.serve --model serving_model --thread 10 --port 9292 --gpu_ids 0,1,2
+```
+#### 当您的一个服务包含两个模型部署时.
+```BASH
+python3 -m paddle_serving_server.serve --model serving_model_1 serving_model_2 --thread 10 --port 9292
+```
+
+### 2.自定义配置启动
+
+一般情况下，自动生成的配置可以应对大部分场景。对于特殊场景，用户也可自行定义配置文件。这些配置文件包括service.prototxt、workflow.prototxt、resource.prototxt、model_toolkit.prototxt、proj.conf。启动命令如下:
+```BASH
+/bin/serving --flagfile=proj.conf
+```
+
+#### 2.1 proj.conf
+
+proj.conf用于传入服务参数，并指定了其他相关配置文件的路径。如果重复传入参数，则以最后序参数值为准。
+```
+# for paddle inference
+--precision=fp32
+--use_calib=False
+--reload_interval_s=10
+# for brpc
+--max_concurrency=0
+--num_threads=10
+--bthread_concurrency=10
+--max_body_size=536870912
+# default path
+--inferservice_path=conf
+--inferservice_file=infer_service.prototxt
+--resource_path=conf
+--resource_file=resource.prototxt
+--workflow_path=conf
+--workflow_file=workflow.prototxt
+```
+各项参数的描述及默认值详见下表：
+| name | Default | Description |
+|------|--------|------|
+|precision|"fp32"|Precision Mode, support FP32, FP16, INT8|
+|use_calib|False|Only for deployment with TensorRT|
+|reload_interval_s|10|Reload interval|
+|max_concurrency|0|Limit of request processing in parallel, 0: unlimited|
+|num_threads|10|Number of brpc service thread|
+|bthread_concurrency|10|Number of bthread|
+|max_body_size|536870912|Max size of brpc message|
+|inferservice_path|"conf"|Path of inferservice conf|
+|inferservice_file|"infer_service.prototxt"|Filename of inferservice conf|
+|resource_path|"conf"|Path of resource conf|
+|resource_file|"resource.prototxt"|Filename of resource conf|
+|workflow_path|"conf"|Path of workflow conf|
+|workflow_file|"workflow.prototxt"|Filename of workflow conf|
+
+#### 2.2 service.prototxt
+
+service.prototxt用于配置Paddle Serving实例挂载的service列表。通过`--inferservice_path`和`--inferservice_file`指定加载路径。protobuf格式可参考`core/configure/server_configure.protobuf`的`InferServiceConf`。示例如下：
+
+```
+port: 8010
+services {
+  name: "GeneralModelService"
+  workflows: "workflow1"
+}
+```
+
+其中：
+- port: 用于配置Serving实例监听的端口号。
+- services: 使用默认配置即可，不可修改。name指定service名称，workflow1的具体定义在workflow.prototxt
+
+#### 2.3 workflow.prototxt
+
+workflow.prototxt用来描述具体的workflow。通过`--workflow_path`和`--workflow_file`指定加载路径。protobuf格式可参考`configure/server_configure.protobuf`的`Workflow`类型。
+如下示例，workflow由3个OP构成，GeneralReaderOp用于读取数据，GeneralInferOp依赖于GeneralReaderOp并进行预测，GeneralResponseOp将预测结果返回：
+
+```
+workflows {
+  name: "workflow1"
+  workflow_type: "Sequence"
+  nodes {
+    name: "general_reader_0"
+    type: "GeneralReaderOp"
+  }
+  nodes {
+    name: "general_infer_0"
+    type: "GeneralInferOp"
+    dependencies {
+      name: "general_reader_0"
+      mode: "RO"
+    }
+  }
+  nodes {
+    name: "general_response_0"
+    type: "GeneralResponseOp"
+    dependencies {
+      name: "general_infer_0"
+      mode: "RO"
+    }
+  }
+}
+```
+其中：
+
+- name: workflow名称，用于从service.prototxt索引到具体的workflow
+- workflow_type: 只支持"Sequence"
+- nodes: 用于串联成workflow的所有节点，可配置多个nodes。nodes间通过配置dependencies串联起来
+- node.name: 与node.type一一对应，具体可参考`python/paddle_serving_server/dag.py`
+- node.type: 当前node所执行OP的类名称，与serving/op/下每个具体的OP类的名称对应
+- node.dependencies: 依赖的上游node列表
+- node.dependencies.name: 与workflow内节点的name保持一致
+- node.dependencies.mode: RO-Read Only, RW-Read Write
+
+#### 2.4 resource.prototxt
+
+resource.prototxt，用于指定模型配置文件。通过`--resource_path`和`--resource_file`指定加载路径。它的protobuf格式参考`core/configure/proto/server_configure.proto`的`ResourceConf`。示例如下：
+
+```
+model_toolkit_path: "conf"
+model_toolkit_file: "general_infer_0/model_toolkit.prototxt"
+general_model_path: "conf"
+general_model_file: "general_infer_0/general_model.prototxt"
+```
+
+其中：
+
+- model_toolkit_path:用来指定model_toolkit.prototxt所在的目录
+- model_toolkit_file: 用来指定model_toolkit.prototxt所在的文件名
+- general_model_path: 用来指定general_model.prototxt所在的目录
+- general_model_file: 用来指定general_model.prototxt所在的文件名
+
+#### 2.5 model_toolkit.prototxt
+
+用来配置模型信息和预测引擎。它的protobuf格式参考`core/configure/proto/server_configure.proto`的ModelToolkitConf。model_toolkit.protobuf的磁盘路径不能通过命令行参数覆盖。示例如下：
+
+```
+engines {
+  name: "general_infer_0"
+  type: "PADDLE_INFER"
+  reloadable_meta: "uci_housing_model/fluid_time_file"
+  reloadable_type: "timestamp_ne"
+  model_dir: "uci_housing_model"
+  gpu_ids: -1
+  enable_memory_optimization: true
+  enable_ir_optimization: false
+  use_trt: false
+  use_lite: false
+  use_xpu: false
+  use_gpu: false
+  combined_model: false
+  gpu_multi_stream: false
+  runtime_thread_num: 0
+  batch_infer_size: 32
+  enable_overrun: false
+  allow_split_request: true
+}
+```
+
+其中
+
+- name: 引擎名称，与workflow.prototxt中的node.name以及所在目录名称对应
+- type: 预测引擎的类型。当前只支持”PADDLE_INFER“
+- reloadable_meta: 目前实际内容无意义，用来通过对该文件的mtime判断是否超过reload时间阈值
+- reloadable_type: 检查reload条件：timestamp_ne/timestamp_gt/md5sum/revision/none
+
+|reloadable_type|含义|
+|---------------|----|
+|timestamp_ne|reloadable_meta所指定文件的mtime时间戳发生变化|
+|timestamp_gt|reloadable_meta所指定文件的mtime时间戳大于等于上次检查时记录的mtime时间戳|
+|md5sum|目前无用，配置后永远不reload|
+|revision|目前无用，配置后用于不reload|
+
+- model_dir: 模型文件路径
+- gpu_ids: 引擎运行时使用的GPU device id，支持指定多个，如：
+```
+# 指定GPU0，1，2
+gpu_ids: 0
+gpu_ids: 1
+gpu_ids: 2
+```
+- enable_memory_optimization: 是否开启memory优化
+- enable_ir_optimization: 是否开启ir优化
+- use_trt: 是否开启TensorRT，需同时开启use_gpu
+- use_lite: 是否开启PaddleLite
+- use_xpu: 是否使用昆仑XPU
+- use_gpu:是否使用GPU
+- combined_model: 是否使用组合模型文件
+- gpu_multi_stream: 是否开启gpu多流模式
+- runtime_thread_num: 若大于0， 则启用Async异步模式，并创建对应数量的predictor实例。
+- batch_infer_size: Async异步模式下的最大batch数
+- enable_overrun: Async异步模式下总是将整个任务放入任务队列
+- allow_split_request: Async异步模式下允许拆分任务
+
+#### 2.6 general_model.prototxt
+
+general_model.prototxt内容与模型配置serving_server_conf.prototxt相同，用了描述模型输入输出参数信息。示例如下：
+```
+feed_var {
+  name: "x"
+  alias_name: "x"
+  is_lod_tensor: false
+  feed_type: 1
+  shape: 13
+}
+fetch_var {
+  name: "fc_0.tmp_1"
+  alias_name: "price"
+  is_lod_tensor: false
+  fetch_type: 1
+  shape: 1
+}
+```
+
+## Python Pipeline
+
+Python Pipeline提供了用户友好的多模型组合服务编程框架，适用于多模型组合应用的场景。
+其配置文件为YAML格式，一般默认为config.yaml。示例如下：
+```YAML
+#rpc端口, rpc_port和http_port不允许同时为空。当rpc_port为空且http_port不为空时，会自动将rpc_port设置为http_port+1
+rpc_port: 18090
+
+#http端口, rpc_port和http_port不允许同时为空。当rpc_port可用且http_port为空时，不自动生成http_port
+http_port: 9999
+
+#worker_num, 最大并发数。当build_dag_each_worker=True时, 框架会创建worker_num个进程，每个进程内构建grpcSever和DAG
+##当build_dag_each_worker=False时，框架会设置主线程grpc线程池的max_workers=worker_num
+worker_num: 20
+
+#build_dag_each_worker, False，框架在进程内创建一条DAG；True，框架会每个进程内创建多个独立的DAG
+build_dag_each_worker: false
+
+dag:
+    #op资源类型, True, 为线程模型；False，为进程模型
+    is_thread_op: False
+
+    #重试次数
+    retry: 1
+
+    #使用性能分析, True，生成Timeline性能数据，对性能有一定影响；False为不使用
+    use_profile: false
+    tracer:
+        interval_s: 10
+
+op:
+    det:
+        #并发数，is_thread_op=True时，为线程并发；否则为进程并发
+        concurrency: 6
+
+        #当op配置没有server_endpoints时，从local_service_conf读取本地服务配置
+        local_service_conf:
+            #client类型，包括brpc, grpc和local_predictor.local_predictor不启动Serving服务，进程内预测
+            client_type: local_predictor
+
+            #det模型路径
+            model_config: ocr_det_model
+
+            #Fetch结果列表，以client_config中fetch_var的alias_name为准
+            fetch_list: ["concat_1.tmp_0"]
+
+            #计算硬件ID，当devices为""或不写时为CPU预测；当devices为"0", "0,1,2"时为GPU预测，表示使用的GPU卡
+            devices: ""
+
+            # device_type, 0=cpu, 1=gpu, 2=tensorRT, 3=arm cpu, 4=kunlun xpu
+            device_type: 0
+
+            #use_mkldnn
+            #use_mkldnn: True
+
+            #ir_optim
+            ir_optim: True
+    rec:
+        #并发数，is_thread_op=True时，为线程并发；否则为进程并发
+        concurrency: 3
+
+        #超时时间, 单位ms
+        timeout: -1
+
+        #Serving交互重试次数，默认不重试
+        retry: 1
+
+        #当op配置没有server_endpoints时，从local_service_conf读取本地服务配置
+        local_service_conf:
+
+            #client类型，包括brpc, grpc和local_predictor。local_predictor不启动Serving服务，进程内预测
+            client_type: local_predictor
+
+            #rec模型路径
+            model_config: ocr_rec_model
+
+            #Fetch结果列表，以client_config中fetch_var的alias_name为准
+            fetch_list: ["ctc_greedy_decoder_0.tmp_0", "softmax_0.tmp_0"]
+
+            #计算硬件ID，当devices为""或不写时为CPU预测；当devices为"0", "0,1,2"时为GPU预测，表示使用的GPU卡
+            devices: ""
+
+            # device_type, 0=cpu, 1=gpu, 2=tensorRT, 3=arm cpu, 4=kunlun xpu
+            device_type: 0
+
+            #use_mkldnn
+            #use_mkldnn: True
+
+            #ir_optim
+            ir_optim: True
+```
+
+### 单机多卡
+
+单机多卡推理，M个OP进程与N个GPU卡绑定，需要在config.ymal中配置3个参数。首先选择进程模式，这样并发数即进程数，然后配置devices。绑定方法是进程启动时遍历GPU卡ID，例如启动7个OP进程，设置了0，1，2三个device id，那么第1、4、7个启动的进程与0卡绑定，第2、5进程与1卡绑定，3、6进程与卡2绑定。
+```YAML
+#op资源类型, True, 为线程模型；False，为进程模型
+is_thread_op: False
+
+#并发数，is_thread_op=True时，为线程并发；否则为进程并发
+concurrency: 7
+
+devices: "0,1,2"
+```
+
+### 异构硬件
+
+Python Pipeline除了支持CPU、GPU之外，还支持多种异构硬件部署。在config.yaml中由device_type和devices控制。优先使用device_type指定，当其空缺时根据devices自动判断类型。device_type描述如下：
+- CPU(Intel) : 0
+- GPU : 1
+- TensorRT : 2
+- CPU(Arm) : 3
+- XPU : 4
+
+config.yml中硬件配置：
+```YAML
+#计算硬件类型: 空缺时由devices决定(CPU/GPU)，0=cpu, 1=gpu, 2=tensorRT, 3=arm cpu, 4=kunlun xpu
+device_type: 0
+#计算硬件ID，优先由device_type决定硬件类型。devices为""或空缺时为CPU预测；当为"0", "0,1,2"时为GPU预测，表示使用的GPU卡
+devices: "" # "0,1"
+```
+
+### 低精度推理
+
+Python Pipeline支持低精度推理，CPU、GPU和TensoRT支持的精度类型如下所示：
+- CPU
+  - fp32(default)
+  - fp16
+  - bf16(mkldnn)
+- GPU
+  - fp32(default)
+  - fp16(TRT下有效)
+  - int8
+- Tensor RT
+  - fp32(default)
+  - fp16
+  - int8 
+
+```YAML
+#precsion, 预测精度，降低预测精度可提升预测速度
+#GPU 支持: "fp32"(default), "fp16(TensorRT)", "int8"；
+#CPU 支持: "fp32"(default), "fp16", "bf16"(mkldnn); 不支持: "int8"
+precision: "fp32"
+```
\ No newline at end of file
--- a/examples/C++/PaddleDetection/ttfnet_darknet53_1x_coco/test_client.py
+++ b/examples/C++/PaddleDetection/ttfnet_darknet53_1x_coco/test_client.py
@@ -19,9 +19,9 @@ import cv2

 preprocess = DetectionSequential([
    DetectionFile2Image(), DetectionResize(
-        (512, 512), False, interpolation=cv2.INTER_LINEAR), DetectionNormalize(
-            [123.675, 116.28, 103.53], [58.395, 57.12, 57.375], False),
-    DetectionTranspose((2, 0, 1))
+        (512, 512), False, interpolation=cv2.INTER_LINEAR),
+    DetectionNormalize([123.675, 116.28, 103.53], [58.395, 57.12, 57.375],
+                       False), DetectionTranspose((2, 0, 1))
 ])

 postprocess = RCNNPostprocess("label_list.txt", "output")

--- a/python/paddle_serving_client/io/__init__.py
+++ b/python/paddle_serving_client/io/__init__.py
@@ -19,7 +19,7 @@ from paddle.fluid.framework import core
 from paddle.fluid.framework import default_main_program
 from paddle.fluid.framework import Program
 from paddle.fluid import CPUPlace
-from paddle.fluid.io import save_inference_model
+from .paddle_io import save_inference_model, normalize_program
 import paddle.fluid as fluid
 from paddle.fluid.core import CipherUtils
 from paddle.fluid.core import CipherFactory
@@ -191,12 +191,14 @@ def save_model(server_model_folder,
    executor = Executor(place=CPUPlace())

    feed_var_names = [feed_var_dict[x].name for x in feed_var_dict]
+    feed_vars = [feed_var_dict[x] for x in feed_var_dict]
    target_vars = []
    target_var_names = []
    for key in sorted(fetch_var_dict.keys()):
        target_vars.append(fetch_var_dict[key])
        target_var_names.append(key)

+    main_program = normalize_program(main_program, feed_vars, target_vars)
    if not encryption and not show_proto:
        if not os.path.exists(server_model_folder):
            os.makedirs(server_model_folder)
@@ -209,7 +211,7 @@ def save_model(server_model_folder,
        new_params_path = os.path.join(server_model_folder, params_filename)

        with open(new_model_path, "wb") as new_model_file:
-            new_model_file.write(main_program.desc.serialize_to_string())
+            new_model_file.write(main_program._remove_training_info(False).desc.serialize_to_string())
        
        paddle.static.save_vars(
            executor=executor,
@@ -229,7 +231,7 @@ def save_model(server_model_folder,
        key = CipherUtils.gen_key_to_file(128, "key")
        params = fluid.io.save_persistables(
            executor=executor, dirname=None, main_program=main_program)
-        model = main_program.desc.serialize_to_string()
+        model = main_program._remove_training_info(False).desc.serialize_to_string()
        if not os.path.exists(server_model_folder):
            os.makedirs(server_model_folder)
        os.chdir(server_model_folder)

--- a/python/paddle_serving_client/io/paddle_io.py
+++ b/python/paddle_serving_client/io/paddle_io.py
+#   Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import print_function
+
+import errno
+import inspect
+import logging
+import os
+import warnings
+import six
+import numpy as np
+
+import paddle
+from paddle.fluid import (
+    core,
+    Variable,
+    CompiledProgram,
+    default_main_program,
+    Program,
+    layers,
+    unique_name,
+    program_guard, )
+from paddle.fluid.io import prepend_feed_ops, append_fetch_ops
+from paddle.fluid.framework import static_only, Parameter
+from paddle.fluid.executor import Executor, global_scope
+from paddle.fluid.log_helper import get_logger
+
+__all__ = []
+
+_logger = get_logger(
+    __name__, logging.INFO, fmt='%(asctime)s-%(levelname)s: %(message)s')
+
+
+def _check_args(caller, args, supported_args=None, deprecated_args=None):
+    supported_args = [] if supported_args is None else supported_args
+    deprecated_args = [] if deprecated_args is None else deprecated_args
+    for arg in args:
+        if arg in deprecated_args:
+            raise ValueError(
+                "argument '{}' in function '{}' is deprecated, only {} are supported.".
+                format(arg, caller, supported_args))
+        elif arg not in supported_args:
+            raise ValueError(
+                "function '{}' doesn't support argument '{}',\n only {} are supported.".
+                format(caller, arg, supported_args))
+
+
+def _check_vars(name, var_list):
+    if not isinstance(var_list, list):
+        var_list = [var_list]
+    if not var_list or not all([isinstance(var, Variable) for var in var_list]):
+        raise ValueError(
+            "'{}' should be a Variable or a list of Variable.".format(name))
+
+
+def _normalize_path_prefix(path_prefix):
+    """
+    convert path_prefix to absolute path.
+    """
+    if not isinstance(path_prefix, six.string_types):
+        raise ValueError("'path_prefix' should be a string.")
+    if path_prefix.endswith("/"):
+        raise ValueError("'path_prefix' should not be a directory")
+    path_prefix = os.path.normpath(path_prefix)
+    path_prefix = os.path.abspath(path_prefix)
+    return path_prefix
+
+
+def _get_valid_program(program=None):
+    """
+    return default main program if program is None.
+    """
+    if program is None:
+        program = default_main_program()
+    elif isinstance(program, CompiledProgram):
+        program = program._program
+        if program is None:
+            raise TypeError(
+                "The type of input program is invalid, expected tyep is Program, but received None"
+            )
+        warnings.warn(
+            "The input is a CompiledProgram, this is not recommended.")
+    if not isinstance(program, Program):
+        raise TypeError(
+            "The type of input program is invalid, expected type is fluid.Program, but received %s"
+            % type(program))
+    return program
+
+
+def _clone_var_in_block(block, var):
+    assert isinstance(var, Variable)
+    if var.desc.type() == core.VarDesc.VarType.LOD_TENSOR:
+        return block.create_var(
+            name=var.name,
+            shape=var.shape,
+            dtype=var.dtype,
+            type=var.type,
+            lod_level=var.lod_level,
+            persistable=True)
+    else:
+        return block.create_var(
+            name=var.name,
+            shape=var.shape,
+            dtype=var.dtype,
+            type=var.type,
+            persistable=True)
+
+
+def normalize_program(program, feed_vars, fetch_vars):
+    """
+    :api_attr: Static Graph
+
+    Normalize/Optimize a program according to feed_vars and fetch_vars.
+
+    Args:
+        program(Program): Specify a program you want to optimize.
+        feed_vars(Variable | list[Variable]): Variables needed by inference.
+        fetch_vars(Variable | list[Variable]): Variables returned by inference.
+
+    Returns:
+        Program: Normalized/Optimized program.
+
+    Raises:
+        TypeError: If `program` is not a Program, an exception is thrown.
+        TypeError: If `feed_vars` is not a Variable or a list of Variable, an exception is thrown.
+        TypeError: If `fetch_vars` is not a Variable or a list of Variable, an exception is thrown.
+
+    Examples:
+        .. code-block:: python
+
+            import paddle
+
+            paddle.enable_static()
+
+            path_prefix = "./infer_model"
+
+            # User defined network, here a softmax regession example
+            image = paddle.static.data(name='img', shape=[None, 28, 28], dtype='float32')
+            label = paddle.static.data(name='label', shape=[None, 1], dtype='int64')
+            predict = paddle.static.nn.fc(image, 10, activation='softmax')
+
+            loss = paddle.nn.functional.cross_entropy(predict, label)
+
+            exe = paddle.static.Executor(paddle.CPUPlace())
+            exe.run(paddle.static.default_startup_program())
+
+            # normalize main program.
+            program = paddle.static.default_main_program()
+            normalized_program = paddle.static.normalize_program(program, [image], [predict])
+
+    """
+    if not isinstance(program, Program):
+        raise TypeError(
+            "program type must be `fluid.Program`, but received `%s`" %
+            type(program))
+    if not isinstance(feed_vars, list):
+        feed_vars = [feed_vars]
+    if not all(isinstance(v, Variable) for v in feed_vars):
+        raise TypeError(
+            "feed_vars type must be a Variable or a list of Variable.")
+    if not isinstance(fetch_vars, list):
+        fetch_vars = [fetch_vars]
+    if not all(isinstance(v, Variable) for v in fetch_vars):
+        raise TypeError(
+            "fetch_vars type must be a Variable or a list of Variable.")
+
+    # remind users to set auc_states to 0 if auc op were found.
+    for op in program.global_block().ops:
+        # clear device of Op
+        device_attr_name = core.op_proto_and_checker_maker.kOpDeviceAttrName()
+        op._set_attr(device_attr_name, "")
+        if op.type == 'auc':
+            warnings.warn("Be sure that you have set auc states to 0 "
+                          "before saving inference model.")
+            break
+
+    # fix the bug that the activation op's output as target will be pruned.
+    # will affect the inference performance.
+    # TODO(Superjomn) add an IR pass to remove 1-scale op.
+    #with program_guard(program):
+    #    uniq_fetch_vars = []
+    #    for i, var in enumerate(fetch_vars):
+    #        if var.dtype != paddle.bool:
+    #            var = layers.scale(
+    #                var, 1., name="save_infer_model/scale_{}".format(i))
+    #        uniq_fetch_vars.append(var)
+    #    fetch_vars = uniq_fetch_vars
+
+    # serialize program
+    copy_program = program.clone()
+    global_block = copy_program.global_block()
+    remove_op_idx = []
+    for i, op in enumerate(global_block.ops):
+        op.desc.set_is_target(False)
+        if op.type == "feed" or op.type == "fetch":
+            remove_op_idx.append(i)
+    for idx in remove_op_idx[::-1]:
+        global_block._remove_op(idx)
+    copy_program.desc.flush()
+
+    feed_var_names = [var.name for var in feed_vars]
+    copy_program = copy_program._prune_with_input(
+        feeded_var_names=feed_var_names, targets=fetch_vars)
+    copy_program = copy_program._inference_optimize(prune_read_op=True)
+    fetch_var_names = [var.name for var in fetch_vars]
+    prepend_feed_ops(copy_program, feed_var_names)
+    append_fetch_ops(copy_program, fetch_var_names)
+    copy_program.desc._set_version()
+    return copy_program
+
+
+def is_persistable(var):
+    """
+    Check whether the given variable is persistable.
+
+    Args:
+        var(Variable): The variable to be checked.
+
+    Returns:
+        bool: True if the given `var` is persistable
+        False if not.
+
+    Examples:
+        .. code-block:: python
+
+            import paddle
+            import paddle.fluid as fluid
+
+            paddle.enable_static()
+            param = fluid.default_main_program().global_block().var('fc.b')
+            res = fluid.io.is_persistable(param)
+    """
+    if var.desc.type() == core.VarDesc.VarType.FEED_MINIBATCH or \
+                    var.desc.type() == core.VarDesc.VarType.FETCH_LIST or \
+                    var.desc.type() == core.VarDesc.VarType.READER:
+        return False
+    return var.persistable
+
+
+@static_only
+def serialize_program(feed_vars, fetch_vars, **kwargs):
+    """
+    :api_attr: Static Graph
+
+    Serialize default main program according to feed_vars and fetch_vars.
+
+    Args:
+        feed_vars(Variable | list[Variable]): Variables needed by inference.
+        fetch_vars(Variable | list[Variable]): Variables returned by inference.
+        kwargs: Supported keys including 'program'.Attention please, kwargs is used for backward compatibility mainly.
+          - program(Program): specify a program if you don't want to use default main program.
+
+    Returns:
+        bytes: serialized program.
+
+    Raises:
+        ValueError: If `feed_vars` is not a Variable or a list of Variable, an exception is thrown.
+        ValueError: If `fetch_vars` is not a Variable or a list of Variable, an exception is thrown.
+
+    Examples:
+        .. code-block:: python
+
+            import paddle
+
+            paddle.enable_static()
+
+            path_prefix = "./infer_model"
+
+            # User defined network, here a softmax regession example
+            image = paddle.static.data(name='img', shape=[None, 28, 28], dtype='float32')
+            label = paddle.static.data(name='label', shape=[None, 1], dtype='int64')
+            predict = paddle.static.nn.fc(image, 10, activation='softmax')
+
+            loss = paddle.nn.functional.cross_entropy(predict, label)
+
+            exe = paddle.static.Executor(paddle.CPUPlace())
+            exe.run(paddle.static.default_startup_program())
+
+            # serialize the default main program to bytes.
+            serialized_program = paddle.static.serialize_program([image], [predict])
+
+            # deserialize bytes to program
+            deserialized_program = paddle.static.deserialize_program(serialized_program)
+
+    """
+    # verify feed_vars
+    _check_vars('feed_vars', feed_vars)
+    # verify fetch_vars
+    _check_vars('fetch_vars', fetch_vars)
+
+    program = _get_valid_program(kwargs.get('program', None))
+    program = normalize_program(program, feed_vars, fetch_vars)
+    return _serialize_program(program)
+
+
+def _serialize_program(program):
+    """
+    serialize given program to bytes.
+    """
+    return program.desc.serialize_to_string()
+
+
+@static_only
+def serialize_persistables(feed_vars, fetch_vars, executor, **kwargs):
+    """
+    :api_attr: Static Graph
+
+    Serialize parameters using given executor and default main program according to feed_vars and fetch_vars.
+
+    Args:
+        feed_vars(Variable | list[Variable]): Variables needed by inference.
+        fetch_vars(Variable | list[Variable]): Variables returned by inference.
+        kwargs: Supported keys including 'program'.Attention please, kwargs is used for backward compatibility mainly.
+          - program(Program): specify a program if you don't want to use default main program.
+
+    Returns:
+        bytes: serialized program.
+
+    Raises:
+        ValueError: If `feed_vars` is not a Variable or a list of Variable, an exception is thrown.
+        ValueError: If `fetch_vars` is not a Variable or a list of Variable, an exception is thrown.
+
+    Examples:
+        .. code-block:: python
+
+            import paddle
+
+            paddle.enable_static()
+
+            path_prefix = "./infer_model"
+
+            # User defined network, here a softmax regession example
+            image = paddle.static.data(name='img', shape=[None, 28, 28], dtype='float32')
+            label = paddle.static.data(name='label', shape=[None, 1], dtype='int64')
+            predict = paddle.static.nn.fc(image, 10, activation='softmax')
+
+            loss = paddle.nn.functional.cross_entropy(predict, label)
+
+            exe = paddle.static.Executor(paddle.CPUPlace())
+            exe.run(paddle.static.default_startup_program())
+
+            # serialize parameters to bytes.
+            serialized_params = paddle.static.serialize_persistables([image], [predict], exe)
+
+            # deserialize bytes to parameters.
+            main_program = paddle.static.default_main_program()
+            deserialized_params = paddle.static.deserialize_persistables(main_program, serialized_params, exe)
+
+    """
+    # verify feed_vars
+    _check_vars('feed_vars', feed_vars)
+    # verify fetch_vars
+    _check_vars('fetch_vars', fetch_vars)
+
+    program = _get_valid_program(kwargs.get('program', None))
+    program = normalize_program(program, feed_vars, fetch_vars)
+    return _serialize_persistables(program, executor)
+
+
+def _serialize_persistables(program, executor):
+    """
+    Serialize parameters using given program and executor.
+    """
+    vars_ = list(filter(is_persistable, program.list_vars()))
+    # warn if no variable found in model
+    if len(vars_) == 0:
+        warnings.warn("no variable in your model, please ensure there are any "
+                      "variables in your model to save")
+        return None
+    # create a new program and clone persitable vars to it
+    save_program = Program()
+    save_block = save_program.global_block()
+    save_var_map = {}
+    for var in vars_:
+        if var.type != core.VarDesc.VarType.RAW:
+            var_copy = _clone_var_in_block(save_block, var)
+            save_var_map[var_copy.name] = var
+
+    # create in_vars and out_var, then append a save_combine op to save_program
+    in_vars = []
+    for name in sorted(save_var_map.keys()):
+        in_vars.append(save_var_map[name])
+
+    out_var_name = unique_name.generate("out_var")
+    out_var = save_block.create_var(
+        type=core.VarDesc.VarType.RAW, name=out_var_name)
+    out_var.desc.set_persistable(True)
+    save_block.append_op(
+        type='save_combine',
+        inputs={'X': in_vars},
+        outputs={'Y': out_var},
+        attrs={'file_path': '',
+               'save_to_memory': True})
+    # run save_program to save vars
+    # NOTE(zhiqiu): save op will add variable kLookupTablePath to save_program.desc,
+    # which leads to diff between save_program and its desc. Call _sync_with_cpp
+    # to keep consistency.
+    save_program._sync_with_cpp()
+    executor.run(save_program)
+    # return serialized bytes in out_var
+    return global_scope().find_var(out_var_name).get_bytes()
+
+
+def save_to_file(path, content):
+    """
+    Save content to given path.
+    Args:
+        path(str): Path to write content to.
+        content(bytes): Content to write.
+    Returns:
+        None
+    """
+
+    if not isinstance(content, bytes):
+        raise ValueError("'content' type should be bytes.")
+    with open(path, "wb") as f:
+        f.write(content)
+
+
+@static_only
+def save_inference_model(path_prefix, feed_vars, fetch_vars, executor,
+                         **kwargs):
+    """
+    :api_attr: Static Graph
+
+    Save current model and its parameters to given path. i.e.
+    Given path_prefix = "/path/to/modelname", after invoking
+    save_inference_model(path_prefix, feed_vars, fetch_vars, executor),
+    you will find two files named modelname.pdmodel and modelname.pdiparams
+    under "/path/to", which represent your model and parameters respectively.
+
+    Args:
+        path_prefix(str): Directory path to save model + model name without suffix.
+        feed_vars(Variable | list[Variable]): Variables needed by inference.
+        fetch_vars(Variable | list[Variable]): Variables returned by inference.
+        executor(Executor): The executor that saves the inference model. You can refer
+                            to :ref:`api_guide_executor_en` for more details.
+        kwargs: Supported keys including 'program' and "clip_extra". Attention please, kwargs is used for backward compatibility mainly.
+          - program(Program): specify a program if you don't want to use default main program.
+          - clip_extra(bool): set to True if you want to clip extra information for every operator.
+    Returns:
+        None
+
+    Raises:
+        ValueError: If `feed_vars` is not a Variable or a list of Variable, an exception is thrown.
+        ValueError: If `fetch_vars` is not a Variable or a list of Variable, an exception is thrown.
+
+    Examples:
+        .. code-block:: python
+
+            import paddle
+
+            paddle.enable_static()
+
+            path_prefix = "./infer_model"
+
+            # User defined network, here a softmax regession example
+            image = paddle.static.data(name='img', shape=[None, 28, 28], dtype='float32')
+            label = paddle.static.data(name='label', shape=[None, 1], dtype='int64')
+            predict = paddle.static.nn.fc(image, 10, activation='softmax')
+
+            loss = paddle.nn.functional.cross_entropy(predict, label)
+
+            exe = paddle.static.Executor(paddle.CPUPlace())
+            exe.run(paddle.static.default_startup_program())
+
+            # Feed data and train process
+
+            # Save inference model. Note we don't save label and loss in this example
+            paddle.static.save_inference_model(path_prefix, [image], [predict], exe)
+
+            # In this example, the save_inference_mode inference will prune the default
+            # main program according to the network's input node (img) and output node(predict).
+            # The pruned inference program is going to be saved in file "./infer_model.pdmodel"
+            # and parameters are going to be saved in file "./infer_model.pdiparams".
+
+    """
+
+    # check path_prefix, set model_path and params_path
+    path_prefix = _normalize_path_prefix(path_prefix)
+    try:
+        # mkdir may conflict if pserver and trainer are running on the same machine
+        dirname = os.path.dirname(path_prefix)
+        os.makedirs(dirname)
+    except OSError as e:
+        if e.errno != errno.EEXIST:
+            raise
+    model_path = path_prefix + ".pdmodel"
+    params_path = path_prefix + ".pdiparams"
+    if os.path.isdir(model_path):
+        raise ValueError("'{}' is an existing directory.".format(model_path))
+    if os.path.isdir(params_path):
+        raise ValueError("'{}' is an existing directory.".format(params_path))
+
+    # verify feed_vars
+    _check_vars('feed_vars', feed_vars)
+    # verify fetch_vars
+    _check_vars('fetch_vars', fetch_vars)
+
+    program = _get_valid_program(kwargs.get('program', None))
+    clip_extra = kwargs.get('clip_extra', False)
+    program = normalize_program(program, feed_vars, fetch_vars)
+    # serialize and save program
+    program_bytes = _serialize_program(
+        program._remove_training_info(clip_extra=clip_extra))
+    save_to_file(model_path, program_bytes)
+    # serialize and save params
+    params_bytes = _serialize_persistables(program, executor)
+    save_to_file(params_path, params_bytes)
+