Merge branch 'PaddlePaddle:develop' into develop

c168938e · TeslaZhao · GitHub · 505aef76 · 3a670684 · c168938e
7 changed file
--- a/doc/C++Serving/Introduction_CN.md
+++ b/doc/C++Serving/Introduction_CN.md
@@ -41,7 +41,7 @@ Server端的核心是一个由项目代码编译产生的名称为serving的二
 <img src='../images/syn_mode.png' width = "350" height = "300">
 <p>
 异步模型主要适用于模型支持多batch(最大batch数M可通过配置选项指定),单个Request请求的batch较小(batch << M)，单次预测时间较长的情况。
-异步模型下，Server端N个线程只负责接收Request请求，实际调用预测引擎是在异步框架的线程中，异步框架的线程数可以由配置选项来指定。为了方便理解，我们假设每个Request请求的batch均为1，此时异步框架会尽可能多得从请求池中取n(n≤M)个Request并将其拼装为1个Request(batch=n)，调用1次预测引擎，得到1个Response(batch = n)，再将其对应拆分为n个Response作为返回结果。
+异步模型下，Server端N个线程只负责接收Request请求，实际调用预测引擎是在异步框架的线程池中，异步框架的线程数可以由配置选项来指定。为了方便理解，我们假设每个Request请求的batch均为1，此时异步框架会尽可能多得从请求池中取n(n≤M)个Request并将其拼装为1个Request(batch=n)，调用1次预测引擎，得到1个Response(batch = n)，再将其对应拆分为n个Response作为返回结果。
 <p align="center">
 <img src='../images/asyn_mode.png'">
 <p>

--- a/doc/C++Serving/Performance_Tuning_CN.md
+++ b/doc/C++Serving/Performance_Tuning_CN.md
-待填写！
+# C++ Serving性能分析与优化
+# 1.背景知识介绍
+1) 首先，应确保您知道C++ Serving常用的一些[功能特点](Introduction_CN.md)和[C++ Serving 参数配置和启动的详细说明](../SERVING_CONFIGURE_CN.md。
+2) 关于C++ Serving框架本身的性能分析和介绍，请参考[C++ Serving框架性能测试](Frame_Performance_CN.md)。
+3) 您需要对您使用的模型、机器环境、需要部署上线的业务有一些了解，例如，您使用CPU还是GPU进行预测；是否可以开启TRT进行加速；你的机器CPU是多少core的；您的业务包含几个模型；每个模型的输入和输出需要做些什么处理；您业务的最大线上流量是多少；您的模型支持的最大输入batch是多少等等.
+# 2.Server线程数
+首先，Server端线程数N并不是越大越好。众所周知，线程的切换涉及到用户空间和内核空间的切换，有一定的开销，当您的core数=1，而线程数为100000时，线程的频繁切换将带来不可忽视的性能开销。
+在BRPC框架中，用户态协程worker数M >> 线程数N，用户态协程worker会工作在任意一个线程中，当RPC网络传输IO操作让出CPU资源时，BRPC会进行用户态协程worker的切换从而提高RPC框架的并发性。所以，极端情况下，若您的代码中除RPC通信外，没有阻塞线程的任何IO或网络操作，您的线程数完全可以 == 机器core数量，您不必担心N个线程都在进行RPC网络IO，而导致CPU利用率不高的问题。
+Server端<mark>**线程数N**</mark>的设置需要结合三个因素来综合考虑：
+## 2.1 最大并发请求量M
+根据最大并发请求量来设置Server端线程数N，根据[C++ Serving框架性能测试](Frame_Performance_CN.md)中的数据来看，此时<mark>**线程数N应等于或略小于最大并发请求量M**</mark>，此时平均处理时延最小。
+这也很容易理解，举个极端的例子，如果您每次只有1个请求，那此时Server端线程数设置1是最合理的，因为此时没有任何线程切换的开销。如果您设置线程数为任何大于1的数，必然就带来了线程切换的开销。
+## 2.2 机器core数量C
+根据机器core数量来设置Server端线程数N，众所周知，线程是CPU core调度执行的最小单元，若要在一个进程内充分使用所有的core，<mark>**线程数至少应该>=机器core数量C**</mark>,但具体线程数N/机器core数量C = ?需要您根据您的代码中网络、IO、内存和计算所占用的比例来决定，一般用户可以通过设置不同的线程数来测试CPU占用率来不断调整。
+## 2.3 模型预测时间长短T
+当您使用CPU进行预测时，预测阶段的计算是使用CPU完成的，此时，请参考前两者来进行设置线程数。
+当您使用GPU进行预测时，情况有些不同，此时预测阶段的计算是由GPU完成的，此时CPU资源是空闲的，而预测操作是阻塞该线程的，类似于Sleep操作，此时若您的线程数==机器core数量，将没有其他可切换的线程从而导致必然有部分core是空闲的状态。具体来说，当模型预测时间较短时（<10ms），Server端线程数不宜过多（线程数=1~10倍core数量），否则线程切换带来的开销不可忽视。当模型预测时间较长时，Server端线程数应稍大一些（线程数=4~200倍core数量）。
+# 3.异步模式
+当<mark>**大部分用户的Request请求batch数<<模型最大支持的Batch数**</mark>时，采用异步模式的收益是明显的。
+异步模型的原理是将模型预测阶段与RPC线程脱离，模型单独开辟一个线程数可指定的线程池，RPC收到Request后将请求数据放入模型的线程池中的Task队列中，线程池中的线程从Task中取出数据合并Batch后进行预测，从而提升QPS，更多详细的介绍见[C++Serving功能简介](Introduction_CN.md)，同步模式与异步模式的数据对比见[C++ Serving vs TensorFlow Serving 性能对比](Benchmark_CN.md)，在上述测试的条件下，异步模型比同步模式快百分50%。
+异步模式的开启有以下两种方式。
+## 3.1 Python命令辅助启动C++Server
+`python3 -m paddle_serving_server.serve`通过添加`--runtime_thread_num 2`指定该模型开启异步模式，其中2表示的是该模型异步线程池中的线程数为2，该数值默认值为0，此时表示不使用异步模式。`--runtime_thread_num`的具体数值设置根据模型、数据和显卡的可用显存来设置。
+通过添加`--batch_infer_size 32`来设置模型最大允许Batch == 32 的输入，此参数只有在异步模型开启的状态下，才有效。
+## 3.2 命令行+配置文件启动C++Server
+此时通过修改`model_toolkit.prototxt`中的`runtime_thread_num`字段和`batch_infer_size`字段同样能达到上述效果。
+# 4.多模型组合
+当<mark>**您的业务中需要调用多个模型进行预测**</mark>时，如果您追求极致的性能，您可以考虑使用C++Serving[自定义OP](OP_CN.md)和[自定义DAG图](DAG_CN.md)的方式来实现上述需求。
+## 4.1 优点
+由于在一个服务中做模型的组合，节省了网络IO的时间和序列化反序列化的时间，尤其当数据量比较大时，收益十分明显(实测单次传输40MB数据时，RPC耗时为160-170ms)。
+## 4.2 缺点
+1) 需要使用C++去自定义OP和自定义DAG图去定义模型之间的组合关系。
+2) 若多个模型之间需要前后处理，您也需要使用C++在OP之间去编写这部分代码。
+3) 需要重新编译Server端代码。
+## 4.3 示例
+请参考[examples/C++/PaddleOCR/ocr/README_CN.md](../../examples/C++/PaddleOCR/ocr/README_CN.md)中`C++ OCR Service服务章节`和[Paddle Serving中的集成预测](Model_Ensemble_CN.md)中的例子。
--- a/doc/SERVING_CONFIGURE.md
+++ b/doc/SERVING_CONFIGURE.md
--- a/doc/SERVING_CONFIGURE_CN.md
+++ b/doc/SERVING_CONFIGURE_CN.md
+# Serving Configuration
+(简体中文|[English](SERVING_CONFIGURE.md))
+## 简介
+本文主要介绍C++ Serving以及Python Pipeline的各项配置:
+- [模型配置文件](#模型配置文件): 转换模型时自动生成，描述模型输入输出信息
+- [C++ Serving](#c-serving): 用于高性能场景，介绍了快速启动以及自定义配置方法
+- [Python Pipeline](#python-pipeline): 用于单算子多模型组合场景
+## 模型配置文件
+在开始介绍Server配置之前，先来介绍一下模型配置文件。我们在将模型转换为PaddleServing模型时，会生成对应的serving_client_conf.prototxt以及serving_server_conf.prototxt，两者内容一致，为模型输入输出的参数信息，方便用户拼装参数。该配置文件用于Server以及Client，并不需要用户自行修改。转换方法参考文档《[怎样保存用于Paddle Serving的模型](SAVE_CN.md)》。protobuf格式可参考`core/configure/proto/general_model_config.proto`。
+样例如下：
+```
+feed_var {
+  name: "x"
+  alias_name: "x"
+  is_lod_tensor: false
+  feed_type: 1
+  shape: 13
+}
+fetch_var {
+  name: "concat_1.tmp_0"
+  alias_name: "concat_1.tmp_0"
+  is_lod_tensor: false
+  fetch_type: 1
+  shape: 3
+  shape: 640
+  shape: 640
+}
+```
+其中
+- feed_var：模型输入
+- fetch_var：模型输出
+- name：名称
+- alias_name：别名，与名称对应
+- is_lod_tensor：是否为lod，具体可参考《[Lod字段说明](LOD_CN.md)》
+- feed_type：数据类型
+|feed_type|类型|
+|---------|----|
+|0|INT64|
+|1|FLOAT32|
+|2|INT32|
+|3|FP64|
+|4|INT16|
+|5|FP16|
+|6|BF16|
+|7|UINT8|
+|8|INT8|
+- shape：数据维度
+## C++ Serving
+### 1.快速启动
+可以通过配置模型及端口号快速启动服务，启动命令如下：
+```BASH
+python3 -m paddle_serving_server.serve --model serving_model --port 9393
+```
+该命令会自动生成配置文件，并使用生成的配置文件启动C++ Serving。例如上述启动命令会自动生成workdir_9393目录，其结构如下
+```
+workdir_9393
+├── general_infer_0
+│   ├── fluid_time_file
+│   ├── general_model.prototxt
+│   └── model_toolkit.prototxt
+├── infer_service.prototxt
+├── resource.prototxt
+└── workflow.prototxt
+```
+更多启动参数详见下表：
+| Argument                                       | Type | Default | Description                                           |
+| ---------------------------------------------- | ---- | ------- | ----------------------------------------------------- |
+| `thread`                                       | int  | `2`     | Number of brpc service thread                         |
+| `op_num`                                       | int[]| `0`     | Thread Number for each model in asynchronous mode     |
+| `op_max_batch`                                 | int[]| `32`    | Batch Number for each model in asynchronous mode      |
+| `gpu_ids`                                      | str[]| `"-1"`  | Gpu card id for each model                            |
+| `port`                                         | int  | `9292`  | Exposed port of current service to users              |
+| `model`                                        | str[]| `""`    | Path of paddle model directory to be served           |
+| `mem_optim_off`                                | -    | -       | Disable memory / graphic memory optimization          |
+| `ir_optim`                                     | bool | False   | Enable analysis and optimization of calculation graph |
+| `use_mkl` (Only for cpu version)               | -    | -       | Run inference with MKL                                |
+| `use_trt` (Only for trt version)               | -    | -       | Run inference with TensorRT. Need open with ir_optim.                           |
+| `use_lite` (Only for Intel x86 CPU or ARM CPU) | -    | -       | Run PaddleLite inference. Need open with ir_optim.                              |
+| `use_xpu`                                      | -    | -       | Run PaddleLite inference with Baidu Kunlun XPU. Need open with ir_optim.        |
+| `precision`                                    | str  | FP32    | Precision Mode, support FP32, FP16, INT8              |
+| `use_calib`                                    | bool | False   | Use TRT int8 calibration                              |
+| `gpu_multi_stream`                             | bool | False   | EnableGpuMultiStream to get larger QPS                |
+#### 当您的某个模型想使用多张GPU卡部署时.
+```BASH
+python3 -m paddle_serving_server.serve --model serving_model --thread 10 --port 9292 --gpu_ids 0,1,2
+```
+#### 当您的一个服务包含两个模型部署时.
+```BASH
+python3 -m paddle_serving_server.serve --model serving_model_1 serving_model_2 --thread 10 --port 9292
+```
+### 2.自定义配置启动
+一般情况下，自动生成的配置可以应对大部分场景。对于特殊场景，用户也可自行定义配置文件。这些配置文件包括service.prototxt、workflow.prototxt、resource.prototxt、model_toolkit.prototxt、proj.conf。启动命令如下:
+```BASH
+/bin/serving --flagfile=proj.conf
+```
+#### 2.1 proj.conf
+proj.conf用于传入服务参数，并指定了其他相关配置文件的路径。如果重复传入参数，则以最后序参数值为准。
+```
+# for paddle inference
+--precision=fp32
+--use_calib=False
+--reload_interval_s=10
+# for brpc
+--max_concurrency=0
+--num_threads=10
+--bthread_concurrency=10
+--max_body_size=536870912
+# default path
+--inferservice_path=conf
+--inferservice_file=infer_service.prototxt
+--resource_path=conf
+--resource_file=resource.prototxt
+--workflow_path=conf
+--workflow_file=workflow.prototxt
+```
+各项参数的描述及默认值详见下表：
+| name | Default | Description |
+|------|--------|------|
+|precision|"fp32"|Precision Mode, support FP32, FP16, INT8|
+|use_calib|False|Only for deployment with TensorRT|
+|reload_interval_s|10|Reload interval|
+|max_concurrency|0|Limit of request processing in parallel, 0: unlimited|
+|num_threads|10|Number of brpc service thread|
+|bthread_concurrency|10|Number of bthread|
+|max_body_size|536870912|Max size of brpc message|
+|inferservice_path|"conf"|Path of inferservice conf|
+|inferservice_file|"infer_service.prototxt"|Filename of inferservice conf|
+|resource_path|"conf"|Path of resource conf|
+|resource_file|"resource.prototxt"|Filename of resource conf|
+|workflow_path|"conf"|Path of workflow conf|
+|workflow_file|"workflow.prototxt"|Filename of workflow conf|
+#### 2.2 service.prototxt
+service.prototxt用于配置Paddle Serving实例挂载的service列表。通过`--inferservice_path`和`--inferservice_file`指定加载路径。protobuf格式可参考`core/configure/server_configure.protobuf`的`InferServiceConf`。示例如下：
+```
+port: 8010
+services {
+  name: "GeneralModelService"
+  workflows: "workflow1"
+}
+```
+其中：
+- port: 用于配置Serving实例监听的端口号。
+- services: 使用默认配置即可，不可修改。name指定service名称，workflow1的具体定义在workflow.prototxt
+#### 2.3 workflow.prototxt
+workflow.prototxt用来描述具体的workflow。通过`--workflow_path`和`--workflow_file`指定加载路径。protobuf格式可参考`configure/server_configure.protobuf`的`Workflow`类型。
+如下示例，workflow由3个OP构成，GeneralReaderOp用于读取数据，GeneralInferOp依赖于GeneralReaderOp并进行预测，GeneralResponseOp将预测结果返回：
+```
+workflows {
+  name: "workflow1"
+  workflow_type: "Sequence"
+  nodes {
+    name: "general_reader_0"
+    type: "GeneralReaderOp"
+  }
+  nodes {
+    name: "general_infer_0"
+    type: "GeneralInferOp"
+    dependencies {
+      name: "general_reader_0"
+      mode: "RO"
+    }
+  }
+  nodes {
+    name: "general_response_0"
+    type: "GeneralResponseOp"
+    dependencies {
+      name: "general_infer_0"
+      mode: "RO"
+    }
+  }
+}
+```
+其中：
+- name: workflow名称，用于从service.prototxt索引到具体的workflow
+- workflow_type: 只支持"Sequence"
+- nodes: 用于串联成workflow的所有节点，可配置多个nodes。nodes间通过配置dependencies串联起来
+- node.name: 与node.type一一对应，具体可参考`python/paddle_serving_server/dag.py`
+- node.type: 当前node所执行OP的类名称，与serving/op/下每个具体的OP类的名称对应
+- node.dependencies: 依赖的上游node列表
+- node.dependencies.name: 与workflow内节点的name保持一致
+- node.dependencies.mode: RO-Read Only, RW-Read Write
+#### 2.4 resource.prototxt
+resource.prototxt，用于指定模型配置文件。通过`--resource_path`和`--resource_file`指定加载路径。它的protobuf格式参考`core/configure/proto/server_configure.proto`的`ResourceConf`。示例如下：
+```
+model_toolkit_path: "conf"
+model_toolkit_file: "general_infer_0/model_toolkit.prototxt"
+general_model_path: "conf"
+general_model_file: "general_infer_0/general_model.prototxt"
+```
+其中：
+- model_toolkit_path:用来指定model_toolkit.prototxt所在的目录
+- model_toolkit_file: 用来指定model_toolkit.prototxt所在的文件名
+- general_model_path: 用来指定general_model.prototxt所在的目录
+- general_model_file: 用来指定general_model.prototxt所在的文件名
+#### 2.5 model_toolkit.prototxt
+用来配置模型信息和预测引擎。它的protobuf格式参考`core/configure/proto/server_configure.proto`的ModelToolkitConf。model_toolkit.protobuf的磁盘路径不能通过命令行参数覆盖。示例如下：
+```
+engines {
+  name: "general_infer_0"
+  type: "PADDLE_INFER"
+  reloadable_meta: "uci_housing_model/fluid_time_file"
+  reloadable_type: "timestamp_ne"
+  model_dir: "uci_housing_model"
+  gpu_ids: -1
+  enable_memory_optimization: true
+  enable_ir_optimization: false
+  use_trt: false
+  use_lite: false
+  use_xpu: false
+  use_gpu: false
+  combined_model: false
+  gpu_multi_stream: false
+  runtime_thread_num: 0
+  batch_infer_size: 32
+  enable_overrun: false
+  allow_split_request: true
+}
+```
+其中
+- name: 引擎名称，与workflow.prototxt中的node.name以及所在目录名称对应
+- type: 预测引擎的类型。当前只支持”PADDLE_INFER“
+- reloadable_meta: 目前实际内容无意义，用来通过对该文件的mtime判断是否超过reload时间阈值
+- reloadable_type: 检查reload条件：timestamp_ne/timestamp_gt/md5sum/revision/none
+|reloadable_type|含义|
+|---------------|----|
+|timestamp_ne|reloadable_meta所指定文件的mtime时间戳发生变化|
+|timestamp_gt|reloadable_meta所指定文件的mtime时间戳大于等于上次检查时记录的mtime时间戳|
+|md5sum|目前无用，配置后永远不reload|
+|revision|目前无用，配置后用于不reload|
+- model_dir: 模型文件路径
+- gpu_ids: 引擎运行时使用的GPU device id，支持指定多个，如：
+```
+# 指定GPU0，1，2
+gpu_ids: 0
+gpu_ids: 1
+gpu_ids: 2
+```
+- enable_memory_optimization: 是否开启memory优化
+- enable_ir_optimization: 是否开启ir优化
+- use_trt: 是否开启TensorRT，需同时开启use_gpu
+- use_lite: 是否开启PaddleLite
+- use_xpu: 是否使用昆仑XPU
+- use_gpu:是否使用GPU
+- combined_model: 是否使用组合模型文件
+- gpu_multi_stream: 是否开启gpu多流模式
+- runtime_thread_num: 若大于0， 则启用Async异步模式，并创建对应数量的predictor实例。
+- batch_infer_size: Async异步模式下的最大batch数
+- enable_overrun: Async异步模式下总是将整个任务放入任务队列
+- allow_split_request: Async异步模式下允许拆分任务
+#### 2.6 general_model.prototxt
+general_model.prototxt内容与模型配置serving_server_conf.prototxt相同，用了描述模型输入输出参数信息。示例如下：
+```
+feed_var {
+  name: "x"
+  alias_name: "x"
+  is_lod_tensor: false
+  feed_type: 1
+  shape: 13
+}
+fetch_var {
+  name: "fc_0.tmp_1"
+  alias_name: "price"
+  is_lod_tensor: false
+  fetch_type: 1
+  shape: 1
+}
+```
+## Python Pipeline
+Python Pipeline提供了用户友好的多模型组合服务编程框架，适用于多模型组合应用的场景。
+其配置文件为YAML格式，一般默认为config.yaml。示例如下：
+```YAML
+#rpc端口, rpc_port和http_port不允许同时为空。当rpc_port为空且http_port不为空时，会自动将rpc_port设置为http_port+1
+rpc_port: 18090
+#http端口, rpc_port和http_port不允许同时为空。当rpc_port可用且http_port为空时，不自动生成http_port
+http_port: 9999
+#worker_num, 最大并发数。当build_dag_each_worker=True时, 框架会创建worker_num个进程，每个进程内构建grpcSever和DAG
+##当build_dag_each_worker=False时，框架会设置主线程grpc线程池的max_workers=worker_num
+worker_num: 20
+#build_dag_each_worker, False，框架在进程内创建一条DAG；True，框架会每个进程内创建多个独立的DAG
+build_dag_each_worker: false
+dag:
+    #op资源类型, True, 为线程模型；False，为进程模型
+    is_thread_op: False
+    #重试次数
+    retry: 1
+    #使用性能分析, True，生成Timeline性能数据，对性能有一定影响；False为不使用
+    use_profile: false
+    tracer:
+        interval_s: 10
+op:
+    det:
+        #并发数，is_thread_op=True时，为线程并发；否则为进程并发
+        concurrency: 6
+        #当op配置没有server_endpoints时，从local_service_conf读取本地服务配置
+        local_service_conf:
+            #client类型，包括brpc, grpc和local_predictor.local_predictor不启动Serving服务，进程内预测
+            client_type: local_predictor
+            #det模型路径
+            model_config: ocr_det_model
+            #Fetch结果列表，以client_config中fetch_var的alias_name为准
+            fetch_list: ["concat_1.tmp_0"]
+            #计算硬件ID，当devices为""或不写时为CPU预测；当devices为"0", "0,1,2"时为GPU预测，表示使用的GPU卡
+            devices: ""
+            # device_type, 0=cpu, 1=gpu, 2=tensorRT, 3=arm cpu, 4=kunlun xpu
+            device_type: 0
+            #use_mkldnn
+            #use_mkldnn: True
+            #ir_optim
+            ir_optim: True
+    rec:
+        #并发数，is_thread_op=True时，为线程并发；否则为进程并发
+        concurrency: 3
+        #超时时间, 单位ms
+        timeout: -1
+        #Serving交互重试次数，默认不重试
+        retry: 1
+        #当op配置没有server_endpoints时，从local_service_conf读取本地服务配置
+        local_service_conf:
+            #client类型，包括brpc, grpc和local_predictor。local_predictor不启动Serving服务，进程内预测
+            client_type: local_predictor
+            #rec模型路径
+            model_config: ocr_rec_model
+            #Fetch结果列表，以client_config中fetch_var的alias_name为准
+            fetch_list: ["ctc_greedy_decoder_0.tmp_0", "softmax_0.tmp_0"]
+            #计算硬件ID，当devices为""或不写时为CPU预测；当devices为"0", "0,1,2"时为GPU预测，表示使用的GPU卡
+            devices: ""
+            # device_type, 0=cpu, 1=gpu, 2=tensorRT, 3=arm cpu, 4=kunlun xpu
+            device_type: 0
+            #use_mkldnn
+            #use_mkldnn: True
+            #ir_optim
+            ir_optim: True
+```
+### 单机多卡
+单机多卡推理，M个OP进程与N个GPU卡绑定，需要在config.ymal中配置3个参数。首先选择进程模式，这样并发数即进程数，然后配置devices。绑定方法是进程启动时遍历GPU卡ID，例如启动7个OP进程，设置了0，1，2三个device id，那么第1、4、7个启动的进程与0卡绑定，第2、5进程与1卡绑定，3、6进程与卡2绑定。
+```YAML
+#op资源类型, True, 为线程模型；False，为进程模型
+is_thread_op: False
+#并发数，is_thread_op=True时，为线程并发；否则为进程并发
+concurrency: 7
+devices: "0,1,2"
+```
+### 异构硬件
+Python Pipeline除了支持CPU、GPU之外，还支持多种异构硬件部署。在config.yaml中由device_type和devices控制。优先使用device_type指定，当其空缺时根据devices自动判断类型。device_type描述如下：
+- CPU(Intel) : 0
+- GPU : 1
+- TensorRT : 2
+- CPU(Arm) : 3
+- XPU : 4
+config.yml中硬件配置：
+```YAML
+#计算硬件类型: 空缺时由devices决定(CPU/GPU)，0=cpu, 1=gpu, 2=tensorRT, 3=arm cpu, 4=kunlun xpu
+device_type: 0
+#计算硬件ID，优先由device_type决定硬件类型。devices为""或空缺时为CPU预测；当为"0", "0,1,2"时为GPU预测，表示使用的GPU卡
+devices: "" # "0,1"
+```
+### 低精度推理
+Python Pipeline支持低精度推理，CPU、GPU和TensoRT支持的精度类型如下所示：
+- CPU
+  - fp32(default)
+  - fp16
+  - bf16(mkldnn)
+- GPU
+  - fp32(default)
+  - fp16(TRT下有效)
+  - int8
+- Tensor RT
+  - fp32(default)
+  - fp16
+  - int8 
+```YAML
+#precsion, 预测精度，降低预测精度可提升预测速度
+#GPU 支持: "fp32"(default), "fp16(TensorRT)", "int8"；
+#CPU 支持: "fp32"(default), "fp16", "bf16"(mkldnn); 不支持: "int8"
+precision: "fp32"
+```
\ No newline at end of file
--- a/examples/C++/PaddleDetection/ttfnet_darknet53_1x_coco/test_client.py
+++ b/examples/C++/PaddleDetection/ttfnet_darknet53_1x_coco/test_client.py
@@ -19,9 +19,9 @@ import cv2
 preprocess = DetectionSequential([
    DetectionFile2Image(), DetectionResize(
-        (512, 512), False, interpolation=cv2.INTER_LINEAR), DetectionNormalize(
+        (512, 512), False, interpolation=cv2.INTER_LINEAR),
-            [123.675, 116.28, 103.53], [58.395, 57.12, 57.375], False),
+    DetectionNormalize([123.675, 116.28, 103.53], [58.395, 57.12, 57.375],
-    DetectionTranspose((2, 0, 1))
+                       False), DetectionTranspose((2, 0, 1))
 ])
 postprocess = RCNNPostprocess("label_list.txt", "output")

--- a/python/paddle_serving_client/io/__init__.py
+++ b/python/paddle_serving_client/io/__init__.py
@@ -19,7 +19,7 @@ from paddle.fluid.framework import core
 from paddle.fluid.framework import default_main_program
 from paddle.fluid.framework import Program
 from paddle.fluid import CPUPlace
-from paddle.fluid.io import save_inference_model
+from .paddle_io import save_inference_model, normalize_program
 import paddle.fluid as fluid
 from paddle.fluid.core import CipherUtils
 from paddle.fluid.core import CipherFactory
@@ -191,12 +191,14 @@ def save_model(server_model_folder,
    executor = Executor(place=CPUPlace())
    feed_var_names = [feed_var_dict[x].name for x in feed_var_dict]
+    feed_vars = [feed_var_dict[x] for x in feed_var_dict]
    target_vars = []
    target_var_names = []
    for key in sorted(fetch_var_dict.keys()):
        target_vars.append(fetch_var_dict[key])
        target_var_names.append(key)
+    main_program = normalize_program(main_program, feed_vars, target_vars)
    if not encryption and not show_proto:
        if not os.path.exists(server_model_folder):
            os.makedirs(server_model_folder)
@@ -209,7 +211,7 @@ def save_model(server_model_folder,
        new_params_path = os.path.join(server_model_folder, params_filename)
        with open(new_model_path, "wb") as new_model_file:
-            new_model_file.write(main_program.desc.serialize_to_string())
+            new_model_file.write(main_program._remove_training_info(False).desc.serialize_to_string())
        paddle.static.save_vars(
            executor=executor,
@@ -229,7 +231,7 @@ def save_model(server_model_folder,
        key = CipherUtils.gen_key_to_file(128, "key")
        params = fluid.io.save_persistables(
            executor=executor, dirname=None, main_program=main_program)
-        model = main_program.desc.serialize_to_string()
+        model = main_program._remove_training_info(False).desc.serialize_to_string()
        if not os.path.exists(server_model_folder):
            os.makedirs(server_model_folder)
        os.chdir(server_model_folder)

--- a/python/paddle_serving_client/io/paddle_io.py
+++ b/python/paddle_serving_client/io/paddle_io.py
+#   Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from __future__ import print_function
+import errno
+import inspect
+import logging
+import os
+import warnings
+import six
+import numpy as np
+import paddle
+from paddle.fluid import (
+    core,
+    Variable,
+    CompiledProgram,
+    default_main_program,
+    Program,
+    layers,
+    unique_name,
+    program_guard, )
+from paddle.fluid.io import prepend_feed_ops, append_fetch_ops
+from paddle.fluid.framework import static_only, Parameter
+from paddle.fluid.executor import Executor, global_scope
+from paddle.fluid.log_helper import get_logger
+__all__ = []
+_logger = get_logger(
+    __name__, logging.INFO, fmt='%(asctime)s-%(levelname)s: %(message)s')
+def _check_args(caller, args, supported_args=None, deprecated_args=None):
+    supported_args = [] if supported_args is None else supported_args
+    deprecated_args = [] if deprecated_args is None else deprecated_args
+    for arg in args:
+        if arg in deprecated_args:
+            raise ValueError(
+                "argument '{}' in function '{}' is deprecated, only {} are supported.".
+                format(arg, caller, supported_args))
+        elif arg not in supported_args:
+            raise ValueError(
+                "function '{}' doesn't support argument '{}',\n only {} are supported.".
+                format(caller, arg, supported_args))
+def _check_vars(name, var_list):
+    if not isinstance(var_list, list):
+        var_list = [var_list]
+    if not var_list or not all([isinstance(var, Variable) for var in var_list]):
+        raise ValueError(
+            "'{}' should be a Variable or a list of Variable.".format(name))
+def _normalize_path_prefix(path_prefix):
+    """
+    convert path_prefix to absolute path.
+    """
+    if not isinstance(path_prefix, six.string_types):
+        raise ValueError("'path_prefix' should be a string.")
+    if path_prefix.endswith("/"):
+        raise ValueError("'path_prefix' should not be a directory")
+    path_prefix = os.path.normpath(path_prefix)
+    path_prefix = os.path.abspath(path_prefix)
+    return path_prefix
+def _get_valid_program(program=None):
+    """
+    return default main program if program is None.
+    """
+    if program is None:
+        program = default_main_program()
+    elif isinstance(program, CompiledProgram):
+        program = program._program
+        if program is None:
+            raise TypeError(
+                "The type of input program is invalid, expected tyep is Program, but received None"
+            )
+        warnings.warn(
+            "The input is a CompiledProgram, this is not recommended.")
+    if not isinstance(program, Program):
+        raise TypeError(
+            "The type of input program is invalid, expected type is fluid.Program, but received %s"
+            % type(program))
+    return program
+def _clone_var_in_block(block, var):
+    assert isinstance(var, Variable)
+    if var.desc.type() == core.VarDesc.VarType.LOD_TENSOR:
+        return block.create_var(
+            name=var.name,
+            shape=var.shape,
+            dtype=var.dtype,
+            type=var.type,
+            lod_level=var.lod_level,
+            persistable=True)
+    else:
+        return block.create_var(
+            name=var.name,
+            shape=var.shape,
+            dtype=var.dtype,
+            type=var.type,
+            persistable=True)
+def normalize_program(program, feed_vars, fetch_vars):
+    """
+    :api_attr: Static Graph
+    Normalize/Optimize a program according to feed_vars and fetch_vars.
+    Args:
+        program(Program): Specify a program you want to optimize.
+        feed_vars(Variable | list[Variable]): Variables needed by inference.
+        fetch_vars(Variable | list[Variable]): Variables returned by inference.
+    Returns:
+        Program: Normalized/Optimized program.
+    Raises:
+        TypeError: If `program` is not a Program, an exception is thrown.
+        TypeError: If `feed_vars` is not a Variable or a list of Variable, an exception is thrown.
+        TypeError: If `fetch_vars` is not a Variable or a list of Variable, an exception is thrown.
+    Examples:
+        .. code-block:: python
+            import paddle
+            paddle.enable_static()
+            path_prefix = "./infer_model"
+            # User defined network, here a softmax regession example
+            image = paddle.static.data(name='img', shape=[None, 28, 28], dtype='float32')
+            label = paddle.static.data(name='label', shape=[None, 1], dtype='int64')
+            predict = paddle.static.nn.fc(image, 10, activation='softmax')
+            loss = paddle.nn.functional.cross_entropy(predict, label)
+            exe = paddle.static.Executor(paddle.CPUPlace())
+            exe.run(paddle.static.default_startup_program())
+            # normalize main program.
+            program = paddle.static.default_main_program()
+            normalized_program = paddle.static.normalize_program(program, [image], [predict])
+    """
+    if not isinstance(program, Program):
+        raise TypeError(
+            "program type must be `fluid.Program`, but received `%s`" %
+            type(program))
+    if not isinstance(feed_vars, list):
+        feed_vars = [feed_vars]
+    if not all(isinstance(v, Variable) for v in feed_vars):
+        raise TypeError(
+            "feed_vars type must be a Variable or a list of Variable.")
+    if not isinstance(fetch_vars, list):
+        fetch_vars = [fetch_vars]
+    if not all(isinstance(v, Variable) for v in fetch_vars):
+        raise TypeError(
+            "fetch_vars type must be a Variable or a list of Variable.")
+    # remind users to set auc_states to 0 if auc op were found.
+    for op in program.global_block().ops:
+        # clear device of Op
+        device_attr_name = core.op_proto_and_checker_maker.kOpDeviceAttrName()
+        op._set_attr(device_attr_name, "")
+        if op.type == 'auc':
+            warnings.warn("Be sure that you have set auc states to 0 "
+                          "before saving inference model.")
+            break
+    # fix the bug that the activation op's output as target will be pruned.
+    # will affect the inference performance.
+    # TODO(Superjomn) add an IR pass to remove 1-scale op.
+    #with program_guard(program):
+    #    uniq_fetch_vars = []
+    #    for i, var in enumerate(fetch_vars):
+    #        if var.dtype != paddle.bool:
+    #            var = layers.scale(
+    #                var, 1., name="save_infer_model/scale_{}".format(i))
+    #        uniq_fetch_vars.append(var)
+    #    fetch_vars = uniq_fetch_vars
+    # serialize program
+    copy_program = program.clone()
+    global_block = copy_program.global_block()
+    remove_op_idx = []
+    for i, op in enumerate(global_block.ops):
+        op.desc.set_is_target(False)
+        if op.type == "feed" or op.type == "fetch":
+            remove_op_idx.append(i)
+    for idx in remove_op_idx[::-1]:
+        global_block._remove_op(idx)
+    copy_program.desc.flush()
+    feed_var_names = [var.name for var in feed_vars]
+    copy_program = copy_program._prune_with_input(
+        feeded_var_names=feed_var_names, targets=fetch_vars)
+    copy_program = copy_program._inference_optimize(prune_read_op=True)
+    fetch_var_names = [var.name for var in fetch_vars]
+    prepend_feed_ops(copy_program, feed_var_names)
+    append_fetch_ops(copy_program, fetch_var_names)
+    copy_program.desc._set_version()
+    return copy_program
+def is_persistable(var):
+    """
+    Check whether the given variable is persistable.
+    Args:
+        var(Variable): The variable to be checked.
+    Returns:
+        bool: True if the given `var` is persistable
+        False if not.
+    Examples:
+        .. code-block:: python
+            import paddle
+            import paddle.fluid as fluid
+            paddle.enable_static()
+            param = fluid.default_main_program().global_block().var('fc.b')
+            res = fluid.io.is_persistable(param)
+    """
+    if var.desc.type() == core.VarDesc.VarType.FEED_MINIBATCH or \
+                    var.desc.type() == core.VarDesc.VarType.FETCH_LIST or \
+                    var.desc.type() == core.VarDesc.VarType.READER:
+        return False
+    return var.persistable
+@static_only
+def serialize_program(feed_vars, fetch_vars, **kwargs):
+    """
+    :api_attr: Static Graph
+    Serialize default main program according to feed_vars and fetch_vars.
+    Args:
+        feed_vars(Variable | list[Variable]): Variables needed by inference.
+        fetch_vars(Variable | list[Variable]): Variables returned by inference.
+        kwargs: Supported keys including 'program'.Attention please, kwargs is used for backward compatibility mainly.
+          - program(Program): specify a program if you don't want to use default main program.
+    Returns:
+        bytes: serialized program.
+    Raises:
+        ValueError: If `feed_vars` is not a Variable or a list of Variable, an exception is thrown.
+        ValueError: If `fetch_vars` is not a Variable or a list of Variable, an exception is thrown.
+    Examples:
+        .. code-block:: python
+            import paddle
+            paddle.enable_static()
+            path_prefix = "./infer_model"
+            # User defined network, here a softmax regession example
+            image = paddle.static.data(name='img', shape=[None, 28, 28], dtype='float32')
+            label = paddle.static.data(name='label', shape=[None, 1], dtype='int64')
+            predict = paddle.static.nn.fc(image, 10, activation='softmax')
+            loss = paddle.nn.functional.cross_entropy(predict, label)
+            exe = paddle.static.Executor(paddle.CPUPlace())
+            exe.run(paddle.static.default_startup_program())
+            # serialize the default main program to bytes.
+            serialized_program = paddle.static.serialize_program([image], [predict])
+            # deserialize bytes to program
+            deserialized_program = paddle.static.deserialize_program(serialized_program)
+    """
+    # verify feed_vars
+    _check_vars('feed_vars', feed_vars)
+    # verify fetch_vars
+    _check_vars('fetch_vars', fetch_vars)
+    program = _get_valid_program(kwargs.get('program', None))
+    program = normalize_program(program, feed_vars, fetch_vars)
+    return _serialize_program(program)
+def _serialize_program(program):
+    """
+    serialize given program to bytes.
+    """
+    return program.desc.serialize_to_string()
+@static_only
+def serialize_persistables(feed_vars, fetch_vars, executor, **kwargs):
+    """
+    :api_attr: Static Graph
+    Serialize parameters using given executor and default main program according to feed_vars and fetch_vars.
+    Args:
+        feed_vars(Variable | list[Variable]): Variables needed by inference.
+        fetch_vars(Variable | list[Variable]): Variables returned by inference.
+        kwargs: Supported keys including 'program'.Attention please, kwargs is used for backward compatibility mainly.
+          - program(Program): specify a program if you don't want to use default main program.
+    Returns:
+        bytes: serialized program.
+    Raises:
+        ValueError: If `feed_vars` is not a Variable or a list of Variable, an exception is thrown.
+        ValueError: If `fetch_vars` is not a Variable or a list of Variable, an exception is thrown.
+    Examples:
+        .. code-block:: python
+            import paddle
+            paddle.enable_static()
+            path_prefix = "./infer_model"
+            # User defined network, here a softmax regession example
+            image = paddle.static.data(name='img', shape=[None, 28, 28], dtype='float32')
+            label = paddle.static.data(name='label', shape=[None, 1], dtype='int64')
+            predict = paddle.static.nn.fc(image, 10, activation='softmax')
+            loss = paddle.nn.functional.cross_entropy(predict, label)
+            exe = paddle.static.Executor(paddle.CPUPlace())
+            exe.run(paddle.static.default_startup_program())
+            # serialize parameters to bytes.
+            serialized_params = paddle.static.serialize_persistables([image], [predict], exe)
+            # deserialize bytes to parameters.
+            main_program = paddle.static.default_main_program()
+            deserialized_params = paddle.static.deserialize_persistables(main_program, serialized_params, exe)
+    """
+    # verify feed_vars
+    _check_vars('feed_vars', feed_vars)
+    # verify fetch_vars
+    _check_vars('fetch_vars', fetch_vars)
+    program = _get_valid_program(kwargs.get('program', None))
+    program = normalize_program(program, feed_vars, fetch_vars)
+    return _serialize_persistables(program, executor)
+def _serialize_persistables(program, executor):
+    """
+    Serialize parameters using given program and executor.
+    """
+    vars_ = list(filter(is_persistable, program.list_vars()))
+    # warn if no variable found in model
+    if len(vars_) == 0:
+        warnings.warn("no variable in your model, please ensure there are any "
+                      "variables in your model to save")
+        return None
+    # create a new program and clone persitable vars to it
+    save_program = Program()
+    save_block = save_program.global_block()
+    save_var_map = {}
+    for var in vars_:
+        if var.type != core.VarDesc.VarType.RAW:
+            var_copy = _clone_var_in_block(save_block, var)
+            save_var_map[var_copy.name] = var
+    # create in_vars and out_var, then append a save_combine op to save_program
+    in_vars = []
+    for name in sorted(save_var_map.keys()):
+        in_vars.append(save_var_map[name])
+    out_var_name = unique_name.generate("out_var")
+    out_var = save_block.create_var(
+        type=core.VarDesc.VarType.RAW, name=out_var_name)
+    out_var.desc.set_persistable(True)
+    save_block.append_op(
+        type='save_combine',
+        inputs={'X': in_vars},
+        outputs={'Y': out_var},
+        attrs={'file_path': '',
+               'save_to_memory': True})
+    # run save_program to save vars
+    # NOTE(zhiqiu): save op will add variable kLookupTablePath to save_program.desc,
+    # which leads to diff between save_program and its desc. Call _sync_with_cpp
+    # to keep consistency.
+    save_program._sync_with_cpp()
+    executor.run(save_program)
+    # return serialized bytes in out_var
+    return global_scope().find_var(out_var_name).get_bytes()
+def save_to_file(path, content):
+    """
+    Save content to given path.
+    Args:
+        path(str): Path to write content to.
+        content(bytes): Content to write.
+    Returns:
+        None
+    """
+    if not isinstance(content, bytes):
+        raise ValueError("'content' type should be bytes.")
+    with open(path, "wb") as f:
+        f.write(content)
+@static_only
+def save_inference_model(path_prefix, feed_vars, fetch_vars, executor,
+                         **kwargs):
+    """
+    :api_attr: Static Graph
+    Save current model and its parameters to given path. i.e.
+    Given path_prefix = "/path/to/modelname", after invoking
+    save_inference_model(path_prefix, feed_vars, fetch_vars, executor),
+    you will find two files named modelname.pdmodel and modelname.pdiparams
+    under "/path/to", which represent your model and parameters respectively.
+    Args:
+        path_prefix(str): Directory path to save model + model name without suffix.
+        feed_vars(Variable | list[Variable]): Variables needed by inference.
+        fetch_vars(Variable | list[Variable]): Variables returned by inference.
+        executor(Executor): The executor that saves the inference model. You can refer
+                            to :ref:`api_guide_executor_en` for more details.
+        kwargs: Supported keys including 'program' and "clip_extra". Attention please, kwargs is used for backward compatibility mainly.
+          - program(Program): specify a program if you don't want to use default main program.
+          - clip_extra(bool): set to True if you want to clip extra information for every operator.
+    Returns:
+        None
+    Raises:
+        ValueError: If `feed_vars` is not a Variable or a list of Variable, an exception is thrown.
+        ValueError: If `fetch_vars` is not a Variable or a list of Variable, an exception is thrown.
+    Examples:
+        .. code-block:: python
+            import paddle
+            paddle.enable_static()
+            path_prefix = "./infer_model"
+            # User defined network, here a softmax regession example
+            image = paddle.static.data(name='img', shape=[None, 28, 28], dtype='float32')
+            label = paddle.static.data(name='label', shape=[None, 1], dtype='int64')
+            predict = paddle.static.nn.fc(image, 10, activation='softmax')
+            loss = paddle.nn.functional.cross_entropy(predict, label)
+            exe = paddle.static.Executor(paddle.CPUPlace())
+            exe.run(paddle.static.default_startup_program())
+            # Feed data and train process
+            # Save inference model. Note we don't save label and loss in this example
+            paddle.static.save_inference_model(path_prefix, [image], [predict], exe)
+            # In this example, the save_inference_mode inference will prune the default
+            # main program according to the network's input node (img) and output node(predict).
+            # The pruned inference program is going to be saved in file "./infer_model.pdmodel"
+            # and parameters are going to be saved in file "./infer_model.pdiparams".
+    """
+    # check path_prefix, set model_path and params_path
+    path_prefix = _normalize_path_prefix(path_prefix)
+    try:
+        # mkdir may conflict if pserver and trainer are running on the same machine
+        dirname = os.path.dirname(path_prefix)
+        os.makedirs(dirname)
+    except OSError as e:
+        if e.errno != errno.EEXIST:
+            raise
+    model_path = path_prefix + ".pdmodel"
+    params_path = path_prefix + ".pdiparams"
+    if os.path.isdir(model_path):
+        raise ValueError("'{}' is an existing directory.".format(model_path))
+    if os.path.isdir(params_path):
+        raise ValueError("'{}' is an existing directory.".format(params_path))
+    # verify feed_vars
+    _check_vars('feed_vars', feed_vars)
+    # verify fetch_vars
+    _check_vars('fetch_vars', fetch_vars)
+    program = _get_valid_program(kwargs.get('program', None))
+    clip_extra = kwargs.get('clip_extra', False)
+    program = normalize_program(program, feed_vars, fetch_vars)
+    # serialize and save program
+    program_bytes = _serialize_program(
+        program._remove_training_info(clip_extra=clip_extra))
+    save_to_file(model_path, program_bytes)
+    # serialize and save params
+    params_bytes = _serialize_persistables(program, executor)
+    save_to_file(params_path, params_bytes)