Merge branch 'PaddlePaddle:develop' into develop

2e98f359 · TeslaZhao · GitHub · c42fa9be · 50b24542 · 2e98f359
14 changed file
--- a/doc/Offical_Docs/5-3_Serving_Configure_CN.md
+++ b/doc/Offical_Docs/5-3_Serving_Configure_CN.md
+# Serving 配置
+## 简介
+本文主要介绍 C++ Serving 以及 Python Pipeline 的各项配置:
+- [模型配置文件](#模型配置文件): 转换模型时自动生成，描述模型输入输出信息
+- [C++ Serving](#c-serving): 用于高性能场景，介绍了快速启动以及自定义配置方法
+- [Python Pipeline](#python-pipeline): 用于单算子多模型组合场景
+## 模型配置文件
+在开始介绍 Server 配置之前，先来介绍一下模型配置文件。我们在将模型转换为 PaddleServing 模型时，会生成对应的 serving_client_conf.prototxt 以及 serving_server_conf.prototxt，两者内容一致，为模型输入输出的参数信息，方便用户拼装参数。该配置文件用于 Server 以及 Client，并不需要用户自行修改。转换方法参考文档《[怎样保存用于Paddle Serving的模型](./Save_CN.md)》。protobuf 格式可参考 `core/configure/proto/general_model_config.proto`。
+样例如下：
+```
+feed_var {
+  name: "x"
+  alias_name: "x"
+  is_lod_tensor: false
+  feed_type: 1
+  shape: 13
+}
+fetch_var {
+  name: "concat_1.tmp_0"
+  alias_name: "concat_1.tmp_0"
+  is_lod_tensor: false
+  fetch_type: 1
+  shape: 3
+  shape: 640
+  shape: 640
+}
+```
+其中
+- feed_var：模型输入
+- fetch_var：模型输出
+- name：名称
+- alias_name：别名，与名称对应
+- is_lod_tensor：是否为 lod，具体可参考《[Lod字段说明](./LOD_CN.md)》
+- feed_type：数据类型，详见表格
+- shape：数据维度
+| feet_type | 0    | 1       | 2    | 3   | 4    | 5   | 6   | 7   | 8    | 20 |
+|-----------|-------|---------|-------|-------|------|------|-------|-------|-----|--------|
+| 类型       | INT64 | FLOAT32 | INT32 | FP64 | INT16 | FP16 | BF16 | UINT8 | INT8 | STRING |
+## C++ Serving
+**一. 快速启动与关闭**
+可以通过配置模型及端口号快速启动服务，启动命令如下：
+```BASH
+python3 -m paddle_serving_server.serve --model serving_model --port 9393
+```
+该命令会自动生成配置文件，并使用生成的配置文件启动 C++ Serving。例如上述启动命令会自动生成 workdir_9393 目录，其结构如下
+```
+workdir_9393
+├── general_infer_0
+│   ├── fluid_time_file
+│   ├── general_model.prototxt
+│   └── model_toolkit.prototxt
+├── infer_service.prototxt
+├── resource.prototxt
+└── workflow.prototxt
+```
+更多启动参数详见下表：
+| Argument                                       | Type | Default | Description                                           |
+| ---------------------------------------------- | ---- | ------- | ----------------------------------------------------- |
+| `thread`                                       | int  | `2`     | BRPC 服务的线程数                         |
+| `runtime_thread_num`                           | int[]| `0`     | 异步模式下每个模型的线程数     |
+| `batch_infer_size`                             | int[]| `32`    | 异步模式下每个模型的 Batch 数      |
+| `gpu_ids`                                      | str[]| `"-1"`  | 设置每个模型的 GPU id，例如当使用多卡部署时，可设置 "0,1,2"                            |
+| `port`                                         | int  | `9292`  | 服务的端口号              |
+| `model`                                        | str[]| `""`    | 模型文件路径，例如包含两个模型时，可设置 "serving_model_1 serving_model_2"           |
+| `mem_optim_off`                                | -    | -       | 是否关闭内存优化选项          |
+| `ir_optim`                                     | bool | False   | 是否开启图优化 |
+| `use_mkl` (Only for cpu version)               | -    | -       | 开启 MKL 选项，需要与 ir_optim 配合使用                                |
+| `use_trt` (Only for trt version)               | -    | -       | 开启 TensorRT，需要与 ir_optim 配合使用                           |
+| `use_lite` (Only for Intel x86 CPU or ARM CPU) | -    | -       | 开启 PaddleLite，需要与 ir_optim 配合使用                              |
+| `use_xpu`                                      | -    | -       | 开启百度昆仑 XPU 配置，需要与 ir_optim 配合使用        |
+| `precision`                                    | str  | FP32    | 精度配置，支持 FP32, FP16, INT8              |
+| `use_calib`                                    | bool | False   | 是否开启 TRT int8 校准模式                              |
+| `gpu_multi_stream`                             | bool | False   | 是否开启 GPU 多流模式                |
+| `use_ascend_cl`                                | bool | False   | 开启昇腾配置，单独开启时适配 910，与 use_lite 共同开启时适配 310 |
+| `request_cache_size`                           | int  | `0`     | 请求缓存的容量大小。默认为 0 时，缓存关闭 |
+**二. 自定义配置启动**
+一般情况下，自动生成的配置可以应对大部分场景。对于特殊场景，用户也可自行定义配置文件。这些配置文件包括 service.prototxt（配置服务列表）、workflow.prototxt（配置 OP 流程 workflow）、resource.prototxt（指定模型配置文件）、model_toolkit.prototxt（配置模型信息和预测引擎）、proj.conf（配置服务参数）。启动命令如下:
+```BASH
+/bin/serving --flagfile=proj.conf
+```
+1. proj.conf
+proj.conf 用于传入服务参数，并指定了其他相关配置文件的路径。如果重复传入参数，则以最后序参数值为准。
+```
+# for paddle inference
+--precision=fp32
+--use_calib=False
+--reload_interval_s=10
+# for brpc
+--max_concurrency=0
+--num_threads=10
+--bthread_concurrency=10
+--max_body_size=536870912
+# default path
+--inferservice_path=conf
+--inferservice_file=infer_service.prototxt
+--resource_path=conf
+--resource_file=resource.prototxt
+--workflow_path=conf
+--workflow_file=workflow.prototxt
+```
+各项参数的描述及默认值详见下表：
+| name | Default | Description |
+|------|--------|------|
+|precision|"fp32"|Precision Mode, support FP32, FP16, INT8|
+|use_calib|False|Only for deployment with TensorRT|
+|reload_interval_s|10|Reload interval|
+|max_concurrency|0|Limit of request processing in parallel, 0: unlimited|
+|num_threads|10|Number of brpc service thread|
+|bthread_concurrency|10|Number of bthread|
+|max_body_size|536870912|Max size of brpc message|
+|inferservice_path|"conf"|Path of inferservice conf|
+|inferservice_file|"infer_service.prototxt"|Filename of inferservice conf|
+|resource_path|"conf"|Path of resource conf|
+|resource_file|"resource.prototxt"|Filename of resource conf|
+|workflow_path|"conf"|Path of workflow conf|
+|workflow_file|"workflow.prototxt"|Filename of workflow conf|
+2. service.prototxt
+service.prototxt 用于配置 Paddle Serving 实例挂载的 service 列表。通过 `--inferservice_path` 和 `--inferservice_file` 指定加载路径。protobuf 格式可参考 `core/configure/server_configure.protobuf` 的 `InferServiceConf`。示例如下：
+```
+port: 8010
+services {
+  name: "GeneralModelService"
+  workflows: "workflow1"
+}
+```
+其中：
+- port: 用于配置 Serving 实例监听的端口号。
+- services: 使用默认配置即可，不可修改。name 指定 service 名称，workflow1 的具体定义在 workflow.prototxt
+3. workflow.prototxt
+workflow.prototxt 用来描述具体的 workflow。通过 `--workflow_path` 和 `--workflow_file` 指定加载路径。protobuf 格式可参考 `configure/server_configure.protobuf` 的 `Workflow` 类型。自定义 OP 请参考 [自定义OP]()
+如下示例，workflow 由3个 OP 构成，GeneralReaderOp 用于读取数据，GeneralInferOp 依赖于 GeneralReaderOp 并进行预测，GeneralResponseOp 将预测结果返回：
+```
+workflows {
+  name: "workflow1"
+  workflow_type: "Sequence"
+  nodes {
+    name: "general_reader_0"
+    type: "GeneralReaderOp"
+  }
+  nodes {
+    name: "general_infer_0"
+    type: "GeneralInferOp"
+    dependencies {
+      name: "general_reader_0"
+      mode: "RO"
+    }
+  }
+  nodes {
+    name: "general_response_0"
+    type: "GeneralResponseOp"
+    dependencies {
+      name: "general_infer_0"
+      mode: "RO"
+    }
+  }
+}
+```
+其中：
+- name: workflow 名称，用于从 service.prototxt 索引到具体的 workflow
+- workflow_type: 只支持 "Sequence"
+- nodes: 用于串联成 workflow 的所有节点，可配置多个 nodes。nodes 间通过配置 dependencies 串联起来
+- node.name: 与 node.type 一一对应，具体可参考 `python/paddle_serving_server/dag.py`
+- node.type: 当前 node 所执行 OP 的类名称，与 serving/op/ 下每个具体的 OP 类的名称对应
+- node.dependencies: 依赖的上游 node 列表
+- node.dependencies.name: 与 workflow 内节点的 name 保持一致
+- node.dependencies.mode: RO-Read Only, RW-Read Write
+4. resource.prototxt
+resource.prototxt，用于指定模型配置文件。通过 `--resource_path` 和 `--resource_file` 指定加载路径。它的 protobuf 格式参考 `core/configure/proto/server_configure.proto` 的 `ResourceConf`。示例如下：
+```
+model_toolkit_path: "conf"
+model_toolkit_file: "general_infer_0/model_toolkit.prototxt"
+general_model_path: "conf"
+general_model_file: "general_infer_0/general_model.prototxt"
+```
+其中：
+- model_toolkit_path: 用来指定 model_toolkit.prototxt 所在的目录
+- model_toolkit_file: 用来指定 model_toolkit.prototxt 所在的文件名
+- general_model_path: 用来指定 general_model.prototxt 所在的目录
+- general_model_file: 用来指定 general_model.prototxt 所在的文件名
+5. model_toolkit.prototxt
+用来配置模型信息和预测引擎。它的 protobuf 格式参考 `core/configure/proto/server_configure.proto` 的 ModelToolkitConf。model_toolkit.protobuf 的磁盘路径不能通过命令行参数覆盖。示例如下：
+```
+engines {
+  name: "general_infer_0"
+  type: "PADDLE_INFER"
+  reloadable_meta: "uci_housing_model/fluid_time_file"
+  reloadable_type: "timestamp_ne"
+  model_dir: "uci_housing_model"
+  gpu_ids: -1
+  enable_memory_optimization: true
+  enable_ir_optimization: false
+  use_trt: false
+  use_lite: false
+  use_xpu: false
+  use_gpu: false
+  combined_model: false
+  gpu_multi_stream: false
+  use_ascend_cl: false
+  runtime_thread_num: 0
+  batch_infer_size: 32
+  enable_overrun: false
+  allow_split_request: true
+}
+```
+其中
+- name: 引擎名称，与 workflow.prototxt 中的 node.name 以及所在目录名称对应
+- type: 预测引擎的类型。当前只支持 ”PADDLE_INFER“
+- reloadable_meta: 目前实际内容无意义，用来通过对该文件的 mtime 判断是否超过 reload 时间阈值
+- reloadable_type: 检查 reload 条件：timestamp_ne/timestamp_gt/md5sum/revision/none，详见表格
+- model_dir: 模型文件路径
+- gpu_ids: 引擎运行时使用的 GPU device id，支持指定多个，如：
+- enable_memory_optimization: 是否开启 memory 优化
+- enable_ir_optimization: 是否开启 ir 优化
+- use_trt: 是否开启 TensorRT，需同时开启 use_gpu
+- use_lite: 是否开启 PaddleLite
+- use_xpu: 是否使用昆仑 XPU
+- use_gpu: 是否使用 GPU
+- combined_model: 是否使用组合模型文件
+- gpu_multi_stream: 是否开启 gpu 多流模式
+- use_ascend_cl: 是否使用昇腾,单独开启适配昇腾 910，同时开启 lite 适配 310
+- runtime_thread_num: 若大于 0， 则启用 Async 异步模式，并创建对应数量的 predictor 实例。
+- batch_infer_size: Async 异步模式下的最大 batch 数
+- enable_overrun: Async 异步模式下总是将整个任务放入任务队列
+- allow_split_request: Async 异步模式下允许拆分任务
+|reloadable_type|含义|
+|---------------|----|
+|timestamp_ne|reloadable_meta 所指定文件的 mtime 时间戳发生变化|
+|timestamp_gt|reloadable_meta 所指定文件的 mtime 时间戳大于等于上次检查时记录的 mtime 时间戳|
+|md5sum|目前无用，配置后永远不 reload|
+|revision|目前无用，配置后用于不 reload|
+6. general_model.prototxt
+general_model.prototxt 内容与模型配置 serving_server_conf.prototxt 相同，用了描述模型输入输出参数信息。
+## Python Pipeline
+**一. 快速启动与关闭**
+Python Pipeline 启动脚本如下，脚本实现请参考[Pipeline Serving]()：
+```BASH
+python3 web_service.py
+```
+当您想要关闭 Serving 服务时（在 Pipeline 启动目录下或环境变量 SERVING_HOME 路径下，执行以下命令）可以如下命令，
+stop 参数发送 SIGINT 至 Pipeline Serving，若 Linux 系统中改成 kill 则发送 SIGKILL 信号至 Pipeline Serving
+```BASH
+python3 -m paddle_serving_server.serve stop
+```
+**二. 配置文件**
+Python Pipeline 提供了用户友好的多模型组合服务编程框架，适用于多模型组合应用的场景。
+其配置文件为 YAML 格式，一般默认为 config.yaml。示例如下：
+```YAML
+rpc_port: 18090
+http_port: 9999
+worker_num: 20
+build_dag_each_worker: false
+```
+| Argument                                       | Type | Default | Description                                           |
+| ---------------------------------------------- | ---- | ------- | ----------------------------------------------------- |
+| `rpc_port`                                       | int  | `18090`     | rpc 端口, rpc_port 和 http_port 不允许同时为空。当 rpc_port 为空且 http_port 不为空时，会自动将 rpc_port 设置为 http_port+1                         |
+| `http_port`                           | int| `9999`     | http 端口, rpc_port 和 http_port 不允许同时为空。当 rpc_port 可用且 http_port 为空时，不自动生成 http_port     |
+| `worker_num`                           | int| `20`     | worker_num, 最大并发数。当 build_dag_each_worker=True 时, 框架会创建w orker_num 个进程，每个进程内构建 grpcSever和DAG,当 build_dag_each_worker=False 时，框架会设置主线程 grpc 线程池的 max_workers=worker_num     |
+| `build_dag_each_worker`                           | bool| `false`     | False，框架在进程内创建一条 DAG；True，框架会每个进程内创建多个独立的 DAG     |
+```YAML
+dag:
+    is_thread_op: False
+    retry: 1
+    use_profile: false
+    tracer:
+        interval_s: 10
+    client_type: local_predictor
+    channel_size: 0
+    channel_recv_frist_arrive: False
+```
+| Argument                                       | Type | Default | Description                                           |
+| ---------------------------------------------- | ---- | ------- | ----------------------------------------------------- |
+| `is_thread_op`                                       | bool  | `false`     | op 资源类型, True, 为线程模型；False，为进程模型 |
+| `retry`                                       | int  | `1`     | 重试次数                         |
+| `use_profile`                                       | bool  | `false`     | 使用性能分析, True，生成 Timeline 性能数据，对性能有一定影响；False 为不使用  |
+| `tracer:interval_s`                                       | int  | `10 `    | rpc 端口, rpc_port 和 http_port 不允许同时为空。当 rpc_port 为空且 http_port 不为空时，会自动将 rpc_port 设置为 http_port+1                         |
+| `client_type`                                       | string  | `local_predictor`     | client 类型，包括 brpc, grpc 和 local_predictor.local_predictor 不启动 Serving 服务，进程内预测 |
+| `channel_size`                                       | int  | `0`     | channel 的最大长度，默认为0  |
+| `channel_recv_frist_arrive`                                       | bool  | `false`     | 针对大模型分布式场景 tensor 并行，接收第一个返回结果后其他结果丢弃来提供速度  |
+```YAML
+op:
+    op1:
+        #并发数，is_thread_op=True 时，为线程并发；否则为进程并发
+        concurrency: 6
+        #Serving IPs
+        #server_endpoints: ["127.0.0.1:9393"]
+        #Fetch 结果列表，以 client_config 中 fetch_var 的 alias_name 为准
+        #fetch_list: ["concat_1.tmp_0"]
+        #det 模型 client 端配置
+        #client_config: serving_client_conf.prototxt
+        #Serving 交互超时时间, 单位 ms
+        #timeout: 3000
+        #Serving 交互重试次数，默认不重试
+        #retry: 1
+        # 批量查询 Serving 的数量, 默认 1。batch_size>1 要设置 auto_batching_timeout，否则不足 batch_size 时会阻塞
+        #batch_size: 2
+        # 批量查询超时，与 batch_size 配合使用
+        #auto_batching_timeout: 2000
+        #当 op 配置没有 server_endpoints 时，从 local_service_conf 读取本地服务配置
+        local_service_conf:
+            #client 类型，包括 brpc, grpc 和 local_predictor.local_predictor 不启动 Serving 服务，进程内预测
+            client_type: local_predictor
+            #det 模型路径
+            model_config: ocr_det_model
+            #Fetch 结果列表，以 client_config 中 fetch_var 的 alias_name 为准
+            fetch_list: ["concat_1.tmp_0"]
+            # device_type, 0=cpu, 1=gpu, 2=tensorRT, 3=arm cpu, 4=kunlun xpu, 5=arm ascend310, 6=arm ascend910
+            device_type: 0
+            #计算硬件 ID，当 devices 为""或不写时为 CPU 预测；当 devices 为 "0", "0,1,2" 时为 GPU 预测，表示使用的 GPU 卡
+            devices: ""
+            #use_mkldnn, 开启 mkldnn 时，必须同时设置 ir_optim=True，否则无效
+            #use_mkldnn: True
+            #ir_optim, 开启 TensorRT 时，必须同时设置 ir_optim=True，否则无效
+            ir_optim: True
+            #CPU 计算线程数，在 CPU 场景开启会降低单次请求响应时长
+            #thread_num: 10
+            #precsion, 预测精度，降低预测精度可提升预测速度
+            #GPU 支持: "fp32"(default), "fp16", "int8"；
+            #CPU 支持: "fp32"(default), "fp16", "bf16"(mkldnn); 不支持: "int8"
+            precision: "fp32"
+            #mem_optim, memory / graphic memory optimization
+            #mem_optim: True
+            #use_calib, Use TRT int8 calibration
+            #use_calib: False
+            #The cache capacity of different input shapes for mkldnn
+            #mkldnn_cache_capacity: 0
+            #mkldnn_op_list, op list accelerated using MKLDNN, None default
+            #mkldnn_op_list: []
+            #mkldnn_bf16_op_list,op list accelerated using MKLDNN bf16, None default.
+            #mkldnn_bf16_op_list: []
+            #min_subgraph_size,the minimal subgraph size for opening tensorrt to optimize, 3 default
+            #min_subgraph_size: 3
+```
+| Argument                                       | Type | Default | Description                                           |
+| ---------------------------------------------- | ---- | ------- | ----------------------------------------------------- |
+| `concurrency`                                       | int  | `6`     | 并发数，is_thread_op=True 时，为线程并发；否则为进程并发 |
+| `server_endpoints`                                       | list  | `-`     | 服务 IP 列表                         |
+| `fetch_list`                                       | list  | `-`     | Fetch 结果列表，以 client_config 中 fetch_var 的 alias_name 为准 |
+| `client_config`                                       | string  | `-`     | 模型 client 端配置                         |
+| `timeout`                                       | int  | `3000`     | Serving 交互超时时间, 单位 ms |
+| `retry`                                       | int  | `1`     | Serving 交互重试次数，默认不重试 |
+| `batch_size`                                       | int  | `1`     | 批量查询 Serving 的数量, 默认 1。batch_size>1 要设置 auto_batching_timeout，否则不足 batch_size 时会阻塞  |
+| `auto_batching_timeout`                                       | int  | `2000`     | 批量查询超时，与 batch_size 配合使用                         |
+| `local_service_conf`                                       | map  | `-`     | 当 op 配置没有 server_endpoints 时，从 local_service_conf 读取本地服务配置 |
+| `client_type`                                       | string  | `-`     | client 类型，包括 brpc, grpc 和 local_predictor.local_predictor 不启动 Serving 服务，进程内预测 |
+| `model_config`                                       | string  | `-`     | 模型路径 |
+| `fetch_list`                                       | list  | `-`     | Fetch 结果列表，以 client_config 中 fetch_var 的 alias_name 为准 |
+| `device_type`                                       | int  | `0`     | 0=cpu, 1=gpu, 2=tensorRT, 3=arm cpu, 4=kunlun xpu, 5=arm ascend310, 6=arm ascend910 |
+| `devices`                                       | string  | `-`     | 计算硬件 ID，当 devices 为""或不写时为 CPU 预测；当 devices 为 "0", "0,1,2" 时为 GPU 预测，表示使用的 GPU 卡 |
+| `use_mkldnn`                                       | bool  | `True`     | use_mkldnn, 开启 mkldnn 时，必须同时设置 ir_optim=True，否则无效|
+| `ir_optim`                                       | bool  | `True`     | 开启 TensorRT 时，必须同时设置 ir_optim=True，否则无效 |
+| `thread_num`                                       | int  | `10`     | CPU 计算线程数，在 CPU 场景开启会降低单次请求响应时长|
+| `precision`                                       | string  | `fp32`     | 预测精度，降低预测精度可提升预测速度,GPU 支持: "fp32"(default), "fp16", "int8"；CPU 支持: "fp32"(default), "fp16", "bf16"(mkldnn); 不支持: "int8" |
+| `mem_optim`                                       | bool  | `True`     | 内存优化选项 |
+| `use_calib`                                       | bool  | `False`     | TRT int8 量化校准模型 |
+| `mkldnn_cache_capacity`                                       | int  | `0`     | mkldnn 的不同输入尺寸缓存大小 |
+| `mkldnn_op_list`                                       | list  | `-`     | mkldnn 加速的 op 列表 |
+| `mkldnn_bf16_op_list`                                       | list  | `-`     | mkldnn bf16 加速的 op 列表 |
+| `min_subgraph_size`                                       | int  | `3`     | 开启 tensorrt 优化的最小子图大小 |
+**三. 进阶参数配置**
+更多进阶参数配置介绍，可以参照下表：如单机多卡推理、异构硬件、低精度推理等请参考[Pipeline Serving 典型示例]()
+| 特性 | 文档 |
+| ---- | ---- |
+| 单机多卡推理| [Pipeline Serving]() |
+| 异构硬件| [Pipeline Serving]() |
+| 低精度推理| [Pipeline Serving]() |
\ No newline at end of file
--- a/doc/Offical_Docs/5-4_HTTP_API_CN.md
+++ b/doc/Offical_Docs/5-4_HTTP_API_CN.md
+# HTTP 方式访问 Server
+Paddle Serving 服务端目前提供了支持 Http 直接访问的功能，本文档显示了详细信息。
+- [基本原理](#1)
+  - [1.1 HTTP 方式](#2.1)
+  - [1.2 Http + protobuf 方式](#2.2)
+- [示例](#2)
+  - [2.1 获取模型](#2.1)
+  - [2.2 开启服务端](#2.2)
+- [客户端访问](#3)
+  - [3.1 HttpClient 方式发送 Http 请求](#3.1)
+  - [3.2 curl方式发送Http请求](#3.2)
+  - [3.3 Http压缩](#3.3)
+<a name="1"></a>
+## 基本原理
+Server 端支持通过 Http 的方式被访问，各种语言都有实现 Http 请求的一些库，下边介绍使用 Java/Python/Go 等语言通过 Http 的方式直接访问服务端进行预测的方法。
+<a name="1.1"></a>
+**一. Http 方式：**
+基本流程和原理：客户端需要将数据按照 Proto 约定的格式(请参阅[`core/general-server/proto/general_model_service.proto`](../../core/general-server/proto/general_model_service.proto))封装在 Http 请求的请求体中。
+Server 会尝试去 JSON 字符串中再去反序列化出 Proto 格式的数据，从而进行后续的处理。
+<a name="1.2"></a>
+**二. Http + protobuf 方式：**
+各种语言都提供了对 ProtoBuf 的支持，如果您对此比较熟悉，您也可以先将数据使用 ProtoBuf 序列化，再将序列化后的数据放入 Http 请求数据体中，然后指定 Content-Type: application/proto，从而使用 http + protobuf 二进制串访问服务。
+实测随着数据量的增大，使用 JSON 方式的 Http 的数据量和反序列化的耗时会大幅度增加，推荐当您的数据量较大时，使用 Http + protobuf 方式，目前已经在 Java 和 Python 的 Client 端提供了支持。
+<a name="2"></a>
+## 示例
+我们将以 examples/C++/fit_a_line 为例，讲解如何通过 Http 访问 Server 端。
+<a name="2.1"></a>
+**一. 获取模型：**
+```shell
+sh get_data.sh
+```
+<a name="2.2"></a>
+**二. 开启服务端：**
+```shell
+python3.6 -m paddle_serving_server.serve --model uci_housing_model --thread 10 --port 9393
+```
+服务端无须做任何改造，即可支持 BRPC 和 HTTP 两种方式。
+<a name="3"></a>
+## 客户端访问
+<a name="3.1"></a>
+**一. HttpClient 方式发送 Http 请求(Python/Java)：**
+为了方便用户快速的使用 Http 方式请求 Server 端预测服务，我们已经将常用的 Http 请求的数据体封装、压缩、请求加密等功能封装为一个 HttpClient 类提供给用户，方便用户使用。
+使用 HttpClient 最简单只需要四步：
+- 1、创建一个 HttpClient 对象。
+- 2、加载 Client 端的 prototxt 配置文件（本例中为 examples/C++/fit_a_line 目录下的 uci_housing_client/serving_client_conf.prototxt)。
+- 3、调用 connect 函数。
+- 4、调用 Predict 函数，通过 Http 方式请求预测服务。
+此外，您可以根据自己的需要配置:
+- Server 端 IP、Port、服务名称
+- 设置 Request 数据体压缩
+- 设置 Response 支持压缩传输
+- 模型加密预测（需要配置 Server 端使用模型加密）
+- 设置响应超时时间等功能。
+1. Python 的 HttpClient 使用示例如下：
+```
+from paddle_serving_client.httpclient import HttpClient
+import sys
+import numpy as np
+import time
+client = HttpClient()
+client.load_client_config(sys.argv[1])
+client.connect(["127.0.0.1:9393"])
+fetch_list = client.get_fetch_names()
+new_data = np.zeros((1, 13)).astype("float32")
+new_data[0] = [0.0137, -0.1136, 0.2553, -0.0692, 0.0582, -0.0727, -0.1583, -0.0584, 0.6283, 0.4919, 0.1856, 0.0795, -0.0332]
+fetch_map = client.predict(
+    feed={"x": new_data}, fetch=fetch_list, batch=True)
+print(fetch_map)
+```
+2. Java 的 HttpClient 使用示例如下：
+```
+boolean http_proto(String model_config_path) {
+    float[] data = {0.0137f, -0.1136f, 0.2553f, -0.0692f,
+        0.0582f, -0.0727f, -0.1583f, -0.0584f,
+        0.6283f, 0.4919f, 0.1856f, 0.0795f, -0.0332f};
+    INDArray npdata = Nd4j.createFromArray(data);
+    long[] batch_shape = {1,13};
+    INDArray batch_npdata = npdata.reshape(batch_shape);
+    HashMap<String, Object> feed_data
+        = new HashMap<String, Object>() {{
+            put("x", batch_npdata);
+        }};
+    List<String> fetch = Arrays.asList("price");
+    Client client = new Client();
+    client.setIP("127.0.0.1");
+    client.setPort("9393");
+    client.loadClientConfig(model_config_path);
+    String result = client.predict(feed_data, fetch, true, 0);
+    System.out.println(result);
+    return true;
+}
+```
+Java 的 HttpClient 更多使用示例详见[`java/examples/src/main/java/PaddleServingClientExample.java`](../../java/examples/src/main/java/PaddleServingClientExample.java)接口详见[`java/src/main/java/io/paddle/serving/client/Client.java`](../../java/src/main/java/io/paddle/serving/client/Client.java)。
+如果不能满足您的需求，您也可以在此基础上添加一些功能。
+<a name="3.2"></a>
+**二. curl方式发送Http请求：**
+```shell
+curl -XPOST http://0.0.0.0:9393/GeneralModelService/inference -d ' {"tensor":[{"float_data":[0.0137,-0.1136,0.2553,-0.0692,0.0582,-0.0727,-0.1583,-0.0584,0.6283,0.4919,0.1856,0.0795,-0.0332],"elem_type":1,"name":"x","alias_name":"x","shape":[1,13]}],"fetch_var_names":["price"],"log_id":0}'
+```
+其中 `127.0.0.1:9393` 为 IP 和 Port，根据您服务端启动的 IP 和 Port 自行设定。
+`GeneralModelService`字段和`inference`字段分别为 Service 服务名和 rpc 方法名。
+-d 后面的是请求的数据体，json 中一定要包含下述 proto 中的 required 字段，否则转化会失败，对应请求会被拒绝。
+需要注意的是，数据中的 shape 字段为模型实际需要的 shape 信息，包含 batch 维度在内。
+1. message
+对应 rapidjson Object, 以花括号包围，其中的元素会被递归地解析。
+```protobuf
+// protobuf
+message Foo {
+    required string field1 = 1;
+    required int32 field2 = 2;  
+}
+message Bar { 
+    required Foo foo = 1; 
+    optional bool flag = 2;
+    required string name = 3;
+}
+// rapidjson
+{"foo":{"field1":"hello", "field2":3},"name":"Tom" }
+```
+2. repeated field
+对应 rapidjson Array, 以方括号包围，其中的元素会被递归地解析，和 message 不同，每个元素的类型相同。
+```protobuf
+// protobuf
+repeated int32 numbers = 1;
+// rapidjson
+{"numbers" : [12, 17, 1, 24] }
+```
+3. elem_type
+表示数据类型，0 means int64, 1 means float32, 2 means int32, 20 means bytes(string)
+4. fetch_var_names
+表示返回结果中需要的数据名称，请参考模型文件 serving_client_conf.prototxt 中的`fetch_var`字段下的`alias_name`。
+<a name="3.2"></a>
+**三. Http压缩：**
+支持 gzip 压缩，但 gzip 并不是一个压缩解压速度非常快的方法，当数据量较小时候，使用 gzip 压缩反而会得不偿失，推荐至少数据大于 512 字节时才考虑使用 gzip 压缩,实测结果是当数据量小于 50K 时，压缩的收益都不大。
+1. Client 请求的数据体压缩
+以上面的 fit_a_line 为例，仍使用上文的请求数据体，但只作为示例演示用法，实际此时使用压缩得不偿失。
+```shell
+echo ' {"tensor":[{"float_data":[0.0137,-0.1136,0.2553,-0.0692,0.0582,-0.0727,-0.1583,-0.0584,0.6283,0.4919,0.1856,0.0795,-0.0332],"elem_type":1,"shape":[1,13]}],"fetch_var_names":["price"],"log_id":0}' | gzip -c > data.txt.gz
+```
+```shell
+curl --data-binary @data.txt.gz -H'Content-Encoding: gzip' -XPOST http://127.0.0.1:9393/GeneralModelService/inference
+```
+**注意：当请求数据体压缩时，需要指定请求头中 Content-Encoding: gzip**
+2. Server 端 Response 压缩
+当 Http 请求头中设置了 Accept-encoding: gzip 时，Server 端会尝试用 gzip 压缩 Response 的数据，“尝试“指的是压缩有可能不发生，条件有：
+- body 尺寸小于 -http_body_compress_threshold 指定的字节数，默认是 512。gzip 并不是一个很快的压缩算法，当 body 较小时，压缩增加的延时可能比网络传输省下的还多。当包较小时不做压缩可能是个更好的选项。
+这时 server 总是会返回不压缩的结果。
+如果使用 curl，通常推荐使用 --compressed 参数来设置 Response 压缩，--compressed 参数会自动地在 http 请求中设置 Accept-encoding: gzip，并在收到压缩后的 Response 后自动解压。
+```shell
+curl --data-binary @data.txt.gz -H'Content-Encoding: gzip' --compressed -XPOST http://127.0.0.1:9393/GeneralModelService/inference
+```
+若您只是在 Http 请求头中通过 -H'Accept-encoding: gzip' 设置了接收压缩的信息，收到的将是压缩后的 Response，此时，您需要手动解压。
+也就是说，--compressed = -H'Content-Encoding: gzip' + 自动解压，所以推荐您使用 --compressed，以下仅作为单独设置请求头 + 手动解压的原理性示例。
+当您想要验证返回值是否真的压缩时，您可以只添加请求头 -H'Content-Encoding: gzip'，而不解压，可以看到返回信息是压缩后的数据（一般而言是看不懂的压缩码）。
+```shell
+curl --data-binary @data.txt.gz -H'Content-Encoding: gzip' -H'Accept-encoding: gzip' -XPOST http://127.0.0.1:9393/GeneralModelService/inference | gunzip
+```
--- a/doc/Offical_Docs/5-5_RPC_API_CN.md
+++ b/doc/Offical_Docs/5-5_RPC_API_CN.md
+# RPC 方式访问 Server
+Paddle Serving 采用[brpc框架](https://github.com/apache/incubator-brpc)进行 Client/Server 端的通信。brpc 是百度开源的一款PRC网络框架，具有高并发、低延时等特点，已经支持了包括百度在内上百万在线预估实例、上千个在线预估服务，稳定可靠。与 gRPC 网络框架相比，具有更低的延时，更高的并发性能，且底层支持<mark>**brpc/grpc/http+json/http+proto**</mark>等多种协议。本文主要介绍如何使用 BRPC 进行通信。
+- [示例](#1)
+  - [1.1 获取模型](#1.1)
+  - [1.2 开启服务端](#1.2)
+- [客户端请求](#2)
+  - [2.1 C++ 方法](#2.1)
+  - [2.2 Python 方法](#2.2)
+<a name="1"></a>
+## 示例
+我们将以 examples/C++/fit_a_line 为例，讲解如何通过 RPC 访问 Server 端。
+<a name="1.1"></a>
+**一. 获取模型：**
+```shell
+sh get_data.sh
+```
+<a name="1.2"></a>
+**二. 开启服务端：**
+```shell
+python3.6 -m paddle_serving_server.serve --model uci_housing_model --thread 10 --port 9393
+```
+服务端无须做任何改造，即可支持 RPC 方式。
+<a name="2"></a>
+## 客户端请求
+<a name="2.1"></a>
+**一. C++ 方法：**
+基础使用方法主要分为四步：
+- 1、创建一个 Client 对象。
+- 2、加载 Client 端的 prototxt 配置文件（本例中为 examples/C++/fit_a_line 目录下的 uci_housing_client/serving_client_conf.prototxt)。
+- 3、准备请求数据。
+- 4、调用 predict 函数，通过 brpc 方式请求预测服务。
+示例如下：
+```
+  std::unique_ptr<ServingClient> client;
+  client.reset(new ServingBrpcClient());
+  std::vector<std::string> confs;
+  confs.push_back(conf);
+  if (client->init(confs, url) != 0) {
+    LOG(ERROR) << "Failed to init client!";
+    return 0;
+  }
+  PredictorInputs input;
+  PredictorOutputs output;
+  std::vector<std::string> fetch_name;
+  std::vector<float> float_feed = {0.0137f, -0.1136f, 0.2553f, -0.0692f,
+            0.0582f, -0.0727f, -0.1583f, -0.0584f,
+            0.6283f, 0.4919f, 0.1856f, 0.0795f, -0.0332f};
+  std::vector<int> float_shape = {1, 13};
+  std::string feed_name = "x";
+  fetch_name = {"price"};
+  std::vector<int> lod;
+  input.add_float_data(float_feed, feed_name, float_shape, lod);
+  if (client->predict(input, output, fetch_name, 0) != 0) {
+    LOG(ERROR) << "Failed to predict!";
+  }
+  else {
+    LOG(INFO) << output.print();
+  }
+```
+具体使用详见[simple_client.cpp](./example/simple_client.cpp)，已提供封装好的调用方法。
+```shell
+./simple_client --client_conf="uci_housing_client/serving_client_conf.prototxt" --server_port="127.0.0.1:9393" --test_type="brpc" --sample_type="fit_a_line"
+```
+| Argument                                       | Type | Default                              | Description                                           |
+| ---------------------------------------------- | ---- | ------------------------------------ | ----------------------------------------------------- |
+| `client_conf`                                  | str  | `"serving_client_conf.prototxt"`     | Path of client conf                                   |
+| `server_port`                                  | str  | `"127.0.0.1:9393"`                   | Exposed ip:port of server                             |
+| `test_type`                                    | str  | `"brpc"`                             | Mode of request "brpc"                                |
+| `sample_type`                                  | str  | `"fit_a_line"`                       | Type of sample include "fit_a_line,bert"              |
+<a name="2.2"></a>
+**二. Python 方法：**
+为了方便用户快速的使用 RPC 方式请求 Server 端预测服务，我们已经将常用的 RPC 请求的数据体封装、压缩、请求加密等功能封装为一个 Client 类提供给用户，方便用户使用。
+使用 Client 最简单只需要四步：1、创建一个 Client 对象。2、加载 Client 端的 prototxt 配置文件（本例中为 examples/C++/fit_a_line 目录下的 uci_housing_client/serving_client_conf.prototxt)。3、调用 connect 函数。4、调用 Predict 函数，通过 RPC 方式请求预测服务。
+```
+from paddle_serving_client import Client
+import sys
+import numpy as np
+client = Client()
+client.load_client_config(sys.argv[1])
+client.connect(["127.0.0.1:9393"])
+fetch_list = client.get_fetch_names()
+new_data = np.zeros((1, 13)).astype("float32")
+new_data[0] = [0.0137, -0.1136, 0.2553, -0.0692, 0.0582, -0.0727, -0.1583, -0.0584, 0.6283, 0.4919, 0.1856, 0.0795, -0.0332]
+fetch_map = client.predict(
+    feed={"x": new_data}, fetch=fetch_list, batch=True)
+print(fetch_map)
+```
\ No newline at end of file
--- a/doc/Offical_Docs/6-3_Request_Cache_CN.md
+++ b/doc/Offical_Docs/6-3_Request_Cache_CN.md
+# 请求缓存
+本文主要介绍请求缓存功能及实现原理。
+## 基本原理
+服务中请求由张量 tensor、结果名称 fetch_var_names、调试开关 profile_server、标识码 log_id 组成，预测结果包含输出张量 tensor 等。这里缓存会保存请求与结果的键值对。当请求命中缓存时，服务不会执行模型预测，而是会直接从缓存中提取结果。对于某些特定场景而言，这能显著降低请求耗时。
+缓存可以通过设置`--request_cache_size`来开启。该标志默认为 0，即不开启缓存。当设置非零值时，服务会以设置大小为存储上限开启缓存。这里设置的内存单位为字节。注意，如果设置`--request_cache_size`为 0 是不能开启缓存的。
+缓存中的键为 64 位整形数，是由请求中的 tensor 和 fetch_var_names 数据生成的 64 位哈希值。如果请求命中，那么对应的处理结果会提取出来用于构建响应数据。如果请求没有命中，服务则会执行模型预测，在返回结果的同时将处理结果放入缓存中。由于缓存设置了存储上限，因此需要淘汰机制来限制缓存容量。当前，服务采用了最近最少使用（LRU）机制用于淘汰缓存数据。
+## 注意事项
+ - 只有预测成功的请求会进行缓存。如果请求失败或者在预测过程中返回错误，则处理结果不会缓存。
+ - 缓存是基于请求数据的哈希值实现。因此，可能会出现两个不同的请求生成了相同的哈希值即哈希碰撞，这时服务可能会返回错误的响应数据。哈希值为 64 位数据，发生哈希碰撞的可能性较小。
+ - 不论使用同步模式还是异步模式，均可以正常使用缓存功能。
--- a/doc/Offical_Docs/6-4_Encryption_CN.md
+++ b/doc/Offical_Docs/6-4_Encryption_CN.md
+# 加密模型预测
+Padle Serving 提供了模型加密预测功能，本文档显示了详细信息。
+## 原理
+采用对称加密算法对模型进行加密。对称加密算法采用同一密钥进行加解密，它计算量小，速度快，是最常用的加密方法。
+**一. 获得加密模型：**
+普通的模型和参数可以理解为一个字符串，通过对其使用加密算法（参数是您的密钥），普通模型和参数就变成了一个加密的模型和参数。
+我们提供了一个简单的演示来加密模型。请参阅[examples/C++/encryption/encrypt.py](../../examples/C++/encryption/encrypt.py)。
+**二. 启动加密服务：**
+假设您已经有一个已经加密的模型（在`encrypt_server/`路径下）,您可以通过添加一个额外的命令行参数 `--use_encryption_model`来启动加密模型服务。
+CPU Service
+```
+python -m paddle_serving_server.serve --model encrypt_server/ --port 9300 --use_encryption_model
+```
+GPU Service
+```
+python -m paddle_serving_server.serve --model encrypt_server/ --port 9300 --use_encryption_model --gpu_ids 0
+```
+此时，服务器不会真正启动，而是等待密钥。
+**三. Client Encryption Inference：**
+首先，您必须拥有模型加密过程中使用的密钥。
+然后你可以用这个密钥配置你的客户端，当你连接服务器时，这个密钥会发送到服务器，服务器会保留它。
+一旦服务器获得密钥，它就使用该密钥解析模型并启动模型预测服务。
+**四. 模型加密推理示例：**
+模型加密推理示例, 请参见[examples/C++/encryption/](../../examples/C++/encryption/)。
--- a/doc/Offical_Docs/6-6_OP_CN.md
+++ b/doc/Offical_Docs/6-6_OP_CN.md
+# 如何开发一个新的General Op?
+- [定义一个Op](#1)
+- [在Op之间使用 `GeneralBlob`](#2)
+  - [2.1 实现 `int Inference()`](#2.1)
+- [定义 Python API](#3)
+在本文档中，我们主要集中于如何为 Paddle Serving 开发新的服务器端运算符。在开始编写新运算符之前，让我们看一些示例代码以获得为服务器编写新运算符的基本思想。我们假设您已经知道 Paddle Serving 服务器端的基本计算逻辑。 下面的代码您可以在 Serving代码库下的 `core/general-server/op` 目录查阅。
+``` c++
+#pragma once
+#include <string>
+#include <vector>
+#include "paddle_inference_api.h"  // NOLINT
+#include "core/general-server/general_model_service.pb.h"
+#include "core/general-server/op/general_infer_helper.h"
+namespace baidu {
+namespace paddle_serving {
+namespace serving {
+class GeneralInferOp
+    : public baidu::paddle_serving::predictor::OpWithChannel<GeneralBlob> {
+ public:
+  typedef std::vector<paddle::PaddleTensor> TensorVector;
+  DECLARE_OP(GeneralInferOp);
+  int inference();
+};
+}  // namespace serving
+}  // namespace paddle_serving
+}  // namespace baidu
+```
+<a name="1"></a>
+## 定义一个Op
+上面的头文件声明了一个名为 `GeneralInferOp` 的 Paddle Serving 运算符。 在运行时，将调用函数 `int inference（)`。 通常，我们将服务器端运算符定义为baidu::paddle_serving::predictor::OpWithChannel 的子类，并使用 `GeneralBlob` 数据结构。
+<a name="2"></a>
+## 在Op之间使用 `GeneralBlob` 
+`GeneralBlob` 是一种可以在服务器端运算符之间使用的数据结构。 `tensor_vector` 是 `GeneralBlob` 中最重要的数据结构。 服务器端的操作员可以将多个 `paddle::PaddleTensor` 作为输入，并可以将多个 `paddle::PaddleTensor `作为输出。 特别是，`tensor_vector` 可以在没有内存拷贝的操作下输入到 Paddle 推理引擎中。
+``` c++
+struct GeneralBlob {
+  std::vector<paddle::PaddleTensor> tensor_vector;
+  int64_t time_stamp[20];
+  int p_size = 0;
+  int _batch_size;
+  void Clear() {
+    size_t tensor_count = tensor_vector.size();
+    for (size_t ti = 0; ti < tensor_count; ++ti) {
+      tensor_vector[ti].shape.clear();
+    }
+    tensor_vector.clear();
+  }
+  int SetBatchSize(int batch_size) { _batch_size = batch_size; }
+  int GetBatchSize() const { return _batch_size; }
+  std::string ShortDebugString() const { return "Not implemented!"; }
+};
+```
+<a name="2.1"></a>
+**一. 实现 `int Inference()`**
+``` c++
+int GeneralInferOp::inference() {
+  VLOG(2) << "Going to run inference";
+  const GeneralBlob *input_blob = get_depend_argument<GeneralBlob>(pre_name());
+  VLOG(2) << "Get precedent op name: " << pre_name();
+  GeneralBlob *output_blob = mutable_data<GeneralBlob>();
+  if (!input_blob) {
+    LOG(ERROR) << "Failed mutable depended argument, op:" << pre_name();
+    return -1;
+  }
+  const TensorVector *in = &input_blob->tensor_vector;
+  TensorVector *out = &output_blob->tensor_vector;
+  int batch_size = input_blob->GetBatchSize();
+  VLOG(2) << "input batch size: " << batch_size;
+  output_blob->SetBatchSize(batch_size);
+  VLOG(2) << "infer batch size: " << batch_size;
+  Timer timeline;
+  int64_t start = timeline.TimeStampUS();
+  timeline.Start();
+  if (InferManager::instance().infer(engine_name().c_str(), in, out, batch_size)) {
+    LOG(ERROR) << "Failed do infer in fluid model: " << engine_name().c_str();
+    return -1;
+  }
+  int64_t end = timeline.TimeStampUS();
+  CopyBlobInfo(input_blob, output_blob);
+  AddBlobInfo(output_blob, start);
+  AddBlobInfo(output_blob, end);
+  return 0;
+}
+DEFINE_OP(GeneralInferOp);
+```
+`input_blob` 和 `output_blob` 都有很多的 `paddle::PaddleTensor`, 且 Paddle 预测库会被 `InferManager::instance().infer(engine_name().c_str(), in, out, batch_size)` 调用。此函数中的其他大多数代码都与性能分析有关，将来我们也可能会删除多余的代码。
+<a name="3"></a>
+## 定义 Python API
+在服务器端为 Paddle Serving 定义 C++ 运算符后，最后一步是在 Python API 中为 Paddle Serving 服务器 API 添加注册， `python/paddle_serving_server/dag.py` 文件里有关于 API 注册的代码如下
+``` python
+self.op_list = [
+            "GeneralInferOp",
+            "GeneralReaderOp",
+            "GeneralResponseOp",
+            "GeneralTextReaderOp",
+            "GeneralTextResponseOp",
+            "GeneralSingleKVOp",
+            "GeneralDistKVInferOp",
+            "GeneralDistKVOp",
+            "GeneralCopyOp",
+            "GeneralDetectionOp",
+        ]
+```
+在 `python/paddle_serving_server/server.py` 文件中仅添加`需要加载模型，执行推理预测的自定义的 C++ OP 类的类名`。例如 `GeneralReaderOp` 由于只是做一些简单的数据处理而不加载模型调用预测，故在上述的代码中需要添加，而不添加在下方的代码中。
+``` python
+default_engine_types = [
+                'GeneralInferOp',
+                'GeneralDistKVInferOp',
+                'GeneralDistKVQuantInferOp',
+                'GeneralDetectionOp',
+            ]
+```
--- a/doc/Offical_Docs/6-7_Model_Ensemble_CN.md
+++ b/doc/Offical_Docs/6-7_Model_Ensemble_CN.md
+# 如何使用 C++ 定义模型组合
+如果您的模型处理过程包含一个以上的模型推理环节（例如 OCR 一般需要 det+rec 两个环节），此时有两种做法可以满足您的需求。
+1. 启动两个 Serving 服务（例如 Serving-det, Serving-rec），在您的 Client 中，读入数据——>det 前处理——>调用 Serving-det 预测——>det 后处理——>rec 前处理——>调用 Serving-rec 预测——>rec 后处理——>输出结果。
+    - 优点：无须改动 Paddle Serving 代码
+    - 缺点：需要两次请求服务，请求数据量越大，效率稍差。
+2. 通过修改代码，自定义模型预测行为（自定义 OP），自定义服务处理的流程（自定义 DAG），将多个模型的组合处理过程(上述的 det 前处理——>调用 Serving-det 预测——>det 后处理——>rec 前处理——>调用 Serving-rec 预测——>rec 后处理)集成在一个 Serving 服务中。此时，在您的 Client 中，读入数据——>调用集成后的 Serving——>输出结果。
+    - 优点：只需要一次请求服务，效率高。
+    - 缺点：需要改动代码，且需要重新编译。
+本文主要介绍自定义服务处理流程的方法，该方法的基本步骤如下：
+1. 自定义 OP（即定义单个模型的前处理-模型预测-模型后处理）
+2. 编译
+3. 服务启动与调用
+## 自定义 OP
+一个 OP 定义了单个模型的前处理-模型预测-模型后处理，定义 OP 需要以下 2 步：
+1. 定义 C++.h 头文件
+2. 定义 C++.cpp 源文件
+**一. 定义 C++.h 头文件**
+复制下方的代码，将其中`/*自定义 Class 名称*/`更换为自定义的类名即可，如 `GeneralDetectionOp`
+放置于 `core/general-server/op/` 路径下，文件名自定义即可，如 `general_detection_op.h`
+``` C++
+#pragma once
+#include <string>
+#include <vector>
+#include "core/general-server/general_model_service.pb.h"
+#include "core/general-server/op/general_infer_helper.h"
+#include "paddle_inference_api.h"  // NOLINT
+namespace baidu {
+namespace paddle_serving {
+namespace serving {
+class /*自定义Class名称*/
+    : public baidu::paddle_serving::predictor::OpWithChannel<GeneralBlob> {
+ public:
+  typedef std::vector<paddle::PaddleTensor> TensorVector;
+  DECLARE_OP(/*自定义Class名称*/);
+  int inference();
+};
+}  // namespace serving
+}  // namespace paddle_serving
+}  // namespace baidu
+```
+**二. 定义 C++.cpp 源文件**
+复制下方的代码，将其中`/*自定义 Class 名称*/`更换为自定义的类名，如 `GeneralDetectionOp`
+将前处理和后处理的代码添加在下方的代码中注释的前处理和后处理的位置。
+放置于 `core/general-server/op/` 路径下，文件名自定义即可，如 `general_detection_op.cpp`
+``` C++
+#include "core/general-server/op/自定义的头文件名"
+#include <algorithm>
+#include <iostream>
+#include <memory>
+#include <sstream>
+#include "core/predictor/framework/infer.h"
+#include "core/predictor/framework/memory.h"
+#include "core/predictor/framework/resource.h"
+#include "core/util/include/timer.h"
+namespace baidu {
+namespace paddle_serving {
+namespace serving {
+using baidu::paddle_serving::Timer;
+using baidu::paddle_serving::predictor::MempoolWrapper;
+using baidu::paddle_serving::predictor::general_model::Tensor;
+using baidu::paddle_serving::predictor::general_model::Response;
+using baidu::paddle_serving::predictor::general_model::Request;
+using baidu::paddle_serving::predictor::InferManager;
+using baidu::paddle_serving::predictor::PaddleGeneralModelConfig;
+int /*自定义Class名称*/::inference() {
+  //获取前置OP节点
+  const std::vector<std::string> pre_node_names = pre_names();
+  if (pre_node_names.size() != 1) {
+    LOG(ERROR) << "This op(" << op_name()
+               << ") can only have one predecessor op, but received "
+               << pre_node_names.size();
+    return -1;
+  }
+  const std::string pre_name = pre_node_names[0];
+  //将前置OP的输出，作为本OP的输入。
+  GeneralBlob *input_blob = mutable_depend_argument<GeneralBlob>(pre_name);
+  if (!input_blob) {
+    LOG(ERROR) << "input_blob is nullptr,error";
+    return -1;
+  }
+  TensorVector *in = &input_blob->tensor_vector;
+  uint64_t log_id = input_blob->GetLogId();
+  int batch_size = input_blob->_batch_size;
+  //初始化本OP的输出。
+  GeneralBlob *output_blob = mutable_data<GeneralBlob>();
+  output_blob->SetLogId(log_id);
+  output_blob->_batch_size = batch_size;
+  VLOG(2) << "(logid=" << log_id << ") infer batch size: " << batch_size;
+  TensorVector *out = &output_blob->tensor_vector;
+  //前处理的代码添加在此处，前处理直接修改上文的TensorVector* in
+  //注意in里面的数据是前置节点的输出经过后处理后的out中的数据
+  Timer timeline;
+  int64_t start = timeline.TimeStampUS();
+  timeline.Start();
+  // 将前处理后的in，初始化的out传入，进行模型预测，模型预测的输出会直接修改out指向的内存中的数据
+  // 如果您想定义一个不需要模型调用，只进行数据处理的OP，删除下面这一部分的代码即可。
+  if (InferManager::instance().infer(
+          engine_name().c_str(), in, out, batch_size)) {
+    LOG(ERROR) << "(logid=" << log_id
+               << ") Failed do infer in fluid model: " << engine_name().c_str();
+    return -1;
+  }
+  //后处理的代码添加在此处，后处理直接修改上文的TensorVector* out
+  //后处理后的out会被传递给后续的节点
+  int64_t end = timeline.TimeStampUS();
+  CopyBlobInfo(input_blob, output_blob);
+  AddBlobInfo(output_blob, start);
+  AddBlobInfo(output_blob, end);
+  return 0;
+}
+DEFINE_OP(/*自定义Class名称*/);
+}  // namespace serving
+}  // namespace paddle_serving
+}  // namespace baidu
+```
+1. TensorVector数据结构
+TensorVector* in 和 out 都是一个 TensorVector 类型的指针，其使用方法跟 Paddle C++ API 中的 Tensor 几乎一样，相关的数据结构如下所示
+``` C++
+//TensorVector
+typedef std::vector<paddle::PaddleTensor> TensorVector;
+//paddle::PaddleTensor
+struct PD_INFER_DECL PaddleTensor {
+  PaddleTensor() = default;
+  std::string name;  ///<  variable name.
+  std::vector<int> shape;
+  PaddleBuf data;  ///<  blob of data.
+  PaddleDType dtype;
+  std::vector<std::vector<size_t>> lod;  ///<  Tensor+LoD equals LoDTensor
+};
+//PaddleBuf
+class PD_INFER_DECL PaddleBuf {
+ public:
+ explicit PaddleBuf(size_t length)
+      : data_(new char[length]), length_(length), memory_owned_(true) {}
+  PaddleBuf(void* data, size_t length)
+      : data_(data), length_(length), memory_owned_{false} {}
+  explicit PaddleBuf(const PaddleBuf& other);
+  void Resize(size_t length);
+  void Reset(void* data, size_t length);
+  bool empty() const { return length_ == 0; }
+  void* data() const { return data_; }
+  size_t length() const { return length_; }
+  ~PaddleBuf() { Free(); }
+  PaddleBuf& operator=(const PaddleBuf&);
+  PaddleBuf& operator=(PaddleBuf&&);
+  PaddleBuf() = default;
+  PaddleBuf(PaddleBuf&& other);
+ private:
+  void Free();
+  void* data_{nullptr};  ///< pointer to the data memory.
+  size_t length_{0};     ///< number of memory bytes.
+  bool memory_owned_{true};
+};
+```
+2. TensorVector 代码示例
+```C++
+/*例如，你想访问输入数据中的第1个Tensor*/
+paddle::PaddleTensor& tensor_1 = in->at(0);
+/*例如，你想修改输入数据中的第1个Tensor的名称*/
+tensor_1.name = "new name";
+/*例如，你想获取输入数据中的第1个Tensor的shape信息*/
+std::vector<int> tensor_1_shape = tensor_1.shape;
+/*例如，你想修改输入数据中的第1个Tensor中的数据*/
+void* data_1 = tensor_1.data.data();
+//后续直接修改data_1指向的内存即可
+//比如，当您的数据是int类型，将void*转换为int*进行处理即可
+```
+## 修改后编译
+此时，需要您重新编译生成 serving，并通过 `export SERVING_BIN` 设置环境变量来指定使用您编译生成的 serving 二进制文件，并通过 `pip3 install` 的方式安装相关 python 包，细节请参考[如何编译Serving](2-3_Compile_CN.md)
+## 服务启动与调用
+**一. Server 端启动**
+在前面两个小节工作做好的基础上，一个服务启动两个模型串联，只需要在`--model 后依次按顺序传入模型文件夹的相对路径`，且需要在`--op 后依次传入自定义 C++OP 类名称`，其中--model 后面的模型与--op 后面的类名称的顺序需要对应，`这里假设我们已经定义好了两个 OP 分别为 GeneralDetectionOp 和 GeneralRecOp`，则脚本代码如下：
+```python
+#一个服务启动多模型串联
+python3 -m paddle_serving_server.serve --model ocr_det_model ocr_rec_model --op GeneralDetectionOp GeneralRecOp --port 9292
+#多模型串联 ocr_det_model 对应 GeneralDetectionOp  ocr_rec_model 对应 GeneralRecOp
+```
+**二. Client 端调用**
+此时，Client 端的调用，也需要传入两个 Client 端的 proto 文件或文件夹的路径，以 OCR 为例，可以参考[ocr_cpp_client.py](../../examples/C++/PaddleOCR/ocr/ocr_cpp_client.py)来自行编写您的脚本，此时 Client 调用如下：
+```python
+#一个服务启动多模型串联
+python3 自定义.py ocr_det_client ocr_rec_client
+#ocr_det_client为第一个模型的Client端proto文件夹的相对路径
+#ocr_rec_client为第二个模型的Client端proto文件夹的相对路径
+```
+此时，对于 Server 端而言，输入的数据的格式与`第一个模型的 Client 端 proto 格式`定义的一致，输出的数据格式与`最后一个模型的 Client 端 proto`文件一致。一般情况下您无须关注此事，当您需要了解详细的proto的定义，请参考[Serving 配置](5-3_Serving_Configure_CN.md)。
--- a/doc/Offical_Docs/6-8_C++_Server_Preformance_Tuning_CN.md
+++ b/doc/Offical_Docs/6-8_C++_Server_Preformance_Tuning_CN.md
+# C++ Serving性能分析与优化
+## 背景知识介绍
+1) 首先，应确保您知道C++ Serving常用的一些[功能特点](./Introduction_CN.md)和[C++ Serving 参数配置和启动的详细说明](../Serving_Configure_CN.md)。
+2) 关于C++ Serving框架本身的性能分析和介绍，请参考[C++ Serving框架性能测试](./Frame_Performance_CN.md)。
+3) 您需要对您使用的模型、机器环境、需要部署上线的业务有一些了解，例如，您使用CPU还是GPU进行预测；是否可以开启TRT进行加速；你的机器CPU是多少core的；您的业务包含几个模型；每个模型的输入和输出需要做些什么处理；您业务的最大线上流量是多少；您的模型支持的最大输入batch是多少等等.
+# 2.Server线程数
+首先，Server端线程数N并不是越大越好。众所周知，线程的切换涉及到用户空间和内核空间的切换，有一定的开销，当您的core数=1，而线程数为100000时，线程的频繁切换将带来不可忽视的性能开销。
+在BRPC框架中，用户态协程worker数M >> 线程数N，用户态协程worker会工作在任意一个线程中，当RPC网络传输IO操作让出CPU资源时，BRPC会进行用户态协程worker的切换从而提高RPC框架的并发性。所以，极端情况下，若您的代码中除RPC通信外，没有阻塞线程的任何IO或网络操作，您的线程数完全可以 == 机器core数量，您不必担心N个线程都在进行RPC网络IO，而导致CPU利用率不高的问题。
+Server端<mark>**线程数N**</mark>的设置需要结合三个因素来综合考虑：
+## 2.1 最大并发请求量M
+根据最大并发请求量来设置Server端线程数N，根据[C++ Serving框架性能测试](./Frame_Performance_CN.md)中的数据来看，此时<mark>**线程数N应等于或略小于最大并发请求量M**</mark>，此时平均处理时延最小。
+这也很容易理解，举个极端的例子，如果您每次只有1个请求，那此时Server端线程数设置1是最合理的，因为此时没有任何线程切换的开销。如果您设置线程数为任何大于1的数，必然就带来了线程切换的开销。
+## 2.2 机器core数量C
+根据机器core数量来设置Server端线程数N，众所周知，线程是CPU core调度执行的最小单元，若要在一个进程内充分使用所有的core，<mark>**线程数至少应该>=机器core数量C**</mark>,但具体线程数N/机器core数量C = ?需要您根据您的代码中网络、IO、内存和计算所占用的比例来决定，一般用户可以通过设置不同的线程数来测试CPU占用率来不断调整。
+## 2.3 模型预测时间长短T
+当您使用CPU进行预测时，预测阶段的计算是使用CPU完成的，此时，请参考前两者来进行设置线程数。
+当您使用GPU进行预测时，情况有些不同，此时预测阶段的计算是由GPU完成的，此时CPU资源是空闲的，而预测操作是阻塞该线程的，类似于Sleep操作，此时若您的线程数==机器core数量，将没有其他可切换的线程从而导致必然有部分core是空闲的状态。具体来说，当模型预测时间较短时（<10ms），Server端线程数不宜过多（线程数=1——10倍core数量），否则线程切换带来的开销不可忽视。当模型预测时间较长时，Server端线程数应稍大一些（线程数=4——200倍core数量）。
+# 3.异步模式
+当<mark>**大部分用户的Request请求batch数<<模型最大支持的Batch数**</mark>时，采用异步模式的收益是明显的。
+异步模型的原理是将模型预测阶段与RPC线程脱离，模型单独开辟一个线程数可指定的线程池，RPC收到Request后将请求数据放入模型的线程池中的Task队列中，线程池中的线程从Task中取出数据合并Batch后进行预测，从而提升QPS，更多详细的介绍见[C++Serving功能简介](./Introduction_CN.md)，同步模式与异步模式的数据对比见[C++ Serving vs TensorFlow Serving 性能对比](./Benchmark_CN.md)，在上述测试的条件下，异步模型比同步模式快百分50%。
+异步模式的开启有以下两种方式。
+## 3.1 Python命令辅助启动C++Server
+`python3 -m paddle_serving_server.serve`通过添加`--runtime_thread_num 2`指定该模型开启异步模式，其中2表示的是该模型异步线程池中的线程数为2，该数值默认值为0，此时表示不使用异步模式。`--runtime_thread_num`的具体数值设置根据模型、数据和显卡的可用显存来设置。
+通过添加`--batch_infer_size 32`来设置模型最大允许Batch == 32 的输入，此参数只有在异步模型开启的状态下，才有效。
+## 3.2 命令行+配置文件启动C++Server
+此时通过修改`model_toolkit.prototxt`中的`runtime_thread_num`字段和`batch_infer_size`字段同样能达到上述效果。
+# 4.多模型组合
+当<mark>**您的业务中需要调用多个模型进行预测**</mark>时，如果您追求极致的性能，您可以考虑使用C++Serving[自定义OP](./OP_CN.md)和[自定义DAG图](./DAG_CN.md)的方式来实现上述需求。
+## 4.1 优点
+由于在一个服务中做模型的组合，节省了网络IO的时间和序列化反序列化的时间，尤其当数据量比较大时，收益十分明显(实测单次传输40MB数据时，RPC耗时为160-170ms)。
+## 4.2 缺点
+1) 需要使用C++去自定义OP和自定义DAG图去定义模型之间的组合关系。
+2) 若多个模型之间需要前后处理，您也需要使用C++在OP之间去编写这部分代码。
+3) 需要重新编译Server端代码。
+## 4.3 示例
+请参考[examples/C++/PaddleOCR/ocr/README_CN.md](../../examples/C++/PaddleOCR/ocr/README_CN.md)中`C++ OCR Service服务章节`和[Paddle Serving中的集成预测](./Model_Ensemble_CN.md)中的例子。
+# 5.请求缓存
+当<mark>**您的业务中有较多重复请求**</mark>时，您可以考虑使用C++Serving[Request Cache](./Request_Cache_CN.md)来提升服务性能
+## 5.1 优点
+服务可以缓存请求结果，将请求数据与结果以键值对的形式保存。当有重复请求到来时，可以根据请求数据直接从缓存中获取结果并返回，而不需要进行模型预测等处理（耗时与请求数据大小有关，在毫秒量级）。
+## 5.2 缺点
+1) 需要额外的系统内存用于缓存请求结果，具体缓存大小可以通过启动参数进行配置。
+2) 对于未命中请求，会增加额外的时间用于根据请求数据检索缓存（耗时增加1%左右）。
+## 5.3 示例
+请参考[Request Cache](./Request_Cache_CN.md)中的使用方法
\ No newline at end of file
--- a/doc/Offical_Docs/8-0_Cube_CN.md
+++ b/doc/Offical_Docs/8-0_Cube_CN.md
+# 稀疏参数索引服务 Cube
+在稀疏参数索引场景，如推荐、广告系统中通常会使用大规模 Embedding 表。由于在工业级的场景中，稀疏参数的规模非常大，达到 10^9 数量级。因此在一台机器上启动大规模稀疏参数预测是不实际的，因此我们引入百度多年来在稀疏参数索引领域的工业级产品 Cube，提供分布式的稀疏参数服务。
+## Cube 工作原理
+本章节介绍了 Cube 的基本使用方法和工作原理。请参考[Cube 架构]()
+## Cube 编译安装
+本章节介绍了 Cube 各个组件的编译以及安装方法。请参考[Cube 编译安装]()
+## Cube 基础功能
+本章节介绍了 Cube 的基础功能及使用方法。请参考[Cube 基础功能]()
+## Cube 进阶功能
+本章节介绍了 Cube 的高级功能使用方法。请参考[Cube 进阶功能]()
+## 在 K8S 上使用 Cube
+本章节介绍了在 K8S 平台上使用 Cube 的方法。请参考[在 K8S 上使用 Cube]()
+## Cube 部署示例
+本章节介绍了 Cube 的一个部署示例。请参考[Cube 部署示例]()
\ No newline at end of file
--- a/doc/Offical_Docs/8-1_Cube_Architecture_CN.md
+++ b/doc/Offical_Docs/8-1_Cube_Architecture_CN.md
+# 稀疏参数索引服务 Cube
+在稀疏参数索引场景，如推荐、广告系统中通常会使用大规模 Embedding 表。由于在工业级的场景中，稀疏参数的规模非常大，达到 10^9 数量级。因此在一台机器上启动大规模稀疏参数预测是不实际的，因此我们引入百度多年来在稀疏参数索引领域的工业级产品 Cube，用于部署大规模的稀疏参数模型，支持模型的分布式管理和快速更新，并且支持 Paddle Serving 进行低延迟的批量访问。
+<img src="images/8-1_Cube_Architecture_CN_1.png">
+## Cube 组件介绍
+**一. cube-builder**
+cube-builder 是把模型生成分片文件和版本管理的工具。由于稀疏参数文件往往是一个大文件，需要使用哈希函数将其分割为不同的分片，并使用分布式当中的每一个节点去加载不同的分片。与此同时，工业级的场景需要支持定期模型的配送和流式训练，因此对于模型的版本管理十分重要，这也是在训练保存模型时缺失的部分，因此 cube-builder 在生成分片的同时，也可以人为指定增加版本信息。
+**二. cube-transfer**
+cube-transfer 是调度管理服务。一方面 cube-transfer 会监测上游模型，当模型更新时进行模型下载。另一方面，会调用 cube-builder 将下载好的模型进行分片。而后与 cube-agent 进行对接完成分片文件配送。
+**三. cube-agent**
+cube-agent 是与cube-transfer 配套使用的调度管理服务。cube-agent 会接收来自 cube-transfer 传输来的分片文件。而后发送信号给 cube-server 对应接口完成配送操作。
+**四. cube-server**
+cube-server 基于 Cube 的 KV 能力，对外提供稀疏参数服务。它通过 brpc 提供高性能分布式查询服务，并支持 RestAPI 来进行远端调用。
+**五. cube-cli**
+cube-cli 是 cube-server 的客户端，用于请求 cube-server 进行对应稀疏参数查询功能。这部分组件已经被整合到 paddle serving 当中，当我们准备好 cube.conf 配置文件并在 server 的代码中指定kv_infer 相关的 op 时，cube-cli 就会在 server 端准备就绪。
+## 配送过程
+一次完整的配送流程如下：
+- 将训练好的模型存放到 FileServer 中，并在传输完成后生成完成标志，这里 FileServer 可以使用 http 协议的文件传输服务；
+- cube-transfer 监测到完成标志后，从 FileServer 中下载对应模型文件；
+- cube-transfer 使用 cube-builder 工具对稀疏参数模型进行分片；
+- cube-transfer 向 cube-agent 进行文件配送；
+- cube-agent 向 cube-server 发送加载命令，通知 cube-server 热加载新的参数文件；
+- cube-server 响应 Paddle Serving 发送的查询请求。
\ No newline at end of file
--- a/doc/Offical_Docs/8-2_Cube_Compile_CN.md
+++ b/doc/Offical_Docs/8-2_Cube_Compile_CN.md
+# Cube 编译
+## 编译依赖
+**以下是主要组件及其编译依赖**
+|             组件             |             说明              |
+| :--------------------------: | :-------------------------------: |
+|             Cube-Server              |     C++程序，提供高效快速的 RPC 协议      |
+|             Cube-Agent              |          Go 程序，需要 Go 环境支持         |
+|           Cube-Transfer            |          Go 程序，需要 Go 环境支持         |
+|            Cube-Builder             |           C++程序          |
+|            Cube-Cli            |          C++组件，已集成进 C++ server 中，不需单独编译          |
+## 编译方法
+推荐使用 Docker 编译，我们已经为您准备好了编译环境并配置好了上述编译依赖，详见[镜像环境]()。
+**一. 设置 PYTHON 环境变量**
+请按照如下，确定好需要编译的 Python 版本，设置对应的环境变量，一共需要设置三个环境变量，分别是 `PYTHON_INCLUDE_DIR`, `PYTHON_LIBRARIES`, `PYTHON_EXECUTABLE`。以下我们以 python 3.7为例，介绍如何设置这三个环境变量。
+```
+# 请自行修改至自身路径
+export PYTHON_INCLUDE_DIR=/usr/local/include/python3.7m/
+export PYTHON_LIBRARIES=/usr/local/lib/x86_64-linux-gnu/libpython3.7m.so
+export PYTHON_EXECUTABLE=/usr/local/bin/python3.7
+export GOPATH=$HOME/go
+export PATH=$PATH:$GOPATH/bin
+python3.7 -m pip install -r python/requirements.txt
+go env -w GO111MODULE=on
+go env -w GOPROXY=https://goproxy.cn,direct
+go install github.com/grpc-ecosystem/grpc-gateway/protoc-gen-grpc-gateway@v1.15.2
+go install github.com/grpc-ecosystem/grpc-gateway/protoc-gen-swagger@v1.15.2
+go install github.com/golang/protobuf/protoc-gen-go@v1.4.3
+go install google.golang.org/grpc@v1.33.0
+go env -w GO111MODULE=auto
+```
+环境变量的含义如下表所示。
+| cmake 环境变量         | 含义                                | 注意事项               | Docker 环境是否需要 |
+|-----------------------|-------------------------------------|-------------------------------|--------------------|
+| PYTHON_INCLUDE_DIR | Python.h 所在的目录，通常为 **/include/python3.7/Python.h | 如果没有找到。说明 1）没有安装开发版本的 Python，需重新安装 2）权限不足无法查看相关系统目录。                | 是(/usr/local/include/python3.7)                 |
+| PYTHON_LIBRARIES         | libpython3.7.so 或 libpython3.7m.so 所在目录，通常为 /usr/local/lib  | 如果没有找到。说明 1）没有安装开发版本的 Python，需重新安装 2）权限不足无法查看相关系统目录。                | 是(/usr/local/lib/x86_64-linux-gnu/libpython3.7m.so)                 |
+| PYTHON_EXECUTABLE   | python3.7 所在目录，通常为 /usr/local/bin |                | 是(/usr/local/bin/python3.7)                 |
+**二. 编译**
+```
+mkdir build_cube
+cd build_cube
+cmake -DPYTHON_INCLUDE_DIR=$PYTHON_INCLUDE_DIR \
+    -DPYTHON_LIBRARIES=$PYTHON_LIBRARIES \
+    -DPYTHON_EXECUTABLE=$PYTHON_EXECUTABLE \
+    -DSERVER=ON \
+    -DWITH_GPU=OFF ..
+make -j
+cd ..
+```
+最终我们会在`build_cube/core/cube`目录下看到 Cube 组件已经编译完成，其中:
+- Cube-Server：build_cube/core/cube/cube-server/cube
+- Cube-Agent：build_cube/core/cube/cube-agent/src/cube-agent
+- Cube-Transfer：build_cube/core/cube/cube-transfer/src/cube-transfer
+- Cube-Builder：build_cube/core/cube/cube-builder/cube-builder
\ No newline at end of file
--- a/doc/Offical_Docs/images/8-1_Cube_Architecture_CN_1.png
+++ b/doc/Offical_Docs/images/8-1_Cube_Architecture_CN_1.png
--- a/doc/TensorRT_Dynamic_Shape_CN.md
+++ b/doc/TensorRT_Dynamic_Shape_CN.md
@@ -33,6 +33,7 @@ python -m paddle_serving_server.serve \
 **二. C++ Serving 设置动态 shape**
+1. 方法一：
 在`**/paddle_inference/paddle/include/paddle_engine.h` 修改如下代码
 ```
@@ -127,6 +128,55 @@ python -m paddle_serving_server.serve \
    }
 ```
+2. 方法二：
+在`**/python/paddle_serving_server/serve.py` 参考如下代码生成配置信息，
+并使用`server.set_trt_dynamic_shape_info(info)`方法进行设置
+```
+def set_ocr_dynamic_shape_info():
+    info = []
+    min_input_shape = {
+        "x": [1, 3, 50, 50],
+        "conv2d_182.tmp_0": [1, 1, 20, 20],
+        "nearest_interp_v2_2.tmp_0": [1, 1, 20, 20],
+        "nearest_interp_v2_3.tmp_0": [1, 1, 20, 20],
+        "nearest_interp_v2_4.tmp_0": [1, 1, 20, 20],
+        "nearest_interp_v2_5.tmp_0": [1, 1, 20, 20]
+    }
+    max_input_shape = {
+        "x": [1, 3, 1536, 1536],
+        "conv2d_182.tmp_0": [20, 200, 960, 960],
+        "nearest_interp_v2_2.tmp_0": [20, 200, 960, 960],
+        "nearest_interp_v2_3.tmp_0": [20, 200, 960, 960],
+        "nearest_interp_v2_4.tmp_0": [20, 200, 960, 960],
+        "nearest_interp_v2_5.tmp_0": [20, 200, 960, 960],
+    }
+    opt_input_shape = {
+        "x": [1, 3, 960, 960],
+        "conv2d_182.tmp_0": [3, 96, 240, 240],
+        "nearest_interp_v2_2.tmp_0": [3, 96, 240, 240],
+        "nearest_interp_v2_3.tmp_0": [3, 24, 240, 240],
+        "nearest_interp_v2_4.tmp_0": [3, 24, 240, 240],
+        "nearest_interp_v2_5.tmp_0": [3, 24, 240, 240],
+    }
+    det_info = {
+        "min_input_shape": min_input_shape,
+        "max_input_shape": max_input_shape,
+        "opt_input_shape": opt_input_shape,
+    }
+    info.append(det_info)
+    min_input_shape = {"x": [1, 3, 32, 10], "lstm_1.tmp_0": [1, 1, 128]}
+    max_input_shape = {"x": [50, 3, 32, 1000], "lstm_1.tmp_0": [500, 50, 128]}
+    opt_input_shape = {"x": [6, 3, 32, 100], "lstm_1.tmp_0": [25, 5, 128]}
+    rec_info = {
+        "min_input_shape": min_input_shape,
+        "max_input_shape": max_input_shape,
+        "opt_input_shape": opt_input_shape,
+    }
+    info.append(rec_info)
+    return info
+```
 ## Pipeline Serving

--- a/doc/TensorRT_Dynamic_Shape_EN.md
+++ b/doc/TensorRT_Dynamic_Shape_EN.md
@@ -16,6 +16,8 @@ The following is the dynamic shape api
 For detail, please refer to API doc [C++](https://paddleinference.paddlepaddle.org.cn/api_reference/cxx_api_doc/Config/GPUConfig.html#tensorrt)/[Python](https://paddleinference.paddlepaddle.org.cn/api_reference/python_api_doc/Config/GPUConfig.html#tensorrt)
 ### C++ Serving
+1. Method 1: 
 Modify the following code in `**/paddle_inference/paddle/include/paddle_engine.h`
 ```
@@ -110,6 +112,54 @@ Modify the following code in `**/paddle_inference/paddle/include/paddle_engine.h
    }
 ```
+2. Method 2:
+Refer to the code of `**/python/paddle_serving_server/serve.py` below to generate the configuration information, 
+and using method `server.set_trt_dynamic_shape_info(info)` to set information.
+```
+def set_ocr_dynamic_shape_info():
+    info = []
+    min_input_shape = {
+        "x": [1, 3, 50, 50],
+        "conv2d_182.tmp_0": [1, 1, 20, 20],
+        "nearest_interp_v2_2.tmp_0": [1, 1, 20, 20],
+        "nearest_interp_v2_3.tmp_0": [1, 1, 20, 20],
+        "nearest_interp_v2_4.tmp_0": [1, 1, 20, 20],
+        "nearest_interp_v2_5.tmp_0": [1, 1, 20, 20]
+    }
+    max_input_shape = {
+        "x": [1, 3, 1536, 1536],
+        "conv2d_182.tmp_0": [20, 200, 960, 960],
+        "nearest_interp_v2_2.tmp_0": [20, 200, 960, 960],
+        "nearest_interp_v2_3.tmp_0": [20, 200, 960, 960],
+        "nearest_interp_v2_4.tmp_0": [20, 200, 960, 960],
+        "nearest_interp_v2_5.tmp_0": [20, 200, 960, 960],
+    }
+    opt_input_shape = {
+        "x": [1, 3, 960, 960],
+        "conv2d_182.tmp_0": [3, 96, 240, 240],
+        "nearest_interp_v2_2.tmp_0": [3, 96, 240, 240],
+        "nearest_interp_v2_3.tmp_0": [3, 24, 240, 240],
+        "nearest_interp_v2_4.tmp_0": [3, 24, 240, 240],
+        "nearest_interp_v2_5.tmp_0": [3, 24, 240, 240],
+    }
+    det_info = {
+        "min_input_shape": min_input_shape,
+        "max_input_shape": max_input_shape,
+        "opt_input_shape": opt_input_shape,
+    }
+    info.append(det_info)
+    min_input_shape = {"x": [1, 3, 32, 10], "lstm_1.tmp_0": [1, 1, 128]}
+    max_input_shape = {"x": [50, 3, 32, 1000], "lstm_1.tmp_0": [500, 50, 128]}
+    opt_input_shape = {"x": [6, 3, 32, 100], "lstm_1.tmp_0": [25, 5, 128]}
+    rec_info = {
+        "min_input_shape": min_input_shape,
+        "max_input_shape": max_input_shape,
+        "opt_input_shape": opt_input_shape,
+    }
+    info.append(rec_info)
+    return info
+```
 ### Pipeline Serving
@@ -151,4 +201,4 @@ if use_trt:
            names[3]: [10, head_number, 60, 60]
        })
 ```
\ No newline at end of file