Merge pull request #1542 from HexToString/update_doc_2_model

Update doc 2+ model

Merge pull request #1542 from HexToString/update_doc_2_model
Update doc 2+ model
3c853be2 · Thomas Young · GitHub · 521bc16c · 37dd6b9e · 3c853be2
6 changed file
--- a/doc/C++_Serving/2+_model.md
+++ b/doc/C++_Serving/2+_model.md
+# 如何使用C++定义模型组合
+
+如果您的模型处理过程包含2+的模型推理环节（例如OCR一般需要det+rec两个环节），此时有两种做法可以满足您的需求。
+
+1. 启动两个Serving服务（例如Serving-det, Serving-rec），在您的Client中，读入数据——det前处理——调用Serving-det预测——det后处理——rec前处理——调用Serving-rec预测——rec后处理——输出结果。
+    - 优点：无须改动Paddle Serving代码
+    - 缺点：需要两次请求服务，请求数据量越大，效率稍差。
+2. 通过修改代码，自定义模型预测行为（自定义OP），自定义服务处理的流程（自定义DAG），将多个模型的组合处理过程(上述的det前处理——调用Serving-det预测——det后处理——rec前处理——调用Serving-rec预测——rec后处理)集成在一个Serving服务中。此时，在您的Client中，读入数据——调用集成后的Serving——输出结果。
+    - 优点：只需要一次请求服务，效率高。
+    - 缺点：需要改动代码，且需要重新编译。
+
+本文主要介绍第二种效率高的方法，该方法的基本步骤如下：
+1. 自定义OP
+2. 自定义DAG
+3. 编译
+4. 服务启动与调用
+
+# 1. 自定义OP
+OP是Paddle Serving服务端的处理流程（即DAG图）的基本组成，参考[从0开始自定义OP](./OP_CN.md)，该文档只是讲述了如何自定义一个调用预测的OP节点，您可以在此基础上加上前处理，后处理。
+
+
+首先获取前置OP的输出，作为本OP的输入，并可以根据自己的需求，通过修改TensorVector* in指向的内存的数据，进行数据的前处理。
+``` c++
+  const GeneralBlob *input_blob = get_depend_argument<GeneralBlob>(pre_name());
+  const TensorVector *in = &input_blob->tensor_vector;
+```
+
+声明本OP的输出
+``` c++
+  GeneralBlob *output_blob = mutable_data<GeneralBlob>();
+  TensorVector *out = &output_blob->tensor_vector;
+  int batch_size = input_blob->GetBatchSize();
+  output_blob->SetBatchSize(batch_size);
+```
+
+完成前处理和定义输出变了后，核心调用预测引擎的一句话如下：
+``` c++
+if (InferManager::instance().infer(engine_name().c_str(), in, out, batch_size)) {
+    LOG(ERROR) << "Failed do infer in fluid model: " << engine_name().c_str();
+    return -1;
+}
+```
+
+在此之后，模型预测的输出已经写入与OP绑定的TensorVector* out指针变量所指向的内存空间，此时`可以通过修改TensorVector* out指向的内存的数据，进行数据的后处理`，下一个后置OP获取该OP的输出。
+
+最后如果您使用Python API的方式启动Server端，在服务器端为Paddle Serving定义C++运算符后，最后一步是在Python API中为Paddle Serving服务器API添加注册， `python/paddle_serving_server/dag.py`文件里有关于API注册的代码如下
+
+``` python
+self.op_dict = {
+            "general_infer": "GeneralInferOp",
+            "general_reader": "GeneralReaderOp",
+            "general_response": "GeneralResponseOp",
+            "general_text_reader": "GeneralTextReaderOp",
+            "general_text_response": "GeneralTextResponseOp",
+            "general_single_kv": "GeneralSingleKVOp",
+            "general_dist_kv_infer": "GeneralDistKVInferOp",
+            "general_dist_kv": "GeneralDistKVOp",
+            "general_copy": "GeneralCopyOp",
+            "general_detection":"GeneralDetectionOp",
+        }
+```
+其中左侧的`”general_infer“名字为自定义（下文有用）`，右侧的`"GeneralInferOp"为自定义的C++OP类的类名`。
+
+在`python/paddle_serving_server/server.py`文件中仅添加`需要加载模型，执行推理预测的自定义的C++OP类的类名`。例如`general_reader`由于只是做一些简单的数据处理而不加载模型调用预测，故在👆的代码中需要添加，而不添加在👇的代码中。
+``` python
+default_engine_types = [
+                'GeneralInferOp',
+                'GeneralDistKVInferOp',
+                'GeneralDistKVQuantInferOp',
+                'GeneralDetectionOp',
+            ]
+```
+
+# 2. 自定义DAG
+DAG图是Server端处理流程的基本定义，在完成上述OP定义的基础上，参考[自定义DAG图](./DAG_CN.md)，您可以自行构建Server端多模型（即多个OP）之间的处理逻辑关系。
+
+框架一般需要在开头加上一个`general_reader`，在结尾加上一个`general_response`，中间添加实际需要调用预测的自定义OP，例如`general_infer`就是一个框架定义好的默认OP,它只调用预测，没有前后处理。
+
+例如，对于OCR模型来说，实际是串联det和rec两个模型，我们可以使用一个`自定义的"general_detection"`和`"general_infer"(注意，此处名字必须与上述Python API中严格对应)`构建DAG图，代码(`python/paddle_serving_server/serve.py`）原理如下图所示。
+
+``` python
+import paddle_serving_server as serving
+from paddle_serving_server import OpMaker
+from paddle_serving_server import OpSeqMaker
+
+op_maker = serving.OpMaker()
+read_op = op_maker.create('general_reader')
+general_detection_op = op_maker.create('general_detection')
+general_infer_op = op_maker.create('general_infer')
+general_response_op = op_maker.create('general_response')
+
+op_seq_maker = serving.OpSeqMaker()
+op_seq_maker.add_op(read_op)
+op_seq_maker.add_op(general_detection_op)
+op_seq_maker.add_op(general_infer_op)
+op_seq_maker.add_op(general_response_op)
+```
+
+# 3. 编译
+此时，需要您重新编译生成serving，并通过`export SERVING_BIN`设置环境变量来指定使用您编译生成的serving二进制文件，并通过`pip3 install`的方式安装相关python包，细节请参考[如何编译Serving](../Compile_CN.md)
+
+# 4. 服务启动与调用
+## 4.1 Server端启动
+仍然以OCR模型为例，分别单独启动det单模型和的脚本代码如下：
+```python
+#分别单独启动模型
+python3 -m paddle_serving_server.serve --model ocr_det_model --port 9293#det模型
+python3 -m paddle_serving_server.serve --model ocr_rec_model --port 9294#rec模型
+```
+在前面三个小节工作做好的基础上，一个服务启动两个模型串联，只需要在`--model后依次按顺序传入模型文件夹的相对路径`即可，脚本代码如下：
+```python
+#一个服务启动多模型串联
+python3 -m paddle_serving_server.serve --model ocr_det_model ocr_rec_model --port 9295#多模型串联
+```
+
+## 4.2 Client端调用
+此时，Client端的调用，也需要传入两个Client端的[proto定义](./Serving_Configure_CN.md)，python脚本代码如下：
+```python
+#一个服务启动多模型串联
+python3 ocr_cpp_client.py ocr_det_client ocr_rec_client
+#ocr_det_client为第一个模型的Client端proto文件夹相对路径
+#ocr_rec_client为第二个模型的Client端proto文件夹相对路径
+```
+此时，对于Server端而言，`'general_reader'`会检查输入的数据的格式是否与第一个模型的Client端proto格式定义的一致，`'general_response'`会保证输出的数据格式与第二个模型的Client端proto文件一致。
--- a/doc/C++_Serving/DAG_CN.md
+++ b/doc/C++_Serving/DAG_CN.md
@@ -16,11 +16,14 @@

 ### 简单的串联结构

-PaddleServing在框架中具有一些预定义的计算节点。 一种非常常用的计算图是简单的reader-infer-response模式，可以涵盖大多数单一模型推理方案。 示例图和相应的DAG定义代码如下。
+PaddleServing在框架中具有一些预定义的计算节点。 一种非常常用的计算图是简单的reader-infer-response模式，可以涵盖大多数单一模型推理方案。 示例图如下所示。
+
 <center>
 <img src='../images/simple_dag.png' width = "260" height = "370" align="middle"/>
 </center>

+通过`Python API 启动Server`相应的DAG定义代码如下（`python/paddle_serving_server/serve.py`）。
+
 ``` python
 import paddle_serving_server as serving
 from paddle_serving_server import OpMaker
@@ -37,6 +40,9 @@ op_seq_maker.add_op(general_infer_op)
 op_seq_maker.add_op(general_response_op)
 ```

+如果使用`命令行 + 配置文件的方式启动C++Server`只需[修改配置文件]((./Serving_Configure_CN.md))即可,无须修改👆的代码。
+
+
 对于简单的串联逻辑，我们将其简化为`Sequence`，使用`OpSeqMaker`进行构建。用户可以不指定每个节点的前继，默认按加入`OpSeqMaker`的顺序来确定前继。

 由于该代码在大多数情况下都会被使用，并且用户不必更改代码，因此PaddleServing会发布一个易于使用的启动命令来启动服务。 示例如下：

--- a/doc/C++_Serving/DAG_EN.md
+++ b/doc/C++_Serving/DAG_EN.md
@@ -16,12 +16,13 @@ Deep neural nets often have some preprocessing steps on input data, and postproc

 ### Simple series structure

-PaddleServing has some predefined Computation Node in the framework. A very commonly used Computation Graph is the simple reader-inference-response mode that can cover most of the single model inference scenarios. A example graph and the corresponding DAG definition code is as follows.
+PaddleServing has some predefined Computation Node in the framework. A very commonly used Computation Graph is the simple reader-inference-response mode that can cover most of the single model inference scenarios. Here is an example of DAG graph.

 <center>
 <img src='../images/simple_dag.png' width = "260" height = "370" align="middle"/>
 </center>

+If you want to start the server through the python API. The corresponding DAG definition code is as follows.
 ``` python
 import paddle_serving_server as serving
 from paddle_serving_server import OpMaker
@@ -38,6 +39,8 @@ op_seq_maker.add_op(general_infer_op)
 op_seq_maker.add_op(general_response_op)
 ```

+If you use `the command line + configuration file method to start C++ server`, you only need to modify [the configuration file](./Serving_Configure_CN.md), don`t need to change any line of 👆 code.
+
 For simple series logic, we simplify it and build it with `OpSeqMaker`. You can determine the successor by default according to the order of joining `OpSeqMaker` without specifying the successor of each node.

 Since the code will be commonly used and users do not have to change the code, PaddleServing releases a easy-to-use launching command for service startup. An example is as follows: 

--- a/doc/C++_Serving/Introduction_CN.md
+++ b/doc/C++_Serving/Introduction_CN.md
@@ -57,7 +57,7 @@ Server端的核心是一个由项目代码编译产生的名称为serving的二
    <br>
 <p>

-C++ Serving框架支持[自定义DAG图](./Model_Ensemble_CN.md)的方式来表示多模型之间串并联组合关系，也支持用户[使用C++开发自定义OP节点](./OP_CN.md)。相比于使用内外两层服务来提供多模型组合处理的方式，由于节省了一次RPC网络传输的开销，把多模型在一个服务中处理性能上会有一定的提升，尤其当RPC通信传输的数据量较大时。
+C++ Serving框架支持在一个服务中创建[多模型组合](./2+_model.md)，用户可通过[自定义DAG图](./Model_Ensemble_CN.md)的方式来表示多模型之间串并联组合关系，也支持用户[使用C++开发自定义OP节点](./OP_CN.md)。相比于使用内外两层服务来提供多模型组合处理的方式，由于节省了一次RPC网络传输的开销，把多模型在一个服务中处理性能上会有一定的提升，尤其当RPC通信传输的数据量较大时。

 ### 3.4 模型管理与热加载
 C++ Serving的引擎支持模型管理功能，支持多种模型和模型不同版本的管理。为了保证在模型更换期间推理服务的可用性，需要在服务不中断的情况下对模型进行热加载。C++ Serving对该特性进行了支持，并提供了一个监控产出模型更新本地模型的工具，具体例子请参考《[C++ Serving中的模型热加载](./Hot_Loading_CN.md)》。

--- a/doc/C++_Serving/OP_CN.md
+++ b/doc/C++_Serving/OP_CN.md
@@ -113,8 +113,8 @@ int GeneralInferOp::inference() {
  int64_t start = timeline.TimeStampUS();
  timeline.Start();

-  if (InferManager::instance().infer(GENERAL_MODEL_NAME, in, out, batch_size)) {
-    LOG(ERROR) << "Failed do infer in fluid model: " << GENERAL_MODEL_NAME;
+  if (InferManager::instance().infer(engine_name().c_str(), in, out, batch_size)) {
+    LOG(ERROR) << "Failed do infer in fluid model: " << engine_name().c_str();
    return -1;
  }

@@ -127,14 +127,15 @@ int GeneralInferOp::inference() {
 DEFINE_OP(GeneralInferOp);
 ```

-`input_blob` 和 `output_blob` 都有很多的 `paddle::PaddleTensor`, 且Paddle预测库会被 `InferManager::instance().infer(GENERAL_MODEL_NAME, in, out, batch_size)`调用。此函数中的其他大多数代码都与性能分析有关，将来我们也可能会删除多余的代码。
+`input_blob` 和 `output_blob` 都有很多的 `paddle::PaddleTensor`, 且Paddle预测库会被 `InferManager::instance().infer(engine_name().c_str(), in, out, batch_size)`调用。此函数中的其他大多数代码都与性能分析有关，将来我们也可能会删除多余的代码。


 基本上，以上代码可以实现一个新的运算符。如果您想访问字典资源，可以参考`core/predictor/framework/resource.cpp`来添加全局可见资源。资源的初始化在启动服务器的运行时执行。

 ## 定义 Python API

-在服务器端为Paddle Serving定义C++运算符后，最后一步是在Python API中为Paddle Serving服务器API添加注册， `python/paddle_serving_server/__init__.py`文件里有关于API注册的代码如下
+在服务器端为Paddle Serving定义C++运算符后，最后一步是在Python API中为Paddle Serving服务器API添加注册， `python/paddle_serving_server/dag.py`文件里有关于API注册的代码如下
+

 ``` python
 self.op_dict = {
@@ -147,3 +148,13 @@ self.op_dict = {
            "general_dist_kv": "GeneralDistKVOp"
        }
 ```
+
+在`python/paddle_serving_server/server.py`文件中仅添加`需要加载模型，执行推理预测的自定义的C++OP类的类名`。例如`general_reader`由于只是做一些简单的数据处理而不加载模型调用预测，故在👆的代码中需要添加，而不添加在👇的代码中。
+``` python
+default_engine_types = [
+                'GeneralInferOp',
+                'GeneralDistKVInferOp',
+                'GeneralDistKVQuantInferOp',
+                'GeneralDetectionOp',
+            ]
+```
--- a/doc/C++_Serving/OP_EN.md
+++ b/doc/C++_Serving/OP_EN.md
@@ -112,8 +112,8 @@ int GeneralInferOp::inference() {
  int64_t start = timeline.TimeStampUS();
  timeline.Start();

-  if (InferManager::instance().infer(GENERAL_MODEL_NAME, in, out, batch_size)) {
-    LOG(ERROR) << "Failed do infer in fluid model: " << GENERAL_MODEL_NAME;
+  if (InferManager::instance().infer(engine_name().c_str(), in, out, batch_size)) {
+    LOG(ERROR) << "Failed do infer in fluid model: " << engine_name().c_str();
    return -1;
  }

@@ -126,15 +126,16 @@ int GeneralInferOp::inference() {
 DEFINE_OP(GeneralInferOp);
 ```

-`input_blob` and `output_blob` both have multiple `paddle::PaddleTensor`, and the Paddle Inference library can be called through `InferManager::instance().infer(GENERAL_MODEL_NAME, in, out, batch_size)`. Most of the other code in this function is about profiling, we may remove redudant code in the future as well.
+`input_blob` and `output_blob` both have multiple `paddle::PaddleTensor`, and the Paddle Inference library can be called through `InferManager::instance().infer(engine_name().c_str(), in, out, batch_size)`. Most of the other code in this function is about profiling, we may remove redudant code in the future as well.

 Basically, the above code can implement a new operator. If you want to visit dictionary resource, you can reference `core/predictor/framework/resource.cpp` to add global visible resources. The initialization of resources is executed at the runtime of starting server.

 ## Define Python API

-After you have defined a C++ operator on server side for Paddle Serving, the last step is to add a registration in Python API for PaddleServing server API, `python/paddle_serving_server/__init__.py` in the repo has the code piece.
+After you have defined a C++ operator on server side for Paddle Serving, the last step is to add a registration in Python API for PaddleServing server API, `python/paddle_serving_server/dag.py` in the repo has the code piece.

-``` c++
+
+``` python
 self.op_dict = {
            "general_infer": "GeneralInferOp",
            "general_reader": "GeneralReaderOp",
@@ -145,3 +146,15 @@ self.op_dict = {
            "general_dist_kv": "GeneralDistKVOp"
        }
 ```
+
+In `python/paddle_serving_server/server.py` file, only the class name of the C++ OP class that needs to load the model and execute prediction is added. 
+
+For example, `general_reader`, need to be added in the 👆 code, but not in the 👇 code. Because it only does some simple data processing without loading the model and call prediction. 
+``` python
+default_engine_types = [
+                'GeneralInferOp',
+                'GeneralDistKVInferOp',
+                'GeneralDistKVQuantInferOp',
+                'GeneralDetectionOp',
+            ]
+```