Merge pull request #1037 from wangjiawei04/v0.5.0

cherry-pick #1036 #1035 #1034 #1033

Merge pull request #1037 from wangjiawei04/v0.5.0
cherry-pick #1036 #1035 #1034 #1033
c1e9e00c · Jiawei Wang · GitHub · 7ecd1c1d · 07dce1d9 · c1e9e00c
20 changed file
--- a/doc/DESIGN_DOC.md
+++ b/doc/DESIGN_DOC.md
@@ -4,158 +4,93 @@

 ## 1. Design Objectives

- Long Term Vision: Online deployment of deep learning models will be a user-facing application in the future. Any AI developer will face the problem of deploying an online service for his or her trained model.
-Paddle Serving is the official open source online deployment framework. The long term goal of Paddle Serving is to provide professional, reliable and easy-to-use online service to the last mile of AI application.
+Paddle Serving is the official open source online deployment framework. The long term goal of Paddle Serving is to provide professional, reliable and easy-to-use online service to the last mile of AI application. Online deployment of deep learning models will be a user-facing application in the future. Any AI developer will face the problem of deploying an online service for his or her trained model.

- Easy-To-Use: For algorithmic developers to quickly deploy their models online, Paddle Serving designs APIs that can be used with Paddle's training process seamlessly, most Paddle models can be deployed as a service with one line command.
+- Industrial Oriented: To meet industrial deployment requirements, Paddle Serving supports lots of large-scale deployment functions: 1) Model management, model hot loading, model encryption and decryption. 2）Support cross-platform, multiple hardware deployment. 3) Distributed Sparse Embedding Indexing. 4) online A/B test
+  
+- High Performance: Thinking about improving the performance of model inference from the two dimensions of low latency and high throughput. 1) High-performance prediction engine Paddle Inference  is integrated. 2) Nvidia Tensor RT is supported. 3) High-performance network framework brpc is Integrated. 4) Asynchronous Pipeline mode greatly improves throughput.

- Industrial Oriented: To meet industrial deployment requirements, Paddle Serving supports lots of large-scale deployment functions: 1) Distributed Sparse Embedding Indexing. 2) Highly concurrent underlying communications. 3) Model Management, online A/B test, model online loading.
+- Easy-To-Use: For algorithmic developers to quickly deploy their models online, Paddle Serving designs APIs that can be used with Paddle's training process seamlessly, most Paddle models can be deployed as a service with one line command. More than 20 common model cases and documents.

- Extensibility: Paddle Serving supports C++, Python and Golang client, and will support more clients with different languages. It is very easy to extend Paddle Serving to support other machine learning inference library, although currently Paddle inference library is the only official supported inference backend.
+- Extensibility: Paddle Serving supports C++, Python, Golang, Java four client SDK, and will support more clients with different languages. It is very easy to extend Paddle Serving to support other machine learning inference library, although currently Paddle inference library is the only official supported inference backend.

+----

-## 2. Module design and implementation
+## 2. Preliminary Design
+Any excellent software product must start from user needs, have clear positioning and good preliminary designs. Same goes for Paddle Serving, which aims to provide professional, reliable and easy-to-use online service to the last mile of AI application. By investigating the usage scenarios of a large number of users, and abstracting these scenarios, for example, online services focus on high concurrency and low response time; offline services focus on high batch throughput and high resource utilization; Algorithm developers are good at using Python for model training and inference.

-### 2.1 Python API interface design
+### 2.1 Design selection

-#### 2.1.1 save a servable model
-The inference phase of Paddle model focuses on 1) input variables of the model. 2) output variables of the model. 3) model structure and model parameters. Paddle Serving Python API provides a `save_model` interface for trained model, and save necessary information for Paddle Serving to use during deployment phase. An example is as follows:
+In order to meet the needs of users in different scenarios, Paddle Serving's product positioning adopts lower-dimensional features, such as response time, throughput, development efficiency, etc., to achieve target selection and technology selection.

-``` python
-import paddle_serving_client.io as serving_io
-serving_io.save_model("serving_model", "client_conf",
-                      {"words": data}, {"prediction": prediction},
-                      fluid.default_main_program())
-```
-In the example, `{"words": data}` and `{"prediction": prediction}` assign the inputs and outputs of a model. `"words"` and `"prediction"` are alias names of inputs and outputs. The design of alias name is to help developers to memorize model inputs and model outputs. `data` and `prediction` are Paddle `[Variable](https://www.paddlepaddle.org.cn/documentation/docs/zh/api_cn/fluid_cn/Variable_cn.html#variable)` in training phase that often represents ([Tensor](https://www.paddlepaddle.org.cn/documentation/docs/zh/api_cn/fluid_cn/Tensor_cn.html#tensor)) or ([LodTensor](https://www.paddlepaddle.org.cn/documentation/docs/zh/beginners_guide/basic_concept/lod_tensor.html#lodtensor)). When the `save_model` API is called, two directories called `"serving_model"` and `"client_conf"` will be generated. The content of the saved model is as follows:
-
-``` shell
-.
-├── client_conf
-│   ├── serving_client_conf.prototxt
-│   └── serving_client_conf.stream.prototxt
-└── serving_model
-    ├── embedding_0.w_0
-    ├── fc_0.b_0
-    ├── fc_0.w_0
-    ├── fc_1.b_0
-    ├── fc_1.w_0
-    ├── fc_2.b_0
-    ├── fc_2.w_0
-    ├── lstm_0.b_0
-    ├── lstm_0.w_0
-    ├── __model__
-    ├── serving_server_conf.prototxt
-    └── serving_server_conf.stream.prototxt
-```
-`"serving_client_conf.prototxt"` and `"serving_server_conf.prototxt"` are the client side and the server side configurations of Paddle Serving, and `"serving_client_conf.stream.prototxt"` and `"serving_server_conf.stream.prototxt"` are the corresponding parts. Other contents saved in the directory are the same as Paddle saved inference model. We are considering to support `save_model` interface in Paddle training framework so that a user is not aware of the servable configurations. 
+| Response time | throughput | development efficiency | Resource utilization | selection | Applications|
+|-----|------|-----|-----|------|------|
+| LOW | HIGH | LOW | HIGH |C++ Serving | High-performance，recall and ranking services of large-scale online recommendation systems|
+| HIGH | HIGH | HIGH | HIGH |Python Pipeline Serving| High-throughput, high-efficiency, asynchronous mode, fitting for single operator multi-model combination scenarios|
+| HIGH | LOW | HIGH| LOW |Python webserver| High-throughput，Low-traffic services or projects that require rapid iteration, model effect verification|

-#### 2.1.2 Model loading on the server side
+Performance index description：
+1. Response time (ms): Average response time of a single request, calculate the response time of 50, 90, 95, 99 quantiles, the lower the better.
+2. Throughput(QPS/TPS): The efficiency of service processing requests, the number of requests processed per unit time, the higher the better.
+3. Development efficiency: Using different development languages to complete the same work takes different time, including the efficiency of development, debugging, and maintenance, the higher the better.
+4. Resource utilization: Deploy a service to resource utilization (CPU/GPU), low resource utilization is a waste of resources, the higher the better.

-Prediction logics on the server side can be defined through Paddle Serving Server API with a few lines of code, an example is as follows:
-``` python
-import paddle_serving_server as serving
-op_maker = serving.OpMaker()
-read_op = op_maker.create('general_reader')
-dist_kv_op = op_maker.create('general_dist_kv')
-general_infer_op = op_maker.create('general_infer')
-general_response_op = op_maker.create('general_response')
-
-op_seq_maker = serving.OpSeqMaker()
-op_seq_maker.add_op(read_op)
-op_seq_maker.add_op(dist_kv_op)
-op_seq_maker.add_op(general_infer_op)
-op_seq_maker.add_op(general_response_op)
-```
-Current Paddle Serving supports operator list on the server side as follows:
-
-<center>
-
-| Op Name | Description |
-|--------------|------|
-| `general_reader` | General Data Reading Operator |
-| `genreal_infer` | General Data Inference with Paddle Operator |
-| `general_response` | General Data Response Operator |
-| `general_dist_kv` | Distributed Sparse Embedding Indexing |
+Paddle Serving provides RPC and HTTP protocol for users. For HTTP service, we recommend users with median or small traffic services to use, and the latency is not a strict requirement. For RPC protocol, we recommend high traffic services and low latency required services to use. For users who use distributed sparse parameter indexing built-in service, it is not necessary to care about the underlying details of communication. The following figure gives out several scenarios that user may want to use Paddle Serving. 

-</center>
+<p align="center">
+    <br>
+<img src='user_groups.png' width = "700" height = "470">
+    <br>
+<p>

-Paddle Serving supports inference engine on multiple devices. Current supports are CPU and GPU engine. Docker Images of CPU and GPU are provided officially. User can use one line command to start an inference service either on CPU or on GPU. 
+For servable models saved from Paddle Serving IO API, users do not need to do extra coding work to startup a service, but may need some coding work on the client side. For development of Web Service plugin, a user needs to provide implementation of Web Service's preprocessing and postprocessing work if needed to get a HTTP service.

-``` shell
-python -m paddle_serving_server.serve --model your_servable_model --thread 10 --port 9292
-```
-``` shell
-python -m paddle_serving_server_gpu.serve --model your_servable_model --thread 10 --port 9292
-```
+### 2.2 Industrial Features

-Options of startup command are listed below: 
-<center>
+Paddle Serving takes into account a series of issues such as different operating systems, different development languages, multiple hardware devices, cross-deep learning platform model conversion, distributed sparse parameter indexing, and cloud deployment by different teams in industrial-level scenarios.

-| Arguments | Types | Defaults | Descriptions |
-|--------------|------|-----------|--------------------------------|
-| `thread` | int | `4` | Concurrency on server side, usually equal to the number of CPU core |
-| `port` | int | `9292` | Port exposed to users |
-| `name` | str | `""` | Service name that if a user specifies, the name of HTTP service is allocated |
-| `model` | str | `""` | Servable models for Paddle Serving |
-| `gpu_ids` | str | `""` | Supported only in paddle_serving_server_gpu, similar to the usage of CUDA_VISIBLE_DEVICES |
+> Cross-platform operation

-</center>
+Cross-platform is not dependent on the operating system, nor on the hardware environment. Applications developed under one operating system can still run under another operating system. Therefore, the design should consider not only the development language and the cross-platform components, but also the interpretation differences of the compilers on different systems.

-For example, `python -m paddle_serving_server.serve --model your_servable_model --thread 10 --port 9292` is the same as the following code as user can define: 
-``` python
-from paddle_serving_server import OpMaker, OpSeqMaker, Server
-
-op_maker = OpMaker()
-read_op = op_maker.create('general_reader')
-general_infer_op = op_maker.create('general_infer')
-general_response_op = op_maker.create('general_response')
-op_seq_maker = OpSeqMaker()
-op_seq_maker.add_op(read_op)
-op_seq_maker.add_op(general_infer_op)
-op_seq_maker.add_op(general_response_op)
-server = Server()
-server.set_op_sequence(op_seq_maker.get_op_sequence())
-server.set_num_threads(10)
-server.load_model_config(”your_servable_model“)
-server.prepare_server(port=9292, device="cpu")
-server.run_server()
-```
+Docker is an open source application container engine that allows developers to package their applications and dependencies into a portable container, and then publish it to any popular Linux machine or Windows machine. We have packaged a variety of Docker images for the Paddle Serving framework. Refer to the image list《[Docker Images](DOCKER_IMAGES.md)》, Select mirrors according to user's usage. We provide Docker usage documentation《[How to run PaddleServing in Docker](RUN_IN_DOCKER.md)》.Currently, the Python webserver mode can be deployed and run on the native Linux and Windows dual systems.《[Paddle Serving for Windows Users](WINDOWS_TUTORIAL.md)》

-#### 2.1.3 Paddle Serving Client API
-Paddle Serving supports remote service access through RPC(remote procedure call) and HTTP. RPC access of remote service can be called through Client API of Paddle Serving. A user can define data preprocess function before calling Paddle Serving's client API. The example below explains how to define the input data of Paddle Serving Client. The servable model has two inputs with alias name of `sparse` and `dense`. `sparse` corresponds to sparse sequence ids such as `[1, 1001, 100001]` and `dense` corresponds to dense vector such as `[0.2, 0.5, 0.1, 0.4, 0.11, 0.22]`. For sparse sequence data, current design supports `lod_level=0` and `lod_level=1` of Paddle, that corresponds to `Tensor` and `LodTensor`. For dense vector, current design supports any `N-D Tensor`. Users do not need to assign the shape of inference model input. The Paddle Serving Client API will check the input data's shape with servable configurations.
+> Support multiple development languages client SDKs

-``` python
-feed_dict["sparse"] = [1, 1001, 100001]
-feed_dict["dense"] = [0.2, 0.5, 0.1, 0.4, 0.11, 0.22]
-fetch_map = client.predict(feed=feed_dict, fetch=["prob"])
-```
+Paddle Serving provides 4 development language client SDKs, including Python, C++, Java, and Golang. Golang SDK is under construction, We hope that interested open source developers can help submit PR.

-The following code sample shows that Paddle Serving Client API connects to Server API with endpoint of the servers. To use the data parallelism ability during prediction, Paddle Serving Client allows users to define multiple server endpoints.
-``` python
-client = Client()
-client.load_client_config('servable_client_configs')
-client.connect(["127.0.0.1:9292"])
-```
+ Python, Refer to the client example under python/examples or 4.2 web service example.
+ C++, Refer to《[从零开始写一个预测服务](deprecated/CREATING.md)》
+ Java, Refer to《[Paddle Serving Client Java SDK](JAVA_SDK.md)》
+ Golang, Refer to《[How to use Go Client of Paddle Serving](deprecated/IMDB_GO_CLIENT.md)》

-### 2.2 Underlying Communication Mechanism
-Paddle Serving adopts [baidu-rpc](https://github.com/apache/incubator-brpc) as underlying communication layer. baidu-rpc is an open-source RPC communication library with high concurrency and low latency advantages compared with other open source RPC library. Millions of instances and thousands of services are using baidu-rpc within Baidu.
+> Support multiple hardware devices

-### 2.3 Core Execution Engine
-The core execution engine of Paddle Serving is a Directed acyclic graph(DAG). In the DAG, each node represents a phase of inference service, such as paddle inference prediction, data preprocessing and data postprocessing. DAG can fully parallelize the computation efficiency and can fully utilize the computation resources. For example, when a user has input data that needs to be feed into two models, and combine the scores of the two models, the computation of model scoring is parallelized through DAG.
+The inference framework of the well-known deep learning platform only supports CPU and GPU inference on the X86 platform. With the rapid increase in the complexity of AI algorithms, the computing power of chips has greatly increased, which has promoted the accelerated implementation of IoT applications and deployment on a variety of hardware.Paddle Serving integrates high-performance inference engine Paddle Inference and mobile terminal inference engine Paddle Lite, Provide inference services on multiple hardware devices. At present, in addition to X86 CPU and GPU, Paddle Serving has implemented the deployment of inference services on ARM CPU and Kunlun XPU. In the future, more hardware will be added to Paddle Serving.

-<p align="center">
-    <br>
-<img src='design_doc.png'">
-    <br>
-<p>
+> Model conversion across deep learning platforms

-### 2.4 Micro service plugin
-The underlying communication of Paddle Serving is implemented with C++ as well as the core framework, it is hard for users who do not familiar with C++ to implement new Paddle Serving Server Operators. Another approach is to use the light-weighted Web Service in Paddle Serving Server that can be viewed as a plugin. A user can implement complex data preprocessing and postprocessing logics to build a complex AI service. If access of the AI service has a large volumn, it is worth to implement the service with high performance Paddle Serving Server operators. The relationship between Web Service and RPC Service can be referenced in `User Type`.
+Models trained on other deep learning platforms can be passed《[PaddlePaddle/X2Paddle工具](https://github.com/PaddlePaddle/X2Paddle)》.We convert multiple mainstream CV models to Paddle models. TensorFlow, Caffe, ONNX, PyTorch model conversion is tested.《[An End-to-end Tutorial from Training to Inference Service Deployment](TRAIN_TO_SERVICE.md)

-## 3. Industrial Features
+Because it is impossible to directly view the feed and fetch parameter information in the model file, it is not convenient for users to assemble the parameters. Therefore, Paddle Serving developed a tool to convert the Paddle model into Serving format and generate a prototxt file containing feed and fetch parameter information. The following figure is the generated prototxt file of the uci_housing example. For more conversion methods, refer to the document《[How to save a servable model of Paddle Serving?](SAVE.md)》.
+```
+feed_var {
+  name: "x"
+  alias_name: "x"
+  is_lod_tensor: false
+  feed_type: 1
+  shape: 13
+}
+fetch_var {
+  name: "fc_0.tmp_1"
+  alias_name: "price"
+  is_lod_tensor: false
+  fetch_type: 1
+  shape: 1
+}
+```

-### 3.1 Distributed Sparse Parameter Indexing
+> Distributed Sparse Parameter Indexing

 Distributed Sparse Parameter Indexing is commonly seen in advertising and recommendation scenarios, and is often used coupled with distributed training. The figure below explains a commonly seen architecture for online recommendation. When the recommendation service receives a request from a user, the system will automatically collects training log for the offline distributed online training. Mean while, the request is sent to Paddle Serving Server. For sparse features, distributed sparse parameter index service is called so that sparse parameters can be looked up. The dense input features together with the looked up sparse model parameters are fed into the Paddle Inference Node of the DAG in Paddle Serving Server. Then the score can be responsed through RPC to product service for item ranking.

@@ -164,41 +99,56 @@ Distributed Sparse Parameter Indexing is commonly seen in advertising and recomm
 <img src='cube_eng.png' width = "450" height = "230">
    <br>
 <p>
+
 Why do we need to support distributed sparse parameter indexing in Paddle Serving? 1) In some recommendation scenarios, the number of features can be up to hundreds of billions that a single node can not hold the parameters within random access memory. 2) Paddle Serving supports distributed sparse parameter indexing that can couple with paddle inference. Users do not need to do extra work to have a low latency inference engine with hundreds of billions of parameters.

-### 3.2 Online A/B test
+----

-After sufficient offline evaluation of the model, online A/B test is usually needed to decide whether to enable the service on a large scale. The following figure shows the basic structure of A/B test with Paddle Serving. After the client is configured with the corresponding configuration, the traffic will be automatically distributed to different servers to achieve A/B test. Please refer to [ABTEST in Paddle Serving](ABTEST_IN_PADDLE_SERVING.md) for specific examples.
+## 3. C++ Serving design
+
+C++ Serving aims to achieve high-performance reasoning services with high concurrency and low latency. Its network framework and core execution engine are written based on C/C++, and provide powerful industrial-grade application capabilities, including model management, model security, and A/B Testing
+
+### 3.1 Network Communication Mechanism
+Paddle Serving adopts [brpc](https://github.com/apache/incubator-brpc) as underlying communication layer. brpc is an open-source RPC communication library with high concurrency and low latency advantages compared with other open source RPC library. Millions of instances and thousands of services are using brpc within Baidu.
+
+### 3.2 Core Execution Engine
+The core execution engine of Paddle Serving is a Directed acyclic graph(DAG). In the DAG, each node represents a phase of inference service, such as paddle inference prediction, data preprocessing and data postprocessing. DAG can fully parallelize the computation efficiency and can fully utilize the computation resources. For example, when a user has input data that needs to be feed into two models, and combine the scores of the two models, the computation of model scoring is parallelized through DAG.

 <p align="center">
    <br>
-<img src='abtest.png' width = "345" height = "230">
+<img src='design_doc.png'">
    <br>
 <p>

+### 3.3  Model Management and Hot Reloading     
+C++ Serving supports model management functions, including management of multiple models and multiple model versions.In order to ensure the availability of services, the model needs to be hot loaded without service interruption. Paddle Serving supports this feature and provides a tool for monitoring output models to update local models. Please refer to [Hot loading in Paddle Serving](HOT_LOADING_IN_SERVING.md) for specific examples.

-### 3.3 Model Online Reloading     
+### 3.4 MOEDL ENCRYPTION INFERENCE
+Paddle Serving uses a symmetric encryption algorithm to encrypt the model, and decrypts it in memory during the service loading model. At present, providing basic model security capabilities does not guarantee absolute model security. Users can improve them according to our design to achieve a higher level of security. Documentation reference《[MOEDL ENCRYPTION INFERENCE](ENCRYPTION.md)》

-In order to ensure the availability of services, the model needs to be hot loaded without service interruption. Paddle Serving supports this feature and provides a tool for monitoring output models to update local models. Please refer to [Hot loading in Paddle Serving](HOT_LOADING_IN_SERVING.md) for specific examples.
+### 3.5 A/B Test

-### 3.4 Model Management
-
-Paddle Serving's C++ engine supports model management. Currently, python API is not released yet, please wait for the next release.
-
-## 4. User Types
-Paddle Serving provides RPC and HTTP protocol for users. For HTTP service, we recommend users with median or small traffic services to use, and the latency is not a strict requirement. For RPC protocol, we recommend high traffic services and low latency required services to use. For users who use distributed sparse parameter indexing built-in service, it is not necessary to care about the underlying details of communication. The following figure gives out several scenarios that user may want to use Paddle Serving. 
+After sufficient offline evaluation of the model, online A/B test is usually needed to decide whether to enable the service on a large scale. The following figure shows the basic structure of A/B test with Paddle Serving. After the client is configured with the corresponding configuration, the traffic will be automatically distributed to different servers to achieve A/B test. Please refer to [ABTEST in Paddle Serving](ABTEST_IN_PADDLE_SERVING.md) for specific examples.

 <p align="center">
    <br>
-<img src='user_groups.png' width = "700" height = "470">
+<img src='abtest.png' width = "345" height = "230">
    <br>
 <p>

-For servable models saved from Paddle Serving IO API, users do not need to do extra coding work to startup a service, but may need some coding work on the client side. For development of Web Service plugin, a user needs to provide implementation of Web Service's preprocessing and postprocessing work if needed to get a HTTP service.
+### 3.6 Micro service plugin
+The underlying communication of Paddle Serving is implemented with C++ as well as the core framework, it is hard for users who do not familiar with C++ to implement new Paddle Serving Server Operators. Another approach is to use the light-weighted Web Service in Paddle Serving Server that can be viewed as a plugin. A user can implement complex data preprocessing and postprocessing logics to build a complex AI service. If access of the AI service has a large volumn, it is worth to implement the service with high performance Paddle Serving Server operators. The relationship between Web Service and RPC Service can be referenced in `User Type`.

-### 4.1 Web Service Development
+----

-Web Service has lots of open sourced framework. Currently Paddle Serving uses Flask as built-in service framework, and users are not aware of this. More efficient web service will be integrated in the furture if needed.
+## 4. Python Webserver Design
+
+### 4.1 Network Communication Mechanism
+There are many open source frameworks for web services. Paddle Serving currently integrates the Flask framework, but this part is not visible to users. In the future, a better-performing web framework may be provided as the underlying HTTP service integration engine.
+
+### 4.2 Web Service Development
+
+`WebService` is a Base Class, providing inheritable interfaces such `preprocess` and `postprocess` for users to implement. In the inherited class of `WebService` class, users can define any functions they want and the startup function interface is the same as RPC service.

 ``` python
 from paddle_serving_server.web_service import WebService
@@ -229,15 +179,36 @@ imdb_service.prepare_dict({"dict_file_path": sys.argv[4]})
 imdb_service.run_server()
 ```

-`WebService` is a Base Class, providing inheritable interfaces such `preprocess` and `postprocess` for users to implement. In the inherited class of `WebService` class, users can define any functions they want and the startup function interface is the same as RPC service.
+----
+
+## 5. Python Pipeline Serving Design
+The end-to-end deep learning model is currently unable to solve all problems. The use of multiple deep learning models together is still a conventional means to solve real-world problems.
+the end-to-end deep learning model can not solve all the problems at present. Usually, it is necessary to use multiple deep learning models to solve practical problems.
+
+### 5.1 Network Communication Mechanism
+The network framework of Pipeline Serving uses gRPC and gPRC gateway. The gRPC service receives the RPC request, and the gPRC gateway receives the RESTful API request and forwards the request to the gRPC Service through the reverse proxy server. Therefore, the network layer of Pipeline Serving receives both RPC and RESTful API.
+<center>
+<img src='pipeline_serving-image1.png' height = "250" align="middle"/>
+</center>
+
+### 5.2 Core Design And Use Cases
+
+The core design of Pipeline Serving is a graph execution engine, and the basic processing units are OP and Channel. A set of directed acyclic graphs can be realized through combination. Reference for design and use documents《[Pipeline Serving](PIPELINE_SERVING.md)》
+
+<center>
+<img src='pipeline_serving-image2.png' height = "300" align="middle"/>
+</center>

-## 5. Future Plan
+----

-### 5.1 Open DAG definition API
-Current version of Paddle Serving Server supports sequential type of execution flow. DAG definition API can be more helpful to users on complex tasks.

-### 5.2 Auto Deployment on Cloud
+## 6. Future Plan
+
+### 5.1 Auto Deployment on Cloud
 In order to make deployment more easily on public cloud, Paddle Serving considers to provides Operators on Kubernetes in submitting a service job.

-### 5.3 Vector Indexing and Tree based Indexing
+### 6.2 Vector Indexing and Tree based Indexing
 In recommendation and advertisement systems, it is commonly seen to use vector based index or tree based indexing service to do candidate retrievals. These retrieval tasks will be built-in services of Paddle Serving.
+
+### 6.3 Service Monitoring
+Paddle Serving will integrate Prometheus monitoring, which is a set of open source monitoring & alarm & time series database combination, suitable for k8s and docker monitoring systems.
--- a/doc/DESIGN_DOC_CN.md
+++ b/doc/DESIGN_DOC_CN.md
@@ -2,201 +2,155 @@

 (简体中文|[English](./DESIGN_DOC.md))

-## 1. 整体设计目标
+## 1. 设计目标

- 长期使命：Paddle Serving是一个PaddlePaddle开源的在线服务框架，长期目标就是围绕着人工智能落地的最后一公里提供越来越专业、可靠、易用的服务。
+Paddle Serving是一个PaddlePaddle开源的在线服务框架，长期目标就是围绕着人工智能落地的最后一公里提供越来越专业、可靠、易用的服务。

 - 工业级：为了达到工业级深度学习模型在线部署的要求，
-Paddle Serving提供很多大规模场景需要的部署功能：1）分布式稀疏参数索引功能；2）高并发底层通信能力；3）模型管理、在线A/B流量测试、模型热加载。
+Paddle Serving提供很多大规模场景需要的部署功能：1）模型管理、模型热加载、模型加解密；2）支持跨平台、多种硬件部署；3）分布式稀疏参数索引功能；4）在线A/B流量测试

- 简单易用：为了让使用Paddle的用户能够以极低的成本部署模型，PaddleServing设计了一套与Paddle训练框架无缝打通的预测部署API，普通模型可以使用一行命令进行服务部署。
+- 高性能：从低延时和高吞吐2个维度思考提升模型推理的性能。1）集成Paddle Inference高性能预测引擎；2）支持Nvidia Tensor RT高性能推理引擎；3）集成高性能网络框架brpc；4）异步Pipeline模式大幅提升吞吐量

- 功能扩展：当前，Paddle Serving支持C++、Python、Golang的客户端，未来也会面向不同类型的客户新增多种语言的客户端。在Paddle Serving的框架设计方面，尽管当前Paddle Serving以支持Paddle模型的部署为核心功能，
+- 简单易用：为了让使用Paddle的用户能够以极低的成本部署模型，PaddleServing设计了一套与Paddle训练框架无缝打通的预测部署API，普通模型可以使用一行命令进行服务部署。20多种常见模型案例和文档。
+
+- 功能扩展：当前，Paddle Serving支持C++、Python、Golang、Java 4种语言客户端，未来会支持更多语。在Paddle Serving的框架设计方面，尽管当前Paddle Serving以支持Paddle模型的部署为核心功能，
 用户可以很容易嵌入其他的机器学习库部署在线预测。

-## 2. 模块设计与实现
+----
+## 2. 概要设计

-### 2.1 Python API接口设计
+任何优秀软件产品一定从用户需求出发，具有清晰的定位和良好的概要设计。Paddle Serving也不例外，Paddle Serving目标围绕着人工智能落地的最后一公里提供越来越专业、可靠、易用的服务。通过调研大量用户的使用场景，并将这些场景抽象归纳，例如在线服务侧重高并发，低平响；离线服务侧重批量高吞吐，高资源利用率；算法开发者擅长使用Python做模型训练和推理等。

-#### 2.1.1 训练模型的保存
-Paddle的模型预测需要重点关注的内容：1）模型的输入变量；2）模型的输出变量；3）模型结构和模型参数。Paddle Serving Python API提供用户可以在训练过程中保存模型的接口，并将Paddle Serving在部署阶段需要保存的配置打包保存，一个示例如下：
-``` python
-import paddle_serving_client.io as serving_io
-serving_io.save_model("serving_model", "client_conf",
-                      {"words": data}, {"prediction": prediction},
-                      fluid.default_main_program())
-```
-代码示例中，`{"words": data}`和`{"prediction": prediction}`分别指定了模型的输入和输出，`"words"`和`"prediction"`是输入和输出变量的别名，设计别名的目的是为了使开发者能够记忆自己训练模型的输入输出对应的字段。`data`和`prediction`则是Paddle训练过程中的`[Variable](https://www.paddlepaddle.org.cn/documentation/docs/zh/api_cn/fluid_cn/Variable_cn.html#variable)`，通常代表张量([Tensor](https://www.paddlepaddle.org.cn/documentation/docs/zh/api_cn/fluid_cn/Tensor_cn.html#tensor))或变长张量([LodTensor](https://www.paddlepaddle.org.cn/documentation/docs/zh/beginners_guide/basic_concept/lod_tensor.html#lodtensor))。调用保存命令后，会按照用户指定的`"serving_model"`和`"client_conf"`生成两个目录，内容如下：
-``` shell
-.
-├── client_conf
-│   ├── serving_client_conf.prototxt
-│   └── serving_client_conf.stream.prototxt
-└── serving_model
-    ├── embedding_0.w_0
-    ├── fc_0.b_0
-    ├── fc_0.w_0
-    ├── fc_1.b_0
-    ├── fc_1.w_0
-    ├── fc_2.b_0
-    ├── fc_2.w_0
-    ├── lstm_0.b_0
-    ├── lstm_0.w_0
-    ├── __model__
-    ├── serving_server_conf.prototxt
-    └── serving_server_conf.stream.prototxt
-```
-其中，`"serving_client_conf.prototxt"`和`"serving_server_conf.prototxt"`是Paddle Serving的Client和Server端需要加载的配置，`"serving_client_conf.stream.prototxt"`和`"serving_server_conf.stream.prototxt"`是配置文件的二进制形式。`"serving_model"`下保存的其他内容和Paddle保存的模型文件是一致的。我们会考虑未来在Paddle框架中直接保存可服务的配置，实现配置保存对用户无感。

-#### 2.1.2 服务端模型加载
+### 2.1 设计选型
+为了满足不同场景的用户需求，Paddle Serving的产品定位采用更低维度特征，如响应时间、吞吐、开发效率等，实现目标的选型和技术选型。

-服务端的预测逻辑可以通过Paddle Serving Server端的API进行人工定义，一个例子：
-``` python
-import paddle_serving_server as serving
-op_maker = serving.OpMaker()
-read_op = op_maker.create('general_reader')
-dist_kv_op = op_maker.create('general_dist_kv')
-general_infer_op = op_maker.create('general_infer')
-general_response_op = op_maker.create('general_response')
-
-op_seq_maker = serving.OpSeqMaker()
-op_seq_maker.add_op(read_op)
-op_seq_maker.add_op(dist_kv_op)
-op_seq_maker.add_op(general_infer_op)
-op_seq_maker.add_op(general_response_op)
-```
+| 响应时间 | 吞吐 | 开发效率 | 资源利用率 | 选型 | 应用场景|
+|-----|------|-----|-----|------|------|
+| 低 | 高 | 低 | 高 |C++ Serving | 高性能场景，大型在线推荐系统召回、排序服务|
+| 高 | 高 | 较高 |高|Python Pipeline Serving| 兼顾吞吐和效率，单算子多模型组合场景，异步模式|
+| 高 | 低 | 高| 低 |Python webserver| 高迭代效率场景，小型服务或需要快速迭代，模型效果验证|

-当前Paddle Serving在Server端支持的主要Op请参考如下列表：

-<center>
+性能指标说明：
+1. 响应时间(ms)：单次请求平均响应时间，计算50、90、95、99分位响应时长，数值越低越好。
+2. 吞吐(QPS/TPS)：服务处理请求的效率，单位时间内处理请求数量，越高越好。
+3. 开发效率：使用不同开发语言完成相同工作时间不同，包括开发、调试、维护的效率等，越高越好。
+4. 资源利用率：部署一个服务对资源利用率(CPU/GPU)，资源利用率低是对资源的浪费，数值越高越好。

-| Op 名称 | 描述 |
-|--------------|------|
-| `general_reader` | 通用数据格式的读取Op |
-| `genreal_infer` | 通用数据格式的Paddle预测Op |
-| `general_response` | 通用数据格式的响应Op |
-| `general_dist_kv` | 分布式索引Op |
+Paddle Serving面向的用户提供RPC和HTTP两种访问协议。对于HTTP协议，我们更倾向于流量中小型的服务使用，并且对延时没有严格要求的AI服务开发者。对于RPC协议，我们面向流量较大，对延时要求更高的用户，此外RPC的客户端可能也处在一个大系统的服务中，这种情况下非常适合使用Paddle Serving提供的RPC服务。对于使用分布式稀疏参数索引服务而言，Paddle Serving的用户不需要关心底层的细节，其调用本质也是通过RPC服务再调用RPC服务。下图给出了当前设计的Paddle Serving可能会使用Serving服务的几种场景。

-</center>
+<p align="center">
+    <br>
+<img src='user_groups.png' width = "700" height = "470">
+    <br>
+<p>

-当前Paddle Serving中的预估引擎支持在CPU/GPU上进行预测，对应的预测服务安装包以及镜像也有两个。但无论是CPU上进行模型预估还是GPU上进行模型预估，普通模型的预测都可用一行命令进行启动。
-``` shell
-python -m paddle_serving_server.serve --model your_servable_model --thread 10 --port 9292
-```
-``` shell
-python -m paddle_serving_server_gpu.serve --model your_servable_model --thread 10 --port 9292
-```
-启动命令的选项列表如下：
-<center>
+对于普通的模型而言（具体指通过Serving提供的IO保存的模型，并且没有对模型进行后处理），用户使用RPC服务不需要额外的开发即可实现服务启动，但需要开发一些Client端的代码来使用服务。对于Web服务的开发，需要用户现在Paddle Serving提供的Web Service框架中进行前后处理的开发，从而实现整个HTTP服务。
+  
+### 2.2 工业级特性

-| 参数 | 类型 | 默认值 | 描述 |
-|--------------|------|-----------|--------------------------------|
-| `thread` | int | `4` | 服务端的并发数，通常与CPU核数一致即可 |
-| `port` | int | `9292` | 服务暴露给用户的端口 |
-| `name` | str | `""` | 服务名称，当用户指定时代表直接启动的是HTTP服务 |
-| `model` | str | `""` | 服务端模型文件夹路径 |
-| `gpu_ids` | str | `""` | 仅在paddle_serving_server_gpu中可以使用，功能与CUDA_VISIBLE_DEVICES一致 |
+Paddle Serving从做顶层设计时考虑到不同团队在工业级场景中会使用不同的操作系统、不同开发语言、多种硬件设备、跨深度学习平台模型转换、分布式稀疏参数索引和云上部署等一系列问题。

-</center>
+> 跨平台运行

-举例`python -m paddle_serving_server.serve --model your_servable_model --thread 10 --port 9292`对应到具体的Server端具体配置如下
-``` python
-from paddle_serving_server import OpMaker, OpSeqMaker, Server
-
-op_maker = OpMaker()
-read_op = op_maker.create('general_reader')
-general_infer_op = op_maker.create('general_infer')
-general_response_op = op_maker.create('general_response')
-op_seq_maker = OpSeqMaker()
-op_seq_maker.add_op(read_op)
-op_seq_maker.add_op(general_infer_op)
-op_seq_maker.add_op(general_response_op)
-server = Server()
-server.set_op_sequence(op_seq_maker.get_op_sequence())
-server.set_num_threads(10)
-server.load_model_config(”your_servable_model“)
-server.prepare_server(port=9292, device="cpu")
-server.run_server()
-```
+跨平台是不依赖于操作系统，也不依赖硬件环境。一个操作系统下开发的应用，放到另一个操作系统下依然可以运行。因此，设计上既要考虑开发语言、组件是跨平台的，同时也要考虑不同系统上编译器的解释差异。
+Docker 是一个开源的应用容器引擎，让开发者可以打包他们的应用以及依赖包到一个可移植的容器中，然后发布到任何流行的Linux机器或Windows机器上。我们将Paddle Serving框架打包了多种Docker镜像，镜像列表参考《[Docker镜像](DOCKER_IMAGES_CN.md)》，根据用户的使用场景选择镜像。为方便用户使用Docker，我们提供了帮助文档《[如何在Docker中运行PaddleServing](RUN_IN_DOCKER_CN.md)》。目前，Python webserver模式可在原生系统Linux和Windows双系统上部署运行。《[Windows平台使用Paddle Serving指导](WINDOWS_TUTORIAL_CN.md)》

-#### 2.1.3 客户端访问API
-Paddle Serving支持远程服务访问的协议一种是基于RPC，另一种是HTTP。用户通过RPC访问，可以使用Paddle Serving提供的Python Client API，通过定制输入数据的格式来实现服务访问。下面的例子解释Paddle Serving Client如何定义输入数据。保存可部署模型时需要指定每个输入的别名，例如`sparse`和`dense`，对应的数据可以是离散的ID序列`[1, 1001, 100001]`，也可以是稠密的向量`[0.2, 0.5, 0.1, 0.4, 0.11, 0.22]`。当前Client的设计，对于离散的ID序列，支持Paddle中的`lod_level=0`和`lod_level=1`的情况，即张量以及一维变长张量。对于稠密的向量，支持`N-D Tensor`。用户不需要显式指定输入数据的形状，Paddle Serving的Client API会通过保存配置时记录的输入形状进行对应的检查。
-``` python
-feed_dict["sparse"] = [1, 1001, 100001]
-feed_dict["dense"] = [0.2, 0.5, 0.1, 0.4, 0.11, 0.22]
-fetch_map = client.predict(feed=feed_dict, fetch=["prob"])
-```
-Client链接Server的代码，通常只需要加载保存模型时保存的Client端配置，以及指定要去访问的服务端点即可。为了保持内部访问进行数据并行的扩展能力，Paddle Serving Client允许定义多个服务端点。
-``` python
-client = Client()
-client.load_client_config('servable_client_configs')
-client.connect(["127.0.0.1:9292"])
-```
+> 支持多种开发语言SDK

+Paddle Serving提供了4种开发语言SDK，包括Python、C++、Java、Golang。Golang SDK在建设中，有兴趣的开源开发者可以提交PR。
+ Python，参考python/examples下client示例 或 4.2 web服务示例
+ C++，参考《[从零开始写一个预测服务](deprecated/CREATING.md)》
+ Java，参考《[Paddle Serving Client Java SDK](JAVA_SDK_CN.md)》
+ Golang，参考《[如何在Paddle Serving使用Go Client](deprecated/IMDB_GO_CLIENT_CN.md)》

-### 2.2 底层通信机制
-Paddle Serving采用[baidu-rpc](https://github.com/apache/incubator-brpc)进行底层的通信。baidu-rpc是百度开源的一款PRC通信库，具有高并发、低延时等特点，已经支持了包括百度在内上百万在线预估实例、上千个在线预估服务，稳定可靠。
+> 支持多种硬件设备

-### 2.3 核心执行引擎
-Paddle Serving的核心执行引擎是一个有向无环图，图中的每个节点代表预估服务的一个环节，例如计算模型预测打分就是其中一个环节。有向无环图有利于可并发节点充分利用部署实例内的计算资源，缩短延时。一个例子，当同一份输入需要送入两个不同的模型进行预估，并将两个模型预估的打分进行加权求和时，两个模型的打分过程即可以通过有向无环图的拓扑关系并发。
-<p align="center">
-    <br>
-<img src='design_doc.png'">
-    <br>
-<p>
+知名的深度学习平台的推理框架仅支持X86平台的CPU和GPU推理。随着AI算法复杂度高速增长，芯片算力大幅提升，推动物联网应用加速落地，在多种硬件上部署。Paddle Serving集成高性能推理引擎Paddle Inference和移动端推理引擎Paddle Lite，在多种硬件设备上提供推理服务。目前，除了X86 CPU、GPU外，Paddle Serving已实现ARM CPU和昆仑 XPU上部署推理服务，未来会有更多的硬件加入Paddle Serving。

-### 2.4 微服务插件模式
-由于Paddle Serving底层采用基于C++的通信组件，并且核心框架也是基于C/C++编写，当用户想要在服务端定义复杂的前处理与后处理逻辑时，一种办法是修改Paddle Serving底层框架，重新编译源码。另一种方式可以通过在服务端嵌入轻量级的Web服务，通过在Web服务中实现更复杂的预处理逻辑，从而搭建一套逻辑完整的服务。当访问量超过了Web服务能够接受的范围，开发者有足够的理由开发一些高性能的C++预处理逻辑，并嵌入到Serving的原生服务库中。Web服务和RPC服务的关系以及他们的组合方式可以参考下文`用户类型`中的说明。

-## 3. 工业级特性
+> 跨深度学习平台模型转换

-### 3.1 分布式稀疏参数索引
+其他深度学习平台训练的模型，可以通过《[PaddlePaddle/X2Paddle工具](https://github.com/PaddlePaddle/X2Paddle)》将多个主流的CV模型转为Paddle模型，测试过TensorFlow、Caffe、ONNX、PyTorch模型转换。

-分布式稀疏参数索引通常在广告推荐中出现，并与分布式训练配合形成完整的离线-在线一体化部署。下图解释了其中的流程，产品的在线服务接受用户请求后将请求发送给预估服务，同时系统会记录用户的请求以进行相应的训练日志处理和拼接。离线分布式训练系统会针对流式产出的训练日志进行模型增量训练，而增量产生的模型会配送至分布式稀疏参数索引服务，同时对应的稠密的模型参数也会配送至在线的预估服务。在线服务由两部分组成，一部分是针对用户的请求提取特征后，将需要进行模型的稀疏参数索引的特征发送请求给分布式稀疏参数索引服务，针对分布式稀疏参数索引服务返回的稀疏参数再进行后续深度学习模型的计算流程，从而完成预估。
+以IMDB评论情感分析任务为例通过9步展示，Paddle Serving从模型的训练到部署预测服务的全流程《[端到端完成从训练到部署全流程](TRAIN_TO_SERVICE_CN.md)》
+
+由于无法直接查看模型文件中feed和fetch参数信息，不方便用户拼装参数。因此，Paddle Serving开发一个工具将Paddle模型转成Serving的格式，生成包含feed和fetch参数信息的prototxt文件。下图是uci_housing示例的生成的prototxt文件，更多转换方法参考文档《[怎样保存用于Paddle Serving的模型](SAVE_CN.md)》。
+```
+feed_var {
+  name: "x"
+  alias_name: "x"
+  is_lod_tensor: false
+  feed_type: 1
+  shape: 13
+}
+fetch_var {
+  name: "fc_0.tmp_1"
+  alias_name: "price"
+  is_lod_tensor: false
+  fetch_type: 1
+  shape: 1
+}
+```

+> 分布式稀疏参数索引
+
+为什么要使用Paddle Serving提供的分布式稀疏参数索引服务？1）在一些推荐场景中，模型的输入特征规模通常可以达到上千亿，单台机器无法支撑T级别模型在内存的保存，因此需要进行分布式存储。2）Paddle Serving提供的分布式稀疏参数索引服务，具有并发请求多个节点的能力，从而以较低的延时完成预估服务。
 <p align="center">
    <br>
 <img src='cube_eng.png' width = "450" height = "230">
    <br>
 <p>
+分布式稀疏参数索引通常在广告推荐中出现，并与分布式训练配合形成完整的离线-在线一体化部署。下图解释了其中的流程，产品的在线服务接受用户请求后将请求发送给预估服务，同时系统会记录用户的请求以进行相应的训练日志处理和拼接。离线分布式训练系统会针对流式产出的训练日志进行模型增量训练，而增量产生的模型会配送至分布式稀疏参数索引服务，同时对应的稠密的模型参数也会配送至在线的预估服务。在线服务由两部分组成，一部分是针对用户的请求提取特征后，将需要进行模型的稀疏参数索引的特征发送请求给分布式稀疏参数索引服务，针对分布式稀疏参数索引服务返回的稀疏参数再进行后续深度学习模型的计算流程，从而完成预估。

-为什么要使用Paddle Serving提供的分布式稀疏参数索引服务？1）在一些推荐场景中，模型的输入特征规模通常可以达到上千亿，单台机器无法支撑T级别模型在内存的保存，因此需要进行分布式存储。2）Paddle Serving提供的分布式稀疏参数索引服务，具有并发请求多个节点的能力，从而以较低的延时完成预估服务。
-                          
-### 3.2 在线A/B流量测试

-在对模型进行充分的离线评估后，通常需要进行在线A/B测试，来决定是否大规模上线服务。下图为使用Paddle Serving做A/B测试的基本结构，Client端做好相应的配置后，自动将流量分发给不同的Server，从而完成A/B测试。具体例子请参考[如何使用Paddle Serving做ABTEST](ABTEST_IN_PADDLE_SERVING_CN.md)。
+----
+## 3. C++ Serving设计
+C++ Serving目标实现高并发、低延时的高性能推理服务。其网络框架和核心执行引擎均是基于C/C++编写，并且提供强大的工业级应用能力，包括模型管理、模型安全、A/B Testing
+
+### 3.1 通信机制
+
+C++ Serving采用[better-rpc](https://github.com/apache/incubator-brpc)进行底层的通信。better-rpc是百度开源的一款PRC通信库，具有高并发、低延时等特点，已经支持了包括百度在内上百万在线预估实例、上千个在线预估服务，稳定可靠。与gRPC网络框架相比，具有更低的延时，更高的并发性能；缺点是跨操作系统平台、跨语言能力不足。

+### 3.2 核心执行引擎
+
+C++ Serving的核心执行引擎是一个有向无环图，图中的每个节点代表预估服务的一个环节，例如计算模型预测打分就是其中一个环节。有向无环图有利于可并发节点充分利用部署实例内的计算资源，缩短延时。一个例子，当同一份输入需要送入两个不同的模型进行预估，并将两个模型预估的打分进行加权求和时，两个模型的打分过程即可以通过有向无环图的拓扑关系并发。
 <p align="center">
    <br>
-<img src='abtest.png' width = "345" height = "230">
+<img src='design_doc.png'">
    <br>
 <p>

+### 3.3 模型管理与热加载

-### 3.3 模型热加载
+Paddle Serving的C++引擎支持模型管理功能，支持多种模型和模型不同版本的管理。为了保证在模型更换期间推理服务的可用性，需要在服务不中断的情况下对模型进行热加载。Paddle Serving对该特性进行了支持，并提供了一个监控产出模型更新本地模型的工具，具体例子请参考《[Paddle Serving中的模型热加载](HOT_LOADING_IN_SERVING_CN.md)》。

-为了保证服务的可用性，需要在服务不中断的情况下对模型进行热加载。Paddle Serving对该特性进行了支持，并提供了一个监控产出模型更新本地模型的工具，具体例子请参考[Paddle Serving中的模型热加载](HOT_LOADING_IN_SERVING_CN.md)。
+### 3.4 模型加解密

-### 3.4 模型管理
+Paddle Serving采用对称加密算法对模型进行加密，在服务加载模型过程中在内存中解密。目前，提供基础的模型安全能力，并不保证模型绝对安全性，用户可根据我们的设计加以完善，实现更高级别的安全性。说明文档参考《[加密模型预测](ENCRYPTION_CN.md)》

-Paddle Serving的C++引擎支持模型管理功能，当前在Python API还有没完全开放这部分功能的配置，敬请期待。
+### 3.5 A/B Test

-## 4. 用户类型
-
-Paddle Serving面向的用户提供RPC和HTTP两种访问协议。对于HTTP协议，我们更倾向于流量中小型的服务使用，并且对延时没有严格要求的AI服务开发者。对于RPC协议，我们面向流量较大，对延时要求更高的用户，此外RPC的客户端可能也处在一个大系统的服务中，这种情况下非常适合使用Paddle Serving提供的RPC服务。对于使用分布式稀疏参数索引服务而言，Paddle Serving的用户不需要关心底层的细节，其调用本质也是通过RPC服务再调用RPC服务。下图给出了当前设计的Paddle Serving可能会使用Serving服务的几种场景。
+在对模型进行充分的离线评估后，通常需要进行在线A/B测试，来决定是否大规模上线服务。下图为使用Paddle Serving做A/B测试的基本结构，Client端做好相应的配置后，自动将流量分发给不同的Server，从而完成A/B测试。具体例子请参考《[如何使用Paddle Serving做ABTEST](ABTEST_IN_PADDLE_SERVING_CN.md)》。

 <p align="center">
    <br>
-<img src='user_groups.png' width = "700" height = "470">
+<img src='abtest.png' width = "345" height = "230">
    <br>
 <p>

-对于普通的模型而言（具体指通过Serving提供的IO保存的模型，并且没有对模型进行后处理），用户使用RPC服务不需要额外的开发即可实现服务启动，但需要开发一些Client端的代码来使用服务。对于Web服务的开发，需要用户现在Paddle Serving提供的Web Service框架中进行前后处理的开发，从而实现整个HTTP服务。
+### 3.6 微服务插件模式
+由于Paddle Serving底层采用基于C++的通信组件，并且核心框架也是基于C/C++编写，当用户想要在服务端定义复杂的前处理与后处理逻辑时，一种办法是修改Paddle Serving底层框架，重新编译源码。另一种方式可以通过在服务端嵌入轻量级的Web服务，通过在Web服务中实现更复杂的预处理逻辑，从而搭建一套逻辑完整的服务。当访问量超过了Web服务能够接受的范围，开发者有足够的理由开发一些高性能的C++预处理逻辑，并嵌入到Serving的原生服务库中。Web服务和RPC服务的关系以及他们的组合方式可以参考下文`用户类型`中的说明。

-### 4.1 Web服务开发
+----
+## 4. Python webserver设计与使用

-Web服务有很多开源的框架，Paddle Serving当前集成了Flask框架，但这部分对用户不可见，在未来可能会提供性能更好的Web框架作为底层HTTP服务集成引擎。用户需要继承WebService，从而实现对rpc服务的输入输出进行加工的目的。
+### 4.1 网络框架
+Web服务有很多开源的框架，Paddle Serving当前集成了Flask框架，但这部分对用户不可见，在未来可能会提供性能更好的Web框架作为底层HTTP服务集成引擎。

+### 4.2 web服务示例
+`WebService`作为基类，提供将用户接受的HTTP请求转化为RPC输入的接口`preprocess`，同时提供对RPC请求返回的结果进行后处理的接口`postprocess`，继承`WebService`的子类，可以定义各种类型的成员函数。`WebService`的启动命令和普通RPC服务提供的启动API一致，重写preprocess和postprocess接口，实现模型预测前、预测后处理方法即可。
 ``` python
 from paddle_serving_server.web_service import WebService
 from imdb_reader import IMDBDataset
@@ -226,15 +180,32 @@ imdb_service.prepare_dict({"dict_file_path": sys.argv[4]})
 imdb_service.run_server()
 ```

-`WebService`作为基类，提供将用户接受的HTTP请求转化为RPC输入的接口`preprocess`，同时提供对RPC请求返回的结果进行后处理的接口`postprocess`，继承`WebService`的子类，可以定义各种类型的成员函数。`WebService`的启动命令和普通RPC服务提供的启动API一致。

-## 5. 未来计划
+----
+
+## 5. Python Pipeline Serving设计
+端到端的深度学习模型当前还不能解决所有问题，多个深度学习模型配合起来使用还是解决现实问题的常规手段。Paddle Serving 提供了用户友好的多模型组合服务编程框架Pipeline Serving，旨在降低编程门槛，提高资源使用率（尤其是GPU设备），提升整体的预估效率。

-### 5.1 有向无环图结构定义开放
-当前版本开放的python API仅支持用户定义Sequential类型的执行流，如果想要进行Server进程内复杂的计算，需要增加对应的用户API。
+### 5.1 网络框架
+Pipeline Serving的网络框架采用gRPC和gPRC gateway。gRPC service接收RPC请求，gPRC gateway接收RESTful API请求通过反向代理服务器将请求转发给gRPC Service。即，Pipeline Serving的网络层同时接收RPC和RESTful API。
+<center>
+<img src='pipeline_serving-image1.png' height = "250" align="middle"/>
+</center>

-### 5.2 云端自动部署能力
+### 5.2 核心设计与使用用例
+Pipeline Serving核心设计是图执行引擎，基本处理单元是OP和Channel，通过组合实现一套有向无环图，设计与使用文档参考《[Pipeline Serving设计与实现](PIPELINE_SERVING_CN.md)》
+<center>
+<img src='pipeline_serving-image2.png' height = "300" align="middle"/>
+</center>
+----
+
+## 6. 未来计划
+
+### 6.1 云端自动部署能力
 为了方便用户更容易将Paddle的预测模型部署到线上，Paddle Serving在接下来的版本会提供Kubernetes生态下任务编排的工具。

-### 5.3 向量检索、树结构检索
+### 6.2 向量检索、树结构检索
 在推荐与广告场景的召回系统中，通常需要采用基于向量的快速检索或者基于树结构的快速检索，Paddle Serving会对这方面的检索引擎进行集成或扩展。
+
+### 6.3 服务监控
+集成普罗米修斯监控，一套开源的监控&报警&时间序列数据库的组合，适合k8s和docker的监控系统。
--- a/doc/INFERENCE_TO_SERVING.md
+++ b/doc/INFERENCE_TO_SERVING.md
-# How to Convert Paddle Inference Model To Paddle Serving Format
-
-([简体中文](./INFERENCE_TO_SERVING_CN.md)|English)
-
-you can use a build-in python module called `paddle_serving_client.convert` to convert it.
-```python
-python -m paddle_serving_client.convert --dirname ./your_inference_model_dir
-```
-Arguments are the same as `inference_model_to_serving` API.
-| Argument | Type | Default | Description |
-|--------------|------|-----------|--------------------------------|
-| `dirname` | str | - | Path of saved model files. Program file and parameter files are saved in this directory. |
-| `serving_server` | str | `"serving_server"` | The path of model files and configuration files for server. |
-| `serving_client` | str | `"serving_client"` | The path of configuration files for client. |
-| `model_filename` | str | None | The name of file to load the inference program. If it is None, the default filename `__model__` will be used. |
-| `params_filename` | str | None | The name of file to load all parameters. It is only used for the case that all parameters were saved in a single binary file. If parameters were saved in separate files, set it as None. |
--- a/doc/INFERENCE_TO_SERVING_CN.md
+++ b/doc/INFERENCE_TO_SERVING_CN.md
-# 如何从Paddle保存的预测模型转为Paddle Serving格式可部署的模型
-
-([English](./INFERENCE_TO_SERVING.md)|简体中文)
-
-你可以使用Paddle Serving提供的名为`paddle_serving_client.convert`的内置模块进行转换。
-```python
-python -m paddle_serving_client.convert --dirname ./your_inference_model_dir
-```
-模块参数与`inference_model_to_serving`接口参数相同。
-| 参数 | 类型 | 默认值 | 描述 |
-|--------------|------|-----------|--------------------------------|
-| `dirname` | str | - | 需要转换的模型文件存储路径，Program结构文件和参数文件均保存在此目录。|
-| `serving_server` | str | `"serving_server"` | 转换后的模型文件和配置文件的存储路径。默认值为serving_server |
-| `serving_client` | str | `"serving_client"` | 转换后的客户端配置文件存储路径。默认值为serving_client |
-| `model_filename` | str | None | 存储需要转换的模型Inference Program结构的文件名称。如果设置为None，则使用 `__model__` 作为默认的文件名 |
-| `params_filename` | str | None | 存储需要转换的模型所有参数的文件名称。当且仅当所有模型参数被保存在一个单独的>二进制文件中，它才需要被指定。如果模型参数是存储在各自分离的文件中，设置它的值为None |
--- a/doc/DESIGN.md
+++ b/doc/DESIGN.md
--- a/doc/DESIGN_CN.md
+++ b/doc/DESIGN_CN.md
--- a/python/examples/bert/README.md
+++ b/python/examples/bert/README.md
@@ -11,14 +11,16 @@ This example use model [BERT Chinese Model](https://www.paddlepaddle.org.cn/hubd

 Install paddlehub first
 ```
-pip install paddlehub
+pip3 install paddlehub
 ```

 run 
 ```
-python prepare_model.py 128
+python3 prepare_model.py 128
 ```

+**PaddleHub only support Python 3.5+**
+
 the 128 in the command above means max_seq_len in BERT model, which is the length of sample after preprocessing.
 the config file and model file for server side are saved in the folder bert_seq128_model.
 the config file generated for client side is saved in the folder bert_seq128_client.
@@ -28,8 +30,9 @@ You can also download the above model from BOS(max_seq_len=128). After decompres
 ```shell
 wget https://paddle-serving.bj.bcebos.com/paddle_hub_models/text/SemanticModel/bert_chinese_L-12_H-768_A-12.tar.gz
 tar -xzf bert_chinese_L-12_H-768_A-12.tar.gz
+mv bert_chinese_L-12_H-768_A-12_model bert_seq128_model
+mv bert_chinese_L-12_H-768_A-12_client bert_seq128_client
 ```
-if your model is bert_chinese_L-12_H-768_A-12_model, replace the 'bert_seq128_model' field in the following command with 'bert_chinese_L-12_H-768_A-12_model',replace 'bert_seq128_client' with 'bert_chinese_L-12_H-768_A-12_client'.

 ### Getting Dict and Sample Dataset


--- a/python/examples/bert/README_CN.md
+++ b/python/examples/bert/README_CN.md
@@ -10,11 +10,11 @@
 示例中采用[Paddlehub](https://github.com/PaddlePaddle/PaddleHub)中的[BERT中文模型](https://www.paddlepaddle.org.cn/hubdetail?name=bert_chinese_L-12_H-768_A-12&en_category=SemanticModel)。
 请先安装paddlehub
 ```
-pip install paddlehub
+pip3 install paddlehub
 ```
 执行
 ```
-python prepare_model.py 128
+python3 prepare_model.py 128
 ```
 参数128表示BERT模型中的max_seq_len，即预处理后的样本长度。
 生成server端配置文件与模型文件，存放在bert_seq128_model文件夹。
@@ -25,9 +25,9 @@ python prepare_model.py 128
 ```shell
 wget https://paddle-serving.bj.bcebos.com/paddle_hub_models/text/SemanticModel/bert_chinese_L-12_H-768_A-12.tar.gz
 tar -xzf bert_chinese_L-12_H-768_A-12.tar.gz
+mv bert_chinese_L-12_H-768_A-12_model bert_seq128_model
+mv bert_chinese_L-12_H-768_A-12_client bert_seq128_client
 ```
-若使用bert_chinese_L-12_H-768_A-12_model模型，将下面命令中的bert_seq128_model字段替换为bert_chinese_L-12_H-768_A-12_model，bert_seq128_client字段替换为bert_chinese_L-12_H-768_A-12_client.
-


 ### 获取词典和样例数据

--- a/python/examples/detection/README.md
+++ b/python/examples/detection/README.md
@@ -12,6 +12,7 @@ Paddle Detection provides a large number of [Model Zoo](https://github.com/Paddl

 ### Serving example
 Several examples of PaddleDetection models used in Serving are given in this folder
+All examples support TensorRT.

 -[Faster RCNN](./faster_rcnn_r50_fpn_1x_coco)
 -[PPYOLO](./ppyolo_r50vd_dcn_1x_coco)

--- a/python/examples/detection/faster_rcnn_r50_fpn_1x_coco/README.md
+++ b/python/examples/detection/faster_rcnn_r50_fpn_1x_coco/README.md
@@ -13,6 +13,9 @@ tar xf faster_rcnn_r50_fpn_1x_coco.tar
 python -m paddle_serving_server_gpu.serve --model serving_server --port 9494 --gpu_ids 0
 ```

+This model support TensorRT, if you want a faster inference, please use `--use_trt`. 
+
+
 ### Perform prediction
 ```
 python test_client.py 000000570688.jpg

--- a/python/examples/detection/faster_rcnn_r50_fpn_1x_coco/README_CN.md
+++ b/python/examples/detection/faster_rcnn_r50_fpn_1x_coco/README_CN.md
@@ -13,6 +13,7 @@ wget --no-check-certificate https://paddle-serving.bj.bcebos.com/pddet_demo/2.0/
 tar xf faster_rcnn_r50_fpn_1x_coco.tar
 python -m paddle_serving_server_gpu.serve --model pddet_serving_model --port 9494 --gpu_ids 0
 ```
+该模型支持TensorRT，如果想要更快的预测速度，可以开启`--use_trt`选项。

 ### 执行预测
 ```

--- a/python/examples/detection/ppyolo_r50vd_dcn_1x_coco/README.md
+++ b/python/examples/detection/ppyolo_r50vd_dcn_1x_coco/README.md
@@ -13,6 +13,8 @@ tar xf ppyolo_r50vd_dcn_1x_coco.tar
 python -m paddle_serving_server_gpu.serve --model serving_server --port 9494 --gpu_ids 0
 ```

+This model support TensorRT, if you want a faster inference, please use `--use_trt`.
+
 ### Perform prediction
 ```
 python test_client.py 000000570688.jpg

--- a/python/examples/detection/ppyolo_r50vd_dcn_1x_coco/README_CN.md
+++ b/python/examples/detection/ppyolo_r50vd_dcn_1x_coco/README_CN.md
@@ -14,6 +14,8 @@ tar xf ppyolo_r50vd_dcn_1x_coco.tar
 python -m paddle_serving_server_gpu.serve --model pddet_serving_model --port 9494 --gpu_ids 0
 ```

+该模型支持TensorRT，如果想要更快的预测速度，可以开启`--use_trt`选项。
+
 ### 执行预测
 ```
 python test_client.py 000000570688.jpg

--- a/python/examples/detection/ttfnet_darknet53_1x_coco/README.md
+++ b/python/examples/detection/ttfnet_darknet53_1x_coco/README.md
@@ -12,6 +12,7 @@ wget --no-check-certificate https://paddle-serving.bj.bcebos.com/pddet_demo/2.0/
 tar xf ttfnet_darknet53_1x_coco.tar
 python -m paddle_serving_server_gpu.serve --model serving_server --port 9494 --gpu_ids 0
 ```
+This model support TensorRT, if you want a faster inference, please use `--use_trt`.

 ### Perform prediction
 ```

--- a/python/examples/detection/ttfnet_darknet53_1x_coco/README_CN.md
+++ b/python/examples/detection/ttfnet_darknet53_1x_coco/README_CN.md
@@ -14,6 +14,8 @@ tar xf ttfnet_darknet53_1x_coco.tar
 python -m paddle_serving_server_gpu.serve --model pddet_serving_model --port 9494 --gpu_ids 0
 ```

+该模型支持TensorRT，如果想要更快的预测速度，可以开启`--use_trt`选项。
+
 ### 执行预测
 ```
 python test_client.py 000000570688.jpg

--- a/python/examples/detection/yolov3_darknet53_270e_coco/README.md
+++ b/python/examples/detection/yolov3_darknet53_270e_coco/README.md
@@ -13,6 +13,8 @@ tar xf yolov3_darknet53_270e_coco.tar
 python -m paddle_serving_server_gpu.serve --model serving_server --port 9494 --gpu_ids 0
 ```

+This model support TensorRT, if you want a faster inference, please use `--use_trt`.
+
 ### Perform prediction
 ```
 python test_client.py 000000570688.jpg

--- a/python/examples/detection/yolov3_darknet53_270e_coco/README_CN.md
+++ b/python/examples/detection/yolov3_darknet53_270e_coco/README_CN.md
@@ -14,6 +14,8 @@ tar xf yolov3_darknet53_270e_coco.tar
 python -m paddle_serving_server_gpu.serve --model pddet_serving_model --port 9494 --gpu_ids 0
 ```

+该模型支持TensorRT，如果想要更快的预测速度，可以开启`--use_trt`选项。
+
 ### 执行预测
 ```
 python test_client.py 000000570688.jpg

--- a/python/examples/pipeline/imagenet/README_CN.md
+++ b/python/examples/pipeline/imagenet/README_CN.md
 # Imagenet Pipeline WebService

-这里以 Uci 服务为例来介绍 Pipeline WebService 的使用。
+这里以 Imagenet 服务为例来介绍 Pipeline WebService 的使用。

 ## 获取模型
 ```
@@ -10,10 +10,11 @@ sh get_model.sh
 ## 启动服务

 ```
-python web_service.py &>log.txt &
+python resnet50_web_service.py &>log.txt &
 ```

 ## 测试
 ```
-curl -X POST -k http://localhost:18082/uci/prediction -d '{"key": ["x"], "value": ["0.0137, -0.1136, 0.2553, -0.0692, 0.0582, -0.0727, -0.1583, -0.0584, 0.6283, 0.4919, 0.1856, 0.0795, -0.0332"]}'
+python pipeline_rpc_client.py
 ```
+
--- a/python/paddle_serving_server/serve.py
+++ b/python/paddle_serving_server/serve.py
@@ -152,8 +152,8 @@ class MainService(BaseHTTPRequestHandler):
        if "key" not in post_data:
            return False
        else:
-            key = base64.b64decode(post_data["key"])
-            with open(args.model + "/key", "w") as f:
+            key = base64.b64decode(post_data["key"].encode())
+            with open(args.model + "/key", "wb") as f:
                f.write(key)
            return True

@@ -161,8 +161,8 @@ class MainService(BaseHTTPRequestHandler):
        if "key" not in post_data:
            return False
        else:
-            key = base64.b64decode(post_data["key"])
-            with open(args.model + "/key", "r") as f:
+            key = base64.b64decode(post_data["key"].encode())
+            with open(args.model + "/key", "rb") as f:
                cur_key = f.read()
            return (key == cur_key)

@@ -203,7 +203,7 @@ class MainService(BaseHTTPRequestHandler):
        self.send_response(200)
        self.send_header('Content-type', 'application/json')
        self.end_headers()
-        self.wfile.write(json.dumps(response))
+        self.wfile.write(json.dumps(response).encode())


 if __name__ == "__main__":

--- a/python/pipeline/channel.py
+++ b/python/pipeline/channel.py
@@ -767,7 +767,7 @@ class ThreadChannel(Queue.PriorityQueue):
            while self._stop is False and self._consumer_cursors[
                    op_name] - self._base_cursor >= len(self._output_buf):
                try:
-                    channeldata = self.get(timeout=0)
+                    channeldata = self.get(timeout=0)[1]
                    self._output_buf.append(channeldata)
                    list_values = list(channeldata.values())
                    _LOGGER.debug(