We consider deploying deep learning inference service online to be a user-facing application in the future. **The goal of this project**: When you have trained a deep neural net with [Paddle](https://github.com/PaddlePaddle/Paddle), you can put the model online without much effort. A demo of serving is as follows:
We consider deploying deep learning inference service online to be a user-facing application in the future. **The goal of this project**: When you have trained a deep neural net with [Paddle](https://github.com/PaddlePaddle/Paddle), you are also capable to deploy the model online easily. A demo of Paddle Serving is as follows:
<palign="center">
<imgsrc="doc/demo.gif"width="700">
</p>
<h2align="center">Some Key Features</h2>
- Integrate with Paddle training pipeline seemlessly, most paddle models can be deployed **with one line command**.
- Integrate with Paddle training pipeline seamlessly, most paddle models can be deployed **with one line command**.
-**Industrial serving features** supported, such as models management, online loading, online A/B testing etc.
-**Distributed Key-Value indexing** supported that is especially useful for large scale sparse features as model inputs.
-**Highly concurrent and efficient communication** between clients and servers.
-**Multiple programming languages** supported on client side, such as Golang, C++ and python
-**Extensible framework design**that can support model serving beyond Paddle.
-**Distributed Key-Value indexing** supported which is especially useful for large scale sparse features as model inputs.
-**Highly concurrent and efficient communication** between clients and servers supported.
-**Multiple programming languages** supported on client side, such as Golang, C++ and python.
-**Extensible framework design**which can support model serving beyond Paddle.
<h2align="center">Installation</h2>
...
...
@@ -53,7 +53,7 @@ Paddle Serving provides HTTP and RPC based service for users to access
### HTTP service
Paddle Serving provides a built-in python module called `paddle_serving_server.serve` that can start a rpc service or a http service with one-line command. If we specify the argument `--name uci`, it means that we will have a HTTP service with a url of `$IP:$PORT/uci/prediction`
Paddle Serving provides a built-in python module called `paddle_serving_server.serve` that can start a RPC service or a http service with one-line command. If we specify the argument `--name uci`, it means that we will have a HTTP service with a url of `$IP:$PORT/uci/prediction`
A user can also start a rpc service with `paddle_serving_server.serve`. RPC service is usually faster than HTTP service, although a user needs to do some coding based on Paddle Serving's python client API. Note that we do not specify `--name` here.
A user can also start a RPC service with `paddle_serving_server.serve`. RPC service is usually faster than HTTP service, although a user needs to do some coding based on Paddle Serving's python client API. Note that we do not specify `--name` here.
This document will use an example of text classification task based on IMDB dataset to show how to build a A/B Test framework using Paddle Serving. The structure relationship between the client and servers in the example is shown in the figure below.
@@ -14,17 +14,17 @@ The result is a complete serving solution.
## 2. Terms explanation
-baidu-rpc: Baidu's official open source RPC framework, supports multiple common communication protocols, and provides a custom interface experience based on protobuf
-Variant: Paddle Serving architecture is an abstraction of a minimal prediction cluster, which is characterized by all internal instances (replicas) being completely homogeneous and logically corresponding to a fixed version of a model
-Endpoint: Multiple Variants form an Endpoint. Logically, Endpoint represents a model, and Variants within the Endpoint represent different versions.
-OP: PaddlePaddle is used to encapsulate a numerical calculation operator, Paddle Serving is used to represent a basic business operation operator, and the core interface is inference. OP configures its dependent upstream OP to connect multiple OPs into a workflow
-Channel: An abstraction of all request-level intermediate data of the OP; data exchange between OPs through Channels
-Bus: manages all channels in a thread, and schedules the access relationship between the two sets of OP and Channel according to the DAG dependency graph between DAGs
-Stage: Workflow according to the topology diagram described by DAG, a collection of OPs that belong to the same link and can be executed in parallel
-Node: An Op operator instance composed of an Op operator class combined with parameter configuration, which is also an execution unit in Workflow
-Workflow: executes the inference interface of each OP in order according to the topology described by DAG
-DAG/Workflow: consists of several interdependent Nodes. Each Node can obtain the Request object through a specific interface. The node Op obtains the output object of its pre-op through the dependency relationship. The output of the last Node is the Response object by default.
-Service: encapsulates a pv request, can configure several Workflows, reuse the current PV's Request object with each other, and then execute each in parallel/serial execution, and finally write the Response to the corresponding output slot; a Paddle-serving process Multiple sets of Service interfaces can be configured. The upstream determines the Service interface currently accessed based on the ServiceName.
-**baidu-rpc**: Baidu's official open source RPC framework, supports multiple common communication protocols, and provides a custom interface experience based on protobuf
-**Variant**: Paddle Serving architecture is an abstraction of a minimal prediction cluster, which is characterized by all internal instances (replicas) being completely homogeneous and logically corresponding to a fixed version of a model
-**Endpoint**: Multiple Variants form an Endpoint. Logically, Endpoint represents a model, and Variants within the Endpoint represent different versions.
-**OP**: PaddlePaddle is used to encapsulate a numerical calculation operator, Paddle Serving is used to represent a basic business operation operator, and the core interface is inference. OP configures its dependent upstream OP to connect multiple OPs into a workflow
-**Channel**: An abstraction of all request-level intermediate data of the OP; data exchange between OPs through Channels
-**Bus**: manages all channels in a thread, and schedules the access relationship between the two sets of OP and Channel according to the DAG dependency graph between DAGs
-**Stage**: Workflow according to the topology diagram described by DAG, a collection of OPs that belong to the same link and can be executed in parallel
-**Node**: An OP operator instance composed of an OP operator class combined with parameter configuration, which is also an execution unit in Workflow
-**Workflow**: executes the inference interface of each OP in order according to the topology described by DAG
-**DAG/Workflow**: consists of several interdependent Nodes. Each Node can obtain the Request object through a specific interface. The node Op obtains the output object of its pre-op through the dependency relationship. The output of the last Node is the Response object by default.
-**Service**: encapsulates a pv request, can configure several Workflows, reuse the current PV's Request object with each other, and then execute each in parallel/serial execution, and finally write the Response to the corresponding output slot; a Paddle-serving process Multiple sets of Service interfaces can be configured. The upstream determines the Service interface currently accessed based on the ServiceName.
## 3. Python Interface Design
...
...
@@ -38,10 +38,10 @@ Models that can be predicted using the Paddle Inference Library, models saved du
### 3.3 Overall design:
The user starts the Client and Server through the Python Client. The Python API has a function to check whether the interconnection and the models to be accessed match.
The Python API calls the pybind corresponding to the client and server functions implemented by Paddle Serving, and the information transmitted through RPC is implemented through RPC.
The Client Python API currently has two simple functions, load_inference_conf and predict, which are used to perform loading of the model to be predicted and prediction, respectively.
The Server Python API is mainly responsible for loading the estimation model and generating various configurations required by Paddle Serving, including engines, workflow, resources, etc.
-The user starts the Client and Server through the Python Client. The Python API has a function to check whether the interconnection and the models to be accessed match.
-The Python API calls the pybind corresponding to the client and server functions implemented by Paddle Serving, and the information transmitted through RPC is implemented through RPC.
-The Client Python API currently has two simple functions, load_inference_conf and predict, which are used to perform loading of the model to be predicted and prediction, respectively.
- The Server Python API is mainly responsible for loading the inference model and generating various configurations required by Paddle Serving, including engines, workflow, resources, etc.
**Model Management Framework**: Connects model files of multiple machine learning platforms and provides a unified inference interface
**Business Scheduling Framework**: Abstracts the calculation logic of various different prediction models, provides a general DAG scheduling framework, and connects different operators through DAG diagrams to complete a prediction service together. This abstract model allows users to conveniently implement their own calculation logic, and at the same time facilitates operator sharing. (Users build their own forecasting services. A large part of their work is to build DAGs and provide operators.)
**PredictService**: Encapsulation of the externally provided prediction service interface. Define communication fields with the client through protobuf.
**Business Scheduling Framework**: Abstracts the calculation logic of various different inference models, provides a general DAG scheduling framework, and connects different operators through DAG diagrams to complete a prediction service together. This abstract model allows users to conveniently implement their own calculation logic, and at the same time facilitates operator sharing. (Users build their own forecasting services. A large part of their work is to build DAGs and provide operators.)
**PredictService**: Encapsulation of the externally provided prediction service interface. Define communication fields with the client through protobuf.
- Long Term Vision: Online deployment of deep learning models will be a user-facing application in the future. Any AI developer will face the problem of deploying an online service for his or her trained model.
# An End-to-end Tutorial from Training to Inference Service Deployment
([简体中文](./TRAIN_TO_SERVICE_CN.md)|English)
Paddle Serving is Paddle's high-performance online prediction service framework, which can flexibly support the deployment of most models. In this article, the IMDB review sentiment analysis task is used as an example to show the entire process from model training to deployment of prediction service through 9 steps.
Paddle Serving is Paddle's high-performance online inference service framework, which can flexibly support the deployment of most models. In this article, the IMDB review sentiment analysis task is used as an example to show the entire process from model training to deployment of inference service through 9 steps.
## Step1:Prepare for Running Environment
Paddle Serving can be deployed on Linux environments such as Centos and Ubuntu. On other systems or in environments where you do not want to install the serving module, you can still access the server-side prediction service through the http service.