We consider deploying deep learning inference service online to be a user-facing application in the future. **The goal of this project**: When you have trained a deep neural net with [Paddle](https://github.com/PaddlePaddle/Paddle), you are also capable to deploy the model online easily. A demo of Paddle Serving is as follows:
We consider deploying deep learning inference service online to be a user-facing application in the future. **The goal of this project**: When you have trained a deep neural net with [Paddle](https://github.com/PaddlePaddle/Paddle), you can put the model online without much effort. A demo of serving is as follows:
<palign="center">
<imgsrc="doc/demo.gif"width="700">
</p>
<h2align="center">Some Key Features</h2>
- Integrate with Paddle training pipeline seamlessly, most paddle models can be deployed **with one line command**.
- Integrate with Paddle training pipeline seemlessly, most paddle models can be deployed **with one line command**.
-**Industrial serving features** supported, such as models management, online loading, online A/B testing etc.
-**Distributed Key-Value indexing** supported which is especially useful for large scale sparse features as model inputs.
-**Highly concurrent and efficient communication** between clients and servers supported.
-**Multiple programming languages** supported on client side, such as Golang, C++ and python.
-**Extensible framework design**which can support model serving beyond Paddle.
-**Distributed Key-Value indexing** supported that is especially useful for large scale sparse features as model inputs.
-**Highly concurrent and efficient communication** between clients and servers.
-**Multiple programming languages** supported on client side, such as Golang, C++ and python
-**Extensible framework design**that can support model serving beyond Paddle.
<h2align="center">Installation</h2>
...
...
@@ -53,7 +53,7 @@ Paddle Serving provides HTTP and RPC based service for users to access
### HTTP service
Paddle Serving provides a built-in python module called `paddle_serving_server.serve` that can start a RPC service or a http service with one-line command. If we specify the argument `--name uci`, it means that we will have a HTTP service with a url of `$IP:$PORT/uci/prediction`
Paddle Serving provides a built-in python module called `paddle_serving_server.serve` that can start a rpc service or a http service with one-line command. If we specify the argument `--name uci`, it means that we will have a HTTP service with a url of `$IP:$PORT/uci/prediction`
A user can also start a RPC service with `paddle_serving_server.serve`. RPC service is usually faster than HTTP service, although a user needs to do some coding based on Paddle Serving's python client API. Note that we do not specify `--name` here.
A user can also start a rpc service with `paddle_serving_server.serve`. RPC service is usually faster than HTTP service, although a user needs to do some coding based on Paddle Serving's python client API. Note that we do not specify `--name` here.
In the example, a BERT model is used for semantic understanding prediction, and the text is represented as a vector, which can be used for further analysis and prediction.
This example use model [BERT Chinese Model](https://www.paddlepaddle.org.cn/hubdetail?name=bert_chinese_L-12_H-768_A-12&en_category=SemanticModel) from [Paddlehub](https://github.com/PaddlePaddle/PaddleHub).
Install paddlehub first
```
pip install paddlehub
```
执行
run
```
python prepare_model.py 20
```
参数20表示BERT模型中的max_seq_len,即预处理后的样本长度。
生成server端配置文件与模型文件,存放在bert_seq20_model文件夹
生成client端配置文件,存放在bert_seq20_client文件夹
### 获取词典和样例数据
the 20 in the command above means max_seq_len in BERT model, which is the length of sample after preprocessing.
the config file and model file for server side are saved in the folder bert_seq20_model.
the config file generated for client side is saved in the folder bert_seq20_client.
### Getting Dict and Sample Dataset
```
sh get_data.sh
```
脚本将下载中文词典vocab.txt和中文样例数据data-c.txt
this script will download Chinese Dictionary File vocab.txt and Chinese Sample Data data-c.txt
the client reads data from data-c.txt and send prediction request, the prediction is given by word vector. (Due to massive data in the word vector, we do not print it).
### HTTP Inference Service
```
export CUDA_VISIBLE_DEVICES=0,1
```
通过环境变量指定gpu预测服务使用的gpu,示例中指定索引为0和1的两块gpu
set environmental variable to specify which gpus are used, the command above means gpu 0 and gpu 1 is used.
In the test, 10 thousand samples in the sample data are copied into 100 thousand samples. Each client thread sends a sample of the number of threads. The batch size is 1, the max_seq_len is 20, and the time unit is seconds.
When the number of client threads is 4, the prediction speed can reach 432 samples per second.
Because a single GPU can only perform serial calculations internally, increasing the number of client threads can only reduce the idle time of the GPU. Therefore, after the number of threads reaches 4, the increase in the number of threads does not improve the prediction speed.
| client thread num | prepro | client infer | op0 | op1 | op2 | postpro | total |
*the port of server side in this example is 9393, the sample data used by client side is in the folder ./data. These parameter can be modified in practice*
In this test, client sends 25000 test samples totally, the bar chart given later is the latency of single thread, the unit is second, from which we know the predict efficiency is improved greatly by multi-thread compared to single-thread. 8.7 times improvement is made by 16 threads prediction.
| client thread num | prepro | client infer | op0 | op1 | op2 | postpro | total |
The serving framework has a built-in function for predicting the timing of each stage of the service. The client controls whether to turn on the environment through environment variables. After opening, the information will be output to the screen.
```
export FLAGS_profile_client=1 #开启client端各阶段时间打点
export FLAGS_profile_server=1 #开启server端各阶段时间打点
export FLAGS_profile_client=1 #turn on the client timing tool for each stage
export FLAGS_profile_server=1 #turn on the server timing tool for each stage
```
开启该功能后,client端在预测的过程中会将对应的日志信息打印到标准输出。
After enabling this function, the client will print the corresponding log information to standard output during the prediction process.
为了更直观地展现各阶段的耗时,提供脚本对日志文件做进一步的分析处理。
In order to show the time consuming of each stage more intuitively, a script is provided to further analyze and process the log file.
使用时先将client的输出保存到文件,以profile为例。
When using, first save the output of the client to a file, taking `profile` as an example.
Here the `thread_num` parameter is the number of processes when the client is running, and the script will calculate the average time spent in each phase according to this parameter.
脚本将计算各阶段的耗时,并除以线程数做平均,打印到标准输出。
The script calculates the time spent in each stage, divides by the number of threads to average, and prints to standard output.
The script converts the time-dot information in the log into a json format and saves it to a trace file. The trace file can be visualized through the tracing function of the Chrome browser.
Specific operation: Open the chrome browser, enter `chrome://tracing/` in the address bar, jump to the tracing page, click the `load` button, and open the saved trace file to visualize the time information of each stage of the prediction service.
The data visualization output is shown as follow, it uses [bert as service example](https://github.com/PaddlePaddle/Serving/tree/develop/python/examples/bert) GPU inference service. The server starts 4 GPU prediction, the client starts 4 `processes`, and the timeline of each stage when the batch size is 1. Among them, `bert_pre` represents the data preprocessing stage of the client, and `client_infer` represents the stage where the client completes sending and receiving prediction requests. `process` represents the process number of the client, and the second line of each process shows the timeline of each op of the server.