It is recommended to use Docker to prepare the compilation environment for the Paddle service: [CPU Dockerfile.devel](../tools/Dockerfile.devel), [GPU Dockerfile.gpu.devel](../tools/Dockerfile.gpu.devel)
It is recommended to use Docker for compilation. We have prepared the Paddle Serving compilation environment for you:
This document will take Python2 as an example to show how to compile Paddle Serving. If you want to compile with Python 3, just adjust the Python options of cmake.
Docker (GPU version requires nvidia-docker to be installed on the GPU machine)
Docker (GPU version requires nvidia-docker to be installed on the GPU machine)
This document takes Python2 as an example to show how to run Paddle Serving in docker. You can also use Python3 to run related commands by replacing `python` with `python3`.
the 20 in the command above means max_seq_len in BERT model, which is the length of sample after preprocessing.
the 128 in the command above means max_seq_len in BERT model, which is the length of sample after preprocessing.
the config file and model file for server side are saved in the folder bert_seq20_model.
the config file and model file for server side are saved in the folder bert_seq128_model.
the config file generated for client side is saved in the folder bert_seq20_client.
the config file generated for client side is saved in the folder bert_seq128_client.
You can also download the above model from BOS(max_seq_len=128). After decompression, the config file and model file for server side are stored in the bert_chinese_L-12_H-768_A-12_model folder, and the config file generated for client side is stored in the bert_chinese_L-12_H-768_A-12_client folder:
@@ -32,11 +38,11 @@ this script will download Chinese Dictionary File vocab.txt and Chinese Sample D
...
@@ -32,11 +38,11 @@ this script will download Chinese Dictionary File vocab.txt and Chinese Sample D
### RPC Inference Service
### RPC Inference Service
Run
Run
```
```
python -m paddle_serving_server.serve --model bert_seq20_model/ --port 9292 #cpu inference service
python -m paddle_serving_server.serve --model bert_seq128_model/ --port 9292 #cpu inference service
```
```
Or
Or
```
```
python -m paddle_serving_server_gpu.serve --model bert_seq20_model/ --port 9292 --gpu_ids 0 #launch gpu inference service at GPU 0
python -m paddle_serving_server_gpu.serve --model bert_seq128_model/ --port 9292 --gpu_ids 0 #launch gpu inference service at GPU 0
```
```
### RPC Inference
### RPC Inference
...
@@ -47,7 +53,7 @@ pip install paddle_serving_app
...
@@ -47,7 +53,7 @@ pip install paddle_serving_app
```
```
Run
Run
```
```
head data-c.txt | python bert_client.py --model bert_seq20_client/serving_client_conf.prototxt
head data-c.txt | python bert_client.py --model bert_seq128_client/serving_client_conf.prototxt
```
```
the client reads data from data-c.txt and send prediction request, the prediction is given by word vector. (Due to massive data in the word vector, we do not print it).
the client reads data from data-c.txt and send prediction request, the prediction is given by word vector. (Due to massive data in the word vector, we do not print it).
...
@@ -58,7 +64,7 @@ the client reads data from data-c.txt and send prediction request, the predictio
...
@@ -58,7 +64,7 @@ the client reads data from data-c.txt and send prediction request, the predictio
```
```
set environmental variable to specify which gpus are used, the command above means gpu 0 and gpu 1 is used.
set environmental variable to specify which gpus are used, the command above means gpu 0 and gpu 1 is used.
```
```
python bert_web_service.py bert_seq20_model/ 9292 #launch gpu inference service
python bert_web_service.py bert_seq128_model/ 9292 #launch gpu inference service
```
```
### HTTP Inference
### HTTP Inference
...
@@ -75,7 +81,7 @@ GPU:GPU V100 * 1
...
@@ -75,7 +81,7 @@ GPU:GPU V100 * 1
CUDA/cudnn Version:CUDA 9.2,cudnn 7.1.4
CUDA/cudnn Version:CUDA 9.2,cudnn 7.1.4
In the test, 10 thousand samples in the sample data are copied into 100 thousand samples. Each client thread sends a sample of the number of threads. The batch size is 1, the max_seq_len is 20, and the time unit is seconds.
In the test, 10 thousand samples in the sample data are copied into 100 thousand samples. Each client thread sends a sample of the number of threads. The batch size is 1, the max_seq_len is 20(not 128 as described above), and the time unit is seconds.
When the number of client threads is 4, the prediction speed can reach 432 samples per second.
When the number of client threads is 4, the prediction speed can reach 432 samples per second.
Because a single GPU can only perform serial calculations internally, increasing the number of client threads can only reduce the idle time of the GPU. Therefore, after the number of threads reaches 4, the increase in the number of threads does not improve the prediction speed.
Because a single GPU can only perform serial calculations internally, increasing the number of client threads can only reduce the idle time of the GPU. Therefore, after the number of threads reaches 4, the increase in the number of threads does not improve the prediction speed.
check_cmd "curl -H "Content-Type:application/json" -X POST -d '{"words": "i am very sad | 0", "fetch":["prediction"]}' http://127.0.0.1:9292/imdb/prediction"
check_cmd "curl -H "Content-Type:application/json" -X POST -d '{"words": "i am very sad | 0", "fetch":["prediction"]}' http://127.0.0.1:9292/imdb/prediction"