......@@ -267,6 +267,15 @@ output
{'err_no': 0, 'err_msg': '', 'key': ['res'], 'value': ["['土地整治与土壤修复研究中心', '华南农业大学1素图']"]}
<h3 align="center">Stop Serving/Pipeline service</h3>
**Method one** :Ctrl+C to quit
**Method Two** :In the path where starting the Serving/Pipeline service or the path which environment variable SERVING_HOME set (the file named ProcessInfo.json exists in this path)
python3 -m paddle_serving_server.serve stop
<h2 align="center">Document</h2>
......@@ -269,6 +269,16 @@ python3 pipeline_rpc_client.py
{'err_no': 0, 'err_msg': '', 'key': ['res'], 'value': ["['土地整治与土壤修复研究中心', '华南农业大学1素图']"]}
<h3 align="center">关闭Serving/Pipeline服务</h3>
**方式一** :Ctrl+C关停服务
**方式二** :在启动Serving/Pipeline服务路径或者环境变量SERVING_HOME路径下(该路径下存在文件ProcessInfo.json)
python3 -m paddle_serving_server.serve stop
<h2 align="center">文档</h2>
### 新手教程
## Build Bert-As-Service in 10 minutes
The goal of Bert-As-Service is to give a sentence, and the service can represent the sentence as a semantic vector and return it to the user. [Bert model](https://arxiv.org/abs/1810.04805) is a popular model in the current NLP field. It has achieved good results on a variety of public NLP tasks. The semantic vector calculated by the Bert model is used as input to other NLP models, which will also greatly improve the performance of the model. Bert-As-Service allows users to easily obtain the semantic vector representation of text and apply it to their own tasks. In order to achieve this goal, we have shown in five steps that using Paddle Serving can build such a service in ten minutes. All the code and files in the example can be found in [Example](https://github.com/PaddlePaddle/Serving/tree/develop/python/examples/bert) of Paddle Serving.
If your python version is 3.X, replace the 'pip' field in the following command with 'pip3',replace 'python' with 'python3'.
### Step1: Getting Model
#### method 1:
This example use model [BERT Chinese Model](https://www.paddlepaddle.org.cn/hubdetail?name=bert_chinese_L-12_H-768_A-12&en_category=SemanticModel) from [Paddlehub](https://github.com/PaddlePaddle/PaddleHub).
Install paddlehub first
pip install paddlehub
python prepare_model.py 128
**PaddleHub only support Python 3.5+**
the 128 in the command above means max_seq_len in BERT model, which is the length of sample after preprocessing.
the config file and model file for server side are saved in the folder bert_seq128_model.
the config file generated for client side is saved in the folder bert_seq128_client.
#### method 2:
You can also download the above model from BOS(max_seq_len=128). After decompression, the config file and model file for server side are stored in the bert_chinese_L-12_H-768_A-12_model folder, and the config file generated for client side is stored in the bert_chinese_L-12_H-768_A-12_client folder:
wget https://paddle-serving.bj.bcebos.com/paddle_hub_models/text/SemanticModel/bert_chinese_L-12_H-768_A-12.tar.gz
tar -xzf bert_chinese_L-12_H-768_A-12.tar.gz
mv bert_chinese_L-12_H-768_A-12_model bert_seq128_model
mv bert_chinese_L-12_H-768_A-12_client bert_seq128_client
### Step2: Getting Dict and Sample Dataset
sh get_data.sh
this script will download Chinese Dictionary File vocab.txt and Chinese Sample Data data-c.txt
### Step3: Launch Service
start cpu inference service,Run
python -m paddle_serving_server.serve --model bert_seq128_model/ --port 9292 #cpu inference service
Or,start gpu inference service,Run
python -m paddle_serving_server.serve --model bert_seq128_model/ --port 9292 --gpu_ids 0 #launch gpu inference service at GPU 0
| Parameters | Meaning |
| ---------- | ---------------------------------------- |
| model | server configuration and model file path |
| thread | server-side threads |
| port | server port number |
| gpu_ids | GPU index number |
### Step4: data preprocessing logic on Client Side
Paddle Serving has many built-in corresponding data preprocessing logics. For the calculation of Chinese Bert semantic representation, we use the ChineseBertReader class under paddle_serving_app for data preprocessing. Model input fields of multiple models corresponding to a raw Chinese sentence can be easily fetched by developers
Install paddle_serving_app
pip install paddle_serving_app
### Step5: Client Visit Serving
#### method 1: RPC Inference
head data-c.txt | python bert_client.py --model bert_seq128_client/serving_client_conf.prototxt
the client reads data from data-c.txt and send prediction request, the prediction is given by word vector. (Due to massive data in the word vector, we do not print it).
#### method 2: HTTP Inference
This method is divided into two steps:
1. Start an HTTP prediction server.
start cpu HTTP inference service,Run
python bert_web_service.py bert_seq128_model/ 9292 #launch cpu inference service
Or,start gpu HTTP inference service,Run
set environmental variable to specify which gpus are used, the command above means gpu 0 and gpu 1 is used.
python bert_web_service_gpu.py bert_seq128_model/ 9292 #launch gpu inference service
2. Prediction via HTTP request
curl -H "Content-Type:application/json" -X POST -d '{"feed":[{"words": "hello"}], "fetch":["pooled_output"]}'
### Benchmark
We tested the performance of Bert-As-Service based on Padde Serving based on V100 and compared it with the Bert-As-Service based on Tensorflow. From the perspective of user configuration, we used the same batch size and concurrent number for stress testing. The overall throughput performance data obtained under 4 V100s is as follows.
yum install -y libXext libSM libXrender
pip install paddlehub paddle_serving_server paddle_serving_client
sh pip_app.sh
python bert_10.py
sh server.sh &
wget https://paddle-serving.bj.bcebos.com/bert_example/data-c.txt --no-check-certificate
head -n 500 data-c.txt > data.txt
cat data.txt | python bert_client.py
if [[ $? -eq 0 ]]; then
echo "test success"
echo "test fail"
ps -ef | grep "paddle_serving_server" | grep -v grep | awk '{print $2}' | xargs kill
## 十分钟构建Bert-As-Service
Bert-As-Service的目标是给定一个句子,服务可以将句子表示成一个语义向量返回给用户。[Bert模型](https://arxiv.org/abs/1810.04805)是目前NLP领域的热门模型,在多种公开的NLP任务上都取得了很好的效果,使用Bert模型计算出的语义向量来做其他NLP模型的输入对提升模型的表现也有很大的帮助。Bert-As-Service可以让用户很方便地获取文本的语义向量表示并应用到自己的任务中。为了实现这个目标,我们通过以下几个步骤说明使用Paddle Serving在十分钟内就可以搭建一个这样的服务。示例中所有的代码和文件均可以在Paddle Serving的[示例](https://github.com/PaddlePaddle/Serving/tree/develop/python/examples/bert)中找到。
若使用python的版本为3.X, 将以下命令中的pip 替换为pip3, python替换为python3.
### Step1:获取模型
#### 方法1:
pip install paddlehub
python prepare_model.py 128
#### 方法2:
wget https://paddle-serving.bj.bcebos.com/paddle_hub_models/text/SemanticModel/bert_chinese_L-12_H-768_A-12.tar.gz
tar -xzf bert_chinese_L-12_H-768_A-12.tar.gz
mv bert_chinese_L-12_H-768_A-12_model bert_seq128_model
mv bert_chinese_L-12_H-768_A-12_client bert_seq128_client
### Step2:获取词典和样例数据
sh get_data.sh
### Step3:启动服务
python -m paddle_serving_server.serve --model bert_seq128_model/ --port 9292 #启动cpu预测服务
python -m paddle_serving_server.serve --model bert_seq128_model/ --port 9292 --gpu_ids 0 #在gpu 0上启动gpu预测服务
| 参数 | 含义 |
| ------- | -------------------------- |
| model | server端配置与模型文件路径 |
| thread | server端线程数 |
| port | server端端口号 |
| gpu_ids | GPU索引号 |
### Step4:客户端数据预处理逻辑
Paddle Serving内建了很多经典典型对应的数据预处理逻辑,对于中文Bert语义表示的计算,我们采用paddle_serving_app下的ChineseBertReader类进行数据预处理,开发者可以很容易获得一个原始的中文句子对应的多个模型输入字段。
pip install paddle_serving_app
### Step5:客户端访问
#### 方法1:通过RPC方式执行预测
head data-c.txt | python bert_client.py --model bert_seq128_client/serving_client_conf.prototxt
#### 方法2:通过HTTP方式执行预测
启动cpu HTTP预测服务,执行
python bert_web_service.py bert_seq128_model/ 9292 #启动CPU预测服务
或者,启动gpu HTTP预测服务,执行
python bert_web_service_gpu.py bert_seq128_model/ 9292 #启动gpu预测服务
curl -H "Content-Type:application/json" -X POST -d '{"feed":[{"words": "hello"}], "fetch":["pooled_output"]}'
### 性能测试
我们基于V100对基于Padde Serving研发的Bert-As-Service的性能进行测试并与基于Tensorflow实现的Bert-As-Service进行对比,从用户配置的角度,采用相同的batch size和并发数进行压力测试,得到4块V100下的整体吞吐性能数据如下。
# Performance Optimization
Due to different model structures, different prediction services consume different computing resources when performing predictions. For online prediction services, models that require less computing resources will have a higher proportion of communication time cost, which is called communication-intensive service. Models that require more computing resources have a higher time cost for inference calculations, which is called computation-intensive services.
For a prediction service, the easiest way to determine the type of service is to look at the time ratio. Paddle Serving provides [Timeline tool](../python/examples/util/README_CN.md), which can intuitively display the time spent in each stage of the prediction service.
For communication-intensive prediction services, requests can be aggregated, and within a limit that can tolerate delay, multiple prediction requests can be combined into a batch for prediction.
For computation-intensive prediction services, you can use GPU prediction services instead of CPU prediction services, or increase the number of graphics cards for GPU prediction services.
Under the same conditions, the communication time of the HTTP prediction service provided by Paddle Serving is longer than that of the RPC prediction service, so for communication-intensive services, please give priority to using RPC communication.
Parameters for performance optimization:
The memory/graphic memory optimization option is enabled by default in Paddle Serving, which can reduce the memory/video memory usage and usually does not affect performance. If you need to turn it off, you can use --mem_optim_off in the command line.
r_optim can optimize the calculation graph and increase the inference speed. It is turned off by default and turned on by --ir_optim in the command line.
| Parameters | Type | Default | Description |
| ---------- | ---- | ------- | ------------------------------------------------------------ |
| mem_optim_off | - | - | Disable memory / graphic memory optimization |
| ir_optim | - | - | Enable analysis and optimization of calculation graph,including OP fusion, etc |
For the mode of using Python code to start the prediction service, the API of the above two parameters is as follows:
RPC Service
from paddle_serving_server import Server
server = Server()
HTTP Service
from paddle_serving_server import WebService
class NewService(WebService):
new_service = NewService(name="new")
new_service.prepare_server(mem_optim=True, ir_optim=False)
# 性能优化
对于一个预测服务来说,想要判断属于哪种类型,最简单的方法就是看时间占比,Paddle Serving提供了[Timeline工具](../python/examples/util/README_CN.md),可以直观的展现预测服务中各阶段的耗时。
在相同条件下,Paddle Serving提供的HTTP预测服务的通信时间是大于RPC预测服务的,因此对于通信密集型的服务请优先考虑使用RPC的通信方式。
Paddle Serving中默认开启内存/显存优化选项,可以减少对内存/显存的占用,通常不会对性能造成影响,如果需要关闭可以在命令行启动模式中使用--mem_optim_off。
| 参数 | 类型 | 默认值 | 含义 |
| --------- | ---- | ------ | -------------------------------- |
| mem_optim_off | - | - | 关闭内存/显存优化 |
| ir_optim | - | - | 开启计算图分析优化,包括OP融合等 |
from paddle_serving_server import Server
server = Server()
from paddle_serving_server import WebService
class NewService(WebService):
new_service = NewService(name="new")
new_service.prepare_server(mem_optim=True, ir_optim=False)
