提交 fe604bc3 编写于 作者: T TeslaZhao 提交者: bjjwwang

Merge pull request #1479 from TeslaZhao/develop

update doc directory
上级 03d5c0d1
## Build Bert-As-Service in 10 minutes
([简体中文](./BERT_10_MINS_CN.md)|English)
The goal of Bert-As-Service is to give a sentence, and the service can represent the sentence as a semantic vector and return it to the user. [Bert model](https://arxiv.org/abs/1810.04805) is a popular model in the current NLP field. It has achieved good results on a variety of public NLP tasks. The semantic vector calculated by the Bert model is used as input to other NLP models, which will also greatly improve the performance of the model. Bert-As-Service allows users to easily obtain the semantic vector representation of text and apply it to their own tasks. In order to achieve this goal, we have shown in five steps that using Paddle Serving can build such a service in ten minutes. All the code and files in the example can be found in [Example](https://github.com/PaddlePaddle/Serving/tree/develop/python/examples/bert) of Paddle Serving.
If your python version is 3.X, replace the 'pip' field in the following command with 'pip3',replace 'python' with 'python3'.
### Step1: Getting Model
#### method 1:
This example use model [BERT Chinese Model](https://www.paddlepaddle.org.cn/hubdetail?name=bert_chinese_L-12_H-768_A-12&en_category=SemanticModel) from [Paddlehub](https://github.com/PaddlePaddle/PaddleHub).
Install paddlehub first
```
pip install paddlehub
```
run
```
python prepare_model.py 128
```
**PaddleHub only support Python 3.5+**
the 128 in the command above means max_seq_len in BERT model, which is the length of sample after preprocessing.
the config file and model file for server side are saved in the folder bert_seq128_model.
the config file generated for client side is saved in the folder bert_seq128_client.
#### method 2:
You can also download the above model from BOS(max_seq_len=128). After decompression, the config file and model file for server side are stored in the bert_chinese_L-12_H-768_A-12_model folder, and the config file generated for client side is stored in the bert_chinese_L-12_H-768_A-12_client folder:
```shell
wget https://paddle-serving.bj.bcebos.com/paddle_hub_models/text/SemanticModel/bert_chinese_L-12_H-768_A-12.tar.gz
tar -xzf bert_chinese_L-12_H-768_A-12.tar.gz
mv bert_chinese_L-12_H-768_A-12_model bert_seq128_model
mv bert_chinese_L-12_H-768_A-12_client bert_seq128_client
```
### Step2: Getting Dict and Sample Dataset
```
sh get_data.sh
```
this script will download Chinese Dictionary File vocab.txt and Chinese Sample Data data-c.txt
### Step3: Launch Service
start cpu inference service,Run
```
python -m paddle_serving_server.serve --model bert_seq128_model/ --port 9292 #cpu inference service
```
Or,start gpu inference service,Run
```
python -m paddle_serving_server.serve --model bert_seq128_model/ --port 9292 --gpu_ids 0 #launch gpu inference service at GPU 0
```
| Parameters | Meaning |
| ---------- | ---------------------------------------- |
| model | server configuration and model file path |
| thread | server-side threads |
| port | server port number |
| gpu_ids | GPU index number |
### Step4: data preprocessing logic on Client Side
Paddle Serving has many built-in corresponding data preprocessing logics. For the calculation of Chinese Bert semantic representation, we use the ChineseBertReader class under paddle_serving_app for data preprocessing. Model input fields of multiple models corresponding to a raw Chinese sentence can be easily fetched by developers
Install paddle_serving_app
```shell
pip install paddle_serving_app
```
### Step5: Client Visit Serving
#### method 1: RPC Inference
Run
```
head data-c.txt | python bert_client.py --model bert_seq128_client/serving_client_conf.prototxt
```
the client reads data from data-c.txt and send prediction request, the prediction is given by word vector. (Due to massive data in the word vector, we do not print it).
#### method 2: HTTP Inference
This method is divided into two steps:
1. Start an HTTP prediction server.
start cpu HTTP inference service,Run
```
python bert_web_service.py bert_seq128_model/ 9292 #launch cpu inference service
```
Or,start gpu HTTP inference service,Run
```
export CUDA_VISIBLE_DEVICES=0,1
```
set environmental variable to specify which gpus are used, the command above means gpu 0 and gpu 1 is used.
```
python bert_web_service_gpu.py bert_seq128_model/ 9292 #launch gpu inference service
```
2. Prediction via HTTP request
```
curl -H "Content-Type:application/json" -X POST -d '{"feed":[{"words": "hello"}], "fetch":["pooled_output"]}' http://127.0.0.1:9292/bert/prediction
```
### Benchmark
We tested the performance of Bert-As-Service based on Padde Serving based on V100 and compared it with the Bert-As-Service based on Tensorflow. From the perspective of user configuration, we used the same batch size and concurrent number for stress testing. The overall throughput performance data obtained under 4 V100s is as follows.
![4v100_bert_as_service_benchmark](images/4v100_bert_as_service_benchmark.png)
<!--
yum install -y libXext libSM libXrender
pip install paddlehub paddle_serving_server paddle_serving_client
sh pip_app.sh
python bert_10.py
sh server.sh &
wget https://paddle-serving.bj.bcebos.com/bert_example/data-c.txt --no-check-certificate
head -n 500 data-c.txt > data.txt
cat data.txt | python bert_client.py
if [[ $? -eq 0 ]]; then
echo "test success"
else
echo "test fail"
fi
ps -ef | grep "paddle_serving_server" | grep -v grep | awk '{print $2}' | xargs kill
-->
## 十分钟构建Bert-As-Service
(简体中文|[English](./BERT_10_MINS.md))
Bert-As-Service的目标是给定一个句子,服务可以将句子表示成一个语义向量返回给用户。[Bert模型](https://arxiv.org/abs/1810.04805)是目前NLP领域的热门模型,在多种公开的NLP任务上都取得了很好的效果,使用Bert模型计算出的语义向量来做其他NLP模型的输入对提升模型的表现也有很大的帮助。Bert-As-Service可以让用户很方便地获取文本的语义向量表示并应用到自己的任务中。为了实现这个目标,我们通过以下几个步骤说明使用Paddle Serving在十分钟内就可以搭建一个这样的服务。示例中所有的代码和文件均可以在Paddle Serving的[示例](https://github.com/PaddlePaddle/Serving/tree/develop/python/examples/bert)中找到。
若使用python的版本为3.X, 将以下命令中的pip 替换为pip3, python替换为python3.
### Step1:获取模型
#### 方法1:
示例中采用[Paddlehub](https://github.com/PaddlePaddle/PaddleHub)中的[BERT中文模型](https://www.paddlepaddle.org.cn/hubdetail?name=bert_chinese_L-12_H-768_A-12&en_category=SemanticModel)
请先安装paddlehub
```
pip install paddlehub
```
执行
```
python prepare_model.py 128
```
参数128表示BERT模型中的max_seq_len,即预处理后的样本长度。
生成server端配置文件与模型文件,存放在bert_seq128_model文件夹。
生成client端配置文件,存放在bert_seq128_client文件夹。
#### 方法2:
您也可以从bos上直接下载上述模型(max_seq_len=128),解压后server端配置文件与模型文件存放在bert_chinese_L-12_H-768_A-12_model文件夹,client端配置文件存放在bert_chinese_L-12_H-768_A-12_client文件夹:
```shell
wget https://paddle-serving.bj.bcebos.com/paddle_hub_models/text/SemanticModel/bert_chinese_L-12_H-768_A-12.tar.gz
tar -xzf bert_chinese_L-12_H-768_A-12.tar.gz
mv bert_chinese_L-12_H-768_A-12_model bert_seq128_model
mv bert_chinese_L-12_H-768_A-12_client bert_seq128_client
```
### Step2:获取词典和样例数据
```
sh get_data.sh
```
脚本将下载中文词典vocab.txt和中文样例数据data-c.txt
### Step3:启动服务
启动cpu预测服务,执行
```
python -m paddle_serving_server.serve --model bert_seq128_model/ --port 9292 #启动cpu预测服务
```
或者,启动gpu预测服务,执行
```
python -m paddle_serving_server.serve --model bert_seq128_model/ --port 9292 --gpu_ids 0 #在gpu 0上启动gpu预测服务
```
| 参数 | 含义 |
| ------- | -------------------------- |
| model | server端配置与模型文件路径 |
| thread | server端线程数 |
| port | server端端口号 |
| gpu_ids | GPU索引号 |
### Step4:客户端数据预处理逻辑
Paddle Serving内建了很多经典典型对应的数据预处理逻辑,对于中文Bert语义表示的计算,我们采用paddle_serving_app下的ChineseBertReader类进行数据预处理,开发者可以很容易获得一个原始的中文句子对应的多个模型输入字段。
安装paddle_serving_app
```shell
pip install paddle_serving_app
```
### Step5:客户端访问
#### 方法1:通过RPC方式执行预测
执行
```
head data-c.txt | python bert_client.py --model bert_seq128_client/serving_client_conf.prototxt
```
启动client读取data-c.txt中的数据进行预测,预测结果为文本的向量表示(由于数据较多,脚本中没有将输出进行打印),server端的地址在脚本中修改。
#### 方法2:通过HTTP方式执行预测
该方式分为两步
1、启动一个HTTP预测服务端。
启动cpu HTTP预测服务,执行
```
python bert_web_service.py bert_seq128_model/ 9292 #启动CPU预测服务
```
或者,启动gpu HTTP预测服务,执行
```
export CUDA_VISIBLE_DEVICES=0,1
```
通过环境变量指定gpu预测服务使用的gpu,示例中指定索引为0和1的两块gpu
```
python bert_web_service_gpu.py bert_seq128_model/ 9292 #启动gpu预测服务
```
2、通过HTTP请求执行预测。
```
curl -H "Content-Type:application/json" -X POST -d '{"feed":[{"words": "hello"}], "fetch":["pooled_output"]}' http://127.0.0.1:9292/bert/prediction
```
### 性能测试
我们基于V100对基于Padde Serving研发的Bert-As-Service的性能进行测试并与基于Tensorflow实现的Bert-As-Service进行对比,从用户配置的角度,采用相同的batch size和并发数进行压力测试,得到4块V100下的整体吞吐性能数据如下。
![4v100_bert_as_service_benchmark](images/4v100_bert_as_service_benchmark.png)
# gRPC接口使用介绍
- [1.与bRPC接口对比](#1与brpc接口对比)
- [1.1 服务端对比](#11-服务端对比)
- [1.2 客服端对比](#12-客服端对比)
- [1.3 其他](#13-其他)
- [2.示例:线性回归预测服务](#2示例线性回归预测服务)
- [获取数据](#获取数据)
- [开启 gRPC 服务端](#开启-grpc-服务端)
- [客户端预测](#客户端预测)
- [同步预测](#同步预测)
- [异步预测](#异步预测)
- [Batch 预测](#batch-预测)
- [通用 pb 预测](#通用-pb-预测)
- [预测超时](#预测超时)
- [List 输入](#list-输入)
- [3.更多示例](#3更多示例)
使用gRPC接口,Client端可以在Win/Linux/MacOS平台上调用不同语言。gRPC 接口实现结构如下:
![](images/grpc_impl.png)
## 1.与bRPC接口对比
#### 1.1 服务端对比
* 由于gRPC Server 端实际包含了brpc-Client端的,因此brpc-Client的初始化过程是在gRPC Server 端实现的,所以gRPC Server 端 `load_model_config` 函数添加 `client_config_path` 参数,用于指定brpc-Client初始化过程中的传输数据格式配置文件路径(`client_config_path` 参数未指定时默认为None,此时`client_config_path``load_model_config` 函数中被默认为 `<server_config_path>/serving_server_conf.prototxt`,此时brpc-Client与brpc-Server的传输数据格式配置文件相同)
```
def load_model_config(self, server_config_paths, client_config_path=None)
```
在一些例子中 bRPC Server 端与 bRPC Client 端的配置文件可能不同(如 在cube local 中,Client 端的数据先交给 cube,经过 cube 处理后再交给预测库),此时 gRPC Server 端需要手动设置 gRPC Client 端的配置`client_config_path`
#### 1.2 客服端对比
* gRPC Client 端取消 `load_client_config` 步骤:
`connect` 步骤通过 RPC 获取相应的 prototxt(从任意一个 endpoint 获取即可)。
* gRPC Client 需要通过 RPC 方式设置 timeout 时间(调用形式与 bRPC Client保持一致)
因为 bRPC Client 在 `connect` 后无法更改 timeout 时间,所以当 gRPC Server 收到变更 timeout 的调用请求时会重新创建 bRPC Client 实例以变更 bRPC Client timeout时间,同时 gRPC Client 会设置 gRPC 的 deadline 时间。
**注意,设置 timeout 接口和 Inference 接口不能同时调用(非线程安全),出于性能考虑暂时不加锁。**
* gRPC Client 端 `predict` 函数添加 `asyn``is_python` 参数:
```
def predict(self, feed, fetch, batch=True, need_variant_tag=False, asyn=False, is_python=True,log_id=0)
```
1. `asyn` 为异步调用选项。当 `asyn=True` 时为异步调用,返回 `MultiLangPredictFuture` 对象,通过 `MultiLangPredictFuture.result()` 阻塞获取预测值;当 `asyn=Fasle` 为同步调用。
2. `is_python` 为 proto 格式选项。当 `is_python=True` 时,基于 numpy bytes 格式进行数据传输,目前只适用于 Python;当 `is_python=False` 时,以普通数据格式传输,更加通用。使用 numpy bytes 格式传输耗时比普通数据格式小很多(详见 [#654](https://github.com/PaddlePaddle/Serving/pull/654))。
3. `batch`为数据是否需要进行增维处理的选项。当`batch=True`时,feed数据不需要额外的处理,维持原有维度;当`batch=False`时,会对数据进行增维度处理。例如:feed.shape原始为[2,2],当`batch=False`时,会将feed.reshape为[1,2,2]。
#### 1.3 其他
* 异常处理:当 gRPC Server 端的 bRPC Client 预测失败(返回 `None`)时,gRPC Client 端同样返回None。其他 gRPC 异常会在 Client 内部捕获,并在返回的 fetch_map 中添加一个 "status_code" 字段来区分是否预测正常(参考 timeout 样例)。
* 由于 gRPC 只支持 pick_first 和 round_robin 负载均衡策略,ABTEST 特性还未打齐。
* 系统兼容性:
* [x] CentOS
* [x] macOS
* [x] Windows
* 已经支持的客户端语言:
- Python
- Java
- Go
## 2.示例:线性回归预测服务
以下是采用gRPC实现的关于线性回归预测的一个示例,具体代码详见此[链接](../python/examples/grpc_impl_example/fit_a_line)
#### 获取数据
```shell
sh get_data.sh
```
#### 开启 gRPC 服务端
``` shell
python test_server.py uci_housing_model/
```
也可以通过下面的一行代码开启默认 gRPC 服务:
```shell
python -m paddle_serving_server.serve --model uci_housing_model --thread 10 --port 9393 --use_multilang
```
注:--use_multilang参数用来启用多语言客户端
### 客户端预测
#### 同步预测
``` shell
python test_sync_client.py
```
#### 异步预测
``` shell
python test_asyn_client.py
```
#### Batch 预测
``` shell
python test_batch_client.py
```
#### 预测超时
``` shell
python test_timeout_client.py
```
## 3.更多示例
详见[`python/examples/grpc_impl_example`](../python/examples/grpc_impl_example)下的示例文件。
# Performance Optimization
([简体中文](./PERFORMANCE_OPTIM_CN.md)|English)
Due to different model structures, different prediction services consume different computing resources when performing predictions. For online prediction services, models that require less computing resources will have a higher proportion of communication time cost, which is called communication-intensive service. Models that require more computing resources have a higher time cost for inference calculations, which is called computation-intensive services.
For a prediction service, the easiest way to determine the type of service is to look at the time ratio. Paddle Serving provides [Timeline tool](../python/examples/util/README_CN.md), which can intuitively display the time spent in each stage of the prediction service.
For communication-intensive prediction services, requests can be aggregated, and within a limit that can tolerate delay, multiple prediction requests can be combined into a batch for prediction.
For computation-intensive prediction services, you can use GPU prediction services instead of CPU prediction services, or increase the number of graphics cards for GPU prediction services.
Under the same conditions, the communication time of the HTTP prediction service provided by Paddle Serving is longer than that of the RPC prediction service, so for communication-intensive services, please give priority to using RPC communication.
Parameters for performance optimization:
The memory/graphic memory optimization option is enabled by default in Paddle Serving, which can reduce the memory/video memory usage and usually does not affect performance. If you need to turn it off, you can use --mem_optim_off in the command line.
r_optim can optimize the calculation graph and increase the inference speed. It is turned off by default and turned on by --ir_optim in the command line.
| Parameters | Type | Default | Description |
| ---------- | ---- | ------- | ------------------------------------------------------------ |
| mem_optim_off | - | - | Disable memory / graphic memory optimization |
| ir_optim | - | - | Enable analysis and optimization of calculation graph,including OP fusion, etc |
For the mode of using Python code to start the prediction service, the API of the above two parameters is as follows:
RPC Service
```
from paddle_serving_server import Server
server = Server()
...
server.set_memory_optimize(mem_optim)
server.set_ir_optimize(ir_optim)
...
```
HTTP Service
```
from paddle_serving_server import WebService
class NewService(WebService):
...
new_service = NewService(name="new")
...
new_service.prepare_server(mem_optim=True, ir_optim=False)
...
```
# 性能优化
(简体中文|[English](./PERFORMANCE_OPTIM.md))
由于模型结构的不同,在执行预测时不同的预测服务对计算资源的消耗也不相同。对于在线的预测服务来说,对计算资源要求较少的模型,通信的时间成本占比就会较高,称为通信密集型服务,对计算资源要求较多的模型,推理计算的时间成本较高,称为计算密集型服务。对于这两种服务类型,可以根据实际需求采取不同的方式进行优化
对于一个预测服务来说,想要判断属于哪种类型,最简单的方法就是看时间占比,Paddle Serving提供了[Timeline工具](../python/examples/util/README_CN.md),可以直观的展现预测服务中各阶段的耗时。
对于通信密集型的预测服务,可以将请求进行聚合,在对延时可以容忍的限度内,将多个预测请求合并成一个batch进行预测。
对于计算密集型的预测服务,可以使用GPU预测服务代替CPU预测服务,或者增加GPU预测服务的显卡数量。
在相同条件下,Paddle Serving提供的HTTP预测服务的通信时间是大于RPC预测服务的,因此对于通信密集型的服务请优先考虑使用RPC的通信方式。
性能优化相关参数:
Paddle Serving中默认开启内存/显存优化选项,可以减少对内存/显存的占用,通常不会对性能造成影响,如果需要关闭可以在命令行启动模式中使用--mem_optim_off。
ir_optim可以优化计算图,提升推理速度,默认关闭,在命令行启动的模式中通过--ir_optim开启。
| 参数 | 类型 | 默认值 | 含义 |
| --------- | ---- | ------ | -------------------------------- |
| mem_optim_off | - | - | 关闭内存/显存优化 |
| ir_optim | - | - | 开启计算图分析优化,包括OP融合等 |
对于使用Python代码启动预测服务的模式,以上两个参数的接口如下:
RPC服务
```
from paddle_serving_server import Server
server = Server()
...
server.set_memory_optimize(mem_optim)
server.set_ir_optimize(ir_optim)
...
```
HTTP服务
```
from paddle_serving_server import WebService
class NewService(WebService):
...
new_service = NewService(name="new")
...
new_service.prepare_server(mem_optim=True, ir_optim=False)
...
```
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册