README.md 3.6 KB
Newer Older
J
Jiawei Wang 已提交
1
## Bert as service
M
MRXLT 已提交
2

J
Jiawei Wang 已提交
3
([简体中文](./README_CN.md)|English)
M
MRXLT 已提交
4

J
Jiawei Wang 已提交
5
In the example, a BERT model is used for semantic understanding prediction, and the text is represented as a vector, which can be used for further analysis and prediction.
M
MRXLT 已提交
6

J
Jiawei Wang 已提交
7 8 9 10 11
### Getting Model

This example use model [BERT Chinese Model](https://www.paddlepaddle.org.cn/hubdetail?name=bert_chinese_L-12_H-768_A-12&en_category=SemanticModel) from [Paddlehub](https://github.com/PaddlePaddle/PaddleHub).

Install paddlehub first
M
fix doc  
MRXLT 已提交
12 13 14
```
pip install paddlehub
```
J
Jiawei Wang 已提交
15 16

run 
M
MRXLT 已提交
17
```
M
fix doc  
MRXLT 已提交
18
python prepare_model.py 20
M
MRXLT 已提交
19 20
```

J
Jiawei Wang 已提交
21 22 23 24 25
the 20 in the command above means max_seq_len in BERT model, which is the length of sample after preprocessing.
the config file and model file for server side are saved in the folder bert_seq20_model.
the config file generated for client side is saved in the folder bert_seq20_client.

### Getting Dict and Sample Dataset
M
MRXLT 已提交
26 27 28 29

```
sh get_data.sh
```
J
Jiawei Wang 已提交
30
this script will download Chinese Dictionary File vocab.txt and Chinese Sample Data data-c.txt
M
MRXLT 已提交
31

J
Jiawei Wang 已提交
32 33
### RPC Inference Service
Run
M
MRXLT 已提交
34
```
J
Jiawei Wang 已提交
35
python -m paddle_serving_server.serve --model bert_seq20_model/ --port 9292  #cpu inference service
M
MRXLT 已提交
36
```
J
Jiawei Wang 已提交
37
Or
M
MRXLT 已提交
38
```
J
Jiawei Wang 已提交
39
python -m paddle_serving_server_gpu.serve --model bert_seq20_model/ --port 9292 --gpu_ids 0 #launch gpu inference service at GPU 0
M
MRXLT 已提交
40 41
```

J
Jiawei Wang 已提交
42
### RPC Inference
M
MRXLT 已提交
43

J
Jiawei Wang 已提交
44
before prediction we should install paddle_serving_app. This module provides data preprocessing for BERT model.
M
MRXLT 已提交
45 46 47
```
pip install paddle_serving_app
```
J
Jiawei Wang 已提交
48
Run
M
MRXLT 已提交
49
```
M
MRXLT 已提交
50
head data-c.txt | python bert_client.py --model bert_seq20_client/serving_client_conf.prototxt
M
MRXLT 已提交
51
```
M
MRXLT 已提交
52

J
Jiawei Wang 已提交
53 54 55
the client reads data from data-c.txt and send prediction request, the prediction is given by word vector. (Due to massive data in the word vector, we do not print it).

### HTTP Inference Service
M
MRXLT 已提交
56 57 58
```
 export CUDA_VISIBLE_DEVICES=0,1
```
J
Jiawei Wang 已提交
59
set environmental variable to specify which gpus are used, the command above means gpu 0 and gpu 1 is used.
M
MRXLT 已提交
60
```
J
Jiawei Wang 已提交
61
 python bert_web_service.py bert_seq20_model/ 9292 #launch gpu inference service
M
MRXLT 已提交
62
```
J
Jiawei Wang 已提交
63
### HTTP Inference 
M
MRXLT 已提交
64 65

```
M
MRXLT 已提交
66
curl -H "Content-Type:application/json" -X POST -d '{"words": "hello", "fetch":["pooled_output"]}' http://127.0.0.1:9292/bert/prediction
M
MRXLT 已提交
67 68 69 70
```

### Benchmark

J
Jiawei Wang 已提交
71 72 73
Model:bert_chinese_L-12_H-768_A-12

GPU:GPU V100 * 1
M
MRXLT 已提交
74

J
Jiawei Wang 已提交
75
CUDA/cudnn Version:CUDA 9.2,cudnn 7.1.4
M
MRXLT 已提交
76 77


J
Jiawei Wang 已提交
78
In the test, 10 thousand samples in the sample data are copied into 100 thousand samples. Each client thread sends a sample of the number of threads. The batch size is 1, the max_seq_len is 20, and the time unit is seconds.
M
MRXLT 已提交
79

J
Jiawei Wang 已提交
80 81
When the number of client threads is 4, the prediction speed can reach 432 samples per second.
Because a single GPU can only perform serial calculations internally, increasing the number of client threads can only reduce the idle time of the GPU. Therefore, after the number of threads reaches 4, the increase in the number of threads does not improve the prediction speed.
M
MRXLT 已提交
82 83 84 85 86 87 88 89 90

| client  thread num | prepro | client infer | op0   | op1    | op2  | postpro | total  |
| ------------------ | ------ | ------------ | ----- | ------ | ---- | ------- | ------ |
| 1                  | 3.05   | 290.54       | 0.37  | 239.15 | 6.43 | 0.71    | 365.63 |
| 4                  | 0.85   | 213.66       | 0.091 | 200.39 | 1.62 | 0.2     | 231.45 |
| 8                  | 0.42   | 223.12       | 0.043 | 110.99 | 0.8  | 0.098   | 232.05 |
| 12                 | 0.32   | 225.26       | 0.029 | 73.87  | 0.53 | 0.078   | 231.45 |
| 16                 | 0.23   | 227.26       | 0.022 | 55.61  | 0.4  | 0.056   | 231.9  |

J
Jiawei Wang 已提交
91
the following is the client thread num - latency bar chart:
M
MRXLT 已提交
92
![bert benchmark](../../../doc/bert-benchmark-batch-size-1.png)