LOW_PRECISION_DEPLOYMENT.md 2.1 KB
Newer Older
Z
update  
zhangjun 已提交
1
# Low-Precision Deployment for Paddle Serving
Z
update  
zhangjun 已提交
2 3 4
(English|[简体中文](./LOW_PRECISION_DEPLOYMENT_CN.md))

Intel CPU supports int8 and bfloat16 models, NVIDIA TensorRT supports int8 and float16 models.
Z
zhangjun 已提交
5

Z
update  
zhangjun 已提交
6
## Obtain the quantized model through PaddleSlim tool
Z
update  
zhangjun 已提交
7
Train the low-precision models please refer to [PaddleSlim](https://paddleslim.readthedocs.io/zh_CN/latest/tutorials/quant/overview.html).
Z
zhangjun 已提交
8

Z
update  
zhangjun 已提交
9 10 11
## Deploy the quantized model from PaddleSlim using Paddle Serving with Nvidia TensorRT int8 mode

Firstly, download the [Resnet50 int8 model](https://paddle-inference-dist.bj.bcebos.com/inference_demo/python/resnet50/ResNet50_quant.tar.gz) and convert to Paddle Serving's saved model。
Z
zhangjun 已提交
12 13 14 15 16 17
```
wget https://paddle-inference-dist.bj.bcebos.com/inference_demo/python/resnet50/ResNet50_quant.tar.gz
tar zxvf ResNet50_quant.tar.gz

python -m paddle_serving_client.convert --dirname ResNet50_quant
```
Z
update  
zhangjun 已提交
18
Start RPC service, specify the GPU id and precision mode
Z
zhangjun 已提交
19 20 21
```
python -m paddle_serving_server.serve --model serving_server --port 9393 --gpu_ids 0 --use_gpu --use_trt --precision int8 
```
Z
update  
zhangjun 已提交
22
Request the serving service with Client
Z
zhangjun 已提交
23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43
```
from paddle_serving_client import Client
from paddle_serving_app.reader import Sequential, File2Image, Resize, CenterCrop
from paddle_serving_app.reader import RGB2BGR, Transpose, Div, Normalize

client = Client()
client.load_client_config(
    "resnet_v2_50_imagenet_client/serving_client_conf.prototxt")
client.connect(["127.0.0.1:9393"])

seq = Sequential([
    File2Image(), Resize(256), CenterCrop(224), RGB2BGR(), Transpose((2, 0, 1)),
    Div(255), Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225], True)
])

image_file = "daisy.jpg"
img = seq(image_file)
fetch_map = client.predict(feed={"image": img}, fetch=["score"])
print(fetch_map["score"].reshape(-1))
```

Z
update  
zhangjun 已提交
44
## Reference
Z
zhangjun 已提交
45
* [PaddleSlim](https://github.com/PaddlePaddle/PaddleSlim)
Z
update  
zhangjun 已提交
46 47
* [Deploy the quantized model Using Paddle Inference on Intel CPU](https://paddle-inference.readthedocs.io/en/latest/optimize/paddle_x86_cpu_int8.html)
* [Deploy the quantized model Using Paddle Inference on Nvidia GPU](https://paddle-inference.readthedocs.io/en/latest/optimize/paddle_trt.html)