diff --git a/README.md b/README.md index 8bc5146864676883f1fcdb6c4f10781acf3d0db7..d018db0a1f0a358bde750da0075b1736b15a7d39 100644 --- a/README.md +++ b/README.md @@ -261,6 +261,8 @@ curl -H "Content-Type:application/json" -X POST -d '{"feed":[{"url": "https://pa - [How to develop a new Web Service?](doc/NEW_WEB_SERVICE.md) - [Golang client](doc/IMDB_GO_CLIENT.md) - [Compile from source code](doc/COMPILE.md) +- [Deploy Web Service with uWSGI](doc/UWSGI_DEPLOY.md) +- [Hot loading for model file](doc/HOT_LOADING_IN_SERVING.md) ### About Efficiency - [How to profile Paddle Serving latency?](python/examples/util) diff --git a/README_CN.md b/README_CN.md index 79b698e2452306ac8ffaeb6bb88057f3c578db0f..3e39e1854c9fda8545172dfb7679fde881827741 100644 --- a/README_CN.md +++ b/README_CN.md @@ -267,6 +267,8 @@ curl -H "Content-Type:application/json" -X POST -d '{"feed":[{"url": "https://pa - [如何开发一个新的Web Service?](doc/NEW_WEB_SERVICE_CN.md) - [如何在Paddle Serving使用Go Client?](doc/IMDB_GO_CLIENT_CN.md) - [如何编译PaddleServing?](doc/COMPILE_CN.md) +- [如何使用uWSGI部署Web Service](doc/UWSGI_DEPLOY_CN.md) +- [如何实现模型文件热加载](doc/HOT_LOADING_IN_SERVING_CN.md) ### 关于Paddle Serving性能 - [如何测试Paddle Serving性能?](python/examples/util/) diff --git a/doc/PERFORMANCE_OPTIM.md b/doc/PERFORMANCE_OPTIM.md index 4b025e94d6f8d3ed69fb76898eb6afada9ca6613..0de06c16988d14d8f92eced491db7dc423831afe 100644 --- a/doc/PERFORMANCE_OPTIM.md +++ b/doc/PERFORMANCE_OPTIM.md @@ -1,8 +1,10 @@ -# Performance optimization +# Performance Optimization + +([简体中文](./PERFORMANCE_OPTIM_CN.md)|English) Due to different model structures, different prediction services consume different computing resources when performing predictions. For online prediction services, models that require less computing resources will have a higher proportion of communication time cost, which is called communication-intensive service. Models that require more computing resources have a higher time cost for inference calculations, which is called computationa-intensive services. -For a prediction service, the easiest way to determine what type it is is to look at the time ratio. Paddle Serving provides [Timeline tool] (../python/examples/util/README_CN.md), which can intuitively display the time spent in each stage of the prediction service. +For a prediction service, the easiest way to determine what type it is is to look at the time ratio. Paddle Serving provides [Timeline tool](../python/examples/util/README_CN.md), which can intuitively display the time spent in each stage of the prediction service. For communication-intensive prediction services, requests can be aggregated, and within a limit that can tolerate delay, multiple prediction requests can be combined into a batch for prediction. diff --git a/doc/PERFORMANCE_OPTIM_CN.md b/doc/PERFORMANCE_OPTIM_CN.md index 7bd64d3e2d645c9328ead55e867d0b97946840ad..1a2c3840942930060a1805bcb999f01b5780cbae 100644 --- a/doc/PERFORMANCE_OPTIM_CN.md +++ b/doc/PERFORMANCE_OPTIM_CN.md @@ -1,5 +1,7 @@ # 性能优化 +(简体中文|[English](./PERFORMANCE_OPTIM.md)) + 由于模型结构的不同,在执行预测时不同的预测服务对计算资源的消耗也不相同。对于在线的预测服务来说,对计算资源要求较少的模型,通信的时间成本占比就会较高,称为通信密集型服务,对计算资源要求较多的模型,推理计算的时间成本较高,称为计算密集型服务。对于这两种服务类型,可以根据实际需求采取不同的方式进行优化 对于一个预测服务来说,想要判断属于哪种类型,最简单的方法就是看时间占比,Paddle Serving提供了[Timeline工具](../python/examples/util/README_CN.md),可以直观的展现预测服务中各阶段的耗时。 diff --git a/python/examples/bert/README.md b/python/examples/bert/README.md index d598fc3b057c85d80e8d10549f7c5b0cf1e725fb..4cfa5590ffb4501c78e9e6ff886f5f82c94dd2db 100644 --- a/python/examples/bert/README.md +++ b/python/examples/bert/README.md @@ -71,28 +71,3 @@ set environmental variable to specify which gpus are used, the command above mea ``` curl -H "Content-Type:application/json" -X POST -d '{"feed":[{"words": "hello"}], "fetch":["pooled_output"]}' http://127.0.0.1:9292/bert/prediction ``` - -### Benchmark - -Model:bert_chinese_L-12_H-768_A-12 - -GPU:GPU V100 * 1 - -CUDA/cudnn Version:CUDA 9.2,cudnn 7.1.4 - - -In the test, 10 thousand samples in the sample data are copied into 100 thousand samples. Each client thread sends a sample of the number of threads. The batch size is 1, the max_seq_len is 20(not 128 as described above), and the time unit is seconds. - -When the number of client threads is 4, the prediction speed can reach 432 samples per second. -Because a single GPU can only perform serial calculations internally, increasing the number of client threads can only reduce the idle time of the GPU. Therefore, after the number of threads reaches 4, the increase in the number of threads does not improve the prediction speed. - -| client thread num | prepro | client infer | op0 | op1 | op2 | postpro | total | -| ------------------ | ------ | ------------ | ----- | ------ | ---- | ------- | ------ | -| 1 | 3.05 | 290.54 | 0.37 | 239.15 | 6.43 | 0.71 | 365.63 | -| 4 | 0.85 | 213.66 | 0.091 | 200.39 | 1.62 | 0.2 | 231.45 | -| 8 | 0.42 | 223.12 | 0.043 | 110.99 | 0.8 | 0.098 | 232.05 | -| 12 | 0.32 | 225.26 | 0.029 | 73.87 | 0.53 | 0.078 | 231.45 | -| 16 | 0.23 | 227.26 | 0.022 | 55.61 | 0.4 | 0.056 | 231.9 | - -the following is the client thread num - latency bar chart: -![bert benchmark](../../../doc/bert-benchmark-batch-size-1.png) diff --git a/python/examples/bert/README_CN.md b/python/examples/bert/README_CN.md index 7f1d2911ba4a5017137e659fe1f1367e64026de4..93ec8f2adbd9ae31489011900472a0077cb33783 100644 --- a/python/examples/bert/README_CN.md +++ b/python/examples/bert/README_CN.md @@ -67,27 +67,3 @@ head data-c.txt | python bert_client.py --model bert_seq128_client/serving_clien ``` curl -H "Content-Type:application/json" -X POST -d '{"feed":[{"words": "hello"}], "fetch":["pooled_output"]}' http://127.0.0.1:9292/bert/prediction ``` - -### Benchmark - -模型:bert_chinese_L-12_H-768_A-12 - -设备:GPU V100 * 1 - -环境:CUDA 9.2,cudnn 7.1.4 - -测试中将样例数据中的1W个样本复制为10W个样本,每个client线程发送线程数分之一个样本,batch size为1,max_seq_len为20(而不是上面的128),时间单位为秒. - -在client线程数为4时,预测速度可以达到432样本每秒。 -由于单张GPU内部只能串行计算,client线程增多只能减少GPU的空闲时间,因此在线程数达到4之后,线程数增多对预测速度没有提升。 - -| client thread num | prepro | client infer | op0 | op1 | op2 | postpro | total | -| ------------------ | ------ | ------------ | ----- | ------ | ---- | ------- | ------ | -| 1 | 3.05 | 290.54 | 0.37 | 239.15 | 6.43 | 0.71 | 365.63 | -| 4 | 0.85 | 213.66 | 0.091 | 200.39 | 1.62 | 0.2 | 231.45 | -| 8 | 0.42 | 223.12 | 0.043 | 110.99 | 0.8 | 0.098 | 232.05 | -| 12 | 0.32 | 225.26 | 0.029 | 73.87 | 0.53 | 0.078 | 231.45 | -| 16 | 0.23 | 227.26 | 0.022 | 55.61 | 0.4 | 0.056 | 231.9 | - -总耗时变化规律如下: -![bert benchmark](../../../doc/bert-benchmark-batch-size-1.png) diff --git a/python/examples/faster_rcnn_model/new_test_client.py b/python/examples/faster_rcnn_model/test_client.py similarity index 86% rename from python/examples/faster_rcnn_model/new_test_client.py rename to python/examples/faster_rcnn_model/test_client.py index 0c6c615f8f3dff10626256de59101c401457509f..ce577a3c4396d33af33e45694a573f8b1cbcb52b 100755 --- a/python/examples/faster_rcnn_model/new_test_client.py +++ b/python/examples/faster_rcnn_model/test_client.py @@ -14,6 +14,8 @@ from paddle_serving_client import Client from paddle_serving_app.reader import * +import sys +import numpy as np preprocess = Sequential([ File2Image(), BGR2RGB(), Div(255.0), @@ -24,11 +26,10 @@ preprocess = Sequential([ postprocess = RCNNPostprocess("label_list.txt", "output") client = Client() -client.load_client_config( - "faster_rcnn_client_conf/serving_client_conf.prototxt") -client.connect(['127.0.0.1:9393']) +client.load_client_config(sys.argv[1]) +client.connect(['127.0.0.1:9494']) -im = preprocess(sys.argv[2]) +im = preprocess(sys.argv[3]) fetch_map = client.predict( feed={ "image": im, @@ -36,5 +37,5 @@ fetch_map = client.predict( "im_shape": np.array(list(im.shape[1:]) + [1.0]) }, fetch=["multiclass_nms"]) -fetch_map["image"] = sys.argv[1] +fetch_map["image"] = sys.argv[3] postprocess(fetch_map) diff --git a/python/examples/imagenet/benchmark.py b/python/examples/imagenet/benchmark.py index 6b21719e7b665906e7abd02a7a3b8aef50136685..caa952f121fbd8725c2a6bfe36f0dd84b6a82707 100644 --- a/python/examples/imagenet/benchmark.py +++ b/python/examples/imagenet/benchmark.py @@ -1,3 +1,5 @@ +# -*- coding: utf-8 -*- +# # Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. # # Licensed under the Apache License, Version 2.0 (the "License"); @@ -13,16 +15,26 @@ # limitations under the License. # pylint: disable=doc-string-missing +from __future__ import unicode_literals, absolute_import +import os import sys -from image_reader import ImageReader +import time +import requests +import json +import base64 from paddle_serving_client import Client from paddle_serving_client.utils import MultiThreadRunner from paddle_serving_client.utils import benchmark_args -import time -import os +from paddle_serving_app.reader import Sequential, URL2Image, Resize +from paddle_serving_app.reader import CenterCrop, RGB2BGR, Transpose, Div, Normalize args = benchmark_args() +seq_preprocess = Sequential([ + URL2Image(), Resize(256), CenterCrop(224), RGB2BGR(), Transpose((2, 0, 1)), + Div(255), Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225], True) +]) + def single_func(idx, resource): file_list = [] @@ -31,30 +43,61 @@ def single_func(idx, resource): img_list = [] for i in range(1000): img_list.append(open("./image_data/n01440764/" + file_list[i]).read()) + profile_flags = False + if "FLAGS_profile_client" in os.environ and os.environ[ + "FLAGS_profile_client"]: + profile_flags = True if args.request == "rpc": reader = ImageReader() fetch = ["score"] client = Client() client.load_client_config(args.model) client.connect([resource["endpoint"][idx % len(resource["endpoint"])]]) + start = time.time() + for i in range(1000): + if args.batch_size >= 1: + feed_batch = [] + i_start = time.time() + for bi in range(args.batch_size): + img = seq_preprocess(img_list[i]) + feed_batch.append({"image": img}) + i_end = time.time() + if profile_flags: + print("PROFILE\tpid:{}\timage_pre_0:{} image_pre_1:{}". + format(os.getpid(), + int(round(i_start * 1000000)), + int(round(i_end * 1000000)))) + + result = client.predict(feed=feed_batch, fetch=fetch) + else: + print("unsupport batch size {}".format(args.batch_size)) + elif args.request == "http": + py_version = 2 + server = "http://" + resource["endpoint"][idx % len(resource[ + "endpoint"])] + "/image/prediction" start = time.time() - for i in range(100): - img = reader.process_image(img_list[i]) - fetch_map = client.predict(feed={"image": img}, fetch=["score"]) - end = time.time() - return [[end - start]] + for i in range(1000): + if py_version == 2: + image = base64.b64encode( + open("./image_data/n01440764/" + file_list[i]).read()) + else: + image = base64.b64encode(open(image_path, "rb").read()).decode( + "utf-8") + req = json.dumps({"feed": [{"image": image}], "fetch": ["score"]}) + r = requests.post( + server, data=req, headers={"Content-Type": "application/json"}) + end = time.time() return [[end - start]] -if __name__ == "__main__": +if __name__ == '__main__': multi_thread_runner = MultiThreadRunner() - endpoint_list = ["127.0.0.1:9292"] - #card_num = 4 - #for i in range(args.thread): - # endpoint_list.append("127.0.0.1:{}".format(9295 + i % card_num)) + endpoint_list = ["127.0.0.1:9696"] + #endpoint_list = endpoint_list + endpoint_list + endpoint_list result = multi_thread_runner.run(single_func, args.thread, {"endpoint": endpoint_list}) + #result = single_func(0, {"endpoint": endpoint_list}) avg_cost = 0 for i in range(args.thread): avg_cost += result[0][i] diff --git a/python/examples/imagenet/benchmark.sh b/python/examples/imagenet/benchmark.sh index 16fadbbac6cd7e616d11135653cfbcfeebe6d4f2..618a62c063c0bc4955baf8516bc5bc93e4832394 100644 --- a/python/examples/imagenet/benchmark.sh +++ b/python/examples/imagenet/benchmark.sh @@ -1,9 +1,12 @@ rm profile_log -for thread_num in 1 2 4 8 16 +for thread_num in 1 2 4 8 do - $PYTHONROOT/bin/python benchmark.py --thread $thread_num --model ResNet101_vd_client_config/serving_client_conf.prototxt --request rpc > profile 2>&1 +for batch_size in 1 2 4 8 16 32 64 128 +do + $PYTHONROOT/bin/python benchmark.py --thread $thread_num --batch_size $batch_size --model ResNet50_vd_client_config/serving_client_conf.prototxt --request rpc > profile 2>&1 echo "========================================" echo "batch size : $batch_size" >> profile_log $PYTHONROOT/bin/python ../util/show_profile.py profile $thread_num >> profile_log tail -n 1 profile >> profile_log done +done diff --git a/python/examples/imagenet/benchmark_batch.py b/python/examples/imagenet/benchmark_batch.py deleted file mode 100644 index 1646fb9a94d6953f90f9f4907aa74940f13c2730..0000000000000000000000000000000000000000 --- a/python/examples/imagenet/benchmark_batch.py +++ /dev/null @@ -1,99 +0,0 @@ -# -*- coding: utf-8 -*- -# -# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -# pylint: disable=doc-string-missing - -from __future__ import unicode_literals, absolute_import -import os -import sys -import time -from paddle_serving_client import Client -from paddle_serving_client.utils import MultiThreadRunner -from paddle_serving_client.utils import benchmark_args -import requests -import json -import base64 -from image_reader import ImageReader - -args = benchmark_args() - - -def single_func(idx, resource): - file_list = [] - for file_name in os.listdir("./image_data/n01440764"): - file_list.append(file_name) - img_list = [] - for i in range(1000): - img_list.append(open("./image_data/n01440764/" + file_list[i]).read()) - profile_flags = False - if "FLAGS_profile_client" in os.environ and os.environ[ - "FLAGS_profile_client"]: - profile_flags = True - if args.request == "rpc": - reader = ImageReader() - fetch = ["score"] - client = Client() - client.load_client_config(args.model) - client.connect([resource["endpoint"][idx % len(resource["endpoint"])]]) - start = time.time() - for i in range(1000): - if args.batch_size >= 1: - feed_batch = [] - i_start = time.time() - for bi in range(args.batch_size): - img = reader.process_image(img_list[i]) - feed_batch.append({"image": img}) - i_end = time.time() - if profile_flags: - print("PROFILE\tpid:{}\timage_pre_0:{} image_pre_1:{}". - format(os.getpid(), - int(round(i_start * 1000000)), - int(round(i_end * 1000000)))) - - result = client.predict(feed=feed_batch, fetch=fetch) - else: - print("unsupport batch size {}".format(args.batch_size)) - - elif args.request == "http": - py_version = 2 - server = "http://" + resource["endpoint"][idx % len(resource[ - "endpoint"])] + "/image/prediction" - start = time.time() - for i in range(1000): - if py_version == 2: - image = base64.b64encode( - open("./image_data/n01440764/" + file_list[i]).read()) - else: - image = base64.b64encode(open(image_path, "rb").read()).decode( - "utf-8") - req = json.dumps({"feed": [{"image": image}], "fetch": ["score"]}) - r = requests.post( - server, data=req, headers={"Content-Type": "application/json"}) - end = time.time() - return [[end - start]] - - -if __name__ == '__main__': - multi_thread_runner = MultiThreadRunner() - endpoint_list = ["127.0.0.1:9292"] - #endpoint_list = endpoint_list + endpoint_list + endpoint_list - result = multi_thread_runner.run(single_func, args.thread, - {"endpoint": endpoint_list}) - #result = single_func(0, {"endpoint": endpoint_list}) - avg_cost = 0 - for i in range(args.thread): - avg_cost += result[0][i] - avg_cost = avg_cost / args.thread - print("average total cost {} s.".format(avg_cost)) diff --git a/python/examples/imagenet/benchmark_batch.py.lprof b/python/examples/imagenet/benchmark_batch.py.lprof new file mode 100644 index 0000000000000000000000000000000000000000..7ff4f1411ded79aba3390e606193ec4fedacf06f Binary files /dev/null and b/python/examples/imagenet/benchmark_batch.py.lprof differ diff --git a/python/examples/imagenet/benchmark_batch.sh b/python/examples/imagenet/benchmark_batch.sh deleted file mode 100644 index 4118ffcc755e6d47c69924efbb1b7d5474db8b00..0000000000000000000000000000000000000000 --- a/python/examples/imagenet/benchmark_batch.sh +++ /dev/null @@ -1,12 +0,0 @@ -rm profile_log -for thread_num in 1 2 4 8 16 -do -for batch_size in 1 2 4 8 16 32 64 128 256 512 -do - $PYTHONROOT/bin/python benchmark_batch.py --thread $thread_num --batch_size $batch_size --model ResNet101_vd_client_config/serving_client_conf.prototxt --request rpc > profile 2>&1 - echo "========================================" - echo "batch size : $batch_size" >> profile_log - $PYTHONROOT/bin/python ../util/show_profile.py profile $thread_num >> profile_log - tail -n 1 profile >> profile_log -done -done diff --git a/python/examples/imdb/README.md b/python/examples/imdb/README.md index 5f4d204d368a98cb47d4dac2ff3d481e519adb9d..e2b9a74c98e8993f19b14888f3e21343f526b81d 100644 --- a/python/examples/imdb/README.md +++ b/python/examples/imdb/README.md @@ -30,27 +30,3 @@ python text_classify_service.py imdb_cnn_model/ workdir/ 9292 imdb.vocab ``` curl -H "Content-Type:application/json" -X POST -d '{"feed":[{"words": "i am very sad | 0"}], "fetch":["prediction"]}' http://127.0.0.1:9292/imdb/prediction ``` - -### Benchmark - -CPU :Intel(R) Xeon(R) Gold 6271 CPU @ 2.60GHz * 48 - -Model :[CNN](https://github.com/PaddlePaddle/Serving/blob/develop/python/examples/imdb/nets.py) - -server thread num : 16 - -In this test, client sends 25000 test samples totally, the bar chart given later is the latency of single thread, the unit is second, from which we know the predict efficiency is improved greatly by multi-thread compared to single-thread. 8.7 times improvement is made by 16 threads prediction. - -| client thread num | prepro | client infer | op0 | op1 | op2 | postpro | total | -| ------------------ | ------ | ------------ | ------ | ----- | ------ | ------- | ----- | -| 1 | 1.09 | 28.79 | 0.094 | 20.59 | 0.047 | 0.034 | 31.41 | -| 4 | 0.22 | 7.41 | 0.023 | 5.01 | 0.011 | 0.0098 | 8.01 | -| 8 | 0.11 | 4.7 | 0.012 | 2.61 | 0.0062 | 0.0049 | 5.01 | -| 12 | 0.081 | 4.69 | 0.0078 | 1.72 | 0.0042 | 0.0035 | 4.91 | -| 16 | 0.058 | 3.46 | 0.0061 | 1.32 | 0.0033 | 0.003 | 3.63 | -| 20 | 0.049 | 3.77 | 0.0047 | 1.03 | 0.0025 | 0.0022 | 3.91 | -| 24 | 0.041 | 3.86 | 0.0039 | 0.85 | 0.002 | 0.0017 | 3.98 | - -The thread-latency bar chart is as follow: - -![total cost](../../../doc/imdb-benchmark-server-16.png) diff --git a/python/examples/imdb/README_CN.md b/python/examples/imdb/README_CN.md index 2b79938bbf0625786033d13ec2960ad2bc73acda..a669e29e94f6c6cce238473a8fc33405e29e8471 100644 --- a/python/examples/imdb/README_CN.md +++ b/python/examples/imdb/README_CN.md @@ -29,27 +29,3 @@ python text_classify_service.py imdb_cnn_model/ workdir/ 9292 imdb.vocab ``` curl -H "Content-Type:application/json" -X POST -d '{"feed":[{"words": "i am very sad | 0"}], "fetch":["prediction"]}' http://127.0.0.1:9292/imdb/prediction ``` - -### Benchmark - -设备 :Intel(R) Xeon(R) Gold 6271 CPU @ 2.60GHz * 48 - -模型 :[CNN](https://github.com/PaddlePaddle/Serving/blob/develop/python/examples/imdb/nets.py) - -server thread num : 16 - -测试中,client共发送25000条测试样本,图中数据为单个线程的耗时,时间单位为秒。可以看出,client端多线程的预测速度相比单线程有明显提升,在16线程时预测速度是单线程的8.7倍。 - -| client thread num | prepro | client infer | op0 | op1 | op2 | postpro | total | -| ------------------ | ------ | ------------ | ------ | ----- | ------ | ------- | ----- | -| 1 | 1.09 | 28.79 | 0.094 | 20.59 | 0.047 | 0.034 | 31.41 | -| 4 | 0.22 | 7.41 | 0.023 | 5.01 | 0.011 | 0.0098 | 8.01 | -| 8 | 0.11 | 4.7 | 0.012 | 2.61 | 0.0062 | 0.0049 | 5.01 | -| 12 | 0.081 | 4.69 | 0.0078 | 1.72 | 0.0042 | 0.0035 | 4.91 | -| 16 | 0.058 | 3.46 | 0.0061 | 1.32 | 0.0033 | 0.003 | 3.63 | -| 20 | 0.049 | 3.77 | 0.0047 | 1.03 | 0.0025 | 0.0022 | 3.91 | -| 24 | 0.041 | 3.86 | 0.0039 | 0.85 | 0.002 | 0.0017 | 3.98 | - -预测总耗时变化规律如下: - -![total cost](../../../doc/imdb-benchmark-server-16.png)