diff --git a/README.md b/README.md index 17730e2a071facf7c939cb7fb686596b2b752aa6..0fff8732f69e675cd502e502e7230d78c7ca080c 100644 --- a/README.md +++ b/README.md @@ -261,6 +261,8 @@ curl -H "Content-Type:application/json" -X POST -d '{"feed":[{"url": "https://pa - [How to develop a new Web Service?](doc/NEW_WEB_SERVICE.md) - [Golang client](doc/IMDB_GO_CLIENT.md) - [Compile from source code](doc/COMPILE.md) +- [Deploy Web Service with uWSGI](doc/UWSGI_DEPLOY.md) +- [Hot loading for model file](doc/HOT_LOADING_IN_SERVING.md) ### About Efficiency - [How to profile Paddle Serving latency?](python/examples/util) diff --git a/README_CN.md b/README_CN.md index 3302d4850e8255e8d2d6460c201892fd6035b260..6a6bc82e83737c182ddee4c7c4515039790e1475 100644 --- a/README_CN.md +++ b/README_CN.md @@ -267,6 +267,8 @@ curl -H "Content-Type:application/json" -X POST -d '{"feed":[{"url": "https://pa - [如何开发一个新的Web Service?](doc/NEW_WEB_SERVICE_CN.md) - [如何在Paddle Serving使用Go Client?](doc/IMDB_GO_CLIENT_CN.md) - [如何编译PaddleServing?](doc/COMPILE_CN.md) +- [如何使用uWSGI部署Web Service](doc/UWSGI_DEPLOY_CN.md) +- [如何实现模型文件热加载](doc/HOT_LOADING_IN_SERVING_CN.md) ### 关于Paddle Serving性能 - [如何测试Paddle Serving性能?](python/examples/util/) diff --git a/python/examples/bert/README.md b/python/examples/bert/README.md index d598fc3b057c85d80e8d10549f7c5b0cf1e725fb..4cfa5590ffb4501c78e9e6ff886f5f82c94dd2db 100644 --- a/python/examples/bert/README.md +++ b/python/examples/bert/README.md @@ -71,28 +71,3 @@ set environmental variable to specify which gpus are used, the command above mea ``` curl -H "Content-Type:application/json" -X POST -d '{"feed":[{"words": "hello"}], "fetch":["pooled_output"]}' http://127.0.0.1:9292/bert/prediction ``` - -### Benchmark - -Model:bert_chinese_L-12_H-768_A-12 - -GPU:GPU V100 * 1 - -CUDA/cudnn Version:CUDA 9.2,cudnn 7.1.4 - - -In the test, 10 thousand samples in the sample data are copied into 100 thousand samples. Each client thread sends a sample of the number of threads. The batch size is 1, the max_seq_len is 20(not 128 as described above), and the time unit is seconds. - -When the number of client threads is 4, the prediction speed can reach 432 samples per second. -Because a single GPU can only perform serial calculations internally, increasing the number of client threads can only reduce the idle time of the GPU. Therefore, after the number of threads reaches 4, the increase in the number of threads does not improve the prediction speed. - -| client thread num | prepro | client infer | op0 | op1 | op2 | postpro | total | -| ------------------ | ------ | ------------ | ----- | ------ | ---- | ------- | ------ | -| 1 | 3.05 | 290.54 | 0.37 | 239.15 | 6.43 | 0.71 | 365.63 | -| 4 | 0.85 | 213.66 | 0.091 | 200.39 | 1.62 | 0.2 | 231.45 | -| 8 | 0.42 | 223.12 | 0.043 | 110.99 | 0.8 | 0.098 | 232.05 | -| 12 | 0.32 | 225.26 | 0.029 | 73.87 | 0.53 | 0.078 | 231.45 | -| 16 | 0.23 | 227.26 | 0.022 | 55.61 | 0.4 | 0.056 | 231.9 | - -the following is the client thread num - latency bar chart: -![bert benchmark](../../../doc/bert-benchmark-batch-size-1.png) diff --git a/python/examples/bert/README_CN.md b/python/examples/bert/README_CN.md index 7f1d2911ba4a5017137e659fe1f1367e64026de4..93ec8f2adbd9ae31489011900472a0077cb33783 100644 --- a/python/examples/bert/README_CN.md +++ b/python/examples/bert/README_CN.md @@ -67,27 +67,3 @@ head data-c.txt | python bert_client.py --model bert_seq128_client/serving_clien ``` curl -H "Content-Type:application/json" -X POST -d '{"feed":[{"words": "hello"}], "fetch":["pooled_output"]}' http://127.0.0.1:9292/bert/prediction ``` - -### Benchmark - -模型:bert_chinese_L-12_H-768_A-12 - -设备:GPU V100 * 1 - -环境:CUDA 9.2,cudnn 7.1.4 - -测试中将样例数据中的1W个样本复制为10W个样本,每个client线程发送线程数分之一个样本,batch size为1,max_seq_len为20(而不是上面的128),时间单位为秒. - -在client线程数为4时,预测速度可以达到432样本每秒。 -由于单张GPU内部只能串行计算,client线程增多只能减少GPU的空闲时间,因此在线程数达到4之后,线程数增多对预测速度没有提升。 - -| client thread num | prepro | client infer | op0 | op1 | op2 | postpro | total | -| ------------------ | ------ | ------------ | ----- | ------ | ---- | ------- | ------ | -| 1 | 3.05 | 290.54 | 0.37 | 239.15 | 6.43 | 0.71 | 365.63 | -| 4 | 0.85 | 213.66 | 0.091 | 200.39 | 1.62 | 0.2 | 231.45 | -| 8 | 0.42 | 223.12 | 0.043 | 110.99 | 0.8 | 0.098 | 232.05 | -| 12 | 0.32 | 225.26 | 0.029 | 73.87 | 0.53 | 0.078 | 231.45 | -| 16 | 0.23 | 227.26 | 0.022 | 55.61 | 0.4 | 0.056 | 231.9 | - -总耗时变化规律如下: -![bert benchmark](../../../doc/bert-benchmark-batch-size-1.png) diff --git a/python/examples/imdb/README.md b/python/examples/imdb/README.md index 5f4d204d368a98cb47d4dac2ff3d481e519adb9d..e2b9a74c98e8993f19b14888f3e21343f526b81d 100644 --- a/python/examples/imdb/README.md +++ b/python/examples/imdb/README.md @@ -30,27 +30,3 @@ python text_classify_service.py imdb_cnn_model/ workdir/ 9292 imdb.vocab ``` curl -H "Content-Type:application/json" -X POST -d '{"feed":[{"words": "i am very sad | 0"}], "fetch":["prediction"]}' http://127.0.0.1:9292/imdb/prediction ``` - -### Benchmark - -CPU :Intel(R) Xeon(R) Gold 6271 CPU @ 2.60GHz * 48 - -Model :[CNN](https://github.com/PaddlePaddle/Serving/blob/develop/python/examples/imdb/nets.py) - -server thread num : 16 - -In this test, client sends 25000 test samples totally, the bar chart given later is the latency of single thread, the unit is second, from which we know the predict efficiency is improved greatly by multi-thread compared to single-thread. 8.7 times improvement is made by 16 threads prediction. - -| client thread num | prepro | client infer | op0 | op1 | op2 | postpro | total | -| ------------------ | ------ | ------------ | ------ | ----- | ------ | ------- | ----- | -| 1 | 1.09 | 28.79 | 0.094 | 20.59 | 0.047 | 0.034 | 31.41 | -| 4 | 0.22 | 7.41 | 0.023 | 5.01 | 0.011 | 0.0098 | 8.01 | -| 8 | 0.11 | 4.7 | 0.012 | 2.61 | 0.0062 | 0.0049 | 5.01 | -| 12 | 0.081 | 4.69 | 0.0078 | 1.72 | 0.0042 | 0.0035 | 4.91 | -| 16 | 0.058 | 3.46 | 0.0061 | 1.32 | 0.0033 | 0.003 | 3.63 | -| 20 | 0.049 | 3.77 | 0.0047 | 1.03 | 0.0025 | 0.0022 | 3.91 | -| 24 | 0.041 | 3.86 | 0.0039 | 0.85 | 0.002 | 0.0017 | 3.98 | - -The thread-latency bar chart is as follow: - -![total cost](../../../doc/imdb-benchmark-server-16.png) diff --git a/python/examples/imdb/README_CN.md b/python/examples/imdb/README_CN.md index 2b79938bbf0625786033d13ec2960ad2bc73acda..a669e29e94f6c6cce238473a8fc33405e29e8471 100644 --- a/python/examples/imdb/README_CN.md +++ b/python/examples/imdb/README_CN.md @@ -29,27 +29,3 @@ python text_classify_service.py imdb_cnn_model/ workdir/ 9292 imdb.vocab ``` curl -H "Content-Type:application/json" -X POST -d '{"feed":[{"words": "i am very sad | 0"}], "fetch":["prediction"]}' http://127.0.0.1:9292/imdb/prediction ``` - -### Benchmark - -设备 :Intel(R) Xeon(R) Gold 6271 CPU @ 2.60GHz * 48 - -模型 :[CNN](https://github.com/PaddlePaddle/Serving/blob/develop/python/examples/imdb/nets.py) - -server thread num : 16 - -测试中,client共发送25000条测试样本,图中数据为单个线程的耗时,时间单位为秒。可以看出,client端多线程的预测速度相比单线程有明显提升,在16线程时预测速度是单线程的8.7倍。 - -| client thread num | prepro | client infer | op0 | op1 | op2 | postpro | total | -| ------------------ | ------ | ------------ | ------ | ----- | ------ | ------- | ----- | -| 1 | 1.09 | 28.79 | 0.094 | 20.59 | 0.047 | 0.034 | 31.41 | -| 4 | 0.22 | 7.41 | 0.023 | 5.01 | 0.011 | 0.0098 | 8.01 | -| 8 | 0.11 | 4.7 | 0.012 | 2.61 | 0.0062 | 0.0049 | 5.01 | -| 12 | 0.081 | 4.69 | 0.0078 | 1.72 | 0.0042 | 0.0035 | 4.91 | -| 16 | 0.058 | 3.46 | 0.0061 | 1.32 | 0.0033 | 0.003 | 3.63 | -| 20 | 0.049 | 3.77 | 0.0047 | 1.03 | 0.0025 | 0.0022 | 3.91 | -| 24 | 0.041 | 3.86 | 0.0039 | 0.85 | 0.002 | 0.0017 | 3.98 | - -预测总耗时变化规律如下: - -![total cost](../../../doc/imdb-benchmark-server-16.png)