diff --git a/README.md b/README.md index 61d043074a2ef65d78086e48beefab1388c8e7ae..911b4a7c056f9f487f80f977a6ad3fe351828d04 100755 --- a/README.md +++ b/README.md @@ -6,327 +6,144 @@

- -


- Build Status + Build Status + Docs + Release + Python + License + Forks + Issues + Contributors + Community - Release - Issues - License - Slack

-- [Motivation](./README.md#motivation) -- [AIStudio Tutorial](./README.md#aistuio-tutorial) -- [Installation](./README.md#installation) -- [Quick Start Example](./README.md#quick-start-example) -- [Document](README.md#document) -- [Community](README.md#community) +*** -

Motivation

+The goal of Paddle Serving is to provide high-performance, flexible and easy-to-use industrial-grade online inference services for machine learning developers and enterprises.Paddle Serving supports multiple protocols such as RESTful, gRPC, bRPC, and provides inference solutions under a variety of hardware and multiple operating system environments, and many famous pre-trained model examples.The core features are as follows: -We consider deploying deep learning inference service online to be a user-facing application in the future. **The goal of this project**: When you have trained a deep neural net with [Paddle](https://github.com/PaddlePaddle/Paddle), you are also capable to deploy the model online easily. A demo of Paddle Serving is as follows: -

Some Key Features of Paddle Serving

+- Integrate high-performance server-side inference engine paddle Inference and mobile-side engine paddle Lite. Models of other machine learning platforms (Caffe/TensorFlow/ONNX/PyTorch) can be migrated to paddle through [x2paddle](https://github.com/PaddlePaddle/X2Paddle). +- There are two frameworks, namely high-performance C++ Serving and high-easy-to-use Python pipeline.The C++ Serving is based on the bRPC network framework to create a high-throughput, low-latency inference service, and its performance indicators are ahead of competing products. The Python pipeline is based on the gRPC/gRPC-Gateway network framework and the Python language to build a highly easy-to-use and high-throughput inference service. How to choose which one please see [Techinical Selection](doc/Serving_Design_EN.md) +- Support multiple [protocols](doc/C++_Serving/Inference_Protocols_CN.md ) such as HTTP, gRPC, bRPC, and provide C++, Python, Java language SDK. +- Design and implement a high-performance inference service framework for asynchronous pipelines based on directed acyclic graph (DAG), with features such as multi-model combination, asynchronous scheduling, concurrent inference, dynamic batch, multi-card multi-stream inference, etc.- Adapt to a variety of commonly used computing hardwares, such as x86 (Intel) CPU, ARM CPU, Nvidia GPU, Kunlun XPU, etc.; Integrate acceleration libraries of Intel MKLDNN and Nvidia TensorRT, and low-precision and quantitative inference. +- Provide a model security deployment solution, including encryption model deployment, and authentication mechanism, HTTPs security gateway, which is used in practice. +- Support cloud deployment, provide a deployment case of Baidu Cloud Intelligent Cloud kubernetes cluster. +- Provide more than 40 classic pre-model deployment examples, such as PaddleOCR, PaddleClas, PaddleDetection, PaddleSeg, PaddleNLP, PaddleRec and other suites, and more models continue to expand. +- Supports distributed deployment of large-scale sparse parameter index models, with features such as multiple tables, multiple shards, multiple copies, local high-frequency cache, etc., and can be deployed on a single machine or clouds. -- Integrate with Paddle training pipeline seamlessly, most paddle models can be deployed **with one line command**. -- **Industrial serving features** supported, such as models management, online loading, online A/B testing etc. -- **Highly concurrent and efficient communication** between clients and servers supported. -- **Multiple programming languages** supported on client side, such as C++, python and Java. -*** +

Tutorial

+ -- Any model trained by [PaddlePaddle](https://github.com/paddlepaddle/paddle) can be directly used or [Model Conversion Interface](./doc/SAVE.md) for online deployment of Paddle Serving. -- Support [Multi-model Pipeline Deployment](./doc/PIPELINE_SERVING.md), and provide the requirements of the REST interface and RPC interface itself, [Pipeline example](./python/examples/pipeline). -- Support the model zoos from the Paddle ecosystem, such as [PaddleDetection](./python/examples/detection), [PaddleOCR](./python/examples/ocr), [PaddleRec](https://github.com/PaddlePaddle/PaddleRec/tree/master/recserving/movie_recommender). -- Provide a variety of pre-processing and post-processing to facilitate users in training, deployment and other stages of related code, bridging the gap between AI developers and application developers, please refer to -[Serving Examples](./python/examples/). +- AIStudio tutorial(Chinese) : [Paddle Serving服务化部署框架](https://www.paddlepaddle.org.cn/tutorials/projectdetail/1555945) +- Video tutorial(Chinese) : [深度学习服务化部署-以互联网应用为例](https://aistudio.baidu.com/aistudio/course/introduce/19084)

+

Documentation

+ + +> Set up + +This chapter guides you through the installation and deployment steps. It is strongly recommended to use Docker to deploy Paddle Serving. If you do not use docker, ignore the docker-related steps. Paddle Serving can be deployed on cloud servers using Kubernetes, running on many commonly hardwares such as ARM CPU, Intel CPU, Nvidia GPU, Kunlun XPU. The latest development kit of the develop branch is compiled and generated every day for developers to use. + +- [Install Paddle Serving using docker](doc/Install_EN.md) +- [Build Paddle Serving from Source with Docker](doc/Compile_EN.md) +- [Deploy Paddle Serving on Kubernetes](doc/Run_On_Kubernetes_EN.md) +- [Deploy Paddle Serving with Security gateway](doc/Serving_Auth_Docker.md) +- [Deploy Paddle Serving on more hardwares](doc/Run_On_XPU_EN.md) +- [Latest Wheel packages](doc/Latest_Packages_CN.md)(Update everyday on branch develop) -

AIStudio Turorial

+> Use -Here we provide tutorial on AIStudio(Chinese Version) [AIStudio教程-Paddle Serving服务化部署框架](https://www.paddlepaddle.org.cn/tutorials/projectdetail/1555945) +The first step is to call the model save interface to generate a model parameter configuration file (.prototxt), which will be used on the client and server. The second step, read the configuration and startup parameters and start the service. According to API documents and your case, the third step is to write client requests based on the SDK, and test the inference service. -The tutorial provides - +- [Quick Start](doc/Quick_Start_EN.md) +- [Save a servable model](doc/Save_EN.md) +- [Description of configuration and startup parameters](doc/Serving_Configure_EN.md) +- [Guide for RESTful/gRPC/bRPC APIs](doc/C++_Serving/Http_Service_EN.md) +- [Infer on quantizative models](doc/Low_Precision_CN.md) +- [Data format of classic models](doc/Process_Data_CN.md) +- [C++ Serving](doc/C++_Serving/Introduction_EN.md) + - [protocols](doc/C++_Serving/Inference_Protocols_CN.md) + - [Hot loading models](doc/C++_Serving/Hot_Loading_EN.md) + - [A/B Test](doc/C++_Serving/ABTest_EN.md) + - [Encryption](doc/C++_Serving/Encryption_EN.md) + - [Analyze and optimize performance(Chinese)](doc/C++_Serving/Performance_Tuning_CN.md) + - [Benchmark(Chinese)](doc/C++_Serving/Benchmark_CN.md) +- [Python Pipeline](doc/Python_Pipeline/Pipeline_Design_EN.md) + - [Analyze and optimize performance](doc/Python_Pipeline/Pipeline_Design_EN.md) + - [Benchmark(Chinese)](doc/Python_Pipeline/Benchmark_CN.md) +- Client SDK + - [Python SDK(Chinese)](doc/C++_Serving/Http_Service_CN.md) + - [JAVA SDK](doc/Java_SDK_EN.md) + - [C++ SDK(Chinese)](doc/C++_Serving/Creat_C++Serving_CN.md) +- [Large-scale sparse parameter server](doc/Cube_Local_EN.md) +
-

Installation

+> Developers -We **highly recommend** you to **run Paddle Serving in Docker**, please visit [Run in Docker](doc/RUN_IN_DOCKER.md). See the [document](doc/DOCKER_IMAGES.md) for more docker images. +For Paddle Serving developers, we provide extended documents such as custom OP, level of detail(LOD) processing. +- [Custom Operators](doc/C++_Serving/OP_EN.md) +- [Processing LOD Data](doc/LOD_EN.md) +- [FAQ(Chinese)](doc/FAQ_CN.md) -**Attention:**: Currently, the default GPU environment of paddlepaddle 2.1 is Cuda 10.2, so the sample code of GPU Docker is based on Cuda 10.2. We also provides docker images and whl packages for other GPU environments. If users use other environments, they need to carefully check and select the appropriate version. +

Model Zoo

-**Attention:** the following so-called 'python' or 'pip' stands for one of Python 3.6/3.7/3.8. -``` -# Run CPU Docker -docker pull registry.baidubce.com/paddlepaddle/serving:0.6.2-devel -docker run -p 9292:9292 --name test -dit registry.baidubce.com/paddlepaddle/serving:0.6.2-devel bash -docker exec -it test bash -git clone https://github.com/PaddlePaddle/Serving -``` -``` -# Run GPU Docker -nvidia-docker pull registry.baidubce.com/paddlepaddle/serving:0.6.2-cuda10.2-cudnn8-devel -nvidia-docker run -p 9292:9292 --name test -dit registry.baidubce.com/paddlepaddle/serving:0.6.2-cuda10.2-cudnn8-devel bash -nvidia-docker exec -it test bash -git clone https://github.com/PaddlePaddle/Serving -``` -install python dependencies -``` -cd Serving -pip3 install -r python/requirements.txt -``` - -```shell -pip3 install paddle-serving-client==0.6.2 -pip3 install paddle-serving-server==0.6.2 # CPU -pip3 install paddle-serving-app==0.6.2 -pip3 install paddle-serving-server-gpu==0.6.2.post102 #GPU with CUDA10.2 + TensorRT7 -# DO NOT RUN ALL COMMANDS! check your GPU env and select the right one -pip3 install paddle-serving-server-gpu==0.6.2.post101 # GPU with CUDA10.1 + TensorRT6 -pip3 install paddle-serving-server-gpu==0.6.2.post11 # GPU with CUDA10.1 + TensorRT7 -``` - -You may need to use a domestic mirror source (in China, you can use the Tsinghua mirror source, add `-i https://pypi.tuna.tsinghua.edu.cn/simple` to pip command) to speed up the download. - -If you need install modules compiled with develop branch, please download packages from [latest packages list](./doc/LATEST_PACKAGES.md) and install with `pip install` command. If you want to compile by yourself, please refer to [How to compile Paddle Serving?](./doc/COMPILE.md) - -Packages of paddle-serving-server and paddle-serving-server-gpu support Centos 6/7, Ubuntu 16/18, Windows 10. - -Packages of paddle-serving-client and paddle-serving-app support Linux and Windows, but paddle-serving-client only support python3.6/3.7/3.8. - -**For latest version, Cuda 9.0 or Cuda 10.0 are no longer supported, Python2.7/3.5 is no longer supported.** - -Recommended to install paddle >= 2.1.0 - - -``` -# CPU users, please run -pip3 install paddlepaddle==2.1.0 - -# GPU Cuda10.2 please run -pip3 install paddlepaddle-gpu==2.1.0 -``` - -**Note**: If your Cuda version is not 10.2, please do not execute the above commands directly, you need to refer to [Paddle official documentation-multi-version whl package list -](https://www.paddlepaddle.org.cn/documentation/docs/en/install/Tables_en.html#multi-version-whl-package-list-release) - -Select the url link of the corresponding GPU environment and install it. For example, for Python3.6 users of Cuda 10.1, please select `cp36-cp36m` and -The url corresponding to `cuda10.1-cudnn7-mkl-gcc8.2-avx-trt6.0.1.5`, copy it and run -``` -pip3 install https://paddle-wheel.bj.bcebos.com/with-trt/2.1.0-gpu-cuda10.1-cudnn7-mkl-gcc8.2/paddlepaddle_gpu-2.1.0.post101-cp36-cp36m-linux_x86_64.whl -``` - -the default `paddlepaddle-gpu==2.1.0` is Cuda 10.2 with no TensorRT. If you want to install PaddlePaddle with TensorRT. please also check the documentation-multi-version whl package list and find key word `cuda10.2-cudnn8.0-trt7.1.3`. More info please check [Paddle Serving uses TensorRT](./doc/TENSOR_RT.md) - -If it is other environment and Python version, please find the corresponding link in the table and install it with pip. - - -For **Windows Users**, please read the document [Paddle Serving for Windows Users](./doc/WINDOWS_TUTORIAL.md) - -

Quick Start Example

- -This quick start example is mainly for those users who already have a model to deploy, and we also provide a model that can be used for deployment. in case if you want to know how to complete the process from offline training to online service, please refer to the AiStudio tutorial above. - -### Boston House Price Prediction model - -get into the Serving git directory, and change dir to `fit_a_line` -``` shell -cd Serving/python/examples/fit_a_line -sh get_data.sh -``` - -Paddle Serving provides HTTP and RPC based service for users to access - -### RPC service - -A user can also start a RPC service with `paddle_serving_server.serve`. RPC service is usually faster than HTTP service, although a user needs to do some coding based on Paddle Serving's python client API. Note that we do not specify `--name` here. -``` shell -python3 -m paddle_serving_server.serve --model uci_housing_model --thread 10 --port 9292 -``` -
- -| Argument | Type | Default | Description | -| ---------------------------------------------- | ---- | ------- | ----------------------------------------------------- | -| `thread` | int | `2` | Number of brpc service thread | -| `runtime_thread_num` | int[]| `0` | Thread Number for each model in asynchronous mode | -| `batch_infer_size` | int[]| `0` | Batch Number for each model in asynchronous mode | -| `gpu_ids` | str[]| `"-1"` | Gpu card id for each model | -| `port` | int | `9292` | Exposed port of current service to users | -| `model` | str[]| `""` | Path of paddle model directory to be served | -| `mem_optim_off` | - | - | Disable memory / graphic memory optimization | -| `ir_optim` | bool | False | Enable analysis and optimization of calculation graph | -| `use_mkl` (Only for cpu version) | - | - | Run inference with MKL | -| `use_trt` (Only for trt version) | - | - | Run inference with TensorRT | -| `use_lite` (Only for Intel x86 CPU or ARM CPU) | - | - | Run PaddleLite inference | -| `use_xpu` | - | - | Run PaddleLite inference with Baidu Kunlun XPU | -| `precision` | str | FP32 | Precision Mode, support FP32, FP16, INT8 | -| `use_calib` | bool | False | Use TRT int8 calibration | -| `gpu_multi_stream` | bool | False | EnableGpuMultiStream to get larger QPS | - -#### Description of asynchronous model - Asynchronous mode is suitable for 1. When the number of requests is very large, 2. When multiple models are concatenated and you want to specify the concurrency number of each model. - Asynchronous mode helps to improve the throughput (QPS) of service, but for a single request, the delay will increase slightly. - In asynchronous mode, each model will start n threads of the number you specify, and each thread contains a model instance. In other words, each model is equivalent to a thread pool containing N threads, and the task is taken from the task queue of the thread pool to execute. - In asynchronous mode, each RPC server thread is only responsible for putting the request into the task queue of the model thread pool. After the task is executed, the completed task is removed from the task queue. - In the above table, the number of RPC server threads is specified by --thread, and the default value is 2. - --runtime_thread_num specifies the number of threads in the thread pool of each model. The default value is 0, indicating that asynchronous mode is not used. - --batch_infer_size specifies the number of batches for each model. The default value is 32. It takes effect when --runtime_thread_num is not 0. -#### When you want a model to use multiple GPU cards. -python3 -m paddle_serving_server.serve --model uci_housing_model --thread 10 --port 9292 --gpu_ids 0,1,2 -#### When you want 2 models. -python3 -m paddle_serving_server.serve --model uci_housing_model_1 uci_housing_model_2 --thread 10 --port 9292 -#### When you want 2 models, and want each of them use multiple GPU cards. -python3 -m paddle_serving_server.serve --model uci_housing_model_1 uci_housing_model_2 --thread 10 --port 9292 --gpu_ids 0,1 1,2 -#### When a service contains two models, and each model needs to specify multiple GPU cards, and needs asynchronous mode, each model specifies different concurrency number. -python3 -m paddle_serving_server.serve --model uci_housing_model_1 uci_housing_model_2 --thread 10 --port 9292 --gpu_ids 0,1 1,2 --runtime_thread_num 4 8 +Paddle Serving works closely with the Paddle model suite, and implements a large number of service deployment examples, including image classification, object detection, language and text recognition, Chinese part of speech, sentiment analysis, content recommendation and other types of examples, for a total of 42 models. + +
+ +| PaddleOCR | PaddleDetection | PaddleClas | PaddleSeg | PaddleRec | Paddle NLP | +| :----: | :----: | :----: | :----: | :----: | :----: | +| 8 | 12 | 13 | 2 | 3 | 4 | + +
+ +For more model examples, read [Model zoo](doc/Model_Zoo_EN.md) + +
+ +
-```python -# A user can visit rpc service through paddle_serving_client API -from paddle_serving_client import Client -import numpy as np -client = Client() -client.load_client_config("uci_housing_client/serving_client_conf.prototxt") -client.connect(["127.0.0.1:9292"]) -data = [0.0137, -0.1136, 0.2553, -0.0692, 0.0582, -0.0727, - -0.1583, -0.0584, 0.6283, 0.4919, 0.1856, 0.0795, -0.0332] -fetch_map = client.predict(feed={"x": np.array(data).reshape(1,13,1)}, fetch=["price"]) -print(fetch_map) -``` -Here, `client.predict` function has two arguments. `feed` is a `python dict` with model input variable alias name and values. `fetch` assigns the prediction variables to be returned from servers. In the example, the name of `"x"` and `"price"` are assigned when the servable model is saved during training. - - -### WEB service - -Users can also put the data format processing logic on the server side, so that they can directly use curl to access the service, refer to the following case whose path is `python/examples/fit_a_line` - -``` -python3 -m paddle_serving_server.serve --model uci_housing_model --thread 10 --port 9292 --name uci -``` -for client side, -``` -curl -H "Content-Type:application/json" -X POST -d '{"feed":[{"x": [0.0137, -0.1136, 0.2553, -0.0692, 0.0582, -0.0727, -0.1583, -0.0584, 0.6283, 0.4919, 0.1856, 0.0795, -0.0332]}], "fetch":["price"]}' http://127.0.0.1:9292/uci/prediction -``` -the response is -``` -{"result":{"price":[[18.901151657104492]]}} -``` -

Pipeline Service

- -Paddle Serving provides industry-leading multi-model tandem services, which strongly supports the actual operating business scenarios of major companies, please refer to [OCR word recognition](./python/examples/pipeline/ocr). - -we get two models -``` -python3 -m paddle_serving_app.package --get_model ocr_rec -tar -xzvf ocr_rec.tar.gz -python3 -m paddle_serving_app.package --get_model ocr_det -tar -xzvf ocr_det.tar.gz -``` -then we start server side, launch two models as one standalone web service -``` -python3 web_service.py -``` -http request -``` -python3 pipeline_http_client.py -``` -grpc request -``` -python3 pipeline_rpc_client.py -``` -output -``` -{'err_no': 0, 'err_msg': '', 'key': ['res'], 'value': ["['土地整治与土壤修复研究中心', '华南农业大学1素图']"]} -``` - -

Stop Serving/Pipeline service

- -**Method one** :Ctrl+C to quit - -**Method Two** :In the path where starting the Serving/Pipeline service or the path which environment variable SERVING_HOME set (the file named ProcessInfo.json exists in this path) - -``` -python3 -m paddle_serving_server.serve stop -``` - -

Document

- -### New to Paddle Serving -- [How to save a servable model?](doc/SAVE.md) -- [Write Bert-as-Service in 10 minutes](doc/BERT_10_MINS.md) -- [Paddle Serving Examples](python/examples) -- [How to process natural data in Paddle Serving?(Chinese)](doc/PROCESS_DATA.md) -- [How to process level of detail(LOD)?](doc/LOD.md) - -### Developers -- [How to deploy Paddle Serving on K8S?(Chinese)](doc/PADDLE_SERVING_ON_KUBERNETES.md) -- [How to route Paddle Serving to secure endpoint?(Chinese)](doc/SERVING_AUTH_DOCKER.md) -- [How to develop a new Web Service?](doc/NEW_WEB_SERVICE.md) -- [Compile from source code](doc/COMPILE.md) -- [Develop Pipeline Serving](doc/PIPELINE_SERVING.md) -- [Deploy Web Service with uWSGI](doc/UWSGI_DEPLOY.md) -- [Hot loading for model file](doc/HOT_LOADING_IN_SERVING.md) -- [Paddle Serving uses TensorRT](doc/TENSOR_RT.md) - -### About Efficiency -- [How to profile Paddle Serving latency?](python/examples/util) -- [How to optimize performance?](doc/PERFORMANCE_OPTIM.md) -- [Deploy multi-services on one GPU(Chinese)](doc/MULTI_SERVICE_ON_ONE_GPU_CN.md) -- [GPU Benchmarks(Chinese)](doc/BENCHMARKING_GPU.md) - -### Design -- [Design Doc](doc/DESIGN_DOC.md) - -### FAQ -- [FAQ(Chinese)](doc/FAQ.md)

Community

+If you want to communicate with developers and other users? Welcome to join us, join the community through the following methods below. + +### Wechat +- 微信用户请扫码 + +### QQ +- 飞桨推理部署交流群(Group No.:696965088) + ### Slack -To connect with other users and contributors, welcome to join our [Slack channel](https://paddleserving.slack.com/archives/CUBPKHKMJ) +- [Slack channel](https://paddleserving.slack.com/archives/CUBPKHKMJ) -### Contribution +> Contribution -If you want to contribute code to Paddle Serving, please reference [Contribution Guidelines](doc/CONTRIBUTE.md) +If you want to contribute code to Paddle Serving, please reference [Contribution Guidelines](doc/Contribute_EN.md) - Special Thanks to [@BeyondYourself](https://github.com/BeyondYourself) in complementing the gRPC tutorial, updating the FAQ doc and modifying the mdkir command - Special Thanks to [@mcl-stone](https://github.com/mcl-stone) in updating faster_rcnn benchmark - Special Thanks to [@cg82616424](https://github.com/cg82616424) in updating the unet benchmark and modifying resize comment error - Special Thanks to [@cuicheng01](https://github.com/cuicheng01) for providing 11 PaddleClas models -### Feedback +> Feedback For any feedback or to report a bug, please propose a [GitHub Issue](https://github.com/PaddlePaddle/Serving/issues). -### License +> License [Apache 2.0 License](https://github.com/PaddlePaddle/Serving/blob/develop/LICENSE) diff --git a/README_CN.md b/README_CN.md old mode 100755 new mode 100644 index f766c57365bdebb665b1154fcdbadd1e4b8599e0..f80e62436b1faea245580a3a7f7b244ef60f195f --- a/README_CN.md +++ b/README_CN.md @@ -6,330 +6,139 @@

- -


- Build Status + Build Status + Docs + Release + Python + License + Forks + Issues + Contributors + Community - Release - Issues - License - Slack

+*** +Paddle Serving依托深度学习框架PaddlePaddle旨在帮助深度学习开发者和企业提供高性能、灵活易用的工业级在线推理服务。Paddle Serving支持RESTful、gRPC、bRPC等多种协议,提供多种异构硬件和多种操作系统环境下推理解决方案,和多种经典预训练模型示例。核心特性如下: -- [动机](./README_CN.md#动机) -- [教程](./README_CN.md#教程) -- [安装](./README_CN.md#安装) -- [快速开始示例](./README_CN.md#快速开始示例) -- [文档](README_CN.md#文档) -- [社区](README_CN.md#社区) - -

动机

- -Paddle Serving 旨在帮助深度学习开发者轻易部署在线预测服务。 **本项目目标**: 当用户使用 [Paddle](https://github.com/PaddlePaddle/Paddle) 训练了一个深度神经网络,就同时拥有了该模型的预测服务。 - -

Paddle Serving的核心功能

+- 集成高性能服务端推理引擎paddle Inference和移动端引擎paddle Lite,其他机器学习平台(Caffe/TensorFlow/ONNX/PyTorch)可通过[x2paddle](https://github.com/PaddlePaddle/X2Paddle)工具迁移模型 +- 具有高性能C++和高易用Python 2套框架。C++框架基于高性能bRPC网络框架打造高吞吐、低延迟的推理服务,性能领先竞品。Python框架基于gRPC/gRPC-Gateway网络框架和Python语言构建高易用、高吞吐推理服务框架。技术选型参考[技术选型](doc/Serving_Design_CN.md) +- 支持HTTP、gRPC、bRPC等多种[协议](doc/C++_Serving/Inference_Protocols_CN.md);提供C++、Python、Java语言SDK +- 设计并实现基于有向无环图(DAG)的异步流水线高性能推理框架,具有多模型组合、异步调度、并发推理、动态批量、多卡多流推理等特性 +- 适配x86(Intel) CPU、ARM CPU、Nvidia GPU、昆仑XPU等多种硬件;集成Intel MKLDNN、Nvidia TensorRT加速库,以及低精度和量化推理 +- 提供一套模型安全部署解决方案,包括加密模型部署、鉴权校验、HTTPs安全网关,并在实际项目中应用 +- 支持云端部署,提供百度云智能云kubernetes集群部署Paddle Serving案例 +- 提供丰富的经典预模型部署示例,如PaddleOCR、PaddleClas、PaddleDetection、PaddleSeg、PaddleNLP、PaddleRec等套件,共计40+个预训练精品模型,更多模型持续扩展 +- 支持大规模稀疏参数索引模型分布式部署,具有多表、多分片、多副本、本地高频cache等特性、可单机或云端部署 -- 与Paddle训练紧密连接,绝大部分Paddle模型可以 **一键部署**. -- 支持 **工业级的服务能力** 例如模型管理,在线加载,在线A/B测试等. -- 支持客户端和服务端之间 **高并发和高效通信**. -- 支持 **多种编程语言** 开发客户端,例如C++, Python和Java. -*** +

教程

-- 任何经过[PaddlePaddle](https://github.com/paddlepaddle/paddle)训练的模型,都可以经过直接保存或是[模型转换接口](./doc/SAVE_CN.md),用于Paddle Serving在线部署。 -- 支持[多模型串联服务部署](./doc/PIPELINE_SERVING_CN.md), 同时提供Rest接口和RPC接口以满足您的需求,[Pipeline示例](./python/examples/pipeline)。 -- 支持Paddle生态的各大模型库, 例如[PaddleDetection](./python/examples/detection),[PaddleOCR](./python/examples/ocr),[PaddleRec](https://github.com/PaddlePaddle/PaddleRec/tree/master/recserving/movie_recommender)。 -- 提供丰富多彩的前后处理,方便用户在训练、部署等各阶段复用相关代码,弥合AI开发者和应用开发者之间的鸿沟,详情参考[模型示例](./python/examples/)。 +- AIStudio教程-[Paddle Serving服务化部署框架](https://www.paddlepaddle.org.cn/tutorials/projectdetail/1555945) +- 视频教程-[深度学习服务化部署-以互联网应用为例](https://aistudio.baidu.com/aistudio/course/introduce/19084)

-

教程

- -Paddle Serving开发者为您提供了简单易用的[AIStudio教程-Paddle Serving服务化部署框架](https://www.paddlepaddle.org.cn/tutorials/projectdetail/1555945) - -教程提供了如下内容 - - - - -

安装

- -**强烈建议**您在**Docker内构建**Paddle Serving,请查看[如何在Docker中运行PaddleServing](doc/RUN_IN_DOCKER_CN.md)。更多镜像请查看[Docker镜像列表](doc/DOCKER_IMAGES_CN.md)。 +

文档

-**提示**:目前paddlepaddle 2.1版本的默认GPU环境是Cuda 10.2,因此GPU Docker的示例代码以Cuda 10.2为准。镜像和pip安装包也提供了其余GPU环境,用户如果使用其他环境,需要仔细甄别并选择合适的版本。 +*** -**提示**:本项目仅支持Python3.6/3.7/3.8,接下来所有的与Python/Pip相关的操作都需要选择正确的Python版本。 +> 部署 + +此章节引导您完成安装和部署步骤,强烈推荐使用Docker部署Paddle Serving,如您不使用docker,省略docker相关步骤。在云服务器上可以使用Kubernetes部署Paddle Serving。在异构硬件如ARM CPU、昆仑XPU上编译或使用Paddle Serving可以下面的文档。每天编译生成develop分支的最新开发包供开发者使用。 +- [使用docker安装Paddle Serving](doc/Install_CN.md) +- [源码编译安装Paddle Serving](doc/Compile_CN.md) +- [在Kuberntes集群上部署Paddle Serving](doc/Run_On_Kubernetes.md) +- [部署Paddle Serving安全网关](doc/Serving_Auth_Docker.md) +- [在异构硬件部署Paddle Serving](doc/Run_On_XPU_CN.md) +- [最新Wheel开发包](doc/Latest_Packages_CN.md)(develop分支每日更新) + +> 使用 + +安装Paddle Serving后,使用快速开始将引导您运行Serving。第一步,调用模型保存接口,生成模型参数配置文件(.prototxt)用以在客户端和服务端使用;第二步,阅读配置和启动参数并启动服务;第三步,根据API和您的使用场景,基于SDK编写客户端请求,并测试推理服务。您想了解跟多特性的使用场景和方法,请详细阅读以下文档。 +- [快速开始](doc/Quick_Start_CN.md) +- [保存用于Paddle Serving的模型和配置](doc/SAVE_CN.md) +- [配置和启动参数的说明](doc/Serving_Configure_CN.md) +- [RESTful/gRPC/bRPC API指南](doc/C++_Serving/Http_Service_CN.md) +- [低精度推理](doc/Low_Precision_CN.md) +- [常见模型数据处理](doc/Process_data_CN.md) +- [C++ Serving简介](doc/C++_Serving/Introduction_CN.md) + - [协议](doc/C++_Serving/Inference_Protocols_CN.md) + - [模型热加载](doc/C++_Serving/Hot_Loading_CN.md) + - [A/B Test](doc/C++_Serving/ABTest_CN.md) + - [加密模型推理服务](doc/C++_Serving/Encryption_CN.md) + - [性能优化指南](doc/C++_Serving/Performance_Tuning_CN.md) + - [性能指标](doc/C++_Serving/Benchmark_CN.md) +- [Python Pipeline简介](doc/Python_Pipeline/Pipeline_Design_CN.md) + - [性能优化指南](doc/Python_Pipeline/Pipeline_Design_CN.md) + - [性能指标](doc/Python_Pipeline/Benchmark_CN.md) +- 客户端SDK + - [Python SDK](doc/C++_Serving/Http_Service_CN.md) + - [JAVA SDK](doc/Java_SDK_CN.md) + - [C++ SDK](doc/C++_Serving/Creat_C++Serving_CN.md) +- [大规模稀疏参数索引服务](doc/Cube_Local_CN.md) + +> 开发者 + +为Paddle Serving开发者,提供自定义OP,变长数据处理。 +- [自定义OP](doc/C++_Serving/OP_CN.md) +- [变长数据(LOD)处理](doc/LOD_CN.md) +- [常见问答](doc/FAQ_CN.md) + +

模型库

+ +Paddle Serving与Paddle模型套件紧密配合,实现大量服务化部署,包括图像分类、物体检测、语言文本识别、中文词性、情感分析、内容推荐等多种类型示例,以及Paddle全链条项目,共计42个模型。 +
+ +| PaddleOCR | PaddleDetection | PaddleClas | PaddleSeg | PaddleRec | Paddle NLP | +| :----: | :----: | :----: | :----: | :----: | :----: | +| 8 | 12 | 13 | 2 | 3 | 4 | -``` -# 启动 CPU Docker -docker pull registry.baidubce.com/paddlepaddle/serving:0.6.2-devel -docker run -p 9292:9292 --name test -dit registry.baidubce.com/paddlepaddle/serving:0.6.2-devel bash -docker exec -it test bash -git clone https://github.com/PaddlePaddle/Serving -``` -``` -# 启动 GPU Docker -nvidia-docker pull registry.baidubce.com/paddlepaddle/serving:0.6.2-cuda10.2-cudnn8-devel -nvidia-docker run -p 9292:9292 --name test -dit registry.baidubce.com/paddlepaddle/serving:0.6.2-cuda10.2-cudnn8-devel bash -nvidia-docker exec -it test bash -git clone https://github.com/PaddlePaddle/Serving -``` - -安装所需的pip依赖 -``` -cd Serving -pip3 install -r python/requirements.txt -``` - -```shell -pip3 install paddle-serving-client==0.6.2 -pip3 install paddle-serving-server==0.6.2 # CPU -pip3 install paddle-serving-app==0.6.2 -pip3 install paddle-serving-server-gpu==0.6.2.post102 #GPU with CUDA10.2 + TensorRT7 -# 其他GPU环境需要确认环境再选择执行哪一条 -pip3 install paddle-serving-server-gpu==0.6.2.post101 # GPU with CUDA10.1 + TensorRT6 -pip3 install paddle-serving-server-gpu==0.6.2.post11 # GPU with CUDA10.1 + TensorRT7 -``` - -您可能需要使用国内镜像源(例如清华源, 在pip命令中添加`-i https://pypi.tuna.tsinghua.edu.cn/simple`)来加速下载。 - -如果需要使用develop分支编译的安装包,请从[最新安装包列表](./doc/LATEST_PACKAGES.md)中获取下载地址进行下载,使用`pip install`命令进行安装。如果您想自行编译,请参照[Paddle Serving编译文档](./doc/COMPILE_CN.md)。 - -paddle-serving-server和paddle-serving-server-gpu安装包支持Centos 6/7, Ubuntu 16/18和Windows 10。 - -paddle-serving-client和paddle-serving-app安装包支持Linux和Windows,其中paddle-serving-client仅支持python3.6/3.7/3.8。 - -**最新的0.6.2的版本,已经不支持Cuda 9.0和Cuda 10.0,Python已不支持2.7和3.5。** - -推荐安装2.1.0及以上版本的paddle - -``` -# CPU环境请执行 -pip3 install paddlepaddle==2.1.0 - -# GPU Cuda10.2环境请执行 -pip3 install paddlepaddle-gpu==2.1.0 -``` - -**注意**: 如果您的Cuda版本不是10.2,请勿直接执行上述命令,需要参考[Paddle官方文档-多版本whl包列表](https://www.paddlepaddle.org.cn/documentation/docs/zh/install/Tables.html#whl-release) - -选择相应的GPU环境的url链接并进行安装,例如Cuda 10.1的Python3.6用户,请选择表格当中的`cp36-cp36m`和`cuda10.1-cudnn7-mkl-gcc8.2-avx-trt6.0.1.5`对应的url,复制下来并执行 -``` -pip3 install https://paddle-wheel.bj.bcebos.com/with-trt/2.1.0-gpu-cuda10.1-cudnn7-mkl-gcc8.2/paddlepaddle_gpu-2.1.0.post101-cp36-cp36m-linux_x86_64.whl -``` -由于默认的`paddlepaddle-gpu==2.1.0`是Cuda 10.2,并没有联编TensorRT,因此如果需要和在`paddlepaddle-gpu`上使用TensorRT,需要在上述多版本whl包列表当中,找到`cuda10.2-cudnn8.0-trt7.1.3`,下载对应的Python版本。更多信息请参考[如何使用TensorRT?](doc/TENSOR_RT_CN.md)。 - -如果是其他环境和Python版本,请在表格中找到对应的链接并用pip安装。 - -对于**Windows 10 用户**,请参考文档[Windows平台使用Paddle Serving指导](./doc/WINDOWS_TUTORIAL_CN.md)。 - - -

快速开始示例

- -这个快速开始示例主要是为了给那些已经有一个要部署的模型的用户准备的,而且我们也提供了一个可以用来部署的模型。如果您想知道如何从离线训练到在线服务走完全流程,请参考前文的AiStudio教程。 - -

波士顿房价预测

- -进入到Serving的git目录下,进入到`fit_a_line`例子 -``` shell -cd Serving/python/examples/fit_a_line -sh get_data.sh -``` - -Paddle Serving 为用户提供了基于 HTTP 和 RPC 的服务 - -

RPC服务

- -用户还可以使用`paddle_serving_server.serve`启动RPC服务。 尽管用户需要基于Paddle Serving的python客户端API进行一些开发,但是RPC服务通常比HTTP服务更快。需要指出的是这里我们没有指定`--name`。 - -``` shell -python3 -m paddle_serving_server.serve --model uci_housing_model --thread 10 --port 9292 -``` -
- -| Argument | Type | Default | Description | -| ---------------------------------------------- | ---- | ------- | ----------------------------------------------------- | -| `thread` | int | `2` | Number of brpc service thread | -| `runtime_thread_num` | int[]| `0` | Thread Number for each model in asynchronous mode | -| `batch_infer_size` | int[]| `32` | Batch Number for each model in asynchronous mode | -| `gpu_ids` | str[]| `"-1"` | Gpu card id for each model | -| `port` | int | `9292` | Exposed port of current service to users | -| `model` | str[]| `""` | Path of paddle model directory to be served | -| `mem_optim_off` | - | - | Disable memory / graphic memory optimization | -| `ir_optim` | bool | False | Enable analysis and optimization of calculation graph | -| `use_mkl` (Only for cpu version) | - | - | Run inference with MKL | -| `use_trt` (Only for trt version) | - | - | Run inference with TensorRT | -| `use_lite` (Only for Intel x86 CPU or ARM CPU) | - | - | Run PaddleLite inference | -| `use_xpu` | - | - | Run PaddleLite inference with Baidu Kunlun XPU | -| `precision` | str | FP32 | Precision Mode, support FP32, FP16, INT8 | -| `use_calib` | bool | False | Use TRT int8 calibration | -| `gpu_multi_stream` | bool | False | EnableGpuMultiStream to get larger QPS | - -#### 异步模型的说明 - 异步模式适用于1、请求数量非常大的情况,2、多模型串联,想要分别指定每个模型的并发数的情况。 - 异步模式有助于提高Service服务的吞吐(QPS),但对于单次请求而言,时延会有少量增加。 - 异步模式中,每个模型会启动您指定个数的N个线程,每个线程中包含一个模型实例,换句话说每个模型相当于包含N个线程的线程池,从线程池的任务队列中取任务来执行。 - 异步模式中,各个RPC Server的线程只负责将Request请求放入模型线程池的任务队列中,等任务被执行完毕后,再从任务队列中取出已完成的任务。 - 上表中通过 --thread 10 指定的是RPC Server的线程数量,默认值为2,--runtime_thread_num 指定的是各个模型的线程池中线程数N,默认值为0,表示不使用异步模式。 - --batch_infer_size 指定的各个模型的batch数量,默认值为32,该参数只有当--runtime_thread_num不为0时才生效。 - -#### 当您的某个模型想使用多张GPU卡部署时. -python3 -m paddle_serving_server.serve --model uci_housing_model --thread 10 --port 9292 --gpu_ids 0,1,2 -#### 当您的一个服务包含两个模型部署时. -python3 -m paddle_serving_server.serve --model uci_housing_model_1 uci_housing_model_2 --thread 10 --port 9292 -#### 当您的一个服务包含两个模型,且每个模型都需要指定多张GPU卡部署时. -python3 -m paddle_serving_server.serve --model uci_housing_model_1 uci_housing_model_2 --thread 10 --port 9292 --gpu_ids 0,1 1,2 -#### 当您的一个服务包含两个模型,且每个模型都需要指定多张GPU卡,且需要异步模式每个模型指定不同的并发数时. -python3 -m paddle_serving_server.serve --model uci_housing_model_1 uci_housing_model_2 --thread 10 --port 9292 --gpu_ids 0,1 1,2 --runtime_thread_num 4 8 +
+更多模型示例参考Repo,可进入[模型库](doc/Model_Zoo_CN.md) +
+ + + +
-``` python -# A user can visit rpc service through paddle_serving_client API -from paddle_serving_client import Client - -client = Client() -client.load_client_config("uci_housing_client/serving_client_conf.prototxt") -client.connect(["127.0.0.1:9292"]) -data = [0.0137, -0.1136, 0.2553, -0.0692, 0.0582, -0.0727, - -0.1583, -0.0584, 0.6283, 0.4919, 0.1856, 0.0795, -0.0332] -fetch_map = client.predict(feed={"x": np.array(data).reshape(1,13,1)}, fetch=["price"]) -print(fetch_map) - -``` -在这里,`client.predict`函数具有两个参数。 `feed`是带有模型输入变量别名和值的`python dict`。 `fetch`被要从服务器返回的预测变量赋值。 在该示例中,在训练过程中保存可服务模型时,被赋值的tensor名为`"x"`和`"price"`。 - - -

HTTP服务

- -用户也可以将数据格式处理逻辑放在服务器端进行,这样就可以直接用curl去访问服务,参考如下案例,在目录`python/examples/fit_a_line`. - -``` -python3 -m paddle_serving_server.serve --model uci_housing_model --thread 10 --port 9292 --name uci -``` -客户端输入 -``` -curl -H "Content-Type:application/json" -X POST -d '{"feed":[{"x": [0.0137, -0.1136, 0.2553, -0.0692, 0.0582, -0.0727, -0.1583, -0.0584, 0.6283, 0.4919, 0.1856, 0.0795, -0.0332]}], "fetch":["price"]}' http://127.0.0.1:9292/uci/prediction -``` -返回结果 -``` -{"result":{"price":[[18.901151657104492]]}} -``` - -

Pipeline服务

- -Paddle Serving提供业界领先的多模型串联服务,强力支持各大公司实际运行的业务场景,参考 [OCR文字识别案例](python/examples/pipeline/ocr),在目录`python/examples/pipeline/ocr` - -我们先获取两个模型 -``` -python3 -m paddle_serving_app.package --get_model ocr_rec -tar -xzvf ocr_rec.tar.gz -python3 -m paddle_serving_app.package --get_model ocr_det -tar -xzvf ocr_det.tar.gz -``` -然后启动服务端程序,将两个串联的模型作为一个整体的服务。 -``` -python3 web_service.py -``` -最终使用http的方式请求 -``` -python3 pipeline_http_client.py -``` -也支持rpc的方式 -``` -python3 pipeline_rpc_client.py -``` -输出 -``` -{'err_no': 0, 'err_msg': '', 'key': ['res'], 'value': ["['土地整治与土壤修复研究中心', '华南农业大学1素图']"]} -``` - -

关闭Serving/Pipeline服务

- -**方式一** :Ctrl+C关停服务 - -**方式二** :在启动Serving/Pipeline服务路径或者环境变量SERVING_HOME路径下(该路径下存在文件ProcessInfo.json) - -``` -python3 -m paddle_serving_server.serve stop -``` +

社区

-

文档

-### 新手教程 -- [怎样保存用于Paddle Serving的模型?](doc/SAVE_CN.md) -- [十分钟构建Bert-As-Service](doc/BERT_10_MINS_CN.md) -- [Paddle Serving示例合辑](python/examples) -- [如何在Paddle Serving处理常见数据类型](doc/PROCESS_DATA.md) -- [如何在Serving上处理level of details(LOD)?](doc/LOD_CN.md) - -### 开发者教程 -- [如何开发一个新的Web Service?](doc/NEW_WEB_SERVICE_CN.md) -- [如何编译PaddleServing?](doc/COMPILE_CN.md) -- [如何开发Pipeline?](doc/PIPELINE_SERVING_CN.md) -- [如何在K8S集群上部署Paddle Serving?](doc/PADDLE_SERVING_ON_KUBERNETES.md) -- [如何在Paddle Serving上部署安全网关?](doc/SERVING_AUTH_DOCKER.md) -- [如何开发Pipeline?](doc/PIPELINE_SERVING_CN.md) -- [如何使用uWSGI部署Web Service](doc/UWSGI_DEPLOY_CN.md) -- [如何实现模型文件热加载](doc/HOT_LOADING_IN_SERVING_CN.md) -- [如何使用TensorRT?](doc/TENSOR_RT_CN.md) - -### 关于Paddle Serving性能 -- [如何测试Paddle Serving性能?](python/examples/util/) -- [如何优化性能?](doc/PERFORMANCE_OPTIM_CN.md) -- [在一张GPU上启动多个预测服务](doc/MULTI_SERVICE_ON_ONE_GPU_CN.md) -- [GPU版Benchmarks](doc/BENCHMARKING_GPU.md) - -### 设计文档 -- [Paddle Serving设计文档](doc/DESIGN_DOC_CN.md) - -### FAQ -- [常见问答](doc/FAQ.md) +您想要同开发者和其他用户沟通吗?欢迎加入我们,通过如下方式加入社群 -

社区

+### 微信 +- 微信用户请扫码 + +### QQ +- 飞桨推理部署交流群(群号:696965088) ### Slack +- [Slack channel](https://paddleserving.slack.com/archives/CUBPKHKMJ) -想要同开发者和其他用户沟通吗?欢迎加入我们的 [Slack channel](https://paddleserving.slack.com/archives/CUBPKHKMJ) -### 贡献代码 +> 贡献代码 -如果您想为Paddle Serving贡献代码,请参考 [Contribution Guidelines](doc/CONTRIBUTE.md) +如果您想为Paddle Serving贡献代码,请参考 [Contribution Guidelines](doc/Contribute.md) - 特别感谢 [@BeyondYourself](https://github.com/BeyondYourself) 提供grpc教程,更新FAQ教程,整理文件目录。 - 特别感谢 [@mcl-stone](https://github.com/mcl-stone) 提供faster rcnn benchmark脚本 - 特别感谢 [@cg82616424](https://github.com/cg82616424) 提供unet benchmark脚本和修改部分注释错误 - 特别感谢 [@cuicheng01](https://github.com/cuicheng01) 提供PaddleClas的11个模型 -### 反馈 +> 反馈 如有任何反馈或是bug,请在 [GitHub Issue](https://github.com/PaddlePaddle/Serving/issues)提交 -### License +> License [Apache 2.0 License](https://github.com/PaddlePaddle/Serving/blob/develop/LICENSE) diff --git a/doc/C++Serving/ABTest_CN.md b/doc/C++_Serving/ABTest_CN.md similarity index 100% rename from doc/C++Serving/ABTest_CN.md rename to doc/C++_Serving/ABTest_CN.md diff --git a/doc/C++Serving/ABTest_EN.md b/doc/C++_Serving/ABTest_EN.md similarity index 100% rename from doc/C++Serving/ABTest_EN.md rename to doc/C++_Serving/ABTest_EN.md diff --git a/doc/C++Serving/Benchmark_CN.md b/doc/C++_Serving/Benchmark_CN.md similarity index 100% rename from doc/C++Serving/Benchmark_CN.md rename to doc/C++_Serving/Benchmark_CN.md diff --git a/doc/C++Serving/Client_Configure_CN.md b/doc/C++_Serving/Client_Configure_CN.md similarity index 100% rename from doc/C++Serving/Client_Configure_CN.md rename to doc/C++_Serving/Client_Configure_CN.md diff --git a/doc/C++Serving/Creat_C++Serving_CN.md b/doc/C++_Serving/Creat_C++Serving_CN.md similarity index 99% rename from doc/C++Serving/Creat_C++Serving_CN.md rename to doc/C++_Serving/Creat_C++Serving_CN.md index 933385921fe6433d0a488a91659b405ae97fadc8..075c38d310ee264dbd2d3b08655cc86c53a9ac28 100755 --- a/doc/C++Serving/Creat_C++Serving_CN.md +++ b/doc/C++_Serving/Creat_C++Serving_CN.md @@ -75,7 +75,7 @@ service ImageClassifyService { #### 2.2.2 示例配置 -关于Serving端的配置的详细信息,可以参考[Serving端配置](../SERVING_CONFIGURE_CN.md) +关于Serving端的配置的详细信息,可以参考[Serving端配置](../Serving_Configure_CN.md) 以下配置文件将ReaderOP, ClassifyOP和WriteJsonOP串联成一个workflow (关于OP/workflow等概念,可参考[OP介绍](OP_CN.md)和[DAG介绍](DAG_CN.md)) diff --git a/doc/C++Serving/DAG_CN.md b/doc/C++_Serving/DAG_CN.md similarity index 100% rename from doc/C++Serving/DAG_CN.md rename to doc/C++_Serving/DAG_CN.md diff --git a/doc/C++Serving/DAG_EN.md b/doc/C++_Serving/DAG_EN.md similarity index 100% rename from doc/C++Serving/DAG_EN.md rename to doc/C++_Serving/DAG_EN.md diff --git a/doc/C++Serving/Encryption_CN.md b/doc/C++_Serving/Encryption_CN.md similarity index 100% rename from doc/C++Serving/Encryption_CN.md rename to doc/C++_Serving/Encryption_CN.md diff --git a/doc/C++Serving/Encryption_EN.md b/doc/C++_Serving/Encryption_EN.md similarity index 100% rename from doc/C++Serving/Encryption_EN.md rename to doc/C++_Serving/Encryption_EN.md diff --git a/doc/C++Serving/Frame_Performance_CN.md b/doc/C++_Serving/Frame_Performance_CN.md similarity index 100% rename from doc/C++Serving/Frame_Performance_CN.md rename to doc/C++_Serving/Frame_Performance_CN.md diff --git a/doc/C++Serving/Hot_Loading_CN.md b/doc/C++_Serving/Hot_Loading_CN.md similarity index 100% rename from doc/C++Serving/Hot_Loading_CN.md rename to doc/C++_Serving/Hot_Loading_CN.md diff --git a/doc/C++Serving/Hot_Loading_EN.md b/doc/C++_Serving/Hot_Loading_EN.md similarity index 100% rename from doc/C++Serving/Hot_Loading_EN.md rename to doc/C++_Serving/Hot_Loading_EN.md diff --git a/doc/C++Serving/Http_Service_CN.md b/doc/C++_Serving/Http_Service_CN.md similarity index 100% rename from doc/C++Serving/Http_Service_CN.md rename to doc/C++_Serving/Http_Service_CN.md diff --git a/doc/C++Serving/Introduction_CN.md b/doc/C++_Serving/Introduction_CN.md similarity index 99% rename from doc/C++Serving/Introduction_CN.md rename to doc/C++_Serving/Introduction_CN.md index 35dee0d5d49f4a006fe4c20b8a9c186fc68d49ad..a8ddb8d398ecefa5d981fe4338212fe0eacac688 100755 --- a/doc/C++Serving/Introduction_CN.md +++ b/doc/C++_Serving/Introduction_CN.md @@ -32,7 +32,7 @@ Server端的核心是一个由项目代码编译产生的名称为serving的二 为了方便用户快速的启动C++ Serving的Server端,除了用户自行修改配置文件并通过命令行传参运行serving二进制可执行文件以外,我们也提供了另外一种通过python脚本启动的方式。python脚本启动本质上仍是运行serving二进制可执行文件,但python脚本中会自动完成两件事:1、配置文件的生成;2、根据需要配置的参数,生成命令行,通过命令行的方式,传入参数信息并运行serving二进制可执行文件。 -更多详细说明和示例,请参考[C++ Serving 参数配置和启动的详细说明](../SERVING_CONFIGURE_CN.md)。 +更多详细说明和示例,请参考[C++ Serving 参数配置和启动的详细说明](../Serving_Configure_CN.md)。 ### 3.2 同步/异步模式 同步模式比较简单直接,适用于模型预测时间短,单个Request请求的batch已经比较大的情况。 diff --git a/doc/C++Serving/Model_Ensemble_CN.md b/doc/C++_Serving/Model_Ensemble_CN.md similarity index 100% rename from doc/C++Serving/Model_Ensemble_CN.md rename to doc/C++_Serving/Model_Ensemble_CN.md diff --git a/doc/C++Serving/Model_Ensemble_EN.md b/doc/C++_Serving/Model_Ensemble_EN.md similarity index 100% rename from doc/C++Serving/Model_Ensemble_EN.md rename to doc/C++_Serving/Model_Ensemble_EN.md diff --git a/doc/C++Serving/OP_CN.md b/doc/C++_Serving/OP_CN.md similarity index 100% rename from doc/C++Serving/OP_CN.md rename to doc/C++_Serving/OP_CN.md diff --git a/doc/C++Serving/OP_EN.md b/doc/C++_Serving/OP_EN.md similarity index 100% rename from doc/C++Serving/OP_EN.md rename to doc/C++_Serving/OP_EN.md diff --git a/doc/C++Serving/Performance_Tuning_CN.md b/doc/C++_Serving/Performance_Tuning_CN.md similarity index 100% rename from doc/C++Serving/Performance_Tuning_CN.md rename to doc/C++_Serving/Performance_Tuning_CN.md diff --git a/doc/COMPILE_CN.md b/doc/Compile_CN.md similarity index 99% rename from doc/COMPILE_CN.md rename to doc/Compile_CN.md index 89178cee78746013915fc416b212b5a49f6762c2..fca4627cc40f08227ce2628841dc6da9b3eddebd 100644 --- a/doc/COMPILE_CN.md +++ b/doc/Compile_CN.md @@ -1,6 +1,6 @@ # 如何编译PaddleServing -(简体中文|[English](./COMPILE.md)) +(简体中文|[English](./Compile_EN.md)) ## 编译环境设置 diff --git a/doc/COMPILE.md b/doc/Compile_EN.md similarity index 98% rename from doc/COMPILE.md rename to doc/Compile_EN.md index ea7f53b2ed1704777b611a58c3a8d971d48eb312..e88887e84514d0b650c087e816b00c351c9cc93a 100644 --- a/doc/COMPILE.md +++ b/doc/Compile_EN.md @@ -1,6 +1,6 @@ # How to compile PaddleServing -([简体中文](./COMPILE_CN.md)|English) +([简体中文](./Compile_CN.md)|English) ## Compilation environment requirements @@ -23,7 +23,7 @@ | libSM | 1.2.2 | | libXrender | 0.9.10 | -It is recommended to use Docker for compilation. We have prepared the Paddle Serving compilation environment for you, see [this document](DOCKER_IMAGES.md). +It is recommended to use Docker for compilation. We have prepared the Paddle Serving compilation environment for you, see [this document](Docker_Images_EN.md). ## Get Code @@ -159,8 +159,7 @@ cmake -DPYTHON_INCLUDE_DIR=$PYTHON_INCLUDE_DIR/ \ -DSERVER=ON .. make -j10 ``` - -**Note:** After the compilation is successful, you need to set the `SERVING_BIN` path, see the following [Notes](https://github.com/PaddlePaddle/Serving/blob/develop/doc/COMPILE.md#Notes). +**Note:** After the compilation is successful, you need to set the `SERVING_BIN` path, see the following [Notes](Compile_EN.md#Notes). ## Compile Client diff --git a/doc/CONTRIBUTE.md b/doc/Contribute_EN.md similarity index 98% rename from doc/CONTRIBUTE.md rename to doc/Contribute_EN.md index 2bb909b43ad0f865d2b2fa25d371ee04ce354ff8..6a8e5841d7c28cbdcc37a21b74a362caf152b3ab 100644 --- a/doc/CONTRIBUTE.md +++ b/doc/Contribute_EN.md @@ -68,7 +68,7 @@ Paddle Serving uses this [Git branching model](http://nvie.com/posts/a-successfu 1. Build and test - Users can build Paddle Serving natively on Linux, see the [BUILD steps](https://github.com/PaddlePaddle/Serving/blob/develop/doc/COMPILE.md). + Users can build Paddle Serving natively on Linux, see the [BUILD steps](Compile_EN.md). 1. Keep pulling diff --git a/doc/CUBE_LOCAL_CN.md b/doc/Cube_Local_CN.md similarity index 96% rename from doc/CUBE_LOCAL_CN.md rename to doc/Cube_Local_CN.md index ea4c00c9f7e593dd40465c2ceff3bebd2ec299fb..136641e9e8b2f55352023d5ccf3775797283296d 100644 --- a/doc/CUBE_LOCAL_CN.md +++ b/doc/Cube_Local_CN.md @@ -1,14 +1,14 @@ # 稀疏参数索引服务Cube单机版使用指南 -(简体中文|[English](./CUBE_LOCAL.md)) +(简体中文|[English](./Cube_Local_EN.md)) ## 引言 在python/examples下有两个关于CTR的示例,他们分别是criteo_ctr, criteo_ctr_with_cube。前者是在训练时保存整个模型,包括稀疏参数。后者是将稀疏参数裁剪出来,保存成两个部分,一个是稀疏参数,另一个是稠密参数。由于在工业级的场景中,稀疏参数的规模非常大,达到10^9数量级。因此在一台机器上启动大规模稀疏参数预测是不实际的,因此我们引入百度多年来在稀疏参数索引领域的工业级产品Cube,提供分布式的稀疏参数服务。 - + -本文档使用的都是未经过任何压缩算法处理的原始模型,如果有量化模型上线需求,请阅读[Cube稀疏参数索引量化存储使用指南](./CUBE_QUANT_CN.md) +本文档使用的都是未经过任何压缩算法处理的原始模型,如果有量化模型上线需求,请阅读[Cube稀疏参数索引量化存储使用指南](./Cube_Quant_CN.md) ## 示例 diff --git a/doc/CUBE_LOCAL.md b/doc/Cube_Local_EN.md similarity index 97% rename from doc/CUBE_LOCAL.md rename to doc/Cube_Local_EN.md index def736230173018ce151dd0f85f1e8b4e15099fe..030a17eb85747c91dfdedce00ec3040fc8f2e02d 100644 --- a/doc/CUBE_LOCAL.md +++ b/doc/Cube_Local_EN.md @@ -1,15 +1,15 @@ # Cube: Sparse Parameter Indexing Service (Local Mode) -([简体中文](./CUBE_LOCAL_CN.md)|English) +([简体中文](./Cube_Local_CN.md)|English) ## Overview There are two examples on CTR under python / examples, they are criteo_ctr, criteo_ctr_with_cube. The former is to save the entire model during training, including sparse parameters. The latter is to cut out the sparse parameters and save them into two parts, one is the sparse parameter and the other is the dense parameter. Because the scale of sparse parameters is very large in industrial cases, reaching the order of 10 ^ 9. Therefore, it is not practical to start large-scale sparse parameter prediction on one machine. Therefore, we introduced Baidu's industrial-grade product Cube to provide the sparse parameter service for many years to provide distributed sparse parameter services. The local mode of Cube is different from distributed Cube, which is designed to be convenient for developers to use in experiments and demos. - + -This document uses the original model without any compression algorithm. If there is a need for a quantitative model to go online, please read the [Quantization Storage on Cube Sparse Parameter Indexing](./CUBE_QUANT.md) +This document uses the original model without any compression algorithm. If there is a need for a quantitative model to go online, please read the [Quantization Storage on Cube Sparse Parameter Indexing](./Cube_Quant_EN.md) ## Example in directory python/example/criteo_ctr_with_cube, run diff --git a/doc/CUBE_QUANT_CN.md b/doc/Cube_Quant_CN.md similarity index 97% rename from doc/CUBE_QUANT_CN.md rename to doc/Cube_Quant_CN.md index d8c66968c633708742c636a020ceec905588d20b..d0d1b75802838a3e774f01e43eff7374e9b58595 100644 --- a/doc/CUBE_QUANT_CN.md +++ b/doc/Cube_Quant_CN.md @@ -1,6 +1,6 @@ # Cube稀疏参数索引量化存储使用指南 -(简体中文|[English](./CUBE_QUANT.md)) +(简体中文|[English](./Cube_Quant_EN.md)) ## 总体概览 @@ -9,7 +9,7 @@ ## 前序要求 -请先读取 [稀疏参数索引服务Cube单机版使用指南](./CUBE_LOCAL_CN.md) +请先读取 [稀疏参数索引服务Cube单机版使用指南](./Cube_Local_CN.md) ## 组件介绍 diff --git a/doc/CUBE_QUANT.md b/doc/Cube_Quant_EN.md similarity index 97% rename from doc/CUBE_QUANT.md rename to doc/Cube_Quant_EN.md index 870b49fcf0e72b9aba0729fdf762b67e2a7004e1..a2ddea099e9731eb7a03188512687a833c7adfe3 100644 --- a/doc/CUBE_QUANT.md +++ b/doc/Cube_Quant_EN.md @@ -1,6 +1,6 @@ # Quantization Storage on Cube Sparse Parameter Indexing -([简体中文](./CUBE_QUANT_CN.md)|English) +([简体中文](./Cube_Quant_CN.md)|English) ## Overview @@ -8,7 +8,7 @@ In our previous article, we know that the sparse parameter is a series of floati ## Precondition -Please Read [Cube: Sparse Parameter Indexing Service (Local Mode)](./CUBE_LOCAL_CN.md) +Please Read [Cube: Sparse Parameter Indexing Service (Local Mode)](./Cube_Local_EN.md) ## Components diff --git a/doc/CUBE_TEST_CN.md b/doc/Cube_Test_CN.md similarity index 96% rename from doc/CUBE_TEST_CN.md rename to doc/Cube_Test_CN.md index c9e8c23ca3be43390ffd959d83c456cf47722056..ae2b68b0e30fcba3c595f5ed5f5b439f5297afa5 100644 --- a/doc/CUBE_TEST_CN.md +++ b/doc/Cube_Test_CN.md @@ -13,7 +13,7 @@ ### 预备知识 -- 需要会编译Paddle Serving,参见[编译文档](./COMPILE.md) +- 需要会编译Paddle Serving,参见[编译文档](./Compile_EN.md) ### 用法 diff --git a/doc/DOCKER_IMAGES_CN.md b/doc/Docker_Images_CN.md similarity index 99% rename from doc/DOCKER_IMAGES_CN.md rename to doc/Docker_Images_CN.md index 9446bbf5272679c00d05b102d1927e1030321b9c..092d9fb4a6c420e394841795247767b9924849fb 100644 --- a/doc/DOCKER_IMAGES_CN.md +++ b/doc/Docker_Images_CN.md @@ -1,6 +1,6 @@ # Docker 镜像 -(简体中文|[English](DOCKER_IMAGES.md)) +(简体中文|[English](Docker_Images_EN.md)) 该文档维护了 Paddle Serving 提供的镜像列表。 diff --git a/doc/DOCKER_IMAGES.md b/doc/Docker_Images_EN.md similarity index 99% rename from doc/DOCKER_IMAGES.md rename to doc/Docker_Images_EN.md index 0e38e668752a06a768cd5abdec8ec7c7aa142960..ad64570830b854bb4affd192dbebb9303ce9a5e7 100644 --- a/doc/DOCKER_IMAGES.md +++ b/doc/Docker_Images_EN.md @@ -1,6 +1,6 @@ # Docker Images -([简体中文](DOCKER_IMAGES_CN.md)|English) +([简体中文](Docker_Images_CN.md)|English) This document maintains a list of docker images provided by Paddle Serving. diff --git a/doc/FAQ.md b/doc/FAQ_CN.md similarity index 96% rename from doc/FAQ.md rename to doc/FAQ_CN.md index cdad1a3dda5339aa1fac55a223a5e3a38f33d031..18712c1cde16ffdef31c0598ff5a53b2a339e027 100644 --- a/doc/FAQ.md +++ b/doc/FAQ_CN.md @@ -142,7 +142,7 @@ make: *** [all] Error 2 #### Q:使用过程中出现CXXABI错误。 -这个问题出现的原因是Python使用的gcc版本和Serving所需的gcc版本对不上。对于Docker用户,推荐使用[Docker容器](./RUN_IN_DOCKER_CN.md),由于Docker容器内的Python版本与Serving在发布前都做过适配,这样就不会出现类似的错误。如果是其他开发环境,首先需要确保开发环境中具备GCC 8.2,如果没有gcc 8.2,参考安装方式 +这个问题出现的原因是Python使用的gcc版本和Serving所需的gcc版本对不上。对于Docker用户,推荐使用[Docker容器](./Run_In_Docker_CN.md),由于Docker容器内的Python版本与Serving在发布前都做过适配,这样就不会出现类似的错误。如果是其他开发环境,首先需要确保开发环境中具备GCC 8.2,如果没有gcc 8.2,参考安装方式 ```bash wget -q https://paddle-ci.gz.bcebos.com/gcc-8.2.0.tar.xz @@ -198,7 +198,7 @@ wget https://paddle-serving.bj.bcebos.com/others/centos_ssl.tar && \ (1)Cuda显卡驱动:文件名通常为 `libcuda.so.$DRIVER_VERSION` 例如驱动版本为440.10.15,文件名就是`libcuda.so.440.10.15`。 -(2)Cuda和Cudnn动态库:文件名通常为 `libcudart.so.$CUDA_VERSION`,和 `libcudnn.so.$CUDNN_VERSION`。例如Cuda9就是 `libcudart.so.9.0`,Cudnn7就是 `libcudnn.so.7`。Cuda和Cudnn与Serving的版本匹配参见[Serving所有镜像列表](DOCKER_IMAGES_CN.md#%E9%99%84%E5%BD%95%E6%89%80%E6%9C%89%E9%95%9C%E5%83%8F%E5%88%97%E8%A1%A8). +(2)Cuda和Cudnn动态库:文件名通常为 `libcudart.so.$CUDA_VERSION`,和 `libcudnn.so.$CUDNN_VERSION`。例如Cuda9就是 `libcudart.so.9.0`,Cudnn7就是 `libcudnn.so.7`。Cuda和Cudnn与Serving的版本匹配参见[Serving所有镜像列表](Docker_Images_CN.md#%E9%99%84%E5%BD%95%E6%89%80%E6%9C%89%E9%95%9C%E5%83%8F%E5%88%97%E8%A1%A8). (3) Cuda10.1及更高版本需要TensorRT。安装TensorRT相关文件的脚本参考 [install_trt.sh](../tools/dockerfiles/build_scripts/install_trt.sh). @@ -232,15 +232,15 @@ InvalidArgumentError: Device id must be less than GPU count, but received id is: #### Q: 目前Paddle Serving支持哪些镜像环境? -**A:** 目前(0.4.0)仅支持CentOS,具体列表查阅[这里](https://github.com/PaddlePaddle/Serving/blob/develop/doc/DOCKER_IMAGES.md) +**A:** 目前(0.4.0)仅支持CentOS,具体列表查阅[这里](https://github.com/PaddlePaddle/Serving/blob/develop/doc/Docker_Images_CN.md) #### Q: python编译的GCC版本与serving的版本不匹配 -**A:**:1)使用[GPU docker](https://github.com/PaddlePaddle/Serving/blob/develop/doc/RUN_IN_DOCKER.md#gpunvidia-docker)解决环境问题;2)修改anaconda的虚拟环境下安装的python的gcc版本[改变python的GCC编译环境](https://www.jianshu.com/p/c498b3d86f77) +**A:**:1)使用[GPU docker](https://github.com/PaddlePaddle/Serving/blob/develop/doc/Run_In_Docker_CN.md#gpunvidia-docker)解决环境问题;2)修改anaconda的虚拟环境下安装的python的gcc版本[改变python的GCC编译环境](https://www.jianshu.com/p/c498b3d86f77) #### Q: paddle-serving是否支持本地离线安装 -**A:** 支持离线部署,需要把一些相关的[依赖包](https://github.com/PaddlePaddle/Serving/blob/develop/doc/COMPILE.md)提前准备安装好 +**A:** 支持离线部署,需要把一些相关的[依赖包](https://github.com/PaddlePaddle/Serving/blob/develop/doc/Compile_CN.md)提前准备安装好 #### Q: Docker中启动server IP地址 127.0.0.1 与 0.0.0.0 差异 **A:** 您必须将容器的主进程设置为绑定到特殊的 0.0.0.0 “所有接口”地址,否则它将无法从容器外部访问。在Docker中 127.0.0.1 代表“这个容器”,而不是“这台机器”。如果您从容器建立到 127.0.0.1 的出站连接,它将返回到同一个容器;如果您将服务器绑定到 127.0.0.1,接收不到来自外部的连接。 @@ -280,7 +280,7 @@ client.connect(["127.0.0.1:9393"]) #### Q: 如何使用多语言客户端 -**A:** 多语言客户端要与多语言服务端配套使用。当前版本下(0.4.0),服务端需要将Server改为MultiLangServer(如果是以命令行启动的话只需要添加--use_multilang参数),Python客户端需要将Client改为MultiLangClient,同时去除load_client_config的过程。[Java客户端参考文档](https://github.com/PaddlePaddle/Serving/blob/develop/doc/JAVA_SDK_CN.md) +**A:** 多语言客户端要与多语言服务端配套使用。当前版本下(0.4.0),服务端需要将Server改为MultiLangServer(如果是以命令行启动的话只需要添加--use_multilang参数),Python客户端需要将Client改为MultiLangClient,同时去除load_client_config的过程。[Java客户端参考文档](https://github.com/PaddlePaddle/Serving/blob/develop/doc/Java_SDK_CN.md) #### Q: 如何在Windows下使用Paddle Serving diff --git a/doc/deprecated/IMDB_GO_CLIENT_CN.md b/doc/Imdb_GO_Client_CN.md similarity index 98% rename from doc/deprecated/IMDB_GO_CLIENT_CN.md rename to doc/Imdb_GO_Client_CN.md index 5067d1ef79218d176aee0c0d7d41506a0b6dc428..84a7fcad2c0c3e9a2e1a87f5a3379fbae9dcb531 100644 --- a/doc/deprecated/IMDB_GO_CLIENT_CN.md +++ b/doc/Imdb_GO_Client_CN.md @@ -1,6 +1,6 @@ # 如何在Paddle Serving使用Go Client -(简体中文|[English](./IMDB_GO_CLIENT.md)) +(简体中文|[English](./Imdb_GO_Client_EN.md)) 本文档说明了如何将Go用作客户端语言。对于Paddle Serving中的Go客户端,提供了一个简单的客户端程序包https://github.com/PaddlePaddle/Serving/tree/develop/go/serving_client, 用户可以根据需要引用该程序包。这是一个基于IMDB数据集的情感分析任务的简单示例。 diff --git a/doc/deprecated/IMDB_GO_CLIENT.md b/doc/Imdb_GO_Client_EN.md similarity index 98% rename from doc/deprecated/IMDB_GO_CLIENT.md rename to doc/Imdb_GO_Client_EN.md index a9f610cfb154548ffe6f89820c1f61b114303351..da241936af8c6ea2859d7ea9ae660b9775f43eb9 100644 --- a/doc/deprecated/IMDB_GO_CLIENT.md +++ b/doc/Imdb_GO_Client_EN.md @@ -1,6 +1,6 @@ # How to use Go Client of Paddle Serving -([简体中文](./IMDB_GO_CLIENT_CN.md)|English) +([简体中文](./Imdb_GO_Client_CN.md)|English) This document shows how to use Go as your client language. For Go client in Paddle Serving, a simple client package is provided https://github.com/PaddlePaddle/Serving/tree/develop/go/serving_client, a user can import this package as needed. Here is a simple example of sentiment analysis task based on IMDB dataset. diff --git a/doc/Install_CN.md b/doc/Install_CN.md new file mode 100644 index 0000000000000000000000000000000000000000..9e6fb07d8b73f4e9795a8b2de547c8762c87e8b0 --- /dev/null +++ b/doc/Install_CN.md @@ -0,0 +1,72 @@ +# 使用Docker安装Paddle Serving + +(简体中文|[English](./Install_EN.md)) + +**强烈建议**您在**Docker内构建**Paddle Serving,请查看[如何在Docker中运行PaddleServing](Run_In_Docker_CN.md)。更多镜像请查看[Docker镜像列表](Docker_Images_CN.md)。 + +**提示**:目前paddlepaddle 2.1版本的默认GPU环境是Cuda 10.2,因此GPU Docker的示例代码以Cuda 10.2为准。镜像和pip安装包也提供了其余GPU环境,用户如果使用其他环境,需要仔细甄别并选择合适的版本。 + +**提示**:本项目仅支持Python3.6/3.7/3.8,接下来所有的与Python/Pip相关的操作都需要选择正确的Python版本。 + +``` +# 启动 CPU Docker +docker pull registry.baidubce.com/paddlepaddle/serving:0.6.2-devel +docker run -p 9292:9292 --name test -dit registry.baidubce.com/paddlepaddle/serving:0.6.2-devel bash +docker exec -it test bash +git clone https://github.com/PaddlePaddle/Serving +``` +``` +# 启动 GPU Docker +nvidia-docker pull registry.baidubce.com/paddlepaddle/serving:0.6.2-cuda10.2-cudnn8-devel +nvidia-docker run -p 9292:9292 --name test -dit registry.baidubce.com/paddlepaddle/serving:0.6.2-cuda10.2-cudnn8-devel bash +nvidia-docker exec -it test bash +git clone https://github.com/PaddlePaddle/Serving +``` + +安装所需的pip依赖 +``` +cd Serving +pip3 install -r python/requirements.txt +``` + +```shell +pip3 install paddle-serving-client==0.6.2 +pip3 install paddle-serving-server==0.6.2 # CPU +pip3 install paddle-serving-app==0.6.2 +pip3 install paddle-serving-server-gpu==0.6.2.post102 #GPU with CUDA10.2 + TensorRT7 +# 其他GPU环境需要确认环境再选择执行哪一条 +pip3 install paddle-serving-server-gpu==0.6.2.post101 # GPU with CUDA10.1 + TensorRT6 +pip3 install paddle-serving-server-gpu==0.6.2.post11 # GPU with CUDA10.1 + TensorRT7 +``` + +您可能需要使用国内镜像源(例如清华源, 在pip命令中添加`-i https://pypi.tuna.tsinghua.edu.cn/simple`)来加速下载。 + +如果需要使用develop分支编译的安装包,请从[最新安装包列表](Latest_Packages_CN.md)中获取下载地址进行下载,使用`pip install`命令进行安装。如果您想自行编译,请参照[Paddle Serving编译文档](Compile_CN.md)。 + +paddle-serving-server和paddle-serving-server-gpu安装包支持Centos 6/7, Ubuntu 16/18和Windows 10。 + +paddle-serving-client和paddle-serving-app安装包支持Linux和Windows,其中paddle-serving-client仅支持python3.6/3.7/3.8。 + +**最新的0.6.2的版本,已经不支持Cuda 9.0和Cuda 10.0,Python已不支持2.7和3.5。** + +推荐安装2.1.0及以上版本的paddle + +``` +# CPU环境请执行 +pip3 install paddlepaddle==2.1.0 + +# GPU Cuda10.2环境请执行 +pip3 install paddlepaddle-gpu==2.1.0 +``` + +**注意**: 如果您的Cuda版本不是10.2,请勿直接执行上述命令,需要参考[Paddle官方文档-多版本whl包列表](https://www.paddlepaddle.org.cn/documentation/docs/zh/install/Tables.html#whl-release) + +选择相应的GPU环境的url链接并进行安装,例如Cuda 10.1的Python3.6用户,请选择表格当中的`cp36-cp36m`和`cuda10.1-cudnn7-mkl-gcc8.2-avx-trt6.0.1.5`对应的url,复制下来并执行 +``` +pip3 install https://paddle-wheel.bj.bcebos.com/with-trt/2.1.0-gpu-cuda10.1-cudnn7-mkl-gcc8.2/paddlepaddle_gpu-2.1.0.post101-cp36-cp36m-linux_x86_64.whl +``` +由于默认的`paddlepaddle-gpu==2.1.0`是Cuda 10.2,并没有联编TensorRT,因此如果需要和在`paddlepaddle-gpu`上使用TensorRT,需要在上述多版本whl包列表当中,找到`cuda10.2-cudnn8.0-trt7.1.3`,下载对应的Python版本 + +如果是其他环境和Python版本,请在表格中找到对应的链接并用pip安装。 + +对于**Windows 10 用户**,请参考文档[Windows平台使用Paddle Serving指导](Windows_Tutorial_CN.md)。 diff --git a/doc/Install_EN.md b/doc/Install_EN.md new file mode 100644 index 0000000000000000000000000000000000000000..13a3d39900ab7dce094f5b398a26621187324fcf --- /dev/null +++ b/doc/Install_EN.md @@ -0,0 +1,79 @@ +# Install Paddle Serving with Docker + +([简体中文](Install_CN.md)|English) + +We **highly recommend** you to **run Paddle Serving in Docker**, please visit [Run in Docker](Run_In_Docker_EN.md). See the [document](Docker_Images_EN.md) for more docker images. + +**Attention:**: Currently, the default GPU environment of paddlepaddle 2.1 is Cuda 10.2, so the sample code of GPU Docker is based on Cuda 10.2. We also provides docker images and whl packages for other GPU environments. If users use other environments, they need to carefully check and select the appropriate version. + +**Attention:** the following so-called 'python' or 'pip' stands for one of Python 3.6/3.7/3.8. + +``` +# Run CPU Docker +docker pull registry.baidubce.com/paddlepaddle/serving:0.6.0-devel +docker run -p 9292:9292 --name test -dit registry.baidubce.com/paddlepaddle/serving:0.6.0-devel bash +docker exec -it test bash +git clone https://github.com/PaddlePaddle/Serving +``` +``` +# Run GPU Docker +nvidia-docker pull registry.baidubce.com/paddlepaddle/serving:0.6.0-cuda10.2-cudnn8-devel +nvidia-docker run -p 9292:9292 --name test -dit registry.baidubce.com/paddlepaddle/serving:0.6.0-cuda10.2-cudnn8-devel bash +nvidia-docker exec -it test bash +git clone https://github.com/PaddlePaddle/Serving +``` +install python dependencies +``` +cd Serving +pip install -r python/requirements.txt +``` + +```shell +pip install paddle-serving-client==0.6.0 +pip install paddle-serving-server==0.6.0 # CPU +pip install paddle-serving-app==0.6.0 +pip install paddle-serving-server-gpu==0.6.0.post102 #GPU with CUDA10.2 + TensorRT7 +# DO NOT RUN ALL COMMANDS! check your GPU env and select the right one +pip install paddle-serving-server-gpu==0.6.0.post101 # GPU with CUDA10.1 + TensorRT6 +pip install paddle-serving-server-gpu==0.6.0.post11 # GPU with CUDA10.1 + TensorRT7 + + +You may need to use a domestic mirror source (in China, you can use the Tsinghua mirror source, add `-i https://pypi.tuna.tsinghua.edu.cn/simple` to pip command) to speed up the download. + +If you need install modules compiled with develop branch, please download packages from [latest packages list](Latest_Package_CN.md) and install with `pip install` command. If you want to compile by yourself, please refer to [How to compile Paddle Serving?](Compile_EN.md) + +Packages of paddle-serving-server and paddle-serving-server-gpu support Centos 6/7, Ubuntu 16/18, Windows 10. + +Packages of paddle-serving-client and paddle-serving-app support Linux and Windows, but paddle-serving-client only support python3.6/3.7/3.8. + +**For latest version, Cuda 9.0 or Cuda 10.0 are no longer supported, Python2.7/3.5 is no longer supported.** + +Recommended to install paddle >= 2.1.0 + + +``` +# CPU users, please run +pip install paddlepaddle==2.1.0 + +# GPU Cuda10.2 please run +pip install paddlepaddle-gpu==2.1.0 +``` + +**Note**: If your Cuda version is not 10.2, please do not execute the above commands directly, you need to refer to [Paddle official documentation-multi-version whl package list +](https://www.paddlepaddle.org.cn/documentation/docs/en/install/Tables_en.html#multi-version-whl-package-list-release) + +Select the url link of the corresponding GPU environment and install it. For example, for Python3.6 users of Cuda 10.1, please select `cp36-cp36m` and +The url corresponding to `cuda10.1-cudnn7-mkl-gcc8.2-avx-trt6.0.1.5`, copy it and run +``` +pip install https://paddle-wheel.bj.bcebos.com/with-trt/2.1.0-gpu-cuda10.1-cudnn7-mkl-gcc8.2/paddlepaddle_gpu-2.1.0.post101-cp36-cp36m-linux_x86_64.whl +``` + +the default `paddlepaddle-gpu==2.1.0` is Cuda 10.2 with no TensorRT. If you want to install PaddlePaddle with TensorRT. please also check the documentation-multi-version whl package list and find key word `cuda10.2-cudnn8.0-trt7.1.3`. + +If it is other environment and Python version, please find the corresponding link in the table and install it with pip. + +For **Windows Users**, please read the document [Paddle Serving for Windows Users](Windows_Tutorial_EN.md) + +

Quick Start Example

+ +This quick start example is mainly for those users who already have a model to deploy, and we also provide a model that can be used for deployment. in case if you want to know how to complete the process from offline training to online service, please refer to the AiStudio tutorial above. diff --git a/doc/JAVA_SDK_CN.md b/doc/Java_SDK_CN.md similarity index 96% rename from doc/JAVA_SDK_CN.md rename to doc/Java_SDK_CN.md index 333a29a67f1664db608803781babeb1b91435de0..1e254b1eee41fe08495fab20bb645da272b9826d 100644 --- a/doc/JAVA_SDK_CN.md +++ b/doc/Java_SDK_CN.md @@ -1,6 +1,6 @@ # Paddle Serving Client Java SDK -(简体中文|[English](JAVA_SDK.md)) +(简体中文|[English](Java_SDK_EN.md)) Paddle Serving 提供了 Java SDK,支持 Client 端用 Java 语言进行预测,本文档说明了如何使用 Java SDK。 diff --git a/doc/JAVA_SDK.md b/doc/Java_SDK_EN.md similarity index 96% rename from doc/JAVA_SDK.md rename to doc/Java_SDK_EN.md index cb1d60bc6a16ebf1d8621b2b4fd650271ca6ab87..f660538bcb54ae83ac5fed6c633e6a355e2a811f 100644 --- a/doc/JAVA_SDK.md +++ b/doc/Java_SDK_EN.md @@ -1,6 +1,6 @@ # Paddle Serving Client Java SDK -([简体中文](JAVA_SDK_CN.md)|English) +([简体中文](Java_SDK_CN.md)|English) Paddle Serving provides Java SDK,which supports predict on the Client side with Java language. This document shows how to use the Java SDK. diff --git a/doc/LOD_CN.md b/doc/LOD_CN.md index 5995dd8050eb1dd99febd9c8a418c636a118eceb..69343a4b0505b735f31fca70e40dcbea37b3a6c0 100644 --- a/doc/LOD_CN.md +++ b/doc/LOD_CN.md @@ -1,6 +1,6 @@ # Lod字段说明 -(简体中文|[English](LOD.md)) +(简体中文|[English](LOD_EN.md)) ## 概念 diff --git a/doc/LOD.md b/doc/LOD_EN.md similarity index 100% rename from doc/LOD.md rename to doc/LOD_EN.md diff --git a/doc/LATEST_PACKAGES.md b/doc/Latest_Packages_CN.md similarity index 100% rename from doc/LATEST_PACKAGES.md rename to doc/Latest_Packages_CN.md diff --git a/doc/LOW_PRECISION_DEPLOYMENT_CN.md b/doc/Low_Precision_CN.md similarity index 89% rename from doc/LOW_PRECISION_DEPLOYMENT_CN.md rename to doc/Low_Precision_CN.md index f77f4e241f3f4b95574d22b9ca55788b5abc968e..f9de43003cf230c2db35dcc50ff13f00482c7b50 100644 --- a/doc/LOW_PRECISION_DEPLOYMENT_CN.md +++ b/doc/Low_Precision_CN.md @@ -1,12 +1,13 @@ -# Paddle Serving低精度部署 -(简体中文|[English](./LOW_PRECISION_DEPLOYMENT.md)) +## Paddle Serving低精度部署 + +(简体中文|[English](./Low_Precision_EN.md)) 低精度部署, 在Intel CPU上支持int8、bfloat16模型,Nvidia TensorRT支持int8、float16模型。 -## 通过PaddleSlim量化生成低精度模型 +### 通过PaddleSlim量化生成低精度模型 详细见[PaddleSlim量化](https://paddleslim.readthedocs.io/zh_CN/latest/tutorials/quant/overview.html) -## 使用TensorRT int8加载PaddleSlim Int8量化模型进行部署 +### 使用TensorRT int8加载PaddleSlim Int8量化模型进行部署 首先下载Resnet50 [PaddleSlim量化模型](https://paddle-inference-dist.bj.bcebos.com/inference_demo/python/resnet50/ResNet50_quant.tar.gz),并转换为Paddle Serving支持的部署模型格式。 ``` wget https://paddle-inference-dist.bj.bcebos.com/inference_demo/python/resnet50/ResNet50_quant.tar.gz @@ -40,7 +41,7 @@ fetch_map = client.predict(feed={"image": img}, fetch=["score"]) print(fetch_map["score"].reshape(-1)) ``` -## 参考文档 +### 参考文档 * [PaddleSlim](https://github.com/PaddlePaddle/PaddleSlim) * PaddleInference Intel CPU部署量化模型[文档](https://paddle-inference.readthedocs.io/en/latest/optimize/paddle_x86_cpu_int8.html) * PaddleInference NV GPU部署量化模型[文档](https://paddle-inference.readthedocs.io/en/latest/optimize/paddle_trt.html) diff --git a/doc/LOW_PRECISION_DEPLOYMENT.md b/doc/Low_Precision_EN.md similarity index 87% rename from doc/LOW_PRECISION_DEPLOYMENT.md rename to doc/Low_Precision_EN.md index fb3bd208f2f52399afff1f96228543685f3cf389..1679ee1130eebd17a9643fdcf5f3ae8ef9b77c97 100644 --- a/doc/LOW_PRECISION_DEPLOYMENT.md +++ b/doc/Low_Precision_EN.md @@ -1,12 +1,13 @@ -# Low-Precision Deployment for Paddle Serving -(English|[简体中文](./LOW_PRECISION_DEPLOYMENT_CN.md)) +## Low-Precision Deployment for Paddle Serving + +(English|[简体中文](./Low_Precision_CN.md)) Intel CPU supports int8 and bfloat16 models, NVIDIA TensorRT supports int8 and float16 models. -## Obtain the quantized model through PaddleSlim tool +### Obtain the quantized model through PaddleSlim tool Train the low-precision models please refer to [PaddleSlim](https://paddleslim.readthedocs.io/zh_CN/latest/tutorials/quant/overview.html). -## Deploy the quantized model from PaddleSlim using Paddle Serving with Nvidia TensorRT int8 mode +### Deploy the quantized model from PaddleSlim using Paddle Serving with Nvidia TensorRT int8 mode Firstly, download the [Resnet50 int8 model](https://paddle-inference-dist.bj.bcebos.com/inference_demo/python/resnet50/ResNet50_quant.tar.gz) and convert to Paddle Serving's saved model。 ``` @@ -41,7 +42,7 @@ fetch_map = client.predict(feed={"image": img}, fetch=["save_infer_model/scale_0 print(fetch_map["save_infer_model/scale_0.tmp_0"].reshape(-1)) ``` -## Reference +### Reference * [PaddleSlim](https://github.com/PaddlePaddle/PaddleSlim) * [Deploy the quantized model Using Paddle Inference on Intel CPU](https://paddle-inference.readthedocs.io/en/latest/optimize/paddle_x86_cpu_int8.html) * [Deploy the quantized model Using Paddle Inference on Nvidia GPU](https://paddle-inference.readthedocs.io/en/latest/optimize/paddle_trt.html) diff --git a/doc/Model_Zoo_CN.md b/doc/Model_Zoo_CN.md index 47c440efc3ea3ae4bbfb76c5626d724509592635..d42e6601e033b1f364d7d5110eca6087851c3bcc 100644 --- a/doc/Model_Zoo_CN.md +++ b/doc/Model_Zoo_CN.md @@ -1,6 +1,6 @@ # Model Zoo -([English](./Model_Zoo.md)|简体中文) +([English](./Model_Zoo_EN.md)|简体中文) 本页面展示了Paddle Serving目前支持的预训练模型以及下载链接 若您想为Paddle Serving提供新的模型,可通过[pull request](https://github.com/PaddlePaddle/Serving/pulls)提交PR diff --git a/doc/Model_Zoo.md b/doc/Model_Zoo_EN.md similarity index 100% rename from doc/Model_Zoo.md rename to doc/Model_Zoo_EN.md diff --git a/doc/PROCESS_DATA.md b/doc/Process_data_CN.md similarity index 100% rename from doc/PROCESS_DATA.md rename to doc/Process_data_CN.md diff --git a/doc/python_server/BENCHMARKING_GPU.md b/doc/Python_Pipeline/Benchmark_CN.md similarity index 100% rename from doc/python_server/BENCHMARKING_GPU.md rename to doc/Python_Pipeline/Benchmark_CN.md diff --git a/doc/python_server/PIPELINE_SERVING_CN.md b/doc/Python_Pipeline/Pipeline_Design_CN.md similarity index 98% rename from doc/python_server/PIPELINE_SERVING_CN.md rename to doc/Python_Pipeline/Pipeline_Design_CN.md index aad71638f87b1ab4bb69456d33e615373d6d2629..5592585e60608853ba2db750da707294919ff086 100644 --- a/doc/python_server/PIPELINE_SERVING_CN.md +++ b/doc/Python_Pipeline/Pipeline_Design_CN.md @@ -1,13 +1,13 @@ # Pipeline Serving -(简体中文|[English](PIPELINE_SERVING.md)) - -- [架构设计](PIPELINE_SERVING_CN.md#1架构设计) -- [详细设计](PIPELINE_SERVING_CN.md#2详细设计) -- [典型示例](PIPELINE_SERVING_CN.md#3典型示例) -- [高阶用法](PIPELINE_SERVING_CN.md#4高阶用法) -- [日志追踪](PIPELINE_SERVING_CN.md#5日志追踪) -- [性能分析与优化](PIPELINE_SERVING_CN.md#6性能分析与优化) +(简体中文|[English](Pipeline_Design_EN.md)) + +- [架构设计](Pipeline_Design_CN.md#1架构设计) +- [详细设计](Pipeline_Design_CN.md#2详细设计) +- [典型示例](Pipeline_Design_CN.md#3典型示例) +- [高阶用法](Pipeline_Design_CN.md#4高阶用法) +- [日志追踪](Pipeline_Design_CN.md#5日志追踪) +- [性能分析与优化](Pipeline_Design_CN.md#6性能分析与优化) 在许多深度学习框架中,Serving通常用于单模型的一键部署。在AI工业大生产的背景下,端到端的深度学习模型当前还不能解决所有问题,多个深度学习模型配合起来使用还是解决现实问题的常规手段。但多模型应用设计复杂,为了降低开发和维护难度,同时保证服务的可用性,通常会采用串行或简单的并行方式,但一般这种情况下吞吐量仅达到可用状态,而且GPU利用率偏低。 diff --git a/doc/python_server/PIPELINE_SERVING.md b/doc/Python_Pipeline/Pipeline_Design_EN.md similarity index 98% rename from doc/python_server/PIPELINE_SERVING.md rename to doc/Python_Pipeline/Pipeline_Design_EN.md index 15555aa4b76bdf16ce95b538b87201f801c72fb8..8a0b313c7081429f34e50b683ab0199d206fc395 100644 --- a/doc/python_server/PIPELINE_SERVING.md +++ b/doc/Python_Pipeline/Pipeline_Design_EN.md @@ -1,13 +1,13 @@ # Pipeline Serving -([简体中文](PIPELINE_SERVING_CN.md)|English) - -- [Architecture Design](PIPELINE_SERVING.md#1architecture-design) -- [Detailed Design](PIPELINE_SERVING.md#2detailed-design) -- [Classic Examples](PIPELINE_SERVING.md#3classic-examples) -- [Advanced Usages](PIPELINE_SERVING.md#4advanced-usages) -- [Log Tracing](PIPELINE_SERVING.md#5log-tracing) -- [Performance Analysis And Optimization](PIPELINE_SERVING.md#6performance-analysis-and-optimization) +([简体中文](Pipeline_Design_CN.md)|English) + +- [Architecture Design](Pipeline_Design_EN.md#1architecture-design) +- [Detailed Design](Pipeline_Design_EN.md#2detailed-design) +- [Classic Examples](Pipeline_Design_EN.md#3classic-examples) +- [Advanced Usages](Pipeline_Design_EN.md#4advanced-usages) +- [Log Tracing](Pipeline_Design_EN.md#5log-tracing) +- [Performance Analysis And Optimization](Pipeline_Design_EN.md#6performance-analysis-and-optimization) In many deep learning frameworks, Serving is usually used for the deployment of single model.but in the context of AI industrial, the end-to-end deep learning model can not solve all the problems at present. Usually, it is necessary to use multiple deep learning models to solve practical problems.However, the design of multi-model applications is complicated. In order to reduce the difficulty of development and maintenance, and to ensure the availability of services, serial or simple parallel methods are usually used. In general, the throughput only reaches the usable state and the GPU utilization rate is low. diff --git a/doc/Quick_Start_CN.md b/doc/Quick_Start_CN.md new file mode 100644 index 0000000000000000000000000000000000000000..3bcca5d2d44cd80168063727f1e6fd199d04b0f3 --- /dev/null +++ b/doc/Quick_Start_CN.md @@ -0,0 +1,125 @@ +## Paddle Serving 快速开始示例 + +([English](./Quick_Start_EN.md)|简体中文) + +这个快速开始示例主要是为了给那些已经有一个要部署的模型的用户准备的,而且我们也提供了一个可以用来部署的模型。如果您想知道如何从离线训练到在线服务走完全流程,请参考前文的AiStudio教程。 + +

波士顿房价预测

+ +进入到Serving的git目录下,进入到`fit_a_line`例子 +``` shell +cd Serving/python/examples/fit_a_line +sh get_data.sh +``` + +Paddle Serving 为用户提供了基于 HTTP 和 RPC 的服务 + + + +

RPC服务

+ +用户还可以使用`paddle_serving_server.serve`启动RPC服务。 尽管用户需要基于Paddle Serving的python客户端API进行一些开发,但是RPC服务通常比HTTP服务更快。需要指出的是这里我们没有指定`--name`。 + +``` shell +python3 -m paddle_serving_server.serve --model uci_housing_model --thread 10 --port 9292 +``` +
+ +| Argument | Type | Default | Description | +| ---------------------------------------------- | ---- | ------- | ----------------------------------------------------- | +| `thread` | int | `2` | Number of brpc service thread | +| `op_num` | int[]| `0` | Thread Number for each model in asynchronous mode | +| `op_max_batch` | int[]| `32` | Batch Number for each model in asynchronous mode | +| `gpu_ids` | str[]| `"-1"` | Gpu card id for each model | +| `port` | int | `9292` | Exposed port of current service to users | +| `model` | str[]| `""` | Path of paddle model directory to be served | +| `mem_optim_off` | - | - | Disable memory / graphic memory optimization | +| `ir_optim` | bool | False | Enable analysis and optimization of calculation graph | +| `use_mkl` (Only for cpu version) | - | - | Run inference with MKL | +| `use_trt` (Only for trt version) | - | - | Run inference with TensorRT | +| `use_lite` (Only for Intel x86 CPU or ARM CPU) | - | - | Run PaddleLite inference | +| `use_xpu` | - | - | Run PaddleLite inference with Baidu Kunlun XPU | +| `precision` | str | FP32 | Precision Mode, support FP32, FP16, INT8 | +| `use_calib` | bool | False | Use TRT int8 calibration | +| `gpu_multi_stream` | bool | False | EnableGpuMultiStream to get larger QPS | + +#### 异步模型的说明 + 异步模式适用于1、请求数量非常大的情况,2、多模型串联,想要分别指定每个模型的并发数的情况。 + 异步模式有助于提高Service服务的吞吐(QPS),但对于单次请求而言,时延会有少量增加。 + 异步模式中,每个模型会启动您指定个数的N个线程,每个线程中包含一个模型实例,换句话说每个模型相当于包含N个线程的线程池,从线程池的任务队列中取任务来执行。 + 异步模式中,各个RPC Server的线程只负责将Request请求放入模型线程池的任务队列中,等任务被执行完毕后,再从任务队列中取出已完成的任务。 + 上表中通过 --thread 10 指定的是RPC Server的线程数量,默认值为2,--op_num 指定的是各个模型的线程池中线程数N,默认值为0,表示不使用异步模式。 + --op_max_batch 指定的各个模型的batch数量,默认值为32,该参数只有当--op_num不为0时才生效。 + +#### 当您的某个模型想使用多张GPU卡部署时. +python3 -m paddle_serving_server.serve --model uci_housing_model --thread 10 --port 9292 --gpu_ids 0,1,2 +#### 当您的一个服务包含两个模型部署时. +python3 -m paddle_serving_server.serve --model uci_housing_model_1 uci_housing_model_2 --thread 10 --port 9292 +#### 当您的一个服务包含两个模型,且每个模型都需要指定多张GPU卡部署时. +python3 -m paddle_serving_server.serve --model uci_housing_model_1 uci_housing_model_2 --thread 10 --port 9292 --gpu_ids 0,1 1,2 +#### 当您的一个服务包含两个模型,且每个模型都需要指定多张GPU卡,且需要异步模式每个模型指定不同的并发数时. +python3 -m paddle_serving_server.serve --model uci_housing_model_1 uci_housing_model_2 --thread 10 --port 9292 --gpu_ids 0,1 1,2 --op_num 4 8 + + + +
+ +``` python +# A user can visit rpc service through paddle_serving_client API +from paddle_serving_client import Client + +client = Client() +client.load_client_config("uci_housing_client/serving_client_conf.prototxt") +client.connect(["127.0.0.1:9292"]) +data = [0.0137, -0.1136, 0.2553, -0.0692, 0.0582, -0.0727, + -0.1583, -0.0584, 0.6283, 0.4919, 0.1856, 0.0795, -0.0332] +fetch_map = client.predict(feed={"x": np.array(data).reshape(1,13,1)}, fetch=["price"]) +print(fetch_map) + +``` +在这里,`client.predict`函数具有两个参数。 `feed`是带有模型输入变量别名和值的`python dict`。 `fetch`被要从服务器返回的预测变量赋值。 在该示例中,在训练过程中保存可服务模型时,被赋值的tensor名为`"x"`和`"price"`。 + + +

HTTP服务

+ +用户也可以将数据格式处理逻辑放在服务器端进行,这样就可以直接用curl去访问服务,参考如下案例,在目录`python/examples/fit_a_line`. + +``` +python3 -m paddle_serving_server.serve --model uci_housing_model --thread 10 --port 9292 --name uci +``` +客户端输入 +``` +curl -H "Content-Type:application/json" -X POST -d '{"feed":[{"x": [0.0137, -0.1136, 0.2553, -0.0692, 0.0582, -0.0727, -0.1583, -0.0584, 0.6283, 0.4919, 0.1856, 0.0795, -0.0332]}], "fetch":["price"]}' http://127.0.0.1:9292/uci/prediction +``` +返回结果 +``` +{"result":{"price":[[18.901151657104492]]}} +``` + +

Pipeline服务

+ +Paddle Serving提供业界领先的多模型串联服务,强力支持各大公司实际运行的业务场景,参考 [OCR文字识别案例](python/examples/pipeline/ocr),在目录`python/examples/pipeline/ocr` + +我们先获取两个模型 +``` +python3 -m paddle_serving_app.package --get_model ocr_rec +tar -xzvf ocr_rec.tar.gz +python3 -m paddle_serving_app.package --get_model ocr_det +tar -xzvf ocr_det.tar.gz +``` +然后启动服务端程序,将两个串联的模型作为一个整体的服务。 +``` +python3 web_service.py +``` +最终使用http的方式请求 +``` +python3 pipeline_http_client.py +``` +也支持rpc的方式 +``` +python3 pipeline_rpc_client.py +``` +输出 +``` +{'err_no': 0, 'err_msg': '', 'key': ['res'], 'value': ["['土地整治与土壤修复研究中心', '华南农业大学1素图']"]} +``` diff --git a/doc/Quick_Start_EN.md b/doc/Quick_Start_EN.md new file mode 100644 index 0000000000000000000000000000000000000000..5de6ea894c76fe55949c9adb0d692eb5e6447680 --- /dev/null +++ b/doc/Quick_Start_EN.md @@ -0,0 +1,96 @@ +## Paddle Serving Quick Start Examples + +(English|[简体中文](./Quick_Start_CN.md)) + +This quick start example is mainly for those users who already have a model to deploy, and we also provide a model that can be used for deployment. in case if you want to know how to complete the process from offline training to online service, please refer to the AiStudio tutorial above. + +### Boston House Price Prediction model + +get into the Serving git directory, and change dir to `fit_a_line` +``` shell +cd Serving/python/examples/fit_a_line +sh get_data.sh +``` + +Paddle Serving provides HTTP and RPC based service for users to access + +### RPC service + +A user can also start a RPC service with `paddle_serving_server.serve`. RPC service is usually faster than HTTP service, although a user needs to do some coding based on Paddle Serving's python client API. Note that we do not specify `--name` here. +``` shell +python3 -m paddle_serving_server.serve --model uci_housing_model --thread 10 --port 9292 +``` +
+ +| Argument | Type | Default | Description | +| ---------------------------------------------- | ---- | ------- | ----------------------------------------------------- | +| `thread` | int | `4` | Concurrency of current service | +| `port` | int | `9292` | Exposed port of current service to users | +| `model` | str | `""` | Path of paddle model directory to be served | +| `mem_optim_off` | - | - | Disable memory / graphic memory optimization | +| `ir_optim` | bool | False | Enable analysis and optimization of calculation graph | +| `use_mkl` (Only for cpu version) | - | - | Run inference with MKL | +| `use_trt` (Only for trt version) | - | - | Run inference with TensorRT | +| `use_lite` (Only for Intel x86 CPU or ARM CPU) | - | - | Run PaddleLite inference | +| `use_xpu` | - | - | Run PaddleLite inference with Baidu Kunlun XPU | +| `precision` | str | FP32 | Precision Mode, support FP32, FP16, INT8 | +| `use_calib` | bool | False | Only for deployment with TensorRT | + +
+ +```python +# A user can visit rpc service through paddle_serving_client API +from paddle_serving_client import Client +import numpy as np +client = Client() +client.load_client_config("uci_housing_client/serving_client_conf.prototxt") +client.connect(["127.0.0.1:9292"]) +data = [0.0137, -0.1136, 0.2553, -0.0692, 0.0582, -0.0727, + -0.1583, -0.0584, 0.6283, 0.4919, 0.1856, 0.0795, -0.0332] +fetch_map = client.predict(feed={"x": np.array(data).reshape(1,13,1)}, fetch=["price"]) +print(fetch_map) +``` +Here, `client.predict` function has two arguments. `feed` is a `python dict` with model input variable alias name and values. `fetch` assigns the prediction variables to be returned from servers. In the example, the name of `"x"` and `"price"` are assigned when the servable model is saved during training. + + +### WEB service +Users can also put the data format processing logic on the server side, so that they can directly use curl to access the service, refer to the following case whose path is `python/examples/fit_a_line` + +``` +python3 -m paddle_serving_server.serve --model uci_housing_model --thread 10 --port 9292 --name uci +``` +for client side, +``` +curl -H "Content-Type:application/json" -X POST -d '{"feed":[{"x": [0.0137, -0.1136, 0.2553, -0.0692, 0.0582, -0.0727, -0.1583, -0.0584, 0.6283, 0.4919, 0.1856, 0.0795, -0.0332]}], "fetch":["price"]}' http://127.0.0.1:9292/uci/prediction +``` +the response is +``` +{"result":{"price":[[18.901151657104492]]}} +``` +

Pipeline Service

+ +Paddle Serving provides industry-leading multi-model tandem services, which strongly supports the actual operating business scenarios of major companies, please refer to [OCR word recognition](./python/examples/pipeline/ocr). + +we get two models +``` +python3 -m paddle_serving_app.package --get_model ocr_rec +tar -xzvf ocr_rec.tar.gz +python3 -m paddle_serving_app.package --get_model ocr_det +tar -xzvf ocr_det.tar.gz +``` +then we start server side, launch two models as one standalone web service +``` +python3 web_service.py +``` +http request +``` +python3 pipeline_http_client.py +``` +grpc request +``` +python3 pipeline_rpc_client.py +``` +output +``` +{'err_no': 0, 'err_msg': '', 'key': ['res'], 'value': ["['土地整治与土壤修复研究中心', '华南农业大学1素图']"]} +``` \ No newline at end of file diff --git a/doc/RUN_IN_DOCKER_CN.md b/doc/Run_In_Docker_CN.md similarity index 91% rename from doc/RUN_IN_DOCKER_CN.md rename to doc/Run_In_Docker_CN.md index 3485f195e711dd9445a0a83063d3d9a336fa8a4b..7e4e4ad3dc9652772028a3e28d9453a032b25297 100644 --- a/doc/RUN_IN_DOCKER_CN.md +++ b/doc/Run_In_Docker_CN.md @@ -1,6 +1,6 @@ # 如何在Docker中运行PaddleServing -(简体中文|[English](RUN_IN_DOCKER.md)) +(简体中文|[English](Run_In_Docker_EN.md)) Docker最大的好处之一就是可移植性,可在多种操作系统和主流的云计算平台部署。使用Paddle Serving Docker镜像可在Linux、Mac和Windows平台部署。 @@ -14,7 +14,7 @@ Docker(GPU版本需要在GPU机器上安装nvidia-docker) ### 获取镜像 -参考[该文档](DOCKER_IMAGES_CN.md)获取镜像: +参考[该文档](Docker_Images_CN.md)获取镜像: 以CPU编译镜像为例 @@ -59,9 +59,9 @@ docker exec -it test bash ### 安装PaddleServing -请参照首页的指导,下载对应版本的pip包。[最新安装包合集](LATEST_PACKAGES.md) +请参照首页的指导,下载对应版本的pip包。[最新安装包合集](Latest_Packages_CN.md) ## 注意事项 -- 运行时镜像不能用于开发编译。如果想要从源码编译,请查看[如何编译PaddleServing](COMPILE.md)。 +- 运行时镜像不能用于开发编译。如果想要从源码编译,请查看[如何编译PaddleServing](Compile_CN.md)。 - 由于Cuda10和Cuda9的环境受限于GCC版本,无法同时运行CPU版本的`paddle_serving_server`,因此如果想要在GPU环境中同时使用CPU版本的`paddle_serving_server`,请选择Cuda10.1,Cuda10.2和Cuda11版本的镜像。 diff --git a/doc/RUN_IN_DOCKER.md b/doc/Run_In_Docker_EN.md similarity index 88% rename from doc/RUN_IN_DOCKER.md rename to doc/Run_In_Docker_EN.md index 9469cc875dd5ed6df127b035b8d876f84f296438..44a516cb0b611315ade0440b6cea81632d8e62f6 100644 --- a/doc/RUN_IN_DOCKER.md +++ b/doc/Run_In_Docker_EN.md @@ -1,6 +1,6 @@ # How to run PaddleServing in Docker -([简体中文](RUN_IN_DOCKER_CN.md)|English) +([简体中文](Run_In_Docker_CN.md)|English) One of the biggest benefits of Docker is portability, which can be deployed on multiple operating systems and mainstream cloud computing platforms. The Paddle Serving Docker image can be deployed on Linux, Mac and Windows platforms. @@ -14,7 +14,7 @@ This document takes Python2 as an example to show how to run Paddle Serving in d ### Get docker image -Refer to [this document](DOCKER_IMAGES.md) for a docker image: +Refer to [this document](Docker_Images_EN.md) for a docker image: ```shell docker pull registry.baidubce.com/paddlepaddle/serving:latest-devel @@ -41,7 +41,7 @@ The GPU version is basically the same as the CPU version, with only some differe ### Get docker image -Refer to [this document](DOCKER_IMAGES.md) for a docker image, the following is an example of an `cuda9.0-cudnn7` image: +Refer to [this document](Docker_Images_EN.md) for a docker image, the following is an example of an `cuda9.0-cudnn7` image: ```shell docker pull registry.baidubce.com/paddlepaddle/serving:latest-cuda10.2-cudnn8-devel @@ -67,9 +67,9 @@ The `-p` option is to map the `9292` port of the container to the `9292` port of The mirror comes with `paddle_serving_server_gpu`, `paddle_serving_client`, and `paddle_serving_app` corresponding to the mirror tag version. If users don’t need to change the version, they can use it directly, which is suitable for environments without extranet services. -If you need to change the version, please refer to the instructions on the homepage to download the pip package of the corresponding version. [LATEST_PACKAGES](./LATEST_PACKAGES.md) +If you need to change the version, please refer to the instructions on the homepage to download the pip package of the corresponding version. [LATEST_PACKAGES](./Latest_Packages_CN.md) ## Precautious -- Runtime images cannot be used for compilation. If you want to compile from source, refer to [COMPILE](COMPILE.md). +- Runtime images cannot be used for compilation. If you want to compile from source, refer to [COMPILE](Compile_EN.md). - If you use Cuda9 and Cuda10 docker images, you cannot use `paddle_serving_server` CPU version at the same time, due to the limitation of gcc version. If you want to use both in one docker image, please choose images of Cuda10.1, Cuda10.2 and Cuda11. diff --git a/doc/PADDLE_SERVING_ON_KUBERNETES.md b/doc/Run_On_Kubernetes_CN.md similarity index 100% rename from doc/PADDLE_SERVING_ON_KUBERNETES.md rename to doc/Run_On_Kubernetes_CN.md diff --git a/doc/BAIDU_KUNLUN_XPU_SERVING_CN.md b/doc/Run_On_XPU_CN.md similarity index 92% rename from doc/BAIDU_KUNLUN_XPU_SERVING_CN.md rename to doc/Run_On_XPU_CN.md index fb7de26e016388dbcc3e5db23d8232743fdd792e..191e03381cc04fcf7e9cd3286c61d18f1f3520a3 100644 --- a/doc/BAIDU_KUNLUN_XPU_SERVING_CN.md +++ b/doc/Run_On_XPU_CN.md @@ -1,11 +1,11 @@ -# Paddle Serving使用百度昆仑芯片部署 -(简体中文|[English](./BAIDU_KUNLUN_XPU_SERVING.md)) +## Paddle Serving使用百度昆仑芯片部署 +(简体中文|[English](./Run_On_XPU_EN.md)) Paddle Serving支持使用百度昆仑芯片进行预测部署。目前支持在百度昆仑芯片和arm服务器(如飞腾 FT-2000+/64), 或者百度昆仑芯片和Intel CPU服务器,上进行部署,后续完善对其他异构硬件服务器部署能力。 -# 编译、安装 -基本环境配置可参考[该文档](COMPILE_CN.md)进行配置。下面以飞腾FT-2000+/64机器为例进行介绍。 -## 编译 +## 编译、安装 +基本环境配置可参考[该文档](Compile_CN.md)进行配置。下面以飞腾FT-2000+/64机器为例进行介绍。 +### 编译 * 编译server部分 ``` cd Serving @@ -50,23 +50,23 @@ cmake -DPYTHON_INCLUDE_DIR=/usr/include/python3.7m/ \ make -j10 ``` -## 安装wheel包 +### 安装wheel包 以上编译步骤完成后,会在各自编译目录$build_dir/python/dist生成whl包,分别安装即可。例如server步骤,会在server-build-arm/python/dist目录下生成whl包, 使用命令```pip install -u xxx.whl```进行安装。 -# 请求参数说明 +## 请求参数说明 为了支持arm+xpu服务部署,使用Paddle-Lite加速能力,请求时需使用以下参数。 | 参数 | 参数说明 | 备注 | | :------- | :-------------------------- | :--------------------------------------------------------------- | | use_lite | 使用Paddle-Lite Engine | 使用Paddle-Lite cpu预测能力 | | use_xpu | 使用Baidu Kunlun进行预测 | 该选项需要与use_lite配合使用 | | ir_optim | 开启Paddle-Lite计算子图优化 | 详细见[Paddle-Lite](https://github.com/PaddlePaddle/Paddle-Lite) | -# 部署使用示例 -## 下载模型 +## 部署使用示例 +### 下载模型 ``` wget --no-check-certificate https://paddle-serving.bj.bcebos.com/uci_housing.tar.gz tar -xzf uci_housing.tar.gz ``` -## 启动rpc服务 +### 启动rpc服务 主要有三种启动配置: * 使用cpu+xpu部署,使用Paddle-Lite xpu优化加速能力; * 单独使用cpu部署,使用Paddle-Lite优化加速能力; @@ -86,7 +86,7 @@ python3 -m paddle_serving_server.serve --model uci_housing_model --thread 6 --po ``` python3 -m paddle_serving_server.serve --model uci_housing_model --thread 6 --port 9292 ``` -## client调用 +### client调用 ``` from paddle_serving_client import Client import numpy as np @@ -98,9 +98,9 @@ data = [0.0137, -0.1136, 0.2553, -0.0692, 0.0582, -0.0727, fetch_map = client.predict(feed={"x": np.array(data).reshape(1,13,1)}, fetch=["price"]) print(fetch_map) ``` -# 其他说明 +## 其他说明 -## 模型实例及说明 +### 模型实例及说明 以下提供部分样例,其他模型可参照进行修改。 | 示例名称 | 示例链接 | | :--------- | :---------------------------------------------------------- | @@ -108,6 +108,7 @@ print(fetch_map) | resnet | [resnet_v2_50_xpu](../python/examples/xpu/resnet_v2_50_xpu) | 注:支持昆仑芯片部署模型列表见[链接](https://paddlelite.paddlepaddle.org.cn/introduction/support_model_list.html)。不同模型适配上存在差异,可能存在不支持的情况,部署使用存在问题时,欢迎以[Github issue](https://github.com/PaddlePaddle/Serving/issues),我们会实时跟进。 -## 昆仑芯片支持相关参考资料 + +### 昆仑芯片支持相关参考资料 * [昆仑XPU芯片运行飞桨](https://www.paddlepaddle.org.cn/documentation/docs/zh/develop/guides/xpu_docs/index_cn.html) * [PaddleLite使用百度XPU预测部署](https://paddlelite.paddlepaddle.org.cn/demo_guides/baidu_xpu.html) diff --git a/doc/BAIDU_KUNLUN_XPU_SERVING.md b/doc/Run_On_XPU_EN.md similarity index 92% rename from doc/BAIDU_KUNLUN_XPU_SERVING.md rename to doc/Run_On_XPU_EN.md index 02568642bad6aafd147b628a1c6607fd8af9fed3..9737f413b3030dcaa5fa819d049dd3228868eccd 100644 --- a/doc/BAIDU_KUNLUN_XPU_SERVING.md +++ b/doc/Run_On_XPU_EN.md @@ -1,13 +1,14 @@ -# Paddle Serving Using Baidu Kunlun Chips -(English|[简体中文](./BAIDU_KUNLUN_XPU_SERVING_CN.md)) +## Paddle Serving Using Baidu Kunlun Chips + +(English|[简体中文](./Run_On_XPU_CN.md)) Paddle serving supports deployment using Baidu Kunlun chips. Currently, it supports deployment on the ARM CPU server with Baidu Kunlun chips (such as Phytium FT-2000+/64), or Intel CPU with Baidu Kunlun chips. We will improve the deployment capability on various heterogeneous hardware servers in the future. -# Compilation and installation -Refer to [compile](COMPILE.md) document to setup the compilation environment. The following is based on FeiTeng FT-2000 +/64 platform. -## Compilatiton +## Compilation and installation +Refer to [compile](Compile.md) document to setup the compilation environment. The following is based on FeiTeng FT-2000 +/64 platform. +### Compilatiton * Compile the Serving Server ``` cd Serving @@ -52,11 +53,11 @@ cmake -DPYTHON_INCLUDE_DIR=/usr/include/python3.7m/ \ make -j10 ``` -## Install the wheel package +### Install the wheel package After the compilations stages above, the whl package will be generated in ```python/dist/``` under the specific temporary directories. For example, after the Server Compiation step,the whl package will be produced under the server-build-arm/python/dist directory, and you can run ```pip install -u python/dist/*.whl``` to install the package. -# Request parameters description +## Request parameters description In order to deploy serving service on the arm server with Baidu Kunlun xpu chips and use the acceleration capability of Paddle-Lite,please specify the following parameters during deployment. | param | param description | about | @@ -64,13 +65,13 @@ In order to deploy serving | use_lite | using Paddle-Lite Engine | use the inference capability of Paddle-Lite | | use_xpu | using Baidu Kunlun for inference | need to be used with the use_lite option | | ir_optim | open the graph optimization | refer to[Paddle-Lite](https://github.com/PaddlePaddle/Paddle-Lite) | -# Deplyment examples -## Download the model +## Deplyment examples +### Download the model ``` wget --no-check-certificate https://paddle-serving.bj.bcebos.com/uci_housing.tar.gz tar -xzf uci_housing.tar.gz ``` -## Start RPC service +### Start RPC service There are mainly three deployment methods: * deploy on the cpu server with Baidu xpu using the acceleration capability of Paddle-Lite and xpu; * deploy on the cpu server standalone with Paddle-Lite; @@ -90,7 +91,7 @@ Start the rpc service, deploying on cpu server. ``` python3 -m paddle_serving_server.serve --model uci_housing_model --thread 6 --port 9292 ``` -## +### ``` from paddle_serving_client import Client import numpy as np @@ -102,8 +103,8 @@ data = [0.0137, -0.1136, 0.2553, -0.0692, 0.0582, -0.0727, fetch_map = client.predict(feed={"x": np.array(data).reshape(1,13,1)}, fetch=["price"]) print(fetch_map) ``` -# Others -## Model example and explanation +## Others +### Model example and explanation Some examples are provided below, and other models can be modifed with reference to these examples. | sample name | sample links | @@ -113,6 +114,6 @@ Some examples are provided below, and other models can be modifed with reference Note:Supported model lists refer to [doc](https://paddlelite.paddlepaddle.org.cn/introduction/support_model_list.html). There are differences in the adaptation of different models, and there may be some unsupported cases. If you have any problem,please submit [Github issue](https://github.com/PaddlePaddle/Serving/issues), and we will follow up in real time. -## Kunlun chip related reference materials +### Kunlun chip related reference materials * [PaddlePaddle on Baidu Kunlun xpu chips](https://www.paddlepaddle.org.cn/documentation/docs/zh/develop/guides/xpu_docs/index_cn.html) * [Deployment on Baidu Kunlun xpu chips using PaddleLite](https://paddlelite.paddlepaddle.org.cn/demo_guides/baidu_xpu.html) diff --git a/doc/SAVE_CN.md b/doc/Save_CN.md similarity index 99% rename from doc/SAVE_CN.md rename to doc/Save_CN.md index 42606372a06bc26591b70d1ae6db119cd5a8749d..fcfbcf19fa43a170a9046c10d88101397b964a46 100644 --- a/doc/SAVE_CN.md +++ b/doc/Save_CN.md @@ -1,6 +1,6 @@ # 怎样保存用于Paddle Serving的模型? -(简体中文|[English](./SAVE.md)) +(简体中文|[English](./Save_EN.md)) ## 从已保存的模型文件中导出 如果已使用Paddle 的`save_inference_model`接口保存出预测要使用的模型,你可以使用Paddle Serving提供的名为`paddle_serving_client.convert`的内置模块进行转换。 diff --git a/doc/SAVE.md b/doc/Save_EN.md similarity index 99% rename from doc/SAVE.md rename to doc/Save_EN.md index 9da923bf6df1437923539aba6da99a429082da29..38f6f383642b358c7705a402ebc1bacc8a9145b2 100644 --- a/doc/SAVE.md +++ b/doc/Save_EN.md @@ -1,6 +1,6 @@ # How to save a servable model of Paddle Serving? -([简体中文](./SAVE_CN.md)|English) +([简体中文](./Save_CN.md)|English) ## Export from saved model files diff --git a/doc/SERVING_AUTH_DOCKER.md b/doc/Serving_Auth_Docker_CN.md similarity index 98% rename from doc/SERVING_AUTH_DOCKER.md rename to doc/Serving_Auth_Docker_CN.md index a3c303e3136a5d0fd0967bf68b1de2a1bc94eb33..9e9ee66d2674255d3fd877eba9eed02b74056437 100644 --- a/doc/SERVING_AUTH_DOCKER.md +++ b/doc/Serving_Auth_Docker_CN.md @@ -63,7 +63,7 @@ ee59a3dd4806 registry.baidubce.com/serving_dev/serving-runtime:cpu-py36 ### Step 1:启动Serving服务 -我们仍然以 [Uci房价预测](../python/examples/fit_a_line)服务作为例子,这里省略了镜像制作的过程,详情可以参考 [在Kubernetes集群上部署Paddle Serving](./PADDLE_SERVING_ON_KUBERNETES.md)。 +我们仍然以 [Uci房价预测](../python/examples/fit_a_line)服务作为例子,这里省略了镜像制作的过程,详情可以参考 [在Kubernetes集群上部署Paddle Serving](./Run_On_Kubernetes.md)。 在这里我们直接执行 ``` diff --git a/doc/SERVING_CONFIGURE_CN.md b/doc/Serving_Configure_CN.md similarity index 99% rename from doc/SERVING_CONFIGURE_CN.md rename to doc/Serving_Configure_CN.md index aede147abbf95324bb36777299c06147d5bea380..6e28f6ef864eacd49785138311a37f4d34e6eba7 100644 --- a/doc/SERVING_CONFIGURE_CN.md +++ b/doc/Serving_Configure_CN.md @@ -1,6 +1,6 @@ # Serving Configuration -(简体中文|[English](SERVING_CONFIGURE.md)) +(简体中文|[English](Serving_Configure_EN.md)) ## 简介 diff --git a/doc/SERVING_CONFIGURE.md b/doc/Serving_Configure_EN.md similarity index 98% rename from doc/SERVING_CONFIGURE.md rename to doc/Serving_Configure_EN.md index 4b42960db88d96e46e0d5eed70a6cab301def814..0a2bd4265dec6d6ca69d665382a611167f9b6b6c 100644 --- a/doc/SERVING_CONFIGURE.md +++ b/doc/Serving_Configure_EN.md @@ -1,6 +1,6 @@ # Serving Configuration -([简体中文](SERVING_CONFIGURE_CN.md)|English) +([简体中文](Serving_Configure_CN.md)|English) ## Overview @@ -12,7 +12,7 @@ This guide focuses on Paddle C++ Serving and Python Pipeline configuration: ## Model Configuration -The model configuration is generated by converting PaddleServing model and named serving_client_conf.prototxt/serving_server_conf.prototxt. It specifies the info of input/output so that users can fill parameters easily. The model configuration file should not be modified. See the [Saving guide](SAVE.md) for model converting. The model configuration file provided must be a `core/configure/proto/general_model_config.proto`. +The model configuration is generated by converting PaddleServing model and named serving_client_conf.prototxt/serving_server_conf.prototxt. It specifies the info of input/output so that users can fill parameters easily. The model configuration file should not be modified. See the [Saving guide](Save_EN.md) for model converting. The model configuration file provided must be a `core/configure/proto/general_model_config.proto`. Example: @@ -39,7 +39,7 @@ fetch_var { - fetch_var:model output - name:node name - alias_name:alias name -- is_lod_tensor:lod tensor, ref to [Lod Introduction](LOD.md) +- is_lod_tensor:lod tensor, ref to [Lod Introduction](LOD_EN.md) - feed_type/fetch_type:data type |feed_type|类型| diff --git a/doc/DESIGN_DOC_CN.md b/doc/Serving_Design_CN.md similarity index 92% rename from doc/DESIGN_DOC_CN.md rename to doc/Serving_Design_CN.md index 9e00840f5817ea69f4ea5ad9c5d4aee528fcfec9..ff45613ddcd6cc30f9dbbd0c70c9770aed1d6016 100644 --- a/doc/DESIGN_DOC_CN.md +++ b/doc/Serving_Design_CN.md @@ -1,6 +1,6 @@ # Paddle Serving设计文档 -(简体中文|[English](./DESIGN_DOC.md)) +(简体中文|[English](./Serving_Design_EN.md)) ## 1. 设计目标 @@ -55,15 +55,15 @@ Paddle Serving从做顶层设计时考虑到不同团队在工业级场景中会 > 跨平台运行 跨平台是不依赖于操作系统,也不依赖硬件环境。一个操作系统下开发的应用,放到另一个操作系统下依然可以运行。因此,设计上既要考虑开发语言、组件是跨平台的,同时也要考虑不同系统上编译器的解释差异。 -Docker 是一个开源的应用容器引擎,让开发者可以打包他们的应用以及依赖包到一个可移植的容器中,然后发布到任何流行的Linux机器或Windows机器上。我们将Paddle Serving框架打包了多种Docker镜像,镜像列表参考《[Docker镜像](DOCKER_IMAGES_CN.md)》,根据用户的使用场景选择镜像。为方便用户使用Docker,我们提供了帮助文档《[如何在Docker中运行PaddleServing](RUN_IN_DOCKER_CN.md)》。目前,Python webservice模式可在原生系统Linux和Windows双系统上部署运行。《[Windows平台使用Paddle Serving指导](WINDOWS_TUTORIAL_CN.md)》 +Docker 是一个开源的应用容器引擎,让开发者可以打包他们的应用以及依赖包到一个可移植的容器中,然后发布到任何流行的Linux机器或Windows机器上。我们将Paddle Serving框架打包了多种Docker镜像,镜像列表参考《[Docker镜像](Docker_Images_CN.md)》,根据用户的使用场景选择镜像。为方便用户使用Docker,我们提供了帮助文档《[如何在Docker中运行PaddleServing](Run_In_Dokcer_CN.md)》。目前,Python webservice模式可在原生系统Linux和Windows双系统上部署运行。《[Windows平台使用Paddle Serving指导](Windows_Tutorial_CN.md)》 > 支持多种开发语言SDK -Paddle Serving提供了4种开发语言SDK,包括Python、C++、Java、Golang。Golang SDK在建设中,有兴趣的开源开发者可以提交PR。 +Paddle Serving提供了3种开发语言SDK,包括Python、C++、Java。Golang SDK在建设中,有兴趣的开源开发者可以提交PR。 + Python,参考python/examples下client示例 或 4.2 web服务示例 -+ C++,参考《[从零开始写一个预测服务](CREATING.md)》 -+ Java,参考《[Paddle Serving Client Java SDK](JAVA_SDK_CN.md)》 -+ Golang,参考《[如何在Paddle Serving使用Go Client](deprecated/IMDB_GO_CLIENT_CN.md)》 ++ C++,参考《[从零开始写一个预测服务](C++_Serving/Creat_C++Serving_CN.md)》 ++ Java,参考《[Paddle Serving Client Java SDK](Java_SDK_CN.md)》 + > 支持多种硬件设备 @@ -76,7 +76,7 @@ Paddle Serving提供了4种开发语言SDK,包括Python、C++、Java、Golang 以IMDB评论情感分析任务为例通过9步展示,Paddle Serving从模型的训练到部署预测服务的全流程《[AIStudio教程-Paddle Serving服务化部署框架](https://www.paddlepaddle.org.cn/tutorials/projectdetail/1555945)》 -由于无法直接查看模型文件中feed和fetch参数信息,不方便用户拼装参数。因此,Paddle Serving开发一个工具将Paddle模型转成Serving的格式,生成包含feed和fetch参数信息的prototxt文件。下图是uci_housing示例的生成的prototxt文件,更多转换方法参考文档《[怎样保存用于Paddle Serving的模型](SAVE_CN.md)》。 +由于无法直接查看模型文件中feed和fetch参数信息,不方便用户拼装参数。因此,Paddle Serving开发一个工具将Paddle模型转成Serving的格式,生成包含feed和fetch参数信息的prototxt文件。下图是uci_housing示例的生成的prototxt文件,更多转换方法参考文档《[怎样保存用于Paddle Serving的模型](Save_CN.md)》。 ``` feed_var { name: "x" @@ -124,15 +124,15 @@ C++ Serving的核心执行引擎是一个有向无环图,图中的每个节点 ### 3.3 模型管理与热加载 -Paddle Serving的C++引擎支持模型管理功能,支持多种模型和模型不同版本的管理。为了保证在模型更换期间推理服务的可用性,需要在服务不中断的情况下对模型进行热加载。Paddle Serving对该特性进行了支持,并提供了一个监控产出模型更新本地模型的工具,具体例子请参考《[Paddle Serving中的模型热加载](HOT_LOADING_IN_SERVING_CN.md)》。 +Paddle Serving的C++引擎支持模型管理功能,支持多种模型和模型不同版本的管理。为了保证在模型更换期间推理服务的可用性,需要在服务不中断的情况下对模型进行热加载。Paddle Serving对该特性进行了支持,并提供了一个监控产出模型更新本地模型的工具,具体例子请参考《[Paddle Serving中的模型热加载](C++_Serving/Hot_Loading_CN.md)》。 ### 3.4 模型加解密 -Paddle Serving采用对称加密算法对模型进行加密,在服务加载模型过程中在内存中解密。目前,提供基础的模型安全能力,并不保证模型绝对安全性,用户可根据我们的设计加以完善,实现更高级别的安全性。说明文档参考《[加密模型预测](ENCRYPTION_CN.md)》 +Paddle Serving采用对称加密算法对模型进行加密,在服务加载模型过程中在内存中解密。目前,提供基础的模型安全能力,并不保证模型绝对安全性,用户可根据我们的设计加以完善,实现更高级别的安全性。说明文档参考《[加密模型预测](C++_Serving/Encryption_CN.md)》 ### 3.5 A/B Test -在对模型进行充分的离线评估后,通常需要进行在线A/B测试,来决定是否大规模上线服务。下图为使用Paddle Serving做A/B测试的基本结构,Client端做好相应的配置后,自动将流量分发给不同的Server,从而完成A/B测试。具体例子请参考《[如何使用Paddle Serving做ABTEST](ABTEST_IN_PADDLE_SERVING_CN.md)》。 +在对模型进行充分的离线评估后,通常需要进行在线A/B测试,来决定是否大规模上线服务。下图为使用Paddle Serving做A/B测试的基本结构,Client端做好相应的配置后,自动将流量分发给不同的Server,从而完成A/B测试。具体例子请参考《[如何使用Paddle Serving做ABTEST](C++_Serving/ABTEST_CN.md)》。


@@ -193,7 +193,7 @@ Pipeline Serving的网络框架采用gRPC和gPRC gateway。gRPC service接收RPC

### 5.2 核心设计与使用用例 -Pipeline Serving核心设计是图执行引擎,基本处理单元是OP和Channel,通过组合实现一套有向无环图,设计与使用文档参考《[Pipeline Serving设计与实现](PIPELINE_SERVING_CN.md)》 +Pipeline Serving核心设计是图执行引擎,基本处理单元是OP和Channel,通过组合实现一套有向无环图,设计与使用文档参考《[Pipeline Serving设计与实现](Python_Pipeline/Pipeline_Design_CN.md)》
@@ -201,11 +201,8 @@ Pipeline Serving核心设计是图执行引擎,基本处理单元是OP和Chann ## 6. 未来计划 -### 6.1 云端自动部署能力 -为了方便用户更容易将Paddle的预测模型部署到线上,Paddle Serving在接下来的版本会提供Kubernetes生态下任务编排的工具。 - -### 6.2 向量检索、树结构检索 +### 6.1 向量检索、树结构检索 在推荐与广告场景的召回系统中,通常需要采用基于向量的快速检索或者基于树结构的快速检索,Paddle Serving会对这方面的检索引擎进行集成或扩展。 -### 6.3 服务监控 +### 6.2 服务监控 集成普罗米修斯监控,一套开源的监控&报警&时间序列数据库的组合,适合k8s和docker的监控系统。 diff --git a/doc/DESIGN_DOC.md b/doc/Serving_Design_EN.md similarity index 93% rename from doc/DESIGN_DOC.md rename to doc/Serving_Design_EN.md index 268d2ff67cbec90ff7714723d10a6a38df1b61b3..895e55c5bf21c26dd55e5f509a392d0ec152195d 100644 --- a/doc/DESIGN_DOC.md +++ b/doc/Serving_Design_EN.md @@ -1,6 +1,6 @@ # Paddle Serving Design Doc -([简体中文](./DESIGN_DOC_CN.md)|English) +([简体中文](./Serving_Design_CN.md)|English) ## 1. Design Objectives @@ -53,16 +53,16 @@ Paddle Serving takes into account a series of issues such as different operating Cross-platform is not dependent on the operating system, nor on the hardware environment. Applications developed under one operating system can still run under another operating system. Therefore, the design should consider not only the development language and the cross-platform components, but also the interpretation differences of the compilers on different systems. -Docker is an open source application container engine that allows developers to package their applications and dependencies into a portable container, and then publish it to any popular Linux machine or Windows machine. We have packaged a variety of Docker images for the Paddle Serving framework. Refer to the image list《[Docker Images](DOCKER_IMAGES.md)》, Select mirrors according to user's usage. We provide Docker usage documentation《[How to run PaddleServing in Docker](RUN_IN_DOCKER.md)》.Currently, the Python webservice mode can be deployed and run on the native Linux and Windows dual systems.《[Paddle Serving for Windows Users](WINDOWS_TUTORIAL.md)》 +Docker is an open source application container engine that allows developers to package their applications and dependencies into a portable container, and then publish it to any popular Linux machine or Windows machine. We have packaged a variety of Docker images for the Paddle Serving framework. Refer to the image list《[Docker Images](Docker_Images_EN.md)》, Select mirrors according to user's usage. We provide Docker usage documentation《[How to run PaddleServing in Docker](Run_In_Docker_EN.md)》.Currently, the Python webservice mode can be deployed and run on the native Linux and Windows dual systems.《[Paddle Serving for Windows Users](Windows_Tutorial_EN.md)》 > Support multiple development languages client ​​SDKs -Paddle Serving provides 4 development language client SDKs, including Python, C++, Java, and Golang. Golang SDK is under construction, We hope that interested open source developers can help submit PR. +Paddle Serving provides 3 development language client SDKs, including Python, C++, Java, we hope that interested open source developers can help submit PR. + Python, Refer to the client example under python/examples or 4.2 web service example. -+ C++, Refer to《[从零开始写一个预测服务](CREATING.md)》 -+ Java, Refer to《[Paddle Serving Client Java SDK](JAVA_SDK.md)》 -+ Golang, Refer to《[How to use Go Client of Paddle Serving](deprecated/IMDB_GO_CLIENT.md)》 ++ C++, Refer to《[从零开始写一个预测服务](C++_Serving/Creat_C++Serving_CN.md)》 ++ Java, Refer to《[Paddle Serving Client Java SDK](Java_SDK_EN.md)》 + > Support multiple hardware devices @@ -72,7 +72,7 @@ The inference framework of the well-known deep learning platform only supports C Models trained on other deep learning platforms can be passed《[PaddlePaddle/X2Paddle工具](https://github.com/PaddlePaddle/X2Paddle)》.We convert multiple mainstream CV models to Paddle models. TensorFlow, Caffe, ONNX, PyTorch model conversion is tested.《[AIStudio教程-Paddle Serving服务化部署框架](https://www.paddlepaddle.org.cn/tutorials/projectdetail/1555945)》 -Because it is impossible to directly view the feed and fetch parameter information in the model file, it is not convenient for users to assemble the parameters. Therefore, Paddle Serving developed a tool to convert the Paddle model into Serving format and generate a prototxt file containing feed and fetch parameter information. The following figure is the generated prototxt file of the uci_housing example. For more conversion methods, refer to the document《[How to save a servable model of Paddle Serving?](SAVE.md)》. +Because it is impossible to directly view the feed and fetch parameter information in the model file, it is not convenient for users to assemble the parameters. Therefore, Paddle Serving developed a tool to convert the Paddle model into Serving format and generate a prototxt file containing feed and fetch parameter information. The following figure is the generated prototxt file of the uci_housing example. For more conversion methods, refer to the document《[How to save a servable model of Paddle Serving?](Save_EN.md)》. ``` feed_var { name: "x" @@ -121,14 +121,14 @@ The core execution engine of Paddle Serving is a Directed acyclic graph(DAG). In

### 3.3 Model Management and Hot Reloading -C++ Serving supports model management functions, including management of multiple models and multiple model versions.In order to ensure the availability of services, the model needs to be hot loaded without service interruption. Paddle Serving supports this feature and provides a tool for monitoring output models to update local models. Please refer to [Hot loading in Paddle Serving](HOT_LOADING_IN_SERVING.md) for specific examples. +C++ Serving supports model management functions, including management of multiple models and multiple model versions.In order to ensure the availability of services, the model needs to be hot loaded without service interruption. Paddle Serving supports this feature and provides a tool for monitoring output models to update local models. Please refer to [Hot loading in Paddle Serving](C++_Serving/Hot_Loading_EN.md) for specific examples. ### 3.4 MOEDL ENCRYPTION INFERENCE -Paddle Serving uses a symmetric encryption algorithm to encrypt the model, and decrypts it in memory during the service loading model. At present, providing basic model security capabilities does not guarantee absolute model security. Users can improve them according to our design to achieve a higher level of security. Documentation reference《[MOEDL ENCRYPTION INFERENCE](ENCRYPTION.md)》 +Paddle Serving uses a symmetric encryption algorithm to encrypt the model, and decrypts it in memory during the service loading model. At present, providing basic model security capabilities does not guarantee absolute model security. Users can improve them according to our design to achieve a higher level of security. Documentation reference《[MOEDL ENCRYPTION INFERENCE](C++_Serving/Encryption_EN.md)》 ### 3.5 A/B Test -After sufficient offline evaluation of the model, online A/B test is usually needed to decide whether to enable the service on a large scale. The following figure shows the basic structure of A/B test with Paddle Serving. After the client is configured with the corresponding configuration, the traffic will be automatically distributed to different servers to achieve A/B test. Please refer to [ABTEST in Paddle Serving](ABTEST_IN_PADDLE_SERVING.md) for specific examples. +After sufficient offline evaluation of the model, online A/B test is usually needed to decide whether to enable the service on a large scale. The following figure shows the basic structure of A/B test with Paddle Serving. After the client is configured with the corresponding configuration, the traffic will be automatically distributed to different servers to achieve A/B test. Please refer to [ABTEST in Paddle Serving](C++_Serving/ABTEST_EN.md) for specific examples.


@@ -193,7 +193,7 @@ The network framework of Pipeline Serving uses gRPC and gPRC gateway. The gRPC s ### 5.2 Core Design And Use Cases -The core design of Pipeline Serving is a graph execution engine, and the basic processing units are OP and Channel. A set of directed acyclic graphs can be realized through combination. Reference for design and use documents《[Pipeline Serving](PIPELINE_SERVING.md)》 +The core design of Pipeline Serving is a graph execution engine, and the basic processing units are OP and Channel. A set of directed acyclic graphs can be realized through combination. Reference for design and use documents《[Pipeline Serving](Python_Pipeline/Pipeline_Design_EN.md)》

diff --git a/doc/TENSOR_RT.md b/doc/TENSOR_RT.md deleted file mode 100644 index a18bc0b0c7c9fb61d57d1d532a719170b79d8047..0000000000000000000000000000000000000000 --- a/doc/TENSOR_RT.md +++ /dev/null @@ -1,65 +0,0 @@ -## Paddle Serving uses TensorRT - -(English|[简体中文](./TENSOR_RT_CN.md)) - -### Background - -Deploying models trained on mainstream frameworks through the tensorRT tool launched by Nvidia can greatly increase the speed of model inference, which is often at least 1 times faster than the original framework, and it also takes up more device memory. less. Therefore, it is very useful for all users who need to deploy models to master the method of deploying deep learning models with tensorRT. Paddle Serving provides comprehensive TensorRT ecological support. - -### surroundings - -Serving Cuda10.1 Cuda10.2 and Cuda11 versions support TensorRT. - -#### Install Paddle - -In [Development using Docker environment](./RUN_IN_DOCKER.md) and [Docker image list](./DOCKER_IMAGES.md), we give the development image of TensorRT. After using the mirror to start, you need to install the Paddle whl package that supports TensorRT, refer to the documentation on the home page - -``` -# GPU Cuda10.2 environment please execute -pip install paddlepaddle-gpu==2.0.0 -``` - -**Note**: If your Cuda version is not 10.2, please do not execute the above commands directly, you need to refer to [Paddle official documentation-multi-version whl package list -](https://www.paddlepaddle.org.cn/documentation/docs/en/install/Tables_en.html#multi-version-whl-package-list-release) - -Select the URL link of the corresponding GPU environment and install it. For example, for Python2.7 users of Cuda 10.1, please select `cp27-cp27mu` and -`cuda10.1-cudnn7.6-trt6.0.1.5` corresponding url, copy it and execute -``` -pip install https://paddle-wheel.bj.bcebos.com/with-trt/2.0.0-gpu-cuda10.1-cudnn7-mkl/paddlepaddle_gpu-2.0.0.post101-cp27-cp27mu-linux_x86_64.whl -``` -Since the default `paddlepaddle-gpu==2.0.0` is Cuda 10.2 and TensorRT is not built, if you need to use TensorRT on `paddlepaddle-gpu`, you need to find `cuda10 in the above multi-version whl package list .2-cudnn8.0-trt7.1.3`, download the corresponding Python version. - - -#### Install Paddle Serving -``` -# Cuda10.2 -pip install paddle-server-server==${VERSION}.post102 -# Cuda 10.1 -pip install paddle-server-server==${VERSION}.post101 -# Cuda 11 -pip install paddle-server-server==${VERSION}.post11 -``` - -### Use TensorRT - -#### RPC mode - -In [Serving model example](../python/examples), we have given models that can be accelerated using TensorRT, such as [Faster_RCNN model](../python/examples/detection/faster_rcnn_r50_fpn_1x_coco) under detection - -We just need -``` -wget --no-check-certificate https://paddle-serving.bj.bcebos.com/pddet_demo/2.0/faster_rcnn_r50_fpn_1x_coco.tar -tar xf faster_rcnn_r50_fpn_1x_coco.tar -python -m paddle_serving_server.serve --model serving_server --port 9494 --gpu_ids 0 --use_trt -``` -The TensorRT version of the faster_rcnn model server is started - - -#### Local Predictor mode - -In [local_predictor](../python/paddle_serving_app/local_predict.py#L52), users can explicitly specify `use_trt=True` and pass it to `load_model_config`. -Other methods are no different from other Local Predictor methods, and you need to pay attention to the compatibility of the model with TensorRT. - -#### Pipeline Mode - -In [Pipeline mode](./PIPELINE_SERVING.md), our [imagenet example](../python/examples/pipeline/imagenet/config.yml#L23) gives the way to set TensorRT. diff --git a/doc/TENSOR_RT_CN.md b/doc/TENSOR_RT_CN.md deleted file mode 100644 index 453a08379196df94a348a13746ed288632d44486..0000000000000000000000000000000000000000 --- a/doc/TENSOR_RT_CN.md +++ /dev/null @@ -1,67 +0,0 @@ -## Paddle Serving 使用 TensorRT - -([English](./TENSOR_RT.md)|简体中文) - -### 背景 - -通过Nvidia推出的tensorRT工具来部署主流框架上训练的模型能够极大的提高模型推断的速度,往往相比与原本的框架能够有至少1倍以上的速度提升,同时占用的设备内存也会更加的少。因此对是所有需要部署模型的用户来说,掌握用tensorRT来部署深度学习模型的方法是非常有用的。Paddle Serving提供了全面的TensorRT生态支持。 - -### 环境 - -Serving 的Cuda10.1 Cuda10.2和Cuda11版本支持TensorRT。 - -#### 安装Paddle - -在[使用Docker环境开发](./RUN_IN_DOCKER_CN.md) 和 [Docker镜像列表](./DOCKER_IMAGES_CN.md)当中,我们给出了TensorRT的开发镜像。使用镜像启动之后,需要安装支持TensorRT的Paddle whl包,参考首页的文档 - -``` -# GPU Cuda10.2环境请执行 -pip install paddlepaddle-gpu==2.0.0 -``` - -**注意**: 如果您的Cuda版本不是10.2,请勿直接执行上述命令,需要参考[Paddle官方文档-多版本whl包列表 -](https://www.paddlepaddle.org.cn/documentation/docs/zh/install/Tables.html#whl-release) - -选择相应的GPU环境的url链接并进行安装,例如Cuda 10.1的Python2.7用户,请选择表格当中的`cp27-cp27mu`和 -`cuda10.1-cudnn7.6-trt6.0.1.5`对应的url,复制下来并执行 -``` -pip install https://paddle-wheel.bj.bcebos.com/with-trt/2.0.0-gpu-cuda10.1-cudnn7-mkl/paddlepaddle_gpu-2.0.0.post101-cp27-cp27mu-linux_x86_64.whl -``` -由于默认的`paddlepaddle-gpu==2.0.0`是Cuda 10.2,并没有联编TensorRT,因此如果需要和在`paddlepaddle-gpu`上使用TensorRT,需要在上述多版本whl包列表当中,找到`cuda10.2-cudnn8.0-trt7.1.3`,下载对应的Python版本。 - - -#### 安装Paddle Serving -``` -# Cuda10.2 -pip install paddle-server-server==${VERSION}.post102 -# Cuda 10.1 -pip install paddle-server-server==${VERSION}.post101 -# Cuda 11 -pip install paddle-server-server==${VERSION}.post11 -``` - -### 使用TensorRT - -#### RPC模式 - -在[Serving模型示例](../python/examples)当中,我们有给出可以使用TensorRT加速的模型,例如detection下的[Faster_RCNN模型](../python/examples/detection/faster_rcnn_r50_fpn_1x_coco) - -我们只需 -``` -wget --no-check-certificate https://paddle-serving.bj.bcebos.com/pddet_demo/2.0/faster_rcnn_r50_fpn_1x_coco.tar -tar xf faster_rcnn_r50_fpn_1x_coco.tar -python -m paddle_serving_server.serve --model serving_server --port 9494 --gpu_ids 0 --use_trt -``` -TensorRT版本的faster_rcnn模型服务端就启动了 - - -#### Local Predictor模式 - -在 [local_predictor](../python/paddle_serving_app/local_predict.py#L52)当中,用户可以显式制定`use_trt=True`传入到`load_model_config`当中。 -其他方式和其他Local Predictor使用方法没有区别,需要注意模型对TensorRT的兼容性。 - -#### Pipeline模式 - -在 [Pipeline模式](./PIPELINE_SERVING_CN.md)当中,我们的[imagenet例子](../python/examples/pipeline/imagenet/config.yml#L23)给出了设置TensorRT的方式。 - - diff --git a/doc/WINDOWS_TUTORIAL_CN.md b/doc/Windows_Tutorial_CN.md similarity index 98% rename from doc/WINDOWS_TUTORIAL_CN.md rename to doc/Windows_Tutorial_CN.md index 0b501d80d8d0415c7653b569d1eec86ba0355849..76f87bae2c532a114031d6b15facfae217525b93 100644 --- a/doc/WINDOWS_TUTORIAL_CN.md +++ b/doc/Windows_Tutorial_CN.md @@ -1,6 +1,6 @@ ## Windows平台使用Paddle Serving指导 -([English](./WINDOWS_TUTORIAL.md)|简体中文) +([English](./Windows_Turtial_EN.md)|简体中文) ### 综述 @@ -97,7 +97,7 @@ r = requests.post(url=url, headers=headers, data=json.dumps(data)) print(r.json()) ``` -用户只需要按照如上指示,在对应函数中实现相关内容即可。更多信息请参见[如何开发一个新的Web Service?](./NEW_WEB_SERVICE_CN.md) +用户只需要按照如上指示,在对应函数中实现相关内容即可。更多信息请参见[如何开发一个新的Web Service?](./C++_Serving/Http_Service_CN.md) 开发完成后执行 diff --git a/doc/WINDOWS_TUTORIAL.md b/doc/Windows_Tutorial_EN.md similarity index 98% rename from doc/WINDOWS_TUTORIAL.md rename to doc/Windows_Tutorial_EN.md index 0133ff01049d14ddf53fcfcff1495377a6192440..b22628fe1f76fe6373fc39407ee0ae541c79e563 100644 --- a/doc/WINDOWS_TUTORIAL.md +++ b/doc/Windows_Tutorial_EN.md @@ -1,6 +1,6 @@ ## Paddle Serving for Windows Users -(English|[简体中文](./WINDOWS_TUTORIAL_CN.md)) +(English|[简体中文](./Windows_Tutorial_CN.md)) ### Summary @@ -97,7 +97,7 @@ r = requests.post(url=url, headers=headers, data=json.dumps(data)) print(r.json()) ``` -The user only needs to follow the above instructions and implement the relevant content in the corresponding function. For more information, please refer to [How to develop a new Web Service? ](./NEW_WEB_SERVICE.md) +The user only needs to follow the above instructions and implement the relevant content in the corresponding function. For more information, please refer to [How to develop a new Web Service? ](./C++_Serving/Http_Service_EN.md) Execute after development diff --git a/doc/deprecated/CLUSTERING.md b/doc/deprecated/CLUSTERING.md deleted file mode 100644 index b9da2e56165f17ee07116c46dbbdfefd806cb56f..0000000000000000000000000000000000000000 --- a/doc/deprecated/CLUSTERING.md +++ /dev/null @@ -1,142 +0,0 @@ -# 搭建预测服务集群 - -从[客户端配置](../CLIENT_CONFIGURE.md)中我们已经知道,通过在客户端SDK的配置文件predictors.prototxt适当配置,可以搭建多副本和多Variant的预测集群。以下以图像分类任务为例,在单机上模拟搭建单Variant的多副本、和多Variant的预测集群 - -## 1. 单Variant多副本的预测集群 - -### 1.1 在本机创建一个serving副本 - -首先复制一个sering目录 - -```shell -$ cd /path/to/paddle-serving/build/output/demo -$ cp -r serving/ serving_new/ -$ cd serving_new/ - -``` - -在serving_new目录中,在conf/gflags.conf中增加如下一行,修改其启动端口为8011,这是为了让该副本监听不同端口 - -```shell ---port=8011 -``` - -然后启动新副本 - -```shell -$ bin/serving& -``` - -### 1.2 修改client端配置,将新副本地址加入ip列表: - -```shell -$ cd /path/to/paddle-serving/build/output/demo/client/image_classification -``` - -修改conf/predictors.prototxt ImageClassifyService部分如下所示 - -```JSON -predictors { - name: "ximage" - service_name: "baidu.paddle_serving.predictor.image_classification.ImageClassifyService" - endpoint_router: "WeightedRandomRender" - weighted_random_render_conf { - variant_weight_list: "50" - } - variants { - tag: "var1" - naming_conf { - cluster: "list://127.0.0.1:8010, 127.0.0.1:8011" # 在这里增加一个新的副本地址 - } - } -} -``` - -重启client端 - -```shell -$ bin/ximage& -``` - -查看2个serving副本目录下是否均有收到请求: - -```shell -$ cd /path/to/paddle-serving/build/output/demo/serving -$ tail -f log/serving.INFO - -$ cd /path/to/paddle-serving/build/output/demo/serving_new -$ tail -f log/serving.INFO -``` - -## 2. 多Variant - -### 2.1 本机创建新的serving副本 - -步骤同1.1节,略过 - -### 2.2 修改client配置,增加一个Variant - -```shell -$ cd /path/to/paddle-serving/build/output/demo/client/image_classification -``` - -修改conf/predictors.prototxt ImageClassifyService部分如下所示 - -```JSON -predictors { - name: "ximage" - service_name: "baidu.paddle_serving.predictor.image_classification.ImageClassifyService" - endpoint_router: "WeightedRandomRender" - weighted_random_render_conf { - variant_weight_list: "50 | 50" # 一共2个variant,代表模型的2个版本。这里的权重代表调度的流量比例关系 - } - variants { - tag: "var1" - naming_conf { - cluster: "list://127.0.0.1:8010" - } - } - variants { # 增加一个variant - tag: "var2" - naming_conf { - cluster: "list://127.0.0.1:8011" - } - } -} -``` - -重启client端 - -```shell -$ bin/ximage& -``` - -查看2个serving副本目录下是否均有收到请求: - -```shell -$ cd /path/to/paddle-serving/build/output/demo/serving -$ tail -f log/serving.INFO - -$ cd /path/to/paddle-serving/build/output/demo/serving_new -$ tail -f log/serving.INFO -``` - -查看client端是否有收到来自Variant1和Variant2的响应 - -```shell -$ cd /path/to/paddle-serving/build/output/demo/client/image_classification -$ tail -f log/ximage.INFO - -``` - -以下是正常的输出 - -``` -I0307 17:54:22.862087 24719 ximage.cpp:172] Debug string: -I0307 17:54:22.862650 24719 ximage.cpp:110] sample-0's classify result: n02112018,博美犬, prop: 0.522815 -I0307 17:54:22.862666 24719 ximage.cpp:114] Succ call predictor[ximage], the tag is: var1, elapse_ms: 333 - -I0307 17:54:23.194780 24719 ximage.cpp:172] Debug string: -I0307 17:54:23.195322 24719 ximage.cpp:110] sample-0's classify result: n02112018,博美犬, prop: 0.522815 -I0307 17:54:23.195334 24719 ximage.cpp:114] Succ call predictor[ximage], the tag is: var2, elapse_ms: 332 -``` diff --git a/doc/deprecated/CTR_PREDICTION.md b/doc/deprecated/CTR_PREDICTION.md deleted file mode 100644 index acfb6c08e61a8a393d3978dc4ca29b9cda631d20..0000000000000000000000000000000000000000 --- a/doc/deprecated/CTR_PREDICTION.md +++ /dev/null @@ -1,334 +0,0 @@ -# CTR预估模型 - -## 1. 背景 - -在搜索、推荐、在线广告等业务场景中,embedding参数的规模常常非常庞大,达到数百GB甚至T级别;训练如此规模的模型需要用到多机分布式训练能力,将参数分片更新和保存;另一方面,训练好的模型,要应用于在线业务,也难以单机加载。Paddle Serving提供大规模稀疏参数读写服务,用户可以方便地将超大规模的稀疏参数以kv形式托管到参数服务,在线预测只需将所需要的参数子集从参数服务读取回来,再执行后续的预测流程。 - -我们以CTR预估模型为例,演示Paddle Serving中如何使用大规模稀疏参数服务。关于模型细节请参考[原始模型](https://github.com/PaddlePaddle/models/tree/v1.5/PaddleRec/ctr) - -根据[对数据集的描述](https://www.kaggle.com/c/criteo-display-ad-challenge/data),该模型原始输入为13维integer features和26维categorical features。在我们的模型中,13维integer feature作为dense feature整体feed到一个data layer,而26维categorical features各自作为一个feature分别feed到一个data layer。除此之外,为计算auc指标,还将label作为一个feature输入。 - -若按缺省训练参数,本模型的embedding dim为100w,size为10,也就是参数矩阵为1000000 x 10的float型矩阵,实际占用内存共1000000 x 10 x sizeof(float) = 39MB;**实际场景中,embedding参数要大的多;因此该demo仅为演示使用**。 - - -## 2. 模型裁剪 - -在写本文档时([v1.5](https://github.com/PaddlePaddle/models/tree/v1.5)),训练脚本用PaddlePaddle py_reader加速样例读取速度,program中带有py_reader相关OP,且训练过程中只保存了模型参数,没有保存program,保存的参数没法直接用预测库加载;另外原始网络中最终输出的tensor是auc和batch_auc,而实际模型用于预测时只需要每个样例的predict,需要改掉模型的输出tensor为predict。再有,为了演示稀疏参数服务的使用,我们要有意将embedding layer包含的lookup_table OP从预测program中拿掉,以embedding layer的output variable作为网络的输入,然后再添加对应的feed OP,使得我们能够在预测时从稀疏参数服务获取到embedding向量后,将数据直接feed到各个embedding的output variable。 - -基于以上几方面考虑,我们需要对原始program进行裁剪。大致过程为: - -1) 去掉py_reader相关代码,改为用fluid自带的reader和DataFeed -2) 修改原始网络配置,将predict变量作为fetch target -3) 修改原始网络配置,将26个稀疏参数的embedding layer的output作为feed target,以与后续稀疏参数服务配合使用 -4) 修改后的网络,本地train 1个batch后,调用`fluid.io.save_inference_model()`,获得裁剪后的模型program -5) 裁剪后的program,用python再次处理,去掉embedding layer的lookup_table OP。这是因为,当前Paddle Fluid在第4步`save_inference_model()`时没有裁剪干净,还保留了embedding的lookup_table OP;如果这些OP不去除掉,那么embedding的output variable就会有2个输入OP:一个是feed OP(我们要添加的),一个是lookup_table;而lookup_table又没有输入,它的输出会与feed OP的输出互相覆盖,导致错乱。另外网络中还保留了SparseFeatFactors这个variable(全局共享的embedding矩阵对应的变量),这个variable也要去掉,否则网络加载时还会尝试从磁盘读取embedding参数,就失去了我们这个demo的意义。 -6) 第4步拿到的program,与分布式训练保存的模型参数(除embedding之外)保存到一起,形成完整的预测模型 - -第1) - 第5)步裁剪完毕后的模型网络配置如下: - -![Pruned CTR prediction network](../images/pruned-ctr-network.png) - - -整个裁剪过程具体说明如下: - -### 2.1 网络配置中去除py_reader - -Inference program调用ctr_dnn_model()函数时添加`user_py_reader=False`参数。这会在ctr_dnn_model定义中将py_reader相关的代码去掉 - -修改前: -```python -def train(): - args = parse_args() - - if not os.path.isdir(args.model_output_dir): - os.mkdir(args.model_output_dir) - - loss, auc_var, batch_auc_var, py_reader, _ = ctr_dnn_model(args.embedding_size, args.sparse_feature_dim) - ... -``` - -修改后: -```python -def train(): - args = parse_args() - - if not os.path.isdir(args.model_output_dir): - os.mkdir(args.model_output_dir) - - loss, auc_var, batch_auc_var, py_reader, _ = ctr_dnn_model(args.embedding_size, args.sparse_feature_dim, use_py_reader=False) - ... -``` - - -### 2.2 网络配置中修改feed targets和fetch targets - -如第2节开头所述,为了使program适合于演示稀疏参数的使用,我们要裁剪program,将`ctr_dnn_model`中feed variable list和fetch variable分别改掉: - -1) Inference program中26维稀疏特征的输入改为每个特征的embedding layer的output variable -2) fetch targets中返回的是predict,取代auc_var和batch_auc_var - -截至写本文时,原始的网络配置 (network_conf.py中)`ctr_dnn_model`定义如下: - -```python -def ctr_dnn_model(embedding_size, sparse_feature_dim, use_py_reader=True): - - def embedding_layer(input): - emb = fluid.layers.embedding( - input=input, - is_sparse=True, - # you need to patch https://github.com/PaddlePaddle/Paddle/pull/14190 - # if you want to set is_distributed to True - is_distributed=False, - size=[sparse_feature_dim, embedding_size], - param_attr=fluid.ParamAttr(name="SparseFeatFactors", - initializer=fluid.initializer.Uniform())) - return fluid.layers.sequence_pool(input=emb, pool_type='average') # 需修改1 - - dense_input = fluid.layers.data( - name="dense_input", shape=[dense_feature_dim], dtype='float32') - - sparse_input_ids = [ - fluid.layers.data(name="C" + str(i), shape=[1], lod_level=1, dtype='int64') - for i in range(1, 27)] - - label = fluid.layers.data(name='label', shape=[1], dtype='int64') - - words = [dense_input] + sparse_input_ids + [label] - - py_reader = None - if use_py_reader: - py_reader = fluid.layers.create_py_reader_by_data(capacity=64, - feed_list=words, - name='py_reader', - use_double_buffer=True) - words = fluid.layers.read_file(py_reader) - - sparse_embed_seq = list(map(embedding_layer, words[1:-1])) # 需修改2 - concated = fluid.layers.concat(sparse_embed_seq + words[0:1], axis=1) - - fc1 = fluid.layers.fc(input=concated, size=400, act='relu', - param_attr=fluid.ParamAttr(initializer=fluid.initializer.Normal( - scale=1 / math.sqrt(concated.shape[1])))) - fc2 = fluid.layers.fc(input=fc1, size=400, act='relu', - param_attr=fluid.ParamAttr(initializer=fluid.initializer.Normal( - scale=1 / math.sqrt(fc1.shape[1])))) - fc3 = fluid.layers.fc(input=fc2, size=400, act='relu', - param_attr=fluid.ParamAttr(initializer=fluid.initializer.Normal( - scale=1 / math.sqrt(fc2.shape[1])))) - predict = fluid.layers.fc(input=fc3, size=2, act='softmax', - param_attr=fluid.ParamAttr(initializer=fluid.initializer.Normal( - scale=1 / math.sqrt(fc3.shape[1])))) - - cost = fluid.layers.cross_entropy(input=predict, label=words[-1]) - avg_cost = fluid.layers.reduce_sum(cost) - accuracy = fluid.layers.accuracy(input=predict, label=words[-1]) - auc_var, batch_auc_var, auc_states = \ - fluid.layers.auc(input=predict, label=words[-1], num_thresholds=2 ** 12, slide_steps=20) - - return avg_cost, auc_var, batch_auc_var, py_reader, words # 需修改3 -``` - -修改后 - -```python -def ctr_dnn_model(embedding_size, sparse_feature_dim, use_py_reader=True): - def embedding_layer(input): - emb = fluid.layers.embedding( - input=input, - is_sparse=True, - # you need to patch https://github.com/PaddlePaddle/Paddle/pull/14190 - # if you want to set is_distributed to True - is_distributed=False, - size=[sparse_feature_dim, embedding_size], - param_attr=fluid.ParamAttr(name="SparseFeatFactors", - initializer=fluid.initializer.Uniform())) - seq = fluid.layers.sequence_pool(input=emb, pool_type='average') - return emb, seq # 对应上文修改处1 - dense_input = fluid.layers.data( - name="dense_input", shape=[dense_feature_dim], dtype='float32') - sparse_input_ids = [ - fluid.layers.data(name="C" + str(i), shape=[1], lod_level=1, dtype='int64') - for i in range(1, 27)] - label = fluid.layers.data(name='label', shape=[1], dtype='int64') - words = [dense_input] + sparse_input_ids + [label] - sparse_embed_and_seq = list(map(embedding_layer, words[1:-1])) - - emb_list = [x[0] for x in sparse_embed_and_seq] # 对应上文修改处2 - sparse_embed_seq = [x[1] for x in sparse_embed_and_seq] - - concated = fluid.layers.concat(sparse_embed_seq + words[0:1], axis=1) - - train_feed_vars = words # 对应上文修改处2 - inference_feed_vars = emb_list + words[0:1] - - fc1 = fluid.layers.fc(input=concated, size=400, act='relu', - param_attr=fluid.ParamAttr(initializer=fluid.initializer.Normal( - scale=1 / math.sqrt(concated.shape[1])))) - fc2 = fluid.layers.fc(input=fc1, size=400, act='relu', - param_attr=fluid.ParamAttr(initializer=fluid.initializer.Normal( - scale=1 / math.sqrt(fc1.shape[1])))) - fc3 = fluid.layers.fc(input=fc2, size=400, act='relu', - param_attr=fluid.ParamAttr(initializer=fluid.initializer.Normal( - scale=1 / math.sqrt(fc2.shape[1])))) - predict = fluid.layers.fc(input=fc3, size=2, act='softmax', - param_attr=fluid.ParamAttr(initializer=fluid.initializer.Normal( - scale=1 / math.sqrt(fc3.shape[1])))) - cost = fluid.layers.cross_entropy(input=predict, label=words[-1]) - avg_cost = fluid.layers.reduce_sum(cost) - accuracy = fluid.layers.accuracy(input=predict, label=words[-1]) - auc_var, batch_auc_var, auc_states = \ - fluid.layers.auc(input=predict, label=words[-1], num_thresholds=2 ** 12, slide_steps=20) - fetch_vars = [predict] - - # 对应上文修改处3 - return avg_cost, auc_var, batch_auc_var, train_feed_vars, inference_feed_vars, fetch_vars -``` - -说明: - -1) 修改处1,我们将embedding layer的输出变量返回 -2) 修改处2,我们将embedding layer的输出变量保存到`emb_list`,后者进一步保存到`inference_feed_vars`,用来将来在`save_inference_model()`时指定feed variable list。 -3) 修改处3,我们将`words`变量作为训练时的feed variable list (`train_feed_vars`),将embedding layer的output variable作为infer时的feed variable list (`inference_feed_vars`),将`predict`作为fetch target (`fetch_vars`),分别返回。`inference_feed_vars`和`fetch_vars`用于`fluid.io.save_inference_model()`时指定feed variable list和fetch target list - - -### 2.3 fluid.io.save_inference_model()保存裁剪后的program - -`fluid.io.save_inference_model()`不仅保存模型参数,还能够根据feed variable list和fetch target list参数,对program进行裁剪,形成适合inference用的program。大致原理是,根据前向网络配置,从fetch target list开始,反向查找其所依赖的OP列表,并将每个OP的输入加入目标variable list,再次递归地反向找到所有依赖OP和variable list。 - -在2.2节中我们已经拿到所需的`inference_feed_vars`和`fetch_vars`,接下来只要在训练过程中每次保存模型参数时改为调用`fluid.io.save_inference_model()`: - -修改前: - -```python -def train_loop(args, train_program, py_reader, loss, auc_var, batch_auc_var, - trainer_num, trainer_id): - -...省略 - for pass_id in range(args.num_passes): - pass_start = time.time() - batch_id = 0 - py_reader.start() - - try: - while True: - loss_val, auc_val, batch_auc_val = pe.run(fetch_list=[loss.name, auc_var.name, batch_auc_var.name]) - loss_val = np.mean(loss_val) - auc_val = np.mean(auc_val) - batch_auc_val = np.mean(batch_auc_val) - - logger.info("TRAIN --> pass: {} batch: {} loss: {} auc: {}, batch_auc: {}" - .format(pass_id, batch_id, loss_val/args.batch_size, auc_val, batch_auc_val)) - if batch_id % 1000 == 0 and batch_id != 0: - model_dir = args.model_output_dir + '/batch-' + str(batch_id) - if args.trainer_id == 0: - fluid.io.save_persistables(executor=exe, dirname=model_dir, - main_program=fluid.default_main_program()) - batch_id += 1 - except fluid.core.EOFException: - py_reader.reset() - print("pass_id: %d, pass_time_cost: %f" % (pass_id, time.time() - pass_start)) -...省略 -``` - -修改后 - -```python -def train_loop(args, - train_program, - train_feed_vars, - inference_feed_vars, # 裁剪program用的feed variable list - fetch_vars, # 裁剪program用的fetch variable list - loss, - auc_var, - batch_auc_var, - trainer_num, - trainer_id): - # 因为已经将py_reader去掉,这里用fluid自带的DataFeeder - dataset = reader.CriteoDataset(args.sparse_feature_dim) - train_reader = paddle.batch( - paddle.reader.shuffle( - dataset.train([args.train_data_path], trainer_num, trainer_id), - buf_size=args.batch_size * 100), - batch_size=args.batch_size) - - inference_feed_var_names = [var.name for var in inference_feed_vars] - - place = fluid.CPUPlace() - exe = fluid.Executor(place) - exe.run(fluid.default_startup_program()) - total_time = 0 - pass_id = 0 - batch_id = 0 - - feed_var_names = [var.name for var in feed_vars] - feeder = fluid.DataFeeder(feed_var_names, place) - - for data in train_reader(): - loss_val, auc_val, batch_auc_val = exe.run(fluid.default_main_program(), - feed = feeder.feed(data), - fetch_list=[loss.name, auc_var.name, batch_auc_var.name]) - fluid.io.save_inference_model(model_dir, - inference_feed_var_names, - fetch_vars, - exe, - fluid.default_main_program()) - break # 我们只要裁剪后的program,不需要模型参数,因此只train一个batch就停止了 - loss_val = np.mean(loss_val) - auc_val = np.mean(auc_val) - batch_auc_val = np.mean(batch_auc_val) - logger.info("TRAIN --> pass: {} batch: {} loss: {} auc: {}, batch_auc: {}" - .format(pass_id, batch_id, loss_val/args.batch_size, auc_val, batch_auc_val)) -``` - -### 2.4 用python再次处理inference program,去除lookup_table OP和SparseFeatFactors变量 - -这一步是因为`fluid.io.save_inference_model()`裁剪出的program没有将lookup_table OP去除。未来如果`save_inference_model`接口完善,本节可跳过 - -主要代码: - -```python -def prune_program(): - args = parse_args() - - # 从磁盘打开网络配置文件并反序列化成protobuf message - model_dir = args.model_output_dir + "/inference_only" - model_file = model_dir + "/__model__" - with open(model_file, "rb") as f: - protostr = f.read() - f.close() - proto = framework_pb2.ProgramDesc.FromString(six.binary_type(protostr)) - - # 去除lookup_table OP - block = proto.blocks[0] - kept_ops = [op for op in block.ops if op.type != "lookup_table"] - del block.ops[:] - block.ops.extend(kept_ops) - - # 去除SparseFeatFactors var - kept_vars = [var for var in block.vars if var.name != "SparseFeatFactors"] - del block.vars[:] - block.vars.extend(kept_vars) - - # 写回磁盘文件 - with open(model_file + ".pruned", "wb") as f: - f.write(proto.SerializePartialToString()) - f.close() - with open(model_file + ".prototxt.pruned", "w") as f: - f.write(text_format.MessageToString(proto)) - f.close() -``` - -### 2.5 裁剪过程串到一起 - -我们提供了完整的裁剪CTR预估模型的脚本文件save_program.py,同[CTR分布式训练和Serving流程化部署](https://github.com/PaddlePaddle/Serving/blob/master/doc/DEPLOY.md)一起发布,可以在trainer和pserver容器的训练脚本目录下找到,也可以在[这里](https://github.com/PaddlePaddle/Serving/tree/master/doc/resource)下载。 - -## 3. 整个预测计算流程 - -Client端: -1) Dense feature: 从dataset每条样例读取13个integer features,形成1个dense feature -2) Sparse feature: 从dataset每条样例读取26个categorical feature,分别经过hash(str(feature_index) + feature_string)签名,得到每个feature的id,形成26个sparse feature - -Serving端: -1) Dense feature: dense feature共13个float型数字,一起feed到网络dense_input这个variable对应的LodTensor -2) Sparse feature: 26个sparse feature id,分别访问kv服务获取对应的embedding向量,feed到对应的26个embedding layer的output variable。在我们裁剪出来的网络中,这些variable分别对应的变量名为embedding_0.tmp_0, embedding_1.tmp_0, ... embedding_25.tmp_0 -3) 执行预测,获取预测结果。 diff --git a/doc/deprecated/FAQ.md b/doc/deprecated/FAQ.md deleted file mode 100644 index 2ba9ec9d5e0a5d7c8f0ccc3ebfc480f21170751d..0000000000000000000000000000000000000000 --- a/doc/deprecated/FAQ.md +++ /dev/null @@ -1,26 +0,0 @@ -# FAQ -## 1. 如何修改端口配置? -使用该框架搭建的服务需要申请一个端口,可以通过以下方式修改端口号: - -- 如果在inferservice_file里指定了port:xxx,那么就去申请该端口号; -- 否则,如果在gflags.conf里指定了--port:xxx,那就去申请该端口号; -- 否则,使用程序里指定的默认端口号:8010。 - -## 2. GPU预测中为何请求的响应时间波动会非常大? -PaddleServing依托PaddlePaddle预测库执行预测计算;在GPU设备上,由于同一个进程内目前共用1个GPU stream,进程内的多个请求的预测计算会被严格串行。所以如果有2个请求同时到达某个Serving实例,不管该实例启动时创建了多少个worker线程,都不能起到加速作用,后到的请求会被排队,直到前面请求计算完成。 - -## 3. 如何充分利用GPU卡的计算能力? -如问题2所说,由于预测库的限制,单个Serving进程只能绑定单张GPU卡,且进程内共用1个GPU stream,所有请求必须串行计算。 - -为提高GPU卡使用率,目前可以想到的方法是:在单张GPU卡上启动多个Serving进程,每个进程绑定一个GPU stream,多个stream并行计算。这种方法是否能起到加速作用,受限于多个因素,主要有: - -1. 单个stream占用GPU算力;假如单个stream已经将GPU算力占用超过50%,那么增加stream很可能会导致2个stream的job分别排队,拖慢各自的响应时间 -2. GPU显存:Serving进程需要将模型参数加载到显存中,并且计算时要在GPU显存池分配临时变量;假如单个Serving进程已经用掉超过50%的显存,则增加Serving进程会造成显存不足,导致进程报错退出 - -为此,可采用如下步骤,进行测试: - -1. 加载模型时,在model_toolkit.prototxt中,model type选择FLUID_GPU_ANALYSIS或FLUID_GPU_ANALYSIS_DIR;会对模型进行静态分析,进行一定程度显存优化 -2. 在步骤1完成后,启动单个Serving进程,启动参数:`--gpuid=N --bthread_concurrency=4 --bthread_min_concurrency=4`;启动一个client,进行并发度为1的压力测试,batch size从小到大,记下平响;由于算力的限制,当batch size增大到一定程度,应该会出现响应时间明显变大;或虽然没有明显变大,但已经不满足系统需求 -3. 再启动1个Serving进程,与步骤2启动时使用相同的参数略有不同: `--gpuid=N --bthread_concurrency=4 --bthread_min_concurrency=4 --port=8011` 其中--port=8011用来让新启动的进程使用一个新的服务端口;然后同时对这2个Serving进程进行压测,继续观察batch size从小到大时平均响应时间的变化,直到取得batch size和响应时间的折中 -4. 重复步骤2-3 -5. 以2-4步的测试,来决定:单张GPU卡可以由多少个Serving进程共用; 实际部署时,就在一张GPU卡上启动这么多个Serving进程同时提供服务 diff --git a/doc/deprecated/HTTP_INTERFACE.md b/doc/deprecated/HTTP_INTERFACE.md deleted file mode 100644 index 96df2edc7b98aaa995e93fcd806cded01d044bd7..0000000000000000000000000000000000000000 --- a/doc/deprecated/HTTP_INTERFACE.md +++ /dev/null @@ -1,131 +0,0 @@ -# HTTP Inferface - -Paddle Serving服务均可以通过HTTP接口访问,客户端只需按照Service定义的Request消息格式构造json字符串即可。客户端构造HTTP请求,将json格式数据以POST请求发给serving端,serving端**自动**按Service定义的Protobuf消息格式,将json数据转换成protobuf消息。 - -本文档介绍以python和PHP语言访问Serving的HTTP服务接口的用法。 - -## 1. 访问地址 - -访问Serving节点的HTTP服务与C++服务使用同一个端口(例如8010),访问URL规则为: - -``` -http://127.0.0.1:8010/ServiceName/inference -http://127.0.0.1:8010/ServiceName/debug -``` - -其中ServiceName应该与Serving的配置文件`conf/services.prototxt`中配置的一致,假如有如下2个service: - -```protobuf -services { - name: "BuiltinTestEchoService" - workflows: "workflow3" -} - -services { - name: "TextClassificationService" - workflows: "workflow6" -} -``` - -则访问上述2个Serving服务的HTTP URL分别为: - -``` -http://127.0.0.1:8010/BuiltinTestEchoService/inference -http://127.0.0.1:8010/BuiltinTestEchoService/debug - -http://127.0.0.1:8010/TextClassificationService/inference -http://127.0.0.1:8010/TextClassificationService/debug -``` - -## 2. Python访问HTTP Serving - -Python语言访问HTTP Serving,关键在于构造json格式的请求数据,可以通过以下步骤完成: - -1) 按照Service定义的Request消息格式构造python object -2) `json.dump()` / `json.dumps()` 等函数将python object转换成json格式字符串 - -以TextClassificationService为例,关键代码如下: - -```python -# Connect to server -conn = httplib.HTTPConnection("127.0.0.1", 8010) - -# samples是一个list,其中每个元素是一个ids字典: -# samples[0] = [190, 1, 70, 382, 914, 5146, 190...] -for i in range(0, len(samples) - BATCH_SIZE, BATCH_SIZE): - # 构建批量预测数据 - batch = samples[i: i + BATCH_SIZE] - ids = [] - for x in batch: - ids.append({"ids" : x}) - ids = {"instances": ids} - - # python object转成json - request_json = json.dumps(ids) - - # 请求HTTP服务,打印response - try: - conn.request('POST', "/TextClassificationService/inference", request_json, {"Content-Type": "application/json"}) - response = conn.getresponse() - print response.read() - except httplib.HTTPException as e: - print e.reason -``` - -完整示例请参考[text_classification.py](https://github.com/PaddlePaddle/Serving/blob/develop/tools/cpp_examples/demo-client/python/text_classification.py) - -## 3. PHP访问HTTP Serving - -PHP语言构造json格式字符串的步骤如下: - -1) 按照Service定义的Request消息格式,构造PHP array -2) `json_encode()`函数将PHP array转换成json字符串 - -以TextCLassificationService为例,关键代码如下: - -```PHP -function http_post(&$ch, $data) { - // array to json string - $data_string = json_encode($data); - - // post data 封装 - curl_setopt($ch, CURLOPT_POSTFIELDS, $data_string); - - // set header - curl_setopt($ch, - CURLOPT_HTTPHEADER, - array( - 'Content-Length: ' . strlen($data_string) - ) - ); - - // 执行 - $result = curl_exec($ch); - return $result; -} - -$ch = &http_connect('http://127.0.0.1:8010/TextClassificationService/inference'); - -$count = 0; - -# $samples是一个2层array,其中每个元素是一个如下array: -# $samples[0] = array( -# "ids" => array( -# [0] => int(190), -# [1] => int(1), -# [2] => int(70), -# [3] => int(382), -# [4] => int(914), -# [5] => int(5146), -# [6] => int(190)...) -# ) - -for ($i = 0; $i < count($samples) - BATCH_SIZE; $i += BATCH_SIZE) { - $instances = array_slice($samples, $i, BATCH_SIZE); - echo http_post($ch, array("instances" => $instances)) . "\n"; -} - -curl_close($ch); -``` - -完整代码请参考[text_classification.php](https://github.com/PaddlePaddle/Serving/blob/develop/tools/cpp_examples/demo-client/php/text_classification.php) diff --git a/doc/deprecated/MODEL_ENSEMBLE_IN_PADDLE_SERVING.md b/doc/deprecated/MODEL_ENSEMBLE_IN_PADDLE_SERVING.md deleted file mode 100644 index e0fc00301302e75ffb18c0d0ec9c385174127b71..0000000000000000000000000000000000000000 --- a/doc/deprecated/MODEL_ENSEMBLE_IN_PADDLE_SERVING.md +++ /dev/null @@ -1,121 +0,0 @@ -# Model Ensemble in Paddle Serving - -([简体中文](MODEL_ENSEMBLE_IN_PADDLE_SERVING_CN.md)|English) - -In some scenarios, multiple models with the same input may be used to predict in parallel and integrate predicted results for better prediction effect. Paddle Serving also supports this feature. - -Next, we will take the text classification task as an example to show model ensemble in Paddle Serving (This feature is still serial prediction for the time being. We will support parallel prediction as soon as possible). - -## Simple example - -In this example (see the figure below), the server side predict the bow and CNN models with the same input in a service in parallel, The client side fetchs the prediction results of the two models, and processes the prediction results to get the final predict results. - -![simple example](../images/model_ensemble_example.png) - -It should be noted that at present, only multiple models with the same format input and output in the same service are supported. In this example, the input and output formats of CNN and BOW model are the same. - -The code used in the example is saved in the `python/examples/imdb` path: - -```shell -. -├── get_data.sh -├── imdb_reader.py -├── test_ensemble_client.py -└── test_ensemble_server.py -``` - -### Prepare data - -Get the pre-trained CNN and BOW models by the following command (you can also run the `get_data.sh` script): - -```shell -wget --no-check-certificate https://fleet.bj.bcebos.com/text_classification_data.tar.gz -wget --no-check-certificate https://paddle-serving.bj.bcebos.com/imdb-demo/imdb_model.tar.gz -tar -zxvf text_classification_data.tar.gz -tar -zxvf imdb_model.tar.gz -``` - -### Start server - -Start server by the following Python code (you can also run the `test_ensemble_server.py` script): - -```python -from paddle_serving_server import OpMaker -from paddle_serving_server import OpGraphMaker -from paddle_serving_server import Server - -op_maker = OpMaker() -read_op = op_maker.create('general_reader') -cnn_infer_op = op_maker.create( - 'general_infer', engine_name='cnn', inputs=[read_op]) -bow_infer_op = op_maker.create( - 'general_infer', engine_name='bow', inputs=[read_op]) -response_op = op_maker.create( - 'general_response', inputs=[cnn_infer_op, bow_infer_op]) - -op_graph_maker = OpGraphMaker() -op_graph_maker.add_op(read_op) -op_graph_maker.add_op(cnn_infer_op) -op_graph_maker.add_op(bow_infer_op) -op_graph_maker.add_op(response_op) - -server = Server() -server.set_op_graph(op_graph_maker.get_op_graph()) -model_config = {cnn_infer_op: 'imdb_cnn_model', bow_infer_op: 'imdb_bow_model'} -server.load_model_config(model_config) -server.prepare_server(workdir="work_dir1", port=9393, device="cpu") -server.run_server() -``` - -Different from the normal prediction service, here we need to use DAG to describe the logic of the server side. - -When creating an Op, you need to specify the predecessor of the current Op (in this example, the predecessor of `cnn_infer_op` and `bow_infer_op` is `read_op`, and the predecessor of `response_op` is `cnn_infer_op` and `bow_infer_op`. For the infer Op `infer_op`, you need to define the prediction engine name `engine_name` (You can also use the default value. It is recommended to set the value to facilitate the client side to obtain the order of prediction results). - -At the same time, when configuring the model path, you need to create a model configuration dictionary with the infer Op as the key and the corresponding model path as value to inform Serving which model each infer OP uses. - -### Start client - -Start client by the following Python code (you can also run the `test_ensemble_client.py` script): - -```python -from paddle_serving_client import Client -from imdb_reader import IMDBDataset - -client = Client() -# If you have more than one model, make sure that the input -# and output of more than one model are the same. -client.load_client_config('imdb_bow_client_conf/serving_client_conf.prototxt') -client.connect(["127.0.0.1:9393"]) - -# you can define any english sentence or dataset here -# This example reuses imdb reader in training, you -# can define your own data preprocessing easily. -imdb_dataset = IMDBDataset() -imdb_dataset.load_resource('imdb.vocab') - -for i in range(3): - line = 'i am very sad | 0' - word_ids, label = imdb_dataset.get_words_and_label(line) - feed = {"words": word_ids} - fetch = ["acc", "cost", "prediction"] - fetch_maps = client.predict(feed=feed, fetch=fetch) - if len(fetch_maps) == 1: - print("step: {}, res: {}".format(i, fetch_maps['prediction'][0][1])) - else: - for model, fetch_map in fetch_maps.items(): - print("step: {}, model: {}, res: {}".format(i, model, fetch_map[ - 'prediction'][0][1])) -``` - -Compared with the normal prediction service, the client side has not changed much. When multiple model predictions are used, the prediction service will return a dictionary with engine name `engine_name`(the value is defined on the server side) as the key, and the corresponding model prediction results as the value. - -### Expected result - -```shell -step: 0, model: cnn, res: 0.560272455215 -step: 0, model: bow, res: 0.633530199528 -step: 1, model: cnn, res: 0.560272455215 -step: 1, model: bow, res: 0.633530199528 -step: 2, model: cnn, res: 0.560272455215 -step: 2, model: bow, res: 0.633530199528 -``` diff --git a/doc/deprecated/MODEL_ENSEMBLE_IN_PADDLE_SERVING_CN.md b/doc/deprecated/MODEL_ENSEMBLE_IN_PADDLE_SERVING_CN.md deleted file mode 100644 index 12d6e5c3ed697c2b06ca346357fa0f2617116e4d..0000000000000000000000000000000000000000 --- a/doc/deprecated/MODEL_ENSEMBLE_IN_PADDLE_SERVING_CN.md +++ /dev/null @@ -1,121 +0,0 @@ -# Paddle Serving中的集成预测 - -(简体中文|[English](MODEL_ENSEMBLE_IN_PADDLE_SERVING.md)) - -在一些场景中,可能使用多个相同输入的模型并行集成预测以获得更好的预测效果,Paddle Serving提供了这项功能。 - -下面将以文本分类任务为例,来展示Paddle Serving的集成预测功能(暂时还是串行预测,我们会尽快支持并行化)。 - -## 集成预测样例 - -该样例中(见下图),Server端在一项服务中并行预测相同输入的BOW和CNN模型,Client端获取两个模型的预测结果并进行后处理,得到最终的预测结果。 - -![simple example](../images/model_ensemble_example.png) - -需要注意的是,目前只支持在同一个服务中使用多个相同格式输入输出的模型。在该例子中,CNN模型和BOW模型的输入输出格式是相同的。 - -样例中用到的代码保存在`python/examples/imdb`路径下: - -```shell -. -├── get_data.sh -├── imdb_reader.py -├── test_ensemble_client.py -└── test_ensemble_server.py -``` - -### 数据准备 - -通过下面命令获取预训练的CNN和BOW模型(您也可以直接运行`get_data.sh`脚本): - -```shell -wget --no-check-certificate https://fleet.bj.bcebos.com/text_classification_data.tar.gz -wget --no-check-certificate https://paddle-serving.bj.bcebos.com/imdb-demo/imdb_model.tar.gz -tar -zxvf text_classification_data.tar.gz -tar -zxvf imdb_model.tar.gz -``` - -### 启动Server - -通过下面的Python代码启动Server端(您也可以直接运行`test_ensemble_server.py`脚本): - -```python -from paddle_serving_server import OpMaker -from paddle_serving_server import OpGraphMaker -from paddle_serving_server import Server - -op_maker = OpMaker() -read_op = op_maker.create('general_reader') -cnn_infer_op = op_maker.create( - 'general_infer', engine_name='cnn', inputs=[read_op]) -bow_infer_op = op_maker.create( - 'general_infer', engine_name='bow', inputs=[read_op]) -response_op = op_maker.create( - 'general_response', inputs=[cnn_infer_op, bow_infer_op]) - -op_graph_maker = OpGraphMaker() -op_graph_maker.add_op(read_op) -op_graph_maker.add_op(cnn_infer_op) -op_graph_maker.add_op(bow_infer_op) -op_graph_maker.add_op(response_op) - -server = Server() -server.set_op_graph(op_graph_maker.get_op_graph()) -model_config = {cnn_infer_op: 'imdb_cnn_model', bow_infer_op: 'imdb_bow_model'} -server.load_model_config(model_config) -server.prepare_server(workdir="work_dir1", port=9393, device="cpu") -server.run_server() -``` - -与普通预测服务不同的是,这里我们需要用DAG来描述Server端的运行逻辑。 - -在创建Op的时候需要指定当前Op的前继(在该例子中,`cnn_infer_op`与`bow_infer_op`的前继均是`read_op`,`response_op`的前继是`cnn_infer_op`和`bow_infer_op`),对于预测Op`infer_op`还需要定义预测引擎名称`engine_name`(也可以使用默认值,建议设置该值方便Client端获取预测结果)。 - -同时在配置模型路径时,需要以预测Op为key,对应的模型路径为value,创建模型配置字典,来告知Serving每个预测Op使用哪个模型。 - -### 启动Client - -通过下面的Python代码运行Client端(您也可以直接运行`test_ensemble_client.py`脚本): - -```python -from paddle_serving_client import Client -from imdb_reader import IMDBDataset - -client = Client() -# If you have more than one model, make sure that the input -# and output of more than one model are the same. -client.load_client_config('imdb_bow_client_conf/serving_client_conf.prototxt') -client.connect(["127.0.0.1:9393"]) - -# you can define any english sentence or dataset here -# This example reuses imdb reader in training, you -# can define your own data preprocessing easily. -imdb_dataset = IMDBDataset() -imdb_dataset.load_resource('imdb.vocab') - -for i in range(3): - line = 'i am very sad | 0' - word_ids, label = imdb_dataset.get_words_and_label(line) - feed = {"words": word_ids} - fetch = ["acc", "cost", "prediction"] - fetch_maps = client.predict(feed=feed, fetch=fetch) - if len(fetch_maps) == 1: - print("step: {}, res: {}".format(i, fetch_maps['prediction'][0][1])) - else: - for model, fetch_map in fetch_maps.items(): - print("step: {}, model: {}, res: {}".format(i, model, fetch_map[ - 'prediction'][0][1])) -``` - -Client端与普通预测服务没有发生太大的变化。当使用多个模型预测时,预测服务将返回一个key为Server端定义的引擎名称`engine_name`,value为对应的模型预测结果的字典。 - -### 预期结果 - -```txt -step: 0, model: cnn, res: 0.560272455215 -step: 0, model: bow, res: 0.633530199528 -step: 1, model: cnn, res: 0.560272455215 -step: 1, model: bow, res: 0.633530199528 -step: 2, model: cnn, res: 0.560272455215 -step: 2, model: bow, res: 0.633530199528 -``` diff --git a/doc/images/detection.png b/doc/images/detection.png new file mode 100644 index 0000000000000000000000000000000000000000..f6d5e4e4d5ed071d19b844ae3d393d730f4485ec Binary files /dev/null and b/doc/images/detection.png differ