diff --git a/README.md b/README.md index 747c140ded49f279c289b0bc8a3b4b1963243040..84fbf579579194076d9994079628bf056506f4b0 100644 --- a/README.md +++ b/README.md @@ -82,7 +82,8 @@ python -m paddle_serving_server.serve --model uci_housing_model --thread 10 --po | `port` | int | `9292` | Exposed port of current service to users| | `name` | str | `""` | Service name, can be used to generate HTTP request url | | `model` | str | `""` | Path of paddle model directory to be served | -| `mem_optim` | bool | `False` | Enable memory optimization | +| `mem_optim` | bool | `False` | Enable memory / graphic memory optimization | +| `ir_optim` | bool | `False` | Enable analysis and optimization of calculation graph | Here, we use `curl` to send a HTTP POST request to the service we just started. Users can use any python library to send HTTP POST as well, e.g, [requests](https://requests.readthedocs.io/en/master/). diff --git a/README_CN.md b/README_CN.md index 266fca330d7597d6188fa0022e6376bc23149c74..6e0843e40588bb6f5af91b17c2eb85bf4bebc8e8 100644 --- a/README_CN.md +++ b/README_CN.md @@ -87,6 +87,7 @@ python -m paddle_serving_server.serve --model uci_housing_model --thread 10 --po | `name` | str | `""` | Service name, can be used to generate HTTP request url | | `model` | str | `""` | Path of paddle model directory to be served | | `mem_optim` | bool | `False` | Enable memory optimization | +| `ir_optim` | bool | `False` | Enable analysis and optimization of calculation graph | 我们使用 `curl` 命令来发送HTTP POST请求给刚刚启动的服务。用户也可以调用python库来发送HTTP POST请求,请参考英文文档 [requests](https://requests.readthedocs.io/en/master/)。 diff --git a/doc/PERFORMANCE_OPTIM.md b/doc/PERFORMANCE_OPTIM.md new file mode 100644 index 0000000000000000000000000000000000000000..4b025e94d6f8d3ed69fb76898eb6afada9ca6613 --- /dev/null +++ b/doc/PERFORMANCE_OPTIM.md @@ -0,0 +1,18 @@ +# Performance optimization + +Due to different model structures, different prediction services consume different computing resources when performing predictions. For online prediction services, models that require less computing resources will have a higher proportion of communication time cost, which is called communication-intensive service. Models that require more computing resources have a higher time cost for inference calculations, which is called computationa-intensive services. + +For a prediction service, the easiest way to determine what type it is is to look at the time ratio. Paddle Serving provides [Timeline tool] (../python/examples/util/README_CN.md), which can intuitively display the time spent in each stage of the prediction service. + +For communication-intensive prediction services, requests can be aggregated, and within a limit that can tolerate delay, multiple prediction requests can be combined into a batch for prediction. + +For computation-intensive prediction services, you can use GPU prediction services instead of CPU prediction services, or increase the number of graphics cards for GPU prediction services. + +Under the same conditions, the communication time of the HTTP prediction service provided by Paddle Serving is longer than that of the RPC prediction service, so for communication-intensive services, please give priority to using RPC communication. + +Parameters for performance optimization: + +| Parameters | Type | Default | Description | +| ---------- | ---- | ------- | ------------------------------------------------------------ | +| mem_optim | bool | False | Enable memory / graphic memory optimization | +| ir_optim | bool | Fasle | Enable analysis and optimization of calculation graph,including OP fusion, etc | diff --git a/doc/PERFORMANCE_OPTIM_CN.md b/doc/PERFORMANCE_OPTIM_CN.md index dd17bc8afab8472f8f55b4870f73e4c481e97cd3..7bd64d3e2d645c9328ead55e867d0b97946840ad 100644 --- a/doc/PERFORMANCE_OPTIM_CN.md +++ b/doc/PERFORMANCE_OPTIM_CN.md @@ -1,6 +1,6 @@ # 性能优化 -由于模型结构的不同,在执行预测时不同的预测对计算资源的消耗也不相同,对于在线的预测服务来说,对计算资源要求较少的模型,通信的时间成本占比就会较高,称为通信密集型服务,对计算资源要求较多的模型,推理计算的时间成本较高,称为计算密集型服务。对于这两种服务类型,可以根据实际需求采取不同的方式进行优化 +由于模型结构的不同,在执行预测时不同的预测服务对计算资源的消耗也不相同。对于在线的预测服务来说,对计算资源要求较少的模型,通信的时间成本占比就会较高,称为通信密集型服务,对计算资源要求较多的模型,推理计算的时间成本较高,称为计算密集型服务。对于这两种服务类型,可以根据实际需求采取不同的方式进行优化 对于一个预测服务来说,想要判断属于哪种类型,最简单的方法就是看时间占比,Paddle Serving提供了[Timeline工具](../python/examples/util/README_CN.md),可以直观的展现预测服务中各阶段的耗时。 @@ -10,4 +10,9 @@ 在相同条件下,Paddle Serving提供的HTTP预测服务的通信时间是大于RPC预测服务的,因此对于通信密集型的服务请优先考虑使用RPC的通信方式。 -对于模型较大,预测服务内存或显存占用较多的情况,可以通过将--mem_optim选项设置为True来开启内存/显存优化。 +性能优化相关参数: + +| 参数 | 类型 | 默认值 | 含义 | +| --------- | ---- | ------ | -------------------------------- | +| mem_optim | bool | False | 开启内存/显存优化 | +| ir_optim | bool | Fasle | 开启计算图分析优化,包括OP融合等 |