Merge branch 'develop' into newtest

c46ea3d4 · Jiawei Wang · GitHub · 6130299f · 1329841c · c46ea3d4
17 changed file
--- a/README.md
+++ b/README.md
@@ -165,17 +165,19 @@ python -m paddle_serving_server.serve --model uci_housing_model --thread 10 --po
 ```
 <center>

-| Argument | Type | Default | Description |
-|--------------|------|-----------|--------------------------------|
-| `thread` | int | `4` | Concurrency of current service |
-| `port` | int | `9292` | Exposed port of current service to users|
-| `model` | str | `""` | Path of paddle model directory to be served |
-| `mem_optim_off` | - | - | Disable memory / graphic memory optimization |
-| `ir_optim` | - | - | Enable analysis and optimization of calculation graph |
-| `use_mkl` (Only for cpu version) | - | - | Run inference with MKL |
-| `use_trt` (Only for trt version) | - | - | Run inference with TensorRT  |
-| `use_lite` (Only for ARM) | - | - | Run PaddleLite inference |
-| `use_xpu` (Only for ARM+XPU) | - | - | Run PaddleLite XPU inference |
+| Argument                                       | Type | Default | Description                                           |
+| ---------------------------------------------- | ---- | ------- | ----------------------------------------------------- |
+| `thread`                                       | int  | `4`     | Concurrency of current service                        |
+| `port`                                         | int  | `9292`  | Exposed port of current service to users              |
+| `model`                                        | str  | `""`    | Path of paddle model directory to be served           |
+| `mem_optim_off`                                | -    | -       | Disable memory / graphic memory optimization          |
+| `ir_optim`                                     | bool | False   | Enable analysis and optimization of calculation graph |
+| `use_mkl` (Only for cpu version)               | -    | -       | Run inference with MKL                                |
+| `use_trt` (Only for trt version)               | -    | -       | Run inference with TensorRT                           |
+| `use_lite` (Only for Intel x86 CPU or ARM CPU) | -    | -       | Run PaddleLite inference                              |
+| `use_xpu`                                      | -    | -       | Run PaddleLite inference with Baidu Kunlun XPU        |
+| `precision`                                    | str  | FP32    | Precision Mode, support FP32, FP16, INT8              |
+| `use_calib`                                    | bool | False   | Only for deployment with TensorRT                     |

 </center>


--- a/README_CN.md
+++ b/README_CN.md
@@ -164,18 +164,19 @@ python -m paddle_serving_server.serve --model uci_housing_model --thread 10 --po
 ```
 <center>

-| Argument | Type | Default | Description |
-|--------------|------|-----------|--------------------------------|
-| `thread` | int | `4` | Concurrency of current service |
-| `port` | int | `9292` | Exposed port of current service to users|
-| `name` | str | `""` | Service name, can be used to generate HTTP request url |
-| `model` | str | `""` | Path of paddle model directory to be served |
-| `mem_optim_off` | - | - | Disable memory optimization |
-| `ir_optim` | - | - | Enable analysis and optimization of calculation graph |
-| `use_mkl` (Only for cpu version) | - | - | Run inference with MKL |
-| `use_trt` (Only for Cuda>=10.1 version) | - | - | Run inference with TensorRT  |
-| `use_lite` (Only for ARM) | - | - | Run PaddleLite inference |
-| `use_xpu` (Only for ARM+XPU) | - | - | Run PaddleLite XPU inference |
+| Argument                                       | Type | Default | Description                                            |
+| ---------------------------------------------- | ---- | ------- | ------------------------------------------------------ |
+| `thread`                                       | int  | `4`     | Concurrency of current service                         |
+| `port`                                         | int  | `9292`  | Exposed port of current service to users               |
+| `name`                                         | str  | `""`    | Service name, can be used to generate HTTP request url |
+| `model`                                        | str  | `""`    | Path of paddle model directory to be served            |
+| `mem_optim_off`                                | -    | -       | Disable memory optimization                            |
+| `ir_optim`                                     | bool | False   | Enable analysis and optimization of calculation graph  |
+| `use_mkl` (Only for cpu version)               | -    | -       | Run inference with MKL                                 |
+| `use_trt` (Only for Cuda>=10.1 version)        | -    | -       | Run inference with TensorRT                            |
+| `use_lite` (Only for Intel x86 CPU or ARM CPU) | -    | -       | Run PaddleLite inference                               |
+| `use_xpu`                                      | -    | -       | Run PaddleLite inference with Baidu Kunlun XPU         |
+| `precision`                                    | str  | FP32    | Precision Mode, support FP32, FP16, INT8               |

 </center>


--- a/doc/BAIDU_KUNLUN_XPU_SERVING.md
+++ b/doc/BAIDU_KUNLUN_XPU_SERVING.md
 # Paddle Serving Using Baidu Kunlun Chips
 (English|[简体中文](./BAIDU_KUNLUN_XPU_SERVING_CN.md))

-Paddle serving supports deployment using Baidu Kunlun chips. At present, the pilot support is deployed on the ARM server with Baidu Kunlun chips
- (such as Phytium FT-2000+/64). We will improve
+Paddle serving supports deployment using Baidu Kunlun chips. Currently, it supports deployment on the ARM CPU server with Baidu Kunlun chips
+ (such as Phytium FT-2000+/64), or Intel CPU with Baidu Kunlun chips. We will improve
 the deployment capability on various heterogeneous hardware servers in the future. 

 # Compilation and installation
-Refer to [compile](COMPILE.md) document to setup the compilation environment。
+Refer to [compile](COMPILE.md) document to setup the compilation environment. The following is based on FeiTeng FT-2000 +/64 platform.
 ## Compilatiton
 * Compile the Serving Server
 ```
@@ -54,11 +54,11 @@ make -j10
 ```
 ## Install the wheel package
 After the compilations stages above, the whl package will be generated in ```python/dist/``` under the specific temporary directories.
-For example, after the Server Compiation step，the whl package will be produced under the server-build-arm/python/dist directory, and you can run ```pip install -u python/dist/*.whl``` to install the package。
+For example, after the Server Compiation step，the whl package will be produced under the server-build-arm/python/dist directory, and you can run ```pip install -u python/dist/*.whl``` to install the package.

 # Request parameters description
 In order to deploy serving
- service on the arm server with Baidu Kunlun xpu chips and use the acceleration capability of Paddle-Lite，please specify the following parameters during deployment。
+ service on the arm server with Baidu Kunlun xpu chips and use the acceleration capability of Paddle-Lite，please specify the following parameters during deployment.
 | param    | param description                | about                                                              |
 | :------- | :------------------------------- | :----------------------------------------------------------------- |
 | use_lite | using Paddle-Lite Engine         | use the inference capability of Paddle-Lite                        |
@@ -72,23 +72,23 @@ tar -xzf uci_housing.tar.gz
 ```
 ## Start RPC service
 There are mainly three deployment methods：
-* deploy on the ARM server with Baidu xpu using the acceleration capability of Paddle-Lite and xpu；
-* deploy on the ARM server standalone with Paddle-Lite；
-* deploy on the ARM server standalone without Paddle-Lite。
+* deploy on the cpu server with Baidu xpu using the acceleration capability of Paddle-Lite and xpu；
+* deploy on the cpu server standalone with Paddle-Lite；
+* deploy on the cpu server standalone without Paddle-Lite.
    
-The first two deployment methods are recommended。
+The first two deployment methods are recommended.

-Start the rpc service, deploying on ARM server with Baidu Kunlun chips，and accelerate with Paddle-Lite and Baidu Kunlun xpu.
+Start the rpc service, deploying on cpu server with Baidu Kunlun chips，and accelerate with Paddle-Lite and Baidu Kunlun xpu.
 ```
-python3 -m paddle_serving_server_gpu.serve --model uci_housing_model --thread 6 --port 9292 --use_lite --use_xpu --ir_optim
+python3 -m paddle_serving_server.serve --model uci_housing_model --thread 6 --port 9292 --use_lite --use_xpu --ir_optim
 ```
-Start the rpc service, deploying on ARM server，and accelerate with Paddle-Lite.
+Start the rpc service, deploying on cpu server，and accelerate with Paddle-Lite.
 ```
-python3 -m paddle_serving_server_gpu.serve --model uci_housing_model --thread 6 --port 9292 --use_lite --ir_optim
+python3 -m paddle_serving_server.serve --model uci_housing_model --thread 6 --port 9292 --use_lite --ir_optim
 ```
-Start the rpc service, deploying on ARM server.
+Start the rpc service, deploying on cpu server.
 ```
-python3 -m paddle_serving_server_gpu.serve --model uci_housing_model --thread 6 --port 9292
+python3 -m paddle_serving_server.serve --model uci_housing_model --thread 6 --port 9292
 ```
 ## 
 ```
@@ -102,8 +102,17 @@ data = [0.0137, -0.1136, 0.2553, -0.0692, 0.0582, -0.0727,
 fetch_map = client.predict(feed={"x": np.array(data).reshape(1,13,1)}, fetch=["price"])
 print(fetch_map)
 ```
-Some examples are provided below, and other models can be modifed with reference to these examples。
+# Others
+## Model example and explanation
+
+Some examples are provided below, and other models can be modifed with reference to these examples.
 | sample name | sample links                                                |
 | :---------- | :---------------------------------------------------------- |
 | fit_a_line  | [fit_a_line_xpu](../python/examples/xpu/fit_a_line_xpu)     |
 | resnet      | [resnet_v2_50_xpu](../python/examples/xpu/resnet_v2_50_xpu) |
+
+Note：Supported model lists refer to [doc](https://paddlelite.paddlepaddle.org.cn/introduction/support_model_list.html). There are differences in the adaptation of different models, and there may be some unsupported cases. If you have any problem，please submit [Github issue](https://github.com/PaddlePaddle/Serving/issues), and we will follow up in real time.
+
+## Kunlun chip related reference materials
+* [PaddlePaddle on Baidu Kunlun xpu chips](https://www.paddlepaddle.org.cn/documentation/docs/zh/develop/guides/xpu_docs/index_cn.html)
+* [Deployment on Baidu Kunlun xpu chips using PaddleLite](https://paddlelite.paddlepaddle.org.cn/demo_guides/baidu_xpu.html)
--- a/doc/BAIDU_KUNLUN_XPU_SERVING_CN.md
+++ b/doc/BAIDU_KUNLUN_XPU_SERVING_CN.md
 # Paddle Serving使用百度昆仑芯片部署
 (简体中文|[English](./BAIDU_KUNLUN_XPU_SERVING.md))

-Paddle Serving支持使用百度昆仑芯片进行预测部署。目前试验性支持在百度昆仑芯片和arm服务器（如飞腾 FT-2000+/64）上进行部署，后续完善对其他异构硬件服务器部署能力。
+Paddle Serving支持使用百度昆仑芯片进行预测部署。目前支持在百度昆仑芯片和arm服务器（如飞腾 FT-2000+/64）, 或者百度昆仑芯片和Intel CPU服务器，上进行部署，后续完善对其他异构硬件服务器部署能力。

 # 编译、安装
-基本环境配置可参考[该文档](COMPILE_CN.md)进行配置。
+基本环境配置可参考[该文档](COMPILE_CN.md)进行配置。下面以飞腾FT-2000+/64机器为例进行介绍。
 ## 编译
 * 编译server部分
 ```
@@ -20,7 +20,7 @@ cmake -DPYTHON_INCLUDE_DIR=/usr/include/python3.7m/ \
    -DSERVER=ON ..
 make -j10
 ```
-可以执行`make install`把目标产出放在`./output`目录下，cmake阶段需添加`-DCMAKE_INSTALL_PREFIX=./output`选项来指定存放路径。
+可以执行`make install`把目标产出放在`./output`目录下，cmake阶段需添加`-DCMAKE_INSTALL_PREFIX=./output`选项来指定存放路径。在支持AVX2指令集的Intel CPU平台上请指定```-DWITH_MKL=ON```编译选项。
 * 编译client部分
 ```
 mkdir -p client-build-arm && cd client-build-arm
@@ -55,11 +55,11 @@ make -j10

 # 请求参数说明
 为了支持arm+xpu服务部署，使用Paddle-Lite加速能力，请求时需使用以下参数。
-|参数|参数说明|备注|
-|:--|:--|:--|
-|use_lite|使用Paddle-Lite Engine|使用Paddle-Lite cpu预测能力|
-|use_xpu|使用Baidu Kunlun进行预测|该选项需要与use_lite配合使用|
-|ir_optim|开启Paddle-Lite计算子图优化|详细见[Paddle-Lite](https://github.com/PaddlePaddle/Paddle-Lite)|
+| 参数     | 参数说明                    | 备注                                                             |
+| :------- | :-------------------------- | :--------------------------------------------------------------- |
+| use_lite | 使用Paddle-Lite Engine      | 使用Paddle-Lite cpu预测能力                                      |
+| use_xpu  | 使用Baidu Kunlun进行预测    | 该选项需要与use_lite配合使用                                     |
+| ir_optim | 开启Paddle-Lite计算子图优化 | 详细见[Paddle-Lite](https://github.com/PaddlePaddle/Paddle-Lite) |
 # 部署使用示例
 ## 下载模型
 ```
@@ -68,23 +68,23 @@ tar -xzf uci_housing.tar.gz
 ```
 ## 启动rpc服务
 主要有三种启动配置：
-* 使用arm cpu+xpu部署，使用Paddle-Lite xpu优化加速能力；
-* 单独使用arm cpu部署，使用Paddle-Lite优化加速能力；
-* 使用arm cpu部署，不使用Paddle-Lite加速。
+* 使用cpu+xpu部署，使用Paddle-Lite xpu优化加速能力；
+* 单独使用cpu部署，使用Paddle-Lite优化加速能力；
+* 使用cpu部署，不使用Paddle-Lite加速。
    
 推荐使用前两种部署方式。

-启动rpc服务，使用arm cpu+xpu部署，使用Paddle-Lite xpu优化加速能力
+启动rpc服务，使用cpu+xpu部署，使用Paddle-Lite xpu优化加速能力
 ```
-python3 -m paddle_serving_server_gpu.serve --model uci_housing_model --thread 6 --port 9292 --use_lite --use_xpu --ir_optim
+python3 -m paddle_serving_server.serve --model uci_housing_model --thread 6 --port 9292 --use_lite --use_xpu --ir_optim
 ```
-启动rpc服务，使用arm cpu部署, 使用Paddle-Lite加速能力
+启动rpc服务，使用cpu部署, 使用Paddle-Lite加速能力
 ```
-python3 -m paddle_serving_server_gpu.serve --model uci_housing_model --thread 6 --port 9292 --use_lite --ir_optim
+python3 -m paddle_serving_server.serve --model uci_housing_model --thread 6 --port 9292 --use_lite --ir_optim
 ```
-启动rpc服务，使用arm cpu部署, 不使用Paddle-Lite加速能力
+启动rpc服务，使用cpu部署, 不使用Paddle-Lite加速能力
 ```
-python3 -m paddle_serving_server_gpu.serve --model uci_housing_model --thread 6 --port 9292
+python3 -m paddle_serving_server.serve --model uci_housing_model --thread 6 --port 9292
 ```
 ## client调用
 ```
@@ -98,8 +98,16 @@ data = [0.0137, -0.1136, 0.2553, -0.0692, 0.0582, -0.0727,
 fetch_map = client.predict(feed={"x": np.array(data).reshape(1,13,1)}, fetch=["price"])
 print(fetch_map)
 ```
+# 其他说明
+
+## 模型实例及说明
 以下提供部分样例，其他模型可参照进行修改。
-|示例名称|示例链接|
-|:-----|:--|
-|fit_a_line|[fit_a_line_xpu](../python/examples/xpu/fit_a_line_xpu)|
-|resnet|[resnet_v2_50_xpu](../python/examples/xpu/resnet_v2_50_xpu)|
+| 示例名称   | 示例链接                                                    |
+| :--------- | :---------------------------------------------------------- |
+| fit_a_line | [fit_a_line_xpu](../python/examples/xpu/fit_a_line_xpu)     |
+| resnet     | [resnet_v2_50_xpu](../python/examples/xpu/resnet_v2_50_xpu) |
+
+注：支持昆仑芯片部署模型列表见[链接](https://paddlelite.paddlepaddle.org.cn/introduction/support_model_list.html)。不同模型适配上存在差异，可能存在不支持的情况，部署使用存在问题时，欢迎以[Github issue](https://github.com/PaddlePaddle/Serving/issues)，我们会实时跟进。
+## 昆仑芯片支持相关参考资料
+* [昆仑XPU芯片运行飞桨](https://www.paddlepaddle.org.cn/documentation/docs/zh/develop/guides/xpu_docs/index_cn.html)
+* [PaddleLite使用百度XPU预测部署](https://paddlelite.paddlepaddle.org.cn/demo_guides/baidu_xpu.html)
--- a/doc/BERT_10_MINS.md
+++ b/doc/BERT_10_MINS.md
@@ -52,7 +52,7 @@ python -m paddle_serving_server.serve --model bert_seq128_model/ --port 9292  #c
 ```
 Or,start gpu inference service,Run
 ```
-python -m paddle_serving_server_gpu.serve --model bert_seq128_model/ --port 9292 --gpu_ids 0 #launch gpu inference service at GPU 0
+python -m paddle_serving_server.serve --model bert_seq128_model/ --port 9292 --gpu_ids 0 #launch gpu inference service at GPU 0
 ```
 | Parameters | Meaning                                  |
 | ---------- | ---------------------------------------- |

--- a/doc/BERT_10_MINS_CN.md
+++ b/doc/BERT_10_MINS_CN.md
@@ -50,7 +50,7 @@ python -m paddle_serving_server.serve --model bert_seq128_model/ --port 9292  #
 ```
 或者，启动gpu预测服务，执行
 ```
-python -m paddle_serving_server_gpu.serve --model bert_seq128_model/ --port 9292 --gpu_ids 0 #在gpu 0上启动gpu预测服务
+python -m paddle_serving_server.serve --model bert_seq128_model/ --port 9292 --gpu_ids 0 #在gpu 0上启动gpu预测服务

 ```


--- a/doc/COMPILE.md
+++ b/doc/COMPILE.md
@@ -117,13 +117,13 @@ Compared with CPU environment, GPU environment needs to refer to the following t
 | CUDA_CUDART_LIBRARY | The directory where libcudart.so.* is located, usually /usr/local/cuda/lib64/ | Required for all environments | No (/usr/local/cuda/lib64/) |
 | TENSORRT_ROOT | The upper level directory of the directory where libnvinfer.so.* is located, depends on the TensorRT installation directory | Cuda 9.0/10.0 does not need, other needs | No (/usr) |

-If not in Docker environment, users can refer to the following execution methods. The specific path is subject to the current environment, and the code is only for reference.TENSORRT_LIBRARY_PATH is related to the TensorRT version and should be set according to the actual situation。For example, in the cuda10.1 environment, the TensorRT version is 6.0 (/usr/local/TensorRT-6.0.1.5/targets/x86_64-linux-gnu/)，In the cuda10.2 environment, the TensorRT version is 7.1 (/usr/local/TensorRT-7.1.3.4/targets/x86_64-linux-gnu/).
+If not in Docker environment, users can refer to the following execution methods. The specific path is subject to the current environment, and the code is only for reference.TENSORRT_LIBRARY_PATH is related to the TensorRT version and should be set according to the actual situation。For example, in the cuda10.1 environment, the TensorRT version is 6.0 (/usr/local/TensorRT6-cuda10.1-cudnn7/targets/x86_64-linux-gnu/)，In the cuda10.2 and cuda11.0 environment, the TensorRT version is 7.1 (/usr/local/TensorRT-7.1.3.4/targets/x86_64-linux-gnu/).

 ``` shell
 export CUDA_PATH='/usr/local/cuda'
 export CUDNN_LIBRARY='/usr/local/cuda/lib64/'
 export CUDA_CUDART_LIBRARY="/usr/local/cuda/lib64/"
-export TENSORRT_LIBRARY_PATH="/usr/local/TensorRT-6.0.1.5/targets/x86_64-linux-gnu/"
+export TENSORRT_LIBRARY_PATH="/usr/local/TensorRT6-cuda10.1-cudnn7/targets/x86_64-linux-gnu/"

 mkdir server-build-gpu && cd server-build-gpu
 cmake -DPYTHON_INCLUDE_DIR=$PYTHON_INCLUDE_DIR \

--- a/doc/COMPILE_CN.md
+++ b/doc/COMPILE_CN.md
@@ -116,13 +116,13 @@ make -j10
 | CUDA_CUDART_LIBRARY   | libcudart.so.*所在目录，通常为/usr/local/cuda/lib64/ | 全部环境都需要                | 否(/usr/local/cuda/lib64/)                 |
 | TENSORRT_ROOT         | libnvinfer.so.*所在目录的上一级目录，取决于TensorRT安装目录 | Cuda 9.0/10.0不需要，其他需要 | 否(/usr)                 |

-非Docker环境下，用户可以参考如下执行方式，具体的路径以当时环境为准，代码仅作为参考。TENSORRT_LIBRARY_PATH和TensorRT版本有关，要根据实际情况设置。例如在cuda10.1环境下TensorRT版本是6.0(/usr/local/TensorRT-6.0.1.5/targets/x86_64-linux-gnu/)，在cuda10.2环境下TensorRT版本是7.1（/usr/local/TensorRT-7.1.3.4/targets/x86_64-linux-gnu/）。
+非Docker环境下，用户可以参考如下执行方式，具体的路径以当时环境为准，代码仅作为参考。TENSORRT_LIBRARY_PATH和TensorRT版本有关，要根据实际情况设置。例如在cuda10.1环境下TensorRT版本是6.0(/usr/local/TensorRT6-cuda10.1-cudnn7/targets/x86_64-linux-gnu/)，在cuda10.2和cuda11.0环境下TensorRT版本是7.1（/usr/local/TensorRT-7.1.3.4/targets/x86_64-linux-gnu/）。

 ``` shell
 export CUDA_PATH='/usr/local/cuda'
 export CUDNN_LIBRARY='/usr/local/cuda/lib64/'
 export CUDA_CUDART_LIBRARY="/usr/local/cuda/lib64/"
-export TENSORRT_LIBRARY_PATH="/usr/local/TensorRT-6.0.1.5/targets/x86_64-linux-gnu/"
+export TENSORRT_LIBRARY_PATH="/usr/local/TensorRT6-cuda10.1-cudnn7/targets/x86_64-linux-gnu/"

 mkdir server-build-gpu && cd server-build-gpu
 cmake -DPYTHON_INCLUDE_DIR=$PYTHON_INCLUDE_DIR \

--- a/doc/ENCRYPTION.md
+++ b/doc/ENCRYPTION.md
@@ -25,7 +25,7 @@ python -m paddle_serving_server.serve --model encrypt_server/ --port 9300 --use_
 ```
 GPU Service
 ```
-python -m paddle_serving_server_gpu.serve --model encrypt_server/ --port 9300 --use_encryption_model --gpu_ids 0
+python -m paddle_serving_server.serve --model encrypt_server/ --port 9300 --use_encryption_model --gpu_ids 0
 ```

 At this point, the server does not really start, but waits for the key。

--- a/doc/ENCRYPTION_CN.md
+++ b/doc/ENCRYPTION_CN.md
@@ -25,7 +25,7 @@ python -m paddle_serving_server.serve --model encrypt_server/ --port 9300 --use_
 ```
 GPU Service
 ```
-python -m paddle_serving_server_gpu.serve --model encrypt_server/ --port 9300 --use_encryption_model --gpu_ids 0
+python -m paddle_serving_server.serve --model encrypt_server/ --port 9300 --use_encryption_model --gpu_ids 0
 ```

 此时，服务器不会真正启动，而是等待密钥。

--- a/doc/MULTI_SERVICE_ON_ONE_GPU_CN.md
+++ b/doc/MULTI_SERVICE_ON_ONE_GPU_CN.md
@@ -5,8 +5,8 @@
 例如：

 ```shell
-python -m paddle_serving_server_gpu.serve --model bert_seq128_model --port 9292 --gpu_ids 0
-python -m paddle_serving_server_gpu.serve --model ResNet50_vd_model --port 9393 --gpu_ids 0
+python -m paddle_serving_server.serve --model bert_seq128_model --port 9292 --gpu_ids 0
+python -m paddle_serving_server.serve --model ResNet50_vd_model --port 9393 --gpu_ids 0
 ```

 在卡0上，同时部署了bert示例和iamgenet示例。

--- a/doc/SAVE.md
+++ b/doc/SAVE.md
@@ -38,7 +38,7 @@ We can see that the `serving_server` and `serving_client` folders hold the serve
 Start the server (GPU)

 ```
-python -m paddle_serving_server_gpu.serve --model serving_server --port 9393 --gpu_id 0
+python -m paddle_serving_server.serve --model serving_server --port 9393 --gpu_id 0
 ```

 Client (`test_client.py`)

--- a/doc/SAVE_CN.md
+++ b/doc/SAVE_CN.md
@@ -37,7 +37,7 @@ python -m paddle_serving_client.convert --dirname . --model_filename dygraph_mod

 启动服务端（GPU）
 ```
-python -m paddle_serving_server_gpu.serve --model serving_server --port 9393 --gpu_id 0
+python -m paddle_serving_server.serve --model serving_server --port 9393 --gpu_id 0
 ```

 客户端写法，保存为`test_client.py`

--- a/doc/TENSOR_RT.md
+++ b/doc/TENSOR_RT.md
@@ -50,7 +50,7 @@ We just need
 ```
 wget --no-check-certificate https://paddle-serving.bj.bcebos.com/pddet_demo/2.0/faster_rcnn_r50_fpn_1x_coco.tar
 tar xf faster_rcnn_r50_fpn_1x_coco.tar
-python -m paddle_serving_server_gpu.serve --model serving_server --port 9494 --gpu_ids 0 --use_trt
+python -m paddle_serving_server.serve --model serving_server --port 9494 --gpu_ids 0 --use_trt
 ```
 The TensorRT version of the faster_rcnn model server is started


--- a/doc/TENSOR_RT_CN.md
+++ b/doc/TENSOR_RT_CN.md
@@ -50,7 +50,7 @@ pip install paddle-server-server==${VERSION}.post11
 ```
 wget --no-check-certificate https://paddle-serving.bj.bcebos.com/pddet_demo/2.0/faster_rcnn_r50_fpn_1x_coco.tar
 tar xf faster_rcnn_r50_fpn_1x_coco.tar
-python -m paddle_serving_server_gpu.serve --model serving_server --port 9494 --gpu_ids 0 --use_trt
+python -m paddle_serving_server.serve --model serving_server --port 9494 --gpu_ids 0 --use_trt
 ```
 TensorRT版本的faster_rcnn模型服务端就启动了


--- a/doc/WINDOWS_TUTORIAL.md
+++ b/doc/WINDOWS_TUTORIAL.md
@@ -54,7 +54,7 @@ Currently Windows supports the Local Predictor of the Web Service framework. The
 ```
 # filename:your_webservice.py
 from paddle_serving_server.web_service import WebService
-# If it is the GPU version, please use from paddle_serving_server_gpu.web_service import WebService
+# If it is the GPU version, please use from paddle_serving_server.web_service import WebService
 class YourWebService(WebService):
    def preprocess(self, feed=[], fetch=[]):
        #Implement pre-processing here

--- a/doc/WINDOWS_TUTORIAL_CN.md
+++ b/doc/WINDOWS_TUTORIAL_CN.md
@@ -54,7 +54,7 @@ python ocr_web_client.py
 ```
 # filename:your_webservice.py
 from paddle_serving_server.web_service import WebService
-# 如果是GPU版本，请使用 from paddle_serving_server_gpu.web_service import WebService
+# 如果是GPU版本，请使用 from paddle_serving_server.web_service import WebService
 class YourWebService(WebService):
    def preprocess(self, feed=[], fetch=[]):
        #在这里实现前处理