Merge branch 'develop' into develop

0f5edea6 · Jiawei Wang · GitHub · 330f8ea5 · 6fad2c35 · 0f5edea6
116 changed file
--- a/README.md
+++ b/README.md
@@ -2,7 +2,7 @@
 <p align="center">
    <br>
-<img src='doc/serving_logo.png' width = "600" height = "130">
+<img src='doc/images/serving_logo.png' width = "600" height = "130">
    <br>
 <p>
@@ -47,7 +47,7 @@ We consider deploying deep learning inference service online to be a user-facing
 [Serving Examples](./python/examples/).
 <p align="center">
-    <img src="doc/demo.gif" width="700">
+    <img src="doc/images/demo.gif" width="700">
 </p>
@@ -267,6 +267,15 @@ output
 {'err_no': 0, 'err_msg': '', 'key': ['res'], 'value': ["['土地整治与土壤修复研究中心', '华南农业大学1素图']"]}
 ```
+<h3 align="center">Stop Serving/Pipeline service</h3>
+**Method one** ：Ctrl+C to quit
+**Method Two** ：In the path where starting the Serving/Pipeline service or the path which environment variable SERVING_HOME set (the file named ProcessInfo.json exists in this path)
+```
+python3 -m paddle_serving_server.serve stop
+```
 <h2 align="center">Document</h2>

--- a/README_CN.md
+++ b/README_CN.md
@@ -2,7 +2,7 @@
 <p align="center">
    <br>
-<img src='doc/serving_logo.png' width = "600" height = "130">
+<img src='doc/images/serving_logo.png' width = "600" height = "130">
    <br>
 <p>
@@ -48,7 +48,7 @@ Paddle Serving 旨在帮助深度学习开发者轻易部署在线预测服务
 - 提供丰富多彩的前后处理，方便用户在训练、部署等各阶段复用相关代码，弥合AI开发者和应用开发者之间的鸿沟，详情参考[模型示例](./python/examples/)。
 <p align="center">
-    <img src="doc/demo.gif" width="700">
+    <img src="doc/images/demo.gif" width="700">
 </p>
 <h2 align="center">教程</h2>
@@ -269,6 +269,16 @@ python3 pipeline_rpc_client.py
 {'err_no': 0, 'err_msg': '', 'key': ['res'], 'value': ["['土地整治与土壤修复研究中心', '华南农业大学1素图']"]}
 ```
+<h3 align="center">关闭Serving/Pipeline服务</h3>
+**方式一** ：Ctrl+C关停服务
+**方式二** ：在启动Serving/Pipeline服务路径或者环境变量SERVING_HOME路径下(该路径下存在文件ProcessInfo.json)
+```
+python3 -m paddle_serving_server.serve stop
+```
 <h2 align="center">文档</h2>
 ### 新手教程

--- a/doc/BERT_10_MINS.md
+++ b/doc/BERT_10_MINS.md
-## Build Bert-As-Service in 10 minutes
-([简体中文](./BERT_10_MINS_CN.md)|English)
-The goal of Bert-As-Service is to give a sentence, and the service can represent the sentence as a semantic vector and return it to the user. [Bert model](https://arxiv.org/abs/1810.04805) is a popular model in the current NLP field. It has achieved good results on a variety of public NLP tasks. The semantic vector calculated by the Bert model is used as input to other NLP models, which will also greatly improve the performance of the model. Bert-As-Service allows users to easily obtain the semantic vector representation of text and apply it to their own tasks. In order to achieve this goal, we have shown in five steps that using Paddle Serving can build such a service in ten minutes. All the code and files in the example can be found in [Example](https://github.com/PaddlePaddle/Serving/tree/develop/python/examples/bert) of Paddle Serving.
-If your python version is 3.X, replace the 'pip' field in the following command with 'pip3',replace 'python' with 'python3'.
-### Step1: Getting Model
-#### method 1:
-This example use model [BERT Chinese Model](https://www.paddlepaddle.org.cn/hubdetail?name=bert_chinese_L-12_H-768_A-12&en_category=SemanticModel) from [Paddlehub](https://github.com/PaddlePaddle/PaddleHub).
-Install paddlehub first
-```
-pip install paddlehub
-```
-run 
-```
-python prepare_model.py 128
-```
-**PaddleHub only support Python 3.5+**
-the 128 in the command above means max_seq_len in BERT model, which is the length of sample after preprocessing.
-the config file and model file for server side are saved in the folder bert_seq128_model.
-the config file generated for client side is saved in the folder bert_seq128_client.
-#### method 2:
-You can also download the above model from BOS(max_seq_len=128). After decompression, the config file and model file for server side are stored in the bert_chinese_L-12_H-768_A-12_model folder, and the config file generated for client side is stored in the bert_chinese_L-12_H-768_A-12_client folder:
-```shell
-wget https://paddle-serving.bj.bcebos.com/paddle_hub_models/text/SemanticModel/bert_chinese_L-12_H-768_A-12.tar.gz
-tar -xzf bert_chinese_L-12_H-768_A-12.tar.gz
-mv bert_chinese_L-12_H-768_A-12_model bert_seq128_model
-mv bert_chinese_L-12_H-768_A-12_client bert_seq128_client
-```
-### Step2: Getting Dict and Sample Dataset
-```
-sh get_data.sh
-```
-this script will download Chinese Dictionary File vocab.txt and Chinese Sample Data data-c.txt
-### Step3: Launch Service
-start cpu inference service,Run
-```
-python -m paddle_serving_server.serve --model bert_seq128_model/ --port 9292  #cpu inference service
-```
-Or,start gpu inference service,Run
-```
-python -m paddle_serving_server.serve --model bert_seq128_model/ --port 9292 --gpu_ids 0 #launch gpu inference service at GPU 0
-```
-| Parameters | Meaning                                  |
-| ---------- | ---------------------------------------- |
-| model      | server configuration and model file path |
-| thread     | server-side threads                      |
-| port       | server port number                       |
-| gpu_ids    | GPU index number                         |
-### Step4: data preprocessing logic on Client Side
-Paddle Serving has many built-in corresponding data preprocessing logics. For the calculation of Chinese Bert semantic representation, we use the ChineseBertReader class under paddle_serving_app for data preprocessing. Model input fields  of multiple models corresponding to a raw Chinese sentence can be easily fetched by developers
-Install paddle_serving_app
-```shell
-pip install paddle_serving_app
-```
-### Step5: Client Visit Serving
-#### method 1: RPC Inference
-Run
-```
-head data-c.txt | python bert_client.py --model bert_seq128_client/serving_client_conf.prototxt
-```
-the client reads data from data-c.txt and send prediction request, the prediction is given by word vector. (Due to massive data in the word vector, we do not print it).
-#### method 2: HTTP Inference
-This method is divided into two steps: 
-1. Start an HTTP prediction server.
-start cpu HTTP inference service,Run
-```
- python bert_web_service.py bert_seq128_model/ 9292 #launch cpu inference service
-```
-Or,start gpu HTTP inference service,Run
-```
- export CUDA_VISIBLE_DEVICES=0,1
-```
-set environmental variable to specify which gpus are used, the command above means gpu 0 and gpu 1 is used.
-```
- python bert_web_service_gpu.py bert_seq128_model/ 9292 #launch gpu inference service
-```
-2. Prediction via HTTP request
-```
-curl -H "Content-Type:application/json" -X POST -d '{"feed":[{"words": "hello"}], "fetch":["pooled_output"]}' http://127.0.0.1:9292/bert/prediction
-```
-### Benchmark
-We tested the performance of Bert-As-Service based on Padde Serving based on V100 and compared it with the Bert-As-Service based on Tensorflow. From the perspective of user configuration, we used the same batch size and concurrent number for stress testing. The overall throughput performance data obtained under 4 V100s is as follows.
-![4v100_bert_as_service_benchmark](4v100_bert_as_service_benchmark.png)
-<!--
-yum install -y libXext libSM libXrender
-pip install paddlehub paddle_serving_server paddle_serving_client
-sh pip_app.sh
-python bert_10.py
-sh server.sh &
-wget https://paddle-serving.bj.bcebos.com/bert_example/data-c.txt --no-check-certificate
-head -n 500 data-c.txt > data.txt
-cat data.txt | python bert_client.py
-if [[ $? -eq 0 ]]; then
-    echo "test success"
-else
-    echo "test fail"
-fi
-ps -ef | grep "paddle_serving_server" | grep -v grep | awk '{print $2}' | xargs kill
-->
--- a/doc/BERT_10_MINS_CN.md
+++ b/doc/BERT_10_MINS_CN.md
-## 十分钟构建Bert-As-Service
-(简体中文|[English](./BERT_10_MINS.md))
-Bert-As-Service的目标是给定一个句子，服务可以将句子表示成一个语义向量返回给用户。[Bert模型](https://arxiv.org/abs/1810.04805)是目前NLP领域的热门模型，在多种公开的NLP任务上都取得了很好的效果，使用Bert模型计算出的语义向量来做其他NLP模型的输入对提升模型的表现也有很大的帮助。Bert-As-Service可以让用户很方便地获取文本的语义向量表示并应用到自己的任务中。为了实现这个目标，我们通过以下几个步骤说明使用Paddle Serving在十分钟内就可以搭建一个这样的服务。示例中所有的代码和文件均可以在Paddle Serving的[示例](https://github.com/PaddlePaddle/Serving/tree/develop/python/examples/bert)中找到。
-若使用python的版本为3.X, 将以下命令中的pip 替换为pip3, python替换为python3.
-### Step1：获取模型
-#### 方法1：
-示例中采用[Paddlehub](https://github.com/PaddlePaddle/PaddleHub)中的[BERT中文模型](https://www.paddlepaddle.org.cn/hubdetail?name=bert_chinese_L-12_H-768_A-12&en_category=SemanticModel)。
-请先安装paddlehub
-```
-pip install paddlehub
-```
-执行
-```
-python prepare_model.py 128
-```
-参数128表示BERT模型中的max_seq_len，即预处理后的样本长度。
-生成server端配置文件与模型文件，存放在bert_seq128_model文件夹。
-生成client端配置文件，存放在bert_seq128_client文件夹。
-#### 方法2：
-您也可以从bos上直接下载上述模型（max_seq_len=128），解压后server端配置文件与模型文件存放在bert_chinese_L-12_H-768_A-12_model文件夹，client端配置文件存放在bert_chinese_L-12_H-768_A-12_client文件夹：
-```shell
-wget https://paddle-serving.bj.bcebos.com/paddle_hub_models/text/SemanticModel/bert_chinese_L-12_H-768_A-12.tar.gz
-tar -xzf bert_chinese_L-12_H-768_A-12.tar.gz
-mv bert_chinese_L-12_H-768_A-12_model bert_seq128_model
-mv bert_chinese_L-12_H-768_A-12_client bert_seq128_client
-```
-### Step2：获取词典和样例数据
-```
-sh get_data.sh
-```
-脚本将下载中文词典vocab.txt和中文样例数据data-c.txt
-### Step3：启动服务
-启动cpu预测服务，执行
-```
-python -m paddle_serving_server.serve --model bert_seq128_model/ --port 9292  #启动cpu预测服务
-```
-或者，启动gpu预测服务，执行
-```
-python -m paddle_serving_server.serve --model bert_seq128_model/ --port 9292 --gpu_ids 0 #在gpu 0上启动gpu预测服务
-```
-| 参数    | 含义                       |
-| ------- | -------------------------- |
-| model   | server端配置与模型文件路径 |
-| thread  | server端线程数             |
-| port    | server端端口号             |
-| gpu_ids | GPU索引号                  |
-### Step4：客户端数据预处理逻辑
-Paddle Serving内建了很多经典典型对应的数据预处理逻辑，对于中文Bert语义表示的计算，我们采用paddle_serving_app下的ChineseBertReader类进行数据预处理，开发者可以很容易获得一个原始的中文句子对应的多个模型输入字段。
-安装paddle_serving_app
-```shell
-pip install paddle_serving_app
-```
-### Step5：客户端访问
-#### 方法1：通过RPC方式执行预测
-执行
-```
-head data-c.txt | python bert_client.py --model bert_seq128_client/serving_client_conf.prototxt
-```
-启动client读取data-c.txt中的数据进行预测，预测结果为文本的向量表示（由于数据较多，脚本中没有将输出进行打印），server端的地址在脚本中修改。
-#### 方法2：通过HTTP方式执行预测
-该方式分为两步
-1、启动一个HTTP预测服务端。
-启动cpu HTTP预测服务，执行
-```
-python bert_web_service.py bert_seq128_model/ 9292 #启动CPU预测服务
-```
-或者，启动gpu HTTP预测服务，执行
-```
- export CUDA_VISIBLE_DEVICES=0,1
-```
-通过环境变量指定gpu预测服务使用的gpu，示例中指定索引为0和1的两块gpu
-```
-python bert_web_service_gpu.py bert_seq128_model/ 9292 #启动gpu预测服务
-```
-2、通过HTTP请求执行预测。
-```
-curl -H "Content-Type:application/json" -X POST -d '{"feed":[{"words": "hello"}], "fetch":["pooled_output"]}' http://127.0.0.1:9292/bert/prediction
-```
-### 性能测试
-我们基于V100对基于Padde Serving研发的Bert-As-Service的性能进行测试并与基于Tensorflow实现的Bert-As-Service进行对比，从用户配置的角度，采用相同的batch size和并发数进行压力测试，得到4块V100下的整体吞吐性能数据如下。
-![4v100_bert_as_service_benchmark](4v100_bert_as_service_benchmark.png)
--- a/doc/CUBE_LOCAL.md
+++ b/doc/CUBE_LOCAL.md
@@ -88,7 +88,7 @@ this step is not necessary, but it can help you to verify if the model is ready.
 ```
 if you succeed, you will see this
 <p align="center">
-    <img src="cube-cli.png" width="700">
+    <img src="images/cube-cli.png" width="700">
 </p>
 If you see that each key has a corresponding value output, it means that the delivery was successful. This file can also be used by Serving to perform cube query in general kv infer op in Serving.

--- a/doc/CUBE_LOCAL_CN.md
+++ b/doc/CUBE_LOCAL_CN.md
@@ -91,7 +91,7 @@ cd cube
 如果执行成功，会看到如下结果
 <p align="center">
-    <img src="cube-cli.png" width="700">
+    <img src="images/cube-cli.png" width="700">
 </p>

--- a/doc/DESIGN_DOC.md
+++ b/doc/DESIGN_DOC.md
@@ -39,7 +39,7 @@ Paddle Serving provides RPC and HTTP protocol for users. For HTTP service, we re
 <p align="center">
    <br>
-<img src='user_groups.png' width = "700" height = "470">
+<img src='images/user_groups.png' width = "700" height = "470">
    <br>
 <p>
@@ -96,7 +96,7 @@ Distributed Sparse Parameter Indexing is commonly seen in advertising and recomm
 <p align="center">
    <br>
-<img src='cube_eng.png' width = "450" height = "230">
+<img src='images/cube_eng.png' width = "450" height = "230">
    <br>
 <p>
@@ -116,7 +116,7 @@ The core execution engine of Paddle Serving is a Directed acyclic graph(DAG). In
 <p align="center">
    <br>
-<img src='design_doc.png'">
+<img src='images/design_doc.png'">
    <br>
 <p>
@@ -132,7 +132,7 @@ After sufficient offline evaluation of the model, online A/B test is usually nee
 <p align="center">
    <br>
-<img src='abtest.png' width = "345" height = "230">
+<img src='images/abtest.png' width = "345" height = "230">
    <br>
 <p>
@@ -188,7 +188,7 @@ the end-to-end deep learning model can not solve all the problems at present. Us
 ### 5.1 Network Communication Mechanism
 The network framework of Pipeline Serving uses gRPC and gPRC gateway. The gRPC service receives the RPC request, and the gPRC gateway receives the RESTful API request and forwards the request to the gRPC Service through the reverse proxy server. Therefore, the network layer of Pipeline Serving receives both RPC and RESTful API.
 <center>
-<img src='pipeline_serving-image1.png' height = "250" align="middle"/>
+<img src='images/pipeline_serving-image1.png' height = "250" align="middle"/>
 </center>
 ### 5.2 Core Design And Use Cases
@@ -196,7 +196,7 @@ The network framework of Pipeline Serving uses gRPC and gPRC gateway. The gRPC s
 The core design of Pipeline Serving is a graph execution engine, and the basic processing units are OP and Channel. A set of directed acyclic graphs can be realized through combination. Reference for design and use documents《[Pipeline Serving](PIPELINE_SERVING.md)》
 <center>
-<img src='pipeline_serving-image2.png' height = "300" align="middle"/>
+<img src='images/pipeline_serving-image2.png' height = "300" align="middle"/>
 </center>
 ----

--- a/doc/DESIGN_DOC_CN.md
+++ b/doc/DESIGN_DOC_CN.md
@@ -42,7 +42,7 @@ Paddle Serving面向的用户提供RPC和HTTP两种访问协议。对于HTTP协
 <p align="center">
    <br>
-<img src='user_groups.png' width = "700" height = "470">
+<img src='images/user_groups.png' width = "700" height = "470">
    <br>
 <p>
@@ -99,7 +99,7 @@ fetch_var {
 为什么要使用Paddle Serving提供的分布式稀疏参数索引服务？1）在一些推荐场景中，模型的输入特征规模通常可以达到上千亿，单台机器无法支撑T级别模型在内存的保存，因此需要进行分布式存储。2）Paddle Serving提供的分布式稀疏参数索引服务，具有并发请求多个节点的能力，从而以较低的延时完成预估服务。
 <p align="center">
    <br>
-<img src='cube_eng.png' width = "450" height = "230">
+<img src='images/cube.png' width = "450" height = "230">
    <br>
 <p>
 分布式稀疏参数索引通常在广告推荐中出现，并与分布式训练配合形成完整的离线-在线一体化部署。下图解释了其中的流程，产品的在线服务接受用户请求后将请求发送给预估服务，同时系统会记录用户的请求以进行相应的训练日志处理和拼接。离线分布式训练系统会针对流式产出的训练日志进行模型增量训练，而增量产生的模型会配送至分布式稀疏参数索引服务，同时对应的稠密的模型参数也会配送至在线的预估服务。在线服务由两部分组成，一部分是针对用户的请求提取特征后，将需要进行模型的稀疏参数索引的特征发送请求给分布式稀疏参数索引服务，针对分布式稀疏参数索引服务返回的稀疏参数再进行后续深度学习模型的计算流程，从而完成预估。
@@ -118,7 +118,7 @@ C++ Serving采用[better-rpc](https://github.com/apache/incubator-brpc)进行底
 C++ Serving的核心执行引擎是一个有向无环图，图中的每个节点代表预估服务的一个环节，例如计算模型预测打分就是其中一个环节。有向无环图有利于可并发节点充分利用部署实例内的计算资源，缩短延时。一个例子，当同一份输入需要送入两个不同的模型进行预估，并将两个模型预估的打分进行加权求和时，两个模型的打分过程即可以通过有向无环图的拓扑关系并发。
 <p align="center">
    <br>
-<img src='design_doc.png'">
+<img src='images/design_doc.png'">
    <br>
 <p>
@@ -136,7 +136,7 @@ Paddle Serving采用对称加密算法对模型进行加密，在服务加载模
 <p align="center">
    <br>
-<img src='abtest.png' width = "345" height = "230">
+<img src='images/abtest.png' width = "345" height = "230">
    <br>
 <p>
@@ -189,13 +189,13 @@ imdb_service.run_server()
 ### 5.1 网络框架
 Pipeline Serving的网络框架采用gRPC和gPRC gateway。gRPC service接收RPC请求，gPRC gateway接收RESTful API请求通过反向代理服务器将请求转发给gRPC Service。即，Pipeline Serving的网络层同时接收RPC和RESTful API。
 <center>
-<img src='pipeline_serving-image1.png' height = "250" align="middle"/>
+<img src='images/pipeline_serving-image1.png' height = "250" align="middle"/>
 </center>
 ### 5.2 核心设计与使用用例
 Pipeline Serving核心设计是图执行引擎，基本处理单元是OP和Channel，通过组合实现一套有向无环图，设计与使用文档参考《[Pipeline Serving设计与实现](PIPELINE_SERVING_CN.md)》
 <center>
-<img src='pipeline_serving-image2.png' height = "300" align="middle"/>
+<img src='images/pipeline_serving-image2.png' height = "300" align="middle"/>
 </center>
 ----

--- a/doc/GRPC_IMPL_CN.md
+++ b/doc/GRPC_IMPL_CN.md
-# gRPC接口使用介绍
-  - [1.与bRPC接口对比](#1与brpc接口对比)
-      - [1.1 服务端对比](#11-服务端对比)
-      - [1.2 客服端对比](#12-客服端对比)
-      - [1.3 其他](#13-其他)
-  - [2.示例：线性回归预测服务](#2示例线性回归预测服务)
-      - [获取数据](#获取数据)
-      - [开启 gRPC 服务端](#开启-grpc-服务端)
-    - [客户端预测](#客户端预测)
-      - [同步预测](#同步预测)
-      - [异步预测](#异步预测)
-      - [Batch 预测](#batch-预测)
-      - [通用 pb 预测](#通用-pb-预测)
-      - [预测超时](#预测超时)
-      - [List 输入](#list-输入)
-  - [3.更多示例](#3更多示例)
-使用gRPC接口，Client端可以在Win/Linux/MacOS平台上调用不同语言。gRPC 接口实现结构如下：
-![](https://github.com/PaddlePaddle/Serving/blob/develop/doc/grpc_impl.png)
-## 1.与bRPC接口对比
-#### 1.1 服务端对比
-* 由于gRPC Server 端实际包含了brpc-Client端的，因此brpc-Client的初始化过程是在gRPC Server 端实现的，所以gRPC Server 端 `load_model_config` 函数添加 `client_config_path` 参数，用于指定brpc-Client初始化过程中的传输数据格式配置文件路径（`client_config_path` 参数未指定时默认为None,此时`client_config_path` 在`load_model_config` 函数中被默认为 `<server_config_path>/serving_server_conf.prototxt`，此时brpc-Client与brpc-Server的传输数据格式配置文件相同）
-   ```
-   def load_model_config(self, server_config_paths, client_config_path=None)
-   ```
-    在一些例子中 bRPC Server 端与 bRPC Client 端的配置文件可能不同（如 在cube local 中，Client 端的数据先交给 cube，经过 cube 处理后再交给预测库），此时 gRPC Server 端需要手动设置 gRPC Client 端的配置`client_config_path`。
-#### 1.2 客服端对比
-* gRPC Client 端取消 `load_client_config` 步骤：
-   在 `connect` 步骤通过 RPC 获取相应的 prototxt（从任意一个 endpoint 获取即可）。
-* gRPC Client 需要通过 RPC 方式设置 timeout 时间（调用形式与 bRPC Client保持一致）
-   因为 bRPC Client 在 `connect` 后无法更改 timeout 时间，所以当 gRPC Server 收到变更 timeout 的调用请求时会重新创建 bRPC Client 实例以变更 bRPC Client timeout时间，同时 gRPC Client 会设置 gRPC 的 deadline 时间。
-   **注意，设置 timeout 接口和 Inference 接口不能同时调用（非线程安全），出于性能考虑暂时不加锁。**
-* gRPC Client 端 `predict` 函数添加 `asyn` 和 `is_python` 参数：
-   ```
-   def predict(self, feed, fetch, batch=True, need_variant_tag=False, asyn=False, is_python=True,log_id=0)
-   ```
-1.    `asyn` 为异步调用选项。当 `asyn=True` 时为异步调用，返回 `MultiLangPredictFuture` 对象，通过 `MultiLangPredictFuture.result()` 阻塞获取预测值；当 `asyn=Fasle` 为同步调用。
-2.    `is_python` 为 proto 格式选项。当 `is_python=True` 时，基于 numpy bytes 格式进行数据传输，目前只适用于 Python；当 `is_python=False` 时，以普通数据格式传输，更加通用。使用 numpy bytes 格式传输耗时比普通数据格式小很多（详见 [#654](https://github.com/PaddlePaddle/Serving/pull/654)）。
-3.    `batch`为数据是否需要进行增维处理的选项。当`batch=True`时，feed数据不需要额外的处理，维持原有维度；当`batch=False`时,会对数据进行增维度处理。例如：feed.shape原始为[2,2]，当`batch=False`时,会将feed.reshape为[1,2,2]。
-#### 1.3 其他
-* 异常处理：当 gRPC Server 端的 bRPC Client 预测失败（返回 `None`）时，gRPC Client 端同样返回None。其他 gRPC 异常会在 Client 内部捕获，并在返回的 fetch_map 中添加一个 "status_code" 字段来区分是否预测正常（参考 timeout 样例）。
-* 由于 gRPC 只支持 pick_first 和 round_robin 负载均衡策略，ABTEST 特性还未打齐。
-* 系统兼容性：
-    * [x]  CentOS
-    * [x]  macOS
-    * [x]  Windows
-* 已经支持的客户端语言：
-   -  Python
-   -  Java
-   -  Go
-## 2.示例：线性回归预测服务
-以下是采用gRPC实现的关于线性回归预测的一个示例，具体代码详见此[链接](../python/examples/grpc_impl_example/fit_a_line)
-#### 获取数据
-```shell
-sh get_data.sh
-```
-#### 开启 gRPC 服务端
-``` shell
-python test_server.py uci_housing_model/
-```
-也可以通过下面的一行代码开启默认 gRPC 服务：
-```shell
-python -m paddle_serving_server.serve --model uci_housing_model --thread 10 --port 9393 --use_multilang
-```
-注：--use_multilang参数用来启用多语言客户端
-### 客户端预测
-#### 同步预测
-``` shell
-python test_sync_client.py
-```
-#### 异步预测
-``` shell
-python test_asyn_client.py
-```
-#### Batch 预测
-``` shell
-python test_batch_client.py
-```
-#### 预测超时
-``` shell
-python test_timeout_client.py
-```
-## 3.更多示例
-详见[`python/examples/grpc_impl_example`](../python/examples/grpc_impl_example)下的示例文件。
--- a/doc/HTTP_SERVICE_CN.md
+++ b/doc/HTTP_SERVICE_CN.md
--- a/doc/MULTI_SERVICE_ON_ONE_GPU_CN.md
+++ b/doc/MULTI_SERVICE_ON_ONE_GPU_CN.md
-# 单卡多模型预测服务
-当客户端发送的请求数并不频繁的情况下，会造成服务端机器计算资源尤其是GPU资源的浪费，这种情况下，可以在服务端启动多个预测服务来提高资源利用率。Paddle Serving支持在单张显卡上部署多个预测服务，使用时只需要在启动单个服务时通过--gpu_ids参数将服务与显卡进行绑定，这样就可以将多个服务都绑定到同一张卡上。
-例如：
-```shell
-python -m paddle_serving_server.serve --model bert_seq128_model --port 9292 --gpu_ids 0
-python -m paddle_serving_server.serve --model ResNet50_vd_model --port 9393 --gpu_ids 0
-```
-在卡0上，同时部署了bert示例和iamgenet示例。
-**注意：** 单张显卡内部进行推理计算时仍然为串行计算，这种方式是为了减少server端显卡的空闲时间。
--- a/doc/Makefile
+++ b/doc/Makefile
-# Minimal makefile for Sphinx documentation
-#
-# You can set these variables from the command line, and also
-# from the environment for the first two.
-SPHINXOPTS    ?=
-SPHINXBUILD   ?= sphinx-build
-SOURCEDIR     = source
-BUILDDIR      = build
-# Put it first so that "make" without argument is like "make help".
-help:
-	@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
-.PHONY: help Makefile
-# Catch-all target: route all unknown targets to Sphinx using the new
-# "make mode" option.  $(O) is meant as a shortcut for $(SPHINXOPTS).
-%: Makefile
-	@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
--- a/doc/PERFORMANCE_OPTIM.md
+++ b/doc/PERFORMANCE_OPTIM.md
-# Performance Optimization
-([简体中文](./PERFORMANCE_OPTIM_CN.md)|English)
-Due to different model structures, different prediction services consume different computing resources when performing predictions. For online prediction services, models that require less computing resources will have a higher proportion of communication time cost, which is called communication-intensive service. Models that require more computing resources have a higher time cost for inference calculations, which is called computation-intensive services.
-For a prediction service, the easiest way to determine the type of service is to look at the time ratio. Paddle Serving provides [Timeline tool](../python/examples/util/README_CN.md), which can intuitively display the time spent in each stage of the prediction service.
-For communication-intensive prediction services, requests can be aggregated, and within a limit that can tolerate delay, multiple prediction requests can be combined into a batch for prediction.
-For computation-intensive prediction services, you can use GPU prediction services instead of CPU prediction services, or increase the number of graphics cards for GPU prediction services.
-Under the same conditions, the communication time of the HTTP prediction service provided by Paddle Serving is longer than that of the RPC prediction service, so for communication-intensive services, please give priority to using RPC communication.
-Parameters for performance optimization:
-The memory/graphic memory optimization option is enabled by default in Paddle Serving, which can reduce the memory/video memory usage and usually does not affect performance. If you need to turn it off, you can use --mem_optim_off in the command line.
-r_optim can optimize the calculation graph and increase the inference speed. It is turned off by default and turned on by --ir_optim in the command line.
-| Parameters | Type | Default | Description                                                  |
-| ---------- | ---- | ------- | ------------------------------------------------------------ |
-| mem_optim_off  | - | - | Disable memory / graphic memory optimization                                   |
-| ir_optim   | - | -  | Enable analysis and optimization of calculation graph,including OP fusion, etc |
-For the mode of using Python code to start the prediction service, the API of the above two parameters is as follows:
-RPC Service
-```
-from paddle_serving_server import Server
-server = Server()
-...
-server.set_memory_optimize(mem_optim)
-server.set_ir_optimize(ir_optim)
-...
-```
-HTTP Service
-```
-from paddle_serving_server import WebService
-class NewService(WebService):
-...
-new_service = NewService(name="new")
-...
-new_service.prepare_server(mem_optim=True, ir_optim=False)
-...
-```
--- a/doc/PERFORMANCE_OPTIM_CN.md
+++ b/doc/PERFORMANCE_OPTIM_CN.md
-# 性能优化
-(简体中文|[English](./PERFORMANCE_OPTIM.md))
-由于模型结构的不同，在执行预测时不同的预测服务对计算资源的消耗也不相同。对于在线的预测服务来说，对计算资源要求较少的模型，通信的时间成本占比就会较高，称为通信密集型服务，对计算资源要求较多的模型，推理计算的时间成本较高，称为计算密集型服务。对于这两种服务类型，可以根据实际需求采取不同的方式进行优化
-对于一个预测服务来说，想要判断属于哪种类型，最简单的方法就是看时间占比，Paddle Serving提供了[Timeline工具](../python/examples/util/README_CN.md)，可以直观的展现预测服务中各阶段的耗时。
-对于通信密集型的预测服务，可以将请求进行聚合，在对延时可以容忍的限度内，将多个预测请求合并成一个batch进行预测。
-对于计算密集型的预测服务，可以使用GPU预测服务代替CPU预测服务，或者增加GPU预测服务的显卡数量。
-在相同条件下，Paddle Serving提供的HTTP预测服务的通信时间是大于RPC预测服务的，因此对于通信密集型的服务请优先考虑使用RPC的通信方式。
-性能优化相关参数：
-Paddle Serving中默认开启内存/显存优化选项，可以减少对内存/显存的占用，通常不会对性能造成影响，如果需要关闭可以在命令行启动模式中使用--mem_optim_off。
-ir_optim可以优化计算图，提升推理速度，默认关闭，在命令行启动的模式中通过--ir_optim开启。
-| 参数      | 类型 | 默认值 | 含义                      |
-| --------- | ---- | ------ | -------------------------------- |
-| mem_optim_off | - | -  | 关闭内存/显存优化                |
-| ir_optim  | - | -  | 开启计算图分析优化，包括OP融合等 |
-对于使用Python代码启动预测服务的模式，以上两个参数的接口如下：
-RPC服务
-```
-from paddle_serving_server import Server
-server = Server()
-...
-server.set_memory_optimize(mem_optim)
-server.set_ir_optimize(ir_optim)
-...
-```
-HTTP服务
-```
-from paddle_serving_server import WebService
-class NewService(WebService):
-...
-new_service = NewService(name="new")
-...
-new_service.prepare_server(mem_optim=True, ir_optim=False)
-...
-```
--- a/doc/SERVING_AUTH_DOCKER.md
+++ b/doc/SERVING_AUTH_DOCKER.md
@@ -32,17 +32,17 @@ ee59a3dd4806        registry.baidubce.com/serving_dev/serving-runtime:cpu-py36
 其中我们之前serving容器 以 9393端口暴露，KONG网关的端口是8443， KONG的Web控制台的端口是8001。接下来我们在浏览器访问 `https://$IP_ADDR:8001`, 其中 IP_ADDR就是宿主机的IP。
-<img src="kong-dashboard.png">
+<img src="images/kong-dashboard.png">
 可以看到在注册结束后，登陆，看到了 DASHBOARD，我们先看SERVICES，可以看到`serving_service`，这意味着我们端口在9393的Serving服务已经在KONG当中被注册。
-<img src="kong-services.png">
+<img src="images/kong-services.png">
-<img src="kong-routes.png">
+<img src="images/kong-routes.png">
 然后在ROUTES中，我们可以看到 serving 被链接到了 `/serving-uci`。
 最后我们点击 CONSUMERS - default_user - Credentials - API KEYS ，我们可以看到 `Api Keys` 下看到很多key
-<img src="kong-api_keys.png">
+<img src="images/kong-api_keys.png">
 接下来可以通过curl访问
@@ -194,6 +194,3 @@ credentials:
 curl -H "Content-Type:application/json" -H "apikey:ZGVmYXVsdC1hcGlrZXkK" -X POST -d '{"feed":[{"x": [0.0137, -0.1136, 0.2553, -0.0692, 0.0582, -0.0727, -0.1583, -0.0584, 0.6283, 0.4919, 0.1856, 0.0795, -0.0332]}], "fetch":["price"]}' https://$IP:$PORT/foo/uci/prediction -k
 ```
 我们可以看到 apikey 已经加入到了curl请求的header当中。
--- a/doc/UWSGI_DEPLOY.md
+++ b/doc/UWSGI_DEPLOY.md
-# Deploy HTTP service with uWSGI
-([简体中文](./UWSGI_DEPLOY_CN.md)|English)
-In fit_a_line example, after starting the HTTP prediction service, you will see the following information:
-```shell
-web service address:
-http://10.127.3.150:9393/uci/prediction
- * Serving Flask app "serve" (lazy loading)
- * Environment: production
-   WARNING: This is a development server. Do not use it in a production deployment.
-   Use a production WSGI server instead.
- * Debug mode: off
- * Running on http://0.0.0.0:9393/ (Press CTRL+C to quit)
-```
-Here you will be prompted that the HTTP service started is in development mode and cannot be used for production deployment. 
-The prediction service started by Flask is not stable enough to withstand the concurrency of a large number of requests. In the actual deployment process, WSGI (Web Server Gateway Interface) is used.
-Next, we will show how to use the [uWSGI](https://github.com/unbit/uwsgi) module to deploy HTTP prediction services for production environments.
-```python
-#uwsgi_service.py
-from paddle_serving_server.web_service import WebService
-#Define prediction service
-uci_service = WebService(name = "uci")
-uci_service.load_model_config("./uci_housing_model")
-uci_service.prepare_server(workdir="./workdir", port=int(9500), device="cpu")
-uci_service.run_rpc_service()
-#Get flask application
-app_instance = uci_service.get_app_instance()
-```
-Start service with uWSGI
-```bash
-uwsgi --http :9393 --module uwsgi_service:app_instance
-```
-Use the --processes parameter to specify the number of service processes. 
-For more information about uWSGI, please refer to [uWSGI documentation](https://uwsgi-docs.readthedocs.io/en/latest/)
--- a/doc/UWSGI_DEPLOY_CN.md
+++ b/doc/UWSGI_DEPLOY_CN.md
-# 使用uwsgi启动HTTP预测服务
-(简体中文|[English](./UWSGI_DEPLOY.md))
-在提供的fit_a_line示例中，启动HTTP预测服务后会看到有以下信息：
-```shell
-web service address:
-http://10.127.3.150:9393/uci/prediction
- * Serving Flask app "serve" (lazy loading)
- * Environment: production
-   WARNING: This is a development server. Do not use it in a production deployment.
-   Use a production WSGI server instead.
- * Debug mode: off
- * Running on http://0.0.0.0:9393/ (Press CTRL+C to quit)
-```
-这里会提示启动的HTTP服务是开发模式，并不能用于生产环境的部署。Flask启动的服务环境不够稳定也无法承受大量请求的并发，实际部署过程中配合需要WSGI（Web Server Gateway Interface）使用。
-下面我们展示一下如何使用[uWSGI](https://github.com/unbit/uwsgi)模块来部署HTTP预测服务用于生产环境。
-编写HTTP服务脚本
-```python
-#uwsgi_service.py
-from paddle_serving_server.web_service import WebService
-#配置预测服务
-uci_service = WebService(name = "uci")
-uci_service.load_model_config("./uci_housing_model")
-uci_service.prepare_server(workdir="./workdir", port=int(9500), device="cpu")
-uci_service.run_rpc_service()
-#获取flask服务
-app_instance = uci_service.get_app_instance()
-```
-使用uwsgi启动HTTP服务
-```bash
-uwsgi --http :9393 --module uwsgi_service:app_instance
-```
-使用--processes参数可以指定服务的进程数。
-更多uWSGI的信息请参考[uWSGI使用文档](https://uwsgi-docs.readthedocs.io/en/latest/)
--- a/doc/architecture.png
+++ b/doc/architecture.png
--- a/doc/bert-benchmark-batch-size-1.png
+++ b/doc/bert-benchmark-batch-size-1.png
--- a/doc/blank.png
+++ b/doc/blank.png
--- a/doc/coding_mode.png
+++ b/doc/coding_mode.png
--- a/doc/ABTEST_IN_PADDLE_SERVING.md
+++ b/doc/ABTEST_IN_PADDLE_SERVING.md
@@ -4,7 +4,7 @@
 This document will use an example of text classification task based on IMDB dataset to show how to build a A/B Test framework using Paddle Serving. The structure relationship between the client and servers in the example is shown in the figure below.
-<img src="abtest.png" style="zoom:25%;" />
+<img src="images/abtest.png" style="zoom:25%;" />
 Note that:  A/B Test is only applicable to RPC mode, not web mode.

--- a/doc/ABTEST_IN_PADDLE_SERVING_CN.md
+++ b/doc/ABTEST_IN_PADDLE_SERVING_CN.md
@@ -4,7 +4,7 @@
 该文档将会用一个基于IMDB数据集的文本分类任务的例子，介绍如何使用Paddle Serving搭建A/B Test框架，例中的Client端、Server端结构如下图所示。
-<img src="abtest.png" style="zoom:33%;" />
+<img src="images/abtest.png" style="zoom:33%;" />
 需要注意的是：A/B Test只适用于RPC模式，不适用于WEB模式。

--- a/doc/C++DESIGN.md
+++ b/doc/C++DESIGN.md
@@ -45,11 +45,11 @@ Models that can be predicted using the Paddle Inference Library, models saved du
 ### 3.4 Server Inferface
-![Server Interface](server_interface.png)
+![Server Interface](images/server_interface.png)
 ### 3.5 Client Interface
-<img src='client_inferface.png' width = "600" height = "200">
+<img src='images/client_inferface.png' width = "600" height = "200">
 ### 3.6 Client io used during Training
@@ -66,7 +66,7 @@ def save_model(server_model_folder,
 ## 4. Paddle Serving Underlying Framework
-![Paddle-Serging Overall Architecture](framework.png)
+![Paddle-Serging Overall Architecture](images/framework.png)
 **Model Management Framework**: Connects model files of multiple machine learning platforms and provides a unified inference interface
 **Business Scheduling Framework**: Abstracts the calculation logic of various different inference models, provides a general DAG scheduling framework, and connects different operators through DAG diagrams to complete a prediction service together. This abstract model allows users to conveniently implement their own calculation logic, and at the same time facilitates operator sharing. (Users build their own forecasting services. A large part of their work is to build DAGs and provide operators.)
@@ -102,18 +102,18 @@ class FluidFamilyCore {
 With reference to the abstract idea of model calculation of the TensorFlow framework, the business logic is abstracted into a DAG diagram, driven by configuration, generating a workflow, and skipping C ++ code compilation. Each specific step of the service corresponds to a specific OP. The OP can configure the upstream OP that it depends on. Unified message passing between OPs is achieved by the thread-level bus and channel mechanisms. For example, the service process of a simple prediction service can be abstracted into 3 steps including reading request data-> calling the prediction interface-> writing back the prediction result, and correspondingly implemented to 3 OP: ReaderOp-> ClassifyOp-> WriteOp
-![Infer Service](predict-service.png)
+![Infer Service](images/predict-service.png)
 Regarding the dependencies between OPs, and the establishment of workflows through OPs, you can refer to [从零开始写一个预测服务](CREATING.md) (simplified Chinese Version)
 Server instance perspective
-![Server instance perspective](server-side.png)
+![Server instance perspective](images/server-side.png)
 #### 4.2.2 Paddle Serving Multi-Service Mechanism
-![Paddle Serving multi-service](multi-service.png)
+![Paddle Serving multi-service](images/multi-service.png)
 Paddle Serving instances can load multiple models at the same time, and each model uses a Service (and its configured workflow) to undertake services. You can refer to [service configuration file in Demo example](../tools/cpp_examples/demo-serving/conf/service.prototxt) to learn how to configure multiple services for the serving instance
@@ -121,12 +121,12 @@ Paddle Serving instances can load multiple models at the same time, and each mod
 From the client's perspective, a Paddle Serving service can be divided into three levels: Service, Endpoint, and Variant from top to bottom.
-![Call hierarchy relationship](multi-variants.png)
+![Call hierarchy relationship](images/multi-variants.png)
 One Service corresponds to one inference model, and there is one endpoint under the model. Different versions of the model are implemented through multiple variant concepts under endpoint:
 The same model prediction service can configure multiple variants, and each variant has its own downstream IP list. The client code can configure relative weights for each variant to achieve the relationship of adjusting the traffic ratio (refer to the description of variant_weight_list in [Client Configuration](CLIENT_CONFIGURE.md) section 3.2).
-![Client-side proxy function](client-side-proxy.png)
+![Client-side proxy function](images/client-side-proxy.png)
 ## 5. User Interface

--- a/doc/C++DESIGN_CN.md
+++ b/doc/C++DESIGN_CN.md
@@ -47,11 +47,11 @@ PaddlePaddle是百度开源的机器学习框架，广泛支持各种深度学
 ### 3.4 Server Inferface
-![Server Interface](server_interface.png)
+![Server Interface](images/server_interface.png)
 ### 3.5 Client Interface
-<img src='client_inferface.png' width = "600" height = "200">
+<img src='images/client_inferface.png' width = "600" height = "200">
 ### 3.6 训练过程中使用的Client io
@@ -68,7 +68,7 @@ def save_model(server_model_folder,
 ## 4. Paddle Serving底层框架
-![Paddle-Serging总体框图](framework.png)
+![Paddle-Serging总体框图](images/framework.png)
 **模型管理框架**：对接多种机器学习平台的模型文件，向上提供统一的inference接口
 **业务调度框架**：对各种不同预测模型的计算逻辑进行抽象，提供通用的DAG调度框架，通过DAG图串联不同的算子，共同完成一次预测服务。该抽象模型使用户可以方便的实现自己的计算逻辑，同时便于算子共用。（用户搭建自己的预测服务，很大一部分工作是搭建DAG和提供算子的实现）
@@ -104,18 +104,18 @@ class FluidFamilyCore {
 参考TF框架的模型计算的抽象思想，将业务逻辑抽象成DAG图，由配置驱动，生成workflow，跳过C++代码编译。业务的每个具体步骤，对应一个具体的OP，OP可配置自己依赖的上游OP。OP之间消息传递统一由线程级Bus和channel机制实现。例如，一个简单的预测服务的服务过程，可以抽象成读请求数据->调用预测接口->写回预测结果等3个步骤，相应的实现到3个OP: ReaderOp->ClassifyOp->WriteOp
-![预测服务Service](predict-service.png)
+![预测服务Service](images/predict-service.png)
 关于OP之间的依赖关系，以及通过OP组建workflow，可以参考[从零开始写一个预测服务](CREATING.md)的相关章节
 服务端实例透视图
-![服务端实例透视图](server-side.png)
+![服务端实例透视图](images/server-side.png)
 #### 4.2.2 Paddle Serving的多服务机制
-![Paddle Serving的多服务机制](multi-service.png)
+![Paddle Serving的多服务机制](images/multi-service.png)
 Paddle Serving实例可以同时加载多个模型，每个模型用一个Service（以及其所配置的workflow）承接服务。可以参考[Demo例子中的service配置文件](../tools/cpp_examples/demo-serving/conf/service.prototxt)了解如何为serving实例配置多个service
@@ -123,12 +123,12 @@ Paddle Serving实例可以同时加载多个模型，每个模型用一个Servic
 从客户端看，一个Paddle Serving service从顶向下可分为Service, Endpoint, Variant等3个层级
-![调用层级关系](multi-variants.png)
+![调用层级关系](images/multi-variants.png)
 一个Service对应一个预测模型，模型下有1个endpoint。模型的不同版本，通过endpoint下多个variant概念实现：
 同一个模型预测服务，可以配置多个variant，每个variant有自己的下游IP列表。客户端代码可以对各个variant配置相对权重，以达到调节流量比例的关系（参考[客户端配置](CLIENT_CONFIGURE.md)第3.2节中关于variant_weight_list的说明）。
-![Client端proxy功能](client-side-proxy.png)
+![Client端proxy功能](images/client-side-proxy.png)
 ## 5. 用户接口

--- a/doc/CLIENT_CONFIGURE.md
+++ b/doc/CLIENT_CONFIGURE.md
--- a/doc/CREATING.md
+++ b/doc/CREATING.md
--- a/doc/ENCRYPTION.md
+++ b/doc/ENCRYPTION.md
--- a/doc/ENCRYPTION_CN.md
+++ b/doc/ENCRYPTION_CN.md
--- a/doc/HOT_LOADING_IN_SERVING.md
+++ b/doc/HOT_LOADING_IN_SERVING.md
--- a/doc/HOT_LOADING_IN_SERVING_CN.md
+++ b/doc/HOT_LOADING_IN_SERVING_CN.md
--- a/doc/NEW_OPERATOR.md
+++ b/doc/NEW_OPERATOR.md
--- a/doc/NEW_OPERATOR_CN.md
+++ b/doc/NEW_OPERATOR_CN.md
--- a/doc/NEW_WEB_SERVICE.md
+++ b/doc/NEW_WEB_SERVICE.md
--- a/doc/NEW_WEB_SERVICE_CN.md
+++ b/doc/NEW_WEB_SERVICE_CN.md
--- a/doc/SERVER_DAG.md
+++ b/doc/SERVER_DAG.md
@@ -9,7 +9,7 @@ This document shows the concept of computation graph on server. How to define co
 Deep neural nets often have some preprocessing steps on input data, and postprocessing steps on model inference scores. Since deep learning frameworks are now very flexible, it is possible to do preprocessing and postprocessing outside the training computation graph. If we want to do input data preprocessing and inference result postprocess on server side, we have to add the corresponding computation logics on server. Moreover, if a user wants to do inference with the same inputs on more than one model, the best way is to do the inference concurrently on server side given only one client request so that we can save some network computation overhead. For the above two reasons, it is naturally to think of a Directed Acyclic Graph(DAG) as the main computation method for server inference. One example of DAG is as follows:
 <center>
-<img src='server_dag.png' width = "450" height = "500" align="middle"/>
+<img src='images/server_dag.png' width = "450" height = "500" align="middle"/>
 </center>
 ## How to define Node
@@ -19,7 +19,7 @@ Deep neural nets often have some preprocessing steps on input data, and postproc
 PaddleServing has some predefined Computation Node in the framework. A very commonly used Computation Graph is the simple reader-inference-response mode that can cover most of the single model inference scenarios. A example graph and the corresponding DAG definition code is as follows.
 <center>
-<img src='simple_dag.png' width = "260" height = "370" align="middle"/>
+<img src='images/simple_dag.png' width = "260" height = "370" align="middle"/>
 </center>
 ``` python
@@ -51,7 +51,7 @@ python -m paddle_serving_server.serve --model uci_housing_model --thread 10 --po
 An example containing multiple input nodes is given in the [MODEL_ENSEMBLE_IN_PADDLE_SERVING](./deprecated/MODEL_ENSEMBLE_IN_PADDLE_SERVING.md). A example graph and the corresponding DAG definition code is as follows.
 <center>
-<img src='complex_dag.png' width = "480" height = "400" align="middle"/>
+<img src='images/complex_dag.png' width = "480" height = "400" align="middle"/>
 </center>
 ```python

--- a/doc/SERVER_DAG_CN.md
+++ b/doc/SERVER_DAG_CN.md
@@ -9,7 +9,7 @@
 深度神经网络通常在输入数据上有一些预处理步骤，而在模型推断分数上有一些后处理步骤。 由于深度学习框架现在非常灵活，因此可以在训练计算图之外进行预处理和后处理。 如果要在服务器端进行输入数据预处理和推理结果后处理，则必须在服务器上添加相应的计算逻辑。 此外，如果用户想在多个模型上使用相同的输入进行推理，则最好的方法是在仅提供一个客户端请求的情况下在服务器端同时进行推理，这样我们可以节省一些网络计算开销。 由于以上两个原因，自然而然地将有向无环图（DAG）视为服务器推理的主要计算方法。 DAG的一个示例如下：
 <center>
-<img src='server_dag.png' width = "450" height = "500" align="middle"/>
+<img src='images/server_dag.png' width = "450" height = "500" align="middle"/>
 </center>
 ## 如何定义节点
@@ -18,7 +18,7 @@
 PaddleServing在框架中具有一些预定义的计算节点。 一种非常常用的计算图是简单的reader-infer-response模式，可以涵盖大多数单一模型推理方案。 示例图和相应的DAG定义代码如下。
 <center>
-<img src='simple_dag.png' width = "260" height = "370" align="middle"/>
+<img src='images/simple_dag.png' width = "260" height = "370" align="middle"/>
 </center>
 ``` python
@@ -50,7 +50,7 @@ python -m paddle_serving_server.serve --model uci_housing_model --thread 10 --po
 在[Paddle Serving中的集成预测](./deprecated/MODEL_ENSEMBLE_IN_PADDLE_SERVING_CN.md)文档中给出了一个包含多个输入节点的样例，示意图和代码如下。
 <center>
-<img src='complex_dag.png' width = "480" height = "400" align="middle"/>
+<img src='images/complex_dag.png' width = "480" height = "400" align="middle"/>
 </center>
 ```python

--- a/doc/deprecated/BENCHMARKING.md
+++ b/doc/deprecated/BENCHMARKING.md
--- a/doc/deprecated/CTR_PREDICTION.md
+++ b/doc/deprecated/CTR_PREDICTION.md
@@ -26,7 +26,7 @@
 第1) - 第5)步裁剪完毕后的模型网络配置如下：
-![Pruned CTR prediction network](../pruned-ctr-network.png)
+![Pruned CTR prediction network](../images/pruned-ctr-network.png)
 整个裁剪过程具体说明如下：

--- a/doc/deprecated/MODEL_ENSEMBLE_IN_PADDLE_SERVING.md
+++ b/doc/deprecated/MODEL_ENSEMBLE_IN_PADDLE_SERVING.md
@@ -10,7 +10,7 @@ Next, we will take the text classification task as an example to show model ense
 In this example (see the figure below), the server side predict the bow and CNN models with the same input in a service in parallel, The client side fetchs the prediction results of the two models, and processes the prediction results to get the final predict results.
-![simple example](../model_ensemble_example.png)
+![simple example](../images/model_ensemble_example.png)
 It should be noted that at present, only multiple models with the same format input and output in the same service are supported. In this example, the input and output formats of CNN and BOW model are the same.

--- a/doc/deprecated/MODEL_ENSEMBLE_IN_PADDLE_SERVING_CN.md
+++ b/doc/deprecated/MODEL_ENSEMBLE_IN_PADDLE_SERVING_CN.md
@@ -10,7 +10,7 @@
 该样例中（见下图），Server端在一项服务中并行预测相同输入的BOW和CNN模型，Client端获取两个模型的预测结果并进行后处理，得到最终的预测结果。
-![simple example](../model_ensemble_example.png)
+![simple example](../images/model_ensemble_example.png)
 需要注意的是，目前只支持在同一个服务中使用多个相同格式输入输出的模型。在该例子中，CNN模型和BOW模型的输入输出格式是相同的。

--- a/doc/deprecated/MULTI_SERVING_OVER_SINGLE_GPU_CARD.md
+++ b/doc/deprecated/MULTI_SERVING_OVER_SINGLE_GPU_CARD.md
-# Multiple Serving Instances over Single GPU Card
-Paddle Serving依托PaddlePaddle预测库执行实际的预测计算。由于当前GPU预测库的限制，单个Serving实例只可以绑定1张GPU卡，且进程内所有worker线程共用1个GPU stream。也就是说，不管Serving启动多少个worker线程，所有的请求在GPU是严格串行计算的，起不到加速作用。这会带来一个问题，就是如果模型计算量不大，那么Serving进程实际上不会用满GPU的算力。
-为了充分利用GPU卡的算力，考虑在单张卡上启动多个Serving实例，通过多个GPU stream，力争用满GPU的算力。启动命令可以如下所示：
-```
-bin/serving --gpuid=0 --bthread_concurrency=4 --bthread_min_concurrency=4 --port=8010&
-bin/serving --gpuid=0 --bthread_concurrency=4 --bthread_min_concurrency=4 --port=8011&
-```
-上述2条命令，启动2个Serving实例，分别监听8010端口和8011端口。但他们都绑定同一张卡 (gpuid = 0)。
-命令行参数含义：
-```
-gpuid=N：用于指定所绑定的GPU卡ID
-bthread_concurrency和bthread_min_concurrency共同限制该进程启动的worker数：由于在GPU预测模式下，增加worker线程数并不能提高并发能力，为了节省部分资源，干脆将他们限制掉；均设为4，是因为这是bthread允许的最小值。
-port xxx：Serving实例监听的端口
-```
-但是，上述方式究竟是否能在不影响响应时间等其他指标的前提下，起到提高GPU使用率作用，受到多个限制因素的制约，具体的：
-1. 单个stream占用GPU算力；假如单个stream已经将GPU算力占用超过50%，那么增加stream很可能会导致2个stream的job分别排队，拖慢各自的响应时间
-2. GPU显存：Serving进程需要将模型参数加载到显存中，并且计算时要在GPU显存池分配临时变量；假如单个Serving进程已经用掉超过50%的显存，则增加Serving进程会造成显存不足，导致进程报错退出
-为此，可采用如下步骤，进行测试：
-1. 加载模型时，在model_toolkit.prototxt中，model type选择FLUID_GPU_ANALYSIS或FLUID_GPU_ANALYSIS_DIR；会对模型进行静态分析，进行一定程度显存优化
-2. 在步骤1完成后，启动单个Serving进程，启动参数:`--gpuid=N --bthread_concurrency=4 --bthread_min_concurrency=4`；启动一个client，进行并发度为1的压力测试，batch size从小到大，记下平响；由于算力的限制，当batch size增大到一定程度，应该会出现响应时间明显变大；或虽然没有明显变大，但已经不满足系统需求
-3. 再启动1个Serving进程，与步骤2启动时使用相同的参数略有不同: `--gpuid=N --bthread_concurrency=4 --bthread_min_concurrency=4 --port=8011` 其中--port=8011用来让新启动的进程使用一个新的服务端口；然后同时对这2个Serving进程进行压测，继续观察batch size从小到大时平均响应时间的变化，直到取得batch size和响应时间的折中
-4. 重复步骤2-3
-5. 以2-4步的测试，来决定：单张GPU卡可以由多少个Serving进程共用; 实际部署时，就在一张GPU卡上启动这么多个Serving进程同时提供服务
--- a/doc/deprecated/NEW_WEB_SERVICE.md
+++ b/doc/deprecated/NEW_WEB_SERVICE.md
-# How to develop a new Web service?
-([简体中文](NEW_WEB_SERVICE_CN.md)|English)
-This document will take the image classification service based on the Imagenet data set as an example to introduce how to develop a new web service. The complete code can be visited at [here](../../python/examples/imagenet/resnet50_web_service.py).
-## WebService base class
-Paddle Serving implements the [WebService](https://github.com/PaddlePaddle/Serving/blob/develop/python/paddle_serving_server/web_service.py#L23) base class. You need to override its `preprocess` and `postprocess` method. The default implementation is as follows:
-```python
-class WebService(object):
-    def preprocess(self, feed={}, fetch=[]):
-        return feed, fetch
-    def postprocess(self, feed={}, fetch=[], fetch_map=None):
-        return fetch_map
-```
-### preprocess
-The preprocess method has two input parameters, `feed` and `fetch`. For an HTTP request `request`:
- The value of `feed` is the feed part `request.json["feed"]` in the request data 
- The value of `fetch` is the fetch part `request.json["fetch"]` in the request data
-The return values are the feed and fetch values used in the prediction.
-### postprocess
-The postprocess method has three input parameters, `feed`, `fetch` and `fetch_map`:
- The value of `feed` is the feed part `request.json["feed"]` in the request data 
- The value of `fetch` is the fetch part `request.json["fetch"]` in the request data
- The value of `fetch_map` is the model output value.
-The return value will be processed as `{"reslut": fetch_map}` as the return of the HTTP request.
-## Develop ImageService class
-```python
-class ImageService(WebService):
-    def preprocess(self, feed={}, fetch=[]):
-        reader = ImageReader()
-        feed_batch = []
-        for ins in feed:
-            if "image" not in ins:
-                raise ("feed data error!")
-            sample = base64.b64decode(ins["image"])
-            img = reader.process_image(sample)
-            feed_batch.append({"image": img})
-        return feed_batch, fetch
-```
-For the above `ImageService`, only the `preprocess` method is rewritten to process the image data in Base64 format into the data format required by prediction.
--- a/doc/deprecated/NEW_WEB_SERVICE_CN.md
+++ b/doc/deprecated/NEW_WEB_SERVICE_CN.md
-# 如何开发一个新的Web Service？
-(简体中文|[English](NEW_WEB_SERVICE.md))
-本文档将以Imagenet图像分类服务为例，来介绍如何开发一个新的Web Service。您可以在[这里](../../python/examples/imagenet/resnet50_web_service.py)查阅完整的代码。
-## WebService基类
-Paddle Serving实现了[WebService](https://github.com/PaddlePaddle/Serving/blob/develop/python/paddle_serving_server/web_service.py#L23)基类，您需要重写它的`preprocess`方法和`postprocess`方法，默认实现如下：
-```python
-class WebService(object):
-    def preprocess(self, feed={}, fetch=[]):
-        return feed, fetch
-    def postprocess(self, feed={}, fetch=[], fetch_map=None):
-        return fetch_map
-```
-### preprocess方法
-preprocess方法有两个输入参数，`feed`和`fetch`。对于一个HTTP请求`request`：
- `feed`的值为请求数据中的feed部分`request.json["feed"]`
- `fetch`的值为请求数据中的fetch部分`request.json["fetch"]`
-返回值分别是预测过程中用到的feed和fetch值。
-### postprocess方法
-postprocess方法有三个输入参数，`feed`、`fetch`和`fetch_map`：
- `feed`的值为请求数据中的feed部分`request.json["feed"]`
- `fetch`的值为请求数据中的fetch部分`request.json["fetch"]`
- `fetch_map`的值为fetch到的模型输出值
-返回值将会被处理成`{"reslut": fetch_map}`作为HTTP请求的返回。
-## 开发ImageService类
-```python
-class ImageService(WebService):
-    def preprocess(self, feed={}, fetch=[]):
-        reader = ImageReader()
-        feed_batch = []
-        for ins in feed:
-            if "image" not in ins:
-                raise ("feed data error!")
-            sample = base64.b64decode(ins["image"])
-            img = reader.process_image(sample)
-            feed_batch.append({"image": img})
-        return feed_batch, fetch
-```
-对于上述的`ImageService`，只重写了前处理方法，将base64格式的图片数据处理成模型预测需要的数据格式。
--- a/doc/doc_test_list
+++ b/doc/doc_test_list
-BERT_10_MINS.md
-ABTEST_IN_PADDLE_SERVING.md
--- a/doc/gpu-local-qps-batchsize.png
+++ b/doc/gpu-local-qps-batchsize.png
--- a/doc/gpu-local-qps-concurrency.png
+++ b/doc/gpu-local-qps-concurrency.png
--- a/doc/gpu-local-time-batchsize.png
+++ b/doc/gpu-local-time-batchsize.png
--- a/doc/gpu-local-time-concurrency.png
+++ b/doc/gpu-local-time-concurrency.png
--- a/doc/gpu-serving-multi-card-multi-concurrency-qps-batchsize-concurrency-client1.png
+++ b/doc/gpu-serving-multi-card-multi-concurrency-qps-batchsize-concurrency-client1.png
--- a/doc/gpu-serving-multi-card-multi-concurrency-qps-batchsize-concurrency-client2.png
+++ b/doc/gpu-serving-multi-card-multi-concurrency-qps-batchsize-concurrency-client2.png
--- a/doc/gpu-serving-multi-card-multi-concurrency-time-batchsize-concurrency-client1.png
+++ b/doc/gpu-serving-multi-card-multi-concurrency-time-batchsize-concurrency-client1.png
--- a/doc/gpu-serving-multi-card-multi-concurrency-time-batchsize-concurrency-client2.png
+++ b/doc/gpu-serving-multi-card-multi-concurrency-time-batchsize-concurrency-client2.png
--- a/doc/gpu-serving-multi-card-single-concurrency-qps-batchsize-client1.png
+++ b/doc/gpu-serving-multi-card-single-concurrency-qps-batchsize-client1.png
--- a/doc/gpu-serving-multi-card-single-concurrency-qps-batchsize-client2.png
+++ b/doc/gpu-serving-multi-card-single-concurrency-qps-batchsize-client2.png
--- a/doc/gpu-serving-multi-card-single-concurrency-time-batchsize-client1.png
+++ b/doc/gpu-serving-multi-card-single-concurrency-time-batchsize-client1.png
--- a/doc/gpu-serving-multi-card-single-concurrency-time-batchsize-client2.png
+++ b/doc/gpu-serving-multi-card-single-concurrency-time-batchsize-client2.png
--- a/doc/gpu-serving-single-card-qps-batchsize.png
+++ b/doc/gpu-serving-single-card-qps-batchsize.png
--- a/doc/gpu-serving-single-card-qps-concurrency.png
+++ b/doc/gpu-serving-single-card-qps-concurrency.png
--- a/doc/gpu-serving-single-card-time-batchsize.png
+++ b/doc/gpu-serving-single-card-time-batchsize.png
--- a/doc/gpu-serving-single-card-time-concurrency.png
+++ b/doc/gpu-serving-single-card-time-concurrency.png
--- a/doc/4v100_bert_as_service_benchmark.png
+++ b/doc/4v100_bert_as_service_benchmark.png
--- a/doc/abtest.png
+++ b/doc/abtest.png
--- a/doc/client-side-proxy.png
+++ b/doc/client-side-proxy.png
--- a/doc/client_inferface.png
+++ b/doc/client_inferface.png
--- a/doc/complex_dag.png
+++ b/doc/complex_dag.png
--- a/doc/criteo-cube-benchmark-avgcost.png
+++ b/doc/criteo-cube-benchmark-avgcost.png
--- a/doc/criteo-cube-benchmark-qps.png
+++ b/doc/criteo-cube-benchmark-qps.png
--- a/doc/cube-cli.png
+++ b/doc/cube-cli.png
--- a/doc/cube.png
+++ b/doc/cube.png
--- a/doc/cube_eng.png
+++ b/doc/cube_eng.png
--- a/doc/demo.gif
+++ b/doc/demo.gif
--- a/doc/design_doc.png
+++ b/doc/design_doc.png
--- a/doc/framework.png
+++ b/doc/framework.png
--- a/doc/grpc_impl.png
+++ b/doc/grpc_impl.png
--- a/doc/kong-api_keys.png
+++ b/doc/kong-api_keys.png
--- a/doc/kong-dashboard.png
+++ b/doc/kong-dashboard.png
--- a/doc/kong-routes.png
+++ b/doc/kong-routes.png
--- a/doc/kong-services.png
+++ b/doc/kong-services.png
--- a/doc/model_ensemble_example.png
+++ b/doc/model_ensemble_example.png
--- a/doc/multi-service.png
+++ b/doc/multi-service.png
--- a/doc/multi-variants.png
+++ b/doc/multi-variants.png
--- a/doc/pipeline_serving-image1.png
+++ b/doc/pipeline_serving-image1.png
--- a/doc/pipeline_serving-image2.png
+++ b/doc/pipeline_serving-image2.png
--- a/doc/pipeline_serving-image3.png
+++ b/doc/pipeline_serving-image3.png
--- a/doc/pipeline_serving-image4.png
+++ b/doc/pipeline_serving-image4.png
--- a/doc/predict-service.png
+++ b/doc/predict-service.png
--- a/doc/pruned-ctr-network.png
+++ b/doc/pruned-ctr-network.png
--- a/doc/server-side.png
+++ b/doc/server-side.png
--- a/doc/server_dag.png
+++ b/doc/server_dag.png
--- a/doc/server_interface.png
+++ b/doc/server_interface.png
--- a/doc/serving_logo.png
+++ b/doc/serving_logo.png
--- a/doc/simple_dag.png
+++ b/doc/simple_dag.png
--- a/doc/timeline-example.png
+++ b/doc/timeline-example.png
--- a/doc/user_groups.png
+++ b/doc/user_groups.png
--- a/doc/imdb-benchmark-server-16.png
+++ b/doc/imdb-benchmark-server-16.png
--- a/doc/imdb_loss.png
+++ b/doc/imdb_loss.png
--- a/doc/BENCHMARKING_GPU.md
+++ b/doc/BENCHMARKING_GPU.md
--- a/doc/PIPELINE_SERVING.md
+++ b/doc/PIPELINE_SERVING.md
@@ -18,7 +18,7 @@ Paddle Serving provides a user-friendly programming framework for multi-model co
 The Server side is built based on <b>RPC Service</b> and <b>graph execution engine</b>. The relationship between them is shown in the following figure.
 <div align=center>
-<img src='pipeline_serving-image1.png' height = "250" align="middle"/>
+<img src='images/pipeline_serving-image1.png' height = "250" align="middle"/>
 </div>
 ### 1.1 RPC Service
@@ -61,7 +61,7 @@ The graph execution engine consists of OPs and Channels, and the connected OPs s
 - For cases where large data needs to be transferred between OPs, consider RAM DB external memory for global storage and data transfer by passing index keys in Channel.
 <div align=center>
-<img src='pipeline_serving-image2.png' height = "300" align="middle"/>
+<img src='images/pipeline_serving-image2.png' height = "300" align="middle"/>
 </div>
@@ -80,7 +80,7 @@ The graph execution engine consists of OPs and Channels, and the connected OPs s
 - The following illustration shows the design of Channel in the graph execution engine, using input buffer and output buffer to align data between multiple OP inputs and multiple OP outputs, with a queue in the middle to buffer.
 <div align=center>
-<img src='pipeline_serving-image3.png' height = "500" align="middle"/>
+<img src='images/pipeline_serving-image3.png' height = "500" align="middle"/>
 </div>
@@ -323,7 +323,7 @@ All examples of pipelines are in [examples/pipeline/](../python/examples/pipelin
 Here, we build a simple imdb model enable example to show how to use Pipeline Serving. The relevant code can be found in the `python/examples/pipeline/imdb_model_ensemble` folder. The Server-side structure in the example is shown in the following figure:
 <div align=center>
-<img src='pipeline_serving-image4.png' height = "200" align="middle"/>
+<img src='images/pipeline_serving-image4.png' height = "200" align="middle"/>
 </div>
 ### 3.1 Files required for pipeline deployment

--- a/doc/PIPELINE_SERVING_CN.md
+++ b/doc/PIPELINE_SERVING_CN.md
@@ -20,7 +20,7 @@ Paddle Serving提供了用户友好的多模型组合服务编程框架，Pipeli
 Server端基于<b>RPC服务层</b>和<b>图执行引擎</b>构建，两者的关系如下图所示。
 <div align=center>
-<img src='pipeline_serving-image1.png' height = "250" align="middle"/>
+<img src='images/pipeline_serving-image1.png' height = "250" align="middle"/>
 </div>
 </n>
@@ -65,7 +65,7 @@ Response中`err_no`和`err_msg`表达处理结果的正确性和错误信息，`
 - 对于 OP 之间需要传输过大数据的情况，可以考虑 RAM DB 外存进行全局存储，通过在 Channel 中传递索引的 Key 来进行数据传输
 <div align=center>
-<img src='pipeline_serving-image2.png' height = "300" align="middle"/>
+<img src='images/pipeline_serving-image2.png' height = "300" align="middle"/>
 </div>
@@ -84,7 +84,7 @@ Response中`err_no`和`err_msg`表达处理结果的正确性和错误信息，`
 - 下图为图执行引擎中 Channel 的设计，采用 input buffer 和 output buffer 进行多 OP 输入或多 OP 输出的数据对齐，中间采用一个 Queue 进行缓冲
 <div align=center>
-<img src='pipeline_serving-image3.png' height = "500" align="middle"/>
+<img src='images/pipeline_serving-image3.png' height = "500" align="middle"/>
 </div>
 #### <b>1.2.3 预测类型的设计</b>
@@ -328,7 +328,7 @@ class ResponseOp(Op):
 以 imdb_model_ensemble 为例来展示如何使用 Pipeline Serving，相关代码在 `python/examples/pipeline/imdb_model_ensemble` 文件夹下可以找到，例子中的 Server 端结构如下图所示：
 <div align=center>
-<img src='pipeline_serving-image4.png' height = "200" align="middle"/>
+<img src='images/pipeline_serving-image4.png' height = "200" align="middle"/>
 </div>
 ### 3.1 Pipeline部署需要的文件

--- a/doc/qps-threads-bow.png
+++ b/doc/qps-threads-bow.png
--- a/doc/qps-threads-cnn.png
+++ b/doc/qps-threads-cnn.png
--- a/doc/qps-threads-lstm.png
+++ b/doc/qps-threads-lstm.png
--- a/doc/qq.jpeg
+++ b/doc/qq.jpeg
--- a/doc/serving-timings.png
+++ b/doc/serving-timings.png
--- a/doc/wechat.jpeg
+++ b/doc/wechat.jpeg
--- a/python/examples/criteo_ctr_with_cube/README.md
+++ b/python/examples/criteo_ctr_with_cube/README.md
--- a/python/examples/criteo_ctr_with_cube/README_CN.md
+++ b/python/examples/criteo_ctr_with_cube/README_CN.md
--- a/python/examples/util/README.md
+++ b/python/examples/util/README.md
--- a/python/examples/util/README_CN.md
+++ b/python/examples/util/README_CN.md
--- a/python/paddle_serving_app/README.md
+++ b/python/paddle_serving_app/README.md
--- a/python/paddle_serving_app/README_CN.md
+++ b/python/paddle_serving_app/README_CN.md
--- a/python/paddle_serving_server/util.py
+++ b/python/paddle_serving_server/util.py
--- a/python/paddle_serving_server/web_service.py
+++ b/python/paddle_serving_server/web_service.py
--- a/python/pipeline/operator.py
+++ b/python/pipeline/operator.py
--- a/python/pipeline/pipeline_client.py
+++ b/python/pipeline/pipeline_client.py