fix bert benchmark

0032ec64 · MRXLT · 025f16e5 · 1a69d452 · 0032ec64 · 0032ec64
20 changed file
--- a/README.md
+++ b/README.md
@@ -18,19 +18,19 @@

 <h2 align="center">Motivation</h2>

-We consider deploying deep learning inference service online to be a user-facing application in the future. **The goal of this project**: When you have trained a deep neural net with [Paddle](https://github.com/PaddlePaddle/Paddle), you can put the model online without much effort. A demo of serving is as follows:
+We consider deploying deep learning inference service online to be a user-facing application in the future. **The goal of this project**: When you have trained a deep neural net with [Paddle](https://github.com/PaddlePaddle/Paddle), you are also capable to deploy the model online easily. A demo of Paddle Serving is as follows:
 <p align="center">
    <img src="doc/demo.gif" width="700">
 </p>

 <h2 align="center">Some Key Features</h2>

- Integrate with Paddle training pipeline seemlessly, most paddle models can be deployed **with one line command**.
+- Integrate with Paddle training pipeline seamlessly, most paddle models can be deployed **with one line command**.
 - **Industrial serving features** supported, such as models management, online loading, online A/B testing etc.
- **Distributed Key-Value indexing** supported that is especially useful for large scale sparse features as model inputs.
- **Highly concurrent and efficient communication** between clients and servers.
- **Multiple programming languages** supported on client side, such as Golang, C++ and python
- **Extensible framework design** that can support model serving beyond Paddle.
+- **Distributed Key-Value indexing** supported which is especially useful for large scale sparse features as model inputs.
+- **Highly concurrent and efficient communication** between clients and servers supported.
+- **Multiple programming languages** supported on client side, such as Golang, C++ and python.
+- **Extensible framework design** which can support model serving beyond Paddle.

 <h2 align="center">Installation</h2>

@@ -53,7 +53,7 @@ Paddle Serving provides HTTP and RPC based service for users to access

 ### HTTP service

-Paddle Serving provides a built-in python module called `paddle_serving_server.serve` that can start a rpc service or a http service with one-line command. If we specify the argument `--name uci`, it means that we will have a HTTP service with a url of `$IP:$PORT/uci/prediction`
+Paddle Serving provides a built-in python module called `paddle_serving_server.serve` that can start a RPC service or a http service with one-line command. If we specify the argument `--name uci`, it means that we will have a HTTP service with a url of `$IP:$PORT/uci/prediction`
 ``` shell
 python -m paddle_serving_server.serve --model uci_housing_model --thread 10 --port 9292 --name uci
 ```
@@ -75,7 +75,7 @@ curl -H "Content-Type:application/json" -X POST -d '{"x": [0.0137, -0.1136, 0.25

 ### RPC service

-A user can also start a rpc service with `paddle_serving_server.serve`. RPC service is usually faster than HTTP service, although a user needs to do some coding based on Paddle Serving's python client API. Note that we do not specify `--name` here. 
+A user can also start a RPC service with `paddle_serving_server.serve`. RPC service is usually faster than HTTP service, although a user needs to do some coding based on Paddle Serving's python client API. Note that we do not specify `--name` here. 
 ``` shell
 python -m paddle_serving_server.serve --model uci_housing_model --thread 10 --port 9292
 ```
@@ -239,26 +239,26 @@ curl -H "Content-Type:application/json" -X POST -d '{"url": "https://paddle-serv

 ### New to Paddle Serving
 - [How to save a servable model?](doc/SAVE.md)
- [An end-to-end tutorial from training to serving(Chinese)](doc/TRAIN_TO_SERVICE.md)
- [Write Bert-as-Service in 10 minutes(Chinese)](doc/BERT_10_MINS.md)
+- [An End-to-end tutorial from training to inference service deployment](doc/TRAIN_TO_SERVICE.md)
+- [Write Bert-as-Service in 10 minutes](doc/BERT_10_MINS.md)

 ### Developers
 - [How to config Serving native operators on server side?](doc/SERVER_DAG.md)
- [How to develop a new Serving operator](doc/NEW_OPERATOR.md)
+- [How to develop a new Serving operator?](doc/NEW_OPERATOR.md)
 - [Golang client](doc/IMDB_GO_CLIENT.md)
- [Compile from source code(Chinese)](doc/COMPILE.md)
+- [Compile from source code](doc/COMPILE.md)

 ### About Efficiency
- [How profile serving efficiency?(Chinese)](https://github.com/PaddlePaddle/Serving/tree/develop/python/examples/util)
- [Benchmarks](doc/BENCHMARK.md)
+- [How to profile Paddle Serving latency?(Chinese)](https://github.com/PaddlePaddle/Serving/tree/develop/python/examples/util)
+- [CPU Benchmarks(Chinese)](doc/BENCHMARKING.md)
+- [GPU Benchmarks(Chinese)](doc/GPU_BENCHMARKING.md)

 ### FAQ
 - [FAQ(Chinese)](doc/FAQ.md)


 ### Design
- [Design Doc(Chinese)](doc/DESIGN_DOC.md)
- [Design Doc(English)](doc/DESIGN_DOC_EN.md)
+- [Design Doc](doc/DESIGN_DOC.md)

 <h2 align="center">Community</h2>


--- a/README_CN.md
+++ b/README_CN.md
-<img src='https://paddle-serving.bj.bcebos.com/imdb-demo%2FLogoMakr-3Bd2NM-300dpi.png' width = "600" height = "127">
+<p align="center">
+    <br>
+<img src='https://paddle-serving.bj.bcebos.com/imdb-demo%2FLogoMakr-3Bd2NM-300dpi.png' width = "600" height = "130">
+    <br>
+<p>
+    
+<p align="center">
+    <br>
+    <a href="https://travis-ci.com/PaddlePaddle/Serving">
+        <img alt="Build Status" src="https://img.shields.io/travis/com/PaddlePaddle/Serving/develop">
+    </a>
+    <img alt="Release" src="https://img.shields.io/badge/Release-0.0.3-yellowgreen">
+    <img alt="Issues" src="https://img.shields.io/github/issues/PaddlePaddle/Serving">
+    <img alt="License" src="https://img.shields.io/github/license/PaddlePaddle/Serving">
+    <img alt="Slack" src="https://img.shields.io/badge/Join-Slack-green">
+    <br>
+<p>
+
+<h2 align="center">动机</h2>

-[![Build Status](https://img.shields.io/travis/com/PaddlePaddle/Serving/develop)](https://travis-ci.com/PaddlePaddle/Serving)
-[![Release](https://img.shields.io/badge/Release-0.0.3-yellowgreen)](Release)
-[![Issues](https://img.shields.io/github/issues/PaddlePaddle/Serving)](Issues)
-[![License](https://img.shields.io/github/license/PaddlePaddle/Serving)](LICENSE)
-[![Slack](https://img.shields.io/badge/Join-Slack-green)](https://paddleserving.slack.com/archives/CU0PB4K35)
+Paddle Serving 旨在帮助深度学习开发者轻易部署在线预测服务。 **本项目目标**: 当用户使用 [Paddle](https://github.com/PaddlePaddle/Paddle) 训练了一个深度神经网络，就同时拥有了该模型的预测服务。

-## 动机
-Paddle Serving 帮助深度学习开发者轻易部署在线预测服务。 **本项目目标**: 只要你使用 [Paddle](https://github.com/PaddlePaddle/Paddle) 训练了一个深度神经网络，你就同时拥有了该模型的预测服务。
 <p align="center">
    <img src="doc/demo.gif" width="700">
 </p>

-## 核心功能
+<h2 align="center">核心功能</h2>
+
 - 与Paddle训练紧密连接，绝大部分Paddle模型可以 **一键部署**.
 - 支持 **工业级的服务能力** 例如模型管理，在线加载，在线A/B测试等.
 - 支持 **分布式键值对索引** 助力于大规模稀疏特征作为模型输入.
@@ -20,7 +33,7 @@ Paddle Serving 帮助深度学习开发者轻易部署在线预测服务。 **
 - 支持 **多种编程语言** 开发客户端，例如Golang，C++和Python.
 - **可伸缩框架设计** 可支持不限于Paddle的模型服务.

-## 安装
+<h2 align="center">安装</h2>

 强烈建议您在Docker内构建Paddle Serving，请查看[如何在Docker中运行PaddleServing](doc/RUN_IN_DOCKER_CN.md)

@@ -29,17 +42,51 @@ pip install paddle-serving-client
 pip install paddle-serving-server
 ```

-## 快速启动示例
+<h2 align="center">快速启动示例</h2>
+
+<h3 align="center">波士顿房价预测</h3>

 ``` shell
 wget --no-check-certificate https://paddle-serving.bj.bcebos.com/uci_housing.tar.gz
 tar -xzf uci_housing.tar.gz
-python -m paddle_serving_server.serve --model uci_housing_model --thread 10 --port 9292
 ```

-Python客户端请求
+Paddle Serving 为用户提供了基于 HTTP 和 RPC 的服务
+
+
+<h3 align="center">HTTP服务</h3>
+
+Paddle Serving提供了一个名为`paddle_serving_server.serve`的内置python模块，可以使用单行命令启动RPC服务或HTTP服务。如果我们指定参数`--name uci`，则意味着我们将拥有一个HTTP服务，其URL为$IP:$PORT/uci/prediction`。
+
+``` shell
+python -m paddle_serving_server.serve --model uci_housing_model --thread 10 --port 9292 --name uci
+```
+<center>
+
+| Argument | Type | Default | Description |
+|--------------|------|-----------|--------------------------------|
+| `thread` | int | `4` | Concurrency of current service |
+| `port` | int | `9292` | Exposed port of current service to users|
+| `name` | str | `""` | Service name, can be used to generate HTTP request url |
+| `model` | str | `""` | Path of paddle model directory to be served |
+
+我们使用 `curl` 命令来发送HTTP POST请求给刚刚启动的服务。用户也可以调用python库来发送HTTP POST请求，请参考英文文档 [requests](https://requests.readthedocs.io/en/master/)。
+</center>
+
+``` shell
+curl -H "Content-Type:application/json" -X POST -d '{"x": [0.0137, -0.1136, 0.2553, -0.0692, 0.0582, -0.0727, -0.1583, -0.0584, 0.6283, 0.4919, 0.1856, 0.0795, -0.0332], "fetch":["price"]}' http://127.0.0.1:9292/uci/prediction
+```
+
+<h3 align="center">RPC服务</h3>
+
+用户还可以使用`paddle_serving_server.serve`启动RPC服务。 尽管用户需要基于Paddle Serving的python客户端API进行一些开发，但是RPC服务通常比HTTP服务更快。需要指出的是这里我们没有指定`--name`。
+
+``` shell
+python -m paddle_serving_server.serve --model uci_housing_model --thread 10 --port 9292
+```

 ``` python
+# A user can visit rpc service through paddle_serving_client API
 from paddle_serving_client import Client

 client = Client()
@@ -51,24 +98,105 @@ fetch_map = client.predict(feed={"x": data}, fetch=["price"])
 print(fetch_map)

 ```
+在这里，`client.predict`函数具有两个参数。 `feed`是带有模型输入变量别名和值的`python dict`。 `fetch`被要从服务器返回的预测变量赋值。 在该示例中，在训练过程中保存可服务模型时，被赋值的tensor名为`"x"`和`"price"`。
+
+<h2 align="center">Paddle Serving预装的服务</h2>
+
+<h3 align="center">中文分词模型</h4>
+
+- **介绍**: 
+``` shell
+本示例为中文分词HTTP服务一键部署
+```
+
+- **下载服务包**: 
+``` shell
+wget --no-check-certificate https://paddle-serving.bj.bcebos.com/lac/lac_model_jieba_web.tar.gz
+```
+- **启动web服务**: 
+``` shell
+tar -xzf lac_model_jieba_web.tar.gz
+python lac_web_service.py jieba_server_model/ lac_workdir 9292
+```
+- **客户端请求示例**: 
+``` shell
+curl -H "Content-Type:application/json" -X POST -d '{"words": "我爱北京天安门", "fetch":["word_seg"]}' http://127.0.0.1:9292/lac/prediction
+```
+- **返回结果示例**: 
+``` shell
+{"word_seg":"我|爱|北京|天安门"}
+```
+
+<h3 align="center">图像分类模型</h4>
+
+- **介绍**: 
+``` shell
+图像分类模型由Imagenet数据集训练而成，该服务会返回一个标签及其概率
+```
+
+- **下载服务包**: 
+``` shell
+wget --no-check-certificate https://paddle-serving.bj.bcebos.com/imagenet-example/imagenet_demo.tar.gz
+```
+- **启动web服务**: 
+``` shell
+tar -xzf imagenet_demo.tar.gz
+python image_classification_service_demo.py resnet50_serving_model
+```
+- **客户端请求示例**: 
+
+<p align="center">
+    <br>
+<img src='https://paddle-serving.bj.bcebos.com/imagenet-example/daisy.jpg' width = "200" height = "200">
+    <br>
+<p>
+    
+``` shell
+curl -H "Content-Type:application/json" -X POST -d '{"url": "https://paddle-serving.bj.bcebos.com/imagenet-example/daisy.jpg", "fetch": ["score"]}' http://127.0.0.1:9292/image/prediction
+```
+- **返回结果示例**: 
+``` shell
+{"label":"daisy","prob":0.9341403245925903}
+```
+
+<h2 align="center">文档</h2>
+
+### 新手教程
+- [怎样保存用于Paddle Serving的模型？](doc/SAVE_CN.md)
+- [端到端完成从训练到部署全流程](doc/TRAIN_TO_SERVICE_CN.md)
+- [十分钟构建Bert-As-Service](doc/BERT_10_MINS_CN.md)
+
+### 开发者教程
+- [如何配置Server端的计算图?](doc/SERVER_DAG_CN.md)
+- [如何开发一个新的General Op?](doc/NEW_OPERATOR_CN.md)
+- [如何在Paddle Serving使用Go Client?](doc/IMDB_GO_CLIENT_CN.md)
+- [如何编译PaddleServing?](doc/COMPILE_CN.md)
+
+### 关于Paddle Serving性能
+- [如何测试Paddle Serving性能？](https://github.com/PaddlePaddle/Serving/tree/develop/python/examples/util)
+- [CPU版Benchmarks](doc/BENCHMARKING.md)
+- [GPU版Benchmarks](doc/GPU_BENCHMARKING.md)
+
+### FAQ
+- [常见问答](doc/deprecated/FAQ.md)

-## 文档
+### 设计文档
+- [Paddle Serving设计文档](doc/DESIGN_DOC_CN.md)

-[开发文档](doc/DESIGN.md)
+<h2 align="center">社区</h2>

-[如何在服务器端配置本地Op?](doc/SERVER_DAG.md)
+### Slack

-[如何开发一个新的Op?](doc/NEW_OPERATOR.md)
+想要同开发者和其他用户沟通吗？欢迎加入我们的 [Slack channel](https://paddleserving.slack.com/archives/CUBPKHKMJ)

-[Golang 客户端](doc/IMDB_GO_CLIENT.md)
+### 贡献代码

-[从源码编译](doc/COMPILE.md)
+如果您想为Paddle Serving贡献代码，请参考 [Contribution Guidelines](doc/CONTRIBUTE.md)

-[常见问答](doc/FAQ.md)
+### 反馈

-## 加入社区
-如果您想要联系其他用户和开发者，欢迎加入我们的 [Slack channel](https://paddleserving.slack.com/archives/CUBPKHKMJ)
+如有任何反馈或是bug，请在 [GitHub Issue](https://github.com/PaddlePaddle/Serving/issues)提交

-## 如何贡献代码
+### License

-如果您想要贡献代码给Paddle Serving，请参考[Contribution Guidelines](doc/CONTRIBUTE.md)
+[Apache 2.0 License](https://github.com/PaddlePaddle/Serving/blob/develop/LICENSE)
--- a/doc/ABTEST_IN_PADDLE_SERVING.md
+++ b/doc/ABTEST_IN_PADDLE_SERVING.md
 # ABTEST in Paddle Serving

+([简体中文](./ABTEST_IN_PADDLE_SERVING_CN.md)|English)
+
 This document will use an example of text classification task based on IMDB dataset to show how to build a A/B Test framework using Paddle Serving. The structure relationship between the client and servers in the example is shown in the figure below.

 <img src="abtest.png" style="zoom:33%;" />

--- a/doc/ABTEST_IN_PADDLE_SERVING_CN.md
+++ b/doc/ABTEST_IN_PADDLE_SERVING_CN.md
 # 如何使用Paddle Serving做ABTEST

+(简体中文|[English](./ABTEST_IN_PADDLE_SERVING.md))
+
 该文档将会用一个基于IMDB数据集的文本分类任务的例子，介绍如何使用Paddle Serving搭建A/B Test框架，例中的Client端、Server端结构如下图所示。

 <img src="abtest.png" style="zoom:33%;" />

--- a/doc/deprecated/BENCHMARKING.md
+++ b/doc/deprecated/BENCHMARKING.md
--- a/doc/DESIGN.md
+++ b/doc/DESIGN.md
@@ -14,17 +14,17 @@ The result is a complete serving solution.

 ## 2. Terms explanation

- baidu-rpc: Baidu's official open source RPC framework, supports multiple common communication protocols, and provides a custom interface experience based on protobuf
- Variant: Paddle Serving architecture is an abstraction of a minimal prediction cluster, which is characterized by all internal instances (replicas) being completely homogeneous and logically corresponding to a fixed version of a model
- Endpoint: Multiple Variants form an Endpoint. Logically, Endpoint represents a model, and Variants within the Endpoint represent different versions.
- OP: PaddlePaddle is used to encapsulate a numerical calculation operator, Paddle Serving is used to represent a basic business operation operator, and the core interface is inference. OP configures its dependent upstream OP to connect multiple OPs into a workflow
- Channel: An abstraction of all request-level intermediate data of the OP; data exchange between OPs through Channels
- Bus: manages all channels in a thread, and schedules the access relationship between the two sets of OP and Channel according to the DAG dependency graph between DAGs
- Stage: Workflow according to the topology diagram described by DAG, a collection of OPs that belong to the same link and can be executed in parallel
- Node: An Op operator instance composed of an Op operator class combined with parameter configuration, which is also an execution unit in Workflow
- Workflow: executes the inference interface of each OP in order according to the topology described by DAG
- DAG/Workflow: consists of several interdependent Nodes. Each Node can obtain the Request object through a specific interface. The node Op obtains the output object of its pre-op through the dependency relationship. The output of the last Node is the Response object by default.
- Service: encapsulates a pv request, can configure several Workflows, reuse the current PV's Request object with each other, and then execute each in parallel/serial execution, and finally write the Response to the corresponding output slot; a Paddle-serving process Multiple sets of Service interfaces can be configured. The upstream determines the Service interface currently accessed based on the ServiceName.
+- **baidu-rpc**: Baidu's official open source RPC framework, supports multiple common communication protocols, and provides a custom interface experience based on protobuf
+- **Variant**: Paddle Serving architecture is an abstraction of a minimal prediction cluster, which is characterized by all internal instances (replicas) being completely homogeneous and logically corresponding to a fixed version of a model
+- **Endpoint**: Multiple Variants form an Endpoint. Logically, Endpoint represents a model, and Variants within the Endpoint represent different versions.
+- **OP**: PaddlePaddle is used to encapsulate a numerical calculation operator, Paddle Serving is used to represent a basic business operation operator, and the core interface is inference. OP configures its dependent upstream OP to connect multiple OPs into a workflow
+- **Channel**: An abstraction of all request-level intermediate data of the OP; data exchange between OPs through Channels
+- **Bus**: manages all channels in a thread, and schedules the access relationship between the two sets of OP and Channel according to the DAG dependency graph between DAGs
+- **Stage**: Workflow according to the topology diagram described by DAG, a collection of OPs that belong to the same link and can be executed in parallel
+- **Node**: An OP operator instance composed of an OP operator class combined with parameter configuration, which is also an execution unit in Workflow
+- **Workflow**: executes the inference interface of each OP in order according to the topology described by DAG
+- **DAG/Workflow**: consists of several interdependent Nodes. Each Node can obtain the Request object through a specific interface. The node Op obtains the output object of its pre-op through the dependency relationship. The output of the last Node is the Response object by default.
+- **Service**: encapsulates a pv request, can configure several Workflows, reuse the current PV's Request object with each other, and then execute each in parallel/serial execution, and finally write the Response to the corresponding output slot; a Paddle-serving process Multiple sets of Service interfaces can be configured. The upstream determines the Service interface currently accessed based on the ServiceName.

 ## 3. Python Interface Design

@@ -38,10 +38,10 @@ Models that can be predicted using the Paddle Inference Library, models saved du

 ### 3.3 Overall design:

-The user starts the Client and Server through the Python Client. The Python API has a function to check whether the interconnection and the models to be accessed match.
-The Python API calls the pybind corresponding to the client and server functions implemented by Paddle Serving, and the information transmitted through RPC is implemented through RPC.
-The Client Python API currently has two simple functions, load_inference_conf and predict, which are used to perform loading of the model to be predicted and prediction, respectively.
-The Server Python API is mainly responsible for loading the estimation model and generating various configurations required by Paddle Serving, including engines, workflow, resources, etc.
+- The user starts the Client and Server through the Python Client. The Python API has a function to check whether the interconnection and the models to be accessed match.
+- The Python API calls the pybind corresponding to the client and server functions implemented by Paddle Serving, and the information transmitted through RPC is implemented through RPC.
+- The Client Python API currently has two simple functions, load_inference_conf and predict, which are used to perform loading of the model to be predicted and prediction, respectively.
+- The Server Python API is mainly responsible for loading the inference model and generating various configurations required by Paddle Serving, including engines, workflow, resources, etc.

 ### 3.4 Server Inferface

@@ -69,8 +69,8 @@ def save_model(server_model_folder,
 ![Paddle-Serging Overall Architecture](framework.png)

 **Model Management Framework**: Connects model files of multiple machine learning platforms and provides a unified inference interface
-**Business Scheduling Framework**: Abstracts the calculation logic of various different prediction models, provides a general DAG scheduling framework, and connects different operators through DAG diagrams to complete a prediction service together. This abstract model allows users to conveniently implement their own calculation logic, and at the same time facilitates operator sharing. (Users build their own forecasting services. A large part of their work is to build DAGs and provide operators.)
-**PredictService**: Encapsulation of the externally provided prediction service interface. Define communication fields with the client through protobuf.
+**Business Scheduling Framework**: Abstracts the calculation logic of various different inference models, provides a general DAG scheduling framework, and connects different operators through DAG diagrams to complete a prediction service together. This abstract model allows users to conveniently implement their own calculation logic, and at the same time facilitates operator sharing. (Users build their own forecasting services. A large part of their work is to build DAGs and provide operators.)
+**Predict Service**: Encapsulation of the externally provided prediction service interface. Define communication fields with the client through protobuf.

 ### 4.1 Model Management Framework


--- a/doc/DESIGN_CN.md
+++ b/doc/DESIGN_CN.md
@@ -4,7 +4,7 @@

 ## 1. 项目背景

-PaddlePaddle是公司开源的机器学习框架，广泛支持各种深度学习模型的定制化开发; Paddle serving是Paddle的在线预测部分，与Paddle模型训练环节无缝衔接，提供机器学习预测云服务。本文将从模型、服务、接入等层面，自底向上描述Paddle Serving设计方案。
+PaddlePaddle是百度开源的机器学习框架，广泛支持各种深度学习模型的定制化开发; Paddle Serving是Paddle的在线预测部分，与Paddle模型训练环节无缝衔接，提供机器学习预测云服务。本文将从模型、服务、接入等层面，自底向上描述Paddle Serving设计方案。

 1. 模型是Paddle Serving预测的核心，包括模型数据和推理计算的管理；
 2. 预测框架封装模型推理计算，对外提供RPC接口，对接不同上游；
@@ -14,23 +14,23 @@ PaddlePaddle是公司开源的机器学习框架，广泛支持各种深度学

 ## 2. 名词解释

- baidu-rpc 百度官方开源RPC框架，支持多种常见通信协议，提供基于protobuf的自定义接口体验
- Variant Paddle Serving架构对一个最小预测集群的抽象，其特点是内部所有实例（副本）完全同质，逻辑上对应一个model的一个固定版本
- Endpoint 多个Variant组成一个Endpoint，逻辑上看，Endpoint代表一个model，Endpoint内部的Variant代表不同的版本
- OP PaddlePaddle用来封装一种数值计算的算子，Paddle Serving用来表示一种基础的业务操作算子，核心接口是inference。OP通过配置其依赖的上游OP，将多个OP串联成一个workflow
- Channel 一个OP所有请求级中间数据的抽象；OP之间通过Channel进行数据交互
- Bus 对一个线程中所有channel的管理，以及根据DAG之间的DAG依赖图对OP和Channel两个集合间的访问关系进行调度
- Stage Workflow按照DAG描述的拓扑图中，属于同一个环节且可并行执行的OP集合
- Node 由某个Op算子类结合参数配置组成的Op算子实例，也是Workflow中的一个执行单元
- Workflow 按照DAG描述的拓扑，有序执行每个OP的inference接口
- DAG/Workflow 由若干个相互依赖的Node组成，每个Node均可通过特定接口获得Request对象，节点Op通过依赖关系获得其前置Op的输出对象，最后一个Node的输出默认就是Response对象
- Service 对一次pv的请求封装，可配置若干条Workflow，彼此之间复用当前PV的Request对象，然后各自并行/串行执行，最后将Response写入对应的输出slot中；一个Paddle-serving进程可配置多套Service接口，上游根据ServiceName决定当前访问的Service接口。
+- **baidu-rpc**: 百度官方开源RPC框架，支持多种常见通信协议，提供基于protobuf的自定义接口体验
+- **Variant**: Paddle Serving架构对一个最小预测集群的抽象，其特点是内部所有实例（副本）完全同质，逻辑上对应一个model的一个固定版本
+- **Endpoint**: 多个Variant组成一个Endpoint，逻辑上看，Endpoint代表一个model，Endpoint内部的Variant代表不同的版本
+- **OP**: PaddlePaddle用来封装一种数值计算的算子，Paddle Serving用来表示一种基础的业务操作算子，核心接口是inference。OP通过配置其依赖的上游OP，将多个OP串联成一个workflow
+- **Channel**: 一个OP所有请求级中间数据的抽象；OP之间通过Channel进行数据交互
+- **Bus**: 对一个线程中所有channel的管理，以及根据DAG之间的DAG依赖图对OP和Channel两个集合间的访问关系进行调度
+- **Stage**: Workflow按照DAG描述的拓扑图中，属于同一个环节且可并行执行的OP集合
+- **Node**: 由某个OP算子类结合参数配置组成的OP算子实例，也是Workflow中的一个执行单元
+- **Workflow**: 按照DAG描述的拓扑，有序执行每个OP的inference接口
+- **DAG/Workflow**: 由若干个相互依赖的Node组成，每个Node均可通过特定接口获得Request对象，节点OP通过依赖关系获得其前置OP的输出对象，最后一个Node的输出默认就是Response对象
+- **Service**: 对一次PV的请求封装，可配置若干条Workflow，彼此之间复用当前PV的Request对象，然后各自并行/串行执行，最后将Response写入对应的输出slot中；一个Paddle-serving进程可配置多套Service接口，上游根据ServiceName决定当前访问的Service接口。

 ## 3. Python Interface设计

 ### 3.1 核心目标：

-一套Paddle Serving的动态库，支持Paddle保存的通用模型的远程预估服务，通过Python Interface调用PaddleServing底层的各种功能。
+完成一整套Paddle Serving的动态库，支持Paddle保存的通用模型的远程预估服务，通过Python Interface调用PaddleServing底层的各种功能。

 ### 3.2 通用模型：

@@ -38,10 +38,10 @@ PaddlePaddle是公司开源的机器学习框架，广泛支持各种深度学

 ### 3.3 整体设计：

-用户通过Python Client启动Client和Server，Python API有检查互联和待访问模型是否匹配的功能
-Python API背后调用的是Paddle Serving实现的client和server对应功能的pybind，互传的信息通过RPC实现
-Client Python API当前有两个简单的功能，load_inference_conf和predict，分别用来执行加载待预测的模型和预测
-Server Python API主要负责加载预估模型，以及生成Paddle Serving需要的各种配置，包括engines，workflow，resource等
+- 用户通过Python Client启动Client和Server，Python API有检查互联和待访问模型是否匹配的功能
+- Python API背后调用的是Paddle Serving实现的client和server对应功能的pybind，互传的信息通过RPC实现
+- Client Python API当前有两个简单的功能，load_inference_conf和predict，分别用来执行加载待预测的模型和预测
+- Server Python API主要负责加载预估模型，以及生成Paddle Serving需要的各种配置，包括engines，workflow，resource等

 ### 3.4 Server Inferface


--- a/doc/DESIGN_DOC.md
+++ b/doc/DESIGN_DOC.md
 # Paddle Serving Design Doc

+([简体中文](./DESIGN_DOC_CN.md)|English)
+
 ## 1. Design Objectives

 - Long Term Vision: Online deployment of deep learning models will be a user-facing application in the future. Any AI developer will face the problem of deploying an online service for his or her trained model.

--- a/doc/DESIGN_DOC_CN.md
+++ b/doc/DESIGN_DOC_CN.md
 # Paddle Serving设计文档

+(简体中文|[English](./DESIGN_DOC.md))
+
 ## 1. 整体设计目标

 - 长期使命：Paddle Serving是一个PaddlePaddle开源的在线服务框架，长期目标就是围绕着人工智能落地的最后一公里提供越来越专业、可靠、易用的服务。

--- a/doc/README.md
+++ b/doc/README.md
@@ -109,7 +109,7 @@ for data in test_reader():

 [Design Doc](DESIGN.md)

-[FAQ](FAQ.md)
+[FAQ](./deprecated/FAQ.md)

 ### Senior Developer Guildlines


--- a/doc/README_CN.md
+++ b/doc/README_CN.md
@@ -109,7 +109,7 @@ for data in test_reader():

 [设计文档](DESIGN_CN.md)

-[FAQ](FAQ.md)
+[FAQ](./deprecated/FAQ.md)

 ### 资深开发者使用指南


--- a/doc/RUN_IN_DOCKER.md
+++ b/doc/RUN_IN_DOCKER.md
 # How to run PaddleServing in Docker

+([简体中文](./RUN_IN_DOCKER_CN.md)|English)
+
 ## Requirements

 Docker (GPU version requires nvidia-docker to be installed on the GPU machine)

--- a/doc/RUN_IN_DOCKER_CN.md
+++ b/doc/RUN_IN_DOCKER_CN.md
 # 如何在Docker中运行PaddleServing

+(简体中文|[English](RUN_IN_DOCKER.md))
+
 ## 环境要求

 Docker（GPU版本需要在GPU机器上安装nvidia-docker）

--- a/doc/TRAIN_TO_SERVICE.md
+++ b/doc/TRAIN_TO_SERVICE.md
-# End-to-end process from training to deployment
+# An End-to-end Tutorial from Training to Inference Service Deployment

 ([简体中文](./TRAIN_TO_SERVICE_CN.md)|English)

-Paddle Serving is Paddle's high-performance online prediction service framework, which can flexibly support the deployment of most models. In this article, the IMDB review sentiment analysis task is used as an example to show the entire process from model training to deployment of prediction service through 9 steps.
+Paddle Serving is Paddle's high-performance online inference service framework, which can flexibly support the deployment of most models. In this article, the IMDB review sentiment analysis task is used as an example to show the entire process from model training to deployment of inference service through 9 steps.

 ## Step1：Prepare for Running Environment
 Paddle Serving can be deployed on Linux environments such as Centos and Ubuntu. On other systems or in environments where you do not want to install the serving module, you can still access the server-side prediction service through the http service.

--- a/doc/timeline-example.png
+++ b/doc/timeline-example.png
--- a/python/examples/bert/benchmark_batch.py
+++ b/python/examples/bert/benchmark_batch.py
@@ -57,8 +57,7 @@ def single_func(idx, resource):
                        os.getpid(),
                        int(round(b_start * 1000000)),
                        int(round(b_end * 1000000))))
-                result = client.batch_predict(
-                    feed_batch=feed_batch, fetch=fetch)
+                result = client.predict(feed=feed_batch, fetch=fetch)
            else:
                print("unsupport batch size {}".format(args.batch_size))


--- a/python/examples/criteo_ctr_with_cube/test_server_gpu.py
+++ b/python/examples/criteo_ctr_with_cube/test_server_gpu.py
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# pylint: disable=doc-string-missing
+
+import os
+import sys
+from paddle_serving_server_gpu import OpMaker
+from paddle_serving_server_gpu import OpSeqMaker
+from paddle_serving_server_gpu import Server
+
+op_maker = OpMaker()
+read_op = op_maker.create('general_reader')
+general_dist_kv_infer_op = op_maker.create('general_dist_kv_infer')
+response_op = op_maker.create('general_response')
+
+op_seq_maker = OpSeqMaker()
+op_seq_maker.add_op(read_op)
+op_seq_maker.add_op(general_dist_kv_infer_op)
+op_seq_maker.add_op(response_op)
+
+server = Server()
+server.set_op_sequence(op_seq_maker.get_op_sequence())
+server.set_num_threads(4)
+server.load_model_config(sys.argv[1])
+server.prepare_server(workdir="work_dir1", port=9292, device="cpu")
+server.run_server()
--- a/python/examples/util/README.md
+++ b/python/examples/util/README.md
 ## Timeline工具使用

-serving框架中内置了预测服务中各阶段时间打点的功能，通过环境变量来控制是否开启。
+serving框架中内置了预测服务中各阶段时间打点的功能，在client端通过环境变量来控制是否开启，开启后会将打点信息输出到屏幕。
 ```
 export FLAGS_profile_client=1 #开启client端各阶段时间打点
 export FLAGS_profile_server=1 #开启server端各阶段时间打点
@@ -13,6 +13,8 @@ export FLAGS_profile_server=1 #开启server端各阶段时间打点
 ```
 python show_profile.py profile ${thread_num}
 ```
+这里thread_num参数为client运行时的进程数，脚本将按照这个参数来计算各阶段的平均耗时。
+
 脚本将计算各阶段的耗时，并除以线程数做平均，打印到标准输出。

 ```
@@ -22,6 +24,6 @@ python timeline_trace.py profile trace

 具体操作：打开chrome浏览器，在地址栏输入chrome://tracing/，跳转至tracing页面，点击load按钮，打开保存的trace文件，即可将预测服务的各阶段时间信息可视化。

-效果如下图，图中展示了client端启动4进程时的bert示例的各阶段timeline，其中bert_pre代表client端的数据预处理阶段，client_infer代表client完成预测请求的发送和接收结果的阶段，每个进进程的第二行展示的是server各个op的timeline。
+效果如下图，图中展示了使用[bert示例](https://github.com/PaddlePaddle/Serving/tree/develop/python/examples/bert)的GPU预测服务，server端开启4卡预测，client端启动4进程，batch size为1时的各阶段timeline，其中bert_pre代表client端的数据预处理阶段，client_infer代表client完成预测请求的发送和接收结果的阶段，图中的process代表的是client的进程号，每个进进程的第二行展示的是server各个op的timeline。

 ![timeline](../../../doc/timeline-example.png)
--- a/python/paddle_serving_server_gpu/__init__.py
+++ b/python/paddle_serving_server_gpu/__init__.py
@@ -55,6 +55,7 @@ class OpMaker(object):
            "general_text_reader": "GeneralTextReaderOp",
            "general_text_response": "GeneralTextResponseOp",
            "general_single_kv": "GeneralSingleKVOp",
+            "general_dist_kv_infer": "GeneralDistKVInferOp",
            "general_dist_kv": "GeneralDistKVOp"
        }

@@ -104,6 +105,7 @@ class Server(object):
        self.infer_service_fn = "infer_service.prototxt"
        self.model_toolkit_fn = "model_toolkit.prototxt"
        self.general_model_config_fn = "general_model.prototxt"
+        self.cube_config_fn = "cube.conf"
        self.workdir = ""
        self.max_concurrency = 0
        self.num_threads = 4
@@ -184,6 +186,11 @@ class Server(object):
                      "w") as fout:
                fout.write(str(self.model_conf))
            self.resource_conf = server_sdk.ResourceConf()
+            for workflow in self.workflow_conf.workflows:
+                for node in workflow.nodes:
+                    if "dist_kv" in node.name:
+                        self.resource_conf.cube_config_path = workdir
+                        self.resource_conf.cube_config_file = self.cube_config_fn
            self.resource_conf.model_toolkit_path = workdir
            self.resource_conf.model_toolkit_file = self.model_toolkit_fn
            self.resource_conf.general_model_path = workdir

--- a/tools/serving_build.sh
+++ b/tools/serving_build.sh
@@ -211,12 +211,11 @@ function python_run_criteo_ctr_with_cube() {
            cp ../../../build-server-$TYPE/output/bin/cube* ./cube/ 
            mkdir -p $PYTHONROOT/lib/python2.7/site-packages/paddle_serving_server/serving-cpu-avx-openblas-0.1.3/
            yes | cp ../../../build-server-$TYPE/output/demo/serving/bin/serving $PYTHONROOT/lib/python2.7/site-packages/paddle_serving_server/serving-cpu-avx-openblas-0.1.3/
-
            sh cube_prepare.sh &
            check_cmd "mkdir work_dir1 && cp cube/conf/cube.conf ./work_dir1/"    
            python test_server.py ctr_serving_model_kv &
            check_cmd "python test_client.py ctr_client_conf/serving_client_conf.prototxt ./ut_data >score"
-            tail -n 2 score
+            tail -n 2 score | awk 'NR==1'
            AUC=$(tail -n 2  score | awk 'NR==1')
            VAR2="0.67" #TODO: temporarily relax the threshold to 0.67
            RES=$( echo "$AUC>$VAR2" | bc )
@@ -229,6 +228,30 @@ function python_run_criteo_ctr_with_cube() {
            ps -ef | grep "cube" | grep -v grep | awk '{print $2}' | xargs kill
            ;;
        GPU)
+            check_cmd "wget https://paddle-serving.bj.bcebos.com/unittest/ctr_cube_unittest.tar.gz"
+            check_cmd "tar xf ctr_cube_unittest.tar.gz"
+            check_cmd "mv models/ctr_client_conf ./"
+            check_cmd "mv models/ctr_serving_model_kv ./"
+            check_cmd "mv models/data ./cube/"
+            check_cmd "mv models/ut_data ./"
+            cp ../../../build-server-$TYPE/output/bin/cube* ./cube/
+            mkdir -p $PYTHONROOT/lib/python2.7/site-packages/paddle_serving_server_gpu/serving-gpu-0.1.3/
+            yes | cp ../../../build-server-$TYPE/output/demo/serving/bin/serving $PYTHONROOT/lib/python2.7/site-packages/paddle_serving_server_gpu/serving-gpu-0.1.3/
+            sh cube_prepare.sh &
+            check_cmd "mkdir work_dir1 && cp cube/conf/cube.conf ./work_dir1/"
+            python test_server_gpu.py ctr_serving_model_kv &
+            check_cmd "python test_client.py ctr_client_conf/serving_client_conf.prototxt ./ut_data >score"
+            tail -n 2 score | awk 'NR==1'
+            AUC=$(tail -n 2  score | awk 'NR==1')
+            VAR2="0.67" #TODO: temporarily relax the threshold to 0.67
+            RES=$( echo "$AUC>$VAR2" | bc )
+            if [[ $RES -eq 0 ]]; then
+                echo "error with criteo_ctr_with_cube inference auc test, auc should > 0.70"
+                exit 1
+            fi
+            echo "criteo_ctr_with_cube inference auc test success"
+            ps -ef | grep "paddle_serving_server" | grep -v grep | awk '{print $2}' | xargs kill
+            ps -ef | grep "cube" | grep -v grep | awk '{print $2}' | xargs kill
            ;;
        *)
            echo "error type"