modify doc directory

078a0029 · TeslaZhao · 433ff7f9 · 078a0029 · 078a0029 · 078a0029
90 changed file
--- a/README.md
+++ b/README.md
@@ -2,7 +2,7 @@

 <p align="center">
    <br>
-<img src='doc/serving_logo.png' width = "600" height = "130">
+<img src='doc/images/serving_logo.png' width = "600" height = "130">
    <br>
 <p>

@@ -47,7 +47,7 @@ We consider deploying deep learning inference service online to be a user-facing
 [Serving Examples](./python/examples/).

 <p align="center">
-    <img src="doc/demo.gif" width="700">
+    <img src="doc/images/demo.gif" width="700">
 </p>



--- a/README_CN.md
+++ b/README_CN.md
@@ -2,7 +2,7 @@

 <p align="center">
    <br>
-<img src='doc/serving_logo.png' width = "600" height = "130">
+<img src='doc/images/serving_logo.png' width = "600" height = "130">
    <br>
 <p>

@@ -48,7 +48,7 @@ Paddle Serving 旨在帮助深度学习开发者轻易部署在线预测服务
 - 提供丰富多彩的前后处理，方便用户在训练、部署等各阶段复用相关代码，弥合AI开发者和应用开发者之间的鸿沟，详情参考[模型示例](./python/examples/)。

 <p align="center">
-    <img src="doc/demo.gif" width="700">
+    <img src="doc/images/demo.gif" width="700">
 </p>

 <h2 align="center">教程</h2>

--- a/doc/ABTEST_IN_PADDLE_SERVING.md
+++ b/doc/ABTEST_IN_PADDLE_SERVING.md
@@ -4,7 +4,7 @@

 This document will use an example of text classification task based on IMDB dataset to show how to build a A/B Test framework using Paddle Serving. The structure relationship between the client and servers in the example is shown in the figure below.

-<img src="abtest.png" style="zoom:25%;" />
+<img src="images/abtest.png" style="zoom:25%;" />

 Note that:  A/B Test is only applicable to RPC mode, not web mode.


--- a/doc/ABTEST_IN_PADDLE_SERVING_CN.md
+++ b/doc/ABTEST_IN_PADDLE_SERVING_CN.md
@@ -4,7 +4,7 @@

 该文档将会用一个基于IMDB数据集的文本分类任务的例子，介绍如何使用Paddle Serving搭建A/B Test框架，例中的Client端、Server端结构如下图所示。

-<img src="abtest.png" style="zoom:33%;" />
+<img src="images/abtest.png" style="zoom:33%;" />

 需要注意的是：A/B Test只适用于RPC模式，不适用于WEB模式。


--- a/doc/BERT_10_MINS.md
+++ b/doc/BERT_10_MINS.md
@@ -115,7 +115,7 @@ curl -H "Content-Type:application/json" -X POST -d '{"feed":[{"words": "hello"}]

 We tested the performance of Bert-As-Service based on Padde Serving based on V100 and compared it with the Bert-As-Service based on Tensorflow. From the perspective of user configuration, we used the same batch size and concurrent number for stress testing. The overall throughput performance data obtained under 4 V100s is as follows.

-![4v100_bert_as_service_benchmark](4v100_bert_as_service_benchmark.png)
+![4v100_bert_as_service_benchmark](images/4v100_bert_as_service_benchmark.png)

 <!--
 yum install -y libXext libSM libXrender

--- a/doc/BERT_10_MINS_CN.md
+++ b/doc/BERT_10_MINS_CN.md
@@ -111,4 +111,4 @@ curl -H "Content-Type:application/json" -X POST -d '{"feed":[{"words": "hello"}]

 我们基于V100对基于Padde Serving研发的Bert-As-Service的性能进行测试并与基于Tensorflow实现的Bert-As-Service进行对比，从用户配置的角度，采用相同的batch size和并发数进行压力测试，得到4块V100下的整体吞吐性能数据如下。

-![4v100_bert_as_service_benchmark](4v100_bert_as_service_benchmark.png)
+![4v100_bert_as_service_benchmark](images/4v100_bert_as_service_benchmark.png)
--- a/doc/C++DESIGN.md
+++ b/doc/C++DESIGN.md
@@ -45,11 +45,11 @@ Models that can be predicted using the Paddle Inference Library, models saved du

 ### 3.4 Server Inferface

-![Server Interface](server_interface.png)
+![Server Interface](images/server_interface.png)

 ### 3.5 Client Interface

-<img src='client_inferface.png' width = "600" height = "200">
+<img src='images/client_inferface.png' width = "600" height = "200">

 ### 3.6 Client io used during Training

@@ -66,7 +66,7 @@ def save_model(server_model_folder,

 ## 4. Paddle Serving Underlying Framework

-![Paddle-Serging Overall Architecture](framework.png)
+![Paddle-Serging Overall Architecture](images/framework.png)

 **Model Management Framework**: Connects model files of multiple machine learning platforms and provides a unified inference interface
 **Business Scheduling Framework**: Abstracts the calculation logic of various different inference models, provides a general DAG scheduling framework, and connects different operators through DAG diagrams to complete a prediction service together. This abstract model allows users to conveniently implement their own calculation logic, and at the same time facilitates operator sharing. (Users build their own forecasting services. A large part of their work is to build DAGs and provide operators.)
@@ -102,18 +102,18 @@ class FluidFamilyCore {

 With reference to the abstract idea of model calculation of the TensorFlow framework, the business logic is abstracted into a DAG diagram, driven by configuration, generating a workflow, and skipping C ++ code compilation. Each specific step of the service corresponds to a specific OP. The OP can configure the upstream OP that it depends on. Unified message passing between OPs is achieved by the thread-level bus and channel mechanisms. For example, the service process of a simple prediction service can be abstracted into 3 steps including reading request data-> calling the prediction interface-> writing back the prediction result, and correspondingly implemented to 3 OP: ReaderOp-> ClassifyOp-> WriteOp

-![Infer Service](predict-service.png)
+![Infer Service](images/predict-service.png)

 Regarding the dependencies between OPs, and the establishment of workflows through OPs, you can refer to [从零开始写一个预测服务](CREATING.md) (simplified Chinese Version)

 Server instance perspective

-![Server instance perspective](server-side.png)
+![Server instance perspective](images/server-side.png)


 #### 4.2.2 Paddle Serving Multi-Service Mechanism

-![Paddle Serving multi-service](multi-service.png)
+![Paddle Serving multi-service](images/multi-service.png)

 Paddle Serving instances can load multiple models at the same time, and each model uses a Service (and its configured workflow) to undertake services. You can refer to [service configuration file in Demo example](../tools/cpp_examples/demo-serving/conf/service.prototxt) to learn how to configure multiple services for the serving instance

@@ -121,12 +121,12 @@ Paddle Serving instances can load multiple models at the same time, and each mod

 From the client's perspective, a Paddle Serving service can be divided into three levels: Service, Endpoint, and Variant from top to bottom.

-![Call hierarchy relationship](multi-variants.png)
+![Call hierarchy relationship](images/multi-variants.png)

 One Service corresponds to one inference model, and there is one endpoint under the model. Different versions of the model are implemented through multiple variant concepts under endpoint:
 The same model prediction service can configure multiple variants, and each variant has its own downstream IP list. The client code can configure relative weights for each variant to achieve the relationship of adjusting the traffic ratio (refer to the description of variant_weight_list in [Client Configuration](CLIENT_CONFIGURE.md) section 3.2).

-![Client-side proxy function](client-side-proxy.png)
+![Client-side proxy function](images/client-side-proxy.png)

 ## 5. User Interface


--- a/doc/C++DESIGN_CN.md
+++ b/doc/C++DESIGN_CN.md
@@ -47,11 +47,11 @@ PaddlePaddle是百度开源的机器学习框架，广泛支持各种深度学

 ### 3.4 Server Inferface

-![Server Interface](server_interface.png)
+![Server Interface](images/server_interface.png)

 ### 3.5 Client Interface

-<img src='client_inferface.png' width = "600" height = "200">
+<img src='images/client_inferface.png' width = "600" height = "200">

 ### 3.6 训练过程中使用的Client io

@@ -68,7 +68,7 @@ def save_model(server_model_folder,

 ## 4. Paddle Serving底层框架

-![Paddle-Serging总体框图](framework.png)
+![Paddle-Serging总体框图](images/framework.png)

 **模型管理框架**：对接多种机器学习平台的模型文件，向上提供统一的inference接口
 **业务调度框架**：对各种不同预测模型的计算逻辑进行抽象，提供通用的DAG调度框架，通过DAG图串联不同的算子，共同完成一次预测服务。该抽象模型使用户可以方便的实现自己的计算逻辑，同时便于算子共用。（用户搭建自己的预测服务，很大一部分工作是搭建DAG和提供算子的实现）
@@ -104,18 +104,18 @@ class FluidFamilyCore {

 参考TF框架的模型计算的抽象思想，将业务逻辑抽象成DAG图，由配置驱动，生成workflow，跳过C++代码编译。业务的每个具体步骤，对应一个具体的OP，OP可配置自己依赖的上游OP。OP之间消息传递统一由线程级Bus和channel机制实现。例如，一个简单的预测服务的服务过程，可以抽象成读请求数据->调用预测接口->写回预测结果等3个步骤，相应的实现到3个OP: ReaderOp->ClassifyOp->WriteOp

-![预测服务Service](predict-service.png)
+![预测服务Service](images/predict-service.png)

 关于OP之间的依赖关系，以及通过OP组建workflow，可以参考[从零开始写一个预测服务](CREATING.md)的相关章节

 服务端实例透视图

-![服务端实例透视图](server-side.png)
+![服务端实例透视图](images/server-side.png)


 #### 4.2.2 Paddle Serving的多服务机制

-![Paddle Serving的多服务机制](multi-service.png)
+![Paddle Serving的多服务机制](images/multi-service.png)

 Paddle Serving实例可以同时加载多个模型，每个模型用一个Service（以及其所配置的workflow）承接服务。可以参考[Demo例子中的service配置文件](../tools/cpp_examples/demo-serving/conf/service.prototxt)了解如何为serving实例配置多个service

@@ -123,12 +123,12 @@ Paddle Serving实例可以同时加载多个模型，每个模型用一个Servic

 从客户端看，一个Paddle Serving service从顶向下可分为Service, Endpoint, Variant等3个层级

-![调用层级关系](multi-variants.png)
+![调用层级关系](images/multi-variants.png)

 一个Service对应一个预测模型，模型下有1个endpoint。模型的不同版本，通过endpoint下多个variant概念实现：
 同一个模型预测服务，可以配置多个variant，每个variant有自己的下游IP列表。客户端代码可以对各个variant配置相对权重，以达到调节流量比例的关系（参考[客户端配置](CLIENT_CONFIGURE.md)第3.2节中关于variant_weight_list的说明）。

-![Client端proxy功能](client-side-proxy.png)
+![Client端proxy功能](images/client-side-proxy.png)

 ## 5. 用户接口


--- a/doc/CUBE_LOCAL.md
+++ b/doc/CUBE_LOCAL.md
@@ -88,7 +88,7 @@ this step is not necessary, but it can help you to verify if the model is ready.
 ```
 if you succeed, you will see this
 <p align="center">
-    <img src="cube-cli.png" width="700">
+    <img src="images/cube-cli.png" width="700">
 </p>

 If you see that each key has a corresponding value output, it means that the delivery was successful. This file can also be used by Serving to perform cube query in general kv infer op in Serving.

--- a/doc/CUBE_LOCAL_CN.md
+++ b/doc/CUBE_LOCAL_CN.md
@@ -91,7 +91,7 @@ cd cube

 如果执行成功，会看到如下结果
 <p align="center">
-    <img src="cube-cli.png" width="700">
+    <img src="images/cube-cli.png" width="700">
 </p>



--- a/doc/DESIGN_DOC.md
+++ b/doc/DESIGN_DOC.md
@@ -39,7 +39,7 @@ Paddle Serving provides RPC and HTTP protocol for users. For HTTP service, we re

 <p align="center">
    <br>
-<img src='user_groups.png' width = "700" height = "470">
+<img src='images/user_groups.png' width = "700" height = "470">
    <br>
 <p>

@@ -96,7 +96,7 @@ Distributed Sparse Parameter Indexing is commonly seen in advertising and recomm

 <p align="center">
    <br>
-<img src='cube_eng.png' width = "450" height = "230">
+<img src='images/cube_eng.png' width = "450" height = "230">
    <br>
 <p>

@@ -116,7 +116,7 @@ The core execution engine of Paddle Serving is a Directed acyclic graph(DAG). In

 <p align="center">
    <br>
-<img src='design_doc.png'">
+<img src='images/design_doc.png'">
    <br>
 <p>

@@ -132,7 +132,7 @@ After sufficient offline evaluation of the model, online A/B test is usually nee

 <p align="center">
    <br>
-<img src='abtest.png' width = "345" height = "230">
+<img src='images/abtest.png' width = "345" height = "230">
    <br>
 <p>

@@ -188,7 +188,7 @@ the end-to-end deep learning model can not solve all the problems at present. Us
 ### 5.1 Network Communication Mechanism
 The network framework of Pipeline Serving uses gRPC and gPRC gateway. The gRPC service receives the RPC request, and the gPRC gateway receives the RESTful API request and forwards the request to the gRPC Service through the reverse proxy server. Therefore, the network layer of Pipeline Serving receives both RPC and RESTful API.
 <center>
-<img src='pipeline_serving-image1.png' height = "250" align="middle"/>
+<img src='images/pipeline_serving-image1.png' height = "250" align="middle"/>
 </center>

 ### 5.2 Core Design And Use Cases
@@ -196,7 +196,7 @@ The network framework of Pipeline Serving uses gRPC and gPRC gateway. The gRPC s
 The core design of Pipeline Serving is a graph execution engine, and the basic processing units are OP and Channel. A set of directed acyclic graphs can be realized through combination. Reference for design and use documents《[Pipeline Serving](PIPELINE_SERVING.md)》

 <center>
-<img src='pipeline_serving-image2.png' height = "300" align="middle"/>
+<img src='images/pipeline_serving-image2.png' height = "300" align="middle"/>
 </center>

 ----

--- a/doc/DESIGN_DOC_CN.md
+++ b/doc/DESIGN_DOC_CN.md
@@ -42,7 +42,7 @@ Paddle Serving面向的用户提供RPC和HTTP两种访问协议。对于HTTP协

 <p align="center">
    <br>
-<img src='user_groups.png' width = "700" height = "470">
+<img src='images/user_groups.png' width = "700" height = "470">
    <br>
 <p>

@@ -99,7 +99,7 @@ fetch_var {
 为什么要使用Paddle Serving提供的分布式稀疏参数索引服务？1）在一些推荐场景中，模型的输入特征规模通常可以达到上千亿，单台机器无法支撑T级别模型在内存的保存，因此需要进行分布式存储。2）Paddle Serving提供的分布式稀疏参数索引服务，具有并发请求多个节点的能力，从而以较低的延时完成预估服务。
 <p align="center">
    <br>
-<img src='cube_eng.png' width = "450" height = "230">
+<img src='images/cube.png' width = "450" height = "230">
    <br>
 <p>
 分布式稀疏参数索引通常在广告推荐中出现，并与分布式训练配合形成完整的离线-在线一体化部署。下图解释了其中的流程，产品的在线服务接受用户请求后将请求发送给预估服务，同时系统会记录用户的请求以进行相应的训练日志处理和拼接。离线分布式训练系统会针对流式产出的训练日志进行模型增量训练，而增量产生的模型会配送至分布式稀疏参数索引服务，同时对应的稠密的模型参数也会配送至在线的预估服务。在线服务由两部分组成，一部分是针对用户的请求提取特征后，将需要进行模型的稀疏参数索引的特征发送请求给分布式稀疏参数索引服务，针对分布式稀疏参数索引服务返回的稀疏参数再进行后续深度学习模型的计算流程，从而完成预估。
@@ -118,7 +118,7 @@ C++ Serving采用[better-rpc](https://github.com/apache/incubator-brpc)进行底
 C++ Serving的核心执行引擎是一个有向无环图，图中的每个节点代表预估服务的一个环节，例如计算模型预测打分就是其中一个环节。有向无环图有利于可并发节点充分利用部署实例内的计算资源，缩短延时。一个例子，当同一份输入需要送入两个不同的模型进行预估，并将两个模型预估的打分进行加权求和时，两个模型的打分过程即可以通过有向无环图的拓扑关系并发。
 <p align="center">
    <br>
-<img src='design_doc.png'">
+<img src='images/design_doc.png'">
    <br>
 <p>

@@ -136,7 +136,7 @@ Paddle Serving采用对称加密算法对模型进行加密，在服务加载模

 <p align="center">
    <br>
-<img src='abtest.png' width = "345" height = "230">
+<img src='images/abtest.png' width = "345" height = "230">
    <br>
 <p>

@@ -189,13 +189,13 @@ imdb_service.run_server()
 ### 5.1 网络框架
 Pipeline Serving的网络框架采用gRPC和gPRC gateway。gRPC service接收RPC请求，gPRC gateway接收RESTful API请求通过反向代理服务器将请求转发给gRPC Service。即，Pipeline Serving的网络层同时接收RPC和RESTful API。
 <center>
-<img src='pipeline_serving-image1.png' height = "250" align="middle"/>
+<img src='images/pipeline_serving-image1.png' height = "250" align="middle"/>
 </center>

 ### 5.2 核心设计与使用用例
 Pipeline Serving核心设计是图执行引擎，基本处理单元是OP和Channel，通过组合实现一套有向无环图，设计与使用文档参考《[Pipeline Serving设计与实现](PIPELINE_SERVING_CN.md)》
 <center>
-<img src='pipeline_serving-image2.png' height = "300" align="middle"/>
+<img src='images/pipeline_serving-image2.png' height = "300" align="middle"/>
 </center>
 ----


--- a/doc/GRPC_IMPL_CN.md
+++ b/doc/GRPC_IMPL_CN.md
@@ -18,7 +18,7 @@
  
 使用gRPC接口，Client端可以在Win/Linux/MacOS平台上调用不同语言。gRPC 接口实现结构如下：

-![](https://github.com/PaddlePaddle/Serving/blob/develop/doc/grpc_impl.png)
+![](images/grpc_impl.png)

 ## 1.与bRPC接口对比


--- a/doc/PIPELINE_SERVING.md
+++ b/doc/PIPELINE_SERVING.md
@@ -18,7 +18,7 @@ Paddle Serving provides a user-friendly programming framework for multi-model co
 The Server side is built based on <b>RPC Service</b> and <b>graph execution engine</b>. The relationship between them is shown in the following figure.

 <div align=center>
-<img src='pipeline_serving-image1.png' height = "250" align="middle"/>
+<img src='images/pipeline_serving-image1.png' height = "250" align="middle"/>
 </div>

 ### 1.1 RPC Service
@@ -61,7 +61,7 @@ The graph execution engine consists of OPs and Channels, and the connected OPs s
 - For cases where large data needs to be transferred between OPs, consider RAM DB external memory for global storage and data transfer by passing index keys in Channel.

 <div align=center>
-<img src='pipeline_serving-image2.png' height = "300" align="middle"/>
+<img src='images/pipeline_serving-image2.png' height = "300" align="middle"/>
 </div>


@@ -80,7 +80,7 @@ The graph execution engine consists of OPs and Channels, and the connected OPs s
 - The following illustration shows the design of Channel in the graph execution engine, using input buffer and output buffer to align data between multiple OP inputs and multiple OP outputs, with a queue in the middle to buffer.

 <div align=center>
-<img src='pipeline_serving-image3.png' height = "500" align="middle"/>
+<img src='images/pipeline_serving-image3.png' height = "500" align="middle"/>
 </div>


@@ -323,7 +323,7 @@ All examples of pipelines are in [examples/pipeline/](../python/examples/pipelin
 Here, we build a simple imdb model enable example to show how to use Pipeline Serving. The relevant code can be found in the `python/examples/pipeline/imdb_model_ensemble` folder. The Server-side structure in the example is shown in the following figure:

 <div align=center>
-<img src='pipeline_serving-image4.png' height = "200" align="middle"/>
+<img src='images/pipeline_serving-image4.png' height = "200" align="middle"/>
 </div>

 ### 3.1 Files required for pipeline deployment

--- a/doc/PIPELINE_SERVING_CN.md
+++ b/doc/PIPELINE_SERVING_CN.md
@@ -20,7 +20,7 @@ Paddle Serving提供了用户友好的多模型组合服务编程框架，Pipeli
 Server端基于<b>RPC服务层</b>和<b>图执行引擎</b>构建，两者的关系如下图所示。

 <div align=center>
-<img src='pipeline_serving-image1.png' height = "250" align="middle"/>
+<img src='images/pipeline_serving-image1.png' height = "250" align="middle"/>
 </div>

 </n>
@@ -65,7 +65,7 @@ Response中`err_no`和`err_msg`表达处理结果的正确性和错误信息，`
 - 对于 OP 之间需要传输过大数据的情况，可以考虑 RAM DB 外存进行全局存储，通过在 Channel 中传递索引的 Key 来进行数据传输

 <div align=center>
-<img src='pipeline_serving-image2.png' height = "300" align="middle"/>
+<img src='images/pipeline_serving-image2.png' height = "300" align="middle"/>
 </div>


@@ -84,7 +84,7 @@ Response中`err_no`和`err_msg`表达处理结果的正确性和错误信息，`
 - 下图为图执行引擎中 Channel 的设计，采用 input buffer 和 output buffer 进行多 OP 输入或多 OP 输出的数据对齐，中间采用一个 Queue 进行缓冲

 <div align=center>
-<img src='pipeline_serving-image3.png' height = "500" align="middle"/>
+<img src='images/pipeline_serving-image3.png' height = "500" align="middle"/>
 </div>

 #### <b>1.2.3 预测类型的设计</b>
@@ -328,7 +328,7 @@ class ResponseOp(Op):
 以 imdb_model_ensemble 为例来展示如何使用 Pipeline Serving，相关代码在 `python/examples/pipeline/imdb_model_ensemble` 文件夹下可以找到，例子中的 Server 端结构如下图所示：

 <div align=center>
-<img src='pipeline_serving-image4.png' height = "200" align="middle"/>
+<img src='images/pipeline_serving-image4.png' height = "200" align="middle"/>
 </div>

 ### 3.1 Pipeline部署需要的文件

--- a/doc/SERVER_DAG.md
+++ b/doc/SERVER_DAG.md
@@ -9,7 +9,7 @@ This document shows the concept of computation graph on server. How to define co
 Deep neural nets often have some preprocessing steps on input data, and postprocessing steps on model inference scores. Since deep learning frameworks are now very flexible, it is possible to do preprocessing and postprocessing outside the training computation graph. If we want to do input data preprocessing and inference result postprocess on server side, we have to add the corresponding computation logics on server. Moreover, if a user wants to do inference with the same inputs on more than one model, the best way is to do the inference concurrently on server side given only one client request so that we can save some network computation overhead. For the above two reasons, it is naturally to think of a Directed Acyclic Graph(DAG) as the main computation method for server inference. One example of DAG is as follows:

 <center>
-<img src='server_dag.png' width = "450" height = "500" align="middle"/>
+<img src='images/server_dag.png' width = "450" height = "500" align="middle"/>
 </center>

 ## How to define Node
@@ -19,7 +19,7 @@ Deep neural nets often have some preprocessing steps on input data, and postproc
 PaddleServing has some predefined Computation Node in the framework. A very commonly used Computation Graph is the simple reader-inference-response mode that can cover most of the single model inference scenarios. A example graph and the corresponding DAG definition code is as follows.

 <center>
-<img src='simple_dag.png' width = "260" height = "370" align="middle"/>
+<img src='images/simple_dag.png' width = "260" height = "370" align="middle"/>
 </center>

 ``` python
@@ -51,7 +51,7 @@ python -m paddle_serving_server.serve --model uci_housing_model --thread 10 --po
 An example containing multiple input nodes is given in the [MODEL_ENSEMBLE_IN_PADDLE_SERVING](./deprecated/MODEL_ENSEMBLE_IN_PADDLE_SERVING.md). A example graph and the corresponding DAG definition code is as follows.

 <center>
-<img src='complex_dag.png' width = "480" height = "400" align="middle"/>
+<img src='images/complex_dag.png' width = "480" height = "400" align="middle"/>
 </center>

 ```python

--- a/doc/SERVER_DAG_CN.md
+++ b/doc/SERVER_DAG_CN.md
@@ -9,7 +9,7 @@
 深度神经网络通常在输入数据上有一些预处理步骤，而在模型推断分数上有一些后处理步骤。 由于深度学习框架现在非常灵活，因此可以在训练计算图之外进行预处理和后处理。 如果要在服务器端进行输入数据预处理和推理结果后处理，则必须在服务器上添加相应的计算逻辑。 此外，如果用户想在多个模型上使用相同的输入进行推理，则最好的方法是在仅提供一个客户端请求的情况下在服务器端同时进行推理，这样我们可以节省一些网络计算开销。 由于以上两个原因，自然而然地将有向无环图（DAG）视为服务器推理的主要计算方法。 DAG的一个示例如下：

 <center>
-<img src='server_dag.png' width = "450" height = "500" align="middle"/>
+<img src='images/server_dag.png' width = "450" height = "500" align="middle"/>
 </center>

 ## 如何定义节点
@@ -18,7 +18,7 @@

 PaddleServing在框架中具有一些预定义的计算节点。 一种非常常用的计算图是简单的reader-infer-response模式，可以涵盖大多数单一模型推理方案。 示例图和相应的DAG定义代码如下。
 <center>
-<img src='simple_dag.png' width = "260" height = "370" align="middle"/>
+<img src='images/simple_dag.png' width = "260" height = "370" align="middle"/>
 </center>

 ``` python
@@ -50,7 +50,7 @@ python -m paddle_serving_server.serve --model uci_housing_model --thread 10 --po
 在[Paddle Serving中的集成预测](./deprecated/MODEL_ENSEMBLE_IN_PADDLE_SERVING_CN.md)文档中给出了一个包含多个输入节点的样例，示意图和代码如下。

 <center>
-<img src='complex_dag.png' width = "480" height = "400" align="middle"/>
+<img src='images/complex_dag.png' width = "480" height = "400" align="middle"/>
 </center>

 ```python

--- a/doc/SERVING_AUTH_DOCKER.md
+++ b/doc/SERVING_AUTH_DOCKER.md
@@ -32,17 +32,17 @@ ee59a3dd4806        registry.baidubce.com/serving_dev/serving-runtime:cpu-py36

 其中我们之前serving容器 以 9393端口暴露，KONG网关的端口是8443， KONG的Web控制台的端口是8001。接下来我们在浏览器访问 `https://$IP_ADDR:8001`, 其中 IP_ADDR就是宿主机的IP。

-<img src="kong-dashboard.png">
+<img src="images/kong-dashboard.png">
 可以看到在注册结束后，登陆，看到了 DASHBOARD，我们先看SERVICES，可以看到`serving_service`，这意味着我们端口在9393的Serving服务已经在KONG当中被注册。

-<img src="kong-services.png">
-<img src="kong-routes.png">
+<img src="images/kong-services.png">
+<img src="images/kong-routes.png">

 然后在ROUTES中，我们可以看到 serving 被链接到了 `/serving-uci`。

 最后我们点击 CONSUMERS - default_user - Credentials - API KEYS ，我们可以看到 `Api Keys` 下看到很多key

-<img src="kong-api_keys.png">
+<img src="images/kong-api_keys.png">

 接下来可以通过curl访问

@@ -194,6 +194,3 @@ credentials:
 curl -H "Content-Type:application/json" -H "apikey:ZGVmYXVsdC1hcGlrZXkK" -X POST -d '{"feed":[{"x": [0.0137, -0.1136, 0.2553, -0.0692, 0.0582, -0.0727, -0.1583, -0.0584, 0.6283, 0.4919, 0.1856, 0.0795, -0.0332]}], "fetch":["price"]}' https://$IP:$PORT/foo/uci/prediction -k
 ```
 我们可以看到 apikey 已经加入到了curl请求的header当中。
-
-
-
--- a/doc/architecture.png
+++ b/doc/architecture.png
--- a/doc/bert-benchmark-batch-size-1.png
+++ b/doc/bert-benchmark-batch-size-1.png
--- a/doc/blank.png
+++ b/doc/blank.png
--- a/doc/coding_mode.png
+++ b/doc/coding_mode.png
--- a/doc/deprecated/BENCHMARKING.md
+++ b/doc/deprecated/BENCHMARKING.md
--- a/doc/deprecated/CTR_PREDICTION.md
+++ b/doc/deprecated/CTR_PREDICTION.md
@@ -26,7 +26,7 @@

 第1) - 第5)步裁剪完毕后的模型网络配置如下：

-![Pruned CTR prediction network](../pruned-ctr-network.png)
+![Pruned CTR prediction network](../images/pruned-ctr-network.png)


 整个裁剪过程具体说明如下：

--- a/doc/deprecated/MODEL_ENSEMBLE_IN_PADDLE_SERVING.md
+++ b/doc/deprecated/MODEL_ENSEMBLE_IN_PADDLE_SERVING.md
@@ -10,7 +10,7 @@ Next, we will take the text classification task as an example to show model ense

 In this example (see the figure below), the server side predict the bow and CNN models with the same input in a service in parallel, The client side fetchs the prediction results of the two models, and processes the prediction results to get the final predict results.

-![simple example](../model_ensemble_example.png)
+![simple example](../images/model_ensemble_example.png)

 It should be noted that at present, only multiple models with the same format input and output in the same service are supported. In this example, the input and output formats of CNN and BOW model are the same.


--- a/doc/deprecated/MODEL_ENSEMBLE_IN_PADDLE_SERVING_CN.md
+++ b/doc/deprecated/MODEL_ENSEMBLE_IN_PADDLE_SERVING_CN.md
@@ -10,7 +10,7 @@

 该样例中（见下图），Server端在一项服务中并行预测相同输入的BOW和CNN模型，Client端获取两个模型的预测结果并进行后处理，得到最终的预测结果。

-![simple example](../model_ensemble_example.png)
+![simple example](../images/model_ensemble_example.png)

 需要注意的是，目前只支持在同一个服务中使用多个相同格式输入输出的模型。在该例子中，CNN模型和BOW模型的输入输出格式是相同的。


--- a/doc/gpu-local-qps-batchsize.png
+++ b/doc/gpu-local-qps-batchsize.png
--- a/doc/gpu-local-qps-concurrency.png
+++ b/doc/gpu-local-qps-concurrency.png
--- a/doc/gpu-local-time-batchsize.png
+++ b/doc/gpu-local-time-batchsize.png
--- a/doc/gpu-local-time-concurrency.png
+++ b/doc/gpu-local-time-concurrency.png
--- a/doc/gpu-serving-multi-card-multi-concurrency-qps-batchsize-concurrency-client1.png
+++ b/doc/gpu-serving-multi-card-multi-concurrency-qps-batchsize-concurrency-client1.png
--- a/doc/gpu-serving-multi-card-multi-concurrency-qps-batchsize-concurrency-client2.png
+++ b/doc/gpu-serving-multi-card-multi-concurrency-qps-batchsize-concurrency-client2.png
--- a/doc/gpu-serving-multi-card-multi-concurrency-time-batchsize-concurrency-client1.png
+++ b/doc/gpu-serving-multi-card-multi-concurrency-time-batchsize-concurrency-client1.png
--- a/doc/gpu-serving-multi-card-multi-concurrency-time-batchsize-concurrency-client2.png
+++ b/doc/gpu-serving-multi-card-multi-concurrency-time-batchsize-concurrency-client2.png
--- a/doc/gpu-serving-multi-card-single-concurrency-qps-batchsize-client1.png
+++ b/doc/gpu-serving-multi-card-single-concurrency-qps-batchsize-client1.png
--- a/doc/gpu-serving-multi-card-single-concurrency-qps-batchsize-client2.png
+++ b/doc/gpu-serving-multi-card-single-concurrency-qps-batchsize-client2.png
--- a/doc/gpu-serving-multi-card-single-concurrency-time-batchsize-client1.png
+++ b/doc/gpu-serving-multi-card-single-concurrency-time-batchsize-client1.png
--- a/doc/gpu-serving-multi-card-single-concurrency-time-batchsize-client2.png
+++ b/doc/gpu-serving-multi-card-single-concurrency-time-batchsize-client2.png
--- a/doc/gpu-serving-single-card-qps-batchsize.png
+++ b/doc/gpu-serving-single-card-qps-batchsize.png
--- a/doc/gpu-serving-single-card-qps-concurrency.png
+++ b/doc/gpu-serving-single-card-qps-concurrency.png
--- a/doc/gpu-serving-single-card-time-batchsize.png
+++ b/doc/gpu-serving-single-card-time-batchsize.png
--- a/doc/gpu-serving-single-card-time-concurrency.png
+++ b/doc/gpu-serving-single-card-time-concurrency.png
--- a/doc/4v100_bert_as_service_benchmark.png
+++ b/doc/4v100_bert_as_service_benchmark.png
--- a/doc/abtest.png
+++ b/doc/abtest.png
--- a/doc/client-side-proxy.png
+++ b/doc/client-side-proxy.png
--- a/doc/client_inferface.png
+++ b/doc/client_inferface.png
--- a/doc/complex_dag.png
+++ b/doc/complex_dag.png
--- a/doc/criteo-cube-benchmark-avgcost.png
+++ b/doc/criteo-cube-benchmark-avgcost.png
--- a/doc/criteo-cube-benchmark-qps.png
+++ b/doc/criteo-cube-benchmark-qps.png
--- a/doc/cube-cli.png
+++ b/doc/cube-cli.png
--- a/doc/cube.png
+++ b/doc/cube.png
--- a/doc/cube_eng.png
+++ b/doc/cube_eng.png
--- a/doc/demo.gif
+++ b/doc/demo.gif
--- a/doc/design_doc.png
+++ b/doc/design_doc.png
--- a/doc/framework.png
+++ b/doc/framework.png
--- a/doc/grpc_impl.png
+++ b/doc/grpc_impl.png
--- a/doc/kong-api_keys.png
+++ b/doc/kong-api_keys.png
--- a/doc/kong-dashboard.png
+++ b/doc/kong-dashboard.png
--- a/doc/kong-routes.png
+++ b/doc/kong-routes.png
--- a/doc/kong-services.png
+++ b/doc/kong-services.png
--- a/doc/model_ensemble_example.png
+++ b/doc/model_ensemble_example.png
--- a/doc/multi-service.png
+++ b/doc/multi-service.png
--- a/doc/multi-variants.png
+++ b/doc/multi-variants.png
--- a/doc/pipeline_serving-image1.png
+++ b/doc/pipeline_serving-image1.png
--- a/doc/pipeline_serving-image2.png
+++ b/doc/pipeline_serving-image2.png
--- a/doc/pipeline_serving-image3.png
+++ b/doc/pipeline_serving-image3.png
--- a/doc/pipeline_serving-image4.png
+++ b/doc/pipeline_serving-image4.png
--- a/doc/predict-service.png
+++ b/doc/predict-service.png
--- a/doc/pruned-ctr-network.png
+++ b/doc/pruned-ctr-network.png
--- a/doc/server-side.png
+++ b/doc/server-side.png
--- a/doc/server_dag.png
+++ b/doc/server_dag.png
--- a/doc/server_interface.png
+++ b/doc/server_interface.png
--- a/doc/serving_logo.png
+++ b/doc/serving_logo.png
--- a/doc/simple_dag.png
+++ b/doc/simple_dag.png
--- a/doc/timeline-example.png
+++ b/doc/timeline-example.png
--- a/doc/user_groups.png
+++ b/doc/user_groups.png
--- a/doc/imdb-benchmark-server-16.png
+++ b/doc/imdb-benchmark-server-16.png
--- a/doc/imdb_loss.png
+++ b/doc/imdb_loss.png
--- a/doc/qps-threads-bow.png
+++ b/doc/qps-threads-bow.png
--- a/doc/qps-threads-cnn.png
+++ b/doc/qps-threads-cnn.png
--- a/doc/qps-threads-lstm.png
+++ b/doc/qps-threads-lstm.png
--- a/doc/qq.jpeg
+++ b/doc/qq.jpeg
--- a/doc/serving-timings.png
+++ b/doc/serving-timings.png
--- a/doc/wechat.jpeg
+++ b/doc/wechat.jpeg
--- a/python/examples/criteo_ctr_with_cube/README.md
+++ b/python/examples/criteo_ctr_with_cube/README.md
@@ -65,8 +65,8 @@ bash benchmark.sh

 the average latency of threads

-![avg cost](../../../doc/criteo-cube-benchmark-avgcost.png)
+![avg cost](../../../doc/images/criteo-cube-benchmark-avgcost.png)

 The QPS is 

-![qps](../../../doc/criteo-cube-benchmark-qps.png)
+![qps](../../../doc/images/criteo-cube-benchmark-qps.png)
--- a/python/examples/criteo_ctr_with_cube/README_CN.md
+++ b/python/examples/criteo_ctr_with_cube/README_CN.md
@@ -63,8 +63,8 @@ bash benchmark.sh

 平均每个线程耗时图如下

-![avg cost](../../../doc/criteo-cube-benchmark-avgcost.png)
+![avg cost](../../../doc/images/criteo-cube-benchmark-avgcost.png)

 每个线程QPS耗时如下

-![qps](../../../doc/criteo-cube-benchmark-qps.png)
+![qps](../../../doc/images/criteo-cube-benchmark-qps.png)
--- a/python/examples/util/README.md
+++ b/python/examples/util/README.md
@@ -28,4 +28,4 @@ Specific operation: Open the chrome browser, enter `chrome://tracing/` in the ad

 The data visualization output is shown as follow, it uses [bert as service example](https://github.com/PaddlePaddle/Serving/tree/develop/python/examples/bert) GPU inference service. The server starts 4 GPU prediction, the client starts 4 `processes`, and the timeline of each stage when the batch size is 1. Among them, `bert_pre` represents the data preprocessing stage of the client, and `client_infer` represents the stage where the client completes sending and receiving prediction requests. `process` represents the process number of the client, and the second line of each process shows the timeline of each op of the server.

-![timeline](../../../doc/timeline-example.png)
+![timeline](../../../doc/images/timeline-example.png)
--- a/python/examples/util/README_CN.md
+++ b/python/examples/util/README_CN.md
@@ -28,4 +28,4 @@ python3 timeline_trace.py profile trace

 效果如下图，图中展示了使用[bert示例](https://github.com/PaddlePaddle/Serving/tree/develop/python/examples/bert)的GPU预测服务，server端开启4卡预测，client端启动4进程，batch size为1时的各阶段timeline，其中bert_pre代表client端的数据预处理阶段，client_infer代表client完成预测请求的发送和接收结果的阶段，图中的process代表的是client的进程号，每个进进程的第二行展示的是server各个op的timeline。

-![timeline](../../../doc/timeline-example.png)
+![timeline](../../../doc/images/timeline-example.png)
--- a/python/paddle_serving_app/README.md
+++ b/python/paddle_serving_app/README.md
@@ -149,7 +149,7 @@ The server side starts service with 4 GPU cards, the client side starts 4 proces
 In the figure, bert_pre represents the data pre-processing stage of the client, and client_infer represents the stage where the client completes the sending of the prediction request to the receiving result.
 The process in the figure represents the process number of the client, and the second line of each process shows the timeline of each op of the server.

-![timeline](../../doc/timeline-example.png)
+![timeline](../../doc/images/timeline-example.png)

 ## Debug tools


--- a/python/paddle_serving_app/README_CN.md
+++ b/python/paddle_serving_app/README_CN.md
@@ -138,7 +138,7 @@ paddle_serving_app针对CV和NLP领域的模型任务，提供了多种常见的
   效果如下图，图中展示了使用[bert示例](https://github.com/PaddlePaddle/Serving/tree/develop/python/examples/bert)的GPU预测服务，server端开启4卡预测，client端启动4进程，batch size为1时的各阶段timeline。
 其中bert_pre代表client端的数据预处理阶段，client_infer代表client完成预测请求的发送到接收结果的阶段，图中的process代表的是client的进程号，每个进程的第二行展示的是server各个op的timeline。

-   ![timeline](../../doc/timeline-example.png)
+   ![timeline](../../doc/images/timeline-example.png)

 ## Debug工具